Spark SQL & Catalyst — Spark Data Flow

The query pipeline

SQL text / DataFrame ops
   │  parse
   ▼
Unresolved LogicalPlan
   │  Analyzer        (resolve tables, columns, functions, types)
   ▼
Analyzed LogicalPlan
   │  Optimizer       (predicate pushdown, column pruning, constant folding, ...)
   ▼
Optimized LogicalPlan
   │  SparkPlanner    (strategies map logical → physical operators)
   ▼
SparkPlan (physical)
   │  preparations    (EnsureRequirements adds shuffles, CollapseCodegenStages, AQE)
   ▼
executedPlan
   │  execute()
   ▼
RDD[InternalRow]  ──► the scheduler, stages, tasks (same as everything else)

Each arrow is a lazy phase in QueryExecution; an action on a Dataset forces executedPlan, which pulls all the earlier phases.

SparkSession and Dataset — lazy by construction

SparkSession.sql(...) parses text into an unresolved LogicalPlan and wraps it in a Dataset via Dataset.ofRows. A Dataset is just a QueryExecution plus an Encoder; every DataFrame transformation produces a new logical plan. Nothing runs until an action like collect calls withAction, which forces the executed plan and collects rows.

SparkSession.scala L528-L559 — sql()

Dataset.scala L110-L138 — ofRows

QueryExecution — the phase orchestrator

Each phase is a lazy val: analyzed, optimizedPlan, sparkPlan, executedPlan, and finally toRdd. Accessing one forces the previous, so the whole pipeline runs on demand and the planning time of each phase is tracked separately.

analyzed       = analyzer.executeAndCheck(logical)
optimizedPlan  = optimizer.executeAndTrack(withCachedData)
sparkPlan      = createSparkPlan(planner, optimizedPlan)
executedPlan   = prepareForExecution(preparations, sparkPlan)
toRdd          = new SQLExecutionRDD(executedPlan.execute(), conf)

QueryExecution.scala L313-L394 — optimized, spark, executed, toRdd

TreeNode and Rule — Catalyst's whole foundation

Every plan, expression, and data type in Catalyst is a TreeNode: an immutable node with children. All transformations are expressed as transformDown/transformUp traversals that return a rewritten tree. A Rule is a single rewrite step (apply(tree): tree), and the RuleExecutor runs batches of rules until they reach a fixed point. This tiny abstraction is why adding an optimization is just writing a new rule.

TreeNode.scala L70-L75 — base node

Rule.scala L24-L35 — Rule

Analyzer — resolving the unresolved

The parser produces a plan full of UnresolvedRelation and UnresolvedAttribute placeholders. The Analyzer (a RuleExecutor) consults the catalog to bind table and column references, resolve functions, add casts for type coercion, and validate the result. Its output is a fully typed, analyzed logical plan.

Analyzer.scala L300-L309 — class

Analyzer.scala L506-L609 — resolution batches

Optimizer — cost-agnostic rewrites

The Optimizer applies rule batches that make the plan cheaper without changing its meaning: predicate pushdown, column pruning, constant folding, filter combination, boolean simplification, join reordering, and subquery rewriting. These are heuristic (rule-based) rewrites; statistics-driven choices like join strategy happen during physical planning and Adaptive Query Execution.

Optimizer.scala L100-L291 — default rule batches

SparkPlanner — logical to physical

Physical planning applies a list of strategies that pattern-match logical operators and emit physical SparkPlan operators (the *Exec classes). JoinSelection decides between broadcast, sort-merge, and shuffle-hash joins; FileSourceStrategy turns a scan into a file read with pushed-down filters; BasicOperators maps the simple cases. The planner takes the first complete candidate.

SparkPlanner.scala L31-L57 — the strategy list

SparkStrategies.scala L906-L949 — BasicOperators

SparkPlan — where SQL meets RDDs

A SparkPlan is a physical operator. Its doExecute() returns an RDD[InternalRow] — the bridge back to the RDD engine. InternalRow (and its binary form UnsafeRow) is Tungsten's compact, off-heap-friendly row layout that flows through these RDDs, avoiding Java object overhead.

SparkPlan.scala L197-L202 — execute

SparkPlan.scala L338-L343 — doExecute returns RDD[InternalRow]

Tungsten: whole-stage code generation

A naive operator tree calls next() through many virtual method calls per row (the "volcano" model). CollapseCodegenStages, a preparation rule, fuses a chain of codegen-capable operators (scan → filter → project) into a single generated Java class with one tight loop. WholeStageCodegenExec.doExecute() compiles that source at runtime and runs it, falling back to normal execution if compilation fails. This is the largest single performance feature in modern Spark SQL.

WholeStageCodegenExec.scala L673-L749 — doCodeGen & doExecute

WholeStageCodegenExec.scala L914-L994 — CollapseCodegenStages

The payoff: one engine, two front ends

Because Catalyst lowers everything to a SparkPlan and then to an RDD[InternalRow], a SQL query and a hand-written RDD job share the same scheduler, shuffle, memory, and executor machinery described on the other pages. The difference is that the SQL front end knows the schema and the relational semantics, so it can optimize aggressively (pruning columns, pushing filters into scans, picking join algorithms) before a single task is scheduled.

Key takeaways

SQL/DataFrames compile through analyze → optimize → plan → prepare → RDD.
Everything is a TreeNode; every optimization is a Rule run to a fixed point.
SparkPlan.doExecute returns an RDD[InternalRow] — the link to the core engine.
Whole-stage codegen fuses operators into one compiled loop over UnsafeRows.

Spark SQL: from a query to an RDD of rows

The query pipeline

The payoff: one engine, two front ends

Key takeaways

External references