The query pipeline
SQL text / DataFrame ops
│ parse
▼
Unresolved LogicalPlan
│ Analyzer (resolve tables, columns, functions, types)
▼
Analyzed LogicalPlan
│ Optimizer (predicate pushdown, column pruning, constant folding, ...)
▼
Optimized LogicalPlan
│ SparkPlanner (strategies map logical → physical operators)
▼
SparkPlan (physical)
│ preparations (EnsureRequirements adds shuffles, CollapseCodegenStages, AQE)
▼
executedPlan
│ execute()
▼
RDD[InternalRow] ──► the scheduler, stages, tasks (same as everything else)
Each arrow is a lazy phase in QueryExecution; an action on a Dataset forces executedPlan, which pulls all the earlier phases.
SparkSession and Dataset — lazy by construction
SparkSession.sql(...) parses text into an unresolved LogicalPlan and wraps it in a Dataset via Dataset.ofRows. A Dataset is just a QueryExecution plus an Encoder; every DataFrame transformation produces a new logical plan. Nothing runs until an action like collect calls withAction, which forces the executed plan and collects rows.
QueryExecution — the phase orchestrator
Each phase is a lazy val: analyzed, optimizedPlan, sparkPlan, executedPlan, and finally toRdd. Accessing one forces the previous, so the whole pipeline runs on demand and the planning time of each phase is tracked separately.
analyzed = analyzer.executeAndCheck(logical)
optimizedPlan = optimizer.executeAndTrack(withCachedData)
sparkPlan = createSparkPlan(planner, optimizedPlan)
executedPlan = prepareForExecution(preparations, sparkPlan)
toRdd = new SQLExecutionRDD(executedPlan.execute(), conf)
QueryExecution.scala L313-L394 — optimized, spark, executed, toRdd
TreeNode and Rule — Catalyst's whole foundation
Every plan, expression, and data type in Catalyst is a TreeNode: an immutable node with children. All transformations are expressed as transformDown/transformUp traversals that return a rewritten tree. A Rule is a single rewrite step (apply(tree): tree), and the RuleExecutor runs batches of rules until they reach a fixed point. This tiny abstraction is why adding an optimization is just writing a new rule.
Analyzer — resolving the unresolved
The parser produces a plan full of UnresolvedRelation and UnresolvedAttribute placeholders. The Analyzer (a RuleExecutor) consults the catalog to bind table and column references, resolve functions, add casts for type coercion, and validate the result. Its output is a fully typed, analyzed logical plan.
Optimizer — cost-agnostic rewrites
The Optimizer applies rule batches that make the plan cheaper without changing its meaning: predicate pushdown, column pruning, constant folding, filter combination, boolean simplification, join reordering, and subquery rewriting. These are heuristic (rule-based) rewrites; statistics-driven choices like join strategy happen during physical planning and Adaptive Query Execution.
SparkPlanner — logical to physical
Physical planning applies a list of strategies that pattern-match logical operators and emit physical SparkPlan operators (the *Exec classes). JoinSelection decides between broadcast, sort-merge, and shuffle-hash joins; FileSourceStrategy turns a scan into a file read with pushed-down filters; BasicOperators maps the simple cases. The planner takes the first complete candidate.
SparkPlan — where SQL meets RDDs
A SparkPlan is a physical operator. Its doExecute() returns an RDD[InternalRow] — the bridge back to the RDD engine. InternalRow (and its binary form UnsafeRow) is Tungsten's compact, off-heap-friendly row layout that flows through these RDDs, avoiding Java object overhead.
SparkPlan.scala L197-L202 — execute
SparkPlan.scala L338-L343 — doExecute returns RDD[InternalRow]
Tungsten: whole-stage code generation
A naive operator tree calls next() through many virtual method calls per row (the "volcano" model). CollapseCodegenStages, a preparation rule, fuses a chain of codegen-capable operators (scan → filter → project) into a single generated Java class with one tight loop. WholeStageCodegenExec.doExecute() compiles that source at runtime and runs it, falling back to normal execution if compilation fails. This is the largest single performance feature in modern Spark SQL.
WholeStageCodegenExec.scala L673-L749 — doCodeGen & doExecute
WholeStageCodegenExec.scala L914-L994 — CollapseCodegenStages
The payoff: one engine, two front ends
Because Catalyst lowers everything to a SparkPlan and then to an RDD[InternalRow], a SQL query and a hand-written RDD job share the same scheduler, shuffle, memory, and executor machinery described on the other pages. The difference is that the SQL front end knows the schema and the relational semantics, so it can optimize aggressively (pruning columns, pushing filters into scans, picking join algorithms) before a single task is scheduled.
Key takeaways
- SQL/DataFrames compile through analyze → optimize → plan → prepare → RDD.
- Everything is a
TreeNode; every optimization is aRulerun to a fixed point. SparkPlan.doExecutereturns anRDD[InternalRow]— the link to the core engine.- Whole-stage codegen fuses operators into one compiled loop over
UnsafeRows.