Apache Spark — Codebase Deep Dive

1. Executive mental model

Apache Spark is a distributed batch and stream analytics engine. It exposes high-level APIs (Scala, Java, Python, R) that compile user programs into a graph of parallel tasks executed on a cluster of JVM processes. The codebase at github.com/apache/spark (version 5.0.0-SNAPSHOT in master) is a large polyglot monorepo centered on Scala 2.13 and Java 17+.

Think of Spark in three stacked layers:

┌─────────────────────────────────────────────────────────────┐ │ User APIs: SparkSession, Dataset/DataFrame, RDD, Streaming │ ├─────────────────────────────────────────────────────────────┤ │ Compiler: Catalyst (SQL), RDD lineage graph (Core) │ ├─────────────────────────────────────────────────────────────┤ │ Runtime: DAGScheduler → TaskScheduler → Executors │ │ BlockManager, ShuffleManager, RpcEnv, SparkEnv │ └─────────────────────────────────────────────────────────────┘ ▲ │ │ cluster managers │ HDFS, S3, Kafka, Hive, │ (YARN, K8s, Standalone) │ JDBC, cloud FS, etc. └──────────────────────────────┘

Driver vs executor: One JVM (the driver) holds SparkContext, builds the execution plan, and schedules work. Worker JVMs (executors) run tasks, cache blocks, and write shuffle data. In local[*] mode, driver and executors share one process.

Lazy evaluation: Transformations build a logical graph (RDD lineage or Catalyst LogicalPlan) without running cluster work. Actions (count, collect, writing sinks) trigger job submission.

Two query paths coexist:

RDD API — lineage graph → stages at shuffle boundaries → tasks.
SQL/DataFrame API — SQL text or DataFrame ops → Catalyst parse/analyze/optimize → physical SparkPlan → RDD → same scheduler.

Spark Connect (since 3.4+, heavily expanded) splits client and server: the client builds protobuf plans; the server runs the classic Catalyst + QueryExecution stack and returns Arrow batches over gRPC.

2. Repository map

What kind of project

Apache Spark is an open-source distributed computing framework, not a web app or database server. It is packaged as Maven/SBT modules, assembled into a tarball with bin/ launch scripts, and deployed on YARN, Kubernetes, or Spark Standalone.

Languages, runtimes, build tools

Technology	Role	Evidence
Scala 2.13	Primary implementation language for core, SQL, streaming	`pom.xml` artifact `spark-parent_2.13`
Java 17+	Launcher, network, some core/tests	`README.md`, `launcher/`
Python 3.11+	PySpark via Py4J	`python/pyspark/`, `pyproject.toml`
R (deprecated)	SparkR bindings	`R/`, README note
Apache Maven	Official build & release	`./build/mvn`, root `pom.xml`
SBT	Developer/CI fast iteration	`project/SparkBuild.scala`, `./build/sbt`
Protobuf/gRPC	Spark Connect wire protocol	`sql/connect/`

Top-level directories

Directory	Role	Central vs peripheral
`core/`	RDD, scheduling, deploy, RPC, storage, shuffle — the engine kernel	Core
`sql/`	Catalyst compiler, execution engine, Hive, Connect, pipelines	Core
`common/`	Shared libs: network, unsafe memory, kvstore, utils	Core
`launcher/`	Minimal JVM to construct `java` command lines	Core (bootstrap)
`resource-managers/`	YARN and Kubernetes integration	Core (when deployed)
`python/`	PySpark client library and tests	Core (API surface)
`streaming/`	Legacy DStream API (pre-Structured Streaming)	Peripheral (legacy)
`sql/.../streaming/`	Structured Streaming (micro-batch, state store)	Core
`mllib/`, `graphx/`	ML and graph algorithms on RDDs/DataFrames	Peripheral (libraries)
`connector/`	Kafka, Avro, Protobuf, Kinesis connectors	Peripheral (integrations)
`assembly/`	Fat JAR / distribution assembly	Build infra
`bin/`, `sbin/`	CLI entry scripts, cluster daemons	Entry points
`conf/`	Template configs (`spark-defaults.conf.template`)	Configuration
`examples/`	Sample apps (Pi, etc.)	Peripheral
`docs/`	User-facing documentation site source	Docs
`dev/`	Test runners, release scripts, lint	Tooling
`.github/workflows/`	CI (63 workflow files)	Infra
`repl/`	Scala REPL integration	Peripheral
`udf/`	External UDF worker over gRPC	Peripheral (extension)
`ui-test/`	Jest tests for Spark UI static assets	Tests

Entry points

Entry	Path	Main class
`spark-submit`	`bin/spark-submit` → `bin/spark-class`	`org.apache.spark.deploy.SparkSubmit`
Launcher bootstrap	`bin/spark-class`	`org.apache.spark.launcher.Main`
Interactive Scala	`bin/spark-shell`	`org.apache.spark.repl.Main`
Interactive Python	`bin/pyspark`	Py4J gateway → JVM shell
SQL CLI	`bin/spark-sql`	SQL shell main via SparkSubmit
Standalone master	`sbin/start-master.sh`	`org.apache.spark.deploy.master.Master`
Standalone worker	`sbin/start-worker.sh`	`org.apache.spark.deploy.worker.Worker`
Executor process	Launched by cluster manager	`org.apache.spark.executor.CoarseGrainedExecutorBackend`
Spark Connect server	Started with Spark app / dedicated command	`org.apache.spark.sql.connect.service.SparkConnectService`
In-process API	User code	`SparkSession.builder().getOrCreate()`

Configuration vs generated vs tests

Configuration: conf/*.template, SparkConf, SQLConf, typed keys in core/.../internal/config/package.scala
Generated: Protobuf classes from sql/connect, Antlr parsers in Catalyst, build-info in build/spark-build-info
Tests: <module>/src/test/scala/** (ScalaTest), src/test/java/** (JUnit 5), python/pyspark/**/tests/ (unittest)
Scripts: dev/run-tests.py, dev/make-distribution.sh, build/mvn, build/sbt

Why core/ is central: Every API path eventually calls SparkContext.runJob, uses SparkEnv singletons on each JVM, and depends on DAGScheduler + BlockManager. SQL is a compiler layered on top; without core, nothing runs on a cluster.

Continue to Architecture & Runtime Components →