Apache Spark — Codebase Deep Dive

1. Executive mental model

Apache Spark is a distributed batch and stream analytics engine. It exposes high-level APIs (Scala, Java, Python, R) that compile user programs into a graph of parallel tasks executed on a cluster of JVM processes. The codebase at github.com/apache/spark (version 5.0.0-SNAPSHOT in master) is a large polyglot monorepo centered on Scala 2.13 and Java 17+.

Think of Spark in three stacked layers:

┌─────────────────────────────────────────────────────────────┐ │ User APIs: SparkSession, Dataset/DataFrame, RDD, Streaming │ ├─────────────────────────────────────────────────────────────┤ │ Compiler: Catalyst (SQL), RDD lineage graph (Core) │ ├─────────────────────────────────────────────────────────────┤ │ Runtime: DAGScheduler → TaskScheduler → Executors │ │ BlockManager, ShuffleManager, RpcEnv, SparkEnv │ └─────────────────────────────────────────────────────────────┘ ▲ │ │ cluster managers │ HDFS, S3, Kafka, Hive, │ (YARN, K8s, Standalone) │ JDBC, cloud FS, etc. └──────────────────────────────┘

Driver vs executor: One JVM (the driver) holds SparkContext, builds the execution plan, and schedules work. Worker JVMs (executors) run tasks, cache blocks, and write shuffle data. In local[*] mode, driver and executors share one process.

Lazy evaluation: Transformations build a logical graph (RDD lineage or Catalyst LogicalPlan) without running cluster work. Actions (count, collect, writing sinks) trigger job submission.

Two query paths coexist:

Spark Connect (since 3.4+, heavily expanded) splits client and server: the client builds protobuf plans; the server runs the classic Catalyst + QueryExecution stack and returns Arrow batches over gRPC.

2. Repository map

What kind of project

Apache Spark is an open-source distributed computing framework, not a web app or database server. It is packaged as Maven/SBT modules, assembled into a tarball with bin/ launch scripts, and deployed on YARN, Kubernetes, or Spark Standalone.

Languages, runtimes, build tools

Technology Role Evidence
Scala 2.13 Primary implementation language for core, SQL, streaming pom.xml artifact spark-parent_2.13
Java 17+ Launcher, network, some core/tests README.md, launcher/
Python 3.11+ PySpark via Py4J python/pyspark/, pyproject.toml
R (deprecated) SparkR bindings R/, README note
Apache Maven Official build & release ./build/mvn, root pom.xml
SBT Developer/CI fast iteration project/SparkBuild.scala, ./build/sbt
Protobuf/gRPC Spark Connect wire protocol sql/connect/

Top-level directories

Directory Role Central vs peripheral
core/ RDD, scheduling, deploy, RPC, storage, shuffle — the engine kernel Core
sql/ Catalyst compiler, execution engine, Hive, Connect, pipelines Core
common/ Shared libs: network, unsafe memory, kvstore, utils Core
launcher/ Minimal JVM to construct java command lines Core (bootstrap)
resource-managers/ YARN and Kubernetes integration Core (when deployed)
python/ PySpark client library and tests Core (API surface)
streaming/ Legacy DStream API (pre-Structured Streaming) Peripheral (legacy)
sql/.../streaming/ Structured Streaming (micro-batch, state store) Core
mllib/, graphx/ ML and graph algorithms on RDDs/DataFrames Peripheral (libraries)
connector/ Kafka, Avro, Protobuf, Kinesis connectors Peripheral (integrations)
assembly/ Fat JAR / distribution assembly Build infra
bin/, sbin/ CLI entry scripts, cluster daemons Entry points
conf/ Template configs (spark-defaults.conf.template) Configuration
examples/ Sample apps (Pi, etc.) Peripheral
docs/ User-facing documentation site source Docs
dev/ Test runners, release scripts, lint Tooling
.github/workflows/ CI (63 workflow files) Infra
repl/ Scala REPL integration Peripheral
udf/ External UDF worker over gRPC Peripheral (extension)
ui-test/ Jest tests for Spark UI static assets Tests

Entry points

Entry Path Main class
spark-submit bin/spark-submitbin/spark-class org.apache.spark.deploy.SparkSubmit
Launcher bootstrap bin/spark-class org.apache.spark.launcher.Main
Interactive Scala bin/spark-shell org.apache.spark.repl.Main
Interactive Python bin/pyspark Py4J gateway → JVM shell
SQL CLI bin/spark-sql SQL shell main via SparkSubmit
Standalone master sbin/start-master.sh org.apache.spark.deploy.master.Master
Standalone worker sbin/start-worker.sh org.apache.spark.deploy.worker.Worker
Executor process Launched by cluster manager org.apache.spark.executor.CoarseGrainedExecutorBackend
Spark Connect server Started with Spark app / dedicated command org.apache.spark.sql.connect.service.SparkConnectService
In-process API User code SparkSession.builder().getOrCreate()

Configuration vs generated vs tests

Why core/ is central: Every API path eventually calls SparkContext.runJob, uses SparkEnv singletons on each JVM, and depends on DAGScheduler + BlockManager. SQL is a compiler layered on top; without core, nothing runs on a cluster.

Continue to Architecture & Runtime Components →