Apache Spark — Architecture

3. Main runtime architecture

Problem solved

Spark solves large-scale data-parallel computation with a unified programming model. Users write declarative or functional transformations; Spark handles partitioning, scheduling, fault tolerance (lineage recomputation), shuffle, and data locality. It integrates with Hadoop-compatible filesystems, Hive metastore, and cloud object stores.

Runtime components

┌──────────────── Driver JVM ────────────────┐ │ SparkSession / SparkContext │ │ DAGScheduler, TaskSchedulerImpl │ │ BlockManagerMaster, MapOutputTrackerMaster │ │ SparkUI, LiveListenerBus │ │ [Connect] SparkConnectService (gRPC) │ └───────────────┬───────────────────────────────┘ │ RpcEnv (Netty) ┌─────────────────────┼─────────────────────┐ ▼ ▼ ▼ ┌────────────────┐ ┌────────────────┐ ┌────────────────┐ │ Executor JVM │ │ Executor JVM │ │ Executor JVM │ │ BlockManager │ │ BlockManager │ │ BlockManager │ │ ShuffleManager │ │ ShuffleManager │ │ ShuffleManager │ │ Task threads │ │ Task threads │ │ Task threads │ └────────────────┘ └────────────────┘ └────────────────┘

Core abstractions

AbstractionLocationMeaning
RDD[T]core/.../rdd/RDD.scalaImmutable partitioned collection with lineage
Dependencycore/.../Dependency.scalaNarrow (pipeline) vs ShuffleDependency (stage break)
Stagecore/.../scheduler/Stage.scalaShuffleMapStage or ResultStage — set of tasks on same RDD op chain
Taskcore/.../scheduler/Task.scalaShuffleMapTask or ResultTask — unit sent to one core
LogicalPlansql/catalyst/.../logical/Unresolved/resolved relational algebra tree
SparkPlansql/core/.../execution/SparkPlan.scalaPhysical operator tree (*Exec suffix)
QueryExecutionsql/core/.../execution/QueryExecution.scalaLazy pipeline: analyze → optimize → plan → execute
SparkEnvcore/.../SparkEnv.scalaPer-JVM runtime bag: serializer, RpcEnv, BlockManager, etc.

Data flows

  1. Input: Hadoop FileSystem, DataSource V2 connectors, JDBC, Kafka (via connector), in-memory collections.
  2. Shuffle: Map tasks write partitioned files (sort-based by default via SortShuffleManager); reduce tasks fetch via BlockStoreClient.
  3. Cache: BlockManager stores serialized/deserialized blocks in memory or spills to disk (StorageLevel).
  4. Output: Actions collect to driver, write to sinks (Parquet, ORC, Hive tables), or stream via Structured Streaming sinks.

Control flows

External systems

SystemIntegration point
HDFS / S3 / GCS / ABFSHadoop FileSystem, hadoop-cloud/, DataSource readers
Hive Metastoresql/hive/HiveExternalCatalog.scala
YARNresource-managers/yarn/, --master yarn
Kubernetesresource-managers/kubernetes/core/, --master k8s://...
Kafkaconnector/kafka-0-10-sql/
JDBC databasessql/core/.../jdbc/
History server / event logcore/.../deploy/history/
Metrics (JMX, Prometheus sink)core/.../metrics/

Startup sequence (driver, client mode)

spark-submit MyApp → launcher.Main builds java command → SparkSubmit.doSubmit → runMain (user main) → SparkSession.builder.getOrCreate() → SparkContext(conf) → SparkEnv.createDriverEnv → createTaskScheduler (Local / Standalone / YARN / K8s backend) → new DAGScheduler → taskScheduler.start → backend.start → blockManager.initialize → SparkUI.bind → application runs until sc.stop() or JVM exit

Startup sequence (executor)

CoarseGrainedExecutorBackend.main → RpcEnv + fetch SparkAppConfig from driver → SparkEnv.createExecutorEnv → RegisterExecutor RPC to driver → new Executor(...) — thread pool ready → await LaunchTask messages

System purpose — mental model for engineers

When debugging Spark, ask four questions:

  1. What graph? RDD lineage or Catalyst df.queryExecution.executedPlan.
  2. What stages? Spark UI / DAGScheduler stage list — shuffle boundaries.
  3. Where is data? Block manager cache, shuffle files, or source partitions.
  4. Who schedules? Driver TaskScheduler + cluster manager for containers.

Spark is not a classic request/response server. User-facing "requests" are actions or streaming micro-batches that enqueue jobs on an internal event-driven scheduler.

Next: Key execution flows →