Apache Spark — Core Modules

5. Core modules (module-by-module)

core/ — distributed runtime kernel

Responsibility: RDD API, scheduling, shuffle, block storage, RPC, deployment, UI.

Public API: org.apache.spark.SparkContext, SparkConf, org.apache.spark.rdd.RDD, org.apache.spark.sql.SparkSession re-exports (via SQL module).

Key internals:

Depends on: common/network-*, common/unsafe, Hadoop client libs.

Depended on by: All other Spark modules.

Invariants: One active SparkContext per JVM; shuffle IDs globally unique per app; stage cleanup when jobs complete.

sql/catalyst/ — query compiler

Responsibility: SQL parsing, analysis, optimization, expression trees, rule engine.

Public API: Mostly private[sql] / developer — ParserInterface, Analyzer, Optimizer, LogicalPlan, Rule[T].

Key classes: RuleExecutor (fixed-point batches), CatalystSqlParser, TreeNode (transformations via patterns).

Pattern: Compiler passes — batches of Rule[LogicalPlan] applied until fixpoint.

Non-obvious: Catalyst is largely cluster-agnostic; no RDD references in logical plans.

sql/core/ — SQL runtime

Responsibility: Physical planning, execution operators, session state, classic API, Structured Streaming runtime.

Public API: org.apache.spark.sql.classic.SparkSession, Dataset, DataFrameReader/Writer.

Key classes: QueryExecution, SparkPlan, SparkPlanner, SQLExecution, FileSourceScanExec, AdaptiveSparkPlanExec.

Depends on: catalyst, core.

sql/hive/ — Hive metastore integration

Responsibility: HiveExternalCatalog, Hive SerDe tables, HiveTableScanExec.

Activation: SparkSession.builder.enableHiveSupport()HiveSessionStateBuilder.

Peripheral when: Using only DataSource V2 / in-memory catalog.

sql/connect/ — decoupled client/server

SubmoduleRole
connect/commonClient SparkSession, SparkConnectClient, proto plan builders
connect/serverSparkConnectService gRPC, SparkConnectPlanner, session manager
connect/client/jvmJVM client packaging
connect/shimsVersion compatibility shims

Critical invariant: Server reuses classic QueryExecution — Connect is transport + plan deserialization, not a separate engine.

launcher/ — bootstrap only

Responsibility: Construct JVM commands without loading full Spark.

API: SparkLauncher (Java), Main, SparkSubmitCommandBuilder.

Why separate: Keeps shell scripts fast; classpath computed before heavy Spark classes load.

common/ — shared infrastructure

ModulePurpose
network-commonNetty RPC, transport config
network-shuffleExternal shuffle service protocol
network-yarnYARN shuffle integration
unsafeOff-heap memory access for Tungsten
kvstoreRocksDB/InMemory KV for UI history
sketchApproximate algorithms (Bloom, quantiles)

resource-managers/

YARN: Client.scala, ApplicationMaster.scala, YarnSchedulerBackend.scala.

Kubernetes: org.apache.spark.deploy.k8s — pod spec builders, driver/executor pod lifecycle.

Standalone: Lives in core/deploy/master and core/deploy/worker — not under resource-managers.

python/pyspark/

Responsibility: Python API mirroring JVM; Py4J bridge; Python worker for UDFs/Pandas UDFs.

Key files: java_gateway.py, worker.py, sql/, connect/.

Type: Glue + API — heavy lifting on JVM.

streaming/ (legacy DStreams)

Old micro-batch API over RDDs. Structured Streaming in sql/core/.../execution/streaming/ is the maintained path.

mllib/, graphx/

Algorithm libraries built on RDDs/DataFrames. Important for ML users but not scheduling core.

connector/

Kafka, Avro, Protobuf, Kinesis — optional Maven modules, DataSource V1/V2 implementations.

6. Data model and persistence

Domain entities (in-memory)

EntityRepresentation
Row (SQL)InternalRow (Catalyst), Row (user-facing)
SchemaStructType, StructField
Table metadataCatalogTable, V2 Table via TableCatalog
BlockBlockId (e.g. rdd_123_4, shuffle_0_0_0)
Shuffle metadataMapStatus, index files + data files on disk
Streaming offsetOffsetSeqMetadata, JSON in checkpoint dir
State storeVersioned key-value per operator + partition (HDFSBackedStateStore)

No ORM / app database

Spark itself does not persist application domain data in an internal database. Metadata goes to:

Serialization

Config formats

Data lifecycle example (batch write)

Parquet files on S3 → FileSourceScanExec reads splits → InternalRow in executor memory (possibly off-heap via UnsafeRow) → shuffle (optional) → sort/aggregate Exec → DataSourceWriterCommitMessage → committed files on S3 (transactional commit via OutputCommitter)

7. External interfaces (summary)

InterfaceProtocol / formatEntry
CLIshell scriptsbin/*
Scala/Java APIin-processSparkSession, SparkContext
Python APIPy4J TCPpyspark
Spark ConnectgRPC + Protobuf + ArrowSparkConnectClient
JDBC/ODBCThrift JDBCsql/hive-thriftserver
REST submissionJSONcore/.../deploy/rest/
FilesystemHadoop FileSystem APIDataSources, checkpoints
Plugin SPIServiceLoaderExternalClusterManager, DataSource V2, catalog plugins

Dependencies and architecture patterns

Dependency direction

python/R APIs → sql/api → sql/core → sql/catalyst ↓ core → common/* ↓ hadoop-client, netty

Patterns in use

Coupling vs separation

Clean separationTight coupling
Catalyst logical vs physical plansSparkEnv global singleton
SchedulerBackend trait vs implementationsSQL ↔ Core via RDD bridge (necessary)
DataSource V2 connector APIHive module patches SessionState builder
Launcher vs core classpathPySpark requires matching JVM/Python versions

Next: Error handling & concurrency →