Spark — Tests, Build & Learning Path

10. Tests

Frameworks

Language	Framework	Base class / runner
Scala	ScalaTest `AnyFunSuite`	`SparkFunSuite`
Java	JUnit 5 (Jupiter)	`Suite.java`, `Test.java`
Python	stdlib `unittest`	`PySparkTestCase`, `python/run-tests.py`
UI	Jest 30	`ui-test/tests/*.test.js`

Test types

Unit: Catalyst rule tests, serializer tests, individual RDD ops with local[2]
Integration: Full SQL queries, Hive metastore (often Derby embedded), streaming with memory sinks
Cluster-ish: local-cluster mode simulates multiple executors in one JVM
Tagged slow/extended: @Tag(SlowSQLTest), ExtendedYarnTest — skipped in default CI
Docker/K8s IT: connector/docker-integration-tests, kubernetes/integration-tests

What tests reveal about intended behavior

DAGSchedulerSuite — stage boundaries, fetch failure recovery, cache tracking
QueryExecutionSuite / SQLQueryTestSuite — golden SQL plans and results
StreamingQuerySuite — micro-batch offset progression
SparkConnect* suites — client/server protocol parity with classic API

What appears less covered

Full multi-node failure chaos (partial — mostly simulated)
Every connector combination with real cloud services (Docker ITs cover some)
Performance regressions (separate benchmarking, not unit tests)

How to run tests

# Full suite (long)
./dev/run-tests

# Specific modules
./dev/run-tests --parallelism 1 --modules core,sql

# Maven single module
./build/mvn -pl :spark-core_2.13 test

# Python
./python/run-tests --modules pyspark-sql --python-executables=python3.12

# With tags
./dev/run-tests --included-tags org.apache.spark.tags.ExtendedSQLTest

Test fixtures

SparkContext often created per suite with local[2] and spark.testing=true. Temporary dirs via Utils.createTempDir. Shared test JARs in core/target/.... Module dependency graph in dev/sparktestsupport/modules.py determines which suites run when files change.

11. Build, run, and deploy

Install dependencies

JDK 17+ (Temurin recommended per README)
Maven (bootstrapped by ./build/mvn)
Python 3.11+ for PySpark development
Node.js for UI tests (optional)

Build

# Standard build
./build/mvn -DskipTests clean package

# With cluster profiles
./build/mvn -Pyarn -Phive -Pkubernetes -DskipTests clean package

# SBT (dev)
./build/sbt package

# Distribution tarball
./dev/make-distribution.sh --tgz -Pyarn -Pkubernetes

Run locally

./bin/spark-shell
./bin/pyspark
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  examples/jars/spark-examples_*.jar

# Standalone mini-cluster
./sbin/start-master.sh
./sbin/start-worker.sh spark://localhost:7077

Configuration

Copy conf/spark-defaults.conf.template → conf/spark-defaults.conf
--conf key=value on submit; SparkConf in code
SPARK_HOME, JAVA_HOME, HADOOP_CONF_DIR (for YARN/HDFS)

Deployment

Target	Mechanism
Standalone	`sbin/start-*`, fat jars in `assembly/target`
YARN	`spark-submit --master yarn --deploy-mode cluster`
Kubernetes	`--master k8s://https://...`, docker images via `bin/docker-image-tool.sh`
PyPI	`python/packaging/classic/setup.py` → `pyspark` package bundles JARs
Maven Central	Release workflow `.github/workflows/release.yml`
Docs site	`docs/` + GitHub Pages workflow

CI

63 GitHub Actions workflows. Primary: build_main.yml → reusable build_and_test.yml (SBT precompile + dev/run-tests matrix). Separate Maven path in maven_test.yml. Python/Java version matrices on branch-specific workflows.

12. Most important files to understand first

#	File	Why
1	`core/.../SparkContext.scala`	Driver bootstrap, job submission API
2	`core/.../rdd/RDD.scala`	Foundational abstraction and actions
3	`core/.../scheduler/DAGScheduler.scala`	Stage-oriented scheduling, failures
4	`core/.../scheduler/TaskSchedulerImpl.scala`	Task dispatch, locality, retries
5	`core/.../executor/Executor.scala`	Task execution on workers
6	`core/.../SparkEnv.scala`	Runtime service locator per JVM
7	`core/.../deploy/SparkSubmit.scala`	How apps launch on clusters
8	`sql/core/.../classic/SparkSession.scala`	Primary user entry for SQL
9	`sql/core/.../execution/QueryExecution.scala`	SQL compilation pipeline
10	`sql/catalyst/.../analysis/Analyzer.scala`	Name/type resolution rules
11	`sql/catalyst/.../optimizer/Optimizer.scala`	Logical rewrite rules
12	`sql/core/.../execution/SparkPlan.scala`	Physical execution contract
13	`core/.../storage/BlockManager.scala`	Cache and block transfer
14	`core/.../shuffle/sort/SortShuffleManager.scala`	Default shuffle implementation
15	`sql/connect/server/.../SparkConnectService.scala`	Modern remote client architecture
16	`core/.../internal/config/package.scala`	All typed configuration keys
17	`bin/spark-class` + `launcher/.../Main.java`	Process bootstrap chain
18	`python/pyspark/java_gateway.py`	PySpark JVM bridge

13. Things that are confusing or risky

Two streaming systems — legacy DStreams vs Structured Streaming; docs and imports differ.
Classic vs Connect API — Connect client has no local SparkContext; debugging requires server logs.
Hive vs V2 catalog — table resolution paths differ; spark.sql.catalogImplementation matters.
Global SparkEnv — implicit state complicates testing and embedding.
Closure serialization — capturing non-serializable objects fails at runtime, not compile time.
Shuffle + dynamic allocation — requires external shuffle service or risk losing shuffle files.
AQE — plan at runtime may differ from explain() without AdaptiveSparkPlanExec awareness.
Version skew — PySpark pip package must match cluster Spark version.

14. Suggested learning path for a new engineer

Read README.md, run ./bin/spark-shell and spark.range(1000).count().
Trace submit: bin/spark-submit → launcher/Main.java → SparkSubmit.scala.
Core model: RDD.scala (doc comment) → Dependency.scala → DAGScheduler.scala (header comment).
Run one test: DAGSchedulerSuite in IDE or Maven to see stage creation.
SQL path: SparkSession.sql → QueryExecution lazy vals → SparkPlan.execute.
Catalyst: Pick one optimizer rule (e.g. PushDownPredicates) and follow RuleExecutor.
Executor: CoarseGrainedExecutorBackend + Executor.TaskRunner.
Storage/shuffle: BlockManager, SortShuffleManager.
Deploy: Skim YARN Client.scala or K8s submit path for your target environment.
Connect (if relevant): SparkConnectService → SparkConnectPlanExecution.
Streaming (if relevant): MicroBatchExecution.scala + checkpoint layout.

15. Open questions

These cannot be fully answered from code alone without running clusters at scale:

Exact production tuning for a specific cloud (instance types, shuffle service sizing) — docs give guidelines, not proofs.
Real-world performance vs Databricks Runtime optimizations (some APIs are OSS stubs or differ in commercial builds).
Complete connector compatibility matrix for every cloud vendor fork of Hadoop FS.
Which deprecated code paths will be removed in which release — check JIRA/SPARK tickets and release notes.
Operational SLOs for Spark Connect at N concurrent sessions — requires load testing.

Source revision: Analysis based on shallow clone of apache/spark master at Spark 5.0.0-SNAPSHOT (June 2026). Line numbers and class names may shift on other branches (e.g. branch-4.x).

← Back to overview