10. Tests
Frameworks
| Language | Framework | Base class / runner |
|---|---|---|
| Scala | ScalaTest AnyFunSuite | SparkFunSuite |
| Java | JUnit 5 (Jupiter) | *Suite.java, *Test.java |
| Python | stdlib unittest | PySparkTestCase, python/run-tests.py |
| UI | Jest 30 | ui-test/tests/*.test.js |
Test types
- Unit: Catalyst rule tests, serializer tests, individual RDD ops with
local[2] - Integration: Full SQL queries, Hive metastore (often Derby embedded), streaming with memory sinks
- Cluster-ish:
local-clustermode simulates multiple executors in one JVM - Tagged slow/extended:
@Tag(SlowSQLTest),ExtendedYarnTest— skipped in default CI - Docker/K8s IT:
connector/docker-integration-tests,kubernetes/integration-tests
What tests reveal about intended behavior
DAGSchedulerSuite— stage boundaries, fetch failure recovery, cache trackingQueryExecutionSuite/SQLQueryTestSuite— golden SQL plans and resultsStreamingQuerySuite— micro-batch offset progressionSparkConnect*suites — client/server protocol parity with classic API
What appears less covered
- Full multi-node failure chaos (partial — mostly simulated)
- Every connector combination with real cloud services (Docker ITs cover some)
- Performance regressions (separate benchmarking, not unit tests)
How to run tests
# Full suite (long) ./dev/run-tests # Specific modules ./dev/run-tests --parallelism 1 --modules core,sql # Maven single module ./build/mvn -pl :spark-core_2.13 test # Python ./python/run-tests --modules pyspark-sql --python-executables=python3.12 # With tags ./dev/run-tests --included-tags org.apache.spark.tags.ExtendedSQLTest
Test fixtures
SparkContext often created per suite with local[2] and spark.testing=true. Temporary dirs via Utils.createTempDir. Shared test JARs in core/target/.... Module dependency graph in dev/sparktestsupport/modules.py determines which suites run when files change.
11. Build, run, and deploy
Install dependencies
- JDK 17+ (Temurin recommended per README)
- Maven (bootstrapped by
./build/mvn) - Python 3.11+ for PySpark development
- Node.js for UI tests (optional)
Build
# Standard build ./build/mvn -DskipTests clean package # With cluster profiles ./build/mvn -Pyarn -Phive -Pkubernetes -DskipTests clean package # SBT (dev) ./build/sbt package # Distribution tarball ./dev/make-distribution.sh --tgz -Pyarn -Pkubernetes
Run locally
./bin/spark-shell ./bin/pyspark ./bin/spark-submit --class org.apache.spark.examples.SparkPi \ examples/jars/spark-examples_*.jar # Standalone mini-cluster ./sbin/start-master.sh ./sbin/start-worker.sh spark://localhost:7077
Configuration
- Copy
conf/spark-defaults.conf.template→conf/spark-defaults.conf --conf key=valueon submit;SparkConfin codeSPARK_HOME,JAVA_HOME,HADOOP_CONF_DIR(for YARN/HDFS)
Deployment
| Target | Mechanism |
|---|---|
| Standalone | sbin/start-*, fat jars in assembly/target |
| YARN | spark-submit --master yarn --deploy-mode cluster |
| Kubernetes | --master k8s://https://..., docker images via bin/docker-image-tool.sh |
| PyPI | python/packaging/classic/setup.py → pyspark package bundles JARs |
| Maven Central | Release workflow .github/workflows/release.yml |
| Docs site | docs/ + GitHub Pages workflow |
CI
63 GitHub Actions workflows. Primary: build_main.yml → reusable build_and_test.yml (SBT precompile + dev/run-tests matrix). Separate Maven path in maven_test.yml. Python/Java version matrices on branch-specific workflows.
12. Most important files to understand first
| # | File | Why |
|---|---|---|
| 1 | core/.../SparkContext.scala | Driver bootstrap, job submission API |
| 2 | core/.../rdd/RDD.scala | Foundational abstraction and actions |
| 3 | core/.../scheduler/DAGScheduler.scala | Stage-oriented scheduling, failures |
| 4 | core/.../scheduler/TaskSchedulerImpl.scala | Task dispatch, locality, retries |
| 5 | core/.../executor/Executor.scala | Task execution on workers |
| 6 | core/.../SparkEnv.scala | Runtime service locator per JVM |
| 7 | core/.../deploy/SparkSubmit.scala | How apps launch on clusters |
| 8 | sql/core/.../classic/SparkSession.scala | Primary user entry for SQL |
| 9 | sql/core/.../execution/QueryExecution.scala | SQL compilation pipeline |
| 10 | sql/catalyst/.../analysis/Analyzer.scala | Name/type resolution rules |
| 11 | sql/catalyst/.../optimizer/Optimizer.scala | Logical rewrite rules |
| 12 | sql/core/.../execution/SparkPlan.scala | Physical execution contract |
| 13 | core/.../storage/BlockManager.scala | Cache and block transfer |
| 14 | core/.../shuffle/sort/SortShuffleManager.scala | Default shuffle implementation |
| 15 | sql/connect/server/.../SparkConnectService.scala | Modern remote client architecture |
| 16 | core/.../internal/config/package.scala | All typed configuration keys |
| 17 | bin/spark-class + launcher/.../Main.java | Process bootstrap chain |
| 18 | python/pyspark/java_gateway.py | PySpark JVM bridge |
13. Things that are confusing or risky
- Two streaming systems — legacy DStreams vs Structured Streaming; docs and imports differ.
- Classic vs Connect API — Connect client has no local
SparkContext; debugging requires server logs. - Hive vs V2 catalog — table resolution paths differ;
spark.sql.catalogImplementationmatters. - Global
SparkEnv— implicit state complicates testing and embedding. - Closure serialization — capturing non-serializable objects fails at runtime, not compile time.
- Shuffle + dynamic allocation — requires external shuffle service or risk losing shuffle files.
- AQE — plan at runtime may differ from
explain()withoutAdaptiveSparkPlanExecawareness. - Version skew — PySpark pip package must match cluster Spark version.
14. Suggested learning path for a new engineer
- Read
README.md, run./bin/spark-shellandspark.range(1000).count(). - Trace submit:
bin/spark-submit→launcher/Main.java→SparkSubmit.scala. - Core model:
RDD.scala(doc comment) →Dependency.scala→DAGScheduler.scala(header comment). - Run one test:
DAGSchedulerSuitein IDE or Maven to see stage creation. - SQL path:
SparkSession.sql→QueryExecutionlazy vals →SparkPlan.execute. - Catalyst: Pick one optimizer rule (e.g.
PushDownPredicates) and followRuleExecutor. - Executor:
CoarseGrainedExecutorBackend+Executor.TaskRunner. - Storage/shuffle:
BlockManager,SortShuffleManager. - Deploy: Skim YARN
Client.scalaor K8s submit path for your target environment. - Connect (if relevant):
SparkConnectService→SparkConnectPlanExecution. - Streaming (if relevant):
MicroBatchExecution.scala+ checkpoint layout.
15. Open questions
These cannot be fully answered from code alone without running clusters at scale:
- Exact production tuning for a specific cloud (instance types, shuffle service sizing) — docs give guidelines, not proofs.
- Real-world performance vs Databricks Runtime optimizations (some APIs are OSS stubs or differ in commercial builds).
- Complete connector compatibility matrix for every cloud vendor fork of Hadoop FS.
- Which deprecated code paths will be removed in which release — check JIRA/SPARK tickets and release notes.
- Operational SLOs for Spark Connect at N concurrent sessions — requires load testing.
Source revision: Analysis based on shallow clone of
apache/spark master at Spark 5.0.0-SNAPSHOT (June 2026). Line numbers and class names may shift on other branches (e.g. branch-4.x).