Apache Spark — Tests & Learning Path

10. Tests

Frameworks

LanguageFrameworkBase class / runner
ScalaScalaTest AnyFunSuiteSparkFunSuite
JavaJUnit 5 (Jupiter)*Suite.java, *Test.java
Pythonstdlib unittestPySparkTestCase, python/run-tests.py
UIJest 30ui-test/tests/*.test.js

Test types

What tests reveal about intended behavior

What appears less covered

How to run tests

# Full suite (long)
./dev/run-tests

# Specific modules
./dev/run-tests --parallelism 1 --modules core,sql

# Maven single module
./build/mvn -pl :spark-core_2.13 test

# Python
./python/run-tests --modules pyspark-sql --python-executables=python3.12

# With tags
./dev/run-tests --included-tags org.apache.spark.tags.ExtendedSQLTest

Test fixtures

SparkContext often created per suite with local[2] and spark.testing=true. Temporary dirs via Utils.createTempDir. Shared test JARs in core/target/.... Module dependency graph in dev/sparktestsupport/modules.py determines which suites run when files change.

11. Build, run, and deploy

Install dependencies

Build

# Standard build
./build/mvn -DskipTests clean package

# With cluster profiles
./build/mvn -Pyarn -Phive -Pkubernetes -DskipTests clean package

# SBT (dev)
./build/sbt package

# Distribution tarball
./dev/make-distribution.sh --tgz -Pyarn -Pkubernetes

Run locally

./bin/spark-shell
./bin/pyspark
./bin/spark-submit --class org.apache.spark.examples.SparkPi \
  examples/jars/spark-examples_*.jar

# Standalone mini-cluster
./sbin/start-master.sh
./sbin/start-worker.sh spark://localhost:7077

Configuration

Deployment

TargetMechanism
Standalonesbin/start-*, fat jars in assembly/target
YARNspark-submit --master yarn --deploy-mode cluster
Kubernetes--master k8s://https://..., docker images via bin/docker-image-tool.sh
PyPIpython/packaging/classic/setup.pypyspark package bundles JARs
Maven CentralRelease workflow .github/workflows/release.yml
Docs sitedocs/ + GitHub Pages workflow

CI

63 GitHub Actions workflows. Primary: build_main.yml → reusable build_and_test.yml (SBT precompile + dev/run-tests matrix). Separate Maven path in maven_test.yml. Python/Java version matrices on branch-specific workflows.

12. Most important files to understand first

#FileWhy
1core/.../SparkContext.scalaDriver bootstrap, job submission API
2core/.../rdd/RDD.scalaFoundational abstraction and actions
3core/.../scheduler/DAGScheduler.scalaStage-oriented scheduling, failures
4core/.../scheduler/TaskSchedulerImpl.scalaTask dispatch, locality, retries
5core/.../executor/Executor.scalaTask execution on workers
6core/.../SparkEnv.scalaRuntime service locator per JVM
7core/.../deploy/SparkSubmit.scalaHow apps launch on clusters
8sql/core/.../classic/SparkSession.scalaPrimary user entry for SQL
9sql/core/.../execution/QueryExecution.scalaSQL compilation pipeline
10sql/catalyst/.../analysis/Analyzer.scalaName/type resolution rules
11sql/catalyst/.../optimizer/Optimizer.scalaLogical rewrite rules
12sql/core/.../execution/SparkPlan.scalaPhysical execution contract
13core/.../storage/BlockManager.scalaCache and block transfer
14core/.../shuffle/sort/SortShuffleManager.scalaDefault shuffle implementation
15sql/connect/server/.../SparkConnectService.scalaModern remote client architecture
16core/.../internal/config/package.scalaAll typed configuration keys
17bin/spark-class + launcher/.../Main.javaProcess bootstrap chain
18python/pyspark/java_gateway.pyPySpark JVM bridge

13. Things that are confusing or risky

14. Suggested learning path for a new engineer

  1. Read README.md, run ./bin/spark-shell and spark.range(1000).count().
  2. Trace submit: bin/spark-submitlauncher/Main.javaSparkSubmit.scala.
  3. Core model: RDD.scala (doc comment) → Dependency.scalaDAGScheduler.scala (header comment).
  4. Run one test: DAGSchedulerSuite in IDE or Maven to see stage creation.
  5. SQL path: SparkSession.sqlQueryExecution lazy vals → SparkPlan.execute.
  6. Catalyst: Pick one optimizer rule (e.g. PushDownPredicates) and follow RuleExecutor.
  7. Executor: CoarseGrainedExecutorBackend + Executor.TaskRunner.
  8. Storage/shuffle: BlockManager, SortShuffleManager.
  9. Deploy: Skim YARN Client.scala or K8s submit path for your target environment.
  10. Connect (if relevant): SparkConnectServiceSparkConnectPlanExecution.
  11. Streaming (if relevant): MicroBatchExecution.scala + checkpoint layout.

15. Open questions

These cannot be fully answered from code alone without running clusters at scale:

Source revision: Analysis based on shallow clone of apache/spark master at Spark 5.0.0-SNAPSHOT (June 2026). Line numbers and class names may shift on other branches (e.g. branch-4.x).

← Back to overview