Cluster & Deploy — Spark Data Flow

Control plane vs. data plane, again

spark-submit
   │ parse args, set cluster manager from master URL
   ▼
DRIVER (your SparkContext)
   │ registers the application, requests executors
   ▼
CLUSTER MANAGER (Standalone Master / YARN RM / K8s API)
   │ picks nodes with free resources
   ▼
WORKER NODES launch EXECUTOR JVMs
   │ each executor connects back to the DRIVER
   ▼
DRIVER schedules tasks straight onto executors  (cluster manager steps aside)

The key idea: the cluster manager only handles process placement and lifecycle. Once executors register with the driver, all task scheduling flows directly between driver and executors, exactly as described on the scheduling page.

spark-submit — the universal launcher

SparkSubmit parses arguments, prepares the classpath, and dispatches based on the action (submit, kill, request status). prepareSubmitEnvironment reads the master URL to choose the cluster manager — spark:// → Standalone, yarn, k8s://, or local — and the deploy mode. In client mode it runs your main class in the submitting JVM; in cluster mode it asks the cluster manager to launch the driver on a worker.

SparkSubmit.scala L70-L100 — doSubmit

SparkSubmit.scala L243-L265 — choosing the cluster manager

Client mode vs. cluster mode

In client mode the driver runs wherever you typed spark-submit (your laptop, an edge node); executors run on the cluster and connect back to it. This is great for interactive shells but ties the application's life to your local process. In cluster mode the driver itself is shipped to and run on a cluster node, so the application survives your client disconnecting — the right choice for production jobs.

Standalone Master — the resource broker

The Standalone Master is an RPC endpoint that tracks registered workers, applications, and drivers. When a driver's StandaloneAppClient sends RegisterApplication, the master records it and runs schedule(). startExecutorsOnWorkers is a simple FIFO scheduler that spreads executors across workers with free cores and memory, then launchExecutor sends a LaunchExecutor message to the chosen worker.

Master.scala L291-L304 — RegisterApplication

Master.scala L831-L847 — startExecutorsOnWorkers

Standalone Worker — the process launcher

A Worker registers its capacity with the master and waits. On LaunchExecutor it creates a work directory and an ExecutorRunner that spawns a separate CoarseGrainedExecutorBackend JVM; on LaunchDriver (cluster mode) it spawns the driver via a DriverRunner. The newly started executor connects back to the driver, not to the master.

Worker.scala L601-L662 — LaunchExecutor

How an executor joins the driver

After the master sends LaunchExecutor, it notifies the driver with ExecutorAdded. The executor process starts, registers with the driver's CoarseGrainedSchedulerBackend, and from then on receives LaunchTask messages directly. This is the seam where deployment hands off to scheduling: the same CoarseGrainedExecutorBackend is used regardless of whether YARN, Kubernetes, or Standalone launched it.

Master.scala L998-L1006 — launchExecutor & ExecutorAdded

The cluster managers compared

Manager	Master URL	How executors are placed
Standalone	`spark://host:7077`	Spark's own Master/Worker daemons; FIFO across workers.
YARN	`yarn`	Executors run as YARN containers requested from the ResourceManager.
Kubernetes	`k8s://https://...`	Driver and executors run as pods created via the K8s API.
Local	`local[*]`	Driver and "executors" are threads in one JVM — for development.

All four present the same SchedulerBackend to the rest of Spark, which is why a job written and tested with local[*] runs unchanged on a thousand-node YARN or Kubernetes cluster.

Key takeaways

spark-submit chooses the cluster manager from the master URL.
Client mode runs the driver locally; cluster mode ships it onto the cluster.
The cluster manager only places processes; the driver schedules tasks directly.
Executors are always CoarseGrainedExecutorBackend JVMs, regardless of manager.

Deployment: how processes get placed on a cluster

Control plane vs. data plane, again

The cluster managers compared

Key takeaways

External references