Apache Spark — Runtime & Errors

8. Error handling and edge cases

Error handling patterns

LayerPatternExamples
CoreTyped error objectsSparkCoreErrors, SparkException
SQLAnalysisException, QueryExecutionErrorsUnresolved column, type mismatch
TasksSerialize failure reason back to driverExceptionFailure, FetchFailed
SchedulerRetry with limitsmaxTaskFailures, stage abort
LoggingStructured logging (Log4j2)spark.log.structuredLogging.enabled

Task failure flow

Executor: task throws → TaskRunner catches → statusUpdate(FAILED, reason) → TaskSchedulerImpl.statusUpdate → if FetchFailed → DAGScheduler.handleTaskCompletion (resubmit stage) → else if attempts < max → retry on another executor → else → abort stage → fail job

Shuffle fetch failure

When a map output file is missing (executor lost), FetchFailed propagates to DAGScheduler, which invalidates the map stage and resubmits it. This is distinct from generic task failure — evidence in DAGScheduler.scala header comments (lines 106–114).

SQL analysis vs execution errors

Security-sensitive paths

9. Concurrency and lifecycle

Concurrency model

ComponentModel
DAGSchedulerSingle-threaded event loop (DAGSchedulerEventProcessLoop)
TaskSchedulerImplThread-safe; synchronized task set managers
ExecutorThread pool — one thread per task slot (spark.executor.cores)
BlockManagerFine-grained locks per block; master RPC serialized
SparkContextDocumented as not thread-safe for all ops; SQL uses withActive session guard
Structured StreamingMicro-batch driver thread + concurrent state store maintenance

Retries and timeouts

Cancellation

Resource cleanup

Dynamic allocation

ExecutorAllocationManager (core/.../scheduler/dynalloc/) requests/kills executors based on load when spark.dynamicAllocation.enabled=true. Requires external shuffle service for safe shrink.

Consistency assumptions

11. Non-obvious insights

Hidden coupling

Implicit conventions

Magic constants / globals

Generated code

Performance-sensitive code

Backward compatibility

Architecture diagram: failure domains

┌──────────── Driver failure ────────────┐ │ Lose entire app unless checkpoint/ │ │ streaming recovery from durable log │ └───────────────────────────────────────────┘ ┌──────────── Executor failure ─────────────┐ │ Tasks retried; shuffle blocks rebuilt │ │ if not using external shuffle service │ └───────────────────────────────────────────┘ ┌──────────── Task failure ─────────────────┐ │ Retry on another executor (bounded) │ └───────────────────────────────────────────┘

Next: Tests, build & learning path →