RDDs: the lineage that defines a computation

The Resilient Distributed Dataset is Spark's foundational abstraction. Even DataFrames and SQL ultimately compile down to RDDs. Understanding the five properties, the lazy graph they form, and the difference between narrow and wide dependencies explains almost everything about how Spark schedules and recovers work.

An RDD is a recipe, not a table

An RDD does not hold data. It is an immutable description of how to compute a partitioned collection from its parents. The source comment lists the five properties that every RDD answers; all scheduling decisions are derived from them.

Internally, each RDD is characterized by five main properties:

  - A list of partitions                       getPartitions
  - A function for computing each split         compute(partition, ctx)
  - A list of dependencies on other RDDs        getDependencies
  - Optionally, a Partitioner for key-value     partitioner
  - Optionally, preferred locations per split   getPreferredLocations

RDD.scala L69-L83 — the five properties

RDD.scala L84-L139 — the abstract class and the five methods/fields

Lineage: a chain of small wrappers

sc.textFile("data")        HadoopRDD          partitions = HDFS splits
   .map(parse)        →    MapPartitionsRDD   OneToOneDependency
   .filter(valid)     →    MapPartitionsRDD   OneToOneDependency
   .map(toPair)       →    MapPartitionsRDD   OneToOneDependency
   .reduceByKey(_+_)  →    ShuffledRDD        ShuffleDependency  ◄── stage boundary
   .collect()              ACTION             triggers a job

Each transformation returns a new RDD that points back at its parent. The arrows are Dependency objects. A single-parent RDD gets a OneToOneDependency automatically through the convenience constructor.

RDD.scala L102-L104 — one-to-one parent constructor

Transformations build the graph (lazy)

A transformation never touches data on the cluster. map cleans the user function and wraps the parent in a MapPartitionsRDD whose compute simply applies iter.map(cleanF) to the parent's iterator. Because the new RDD keeps the same number of partitions and depends one-to-one on its parent, it can be pipelined with neighbours into a single stage.

def map[U](f: T => U): RDD[U] = {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this,
    (context, pid, iter) => iter.map(cleanF))
}

RDD.scala L424-L450 — map and filter

Actions launch jobs (eager)

Actions are the only operations that cross the driver boundary. collect runs a job that turns each partition into an array and concatenates them on the driver; count sums per-partition sizes; reduce merges per-partition reductions. All of them call sc.runJob.

def collect(): Array[T] = {
  val results = sc.runJob(this, iter => iter.toArray)
  Array.concat(results: _*)
}

RDD.scala L1072-L1076 — collect

RDD.scala L1146-L1167 — reduce

Narrow vs. shuffle dependencies

The Dependency type is the single most important fact about an RDD edge. A NarrowDependency means each child partition needs only a few specific parent partitions, so the work can be pipelined locally. A ShuffleDependency means a child partition may need data from every parent partition, which requires a network exchange and forces a stage boundary.

NarrowDependency.getParents(childPartitionId): Seq[Int]
  // OneToOneDependency: child i ← parent i  (map, filter)
  // RangeDependency:    contiguous ranges    (union)

ShuffleDependency
  // registers a shuffleId + shuffleHandle with the ShuffleManager
  // carries the Partitioner that decides the reducer for each key

Creating a ShuffleDependency immediately registers the shuffle with the shuffle manager, allocating a shuffleId — the shuffle is "planned" the moment the lineage is built, even though no bytes move until execution.

Dependency.scala L46-L61 — NarrowDependency

Dependency.scala L83-L138 — ShuffleDependency and shuffle registration

Partitions and the Partitioner

A Partition is just a serializable index identifying one slice of an RDD; it is the unit of task parallelism. For key/value RDDs, a Partitioner maps each key to a partition id and therefore decides how many reducers a shuffle has and where each key lands. HashPartitioner (the default) uses key.hashCode mod numPartitions; RangePartitioner samples the data to build sorted bounds for sortByKey.

Choosing a good partitioner is the main lever for avoiding skew and unnecessary shuffles: if two RDDs already share a partitioner, a join between them can be narrow.

Partition.scala L23-L33 — the Partition trait

Partitioner.scala L114-L132 — HashPartitioner

iterator: compute, cache, or checkpoint

When a task needs a partition it calls RDD.iterator. This is the decision point between recomputation and reuse: if the RDD is persisted, it goes through getOrCompute (which asks the local BlockManager for a cached block, computing and storing it on a miss); otherwise it calls computeOrReadCheckpoint, which either reads a materialized checkpoint or finally calls the subclass's compute.

final def iterator(split, context): Iterator[T] =
  if (storageLevel != NONE) getOrCompute(split, context)   // cache path
  else                      computeOrReadCheckpoint(split, context)

RDD.scala L334-L409 — iterator, computeOrReadCheckpoint, getOrCompute

persist / cache and fault tolerance

cache() is persist(MEMORY_ONLY). Persisting only sets a storage level and registers the RDD with the context; the data is materialized the first time a partition is computed, after which getOrCompute finds it in the block manager. Lineage is what makes this safe: if a cached partition is lost, Spark recomputes it from its parents rather than failing the job. This recompute-from-lineage property is the "Resilient" in RDD.

RDD.scala L186-L205 — persist and cache

Why this matters downstream

The lineage graph is the input to the DAGScheduler. It walks dependencies backward from the action: narrow edges are folded into the current stage, and every shuffle edge becomes a boundary. So the way you write transformations — particularly which ones introduce shuffles — directly determines the number of stages, the amount of network traffic, and the recovery behaviour of your job. Continue to DAG & Task Scheduling to see the graph become stages and tasks.

Key takeaways

  • An RDD stores a recipe (five properties), not data.
  • Transformations are lazy graph-building; actions are eager and call runJob.
  • Narrow dependencies pipeline; shuffle dependencies create stage boundaries.
  • The partitioner controls reducer count, key placement, and shuffle-free joins.
  • Lineage enables recomputation, which is Spark's fault-tolerance mechanism.