RDD & Lineage — Spark Data Flow

An RDD is a recipe, not a table

An RDD does not hold data. It is an immutable description of how to compute a partitioned collection from its parents. The source comment lists the five properties that every RDD answers; all scheduling decisions are derived from them.

Internally, each RDD is characterized by five main properties:

  - A list of partitions                       getPartitions
  - A function for computing each split         compute(partition, ctx)
  - A list of dependencies on other RDDs        getDependencies
  - Optionally, a Partitioner for key-value     partitioner
  - Optionally, preferred locations per split   getPreferredLocations

RDD.scala L69-L83 — the five properties

RDD.scala L84-L139 — the abstract class and the five methods/fields

Lineage: a chain of small wrappers

sc.textFile("data")        HadoopRDD          partitions = HDFS splits
   .map(parse)        →    MapPartitionsRDD   OneToOneDependency
   .filter(valid)     →    MapPartitionsRDD   OneToOneDependency
   .map(toPair)       →    MapPartitionsRDD   OneToOneDependency
   .reduceByKey(_+_)  →    ShuffledRDD        ShuffleDependency  ◄── stage boundary
   .collect()              ACTION             triggers a job

Each transformation returns a new RDD that points back at its parent. The arrows are Dependency objects. A single-parent RDD gets a OneToOneDependency automatically through the convenience constructor.

RDD.scala L102-L104 — one-to-one parent constructor

Transformations build the graph (lazy)

A transformation never touches data on the cluster. map cleans the user function and wraps the parent in a MapPartitionsRDD whose compute simply applies iter.map(cleanF) to the parent's iterator. Because the new RDD keeps the same number of partitions and depends one-to-one on its parent, it can be pipelined with neighbours into a single stage.

def map[U](f: T => U): RDD[U] = {
  val cleanF = sc.clean(f)
  new MapPartitionsRDD[U, T](this,
    (context, pid, iter) => iter.map(cleanF))
}

RDD.scala L424-L450 — map and filter

Actions launch jobs (eager)

Actions are the only operations that cross the driver boundary. collect runs a job that turns each partition into an array and concatenates them on the driver; count sums per-partition sizes; reduce merges per-partition reductions. All of them call sc.runJob.

def collect(): Array[T] = {
  val results = sc.runJob(this, iter => iter.toArray)
  Array.concat(results: _*)
}

RDD.scala L1072-L1076 — collect

RDD.scala L1146-L1167 — reduce

Narrow vs. shuffle dependencies

The Dependency type is the single most important fact about an RDD edge. A NarrowDependency means each child partition needs only a few specific parent partitions, so the work can be pipelined locally. A ShuffleDependency means a child partition may need data from every parent partition, which requires a network exchange and forces a stage boundary.

NarrowDependency.getParents(childPartitionId): Seq[Int]
  // OneToOneDependency: child i ← parent i  (map, filter)
  // RangeDependency:    contiguous ranges    (union)

ShuffleDependency
  // registers a shuffleId + shuffleHandle with the ShuffleManager
  // carries the Partitioner that decides the reducer for each key

Creating a ShuffleDependency immediately registers the shuffle with the shuffle manager, allocating a shuffleId — the shuffle is "planned" the moment the lineage is built, even though no bytes move until execution.

Dependency.scala L46-L61 — NarrowDependency

Dependency.scala L83-L138 — ShuffleDependency and shuffle registration

Partitions and the Partitioner

A Partition is just a serializable index identifying one slice of an RDD; it is the unit of task parallelism. For key/value RDDs, a Partitioner maps each key to a partition id and therefore decides how many reducers a shuffle has and where each key lands. HashPartitioner (the default) uses key.hashCode mod numPartitions; RangePartitioner samples the data to build sorted bounds for sortByKey.

Choosing a good partitioner is the main lever for avoiding skew and unnecessary shuffles: if two RDDs already share a partitioner, a join between them can be narrow.

Partition.scala L23-L33 — the Partition trait

Partitioner.scala L114-L132 — HashPartitioner

iterator: compute, cache, or checkpoint

When a task needs a partition it calls RDD.iterator. This is the decision point between recomputation and reuse: if the RDD is persisted, it goes through getOrCompute (which asks the local BlockManager for a cached block, computing and storing it on a miss); otherwise it calls computeOrReadCheckpoint, which either reads a materialized checkpoint or finally calls the subclass's compute.

final def iterator(split, context): Iterator[T] =
  if (storageLevel != NONE) getOrCompute(split, context)   // cache path
  else                      computeOrReadCheckpoint(split, context)

RDD.scala L334-L409 — iterator, computeOrReadCheckpoint, getOrCompute

persist / cache and fault tolerance

cache() is persist(MEMORY_ONLY). Persisting only sets a storage level and registers the RDD with the context; the data is materialized the first time a partition is computed, after which getOrCompute finds it in the block manager. Lineage is what makes this safe: if a cached partition is lost, Spark recomputes it from its parents rather than failing the job. This recompute-from-lineage property is the "Resilient" in RDD.

RDD.scala L186-L205 — persist and cache

Why this matters downstream

The lineage graph is the input to the DAGScheduler. It walks dependencies backward from the action: narrow edges are folded into the current stage, and every shuffle edge becomes a boundary. So the way you write transformations — particularly which ones introduce shuffles — directly determines the number of stages, the amount of network traffic, and the recovery behaviour of your job. Continue to DAG & Task Scheduling to see the graph become stages and tasks.

Key takeaways

An RDD stores a recipe (five properties), not data.
Transformations are lazy graph-building; actions are eager and call runJob.
Narrow dependencies pipeline; shuffle dependencies create stage boundaries.
The partitioner controls reducer count, key placement, and shuffle-free joins.
Lineage enables recomputation, which is Spark's fault-tolerance mechanism.

RDDs: the lineage that defines a computation

An RDD is a recipe, not a table

Lineage: a chain of small wrappers

Why this matters downstream

Key takeaways

External references