An RDD is a recipe, not a table
An RDD does not hold data. It is an immutable description of how to compute a partitioned collection from its parents. The source comment lists the five properties that every RDD answers; all scheduling decisions are derived from them.
Internally, each RDD is characterized by five main properties:
- A list of partitions getPartitions
- A function for computing each split compute(partition, ctx)
- A list of dependencies on other RDDs getDependencies
- Optionally, a Partitioner for key-value partitioner
- Optionally, preferred locations per split getPreferredLocations
RDD.scala L69-L83 — the five properties
RDD.scala L84-L139 — the abstract class and the five methods/fields
Lineage: a chain of small wrappers
sc.textFile("data") HadoopRDD partitions = HDFS splits
.map(parse) → MapPartitionsRDD OneToOneDependency
.filter(valid) → MapPartitionsRDD OneToOneDependency
.map(toPair) → MapPartitionsRDD OneToOneDependency
.reduceByKey(_+_) → ShuffledRDD ShuffleDependency ◄── stage boundary
.collect() ACTION triggers a job
Each transformation returns a new RDD that points back at its parent. The arrows are Dependency objects. A single-parent RDD gets a OneToOneDependency automatically through the convenience constructor.
Transformations build the graph (lazy)
A transformation never touches data on the cluster. map cleans the user function and wraps the parent in a MapPartitionsRDD whose compute simply applies iter.map(cleanF) to the parent's iterator. Because the new RDD keeps the same number of partitions and depends one-to-one on its parent, it can be pipelined with neighbours into a single stage.
def map[U](f: T => U): RDD[U] = {
val cleanF = sc.clean(f)
new MapPartitionsRDD[U, T](this,
(context, pid, iter) => iter.map(cleanF))
}
Actions launch jobs (eager)
Actions are the only operations that cross the driver boundary. collect runs a job that turns each partition into an array and concatenates them on the driver; count sums per-partition sizes; reduce merges per-partition reductions. All of them call sc.runJob.
def collect(): Array[T] = {
val results = sc.runJob(this, iter => iter.toArray)
Array.concat(results: _*)
}
Narrow vs. shuffle dependencies
The Dependency type is the single most important fact about an RDD edge. A NarrowDependency means each child partition needs only a few specific parent partitions, so the work can be pipelined locally. A ShuffleDependency means a child partition may need data from every parent partition, which requires a network exchange and forces a stage boundary.
NarrowDependency.getParents(childPartitionId): Seq[Int]
// OneToOneDependency: child i ← parent i (map, filter)
// RangeDependency: contiguous ranges (union)
ShuffleDependency
// registers a shuffleId + shuffleHandle with the ShuffleManager
// carries the Partitioner that decides the reducer for each key
Creating a ShuffleDependency immediately registers the shuffle with the shuffle manager, allocating a shuffleId — the shuffle is "planned" the moment the lineage is built, even though no bytes move until execution.
Dependency.scala L46-L61 — NarrowDependency
Dependency.scala L83-L138 — ShuffleDependency and shuffle registration
Partitions and the Partitioner
A Partition is just a serializable index identifying one slice of an RDD; it is the unit of task parallelism. For key/value RDDs, a Partitioner maps each key to a partition id and therefore decides how many reducers a shuffle has and where each key lands. HashPartitioner (the default) uses key.hashCode mod numPartitions; RangePartitioner samples the data to build sorted bounds for sortByKey.
Choosing a good partitioner is the main lever for avoiding skew and unnecessary shuffles: if two RDDs already share a partitioner, a join between them can be narrow.
iterator: compute, cache, or checkpoint
When a task needs a partition it calls RDD.iterator. This is the decision point between recomputation and reuse: if the RDD is persisted, it goes through getOrCompute (which asks the local BlockManager for a cached block, computing and storing it on a miss); otherwise it calls computeOrReadCheckpoint, which either reads a materialized checkpoint or finally calls the subclass's compute.
final def iterator(split, context): Iterator[T] =
if (storageLevel != NONE) getOrCompute(split, context) // cache path
else computeOrReadCheckpoint(split, context)
RDD.scala L334-L409 — iterator, computeOrReadCheckpoint, getOrCompute
persist / cache and fault tolerance
cache() is persist(MEMORY_ONLY). Persisting only sets a storage level and registers the RDD with the context; the data is materialized the first time a partition is computed, after which getOrCompute finds it in the block manager. Lineage is what makes this safe: if a cached partition is lost, Spark recomputes it from its parents rather than failing the job. This recompute-from-lineage property is the "Resilient" in RDD.
Why this matters downstream
The lineage graph is the input to the DAGScheduler. It walks dependencies backward from the action: narrow edges are folded into the current stage, and every shuffle edge becomes a boundary. So the way you write transformations — particularly which ones introduce shuffles — directly determines the number of stages, the amount of network traffic, and the recovery behaviour of your job. Continue to DAG & Task Scheduling to see the graph become stages and tasks.
Key takeaways
- An RDD stores a recipe (five properties), not data.
- Transformations are lazy graph-building; actions are eager and call
runJob. - Narrow dependencies pipeline; shuffle dependencies create stage boundaries.
- The partitioner controls reducer count, key placement, and shuffle-free joins.
- Lineage enables recomputation, which is Spark's fault-tolerance mechanism.