How pdf_reflow reshapes a PDF

Five stages turn a two-column desktop PDF into a single-column phone-sized one. Tap any stage to expand. The pipeline is pure geometry & font heuristics — no ML.

Watch a page flow through

A two-column page with a figure becomes a single mobile column. Step through each transformation.

Stage 0 · Source PDF Two columns, a figure, headers and footers
in: PDF file out: PageContent[] extract.py

PyMuPDF (a C wrapper around MuPDF) reads the source bytes and returns every glyph in rawdict form — bbox, font, size, weight flags, color. Vector drawings (lines, rects, curves) and embedded raster images are reduced to bounding boxes.

At this boundary, ligature codepoints (, ), smart quotes, and soft hyphens are normalized to plain ASCII so every downstream stage sees the same text and the base-14 PDF fonts we render with can find a glyph for every char.

source.pdf PageContent spans[ Span(text, bbox, font, size, flags) ] drawings[ bbox, kind ] images[ bbox, xref ] rect = (0,0,w,h)
Every glyph keeps its exact bbox & font metadata.
Span
  • text — one run of chars
  • bbox — (x0, y0, x1, y1)
  • font / size / flags
  • is_bold / is_italic
PageContent
  • index — source page #
  • rect — page dimensions
  • spans, drawings, images

Parallelism: long docs use extract_document_parallel, opening a fresh fitz.Document in each worker process (PyMuPDF isn't thread-safe).

in: PageContent out: Block[] analyze.py

Three groupings happen here:

  • Lines — spans whose vertical midpoints differ by less than half a span height are merged. Then sorted by x.
  • Blocks — adjacent lines with similar font size and vertical gap < ~1.2× line height become one block (a paragraph candidate).
  • Body size — weighted mode of all span sizes. Defines what counts as "body" for the rest of the pipeline.
spans lines blocks
Vertical-midpoint clustering, then size/gap thresholds.
# analyze.py — group spans into lines
def _group_lines(spans):
    # sort by y, sweep, merge if mid-y diff < ½ height
    ...

def _group_blocks(lines, body_size):
    # gap < 1.2 × line_h AND |size diff| small → same block
    ...

def body_font_size(pages):
    counter = Counter()
    for p in pages:
        for s in p.spans:
            counter[round(s.size, 1)] += len(s.text)
    return counter.most_common(1)[0][0]
in: Block[] out: typed Block[] analyze.py

Each block is tagged using font & geometry signals:

heading
  • size > body + 1pt and bold, or much larger than body
body
  • body-sized prose, no special font signals
code
  • block in a Courier-family font; line breaks preserved
equation
  • contains chars in Unicode Private Use Area, or math-only font like CMMI, CMSY → rasterize
caption
  • sub-body size text near a figure region
toc / label
  • dot-leader entries / single-token numerics (page #s, axis ticks)

Why bother with PUA detection? LaTeX math fonts emit symbols at Private-Use codepoints that never round-trip to plain text — these blocks would render as garbage if typeset, so they're routed to be rasterized as images instead.

in: Block[], drawings, images out: FlowItem[] analyze.py
  • Figure bands — vector drawings on a page are merged vertically into y-bands. Equation blocks seed bands too. Captions and short fragments inside the band are absorbed.
  • Column detection — body-block x-centers are clustered; two clusters = two-column layout.
  • Reading order — column-major top→bottom. Figures take a synthetic slot at their band's top y.
  • Page chrome — tiny text in the top 10% / bottom 12% (running headers, footers, page numbers) is dropped.
figure band FlowItem[] L¹ body L² body R¹ body R² body figure L³ body R³ body
Column-major flatten; figure inserted at its y-position.
# Sort key: (column, top_y)
items_with_y.sort(key=lambda t: (t[0], t[1]))
# figures use column 0 so they appear in-stream
in: FlowItem[] out: mobile PDF layout.py · render.py
  • Page size — default 360×600 pt (iPhone-friendly).
  • Wrap — greedy word-wrap using PyMuPDF's base-14 font metrics (tiro, tibo, tiit, cour). Per-char widths are cached so wrapping is a Python-side sum (no SWIG crossing per word).
  • Figures — clipped from the source page, tightened to the actual content bbox, rasterized at 150 dpi, scaled to fit the column.
  • Code — small blocks kept together; large blocks scaled so the longest line fits.
  • TOC — anchors mirror the source PDF outline; if none, headings synthesise one.
FlowItem[]
Greedy-wrap + figure clip → paginated mobile PDF.
# layout.py — central loop
for it in items:
    if it.kind == "heading": emit_paragraph(..., font="times-bold")
    elif it.kind == "body":    emit_paragraph(..., size=body)
    elif it.kind == "code":    # courier; scale if too wide
    elif it.kind == "figure":  pb.emit_image(src_page, src_rect, w, h)

# render.py — figures are rasterized via PyMuPDF Pixmap
pix = src.get_pixmap(matrix=fitz.Matrix(scale, scale),
                     clip=fitz.Rect(rect), alpha=False)

A pre-pass de-duplicates figure rasterization tasks by (page, rect, scale) so the same crop isn't redrawn for two output pages.