pdf_reflow — data flow

1

Extract PDF bytes → Spans, Drawings, Images

in: PDF file out: PageContent[] extract.py

PyMuPDF (a C wrapper around MuPDF) reads the source bytes and returns every glyph in rawdict form — bbox, font, size, weight flags, color. Vector drawings (lines, rects, curves) and embedded raster images are reduced to bounding boxes.

At this boundary, ligature codepoints (ﬁ, ﬂ), smart quotes, and soft hyphens are normalized to plain ASCII so every downstream stage sees the same text and the base-14 PDF fonts we render with can find a glyph for every char.

Every glyph keeps its exact bbox & font metadata.

Span

text — one run of chars
bbox — (x0, y0, x1, y1)
font / size / flags
is_bold / is_italic

PageContent

index — source page #
rect — page dimensions
spans, drawings, images

Parallelism: long docs use extract_document_parallel, opening a fresh fitz.Document in each worker process (PyMuPDF isn't thread-safe).

2

Lines & blocks Spans → Lines → Blocks; infer body size

in: PageContent out: Block[] analyze.py

Three groupings happen here:

Lines — spans whose vertical midpoints differ by less than half a span height are merged. Then sorted by x.
Blocks — adjacent lines with similar font size and vertical gap < ~1.2× line height become one block (a paragraph candidate).
Body size — weighted mode of all span sizes. Defines what counts as "body" for the rest of the pipeline.

Vertical-midpoint clustering, then size/gap thresholds.

# analyze.py — group spans into lines
def _group_lines(spans):
    # sort by y, sweep, merge if mid-y diff < ½ height
    ...

def _group_blocks(lines, body_size):
    # gap < 1.2 × line_h AND |size diff| small → same block
    ...

def body_font_size(pages):
    counter = Counter()
    for p in pages:
        for s in p.spans:
            counter[round(s.size, 1)] += len(s.text)
    return counter.most_common(1)[0][0]

3

Classify Tag each block: heading / body / code / equation…

in: Block[] out: typed Block[] analyze.py

Each block is tagged using font & geometry signals:

heading

size > body + 1pt and bold, or much larger than body

body

body-sized prose, no special font signals

code

block in a Courier-family font; line breaks preserved

equation

contains chars in Unicode Private Use Area, or math-only font like CMMI, CMSY → rasterize

caption

sub-body size text near a figure region

toc / label

dot-leader entries / single-token numerics (page #s, axis ticks)

Why bother with PUA detection? LaTeX math fonts emit symbols at Private-Use codepoints that never round-trip to plain text — these blocks would render as garbage if typeset, so they're routed to be rasterized as images instead.

4

Figure bands & reading order Detect columns, build the linear FlowItem stream

in: Block[], drawings, images out: FlowItem[] analyze.py

Figure bands — vector drawings on a page are merged vertically into y-bands. Equation blocks seed bands too. Captions and short fragments inside the band are absorbed.
Column detection — body-block x-centers are clustered; two clusters = two-column layout.
Reading order — column-major top→bottom. Figures take a synthetic slot at their band's top y.
Page chrome — tiny text in the top 10% / bottom 12% (running headers, footers, page numbers) is dropped.

Column-major flatten; figure inserted at its y-position.

# Sort key: (column, top_y)
items_with_y.sort(key=lambda t: (t[0], t[1]))
# figures use column 0 so they appear in-stream

5

Layout & render FlowItem[] → 360×600 pt PDF pages

in: FlowItem[] out: mobile PDF layout.py · render.py

Page size — default 360×600 pt (iPhone-friendly).
Wrap — greedy word-wrap using PyMuPDF's base-14 font metrics (tiro, tibo, tiit, cour). Per-char widths are cached so wrapping is a Python-side sum (no SWIG crossing per word).
Figures — clipped from the source page, tightened to the actual content bbox, rasterized at 150 dpi, scaled to fit the column.
Code — small blocks kept together; large blocks scaled so the longest line fits.
TOC — anchors mirror the source PDF outline; if none, headings synthesise one.

Greedy-wrap + figure clip → paginated mobile PDF.

# layout.py — central loop
for it in items:
    if it.kind == "heading": emit_paragraph(..., font="times-bold")
    elif it.kind == "body":    emit_paragraph(..., size=body)
    elif it.kind == "code":    # courier; scale if too wide
    elif it.kind == "figure":  pb.emit_image(src_page, src_rect, w, h)

# render.py — figures are rasterized via PyMuPDF Pixmap
pix = src.get_pixmap(matrix=fitz.Matrix(scale, scale),
                     clip=fitz.Rect(rect), alpha=False)

A pre-pass de-duplicates figure rasterization tasks by (page, rect, scale) so the same crop isn't redrawn for two output pages.

How pdf_reflow reshapes a PDF

Watch a page flow through