Five stages turn a two-column desktop PDF into a single-column phone-sized one. Tap any stage to expand. The pipeline is pure geometry & font heuristics — no ML.
A two-column page with a figure becomes a single mobile column. Step through each transformation.
PyMuPDF (a C wrapper around MuPDF) reads the source bytes and returns every
glyph in rawdict form — bbox, font, size, weight flags, color.
Vector drawings (lines, rects, curves) and embedded raster images are
reduced to bounding boxes.
At this boundary, ligature codepoints (fi, fl),
smart quotes, and soft hyphens are normalized to plain ASCII so every
downstream stage sees the same text and the base-14 PDF fonts we render
with can find a glyph for every char.
Parallelism: long docs use extract_document_parallel, opening a fresh
fitz.Document in each worker process (PyMuPDF isn't thread-safe).
Three groupings happen here:
# analyze.py — group spans into lines def _group_lines(spans): # sort by y, sweep, merge if mid-y diff < ½ height ... def _group_blocks(lines, body_size): # gap < 1.2 × line_h AND |size diff| small → same block ... def body_font_size(pages): counter = Counter() for p in pages: for s in p.spans: counter[round(s.size, 1)] += len(s.text) return counter.most_common(1)[0][0]
Each block is tagged using font & geometry signals:
CMMI, CMSY → rasterizeWhy bother with PUA detection? LaTeX math fonts emit symbols at Private-Use codepoints that never round-trip to plain text — these blocks would render as garbage if typeset, so they're routed to be rasterized as images instead.
# Sort key: (column, top_y) items_with_y.sort(key=lambda t: (t[0], t[1])) # figures use column 0 so they appear in-stream
tiro, tibo, tiit, cour). Per-char widths are cached so wrapping is a Python-side sum (no SWIG crossing per word).# layout.py — central loop for it in items: if it.kind == "heading": emit_paragraph(..., font="times-bold") elif it.kind == "body": emit_paragraph(..., size=body) elif it.kind == "code": # courier; scale if too wide elif it.kind == "figure": pb.emit_image(src_page, src_rect, w, h) # render.py — figures are rasterized via PyMuPDF Pixmap pix = src.get_pixmap(matrix=fitz.Matrix(scale, scale), clip=fitz.Rect(rect), alpha=False)
A pre-pass de-duplicates figure rasterization tasks by (page, rect, scale) so the same crop isn't redrawn for two output pages.