How PDF compression actually works (and why your file is so big)

A PDF can be 50 KB or 50 MB for the same number of pages. Here's what actually takes up the space, and what a compressor can and can't do about it.

By Muhammad Tahir7 min readpdfexplainer

You export a two-page document and it's 18 MB. Your colleague's looks identical and weighs 200 KB. Same page count, same fonts, wildly different size. The difference is almost never the text — it's what's embedded around it. Understanding what's inside a PDF tells you exactly why it's big and how much a compressor can realistically claw back.

A PDF is a bundle of objects, not a picture of a page

A PDF file isn't a flat image of each page. It's a container holding a graph of objects: text strings with positioning, vector drawing instructions, embedded fonts, embedded raster images, metadata, and bookmarks. When you open the file, a viewer reassembles those objects into the page you see.

That structure matters because it tells you where the weight goes. A page of pure text — say 3,000 words — is a few kilobytes of character data plus whatever fonts it references. You could fit hundreds of such pages in a megabyte. So if your text-only PDF is huge, the text is not the culprit. Something heavier got embedded.

In practice, three things make PDFs large:

  • Embedded raster images — photos, screenshots, logos saved at full camera resolution.
  • Embedded fonts — especially full font files when only a few characters are used.
  • Scanned pages — where every page is a high-resolution image, so the entire document is pictures.

The first and third are where almost all real bloat lives.

Why embedded images dominate the file size

Say you drop a single phone photo into a report. That photo is 4032 by 3024 pixels — about 12 million pixels. Even compressed as JPEG it might be 3–5 MB. The PDF stores that image stream more or less as-is. Add five photos and you've got a 20 MB document whose text would have fit in 50 KB.

Here's the wasteful part: the page only displays that image at maybe 600 pixels wide. The other 3,400 pixels of width are downloaded, stored, and then thrown away by the viewer at render time. You're paying to carry resolution nobody will ever see.

This is the single biggest lever a compressor has, and it's called downsampling.

Downsampling: matching resolution to how the page is actually viewed

Downsampling reduces an embedded image's pixel dimensions to something appropriate for its display size. The unit is PPI (pixels per inch) relative to the size the image occupies on the page.

  • 300 PPI — print quality. The right target if the document will be physically printed.
  • 150 PPI — comfortable for on-screen reading and most office use.
  • 72–96 PPI — fine for screen-only documents that won't be zoomed deeply.

If a photo sits in a 2-inch-wide slot but is stored at 4032 pixels, that's roughly 2000 PPI — wildly more than needed. Downsample it to 150 PPI and it becomes 300 pixels wide. That's a reduction from ~12 million pixels to ~67,000, well over a 99% cut in pixel count for that image. The visible result on screen is unchanged.

Downsampling is lossy and irreversible — you're discarding pixels — but for images displayed far below their native resolution, nobody can tell. This is where the dramatic 20 MB to 800 KB transformations come from.

Re-encoding the JPEGs that are already inside

Even after downsampling, each embedded image is encoded with some quality setting. JPEG quality trades file size against artifacts in flat regions like skies and shadows. A scan or photo saved at maximum quality (say 95) is much larger than the same image at quality 75, and at normal viewing size the two look identical.

A compressor re-encodes embedded images at a sensible quality target. Combined with downsampling, a photo that entered the PDF as 4 MB might leave as 120 KB. The catch with JPEG is that it's lossy and generational — re-encoding an already-JPEG image adds a little more loss each time. Doing it once at a reasonable quality is invisible; doing it repeatedly across many save-and-recompress cycles slowly degrades the image. Compress once, from the best source you have.

You can preview exactly this trade-off on standalone images with the Image Compressor before they ever go into a document — it's the same JPEG machinery, just outside the PDF wrapper.

Lossless compression of the non-image parts

Not everything in a PDF is an image, and the rest gets compressed too — but losslessly. PDF streams (text content, vector paths, metadata) are typically deflated with the same algorithm as ZIP and PNG. This is lossless, so it never changes how anything looks; it just removes redundancy in the byte stream.

A few structural optimizations also help:

  • Object stream compression bundles many small objects together and deflates them as one block, which compresses far better than each tiny object alone.
  • De-duplication spots identical embedded resources — the same logo placed on every page, an identical font, a repeated background — and stores one copy referenced many times instead of dozens of duplicates.
  • Removing cruft strips unused objects, stale incremental-save history, orphaned bookmarks, and bloated metadata. A file edited and re-saved many times accumulates dead objects that a clean rewrite discards.

These steps are safe — they reduce size without touching fidelity. On a text-heavy PDF with few images, lossless optimization plus de-duplication might be the only meaningful win, and it's typically modest: think 10–30%, not 90%.

Fonts: small files that occasionally aren't

Fonts are usually a minor cost, but they can surprise you. If a document embeds a complete font file for a typeface where only a handful of glyphs are used, that's wasted space. Subsetting embeds only the characters the document actually uses. A full CJK font can be megabytes; a subset of the few hundred characters in your document is a fraction of that. Most well-behaved exporters already subset, but older tools and certain workflows don't, which is why a short document can occasionally carry surprising font weight.

Why a scanned PDF compresses completely differently

This is the distinction that confuses people most. A text PDF and a scanned PDF can look identical on screen but behave nothing alike under compression.

A scanned PDF has no text objects at all. Each page is a single large image — a photograph of paper. There is no character data to compress losslessly because there are no characters, just pixels. So the only levers are the lossy ones: downsample the page images and re-encode them. Scanned documents therefore both start large (a color scan at 300 PPI is heavy) and respond dramatically to compression, because reducing those page images is almost the entire file.

A born-digital text PDF is the opposite: it's mostly lightweight text and vectors, so it's already small, and there's little image weight to squeeze. Running it through aggressive image compression does almost nothing because there are barely any images to shrink.

A practical tell: if your PDF is large but you can't select the text with your cursor, it's a scan, and downsampling will be your big win. If you can select the text and it's still large, hunt for a few oversized embedded images.

Realistic size expectations

Rough numbers, so you know whether a result is normal or a sign something's off:

  • Text-only document (no images): a few KB per page. A 50-page report can be well under 1 MB. Compression saves maybe 10–30%.
  • Document with a few photos at full resolution: easily 5–20 MB. Downsampling to 150 PPI plus re-encoding can cut this 80–95%, landing under 1 MB while looking unchanged on screen.
  • Color scan at 300 PPI: roughly 1–3 MB per page, so a 20-page scan is 20–60 MB. Dropping to 150 PPI grayscale where color isn't needed can shrink it 70–90%.
  • Already-optimized PDF: 5% or less. If a file barely shrinks, it was probably compressed well already, and that's a good sign, not a failure.

The honest takeaway: a compressor isn't magic. It can only remove redundancy and resolution you don't need. A lean text document is already near its floor, while an image-heavy or scanned one has enormous headroom. Knowing which kind you have tells you what to expect before you even start.

When you want to shrink a real file, run it through the PDF Compressor and pick a target based on use: 150 PPI for on-screen sharing, 300 PPI only if it's going to print. And if the bloat is one specific image you'll reuse elsewhere, compress that image on its own with the Image Compressor first — it's often the cleanest fix.