Vision, PDF & File Input

Intermediate

Claude is multimodal: you can include images and documents (like PDFs) in a message and ask questions about them — extract data, describe a chart, summarize a contract, read a screenshot.

Three ways to send a file

Base64 — encode the bytes inline in the request. Simplest for one-offs.
URL — point to a hosted file.
Files API — upload once, then reference it by file_id across many requests (avoids re-uploading large files).

# Conceptual: an image content block alongside text (see docs for exact fields)
messages=[{
  "role": "user",
  "content": [
    {"type": "image", "source": {"type": "base64", "media_type": "image/png", "data": b64}},
    {"type": "text", "text": "What does this chart show? Give the top 3 takeaways."},
  ],
}]

PDFs work similarly (a document content block); Claude can read text and visual layout, and you can request citations back to where in the document an answer came from.

What it's great for

Extraction — pull structured data from invoices, forms, tables (pair with Structured Output).
Understanding visuals — charts, diagrams, screenshots, UI.
Document Q&A — ask questions across a long PDF with citations.

Tips

Reuse with file_id when sending the same big document repeatedly — cheaper and faster.
Resolution/size matter — very large images may be downscaled; check limits.
Verify extracted numbers — vision is strong but not infallible; spot-check critical figures (Hallucinations).

Structured Output
Generating Real Files (docx/pptx/xlsx/pdf)
Tokens & Pricing — images cost tokens too

Three ways to send a file​

What it's great for​

Tips​

Next​

Three ways to send a file

What it's great for

Tips

Next