Common Patterns
The typical flow is: upload a file (POST /api/sources/upload), poll for completion (GET /api/sources/{id} until extraction_status is ‘extracted’ or ‘attention_required’), then retrieve extracted text (GET /api/sources/{id}/page-texts). For files already in project storage, use import-from-storage. For web pages, use import-url. To swap extraction backends after the fact, POST /api/sources/{id}/reextract with a new extraction_model.GET /api/sources
List all sources with optional status filter.Filter by extraction_status. One of: pending, extracting, extracted, attention_required, failed, cancelled.
POST /api/sources/upload
Upload a file for extraction. Accepts PDF, DOCX, PPTX, XLSX, images (PNG/JPG/WebP/GIF/TIFF), and plain text. Uses multipart/form-data. Optional fields: name (display name), metadata (JSON string, preserved through indexing), extraction_model (PDF only — one of auto, mistral, paddleocr, lighton, opendataloader, fitz, pdfplumber).POST /api/sources/import-from-storage
Import a file already in project storage as a source.POST /api/sources/import-url
Import content from web URLs. mode=‘urls’ imports a fixed list, mode=‘crawl’ spiders from a seed URL, mode=‘sitemap’ parses a sitemap XML. Requires Firecrawl API key to be configured in project settings.GET /api/sources/
Get source details including extraction status.Source ID
GET /api/sources//page-texts
Get extracted text content organized by page.Source ID
Specific page number
PATCH /api/sources/
Update a source’s display name or metadata.Source ID
POST /api/sources//reextract
Re-run extraction on an existing source, optionally with a different extraction_model.Source ID
POST /api/sources//cancel
Cancel an in-progress extraction. Sets extraction_status to ‘cancelled’.Source ID
GET /api/sources//download
Download the original uploaded file (as stored in project storage).Source ID
GET /api/sources//derivatives//download
Download a derivative artifact. type is one of: markdown, text, page_text, image. For per-page types (page_text, image) pass index=N (0-based) in the query string.Source ID
Derivative type: markdown, text, page_text, or image
0-based index for per-page derivatives (page_text, image)
DELETE /api/sources/
Delete a source and its associated storage files (original + derivatives).Source ID
Error Responses
Source routes return{"error": "<message>"}.
| Status | Description |
|---|---|
| 400 | Upload missing the file form field, no filename, or invalid metadata JSON |
| 400 | Upload or /import-from-storage: unsupported file extension (allowed: .pdf, .txt, .md, .docx, .xlsx, .xls, .pptx, .png, .jpg, .jpeg, .webp, .gif, .tiff) |
| 400 | Upload, /import-from-storage, or /reextract: invalid extraction_model (must be one of the configured PDF extraction methods) |
| 400 | /import-from-storage: missing bucket or path |
| 400 | /import-url: missing/invalid mode (urls/crawl/sitemap), missing or empty URL list, invalid sitemap URL, or no URLs found in sitemap |
| 400 | PATCH: no body or no valid fields to update |
| 400 | /page-texts: invalid page query (must be an integer ≥ 1); /derivatives/{type}/download: invalid index query |
| 404 | No source exists with the given ID; referenced file not found in storage (/import-from-storage); requested page or derivative does not exist; no file/derivative available for download |
| 409 | /cancel: extraction is not in a cancellable state (must be pending or extracting) |
| 500 | Upload, import, extraction, page download, or derivative download failed — body contains the underlying error message |