Skip to main content
Sources represent uploaded documents in the platform. Each source goes through an asynchronous extraction pipeline that converts files into structured derivatives (page texts, markdown, per-page images). Sources are the raw material for knowledge bases — once extracted, their content can be chunked and indexed for semantic search.

Common Patterns

The typical flow is: upload a file (POST /api/sources/upload), poll for completion (GET /api/sources/{id} until extraction_status is ‘extracted’ or ‘attention_required’), then retrieve extracted text (GET /api/sources/{id}/page-texts). For files already in project storage, use import-from-storage. For web pages, use import-url. To swap extraction backends after the fact, POST /api/sources/{id}/reextract with a new extraction_model.

GET /api/sources

List all sources with optional status filter.
status
string
Filter by extraction_status. One of: pending, extracting, extracted, attention_required, failed, cancelled.
response = requests.get(f"{BASE_URL}/api/sources", headers=headers)
print(response.json())

POST /api/sources/upload

Upload a file for extraction. Accepts PDF, DOCX, PPTX, XLSX, images (PNG/JPG/WebP/GIF/TIFF), and plain text. Uses multipart/form-data. Optional fields: name (display name), metadata (JSON string, preserved through indexing), extraction_model (PDF only — one of auto, mistral, paddleocr, lighton, opendataloader, fitz, pdfplumber).
with open("file.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/api/sources/upload",
        headers={"apikey": API_KEY, "Authorization": f"Bearer {API_KEY}"},
        files={"file": ("file.pdf", f, "application/pdf")},
        data={"extraction_model": "mistral"},
    )

POST /api/sources/import-from-storage

Import a file already in project storage as a source.
{
  "bucket": "documents",
  "path": "reports/q4.pdf",
  "name": "Q4 Report"
}
response = requests.post(
    f"{BASE_URL}/api/sources/import-from-storage",
    headers=headers,
    json={"bucket": "documents", "path": "reports/q4.pdf"},
)

POST /api/sources/import-url

Import content from web URLs. mode=‘urls’ imports a fixed list, mode=‘crawl’ spiders from a seed URL, mode=‘sitemap’ parses a sitemap XML. Requires Firecrawl API key to be configured in project settings.
{
  "mode": "urls",
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "max_pages": 50
}
response = requests.post(
    f"{BASE_URL}/api/sources/import-url",
    headers=headers,
    json={"mode": "urls", "urls": ["https://example.com/page1"]},
)

GET /api/sources/

Get source details including extraction status.
id
string
required
Source ID
response = requests.get(f"{BASE_URL}/api/sources/{source_id}", headers=headers)

GET /api/sources//page-texts

Get extracted text content organized by page.
id
string
required
Source ID
page
integer
Specific page number
response = requests.get(f"{BASE_URL}/api/sources/{source_id}/page-texts", headers=headers)

PATCH /api/sources/

Update a source’s display name or metadata.
id
string
required
Source ID
{
  "name": "New Display Name",
  "metadata": { "author": "alice" }
}
response = requests.patch(f"{BASE_URL}/api/sources/{source_id}", headers=headers, json={"name": "New Display Name"})

POST /api/sources//reextract

Re-run extraction on an existing source, optionally with a different extraction_model.
id
string
required
Source ID
{
  "extraction_model": "paddleocr"
}
response = requests.post(f"{BASE_URL}/api/sources/{source_id}/reextract", headers=headers, json={"extraction_model": "paddleocr"})

POST /api/sources//cancel

Cancel an in-progress extraction. Sets extraction_status to ‘cancelled’.
id
string
required
Source ID
response = requests.post(f"{BASE_URL}/api/sources/{source_id}/cancel", headers=headers)

GET /api/sources//download

Download the original uploaded file (as stored in project storage).
id
string
required
Source ID
response = requests.get(f"{BASE_URL}/api/sources/{source_id}/download", headers=headers)
open("source.pdf", "wb").write(response.content)

GET /api/sources//derivatives//download

Download a derivative artifact. type is one of: markdown, text, page_text, image. For per-page types (page_text, image) pass index=N (0-based) in the query string.
id
string
required
Source ID
type
string
required
Derivative type: markdown, text, page_text, or image
index
integer
0-based index for per-page derivatives (page_text, image)
response = requests.get(f"{BASE_URL}/api/sources/{source_id}/derivatives/markdown/download", headers=headers)
print(response.text)

DELETE /api/sources/

Delete a source and its associated storage files (original + derivatives).
id
string
required
Source ID
response = requests.delete(f"{BASE_URL}/api/sources/{source_id}", headers=headers)

Error Responses

Source routes return {"error": "<message>"}.
StatusDescription
400Upload missing the file form field, no filename, or invalid metadata JSON
400Upload or /import-from-storage: unsupported file extension (allowed: .pdf, .txt, .md, .docx, .xlsx, .xls, .pptx, .png, .jpg, .jpeg, .webp, .gif, .tiff)
400Upload, /import-from-storage, or /reextract: invalid extraction_model (must be one of the configured PDF extraction methods)
400/import-from-storage: missing bucket or path
400/import-url: missing/invalid mode (urls/crawl/sitemap), missing or empty URL list, invalid sitemap URL, or no URLs found in sitemap
400PATCH: no body or no valid fields to update
400/page-texts: invalid page query (must be an integer ≥ 1); /derivatives/{type}/download: invalid index query
404No source exists with the given ID; referenced file not found in storage (/import-from-storage); requested page or derivative does not exist; no file/derivative available for download
409/cancel: extraction is not in a cancellable state (must be pending or extracting)
500Upload, import, extraction, page download, or derivative download failed — body contains the underlying error message