Sources

Sources represent uploaded documents in the platform. Each source goes through an asynchronous extraction pipeline that converts files into structured derivatives (page texts, markdown, per-page images). Sources are the raw material for knowledge bases — once extracted, their content can be chunked and indexed for semantic search.

Common Patterns

The typical flow is: upload a file (POST /api/sources/upload), poll for completion (GET /api/sources/{id} until extraction_status is ‘extracted’ or ‘attention_required’), then retrieve extracted text (GET /api/sources/{id}/page-texts). For files already in project storage, use import-from-storage. For web pages, use import-url. To swap extraction backends after the fact, POST /api/sources/{id}/reextract with a new extraction_model.

GET /api/sources

List all sources with optional status filter.

status

string

Filter by extraction_status. One of: pending, extracting, extracted, attention_required, failed, cancelled.

response = requests.get(f"{BASE_URL}/api/sources", headers=headers)
print(response.json())

POST /api/sources/upload

Upload a file for extraction. Accepts PDF, DOCX, PPTX, XLSX, images (PNG/JPG/WebP/GIF/TIFF), and plain text. Uses multipart/form-data. Optional fields: name (display name), metadata (JSON string, preserved through indexing), extraction_model (PDF only — one of auto, mistral, paddleocr, lighton, opendataloader, fitz, pdfplumber).

with open("file.pdf", "rb") as f:
    response = requests.post(
        f"{BASE_URL}/api/sources/upload",
        headers={"apikey": API_KEY, "Authorization": f"Bearer {API_KEY}"},
        files={"file": ("file.pdf", f, "application/pdf")},
        data={"extraction_model": "mistral"},
    )

POST /api/sources/import-from-storage

Import a file already in project storage as a source.

{
  "bucket": "documents",
  "path": "reports/q4.pdf",
  "name": "Q4 Report"
}

response = requests.post(
    f"{BASE_URL}/api/sources/import-from-storage",
    headers=headers,
    json={"bucket": "documents", "path": "reports/q4.pdf"},
)

POST /api/sources/import-url

Import content from web URLs. mode=‘urls’ imports a fixed list, mode=‘crawl’ spiders from a seed URL, mode=‘sitemap’ parses a sitemap XML. Requires Firecrawl API key to be configured in project settings.

{
  "mode": "urls",
  "urls": ["https://example.com/page1", "https://example.com/page2"],
  "max_pages": 50
}

response = requests.post(
    f"{BASE_URL}/api/sources/import-url",
    headers=headers,
    json={"mode": "urls", "urls": ["https://example.com/page1"]},
)

GET /api/sources/

Get source details including extraction status.

string

required

Source ID

response = requests.get(f"{BASE_URL}/api/sources/{source_id}", headers=headers)

GET /api/sources//page-texts

Get extracted text content organized by page.

string

required

Source ID

page

integer

Specific page number

response = requests.get(f"{BASE_URL}/api/sources/{source_id}/page-texts", headers=headers)

PATCH /api/sources/

Update a source’s display name or metadata.

string

required

Source ID

{
  "name": "New Display Name",
  "metadata": { "author": "alice" }
}

response = requests.patch(f"{BASE_URL}/api/sources/{source_id}", headers=headers, json={"name": "New Display Name"})

POST /api/sources//reextract

Re-run extraction on an existing source, optionally with a different extraction_model.

string

required

Source ID

{
  "extraction_model": "paddleocr"
}

response = requests.post(f"{BASE_URL}/api/sources/{source_id}/reextract", headers=headers, json={"extraction_model": "paddleocr"})

POST /api/sources//cancel

Cancel an in-progress extraction. Sets extraction_status to ‘cancelled’.

string

required

Source ID

response = requests.post(f"{BASE_URL}/api/sources/{source_id}/cancel", headers=headers)

GET /api/sources//download

Download the original uploaded file (as stored in project storage).

string

required

Source ID

response = requests.get(f"{BASE_URL}/api/sources/{source_id}/download", headers=headers)
open("source.pdf", "wb").write(response.content)

GET /api/sources//derivatives//download

Download a derivative artifact. type is one of: markdown, text, page_text, image. For per-page types (page_text, image) pass index=N (0-based) in the query string.

string

required

Source ID

type

string

required

Derivative type: markdown, text, page_text, or image

index

integer

0-based index for per-page derivatives (page_text, image)

response = requests.get(f"{BASE_URL}/api/sources/{source_id}/derivatives/markdown/download", headers=headers)
print(response.text)

DELETE /api/sources/

Delete a source and its associated storage files (original + derivatives).

string

required

Source ID

response = requests.delete(f"{BASE_URL}/api/sources/{source_id}", headers=headers)

Error Responses

Status	Code	Description
400	`invalid_file`	The uploaded file type is not supported or the file is corrupted
404	`source_not_found`	No source exists with the given ID
413	`file_too_large`	The uploaded file exceeds the maximum allowed size

Getting Started

Concepts

Guides

API Reference

Common Patterns

GET /api/sources

POST /api/sources/upload

POST /api/sources/import-from-storage

POST /api/sources/import-url

GET /api/sources/

GET /api/sources//page-texts

PATCH /api/sources/

POST /api/sources//reextract

POST /api/sources//cancel

GET /api/sources//download

GET /api/sources//derivatives//download

DELETE /api/sources/

Error Responses

Getting Started

Concepts

Guides

API Reference

Documentation Index

​Common Patterns

​GET /api/sources

​POST /api/sources/upload

​POST /api/sources/import-from-storage

​POST /api/sources/import-url

​GET /api/sources/

​GET /api/sources//page-texts

​PATCH /api/sources/

​POST /api/sources//reextract

​POST /api/sources//cancel

​GET /api/sources//download

​GET /api/sources//derivatives//download

​DELETE /api/sources/

​Error Responses

Common Patterns

GET /api/sources

POST /api/sources/upload

POST /api/sources/import-from-storage

POST /api/sources/import-url

GET /api/sources/

GET /api/sources//page-texts

PATCH /api/sources/

POST /api/sources//reextract

POST /api/sources//cancel

GET /api/sources//download

GET /api/sources//derivatives//download

DELETE /api/sources/

Error Responses