Documentation Index
Fetch the complete documentation index at: https://docs.powabase.ai/llms.txt
Use this file to discover all available pages before exploring further.
What is a Source?
A Source represents a single uploaded document. When you upload a file, the platform creates a Source record, stores the original file, and kicks off an asynchronous extraction pipeline. The pipeline extracts text content from the document — handling PDFs, Word documents, images (via OCR), and other file types. The result is clean, structured text organized by page.Supported File Types
| Type | Extensions | Extraction Method |
|---|---|---|
| Multiple extractors — see ‘Extraction Models’ below | ||
| Word | .docx, .doc | Structured text extraction preserving headings and formatting |
| Images | .png, .jpg, .jpeg, .webp, .gif, .tiff | OCR (Optical Character Recognition) |
| Text | .txt, .md, .csv | Direct text reading |
| PowerPoint | .pptx | Slide-by-slide text extraction (REST API only) |
| Excel | .xlsx | Sheet-by-sheet cell content extraction (REST API only) |
| URLs | http(s):// | Fetched via URL import — single URLs, crawl, or sitemap |
Extraction Models (PDF)
For PDFs you can choose how the text is extracted by passing extraction_model at upload time or via POST /sources/{id}/reextract. If you don’t pass one, the pipeline uses auto, which tries mistral → opendataloader → fitz → pdfplumber in order until one succeeds. paddleocr and lighton are not part of the auto chain — request them explicitly.| Model | What it does | Requires |
|---|---|---|
| auto | Default fallback chain (mistral → opendataloader → fitz → pdfplumber) | — |
| mistral | Mistral OCR — scanned PDFs, image-heavy documents | MISTRAL_API_KEY |
| paddleocr | PaddleOCR-VL API — strong non-English support and layout detection | PADDLEOCR_API_KEY, PADDLEOCR_BASE_URL |
| lighton | LightOn OCR API | LIGHTON_API_KEY, LIGHTON_BASE_URL |
| opendataloader | Local high-accuracy structural extraction (tables, headings, layout) | None (local) |
| fitz | PyMuPDF — fast, text-based extraction | None (local) |
| pdfplumber | Reliable fallback for complex tables and unusual layouts | None (local) |
Extraction Pipeline
When you upload a file, it goes through several stages: the file is stored in project storage, a Celery worker picks up the extraction task, the appropriate strategy is selected based on file type, and the extracted content is stored as derivatives (page texts, markdown, per-page images) associated with the source. This process runs asynchronously — poll the source status to check progress.Status Lifecycle
Every source goes through an extraction_status lifecycle. After upload the source is pending; a worker picks it up and it moves to extracting; on success it becomes extracted. If some pages fail but others succeed the status is attention_required (partial success — the source is still indexable). Terminal states are extracted, failed, attention_required, and cancelled.| Status | Meaning |
|---|---|
| pending | Uploaded but not yet picked up by a worker |
| extracting | Currently being extracted |
| extracted | Extraction finished — derivatives available, source is indexable |
| attention_required | Partial success — some pages failed. error_message explains. Still indexable. |
| failed | Extraction failed — check error_message for details |
| cancelled | User cancelled via POST /sources//cancel |
Storage Integration
You can also import files that are already in your project’s storage buckets using the import-from-storage endpoint. This avoids re-uploading files and is useful when you have an existing storage workflow. The extraction process is the same regardless of whether the file was uploaded directly or imported from storage.Next Steps
Upload a Document
Step-by-step guide to uploading and extracting.
Knowledge Bases & Indexing
Turn extracted text into searchable vectors.
Sources API Reference
Full endpoint documentation.