Skip to main content

Documentation Index

Fetch the complete documentation index at: https://docs.powabase.ai/llms.txt

Use this file to discover all available pages before exploring further.

What is a Source?

A Source represents a single uploaded document. When you upload a file, the platform creates a Source record, stores the original file, and kicks off an asynchronous extraction pipeline. The pipeline extracts text content from the document — handling PDFs, Word documents, images (via OCR), and other file types. The result is clean, structured text organized by page.

Supported File Types

TypeExtensionsExtraction Method
PDF.pdfMultiple extractors — see ‘Extraction Models’ below
Word.docx, .docStructured text extraction preserving headings and formatting
Images.png, .jpg, .jpeg, .webp, .gif, .tiffOCR (Optical Character Recognition)
Text.txt, .md, .csvDirect text reading
PowerPoint.pptxSlide-by-slide text extraction (REST API only)
Excel.xlsxSheet-by-sheet cell content extraction (REST API only)
URLshttp(s)://Fetched via URL import — single URLs, crawl, or sitemap

Extraction Models (PDF)

For PDFs you can choose how the text is extracted by passing extraction_model at upload time or via POST /sources/{id}/reextract. If you don’t pass one, the pipeline uses auto, which tries mistral → opendataloader → fitz → pdfplumber in order until one succeeds. paddleocr and lighton are not part of the auto chain — request them explicitly.
ModelWhat it doesRequires
autoDefault fallback chain (mistral → opendataloader → fitz → pdfplumber)
mistralMistral OCR — scanned PDFs, image-heavy documentsMISTRAL_API_KEY
paddleocrPaddleOCR-VL API — strong non-English support and layout detectionPADDLEOCR_API_KEY, PADDLEOCR_BASE_URL
lightonLightOn OCR APILIGHTON_API_KEY, LIGHTON_BASE_URL
opendataloaderLocal high-accuracy structural extraction (tables, headings, layout)None (local)
fitzPyMuPDF — fast, text-based extractionNone (local)
pdfplumberReliable fallback for complex tables and unusual layoutsNone (local)

Extraction Pipeline

When you upload a file, it goes through several stages: the file is stored in project storage, a Celery worker picks up the extraction task, the appropriate strategy is selected based on file type, and the extracted content is stored as derivatives (page texts, markdown, per-page images) associated with the source. This process runs asynchronously — poll the source status to check progress.
Extraction pipeline: File Upload → Project Storage → Celery Worker (selects strategy: PDF extractor, DOCX parser, OCR engine) → Derivatives (page texts, markdown, images) → Source record updated to extracted status.

Status Lifecycle

Every source goes through an extraction_status lifecycle. After upload the source is pending; a worker picks it up and it moves to extracting; on success it becomes extracted. If some pages fail but others succeed the status is attention_required (partial success — the source is still indexable). Terminal states are extracted, failed, attention_required, and cancelled.
StatusMeaning
pendingUploaded but not yet picked up by a worker
extractingCurrently being extracted
extractedExtraction finished — derivatives available, source is indexable
attention_requiredPartial success — some pages failed. error_message explains. Still indexable.
failedExtraction failed — check error_message for details
cancelledUser cancelled via POST /sources//cancel

Storage Integration

You can also import files that are already in your project’s storage buckets using the import-from-storage endpoint. This avoids re-uploading files and is useful when you have an existing storage workflow. The extraction process is the same regardless of whether the file was uploaded directly or imported from storage.

Next Steps

Upload a Document

Step-by-step guide to uploading and extracting.

Knowledge Bases & Indexing

Turn extracted text into searchable vectors.

Sources API Reference

Full endpoint documentation.