DocumentStructuredExtraction
A document_structured_extraction record is the persisted output of one LLM structured-extraction pass over a parsed document. It is separate from document_extractions (which stores raw OCR/parser text): this table stores the typed, schema-validated JSON produced after classification and field mapping. The cache key โ a composite of document_pk, extraction_family, schema_name, schema_version, prompt_version, model_policy, source_checksum, and evidence_checksum โ ensures that retries and backfills reuse existing output without bypassing stale-source or stale-schema guards. Records are scoped to a workspace and hang off a parent document; soft-deletion is used to invalidate superseded cache entries while preserving the audit trail.
| Naming | Value |
|---|---|
| Object | DocumentStructuredExtraction |
Resource type (JSON:API type) | document_structured_extraction |
| Collection / records root | โ (not a records root) |
| REST base | /v1/document-structured-extractions |
| Entity class | DocumentStructuredExtraction |
Internal object. Not currently exposed on the public REST API. The operations below describe the intended contract.
API operations
| Operation | Method & path | Status |
|---|---|---|
| List | GET /v1/document-structured-extractions | ๐ก Planned |
| Retrieve | GET /v1/document-structured-extractions/{id} | ๐ก Planned |
| Create | POST /v1/document-structured-extractions | ๐ก Planned |
| Update | PATCH /v1/document-structured-extractions/{id} | ๐ก Planned |
| Delete | DELETE /v1/document-structured-extractions/{id} | ๐ก Planned |
Data model
Attributes
| Field | Type | Required | Constraints | Allowed values | Description |
|---|---|---|---|---|---|
| document_structured_extraction_id | string (UUID) | โ Yes | unique | โ | Public UUID identifier, auto-generated by gen_random_uuid(). This is the stable external reference surfaced in the API. The internal pk is never exposed. |
| extraction_family | string | โ Yes | max length 32; included in partial-unique-on-deleted_at index | โ | High-level category of the extraction pass (e.g. 'invoice', 'receipt', 'contract'). Part of the cache-key composite unique index. |
| schema_name | string | โ Yes | max length 96; included in partial-unique-on-deleted_at index | โ | Name of the Zod/JSON schema used to validate and shape the extraction output. Part of the cache-key composite unique index. |
| schema_version | string | โ Yes | max length 48; included in partial-unique-on-deleted_at index | โ | Semver string of the schema. Bump forces a new extraction even if all other cache-key components are unchanged. |
| prompt_version | string | โ Yes | max length 48; included in partial-unique-on-deleted_at index | โ | Version identifier of the LLM prompt template used. Part of the cache-key composite unique index. |
| model_policy | string | โ Yes | max length 64; included in partial-unique-on-deleted_at index | โ | Identifies the model-selection policy in effect when the extraction ran (e.g. a model alias or routing policy name). Part of the cache-key composite unique index. |
| source_checksum | string | โ Yes | max length 64; included in partial-unique-on-deleted_at index | โ | Checksum of the raw source content (OCR text / parsed PDF bytes) fed into the extraction. Stale-source guard: if the document's parsed text changes, this checksum changes and a new extraction is required. |
| evidence_checksum | string | โ Yes | max length 64; included in partial-unique-on-deleted_at index | โ | Checksum of the structured evidence slice passed to the LLM (may differ from source_checksum when evidence is preprocessed or truncated). Part of the cache-key composite unique index. |
| classification_json | jsonb | โ Yes | NOT NULL | โ | LLM output from the classification phase: document type, confidence scores, locale, and any routing signals that determined which schema to apply in subsequent passes. |
| core_extraction_json | jsonb | โ Yes | NOT NULL | โ | LLM output from the core extraction phase: the mandatory, high-confidence fields (header-level invoice fields such as issuer, reference, date, totals). Always present even when detail extraction is skipped. |
| detail_extraction_json | jsonb | โช No | nullable | โ | LLM output from the optional detail extraction phase: line items, tax breakdowns, and other structured sub-arrays that require a secondary prompt. NULL when detail extraction was not requested or failed gracefully. |
| final_extraction_json | jsonb | โ Yes | NOT NULL | โ | Merged, post-processed extraction output combining classification, core, and detail phases. This is the authoritative input used by downstream invoice mapping and persistence services. |
| invoice_mapped_json | jsonb | โช No | nullable | โ | The result of mapping final_extraction_json onto Well's internal invoice schema (entity PKs resolved, field names normalised). NULL when the document is not an invoice or mapping has not yet run. |
| selected_provider | string | โช No | max length 32; nullable | โ | The AI provider selected at runtime (e.g. 'openai', 'anthropic'). NULL when the model policy does not record per-extraction provider choice. |
| selected_model | string | โช No | max length 128; nullable | โ | The specific model identifier resolved from the model_policy at runtime (e.g. 'gpt-5.4'). NULL when not recorded. |
| quality_flags | jsonb | โช No | nullable | โ | Post-extraction quality signals: low-confidence fields, OCR warnings, schema-validation failures, and any flags the extraction pipeline chose to surface for downstream review. NULL when no quality issues were detected. |
| created_at | ๐ system โ timestamp with time zone | โ Yes | NOT NULL; defaultRaw: NOW() | โ | Row creation timestamp, set once by the onCreate hook. Reflects when the extraction pipeline persisted this result. |
| updated_at | ๐ system โ timestamp with time zone | โช No | nullable | โ | Row update timestamp, managed by the onUpdate hook. NULL until the first update after creation. |
| deleted_at | ๐ system โ timestamp with time zone | โช No | nullable; excluded from uniq_document_structured_extraction_active when non-NULL | โ | Soft-delete timestamp. When set, the row is excluded from the composite partial-unique index (uniq_document_structured_extraction_active), allowing a fresh extraction with the same cache key to be inserted. |
Relationships
| Name | Type | Required | Description |
|---|---|---|---|
| document | to-one (ManyToOne) | โ Yes | The parent Document this extraction was produced from. The document_pk FK carries ON DELETE CASCADE, so deleting a document hard-deletes all its structured extractions. |
| workspace | to-one (ManyToOne) | โ Yes | The tenant workspace that owns this extraction record. Used for all workspace-scoped queries and Hasura RLS enforcement. workspace_pk FK carries ON DELETE CASCADE. |
System-computed
- document_structured_extraction_id โ auto-generated by gen_random_uuid() database default; never supplied by the caller
- created_at โ set to NOW() on INSERT via MikroORM onCreate hook; never updated
- updated_at โ set to NOW() on UPDATE via MikroORM onUpdate hook; NULL on creation
- deleted_at โ set by the extraction pipeline soft-delete path; when set, the partial-unique index uniq_document_structured_extraction_active no longer covers the row, enabling a replacement extraction with the same cache key to be inserted
- Cache-key deduplication โ the partial unique index (document_pk, extraction_family, schema_name, schema_version, prompt_version, model_policy, source_checksum, evidence_checksum) WHERE deleted_at IS NULL guarantees at-most-one live structured extraction per distinct extraction context; the pipeline soft-deletes the old row before inserting a fresh one on invalidation
- Row written exclusively by the LLM extraction pipeline (ExtractPersistenceService / document structured-extraction service); no user-facing PATCH route exists for this entity
Example
{
"data": {
"type": "document_structured_extraction",
"id": "d3a1e8f2-5b4c-4e2f-9aab-0123456789ab",
"attributes": {
"extraction_family": "invoice",
"schema_name": "invoice_v2_fr",
"schema_version": "2.4.0",
"prompt_version": "p1.3",
"model_policy": "gpt-5_structured_v1",
"source_checksum": "sha256:a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
"evidence_checksum": "sha256:f6e5d4c3b2a1f6e5d4c3b2a1f6e5d4c3",
"classification_json": {
"document_type": "invoice",
"confidence": 0.98,
"locale": "fr-FR"
},
"core_extraction_json": {
"issuer_name": "Acme SAS",
"reference_number": "FAC-2026-0042",
"issue_date": "2026-05-15",
"grand_total": 1200.00,
"currency": "EUR"
},
"detail_extraction_json": {
"line_items": [
{ "description": "Consulting services", "quantity": 1, "unit_price": 1000.00 },
{ "description": "Expenses", "quantity": 1, "unit_price": 200.00 }
]
},
"final_extraction_json": {
"issuer_name": "Acme SAS",
"reference_number": "FAC-2026-0042",
"issue_date": "2026-05-15",
"grand_total": 1200.00,
"currency": "EUR",
"line_items": [
{ "description": "Consulting services", "quantity": 1, "unit_price": 1000.00 },
{ "description": "Expenses", "quantity": 1, "unit_price": 200.00 }
]
},
"invoice_mapped_json": {
"issuer_pk": 482917,
"receiver_pk": 482918,
"grand_total": 1200.00,
"local_currency": "EUR",
"status": "unpaid"
},
"selected_provider": "openai",
"selected_model": "gpt-5.4",
"quality_flags": {
"low_confidence_fields": ["due_date"],
"ocr_warnings": []
},
"created_at": "2026-05-28T14:32:10.000Z",
"updated_at": "2026-05-28T14:32:11.000Z",
"deleted_at": null
},
"relationships": {
"document": {
"data": { "type": "document", "id": "c7f2a1e0-0001-0001-0001-000000000001" }
},
"workspace": {
"data": { "type": "workspace", "id": "9f3b2d00-aaaa-bbbb-cccc-000000000001" }
}
}
}
}apps/api/src/database/entities/DocumentStructuredExtraction.ts ยท domain: ingestion ยท tier: Infrastructure