DocumentStructuredExtraction

A document_structured_extraction record is the persisted output of one LLM structured-extraction pass over a parsed document. It is separate from document_extractions (which stores raw OCR/parser text): this table stores the typed, schema-validated JSON produced after classification and field mapping. The cache key — a composite of document_pk, extraction_family, schema_name, schema_version, prompt_version, model_policy, source_checksum, and evidence_checksum — ensures that retries and backfills reuse existing output without bypassing stale-source or stale-schema guards. Records are scoped to a workspace and hang off a parent document; soft-deletion is used to invalidate superseded cache entries while preserving the audit trail.

Naming	Value
Object	DocumentStructuredExtraction
Resource type (JSON:API `type`)	`document_structured_extraction`
Collection / records root	— _{(not a records root)}
REST base	`/v1/document-structured-extractions`
Entity class	`DocumentStructuredExtraction`

Internal object. Not currently exposed on the public REST API. The operations below describe the intended contract.

API operations

Operation	Method & path	Status
List	`GET /v1/document-structured-extractions`	🟡 Planned
Retrieve	`GET /v1/document-structured-extractions/{id}`	🟡 Planned
Create	`POST /v1/document-structured-extractions`	🟡 Planned
Update	`PATCH /v1/document-structured-extractions/{id}`	🟡 Planned
Delete	`DELETE /v1/document-structured-extractions/{id}`	🟡 Planned

Data model

Attributes

Field	Type	Required	Constraints	Allowed values	Description
document_structured_extraction_id	string (UUID)	✅ Yes	unique	—	Public UUID identifier, auto-generated by gen_random_uuid(). This is the stable external reference surfaced in the API. The internal pk is never exposed.
extraction_family	string	✅ Yes	max length 32; included in partial-unique-on-deleted_at index	—	High-level category of the extraction pass (e.g. 'invoice', 'receipt', 'contract'). Part of the cache-key composite unique index.
schema_name	string	✅ Yes	max length 96; included in partial-unique-on-deleted_at index	—	Name of the Zod/JSON schema used to validate and shape the extraction output. Part of the cache-key composite unique index.
schema_version	string	✅ Yes	max length 48; included in partial-unique-on-deleted_at index	—	Semver string of the schema. Bump forces a new extraction even if all other cache-key components are unchanged.
prompt_version	string	✅ Yes	max length 48; included in partial-unique-on-deleted_at index	—	Version identifier of the LLM prompt template used. Part of the cache-key composite unique index.
model_policy	string	✅ Yes	max length 64; included in partial-unique-on-deleted_at index	—	Identifies the model-selection policy in effect when the extraction ran (e.g. a model alias or routing policy name). Part of the cache-key composite unique index.
source_checksum	string	✅ Yes	max length 64; included in partial-unique-on-deleted_at index	—	Checksum of the raw source content (OCR text / parsed PDF bytes) fed into the extraction. Stale-source guard: if the document's parsed text changes, this checksum changes and a new extraction is required.
evidence_checksum	string	✅ Yes	max length 64; included in partial-unique-on-deleted_at index	—	Checksum of the structured evidence slice passed to the LLM (may differ from source_checksum when evidence is preprocessed or truncated). Part of the cache-key composite unique index.
classification_json	jsonb	✅ Yes	NOT NULL	—	LLM output from the classification phase: document type, confidence scores, locale, and any routing signals that determined which schema to apply in subsequent passes.
core_extraction_json	jsonb	✅ Yes	NOT NULL	—	LLM output from the core extraction phase: the mandatory, high-confidence fields (header-level invoice fields such as issuer, reference, date, totals). Always present even when detail extraction is skipped.
detail_extraction_json	jsonb	⚪ No	nullable	—	LLM output from the optional detail extraction phase: line items, tax breakdowns, and other structured sub-arrays that require a secondary prompt. NULL when detail extraction was not requested or failed gracefully.
final_extraction_json	jsonb	✅ Yes	NOT NULL	—	Merged, post-processed extraction output combining classification, core, and detail phases. This is the authoritative input used by downstream invoice mapping and persistence services.
invoice_mapped_json	jsonb	⚪ No	nullable	—	The result of mapping final_extraction_json onto Well's internal invoice schema (entity PKs resolved, field names normalised). NULL when the document is not an invoice or mapping has not yet run.
selected_provider	string	⚪ No	max length 32; nullable	—	The AI provider selected at runtime (e.g. 'openai', 'anthropic'). NULL when the model policy does not record per-extraction provider choice.
selected_model	string	⚪ No	max length 128; nullable	—	The specific model identifier resolved from the model_policy at runtime (e.g. 'gpt-5.4'). NULL when not recorded.
quality_flags	jsonb	⚪ No	nullable	—	Post-extraction quality signals: low-confidence fields, OCR warnings, schema-validation failures, and any flags the extraction pipeline chose to surface for downstream review. NULL when no quality issues were detected.
created_at	🔒 system — timestamp with time zone	✅ Yes	NOT NULL; defaultRaw: NOW()	—	Row creation timestamp, set once by the onCreate hook. Reflects when the extraction pipeline persisted this result.
updated_at	🔒 system — timestamp with time zone	⚪ No	nullable	—	Row update timestamp, managed by the onUpdate hook. NULL until the first update after creation.
deleted_at	🔒 system — timestamp with time zone	⚪ No	nullable; excluded from uniq_document_structured_extraction_active when non-NULL	—	Soft-delete timestamp. When set, the row is excluded from the composite partial-unique index (uniq_document_structured_extraction_active), allowing a fresh extraction with the same cache key to be inserted.

Relationships

Name	Type	Required	Description
document	to-one (ManyToOne)	✅ Yes	The parent Document this extraction was produced from. The document_pk FK carries ON DELETE CASCADE, so deleting a document hard-deletes all its structured extractions.
workspace	to-one (ManyToOne)	✅ Yes	The tenant workspace that owns this extraction record. Used for all workspace-scoped queries and Hasura RLS enforcement. workspace_pk FK carries ON DELETE CASCADE.

System-computed

document_structured_extraction_id — auto-generated by gen_random_uuid() database default; never supplied by the caller
created_at — set to NOW() on INSERT via MikroORM onCreate hook; never updated
updated_at — set to NOW() on UPDATE via MikroORM onUpdate hook; NULL on creation
deleted_at — set by the extraction pipeline soft-delete path; when set, the partial-unique index uniq_document_structured_extraction_active no longer covers the row, enabling a replacement extraction with the same cache key to be inserted
Cache-key deduplication — the partial unique index (document_pk, extraction_family, schema_name, schema_version, prompt_version, model_policy, source_checksum, evidence_checksum) WHERE deleted_at IS NULL guarantees at-most-one live structured extraction per distinct extraction context; the pipeline soft-deletes the old row before inserting a fresh one on invalidation
Row written exclusively by the LLM extraction pipeline (ExtractPersistenceService / document structured-extraction service); no user-facing PATCH route exists for this entity

Example

{
  "data": {
    "type": "document_structured_extraction",
    "id": "d3a1e8f2-5b4c-4e2f-9aab-0123456789ab",
    "attributes": {
      "extraction_family": "invoice",
      "schema_name": "invoice_v2_fr",
      "schema_version": "2.4.0",
      "prompt_version": "p1.3",
      "model_policy": "gpt-5_structured_v1",
      "source_checksum": "sha256:a1b2c3d4e5f6a1b2c3d4e5f6a1b2c3d4",
      "evidence_checksum": "sha256:f6e5d4c3b2a1f6e5d4c3b2a1f6e5d4c3",
      "classification_json": {
        "document_type": "invoice",
        "confidence": 0.98,
        "locale": "fr-FR"
      },
      "core_extraction_json": {
        "issuer_name": "Acme SAS",
        "reference_number": "FAC-2026-0042",
        "issue_date": "2026-05-15",
        "grand_total": 1200.00,
        "currency": "EUR"
      },
      "detail_extraction_json": {
        "line_items": [
          { "description": "Consulting services", "quantity": 1, "unit_price": 1000.00 },
          { "description": "Expenses", "quantity": 1, "unit_price": 200.00 }
        ]
      },
      "final_extraction_json": {
        "issuer_name": "Acme SAS",
        "reference_number": "FAC-2026-0042",
        "issue_date": "2026-05-15",
        "grand_total": 1200.00,
        "currency": "EUR",
        "line_items": [
          { "description": "Consulting services", "quantity": 1, "unit_price": 1000.00 },
          { "description": "Expenses", "quantity": 1, "unit_price": 200.00 }
        ]
      },
      "invoice_mapped_json": {
        "issuer_pk": 482917,
        "receiver_pk": 482918,
        "grand_total": 1200.00,
        "local_currency": "EUR",
        "status": "unpaid"
      },
      "selected_provider": "openai",
      "selected_model": "gpt-5.4",
      "quality_flags": {
        "low_confidence_fields": ["due_date"],
        "ocr_warnings": []
      },
      "created_at": "2026-05-28T14:32:10.000Z",
      "updated_at": "2026-05-28T14:32:11.000Z",
      "deleted_at": null
    },
    "relationships": {
      "document": {
        "data": { "type": "document", "id": "c7f2a1e0-0001-0001-0001-000000000001" }
      },
      "workspace": {
        "data": { "type": "workspace", "id": "9f3b2d00-aaaa-bbbb-cccc-000000000001" }
      }
    }
  }
}

_{Source: apps/api/src/database/entities/DocumentStructuredExtraction.ts · domain: ingestion · tier: Infrastructure}