3 min read

Generic Document

Extract raw text, tables, and key-value pairs from any document using general OCR capabilities.

Overview

The Generic document type provides universal OCR (Optical Character Recognition) capabilities for processing any type of document. When you don't have a specific document type or need to extract raw content without structured parsing, use this type to capture all text, tables, and key-value pairs found in the document.

This is ideal for:

  • Documents without a predefined template
  • Initial document analysis before choosing a specific type
  • Custom document formats unique to your business
  • Quality assurance and verification workflows

Extracted Fields

| Field | Type | Description | |-------|------|-------------| | rawText | string | Complete extracted text content from the document | | tables | array | Detected tables with rows and columns preserved | | keyValuePairs | array | Automatically detected label-value pairs | | detectedLanguage | string | ISO 639-1 language code of the document content |

Example Request

POST/v1/documents/process
curl -X POST 'https://api.docurift.com/v1/documents/process' \
  -H 'X-API-Key: your_api_key' \
  -F 'file=@document.pdf' \
  -F 'documentType=generic'

Example Response

{
  "success": true,
  "data": {
    "id": "doc_8f7a3b2c1d4e5f6a",
    "documentType": "generic",
    "result": {
      "rawText": "ACME Corporation\nQuarterly Report Q3 2024\n\nExecutive Summary\nRevenue increased by 15% compared to the previous quarter...\n\nKey Metrics:\nTotal Revenue: $2,450,000\nOperating Costs: $1,890,000\nNet Profit: $560,000",
      "tables": [
        {
          "tableIndex": 0,
          "headers": ["Metric", "Q2 2024", "Q3 2024", "Change"],
          "rows": [
            ["Revenue", "$2,130,000", "$2,450,000", "+15%"],
            ["Costs", "$1,750,000", "$1,890,000", "+8%"],
            ["Profit", "$380,000", "$560,000", "+47%"]
          ]
        }
      ],
      "keyValuePairs": [
        { "key": "Total Revenue", "value": "$2,450,000", "confidence": 0.98 },
        { "key": "Operating Costs", "value": "$1,890,000", "confidence": 0.97 },
        { "key": "Net Profit", "value": "$560,000", "confidence": 0.99 }
      ],
      "detectedLanguage": "en"
    },
    "confidence": 0.96,
    "processingTimeMs": 1250
  }
}

Field Definitions

rawText

The complete text content extracted from the document, preserving the reading order as much as possible. Line breaks are represented as \n characters. This includes all visible text from headers, paragraphs, labels, and values.

tables

An array of detected table structures. Each table object contains:

  • tableIndex: Zero-based index of the table in the document
  • headers: Array of column header strings (if detected)
  • rows: Two-dimensional array of cell values

Tables are detected based on visual layout and grid patterns. Complex nested tables may be split into multiple table objects.

keyValuePairs

Automatically detected label-value pairs found throughout the document. Each pair includes:

  • key: The label or field name
  • value: The associated value
  • confidence: Confidence score (0.0 to 1.0) for this extraction

The algorithm identifies patterns like "Label: Value", "Label - Value", and spatially adjacent label-value layouts.

detectedLanguage

The primary language of the document content as an ISO 639-1 two-letter code (e.g., "en" for English, "es" for Spanish, "de" for German). This is determined by analyzing the extracted text content.

Best Practices

  1. Image Quality: Ensure scanned documents have at least 300 DPI resolution for optimal text recognition.

  2. File Formats: PDF files with embedded text layers will process faster and more accurately than image-only PDFs.

  3. Document Orientation: While DocuRift can detect and correct rotated pages, properly oriented documents yield better results.

  4. Complex Layouts: For documents with multiple columns or complex layouts, consider splitting into sections if extraction accuracy is critical.

  5. Language Detection: If your document contains multiple languages, the detected language will be the predominant one. For multilingual documents, consider processing sections separately.

  6. Table Extraction: For best table results, ensure tables have clear borders or consistent spacing between columns.

  7. Post-Processing: Use the keyValuePairs array as a starting point for custom field extraction by filtering on specific key patterns relevant to your use case.