Data Extraction from PDF

Document extraction has always been a fascinating challenge. Over the years, advancements in AI have transformed this domain, making it easier to tackle even the most complex use cases. Using tools like Orq, extracting structured data from documents is now both efficient and practical. This cookbook demonstrates how to use Orq for processing PDF invoices, from file uploads to extracting actionable insights.

To get started, you'll need to sign up for an Orq account if you haven't already.

Additionally, we've prepared a Google Colab file that you can copy and run right away, allowing you to quickly experiment with document processing after replacing your API key.

Step 1: Setting Up the Environment
The first step is ensuring the environment is ready. Installing the Orq SDK is quick and straightforward.

!pip install orq-ai-sdk

Step 2: Connecting to Orq
Interacting with Orq’s platform starts with client initialization.

import os
from orq_ai_sdk import Orq

# Store the API key as a variable
API_KEY = os.environ.get("ORQ_API_KEY", "your_api_key_here")

# Initialize Orq client
client = Orq(
    api_key=API_KEY,
)

Step 3: Uploading PDF Files for Processing

Here, PDF invoices are uploaded to Orq, making them ready for extraction and analysis. In this case, we have a few PDF files stored in a Google Drive folder that will be used for demonstration. You can easily replace these with your own files to suit your use case.

import requests

# API details
url = "https://my.orq.ai/v2/files"
headers = {
    "Authorization": f"Bearer {API_KEY}"
}

# Specify the folder containing PDF files
folder_path = '/content/drive/MyDrive/invoice_test'
pdf_files = [file for file in os.listdir(folder_path) if file.endswith('.pdf')][:3]

# Store JSON responses
responses_json = []

for file_name in pdf_files:
    file_path = os.path.join(folder_path, file_name)

    try:
        with open(file_path, 'rb') as file:
            files = {
                'purpose': (None, 'retrieval'),
                'file': (file_name, file)
            }
            response = requests.post(url, headers=headers, files=files)

            if response.status_code == 200:
                responses_json.append(response.json())
                print(f"Successfully uploaded: {file_name}")
            else:
                print(f"Failed to upload: {file_name}")
    except Exception as e:
        print(f"Error uploading {file_name}: {e}")

Extracting File IDs
Once the files are uploaded, their unique identifiers (file_ids) are extracted from the responses. These IDs are required for processing in the next step.

file_ids = [response.get('_id') for response in responses_json if response.get('_id')]
print(f"Extracted file IDs: {file_ids}")

Step 4: Deploying for Data Extraction

To ensure consistent and structured outputs from the data extraction process, the GPT-4o model can be configured to adhere to a predefined JSON schema. By specifying the schema, the model is guided to generate results in a precise format, reducing ambiguity and ensuring compatibility with downstream systems.

Below is an example schema designed for extracting key fields from receipts, including transaction date, vendor name, and payment details. The schema enforces strict adherence, with required fields and specific data types for each property. This approach ensures that outputs are well-structured and can be directly integrated into applications or databases for further analysis, reporting, or automation. Leveraging this JSON schema with the GPT-4o model enhances the reliability of the extraction process, making it an invaluable tool for handling structured data tasks.

This is the prompt in Orq.ai:

Analyze the provided images of receipts and invoices. Extract the following relevant information:

Date: The date of the transaction.
Vendor Name: The name of the company or individual from whom the goods or services were purchased.
Amount: The total amount spent, including any applicable taxes.
Category: An appropriate category for the expense (e.g., Travel, Food, Office Supplies).
Payment Method: The method of payment used (e.g., Credit Card, Cash, Bank Transfer).
Invoice Number: If available, the unique identifier for the invoice.
Map each extracted piece of information to the appropriate columns in a CSV file with the following headers: Date, Vendor Name, Amount, Category, Payment Method, Invoice Number. Provide the results in a structured format suitable for CSV output.

This is the receipt:

This is the JSON Schema that helps generate the structured output:

{
  "name": "dataextraction_receipts",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "Date": {
        "type": "string",
        "description": "The date of the transaction in YYYY-MM-DD format."
      },
      "VendorName": {
        "type": "string",
        "description": "The name of the company or individual from whom the goods or services were purchased."
      },
      "Amount": {
        "type": "number",
        "description": "The total amount spent, including any applicable taxes."
      },
      "Category": {
        "type": "string",
        "description": "An appropriate category for the expense (e.g., Travel, Food, Office Supplies)."
      },
      "PaymentMethod": {
        "type": "string",
        "description": "The method of payment used (e.g., Credit Card, Cash, Bank Transfer)."
      },
      "InvoiceNumber": {
        "type": "string",
        "description": "The unique identifier for the invoice, if available."
      }
    },
    "additionalProperties": false,
    "required": [
      "Date",
      "VendorName",
      "Amount",
      "Category",
      "PaymentMethod",
      "InvoiceNumber"
    ]
  }
}

With file_ids in hand, the next step is invoking a pre-trained deployment to extract structured data from the invoices. This process transforms raw document data into insights.

for file_id in file_ids:
    try:
        generation = client.deployments.invoke(
            key="DataExtraction_Receipts",
            context={"environments": []},
            file_ids=[file_id],
            metadata={"custom-field-name": "custom-metadata-value"}
        )
        print(f"Extraction results for {file_id}: {generation.choices[0].message.content}")
    except Exception as e:
        print(f"Error processing {file_id}: {e}")

What’s Next?
Orq’s tools provide robust capabilities for extracting structured data from unstructured PDF documents. With this workflow, you can:

Scale Data Processing: Adapt the workflow to handle larger batches of PDF files or seamlessly integrate it into your existing systems.
Refine Extraction Outputs: Leverage Orq’s deployment configurations to fine-tune the extraction process for specific document formats, layouts, or fields.
Automate End-to-End Workflows: Combine this process with automated pipelines to optimize tasks such as invoice management, financial reporting, or compliance monitoring.

By transforming unstructured PDF data into actionable insights, Orq empowers businesses to streamline operations, improve decision-making, and unlock new efficiencies with ease.