Extract data from PDFs with LLMs - Orq.ai Documentation

Document extraction has always been a fascinating challenge. Over the years, advancements in AI have transformed this domain, making it easier to tackle even the most complex use cases. Using tools like Orq, extracting structured data from documents is now both efficient and practical. This cookbook demonstrates how to use Orq for processing PDF invoices by sending them directly to the model as native file attachments and extracting actionable insights. To get started, you’ll need to sign up for an Orq account if you haven’t already. Additionally, we’ve prepared a Google Colab file that you can copy and run right away, allowing you to quickly experiment with document processing after replacing your API key. Step 1: Setting Up the Environment The first step is ensuring the environment is ready. Installing the Orq SDK is quick and straightforward.

!pip install orq-ai-sdk

Step 2: Identity Tracking (Optional) Identities in Orq.ai help track user interactions and API usage across your application. They can represent users, teams, or projects and enable better analytics and budget management. Create an Identity through the AI Studio:

Go to Identity Analytics in your workspace
Click Create an Identity
Add the user details (name, email, external ID)
Set optional metadata and budget limits

Learn more about creating identities, see Creating an Identity. Step 3: Connecting to Orq Interacting with Orq’s platform starts with client initialization.

import os
from orq_ai_sdk import Orq

# Store the API key as a variable
API_KEY = os.environ.get("ORQ_API_KEY", "your_api_key_here")

# Initialize Orq client
client = Orq(
    api_key=API_KEY,
)
# Pass identity per-request: identity={"id": "<identity-id>"} in deployments.invoke() or responses.create()

Step 4: Locating the PDF Files In this case, we have a few PDF files stored in a Google Drive folder that will be used for demonstration. You can easily replace these with your own files to suit your use case.

# Specify the folder containing PDF files
folder_path = '/content/drive/MyDrive/invoice_test'
pdf_files = [file for file in os.listdir(folder_path) if file.endswith('.pdf')][:3]

print(f"Found {len(pdf_files)} PDF files to process")

Step 5: Deploying for Data Extraction To ensure consistent and structured outputs from the data extraction process, the GPT-4o model can be configured to adhere to a predefined JSON schema. By specifying the schema, the model is guided to generate results in a precise format, reducing ambiguity and ensuring compatibility with downstream systems. Below is an example schema designed for extracting key fields from receipts, including transaction date, vendor name, and payment details. The schema enforces strict adherence, with required fields and specific data types for each property. This approach ensures that outputs are well-structured and can be directly integrated into applications or databases for further analysis, reporting, or automation. Leveraging this JSON schema with the GPT-4o model enhances the reliability of the extraction process, making it an invaluable tool for handling structured data tasks. This is the prompt in Orq.ai:

Analyze the provided images of receipts and invoices. Extract the following relevant information:

Date: The date of the transaction.
Vendor Name: The name of the company or individual from whom the goods or services were purchased.
Amount: The total amount spent, including any applicable taxes.
Category: An appropriate category for the expense (e.g., Travel, Food, Office Supplies).
Payment Method: The method of payment used (e.g., Credit Card, Cash, Bank Transfer).
Invoice Number: If available, the unique identifier for the invoice.
Map each extracted piece of information to the appropriate columns in a CSV file with the following headers: Date, Vendor Name, Amount, Category, Payment Method, Invoice Number. Provide the results in a structured format suitable for CSV output.

This is the receipt: 

This is the JSON Schema that helps generate the structured output:

{
  "name": "dataextraction_receipts",
  "strict": true,
  "schema": {
    "type": "object",
    "properties": {
      "Date": {
        "type": "string",
        "description": "The date of the transaction in YYYY-MM-DD format."
      },
      "VendorName": {
        "type": "string",
        "description": "The name of the company or individual from whom the goods or services were purchased."
      },
      "Amount": {
        "type": "number",
        "description": "The total amount spent, including any applicable taxes."
      },
      "Category": {
        "type": "string",
        "description": "An appropriate category for the expense (e.g., Travel, Food, Office Supplies)."
      },
      "PaymentMethod": {
        "type": "string",
        "description": "The method of payment used (e.g., Credit Card, Cash, Bank Transfer)."
      },
      "InvoiceNumber": {
        "type": "string",
        "description": "The unique identifier for the invoice, if available."
      }
    },
    "additionalProperties": false,
    "required": [
      "Date",
      "VendorName",
      "Amount",
      "Category",
      "PaymentMethod",
      "InvoiceNumber"
    ]
  }
}

Next, invoke a pre-trained deployment to extract structured data from the invoices. Each PDF is read from disk, base64-encoded, and sent directly to the model as a native file content part — no upload step required. This works with OpenAI, Anthropic, and Google Gemini models.

import base64

for file_name in pdf_files:
    file_path = os.path.join(folder_path, file_name)
    try:
        with open(file_path, "rb") as f:
            encoded_pdf = base64.b64encode(f.read()).decode("utf-8")

        generation = client.deployments.invoke(
            key="DataExtraction_Receipts",
            context={"environments": []},
            messages=[
                {
                    "role": "user",
                    "content": [
                        {
                            "type": "file",
                            "file": {
                                "file_data": f"data:application/pdf;base64,{encoded_pdf}",
                                "filename": file_name,
                            },
                        }
                    ],
                }
            ],
            metadata={"custom-field-name": "custom-metadata-value"},
        )
        print(f"Extraction results for {file_name}: {generation.choices[0].message.content}")
    except Exception as e:
        print(f"Error processing {file_name}: {e}")

Feedback Collection (Optional) Feedback in Orq.ai helps track response quality and identify areas for improvement. You can collect user ratings, defect classifications, and corrections to continuously enhance your application. Provide feedback through the AI Studio:

Go to Logs in your workspace
Find the specific deployment invocation
Use the feedback interface to rate responses
Add defect classifications or corrections as needed

You can also collect feedback programmatically via the API if needed.

What’s Next? Orq’s tools provide robust capabilities for extracting structured data from unstructured PDF documents. With this workflow, you can:

Scale Data Processing: Adapt the workflow to handle larger batches of PDF files or seamlessly integrate it into your existing systems.
Refine Extraction Outputs: Leverage Orq’s deployment configurations to fine-tune the extraction process for specific document formats, layouts, or fields.
Automate End-to-End Workflows: Combine this process with automated pipelines to optimize tasks such as invoice management, financial reporting, or compliance monitoring.

By transforming unstructured PDF data into actionable insights, Orq empowers businesses to streamline operations, improve decision-making, and unlock new efficiencies with ease.