Creating SQL Queries from Natural Language

This tutorial will guide you through creating an application that generates SQL queries from natural language instructions and evaluates the quality of the generated queries. Along the way, you'll learn how to use Orq's deployment feature to enhance SQL generation. By the end of this tutorial, you'll be ready to experiment with SQL generation in your own projects.

Before starting, ensure you have an Orq account. If not, sign up at Orq.ai. Let's dive in!

Additionally, to simplify the process, we’ve prepared a Google Colab file that you can copy and run immediately after replacing your API key. This file provides a ready-to-use environment with all the required configurations set up, allowing you to focus on experimenting with SQL generation without worrying about initial setup. Let's dive in!

Step 1: Setting Up the Environment
The following commands install the required libraries for working with the Orq platform, handling datasets, and managing the SQL generation workflow. Feel free to reuse and adapt this code for your projects.

!pip install orq-ai-sdk datasets huggingface_hub

Step 2: Initializing the Orq Client
The Orq client is the core interface for deploying and invoking models. Here’s how to set it up using an API key, which can be stored as an environment variable (ORQ_API_KEY) or passed directly.

import os
from orq_ai_sdk import Orq

client = Orq(
    api_key=os.environ.get("ORQ_API_KEY", "your_orq_api_key"),
)

Step 3: Loading the Dataset
Use the Hugging Face datasets library to load a dataset containing table schemas and natural language instructions. Convert the dataset to a pandas DataFrame for easy manipulation.

from datasets import load_dataset

ds = load_dataset("Clinton/Text-to-sql-v1")

# Convert to a pandas DataFrame (selecting the "train" split as an example)
df = ds["train"].to_pandas()

# Select the top 300 rows
df = df.head(50)

# Display the DataFrame or save it
print(df)

SQL Query Generation Use Case

This deployment is designed to generate valid SQL queries based on specific table schemas and user-provided instructions. The model analyzes the instruction and the associated table schema to produce a precise and contextually appropriate SQL query.

SQL query generation is particularly useful when automating database interactions, building query assistants, or streamlining the process of accessing structured data through natural language inputs.

Below are sql tables schemas paired with instruction that describes a task. Using valid SQLite, write a response that appropriately completes the request for the provided tables:

Here is the instruction: {{instruction}}

Here is the table: {{table}}

Here are some output examples, adhere to this output form:

SELECT home_team FROM table_name_77 WHERE away_team = "carlton"

SELECT "Yellow jersey" FROM table_3791 WHERE "Distance (km)" = '125'

SELECT COUNT(DISTINCT demographic.subject_id) FROM demographic INNER JOIN diagnoses ON demographic.hadm_id = diagnoses.hadm_id INNER JOIN lab ON demographic.hadm_id = lab.hadm_id WHERE diagnoses.icd9_code = "45620" AND lab.fluid = "Blood"

NO NEED TO INCLUDE   ```sql```  AROUND THE OUTPUT:

 ```sql
SELECT Sex, COUNT(*) AS FacultyCount
FROM Faculty
GROUP BY Sex
ORDER BY FacultyCount DESC;


**Step 4: Generating SQL Queries**  
This step involves invoking the Orq deployment to generate SQL queries for each row in the dataset. The instruction column provides the natural language task, while the input column contains the table schema. The results are stored in a new column named output.

Step 4: Generating SQL Queries
This step involves invoking the Orq deployment to generate SQL queries for each row in the dataset. The instruction column provides the natural language task, while the input column contains the table schema. The results are stored in a new column named output.

# Initialize the outputs list
outputs = []

# Iterate through each row in the DataFrame
for _, row in df.iterrows():
    # Extract the 'instruction' and 'input' columns for each row
    instruction = row["instruction"]
    table = row["input"]
    
    # Invoke the deployment for each row
    generation = client.deployments.invoke(
        key="text_to_SQL",  # Replace with your actual deployment key
        context={
            "environments": []
        },
        inputs={
            "table": table,
            "instruction": instruction
        },
        metadata={
            "custom-field-name": "custom-metadata-value"
        }
    )
    
    # Append the model's output to the outputs list
    outputs.append(generation.choices[0].message.content)

# Add the outputs as a new column in the DataFrame
df["output"] = outputs

Step 5: Saving and Evaluating Results
Save the updated DataFrame containing the SQL queries to a file and evaluate their quality. Use metrics or manual inspection to verify the accuracy and relevance of the generated queries.

# Save the results to a CSV file
df.to_csv("generated_sql_queries.csv", index=False)
print("Generated SQL queries saved to 'generated_sql_queries.csv'.")

Next Steps
Congratulations! You've successfully built and tested a SQL generation application using Orq. To further enhance your project:

Experiment with different datasets or deployment keys.
Refine the prompt to improve SQL generation quality.
Integrate the solution into a larger application for automated data access.

For more details and advanced features, visit the Orq documentation.