Skip to main content
EvaluatorQ Python SDK
v4.1.0
Run experiments programmatically and measure AI performance directly from your Python code. Test Orq deployments, Orq agents, or any third-party framework, execute them over datasets, and evaluate results without leaving your IDE. The real power? Every experiment you kick off renders directly in Orq’s AI Studio, giving your team full visibility into what’s working and what isn’t.Key Features:
  • Run experiments from code to compare any AI system against your evaluation criteria. Test an Orq deployment against a third-party LangGraph agent. Run your CrewAI setup over the same dataset as your Orq agent. Execute against local datasets or pull directly from datasets stored in Orq. The SDK handles the orchestration while you focus on what matters: understanding how your systems actually perform.
  • Results rendered in Orq’s AI Studio so when experiments complete, the full picture is waiting for you in the platform. When a version underperforms, hand it off to whoever owns the prompt engineering. They can drill into the exact failure points and answer the real questions: Are the tool descriptions unclear? Do agent instructions need adjustment? Does the prompt need refinement? No more “it failed in CI, good luck figuring out why.”
  • Framework-agnostic testing because production AI systems rarely live in one ecosystem. Evaluate Orq-native agents alongside external implementations with the same evaluators, getting consistent and comparable metrics in one place. Your evaluation logic shouldn’t care where the agent was built.
Now available in Python, alongside our existing Node.js SDK. For examples and common patterns, check out github.com/orq-ai/orqkit
Learn more about running experiments from code at EvaluatorQ Documentation.
Command Bar
v4.1.0
Navigate and create faster with universal search in the new Command Bar. We’ve all been there: you know the deployment exists, you just can’t remember which project it’s in. Or you need to create an experiment but you’re three levels deep in the folder structure. The Command Bar eliminates that friction entirely.Command BarWhat’s new:
  • Quick access with Command+K (Mac) or Ctrl+K (Windows/Linux) to open the Command Bar from anywhere.
  • Universal search across files, documentation, and resources to find what you need instantly. Type a few characters and get fuzzy-matched results across deployments, experiments, agents, datasets, knowledge bases, and even internal documentation.
  • Rapid entity creation to create deployments, experiments, agents, and other resources without leaving your current view. Start typing “new deployment” or “create experiment” and you’re immediately in the creation flow.
Discover keyboard shortcuts and navigation features in the Platform Documentation.
Experiment Improvements
v4.1.0
Greater flexibility to iterate and refine experiments after they run. We’ve removed one of the most frustrating bottlenecks in the experimentation workflow: having to rerun entire experiments just to add one more evaluator or get human feedback on existing results.Experiment Improvements 4 1New Capabilities:
  • Add evaluators dynamically to experiments that have already completed without rerunning the entire experiment. Realized you need to check for hallucinations after the fact? Just add a new evaluator and run it against your existing results. No need to burn through tokens and time rerunning all your prompts. The evaluator processes the cached outputs you already generated.
  • Enable human review on completed experiments to gather qualitative feedback on existing results. Sometimes the metrics look great but something feels off. Now you can retroactively add human review to experiments, and get team feedback without disrupting the results you’ve already collected.
  • Rerun error cells to retry failed executions without discarding successful results. Network timeouts, rate limits, temporary API issues… they happen. Instead of throwing away an entire experiment run because 3 out of 100 calls failed, just retry those specific cells and preserve everything else.
Discover how to use these experiment capabilities at Experiments Documentation.
Tool Call Experimentation
v4.1.0
Test how agents handle tool calls in realistic scenarios, including error recovery and self-correction. Most tools test if agents call the right function. With Orq.ai you can now test what happens when agents must recover from their own mistakes. Essential for production agents that must handle real-world conditions where tools fail and multi-turn corrections are necessary.Tool ExperimentationWhat’s available:
  • Historical tool call testing to evaluate how agents behave when confronted with previous tool call failures. Include incorrect tool calls from conversation history to test whether agents can recognize errors, self-correct, and maintain consistent reasoning during recovery.
  • Side-by-side context comparison by running identical prompts with and without historical tool calls. See if past failures improve decision-making or degrade performance, and validate whether your system prompts effectively guide error recovery.
  • Full tool call inspection showing trigger conditions, parameters, and responses. Identify patterns like redundant calls (“web_search called twice unnecessarily”) or systematic failures (“30% fail with malformed JSON”).
Learn more about tool call experimentation at Tools Documentation.
Memory Store Studio
v4.1.0
Create and manage agent memory stores directly in the UI. Previously, setting up memory stores required writing code, now you can configure everything visually and actually see what your agents are remembering.Memory Store StudioWhat’s new:
  • Visual memory store creation to configure memory stores without writing code: define keys, descriptions, and select embedding models through an intuitive interface. Spin up a new memory store in seconds.
  • Memory inspection and editing to view all chunks that agents have created and stored, with the ability to manually adjust or overwrite memory content for greater control. See everything your agent has decided to memorize, identify when it’s storing irrelevant information or missing important context, and directly edit the memory chunks to correct issues. Think of it as “view source” for your agent’s memory. No more black box wondering what it actually remembers.
Explore memory store configuration at Memory Stores Documentation.
Project Settings
v4.1.0
Fine-grained access control and configuration at the project level. Workspaces are great for high-level organization, but real teams need project-level isolation: different API keys for different projects, annotation workflows that don’t bleed across teams, and the ability to lock down sensitive projects without affecting everything else.Project SettingsKey Features:
  • Project-specific API keys to manage access credentials directly from the project settings page. Generate API keys scoped to individual projects, rotate them independently when team members leave, and avoid the nightmare scenario where one compromised key exposes your entire workspace.
  • Project-scoped human reviews and review sets to create and manage custom annotation workflows tailored to each project’s specific evaluation needs. Your customer support team needs different review criteria than your content generation team. Now each project can define its own annotation labels, review queues, and quality rubrics without cluttering up everyone else’s workflows.
Learn more about project administration at Projects Documentation.
Moonshot AI Models
v4.1.0
We’ve added Moonshot AI as a new model provider, bringing their Kimi K2 model series with extended 256K context windows for handling lengthy documents and complex multi-turn conversations, featuring specialized thinking modes with dedicated reasoning capabilities for multi-step problem solving and tool usage, and high-speed turbo variants generating 60-100 tokens/sec for responsive interactions.New Models:
  • kimi-k2-thinking - Long-term thinking model with 256K context, supporting multi-step tool usage and deep reasoning for complex problems
  • kimi-k2-thinking-turbo - High-speed thinking model with 256K context, delivering 60-100 tokens/sec while maintaining deep reasoning capabilities
  • kimi-k2-turbo-preview - Performance-optimized model with 256K context and 60-100 tokens/sec output speed
Pricing:
  • Standard models: $0.60 per 1M input tokens, $2.50 per 1M output tokens
  • Turbo models: $1.15 per 1M input tokens, $8.00 per 1M output tokens
Explore Moonshot AI Kimi K2 models and their reasoning capabilities in the Model Garden.
Z.ai Models
v4.1.0
We’ve added Z.ai as a new model provider, bringing their GLM (General Language Model) series and CogView image generation with multimodal capabilities including vision, image, video, and file understanding in glm-4.5v, extended context windows up to 200K tokens with 128K maximum output for handling lengthy conversations, and hybrid Thinking/Non-Thinking modes that balance speed and depth across tasks.New Models:
  • glm-4.5 - 355B parameter MoE model with 32B active parameters, 128K context, and hybrid Thinking/Non-Thinking modes generating 100+ tokens/sec
  • glm-4.5v - 106B parameter multimodal model with vision, image, video, and file inputs, featuring Thinking Mode for balanced speed and reasoning
  • glm-4.6 - Advanced model with 200K context and 128K output tokens, comparable to Claude Sonnet 4/4.6 with 30% better token efficiency
  • cogView-4-250304 - Efficient image generation model for creating visual content from text descriptions
Pricing:
  • Chat models: $0.60 per 1M input tokens, $1.80-$2.20 per 1M output tokens
  • Image generation: $0.01 per image
Explore Z.ai models and their multimodal capabilities in the Model Garden.
DeepSeek Models
v4.1.0
We’ve added DeepSeek as a new model provider, bringing their advanced reasoning models with step-by-step chain-of-thought capabilities for mathematical proofs and multi-step problem solving, superior coding capabilities for code generation, debugging, and algorithm optimization, delivering frontier-model performance at highly competitive pricing.New Models:
  • deepseek-chat - DeepSeek-V3 with 671B parameters, a Mixture-of-Experts model excelling at general chat, coding, and complex reasoning
  • deepseek-reasoner - DeepSeek-R1 with chain-of-thought capabilities for step-by-step problem solving in mathematics and complex reasoning tasks
Pricing:
  • Input tokens: $0.28 per 1M tokens
  • Output tokens: $0.42 per 1M tokens
Explore DeepSeek models and compare reasoning performance in the Model Garden.
Contextual AI Rerankers
v4.1.0
We’ve added Contextual AI as a new model provider, bringing their reranking models with context-aware ranking to reorder search results based on semantic relevance and query understanding, RAG optimization to enhance retrieval-augmented generation pipelines by surfacing the most relevant documents, and full compatibility across Deployments, Experiments, and the AI Gateway.New Models:
  • ctxl-rerank-v1-instruct - High-performance reranking model for improving search result ordering
  • ctxl-rerank-v2-instruct-multilingual - Multilingual reranking with support for diverse languages and cross-lingual search
  • ctxl-rerank-v2-instruct-multilingual-mini - Efficient multilingual reranking optimized for speed and cost
Pricing:
  • $0.02 - $0.05 per 1M tokens depending on model variant
Explore Contextual AI rerankers and integrate them into your RAG workflows at Model Garden.
Enforce Enabled Models
v4.1.0
Control which models are enabled across your workspace to ensure compliance with organizational requirements. Not all models are created equal when it comes to cost, compliance, or performance characteristics. Workspace admins can now create guardrails that prevent teams from accidentally deploying expensive or non-compliant models.
Configure model enforcement in your Workspace Settings.
Looking Forward in 4.2 (WIP)
v4.1.0
Coming soon:
  • Trace Insights for pattern detection and anomaly analysis across your agent executions
  • Dataset Management Enhancements including version control and lineage tracking for training data
  • Advanced Agent Monitoring with real-time performance dashboards and intelligent alerting