Skip to main content
After observing an application in production, the next step is annotating and curating that data to build evaluation datasets. This process turns raw production logs into high-quality test cases that drive systematic improvement. Use Cases
Capture thumbs up/down ratings, custom scores, or categorical labels on AI responses. Build a feedback loop that surfaces low-quality generations for review.
Flag responses with specific defects (hallucination, off-topic, inappropriate content) using structured annotation keys shared across the team.
Annotate Traces with corrections and quality labels, then export curated subsets as training datasets for future experiments.
Route Traces to Annotation Queues for systematic expert review. Combine with Trace Automations to automatically surface Traces that meet specific criteria.
Concepts Three concepts work together to form the annotations system:
  • Human Reviews: define the schema (key, value type, options) that annotations must conform to
  • Annotation Queues: organized workflows for reviewing Traces in bulk via AI Studio
  • Annotations API: the API and SDK for applying feedback values to a Trace or span programmatically

Human Reviews

Define annotation schemas: keys, value types, and validation rules. Available on chat completion and responses spans once created.

Annotation Queues

Organize human review workflows. Filter and present relevant Traces for review in bulk.

Annotations API

Apply structured human feedback to Traces and spans programmatically via the API and SDK.

Create Human Review

Human Reviews define the structure and validation rules for annotations. Each annotation key must match an existing Human Review definition in the project.
To create a Human Review, head to Project Settings > Human Review and press the button. Human Reviews can also be created directly from an Annotation Queue.
Create human review form with Key, Title, Description fields and a Type selector showing Categorical, Range, and Text options.
Three Human Review types are available:
  • Categorical: button options with custom labels, such as good/bad or saved/deleted
  • Range: a custom scoring slider, for example a scale from 0 to 100
  • Open field: free-form text input for detailed comments
Once created, a Human Review is available on all chat completion spans and responses spans in the project. No additional configuration or filtering required.
Deleting a Human Review removes it from any Annotation Queues and Experiments that use it, so it no longer appears as a review option there. Annotations already recorded with that Human Review are preserved: every annotated data point remains stored and queryable.

Common Annotation Types Legacy

Rate the overall quality of AI responses:
RatingDescription
goodThe response was helpful and accurate.
badThe response was unhelpful or inaccurate.
Identify specific issues with AI responses:
DefectDescription
grammaticalResponses that contain grammatical errors
spellingResponses that contain spelling errors
hallucinationResponses that contain hallucinations or factual inaccuracies
repetitionResponses that contain unnecessary repetition
inappropriateResponses that are deemed inappropriate or offensive
off_topicResponses that do not address the user’s query
incompletenessResponses that are incomplete or partially address the query
ambiguityResponses that are vague or unclear
Multiple defects can be selected for one response using an array-type Human Review.

Use Annotations

Annotations can be applied wherever a Trace or span is reviewed:
  • Directly on a Trace or Log: open a single Trace or Log in the Traces or Logs view and use the Annotations panel.
  • In an Annotation Queue: review a curated set of Traces in bulk. Fill a queue with Trace Automations or by manually adding individual Traces or Logs.
  • Programmatically: apply feedback through the API and SDK using the API & SDK tab below.
  • In an Experiment: apply Human Reviews while reviewing experiment outputs.
Every annotation applied in an Annotation Queue is written back to its originating Trace. Because the values live on the Trace, they can be queried with the Orq MCP and used to run analysis across reviewed data.
The annotation capabilities differ between Logs and Traces. Logs support both human feedback and corrections, while Traces only support human feedback annotations.
Navigate to the Traces view and select a single trace. The Annotations panel will be displayed, allowing you to apply human feedback to the AI response.
Trace detail panel for a claude-sonnet chat-completion showing Evaluations section with Defects, Interactions, and Rating feedback options including good/bad thumbs.

Create Annotation Queues

Annotation Queues help you organize and apply Human Reviews effectively to relevant incoming Traces.
To create an Annotation Queue, head to AI Studio > Annotation Queue.Choose Create Annotation Queue.The following fields are configurable:
  • The Name of the queue
  • The Description of the Annotation Queue
  • The Human Reviews that Traces will be reviewed by
Create Annotation Queue panel with fields for name, description, and human reviews, showing Defects, Interactions, and Rating tags selected.

Fill Annotation Queues

Once a queue exists, fill it with the Traces to review. Traces can be added automatically or manually.
Use Trace Automations to route Traces into a queue based on configured rules. Add an Add to Annotation Queue action to an automation and select the target queue. As matching Traces arrive, they are added to the queue without manual effort, which keeps a steady stream of relevant Traces ready for review.
Edit Automation panel with a metadata filter on request_id, an Add to Annotation Queue action selecting the fireflies_annotation queue, and an Apply Evaluator action marked Coming soon.

Use Annotation Queues

Open an Annotation Queue to step through its Traces one at a time in the review screen.
Annotation Queue review screen showing Item 7 of 43 in the header, a left panel with Inputs, Metrics (Latency, Cost, tokens), and Task (Model claude-haiku-4-5, Provider anthropic), a center panel with the System instructions, User input, and Assistant output, and a right Annotations panel with a comment field and a rating with good and bad buttons. A dataset selector and Add to dataset button sit at the bottom.
The screen is divided into three panels:
  • Left: details for the selected Trace.
    • Inputs: the variables mapped to inputs, when configured.
    • Metrics: latency, cost, and token usage.
    • Task: the model, provider, and other configuration parameters.
    The header shows the current position, the total number of items in the queue, and how many have already been reviewed.
  • Center: the full interaction for the selected Trace.
  • Right: the Annotations panel with the Human Reviews configured for the queue, such as a rating with categorical buttons or an open comment field. Selecting a value saves immediately and marks the Trace as reviewed.
Navigate between items with K (previous) and J (next), or use the up and down buttons at the top left. When a data point is worth reusing, select Add to dataset to send the Trace to a Dataset for use in a future Experiment.
Adding a Trace to a Dataset does not copy its annotations for now. As noted above, the annotation values stay on the originating Trace, where they remain queryable via the Orq MCP.

Annotations in Experiments

Human Reviews can also be applied outside of Annotation Queues, while reviewing the outputs of an Experiment. In the experiment review screen, the Human Reviews defined for the project appear alongside Evaluator scores, so outputs can be annotated manually as part of an evaluation run.
Experiment review screen showing Response 1 of 20 for product-orchestrator-A, a left panel with Inputs, Expected, and Metrics, a center panel with the System instructions, User input, and Assistant output including function calls, and a right panel with an Annotations comment field and good/bad rating above an Evaluators section listing a json_check evaluator marked No.