- Each Run executes model generations using configured Inputs and Messages from a Dataset.
- After a Run completes, Latency, Cost, and Time to First Token metrics are recorded for each generation.
- Results can be reviewed manually or validated automatically with Evaluators and Human Review, allowing for comparison across an Expected Output.
Get Started with Experiments, see Creating an Experiment.