Creating LLM-as-a-judge Evaluator

To start building an LLM-as-a-judge Evaluator, head to a Projects, use the + button and select Evaluator.

The following modal opens:

Select the **API Call** type

Select the LLM-as-a-judge type

Configure Model & Output

Select then the model you would like to use to evaluate the output (the model needs to be enabled in your Model Garden).

Choose which type of output your model evaluation will provide:

  • Boolean, if the evaluation generates a True/False response.
  • Number, if the evaluation generates a Score.

Configure Prompt

Your prompt has access to two variables:

  • {{log.input}} contains the input message sent to the evaluated model
  • {{log.output}} contains the output response generated by the evaluated model

Example

Evaluating the Familiarity of an output

Evaluate the familiarity of the [OUTPUT], give a score between 1 and 10, 1 being very formal, 10 being very familiar. Only output the score.

 [OUTPUT] {{log.output}}

Evaluating the accuracy of a response

Evaluate how accurate a response [OUTPUT] is compared to the query [INPUT]. Give a score between 1 and 10, 1 being not accurate at all, 10 being perfectly accurate. Only output the score.

[INPUT] {{log.input}}
[OUTPUT] {{log.output}}

Guardrail Configuration

Within a Deployments, you can use your LLM-as-a-judge Evaluator as a Guardrail, effectively permitting a validation on input and output for a deployment generation.

Enabling the Guardrail toggle will block payloads that don't meet a score or expected boolean response.

Once created the Evaluator will be available to use in Deployments, to learn more, see Evaluators & Guardrails in Deployments.