LLM/AI Model Evaluation — Overview

EP-046 extends the platform to become the first open-source experimentation platform with native LLM experiment support. Users can compare prompt versions, model variants (GPT-4 vs Claude vs Gemini), agent configurations, and system prompts against real business metrics — all without leaving the platform.


Why LLM Evaluation is Different from Traditional A/B Testing

Traditional A/B tests compare binary outcomes (conversion yes/no) or continuous metrics (revenue). LLM experiments introduce new dimensions:

DimensionTraditional A/BLLM Experiment
UnitPage / feature variantPrompt / model config
ResponseBinary or numericFree-form text
LatencyMilliseconds (deterministic)500 ms – 30 s (variable)
CostFixed infra costPer-token billing
QualityImplicit (conversion)Explicit (human rating / LLM judge)
ReproducibilityHighLow (temperature / sampling)

The platform handles all of this automatically: it tracks latency, estimates cost, supports human rating collection, and can run automated LLM-as-judge scoring.


Supported Providers and Models

ProviderModelsNotes
Anthropicclaude-3-5-sonnet-20241022, claude-3-haiku-20240307, claude-opus-4-6Recommended judge model
OpenAIgpt-4o, gpt-4o-mini, gpt-3.5-turboCost-efficient options available
Googlegemini-1.5-pro, gemini-1.5-flashFlash: lowest cost per token
Coherecommand-r-plus, command-rEnterprise RAG use cases
Mistralmistral-large-latest, mistral-small-latestEU-hosted option
LocalAny Ollama-compatible modelPrivacy / cost-free option

Configure API keys via environment variables:

LLM_ANTHROPIC_API_KEY=sk-ant-...
LLM_OPENAI_API_KEY=sk-...
LLM_GOOGLE_API_KEY=...

Task Types

Each LLM experiment is associated with a task type that describes what the model does:

Task TypeDescriptionExample Use Case
chat_completionMulti-turn conversational interactionCustomer support chatbot
text_generationSingle-turn open-ended generationProduct description writer
classificationCategorise input into predefined classesSentiment analyser
summarizationCondense long text into shorter formDocument summariser
code_generationProduce executable codeProgramming assistant
embeddingProduce vector representationsSemantic search

Evaluation Metrics

Choose the primary metric that defines "winning" for your experiment:

MetricDescriptionWhen to Use
business_metricDownstream KPI (conversion, revenue, engagement)Most production scenarios
human_ratingHuman expert 1–5 scoreHigh-stakes quality evaluation
auto_eval_scoreLLM-as-judge 0–1 scoreScalable quality evaluation
latencyResponse time in millisecondsLatency-sensitive applications
costEstimated USD per requestBudget-constrained deployments
accuracyBinary correctness on ground truthClassification / QA tasks
relevanceRelevance to the querySearch / retrieval tasks
fluencyNatural language qualityText generation tasks

Variant Assignment

Variants are assigned using consistent MD5 hashing (the same algorithm used by all platform SDKs). The same user_id always receives the same variant, ensuring reproducibility and preventing confounding from re-assignment.

The hash key is "{experiment_id}:{user_id}", mapped to a bucket in [0, 1) and distributed across variants according to their traffic_split values.


LLM-as-Judge Scoring

The platform supports automated quality evaluation using a judge LLM (default: claude-3-5-sonnet-20241022). The judge is prompted:

Rate the following AI response on '{criteria}' using a score from 0 (terrible)
to 1 (excellent).

Original prompt: {prompt}
Response: {response}

Reply with ONLY: {"score": <float 0-1>, "reasoning": "<one sentence>"}

Judge scores are stored as auto_eval_score on each evaluation record and included in the results analytics.


Cost Estimation

The platform automatically estimates the cost of each LLM call using a built-in cost table:

claude-3-5-sonnet-20241022: $0.003/1k input + $0.015/1k output
gpt-4o:                     $0.0025/1k input + $0.010/1k output
gemini-1.5-flash:           $0.000075/1k input + $0.0003/1k output

Cost data is visible per-variant in the results dashboard, enabling cost/quality trade-off analysis.


Statistical Analysis

Results include:

  • Mean and 95% confidence interval for latency, cost, auto-eval score, and business metric
  • Welch's t-test p-value (treatment vs control) for continuous metrics
  • Cohen's d effect size for business metric and auto-eval score
  • Winner determination based on the highest mean business metric (falls back to auto-eval score)