LLM/AI Model Evaluation — Overview
EP-046 extends the platform to become the first open-source experimentation platform with native LLM experiment support. Users can compare prompt versions, model variants (GPT-4 vs Claude vs Gemini), agent configurations, and system prompts against real business metrics — all without leaving the platform.
Why LLM Evaluation is Different from Traditional A/B Testing
Traditional A/B tests compare binary outcomes (conversion yes/no) or continuous metrics (revenue). LLM experiments introduce new dimensions:
| Dimension | Traditional A/B | LLM Experiment |
|---|---|---|
| Unit | Page / feature variant | Prompt / model config |
| Response | Binary or numeric | Free-form text |
| Latency | Milliseconds (deterministic) | 500 ms – 30 s (variable) |
| Cost | Fixed infra cost | Per-token billing |
| Quality | Implicit (conversion) | Explicit (human rating / LLM judge) |
| Reproducibility | High | Low (temperature / sampling) |
The platform handles all of this automatically: it tracks latency, estimates cost, supports human rating collection, and can run automated LLM-as-judge scoring.
Supported Providers and Models
| Provider | Models | Notes |
|---|---|---|
| Anthropic | claude-3-5-sonnet-20241022, claude-3-haiku-20240307, claude-opus-4-6 | Recommended judge model |
| OpenAI | gpt-4o, gpt-4o-mini, gpt-3.5-turbo | Cost-efficient options available |
| gemini-1.5-pro, gemini-1.5-flash | Flash: lowest cost per token | |
| Cohere | command-r-plus, command-r | Enterprise RAG use cases |
| Mistral | mistral-large-latest, mistral-small-latest | EU-hosted option |
| Local | Any Ollama-compatible model | Privacy / cost-free option |
Configure API keys via environment variables:
LLM_ANTHROPIC_API_KEY=sk-ant-...
LLM_OPENAI_API_KEY=sk-...
LLM_GOOGLE_API_KEY=...
Task Types
Each LLM experiment is associated with a task type that describes what the model does:
| Task Type | Description | Example Use Case |
|---|---|---|
chat_completion | Multi-turn conversational interaction | Customer support chatbot |
text_generation | Single-turn open-ended generation | Product description writer |
classification | Categorise input into predefined classes | Sentiment analyser |
summarization | Condense long text into shorter form | Document summariser |
code_generation | Produce executable code | Programming assistant |
embedding | Produce vector representations | Semantic search |
Evaluation Metrics
Choose the primary metric that defines "winning" for your experiment:
| Metric | Description | When to Use |
|---|---|---|
business_metric | Downstream KPI (conversion, revenue, engagement) | Most production scenarios |
human_rating | Human expert 1–5 score | High-stakes quality evaluation |
auto_eval_score | LLM-as-judge 0–1 score | Scalable quality evaluation |
latency | Response time in milliseconds | Latency-sensitive applications |
cost | Estimated USD per request | Budget-constrained deployments |
accuracy | Binary correctness on ground truth | Classification / QA tasks |
relevance | Relevance to the query | Search / retrieval tasks |
fluency | Natural language quality | Text generation tasks |
Variant Assignment
Variants are assigned using consistent MD5 hashing (the same algorithm used by all platform SDKs). The same user_id always receives the same variant, ensuring reproducibility and preventing confounding from re-assignment.
The hash key is "{experiment_id}:{user_id}", mapped to a bucket in [0, 1) and distributed across variants according to their traffic_split values.
LLM-as-Judge Scoring
The platform supports automated quality evaluation using a judge LLM (default: claude-3-5-sonnet-20241022). The judge is prompted:
Rate the following AI response on '{criteria}' using a score from 0 (terrible)
to 1 (excellent).
Original prompt: {prompt}
Response: {response}
Reply with ONLY: {"score": <float 0-1>, "reasoning": "<one sentence>"}
Judge scores are stored as auto_eval_score on each evaluation record and included in the results analytics.
Cost Estimation
The platform automatically estimates the cost of each LLM call using a built-in cost table:
claude-3-5-sonnet-20241022: $0.003/1k input + $0.015/1k output
gpt-4o: $0.0025/1k input + $0.010/1k output
gemini-1.5-flash: $0.000075/1k input + $0.0003/1k output
Cost data is visible per-variant in the results dashboard, enabling cost/quality trade-off analysis.
Statistical Analysis
Results include:
- Mean and 95% confidence interval for latency, cost, auto-eval score, and business metric
- Welch's t-test p-value (treatment vs control) for continuous metrics
- Cohen's d effect size for business metric and auto-eval score
- Winner determination based on the highest mean business metric (falls back to auto-eval score)