Skip to content

Latest commit

 

History

History
663 lines (527 loc) · 24.9 KB

File metadata and controls

663 lines (527 loc) · 24.9 KB
layout default
title Langfuse Tutorial - Chapter 6: Datasets & Testing
nav_order 6
has_children false
parent Langfuse Tutorial

Chapter 6: Datasets & Testing

Welcome to Chapter 6: Datasets & Testing. In this part of Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations, you will build an intuitive mental model first, then move into concrete implementation details and practical production tradeoffs.

Create test suites from production traces and run regression tests.

Previous: Chapter 5: Analytics & Metrics

Overview

When you update a prompt, swap a model, or change your retrieval logic, how do you know things did not get worse? The answer is datasets. Langfuse lets you extract real examples from production traces, curate them into test datasets, and run automated regression tests. This gives you confidence that changes improve quality -- or at least do not degrade it.

In this chapter, you will learn how to create datasets from traces, design effective test suites, run evaluations against them, set up CI/CD pipelines for automated testing, manage dataset versions, and compare model performance systematically.

Dataset Workflow

Here is the high-level workflow for using datasets in Langfuse:

flowchart LR
    A[Production Traces] --> B[Filter High-Quality Examples]
    B --> C[Curate Dataset Items]
    C --> D[Run Tests Against Dataset]
    D --> E[Evaluate Results]
    E --> F{Quality OK?}
    F -->|Yes| G[Deploy Changes]
    F -->|No| H[Iterate on Prompt/Model]
    H --> D
    G --> A
Loading

The cycle is continuous: production traces feed new dataset items, which you use to test changes, which then go back into production and generate more traces. Over time, your test suite grows stronger and more representative.

Dataset Design Principles

Before creating your first dataset, it helps to think about what makes a good one. A poorly designed dataset will give you false confidence, while a well-designed one will catch real regressions.

Principle 1: Representativeness

Your dataset should reflect the actual distribution of queries your application receives. If 60% of your traffic is simple FAQ questions and 10% is complex multi-step reasoning, your dataset should roughly mirror that ratio. Otherwise, you will optimize for edge cases while ignoring the common case.

Principle 2: Diversity

Include examples that cover different categories of input:

  • Different user intents: questions, commands, clarifications, complaints
  • Different input lengths: short queries, long documents, multi-turn conversations
  • Different languages: if your app serves multiple locales
  • Different difficulty levels: easy questions your model always gets right, hard ones it sometimes misses

Principle 3: Include Failure Cases

Do not only include examples where your model succeeded. Include cases where it failed, hallucinated, or gave a mediocre response. These are the most valuable test cases because they tell you whether a change actually fixes known problems.

Principle 4: Manageable Size

Start small. A dataset of 50-100 well-chosen examples is far more useful than 10,000 random traces. You can always expand later, but a small, curated dataset is faster to iterate on and easier to review manually.

Creating Datasets from Traces

The most natural way to build a dataset is to extract examples from production:

from langfuse import Langfuse

langfuse = Langfuse()

# Create a new dataset
dataset = langfuse.create_dataset(
    name="support-qa-v1",
    description="Real customer support conversations, curated for testing",
)

# Fetch high-quality production traces
traces = langfuse.get_traces(
    limit=200,
    filters={
        "tags": ["production"],
        "scores": {"helpfulness": {"gte": 0.9}},
    }
)

# Add selected traces as dataset items
for trace in traces[:50]:  # Start with 50 examples
    langfuse.create_dataset_item(
        dataset_name="support-qa-v1",
        input=trace.input,
        expected_output=trace.output,
        metadata={
            "source_trace_id": trace.id,
            "original_scores": trace.scores,
            "tags": trace.tags,
        }
    )

Dataset Item Structure

Each dataset item has a clear structure:

{
    "input": {
        "messages": [
            {"role": "user", "content": "How do I reset my password?"}
        ],
        "context": "User is on the settings page.",
    },
    "expected_output": "To reset your password, go to Settings > Security > Change Password.",
    "metadata": {
        "tags": ["production", "high-quality"],
        "scores": {"helpfulness": 0.95, "accuracy": 1.0},
        "source_trace_id": "trace_abc123",
        "category": "faq",
        "difficulty": "easy",
    }
}

The metadata field is flexible -- use it to tag items by category, difficulty, language, or anything else useful for filtering and analysis.

Running Tests Against Datasets

Once you have a dataset, you can run your pipeline against it and compare the results:

from datetime import datetime

def run_dataset_test(
    dataset_name: str,
    run_name: str,
    model: str = "gpt-4o-mini",
    prompt_version: str = "v1",
):
    """Run a test against a dataset and log results to Langfuse."""
    dataset = langfuse.get_dataset(dataset_name)
    results = []
    scores = []

    for item in dataset.items:
        # Run your pipeline with the current configuration
        trace = langfuse.trace(
            name="dataset-test",
            metadata={"dataset": dataset_name, "model": model, "prompt_version": prompt_version}
        )

        actual_output = generate_response(
            item.input,
            model=model,
            prompt_version=prompt_version,
        )

        # Evaluate the result
        score = evaluate_similarity(actual_output, item.expected_output)
        scores.append(score)

        # Link the trace to the dataset item
        item.link(trace, run_name=run_name)

        # Score the trace
        trace.score(name="accuracy", value=score)

        results.append({
            "item_id": item.id,
            "expected": item.expected_output,
            "actual": actual_output,
            "score": score,
        })

    # Summary statistics
    avg_score = sum(scores) / len(scores) if scores else 0
    min_score = min(scores) if scores else 0
    pass_rate = sum(1 for s in scores if s >= 0.8) / len(scores) if scores else 0

    print(f"Run: {run_name}")
    print(f"Average score: {avg_score:.3f}")
    print(f"Min score: {min_score:.3f}")
    print(f"Pass rate (>= 0.8): {pass_rate:.1%}")

    return {"avg_score": avg_score, "min_score": min_score, "pass_rate": pass_rate, "results": results}

Interpreting Results

When reviewing test results, look at more than just the average score:

  • Average score: The overall quality level. Should stay stable or improve.
  • Minimum score: The worst-case result. A single terrible answer can damage user trust.
  • Pass rate: The percentage of items that meet your quality threshold. This is often the most actionable metric.
  • Score distribution: Are scores clustered around 0.9, or spread between 0.5 and 1.0? A tight distribution means consistent quality.

Golden Dataset Creation

A "golden dataset" is a curated set of examples with expert-verified expected outputs. These are your highest-confidence test cases.

def create_golden_dataset():
    """Create a golden dataset from manually reviewed traces."""
    # Step 1: Filter traces with high scores
    traces = langfuse.get_traces(
        limit=500,
        filters={"scores": {"helpfulness": {"gte": 0.95}, "accuracy": {"gte": 0.95}}}
    )

    # Step 2: Create the dataset
    dataset = langfuse.create_dataset(
        name="golden-support-v1",
        description="Expert-verified examples for regression testing",
    )

    # Step 3: Add items with manual review flag
    for trace in traces[:30]:  # Start with 30 golden examples
        langfuse.create_dataset_item(
            dataset_name="golden-support-v1",
            input=trace.input,
            expected_output=trace.output,
            metadata={
                "reviewed_by": "human-expert",
                "review_date": datetime.now().isoformat(),
                "confidence": "high",
            }
        )

    return dataset

The golden dataset should be small (30-50 items), carefully reviewed, and updated only when you are certain about the expected outputs.

Benchmark Datasets

Beyond testing your own application, you may want benchmark datasets that let you compare models and configurations objectively. Benchmarks differ from golden datasets in that they focus on standardized tasks rather than your specific use case.

Creating a Benchmark Suite

def create_benchmark_dataset():
    """Create a benchmark dataset for comparing models."""
    benchmarks = [
        {
            "input": {"messages": [{"role": "user", "content": "Summarize this in one sentence: [article text]"}]},
            "expected_output": "A concise one-sentence summary.",
            "metadata": {"category": "summarization", "difficulty": "medium"},
        },
        {
            "input": {"messages": [{"role": "user", "content": "Translate to French: Hello, how are you?"}]},
            "expected_output": "Bonjour, comment allez-vous ?",
            "metadata": {"category": "translation", "difficulty": "easy"},
        },
        {
            "input": {"messages": [{"role": "user", "content": "What is 15% of 240?"}]},
            "expected_output": "36",
            "metadata": {"category": "math", "difficulty": "easy"},
        },
        # Add more benchmark items covering different capabilities
    ]

    dataset = langfuse.create_dataset(
        name="model-benchmark-v1",
        description="Standardized benchmark for comparing model performance",
    )

    for item in benchmarks:
        langfuse.create_dataset_item(
            dataset_name="model-benchmark-v1",
            input=item["input"],
            expected_output=item["expected_output"],
            metadata=item["metadata"],
        )

Running Benchmarks Across Models

models_to_test = ["gpt-4o-mini", "gpt-4o", "claude-3-haiku-20240307"]

benchmark_results = {}
for model in models_to_test:
    result = run_dataset_test(
        dataset_name="model-benchmark-v1",
        run_name=f"benchmark-{model}-{datetime.now().strftime('%Y%m%d')}",
        model=model,
    )
    benchmark_results[model] = result

# Print comparison table
print(f"\n{'Model':<30} {'Avg Score':<12} {'Pass Rate':<12} {'Min Score':<12}")
print("-" * 66)
for model, result in benchmark_results.items():
    print(f"{model:<30} {result['avg_score']:<12.3f} {result['pass_rate']:<12.1%} {result['min_score']:<12.3f}")

Comparing Model Performance

When you are deciding between models (or between prompt versions), you need a systematic approach to comparison.

Head-to-Head Comparison

def compare_models(dataset_name: str, model_a: str, model_b: str):
    """Run the same dataset against two models and compare results."""
    result_a = run_dataset_test(dataset_name, f"compare-{model_a}", model=model_a)
    result_b = run_dataset_test(dataset_name, f"compare-{model_b}", model=model_b)

    print(f"\nComparison: {model_a} vs {model_b}")
    print(f"{'Metric':<20} {model_a:<20} {model_b:<20}")
    print("-" * 60)
    print(f"{'Avg Score':<20} {result_a['avg_score']:<20.3f} {result_b['avg_score']:<20.3f}")
    print(f"{'Pass Rate':<20} {result_a['pass_rate']:<20.1%} {result_b['pass_rate']:<20.1%}")
    print(f"{'Min Score':<20} {result_a['min_score']:<20.3f} {result_b['min_score']:<20.3f}")

    # Per-item comparison to find where models differ
    wins_a, wins_b, ties = 0, 0, 0
    for ra, rb in zip(result_a["results"], result_b["results"]):
        if ra["score"] > rb["score"] + 0.05:
            wins_a += 1
        elif rb["score"] > ra["score"] + 0.05:
            wins_b += 1
        else:
            ties += 1

    print(f"\n{model_a} wins: {wins_a}  |  {model_b} wins: {wins_b}  |  Ties: {ties}")

This gives you a clear picture of which model is better overall and where each model has strengths and weaknesses.

Dataset Versioning Strategy

As your application evolves, your datasets need to evolve too. Here is a practical versioning approach.

Naming Convention

Use a clear naming scheme: {purpose}-{version}, for example support-qa-v2 or golden-support-v3. Include a description that explains what changed.

Versioning Workflow

def create_new_dataset_version(
    source_name: str,
    new_version: str,
    add_items: list = None,
    remove_item_ids: list = None,
):
    """Create a new version of a dataset with modifications."""
    source = langfuse.get_dataset(source_name)
    new_name = f"{source_name.rsplit('-v', 1)[0]}-v{new_version}"

    new_dataset = langfuse.create_dataset(
        name=new_name,
        description=f"Version {new_version}. Based on {source_name}.",
    )

    # Copy existing items (minus removals)
    remove_ids = set(remove_item_ids or [])
    for item in source.items:
        if item.id not in remove_ids:
            langfuse.create_dataset_item(
                dataset_name=new_name,
                input=item.input,
                expected_output=item.expected_output,
                metadata={**item.metadata, "copied_from": source_name},
            )

    # Add new items
    for new_item in (add_items or []):
        langfuse.create_dataset_item(
            dataset_name=new_name,
            input=new_item["input"],
            expected_output=new_item["expected_output"],
            metadata=new_item.get("metadata", {}),
        )

    print(f"Created {new_name} with {len(new_dataset.items)} items")
    return new_dataset

When to Create a New Version

  • Adding examples: You found new edge cases in production that should be tested.
  • Fixing labels: The expected output for some items was wrong and has been corrected.
  • Removing stale items: Your application changed and some test cases are no longer relevant.
  • Major prompt changes: A significant prompt rewrite may require updated expected outputs.

Automated Regression Testing in CI/CD

Integrating dataset tests into your CI/CD pipeline ensures that every code change is tested before it reaches production.

GitHub Actions Workflow

# .github/workflows/llm-regression.yml
name: LLM Regression Tests

on:
  pull_request:
    paths:
      - 'prompts/**'
      - 'src/llm/**'
      - 'config/models.yaml'

jobs:
  regression-test:
    runs-on: ubuntu-latest
    timeout-minutes: 30

    steps:
      - uses: actions/checkout@v4

      - name: Setup Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.11'

      - name: Install dependencies
        run: |
          pip install langfuse openai numpy

      - name: Run golden dataset tests
        env:
          LANGFUSE_PUBLIC_KEY: ${{ secrets.LANGFUSE_PUBLIC_KEY }}
          LANGFUSE_SECRET_KEY: ${{ secrets.LANGFUSE_SECRET_KEY }}
          LANGFUSE_HOST: ${{ secrets.LANGFUSE_HOST }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: python tests/test_llm_regression.py

      - name: Check results
        run: |
          python -c "
          import json
          with open('test_results.json') as f:
              results = json.load(f)
          if results['pass_rate'] < 0.85:
              print(f'FAIL: Pass rate {results[\"pass_rate\"]:.1%} is below 85% threshold')
              exit(1)
          if results['avg_score'] < 0.80:
              print(f'FAIL: Average score {results[\"avg_score\"]:.3f} is below 0.80 threshold')
              exit(1)
          print(f'PASS: Score={results[\"avg_score\"]:.3f}, Pass rate={results[\"pass_rate\"]:.1%}')
          "

The Test Script

# tests/test_llm_regression.py
import json
from langfuse import Langfuse

langfuse = Langfuse()

def test_golden_dataset():
    dataset = langfuse.get_dataset("golden-support-v1")
    scores = []

    for item in dataset.items:
        trace = langfuse.trace(name="regression-test")
        output = generate_response(item.input)
        score = evaluate_similarity(output, item.expected_output)
        scores.append(score)
        item.link(trace, run_name=f"ci-{os.environ.get('GITHUB_SHA', 'local')[:8]}")
        trace.score(name="accuracy", value=score)

    results = {
        "avg_score": sum(scores) / len(scores),
        "min_score": min(scores),
        "pass_rate": sum(1 for s in scores if s >= 0.8) / len(scores),
        "total_items": len(scores),
    }

    with open("test_results.json", "w") as f:
        json.dump(results, f)

    langfuse.flush()
    return results

if __name__ == "__main__":
    results = test_golden_dataset()
    print(f"Results: {json.dumps(results, indent=2)}")

Test Coverage

Ensure comprehensive coverage across multiple dimensions:

  • Edge Cases: Error scenarios, unusual inputs, empty strings, very long inputs
  • Diversity: Different user types, languages, conversation contexts
  • Scale: Various input lengths and complexity levels
  • Time Sensitivity: Questions about current events or time-dependent information
  • Adversarial: Prompt injection attempts, off-topic questions, ambiguous queries

A good rule of thumb: if a bug could appear in production and your dataset would not catch it, add a test case for it.

Tips

  • Start with 50-100 examples and expand as you discover new failure modes.
  • Include both positive and negative examples. You need to know that good inputs still produce good outputs and that tricky inputs are handled gracefully.
  • Run tests before every production deployment. Make it a gate in your CI/CD pipeline.
  • Review test failures manually. Sometimes a low score means the expected output is wrong, not the model.
  • Keep datasets up to date. Stale expected outputs lead to false failures and erode trust in the test suite.
  • Tag dataset items by category so you can analyze performance per category, not just overall.

Summary

Datasets and testing are the foundation of reliable LLM development. In this chapter, you learned how to create datasets from production traces, design them with the right principles, run tests and interpret results, set up CI/CD regression testing, manage dataset versions, and compare models systematically. With a solid test suite in place, you can iterate on your prompts and models with confidence.

Next: Chapter 7: Integrations -- connect Langfuse with LangChain, LlamaIndex, OpenAI SDK, and other frameworks.

Depth Expansion Playbook

Source Code Walkthrough

package.json

The package module in package.json handles a key part of this chapter's functionality:

{
  "name": "langfuse",
  "version": "3.163.0",
  "author": "engineering@langfuse.com",
  "license": "MIT",
  "private": true,
  "engines": {
    "node": "24"
  },
  "scripts": {
    "agents:check": "node scripts/agents/sync-agent-shims.mjs --check",
    "agents:sync": "node scripts/agents/sync-agent-shims.mjs",
    "postinstall": "node -e \"const fs = require('node:fs'); const cp = require('node:child_process'); if (!fs.existsSync('scripts/postinstall.sh')) { console.log('Skipping repo postinstall helper: scripts/postinstall.sh is not present in this install context.'); process.exit(0); } cp.execSync('bash scripts/postinstall.sh', { stdio: 'inherit' });\"",
    "preinstall": "npx only-allow pnpm",
    "infra:dev:up": "docker compose -f ./docker-compose.dev.yml up -d --wait",
    "infra:dev:down": "docker compose -f ./docker-compose.dev.yml down",
    "infra:dev:prune": "docker compose -f ./docker-compose.dev.yml down -v",
    "db:generate": "turbo run db:generate",
    "db:migrate": "turbo run db:migrate",
    "db:seed": "turbo run db:seed",
    "db:seed:examples": "turbo run db:seed:examples",
    "nuke": "bash ./scripts/nuke.sh",
    "dx": "pnpm i && pnpm run infra:dev:prune && pnpm run infra:dev:up --pull always && pnpm --filter=shared run db:reset:test && pnpm --filter=shared run db:reset && pnpm --filter=shared run ch:reset && pnpm --filter=shared run db:seed:examples && pnpm run dev",
    "dx-f": "pnpm i && pnpm run infra:dev:prune && pnpm run infra:dev:up --pull always && pnpm --filter=shared run db:reset:test && pnpm --filter=shared run db:reset -f && SKIP_CONFIRM=1 pnpm --filter=shared run ch:reset && pnpm --filter=shared run db:seed:examples && pnpm run dev",
    "dx:skip-infra": "pnpm i && pnpm --filter=shared run db:reset:test && pnpm --filter=shared run db:reset && pnpm --filter=shared run ch:reset && pnpm --filter=shared run db:seed:examples && pnpm run dev",
    "build": "turbo run build",
    "build:check": "turbo run build:check",
    "typecheck": "turbo run typecheck",
    "tc": "turbo run typecheck",
    "start": "turbo run start",
    "dev": "turbo run dev",
    "dev:worker": "turbo run dev --filter=worker",
    "dev:web": "turbo run dev --filter=web",
    "dev:web-webpack": "turbo run dev --filter=web -- --webpack",
    "lint": "turbo run lint",

This module is important because it defines how Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations implements the patterns covered in this chapter.

docker-compose.dev-azure.yml

The docker-compose.dev-azure module in docker-compose.dev-azure.yml handles a key part of this chapter's functionality:

services:
  clickhouse:
    image: docker.io/clickhouse/clickhouse-server:24.3
    user: "101:101"
    environment:
      CLICKHOUSE_DB: default
      CLICKHOUSE_USER: ${CLICKHOUSE_USER:-clickhouse}
      CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-clickhouse}
    volumes:
      - langfuse_clickhouse_data:/var/lib/clickhouse
      - langfuse_clickhouse_logs:/var/log/clickhouse-server
    ports:
      - "8123:8123"
      - "9000:9000"
    healthcheck:
      test: wget --no-verbose --tries=1 --spider http://localhost:8123/ping || exit 1
      interval: 5s
      timeout: 5s
      retries: 10
      start_period: 1s
    depends_on:
      - postgres

  azurite:
    image: mcr.microsoft.com/azure-storage/azurite
    command: azurite-blob --blobHost 0.0.0.0
    ports:
      - "10000:10000"
    volumes:
      - langfuse_azurite_data:/data

  minio:
    image: cgr.dev/chainguard/minio
    container_name: ${MINIO_CONTAINER_NAME:-langfuse-minio}
    entrypoint: sh

This module is important because it defines how Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations implements the patterns covered in this chapter.

docker-compose.yml

The docker-compose module in docker-compose.yml handles a key part of this chapter's functionality:

# Make sure to update the credential placeholders with your own secrets.
# We mark them with # CHANGEME in the file below.
# In addition, we recommend to restrict inbound traffic on the host to langfuse-web (port 3000) and minio (port 9090) only.
# All other components are bound to localhost (127.0.0.1) to only accept connections from the local machine.
# External connections from other machines will not be able to reach these services directly.
services:
  langfuse-worker:
    image: docker.io/langfuse/langfuse-worker:3
    restart: always
    depends_on: &langfuse-depends-on
      postgres:
        condition: service_healthy
      minio:
        condition: service_healthy
      redis:
        condition: service_healthy
      clickhouse:
        condition: service_healthy
    ports:
      - 127.0.0.1:3030:3030
    environment: &langfuse-worker-env
      NEXTAUTH_URL: ${NEXTAUTH_URL:-http://localhost:3000}
      DATABASE_URL: ${DATABASE_URL:-postgresql://postgres:postgres@postgres:5432/postgres} # CHANGEME
      SALT: ${SALT:-mysalt} # CHANGEME
      ENCRYPTION_KEY: ${ENCRYPTION_KEY:-0000000000000000000000000000000000000000000000000000000000000000} # CHANGEME: generate via `openssl rand -hex 32`
      TELEMETRY_ENABLED: ${TELEMETRY_ENABLED:-true}
      LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES: ${LANGFUSE_ENABLE_EXPERIMENTAL_FEATURES:-false}
      CLICKHOUSE_MIGRATION_URL: ${CLICKHOUSE_MIGRATION_URL:-clickhouse://clickhouse:9000}
      CLICKHOUSE_URL: ${CLICKHOUSE_URL:-http://clickhouse:8123}
      CLICKHOUSE_USER: ${CLICKHOUSE_USER:-clickhouse}
      CLICKHOUSE_PASSWORD: ${CLICKHOUSE_PASSWORD:-clickhouse} # CHANGEME
      CLICKHOUSE_CLUSTER_ENABLED: ${CLICKHOUSE_CLUSTER_ENABLED:-false}
      LANGFUSE_USE_AZURE_BLOB: ${LANGFUSE_USE_AZURE_BLOB:-false}
      LANGFUSE_S3_EVENT_UPLOAD_BUCKET: ${LANGFUSE_S3_EVENT_UPLOAD_BUCKET:-langfuse}
      LANGFUSE_S3_EVENT_UPLOAD_REGION: ${LANGFUSE_S3_EVENT_UPLOAD_REGION:-auto}

This module is important because it defines how Langfuse Tutorial: LLM Observability, Evaluation, and Prompt Operations implements the patterns covered in this chapter.

How These Components Connect

flowchart TD
    A[package]
    B[docker-compose.dev-azure]
    C[docker-compose]
    A --> B
    B --> C
Loading