Skip to content

debabratamishra/llm-evals

Folders and files

NameName
Last commit message
Last commit date

Latest commit

ย 

History

19 Commits
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 
ย 

Repository files navigation

LLM Evaluation Dashboard

A comprehensive dashboard for visualizing and analyzing evaluation results of Large Language Models, including performance metrics, cost analysis, and model comparisons.

Python Streamlit License

Home Page of LLM Evals

๐Ÿš€ Features

๐Ÿ“Š Dashboard Sections

  1. ๐ŸŽฏ Performance & TopN Analysis

    • Model accuracy comparisons across benchmarks
    • TopN performance metrics (Top@1, Top@2, Top@5)
    • Performance heatmaps for model-benchmark combinations
    • Detailed performance progression analysis
  2. ๐Ÿ’ฐ Cost Analysis

    • Cost per token analysis (1K and 1M tokens)
    • Total run cost comparisons
    • Cost efficiency metrics
    • Detailed cost breakdown by model
  3. โšก Throughput Analysis

    • Tokens per Second (TPS) measurements
    • Time to First Token (TTFT) analysis
    • Output token generation speed metrics
    • Throughput efficiency scatter plots (accuracy vs speed)
    • Detailed throughput data tables
  4. ๐Ÿ” Model Comparison

    • Efficiency scatter plots (accuracy vs cost)
    • Automated model rankings with efficiency scores
  5. ๐Ÿ“Š Advanced Analytics

    • Multi-dimensional efficiency scoring (accuracy + cost + throughput)
    • Data export functionality (CSV download)
    • Interactive visualizations with tooltips
    • Comprehensive metrics tables

๐Ÿ› ๏ธ Installation

Prerequisites

  • Python 3.10 or higher
  • Conda (recommended) or virtualenv

Quick Setup

  1. Clone the repository:

    git clone https://github.com/debabratamishra/llm-evals
    cd llm-evals
  2. Create and activate a conda environment:

    conda create -n llm_ui python=3.12
    conda activate llm_ui
  3. Install dependencies:

    pip install -r requirements.txt

๐Ÿš€ Usage

Running the Dashboard

Option 1: Using the startup script (Recommended)

chmod +x start_dashboard.sh
./start_dashboard.sh

Option 2: Manual startup

conda activate llm_ui
streamlit run app.py

The dashboard will be available at http://localhost:8501

๐Ÿ“ Data Structure

The dashboard automatically loads evaluation data from the data/ directory. The following file formats are supported:

  • advanced_eval_summary.json - Summary evaluation files
  • *__*_details.json - Detailed evaluation results
  • *__cost_throughput.json - Cost and throughput metrics

Expected Data Format

Performance Metrics:

{
  "model_name": {
    "benchmark_name": {
      "n": 100,
      "acc": 0.75,
      "top@1": 0.75,
      "top@2": 0.85,
      "top@5": 0.95
    }
  }
}

Cost & Throughput Metrics:

{
  "model_name": {
    "cost_throughput": {
      "cost": {
        "run_cost_usd": 0.01,
        "cost_per_1k_tokens_usd": 0.0001,
        "cost_per_1m_tokens_usd": 0.1
      },
      "mode": "api",
      "elapsed_seconds": 45.2,
      "total_tokens": 25000,
      "input_tokens": 15000,
      "output_tokens": 10000
    }
  }
}
    }
  }
}

๐ŸŽฎ Dashboard Navigation

Sidebar Controls

  • Data Directory: Configure the path to evaluation data (defaults to ./data)
  • Real-time data loading with progress indicators

Main Tabs

  1. ๐ŸŽฏ Performance & TopN Analysis

    • Select performance metrics to visualize
    • Compare TopN accuracy across models
    • View performance heatmaps
  2. ๐Ÿ’ฐ Cost Analysis

    • Analyze cost per token metrics
    • Compare total run costs
    • Identify cost-effective models
  3. โšก Throughput Analysis

    • Tokens per Second (TPS) performance metrics
    • Time to First Token (TTFT) measurements
    • Throughput efficiency analysis (accuracy vs speed)
    • Detailed throughput data tables
  4. ๐Ÿ” Model Comparison

    • Efficiency analysis (accuracy vs cost)
    • Customizable ranking systems
  5. ๐Ÿ“Š Advanced Analytics

    • Multi-dimensional efficiency calculations (accuracy + cost + throughput)
    • Data export functionality
    • Interactive metric visualization

๐Ÿ“Š Visualization Features

  • Interactive Charts: Built with Plotly for responsive visualization
  • Custom Tooltips: Detailed explanations for all metrics
  • Export Options: Download charts and data as CSV/PNG
  • Responsive Design: Optimized for different screen sizes
  • Real-time Updates: Data refreshes automatically when changed

โšก Throughput Metrics Explained

Key Metrics

  • Tokens per Second (TPS): Total throughput including both input and output processing
  • Output Tokens per Second: Generation speed for output tokens only
  • Time to First Token (TTFT): Latency measurement for initial response (estimated)
  • Throughput Efficiency: Composite metric combining accuracy and speed performance

Use Cases

  • Latency Optimization: Use TTFT metrics for real-time applications
  • Throughput Planning: Use TPS metrics for batch processing scenarios
  • Balanced Selection: Use efficiency metrics for optimal accuracy-speed trade-offs
  • Cost-Performance Analysis: Combined with cost metrics for comprehensive evaluation

๐Ÿ”ง Configuration

Custom Data Directory

You can specify a different data directory using the sidebar input or by modifying the default path in app.py:

default_data_dir = os.path.join(dashboard_dir, "your_data_directory")

Adding New Metrics

  1. Extend the data loader (data_loader.py):

    def _extract_new_metrics(self) -> pd.DataFrame:
        # Add your metric extraction logic here
        pass
  2. Create visualizations (visualizations.py):

    def create_new_chart(self, data: pd.DataFrame) -> go.Figure:
        # Add your visualization logic here
        pass
  3. Update the dashboard (app.py):

    # Add new tabs or sections
    with st.tab("New Analysis"):
        new_fig = visualizer.create_new_chart(data)
        st.plotly_chart(new_fig)

๐Ÿ—๏ธ Architecture

Component Overview

llm-evals/
โ”œโ”€โ”€ app.py                 # Main Streamlit application
โ”œโ”€โ”€ data_loader.py         # Data loading and processing
โ”œโ”€โ”€ visualizations.py      # Chart and visualization creation
โ”œโ”€โ”€ requirements.txt       # Python dependencies
โ”œโ”€โ”€ start_dashboard.sh     # Quick start script
โ””โ”€โ”€ data/                  # Evaluation data directory
    โ”œโ”€โ”€ advanced_eval_summary.json
    โ”œโ”€โ”€ *__cost_throughput.json
    โ””โ”€โ”€ *__*_details.json

Data Flow

  1. Load โ†’ JSON files are loaded from the data/ directory
  2. Process โ†’ Data is parsed and structured into pandas DataFrames
  3. Visualize โ†’ Charts are created using Plotly
  4. Interact โ†’ Users can filter, compare, and export data

๐Ÿงช Testing

To verify the installation and data loading:

conda activate llm_ui
python -c "
from data_loader import EvaluationDataLoader
loader = EvaluationDataLoader('./data')
data = loader.load_all_data()
print(f'Loaded {len(data)} data sections')
print('Available sections:', list(data.keys()))
"

๐Ÿ› Troubleshooting

Common Issues

Data Not Loading

  • Check file format: Ensure JSON files are properly formatted
  • Verify data directory: Confirm the data/ directory exists and contains files
  • File permissions: Ensure read permissions on data files

Import Errors

# Reinstall dependencies
conda activate llm_ui
pip install --upgrade -r requirements.txt

Performance Issues

  • Large datasets: Consider filtering data for better performance
  • Memory usage: Monitor system resources with large evaluation datasets
  • Browser cache: Clear browser cache if visualizations aren't updating

Visualization Problems

  • Missing data: Check console logs for data processing errors
  • Chart rendering: Ensure browser supports modern JavaScript features
  • Interactive features: Verify Plotly.js is loading correctly

Debug Mode

Run the dashboard in debug mode:

streamlit run app.py --logger.level=debug

๐Ÿ“ˆ Performance Optimization

  • Data Caching: Streamlit automatically caches loaded data
  • Efficient Processing: Use pandas vectorized operations
  • Memory Management: Process data in chunks for large datasets
  • Visualization: Limit data points for complex charts

๐Ÿค Contributing

  1. Fork the repository
  2. Create a feature branch: git checkout -b feature-name
  3. Make changes with proper documentation
  4. Test thoroughly with sample data
  5. Submit a pull request

Development Guidelines

  • Follow PEP 8 style guidelines
  • Add comprehensive docstrings
  • Include type hints where appropriate
  • Test with various data formats
  • Update documentation for new features

๐Ÿ“œ License

This project is licensed under the MIT License - see the LICENSE file for details.

๐Ÿ™ Acknowledgments

  • Streamlit for the web application framework
  • Plotly for interactive visualizations
  • Pandas for data processing capabilities

๐Ÿ“ง Support

For issues, questions, or contributions:

  • Create an issue in the repository
  • Check existing documentation
  • Review troubleshooting section

Built with โค๏ธ for the LLM evaluation community

About

A simple and efficient evaluation interface to compare performance between different large language models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors