A comprehensive dashboard for visualizing and analyzing evaluation results of Large Language Models, including performance metrics, cost analysis, and model comparisons.
-
๐ฏ Performance & TopN Analysis
- Model accuracy comparisons across benchmarks
- TopN performance metrics (Top@1, Top@2, Top@5)
- Performance heatmaps for model-benchmark combinations
- Detailed performance progression analysis
-
๐ฐ Cost Analysis
- Cost per token analysis (1K and 1M tokens)
- Total run cost comparisons
- Cost efficiency metrics
- Detailed cost breakdown by model
-
โก Throughput Analysis
- Tokens per Second (TPS) measurements
- Time to First Token (TTFT) analysis
- Output token generation speed metrics
- Throughput efficiency scatter plots (accuracy vs speed)
- Detailed throughput data tables
-
๐ Model Comparison
- Efficiency scatter plots (accuracy vs cost)
- Automated model rankings with efficiency scores
-
๐ Advanced Analytics
- Multi-dimensional efficiency scoring (accuracy + cost + throughput)
- Data export functionality (CSV download)
- Interactive visualizations with tooltips
- Comprehensive metrics tables
- Python 3.10 or higher
- Conda (recommended) or virtualenv
-
Clone the repository:
git clone https://github.com/debabratamishra/llm-evals cd llm-evals -
Create and activate a conda environment:
conda create -n llm_ui python=3.12 conda activate llm_ui
-
Install dependencies:
pip install -r requirements.txt
chmod +x start_dashboard.sh
./start_dashboard.shconda activate llm_ui
streamlit run app.pyThe dashboard will be available at http://localhost:8501
The dashboard automatically loads evaluation data from the data/ directory. The following file formats are supported:
advanced_eval_summary.json- Summary evaluation files*__*_details.json- Detailed evaluation results*__cost_throughput.json- Cost and throughput metrics
Performance Metrics:
{
"model_name": {
"benchmark_name": {
"n": 100,
"acc": 0.75,
"top@1": 0.75,
"top@2": 0.85,
"top@5": 0.95
}
}
}Cost & Throughput Metrics:
{
"model_name": {
"cost_throughput": {
"cost": {
"run_cost_usd": 0.01,
"cost_per_1k_tokens_usd": 0.0001,
"cost_per_1m_tokens_usd": 0.1
},
"mode": "api",
"elapsed_seconds": 45.2,
"total_tokens": 25000,
"input_tokens": 15000,
"output_tokens": 10000
}
}
}
}
}
}- Data Directory: Configure the path to evaluation data (defaults to
./data) - Real-time data loading with progress indicators
-
๐ฏ Performance & TopN Analysis
- Select performance metrics to visualize
- Compare TopN accuracy across models
- View performance heatmaps
-
๐ฐ Cost Analysis
- Analyze cost per token metrics
- Compare total run costs
- Identify cost-effective models
-
โก Throughput Analysis
- Tokens per Second (TPS) performance metrics
- Time to First Token (TTFT) measurements
- Throughput efficiency analysis (accuracy vs speed)
- Detailed throughput data tables
-
๐ Model Comparison
- Efficiency analysis (accuracy vs cost)
- Customizable ranking systems
-
๐ Advanced Analytics
- Multi-dimensional efficiency calculations (accuracy + cost + throughput)
- Data export functionality
- Interactive metric visualization
- Interactive Charts: Built with Plotly for responsive visualization
- Custom Tooltips: Detailed explanations for all metrics
- Export Options: Download charts and data as CSV/PNG
- Responsive Design: Optimized for different screen sizes
- Real-time Updates: Data refreshes automatically when changed
- Tokens per Second (TPS): Total throughput including both input and output processing
- Output Tokens per Second: Generation speed for output tokens only
- Time to First Token (TTFT): Latency measurement for initial response (estimated)
- Throughput Efficiency: Composite metric combining accuracy and speed performance
- Latency Optimization: Use TTFT metrics for real-time applications
- Throughput Planning: Use TPS metrics for batch processing scenarios
- Balanced Selection: Use efficiency metrics for optimal accuracy-speed trade-offs
- Cost-Performance Analysis: Combined with cost metrics for comprehensive evaluation
You can specify a different data directory using the sidebar input or by modifying the default path in app.py:
default_data_dir = os.path.join(dashboard_dir, "your_data_directory")-
Extend the data loader (
data_loader.py):def _extract_new_metrics(self) -> pd.DataFrame: # Add your metric extraction logic here pass
-
Create visualizations (
visualizations.py):def create_new_chart(self, data: pd.DataFrame) -> go.Figure: # Add your visualization logic here pass
-
Update the dashboard (
app.py):# Add new tabs or sections with st.tab("New Analysis"): new_fig = visualizer.create_new_chart(data) st.plotly_chart(new_fig)
llm-evals/
โโโ app.py # Main Streamlit application
โโโ data_loader.py # Data loading and processing
โโโ visualizations.py # Chart and visualization creation
โโโ requirements.txt # Python dependencies
โโโ start_dashboard.sh # Quick start script
โโโ data/ # Evaluation data directory
โโโ advanced_eval_summary.json
โโโ *__cost_throughput.json
โโโ *__*_details.json
- Load โ JSON files are loaded from the
data/directory - Process โ Data is parsed and structured into pandas DataFrames
- Visualize โ Charts are created using Plotly
- Interact โ Users can filter, compare, and export data
To verify the installation and data loading:
conda activate llm_ui
python -c "
from data_loader import EvaluationDataLoader
loader = EvaluationDataLoader('./data')
data = loader.load_all_data()
print(f'Loaded {len(data)} data sections')
print('Available sections:', list(data.keys()))
"- Check file format: Ensure JSON files are properly formatted
- Verify data directory: Confirm the
data/directory exists and contains files - File permissions: Ensure read permissions on data files
# Reinstall dependencies
conda activate llm_ui
pip install --upgrade -r requirements.txt- Large datasets: Consider filtering data for better performance
- Memory usage: Monitor system resources with large evaluation datasets
- Browser cache: Clear browser cache if visualizations aren't updating
- Missing data: Check console logs for data processing errors
- Chart rendering: Ensure browser supports modern JavaScript features
- Interactive features: Verify Plotly.js is loading correctly
Run the dashboard in debug mode:
streamlit run app.py --logger.level=debug- Data Caching: Streamlit automatically caches loaded data
- Efficient Processing: Use pandas vectorized operations
- Memory Management: Process data in chunks for large datasets
- Visualization: Limit data points for complex charts
- Fork the repository
- Create a feature branch:
git checkout -b feature-name - Make changes with proper documentation
- Test thoroughly with sample data
- Submit a pull request
- Follow PEP 8 style guidelines
- Add comprehensive docstrings
- Include type hints where appropriate
- Test with various data formats
- Update documentation for new features
This project is licensed under the MIT License - see the LICENSE file for details.
- Streamlit for the web application framework
- Plotly for interactive visualizations
- Pandas for data processing capabilities
For issues, questions, or contributions:
- Create an issue in the repository
- Check existing documentation
- Review troubleshooting section
Built with โค๏ธ for the LLM evaluation community
