A cross-platform SwiftUI application for benchmarking LLM code generation capabilities using the OpenRouter API. SwiftBench evaluates how well language models can generate modern Swift and SwiftUI code that follows Apple's latest APIs and Human Interface Guidelines.
- LLM Benchmarking: Test any model available through OpenRouter against a comprehensive Swift/SwiftUI benchmark
- Real-time Streaming: Watch code generation in real-time with streaming API responses
- Multi-dimensional Scoring: Evaluates generated code across 5 dimensions:
- Token overlap (35%) - Jaccard similarity of code tokens
- Line overlap (20%) - Structural similarity of code lines
- Modern APIs (20%) - Usage of latest iOS 26 APIs and patterns
- Code quality (15%) - Documentation, comments, and best practices
- Length match (10%) - Similarity in code length
- Anti-pattern Detection: Penalizes deprecated APIs and bad practices
- 24 Benchmark Tasks across 6 categories:
- Algorithms - Classic implementations (Fibonacci, Palindrome, Binary Search, Two Sum)
- Data Modeling - SwiftData models and relationships
- Concurrency - async/await, Actors, Sendable protocols
- SwiftUI Composition - View composition, state management, navigation
- SwiftData Queries - Predicates, sorting, filtering, and query optimization
- Refactors & Bug Fixes - Modernizing deprecated patterns and fixing bugs
- Pass@k Evaluation: Run each task multiple times (configurable) to measure reliability
- Execution-Based Scoring (macOS only):
- Automatic compilation of generated Swift code
- XCTest execution against test suites
- AST-based style analysis for modern Swift patterns
- Score breakdown: Compilation (30%) + Test Pass Rate (70%)
- Style Analysis: Detects modern Swift patterns using SwiftSyntax AST parsing
- Aggregate Metrics:
- Pass@1 - Tasks where at least one run passed all tests
- Pass@k - Tasks where all k runs passed all tests
- Mean Score - Average score across all runs
- Category Breakdown - Performance by task category
- Token Usage - Total tokens consumed
- Execution Time - Total time for compilation and testing
- Leaderboard: Track and compare model performance over time with SwiftData persistence
- Suite Results: View aggregate metrics and category breakdowns for suite runs
- Export to CSV: Export leaderboard data for external analysis
- Delete & Reset: Remove individual runs or reset entire leaderboard
- BYOK: Bring your own OpenRouter API key (securely stored in Keychain)
- Xcode 26+ with Swift 6.2
- iOS 26.0+ / iPadOS 26.0+ / macOS 26.0+ / visionOS 26.0+
- OpenRouter API Key: Get one at openrouter.ai
- Clone the repository
- Open
SwiftBench.xcodeprojin Xcode - Build and run on your target platform
# Build for macOS
xcodebuild -project SwiftBench.xcodeproj -scheme SwiftBench -destination 'platform=macOS' build
# Build for iOS Simulator
xcodebuild -project SwiftBench.xcodeproj -scheme SwiftBench -destination 'platform=iOS Simulator,name=iPhone 17' build- Enter API Key: Add your OpenRouter API key in the "Connection" section and save it
- Select Model: Choose from preset models or enter a custom model identifier
- Enter Prompt: Type your prompt or use the benchmark template
- Run Benchmark: Click "Run Benchmark" to start the evaluation
- View Results: See real-time streaming output and final score breakdown
- Select Suite: Choose SwiftBench Suite v2 with 24 tasks
- Configure Runs: Set runs per task (for Pass@k evaluation) and temperature
- Run Suite: Execute all tasks in the suite
- Track Progress: Monitor real-time progress with task-by-task feedback
- Review Results: Check Suite Results tab for aggregate metrics and category breakdowns
- Filter & Sort: Use the filter and sort controls to explore results
- Delete Runs: Swipe left on iOS or use context menu to delete individual runs
- Reset Leaderboard: Use the Actions menu to clear all results
- Export CSV: Export filtered results for external analysis
- View Details: Tap any run to see detailed information
The benchmark evaluates LLM ability to generate a complete SwiftUI application with:
@MainActor @Observablestate management- SwiftData
@Modelfor persistence NavigationStackwith modernTabAPINSViewRepresentable/UIViewRepresentablefor platform integrationasync/awaitconcurrencyforegroundStyle()andclipShape(.rect(cornerRadius:))ContentUnavailableViewfor empty states@ScaledMetricfor Dynamic Type supportscrollIndicators(.hidden)modifier
- Documentation comments (
///and// MARK:) - Proper access control (
private var/func/let) - Accessibility labels and values
- Following Apple Human Interface Guidelines
DispatchQueue(use async/await instead)ObservableObject/@Published(use @Observable)foregroundColor()(use foregroundStyle)cornerRadius()(use clipShape).tabItem()(use Tab API)NavigationView(use NavigationStack)- Force unwraps (
try!,as!)
The test suite evaluates LLM ability to generate correct, modern Swift code across multiple categories:
-
Algorithms (4 tasks)
- Classic algorithm implementations
- Efficient iterative solutions
- Input/output validation
-
Data Modeling (4 tasks)
- SwiftData
@Modeldefinitions - Relationships (one-to-one, one-to-many)
- Computed properties and validation
- SwiftData
-
Concurrency (4 tasks)
async/awaitpatterns- Actor isolation
- Sendable conformance
- Structured concurrency
-
SwiftUI Composition (4 tasks)
- View composition and modifiers
- State management (
@Observable,@State,@Binding) - Navigation with
NavigationStackandTabAPI - Lists and ForEach
-
SwiftData Queries (4 tasks)
@Querypredicates- Sorting and filtering
- Aggregate calculations
- Relationship queries
-
Refactors & Bug Fixes (4 tasks)
- Modernizing deprecated APIs
- Fixing concurrency issues
- Correcting SwiftUI patterns
- Memory leak fixes
- Easy - Simple implementations, straightforward APIs
- Medium - Multiple concepts involved, requires careful implementation
- Hard - Complex scenarios, edge cases, advanced patterns
Generated code is compiled and tested automatically:
- Compilation: Code must compile without errors
- Testing: Code passes all XCTest cases
- Style Analysis: AST parsing detects modern Swift patterns
- Score Formula: 30% (compilation) + 70% (test pass rate)
Run each task k times to measure consistency:
- Pass@1: Task passed if any one run succeeded
- Pass@k: Task passed if all
kruns succeeded - Mean Score: Average score across all runs
SwiftBench/
├── App/ - App entry point, global state
├── Benchmarks/ - Benchmark specs and scoring algorithms
│ └── Suites/ - Comprehensive test suite definitions
├── Features/ - Feature-specific views
│ ├── Run/ - Benchmark execution interface
│ ├── Leaderboard/ - Historical results and rankings
│ ├── Suite/ - Suite run configuration and progress
│ ├── Results/ - Suite result aggregation and display
│ └── Tasks/ - Individual task browsing and details
├── Models/ - SwiftData models and domain types
├── Services/ - Keychain, test execution, CSV export
└── Views/ - Shared UI components
- AppState:
@MainActor @Observableclass managing global application state and suite runs - BenchmarkSuite: Versioned collections of benchmark tasks with metadata
- BenchmarkTask: Individual test tasks with prompts, tests, and style rules
- BenchmarkScorer: Multi-dimensional scoring algorithm with anti-pattern detection
- TestExecutionService: Compiles and executes generated Swift code with XCTest (macOS)
- ExecutionScorer: Scores execution results based on compilation and test pass rate
- CodeExtractor: Extracts code blocks from LLM responses (handles markdown)
- LeaderboardExporter: Generates CSV exports of leaderboard data
- CodeTextView: Platform-specific scrollable code display (AppKit/UIKit)
| Model | Identifier |
|---|---|
| Devstral 2512 | mistralai/devstral-2512 |
| Qwen 3 Coder | qwen/qwen3-coder |
| Claude 4.5 Sonnet | anthropic/claude-sonnet-4.5 |
| Gemini 3.0 Flash | google/gemini-3-flash-preview |
You can also enter any custom model identifier supported by OpenRouter.
- Swift 6.2 with strict concurrency enabled
- SwiftData for persistent storage
- Keychain for secure API key storage
- AIProxy library for OpenRouter streaming API
- Sandboxed with network access enabled
See LICENSE file for details.
Contributions are welcome! Please ensure your code follows the project's coding standards:
- Use modern Swift APIs (iOS 26+)
- Follow Apple Human Interface Guidelines
- Add documentation comments for public APIs
- Include accessibility labels for UI elements