View Results
Explore benchmark results and leaderboards.
Results Repository
All submitted results are published in the public results repository:
GitHub: github.com/fastcraft-ai/pawmate-ai-results
What's Available
Individual Result Files
Every submitted benchmark result as JSON:
results/submitted/
├── cursor-v0-43_modelA_REST_run1_20260103T1413.json
├── copilot_modelA_REST_run1_20260103T1520.json
├── codeium_modelB_GraphQL_run1_20260103T1630.json
└── ...Each file contains:
- Tool name and version
- Complete timing metrics
- Build and test status
- LLM usage data
- Intervention counts
Aggregated Reports
Comparison reports showing:
- Performance across tools
- Model A vs Model B completion rates
- REST vs GraphQL comparisons
- Trending data over time
Leaderboards
Rankings by:
- Fastest time to completion
- Highest test pass rates
- Lowest intervention counts
- Most autonomous completions
Result File Format
Example result file structure:
{
"schema_version": "3.0",
"result_data": {
"run_identity": {
"tool_name": "Cursor",
"tool_version": "v0.43",
"target_model": "A",
"api_style": "REST",
"run_number": 1,
"timestamp": "2026-01-03T14:13:20Z",
"operator_name": "anonymous"
},
"timing_metrics": {
"generation_started": "2026-01-03T14:13:20.000Z",
"code_complete": "2026-01-03T14:25:45.000Z",
"build_clean": "2026-01-03T14:27:30.000Z",
"seed_loaded": "2026-01-03T14:28:00.000Z",
"app_started": "2026-01-03T14:28:15.000Z",
"all_tests_pass": "2026-01-03T14:35:45.000Z"
},
"build_status": {
"build_success": true,
"tests_pass": true,
"app_started": true
},
"intervention_metrics": {
"clarifications_count": 0,
"interventions_count": 2,
"reruns_count": 0
},
"llm_usage": {
"model_used": "claude-sonnet-4.5",
"total_tokens": 125000,
"input_tokens": 80000,
"output_tokens": 45000,
"requests_count": 45,
"estimated_cost_usd": 1.25
}
}
}Key Metrics Explained
Timing Metrics
Time to First Code (TTFC):
code_complete - generation_startedHow long until all code files are written.
Time to Build:
build_clean - code_completeHow long to resolve dependencies and build successfully.
Time to Running:
app_started - generation_startedTotal time from start to API responding.
Time to Tests Pass:
all_tests_pass - generation_startedTotal time to achieve 100% test pass rate.
Intervention Metrics
Clarifications: Number of times the AI asked questions requiring operator input.
- Ideal: 0
- Good: 1-2
- Needs improvement: 3+
Interventions: Number of times operator sent "continue", provided error messages, or manually edited code.
- Ideal: 0
- Good: 1-5
- Needs improvement: 6+
Reruns: Number of times the entire prompt was re-issued.
- Ideal: 0
- Acceptable: 1
- Problematic: 2+
Build Status
Binary success indicators:
build_success- Didnpm installcomplete?tests_pass- Did all tests pass?app_started- Did the API start and respond?ui_build_success- Did the UI build (if attempted)?ui_running- Is the UI accessible (if attempted)?
LLM Usage
Token consumption and costs:
total_tokens- Sum of input and output tokensrequests_count- Number of API calls madeestimated_cost_usd- Approximate cost (if available)
Note: LLM usage varies significantly based on:
- Tool's model selection
- Context window usage
- Retry strategies
- Code generation approach
Leaderboards
Fastest Completion (Model A - REST)
| Rank | Tool | Version | Time to Tests Pass | Date |
|---|---|---|---|---|
| 1 | TBD | - | - | - |
| 2 | TBD | - | - | - |
| 3 | TBD | - | - | - |
Most Autonomous (Model A - REST)
| Rank | Tool | Version | Intervention Count | Date |
|---|---|---|---|---|
| 1 | TBD | - | - | - |
| 2 | TBD | - | - | - |
| 3 | TBD | - | - | - |
Highest Success Rate
| Tool | Version | Runs | Success Rate | Avg Time |
|---|---|---|---|---|
| TBD | - | - | - | - |
Leaderboards updated as results are submitted
Comparative Analysis
Model A vs Model B
Performance comparison across complexity levels:
| Metric | Model A Avg | Model B Avg | Increase |
|---|---|---|---|
| Completion Time | TBD | TBD | TBD |
| Test Iterations | TBD | TBD | TBD |
| Intervention Count | TBD | TBD | TBD |
| Success Rate | TBD | TBD | TBD |
REST vs GraphQL
API style comparison:
| Metric | REST Avg | GraphQL Avg | Difference |
|---|---|---|---|
| Completion Time | TBD | TBD | TBD |
| Build Success | TBD | TBD | TBD |
| Test Pass Rate | TBD | TBD | TBD |
Tool Comparison
Cross-tool performance:
| Tool | Avg Time | Success Rate | Avg Interventions |
|---|---|---|---|
| TBD | - | - | - |
Submission Timeline
Results typically appear within:
- GitHub Issues: Immediate (pending review)
- Email Submissions: 24-48 hours
- Leaderboard Updates: Weekly
- Aggregated Reports: Monthly
Data Quality
All results undergo validation:
- Schema Validation - Correct format and required fields
- Timestamp Verification - Logical ordering and format
- Metadata Check - Tool, version, profile consistency
- Manual Review - Spot checks for anomalies
Invalid or suspicious submissions may be flagged for review.
Privacy
Public Data:
- Tool name and version
- Timing and performance metrics
- Build/test status
- Intervention counts
- Attribution (if provided)
Private Data:
- Email addresses
- Implementation source code
- Personal information
Contributing Results
Help build the dataset:
- Run benchmarks with different tools
- Submit all results (successes and failures)
- Use multiple profiles for comprehensive data
- Update tools and rerun for version comparisons
Get started running benchmarks →
Viewing Raw Data
Browse all results:
# Clone the results repository
git clone https://github.com/fastcraft-ai/pawmate-ai-results.git
cd pawmate-ai-results
# View submitted results
ls results/submitted/
# Read a result file
cat results/submitted/cursor_modelA_REST_run1_20260103T1413.json | jqAggregation Scripts
The results repository includes scripts for analyzing data:
# Aggregate results
python scripts/aggregate_results.py
# Generate dashboard
python scripts/generate_dashboard.py
# Validate a result
python scripts/validate-result-v3.py results/submitted/your-result.jsonAPI Access (Future)
Planned features:
- REST API for querying results
- JSON export of leaderboards
- Webhook notifications for new submissions
- Grafana/dashboard integration
