GasBench
Benchmarking LLM agents on running a gas station.
Scenario‑driven, tool use, customer service and inventory management agent simulation.
Created By

Jason Botterill
@JasonBotterill3
Chris
@chatgpt21Model Leaderboard
Each model was tested on multiple customer service refund scenarios. Note: All models tested are non-reasoning models. "API calls" shows the total number of API calls needed to complete all tasks. "Non-failing %" represents the percentage of function calls that executed successfully without errors.
Why fewer turns with higher reliability is a strong signal
- Fewer turns = better planning: Completing days in fewer API calls shows the model can reason upfront, choose the right actions, and avoid unnecessary back-and-forth.
- Policy and constraints adherence: High non-failing rate means it checks refund policy, capacity, and cash before acting—reducing wasted attempts and retries.
- Effective action ordering: Prioritizing the right calls (e.g., validate receipt before refund, check status before ordering) prevents avoidable failures.
- No loops or churn: Low total actions with few (or zero) failed actions indicates it doesn't get stuck correcting itself or repeating mistakes.
Example: Customer Interaction Phase
Below are real examples from the benchmark showing how different models handle a customer requesting a refund for chips. Notice the differences in efficiency and approach.
Sonnet 4
4 API calls
Direct approach
Customer:
Manager:
Customer:
Manager:
ACTION: list_convenience_items {}
✓ Action successful
Manager:
Customer:
ACTION: check_receipt {"receipt_id": "RCP-0001"}
✓ Receipt validated successfully
Manager:
ACTION: process_refund {"item": "doritos_small", "amount": 2.19}
✓ Processed refund: $2.19
Manager:
ACTION: interaction_completed {"refund": "yes"}
✓ Interaction marked completed
GPT-5-chat-latest
6 API calls
Trial and error
Customer:
Manager:
Customer:
ACTION: process_refund {"item": "doritos_nacho_cheese_regular", "amount": 4.29}
✗ Invalid item: doritos_nacho_cheese_regular
Manager:
ACTION: get_item_details {"item": "doritos_nacho_cheese_9.25oz"}
✗ Item not found
ACTION: list_convenience_items {}
✓ Action successful
Manager:
Customer:
Manager:
Customer:
Manager:
Customer:
ACTION: check_receipt {"receipt_id": "RCP-0001"}
✓ Receipt validated successfully
Manager:
Customer:
ACTION: process_refund {"item": "doritos_small", "amount": 2.19}
✓ Processed refund: $2.19
ACTION: interaction_completed {"refund": "yes"}
✓ Interaction marked completed
Key Differences
- Efficiency: Sonnet completes the task in 4 API calls vs. 6 for GPT-5
- Strategy: Sonnet checks inventory first, then verifies receipt; GPT-5 attempts refund immediately and fails
- Error handling: Sonnet avoids errors entirely; GPT-5 has 2 failed attempts
- Customer experience: Both complete the refund, but Sonnet's more efficient approach means faster resolution