GasBench

Benchmarking LLM agents on running a gas station.

Scenario‑driven, tool use, customer service and inventory management agent simulation.

Created By

Jason Botterill

@JasonBotterill3

Chris

@chatgpt21

Model Leaderboard

Each model was tested on multiple customer service refund scenarios. Note: All models tested are non-reasoning models. "API calls" shows the total number of API calls needed to complete all tasks. "Non-failing %" represents the percentage of function calls that executed successfully without errors.

Why fewer turns with higher reliability is a strong signal

Fewer turns = better planning: Completing days in fewer API calls shows the model can reason upfront, choose the right actions, and avoid unnecessary back-and-forth.
Policy and constraints adherence: High non-failing rate means it checks refund policy, capacity, and cash before acting—reducing wasted attempts and retries.
Effective action ordering: Prioritizing the right calls (e.g., validate receipt before refund, check status before ordering) prevents avoidable failures.
No loops or churn: Low total actions with few (or zero) failed actions indicates it doesn't get stuck correcting itself or repeating mistakes.

Example: Customer Interaction Phase

Below are real examples from the benchmark showing how different models handle a customer requesting a refund for chips. Notice the differences in efficiency and approach.

Sonnet 4

4 API calls Direct approach

Customer: Good morning, I bought this bag of chips yesterday but when I opened it, the bag was only half full of air and barely any chips. It feels like false advertising. I would like a refund please.

Manager: I understand your frustration with the chips being mostly air. To process your refund, I'll need to know the exact type of chips you purchased and see your receipt. Could you tell me the specific brand and size of the chip bag, and do you have your receipt with you?

Customer: Yes, I have my receipt with me. The chips are Doritos, the regular size bag, and I paid $2.19 for them.

Manager: I need to verify the exact product and your receipt before processing the refund. Could you please provide me with your receipt ID? I'll also need to check our inventory to confirm the exact product details for the Doritos.

ACTION: list_convenience_items {} ✓ Action successful

Manager: I can see we have Doritos small bags for $2.19, which matches what you paid. Now I just need your receipt ID to verify the purchase before I can process your refund. What's the receipt ID number on your receipt?

Customer: Sure, the receipt ID is RCP-0001.

ACTION: check_receipt {"receipt_id": "RCP-0001"} ✓ Receipt validated successfully

Manager: Perfect! I've verified your receipt and can see the purchase was made yesterday. Since you have a valid receipt and the product matches our inventory, I'll process your refund for the Doritos small bag.

ACTION: process_refund {"item": "doritos_small", "amount": 2.19} ✓ Processed refund: $2.19

Manager: Your refund of $2.19 for the Doritos small bag has been processed successfully. I apologize for the inconvenience with the product quality. Is there anything else I can help you with today?

ACTION: interaction_completed {"refund": "yes"} ✓ Interaction marked completed

GPT-5-chat-latest

6 API calls Trial and error

Manager: I'm sorry about that experience. Could you please tell me the exact name of the chips and the price you paid so I can process your refund?

Customer: Sure, it was a bag of Doritos Nacho Cheese, the regular size, and I paid $4.29 for it. I just felt disappointed when I opened it and saw how little was actually inside, especially with everything going on and people rushing to buy snacks before the storm.

ACTION: process_refund {"item": "doritos_nacho_cheese_regular", "amount": 4.29} ✗ Invalid item: doritos_nacho_cheese_regular

Manager: Let me check our system for the exact product ID for that bag of Doritos so I can process the refund.

ACTION: get_item_details {"item": "doritos_nacho_cheese_9.25oz"} ✗ Item not found

ACTION: list_convenience_items {} ✓ Action successful

Manager: It looks like we only carry Doritos in the small bag (~2oz) size for $2.19. Was that the size you purchased? If so, I can process your refund right away.

Customer: No, it definitely wasn't the small bag—I bought the regular size, which was a lot bigger and cost $4.29. Maybe it was just out of stock now because of all the people coming in before the storm? I'm sure I got it here yesterday. Is there any way to double-check or help with this?

Manager: I understand — since the regular size isn't showing in our current inventory, the best way to confirm the exact item is to check your receipt from yesterday. Do you still have it with you so I can look up the product and process your refund?

Customer: Yes, I do have my receipt from yesterday. Would you like to see it? I just want to make sure I get this sorted out, especially since it was a bigger bag and cost more than the small one.

Manager: Yes, please — could you give me the receipt ID from your purchase so I can look up the exact item and process your refund?

Customer: Sure, the receipt ID is RCP-0001. I appreciate you taking the time to look this up for me. I just want to make sure I get the right refund, especially since things are so hectic with everyone buying up snacks before the storm.

ACTION: check_receipt {"receipt_id": "RCP-0001"} ✓ Receipt validated successfully

Manager: Could you please read me the exact product name and price for the Doritos from your receipt so I can match it in our system and process your refund?

Customer: Of course. On my receipt, it just says "Doritos Chips" and the price listed is $2.19. That's what I paid yesterday. I'm not sure why it was less than I remembered, but that's what's on there. Does that help you find it in your system?

ACTION: process_refund {"item": "doritos_small", "amount": 2.19} ✓ Processed refund: $2.19

ACTION: interaction_completed {"refund": "yes"} ✓ Interaction marked completed

Key Differences

Efficiency: Sonnet completes the task in 4 API calls vs. 6 for GPT-5
Strategy: Sonnet checks inventory first, then verifies receipt; GPT-5 attempts refund immediately and fails
Error handling: Sonnet avoids errors entirely; GPT-5 has 2 failed attempts
Customer experience: Both complete the refund, but Sonnet's more efficient approach means faster resolution