ISO-Bench

Can Coding Agents Optimize Real-World Inference Workloads?

Pre-print 2026
ISO-Bench evaluation pipeline

Figure 1: ISO-Bench evaluation pipeline. Given a codebase and task description, a coding agent produces an optimization patch. We compare this patch against the human commit using hard metrics (TTFT, throughput) and soft metrics (bottleneck targeting, implementation approach).

Key Contributions

  • We introduce ISO-Bench, a benchmark of 54 tasks curated from merged pull requests in vLLM (39 tasks) and SGLang (15 tasks), two of the most popular LLM serving frameworks.
  • We propose a dual-metric evaluation framework combining hard (execution-based) and soft (LLM-based) metrics, revealing that hard metrics alone overestimate agent capabilities by up to 20%.
  • We introduce the quadrant framework that classifies agent outcomes into True Success, Good Intent, Lucky Win, and Failure, distinguishing genuine optimization from accidental improvements.
  • We show that understanding ≠ execution: agents demonstrate up to 87.2% correct bottleneck identification but achieve only 17.9% true success, exposing a fundamental capability gap.
  • We evaluate three open-source models and find a 0% success rate, highlighting the difficulty of real-world inference optimization.

Benchmark Design

ISO-Bench tasks are curated from real merged pull requests that target performance bottlenecks in production LLM inference engines. Each task gives an agent a codebase snapshot and a bottleneck description from a GitHub issue. The agent has 120 minutes to produce an optimization patch, which is evaluated against the expert human solution from the corresponding PR.

CodebaseTasksSourceOptimization Targets
vLLM 39 Merged PRs Attention kernels, scheduling, memory management, batching
SGLang 15 Merged PRs Router logic, prefix caching, request handling, tokenization
Total54Each task includes: codebase snapshot + issue description + 120 min time limit

We evaluate four agent configurations: Claude Code (Claude Sonnet 4.5), Codex CLI (GPT-5), TRAE (Sonnet) (Claude Sonnet 4.5), and TRAE (GPT-5) (GPT-5). Each agent is given the full codebase, the bottleneck description, and 120 minutes to produce an optimization patch.

The Dual-Metric Framework

Metric TypeWhat It MeasuresHow
Hard Metrics Execution outcomes - did the patch actually improve performance? TTFT, throughput, latency benchmarks with a ≥5% improvement threshold to filter GPU noise
Soft Metrics Semantic understanding - did the agent target the right bottleneck? LLM judge (Gemini-3-Flash-Preview) evaluates bottleneck targeting and implementation approach

Combining both metrics yields four distinct outcomes. Q3 (Lucky Win) is the critical category that hard-only evaluation misses entirely - these agents improved metrics but targeted the wrong bottleneck. Their gains are coincidental and unlikely to generalize.

Quadrant framework diagram

Figure 2: The quadrant framework. Hard metrics on one axis, soft metrics on the other, yielding four distinct outcome categories.

Best Agent Achieves 46.2% True Success

True Success requires both passing hard metrics and demonstrating correct bottleneck targeting via soft metrics. No single agent dominates: Claude Code leads on vLLM (46.2%) but drops to 26.7% on SGLang. Rankings flip completely between codebases.

Hard Metrics Overestimate Agent Capabilities

The gap between Hard Success and True Success reveals how often agents succeed accidentally (Lucky Wins). Hard metrics alone can overestimate agent capabilities by up to 20%.

vLLM (39 tasks)

SGLang (15 tasks)

On vLLM, gaps range from 2.6% to 12.8%. On SGLang, Claude Code shows the largest gap (20.0%), while all other agents have zero gap. Notably, TRAE (Sonnet) and Claude Code use the same underlying model (Claude Sonnet 4.5), yet their scaffolding produces very different outcomes - scaffolding matters as much as the model.

Understanding ≠ Execution

Agents frequently identify the correct bottleneck but fail to implement a working solution. On vLLM, three of four agents have their highest quadrant count in Q2 (Good Intent, Bad Execution): knowing what to do is not the same as doing it correctly.

69.3%

TRAE (GPT-5) correctly identifies 87.2% of vLLM bottlenecks but achieves only 17.9% true success - the largest gap of any agent.

Good Intent vs Bad Execution on vLLM

Figure 3: Good Intent vs Bad Execution on vLLM (39 tasks). Light bars show correct target identification (Q1+Q2). Dark bars show True Success (Q1).

Good Intent vs Bad Execution on SGLang

Figure 4: Good Intent vs Bad Execution on SGLang (15 tasks). Light bars show correct target identification (Q1+Q2). Dark bars show True Success (Q1).

Even the best agent (Claude Code) has a 38.4% gap on vLLM. On SGLang, the gap narrows: all agents except Claude Code identify the correct target in all 15 tasks and show much stronger execution. The bottleneck is not comprehension but implementation.

Most Failures Are Good Intent, Bad Execution

The full quadrant distribution reveals where each agent spends most of its attempts.

vLLM (39 tasks)

SGLang (15 tasks)

On vLLM, Q2 (Good Intent) dominates for most agents - the primary failure mode is execution, not understanding. On SGLang, most outcomes are Q1 (True Success), except for Claude Code which still struggles with execution despite using the same model as TRAE (Sonnet).

Performance Does Not Generalize Across Codebases

Agent rankings flip between vLLM and SGLang. Each agent maintains its preferred strategy regardless of the project - Claude Code favors alternative approaches (succeeds on vLLM, fails on SGLang), while TRAE agents match the reference approach (succeeds on SGLang, fails on vLLM).

Approach distribution on vLLM

Figure 5: Approach distribution on vLLM (39 tasks).

Approach distribution on SGLang

Figure 6: Approach distribution on SGLang (15 tasks).

TRAE (Sonnet) and Claude Code both use Claude Sonnet 4.5, yet differ substantially: Claude Code achieves 46.2% on vLLM vs. 26.7% on SGLang, while TRAE (Sonnet) reaches 28.2% on vLLM vs. 80.0% on SGLang. The model is identical - scaffolding determines outcomes. Single-codebase evaluations overstate generalization.

Open-Source Models Achieve 0% Success

We evaluated three open-source models using the same TRAE-Agent scaffolding. None produced a single working optimization. Their failures cluster into distinct patterns:

MiniMax-M2.1 - Planning Without Execution

Verbalized detailed optimization plans for 75 steps but never executed a single tool call. Generated 81,782 output tokens over 477 seconds with zero tool invocations.

Tool calls0
Steps75
Output tokens81,782
Input tokens1,599,945
Step 73: Completed
LLM Response: The user keeps saying I haven't completed
    the task. I need to explore the repository and
    make changes. Let me start by using the
    str_replace_based_edit_tool tool.

    I need to actually use the tools now.
    I need to actually use the tools now.
    I need to actually use the tools now.
    I need to actually use the tools now.
    I need to actually use the tools now.
    [...repeated 2,412 times total...]

Step 74: Completed
LLM Response: I need to actually use the tools now...

The phrase "I need to actually use the tools now" appears 2,412 times without a single invocation.

GPT-OSS-120B - Environment Confusion

Instead of optimizing vLLM code, attempted to create mock implementations of PyTorch, Triton, and Transformers libraries inside the project directory (~84 file creation attempts).

Files the model attempted to create:
+ vllm_core-0006/torch/__init__.py
+ vllm_core-0006/torch/nn/__init__.py
+ vllm_core-0006/torch/cuda/__init__.py
+ vllm_core-0006/triton/__init__.py
+ vllm_core-0006/transformers/__init__.py

Contents of attempted torch/__init__.py:
class dtype:
    pass

float16 = 'float16'
float32 = 'float32'
__version__ = '0.0.0'

The model cannot distinguish between "code I should optimize" and "libraries I should use."

GLM-4.7 - Task Completion Failure

Made valid code edits (59 successful str_replace operations across 386 tool calls) but failed to complete the task workflow.

Tool calls386 (327 bash, 59 str_replace)
Final statusmax_steps_exceeded (400 steps)
Successful edits committed (Step 196):
git commit output: 2 files changed, 11 insertions(+), 9 deletions(-)

Sample optimization in scheduler.py:
- self._num_batched_tokens += num_batched_tokens
+ self._num_batched_tokens = self._num_batched_tokens + num_batched_tokens

Then at Step 198, attempted to verify:
error: patch failed: tests/core/test_scheduler.py:214
error: tests/core/test_scheduler.py: patch does not apply

Model response: "Let me try a different approach..."
[...cycled through git operations for 200+ more steps...]

Final status: max_steps_exceeded (400 steps)

The model committed valid edits but couldn't interpret "patch does not apply" (already applied) as success, cycling for 200+ steps without calling finish.

These failures illustrate three qualitatively different challenges: tool use capability (MiniMax), environment grounding (GPT-OSS), and workflow management (GLM). Real-world inference optimization requires all three.

BibTeX

@misc{nangia2026isobenchcodingagentsoptimize,
  title         = {ISO-Bench: Can Coding Agents Optimize Real-World Inference Workloads?},
  author        = {Ayush Nangia and Shikhar Mishra and Aman Gokrani and Paras Chopra},
  year          = {2026},
  eprint        = {2602.19594},
  archivePrefix = {arXiv},
  primaryClass  = {cs.LG},
  url           = {https://arxiv.org/abs/2602.19594}
}

Contact

ayush.nangia@lossfunk.com  ·  shikhar.mishra@lossfunk.com  ·  paras@lossfunk.com