We evaluated three open-source models using the same TRAE-Agent scaffolding. None produced a single working optimization. Their failures cluster into distinct patterns:
MiniMax-M2.1 - Planning Without Execution
Verbalized detailed optimization plans for 75 steps but never executed a single tool call. Generated 81,782 output tokens over 477 seconds with zero tool invocations.
| Tool calls | 0 |
| Steps | 75 |
| Output tokens | 81,782 |
| Input tokens | 1,599,945 |
Step 73: Completed
LLM Response: The user keeps saying I haven't completed
the task. I need to explore the repository and
make changes. Let me start by using the
str_replace_based_edit_tool tool.
I need to actually use the tools now.
I need to actually use the tools now.
I need to actually use the tools now.
I need to actually use the tools now.
I need to actually use the tools now.
[...repeated 2,412 times total...]
Step 74: Completed
LLM Response: I need to actually use the tools now...
The phrase "I need to actually use the tools now" appears 2,412 times without a single invocation.
GPT-OSS-120B - Environment Confusion
Instead of optimizing vLLM code, attempted to create mock implementations of PyTorch, Triton, and Transformers libraries inside the project directory (~84 file creation attempts).
Files the model attempted to create:
+ vllm_core-0006/torch/__init__.py
+ vllm_core-0006/torch/nn/__init__.py
+ vllm_core-0006/torch/cuda/__init__.py
+ vllm_core-0006/triton/__init__.py
+ vllm_core-0006/transformers/__init__.py
Contents of attempted torch/__init__.py:
class dtype:
pass
float16 = 'float16'
float32 = 'float32'
__version__ = '0.0.0'
The model cannot distinguish between "code I should optimize" and "libraries I should use."
GLM-4.7 - Task Completion Failure
Made valid code edits (59 successful str_replace operations across 386 tool calls) but failed to complete the task workflow.
| Tool calls | 386 (327 bash, 59 str_replace) |
| Final status | max_steps_exceeded (400 steps) |
Successful edits committed (Step 196):
git commit output: 2 files changed, 11 insertions(+), 9 deletions(-)
Sample optimization in scheduler.py:
- self._num_batched_tokens += num_batched_tokens
+ self._num_batched_tokens = self._num_batched_tokens + num_batched_tokens
Then at Step 198, attempted to verify:
error: patch failed: tests/core/test_scheduler.py:214
error: tests/core/test_scheduler.py: patch does not apply
Model response: "Let me try a different approach..."
[...cycled through git operations for 200+ more steps...]
Final status: max_steps_exceeded (400 steps)
The model committed valid edits but couldn't interpret "patch does not apply" (already applied) as success, cycling for 200+ steps without calling finish.
These failures illustrate three qualitatively different challenges: tool use capability (MiniMax), environment grounding (GPT-OSS), and workflow management (GLM). Real-world inference optimization requires all three.