Skip to content

UITARS actions left_double/right_single not mapped → invalid output + message conversion crash #639

@zju-lx

Description

@zju-lx

Summary

The UITARS action space defines left_double and right_single, but convert_to_computer_actions only handles double_click and right_click.
Since the model emits the former names, these actions are silently dropped.
When this happens, the fallback logic in convert_uitars_messages_to_litellm appends a list of raw strings to content, which later causes huggingfacelocal_adapter._convert_messages to crash with 'str' object has no attribute 'get'.

Steps to Reproduce

  1. Update the test prompt in tests/agent_loop_testing/agent_test.py:

    message = "Open Safari browser by double click"
  2. Run:

    uv run tests/agent_loop_testing/agent_test.py --model "huggingface-local/ByteDance-Seed/UI-TARS-1.5-7B"

Expected Behavior

  • left_double should correctly translate into a double-click action.
  • right_single should translate into a right-click action.
  • Message conversion should never introduce raw strings into content, ensuring downstream adapters can parse messages correctly.

Actual Behavior

Actions are ignored, resulting in no computer events being executed.
During the next conversion pass, huggingfacelocal_adapter._convert_messages crashes because content contains raw strings instead of structured message objects.

🔥Error Logs

🤖 Testing CUA Agent: huggingface-local//home/liangxiao/NAS/ByteDance-Seed/UI-TARS-1.5-7B
==================================================
✅ CUA Agent created
✅ Mock computer ready
🚀 Running agent...

Iteration 1:
  Agent: 
  Unknown output type

Iteration 2:
  (No output from agent)
  Debug - Full result: {'output': [], 'usage': Usage(completion_tokens=0, prompt_tokens=0, total_tokens=0)}

Iteration 3:
  (No output from agent)
  Debug - Full result: {'output': [], 'usage': Usage(completion_tokens=0, prompt_tokens=0, total_tokens=0)}

❌ Test failed: litellm.APIConnectionError: 'str' object has no attribute 'get'
Traceback (most recent call last):
  File "huggingfacelocal_adapter.py", line 70, in _convert_messages
    if item.get("type") == "text":
       ^^^^^^^^
AttributeError: 'str' object has no attribute 'get'
LiteLLM Retried: 3 times

Root Cause

  • convert_to_computer_actions does not map UITARS-native actions (left_double, right_single) to the expected internal action names.
  • When the model emits these actions, they are skipped, resulting in no structured action blocks.
  • The fallback in convert_uitars_messages_to_litellm appends current_assistant_content as a list of raw strings, breaking message schema and causing _convert_messages to fail.

Fix

  • In convert_to_computer_actions: map left_double and right_single to double-click/right-click.
  • In convert_uitars_messages_to_litellm: wrap trailing current_assistant_content into a text block: {"role": "assistant", "content": [{"type": "text", "text": "\n".join(...)}]} to keep formats consistent.

References

  • libs/python/agent/agent/loops/uitars.py (action conversion + message conversion)

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions