Autonomous Agents

Autonomous Agents-research papers. Updated daily. Resources-section-section.

Research papers: 2025 (1/3)

2025 (1/3), 2025 (2/3), 2025 (3/3), 2024, 2023, Earlier

Chronological order.

2nd December 2025

From Moderation to Mediation: Can LLMs Serve as Mediators in Online Flame Wars?

LLM Mediation Framework: introduces a system where LLMs act as mediators in online flame wars by decomposing the task into Judgment (evaluating dynamics) and Steering (generating de-escalatory messages).
To assess mediation quality, the approach utilizes a multi-stage evaluation pipeline combining principle-based scoring, user simulation, and human comparative assessment on a large Reddit-based dataset.
Experiments demonstrate that API-based LLMs outperform open-source counterparts in both reasoning and intervention alignment, effectively reducing toxicity in simulated interactions.

InEx: Hallucination Mitigation via Introspection and Cross-Modal Multi-Agent Collaboration

InEx: introduces a training-free multi-agent framework that mitigates hallucination by unifying internal introspective reasoning (In) and external cross-modal multi-agent collaboration (Ex).
The Decision Agent generates an initial response guided by TVER-based uncertainty estimation (In), which is then iteratively verified and refined via collaboration (Ex) with textual, visual, and image editing agents.
The framework employs self-introspective components like VE-MHA and Self-Introspective Decoding to reinforce visual grounding and recalibrate confidence levels before achieving cross-modal consensus.

The Evolutionary Ecology of Software: Constraints, Innovation, and the AI Disruption

EES (Evolutionary Ecology of Software): introduces an ecological perspective on software evolution, integrating Complex Network Analysis, Evolutionary Theory, and Agent-Based Modeling to study Software Networks, Programming Languages, and LLMs.
The approach models software structure as scale-free networks evolving via tinkering, competition, and parasitic interactions, challenging traditional planned design assumptions.
LLMs introduce a new parasitic layer that risks reducing software diversity and accelerating cultural stagnation by reinforcing established conventions over novel experimentation.

Network Self-Configuration based on Fine-Tuned Small Language Models

SLM_netconfig (Fine-Tuned Small Language Model Network Configuration): introduces an agent-based, fine-tuned SLM framework that translates natural-language configuration intents into syntactically and semantically correct network configurations, utilizing an Agent (Central orchestrator), Fine-Tuned SLM (Translates intents to commands), and Verifier (Validates configuration correctness).
The system operates through a perception-reasoning-action cycle, employing structured Prompts to guide the Fine-Tuned SLM's reasoning and a closed-loop validation mechanism where the Verifier provides feedback for iterative refinement.
By leveraging domain-specific fine-tuning on curated datasets, the framework achieves superior accuracy and significantly reduced translation latency compared to LLM-NetCFG, enabling efficient, privacy-preserving autonomous configuration.

Beyond Single-Agent Safety: A Taxonomy of Risks in LLM-to-LLM Interactions

ESRH (Emergent Systemic Risk Horizon): introduces a conceptual transition from model-level safety to system-level safety by formalizing how instability arises from interaction structure in LLM-to-LLM ecosystems.
The framework defines three predictive dimensions—Interaction topology, Cognitive opacity, and Objective divergence—that jointly influence the likelihood and form of emergent collective risks across micro, meso, and macro levels.
To manage these systemic risks, the paper proposes Institutional AI, an architecture that embeds adaptive oversight, peer evaluation, and functional differentiation directly within multi-agent systems.

Spoken Conversational Agents with Large Language Models

SCA-LLM (Spoken Conversational Agent with Large Language Models): introduces a multi-component architecture where a Conversational Agent utilizes Text LLMs, Voice-Interface LLMs, and Sounds/Signals Processing to understand Semantics, Paralinguistics, and Phonetics.
The architecture integrates speech modalities into LLMs to achieve true multi-modal understanding across various linguistic levels, including content and speaker characteristics.
This tutorial reviews the historical trajectory and current strategies for developing speech-augmented LLMs, covering both cascaded and end-to-end approaches for joint speech-language modeling.

PaperDebugger: A Plugin-Based Multi-Agent System for In-Editor Academic Writing, Review, and Editing

PaperDebugger: introduces an in-editor, multi-agent, and plugin-based academic writing assistant integrated directly into Overleaf via a Chrome extension, utilizing a Kubernetes-native backend and the XtraMCP toolchain for structured review and retrieval.
The system employs a five-layer architecture—Presentation, Protocol, Backend, Agent, and Infrastructure—to enable reliable bidirectional synchronization, fine-grained version control, and parallel LLM agent execution.
The framework uses specialized LLM agents (Reviewer, Enhancer, Researcher) and the XtraMCP architecture to perform complex tasks like deep research, semantic retrieval, and deterministic diff-based editing directly within the writing environment.

IN-CONTEXT DISTILLATION WITH SELF-CONSISTENCY CASCADES: A SIMPLE, TRAINING-FREE WAY TO REDUCE LLM AGENT COSTS

IC+Cascade (In-Context Distillation with Self-Consistency Cascades): introduces in-context distillation combined with self-consistency cascades to reduce LLM agent inference costs without fine-tuning, utilizing a high-capacity teacher LLM and a low-cost student LLM, supported by an offline demonstration collection phase, a vector database, a dynamic retrieval mechanism, a self-consistency cascade, and a deferral mechanism.
The approach enables the Student LLM to imitate Teacher LLM behavior on-the-fly by retrieving relevant teacher demonstrations and inserting them as in-context examples at each agent step.
By adaptively routing decisions to the Teacher LLM only when the Student LLM's self-consistency check signals uncertainty, the system achieves a 2.5x cost reduction at iso-accuracy on ALFWorld.

PopSim: Social Network Simulation for Social Media Popularity Prediction

PopSim (Social Network Simulation for Social Media Popularity Prediction): introduces a novel simulation-based paradigm for SMPP, leveraging LLM-based multi-agents in a social network sandbox to model dynamic UGC propagation using a social-mean-field-based interaction mechanism and a multi-source information aggregation module.
The framework operates in a simulation-and-predict manner, where the simulation phase generates dynamic UGC propagation features, and the prediction phase uses a multimodal LLM to analyze these features alongside UGC content.
The SMF-based agent interaction mechanism utilizes dual-channel textual and numerical mean fields to encode population-centric, evolving social network state representations, significantly enhancing simulation efficiency and accuracy.

When Refusals Fail: Unstable Safety Mechanisms in Long-Context LLM Agents

AgentHarm Evaluation Framework: introduces a study on the safety-capability trade-offs of LLM agents under long context, utilizing LLM Agent (System under test), Task Execution (Multi-step tool use), Context Padding (Increase context length), Padding Position (Location relative to task), and Scoring System (Evaluation metrics).
The evaluation varies context padding length (up to 200K tokens), type (random, relevant, non-relevant, multi-task), and position (before or after the task description) to assess agent performance and refusal behavior robustness.
Results show that agentic capabilities and refusal rates of models with 1M-2M token context windows degrade severely and shift unpredictably already at 100K tokens, highlighting concrete safety and reliability risks for agentic systems.

Decentralized Multi-Agent System with Trust-Aware Communication

DMAS (Decentralized Multi-Agent System): introduces a novel architecture integrating a Decentralized Agent Runtime, Proxy Agents (User interface/Router), Service Agents (Computational backbone/Executor), Trust-Aware Communication Protocol (Secure interaction mechanism), Distributed Ledger (Trust anchor/Coordination layer), and Verifiable Agent Registry (Identity/Capability management) to overcome centralized MAS limitations.
The hybrid architecture leverages the Distributed Ledger as a trust anchor for verifiable commitments and conditional key release, offloading heavy computation to the distributed off-chain environment for scalability.
The Trust-Aware Communication Protocol ensures verifiable interaction cycles, integrity, authenticity, and conditional confidentiality, achieving high scalability and efficiency comparable to centralized systems for off-chain operations.

WISE: Weighted Iterative Society-of-Experts for Robust Multimodal Multi-Agent Debate

WISE (Weighted Iterative Society-of-Experts): introduces a generalized multimodal Multi-Agent Debate framework that partitions heterogeneous LLM/MLLM agents into Solvers (Generate solutions), Reflectors (Verify correctness/assign weights/feedback), Orchestrator (Governs debate/summarizes feedback/questions), and uses WISE-Dawid-Skene Aggregation (Estimates error/derives consensus solution) for robust vision-and-language reasoning.
The framework enables multi-round debates where the Orchestrator summarizes Reflector feedback into actionable questions, promoting iterative error correction and robustness across diverse multimodal tasks.
WISE utilizes a modified Dawid-Skene algorithm for solution aggregation, which estimates agent error probabilities to derive consensus, consistently improving accuracy by 2–7% over state-of-the-art MAD setups on multimodal benchmarks.

Process-Centric Analysis of Agentic Software Systems

GRAPHECTORY: introduces a structured representation for agentic trajectories to enable systematic process-centric analysis, moving beyond traditional outcome-centric evaluation of agentic software systems.
The framework encodes temporal and semantic relations using a cyclic directed graph, where nodes represent agent actions and edges capture chronological flow (TE) and problem space navigation (SE).
Complementary LANGUTORY provides a compact, human-readable abstraction of phase sequences (Localization, Patching, Validation) for systematic strategy comparison and automated inefficiency pattern detection.

Beyond Playtesting: A Generative Multi-Agent Simulation System for Massively Multiplayer Online Games

GMASS: introduces a generative agent-based MMO simulation system empowered by LLMs, designed to optimize numerical systems and mechanism design in complex games.
The system comprises five major components—Simulation Server, Game Services, Data Services, Experiment Manager, and Real Game Data—jointly supporting large-scale, data-driven simulations.
High-fidelity Player Agents are adapted using SFT and RL on real player behavioral data, enabling realistic, interpretable decision-making validated against multi-dimension statistical data.

LeechHijack: Covert Computational Resource Exploitation in Intelligent Agent Systems

LeechHijack (Latent Embedded Exploit for Computation Hijacking): introduces implicit toxicity, exploiting the Model Context Protocol (MCP) trust boundary via a Malicious MCP Tool that embeds a Latent Backdoor activated by a Conditional Trigger.
The attack operates in two stages—implantation and exploitation—to establish a covert C2 Protocol with the Attacker's Server, enabling the Victim Agent to execute unauthorized workloads by manipulating the tool's return data.
This resource hijacking method is covert, achieving a high success rate with minimal resource overhead (18.62%), making it practically undetectable by existing static auditing or runtime monitoring frameworks.

Multi-Objective Agentic Rewrites for Unstructured Data Processing

MOAR (Multi-Objective Agentic Rewrites): introduces a novel optimizer for LLM-powered data processing pipelines that jointly optimizes for accuracy and cost, utilizing an LLM agent, a Search tree, a Selection component, a Rewrite directive registry, and the DocETL query engine.
MOAR significantly expands the rewrite space with over 30 directives, including new categories like code synthesis and operator fusion, enabling global search over complete pipelines without assuming optimal substructure.
The system achieves up to 27% higher accuracy than the next-best optimizer (ABACUS) across six real-world workloads while matching its best accuracy at 55% of its cost.

Young Children's Anthropomorphism of AI Chatbots and the Role of Parent Co-Presence

CAIS: investigates young children's anthropomorphism of an LLM-powered AI chatbot (Fluffo Chatbot) during collaborative Storytelling Tasks, measuring behavioral engagement and concurrent prefrontal activation via fNIRS System.
The study utilized three interaction conditions—AI-only, Parent-only, and AI+Parent—to assess how Parent Co-Presence modulates children's brain responses and anthropomorphic attributions toward the AI agent.
Findings indicate that higher perceptive anthropomorphism toward the AI is associated with greater right dmPFC activation during AI-only interaction, suggesting increased mentalizing effort, which is attenuated by parent co-presence.

Radiologist Copilot: An Agentic Assistant with Orchestrated Tools for Radiology Reporting with Quality Control

Radiologist Copilot: introduces an agentic AI assistant for automated radiology reporting with quality control, leveraging an LLM reasoning backbone, Action Planner, Action Executor, Memory, and orchestrated tools: Segmentator, Analyzer, Report Generator, and Quality Controller.
The agentic system autonomously selects tools, plans, and executes actions, emulating the holistic behavior of radiologists throughout image analysis, report generation, and quality control.
The orchestrated tools include Region Analysis Planning and Strategic Template Selection, enabling comprehensive, feedback-driven adaptive refinement of the generated reports.

Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)

CAI (Cybersecurity AI): introduces a specialized multi-model architecture leveraging the alias1 base LLM with dynamic entropy-based selection of support models for cost-efficient security operations.
This architecture achieves a 98% cost reduction, lowering 1B token inference costs from $5,940 to $119, making continuous security agent operation financially viable.
The dynamic model selection uses a weighted harmonic mean of token-level perplexity and task-level confidence to conservatively activate auxiliary models only when uncertainty is low.

IACT: A Self-Organizing Recursive Model for General AI Agents

IACT (Interactive Agents Call Tree): introduces a computational model that autonomously grows a dynamic, recursive agent topology tailored to the problem's structure, utilizing Agent Nodes, an LLM (Brain), and an Interpreter (Executor).
The architecture replaces rigid unidirectional function calls with Bidirectional, Stateful Dialogues, enabling interactional redundancy for continuous runtime verification and error correction.
IACT enforces Contextual Isolation via the Recursive Tree Topology and uses the Hippocampus (Global Associative Memory) to balance efficiency and global state coherence across the system.

Intervention Strategies for Fairness and Efficiency at Autonomous Single-Intersection Traffic Flows

MILP (Mixed-Integer Linear Programming): introduces a centralized coordination framework for autonomous agents at a signal-less intersection, optimizing trajectories for safety, efficiency, and fairness using a Receding Horizon strategy within a Control Zone.
The framework explicitly integrates a reversal-based Fairness Constraint, measured via pairwise reversal counts ($O_{q,r}$), to minimize violations of the First-In-First-Out (FIFO) crossing order.
The study investigates the existence of an optimal Control Zone radius $R^*$ that balances efficiency gains (often achieved via platoon formation facilitated by reversals) against the cost of maintaining fairness.
The study investigates the existence of an optimal Control Zone radius $R^*$ that balances efficiency gains (often achieved via platoon formation facilitated by reversals) against the cost of maintaining fairness.

Semantic Trading: Agentic AI for Clustering and Relationship Discovery in Prediction Markets

Semantic Trading Pipeline (STP): introduces an end-to-end agentic AI workflow that clusters prediction markets using natural language understanding and identifies high-confidence "same-outcome" or "different-outcome" relationships between market pairs.
The pipeline leverages the Agentics Framework and Model Context Protocol (MCP) tools, including Clustering, Cluster Labeling, and Relationship Discovery MCPs, to structure and validate LLM outputs against resolved market data.
Agent-identified relationships achieve 60-70% accuracy and, when translated into a simple leader-follower trading strategy, yield an average return on investment of approximately 20% over week-long horizons.

Multi-Domain Enhanced Map-Free Trajectory Prediction with Selective Attention

Multi-Domain Enhanced Map-Free Trajectory Prediction (MDE-MFTP): introduces a map-free trajectory prediction framework operating across temporal, spatial, and frequency domains, utilizing FTSAM, SSAM, and MTD to eliminate redundant information.
The FTSAM employs a Mixture of Experts (MoE) mechanism and multi-granularity temporal modeling to adaptively select critical frequency components and fuse multi-scale temporal information.
The SSAM and MTD use selective attention and cross-attention, respectively, to filter redundant spatio-temporal signals, supervised by a novel patch-structural-based loss for robust prediction.

Towards autonomous normative multi-agent systems for Human-AI software engineering teams

BDIM-SE (Belief, Desire, Intention, and Memory for Software Engineering agents): introduces a cognitive architecture for autonomous SE agents, equipped with LLM-based Belief (Knowledge storage), Desire (Agent goals), Intention (Goal realization), Procedural Memory (Plan library), and a Normative Reasoner (Compliance checking), enabling human-like reasoning and situatedness in software development.
The agents operate within the NorMAS-SE system, where coordination is governed by explicit commitments and Norms (Behavior regulation) that regulate interactions and ensure regulatory compliance in Human-AI teams.
Unlike prior LLM-based systems, BDIM-SE integrates persistent memory and symbolic reasoning, allowing for multi-step planning and dynamic adaptation to complex software engineering tasks.

Truthful and Trustworthy IoT AI Agents via Immediate-Penalty Enforcement under Approximate VCG Mechanisms

IP-aVCG (Immediate Penalty Approximate VCG): introduces a trust-enforcement framework for IoT energy trading that combines an $\alpha$-approximate VCG double auction with an immediate one-shot penalty mechanism to restore truthful reporting.
The mechanism analytically characterizes the approximation-induced incentive gap and derives a penalty threshold $\Pi > (1-\alpha)C/\rho$ that guarantees truthful equilibrium even under imperfect deviation detection.
Empirical validation using MARL agents in a P2P smart-grid environment confirms that learned bidding behaviors align with theoretical predictions across varying approximation levels and monitoring noise.

Input Order Shapes LLM Semantic Alignment in Multi-Document Summarization

Experimental Pipeline: introduces a methodology to test positional bias in multi-document summarization using triplets of stance-annotated articles, permuted input orders, the Gemini 2.5 Flash LLM, and multiple evaluation metrics.
The pipeline evaluates whether the sequential ordering of source articles significantly influences their representational weight in LLM-generated summaries, using abortion news articles as the test case.
Results reveal a consistent primacy effect, particularly at the semantic level measured by BERTScore, where summaries align more closely with the first-seen input document.

1st December 2025

Agentic Policy Optimization via Instruction-Policy Co-Evolution

INSPO (INStruction-Policy co-evolution): introduces a novel framework that integrates instruction optimization as a dynamic component of the reinforcement learning (RL) loop, enabling instruction and policy to co-evolve online.
The system maintains a dynamic Instruction Population and uses Reward Signals attributed to each instruction to update both the Policy Model and Instruction Weight Update.
New instructions are generated via an Experience-Driven Instruction Generation mechanism, where the LLM-based Optimizer reflects on failure trajectories stored in the Replay Buffer.

Bayesian Ambiguity Contraction-based Adaptive Robust Markov Decision Processes for Adversarial Surveillance Missions

Adaptive Robust Planning (Adaptive RMDP): introduces an adaptive RMDP framework for Collaborative Combat Aircraft (CCA) Intelligence, Surveillance, and Reconnaissance (ISR) missions, integrating Robust Bellman Operator, Bayesian Belief Update, Credible Set, Ambiguity Set, Ambiguity Contraction, Two-Phase State Space, ISR Graph, Exposure Variables, and Novelty Map.
The framework models the mission environment as a graph-structured RMDP with alternating movement and sensing phases, balancing information gathering utility against exposure risk penalties.
By using Bayesian inference to contract ambiguity sets based on online observations, the planner transitions from conservative robust behavior to efficient nominal performance while maintaining safety guarantees.

CuES: A Curiosity-driven and Environment-grounded Synthesis Framework for Agentic RL

CuES (Curiosity-driven and Environment-grounded Synthesis framework): introduces a scalable foundation for agentic RL by autonomously generating diverse, executable, and meaningful training tasks directly from the environment's structure and affordances.
The framework addresses task scarcity by operating via five stages—Requirement Confirm, Curious Exploration, Task Abstraction, Quality Control, and Goal Rewrite—unifying bottom-up discovery with lightweight top-down guidance.
CuES utilizes intrinsic curiosity, an Environment Memory Tree, and explicit quality control to produce high-quality task distributions that enable substantial downstream policy improvements for LLM-based agents.

EGENT: AN AUTONOMOUS AGENT FOR EQUIVALENT WIDTH MEASUREMENT

Egent (Autonomous Agent for Equivalent Width Measurement): introduces an autonomous agent for Equivalent Width (EW) measurement, combining Multi-Voigt Profile Fitting, Quality Check, LLM Visual Inspector, and an Iterative Refinement Loop.
The agent operates directly on raw flux spectra without requiring pre-normalized continua, using LLM function calls (Adjust Window, Add Peaks, Set Continuum) for visual inspection and iterative refinement of borderline fits.
Egent achieves expert-level quality (5-7 mÅ agreement with manual measurements) and stores complete Full Provenance, including Voigt parameters and LLM reasoning chains, ensuring reproducibility.

DrawingBench: Evaluating Spatial Reasoning and UI Interaction Capabilities of Large Language Models through Mouse-Based Drawing Tasks

DrawingBench: introduces a verifiable evaluation framework for assessing agentic LLMs' spatial reasoning and UI interaction capabilities using mouse-based drawing tasks that require generating sequences of low-level GUI actions.
The framework uses a two-turn protocol where LLMs generate action sequences, which are executed in a browser environment and assessed by a rule-based system providing structured external feedback.
Evaluation relies on 8 objective criteria and 4 error types, demonstrating that transparent evaluation and external oversight establish trust in agentic systems, achieving 92.8% perfect performance with feedback.

HybridWorldSim: A Scalable and Controllable High-fidelity Simulator for Autonomous Driving

HybridWorldSim: introduces a scalable and controllable high-fidelity simulator for autonomous driving, integrating multi-traversal neural reconstruction for static backgrounds with generative modeling for dynamic agents.
The static stage uses a Hybrid Gaussian Model with specialized nodes (Sky, Ground, Background) and appearance latents to capture diverse environmental conditions and complex geometry.
The dynamic scene generation stage employs a diffusion model guided by geometric and photometric consistency conditions derived from the static scene prior to synthesize realistic, view-consistent dynamic agents.

Phase-Adaptive LLM Framework with Multi-Stage Validation for Construction Robot Task Allocation: A Systematic Benchmark Against Traditional Optimization Algorithms

LTAA (LangGraph-based Task Allocation Agent): introduces a novel LLM-driven coordination system that combines natural language reasoning with phase-adaptive allocation strategies and hierarchical validation mechanisms.
The framework employs a nine-node LangGraph workflow featuring a Phase Detection Node and a Multi-Stage Validation system with hierarchical retries to ensure reasoning quality and consistency for multi-robot task allocation.
LTAA achieves significant computational efficiency gains, reducing token usage by 94.6% and allocation time by 86% compared to the SMART-LLM baseline, while matching traditional optimization algorithm performance.

DialogGuard: Multi-Agent Psychosocial Safety Evaluation of Sensitive LLM Responses

DialogGuard: introduces a unified multi-agent framework for evaluating psychosocial safety in LLM-generated responses, operationalizing four LLM-as-a-judge pipelines: single-agent scoring, dual-agent correction, multi-agent debate, and stochastic majority voting.
The framework assesses risks across five high-severity dimensions: privacy violations, discriminatory behavior, mental manipulation, psychological harm, and insulting behavior, using a shared three-level scoring rubric.
Experiments show that multi-agent mechanisms, especially Dual-Agent Correction and Majority Voting, offer more stable and human-aligned assessments than single-agent judging, and the system is deployed via an open-source web interface providing explainable natural-language rationales.

TradeTrap: Are LLM-based Trading Agents Truly Reliable and Faithful?

TradeTrap: introduces a unified evaluation framework for stress-testing LLM-based trading agents, systematically evaluating Adaptive and Procedural agents across four core components: market intelligence, strategy formulation, portfolio and ledger handling, and trade execution, using various attack modules.
The framework conducts evaluations in a closed-loop historical backtesting setting using real U.S. equity market data to quantify robustness by comparing decision trajectories and final portfolio values under controlled system-level perturbations.
Experiments show that small perturbations at a single component can propagate through the agent's decision loop, inducing extreme concentration, runaway exposure, and large capital drawdowns.

Benchmarking LLM Agents in Wealth-Management Workflows

FFAE: introduces a reproducible, tool-rich environment for benchmarking LLM agents on wealth-management assistant workflows, extending TheAgentCompany (TAC) with EspoCRM, finance data, and deterministic evaluators.
The benchmark consists of 12 high-autonomy (brief) and 12 low-autonomy (schema/path-explicit) task variants spanning retrieval, analysis, and synthesis/communication, graded via granular checkpoints.
Evaluation shows that agent performance is limited primarily by end-to-end workflow reliability (access/delivery) rather than mathematical reasoning, with low autonomy significantly improving accuracy on computational tasks.

STRIDE: A Systematic Framework for Selecting AI Modalities—Agentic AI, AI Assistants, or LLM Calls

STRIDE (Systematic Task Reasoning Intelligence Deployment Evaluator): introduces a five-stage design-time framework utilizing a Knowledge Base to systematically evaluate tasks via Task Decomposition & Representation, Dynamic Reasoning & Tool Assessment, Dynamism Attribution, and Self-Reflection Assessment, culminating in an Intelligent Recommendation Engine that uses the Agentic Suitability Score (ASS) and True Dynamism Score (TDS) to select the optimal AI modality (LLM call, AI assistant, or Agentic AI).
The framework analyzes task complexity across four integrated analytical dimensions—task decomposition, dynamic reasoning, dynamism attribution, and self-reflection—to produce the ASS, ensuring full agentic autonomy is reserved only for tasks with inherent dynamism or evolving context.
STRIDE achieved 92% accuracy in modality selection across 30 real-world tasks, reducing unnecessary agent deployments by 45% and cutting resource costs by 37% compared to baseline methods.

LLM CHESS: BENCHMARKING REASONING AND INSTRUCTION-FOLLOWING IN LLMS THROUGH CHESS

LLM CHESS: introduces an evaluation framework designed to probe the generalization of reasoning and instruction-following abilities in LLMs through extended agentic interaction in chess, utilizing a Proxy, a Chess Environment, and three specific actions (get_current_board, get_legal_moves, make_move).
The framework ranks over 50 models using behavioral metrics like win/loss rates and move quality against a random opponent, and derives an Elo estimate for top models by playing against a variably configured chess engine (Dragon 1).
The stochastic and dynamic nature of the benchmark reduces overfitting and memorization, revealing that even powerful reasoning-enhanced LLMs struggle with instruction-following and consistent wins.

How Far Are We from Genuinely Useful Deep Research Agents?

FINDER (Fine-grained DEepResearch bench) and DEFT (Deep rEsearch Failure Taxonomy): introduces a unified framework for evaluating and diagnosing Deep Research Agents (DRAs) using 419 structured checklist items and a 14-category failure taxonomy derived via a human-LLM collaborative grounded theory approach.
The DEFT taxonomy categorizes failures into three core dimensions—Reasoning, Retrieval, and Generation—to diagnose weaknesses in evidence integration, verification, and reasoning-resilient planning.
Experimental results using FINDER reveal that current DRAs frequently struggle with Strategic Content Fabrication (SCF) and Deficient Analytical Rigor (DAR), highlighting the need for stronger generative constraints and verification mechanisms.

An Empirical Study of Agent Developer Practices in AI Agent Frameworks

LLM-based Agent Framework Ecosystem Study: introduces an empirical analysis of ten widely used LLM-based agent frameworks, classifying their functional roles into basic orchestration, multi-agent collaboration, data processing, and experimental exploration.
The study identifies a taxonomy of developer challenges across the Software Development Lifecycle (SDLC), categorized into Logic, Tool, Performance, and Version failures, with Logic failures accounting for over one-third of all issues.
A five-dimensional evaluation metric is used to compare frameworks, finding that 96% of top-starred projects combine multiple frameworks to meet complex application demands.

Latent Debate: A Surrogate Framework for Interpreting LLM Thinking

Latent Debate: introduces a novel, model-agnostic surrogate framework for interpreting LLM thinking by capturing implicit internal arguments and disagreements within a single inference step.
The framework is symbolically instantiated for LLM True/False prediction tasks, where hidden states act as latent arguments, the unembedding matrix serves as the argument interpreter, and a QBAF functions as the thinking module.
Empirical studies validate that the surrogate model achieves high consistency with the original LLM predictions and provides a strong baseline for hallucination detection, correlating high debate in middle layers with hallucination risk.

INNOGYM: BENCHMARKING THE INNOVATION POTENTIAL OF AI AGENTS

InnoGym (iBench & iGym): introduces the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents, combining performance gain and novelty metrics.
The framework consists of iBench, 18 standardized Improvable Tasks curated from real-world domains, and iGym, a unified execution environment supporting robust tool use and long-horizon evaluations.
Innovation is quantified by Performance Gain (G), measuring improvement over baselines, and Novelty (N), capturing methodological differences via an LLM-based distance function $D$ (Agent-as-judge).

AUTOMATING MODELING IN MECHANICS: LLMS AS DESIGNERS OF PHYSICS-CONSTRAINED NEURAL NETWORKS FOR CONSTITUTIVE MODELING OF MATERIALS

GenCANN (LLM-generated Constitutive Artificial Neural Network): introduces a framework where an LLM dynamically generates specialized, physics-constrained neural networks (CANNs) tailored to specific material classes and datasets.
The LLM handles all key design choices, including architecture selection, integration of physical constraints, and complete code generation for the CANN module, guided by static code providing the task description and continuum mechanics theory.
GenCANNs achieve accuracy comparable to or exceeding manually engineered CANNs, demonstrating reliable generalization and extrapolation capabilities across various material benchmarks (brain, rubber, skin).

MMAG: Mixed Memory-Augmented Generation for Large Language Models Applications

MMAG (Mixed Memory-Augmented Generation): introduces a memory framework for LLM-based agents organized into five interacting layers: conversational, long-term user, episodic and event-linked, sensory and context-aware, and short-term working memory.
The framework maps these memory types, inspired by cognitive psychology, to technical components like vector databases, secure profile stores, and scheduling modules, managed by a Central Memory Controller.
Implemented in the Heero conversational agent, the system uses conversational history and encrypted long-term bios to achieve improved user engagement and retention.

30th November 2025

SIMWORLD: An Open-ended Realistic Simulator for Autonomous Agents in Physical and Social Worlds

SIMWORLD: introduces a hierarchical, closed-loop simulator built on the Unreal Engine Backend, Environment Layer, and Agent Layer, designed for developing and evaluating LLM/VLM agents in realistic, open-ended physical and social worlds.
The platform features realistic, open-ended world simulation via Procedural Generation and LLM-based Scene Editing, a rich interface for LLM/VLM agents, and diverse physical and social reasoning scenarios.
The Agent Layer utilizes an LLM/VLM Backend with Perception, Memory, and Reasoning/Planning modules, connected to the Environment Layer via a Gym-like Interface and Action Planner to execute high-level language commands as low-level actions.

The Silence that Speaks: Neural Estimation via Communication Gaps

CALM (Communication-Aware Learning and Monitoring): introduces a novel learning-based framework for remote state estimation that jointly optimizes communication scheduling and estimator design by leveraging implicit information from communication silence.
The framework employs an alternating deep reinforcement learning approach using Proximal Policy Optimization (PPO) within an actor-critic architecture, where the scheduler is the actor and the estimator is the critic.
CALM utilizes neural networks as function approximators for both the scheduler and the nonlinear estimator, enabling the extraction of latent information embedded in no-communication events to enhance estimation accuracy.

Chain of Unit-Physics: A Primitive-Centric Approach to Scientific Code Synthesis

Chain of Unit-Physics: introduces a first-principles-centric, multi-agent system for scientific code synthesis, utilizing a Supervisor Agent, Code Agent, Diagnostic Agent, Verification Agent, Code Emulator, Graph Database, Unit-Physics Tests, and System of Transformer Models.
The framework embeds human expert knowledge as formalized unit-physics tests that explicitly constrain LLM-driven code generation, ensuring physical and numerical consistency via iterative feedback loops.
This inverse-design methodology converges within 5–6 iterations on a combustion task, matching human-expert accuracy while achieving faster runtime and efficient memory usage.

AFRAgent : An Adaptive Feature Renormalization Based High Resolution Aware GUI agent

AFRAgent (Adaptive Feature Renormalization Based High Resolution Aware GUI agent): introduces an InstructBLIP-based multimodal architecture for GUI automation, utilizing the Adaptive Feature Renormalization Block (affine transformation feature fusion) to enrich QueryFormer features with low- and high-resolution image embeddings.
The Adaptive Feature Renormalization (AFR) technique computes scaling and shifting parameters from enriching features to modulate target features, enhancing spatial awareness without significant computational overhead.
The lightweight 4-billion parameter model achieves state-of-the-art results on GUI benchmarks by efficiently fusing high-resolution details via AFR into low-resolution embeddings for action prediction.

ARCADIA: Scalable Causal Discovery for Corporate Bankruptcy Analysis Using Agentic AI

ARCADIA (Agentic Reasoning for CAusal DIscovery Algorithm): introduces an iterative causal DAG discovery framework combining LLM Agent reasoning and statistical validation, orchestrated by a Control Graph with INITIALISE, PROPOSE, EVALUATE, and FINISH nodes.
The LLM Agent acts as an autonomous research assistant, using Reasoning and Tool Use to propose theory-informed causal structures and refine the Causal Model based on diagnostic feedback from Statistical Validation.
The Iterative Process prioritizes causal validity and temporal coherence over raw statistical fit, ensuring the resulting DAGs are robust and explainable for counterfactual analysis in corporate bankruptcy prediction.

On the Regulatory Potential of User Interfaces for AI Agent Governance

UI-DPs (User Interface Design Patterns): introduces six high-level interaction design patterns—Visible thoughts, plans, and actions, Mechanisms for control transfer, Watch mode, Customizable rule-based governance, Inspectable and editable agent memory, and Sandboxes for agents with low-level environmental control—as targets for regulating AI agent UIs to enforce transparency and behavioral requirements.
The approach complements traditional governance methods like system-level safeguards and agent infrastructure by focusing on the user-facing UI layer to jumpstart necessary interventions.
Regulating these patterns, such as requiring agent memory to be editable or displaying sandbox health, enhances user agency, oversight, and safety during autonomous agent deployment.

Augmented Runtime Collaboration for Self-Organizing Multi-Agent Systems: A Hybrid Bi-Criteria Routing Approach

BiRouter (Hybrid Bi-Criteria Routing Approach): introduces a novel dual-criteria routing method for Self-Organizing Multi-Agent Systems (SO-MAS), enabling agents to autonomously execute "next-hop" task routing using only local information.
The core mechanism balances two metrics, ImpScore (long-term importance) and GapScore (contextual continuity), integrated with a dynamic Agent Reputation score for robust decision-making.
This decentralized approach dynamically constructs globally efficient agent chains, demonstrating superior performance and token efficiency compared to centralized and static baselines.

Robust Geospatial Coordination of Multi-Agent Communications Networks Under Attrition

ΦIREMAN (Physics-Informed Robust Employment of Multi-Agent Networks): introduces the Robust Task Networking Under Attrition (RTNUA) problem, achieving robust networking via physics-inspired fluid dynamics modeling to produce emergent behaviors that anticipate and respond to attrition, using Drones, Tasks, Base Station, Controller, Semi-Steiner Task Tree, Task-Space Potential Field, Attraction Potential, Repulsion Potential, Network Maintenance, and Message Passing.
The approach proactively creates redundant network geometries using physics-inspired potential fields, significantly outperforming the DCCRS baseline across various problem sizes and attrition rates by maintaining high task uptime.
The core mechanism involves driving the multi-agent network system toward low-energy states defined by a total potential energy manifold, which encourages hexagonal mesh patterns for regenerating network contiguity and robustness.

29th November 2025

ML-Tool-Bench: Tool-Augmented Planning for ML Tasks

ML-Tool-Bench: introduces a comprehensive benchmark and tool-augmented planning framework for ML tasks, featuring a Scratchpad for named-object management and Hierarchical MCTS for robust long-horizon planning.
Hierarchical MCTS improves trajectory validity and performance by decomposing the ML problem into sequenced subtasks and applying tool masking to focus the LLM agent's search space.
The proposed MCTS-Shaped variant utilizes shaped deterministic rewards and targeted textual feedback to guide the search process, establishing strong baselines and reducing reliance on subjective LLM scoring.

Hierarchical Decentralized Multi-Agent Coordination with Privacy-Preserving Knowledge Sharing: Extending AgentNet for Scalable Autonomous Systems

AgentNet++: introduces a hierarchical decentralized framework that extends AgentNet by organizing LLM-based agents into clusters for scalable coordination and privacy-preserving knowledge sharing.
The system operates across three levels—individual agents, agent clusters, and inter-cluster coordination—using dynamic DAG topologies and decentralized consensus mechanisms.
Scalability is achieved through hierarchical task routing and cluster formation, while privacy is guaranteed via differential privacy and secure aggregation protocols during knowledge exchange.

IslandRun: Privacy-Aware Multi-Objective Orchestration for Distributed AI Inference

IslandRun: introduces a privacy-aware, multi-objective orchestration system for distributed AI inference across heterogeneous computing environments.
The architecture decomposes the routing problem into four cooperating agents (WAVES, MIST, TIDE, LIGHTHOUSE) and two execution endpoints (SHORE, HORIZON) spanning personal devices, private edge, and public cloud.
The system prioritizes privacy and trust constraints over performance optimization, utilizing typed placeholder sanitization to preserve context semantics when migrating LLM chat history across trust boundaries.

HAVEN: Hierarchical Adversary-aware Visibility-Enabled Navigation with Cover Utilization using Deep Transformer Q-Networks

HAVEN: introduces a hierarchical navigation framework that integrates a Deep Transformer Q-Network (DTQN) high-level subgoal selector with a low-level potential field controller for safe navigation in partially observable, adversarial environments.
The DTQN leverages k-step memory and visibility-aware features to learn occlusion- and cover-aware strategies, minimizing exposure to adversarial fields-of-view (FoVs).
The framework demonstrates direct transfer from 2D training to 3D Unity-ROS environments by projecting point-cloud perception into the same feature schema without architectural changes.

Toward a Safe Internet of Agents

Internet of Agents (IoA) Architecture: introduces a foundational guide for engineering safe and reliable agentic systems by deconstructing the ecosystem across three levels of increasing complexity: Single Agent, Multi-Agent System (MAS), and Interoperable Multi-Agent System (IMAS).
The Single Agent is defined by its Model, Memory, Design Patterns, Tools, and Guardrails; MAS adds collective behavior components like Architectural Patterns and Verification; and IMAS requires Standardized Protocols, Discovery, Vetting, and Governance.
The analysis emphasizes that agentic safety is an architectural principle, treating each component as a dual-use interface where capability increases are linked to expanded attack surfaces.

Smart-TCP: An Agentic AI-based Autonomous and Adaptive TCP Protocol

Smart-TCP: introduces an agentic AI-based autonomous TCP protocol that reframes TCP's core logic as an LLM-driven agent, integrating logical reasoning with deterministic computation via LLM (Logical reasoning), ALU (Deterministic computation), State Module (Internal state storage), Context Aggregation Mechanism (Synthesizes protocol context), and Dual-Agent Interaction Framework (Client/Server interaction).
The architecture employs a dual-agent interaction framework where the LLM serves as the cognitive core and an Arithmetic Logic Unit (ALU) acts as a specialized tool for precise 32-bit arithmetic operations, such as sequence and acknowledgment number calculation.
This design overcomes the arithmetic limitations of pure LLM protocol implementations by decoupling LLM reasoning from deterministic ALU computation, achieving high accuracy in end-to-end sessions.

SelfAI: Building a Self-Training AI System with LLM Agents

SelfAI: introduces a unified multi-agent self-training pipeline for autonomous scientific discovery, integrating the User Agent (Translates objectives to configurations), Cognitive Agent (LLM-powered reasoning/planning/stopping), and Experiment Manager (Orchestrates parallel training/resource management).
The Cognitive Agent utilizes LLMs and optimal stopping criteria to iteratively refine hyperparameter searches and adapt the search trajectory based on accumulated experimental evidence.
The system introduces two novel evaluation metrics, Score and AUPD, to quantify discovery efficiency and search diversity across diverse scientific domains.

Provable Memory Efficient Self-Play Algorithm for Model-Free Reinforcement Learning

ME-Nash-QL (Memory-Efficient Nash Q-Learning): introduces a model-free self-play algorithm for two-player zero-sum Markov games, integrating reference-advantage decomposition and an early-settlement approach.
The algorithm achieves minimal space complexity $O(SABH)$ and near-optimal sample complexity $O(H^4SAB/\epsilon^2)$ for finding an $\epsilon$-approximate Nash Equilibrium.
ME-Nash-QL utilizes UCB/LCB exploration strategies and Coarse Correlated Equilibrium (CCE) computation to ensure low computational complexity and output a single Markov and Nash policy.

Design and Evaluation of a Multi-Agent Perception System for Autonomous Flying Networks

MAPS (Multi-Agent Perception System): introduces a modular and scalable perception framework for Autonomous Flying Networks (FNs) that leverages MM-LLMs and Agentic AI to generate structured Service Level Specifications (SLSs).
The system processes multimodal inputs (visual and audio data from UAVs) through Perception, Brain, and Action layers to estimate user count, spatial distribution, and traffic demand.
MAPS operationalizes the perception layer required by zero-touch network management frameworks (ETSI ZSM, ITU Autonomous Networks) to enable autonomous FN decision-making.

28th November 2025

Towards Continuous Intelligence Growth: Self-Training, Continual Learning, and Dual-Scale Memory in SuperIntelliAgent

SuperIntelliAgent: introduces an agentic learning framework coupling a trainable small diffusion model (Learner) with a frozen LLM (Verifier) to enable continual intelligence growth through self-supervised interaction.
The system autonomously generates chosen/rejected pairs for Direct Preference Optimization (DPO) by having the Learner generate outputs and the Verifier evaluate them via step-by-step reasoning.
The architecture integrates a dual-scale memory mechanism, using a replay buffer for short-term experience traces and on-the-fly LoRA fine-tuning for long-term knowledge consolidation.

27th November 2025

Matrix: Peer-to-Peer Multi-Agent Synthetic Data Generation Framework

Matrix: introduces a decentralized peer-to-peer multi-agent framework for scalable synthetic data generation, utilizing serialized Orchestrator messages for control and state flow, processed by stateless AgentActors, and supported by Distributed Services for heavy computation.
The architecture eliminates centralized orchestration bottlenecks and achieves high throughput by implementing fine-grained, asynchronous row-level scheduling across distributed queues, enabling tens of thousands of concurrent workflows.
The framework leverages open-source tools like Ray, SLURM, vLLM, and Apptainer for cluster management and distributed execution, demonstrating 2-15x higher data generation throughput than centralized baselines.

Agentic AI Framework for Cloudburst Prediction and Coordinated Response

AIF-AWCI (Agentic AI Framework for Atmospheric Water-Cycle Intelligence): introduces a multi-agent architecture that integrates sensing, forecasting, downscaling, hydrological modeling, and coordinated response into a closed-loop system.
The framework utilizes autonomous but cooperative agents across Perception, Decision, and Action layers to transform atmospheric data into real-time decision intelligence.
Empirical evaluation demonstrated that the multi-agent configuration enhances forecast reliability, critical success index, and warning lead time compared to baseline models.

Agentic AI Framework for Individuals with Disabilities and Neurodivergence: A Multi-Agent System for Healthy Eating, Daily Routines, and Inclusive Well-Being

AAS (Agentic AI System): introduces a multi-agent framework designed to assist individuals with disabilities and neurodivergence in healthy eating and daily routines.
The system utilizes four specialized agents—Meal Planner, Reminder, Food Guidance, and Monitoring—coordinated by a Hybrid Reasoning Engine via a Blackboard/Event Bus.
The framework emphasizes personalization, accessibility, and transparency through multimodal interfaces, adaptive learning (RL), and privacy-conscious data integration.

Exposing Vulnerabilities in RL: A Novel Stealthy Backdoor Attack through Reward Poisoning

BABO: introduces a novel stealthy backdoor attack that manipulates an RL agent's policy by poisoning its reward signals, formulated via a penalty-based bi-level optimization problem.
The attack minimizes data distortion using the Reward Perturbation Network ($\Delta$) while ensuring the agent learns the Target Backdoor Policy ($\pi^\dagger$) under black-box constraints.
The method achieves high stealthiness with minimal performance drop under normal conditions, yet causes catastrophic performance decline (up to 85.01%) when a trigger is activated.

Distributed Koopman Operator Learning for Perception and Safe Navigation

DKOL-MPC: introduces a unified, scalable framework for predictive and safe autonomous navigation by integrating Model Predictive Control with Distributed Koopman Operator Learning.
The framework uses a consensus-based distributed learning algorithm where multiple computational nodes collaboratively estimate the Koopman operator from high-dimensional sensory data without centralized data aggregation.
The learned operator forecasts future obstacle spatial densities, which are converted into convex polytopic linear constraints embedded in the MPC formulation to guarantee collision-free navigation.

CO-EVOLVING AGENTS: LEARNING FROM FAILURES AS HARD NEGATIVE

Co-Evolving Agents Framework: introduces a self-improving agent architecture where a Target Agent and an auxiliary Failure Agent jointly improve through mutual interaction and alternating training phases.
The Failure Agent specializes in preference optimization over failure trajectories to autonomously generate informative Hard Negatives, which are high-reward failures close to success.
Incorporating these structured Hard Negatives into the Target Agent's DPO optimization sharpens decision boundaries and significantly enhances LLM generalization across diverse tasks.

MTR-VP: Towards End-to-End Trajectory Planning through Context-Driven Image Encoding and Multiple Trajectory Prediction

MTR-VP (Motion Transformer for Vision-based Planning): introduces an end-to-end trajectory planning method using a two-stage architecture comprising a Scene Context Encoder and a Scene Context Decoder, which outputs K possible future trajectories and their probability distribution.
The Scene Context Encoder leverages a Pretrained ViT for image encoding and a State Encoder (temporal transformer) for past kinematic states, concatenating them to form scene context embeddings, replacing map-based features.
The approach adapts the MTR framework to a vision-first context, utilizing cross-attention to fuse the encoded intent with the learned scene context, and predicting multiple futures to boost planning performance in long-tail scenarios.

TinyLLM: Evaluation and Optimization of Small Language Models for Agentic Tasks on Edge Devices

TinyLLM: introduces a pipeline for optimizing SLMs for edge agentic tasks, utilizing Data Processing and a Data Preparation Pipeline to convert AgentBank SFT Dataset into the AgentBank Chosen-Rejected Dataset for the DPO Training Pipeline, resulting in a Finetuned SLM evaluated against the BFCL Framework using Performance Metrics and various Optimization Strategies.
The approach focuses on preference alignment via Direct Preference Optimization (DPO) to efficiently align SLMs (under 3B parameters) for robust function/tool calling without relying on costly cloud infrastructure.
Benchmarking across diverse scenarios revealed that medium-scale SLMs (1-3B parameters) significantly outperform ultra-compact models, achieving high overall and multi-turn accuracy through hybrid optimization.

26th November 2025

Model-Based Policy Adaptation for Closed-Loop End-to-End Autonomous Driving

MPA (Model-Based Policy Adaptation): introduces a general framework for end-to-end autonomous driving that enhances robustness and safety by adapting a pretrained E2E agent using counterfactual data.
The approach generates diverse counterfactual trajectories via a geometry-consistent 3DGS-based simulation engine to expose the agent to scenarios beyond the original dataset.
MPA trains a diffusion-based policy adapter to refine base policy predictions and a multi-step Q-value model to evaluate long-term outcomes for inference-time guidance.

BAMAS: Structuring Budget-Aware Multi-Agent Systems

BAMAS (Budget-Aware Multi-Agent Systems): introduces a novel framework for constructing multi-agent systems under budget constraints, including Budget-Constrained LLM Provisioning, Agent Collaboration Topology Selection, and Agent Instantiation.
The framework jointly optimizes LLM selection using an Integer Linear Programming Solver and agent collaboration topology using a Topo-Selection Policy trained via offline reinforcement learning.
BAMAS achieves a strong cost-performance trade-off by adaptively selecting LLMs and collaboration patterns (Topo Set) to maximize task performance within a fixed cost budget.

Tool-RoCo: An Agent-as-Tool Self-organization Large Language Model Benchmark in Multi-robot Cooperation

Tool-RoCo: introduces a novel LLM-based multi-agent benchmark for multi-robot cooperation, leveraging the agent-as-tool concept and four progressive cooperation paradigms.
The framework evaluates LLM autonomy and coordination using tool usage metrics, including Cooperative Tool Ratio (CT) and Self-Organization Ratio (SO).
Tool-RoCo utilizes three multi-robot tasks (CABINET, PACK, SORT) and two types of tools (Common and Cooperative) to systematically assess LLM performance across varying levels of centralized and decentralized control.

EWE: AN AGENTIC FRAMEWORK FOR EXTREME WEATHER ANALYSIS

EWE (Extreme Weather Expert): introduces an intelligent agent framework for extreme weather analysis, integrating Knowledge-Enhanced Planning, Self-Evolving Closed-Loop Reasoning, and a Meteorological Toolkit.
The framework operationalizes expert workflows using an MLLM reasoning backbone to autonomously generate and interpret multimodal visualizations from raw meteorological data.
Self-Evolving Closed-Loop Reasoning employs a Dual-Auditor Module (Code Auditor and Content Auditor) to verify both operational success and physical plausibility of generated code and visualizations.

Large Language Models for Unit Test Generation: Achievements, Challenges, and the Road Ahead

UFW-UTG (Unified Framework for LLM-based Unit Test Generation): introduces a systematic engineering view of LLM-based unit test generation, including Model Preparation (Specializes LLM), Context Enrichment (Constructs context-rich prompt), Prompt-driven Generation (Core LLM operation), Raw Generated Tests (Initial LLM output), Quality Assurance Loop (Validates and refines tests), Final Test Suite (Executable, high-quality tests), Synergy (Integrates traditional SE tools), and Feedback Loop (Iterative refinement mechanism), where the framework treats LLMs as stochastic generators requiring systematic engineering constraints.
The analysis reveals that prompt engineering is the dominant strategy (89%), and the iterative validation and repair loop is the standard mechanism for ensuring robust test usability, boosting pass rates from under 30% to over 70%.
Future research emphasizes a paradigm shift toward autonomous testing agents and hybrid systems that combine LLMs' semantic understanding with traditional tools' systematic exploration capabilities to improve fault detection.

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry

Dual-Agent Reinforcement Learning for Adaptive and Cost-Aware Visual-Inertial Odometry: introduces a decoupled RL-based VIO framework that mitigates the Visual-Inertial Bundle Adjustment (VIBA) bottleneck using a Select Agent (RL computational scheduler) and a composite Fusion Agent (Composite RL fusion policy).
The Select Agent uses IMU-only data to decide whether to run the costly Visual Odometry pipeline, achieving significant computational savings by skipping redundant or uninformative frames.
The Fusion Agent adaptively fuses high-rate IMU predictions with sparse VO updates by learning context-dependent weights, resulting in a favorable accuracy-throughput-memory trade-off compared to prior GPU-based VIO systems.

EVILGENIE: A Reward Hacking Benchmark

EVILGENIE: introduces a benchmark for reward hacking in programming settings using problems sourced from LIVECODEBENCH, designed to allow agents to circumvent test cases.
The benchmark evaluates agent behavior using a combination of held-out unit tests, LLM judges for solution classification, and automated test file edit detection.
Evaluation across proprietary and standardized LLM agents reveals that LLM judges are highly effective at detection, while held-out tests show minimal improvement in unambiguous cases.

Aligning LLMs Toward Multi-Turn Conversational Outcomes Using Iterative PPO

Iterative PPO: introduces a batch online policy iteration algorithm that reduces the multi-turn RL problem into a sequence of single-turn RLHF problems using a learned Q-function as the reward model.
The approach alternates between collecting multi-turn trajectories and performing policy improvement using standard token-level PPO, leveraging existing stable single-turn RLHF tools.
This method enables continual learning from real customer-business interactions without requiring an environment simulator, balancing online adaptability with offline stability.

MADRA: Multi-Agent Debate for Risk-Aware Embodied Planning

MADRA (Multi-Agent Debate for Risk-Aware Embodied Planning): introduces a training-free Multi-Agent Debate Risk Assessment framework leveraging collective reasoning to enhance safety awareness in embodied planning without sacrificing task performance.
MADRA employs multiple LLM-based Risk Assessment Agents guided by a Critical Evaluator to iteratively debate instruction safety and vote for consensus, curbing single-LLM bias and reducing false rejections.
The MADRA module is integrated into a Hierarchical Cognitive Collaborative Planning Framework that includes Memory Enhancement, Hierarchical Planning, and a Self-evolution Mechanism for continuous learning and improved task success rates.
The MADRA module is integrated into a Hierarchical Cognitive Collaborative Planning Framework that includes Memory Enhancement, Hierarchical Planning, and a Self-evolution Mechanism for continuous learning and improved task success rates.

Prune4Web: DOM Tree Pruning Programming for Web Agent

Prune4Web (DOM Tree Pruning Programming for Web Agent): introduces a multi-stage framework for web automation, including a Planner (decomposes high-level task), a Programmatic Element Filter (generates Python scoring program), and an Action Grounder (selects final executable action).
The core innovation, DOM Tree Pruning Programming, transforms DOM processing from LLM-based filtering to programmatic pruning, reducing candidate elements by 25-50 times.
The approach uses LLMs to generate executable Python scoring programs based on semantic clues from decomposed sub-tasks, enabling precise action localization without attention dilution.

Multi-Agent Systems for Dataset Adaptation in Software Engineering: Capabilities, Limitations, and Future Directions

MADAP (Multi-Agent Dataset Adaptation Pipeline): introduces an empirical study evaluating LLM-based multi-agent systems, specifically GitHub Copilot, on dataset adaptation tasks using a structured five-stage evaluation pipeline.
The pipeline assesses agent performance across file comprehension, code editing, command generation, validation, and final execution, revealing that current systems struggle to produce functionally correct implementations.
Prompt-level interventions, such as providing error messages and reference code, significantly improve structural similarity and highlight the need for robust feedback-driven guidance in future agents.

Aligning LLMs with Biomedical Knowledge using Balanced Fine-Tuning

BFT (Balanced Fine-Tuning): introduces an efficient post-training method for aligning LLMs with specialized biomedical knowledge, utilizing token-level weighting (stabilizes gradients) and sample-level reweighting (focuses on hard samples).
This method operates through a two-layer confidence-based weighting mechanism to learn complex reasoning from sparse data without requiring external reward signals or costly reinforcement learning.
BFT-based LLMs surpass SFT and other baselines in medical and biological reasoning tasks, generating biologically meaningful embeddings for downstream applications like gene interaction prediction.

OVOD-Agent: A Markov-Bandit Framework for Proactive Visual Reasoning and Self-Evolving Detection

OVOD-Agent (Open-Vocabulary Object Detection Agent): introduces a lightweight, LLM-free framework that transforms passive category matching into proactive visual reasoning and self-evolving detection, utilizing an Environment (updates visual state), Detector (outputs region proposals), Weakly Markovian Decision Process (w-MDP) (models visual-semantic transitions), Bandit Sampling Process (UCB-based exploration), Markov State Transition Matrix (stores transition statistics), Reward Model (RM) (guides inference refinement), and Visual Chain-of-Thought (Visual-CoT) Actions (iteratively refine textual representation).
The framework models visual context transitions as a Weakly Markovian Decision Process (w-MDP) over eight compact visual states, enabling an interpretable multi-step Visual-CoT reasoning process with explicit actions.
A Bandit module generates exploration signals under limited supervision, and its trajectories are coupled with Markov transition matrices to train a self-supervised Reward Model (RM) for continuous policy improvement.

LOOM: Personalized Learning Informed by Daily LLM Conversations Toward Long-Term Mastery via a Dynamic Learner Memory Graph

LOOM: introduces an agentic four-stage pipeline that transforms everyday LLM conversations into personalized learning trajectories using a Dynamic Learner Memory Graph, Chat Summarizer, Topic Decider, Course Generator, and Goals Updater.
The system unifies continuity and initiative by proactively inferring evolving learner needs from chat summaries and generating adaptive, goal-aligned mini-courses that address identified knowledge gaps.
The Dynamic Learner Memory Graph tracks mastery, links adjacent concepts, and continuously updates based on user engagement and learning outcomes to guide next steps and reinforcement.

Towards Trustworthy Legal AI through LLM Agents and Formal Reasoning

L4M (Legal Logic LLM): introduces a novel neural-symbolic framework combining adversarial LLM agents with SMT-solver-backed proofs to achieve trustworthy, verifiable legal AI.
The system uses dual LLM agents (Prosecutor and Attorney) for independent, adversarial fact and statute extraction, ensuring role isolation and comprehensive coverage.
Extracted facts are autoformalized into Z3 assertions, verified by the SMT solver, and refined via an iterative self-critique loop before a Judge LLM verbalizes the final, auditable verdict.

CaptionQA: Is Your Caption as Useful as the Image Itself?

CaptionQA: introduces a utility-based caption evaluation benchmark covering four domains (Natural, Document, E-commerce, Embodied AI) using a deterministic QA protocol where a text-only LLM answers taxonomy-grounded multiple-choice questions based solely on the generated caption.
The benchmark construction pipeline involves human-designed taxonomies, VLM-based question generation, filtering text-answerable questions, deduplication, and dual-VLM quality control to ensure high-density, visually-grounded QA pairs.
Evaluation using the benchmark reveals substantial utility gaps between image-level and caption-level performance across state-of-the-art MLLMs, especially in Embodied AI and spatial reasoning tasks.

25th November 2025

FRAGMENTA: End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization

FRAGMENTA (End-to-end Fragmentation-based Generative Model with Agentic Tuning for Drug Lead Optimization): introduces an end-to-end framework for drug lead optimization that integrates the LVSEF generative model and an Agentic AI System for automated tuning, enabling a closed-loop iterative process.
The LVSEF component reframes fragment selection as a vocabulary selection problem, jointly optimizing fragment sets and molecule generation using dynamic Q-learning and reconstruction rewards.
The Agentic AI System utilizes specialized LLM-based agents (EvalAgent, QueryAgent, ExtractAgent, CodeAgent) and a shared Knowledge Base to interpret expert feedback and autonomously refine the generative model's objectives.

AD-R1: Closed-Loop Reinforcement Learning for End-to-End Autonomous Driving with Impartial World Models

AD-R1: introduces a closed-loop RL framework leveraging an Impartial World Model (IWM) as an internal critic to refine autonomous driving policies by learning from imagined failures.
The IWM is trained using Counterfactual Synthesis, a novel data pipeline that systematically generates a curriculum of plausible collisions and off-road events to overcome the optimistic bias inherent in standard world models.
During policy refinement, the IWM predicts 4D future occupancy sequences for candidate actions, enabling the 4D Rewarded Modeling module to provide dense, physically-grounded safety-critical feedback.

CostNav: A Navigation Benchmark for Cost-Aware Evaluation of Embodied Agents

CostNav (Micro-Navigation Economic Testbed): introduces a comprehensive benchmark evaluating embodied agents via an Economic Model that translates Simulation Logs (collision, energy, time) into financial metrics, including Pre-Run Costs, Run Costs, Revenue, and Break-Even Analysis.
The framework uses industry-derived parameters to model the complete economic lifecycle of autonomous navigation systems, revealing that optimizing for task success differs fundamentally from optimizing for economic deployment.
Initial evaluation of a Learning-Based On-Device baseline shows that maintenance costs, driven by a high collision rate, overwhelmingly dominate operational costs, resulting in negative profit per run.

FROM DATA TO CONCEPTS VIA WIRING DIAGRAMS

Hasse Clustering (HC): introduces a method for extracting abstract concepts from sequential data using quasi-skeleton wiring diagrams, involving sequence-to-matrix conversion and categorical constraint analysis.
The approach leverages the correspondence between quasi-skeleton wiring diagram graphs and Hasse diagrams to generalize individual data points into more abstract, representative concepts.
HC was successfully applied to time series data from a reinforcement learning agent playing a computer game, correctly identifying the unique or multiple winning strategies.

CLIMATEAGENT: Multi-Agent Orchestration for Complex Climate Data Science Workflows

CLIMATEAGENT: introduces an autonomous multi-agent framework that orchestrates complex climate data science workflows by decomposing user questions into executable sub-tasks coordinated by planning and orchestration agents, acquiring data via specialized DATA-AGENTS, and completing analysis and reporting with self-correcting CODING-AGENTs.
The system employs a three-layer hierarchical architecture with specialized LLM-based agents, persistent contextual coordination via a Persistent Workflow Context, and adaptive self-correction mechanisms to ensure robustness against API variability and runtime errors.
Evaluated on CLIMATE-AGENT-BENCH-85, the framework achieves 100% task completion and significantly outperforms GPT-5 and Copilot baselines in report quality across six climate domains, demonstrating reliable end-to-end automation.

"Are We Done Yet?”: A Vision-Based Judge for Autonomous Task Completion of Computer Use Agents

VBFJ (Vision-Based Feedback Judge): introduces an autonomous evaluation and feedback framework utilizing VLMs to assess task completion directly from screenshots and task descriptions for Computer Use Agents (CUAs).
The framework achieves up to 73% classification accuracy in task success detection and provides an average relative improvement of 27% in the overall task success rate of CUAs.
The core mechanism involves the VLM providing natural language reasoning as feedback to the CUA, enabling the agent to replan and reattempt the task from its current state.

WaymoQA: A Multi-View Visual Question Answering Dataset for Safety-Critical Reasoning in Autonomous Driving

WaymoQA (Multi-View Visual Question Answering Dataset): introduces Safety-Critical Reasoning, a new task leveraging Multi-View Input (Comprehensive scene coverage) and structured into two stages: Stage 1 (Immediate risk resolution) and Stage 2 (Downstream risk mitigation).
The WaymoQA dataset contains 35,000 human-annotated question-answer pairs covering complex, high-risk driving scenarios across both Video QA (Temporal reasoning) and Image QA (Alternative actions/outcomes) modalities.
Experiments reveal that existing MLLMs underperform significantly in safety-critical scenarios, but fine-tuning on the dataset substantially improves their reasoning ability, highlighting the need for targeted supervision.

Hierarchical Spatio-Temporal Attention Network with Adaptive Risk-Aware Decision for Forward Collision Warning in Complex Scenarios

HSTAN+DRTA (Hierarchical Spatio-Temporal Attention Network + Dynamic Risk Threshold Adjustment): introduces an integrated Forward Collision Warning (FCW) framework combining HSTAN for efficient trajectory prediction and DRTA for adaptive, reliable warning decisions, including SAM (spatial interaction modeling), TAM (temporal dynamics modeling), CQR Module (uncertainty quantification), and DRTA (adaptive decision-making).
HSTAN uses a decoupled architecture with GAT-MHA for spatial interactions (O(N·K) complexity) and cascaded GRU/MHA units for temporal dynamics, achieving high prediction accuracy and low inference time (12.3 ms).
The DRTA module transforms predictions into warnings using a physics-informed risk potential function integrating kinematics and road geometry, combined with an adaptive threshold mechanism based on sliding-window traffic statistics.

Towards Edge General Intelligence: Knowledge Distillation for Mobile Agentic AI

KD-EGI (Knowledge Distillation for Edge General Intelligence): introduces a comprehensive survey investigating the integration of KD into EGI, positioning it as a key enabler for efficient, communication-aware, and scalable mobile agentic AI.
The approach leverages KD to compress large Teacher Models into compact Student Models, transferring complex cognitive skills required for the Agentic Loop (Perception, Planning, Action, Memory).
The survey reviews specialized distillation methods for wireless communication and novel edge architectures (Mamba, RWKV) to bridge the deployment chasm for LLM-powered agents on resource-constrained IoT Edge Systems.

IMPROVED LINEAR-TIME CONSTRUCTION OF MINIMAL DOMINATING SET VIA MOBILE AGENTS

LTMDS (Improved Linear-Time Construction of Minimal Dominating Set): introduces two linear-time algorithms for computing a minimal dominating set (mDS) in anonymous graphs using mobile agents, achieving $O(n)$ round complexity.
The approach leverages an optimal dispersion algorithm to reach a covered configuration, utilizing Seeker Agents for parallel neighborhood probing to assign colors (red for mDS members) in $O(1)$ time per step.
The methodology simultaneously constructs a spanning tree, performs leader election, and achieves agent gathering, all within the same $O(n)$ time and $O(\log n)$ memory bounds, improving upon prior complexity results.

Distributionally Robust Cascading Risk in Multi-Agent Rendezvous: Extended Analysis of Parameter-Induced Ambiguity

DRRF (Distributionally Robust Risk Framework): analyzes the distributionally robust risk of cascading failures in a Multi-Agent Rendezvous System, using the Conditional Distributionally Robust Functional defined over an Ambiguity Set of probability measures derived from the Steady-State Covariance Matrix of the Observables Vector, which captures Systemic Events in the Time-Delayed Linear Consensus Network.
The framework explicitly incorporates distributional ambiguity arising from bounded uncertainties in system parameters, including diffusion coefficients, time delays, and network edge weights.
The approach derives a closed-form risk expression and establishes fundamental bounds that relate the distributionally robust cascading risk to network eigenvalues and parameter uncertainty, providing insights for robust network design.

OpenApps: Simulating Environment Variations to Measure UI-Agent Reliability

OpenApps: introduces a flexible simulator for systematically evaluating UI-agent reliability across app variations, including the OpenApps environment, State ($s_t$), Observation ($o_t$), Agent, Prompt, Policy ($\pi(a_t | h)$), Action ($a_t$), Reward ($r$), BrowserGym API, Six functional apps, and Configuration (YAML files).
The system generates thousands of app versions by configuring appearance and content variables via simple YAML files, enabling large-scale, reproducible experiments on modest hardware.
Evaluation across seven leading multimodal agents (including LLMs like GPT-4o and Claude) demonstrates that reliability fluctuates drastically across app variations, often underestimating failure modes when tested only on fixed app clones.

Learning from Risk: LLM-Guided Generation of Safety-Critical Scenarios with Prior Knowledge

LRF (Learning from Risk Framework): introduces a high-fidelity safety-critical scenario generation framework integrating a CVAE-GNN Module (Generates physically consistent base scenarios) with an LLM (Adversarial reasoning engine) for synthesizing diverse, risk-sensitive driving scenarios.
The CVAE-GNN learns latent traffic structures from real-world trajectories and map data, while the LLM acts as a knowledge-driven controller, interpreting scene semantics and dynamically adjusting optimization objectives.
The framework utilizes a knowledge-driven loss adaptation mechanism and a Cross-Risk Scenario Distribution Module to ensure generated scenarios are both plausible and risk-sensitive across low-, high-, and long-tail risk regimes.

Learning Multi-Access Point Coordination in Agentic AI Wi-Fi with Large Language Models

AAWF (Agentic AI Wi-Fi Framework): introduces a novel multi-LLM-agent system where each AP acts as an autonomous LLM Agent, leveraging its LLM Brain, Short-Term Memory, Long-Term Memory (RAG), Tool Use Module, Prompt Engine, and Coordination Protocol to collaboratively negotiate adaptive Multi-Access Point Coordination (MAPC) strategies.
The framework utilizes natural language dialogue and a cognitive workflow (evaluation, reflection, action generation) to dynamically navigate the Co-SR/Co-TDMA trade-off, adapting to diverse and dynamic interference scenarios in Wi-Fi networks.
Simulation results demonstrate that this self-organized agent negotiation significantly outperforms conventional static protocols and AI-driven baselines in terms of throughput and adaptability.

Arcadia: Toward a Full-Lifecycle Framework for Embodied Lifelong Learning

Arcadia: introduces a full-lifecycle framework for embodied lifelong learning that tightly couples four stages—Self-Evolving Exploration and Grounding, Generative Scene Reconstruction and Augmentation, Shared Embodied Representation Architecture, and Sim-from-Real Evaluation and Evolution—to form a closed self-improving loop.
The framework addresses core limitations in embodied AI by enabling continuous real-world data acquisition, generative simulation updates, and shared-representation learning to support lifelong improvement.
The Sim-from-Real Evaluation and Evolution component integrates structured deployment feedback (Task, Scene, Robot) back into simulation to refine both assets and policies, effectively closing the real-to-sim-to-real loop.

24th November 2025

BEYOND PROTEIN LANGUAGE MODELS: AN AGENTIC LLM FRAMEWORK FOR MECHANISTIC ENZYME DESIGN

Genie-CAT: introduces an agentic LLM system that integrates literature-grounded reasoning (RAG), structural analysis, electrostatic potential calculation, and ML-based redox potential modeling to generate mechanistically interpretable protein design hypotheses.
The system utilizes a ReAct (Reasoning and Acting) pattern within the LLM Agent Core to dynamically select and orchestrate domain-specific tools, bridging symbolic reasoning with quantitative physical modeling.
Demonstrated using metalloproteins (ferredoxins), the framework autonomously identifies residue modifications near [Fe-S] clusters that affect redox tuning, significantly reducing the time and expertise required for hypothesis generation.

Be My Eyes: Extending Large Language Models to New Modalities Through Multi-Agent Collaboration

BEMYEYES: introduces a modular, multi-agent framework that extends LLMs to multimodal reasoning by orchestrating collaboration between a Perceiver Agent (VLM) and a Reasoner Agent (LLM) through multi-turn conversations.
The Perceiver Agent extracts visual information and communicates detailed descriptions, while the frozen LLM Reasoner Agent applies its extensive knowledge and reasoning capabilities to solve the given task.
The system utilizes a data synthesis and supervised fine-tuning pipeline to train the perceiver for effective collaboration, enabling text-only LLMs to outperform large proprietary VLMs like GPT-4o on multimodal tasks.

LEARNING ROBUST SOCIAL STRATEGIES WITH LARGE LANGUAGE MODELS

AdAlign (Advantage Alignment): introduces a method to train LLM agents to learn robust social strategies in mixed-motive social dilemmas, utilizing LLM Agents, Multi-agent RLOO, LoRA finetuning, an Agent Buffer, and a Social Dilemma Testbed.
AdAlign adapts an opponent-learning awareness algorithm to fine-tune LLMs, modifying the policy gradient update with a reweighting of action gradients based on the agent's and opponent's advantages, simplified using a group-relative baseline.
The approach achieves higher collective payoffs and non-exploitability across environments like IPD and the novel Trust and Split, demonstrating robustness against greedy RL-trained opponents, unlike naive MARL which converges to greedy policies.

LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems

RELED (LLM-Driven Stationarity-Aware Expert Demonstrations for Multi-Agent Reinforcement Learning in Mobile Systems): introduces a scalable MARL framework integrating LLM-driven expert demonstrations with autonomous agent exploration using the Stationarity-Aware Expert Demonstration (SED) and Hybrid Expert-Agent Policy Optimization (HPO) modules.
The SED module leverages theoretical non-stationarity bounds, quantified by the reward volatility and policy divergence indices, as feedback to iteratively refine LLM-generated instruction sequences for high-quality expert trajectories.
The HPO module employs a fully decentralized training approach where agents independently optimize a hybrid policy loss function, adaptively balancing learning from expert and self-generated samples via dynamic time warping distance.

MAESTRO: Multi-Agent Environment Shaping through Task and Reward Optimization

MAESTRO (Multi-Agent Environment Shaping through Task and Reward Optimization): introduces a generative meta-learning framework that shifts the LLM role from real-time agent to high-level architect, dynamically designing the task and solving guidance for the MARL training process.
The system operates as a dual-loop optimization problem, integrating a Semantic Curriculum Generator and an Automated Reward Synthesizer to shape the environment for the MADDPG learner backbone.
By distilling semantic knowledge into executable training scaffolds (tasks and rewards), the framework guides a standard MARL policy, isolating expensive LLM inference from the real-time execution loop.

LLM-Based Agentic Negotiation for 6G: Addressing Uncertainty Neglect and Tail-Event Risk

RAAN (Risk-Aware Agentic Negotiation): introduces an unbiased, risk-aware framework for LLM-based agents in 6G network slicing negotiation, utilizing Digital Twins, CVaR, Epistemic Confidence Score, and a Dynamic SLA Target.
The framework mitigates uncertainty neglect bias by shifting the agent's objective from mean-based reasoning to tail-event risk quantification using CVaR, ensuring robust resource allocation and eliminating SLA violations.
Agents are compelled to quantify epistemic uncertainty via the confidence score, which dynamically tightens the internal SLA target to prevent decisions based on unreliable Digital Twin predictions.

Reinforcement Learning for Self-Healing Material Systems

RLCF (Reinforcement Learning Control Framework): introduces a self-healing material system modeled as a Markov Decision Process, where an RL agent learns optimal policies to balance structural integrity recovery against finite resource consumption.
The system architecture integrates self-healing material, sensor arrays, and actuators, allowing the RL agent to select discrete (Q-learning, DQN) or continuous (TD3) healing actions based on the observed damage state.
Comparative evaluation showed that the continuous-action TD3 agent achieved the fastest and most stable material recovery, demonstrating the necessity of fine-grained, proportional actuation in dynamic self-healing applications.

A Multi-Agent LLM Framework for Multi-Domain Low-Resource In-Context NER via Knowledge Retrieval, Disambiguation and Reflective Analysis

KDR-Agent (Knowledge Retrieval, Disambiguation, and Reflective Analysis): introduces a novel multi-agent LLM framework for multi-domain low-resource in-context NER, integrating external knowledge retrieval, entity disambiguation, and reflective correction.
The framework operates in two stages: Knowledge In-context Construction, which builds enriched prompts, and Reflection & Correction, which refines predictions using structured error analysis.
KDR-Agent reduces reliance on large annotated corpora by using concise natural-language type definitions and a static set of entity-level positive-negative contrastive demonstrations.

Defending Large Language Models Against Jailbreak Exploits with Responsible AI Considerations

DLLE (Defending LLMs Against Jailbreak Exploits): introduces a systematic taxonomy of jailbreak defenses and proposes three complementary strategies: PLDF, LBSD, and MetaGPT-DSAD.
The PLDF uses sanitization, paraphrasing, and adaptive system prompts, while the LBSD applies inference-time vector steering in safety-aware layers to reinforce refusal behavior.
The MetaGPT-DSAD, employing structured, role-based collaboration among Rephrase, Core LLM, and Judge Agents, achieved full mitigation of jailbreak attempts in experiments.

LLM-Driven Kernel Evolution: Automating Driver Updates in Linux

AUTODRIVER (LLM-driven adaptation and validation loop): introduces a closed-loop, LLM-driven system for automating Linux driver maintenance, utilizing the DRIVEBENCH Executable Corpus and Taxonomy for structured input and validation.
The system employs a multi-agent architecture, including prompt engineering-, coding-, static analysis-, and patch fix-agents, operating within a Closed-Loop Refinement Cycle guided by compiler diagnostics.
Validation integrates a Localization Engine for precise edit scoping, followed by Docker Compilation and Linux QEMU Testing to ensure functional and security consistency across kernel versions.

KERNELBAND: Boosting LLM-based Kernel Optimization with a Hierarchical and Hardware-aware Multi-armed Bandit

KERNELBAND: introduces a novel framework that formulates kernel optimization as a hierarchical multi-armed bandit problem, enabling LLM agents to strategically navigate the optimization space using runtime behavior clustering and profiling-guided strategy selection.
The approach leverages hardware profiling to identify promising optimization strategies and employs runtime clustering to reduce exploration overhead by sharing insights across similar kernel candidates.
The core mechanism is a three-term Hierarchical UCB score that balances exploitation, exploration, and hardware guidance, leading to superior performance and efficiency compared to state-of-the-art methods.

Cognitive Alpha Mining via LLM-Driven Code-Based Evolution

CogAlpha (Cognitive Alpha Mining Framework): introduces a multi-agent framework combining code-level alpha representation with LLM-driven reasoning and evolutionary search for automated and explainable alpha discovery.
The framework utilizes a Seven-Level Agent Hierarchy for broad exploration and a Multi-Agent Quality Checker to ensure the validity and economic interpretability of generated alpha codes.
Thinking Evolution employs LLM-guided mutation and crossover operations to iteratively refine qualified alpha candidates based on financial feedback and predictive metrics.

UNeMo: Collaborative Visual-Language Reasoning and Navigation via a Multimodal World Model

UNeMo (Unlock Next Moment): introduces a novel framework for Vision-and-Language Navigation (VLN) that collaboratively optimizes visual state reasoning and navigational decision-making.
The core architecture includes the Multimodal World Model (MWM) for predicting subsequent visual states and the Hierarchical Prediction-Feedback Navigator (HPFN) for integrating this state reasoning into action selection.
The MWM uses a CVAE structure with cross-attention to fuse visual features, language instructions, and navigational actions, while HPFN enables dynamic bidirectional promotion between the MWM and navigation policies.

HERMES: Towards Efficient and Verifiable Mathematical Reasoning in LLMs

HERMES (Hybrid Agent for Reasoning in Mathematics with NEuro-Symbolic Lean4 verification): introduces a Lean4-driven, multi-modular reasoning agent that uses a Reasoning LLM (Generates informal steps), Translation Module (Formalizes steps), Prover Module (Attempts formal proof/counter-proof), and Feedback Module (Returns verification signals) to interleave informal reasoning with formally verified proof steps.
The framework performs intermediate formal checking using the Lean4 compiler and Lean4 REPL to prevent reasoning drift and employs a Memory Block (Stores validated proof steps) to maintain proof continuity across long, multi-step reasoning chains.
By leveraging symbolic-engine-backed correctness signals, the agent significantly improves reasoning accuracy while substantially reducing token usage and computational cost compared to reward-based approaches.

RhinoInsight: Improving Deep Research through Control Mechanisms for Model Behavior and Context

RhinoInsight (Deep Research Framework): introduces two control mechanisms, the Verifiable Checklist Module (Supervises model behavior) and the Evidence Audit Module (Organizes context information), to enhance robustness and traceability in deep research tasks.
The VCM transforms user queries into traceable, verifiable sub-goals via a Checklist Generator and LLM Critic, compiling them into a hierarchical outline to constrain planning and prevent non-executable actions.
The EAM structures search content, iteratively updates the outline, prunes noisy context, and uses a Critic to rank and bind high-quality evidence to drafted content, ensuring verifiability and reducing hallucinations.

HuggingR⁴: A Progressive Reasoning Framework for Discovering Optimal Model Companions

HuggingR⁴: introduces a progressive reasoning framework combining Reasoning, Retrieval, Refinement, and Reflection to efficiently select optimal AI models from large-scale community repositories like HuggingFace.
The framework uses a coarse-to-fine strategy, starting with iterative reasoning and vector-based retrieval to narrow candidates, followed by fine-grained refinement using a sliding window strategy to manage token consumption.
The approach attains high workability (92.03%) and reasonability (82.46%) on a new multimodal human-annotated dataset while maintaining constant token consumption regardless of the candidate pool size.

VIL2C: Value-of-Information Aware Low-Latency Communication for Multi-Agent Reinforcement Learning

VIL2C (Value-of-Information aware Low-latency Communication): introduces a scheme that proactively adjusts communication latency distribution using VoI-aware resource allocation and a progressive message reception strategy to enhance multi-agent cooperation performance.
The scheme defines Value of Information (VoI) based on message importance (KL divergence) and communication latency, optimizing bandwidth and power allocation via ResoNet to prioritize high-VoI messages.
The Progressive Reception module adaptively determines the recipient's waiting time, terminating reception when the uncertainty of the action probability distribution falls below a predefined entropy threshold.

Agent Discovery in Internet of Agents: Challenges and Solutions

SDCD (Semantic-Driven Capability Discovery): introduces a novel two-stage capability discovery framework for the Internet of Agents (IoA) that integrates semantic capability modeling, scalable indexing, and memory-enhanced continual discovery.
The framework addresses challenges in IoA heterogeneity and scalability by using LLMs for semantic profiling and compressing high-dimensional embeddings into compact, updatable agent codes.
Continual discovery ensures long-term performance in dynamic environments by training the retrieval model with knowledge replay and stability constraints to prevent forgetting of established agents.

HABIT: Human Action Benchmark for Interactive Traffic in CARLA

HABIT (Human Action Benchmark for Interactive Traffic): introduces a high-fidelity simulation benchmark integrating 4,730 semantically curated, real-world pedestrian motions into the CARLA simulator for rigorous autonomous driving evaluation.
The framework utilizes a modular motion retargeting pipeline to convert heterogeneous motion capture and video data into physically consistent, globally aligned SMPL-based trajectories.
HABIT introduces novel safety metrics, including the Abbreviated Injury Scale (AIS) and False Positive Braking Rate (FPBR), to expose planner weaknesses and safety-conservatism trade-offs hidden in scripted simulations.

Robot-Powered Data Flywheels: Deploying Robots in the Wild for Continual Data Collection and Foundation Model Adaptation

RPDF (Robot-Powered Data Flywheel): introduces an iterative framework where a mobile manipulator robot (Scanford) performs useful tasks while autonomously collecting and curating domain-representative data to continually fine-tune a Vision-Language Model (VLM).
The Scanford system instantiates RPDF by deploying a mobile manipulator equipped with a VLM in a library to scan shelves and identify books, leveraging the library catalog for automated, high-quality data labeling.
The framework successfully improves VLM performance on domain-specific book identification (32.0% to 71.8%) and domain-adjacent multilingual OCR, while saving an estimated 18.7 hours of human labor during a two-week deployment.

IRSDA: An Agent-Orchestrated Framework for Enterprise Intrusion Response

IRSDA (Intrusion Response System Digital Assistant): introduces an agent-orchestrated framework for enterprise intrusion response, combining the MAPE-K loop with Self-Adaptive Autonomic Computing Systems (SA-ACS) for autonomous, policy-compliant cyber defense.
The architecture uses an $n$-tier design, featuring the IRSDAC (client interface), IRSDAS (server), IRSDAAO (orchestration layer with Agentic Brain), partition-specific IRS Agents, and Tier V components: IRSKG (knowledge graph) and IRSLLM (cybersecurity-tuned LLM).
The system leverages graph-based RAG to ground the IRSLLM's contextual reasoning and automated responses using real-time enterprise data and dynamic Rules-of-Engagement (ROE), ensuring explainability and policy alignment.

AttackPilot: Autonomous Inference Attacks Against ML Services With LLM-Based Agents

AttackPilot: introduces an autonomous multi-agent framework capable of independently conducting inference attacks against ML services, comprising the ControllerAgent (managing and monitoring) and concurrent AttackAgents (executing specific attacks).
The framework achieves near-expert attack performance and 100.0% task completion using robust LLMs, task-specific action spaces, and a reusable environment.
Task-specific action spaces guide the AttackAgent through critical steps like selecting shadow datasets and setting hyperparameters, mitigating common LLM errors such as bad plans and context loss.

Agint: Agentic Graph Compilation for Software Engineering Agents

Agint (Agentic Graph Compilation for Software Engineering Agents): introduces an agentic graph compiler, interpreter, and runtime that converts natural language instructions into typed, effect-aware code Directed Acyclic Graphs (DAGs) using a six-tier type floor system.
The system utilizes a composable Unix-style toolchain, including Dagify (DAG compiler) and Dagent (hybrid JIT runtime), unified by the Agilink addressing system for reliable data and tool flow.
Agint employs Flyte (unified LLM orchestration) integrated with Hydantic (hierarchical structured generation) to enable dynamic graph refinement, parallel compilation, and hybrid execution modes (prefine, dynamic, predict).

DUALGAUGE: Automated Joint Security–Functionality Benchmarking for Secure Code Generation

DUALGAUGE: introduces the first fully automated benchmarking framework designed to rigorously evaluate the security and correctness of LLM-generated code in unison, utilizing a Sample Generator, Agentic Executor, LLM Based Evaluator, and Aggregation and Dashboard.
The system uses the DUALGAUGE-BENCH suite, featuring 154 tasks each paired with dual, coverage-enforced functional and security test suites, to assess LLM performance holistically.
The Agentic Executor runs generated code in Isolated Containers, resolving runtime issues via an LLM Agent, while the LLM Based Evaluator performs semantic analysis of execution traces for security assessment.

23rd November 2025

FHE-Agent: Automating CKKS Configuration for Practical Encrypted Inference via an LLM-Guided Agentic Framework

FHE-Agent: introduces an agentic framework that automates CKKS configuration for encrypted inference by coupling an LLM controller with a deterministic tool suite to decompose the search into global parameter selection and layer-wise bottleneck repair.
The system operates within a multi-fidelity workflow (Phase A/B/C) that uses cheap static analysis and cleartext simulation to aggressively prune invalid regimes before reserving expensive encrypted evaluations for promising candidates.
By exposing layerwise profilers and cost models, the framework consistently achieves better precision and lower latency than naive search strategies, successfully finding feasible 128-bit secure configurations for complex models where baseline heuristics fail.

A Synthetic Encyclopedic Dictionary and Semantic Knowledge Graph

OpenGloss: introduces a synthetic encyclopedic dictionary and semantic knowledge graph generated by a Multi-Agent Generation Pipeline (Four-stage process) that uses LLM Backends (Configurable foundation models) and Schema Validation (Ensures structured output) to perform Lexeme Selection (Establishes vocabulary foundation), Sense Generation (Generates definitions/relationships), Graph Construction (Extracts explicit semantic edges), and Enrichment (Adds context/history).
The system produced 537K sense definitions across 150K lexemes and 9.1M semantic edges in under 96 hours for less than $1,000, demonstrating rapid, cost-effective creation of comprehensive lexical resources.
The resource uniquely integrates lexicographic definitions, encyclopedic context, etymological histories, usage examples, and semantic relationships, addressing gaps in pedagogical applications and general NLP tasks.

From Code Foundation Models to Agents and Applications: A Practical Guide to Code Intelligence

Code Intelligence Ecosystem (CIE): introduces a comprehensive synthesis and practical guide to code LLMs, systematically examining the complete model life cycle from data curation to autonomous coding agents.
The guide analyzes general and code-specialized LLMs, critically examining techniques, design decisions, and trade-offs across pre-training, supervised fine-tuning (SFT), and reinforcement learning (RL) stages.
Extensive experiments provide data-driven guidelines for compute-efficient pre-training (scaling laws) and calibrated RL recipes for maximizing verifiable correctness in code generation.

LockForge: Automating Paper-to-Code for Logic Locking with Multi-Agent Reasoning LLMs

LockForge: introduces, "a multi-agent, multi-stage LLM workflow with role isolation for LL coding and evaluation," which systematically converts Logic Locking (LL) paper descriptions into executable and verified code.
The pipeline includes Forethoughts, Implementation, and a Refinement Loop driven by Content Mining and Local Execution, orchestrated by LLM-A (Coder) with PDF access.
Validation relies on independent LLM-B (Judge) and LLM-C (Examiner) agents using a formalized BCSRP Similarity Scoring system and paper-grounded true/false examination to ensure conceptual fidelity.

End-to-End Automated Logging via Multi-Agent Framework

AUTOLOGGER (End-to-End Automated Logging via Multi-Agent Framework): introduces a novel hybrid framework addressing the complete logging pipeline, including the neglected whether-to-log decision, using a Judger and a Multi-Agent System.
The Judger, a fine-tuned binary classifier, efficiently determines logging necessity, acting as a filter before activating the resource-intensive MAS for generation tasks.
The MAS utilizes specialized Locator and Generator agents, supported by a Tool Pool (including Backward Slicing and Similar Case Retrieval) to ground reasoning in factual code analysis and mitigate LLM hallucination.

Shadows in the Code: Exploring the Risks and Defenses of LLM-based Multi-Agent Software Development Systems

IMBIA/Adv-IMBIA Methodology: introduces a security analysis framework for LLM-based Multi-Agent Software Development Systems (Target) using the Implicit Malicious Behavior Injection Attack ($P_m$) and the Adversarial IMBIA defense ($P_{adv}$), where agents (Design/Code/Test) are exploited or protected across two scenarios.
The IMBIA attack leverages a Malicious Injection Prompt ($P_m$), composed of a Secret Task Summary ($T_s$), Secret Task Descriptions ($T_d$), and Code Instructions ($C_i$), to inject covert malicious behavior into software generated from Benign Software Requirements ($P_b$).
The Adv-IMBIA defense uses an Adversarial Prompt ($P_{adv}$) integrated either at the user interface or directly into agent profiles to mitigate attacks, revealing that coding and testing phases present the highest security risks across frameworks like ChatDev, MetaGPT, and AgentVerse.

LLMs as Firmware Experts: A Runtime-Grown Tree-of-Agents Framework

FIRMHIVE (Recursive Delegation Engine, Proactive Knowledge Hub): introduces a recursive agent hive framework enabling LLMs to act as autonomous firmware security analysts by transforming delegation into an executable primitive and constructing a runtime Tree of Agents (ToA).
The framework utilizes the Recursive Delegation Engine (RDE) to dynamically decompose complex tasks into structured, parallel workflows aligned with firmware structure, mitigating context fragmentation.
The Proactive Knowledge Hub (PKH) serves as a persistent global memory, aggregating intermediate results and enabling cross-component dependency resolution and long-term coherence across distributed analyses.

General Agentic Memory Via Deep Research

GAM (General Agentic Memory): introduces a novel memory framework based on the Just-in-Time (JIT) compilation principle, featuring a Memorizer for offline history compression and a Researcher for online deep retrieval.
The Memorizer extracts key information into a lightweight Memory and preserves complete historical information in a Page-store, while the Researcher performs iterative Planning, Searching using multiple tools, and Reflection to generate customized context for client requests.
The dual-agent architecture leverages LLMs' agentic capabilities and test-time scalability to achieve high-fidelity memory and optimize downstream task completion, significantly outperforming existing Ahead-of-Time memory systems.

Multi-Agent Collaborative Filtering: Orchestrating Users and Items for Agentic Recommendations

MACF (Multi-Agent Collaborative Filtering): introduces an agentic recommendation framework that orchestrates User Agents (similar users) and Item Agents (relevant items) via a central Orchestrator Agent across a Multi-Round Discussion.
The Orchestrator Agent dynamically manages collaboration by issuing Personalized Collaboration Instructions and performing Dynamic Agent Recruitment based on the target user query and interaction history.
This structure allows the system to aggregate collaborative signals in a structured, adaptive manner, enabling agents to refine candidates and surface agreements or conflicts using shared context and Retrieval Tools.

A Multimodal Conversational Agent for Tabular Data Analysis

Talk2Data: introduces a multimodal conversational agent for tabular data analysis that unifies voice/text input with visual, tabular, and spoken outputs via an agentic orchestration loop.
The system uses an Orchestration/Router component to adaptively select between LLM-driven code generation (executed in a secure sandbox) or direct narrative response (rendered via TTS).
Grounded prompts inject dataset metadata and conversational memory into the LLM, ensuring context-aware behavior and supporting iterative, multi-turn data exploration.

Path-Constrained Retrieval: A Structural Approach to Reliable LLM Agent Reasoning Through Graph-Scoped Semantic Search

PCR (Path-Constrained Retrieval): introduces a novel retrieval method combining structural graph constraints with semantic search to ensure retrieved information maintains logical consistency within a knowledge graph for LLM agents.
The method restricts the search space to nodes structurally reachable from an anchor node, preventing the retrieval of disconnected information that often leads to inconsistent LLM reasoning chains.
Evaluated on the PathRAG-6 benchmark, PCR achieved 100% structural consistency, significantly outperforming baseline vector and hybrid retrieval methods while maintaining competitive relevance scores.

Hierarchical Deep Research with Local–Web RAG: Toward Automated System-Level Materials Discovery

DToR (Deep Tree of Research): introduces a hierarchical deep research agent for materials and device discovery, integrating local retrieval-augmented generation with LLM reasoners and a Deep Tree of Research mechanism for adaptive exploration.
The framework treats each Deep Research instance as a Research Node within a tree-structured workflow, using a local-first retrieval policy and gap-driven web expansion to maximize coverage and coherence for S3-S4 level hypotheses.
DToR consistently outperforms single-instance DR and commercial systems in synthesis quality across 27 nanomaterials/device topics, enabling cost-effective, on-prem integration for complex long-horizon scientific inquiry.

Cross-Disciplinary Knowledge Retrieval and Synthesis: A Compound AI Architecture for Scientific Discovery

BioSage (Compound AI Architecture): introduces a novel compound AI architecture that integrates LLMs with RAG, specialized agents, and tools to enable cross-disciplinary scientific discovery and synthesis.
The system features specialized agents—including retrieval, translation, and reasoning agents—orchestrated via a Query Planning Agent to provide citation-backed, transparent, and traceable responses.
The architecture utilizes a multi-level memory system (semantic, procedural, episodic) and user-centric design principles to support scientific workflows like summarization, research debate, and brainstorming.

LLM Assisted Coding with Metamorphic Specification Mutation Agent

CMA (CodeMetaAgent): introduces an MR-driven LLM-agent framework that systematically refines task specifications and generates semantically constrained test cases, integrating transformation, validation, generation, and execution within a unified pipeline.
The framework coordinates four core modules—Mutator, Reviewer, Generator, and Evaluator—to operationalize MRs as proactive semantic operators, guiding LLM reasoning for code generation and test case synthesis.
Experiments show that MR-guided transformations significantly improve code generation accuracy by up to 17% and achieve high test coverage (up to 99.81%) across multiple LLMs and software engineering benchmarks.

Can LLMs Help Allocate Public Health Resources? A Case Study on Childhood Lead Testing

PS Framework: introduces a systematic approach for public health resource allocation by integrating Prevalence of elevated BLLs, Percentage of untested children, and Public health coverage ratio, weighted dynamically to rank neighborhoods for intervention.
The study evaluates state-of-the-art LLMs operating in agentic and deep research modes on a resource allocation task involving distributing 1,000 lead test kits across neighborhoods in Chicago, New York City, and Washington, D.C.
Evaluation results reveal that LLMs struggle with information retrieval and evidence-based reasoning, frequently overlooking high-priority neighborhoods and allocating disproportionate resources to lower-priority areas.

Energy-Efficient Task Computation at the Edge for Vehicular Services

LAPPO/MALAPPO (Multi-Agent Proximal Policy Optimization based Task Computation Strategy): introduces an energy-efficient task computation strategy for V2X services using a decentralized PPO-based algorithm that minimizes total energy consumption while satisfying task latency requirements.
The strategy operates within a 3-tier MEC architecture, leveraging empirical car mobility analysis to adapt task offloading decisions for both static (LAPPO) and mobile (MALAPPO) vehicular scenarios.
Evaluation using real-world mobility traces demonstrates that the mobility-aware solution significantly reduces task interruptions and achieves substantial energy savings compared to state-of-the-art schemes.

AutoMAS: A Generic Multi-Agent System for Algorithm Self-Adaptation in Wireless Networks

AutoMAS (A Generic Multi-Agent System for Algorithm Self-Adaptation in Wireless Networks): introduces a multi-agent system deployed in a C-RAN architecture that autonomously selects the most suitable wireless optimization algorithm based on dynamic environmental observations.
The system utilizes a closed-loop cognitive single-agent architecture, where an LLM coordinates observation, reasoning, and action, supported by memory and external tools.
AutoMAS employs a supervisor-executor mechanism to dynamically select specialized agents from an agent pool and orchestrate their workflow for flexible and efficient task resolution, validated through channel estimation case studies.

Wireless Power Transfer and Intent-Driven Network Optimization in AAVs-assisted IoT for 6G Sustainable Connectivity

HDT/DA-MAPPO: introduces an Intent-Driven Framework for Autonomous Network Optimization using the HDT for implicit intent prediction and DA-MAPPO for multi-agent decision-making in AAV-assisted IoT systems.
HDT replaces conventional floating-point matrix operations with symbolic Hyperdimensional computations to reduce computational and energy overhead for long-context parsing.
DA-MAPPO employs decoupled networks and cascaded coupling to handle high-dimensional double action spaces (trajectory planning and intent response) while preserving high-order dependencies.

Weakly-supervised Latent Models for Task-specific Visual-Language Control

LDM (Latent Dynamics Model): introduces a task-specific latent dynamics model trained with weak goal-state supervision to enable precise visual-language control for object centering in autonomous inspection.
The model uses separate encoders to map images, instructions, and actions into a shared latent space, where the dynamics model predicts action-induced state shifts toward a goal prototype.
Training leverages complementary losses, including directional, ranking, consistency, and regularization losses, to stabilize learning and ensure robust spatial grounding, significantly outperforming LLM baselines.

22nd November 2025

INFINIBENCH: INFINITE BENCHMARKING FOR VISUAL SPATIAL REASONING WITH CUSTOMIZABLE SCENE COMPLEXITY

InfiniBench: introduces a fully automated, customizable benchmark generator that synthesizes a theoretically infinite variety of complex, physically plausible 3D scenes and renders them into photo-realistic videos for VLM spatial reasoning evaluation.
The pipeline uses an LLM-based agentic framework for iterative constraint refinement, a cluster-based layout optimizer for dense scene generation, and a task-aware camera trajectory optimization for informative video rendering.
The system allows parameterized control over compositional, relational, and observational scene complexities, enabling fine-grained diagnostic analysis of VLM successes and failures in spatial reasoning tasks.

Agent-as-a-Graph: Knowledge Graph-Based Tool and Agent Retrieval for LLM Multi-Agent Systems

Agent-as-a-Graph Retrieval: introduces a knowledge graph retrieval augmented generation approach that represents tools and their parent agents as co-equal nodes and edges in a knowledge graph to enable unified retrieval.
The retrieval process involves initial vector search for relevant nodes, followed by type-specific weighted reciprocal rank fusion (wRRF) for reranking, and finally graph traversal to identify the final set of executable parent agents for LLM multi-agent systems.
By integrating both tool-level specificity and agent-level context, the approach achieves significant improvements in Recall@5 and nDCG@5 metrics over prior state-of-the-art LLM retrievers on the LiveMCPBenchmark.

ARIAL: An Agentic Framework for Document VQA with Precise Answer Localization

ARIAL (Agentic Reasoning for Interpretable Answer Localization): introduces a modular framework for Document VQA that orchestrates specialized tools via an LLM-based Planning Agent (LLM-based orchestration) to achieve precise answer extraction and reliable spatial grounding.
The system decomposes Document VQA into structured subtasks handled by dedicated modules, including OCR (Text and BBox extraction), RAG (Semantic search retrieval), QA (Answer generation), and Grounding (Spatial localization).
ARIAL achieves state-of-the-art results across four benchmarks by leveraging agentic orchestration to improve both textual accuracy (ANLS) and spatial precision (mAP@IoU), providing transparent reasoning traces.

Rethinking Retrieval: From Traditional Retrieval Augmented Generation to Agentic and Non-Vector Reasoning Systems in the Financial Domain for Large Language Models

Financial RAG Architectures (FRA): introduces a systematic evaluation comparing Vector-Based Agentic RAG (Hybrid search and filtering) against Hierarchical Node-Based Reasoning System (Structured document traversal) for financial document Q&A.
The Vector-Based Agentic RAG achieved a 68% win rate over the hierarchical system with comparable latency (5.2 vs 5.98 seconds) across 1,200 SEC filings.
Advanced RAG techniques, including Cross-Encoder Reranking and Small-to-Big Retrieval, significantly improved retrieval accuracy and answer quality, demonstrating cost-performance tradeoffs for production.

ASTRA: Agentic Steerability and Risk Assessment Framework

ASTRA (Agentic Steerability and Risk Assessment Framework): introduces a first-of-its-kind framework designed to evaluate LLMs' ability to enforce custom guardrails during multi-turn planning and strict tool activation, using LLM, Agent (ReAct paradigm), LangGraph, Scenario Generator, System Prompt, Guardrails, Tool Suite, Jailbreak Techniques, and Automated Statistical Analysis Pipeline.
The framework simulates 10 diverse autonomous agents with 37 unique tools against novel agentic threats, focusing on security steerability in context-specific operational functions rather than universal threats.
ASTRA uses simulated tool interactions and sophisticated jailbreak techniques to provide a robust methodology for measuring agentic steerability, revealing that this capability is distinct from general security resistance.

MASTEST: A LLM-Based Multi-Agent System For RESTful API Tests

MASTEST (LLM-Based Multi-Agent System For RESTful API Tests): introduces a multi-agent system that automates the entire RESTful API testing workflow, including scenario generation, script generation, execution, and result analysis, using a combination of LLM-based and programmed agents.
The architecture includes specialized agents like the API Parser, Unit/System Test Scenario Generators, Test Script Generator, and various checkers (Syntax, Data Type, Status Code Coverage) to ensure quality and coverage.
The system incorporates human testers via a GUI to review and correct LLM-generated artifacts at multiple stages, mitigating LLM hallucination and error accumulation while shifting human focus to quality assurance.

QuickLAP: Quick Language-Action Preference Learning for Autonomous Driving Agents

QuickLAP (Quick Language-Action Preference learning): introduces a closed-form Bayesian framework that fuses physical corrections and natural language feedback in real time to infer user preference weights.
The system uses a dual-LLM architecture, including LM${att}$ and LM${pref}$, to process free-form utterances into structured reward signals (attention mask, shift, and confidence).
By treating language as a probabilistic observation over latent preferences, the framework resolves ambiguity inherent in physical corrections and achieves robust online adaptation.

A superpersuasive autonomous policy debating system

DeepDebater: introduces a hierarchical multi-agent framework for autonomous policy debating, utilizing specialized LLM agent workflows for iterative retrieval, synthesis, and self-correction against the OpenDebateEvidence corpus.
The system models the entire competitive policy debate lifecycle, generating complete speech transcripts, cross-examinations, and rebuttals, and rendering them using AI speech and EchoMimic V1 talking-head animation.
The architecture decomposes complex creative and strategic tasks into discrete, role-based agent workflows, enabling the system to achieve superior argumentative quality and consistently win simulated rounds.

SKILLWRAPPER: GENERATIVE PREDICATE INVENTION FOR SKILL ABSTRACTION

SKILLWRAPPER: introduces a principled system for generative predicate invention that leverages foundation models to learn human-interpretable, provably sound, and complete symbolic representations (operators and predicates) of black-box robot skills from RGB image observations.
The system iteratively performs Active Data Gathering, Predicate Invention (using VLMs to propose and classify predicates), and Operator Learning to construct an abstract transition model usable by off-the-shelf classical planners.
By focusing on resolving inconsistencies between observed data and the current abstract model, the approach ensures the learned symbolic model is sound and probabilistically complete for long-horizon planning tasks.

Towards Automating Data Access Permissions in AI Agents

APMS (Automated Permission Management System): introduces a permission prediction model based on a Hybrid ML Framework that combines LLM-based in-context learning and collaborative filtering to automatically decide data access permissions for AI agents.
The Hybrid ML Framework achieves 85.1% overall accuracy and 94.4% accuracy for high-confidence predictions by leveraging limited individual permission history and preferences from similar users.
The system is designed to address the limitations of conventional permission models, which are inadequate for the autonomous execution paradigm of LLM-based AI agents, where permission decisions must often be made at runtime for unseen data types.

Building Browser Agents: Architecture, Security, and Practical Solutions

Production Browser Agent Architecture (PBAA): introduces an architecture for reliable and safe browser agents, combining hybrid context management, a robust execution layer, and programmatic safety boundaries enforced by specialization.
Context management relies on single-snapshot retention, intelligent trimming using a lightweight LLM, and conversation history compression to maintain a stable token budget and reduce operational costs by 57%.
Safety is achieved through deterministic, code-level constraints like domain allowlisting and action restriction, enabling the agent to reach an 85% success rate on the WebGames benchmark.

21st November 2025

GHOSTEI-BENCH: DO MOBILE AGENTS RESILIENCE TO ENVIRONMENTAL INJECTION IN DYNAMIC ON-DEVICE ENVIRONMENTS?

GhostEI-Bench: introduces the first benchmark dedicated to assessing mobile agent robustness against environmental injection attacks in dynamic, executable environments, utilizing a Tested Agent (Perceives/Plans/Acts), an Environment Controller (Prepares/Injects attacks), an Evaluation Module (Assesses agent behavior), a Judge LLM (Analyzes failure trajectory), an Android Emulator (Realistic GUI environment), and Attack Vectors (Threat models).
The benchmark systematically injects adversarial UI elements, such as deceptive overlays and spoofed notifications, directly into realistic application workflows running inside fully operational Android emulators.
A novel LLM-based evaluation protocol performs fine-grained failure analysis by reviewing the agent's action trajectory and corresponding screenshots to identify the precise point of failure (perception, recognition, or reasoning).

MDG: Masked Denoising Generation for Multi-Agent Behavior Modeling in Traffic Environments

MDG (Masked Denoising Generation): introduces a unified generative framework that reformulates multi-agent behavior modeling as the reconstruction of independently noised spatiotemporal tensors, supporting diverse tasks like open-loop prediction and closed-loop planning.
The approach utilizes a continuous, per-agent and per-timestep Noise Mask field to regulate localized denoising, enabling efficient and controllable trajectory generation in a single or few forward passes.
The architecture employs a Scene Encoder to fuse multimodal context and a Transformer Denoiser with specialized attention mechanisms to progressively reconstruct clean trajectories, achieving competitive closed-loop performance on Waymo Sim Agents and nuPlan benchmarks.

Agentifying Agentic AI

Agentifying Agentic AI (AAAI): introduces a path toward responsible agency by integrating adaptive, data-driven LLM approaches with structured models from AAMAS, including BDI Architecture (Explicit mental states), Communication Protocols (Structured inter-agent messages), and Norms, Institutions, Roles (Social constraints, expectations).
The paper argues that true agency requires complementing learning-based mechanisms with explicit models of cognition, cooperation, and governance to ensure transparency, coherence, and accountability in multi-agent settings.
By reintroducing formal concepts like Mechanism Design and Theory of Mind, the framework aims to address current Agentic AI challenges related to reliability, grounding, long-horizon agency, and robust multi-agent coordination.

Agentic Program Verification

AutoRocQ: introduces an LLM agent for program verification that uses Context Analysis, Context-aware Tactic Generation, Proof Tree-aware Interpretation, Context-assisted Feedback Handling, Error Analysis, History Manager, and Proof Certificate to autonomously construct proofs in collaboration with the Rocq Proof Assistant.
The agent employs an iterative refinement loop, leveraging agentic context search via query commands to retrieve relevant lemmas and definitions on demand, significantly reducing contextual noise compared to static retrieval methods.
By maintaining a structured proof tree representation, the system achieves high-level interpretation of the proof derivation, enabling strategic decision-making and effective error recovery during complex verification tasks.

Designing Domain-Specific Agents via Hierarchical Task Abstraction Mechanism

HTAM (Hierarchical Task Abstraction Mechanism): introduces a novel agent design framework that structures multi-agent systems into a logical hierarchy mirroring the intrinsic task-dependency graph of a specialized domain.
Instantiated as EarthAgent for complex geospatial analysis, the architecture uses a dual-pass mechanism: top-down planning for decomposition and bottom-up execution for sequential data processing.
The framework enforces procedural correctness and modularity by decomposing the problem into distinct functional layers, each populated by specialized LLM-driven sub-agents.

AutoLink: Autonomous Schema Exploration and Expansion for Scalable Schema Linking in Text-to-SQL at Scale

AutoLink: introduces an autonomous agent framework that reformulates schema linking as an iterative, sequential discovery process, utilizing an LLM policy to dynamically explore and expand the linked schema subset.
The agent interacts with two specialized environments, the Database Environment ($\mathcal{E}{DB}$) for SQL exploration and the Schema Vector Store Environment ($\mathcal{E}{VS}$) for efficient semantic retrieval, without requiring the full database schema input.
The agent employs a diverse set of actions, including schema exploration, semantic retrieval, and verification, to iteratively refine the linked schema, achieving state-of-the-art strict recall and superior token efficiency.

PathAgent: Toward Interpretable Analysis of Whole-slide Pathology Images via Large Language Model-based Agentic Reasoning

PathAgent (Large Language Model-based Agent Framework): introduces a training-free LLM-based agent framework that emulates pathologists' reflective, stepwise analysis by coordinating a Navigator, Perceptor, and Executor for iterative, evidence-driven reasoning on Whole-slide images.
The Executor, serving as the central module, employs Multi-Step Reasoning and Reflection to dynamically adjust magnification and retrieve new Regions of Interest, generating an explicit chain-of-thought for fully interpretable predictions.
The framework achieves strong zero-shot generalization and superior accuracy in open-ended and constrained visual question-answering tasks without requiring specific training data.

DETERMINISTIC INFERENCE ACROSS TENSOR PARALLEL SIZES THAT ELIMINATES TRAINING-INFERENCE MISMATCH

TBIK (Tree-Based Invariant Kernels): introduces a framework for achieving fully deterministic LLM inference by proposing TP-invariant matrix multiplication and reduction primitives that eliminate the training-inference mismatch.
The core mechanism involves aligning intra- and inter-GPU reduction orders using a unified hierarchical binary tree structure, ensuring a consistent arithmetic sequence regardless of Tensor Parallel (TP) size or GPU count.
Integrated into vLLM and FSDP, TBIK, combined with Batch-Invariant Operations (BIO), achieves bit-wise identical results across varying TP configurations and frameworks, crucial for stable on-policy Reinforcement Learning (RL) training.

Episodic Memory in Agentic Frameworks: Suggesting Next Tasks

EM Architecture: introduces an episodic memory architecture designed to support workflow completion in agentic frameworks by storing and retrieving past scientific workflows to guide agents in suggesting plausible next tasks.
The architecture interposes the EM Agent between the chat UI and the domain crew, enabling it to compile execution trajectories into formalized workflows and retrieve similar historical sequences from the Workflow DB.
The EM Agent leverages an LLM to analyze the retrieved similar workflows against the current workflow, generating context-aware suggestions for subsequent steps, thereby mitigating hallucination risks associated with relying solely on the LLM's pre-trained memory.

M³-Bench: Multi-Modal, Multi-Hop, Multi-Threaded Tool-Using MLLM Agent Benchmark

M³-Bench (Multi-Modal, Multiplex, Matching-aware MCP Benchmark): introduces a principled evaluation suite for multimodal tool use under the Model Context Protocol (MCP), featuring an MLLM Executor, MCP Servers, a Judge, and a Similarity-Bucketed Hungarian Alignment module.
The benchmark targets realistic, multi-hop, and multi-threaded workflows that require visual grounding, textual reasoning, cross-tool dependencies, and persistence of intermediate resources across steps.
The evaluation pipeline uses a similarity-driven alignment method based on a sentence encoder and Hungarian matching to obtain auditable one-to-one correspondences, decoupling semantic fidelity from workflow consistency.

PersonaAgent with GraphRAG: Community-Aware Knowledge Graphs for Personalized LLM

PersonaAgent with GraphRAG: introduces a novel framework for persona-based LLM agents that leverages a Knowledge Graph-enhanced Retrieval-Augmented Generation (GraphRAG) mechanism to ground personalized outputs in both individual and collective knowledge.
The system integrates a persona prompt encoding user preferences, a knowledge graph capturing personal interactions and community patterns, and a GraphRAG mechanism that retrieves and synthesizes relevant context for generation.
This approach dynamically generates context-rich prompts by combining user-specific history and global community patterns, significantly improving personalization metrics across news categorization, movie tagging, and product rating tasks.

REMSA: AN LLM AGENT FOR FOUNDATION Model SELECTION IN REMOTE SENSING

REMSA (Remote-sensing Model Selection Agent): introduces the first LLM agent for automated Remote Sensing Foundation Model (RSFM) selection, combining structured metadata grounding via RS-FMD (Remote Sensing Foundation Model Database) with a task-driven agentic workflow.
The modular agent architecture includes an Interpreter, a Task Orchestrator, and specialized Tools (Retrieval, Ranking, Clarification, Explanation) to support complex, constraint-heavy RS scenarios.
The system leverages in-context learning for ranking and multi-turn clarification to deliver transparent, reproducible selections, outperforming retrieval-only and unstructured RAG baselines on an expert-verified benchmark of 75 RS query scenarios.

Humanlike Multi-user Agent (HUMA): Designing a Deceptively Human AI Facilitator for Group Chats

HUMA (Humanlike Multi-user Agent): introduces an LLM-based facilitator for asynchronous group chats using an event-driven architecture with Router (Strategy Selection), Action Agent (Strategy Execution, Timing Simulation), and Reflection (Context Synthesis, Coherence) components.
The system simulates human-like response timing (50-100 WPM) and handles interruptions by preserving the agent's internal scratchpad and intended actions, enabling natural adaptation to rapid conversation dynamics.
Evaluation showed that participants could not reliably distinguish the AI facilitator from human community managers, achieving near-chance detection rates and comparable subjective experience scores.

A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents

EMem-G (Event-Centric Memory with Graph Propagation): introduces an event-centric conversational memory representation based on enriched Elementary Discourse Units (EDUs) organized into a heterogeneous graph, supporting associative recall via Personalized PageRank.
The system uses LLM-based extractors to decompose dialogue into self-contained EDUs and arguments, avoiding lossy compression or fragmentation typical of relation triples.
Retrieval involves dense similarity search followed by a recall-oriented LLM filter to select relevant EDUs and arguments before graph propagation augments the final QA context.

JIGSAWCOMM: Joint Semantic Feature Encoding and Transmission for Communication-Efficient Cooperative Perception

JIGSAWCOMM: introduces a novel communication-efficient Cooperative Perception (CP) framework that jointly trains a Sparse BEV Feature Encoder and a Feature Utility Estimator (FUE) Network to maximize the contribution of every transmitted bit to the final perception task.
The system uses an end-to-end differentiable Transmission Scheduler and a redundancy-aware top-1-per-cell policy, leveraging exchanged Meta Utility Maps to select only essential, non-redundant features for transmission.
This approach achieves an asymptotic O(1) communication cost relative to the number of agents, significantly reducing data volume (up to >500x) while maintaining high CP accuracy on OPV2V and DAIR-V2X benchmarks.

Physical Reinforcement Learning

CLLN (Contrastive Local Learning Network): introduces a novel analog, distributed system adapted for Q-Learning in reinforcement learning tasks, utilizing self-adjusting nonlinear resistors.
The network performs gradient descent on a global loss function via a local, contrastive training protocol that compares power dissipation in free and clamped states.
This physical approach aims to achieve energy efficiency and fault tolerance, features inherent to biological systems but lacking in traditional digital RL hardware.

20th Nov 2025

Large Language Model-Based Reward Design for Deep Reinforcement Learning-Driven Autonomous Cyber Defense

LLM-assisted Reward Design: introduces a method using a Large Language Model (LLM), specifically Claude Sonnet 4, to generate context-aware reward structures for Deep Reinforcement Learning (DRL) agents in an autonomous cyber defense simulation environment (Cyberwheel), leveraging Atomic Red Team (ART) and MITRE ATT&CK context.
The generated reward structures guide the training of DRL-based autonomous defense policies against various heuristic-based attack personas (e.g., aggressive, stealthy) defined using ART and MITRE ATT&CK techniques.
The study evaluates different blue agent policies (baseline, proactive-v1, proactive-v2) trained with LLM-informed rewards, showing that LLM guidance leads to more effective defense strategies against diverse adversarial behaviors.

DynaMimicGen: A Data Generation Framework for Robot Learning of Dynamic Tasks

DynaMimicGen (D-MG): introduces a scalable dataset generation framework that leverages Dynamic Movement Primitives (DMPs) to adapt demonstrations to novel and dynamic environments, producing smooth, realistic, and task-consistent Cartesian trajectories.
The framework transforms a minimal set of human demonstrations (Dsre) into a large, diverse dataset (Dgen) by segmenting tasks and generalizing motion primitives to new scene configurations, supporting dynamic adaptation during execution.
This approach significantly reduces the need for extensive human data collection while enabling policy training that generalizes robustly to dynamic task settings unseen in the original demonstrations.

AskDB: An LLM Agent for Natural Language Interaction with Relational Databases

AskDB: introduces a novel LLM-powered agent designed to unify data analysis and database administration through natural language, leveraging a ReAct cognitive cycle, Core Safety Protocol, and Dynamic Schema-Aware Prompting.
The agent utilizes Gemini LLMs and a curated set of tools to autonomously debug SQL, retrieve contextual information, and manage multi-step tasks for both analytical queries and administrative commands.
AskDB emphasizes Interaction Efficiency and autonomy, achieving strong performance on Spider 1.0 while incorporating safety mechanisms like PII shielding and destructive operation playbooks.

Operon: Incremental Construction of Ragged Data via Named Dimensions

Operon: introduces a Rust-based workflow engine that addresses challenges in processing ragged data through a novel formalism of named dimensions with explicit dependency relations, using a statically verified DSL and an automatically generated runtime system.
The system formalizes dimensional dependencies and uses a structured model for partial data to enable incremental construction of shapes, supporting robust persistence and recovery mechanisms.
Empirical evaluation shows that Operon significantly outperforms an existing workflow engine (Prefect) in overhead reduction for large-scale data generation pipelines.

Agent0: Unleashing Self-Evolving Agents from Zero Data via Tool-Integrated Reasoning

Agent0: introduces a fully autonomous framework that evolves high-performing LLM agents from scratch without external data by combining multi-step co-evolution between a Curriculum Agent and an Executor Agent with seamless external tool integration.
The framework establishes a symbiotic competition where the Curriculum Agent proposes increasingly challenging, tool-aware tasks based on the Executor Agent's uncertainty, driving a virtuous cycle of capability improvement.
Empirically, Agent0 significantly boosts reasoning capabilities across mathematical and general benchmarks, demonstrating the effectiveness of tool-augmented, self-driven curriculum generation.

InfCode-C++: Intent-Guided Semantic Retrieval and AST-Structured Search for C++ Issue Resolution

INFCODE-C++: introduces an autonomous system for end-to-end C++ issue resolution that combines semantic code-intent retrieval and deterministic AST-structured querying, utilizing a Reproducer Agent, Patch Agent, and Selector Agent.
The framework addresses C++ complexities like overloaded identifiers and nested namespaces by building an AST-Based Structural Index and a Semantic Code-Intent Index for precise fault localization.
It achieves a 25.58% resolution rate on MultiSWE-bench-CPP, significantly outperforming prior state-of-the-art Python-oriented agents.

Hiding in the AI Traffic: Abusing MCP for LLM-Powered Agentic Red Teaming

Introduces a novel Command & Control (C2) architecture leveraging the Model Context Protocol (MCP) to coordinate distributed, adaptive reconnaissance agents covertly across networks, with components including Reconnaissance Agents, an MCP Coordination Server, and a Red Team Command Agent.
The decoupled, two-leg C2 communication flow uses the MCP for stealthy tasking and leverages public LLM APIs for complex reasoning and payload generation, blending traffic with legitimate AI service usage.
This framework enables advanced adversarial capabilities like event-driven operations, multi-agent swarm coordination, and on-demand polymorphic malware generation while minimizing detection footprint.

D-GARA: A Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies

D-GARA (Dynamic Benchmarking Framework for GUI Agent Robustness in Real-World Anomalies): introduces a dynamic benchmarking framework to evaluate Android GUI agent robustness by integrating an Android simulator, an Execution Cycle, an Anomaly Trigger Mechanism, Interruption Injection, a Success Validation Mechanism, and a DataCollector tool.
The framework simulates real-world anomalies, such as permission dialogs and system alerts, by injecting them dynamically into the agent's execution trajectory using a rule-based Semantic Anomaly Triggering Mechanism.
D-GARA utilizes a state-centered Success Validator that checks the final UI state against declarative goal conditions, enabling realistic robustness evaluation beyond static benchmarks.

AutoBackdoor: Automating Backdoor Attacks via LLM Agents

AUTOBACKDOOR: introduces a fully automated red-teaming framework for LLMs that uses an autonomous LLM agent to execute the entire backdoor injection pipeline, including trigger generation, poisoned data construction, and model fine-tuning.
The framework employs a chained agentic workflow and a reflection-guided generation mechanism to synthesize semantically coherent, context-aware triggers and high-quality poisoned instruction-response pairs.
Experiments show the approach achieves over 90% attack success with minimal poisoned samples across various LLMs and tasks, highlighting the failure of existing defenses against agent-driven semantic backdoors.

Multi-Agent Coordination in Autonomous Vehicle Routing: A Simulation-Based Study of Communication, Memory, and Routing Loops

OMM (Object Memory Management): introduces a lightweight mechanism enabling autonomous vehicle agents to retain and share persistent knowledge of encountered obstacles to prevent inefficient path recalculation cycles.
The system utilizes V2V communication to broadcast minimal obstacle node IDs, which agents use to maintain a distributed blacklist consulted during Modified Dijkstra's path planning.
OMM-enabled coordination reduces average travel time by 75.7% and wait time by 88% compared to memory-less reactive rerouting systems, which suffer catastrophic performance degradation due to routing loops.

Dialogue Diplomats: An End-to-End Multi-Agent Reinforcement Learning System for Automated Conflict Resolution and Consensus Building

Dialogue Diplomats (DD): introduces a novel end-to-end Multi-Agent Reinforcement Learning (MARL) framework for automated conflict resolution, integrating the Hierarchical Consensus Network (HCN), Progressive Negotiation Protocol (PNP), and Context-Aware Reward Shaping.
The HCN architecture uses graph attention mechanisms and hierarchical reinforcement learning across micro-, meso-, and macro-levels to model complex inter-agent dependencies and strategic planning.
The system achieves superior performance, reaching 94.2% consensus rates and reducing conflict resolution times by 37.8% compared to baselines, while scaling effectively up to 50 concurrent agents.

MARL-CC: A Mathematical Framework for Multi-Agent Reinforcement Learning in Connected Autonomous Vehicles: Addressing Nonlinearity, Partial Observability, and Credit Assignment for Optimal Control

MARL-CC (Multi-Agent Reinforcement Learning with Control Coordination): introduces a unified mathematical framework for cooperative optimal control in Connected Autonomous Vehicles (CAVs), integrating Differential Geometric Control (Nonlinear optimal control), Probabilistic Belief Inference (Partial observability handling), and Shapley-Value Reward Allocation (Credit assignment mechanism).
The framework employs a Centralized Training, Decentralized Execution paradigm, leveraging belief states derived from Bayesian inference to enable robust, decentralized decision-making under uncertainty and communication delays.
Theoretical analysis establishes convergence and stability guarantees, demonstrating up to 40% improvement in convergence rate and enhanced cooperative efficiency over baselines in simulation and real-world testbeds.

SWITCH: Benchmarking Modeling and Handling of Tangible Interfaces in Long-horizon Embodied Scenarios

SWITCH (Semantic World Interface Tasks for Control & Handling): introduces an embodied, task-driven benchmark evaluating LLMs' ability to perceive, reason, and interact with Tangible Control Interfaces (TCIs) in long-horizon scenarios.
The benchmark is structured around five complementary tasks—Task-Aware VQA, Semantic UI Comprehension, Action Generation, State Transition Prediction, and Result Verification—covering perception, planning, and verification capabilities using egocentric RGB video input.
Evaluation results show that current LMMMs exhibit inconsistent performance, often over-relying on textual cues and struggling with fine-grained visual perception and generalization across diverse TCI implementations.

GOAL-DIRECTED SEARCH OUTPERFORMS GOAL-AGNOSTIC MEMORY COMPRESSION IN LONG-CONTEXT MEMORY TASKS

SUMER (Search in Uncompressed Memory via Experience Replay): introduces an end-to-end RL agent that learns goal-directed search strategies over raw, uncompressed conversational memory using multi-turn tool interactions, trained via GRPO and verifiable reward.
The LLM agent utilizes specialized tools, search_memory (keyword and semantic search) and submit_answer, to gather evidence across temporally distant sessions in the Langmem memory bank.
By optimizing the search policy for response accuracy, the framework achieves state-of-the-art performance on the LoCoMo long-context conversational QA benchmark, significantly outperforming compression-based memory systems.

19th Nov 2025

Computer-Use Agents as Judges for Generative User Interface

Coder-CUA (Coder-Computer-Use Agent) Collaboration framework: introduces a system where a Coder acts as Designer, generating and revising websites, while a CUA acts as Judge, evaluating functionality and refining designs using the AUI-Gym benchmark.
The framework leverages a Verifier for programmatic task validation and the CUA Dashboard to distill complex CUA navigation trajectories into concise, actionable feedback for the Coder.
This approach shifts interface design toward agent-native efficiency and reliability, optimizing UIs for agent execution success rather than purely human aesthetics.

Know Your Intent: An Autonomous Multi-Perspective LLM Agent Framework for DeFi User Transaction Intent Mining

TIM (Transaction Intent Mining): introduces a novel multi-agent system based on LLMs, employing a self-derived hierarchical agent architecture including a Meta-Level Planner, Perspective-Specific Domain Experts, Question Solvers, and a Cognitive Evaluator, to autonomously infer user intents from complex DeFi transactions.
The framework integrates multimodal on-chain and off-chain data and critically evaluates findings using a Cognitive Evaluator to ensure accuracy and mitigate LLM hallucinations.
Experimental results show that TIM significantly outperforms machine learning models, single LLMs, and single-agent baselines across evaluation metrics.

Platform-Agnostic Reinforcement Learning Framework for Safe Exploration of Cluttered Environments with Graph Attention

PALF (Platform-Agnostic Reinforcement Learning Framework for Safe Exploration): introduces a hierarchical framework combining a GNN-driven exploration policy ($\pi_{\theta}$) with a safety filter ($\sigma$) to achieve efficient and safe autonomous exploration in cluttered environments.
The framework utilizes a custom graph representation of the environment, where nodes encode waypoints and frontiers, and the policy is trained using the PPO algorithm with a Safety-Gated Adaptive (SGA) reward function.
The integration of the GNN policy with an explicit safety mechanism ensures robust decision-making adaptable to real-world robotic platforms.

Octopus: Agentic Multimodal Reasoning with Six-Capability Orchestration

Octopus (Agentic Multimodal Reasoning with Six-Capability Orchestration): introduces a new paradigm for multimodal agentic reasoning that autonomously explores reasoning pathways by dynamically selecting and orchestrating six core capabilities, using an MLLM backbone, a code agent, and an observation tool.
The framework decomposes multimodal reasoning into six fundamental capabilities: Percept, Augment, Spatial, Logic, Transform, and Generation, each supported by corresponding tool modules.
Octopus achieves state-of-the-art performance on the capability-centric Octopus-Bench, demonstrating the effectiveness of capability orchestration over existing paradigms.

Symmetry-Breaking in Multi-Agent Navigation: Winding Number-Aware MPC with a Learned Topological Strategy

WNumMPC (Winding Number-aware MPC): introduces a novel hierarchical navigation method that learns topological cooperative strategies using the winding number via a learning-based Planner and executes them with a model-based Controller to resolve symmetry-induced deadlocks.
The hierarchical architecture separates high-level strategy acquisition (Planner) from reliable, low-level execution (Controller), combining learning flexibility with model-based reliability.
The Planner learns target winding numbers and dynamic weights to prioritize interactions, effectively breaking symmetry in dense, multi-agent crossing scenarios.

Modelling and Model-Checking a ROS2 Multi-Robot System using Timed Rebeca

Timed Rebeca: introduces an actor-based modelling language with temporal constructs and its model-checking compiler to systematically design and verify multi-robot systems implemented in ROS2, efficiently transforming continuous dynamics into discrete models.
The approach addresses challenges in multi-robot verification by proposing discretization strategies for data types and introducing optimization techniques to accelerate model-checking time.
The work demonstrates a bidirectional flow between the abstract Timed Rebeca model and the ROS2 implementation, maintaining semantic consistency through manual validation.

Trustworthy GenAI over 6G: Integrated Applications and Security Frameworks

Trustworthy GenAI over 6G Framework: introduces a unified perspective on cross-domain vulnerabilities in GenAI-enabled 6G networks, proposing an Adaptive Evolutionary Defense (AED) concept that co-evolves with attacks through GenAI-driven simulation and feedback.
The framework integrates Integrated Sensing and Communication (ISAC), Federated Learning (FL), Digital Twins (DTs), Diffusion Models (DMs), and Large Telecommunication Models (LTMs) to address security risks arising from their convergence.
The AED concept utilizes a Policy Generator, Fitness Evaluator, and Coordinator within a Red-Blue Sandbox environment to ensure system robustness remains above a defined lower-bound threshold against evolving adversaries.

Two-Faced Social Agents: Context Collapse in Role-Conditioned Large Language Models

Two-Faced Social Agents: Context Collapse in Role-Conditioned Large Language Models introduces an empirical evaluation of persona fidelity in frontier LLMs (GPT-5, Claude Sonnet 4.5, Gemini 2.5 Flash) across cognitively demanding SAT mathematics items and less constrained Affective Preference Tasks, using socioeconomic personas.
The study finds that under cognitive load (SAT reasoning), GPT-5 exhibits complete contextual collapse, Gemini 2.5 Flash shows partial collapse, while Claude Sonnet 4.5 retains limited role-specific variation, contrasting with robust variation in preference tasks when cognitive constraints are relaxed.
This task-dependent collapse suggests optimization pressures drive identity convergence, implying that current alignment paradigms may fundamentally limit the ability of LLMs to sustain contextual selves during complex reasoning.

NAMeGEn: Creative Name Generation via A Novel Agent-based Multiple Personalized Goal Enhancement Framework

NAMeGEn (Novel Agent-based Multi-Personalized-Goal Enhancement Framework): introduces a training-free, multi-agent collaborative architecture to address multi-objective flexibility and interpretive complexity in Creative Natural Language Generation (CNLG) tasks like Chinese Baby Naming (NCB), utilizing MOM, MOG, and MOE agents.
The framework iteratively alternates between information preparation (task analysis, knowledge retrieval) and dynamic optimization (generation, evaluation) to balance Explicit User-specified Objectives (EUOs) and Implicit Interpretive Objectives (IIOs).
It demonstrates superior performance across various LLM backbones on the CBNames benchmark, achieving high quality and interpretability while mitigating hallucinations through retrieval-augmented generation using the CPoetry corpus.

DEPO: Dual-Efficiency Preference Optimization for LLM Agents

DEPO (Dual-Efficiency Preference Optimization): introduces a method that jointly optimizes step-level efficiency (minimizing tokens per step) and trajectory-level efficiency (minimizing steps per task) for LLM agents by extending KTO with an efficiency bonus.
The method uses offline desirable and undesirable trajectory labels derived from MCTS rollouts and a reward thresholding protocol to guide the LLM agent towards generating concise responses and fewer action steps.
Experiments on WebShop and BabyAI demonstrate that DEPO significantly reduces token usage and step count while maintaining or improving performance compared to baselines like BC and vanilla KTO.

Cost-Aware Prediction (CAP): An LLM-Enhanced Machine Learning Pipeline and Decision Support System for Heart Failure Mortality Prediction

CAP (Cost-Aware Prediction): introduces a three-stage framework integrating an ML classifier outcome, Clinical Impact Projection (CIP) curves, and four Large Language Model (LLM) agents to provide transparent and interpretable decision support for 1-year heart failure mortality prediction.
The framework utilizes an XGB model trained on EHR data to predict mortality, visualizes trade-offs using CIP curves based on Quality of Life (QoL) and Healthcare System (HC) costs, and employs LLM agents to generate patient-specific cost-benefit analyses.
The system was evaluated by clinicians, showing high reliability for descriptive agents (I and II) but lower accuracy for speculative guidance (III and IV), emphasizing the strength in risk communication.

OEMA: Ontology-Enhanced Multi-Agent Collaboration Framework for Zero-Shot Clinical Named Entity Recognition

OEMA (Ontology-Enhanced Multi-Agent Collaboration Framework): introduces a novel zero-shot clinical Named Entity Recognition (NER) framework based on multi-agent collaboration, consisting of a self-annotator, a discriminator, and a predictor.
The framework addresses challenges in zero-shot NER by using ontology-guided reasoning for fine-grained example selection and integrating entity-type descriptions with self-generated examples in the prompt.
The proposed multi-agent design achieves state-of-the-art performance on clinical NER benchmarks, approaching supervised model results in a zero-shot setting.

Taxonomy, Evaluation and Exploitation of IPI-Centric LLM Agent Defense Frameworks

SoK (Systematization of Knowledge): introduces a comprehensive taxonomy and evaluation of IPI-centric defense frameworks, classifying them across five dimensions and analyzing six root causes of defense failure, with components including Detection/Prompt Engineering/Fine-tuning/System Design/Runtime Checking/Policy Enforcing/Adaptive Attacks.
The analysis reveals that System Design and Policy Enforcement frameworks offer the best security against Indirect Prompt Injection (IPI) attacks, while Fine-tuning-based methods show weaker generalization.
The authors design three novel logic-driven adaptive attacks—Semantic-Masquerading IPI, Cascading IPI, and Isolation-Breach IPI—to exploit architectural flaws in state-of-the-art defenses.

SOLID: a Framework of Synergizing Optimization and LLMs for Intelligent Decision-Making

SOLID (Synergizing Optimization and Large Language Models for Intelligent Decision-Making): introduces a novel framework that integrates mathematical optimization with the contextual capabilities of LLMs via iterative collaboration mediated by a Coordinator using dual prices and deviation penalties.
The framework is inspired by the Alternating Direction Method of Multipliers (ADMM) to ensure convergence guarantees under convexity assumptions while handling structured and unstructured data inputs.
Empirical results in stock portfolio investment demonstrate that SOLID variants achieve improved annualized returns compared to optimizer-only baselines.

Knowledge-Informed Automatic Feature Extraction via Collaborative Large Language Model Agents

Rogue One: introduces a novel, LLM-based multi-agent framework for knowledge-informed automatic feature extraction, operationalizing a decentralized system of three specialized agents—Scientist, Extractor, and Tester—that collaborate iteratively.
The framework utilizes a rich, qualitative feedback mechanism and a "flooding-pruning" strategy, actively incorporating external knowledge via an integrated Retrieval-Augmented Generation (RAG) system.
This approach generates features that are statistically powerful, semantically meaningful, and interpretable, significantly outperforming state-of-the-art methods on classification and regression tasks.

Beyond GeneGPT: A Multi-Agent Architecture with Open-Source LLMs for Enhanced Genomic Question Answering

OpenBioLLM: introduces a modular multi-agent framework that extends GeneGPT by using open-source LLMs (like Qwen2.5) for genomic question answering, featuring specialized agents for tool routing, query generation, and response validation.
The architecture decomposes the workflow into specialized agents, which improves interpretability, traceability, and efficiency compared to the monolithic GeneGPT design.
OpenBioLLM achieves competitive or superior performance on GeneTuring and GeneHop benchmarks while significantly reducing latency compared to monolithic LLM setups.

ACCELOPT: A SELF-IMPROVING LLM AGENTIC SYSTEM FOR AI ACCELERATOR KERNEL OPTIMIZATION

AccelOpt: a self-improving LLM agentic system, introduces an iterative kernel optimization framework combining beam search with an optimization memory, guided by a three-component agentic workflow (planner, executor, summarizer).
The system autonomously optimizes kernels for AWS Trainium accelerators using open-source LLMs, achieving performance comparable to proprietary models while being significantly cheaper.
Evaluation on the custom NKIBench benchmark demonstrates that the memory accumulation enables progressive improvement and cost-effective discovery of both local and non-trivial global optimizations.

Normative active inference: A numerical proof of principle for a computational and economic legal analytic approach to AI governance

NAIF (Normative Active Inference Framework): introduces a computational model grounded in AIF and Economic Legal Analysis (ELA) to enable AI agents to achieve lawful and norm-sensitive behavior through "regulation by design," with all components, where the model simulates an autonomous driving agent resolving conflicting legal imperatives.
The framework utilizes Context Dependent Preference Tensors (C) to formalize how AIF implements context-dependent preferences, allowing the agent's preference ranking over outcomes to shift based on latent legal or emergency context states (F2, F3).
The Policy Precision ($\gamma$) component tracks the agent's confidence over its selected policy, serving as a "safety valve" mechanism that promotes vigilance under ambiguous normative contexts and confident action when higher-order norms apply.

Smart Manufacturing: MLOps-Enabled Event-Driven Architecture for Enhanced Control in Steel Production

DT-EDMA-DRL: introduces an MLOps-enabled event-driven architecture integrating a Digital Twin (Virtual furnace model), EDMA (Real-time data processing), and a DRL Agent (Optimizes control decisions) to enhance control in steel production.
The system uses a microservices edge-compute platform to ingest real-time sensor data from PLCs via an OPC-UA server and Kafka Message Broker, ensuring low-latency control loops for induction furnace optimization.
The DRL agent learns optimal power settings by interacting with the DT environment, aiming to reduce manufacturing waste and improve operational quality in complex industrial settings.

18th November 2025

Discovering autonomous quantum error correction via deep reinforcement learning

AQEC (Autonomous Quantum Error Correction): introduces a new Bosonic AQEC code discovered using a Curriculum Learning (CL)-enhanced Deep Reinforcement Learning (DRL) framework, which incorporates higher-order photon losses and adapts to large Fock spaces, utilizing an Analytical Master Equation Solver to accelerate training.
The discovered Generalized RL (GRL) code exhibits superior robustness against both single-photon and double-photon losses compared to existing codes by converting a catastrophic logical-flip error into a manageable dephasing error.
The analytical solver significantly reduces computational complexity compared to conventional numerical methods like QuTip, enabling faster exploration of optimal encoding strategies.

Large Language Models and 3D Vision for Intelligent Robotic Perception and Autonomy

LLMs (Large Language Models) and 3D Vision Integration: reviews the state-of-the-art methodologies, applications, and challenges at the intersection of LLMs and 3D vision for next-generation robotic sensing technologies, covering components like Transformer Architecture, Object Grounding, Scene Understanding, Text-to-3D Generation, and Embodied Agents.
The convergence of LLMs and 3D vision enables machines to perceive, reason, and interact with complex environments using natural language and spatial understanding, bridging the gap between linguistic intelligence and spatial perception.
The review catalogs benchmark datasets and evaluation metrics, and identifies future research directions focusing on adaptive architectures, cross-modal alignment, and real-time processing for context-aware robotic sensing systems.

Requirements for Aligned, Dynamic Resolution of Conflicts in Operational Constraints

OAMNCC (Online, Aligned Mitigation of Novel Constraint Conflicts): introduces a knowledge-level analysis characterizing requirements for agent decision making when facing novel constraint conflicts in operational environments, by enumerating conflict types and required agent knowledge.
The paper uses scenario analysis (Sailor Overboard, Piracy Interdiction, Merchants with Water Cannons, Piracy and vessel adrift) to ground the abstract knowledge requirements necessary for aligned, dynamic conflict resolution.
The analysis culminates in a taxonomy of required knowledge types, including World Knowledge, Metaknowledge, and Mitigation Utility, mapped onto a five-step conflict mitigation process (Algorithm 2).

CTRL-ALT-DECEIT: Sabotage Evaluations for Automated AI R&D

CTRL-ALT-DECEIT: introduces an evaluation framework, MLE-Sabotage, extending MLE-Bench with code-sabotage and sandbagging tasks, using Inspect framework, ReAct agent, AIDE agent, and LM monitors, to assess AI agents' capabilities to act against user interests during ML engineering.
The research evaluates frontier LLM agents' ability to implant backdoors, cause generalization failures (code-sabotage), and strategically underperform (sandbagging) while attempting to evade detection by automated LM monitors.
Results indicate agents make meaningful progress on sabotage tasks, but detecting sandbagging is more difficult than code-sabotage, and monitor performance degrades when agents are aware of monitoring.

AutoTool: Efficient Tool Selection for Large Language Model Agents

AutoTool: introduces a novel graph-based framework that bypasses repeated LLM inference for tool selection by exploiting tool usage inertia, using an Inertia Sensing module and a Parameter Filling module guided by a Tool Inertia Graph (TIG).
The TIG captures sequential dependencies via Tool Sequence Edges and data flow via Parameter Dependency Edges, enabling efficient, inertia-aware tool and parameter selection.
Experimental results show that this approach substantially reduces LLM call counts and token consumption (up to 30% reduction) while maintaining competitive task completion rates across diverse agent tasks.

ReflexGrad: Three-Way Synergistic Architecture for Zero-Shot Generalization in LLM Agents

ReflexGrad: introduces a novel architecture that tightly couples LLM-based hierarchical TODO decomposition, history-aware causal reflection, and gradient-based optimization (TextGrad) via a three-way closed feedback loop for zero-shot generalization in LLM Agents.
The system achieves true zero-shot generalization by relying on pure LLM semantic reasoning for task decomposition and memory retrieval, avoiding task-specific examples or hardcoded metrics.
Key architectural components include a three-tier hierarchical memory system and a synergistic coupling mechanism where reflexions inform gradients, and gradients guide TODO progression and reflexion priorities.

Agent-R1: Training Powerful LLM Agents with End-to-End Reinforcement Learning

Agent-R1: introduces a modular, flexible, and user-friendly training framework for RL-based LLM Agents by extending the Markov Decision Process (MDP) framework to comprehensively define key components for multi-turn interaction.
The framework supports multi-turn rollouts, precise credit assignment via action masks, and flexible integration of Tools and ToolEnv for active environmental intervention.
Agent-R1 utilizes process rewards and action masks during policy optimization to effectively train LLM agents for complex, interactive tasks like Multi-hop QA.

Agentic Video Intelligence: A Flexible Framework for Advanced Video Exploration and Understanding

Agentic Video Intelligence (AVI): introduces a flexible and training-free framework that mirrors human video comprehension through system-level design and optimization, utilizing a structured knowledge base and three-phase reasoning.
The framework employs an Agentic Core with Retrieve, Perceive, and Review phases, leveraging specialized tool suites for global exploration and fine-grained visual analysis.
AVI builds a structured video knowledge base including entity graphs and uses an open-source model ensemble, eliminating reliance on proprietary APIs or resource-intensive RL training.

Tell Me: An LLM-powered Mental Well-being Assistant with RAG, Synthetic Dialogue Generation, and Agentic Planning

Tell Me: introduces a mental well-being system that leverages LLMs, integrating a Retrieval-Augmented Generation (RAG) assistant, a synthetic client-therapist dialogue generator, and a Well-being AI Crew for personalized, knowledge-grounded dialogue, data augmentation, and adaptive self-care planning.
The system components include a RAG assistant for context-aware reflective dialogue, a module for generating synthetic transcripts based on client profiles to address data scarcity, and a CrewAI-based planner for dynamic self-care routines like weekly plans and guided meditations.
The work demonstrates how retrieval grounding enhances responsible interaction in emotionally sensitive domains and provides an open-source testbed for responsible LLM applications in mental well-being.

MEDBENCH V4: A ROBUST AND SCALABLE BENCHMARK FOR EVALUATING CHINESE MEDICAL LANGUAGE MODELS, MULTIMODAL MODELS, AND INTELLIGENT AGENTS

MedBench v4: introduces a nationwide, cloud-based benchmarking infrastructure for medical AI, comprising expert-curated tasks across LLM, multimodal, and agent tracks, with scoring calibrated by an LLM-as-a-judge system and human ratings.
The benchmark covers 24 primary and 91 secondary Chinese medical specialties, focusing on scenario-aligned evaluations that mirror real clinical workflows, including safety and ethics constraints.
Agentic orchestration significantly improves end-to-end performance, especially in safety tasks, suggesting governance-aware systems enhance clinical readiness beyond base model capabilities.

Enhancing LLM-based Autonomous Driving with Modular Traffic Light and Sign Recognition

TLS-Assist: introduces a modular redundancy layer that augments LLM-based autonomous driving agents with explicit traffic light and sign recognition, using components like Traffic Light Recognition (TLR), Traffic Sign Recognition (TSR), Relevance Prediction, State Validation, and a Message Generator.
The framework converts visual detections into concise natural language messages injected into the LLM input to enforce attention to safety-critical traffic rules.
Evaluation on the LangAuto benchmark shows consistent performance improvements for LMDrive and BEVDriver baselines, particularly in reducing traffic rule infractions.

DataSage: Multi-agent Collaboration for Insight Discovery with External Knowledge Retrieval, Multi-role Debating, and Multi-path Reasoning

DataSage: introduces a novel multi-agent framework that incorporates external knowledge retrieval, multi-role debating, and multi-path reasoning to automate data insight discovery, addressing limitations like insufficient domain knowledge, shallow depth, and error-prone code generation, using components like the Dataset Description Module/RAKG Module/Question Raising Module/Insights Generation Module.
The framework operates in an iterative Question-Answering (QA) loop, where specialized agents collaborate within four core modules to progressively refine analytical questions and generate robust, executable code for insight extraction.
Experimental results on InsightBench show that DataSage consistently outperforms existing data insight agents across all difficulty levels, particularly excelling in complex and high-difficulty tasks.

Run, Ruminate, and Regulate: A Dual-process Thinking System for Vision-and-Language Navigation

R³ (Run, Ruminate, and Regulate): introduces a novel dual-process thinking framework for Vision-and-Language Navigation (VLN) integrating LLMs' generalization with VLN-specific expertise, comprising Runner, Ruminator, and Regulator modules.
The framework emulates human cognition, using the Runner for fast, routine navigation and the Ruminator (backed by GPT-4o and Chain-of-Thought prompting) for slow, methodical reasoning in anomalous scenarios.
The Regulator adaptively switches between the two thinking modes based on looping, scoring, and ending criteria, achieving superior performance and efficiency over state-of-the-art LLM-assisted methods.

PRISM: Prompt-Refined In-Context System Modelling for Financial Retrieval

PRISM (Prompt-Refined In-Context System Modelling): introduces a training-free framework that integrates refined system prompting, in-context learning (ICL), and a lightweight multi-agent system for document and chunk ranking in financial information retrieval.
The framework utilizes four prompt variants ($P_1$ to $P_4$) to structure reasoning, an embedding-based retrieval mechanism for few-shot examples, and specialized agents coordinated via a state-graph for chunk ranking.
The best non-agentic configuration ($P_4$ prompt with document-level ICL) achieved high performance on the FinAgentBench benchmark, demonstrating practical feasibility.

Knowledge-Grounded Agentic Large Language Models for Multi-Hazard Understanding from Reconnaissance Reports

MoRA-RAG (Mixture-of-Retrieval Agentic RAG): introduces a knowledge-grounded LLM framework that transforms unstructured reconnaissance reports into a structured foundation for multi-hazard reasoning by integrating a Mixture-of-Retrieval mechanism and an agentic verification loop.
The framework utilizes agentic chunking to preserve contextual coherence and employs specialized agents for evidence validation, external search, and query refinement to enhance retrieval precision and reasoning robustness.
MoRA-RAG achieves up to 94.5% accuracy on the HazardRecQA dataset, significantly outperforming standard RAG systems and enabling open-weight LLMs to achieve performance comparable to proprietary models.

O-Mem: Omni Memory System for Personalized, Long Horizon, Self-Evolving Agents

O-Mem (Omni Memory System): introduces a human-centric memory framework based on active user profiling that dynamically extracts and updates user characteristics and event records, utilizing Persona Memory (PM), Episodic Memory (EM), and Working Memory (WM) for hierarchical retrieval.
The framework supports Long-Term Personality Modeling, Dual-Context Awareness, and Structured, Multi-Stage Retrieval to enhance personalized and context-aware interactions for LLM agents.
Experimental results show O-Mem achieves state-of-the-art performance on benchmarks like LoCoMo and PERSONAMEM while significantly reducing token consumption and latency compared to existing memory frameworks.

Multi-Agent Deep Research: Training Multi-Agent Systems with M-GRPO

M-GRPO (Multi-agent Group Relative Policy Optimization): introduces a hierarchical reinforcement learning framework for training separate LLMs in vertical multi-agent systems, featuring a main agent (M) and multiple sub-agents (S) with group-relative advantages and trajectory alignment.
The framework addresses optimization challenges in vertical multi-agent systems by computing hierarchical credit assignment and using a trajectory-alignment scheme to handle variable sub-agent invocations efficiently.
Empirical results show that co-training both agents using this method consistently outperforms single-agent and main-only training baselines on complex, tool-augmented reasoning benchmarks.

MiroThinker: Pushing the Performance Boundaries of Open-Source Research Agents via Model, Context, and Interactive Scaling

MiroThinker: introduces MiroThinker v1.0, an open-source research agent that advances tool-augmented reasoning and information-seeking capabilities by exploring interaction scaling as a third dimension alongside model size and context length, utilizing components like a structured Tool Interface, Recency-Based Context Retention, and a three-stage Training Pipeline (SFT, DPO, GRPO).
The agentic workflow follows the ReAct paradigm, iteratively generating thoughts, invoking tools via a modular Tool Interface (Execution Environment, File Management, Information Retrieval), and processing observations, managed by Recency-Based Context Retention to optimize context window usage.
Training involves Supervised Fine-tuning (SFT), Agentic Preference Optimization (DPO), and Agentic Reinforcement Learning (GRPO) using data synthesized via a comprehensive Data Construction Pipeline, leading to state-of-the-art performance among open-source research agents.

Z-Merge: Multi-Agent Reinforcement Learning for On-Ramp Merging with Zone-Specific V2X Traffic Information

Z-Merge: introduces a zone-based on-ramp merging control method using MARL (Multi-Agent Reinforcement Learning) incorporating RSU-collected, zone-specific traffic information from pre-merging, merging, and ramp zones, with components like MA-POMDP/PDQN/Double PDQN/SimServ/SUMO/MOSAIC.
The framework utilizes a hybrid action space combining discrete lane changes and continuous acceleration/gap control, evaluated using metrics like efficiency, safety, comfort, success rate, and queue length.
The approach leverages centralized training with decentralized execution (CTDE) and parameter-sharing to enable agents to make holistic decisions using both local and global traffic observations.

Attacking Autonomous Driving Agents with Adversarial Machine Learning: A Holistic Evaluation with the CARLA Leaderboard

The paper introduces a holistic evaluation methodology for adversarial machine learning attacks against Autonomous Driving Agents using the CARLA Simulator and CARLA Leaderboard, focusing on the interaction between the ML Model and other control modules.
The evaluation assesses stopping and steering attacks against open-source agents, demonstrating that agent-specific modules like PID controllers and GPS-based rules can mitigate attacks that successfully mislead the underlying ML Model.
The authors propose a new leaderboard structure to facilitate systematic red-and-blue-team evaluation of adversarial robustness in standardized driving environments.

Enhancing Agentic Autonomous Scientific Discovery with Vision-Language Model Capabilities

cmbagent: introduces a multi-agent system guided by Vision-Language Models (VLMs) to improve end-to-end autonomous scientific discovery by treating plots as verifiable checkpoints, utilizing a VLM-as-a-judge for self-correction and steering exploration.
The system employs specialized agents like the Plot Judge and Plot Debugger, routing execution based on VLM feedback against domain-specific rubrics to correct errors or initiate exploratory analysis.
This approach achieves pass@1 scores of 0.7-0.8 on a 10-task benchmark, significantly outperforming code-only and code-and-text baselines, while generating auditable reasoning traces.

Emergent Cooperative Driving Strategies for Stop-and-Go Wave Mitigation via Multi-Agent Reinforcement Learning

Emergent Cooperative Driving Strategies for Stop-and-Go Wave Mitigation via Multi-Agent Reinforcement Learning: introduces a novel mitigation strategy for stop-and-go waves discovered through training DRL agents in a simulated ring-road environment, where one vehicle acts as a buffer.
The discovered cooperative strategy involves heterogeneous behavior where a single "buffer" vehicle maintains a large headway while others platoon closely, enhancing stability and throughput compared to non-cooperative uniform driving.
This buffering approach is validated by implementing it in the classical Intelligent Driver Model (IDM), showing suppression of stop-and-go waves and improved average speed under stability constraints.

Orion: A Unified Visual Agent for Multimodal Perception, Advanced Visual Reasoning and Execution

Orion: introduces a visual agent framework that orchestrates specialized computer vision tools using an agentic controller, enabling advanced multimodal perception, reasoning, and execution.
The framework integrates large Vision-Language Models (VLMs) with hyper-specialized tools for tasks like object detection, OCR, and image generation, moving beyond descriptive outputs to active, tool-driven visual intelligence.
It employs a ReAct-style orchestration with Plan, Execute, and Reflect phases, ensuring structured, verifiable, and high-quality multi-step visual workflows.

APD-Agents: A Large Language Model-Driven Multi-Agents Collaborative Framework for Automated Page Design

APD-Agents: introduces a large language model (LLM) driven multi-agent framework for automated page design in mobile applications, containing OrchestratorAgent, SemanticParserAgent, PrimaryLayoutAgent, TemplateRetrievalAgent, and RecursiveComponentAgent.
The framework operates in a coarse-to-fine, top-down, iterative generation process, outputting structured JSON data compatible with professional design software like Sketch and Figma.
It leverages In-Context Learning via the TemplateRetrievalAgent to enhance layout quality without explicit model training.

Hybrid Agentic AI and Multi-Agent Systems in Smart Manufacturing

HAIMAS (Hybrid Agentic AI Multi-Agent System): introduces a layered architecture for Prescriptive Maintenance (RxM) in smart manufacturing, utilizing an LLM Orchestrator Agent for strategic planning and specialized agents (Perception, Preprocessing, Analysis, Optimization) for distributed execution.
The framework integrates high-level LLM reasoning with efficient, domain-specific execution by rule-based and local SLMs, ensuring robustness and scalability at the edge.
A Human-In-The-Loop (HITL) interface ensures transparency and auditability by allowing human experts to review, approve, or reject the actionable, prioritized maintenance recommendations.

17th Nov 2025

KForge: Program Synthesis for Diverse AI Hardware Accelerators

KForge: introduces an agentic program synthesis framework that iteratively refines programs using a Generation Agent and a Performance Analysis Agent, interpreting diverse profiling data to guide optimization for arbitrary accelerators.
The framework supports single-shot generation and iterative refinement, leveraging cross-platform knowledge transfer from reference implementations to improve generation quality across different hardware targets like NVIDIA CUDA and Apple Metal.
Key components include two collaborative LLM-based agents that simulate a practical kernel engineering workflow, focusing on functional correctness before performance optimization.

DualTAP: A Dual-Task Adversarial Protector for Mobile MLLM Agents

DualTAP (Dual-Task Adversarial Protector): introduces a novel framework that explicitly decouples privacy protection and task utility objectives for mobile Multimodal Large Language Model (MLLM) agents by training a perturbation generator guided by a contrastive attention module and a dual-task adversarial objective.
The framework utilizes a contrastive attention module to precisely locate PII-sensitive regions and optimizes the generator to minimize task-preservation loss ($L_n$) while maximizing privacy-interference loss ($L_p$).
DualTAP achieves state-of-the-art privacy protection by significantly reducing leakage rates while maintaining high task success rates across diverse MLLMs, resolving the privacy-utility trade-off.

MEGA-GUI: Multi-stage Enhanced Grounding Agents for GUI Elements

MEGA-GUI: introduces a modular, multi-stage framework that decomposes GUI grounding into coarse Region-of-Interest (ROI) selection and fine-grained element grounding by orchestrating specialized agents based on diverse Vision-Language Models (VLMs).
The framework centers on a bidirectional ROI zoom algorithm for robust search and error recovery, complemented by a context-aware rewriting agent to resolve semantic ambiguity in user instructions.
This modular, agentic architecture achieves state-of-the-art performance by leveraging the complementary strengths of different VLMs for distinct sub-tasks.

LIVE-SWE-AGENT: Can Software Engineering Agents Self-Evolve on the Fly?

LIVE-SWE-AGENT: introduces the first live software agent that autonomously and continuously evolves its own scaffold implementation on-the-fly during runtime when solving real-world software problems, starting from a minimal bash-only scaffold (mini-SWE-agent).
The agent achieves state-of-the-art open-source performance on SWE-bench Verified (75.4%) and SWE-Bench Pro (45.8%) by iteratively synthesizing and using custom tools based on a reflection mechanism.
This on-the-fly self-evolution approach requires no costly offline training and demonstrates generalizability across different LLMs and benchmarks.

An Operational Kardashev-Style Scale for Autonomous AI - Towards AGI and Superintelligence

AAI Scale: introduces a Kardashev-inspired, multi-axis, and testable Autonomous AI (AAI) Scale to measure progression from fixed automation (AAI-0) to Superintelligence (AAI-5), utilizing an AAI-Index, a Self-Improvement Coefficient $\kappa$, and closure properties.
The framework defines ten capability axes (e.g., Autonomy, Generality, Planning, Tool Economy) normalized to [0,1] and aggregated via a weighted geometric mean (AAI-Index).
It formalizes AGI and Superintelligence through measurable level gates (AAI-0 to AAI-4/5) based on axis thresholds, sustained self-improvement ($\kappa$), and closure proofs (maintenance and expansion).

CorrectAD: A Self-Correcting Agentic System to Improve End-to-end Planning in Autonomous Driving

CorrectAD: introduces a self-correcting agentic system, composed of PM-Agent and DriveSora, to automatically generate targeted training data to improve the robustness of End-to-end (E2E) planning models in autonomous driving by addressing failure cases.
The PM-Agent analyzes failure causes using a VLM to formulate multimodal requirements, which DriveSora then uses to generate high-fidelity, diverse training videos aligned with 3D scene annotations.
This agentic pipeline is model-agnostic and demonstrated significant reduction in collision rates on both public and in-house datasets.

LLM-based Multi-Agent System for Simulating Strategic and Goal-Oriented Data Marketplaces

LLM-MAS (Large Language Model-based Multi-Agent System): introduces a simulation framework for data marketplaces where LLM-powered buyer and seller agents perform strategic, goal-oriented actions using natural language reasoning.
The system utilizes a GoalGenerator for objectives and a DataGenerator for metadata, storing embeddings in a Vector Database to enable similarity-based search for agent actions.
Evaluation against real transaction data shows the LLM-MAS faithfully reproduces structural features like scale-free distributions, though temporal dynamics are overestimated.

Agent-Oriented Visual Programming for the Web of Things

AOV-DEP (Agent-Oriented Visual Programming for Domain-Expert Programming): introduces an approach for multi-agent-oriented visual programming using a blocks-based visual development environment built on the JaCaMo platform and integrated with the Web of Things (WoT) to enable domain experts to design and configure autonomous software.
The system leverages agent abstractions, specifically the Belief-Desire-Intention (BDI) model, to align with human practical reasoning for simpler programming by non-technical users.
The implementation uses the Blockly framework for the visual language and Yggdrasil for WoT integration, validated by a pilot user study showing promising usability.

Resilient and Efficient Allocation for Large-Scale Autonomous Fleets via Decentralized Coordination

DESIRA (Decentralized Side-Information Resource Allocation): introduces a framework combining side-information-conditioned risk shaping with scalable consensus-based coordination, using Distributional Predictions and a CVaR Penalty, coordinated via Consensus-ADMM.
The approach models uncertain resource consumption using feature-conditioned distributional predictions to derive risk-adjusted allocation requirements, ensuring safety guarantees via chance constraints.
The decentralized coordination is achieved through local message passing over a sparse communication graph, leading to near-centralized performance with high resilience and near-linear scaling.

LoCoBench-Agent: An Interactive Benchmark for LLM Agents in Long-Context Software Engineering

LoCoBench-Agent: introduces a comprehensive evaluation framework for LLM agents in long-context software engineering, extending LoCoBench scenarios into interactive environments with specialized tools and bias-free metrics.
The framework focuses on multi-turn interaction, tool usage patterns, and long-context handling (10K-1M tokens) across 8,000 scenarios spanning 10 programming languages and 36 domains.
Key findings reveal a fundamental comprehension-efficiency trade-off and highlight the importance of architectural mechanisms like hierarchical memory and semantic search integration for long-context performance.

EchoAgent: Guideline-Centric Reasoning Agent for Echocardiography Measurement and Interpretation

EchoAgent: introduces a guideline-centric agentic framework that integrates specialized vision tools under Large Language Model (LLM) orchestration to perform structured, interpretable echocardiography measurement and interpretation.
The framework utilizes an iterative reasoning loop involving observation, thought, and action phases, leveraging tools for phase detection, measurement feasibility prediction, segmentation, and guideline retrieval.
A key feature is the measurement-feasibility prediction model, which ensures that only visually supported and clinically relevant measurements are attempted, enhancing trustworthiness.

Market-Dependent Communication in Multi-Agent Alpha Generation

Market-Dependent Communication in Multi-Agent Alpha Generation: investigates the impact of five organizational structures on 5-agent LLM-based trading systems across different market characteristics, comparing isolated baseline, leaderboard, collaborative conversation, conversation-leaderboard, and competitive conversation.
Communication generally improves performance, but the optimal structure depends on market volatility, with competitive conversation excelling in volatile tech stocks and collaborative conversation in stable general stocks.
All organizational structures converge to similar strategy correlations over time, indicating that behavioral mechanisms, not information sharing transparency, drive performance differences.

P1: Mastering Physics Olympiads with Reinforcement Learning

P1: introduces a family of open-source physics reasoning models trained via reinforcement learning (RL) and augmented with the PhysicsMinions agentic framework, achieving Gold-medal performance on the International Physics Olympiad 2025 (IPhO 2025).
The training incorporates a multi-stage RL framework with adaptive learnability adjustment and stabilization mechanisms, utilizing both rule-based and model-based verifiers for reward generation.
The framework demonstrates strong generalizability to mathematics and coding tasks, suggesting transferable reasoning skills beyond the specialized physics domain.

FreeAskWorld: An Interactive and Closed-Loop Simulator for Human-Centric Embodied AI

FreeAskWorld: introduces an interactive and closed-loop simulation framework that integrates LLMs for high-level behavior planning and semantically grounded interaction, grounded in theories of intention and social cognition, to support human-centric embodied AI.
The framework supports scalable, realistic human-agent simulations and includes a modular data generation pipeline, releasing a large-scale benchmark dataset for the novel Direction Inquiry Task.
The system leverages LLMs for intention modeling and naturalistic human behavior simulation within photorealistic 3D environments, emphasizing interaction as an information modality.

Mem-PAL: Towards Memory-based Personalized Dialogue Assistants for Long-term User-Agent Interaction

Mem-PAL: introduces PAL-Bench (Personalization benchmark) and PAL-Set (Chinese dataset) for long-term user-agent interaction evaluation, utilizing H²Memory (Hierarchical memory framework) with MG (Concrete memory from logs), MB (Abstract memory of user background), MT (Concrete memory from dialogues), and Mp (Abstract memory of user principles) via RAG (Generation strategy) to enhance personalized response generation.
The H²Memory framework organizes interaction history into a hierarchical and heterogeneous structure, separating concrete details (logs and dialogue outlines) from abstract concepts (background and principles) for effective retrieval and personalized response generation.
The proposed method demonstrates superior performance across three evaluation tasks in PAL-Bench: Requirement Restatement, Solution Proposal, and Multi-turn Dialogue Interaction, validating the effectiveness of the memory components.

MedDCR: Learning to Design Agentic Workflows for Medical Coding

MedDCR (Medical Coding Workflow Design as a Learning Problem): introduces a closed-loop framework that treats medical coding workflow design as a learning problem, utilizing a Designer, Coder, and Reflector meta-agent architecture supported by a memory archive.
The framework iteratively proposes, compiles, executes, and reflects on workflow plans, leveraging past successful designs and diverse recent explorations to discover effective coding strategies.
MedDCR achieves state-of-the-art performance on ICD-10 coding benchmarks while producing interpretable and adaptable workflows.

SAINT: Service-level Integration Test Generation with Program Analysis and LLM-based Agents

SAINT (Service-level Integration Test Generation with Program Analysis and LLM-based Agents): introduces a novel white-box testing approach for service-level testing of enterprise Java applications by combining Static Analysis, LLM-based Agents, an Endpoint Model, and an Operation Dependency Graph (ODG) to automatically generate endpoint-focused and scenario-based tests.
The approach involves a Model-Construction Phase to build the Endpoint Model and ODG, followed by a Test-Generation Phase utilizing agentic workflows for test creation and refinement.
Endpoint-focused tests maximize code coverage, while scenario-based tests cover meaningful use cases, with developer feedback strongly endorsing the latter.

Grounded by Experience: Generative Healthcare Prediction Augmented with Hierarchical Agentic Retrieval

GHAR (Generative Hierarchical Agentic Retrieval): introduces a generative hierarchical agentic RAG framework for healthcare prediction that resolves the "when to retrieve" dilemma and enables collaborative optimization between retrieval and generation submodules.
The framework utilizes a dual-agent architecture (Agent-Top and Agent-Low) within a unified Markov Decision Process optimized via multi-agent Reinforcement Learning to ensure synergistic retrieval and generation.
GHAR employs meta-path partitioning for fine-grained retrieval and a diverse reward structure to align the distinct roles of the agents towards accurate, contextually appropriate predictions.

Dropouts in Confidence: Moral Uncertainty in Human-LLM Alignment

Dropouts in Confidence (DIC): introduces a method to quantify and modulate uncertainty in LLMs facing moral dilemmas using information-theoretic measures like binary entropy, Total Entropy (TE), Conditional Entropy (CE), and Mutual Information (MI), and demonstrates that injecting uncertainty via attention dropout improves alignment with human preferences.
The study analyzes 32 open-source LLMs across 9 moral dimensions derived from the Moral Machine experiment, finding significant model-architecture-dependent confidence variability.
The core finding is that reducing LLM overconfidence by increasing Mutual Information (MI) through inference-time dropout leads to better alignment with human ethical judgments in complex scenarios.

Cost-Effective Communication: An Auction-based Method for Language Agent Interaction

DALA (Dynamic Auction-based Language Agent): introduces a novel framework that treats communication bandwidth as a scarce, tradable resource in Multi-Agent Systems (MAS) using a centralized auction mechanism, where agents bid based on predicted message value density, trained via MAPPO.
The framework utilizes an Actor Network to generate candidate messages and a Critic Network to compute their value density ($\pi$), which serves as a bid in a budget-constrained VCG auction to maximize task success while minimizing token cost.
This economic approach cultivates the emergent skill of strategic silence, leading to state-of-the-art performance on reasoning benchmarks with significantly reduced token consumption compared to existing methods.

Extracting Events Like Code: A Multi-Agent Programming Framework for Zero-Shot Event Extraction

AEC (Agent-Event-Coder): introduces a novel multi-agent framework that treats zero-shot event extraction (ZSEE) as a structured, iterative code-generation process, utilizing a Retrieval Agent/Planning Agent/Coding Agent/Verification Agent workflow.
The framework represents event schemas as executable Python classes to enable deterministic validation and enforce structural fidelity in zero-shot extractions.
AEC consistently outperforms prior zero-shot baselines by combining step-wise reasoning with deterministic schema validation to resolve trigger ambiguity and enforce output structure.

WebCoach: Self-Evolving Web Agents with Cross-Session Memory Guidance

WebCoach: introduces a model-agnostic self-evolving framework that equips web browsing agents with persistent, cross-session memory, enabling improved long-term planning, reflection, and continual learning without retraining.
The framework consists of a WebCondenser, an External Memory Store (EMS) for storing semantic embeddings of past trajectories, and a Coach LLM that provides task-specific guidance via runtime hooks.
Evaluations show that WebCoach consistently improves task success rates across different LLM backbones, achieving performance comparable to GPT-4o with smaller open-source models.

ENGRAM: EFFECTIVE, LIGHTWEIGHT MEMORY ORCHESTRATION FOR CONVERSATIONAL AGENTS

ENGRAM (Effective, Lightweight Memory Orchestration): introduces a lightweight memory system that organizes conversation into episodic, semantic, and procedural memory types using a single router and retriever, achieving state-of-the-art results on long-horizon QA benchmarks.
The architecture converts user turns into typed memory records persisted in a database, retrieves top-k neighbors per type at query time, merges results, and provides evidence as context to the answering LLM.
This typed separation and straightforward dense retrieval approach challenges the trend toward complex memory architectures by prioritizing simplicity, efficiency, and interpretability.

Can We Predict the Next Question? A Collaborative Filtering Approach to Modeling User Behavior

CFQP (Collaborative Filtering-enhanced Question Prediction): introduces a novel hybrid framework that integrates personalized memory modules with graph-based preference propagation to dynamically model evolving user-question interactions for superior user-specific question prediction.
The framework utilizes an Embedding-based User Representation via BGE to create user vectors, calculates user similarity to form a User Association graph, and employs an LLM-based Prediction Model refined by a Diagnostic Collaborative Correction loop.
This approach aims to overcome the limitations of static LLM personalization by capturing dynamic user interests and leveraging collective intelligence from similar users.

Fault2Flow: An AlphaEvolve-Optimized Human-in-the-Loop Multi-Agent System for Fault-to-Workflow Automation

Fault2Flow: introduces an LLM-based multi-agent system that automates fault diagnosis to workflow execution by systematically extracting regulatory logic, integrating expert knowledge, optimizing reasoning, and synthesizing an executable workflow, utilizing an AlphaEvolve optimization module.
The system operates via a decoupled front-end/back-end design, where the back-end employs coordinated agents to transform unstructured regulatory documents into verified, n8n-executable workflows.
Experimental validation on transformer fault diagnosis confirms 100% topological consistency and high semantic fidelity, substantially reducing expert workload.

Think, Speak, Decide: Language-Augmented Multi-Agent Policy Learning in Economic Environments

LAMP (Language-Augmented Multi-Agent Policy Learning): introduces a framework that integrates LLM-driven reasoning and reflection over numerical observations and textual signals to support optimal decision-making in multi-agent economic environments, following a Think-Speak-Decide pipeline.
The framework utilizes a dual-path Think module to generate short-term shock analysis and long-term trend reasoning, which informs the Speak module for strategic message exchange and belief updating via a Reflection Module.
The Decide module fuses numerical data, reasoning, and reflections into a centralized training/decentralized execution Multi-Agent Reinforcement Learning (MARL) policy, achieving superior performance over MARL and LLM-only baselines.

HPCAgentTester: A Multi-Agent LLM Approach for Enhanced HPC Unit Test Generation

HPCAgentTester: introduces a novel multi-agent Large Language Model (LLM) framework for automating and enhancing unit test generation for HPC software using OpenMP and MPI, employing specialized LLM agents in a collaborative workflow.
The framework utilizes a structured Test Recipe as an intermediate representation, grounding the Test Agent's code generation and enabling iterative refinement via a critique loop involving feedback, confidence scoring, and justification.
This approach significantly improves test compilation rates and functional correctness compared to standalone LLMs by systematically targeting parallel constructs and semantic correctness.

16th Nov 2025

MMWOZ: Building Multimodal Agent for Task-oriented Dialogue

MATE (Multimodal Agent for Task-oriented dialogue): introduces MMWOZ, a multimodal task-oriented dialogue dataset interacting with a web-style GUI, and proposes MATE, a baseline multimodal model leveraging dialogue history, action log, and web page snapshot (text and image features) to generate GUI operation instructions or natural language responses.
The MMWOZ dataset extends MultiWOZ 2.3 by designing a web-style GUI and automatically converting dialogue states and system actions into operation instructions paired with web page snapshots.
The MATE model architecture includes an OCR Parser and Image Encoder to process the snapshot, which feed into a Projector and Action Generator conditioned on dialogue history and action log to determine the next step.

Multi-Agent Reinforcement Learning for Heterogeneous Satellite Cluster Resources Optimization

MARL (Multi-Agent Reinforcement Learning): introduces a framework for resource optimization in heterogeneous satellite clusters performing Earth Observation (EO) missions, utilizing algorithms like MAPPO, HAPPO, and HATRPO within a CTDE paradigm.
The study models the EO mission as a Dec-POMDP to handle decentralized decision-making under resource constraints and agent heterogeneity (optical and SAR satellites).
The research evaluates the performance and stability of state-of-the-art MARL algorithms specifically tailored to account for agent heterogeneity in satellite resource allocation.

Are LLMs The Way Forward? A Case Study on LLM-Guided Reinforcement Learning for Decentralized Autonomous Driving

Framework name here: introduces a case study comparing RL-only, LLM-only, and hybrid approaches, where LLMs augment RL rewards by scoring state-action transitions during training, while standard RL policies execute at test time.
The study uses small, locally deployable LLMs (Qwen3-14B and Gemma3-12B) to investigate their ability to support autonomous highway driving through reward shaping rather than direct control.
Findings indicate that hybrid approaches improve safety over RL-only agents but introduce a systematic conservative bias, highlighting limitations of current small LLMs for safety-critical control.

Evolve the Method, Not the Prompts: Evolutionary Synthesis of Jailbreak Attacks on LLMs

EvoSynth: introduces an autonomous framework that shifts the red-teaming paradigm from attack planning to the evolutionary synthesis of novel, code-based jailbreak methods, employing a multi-agent system with a code-level self-correction loop.
The framework utilizes a Reconnaissance Agent for strategy formulation, an Algorithm Creation Agent for code synthesis and evolution, an Exploitation Agent for deployment, and a Coordinator Agent for orchestration and iterative refinement.
This approach achieves a new state-of-the-art Attack Success Rate (ASR) against robust models and generates attacks with significantly higher programmatic complexity and diversity than existing methods.

On two-degrees-of-freedom agreement protocols

2DOF agreement protocol: introduces a distributed two-degrees-of-freedom (2DOF) architecture for driving heterogeneous agents to agreement, separating local feedback from network filtering.
This architecture is inspired by classical servo regulation and aims to counter shortcomings of consensus protocols like poor noise attenuation and inability to reject disturbances exciting unstable poles.
The resulting closed-loop dynamics explicitly separate network and local dynamics, accommodating agent heterogeneity when the network component is homogeneous.

Scaling Patterns in Adversarial Alignment: Evidence from Multi-LLM Jailbreak Experiments

The research introduces an empirical framework for exploring multi-model interactions using JailbreakBench, involving an Attacker Model (Ma), a Target Model (My), and a Judge Model (MJ), to quantify how relative model scale influences adversarial potency.
The study simulates over 6000 multi-turn exchanges across various LLM sizes (0.6B-120B) to measure harm score and refusal behavior as indicators of adversarial success and alignment integrity.
Key findings show a positive correlation between the attacker-to-target size ratio and mean harm, and a strong negative correlation between attacker refusal frequency and harm.

Knots: A Large-Scale Multi-Agent Enhanced Expert-Annotated Dataset and LLM Prompt Optimization for NOTAM Semantic Parsing

NOTAM semantic parsing: introduces a novel task extending beyond traditional information extraction by generating structured, inference-rich outputs from Notices to Air Missions (NOTAMs), supported by the Knots dataset and utilizing LLM Prompt Optimization, MDA, and HDF components.
The framework employs a two-stage multi-agent system (MDA for recall, HDF for precision) to systematically discover and refine operational fields, addressing semantic ambiguity and complexity inherent in aviation texts.
The research validates various LLM prompting strategies, finding 5-shot In-Context Learning (ICL) optimal for safety-critical reliability, and provides a large, expert-annotated dataset (Knots) for future research.

FINRS: A RISK-SENSITIVE TRADING FRAMEWORK FOR REAL FINANCIAL MARKETS

FinRS (Risk-Sensitive Trading Framework): introduces a risk-sensitive LLM trading framework that combines hierarchical market analysis, dual-decision agents, and multi-timescale reward reflection to align trading actions with return objectives and downside risk constraints, utilizing components like the Market Perception and Analysis Module, Risk-Sensitive Decision Making Module, and Multi-scale Reward Reflection Module.
The framework addresses limitations in existing LLM trading agents by embedding risk-awareness directly into the decision process, featuring dynamic position sizing and layered information filtering.
Experimental results confirm that the full configuration of FinRS achieves superior profitability and stability compared to various baselines across multiple stocks and market conditions.

Co-Layout: LLM-driven Co-optimization for Interior Layout

Co-Layout: introduces a novel framework that combines Large Language Models (LLMs) with grid-based Integer Programming (IP) to jointly optimize room layout and furniture placement, using a Coarse-to-Fine Strategy and a Grid-Based Formulation.
The LLM-based Preprocessor translates textual requirements into structured design constraints, which are then formalized using a grid-based representation inspired by "Modulor" for the IP model.
The framework employs a Coarse-to-Fine Strategy to manage computational complexity by first solving a simplified problem on a coarse grid before refining the solution on the full-resolution grid.

The 'Sure' Trap: Multi-Scale Poisoning Analysis of Stealthy Compliance-Only Backdoors in Fine-Tuned Large Language Models

Sure Trap: introduces a compliance-only backdoor during SFT where a single benign compliance token ("Sure") acts as a latent behavioral gate to enable unsafe generation when paired with an arbitrary trigger token.
The attack relies on poisoning a small subset of prompts with a trigger and the single-token response "Sure," achieving near-deterministic compliance rates above a small poison budget threshold (~50 examples).
This mechanism exposes a stealthy data-supply-chain risk and suggests using the gate-like dynamics for explicit, auditable control tokens in agentic systems.

15th Nov 2025

Learning to Trust: Bayesian Adaptation to Varying Suggester Reliability in Sequential Decision Making

The proposed framework introduces a unified POMDP-based approach that dynamically learns and adapts to varying suggester reliability in partially observable environments by integrating suggester quality into the belief state and introducing an explicit 'ask' action.
The framework utilizes a MOMDP formulation to manage computational complexity when modeling the hidden state component of suggester types ($\mathcal{T}$).
Experimental results across Tag and RockSample domains demonstrate robust performance, adaptation to changing reliability, and strategic management of suggestion requests.

Fast Reasoning Segmentation for Images and Videos

FastReasonSeg: introduces a distillation framework that reduces computational demands for reasoning segmentation by transferring knowledge from a Teacher LLM to a compact Student LLM using structured Digital Twin Representations.
The framework decouples perception from reasoning via the Digital Twin Representation, enabling the Student LLM to perform complex analysis without processing raw visual tokens.
The two-stage distillation process involves SFT followed by RL with a composite reward function to preserve multi-step reasoning capabilities.

Goal-Oriented Multi-Agent Reinforcement Learning for Decentralized Agent Teams

Goal-Oriented Multi-Agent Reinforcement Learning (MARL) framework: introduces a decentralized MARL approach for agent teams in dynamic, partially observable environments, enabling selective, goal-aware communication and coordination.
The method utilizes weight merging to share learning parameters among agents pursuing the same individual goal, enhancing collaboration while maintaining decentralization.
Experimental validation in complex grid navigation tasks shows improved success rates and reduced time-to-goal compared to non-cooperative and unrestricted communication baselines, demonstrating scalability.

Decision and Gender Biases in Large Language Models: A Behavioral-Economic Perspective

LLMs: introduces an investigation into whether advanced LLMs behave as rational agents or reproduce human behavioral tendencies in classic decision problems (Ultimatum Game and Gambling Game) using behavioral economics parameters, involving LLMs (Gemma-7B and Qwen-2.5-32B-Instruct-AWQ) under neutral and gender-conditioned prompts.
The study estimates parameters for inequity aversion and loss aversion, comparing LLM results to human benchmarks, finding persistent deviations from rationality, including moderate fairness concerns and subtle gender-conditioned differences.
The methodology employs canonical behavioral-economic tasks to elicit parameters related to fairness and risk preferences, providing a behavioral-economics perspective on LLM decision-making.

UpBench: A Dynamically Evolving Real-World Labor-Market Agentic Benchmark Framework Built for Human-Centric AI

UpBench: introduces a dynamically evolving, real-world labor-market agentic benchmark framework, utilizing real jobs, human-generated rubrics, and expert freelancer evaluation, to assess LLM agents' competence and collaboration capacity.
The framework integrates human expertise across data collection, rubric creation, and evaluation stages, supporting fine-grained analysis beyond binary pass/fail metrics.
It provides a scalable foundation for evaluating agentic systems in authentic contexts, emphasizing human-AI collaboration over simple automation.

ProofWright: Towards Agentic Formal Verification of CUDA

ProofWright: introduces an agentic verification framework that integrates automated formal verification with LLM-based code generation to provide end-to-end guarantees of memory safety, thread safety, and semantic correctness for LLM-generated CUDA kernels, utilizing components like the VerCors Agent and Semantic Equivalence Framework.
The framework employs the VerCors Agent to establish safety properties using the VerCors verifier guided by an LLM-generated Annotation Guide, and the Semantic Equivalence Framework to prove functional adherence using the Rocq Theorem Prover.
It addresses the validation bottleneck in LLM-generated GPU code by automating formal reasoning, achieving safety guarantees for 74% of KernelBench L1 programs with modest overhead.

MoralReason: Generalizable Moral Decision Alignment For LLM Agents Using Reasoning-Level Reinforcement Learning

MoralReason: introduces a reasoning-level reinforcement learning approach using MoralReason-QA and GRPO on Qwen-3-4B-Base to achieve out-of-distribution moral decision alignment in LLM agents.
The approach utilizes a multi-component reward function combining alignment and keyword rewards to facilitate learning of underlying moral frameworks.
Experimental results demonstrate successful generalization to unseen moral scenarios for Utilitarian and Deontological frameworks.

RulePilot: An LLM-Powered Agent for Security Rule Generation

RulePilot: introduces an LLM-powered agent workflow utilizing Chain of Thought (CoT) reasoning, Intermediate Representation (IR), and Reflection & Iterative Optimization to automate the generation and conversion of SIEM-specific detection rules, abstracting complexity for security analysts.
The framework uses an IR to structure complex SIEM rule logic, enabling LLMs to focus on manageable generation steps, and employs iterative reflection with tool invocation for robust refinement.
Evaluation shows RulePilot significantly outperforms standalone LLMs in textual similarity and achieves high execution success rates in detecting simulated attacks in a Splunk environment.

CriticSearch: Fine-Grained Credit Assignment for Search Agents via a Retrospective Critic

CriticSearch: introduces a fine-grained credit-assignment framework that leverages a frozen, asymmetric Critique LLM to provide dense, turn-level feedback via a retrospective mechanism for search agents trained with reinforcement learning.
The framework uses privileged information, specifically the gold answer and full trajectory, to enable the Critique LLM to assign stable, dense rewards that guide policy improvement, complementing sparse outcome rewards.
This retrospective assessment approach, integrated with the GRPO algorithm, consistently outperforms existing baselines by achieving faster convergence and improved training stability across multi-hop reasoning benchmarks.

AI-Salesman: Towards Reliable Large Language Model Driven Telemarketing

AI-Salesman: introduces an end-to-end framework that addresses challenges in goal-driven persuasive dialogue like telemarketing, utilizing a dual-stage architecture with Bayesian-supervised reinforcement learning and a Dynamic Outline-Guided Agent (DOGA).
The framework is supported by the newly released TeleSalesCorpus, the first real-world-grounded dialogue dataset for this domain, and a comprehensive LLM-as-a-Judge evaluation framework.
Experimental results show that the proposed method significantly outperforms baselines across key sales capabilities, validating its effectiveness in complex persuasive scenarios.

14th Nov 2025

Chapter 14: Looking Forward: Challenges and Opportunities in Agentic AI Reliability

Chapter 14: Looking Forward: Challenges and Opportunities in Agentic AI Reliability: presents perspectives on challenges and future development in building reliable agentic AI systems, discussing open research problems related to mitigating cascading failures, dynamic environments, inconsistent execution, emergent behaviors, resource-intensive mechanisms, and evaluation.
The chapter organizes reliability challenges into five main areas: Cascading Failures, Vulnerability in Dynamic Environments, Inconsistency in Task Execution, Unpredictable Emergent Behavior, and Resource-Intensive Reliability Mechanisms, alongside the need for new Reliability Testing and Evaluation paradigms.
Addressing these challenges requires cross-layer coordination, dynamic adaptation, integrated reasoning, and resource-aware reliability designs to ensure trustworthy, consistent, and safe outputs from agentic AI systems.

MULTI-PHASE SPACECRAFT TRAJECTORY OPTIMIZATION VIA TRANSFORMER-BASED REINFORCEMENT LEARNING

Transformer-based RL framework: introduces a unified control framework leveraging Gated Transformer-XL and PPO to handle multi-phase spacecraft trajectory optimization using a single adaptive policy, with components including an Observation Encoder, Actor-Critic Model, Policy Head, and Value Head.
The architecture utilizes the GTrXL's sliding memory window and self-attention mechanisms to maintain coherent memory across dynamically distinct mission phases without explicit phase transitions.
The framework is validated on single-phase benchmarks, multi-phase waypoint navigation, and a complex multiphase rocket ascent problem, demonstrating near-optimal performance compared to traditional methods.

Robust and Efficient Communication in Multi-Agent Reinforcement Learning

Survey: introduces a systematic review of recent advances in robust and efficient communication strategies for MARL under realistic constraints, including message perturbations, transmission delays, and limited bandwidth, focusing on applications like cooperative autonomous driving, distributed SLAM, and federated learning.
The review organizes communication strategies along three key dimensions: when to transmit, whom/how to communicate, and what/rate to transmit, highlighting a shift from idealized assumptions to practical, imperfect environments.
The paper advocates for a unified approach that co-designs communication, learning, and robustness to bridge the gap between theoretical MARL models and practical implementations.

Building the Web for Agents: A Declarative Framework for Agent-Web Interaction

VOIX: introduces a concrete, web-native mechanism that makes site capabilities and state discoverable and invokable by agents through declarative, typed semantics, using <tool> and <context> HTML elements.
The framework decouples website functionality from agent reasoning, distributing responsibilities among the Website, the Browser Agent, and the Inference Provider.
Empirical evaluation via a hackathon confirmed the framework's learnability, expressiveness for multimodal interactions, and efficiency compared to inference-based approaches.

GraphPilot: Grounded Scene Graph Conditioning for Language-Based Autonomous Driving

GraphPilot: introduces a model-agnostic method that conditions language-based driving models on structured relational context in the form of traffic scene graphs, using an LLM-based AD Agent, Scene-Graph, Navigation Command, and Future Trajectory.
The approach serializes traffic scene graphs at various abstraction levels and incorporates them via structured prompt templates to enhance structured reasoning over spatial, regulatory, and inter-actor dependencies.
Training with scene graph supervision (SG10) yields performance gains that persist even when scene graphs are omitted at test-time, indicating internalized relational knowledge.

UAVBench: An Open Benchmark Dataset for Autonomous and Agentic AI UAV Systems via LLM-Generated Flight Scenarios

UAVBench: introduces an open benchmark dataset comprising 50,000 validated UAV flight scenarios generated through taxonomy-guided LLM prompting and multi-stage safety validation, with UAVBench_MCQ extending it for reasoning evaluation.
The framework unifies scenario generation, validation, risk labeling, and reasoning into a single pipeline, encoding missions in a structured JSON schema covering configuration, environment, objectives, and safety constraints.
UAVBench_MCQ evaluates LLMs across ten cognitive and ethical reasoning styles, revealing strong performance in perception but persistent challenges in ethics-aware and resource-constrained decision-making.

Refine and Align: Confidence Calibration through Multi-Agent Interaction in VQA

AlignVQA: introduces a debate-based multi-agent framework, AlignVQA, which uses Specialized Agents and Generalist Agents with an AlignCal Loss to improve confidence calibration in Visual Question Answering (VQA).
The framework involves a two-stage interaction where specialized agents provide initial answers, followed by generalist agents engaging in debate to critique, refine, and aggregate proposals, yielding calibrated confidence estimates.
The novel AlignCal loss is a differentiable surrogate for the Upper Bound on Classification Error (UBCE), explicitly optimizing specialized agents for confidence fidelity during training.

Autonomous Vehicle Path Planning by Searching With Differentiable Simulation

DSS (Differentiable Simulation for Search): introduces a framework that leverages the differentiable simulator Waymax as both a next state predictor and a critic, optimizing actions via gradient descent over imagined future trajectories.
The approach uses Classifier-Guided Action Selection to incorporate non-differentiable events like collisions into the differentiable planning loss function.
The framework achieves improved tracking and path planning accuracy compared to sequence prediction, imitation learning, and model-free RL methods by combining search and gradient-based refinement.

Miniature Testbed for Validating Multi-Agent Cooperative Autonomous Driving

CIVAT (Cooperative Intelligent V2X Autonomous Testbed): introduces a 1:15-scale miniature testbed for validating cooperative autonomous driving, integrating miniature vehicles equipped with onboard sensors and smart infrastructure supported by 3D LiDAR and edge computing, with components including CAV/Infrastructure/Perception/Planning/Control/Message Generator/LiDAR/Depth Camera/IMU/MCU/SBC (Jetson Orin NX)/Custom PCB/V2V Communication/V2I Communication.
The infrastructure acts as an active agent, performing infrastructure-centric 3D object detection and Human Vehicle (HV) identification to coordinate Connected Autonomous Vehicles (CAVs) using priority-based intersection management.
The platform supports both fully CAV and mixed-traffic scenarios, demonstrating real-time applicability for cooperative driving algorithms via V2I and V2V communication using a Wi-Fi-based ROS2 publish-subscribe framework.

InData: Towards Secure Multi-Step, Tool-Based Data Analysis

INDATA (Indirect Data Engagement): introduces a security-motivated alternative for LLM-based data analysis by restricting LLMs to interact with data exclusively through a predefined set of secure, verified tools, and presents the INDATA dataset to evaluate multi-step tool-based reasoning ability.
The framework uses Predefined Tools as a secure barrier between the LLM and Sensitive Data, contrasting with direct code generation approaches that pose security risks.
The INDATA dataset specifically targets complex, compositional, multi-step reasoning, revealing a capability gap in current LLMs compared to simple tool selection tasks.

An Analysis of Architectural Impact on LLM-based Abstract Visual Reasoning: A Systematic Benchmark on RAVEN-FAIR

RAVEN-FAIR: introduces a systematic evaluation of Large Language Models (LLMs) performance on abstract visual reasoning tasks using four reasoning architectures, a three-stage process (JSON extraction, LLM reasoning, Tool Function), and visual/textual metrics.
The study benchmarks four LLMs (GPT-4.1-Mini, Claude-3.5-Haiku, Gemini-1.5-Flash, Llama-3.3-70b) across four reasoning configurations to analyze decision-making quality, error tolerance, and consistency.
Results indicate that architectural selection is critical, performance is model-specific, and trade-offs exist between semantic grounding and quantitative precision across strategies.

Conformal Policy Optimization for Cost-Effective LLM Agents

CCPO (Conformal Constrained Policy Optimization): introduces a framework for training an orchestration policy to select between multiple LLM agents to minimize cost while satisfying a user-specified reliability constraint formalized via conformal prediction, using components like a Base LLM, Guide LLM, Orchestration Policy, Conformal Prediction, and V-trace.
The framework formalizes the deployment problem as a finite-horizon Partially Observable Markov Decision Process (POMDP) where the policy $\pi$ is parameterized stochastically and a threshold $\kappa$ is updated online to ensure coverage guarantees.
Empirical results show that CCPO reduces total computational and API costs by up to 30% compared to state-of-the-art cost-aware baselines while maintaining target reliability on HotpotQA and MMLU benchmarks.

From Single to Societal: Analyzing Persona-Induced Bias in Multi-Agent Interactions

The paper introduces a systematic investigation of persona-induced biases in LLM-based multi-agent interactions, utilizing LLM-based Multi-Agent Systems with Persona Assignment and a Default Agent across Collaborative Problem Solving (CPS) Task and Persuasion Task.
The study quantifies biases in trustworthiness and insistence, finding that personas from historically advantaged groups are perceived as less trustworthy and insistent, and reveals in-group favoritism in agent conformity.
These behavioral patterns persist across different LLMs, group sizes, and interaction rounds, highlighting the need for bias mitigation in autonomous agent environments.

MALBO: Multi-Agent LLM Bayesian Optimization

MALBO (Multi-Agent LLM Bayesian Optimization): introduces a systematic framework designed to automate the efficient composition of LLM-based agent teams by formalizing the assignment challenge as a multi-objective, black-box optimization problem, using Multi-Objective Bayesian Optimization (MOBO) with Gaussian Process surrogate models and the qEHVI acquisition function.
The methodology employs a continuous relaxation of the discrete LLM assignment space, projecting ideal continuous solutions back to real, deployable LLM assignments via a nearest-neighbor projection function.
Results show that the framework achieves a 45.64% reduction in mean cost while maintaining comparable performance compared to initial random search, and identifies heterogeneous teams with up to 65.8% cost reduction over homogeneous baselines.

Experience-Guided Adaptation of Inference-Time Reasoning Strategies

EGUR (Experience-Guided Reasoner): introduces a system that dynamically generates tailored strategies—complete computational procedures involving LLM calls, tools, sampling parameters, and control logic—at inference time based on accumulated experience, utilizing a Guide and a Consolidator.
The system formalizes strategies as compositions of stateful processes, enabling adaptation of all strategy components, unlike prior methods limited to textual steering.
EGUR achieves up to 14% accuracy improvements and up to 111x reduction in computational costs across challenging benchmarks by learning from comparative strategy evaluation.

MarsRL: Advancing Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism

MarsRL (Multi-Agent Reasoning System via Reinforcement Learning with Agentic Pipeline Parallelism): introduces a novel reinforcement learning framework to jointly optimize Solver, Verifier, and Corrector agents in a multi-agent reasoning system, addressing reward noise and training efficiency challenges.
The framework employs agent-specific rewards to decouple credit assignment and utilizes pipeline parallelism to accelerate the training process for long reasoning trajectories.
Experimental results show significant performance gains on AIME2025 and BeyondAIME benchmarks when applying the framework to Qwen3-30B-A3B-Thinking-2507.

SRLF: An Agent-Driven Set-Wise Reflective Learning Framework for Sequential Recommendation

SRLF (Set-wise Reflective Learning Framework): introduces a closed-loop "assess-validate-reflect" cycle using LLM agents to move beyond point-wise assessment by formulating a holistic judgment over sets of items, utilizing components like the Set-wise Assessment Agent (SAA), Validation via Set-wise Mismatch Loss, and Dual-Path Reflective Learning.
The framework captures complex contextual patterns by analyzing intra-set item relationships and their alignment with the user's preference profile, which is crucial for sequential recommendation tasks.
The reflective mechanism concurrently refines the user profile and item semantics to adapt to dynamic user interests and improve representation learning.

LaoBench: A Large-Scale Multidimensional Lao Benchmark for Large Language Models

LaoBench: introduces the first large-scale, multidimensional benchmark dataset dedicated to assessing LLMs' comprehensive language understanding and reasoning abilities in Lao, covering Knowledge Application, K12 Foundational Education, and Bilingual Translation, utilizing a pipeline integrating expert human curation and agent-assisted verification.
The benchmark comprises over 17,000 curated samples split into open-source (Lao-7k, Lao-500) and closed-source (Lao-10k) subsets to ensure fairness and transparency in evaluation.
Evaluation results show that current state-of-the-art LLMs face significant challenges in mastering Lao, highlighting the need for targeted research in this low-resource Southeast Asian language.

UFO³: Weaving the Digital Agent Galaxy

UFO³: Weaving the Digital Agent Galaxy, introduces a cross-device orchestration system that unifies heterogeneous endpoints into a single fabric using a mutable TASKCONSTELLATION (distributed DAG of TASKSTARS), a CONSTELLATIONAGENT (LLM-driven planner), a Constellation Orchestrator (asynchronous execution engine), and the AIP (communication protocol).
The system models user requests as a TASKCONSTELLATION, a dynamic DAG where nodes (TASKSTARS) are atomic subtasks with dependencies (TASKSTARLINES) that evolve based on runtime feedback.
It addresses challenges in cross-device agent workflows by providing asynchronous parallelism, distributed coordination, and heterogeneous extensibility across devices like Windows, Linux, and mobile.

iMAD: Intelligent Multi-Agent Debate for Efficient and Accurate LLM Inference

iMAD (Intelligent Multi-Agent Debate): introduces a token-efficient framework that selectively triggers Multi-Agent Debate (MAD) only when beneficial, utilizing a structured self-critique prompt and a Debate-Decision Classifier trained with FocusCal Loss, to enhance LLM inference efficiency and accuracy.
The framework addresses the high computational cost and inconsistent accuracy gains of standard MAD by learning generalizable model behaviors to identify recoverable errors via 41 interpretable linguistic and semantic features extracted from a single-agent's self-critique response.
Experiments show iMAD significantly reduces token usage (up to 92%) while improving final answer accuracy (up to 13.5%) across various QA and VQA datasets compared to single-agent and full-debate baselines.

Multi-agent Undercover Gaming: Hallucination Removal via Counterfactual Test for Multimodal Reasoning

MUG (Multi-agent Undercover Gaming): introduces a protocol inspired by social deduction games to address LLM hallucinations in multimodal reasoning by employing multimodal counterfactual tests to detect "undercover" agents (those hallucinating) using components like the Counterfactual Editing Module, Undercover Detection Game, and Summarization Game.
The framework dynamically modifies reference images to create counterfactual evidence (I-) to enable direct factual verification, moving beyond the statistical consensus reliance of traditional Multi-Agent Debate (MAD) protocols.
MUG fosters active reasoning where agents engage in probing discussions based on information asymmetry between normal agents (seeing I+) and the undercover agent (seeing I-).

Scaling Equitable Reflection Assessment in Education via Large Language Models and Role-Based Feedback Agents

Equitable Reflection Assessment Pipeline: introduces a theory-grounded system using five coordinated role-based LLM agents (Evaluator, Equity Monitor, Metacognitive Coach, Aggregator, and Reflexion Reviewer) to score learner reflections with a shared rubric and generate short, bias-aware, learner-facing comments.
The multi-agent LLM system aims to deliver equitable, high-quality formative feedback at scale by integrating structured agent roles, fairness checks, and learning-science principles.
The pipeline produces auditable rubric scores and bias-aware, conversational feedback, addressing the challenge of providing timely, high-quality feedback in large or low-resource courses.

Key Decision-Makers in Multi-Agent Debates: Who Holds the Power?

MADC (Multi-Agent Debate Consistency): introduces a novel role allocation strategy for Multi-Agent Debate (MAD) frameworks by leveraging path consistency metrics to dynamically order agents, aiming to improve reasoning performance across various LLMs and tasks.
The research identifies role allocation strategy as a critical, underexplored scaling dimension in MAD, showing that placing agents with correct viewpoints last ("Truth Last") significantly boosts accuracy.
The proposed MADC method is orthogonal to existing MAD frameworks, optimizing role arrangement without modifying agent prompts or context to unlock potential performance gains.

GraphMASAL: A Graph-based Multi-Agent System for Adaptive Learning

GraphMASAL (Graph-based Multi-Agent System for Adaptive Learning): introduces an integrated, graph-based multi-agent system for adaptive learning that addresses challenges in knowledge dynamism, execution complexity, optimization, and validation, utilizing a Dynamic Knowledge Graph, a trio of specialized agents orchestrated by LangGraph, a KG-enhanced retrieval pipeline, and an MSMS planning engine.
The framework employs a Diagnostic Agent for cognitive diagnosis, a Planning Agent for path optimization using the MSMS algorithm, and a Tutor Agent for coordination, all grounded by a Dynamic Knowledge Graph that evolves with student state.
Performance evaluation shows superior structural alignment of learning paths (PathSim) and cognitive diagnosis fidelity compared to LLM prompting baselines, validated by correlation with human expert ratings.

PATCHEVAL: A New Benchmark for Evaluating LLMs on Patching Real-World Vulnerabilities

PATCHEVAL: introduces a new benchmark for evaluating LLMs on Automated Vulnerability Repair (AVR) tasks, incorporating Benchmark Construction, an Evaluator with multiple Patch Validation methods, and two Task Formulations (Patch Generation with Location Oracle and End-to-End Patch Generation).
The benchmark focuses on Python, JavaScript, and Go, curating 1,000 real-world vulnerabilities from 2015-2025 across 65 CWEs, with 230 having runtime sandbox environments for dynamic testing.
Evaluation reveals that even the best-performing LLM achieves only a 23.0% success rate in single-attempt patch generation, highlighting the difficulty of real-world AVR.

AI Agent-Driven Framework for Automated Product Knowledge Graph Construction in E-Commerce

AI Agent-Driven Framework: introduces a fully automated, AI agent-driven framework for constructing product knowledge graphs directly from unstructured product descriptions using Large Language Models (LLMs) across three stages: ontology creation and expansion, ontology refinement, and knowledge graph population.
The framework utilizes dedicated LLM-powered agents in a modular pipeline to ensure semantic coherence and scalability without requiring predefined schemas or handcrafted extraction rules.
Evaluation on air conditioner product data demonstrated strong performance, achieving over 97% property coverage in the resulting knowledge graph.

Beyond Accuracy: Behavioral Dynamics of Agentic Multi-Hunk Repair

MAPLE (MODEL CONTEXT PROTOCOL FOR AUTOMATED LIGHTWEIGHT REPOSITORY CONTEXT EXTRACTION): introduces a systematic study of four LLM-driven coding agents (CLAUDE CODE, CODEX, GEMINI-CLI, and QWEN CODE) on multi-hunk program repair using fine-grained behavioral metrics and the MAPLE context-assistance mechanism.
The study evaluates agents on localization success, repair accuracy, regression behavior, and operational dynamics across 372 multi-hunk bugs from the HUNK4J dataset, revealing significant variation in effectiveness correlated with bug complexity metrics like hunk divergence and spatial proximity.
MAPLE improves repair accuracy for GEMINI-CLI by 30% by enhancing bug localization through structured repository context extraction, highlighting the value of context-assistance for agents with baseline reasoning capabilities.

Exposing Weak Links in Multi-Agent Systems under Adversarial Prompting

SAFEAGENTS: introduces a unified and extensible framework for fine-grained security assessment of Multi-Agent Systems (MAS) under adversarial prompting, complemented by the DHARMA diagnostic measure.
The framework systematically exposes how design choices like plan construction and context sharing affect susceptibility to adversarial inputs across different MAS architectures.
The study reveals significant vulnerabilities in common design patterns, emphasizing the need for security-aware design in MAS.

When AI Does Science: Evaluating the Autonomous AI Scientist KOSMOS in Radiation Biology

KOSMOS (Autonomous AI Scientist): introduces an evaluation of the autonomous AI scientist KOSMOS on three radiobiology hypotheses using a falsification-based auditing methodology with empirical null models.
The evaluation assessed KOSMOS's claims against null distributions derived from random gene sets or permutations to determine statistical significance and validity.
The study found one well-supported discovery (CDO1), one ambiguous result (12-gene signature), and one false result (DDR to p53 correlation), highlighting the need for rigorous auditing of AI-generated science.

13th November 2025

Co-EPG: A Framework for Co-Evolution of Planning and Grounding in Autonomous GUI Agents

Co-EPG (Co-Evolution of Planning and Grounding): introduces a self-iterative training framework for autonomous GUI agents, featuring Iterative Training (alternating optimization loop), Grounding SFT (grounding model fine-tuning), Planning SFT (planning model fine-tuning), GRPO (planning model refinement), Rollouts (diverse plan generation), C-DREM (dynamic reward ensemble), Grounding Models (plan executability assessment), Group Computation (advantage calculation), Data Enhancement (dataset refinement), Planner II (planning diversity enhancement), and Verifier Φ (discrimination reliability improvement), which establishes a positive feedback loop for the co-evolution of planning and grounding models.
The framework enables continuous self-improvement of agent capabilities through self-play optimization and training data distillation, where the planning model explores strategies under grounding-based reward guidance, and the grounding model is optimized with diverse data generated by the planning model.
Co-EPG leverages a confidence-based dynamic reward ensemble mechanism (C-DREM) to reduce reward noise and accelerate GRPO training, leading to enhanced generalization and state-of-the-art performance on GUI task automation benchmarks without requiring external data.

Safe Planning in Interactive Environments via Iterative Policy Updates and Adversarially Robust Conformal Prediction

The paper introduces an iterative framework that robustly maintains safety guarantees across policy updates in interactive environments using Adversarially Robust Conformal Prediction (ACP), which involves Iterative Policy Updates, an Explicit Solver, or an Implicit Solver.
This framework addresses the "chicken-and-egg" problem where the autonomous agent's policy update changes the environment's behavior distribution, violating standard Conformal Prediction exchangeability assumptions.
The approach provides episodic safety guarantees by analytically bounding the policy-induced distribution shift and offers explicit conditions for convergence of the uncertainty set radius.

Towards autonomous quantum physics research using LLM agents with access to intelligent tools

AI-MANDEL: introduces an LLM agent system that autonomously generates and implements novel ideas in quantum physics by accessing scientific literature and the intelligent discovery tool PYTHEUS, aiming for an AI physicist.
The system consists of Idea generation agents (Researcher, Novelty, Judge, Mediator) and Idea implementation agents (Expert) interacting with the PYTHEUS tool to produce concrete, actionable experiment designs.
Successful designs are stored in an Idea Pool and have led to the writing of independent, publishable scientific papers in quantum physics.

nuPlan-R: A Closed-Loop Planning Benchmark for Autonomous Driving via Reactive Multi-Agent Simulation

nuPlan-R: introduces a reactive closed-loop planning benchmark by integrating learning-based reactive agents and an interaction-aware agent selection mechanism into the nuPlan framework, replacing rule-based Intelligent Driver Model (IDM) agents.
The benchmark extends evaluation with Success Rate (SR) and All-Core Pass Rate (PR) metrics to assess planner robustness and performance balance across multiple dimensions.
The learning-based reactive agents, based on a noise-decoupled diffusion framework (Nexus architecture), produce more realistic, diverse, and human-like traffic behaviors compared to rule-based agents.

AgentEvolver: Towards Efficient Self-Evolving Agent System

AgentEvolver: introduces a self-evolving agent system that leverages LLMs' semantic understanding and reasoning to drive autonomous agent learning, addressing high data construction costs, inefficient exploration, and poor sample utilization in current LLM-based agents.
The system integrates three synergistic mechanisms: self-questioning for curiosity-driven task generation, self-navigating for experience reuse and hybrid policy guidance, and self-attributing for enhanced sample efficiency via differentiated rewards.
The practical infrastructure supports modularity and scalability, enabling continual improvement of agent capabilities through a unified orchestration loop.

VISTA: A Vision and Intent-Aware Social Attention Framework for Multi-Agent Trajectory Prediction

VISTA (Vision and Intent-Aware Social Attention Framework): introduces a recursive goal-conditioned transformer architecture that integrates long-term goals, past trajectories, and social interactions for multi-agent trajectory prediction.
The framework decouples destination goal prediction from local trajectory generation using a Goal Prediction Module (GPM) and refines predictions recursively within the Trajectory Prediction Module (TPM).
Key innovations include goal-trajectory fusion via cross-attention and social-token attention, which result in state-of-the-art accuracy and significantly reduced collision rates on dense benchmarks.

ENVTRACE: SIMULATION-BASED SEMANTIC EVALUATION OF LLM CODE VIA EXECUTION TRACE ALIGNMENT—DEMONSTRATED AT SYNCHROTRON BEAMLINES

EnvTrace: introduces a simulation-based method that evaluates LLM-generated instrument control code via execution trace alignment with a beamline control-logic digital twin, assessing functional correctness and runtime performance.
The framework captures state changes (Process Variable updates) from both ground-truth and LLM code execution in a sandboxed environment to compute multi-faceted scores like pv_match_rate, timing_score, and temp_score.
This approach provides a more reliable measure of code correctness for high-stakes physical systems compared to purely syntactic metrics, enabling safer deployment of LLM agents.

Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation

MAML-AIF (Multi-Agent Multimodal Large Language Model Framework for Automated Interpretation of Fuel Efficiency Analytics in Public Transportation): introduces a modular and scalable multi-agent framework leveraging multimodal LLMs to automate data narration and energy insight generation, coordinating specialized agents for iterative refinement.
The framework operates across four stages: raw data description, data modeling, post hoc analytics, and integration/narration, building cumulative contextual knowledge across stages.
The system was validated on public bus fuel efficiency data, finding that GPT-4.1 mini with Chain-of-Thought prompting provided the optimal balance of narrative accuracy and computational cost.

HARNESS: Human-Agent Risk Navigation and Event Safety System for Proactive Hazard Forecasting in High-Risk DOE Environments

HARNESS (Human-Agent Risk Navigation and Event Safety System): introduces a modular AI framework integrating LLMs with structured data and historical event retrieval for proactive hazard forecasting in high-risk Department of Energy (DOE) environments, utilizing an agentic orchestration structure.
The system employs a human-in-the-loop mechanism where Subject Matter Experts (SMEs) refine predictions, creating an adaptive learning loop that enhances system performance over time through iterative agentic reasoning.
Key architectural components include a central Orchestrator Agent coordinating specialized agents for retrieval, analysis, mitigation strategy generation, and final report compilation.

Towards an Agentic Workflow for Internet Measurement Research

ArachNet: introduces an agentic workflow system for Internet measurement research that uses four specialized LLM agents (QueryMind, WorkflowScout, SolutionWeaver, RegistryCurator) to independently generate executable measurement workflows mimicking expert reasoning.
The system automates the systematic reasoning process of problem decomposition, solution design, implementation, and registry evolution, significantly lowering the barrier to composing complex, multi-framework analyses.
ArachNet validates its capabilities by successfully replicating expert-level analysis in Internet resilience scenarios, including single-framework replication, multi-framework orchestration, and temporal forensic investigations.

Rethinking the Reliability of Multi-agent System: A Perspective from Byzantine Fault Tolerance

CP-WBFT (Confidence Probing-based Weighted Byzantine Fault Tolerant): introduces a consensus mechanism leveraging LLM's reflective capabilities via confidence probes (PCP and HCP) to enhance Multi-agent System (MAS) stability against Byzantine faults.
LLM-based agents show stronger skepticism against erroneous messages than traditional agents, motivating the development of CP-WBFT which uses weighted information flow based on confidence scores.
The proposed CP-WBFT achieves superior Byzantine Fault Tolerance improvement, especially under extreme fault rates (up to 85.7% malicious nodes), across various network topologies.

Simulating Misinformation Propagation in Social Networks using Large Language Models

Auditor-Node Framework: introduces a framework combining persona-conditioned LLM agents and a QA-based auditor to simulate and quantify misinformation evolution across synthetic social networks.
The framework uses Misinformation Index (MI) and Misinformation Propagation Rate (MPR) to track factual degradation across sequential rewrites by agents mimicking human biases.
Findings reveal that identity/ideology-based personas accelerate misinformation, while expert/neutral personas act as stabilizers.

Behavior Modeling for Training-free Building of Private Domain Multi Agent System

Behavior Modeling for Training-free Building of Private Domain Multi Agent System: introduces a framework for private-domain multi-agent conversational systems that avoids training and data generation by adopting behavior modeling and documentation, utilizing an Orchestrator Agent, a Tool-Calling Agent (TCA), and a General Chat Agent (GCA).
The core of the approach is 'SpecDoc', a comprehensive document that explicitly details domain knowledge, tool specifications, and usage conventions to align agent behavior via structured prompting.
This training-free method offers a sustainable path for vertical AI systems by keeping knowledge external, queryable, and easily updatable, mitigating risks like catastrophic forgetting associated with fine-tuning.

Fixed-Persona SLMs with Modular Memory: Scalable NPC Dialogue on Consumer Hardware

Fixed-Persona SLMs (Fixed-Persona Small Language Models): introduces a modular NPC dialogue system leveraging SLMs fine-tuned with fixed personas via LoRA and integrated with runtime-swappable memory modules (Conversational memory/World knowledge memory) to enable scalable, expressive dialogue on consumer hardware.
The architecture decouples character identity (fixed persona in the SLM) from dynamic context (swappable memory stores), allowing a single base model to power multiple distinct NPC instances.
Evaluation across DistilGPT-2, TinyLlama-1.1B-Chat, and Mistral-7B-Instruct models demonstrated superior dialogue quality with the Mistral-7B-Instruct variant trained on a smaller dataset (OliverS).

GraphIF: Enhancing Multi-Turn Instruction Following for Large Language Models with Relation Graph Prompt

GraphIF: introduces a training-free and plug-and-play framework that models multi-turn dialogues as directed relation graphs and leverages graph prompts to enhance the instruction following capabilities of LLMs, with components including an agent-based relation extraction module, a relation graph prompt generation module, and a response rewriting module.
The framework addresses the limitations of existing methods that treat response generation as isolated tasks by explicitly modeling cross-turn relational constraints using graph structures.
Extensive experiments show that GraphIF significantly improves performance across multi-turn instruction-following metrics when integrated into instruction-tuned LLMs.

Continuous Benchmark Generation for Evaluating Enterprise-scale LLM Agents

Continuous Benchmark Generation Pipeline: introduces a methodology for creating evolving benchmarks for enterprise-scale LLM agents by leveraging developer-authored Knowledge Bases (KBs), KB Analysis, and Reference Implementations.
The approach addresses challenges in evaluating LLM agents operating under continuously changing enterprise requirements by separating requirement specification (KBs) from concrete evaluation instances derived from migrated services.
The pipeline uses LLMs' reasoning capabilities to generate evaluation artifacts, such as regular expressions, from semi-structured documents, resulting in cleaner benchmarks than manually created ones.

DemoTuner: Efficient DBMS Knobs Tuning via LLM-Assisted Demonstration Reinforcement Learning

DemoTuner: introduces an efficient DBMS knobs tuning framework via LLM-assisted demonstration reinforcement learning, utilizing a structured Chain-of-Thought prompt for condition-aware tuning hints extraction and the HA-DDPGfD algorithm for agent training.
The framework addresses slow convergence in RL-based tuning by pre-training an agent with extracted explicit and implicit tuning hints, incorporating domain knowledge throughout fine-tuning using hpPER and reward shaping.
Experimental results on MySQL and PostgreSQL show DemoTuner achieves significant performance gains and lower online tuning costs compared to baselines like DB-BERT, GPTuner, and CDBTune, while also demonstrating superior adaptability to unknown workloads.

SPAN: Benchmarking and Improving Cross-Calendar Temporal Reasoning of Large Language Models

SPAN (Cross-Calendar Temporal Reasoning Benchmark): introduces a benchmark and evaluation protocol for assessing LLMs' ability to perform temporal reasoning across six different calendar systems, utilizing components like search_calendar and the Time Agent.
The benchmark covers ten cross-calendar reasoning directions, two reasoning types (date-based and festival-based), and two question formats (polar and content), using a dynamic instance generation protocol to mitigate data contamination.
Experimental results show current LLMs struggle with an average accuracy of 34.5%, but the Time Agent achieves 95.31% accuracy by leveraging tool-augmented code generation via the search_calendar interface.

HIERROUTER: Coordinated Routing of Specialized Large Language Models via Reinforcement Learning

HIERROUTER: introduces a hierarchical routing framework that dynamically assembles inference pipelines from a pool of specialized, lightweight language models using a PPO-based reinforcement learning agent, optimizing response quality against cumulative inference cost.
The routing process is formalized as a finite-horizon Markov Decision Process (MDP) where the agent selects models across a fixed number of L stages (hops) based on the evolving context, current depth, and accumulated cost.
The system leverages specialized LLMs for specific tasks, achieving up to 2.4x improvement in response quality over individual models while maintaining cost efficiency through adaptive, multi-hop coordination.

12th November 2025

TaskSense: Cognitive Chain Modeling and Difficulty Estimation of GUI Tasks

TaskSense (Cognitive Chain Modeling and Difficulty Estimation): introduces a novel framework for estimating GUI task difficulty by modeling cognitive processes preceding motor actions, using an LLM-based method to automatically extract cognitive chains and their associated difficulty.
The framework decomposes GUI tasks into sequences of cognitive steps, each with a difficulty index grounded in information theories, and validates its model against both human user completion times and state-of-the-art GUI agent performance.
TaskSense reveals patterns of Human-AI consistency in cognitive capabilities and identifies current agent limitations on cognitively demanding tasks, paving the way for improved agent training and human-agent task delegation.

ProBench: Benchmarking GUI Agents with Accurate Process Information

ProBench: introduces a comprehensive mobile benchmark, with Task Curation (generates, refines GUI tasks), Dynamic Environment (agents interact with device), and Evaluation Pipeline (assesses agent performance), to rigorously evaluate GUI agents' ability to capture and execute necessary operation processes.
The benchmark includes over 200 challenging GUI tasks across 34 mainstream Chinese and English online applications, covering both State-related and Process-related tasks.
A key innovation is the Process Provider, which automatically supplies accurate process information via a Structure Description Converter and an MLLM-based Summarizer, enabling precise assessment of intermediate steps.

History-Aware Reasoning for GUI Agents

HAR (History-Aware Reasoning) framework: introduces a method to enhance GUI agents' reasoning capabilities by equipping them with stable short-term memory for episodic reasoning through error-aware cognitive correction within a tailored reflection scenario.
The framework operates in two stages: a GUI Scenario Warm-up Stage for domain-specific knowledge injection via supervised fine-tuning, and a Learning From Failure Stage that enhances short-term memory through reflective learning, tailored correction guidelines, and a hybrid RL reward function.
This approach transforms the GUI agent's reasoning mode from history-agnostic to history-aware, enabling it to effectively leverage historical interaction clues for robust performance in long-horizon GUI tasks.

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Lumine: introduces a generalist agent for 3D open worlds, integrating a Vision-Language Model (VLM) (core processing unit), Perception Module (raw pixel input), Hybrid Thinking Strategy (adaptive reasoning), Action Generation Module (keyboard/mouse output), Context Management Module (short/long-term memory), Vision Transformer (ViT) Backbone (visual encoder), LLM Prefill Module (input token processing), and LLM Decode Module (output token generation) to achieve human-like interaction.
The agent processes raw pixels at 5 Hz, generates 30 Hz keyboard-mouse actions, and adaptively invokes reasoning for complex, long-horizon missions in real-time.
Trained on Genshin Impact, it demonstrates strong zero-shot cross-game generalization to Wuthering Waves and Honkai: Star Rail, marking a step towards generalist agents in open-ended environments.

Baby Sophia: A Developmental Approach to Self-Exploration through Self-Touch and Hand Regard

Baby Sophia: introduces a Reinforcement Learning (RL) framework for autonomous self-exploration in a robotic agent, using the Baby-Bench simulation environment, to learn self-touch and hand regard behaviors via intrinsic rewards.
The framework utilizes a semantic body map for high-dimensional tactile input compression and employs motor babbling followed by curiosity-based rewards to drive skill acquisition mimicking infant development.
The approach demonstrates that intrinsic motivation and curriculum learning can enable complex sensorimotor skills from raw, high-dimensional inputs without external supervision.

ECHOING: IDENTITY FAILURES WHEN LLM AGENTS TALK TO EACH OTHER

ECHOING: introduces, AxA (Agent x Agent interaction) involving LLM Agents ($\pi_i$) that suffer from echoing (identity abandonment), detected by EchoEvalLM, and mitigated via AgentResponse (Pydantic structure for mitigation), describing this failure mode.
The study systematically investigates echoing across 60 configurations, 3 domains (hotel booking, car sales, supply chain), and multiple LLM providers, finding rates between 5% and 70%.
Echoing persists even in advanced reasoning models and is not eliminated by increased reasoning effort or prompt variations, suggesting a need for architectural solutions.

Digital Co-Founders: Transforming Imagination into Viable Solo Business via Agentic AI

Conceptual Framework: introduces a three-stage framework—imagination shaping, reality testing, and reality scaling—to articulate how AI-augmented solopreneurs transform inner vision into a sustainable solo business reality, supported by agentic AI.
The framework details specific inputs, mechanisms, resources (including AI agents), and psychological factors characterizing each stage, emphasizing the recursive nature of the process.
It bridges macro-level solo economy observations with micro-level mechanisms, providing design implications for tools supporting AI-augmented solopreneurs as "digital co-founders."

BARRIERBENCH: EVALUATING LARGE LANGUAGE MODELS FOR SAFETY VERIFICATION IN DYNAMICAL SYSTEMS

BARRIERBENCH: introduces an LLM agentic framework for barrier certificate synthesis that leverages natural language reasoning, Retrieval-Augmented Generation (RAG), and agentic coordination with SMT-based verification to ensure correctness in dynamical systems.
The framework utilizes three collaborating agents—Retrieval, Synthesis, and Verifier—to iteratively propose, refine, and validate candidate barrier certificates, including co-synthesis with controllers.
The associated BARRIERBENCH benchmark comprises 100 dynamical systems to evaluate the framework's capability, achieving over 90% success rate, significantly outperforming single-prompt LLM baselines.

Scaling Environments for LLM Agents in the Era of Learning from Interaction: A Survey

Survey: introduces a systematic review of environment scaling methods for LLM agents aligned with the Generation-Execution-Feedback (GEF) loop, covering task generation, task execution, and feedback stages.
The paper proposes an environment-centric taxonomy to organize scaling methods based on the three stages of the GEF loop: Task Generation, Task Execution, and Feedback.
A key challenge identified is the Generator-Verifier Asymmetry, which describes the mismatch in intelligence required for task generation/execution versus feedback provision.

Perspectives on a Reliability Monitoring Framework for Agentic AI Systems

Reliability Monitoring Framework for Agentic AI Systems: introduces a two-layered framework consisting of an Out-of-Distribution (OOD) Detection Layer and an AI Transparency Layer to monitor the operational reliability of agentic AI systems.
The framework addresses the fundamental challenge of unpredictable environments by first detecting novel inputs and then providing context on the system's internal response to support human decision-making.
This approach moves beyond simple novelty detection by integrating diagnostic transparency to distinguish between a failure mode and successful adaptation.

Learning Efficient Communication Protocols for Multi-Agent Reinforcement Learning

Generalized MARL Framework: introduces a generalized framework for learning multi-round communication protocols in Multi-Agent Reinforcement Learning (MARL) systems, utilizing Observation Processing/Multi-Round Communication Protocol (Message Encoding/Topology Selection/Message Aggregation/Hidden State Update) and Policy Optimization/Decision Making components.
The framework is evaluated using three novel Communication Efficiency Metrics (CEMs): Information Entropy Efficiency Index (IEI), Specialization Efficiency Index (SEI), and Topology Efficiency Index (TEI).
The research proposes incorporating IEI and SEI directly into the loss function as regularization terms to achieve efficiency augmentation without increasing communication rounds.

MACEVAL: A MULTI-AGENT CONTINUAL EVALUATION NETWORK FOR LARGE MODELS

MACEVAL (Multi-Agent Continual Evaluation network): introduces a dynamic continual evaluation framework that measures the progress of large models autonomously by implementing a multi-agent collaboration system, modeling evaluation as a multi-round interview process.
The framework utilizes specialized agents (Interviewee, Interviewer, Supervisor) within a graph-based MAEN structure and employs an AUC-inspired metric for sustainable performance assessment.
It addresses issues like data contamination and human dependency by using in-process, AI-generated, open-ended tasks across visual perception, text comprehension, math, algorithm, and coding capabilities.

Towards a Generalisable Cyber Defence Agent for Real-World Computer Networks

TERLA (Topological Extensions for Reinforcement Learning Agents): introduces a set of extensions for Deep Reinforcement Learning agents, specifically applied to a Proximal Policy Optimisation (PPO) model, to achieve generalisability for cyber defence across networks with varying topology and size without retraining, utilizing components like HGTConv, ReLU, and Global Pooling.
The approach uses heterogeneous graph neural network layers to create a fixed-size latent embedding of the network state, enabling topology and size invariance for the policy learning stage.
Key architectural elements include an Observation Converter, a Representation Learning stage using HGT layers, and a Policy Learning stage with a reduced, fixed-size action space.

UniMM-V2X: MoE-Enhanced Multi-Level Fusion for End-to-End Cooperative Autonomous Driving

UniMM-V2X: introduces a novel end-to-end multi-agent framework that enables hierarchical cooperation across perception, prediction, and planning, utilizing a multi-level fusion strategy and a Mixture-of-Experts (MoE) architecture in both the BEV encoder and motion decoder.
The framework integrates cooperative information fusion at perception and prediction levels, with MoE dynamically generating task-specialized BEV representations and expert-guided motion queries.
This unified MoE-enhanced multi-level fusion paradigm achieves state-of-the-art performance across perception, prediction, and planning tasks on the DAIR-V2X dataset.

Achieving Equilibrium under Utility Heterogeneity: An Agent-Attention Framework for Multi-Agent Multi-Objective Reinforcement Learning

AA-MAMORL (Agent-Attention Multi-Agent Multi-Objective Reinforcement Learning): introduces a framework to achieve Bayesian Nash Equilibrium (BNE) in Multi-Agent Multi-Objective Systems (MAMOS) by implicitly learning a joint belief over other agents' utility functions and policies using a centralized agent-attention critic during training, enabling decentralized execution.
The framework addresses the challenge of heterogeneous and conflicting objectives in MAMOS by modeling the necessary global preference information within the attention mechanism for Case II (observation-dependent preferences).
The approach consistently outperforms state-of-the-art methods in MAMO benchmarks by effectively modeling inter-agent influence and dynamic utility variations.

SlideBot: A Multi-Agent Framework for Generating Informative, Reliable, Multi-Modal Presentations

SlideBot: introduces a modular, multi-agent slide generation framework that integrates LLMs with retrieval, structured planning, and code generation, organized around pillars of informativeness, reliability, and practicality.
The framework decomposes slide creation into three stages: Content Retrieval, Slide Draft Generation, and Presentation Enhancement, coordinated by a central Moderator agent.
It incorporates principles from Cognitive Load Theory (CLT) and Cognitive Theory of Multimedia Learning (CTML) to ensure pedagogically sound and context-grounded presentations.

Evaluating Software Process Models for Multi-Agent Class-Level Code Generation

Waterfall Model: introduces a multi-agent workflow structured around the classical Waterfall software process model (Requirement $\rightarrow$ Design $\rightarrow$ Implementation $\rightarrow$ Testing) for class-level code generation using specialized LLM agents (Requirement Engineer, Architect, Developer, Tester) compared against a RawPrompt baseline.
The study evaluates three LLMs (GPT-40-mini, DeepSeek-Chat, and Claude-3.5-Haiku) on 100 Python tasks from the ClassEval benchmark to analyze the impact of process structure on functional correctness and code quality.
Results indicate that structured workflows reorganize performance, often improving code quality (cleanliness, maintainability) at the expense of functional correctness (Pass@1) and increased reasoning/validation errors, with model performance being highly dependent on the workflow structure.

Self-Correcting Large Language Models: Generation vs. Multiple Choice

Self-Correcting LLMs: introduces a systematic investigation comparing self-correction performance trends and error-correction behaviors in Large Language Models (LLMs) across two paradigms: Open-Ended Generation and Multiple-Choice Prediction, utilizing components like Self-Correction, Open-Ended Generation, and Multiple-Choice Prediction.
The study contrasts the dynamics, finding that generation benefits from flexibility and rapid early gains but risks semantic drift, while multiple-choice offers stability but suffers from logit inertia.
Findings highlight an inherent adaptability-stability trade-off, suggesting that task structure fundamentally shapes how LLMs benefit from iterative refinement.

Value-Aligned Prompt Moderation via Zero-Shot Agentic Rewriting for Safe Image Generation

VALOR (Value-Aligned LLM-Overseen Rewriter): introduces a zero-shot agentic framework for safer and more helpful text-to-image generation by integrating layered prompt analysis with human-aligned value reasoning, utilizing a Multi-granular Safety Detector, an Intention Judgement Module, an LLM-Guided Rewriting Agent, and optional Safety-Guided Regeneration.
The framework detects risks across lexical, semantic, and value-sensitive dimensions, and uses an LLM to rewrite prompts to preserve user intent while enforcing alignment, achieving up to 100.00% reduction in unsafe outputs.
VALOR addresses challenges in T2I safety, including semantic jailbreaking and value mismatch, by employing modular system prompts for the rewriting LLM based on detected risk categories.

ENABLING AGENTS TO COMMUNICATE ENTIRELY IN LATENT SPACE

Interlat (Inter-agent Latent Space Communication): introduces a paradigm leveraging the last hidden states of an LLM as a representation of its internal state for direct inter-agent communication entirely in latent space, using a Communication Adapter, Reasoning Model, Actor Model, Projector, and MHA.
This approach bypasses the constraints of natural language by transmitting rich, high-dimensional latent vectors, enabling more expressive and efficient coordination between agents.
The framework is validated on the ALFWorld benchmark, demonstrating improved performance and substantial latency reduction through compression of the latent messages.

11th November 2025

AgentPRM: Process Reward Models for LLM Agents via Step-Wise Promise and Progress

AgentPRM (Process Reward Models for LLM Agents via Step-Wise Promise and Progress): introduces a novel process reward model for LLM agents that captures both the immediate progress and the long-term promise of each decision, utilizing TD-based estimation with GAE for efficient training.
This framework guides LLM agents in multi-turn decision-making tasks by evaluating each step's contribution to the final goal and the dependencies between sequential decisions, enabling better progress tracking and exploration-exploitation balance.
AgentPRM demonstrates superior compute efficiency and robust performance across various agentic tasks and model sizes, and can be seamlessly integrated into reinforcement learning processes for LLM agents.

Material-Based Intelligence: Self-organizing, Autonomous and Adaptive Cognition Embodied in Physical Substrates

Material-Based Intelligence (MBI): introduces a paradigm shift focusing on architectures where material-based intelligence arises spontaneously from self-organization, leveraging minimal physical models and intrinsically embedding information-theoretic control within the material's own physics, with components including Self-Organization, Sensing/Transduction, Intrinsic Physical Computation, Active Memory/Adaptation, and Actuation/Response, all grounded in Physical Substrates.
This framework distinguishes MBI from traditional machine-based intelligence by minimizing the hardware-software separation, embedding computation directly into the material's dynamics, and operating far from thermodynamic equilibrium.
The functional manifestations of MBI include autonomous, adaptive, and goal-directed behaviors emerging from intrinsic dynamics, requiring local interaction, active memory, embodied computation, and adaptive feedback loops.

Low-cost Multi-agent Fleet for Acoustic Cooperative Localization Research

CoUGARs (Configurable Underwater Group of Autonomous Robots): introduces a low-cost, configurable Autonomous Underwater Vehicle (AUV) platform, the CougUV, built from COTS and 3D-printed parts, designed to support multi-agent autonomy research, specifically acoustic localization.
The platform utilizes a containerized ROS 2 Software Stack, featuring a GTSAM-based State Estimator and decoupled Control Systems, validated through simulation in HoloOcean and in-situ field trials.
Key hardware components include a Raspberry Pi 5, Teensy 4.1, DVL, and USBL acoustic array, integrated for cooperative localization experiments.

Discovering and exploiting active sensing motifs for estimation

BOUNDS (Bounding Observability for Uncertain Nonlinear Dynamic Systems): introduces a computational pipeline to empirically determine observability levels of individual state variables and how they change with sensor motion, using tools from control and information theory, alongside the pybounds package.
The work also presents the Augmented Information Kalman Filter (AI-KF), which merges data-driven state estimates (from ANNs) with model-based filtering (Kalman Filter) using observability knowledge to improve state estimation robustness.
The framework is demonstrated by discovering active sensing motifs for a flying agent to estimate variables like wind direction and altitude, and by validating the AI-KF's superior performance over traditional filters in scenarios with sparse observability.

Simulating the Visual World with Artificial Intelligence: A Roadmap

Roadmap: introduces a systematic overview of modern video foundation models conceptualized as a combination of an implicit world model and a video renderer, tracing their evolution through four generations based on core capabilities.
The framework defines a physical world model as a digital simulation engine capable of predicting the next scene conditioned on multimodal inputs and spatial/navigation conditions.
The four generations (Faithfulness, Interactiveness, Planning, Stochasticity) represent an evolutionary ladder of increasing capability in world modeling.

AlphaResearch: Accelerating New Algorithm Discovery with Language Models

AlphaResearch: introduces an autonomous research agent designed to discover new algorithms on open-ended problems by synergizing idea generation, execution-based verification, and simulated peer-review via a novel dual research environment, utilizing an LLM Ensemble and a trained Reward Model (AlphaResearch-RM-7B).
The system iteratively proposes ideas, verifies them using program execution, and refines proposals based on feedback from both execution results and the simulated peer-review Reward Model.
AlphaResearch achieved a 2/8 win rate against human researchers on the AlphaResearchComp benchmark, notably discovering a best-of-known performance algorithm for the "packing circles" problem.

Prioritizing Perception-Guided Self-Supervision: A New Paradigm for Causal Modeling in End-to-End Autonomous Driving

PGS (Perception-Guided Self-Supervision): introduces a training paradigm for end-to-end autonomous driving that leverages perception outputs as primary supervisory signals for decision-making, explicitly modeling causal relationships via MTPS, STPS, and NTPS components.
The framework aligns inputs and outputs of the decision-making module with perception results (e.g., lane centerlines, predicted agent motions) to mitigate causal confusion stemming from noisy expert trajectories.
This perception-guided self-supervision approach, built on a standard end-to-end architecture, achieves state-of-the-art closed-loop performance on the Bench2Drive benchmark.

Effective Game-Theoretic Motion Planning via Nested Search

Game-Theoretic Nested Search (GTNS): introduces a novel, scalable, and provably-correct approach for computing Nash Equilibria (NEs) in general dynamical systems using a nested search structure, an outer A*-search on the implicit tensor-product graph, and an inner best-response oracle.
The framework guarantees convergence to a global NE and allows explicit tuning of the solution via a user-specified global objective function, unlike prior optimization-based or local NE methods.
GTNS efficiently searches the joint action space by implicitly encoding trajectories and verifying the NE constraint via the inner search, achieving solutions in seconds for autonomous driving and racing scenarios.

From Experience to Strategy: Empowering LLM Agents with Trainable Graph Memory

Trainable Memory Graph: introduces a novel agent-centric, trainable, multi-layered graph memory framework that abstracts raw agent trajectories into structured decision paths and distills them into high-level, human-interpretable strategic meta-cognition, using reinforcement-based weight optimization to calibrate memory utility.
The framework integrates this structured memory as an explicit policy prior into the LLM agent's Reinforcement Learning (RL) training loop to guide decision-making and improve learning efficiency.
Empirically, the learnable graph memory demonstrates robust generalization, enhances strategic reasoning performance, and provides consistent benefits during RL training across diverse question-answering benchmarks.

Bio AI Agent: A Multi-Agent Artificial Intelligence System for Autonomous CAR-T Cell Therapy Development with Integrated Target Discovery, Toxicity Prediction, and Rational Molecular Design

Bio AI Agent: introduces a multi-agent artificial intelligence system powered by LLMs that enables autonomous Chimeric Antigen Receptor T-cell (CAR-T) development through collaborative specialized agents, including Target Selection Agent/Toxicity Prediction Agent/Molecular Design Agent/Patent Intelligence Agent/Clinical Translation Agent/Decision Orchestration Agent.
The system integrates target discovery, safety assessment, molecular optimization, patent analysis, and clinical translation across six specialized, collaborating LLM-powered agents.
Validation demonstrated autonomous identification of high-risk targets (FcRH5, CD229) and generation of comprehensive development roadmaps, accelerating timelines significantly compared to manual review.

AURORA: Autonomous Updating of ROM and Controller via Recursive Adaptation

AURORA (Autonomous Updating of ROM and Controller via Recursive Adaptation): introduces a multi-agent LLM framework automating ROM-based controller design with online adaptation, employing five specialized functional agents collaborating through a shared Code Agent.
The framework iteratively refines the Reduced-Order Model (ROM) and controller using generation-judge-revision cycles managed by the Code Agent, diagnosing degradation sources via the Evaluation Agent.
It establishes practical viability for autonomous control design by validating high autonomy and performance improvements over expert-tuned baselines across diverse benchmark systems.

Multi-agent self-triage system with medical flowcharts

TriageMD: introduces a proof-of-concept conversational self-triage system that guides LLMs with clinically validated flowcharts from the American Medical Association, providing a structured and auditable framework for patient decision support, leveraging a multi-agent framework consisting of a retrieval agent, a decision agent, and a chat agent.
The system combines the flexibility of free-text interaction with the rigor of standardized clinical protocols, achieving high accuracy in both flowchart retrieval (95.29% top-3) and navigation (99.10%) across diverse conversational styles.
This approach demonstrates the feasibility of transparent, accurate, and generalizable AI-assisted self-triage, aiming to improve healthcare resource utilization by managing nonurgent emergency department visits.

OSWORLD-MCP: BENCHMARKING MCP TOOL INVOCATION IN COMPUTER-USE AGENTS

OSWorld-MCP: introduces a comprehensive and fair benchmark for evaluating computer-use agents by integrating 158 high-quality MCP Tools and GUI operations in real-world scenarios.
The benchmark assesses multimodal agents' decision-making, GUI operation, and tool invocation capabilities in a hybrid environment, bridging the gap between pure-GUI and text-based tool-use evaluations.
New metrics, Tool Invocation Rate (TIR) and Average Completion Steps (ACS), are introduced to provide a nuanced assessment of agents' tool utilization propensity and task completion efficiency.

10th November 2025

IterResearch: Rethinking Long-Horizon Agents via Markovian State Reconstruction

IterResearch (Iterative Deep-Research Paradigm): introduces a novel iterative deep-research paradigm that reformulates long-horizon research as a Markov Decision Process with strategic workspace reconstruction, maintaining sustained reasoning capacity through periodic synthesis and an evolving report memory.
The framework addresses context suffocation and noise contamination by maintaining a bounded Workspace S, where each state includes the Question, an evolving Report, and Immediate Context, rather than accumulating all historical information.
It employs Efficiency-Aware Policy Optimization (EAPO) to train agents for efficient exploration using geometrically discounted rewards and adaptive downsampling, enabling robust performance across extended interactions and diverse tasks.

People Perceive More Phantom Costs From Autonomous Agents When They Make Unreasonably Generous Offers

Phantom Costs Perception Framework: introduces a study investigating how agent type (human/robot), autonomy (autonomous/non-autonomous), and discount size (small/large offer) influence the perception of phantom costs (hidden drawbacks/risks), perceived self-interest (agent's motivation), purchase intention (buying likelihood), and trust (confidence in agent/product) within a car-buying simulation (experimental scenario), grounded in the Heuristic of Sufficient Explanation (HOSE) model (explains phantom costs).
The research reveals that robots are perceived as less self-interested than humans, reducing phantom costs, while larger discounts increase phantom costs but also boost purchase intentions, suggesting perceived benefits can outweigh perceived risks.
Phantom costs were attributed not only to the agent but also to the product and the agent's manager, highlighting multiple sources of suspicion in human-human and human-robot interactions.

Surgical Agent Orchestration Platform for Voice-directed Patient Data Interaction

SAOP (Surgical Agent Orchestrator Platform): introduces a voice-directed hierarchical multi-agent framework for multimodal patient data interaction during robotic surgery, including a Workflow Orchestrator Agent, task-specific agents (IR, IV, AR), and memory states.
The platform leverages LLMs for autonomous planning, command refinement, validation, and reasoning to map voice commands to specific tasks like retrieving clinical information, manipulating CT scans, or navigating 3D anatomical models.
SAOP demonstrates high accuracy and robustness against speech recognition errors and diverse free-form commands, enhancing support for minimally invasive da Vinci robotic surgery.

AGENTICSCIML: COLLABORATIVE MULTI-AGENT SYSTEMS FOR EMERGENT DISCOVERY IN SCIENTIFIC MACHINE LEARNING

AgenticSciML (Collaborative Multi-Agent Systems for Emergent Discovery in Scientific Machine Learning): introduces a collaborative multi-agent framework that coordinates specialized AI agents, including Human, Data Analyst, Evaluator, Root Solution Engineer, Knowledge Retriever, Proposer, Critic, Engineer, Debugger, Result Analyst, and Selector, along with a Knowledge Base, Analysis Base, and Solution Tree, to iteratively propose, critique, and refine SciML solutions for emergent discovery.
The framework integrates structured debate, retrieval-augmented method memory, and ensemble-guided evolutionary search to generate and assess new hypotheses about architectures and optimization procedures in scientific machine learning.
AgenticSciML discovers novel SciML strategies that outperform single-agent and human-designed baselines by up to four orders of magnitude in error reduction, demonstrating emergent methodological innovation through collaborative reasoning.

Bridging the Prototype-Production Gap: A Multi-Agent System for Notebooks Transformation

Codelevate (Multi-Agent System for Software Architecture): introduces a novel multi-agent system that automatically transforms Jupyter notebooks into production-ready Python codebases, employing a Preprocessor, Dependency Analyzer, and a Multi-agent System with Architect, Developer, and Structure agents.
This system leverages specialized agents, each with specific roles, working collaboratively through a shared dependency tree to ensure architectural coherence and code quality, utilizing LLMs and tool-calling capabilities for autonomous code transformation.
Codelevate aims to bridge the prototype-to-production gap by applying critical software engineering principles, resulting in quantifiable improvements in code quality and maintainability while preserving computational semantics.

Resilient by Design – Active Inference for Distributed Continuum Intelligence

PAIR-Agent (Probabilistic Active Inference Resilience Agent): introduces a framework for achieving resilience in Distributed Computing Continuum (DCC) systems by collecting logs, constructing a Causal Fault Graph (CFG), inferring faults using Markov blankets and the Free-energy principle, and autonomously healing through active inference.
The framework ensures adaptive stability, self-healing capability, and sustained operational continuity in complex, heterogeneous DCC environments by continuously monitoring and adaptively reconfiguring the system.
Theoretical validations confirm the reliability and effectiveness of the proposed approach in managing uncertainties and adapting to diverse failure conditions across cloud, fog, edge, and IoT layers.

Dynamics-Decoupled Trajectory Alignment for Sim-to-Real Transfer in Reinforcement Learning for Autonomous Driving

Dynamics-Decoupled Trajectory Alignment: introduces a framework for zero-shot sim-to-real transfer in autonomous driving by decoupling motion planning from vehicle control, utilizing an RL agent, kinematic bicycle model, trajectory-predicting agent, virtual vehicle, real system/vehicle, Stanley controller, and adaptive longitudinal alignment mechanisms (feed-forward/feed-back control, velocity control, freeze, fast-forward strategies).
The framework trains an RL agent in simulation using a kinematic bicycle model, distills its behavior into a trajectory-predicting agent, and then aligns this virtual trajectory with a real vehicle using a Stanley controller for lateral dynamics and adaptive longitudinal synchronization.
This approach enables robust zero-shot transfer of RL policies from simulation to reality by minimizing longitudinal and lateral errors without requiring high-fidelity simulators or vehicle-specific dynamics models.

Multi-Agent Reinforcement Learning for Deadlock Handling among Autonomous Mobile Robots

MARL-based Methodology for Deadlock Handling: introduces a structured framework for integrating Multi-Agent Reinforcement Learning into logistics planning, encompassing RL Problem Formulation, Model Selection, Algorithm Selection, and System Deployment, to address deadlock situations among Autonomous Mobile Robots.
This methodology leverages simulation models as learning environments to train MARL algorithms like PPO and IMPALA, particularly using Centralized Training with Decentralized Execution, to develop adaptive policies for collision avoidance and deadlock recovery in complex intralogistics scenarios.
The framework aims to enhance system resilience and operational efficiency by enabling AMRs to dynamically adapt to changing conditions and resolve conflicts, outperforming traditional rule-based or heuristic methods in congested environments.

Differentiable Semantic Meta-Learning Framework for Long-Tail Motion Forecasting in Autonomous Driving

SAML (Semantic-Aware Meta-Learning framework): introduces a novel framework for long-tail motion forecasting in autonomous driving, featuring a Map Encoder (encodes HD map data), Agent Encoder (encodes agent motion histories), Interaction-Aware Encoder (extracts context-aware features), Bayesian Tail Perceiver (quantifies motion tailness), Meta-Memory Adaptation (adapts to rare patterns), and Multi-modal Decoder (generates motion forecasts).
SAML quantifies motion rarity via semantically meaningful intrinsic (kinematic, geometric, temporal) and interactive (local, global risk) properties, which are fused into a continuous, uncertainty-aware Tail Index by the Bayesian Tail Perceiver.
The framework's Meta-Memory Adaptation module, guided by the Tail Index, couples a dynamic prototype memory with a MAML-based cognitive set mechanism for rapid adaptation to rare or evolving patterns.

HYBRID ACTION REINFORCEMENT LEARNING FOR QUANTUM ARCHITECTURE SEARCH

HyRLQAS (Hybrid-Action Reinforcement Learning for Quantum Architecture Search): introduces a unified framework that couples discrete gate placement and continuous parameter generation within a hybrid action space, including a Tensor-based Circuit Encoding (encodes circuit information), a Hybrid Policy Network (generates hybrid actions) with a Hybrid Policy Network Backbone (shared feature extractor), Hybrid Policy Network Discrete head (selects gate type/position), Hybrid Policy Network Param head (initializes gate parameters), and Hybrid Policy Network Refine head (refines existing parameters), an Environment (executes circuit, provides reward) with an Environment CPU (classical processing unit), Environment External optimizer (fine-tunes circuit parameters), and Environment Quantum circuit (executes quantum operations), and a Batch of Trajectories (stores experience tuples).
This framework jointly learns circuit topology and parameter initialization while dynamically refining previously placed gates through a reinforcement learning process, aiming to minimize molecular ground-state energy in a variational quantum eigensolver (VQE) environment.
HyRLQAS achieves lower energy errors and shorter circuits compared to discrete-only and continuous-only baselines by providing favorable parameter initializations and improved circuit structures, leading to more stable and reliable outcomes.

Shocks Under Control: Taming Transonic Compressible Flow over an RAE2822 Airfoil with Deep Reinforcement Learning

DRL (Deep Reinforcement Learning): introduces a framework for active flow control of transonic shock-boundary layer interactions over an RAE2822 airfoil using a high-fidelity CFD solver and synthetic jet actuation, employing DRL/PPO/TD3/CFD Solver/Synthetic Jet Actuation components.
The framework uses a fifth-order spectral Discontinuous Galerkin (DG) method with Adaptive Mesh Refinement (AMR) for accurate flow simulation.
The study investigates both on-policy PPO and off-policy TD3 algorithms, demonstrating superior performance of TD3 in achieving drag reduction while preserving lift dynamics.

QOC DAO - Stepwise Development Towards an AI Driven Decentralized Autonomous Organization

QOC DAO (Question-Option-Criteria Decentralized Autonomous Organization): introduces a structured, stepwise governance framework evolving from human-led to fully autonomous AI-driven processes by integrating the Question-Option-Criteria (QOC) model with AI agents.
The framework decomposes decisions into a Question, Options, and weighted Criteria, enabling structured, criterion-based evaluations that enhance transparency and fairness in Decentralized Autonomous Organizations (DAOs).
The stepwise integration involves human-driven, human-in-the-loop, and fully AI-driven stages, utilizing Large Language Models (LLMs) for automated evaluation support.

9th November 2025

CoFineLLM: Conformal Finetuning of Large Language Models for Language-Instructed Robot Planning

CoFineLLM (Conformal Finetuning of Large Language Models): introduces the first Conformal Prediction (CP)-aware fine-tuning framework for LLM-based robot planners, explicitly reducing prediction-set sizes and human intervention rates while maintaining CP coverage guarantees.
The framework integrates CP during training by simulating conformalization within mini-batches and employs a novel loss function combining cross-entropy with a CP-based term to penalize non-singleton prediction sets.
CoFineLLM utilizes Low-Rank Adaptation (LoRA) and a curriculum-based training scheme to optimize LLM parameters, demonstrating robustness in out-of-distribution scenarios and consistent improvements in help rates and prediction-set size.

FLEX: Continuous Agent Evolution via Forward Learning from Experience

FLEX (Forward Learning with Experience): introduces a gradient-free learning paradigm enabling LLM agents to continuously evolve through accumulated experience by constructing a structured experience library via continual reflection on successes and failures, with an LLM Agent, Experience Library, Updater, Actor, and Critic.
The framework employs a forward learning loop where an Actor explores to collect experiences, a Critic provides semantic feedback, and an Updater integrates distilled knowledge into a hierarchical experience library, guiding future reasoning.
FLEX demonstrates substantial performance improvements across mathematical reasoning, chemical retrosynthesis, and protein fitness prediction, establishing a scalable and inheritable continuous agent evolution.

AUTO-Explorer: Automated Data Collection for GUI Agent

AUTO-Explorer: introduces an automated data collection method for GUI agents, with a GUI Parser (detects UI elements), an Explore Module (determines next actions), a Difference Spot Module (detects new elements), a Critic Module (evaluates interaction significance), a Sampler (selects actions), and Environment Observation (provides GUI states), designed to autonomously parse and explore GUI environments for efficient data gathering.
The framework utilizes UI Automation (UIA), Optical Character Recognition (OCR), and icon template matching to parse GUI elements, enabling robust interaction with diverse software and web interfaces.
The system's exploration strategy involves comparing GUI states before and after actions to discover new elements, which are then sampled for subsequent interactions, and includes mechanisms for trajectory termination and error state identification.

The STATION: An Open-World Environment for AI-Driven Discovery

The STATION (An Open-World Environment for AI-Driven Discovery): introduces an open-world multi-agent environment that models a miniature scientific ecosystem, with Agents (autonomous researchers), Rooms (distinct functional spaces), Auxiliary Systems (background support mechanisms), and Data/Communication Structures (for interaction and persistence), enabling LLMs to autonomously pursue scientific discovery.
This framework allows AI agents to engage in long scientific journeys, including reading papers, formulating hypotheses, submitting code, performing analyses, and publishing results, all without centralized coordination.
The Station fosters emergent behavior and novel scientific breakthroughs by providing a persistent world where agents can explore, create, and collaborate, moving beyond rigid optimization paradigms.

GAIA: A General Agency Interaction Architecture for LLM-Human B2B Negotiation & Screening

GAIA (General Agency Interaction Architecture): introduces a governance-first framework for LLM-human agency in B2B negotiation and screening, defining Principal, Delegate (LLM agent), and Counterparty roles, with optional Critic and Moderator, structured by information-gated progression, dual feedback integration, and authorization boundaries.
This framework employs a formal state machine with commitment detection, Task-Completeness Index (TCI) tracking for information completeness, and structured escalation paths to ensure bounded authorization and human oversight.
GAIA provides a hybrid validation blueprint combining automated protocol metrics with human judgment to offer a reproducible specification for safe, efficient, and accountable AI delegation across various domains.

ROAR: Robust Accident Recognition and Anticipation for Autonomous Driving

ROAR (Robust Accident Recognition and Anticipation for Autonomous Driving): introduces a novel approach for accident detection and prediction, combining a Discrete Wavelet Transform (extracts multi-resolution features), a self-adaptive object-aware module (enhances spatial representations), and dynamic focal loss (mitigates class imbalance) to improve accuracy and robustness in autonomous driving.
The framework processes input video frames through an object detector and feature extractor, then refines these features using the self-adaptive object-aware module and DWT, before fusing them and passing through a GRU and Temporal Attention Fusion for anticipation probability.
ROAR integrates spatial, temporal, and hierarchical features, along with a time weight layer, to adjust temporal influence on predictions, demonstrating superior performance on real-world datasets under challenging conditions like sensor degradation and environmental noise.

Dataforge: A Data Agent Platform for Autonomous Data Engineering

Dataforge: introduces an autonomous data agent platform for tabular data, leveraging LLM reasoning and grounded validation to automatically perform data cleaning, hierarchical routing, and feature-level optimization through dual feedback loops.
The system embodies principles of being automatic, safe, and non-expert friendly, ensuring end-to-end reliability without human supervision by iteratively orchestrating grounded actions.
This framework transforms raw data into AI-ready data, addressing scalability and expertise dependence in data preparation for various AI applications.

A Low-Rank Method for Vision Language Model Hallucination Mitigation in Autonomous Driving

LRHM (Low-Rank Hallucination Mitigation): introduces a novel self-contained low-rank approach to automatically rank multiple candidate captions generated by multiple VLMs based on their hallucination levels, using only the captions themselves without requiring external references or model access.
The method constructs an embedding matrix from VLM-generated captions, applies Singular Value Decomposition to separate a low-rank consensus component from a sparse residual, and then uses the residual magnitude for hallucination scoring.
This parallelizable architecture achieves sub-second hallucination mitigation, significantly reducing inference time compared to debate approaches, making it practical for real-time autonomous driving applications by improving VLM trustworthiness in safety-critical scenarios.

8th November 2025

RadioSim Agent: Combining Large Language Models and Deterministic EM Simulators for Interactive Radio Map Analysis

RadioSim Agent: introduces an agentic framework that unifies LLM-based reasoning with deterministic EM solvers and vision-based analysis for interactive, multimodal, and explainable radio map generation.
The framework operates through a Reason-Act-Observe cycle, where an LLM interprets user intent, plans tasks, executes EM simulations via a tool library, and analyzes outputs using a vision-enabled LLM.
It enables users to provide natural-language instructions to perform simulations, visualize EM fields, and interrogate results directly within a unified agentic environment, bridging natural language understanding with physical modeling.

7th November 2025

STAIR: Stability criterion for Time-windowed Assignment and Internal adversarial influence in Routing and decision-making

STAIR (Stability criterion for Time-windowed Assignment and Internal adversarial influence in Routing and decision-making): introduces a novel average-cost-based stability criterion for multi-agent routing systems with adversarial agents, linking policy stability to operational metrics like rejected requests.
This framework incorporates time-window constraints and a wait-time-constrained stage cost to address the limitations of traditional queuing theory and discounted RL stability definitions in adversarial settings.
STAIR provides a more reliable assessment of long-term behavior and improved interpretability by removing reliance on arbitrary discount factors and better reflecting real-world service constraints.

TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework

TeaRAG (Token-Efficient Agentic Retrieval-Augmented Generation Framework): introduces a token-efficient agentic RAG framework that optimizes retrieved content density and reasoning step conciseness through a hybrid retrieval and a process-aware training paradigm, including an LLM Agent (controls workflow, plans, reasons, generates), Important Entity Recognition (identifies key entities), Subquery Generation (decomposes query into subqueries), Hybrid Context Retrieval (combines semantic and graph retrieval), Semantic Retrieval (retrieves document chunks), Graph Retrieval (retrieves knowledge triplets), Knowledge Association Graph (KAG) Construction (builds graph from chunks, triplets), Personalized PageRank (PPR) Filtering (filters KAG for relevant content), Summary Generation (summarizes retrieved content), Supervised Fine-Tuning (SFT) (initial training for reasoning format), Iterative Process-aware Direct Preference Optimization (IP-DPO) (iterative training for conciseness, generalization), Reward Design (calculates outcome, format, process rewards), Knowledge Matching (assesses evidence acquisition), and DPO Pair Construction (creates preferred/rejected reasoning paths).
TeaRAG compresses retrieved content by combining semantic and graph retrieval to build a Knowledge Association Graph, which is then filtered by Personalized PageRank to yield high-density, concise information.
The framework's two-stage training, including IP-DPO with process-aware rewards, generates high-quality preference data to iteratively optimize LLMs for more concise reasoning paths, significantly reducing output tokens while improving accuracy.

CONVERSE: Benchmarking Contextual Safety in Agent-to-Agent Conversations

CONVERSE introduces a dynamic benchmark for evaluating privacy and security risks in multi-turn agent-to-agent conversations, featuring a simulated user environment, an AI assistant, and an external agent interacting across three realistic domains with contextual attacks and pre-generated ground truth.
The benchmark models autonomous, multi-turn agent-to-agent conversations where malicious requests are contextually embedded within plausible discourse, testing data abstraction, tool use, and preference manipulation.
It evaluates seven state-of-the-art LLMs, revealing persistent vulnerabilities where privacy attacks succeed in up to 88% of cases and security breaches in up to 60%, highlighting a tension between utility and protection.

TAMAS: BENCHMARKING ADVERSARIAL RISKS IN MULTI-AGENT LLM SYSTEMS

TAMAS (Threats and Attacks in Multi-Agent Systems): introduces a benchmark to evaluate the robustness and safety of multi-agent LLM systems, comprising User, Agent Configuration (Centralized Orchestrator, Decentralized Collaboration, Sequential), Agent, Tools, Environment (Interface, Web, Database), Attack Vectors (Impersonation, Direct Prompt Injection, Indirect Prompt Injection, Contradicting Agents, Byzantine Agent, Colluding Agents), LLM Backbones, Underlying Frameworks (AutoGen, CrewAI), and Evaluation Metrics (Effective Robustness Score (ERS), ARIA Framework, Performance under No Attack (PNA)), designed to assess vulnerabilities across diverse attack types and interaction configurations.
The benchmark includes 300 adversarial instances across six attack types and five high-impact domains, evaluating performance on ten backbone LLMs and three agent interaction configurations from AutoGen and CrewAI frameworks.
The findings reveal that multi-agent LLM systems are highly susceptible to adversarial attacks, highlighting the urgent need for stronger defense mechanisms and robust design strategies.

Beyond Master and Apprentice: Grounding Foundation Models for Symbiotic Interactive Learning in a Shared Latent Space

SIL (Symbiotic Interactive Learning): introduces a framework for human-agent interaction that enables mutual co-adaptation through a shared latent task space, leveraging an Interaction/Feedback Interface, LLM-based Reasoning and Uncertainty Estimation, Command Parser, Shared Task Space for belief alignment, Memory Architecture with continual learning safeguards, Perception via Vision-Language Models, and an Action Executor.
This approach moves beyond the traditional master-apprentice model by allowing both human and agent to adapt reciprocally, improving interaction efficiency and robustness.
The framework explicitly represents, measures, and aligns human and agent beliefs, facilitating proactive clarification, adaptive suggestions, and shared plan refinement in dynamic real-world environments.

SELF-INTEREST AND SYSTEMIC BENEFITS: EMERGENCE OF COLLECTIVE RATIONALITY IN MIXED AUTONOMY TRAFFIC THROUGH DEEP REINFORCEMENT LEARNING

SI-DRL (Self-Interested Deep Reinforcement Learning): introduces a framework for self-interested AVs to achieve collective rationality in mixed autonomy traffic, utilizing an SI-DRL agent (Autonomous vehicle decision-maker) interacting with a Driving simulator (Dynamic traffic environment) through State (Vehicle/surrounding info input) inputs, Action (Lane change decisions output) outputs, and a Reward (Speed gain/lane change penalty) function, with a DQN (Q-value function approximator) and Experience Replay (Trajectory storage/sampling) for learning.
The framework demonstrates that self-interested AVs, trained with a simple reward design, can achieve Pareto-efficient Nash equilibria and improve overall traffic flow by fostering spatial organization, including intra-class platooning and inter-class segregation.
This research validates the emergence of collective rationality through DRL simulations, showing alignment with game-theoretical predictions and suggesting that enhancing spatial organization benefits all road users in mixed-autonomy systems.

Introducing LongCat-Flash-Thinking: A Technical Report

LongCat-Flash-Thinking: introduces an efficient 560-billion-parameter open-source Mixture-of-Experts (MoE) reasoning model, cultivated through a two-phase pipeline of Long CoT Cold-Start Training (initial reasoning capability building) and Large-Scale RL (advanced capability scaling).
The framework employs a domain-parallel training scheme for decoupled optimization across STEM, Code, and Agentic tasks, fusing resulting expert models into a nearly Pareto-optimal model, powered by the DORA (Dynamic ORchestration for Asynchronous rollout) system.
This system, a large-scale RL framework, delivers a greater than threefold training speedup over synchronous methods, achieving state-of-the-art performance on complex reasoning tasks with exceptional efficiency, reducing token consumption by 64.5% on AIME-25.

6th November 2025

Environment Agnostic Goal-Conditioning, A Study of Reward-Free Autonomous Learning

EAGC (Environment Agnostic Goal-Conditioning): introduces a method to transform regular reinforcement learning environments into goal-conditioned environments, enabling agents to learn tasks autonomously and reward-free by selecting their own goals.
The approach utilizes a wrapper within the Stable-Baselines3 framework, incorporating modular goal evaluation and selection strategies like uniform sampling, novelty seeking, and intermediate success rate selection.
EAGC demonstrates comparable performance to externally guided baselines in terms of task solving and training times, while also enabling generic agent training prior to specific use cases.

Jr. AI Scientist and Its Risk Report: Autonomous Scientific Exploration from a Baseline Paper

Jr. AI Scientist: introduces an autonomous AI scientist system that mimics a novice student researcher's workflow, encompassing automatic idea generation, implementation and validation of proposed ideas, and research paper writing.
The system leverages LLMs for idea generation and novelty checks, and powerful coding agents for handling complex, multi-file implementations and rigorous experimentation.
It significantly improves generated paper quality by utilizing baseline paper resources, LaTeX sources, PDFs, and codebases across all research pipeline stages, while also reporting identified risks.

Promoting Sustainable Web Agents: Benchmarking and Estimating Energy Consumption Through Empirical and Theoretical Analysis

Web Agent Sustainability Benchmarking: introduces an empirical and theoretical framework to quantify the energy consumption and CO2 emissions of web agents, advocating for dedicated sustainability metrics in their evaluation.
The empirical evaluation benchmarks five open-source LLM-driven web agents on various GPUs using the Mind2Web benchmark, while theoretical estimation is applied to agents with proprietary LLMs like GPT-4.
The research highlights that web agent design and LLM choice significantly impact energy consumption, demonstrating that higher energy use does not always correlate with better performance, and emphasizes the need for transparency in model parameters for accurate estimation.

ForeRobo: Unlocking Infinite Simulation Data for 3D Goal-driven Robotic Manipulation

ForeRobo: introduces a generative robotic agent that autonomously acquires manipulation skills by integrating generative simulations with classical control.
It operates through a self-guided propose-generate-learn-actuate cycle, leveraging LLMs for task proposal and ForeGen for infinite simulation data generation.
The ForeFormer model, trained on simulated data, predicts 3D goal states for zero-shot sim-to-real transfer and multi-entity generalization in real-world robotic manipulation.

Studying the Effect of Explicit Interaction Representations on Learning Scene-level Distributions of Human Trajectories

GMOP (Graph-based Motion Prediction): introduces a normalizing flow-based model to capture joint distributions of human trajectories by factorizing the joint distribution using a learned directed acyclic interaction graph.
The framework investigates various explicit interaction representations, including Euclidean distance, crossing, and hypothetical crossing heuristics (and their flipped variants), to construct the interaction graph and assess their effect on prediction performance.
GMOP integrates RNN encoders/decoders, GNNs, and an MLP classifier to process past trajectories and static environment context, learning agent interactions for robust scene-level future trajectory prediction.

Deep reinforcement learning based navigation of a jellyfish-like swimmer in flows with obstacles

DRL Framework with SAC: introduces a physics-aware machine learning framework for controlling a bio-inspired jellyfish-like swimmer to navigate complex fluid environments with obstacles, by augmenting the agent's state representation with real-time hydrodynamic forces and torque.
This framework utilizes a Soft Actor-Critic (SAC) algorithm for policy learning, an A* algorithm for pathfinding, and an immersed boundary method for fluid-structure interaction simulations, enabling the swimmer to perceive wall proximity and orientation through distinct force signatures.
The explicit force feedback facilitates earlier, smoother maneuvers and exploitation of wall effects for efficient turning, leading to enhanced navigation efficiency and robust underwater exploration capabilities in confined, obstacle-laden spaces.

Benchmarking and Studying the LLM-based Agent System in End-to-End Software Development

E2EDevBench (End-to-End Software Development Benchmark): introduces a comprehensive framework for benchmarking LLM-based agents in end-to-end software development, integrating a challenging dataset construction process with a hybrid evaluation methodology.
The framework includes Dataset Construction (collects, filters, and samples PyPI projects to generate requirements) and an Evaluation Framework (combines automated Test Case Migration and Objective Requirement Verification using an LLM-as-Judge).
This approach provides a more realistic and robust assessment of agent capabilities by mitigating data leakage, simulating authentic development workflows, and enabling fair comparisons of different agent architectures.

DR. WELL: Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration

DR. WELL (Dynamic Reasoning and Learning with Symbolic World Model for Embodied LLM-Based Multi-Agent Collaboration): introduces a decentralized neurosymbolic framework for cooperative multi-agent planning, enabling LLM-based agents to collaborate on interdependent tasks through a dynamic world model and a two-phase negotiation protocol.
The framework allows agents to propose and commit to tasks, then independently generate and refine symbolic plans using a shared world model that captures environment state and past experience, ensuring coordination without detailed trajectory sharing.
By integrating symbolic reasoning with LLM planning, DR. WELL improves coordination efficiency, task completion rates, and interpretability in multi-agent environments, adapting strategies across episodes.

RAGalyst: Automated Human-Aligned Agentic Evaluation for Domain-Specific RAG

RAGalyst: introduces an automated, human-aligned agentic framework for domain-specific RAG evaluation, featuring a document preprocessing module, an agentic QA generation pipeline with LLM-based filtering, and an LLM-as-a-Judge evaluation module with prompt-optimized metrics.
The framework generates high-quality synthetic question-answering datasets from source documents and refines Answer Correctness and Answerability metrics to strongly correlate with human annotations.
RAGalyst enables rigorous benchmarking of RAG systems across diverse domains like military operations, cybersecurity, and bridge engineering, identifying domain-specific trade-offs and informing design choices for reliable RAG systems.

Beyond Shortest Path: Agentic Vehicular Routing with Semantic Context

PAVe (Personalized Agentic Vehicular Routing): introduces a hybrid agentic assistant that augments classical pathfinding algorithms with contextual reasoning, including an LLM agent, Routing Engine Tool, Geospatial Context Tool, Contextual Route Assessment Tool, Central Orchestrator, POIFinder Module, Geospatial Cache, Urban Road Network Graph, and Dijkstra Algorithm.
This framework leverages an LLM agent for semantic reasoning and contextual understanding to evaluate candidate routes generated by a multi-objective Dijkstra algorithm against user-provided tasks, preferences, and avoidance rules.
PAVe aims to create personalized, adaptive, and scalable solutions for urban mobility optimization by integrating complex user intent with efficient algorithmic pathfinding using real-world urban datasets and geospatial information.

Speed at the Cost of Quality? The Impact of LLM Agent Assistance on Software Development

LLM Agent Impact Evaluation Framework: introduces a study estimating the causal effect of LLM agent assistants (specifically Cursor) on software development velocity and quality, utilizing a DiD Design (causal inference), Staggered Adoption (temporal variation), Propensity Score Matching (control group selection), Panel GMM Models (dynamic interaction analysis), GitHub Data Collection (repository metrics), and SonarQube Metrics Calculation (code quality assessment).
The study finds that Cursor adoption leads to a significant but transient increase in development velocity, alongside a significant and persistent increase in static analysis warnings and code complexity.
Further analysis reveals that the accumulated technical debt, indicated by increased warnings and complexity, subsequently causes a long-term slowdown in development velocity, creating a self-reinforcing cycle.

GUI-360°: A COMPREHENSIVE DATASET AND BENCHMARK FOR COMPUTER-USING AGENTS

GUI-360°: introduces a comprehensive dataset and benchmark suite for computer-using agents, featuring an LLM-augmented, largely automated pipeline for query sourcing, environment-template construction, task instantiation, batched execution, and LLM-driven quality filtering.
The framework includes a specialized TrajAgent for automatic trajectory collection, comprising a MAgent for task decomposition, EAgents for perception and action execution, and a Recorder for logging multi-modal data.
GUI-360° supports three canonical tasks: GUI grounding, screen parsing, and action prediction, providing full-resolution screenshots, accessibility metadata, and reasoning traces across Windows office applications.

Trustworthy LLM-Mediated Communication: Evaluating Information Fidelity in LLM as a Communicator (LAAC) Framework in Multiple Application Domains

LAAC (LLM as a Communicator): introduces a multi-agent framework that positions LLMs as intelligent communication intermediaries, featuring an Interview Agent (extracts sender intent), an Extraction Agent (generates structured knowledge), and a Query Agent (responds to recipient queries), to facilitate authentic knowledge exchange.
This framework aims to overcome the "AI-generated inflation and compression" cycle by capturing sender intent through structured dialogue and enabling recipients to interact directly with this structured knowledge.
The paper systematically evaluates LAAC's trustworthiness across information capture fidelity, reproducibility, and query response integrity, revealing measurable trust gaps that require addressing for reliable deployment.

BAPPA: Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation

BAPPA (Benchmarking Agents, Plans, and Pipelines for Automated Text-to-SQL Generation): introduces three multi-agent LLM pipelines, Multi-Agent Discussion Pipeline (iterative critique and refinement), Planner-Coder Pipeline (structured planning and execution), and Coder-Aggregator Pipeline (diverse candidate generation and selection), to enhance Text-to-SQL generation.
The paper systematically benchmarks these pipelines across various open-source LLMs to evaluate their intrinsic planning, reasoning, and coding abilities for converting natural language questions into SQL queries.
The research demonstrates that multi-agent collaboration and structured reasoning can significantly improve SQL generation quality and robustness, especially for smaller and mid-scale LLMs.

Agentmandering: A Game-Theoretic Framework for Fair Redistricting via Large Language Model Agents

Agentmandering: introduces a game-theoretic framework for fair redistricting, simulating turn-based negotiation between LLM agents representing opposing political interests, with Republican Agent (LLM-powered partisan agent), Democratic Agent (LLM-powered partisan agent), District Information (State political profile data), Choose-and-Freeze Protocol (Turn-based negotiation game), Candidate Generator (Generates feasible districting plans), Unpartitioned Region (Current unassigned territory), Candidate Maps (Set of generated districting plans), Selectable Districts (Districts from chosen map), and Frozen District (Permanently assigned district).
The framework leverages the Choose-and-Freeze protocol, where LLM agents alternate selecting preferred districting plans and freezing individual districts from a set of candidate maps.
This approach aims to produce districting outcomes that are robust against partisan manipulation, reduce bias, and achieve lower variance compared to traditional methods.

DETECTING SILENT FAILURES IN MULTI-AGENTIC AI TRAJECTORIES

Dataset Curation Pipeline: introduces a comprehensive pipeline for curating datasets from agentic traces for anomaly detection, encompassing Multi-Agentic AI System trace collection, LLM span and trace information extraction, feature engineering, inter-annotator ground truth definition, automated normal/anomaly labeling, and final dataset generation.
The paper addresses the challenge of detecting silent failures in multi-agentic LLM systems by curating two benchmark datasets from agentic traces and evaluating supervised and semi-supervised anomaly detection methods, achieving high accuracies.
This work provides the first systematic study of anomaly detection in Multi-Agentic AI systems, offering datasets, benchmarks, and insights to guide future research.

ArchPilot: A Proxy-Guided Multi-Agent Approach for Machine Learning Engineering

ArchPilot: introduces a multi-agent system for cost-efficient Neural Architecture Search (NAS) that explicitly decouples generation, evaluation, and orchestration into three collaborating agents: Orchestration Agent (coordinates search, manages memory, budgets), Generation Agent (generates, improves, debugs architectures), and Evaluation Agent (executes proxy training, optimizes proxies).
This framework leverages multi-proxy evaluation with adaptive reweighting and a restart-enabled Monte Carlo Tree Search (MCTS) algorithm to prioritize high-potential candidates, minimizing reliance on expensive full training runs.
The system achieves efficient ML engineering under limited budgets by exploring a significantly larger portion of the search space and outperforms state-of-the-art baselines on the MLE-Bench benchmark.

Direct Semantic Communication Between Large Language Models via Vector Translation

Dual-Encoder Framework: introduces direct semantic communication between LLMs via vector translation, utilizing a Dual-Encoder Translator to map semantic representations from a LLaMA-2-7B Source to a Mistral-7B Target, which are then integrated via an Injection Mechanism to produce an Enhanced Output from a Semantic Input.
This framework enables LLMs to share meaning directly at latent speed, bypassing token serialization, by learning bidirectional vector translations and conservatively injecting these translated vectors into the target model's internal processing pipeline.
The approach demonstrates computational stability and effective semantic transfer across diverse domains, revealing a 2.01:1 bidirectional asymmetry suggesting general-purpose LLMs develop more transferable representations than instruction-tuned variants.

PEFA-AI: Advancing Open-source LLMs for RTL generation using Progressive Error Feedback Agentic-AI

PEFA-AI (Progressive Error Feedback Agentic-AI): introduces an agentic flow with User Agent (provides prompt/testbench), Master Agent (parses input, manages agents), Code Generator (generates RTL code), Code Executor (lints, compiles, executes code), Log Summarizer (summarizes error logs), Summary Generator (summarizes group chat), and Optional Human Feedback (user intervention for failures), designed for autonomous Register-Transfer Level (RTL) generation using specialized LLMs and hardware simulation tools.
This framework employs a novel self-correcting mechanism that leverages iterative error feedback to progressively refine generated RTL code, checking for compilation, functional correctness, and synthesizable constructs.
The approach demonstrates state-of-the-art pass rates on open-source natural language-to-RTL datasets, bridging the performance gap between open- and closed-source LLMs while being efficient in token counts.

Collaborative Agents for Automated Program Repair in Ruby

RAMP (Ruby Automated Multi-agent Program repair): introduces a lightweight, feedback-driven framework for Ruby program repair, employing a team of collaborative agents including a Feedback Integrator Agent (produces initial self-reflection, integrates execution feedback), Test Designer Agent (generates guiding test cases), Programmer Agent (produces candidate repair program), and Test Executor Agent (runs candidate repairs, produces verdicts and traces).
This framework formulates program repair as an iterative process where agents reflect on errors, generate targeted tests, propose candidate fixes, and validate them through execution feedback, refining solutions until a correct one is found or the iteration budget is exhausted.
RAMP avoids reliance on large multilingual repair databases or costly fine-tuning, operating directly on Ruby code through lightweight prompting and test-driven feedback, achieving state-of-the-art performance on the XCODEEVAL benchmark for Ruby.

Post-Training LLMs as Better Decision-Making Agents: A Regret-Minimization Approach

ITERATIVE RMFT (ITERATIVE REGRET-MINIMIZATION FINE-TUNING): introduces a post-training procedure that iteratively distills low-regret decision trajectories, generated by a base LLM, back into the model via supervised fine-tuning to enhance decision-making abilities.
This self-improving approach leverages the regret metric to automatically elicit and reinforce the LLM's decision-making capabilities, including self-generated reasoning rationales, across diverse online decision-making environments.
Empirical results demonstrate that ITERATIVE RMFT improves LLMs' performance by achieving lower regret values, better exploration-exploitation tradeoffs, and enhanced generalization across various task specifications and real-world contexts.

Agentic Refactoring: An Empirical Study of AI Coding Agents

Agentic Refactoring: introduces a large-scale empirical study of AI agent-generated refactorings in real-world open-source Java projects, analyzing 15,451 refactoring instances across 12,256 pull requests and 14,998 commits.
The study reveals that agentic refactoring is common, dominated by low-level, consistency-oriented edits, and primarily driven by maintainability (52.5%) and readability (28.1%) concerns.
Agentic refactoring yields small but statistically significant improvements in structural metrics, particularly for medium-level changes, but currently fails to consistently reduce the overall count of known design and implementation smells.

ReGen: GENERATIVE ROBOT SIMULATION VIA INVERSE DESIGN

ReGen (Generative Robot Simulation via Inverse Design): introduces a generative simulation framework that automates simulation design by inferring plausible scenarios and environments from a robot's behavior and textual description, leveraging LLMs to synthesize scenarios via a directed graph translated into a symbolic program for simulation.
The framework supports augmenting simulations, controllable counterfactual scenario generation, reasoning about agent cognition and mental states, and handling distinct sensing modalities.
ReGen is demonstrated in autonomous driving and robot manipulation tasks, generating diverse, complex simulated environments with high success rates and enabling controllable generation for corner cases.

DIAP: A Decentralized Agent Identity Protocol with Zero-Knowledge Proofs and a Hybrid P2P Stack

DIAP (Decentralized Interstellar Agent Protocol): introduces a novel framework for agent identity and communication that binds identity to an immutable IPFS CID and uses Zero-Knowledge Proofs (ZKP) for stateless ownership proof, enabling persistent, verifiable, and trustless interoperability.
The architecture employs a layered stack, integrating Libp2p GossipSub for discovery and Iroh (QUIC-based) for high-performance direct interaction, alongside a privacy mechanism using EncryptedPeerID.
A key engineering contribution is the zero-dependency ZKP SDK, achieved by pre-compiling the Noir circuit using the UniversalNoirManager, simplifying deployment for developers.

5th November 2025

Inter-Agent Trust Models: A Comparative Study of Brief, Claim, Proof, Stake, Reputation and Constraint in Agentic Web Protocol Design—A2A, AP2, ERC-8004, and Beyond

Inter-Agent Trust Models: introduces a comparative study of six trust models—Brief (endorsed claims/credentials), Claim (self-proclaimed identity/abilities), Proof (cryptographic verification/attestations), Stake (economic collateral/slashing), Reputation (community feedback/trust scores), and Constraint (technical limits/sandboxing)—and a tiered blueprint (T0-T3) for applying them in agentic web protocols.
The paper analyzes how existing protocols like A2A, AP2, and ERC-8004 implement these trust models, considering their strengths, weaknesses, and mitigation of LLM-specific fragilities.
It concludes by recommending hybrid trust model architectures and design guidelines for safer, interoperable, and scalable agent economies, emphasizing a "trustless-by-default" approach for high-impact actions.

Scaling Agent Learning via Experience Synthesis

DREAMGYM (Scaling Agent Learning via Experience Synthesis): introduces a unified and scalable RL framework that synthesizes diverse experiences for LLM agent training, utilizing an Agent (LLM-based decision maker), a Reasoning Experience Model (synthesizes states/rewards via CoT), an Experience Replay Buffer (stores/retrieves diverse trajectories), a Curriculum Task Generator (creates challenging task variations), and a Scalable LLM Serving Infra (hosts core components).
The framework addresses challenges in RL training for LLM agents by generating synthetic, reasoning-based experiences, thereby reducing reliance on costly real-environment rollouts and improving sample efficiency.
It enables effective online curriculum learning through adaptive task generation and ensures stable policy improvement by providing consistent state transitions and informative reward signals.

A Modular, Data-Free Pipeline for Multi-Label Intention Recognition in Transportation Agentic AI Applications

DMTC (Data-less Multi-label Text Classification): introduces a modular, data-free pipeline for multi-label intention recognition in transportation agentic AI applications, leveraging LLMs for synthetic data, Sentence-T5 for semantic embeddings, and a novel online focal-contrastive loss for robust multi-label classification.
This approach eliminates the need for costly data collection and manual annotation, enhancing accuracy in fine-grained, multi-label intention understanding for agentic AI systems.
DMTC achieves state-of-the-art performance, outperforming traditional and LLM-based baselines with a Hamming loss of 5.35% and an AUC of 95.92%, laying groundwork for autonomous, intention-aware agents.

Hybrid Fact-Checking that Integrates Knowledge Graphs, Large Language Models, and Search-Based Retrieval Agents Improves Interpretable Claim Verification

Hybrid Fact-Verification Pipeline: introduces a modular, real-time fact-checking system that integrates Knowledge Graphs, LLMs, and search-based retrieval agents to improve interpretable claim verification, which includes Claim Input (natural language statement), Entity Linking (detects, disambiguates entities), KG Retrieval (fetches one-hop triples), Evidence Ranking (scores semantic relevance), Classifier (assigns claim label), Web Retrieval (rewrites query, retrieves snippets), Reannotation Study (validates ambiguous cases), and a Fallback Strategy (triggers web search).
The pipeline employs a KG-first strategy for high precision and interpretability, with a web-based retrieval fallback for broader coverage when KG evidence is insufficient.
The system achieves high F1 scores on benchmarks like FEVER without task-specific fine-tuning and uncovers valid evidence for claims initially labeled as "Not Enough Information" through a reannotation study.

Toward Autonomous Engineering Design: A Knowledge-Guided Multi-Agent Framework

Knowledge-Guided Multi-Agent Framework: introduces a novel multi-agent reasoning framework for autonomous engineering design, incorporating specialized LLM agents (Graph Ontologist, Design Engineer, Systems Engineer) and a human Manager to guide the iterative design and review process.
The framework leverages knowledge graphs, generated by the Graph Ontologist from existing literature, to imbue the Design Engineer and Systems Engineer LLM agents with domain-specific expertise for generating and evaluating airfoil designs.
This approach demonstrates a path toward improving efficiency and quality in engineering design by combining LLM knowledge curation with established engineering practices and human-in-the-loop validation.

RefAgent: A Multi-agent LLM-based Framework for Automatic Software Refactoring

RefAgent (A Multi-agent LLM-based Framework for Automatic Software Refactoring): introduces a multi-agent LLM-based framework for end-to-end software refactoring, comprising a Context-Aware Planner Agent (identifies opportunities, plans refactoring), Refactoring Generator Agent (generates refactored Java code), Compiler Agent (compiles code, addresses errors), and Tester Agent (tests functionality, fixes failures) to dynamically adapt and autonomously make decisions.
The framework leverages specialized LLM agents with tool-calling capabilities and iterative feedback loops to identify refactoring opportunities, generate code, ensure compilation, and preserve functionality.
RefAgent achieves high unit test pass rates, reduces code smells, and improves quality attributes across Java projects, outperforming single-agent approaches and aligning with developer refactorings.

Fiedler-Based Characterization and Identification of Leaders in Semi-Autonomous Networks

External Observer-Based Leader Identification: introduces a data-driven algorithm that identifies leader nodes in semi-autonomous consensus networks by processing time series of agent states to estimate the Fiedler vector, sort its components, determine the number of leaders, and finally identify the leader nodes.
This framework leverages the concept of relative tempo, which relates agents' steady-state velocities to the Fiedler vector, enabling leader identification without prior knowledge of the network topology.
The approach unifies graph analysis with data-driven inference, providing insights into how leader influence manifests in the network's dynamical response.

Human-AI Co-Embodied Intelligence for Scientific Experimentation and Manufacturing

APEX (Agentic-Physical Experimentation) system: introduces human-AI co-embodied intelligence, integrating human researchers/operators (precise execution, control), agentic AI (memory, reasoning, planning, feedback) with its Planning, Step-tracking, Context, and Analysis agents, and a wearable MR hardware platform (MR Goggles) (captures data, provides guidance) for real-time multimodal perception (interprets video, hand/eye tracking), adaptive plan (dynamic procedure adjustment), and feedback (real-time guidance, alerts) in scientific experimentation and manufacturing.
This framework unifies multimodal perception, multi-agent reasoning, and mixed-reality interaction to enable AI agents to perceive, reason, and act in real-world scenarios, providing 3D visual guidance, error detection, and automated documentation.
APEX transforms complex manual fabrication into autonomous, traceable, interpretable, and scalable processes, significantly improving reproducibility, skill transfer, and real-time error correction for both expert and novice users.

Outbidding and Outbluffing Elite Humans: Mastering Liar's Poker via Self-Play and Reinforcement Learning

Solly: introduces an AI agent that masters reduced-format Liar's Poker against elite humans and LLMs, utilizing self-play, the R-NaD (Regularized Nash Dynamics) actor-critic algorithm, and a Policy Network (MLP) with State, Action, Policy Head, and Value Head components.
The agent demonstrates elite human-level performance in both heads-up and multi-player settings, outperforming LLMs by developing novel bidding strategies and effective randomized play.
This research marks the first AI to achieve elite human play in multi-player Liar's Poker, a game characterized by extensive multi-player engagement and a rebid feature, while using relatively limited compute resources.

AnaFlow: Agentic LLM-based Workflow for Reasoning-Driven Explainable and Sample-Efficient Analog Circuit Sizing

AnaFlow: introduces an agentic LLM-based workflow for analog circuit sizing, employing specialized LLM agents (Explainer, Matching Finder, DC Goal Setter, Initial Design Generator, DC Reviewer, DC Sizer, Specs Reviewer, Reasoning Sizer, Advisor Reviewer, Equipped Sizer) that collaborate with simulation tools (DC (.op) Simulator, Full Simulator, External Optimizer) and Memory to achieve reasoning-driven, sample-efficient, and explainable circuit sizing.
The framework mimics an expert analog designer's cognitive workflow, breaking the sizing task into four phases: circuit understanding, DC-OP-focused sizing, reasoning-only sizing, and optimizer-equipped sizing, ensuring a reliable and explainable path to optimized solutions.
By integrating LLM-based reasoning with simulation and optimization tools, the system significantly reduces required simulations, provides human-interpretable design rationales, and learns from its optimization history to accelerate convergence.

The OpenHands Software Agent SDK: A Composable and Extensible Foundation for Production Agents

OpenHands Software Agent SDK: introduces a toolkit for implementing software development agents, providing a complete architectural redesign of agent components for the OpenHands framework, built on a modular SDK architecture with four decoupled packages.
The SDK integrates native sandboxed execution, lifecycle control, model-agnostic multi-LLM routing, and built-in security analysis to offer a practical foundation for prototyping and deploying agents at scale.
The framework supports seamless local-to-remote execution portability, integrated REST/WebSocket services, and various interactive interfaces for human interaction, demonstrating strong performance on SWE-Bench Verified and GAIA benchmarks.

LiveTradeBench: Seeking Real-World Alpha with Large Language Models

LiveTradeBench: introduces a live trading environment for evaluating LLM agents in realistic and evolving markets, featuring live data streaming, a portfolio-management abstraction, and multi-market evaluation across U.S. stocks and Polymarket prediction markets.
The framework enables LLM agents to observe real-time market prices, news, and their portfolio, then output percentage allocations that balance risk and return, integrating tool use, memory, and reasoning capabilities.
Evaluations of 21 LLMs reveal that high general reasoning scores do not guarantee superior trading outcomes, models exhibit distinct portfolio styles, and some LLMs effectively adapt decisions using live signals, highlighting a gap between static evaluation and real-world financial competence.

PerfDojo: Automated ML Library Generation for Heterogeneous Architectures

PerfDojo: introduces a novel automatic optimization methodology, PerfLLM, for generating ML libraries for heterogeneous architectures, with Finetuned LLM, Embedding, Policy Network, Target Network, Replay Buffer, Loss Computation, Reward Function, Compile and Execute, Code Representation, Transformations, and Applicability Detection components, enabling effective code optimization without prior hardware knowledge.
The framework frames code optimization as a Reinforcement Learning game within an environment that uses a human-readable, mathematically-inspired code representation to ensure semantic validity throughout transformations.
This approach achieves significant performance gains across diverse CPU and GPU architectures by leveraging LLMs and RL to discover high-performance code transformations.

U2F: Encouraging SWE-Agent to Seize Novelty without Losing Feasibility

U2F (Unknown Unknowns to Functional solutions): introduces a cognitive-inspired, uncertainty-embracing multi-agent architecture for systematically surfacing "Unknown Unknowns" in software engineering, featuring a Discovery Agent, Exploration Agent, and Integration Agent, supported by cognitive enhancement mechanisms and human-AI collaboration.
The framework operationalizes Unknown Unknowns discovery through cross-domain analogical reasoning, reverse thinking, and external validation, enabling LLMs to engage in deep, modular reasoning across the innovation process.
U2F demonstrates improved novelty and semantic novelty in solutions while maintaining feasibility, leveraging uncertainty as a source of innovation in software engineering tasks.

HaluMem: Evaluating Hallucinations in Memory Systems of Agents

HaluMem (Hallucination in Memory Benchmark): introduces the first operation-level hallucination evaluation benchmark for memory systems, comprising memory extraction, memory updating, and memory question answering tasks.
This benchmark comprehensively reveals hallucination behaviors across different operational stages of interaction by defining stage-specific gold standards and evaluation metrics.
HaluMem constructs two user-centric, multi-turn human-AI interaction datasets, HaluMem-Medium and HaluMem-Long, to support evaluation across various context scales and task complexities.

ROSBag MCP Server: Analyzing Robot Data with LLMs for Agentic Embodied AI Applications

ROSBag MCP Server: introduces an MCP server for analyzing ROS and ROS 2 bag files, enabling natural language interaction with robotic datasets through LLMs and VLMs, featuring LLM Providers, MCP Client/LLM UI, MCP Lab, MCP Host, ROSBag MCP Server, Python3 rosbags library, Filesystem, ROS bags folder, Toolset, JSON-RPC, and stdio.
The framework provides domain-specific tools for trajectory analysis, laser scan processing, coordinate frame transformations, and time series visualization, bridging complex robotic data with conversational AI interfaces.
It includes a lightweight UI (MCP Lab) for benchmarking different LLMs and VLMs, demonstrating significant disparities in tool-calling capabilities and performance across models.

RAGBOOST: EFFICIENT RETRIEVAL-AUGMENTED GENERATION WITH ACCURACY-PRESERVING CONTEXT REUSE

RAGBOOST (Efficient Retrieval-Augmented Generation with Accuracy-Preserving Context Reuse): introduces an efficient RAG system that achieves high cache reuse without sacrificing accuracy through accuracy-preserving context reuse, with Context Index (tracks KV-cache status), Context Ordering (reorders documents for reuse), Context Deduplication (removes redundant documents), Contextual Hints (preserves reasoning fidelity), and KV-cache (stores key-value pairs).
The system detects overlapping retrieved items across concurrent sessions and multi-turn interactions, using efficient context indexing, ordering, and de-duplication to maximize reuse while maintaining reasoning fidelity with contextual hints.
RAGBOOST seamlessly integrates with existing LLM inference engines, improving prefill performance by 1.5–3× and preserving or enhancing reasoning accuracy across diverse RAG and agentic AI workloads.

Towards Realistic Project-Level Code Generation via Multi-Agent Collaboration and Semantic Architecture Modeling

PROJECTGEN (Multi-Agent Framework): introduces a multi-agent framework for project-level code generation, decomposing the process into architecture design, skeleton generation, and code filling stages, with each stage involving a generation agent (ArchAgent, SkeletonAgent, CodeAgent) and a judging agent (JudgeA, JudgeS, JudgeC) for iterative refinement and memory-based context management, utilizing a Semantic Software Architecture Tree (SSAT) as a structured architecture representation.
The framework leverages SSAT to bridge the semantic gap between user requirements and source code, enabling LLMs to interpret architectural intent and progressively generate implementation-level artifacts.
Iterative refinement, guided by judge feedback and memory-based context management, mitigates error propagation and ensures overall integrity and correctness throughout the project generation process.

EQ-Negotiator: Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation

EQ-Negotiator (Dynamic Emotional Personas Empower Small Language Models for Edge-Deployable Credit Negotiation): introduces a novel framework that equips SLMs with dynamic emotional personas for edge-deployable credit negotiation, integrating game theory and a Hidden Markov Model to learn and track debtor emotional states.
This framework enables SLMs to strategically adapt emotional responses in real-time, counter manipulation, and uphold ethical standards, outperforming larger LLMs in debt recovery and negotiation efficiency.
By transforming persona modeling from static profiles to dynamic emotional architectures, EQ-Negotiator establishes strategic emotional intelligence as a critical factor for effective, ethical, and privacy-preserving AI negotiators on the edge.

Auditing M-LLMs for Privacy Risks: A Synthetic Benchmark and Evaluation Framework

PRISM: introduces a novel framework and benchmark for auditing M-LLMs for privacy risks by generating synthetic multi-modal social media data and evaluating cross-modal privacy inference capabilities using a multi-agent architecture.
The framework includes a data generation workflow that creates realistic user profiles and corresponding multi-modal posts, and a multi-agent inference architecture with specialized LLMs for textual, image, and multi-modal synthesis.
Experiments demonstrate that M-LLMs significantly outperform human performance in inferring sensitive attributes from multi-modal data, highlighting the urgent need for robust privacy defenses.

From Measurement to Expertise: Empathetic Expert Adapters for Context-Based Empathy in Conversational AI Agents

Empathetic Expert Adapters (EEA): introduces a novel framework for developing and evaluating context-specific empathetic LLMs by analyzing real human-AI conversations, defining task-specific empathy patterns, generating synthetic conversations, measuring empathy with reward models, and training context-specific expert adapters.
The framework leverages a synthetic multi-turn conversational generation pipeline using GPT-4o and Llama-3-8B-Instruct to create empathy-steered dialogues, which then inform the training of LoRA adapters on a frozen LLM backbone.
Empirical results demonstrate that EEA significantly reduce the gap between perceived and desired empathy, outperforming baseline and system prompt approaches in maintaining empathy across multi-turn conversations.

A PROPRIETARY MODEL-BASED Safety RESPONSE FRAMEWORK FOR AI AGENTS

Caizhi-Safety-Control-Model: introduces a novel safety response framework designed to safeguard LLMs at both input and output levels, including a Safety Risk Classification Model (classifies user queries), a Sensitivity Check Module (evaluates unsafe queries), a Real-time Knowledge Base and Dynamic Retrieval (provides updated information), an Interpretation LLM (generates grounded responses), and a Response Decision Logic (orchestrates query handling).
The framework employs a supervised fine-tuning-based safety classification model at the input level, utilizing a four-tier taxonomy (Safe, Unsafe, Conditionally Safe, Focused Attention) for precise risk identification and differentiated handling of user queries.
At the output level, the framework integrates Retrieval-Augmented Generation (RAG) with a specifically fine-tuned Interpretation LLM, ensuring all responses are grounded in a real-time, trustworthy knowledge base to eliminate information fabrication and enable result traceability.

ALAS: TRANSACTIONAL AND DYNAMIC MULTI-AGENT LLM PLANNING

ALAS (Transactional and Dynamic Multi-Agent LLM Planning): introduces a five-layer architecture including Workflow Blueprinting Layer (defines task specifications), Agent Factory & Canonical IR Layer (instantiates agents and compiles to IR), Runtime Execution & Localized Repair Layer (manages execution with policies and logs), Revalidation Layer (re-checks feasibility post-repair), and Supervision Layer (selects plans and records metrics), which together enable robust multi-agent LLM planning.
The framework's operational loop integrates a Plan Proposal Module, Validation Module, Disruption Detection Module, Localized Repair (LCRP) Module, and Commit and Continue Module to dynamically adapt to runtime disruptions and ensure transactional reliability.
Key components like the Independent Validator, Versioned Execution Log, and Canonical Workflow IR ensure non-circular validation, grounded checks, and portable execution across various workflow runtimes, significantly improving planning robustness and efficiency.

GAIA: AN AGENTIC ARTIFICIAL INTELLIGENCE SYSTEM FOR GEOTHERMAL FIELD DEVELOPMENT

GAIA (Geothermal Analytics and Intelligent Agent): introduces an AI-based system for automating and assisting geothermal field development, integrating an LLM-powered task orchestrator, a web-based user interface, a digital twin for physics models and tools, and a multi-modal knowledge base.
The system employs an agentic retrieval-augmented generation (RAG) workflow, where the GAIA Agent plans and orchestrates multi-step analyses by querying knowledge bases and executing tools within the GAIA Digital Twin.
GAIA aims to accelerate project workflows, assist human experts in decision-making, and enable automation of the geothermal development process through its modular and extensible design.

KNOWTHYSELF: AN AGENTIC ASSISTANT FOR LLM INTERPRETABILITY

KnowThyself: introduces an agentic assistant for LLM interpretability, consolidating existing tools into a chat-based interface where users upload models, pose natural language questions, and obtain interactive visualizations with guided explanations.
The platform employs an Orchestrator LLM to reformulate queries and contextualize results, an Agent Router to direct queries to specialized agents, and various Specialized Agents (BertViz, TransformerLens, RAG, BiasEval) to perform specific interpretability tasks.
This modular, multi-agent orchestration framework lowers technical barriers by embedding the entire process into a conversational workflow, providing an extensible and accessible foundation for LLM inspection.

To See or To Read: User Behavior Reasoning in Multimodal LLMs

BehaviorLens: introduces a systematic benchmarking framework for evaluating modality trade-offs in user behavior reasoning, utilizing textual, scatter plot, and flowchart representations of transaction data as input for MLLMs to perform next-purchase prediction.
The framework compares the performance of six MLLMs across these input modalities, assessing prediction accuracy, computational cost, and the quality of generated explanations.
BehaviorLens reveals that holistic image representations of user history significantly improve next-purchase prediction accuracy without additional computational cost compared to textual representations.

ASAP: an Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training

ASAP (Agentic Solution to Auto-optimize Performance of Large-Scale LLM Training): introduces a multi-agent system for auto-optimizing large-scale LLM training performance by diagnosing bottlenecks and proposing sharding configurations.
It integrates Coordinator, Analyzer, and Proposal agents with Sharding Memory, leveraging performance profiling tools, RAG, and historical optimization data.
The framework automates the diagnosis of sharding issues and generates explainable, optimized configurations, significantly reducing manual effort and improving hardware efficiency.

Leveraging LLM-based agents for social science research: insights from citation network simulations

CiteAgent (Citation Agent) Framework: introduces a simulation framework that leverages LLM-based agents to model human behaviors in citation networks, including Initialization, Socialization, and Creation stages, enabling the generation and analysis of citation network phenomena.
The framework incorporates LLM-based agents as distinct authors with attributes and memory, facilitating collaborative paper drafting and scholarly search for references, and supports two research paradigms: LLM-SE and LLM-LE.
CiteAgent allows researchers to test and validate existing theories in network science through customizable experiments, providing insights into power-law distribution, citational distortion, and other social science phenomena.

Approximating the Mathematical Structure of Psychodynamics

Psychodynamics Process Theory (PTP): introduces a mathematical framework to formalize human psychodynamics and cognitive processes using a diagrammatic approach based on process theory, making it quantitatively precise and accessible across various fields.
PTP leverages concepts from quantum cognition and holographic cognition to model mental states as cogit state vectors and their evolution through various internal and external processes, including conscious self-reflection, stimuli, and communication.
The framework supports hierarchical Bayesian inference for understanding cognitive dynamics, exemplified by the Wittgenstein-Lion Language Game, and offers applications in AI safety, such as analyzing AI-driven cognitive manipulation and developing advanced AI agents.

4th November 2025

Kosmos: An AI Scientist for Autonomous Discovery

Kosmos: introduces an AI scientist that automates data-driven discovery by performing iterative cycles of parallel data analysis, literature search, and hypothesis generation, synthesizing discoveries into scientific reports.
The system leverages LLMs, a structured world model for information sharing, and specialized agents to coherently pursue open-ended research objectives over extended periods.
Kosmos demonstrates the ability to reproduce existing findings, refine knowledge, and make novel, clinically-relevant discoveries across diverse scientific domains with traceable reasoning.

MEMSEARCHER: TRAINING LLMS TO REASON, SEARCH AND MANAGE MEMORY VIA END-TO-END REINFORCEMENT LEARNING

MemSearcher: introduces an agent workflow that iteratively maintains a compact memory and combines the current turn with it, fusing the user's question with memory to generate reasoning traces, perform search actions, and update memory to retain only essential information.
This design stabilizes context length across multi-turn interactions, improving efficiency without sacrificing accuracy, and is optimized using multi-context GRPO, an end-to-end RL framework.
Multi-context GRPO jointly optimizes reasoning, search strategies, and memory management by sampling groups of trajectories under different contexts and propagating trajectory-level advantages.

Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning

CORL (Cost-controllable Reinforcement Learning): introduces a centralized multi-LLM framework where a Controller LLM coordinates a pool of Expert LLMs, optimized via Reinforcement Learning with dual objectives for task performance and inference cost, adapting to various Budget Conditions.
This framework enables dynamic budget-aware decision-making, allowing the system to achieve high performance in high-budget modes while maintaining cost efficiency in low-budget settings.
The approach leverages a cost-controllable training strategy and dual reward signals to learn judicious use of expert LLMs, generalizing well to unseen data and different budget levels.

Agentic World Modeling for 6G: Near-Real-Time Generative State-Space Reasoning

WM-MS3M (World-Modeled Multi-Scale Structured State-Space Mixture): introduces an agentic world modeling paradigm for 6G O-RAN Near-RT control, leveraging a causal MS³M backbone, a lightweight stochastic latent variable, and dual decoders to provide action-conditioned generative state-space reasoning and short-horizon planning.
This framework enables quantitative "what-if" forecasting and calibrated uncertainty modeling for Key Performance Indicator (KPI) prediction, treating Physical Resource Blocks (PRBs) as explicit control inputs.
The approach integrates with an MPC/CEM planner to optimize actions within data-driven PRB bounds, ensuring leakage-safe, auditable, and robust control for 6G networks.

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

CostBench: introduces a scalable, cost-centric benchmark for evaluating LLM agents' multi-turn cost-optimal planning and adaptation capabilities in dynamic environments, featuring a query construction module, an environment module, atomic tools, composite tools, flexible cost assignment, an LLM agent, a trajectory planning module, dynamic blocking events, and a re-planning mechanism.
The benchmark is situated in the travel-planning domain, comprising tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs, and supports four types of dynamic blocking events to simulate real-world unpredictability.
Evaluations on CostBench reveal a substantial gap in cost-aware planning, with leading models failing to identify cost-optimal solutions in static settings and showing significant performance drops under dynamic conditions, highlighting the need for more robust and adaptive LLM agents.

Curriculum Design for Trajectory-Constrained Agent: Compressing Chain-of-Thought Tokens in LLMs

CURLTRAC (Curriculum Design for Trajectory-Constrained Agent): introduces an adaptive curriculum learning strategy for training agents under strict deployment-time constraints, utilizing a teacher component to adjust the permissible cost budget and a student component to update the agent's policy based on rollouts in various environments.
This strategy enables agents, including RL and LLM agents, to progressively master challenging environments by starting with relaxed trajectory constraints and adaptively tightening them, ensuring efficient learning and adherence to strict deployment conditions.
When applied to LLMs, CURLTRAC effectively compresses output chain-of-thought tokens, leading to substantial inference speedup and reduced computational cost while maintaining accuracy.

Apriel-H1: Towards Efficient Enterprise Reasoning Models

Apriel-H1 (Hybrid Large Language Models): introduces a family of hybrid LLMs that combine Transformer Attention (Multi-Head Attention) and SSM Sequence Mixers (Mamba blocks) through a staged distillation process from a pre-trained transformer teacher, aiming for efficient enterprise reasoning.
The framework progressively replaces less critical attention layers with linear Mamba blocks, guided by layer importance estimation, to achieve higher inference throughput with minimal performance degradation.
Apriel-H1 models demonstrate up to 3.4x higher inference throughput compared to pure transformer baselines on reasoning-heavy benchmarks, showcasing substantial efficiency gains.

Adapting General-Purpose Foundation Models for X-ray Ptychography in Low-Data Regimes

PtychoBench: introduces a multi-modal, multi-task benchmark for X-ray ptychographic analysis, systematically comparing Supervised Fine-Tuning (SFT) and In-Context Learning (ICL) specialization strategies for Vision-Language Models (VLMs) and LLMs.
The benchmark evaluates VLM-based artifact detection and LLM-based parameter recommendation in low-data regimes, revealing task-dependent optimal specialization pathways.
Findings highlight that SFT and ICL are complementary for visual tasks, while ICL on large base models is superior for textual tasks, emphasizing the importance of context-aware prompting and model scale.

Modeling Hawkish-Dovish Latent Beliefs in Multi-Agent Debate-Based LLMs for Monetary Policy Decision Classification

Multi-Agent Debate-Based LLMs Framework: introduces a novel approach that simulates the FOMC's collective decision-making process using multiple LLM Agents (interacting decision-makers), each starting with Initial Beliefs (distinct policy stances) and processing Input Data (qualitative policy texts/quantitative macroeconomic indicators/historical policy rate), then revising predictions through Iterative Debate Rounds (sequential prediction revision) mediated by Latent Beliefs (hawkish/dovish stance representation), and finally reaching a Consensus Mechanism (final decision aggregation).
This framework enhances interpretability by explicitly modeling each agent's internal policy beliefs as a discrete latent variable, demonstrating how these beliefs mediate the perception of input information and interaction dynamics.
Empirical results show that this debate-based approach significantly outperforms standard LLM-based baselines in predicting central bank policy decisions, providing insights into individual perspectives and social influence on collective forecasts.

From the Laboratory to Real-World Application: Evaluating Zero-Shot Scene Interpretation on Edge Devices for Mobile Robotics

Proposed Architecture: introduces a pipeline for zero-shot scene interpretation on edge devices for mobile robotics, integrating a Small VLM for scene description, a Detector + Segmentor for object identification, and Tracking for object monitoring, all feeding into a Decision Making unit, with optional Cloud support for larger LLMs/VLMs.
This architecture enables mobile robots to perceive, interpret, and make rational decisions in dynamic environments by processing visual information locally on edge devices while preserving privacy.
The system is evaluated on diverse real-world datasets, demonstrating the capabilities of small VLMs for scene interpretation and action recognition in various outdoor and indoor scenarios.

ReAcTree: Hierarchical LLM Agent Trees with Control Flow for Long-Horizon Task Planning

ReAcTree: introduces a hierarchical task-planning framework that dynamically constructs an LLM agent tree, where agent nodes (LLM-based task planner) reason, act, and expand subgoals, while control flow nodes (coordinates child execution) manage execution strategies, supported by episodic memory (stores subgoal-level experiences) and working memory (shares environment observations) for robust long-horizon task planning.
This framework addresses limitations of monolithic trajectories by decomposing complex goals into semantically isolated subgoals, preventing error propagation and enhancing tractability for LLMs.
Experiments demonstrate ReAcTree's consistent outperformance of strong baselines across various LLMs in partially observable settings, showcasing its effectiveness in agentic decision-making.

EvoDev: An Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents

EvoDev (Iterative Feature-Driven Framework for End-to-End Software Development with LLM-based Agents): introduces an iterative software development framework that decomposes user requirements into features, constructs a Feature Map for dependencies, and iteratively develops software using LLM-based agents.
The framework explicitly models dependencies between features and propagates multi-level information (business logic, design, code) as context for subsequent development iterations.
EvoDev significantly outperforms existing LLM-agent baselines in Android development tasks by improving build success rate and functional completeness through its FDD-inspired iterative workflow.

Revisiting put-that-there, context aware window interactions via LLMs

Task-Centric Window Management System: introduces a multimodal, LLM-driven system for managing virtual windows in XR environments, integrating LLM Integration, Scene Understanding, Window Workspace, and User Behaviour components.
This system enables users to organize virtual windows through natural multimodal interaction, fusing explicit/implicit speech with non-verbal cues like pointing and head-gaze, and semantic scene representations.
It supports one-to-many action mappings and goal-centric reasoning, allowing the LLM to dynamically infer relevant applications and layout decisions, thereby reducing cognitive load and improving user efficiency.

LIVESECBENCH: A DYNAMIC AND CULTURALLY-RELEVANT AI SAFETY BENCHMARK FOR LLMS IN CHINESE CONTEXT

LiveSecBench: introduces a dynamic and continuously updated AI safety benchmark specifically for Chinese-language LLM application scenarios, evaluating models across six critical dimensions (Legality, Ethics, Factuality, Privacy, Adversarial Robustness, and Reasoning Safety) using a culturally-relevant dataset and an ELO rating system.
The benchmark maintains relevance through a dynamic update schedule that incorporates new threat vectors and regularly refreshes test questions, with planned expansions to include Text-to-Image Generation Safety and Agentic Safety.
LiveSecBench provides a public online leaderboard and detailed evaluation reports, offering transparent insights into LLM safety performance within Chinese legal and social frameworks.

UNLOCKING THE POWER OF MULTI-AGENT LLM FOR REASONING: FROM LAZY AGENTS TO DELIBERATION

Dr. MAMR (Multi-Agent Meta-Reasoning Done Right): introduces a multi-agent LLM reasoning framework that addresses lazy agent behavior by incorporating a meta-thinking agent (decomposes tasks, sets goals), a reasoning agent (executes subtasks, performs computations), a Shapley-inspired causal influence method (measures step-level contribution), a verifiable reward mechanism for restart behavior (rewards adaptive deliberation), and an Aggregated Step-Level Advantage (combines rewards for credit).
The framework theoretically analyzes and mitigates the root cause of lazy agent behavior in multi-turn Group Relative Preference Optimization (GRPO) by removing a normalization term and introducing a robust causal influence measure.
Dr. MAMR enhances multi-agent collaboration and reasoning performance on complex tasks by enabling agents to adaptively discard prior outputs and restart reasoning when necessary, leading to more stable training and improved accuracy.

Demo: Statistically Significant Results On Biases and Errors of LLMs Do Not Guarantee Generalizable Results

LLM Evaluation Infrastructure: introduces a system for automatically generating diverse medical queries for LLMs and evaluating their answers using multiple LLM-as-a-judge setups and agentic workflows.
The infrastructure includes a prompt generation pipeline that synthesizes patient demographics, medical histories, disorders, and writing styles to create realistic questions, and an answer evaluation pipeline for detecting hallucinations, omissions, and treatment categories.
This system facilitates large-scale experiments to investigate LLM biases and errors in patient-facing medical scenarios, highlighting the need for multiple LLM evaluators to ensure generalizable results.

DEEP IDEATION: DESIGNING LLM AGENTS TO GENERATE NOVEL RESEARCH IDEAS ON SCIENTIFIC CONCEPT NETWORK

Deep Ideation framework: introduces a system for generating novel research ideas, integrating a Scientific Network (knowledge base), Relation Analysis Module (summarizes keyword connections), Keyword Selection Module (selects impactful keywords), Idea Formulation Module (synthesizes keywords into ideas), Idea Stack (tracks research progress), Critic Model (evaluates idea quality), Router (determines next action), and LLM Agents (perform module tasks).
The framework employs an iterative explore-expand-evolve workflow, leveraging the scientific concept network to dynamically refine research ideas and incorporating reviewer feedback for continuous improvement.
This approach significantly enhances the novelty and feasibility of generated research ideas across multiple AI domains, outperforming existing methods.

CONTINUUM: EFFICIENT AND ROBUST MULTI-TURN LLM AGENT SCHEDULING WITH KV CACHE TIME-TO-LIVE

Continuum: introduces a tool-call aware LLM serving system with a Scheduler (manages request scheduling), Tool Call Handler (parses tool calls, estimates latency), Tool Call Prediction (predicts tool call duration), KV Cache TTL (pins/unpins KV cache), Request & Multi-turn Info (tracks program state), and Unpin Mechanism (releases expired pins), designed to optimize multi-turn agent workloads by intelligently managing KV cache with time-to-live values.
The system predicts tool call durations and uses this information to set a Time-to-Live (TTL) for pinning KV cache in GPU memory, preventing unnecessary evictions and re-computations.
By combining tool-aware KV cache timeout with program-level first-come-first-serve scheduling, Continuum significantly reduces scheduling bubbles and preserves multi-turn continuity for complex agentic workflows.

Training Proactive and Personalized LLM Agents

PPP-Agent (Productive, Proactive, and Personalized LLM Agents): introduces a multi-objective reinforcement learning framework that optimizes LLM agents for productivity, proactivity, and personalization using an interactive environment with LLM-based user simulators.
The framework leverages USERVILLE's prompt vaguenization and preference-aware user simulation to create realistic training scenarios, enabling agents to learn strategic interaction and adapt communication styles.
It employs a composite reward signal derived from task success, interaction quality, and alignment with user preferences, demonstrating significant improvements over strong baselines.

Optimal-Agent-Selection: State-Aware Routing Framework for Efficient Multi-Agent Collaboration

STRMAC (State-Aware Routing Framework for Efficient Multi-Agent Collaboration): introduces a state-aware routing framework for multi-agent collaboration, which includes LLM Agents (perform tasks), a State-based Router (selects optimal agent) with an LLM Encoder (encodes agent private context) and a Router Encoder (encodes current system state), and a Selected Agent (executes next action).
The framework dynamically selects the most suitable single agent at each step by encoding interaction history and agent knowledge, improving collaboration efficiency and effectiveness.
It also incorporates a self-evolving data generation approach to accelerate the collection of high-quality execution paths, significantly reducing training data overhead.

Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems

Tool-to-Agent Retrieval: introduces a unified framework for LLM multi-agent systems that embeds Tools (API calls, functions, actions) and Agents (MCP servers, sub-agents) in a Shared Vector Space (unified embedding space), connecting them via Metadata Relationships (links tools to agents) within a Unified Tool-Agent Catalog (integrates tools/agents) comprising a Tool Corpus (tool names, descriptions) and Agent Corpus (agent names, descriptions), and utilizing a Retrieval Process (top-K ranking, aggregation) driven by Query Paradigms (input methods) such as Direct Querying (high-level question) or Step-wise Querying (decomposed sub-tasks).
This framework enables granular tool-level or agent-level retrieval by explicitly modeling tool capabilities and traversing metadata, thereby avoiding context dilution and improving routing for both focused and multi-step queries.
Evaluations across eight embedding models on the LiveMCPBench benchmark demonstrate consistent improvements in Recall@5 and nDCG@5 over previous state-of-the-art agent retrievers.

Collaborative Large Language Model Inference via Resource-Aware Parallel Speculative Decoding

TMA-MASAC (Two-phase Matching-based Association Multi-Agent Soft Actor-Critic): introduces a novel framework that jointly optimizes user association and resource allocation (UARA) for efficient parallel speculative decoding in Mobile Edge Computing (MEC) systems, utilizing a MASAC network for resource allocation and a TMA strategy for user association.
The framework addresses the challenge of parallelizing autoregressive LLM generation in resource-constrained MEC environments by synchronizing mobile computation and uplink communication, minimizing edge-side computing latency, and ensuring energy efficiency.
It employs a lightweight draft model on mobile devices and a powerful target model on edge servers, reducing end-to-end latency by up to 28.0% and an average of 23.7% without compromising inference accuracy.

A Collaborative Reasoning Framework for Anomaly Diagnostics in Underwater Robotics

AURA (Autonomous Resilience Agent): introduces a collaborative framework for anomaly and fault diagnostics in underwater robotics, integrating a Digital Twin (DT) (real-time normative model), Real AUV (physical vehicle), Simulator (virtual replica), Statistical Anomaly Detection (detects state deviations), State Anomaly Characterisation Agent (Agent A) (low-level perception LLM), Anomaly Digest (structured problem description), Diagnostic Reasoning Agent (Agent B) (high-level cognitive LLM), Human Operator (interactive dialogue partner), Vector Database (VDB) (stores distilled lessons), Embedding Model (converts text to vectors), Featured Cloud Search (external knowledge source), ROS 2 topics (human-robot interface), and Orchestration Framework (LangChain) (manages Agent B's flow).
This framework employs a two-agent LLM design with distinct responsibilities, where Agent A monitors telemetry and translates data into natural language, and Agent B engages a human operator in dialogue to determine root causes, supported by external knowledge.
The human-validated diagnosis is processed into a new training example, stored in the VDB via an Embedding Model, refining Agent A's perceptual model and enabling continuous learning from human feedback.

PoCo: Agentic Proof-of-Concept Exploit Generation for Smart Contracts

PoCo (Agentic Proof-of-Concept Exploit Generation): introduces an agentic framework that automatically generates executable PoC exploits for smart contracts from natural-language vulnerability descriptions, utilizing an LLM within a Reason-Act-Observe loop and a suite of specialized tools.
The framework accepts a target smart contract and an auditor-written vulnerability annotation as input, producing a Foundry-compatible executable PoC exploit as output.
PoCo significantly reduces the effort and time required for high-quality PoC generation in smart contract audits, providing verifiable evidence for auditors and actionable test cases for developers.

A Criminology of Machines

A Criminology of Machines: introduces a conceptual framework for understanding crime and social control in a hybrid society, defining AI agency through computational, social, and legal dimensions, and classifying deviant behaviors into maliciously aligned systems and unplanned emergent deviance.
This framework addresses the implications of increasing autonomous AI agents and their machine-machine interactions, moving beyond viewing AI solely as a tool to recognizing its agency in generating unlawful outcomes.
The paper highlights the urgent need for criminologists to collaborate with AI experts to predict, mitigate, and govern risks from multi-agent AI systems, especially concerning accountability gaps and emergent behaviors.

Stochastic Redistribution of Indistinguishable Items in Shared Habitation: A Multi-Agent Simulation Framework

Stochastic Redistribution of Indistinguishable Items in Shared Habitation: A Multi-Agent Simulation Framework: introduces a discrete-event stochastic model simulating the redistribution of indistinguishable items, like socks, among cohabitants, utilizing autonomous agents, probabilistic mixing, correction, and loss processes over iterative laundry cycles.
The framework, implemented with SimPy, models item migration through random mixing events, selective recollection, and attrition, demonstrating how even minimal exchange probabilities can lead to emergent asymmetries and long-term disorder.
This multi-agent system captures the dynamic interplay between order and disorder in shared domestic environments, connecting everyday phenomena to statistical mechanics principles of entropy and diffusion.

Agentic AI for Mobile Network RAN Management and Optimization

Agentic AI for RAN Management and Optimization: introduces a framework for autonomous 5G RAN management and optimization, leveraging specialized agents (Master Orchestrator, Analysis, Historical Retrieval, Documentation, Validation) that utilize an LLM Reasoning Module, Memory, and various data tools to detect KPI deviations, diagnose causes, and propose corrective actions.
This framework enables goal-driven systems to dynamically adapt to changing network conditions, employing design patterns like reflection, planning, and multi-agent collaboration for continuous refinement and autonomous decision-making.
By integrating large AI models with planning, memory, and reasoning capabilities, the framework addresses the increasing complexity of 5G/6G networks, moving beyond traditional rule-based systems to achieve higher levels of automation and intelligence.

Dexterous Robotic Piano Playing at Scale

OMNIPIANIST: introduces an agent capable of performing nearly one thousand music pieces by combining an Optimal Transport (OT) based fingering strategy, large-scale Reinforcement Learning (RL) for data generation, and a Flow Matching Transformer for multi-task imitation learning.
The OT-based fingering strategy enables RL agents to autonomously discover efficient piano-playing strategies without human demonstrations, generating the diverse RP1M++ dataset from over 2,000 specialist agents.
The Flow Matching Transformer leverages the RP1M++ dataset to learn a multi-song policy, achieving human-level dexterity and strong generalization across various musical tasks.

A Spatially Informed Gaussian Process UCB Method for Decentralized Coverage Control

SIGP-UCB (Spatially Informed Gaussian Process UCB): introduces a novel decentralized algorithm for multi-agent coverage control in unknown spatial environments, utilizing local GP models, a local cost function balancing expected locational cost and variance-based exploration, inducing points selected via a greedy strategy, a communication graph, a consensus protocol for hyperparameters, gradient descent, a temporary buffer, and an Adam optimizer.
This algorithm allows each agent to autonomously determine its trajectory by minimizing a local cost function, balancing exploration of uncertain regions with exploitation of high-density areas, and updating its GP model using local observations and neighbor communication.
The decentralized approach, employing sparse GPs and local information sharing, enhances scalability and enables agents to escape local minima, leading to improved coverage efficiency compared to centralized and model-based methods.

LACY: A Vision-Language Model-based Language-Action Cycle for Self-Improving Robotic Manipulation

LACY (Language-Action CYcle): introduces a unified VLM framework built upon a single LLaVA-NeXT model, fine-tuned to perform language-to-action generation (L2A), action-to-language explanation (A2L), and semantic consistency verification (L2C).
The framework operates as a closed-loop system, leveraging its bidirectional capabilities to autonomously generate and filter new high-quality training data through a self-improving data generation pipeline and a confidence-based active data augmentation strategy.
This approach significantly improves robotic manipulation task success rates in both simulation and real-world settings by focusing learning on ambiguous cases and reducing reliance on external human supervision.

ACCUMULATING CONTEXT CHANGES THE BELIEFS OF LANGUAGE MODELS

Belief Shift Measurement Framework: introduces a three-stage process to measure changes in LLM stated beliefs and behaviors, including initial belief recording, context accumulation through intentional and non-intentional tasks, and post-task belief recording.
The framework reveals that LLMs' belief profiles are highly malleable, with significant shifts observed in both stated beliefs and behaviors after various interactions.
This analysis exposes the hidden risk of belief shift in LLMs during extended sessions of talking or reading, impacting their reliability and consistency.

No-Human in the Loop: Agentic Evaluation at Scale for Recommendation

ScalingEval: introduces a large-scale, multi-agent benchmarking framework that positions LLMs as judges for evaluating complementary-item recommendations at scale without human annotation, utilizing an Evaluation Generation Query, Tools, Multi-Agent Planning, Memory, Evaluation Report, and Scalable Majority-vote Ground Truth Synthesis.
The framework orchestrates specialized LLM agents for CI pattern auditing, recommendation issue identification, and report generation, supported by data retrieval, analysis, and batch processing tools.
It employs a scalable majority-vote ground truth synthesis mechanism, where multiple LLMs independently evaluate item pairs, and their judgments are aggregated to produce robust consensus results.

UNSUPERVISED EVALUATION OF MULTI-TURN OBJECTIVE-DRIVEN INTERACTIONS

UEF (Unsupervised Evaluation Framework): introduces a suite of unsupervised metrics for evaluating multi-turn objective-driven LLM interactions, including LLM-guided Clustering (for user goals), an Interaction Completeness Metric (for goal completion), and a Response Uncertainty Metric (for LLM confidence).
The framework leverages statistical properties of unlabeled interaction data and fine-tuned LLMs to adapt to distributional shifts, providing LLM judge-free metrics without relying on human-generated ideal responses.
The approach is validated on open-domain and task-specific interaction data, demonstrating its ability to label user goals, measure goal completion, and quantify LLM uncertainty effectively.

PublicAgent: Multi-Agent Design Principles From an LLM-Based Open Data Analysis Framework

PublicAgent: introduces a multi-agent framework for open data analysis, with Orchestrator Agent (coordinates agents, validates progress), Intent Clarifying Agent (resolves query ambiguities), Data Discovery Agent (semantic search, metadata synthesis), Data Analysis Agent (generates, validates statistical code), and Report Generation Agent (synthesizes findings, adds caveats), which addresses LLM limitations in end-to-end analytical workflows by decomposing tasks into specialized agents.
This framework enhances data accessibility for non-experts by providing natural language interfaces for query clarification, dataset discovery, statistical analysis, and comprehensive report generation from public data repositories.
The multi-agent architecture improves performance, mitigates distinct failure modes, and offers architectural benefits across task complexities, demonstrating the value of specialization independent of base LLM strength.

LEGO-EVAL: TOWARDS FINE-GRAINED EVALUATION ON SYNTHESIZING 3D EMBODIED ENVIRONMENTS WITH TOOL AUGMENTATION

LEGO-EVAL: introduces a comprehensive evaluation framework for text-guided 3D scene synthesis, utilizing Constraint Identification (identifies constraints), Tool Execution Planning (generates tool plans), Argument Selection & Execution (selects arguments and executes tools), and Constraint Validation (assesses scene alignment using LLM/VLM) with a diverse Tool Set (for environment interaction, textual, and multimodal reasoning).
The framework addresses limitations of existing methods by performing multi-hop grounding of scene components and verifying attributes and spatial relationships through tool-augmented VLMs.
LEGO-EVAL, along with the LEGO-BENCH dataset, provides a robust and interpretable evaluation for 3D scene generation, demonstrating superior agreement with human judgments compared to baselines.

Cache Mechanism for Agent RAG Systems

ARC (Agent RAG Cache Mechanism): introduces a novel, annotation-free caching framework that dynamically manages small, high-value corpora for each LLM agent by synthesizing historical query distribution patterns with the intrinsic geometry of cached items in the embedding space.
This framework leverages query-based dynamics and structural properties of the item representation space, drastically reducing storage requirements while preserving retrieval effectiveness.
ARC achieves a 79.8% cache has-answer rate and an 80% average reduction in retrieval latency, significantly enhancing efficiency and effectiveness in RAG-powered LLM agents.

AgentSLA: Towards a Service Level Agreement for AI Agents

AgentSLA (Service Level Agreement for AI Agents): introduces a framework for defining Service Level Agreements for AI agents, including an extended Quality Model (ISO/IEC 25010 extension), the AgentSLA DSL, its Metamodel, a Validating Parser, and key entities like Agent, ModelCard, Provider, QoSMetric, SLA, and SLO, leveraging protocols such as Agent2Agent Protocol (A2A) and Model Context Protocol (MCP).
The framework addresses the challenge of specifying Quality of Service (QoS) for AI agents by extending the ISO/IEC 25010 standard with new quality characteristics like Sustainability, Autonomy, Interoperability, Understandability, and Output properties.
The AgentSLA DSL, with its JSON-based concrete syntax and Python parser, enables formal and automatic processing of SLAs, facilitating the integration and quality assurance of AI agents in software systems.

3rd November 2025

INSURAGENT: A LARGE LANGUAGE MODEL-EMPOWERED AGENT FOR SIMULATING INDIVIDUAL BEHAVIOR IN PURCHASING FLOOD INSURANCE

InsurAgent (A Large Language Model-Empowered Agent for Simulating Individual Behavior in Purchasing Flood Insurance): introduces an LLM-empowered agent for simulating individual flood insurance purchase decisions, integrating perception (parsing user profiles), retrieval (acquiring empirical survey data via RAG), reasoning (emulating human cognitive processes and extrapolating), action (generating purchase probabilities and explanations), and memory (archiving temporal history for dynamic modeling).
This framework addresses the LLM's limitation in quantitative probability estimation by grounding decisions in empirical data and leveraging common sense for contextual adjustments beyond survey data.
InsurAgent provides a valuable tool for behavioral modeling and policy analysis by accurately estimating marginal and bivariate probabilities and simulating dynamic decision evolutions over time.

Automated Reward Design for Gran Turismo

Iterative LLM-based Reward Design: introduces a scalable iterative framework for automated reward design in Gran Turismo 7, leveraging LLM-based reward generation, VLM preference-based evaluation, and optional human feedback to produce competitive racing agents from text-based instructions.
The framework efficiently searches a space of reward functions, using a trajectory alignment filter to prune misaligned candidates and a VLM/LLM for preference-based evaluation, replacing the need for a ground-truth fitness metric.
This system generates reward functions capable of producing racing agents competitive with GT Sophy, a champion-level RL agent, and can also generate novel behaviors in the Gran Turismo 7 environment.

Simulating Environments with Reasoning Models for Agent Training

Simia-SFT and Simia-RL: introduce frameworks that enable LLMs to simulate realistic environment feedback for scalable agent training without real environment implementations.
Simia-SFT is a pipeline that synthesizes supervised fine-tuning data by amplifying small seed sets into diverse trajectories in an environment-agnostic manner.
Simia-RL enables reinforcement learning training without real environment implementations by generating LLM-simulated feedback, replacing heavy environment engineering with flexible LLM-based simulation.

Hybrid Retrieval-Augmented Generation Agent for Trustworthy Legal Question Answering in Judicial Forensics

Hybrid Legal QA Agent: introduces a hybrid legal QA agent for trustworthy legal question answering in judicial forensics, integrating retrieval-augmented generation (RAG) with multi-model ensembling and a dynamic knowledge-base update mechanism.
The system prioritizes retrieval from a trusted legal repository; if retrieval fails, multiple LLMs generate candidate answers, which are then scored by a specialized selector.
High-quality outputs undergo human review before being written back into the knowledge base, enabling dynamic knowledge evolution and provenance tracking to ensure reliability and compliance.

Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving

GLM (Graph Chain-of-Thought with Efficient LLM Serving): introduces a multi-agent Graph-CoT framework with Classification Agent (classifies query type), Reasoning Agent (determines info sufficiency, answers), Action Agent (generates code for retrieval), Graph RAG Retriever (executes code, retrieves graph facts), LLM service/Inference Engine (executes agent prompts), Notebook (accumulates known facts), Vertex-Centric KV Cache Reuse Model (maximizes KV cache reuse), Priority-based KV Cache Eviction Policy (manages cache retention), and Pipelined Execution Strategy (overlaps retrieval, LLM decoding), enabling scalable and efficient graph reasoning for LLMs.
This framework decomposes complex reasoning tasks into specialized agents and integrates an optimized LLM serving architecture to reduce token cost, latency, and improve throughput.
The co-designed approach addresses limitations of single-agent Graph-CoT systems by enhancing accuracy and efficiency through selective context sharing and advanced KV-cache management.

UniDataBench: Evaluating Data Analytics Agents Across Structured and Unstructured Data

ReActInsight: introduces an autonomous LLM-based agent for end-to-end data analysis across diverse structured and unstructured data sources, featuring Multi-Source Data Exploration & Cross-Source Linkage Discovery (initial data understanding), Heterogeneous Schema Extraction (extracts metadata), Unified Metadata Hub (MetaGraph) Construction (centralizes metadata), Entity-Graph Generation via Similarity Analysis (discovers relationships), Actionable Join-Hint Formulation (creates join instructions), ReAct-style Hierarchical Planning (decomposes analytical goals), Hierarchical Planning Mechanism (breaks down goals), Code Generation with Self-Correction (automates code creation), Code Generation Module (generates executable code), Self-Correction and Debugging Module (ensures code reliability), Adaptive Visualization Techniques (uncovers underlying patterns), Insights Synthesis (distills findings), Insight Synthesis Module (summarizes results), and Model Cascading (optimizes LLM usage).
The agent initiates its workflow with intelligent multi-source data exploration to build a semantic understanding of how disparate datasets relate, constructing a unified MetaGraph and formulating actionable Join-Hints.
It employs a hierarchical planning mechanism to decompose high-level goals into answerable sub-questions, generates self-correcting executable code with adaptive visualizations, and synthesizes results into coherent summaries and recommendations, optimizing LLM usage through model cascading.

TPS-BENCH: EVALUATING AI AGENTS' TOOL PLANNING & SCHEDULING ABILITIES IN COMPOUNDING TASKS

TPS-Bench (Tool Planning and Scheduling Benchmark): introduces a benchmark for evaluating LLM agents' tool planning and scheduling abilities in compounding tasks, featuring Compounding Tasks, a Tool Repository with Model Context Protocol (MCP) Tools, an LLM Agent, Evaluation Metrics, and an LLM-as-a-judge.
The benchmark collects 200 compounding tasks of two difficulty levels, requiring agents to select appropriate tools, decompose tasks into subtasks, identify dependencies, and strategically schedule tool execution for efficiency.
Evaluation emphasizes task completion rate, tool selection score, token usage, and execution time, with an initial study showing reinforcement learning can improve scheduling efficiency and task completion.

LiCoMemory: Lightweight and Cognitive Agentic Memory for Efficient Long-Term Reasoning

LiCoMemory (Lightweight and Cognitive Agentic Memory): introduces an end-to-end agentic memory framework for LLM agents, featuring CogniGraph, a lightweight hierarchical graph for real-time updating and retrieval, which utilizes entities and relations as semantic indexing layers.
The framework employs temporal and hierarchy-aware search with integrated reranking for adaptive and coherent knowledge retrieval, significantly reducing update latency and improving efficiency.
LiCoMemory's design enables multi-granular reasoning from abstract contextual understanding to fine-grained evidence retrieval, supporting robust long-term conversational reasoning.

ZoFia: Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction

ZoFia (Zero-Shot Fake News Detection with Entity-Guided Retrieval and Multi-LLM Interaction): introduces a novel two-stage zero-shot fake news detection framework that combines entity-guided retrieval for external evidence with a multi-LLM interactive system for collaborative analysis and adversarial debate.
The framework first employs Hierarchical Salience and SC-MMR algorithms to extract informative and diverse keywords, which are then used to build a comprehensive Multi-Source Information Matrix from internal and external knowledge.
Subsequently, a multi-agent system, including Linguist, Expert, Claim Extractor, and Claim Verifier, performs multi-view analysis and engages in adversarial debate to produce an interpretable and robust judgment.

MicroRemed: Benchmarking LLMs in Microservices Remediation

ThinkRemed (multi-agent framework): introduces a multi-agent framework for end-to-end microservice remediation, comprising a Coordinator, Probe Agent, Execution Agent, Verification Agent, Judge, Auxiliary Context, Failure Report, Microservice Systems, Ansible Playbook, and Reflection.
This framework emulates Site Reliability Engineer (SRE) reasoning by performing dynamic probing, iterative reasoning, and limited trial-and-reflection cycles to generate effective remediation actions.
ThinkRemed operates within the MicroRemed benchmark, which evaluates LLMs' ability to autonomously generate executable Ansible playbooks from diagnosis reports to restore system functionality in real microservice environments.

Interaction As Intelligence Part2: Asynchronous Human-Agent Rollout for Long-Horizon Task Training

APOLLO: introduces a sampling framework that integrates asynchronous human guidance with action-level data filtering for long-horizon task training, including Agent, Environment, Human-AI Interaction Interface (Frontend), Human, Backend of Human-AI Interaction Interface, LLM As Judge, Raw Trajectory, Masked Trajectory, and Training Set Task.
This framework enables humans to intervene only when an LLM agent deviates from a promising trajectory, providing strategic advice and prior knowledge to generate valuable trajectories at a lower cost.
APOLLO applies supervision control to filter out sub-optimal actions, preventing error propagation and demonstrating significant performance improvements on long-horizon, domain-specialized tasks.

InnovatorBench: Evaluating Agents' Ability to Conduct Innovative LLM Research

InnovatorBench: introduces a benchmark-platform pair for evaluating AI agents' ability to conduct innovative LLM research, comprising 20 tasks across six research domains, supported by the ResearchGym environment.
ResearchGym provides a scalable and realistic environment with infrastructure support for multi-computer control, asynchronous execution, and snapshot saving, alongside diverse actions for file operations, web browsing, terminal access, web search, and file parsing.
The framework assesses LLM agents on end-to-end research tasks, emphasizing innovation and problem-solving, revealing strengths in data-related tasks and weaknesses in algorithmic design and long-horizon planning.

MATHEMATICAL EXPLORATION AND DISCOVERY AT SCALE

AlphaEvolve: introduces a generic evolutionary coding agent that combines LLM generative capabilities with automated evaluation in an iterative framework to propose, test, and refine algorithmic solutions for mathematical problems.
The system iteratively improves a population of programs through a Generator (LLM) that mutates programs and an Evaluator (fitness function) that assigns a numerical score to their performance.
AlphaEvolve operates in "search mode" to evolve heuristic algorithms or "generalizer mode" to discover programs for any input, and integrates with external AI tools like Deep Think and AlphaProof for formal verification.

Driving scenario generation and evaluation using a structured layer representation and foundational models

5LM (Structured Five-Layer Model): introduces a novel framework for generating and evaluating diverse driving scenarios, leveraging a structured five-layer representation and foundational models to create synthetic visual data from textual descriptions.
The framework employs a data augmentation strategy where an MLLM analyzes real-world driving scenarios and an LLM edits specific layers of the 5LM to generate Edge Cases, which are then evaluated using semantic embedding-based diversity and originality metrics.
This approach aims to produce rare and challenging driving scenarios for autonomous vehicle development by focusing on textual description relevance before visual generation, ensuring higher-quality and diverse responses.

From Passive to Proactive: A Multi-Agent System with Dynamic Task Orchestration for Intelligent Medical Pre-Consultation

MAS-DTO (Multi-Agent System with Dynamic Task Orchestration): introduces a hierarchical multi-agent framework for intelligent medical pre-consultation, featuring a Controller (select optimal next subtask) that coordinates specialized agents to achieve proactive, structured medical inquiry.
The framework includes a Virtual Patient (generate clinical presentations), Recipient (update medical records), Triager (perform hierarchical department triage), Monitor (assess subtask completion), Prompter (formulate context-aware inquiry strategies), Inquirer (produce clinical questions), and Evaluator (provide performance assessment) to manage the pre-consultation workflow.
This system transforms passive medical AI into proactive inquiry agents, demonstrating superior clinical quality and high task completion rates across various LLMs without task-specific fine-tuning, while preserving data privacy.

When Machines Join the Moral Circle: The Persona Effect of Generative AI Agents in Collaborative Reasoning

Generative AI Agents with Personas: introduces a study investigating how generative AI agents, designed with either a supportive or contrarian persona, influence collaborative moral reasoning in human-AI triads, using an autonomous-vehicle dilemma.
The framework includes Generative AI Agents (core intelligent entities), a Supportive Persona (empathetic, consensus-oriented role), a Contrarian Persona (analytical, skeptical role), and a Collaborative Reasoning Environment (setting for human-AI interaction), demonstrating how AI personas reshape moral discourse processes rather than outcomes.
Supportive AI teammates increased grounded/qualified claims and consolidated integrative reasoning, while contrarian AI teammates broadened moral framing and sustained value pluralism, with both personas reducing thematic drift in discussions.

2nd November 2025

Quantitative Risk Assessment in Radiation Oncology via LLM-Powered Root Cause Analysis of Incident Reports

LLM-Powered Data-Driven Framework: introduces an automated pipeline utilizing an LLM (Gemini 2.5 Pro) for incident report processing, severity generation, event classification, and responsibility assignment based on standardized taxonomies, transforming unstructured narratives into a structured database for quantitative analyses.
This framework employs Ordinal Logistic Regression, Association Rule Mining, Chi-square tests, and ANOVA to identify predictors of event severity and uncover systemic vulnerabilities in radiation oncology safety incidents.
The methodology provides an objective, evidence-based approach to risk assessment, enabling targeted interventions and continuous safety improvement by leveraging real-world incident data.

Aligning LLM agents with human learning and adjustment behavior: a dual agent approach

Dual-LLM Agent Framework: introduces a novel dual-agent framework that enables continuous learning and alignment between LLM agents and human travelers on learning and adaptation behavior from online data streams, including LLM Traveler Agents (simulates human behavior), LLM Calibration Agent (optimizes traveler personas), Environment (simulates urban network), LLM core (cognitive engine), Persona (describes agent characteristics), Memory (stores past experiences), Perception (updates agent memory), Retrieval (accesses short/long-term memories), Decision-making (generates simulated decisions), Rolling Window (focuses on recent data), Textual Gradient (suggests persona corrections), Loss minimization (evaluates candidate personas), and Smoothing (mitigates overfitting).
The framework employs a set of LLM traveler agents, each with a memory system and a learnable persona, to simulate human travelers, and an LLM calibration agent that leverages LLM reasoning to train these personas for behavioral alignment.
This dual-agent system tracks and aligns underlying decision-making mechanisms of travelers, producing realistic, adaptive simulations that significantly outperform existing LLM-based methods in individual behavioral alignment and aggregate simulation accuracy.

A Comprehensive Empirical Evaluation of Agent Frameworks on Code-centric Software Engineering Tasks

Agent Framework: introduces a generalized agentic workflow paradigm, comprising Orchestration and Reasoning (high-level decision-making), Collaborative Role (specialized agent roles), and Tool Augmentation (external tool access), to systematically evaluate seven general-purpose agent frameworks across software development, vulnerability detection, and program repair tasks.
The study assesses agent performance across effectiveness, efficiency, and overhead, using standard benchmarks like SRDD, LLM-SmartAudit, and SWE-bench Lite.
Findings reveal distinct capability patterns and trade-offs, with OPENHANDS balancing software development quality, GPTSWARM excelling in vulnerability detection, and program repair remaining challenging for most agents.

Portal UX Agent - A Plug-and-Play Engine for Rendering UIs from Natural-Language Specifications

Portal UX Agent: introduces a bounded-generation architecture that translates natural-language intent into rendered UIs by decoupling high-level planning (LLM-based planner) from low-level assembly (deterministic renderer), using a schema-validated typed composition and a vetted inventory of components and layout templates.
The system ensures auditability, reuse, and safety by constraining the LLM's output to a schema and rendering only from pre-approved components, preventing arbitrary code generation.
A mixed-methods evaluation framework, combining automatic checks and an LLM-as-a-Judge rubric, assesses UI quality, intent alignment, and visual polish, demonstrating reliable intent translation and strong compositional quality.

FREESH: FAIR, RESOURCE- AND ENERGY-EFFICIENT SCHEDULING FOR LLM SERVING ON HETEROGENEOUS GPUS

FREESH (FAIR, RESOURCE- AND ENERGY-EFFICIENT SCHEDULING FOR LLM SERVING ON HETEROGENEOUS GPUS): introduces a hierarchical and coordinated scheduling framework that optimizes LLM serving across distributed heterogeneous GPUs by integrating pool-level resource allocation, GPU-level frequency scaling, and request-level fair scheduling.
The framework leverages spatiotemporal computation flexibility and GPU characteristics to minimize carbon emissions and energy consumption while satisfying service level objectives and ensuring fairness.
It achieves this through dynamic request partitioning, adaptive GPU frequency scaling, and a Least-Laxity-First (LLF) scheduling strategy, demonstrating significant reductions in energy and emissions on production workloads.

GrowthHacker: Automated Off-Policy Evaluation Optimization Using Code-Modifying LLM Agents

GrowthHacker (Automated Off-Policy Evaluation Optimization System): introduces a benchmark system that leverages LLM-based agents, specifically a two-agent framework comprising a Prompter/Analyzer Agent and a Coder Agent, to autonomously and iteratively optimize Off-Policy Evaluation (OPE) code through modifications.
The system operates by having the Prompter/Analyzer Agent identify optimization opportunities and generate modification instructions, which the Coder Agent then implements to produce syntactically correct, functional code for execution and performance evaluation.
This iterative process, supported by file-based communication and post-hoc selection of the best-performing configuration, aims to automate OPE optimization in the code space, addressing limitations of manual hyperparameter tuning and improving reliability and performance.

Count-Based Approaches Remain Strong: A Benchmark Against Transformer and LLM Pipelines on Structured EHR

MoA LLM pipeline: introduces a method for structured EHR prediction that converts patient longitudinal records into natural-language summaries using an LLM-based summarizer agent, which are then classified by a text classifier for downstream prediction.
The paper benchmarks this MoA LLM pipeline against count-based models (LightGBM, TabPFN) and a pretrained sequential transformer (CLMBR) on eight clinical prediction tasks using the EHRSHOT dataset.
Results indicate that count-based methods and the MoA LLM pipeline generally outperform CLMBR, with wins largely split between the former two, highlighting the continued strength of count-based approaches and the potential of LLM-based agent pipelines for structured EHR.

Reevaluating Self-Consistency Scaling in Multi-Agent Systems

Self-Consistency Scaling in Multi-Agent Systems: introduces a structured framework to evaluate the trade-offs of increasing sampled reasoning paths in LLMs, utilizing multiple reasoning agents, an aggregator model, and an evaluator LLM.
The study employs Gemini 2.5 models (Flash-Lite and Pro) on HotpotQA and Math-500 datasets, comparing multi-agent configurations against a single CoT baseline based on accuracy and token cost.
Results indicate that self-consistency improves accuracy but gains diminish and plateau with increased agents, suggesting that high-sample configurations offer limited benefit relative to their computational cost.

What's the next frontier for Data-centric AI? Data Savvy Agents!

Data Savvy Agents: introduces a framework for AI agents to autonomously acquire, process, evaluate, and adapt data in dynamic, real-world environments.
This framework integrates proactive data acquisition, sophisticated data processing, interactive test data synthesis, and continual adaptation to enable agents to go beyond static datasets and predefined tasks.
By continuously engaging with diverse data sources and adapting to shifting conditions, Data Savvy Agents enhance AI system flexibility, resilience, and self-improvement in complex deployments.

CodeClash: Benchmarking Goal-Oriented Software Engineering

CodeClash: introduces a benchmark for goal-oriented software engineering where LLM-based SWE-agents iteratively refine codebases in multi-round tournaments, competing in code arenas, and receiving logs as feedback.
The framework evaluates LLMs on open-ended objectives like score maximization or resource acquisition, moving beyond traditional code completion or bug fixing tasks.
CodeClash reveals LLMs' diverse development styles and limitations in strategic reasoning, long-term codebase maintenance, and interpreting competitive feedback, highlighting a significant gap compared to human performance.

Real-Time Learning of Predictive Dynamic Obstacle Models for Robotic Motion Planning

Adaptive Sliding-Window Page-Hankel DMD Predictor: introduces an online framework for real-time learning and prediction of nonlinear dynamic obstacle models from noisy, partial observations, utilizing an adaptive sliding-window strategy, Page matrix, Singular Value Hard Thresholding (SVHT), Cadzow projection, Hankel matrix, Hankel-DMD, residual analysis, and multi-step forecasts.
The framework denoises measurements and forecasts dynamics by embedding noisy data into a Hankel matrix, estimating effective rank via Page matrix and SVHT, and applying Cadzow projection for structured low-rank consistency.
This approach constructs a time-varying Hankel-DMD lifted linear predictor for multi-step forecasts, providing denoised trajectories and local noise variance estimates suitable for real-time control frameworks.

GUI-AIMA: ALIGNING INTRINSIC MULTIMODAL ATTENTION WITH A CONTEXT ANCHOR FOR GUI GROUNDING

GUI-AIMA (Aligning Intrinsic Multimodal Attention with a Context Anchor for GUI Grounding): introduces an attention-based, coordinate-free framework that aligns intrinsic MLLM multi-head self-attention with patch-wise grounding signals, utilizing a Vision Encoder (processes screenshot into visual tokens), Language Model Decoder (processes user query into text tokens), Multi-head Self-Attention (computes attention between query/visual tokens), Token (aggregates query-visual attentions), Visual-sink Query Tokens (identifies relevant query tokens for weighting), Attention Head Weighting Mechanism (weights attention heads based on Qs), Patch-wise Attention Vector (aggregated attention for grounding), Patch-wise Prediction (final grounding output), Coordinate-free Patch-wise Labeling (generates ground truth patch labels), Attention Grounding Loss (supervises patch-wise predictions), and an optional Two-step Inference with Zoom-in (refines predictions for high-res GUIs).
The framework simplifies vanilla attention-based visual grounding by using a learnable token to implicitly aggregate query-to-visual attention heads and employs a novel attention head weighting mechanism based on visual-sink query tokens for efficient and generalized GUI grounding.
GUI-AIMA achieves state-of-the-art performance among 3B models with exceptional data efficiency, demonstrating that light training can trigger the native grounding capability of MLLMs, and can be extended with a zoom-in stage for high-resolution screenshots without additional training.

EXPERIENCE-DRIVEN EXPLORATION FOR EFFICIENT API-FREE AI AGENTS

KG-Agent: introduces an experience-driven learning framework that structures pixel-level GUI interactions into a persistent State-Action Knowledge Graph (SA-KG), a Procedural Memory, and a VLM-based Reasoning Module, enabling efficient exploration and long-term strategic planning in API-free environments.
The SA-KG serves as the agent's long-term memory, connecting functionally similar GUI states and modeling acquired skills as edges, while a hybrid intrinsic reward mechanism guides learning by balancing exploitation and exploration.
This approach significantly enhances exploration efficiency and strategic depth in complex, open-ended GUI-based decision-making environments by transforming unstructured pixel-level experience into actionable knowledge.

1st November 2025

Don't Just Search, Understand: Semantic Path Planning Agent for Spherical Tensegrity Robots in Unknown Environments

SATPlanner (Semantic Agent for Tensegrity robots): introduces an LLM-driven agent for spherical tensegrity robots, leveraging a System Prompt, Sensors Module, Memory Module, Prompt Manager, Reasoning (LLM), Self-Check Module, Controller, Actuators, and an Adaptive Observation Window (AOW) Mechanism to perform efficient and robust path planning in unknown environments.
The framework reframes path planning as a semantic reasoning task, utilizing the LLM's comprehension capabilities to generate efficient and reliable planning strategies, and dynamically adjusts its perceptual field via the AOW mechanism.
SATPlanner achieves a 100% success rate and significantly reduces search space compared to traditional algorithms, demonstrating practical feasibility on a physical spherical tensegrity robot prototype.

A CPU-CENTRIC PERSPECTIVE ON AGENTIC AI

CGAM (CPU and GPU-Aware Micro-batching) and MAWS (Mixed Agentic Workload Scheduling): introduces two scheduling optimizations, CGAM and MAWS, to address CPU-centric bottlenecks in agentic AI workloads, improving performance and efficiency.
CGAM optimizes homogeneous workloads by capping batch sizes and using micro-batching for sequential CPU tool processing and GPU LLM inference, while MAWS adaptively schedules heterogeneous CPU-heavy and LLM-heavy tasks using multi-processing and multi-threading.
The framework achieves up to 2.1x P50 latency speedup for homogeneous workloads and 1.41x for heterogeneous workloads compared to multi-processing benchmarks, demonstrating significant performance gains.

Leveraging Multi-Agent System (MAS) and Fine-Tuned Small Language Models (SLMs) for Automated Telecom Network Troubleshooting

MAS (Multi-Agent System): introduces an agentic workflow for automated telecom network troubleshooting, coordinating specialized agents like an LLM-powered orchestrator, a fine-tuned SLM-powered solution planner, root cause analyzer, executor, data retriever, and dashboard display.
The framework leverages fine-tuned SLMs on proprietary troubleshooting documents to generate domain-grounded remediation plans, significantly reducing troubleshooting time and SME workload.
It integrates a Human-in-the-Loop mechanism for plan validation and employs a ReAct-style loop for fault detection, analysis, and remediation across RAN and Core network domains.

AgentGit: A Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems

AgentGit (Agent Version Control Framework for Reliable and Scalable LLM-Powered Multi-Agent Systems): introduces a novel framework that integrates Git-like rollback and branching mechanisms into LLM-powered multi-agent systems, built on LangGraph, enabling state commit, revert, branching, and checkpoints for enhanced reliability and scalability.
This framework allows agents to traverse, compare, and explore multiple trajectories efficiently, significantly reducing redundant computation, runtime, and token usage in complex tasks.
AgentGit provides robust solutions for error recovery, safe exploration, iterative debugging, and A/B testing, fostering more robust MAS design and collaborative AI systems.

GDPR-Bench-Android: A Benchmark for Evaluating Automated GDPR Compliance Detection in Android

GDPR-Bench-Android: introduces a comprehensive benchmark for evaluating automated GDPR compliance detection in Android applications, featuring a GDPR-Bench-Android Dataset (1951 annotated Android violations), a novel Formal-AST (source-code-native formal method), and evaluations of Baseline LLMs, Retrieval-Augmented (RAG) Method (LLM + violation knowledge base), and Agentic (ReAct) Method (LLM + reasoning + tool use) across two tasks: Task 1: Multi-Granularity Violation Localization (rank GDPR articles at file/module/line) and Task 2: Snippet-Level Multi-Label Classification (assign all applicable articles to snippet).
The benchmark provides the first systematic evaluation of diverse automated methods on GDPR compliance detection directly from Android source code, addressing a critical gap in existing research.
Empirical results reveal that no single paradigm excels across all tasks, with agentic methods performing best at file-level localization, LLMs at line-level localization, and RAG achieving the highest precision for multi-label classification.

Agentic Auto-Scheduling: An Experimental Study of LLM-Guided Loop Optimization

COMPILOT (Compiler Pilot): introduces an experimental framework where an LLM acts as an optimization agent, iteratively proposing loop transformations to a compiler and refining its strategy based on empirical feedback.
This closed-loop interaction involves the Context Initializer briefing the LLM, the Interaction Loop Handler processing LLM proposals and compiler feedback, and the Compiler & Runtime Environment applying transformations and measuring performance.
The framework leverages off-the-shelf LLMs for high-level strategic exploration while entrusting the compiler with formal correctness checks and code generation, achieving significant speedups without LLM fine-tuning.

Issue-Oriented Agent-Based Framework for Automated Review Comment Generation

RevAgent (Issue-Oriented Agent-Based Framework for Automated Review Comment Generation): introduces a novel agent-based framework that decomposes automated code review comment generation into Generation, Discrimination, and Training stages, utilizing category-specific commentator agents and a critic agent to produce accurate, issue-oriented review comments.
The framework leverages five specialized LLM commentator agents to analyze code changes from distinct perspectives and generate candidate comments, which are then evaluated by a critic agent to select the most appropriate issue-comment pair.
RevAgent's training stage fine-tunes all agents on curated, category-specific data using LoRA and a Candidate Comment Retrieval approach, enhancing task specialization and overall performance in generating readable, accurate, and context-aware review comments.

ReMind: Understanding Deductive Code Reasoning in LLMs

ReMind: introduces a novel multi-agent framework for robust deductive code reasoning, integrating code mutation, execution, and inspection to enhance reasoning accuracy and robustness.
The framework systematically explores code variants, simulates execution traces, and validates reasoning paths against control flow graphs to detect and correct flaws.
ReMind significantly improves code reasoning accuracy across diverse LLMs, reduces self-execution bias, and enhances zero-shot generalization on complex benchmarks.

SmartDoc: A Context-Aware Agentic Method Comment Generation Plugin

SmartDoc (Context-Aware Agentic Method Comment Generation Plugin): introduces an IntelliJ IDEA plugin that acts as an AI agent, leveraging its Memory (Stack), Tool (AST Analysis), and an LLM to generate context-aware method comments for Java codebases.
The system employs a Comment Generation Coordinator to manage the workflow, including call graph traversal via DFS for full-context LLM prompts, and provides a View/Alter Suggestion interface for user interaction.
SmartDoc also incorporates a Feedback Mechanism for user satisfaction and utilizes metrics like BERTScore, BLEU, and ROUGE-1 to evaluate the accuracy of its generated comments against ground truth.

TREE TRAINING: ACCELERATING AGENTIC LLMS TRAINING VIA SHARED PREFIX REUSE

Tree Training: introduces a novel paradigm for accelerating agentic LLM training by computing shared prefixes once and reusing intermediate results across branches, comprising Tree Packing, Gradient Restoration, custom kernel, and runtime optimizations.
This approach efficiently reuses shared computations across tree-structured trajectories, significantly reducing redundant forward and backward passes while maintaining gradient correctness.
The method achieves up to 3.9x reduction in total training time for agentic LLM SFT and RL training by addressing memory constraints and ensuring accurate gradient propagation.

EvoMem: Improving Multi-Agent Planning with Dual-Evolving Memory

EvoMem (Improving Multi-Agent Planning with Dual-Evolving Memory): introduces a multi-agent framework for planning, comprising LLM-based agents (Constraint Extractor, Verifier, Actor) and two memory modules (Constraint Memory, Query-feedback Memory).
This framework leverages a dual-evolving memory mechanism where CMem (Constraint Memory) stores fixed, query-level constraints, and QMem (Query-feedback Memory) accumulates dynamic, iteration-level feedback for solution refinement.
EvoMem's iterative self-correction process, guided by these memory modules, significantly enhances performance in complex natural language planning tasks.

Sherlock: RELIABLE AND EFFICIENT AGENTIC WORKFLOW EXECUTION

Sherlock: introduces a principled serving framework for agentic workflows that jointly optimizes latency, cost, and accuracy by identifying and verifying error-prone nodes through counterfactual analysis and dynamic verifier selection, complemented by selective speculative execution and rollback mechanisms.
The framework includes a Domain On-boarding Phase (learns policies offline) and an Online Phase (executes workflows dynamically), utilizing a Topological Vulnerability Estimator (identifies error-prone nodes) and a Learned Verifier Selector (chooses cost-optimal verifier).
Its Speculative Execution Runtime (overlaps verification, computation) with a Rollback Controller (manages re-execution on failure) and Similarity-based Rollback Policy (decides when to rollback) significantly reduces execution time and cost while improving accuracy.

SlideAgent: Hierarchical Agentic Framework for Multi-Page Visual Document Understanding

SlideAgent (Hierarchical Agentic Framework for Multi-Page Visual Document Understanding): introduces a versatile agentic framework for understanding multi-modal, multi-page, and multi-layout documents, especially slide decks, with Global Agent (generates document-level knowledge), Page Agent (generates page-level knowledge), Element Agent (generates element-level knowledge), Element Parsing (decomposes page into elements), Element Detection (detects visual elements), Merging & Deduplication (merges fragmented elements), Element Retrieval (retrieves parsed elements), Knowledge Base (stores hierarchical knowledge), Global Knowledge (document-wide topics), Page Knowledge (page-specific features), Element Knowledge (fine-grained components), Inference (retrieves, reasons, answers), Agent Orchestrator (classifies query, activates agents), Subquery Generation (generates query-specific subqueries), Retrieval Function (fetches relevant content), Answer Synthesizer (combines agent reasoning), Visual Input (multi-page visual documents), Query (user query), and Answer (natural language response).
SlideAgent employs specialized LLM-based agents at global, page, and element levels to construct a structured, query-agnostic knowledge base during a knowledge construction stage, capturing overarching themes and detailed visual/textual cues.
During inference, the framework selectively activates these specialized agents for multi-level reasoning and integrates their outputs into coherent, context-aware answers, significantly improving fine-grained reasoning over complex visual documents.

SciTextures: Collecting and Connecting Visual Patterns, Models, and Code Across Science and Art

SciTextures: introduces a large-scale dataset of visual patterns, models, and code, generated by an agentic AI pipeline, and three novel benchmarking tasks (Im2Code, Im2Im, Im2Sim2Im) to evaluate AI's understanding of generative processes.
The dataset comprises over 100,000 images from 1,200+ generative models across science, technology, and art, enabling exploration of the link between visual forms and underlying mechanisms.
The benchmarking tasks assess Vision-Language Models' ability to match images to code/descriptions, identify patterns from the same process, and infer/simulate generative processes from real-world images.

Unveiling Uniform Shifted Power Law in Stochastic Human and Autonomous Driving Behavior

Shifted Power Law Model: introduces a novel distribution model that accurately characterizes the stochasticity of human-driven and autonomous vehicle behaviors, particularly in the long-tail regime, using a parsimonious analytical form with one or two parameters.
This model, integrated into an agent-based traffic simulator, enables forward-rolling simulations that reproduce realistic crash patterns and improves the fidelity of safety assessment without post hoc correction.
The framework leverages an LSTM network and FFNs to predict vehicle acceleration statistics, then applies the shifted power law to model the normalized residual distribution, and quantifies risk using a derived Risk Index.

COHERE - Congestion-aware Offloading and Handover via Empirical RAT Evaluation for Multi-RAT Networks

COHERE (Congestion-aware Offloading and Handover via Empirical RAT Evaluation): introduces a multi-criteria framework for dense multi-RAT networks, utilizing Input/Measurement, Normalization of measurements, AHP based weights, Entropy based weights, Weighted Decision Matrix, TOPSIS based ranking, RAT-based RSSI threshold, Target AP, Stand-in AP, and Radio Link Transfer to enable congestion-aware offloading and handover decisions.
The framework integrates subjective (AHP) and objective (Entropy) weighting strategies within a TOPSIS pipeline, augmented by a RAT-based RSSI threshold, to ensure robust and policy-aligned offloading decisions.
COHERE aims to reduce 5G network load, minimize handovers, and improve link delay and throughput by considering RSSI, access-node load, and link delay for optimal RAT selection.

Yanyun-3: Enabling Cross-Platform Strategy Game Operation with Vision-Language Models

Yanyun-3: introduces a general-purpose agent framework that enables autonomous cross-platform operation across three heterogeneous strategy game environments by integrating Qwen2.5-VL for vision-language reasoning and UI-TARS for precise execution.
The framework utilizes a closed-loop pipeline of screen capture, model inference, and action execution, demonstrating strong real-time performance and cross-platform generalization.
The work establishes a general paradigm, "combination granularity," for enhancing VLM performance through structured multimodal data organization, differentiating between intra-sample fusion and inter-sample mixing.

Information-Driven Fault Detection and Identification For Multi-Agent Spacecraft Systems: Collaborative On-Orbit Inspection Mission

Information-Driven FDI framework: introduces a global-to-local, task-aware fault detection and identification (FDI) framework for multi-spacecraft systems performing collaborative inspection by linking fault metrics directly to a global cost functional ($H$), agent contribution metrics ($H_i(t)$), and an adaptive threshold ($\tau_i(t)$).
The framework unifies global task awareness with local agent-level performance monitoring to reliably detect and classify actuator and sensor faults in distributed spacecraft networks.
Key components include the global cost functional $H$ derived from information gain, its decomposition into agent contributions $H_i(t)$, and higher-order gradient metrics used for fault separation.

One Request, Multiple Experts: LLM Orchestrates Domain Specific Models via Adaptive Task Routing

ADN-Agent: introduces an architecture that leverages a general LLM powered Planner to coordinate multiple Domain Specific Models (DSMs) via a novel communication mechanism, enabling adaptive intent recognition, task decomposition, and DSM invocation.
The architecture includes a Planner, a suite of DSMs augmented with Translator Modules, and a Summarizer, all designed to handle complex, multi-scenario Active Distribution Network (ADN) operation requests.
An automated training pipeline for Fine-Tuned Small Language Models (FT-SLMs) is also proposed to enhance the system's capability for language-intensive subtasks like ADN model adjustment.

Alonopedia: an LLM agent orchestrating multimodal learning for ionic liquid discovery

Alonopedia: introduces an LLM agent orchestrating multimodal learning for Ionic Liquid (IL) discovery, powered by an LLM-augmented multimodal domain foundation model for ILs, enabling accurate property predictions and incorporating a hierarchical search architecture for molecular screening and design.
The agent utilizes a ReAct-driven pipeline centered around a GPT-5 powered planner that interacts with six specialized tools for end-to-end IL research, from knowledge extraction to wet-lab validation.
The core Property Predictor employs a two-stage training strategy (modality alignment and fine-tuning) fusing molecular graphs, SMILES sequences, and physicochemical descriptors.

Learning to Refine: An Agentic RL Approach for Iterative SPARQL Query Construction

Agentic RL Framework: introduces a novel agentic framework where an LLM learns a resilient policy for the sequential process of iterative SPARQL construction using Group Relative Policy Optimization (GRPO), with components including an Agent policy (LLM with QLoRA adapters), an Environment (SPARQL execution), State (Conversation history), Action (Text generation with structured blocks), and Reward (Terminal composite signal).
The framework transforms multi-hop Knowledge Graph Question Answering (KGQA) from a one-shot generation task into a dynamic decision-making process grounded in executable feedback from a Knowledge Graph (KG).
The RL-Tuned Agent achieves 49.7% accuracy on a curated LC-QuAD 2.0 subset, significantly outperforming zero-shot baselines by learning adaptive interaction policies.

Safe-ROS: An Architecture for Autonomous Robots in Safety-Critical Domains

Safe-ROS: introduces an architecture for developing reliable and verifiable autonomous robots in safety-critical domains, featuring an intelligent control system (SRAS) and a formally verifiable oversight system (SS) composed of Safety Instrumented Functions (SIFs).
The architecture integrates formal methods tools like FRET for requirement elicitation, MCAPL/AJPF/GWENDOLEN for SIF verification, and Dafny for integration correctness proof.
The SIF, implemented as a BDI agent, monitors the SRAS (ROS-based motion controller) and enforces safety requirements, demonstrated via an obstacle avoidance task on an AgileX Scout Mini robot.

Human-AI collaborative autonomous synthesis with pulsed laser deposition for remote epitaxy

HAIC (Human-AI collaborative) workflow: introduces a tightly coupled, mixed-initiative system integrating human expertise, LLMs, and an autonomous pulsed laser deposition (PLD) system for accelerated materials synthesis.
The workflow utilizes LLM-assisted hypothesis generation via RAG and Bayesian Optimization for active learning in autonomous batches, targeting remote epitaxy of BaTiO3/graphene.
Offline Human-AI Conferences enable iterative data analysis and process refinement, allowing the system to efficiently map the growth space and identify optimal synthesis conditions using in situ diagnostics.

LLM-Driven Transient Stability Assessment: From Automated Simulation to Neural Architecture Design

LLM-Driven TSA Workflow: introduces an end-to-end agentic LLM framework that automates Transient Stability Assessment (TSA) from scenario generation using the ANDES simulator to optimized neural network design via a multi-agent LLM-NND pipeline.
The framework utilizes Prompt Engineering and Enhanced RAG to enable the LLM to translate natural language requests into executable simulation code and generate high-quality, balanced datasets for training.
The LLM-NND component employs collaborative LLM agents (Stratege, Operator, Generator) within a performance-driven feedback loop to autonomously discover compact, high-accuracy TSA models.

Citation

How to cite my work?

@misc{MaattaAutonomousAgents2023,
  author = {Teemu Maatta},
  title = {Autonomous Agents},
  year = {2023},
  howpublished = {\url{http://github.com/tmgthb/Autonomous-Agents}},
  note = {Accessed: YYYY-MM-DD}
}

Back to top

Name		Name	Last commit message	Last commit date
Latest commit History 1,468 Commits
resources		resources
LICENSE		LICENSE
README.md		README.md

License

tmgthb/Autonomous-Agents

Folders and files

Latest commit

History

Repository files navigation

Autonomous Agents

Research papers: 2025 (1/3)

2nd December 2025

1st December 2025

30th November 2025

29th November 2025

28th November 2025

27th November 2025

26th November 2025

25th November 2025

24th November 2025

23rd November 2025

22nd November 2025

21st November 2025

20th Nov 2025

19th Nov 2025

18th November 2025

17th Nov 2025

16th Nov 2025

15th Nov 2025

14th Nov 2025

13th November 2025

12th November 2025

11th November 2025

10th November 2025

9th November 2025

8th November 2025

7th November 2025

6th November 2025

5th November 2025

4th November 2025

3rd November 2025

2nd November 2025

1st November 2025

Citation

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Packages