Problems of Scaling Large Language Model Inference During Simultaneous Operation of Multiple Autonomous Agents( Vol-12,Issue-2,March - April 2026 ) |
|
Author(s): Kapil Verma |
Download Full Text PDF
Total View : 13
Downloads : 0
Page No: 170-178
|
Keywords: |
|
|
large language models, multi-agent systems, inference scaling, KV cache. |
|
Abstract: |
|
|
The article examines the transformation of large language models from a single-query, single-response regime to a multi-agent configuration, in which a single external stimulus generates a tree of dependent calls to the model, and analyzes the specific constraints on scaling inference. The relevance of the study is determined by the proliferation of LLM-based agents and the growing share of workloads in which not the total number of tokens but the cadence of short iterations is decisive. The objective is to identify causal bottlenecks that determine throughput and tail latencies during the concurrent operation of multiple autonomous executors. On the basis of an analytic–synthetic review of 11 sources, a framework is proposed that shifts the unit of analysis from an individual response to a chain of dependent micro-steps, interpreted as competing job classes. The scientific contribution consists of systematizing the role of the KV cache as a dynamic scarce resource, introducing the phenomenon of contextual inflation, and linking these effects to batching policies, service fairness, distributed inference, and step routing across models of different sizes. It is shown that bottlenecks in multi-agent systems shift from arithmetic performance to memory, attention-state management, stopping discipline, and context engineering, while tail and network latencies take on the character of cascading lockups; the necessity of role-dependent token budgets and carefully designed eviction and state-folding strategies is substantiated. The article is intended for researchers and engineers developing LLM-based multi-agent systems and the infrastructure for their operation. |
|
| Article Info: | |
|
Received: 21 Mar 2026; Received in revised form: 22 Apr 2026; Accepted: 25 Apr 2026; Available online: 29 Apr 2026 |
|
Cite This Article: |
|
|
Citations:
APA | ACM | Chicago | Harvard | IEEE | MLA | Vancouver | Bibtex
| |
Share: |
|

DOI: 



























