MemAgent: Reshaping Long-Context LLM with Multi-Conv RL based Memory Agent MemAgent:Reshaping Long-Context LLM with Multi-Conv RL based Memory Agent

Hongli Yu1,2,3于鸿利1,2,3 Tinghong Chen2陈霆鸿2 Jiangtao Feng2封江涛2 Jiangjie Chen1,3陈江捷1,3 Weinan Dai1,2,3戴炜楠1,2,3 Qiying Yu1,2,3禹棋赢1,2,3 Ya-Qin Zhang2,3张亚勤2,3 Wei-Ying Ma2,3马维英2,3 Jingjing Liu2,3刘菁菁2,3 Mingxuan Wang1,3王明轩1,3 Hao Zhou2,3周浩2,3
1ByteDance Seed 1字节跳动 Seed
2Institute for AI Industry Research (AIR), Tsinghua University 2清华大学智能产业研究院(AIR)
3SIA-Lab of Tsinghua AIR and ByteDance Seed 3清华大学 AIR 与字节跳动 Seed 联合实验室 SIA-Lab

Introduction 引言

We propose a novel long-context processing framework — MemAgent, which directly optimizes long-context tasks through end-to-end Reinforcement Learning without altering the underlying model architecture. MemAgent has demonstrated superb long-context capabilities, being able to extrapolate from an 8K context trained on 32K text to a 3.5M QA task with performance loss < 5% and achieves 95%+ accuracy in 512K RULER test.

我们推出了MemAgent,这是一个全新的长文本处理框架,能够通过端到端的强化学习直接优化长文本任务性能,而无需更改底层模型架构。MemAgent 具有出色的长文本处理能力,能够从 8K 上下文长度和32K训练数据长度外推至 3.5M 问答任务,性能损失 < 5%,并在 512K 的 RULER 测试集上取得 95%+ 准确率。

MemAgent achieves three core breakthroughs:

MemAgent 实现了三大核心突破:

1

Novel memory mechanism: The agent reads text in segments and efficiently updates memory through an overwriting strategy. This design enables the model to process arbitrarily long inputs within a fixed context window, fundamentally overcoming the window length limitations of traditional Transformer architectures.

新型记忆机制: Agent 以分段方式读取文本,并借助覆写策略高效更新记忆。该设计使得模型能够在固定上下文窗口内处理任意长度的输入,从根本上突破了传统 Transformer 架构的窗口长度限制。

2

O(n) complexity: By decoupling computation from text length, the complexity of processing long texts is transformed from quadratic growth to linear growth.

O(n) 线性复杂度: 将计算与文本长度解耦,使得处理长文本的复杂度由原本的二次方增长转变为线性增长。

3

RL-driven extrapolation: We enhance the DAPO algorithm to support multi-turn training over context-independent conversations. Based on this, the trained model exhibits unprecedented extrapolation performance.

强化学习驱动的外推能力: 我们改进了 DAPO 算法,使其支持独立上下文的多轮生成训练。 基于此训练出的模型表现出了可观的外推性能。

Through a simple yet effective design, we demonstrate the first truly trainable memory mechanism powered by reinforcement learning, showcasing the vast potential of using RL to optimize agent workflows.

我们以一种简洁而高效的方式,首次实现了真正意义上的、由强化学习赋予的可训练记忆能力,充分展现了强化学习在优化工作流方面的巨大潜力。

Main Result Fig
Accuracy scores of RULER-HotpotQA

Method 方法介绍

Memory Agent框架

Inspired by human behavioral patterns when processing long texts, we propose MemAgent, a novel approach for long-context processing that requires no modification to model architecture:
| Equipping LLMs with dynamically updating "Memory Modules".

受人类处理长文本时的行为模式启发,我们提出了一种无需修改模型架构的长文处理新方案——MemAgent
| 给LLM装上动态更新的"记忆模块"

MemAgent Architecture Overview
MemAgent Architecture Overview

MemAgent introduces a fixed-length auxiliary memory panel that enables the model to process long texts in a segmented manner, actively updating the memory state after each segment to achieve a novel "local processing + global fusion" workflow. This memory module continuously updates dynamically during inference and, after all segments are processed, assists in generating the final output by aggregating critical information stored in memory.

MemAgent引入了一个固定长度的辅助记忆面板,允许模型在处理长文本时以分段的方式读取输入,并在每一段之后主动更新记忆状态,从而实现"局部处理 + 全局融合"的新型工作流。该记忆模块在推理过程中不断动态更新,并在所有段落处理完毕后,通过聚合记忆中的关键信息协助生成最终输出。

多轮对话强化学习训练 MemAgent

Training MemAgent with Multi-conv RL

We employ Reinforcement Learning from Verifiable Rewards (RLVR), which currently demonstrates exceptional performance in the reasoning domain, to train MemAgent, rather than simply performing fine-tuning or instruction engineering. To this end, we extend the existing DAPO algorithm to further support end-to-end optimization of Agent Workflows with multi-turn context-independent conversations.

我们使用目前在推理领域表现出卓越性能的基于可验证结果的强化学习(RLVR)来训练MemAgent,而非简单的进行微调或指令工程。为此,我们扩展了现有的DAPO算法,使其进一步支持了具有多轮独立对话的Agent Workflow的端到端优化。

Comparison between vanilla GRPO and Multi-Conv DAPO
Comparison between vanilla GRPO and Multi-Conv DAPO
Template of MemAgent for context processing (top part) and final answer generation (bottom)
Template of MemAgent

Multi-Conv Training Mechanism: For each input sample, the model generates multiple responses, where each response cannot be obtained by simply concatenating previous generation trajectories but has independent inputs, differing from the approach in tool calling that uses multi-turn concatenated trajectories as input.

多轮对话训练机制:对于每个输入样本,模型生成多次回答,每次回答的输入不能通过简单的拼接此前的生成轨迹获得,而是具有独立的输入,区别于工具调用中使用多轮拼接轨迹作为输入的方式。

Reward Computation: The final answer is extracted based from the last turn of conversation. The advantage is computed through rule-based outcome reward and group normalization and then be allocated to all associated conversations.

奖励计算:从最后一轮回答中提取最终答案,使用基于规则的结果奖励并通过组归一化计算优势,将其分配到所有关联对话。

Policy Optimization: Each conversation serves as an optimization target, with DAPO-style token-level averaged loss calculated based on its advantage.

优化策略:以每轮对话为优化目标,根据其优势计算类DAPO的token-level平均损失。

$$ \begin{aligned} \mathcal{J}_{\text{DAPO}}(\theta) =\quad &\mathbb{E}_{(q,a)\sim \mathcal{D}, \{o_{i,j}\}_{i=1}^G\sim \pi_{\theta_\text{old}}(\cdot\mid q,~o_{i,j-1})} \\ &\Bigg[\frac{1}{\sum_{i=1}^{G}\sum_{j=1}^{n_i}|o_{i,j}|}\sum_{i=1}^{G}\sum_{j=1}^{n_i}\sum_{t=1}^{|o_{i,j}|} \Big(\mathcal{C}_{i,j,t} - \beta D_{\text{KL}}(\pi_{\theta} || \pi_{\text{ref}} )\Big) \Bigg] \\ \text{where } \mathcal{C}_{i,j,t} = &\min\Big(r_{i,j,t}(\theta) \hat{A}_{i,j,t}, \ \text{clip} \Big( r_{i,j,t}(\theta), 1 - {\varepsilon_{low}}, 1 + {\varepsilon_{high}} \Big) \hat{A}_{i,j,t}\Big) \end{aligned} $$

重新建模语言模型的生成过程

Re-modeling the Language Model Generation Process

Traditional Text Modeling:

Traditional autoregressive language models model token sequences of length N through the following approach:

传统文本建模:

传统的自回归语言模型通过以下方式建模长度为N的token序列:

$p(\mathbf{x}_{1:N}) = p(x_1) \prod_{n=2}^{N} p(x_n \mid \mathbf{x}_{1:n-1})$

This method requires attention computation over all previously generated tokens, leading to a cost of $O(N^2)$.

这种方法需要对所有先前生成的token计算注意力,导致计算成本随序列长度呈二次方增长。

Rethinking MemAgent from LM Perspectives

To get a deeper sense of the MemAgent design, we propose to re-think language-model factorization in the following fashion.

从语言模型的角度重新理解 MemAgent

为了进一步地理解 MemAgent的设计,我们考虑下列对语言模型的生成过程的新建模方式。

Specifically, the input sequence is segmented into K contiguous chunks \((c^1, c^2, \ldots, c^K)\), with each chunk containing at most \(C\) tokens.

Let \(m^{1:K-1}\) denoteds the latent memory variables and initial state \(m^0 = \emptyset\), the autoregressive factorization is reformulated as a series of chunk processing and reading and writing to the memory:

具体来说,我们将输入序列分割为K个连续块 \((c^1, c^2, \ldots, c^K)\),每块最多包含 \(C\) 个token。

引入表示记忆的隐变量\(m^{1:K-1}\)并给定初始状态\(m^0 = \emptyset\)后,自回归过程可重新表述为一系列对输入块的处理并读写memory的过程:

$p(\mathbf{x}_{1:N}) = \sum_{\mathbf{m}^{1:K-1}} \prod_{k=1}^{K} \underbrace{p(\mathbf{c}^k \mid \mathbf{m}^{k-1})}_{\text{read}} \cdot \underbrace{p(\mathbf{m}^k \mid \mathbf{c}^k, \mathbf{m}^{k-1})}_{\text{write}}$

The new modeling bounds the context window of each conversation within a fixed size, yielding a $O(N)$ computational cost.

新的建模方式将每次对话的上下文窗口大小限制在常数范围内,最终的计算复杂度为 $O(N)$

Experiments 实验分析

Main Results

主实验结果

Baseline Models: Experimental results demonstrate that existing models exhibit significant performance degradation when confronted with ultra-long contexts.

  • DS-distill series: Performance within context rapidly decays to extremely low levels as length increases, becoming essentially ineffective beyond the context window due to information loss.
  • QwenLong-L1: With post-training length of 60K, performance decline within this range is approximately 10%. While the degradation from 64K to 112K exceeds 28%, despite remaining within the context window length.
  • Qwen2.5-Instruct-1M series: Performance decline within context is gradual, but performance drops to zero at 896K testing, still within the 1M context length range.

基线模型: 实验结果显示,现有模型在面对超长序列时均出现显著的性能退化:

  • DS-distill系列:Context window内性能随长度迅速衰减到极低水平,超出context window因丢失信息基本失效。
  • QwenLong-L1:后训练长度为60K,在此之内的性能下降约为10%。从64K到112K的下降则超过28%,尽管此时尚在context window长度之内。
  • Qwen2.5-Instruct-1M系列:context内性能下降平缓,但在尚在1M上下文长度范围内的896K测试上便已降至0。

RL-MemAgent: In contrast, RL-MemAgent demonstrates exceptional stability in ultra-long context processing:

  • RL-MemAgent-14B: Performance degradation <5.5% on 3.5M token tasks, achieving truly lossless extrapolation.
  • RL-MemAgent-7B: Only 11% performance decline in the longest contexts, with overall performance far exceeding existing long-context models.

RL-MemAgent: 相比之下,RL-MemAgent在超长文本处理中展现了优异的稳定性:

  • RL-MemAgent-14B:在3.5M token任务上性能下降<5.5%,实现了真正意义上的无损外推。
  • RL-MemAgent-7B:在最长文本上仅出现11%的性能下降,整体表现远超现有长文本模型。

Main experimental results
Main experimental results

Ablation Study

消融实验

To validate the necessity of using reinforcement learning for Memory Agent training, we conduct comprehensive ablation experiments.

  • Base Model: The original model exhibits severe performance degradation as context length increases, particularly after 112K where inputs are truncated due to context window limitations, making effective extrapolation nearly impossible.
  • MemAgent (w/o RL): Compared to the base model, it demonstrates better performance and maintains reasonable capability on tasks exceeding the context length, but still experiences overall performance decline as input length increases.
  • RL-MemAgent:The RL-trained MemAgent maintains near-lossless extrapolation capability across all context lengths.

我们进行了系统性的消融实验,验证使用强化学习对MemAgent进行训练的必要性。

  • 基础模型: 原始模型随着上下文长度增加表现出严重的性能下降,特别是在112K之后,由于上下文窗口限制导致输入被截断,使得有效外推几乎不可能实现。
  • MemAgent(w/o RL): 相较于基础模型表现出更好的性能,在超出上下文长度的任务中保持一定水平的能力,但随着输入长度增加仍然出现整体性能下降。
  • RL-MemAgent: 经过强化学习训练的MemAgent在所有上下文长度中保持近乎无损的外推能力。

Ablation study on RULER-HotpotQA
Ablation study on RULER-HotpotQA

Other OOD Task in RULER Benchmark

RULER基准测试中的其他OOD任务

RULER is the current standard test set for long-text extrapolation capability research, with the core advantage of controllable length generation tasks. We utilize synthetic QA data based on HotpotQA for training.

RULER基准测试 是当前长文本外推能力研究的标准测试集,其核心优势是可控长度生成任务。训练中,我们使用了从HotpotQA合成的QA数据。

Needle-in-a-Haystack(NIAH) 大海捞针(NIAH)

Locating key needles in ultra-long texts, including 8 types of interference variants.

在超长文本中定位关键needle,包含8类干扰变体。

Variable Tracking(VT) 变量追踪(VT)

Simulating program analysis scenarios, tracking variable references and assignment relationships.

模拟程序分析场景,追踪变量引用和赋值关系。

Aggregation(Agg) 聚合任务(Agg)

Aggregating scattered information to evaluate the model's ability to grasp global features.

汇总分散信息,评估模型对全局特征的掌握能力。

Question Answering (QA) 问答任务(QA)

Conducting multi-hop complex reasoning to test the model's contextual understanding and QA capabilities.

进行多跳复杂推理,测试模型上下文理解与问答能力。

OOD Experiments: We test out model in 10 untrained tasks and QA task synthesized from a new dataset, SQuAD. We use heatmaps to visualize the performance of different models across different length ranges and task types. Our model achieves a SOTA in this two OOD tests and the 14B model achieves 95%+ average score over 10 OOD RULER tests in 512K context length.

OOD实验: 我们在10个未训练任务和使用新数据集(SQuAD)合成的问答任务上测试模型性能,使用热力图可视化不同模型在不同长度区间和任务类型下的性能表现。我们的模型在这两个OOD测试中均取得SOTA表现,其中14B模型在512K上下文长度下的10个OOD RULER测试中取得95%+的平均分数。

AVG
RULER average across 10 tasks
QA_1
RULER-QA task from SQuAD

Conclusion 总结

本研究提出了一种全新的长文本处理框架 MemAgent,在以下三个关键维度实现了重要突破:

In this work, we propose a novel framework for long-context processing, with contributions spanning three key dimensions:

1

Architectural Innovation: We introduce an innovative mechanism that enables large language models to process arbitrarily long input sequences within a limited context window and with linear-time complexity, fundamentally addressing the computational bottlenecks faced by traditional long-context methods.

技术架构突破:我们提出了一种创新的机制,使大语言模型能够在有限的上下文窗口内以线性时间复杂度处理任意长度的输入文本,从根本上解决了传统长文本方法面临的计算瓶颈问题。

2

Agent Training Methodology: We design a complete agent workflow to implement this mechanism, and develop an end-to-end training framework based on Multi-conv RL, enabling the agent to learn how to store and retrieve relevant information effectively.

智能体训练方法: 我们设计了一套完整的智能体工作流来实现上述机制,并基于多轮对话强化学习为该智能体设计了端到端的训练框架。

3

Extrapolation Performance: Through extensive empirical evaluation, we demonstrate that our Multi-conv RL method allows models to extrapolate far beyond their training context length with almost lossless performance during testing, substantially expanding the capability frontier of current long-context LLM systems.

性能外推验证: 通过大量实验验证,我们证明了基于强化学习训练的方法能够让模型成功无损外推到远超训练长度的文档上,大幅扩展了当前长文本大语言模型系统的处理边界。

Engineering Design 工程设计

我们基于verl实现了基础的Multi-Conv DAPO训练代码,它定义了Multi-Conv工作流、数据集、配置项的统一接口,可方便地接入新的实现。

We implement a basic Multi-Conv DAPO training code based on verl, which defines a unified interface for Multi-Conv workflow, dataset, and configuration items.

我们进一步实现了一个纯异步的训练框架,使用统一接口做到了“Agent as a function”,仅需定义一个函数,即可实现任意agent workflow。

  • 纯异步流水线: GPU/CPU资源解耦,AsyncLLMEngine 负责多节点推理,Ray Worker管理常驻CPU任务池,通过协程完成资源调度。
  • 统一API接口: 使用 OpenAI API 风格的接口调用LLM,支持 多轮工具调用、多轮独立对话,消除传统的大量冗余代码地狱。

We further implement a fully asynchronous training framework that achieves "Agent as a Function" through a unified interface, requiring only the definition of a single function to implement arbitrary agent workflows.

  • Fully Asynchronous Pipeline: GPU and CPU resources are decoupled. AsyncLLMEngine handles multi-node inference while
    Ray Worker manages a persistent process pool, and task scheduling is completed via async coroutines.
  • Unified API Interface: Unified API Interface with OpenAI-style API, supporting multi-turn tool use, multi-agent parallelism, and multi-task training, eliminating traditional state machine boilerplate.

📝 Citation 📝 引用

If you find this work useful, please cite our paper:

如果您发现这项工作有用,请引用我们的论文:


@article{memagent,
  title={MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent},
  author={Yu, Hongli and Chen, Tinghong and Feng, Jiangtao and Chen, Jiangjie and Dai, Weinan and Yu, Qiying and others},
  journal={arXiv preprint arXiv:2507.02259},
  year={2025}
}