推文 | Alfred's Site

账号:全部 mattpocockuk dair_ai simonw bcherny adocomplete trq212 dexhorthy yetone felixrieseberg 0xblacklight dani_avila7 zarazhangrui AlchainHust badlogicgames dotey petergyang vikingmute ryancarson ClaudeDevs kunchenguid leon7hao ctatedev iamzhihui kieranklaassen mckaywrigley lennysan thdxr RhysSullivan garrytan jakevin7 johnlindquist karpathy servasyy_ai theo 0xMovez 9hills RLanceMartin aidenybai dingyi hylarucoder mitchellh mitsuhiko mvanhorn omarsar0 steipete zeeg GeminiApp GergelyOrosz ZaynHao addyosmani antirez danshipper dillon_mulroy elonmusk ewind_dev idoubicc nummanali realWeZZard swyx thsottiaux yiliush 0xPaulius 0x_rody AYi_AInotes AnthropicAI Barret_China BenJames_____CMGS1988 ChadMoran DanielMiessler DaveJ DavidKPiano Dimillian FactoryAI FardeemM FarzaTV GoogleDeepMind HamelHusain HiTw93 HilaShmuel IfanJew IndieDevHailey Jack_W_Lindsey Jackywine JasonZX Jiaxi_Cui JinjingLiang KhalidWarsa LinearUncle LinghuaJ MatthewBerman MaxForAI Meari_V2_0_G Mnilax QingQ77 RaillyHugo Saccc_c Taniyatweets_VincentLogic ai_explorer25 aibuilderclub_alexalbert__alliekmiller andrewfarah antoinecojp anxue201 arkuy99 artman asynkimo bbssppllvv bearliu bentossell bridgemindai buaaxhm cailynyongyong cgtwts charmaine_klee chrisbarber chrisparkX clairevo delba_oliveira demishassabis doodlestein driaforall elithrar elvissun ericzakariasson francoisfleuret gabriell_lab iamsahaj_xyz imdigitalashish imvihv jarredsumner jerryjliu0 jianshuo jiayuan_jy jlongster jonas_nelle joshalbrecht jpschroeder kaiofreitas kepano lexrus logancyang mattyp nabeel nifinet ninthbit_ai oran_ge patrickc pejmanjohn petradonka prathamgrv prathyvsh quanruzhuoxiu quant_sheep raroque realCaigu repsiace rohanpaul_ai sama samuelstroschei sawyerhood shadcn shao__meng shaogefenhao shreyansj techgirl1908 techwith_ram thorstenball tianyi tobiaswup toddsaunders tricalt turingou tuturetom uniswap12 untraceable_the vanillaCitron velvet_shark wengtianxin wquguru xiaogaifun xicilion ycombinator zachlloydtweets ziwenxu_zodchiii

@dair_ai·6月19日

Catch up on the "loop engineering" trend in this banger article.

@omarsar0

From Prompting Agents to Loop Engineering

↗ view on x.com

关注下「loop engineering」这个趋势，这篇文章写得很好。 [引用 @omarsar0]: 从 Prompting Agents 到 Loop Engineering

@dair_ai·6月19日

RT @omarsar0: // Automating SKILL.md Generation // Increasingly, mining sessions is one of the best ways to improve your agents. OpenAI r…

↗ view on x.com

// 自动化生成 SKILL.md // 越来越多人发现，挖掘历史 session 是改进 agent 的最佳途径之一。 OpenAI r…

@dair_ai·6月19日

Generate Artifacts from YouTube videos with our new /youtube-notetaker skill.

@omarsar0

YT Videos -> Aritfacts Watch how I use my new /youtube-notetaker skill to generate artifacts from YT videos. Captures slides, notes, transcriptions,... Go try it ↓ https://t.co/BxGwMNyGE4

↗ view on x.com

用我们新的 /youtube-notetaker skill 从 YouTube 视频生成 Artifacts。 [引用 @omarsar0]: YT 视频 -> Artifacts 看我如何用新的 /youtube-notetaker skill 从 YT 视频生成 artifacts。可以提取幻灯片、笔记、字幕... 快去试试 ↓

@dair_ai·6月18日

RT @omarsar0: Cool paper on Skill routing for LLM agents. Real tasks rarely map to a single skill. They need several composed together, bu…

↗ view on x.com

关于 LLM agent 技能路由的新论文。真实任务很少能映射到单一技能，往往需要多个技能组合——但如何优雅地路由和组合，是个难题。

@dair_ai·6月18日

If you build web agents, this one is worth your time. It's on how to make agent skills reusable. (bookmark it) LLM web agents usually run as tool callers. Each turn, the model reads a fresh page and emits one low-level action, so horizons and policy-facing LLM completions both blow up on benchmarks like Mind2Web and WebArena. Skill libraries are meant to fix this by wrapping repeated fragments as callable tools, but they trigger reuse on instruction similarity or site metadata, which barely fires on held-out sites. This work routes skill reuse by transferable interaction patterns instead, so a skill learned on one site fires on new sites that share the same interaction shape. That lifts reuse where domain-keyed retrieval falls flat. Why does it matter? The same search, filter, and paginate dance shows up across sites. Abstracting it into a pattern-keyed skill makes web-agent skills generalize beyond the site on which they were learned. Paper: https://t.co/ku7kFIBhhy Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

做 web agent 的值得看这篇。讲的是如何让 agent skill 可复用。（收藏） LLM web agent 通常作为 tool caller 运行。每轮模型读取新页面并执行一个低层动作，导致在 Mind2Web、WebArena 等 benchmark 上 horizon 和策略侧 LLM 补全双双崩溃。 Skill library 本来是为了解决这个问题——把重复片段封装成可调用工具。但传统方案按指令相似度或站点元数据来触发复用，在 held-out 站点上几乎不生效。这项工作改用「可迁移交互模式」来路由 skill 复用，让在某个站点学到的 skill 能在具有相同交互形态的新站点上触发。这在基于域名的检索失效的场景下显著提升了复用率。为什么重要？搜索、筛选、翻页这套流程在各个网站上反复出现。把它抽象成模式键控的 skill，就能让 web agent skill 泛化到训练站点之外。

@dair_ai·6月18日

RT @dair_ai: Who should design the training environment for an RL agent, the practitioner or the policy itself? RL pipelines for LLMs usua…

↗ view on x.com

谁应该设计 RL agent 的训练环境——从业者，还是 policy 本身？ LLM 的 RL pipeline 通常……

@dair_ai·6月18日

Who should design the training environment for an RL agent, the practitioner or the policy itself? RL pipelines for LLMs usually rely on manually redesigned environments between stages, with practitioners guessing which configuration will best improve the current policy. This work proposes an LLM-as-Environment-Engineer framework. The current policy analyzes its own failure trajectories plus context and proposes the next-stage environment configuration, automating a step that has stayed stubbornly manual. They also release MAPF-FrozenLake, a controllable multi-agent testbed whose generator exposes multi-dimensional environment configs. Why does it matter? Curriculum design between RL stages is mostly gut feel today. Letting the policy read its failures and shape the next environment closes a loop that practitioners currently close by hand. Paper: https://t.co/lZHlqozrQD Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

谁应该设计 RL agent 的训练环境——实践者还是策略本身？ LLM 的 RL pipeline 通常依赖实践者在各阶段之间手动重新设计环境，靠猜测哪种配置能最好地改善当前策略。这项工作提出了 LLM-as-Environment-Engineer 框架。当前策略分析自身的失败轨迹及上下文，并提出下一阶段的环境配置，将这个长期依赖人工的步骤自动化。他们还发布了 MAPF-FrozenLake，一个可控的多 agent 测试床，其生成器支持多维度环境配置。为何重要？ RL 阶段间的课程设计如今主要靠直觉。让策略读取自身失败并塑造下一个环境，补上了实践者当前手动完成的这个闭环。

@dair_ai·6月17日

Outstanding paper on computer-using agents. (bookmark it) Computer-using agents drive real software through the screen, but they solve every task from scratch. Ask one to repeat a task, and it re-reads the screen and re-reasons every tap, paying the full cost again. PreAct compiles the first successful run into a small state-machine program, states that check the screen and transitions that act, then replays it directly on later runs. That runs 8.5 to 13x faster with no per-step language-model calls. Replay stays guarded. At each step, PreAct checks that the screen matches what the program expects before acting, and hands control back to the agent when reality diverges. Why does it matter? Most computer-use costs are repeated reasoning on tasks the agent has already solved. Amortizing that into a replayable program is a clean way to make agents faster the second time. Paper: https://t.co/kMloX0qC5M Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

关于计算机操控 agent 的优秀论文。（值得收藏）计算机操控 agent 通过屏幕驱动真实软件，但每次任务都从头解决。让它重复同一个任务，它会重新读屏幕、重新推理每一次点击，全部成本再付一遍。 PreAct 将首次成功运行编译成一个小型状态机程序——状态负责检查屏幕，转换负责执行动作——并在后续运行中直接回放。速度提升 8.5 到 13 倍，无需每步调用语言模型。回放有保护机制。每一步，PreAct 都会验证屏幕是否与程序预期匹配再执行动作；若现实偏离，则将控制权交还给 agent。为何重要？计算机操控的大部分成本在于对已解决任务的重复推理。将其摊销为可回放程序，是让 agent 第二次运行时变快的简洁方案。

@dair_ai·6月17日

RT @omarsar0: Highly-recommended reading! After using /loops & /goal throughout my projects, I believe that verifiers and robust guardrail…

↗ view on x.com

强烈推荐阅读！在项目中大量使用 /loops 和 /goal 之后，我认为 verifier 和健壮的 guardrail 是这类 agentic 工作流真正发挥作用的关键。

@dair_ai·6月17日

RT @omarsar0: eve looks like a very promising agent framework. Built-in: - Durable execution - Sandboxed compute - Human-in-the-loop appr…

↗ view on x.com

eve 看起来是一个很有前途的 agent 框架。内置： - Durable execution（持久化执行） - 沙箱计算 - Human-in-the-loop 支持……

@dair_ai·6月16日

Can an LLM agent actually build a model of an environment it cannot see? This work makes the question gradeable. An agent has to uncover a hidden deterministic finite automaton by interacting with an oracle through membership queries (does this string belong?) and equivalence queries (is this the target?), with classic automata-learning algorithms as strong baselines. The honest result is that performance drops sharply as the automaton grows. Reasoning models do better than the rest, but everything degrades with size. Why does it matter? World-model claims about agents are usually vibes. Forcing an agent to actively reconstruct a hidden structure through queries is a clean, controlled way to measure whether it is modeling its environment or just reacting. Paper: https://t.co/Kw1WCLEAQ3 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

LLM agent 真的能对它看不见的环境建立模型吗？这项工作让这个问题变得可评分。Agent 必须通过与 oracle 交互——提出成员查询（这个字符串属于该语言吗？）和等价查询（这就是目标自动机吗？）——来还原一个隐藏的确定性有限自动机，同时以经典自动机学习算法作为强 baseline。实验结果很诚实：随着自动机规模增大，性能急剧下降。推理模型表现优于其他模型，但所有模型都随规模退化。为什么重要？关于 agent 世界建模能力的声明，通常只是感觉。强迫 agent 通过查询主动重建一个隐藏结构，是衡量它究竟是在对环境建模、还是只在被动响应的一种干净、受控的方式。论文：[链接]

@dair_ai·6月15日

RT @omarsar0: Verifiers are a big deal. Without good verifiers, /goal & /loop breaks a lot. Anything out of distribution for an LLM, the…

↗ view on x.com

Verifier 非常重要。没有好的 verifier，/goal 和 /loop 会频繁崩。凡是 LLM 遇到分布外情况，就会……

@dair_ai·6月15日

// HarnessX: Harnesses You Compile, Not Hand-Build // (bookmark it) Most agent harnesses are hand-crafted and frozen. Each new model or task means rewriting the prompts, tools, memory, and control flow from scratch, and the rich traces from every run get thrown away. HarnessX treats the harness as something you assemble from typed primitives through a substitution algebra, then evolves with AEGIS, a trace-driven multi-agent engine that feeds execution history back into the design. Why does it matter? If the scaffolding can compose and improve itself from its own traces, the per-model rewrite tax that quietly dominates agent engineering starts to disappear. This is the cleanest version yet of treating the harness as a first-class, programmable artifact. Paper: https://t.co/zvz5YTLRZr Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// HarnessX：编译出来的 Harness，不是手搭出来的 // （收藏一下）大多数 agent harness 都是手工搭、然后就冻住了。每换一个模型或任务，就得从头重写 prompt、工具、memory 和控制流，而每次运行产生的丰富 trace 也全部丢弃。 HarnessX 把 harness 当成一个可组合的对象——通过 typed primitives 和替换代数（substitution algebra）拼装而成，再借助 AEGIS（一个 trace 驱动的多 agent 引擎）把执行历史反哺回设计本身，让 harness 持续进化。为什么重要？如果脚手架能从自身 trace 中组合、自我改进，那个悄悄吞噬 agent 工程成本的「换模型就重写税」就会开始消失。这是目前把 harness 当作一等可编程制品来对待的最清晰版本。论文：[链接]

@dair_ai·6月14日

RT @omarsar0: Highly recommended reading. Don't offload your learning. Don't offload your creative process. "You can offload a task, or…

↗ view on x.com

强烈推荐阅读。不要把学习外包给 AI，不要把创作过程外包给 AI。「你可以把任务外包出去，但是……」

@dair_ai·6月13日

RT @omarsar0: Notes on the recent session we had related to autonomous long-running coding agents. (bookmark it) Topics: /goal, loop engi…

↗ view on x.com

关于自主长时运行 coding agent 的笔记（建议收藏）。话题：/goal、loop 工程……

@dair_ai·6月13日

RT @omarsar0: Own the harness, own the agent orchestrators. Great to see open-source work starting to enable it. Being able to compose…

↗ view on x.com

掌控 harness，就掌控了 agent 编排层。很高兴看到开源工作开始支持这一点。能够组合……

@dair_ai·6月12日

RT @omarsar0: How to effectively run autonomous long-running coding agents? This is one of the most exciting discussions on agents I've ev…

↗ view on x.com

如何有效运行自主长时运行的 coding agent？这是我见过的关于 agent 最令人兴奋的讨论之一……

@dair_ai·6月9日

// The Consistency Illusion // Multi-agent debate can make agents agree on the final answer while their underlying reasoning stays misaligned. This work finds that consensus on the output hides disagreement on the path that produced it, and you only ever see the output. A lot of pipelines treat debate or self-consistency as a correctness signal. This work shows that agreement can be an illusion, papering over reasoning that never actually lined up. If you trust convergence as a proxy for being right, you may be measuring the wrong thing. Paper: https://t.co/fd2edva1qu Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// 一致性幻觉 // Multi-agent debate 可以让 agents 在最终答案上达成一致，而它们底层的推理过程却依然彼此背离。这项研究发现：输出层的共识掩盖了产生它的推理路径上的分歧——而你永远只能看到输出。很多 pipeline 把 debate 或 self-consistency 当作正确性的信号。这项研究表明，共识可以是一种幻觉，把从未真正对齐的推理粉饰过去。如果你把「收敛」作为「正确」的代理指标，你衡量的可能是错误的东西。

@dair_ai·6月7日

Great paper on self-improving agents:

@omarsar0

This was one of the standout AI papers of the week. (bookmark it) It tackles a question most self-improving AI agents ignore: is the agent actually discovering anything, or just remixing what it already knows? How can you tell whether the agent is doing real discovery or just confident retrieval? The authors give three clean buckets: - Retrieval is looking something up in a notebook you already have. - Search is combining tools you already own in new ways. - Discovery is inventing a new concept that wasn't in your toolkit before. The issue is that most agents stop at the first two. The math behind their definition (category theory plus a left Kan extension, if you care) is basically a bookkeeping trick to ask: could the old version of me have produced this result? If yes, it's not discovery. If no, something genuinely new showed up. They build a Builder/Breaker agent that studies protein mechanics. Over four rounds, the model's fit accuracy actually drops (R² goes from 0.48 to 0.68 to 0.54 to 0.41). At first glance, that looks like a failing agent. It isn't. The agent kept taking on harder proteins and rewriting its theory to cover them. Data grew almost 10x while the model code grew only 1.3x. A smaller theory covering a bigger world is exactly what good science looks like. Why does it matter? If you optimize for accuracy alone, your self-improving agent will just settle into easy benchmarks and stop. This paper offers a cleaner success signal and asks whether the agent is compressing more of the world into less code over time. Paper: https://t.co/Vb4TcCb5YD Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

↗ view on x.com

关于自我改进 agent 的精彩论文：这是本周最值得关注的 AI 论文之一。（建议收藏）它回答了绝大多数自我改进 agent 都忽略的问题：agent 到底在真正发现新东西，还是只是在重组已有知识？怎么判断 agent 是在做真正的发现，还是只是有把握地检索？作者给出了三个清晰的分类： - **Retrieval（检索）**：从你已有的笔记本里查一条信息。 - **Search（搜索）**：把你已有的工具以新方式组合。 - **Discovery（发现）**：发明一个此前不在你工具箱里的新概念。问题在于，大多数 agent 只做到前两层就停了。他们的数学定义（基于范畴论和左 Kan 扩张）本质上是一个记账技巧：老版本的我能产出这个结果吗？如果能，就不算发现；如果不能，才是真正有新东西出现。他们构建了一个 Builder/Breaker agent 来研究蛋白质力学。历经四轮，模型拟合精度实际上下降了（R² 从 0.48 → 0.68 → 0.54 → 0.41）。乍看像是 agent 失败了。其实不然。 Agent 一直在挑战更难的蛋白质，并不断重写理论来覆盖它们。数据量增长了近 10 倍，而模型代码只增长了 1.3 倍。用更小的理论覆盖更大的世界，正是好科学的样子。为什么这很重要？如果只优化 accuracy，自我改进的 agent 会在简单 benchmark 上安于现状，停止进步。这篇论文提供了一个更清晰的成功信号：agent 是否在随时间推移，用更少的代码压缩更多的世界？

@dair_ai·6月7日

The Top AI Papers of the Week (May 31 - June 7) - LEAP - AutoLab - Learn From Your Own Latents - Reusable Context Engineering - Self-Revising Discovery Systems - Scaling Laws for Agent Harnesses - Disentangling Agent Self-Evolution Read on for more:

@dair_ai

🥇Top AI Papers of the Week

↗ view on x.com

本周 AI 顶级论文（5月31日 - 6月7日） - LEAP - AutoLab - Learn From Your Own Latents（从自身 Latent 中学习） - Reusable Context Engineering（可复用的 Context 工程） - Self-Revising Discovery Systems（自我修正的发现系统） - Scaling Laws for Agent Harnesses（Agent 框架的 Scaling Laws） - Disentangling Agent Self-Evolution（解耦 Agent 自我演化）详情见原帖：

@dair_ai·6月6日

New research from Renmin University. Treat skill selection as a harness in its own right. If you design skill routing for personal or edge agents, this work argues that the selection layer is a first-class component you train and own, sitting alongside memory rather than inside it. The work builds a lightweight local preference harness for on-device personal agents. It keeps a cheap statistical preference learner on-device while a remote LLM handles semantic intent, and the local statistics modulate the model's skill-selection decisions rather than overriding them. Framed as a bandit-style local optimization, the decoupled design reports the lowest cumulative regret and highest test accuracy against memory-augmented agents. Paper: https://t.co/nBigS6jRf7 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

中国人民大学的新研究。将 skill 选择视为独立的 harness 组件。如果你在为个人 agent 或端侧 agent 设计 skill routing，这项研究认为：选择层是一个你需要单独训练和管理的一等组件，与 memory 并列，而非嵌套在 memory 内部。该工作为端侧个人 agent 构建了一个轻量级本地偏好 harness：设备端保留一个廉价的统计偏好学习器，远端 LLM 负责语义意图理解，本地统计量调节（而非覆盖）模型的 skill 选择决策。将其建模为 bandit 式本地优化问题，这种解耦设计在与 memory-augmented agent 的对比中，实现了最低的累积遗憾和最高的测试精度。

@dair_ai·6月5日

// Agents' Last Exam // Agents' Last Exam is a living benchmark of over 1,000 economically valuable tasks, built with 250+ industry experts and mapped to the U.S. federal occupational taxonomy. The hardest tier sits at a 2.6% average full pass rate across mainstream harnesses and backbones. ALE behaves like a GDP-coverage instrument instead of another test that saturates in a month. Paper: https://t.co/2FMeltJ23e Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// Agents' Last Exam // Agents' Last Exam 是一个持续更新的 benchmark，包含 1000+ 具有经济价值的任务，由 250+ 行业专家共同构建，并映射到美国联邦职业分类体系。最难层级在主流 harness 和 backbone 上的平均完整通过率仅为 2.6%。 ALE 更像一个 GDP 覆盖率工具，而不是一个月内就会饱和的普通测试。

@dair_ai·6月4日

Outstanding paper on long-horizon agents. (bookmark it) Similar to humans, how do you make agents persist on a difficult task, and how is that useful? And which models today work well on this? This new work, AutoLab, explores this question and how encoding persistence in agents is beneficial for tasks such as auto research and engineering tasks. Can a model keep improving an artifact for hours, under a strict wall-clock budget, the way real research and engineering actually work? Results: AutoLab hands agents 36 expert-curated tasks across system optimization, model development, CUDA kernels, and puzzles, each starting from a correct but deliberately suboptimal baseline. Across 17 frontier models, the dominant predictor of success was not the quality of the first attempt. It was persistence, repeatedly benchmarking, editing, and folding in empirical feedback. It appears that Claude-opus-4.6 sustained that loop well. Most of the other models quit early or burned the budget, making almost no progress. Paper: https://t.co/jb8uYR0fpE Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

关于长程 agent 的杰出论文（值得收藏）。类比人类——如何让 agent 在困难任务上持续坚持？这有什么实际价值？当前哪些模型表现出色？新工作 AutoLab 探索了这些问题，以及在 agent 中编码「持续性」对自动化研究和工程任务的收益。一个模型能否像真实研究和工程那样，在严格的时钟预算内持续改进 artifact 数小时？实验结果： AutoLab 为 agent 准备了 36 个专家精心策划的任务，涵盖系统优化、模型开发、CUDA kernel 和谜题，每个任务均从一个正确但刻意次优的基线出发。在 17 个前沿模型中，成功的最强预测因子不是第一次尝试的质量，而是「持续性」——反复 benchmark、编辑、将实证反馈融入改进。 Claude-opus-4.6 能够很好地维持这个循环。其他大多数模型要么过早放弃，要么耗尽 budget，几乎没有进展。论文：https://t.co/jb8uYR0fpE

@dair_ai·6月4日

RT @omarsar0: NEW: NVIDIA ships 550B MoE open model for long-running agents. Very exciting times to see more open models to support local…

↗ view on x.com

新消息：NVIDIA 发布 550B MoE 开源模型，专为长时间运行的 agent 设计。看到越来越多支持本地部署的开源模型，真是令人振奋的时代。

@dair_ai·6月3日

Nice primer on post-training reasoning data. (bookmark it) This is one of the first primers to pull the scattered post-training reasoning-data literature into one place, synthesizing over 150 public studies and system reports that previously lived across dataset papers, RL recipes, reward-model studies, benchmarks, and frontier reports. It organizes everything around four questions. What data objects exist, what makes them useful, how they are constructed, and how they scale. Paper: https://t.co/royylAHk3y Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

关于后训练推理数据的优质入门读物。（建议收藏）这是首批将分散的后训练推理数据文献整合在一起的综述之一，汇总了 150 多项此前分散在数据集论文、RL 配方、奖励模型研究、benchmark 和前沿报告中的公开研究。内容围绕四个核心问题展开：数据对象是什么、是什么让它们有价值、如何构建，以及如何扩展。

@dair_ai·6月2日

// State-Externalizing Harnesses // A new paradigm is emerging on how to effectively build agents and harnesses. If there is a state that the environment can maintain reliably, it probably doesn't belong inside the policy. Move it into the harness, and a 20B model trains better and generalizes further. Search agents are usually trained on one policy over a growing transcript, so RL has to learn semantic search and routine bookkeeping at the same time. This model, Harness-1, splits those apart. The harness keeps the working memory (candidate pool, evidence links, verification records, deduplicated observations, budget-aware context) outside the policy, and the 20B model only decides what to search, what to keep, what to verify, and when to stop. Across eight retrieval benchmarks spanning web, finance, patents, and multi-hop QA, it reaches 0.730 average curated recall, beating the next-best open search agent by 11.4 points and staying competitive with much larger frontier searchers. The gains are largest on the held-out transfer. Paper: https://t.co/8DOQtsLsp2 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// State-Externalizing Harnesses（状态外化 Harness）// 一种构建 agent 和 harness 的新范式正在涌现。 **核心原则**：如果某个状态能被环境可靠地维护，它大概率不该放在 policy 内部。把它移进 harness，20B 的模型就能训练得更好、泛化能力更强。 Search agent 通常在不断增长的 transcript 上训练单一 policy，这意味着 RL 要同时学习「语义检索」和「日常簿记」两件事。这篇论文的模型 Harness-1 把这两件事拆开了。 Harness 负责维护工作记忆（候选池、证据链接、验证记录、去重观察、预算感知 context），20B 模型只负责决策：搜什么、留什么、验证什么、何时停止。在横跨网页、金融、专利、多跳 QA 的 8 个检索 benchmark 上，平均 curated recall 达到 0.730，领先次优开源 search agent 11.4 个点，并与更大的 frontier 模型保持竞争力。迁移泛化上的增益最为显著。

@dair_ai·6月1日

RT @dair_ai: The Top AI Papers of the Week (May 24 - May 31) - SkillOpt - AutoScientists - The Efficiency Frontier - Language Models Need…

↗ view on x.com

本周 AI 顶级论文（5月24日 - 5月31日） - SkillOpt - AutoScientists - The Efficiency Frontier - Language Models Need…

@dair_ai·6月1日

// Reusable Context Engineering // Context bloat quietly kills long-horizon runs, but you can fix it from the outside without fine-tuning the underlying agent. (bookmark this) Context management is usually baked into an agent's own prompt or weights, which does not transfer and cannot wrap a closed model. New research introduces AdaCoM, which trains a separate LLM with end-to-end RL to prune or preserve the context of a frozen, possibly closed-source agent. That turns context engineering into a reusable, model-agnostic policy you train once and reuse, with substantial gains across diverse agents on web search and deep research. The authors also surface a capability-dependent fidelity-reliability trade-off, and find transfer works best between agents of similar capability. Paper: https://t.co/2n1dpMs7QR Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// 可复用的 Context Engineering // Context 膨胀会悄悄终结 long-horizon 任务，但你可以从外部修复它，无需微调底层 agent。（值得收藏） Context 管理通常被内嵌进 agent 自己的 prompt 或权重里，这意味着它无法跨模型迁移，也无法包裹闭源模型。新研究提出 AdaCoM：训练一个独立的 LLM，用端到端 RL 来对冻结的（甚至是闭源的）agent 的上下文进行裁剪或保留。这样，context engineering 就变成了一个可复用的、模型无关的策略——训练一次，到处复用。在 web search 和 deep research 任务上，跨不同 agent 均有显著提升。研究还揭示了一个依赖 capability 的「保真度-可靠性权衡」，并发现在能力相近的 agent 之间迁移效果最好。论文：https://t.co/2n1dpMs7QR

@dair_ai·5月31日

RT @omarsar0: // The Efficiency Frontier // Cool paper on context management. As agents reuse the same documents and histories across man…

↗ view on x.com

// 效率前沿 // 一篇关于 context 管理的好论文。当 agent 在多任务中复用相同文档和历史记录时……

@dair_ai·5月31日

The Top AI Papers of the Week (May 24 - May 31) - SkillOpt - AutoScientists - The Efficiency Frontier - Language Models Need Sleep - Adapting the Interface, Not the Model - Forecasting Scientific Progress with AI - Compiling Agentic Workflows into Weights Read on for more:

@dair_ai

🥇Top AI Papers of the Week

↗ view on x.com

本周顶级 AI 论文（5月24日 - 5月31日） - SkillOpt - AutoScientists - The Efficiency Frontier - Language Models Need Sleep - Adapting the Interface, Not the Model - Forecasting Scientific Progress with AI - Compiling Agentic Workflows into Weights 点击阅读更多：

@dair_ai·5月30日

RT @omarsar0: Increasingly, HTML Artifacts are becoming a core part of how I work with AI agents. Long-horizon agent sessions need a bette…

↗ view on x.com

HTML Artifacts 正在成为我与 AI agent 协作方式的核心部分。长时 agent session 需要更好的……

@dair_ai·5月30日

RT @omarsar0: In a few months, people will start to realize how fundamentally important MCP for agents is. It's not even about connecting…

↗ view on x.com

再过几个月，大家就会开始意识到 MCP 对 agent 有多根本性的重要。这甚至不是关于连接工具这件事……

@dair_ai·5月29日

Do proactive agents really need an LLM to decide when to wake? The default proactive agent calls an LLM on every event just to decide whether to wake up. That is a lot of expensive inference spent on a yes or no. New research from Microsoft and Purdue asks whether the trigger really needs a language model at all. Their answer is a 220MiB temporal-graph encoder that decides when to wake and what context to anchor. It gains +16.7 mean F1 across 14 backbones, runs 4 to 83x faster, and fits on-device at around 11ms per event. If you run an always-on agent loop, the polling decision is quietly the main cost driver. A tiny encoder removes it without giving up accuracy. Paper: https://t.co/15KpQEm7Eo Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

Proactive agent 真的需要用 LLM 来决定何时唤醒吗？默认的 proactive agent 每次事件都调用 LLM，仅仅是为了判断「要不要醒来」。这等于用大量昂贵的推理来回答一个是/否问题。 Microsoft 和 Purdue 的新研究问了一个好问题：触发器真的需要语言模型吗？他们的答案是一个 220MiB 的 temporal-graph encoder，负责决定何时唤醒以及锚定哪些上下文。结果：在 14 个 backbone 上平均 F1 提升 +16.7，速度快 4 到 83 倍，可在端侧运行，每次事件约 11ms。如果你在跑 always-on agent 循环，polling 决策其实是最大的隐性成本来源。一个轻量 encoder 就能移除它，同时不损失精度。

@dair_ai·5月28日

Banger paper from Harvard. AutoScientists drops the central planner entirely. Agents interpret shared experimental data, self-organize around promising directions, evaluate proposals before resource allocation, and document successes AND failures. Decentralized AI co-scientists with failure documentation as a first-class step. Validated across three concrete domains. Biomedical ML reaches 74.4% mean leaderboard percentile. Language model training converges 1.9x faster. Protein fitness prediction lifts +12.5% on specific assays and +6.5% broader. The strongest argument so far that the AI-scientist bottleneck is governance rather than raw capability. Paper: https://t.co/LtqUsrJ0os Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

哈佛出的重磅论文。 AutoScientists 完全去掉了中央规划器。Agents 自主解读共享实验数据，围绕有前景的方向自我组织，在资源分配前先评估提案，并同等记录成功与失败。去中心化的 AI 协作科学家，把失败文档作为一等公民。在三个具体领域得到验证：生物医学 ML 达到 74.4% 平均 leaderboard 百分位；语言模型训练收敛速度提升 1.9 倍；蛋白质适应性预测在特定 assay 上提升 +12.5%、整体提升 +6.5%。目前最有力的论据：AI 科学家的瓶颈在于治理机制，而非原始能力。

@dair_ai·5月22日

NEW paper worth reading. A full agentic workflow can be distilled into model weights and run at roughly 100x lower inference cost while preserving near-frontier task quality. The workflow includes multi-step LLM calls, tool invocations, intermediate scratchpads, and decision structure. Instead of expressing all of that at runtime through a framework, the paper amortizes the behavior into a compiled model through targeted distillation. This is the strongest economic argument for agent compilation so far. Runtime loops are flexible, but expensive. Compiled workflows trade some flexibility for a massive inference-cost reduction. Paper: https://t.co/4k4urYOAeQ Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

新论文值得一读。完整的 agentic workflow 可以被蒸馏进模型权重中，以约 100 倍更低的推理成本运行，同时保持接近前沿的任务质量。该 workflow 涵盖多步 LLM 调用、工具调用、中间 scratchpad 和决策结构。与其在运行时通过 framework 动态表达这一切，这篇论文通过定向蒸馏将行为摊销进一个「编译好的」模型中。这是迄今为止支持 agent 编译最强的经济学论据。运行时循环灵活，但成本高昂；编译后的 workflow 以牺牲部分灵活性换取大幅降低的推理成本。

@dair_ai·5月21日

Learn how to build LLM Wikis and LLM Artifacts.

@omarsar0

New VIDEO: From LLM Wikis to LLM Artifacts Shared all my thoughts on why LLM wikis and HTML artifacts are a big deal. Plus, new tools to help you build wikis and artifacts with agents. Just getting started! https://t.co/fDPDNUcJfx

↗ view on x.com

学习如何构建 LLM Wikis 和 LLM Artifacts。【引用 @omarsar0】：新视频：从 LLM Wikis 到 LLM Artifacts 分享了我对 LLM wikis 和 HTML artifacts 为何意义重大的全部思考。还有用 agent 帮你构建 wikis 和 artifacts 的新工具。这才刚刚开始！

@dair_ai·5月20日

// Memory as a Model // The paper augments any LLM with a separate trained memory model that stores, retrieves, and integrates facts on its behalf. It decouples memory updates from base-model weight updates. It achieves continual-learning robustness without catastrophic forgetting, which is a property that RAG fails to deliver. A vector store is a database with a learned encoder bolted on. MeMo is a learned subsystem with explicit interfaces. That distinction matters, as agents need to be able to ingest fresh knowledge weekly without retraining or vector-DB churn. At its core, the position here is that memory in agents should be modular, learned, and gated, not a context-window hack. Paper: https://t.co/iMrghPtxWW Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// Memory as a Model // 这篇论文为任意 LLM 配备了一个独立训练的记忆模型，由它来负责知识的存储、检索和整合。核心思路是把记忆更新与基础模型的权重更新解耦。它实现了持续学习的鲁棒性，同时避免了灾难性遗忘——而这恰恰是 RAG 做不到的。 Vector store 本质是一个数据库加上一个后天接上去的编码器。MeMo 则是一个带有显式接口的可学习子系统。这个区别很关键：agent 需要每周消化新知识，却不能每次都重新训练，也不能靠不断折腾 vector DB 来解决。这篇论文的核心立场是：agent 中的记忆应该是模块化的、可学习的、有门控的，而不是一个 context window 的临时凑合方案。

@dair_ai·5月20日

If you design production agent systems, this matters. Most devs accidentally let their framework defaults make critical architecture decisions without thinking it through. This paper shows you how to choose deliberately instead. Why it matters? You need to start making more deliberate decision about your architecture. Paper: https://t.co/uSW5mUMnJp Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

如果你在设计生产级 agent 系统，这点很重要。大多数开发者会不加思考地让框架默认配置替他们做出关键架构决策。这篇论文告诉你如何有意识地主动选择。为什么重要？你需要开始对架构做出更刻意的判断，而不是被动接受默认值。

@dair_ai·5月20日

RT @omarsar0: Very interesting results from this NanoGPT-Bench eval. There is so much talk about self-improving agents. But can coding ag…

↗ view on x.com

NanoGPT-Bench 评测出了一些非常有趣的结果。关于 self-improving agent 的讨论铺天盖地。但 coding agent 真的能……

@dair_ai·5月19日

NEW paper worth reading: MetaCogAgent MetaCogAgent equips a multi-agent system with metacognition so each agent decides whether it should answer or delegate. In other words, it aims for self-aware task delegation rather than fixed routing. The bottleneck in multi-agent systems has been over-delegation and under-delegation. In a way, a metacognitive gate is a principled way to manage both. If you orchestrate specialists, this could give you a routing primitive that adapts to task uncertainty instead of relying on a fixed router. Paper: https://t.co/Y5RE4zgmIn Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

新论文值得一读：MetaCogAgent MetaCogAgent 为 multi-agent 系统赋予元认知能力，让每个 agent 自行判断：该自己回答，还是把任务委派出去。换句话说，它追求的是「自感知的任务委派」，而非固定路由规则。 multi-agent 系统的瓶颈一直是过度委派和委派不足并存。从这个角度看，元认知门控是一种有原则地同时管控两者的方式。如果你在编排多个专家 agent，这可能为你提供一个能适应任务不确定性的路由原语，不再依赖固定的路由器。

@dair_ai·5月19日

Great new paper to read: Code as Agent Harness (bookmark it)

@omarsar0

// Code as Agent Harness // 100+ page report on all things related to agent harnesses. (bookmark it) In particular, the survey summarizes methods and applications of code as agent harness. This paper makes a strong case that code-as-harness might be the key to moving us towards a broader science harness engineering. Is code all you need? Maybe. Regardless, the paper argues that future systems must have the following four properties: executable, inspectable, stateful, and governed. Paper: https://t.co/bcNtSpSsQq Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

↗ view on x.com

好文推荐：Code as Agent Harness（先收藏） [转自 @omarsar0]：// Code as Agent Harness // 一份 100+ 页的 agent harness 全景报告。（先收藏）这篇综述总结了「代码作为 agent harness」的方法与应用，并有力论证：代码即 harness，可能是推动我们走向更宏观的 harness 工程科学的关键。代码是否就是全部答案？也许是。不管怎样，论文认为未来系统必须具备以下四个属性：可执行（executable）、可检视（inspectable）、有状态（stateful）、可治理（governed）。

@dair_ai·5月18日

NEW paper from Meta: Agentic Discovery of Neural Architectures. This is a hot new area of research! Keep an eye on it.

@omarsar0

NEW paper from Meta. (bookmark it) It's an agent system that autonomously discovers neural architectures that beat Llama 3.2 at 350M, 1B, and 3B scales, all under a 24-hour compute budget. They get this work by splitting the search into two agents: > AIRA-Compose searches the macro architecture. > AIRA-Design implements the low-level mechanisms. For devs: If one agent in your stack is doing both strategy and implementation, split it. Run a planner that picks the structure and an implementer that fills in the mechanisms. AIRA shows this beats a single end-to-end agent on a real, non-toy search problem. The same split is useful for pipeline assembly, query planning, prompt scaffolding, and tool-use programs. Paper: https://t.co/CYALI6CFjJ Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

↗ view on x.com

Meta 新论文：Agent 自主发现神经网络架构。这个研究方向很值得关注！ [引用]：Meta 新论文（收藏备用）这是一个 agent 系统，能在 24 小时算力预算内，自主发现在 350M、1B、3B 参数规模上超越 Llama 3.2 的神经架构。核心做法是把搜索拆成两个 agent： > AIRA-Compose 负责搜索宏观架构。 > AIRA-Design 负责实现底层机制。 **对开发者的启示：** 如果你的 agent stack 里有一个 agent 同时在做「策略制定」和「具体实现」，把它拆开。让 planner 决定结构，让 implementer 填充机制。 AIRA 在一个真实的、非玩具级搜索问题上证明：这种拆分比单一端到端 agent 效果更好。同样的模式适用于 pipeline 组装、query planning、prompt 脚手架、tool-use 程序。

@dair_ai·5月18日

NEW paper worth reading. GPT-5.4 nano plus a critic-comparator orchestration loop hits 76.4% on SWE-bench Verified, matching standalone Gemini 3 Pro and Claude Opus 4.5 Thinking. The trick is to select from k=8 weak-model proposals using execution and proof signals. What does this mean? Many of the patches you'd expect from a frontier model are already inside a weak model's top-8 candidates. When you have 8 candidate patches from a weak model, don't ask the model which is best. Run them and verify them. That's enough to match a frontier model's accuracy. The takeaway for AI devs: a weak model's top-k often already contains the right answer. What limits you is the quality of your selector, not the capability of the model. Paper: https://t.co/Gx7j7EP9BM Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

新论文值得一读。 GPT-5.4 nano 加上 critic-comparator 编排循环，在 SWE-bench Verified 上达到 76.4%，与独立运行的 Gemini 3 Pro 和 Claude Opus 4.5 Thinking 持平。核心技巧：从弱模型生成的 k=8 个候选 patch 中，利用执行结果和验证信号来做选择。这意味着什么？你期望顶级模型才能给出的 patch，其实已经藏在弱模型的 top-8 候选里了。当你有 8 个弱模型候选 patch 时，别让模型自己挑——直接运行并验证。这就够了，足以匹配顶级模型的准确率。对 AI 开发者的启示：弱模型的 top-k 往往已经包含了正确答案。制约你的是选择器的质量，而不是模型的能力。

@dair_ai·5月18日

RT @dair_ai: The Top AI Papers of the Week (May 11 - May 17) - AEvo - δ-mem - AutoTTS - AI Co-Mathematician - Lighthouse Attention - Is Gr…

↗ view on x.com

本周 AI 论文精选（5月11日–17日） - AEvo - δ-mem - AutoTTS - AI Co-Mathematician - Lighthouse Attention - Is Gr…

@dair_ai·5月17日

The Top AI Papers of the Week (May 11 - May 17) - AEvo - δ-mem - AutoTTS - AI Co-Mathematician - Lighthouse Attention - Is Grep All You Need? - A Geometric Calculator Inside a Neural Network Read on for more:

@dair_ai

🥇Top AI Papers of the Week

↗ view on x.com

本周 Top AI 论文（5月11日 - 5月17日） - AEvo - δ-mem - AutoTTS - AI 联合数学家 - Lighthouse Attention - Grep 是否足够用？ - 神经网络内部的几何计算器点击阅读详情：

@dair_ai·5月16日

Are your benchmarks actually measuring the capability you think they measure? New paper says they probably not. Coined the "The Evaluation Trap", it provides a vocabulary for auditing whether your eval discriminates the underlying capability or just proxies behaviors that happen to correlate. Most benchmarks bake in implicit theory that nobody states explicitly, then evaluate as if the theory were neutral. Research indicates that most agent leaderboards are not measuring what we collectively think they are. Great read on evals, especially those making decisions on model selection. Paper: https://t.co/mPIiOhkB8P Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

你的 benchmark 真的在衡量你以为它在衡量的能力吗？新论文说：很可能没有。作者提出「评估陷阱」（The Evaluation Trap）这一概念，提供了一套词汇来审查你的 eval 是否真正区分了底层能力，还是仅在代理衡量那些恰好相关的行为。大多数 benchmark 内置了隐性理论，这些理论从未被明确陈述，却被当作中立标准来评估。研究表明，大多数 agent 排行榜并没有在衡量我们集体认为它们在衡量的东西。强烈推荐关注 eval 的人阅读，尤其是基于 eval 做模型选型决策的人。

@dair_ai·5月16日

RT @dair_ai: // Beyond Individual Intelligence // One of the more useful multi-agent surveys I've read this year. 200+ papers mapped alon…

↗ view on x.com

// 超越单体智能 // 今年读过的最有价值的 multi-agent 综述之一。 200+ 篇论文系统梳理……

@dair_ai·5月15日

// Beyond Individual Intelligence // One of the more useful multi-agent surveys I've read this year. 200+ papers mapped along three axes: collaboration mechanisms, failure attribution, and self-evolution. The self-evolution chapter is the cleanest field map of where memory, meta-learning, and procedure-editing approaches actually intersect. Paper: https://t.co/JHBPVFe1l7 Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// 超越个体智能 // 今年读过最有价值的多智能体综述之一。 200+ 篇论文沿三个轴展开：协作机制、失败归因、自我进化。 self-evolution 章节是目前最清晰的领域地图，清楚地呈现了 memory、meta-learning、procedure-editing 这几条路线在哪里真正交叉。论文：[链接]

@dair_ai·5月15日

Great paper discussing agentic search vs. vector search.

@omarsar0

// Is Grep All You Need? // Pay attention to this on, AI devs. (bookmark it) They find that grep-style text search, when wrapped in the right agent harness, matches or beats embedding-based retrieval on coding-agent tasks. Are vector databases even needed where this is all going? It might be that what coding agents needed was not better embeddings. It was better harness design around primitive tools. If you operate a coding-agent stack that depends on a vector DB, it might be time to re-evaluate. My personal experience on this has been that agentic search, if done right, is more than good enough for a lot of use cases. But you also have to understand how to properly index and structure information for the agents to take advantage. At scale, vector databases do shine so take that into account as well. In most cases, a hybrid approach often works best but that's something we haven't figured out really well as of yet. Paper: https://t.co/VjjXDoZ2yL Learn to build effective AI agents in our academy: https://t.co/1e8RZKs4uX

↗ view on x.com

// 值得关注的论文：Grep 够用吗？// AI 开发者注意。（建议收藏）研究发现：在 coding agent 任务上，grep 风格的文本搜索——只要套上合适的 agent harness——能够匹配甚至超越基于 embedding 的检索。向量数据库还有必要吗？也许 coding agent 真正需要的不是更好的 embedding，而是围绕原始工具的更好 harness 设计。如果你的 coding agent 技术栈依赖向量数据库，可能是时候重新评估了。我个人的经验是：做对了的 agentic search，对大量用例已经绰绰有余。但你也需要理解如何为 agent 正确地索引和组织信息。大规模场景下向量数据库确实有优势，这一点要考虑进去。大多数情况下 hybrid 方案效果最好，但这一块我们目前还没有真正摸透。论文：[链接]

@dair_ai·5月14日

// Harnessing Agentic Evolution // Pay attention to this one if you run iterative agentic search loops. (bookmark it) AEvo splits the self-improvement loop into two jobs: > One proposes the next candidate. > The other watches what worked, what failed, and edits the procedure that proposes future candidates. Past runs (candidates, feedback, traces, failures) become memory the meta-agent reads from. Achieves 26% relative gain over the strongest evolution baseline on agentic and reasoning benchmarks. SOTA on three open-ended optimization tasks under the same iteration budget. If you are accumulating agentic search logs you never use, this is how to feed them back into the search procedure itself. Paper: https://t.co/eWFO4rI4iA Learn to build effective AI agents in our academy: https://t.co/LRnpZN7L4c

↗ view on x.com

// Harnessing Agentic Evolution // 如果你在跑迭代式 agentic search 循环，这篇值得关注。（建议收藏） AEvo 把自我改进循环拆成两个任务： > 一个负责提出下一个候选方案。 > 另一个负责观察什么有效、什么失败，并编辑「提案者」未来使用的流程本身。历史运行记录（候选方案、反馈、traces、失败案例）成为元 agent 读取的记忆。相比最强的 evolution baseline，在 agentic 和推理 benchmark 上取得 26% 的相对提升。在相同迭代预算下，三个开放式优化任务上达到 SOTA。如果你积累了一堆从未利用的 agentic search 日志，这就是把它们反哺回搜索流程本身的方法。论文：[链接]

760 tweets · 188 sources