๐Ÿ“š Weekly AI Paper Digest

๊ธฐ๊ฐ„: 2026-01-19 ~ 2026-01-24 ์„ ์ •: ์ด๋ฒˆ ์ฃผ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›์€ ๋…ผ๋ฌธ Top 5


๐Ÿ† ์ด๋ฒˆ ์ฃผ Top 5

์ˆœ์œ„๋…ผ๋ฌธโฌ†๏ธDeep Dive
๐Ÿฅ‡Agentic Reasoning for Large Language Modโ€ฆ186DD-011
๐ŸฅˆYour Group-Relative Advantage Is Biased147DD-012
๐Ÿฅ‰EvoCUA: Evolving Computer Use Agents viaโ€ฆ89DD-013
4.LLM-in-Sandbox Elicits General Agentic Iโ€ฆ82DD-014
5.Being-H0.5: Scaling Human-Centric Robot โ€ฆ75DD-015

๐Ÿ” ์ด๋ฒˆ ์ฃผ ํŠธ๋ Œ๋“œ

ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ

  • Agentic Reasoning (์—์ด์ „ํŠธ์  ์ถ”๋ก ): ๋‹จ์ˆœํ•œ ํ…์ŠคํŠธ ์ƒ์„ฑ์„ ๋„˜์–ด, ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ๊ณ„ํšํ•˜๊ณ  ์‹คํ–‰ํ•˜๋Š” ์ž์œจ์ ์ธ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ์˜๋ฏธํ•ฉ๋‹ˆ๋‹ค.
  • Environment Interaction (ํ™˜๊ฒฝ ์ƒํ˜ธ์ž‘์šฉ): ์ฝ”๋“œ ์ƒŒ๋“œ๋ฐ•์Šค, ์ปดํ“จํ„ฐ ์‚ฌ์šฉ, ๋กœ๋ด‡ ๋“ฑ ์‹ค์ œ ๋˜๋Š” ๊ฐ€์ƒ์˜ ํ™˜๊ฒฝ์„ ์ง์ ‘ ์กฐ์ž‘ํ•˜๋ฉฐ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.
  • Synthetic Experience (ํ•ฉ์„ฑ ๊ฒฝํ—˜): ์ •์  ๋ฐ์ดํ„ฐ์˜ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ธฐ ์œ„ํ•ด ์Šค์Šค๋กœ ์ƒ์„ฑํ•˜๊ฑฐ๋‚˜ ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ๊ฐ€์ƒ์˜ ๊ฒฝํ—˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ†ตํ•ด ๋ชจ๋ธ์„ ๋ฐœ์ „์‹œํ‚ค๋Š” ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.
  • Cross-Embodiment (ํฌ๋กœ์Šค-๋ฐ”๋”” ์ผ๋ฐ˜ํ™”): ์„œ๋กœ ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๋กœ๋ด‡์ด๋‚˜ ์—์ด์ „ํŠธ ํ•˜๋“œ์›จ์–ด ๊ฐ„์—๋„ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋Š” ๋ฒ”์šฉ์ ์ธ ๋ฌผ๋ฆฌ ์ง€๋Šฅ์„ ๋œปํ•ฉ๋‹ˆ๋‹ค.
  • RLVR (Verifier-based RL): ์ถ”๋ก  ๊ณผ์ •์„ ๊ฒ€์ฆ์ž(V verifier)์˜ ๋ณด์ƒ์„ ํ†ตํ•ด ๊ฐ•ํ™”ํ•™์Šต์œผ๋กœ ๊ฐœ์„ ํ•˜๋Š” ํ›„์ฒ˜๋ฆฌ(Post-training) ๊ธฐ๋ฒ•์ž…๋‹ˆ๋‹ค.

๊ณตํ†ต ์ฃผ์ œ

์ด๋ฒˆ ์ฃผ ์—ฐ๊ตฌ๋“ค์€ **โ€œ์ •์ ์ธ ์–ธ์–ด ๋ชจ๋ธ์—์„œ ๋™์ ์ธ ํ–‰๋™ ์—์ด์ „ํŠธ๋กœ์˜ ์ „ํ™˜โ€**์— ์ง‘์ค‘ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋” ์ด์ƒ LLM์ด ๋‹ซํžŒ ์„ธ๊ณ„(closed-world) ์•ˆ์—์„œ ์ถ”๋ก ์— ๊ทธ์น˜์ง€ ์•Š๊ณ , ์ฝ”๋“œ ์ƒŒ๋“œ๋ฐ•์Šค๋‚˜ ์ปดํ“จํ„ฐ, ๋ฌผ๋ฆฌ์  ๋กœ๋ด‡๊ณผ ๊ฐ™์€ ๊ฐœ๋ฐฉํ˜• ํ™˜๊ฒฝ(open-ended environments)๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ์Šค์Šค๋กœ ํ•™์Šตํ•˜๊ณ  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” โ€˜์—์ด์ „ํŠธโ€™ ํ˜•ํƒœ๋กœ ์ง„ํ™”ํ•˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ, ์ด๋ฅผ ์œ„ํ•œ ํ•™์Šต ๋ฐ์ดํ„ฐ๋กœ ์ •์  ๋ฐ์ดํ„ฐ์…‹ ๋Œ€์‹  ์‹ค์‹œ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ ๊ฒฝํ—˜์ด๋‚˜ ์ธ๊ฐ„ ์ค‘์‹ฌ์˜ ๋ฐ์ดํ„ฐ๋ฅผ ํ™œ์šฉํ•˜๋Š” ์ถ”์„ธ๊ฐ€ ๋‘๋“œ๋Ÿฌ์ง‘๋‹ˆ๋‹ค.

์ฃผ๋ชฉํ•  ์ 

LLM-in-Sandbox ์—ฐ๊ตฌ์—์„œ๋Š” ํฅ๋ฏธ๋กญ๊ฒŒ๋„ ์ฝ”๋“œ๋ฅผ ์œ„ํ•œ ์ƒŒ๋“œ๋ฐ•์Šค ํ™˜๊ฒฝ์„ ๋น„์ฝ”๋“œ(Non-code) ์˜์—ญ์˜ ์ผ๋ฐ˜ ์ง€๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ฆฌ๋Š” ๋„๊ตฌ๋กœ ํ™œ์šฉํ•œ๋‹ค๋Š” ์ ์ด ์ฃผ๋ชฉ๋ฐ›์Šต๋‹ˆ๋‹ค. ๋˜ํ•œ EvoCUA์™€ Being-H0.5๋Š” ๊ธฐ์กด ๋ฐ์ดํ„ฐ ํ™•์žฅ์˜ ๋ณ‘๋ชฉ ํ˜„์ƒ์„ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด, ๊ฐ๊ธฐ โ€˜ํ™•์žฅ ๊ฐ€๋Šฅํ•œ ํ•ฉ์„ฑ ๊ฒฝํ—˜โ€™๊ณผ โ€˜์ธ๊ฐ„ ์ƒํ˜ธ์ž‘์šฉ์„ ๋ณดํŽธ ์–ธ์–ด(์–ด๋จธ๋‹ˆ ์–ธ์–ด)๋กœ ํ™œ์šฉโ€™์ด๋ผ๋Š” ์ƒˆ๋กœ์šด ๋ฐ์ดํ„ฐ ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์•ˆํ•˜๋ฉฐ ์—์ด์ „ํŠธ์˜ ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๋ ค๋Š” ์‹œ๋„๋ฅผ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค.

์‹ค๋ฌด ์‹œ์‚ฌ์ 

๊ฐœ๋ฐœ์ž์™€ ์—ฐ๊ตฌ์ž๋Š” ๋ชจ๋ธ์˜ ํฌ๊ธฐ๋งŒ ํ‚ค์šฐ๋Š” ๊ฒƒ์—์„œ ๋ฒ—์–ด๋‚˜, ์—์ด์ „ํŠธ๊ฐ€ ํƒ์ƒ‰ํ•˜๊ณ  ์‹คํŒจํ•  ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ(์˜ˆ: ์ƒŒ๋“œ๋ฐ•์Šค, ์‹œ๋ฎฌ๋ ˆ์ดํ„ฐ)์„ ๊ตฌ์ถ•ํ•˜๋Š” ๋ฐ ์ค‘์ ์„ ๋‘์–ด์•ผ ํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ, RLVR๊ณผ ๊ฐ™์€ ์ถ”๋ก  ๊ฐ•ํ™” ๊ธฐ๋ฒ•์„ ์ ์šฉํ•  ๋•Œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ํŽธํ–ฅ(Bias) ๋ฌธ์ œ๋ฅผ ์ดํ•ดํ•˜๊ณ  ์ด๋ฅผ ์™„ํ™”ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต ํŒŒ์ดํ”„๋ผ์ธ์„ ์ตœ์ ํ™”ํ•˜๋Š” ๊ฒƒ์ด ํ•„์ˆ˜์ ์ด๋ฉฐ, ๋ฌผ๋ฆฌ์  AI๋‚˜ ์ž๋™ํ™” ๋ถ„์•ผ์—์„œ๋Š” ํŠน์ • ํ•˜๋“œ์›จ์–ด์— ์ข…์†๋˜์ง€ ์•Š๋Š” **๋ฒ”์šฉ์ ์ธ ์•ก์…˜ ๋ชจ๋ธ(VLA)**์„ ์„ค๊ณ„ํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ๋‚˜์•„๊ฐ€์•ผ ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“‘ ๋…ผ๋ฌธ๋ณ„ ์š”์•ฝ

๐Ÿฅ‡ 1. Agentic Reasoning for Large Language Models

arXiv: 2601.12538 | โฌ†๏ธ 186 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: agentic-reasoning llm-agents survey-paper autonomous-agents tool-use prompt-engineering ai-planning machine-learning

์ด ๋…ผ๋ฌธ์€ ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์„ ์ˆ˜๋™์ ์ธ ํ…์ŠคํŠธ ์ƒ์„ฑ ๋„๊ตฌ์—์„œ ํ™˜๊ฒฝ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•˜๋ฉฐ ์Šค์Šค๋กœ ๊ณ„ํšํ•˜๊ณ  ํ–‰๋™ํ•˜๋Š” ์ž์œจํ˜• ์ง€๋Šฅ์ฒด(Autonomous Agent)๋กœ ์ง„ํ™”์‹œํ‚ค๋Š” Agentic Reasoning ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ฒด๊ณ„ํ™”ํ•˜๊ณ  ์ •๋ฆฝํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿฅˆ 2. Your Group-Relative Advantage Is Biased

arXiv: 2601.08521 | โฌ†๏ธ 147 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: llm reinforcement-learning rlvr grpo reasoning bias-correction post-training mathematics

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ ํ–ฅ์ƒ์„ ์œ„ํ•ด ๋„๋ฆฌ ์“ฐ์ด๋Š” ๊ทธ๋ฃน ๊ธฐ๋ฐ˜ ๊ฐ•ํ™” ํ•™์Šต(GRPO) ์•Œ๊ณ ๋ฆฌ์ฆ˜์˜ ๊ทผ๋ณธ์ ์ธ ํŽธํ–ฅ(Bias) ๋ฌธ์ œ๋ฅผ ์ตœ์ดˆ๋กœ ์ด๋ก ์ ์œผ๋กœ ์ฆ๋ช…ํ•˜๊ณ , ๊ณผ๊ฑฐ ํ•™์Šต ์ด๋ ฅ์„ ํ™œ์šฉํ•ด ์ด๋ฅผ ๋ณด์ •ํ•˜๋Š” HA-DW ํ”„๋ ˆ์ž„์›Œํฌ๋ฅผ ์ œ์‹œํ•˜์—ฌ ์ˆ˜ํ•™์  ์ถ”๋ก  ์„ฑ๋Šฅ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ–ˆ์Šต๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿฅ‰ 3. EvoCUA: Evolving Computer Use Agents via Learning from Scalable Synthetic Experience

arXiv: 2601.15876 | โฌ†๏ธ 89 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: computer-use-agents synthetic-data reinforcement-learning scalable-infrastructure auto-ml rlhf reasoning virtualization

์ •์  ๋ฐ์ดํ„ฐ(Static Data)์˜ ์ˆ˜์ง‘ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜๊ณ , ์—์ด์ „ํŠธ๊ฐ€ ์Šค์Šค๋กœ ๊ณผ์ œ์™€ ๊ฒ€์ฆ๊ธฐ๋ฅผ ์ƒ์„ฑํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ์ž๊ธฐ ์ง„ํ™”(Self-Evolving) ๋ฃจํ”„๋ฅผ ํ†ตํ•ด ์ปดํ“จํ„ฐ ์‚ฌ์šฉ ์—์ด์ „ํŠธ(CUA)์˜ ์„ฑ๋Šฅ๊ณผ ํ™•์žฅ์„ฑ์„ ํš๊ธฐ์ ์œผ๋กœ ๊ฐœ์„ ํ–ˆ๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


4. 4. LLM-in-Sandbox Elicits General Agentic Intelligence

arXiv: 2601.16206 | โฌ†๏ธ 82 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: llm-agent sandbox reinforcement-learning general-intelligence tool-use emergent-abilities llm-reasoning agentic-ai

LLM์— ๊ฐ€์ƒ ์ปดํ“จํ„ฐ ํ™˜๊ฒฝ(Sandbox)์„ ์ œ๊ณตํ•˜์—ฌ ์ฝ”๋”ฉ์ด ์•„๋‹Œ ์ผ๋ฐ˜ ๊ณผ์—…์—์„œ๋„ ์Šค์Šค๋กœ ๋„๊ตฌ๋ฅผ ํ™œ์šฉํ•˜๊ณ  ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” โ€˜์—์ด์ „ํŠธ ์ง€๋Šฅ(Agentic Intelligence)โ€˜์„ ๋ฐœํ˜„์‹œํ‚ฌ ์ˆ˜ ์žˆ์Œ์„ ๋ณด์˜€๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


5. 5. Being-H0.5: Scaling Human-Centric Robot Learning for Cross-Embodiment Generalization

arXiv: 2601.12993 | โฌ†๏ธ 75 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: being-h05 vla cross-embodiment robotics uni-hand-20 representation-learning human-centric-ai multimodal

์ด ๋…ผ๋ฌธ์€ ์ธ๊ฐ„์˜ ํ–‰๋™ ๋ฐ์ดํ„ฐ๋ฅผ ๋กœ๋ด‡ ํ•™์Šต์˜ โ€˜๋ณดํŽธ ์–ธ์–ดโ€™๋กœ ํ™œ์šฉํ•˜์—ฌ, ์„œ๋กœ ๋‹ค๋ฅธ ํ˜•ํƒœ์˜ ๋กœ๋ด‡(Embodiment)๋“ค์ด ๋ฐ์ดํ„ฐ ๋ถ€์กฑ ๋ฌธ์ œ๋ฅผ ๊ทน๋ณตํ•˜๊ณ  ๋ฒ”์šฉ์ ์œผ๋กœ ์ง€๋Šฅ์„ ์Šต๋“ํ•  ์ˆ˜ ์žˆ๋Š” ํ† ๋Œ€๋ฅผ ๋งˆ๋ จํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿ“… ์ƒ์„ฑ์ผ: 2026-02-02 | ๐Ÿค– GLM-4.7 Weekly Digest