๐Ÿ“š Weekly AI Paper Digest

๊ธฐ๊ฐ„: 2026-02-02 ~ 2026-02-07 ์„ ์ •: ์ด๋ฒˆ ์ฃผ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›์€ ๋…ผ๋ฌธ Top 5


๐Ÿ† ์ด๋ฒˆ ์ฃผ Top 5

์ˆœ์œ„๋…ผ๋ฌธโฌ†๏ธDeep Dive
๐Ÿฅ‡Green-VLA: Staged Vision-Language-Actionโ€ฆ236DD-017
๐ŸฅˆERNIE 5.0 Technical Report236DD-016
๐Ÿฅ‰Kimi K2.5: Visual Agentic Intelligence206DD-018
4.Vision-DeepResearch: Incentivizing DeepRโ€ฆ147DD-019
5.PaperBanana: Automating Academic Illustrโ€ฆ137DD-020

๐Ÿ” ์ด๋ฒˆ ์ฃผ ํŠธ๋ Œ๋“œ

ํ•ต์‹ฌ ํ‚ค์›Œ๋“œ

  • ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์—์ด์ „ํŠธ & VLA (Vision-Language-Action): ํ…์ŠคํŠธ์™€ ์ด๋ฏธ์ง€๋ฅผ ์ดํ•ดํ•˜๋Š” ๊ฒƒ์„ ๋„˜์–ด, ์‹ค์ œ ํ™˜๊ฒฝ์—์„œ ํ–‰๋™(์•ก์…˜)ํ•˜๊ฑฐ๋‚˜ ๋ณต์žกํ•œ ๋„๊ตฌ๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๋ชฉํ‘œ๋ฅผ ์ˆ˜ํ–‰ํ•˜๋Š” AI ๋ชจ๋ธ์˜ ๋ถ€์ƒ
  • ๋„ค์ดํ‹ฐ๋ธŒ ํ†ตํ•ฉ ์•„ํ‚คํ…์ฒ˜ (Native Multimodality): ๊ธฐ์กด ๋ชจ๋ธ์„ ์—ฐ๊ฒฐํ•˜๋Š” ๋ฐฉ์‹์ด ์•„๋‹ˆ๋ผ, ์ฒ˜์Œ๋ถ€ํ„ฐ ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ๋น„๋””์˜ค ๋“ฑ์„ ํ†ต์ผ๋œ ๋ชฉ์ ํ•จ์ˆ˜๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ์ƒˆ๋กœ์šด ๋ชจ๋ธ ์„ค๊ณ„ ๊ณ„๋ณด (ERNIE 5.0, Kimi K2.5)
  • ๋กœ๋ด‡ ๊ณตํ•™์˜ ์‹ค์ „ ๋ฐฐ์น˜: ์—ฐ๊ตฌ์‹ค ํ™˜๊ฒฝ์„ ๋„˜์–ด ์‹ค์ œ ๋กœ๋ด‡(ํœด๋จธ๋…ธ์ด๋“œ ๋“ฑ)์—์„œ ์ž‘๋™ํ•  ์ˆ˜ ์žˆ๋„๋ก ์ตœ์ ํ™”๋œ ๋‹จ๊ณ„๋ณ„ ํ•™์Šต ์ปค๋ฆฌํ˜๋Ÿผ ๋ฐ ์ •์ฑ… ๊ฐ•ํ™” ํ•™์Šต์˜ ์ ์šฉ
  • AI ์—ฐ๊ตฌ ์ž๋™ํ™” (AI for Science): ๋…ผ๋ฌธ ์ž‘์„ฑ์˜ ์‹œ๊ฐํ™” ์ž๋ฃŒ ์ƒ์„ฑ๋ถ€ํ„ฐ ์‹ฌ์ธต์ ์ธ ๋ฆฌ์„œ์น˜ ์ˆ˜ํ–‰๊นŒ์ง€, ์—ฐ๊ตฌ ๊ณผ์ • ์ž์ฒด๋ฅผ AI ์—์ด์ „ํŠธ๊ฐ€ ์ž๋™ํ™”ํ•˜๋Š” ๋ฉ”ํƒ€ ํŠธ๋ Œ๋“œ

๊ณตํ†ต ์ฃผ์ œ

์ด๋ฒˆ ์ฃผ ๋…ผ๋ฌธ๋“ค์€ AI๊ฐ€ ๋‹จ์ˆœํ•œ โ€˜์ง€๋Šฅํ˜• ๋น„์„œโ€™๋ฅผ ๋„˜์–ด **โ€˜๋Šฅ๋™์ ์ธ ํ–‰์œ„์ž(Agent)โ€˜**๋กœ ์ง„ํ™”ํ•˜๊ณ  ์žˆ์Œ์„ ๋ณด์—ฌ์ค๋‹ˆ๋‹ค. ํŠนํžˆ ์‹œ๊ฐ ์ •๋ณด(Vision)๋ฅผ ํ†ตํ•ด ์„ธ์ƒ์„ ์ดํ•ดํ•˜๊ณ  ์ด๋ฅผ ๋ฌผ๋ฆฌ์  ํ–‰๋™(Robotics)์ด๋‚˜ ๋ณต์žกํ•œ ์ธ์ง€์  ์ž‘์—…(Research)์œผ๋กœ ์—ฐ๊ฒฐํ•˜๋Š” Vision-Action ํ†ตํ•ฉ์ด ๊ฐ€์žฅ ๋‘๋“œ๋Ÿฌ์ง„ ๊ณตํ†ต ์ฃผ์ œ์ž…๋‹ˆ๋‹ค. ๋˜ํ•œ, ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๊ฐ„์˜ ๊ฒฝ๊ณ„๋ฅผ ํ—ˆ๋ฌด๋Š” โ€˜๋„ค์ดํ‹ฐ๋ธŒ(Native)โ€™ ํ•™์Šต ๋ฐฉ์‹๊ณผ AI ์ž์‹ ์ด ์—ฐ๊ตฌ๋ฅผ ๋•๋Š” โ€˜์ž๋™ํ™”โ€™๊ฐ€ ๋™์‹œ์— ๊ณ ๋„ํ™”๋˜๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

์ฃผ๋ชฉํ•  ์ 

Green-VLA์™€ Kimi K2.5๋Š” ์‹œ๊ฐ๊ณผ ์–ธ์–ด๋ฅผ ๋‹จ์ˆœํžˆ ๊ฒฐํ•ฉํ•˜๋Š” ์ˆ˜์ค€์„ ๋„˜์–ด, ๋‘ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๊ฐ€ ์ƒํ˜ธ ๋ณด์™„ํ•˜๋ฉฐ ๊ฐ•ํ™” ํ•™์Šต(RL)์„ ํ†ตํ•ด ์ตœ์ ํ™”๋˜๋Š” โ€˜Joint Optimizationโ€™ ๊ธฐ๋ฒ•์„ ์ฃผ๋ชฉํ–ˆ์Šต๋‹ˆ๋‹ค. ERNIE 5.0์€ ๋ชจ๋“  ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ๋ฅผ ์ฒ˜์Œ๋ถ€ํ„ฐ ํ†ต์ผ๋œ ํ† ํฐ ์˜ˆ์ธก ๋ชฉ์ ์œผ๋กœ ํ•™์Šต์‹œํ‚ค๋Š” ํš๊ธฐ์ ์ธ โ€˜Native Autoregressiveโ€™ ์ ‘๊ทผ ๋ฐฉ์‹์„ ์ทจํ•˜์—ฌ, ๋ณ„๋„์˜ ์ธ์ฝ”๋” ์—†์ด๋„ ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ๋น„๋””์˜ค, ์˜ค๋””์˜ค๋ฅผ ํ†ตํ•ฉ ์ฒ˜๋ฆฌํ•˜๋Š” ์„ฑ๋Šฅ์„ ์ž…์ฆํ–ˆ์Šต๋‹ˆ๋‹ค. PaperBanana์™€ ๊ฐ™์€ ์—ฐ๊ตฌ๋Š” ์ตœ์ฒจ๋‹จ ๋ชจ๋ธ๋“ค์ด ์ด์ œ ์ง€์‹ ์ƒ์‚ฐ์ž๊ฐ€ ๋˜์–ด ์—ฐ๊ตฌ์ž์˜ ๊ฐ€์žฅ ๋…ธ๋™ intensiveํ•œ ์ž‘์—…(์˜ˆ: ๋…ผ๋ฌธ ์‚ฝํ™” ์ œ์ž‘)์„ ๋Œ€์ฒดํ•˜๊ธฐ ์‹œ์ž‘ํ–ˆ์Œ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.

์‹ค๋ฌด ์‹œ์‚ฌ์ 

๊ฐœ๋ฐœ์ž์™€ ์—ฐ๊ตฌ์ž๋Š” ์ด์ œ ๋‹จ์ผ ๋ชจ๋‹ฌ๋ฆฌํ‹ฐ ๋ชจ๋ธ์ด ์•„๋‹Œ, ํ–‰๋™ ๊ณ„ํš๊ณผ ๋„๊ตฌ ์‚ฌ์šฉ์ด ๊ฐ€๋Šฅํ•œ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ์—์ด์ „ํŠธ๋ฅผ ์„ค๊ณ„ํ•ด์•ผ ํ•˜๋Š” ์‹œ์ ์— ์ ‘์–ด๋“ค์—ˆ์Šต๋‹ˆ๋‹ค. ํŠนํžˆ ๋กœ๋ด‡ ์‚ฐ์—…์ด๋‚˜ ์ž๋™ํ™” ๋ถ„์•ผ์—์„œ๋Š” VLA(Vision-Language-Action) ๋ชจ๋ธ์„ ์‹ค์ œ ํ•˜๋“œ์›จ์–ด์— ์–ด๋–ป๊ฒŒ ์ตœ์ ํ™”ํ•˜์—ฌ ํƒ‘์žฌํ• ์ง€(R0, R1, R2 ๋‹จ๊ณ„ ๋“ฑ)์— ๋Œ€ํ•œ ์ „๋žต์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค. ๋˜ํ•œ, ์—ฐ๊ตฌ ์ƒ์‚ฐ์„ฑ ์ธก๋ฉด์—์„œ๋Š” PaperBanana๋‚˜ Vision-DeepResearch์™€ ๊ฐ™์€ AI ์—์ด์ „ํŠธ ํˆด์„ ์ ๊ทน์ ์œผ๋กœ ๋„์ž…ํ•˜์—ฌ ๋ฆฌํ„ฐ๋Ÿฌ์น˜(๋ฌธํ—Œ ์กฐ์‚ฌ)๋‚˜ ์ฝ˜ํ…์ธ  ์ œ์ž‘ ์†Œ์š” ์‹œ๊ฐ„์„ ํš๊ธฐ์ ์œผ๋กœ ๋‹จ์ถ•ํ•  ์ˆ˜ ์žˆ๋Š” ์‹ค์šฉ์ ์ธ ๋ฐฉ์•ˆ์„ ๋ชจ์ƒ‰ํ•ด์•ผ ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“‘ ๋…ผ๋ฌธ๋ณ„ ์š”์•ฝ

๐Ÿฅ‡ 1. Green-VLA: Staged Vision-Language-Action Model for Generalist Robots

arXiv: 2602.00919 | โฌ†๏ธ 236 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: vla robotics curriculum-learning generalist-robot embodied-ai fine-tuning reinforcement-learning

์ด ๋…ผ๋ฌธ์€ ์„œ๋กœ ๋‹ค๋ฅธ ๋กœ๋ด‡์˜ ๋ฐ์ดํ„ฐ ์ด์งˆ์„ฑ๊ณผ ๋‚ฎ์€ ๋ฐ์ดํ„ฐ ํ’ˆ์งˆ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๊ธฐ ์œ„ํ•ด 5๋‹จ๊ณ„๋กœ ๊ตฌ์„ฑ๋œ ๊ต์œก ๊ณผ์ •(Curriculum)์„ ์ œ์•ˆํ•˜์—ฌ, ์‹ค์ œ ๋กœ๋ด‡(Green ๋กœ๋ด‡)์— ์„ฑ๊ณต์ ์œผ๋กœ ๋ฐฐํฌ๋จ๊ณผ ๋™์‹œ์— ๋‹ค์–‘ํ•œ ํ˜•ํƒœ์˜ ๋กœ๋ด‡์— ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅํ•œ VLA ๋ชจ๋ธ์„ ๊ตฌํ˜„ํ–ˆ๊ธฐ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿฅˆ 2. ERNIE 5.0 Technical Report

arXiv: 2602.04705 | โฌ†๏ธ 236 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: ernie-50 multimodal autoregressive mixture-of-experts foundation-model deep-learning nlp computer-vision

ํ…์ŠคํŠธ, ์ด๋ฏธ์ง€, ๋น„๋””์˜ค, ์˜ค๋””์˜ค๋ฅผ ํ•˜๋‚˜์˜ ํ†ตํ•ฉ๋œ ์ž๊ธฐํšŒ๊ท€(Autoregressive) ๋ฐฉ์‹์œผ๋กœ ์ฒ˜์Œ๋ถ€ํ„ฐ ๋๊นŒ์ง€ ์ฒ˜๋ฆฌํ•˜์—ฌ, ๊ธฐ์กด ๋ชจ๋ธ๋“ค์˜ ํ•œ๊ณ„์˜€๋˜ ์ดํ•ด(Understanding)์™€ ์ƒ์„ฑ(Generation)์˜ ๋ถ„๋ฆฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ ์ง„์ •ํ•œ ํ†ตํ•ฉ ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ํŒŒ์šด๋ฐ์ด์…˜ ๋ชจ๋ธ์ด๊ธฐ ๋•Œ๋ฌธ์ž…๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿฅ‰ 3. Kimi K2.5: Visual Agentic Intelligence

arXiv: 2602.02276 | โฌ†๏ธ 206 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: ai-agent multimodal parallel-processing reinforcement-learning kimi-k25 model-architecture state-of-the-art latency-reduction

์ด ๋…ผ๋ฌธ์€ ํ…์ŠคํŠธ์™€ ๋น„์ „์„ ๊ณต๋™์œผ๋กœ ์ตœ์ ํ™”ํ•˜๊ณ  ์—ฌ๋Ÿฌ ์—์ด์ „ํŠธ๋ฅผ ๋™์‹œ์— ์‹คํ–‰ํ•˜์—ฌ ๋ณต์žกํ•œ ์ž‘์—…์„ ๊ธฐ์กด๋ณด๋‹ค ๋น ๋ฅด๊ณ  ์ •ํ™•ํ•˜๊ฒŒ ์ฒ˜๋ฆฌํ•˜๋Š” ๋ฒ”์šฉ ์—์ด์ „ํŠธ ์ง€๋Šฅ์˜ ์ƒˆ๋กœ์šด ํŒจ๋Ÿฌ๋‹ค์ž„์„ ์ œ์‹œํ–ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


4. Vision-DeepResearch: Incentivizing DeepResearch Capability in Multimodal Large Language Models

arXiv: 2601.22060 | โฌ†๏ธ 147 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: multimodal-llm deep-research retrieval-augmented-generation computer-vision reinforcement-learning reasoning visual-search agent

๊ธฐ์กด ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM์˜ ๋‹จ์ˆœํ•œ ๊ฒ€์ƒ‰ ๋ฐฉ์‹ ํ•œ๊ณ„๋ฅผ ๊ทน๋ณตํ•˜์—ฌ, ์ˆ˜์‹ญ ๋‹จ๊ณ„์˜ ์ถ”๋ก  ๊ณผ์ •๊ณผ ์ˆ˜๋ฐฑ ๋ฒˆ์˜ ๊ฒ€์ƒ‰ ์—”์ง„ ์ƒํ˜ธ์ž‘์šฉ์„ ํ†ตํ•ด ๋ณต์žกํ•˜๊ณ  ์ •ํ™•ํ•œ ์‹œ๊ฐ์  ์ •๋ณด๋ฅผ ์Šค์Šค๋กœ ํƒ์ƒ‰ํ•ด ๋‚ผ ์ˆ˜ ์žˆ๋Š” โ€˜๋”ฅ ๋ฆฌ์„œ์น˜(Deep Research)โ€™ ์—์ด์ „ํŠธ ๋Šฅ๋ ฅ์„ ์ฒ˜์Œ์œผ๋กœ ๊ทœ๋ชจ ์žˆ๊ฒŒ ๊ตฌํ˜„ํ•˜๊ณ  ์ฆ๋ช…ํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


5. PaperBanana: Automating Academic Illustration for AI Scientists

arXiv: 2601.23265 | โฌ†๏ธ 137 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: ai-scientist academic-illustration agentic-framework paperbanana visualization llm vlm multimodal-ai

AI ์—ฐ๊ตฌ์ž๋“ค์˜ ๊ฐ€์žฅ ํฐ ๋ณ‘๋ชฉ ํ˜„์ƒ์ธ โ€˜ํ•™์ˆ ์šฉ ์‚ฝํ™” ์ œ์ž‘โ€™์„ 5๊ฐœ์˜ ์ „๋ฌธ ์—์ด์ „ํŠธ๊ฐ€ ํ˜‘๋ ฅํ•˜๋Š” ์ž๋™ํ™” ํ”„๋ ˆ์ž„์›Œํฌ๋กœ ํ•ด๊ฒฐํ•˜์—ฌ, ์ธ๊ฐ„์˜ ํ‰๊ท  ํ’ˆ์งˆ์„ ๋›ฐ์–ด๋„˜๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋ณด์—ฌ์ฃผ์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.

๐Ÿ“– ์ƒ์„ธ ๋ถ„์„: โ†’ Deep Dive ๋ณด๊ธฐ์—์„œ ์‹ฌ์ธต ๋ถ„์„์„ ํ™•์ธํ•˜์„ธ์š”.


๐Ÿ“… ์ƒ์„ฑ์ผ: 2026-02-08 | ๐Ÿค– GLM-4.7 Weekly Digest