๐Ÿ“š Weekly AI Paper Digest

๊ธฐ๊ฐ„: 2026-01-12 ~ 2026-01-17 ์„ ์ •: ์ด๋ฒˆ ์ฃผ ๊ฐ€์žฅ ์ฃผ๋ชฉ๋ฐ›์€ ๋…ผ๋ฌธ Top 5


๐Ÿ† ์ด๋ฒˆ ์ฃผ Top 5

์ˆœ์œ„๋…ผ๋ฌธโฌ†๏ธDeep Dive
๐Ÿฅ‡Watching, Reasoning, and Searching: A Viโ€ฆ209DD-006
๐ŸฅˆBabyVision: Visual Reasoning Beyond Langโ€ฆ193DD-007
๐Ÿฅ‰STEP3-VL-10B Technical Report190DD-008
4.Thinking with Map: Reinforced Parallel Mโ€ฆ165DD-009
5.Urban Socio-Semantic Segmentation with Vโ€ฆ155DD-010

๐Ÿ“‘ ์š”์•ฝ

๐Ÿฅ‡ 1. Watching, Reasoning, and Searching: A Video Deep Research Benchmark on Open Web for Agentic Video Reasoning

arXiv: 2601.06943 | โฌ†๏ธ 209 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: video-reasoning agentic-ai open-web-search multimodal-benchmark deep-research retrieval-augmented-generation fact-verification

์˜์ƒ๋งŒ์œผ๋กœ๋Š” ๋‹ต์„ ์–ป์„ ์ˆ˜ ์—†๋Š” ๋ณต์žกํ•œ ์งˆ๋ฌธ์— ๋Œ€ํ•ด, ์˜์ƒ ์† ๋‹จ์„œ๋ฅผ ์ฐพ์•„ ์›น์„ ๊ฒ€์ƒ‰ํ•˜๊ณ  ์ถ”๋ก ํ•˜๋Š” โ€˜์‹ฌ์ธต ์—ฐ๊ตฌ(Deep Research)โ€™ ๋Šฅ๋ ฅ์„ ๊ฐ–์ถ˜ AI ์—์ด์ „ํŠธ๋ฅผ ํ‰๊ฐ€ํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์ดˆ์˜ ๋ฒค์น˜๋งˆํฌ(VideoDR)๋ฅผ ์ œ์•ˆํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿฅˆ 2. BabyVision: Visual Reasoning Beyond Language

arXiv: 2601.06521 | โฌ†๏ธ 193 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: babyvision visual-reasoning multimodal-llm ai-benchmark computer-vision cognitive-science model-evaluation

์ด ๋…ผ๋ฌธ์€ ์ตœ์‹  ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ LLM(๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ)๋“ค์ด ๋ณต์žกํ•œ ์ง€์‹ ๊ธฐ๋ฐ˜ ๋ฌธ์ œ๋Š” ์ž˜ ํ’€๋ฉด์„œ๋„, 3์„ธ ์•„์ด๋„ ์‰ฝ๊ฒŒ ํ•˜๋Š” ๊ธฐ๋ณธ์ ์ธ ์‹œ๊ฐ์  ์ถ”๋ก (ํ˜•ํƒœ ๋ถ„๋ณ„, ๊ณต๊ฐ„ ์ง€๊ฐ ๋“ฑ)์— ์‹ฌ๊ฐํ•œ ์ทจ์•ฝ์ ์„ ๋ณด์ธ๋‹ค๋Š” ์‚ฌ์‹ค์„ BabyVision ๋ฒค์น˜๋งˆํฌ๋ฅผ ํ†ตํ•ด ๊ฐ๊ด€์ ์œผ๋กœ ์ž…์ฆํ•˜๋ฉฐ, ์ง„์ •ํ•œ ๋น„์ „ ์ง€๋Šฅ์„ ์œ„ํ•ด์„œ๋Š” ์–ธ์–ด ์˜์กด๋„๋ฅผ ๋‚ฎ์ถฐ์•ผ ํ•จ์„ ์‹œ์‚ฌํ•ฉ๋‹ˆ๋‹ค.


๐Ÿฅ‰ 3. STEP3-VL-10B Technical Report

arXiv: 2601.09668 | โฌ†๏ธ 190 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: step3-vl vision-language-models efficient-ai pacore reinforcement-learning multimodal-reasoning llm open-source-model

๋‹จ 100์–ต ๊ฐœ์˜ ํŒŒ๋ผ๋ฏธํ„ฐ(10B)๋กœ ์ˆ˜๋ฐฑ ๋ฐฐ ํฐ ๋ชจ๋ธ(GPT-5.2, Gemini-3-Pro ๋“ฑ)๊ณผ ๊ฒฌ์ค„ ๋งŒํ•œ ์„ฑ๋Šฅ์„ ๋‚ด๋ฉด์„œ๋„, **์ถ”๋ก  ์‹œ์ ์˜ ์—ฐ์‚ฐ(Parallel Coordinated Reasoning)**์„ ํš๊ธฐ์ ์œผ๋กœ ํ™•์žฅํ•˜์—ฌ ํšจ์œจ์„ฑ๊ณผ ์ง€๋Šฅ์˜ trade-off(์ƒ์ถฉ ๊ด€๊ณ„)๋ฅผ ์žฌ์ •๋ฆฝํ•œ ์˜คํ”ˆ์†Œ์Šค ๋ฉ€ํ‹ฐ๋ชจ๋‹ฌ ๋ชจ๋ธ์ž…๋‹ˆ๋‹ค.


4. 4. Thinking with Map: Reinforced Parallel Map-Augmented Agent for Geolocalization

arXiv: 2601.05432 | โฌ†๏ธ 165 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: geolocalization map-agent lvlm reinforcement-learning test-time-scaling reasoning computer-vision

[!info] ๋ฌธ์ œ ์ •์˜


5. 5. Urban Socio-Semantic Segmentation with Vision-Language Reasoning

arXiv: 2601.10477 | โฌ†๏ธ 155 โ†’ Deep Dive ๋ณด๊ธฐ ํƒœ๊ทธ: urban-ai semantic-segmentation vision-language-model remote-sensing socio-semantics zero-shot-generalization satellite-imagery multimodal-learning

์ด ๋…ผ๋ฌธ์€ ์œ„์„ฑ ์ด๋ฏธ์ง€๋งŒ์œผ๋กœ๋Š” ๊ตฌ๋ณ„ํ•˜๊ธฐ ์–ด๋ ค์šด โ€˜ํ•™๊ตโ€™, โ€˜๊ณต์›โ€™ ๋“ฑ ์‚ฌํšŒ์  ์˜๋ฏธ๋ฅผ ๊ฐ€์ง„ ๋„์‹œ ์˜์—ญ์„, ๋””์ง€ํ„ธ ์ง€๋„์™€ ๋น„์ „-์–ธ์–ด ๋ชจ๋ธ(VLM)์„ ๊ฒฐํ•ฉํ•œ ์ถ”๋ก ์„ ํ†ตํ•ด ์ •๋ฐ€ํ•˜๊ฒŒ ๋ถ„ํ• ํ•˜๋Š” ์ตœ์ดˆ์˜ ํ”„๋ ˆ์ž„์›Œํฌ์™€ ๋ฐ์ดํ„ฐ์…‹์„ ์ œ์‹œํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.


๐Ÿ“… ์ƒ์„ฑ์ผ: 2026-02-02 | ๐Ÿค– GLM-4.7 Weekly Digest