โ† ๐Ÿ“š ์ด๋ฒˆ ์ฃผ Weekly Digest๋กœ ๋Œ์•„๊ฐ€๊ธฐ

DD-012 Your Group-Relative Advantage Is Biased

arXiv: 2601.08521 Upvotes: 147 | Comments: 7 ์ˆœ์œ„: ์ด๋ฒˆ ์ฃผ Top 2

Figure 1


๋…ผ๋ฌธ ๋ฆฌ๋ทฐ: Your Group-Relative Advantage Is Biased (arXiv: 2601.08521)


1. ์™œ ์ด ๋…ผ๋ฌธ์ด ์ค‘์š”ํ•œ๊ฐ€?

DeepSeek-R1์˜ ์„ฑ๊ณต ์ดํ›„, ๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ์ถ”๋ก  ๋Šฅ๋ ฅ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ธฐ ์œ„ํ•ด **GRPO(Group Relative Policy Optimization)**์™€ ๊ฐ™์€ ๊ทธ๋ฃน ๊ธฐ๋ฐ˜ ๊ฐ•ํ™” ํ•™์Šต(RL) ๋ฐฉ๋ฒ•์ด ํ‘œ์ค€์ฒ˜๋Ÿผ ์“ฐ์ด๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค. ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ ๋ณ„๋„์˜ ๋น„ํ‰๊ฐ€(Critic) ๋ชจ๋ธ ์—†์ด ๊ทธ๋ฃน ๋‚ด ํ‰๊ท  ๋ณด์ƒ๋งŒ์œผ๋กœ ํ•™์Šตํ•˜์ง€๋งŒ, ์ด ๋…ผ๋ฌธ์€ ์ด ์ ‘๊ทผ๋ฒ•์ด **โ€œ์–ด๋ ค์šด ๋ฌธ์ œ๋Š” ์–ด๋ ท๊ฒŒ, ์‰ฌ์šด ๋ฌธ์ œ๋Š” ์‰ฝ๊ฒŒโ€ ํŒ๋‹จํ•˜๋Š” ๊ทผ๋ณธ์ ์ธ ํŽธํ–ฅ(Bias)**์„ ๊ฐ€์ง€๊ณ  ์žˆ์Œ์„ ์ˆ˜ํ•™์ ์œผ๋กœ ์ฆ๋ช…ํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด ๋…ผ๋ฌธ์€ ๋‹จ์ˆœํžˆ ์„ฑ๋Šฅ์„ ๋†’์ด๋Š” ๊ฒƒ์„ ๋„˜์–ด, ํ˜„์žฌ ๊ฐ€์žฅ ํ•ซํ•œ LLM ํ•™์Šต ํŒจ๋Ÿฌ๋‹ค์ž„(RLVR)์˜ ์ˆจ๊ฒจ์ง„ ๊ฒฐํ•จ์„ ํ•ด๋ถ€ํ•˜๊ณ  ๊ณผ๊ฑฐ ์ด๋ ฅ์„ ํ™œ์šฉํ•ด ์ด ํŽธํ–ฅ์„ ๊ต์ •ํ•˜๋Š” HA-DW๋ผ๋Š” ๊ฐ•๋ ฅํ•œ ํ•ด๊ฒฐ์ฑ…์„ ์ œ์‹œํ–ˆ๋‹ค๋Š” ์ ์—์„œ ๋งค์šฐ ์ค‘์š”ํ•ฉ๋‹ˆ๋‹ค.


2. ํ•ต์‹ฌ ์•„์ด๋””์–ด ์‰ฝ๊ฒŒ ์ดํ•ดํ•˜๊ธฐ

๐ŸŽฏ ์ผ์ƒ์ƒํ™œ ๋น„์œ : โ€œ๋„ˆ๊ทธ๋Ÿฌ์šด ๊ต์‚ฌ๋‹˜ vs ์—„๊ฒฉํ•œ ๊ต์‚ฌ๋‹˜โ€

์ด ๋…ผ๋ฌธ์˜ ํ•ต์‹ฌ์„ ์ดํ•ดํ•˜๋ ค๋ฉด **โ€˜์‹œํ—˜์„ ์น˜๋ฅด๋Š” ํ•™์ƒ๊ณผ ์ฑ„์ ํ•˜๋Š” ๊ต์‚ฌโ€™**์˜ ์ƒํ™ฉ์„ ์ƒ์ƒํ•ด ๋ณด์„ธ์š”.

  • ๊ธฐ์กด ๋ฐฉ์‹ (GRPO)์˜ ๋ฌธ์ œ์  - โ€œ๊ธฐ์ค€์ด ์—†๋Š” ๊ต์‚ฌโ€: ๊ต์‚ฌ๊ฐ€ ์‹œํ—˜ ๋ฌธ์ œ๋ฅผ ๊ทธ๋ฃน๋ณ„๋กœ ๋‚ด์ค๋‹ˆ๋‹ค.

    • ์ƒํ™ฉ A (์‰ฌ์šด ๋ฌธ์ œ): ๊ทธ๋ฃน์˜ ๋ชจ๋“  ํ•™์ƒ์ด 100์ ์„ ๋งž์•˜์Šต๋‹ˆ๋‹ค. ๊ต์‚ฌ๋Š” โ€œํ‰๊ท ์ด 100์ ์ด๋‹ˆ๊นŒ, 100์  ๋งž์€ ์• ๋Š” ๊ทธ๋ƒฅ ๋ณดํ†ต์ด๋„ค?โ€๋ผ๊ณ  ์ƒ๊ฐํ•˜๋ฉฐ **๋„ˆ๋ฌด ๋†’์€ ์ ์ˆ˜(๊ณผ๋Œ€ํ‰๊ฐ€)**๋ฅผ ์ค๋‹ˆ๋‹ค. ํ•™์ƒ๋“ค์€ ์‰ฌ์šด ๋ฌธ์ œ๋งŒ ๊ณ„์† ํ’€๋ ค๊ณ  ํ•ฉ๋‹ˆ๋‹ค.
    • ์ƒํ™ฉ B (์–ด๋ ค์šด ๋ฌธ์ œ): ์•„๋ฌด๋„ ๋ชป ํ’€๊ณ  ํ•œ ํ•™์ƒ๋งŒ ๊ฒจ์šฐ 10์ ์„ ๋ฐ›์•˜์Šต๋‹ˆ๋‹ค. ๊ต์‚ฌ๋Š” โ€œ์ด ๊ทธ๋ฃน ํ‰๊ท ์ด 2์ ์ด๋‹ˆ๊นŒ, 10์ ์€ ๊ฝค ์ž˜ํ–ˆ๋„ค?โ€๋ผ๊ณ  ์ƒ๊ฐํ•  ์ˆ˜ ์žˆ์ง€๋งŒ, ํ†ต๊ณ„์ ์œผ๋กœ ๊ทธ๋ฃน ๋‚ด ๋ถ„์‚ฐ์ด ๋‚ฎ์•„์„œ ์ง„์งœ ์‹ค๋ ฅ๋ณด๋‹ค ์ ์ˆ˜๋ฅผ ๊นŽ์•„(๊ณผ์†Œํ‰๊ฐ€) ๋ฒ„๋ฆฝ๋‹ˆ๋‹ค. ํ•™์ƒ์€ โ€œ์–ด๋ ค์šด ๋ฌธ์ œ๋ฅผ ํ’€์–ด๋„ ๋ณ„ ๋ณด์ƒ์ด ์—†๊ตฌ๋‚˜โ€๋ผ๊ณ  ๋А๊ปด ํฌ๊ธฐํ•ฉ๋‹ˆ๋‹ค.
    • ๊ฒฐ๊ณผ: ๋ชจ๋ธ์€ ์‰ฌ์šด ๋ฌธ์ œ์—๋งŒ ์ง‘์ฐฉํ•˜๊ณ  ์–ด๋ ค์šด ๋ฌธ์ œ๋Š” ์™ธ๋ฉดํ•˜๊ฒŒ ๋ฉ๋‹ˆ๋‹ค.
  • ์ œ์•ˆ ๋ฐฉ์‹ (HA-DW)์˜ ํ•ด๊ฒฐ์ฑ… - โ€œ์„ฑ์ ๋ถ€๊ฐ€ ์žˆ๋Š” ๊ต์‚ฌโ€: ์ด ๊ต์‚ฌ๋Š” **์ง€๋‚œ ์ˆ˜๋…„๊ฐ„์˜ ํ•™๊ธ‰ ์„ฑ์ (์ด๋ ฅ, History)**์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.

    • Evolutionary Difficulty Anchor (์ง„ํ™”ํ•˜๋Š” ๋‚œ์ด๋„ ๊ธฐ์ค€): ๊ต์‚ฌ๋Š” โ€œ์ง€๊ธˆ๊นŒ์ง€ ์šฐ๋ฆฌ ํ•™๊ธ‰ ์‹ค๋ ฅ์ด ๋ณดํ†ต 70์ ์ด์—ˆ์–ดโ€๋ผ๋Š” **๊ธฐ์ค€์ (Anchor)**์„ ๊ฐ€์ง€๊ณ  ์žˆ์Šต๋‹ˆ๋‹ค.
    • Adaptive Reweighting (์ ์‘ํ˜• ๊ฐ€์ค‘์น˜ ์กฐ์ •): ์ด๋ฒˆ ์‹œํ—˜์—์„œ ๊ฐ‘์ž๊ธฐ ๋ฌธ์ œ๊ฐ€ ๋„ˆ๋ฌด ์–ด๋ ค์›Œ์„œ ์ ์ˆ˜๊ฐ€ 10์  ๋‚˜์™”๋”๋ผ๋„, **๊ณผ๊ฑฐ ์ด๋ ฅ(70์  ์‹ค๋ ฅ)**์„ ๊ณ ๋ คํ•ด โ€œ์˜ค๋Š˜ ๋ฌธ์ œ๊ฐ€ ์›ฌ์ผ์ด์ง€ ์—„์ฒญ ์–ด๋ ค์› ๊ตฌ๋‚˜! 10์  ๋งž์€ ์• ๋Š” ์‚ฌ์‹ค ์‹ค๋ ฅ์ด 100์ ์ด๋‚˜ ๋‹ค๋ฆ„์—†์–ด!โ€๋ผ๊ณ  ํŒ๋‹จํ•˜๊ณ  ๋ณด์ƒ์„ ์˜ฌ๋ ค์ค๋‹ˆ๋‹ค. ๋ฐ˜๋Œ€๋กœ ๋„ˆ๋ฌด ์‰ฌ์šด ์‹œํ—˜์ด๋ฉด ๋ณด์ƒ์„ ๋‚ฎ์ถฐ์ค๋‹ˆ๋‹ค.

โš™๏ธ ๋‹จ๊ณ„๋ณ„ ๋™์ž‘ ์›๋ฆฌ

  1. ๋ฌธ์ œ ๋ฐœ์ƒ (Bias Identification): ๊ธฐ์กด GRPO ์•Œ๊ณ ๋ฆฌ์ฆ˜์€ โ€˜๊ทธ๋ฃน ๋‚ด ํ‰๊ท โ€™์„ ๊ธฐ์ค€(Baseline)์œผ๋กœ ์‚ผ๋Š”๋ฐ, ์ด๋Š” ์ƒ˜ํ”Œ๋ง ์ˆ˜(G)๊ฐ€ ์ ์„ ๋•Œ ์–ด๋ ค์šด ๋ฌธ์ œ์—์„œ๋Š” ์–ด๋“œ๋ฐดํ‹ฐ์ง€(์ด๋“)๋ฅผ ๊ณผ์†Œํ‰๊ฐ€ํ•˜๊ณ  ์‰ฌ์šด ๋ฌธ์ œ์—์„œ๋Š” ๊ณผ๋Œ€ํ‰๊ฐ€ํ•˜๋Š” ํŽธํ–ฅ์„ ๋งŒ๋“ญ๋‹ˆ๋‹ค.
  2. ์ด๋ ฅ ๊ธฐ์ค€ ์„ค์ • (History-Aware Anchor): ๋ชจ๋ธ์ด ํ•™์Šตํ•˜๋ฉด์„œ ๊ณผ๊ฑฐ์— ์–ผ๋งˆ๋‚˜ ์ž˜ ํ’€์—ˆ๋Š”์ง€๋ฅผ ์ถ”์ ํ•ฉ๋‹ˆ๋‹ค. ๋งˆ์น˜ ์นผ๋งŒ ํ•„ํ„ฐ(Kalman Filter)์ฒ˜๋Ÿผ ํ˜„์žฌ ๋ฐฐ์น˜์˜ ์„ฑ์ ($y_t$)๊ณผ ๊ณผ๊ฑฐ์˜ ๋ฏฟ์Œ($C_t^-$)์„ ์„ž์–ด ๋ชจ๋ธ์˜ ํ˜„์žฌ ์‹ค๋ ฅ($C_t^+$)์„ ์ถ”์ •ํ•ฉ๋‹ˆ๋‹ค.
    • $$C_t^+ = (1-\eta_t)C_t^- + \eta_t y_t$$
    • ($\eta_t$: ๋ฏผ๊ฐ๋„ ์กฐ์ ˆ ๊ณ„์ˆ˜)
  3. ๋‚œ์ด๋„ ๋ณด์ƒ ์กฐ์ • (Adaptive Difficulty Weighting): ํ˜„์žฌ ์ถ”์ •๋œ ๋ชจ๋ธ ์‹ค๋ ฅ($C_t$)๊ณผ ํ˜„์žฌ ๋ฌธ์ œ์˜ ์„ฑ๊ณต ํ™•๋ฅ ์„ ๋น„๊ตํ•˜์—ฌ, ๋ฌธ์ œ๊ฐ€ ์˜ˆ์ƒ๋ณด๋‹ค ์–ด๋ ต๋‹ค๋ฉด ๊ฐ€์ค‘์น˜๋ฅผ ๋†’์—ฌ์ฃผ๊ณ  ์‰ฝ๋‹ค๋ฉด ๋‚ฎ์ถฐ์ค๋‹ˆ๋‹ค. ์ด๋กœ์จ ์™œ๊ณก๋œ ์–ด๋“œ๋ฐดํ‹ฐ์ง€๋ฅผ ํŽธํ–ฅ๋˜์ง€ ์•Š์€ ์ƒํƒœ๋กœ ๋ณด์ •ํ•ฉ๋‹ˆ๋‹ค.

3. ์‹คํ—˜ ๊ฒฐ๊ณผ ๋ถ„์„

์—ฐ๊ตฌ์ง„์€ ์ˆ˜ํ•™ ์ถ”๋ก  ๋ฒค์น˜๋งˆํฌ์—์„œ Qwen3-4B-Base ๋ชจ๋ธ์„ ์‚ฌ์šฉํ•ด ์„ฑ๋Šฅ์„ ๊ฒ€์ฆํ–ˆ์Šต๋‹ˆ๋‹ค.

  • ํ…Œ์ŠคํŠธ ๋ฒค์น˜๋งˆํฌ: MATH500, AIME25 (๋ฏธ๊ตญ ์ˆ˜ํ•™ ์˜ฌ๋ฆผํ”ผ์•„๋“œ ์ˆ˜์ค€), AMC23, Minerva, OlympiadBench ๋“ฑ ๊ณ ๋‚œ๋„ ์ˆ˜ํ•™ ๋ฌธ์ œ๋“ค์ž…๋‹ˆ๋‹ค.
  • ์ฃผ์š” ์„ฑ๊ณผ (๊ตฌ์ฒด์  ์ˆ˜์น˜ ๋ฐ ๋น„๊ต):
    • ๊ธฐ์กด GRPO ๋Œ€๋น„ ํ–ฅ์ƒ: HA-DW๋ฅผ ์ ์šฉํ•œ GRPO๋Š” ์ผ๋ฐ˜ GRPO ๋Œ€๋น„ ์ „์ฒด์ ์ธ ํ‰๊ท  ์„ฑ๋Šฅ(AVG)์—์„œ ์œ ์˜๋ฏธํ•œ ํ–ฅ์ƒ์„ ๋ณด์˜€์Šต๋‹ˆ๋‹ค. ํŠนํžˆ AIME25์™€ ๊ฐ™์€่ถ…้ซ˜๋‚œ๋„ ๋ฌธ์ œ์—์„œ ์„ฑ๋Šฅ ์ƒ์Šนํญ์ด ๋‘๋“œ๋Ÿฌ์กŒ์Šต๋‹ˆ๋‹ค.
    • ๋‚œ์ด๋„๋ณ„ ๋ถ„์„ (Stratified Results): ๋…ผ๋ฌธ์˜ Figure 1(c)๋ฅผ ๋ณด๋ฉด, HA-DW๋Š” ํŠนํžˆ Hard(์–ด๋ ค์šด) ๋ ˆ๋ฒจ์˜ ๋ฌธ์ œ์—์„œ ๊ธฐ์กด GRPO๋ณด๋‹ค ํ›จ์”ฌ ๋†’์€ ์ •๋‹ต๋ฅ ์„ ๊ธฐ๋กํ–ˆ์Šต๋‹ˆ๋‹ค. ์ด๋Š” ์•ž์„œ ์„ค๋ช…ํ•œ โ€œ์–ด๋ ค์šด ๋ฌธ์ œ์˜ ์–ด๋“œ๋ฐดํ‹ฐ์ง€๋ฅผ ๋ณด์ •โ€ํ•˜๋Š” ๋ฉ”์ปค๋‹ˆ์ฆ˜์ด ์‹ค์ œ๋กœ ์ž‘๋™ํ–ˆ์Œ์„ ์ฆ๋ช…ํ•ฉ๋‹ˆ๋‹ค.
    • ์ƒ˜ํ”Œ ํšจ์œจ์„ฑ: ์ ์€ ์ˆ˜์˜ ๋กค์•„์›ƒ(rollouts, ์ƒ์„ฑ ์‹œ๋„)์—์„œ๋„ ํŽธํ–ฅ์„ ์ค„์—ฌ์ฃผ์–ด, ๊ณ„์‚ฐ ๋น„์šฉ์„ ๋Š˜๋ฆฌ์ง€ ์•Š๊ณ ๋„ ์„ฑ๋Šฅ์„ ๋Œ์–ด์˜ฌ๋ ธ์Šต๋‹ˆ๋‹ค.

4. ํ•œ๊ณ„์ ๊ณผ ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ

  • ์ €์ž๊ฐ€ ์–ธ๊ธ‰ํ•œ ํ•œ๊ณ„:
    • ์ถ”๊ฐ€์ ์ธ ํ•˜์ดํผํŒŒ๋ผ๋ฏธํ„ฐ(๊ธฐ์šธ๊ธฐ ์กฐ์ ˆ, ์ด๋ ฅ ๋ฒ„ํผ ํฌ๊ธฐ ๋“ฑ)๊ฐ€ ํ•„์š”ํ•˜์—ฌ ํŠœ๋‹์ด ๋‹ค์†Œ ๊นŒ๋‹ค๋กœ์šธ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
    • ์™„์ „ํ•œ ๋น„ํŽธํ–ฅ(Unbiased) ์ƒํƒœ๋ฅผ ๋ณด์žฅํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ถฉ๋ถ„ํ•œ ๋กค์•„์›ƒ ์ˆ˜๊ฐ€ ํ•„์š”ํ•˜๋ฉฐ, ๊ทน๋‹จ์ ์œผ๋กœ ์ ์€ ์ƒ˜ํ”Œ์—์„œ๋Š” ์—ฌ์ „ํžˆ ๋ถˆ์•ˆ์ •ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.
  • ํ–ฅํ›„ ์—ฐ๊ตฌ ๋ฐฉํ–ฅ:
    • HA-DW๋ฅผ ๋‹ค๋ฅธ GRPO ๋ณ€ํ˜• ์•Œ๊ณ ๋ฆฌ์ฆ˜(GSPO, DAPO ๋“ฑ)๊ณผ ๊ฒฐํ•ฉํ•˜์—ฌ ์„ฑ๋Šฅ์„ ๊ทน๋Œ€ํ™”ํ•˜๋Š” ์—ฐ๊ตฌ.
    • ์ˆ˜ํ•™๋ฟ๋งŒ ์•„๋‹ˆ๋ผ ์ฝ”๋”ฉ(Code)์ด๋‚˜ ๋…ผ๋ฆฌ์  ์ถ”๋ก ์ด ํ•„์š”ํ•œ ๋‹ค๋ฅธ ๋„๋ฉ”์ธ์œผ๋กœ์˜ ์ผ๋ฐ˜ํ™” ๊ฐ€๋Šฅ์„ฑ ํ™•์ธ.

5. ์‹ค๋ฌด ์ ์šฉ ๊ฐ€๋Šฅ์„ฑ

  • ์ ์šฉ ๋ถ„์•ผ:
    • LLM ์ถ”๋ก  ๋Šฅ๋ ฅ ๊ฐ•ํ™”: ์ˆ˜ํ•™, ์ฝ”๋”ฉ, ๋ณต์žกํ•œ ๋…ผ๋ฆฌ ์งˆ์˜๋ฅผ ๋‹ค๋ฃจ๋Š” LLM์„ ์‚ฌํ›„ ํ•™์Šต(Post-training)์‹œํ‚ค๋Š” ๋ชจ๋“  ๋ถ„์•ผ์— ์ฆ‰์‹œ ์ ์šฉ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค. ํŠนํžˆ โ€œDeepSeek-R1โ€๊ณผ ๊ฐ™์€ ์ถ”๋ก  ์ „๋ฌธ ๋ชจ๋ธ์„ ํ•™์Šต์‹œํ‚ฌ ๋•Œ ํ•„์ˆ˜์ ์ธ ๊ธฐ์ˆ ์ด ๋  ๊ฒƒ์ž…๋‹ˆ๋‹ค.
  • ํ•„์š”ํ•œ ๋ฆฌ์†Œ์Šค:
    • ์ถ”๊ฐ€ ๋ชจ๋ธ ๋ถˆํ•„์š”: Critic ๋ชจ๋ธ์„ ๋”ฐ๋กœ ๋‘๋Š” PPO์™€ ๋‹ฌ๋ฆฌ, GRPO ๊ธฐ๋ฐ˜์ด๋ฏ€๋กœ ํ•™์Šต์— ํ•„์š”ํ•œ GPU ๋ฉ”๋ชจ๋ฆฌ๊ฐ€ ์ƒ๋Œ€์ ์œผ๋กœ ์ ์Šต๋‹ˆ๋‹ค.
    • ๋ฐ์ดํ„ฐ: RLVR(Verifier Rewards) ํ›ˆ๋ จ์ด ๊ฐ€๋Šฅํ•œ ๋ฐ์ดํ„ฐ์…‹(๋ณด์ƒ์„ ์ค„ ์ˆ˜ ์žˆ๋Š” ํ™˜๊ฒฝ ๋˜๋Š” ๊ฒ€์ฆ ๋ชจ๋ธ)์ด ํ•„์š”ํ•ฉ๋‹ˆ๋‹ค.
    • ๊ตฌํ˜„ ๋‚œ์ด๋„: ๊ธฐ์กด GRPO ์ฝ”๋“œ์— ๋น„ํ•ด ์ƒ๋Œ€์ ์œผ๋กœ ๋‹จ์ˆœํ•œ ์ˆ˜์‹(๊ฐ€์ค‘์น˜ ์—…๋ฐ์ดํŠธ)๋งŒ ์ถ”๊ฐ€ํ•˜๋ฉด ๋˜๋ฏ€๋กœ, ์—”์ง€๋‹ˆ์–ด๋ง ๊ด€์ ์—์„œ ๊ตฌํ˜„ ๋ถ€๋‹ด์ด ํฌ์ง€ ์•Š์Šต๋‹ˆ๋‹ค.

6. ์ด ๋…ผ๋ฌธ์„ ์ดํ•ดํ•˜๊ธฐ ์œ„ํ•œ ์‚ฌ์ „ ์ง€์‹

  1. RLHF (Reinforcement Learning from Human Feedback): ์ธ๊ฐ„์˜ ํ”ผ๋“œ๋ฐฑ์ด๋‚˜ ๋ณด์ƒ ์‹ ํ˜ธ๋ฅผ ํ†ตํ•ด LLM์„ ์ธ๊ฐ„์ด ์„ ํ˜ธํ•˜๋Š” ๋ฐฉํ–ฅ์œผ๋กœ ํŠœ๋‹ํ•˜๋Š” ๊ธฐ๋ฒ•.
  2. PPO (Proximal Policy Optimization): OpenAI๊ฐ€ ์‚ฌ์šฉํ•œ ๋Œ€ํ‘œ์ ์ธ ๊ฐ•ํ™” ํ•™์Šต ์•Œ๊ณ ๋ฆฌ์ฆ˜์œผ๋กœ, ์ •์ฑ… ์—…๋ฐ์ดํŠธ๊ฐ€ ๋„ˆ๋ฌด ํฌ์ง€ ์•Š๋„๋ก ์ œ์•ฝํ•˜๋Š” ์•ˆ์ •์ ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜.
  3. GRPO (Group Relative Policy Optimization): PPO์—์„œ ๋น„ํ‰๊ฐ€(Critic) ๋ชจ๋ธ์„ ์—†์• ๊ณ , ๊ฐ™์€ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์—ฌ๋Ÿฌ ๋‹ต๋ณ€(๊ทธ๋ฃน)์˜ ํ‰๊ท  ๋ณด์ƒ์„ ๊ธฐ์ค€์œผ๋กœ ์‚ผ์•„ ๋ฉ”๋ชจ๋ฆฌ ํšจ์œจ์„ ๊ทน๋Œ€ํ™”ํ•œ ์ตœ์‹  ์•Œ๊ณ ๋ฆฌ์ฆ˜.
  4. Advantage Estimation (์–ด๋“œ๋ฐดํ‹ฐ์ง€ ์ถ”์ •): ํŠน์ • ํ–‰๋™์ด ๊ธฐ์ค€์„ (Baseline)๋ณด๋‹ค ์–ผ๋งˆ๋‚˜ ๋” ์ข‹์•˜๋Š”์ง€๋ฅผ ์ธก์ •ํ•˜๋Š” ๊ฐ’์œผ๋กœ, ๊ฐ•ํ™” ํ•™์Šต์—์„œ ์ •์ฑ…์„ ์—…๋ฐ์ดํŠธํ•˜๋Š” ๋ฐฉํ–ฅ์„ ๊ฒฐ์ •ํ•˜๋Š” ํ•ต์‹ฌ ์ง€ํ‘œ.
  5. Bias-Variance Tradeoff (ํŽธํ–ฅ-๋ถ„์‚ฐ ํŠธ๋ ˆ์ด๋“œ์˜คํ”„): ๋ชจ๋ธ์ด ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋„ˆ๋ฌด ๊ณผ์ ํ•ฉ๋˜๊ฑฐ๋‚˜(๋ถ„์‚ฐ), ๋„ˆ๋ฌด ๋‹จ์ˆœํ•ด์ ธ์„œ(ํŽธํ–ฅ) ์‹ค์ œ ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ํ˜„์ƒ ์‚ฌ์ด์˜ ๊ท ํ˜•.
  6. Outcome Reward Model (ORM): ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์ตœ์ข… ๊ฒฐ๊ณผ(๋‹ต)๋งŒ ๋ณด๊ณ  ์ ์ˆ˜๋ฅผ ๋งค๊ธฐ๋Š” ๋ณด์ƒ ๋ชจ๋ธ๋กœ, ์ถ”๋ก  ๊ณผ์ •์„ ํ‰๊ฐ€ํ•  ๋•Œ ์ž์ฃผ ์“ฐ์ž„.
  7. Kalman Filter (์นผ๋งŒ ํ•„ํ„ฐ): ์‹œ์Šคํ…œ์˜ ์ƒํƒœ๋ฅผ ์ถ”์ •ํ•˜๊ธฐ ์œ„ํ•ด ๊ณผ๊ฑฐ์˜ ๋ฐ์ดํ„ฐ์™€ ํ˜„์žฌ์˜ ์ธก์ •๊ฐ’์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์ตœ์ ์˜ ์ƒํƒœ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜ (๋…ผ๋ฌธ์—์„œ ์ด๋ ฅ์„ ๊ด€๋ฆฌํ•˜๋Š” ๋ฐ ์‚ฌ์šฉ๋จ).

๐Ÿ“š ์ด๋ฒˆ ์ฃผ ๊ด€๋ จ Deep Dive

์ˆœ์œ„๋…ผ๋ฌธDeep Dive
๐Ÿฅ‡Agentic Reasoning for Large Languagโ€ฆDD-011
๐ŸฅˆYour Group-Relative Advantage Is Biโ€ฆ๐Ÿ“ ํ˜„์žฌ ๋ฌธ์„œ
๐Ÿฅ‰EvoCUA: Evolving Computer Use Agentโ€ฆDD-013
4.LLM-in-Sandbox Elicits General Agenโ€ฆDD-014
5.Being-H0.5: Scaling Human-Centric Rโ€ฆDD-015

๐Ÿ“… ์ƒ์„ฑ์ผ: 2026-02-02 | ๐Ÿค– GLM-4.7 Deep Dive