[Paper Review] Deep Think with Confidence
๐Ÿค”

[Paper Review] Deep Think with Confidence

Tags
LLM Reasoning
Test-time Scaling
Efficient Inference
AI
Published
March 8, 2026
Author

๋ฌธ์ œ

Self-consistency ๊ธฐ๋ฐ˜ ๋ณ‘๋ ฌ ์ถ”๋ก ์—์„œ, ํŠธ๋ ˆ์ด์Šค ์ˆ˜๋ฅผ ๋Š˜๋ ค๋„ ์„ฑ๋Šฅ์ด ํฌํ™”๋˜๊ฑฐ๋‚˜ ํ•˜๋ฝํ•˜๋ฉฐ, ์ €ํ’ˆ์งˆ ํŠธ๋ ˆ์ด์Šค๊ฐ€ ์ •๋‹ต์— ๋Œ€ํ•œ ํˆฌํ‘œ๋ฅผ ํฌ์„์‹œํ‚ต๋‹ˆ๋‹ค.

๋ฐฉ์•ˆ

๋ชจ๋ธ์ด ์ž์ฒด์ ์œผ๋กœ ์ƒ์„ฑํ•˜๋Š” token-level confidence๋ฅผ ํ™œ์šฉํ•˜์—ฌ, (1) ์˜คํ”„๋ผ์ธ์—์„œ๋Š” ์™„์„ฑ๋œ ํŠธ๋ ˆ์ด์Šค๋ฅผ ํ•„ํ„ฐ๋ง/๊ฐ€์ค‘ ํˆฌํ‘œํ•˜๊ณ  (2) ์˜จ๋ผ์ธ์—์„œ๋Š” ์ƒ์„ฑ ๋„์ค‘ ์ €ํ’ˆ์งˆ ํŠธ๋ ˆ์ด์Šค๋ฅผ ์กฐ๊ธฐ ์ข…๋ฃŒํ•ฉ๋‹ˆ๋‹ค.

์ฃผ์š” ๊ธฐ์—ฌ

  1. ๊ธฐ์กด ํ‰๊ท  ํŠธ๋ ˆ์ด์Šค confidence๋ณด๋‹ค ์šฐ์ˆ˜ํ•œ ์„ธ๋ถ„ํ™”๋œ confidence ์ธก์ • ์ง€ํ‘œ ์ œ์•ˆ (Group Confidence, Bottom-10% Confidence, Tail Confidence)
  1. Confidence ๊ธฐ๋ฐ˜ ํ•„ํ„ฐ๋ง + ๊ฐ€์ค‘ ํˆฌํ‘œ ์˜คํ”„๋ผ์ธ ํŒŒ์ดํ”„๋ผ์ธ ์„ค๊ณ„
  1. ์ƒ์„ฑ ์ค‘ Early Stopping ๊ธฐ๋ฐ˜ ์˜จ๋ผ์ธ ์•Œ๊ณ ๋ฆฌ์ฆ˜, DeepConf-low/high ์ œ์•ˆ
  1. ์ถ”๊ฐ€ ํ•™์Šต ์—†์ด DeepSeek-8B์—์„œ AIME 2025 ๊ธฐ์ค€ 99.9% ์ •ํ™•๋„ ๋‹ฌ์„ฑ full parallel thinking ๋Œ€๋น„ ํ† ํฐ ์ตœ๋Œ€ 84.7% ์ ˆ๊ฐ

๋ฐฐ๊ฒฝ ๋ฐ ๋™๊ธฐ

๊ธฐ์กด ๋ฐฉ๋ฒ•์˜ ์ ‘๊ทผ

Self-Consistency + Majority Voting์€ LLM ์ถ”๋ก ์˜ test-time scaling ๋ฐฉ๋ฒ•์ž…๋‹ˆ๋‹ค.
์˜ˆ์‹œ๋กœ Qwen3-8B ๋ชจ๋ธ์—์„œ AIME 2025๋ฅผ ํ’€ ๋•Œ, majority voting์„ ์ ์šฉํ•˜๋ฉด ๋‹จ์ผ ์ƒ˜ํ”Œ ๋Œ€๋น„ pass@1 ์ •ํ™•๋„๊ฐ€ 65.1%์—์„œ 82.6%๊นŒ์ง€ ์ƒ์Šนํ•ฉ๋‹ˆ๋‹ค.

๋ฌธ์ œ์  ๋ฐ ๋ณ‘๋ชฉ

1. Diminishing Returns
ํŠธ๋ ˆ์ด์Šค ์ˆ˜๋ฅผ ๋Š˜๋ฆด์ˆ˜๋ก ์„ฑ๋Šฅ ๊ฐœ์„ ํญ์ด ๊ธ‰๊ฒฉํžˆ ์ค„์–ด๋“ญ๋‹ˆ๋‹ค.
notion image
  • Qwen3-8B ๊ธฐ์ค€ 512๊ฐœ ํŠธ๋ ˆ์ด์Šค๋กœ majority voting โ†’ 100M ์ถ”๊ฐ€ ํ† ํฐ ์ƒ์„ฑ
  • ํ•˜์ง€๋งŒ pass@1 ๋Œ€๋น„ ์ •ํ™•๋„ ๊ฐœ์„ ์€ ์•ฝ 17%p์— ๋ถˆ๊ณผ
  • ํŠธ๋ ˆ์ด์Šค ์ˆ˜๋ฅผ 64โ†’512๋กœ 8๋ฐฐ ๋Š˜๋ ค๋„ ์ •ํ™•๋„ ๊ฐœ์„ ์€ 2~3%p ์ˆ˜์ค€
2. ์ €ํ’ˆ์งˆ ํŠธ๋ ˆ์ด์Šค์˜ ํˆฌํ‘œ ์˜ค์—ผ
๋ชจ๋“  ํŠธ๋ ˆ์ด์Šค๋ฅผ ๋™๋“ฑํ•˜๊ฒŒ ์ทจ๊ธ‰ํ•˜๋ฉด, ํ‹€๋ฆฐ ๋‹ต์„ ๋‚ด๋Š” ํŠธ๋ ˆ์ด์Šค๊ฐ€ ์ •๋‹ต ํˆฌํ‘œ๋ฅผ ํฌ์„์‹œํ‚ต๋‹ˆ๋‹ค.
3. ๊ณ„์‚ฐ ๋น„์šฉ์˜ ๋น„ํšจ์œจ์„ฑ
์ƒ์„ฑ๋œ ํŠธ๋ ˆ์ด์Šค ์ค‘ ์ƒ๋‹น์ˆ˜๋Š” ์ฒ˜์Œ๋ถ€ํ„ฐ ์ž˜๋ชป๋œ ๋ฐฉํ–ฅ์œผ๋กœ ์ถ”๋ก ํ•˜์—ฌ, ์™„๋ฃŒ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆด ํ•„์š”๊ฐ€ ์—†์Šต๋‹ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ  ๊ธฐ์กด ๋ฐฉ๋ฒ•์€ ๋ชจ๋“  ํŠธ๋ ˆ์ด์Šค๋ฅผ ๋๊นŒ์ง€ ์ƒ์„ฑํ•œ ํ›„์—์•ผ ํŒ๋‹จ์ด ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.

๋ณ‘๋ชฉ์˜ ์›์ธ์€?

๊ธฐ์กด majority voting์€ ํŠธ๋ ˆ์ด์Šค์˜ ํ’ˆ์งˆ ์ฐจ์ด๋ฅผ ๋ฌด์‹œํ•ฉ๋‹ˆ๋‹ค. ํ•œํŽธ LLM์€ ์ถ”๋ก  ๊ณผ์ •์—์„œ ์ž์ฒด์ ์œผ๋กœ token-level log-probability ๊ธฐ๋ฐ˜ confidence ์‹ ํ˜ธ๋ฅผ ์ƒ์„ฑํ•˜๊ณ  ์žˆ์œผ๋ฉฐ, ์ด ์‹ ํ˜ธ๊ฐ€ ํŠธ๋ ˆ์ด์Šค์˜ ์ •ํ™•์„ฑ๊ณผ ๊ฐ•ํ•œ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ๋ณด์ž…๋‹ˆ๋‹ค. ์ด๋ฅผ ์ด์šฉํ•˜์—ฌ Voting ๊ณผ์ •์— ํ†ตํ•ฉํ•˜๋ฉด ์ •ํ™•๋„์™€ ํšจ์œจ์„ฑ์„ ๋ชจ๋‘ ๋†’์ผ ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

ํ•ต์‹ฌ ๋ฐฉ๋ฒ•

Confidence ์ธก์ • ์ง€ํ‘œ ์„ค๊ณ„

DeepConf์—์„œ ์ฃผ์š”ํ•˜๊ฒŒ ๋ด์•ผํ•  ์ง€ํ‘œ๋Š” ์–ด๋–ค confidence ์ง€ํ‘œ๊ฐ€ ์ •๋‹ต/์˜ค๋‹ต ํŠธ๋ ˆ์ด์Šค๋ฅผ ๊ฐ€์žฅ ์ž˜ ๊ตฌ๋ถ„ํ•˜๋Š”๊ฐ€์ž…๋‹ˆ๋‹ค.
notion image

1. Token Confidence

2. Average Trace Confidence

3. Group Confidence (๊ทธ๋ฃน ์ˆ˜์ค€) โ€” DeepConf ์ œ์•ˆ

ํ† ํฐ์„ ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ๋กœ ๊ทธ๋ฃนํ™”ํ•˜์—ฌ ๊ตญ์†Œ์  confidence๋ฅผ ํฌ์ฐฉํ•ฉ๋‹ˆ๋‹ค.
  • : ๊ฐœ์˜ ์—ฐ์† ํ† ํฐ์œผ๋กœ ๊ตฌ์„ฑ๋œ ๊ทธ๋ฃน (์˜ˆ: ๋˜๋Š” )
  • ์ธ์ ‘ ๊ทธ๋ฃน๊ณผ ๊ฒน์น˜๋Š” ์Šฌ๋ผ์ด๋”ฉ ์œˆ๋„์šฐ ๋ฐฉ์‹
  • ์ถ”๋ก  ์ค‘๊ฐ„์— confidence๊ฐ€ ๊ธ‰๊ฒฉํžˆ ๋–จ์–ด์ง€๋Š” ๊ตฌ๊ฐ„(์˜ˆ: "wait", "however", "think again" ๊ฐ™์€ ํ† ํฐ)์„ ์ •ํ™•ํžˆ ํฌ์ฐฉ

4. Bottom-10% Group Confidence โ€” DeepConf ์ œ์•ˆ

๋ชจ๋“  ๊ทธ๋ฃน confidence ์ค‘ ํ•˜์œ„ 10%์˜ ํ‰๊ท ์„ ํŠธ๋ ˆ์ด์Šค ํ’ˆ์งˆ ์ง€ํ‘œ๋กœ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • : ํ•˜์œ„ 10% confidence ๊ทธ๋ฃน์˜ ์ง‘ํ•ฉ
  • ์ง๊ด€์  ์˜๋ฏธ: "์ถ”๋ก  ๊ณผ์ •์—์„œ ๊ฐ€์žฅ ๋ถˆํ™•์‹คํ–ˆ๋˜ ๊ตฌ๊ฐ„์ด ์–ผ๋งˆ๋‚˜ ๋ถˆํ™•์‹คํ–ˆ๋Š”๊ฐ€"
  • ํ•˜์œ„ ๊ตฌ๊ฐ„ ํ•˜๋‚˜๋ผ๋„ ์‹ฌ๊ฐํ•˜๊ฒŒ ๋ถˆํ™•์‹คํ•˜๋ฉด, ๊ทธ ํŠธ๋ ˆ์ด์Šค๋Š” ์ „์ฒด์ ์œผ๋กœ ์˜์‹ฌ์Šค๋Ÿฌ์›€

5. Lowest Group Confidence โ€” DeepConf ์ œ์•ˆ

๊ฐ€์žฅ ๋‚ฎ์€ ๋‹จ์ผ ๊ทธ๋ฃน confidence๋งŒ์„ ์‚ฌ์šฉํ•ฉ๋‹ˆ๋‹ค.
  • : ํŠธ๋ ˆ์ด์Šค ๋‚ด ๋ชจ๋“  ๊ทธ๋ฃน์˜ ์ง‘ํ•ฉ
  • Bottom-10%์˜ ๊ทน๋‹จ์  ๋ณ€ํ˜•์œผ๋กœ, ์ตœ์•…์˜ ๊ตฌ๊ฐ„ ํ•˜๋‚˜๋งŒ์œผ๋กœ ํŒ๋‹จ
  • ์˜จ๋ผ์ธ ์‹œ๋‚˜๋ฆฌ์˜ค์—์„œ ํŠนํžˆ ์œ ์šฉ: ์ƒ์„ฑ ์ค‘ ํ˜„์žฌ ๊ทธ๋ฃน์˜ confidence๋งŒ ํ™•์ธํ•˜๋ฉด ๋จ

6. Tail Confidence โ€” DeepConf ์ œ์•ˆ

ํŠธ๋ ˆ์ด์Šค์˜ ๋งˆ์ง€๋ง‰ ๊ตฌ๊ฐ„๋งŒ์˜ confidence๋ฅผ ์ธก์ •ํ•ฉ๋‹ˆ๋‹ค.
  • : ๊ผฌ๋ฆฌ ํ† ํฐ ์ˆ˜ (์˜ˆ: 2048๊ฐœ)
  • ์ˆ˜ํ•™ ๋ฌธ์ œ์—์„œ ์ตœ์ข… ๋‹ต์„ ๋„์ถœํ•˜๋Š” ๋งˆ์ง€๋ง‰ ๋‹จ๊ณ„์˜ ํ™•์‹ ๋„๊ฐ€ ์ „์ฒด ์ •ํ™•์„ฑ๊ณผ ๋†’์€ ์ƒ๊ด€
  • ์ฒ˜์Œ์—” ํ™•์‹  ์žˆ๊ฒŒ ์‹œ์ž‘ํ–ˆ์ง€๋งŒ ๊ฒฐ๋ก ์—์„œ ํ”๋“ค๋ฆฌ๋Š” ํŠธ๋ ˆ์ด์Šค๋ฅผ ํฌ์ฐฉ

์ง€ํ‘œ ๊ฐ„ ์„ฑ๋Šฅ ๋น„๊ต

notion image
notion image

Offline Thinking with Confidence

notion image
์™„์„ฑ๋œ ํŠธ๋ ˆ์ด์Šค๋“ค์— ๋Œ€ํ•ด ์‚ฌํ›„์ ์œผ๋กœ confidence๋ฅผ ํ‰๊ฐ€ํ•˜๊ณ , ํˆฌํ‘œ๋ฅผ ์ •์ œํ•˜๋Š” ๋ฐฉ์‹์ž…๋‹ˆ๋‹ค.

Confidence-Weighted Majority Voting

๊ฐ ํŠธ๋ ˆ์ด์Šค์˜ ํˆฌํ‘œ๋ฅผ confidence๋กœ ๊ฐ€์ค‘ํ•ฉ๋‹ˆ๋‹ค.
  • : ํŠธ๋ ˆ์ด์Šค ์˜ confidence
  • ๋†’์€ confidence ํŠธ๋ ˆ์ด์Šค์˜ ํ‘œ๊ฐ€ ๋” ํฐ ์˜ํ–ฅ๋ ฅ์„ ๊ฐ€์ง
  • ๋‹จ์ˆœ majority voting๊ณผ ๋‹ฌ๋ฆฌ, ํ™•์‹  ๋†’์€ ์†Œ์ˆ˜ ํŠธ๋ ˆ์ด์Šค๊ฐ€ ๋ถˆํ™•์‹คํ•œ ๋‹ค์ˆ˜๋ฅผ ์ด๊ธธ ์ˆ˜ ์žˆ์Œ

Confidence Filtering

ํˆฌํ‘œ ์ „์— low-confidence ํŠธ๋ ˆ์ด์Šค๋ฅผ ์™„์ „ํžˆ ์ œ๊ฑฐํ•ฉ๋‹ˆ๋‹ค.
  • Top 10% ํ•„ํ„ฐ (): ๊ฐ€์žฅ confidentํ•œ ์ƒ์œ„ 10%๋งŒ ์‚ฌ์šฉ. ์†Œ์ˆ˜ ํŠธ๋ ˆ์ด์Šค๋กœ๋„ ๋†’์€ ์ •ํ™•๋„. ํ•˜์ง€๋งŒ ๊ฐ€๋” ๊ณผ์‹ (overconfident) ์˜ค๋‹ต์— ์ทจ์•ฝ
  • Top 90% ํ•„ํ„ฐ (): ๊ฐ€์žฅ ๋‚ฎ์€ 10%๋งŒ ์ œ๊ฑฐ. ๋ณด์ˆ˜์ ์ด์ง€๋งŒ ์•ˆ์ •์ 
์‹คํ—˜์ƒ Top 90%๊ฐ€ ์•ˆ์ •์„ฑ ๋ฉด์—์„œ ์šฐ์ˆ˜, Top 10%๊ฐ€ ์ตœ๊ณ  ์ •ํ™•๋„๋ฅผ ๋ณด์ด๋‚˜ ๊ฐ€๋” ์„ฑ๋Šฅ์ด ๋–จ์–ด์ง€๋Š” ๊ฒฝ์šฐ๊ฐ€ ์žˆ์Šต๋‹ˆ๋‹ค.

Online Thinking with Confidence

notion image
์ƒ์„ฑ ๋„์ค‘์— confidence๋ฅผ ํ™•์ธํ•˜๊ณ , ์ €ํ’ˆ์งˆ ํŠธ๋ ˆ์ด์Šค๋ฅผ ์กฐ๊ธฐ ์ข…๋ฃŒํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ, ์ด๋ฅผ ํ†ตํ•ด ๋ถˆํ•„์š”ํ•œ ํ† ํฐ ์ƒ์„ฑ์„ ๊ทผ๋ณธ์ ์œผ๋กœ ์ค„์ž…๋‹ˆ๋‹ค.

DeepConf-low vs DeepConf-high

๊ตฌ๋ถ„
DeepConf-low
DeepConf-high
ํ•„ํ„ฐ๋ง ๋น„์œจ ฮท
10% (์ƒ์œ„ 10% ๊ธฐ์ค€)
90% (์ƒ์œ„ 90% ๊ธฐ์ค€)
stopping threshold
๋†’์Œ (์—„๊ฒฉ)
๋‚ฎ์Œ (๊ด€๋Œ€)
ํ† ํฐ ์ ˆ๊ฐ๋ฅ 
๋†’์Œ (43~84%)
์ค‘๊ฐ„ (16~59%)
์ •ํ™•๋„ ์•ˆ์ •์„ฑ
๊ฐ€๋” 1~2%p ํ•˜๋ฝ ๊ฐ€๋Šฅ
majority voting๊ณผ ๊ฑฐ์˜ ๋™์ผ
์ ํ•ฉ ์‹œ๋‚˜๋ฆฌ์˜ค
ํšจ์œจ ์ตœ์šฐ์„ 
์ •ํ™•๋„ ์ตœ์šฐ์„ 

Adaptive Sampling

๋ฌธ์ œ ๋‚œ์ด๋„์— ๋”ฐ๋ผ ํŠธ๋ ˆ์ด์Šค ์ƒ์„ฑ ์ˆ˜๋ฅผ ๋™์ ์œผ๋กœ ์กฐ์ ˆํ•ฉ๋‹ˆ๋‹ค.
# ํ•ฉ์˜ ๋น„์œจ(consensus ratio)๋กœ ๋‚œ์ด๋„ ์ถ”์ • consensus = V(รข) / ฮฃ_a V(a) # ์‰ฌ์šด ๋ฌธ์ œ: ์†Œ์ˆ˜ ํŠธ๋ ˆ์ด์Šค๋กœ๋„ ๋†’์€ ํ•ฉ์˜ โ†’ ์กฐ๊ธฐ ์ข…๋ฃŒ # ์–ด๋ ค์šด ๋ฌธ์ œ: ํ•ฉ์˜ ๋‚ฎ์Œ โ†’ ์˜ˆ์‚ฐ๊นŒ์ง€ ๊ณ„์† ์ƒ์„ฑ if consensus >= ฯ„: # ฯ„ = 0.95 ๊ธฐ๋ณธ๊ฐ’ stop_generation() else: continue_generating()
์ด ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋•๋ถ„์— ์‰ฌ์šด ๋ฌธ์ œ์—์„œ๋Š” ์†Œ์ˆ˜์˜ ํŠธ๋ ˆ์ด์Šค๋งŒ ์ƒ์„ฑํ•˜๊ณ , ์–ด๋ ค์šด ๋ฌธ์ œ์— ๊ณ„์‚ฐ ์ž์›์„ ์ง‘์ค‘ ๋ฐฐ๋ถ„ํ•  ์ˆ˜ ์žˆ์Šต๋‹ˆ๋‹ค.

์™œ Lowest Group Confidence๊ฐ€ ์˜จ๋ผ์ธ ์‹œ๋‚˜๋ฆฌ์˜ค์— ์ ํ•ฉํ•œ๊ฐ€?

์ƒ์„ฑ ์ค‘ ์‹ค์‹œ๊ฐ„์œผ๋กœ ํŒ๋‹จํ•˜๋ ค๋ฉด, ํŠธ๋ ˆ์ด์Šค ์ „์ฒด๋ฅผ ๊ธฐ๋‹ค๋ฆด ์ˆ˜ ์—†์Šต๋‹ˆ๋‹ค. Lowest Group Confidence๋Š” ํ˜„์žฌ๊นŒ์ง€ ์ƒ์„ฑ๋œ ๊ตฌ๊ฐ„์˜ ์ตœ์ € confidence๋งŒ ํ™•์ธํ•˜๋ฉด ๋˜๋ฏ€๋กœ, ์ŠคํŠธ๋ฆฌ๋ฐ ๋ฐฉ์‹์œผ๋กœ ๊ณ„์‚ฐ ๊ฐ€๋Šฅํ•ฉ๋‹ˆ๋‹ค.
ํŠธ๋ ˆ์ด์Šค ์ƒ์„ฑ ์ง„ํ–‰: [๊ทธ๋ฃน1: C=0.85] [๊ทธ๋ฃน2: C=0.72] [๊ทธ๋ฃน3: C=0.41] โ† ์ž„๊ณ„๊ฐ’ 0.55 ๋ฏธ๋งŒ! โ†’ ์ฆ‰์‹œ ์ค‘๋‹จ Lowest Group Conf = min(0.85, 0.72, 0.41) = 0.41 < s โ†’ ์ด ํŠธ๋ ˆ์ด์Šค๋Š” "์ถ”๋ก ์ด ํ•œ ๋ฒˆ์ด๋ผ๋„ ์‹ฌ๊ฐํ•˜๊ฒŒ ํ”๋“ค๋ฆผ" โ†’ ์™„๋ฃŒ๊นŒ์ง€ ๊ธฐ๋‹ค๋ฆด ๊ฐ€์น˜ ์—†์Œ