LLM_Fine_Tuning_Strategies

๐Ÿ“„ LLM_Fine_Tuning_Strategies

๋Œ€๊ทœ๋ชจ ์–ธ์–ด ๋ชจ๋ธ(LLM)์˜ ํšจ์œจ์ ์ธ Fine-tuning ์ „๋žต๊ณผ ์ž์› ์ œ์•ฝ ํ™˜๊ฒฝ์—์„œ์˜ ํ™œ์šฉ ๋ฐฉ์•ˆ.

1. LLM Fine-tuning ์ ‘๊ทผ ๋ฐฉ์‹

1.1 Custom Classifier

  • ์ ์šฉ ์ƒํ™ฉ: ์ถฉ๋ถ„ํ•œ ๋ฐ์ดํ„ฐ์…‹, ํŠน์ • Task ์„ฑ๋Šฅ ์ตœ์ ํ™”, ๋Œ€๋Ÿ‰ ๋ฐฐ์น˜ ์ถ”๋ก 

  • ํŠน์ง•: Task-specific ๋ชจ๋ธ, ๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ ํ•™์Šต(Fine-tuning)

  • ์žฅ์ : ๋†’์€ Task ์„ฑ๋Šฅ, ์•ˆ์ •์ ์ธ ์ถ”๋ก 

1.2 Prompt Template

  • ์ ์šฉ ์ƒํ™ฉ: ๋ฐ์ดํ„ฐ์…‹ ๋ถ€์กฑ, ๋น ๋ฅธ ํ”„๋กœํ† ํƒ€์ดํ•‘, Zero/Few-shot ํ•™์Šต, Structured Output

  • ํŠน์ง•: ์‚ฌ์ „ ์ •์˜๋œ ํ…œํ”Œ๋ฆฟ, Prompt Engineering ํ•„์š”

  • ์žฅ์ : ์ ์€ ๋ฐ์ดํ„ฐ๋กœ ํ™œ์šฉ, ์œ ์—ฐํ•œ ์ œ์–ด

2. Fine-tuning ์„ธ๋ถ€ ๊ธฐ๋ฒ•

2.1 ํŒŒ๋ผ๋ฏธํ„ฐ ํšจ์œจ์  ํ•™์Šต (Parameter-Efficient Tuning, PET)

  • LoRA (Low-Rank Adaptation): Custom Classifier์— ์ผ๋ถ€ Layer Fine-tuning, ์ž‘์€ Sub-layer ์—…๋ฐ์ดํŠธ

  • Adapters: ๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ ๊ณ ์ •, Task-specific Layer ์ถ”๊ฐ€ ํ•™์Šต

  • Prefix Tuning: Prompt Template ์‘๋‹ต ํ˜•์‹ ์ œ์–ด, Pre-trained Prompt ์ถ”๊ฐ€

2.2 ๋ชจ๋ธ ๊ฒฝ๋Ÿ‰ํ™” ๋ฐ ์ „์ด ํ•™์Šต

  • Knowledge Distillation: ํฐ Teacher ๋ชจ๋ธ์—์„œ ์ž‘์€ Student ๋ชจ๋ธ๋กœ ์ •๋ณด ์ „์ด, ๊ฒฝ๋Ÿ‰ํ™”๋œ ๋ชจ๋ธ ํ•™์Šต

  • qLoRA: 4๋น„ํŠธ ๋ชจ๋ธ ์••์ถ•, GPU ๋ฉ”๋ชจ๋ฆฌ ์‚ฌ์šฉ๋Ÿ‰ ๋Œ€ํญ ๊ฐ์†Œ (์ฃผ๋กœ ์ถ”๋ก ์— ํšจ๊ณผ์ )

3. Fine-tuning ํ™˜๊ฒฝ ๋ฐ ๊ณ ๋ ค์‚ฌํ•ญ

3.1 Tokenizer ๋ฐ Special Tokens

  • Special Tokens: eos, bos, pad, sep ๋“ฑ ๋ฌธ์žฅ ๊ตฌ์กฐ ์ดํ•ด ๋ณด์กฐ

  • Tokenizer: ๋ชจ๋ธ๋ณ„ ๊ทœ์น™ ์ƒ์ด, Fine-tuning ์‹œ ์—…๋ฐ์ดํŠธ ํ•„์š”์„ฑ

3.2 Fine-tuning ์ „๋žต ๋ฐ ์ตœ์ ํ™”

  • ํ•˜๋“œ์›จ์–ด ์ œ์•ฝ: RTX 3090 (70B LLM qLoRA), 7B SLM (๋งˆ์ง€๋ง‰ 2-3 Layer unfreeze)

  • ์•™์ƒ๋ธ”: Short, Mid, Long Sample Ensemble๋กœ ์ตœ์ข… ๊ฒฐ๊ณผ ํ–ฅ์ƒ

  • ์ตœ์ ํ™”: Batch Size ์กฐ์ •, Mixed Precision ํ™œ์šฉ (GPU ๋ฉ”๋ชจ๋ฆฌ ์ตœ์ ํ™”)

3.3 PyTorch Lightning ์—…๋ฐ์ดํŠธ

  • Hook ๋ณ€๊ฒฝ: validation_epoch_end -> on_validation_epoch_end

  • ๋ชฉ์ : ๋ช…ํ™•ํ•œ Task ์ฃผ๊ธฐ ๋ฐ ์ด๋ฒคํŠธ ๊ธฐ๋ฐ˜ ๋กœ์ง ์ง€์›

3.4 RTX 3060 (12GB VRAM) 70B ๋ชจ๋ธ ์‚ฌ์šฉ ๋ถˆ๊ฐ€

  • ๋ฉ”๋ชจ๋ฆฌ ๋ถ€์กฑ: 70B ๋ชจ๋ธ์€ ์ตœ์†Œ 40GB ์ด์ƒ VRAM ํ•„์š”

  • ๋Œ€์•ˆ:

  • LoRA/qLoRA: ๋ฉ”๋ชจ๋ฆฌ ์ ˆ๊ฐ (12-15GB ์ˆ˜์ค€์œผ๋กœ ์ถ•์†Œ ๊ฐ€๋Šฅ)

  • CPU ์˜คํ”„๋กœ๋“œ + GPU ๋ถ„์‚ฐ: Accelerate ํ™œ์šฉ (์„ฑ๋Šฅ ์ €ํ•˜)

  • Deepspeed ZeRO / Tensor Parallelism: ๋‹จ์ผ GPU ์ œํ•œ์ 

  • ํด๋ผ์šฐ๋“œ ํ™œ์šฉ: AWS EC2, Google Cloud ๋“ฑ ๊ณ ์‚ฌ์–‘ GPU ์ธ์Šคํ„ด์Šค (A100, H100)

์„ธ๋ถ€ ๊ธฐ๋ฒ• ๋ฐ ์ ์šฉ ์‚ฌ๋ก€

๊ธฐ๋ฒ•์ ์šฉ ์ƒํ™ฉํŠน์ง•
LoRACustom Classifier์— ์ผ๋ถ€ Layer๋งŒ Fine-tuning ํ•„์š” ์‹œ ์‚ฌ์šฉ๊ธฐ์กด ๋ชจ๋ธ์˜ ์ž‘์€ Sub-layer๋งŒ ์—…๋ฐ์ดํŠธํ•˜์—ฌ ํšจ์œจ์  Fine-tuning ๊ฐ€๋Šฅ.
Knowledge Distillationํฐ Teacher ๋ชจ๋ธ์—์„œ ์ž‘์€ Student ๋ชจ๋ธ๋กœ ์ •๋ณด ์ „์ด ํ•„์š” ์‹œStudent ๋ชจ๋ธ์ด Teacher์˜ ์„ฑ๋Šฅ์„ ๋ชจ๋ฐฉํ•˜๋ฉฐ, ๊ฒฝ๋Ÿ‰ํ™”๋œ ๋ชจ๋ธ์„ ํ•™์Šต ๊ฐ€๋Šฅ.
Adapters๋ชจ๋ธ ํŒŒ๋ผ๋ฏธํ„ฐ๋ฅผ ๊ณ ์ •ํ•˜๊ณ  Task-specific Layer๋งŒ ์ถ”๊ฐ€ํ•ด์•ผ ํ•  ๋•Œ๊ธฐ์กด ๋ชจ๋ธ์˜ ๊ฐ€์ค‘์น˜๋ฅผ ์œ ์ง€ํ•˜๋ฉฐ, ์ž‘์€ ์ถ”๊ฐ€ Layer๋ฅผ ํ•™์Šตํ•˜์—ฌ ์ƒˆ๋กœ์šด Task์— ์ ์‘ ๊ฐ€๋Šฅ.
Prefix TuningPrompt Template ์‘๋‹ต ํ˜•์‹์„ ๋”์šฑ ๊ฐ•์ œํ•˜๊ณ  ์ œ์–ด๊ฐ€ ํ•„์š”ํ•  ๋•Œ๋ชจ๋ธ์˜ ์ž…๋ ฅ์— Pre-trained Prompt๋ฅผ ์ถ”๊ฐ€ํ•˜์—ฌ Fine-tuning ์—†์ด ํ˜•์‹์  ์‘๋‹ต ์ œ์–ด ๊ฐ€๋Šฅ.