๊ด€๋ฆฌ ๋ฉ”๋‰ด

๐˜š๐˜ญ๐˜ฐ๐˜ธ ๐˜ฃ๐˜ถ๐˜ต ๐˜ด๐˜ต๐˜ฆ๐˜ข๐˜ฅ๐˜บ

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - BERT ๋ฐ”๋‹ฅ๊นŒ์ง€ ์ดํ•ดํ•˜๊ธฐ ๋ณธ๋ฌธ

machine learning

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding - BERT ๋ฐ”๋‹ฅ๊นŒ์ง€ ์ดํ•ดํ•˜๊ธฐ

.23 2025. 2. 24. 20:42

๋‚จ๋“ค DeepSeek ์ฝ์„๋•Œ ์ด์ œ์„œ์•ผ BERT ์ฝ๊ณ  ์ •๋ฆฌํ•œ๋‹ค

https://arxiv.org/abs/1810.04805

 

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers. Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations from unla

arxiv.org

 

NLP๊ณ„์˜ ์กฐ์ƒ๋‹˜, ํ˜์‹ , ์–ด์ฉŒ๊ณ ์˜€๋˜ "๊ทธ" ๋…ผ๋ฌธ

๋งˆ์นจ ๋ฐœํ‘œํ•  ๊ธฐํšŒ๊ฐ€ ์ฐพ์•„ ์™€์„œ ์ฝ๊ณ  ์ •๋ฆฌํ•ด๋ณด์•˜์Šต๋‹ˆ๋‹ค.

 

Abstract

BERT๋ž€?

Bidirectional Encoder Representation from Transformers

์ง์—ญํ•˜๋ฉด Transformer ๊ธฐ๋ฐ˜ ์–‘๋ฐฉํ–ฅ ์ธ์ฝ”๋” representation

 

Label ๋˜์ง€ ์•Š์€ text๋ฅผ ๋ชจ๋“  layer์˜ ์™ผ์ชฝ, ์˜ค๋ฅธ์ชฝ ๋ฌธ๋งฅ ๋ชจ๋‘์—์„œ ๊ณต๋™์œผ๋กœ(jointly) conditioning* 

* conditioning: ์–ด๋–ค ์ •๋ณด๋ฅผ ์ž…๋ ฅ(์กฐ๊ฑด)์œผ๋กœ ์‚ฌ์šฉํ•˜์—ฌ ๋ชจ๋ธ์„ ํ•™์Šตํ•˜๊ฑฐ๋‚˜ ์˜ˆ์ธก์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ฒƒ

→ ๋ฌธ์žฅ์˜ ์™ผ์ชฝ๊ณผ ์˜ค๋ฅธ์ชฝ ์ •๋ณด๋ฅผ ๋™์‹œ์— ๊ณ ๋ คํ•˜์—ฌ ๋ชจ๋ธ์ด ํ•™์Šต

 

Pre-trained BERT ๋ชจ๋ธ์€ ๋‹จ ํ•˜๋‚˜์˜ additional output layer๋กœ fine-tuning → ๋‹ค์–‘ํ•œ ๋ฒ”์œ„์˜ task(e.g. ์งˆ์˜์‘๋‹ต, ์–ธ์–ด์ถ”๋ก  ๋“ฑ) ์ฒ˜๋ฆฌ ๊ฐ€๋Šฅํ•œ ๋ชจ๋ธ ์ƒ์„ฑ ๊ฐ€๋Šฅ

 

Introduction

BERT๊ฐ€ ๋ญ”๋ฐ?; Language model pre-training

์‚ฌ์ „ ํ•™์Šต๋œ ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ ๋Œ€ํ‘œ์ ์œผ๋กœ ๋‹ค์Œ๊ณผ ๊ฐ™์€ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ฌธ์ œ(downstream task)๋ฅผ ํ•ด๊ฒฐํ•ด์™”๋‹ค.

๋ฌธ์žฅ ๋‹จ์œ„๋กœ ํ•ด์„ํ•˜๋ฉด sentence-level task, ๋ฌธ์žฅ์„ ์ชผ๊ฐœ์–ด ํ•ด์„ํ•˜๋ฉด token-level task๋ผ๊ณ  ํ•œ๋‹ค.

 

โœ”๏ธ Sentence-level task(์ž์—ฐ์–ด ์ถ”๋ก , paraphrasing): ๋ฌธ์žฅ ์ „์ฒด์˜ ์˜๋ฏธ๋ฅผ ์ดํ•ดํ•˜๊ณ , ๋ฌธ์žฅ ๊ฐ„ ๊ด€๊ณ„๋ฅผ ํŒ๋‹จํ•˜๋Š” task

- ์ž์—ฐ์–ด ์ถ”๋ก (Natural language inference): ์ฃผ์–ด์ง„ ๋‘ ๋ฌธ์žฅ์ด ํฌํ•จ(entailment) / ๋ชจ์ˆœ(contradiction) / ์ค‘๋ฆฝ(neutral) ๊ด€๊ณ„์ธ์ง€ ๋ถ„๋ฅ˜
  e.g. ํ•˜๋Š˜์€ ์ฒญ๋ช…ํ•˜๋‹ค. / ๋น„๊ฐ€ ์˜ค์ง€ ์•Š๋Š”๋‹ค. → ํฌํ•จ ๊ด€๊ณ„

 

- ๋ฌธ์žฅ ์œ ์‚ฌ๋„ ํŒ๋ณ„(paraphrasing): ๋‘ ๋ฌธ์žฅ์ด ๊ฐ™์€ ์˜๋ฏธ๋ฅผ ๊ฐ–๊ณ  ์žˆ๋Š”์ง€ ํŒ๋‹จํ•˜๋Š” task
  e.g. ๊ทธ๋Š” ์ฐจ๋ฅผ ์ƒˆ๋กœ ์ƒ€๋‹ค. / ๊ทธ ์ž๋™์ฐจ๋Š” ๊ทธ์— ์˜ํ•ด ๊ตฌ๋งค๋˜์—ˆ๋‹ค. → paraphrase

 

- ๊ฐ์ • ๋ถ„์„(Sentiment analysis): ๋ฌธ์žฅ์ด ๊ธ์ •์ ์ธ์ง€, ๋ถ€์ •์ ์ธ์ง€ ๋ถ„๋ฅ˜ํ•˜๋Š” task

 

โœ”๏ธ Token-level task(named-entity recognition, question-answering): ๋ฌธ์žฅ ๋‚ด ๊ฐœ๋ณ„ ๋‹จ์–ด๋‚˜ ๊ตฌ(phrase)์˜ ์˜๋ฏธ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” task

- Named-Entity Recognition(NER, ๊ฐœ์ฒด๋ช… ์ธ์‹): ๋ฌธ์žฅ์—์„œ ์‚ฌ๋žŒ, ์žฅ์†Œ, ์กฐ์ง๋ช… ๋“ฑ ํƒœ๊น…ํ•˜๋Š” ํƒœ์Šคํฌ

  e.g. ์• ํ”Œ(Apple)์€ ์Šคํ‹ฐ๋ธŒ ์žก์Šค์— ์˜ํ•ด ์บ˜๋ฆฌํฌ๋‹ˆ์•„์—์„œ ์„ค๋ฆฝ๋˜์—ˆ๋‹ค.

  • Apple → ๊ธฐ์—…
  • ์Šคํ‹ฐ๋ธŒ ์žก์Šค → ์ธ๋ฌผ
  • ์บ˜๋ฆฌํฌ๋‹ˆ์•„ → ์ง€๋ช…

- Question Answering(QA): ๋ฌธ์žฅ์—์„œ ์งˆ๋ฌธ์— ๋Œ€ํ•œ ์ •๋‹ต์ด ๋˜๋Š” ๋ถ€๋ถ„์„ ์ฐพ์•„๋‚ด๋Š” ํƒœ์Šคํฌ

  e.g.

๋ฌธ์žฅ: ์• ํ”Œ(Apple)์€ ์Šคํ‹ฐ๋ธŒ ์žก์Šค์— ์˜ํ•ด ์บ˜๋ฆฌํฌ๋‹ˆ์•„์—์„œ ์„ค๋ฆฝ๋˜์—ˆ๋‹ค.

Q. ๋ˆ„๊ฐ€ ์• ํ”Œ์„ ์„ค๋ฆฝํ–ˆ๋Š”๊ฐ€?

→ A. ์Šคํ‹ฐ๋ธŒ ์žก์Šค

 

- Part-of-Speech Tagging(POS, ํ’ˆ์‚ฌ ํƒœ๊น…): ๋ฌธ์žฅ์˜ ๊ฐ ๋‹จ์–ด์— ๋Œ€ํ•ด ํ’ˆ์‚ฌ๋ฅผ ๋ถ€์—ฌํ•˜๋Š” ํƒœ์Šคํฌ

 

์›๋žœ ์–ด๋–ป๊ฒŒ ํ–ˆ๋Š”๋ฐ?; Downstream task* ์— pre-trained language representation ์ ์šฉํ•˜๊ธฐ ์œ„ํ•œ ๊ธฐ์กด์˜ ์ „๋žต

* downstream task: ์ž์—ฐ์–ด์ฒ˜๋ฆฌ๋กœ ํ•ด๊ฒฐํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌธ์ œ

์•ž์„œ ์†Œ๊ฐœํ•œ ๋‹ค์–‘ํ•œ downstream task๋“ค์„ ์ฒ˜๋ฆฌํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์ „ํ•™์Šต ์–ธ์–ด ๋ชจ๋ธ๋“ค์€ ๋‹ค์–‘ํ•˜๊ฒŒ ์ง„ํ™”ํ•ด์™”๋Š”๋ฐ,

๋ชจ๋ธ์„ ์–ด๋–ป๊ฒŒ ํ•™์Šตํ•˜๋Š๋ƒ์— ๋”ฐ๋ผ ํฌ๊ฒŒ Feature-based approach์™€ Fine-tuning based approach๋กœ ๋‚˜๋ˆŒ ์ˆ˜ ์žˆ๋‹ค.

 

โœ”๏ธ Feature-based approach

- ์‚ฌ์ „ํ•™์Šต๋œ representation์„ ์ถ”๊ฐ€์  feature๋กœ ํฌํ•จํ•˜๋Š” task๋ณ„ ๊ตฌ์กฐ ์‚ฌ์šฉ

- ์ฆ‰, ์‚ฌ์ „ํ•™์Šต๋œ ํ‘œํ˜„ → task ๋ณ„ ๋ชจ๋ธ์— ๋ถ™์ด๋Š” ๋ฐฉ์‹, ๋”ฐ๋ผ์„œ task ๋ณ„ ๋ชจ๋ธ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•„์š”

- ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ ์ž์ฒด๋Š” ๊ฑด๋“ค์ง€ ์•Š๊ณ , ํ•ด๋‹น ๋ชจ๋ธ์—์„œ ๋‚˜์˜จ ํ‘œํ˜„๋งŒ ์‚ฌ์šฉ

๋‹ค์‹œ ๋งํ•˜์ž๋ฉด ์–˜๋Š” pre-trained model(ELMo, …) ๋“ฑ์„ ๊ทธ๋ƒฅ ๋‹จ์ˆœํžˆ feature extractor๋กœ๋งŒ ์‚ฌ์šฉ → task ์ฒ˜๋ฆฌ๋ฅผ ์œ„ํ•œ ๋ชจ๋ธ์€ ๋ณ„๋„๋กœ ์ •์˜

 

โœ”๏ธ Fine-tuning based approach

- ์ตœ์†Œํ•œ์˜ task๋ณ„ ํŒŒ๋ผ๋ฏธํ„ฐ ๋„์ž…,๋ชจ๋“  ํŒŒ๋ผ๋ฏธํ„ฐ fine-tuningํ•˜์—ฌ task์—์„œ ํ•™์Šต

- ์ฆ‰, ์‚ฌ์ „ํ•™์Šต๋œ ๋ชจ๋ธ ์ž์ฒด๋ฅผ task๋ณ„๋กœ fine-tuning์„ ์ง„ํ–‰ํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๋ฐฉ์‹

- ์‚ฌ์ „ ํ•™์Šต๋œ ๋ชจ๋ธ์„ fine-tuning → ๊ทธ ๋ชจ๋ธ๋กœ task ํ•™์Šต ์ง„ํ–‰

์–˜๋Š” pre-trained model ์ž์ฒด๋ฅผ ํ•™์Šตํ•ด์„œ task๊นŒ์ง€ ์ฒ˜๋ฆฌํ•จ. BERT ์—ญ์‹œ fine-tuning based approach

→ ๋‘˜ ๋‹ค ๊ฒฐ๊ตญ์€ pre-training ๊ณผ์ •๊นŒ์ง€๋Š” ๋™์ผํ•œ ๋ชฉ์  ์ˆ˜ํ–‰(๋ฌธ์žฅ์œผ๋กœ๋ถ€ํ„ฐ representation ์ƒ์„ฑ)

 

๊ธฐ์กด ๋ฐฉ์‹๋“ค์˜ ํŠน์ง•:

- ์ด๋Ÿฌํ•œ ๋ฐฉ์‹๋“ค์€ pre-trained representation ๋Šฅ๋ ฅ ์ œํ•œ(ํŠนํžˆ fine-tuning ๊ธฐ๋ฐ˜ ๋ฐฉ๋ฒ•๋ก ๋“ค)

- ๊ทธ representation์„ ์–ป์„ ๋•Œ ๋‹จ๋ฐฉํ–ฅ ์–ธ์–ด ๋ชจ๋ธ(unidirectional language model) ์‚ฌ์šฉ

 

๊ทธ๋ž˜์„œ ์™œ BERT๊ฐ€ ์ œ์•ˆ์ด ๋˜์—ˆ๋Š”๋ฐ?; ๊ธฐ์กด pre-trained language model๋“ค์˜ ํ•œ๊ณ„

โœ”๏ธ ๋‹จ๋ฐฉํ–ฅ(unidirectional) ํ•™์Šต

→ pre-training ๋„์ค‘ ํ™œ์šฉ๋  ์ˆ˜ ์žˆ๋Š” ๊ตฌ์กฐ(architecture) ์„ ํƒ์— ์ œํ•œ๋จ.

- OpenAI GPT๊ฐ™์€ ์นœ๊ตฌ๋“ค์€ left-to-right architecture ์‚ฌ์šฉ → ๋ชจ๋“  ํ† ํฐ์€ ์ด์ „ ํ† ํฐ๋งŒ ์ฐธ๊ณ (attend) ๊ฐ€๋Šฅํ•จ(causal masking)

- ๋‹จ๋ฐฉํ–ฅ ํ•™์Šต๋ชจ๋ธ์€ ๋ฌธ๋งฅ ํŒŒ์•…์ด ์–ด๋ ต๊ธฐ ๋•Œ๋ฌธ์— sentence-level task๋‚˜ fine-tuning ๊ธฐ๋ฐ˜ ๋ชจ๋ธ์ด token-level task๋ฅผ ํ•ด๊ฒฐํ•˜๋ ค ํ•  ๋•Œ ์ตœ์ ์˜ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœํ•  ์ˆ˜ ์—†์Œ

OpenAI GPT: auto-regressive model*

* auto-regressive model: ์ž๊ธฐํšŒ๊ท€ ๋ชจํ˜•, ์ถœ๋ ฅ ๋ณ€์ˆ˜๊ฐ€ ์ž์‹ ์˜ ์ด์ „ ๊ฐ’๊ณผ ํ™•๋ฅ ์  ํ•ญ์— ์„ ํ˜•์ ์œผ๋กœ ์˜์กดํ•จ

The cat sat on the mat.

 

    → GPT์—์„œ 'sat' ์ฒ˜๋ฆฌ(์ถ”๋ก ) ์‹œ, ‘The’, ‘cat’ ๋งŒ ์ฐธ๊ณ  ๊ฐ€๋Šฅ, ์ดํ›„ ‘on’, ‘the’, ‘mat’์€ ๋ณผ ์ˆ˜ ์—†์Œ. ์ดํ›„ ๋‹จ์–ด๋ฅผ ๋ณด๋ ค ํ•˜๋ฉด ์ •๋‹ต์ง€(’sat’)๋ฅผ ๋ณด๊ฒŒ ๋  ์ˆ˜ ์žˆ์Œ.

๋ญ๊ฐ€ ์ƒˆ๋กญ์ง€?; Contribution

โœ”๏ธ ์–ธ์–ด ํ‘œํ˜„(language representation)์„ ์œ„ํ•œ ์–‘๋ฐฉํ–ฅ ์‚ฌ์ „ ํ•™์Šต ๋ชจ๋ธ ์ œ์•ˆ

- ๋งˆ์Šคํ‚น ์ฒ˜๋ฆฌ๋œ ์–ธ์–ด ๋ชจ๋ธ ์‚ฌ์šฉ → ์ขŒ์šฐ ๋ฌธ๋งฅ ๋™์‹œ์— ํŒŒ์•… ๋” ๊นŠ์€ ๋ฌธ๋งฅ ํŒŒ์•…์„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ํ•˜๋Š” ์–‘๋ฐฉํ–ฅ ์‹ฌ์ธต ํ‘œํ˜„ ํ•™์Šต(deep bidirectional representation) ๊ฐ€๋Šฅ

 

โœ”๏ธ ์‚ฌ์ „ ํ•™์Šต๋œ ํ‘œํ˜„์€ ๋งŽ์€ task๋ณ„ architecture์˜ ํ•„์š”์„ฑ์„ ์ค„์—ฌ์ค„ ์ˆ˜ ์žˆ์Œ → ๊ตณ์ด representation learning์— ์ƒˆ๋กœ์šด architecture๋ฅผ ์„ค๊ณ„ํ•˜์ง€ ์•Š์•„๋„ ๋จ

- ์˜ˆ์ „์—๋Š” ๊ฐ NLP ํƒœ์Šคํฌ๋งˆ๋‹ค ๋งž์ถคํ˜• ๋ชจ๋ธ์ด ํ•„์š”ํ–ˆ๋‹ค.

- Task-specificํ•œ/๋ณต์žกํ•œ ๊ตฌ์กฐ ์—†์ด BERT๋ฅผ ํ†ตํ•ด ์ข‹์€ ์„ฑ๋Šฅ ๋‚ผ ์ˆ˜ ์žˆ์Œ

 

 

BERT๋Š”: [fine-tuning ๊ธฐ๋ฐ˜ ์‚ฌ์ „ ํ•™์Šต ์–ธ์–ด๋ชจ๋ธ]

โœ”๏ธ MLM(Masked-Language Model) ์‚ฌ์šฉํ•˜์—ฌ ๋‹จ๋ฐฉํ–ฅ ๋ชจ๋ธ์˜ ํ•œ๊ณ„ ๊ทน๋ณต

- ์ž…๋ ฅ ๋ฌธ์žฅ์˜ ์ผ๋ถ€ ํ† ํฐ์„ ๋ฌด์ž‘์œ„๋กœ ๋งˆ์Šคํ‚น ์ง„ํ–‰

- ๋ชฉ์ ์€ ๋ฌธ์žฅ์˜ ๋ฌธ๋งฅ’๋งŒ’ ํŒŒ์•…ํ•˜์—ฌ ๋งˆ์Šคํ‚น ๋œ ๋‹จ์–ด์˜ ์›๋ž˜ id ์˜ˆ์ธก

- ๋‹จ๋ฐฉํ–ฅ ์‚ฌ์ „ํ•™์Šต ๋ชจ๋ธ๊ณผ ๋น„๊ตํ–ˆ์„ ๋•Œ MLM์€ ์–‘๋ฐฉํ–ฅ์˜ ๋ฌธ๋งฅ์„ ์œตํ•ฉํ•˜์—ฌ deep bidirectional Transformer ์‚ฌ์ „ ํ•™์Šต ๊ฐ€๋Šฅ

 

โœ”๏ธ NSP(Next Senetence Prediction; ๋‹ค์Œ ๋ฌธ์žฅ ์˜ˆ์ธก) ์ˆ˜ํ–‰

- ์ด์–ด์ง€๋Š” ๋ฌธ์žฅ์Œ์„ ํ•˜๋‚˜์˜ ์ž…๋ ฅ ๋ฌธ์žฅ/์‹œํ€€์Šค๋กœ ์‚ฌ์šฉํ•˜์—ฌ ์‚ฌ์ „ ํ•™์Šต(pre-training) ์ง„ํ–‰

 

 

Model Architecture

BERT์˜ ๋ชจ๋ธ ๊ตฌ์กฐ: multi-layer bidirectional Transfomer encoder → ์‹ค์ œ Transformer์˜ encoder์™€ ๊ฑฐ์˜ ์œ ์‚ฌํ•œ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•จ

 

์‹คํ—˜์„ ์œ„ํ•ด ์‚ฌ์šฉํ•œ ๋ชจ๋ธ๋กœ๋Š” BERTBASE, BERTLARGE

โœ”๏ธ BERTBASE

- L (encoder layer): 12

- H (hidden size(ํ† ํฐ๋‹น ํ•™์Šต๋˜๋Š” ์ฐจ์›์˜ ๊ธธ์ด)): 768

- ์ตœ๋Œ€ 512๊ฐœ ํ† ํฐ๊นŒ์ง€ ์ž…๋ ฅ ๊ฐ€๋Šฅ ์ตœ๋Œ€ ์ž…๋ ฅ ๋ฐ์ดํ„ฐ์˜ ํฌ๊ธฐ๋Š” (512, 768)

- A (Self-attention head): 12

- feed-forward/filter size to be 4H (3072 for the H=768)

 

โœ”๏ธ BERTLARGE

- L: 24

- H: 1024

- A: 16

- feed-forward/filter size to be 4H (4096 for the H=1024)

 

BERT

ํ•™์Šต ๋‹จ๊ณ„: ์‚ฌ์ „ ํ•™์Šต(pre-training) & ๋ฏธ์„ธ ์กฐ์ •(fine-tuning).

Pre-training ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ๋ฒ”์šฉ์ ์ธ ์–ธ์–ด ํŒจํ„ด์„ ํ•™์Šตํ•œ ํ›„, fine-tuning ๋‹จ๊ณ„๋ฅผ ํ†ตํ•ด ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ฌธ์ œ (NLP downstream task) ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. ์˜ˆ๋ฅผ ๋“ค์ž๋ฉด, ์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„๋Š” ๋‹จ์–ด์‚ฌ์ „์„ ๋งŒ๋“œ๋Š” ๊ณผ์ •์ด๊ณ , ๋ฏธ์„ธ ์กฐ์ • ๋‹จ๊ณ„์—์„œ๋Š” ๋‹จ์–ด์‚ฌ์ „์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ „๊ณต ์šฉ์–ด ์‚ฌ์ „์„ ๋งŒ๋“ ๋‹ค.

 

โœ”๏ธ Pre-training: label ๋˜์ง€ ์•Š์€ ๋ฐ์ดํ„ฐ๋กœ pre-trainned task(MLM / NSP) ์ˆ˜ํ–‰ํ•˜๋ฉฐ represenation ํ•™์Šต

- ์ฆ‰, ์‚ฌ๋žŒ์ด ์ •๋‹ต(๋ผ๋ฒจ)์„ ์ œ๊ณตํ•˜์ง€ ์•Š์€ ๋ฌธ์žฅ๋“ค์„ ํ•™์Šต(self-supervised learning ๊ด€์  → ๋ชจ๋ธ์ด label ์Šค์Šค๋กœ ์ƒ์„ฑ)

 

โœ”๏ธ Fine-tuning: ์‚ฌ์ „ ํ•™์Šต๋œ parameter๋กœ ์ดˆ๊ธฐํ™” → downstream task์˜ labeled data ํ™œ์šฉํ•˜์—ฌ parameter๋“ค fine-tuning ์ˆ˜ํ–‰(supervised learning์˜ ์˜์—ญ)

 

BERT์˜ ๊ฐ€์žฅ ํฐ ํŠน์ง•์€ task์— ๊ด€๊ณ„ ์—†์ด ๋‹จ์ผ ๊ตฌ์กฐ๋ฅผ ์‚ฌ์šฉํ•œ๋‹ค๋Š” ์ ์ด๋‹ค.

- ์‚ฌ์ „ ํ•™์Šต์— ์‚ฌ์šฉํ•˜๋Š” ๋ชจ๋ธ ๊ตฌ์กฐ์™€ ๋ฏธ์„ธ ์กฐ์ •์„ ์œ„ํ•ด ์„ค๊ณ„๋œ ๋ชจ๋ธ์—๋Š” ๊ฑฐ์˜ ์ฐจ์ด ์กด์žฌํ•˜์ง€ ์•Š์Œ. Fine-tuning ๋‹จ๊ณ„์—์„œ task์— ๋งž๋Š” layer(e.g. ๋ฌธ์žฅ ๋ถ„๋ฅ˜๋ฅผ ์œ„ํ•œ ๋ถ„๋ฅ˜๊ธฐ) ์ถ”๊ฐ€ํ•˜์—ฌ ํ•™์Šต ์ง„ํ–‰ํ•จ

 

Input/Output Representations

โœ”๏ธ ์ž…๋ ฅ ํ‘œํ˜„(input representation)

- ์ž…๋ ฅ ์‹œํ€€์Šค๋Š” ๋‹จ์ผ ๋ฌธ์žฅ, ๋ฌธ์žฅ ์Œ(์งˆ์˜์‘๋‹ต, ์ž์—ฐ์–ด ์ถ”๋ก  ๋“ฑ์— ์‚ฌ์šฉ) ์ „๋ถ€ ์ปค๋ฒ„ ๊ฐ€๋Šฅํ•˜๋‹ค.

- ๊ฐ ๋ฌธ์žฅ๋“ค์„ ํ† ํฐํ™”ํ•˜์—ฌ ํ‘œํ˜„ํ•œ๋‹ค. ์ด๋•Œ, ํ† ํฐํ™”์—๋Š” WordPiece embedding์„ ์‚ฌ์šฉํ•œ๋‹ค.

    ใƒป WordPiece๋Š” 3๋งŒ์—ฌ๊ฐœ์˜ token vocabulary๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค.

    ใƒป ์ž…๋ ฅ ์‹œํ€€์Šค๋“ค์€ WordPiece ํ† ํฌ๋‚˜์ด์ €๋ฅผ ๊ธฐ์ค€์œผ๋กœ ํ† ํฐํ™”๋ฅผ ํ•˜๋˜, ํ† ํฐ์€ subword๋ฅผ ๊ธฐ์ค€(subword-level)์œผ๋กœ ์ง„ํ–‰๋œ๋‹ค.

    e.g. "I love cats" → ["I", "love", "cats"], "unhappiness" → ["un", "##happiness"], "electroencephalography" → ["electro", "##ence", "##phal", "##ography"]

์ดํ›„ ๊ฐ ํ† ํฐ๋ณ„๋กœ ‘Token embeddings’ + ‘Segment Embeddings’ + ‘Position Embeddings’๋ฅผ ์—ฐ์‚ฐ ํ›„ concatํ•˜์—ฌ ์ตœ์ข… input embedding ๋˜๋Š” representation ๊ตฌ์„ฑํ•œ๋‹ค.

 

๐Ÿ’ก Input representation vs Output representation ?

- Input representation์€ transformer Encoder์— ์ž…๋ ฅ๋˜๋Š” Token, Segment, Position Embedding์˜ ์กฐํ•ฉ์ด๋‹ค.

- Output Representation์€ Transformer Encoder๋ฅผ ๊ฑฐ์นœ ํ›„ ์ƒ์„ฑ๋˜๋Š” ์ตœ์ข… Hidden Representation์œผ๋กœ, ๋‹ค์–‘ํ•œ ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ์—์„œ ํ™œ์šฉ๋œ๋‹ค.

 

ํŠน์ง• 1. ์ž…๋ ฅ ํ‘œํ˜„์˜ ํ•ญ์ƒ ๊ฐ€์žฅ ์ฒ˜์Œ ํ† ํฐ์€ [CLS]

- ์ด ํ† ํฐ์˜ ์ตœ์ข… ํ•™์Šต ๊ฒฐ๊ณผ๋Š” ๋ฌธ์žฅ ์ „์ฒด์˜ ์ •๋ณด๋ฅผ ์••์ถ•ํ•œ ๋ฒกํ„ฐ(aggregate sequence representation)

- ๋ฌธ์žฅ ๋‹จ์œ„ ๋ถ„๋ฅ˜(Classification Task)๋ฅผ ์ง„ํ–‰ํ•  ๋•Œ, ์ „์ฒด ๋ฌธ์žฅ์˜ ์˜๋ฏธ๋ฅผ ๋Œ€ํ‘œํ•˜๋Š” ๋ฒกํ„ฐ๋กœ ์‚ฌ์šฉ๋จ

 

ํŠน์ง• 2. QA task๋ฅผ ์œ„ํ•ด ๋‘ ๋ฌธ์žฅ์Œ์„ ๊ฐ™์ด ํ•™์Šตํ•˜๋Š” ๊ฒฝ์šฐ, ๋‘ ๋ฌธ์žฅ ๊ฐ„ ๊ตฌ๋ถ„์—๋Š” [SEP] ํ† ํฐ ์‚ฌ์šฉ

- Segment embeddings: ๊ฐ ํ† ํฐ์ด ๋ฌธ์žฅ A ๋˜๋Š” ๋ฌธ์žฅ B์— ์†ํ•˜๋Š”์ง€๋ฅผ ๋‚˜ํƒ€๋‚ด๋Š” "Segment Embedding"์„ ์ถ”๊ฐ€์ ์œผ๋กœ ํ•™์Šต โฌ…๏ธ Transformer์™€ ๋‹ค๋ฅธ์ 

 

โœ”๏ธ Output representation

- Transformer Encoder์˜ ์ตœ์ข… ๋ฒกํ„ฐ

- Token + Position + Segment Embedding(์ž…๋ ฅ ํ‘œํ˜„)๊ณผ ๋™์ผํ•œ ์ฐจ์›, ๋™์ผํ•œ ํ˜•ํƒœ

    BERT BASE๋ชจ๋ธ์—์„œ๋Š” ์ž…, ์ถœ๋ ฅ ๋ฒกํ„ฐ ๋ชจ๋‘ (์ž…๋ ฅ ํ† ํฐ ์ˆ˜ x 768), BERT LARGE ๋ชจ๋ธ์—์„œ๋Š” (์ž…๋ ฅ ํ† ํฐ ์ˆ˜ x 1024)

- ๋‹ค๋งŒ, downstream task์— ์ตœ์ ํ™”๋œ ์ •๊ตํ•œ embedding ๊ฒฐ๊ณผ ์ถœ๋ ฅ

 

  Pre-training ๊ฒฐ๊ณผ (General Representation)   MLM(๋งˆ์Šคํฌ๋œ ๋‹จ์–ด ์˜ˆ์ธก)๊ณผ NSP(๋ฌธ์žฅ ๊ด€๊ณ„ ์˜ˆ์ธก)๋ฅผ ํ•™์Šตํ•œ ๋ฒ”์šฉ์ ์ธ ํ‘œํ˜„
  Fine-tuning ๊ฒฐ๊ณผ (Task-Specific Representation)   ๋‹ค์šด์ŠคํŠธ๋ฆผ ํƒœ์Šคํฌ(๊ฐ์„ฑ ๋ถ„์„, QA ๋“ฑ)์— ์ตœ์ ํ™”๋œ ํ˜•ํƒœ๋กœ ๋ณ€ํ˜•๋จ

Pre-training BERT

์‚ฌ์ „ ํ•™์Šต ๋‹จ๊ณ„์—์„œ๋Š” ๋‘๊ฐ€์ง€ ๋น„์ง€๋„ํ•™์Šต ๊ธฐ๋ฐ˜ task ํ™œ์šฉํ•˜์—ฌ BERT์˜ pre-training์„ ์ง„ํ–‰ํ•œ๋‹ค.

Task #1: Masked LM

GPT์™€ ๊ฐ™์€ ๊ธฐ์กด ์–ธ์–ด ๋ชจ๋ธ๋“ค์ด ํ•œ์ชฝ์œผ๋กœ๋งŒ ํ•™์Šต์„ ํ–ˆ๋˜ ์ด์œ ๋Š” ์–‘๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋ฉด ์˜ˆ์ธก ๋Œ€์ƒ ๋‹จ์–ด๋ฅผ ํ•™์Šต ์‹œ ๋ด๋ฒ„๋ฆฌ๋Š”๋ฐ, ๊ทธ๋Ÿผ ๋ชจ๋ธ์ด ์˜ˆ์ธกํ•˜๋ ค๋Š” ๋‹จ์–ด๋ฅผ ์ด๋ฏธ ์•Œ๊ฒŒ ๋˜์–ด ์˜ˆ์ธก ์ž์ฒด๊ฐ€ ๋ฌด์˜๋ฏธํ•ด์งˆ ์ˆ˜ ์žˆ๋‹ค. ๋”ฐ๋ผ์„œ BERT์—์„œ๋Š” ํ•™์Šต์‹œ ๊ฐ ์‹œํ€€์Šค ๋ณ„ ๋ฌด์ž‘์œ„๋กœ ์ผ๋ถ€(15%) ํ† ํฐ์„ ๋งˆ์Šคํ‚นํ•ด์„œ ๊ฐ€๋ ค๋ฒ„๋ฆฌ๊ณ , ๋งˆ์Šคํ‚น ๋œ ํ† ํฐ([MASKED] ์ƒํƒœ)์˜ ํ† ํฐ ID๋ฅผ ์˜ˆ์ธกํ•˜๊ฒŒ ํ•œ๋‹ค. ์ด๋ฅผ MLM(Masked-Language Model)์ด๋ผ ํ•œ๋‹ค.

 

๊ทธ๋Ÿฌ๋‚˜ fine-tuning ๋‹จ๊ณ„์—์„œ๋Š” ํ•™์Šต ์‹œ [MASK] ํ† ํฐ์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ™์€ ๋ฌธ์žฅ์„ ํ•™์Šตํ•˜๋”๋ผ๋„ ๋‘ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ๊ฐ™์€ ํ† ํฐ์ด ์„œ๋กœ ๋‹ค๋ฅธ ๋ฐฉํ–ฅ์œผ๋กœ ํ•™์Šต ๋  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ฅผ mismatch ๋ฌธ์ œ๋ผ๊ณ  ํ•œ๋‹ค.

 

์˜ˆ๋ฅผ ๋“ค์–ด,

'The cat is lying on the couch.' ๋ผ๋Š” ๋ฌธ์žฅ์„ BERT์˜ ์–‘ ํ•™์Šต ๋‹จ๊ณ„์—์„œ ํ•™์Šตํ•œ๋‹ค๊ณ  ๊ฐ€์ •ํ•  ๋•Œ,

Pre-training ๋‹จ๊ณ„์—์„œ 'The cat is [MASKED] on the couch.' ๋ผ๋Š” ๋ฌธ์žฅ์ด ๋“ค์–ด์™€์„œ ๋ฌธ๋งฅ ํ•™์Šต์„ ํ†ตํ•ด [MASKED] ๋œ ์ž๋ฆฌ์— lying์ด ์•„๋‹Œ ์ž๋‹ค(sleeping), ์•‰๋‹ค(sit) ๋“ฑ์˜ ํ† ํฐ์ด ์˜ˆ์ธก๋˜์–ด '๋ˆ•๋‹ค'๊ฐ€ ์•„๋‹Œ ๋‹ค๋ฅธ ๋ฌธ๋งฅ์ ์ธ ์˜๋ฏธ์˜ ๋‹จ์–ด ํ‘œํ˜„์ด ํ•™์Šต ๋  ๊ฐ€๋Šฅ์„ฑ์ด ์กด์žฌํ•œ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ fine-tuning ๋‹จ๊ณ„์—์„œ๋Š” ์˜จ์ „ํ•œ 'The cat is lying on the couch.' ๋ฌธ์žฅ์„ ํ™œ์šฉํ•˜๊ธฐ ๋•Œ๋ฌธ์—, ๋‘ ํ‘œํ˜„์ด ๋‹ฌ๋ผ์ง€๋Š” mismatch ๋ฌธ์ œ๊ฐ€ ๋ฐœ์ƒํ•œ๋‹ค.

 

์œ„์™€ ๊ฐ™์€ mismatch ๋ฌธ์ œ๋ฅผ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด BERT์—์„œ๋Š” 15%์˜ ํ† ํฐ์„ ๋ฌด์ž‘์œ„๋กœ ์„ ํƒํ•˜์—ฌ ์„ ํƒํ•œ ํ† ํฐ ์ค‘

- 80%๋Š” [MASK] ํ† ํฐ์œผ๋กœ ๋Œ€์ฒด

- 10%๋Š” ๋ฌด์ž‘์œ„ ํ† ํฐ

- 10%๋Š” ๋ณ€๊ฒฝํ•˜์ง€ ์•Š๋Š”๋‹ค. (๋”ฐ๋ผ์„œ ์‹ค์ œ๋กœ๋Š” 13.5%์˜ ํ† ํฐ๋งŒ ์‹ค์ œ ํ† ํฐ์ด ์•„๋‹Œ ๋‹ค๋ฅธ ํ† ํฐ์œผ๋กœ ๋ณ€๊ฒฝ๋œ๋‹ค๊ณ  ๋ณด๋ฉด ๋œ๋‹ค.)

์œ„์™€ ๊ฐ™์€ ๋ฐฉ์‹์œผ๋กœ ํ•™์Šตํ•˜์—ฌ ์˜๋„์ ์œผ๋กœ ํ•™์Šต ๋ฐ์ดํ„ฐ์— ๋…ธ์ด์ฆˆ๋ฅผ ๋ผ์›Œ์คŒ์œผ๋กœ์จ ๋ชจ๋ธ์ด ํŠน์ • ๋ฐฉํ–ฅ์œผ๋กœ ๊ณผ์ ํ•ฉ ๋˜์ง€ ์•Š๋„๋ก ์ผ๋ฐ˜ํ™” ๋Šฅ๋ ฅ์„ ํ‚ค์šด๋‹ค.

Task #2: Next Sentence Prediction(NSP)

์งˆ์˜์‘๋‹ต๊ฐ™์€ task์—์„œ ์ค‘์š”ํ•œ ๊ฒƒ์€ ‘๋‘ ๋ฌธ์žฅ ์‚ฌ์ด ๊ด€๊ณ„’์ด๋‹ค. ๋”ฐ๋ผ์„œ NSP task์—์„œ๋Š” ๋‘ ๊ฐœ์˜ ๋ฌธ์žฅ์„ ์ž…๋ ฅ ๋ฐ›์•„, ์ด๋“ค์ด ์‹ค์ œ ์—ฐ์†๋œ ๋ฌธ์žฅ์ธ์ง€ ์•„๋‹Œ์ง€๋ฅผ ๋งžํžˆ๋Š” ์ด์ง„ ๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•œ๋‹ค. ํ•™์Šต ๋ฐฉ์‹์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

โœ”๏ธ ๋ฌธ์žฅ A์™€ B๊ฐ€ ๋“ค์–ด์˜จ๋‹ค๊ณ  ํ• ๋•Œ, ์‹ค์ œ B๊ฐ€ A ๋‹ค์Œ ์ด์–ด์งˆ ๋ฌธ์žฅ์ด ๋  ํ™•๋ฅ ์€ 50%์ด๋‹ค.

- B๊ฐ€ A์— ์ด์–ด์ง€๋Š” ๋ฌธ์žฅ์ธ ๊ฒฝ์šฐ 'IsNext'

- B๊ฐ€ corpus์—์„œ ์ถ”์ถœํ•œ ๋ฌด์ž‘์œ„ ๋ฌธ์žฅ์ธ ๊ฒฝ์šฐ 'NotNext'

โœ”๏ธ ์ด๋•Œ์˜ [CLS] ํ† ํฐ์€ ๋ฌธ๋งฅ์  ๋‚ด์šฉ ๋‹ด๊ณ  ์žˆ๋Š” vector๋กœ์„œ ํ•™์Šต๋œ๋‹ค.

- ์ตœ์ข… ๋ฒกํ„ฐ๋Š” ๋‘ ๋ฌธ์žฅ(A, B) ์ „์ฒด์˜ ๋ฌธ๋งฅ์  ์ •๋ณด๋ฅผ ํฌํ•จํ•˜๋Š” ๋ฒกํ„ฐ๋กœ ํ•™์Šต๋œ๋‹ค.

 

์ด์™€ ๊ฐ™์€ ํ•™์Šต ๋ฐฉ์‹์€ ํ•™์Šต ๋ฐ์ดํ„ฐ๊ฐ€ ๋‘ ๋ฌธ์žฅ ์Œ์œผ๋กœ ์ด๋ฃจ์–ด์ง„ ์งˆ์˜์‘๋‹ต(QA)๊ณผ ์ž์—ฐ์–ด ์ถ”๋ก (NLI) task์—์„œ ๋งค์šฐ ์œ ์šฉํ•˜๋‹ค.

 

์ •๋ฆฌํ•˜์ž๋ฉด, Pre-training ๋‹จ๊ณ„์—์„œ๋Š” MLM๊ณผ NSP ๋‘ ๊ฐœ์˜ Loss๋ฅผ ๋™์‹œ์— ์ตœ์ ํ™”(๊ณต๋™ ํ•™์Šต)ํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ํ•™์Šต๋œ๋‹ค. ์ž…๋ ฅ ๋ฌธ์žฅ์ด ์ฃผ์–ด์ง€๋ฉด, ๋ชจ๋ธ์ด ๋™์‹œ์— '๋‹จ์–ด ๋ณต์›(MLM)'๊ณผ '๋ฌธ์žฅ ๊ด€๊ณ„ ์˜ˆ์ธก(NSP)'์„ ์ˆ˜ํ–‰ํ•˜๋„๋ก ํ•™์Šตํ•˜์—ฌ ๋ณด๋‹ค ์ผ๋ฐ˜ํ™”๋œ ์–ธ์–ด ํŒจํ„ด์„ ํ•™์Šตํ•œ๋‹ค.

 

Fine-tuning BERT

๊ธฐ์กด์— ์ž…๋ ฅ ๋ฐ์ดํ„ฐ๋กœ์จ ๋ฌธ์žฅ ์Œ์„ ํ•™์Šตํ•˜๋Š” application๋“ค์€ ๋ณดํ†ต ๋ฌธ์žฅ ์Œ์„ ๋…๋ฆฝ์ ์œผ๋กœ encoding ํ–ˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ BERT๋Š” self-attention ํ†ตํ•ด ์™ผ→์˜ค, ์˜ค→์™ผ ์–‘๋ฐฉํ–ฅ์˜ ๋ฌธ๋งฅ ํ•™์Šต์„ ํ†ตํ•ฉํ•˜์˜€๊ณ , ์ด๋ฅผ ํ†ตํ•ด ํ•™์Šต์˜ ํšจ์œจ์„ฑ์„ ํ–ฅ์ƒ์‹œ์ผฐ๋‹ค.

    + BiLSTM์€ ์–‘๋ฐฉํ–ฅ ํ•™์Šต๋ชจ๋ธ์ด์ง€๋งŒ, ๋‹จ๋ฐฉํ–ฅ์œผ๋กœ ๋‘ ๋ฒˆ ํ•™์Šต์„ ์ง„ํ–‰ํ•˜๊ธฐ ๋•Œ๋ฌธ์— BERT์— ๋น„ํ•ด ํšจ์œจ์„ฑ์ด ๋–จ์–ด์ง€๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•œ๋‹ค.

 

BERT์—์„œ์˜ self-attention์€ ๊ฐ ํ† ํฐ์ด ๋ฌธ์žฅ ๋‚ด ๋ชจ๋“  ๋‹จ์–ด๋“ค๊ณผ ์ง์ ‘ ์—ฐ๊ฒฐ๋  ์ˆ˜ ์žˆ๋„๋ก ํ•˜์—ฌ ๋ฌธ๋งฅ์„ ๋” ๊นŠ์ด ์ดํ•ดํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ฃผ๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. Transformer ๋‚ด Encoder์˜ Multi-Head Attention ๋ถ€๋ถ„์—์„œ ๊ฐ head๋ณ„๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ๋‹จ์–ด·ํŒจํ„ด(e.g. ์ฃผ์–ด-๋ชฉ์ ์–ด, ์ฃผ์–ด-๋™์‚ฌ, ๋ชฉ์ ์–ด-๋™์‚ฌ, ...)์„ ์ง‘์ค‘์ ์œผ๋กœ ํ•™์Šตํ•˜๊ฒŒ ๋˜๋Š”๋ฐ, BERT์—์„œ๋Š” Multi-Head Attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ํ™œ์šฉํ•˜์—ฌ ๋ฌธ์žฅ์˜ ๊ฐ ํ† ํฐ์ด ๋ฌธ์žฅ ๋‚ด ๋‹ค๋ฅธ ๋ชจ๋“  ํ† ํฐ๊ณผ ์ƒํ˜ธ์ž‘์šฉํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•™์Šต๋œ๋‹ค. ์ด๋Ÿฌํ•œ self-attention์„ ํ†ตํ•ด ์—ฐ๊ฒฐ๋œ ๋ฌธ์žฅ ์Œ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ encodingํ•˜๊ฒŒ ๋˜๋ฉด ๋‘ ๋ฌธ์žฅ ์‚ฌ์ด bidirectional cross attention ํšจ๊ณผ์ ์œผ๋กœ ํฌํ•จ ๊ฐ€๋Šฅํ•˜๋‹ค. ์ฆ‰, self-attention์€ ๋ฌธ์žฅ A์™€ B๊ฐ€ ์„œ๋กœ๋ฅผ ์ฐธ๊ณ ํ•˜๋ฉด์„œ ์ด๋Ÿฌํ•œ ๋ฌธ๋งฅ์ด encoding์— ์ ์šฉ๋œ๋‹ค. ๋˜ํ•œ, ๊ทธ์— ๋”ฐ๋ผ ๋ฌธ์žฅ A์˜ ํŠน์ • ๋‹จ์–ด๊ฐ€ ๋ฌธ์žฅ B์˜ ์ •๋ณด๋ฅผ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๊ณ , ๋ฐ˜๋Œ€๋กœ๋„ ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋œ๋‹ค.

 

BERT์˜ fine-tuning ๋‹จ๊ฒŒ์—์„œ๋Š” task ๋ณ„ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ๋ฐ์ดํ„ฐ๋ฅผ BERT์— ์—ฐ๊ฒฐ์ง€์–ด parameter๋“ค์„ end-to-end๋กœ ๋ฏธ์„ธ ์กฐ์ •ํ•˜์—ฌ ํ•™์Šตํ•œ๋‹ค. ๊ฐ task๋ณ„ ์ž…๋ ฅ์œผ๋กœ ์ฃผ์–ด์ง€๋Š” ๋ฌธ์žฅ ์Œ ํ•™์Šต ๋ฐ์ดํ„ฐ๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™์ด ๊ตฌ์„ฑ๋œ๋‹ค.

- Paraphrasing(์œ ์˜์–ด ์ฒ˜๋ฆฌ): ์œ ์‚ฌ๋„ ์ธก์ •ํ•˜๊ณ ์ž ํ•˜๋Š” ๋ฌธ์žฅ ์Œ

- Entailment(์ž์—ฐ์–ด ์ถ”๋ก ): ๊ฐ€์„ค - ์ „์ œ ์Œ

- QA(์งˆ์˜์‘๋‹ต): ์งˆ๋ฌธ-(์ •๋ณด๊ฐ€ ์ฃผ์–ด์ง„)๋ฌธ์žฅ ์Œ

- ๋ฌธ์žฅ ๋ถ„๋ฅ˜: ๋‹จ์ผ ๋ฌธ์žฅ-๊ณต์ง‘ํ•ฉ ์Œ(ํ’ˆ์‚ฌ ํƒœ๊น…/๋ฌธ์žฅ ์š”์•ฝ๊ณผ ๊ฐ™์€ task์—๋Š” ๋ฌธ์žฅ์ด ์Œ์œผ๋กœ ์กด์žฌํ•˜์ง€ ์•Š์•„๋„ ๋จ)

 

ํ•™์Šต ํ›„ ์ฃผ์–ด์ง€๋Š” output์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

- ๊ฐœ๋ณ„ token: ๊ฐ ๊ฐœ๋ณ„ token์˜ ํ•„์š”ํ•œ representation์ด ํ•™์Šต๋œ๋‹ค.

- [CLS]: ํ•™์Šต๋œ ๋ฌธ์žฅ์˜ ์ „์ฒด์ ์ธ ์š”์•ฝ๊ณผ ๊ฐ™์€ ๋‚ด์šฉ, ํฌํ•จ๊ด€๊ณ„ ์˜ˆ์ธก, ๊ฐ์ • ๋ถ„์„ ๋“ฑ์— ์‚ฌ์šฉ๋˜๋Š” representation์ด ํ•™์Šต๋œ๋‹ค.

'machine learning' ์นดํ…Œ๊ณ ๋ฆฌ์˜ ๋‹ค๋ฅธ ๊ธ€

A Tutorial on Spectral Clustering - ์ŠคํŽ™ํŠธ๋Ÿด ํด๋Ÿฌ์Šคํ„ฐ๋ง  (1) 2023.03.21
Comments