๊ด€๋ฆฌ ๋ฉ”๋‰ด

๐˜š๐˜ญ๐˜ฐ๐˜ธ ๐˜ฃ๐˜ถ๐˜ต ๐˜ด๐˜ต๐˜ฆ๐˜ข๐˜ฅ๐˜บ

Attention Is All You Need - Transformer ๋…ผ๋ฌธ ์ •๋ฆฌ ๋ณธ๋ฌธ

machine learning

Attention Is All You Need - Transformer ๋…ผ๋ฌธ ์ •๋ฆฌ

.23 2025. 3. 4. 20:44

๋”ฅ๋Ÿฌ๋‹ ํ•™๊ณ„ ์ „๋ฐ˜์— ํ˜์‹ ์ ์ธ ๋Œํ’์„ ๋ชฐ๊ณ  ์˜จ ๋…ผ๋ฌธ,

Attention is All You Need: https://dl.acm.org/doi/10.5555/3295222.3295349

 

Attention is all you need | Proceedings of the 31st International Conference on Neural Information Processing Systems

Publication History Published: 04 December 2017

dl.acm.org

 

BERT ๋…ผ๋ฌธ๋„ ์ปจํผ๋Ÿฐ์Šค ๋…ผ๋ฌธ์—๋‹ค ๊ฑฐ์˜ Encoder ๊ตฌ์กฐ๋ฅผ ๊ทธ๋Œ€๋กœ ๊ฐ”๋‹ค ์ผ๊ธฐ ๋•Œ๋ฌธ์—

๋ชจ๋ธ์˜ ๊ตฌ์กฐ์ ์ธ ๋ถ€๋ถ„์€ ๋‚˜์™€์žˆ์ง€ ์•Š์•„ ์–ด์ฉ” ์ˆ˜ ์—†์ด Transformer ๋…ผ๋ฌธ๋„ ๊ฐ™์ด ๊ณต๋ถ€ํ•˜๊ฒŒ ๋˜์—ˆ๊ณ ,

์ฝ์€ ๊น€์— ์ •๋ฆฌํ•ด์„œ ๊ธฐ๋ก์œผ๋กœ ๋‚จ๊ฒจ๋ณด์ž ํ•˜๋Š” ์ƒ๊ฐ์— ํฌ์ŠคํŒ…์„ ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค. ๋“œ๋””์–ด ๋‚ด๊ฐ€ ๋„ˆ๊นŒ์ง€ ์ดํ•ดํ–ˆ๋‹ค ํŠธ๋žœ์Šคํฌ๋จธ..

 

๐Ÿ—ฃ๏ธ BERT ์ •๋ฆฌํ•œ๊ฑฐ ๋ณด๋Ÿฌ๊ฐ€๊ธฐ: https://miiinnn23.tistory.com/112

 

Transformer๋Š” ํฌ๊ฒŒ ๋ฌธ์žฅ์„ ์ดํ•ดํ•˜๋Š” encoder ๋ถ€๋ถ„๊ณผ, ์ดํ•ดํ•œ ๋‚ด์šฉ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒ์„ฑ ์ž‘์—…์„ ์‹คํ–‰ํ•˜๋Š” decoder ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰˜์–ด์ ธ ์žˆ๋‹ค๋Š” ๊ฒƒ๋งŒ ์•Œ๊ณ  ์žˆ๊ณ , ๊ฐ๊ฐ์˜ ์—ฐ์‚ฐ๋“ค์ด ์–ด๋–ป๊ฒŒ ์ง„ํ–‰๋œ๋‹ค๊ฑฐ๋‚˜ ์„ธ๋ถ€์ ์ธ ๋ชจ๋ธ ๋‚ด์šฉ์„ ๊ณต๋ถ€ํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐํšŒ๊ฐ€ ์—†์—ˆ๋Š”๋ฐ(์‚ฌ์‹ค ์–ด๋ ค์›Œ์„œ ์ž‘๋…„์— ์ฝ๋‹ค ๋•Œ๋ ค์นจใ…Ž) ์ข‹์€ ๊ธฐํšŒ๋กœ ์ˆ˜์—…๋„ ๋“ฃ๊ณ  ๋ฐœํ‘œ๋„ ํ•ด๋ณธ ๋•์— ์ •๋ฆฌ๋ฅผ ํ•ด๋ณธ๋‹ค..

 

 

๋“ค์–ด๊ฐ€๊ธฐ ์ „์—,

 

Transformer๋Š” ์•ž์„œ ๋งํ–ˆ๋“ฏ

 

์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๋Š” ๋ฌธ์žฅ(์ž…๋ ฅ ์‹œํ€€์Šค)์„ ๊ธฐ๊ณ„๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋Š” ์ตœ์†Œ ๋‹จ์œ„์ธ ํ† ํฐ์œผ๋กœ ์ชผ๊ฐœ์„œ ๊ฐ ํ† ํฐ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ํ•ด์„ํ•˜๊ณ  ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜์—ฌ ๊ฐ ํ† ํฐ๋“ค์„ ๋ฒกํ„ฐํ™”๋œ ํ‘œํ˜„(representation)์œผ๋กœ ๋งŒ๋“ค์–ด๋†“๋Š” encoder ๋ถ€๋ถ„๊ณผ

encoder๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ ํ‘œํ˜„๋“ค๊ณผ ๋ฌธ๋งฅ ํŒŒ์•… ๋ฐฉ๋ฒ•์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ž…๋ ฅ ์‹œํ€€์Šค๋ฅผ ๋‚ด๊ฐ€ ์›ํ•˜๋Š” ๋ฌธ์ œ์— ๋งž๊ฒŒ ๋ณ€ํ™˜(๋ฌธ์žฅ ์ƒ์„ฑ, ๋ฒˆ์—ญ, ์š”์•ฝ ๋“ฑ๋“ฑ)ํ•˜์—ฌ ๋ฌธ์žฅ์„ ํ† ํฐ ๋‹จ์œ„๋กœ ์ƒ์„ฑํ•˜๋Š” decoder ๋ถ€๋ถ„์œผ๋กœ ๋‚˜๋‰˜์–ด์ ธ ์žˆ๋‹ค.

 

๊ทธ๋ž˜์„œ ํ”ํžˆ Encoder ๋ถ€๋ถ„์„ ๊ฐ•ํ™”ํ•œ ๋ชจ๋ธ์ด BERT์ด๊ณ , Decoder ๋ถ€๋ถ„์„ ๊ฐ•ํ™”ํ•œ ๋ชจ๋ธ์ด GPT๋กœ ์•Œ๋ ค์ ธ์žˆ๋‹ค. ๊ทธ๋Ÿฌ๋‚˜ ์ดˆ๊ธฐ์˜ BERT ๋ชจ๋ธ์€ ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ• ์ง€์–ธ์ • ์ƒ์„ฑํ•˜๋Š” task๋Š” ํ•œ๊ณ„๊ฐ€ ์กด์žฌํ•˜๊ณ (QA๋ฌธ์ œ ํ…Œ์ŠคํŠธ ์‹œ ์ง€๋ฌธ์— ๋‹ต์ด ์—†๋Š” ๊ฒฝ์šฐ '๋‹ตํ•  ์ˆ˜ ์—†์Œ' ์ด๋ผ๋Š” ๊ฒฐ๊ณผ๋ฅผ ๋‚ด๋†“์•„์•ผ ํ–ˆ์Œ) GPT ๋ชจ๋ธ์€ ๋ฌธ์žฅ์„ ์ƒ์„ฑํ• ์ง€์–ธ์ • '์ž๊ธฐ ํšŒ๊ท€ ๋ชจ๋ธ'์ด๋ผ๋Š” ํŠน์„ฑ ํƒ“์— ๋‹จ๋ฐฉํ–ฅ ํ•™์Šต๋งŒ์ด ๊ฐ€๋Šฅํ•˜์—ฌ ๋ฌธ๋งฅ ํ•™์Šต์— ์ œํ•œ์ด ์กด์žฌํ–ˆ๋‹ค. ๊ทธ๋Ÿผ์—๋„ ๋ถˆ๊ตฌํ•˜๊ณ , transformer์™€ BERT/GPT์˜ ๋“ฑ์žฅ์œผ๋กœ ์ธํ•ด NLP ํ•™๊ณ„๊ฐ€ ์•„์ฃผ ์•„์ฃผ ์•„์ฃผ ์•„์ฃผ .. ํฐ ๋ฐœ์ „์„ ์ด๋ฃฉํ•˜์˜€๋‹ค๋Š” ์ ์—์„œ ์ด๋“ค์€ ๋งค์šฐ ์ƒ์ง•์ ์ด๋ผ๊ณ  ๋ณผ ์ˆ˜ ์žˆ๋‹ค.

 

๋˜ํ•œ ๋ถ€๊ฐ€์ ์œผ๋กœ .. CNN์ด Computer Vision ํ•™๊ณ„์— ๊ฝค๋‚˜ ํฐ ์ถฉ๊ฒฉ์„ ์ฃผ์—ˆ๋Š”๋ฐ, CNN์„ ํŒŒ๊ณ ํŒŒ๊ณ  ์ฅ์–ด์งœ๊ณ  ๋์—†์ด ํŒŒ์„œ ๋”์ด์ƒ ๋‚˜์˜ฌ ๊ฒƒ๋„ ์—†์„ ์ฏค์— Transformer์˜ ํ•ต์‹ฌ์ธ 'self-attention' ๋ฉ”์ปค๋‹ˆ์ฆ˜ ๋•์— ๋˜๋‹ค์‹œ ํž˜์„ ์–ป๊ณ  ViT(Vision Transformer)๋ผ๋Š” ๋Œ€๋‹จํ•œ ๋ชจ๋ธ์ด ๋“ฑ์žฅํ•  ์ˆ˜ ์žˆ์—ˆ์œผ๋‹ˆ ์—ฌ๋Ÿฌ๋ชจ๋กœ ๋‹ค์–‘ํ•œ ํ•™๊ณ„์˜ ์ƒˆ๋กœ์šด ๊ฐ€๋Šฅ์„ฑ์„ ์—ด์–ด์ค€ ๋Œ€๋‹จํ•œ ๋…ผ๋ฌธ์ด ๋ถ„๋ช…ํ•˜๋‹ค. ์‹ค์ œ๋กœ ๋น„์ „์ชฝ ์—ฐ๊ตฌ์‹ค ๋“ค์–ด๊ฐ„ ๋‚ด ์นœ๊ตฌ๋“ค์€ ์ „๋ถ€ transformer ๊ธฐ๋ฐ˜ ๋ชจ๋ธ๋“ค ์—ฐ๊ตฌํ•˜๊ณ  ์žˆ์—ˆ์Œ

 

์•„๋ฌดํŠผ '๊ทธ' ํŠธ๋žœ์Šคํฌ๋จธ๊ฐ€ ์™œ ๊ทธ๋ ‡๊ฒŒ ๋Œ€๋‹จํ•œ์ง€ ์ˆ˜์—…์„ ๋“ค์œผ๋ฉด์„œ ๋”๋”์šฑ ์•Œ๊ฒŒ ๋˜์—ˆ๋Š”๋ฐ, ์ด๋Ÿฌํ•œ ์ „ํ›„ ๋ฐฐ๊ฒฝ๋“ค์ด ์žฌ๋ฏธ์žˆ์–ด์„œ ๋“ค์–ด๊ฐ€๊ธฐ ์ „์— ์ ์–ด๋ณด์•˜๋‹ค.

 

์žฌ๋ฏธ์—†์—‡์œผ๋ฉด ๋ง๊ณ ใ…‹ใ…Ž

 


Introduction

๋Œ€์ฒด Transformer๊ฐ€ ๋ญ์ง€?

Sequence transduction model; ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ

 

- Sequence-to-sequence: ์‹œํ€€์Šค๊ฐ€ ์ž…๋ ฅ์œผ๋กœ ๋“ค์–ด์˜ค๋ฉด, ํ•™์Šต์„ ํ†ตํ•ด ๋‹ค์‹œ ์‹œํ€€์Šค๋ฅผ ์ถœ๋ ฅ

- ์ด๋ฅผ ํ†ตํ•ด ๊ธฐ๊ณ„ ๋ฒˆ์—ญ, ๋ฌธ์„œ ์š”์•ฝ, ์Œ์„ฑ ์ธ์‹ ๋“ฑ์˜ ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š” ๋ชจ๋ธ

 

ํ•ด๋‹น ๊ณผ์ •์—์„œ ๊ธฐ์กด sequence-to-sequence ๋ชจ๋ธ์— ๋งŽ์ด ํ™œ์šฉ๋˜๋˜ RNN์ด๋‚˜ convolution ๋“ฑ๋“ฑ์˜ ๋ฐฉ์‹์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ณ , attention ๋ฉ”์ปค๋‹ˆ์ฆ˜๋งŒ์„ ํ™œ์šฉํ•˜์—ฌ ์‹œํ€€์Šค ๋‚ด ํ† ํฐ ๊ฐ„์˜ ๊ด€๊ณ„๋ฅผ ๋ชจ๋ธ๋งํ•˜์˜€๋‹ค.

 

Background

์ปดํ“จํ„ฐ ๋น„์ „์ชฝ ๋ถ„์•ผ๋Š” ์ธ๊ฐ„์˜ ์‹œ๊ฐ ์ฒด๊ณ„๋ฅผ ๋ชจ๋ฐฉํ•œ CNN์˜ ๋“ฑ์žฅ์œผ๋กœ ์ธํ•ด ํš๊ธฐ์ ์œผ๋กœ ๋ฐœ์ „ํ•˜๊ฒŒ ๋˜์—ˆ๋‹ค๊ณ  ํ•œ๋‹ค. ๋ฐ˜๋ฉด NLP ๋ถ„์•ผ๋Š” ๊ณผ๊ฑฐ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•œ ์ˆœํ™˜ ๋ชจ๋ธ๋“ค(RNN, LSTM, GRU) ๋“ฑ์ด ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค. ์™œ๋ƒ?? ํ…์ŠคํŠธ ๋ฐ์ดํ„ฐ๋“ค์€ sequentialํ•œ ๋ฐ์ดํ„ฐ๋ผ ๋ฌธ๋งฅ์ ์ธ ์ •๋ณด๋ฅผ ํŒŒ์•…ํ•˜๊ธฐ ์œ„ํ•ด ์ด์ „ ๋‹จ์–ด๋ฅผ ๊ธฐ์–ตํ•˜๊ณ  ํ™œ์šฉํ•  ์ˆ˜ ์žˆ์–ด์•ผ ๋˜๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค.

 

์ˆœํ™˜ ๋ชจ๋ธ(Recurrent Model)๋“ค์˜ ์ž‘๋™ ๋ฐฉ์‹

- input / output sequence์—์„œ ๊ฐ ํ† ํฐ์˜ ์œ„์น˜ ์ •๋ณด(symbol position)๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ์ˆœ์ฐจ์ ์ธ ์—ฐ์‚ฐ ์ง„ํ–‰

- hidden state* $h_t$ ($t$ ๋ฒˆ์งธ ํ† ํฐ)๋Š” ํ˜„์žฌ ์œ„์น˜ $t$์™€ $h_{t-1}$๋ฒˆ์งธ ํ† ํฐ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜์—ฌ ์ƒ์„ฑ๋จ

* hidden state: ํ•™์Šตํ•˜๊ณ ์ž ํ•˜๋Š” ํ† ํฐ์˜ ์ •๋ณด/ํ‘œํ˜„

 

Attention?

์ค‘์š”ํ•œ ๊ฒƒ์— ์ง‘์ค‘

 

self-attention์˜ ์‹œ๊ฐํ™”

 

ํ† ํฐ ๊ฐ„ ๊ฑฐ๋ฆฌ์™€ ์ƒ๊ด€ ์—†์ด ์˜์กด์„ฑ ๋ชจ๋ธ๋ง(dependency modeling)์ด ๊ฐ€๋Šฅํ•˜์—ฌ ๊ธฐ๊ณ„ ๋ฒˆ์—ญ, ๋ฌธ์„œ ์š”์•ฝ, ์Œ์„ฑ ์ธ์‹ ๋“ฑ ๋‹ค์–‘ํ•œ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ฌธ์ œ๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฒ˜๋ฆฌํ•  ์ˆ˜ ์žˆ๋‹ค. ์—ฌ๊ธฐ์„œ ์˜์กด์„ฑ ๋ชจ๋ธ๋ง์ด๋ž€, ์‹œํ€€์Šค ๋‚ด์˜ ๊ฐ ํ† ํฐ๋“ค์ด ์„œ๋กœ ์–ด๋–ป๊ฒŒ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€, ์ฆ‰ ์„œ๋กœ ๊ฐ„์˜ ์—ฐ๊ด€์„ฑ๊ณผ ์˜ํ–ฅ์„ ํŒŒ์•…ํ•˜๊ณ  ์ด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šตํ•˜๋Š” ๊ณผ์ •์„ ์˜๋ฏธํ•œ๋‹ค.

→ ‘์˜์กด์„ฑ ๋ชจ๋ธ๋ง’์€ attention์˜ ๋ณธ์งˆ!

 

๊ธฐ์กด์—๋Š” attention์€ ์ˆœํ™˜ ๋ชจ๋ธ๊ณผ ํ•จ๊ป˜ ๋งŽ์ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

→ ๋‹ค๋งŒ ์ด๋ ‡๊ฒŒ ํ•˜๊ฒŒ ๋˜๋ฉด ์ˆœํ™˜ ๋ชจ๋ธ๋“ค์˜ ํŠน์„ฑ์— ์˜ํ•ด ์ˆœ์ฐจ ์ฒ˜๋ฆฌ ๋ฐฉ์‹์˜ ํŠน์„ฑ ์ƒ ๋ณ‘๋ ฌ ์ฒ˜๋ฆฌ๊ฐ€ ์–ด๋ ต๊ณ , ๋ฉ”๋ชจ๋ฆฌ์˜ ํ•œ๊ณ„๋กœ ์ธํ•ด ๊ธด ๋ฌธ์žฅ์—๋Š” ํ™œ์šฉ์ด ์–ด๋ ค์›€(์ดˆ๊ธฐ ์ •๋ณด๊ฐ€ ์†์‹ค๋˜๊ฑฐ๋‚˜ ์™œ๊ณก๋  ์œ„ํ—˜์œผ๋กœ ์ธํ•ด ์˜์กด์„ฑ ํ•™์Šต์ด ์–ด๋ ต๊ณ , ํ•™์Šต์˜ ํšจ์œจ ์ธก๋ฉด์—์„œ๋„ ์ข‹์ง€ ์•Š์Œ)

 

Background

Self-Attention

- intra-attention์ด๋ผ๊ณ ๋„ ํ•จ

- ๋‹จ์ผ ์‹œํ€€์Šค ๋‚ด ๊ฐ ํ† ํฐ๋“ค ์‚ฌ์ด์˜ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•˜์—ฌ ์‹œํ€€์Šค์˜ ํ‘œํ˜„(representation)์„ ์ƒ์„ฑํ•˜๋Š” attention mechanism

- ๊ฐ ํ† ํฐ์€ ๋‹ค๋ฅธ ๋ชจ๋“  ํ† ํฐ๊ณผ์˜ ์ƒํ˜ธ ์ž‘์šฉ์„ ํ†ตํ•ด ์ž์‹ ์—๊ฒŒ ํ•„์š”ํ•œ ์ •๋ณด๋ฅผ ์„ ํƒ์ ์œผ๋กœ ๋ฐ˜์˜ํ•จ์œผ๋กœ์จ ๋ฌธ๋งฅ/์˜์กด์„ฑ ํŒŒ์•…์ด ํšจ๊ณผ์ 

End-to-end memory networks

- ์ˆœํ™˜ attention mechanism ๊ธฐ๋ฐ˜

- ๊ฐ„๋‹จํ•œ QA(Question Answering) ๋ฌธ์ œ์™€ ์–ธ์–ด ๋ชจ๋ธ๋ง ๋ฌธ์ œ์—์„œ ์ข‹์€ ์„ฑ๋Šฅ ๋ณด์ž„

 

Model Architecture

Transformer ๋ชจ๋ธ์€ ํฌ๊ฒŒ encoder์™€ decoder ๋‘ ๋‹จ๊ณ„์˜ ๊ตฌ์กฐ๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค. ์š”์•ฝํ•˜์ž๋ฉด,

 

- Encoder์˜ ์—ญํ• : ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ํ† ํฐ(symbol)์„ $(x_1, x_2,\dots, x_n)$์„ ๊ฐ ํ† ํฐ์˜ ๋ฌธ๋งฅ์ ์ธ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•œ ํ‘œํ˜„ $\mathbf z=(z_1, \dots, z_n)$์œผ๋กœ ๋งคํ•‘

- Decoder์˜ ์—ญํ• : $\mathbf z$๋กœ๋ถ€ํ„ฐ ํ•œ๋ฒˆ์— ํ•˜๋‚˜์”ฉ ์ถœ๋ ฅ ์‹œํ€€์Šค $(y_1,\dots, y_m)$ ์ƒ์„ฑ

 

๋ชจ๋ธ์˜ ๊ฐ ๋‹จ๊ณ„๋Š” ์ž๊ธฐ ํšŒ๊ท€(auto-regressive) ํ˜•ํƒœ์ด๋ฉฐ, ํ…์ŠคํŠธ ์ƒ์„ฑ์‹œ ์ƒ์„ฑ๋œ ํ˜„์žฌ ํ† ํฐ์€ ๋‹ค์Œ ํ† ํฐ ์ƒ์„ฑ์„ ์œ„ํ•œ ์ถ”๊ฐ€์ ์ธ ์ž…๋ ฅ์œผ๋กœ ํ™œ์šฉ๋œ๋‹ค.

Encoder and Decoder Stacks

Transformer ๋ชจ๋ธ์˜ ์ „์ฒด ๊ตฌ์กฐ

 

Encoder

Transformer์˜ encoder๋Š” 6๊ฐœ์˜ ๋™์ผํ•œ ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๊ณ , ๊ฐ encoder ๋ ˆ์ด์–ด๋Š” ๋‘๊ฐœ์˜ sub-layer๋กœ ์ด๋ฃจ์–ด์ ธ ์žˆ๋‹ค.

 

์ฒซ๋ฒˆ์งธ ๋ ˆ์ด์–ด๋Š” Multi-head attention ๋ฉ”์ปค๋‹ˆ์ฆ˜์„ ์ˆ˜ํ–‰ํ•˜๋Š” ๋ ˆ์ด์–ด์ด๊ณ , ๋‘๋ฒˆ์งธ๋Š” ํ† ํฐ์˜ ํฌ์ง€์…˜ ๋ณ„ ์—ฐ์‚ฐ๋˜๋Š” fully-connected feed-forward ๋„คํŠธ์›Œํฌ๋ฅผ ์˜๋ฏธํ•œ๋‹ค. ๊ฐ sub-layer ๋ณ„ ์—ฐ์‚ฐ ํ›„์—๋Š” residual connection์„ ์ˆ˜ํ–‰ํ•˜๊ณ , ์ดํ›„ layer normalization์„ ์ˆ˜ํ–‰ํ•˜์—ฌ ํšจ์œจ์ ์ธ ํ•™์Šต์„ ์ง„ํ–‰ํ•œ๋‹ค.

 

Residual connection(์ž”์ฐจ ์—ฐ๊ฒฐ)์€ ๊ฐ sub-layer ์ถœ๋ ฅ์— ์ž…๋ ฅ๊ฐ’์„ ๋”ํ•ด์คŒ์œผ๋กœ์จ ํ•™์Šต ๊ณผ์ •์—์„œ identity mapping(์ž…๋ ฅ์„ ๊ทธ๋Œ€๋กœ ์ „๋‹ฌ)์„ ํ†ตํ•ด ์ตœ์†Œํ•œ ์›๋ž˜ ์ž…๋ ฅ ์ •๋ณด๊ฐ€ ์œ ์ง€๋˜๋„๋ก ํ•˜๊ณ , backpropagation ๊ณผ์ •์—์„œ ๋ฐœ์ƒํ•  ์ˆ˜ ์žˆ๋Š” ๊ธฐ์šธ๊ธฐ ์†Œ์‹ค(gradient descent) ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•ด ์ ์šฉํ•œ๋‹ค. ๋˜ํ•œ layer normalization์€ ๊ฐ ๊ฐ’๋“ค์˜ ํ‰๊ท ๊ณผ ๋ถ„์‚ฐ์„ ์ผ์ •ํ•˜๊ฒŒ ์œ ์ง€ํ•จ์œผ๋กœ์จ ํ•™์Šต์˜ ์•ˆ์ •์„ฑ๊ณผ ํšจ์œจ์„ฑ์„ ๋†’์ด๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋˜์—ˆ๋‹ค.

 

๊ทธ์™ธ์˜ ๋ถ€๋ถ„์— ๋Œ€ํ•ด์„œ ์ €๊ฒŒ ๋‹ค ๋ญ”์†Œ๋ฆฌ๋ƒ๋ฉด ๊ทธ๋ƒฅ ๊ทธ๋Ÿฐ๊ฒŒ ์žˆ๋‹ค๊ณ  ์ƒ๊ฐํ•˜๋ฉด ๋œ๋‹ค. ์•„๋ž˜์—์„œ ์ž์„ธํ•˜๊ฒŒ ์„ค๋ช…ํ•œ๋‹ค.

Decoder

encoder์™€ ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ 6๊ฐœ์˜ ๋™์ผํ•œ decoder ๋ ˆ์ด์–ด๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ๊ธฐ๋ณธ์ ์ธ ๊ตฌ์กฐ๋„ encoder์™€ ์œ ์‚ฌํ•œ๋ฐ, ์ค‘๊ฐ„์— encoder์˜ ์ถœ๋ ฅ ๊ฒฐ๊ณผ์— ๋Œ€ํ•ด multi-head attention์„ ์ˆ˜ํ–‰ํ•˜๋Š” sub-layer๊ฐ€ ์กด์žฌํ•œ๋‹ค. ๋˜ํ•œ ์ฒซ๋ฒˆ์งธ sub-layer์˜ ๊ฒฝ์šฐ self-attention layer๋ฅผ ์ˆ˜์ •ํ•˜์—ฌ masked multi-head attention์„ ์ˆ˜ํ–‰ํ•˜๊ฒŒ ํ•จ์œผ๋กœ์จ ํ˜„์žฌ ํ† ํฐ์ด ์ดํ›„ ์ด์–ด์ง€๋Š” ํ† ํฐ๋“ค๋กœ๋ถ€ํ„ฐ ์˜ํ–ฅ์„ ๋ฐ›์ง€ ์•Š๊ฒŒ ํ•œ๋‹ค. ์ฆ‰, ์ด๋Ÿฌํ•œ ๋งˆ์Šคํ‚น์€ output embedding์„ ์ƒ์„ฑํ•  ๋•Œ i ๋ฒˆ์งธ ํ† ํฐ์— ๋Œ€ํ•œ ์˜ˆ์ธก์ด i-1 ๋ฒˆ์งธ๊นŒ์ง€์˜ ํ† ํฐ์—๋งŒ ์˜์กดํ•  ์ˆ˜ ์žˆ๋„๋ก ํ•œ๋‹ค.

 

Casual mask; Decoder์˜ self-attention ๊ณ„์ธต์—์„œ ๋ฏธ๋ž˜(์•„์ง ์ƒ์„ฑ๋˜์ง€ ์•Š์€) ํ† ํฐ๋“ค์ด ํ˜„์žฌ ํ† ํฐ์˜ ์˜ˆ์ธก์— ์˜ํ–ฅ์„ ์ฃผ์ง€ ์•Š๋„๋ก ๋งˆ์Šคํ‚นํ•˜๋Š” ๊ธฐ๋ฒ•

Attention

ํ•ด๋‹น ํŒŒํŠธ์—์„œ๋Š” attention ์—ฐ์‚ฐ์€ ๊ตฌ์ฒด์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์ˆ˜ํ–‰๋˜๋Š”์ง€ ์„ค๋ช…ํ•œ๋‹ค.

 

Transformer์˜ Attention์€ ์˜๋ฏธ์ ์œผ๋กœ ์‹œํ€€์Šค์˜ ๊ฐ ํ† ํฐ๋“ค์ด ์ž๊ธฐ ์ž์‹ ์„ ํฌํ•จํ•˜์—ฌ ๋‹ค๋ฅธ ํ† ํฐ๋“ค๊ณผ ์–ด๋–ค ๊ด€๊ณ„๋ฅผ ๊ฐ–๋Š”์ง€ ํ•™์Šตํ•˜๋Š” ์—ญํ• ์„ ํ•˜์ง€๋งŒ, Attention ๊ธฐ๋ฒ•์„ ํ•จ์ˆ˜์ ์œผ๋กœ ์„ค๋ช…ํ•˜์ž๋ฉด (query, key, value)๋ฅผ ์ž…๋ ฅ๋ฐ›์•„ ํ•˜๋‚˜์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋กœ ๋งคํ•‘ํ•˜๋Š” ์—ญํ• ์„ ํ•œ๋‹ค. (๋‹จ, query, key, value, output์€ ๋ชจ๋‘ vector ํ˜•ํƒœ๋ผ๊ณ  ๊ฐ€์ •)

 

์—ฌ๊ธฐ์„œ์˜ ์ถœ๋ ฅ ๋ฒกํ„ฐ๋Š” Value์˜ ๊ฐ€์ค‘ ํ•ฉ(weighted sum)์œผ๋กœ ์—ฐ์‚ฐ๋˜๊ณ , Value๋Š” ๊ฐ ํ† ํฐ ๋ณ„ Key์™€ Query๋กœ๋ถ€ํ„ฐ ์—ฐ์‚ฐ๋œ๋‹ค. Transformer์—์„œ ์‚ฌ์šฉ๋˜๋Š” Query, Key, Value์— ๋Œ€ํ•œ ๋Œ€๋žต์ ์ธ ์ •์˜๋Š” ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค.

 

Query

- Key์— ๋Œ€ํ•œ ๋ฌธ๋งฅ์  ์—ฐ๊ด€์„ฑ ์ฐพ๊ณ ์ž ํ•˜๋Š” ํ•ญ๋ชฉ์„ ๋‚˜ํƒ€๋‚ด๋Š” ๋ฒกํ„ฐ

- ์–ด๋–ค ์ •๋ณด๊ฐ€ ์ค‘์š”ํ•œ์ง€ ์ฐพ๊ธฐ ์œ„ํ•ด ๊ฐ ๋‹จ์–ด๊ฐ€ ์–ผ๋งˆ๋‚˜ ์—ฐ๊ด€๋˜์–ด ์žˆ๋Š”์ง€ ํ‰๊ฐ€

 

Key

- attention score ์—ฐ์‚ฐ์— ์‚ฌ์šฉ๋˜๋Š” ๋ฒกํ„ฐ

- Query์™€ ๋น„๊ตํ•ด์„œ ๋ชจ๋ธ์ด ๊ฐ Value์— ์–ผ๋งˆ๋‚˜ ์ง‘์ค‘ํ•ด์•ผ ํ•˜๋Š”์ง€ ๊ฒฐ์ •

- ๊ฐ๊ฐ์˜ ๋‹จ์–ด๊ฐ€ ๊ฐ€์ง„ ์ •๋ณด์˜ ๊ด€๋ จ์„ฑ ํŒ๋‹จ์˜ ๊ธฐ์ค€์ด ๋จ

 

Value

- ์‹ค์ œ๋กœ ์ „๋‹ฌํ•˜๊ณ ์ž ํ•˜๋Š” "์ •๋ณด"๋ฅผ ๋‹ด๊ณ  ์žˆ๋Š” ๋ฒกํ„ฐ, ์ค‘์š”ํ•œ ๋‹จ์–ด์˜ ์ •๋ณด๋ฅผ ์ตœ์ข… ๊ฒฐ๊ณผ๋กœ ๋งŒ๋“ค์–ด๋‚ด๋Š” ๋ฐ ์‚ฌ์šฉ

- attention ๊ฐ€์ค‘์น˜ ์—ฐ์‚ฐ ์ดํ›„ ์ตœ์ข… output ๋งŒ๋“ค์–ด๋‚ด๊ธฐ ์œ„ํ•ด value vector ์ ์šฉ๋จ

→ ์‰ฝ๊ฒŒ ์ •๋ฆฌํ•˜๋ฉด, Query๋Š” "์ด ๋ถ€๋ถ„์— ์ง‘์ค‘ํ•ด์•ผ ํ•˜๋‚˜?"๋ผ๋Š” ์งˆ๋ฌธ์„, Key๋Š” "์ด ๋ถ€๋ถ„์˜ ํŠน์ง•์€ ๋ญ์•ผ?"๋ผ๊ณ  ๋‹ตํ•˜๋ฉฐ, Value๋Š” "์—ฌ๊ธฐ ์ค‘์š”ํ•œ ์ •๋ณด๊ฐ€ ์žˆ์–ด!"๋ผ๋Š” ์‹ค์ œ ์ •๋ณด๋ฅผ ํฌํ•จํ•œ๋‹ค.

Scaled Dot-Product Attention

Sub-layer์˜ multi-head attention ๋‚ด ๊ฐœ๋ณ„์ ์ธ attention ์—ฐ์‚ฐ๋“ค์„ scaled dot-product attention ์ด๋ผ๊ณ  ํ•œ๋‹ค. ์ด๋•Œ, Input์€ Query, Key, Value๋กœ ๊ตฌ์„ฑ๋˜์–ด ์žˆ๋‹ค.

 

- ํ† ํฐ ๋ณ„ Query, Key $\in \mathbb R^{d_k}$

- ํ† ํฐ ๋ณ„ Value $\in \mathbb R^{d_v}$

 

์ด๋•Œ $d_k$์™€ $d_v$๋Š” ์ผ๋ฐ˜์ ์œผ๋กœ ๋ชจ๋ธ ์„ค๊ณ„์‹œ ๋ฏธ๋ฆฌ ์ •์˜๋˜๋Š” hyper-parameter์ด๊ณ , Transformer์—์„œ๋Š” $d_k=d_v=\frac{d_{model}}{h}$ ($h$ : multi-head attention์—์„œ head์˜ ์ˆ˜, $d_{model}$ : ์ „์ฒด ๋ชจ๋ธ ์ฐจ์›)์œผ๋กœ ์„ค์ •ํ•˜์—ฌ ์‹คํ—˜์— ์‚ฌ์šฉํ•˜์˜€๋‹ค. ๋งˆ์ฐฌ๊ฐ€์ง€๋กœ h ์™€ ์ „์ฒด ๋ชจ๋ธ ์ฐจ์› $d_{model}$ ์—ญ์‹œ hyper-parameter์˜ ์˜์—ญ์ด๊ธฐ๋„ ํ•˜๋‹ค.

 

์ „์ฒด ์‹œํ€€์Šค์˜ attention์ด ๊ฐœ๋ณ„์ ์œผ๋กœ ์—ฐ์‚ฐ๋˜๋Š” ๊ณผ์ •์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค. ์‹ค์ œ ์—ฐ์‚ฐ์—์„œ๋Š” ํ† ํฐ ๋ณ„ (query, key, value $\in\mathbb R^{d_k}$) ๋ฒกํ„ฐ ์—ฐ์‚ฐ์ด ์ง„ํ–‰๋œ๋‹ค๊ธฐ๋ณด๋‹จ ํ•œ ์‹œํ€€์Šค ๋‚ด ๋ชจ๋“  ํ† ํฐ๋“ค์— ๋Œ€ํ•ด ๋™์‹œ์— attention์ด ์—ฐ์‚ฐ๋˜์–ด ๋ฒกํ„ฐ๊ฐ€ ์•„๋‹Œ Q, K, V $\in\mathbb R^{seq\_len\times d_k}$์˜ ํ–‰๋ ฌ ํ˜•ํƒœ๋กœ ์—ฐ์‚ฐ์ด ์ˆ˜ํ–‰๋œ๋‹ค.

 

Query์™€ Key์˜ ๋‚ด์  ์—ฐ์‚ฐ(MatMul) → $\sqrt{d_k}$๋กœ ๊ฐ ๊ฐ’ ๋‚˜๋ˆ ์คŒ(Scale) → softmax function ์—ฐ์‚ฐ(SoftMax) → Value์™€ ์—ฐ์‚ฐ(MatMul)

 

๊ทธ๋Ÿผ ๊ฐ ๋‹จ๊ณ„์˜ ์—ฐ์‚ฐ์€ ๋ฌด์—‡์„ ์˜๋ฏธํ• ๊นŒ?

- Query์™€ Key์˜ ๋‚ด์  ์—ฐ์‚ฐ: $QK^T$๋Š” ๊ฐ ํ† ํฐ ๋ณ„ Query์™€ Key ์‚ฌ์ด์˜ ์œ ์‚ฌ๋„(๋˜๋Š” ๊ด€๋ จ๋„)๋ฅผ ์—ฐ์‚ฐํ•˜๋Š” ๊ณผ์ •์œผ๋กœ, ๊ฐ ํ† ํฐ๋ผ๋ฆฌ ์–ผ๋งˆ๋‚˜ ๊ฐ•ํ•˜๊ฒŒ ์—ฐ๊ฒฐ๋˜์–ด ์žˆ๋Š”์ง€๋ฅผ ์ˆ˜์น˜๋กœ ํ‘œํ˜„ํ•œ๋‹ค.

- Scale: ๊ฐ ๋ฒกํ„ฐ์˜ ๊ธธ์ด $d_k$๊ฐ€ ์ปค์งˆ์ˆ˜๋ก ๋‚ด์ ์˜ ๊ฒฐ๊ณผ๊ฐ’์ด ์ปค์ง€๋Š” ๋ฌธ์ œ๋ฅผ ์™„ํ™”ํ•˜๊ธฐ ์œ„ํ•œ ์Šค์ผ€์ผ๋ง์„ ์ง„ํ–‰ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด softmax ํ•จ์ˆ˜์˜ ์ž…๋ ฅ ๊ฐ’์ด ๋„ˆ๋ฌด ์ปค์ง€์ง€ ์•Š๊ฒŒ ํ•˜์—ฌ, ํ•™์Šต ์‹œ ์•ˆ์ •์„ฑ์„ ๋†’์ธ๋‹ค.

- Softmax: ์Šค์ผ€์ผ๋ง๋œ ๋‚ด์  ๊ฐ’์— softmax๋ฅผ ์ ์šฉํ•˜์—ฌ, ๊ฐ Query์— ๋Œ€ํ•ด Key๋“ค์˜ ์œ ์‚ฌ๋„๋ฅผ ํ™•๋ฅ  ๋ถ„ํฌ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ฐ Key์— ํ• ๋‹น๋œ ๊ฐ€์ค‘์น˜๊ฐ€ 0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ์ •๊ทœํ™”๋˜๋ฉฐ, ์ „์ฒด Key์— ๋Œ€ํ•ด ํ•ฉ์ด 1์ด ๋œ๋‹ค.

 

→ ์—ฌ๊ธฐ๊นŒ์ง€ ์—ฐ์‚ฐ๋œ ๊ฐ’์ด ๊ณง ๊ฐ Key์— ๋Œ€ํ•œ attention score๊ฐ€ ๋œ๋‹ค. ํ•ด๋‹น ๊ณผ์ •๋“ค์„ ํ†ตํ•ด ๊ฐ Query๊ฐ€ ์–ด๋–ค Key์— ์ง‘์ค‘ํ• ์ง€ ๊ฒฐ์ •๋œ๋‹ค.

 

- Value์™€ ํ–‰๋ ฌ๊ณฑ: ์•ž์„œ ๊ณ„์‚ฐ๋œ softmax ๊ฒฐ๊ณผ๋ฅผ Value์™€ ๊ณฑํ•˜๋Š” ๊ณผ์ •์œผ๋กœ, ๊ฐ Query์— ๋Œ€ํ•ด ์ค‘์š”ํ•œ ์ •๋ณด๋ฅผ ๋‹ด์€ Value๋“ค์„ ๊ฐ€์ค‘ ํ•ฉ์‚ฐ(weighted sum)ํ•˜์—ฌ ์ตœ์ข… ์ถœ๋ ฅ ๋ฒกํ„ฐ๋ฅผ ์ƒ์„ฑํ•œ๋‹ค.

 

 

Attention score๋Š” ์™œ ํ•˜ํ•„ ํ™•๋ฅ ๋ถ„ํฌ๊ฐ’์œผ๋กœ ํ‘œํ˜„ํ• ๊นŒ?

0๊ณผ 1 ์‚ฌ์ด์˜ ๊ฐ’์œผ๋กœ ๋ณ€ํ™˜ํ•จ์œผ๋กœ์จ ์ •๊ทœํ™”์˜ ์˜๋ฏธ๋„ ๋“ค์–ด๊ฐ€๊ณ , ํ•ด์„์ด ๋” ์šฉ์ดํ•ด์ง€๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ์–ด๋–ค key๊ฐ€ ์–ผ๋งˆ๋‚˜ ์ค‘์š”ํ•œ์ง€๋ฅผ ํ™•๋ฅ ์ ์œผ๋กœ ํ•ด์„ํ•จ์œผ๋กœ์จ ์–ด๋–ค ์ •๋ณด์— ๋” ์ง‘์ค‘ํ• ์ง€๋ฅผ ๋ช…ํ™•ํ•˜๊ฒŒ ํŒ๋‹จํ•  ์ˆ˜ ์žˆ๋‹ค.

 

๊ฐ ์—ฐ์‚ฐ ๊ณผ์ •์„ ์œ„์™€ ๊ฐ™์ด ๋„์‹ํ™”ํ•  ์ˆ˜ ์žˆ์ง€๋งŒ.. ์‚ฌ์‹ค ๊ฐœ๋ณ„ attention ์—ฐ์‚ฐ์€ ์ˆ˜์‹ ํ•œ์ค„๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

$$ \text{Attention}(Q,K,V) =\text{softmax}(\frac{QK^T}{\sqrt{d_k}})V $$

 

์ด๊ฒŒ ๋์ด๋‹ค.

Multi-Head Attention

๋‹จ์ผ attention ์—ฐ์‚ฐ์„ ํ•˜๋Š”๊ฑฐ๋ณด๋‹ค attention ์—ฐ์‚ฐ์„ ๋‹ค๋ฅด๊ฒŒ ์—ฌ๋Ÿฌ๋ฒˆ ์ˆ˜ํ–‰ํ•˜์—ฌ ๊ฐ attention ๋ณ„๋กœ ๋‹ค์–‘ํ•œ ์˜๋ฏธ๋ฅผ ํฌํ•จํ•œ representation์„ ์–ป๊ฒŒ ๋˜๋Š” ๊ฒƒ์ด ๋ฌธ๋งฅ ํŒŒ์•…์— ๋” ํšจ๊ณผ์ ์ด๋ผ๊ณ  ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ์œ„์—์„œ ์–ธ๊ธ‰ํ•œ attention์„ h๋ฒˆ ์ˆ˜ํ–‰ํ•˜์—ฌ ๋ฌธ์žฅ์˜ ์˜๋ฏธ ํŒŒ์•…์— ์žˆ์–ด head ๋ณ„๋กœ ์„œ๋กœ ๋‹ค๋ฅธ ์ •๋ณด๋ฅผ ํ•™์Šตํ•˜๊ณ ์ž multi-head attention์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Query, Key, Value ๋ฒกํ„ฐ๋“ค์— ๊ฐ๊ฐ ๋‹ค๋ฅธ ์„ ํ˜• ๋ณ€ํ™˜(projection)์„ ์ ์šฉํ•˜์—ฌ h๊ฐœ์˜ ๋‹ค์–‘ํ•œ ํ‘œํ˜„์„ ์ƒ์„ฑํ•œ ํ›„, ๊ฐ ํ‘œํ˜„ ๋ณ„ ๋ณ‘๋ ฌ์ ์œผ๋กœ ์œ„์—์„œ ์–ธ๊ธ‰๋œ attention ์—ฐ์‚ฐ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค. ์ด๋ฅผ ํ†ตํ•ด ๊ฐ head์—์„œ๋Š” ๋…๋ฆฝ์ ์œผ๋กœ attention ์—ฐ์‚ฐํ•˜๊ณ , ๋‹ค์–‘ํ•œ ๋ฌธ๋งฅ์  ์ •๋ณด๋ฅผ ํฌํ•จํ•œ $d_v$ ์ฐจ์›์˜ attention score ๋ฒกํ„ฐ๋ฅผ output์œผ๋กœ ๊ฐ–๊ฒŒ ๋œ๋‹ค. ์ด๋Ÿฌํ•œ Multi-head attention์€ ๋ชจ๋ธ์ด ์„œ๋กœ ๋‹ค๋ฅธ ์œ„์น˜์—์„œ ์„œ๋กœ ๋‹ค๋ฅธ representation subspace์˜ ์ •๋ณด๋ฅผ ๋™์‹œ์— ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ๋œ๋‹ค. ๋งŒ์•ฝ ๋‹จ์ผ attention์„ ์‚ฌ์šฉํ•˜๊ฒŒ ๋˜๋ฉด ํ‰๊ท ํ™”๋กœ ์ธํ•ด ์ด์ •๋„์˜ ๋ฌธ๋งฅ ํŒŒ์•…์ด ๋ถˆ๊ฐ€๋Šฅํ•˜๊ฒŒ ๋œ๋‹ค.

 

๊ทธ๋Ÿผ ๊ฐ head๋ณ„ attention ์ •๋ณด๋Š” ์ตœ์ข…์ ์œผ๋กœ ์–ด๋–ป๊ฒŒ ์œตํ•ฉํ• ๊นŒ? ๋‹จ์ˆœํ•˜๊ฒŒ concatenationํ•œ๋‹ค.

 

$$ \text{Multihead}(Q,K,V)=\text{Concat(head}_1,\dots,\text{head}_h)W^O \\where\ \text{head}_i=\text{Attention}(QW^Q_i,KW^K_i,VW^V_i) $$

 

์ด ๋•Œ, $W^Q_i\in\mathbb R^{d_{model}\times d_k}, W^K_i\in\mathbb R^{d_{model}\times d_k},W^V_i\in\mathbb R^{d_{model}\times d_v},W^O\in\mathbb R^{hd_v\times d_{model}}$.

๋…ผ๋ฌธ์—์„œ๋Š” h=8, ๊ฐ ์ฟผ๋ฆฌ ๋ณ„ Query, Key์™€ Value์˜ ์ฐจ์›์„ ๋ชจ๋ธ์˜ ํฌ๊ธฐ/h ๋กœ ์ •์˜ํ•˜์—ฌ์„œ ๊ฒฐ๊ตญ ์ถœ๋ ฅ ๋ฒกํ„ฐ์ธ $W^O$๋Š” ์ •๋ฐฉํ–‰๋ ฌ์˜ ํ˜•ํƒœ๊ฐ€ ๋œ๋‹ค.

Applications of Attention in our Model

Transformer ๋…ผ๋ฌธ์—์„œ ๊ฐ sub-layer ๋ณ„ multi-head attention์€ ์–ด๋–ป๊ฒŒ ์—ฐ์‚ฐ๋ ๊นŒ? Transformer ๊ตฌ์กฐ ๋‚ด ๋“ฑ์žฅํ•˜๋Š” multi-head attention ์—ฐ์‚ฐ์€ encoder layer์—์„œ 1๋ฒˆ, decoder layer์—์„œ ๋‘ ๋ฒˆ ๋“ฑ์žฅํ•œ๋‹ค.

 

Decoder layer์˜ ‘multi-head attention’

๋…ผ๋ฌธ์—์„œ “encoder-decoder attention” layer๋ผ๊ณ ๋„ ๋ถˆ๋ฆฌ๋Š” ํ•ด๋‹น ๋‹จ๊ณ„์—์„œ๋Š” Query๋Š” ์ด์ „ decoder layer(Masked Multi-head Attention)์œผ๋กœ๋ถ€ํ„ฐ ํ•™์Šต๋œ ๊ฐ’์ด, (Key, Value)๋Š” encoder layer์˜ ์ตœ์ข… ์ถœ๋ ฅ ๋‹จ๊ณ„์—์„œ ๋‚˜์˜จ ๊ฐ’์„ ์‚ฌ์šฉํ•œ๋‹ค. ์ด๋Š” ๋‹ค๋ฅธ sequence-to-sequence model๋“ค์˜ encoder-decoder attention mechanism์„ ๋ชจ๋ฐฉํ•œ ๊ฒƒ์œผ๋กœ, decoder์—์„œ๋Š” ๊ฐ ์œ„์น˜์˜ ํ† ํฐ๋“ค์ด ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ์œ„์น˜ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

์ด ๋•Œ,

- (Key, Value): encoder์˜ ์ตœ์ข… ์ถœ๋ ฅ์œผ๋กœ๋ถ€ํ„ฐ ์ƒ์„ฑ๋˜๋ฏ€๋กœ, ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๋ชจ๋“  ํ† ํฐ์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํฌํ•จ

- Query: decoder์˜ ์ด์ „ masked self-attention ๋‹จ๊ณ„์—์„œ ์ƒ์„ฑ๋œ ํ‘œํ˜„์œผ๋กœ, ๊ทธ ์‹œ์ ๊นŒ์ง€(์ฆ‰, ์ด์ „ ํ† ํฐ๋“ค์— ๊ธฐ๋ฐ˜ํ•œ) ์ •๋ณด๋ฅผ ํฌํ•จ

 

์ฆ‰, decoder์˜ ํ•ด๋‹น ํ† ํฐ์€ ์ž์‹ ์˜ query๋ฅผ ์ด์šฉํ•ด encoder์˜ ์ „์ฒด ์ถœ๋ ฅ(๋ชจ๋“  ์œ„์น˜์˜ ํ† ํฐ๋“ค)์„ ์ฐธ๊ณ  ๊ฐ€๋Šฅ → ์ „์ฒด ์ž…๋ ฅ ๋ฌธ๋งฅ์„ ๋ฐ˜์˜ํ•˜์—ฌ ๋ฌธ๋งฅ ์ •๋ณด ํ•™์Šต ๊ฐ€๋Šฅ

 

Encoder layer์˜ ‘multi-head attention’

Encoder layer์˜ multi-head attention์—์„œ๋Š” ์ƒ์„ฑ ๊ณผ์ •์ด ์•„๋‹Œ ๋ฌธ๋งฅ์„ ํŒŒ์•…ํ•˜๋Š” ๊ณผ์ •์ด๊ธฐ ๋•Œ๋ฌธ์—, ๊ฐ ํ† ํฐ๋“ค์€ ๋ชจ๋“  ์œ„์น˜์˜ ํ† ํฐ์„ ์ฐธ๊ณ ํ•  ์ˆ˜ ์žˆ๋‹ค. ์ฆ‰, encoder ๋‚ด ๊ฐ ํ† ํฐ๋“ค์€ ์ด์ „ layer์—์„œ ํ•™์Šต๋œ ๋ชจ๋“  ํ† ํฐ๋“ค์˜ ์ •๋ณด๋ฅผ ๋ฐ˜์˜ํ•˜์—ฌ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•˜๋‹ค.

 

Decoder layer์˜ ‘masked multi-head attention’

Decoder์˜ multi-head attention์—์„œ๋Š” ๊ฐ ํ† ํฐ์ด ํ•ด๋‹น ์œ„์น˜๋ฅผ ํฌํ•จํ•œ decoder์˜ ๋ชจ๋“  ์œ„์น˜์˜ ํ† ํฐ์—์„œ attention ์—ฐ์‚ฐ์„ ํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ๋‹จ, ์ž๊ธฐ ํšŒ๊ท€(auto-regression) ํŠน์„ฑ์„ ๋งŒ์กฑํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ํ˜„์žฌ ํ† ํฐ ์ดํ›„์— ์กด์žฌํ•˜๋Š” ํ† ํฐ๋“ค์˜ ์ •๋ณด๋ฅผ ์ฐธ๊ณ ํ•˜๋Š” ๊ฒƒ์„ ๋ฐฉ์ง€ํ•˜๊ธฐ ์œ„ํ•ด ๋’ค์˜ ํ† ํฐ๋“ค์„ masking(-∞๋กœ ์ฒ˜๋ฆฌ)ํ•˜์—ฌ scaled dot-product attention์„ ์ง„ํ–‰ํ•œ๋‹ค.

Position-wise Feed-Forward Networks

๊ฐ sub-layer ๋ณ„ ์—ฐ์‚ฐ ์ดํ›„ ์œ„์น˜ ๋ณ„ ๊ฐœ๋ณ„์ ์œผ๋กœ fully connected feed-forward network๋ฅผ ์ ์šฉํ•œ๋‹ค. ํ•ด๋‹น sub-layer๋Š” ๋‘ ๊ฐœ์˜ ์„ ํ˜• ๋ณ€ํ™˜๊ณผ ReLU activation function์œผ๋กœ ๊ตฌ์„ฑ๋˜์–ด์žˆ๋‹ค. ํ† ํฐ์˜ ์œ„์น˜ ๋ณ„ ์—ฐ์‚ฐ์ด ์ˆ˜ํ–‰๋˜๊ธฐ ๋•Œ๋ฌธ์—, ๋‹ค๋ฅธ ํ† ํฐ๋“ค์˜ ์ •๋ณด๋“ค์ด ์„ž์ด์ง€ ์•Š๊ณ  ๋…๋ฆฝ์ ์œผ๋กœ ์—ฐ์‚ฐ์ด ๊ฐ€๋Šฅํ•˜๋‹ค. ๋˜ํ•œ, ReLU ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ํ† ํฐ์˜ ๋น„์„ ํ˜• ๋ณ€ํ™˜์„ ์ง„ํ–‰ํ•จ์œผ๋กœ์จ ๊ฐ ํ† ํฐ์˜ ํ‘œํ˜„์„ ๊ฐ•ํ™”ํ•˜๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

$$ \text{FFN}(x)=\max(0, xW_1+b_1)W_2+b_2 $$

 

ํ•ด๋‹น ์—ฐ์‚ฐ์€ ์ˆ˜ํ–‰๋˜๋Š” ์œ„์น˜๋Š” ๋™์ผํ•˜์ง€๋งŒ, $W_1$๊ณผ $W_2$์™€ ๊ด€๋ จ๋œ ํŒŒ๋ผ๋ฏธํ„ฐ๋“ค์€ ๋ ˆ์ด์–ด๋งˆ๋‹ค ๋ณ„๋„๋กœ ํ•™์Šต์ด ์ง„ํ–‰๋œ๋‹ค. ์ด ์—ฐ์‚ฐ์„ ๋‹ค๋ฅด๊ฒŒ ํ‘œํ˜„ํ•˜์ž๋ฉด ์ปค๋„ ํฌ๊ธฐ๊ฐ€ 1์ธ ๋‘๋ฒˆ์˜ convolution ์—ฐ์‚ฐ๊ณผ ๋™์ผํ•˜๋‹ค. ๋˜ํ•œ ReLU ํ•จ์ˆ˜๋ฅผ ์ ์šฉํ•˜์—ฌ ๋น„์„ ํ˜• ๋ณ€ํ™˜์„ ์ˆ˜ํ–‰ํ•จ์œผ๋กœ์จ ๊ฐ ๋ชจ๋ธ์ด ํ† ํฐ ๊ฐ„ ๋” ๋ณต์žกํ•œ ํŒจํ„ด๊ณผ ๊ด€๊ณ„๋ฅผ ํ•™์Šตํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค.

 

ํ•ด๋‹น ๋ ˆ์ด์–ด์—์„œ๋Š” ๊ฒฐ๊ตญ 512์ฐจ์› → 2048์ฐจ์› → 512์ฐจ์› ์ด ๋‘๋ฒˆ์˜ ์ฐจ์› ๋ณ€ํ™˜์„ ๊ฑฐ์ณ ํ† ํฐ์˜ ํ‘œํ˜„์„ ์žฌ๊ตฌ์„ฑํ•จ์œผ๋กœ์จ ๊ฐ ํ† ํฐ์˜ ํ‘œํ˜„์„ ๋”์šฑ ํ’๋ถ€ํ•˜๊ณ  ์˜๋ฏธ์žˆ๊ฒŒ ๋งŒ๋“œ๋Š” ์—ญํ• ์„ ์ˆ˜ํ–‰ํ•œ๋‹ค.

Embeddings and Softmax

Transformer์—์„œ ์ž…์ถœ๋ ฅ์œผ๋กœ ์‚ฌ์šฉํ•˜๋Š” ๊ฐ๊ฐ์˜ ํ† ํฐ๋“ค์€ ๋‹ค๋ฅธ ์‹œํ€€์Šค ๋ณ€ํ™˜ ๋ชจ๋ธ๋“ค๊ณผ ๋™์ผํ•˜๊ฒŒ ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•ด ๊ฐ ํ† ํฐ์„ $d_{model}$ ์ฐจ์›์˜ ํ‘œํ˜„ ๋ฒกํ„ฐ๋กœ ๋ณ€ํ™˜ํ•œ๋‹ค. ๋˜ํ•œ decoder์˜ ์ถœ๋ ฅ์—์„œ๋Š” softmax ํ•จ์ˆ˜๋ฅผ ํ†ตํ•ด ๋‹ค์Œ ํ† ํฐ์˜ ๋“ฑ์žฅ ํ™•๋ฅ ์„ ์˜ˆ์ธกํ•œ๋‹ค. ์ฆ‰, Transformer๋Š” ์ž…/์ถœ๋ ฅ ๋ชจ๋‘ ๋™์ผํ•œ ์ฐจ์›์˜ ๋ฒกํ„ฐ ๊ณต๊ฐ„์œผ๋กœ ์ž„๋ฒ ๋”ฉํ•˜๊ณ , ๊ฐ€์ค‘์น˜ ๊ณต์œ  · ์Šค์ผ€์ผ๋ง์„ ํ†ตํ•ด ๋ชจ๋ธ์˜ ํšจ์œจ์„ฑ, ์•ˆ์ •์„ฑ์„ ๋†’์ธ๋‹ค.


 ๐Ÿ’ก Embedding / Encoding / Representation ์ฐจ์ด?

- Embedding: ๋‹จ์–ด๋ฅผ ๊ณ ์ • ์ฐจ์›์˜ ๋ฒกํ„ฐ๋กœ “์ดˆ๊ธฐ ๋ณ€ํ™˜”ํ•˜๋Š” ๋‹จ๊ณ„.

- Encoding: ์ž„๋ฒ ๋”ฉ๋œ ํ† ํฐ ๋ฒกํ„ฐ๋“ค์— ๋ฌธ๋งฅ์ ์ธ ์ •๋ณด๋ฅผ ์ถ”๊ฐ€ํ•˜๋Š” ํ•™์Šต ๊ณผ์ •

- Representation: ์ธ์ฝ”๋”ฉ ๊ฒฐ๊ณผ๋กœ ๋‚˜์˜จ ๋ฌธ๋งฅ๊ณผ ์˜๋ฏธ๋ฅผ ๋ฐ˜์˜ํ•œ ์ตœ์ข…(hidden) ๋ฒกํ„ฐ ํ‘œํ˜„

Output embedding๊ณผ๋Š” ์กฐ๊ธˆ ๋‹ค๋ฅธ ๊ฐœ๋….. Output embedding์€ hidden representation(์ฆ‰, decoder์˜ representation)์œผ๋กœ๋ถ€ํ„ฐ ์‹ค์ œ ๋‹จ์–ด ์˜ˆ์ธก์„ ํ•  ๋•Œ ํ™œ์šฉ๋˜๋Š” ๋ฒกํ„ฐ ์ง‘ํ•ฉ์ด๊ณ , representation์€ ๋ชจ๋ธ์ด ์ƒ์„ฑํ•œ ์ค‘๊ฐ„ ๋˜๋Š” ์ตœ์ข… ์ƒํƒœ์˜ ๋ฒกํ„ฐ๋ฅผ ์˜๋ฏธํ•œ๋‹ค.


Positional Encoding

Transformer์—์„œ๋Š” convolution์ด๋‚˜ recurrence ๊ธฐ๋ฒ•์„ ์‚ฌ์šฉํ•˜์ง€ ์•Š๊ธฐ ๋•Œ๋ฌธ์—, ๋ฐ์ดํ„ฐ ‘์‹œํ€€์Šค’์— ๋Œ€ํ•œ ์ •๋ณด๋ฅผ ํ™œ์šฉํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์ž…์ถœ๋ ฅ ์‹œํ€€์Šค๋กœ ๊ฐ–๋Š” ํ‘œํ˜„์— ์œ„์น˜ ์ •๋ณด๋ฅผ ๋ฐ˜๋“œ์‹œ ํฌํ•จํ•ด์ค˜์•ผ ํ•œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ณธ ๋…ผ๋ฌธ์—์„œ๋Š” input embedding์— “positional encoding”์„ ์ถ”๊ฐ€ํ•˜์—ฌ encoder์™€ decoder ๋ชจ๋ธ์—์„œ ์‚ฌ์šฉํ•œ๋‹ค. “positional encoding”์€ $d_{model}$๊ณผ ๊ฐ™์€ ์ฐจ์›์ด๊ธฐ ๋•Œ๋ฌธ์—, ๋‘˜์„ ๋”ํ•˜์—ฌ ์‚ฌ์šฉํ•  ์ˆ˜ ์žˆ๋‹ค.

์—ฌ๊ธฐ์„œ๋Š” ์‹ธ์ธ๊ณผ ์ฝ”์‹ธ์ธ ํ•จ์ˆ˜๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ฐ ์œ„์น˜๋ฅผ ๊ณ ์œ ํ•˜๊ฒŒ ์ธ์ฝ”๋”ฉํ•˜๋Š”๋ฐ, ์ด๋ฅผ ํ†ตํ•ด ์„œ๋กœ ๋‹ค๋ฅธ ์ฃผ๊ธฐ ํŒจํ„ด์„ ๊ฒฐํ•ฉํ•˜์—ฌ ์œ„์น˜ ์ •๋ณด๋ฅผ ํšจ๊ณผ์ ์œผ๋กœ ์ „๋‹ฌํ•˜๊ณ ์ž ํ•˜์˜€๋‹ค.(ํ‘ธ๋ฆฌ์—๋ณ€ํ™˜์˜ ์›๋ฆฌ์™€ ์œ ์‚ฌํ•˜๋‹ค ํ•จ!!)

 

$$ PE_{(pos, 2i)}=\sin(pos/10000^{2i/d_{model}})\\PE_{(pos, 2i+1)}=\cos(pos/10000^{2i/d_{model}}) $$

 

pos๋Š” ์œ„์น˜, i๋Š” ์ฐจ์›์„ ์˜๋ฏธํ•œ๋‹ค. ํŒŒ์žฅ์ด 2$\pi$ ~ 10000·2$\pi$์˜ ๊ธฐํ•˜ํ•™์  ๋ชจ์–‘์ด ํ˜•์„ฑ๋˜๋Š” ์‹ธ์ธํŒŒ(sinusoid) ๋ชจ์–‘์˜ ๊ทธ๋ž˜ํ”„๋ฅผ ์ฑ„ํƒํ•˜์—ฌ ๋ชจ๋ธ์ด ์‹œํ€€์Šค์˜ ๊ธธ์ด์™€ ์ƒ๊ด€ ์—†์ด ํ† ํฐ๋“ค ๊ฐ„ ์ƒ๋Œ€์  ๊ฑฐ๋ฆฌ๋ฅผ ์‰ฝ๊ฒŒ ์—ฐ์‚ฐํ•  ์ˆ˜ ์žˆ๊ฒŒ ํ•œ๋‹ค. ๋˜ํ•œ ํ•™์Šต๋œ ์ž„๋ฒ ๋”ฉ์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹คํ—˜ํ•œ ๊ฒฐ๊ณผ, ๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ ์„ฑ๋Šฅ ์ฐจ์ด๊ฐ€ ๋ฏธ๋ฏธํ–ˆ์œผ๋ฉฐ, sinusoidal ์ž„๋ฒ ๋”ฉ ๋ฐฉ์‹์€ ํ•™์Šต ์‹œ ์‚ฌ์šฉ๋œ ์‹œํ€€์Šค ๊ธธ์ด๋ฅผ ์ดˆ๊ณผํ•˜๋Š” ๊ธธ์ด์˜ ์‹œํ€€์Šค์—๋„ ์ž˜ ์ผ๋ฐ˜ํ™”๋˜์—ˆ๊ธฐ ๋•Œ๋ฌธ์— ํ•ด๋‹น ๋ฐฉ์‹์ด ์ฑ„ํƒ๋˜์—ˆ๋‹ค.

 

Why Self-Attention

๊ทธ๋ž˜์„œ ๋‹ค์‹œ ๋งํ•˜์ž๋ฉด, ๋„๋Œ€์ฒด self-attention์ด ์–ด๋–ค ์ ์ด ๊ธฐ์กด ์ˆœํ™˜ ๋ชจ๋ธ/์ปจ๋ณผ๋ฃจ์…˜ ๋ ˆ์ด์–ด์™€ ์ฐจ๋ณ„์ ์ด ์žˆ๋Š” ๊ฒƒ์ผ๊นŒ?

๊ฐ€์žฅ ํฐ ์ฐจ์ด์ ์„ ๊ผฝ์•„๋ณด๋ผ๋ฉด ๋ ˆ์ด์–ด ๋ณ„ ์—ฐ์‚ฐ ๋ณต์žก๋„, ๋ณ‘๋ ฌํ™” ๊ฐ€๋Šฅํ•œ ์—ฐ์‚ฐ์˜ ์–‘์ด ์žˆ๋‹ค. ๊ทธ ์™ธ์—๋„ ์žฅ๊ธฐ ์˜์กด์„ฑ(long-range dependencies) ๊ฐ„ path์˜ ๊ธธ์ด๋„ ์žˆ๋‹ค.

 

Self-attention ๋ ˆ์ด์–ด๋Š” ๋ชจ๋“  ์œ„์น˜์˜ ํ† ํฐ๋“ค์„ ์ƒ์ˆ˜์‹œ๊ฐ„ ๋‚ด์— ์‹คํ–‰๋˜๋Š” ์—ฐ์‚ฐ์„ ํ†ตํ•ด ์—ฐ๊ฒฐ ๊ฐ€๋Šฅํ•œ ๋ฐ˜๋ฉด, ์ˆœํ™˜ ๋ชจ๋ธ์€ O(n)์˜ ์—ฐ์‚ฐ์ด ํ•„์š”ํ•˜๋‹ค. ์—ฐ์‚ฐ๋Ÿ‰ ์ธก๋ฉด์—์„œ ์‹œํ€€์Šค ๊ธธ์ด n์ด ํ‘œํ˜„์˜ ์ฐจ์› d๋ณด๋‹ค ์ž‘์„ ๋•Œ(n < d) self-attention layer๋Š” ์ˆœํ™˜ ๋ชจ๋ธ๋ณด๋‹ค ๋น ๋ฅด๊ณ , ๊ทธ ๋ฐ˜๋Œ€์˜ ๊ฒฝ์šฐ ์—ฐ์‚ฐ์˜ ํšจ์œจ์„ ์œ„ํ•ด ์ž…๋ ฅ ์‹œํ€€์Šค์˜ ๊ฐ ์œ„์น˜๋กœ๋ถ€ํ„ฐ r ๊ฑฐ๋ฆฌ ๋‚ด์˜ ์ด์›ƒ๋“ค๋งŒ ๊ณ ๋ คํ•  ์ˆ˜ ์žˆ๋„๋ก attention ์—ฐ์‚ฐ์˜ ๋ฒ”์œ„๋ฅผ ์ œํ•œํ•  ์ˆ˜ ์žˆ๋‹ค. ์ด๋ ‡๊ฒŒ ํ•˜๋ฉด ์ตœ๋Œ€ ๊ฒฝ๋กœ์˜ ๊ธธ์ด๋ฅผ O(n/r) ๊นŒ์ง€ ์ฆ๊ฐ€์‹œํ‚ฌ ์ˆ˜ ์žˆ๋‹ค.

๋˜ํ•œ, convolution ์—ฐ์‚ฐ๊ณผ ๋น„๊ตํ•ด๋ดค์„ ๋•Œ๋„ ๋งˆ์ฐฌ๊ฐ€์ง€์ด๋‹ค. ์ปค๋„ ๋„ˆ๋น„๊ฐ€ k < n์ธ ๋‹จ์ผ convolution layer์˜ ๊ฒฝ์šฐ ์ž…๋ ฅ๊ณผ ์ถœ๋ ฅ ํ† ํฐ๋“ค์˜ ๋ชจ๋“  ์œ„์น˜๋ฅผ ์—ฐ๊ฒฐํ•  ์ˆ˜ ์—†๋‹ค. ๋ชจ๋“  ์œ„์น˜๋ฅผ ์—ฐ๊ฒฐํ•˜๋ ค๋ฉด ์—ฐ์† ์ปค๋„์ผ ๋•Œ O(n/k) ๋ ˆ์ด์–ด, dilated convolution์ธ ๊ฒฝ์šฐ O(log_k(n))๊ฐœ์˜ ๋ ˆ์ด์–ด๊ฐ€ ํ•„์š”ํ•˜๊ธฐ ๋•Œ๋ฌธ์— ๋„คํŠธ์›Œํฌ ๋‚ด ๋‘ position ๊ฐ„ path ๊ธธ์ด๊ฐ€ ๊ธธ์–ด์งˆ ์ˆ˜ ์žˆ๋‹ค.

 

Separable convolution* ์ ์šฉํ•˜๋ฉด ๋ณต์žก๋„๋ฅผ ๋‚ฎ์ถœ ์ˆ˜ ์žˆ๊ธฐ ๋•Œ๋ฌธ์—, ๋ณธ ๋ชจ๋ธ์—์„œ๋Š” feed-forward ๋ถ€๋ถ„์—์„œ ํ•ด๋‹น ๋ฐฉ์‹์„ ์ฑ„ํƒํ•˜๊ฒŒ ๋จ

* Separable convolution: ์ผ๋ฐ˜์ ์ธ convolution ์—ฐ์‚ฐ์„ ๋‘ ๋‹จ๊ณ„๋กœ ๋ถ„ํ•ดํ•˜์—ฌ ์ˆ˜ํ–‰ํ•˜๋Š” ๊ธฐ๋ฒ•

- Depthwise Convolution: ๊ฐ ์ž…๋ ฅ ์ฑ„๋„์— ๋Œ€ํ•ด ๋ณ„๋„์˜ ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ๊ณต๊ฐ„์ (feature map ๋‚ด์˜ ์ง€์—ญ ํŒจํ„ด) ์ •๋ณด๋ฅผ ์ถ”์ถœ. ์ด ๊ณผ์ •์—์„œ๋Š” ์ฑ„๋„ ๊ฐ„ ๊ฒฐํ•ฉ์€ ์ด๋ฃจ์–ด์ง€์ง€ ์•Š์Œ.

- Pointwise Convolution (1×1 Convolution): depthwise convolution์˜ ์ถœ๋ ฅ์— ๋Œ€ํ•ด 1×1 ํ•„ํ„ฐ๋ฅผ ์ ์šฉํ•˜์—ฌ ์ฑ„๋„ ๊ฐ„์˜ ๊ฒฐํ•ฉ์„ ์ˆ˜ํ–‰

 

๊ทธ ์™ธ์—๋„ self-attention์„ ํ†ตํ•ด ๋” ํ•ด์„์ด ์šฉ์ดํ•œ ๋ชจ๋ธ์„ ์„ค๊ณ„ํ•  ์ˆ˜ ์žˆ๊ณ , multi-head attention์˜ ๊ฒฐ๊ณผ๋กœ ๊ฐœ๋ณ„์ ์ธ head๋“ค์ด ๋ฌธ์žฅ์˜ ๊ตฌ๋ฌธ์ด๋‚˜ ์˜๋ฏธ ๊ตฌ์กฐ์™€ ๊ฐ™์€ ๋ฌธ๋งฅ ํŒŒ์•…์— ์žˆ์–ด ๋‹ค์–‘ํ•œ ์˜๋ฏธ๊ฐ€ ๋‹ด๊ธด ํŒจํ„ด๋“ค์„ ํ•™์Šต ๊ฒฐ๊ณผ๋กœ ๋‚ด๋†“๊ธฐ ๋•Œ๋ฌธ์— ์ž์—ฐ์–ด ์ฒ˜๋ฆฌ ๋ชจ๋ธ์— ์žˆ์–ด ํš๊ธฐ์ ์ธ ํ•™์Šต์ด ๊ฐ€๋Šฅํ•จ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

Comments