[LG Aimers] [딥러닝-5] Transformer

[LG Aimers] [딥러닝-5] Transformer

Author

JH

Tags

AI

LG Aimers

Published

January 13, 2025

종류

태그

학문 분야

How Transformer Model Works

Transformer: High-level View

Attention module은 seq2seq에서 시퀀스 인코더와 디코더의 역할을 모두 수행할 수 있음

즉, RNN이나 CNN은 더 이상 필요하지 않고 Attention module만 있으면 됨

notion image

Long-term Dependency Issue of RNN Models

notion image

입력 데이터의 step이 길어질수록, context에서 앞선 입력 데이터의 정보가 희미해지는 장기 의존성 문제가 발생할 수 있음

Transformer: Scaled Dot-product Attention

notion image

쿼리(Q), 키(K), 값(V) 간의 관계를 계산하며,

식을 사용함

가 커질수록 softmax 결과가 집중되므로(분산이 커짐으로써), 로 스케일링하여 균형을 맞춤

Transformer: Multi-head Attention

notion image

단일 Attention으로는 부족한 다양한 관계를 표현하기 위해 여러 개의 Attention 헤드를 사용함

각 헤드의 결과를 연결(Concat)하고 선형 변환을 수행함 → 원래 벡터의 차원으로 되돌림

Transformer: Quadratic Memory Complexity

notion image

를 저장하기 위해서 의 메모리 공간이 필요하므로, 시퀀스의 길이가 길어질수록 메모리 복잡도가 커짐

Transformer: Block-Based Model

notion image

Transformer는 여러 개의 블록으로 구성되며, 각 블록은 다음을 포함함:

Multi-head Attention
Feed-forward 신경망 (ReLU 활성화 포함)

각 블록에는 잔차 연결(Residual Connection) 및 Layer Normalization이 적용됨

Transformer: Positional Encoding

입력 단어 간의 순서를 학습시키기 위해 사인 및 코사인 함수 기반의 위치 정보를 추가함:

Transformer: Decoder

notion image

디코더에서는 하위 두 계층이 변경됨

이전에 생성된 출력에 대한 마스킹

Encoder-Decoder attention: Query는 이전 디코더 계층에서 오고 Key와 Value는 인코더의 출력에서 옴

Transformer: Masked Self-attention

디코더는 이전 단계에서 생성된 단어만 참조할 수 있도록 마스킹(Masking)을 사용함

이는 다음과 같은 문제를 방지함:

생성되지 않은 단어에 접근하는 문제
Softmax 출력의 재정규화를 통해 유효하지 않은 단어 접근을 방지함

notion image