[논문리뷰] R-Drop: Regularized Dropout for Neural Networks #26

Yebin46 · 2021-12-06T14:45:59Z

동기

딥러닝 모델 일반화하는 데엔 Dropout이 효과적이고, 성능에도 참 좋지!
근데 Dropout하면 training할 때랑 inference할 때랑 모델이 다르잖아...?🤔
(training할 때는 dropout 때문에 모든 unit들을 사용하지 않으니까 sub-model을 사용한다고 할 수 있음, inference 할 때는 full-model 사용)

그래서 우리는 일관성있는(consistency) 모델을 위한 훈련 전략을 하나 생각해냈어!
dropout으로 생기는 sub model들의 output distribution이 같아지도록 학습하는 Regularized Dropout을 쓰는거야!

R-drop이 좋은 이유

모델 구조는 전혀 바꾸지 않아! loss만 하나 더 추가해서 학습해😎
모델에 randomness(i.e. dropout)가 있다면 어떤 모델이든 가능해😎
- (실제로 summarization 뿐만 아니라 NMT, Language Understanding 등에서 transformer 계열 모델에 사용했을 때 사용안했을 때보다 성능 향상되었음 - 다른 task 관련 자세한 내용은 논문 참조 부탁드립니다!)
  
  ▲ Abstractive summarization CNN/Daily Mail SOTA를 달성 (현재는 2위)

설명

같은 input x를 모델에 두 번 통과시켜주면, dropout 때문에 각기 다른 모델에 통과하는 것과 비슷한 효과를 냅니다. 두 번 통과 했으니 두 개의 output distribution이 만들어지겠죠? 이 두 개의 output distribution의 KL-divergence를 줄이는 방향으로 학습하는 것이 R-drop의 원리입니다.

▲ 수식1. 원래 사용하는 negative log-likelihood loss (forward pass가 한 번인 일반적인 경우)

'''
코드에서는 이렇게 구현됩니다. 이 코드는 transformers Trainer를 사용할 때, self.label_smoother를 호출하면 실행됩니다.
'''
logits = model_output["logits"] if isinstance(model_output, dict) else model_output[0]
log_probs = -nn.functional.log_softmax(logits, dim=-1)
if labels.dim() == log_probs.dim() - 1:
    labels = labels.unsqueeze(-1)

padding_mask = labels.eq(self.ignore_index)
# In case the ignore_index is -100, the gather will fail, so we replace labels by 0. The padding_mask
# will ignore them in any case.
labels = torch.clamp(labels, min=0)
nll_loss = log_probs.gather(dim=-1, index=labels)
# works for fp16 input tensor too, by internally upcasting it to fp32
smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32)

nll_loss.masked_fill_(padding_mask, 0.0)
smoothed_loss.masked_fill_(padding_mask, 0.0)

# Take the mean over the label dimensions, then divide by the number of active elements (i.e. not-padded):
num_active_elements = padding_mask.numel() - padding_mask.long().sum()
nll_loss = nll_loss.sum() / num_active_elements # 수식대로 n개로 나눕니다.
smoothed_loss = smoothed_loss.sum() / (num_active_elements * log_probs.shape[-1])
return (1 - self.epsilon) * nll_loss + self.epsilon * smoothed_loss

▲ 수식2. 두 번의 forward pass에 대한 negative log-likelihood loss

def label_smoothed_nll_loss(self, model_output, labels, epsilon):
    logits = model_output["logits"] if isinstance(model_output, dict) else model_output[0]
    log_probs = -F.log_softmax(logits, dim=-1)
    if labels.dim() == log_probs.dim() - 1:
        labels = labels.unsqueeze(-1)

    padding_mask = labels.eq(self.label_smoother.ignore_index)
    labels = torch.clamp(labels, min=0)
    nll_loss = log_probs.gather(dim=-1, index=labels)
    smoothed_loss = log_probs.sum(dim=-1, keepdim=True, dtype=torch.float32)

    nll_loss.masked_fill_(padding_mask, 0.0)
    smoothed_loss.masked_fill_(padding_mask, 0.0)

    nll_loss = nll_loss.sum() # n개로 나누지 않습니다.
    smoothed_loss = smoothed_loss.sum()
    eps_i = epsilon / log_probs.size(-1)
    return (1. - epsilon) * nll_loss + eps_i * smoothed_loss

▲ 수식3. R-drop 학습에 사용되는 loss function. (NLL+KL loss) 알파는 하이퍼파라미터로, 현재 코드에서는 0.7을 사용하고 있습니다.

loss = self.label_smoothed_nll_loss(outputs, labels, 0.1) # 0.1은 epsilon (label smoothing factor)
kl_loss = self.compute_kl_loss(outputs, pad_mask)
loss += 0.7 * kl_loss

유의할 점

앞서 설명하기로는 같은 input에 대해 forward pass를 두 번 거친다고 했지만, computational cost를 고려해서 input x를 batch 방향으로 concat해서 한 번만 통과시켜줍니다. 그래서 입력을 batch size 8로 넣어주시면 실제로 모델에는 batch size 16으로 들어갑니다.
다른 regularization 기법들처럼 수렴까지 더 많은 학습이 필요합니다.

▲ 수렴까지 많은 학습이 필요하지만 뛰어난 성능을 보임

Q&A + 논문에서 나온 실험

Q. 같은 인풋을 concat해서 넣어주면 같은 모델에 feed하게 되는 게 아닌가요?
- A. Dropout이 어떻게 적용되는지를 먼저 살펴봐야 할 것 같습니다!
  ( 참고: https://stats.stackexchange.com/questions/335690/understanding-dropout-method-one-mask-per-batch-or-more )
  
  모델 weight에 마스크가 씌워지는 게 아니라 input에 마스크가 씌워지는 형태이기 때문에 결론적으로 같은 인풋에 다른 weight가 적용되는 것입니다! (dropout이 이렇게 적용되는지 저도 처음 알았어요...ㅎ)
Q. [논문 실험] forward pass를 세 번 이상하면 더 좋은 성능을 내지 않을까요?
- A. 세 번을 concat해서 실험해봤는데 NMT task에서 BLEU score가 0.05 상승, 이를 통해 두 번 하는 것만으로도 충분히 일반화됐다고 판단하였다고 합니다.

마무리

전체적으로 설명이 깔끔하게 잘 되어 있고, theoretical analysis 부분 빼고는 어려운 개념이 없어서 관심 있으시면 한번 읽어보시는 것을 추천드립니다!
기존 모델보다 성능이 좋다는 것을 확인한다면 dropout ratio, reg_alpha, epsilon 같은 hyperparameter를 조정해보는 것도 좋을 것 같습니다. (논문에 따르면 NMT task에서 Dropout ratio는 0.3일 때, reg_alpha는 0.5일 때 좋았다고 했는데 task마다 다르다고 해서 실험이 필요합니다.)
읽어주셔서 감사드립니다 😁

The text was updated successfully, but these errors were encountered:

Yebin46 self-assigned this Dec 6, 2021

Yebin46 added the documentation Improvements or additions to documentation label Dec 6, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[논문리뷰] R-Drop: Regularized Dropout for Neural Networks #26

[논문리뷰] R-Drop: Regularized Dropout for Neural Networks #26

Yebin46 commented Dec 6, 2021

[논문리뷰] R-Drop: Regularized Dropout for Neural Networks #26

[논문리뷰] R-Drop: Regularized Dropout for Neural Networks #26

Comments

Yebin46 commented Dec 6, 2021

동기

R-drop이 좋은 이유

설명

유의할 점

Q&A + 논문에서 나온 실험

마무리