You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have identified an issue in the implementation of the Transformer architecture in the "Transformer Anatomy" chapter. It appears that layer normalization is applied twice in a row, leading to an inconsistency with the standard Transformer model.
Code Snippet:
Here normalization is applied to the output of the Embeddings layer first:
class Embeddings(nn.Module):
def __init__(self, config):
super().__init__()
self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
self.position_embeddings = nn.Embedding(config.max_position_embeddings,
config.hidden_size)
self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
self.dropout = nn.Dropout()
def forward(self, input_ids):
# Create position IDs for input sequence
seq_length = input_ids.size(1)
position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
# Create token and position embeddings
token_embeddings = self.token_embeddings(input_ids)
position_embeddings = self.position_embeddings(position_ids)
# Combine token and position embeddings
embeddings = token_embeddings + position_embeddings
#################### Here ##########################
embeddings = self.layer_norm(embeddings)
###################################################
embeddings = self.dropout(embeddings)
return embeddings
Then is applied to the input of TransformerEncoderLayer, which is the output of the Embeddings layer:
class TransformerEncoderLayer(nn.Module):
def __init__(self, config):
super().__init__()
self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
self.attention = MultiHeadAttention(config)
self.feed_forward = FeedForward(config)
def forward(self, x):
# Apply layer normalization and then copy input into query, key, value
################### Again Here ######################
hidden_state = self.layer_norm_1(x)
###################################################
# Apply attention with a skip connection
x = x + self.attention(hidden_state)
# Apply feed-forward layer with a skip connection
x = x + self.feed_forward(self.layer_norm_2(x))
return x
class TransformerEncoder(nn.Module):
def __init__(self, config):
super().__init__()
self.embeddings = Embeddings(config)
self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])
def forward(self, x):
####################################################
x = self.embeddings(x)
for layer in self.layers:
x = layer(x)
####################################################
return x
Expected Behavior:
Layer normalization should be applied only once in the forward pass, either in the Embeddings module or in the TransformerEncoderLayer.
Additional Information:
Chapter: Transformer Anatomy
Section:The Encoder
The text was updated successfully, but these errors were encountered:
Description:
I have identified an issue in the implementation of the Transformer architecture in the "Transformer Anatomy" chapter. It appears that layer normalization is applied twice in a row, leading to an inconsistency with the standard Transformer model.
Code Snippet:
Here normalization is applied to the output of the
Embeddings
layer first:Then is applied to the input of
TransformerEncoderLayer
, which is the output of the Embeddings layer:Expected Behavior:
Layer normalization should be applied only once in the forward pass, either in the
Embeddings
module or in theTransformerEncoderLayer
.Additional Information:
The text was updated successfully, but these errors were encountered: