Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Double Layer Normalization in TransformerEncoderLayer #129

Open
AliHaiderAhmad001 opened this issue Dec 20, 2023 · 0 comments
Open

Double Layer Normalization in TransformerEncoderLayer #129

AliHaiderAhmad001 opened this issue Dec 20, 2023 · 0 comments

Comments

@AliHaiderAhmad001
Copy link

Description:

I have identified an issue in the implementation of the Transformer architecture in the "Transformer Anatomy" chapter. It appears that layer normalization is applied twice in a row, leading to an inconsistency with the standard Transformer model.

Code Snippet:
Here normalization is applied to the output of the Embeddings layer first:

class Embeddings(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.token_embeddings = nn.Embedding(config.vocab_size, config.hidden_size)
        self.position_embeddings = nn.Embedding(config.max_position_embeddings,
        config.hidden_size)
        self.layer_norm = nn.LayerNorm(config.hidden_size, eps=1e-12)
        self.dropout = nn.Dropout()

    def forward(self, input_ids):
        # Create position IDs for input sequence
        seq_length = input_ids.size(1)
        position_ids = torch.arange(seq_length, dtype=torch.long).unsqueeze(0)
        # Create token and position embeddings
        token_embeddings = self.token_embeddings(input_ids)
        position_embeddings = self.position_embeddings(position_ids)
        # Combine token and position embeddings
        embeddings = token_embeddings + position_embeddings
        #################### Here ##########################
        embeddings = self.layer_norm(embeddings)
        ###################################################
        embeddings = self.dropout(embeddings)
        return embeddings

Then is applied to the input of TransformerEncoderLayer, which is the output of the Embeddings layer:

class TransformerEncoderLayer(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.layer_norm_1 = nn.LayerNorm(config.hidden_size)
        self.layer_norm_2 = nn.LayerNorm(config.hidden_size)
        self.attention = MultiHeadAttention(config)
        self.feed_forward = FeedForward(config)
    def forward(self, x):
        # Apply layer normalization and then copy input into query, key, value
        ################### Again Here ######################
        hidden_state = self.layer_norm_1(x)
        ###################################################
        # Apply attention with a skip connection
        x = x + self.attention(hidden_state)
        # Apply feed-forward layer with a skip connection
        x = x + self.feed_forward(self.layer_norm_2(x))
    return x
class TransformerEncoder(nn.Module):
    def __init__(self, config):
        super().__init__()
        self.embeddings = Embeddings(config)
        self.layers = nn.ModuleList([TransformerEncoderLayer(config) for _ in range(config.num_hidden_layers)])
    def forward(self, x):
       ####################################################
        x = self.embeddings(x)
        for layer in self.layers:
             x = layer(x)
        ####################################################
    return x

Expected Behavior:

Layer normalization should be applied only once in the forward pass, either in the Embeddings module or in the TransformerEncoderLayer.

Additional Information:

  • Chapter: Transformer Anatomy
  • Section:The Encoder
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant