Tensorflow Mesh needs documentation. Will this be provided anytime soon? #276

shyamalschandra · 2021-01-14T21:49:09Z

I read the paper, Switch Transformers, as carefully as possible. However, none of these parameters were glossarized and well-defined in the code and paper. For example, you have the following uncommented lines with variables:

mesh/mesh_tensorflow/transformer/moe.py

Lines 41 to 87 in 5ce9683

    
                      num_experts=16, 
        
                      loss_coef=1e-2, 
        
                      hidden_size=4096, 
        
                      group_size=1024, 
        
                      capacity_factor_train=1.25, 
        
                      capacity_factor_eval=2.0, 
        
                      use_second_place_loss=False, 
        
                      second_policy_train="random", 
        
                      second_policy_eval="random", 
        
                      second_threshold_train=0.2, 
        
                      second_threshold_eval=0.2, 
        
                      dropout_rate=0.0, 
        
                      activation="relu", 
        
                      moe_gating="top_2", 
        
                      min_expert_capacity=4, 
        
                      rand_1_policy_train="input_jitter", 
        
                      rand_1_policy_eval="input_jitter", 
        
                      rand_1_dropout=0.1, 
        
                      rand_1_temperature=1.0, 
        
                      rand_1_jitter=1e-2, 
        
                      switch_top_k=4, 
        
                      output_dim=None, 
        
                      use_experts_attention=False): 
        
           self._hparams = HParams( 
        
               moe_gating=moe_gating, 
        
               moe_num_experts=num_experts, 
        
               moe_loss_coef=loss_coef, 
        
               moe_hidden_size=hidden_size, 
        
               moe_group_size=group_size, 
        
               moe_min_expert_capacity=min_expert_capacity, 
        
               moe_capacity_factor_train=capacity_factor_train, 
        
               moe_capacity_factor_eval=capacity_factor_eval, 
        
               moe_use_second_place_loss=use_second_place_loss, 
        
               moe_second_policy_train=second_policy_train, 
        
               moe_second_policy_eval=second_policy_eval, 
        
               moe_second_threshold_train=second_threshold_train, 
        
               moe_second_threshold_eval=second_threshold_eval, 
        
               moe_dropout_rate=dropout_rate, 
        
               moe_rand_1_policy_train=rand_1_policy_train, 
        
               moe_rand_1_policy_eval=rand_1_policy_eval, 
        
               moe_rand_1_dropout=rand_1_dropout, 
        
               moe_rand_1_temperature=rand_1_temperature, 
        
               moe_rand_1_jitter=rand_1_jitter, 
        
               moe_output_dim=output_dim, 
        
               moe_switch_top_k=switch_top_k, 
        
               moe_use_experts_attention=use_experts_attention) 
        
           self._activation = activation

mesh/mesh_tensorflow/transformer/moe.py

Lines 133 to 159 in 5ce9683

    
                      expert_x=8, 
        
                      expert_y=8, 
        
                      loss_coef=1e-2, 
        
                      hidden_size=4096, 
        
                      group_size=1024, 
        
                      capacity_factor_train=1.25, 
        
                      capacity_factor_eval=2.0, 
        
                      capacity_factor_second_level=1.0, 
        
                      use_second_place_loss=False, 
        
                      second_policy_train="random", 
        
                      second_policy_eval="random", 
        
                      second_threshold_train=0.2, 
        
                      second_threshold_eval=0.2): 
        
           self._hparams = HParams( 
        
               moe_gating="top_2", 
        
               moe_num_experts=[expert_x, expert_y], 
        
               moe_loss_coef=loss_coef, 
        
               moe_hidden_size=hidden_size, 
        
               moe_group_size=group_size, 
        
               moe_capacity_factor_train=capacity_factor_train, 
        
               moe_capacity_factor_eval=capacity_factor_eval, 
        
               moe_capacity_factor_second_level=capacity_factor_second_level, 
        
               moe_use_second_place_loss=use_second_place_loss, 
        
               moe_second_policy_train=second_policy_train, 
        
               moe_second_policy_eval=second_policy_eval, 
        
               moe_second_threshold_train=second_threshold_train, 
        
               moe_second_threshold_eval=second_threshold_eval)

However, I see "some" documentation that is not inline with the code in the following section:

mesh/mesh_tensorflow/transformer/moe.py

Lines 504 to 609 in 5ce9683

    
             """2-level mixture of experts. 
        
             Adapted from the paper https://arxiv.org/abs/1701.06538 
        
             Note: until the algorithm and inferface solidify, we pass in a hyperparameters 
        
             dictionary in order not to complicate the interface in mtf_transformer.py . 
        
             Once this code moves out of "research", we should pass the hyperparameters 
        
             separately. 
        
             Hyperparameters used: 
        
               hparams.moe_num_experts: number of experts 
        
               hparams.moe_hidden_size: size of hidden layer in each expert 
        
               hparams.moe_group_size: size of each "group" for gating purposes 
        
               hparams.moe_capacity_factor_train: a float 
        
               hparams.moe_capacity_factor_eval: a float 
        
               hparams.moe_capacity_factor_second_level: a float 
        
               hparams.moe_gating: a string 
        
               + all hyperparmeters used by _top_2_gating() 
        
             One set of params for experts in first level and different of hparams 
        
             per expert in the second level. 
        
             The number of parameters in the gating network is: 
        
               (input_dim.size * (hparams.num_experts) + 
        
                 (moe_hidden_size * hparams.num_experts) * hparams.num_experts 
        
             The number of parameters in the experts themselves is: 
        
               (hparams.num_experts 
        
                * (input_dim.size + output_dim.size) 
        
                * hparams.moe_hidden_size) 
        
             The input is n-dimensional: [<batch_and_length_dims>, input_dim], consisting 
        
             of the representations of all positions in a batch of sequences. 
        
             Each position of each sequence is sent to 0-3 experts.  The expert 
        
             choices and the combination weights are determined by a learned gating 
        
             function. 
        
             This function returns a small auxiliary loss that should be added to the 
        
             training loss of the model.  This loss helps to balance expert usage. 
        
             Without the loss, it is very likely that a few experts will be trained and 
        
             the rest will starve. 
        
             Several hacks are necessary to get around current TPU limitations: 
        
             - To ensure static shapes, we enforce (by truncation/padding) 
        
               that each sequence send the same number of elements to each expert. 
        
               It would make more sense to enforce this equality over the entire batch, 
        
               but due to our hacked-up gather-by-matmul implementation, we need to divide 
        
               the batch into "groups".  For each group, the same number of elements 
        
               are sent to each expert. 
        
             TODO(noam): Factor this code better.  We want to be able to substitute 
        
             different code for the experts themselves. 
        
             Dimensions cheat sheet: 
        
             a, b: batch size 
        
             l: original sequence length 
        
             m: input depth 
        
             n: output depth 
        
             g, h: number of groups 
        
             s, t: group size 
        
             x, y: number of experts 
        
             c, d: expert capacity 
        
             input: [a0, b1, l, m] 
        
             input: [a0, g1, s, m] 
        
             dispatch_tensor_x: [a0, g1, s, x, c] 
        
             expert_input: [a0, g1, x, c, m] 
        
             alltoall: [a0, g, x1, c, m] 
        
             alltoall: [a0, g, x1, c, m] 
        
             transpose: [x1, a0, g, c, m] 
        
             reshape: [x1, h0, s, m] 
        
             assignment2: [x1, h0, t, y, d] 
        
             expert_input2: [x1, h0, y, d, m] 
        
             alltoall: [x1, h, y0, d, m] 
        
             ... 
        
             reverse of that 
        
             gating params 0: [m, x] 
        
             gating params 1: [x1, m, y] 
        
             expert params: 
        
                [x1, y0, m, hidden] 
        
                [x1, y0, hidden, n] 
        
             Args: 
        
               inputs: a mtf.Tensor with shape [a, b, l, m] 
        
               output_dim: a mtf.Dimension (for Transformer, this is input_dim) 
        
               hparams: model hyperparameters 
        
               train: a boolean 
        
               variable_dtype: a mtf.VariableDType 
        
               layout: optional - an input to mtf.convert_to_layout_rules 
        
               mesh_shape: optional - an input to mtf.convert_to_shape 
        
               nonpadding: an optional mtf.Tensor with shape [a, b, l] 
        
                 and the same dtype as inputs, consisting of ones(nonpadding) 
        
                 and zeros(padding). 
        
               num_microbatches: number of microbatches. 
        
             Returns: 
        
               outputs: a Tensor with shape [a, b, l, n] 
        
               loss: a mtf scalar 
        
             Raises: 
        
               ValueError: on unrecognized hparams.moe_gating

How can I understand the paper if the explanation for Switch Transformers in MoE is unclear and too abstract for most people unless they have access to the authors of the paper?

shyamalschandra · 2021-01-20T22:21:05Z

Anyone out there with some credence? Any help could help?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tensorflow Mesh needs documentation. Will this be provided anytime soon? #276

Tensorflow Mesh needs documentation. Will this be provided anytime soon? #276

shyamalschandra commented Jan 14, 2021 •

edited

Loading

shyamalschandra commented Jan 20, 2021

Tensorflow Mesh needs documentation. Will this be provided anytime soon? #276

Tensorflow Mesh needs documentation. Will this be provided anytime soon? #276

Comments

shyamalschandra commented Jan 14, 2021 • edited Loading

shyamalschandra commented Jan 20, 2021

shyamalschandra commented Jan 14, 2021 •

edited

Loading