Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow Mesh needs documentation. Will this be provided anytime soon? #276

Open
shyamalschandra opened this issue Jan 14, 2021 · 1 comment

Comments

@shyamalschandra
Copy link

shyamalschandra commented Jan 14, 2021

I read the paper, Switch Transformers, as carefully as possible. However, none of these parameters were glossarized and well-defined in the code and paper. For example, you have the following uncommented lines with variables:

num_experts=16,
loss_coef=1e-2,
hidden_size=4096,
group_size=1024,
capacity_factor_train=1.25,
capacity_factor_eval=2.0,
use_second_place_loss=False,
second_policy_train="random",
second_policy_eval="random",
second_threshold_train=0.2,
second_threshold_eval=0.2,
dropout_rate=0.0,
activation="relu",
moe_gating="top_2",
min_expert_capacity=4,
rand_1_policy_train="input_jitter",
rand_1_policy_eval="input_jitter",
rand_1_dropout=0.1,
rand_1_temperature=1.0,
rand_1_jitter=1e-2,
switch_top_k=4,
output_dim=None,
use_experts_attention=False):
self._hparams = HParams(
moe_gating=moe_gating,
moe_num_experts=num_experts,
moe_loss_coef=loss_coef,
moe_hidden_size=hidden_size,
moe_group_size=group_size,
moe_min_expert_capacity=min_expert_capacity,
moe_capacity_factor_train=capacity_factor_train,
moe_capacity_factor_eval=capacity_factor_eval,
moe_use_second_place_loss=use_second_place_loss,
moe_second_policy_train=second_policy_train,
moe_second_policy_eval=second_policy_eval,
moe_second_threshold_train=second_threshold_train,
moe_second_threshold_eval=second_threshold_eval,
moe_dropout_rate=dropout_rate,
moe_rand_1_policy_train=rand_1_policy_train,
moe_rand_1_policy_eval=rand_1_policy_eval,
moe_rand_1_dropout=rand_1_dropout,
moe_rand_1_temperature=rand_1_temperature,
moe_rand_1_jitter=rand_1_jitter,
moe_output_dim=output_dim,
moe_switch_top_k=switch_top_k,
moe_use_experts_attention=use_experts_attention)
self._activation = activation

expert_x=8,
expert_y=8,
loss_coef=1e-2,
hidden_size=4096,
group_size=1024,
capacity_factor_train=1.25,
capacity_factor_eval=2.0,
capacity_factor_second_level=1.0,
use_second_place_loss=False,
second_policy_train="random",
second_policy_eval="random",
second_threshold_train=0.2,
second_threshold_eval=0.2):
self._hparams = HParams(
moe_gating="top_2",
moe_num_experts=[expert_x, expert_y],
moe_loss_coef=loss_coef,
moe_hidden_size=hidden_size,
moe_group_size=group_size,
moe_capacity_factor_train=capacity_factor_train,
moe_capacity_factor_eval=capacity_factor_eval,
moe_capacity_factor_second_level=capacity_factor_second_level,
moe_use_second_place_loss=use_second_place_loss,
moe_second_policy_train=second_policy_train,
moe_second_policy_eval=second_policy_eval,
moe_second_threshold_train=second_threshold_train,
moe_second_threshold_eval=second_threshold_eval)

However, I see "some" documentation that is not inline with the code in the following section:

"""2-level mixture of experts.
Adapted from the paper https://arxiv.org/abs/1701.06538
Note: until the algorithm and inferface solidify, we pass in a hyperparameters
dictionary in order not to complicate the interface in mtf_transformer.py .
Once this code moves out of "research", we should pass the hyperparameters
separately.
Hyperparameters used:
hparams.moe_num_experts: number of experts
hparams.moe_hidden_size: size of hidden layer in each expert
hparams.moe_group_size: size of each "group" for gating purposes
hparams.moe_capacity_factor_train: a float
hparams.moe_capacity_factor_eval: a float
hparams.moe_capacity_factor_second_level: a float
hparams.moe_gating: a string
+ all hyperparmeters used by _top_2_gating()
One set of params for experts in first level and different of hparams
per expert in the second level.
The number of parameters in the gating network is:
(input_dim.size * (hparams.num_experts) +
(moe_hidden_size * hparams.num_experts) * hparams.num_experts
The number of parameters in the experts themselves is:
(hparams.num_experts
* (input_dim.size + output_dim.size)
* hparams.moe_hidden_size)
The input is n-dimensional: [<batch_and_length_dims>, input_dim], consisting
of the representations of all positions in a batch of sequences.
Each position of each sequence is sent to 0-3 experts. The expert
choices and the combination weights are determined by a learned gating
function.
This function returns a small auxiliary loss that should be added to the
training loss of the model. This loss helps to balance expert usage.
Without the loss, it is very likely that a few experts will be trained and
the rest will starve.
Several hacks are necessary to get around current TPU limitations:
- To ensure static shapes, we enforce (by truncation/padding)
that each sequence send the same number of elements to each expert.
It would make more sense to enforce this equality over the entire batch,
but due to our hacked-up gather-by-matmul implementation, we need to divide
the batch into "groups". For each group, the same number of elements
are sent to each expert.
TODO(noam): Factor this code better. We want to be able to substitute
different code for the experts themselves.
Dimensions cheat sheet:
a, b: batch size
l: original sequence length
m: input depth
n: output depth
g, h: number of groups
s, t: group size
x, y: number of experts
c, d: expert capacity
input: [a0, b1, l, m]
input: [a0, g1, s, m]
dispatch_tensor_x: [a0, g1, s, x, c]
expert_input: [a0, g1, x, c, m]
alltoall: [a0, g, x1, c, m]
alltoall: [a0, g, x1, c, m]
transpose: [x1, a0, g, c, m]
reshape: [x1, h0, s, m]
assignment2: [x1, h0, t, y, d]
expert_input2: [x1, h0, y, d, m]
alltoall: [x1, h, y0, d, m]
...
reverse of that
gating params 0: [m, x]
gating params 1: [x1, m, y]
expert params:
[x1, y0, m, hidden]
[x1, y0, hidden, n]
Args:
inputs: a mtf.Tensor with shape [a, b, l, m]
output_dim: a mtf.Dimension (for Transformer, this is input_dim)
hparams: model hyperparameters
train: a boolean
variable_dtype: a mtf.VariableDType
layout: optional - an input to mtf.convert_to_layout_rules
mesh_shape: optional - an input to mtf.convert_to_shape
nonpadding: an optional mtf.Tensor with shape [a, b, l]
and the same dtype as inputs, consisting of ones(nonpadding)
and zeros(padding).
num_microbatches: number of microbatches.
Returns:
outputs: a Tensor with shape [a, b, l, n]
loss: a mtf scalar
Raises:
ValueError: on unrecognized hparams.moe_gating

How can I understand the paper if the explanation for Switch Transformers in MoE is unclear and too abstract for most people unless they have access to the authors of the paper?

@shyamalschandra
Copy link
Author

Anyone out there with some credence? Any help could help?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant