-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ablation #2
base: master
Are you sure you want to change the base?
Ablation #2
Conversation
try: | ||
loss = model[0].eval_batch(data_iterator) # average loss per sample per microbatch | ||
# difficult to know if it is the right way to get the total loss | ||
loss = loss * args.micro_batch_size * args.seq_length # losses per token |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want a total loss, and not an average loss ?
I am not sure micro_batch_size is the correct one : this is the batch size per GPU, the effective batch size is macro_batch_size.
I would suggest to save the average loss per token AND the total number of tokens in the dataset (separately).
So that we can chose between the stats (average / total) and make checks based on the numbers of tokens.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To clarify, I suggest to use
loss = model[0].eval_batch(data_iterator)
loss_dicts = [{'lm loss' : loss, 'num_batches' : 1}]
and to aggregate losses and number of batches where relevant (I think it's around line 417).
Then at the really end to normalize using the number of batches.
ablation/perplexity.py
Outdated
if is_last_rank(): | ||
|
||
val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Aussi il me semble (je peux avoir tord) que "is_last_rank" n'est True que sur un GPU en cas de multi-GPU.
Ce qui voudrait dire qu'en multi-GPU, on ignorerait les résultats sur "n-1" GPUs ?
ablation/perplexity.py
Outdated
if is_last_rank(): | ||
|
||
val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1) | ||
ppl = math.exp(min(20, val_loss)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ppl = math.exp(min(20, val_loss)) | |
dist.all_reduce(val_loss, op=ReduceOp.SUM) # mean reduction is not supported | |
dist.all_reduce(ppl, op=ReduceOp.SUM) | |
dist.all_reduce(adjusted_ppl, op=ReduceOp.SUM) | |
dist.all_reduce(token_ratio, op=ReduceOp.SUM) | |
val_loss = val_loss / NB_SHARDS | |
token_ratio = token_ratio / NB_SHARDS | |
ppl = math.exp(min(20, val_loss)) | |
adjusted_ppl = math.exp(min(20, val_loss * token_ratio)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, I'll try it out, hope it solves the synchronization problem ;)
3c1de29
to
f60db54
Compare
d7b6a0d
to
9c83ef1
Compare
- Add datasets: Pile (WIP) and Stac (tiny). - Improve a bit folder organization. - Add zstandard in requirements (to read datasets in .jsonl.zst format)
ea89965
to
8b5cec0
Compare
7a1f6de
to
e3da1de
Compare
4a9f8dc
to
7a8da24
Compare
No description provided.