Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ablation #2

Open
wants to merge 41 commits into
base: master
Choose a base branch
from
Open

Ablation #2

wants to merge 41 commits into from

Conversation

Jeronymous
Copy link
Member

No description provided.

try:
loss = model[0].eval_batch(data_iterator) # average loss per sample per microbatch
# difficult to know if it is the right way to get the total loss
loss = loss * args.micro_batch_size * args.seq_length # losses per token
Copy link
Member Author

@Jeronymous Jeronymous Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want a total loss, and not an average loss ?

I am not sure micro_batch_size is the correct one : this is the batch size per GPU, the effective batch size is macro_batch_size.

I would suggest to save the average loss per token AND the total number of tokens in the dataset (separately).
So that we can chose between the stats (average / total) and make checks based on the numbers of tokens.

Copy link
Member Author

@Jeronymous Jeronymous Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To clarify, I suggest to use

loss = model[0].eval_batch(data_iterator)
loss_dicts = [{'lm loss' : loss, 'num_batches' : 1}]

and to aggregate losses and number of batches where relevant (I think it's around line 417).
Then at the really end to normalize using the number of batches.

Comment on lines 415 to 417
if is_last_rank():

val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1)
Copy link
Member Author

@Jeronymous Jeronymous Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Aussi il me semble (je peux avoir tord) que "is_last_rank" n'est True que sur un GPU en cas de multi-GPU.
Ce qui voudrait dire qu'en multi-GPU, on ignorerait les résultats sur "n-1" GPUs ?

if is_last_rank():

val_loss = total_loss_dict['lm loss'].item() / (num_tokenized_tokens - 1)
ppl = math.exp(min(20, val_loss))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ppl = math.exp(min(20, val_loss))
dist.all_reduce(val_loss, op=ReduceOp.SUM) # mean reduction is not supported
dist.all_reduce(ppl, op=ReduceOp.SUM)
dist.all_reduce(adjusted_ppl, op=ReduceOp.SUM)
dist.all_reduce(token_ratio, op=ReduceOp.SUM)
val_loss = val_loss / NB_SHARDS
token_ratio = token_ratio / NB_SHARDS
ppl = math.exp(min(20, val_loss))
adjusted_ppl = math.exp(min(20, val_loss * token_ratio))

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I'll try it out, hope it solves the synchronization problem ;)

lucashervier and others added 23 commits August 21, 2024 16:51
- Add datasets: Pile (WIP) and Stac (tiny).
- Improve a bit folder organization.
- Add zstandard in requirements (to read datasets in .jsonl.zst format)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants