[WIP] BFloat16 weights; 2 sec improvement #74

leloykun · 2025-01-21T23:56:59Z

From the pytorch trace below, see that the current record is bottlenecked by the backwards pass. Thus, it makes sense to shift our focus on optimizing this part of the run.

Converting the weights to bfloat16 is the lowest hanging fruit we can do & it seems to reduce wallclock time by 2 secs.

YouJiacheng · 2025-01-22T09:49:36Z

I think a better choice is to also implement a master weight in optimizers so we can get back to FP32 with minimal performance regression when we need FP32.
Note that embeddings are already full BF16, so we only need to deal with lm_head and weights updated by Muon.
BF16 accumulation is safe now (probably) because we only decay the lr to 0.1× peak after previous changes.
But in longer runs (ofc, irrelevant to 3.28 speedrun) we might decay the lr to smaller values.

leloykun added 4 commits January 21, 2025 09:34

all bfloat16 training

46566f7

bfloat16 in custom op

8099ee3

bfloat16 in custom op

34c6cf6

increase num iterations

3f96031

leloykun added 6 commits January 23, 2025 16:16

increase num_iterations & prepare for fp8 ops on other layers

a3c83ba

.

d930325

.

58c9931

.

b77c123

adjust fp8 scales

63af732

fix Rotary dtype computation

7904ab1

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] BFloat16 weights; 2 sec improvement #74

[WIP] BFloat16 weights; 2 sec improvement #74

leloykun commented Jan 21, 2025

YouJiacheng commented Jan 22, 2025 •

edited

Loading

[WIP] BFloat16 weights; 2 sec improvement #74

Are you sure you want to change the base?

[WIP] BFloat16 weights; 2 sec improvement #74

Conversation

leloykun commented Jan 21, 2025

YouJiacheng commented Jan 22, 2025 • edited Loading

YouJiacheng commented Jan 22, 2025 •

edited

Loading