Skip to content

Commit

Permalink
Fix typos. (#15365)
Browse files Browse the repository at this point in the history
  • Loading branch information
jasondavies authored Feb 19, 2025
1 parent e820e8d commit 3ebf8a8
Show file tree
Hide file tree
Showing 2 changed files with 46 additions and 49 deletions.
40 changes: 20 additions & 20 deletions tech_reports/GEMM_FLOPS/GEMM_FLOPS.md
Original file line number Diff line number Diff line change
Expand Up @@ -55,27 +55,27 @@ For more details please refer to the tech reports [Matrix Engine](../matrix_engi

For example, when changing the precision of the matrix, for a given size of matrix the output performance is expected to be different.

![A simple bar chart of the TFLOPS on WH when changing the precision of matrcies](images/effects_of_precision.png "Variance in performance of TFLOPS on WH from SRAM due to changing precision")
![A simple bar chart of the TFLOPS on WH when changing the precision of matrices](images/effects_of_precision.png "Variance in performance of TFLOPS on WH from SRAM due to changing precision")



## MicroBenchmarks

### Matrix Multiplication TFLOPs on Wormhole (WH)
### Matrix Multiplication TFLOPS on Wormhole (WH)

The WH matrix engine performs 8x16 x 16x16 = 8x16 in a single cycle.
- This is 2*8\*16\*16 = 4096 muladds in a single cycle.
- At 1GHz, this is 4 TFLOPs per matrix engine.
- At 1GHz, this is 4 TFLOPS per matrix engine.
- The 8x16 is the smallest matrix that can be fed into in0, and 16x16 is the smallest matrix that can be fed into in1.

If the input matrices fed into the engine are "shorter" than 8x16, for example 1x16, the engine will still perform 8x16 x 16x16 = 8x16, but the effective throughput will be 1/8.
Thus, for 1x16 x 16x16 matrices, the effective throughput is 0.5 TFLOP per matrix engine.
Thus, for 1x16 x 16x16 matrices, the effective throughput is 0.5 TFLOPS per matrix engine.

MATH_FIDELITY is used for higher precision, and TFLOPs are calculated by dividing by the MATH_FIDELITY value.
- LoFi -> ~4 TFLOPs
- HiFi2 -> ~2 TFLOPs
- HiFi3 -> ~1.33 TFLOPs
- HiFi4 -> ~1 TFLOPs
MATH_FIDELITY is used for higher precision, and TFLOPS are calculated by dividing by the MATH_FIDELITY value.
- LoFi -> ~4 TFLOPS
- HiFi2 -> ~2 TFLOPS
- HiFi3 -> ~1.33 TFLOPS
- HiFi4 -> ~1 TFLOPS


### Utilization derivation formula
Expand All @@ -90,7 +90,7 @@ Ideal cycles = (m * k * n) / (tile_height * tile_width * tile_height) * (cycle_p

### Manually tuned Performance

Here we show the peak results we can get based on manually selected matmul configuturations, including packer l1 enablement, math fidelity, input output sharding, and input ouput L1/DRAM selection.
Here we show the peak results we can get based on manually selected matmul configurations, including packer l1 enablement, math fidelity, input output sharding, and input output L1/DRAM selection.

#### Peak FLOPS

Expand All @@ -100,15 +100,15 @@ Below is the results generated from running the benchmark script, showcasing the

We also show the results with and without trace (see [AdvancedPerformanceOptimizationsForModels](../AdvancedPerformanceOptimizationsForModels/AdvancedPerformanceOptimizationsForModels.md) for details of trace). With trace, we can minimize the overhead of host which can reflect the actual device performance better.

Finally, we present the results in terms of device time, device throughput in TFLOPs, device utilization compared to the user-specified grid size and device utilization compared to the full grid size (8x8 in Wormhole). Utilization is calculated with
Finally, we present the results in terms of device time, device throughput in TFLOPS, device utilization compared to the user-specified grid size and device utilization compared to the full grid size (8x8 in Wormhole). Utilization is calculated with


#### TFLOPS plot across all matrix sizes and configurations

![](images/matmul_tflops_5_exp.png)


#### Utilization plot across all matrix sizes and configurations, based on the Chip TFLOPs calculated per each Math Fidelity
#### Utilization plot across all matrix sizes and configurations, based on the Chip TFLOPS calculated per each Math Fidelity

![](images/matmul_utilization_5_exp.png)

Expand All @@ -123,15 +123,15 @@ Finally, we present the results in terms of device time, device throughput in TF
![](images/matmul_utilization_table_5_exp.png)


#### TFLOPS ratio between the results with trace and without-trace. The trace mode has signficiant impact (i.e. higher ratio) when running a sequence of smaller/faster OPs, because the OP dispatch time will be comparable to the OP device runtime.
#### TFLOPS ratio between the results with trace and without-trace. The trace mode has significant impact (i.e. higher ratio) when running a sequence of smaller/faster OPS, because the OP dispatch time will be comparable to the OP device runtime.

![](images/mamtul_trace_nontrace_ratio_5_exp.png)



#### The full results table

| m | k | n | use_trace | grid_size | in0_sharded | out_sharded | in0_storage_type | in1_storage_type | out_storage_type | dtype | math_fidelity | inference_time_avg (ns) | TFLOPs (avg) | Utilization (vs user grid) | Utilization (vs 8x8 full grid) |
| m | k | n | use_trace | grid_size | in0_sharded | out_sharded | in0_storage_type | in1_storage_type | out_storage_type | dtype | math_fidelity | inference_time_avg (ns) | TFLOPS (avg) | Utilization (vs user grid) | Utilization (vs 8x8 full grid) |
|------:|------:|------:|:------------|:------------|:--------------|:--------------|:-------------------|:-------------------|:-------------------|:-------------------|:-------------------|--------------------------:|---------------:|:-----------------------------|:---------------------------------|
| 512 | 512 | 512 | False | (8, 8) | True | True | L1 | DRAM | L1 | DataType.BFLOAT16 | MathFidelity.HiFi2 | 378654 | 0.71 | 0.54% | 0.54% |
| 512 | 1024 | 1024 | False | (8, 8) | True | True | L1 | DRAM | L1 | DataType.BFLOAT16 | MathFidelity.HiFi2 | 363193 | 2.96 | 2.26% | 2.26% |
Expand Down Expand Up @@ -289,31 +289,31 @@ Finally, we present the results in terms of device time, device throughput in TF

For most hardware, peak performance is achieved with square matrices that best align with the underlying hardware, for example WH performs best when using Square input matrices, we achieve highest device utilization with bfloat16 and HiFi4.

![A simple bar chart of the TFLOPS on WH when using various square matrcies](images/TFLOPS_WH_SQUARE.png "Square Matrix TFLOPS on WH from SRAM")
![A simple bar chart of the TFLOPS on WH when using various square matrices](images/TFLOPS_WH_SQUARE.png "Square Matrix TFLOPS on WH from SRAM")

#### Rectangular matrices

When deviating from Square matrices, the total balance of compute can be thrown off, lowering peak performance. For example, processing matrices with equal amounts of elements, but different shapes can reduce peak TFLOPS.

Given input matrix A of 512x1024 and B of 1024x2048 to produce output matrix 512x2048 requires the same amount of computation as if the input matrices were of dimensions 1024^2. However, the performance results are measurably different:

| m | k | n | use_trace | grid_size | in0_sharded | out_sharded | in0_storage_type | in1_storage_type | out_storage_type | dtype | math_fidelity | inference_time_avg (ns) | TFLOPs (avg) | Utilization (vs user grid) | Utilization (vs 8x8 full grid) |
| m | k | n | use_trace | grid_size | in0_sharded | out_sharded | in0_storage_type | in1_storage_type | out_storage_type | dtype | math_fidelity | inference_time_avg (ns) | TFLOPS (avg) | Utilization (vs user grid) | Utilization (vs 8x8 full grid) |
|------:|------:|------:|:------------|:------------|:--------------|:--------------|:-------------------|:-------------------|:-------------------|:-------------------|:-------------------|--------------------------:|---------------:|:-----------------------------|:---------------------------------|
| 512 | 1024 | 2048 | True | (8, 8) | True | True | L1 | DRAM | L1 | DataType.BFLOAT16 | MathFidelity.HiFi2 | 52824 | 40.65 | 31.02% | 31.02% |
| 1024 | 1024 | 1024 | True | (8, 8) | True | True | L1 | DRAM | L1 | DataType.BFLOAT16 | MathFidelity.HiFi2 | 36845.2 | 58.28 | 44.47% | 44.47%

![A simple bar chart of the TFLOPS on WH when using square vs rectangular matrcies](images/effects_of_shapes.png "Square vs rectangular Matrix TFLOPS on WH from SRAM")
![A simple bar chart of the TFLOPS on WH when using square vs rectangular matrices](images/effects_of_shapes.png "Square vs rectangular Matrix TFLOPS on WH from SRAM")


### Out of Box performance

We also show the peak results we can get based on auto-selected matmul configuturations, which the matmul op itself chooses the configuraitons. It currently is not perfect and we'll continue improve it so that it can match or even surpass the manually selected ones. We show the results from 512x512x512 to 4096x4096x4096. The reason we are not testing shapes larger is due to the wrong selections of matmul configuturations.
We also show the peak results we can get based on auto-selected matmul configurations, which the matmul op itself chooses the configurations. It currently is not perfect and we'll continue improve it so that it can match or even surpass the manually selected ones. We show the results from 512x512x512 to 4096x4096x4096. The reason we are not testing shapes larger is due to the wrong selections of matmul configurations.

As we can see, the results are comprable to the manutally selected.
As we can see, the results are comparable to the manually selected.

#### The full results table

| m | k | n | use_trace | grid_size | in0_storage_type | in1_storage_type | out_storage_type | dtype | math_fidelity | inference_time_avg (ns) | TFLOPs (avg) | Utilization (vs user grid) | Utilization (vs 8x8 full grid) |
| m | k | n | use_trace | grid_size | in0_storage_type | in1_storage_type | out_storage_type | dtype | math_fidelity | inference_time_avg (ns) | TFLOPS (avg) | Utilization (vs user grid) | Utilization (vs 8x8 full grid) |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 512 | 512 | 512 | False | (8, 8) | DRAM | DRAM | DRAM | DataType.BFLOAT16 | MathFidelity.HiFi2 | 400640.96 | 0.67 | 0.51% | 0.51% |
| 512 | 1024 | 1024 | False | (8, 8) | DRAM | DRAM | DRAM | DataType.BFLOAT16 | MathFidelity.HiFi2 | 296726.23 | 3.62 | 2.76% | 2.76% |
Expand Down
55 changes: 26 additions & 29 deletions tech_reports/matrix_engine/matrix_engine.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,47 +2,47 @@

## Introduction

The matrix engine supports the following operations: matrix mult, reduction, eltwise add/sub/mul, and tranpose_xy.
The matrix engine supports the following operations: matrix mult, reduction, eltwise add/sub/mul, and transpose_xy.

## Operations

### Matrix Mult
### Matrix Mult

The WH matrix engine performs 8x16 x 16x16 = 8x16 in a single cycle. \
This is 2*8\*16\*16 = 4096 muladds in a single cycle. At 1GHz, this is 4 TFLOPs per matrix engine. \
The 8x16 is the smallest matrix that can be fed into in0, and 16x16 is the
This is 2*8\*16\*16 = 4096 muladds in a single cycle. At 1GHz, this is 4 TFLOPS per matrix engine. \
The 8x16 is the smallest matrix that can be fed into in0, and 16x16 is the
smallest matrix that can be fed into in1.

If the input matrices fed into the engine are "shorter" than 8x16, for example 1x16, the engine will still perform 8x16 x 16x16 = 8x16, but the effective throughput will be 1/8.
Thus, for 1x16 x 16x16 matricies, the effective throughput is 0.5 TFLOP per matrix engine.
If the input matrices fed into the engine are "shorter" than 8x16, for example 1x16, the engine will still perform 8x16 x 16x16 = 8x16, but the effective throughput will be 1/8.
Thus, for 1x16 x 16x16 matrices, the effective throughput is 0.5 TFLOPS per matrix engine.

MATH_FIDELITY is used for higher precision, and TFLOPs are calculated by dividing by the MATH_FIDELITY value.
MATH_FIDELITY is used for higher precision, and TFLOPS are calculated by dividing by the MATH_FIDELITY value.

LoFi -> 4 TFLOPs \
HiFi2 -> 2 TFLOPs \
HiFi3 -> 1.33 TFLOPs \
HiFi4 -> 1 TFLOPs
LoFi -> 4 TFLOPS \
HiFi2 -> 2 TFLOPS \
HiFi3 -> 1.33 TFLOPS \
HiFi4 -> 1 TFLOPS

### Reduction: Addition and Max
The WH matrix engine performs 16x16 reduce max/average operations in a single cycle. \
This is 2*16\*16 multiply + adds in a single cycle. At 1GHz, this is 0.512 TFLOPs per matrix engine.
### Reduction: Max/Average/Sum
The WH matrix engine performs 16x16 reduce max/average/sum operations in a single cycle. \
This is 2*16\*16 multiply + adds in a single cycle. At 1GHz, this is 0.512 TFLOPS per matrix engine.

Reduce max does not use MATH_FIDELITY; however reduce average does use MATH_FIDELITY for higher precision, and TFLOPs are calculated by dividing by the MATH_FIDELITY value.
Reduce max does not use MATH_FIDELITY; however reduce average/sum does use MATH_FIDELITY for higher precision, and TFLOPS are calculated by dividing by the MATH_FIDELITY value.

LoFi -> 0.512 TFLOPs \
HiFi2 -> 0.256 TFLOPs \
HiFi3 -> 0.171 TFLOPs \
HiFi4 -> 0.128 TFLOPs
LoFi -> 0.512 TFLOPS \
HiFi2 -> 0.256 TFLOPS \
HiFi3 -> 0.171 TFLOPS \
HiFi4 -> 0.128 TFLOPS

### Eltwise: Add, Sub, Mul
The WH matrix engine performs 8x16 elementwise addition/subtraction/multiplication in a single cycle. \
This is 8\*16 (multiply or adds, not both) in a single cycle. At 1Ghz, this is 0.128 TFLOPs per matrix engine. \
Elementwise addition and subtraction do not use MATH_FIDELITY; however, Elementwise multiplication does use MATH_FIDELITY for higher precision, and TFLOPs are calculated by dividing by the MATH_FIDELITY value.
This is 8\*16 (multiply or adds, not both) in a single cycle. At 1GHz, this is 0.128 TFLOPS per matrix engine. \
Elementwise addition and subtraction do not use MATH_FIDELITY; however, elementwise multiplication does use MATH_FIDELITY for higher precision, and TFLOPS are calculated by dividing by the MATH_FIDELITY value.

LoFi -> 0.128 TFLOPs \
HiFi2 -> 0.064 TFLOPs \
HiFi3 -> 0.043 TFLOPs \
HiFi4 -> 0.032 TFLOPs
LoFi -> 0.128 TFLOPS \
HiFi2 -> 0.064 TFLOPS \
HiFi3 -> 0.043 TFLOPS \
HiFi4 -> 0.032 TFLOPS

## Configurations

Expand All @@ -65,7 +65,7 @@ Math Fidelity specifies the number of times an operation is run to consume the f
LoFi -> SrcA register: uses 1 hidden bit + 4 most significant bits of the mantissa (MSB of the mantissa), SrcB register: uses 1 hidden bit + 6 MSB of the mantissa \
HiFi2 -> SrcA register: uses 1 hidden bit + next 4 bits of LSBs of the mantissa, SrcB register: uses 1 hidden bit + 6 MSB of the mantissa \
HiFi3 -> SrcA register: uses 1 hidden bit + 4 MSB of the mantissa, SrcB register: Uses 1 hidden bit + next 6 LSB of the mantissa \
HiFi4 -> SrcA register: uses 1 hidden bit + next 4 bits of LSBs of the mantissa, SrcB register: Uses 1 hidden bit + next 6 LSB of the mantissa
HiFi4 -> SrcA register: uses 1 hidden bit + next 4 bits of LSBs of the mantissa, SrcB register: Uses 1 hidden bit + next 6 LSB of the mantissa

### Math Approx Mode

Expand All @@ -84,6 +84,3 @@ Warning: If this flag is set, the math destination register can fit as half as m
Wormhole has the ability to do accumulation in the L1 memory, the packer will read the input address, and accumulate it with the values read from dest, then write back into the same address.
This feature is useful for accumulations in higher precision, and then a final pack call can be done to convert into lower precision (for example accumulate in fp32, then final output as float16_b).
In order to enable this feature, `packer_l1_acc` must be set.



0 comments on commit 3ebf8a8

Please sign in to comment.