forked from triton-lang/triton
-
Notifications
You must be signed in to change notification settings - Fork 30
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Move preamble code into tikzplot.tex * Rename kpack to kWidth and allow kWidth = 32 * [API change] Take user input to set dim names API change: - For blocked layout, use -tensorShape, which only takes two dims as dim0,dim1 - For dot layout, use -dotShape, which takes three dims as M,N,K * Re-structure files Separate each layout's code into their own files * Extend dotLayout plot to support kWidth=32 - When kWidth is large, use a smaller elemSize honrizontally to save space - Improve the labels, such as - change vec to kWidth for operands - change opA/opB to inA/inB and include operand dims - remove group dims in the operands so that they don't overlap with operand block dims - Better alignment: dot op and mfma zoomed-in pics are bottom aligned * [API change] Add support for kGroup kGroup is defined as total elements per thread / kWidth for one mfma instruction. We need kGroup = 2 only for the newly added mfma_f32_16x16x128_f8f6f4 and mfma_f32_32x32x64_f8f6f4 with f8 input type on MI350. * [API change] Add support for data types of both operands And print mfma instruction name accordingly. For now, mixed precision mfma between 8-bit and 4- or 6-bit is not supported yet. * Support mixed mfma with bf8/fp8 and fp6/bf6/f4 * [API change] Add support for scale * [NFC] Fix format * [API change] Refactor tensor and LDS layout - Support data types - Support both 32 and 64 banks - Still working on LDS accesses * [LDS layout] Add support for ds_read access pattern for TN config - Fixed the issue with maxPhase computation. Need to submit a PR to fix it in the triton compiler - For ds_read_b64 with 64 banks, there are bank conflicts. We need to figure out a different swizzling pattern to avoid bank conflicts. * [LDS layout] Add support for ds_write access pattern Assumed a basic global access pattern * [LDS layout] Support access pattern for MN-contig without using mfma_transpose_load instructions - Elements along the M/N dim are contiguous in both global memory and LDS. Note that this is not the in-thread transpose case. - Swizzling is disabled * [LDS layout] Support access pattern for MN-contig with mfma_trans_load instructions * Clean up the code * [lds layout] support padding * Reduce tex package required
- Loading branch information
Showing
8 changed files
with
1,818 additions
and
1,059 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
157 changes: 157 additions & 0 deletions
157
python/perf-kernels/tools/plot-layout/blockedLayout.tex
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,157 @@ | ||
\newcommand{\drawBlockedWave}[5]{ | ||
%% | ||
%% Draw a wave coverage with blocked layout | ||
%% | ||
%% Wave TL: pre defined top-left coordinate of the wave | ||
%% \elem: pre defined variable | ||
%% | ||
%% #1: sizePerThread[0] --> sizePerThreadM | ||
%% #2: sizePerThread[1] --> sizePerThreadN | ||
%% #3: threadsPerWarp[0] --> threadsPerWarpM | ||
%% #4: threadsPerWarp[1] --> threadsPerWarpN | ||
%% #5: fastest changing dim --> order | ||
|
||
\pgfmathsetmacro{\sizePerThreadM}{#1} | ||
\pgfmathsetmacro{\sizePerThreadN}{#2} | ||
\pgfmathsetmacro{\threadsPerWarpM}{#3} | ||
\pgfmathsetmacro{\threadsPerWarpN}{#4} | ||
\pgfmathsetmacro{\order}{#5} | ||
|
||
\pgfmathsetmacro{\waveSizeM}{\sizePerThreadM*\threadsPerWarpM} | ||
\pgfmathsetmacro{\waveSizeN}{\sizePerThreadN*\threadsPerWarpN} | ||
|
||
\foreach \tid in {0,...,63}{ | ||
\pgfmathsetmacro{\tidM}{int(\tid/\threadsPerWarpN)} | ||
\pgfmathsetmacro{\tidN}{mod(\tid,\threadsPerWarpN)} | ||
\coordinate (Thread TL) at ($(Wave TL)+(\tidN*\sizePerThreadN*\elem, -\tidM*\sizePerThreadM*\elem)$); | ||
\pgfmathsetmacro{\ratio}{\tidM*10} | ||
|
||
\ifthenelse{\tid = 0}{ | ||
\draw [line width = 0.01mm, fill=red] (Thread TL) | ||
rectangle ++(\sizePerThreadN*\elem, -\sizePerThreadM*\elem); | ||
}{ | ||
\draw [line width = 0.01mm, fill=blue!\ratio!white] (Thread TL) | ||
rectangle ++(\sizePerThreadN*\elem, -\sizePerThreadM*\elem); | ||
} | ||
} | ||
\draw (Wave TL) rectangle ++(\waveSizeN*\elem, -\waveSizeM*\elem); | ||
} | ||
|
||
\newcommand{\drawBlockedCTA}[7]{ | ||
%% | ||
%% Draw a CTA coverage with blocked layout | ||
%% | ||
%% CTA TL: pre defined top-left coordinate of the CTA | ||
%% \elem: pre defined variable | ||
%% | ||
%% #1: sizePerThread[0] --> sizePerThreadM | ||
%% #2: sizePerThread[1] --> sizePerThreadN | ||
%% #3: threadsPerWarp[0] --> threadsPerWarpM | ||
%% #4: threadsPerWarp[1] --> threadsPerWarpN | ||
%% #5: warpsPerCTA[0] --> warpsPerCTAM | ||
%% #6: warpsPerCTA[1] --> warpsPerCTAN | ||
%% #7: fastest changing dim --> order | ||
|
||
\pgfmathsetmacro{\sizePerThreadM}{#1} | ||
\pgfmathsetmacro{\sizePerThreadN}{#2} | ||
\pgfmathsetmacro{\threadsPerWarpM}{#3} | ||
\pgfmathsetmacro{\threadsPerWarpN}{#4} | ||
\pgfmathsetmacro{\warpsPerCTAM}{#5} | ||
\pgfmathsetmacro{\warpsPerCTAN}{#6} | ||
\pgfmathsetmacro{\order}{#7} | ||
|
||
\pgfmathsetmacro{\CTASizeM}{\sizePerThreadM*\threadsPerWarpM*\warpsPerCTAM} | ||
\pgfmathsetmacro{\CTASizeN}{\sizePerThreadN*\threadsPerWarpN*\warpsPerCTAN} | ||
\pgfmathsetmacro{\waveSizeM}{\sizePerThreadM*\threadsPerWarpM} | ||
\pgfmathsetmacro{\waveSizeN}{\sizePerThreadN*\threadsPerWarpN} | ||
|
||
\pgfmathsetmacro{\maxWaveId}{\warpsPerCTAM*\warpsPerCTAN-1} | ||
|
||
\coordinate (Wave TL) at (CTA TL); | ||
\drawBlockedWave{\sizePerThreadM}{\sizePerThreadN}{\threadsPerWarpM}{\threadsPerWarpN}{\order} | ||
\foreach \waveId in {0,...,\maxWaveId}{ | ||
\ifthenelse{\order=1} | ||
{ | ||
\pgfmathsetmacro{\waveCoordM}{int(\waveId/\warpsPerCTAN)} | ||
\pgfmathsetmacro{\waveCoordN}{mod(\waveId,\warpsPerCTAN)} | ||
\pgfmathsetmacro{\rot}{0} | ||
}{ | ||
\pgfmathsetmacro{\waveCoordM}{mod(\waveId,\warpsPerCTAM)} | ||
\pgfmathsetmacro{\waveCoordN}{int(\waveId/\warpsPerCTAM)} | ||
\pgfmathsetmacro{\rot}{90} | ||
} | ||
|
||
\coordinate (Wave TL) at ($(CTA TL)+(\waveCoordN*\waveSizeN*\elem, -\waveCoordM*\waveSizeM*\elem)$); | ||
\draw [ultra thin] (Wave TL) rectangle ++(\waveSizeN*\elem, -\waveSizeM*\elem) | ||
node [pos=.5, scale=.6*\scale, inner sep=0, fill=white, rotate=\rot] {wave\waveId}; | ||
} | ||
|
||
\draw [thick] (CTA TL) rectangle ++(\CTASizeN*\elem, -\CTASizeM*\elem); | ||
} | ||
|
||
\newcommand{\drawBlockedTensor}[8]{ | ||
%% | ||
%% Draw a tensor with blocked layout of the following parameters | ||
%% sizePerThread[2] | ||
%% threadsPerWarp[2] | ||
%% warpsPerCTA[2] | ||
%% order[2] | ||
%% | ||
%% TL: pre defined top-left coordinate of the tensor | ||
%% \elem: pre defined variable | ||
%% \dimColName: dim0Name | ||
%% \dimRowName: dim1Name | ||
%% | ||
%% #1: tensorShape[0] --> M | ||
%% #2: tensorShape[1] --> N | ||
%% #3: sizePerThread[0] --> sizePerThreadM | ||
%% #4: sizePerThread[1] --> sizePerThreadN | ||
%% #5: threadsPerWarp[0] --> threadsPerWarpM | ||
%% Note that threadsPerWarp[1] is calculated by 64/threadsPerWarp[0] | ||
%% #6: warpsPerCTA[0] --> warpsPerCTAM | ||
%% #7: warpsPerCTA[1] --> warpsPerCTAN | ||
%% #8: fastest changing dim --> order | ||
|
||
\pgfmathsetmacro{\M}{#1} | ||
\pgfmathsetmacro{\N}{#2} | ||
\pgfmathsetmacro{\sizePerThreadM}{#3} | ||
\pgfmathsetmacro{\sizePerThreadN}{#4} | ||
\pgfmathsetmacro{\threadsPerWarpM}{#5} | ||
\pgfmathsetmacro{\warpsPerCTAM}{#6} | ||
\pgfmathsetmacro{\warpsPerCTAN}{#7} | ||
\pgfmathsetmacro{\order}{#8} | ||
|
||
\pgfmathsetmacro{\threadsPerWarpN}{64/\threadsPerWarpM} | ||
\pgfmathsetmacro{\CTASizeM}{\sizePerThreadM*\threadsPerWarpM*\warpsPerCTAM} | ||
\pgfmathsetmacro{\CTASizeN}{\sizePerThreadN*\threadsPerWarpN*\warpsPerCTAN} | ||
\pgfmathsetmacro{\CTARepM}{\M/\CTASizeM} | ||
\pgfmathsetmacro{\CTARepN}{\N/\CTASizeN} | ||
\pgfmathsetmacro{\maxCTAId}{\CTARepM*\CTARepN-1} | ||
|
||
\foreach \ctaId in {0,...,\maxCTAId}{ | ||
\pgfmathsetmacro{\ctaCoordM}{int(\ctaId/\CTARepN)} | ||
\pgfmathsetmacro{\ctaCoordN}{mod(\ctaId,\CTARepN)} | ||
\coordinate (CTA TL) at ($(TL)+(\ctaCoordN*\CTASizeN*\elem, -\ctaCoordM*\CTASizeM*\elem)$); | ||
\drawBlockedCTA{\sizePerThreadM}{\sizePerThreadN}{\threadsPerWarpM}{\threadsPerWarpN}{\warpsPerCTAM}{\warpsPerCTAN}{\order} | ||
} | ||
|
||
\node [scale=.7*\scale, above, rotate=90] at ($(TL)+(0, -.5*\M*\elem)$) {\dimColName=\M}; | ||
\node [scale=.7*\scale, above] at ($(TL)+(.5*\N*\elem, 0)$) {\dimRowName=\N}; | ||
|
||
\def\zoomR{1.5} | ||
\coordinate (zoomin BL) at ($(TL)+(0, .3)$); | ||
|
||
\foreach \hl in {0,...,\sizePerThreadM}{ | ||
\draw ($(zoomin BL)+(0, \hl*\elem*\zoomR)$) -- ++(\sizePerThreadN*\elem*\zoomR,0); | ||
} | ||
\foreach \vl in {0,...,\sizePerThreadN}{ | ||
\draw ($(zoomin BL)+(\vl*\elem*\zoomR, 0)$) -- ++(0, \sizePerThreadM*\elem*\zoomR); | ||
} | ||
|
||
\node [scale=.6*\scale, left] at ($(zoomin BL)+(0, .5*\sizePerThreadM*\elem*\zoomR)$) {$t_0$}; | ||
\node [scale=.6*\scale, right] at ($(zoomin BL)+(\sizePerThreadN*\elem*\zoomR, .5*\sizePerThreadM*\elem*\zoomR)$) {\sizePerThreadM$\times$\sizePerThreadN}; | ||
|
||
\draw [densely dotted] (TL) -- (zoomin BL); | ||
\draw [densely dotted] ($(TL)+(\sizePerThreadN*\elem, 0)$) -- ($(zoomin BL)+(\sizePerThreadN*\elem*\zoomR, 0)$); | ||
\draw [fill=red] (TL) rectangle ++(\sizePerThreadN*\elem, -\sizePerThreadM*\elem); | ||
} |
Oops, something went wrong.