-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMX Extension #93
base: master
Are you sure you want to change the base?
AMX Extension #93
Conversation
Added requested changes in master. Update PR branch.
@@ -57,7 +57,7 @@ git_submodule_update() | |||
|
|||
if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC") | |||
else() | |||
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -O2 -fdata-sections -ffunction-sections") | |||
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mamx-tile -Wall -Wextra -O2 -fdata-sections -ffunction-sections") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change will not be necessary when the ldtilecfg instruction is integrated into the assembler kernel
// Create tile_cfg, fill it and return | ||
int i; | ||
tileinfo->palette_id = 1; | ||
tileinfo->start_row = 0; | ||
|
||
for (i = 0; i < 8; ++i) { | ||
tileinfo->colsb[i] = MAX_COLS; | ||
tileinfo->rows[i] = MAX_ROWS; | ||
} | ||
|
||
_tile_loadconfig(tileinfo); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function can be integrated into the assembler kernel. This will also require that the tile config struct is passed via the c-style inferface to this kernel. The pointer to the value returned by getMemoryAddress
function in https://github.com/tud-zih-energy/FIRESTARTER/blob/code-style-enforcing/include/firestarter/LoadWorkerMemory.hpp is saved in PointerReg
. Also take a look at X86Payload::emitDumpRegisterCode
function to see how this memory is addressed in the assembler payload.
long rc; | ||
unsigned long bitmask; | ||
rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume that the syscall is required to enable AMX on the OS level. This however will cause to not compile on Windows and MacOS. Please guard this code with an linux ifdef and workerLog::fatal
on Windows/MacOS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Taking a look at this PR microsoft/onnxruntime#14042 it seems that AMX is just supported on Windows. Only question is if the compiler sets some flag in the binary for the operating system or if it just works with Jit generated assembler code.
request_permission(); | ||
create_AMX_config(&tile_data); // Create tilecfg and fill it | ||
|
||
static bool init = true; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unused variable
|
||
// Initialize buffer with random values | ||
// Multiplication always produces either 1 or -1 | ||
// Accumulation operation always on (1 + -1) = 0 ensures stable values | ||
|
||
__bfloat16* buf1 = (__bfloat16*)src1; | ||
__bfloat16* buf2 = (__bfloat16*)src2; | ||
|
||
// TODO: Change MAX_ROWS/MAXC_COLS from constant to maximum size check by asmJit | ||
// Currently not supported by asmJit | ||
// Alternative: Manually parse CPUID | ||
|
||
for (int i = 0; i < MAX_ROWS; i++) { | ||
__bfloat16 random_init = (__bfloat16)(rand() % 65536); // Limit maximum size as 1/x needs to fit bfloat16 | ||
for (int j = 0; j < MAX_COLS; j++) { | ||
buf1[i * MAX_COLS + j] = (__bfloat16)(random_init); | ||
if (!(j % 2)) { | ||
buf2[i * MAX_COLS + j] = (__bfloat16)((-1) / random_init); | ||
} else if (j % 2) { | ||
buf2[i * MAX_COLS + j] = (__bfloat16)(1 / random_init); | ||
} | ||
} | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This function should be called in the AVX512WithAMXPayload::init function. It should override and call the AVX512Payload::init
. The pointer to memory is the same as in the assembler kernel.
workerLog::error() << "prctl(ARCH_GET_XCOMP_PERM) error: " << rc; | ||
} | ||
if (bitmask & XFEATURE_MASK_XTILE) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this check if XFEATURE_MASK_XTILEDATA
is set? This check returns true if either/and XFEATURE_MASK_XTILECFG
and XFEATURE_MASK_XTILEDATA
is set.
} | ||
|
||
void AVX512Payload::init_buffer_rand(uintptr_t src1, uintptr_t src2) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pointers to the function should be __bfloat16*
This may however not be supported by all compilers. We might need to initialize this memory differently.
__bfloat16* buf2 = (__bfloat16*)src2; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please do not use c-style casts
unsigned int aligned_alloc_size = static_cast<unsigned int>(MAX * sizeof(__bfloat16)); | ||
if (aligned_alloc_size % 1024) { // aligned_alloc expects size to be multiple of alignment (aka 1024) | ||
aligned_alloc_size = aligned_alloc_size + (1024 - (aligned_alloc_size % 1024)); | ||
} | ||
src1 = (uintptr_t)aligned_alloc(1024, aligned_alloc_size); | ||
src2 = (uintptr_t)aligned_alloc(1024, aligned_alloc_size); | ||
src3 = (uint64_t)aligned_alloc(1024, aligned_alloc_size); | ||
if (((void*)src1 == nullptr) || (void*)src2 == nullptr || | ||
(void*)src3 == nullptr) { // uintptr_t garantuees we can cast it to void* and back | ||
std::cout << "[ERROR]: Allocation of source and target buffer for AMX failed. Aborting...\n"; | ||
exit(1); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This memory should be allocated in the LoadWorkerMemory
class. A platform independent alligned alloc abstraction is available there. You might need to change it sightly, so that these variables are aligned with 1024B instead of 64B. This change will not only allocate these arrays for the AVX512/AMX payload but for all. There should however be no negative effect other than increased allocated RAM size for all payloads.
AVX512WithAMXPayload() noexcept { | ||
// Enable the AMX instruction in the AVX512 Payload and request AMX_TILE and AMX_BF16 feature. | ||
addInstructionFlops("AMX", 512); | ||
addFeatureRequest(asmjit::CpuFeatures::X86::kAMX_TILE); | ||
addFeatureRequest(asmjit::CpuFeatures::X86::kAMX_BF16); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added this wrapper to the AVX512Payload
to allow for checking the AMX_TILE and AMX_BF16 features.
environment::payload::PayloadSettings(/*Threads=*/{1, 2}, | ||
/*DataCacheBufferSize=*/{32768, 1048576, 1441792}, | ||
/*RamBufferSize=*/1048576000, /*Lines=*/1536, | ||
/*InstructionGroups=*/ | ||
{{"RAM_S", 3}, | ||
{"RAM_P", 1}, | ||
{"L3_S", 1}, | ||
{"L3_P", 1}, | ||
{"L2_S", 4}, | ||
{"L2_L", 70}, | ||
{"L1_S", 0}, | ||
{"L1_L", 40}, | ||
{"REG", 140}, | ||
{"AMX", 1}}), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These values should be updated. E.g. the L1 data cache size changed from SkylakeSP to Sapphire Rapids
After taking another look, you will also need to specialize the |
@cyssi-cb I merged the refactoring branch onto this PR. You will probably want to reopen this PR with your own user account. You should have the permission to commit to this branch directly. I'll leave you the comments here but can also readd them to a new pr.