Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMX Extension #93

Draft
wants to merge 44 commits into
base: master
Choose a base branch
from
Draft

AMX Extension #93

wants to merge 44 commits into from

Conversation

marenz2569
Copy link
Member

@cyssi-cb I merged the refactoring branch onto this PR. You will probably want to reopen this PR with your own user account. You should have the permission to commit to this branch directly. I'll leave you the comments here but can also readd them to a new pr.

@@ -57,7 +57,7 @@ git_submodule_update()

if("${CMAKE_CXX_COMPILER_ID}" STREQUAL "MSVC")
else()
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -Wall -Wextra -O2 -fdata-sections -ffunction-sections")
SET(CMAKE_CXX_FLAGS "${CMAKE_CXX_FLAGS} -mamx-tile -Wall -Wextra -O2 -fdata-sections -ffunction-sections")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change will not be necessary when the ldtilecfg instruction is integrated into the assembler kernel

Comment on lines 451 to 463
// Create tile_cfg, fill it and return
int i;
tileinfo->palette_id = 1;
tileinfo->start_row = 0;

for (i = 0; i < 8; ++i) {
tileinfo->colsb[i] = MAX_COLS;
tileinfo->rows[i] = MAX_ROWS;
}

_tile_loadconfig(tileinfo);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function can be integrated into the assembler kernel. This will also require that the tile config struct is passed via the c-style inferface to this kernel. The pointer to the value returned by getMemoryAddress function in https://github.com/tud-zih-energy/FIRESTARTER/blob/code-style-enforcing/include/firestarter/LoadWorkerMemory.hpp is saved in PointerReg. Also take a look at X86Payload::emitDumpRegisterCode function to see how this memory is addressed in the assembler payload.

long rc;
unsigned long bitmask;
rc = syscall(SYS_arch_prctl, ARCH_REQ_XCOMP_PERM, XFEATURE_XTILEDATA);
Copy link
Member Author

@marenz2569 marenz2569 Nov 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume that the syscall is required to enable AMX on the OS level. This however will cause to not compile on Windows and MacOS. Please guard this code with an linux ifdef and workerLog::fatal on Windows/MacOS.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Taking a look at this PR microsoft/onnxruntime#14042 it seems that AMX is just supported on Windows. Only question is if the compiler sets some flag in the binary for the operating system or if it just works with Jit generated assembler code.

request_permission();
create_AMX_config(&tile_data); // Create tilecfg and fill it

static bool init = true;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused variable

Comment on lines 486 to 510

// Initialize buffer with random values
// Multiplication always produces either 1 or -1
// Accumulation operation always on (1 + -1) = 0 ensures stable values

__bfloat16* buf1 = (__bfloat16*)src1;
__bfloat16* buf2 = (__bfloat16*)src2;

// TODO: Change MAX_ROWS/MAXC_COLS from constant to maximum size check by asmJit
// Currently not supported by asmJit
// Alternative: Manually parse CPUID

for (int i = 0; i < MAX_ROWS; i++) {
__bfloat16 random_init = (__bfloat16)(rand() % 65536); // Limit maximum size as 1/x needs to fit bfloat16
for (int j = 0; j < MAX_COLS; j++) {
buf1[i * MAX_COLS + j] = (__bfloat16)(random_init);
if (!(j % 2)) {
buf2[i * MAX_COLS + j] = (__bfloat16)((-1) / random_init);
} else if (j % 2) {
buf2[i * MAX_COLS + j] = (__bfloat16)(1 / random_init);
}
}
}
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function should be called in the AVX512WithAMXPayload::init function. It should override and call the AVX512Payload::init. The pointer to memory is the same as in the assembler kernel.

workerLog::error() << "prctl(ARCH_GET_XCOMP_PERM) error: " << rc;
}
if (bitmask & XFEATURE_MASK_XTILE) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this check if XFEATURE_MASK_XTILEDATA is set? This check returns true if either/and XFEATURE_MASK_XTILECFG and XFEATURE_MASK_XTILEDATA is set.

}

void AVX512Payload::init_buffer_rand(uintptr_t src1, uintptr_t src2) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pointers to the function should be __bfloat16* This may however not be supported by all compilers. We might need to initialize this memory differently.

Comment on lines 492 to 493
__bfloat16* buf2 = (__bfloat16*)src2;
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please do not use c-style casts

Comment on lines 191 to 202
unsigned int aligned_alloc_size = static_cast<unsigned int>(MAX * sizeof(__bfloat16));
if (aligned_alloc_size % 1024) { // aligned_alloc expects size to be multiple of alignment (aka 1024)
aligned_alloc_size = aligned_alloc_size + (1024 - (aligned_alloc_size % 1024));
}
src1 = (uintptr_t)aligned_alloc(1024, aligned_alloc_size);
src2 = (uintptr_t)aligned_alloc(1024, aligned_alloc_size);
src3 = (uint64_t)aligned_alloc(1024, aligned_alloc_size);
if (((void*)src1 == nullptr) || (void*)src2 == nullptr ||
(void*)src3 == nullptr) { // uintptr_t garantuees we can cast it to void* and back
std::cout << "[ERROR]: Allocation of source and target buffer for AMX failed. Aborting...\n";
exit(1);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This memory should be allocated in the LoadWorkerMemory class. A platform independent alligned alloc abstraction is available there. You might need to change it sightly, so that these variables are aligned with 1024B instead of 64B. This change will not only allocate these arrays for the AVX512/AMX payload but for all. There should however be no negative effect other than increased allocated RAM size for all payloads.

Comment on lines +33 to +38
AVX512WithAMXPayload() noexcept {
// Enable the AMX instruction in the AVX512 Payload and request AMX_TILE and AMX_BF16 feature.
addInstructionFlops("AMX", 512);
addFeatureRequest(asmjit::CpuFeatures::X86::kAMX_TILE);
addFeatureRequest(asmjit::CpuFeatures::X86::kAMX_BF16);
}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added this wrapper to the AVX512Payload to allow for checking the AMX_TILE and AMX_BF16 features.

Comment on lines 33 to 46
environment::payload::PayloadSettings(/*Threads=*/{1, 2},
/*DataCacheBufferSize=*/{32768, 1048576, 1441792},
/*RamBufferSize=*/1048576000, /*Lines=*/1536,
/*InstructionGroups=*/
{{"RAM_S", 3},
{"RAM_P", 1},
{"L3_S", 1},
{"L3_P", 1},
{"L2_S", 4},
{"L2_L", 70},
{"L1_S", 0},
{"L1_L", 40},
{"REG", 140},
{"AMX", 1}}),
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These values should be updated. E.g. the L1 data cache size changed from SkylakeSP to Sapphire Rapids

@marenz2569
Copy link
Member Author

After taking another look, you will also need to specialize the compilePayload function so the call to CompiledX86Payload::create<AVX512Payload>(Stats, Code) uses AVX512WithAMXPayload in case of AMX and AVX512Payload otherwise. This will also cause the correct overloaded init function to be used.

@marenz2569 marenz2569 mentioned this pull request Nov 27, 2024
Base automatically changed from code-style-enforcing to master December 5, 2024 12:47
@marenz2569 marenz2569 changed the base branch from master to code-style-enforcing December 5, 2024 13:36
Base automatically changed from code-style-enforcing to master December 5, 2024 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants