Update to BLIS 0.9.0 #1

danieldk · 2022-04-11T12:13:31Z

Although the PR targets master now, it should probably target a to-be-create v0.9.0 branch.

Changes:

Update to BLIS 0.9.0
Add x86_64_no_zen2 and x86_64_no_zen3 architectures (in addition to x86_64_no_skx), because some compilers don't support the -march=znver2 and -march=znver3

Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change.

Details: - Modified the octave scripts in test/3 so that the script does not choke when one or more of the expected OpenBLAS, Eigen, or vendor data files is missing. (The BLIS data set, however, must be complete.) When a file is missing, that data series is simply not included on that particular graph. Also factored out a lot of the redundant logic from plot_panel_4x5.m into a separate function in read_data.m.

Details: - Switched the small block allocator (sba), as defined in bli_sba.c and bli_apool.c, to static initialization of its internal mutex. Did a similar thing for the packing block allocator (pba), which appears as global_membrk in bli_membrk.c. - Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex() to ensure they won't be used in the future. - In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp blocks guarded by BLIS_USE_PTHREAD_MUTEX.

Details: - Renamed the files, variables, and functions relating to the packing block allocator from its legacy name (membrk) to its current name (pba). This more clearly contrasts the packing block allocator with the small block allocator (sba). - Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that caused the function to erroneously change the value of the pack_a field of the global rntm_t instead of the pack_b field. (Apparently nobody has used this API yet.) - Comment updates.

Details: - Removed the option to finalize BLIS after every BLAS call, which also means that BLIS would initialize at the beginning of every BLAS call. This option never really made sense and wasn't even implemented properly to begin with. (Because bli_init_auto() and _finalize_auto() were implemented in terms of bli_init_once() and _finalize_once(), respectively, the application would have only been able to call one BLAS routine before BLIS would find itself in a unusable, permanently uninitialized state.) Because this option was never meant for regular use, it never made it into configure as an actual configure-time option, and therefore this commit only removes parts of the code affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.

Details: - Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did not store the microtile result correctly due to incorrect indices calculations. (The error was introduced when I reorganized the 'kernels/power10/3' directory.)

Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.

Details: - Added an err_t* parameter to memory allocation functions including bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(), bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions already use the return value to return the allocated memory address, they can't communicate errors to the caller through the return value. This commit does not employ any error checking within these functions or their callers, but this sets up BLIS for a more comprehensive commit that moves in that direction. - Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to bli_type_defs.h. This was done so that what remains of bli_malloc.h can be included after the definition of the err_t enum. (This ordering was needed because bli_malloc.h now contains function prototypes that use err_t.) - Defined bli_is_success() and bli_is_failure() static functions in bli_param_macro_defs.h. These functions provide easy checks for error codes and will be used more heavily in future commits. - Unfortunately, the additional err_t* argument discussed above breaks the API for bli_malloc_user(), which is an exported symbol in the shared library. However, it's quite possible that the only application that calls bli_malloc_user()--indeed, the reason it is was marked for symbol exporting to begin with--is the BLIS testsuite. And if that's the case, this breakage won't affect anyone. Nonetheless, the "major" part of the so_version file has been updated accordingly to 4.0.0.

Details: - Fixed an issue where the wrong string was being passed in for the vendor legend string. - Changed the graph in which the legends appear. - Updates to runthese.m.

Details: - Added single-threaded and multithreaded performance results to docs/Performance.md. These results were gathered on the "Fugaku" Fujitsu A64fx supercomputer at the RIKEN Center for Computational Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan Nassyr for their work in developing and optimizing A64fx support in BLIS and RuQing for gathering the performance data that is reflected in these new graphs.

* Performance.md Update A64fx Comments - Reason for ARMPL's missing data; - Additional envs / flags for kernel selection; - Update BLIS SRC commit. * Include Another Fix in armsve-cfg-vendor A prototype was forgotten, causing that void* pointer was not fully returned.

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

Fix typo in FAQ.md

Details: - Spun off the Performance.md and PerformanceSmall.md links in the Documentation section into a new Performance section dedicated to those two links. (The previous entries remain redundantly listed within Documentation section.) Thanks to Robert van de Geijn for suggesting this change.

Details: - Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that instead of returning a bool, they set a bool that is passed in by address. This does break the public exported API, but I expect very few users actually use this function. (This change is being made in preparation for a much more extensive commit relating to error checking.)

Details: - Defined getijv, setijv operations to get and set elements of a vector, in bli_setgetijv.c and .h. - Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively. - Added additional bounds checking to getijm and setijm to prevent actions with negative indices. - Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv and setijv. - Added documentation to BLISTypedAPI.md for getijm and setijm, which were inadvertently missing. - Added a new entry to the FAQ titled "Why does BLIS have vector (level-1v) and matrix (level-1m) variations of most level-1 operations?" - Comment updates.

Details: - Added new implementations of bli_slamch() and bli_dlamch() that use constants from the standard C library in lieu of dynamically-computed values (via code inherited from netlib). The previous implementation is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is defined by the subconfiguration at compile-time. Thanks to Devin Matthews for providing this patch, and to Stefano Zampini for reporting the issue (flame#497) that prompted Devin to propose the patch.

Details: - Defined eqsc, eqv, and eqm operations, which set a bool depending on whether the two scalars, two vectors, or two matrix operands are equal (element-wise). eqsc and eqv support implicit conjugation and eqm supports diagonal offset, diag, uplo, and trans parameters (in a manner consistent with other level-1m operations). These operations are currently housed under frame/util, at least for now, because they are not computational in nature. - Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm. - Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md. Also: - Documented getsc and setsc in both docs. - Reordered entry for setijv in BLISTypedAPI.md, and added separator bars to both docs. - Added missing "Observed object properties" clauses to various levle-1v entries in BLISObjectAPI.md. - Defined bli_apply_trans() in bli_param_macro_defs.h. - Defined supporting _check() function, bli_l0_xxbsc_check(), in bli_l0_check.c for eqsc. - Programming style and whitespace updates to bli_l1m_unb_var1.c. - Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c - Consolidated redundant macro redefinition for copym function pointer type in bli_l1m_ft.h. - Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that allow oapi and tapi source files to forego defining certain expert functions. (Certain operations such as printv and printm do not need to have both basic expert interfaces. This also includes eqsc, eqv, and eqm.)

Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be.

Details: - Added frame/include/bli_xapi_undef.h, which explicitly undefines all macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. (This is for safety and good cpp coding practice, not because it fixes anything.) - Added #include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h, bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h. - Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. - Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing that nothing in BLIS used those function pointer types. Also commented out the "#include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.

Details: - Inserted a "#include bli_xapi_undef.h" after each usage of the basic and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to the previous status quo, in which each header made minimal #undef prior to its own definitions and then a single instance of "#include bli_xapi_undef.h" cleaned up any remaining macro defs after all other headers were used. This commit will guarantee that macro defs from the setup of one header (say, bli_oapi_ex.h) don't "infect" the definitions made in a subsequent header. As with this previous commit, this change does not fix any issue but rather attempts to avoid creating orphaned macro definitions that are only needed within a very limited scope. - Removed minimal #undef from bli_?api_[ba|ex].h. - Removed old commented-out lines from bli_?api_[ba|ex].h.

armclang is treated as regular clang. Fixes flame#606. [ci skip]

No need to query MR during kernel runtime.

For clang (& armclang?) compilation. Hopefully solves flame#609 .

Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]

Fixes flame#611.

Flat namespaces can cause problems due to conflicting system libraries, etc., so just mark `xerbla_` as a weak symbol on macOS instead.

Details: - Moved edge-case handling into the gemmtrsm microkernel. This required changing the microkernel API to take m and n dimension parameters as well as updating all existing gemmtrsm microkernel function pointer types, function signatures, and related definitions to take m and n dimensions. Also updated all existing gemmtrsm kernels in the 'kernels' directory (which for now is limited to haswell and penryn kernel sets, plus native and 1m-based reference kernels in 'ref_kernels') to take m and n dimensions, and implemented edge-case handling within those microkernels via a collection of new C preprocessor macros defined within bli_edge_case_macro_defs.h. Note that the edge-case handling for gemm-like operations had already been relocated into the gemm microkernel in 54fa28b. - Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in bli_edge_case_macro_defs.h to allow for easier reading. - Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up the bullet under "Implementation Notes for gemm" that covers alignment issues. (Thanks to Ivan Korostelev for pointing out the confusing and outdated language in issue flame#591.) - Other minor tweaks to KernelsHowTo.md.

Details: - Renamed the following macros defined in bli_kernel_macro_defs.h: BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS BLIS_SIMD_SIZE -> BLIS_SIMD_MAX_SIZE Also updated all instances of these macros elsewhere, including subconfigurations, source code, and documentation. Thanks to Devin Matthews for suggesting this change.

Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes flame#612.

Fixes flame#613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.

Details: - Consolidate handling of tools that are specifiable via CC, CXX, FC, PYTHON, AR, and RANLIB into one bash function, select_tool_w_env(). - If the user specifies a tool via an environment variable (e.g. CC=gcc) and that tool does not seem valid, print an error message and abort configure, unless the tool is optional (e.g. CXX or FC), in which case a warning message is printed instead. - The definition of "seems valid" above amounts to: - responding to at least one of a basic set of command line options (e.g. --version, -V, -h) if the os_name is Linux (since GNU tools tend to respond to flags such as --version) or if the tool in question is CC, CXX, FC, or PYTHON (which tend to respond to the expected flags regardless of OS) - the binary merely existing for AR and RANLIB on Darwin/OSX/BSD. (These OSes tend to have non-GNU versions of ar and ranlib, which typically do not respond to --version and friends.) - This PR addresses flame#584. Thanks to Devin Matthews for suggesting some of the changes in this commit.

Details: - Fixed a performance regression affecting nearly all level-3 operations that use the 'haswell' sgemm and dgemm microkernels. This regression was introduced in 54fa28b, caused by an ill-formed conditional expression in the assembly code that controls whether cache lines of C should be prefetched as rows or as columns. Essentially, the two branches were reversed, causing incomplete prefetching to occur for both row- and column-stored instances of matrix C. Thanks to Devin Matthews for his help finding and fixing this bug.

Use new API for POWER10 gemm microkernel

Details: - Implemented a multithreaded optimization for the special (and common) case of employing the gemmsup code path when the user requests (implicitly or explicitly) that neither A nor B be packed during computation. This optimization takes the form of a greatly reduced code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a broadcast and two barriers, and results in higher performance when obtaining two-way or higher parallelism within BLIS. Thanks to Bhaskar Nallani of AMD for proposing this change via issue flame#605. - Added an early return branch to bli_thrinfo_create_for_cntl() that detects and quickly handles cases where no parallelism is being obtained within BLIS (i.e., single-threaded execution). Note that this special case handling was/is already present in bli_thrinfo_sup_create_for_cntl(). - CREDITS file update.

Details: - A co-attribution to Mithun Mohan was inadvertently omitted from the commit log for headline change in the previous commit, 7c07b47.

Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip]

Details: - Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply invoke the 1m implementation unconditionally. (Note that these APIs bypass sup handling.) - Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h. - Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h. - Relocated: frame/compat/cblas/src/cblas_?gemmt.c files into frame/compat/cblas/src/extra/ - Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ . - Minor reorganization of prototypes and cpp macro directives in bli_blas.h, cblas.h, and cblas_f77.h. - Trival whitespace change to cblas_zgemm.c.

Details: - Allow building BLIS with certain framework files (each with the '_amd' suffix) that have been customized by AMD for Zen-based hardware. These customized files were derived from portable versions of the same files (i.e., those without the '_amd' suffix). Whether the portable or AMD- specific files are compiled is now controlled by a new configure option, --[en|dis]able-amd-frame-tweaks. This option is disabled by default in vanilla BLIS, though AMD may choose to enable it by default in their fork. For now, the added AMD-specific files are: - bli_gemv_unf_var2_amd.c - bla_copy_amd.c - bla_gemv_amd.c These files reside in 'amd' subdirectories found within the directory housing their generic counterparts. - Register optimized real-domain copyv, setv, and swapv kernels in bli_cntx_init_zen.c. - Various minor updates to level-1v kernels in 'zen' kernel set. - Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to the 'zen' kernel set - If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim, call gemv instead and return early. - Combined variable declarations with their initialization in various level-2 and level-3 BLAS compatibility files, and also inserted 'const' qualifer in those same declaration statements. - Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ . - Added copyv and swapv test drivers to 'test' directory. - Whitespace, comment changes.

Details: - Fixed an unresolved symbol issue leftover from flame#590 whereby ?gemm3m_() as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which does not exist. It should have simply called the _check() function for gemm.

Add: - x86_64_no_zen2: for compilers without znver2 - x86_64_no_zen3: for compilers without znver3 Also disable Knights Landing on compilers that do not support Zen 2. Knights Landing is pretty rare, so probably not worth maintaining a separate architecture for.

fgvanzee and others added 30 commits March 19, 2021 13:03

Reduced KC on skx from 384 to 256.

bf1b578

Details: - Reduced the KC cache blocksize for double real on the skx subconfig from 384 to 256. The maximum (extended) KC was also reduced accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting this change.

Merge branch 'master' of github.com:flame/blis

57ef61f

CREDITS file update.

ca83f95

ReleaseNotes.md update in advance of next version.

e56d9f2

Version file update (0.8.1)

8535b3e

CHANGELOG update (0.8.1)

545e6c2

Update do_sde.sh (flame#489)

9050819

Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.

Merge branch 'master' into dev

f9ad55c

Minor updates and fixes to test/3/octave scripts.

ba3ba8d

Details: - Fixed an issue where the wrong string was being passed in for the vendor legend string. - Changed the graph in which the legends appear. - Updates to runthese.m.

Minor updates to a64fx section of Performance.md.

6280757

Allow clang for ThunderX2 config

6548ceb

Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.

Fix typo in FAQ.md

1f3461a

Merge pull request flame#493 from cassiersg/patch-1

40ce5fd

Fix typo in FAQ.md

Fixed typo in Table of Contents.

6a4aa98

Fixed typo in cpp guard in bli_util_ft.h.

5aa63cd

Details: - Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in bli_util_ft.h. This typo was causing some types to be redefined when they weren't supposed to be.

devinamatthews and others added 26 commits January 31, 2022 10:29

Add armclang detection to configure.

35195bb

armclang is treated as regular clang. Fixes flame#606. [ci skip]

Armv8a, ArmSVE: Simplify Gen-C

b5df181

Fix SVE Compil.

9cc897f

ArmSVE Use Predicate in M-Direction

72089bb

No need to query MR during kernel runtime.

ArmSVE Adopts Label Wrapper

2f3872e

For clang (& armclang?) compilation. Hopefully solves flame#609 .

Update CC_VENDOR logic

2674291

Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]

Use -flat_namespace option to link on macOS

5a4d3f5

Fixes flame#611.

Don't use -Wl,-flat-namespace.

2506159

Flat namespaces can cause problems due to conflicting system libraries, etc., so just mark `xerbla_` as a weak symbol on macOS instead.

Add armsve to arm64 Metaconfig (flame#614)

4d83523

Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes flame#612.

ArmSVE Ensure Non-zero Block Size (flame#615)

d514658

Fixes flame#613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.

POWER10: edge cases in microkernel (flame#620)

cad1041

Use new API for POWER10 gemm microkernel

Trival whitespace change; commit log addendum.

f1dbb0e

Details: - A co-attribution to Mithun Mohan was inadvertently omitted from the commit log for headline change in the previous commit, 7c07b47.

Update Multithreading.md

d681000

Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip]

Fixed typo in BLAS gemm3m call to _check().

cf06364

Details: - Fixed an unresolved symbol issue leftover from flame#590 whereby ?gemm3m_() as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which does not exist. It should have simply called the _check() function for gemm.

CREDITS file update.

bee7678

ReleaseNotes.md update in advance of next version.

99bb900

Version file update (0.9.0)

14c86f6

Update to 0.9.0

aeb9d30

danieldk changed the base branch from master to v0.9.0 May 5, 2022 11:56

adrianeboyd approved these changes May 5, 2022

View reviewed changes

danieldk merged commit a9a7e28 into explosion:v0.9.0 May 5, 2022

danieldk deleted the v0.9.0 branch May 5, 2022 13:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update to BLIS 0.9.0 #1

Update to BLIS 0.9.0 #1

danieldk commented Apr 11, 2022

Update to BLIS 0.9.0 #1

Update to BLIS 0.9.0 #1

Conversation

danieldk commented Apr 11, 2022