Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update to BLIS 0.9.0 #1

Merged
merged 616 commits into from
May 5, 2022
Merged

Update to BLIS 0.9.0 #1

merged 616 commits into from
May 5, 2022

Conversation

danieldk
Copy link

Although the PR targets master now, it should probably target a to-be-create v0.9.0 branch.

Changes:

  • Update to BLIS 0.9.0
  • Add x86_64_no_zen2 and x86_64_no_zen3 architectures (in addition to x86_64_no_skx), because some compilers don't support the -march=znver2 and -march=znver3

fgvanzee and others added 30 commits March 19, 2021 13:03
Details:
- Reduced the KC cache blocksize for double real on the skx subconfig
  from 384 to 256. The maximum (extended) KC was also reduced
  accordingly from 480 to 320. Thanks to Tze Meng Low for suggesting
  this change.
Details:
- Modified the octave scripts in test/3 so that the script does not
  choke when one or more of the expected OpenBLAS, Eigen, or vendor data
  files is missing. (The BLIS data set, however, must be complete.) When
  a file is missing, that data series is simply not included on that
  particular graph. Also factored out a lot of the redundant logic from
  plot_panel_4x5.m into a separate function in read_data.m.
Details:
- Switched the small block allocator (sba), as defined in bli_sba.c and
  bli_apool.c, to static initialization of its internal mutex. Did a
  similar thing for the packing block allocator (pba), which appears as
  global_membrk in bli_membrk.c.
- Commented out bli_membrk_init_mutex() and bli_membrk_finalize_mutex()
  to ensure they won't be used in the future.
- In bli_thrcomm_pthreads.c and .h, removed old, commented-out cpp
  blocks guarded by BLIS_USE_PTHREAD_MUTEX.
Details:
- Renamed the files, variables, and functions relating to the packing
  block allocator from its legacy name (membrk) to its current name
  (pba). This more clearly contrasts the packing block allocator with
  the small block allocator (sba).
- Fixed a typo in bli_pack_set_pack_b(), defined in bli_pack.c, that
  caused the function to erroneously change the value of the pack_a
  field of the global rntm_t instead of the pack_b field. (Apparently
  nobody has used this API yet.)
- Comment updates.
Details:
- Removed the option to finalize BLIS after every BLAS call, which also
  means that BLIS would initialize at the beginning of every BLAS call.
  This option never really made sense and wasn't even implemented
  properly to begin with. (Because bli_init_auto() and _finalize_auto()
  were implemented in terms of bli_init_once() and _finalize_once(),
  respectively, the application would have only been able to call one
  BLAS routine before BLIS would find itself in a unusable, permanently
  uninitialized state.) Because this option was never meant for regular
  use, it never made it into configure as an actual configure-time
  option, and therefore this commit only removes parts of the code
  affected by the cpp macro guard BLIS_ENABLE_STAY_AUTO_INITIALIZED.
Details:
- Fixed a bug in the POWER10 DGEMM kernel whereby the microkernel did
  not store the microtile result correctly due to incorrect indices
  calculations. (The error was introduced when I reorganized the 
  'kernels/power10/3' directory.)
Update to a newer version of SDE, and do a direct download as it seems you don't have to click-through the license anymore.
Details:
- Added an err_t* parameter to memory allocation functions including
  bli_malloc_intl(), bli_calloc_intl(), bli_malloc_user(),
  bli_fmalloc_align(), and bli_fmalloc_noalign(). Since these functions
  already use the return value to return the allocated memory address,
  they can't communicate errors to the caller through the return value.
  This commit does not employ any error checking within these functions
  or their callers, but this sets up BLIS for a more comprehensive
  commit that moves in that direction.
- Moved the typedefs for malloc_ft and free_ft from bli_malloc.h to
  bli_type_defs.h. This was done so that what remains of bli_malloc.h
  can be included after the definition of the err_t enum. (This ordering
  was needed because bli_malloc.h now contains function prototypes that
  use err_t.)
- Defined bli_is_success() and bli_is_failure() static functions in
  bli_param_macro_defs.h. These functions provide easy checks for error
  codes and will be used more heavily in future commits.
- Unfortunately, the additional err_t* argument discussed above breaks
  the API for bli_malloc_user(), which is an exported symbol in the
  shared library. However, it's quite possible that the only application
  that calls bli_malloc_user()--indeed, the reason it is was marked for
  symbol exporting to begin with--is the BLIS testsuite. And if that's
  the case, this breakage won't affect anyone. Nonetheless, the "major"
  part of the so_version file has been updated accordingly to 4.0.0.
Details:
- Fixed an issue where the wrong string was being passed in for the
  vendor legend string.
- Changed the graph in which the legends appear.
- Updates to runthese.m.
Details:
- Added single-threaded and multithreaded performance results to
  docs/Performance.md. These results were gathered on the "Fugaku"
  Fujitsu A64fx supercomputer at the RIKEN Center for Computational
  Science in Kobe, Japan. Special thanks to RuQing Xu and Stepan
  Nassyr for their work in developing and optimizing A64fx support in
  BLIS and RuQing for gathering the performance data that is reflected
  in these new graphs.
* Performance.md Update A64fx Comments

- Reason for ARMPL's missing data;
- Additional envs / flags for kernel selection;
- Update BLIS SRC commit.

* Include Another Fix in armsve-cfg-vendor

A prototype was forgotten, causing that void* pointer was not fully returned.
Needed for compiling on e.g. Mac M1. AFAIK clang supports the same -mcpu flag for ThunderX2 as gcc.
Details:
- Spun off the Performance.md and PerformanceSmall.md links in the
  Documentation section into a new Performance section dedicated to
  those two links. (The previous entries remain redundantly listed
  within Documentation section.) Thanks to Robert van de Geijn for
  suggesting this change.
Details:
- Changed bli_pack_get_pack_a() and bli_pack_get_pack_b() so that
  instead of returning a bool, they set a bool that is passed in by
  address. This does break the public exported API, but I expect very
  few users actually use this function. (This change is being made in
  preparation for a much more extensive commit relating to error
  checking.)
Details:
- Defined getijv, setijv operations to get and set elements of a vector,
  in bli_setgetijv.c and .h.
- Renamed bli_setgetij.c and .h to bli_setgetijm.c and .h, respectively.
- Added additional bounds checking to getijm and setijm to prevent
  actions with negative indices.
- Added documentation to BLISObjectAPI.md and BLISTypedAPI.md for getijv
  and setijv.
- Added documentation to BLISTypedAPI.md for getijm and setijm, which
  were inadvertently missing.
- Added a new entry to the FAQ titled "Why does BLIS have vector
  (level-1v) and matrix (level-1m) variations of most level-1
  operations?"
- Comment updates.
Details:
- Added new implementations of bli_slamch() and bli_dlamch() that use
  constants from the standard C library in lieu of dynamically-computed
  values (via code inherited from netlib). The previous implementation
  is still available when the cpp macro BLIS_ENABLE_LEGACY_LAMCH is 
  defined by the subconfiguration at compile-time. Thanks to Devin
  Matthews for providing this patch, and to Stefano Zampini for
  reporting the issue (flame#497) that prompted Devin to propose the patch.
Details:
- Defined eqsc, eqv, and eqm operations, which set a bool depending on
  whether the two scalars, two vectors, or two matrix operands are equal
  (element-wise). eqsc and eqv support implicit conjugation and eqm
  supports diagonal offset, diag, uplo, and trans parameters (in a
  manner consistent with other level-1m operations). These operations
  are currently housed under frame/util, at least for now, because they
  are not computational in nature.
- Redefined bli_obj_equals() in terms of eqsc, eqv, and eqm.
- Documented eqsc, eqv, and eqm in BLISObjectAPI.md and BLISTypedAPI.md.
  Also:
  - Documented getsc and setsc in both docs.
  - Reordered entry for setijv in BLISTypedAPI.md, and added separator
    bars to both docs.
  - Added missing "Observed object properties" clauses to various
    levle-1v entries in BLISObjectAPI.md.
- Defined bli_apply_trans() in bli_param_macro_defs.h.
- Defined supporting _check() function, bli_l0_xxbsc_check(), in
  bli_l0_check.c for eqsc.
- Programming style and whitespace updates to bli_l1m_unb_var1.c.
- Whitespace updates to bli_l0_oapi.c, bli_l1m_oapi.c
- Consolidated redundant macro redefinition for copym function pointer
  type in bli_l1m_ft.h.
- Added macros to bli_oapi_ba.h, _ex.h, and bli_tapi_ba.h, _ex.h that
  allow oapi and tapi source files to forego defining certain expert
  functions. (Certain operations such as printv and printm do not need
  to have both basic expert interfaces. This also includes eqsc, eqv,
  and eqm.)
Details:
- Changed #ifdef BLIS_OAPI_BASIC to #ifdef BLIS_TAPI_BASIC in
  bli_util_ft.h. This typo was causing some types to be redefined when
  they weren't supposed to be.
Details:
- Added frame/include/bli_xapi_undef.h, which explicitly undefines all
  macros defined in bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
  bli_tapi_ex.h. (This is for safety and good cpp coding practice, not
  because it fixes anything.)
- Added #include "bli_xapi_undef.h" to bli_l1v.h, bli_l1d.h, bli_l1f.h,
  bli_l1m.h, bli_l2.h, bli_l3.h, and bli_util.h.
- Comment updates to bli_oapi_ba.h, bli_oapi_ex.h, bli_tapi_ba.h, and
  bli_tapi_ex.h.
- Moved frame/3/bli_l3_ft_ex.h to local 'old' directory after realizing
  that nothing in BLIS used those function pointer types. Also commented
  out the "#include bli_l3_ft_ex.h" directive in frame/3/bli_l3.h.
Details:
- Inserted a "#include bli_xapi_undef.h" after each usage of the basic
  and expert API macro setup headers: bli_oapi_ba.h, bli_oapi_ex.h,
  bli_tapi_ba.h, and bli_tapi_ex.h. This is functionally equivalent to
  the previous status quo, in which each header made minimal #undef
  prior to its own definitions and then a single instance of
  "#include bli_xapi_undef.h" cleaned up any remaining macro defs after
  all other headers were used. This commit will guarantee that macro
  defs from the setup of one header (say, bli_oapi_ex.h) don't "infect"
  the definitions made in a subsequent header. As with this previous
  commit, this change does not fix any issue but rather attempts to
  avoid creating orphaned macro definitions that are only needed within
  a very limited scope.
- Removed minimal #undef from bli_?api_[ba|ex].h.
- Removed old commented-out lines from bli_?api_[ba|ex].h.
devinamatthews and others added 26 commits January 31, 2022 10:29
armclang is treated as regular clang. Fixes flame#606. [ci skip]
No need to query MR during kernel runtime.
For clang (& armclang?) compilation.

Hopefully solves flame#609 .
Look for `GCC` in addition to `gcc` to handle weird conda version strings. [ci skip]
Flat namespaces can cause problems due to conflicting system libraries,
etc., so just mark `xerbla_` as a weak symbol on macOS instead.
Details:
- Moved edge-case handling into the gemmtrsm microkernel. This required
  changing the microkernel API to take m and n dimension parameters as
  well as updating all existing gemmtrsm microkernel function pointer
  types, function signatures, and related definitions to take m and n
  dimensions. Also updated all existing gemmtrsm kernels in the
  'kernels' directory (which for now is limited to haswell and penryn
  kernel sets, plus native and 1m-based reference kernels in
  'ref_kernels') to take m and n dimensions, and implemented edge-case
  handling within those microkernels via a collection of new C
  preprocessor macros defined within bli_edge_case_macro_defs.h. Note
  that the edge-case handling for gemm-like operations had already
  been relocated into the gemm microkernel in 54fa28b.
- Added desriptive comments to GEMM_UKR_SETUP_CT() and related macros in
  bli_edge_case_macro_defs.h to allow for easier reading.
- Updated docs/KernelsHowTo.md to reflect above changes. Also cleaned up
  the bullet under "Implementation Notes for gemm" that covers alignment
  issues. (Thanks to Ivan Korostelev for pointing out the confusing and
  outdated language in issue flame#591.)
- Other minor tweaks to KernelsHowTo.md.
Details:
- Renamed the following macros defined in bli_kernel_macro_defs.h:

    BLIS_SIMD_NUM_REGISTERS -> BLIS_SIMD_MAX_NUM_REGISTERS
    BLIS_SIMD_SIZE          -> BLIS_SIMD_MAX_SIZE

  Also updated all instances of these macros elsewhere, including
  subconfigurations, source code, and documentation. Thanks to Devin
  Matthews for suggesting this change.
Availability of the `armsve` subconfig is controlled by the compiler version (gcc/clang). Tested for SVE and non-SVE. Fixes flame#612.
Fixes flame#613. There are several macros/environment variables which need to be tuned to get good cache block sizes. It would be nice to have a way of getting values automatically.
Details:
- Consolidate handling of tools that are specifiable via CC, CXX, FC, 
  PYTHON, AR, and RANLIB into one bash function, select_tool_w_env().
  - If the user specifies a tool via an environment variable (e.g. 
    CC=gcc) and that tool does not seem valid, print an error message 
    and abort configure, unless the tool is optional (e.g. CXX or FC), 
    in which case a warning message is printed instead.
  - The definition of "seems valid" above amounts to:
    - responding to at least one of a basic set of command line options 
      (e.g. --version, -V, -h) if the os_name is Linux (since GNU tools 
      tend to respond to flags such as --version) or if the tool in 
      question is CC, CXX, FC, or PYTHON (which tend to respond to the 
      expected flags regardless of OS)
    - the binary merely existing for AR and RANLIB on Darwin/OSX/BSD. 
      (These OSes tend to have non-GNU versions of ar and ranlib, which 
      typically do not respond to --version and friends.)
- This PR addresses flame#584. Thanks to Devin Matthews for suggesting some
  of the changes in this commit.
Details:
- Fixed a performance regression affecting nearly all level-3 operations
  that use the 'haswell' sgemm and dgemm microkernels. This regression
  was introduced in 54fa28b, caused by an ill-formed conditional
  expression in the assembly code that controls whether cache lines of C
  should be prefetched as rows or as columns. Essentially, the two
  branches were reversed, causing incomplete prefetching to occur for
  both row- and column-stored instances of matrix C. Thanks to Devin
  Matthews for his help finding and fixing this bug.
Use new API for POWER10 gemm microkernel
Details:
- Implemented a multithreaded optimization for the special (and common)
  case of employing the gemmsup code path when the user requests
  (implicitly or explicitly) that neither A nor B be packed during
  computation. This optimization takes the form of a greatly reduced
  code branch in bli_thrinfo_sup_create_for_cntl(), which avoids a
  broadcast and two barriers, and results in higher performance when
  obtaining two-way or higher parallelism within BLIS. Thanks to
  Bhaskar Nallani of AMD for proposing this change via issue flame#605.
- Added an early return branch to bli_thrinfo_create_for_cntl() that
  detects and quickly handles cases where no parallelism is being
  obtained within BLIS (i.e., single-threaded execution). Note that
  this special case handling was/is already present in
  bli_thrinfo_sup_create_for_cntl().
- CREDITS file update.
Details:
- A co-attribution to Mithun Mohan was inadvertently omitted from the
  commit log for headline change in the previous commit, 7c07b47.
Add notes about `BLIS_IR_NT` (should typically be 1) and `BLIS_JR_NT` (should typically be small, e.g. <= 4). [ci skip]
Details:
- Created ?gemm3m_() and cblas_?gemm3m() APIs that (for now) simply
  invoke the 1m implementation unconditionally. (Note that these APIs
  bypass sup handling.)
- Added BLAS prototypes for gemm3m in frame/compat/bla_gemm3m.h.
- Added CBLAS prototypes for gemm3m in frame/compat/cblas/src/cblas.h.
- Relocated: 
    frame/compat/cblas/src/cblas_?gemmt.c 
  files into
    frame/compat/cblas/src/extra/ 
- Relocated frame/compat/bla_gemmt.? into frame/compat/extra/ .
- Minor reorganization of prototypes and cpp macro directives in 
  bli_blas.h, cblas.h, and cblas_f77.h.
- Trival whitespace change to cblas_zgemm.c.
Details:
- Allow building BLIS with certain framework files (each with the '_amd'
  suffix) that have been customized by AMD for Zen-based hardware. These
  customized files were derived from portable versions of the same files
  (i.e., those without the '_amd' suffix). Whether the portable or AMD-
  specific files are compiled is now controlled by a new configure
  option, --[en|dis]able-amd-frame-tweaks. This option is disabled by
  default in vanilla BLIS, though AMD may choose to enable it by default
  in their fork. For now, the added AMD-specific files are:
  - bli_gemv_unf_var2_amd.c
  - bla_copy_amd.c
  - bla_gemv_amd.c
  These files reside in 'amd' subdirectories found within the directory
  housing their generic counterparts.
- Register optimized real-domain copyv, setv, and swapv kernels in
  bli_cntx_init_zen.c.
- Various minor updates to level-1v kernels in 'zen' kernel set.
- Added caxpyf kernel as well as saxpyf and multiple daxpyf kernels to
  the 'zen' kernel set
- If the problem passed to ?gemm_() in bla_gemm.c has a unit m or n dim,
  call gemv instead and return early.
- Combined variable declarations with their initialization in various
  level-2 and level-3 BLAS compatibility files, and also inserted
  'const' qualifer in those same declaration statements.
- Moved frame/compat/bla_gemmt.c and .h to frame/compat/extra/ .
- Added copyv and swapv test drivers to 'test' directory.
- Whitespace, comment changes.
Details:
- Fixed an unresolved symbol issue leftover from flame#590 whereby ?gemm3m_()
  as defined in bla_gemm3m.c was referencing bla_gemm3m_check(), which
  does not exist. It should have simply called the _check() function for
  gemm.
Add:

- x86_64_no_zen2: for compilers without znver2
- x86_64_no_zen3: for compilers without znver3

Also disable Knights Landing on compilers that do not support Zen 2.
Knights Landing is pretty rare, so probably not worth maintaining a
separate architecture for.
@danieldk danieldk changed the base branch from master to v0.9.0 May 5, 2022 11:56
@danieldk danieldk merged commit a9a7e28 into explosion:v0.9.0 May 5, 2022
@danieldk danieldk deleted the v0.9.0 branch May 5, 2022 13:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.