Skip to content

Commit

Permalink
Enable user-customized packm ukernel/variant. (#549)
Browse files Browse the repository at this point in the history
Details:
- Added four new fields to obj_t: .pack_fn, .pack_params, .ker_fn, and
  .ker_params. These fields store pointers to functions and data that
  will allow the user to more flexibly create custom operations while
  recycling BLIS's existing partitioning infrastructure.
- Updated typed API to packm variant and structure-aware kernels to
  replace the diagonal offset with panel offsets, and changed strides
  of both C and P to inc/ldim semantics. Updated object API to the packm
  variant to include rntm_t*.
- Removed the packm variant function pointer from the packm cntl_t node
  definition since it has been replaced by the .pack_fn pointer in the
  obj_t.
- Updated bli_packm_int() to read the new packm variant function pointer
  from the obj_t and call it instead of from the cntl_t node.
- Moved some of the logic of bli_l3_packm.c to a new file,
  bli_packm_alloc.c.
- Rewrote bli_packm_blk_var1.c so that it uses byte (char*) pointers
  instead of typed pointers, allowing a single function to be used
  regardless of datatype. This obviated having a separate implementation
  in bli_packm_blk_var1_md.c. Also relegated handling of scalars to a
  new function, bli_packm_scalar().
- Employed a new standard whereby right-hand matrix operands ("B") are
  always packed as column-stored row panels -- that is, identically to
  that of left-hand matrix operands ("A"). This means that while we pack
  matrix A normally, we actually pack B in a transposed state. This
  allowed us to simplify a lot of code throughout the framework, and
  also affected some of the logic in bli_l3_packa() and _packb().
- Simplified bli_packm_init.c in light of the new B^T convention
  described above. bli_packm_init()--which is now called from within
  bli_packm_blk_var1()--also now calls bli_packm_alloc() and returns
  a bool that indicates whether packing should be performed (or
  skipped).
- Consolidated bli_gemm_int() and bli_trsm_int() into a bli_l3_int(),
  which, among other things, defaults the new .pack_fn field of the
  obj_t to bli_packm_blk_var1() if the field is NULL.
- Defined a new function, bli_obj_reset_origin(), which permanently
  refocuses the view of an object so that it "forgets" any offsets from
  its original pointer. This function also sets the object's root field
  to itself. Calls to bli_obj_reset_origin() for each matrix operand
  appear in the _front() functions, after the obj_t's are aliased. This
  resetting of the underlying matrices' origins is needed in preparation
  for more advanced features from within custom packm kernels.
- Redefined bli_pba_rntm_set_pba() from a regular function to a static
  inline function.
- Updated gemm_ukr, gemmtrsm_ukr, and trsm_ukr testsuite modules to use
  libblis_test_pobj_create() to create local packed objects. Previously,
  these packed objects were created by calling lower-level functions.
- (cherry picked from commit cf7d616)

Fixed trmm[3]/trsm performance bug in cf7d616. (#685)

Details:
- Fixed a performance bug in the packing of micropanels that intersect
  the diagonal of triangular matrices (i.e., those found in trmm, trmm3,
  and trsm). This bug was introduced in cf7d616 and stemmed from an
  ill-formed boolean conditional expression in bli_packm_blk_var1().
  This conditional would chose when to use round-robin parallel work
  allocation, but checked for the triangularity of the submatrix being
  packed while failing also to check for whether the current micropanel
  actually intersected the diagonal. The net result of this bug was that
  *all* micropanels of a triangular matrix, no matter where the upanels
  resided within the matrix, were assigned to threads via a round-robin
  policy. This affected some microarchitectures and threading
  configurations much worse than others, but it seems that overall the
  effect was universally negative, likely because of the reduced spatial
  locality during the packing with round-robin. Thanks to Leick Robinson
  for his tireless efforts in helping track down this issue.
- (cherry picked from commit 872898d)
  • Loading branch information
fgvanzee committed Nov 3, 2022
1 parent a7767e6 commit 126279b
Show file tree
Hide file tree
Showing 71 changed files with 1,291 additions and 3,714 deletions.
1 change: 0 additions & 1 deletion build/libblis-symbols.def
Original file line number Diff line number Diff line change
Expand Up @@ -1307,7 +1307,6 @@ bli_pba_init_pools
bli_pba_pool_size
bli_pba_query
bli_pba_release
bli_pba_rntm_set_pba
bli_memsys_finalize
bli_memsys_init
bli_mkherm
Expand Down
18 changes: 10 additions & 8 deletions frame/1m/bli_l1m_ft_ker.h
Original file line number Diff line number Diff line change
Expand Up @@ -50,21 +50,23 @@
typedef void (*PASTECH3(ch,opname,_ker,tsuf)) \
( \
struc_t strucc, \
doff_t diagoffc, \
diag_t diagc, \
uplo_t uploc, \
conj_t conjc, \
pack_t schema, \
bool invdiag, \
dim_t m_panel, \
dim_t n_panel, \
dim_t m_panel_max, \
dim_t n_panel_max, \
dim_t panel_dim, \
dim_t panel_len, \
dim_t panel_dim_max, \
dim_t panel_len_max, \
dim_t panel_dim_off, \
dim_t panel_len_off, \
ctype* restrict kappa, \
ctype* restrict c, inc_t rs_c, inc_t cs_c, \
ctype* restrict p, inc_t rs_p, inc_t cs_p, \
ctype* restrict c, inc_t incc, inc_t ldc, \
ctype* restrict p, inc_t ldp, \
inc_t is_p, \
cntx_t* cntx \
cntx_t* cntx, \
void* params \
);

INSERT_GENTDEF( packm )
Expand Down
1 change: 1 addition & 0 deletions frame/1m/bli_l1m_oft_var.h
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ typedef void (*PASTECH(opname,_var_oft)) \
obj_t* a, \
obj_t* p, \
cntx_t* cntx, \
rntm_t* rntm, \
cntl_t* cntl, \
thrinfo_t* thread \
);
Expand Down
8 changes: 5 additions & 3 deletions frame/1m/packm/bli_packm.h
Original file line number Diff line number Diff line change
Expand Up @@ -33,15 +33,15 @@
*/

#include "bli_packm_alloc.h"
#include "bli_packm_cntl.h"
#include "bli_packm_check.h"
#include "bli_packm_init.h"
#include "bli_packm_int.h"
#include "bli_packm_scalar.h"

#include "bli_packm_part.h"

#include "bli_packm_var.h"

#include "bli_packm_struc_cxk.h"
#include "bli_packm_struc_cxk_1er.h"

Expand All @@ -50,6 +50,8 @@

// Mixed datatype support.
#ifdef BLIS_ENABLE_GEMM_MD
#include "bli_packm_md.h"
#include "bli_packm_struc_cxk_md.h"
#endif

#include "bli_packm_blk_var1.h"

139 changes: 64 additions & 75 deletions frame/1m/packm/bli_packm_var.h → frame/1m/packm/bli_packm_alloc.c
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@
libraries.
Copyright (C) 2014, The University of Texas at Austin
Copyright (C) 2018 - 2019, Advanced Micro Devices, Inc.
Copyright (C) 2016, Hewlett Packard Enterprise Development LP
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
Expand Down Expand Up @@ -33,78 +33,67 @@
*/

//
// Prototype object-based interfaces.
//

#undef GENPROT
#define GENPROT( opname ) \
\
BLIS_EXPORT_BLIS void PASTEMAC0(opname) \
( \
obj_t* c, \
obj_t* p, \
cntx_t* cntx, \
cntl_t* cntl, \
thrinfo_t* t \
);

GENPROT( packm_unb_var1 )
GENPROT( packm_blk_var1 )

//
// Prototype BLAS-like interfaces with void pointer operands.
//

#undef GENTPROT
#define GENTPROT( ctype, ch, varname ) \
\
void PASTEMAC(ch,varname) \
( \
struc_t strucc, \
doff_t diagoffc, \
diag_t diagc, \
uplo_t uploc, \
trans_t transc, \
dim_t m, \
dim_t n, \
dim_t m_max, \
dim_t n_max, \
void* kappa, \
void* c, inc_t rs_c, inc_t cs_c, \
void* p, inc_t rs_p, inc_t cs_p, \
cntx_t* cntx \
);

INSERT_GENTPROT_BASIC0( packm_unb_var1 )

#undef GENTPROT
#define GENTPROT( ctype, ch, varname ) \
\
void PASTEMAC(ch,varname) \
( \
struc_t strucc, \
doff_t diagoffc, \
diag_t diagc, \
uplo_t uploc, \
trans_t transc, \
pack_t schema, \
bool invdiag, \
bool revifup, \
bool reviflo, \
dim_t m, \
dim_t n, \
dim_t m_max, \
dim_t n_max, \
void* kappa, \
void* c, inc_t rs_c, inc_t cs_c, \
void* p, inc_t rs_p, inc_t cs_p, \
inc_t is_p, \
dim_t pd_p, inc_t ps_p, \
void_fp packm_ker, \
cntx_t* cntx, \
thrinfo_t* thread \
);

INSERT_GENTPROT_BASIC0( packm_blk_var1 )
#include "blis.h"

void* bli_packm_alloc
(
siz_t size_needed,
rntm_t* rntm,
cntl_t* cntl,
thrinfo_t* thread
)
{
// Query the pack buffer type from the control tree node.
packbuf_t pack_buf_type = bli_cntl_packm_params_pack_buf_type( cntl );

// Query the address of the mem_t entry within the control tree node.
mem_t* cntl_mem_p = bli_cntl_pack_mem( cntl );

mem_t* local_mem_p;
mem_t local_mem_s;

siz_t cntl_mem_size = 0;

if ( bli_mem_is_alloc( cntl_mem_p ) )
cntl_mem_size = bli_mem_size( cntl_mem_p );

if ( cntl_mem_size < size_needed )
{
if ( bli_thread_am_ochief( thread ) )
{
// The chief thread releases the existing block associated with
// the mem_t entry in the control tree, and then re-acquires a
// new block, saving the associated mem_t entry to local_mem_s.
if ( bli_mem_is_alloc( cntl_mem_p ) )
{
bli_pba_release
(
rntm,
cntl_mem_p
);
}
bli_pba_acquire_m
(
rntm,
size_needed,
pack_buf_type,
&local_mem_s
);
}

// Broadcast the address of the chief thread's local mem_t entry to
// all threads.
local_mem_p = bli_thread_broadcast( thread, &local_mem_s );

// Save the chief thread's local mem_t entry to the mem_t field in
// this thread's control tree node.
*cntl_mem_p = *local_mem_p;

// Barrier so that the master thread doesn't return from the function
// before we are done reading.
bli_thread_barrier( thread );
}

return bli_mem_buffer( cntl_mem_p );
}

17 changes: 7 additions & 10 deletions frame/3/bli_l3_packm.h → frame/1m/packm/bli_packm_alloc.h
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,6 @@
libraries.
Copyright (C) 2014, The University of Texas at Austin
Copyright (C) 2018 - 2019, Advanced Micro Devices, Inc.
Redistribution and use in source and binary forms, with or without
modification, are permitted provided that the following conditions are
Expand Down Expand Up @@ -33,13 +32,11 @@
*/

void bli_l3_packm
(
obj_t* x,
obj_t* x_pack,
cntx_t* cntx,
rntm_t* rntm,
cntl_t* cntl,
thrinfo_t* thread
);
BLIS_EXPORT_BLIS void* bli_packm_alloc
(
siz_t size_needed,
rntm_t* rntm,
cntl_t* cntl,
thrinfo_t* thread
);

Loading

0 comments on commit 126279b

Please sign in to comment.