-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vectorized interface #23
Comments
I've made a small example on Godbolt, using the function: The assembly for the scalar function is: test_mp_f_:
push rbx #24.10
sub rsp, 16 #24.10
mov rbx, rsi #24.10
vmovss xmm0, DWORD PTR [rdi] #29.3
vmovss DWORD PTR [rsp], xmm0 #29.3[spill]
call sinf #29.9
movsxd rax, DWORD PTR [rbx] #29.8
vmovss xmm1, DWORD PTR [rsp] #30.1[spill]
vfmsub213ss xmm1, xmm0, DWORD PTR [-4+test_mp_p_+rax*4] #30.1
vmovaps xmm0, xmm1 #30.1
add rsp, 16 #30.1
pop rbx #30.1
ret It is 13 instructions to process one element. The assembly of the vector function with explicit length is: test_mp_vf_:
push rbp #33.10
mov rbp, rsp #33.10
and rsp, -32 #33.10
push r12 #33.10
push rbx #33.10
sub rsp, 16 #33.10
vmovss xmm0, DWORD PTR [rsi] #40.17
vbroadcastss xmm1, xmm0 #40.7
vmovups XMMWORD PTR [rsp], xmm1 #40.7[spill]
mov rbx, QWORD PTR [rdi] #40.7
movsxd r12, DWORD PTR [rdx] #40.16
call sinf #40.17
vbroadcastss xmm2, xmm0 #40.17
vmovups xmm1, XMMWORD PTR [rsp] #40.7[spill]
vfmsub213ps xmm2, xmm1, XMMWORD PTR [test_mp_p_+r12*4] #40.7
vmovups XMMWORD PTR [rbx], xmm2 #40.7
add rsp, 16 #42.1
pop rbx #42.1
pop r12 #42.1
mov rsp, rbp #42.1
pop rbp #42.1
ret Notably, it uses the |
It could be nice to support a vector-instruction-friendly interface. There is an open issue for this in scipy/scipy#7242. The purpose is to solve a family of problems:
Ideally, the$p_k$ are somehow related in their physical origin (e.g. a spatial field), so the convergence behavior locally in terms of $k$ will be similar. For root-finding on multi-dimensional parameter grids, e.g. $p_{j,k}$ , one can use the Fortran pointer-remapping feature to make the problems 1-D.
I did some benchmarking of
zeroin
recently (discussion here). The code runs in scalar mode for the most part. The complex control flow prevents vectorization (view on godbolt; code taken from netlib):One may need to write the implementation in terms of "chunks" with reductions, which would then map to SIMD registers.
I'm not sure how the vector callback interface would look like, and what is the natural way to pass the parameters in a way that makes it SIMD-friendly:
I suppose you could also do a callback of the form:
This way the SIMD length could be left to the program logic, and it would make it easier to handle peel/remainder loops. A more Fortranic way would be to use
do concurrent
(assuming compilers will do the right thing).The text was updated successfully, but these errors were encountered: