-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathtransformations.tex
658 lines (590 loc) · 26.3 KB
/
transformations.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
\lstset{language=C}
\chapter{Loop Transformations}
\label{chap:loop-transformations}
\begin{quote}
\emph{This chapter is adapted with permission from the PMPH Lecture
Notes written by Cosmin Oancea.}
\end{quote}
This chapter is organised as follows:
\begin{itemize}
\item \Cref{sec:loop-interch} presents a simple theorem that gives
necessary conditions for the safety of the transformation that
interchanges two perfectly nested loops.
\item \Cref{sec:loop-distrib} discusses the legality and
the manner in which a loop can be distributed across the
statements in its body.
\item \Cref{sec:false-dep-elim} discusses techniques for
eliminating cross-iteration write-after-read and
write-after-write dependencies.
\item \Cref{sec:strip-tiling} introducing a simple
transformation, named stripmining, which is always valid, and
shows how block and register tiling can be derived as a
combination of stripmining, loop interchange and loop
distribution.
\end{itemize}
\section{Loop interchange: legality and applications}
\label{sec:loop-interch}
Direction vectors are not used only for proving the parallel nature of
loops, but can also enable powerful code restructuring techniques. For
example they can be straightforwardly applied to determine whether it
is safe to interchange two loops in a perfect loop nest\footnote{ A
perfect loop nest is a nest in which any two loops at consecutive
depth levels are not separated by any other statements; for example
all loop nests in \cref{fig:data-dep-running-eg} are perfectly
nested. }---which may result in better locality and even in
changing an inner loop nature from dependent (sequential) to
independent (parallel).
The following theorem gives a sufficient condition for the
legality of loop interchange---i.e., for the transformation
to result in code that is semantically equivalent to the original one.
\begin{theorem}[Legality of Loop Interchange]\label{Loop-Interch}
A sufficient condition for the legality of interchanging two loops
at depth levels $k$ and $l$ in a perfect nest is that interchanging
columns $k$ and $l$ in the direction matrix of the loop nest
\emph{does not result} in a (leading) \texttt{>} direction as the
leftmost non-\texttt{=} direction of any row.
\end{theorem}
The theorem above shows that the legality of loop interchange
can be determined solely by inspecting the result of permuting
the direction matrix in the same way as the one desired for loops.
For the rationale related to why a row-leading \texttt{>} direction
is illegal, we refer the reader to \cref{Leg-Dir-Vect}: a non-\texttt{=}
leading \texttt{>} direction would correspond to depending on something
that happens in the future: this currently seems impossible in our
universe, and as such it signals an illegal transformation.
%
The following corollary can be easily derived from \cref{Loop-Interch}:
\begin{corollary}[Interchanging a parallel loop inwards]\label{Par-Loop-Interch}
In a perfect loop nest, it is always safe to interchange a
{\em parallel loop} inwards one step at a time (i.e., if the
parallel loop is the $k^{th}$ loop in the nest then one can
always interchange it with loop $k+1$, then with loop $k+2$, etc.).
\end{corollary}
The corollary says that if we somehow know the parallel nature
of a loop, then we can safely interchange it in the immediate inward
position, without even having to build the dependence-direction matrix.
Let us analyse the legality of loop interchange for the three loop
nests of our running example:
\begin{description}
\item[\Cref{fig:data-dep-running-eg-a}:] The direction matrix is
\texttt{[=,=]} and, as such, it is legal to interchange the two
loops, because it would result in direction matrix
\texttt{[=,=]}. Moreover applying loop interchange in this case is
highly beneficial because it \emph{optimises locality of reference}:
the loop of index \texttt{i} appears in the innermost position after
the interchange, which optimally exploits spatial locality for the
write and read accesses to \texttt{A[j][i]}.
\item[\Cref{fig:data-dep-running-eg-b}:] The direction matrices are
\[
M
= \begin{cases}\texttt{[<,<]}\\\texttt{[=,<]}\end{cases}
\]
and
\[
M^{intchg}
= \begin{cases}\texttt{[<,<]}\\\texttt{[<,=]}\end{cases}
\]
before and after interchange, respectively. It follows that the
loop interchange is legal---because $M^{intchg}$ satisfies
\cref{Loop-Interch}---and it also optimises spatial locality (as
before). What is interesting about this example is that after the
interchange, \emph{the innermost loop has become parallel}, by
\cref{Loop-Par}, because the outer loop caries all
dependencies---the direction column corresponding to the outer loop
consists only of \texttt{<} directions.
\item[\Cref{fig:data-dep-running-eg-c}:] The direction matrix is
\texttt{[<,>]} and \emph{interchanging the two loops is illegal}
because the direction matrix obtained after the interchange
\texttt{[>,<]} starts with a \texttt{>} direction; this would mean
that the current iteration depends on a future iteration, which is
impossible, hence the interchange is illegal.
\end{description}
\section{Loop distribution: legality and applications}
\label{sec:loop-distrib}
This section introduces a transformation, named loop distribution,
where a loop is distributed across its statements. Potential benefits
are:
\begin{itemize}
\item Loop distribution provides the bases for performing
vectorisation: the innermost loop is distributed across its
statements, and then the distributed loops are chunked (stripmined,
\cref{sec:strip-tiling}) by a factor that permits utilisation of
processor's vector instructions.
\item Loop distribution may enhance the degree of parallelism that can
be statically mapped to the hardware. As discussed in
\cref{sec:openmp-nesting}, OpenMP \texttt{collapse} clauses only
apply to perfect loop nests. Distribution lets us split apart
complex loop nests to create perfect nests of parallel constructs,
which can then be parallelised efficiently with OpenMP.
\end{itemize}
Loop distribution requires the construction of a dependency graph,
which is defined below.
\begin{definition}[Dependency graph]\label{Dep-Graph}
A dependency graph of a loop is a directed graph in which
the nodes correspond to the statements of the loop nest and the
edges correspond to dependencies. An edge is directed (points)
from the source to the sink of the dependency, and is annotated
with the direction corresponding to that dependence.
In the case when the loop contains another inner loop, then
the inner loop is represented as a single statement that conservatively
summarises the behavior of all the statements of the inner loop.
\end{definition}
The dependency graph of a loop can be used to characterise its
parallel behavior:
\begin{theorem}[Dependency cycle]\label{Dep-Cycle}
A loop is parallel \emph{if and only if} its dependency graph does
not have cycles.
\end{theorem}
If the loop contains a cycle of dependencies, then it necessarily
exhibits at least a cross iteration dependency (needed to form the
cycle), and thus the loop is not parallel. The following theorem
specifies how the transformation can be implemented:
\begin{theorem}[Loop distribution]\label{Loop-Distrib}
Distributing a loop across its statements can be performed
in the following way:
\begin{enumerate}
\item The dependency graph corresponding to the target loop
is constructed.
\item The graph is decomposed into strongly-connected components
(SCCs)\footnote{
A graph is said to be strongly connected if every vertex
is reachable from every other vertex, i.e., a cycle.
It is possible to find the strongly-connected components
of an arbitrary directed graph in linear time $\Theta(V+E)$,
where $V$ is the number of vertices and $E$ is the number of
edges.
}, and a new graph $G'$ is formed in which the SCCs are nodes.
\item The loop can be safely distributed across its
strongly-connected components, in the graph order of $G'$.
Assuming a number $k$ of SCCs, this means that the result of the
transformation will be $k$ loops, each containing the statements
of the corresponding SCC. Inside an SCC, the statements remain in
program order, but the distributed loops are ordered according to
$G'$.
\item Array expansion (\cref{sec:array-expansion}) must be performed
for the variables that
\begin{itemize}
\item are either declared inside the loop or overwritten
in each iteration (output dependencies), \textbf{\em and}
\item are used in at least two strongly-connected components.
\end{itemize}
\end{enumerate}
\end{theorem}
The theorem above says that the statements that are in a dependency
cycle must remain in (form) one loop (which is sequential by
\cref{Dep-Cycle}). As such, the loop can be distributed across groups
of statements corresponding to the strongly connected components (SCC)
of the dependency graph. If the graph has only one SCC then it cannot
be distributed. The resulting distributed loops are written in the
order dictated by the graph of SCCs.
%
We demonstrate \cref{Loop-Distrib} on the simple code example presented
below:
\begin{lstlisting}[mathescape=true]
for (int i = 2; i < N; i++) {
$S_1:$ A[i] = B[i-2] ...
$S_2:$ B[i] = B[i-1] ...
}
\end{lstlisting}
The code has two dependencies:
\begin{description}
\item[$S_2\to S_1$:] In order for a dependency on \texttt{B} to exist
the read from \texttt{B} in iteration $i_1$ of $S_1$ and
the write to \texttt{B} in iteration $i_2$ of $S_2$ must refer
to the same location. Hence \texttt{i$_1$-2 = $i_2$}, which means
\texttt{i$_1$ > i$_2$}, hence $S_2$ is the source, $S_1$ is the
sink and the direction vector is \texttt{[<]};
\item[$S2\to S2$:] similarly, there is a dependency between the
read from \texttt{B} in $S_2$ and the write to \texttt{B} in $S_2$
of direction vector \texttt{[<]}.
\end{description}
The dependency graph is thus:
\begin{center}
\includegraphics[height=15ex]{img/LoopDistr.pdf}
\end{center}
\noindent and it exhibits two strongly-connected components:
one formed by statement $S_2$ and one formed by statement $S_1$.
Loop distribution results in the following restructured code:
\begin{lstlisting}[mathescape=true]
for (int i = 2; i < N; i++)
$S_2:$ B[i] = B[i-1] ...
for (int i = 2; i < N; i++)
$S_1:$ A[i] = B[i-2] ...
\end{lstlisting}
in which, according to the graph order, the loop corresponding to
statement $S_2$ appears before the one corresponding to statement
$S_1$. Note that this does not match the program order of statements
$S_1$ and $S_2$ in the original program. Also note that the first loop
is \emph{not} parallel because the SCC consisting of $S_2$ has a
(dependency) cycle, but the second loop is parallel because the SCC
corresponding to $S_1$ does not have cycles.
If a loop is parallel then it can be straightforwardly distributed
across its statements in program order because:
\begin{itemize}
\item by \cref{Dep-Cycle}, the loop dependency graph has no cycles and
thereby each statement is a strongly connected component;
\item the program order naturally respects all dependencies.
\end{itemize}
\begin{corollary}[Parallel loop distribution]\label{Par-Loop-Distr}
A parallel loop can be directly distributed across each one of its
statements. The resulting loops appear in the same order in which
their corresponding statements appear in the original loop.
\end{corollary}
\subsection{Array expansion}
\label{sec:array-expansion}
Finally, it remains to demonstrate array expansion, mentioned
in the fourth bullet of \cref{Loop-Distrib}. Assume the slightly
modified code:
\begin{lstlisting}[mathescape=true]
float tmp;
for (int i = 2; i < N; i++) {
$S_1:$ tmp = 2 * B[i-2];
$S_2:$ A[i] = tmp;
$S_3:$ B[i] = tmp + B[i-1];
}
\end{lstlisting}
Statements $S_1$ and $S_3$ are in a dependency cycle, because
there is a dependency $S_3\to S_1$ with direction \texttt{<} caused
by the write to and the read from array \texttt{B}, and a dependency
$S_1\to S_3$ with direction \texttt{=} caused by \texttt{tmp}.
Statement $S_2$ is not in a dependency cycle, but there is a
dependency $S_1\to S_2$, and hence its distributed loop
should follow the distributed loop containing $S_1$ and $S_3$.
If we do not perform array expansion, the distributed code:
\pagebreak
\begin{lstlisting}[mathescape=true]
float tmp;
for (int i = 2; i < N; i++) {
$S_1:$ tmp = 2 * B[i-2];
$S_3:$ B[i] = tmp + B[i-1];
}
for (int i = 2; i < N; i++) {
$S_2:$ A[i] = tmp;
}
\end{lstlisting}
\noindent does not respect the semantics of the original program
because the second loop uses the same value of \texttt{tmp}---the one
set by the last iteration of the first loop---while the original loop
writes and then reads a different value of \texttt{tmp} for each
iteration. We fix this by performing array expansion for \texttt{tmp},
which means that we must expand it with an array dimension equal to
the loop count and replace its uses with corresponding indexing
expressions of the expanded array. This results in the following
\emph{correct} code:
\begin{lstlisting}[mathescape=true]
float tmp[N];
for (int i = 2; i < N; i++) {
$S_1:$ tmp[j] = 2 * B[i-2];
$S_3:$ B[i] = tmp + B[i-1];
}
for (int i = 2; i < N; j++) {
$S_2:$ A[i] = tmp[j];
}
\end{lstlisting}
Array expansion requires us to normalise the loop first---this means
rewriting the loop such as its index starts from $0$ and increases by
$1$ each iteration. This is why we have not written our example as
\lstinline{do i = 2, N+1}.
\section{Eliminating false dependencies (WAR and WAW)}
\label{sec:false-dep-elim}
Anti and output dependencies are often referred to as \emph{false}
dependencies because they can be eliminated in most cases by copying
or privatisation operations:
\begin{itemize}
\item Cross-iteration anti dependencies (WAR) typically
correspond to a read from some original element of
the array---whose value was set before the start of
the loop execution---followed by an update to that
element in a later iteration. As such, this dependency
can be eliminated by copying (in parallel) the target
array before the loop and rewriting the offending
read access inside the loop such that it refers
to the copy of the array.
\item Cross-iteration output dependencies (WAW) can be
eliminated by a technique named privatisation (or renaming),
whenever it can be determined that {\em every read access}
from a scalar or array location {\em is covered by an
update} to that scalar or memory location that was
previously performed {\em in the same iteration}.
Semantically, privatisation moves the declaration of the
offending variable inside the loop, because it has been
already determined that the read/used value was produced
earlier in the same iteration.
\item Reasoning based on direction vectors is limited to
relatively simple loop nests; for example it is difficult
to reason about privatisation by means of direction vectors.
\end {itemize}
\subsection{Eliminating WAR dependencies by copying}
Consider the simple C code below which rotates an array in the right
dimension by one:
\begin{lstlisting}[mathescape=true]
float tmp = A[0];
for (int i=0; i<N-1; i++) {
A[i] = A[i+1]; // $S_1$
}
A[N-1] = tmp;
\end{lstlisting}
The loop exhibits a cross-iteration anti dependency (WAR) $S_1\to S_1$
(with direction vector $\texttt{[<]}$), and, as such, it is not safe
to execute it in parallel. However, one can observe that the reads
from \texttt{A} inside the loop correspond to the original elements of
\texttt{A} before the loop, because they are rewritten in a later
iteration. As such one can perform a copy of \texttt{A} before the
loop, and replace the read access inside the loop to operate on the
copy of array \texttt{A}. This preserves the original loop semantics
and results in a parallel loop because the read and write accesses
operate on different arrays, hence a dependency cannot occur. For
example, using OpenMP:
\begin{lstlisting}[mathescape=true]
float Acopy[N];
#pragma omp parallel for
for (int i=0; i<N; i++) {
Acopy[i] = A[i];
}
tmp = A[0];
#pragma omp parallel for
for (int i=0; i<N-1; i++) {
A[i] = Acopy[i+1];
}
A[N-1] = tmp;
\end{lstlisting}
Note that in a real program, we would allocate the \texttt{Acopy}
array with \texttt{malloc()} rather than creating a potentially very
large stack allocation.
\subsection{Eliminating WAW dependencies by privatisation}
Consider the contrived and ugly looking C code below:
\begin{lstlisting}[mathescape=true]
int A[M];
for (int i=0; i<N; i++){
for (int j=0, j<M; j++) { // writes slice A[0:M-1]
A[j] = (4*i+4*j) % M; // $S_1$
}
for (int k=0; k<N; k++) { // reads A[j] where j$\in${0,$\ldots$M-1}
// because % denotes modulus op
X[i][k] = X[i][k-1] * A[ A[(2*i+k)%M] % M]; // $S_2$
}
}
\end{lstlisting}
Analysing the cross-iteration dependencies of the outer loop,
one can observe that there are frequent output dependencies
$S_1\to S_1$ of all directions (\texttt{*}), because, in
essence, all elements of \texttt{A} at indices $0\ldots M-1$
are (over)written in each iteration of the outer loop.
This also causes frequent cross-iteration WAR and RAW
dependencies between $S_1$ and $S_2$ of all directions
\texttt{*} because $S_2$ reads some of the values of
\texttt{A} which are written in $S_1$. The read access
is also statically unanalysable because the index
into \texttt{A} depends on a value of \texttt{A} (i.e.,
it is an indirect-array access \texttt{A[ A[...] ]}).
It would thus seem that this is a hopeless case and parallel
execution is a pipe dream. Not so! Actually the rationale
of how to transform the outer loop into a parallel one
is quite simple. One may observe that each iteration of
the outer loop writes the same indices of \texttt{A}, namely
the ones belonging to the closed integral interval
\texttt{[0,M-1]}. One may also observe that $S_2$ reads
from \texttt{A} elements whose indices necessarily belong to
\texttt{[0,M-1]}---due to the two modulus-\texttt{M} operations.
As such, one may conclude that any value
read in $S_2$ must have been produced in the same iteration
of the outer loop (in the inner loop enclosing $S_1$).
It follows that it is safe to rewrite the loop in the
following way:
\begin{enumerate}
\item Declare a new variable \texttt{A'} of the same dimensions as
\texttt{A} just inside the outer loop (or equivalently perform array
expansion of array \texttt{A'} with a new outer dimension of size
\texttt{N}).
\item Replace all the uses of \texttt{A} in the outer loop by uses of
\texttt{A'}. The resulting loop is safe to execute in parallel
because there can be no dependencies on \texttt{A'} since each
iteration uses a different array \texttt{A'}.
\item As a last step, after the parallel execution of the loop
terminates, one must copy (in parallel) the elements produced by the
last iteration of the outer loop (i.e., \texttt{A'[0,$\ldots$][M-1]}
back to \texttt{A}.
\end{enumerate}
The parallel OpenMP code that implements these steps is
presented below:
\begin{lstlisting}[mathescape=true]
int A[M];
#pragma omp for lastprivate(A)
for (int i=0; i<N; i++) {
for(int j=0, j<M; j++) {
A[j] = (4*i+4*j) % M;
}
for(int k=0; k<N; k++) {
X[i][k]=X[i][k-1] * A[ A[(2*i+k) % M] % M];
}
}
\end{lstlisting}
Declaring array \texttt{A} as private (by using the clause
\texttt{private(A)}) would result in semantically performing steps (1)
and (2) above. Using the \texttt{lastprivate(A)} clause instructs the
OpenMP compiler to also perform step (3)---to copy back the
privately-maintained result of \texttt{A} of the last executing
iteration into the globally-declared array \texttt{A}.
Please also note that the OpenMP execution will not allocate
a new \texttt{A'} for each iteration of the outer loop---this is
actually equivalent to performing array expansion which is also
applicable here---but instead it will {\em allocate a copy of
\texttt{A} for each active thread}, thus significantly reducing
the memory footprint and/or the number of (de)allocations.
Privatisation can be applied whenever one can prove that every read
access in an iteration is covered by a previously-performed write
access in the same iteration. Privatisation can be implemented by
performing either array expansion or moving the declaration of the
target variable from outside to inside the loop. However, it saves
memory to allocate the private copy per active thread rather than per
iteration, which is what OpenMP is doing.
\section{Loop stripmining, block and register tiling}
\label{sec:strip-tiling}
This section discusses several simple transformations that are going
to be combined in various ways to optimise locality of reference (both
temporal and spatial locality).
\textbf{\em Stripmining} refers to the following transformation,
which is always safe to apply:
\begin{lstlisting}[mathescape=true]
for(int i = 0; i<N; i++) { for(int ii = 0; ii<N; ii+=T){
iteration body $\Rightarrow$ for(int i=ii, i<min(ii+T,N); i++)
} iteration body
} }
\end{lstlisting}
In essence, a normalized loop is split into a perfect nest of two
loops, in which the first loop goes with stride \texttt{T}, and
the second one goes with stride \texttt{1}. Please notice that the
resulting loop nest executes the same number of statements and
in the same order as the original loop.
\textbf{\em Block tiling} refers to the transformation that
stripmines several consecutive innermost loops in a
perfect loop nest---named $l_{k+1}\ldots l_{k+n}$---and then
interchanges inwards the resulting loops of stride $1$.
The transformation is valid/safe if in the original program
it is safe to interchange any of the loops
$l_{k+i},~i\in\{1,\ldots,n-1\}$ in the innermost position.
For example, the code below demonstrates block tiling a perfect
loop nest of depth two:
\begin{lstlisting}[mathescape=true]
for(i = 0; i<N; i++) { for(ii=0; ii<N; ii+=T1) {
for(j = 0; j<M; j++) { for(jj=0; jj<M; jj+=T2) {
iteration body $\Rightarrow$ for(i=ii; i<min(ii+T1,N); i++) {
} for(j=jj; j<min(jj+T2,M); j++) {
} iteration body
} } } }
\end{lstlisting}
\textbf{\em Unroll and jam} refers to the transformation that partially
unrolls one or more of the outer loops in a perfect nest and then
fuses (``jams'') the resulting loops. Equivalently, one can stripmine
an outer loop, then interchange (distribute) it in the innermost
position, then completely unroll it. The transformation is aimed at
decreasing the number of memory loads and stores by storing to and
reusing values from registers, and thus it is applied when the
original loop nest contains data references that allow for temporal
reuse---e.g., their indexes are invariant to some of the loops
in the nest.
Due to this, it is also known as ``{\em register tiling}''.
We demonstrate the transformation on the matrix-matrix
multiplication code below:
\begin{lstlisting}[mathescape=true]
for(i=0; i<N; i++) {
for(j=0; j<M; j++) {
float c;
c = 0.0;
for(k=0; k<N; k++) {
c += A[i][k] * B[k][j];
}
C[i][j] = c;
}
}
\end{lstlisting}
The plan is to stripmine the loop of index \texttt{j} by a tile of
size $2$, and to interchange it to the innermost position, while
performing the necessary loop distribution and array expansion:
\begin{lstlisting}[mathescape=true]
for(i=0; i<N; i++) {
for(jj=0; jj<M; jj+=2) {
float cs[2];
for(j=jj; j<min(jj+2,M); j++) {
cs[j-jj] = 0.0;
}
for(k=0; k<N; k++) {
for(j=jj; j<min(jj+2,M); j++) {
cs[j-jj] += A[i][k] * B[k][j];
} }
for(j=jj; j<min(jj+2,M); j++) {
C[i][j] = cs[j-jj];
}
} }
\end{lstlisting}
One can observe that the access \texttt{A[i][k]} is invariant to its
immediately contained loop of index \texttt{j} and thus it can be hoisted
outside it and saved into a register. Then the loops of index \texttt{j}
can be unrolled, and array \texttt{cs} can be scalarized as well:
\begin{lstlisting}[mathescape=true]
for(i=0; i<N; i++) {
for(jj=0; jj<M; jj+=2) {
float c1, c2;
if (jj < M) c1 = 0.0;
if (jj+1 < M) c2 = 0.0;
for(k=0; k<N; k++) {
float a;
a = A[i,k];
if (jj < M) c1 += a * B[k][jj ];
if (jj+1 < M) c2 += a * B[k][jj+1];
}
if (jj < M) C[i][jj ] = c1;
if (jj+1 < M) C[i][jj+1] = c2;
} }
\end{lstlisting}
In the resulted code, the accesses to the elements of \texttt{A} have
been halved. We can similarly apply unroll and jam for the loop
of index \texttt{i} with a tile size equal to \texttt{3}. This will cut down
the accesses to \texttt{B} by a factor of $3$. The resulted code
is shown in \cref{fig-unroll-jam}.
\begin{figure}
\begin{lstlisting}[mathescape=true]
for(ii=0; ii<N; ii+=3) {
for(jj=0; jj<M; jj+=2) {
float c11, c12, c21, c22, c31, c32;
if (ii < N && jj < M) c11 = 0.0;
if (ii+1 < N && jj < M) c21 = 0.0;
if (ii+2 < N && jj < M) c31 = 0.0;
if (ii < N && jj+1 < M) c12 = 0.0;
if (ii+1 < N && jj+1 < M) c22 = 0.0;
if (ii+2 < N && jj+1 < M) c32 = 0.0;
for(k=0; k<N; k++) {
float a1, a2, a3, b1, b2;
if (ii < N) a1 = A[ii ][k];
if (ii+1 < N) a2 = A[ii+1][k];
if (ii+2 < N) a3 = A[ii+2][k];
if (jj < M) b1 = B[k ][jj];
if (jj+1 < M) b2 = B[k ][jj+1];
if (ii < N && jj < M) c11 += a1 * b1;
if (ii+1 < N && jj < M) c21 += a2 * b1;
if (ii+2 < N && jj < M) c31 += a3 * b1;
if (ii < N && jj+1 < M) c12 += a1 * b2;
if (ii+1 < N && jj+1 < M) c22 += a2 * b2;
if (ii+2 < N && jj+1 < M) c32 += a3 * b2;
}
if (ii < N && jj < M) C[ii ][jj ] = c11;
if (ii+1 < N && jj < M) C[ii+1][jj ] = c21;
if (ii+2 < N && jj < M) C[ii+2][jj ] = c31;
if (ii < N && jj+1 < M) C[ii ][jj+1] = c11;
if (ii+1 < N && jj+1 < M) C[ii+1][jj+1] = c21;
if (ii+2 < N && jj+1 < M) C[ii+2][jj+1] = c31;
}
}
\end{lstlisting}
\caption{Result of unroll-and-jam applied to matrix-matrix
multiplication, where the first and second outer loops were tiled
with sizes $3$ and $2$, respectively. The number of accesses to
\texttt{A} and \texttt{B} has been reduced by a factor of
$2\times$ and $3\times$, respectively, at the expense of
introducing some conditional statements.}
\label{fig-unroll-jam}
\end{figure}
%%% Local Variables:
%%% mode: latex
%%% TeX-master: "hpps-notes"
%%% End: