-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathsig-alternate-sample_MD.tex
495 lines (410 loc) · 37.7 KB
/
sig-alternate-sample_MD.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
% This is "sig-alternate.tex" V2.1 April 2013
% This file should be compiled with V2.8 of "sig-alternate.cls" May 2012
%
% This example file demonstrates the use of the 'sig-alternate.cls'
% V2.8 LaTeX2e document class file. It is for those submitting
% articles to ACM Conference Proceedings WHO DO NOT WISH TO
% STRICTLY ADHERE TO THE SIGS (PUBS-BOARD-ENDORSED) STYLE.
% The 'sig-alternate.cls' file will produce a similar-looking,
% albeit, 'tighter' paper resulting in, invariably, fewer pages.
%
% ----------------------------------------------------------------------------------------------------------------
% This .tex file (and associated .cls V2.8) produces:
% 1) The Permission Statement
% 2) The Conference (location) Info information
% 3) The Copyright Line with ACM data
% 4) NO page numbers
%
% as against the acm_proc_article-sp.cls file which
% DOES NOT produce 1) thru' 3) above.
%
% Using 'sig-alternate.cls' you have control, however, from within
% the source .tex file, over both the CopyrightYear
% (defaulted to 200X) and the ACM Copyright Data
% (defaulted to X-XXXXX-XX-X/XX/XX).
% e.g.
% \CopyrightYear{2007} will cause 2007 to appear in the copyright line.
% \crdata{0-12345-67-8/90/12} will cause 0-12345-67-8/90/12 to appear in the copyright line.
%
% ---------------------------------------------------------------------------------------------------------------
% This .tex source is an example which *does* use
% the .bib file (from which the .bbl file % is produced).
% REMEMBER HOWEVER: After having produced the .bbl file,
% and prior to final submission, you *NEED* to 'insert'
% your .bbl file into your source .tex file so as to provide
% ONE 'self-contained' source file.
%
% ================= IF YOU HAVE QUESTIONS =======================
% Questions regarding the SIGS styles, SIGS policies and
% procedures, Conferences etc. should be sent to
% Adrienne Griscti ([email protected])
%
% Technical questions _only_ to
% Gerald Murray ([email protected])
% ===============================================================
%
% For tracking purposes - this is V2.0 - May 2012
\documentclass{sig-alternate}
\begin{document}
% Copyright
\setcopyright{waclicense}
%% DOI
%\doi{10.475/123_4}
%
%% ISBN
%\isbn{123-4567-24-567/08/06}
%
%%Conference
%\conferenceinfo{PLDI '13}{June 16--19, 2013, Seattle, WA, USA}
%
%\acmPrice{\$15.00}
%
% --- Author Metadata here ---
\conferenceinfo{Web Audio Conference WAC-2016,}{April 4--6, 2016, Atlanta, USA}
\CopyrightYear{2016} % Allows default copyright year (20XX) to be over-ridden - IF NEED BE.
%\crdata{0-12345-67-8/90/01} % Allows default copyright data (0-89791-88-6/97/05) to be over-ridden - IF NEED BE.
% --- End of Author Metadata ---
\title{Time Stretching \& Pitch Shifting with the Web Audio API: Where are we at?}
%\subtitle{[Extended Abstract]
%\titlenote{A full version of this paper is available as
%\textit{Author's Guide to Preparing ACM SIG Proceedings Using
%\LaTeX$2_\epsilon$\ and BibTeX} at
%\texttt{www.acm.org/eaddress.htm}}}
%
% You need the command \numberofauthors to handle the 'placement
% and alignment' of the authors beneath the title.
%
% For aesthetic reasons, we recommend 'three authors at a time'
% i.e. three 'name/affiliation blocks' be placed beneath the title.
%
% NOTE: You are NOT restricted in how many 'rows' of
% "name/affiliations" may appear. We just ask that you restrict
% the number of 'columns' to three.
%
% Because of the available 'opening page real-estate'
% we ask you to refrain from putting more than six authors
% (two rows with three columns) beneath the article title.
% More than six makes the first-page appear very cluttered indeed.
%
% Use the \alignauthor commands to handle the names
% and affiliations for an 'aesthetic maximum' of six authors.
% Add names, affiliations, addresses for
% the seventh etc. author(s) as the argument for the
% \additionalauthors command.
% These 'additional authors' will be output/set for you
% without further effort on your part as the last section in
% the body of your article BEFORE References or any Appendices.
\numberofauthors{3} % in this sample file, there are a *total*
% of EIGHT authors. SIX appear on the 'first-page' (for formatting
% reasons) and the remaining two appear in the \additionalauthors section.
%
%\author{
% You can go ahead and credit any number of authors here,
% e.g. one 'row of three' or two rows (consisting of one row of three
% and a second row of one, two or three).
%
% The command \alignauthor (no curly braces needed) should
% precede each author name, affiliation/snail-mail address and
% e-mail address. Additionally, tag each line of
% affiliation/address with \affaddr, and tag the
% e-mail address with \email.
%
% 1st. author
%\alignauthor Matthew E. P. Davies
% \affaddr{Sound and Music Computing Group, INESC TEC, Porto, Portugal}
% \email{[email protected]}
% 2nd. author
%\alignauthor David Martins Matos\\
% \affaddr{INESC-ID, IST - Universidade de Lisboa, Portugal}\\
% \email{[email protected] }
% 3rd. author
%\alignauthor Helena Sofia Pinto\\
% \affaddr{INESC-ID, IST - Universidade de Lisboa, Portugal}\\
% \email{[email protected] }
%\and
% 4th. author
%\alignauthor Bruno Dias\\
% \affaddr{INESC-ID, IST - Universidade de Lisboa, Portugal}\\
% \email{[email protected]}
%}
% There's nothing stopping you putting the seventh, eighth, etc.
% author on the opening page (as the 'third row') but we ask,
% for aesthetic reasons that you place these 'additional authors'
% in the \additional authors block, viz.
\date{1 September 2015}
% Just remember to make sure that the TOTAL number of authors
% is the number that will appear on the first page PLUS the
% number that will appear in the \additionalauthors section.
\maketitle
\begin{abstract}
\begin{sloppypar}
Audio time stretching and pitch shifting are operations that all major commercial and/or open source Digital Audio Workstations, DJ Mixing Software and Live Coding Suites offer. These operations allow users to change the duration of audio files while maintaining the pitch and vice-versa. Such operations allow, for example, a DJ to speed up, or slow down, a song that is being mixed into another song, in order to align the tempo, and beats, of both songs. Unfortunately, there are few (and experimental) client-side JavaScript implementations of these two operations.
\end{sloppypar}
In this paper, we review the current state of art for client-side implementations of time stretching and pitch shifting, their limitations, and describe two new implementations for two well-known algorithms: (1) Phase Vocoder with Identity Phase Lock and (2) a modified version of Overlap \& Add.
Additionally, we discuss some issues related to the Web Audio API (WAA) and frequency-based audio processing regarding latency and audio quality in pitch shifting and time stretching towards raising awareness about possible changes in the WAA. %% MD removed extra "time"
\end{abstract}
%
% The code below should be generated by the tool at
% http://dl.acm.org/ccs.cfm
% Please copy and paste the code instead of the example below.
%
%\begin{CCSXML}
%<ccs2012>
%<concept>
%<concept_id>10010520.10010553.10010562</concept_id>
%<concept_desc>Computer systems organization~Embedded systems</concept_desc>
%<concept_significance>500</concept_significance>
%</concept>
%<concept>
%<concept_id>10010520.10010575.10010755</concept_id>
%<concept_desc>Computer systems organization~Redundancy</concept_desc>
%<concept_significance>300</concept_significance>
%</concept>
%<concept>
%<concept_id>10010520.10010553.10010554</concept_id>
%<concept_desc>Computer systems organization~Robotics</concept_desc>
%<concept_significance>100</concept_significance>
%</concept>
%<concept>
%<concept_id>10003033.10003083.10003095</concept_id>
%<concept_desc>Networks~Network reliability</concept_desc>
%<concept_significance>100</concept_significance>
%</concept>
%</ccs2012>
%\end{CCSXML}
%
%\ccsdesc[500]{Computer systems organization~Embedded systems}
%\ccsdesc[300]{Computer systems organization~Redundancy}
%\ccsdesc{Computer systems organization~Robotics}
%\ccsdesc[100]{Networks~Network reliability}
%
%
%%
%% End generated code
%%
%
%%
%% Use this command to print the description
%%
%\printccsdesc
%
%% We no longer use \terms command
%%\terms{Theory}
%
%\keywords{ACM proceedings, \LaTeX, text tagging}
\section{Introduction}
\begin{sloppypar}
Time stretching and pitch shifting are two operations widely available in commercial and open source musical applications like Ableton Live,\footnote{https://www.ableton.com/} Traktor Pro\footnote{http://bit.ly/1izsxHy} and Ardour.\footnote{http://ardour.org/} Currently, creative frameworks implemented in JavaScript and the Web Audio API (WAA) such as Flocking \cite{flocking:icmc2014} can only change the duration of a signal through re-sampling of audio buffers, thus changing both the duration (i.e.: tempo) and pitch at the same time. In digital audio workstations (DAW) like EarSketch \cite{earsketch:wac2015}, time stretching and pitch shifting is performed server-side, with SoX,\footnote{http://sox.sourceforge.net/} making real-time interaction impossible.
%% MD added extra sentence for motivation to bridge to the next section
Our aim in this work is towards enabling real-time pitch shifting and time-stretching in browsers, which is currently an under-explored aspect of the WAA.
\end{sloppypar}
\begin{sloppypar}
This paper presents an overview of the current implementations for both algorithms using JavaScript in the browser environment and two new implementations. Additionally, we discuss the impact of certain decisions in the WAA design and philosophy regarding the absence of frequency-based operators like the Fast Fourier Transform (FFT), and their impact on performance.
\end{sloppypar}
\begin{sloppypar}
For both implementations, we had the following non-functional requirements:
\begin{itemize}
\item \textbf{Constant memory usage}, with fixed size circular arrays, in order to minimize the impact of the garbage collector, by maintaining a constant memory profile.
\item \textbf{Intuitive API and documentation}, stating the trade-offs between audio quality and performance.
\item \textbf{Minimize assumptions} regarding sample rate, number of channels, the type of audio content (percussive or harmonic) being processed, third party software components and execution environments. %% MD added content - type of "audio" could be wrong interpreted as file type.. .wav .flac .mp3.
\end{itemize}
\end{sloppypar}
The outline of the paper is the following. First, we review the basic theory for time domain and frequency domain time stretch and pitch shifting algorithms. Then, we describe the existing Javascript implementations of such algorithms. After that, we present our two new implementations for two well know algorithms: (1) Phase Vocoder with Identity Phase Locking \cite{laroche:phasinessbusiness} and (2) a modified Overlap \& Add (OLA) algorithm. Finally, we discuss the trade-offs between existing implementations and problems with the WAA regarding these two operations.
\section{Theoretical Background}
For both tasks, there are algorithms that work in time domain, like Overlap and Add (OLA), Waveform Similarity based OLA (WSOLA) \cite{verhelst:wsola93} and delay line modulation \cite{modulationline:pitchshift}, and others that work in the frequency domain, like the Phase Vocoder \cite{dolsontutorial:phasevocoder} and Spectral Modelling \cite{dafx:2nd}. In this section, we give an overview of three popular methods: OLA, WSOLA and the Phase Vocoder.
Algorithms in both time and frequency domains, for time stretching and pitch shifting, are more sophisticated versions of OLA. For that reason, we start with an overview of the basic OLA algorithm.
\subsection{OLA}
Let $x \in \mathbb{R}^M$ be the input signal of size $M$, $\alpha \in \mathbb{R}$ be the stretching factor and $y \in \mathbb{R}^L$ be the output signal of size $L=\alpha \cdot M$. There are three main steps for OLA:
\begin{enumerate}
\item Partition the input signal $x$ into a set of analysis frames of size N, each overlapping with the previous one in $H_a \in \mathbb{N}^+$ (i.e.: the analysis hop size) samples.
\begin{equation}
x_i(n) = x(n + i \cdot H_a), \,\, x_i \in \mathbb{R}^N
\end{equation}
\item Apply a window $w$ to each analysis frame.
\begin{equation}
x_{w_i}(n) = w(n) \cdot x_i(n), \,\, w \in \mathbb{R}^N
\end{equation}
\item Add the (synthesis) frame to the output, by overlapping it with $H_s \in \mathbb{N}^+$, the synthesis hop size, samples of the ``tail" of the output signal $y$
\begin{equation}
y(n+i \cdot H_s) = \left\{\begin{array}{lll}
y_i(n),\,\,\,\,\,\textrm{if}i=0 \\
y(n+i \cdot H_s) + \frac{y_i(n)}{w(n)},\,\textrm{if}\,\,\,i>0
\end{array}\right.
\end{equation}
\end{enumerate}
with $y_i = x_{w_i}$ for the basic OLA, $n \in \mathbb{I} \, \land 0 \leq n \leq N$ and $i \in \mathbb{N} \, \land \, i \cdot H_a \leq M \, \land \, i \cdot H_s \leq L$. The relation between both overlap/hop sizes and the the stretching factor is defined as:
\begin{equation}
\alpha = \frac{H_s}{H_a}. %% MD added full stop.
\end{equation}
%% MD added part about relevance for real-time processing.. and minor tweaks to text
Usually, one of the hop sizes is fixed while the other is a function of $\alpha$. We should note that $\alpha$ can change at each new frame, which is particularly relevant in for real-time audio processing. OLA does not preserve phase relations between consecutive frames and, as such, there are noticeable artefacts in the output signal: modulation of harmonic structures (e.g.: human voice) and reverberation. The temporal complexity of OLA is $O(N)$. In figure \ref{fig:OlaAndWsola}(a), we can see what happens to both $H_a$ and $H_s$ when changing $\alpha$.
%% MD - don't understand what you mean by "moments" -> change to frames ??
To perform pitch shifting with OLA, we could couple a re-sampler to the OLA time stretcher. In order to allow simultaneous time stretching and pitch shifting, let $t$ and $t+1$ be the current and next moments, $\beta_{t}$ be the new desired pitch and $R_{b}$ the sampling rate of the input signal. Then, the new stretching factor $\alpha_{t+1}$ and the new sample rate $R_{t+1}$ are defined as
\begin{equation}
R_{t+1} = R_{b} * \beta_{t+1}
\end{equation}
\begin{equation}
\alpha_{t+1} = \frac{\alpha_{t}}{R_{t}}. %% MD added full stop.
\end{equation}
\subsection{WSOLA}
First introduced in \cite{verhelst:wsola93}, WSOLA applies a delay $\delta \in [-\delta_{max}:\delta_{max}]$ to each analysis frame, such that the waveforms of two overlapping synthesis frames are as similar as possible in the overlapping regions.\footnote{WSOLA is equivalent to OLA when $\delta_{max}=0$.} The delay can be obtained by calculating the cross-correlation between the overlapping regions of each synthesis frame. Therefore, the analysis step for $x_i$ is redefined as %% MD sample -> frame
\begin{equation}
x_i(n) = x(n + H_{a_i})
\end{equation}
where $H_{a_i}$ is the analysis hop size for frame $i$ and is defined as
\begin{equation}
H_{a_i} = i \cdot H_a + \delta_i
\end{equation}
and
\begin{equation}
\delta_i = x_i \star x_{i-1}[\delta_{max}]
\end{equation}
where $\delta_{max} < H_a$. This algorithm removes the reverberation and modulation but, for transient rich sounds (percussion, guitar riffs), there might be some missing transients and stuttering (i.e.: repeated transients). Regarding the temporal complexity of WSOLA, if the cross-correlation is implemented with FFT Convolution \cite{dspguide}, we get $O(N log_2 N)$. Otherwise, if it is a ``naive" implementation of the convolution, we get $O(N^2)$.
To perform pitch shifting, we can use the same method described for OLA.
\begin{figure*}
\centering
%COLOCAR AS IMAGENS: (a) eps/constant-bpm, (b) eps/step-bpm, (c) eps/aprox-bpm, (d) eps/linear-bpm [width=1.4in]
\subfigure(a) \includegraphics[width=3.9in]{OLA}
\subfigure(b) \includegraphics[width=2.2in]{WSOLA}
\caption{Illustrations for (a) the basic OLA algorithm, with both analysis and synthesis steps and (b) the WSOLA analysis step. For time compression, we have $H_a < H_s$ and for time expansion, $H_a > H_s$. In WSOLA, the main difference is within the ``reading head" position. For each analysis frame \textit{i}, the analysis hop size $H_{a_i}$ is calculated according to the optimal displacement $\delta_i$, obtained through cross-correlation between analysis frame $i$ and $i-1$, for a maximum delay of $\delta_{max}$: $H_{a_i} = i \cdot H_a + \delta_i$}
\label{fig:OlaAndWsola}
\end{figure*}
\subsection{Phase Vocoder}
\begin{sloppypar}
The Phase Vocoder is a well documented \cite{phasevocoder:bernardini, dolsontutorial:phasevocoder, phavorit, phasevocoder:latency, laroche:phasinessbusiness, phasevocoder:pitchshift, dafx:2nd} algorithm used for time stretching \cite{dolsontutorial:phasevocoder,phavorit}, pitch shifting \cite{phasevocoder:pitchshift} and other audio effects like robotization \cite{dafx:2nd}. In general, it has a higher computational cost than time domain methods like WSOLA but offers higher quality audio, without missing transients. Each iteration of the Phase Vocoder has eight steps which occur in-between steps 2 and 3 of OLA:
\begin{enumerate}
\item Calculate the forward Fourier transform of $x_{w_i}$
\begin{equation}
X_{i}(n, \Omega_{k}) = \sum_{n=0}^{N} x_{w_i}(n) \cdot e^{-j\Omega_{k}n}, X_{i} \in \mathbb{C}^N
\end{equation}
where $\Omega_{k} = \frac{2\pi k}{N}$ is the frequency center for frequency bin $k$ and $e^{-j\Omega_{k}n}$ is a complex sinusoid of frequency $\Omega_k$.
\item Calculate the magnitude $|X_i| \in \mathbb{R}^N$ and phase $\angle X_i \in \mathbb{R}^N$ spectra for $X_i$ by converting $X_i$ to polar coordinates.
\item Calculate the difference between current and previous phase spectras and, then, the sample-wise difference with the frequency centres $\Omega_{k}$
\begin{equation}
\Delta_{\angle X_i} = \angle X_{i} - \angle X_{i-1} - H_a \cdot \Omega_k
\end{equation}
\item Due to the fact that the phase values are given in modulo $2\pi$ and, as such, phase `jumps' will occur, we need to unwrap the phase in order to obtain a continuous phase function
\begin{equation}
\Delta^{p}_{\angle X_i} = \Delta_{\angle X_i} - 2\pi \cdot \lfloor \frac{\Delta_{\angle X_i}}{2} \rceil
\end{equation}
\item Compute the instantaneous frequency $\omega$ for each freq. bin $k$
\begin{equation}
\omega_k = \Omega_k + \frac{\Delta^{p}_{\angle X_i}}{H_a}
\end{equation}
\item Now, we can use $\omega_k$ to compute the output phase spectra $\angle Y_i$ by advancing the previous output $\angle Y_{i-1}$ according to the synthesis hop size $H_s$
\begin{equation}
\angle Y_i = \angle Y_{i-1} + H_s \cdot \omega_k
\end{equation}
\item Compute the synthesis frame $Y_i \in \mathbb{C}^N$ by reusing the input magnitude spectra and the new phase spectra
\begin{equation}
Y_i = |X_i| \cdot e^{j \cdot \angle Y_i}
\end{equation}
\item Finally, we calculate the inverse Fourier transform $y_i$ of the frequency $Y_i$
\begin{equation}
y_{i}(n) = w(n) \cdot \sum_{k=0}^{N} Y_i(n,\Omega_k) \cdot e^{j\Omega_{k}n}
\end{equation}
\end{enumerate}
\end{sloppypar}
This algorithm ensures \textit{horizontal phase coherence} (i.e.: phase continuity for the same frequency bin, in consecutive frequency frames) is guaranteed. However, \textit{vertical phase coherence} (i.e.: phase relations between different frequency bins, in the same frame) is usually destroyed in the phase correction process. The loss of \textit{vertical phase coherence} results in a distinct artefact: \textit{phasiness} (i.e.: a metallic tunnel sound). The temporal complexity of the Phase Vocoder is $O(N\,log_{2}N)$.
To maintain both vertical and horizontal coherence, we can apply a technique known as \textit{Identity Phase-Locking} \cite{phasevocoder:identityphaselock}. The main idea is that frequency bins that are not spectral peaks contribute to the partials of the nearest spectral peak. A spectral peak is a local maximum in the magnitude spectra. In order to find the spectral peaks, we can use a simple heuristic: if a frequency bin has the maximum magnitude when compared with four neighbour bins, then it is a spectral peak. After identifying all spectral peaks, we need to infer the ``region of influence" of each peak (i.e.: for a given peak, which are the non-peak bins that contribute to the peak partial) (see section IIIC of \cite{phasevocoder:identityphaselock}). After identifying both the spectral peaks and the ``regions of influence", the usual phase correction method is applied to the peak bins. The phase of a non-peak bin will be equal to the phase of its corresponding peak bin.
To pitch shift with a Phase Vocoder, there are two methods available. The first one is re-sampling, in the same manner as OLA and WSOLA. The second one is adding an additional step to the phase correction, as described in \cite{phasevocoder:pitchshift}, which proceeds with the following steps. After identifying the spectral peaks, for a pitch shift factor $\beta$, each peak will be shifted to a new (angular) frequency $\beta \omega$, corresponding to a freq. shift $\Delta \omega = \omega(\beta-1)$. When $\Delta \omega$ is an integer, we just need to copy the Fourier transform values from the original ``region of influence" to the new one (around the new peak bin). If $\Delta \omega$ is a fractional number, a naive solution is to round $\Delta \omega$ to the nearest integer. This solution presents acceptable results for low sample rates and large FFT frame sizes.
\section{Existing Implementations}
\begin{sloppypar}
In this section we give an overview regarding time stretching and pitch shifting implementations with web technologies like JavaScript and the WAA, as well as native web browser implementations, available through the Audio Element \cite{audiotag}, via its \textit{playbackRate} attribute \cite{audiotag:playbackRate}.
For pitch shift only implementations in JavaScript, there is pitchshift.js \cite{kievII} and jungle.js.\footnote{https://github.com/cwilso/Audio-Input-Effects} The first is a port of a C++ implementation\footnote{http://downloads.dspdimension.com/smbPitchShift.cpp} of the Phase Vocoder for pitch shifting \cite{phasevocoder:pitchshift} while the second is an implementation of the algorithm presented in \cite{modulationline:pitchshift}.
In Vexwarp,\footnote{http://www.vexflow.com/vexwarp/} time stretching is performed with the basic phase vocoder algorithm \cite{dolsontutorial:phasevocoder,phavorit}. With this application, it is not possible to perform real-time processing of the input signal.
Soundtouch.js\footnote{https://github.com/also/soundtouch-js} is a port of the C++ library SoundTouch,\footnote{http://www.surina.net/soundtouch/} a WSOLA implementation. This library performs both time stretching and pitch shifting (through re-sampling) with some additional features:
\begin{itemize}
\item Cross-correlation is computed with an interleaved array with all audio channels.
\item Instead of implementing the ``naive" approach to cross-correlation, the developers used a hierarchical algorithm.
\end{itemize}
This port is tightly coupled regarding buffers, buffer management, stretcher, re-sampler and parameter adjustments, as well as some hard coded parameters, making integration into new applications a difficult task.
There is another WSOLA implementation: tempo.js \cite{kievII}, the result of the compilation of a port of the SoX tempo effect, using Emscripten \cite{emscripten}. Currently, it has no documentation and the only demo available uses deprecated APIs that are no longer available.
In the WAVES project,\footnote{http://wavesjs.github.io/} the Audio library \cite{wavesaudio:wac2015} supports time stretching and pitch shifting through granular synthesis and re-sampling, offering two classes to perform both tasks: GranularEngine and SegmentEngine. The first one causes significant transient smearing. The second class requires the developer to pass a JSON object detailing the segmentation of the input audio buffer.
Regarding the native implementations in web browsers, all major browsers, like Opera, Safari, Chrome and Firefox, implement time stretching to be used with the Audio tag, controlling the stretching factor with the \textit{playbackRate} attribute of the Audio tag/object. The stretching factor in the current implementations seems to be limited to the range $\alpha \in [0.5:4]$, where 0.5 is the slowest speed and 4 is the fastest one, meaning that web browsers perform an additional step in order to make the provided $\alpha$ comply with equation (4). Currently, we can only detail the implementation of Firefox and Chromium due to the closed-source nature of Opera and Safari. In Firefox, time stretching is performed by the SoundTouch library. Chromium uses a custom WSOLA implementation. For both browsers, when slowing down, there is some stuttering (more noticeable in Firefox). When speeding up, namely for \textit{playbackRate} values greater than 1.2, there are some missing transients for percussive sounds.
\end{sloppypar}
\section{New Implementations}
\begin{sloppypar}
In this section, we detail our two implementations for time stretching, OLA-TS.js (modified OLA) and PhaseVocoder.js (Phase Vocoder with Identity Phase-Locking), as well as some helper classes we created to ease the integration of the time stretchers.
Both implementations operate on a single audio channel and can be included in a \textit{ScriptProcessor} or \textit{AudioWorker} for real-time interaction/processing, or they can be used in batch processsing, in a similar way to Vexwarp, in order to integrate in frameworks and applications like Flocking and EarSketch. They do not include pitch shifting capabilities but that be easily changed by coupling a re-sampler to the helper classes. In order to maintain a static memory footprint, we used an existing circular buffer implementation, CBuffer.\footnote{https://github.com/trevnorris/cbuffer} Even though our time stretchers are implementations of (totally) different algorithms, there is a common API to both stretchers:
\begin{itemize}
\item \textit{process(Array inputFrame, CBuffer outputFrame)}: given a (mono) frame, performs a time stretching iteration and pushes $H_s$ samples in the output CBuffer.
\item \textit{get\_ha}: returns the current analysis hop size. This function calculates the increment to the ``read head" of the input signal, when playing an audio file.
\item \textit{get\_hs}: returns the current synthesis hop size. This function calculates the increment to the output signal position which an be used to guide the cursor in the UI of an audio player using OLA-TS.js or PhaseVocoder.js as time stretchers.
\item \textit{clear\_buffers}: clears all internal buffers, like the overlapping buffer. This can be useful for audio players that need to create a noticeable stop in the transition to the next file in a playlist, in order to avoid using the phase of the previous song to adjust the phase of the next song.
\item \textit{set\_alpha(Number newAlpha)}: given the new stretching factor, it computes the new values for $H_s$, $H_a$ (both integers) and invokes the function pointed by \textit{overlap\_fn}.
\item \textit{get\_alpha}: returns the last specified stretching factor.
\item \textit{overlap\_fn}: a public field pointing to a function that, given a stretching factor $\alpha$, will return a new overlapping factor.
\item \textit{get\_real\_alpha}: there are stretching factors that do not allow $H_s$ and $H_a$ to be integers and this might present a problem because the input signal ``read head" is an integer number ($H_a$ is used to increment the ``read head"). When the developer uses \textit{set\_alpha} to specify a new stretching factor, both $H_s$ and $H_a$ are rounded to the closest integer. As a result, there will be a difference between the specified $\alpha$ and the real $\alpha$. This difference can cause problems in use cases like a DJ application that automatically synchronizes two songs to a master tempo. If there is a divergence between the specified $\alpha$ and the real $\alpha$, after a certain amount of time, this divergence can cause the beats of both songs to drift out of sync. Therefore, we created this function in order to allow the developer to create adequate controllers to adjust the speed of the audio players to circumvent this issue.
\end{itemize}
\subsection{OLA-TS.js}
OLA-TS.js diverges from the basic OLA algorithm as follows: the window has an exponent that is a function of the stretching factor, $W'(n) = W(n)^{\beta(\alpha)}$. The overlapping factor is also a function of $\alpha$, $Ovl(\alpha)$. We include two default functions for both the overlapping factor and the exponent. The exponent function can be redefined by changing the public field \textit{beta\_fn}. The default functions for the exponent and overlapping factor were designed through experimentation. Both of them are a series of step functions design to minimize the modulation described in section 2.1.
Both OLA-TS.js and PhaseVocoder.js use the following formulas to define the analysis and synthesis hop sizes:
\begin{equation}
H_a = \frac{N}{Ovl(\alpha)}
\end{equation}
\begin{equation}
H_s = \alpha * H_a
\end{equation}
%
where $Ovl(\alpha)$ is the function defined in \textit{overlap\_fn}. In order to properly stretch a input signal, the developer should use a predetermined sequence of instructions. To make integration in other applications easier, we implemented helper classes to manage the buffering and the ``read heads" for the input buffers. Both implementations and their helper classes are public and open source, each one with it's own Git repository.\footnote{https://github.com/echo66/OLA-TS.js}\footnote{https://github.com/echo66/PhaseVocoderJS} Additionally, we implemented a small demo page where the user can drag \& drop several songs and play them simultaneously, controlling the volume and the stretching factor.
\subsection{PhaseVocoder.js}
PhaseVocoder.js uses, by default, fourier.js, an FFT library offering methods implemented in both asm.js\footnote{http://asmjs.org} and ``raw" JavaScript. In order to use a different FFT library, the developer can use the following methods: \textit{set\_stft\_fn(Function stftCallback)}, \textit{set\_istft\_fn(Function istftCallback)} and \textit{init\_fft\_fn(Function initCallback)}:
\begin{itemize}
\item \textit{stftCallback(Array inputFrame, Array windowFrame, Number wantedSize, Object out)}: \textit{inputFrame} is a sequence of samples; \textit{windowFrame} is the discretization of the window function; \textit{wantedSize} is the desired size for the real and imaginary arrays of the output\footnote{\textit{wantedSize} can not be bigger than \textit{windowFrame.length}.}; \textit{out} is a JSON object with four arrays: real, imaginary, magnitude and phase, all of them describing the result of the forward Short Time Fourier Transform (STFT). This function will be invoked for each time frame being processed by PhaseVocoder.js.
\item \textit{istftCallback(Array real, Array imag, Array windowFrame, Array timeFrame)}: \textit{real} and \textit{imag} are the real and imaginary arrays describing a frequency frame; \textit{timeFrame} is the result of the inverse STFT. This function will be invoked when synthesizing a frequency frame, after the phase adaptation.
\item \textit{initCallback(Number frameSize)}: this function is invoked after being added to a PhaseVocoder.js instance.
\end{itemize}
\subsection{Helper Classes}
For both implementations, we created a set of helpers with a common API: \textit{BufferedTS} and \textit{WAAPlayer}.
In BufferedTS, we manage the output buffering of two time stretchers (i.e.: 2 channels), as well as the the ``read head" position. For this class, we provide two (public) methods, (1) \textit{process(AudioBuffer oBuf)} which writes the next output frame into \textit{oBuf} and (2) \textit{set\_audio\_buffer(AudioBuffer iBuf)} which defines the input audio buffer to be processed with the time stretchers, and two (public) read/write fields, \textit{position} and \textit{alpha}, allowing access to both the ``read head" and the stretch factor $\alpha$.
To ease integration with the WAA, we implemented \textit{WAAPlayer}, which integrates a \textit{BufferedTS} instance into a \textit{ScriptProcessor}.
\section{Evaluation and Discussion}
\begin{table*}
\centering
\caption{Comparison of Time Stretching/Pitch Shift JavaScript and Native implementations}
\begin{tabular}{|c|c|c|c|c|} \hline
\textbf{Name} & \textbf{T. Stretch/P. Shift} &\textbf{Algorithm} & \textbf{Audio Artifacts}\\ \hline
SoundTouch.js & Both & WSOLA + Re-Samplig & Missing Transients \\ \hline
WAVES Audio library & Both & G. Synthesis + Re-Sampling & Smeared Transients \\ \hline
pitchshift.js & Pitch Shift & Phase Vocoder & None \\ \hline
jungle.js & Pitch Shift & Delay-Line Modulation & Reverberation \\ \hline
Vexwarp & Time Stretch & Phase Vocoder & Metal Tunnel \\ \hline
tempo-sox.js & Time Stretch & WSOLA & Unknown \\ \hline
PhaseVocoder.JS & Time Stretch & Phase Vocoder & None \\ \hline
OLA-TS.JS & Time Stretch & Modified OLA & Some modulation in harm. structures \\ \hline
Firefox Audio & Time Stretch & WSOLA & Missing Transients \\ \hline
Chromium Audio & Time Stretch & WSOLA & None \\
\hline\end{tabular}
\end{table*}
In this section, we compare our implementations with the existing ones, reviewed in Section 3.
\begin{sloppypar}
Unlike Vexwarp, PhaseVocoder.js allows real-time and block-wise processing of the input signal, with a minimized impact in transient smearing and no metallic sound, due to the inclusion of \textit{vertical phase coherence}. Unfortunately, due to the number of FFTs used (for $\alpha>1$, there are 4 forward and 4 inverse FFTs per output buffer), the computational costs are greater than soundtouch.js. But, in order to avoid missing transients in soundtouch.js, we must decrease the frame size and/or increase the cross-correlation range, which will increase the computational costs of soundtouch.js and, as a consequence, can make PhaseVocoder.js competitive.
\end{sloppypar}
OLA-TS.js allows $O(N)$ time stretching, unlike PhaseVocoder.js, $O(N\,log_{2} N)$, and soundtouch.js, $O(N^2)$, but can introduce noticeable modulation in harmonic structures like the voice of a human singer, for frame sizes smaller than 4096 samples, with a sample rate of 44100 Hz. Additionally, it may require some manual experimentation with the overlap and beta parameters to reduce the modulation.\footnote{We have two default functions to adjust both parameters.} Due to the recommended frame size (4096 samples), OLA-TS.js is adequate for applications where there are smooth adjustments in the stretching factor.
To make it simple for the reader to understand some of the main features and problems with each reviewed and/or discussed implementation, Table 1 includes, for each implementation, its algorithm, audio artefacts and the effect(s) implemented.
\end{sloppypar}
\section{Time stretching and pitch shifting issues with the WAA}
Currently, the WAA offers two classes to work in frequency domain: (1) \textit{AnalyserNode}, allowing the developer to obtain the magnitude spectra (but no phase spectra) of a time sequence, (2) \textit{PeriodicWave}, an interface to define a periodic waveform for an \textit{OscillatorNode} in order to perform synthesis using a similar method to the one described in \cite{moorer1976the}. Currently, there is no way to retrieve the full frequency description in a \textit{ScriptProcessor} or \textit{AudioWorker}. This situation requires developers to use JavaScript implements of the FFT in order to implement high quality time stretching with Phase Vocoders, instead of relying on native implementations exposed in a JavaScript API like the WAA. This leads to an increased overhead in computational costs because JavaScript is an interpreted language. This overhead can cause sudden audio dropouts when using algorithms like the Phase Vocoder, Spectral Modelling or Percussive-Harmonic Separation \cite{Tachibana:2014:HSS:2719949.2719982} within a \textit{ScriptProcessor}.\footnote{In Google Chrome/Chromium, if you play three or more songs, audio drop-outs occur. In Firefox, these drop-outs happen for six or more songs.} And, in a Phase Vocoder, the bulk of the computational cost is due to the FFT. This absence caused significant discussion within the WAA community.\footnote{https://github.com/WebAudio/web-audio-api/issues/468}
\section{Conclusions and Future Work}
In this paper, we presented the existing time stretching and pitch shifting implementations using web technologies, as well as two new implementations. Then, we compared the implementations regarding computational costs and the existence of audio artefacts. Additionally, we commented on the current state of the WAA specification regarding frequency operations and how that affects time stretching and pitch shifting.
Our next step regarding OLA-TS.js and PhaseVocoder.js will be to integrate the implementations in creative frameworks and applications like Flocking and EarSketch. Additionally, we plan on integrating a re-sampler in the helper classes in order to provide pitch shifting. For PhaseVocoder.js, we plan to include new audio effects like robotization and whiperization.
%ACKNOWLEDGMENTS are optional
%\section{Acknowledgments}
%
%This work was partly supported by national funds through FCT – Fundação para a Ciência e Tecnologia, under project EXCL/EEI-ESS/0257/2012.
%This work was supported by national funds through Fundação para a Ciência e a Tecnologia (FCT) with reference UID/CEC/50021/2013.
%
% The following two commands are all you need in the
% initial runs of your .tex file to
% produce the bibliography for the citations in your paper.
\bibliographystyle{abbrv}
\bibliography{sigproc} % sigproc.bib is the name of the Bibliography in this case
% You must have a proper ".bib" file
% and remember to run:
% latex bibtex latex latex
% to resolve all references
%
% ACM needs 'a single self-contained file'!
%
\end{document}