-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathindex.html
487 lines (430 loc) · 17.2 KB
/
index.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
<!doctype html>
<html lang="en">
<head>
<!-- Required meta tags -->
<meta charset="utf-8">
<meta name="viewport" content="width=device-width, initial-scale=1, shrink-to-fit=no">
<!-- Bootstrap CSS -->
<link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.3.1/css/bootstrap.min.css" integrity="sha384-ggOyR0iXCbMQv3Xipma34MD+dH/1fQ784/j6cY/iJTQUOhcWr7x9JvoRxT2MZw1T" crossorigin="anonymous">
<title>An Automated Method to Correct Artifacts in Neural Text-to-Speech Models</title>
</head>
<body>
<div class="container">
<h1>An Automated Method to Correct Artifacts in Neural Text-to-speech Models</h1>
</br>
</br>
<h2>Abstract</h2>
<p>
Recent advancements in deep learning-based speech synthesis models have yielded substantial progress in generating natural speech. While these high-performing speech models find applications in various domains, a need remains to enhance mis-synthesized speech. Previous speech correction methodologies suffer from inefficiencies due to the need for manual error specification, model retraining, or additional data. This paper presents a novel approach for detecting and correcting errors within the model, obviating the need for additional resources or model retraining. Specifically, we propose a method for automatically identifying abnormal encoder vectors by scrutinizing the inherent limitations of neural network encoders responsible for contextualizing input sentences. Additionally, we introduce a correction algorithm designed to enhance speech artifacts by eliminating the incorrect relationships among phonemes that make abnormal encoder context vectors. Objective evaluation metrics, namely attention alignment error and Fréchet Wav2Vec Distance, along with subjective evaluation using the Comparative Mean Opinion Score, demonstrate significant enhancements in the corrected speech. These findings demonstrate the need for technologies that can autonomously identify and correct flaws in speech synthesis models.
</p>
</br>
<h2>Artificial speech correction with the proposed method</h2>
<p>
The script is provided with four types of sample audio: synthesized speech, local reference (Ours), global reference (Truncation Trick), and random reference . We conducted experiments on three types of test data: LJSpeech, low PMI LibriSpeech, and high PMI LibriSpeech. We marked abnormal part of sentences as red highlights.
</p>
</br>
<h2>LJSpeech</h2>
</br>
<h5>
"LJ019-0368":"The <span style="color:red">latter</span> too was to be laid before the House of Commons.",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/LJ019-0368_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ019-0368_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ019-0368_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ019-0368_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"LJ050-0235":"It has also used other Federal law enforcement agents during Presidential visits to cities in which such agents are <span style="color:red">stationed</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/LJ050-0235_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ050-0235_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ050-0235_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ050-0235_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"LJ050-0084":"or, <span style="color:red">quote</span>, other high government officials in the nature of a complaint coupled with an expressed or implied determination to use a means",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/LJ050-0084_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ050-0084_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ050-0084_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ050-0084_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"LJ048-0053":"It is the conclusion of the Commission that, even in the absence of Secret Service <span style="color:red">criteria</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/LJ048-0053_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ048-0053_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ048-0053_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ048-0053_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"LJ043-0107":"Upon moving to New Orleans on April 24, <span style="color:red">1963,</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/LJ043-0107_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ043-0107_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ043-0107_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/LJ043-0107_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
</br>
</br>
<h2>Low PMI LibriSpeech</h2>
</br>
<h5>
"4122-157669-0057":"<span style="color:red">CHURL</span> UPON THY EYES I THROW ALL THE POWER THAT THIS CHARM DOTH OWE WHEN THOU WAKEST LET LOVE FORBID SLEEP HIS SEAT ON THY EYELID",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/4122-157669-0057_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4122-157669-0057_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4122-157669-0057_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4122-157669-0057_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"4179-25937-0046":"I AM NOT SO <span style="color:red">SURE</span> ABOUT SCHWARTZ I SAID THOUGHTFULLY",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/4179-25937-0046_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4179-25937-0046_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4179-25937-0046_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4179-25937-0046_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"8272-279789-0044":"SNOWDROP SHALL DIE SHE CRIED IF IT COSTS MY <span style="color:red">OWN</span> LIFE",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/8272-279789-0044_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/8272-279789-0044_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/8272-279789-0044_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/8272-279789-0044_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"4042-12369-0039":"SOFT GOAT CHEESE TOME DE SAVOIE FRANCE SOFT PASTE GOAT OR COW OTHERS IN THE SAME CATEGORY <span style="color:red">ARE</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/4042-12369-0039_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4042-12369-0039_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4042-12369-0039_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4042-12369-0039_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"5983-39668-0033":"YES COLBERT LITTLE COLBERT MAZARIN'S FACTOTUM THE SAME <span style="color:red">WELL</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/5983-39668-0033_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/5983-39668-0033_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/5983-39668-0033_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/5983-39668-0033_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
</br>
</br>
<h2>High PMI LibriSpeech</h2>
</br>
<h5>
"1968-145732-0016":"AND <span style="color:red">CARRIED</span> HIM HOME TO HIS HOUSE AND WAS EXCEEDINGLY KIND TO HIM HE GAVE HIM TO HIS WIFE",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/1968-145732-0016_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/1968-145732-0016_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/1968-145732-0016_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/1968-145732-0016_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"4546-16781-0038":"REGARD FOR A PERSON IS THE MENTAL VIEW OR FEELING THAT SPRINGS FROM A SENSE OF HIS <span style="color:red">VALUE</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/4546-16781-0038_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4546-16781-0038_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4546-16781-0038_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/4546-16781-0038_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"844-133697-0063":"SHE WAS <span style="color:red">REASSURED</span> QUICKLY ENOUGH BY HER SENSE OF HIS GREAT GOOD MANNERS",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/844-133697-0063_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/844-133697-0063_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/844-133697-0063_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/844-133697-0063_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"5183-29124-0006":"AS HE WAS POPULARLY CALLED FOR HE HAD BEEN A CLERGYMAN IN <span style="color:red">HIS DAY</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/5183-29124-0006_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/5183-29124-0006_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/5183-29124-0006_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/5183-29124-0006_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
<h5>
"7307-91998-0017":"HAD HE COME BECAUSE HE HAD HEARD OF THE BETROTHALS HE ADMITTED THAT IT WAS <span style="color:red">SO</span>",
</h5>
<table>
<thead>
<tr>
<th></th>
<th></th>
<th></th>
</tr>
</thead>
<tbody>
<tr>
<td>Synthesized Speech</td>
<td>Local reference (Ours)</td>
<td>Global reference (Truncation Trick)</td>
<td>Random reference </td>
</tr>
<tr>
<td><audio controls="" ><source src="wavs/7307-91998-0017_ori_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/7307-91998-0017_proposed_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/7307-91998-0017_mean_3_0_0.wav"/></audio></td>
<td><audio controls="" ><source src="wavs/7307-91998-0017_rn_3_0_0.wav"/></audio></td>
</tr>
</tbody>
</table>
</br>
</br>
</div>
</body>
</html>