You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The GloVe/CoVe representation of words in passage, words in question, and words in answer (Couple of examples).
Word Split Based EDA (Stats)
Bert Tokenizer Based EDA (Stats)
WordClouds
Build Basic Q/A system
GA Reader
Literature Reviews
Data
Statistics
Word/Char Python Split-Based
Task_1_train
Task_1_dev
Task_2_train
Task_2_dev
Samples
3227
837
3318
851
Duplicated Examples
0
0
0
0
Duplicated Articles
57
4
55
2
Mean Char Count (Articles)
1548.47
1582.79
2448.93
2501.41
Std Char Count (Articles)
905.301
992.333
1788.42
1809.55
Max Char Count (Articles)
10177
9563
10370
9628
Min Char Count (Articles)
162
255
169
247
Mean Word Count (Articles)
262.44
268.608
417.727
427.048
Std Word Count (Articles)
154.534
171.253
307.248
310.994
Max Word Count (Articles)
1754
1641
1700
1651
Min Word Count (Articles)
31
41
31
45
Duplicated Questions
0
0
1
0
Mean Char Count (Questions)
137.826
137.455
148.234
149.385
Std Char Count (Questions)
31.1171
30.6809
47.7494
46.9035
Max Char Count (Questions)
389
387
443
413
Min Char Count (Questions)
32
48
31
37
Mean Word Count (Questions)
24.6771
24.4695
26.8852
27.2127
Std Word Count (Questions)
6.18453
6.04612
9.1822
9.16159
Max Word Count (Questions)
73
73
83
83
Min Word Count (Questions)
6
9
7
6
BertTokenizer Based
Task_1_train
Task_1_dev
Task_2_train
Task_2_dev
Mean Token Count (Articles)
332.603
341.452
528.58
539.303
Std Token Count (Articles)
193.729
217.673
387.756
396.286
Max Token Count (Articles)
2206
2043
2188
2272
Min Token Count (Articles)
38
52
46
52
Mean Token Count (Questions)
28.6319
28.3441
30.9629
31.1657
Std Token Count (Questions)
6.83765
6.75112
10.0225
9.98314
Max Token Count (Questions)
84
84
91
89
Min Token Count (Questions)
9
11
9
8
Note :
Word is based on .split() on Python strings, including all the words (stopwords).
We can also use torch/nltk tokenizers to get tokens and check their lengths.
The spaces are also included in character counting.
No preprocessing was done for either of the statistics.
For questions, we do not remove @placeholder.
Most Frequent Words
Note :
Tokens were made using NLTK Tokenizer, and only alphanumeric characters, after removing stop words were used.
Articles
Task_1_train
Task_1_dev
Task_2_train
Task_2_dev
Unique Token Count
41087
20447
53285
25779
Total Count
482713
128258
775342
203877
Task_1_train
Task_1_dev
Task_2_train
Task_2_dev
Word
Count
Frequency
Word
Count
Frequency
Word
Count
Frequency
Word
Count
Frequency
0
said
7978
0.0165274
said
1940
0.0151258
said
9594
0.0123739
said
2524
0.01238
1
mr
2465
0.00510655
mr
701
0.00546555
mr
3549
0.00457733
mr
993
0.00487058
2
would
2194
0.00454514
would
593
0.00462349
would
3539
0.00456444
would
954
0.00467929
3
also
2019
0.00418261
also
506
0.00394517
people
3335
0.00430133
people
872
0.00427709
4
people
1897
0.00392987
one
486
0.00378924
one
3182
0.004104
one
814
0.0039926
5
one
1705
0.00353212
people
481
0.00375025
also
2784
0.00359067
also
675
0.00331082
6
last
1461
0.00302664
first
395
0.00307973
says
2201
0.00283875
two
562
0.00275656
7
two
1349
0.00279462
last
391
0.00304854
two
2112
0.00272396
says
547
0.00268299
8
first
1344
0.00278426
new
372
0.0029004
years
2084
0.00268785
new
545
0.00267318
9
new
1341
0.00277805
told
349
0.00272108
new
2081
0.00268398
time
532
0.00260942
10
year
1262
0.00261439
two
346
0.00269769
time
2064
0.00266205
us
531
0.00260451
11
years
1248
0.00258539
us
342
0.0026665
could
2002
0.00258209
could
517
0.00253584
12
could
1222
0.00253152
years
340
0.00265091
last
1957
0.00252405
years
516
0.00253094
13
told
1203
0.00249216
year
329
0.00256514
first
1944
0.00250728
first
509
0.0024966
14
us
1187
0.00245902
time
307
0.00239361
us
1820
0.00234735
told
493
0.00241812
15
time
1159
0.00240101
could
291
0.00226886
year
1768
0.00228028
year
462
0.00226607
16
government
1036
0.0021462
bbc
256
0.00199598
like
1727
0.0022274
last
455
0.00223174
17
made
940
0.00194733
added
246
0.00191801
told
1659
0.0021397
bbc
418
0.00205026
18
bbc
894
0.00185203
made
236
0.00184004
police
1504
0.00193979
like
408
0.00200121
19
police
892
0.00184789
get
227
0.00176987
get
1476
0.00190368
government
404
0.00198159
Questions
Task_1_train
Task_1_dev
Task_2_train
Task_2_dev
Unique Token Count
10032
4602
11098
4966
Total Count
43683
11289
47297
12255
Task_1_train
Task_1_dev
Task_2_train
Task_2_dev
Word
Count
Frequency
Word
Count
Frequency
Word
Count
Frequency
Word
Count
Frequency
0
placeholder
3227
0.0738731
placeholder
837
0.074143
placeholder
3318
0.0701524
placeholder
851
0.069441
1
said
268
0.00613511
new
69
0.00611214
new
229
0.00484174
new
71
0.00579355
2
year
230
0.00526521
says
62
0.00549207
year
218
0.00460917
two
60
0.00489596
3
new
222
0.00508207
year
56
0.00496058
two
209
0.00441888
says
57
0.00465116
4
says
197
0.00450976
us
48
0.00425193
one
201
0.00424974
year
54
0.00440636
5
two
172
0.00393746
said
46
0.00407476
said
200
0.0042286
said
54
0.00440636
6
first
159
0.00363986
first
43
0.00380902
man
191
0.00403831
us
49
0.00399837
7
world
158
0.00361697
two
43
0.00380902
police
185
0.00391145
people
48
0.00391677
8
one
148
0.00338805
world
38
0.00336611
people
177
0.00374231
one
47
0.00383517
9
man
143
0.00327358
uk
35
0.00310036
first
174
0.00367888
man
46
0.00375357
10
people
141
0.0032278
one
34
0.00301178
world
168
0.00355202
first
44
0.00359037
11
us
129
0.00295309
man
32
0.00283462
years
154
0.00325602
years
37
0.00301918
12
police
127
0.00290731
people
32
0.00283462
us
147
0.00310802
police
36
0.00293758
13
uk
127
0.00290731
police
31
0.00274604
uk
133
0.00281202
uk
32
0.00261118
14
wales
118
0.00270128
former
31
0.00274604
says
130
0.00274859
world
31
0.00252958
15
years
115
0.0026326
england
31
0.00274604
three
113
0.00238916
could
31
0.00252958
16
former
112
0.00256393
wales
29
0.00256887
england
111
0.00234687
city
28
0.00228478
17
government
110
0.00251814
league
29
0.00256887
old
105
0.00222001
found
27
0.00220318
18
could
109
0.00249525
years
28
0.00248029
could
102
0.00215658
may
27
0.00220318
19
city
106
0.00242657
three
25
0.00221455
last
100
0.0021143
bbc
27
0.00220318
Note:
BERT Tokenizer based splits
Process - Take passages, convert to lowercase, remove all non-alphanumeric characters, tokenize, then remove stop words.