-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathbasic_file_stats.Rmd
638 lines (443 loc) · 29.4 KB
/
basic_file_stats.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
---
title: "Clone detection stuff"
output:
html_document:
df_print: kable
toc: yes
toc_depth: 5
toc_float:
collapsed: no
html_notebook: default
pdf_document: default
word_document: default
---
# Setup
- 2-3 weeks to capture
-
- stats in terms of characters (in files)
-
- story, # of clones not enough
- samples files no clones, etc. etc.
+ quality analysis
- different clone rates for different projects
- clustering high cloned projects
+ more data about project
## AWS Instance
To connect to the AWS instance, you need to have the identity file (`amkey1.pem`) installed. You can then connect to the instance using the following command:
```
ssh -i ./ssh/amkey1.pem.txt [email protected]
```
The following command copies remote files on aws to local directory:
```
scp -i amkey1.pem [email protected]:/path_to_source_file .
```
Our dataset lives in `mnt/data1/data70k` and `mnt/data2/data70k2`. The tokenizer and sourcererCC are already installed in `/mnt/data2/peta`, however, it would be better for you to start from scratch, getting both the tokenizer and the sourcerer from github:
```
git clone https://github.com/reactorlabs/js-tokenizer.git
git clone https://github.com/Mondego/SourcererCC.git
```
## Setting up MySQL database
### MySQL on AWS
> Jakub, can you provided details on how is SQL installed on AWS?
### Setting up MySQL locally
Download MySQL or MariaDB using the default Linux (I can't speak for mac) packages. It will prompt you for a root password - and you can happily use that one for the script below if you ignore security - I do:-)
After it is installed, check that it has decent memory and that it can be populated from local files without restrictions, e.g. open file `/etc/mysql/mysql.conf.d/mysqld.conf` and append:
```
secure_file_priv=""
innodb_buffer_pool_size=10G # the more the better
innodb_log_file_size=512M
query_cache_size=0
```
> I have copied those from the internet and claim *no* real understanding of what they do - all I know the first one allows data loading, the second one sets memory.
Once you have updated the config file, run:
```
sudo service mysql restart
```
# Obtaining the data
## Tokenization
At this point the tokenizer is not entirely user friendly. There are few important things that you should setup before running it and you have to change these in the source code. Open `main.cpp` file and change number of threads for different worker types. So far the best setup I found is having 1 writer, and `N` mergers, crawlers and tokenizers, where `N` is the number of cores on the machine. Also make sure to check the values of worker queues otherwise you may run out of memory.
> See details of how tokenier works in one of the appendices.
To build the tokenizr, go to its folder, and execute the following:
```
mkdir build
cd build
cmake ..
make
```
Once the tokenizer has been built, you can ran it on your data. To tokenize, the first argument must be `-t`, followed by a directory to which all outputs will be dumped and then at least one, but possibly more root directories which will be scanned for git projects. So, if we use the AWS as an example and we want our output to be in `/mnt/data2/tokenizer_output`, run the following:
```
js-tokenizer -t /mnt/data2/tokenizer_output /mnt/data1/data70k /mnt/data2/data70k2
```
> Note that this may take hours, the tokenizer will report each second about its progress> When it is done, the final report might look like this: (this is actually output for the `processed_061116` dataset, see the appendix for more information about what the output actually does)
```
Crawler active 0 queue 0 done 336170 errors 0
Tokenizer active 0 queue 0 done 72898 errors 24
Merger active 0 queue 0 done 18940165 errors 0
Writer active 0 queue 0 done 18940371 errors 0
Files 18940371 tokenizer18940371 merger18940371 writer
Bytes 366805.24 tokenizer366805.24 merger366805.24 writer [MB]
Throughput 12.07 tokenizer 12.07 merger 12.07 writer [MB/s]
Unique tokens 38431688
Empty files 71819 (0.379185%)
Detected clones 17398449 (91.8591%)
```
If you want to copy this to your local folder, make sure to compress the whole output directory, such as:
```
tar -zcvf tokenizer_output.tar.gz /mnt/data2/tokenizer_output
```
## Validating the tokenizer results
> TODO
## Clone detection
Inside sourcererCC's directory go to `clone-detector` folder. Make sure the directories `input/dataset` and `input/query` exist there. Copy the `files_tokens/files-tokens-0.txt` file from the tokenizer's output to `input/dataset` and rename it to `blocks.file`:
```
cd clone-detector
mkdir input/dataset
mkdir input/query
cp /mnt/data2/tokenizer_output/files_tokens/files-tokens-0.txt input/dataset/blocks.file
```
Then we must decide on sharding. Sourcerer provides a python script to help with that, to use it, run the following from the `clone-detector` directory:
```
python ../scripts-data-analysis/pre-CC/step4/find-sharding-intervals.py input/dataset/blocks.file
```
This will find the best intervals and give you a setting for the `SHARD_MAX_NUM_TOKENS` in `sourcerer-cc.properties` file in the `clone-detector` directory. Update it, and also make sure to change the `MAX_TOKENS` variable in the same file.
> TODO I am not convinced the shard calculator works (they seem to confuse unique & total tokens, etc). Also I would rather have it produce specific number of intervals. We can easily recreate it in R.
Now we are ready to run the clone detector. The initialization and indexing do not benefit from parallelization, so execute the following:
```
./execute.sh 1
./runnodes.sh init 1
./runnodes.sh index 1
./mergeindexes.sh
```
> If you get an error about Java versin in the second command, open `runnodes.sh` and add `-Dbuild.compiler=javac1.7` to the ant commandline (this happens on the AWS). If you get an error that Java was not found, updated the `JAVA_HOME` variable: `export JAVA_HOME="/opt/jdk"`
Once indexing is done, we can start clone search phase using as many nodes as recommended by the shard analysis above, so for 3 nodes, we must run:
```
./execute.sh 3
./runnodes.sh search 3
```
> TODO what to do next, fill in when I get here
# Running this notebook
This notebook and its scripts provide all functionality required to run the analysis of Javascript tokenization and clone detection. It assumes that you have already installed all the required software and executed the tokenizer and clone detector, as described in previous chapters.
When done, please read this section carefully as it guides you through the necessary setup you must perform in order to be able to run the code below. All of the configucation variables can either be specified here in the notebook and saved before the execution, or you may create local file named `local.r` and set the variables in it. At the end of the setup section, the script will automatically check for its existence and source it if found, overriding anything that is set in the notebook itself.
In order to run code in this notebook, we must load the required R scripts for database access and data manipulation.
```{r}
source("clone_detection.r")
```
## Configuration
We must then set up the database connection, please make sure that these values correspond to what you need:
```{r}
MYSQL_HOST = "localhost"
# this user must be powerful enough to create databases, or the database must already exist
MYSQL_USER = "sourcerer"
MYSQL_PASSWORD = "js"
```
For the data loading and preprocessing, you may specify the override flag. During this phase the script checks that the data in the database correspond to the data on the disk by comparing the timestamps. If this situation is detected and `OVERRIDE` is set to `F` an error is printed, but the data in the database will not be touched. If it is `T`, then no error is printed, but the data is reloaded. And finally, if the flag is set to `"force"`, the data will be reloaded regardless of the timestamp. Setting the flag here has global validity, but you can also pass it to the loading and preprocessing functions to do a per-function update.
We also set the `DT_LIMIT` which the script uses to limit the sizes of the data tables when presented
```{r}
OVERWRITE = T
DT_LIMIT = 20
```
Next we have to prepare the database and tables and load the data produced by the tokenizer and sourcerer. For this, we must specify the root folder where the tokenizer output files are stored, and where the sourcerer CC's output for this particular set is stored. We must also name the database which will be created for the data. By default, the name of the output dir is used, but you may change this.
> By default, we assume you have copied sourcerer CC's output into the sourcerer directory of the tokenizer output, but you may change this if you want.
```{r}
TOKENIZER_OUTPUT_DIR = "tokenizer/output/files"
SOURCERER_OUTPUT_DIR = paste(TOKENIZER_OUTPUT_DIR, "sourcerer", sep = "/")
MYSQL_DB_NAME = tail(strsplit(TOKENIZER_OUTPUT_DIR, "/")[[1]], n = 1)
```
Finally, let's check if we have local configuration and use that one instead of the one supplied with the notebook for easier management:
```{r}
if (file.exists("local.r")) {
println("Using local cofiguration instead:")
source("local.r")
println(" MYSQL_HOST: ", MYSQL_HOST)
println(" MYSQL_DB_NAME: ", MYSQL_DB_NAME)
println(" MYSQL_USER: ", MYSQL_USER)
#println(" MYSQL_PASSWORD: ", MYSQL_PASSWORD) #-- we do not want the password to be in the document
println(" TOKENIZER_OUTPUT_DIR: ", TOKENIZER_OUTPUT_DIR)
println(" SOURCERER_OUTPUT_DIR: ", SOURCERER_OUTPUT_DIR)
println(" OVERWRITE: ", OVERWRITE)
println(" DT_LIMIT: ", DT_LIMIT)
}
println("Using database ", MYSQL_DB_NAME)
```
### Running this notebook
There are two options how you can run this notebook. You can use *RStudio*, which is better for interactivity and playing with the data, or you can use R's command line tools to generate the HTML w/o the need of RStudio and gui. In both cases the renderer runs all R code in this document and interleaves its results with the text.
#### Running from RStudio
Download the latest RStudio as specified in [R notebooks](http://rmarkdown.rstudio.com/r_notebooks.html) in RStudio, so make sure to follow their advice. Once installed, you may open the `Rmd` file and then either hit `CTRL-ALT-R` which makes RStudio rerun all the code segments in the IDE itself. Alternatively you may execute current code segment by pressing `CTRL+Enter`. If you want to produce the clean html document instead, click on the *Knit HTML* (`CTRL+SHIFT+K`) command.
#### Running from console
You can also create the HTML document directly from R, using the `rmarkdown::render` function. Before doing so, the *pandoc* must be [setup properly](https://github.com/rstudio/rmarkdown/blob/master/PANDOC.md).
> TODO I have not actually tested it as on my machine
## Preprocessing
Let's connect to the database server, create the database, load the data from tokenizer and sourcerer and preprocess it. These steps are only pefrormed if they are deemed necessary. The script uses timestamp data for the respective databases and reloads, or recreates the data only if necessary.
```{r}
# connect to the database and create the table if it does not exist
sql.connect(MYSQL_USER, MYSQL_PASSWORD, MYSQL_DB_NAME)
# load tokenizer data into the database
loadTokenizerData(TOKENIZER_OUTPUT_DIR)
# now load sourcererCC's data
#loadSourcererData(SOURCERER_OUTPUT_DIR)
# calculate non-empty files from the full statistics
calculateNonEmptyFiles()
# calculate clone groups as reported by the tokenizer
calculateTokenizerCloneGroups("tokenizer_clones", "tokenizer")
# calculate clone groups as reported by the sourcerer
#calculateCloneGroups("sourcerer_clones", "sourcerer")
# calculate project statistics
calculateProjectStats()
# calculate non-identical files from tokenizer's perspective
calculateOriginalFiles("tokenizer_originals", "files_full_stats", "tokenizer")
# calculate cummulative statistics for projects using only non-identical files (as tagged by tokenizer)
calculateProjectOriginals("tokenizer_project_stats", "tokenizer_originals")
# calculate non-identical files from tokenizer's perspective, oldest file from each group is not treated as original
calculateOriginalFiles("tokenizer_originals_clones", "files_full_stats", "tokenizer", include.oldest = F)
# calclate non-identical files from tokenizer's perspective, oldest files from group are treated as identical to the rest of the group
calculateProjectOriginals("tokenizer_clones_stats", "tokenizer_originals_clones")
# calculate non-identical files from sourcerer's perspective
#calculateOriginalFiles("sourcerer_originals", "files_stats", "sourcerer")
# calculate cummulative statistics for projects using only non-similar files (as tagged by sourcerer)
#calculateProjectOriginals("sourcerer_project_stats", "sourcerer_originals")
# calculate non-identical files from sourcerer's perspective, oldest file from each group is not treated as original
#calculateOriginalFiles("sourcerer_originals_clones", "files_full_stats", "sourcerer", include.oldest = F)
# calclate non-identical files from sourcerer's perspective, oldest files from group are treated as identical to the rest of the group
#calculateProjectOriginals("sourcerer_clones_stats", "sourcerer_originals_clones")
println("Total database size ", sum(sql.query("SHOW TABLE STATUS")$Data_length) / 1024 / 1024, " [Mb]")
```
# Results
We are now going to look at various simple statistics we can obtain from the tokenizer output. Where applicable, both summaries and histograms of the specified variables are shown. Note that most of the histogram scales are logarithmic.
## Summary
```{r}
files = list()
files$total = sql.tableStatus("files_full_stats")$length
files$unique = sql.tableStatus("files_stats")$length
# empty files are not part of the sourcererCC's input and must therefore be excluded
files$empty = sql.query("SELECT COUNT(*) FROM files_full_stats WHERE totalTokens=0")[[1]]
# files that are reported as not fully understood by the tokenizer
files$error = sql.query("SELECT COUNT(*) FROM files_full_stats WHERE errors>0")[[1]]
files$distinctFileHash = sql.query("SELECT COUNT(DISTINCT filehash) FROM files_full_stats")[[1]]
files$distinctTokenHash = sql.query("SELECT COUNT(DISTINCT tokensHash) FROM files_full_stats")[[1]]
files$nonCloned = files$unique - sql.tableStatus("sourcerer_clone_info")$length
projects = list()
projects$total = sql.tableStatus("bookkeeping_proj")$length
projects$unique = sql.query("SELECT COUNT(DISTINCT projectId) FROM tokenizer_clones_stats WHERE files > 0")[[1]]
projects$nonCloned = sql.query("SELECT COUNT(DISTINCT projectId) FROM sourcerer_clones_stats WHERE files > 0")[[1]]
projects$uniqueAndOriginals = sql.query("SELECT COUNT(DISTINCT projectId) FROM tokenizer_project_stats WHERE files > 0")[[1]]
projects$nonClonedAndOriginals = sql.query("SELECT COUNT(DISTINCT projectId) FROM sourcerer_project_stats WHERE files > 0")[[1]]
println("Total projects: ", projects$total)
println("Files processed: ", files$total)
println("Empty files: ", files$empty, " ", (files$empty / files$total) * 100, "%")
println("Error files: ", files$error, " ", (files$error / files$total) * 100, "%")
println("Distinct file hashes: ", files$distinctFileHash, " ", (files$distinctFileHash / files$total ) * 100, "%")
println("Distinct token hashes: ", files$distinctTokenHash, " ", (files$distinctTokenHash / files$total) * 100, "%")
println("Non-cloned files: ", files$nonCloned, " ", (files$nonCloned / files$total) * 100, "%")
println("Projects with unique files: ", projects$unique, " ", (projects$unique / projects$total) * 100, "%")
println("Projects non-cloned files: ", projects$nonCloned, " ", (projects$nonCloned / projects$total) * 100, "%")
println("Projects with unique or original files: ", projects$uniqueAndOriginals, " ", (projects$uniqueAndOriginals / projects$total) * 100, "%")
println("Projects non-cloned or original files: ", projects$nonClonedAndOriginals, " ", (projects$nonClonedAndOriginals / projects$total) * 100, "%")
```
> If the ratio of error files is high, then the tokenizer cannot be trusted to produce meaningful results, but small number of error files is totally ok - some of the javascript files are test cases that should error.
### Basic information about files
```{r}
# file size in bytes
x <- sql.query("SELECT bytes FROM files_full_stats")$bytes
summary(x)
logHist(x, main = "File size [B]", xlab = "Size [B]", ylab = "# of files")
# file size in lines of code (including empty and comments)
x <- sql.query("SELECT loc FROM files_full_stats")$loc
summary(x)
logHist(x, main = "File size [LOC]", xlab = "Lines", ylab = "# of files")
# # of total tokens in files
x <- sql.query("SELECT totalTokens FROM files_full_stats")$totalTokens
summary(x)
logHist(x, main = "Tokens in files", xlab = "# of tokens", ylab = "# of files")
# # of unique tokens in files (excluding repetitions)
x <- sql.query("SELECT uniqueTokens FROM files_full_stats")$uniqueTokens
summary(x)
logHist(x, main = "Unique Tokens per File", xlab = "# of tokens", ylab = "# of files")
```
> We believe the peak at 1 LOC is due to min.js files which are minified version of javascript with no extra whitespace, and therefore no line breaks so the entire file is a single line.
### Composition of files
```{r}
# % of comments in the files
x <- sql.query("SELECT (commentBytes/bytes) AS result FROM non_empty_files AS ne, files_full_stats AS fs WHERE ne.id = fs.id")$result
summary(x)
normalHist(x, main = "% of comment bytes in files", xlab = "% of bytes in comments")
# % of whitespace in the files
x <- sql.query("SELECT (whitespaceBytes/bytes) AS result FROM non_empty_files AS ne, files_full_stats AS fs WHERE ne.id = fs.id")$result
summary(x)
normalHist(x, main = "% of whitespace bytes in files", xlab = "% of bytes in whutespace")
# % of separators in the files
x <- sql.query("SELECT (separatorBytes/bytes) AS result FROM non_empty_files AS ne, files_full_stats AS fs WHERE ne.id = fs.id")$result
summary(x)
normalHist(x, main = "% of separator bytes in files", xlab = "% of bytes in separators")
# % of tokens in the files
x <- sql.query("SELECT (tokenBytes/bytes) AS result FROM non_empty_files AS ne, files_full_stats AS fs WHERE ne.id = fs.id")$result
summary(x)
normalHist(x, main = "% of token bytes in files", xlab = "% of bytes in tokens")
```
> Again the large number of files with 0 whitespace is due to minified javascript files. We will confirm this later by rerunning the tokenizer and excluding these.
### Basic Information about projects
x <- sql.query("SELECT ", field, " AS field FROM ", table)
x$field
```{r}
x <- sql.query("SELECT files FROM project_stats")$files
summary(x)
logHist(x, main = "Files per project" , xlab = "Size [B]", ylab = "# of projects")
x <- sql.query("SELECT bytes FROM project_stats")$bytes
summary(x)
logHist(x, main = "Sum of size of project's JS files [B]", xlab = "Size [B]", ylab = "# of projects")
x <- sql.query("SELECT loc FROM project_stats")$loc
summary(x)
logHist(x, main = "Sum of LOC of project JS files", xlab = "Lines", ylab = "# of projects")
```
Some of the projects are surprisingly large. Let us examine the largest projects out there:
```{r}
x <- sql.query("SELECT url, bytes, files FROM project_stats, bookkeeping_proj WHERE id=projectId ORDER BY bytes DESC LIMIT ", DT_LIMIT)
x$url <- sapply(x$url, unescape)
x
```
### Information about tokens
```{r}
# token's size
x <- sql.query("SELECT size FROM tokens")$size
summary(x)
# histogram over the entire range
logHist(x, main = "Token Size", xlab = "size [b]", ylab = "# of tokens")
# histogram of token sizes w/o extremes
p <- quantile(x, .98)
logHist(x[ x < p ], main = "Token Size (no extremes)", xlab = "size [b]", ylab = "# of tokens")
```
Let's now examine the longest tokens to see how they look like. The table shows the beginnings of the longest tokens, their size and # of occurences in the corpus.
```{r}
x <- sql.query("SELECT LEFT(text, 200) as text, size, count FROM tokens ORDER BY size DESC LIMIT ", DT_LIMIT)
x$text <- sapply(x$text, unescape50)
x
```
Let's also look at token frequencies:
```{r}
x <- sql.query("SELECT count FROM tokens")$count
summary(x)
logHist(x, main = "# of token uses", xlab = " # token uses", ylab = "# of tokens", base = 100)
```
> I have left out the detailed exploration of this graph because it was no longer adequate. Also there were no real findings from it at this point.
Let's look at the most frequently used tokens:
```{r}
x <- sql.query("SELECT LEFT(text, 200) as text, count FROM tokens ORDER BY count DESC LIMIT ", DT_LIMIT)
x$text <- sapply(x$text, unescape50)
x
```
## Data analysis of clones
Clones are classified into _clone groups_. A clone group is set of all files that are identical , or similar (if reported by sourcererCC). Each group has an `id`, which is just an `id` of any of file that belong to it (it does not matter which one). For each group we also calculate # of files and the id of the oldest file in the group by first commit date, which we assume to be the original.
> Note that while this idea will likely work well on the entire dataset, it is almost useless for the 70k dataset as it constitutes only about 3% of the entire dataset.
Also the clone groups for tokenizer and clone groups for sourcerer are not yet unified together so originality resulst for sourcerer are even less meaningful.
### Clones as reported by the tokenizer
```{r}
ctype = "tokenizer"
table.originals = paste(ctype, "_originals", sep = "")
table.clone_groups = paste(ctype, "_clone_groups", sep = "")
table.clone_info = paste(ctype, "_clone_info", sep = "")
table.project_stats = paste(ctype, "_project_stats", sep = "")
```
```{r}
# basic info about number of files, clone groups, etc.
displayCloneGroupsInfo(table.clone_groups, table.clone_info, "files_full_stats")
# files per clone group statistic
displayFilesPerCloneGroup(table.clone_groups)
# projects per clone group statistic
displayProjectsPerCloneGroup(table.clone_groups)
# Links to mostly cloned files, ordered by absolute number of copies
mostClonedFiles(table.clone_groups, countBy = "files")
# Links to mostly cloned files, ordered by # of projects which contain them
mostClonedFiles(table.clone_groups, countBy = "projects")
```
```{r}
originalContentPerProject(table.project_stats, "files")
originalContentPerProject(table.project_stats, "bytes")
```
Let's now look at the most original projects, ordered first by the # of original files in them (the oldest file from each group is assumed to be original as well), and then by the number of files they contain.
```{r}
mostOriginalProjects(table.project_stats, "files", DT_LIMIT)
```
Similarly, we can observe the least original projects:
```{r}
leastOriginalProjects(table.project_stats, "files", DT_LIMIT)
```
### Clones as reported by sourcerer
> These files are 70% or more compatible.
```{r}
ctype = "sourcerer"
table.originals = paste(ctype, "_originals", sep = "")
table.clone_groups = paste(ctype, "_clone_groups", sep = "")
table.clone_info = paste(ctype, "_clone_info", sep = "")
table.project_stats = paste(ctype, "_project_stats", sep = "")
```
```{r}
# basic info about number of files, clone groups, etc.
displayCloneGroupsInfo(table.clone_groups, table.clone_info, "files_full_stats")
# files per clone group statistic
displayFilesPerCloneGroup(table.clone_groups)
# projects per clone group statistic
displayProjectsPerCloneGroup(table.clone_groups)
# Links to mostly cloned files, ordered by absolute number of copies
mostClonedFiles(table.clone_groups, countBy = "files")
# Links to mostly cloned files, ordered by # of projects which contain them
mostClonedFiles(table.clone_groups, countBy = "projects")
```
```{r}
originalContentPerProject(table.project_stats, "files")
originalContentPerProject(table.project_stats, "bytes")
```
Let's now look at the most original projects, ordered first by the # of original files in them (the oldest file from each group is assumed to be original as well), and then by the number of files they contain.
```{r}
mostOriginalProjects(table.project_stats, "files", DT_LIMIT)
```
Similarly, we can observe the least original projects:
```{r}
leastOriginalProjects(table.project_stats, "files", DT_LIMIT)
```
# Appendices
## Appendix A - Tokenizer
The tokenizer searches given set of root directories recursively. Whenever it finds a git repository, it asks git for a list of all files in the repository with their first commit dates:
```
git log --format=\"format:%at\" --name-only --diff-filter=A
```
This result is cached in a file named `cdate.js.tokenizer.txt`. The tokenizer then looks at the files, and when it finds a javascript file (`*.js`, i.e. including `*.min.js` ones), it tokenizes it and outputs three statistics:
- `files_stats` which are file statistics as mandated by sourcererCC
- `files_tokens` which lists all tokens and number of their occurences in the file
- `files_full_stats` which lists all statistics the tokenizer can produce (this is more than what generic sourcererCC's tokenizer is capable of)
Furthermore the tokenizer does the following:
- converts each token in the file to a token id which it uses instead of the token string. This is useful for compressing the output format because the ids are smaller than token characters and also helps sourcererCC to deal with weird tokens in terms of character escaping and length (sourcererCC cannot handle very long tokens). The tokenizer computes total frequencies of all the tokens and outputs the mapping from token ids to token characters and frequencies at the end of its run into `tokens.txt` file.
- maintains a map of all files it has seen based on token's hash. Only emits file into `files_stats` and `files_tokens` (i.e. for sourcererCC to work on) if its identical clone has not been seen yet. `files_full_stats` contains all files. Clone pairs are stored in the `clones` directory
- if the file is empty, it does not send it to sourcererCC as empty files cause sourcerer to complain
- handles ASCII, UTF-8, and UTF-16 encodings
- ignores archive files
### Tokenization
The tokenizer attempts to parse JavaScript properly, including full support for character literals, numbers and patterns. The tokenizer does not parse direct html insertion in javascript properly, but this feature is deprecated. If the tokenizer is not able to parse the file properly, it increases the error number of that file and attempts to recover.
Separators (operators, braces, etc.), tokens (identifiers, keywords and literals), comments and whitespace groups are distinguished and depending on the settings, any of then can be treated as tokens (i.e. submitted for clone detection analysis). This is useful during validation phase.
### Different Workers
The tokenizer works in different stages to scale well to many cores. Number of threads per each class can be specified as well as maximum job queue size for different worker classes. This is important because if one class becomes a bottleneck and the previous ones get ahead of it, the memory requirements of the program grow accordingly. When thread wants to schedule a job for other worker and that job queue is full, the thread is paused until the queue allows insertion.
#### Crawler
Crawlers look recursively into the directories trying to find github projects. When one is found, they pass it as a job to tokenizer and continue crawling.
#### Tokenizer
Obtains list of all files in the project, and tokenizes the javascript ones. When a file is tokenized, it is passed as a job to merger.
#### Merger
> This is currently the bottleneck.
Mergers are responsible for (a) clone detection and (b) translation of token characters into unique token ids. When both are done, the tokenized file, now with ids instead of token characters is passed to writers.
#### Writers
Each writer has its own set of output streams (each writer does portions of *all* outputs generated by the tokenizer) and when it receives a file, it writes its statistics. Currently a single writer is enough to do the job.
## Appendix B - This notebook
The notebook creates plenty of tables, ingests data into them and then performs some preprocessing. All is done in SQL and then results are displayed in R so that the entire thing fits in memory. Key to relative performance is indexing on the databases, but frankly, it is still slow:( Especially the data ingestion and preprocessing.
If I were a better man, the code would have been better commented.
> TODO someone who really knows SQL may be able to speed up our queries.
#### Timestamps
In order not to recompute the various tables with data, the script uses timestamps. Each table has a timestamp associated with it and it is only recomputed if it's source's (file or another table) timestamp differs, or when `overwrite` is set to `"force"`
## Appendx C - Datasets
### aws_processed_061116
- taken on 6th November 2016
- raw data for the dataset are in `/mnt/data1/data70k` and `/mnt/data2/data70k2` on the aws. Tokenizer & sourcererCC results can be found in `/mnt/data2/peta/processed_061116` on the server as well.
- includes `*.min.js`
- tokenizer failed on file `/mnt/data2/data70k2/f/foxben/userscript/latest/monolith/46/109842.user.js`
- repaired the dataset
- sharding interval used was `274, 2401, 12000012`
- tokenization took ~ 9 hrs
- clone detection preprocessing took ~ 1 hour
- clone detection itself took ~ 6 hrs
- entire data ingestion, preprocessing and output took ~ 4:53 start