Skip to content

Latest commit

 

History

History
179 lines (139 loc) · 7.99 KB

README.md

File metadata and controls

179 lines (139 loc) · 7.99 KB

atom-finder Build Status

A Clojure library for finding Atoms of Confusion in C projects.

Contains facilities for:

  • Parsing C/C++ with Eclipse CDT library
  • Finding specific patterns in an AST
  • Traversing version histories through git
  • Parsing commit logs for bug/patch IDs

Output from this work formed the basis of our paper at the Mining Software Repositories 2018 conference: Prevalence of Confusing Code in Software Projects - Atoms of Confusion in the Wild.

Running from the command-line

If you would like to use this project as-is to find all occurrences of the 15 atoms of confusion described in our 2018 MSR paper you can run our code from the command-line as:

lein run dir1 dir2 > atoms.csv

This command will loop over each of the directories provided (in the example above: dir1 dir2) and print a csv to the file atoms.csv with one row for each atom in the following shape:

atom,file,line,offset,code
operator-precedence,nginx/src/misc/ngx_google_perftools_module.c,109,2669,&gptcf->profiles
post-increment,nginx/src/core/ngx_thread_pool.c,254,6422,task->id = ngx_thread_pool_task_id++
operator-precedence,nginx/src/core/ngx_thread_pool.c,122,3452,&tp->mtx

It is best to redirect this output to a file for further post-processing.

To dump every AST node (independent of atom of confusion or not) run:

lein with-profile dump-asts run

Project Structure Overview

The majority of interesting files in this project are located in the top-level src directory. Secondarily, test files are in located test and jars in resources. Under src there are several important directories:

  • atom_finder - Clojure files for parsing C/C++ files and searching for atoms
  • analysis - R files for statistically analyzing the results of the code mining
  • conf - Configuration variables to customize each runtime environment

The most important, and complicated directory is src/atom_finder which contains all the code analyze source code and repositories. Within the top-level of atom-finder is code which is both specific to this project, but also reusable between different analyses. Below the top-level are several other useful directories:

  • classifier - Every file in this directory is used to determine whether an individual AST node is a particular atom of confusion
  • questions - Every file in this directory corresponds to one of our [published, or potential) research hypotheses. These files implicitly use the classifier infrastructure to observe patterns.
  • tree_diff - Tree diffing was a difficult enough problem that took several iterations to get working. Each evolution is it's own sub-namespace in this directory. Ultimately only difflib ended up being used.
  • util - The most reusable and general functions. Most of these files are potentially useful in other projects outside this one.

Working with Clojure

This project is primariy written in Clojure(JVM), and uses many Java libraries. In order to run this project you should install:

  • Leiningen - The Clojure build manager. This tool will automatically download the right version of Clojure, resolve all the necessary libraries, run tests, and execute the program.
  • One of Emacs/CIDER, Sublime/SublimeREPL, Nightcode, or anything that offers you a clojure-centric workflow. The way one writes Clojure (and lisp in general) is a bit more interactive than traditional development. It's important to be able to evaluate code as you write it.

After you've installed these tools, first run lein test to make sure everything is up and running. Then you should be able to develop in your editor, executing snippets of code as you go.

Using the framework to parse code

The first thing you might want to do, is parse some C code. There are three main functions for doing this, parse-file, parse-source and parse-frag. Both functions take a String as an argument, and return an IASTNode. parse-file and parse-source both require whole programs, the former accepting a filename as its argument and the latter a string containing the full code. parse-frag on the other hand can take any (read "many") partial program. For example:

(parse-file "gcc/testsuite/c-c++-common/wdate-time.c")  ;; => CPPASTTranslationUnit
(parse-source "int main() { 1 + 1; }")                  ;; => CPPASTTranslationUnit
(parse-frag "1 + 1")                                    ;; => CPPASTBinaryExpression

After you've parsed some code, you might reasonably want to see what it looks like:

(->> "gcc/testsuite/c-c++-common/wdate-time.c"
      parse-file
      (get-in-tree [2])
      print-tree)

Which should output:

[]  <SimpleDeclaration>                                      {:line 6, :off 238, :len 39}
[0]  <SimpleDeclSpecifier>                                   {:line 6, :off 238, :len 10}
[1]  <ArrayDeclarator>                                       {:line 6, :off 249, :len 27}
[1 0]  <Name>                                                {:line 6, :off 249, :len 9}
[1 1]  <ArrayModifier>                                       {:line 6, :off 258, :len 2}
[1 2]  <EqualsInitializer>                                   {:line 6, :off 261, :len 15}
[1 2 0]  <IdExpression>                                      {:line 6, :off 263, :len 13}
[1 2 0 0]  <Name>                                            {:line 6, :off 263, :len 13}

Some other useful functions are:

print-tree     -> Prints a debug view of the tree structure of an AST plus metadata
write-tree     -> Takes an AST and returns the code that generated it (inverse parsing)
get-in-tree    -> Digs down into an AST to get at nested children
default-finder -> Take a function that returns true/false for a single AST node, and run it over an entire AST

Using the framework to find atoms of confusion

You may also be interested in finding where in software projects atoms of confusion live.

In the classifier namespace there are several functions for finding atoms. First, every type of atom has a classifier which can be applied to an AST node to determine whether it represents an atom of confusion.

(->> "x++" parse-expr post-*crement-atom?)      ;; => false
(->> "y = x++" parse-expr post-*crement-atom?)  ;; => true

Further, by applying the default-finder function, each classifier can be adapted to find each example of an atom in a piece of code.

(->> "x = (1, 2) && y = (3, 4)"
     parse-expr
     ((default-finder comma-operator-atom?))
     (map write-tree))
;; => ("1, 2" "3, 4")

If you would like to find every atom in a piece of code you can use the helper function find-all-atoms in classifier.clj.

(->> "11 && 12 & 013"
     parse-expr
     find-all-atoms
     (map-values (partial map write-tree))
     (remove (comp empty? last))
     (into {}))
;; => {:operator-precedence ("11 && 12 & 013"), :literal-encoding ("12 & 013")}

Using the framework to answer questions

Beyond simply finding atoms of confusion, there's a fair amount of code to answer specific questions about how atoms of confusion relate to a codebase. Much of this code lives in the questions directory, and is very poorly documented. Sorry in advance.