thesis.tex

\documentclass{pracamgr}

\usepackage{amsmath}
\usepackage{amssymb}
\usepackage[titletoc]{appendix}
\usepackage[backend=biber,style=numeric]{biblatex}
\usepackage{bold-extra}
\usepackage{booktabs}
\usepackage{caption}
\usepackage{changepage}
\usepackage{chngcntr}
\usepackage{enumitem}
\usepackage{fixltx2e}
\usepackage{float}
\usepackage[T1]{fontenc}
\usepackage[hidelinks]{hyperref}
\usepackage[utf8]{inputenc}
\usepackage{listings}
\usepackage[lighttt]{lmodern}
\usepackage{lstlinebgrd}
\usepackage{ltxtable}
\usepackage{multicol}
\usepackage{multirow}
\usepackage{pbox}
\usepackage{pgf}
\usepackage{pgffor}
\usepackage{polski}
\usepackage{soul}
\usepackage{tikz}
\usepackage{tikz-uml}
\usepackage{xcolor}
\usepackage{xspace}

\usetikzlibrary{arrows}
\usetikzlibrary{decorations.pathreplacing}
\usetikzlibrary{intersections}
\usetikzlibrary{patterns}
\usetikzlibrary{positioning}
\usetikzlibrary{shapes}

\author{Adam Wierzbicki}
\nralbumu{306441}
\title{Using Code History for Defect Prediction}
\tytulpl{Wykorzystanie historii kodu do predykcji błedów}
\kierunek{Informatyka}
\opiekun{
	prof. dr. hab. Krzysztofa Stencla\\
	Instytut Informatyki
}
\date{Czerwiec 2015}
\dziedzina{11.3 Informatyka}
\klasyfikacja{
	D. Software\\
	D.2. Software Engineering\\
	D.2.8 Metrics\\
	D.2.9 Management
}
\keywords{code history, defect, bug-proneness, prediction, repository, metrics, machine learning}

\bibliography{bibliography}

\input{utils/preamble.tex}
\input{utils/highlight.tex}
\input{utils/rectangle.tex}

\begin{document}

\maketitle

\begin{abstract}
Detection of software defects has become one of the major challenges in the field of automated software engineering. Numerous studies have revealed that mining data from repositories could provide a substantial basis for defect prediction. In this thesis I introduce my approach towards this problem relying on the analysis of source code history and machine learning algorithms. I describe in detail the proposed computational procedures and explain their underlying assumptions. Following the theoretical basis, I  present the results of performed experiments which serve as an empirical assessment of the effectiveness of my methods.
\end{abstract}

\tableofcontents

\begingroup
  \let\cleardoublepage\relax
  \clearpage
  \listoftables
  \listoffigures
  \clearpage
  \lstlistoflistings
\endgroup

\chapter{Introduction}
\label{cha:introduction}
Ever since the legendary first bug in 1947\footnote{According to \cite{first_bug} on 9th of September, 1947 an investigation of malfunctioning Harvard University Mark II Aiken Relay Calculator revealed a moth trapped between the points of relay \#70 in panel F. This event was reported in a log with the following statement: \textit{''First actual case of bug being found.''}}, the detection of software faults has been a crucial part of the quality assurance process. Undiscovered bugs were the cause of many misfortunes with the crash of the \$500-million space rocket Ariane 5 being the most spectacular example.\footnote{On 4th of June, 1996 the rocket was launched by the European Space Agency from Kourou in French Guyana. It exploded only about 40 seconds after take-off due an error in the inertial reference system -- conversion of a 64-bit float to a 16-bit signed integer failed, because the number was larger than the largest storable value. \cite{ariane}} This work aims to contribute to the general improvement of software quality by investigating certain methods of error identification.

\section{Topic choice rationale}
\label{sec:topic_choice}
Various procedures have been applied to track down software defects before they would cause any problems. Most widely used techniques are testing and code reviewing. Both of them are quite successful, but unfortunately also very arduous. In order to improve the performance of these methods and reduce the programmers' effort needed to conduct them, automatic debugging programs are developed.

Such programs include a broad range of approaches towards detecting bugs. Some of them are run-time debuggers which help programmers analyse the execution of a program (either standalone as GDB \cite{gdb} or integrated with IDEs as \textsc{Microsoft Visual Studio Debugger} \cite{vs_debugger}). Others generate test cases using randomization and symbolic execution \cite{symbolic, puzzle}. Others support the reviewing process by highlighting potentially dangerous program parts.

Identification of such fault-prone elements could be performed using many different methods. There are tools which rely on hard-coded patterns of so-called ``bad code smells'' (\textit{e.g.} \textsc{FindBugs} \cite{findbugs}). Other ones incorporate statistical analysis and machine learning, using various software metrics and properties (\textit{e.g.} HATARI \cite{hatari}). In this work I focus on the latter approach, because I find it more interesting and less frequently encountered. I decided to use historical metrics\footnote{Software metrics based on code change history. For details see Section \ref{sec:historical}.} which have proven to be well-suited for this task \cite{merits, comparative, how_and_why}. With my research I hope to extend the knowledge about methods of bug prediction, which is not only of theoretical but also of practical value. This thesis makes the following contributions:
\begin{itemize}
	\item A proposal of a novel method for bug prediction,
	\item A prototype implementation of the proposed method,
	\item An experimental evaluation of the implemented tool.
\end{itemize}

\section{Document structure}
\label{sec:structure}

\textbf{Chapter \ref{cha:introduction}} includes general introductory information for the thesis, topic choice rationale, and a sketch of the document structure. 

\bpar{Chapter \ref{cha:overview}} contains a broad and comprehensive description of the problem of bug prediction. I try to define in a possibly precise manner all terms relevant to this topic. Then I present the general goals of the prediction process and its consecutive phases. I describe several kinds of sub-problems which have to be solved in order to successfully develop an error detection instrument.

\bpar{Chapter \ref{cha:approach}} presents my approach towards the problem. The proposed method is based on extraction of historical information from a version control system and application of a machine learning algorithm. I describe the general model of a changing method\footnote{In the object-oriented-programming sense}, which is the principal element of the whole procedure, as well as successive phases such as feature extraction, identification of bug-fixes, assignment of bug-proneness scores and training a regressor. In the last section of this chapter, I list four different machine learning algorithms which are evaluated in Chapter \ref{cha:experiments}.

\bpar{Chapter \ref{cha:implementation}} describes the \ca tool -- an implementation of the approach proposed in Chapter \ref{cha:approach}. I present all used technologies (languages and libraries), software architecture and the interface.

\bpar{Chapter \ref{cha:experiments}} contains a description of the experiments which were performed to evaluate the proposed approach and compare the performance of different algorithms listed in Section \ref{sec:machine_learning}. I explain the methodology of these experiments and the software configuration used to conduct them. I give an overview of the experimental data and finally present the results with a short commentary.

\bpar{Chapter \ref{cha:conclusions}} includes a general assessment of my approach towards the problem of bug prediction. Further, I describe several threats to the validity of my research, and in the last section, I present the possible ways of continuation of my work on this topic.

\chapter{Problem overview}
\label{cha:overview}
Simply speaking, bug prediction is a procedure aimed at automatic detection of faults in an unreviewed and untested code, basing on reports about previously discovered bugs. However, this deceptively plain definition hides a very convoluted and problematical process. In this chapter I would like to explore its meanders and formulate a more precise description by answering questions such as: \emph{\textbf{What is it} exactly that we are predicting?} (Section \ref{sec:dependent_variable}) \emph{What is \textbf{the scope} of our prediction?} (Section \ref{sec:unit_of_analysis}) \emph{How do we learn about \textbf{former defects}?} (Section \ref{sec:identification}) and \emph{What \textbf{kinds of information} can we utilize in our prediction?} (Section \ref{sec:observable_variables})

\section{Terminology}
\label{sec:terminology}
Before referring the topic, I would like to define some concepts of major importance, namely: \emph{version control system}, \emph{software repository}, \emph{commit}, \emph{error}, \emph{bug-fix}, \emph{bug-tracker}, \emph{software metric}, and \emph{error model}. Below are given brief explications of how I understand these terms.

\medskip
\begin{definition}{Version control system}
``A system capable of recording the changes made to a file or a set of files over a time period in such a way that it allows us to get back in time from the future to recall a specific version of that file'' \cite[p. 8]{git_book}. Usually, such system is used to manage the source code of a software project. Examples of VCSs are: \textsc{Apache}\textsuperscript{TM} \textsc{Subversion}\textsuperscript{\textregistered} \cite{subversion}, \textsc{Git} \cite{git}, or \textsc{Mercurial} \cite{mercurial}.
\end{definition}

\begin{definition}{Software repository}
A directory containing the source code of a software project, which is managed by a version control system. It could be either local or remote (accessed via HTTPS or SSH protocol).
\end{definition}

\begin{definition}{Commit}
``An atomic collection of changes to files in a repository. It contains all recorded local modifications that lead to a new revision of the repository'' \cite{mercurial_wiki}. Apart from file changes, a commit contains meta-information such as: committer's name and e-mail address, date and time, and a description of changes. Commit is also often called \emph{changeset}.
\end{definition}

\begin{definition}{Error}
``An incorrect step, process, or data definition. For example, an incorrect instruction in a computer program'' \cite[p. 31]{glossary}. Error is also called \emph{bug}, \emph{fault}, or \textit{defect}.
\end{definition}

\begin{definition}{Bug-fix}
A commit created in order to repair a software defect. It is usually distinguished by an appropriate commit message describing the fixed error.
\end{definition}

\begin{definition}{Bug tracking system}
A system that ``stores and manages all bug status and information such as the module where a bug occurred, when a bug was found, the severity of a bug, comments describing a bug’s effect on the software, instructions on replicating a bug, who reported the bug, and whether the bug has been fixed yet'' \cite{adaptive}. Also called simply \emph{bug-tracker}. Examples of bug-trackers are: \textsc{Bugzilla} \cite{bugzilla} and \textsc{Jira} \cite{jira}.
\end{definition}

\begin{definition}{Software metric\footnote{I prefer the term ``measure'' to be used in this context instead of ``metric'', because it is more consistent with the mathematical understanding of measure and metric. However, ``software metric'' has become a widely used phrase in software engineering, so I will incline to this tendency.}}
``A quantitative measure of the degree to which a system, component, or process possesses a given attribute'' \cite[p. 47-48]{glossary}.
\end{definition}

\begin{definition}{Error model}
``In software evaluation, a model used to estimate or predict the number of remaining faults, required test time, and similar characteristics of a system'' \cite[p. 31]{glossary}.
\end{definition}

\section{Dependent variable}
\label{sec:dependent_variable}
Prediction in general may be described as discovering a relation between several observable variables and a dependent variable (which may also be called \textit{target variable}). The discovered relation allows for estimating values of the target variable by analysing the observables. Trying to conform \textit{defect prediction} to this description seems problematic, because ``defect'' does not name any well-defined variable. Therefore, the first fundamental concern of bug prediction is specifying a variable which will both suit our theoretical framework and embrace the common-sense intuitions.

The most obvious candidate (used \textit{e.g.} by Janes \textit{et al.} in \cite{Janes}) is the number of defects per software unit\footnote{\cite[p. 31]{glossary} endorses the choice of this kind of variables by defining error prediction as ``a quantitative statement about \textbf{the expected number} or nature of faults in a system or component'' (emphasis added).} (for possible units see Section \ref{sec:unit_of_analysis}). However, as in many cases the numbers of bugs in software components are relatively small, a dichotomous boolean variable (indicating bug presence or absence) is used more often (\textit{e.g.} by Moser \textit{et al.} in \cite{comparative} or Giger \textit{et al.} in \cite{method-level}). More complex examples include \emph{fault density} -- number of faults divided by a code size measure (Tomaszewski \textit{et al.} \cite{Tomaszewski}), or expected number of repairs in some period of time (Hassan~\cite{complexity}).

As Arisholm \textit{et al.} reasonably state in \cite[p. 5]{systematic}, the choice of dependent variable should be determined by the further usage of the developed model. For assigning a software component for a detailed examination by a reviewer, a boolean variable is probably sufficient. On the other hand, creating a ranking or visualisation of the software quality requires numeric values, so number of defects or fault density would suit these purposes much better.

\section{Unit of analysis}
\label{sec:unit_of_analysis}
All of the variables defined in the above section are relative to some code segment. Size of such segment can span from a single function up to a whole project -- every possibility has its advantages and drawbacks. As nine of ten most popular programming languages are object-oriented \cite{popularity1,popularity2}, also the most commonly encountered units of analysis are related to OOP paradigm: module/package, class, or method/function.

Package-level prediction (used \textit{e.g.} by Jin \textit{et al.} in \cite{Jin}) can make use of structural software properties which cannot be computed for smaller units. However, the typical size of a package in some languages would make the corresponding fault-proneness value of very little value for practical application. Therefore, module- or package-level prediction is usually restricted to languages to which the class level does not apply, \textit{e.g.} Fortran, Pascal or Ada.

Class level (used \textit{e.g.} by Tomaszewski \textit{et al.} in \cite{Tomaszewski}) is the most popular choice among defect prediction studies. Probably, the reason for this popularity is the fact that class is the most natural and accustomed part of object-oriented software. In case of non-object-oriented languages, the middle-sized unit of analysis is file, which is easily separable and trackable with a VCS machinery. Class shares the advantages of file, since numerous software projects obey, either by virtue of conventional guidelines or due to language constraints, the \emph{one class per file} rule. The scope of a class or a file is also usually narrow enough to serve as an object of manual inspection in case of high bug-proneness score.

Using method or function as the unit of analysis is a more complicated task, but it has been also successfully accomplished (\textit{e.g.} by Giger \textit{et al.} in \cite{method-level}). Contrary to higher-level units described above, which can be identified using solely lists of files, ascribing faults to certain methods requires examining detailed information about code changes. I consider the results of method-level prediction most useful for debugging due to their fine granularity.

The scope of prediction is limited not only \emph{spatially} (by which I understand focusing on particular software components, as described above in this section), but also \emph{temporally}. One may predict the number (or presence) of bugs in a software release, over some fixed period of time, or in a single commit. This choice should also be affected by the later utilisation of predicted values.

\section{Identification of bugs}
\label{sec:identification}
Prediction of new bugs has to be based on historical data about the heretofore detected ones. However, identification of these is not a trivial task. It has become one of the major challenges in the discipline of \emph{mining software repositories}.

\subsection{Bug-fixes}
\label{sec:bug-fixes}
First step towards identification of bugs is the recognition of bug-fixing commits. Traditionally used heuristic (\textit{e.g.} by Kim in \cite{adaptive}) involves searching for keywords such as ``bug'', ``issue'' or ``fix'' in commit messages. A more refined approach (used in \cite{Ekanayake, method-level, adaptive}), instead of relying on hard-coded patterns, utilizes information obtained from a bug-tracking system. Bug-trackers assign IDs to the tracked issues, which could be later referenced in change logs. As the vast majority of contemporary software projects uses bug-trackers, searching for these IDs in commit messages could therefore constitute a reliable method for detecting bug-fixes.

\subsection{Noise in error data}
\label{sec:noise}
Unfortunately, research has shown that the aforementioned methods are not quite successful. Herzig \textit{et al.} \cite{Herzig}, after analysing over 7,000 error reports, make the following statements:
\begin{itemize}
	\item ``More than 40\% of issue reports are inaccurately classified'',
	\item ``33.8\% of all bug reports do not refer to corrective code maintenance'',
	\item ``Due to misclassifications, 39\% of files marked as defective actually have never had a bug''.
\end{itemize}
Bachmann \textit{et. al} in \cite{Bachmann} confirm the poor quality of error reporting (particularly in \textsc{Apache} project), by declaring that ``only 47.6\% of bug fix related commits are documented in the bug tracking database''. Human failure and negligence makes the error data obtained from repositories highly biased an untrustworthy. Which in turn negatively affects bug localization and prediction \cites{Herzig, dealing, Kochnar}.

In order to improve the efficiency of fault detection, Wu \textit{et al.} have developed \textsc{ReLink} -- an automatic link\footnote{The term ``link'' refers here to the links between bug reports and code changes.} recovery algorithm \cite{ReLink}. It uses machine learning approach with features such as:
\begin{itemize}
	\item The interval between the bug-fixing time and the change-commit time,
	\item Mapping between bug owners and change committers,
	\item The similarity between bug reports and change logs.
\end{itemize}
Evaluated on five open-source projects, \textsc{ReLink} achieved up to 26.6\% more recall than the traditional heuristics.

\subsection{Bug-introducing changes}
\label{sec:bug-introducing}
The next stage, after identifying bug-fixes, is linking them with \emph{bug-introducing changes}. No earlier than this is achieved, one can give the precise temporal location of an error. Several algorithms have been developed for this purpose. Probably the most popular one (used \textit{e.g.} by Kim \cite{adaptive} and Fukushima \textit{et al.} \cite{Fukushima}) is the SZZ algorithm developed by Śliwerski, Zimmermann \& Zeller in \cite{SZZ}. It relies on annotation graphs produced by CVS\footnote{\textsc{Concurrent Versions System} is an open-source version control system \cite{CVS}.} \emph{annotate}\footnote{The \emph{annotate} command allows to ``print the head revision of the trunk, together with information on the last modification for each line'' \cite{CVS_annotate}. Its counterpart in \textsc{Git} and SVN is \emph{blame}.} command. This approach was criticised by in \cite{SZZ_revisited} by Williams \& Spacco, who propose a different technique using line-number maps and \textsc{DiffJ} -- a Java syntax-aware diff tool. Application of such algorithms allows to conclude with high certainty, which parts of software at which point of time are to be considered fault-prone.

\section{Observable variables}
\label{sec:observable_variables}
Prediction of the target variable has to be done through a measurement of some observable variables. A multitude of software metrics has been applied for this purpose. In this section I will present a condensed survey of the most common ones (and some less common, but interesting).

\subsection{Static metrics}
\label{sec:static}
Static metrics are ``measures of structural properties derived from the source code'' \cite[p.~5]{systematic}. Their essential property (responsible for the name) is that they can be computed using only a snapshot (revision) of the code. Some of static metrics are specific to object-oriented software, other are more universal (\textit{e.g.} LOC or complexity). Examples of static metrics are listed in Table \ref{tab:static_metrics}. Please note, that metrics presented as low-level could be also used on higher levels by aggregation operations (\textit{e.g.} minimum, maximum or average).

\LTXtable{\textwidth}{tables/static_metrics.tex}

\subsection{Historical metrics}
\label{sec:historical}
As the name suggests, historical metric are obtained through the analysis of software history, which is stored in a version control system. Some of them are computed from raw change data (that is a series of code revisions), while others rely on meta-information stored by VCS such as commit times, authors, and descriptions. Historical metrics are also called \emph{process metrics} or \emph{change metrics}\footnote{However, in \cite{Fukushima} this last name is used for metrics which can be attributed to a \textit{change itself}, not to a software component.}. Table \ref{tab:historical_metrics} contains examples of such metrics.

\LTXtable{\textwidth}{tables/historical_metrics.tex}

\subsection{Micro-interaction metrics}
\label{sec:micro-interaction}
Micro-interaction metrics, proposed by Lee \textit{et al.} in \cite{micro_interaction}, are tracking behavioural interaction patterns of software developers. Usage of such metrics is motivated by research showing correlations between work habits and productivity \cite{LaToza}. They can be obtained from systems tracking developers' activities, such as \textsc{Mylyn} \cite{mylyn}, a task management plug-in for \textsc{Eclipse IDE}. Experimental results presented in \cite{micro_interaction} show that micro-interaction metrics outperform both static and historical metrics. However, the scope of their usage is very limited, due to requirement for programmers' activities registration, which is not so common in software projects. Examples of micro-interaction metrics are presented in Table \ref{tab:micro_interaction_metrics}.

\LTXtable{\textwidth}{tables/micro_interaction_metrics.tex}

\chapter{Proposed approach}
\label{cha:approach}
Having referred the general goals and issues connected with defect prediction, I would like to present my own approach towards this problem. I decided to perform the prediction on method level, which is the finest granularity reached by state-of-the-art algorithms. Similarly to \cite{method-level}, I chose to use historical metrics based on fine-grained source code changes. However, instead of relying on sparse software releases as time scope boundaries, I developed a model allowing for a more continuous prediction.

Underlying assumptions and a general outline of the \textbf{model} is presented in Section \ref{sec:model}. Then, in Section \ref{sec:features}, a detailed summary of the model's \textbf{observable variables} is given. Section \ref{sec:bug-proneness}, on the other hand, describes the \textbf{target variable}. Section \ref{sec:machine_learning} explains four different \textbf{machine learning algorithms} which can be applied to the model.

\section{Model}
\label{sec:model}

\subsection{General assumptions}
\label{sec:general_assumptions}
Design of the further described model relies on the following fundamental assumptions:
\begin{enumerate}[label=(A\arabic*)]
	\item Every method, in every moment can be assigned a certain \textbf{\emph{bug-proneness}} level ranging between 0.0 and 1.0 (these values are arbitrary and do not denote any units).
	\item \textbf{Immediately before a bug-fix}, the method to be fixed is \textbf{undoubtedly deficient}, therefore its bug-proneness equals 1.0.
	\item \textbf{Immediately after a bug-fix}, the fixed method is \textbf{perfectly correct}, therefore its bug-proneness equals 0.0. 
\end{enumerate}
The above presumptions may be criticized in many ways as not accurately reflecting the reality. However, they were not intended to do so, but rather to form an idealization which would allow for a possibly simple modelling of the quality of software methods.

The bug-proneness parameter constitutes the target variable in my approach. As it is not precisely defined, one could use different methods of measuring this value. These are described later, in Section \ref{sec:bug-proneness}. An example of how bug-proneness score can change with succeeding commits is presented in Figure \ref{fig:bug_proneness}.

\begin{figure}[h]
\centering
\input{figures/bug_proneness.tex}
\caption{Bug-proneness alteration}
\label{fig:bug_proneness}
\end{figure}

\subsection{Instance construction}
\label{sec:instance_construction}

Since my aim is to follow changes affecting the analysed method, model instances are constructed from succeeding commits. However, as the mentioned changes are \emph{cumulative}, each instance corresponds to a whole \emph{chunk} of commits, rather than to single one. In virtue of assumptions A2 and A3, the maximal time scope of an instance is bounded by two adjacent bug-fixes. This means that instances range in size from single commits to continuous groups delimited by bug-fixes. Example of instance construction is presented in Figure \ref{fig:instance_construction}.

\begin{figure}[h]
\centering
\input{figures/instance_construction.tex}
\caption{Instance construction}
\label{fig:instance_construction}
\end{figure}

\section{Features}
\label{sec:features}
In order to be useful for prediction of the target variable (bug-proneness), the model incorporates several observable variables. I decided to use historical metrics (described in Section \ref{sec:historical}), which are considered superior to static measures (Section \ref{sec:static}) by many researchers \cite{merits, comparative, how_and_why} and at the same time are more widely applicable than micro-interaction metrics (Section \ref{sec:micro-interaction}).

Opposed to most works using historical metrics (\textit{e.g.} \cite{systematic, merits, micro_interaction, comparative, how_and_why}), which rely on standard source code differencing utilities\footnote{Such as \texttt{diff} and \texttt{diff3} provided by \textsc{GNU diffutils} package \cite{diffutils}.}, I employ a syntax-aware \emph{change distilling} algorithm developed by Fluri \textit{et al.} in \cite{change_distilling}, described in more detail in the next section. A similar approach was taken by Giger \textit{et al.} in \cite{method-level}.

\subsection{Change distilling}
\label{sec:change_distilling}
Change distilling algorithm, presented in \cite{change_distilling} aims to provide a smart, syntax-aware method of differencing source code files. It extracts \emph{fine-grained source code changes} (down to the level of single statements) by matching the \emph{abstract syntax trees}\footnote{Abstract syntax tree (AST) is the ``hierarchical syntactic structure'' of a sentence belonging to some formal language (\textit{e.g.} a programming language). ``Every internal node of a parse tree is labeled with a nonterminal symbol; every leaf is labeled with a terminal symbol. Every subtree of a parse tree describes one instance of an abstraction in the sentence.'' \cite[p. 121]{concepts} An abstract syntax tree is also called a \emph{parse tree}, for it is generated by an act of \emph{parsing}.} representing the compared files.

Leaves of the ASTs are matched using the \emph{bi-gram Dice coefficient} -- an adaptation of an ecological association measure (proposed by Dice in \cite{Dice}) to the text distance measurement. The similarity of inner nodes is computed as a combination of node values similarities and subtrees similarities. After matching the appropriate nodes, the algorithm calculates the minimal number of \emph{insert}, \emph{delete}, \emph{move}, and \emph{update} operations required to transform the first version of the AST into the other. These elemental operations constitute fine-grained source code changes, which are later classified using a taxonomy presented by Fluri and Gall in \cite{classifying}. A summary of change types is given in Table \ref{tab:fine_grained_changes}, while Snippets \ref{lst:condition_expression_change} to \ref{lst:return_type_insert} contain examples of classified changes.

\addtocounter{footnote}{1}
\footnotetext{``The significance level of a change is defined as the impact of the change on other source code entities, i.e., how likely is it that other source code entities have to be changed, when a certain change is applied.'' \cite{classifying}}
\addtocounter{footnote}{-1}

\LTXtable{\textwidth}{tables/fine_grained_changes.tex}

\begin{diff}{Condition expression change}{lst:condition_expression_change}
\begin{lstdiff}{2,10}
static int abs(int x) {
	if (x < 0) {
		return x;
	} else {
		return -x;
	}
}
|\vfill\columnbreak|
static int abs(int x) {
	if (x >= 0) {
		return x;
	} else {
		return -x;
	}
}
\end{lstdiff}
\end{diff}

\begin{diff}{Else-part insert}{lst:else_part_insert}
\begin{lstdiff}{12-14}
void printSqrt(int n) {
	if (n < 0) {
		System.out.print("-");
	}
	int sqrt = sqrt(abs(n));
	System.out.print(sqrt);
}
|\vfill\columnbreak|
void printSqrt(int n) {
	if (n < 0) {
		System.out.print("-");
	} else {
		System.out.print("+");
	}
	int sqrt = sqrt(abs(n));
	System.out.print(sqrt);
}
\end{lstdiff}
\end{diff}

\begin{diff}{Statement parent change}{lst:statement_parent_change}
\begin{lstdiff}{4,13}
private int foo(int a) {
	if (this.isReady()) {
		this.startFoo(a);
		this.createX();
	}
	return this.getX();
}
|\vfill\columnbreak|
private int foo(int a) {
	if (this.isReady()) {
		this.startFoo(a);
	}
	this.createX();
	return this.getX();
}
\end{lstdiff}
\end{diff}

\begin{diff}{Attribute type change}{lst:attribute_type_change}
\begin{lstdiff}{1,5}
int pow(int x, int y) {
	...
}
|\vfill\columnbreak|
int pow(double x, int y) {
	...
}
\end{lstdiff}
\end{diff}

\begin{diff}{Parameter ordering change}{lst:parameter_ordering_change}
\begin{lstdiff}{1,5}
void foo(int[] a, Blob b) {
	...
}
|\vfill\columnbreak|
void foo(Blob b, int[] a) {
	...
}
\end{lstdiff}
\end{diff}

\begin{diff}{Return type insert}{lst:return_type_insert}
\begin{lstdiff}{1,5}
void add(Object obj) {
	...
}
|\vfill\columnbreak|
boolean foo(Object obj) {
	...
}
\end{lstdiff}
\end{diff}

\subsection{Feature engineering}
\label{sec:feature_engineering}
Basing on fined-grained code changes (extracted by the algorithm mentioned in the previous section) as well as commit meta-information, each instance of my model is ascribed a number of features. As instances represent groups of commits, most features are computed by an aggregation of parameters of single commits (\textit{e.g. via} averaging or summing). Some attributes, related to commits' authors, utilize not only local, but also global information. They require an evaluation of the authors' experience, which cannot be computed using only data related to a single method. All the features are presented in Table \ref{tab:features}.

\LTXtable{\textwidth}{tables/features.tex}

\section{Bug-proneness estimation}
\label{sec:bug-proneness}
For the sake of simplicity, I decided to use a plain textual match to identify the bug-fixing commits. Log messages are searched for one of the following strings: \texttt{''bug''}, \texttt{''fix''}, or \texttt{''issue''} (the matching is case-insensitive). The problem of identification of bug-fixes has been described in more detail in Section \ref{sec:bug-fixes}.

After identifying the bug-fixes, defect-proneness has to be estimated for the other commits. As the assumptions put forward in Section \ref{sec:general_assumptions} do not impose any particular method of estimation, I propose three different strategies which could be used fo this purpose:
\begin{enumerate}[label=(S\arabic*)]
	\item \textbf{Linear} -- It is the simplest method, which assigns linearly ascending scores to successive commits. The common difference of this sequence is determined by the number of commits in a chunk ($n$).
	\[ \underset{1 \leqslant i \leqslant n}{\forall} lin(i) = \frac{i}{n} \]
	\item \textbf{Geometric} -- This strategy uses geometric progression of bug-proneness. As it is impossible for this kind of sequence to advance so ``smoothly'' from 0.0 to 1.0 as the arithmetic sequence, the common ratio needs to be specified by an additional parameter $\alpha$.
	\[ \underset{1 \leqslant i \leqslant n}{\forall} geom(i) = \alpha^{n - i} \]
	\item \textbf{Weighted} -- This method, based on sizes of commits (expressed in numbers of fine-grained changes), assigns higher bug-proneness increases to big commits and lower increases to smaller ones.
	\[ \underset{1 \leqslant i \leqslant n}{\forall} wght(i) =  \underset{1 \leqslant j \leqslant i}{\sum} size(j) \Biggm/ \underset{1 \leqslant j \leqslant n}{\sum} size(j) 	 \]
\end{enumerate}

Table \ref{tab:bug_proneness_strategies} and corresponding Figure \ref{fig:bug_proneness_strategies} present examples of using all of the above strategies.

\begin{table}[H]
\centering
\caption{Bug-proneness estimation strategies} 
\label{tab:bug_proneness_strategies}
\input{tables/bug_proneness_strategies.tex}
\end{table}

\begin{figure}[H]
\centering
\input{figures/bug_proneness_strategies.tex}
\caption{Bug-proneness estimation strategies}
\label{fig:bug_proneness_strategies}
\end{figure}

\section{Machine learning}
\label{sec:machine_learning}
The final step of the development of a defect prediction tool is training a classifier (which in our case should be called a \emph{regressor} for the continuous character of the bug-proneness parameter) that later would be used for the classification of new instances. My approach incorporates four machine learning algorithms, described in the following sections (and evaluated and compared in chapter~\ref{cha:experiments}).

\subsection{Decision tree}
\label{sec:decision_tree}
According to \cite[p. 263]{encyclopedia}, decision tree is a simple, tree-structured classification model. Its inner nodes represent tests, edges correspond to possible test outputs, while leaves contain potential values of the target variable. For each test, the outcomes are both exhaustive and mutually exclusive. Classification of a new case proceeds in a top-down manner. Starting from the tree root, the classified instance is checked against successive conditions (progressing according to the results) until a leaf is reached, from which the result value is retrieved.

The learning algorithm for decision trees, called \emph{recursive splitting} (described in \cite[p.~264]{encyclopedia}) also takes a top-down approach. Beginning with a set of training instances, the best possible split of this set is chosen. Candidates for this choice are all values of attributes appearing in the training dataset. For a discrete attribute, the split has a trivial form -- one subset corresponding to one value. For a numerical one, an inequality test or some kind of discretization is required. Quality of the possible splits (or, more precisely, its reciprocal called \emph{impurity}) is evaluated with measures such as \emph{information entropy} \cite{Shannon} or \emph{Gini index} \cite{Gini}. After choosing the best parameter, the training set is appropriately divided and the algorithm proceeds recursively for each subset. The procedure terminates when the processed subset can no longer be split (either because all instances belong to one class, or due to their inseparability), or when the predefined maximum tree depth is reached.

\begin{figure}[h]
\centering
\input{figures/decision_tree.tex}
\caption[Decision tree example]{Decision tree example\footnotemark}
\label{fig:decision_tree}
\end{figure}
\footnotetext{This tree describes the \emph{Play Golf} data set. ``P'' labels indicate weather suitable for playing golf, while ``N'' means unsuitable weather. This figure was copied from \cite{decision_trees}.}

\subsection{Random forest}
\label{sec:random_forest}

Random forest, introduced by Breiman in \cite{random_forests}, is an \emph{ensemble learning algorithm}, using the decision tree (see Section \ref{sec:decision_tree}) as the base classifier. It combines two important techniques: \emph{bagging} and \emph{random subspace method}.

Basically, a random forest is a set of decision trees. Each one of them is trained using a separate random sample of an input training set. The output of such ensemble for test instances is obtained from individual responses by computing their mode (in case of classification) or average (regression). This method, called \emph{bagging} or \emph{bootstrap aggregating}, reduces the instability of the base classifier and improves its accuracy \cite{bagging}.

\emph{Random subspace method} increases the diversity between members of the ensemble by choosing random subspaces of the features space to be used \cite[p.~828]{encyclopedia}. In case of random forests, this means that each choice of a data split is restricted to a subset\footnote{Typically of size $\sqrt{n}$ where $n$ is the total number of features.} of features (not all possibilities are taken into account) \cite{random_forests}.
 
\subsection{Support vector machine}
\label{sec:svm}
Support vector machine is a classification model based on the construction of a hyperplane or a set of hyperplanes dividing the feature space of the data. The original form of the algorithm, proposed by Vapnik and Lerner in \cite{svm}, assumes linear separability of classes. The hyperplane is constructed in a way which minimizes the distance between the hyperplane and the closest points from both classes.

Figure \ref{fig:svm} shows two examples of separation of the same set of instances in a two-dimensional space. While both lines separate the classes properly, only \textbf{A} is optimal in the aforementioned sense. The \emph{support vectors} (marked with larger points) are those which are placed closest to the separation line. By a formalization of the maximum-margin hyperplane problem and its transformation to the dual form, one can show that the result classifier relies only on these vectors \cite[pp. 942--944]{encyclopedia}.

\begin{figure}[h]
\centering
\input{figures/svm.tex}
\caption{SVM example}
\label{fig:svm}
\end{figure}

The original version of SVM is a linear classifier, which is a significant disadvantage when dealing with non-linearly separable classes. This issue was addressed by Boser \emph{et al.} in \cite{Boser} through the incorporation of \emph{kernel functions} to the model. This technique allows for a transformation of feature vectors into a high-dimensional space. However, by using the \emph{kernel trick}, no explicit representation of the transformed vectors is created. Another improvement to the algorithm, which enables SVM to handle mislabeled cases, is the \emph{soft margin} introduced by Cortes and Vapnik in \cite{Cortes}.

\subsection{Neural network}
\label{sec:neural_net}

(Artificial) neural network is a learning algorithm based on the structure of neural connections in the human brain. A neural network is composed of simple nodes, known as \emph{neurons}, connected by edges. A neuron is a ``nonlinear, parameterized, bounded function'' \cite[p. 2]{neural_networks}. It has several inputs (variables) and exactly one output (value). Typical neuron can be described with the following equation:
\[y(x) = f\left(w_0 + \sum\limits_{i = 1}^{n} (x_i \cdot w_i) \right)\]
where $x_i$ are variables, $w_i$ are appropriate weights, $w_0$ is an additional constant called \emph{bias}, and $f$ is an \emph{activation function} (usually a $sigmoid$ function).

A neural net consists of several layers of neurons. Attribute values of classified instances are fed to the first layer. Then, successively, outputs of lower layers are propagated along the edges to the higher ones, up to the classifier output. Learning of a network is performed via adjustment of weights associated with the edges by the \emph{backpropagation} algorithm \cite[p. 31]{neural_networks}.

\chapter{Implementation}
\label{cha:implementation}
The defect prediction method described in Chapter \ref{cha:approach} has been implemented in form of an integrated tool -- \ca. It is an open source Java software, licensed under Apache License 2.0 \cite{apache}. The code repository can be found on \textsc{GitHub}: \url{https://github.com/Wiezzel/changeanalyzer}. The following sections describe the \textbf{languages and tools} used for the development (Section \ref{sec:languagess}), \textbf{interface} (Section \ref{sec:interface}), and \textbf{architecture} (Section \ref{sec:architecture}) of the program.

\section{Languages and tools}
\label{sec:languagess}

\subsection{Java} 

Java \cite{java} is an object-oriented programming language, developed by James Gosling at \textsc{Sun Microsystems}, currently supported by \textsc{Oracle Corporation}. It has been chosen as the implementation language of \ca due to its portability, wide availability of useful libraries (described further), and author's experience in Java programming.

\begin{definition}{JGit}
\textsc{JGit} \cite{jgit} is an implementation of \textsc{Git} \cite{git} version control system in Java. It allows for programmatically performing such operations as inspecting a repository, tracking commits affecting a given file, or retrieving particular file revisions. \textsc{JGit} is used at the very beginning of the feature extraction process to obtain consecutive revisions of a source file, which are later compared by \textsc{ChangeDistiller}.
\end{definition}

\begin{definition}{ChangeDistiller}
The \textsc{ChangeDistiller} \cite{change_distilling} is an \textsc{Eclipse} plugin (also available as a stand-alone library), developed in order to extract fine-grained source code changes by syntax-aware file differentiation (for details see Section \ref{sec:change_distilling}). \textsc{ChangeDistiller} is incorporated in \ca to compare abstract syntax trees of successive file versions provided by \textsc{JGit} and compute the numbers of changes of particular types (see Table \ref{tab:fine_grained_changes}).
\end{definition}


\begin{definition}{Weka}
The \textsc{Waikato Environment for Knowledge Analysis} (WEKA) \cite{weka} is an open source data mining software suite written in Java. It contains a broad collection of state-of-the-art machine learning algorithms and can be used either as a stand-alone application, or as a library. \ca utilizes \textsc{Weka} to train various predictive models (described in Section \ref{sec:machine_learning}) and predict the defect-proneness of unclassified methods.
\end{definition}

\begin{definition}{LIBSVM}
LIBSVM \cite{libsvm} is a library for \emph{support vector machines} (see Section \ref{sec:svm}) in C++ and Java. It supports classification (C-SVC, $\nu$-SVC), regression ($\varepsilon$-SVR, $\nu$-SVR) and distribution estimation (one-class SVM). LIBSVM bindings are provided for many languages and frameworks (\textit{e.g.} Python, R, Ruby). \ca uses a wrapper interface compatible with \textsc{Weka}~\cite{weka_libsvm}.
\end{definition}

\subsection{R}

R \cite{r} is a free, cross-platform software environment for statistical computations. It is not used for the implementation of \ca itself, but of auxiliary scripts (see Section \ref{sec:set-up}) developed for the evaluation of the algorithms presented in Section \ref{sec:machine_learning}.

\begin{definition}{foreign}
\textsc{foreign} \cite{foreign} is an R package containing functions for reading and writing data in formats specific to other software. It is used to import ARFF\footnote{\label{not:arff}\emph{Attribute-Relation File Format}. For details see \cite{arff}.} files generated by \textsc{Weka} library.
\end{definition}

\begin{definition}{cvTools}
\textsc{cvTools} \cite{cvTools} is an R package containing tools for cross-validation (see Section \ref{sec:methodology}) of classification and regression models. It used to evaluate te performance of machine learning algorithms described in Section \ref{sec:machine_learning}.
\end{definition}

\begin{definition}{rpart}
\textsc{rpart} \cite{rpart} is an R package containing an implementation of \emph{decision tree} algorithm described in Section \ref{sec:decision_tree}.
\end{definition}

\begin{definition}{randomForest}
\textsc{randomForest} \cite{randomForest} is an R package containing an implementation of \emph{random forest} classifier described in Section \ref{sec:random_forest}.
\end{definition}

\begin{definition}{e1071}
\textsc{e1071} \cite{e1071} is an R package providing a biding for LIBSVM and thus allowing for the use of \emph{support vector machines} in R.
\end{definition}

\begin{definition}{nnet}
\textsc{nnet} \cite{nnet} is an R package containing an implementation of \emph{neural network} classifier described in Section \ref{sec:neural_net}.
\end{definition}

\section{Interface}
\label{sec:interface}

\ca includes a command-line interface for users. Section \ref{sec:main_interface} presents the main functionalities offered by \ca as an integrated tool. Section \ref{sec:test_interface} describes a test interface, which provides more limited capabilities and was employed during system development and testing.

\subsection{Main interface}
\label{sec:main_interface}

The main interface is provided by the main class of the \ca runnable JAR. In order to use it, one must simply run the following command:
\begin{verbatim}
java -jar ChangeAnalyzer.jar <options>
\end{verbatim}
From the options described below exactly one input (\texttt{extract} or \texttt{read}) and at least one output (\texttt{save} or \texttt{classify}) option has to be chosen.

\subsubsection*{Input options}

\begin{itemize}[leftmargin=1.5in]

\item[\clopt{extract <path>}] Extract data from a \textsc{Git} repository under the given \texttt{<path>}. Extracted data can be saved to a file or used for method classification. When using this option, one of the below bug-estimation strategies has to be specified (for details see Section \ref{sec:bug-proneness}).

	\begin{itemize}[leftmargin=0.5in]
	\item[\clopt{linear <number>}] Use the \emph{linear} strategy, where the given \texttt{<number>} is the initial bug-proneness in each chunk.
	\item[\clopt{geometric <number>}] Use the \emph{geometric} strategy, where the given \texttt{<number>} is the bug-proneness increase ratio.
	\item[\clopt{weighted}] Use the \emph{weighted} strategy.
	\end{itemize}

\item[\clopt{read <path>}] Read previously extracted data stored under the given \texttt{<path>}.

\end{itemize}

\subsubsection*{Output options}

\begin{itemize}[leftmargin=1.5in]

\item[\clopt{save <path>}] Save the extracted data under the given \texttt{<path>} in ARFF\footnote{See note \ref{not:arff} on page \pageref{not:arff}.} format.

\item[\clopt{classify <path>}] Predict the bug-proneness of all changed\footnote{By \emph{changed} I mean methods which have been affected by at least one code change since their last bug-fix.} methods and save classification results under the given \texttt{<path>} in CSV\footnote{\emph{Comma-Separated Values}. For details see \cite{csv}} format.

	\begin{itemize}[leftmargin=0.5in]
	\item[\clopt{decision-tree}] Use the \emph{decision tree} classifier (see Section \ref{sec:decision_tree}).
	\item[\clopt{random-forest}] Use the \emph{random forest} classifier (see Section \ref{sec:random_forest}).
	\item[\clopt{svm}] Use the \emph{support vector machine} classifier (see Section \ref{sec:svm}).
	\item[\clopt{neural-net}] Use the \emph{neural network} classifier (see Section \ref{sec:neural_net}).
	\end{itemize}

\end{itemize}

\subsubsection*{Other options}

\begin{itemize}[leftmargin=1.5in]

\item[\clopt{help}] Display a list of available options with descriptions.

\end{itemize}

\subsection{Test interface}
\label{sec:test_interface}

The test interface provides a very limited functionality of extracting data from repositories and saving it. The format of extracted data is not compatible with the main interface -- it includes total numbers of fine-grained changes of all types (see Table \ref{tab:fine_grained_changes}) and three separate bug-proneness estimation columns: linear, geometric\textsubscript{0.5}, and weighted (for details see Section~\ref{sec:bug-proneness}).

In order to use the test interface, the following command has to be run:

\begin{verbatim}
java -cp ChangeAnalyzer.jar pl.edu.mimuw.changeanalyzer.ExtractAndSave \
    <path1> <path2> ...
\end{verbatim}

It will result in extracting data from repositories under the provided paths and storing it in ARFF files in the current working directory. Each file is named after the last part of the corresponding repository path (if name collisions occur, files will be overwritten).

\section{Architecture}
\label{sec:architecture}
This section gives a broad overview of the architecture of \ca by describing briefly all system components. Subsequent sections contain information about individual packages, which dependencies are presented in Figure \ref{fig:package_diagram}. The full class diagram is given in Figure~\ref{fig:class_diagram} on page \pageref{fig:class_diagram}.

Please note, that the following description is not intended to serve as a complete technical documentation. For more details, visit the JavaDoc page: \url{http://wiezzel.github.io/changeanalyzer/}.

\begin{figure}[h!]
\centering
\input{figures/package_diagram.tex}
\caption{ChangeAnalyzer package diagram}
\label{fig:package_diagram}
\end{figure}

\begin{figure}[h!]
\begin{adjustwidth}{-2.5cm}{-2.5cm}
\centering
\input{figures/class_diagram.tex}
\caption{ChangeAnalyzer class diagram}
\label{fig:class_diagram}
\end{adjustwidth}
\end{figure}

\pack{Root}{root}{pl.edu.mimuw.changeanalyzer}

\noindent This package contains classes responsible for the interface of \ca (both main and test interface).

\subsubsection*{ChangeAnalyzer}
ChangeAnalyzer is the main class of the project. It provides the \texttt{main} method and several others, which compose the interface described in Section \ref{sec:main_interface}:
\begin{itemize}[noitemsep,topsep=1pt]
	\item Extract data from a repository,
    \item Read extracted data from an ARFF file,
    \item Save extracted data to an ARFF file,
    \item Classify instances (methods) with missing class attribute.
\end{itemize}

\subsubsection*{ChangeAnalyzerOptionParser}
This class is a parser of command-line options for \texttt{ChangeAnalyzer.main} method, based on Apache Commons\textsuperscript{TM} CLI2 library \cite{cli2}.

\subsubsection*{ExtractAndSave}
This class provides a test interface for performing classifier evaluation experiments (for details, see Section \ref{sec:test_interface}). It is capable of extracting data from repositories and saving unprocessed datasets.

\pack{Extraction}{extraction}{pl.edu.mimuw.changeanalyzer.extraction}

\noindent This package contains classes responsible for the extraction of method histories from software repositories.

\subsubsection*{RepoHistoryExtractor}
\texttt{RepoHistoryExtractor} is the master class of the extraction package. It contains a \loosett{ClassHistoryExtractor} and extracts all class histories and all commits from a repository.

\subsubsection*{ClassHistoryExtractor}
This class is responsible for the extraction of class histories. It holds a reference to a local \textsc{Git} repository. In order to extract a class history (containing the histories of all its methods), one has to provide a relative (to the repository root directory) path to the .java file containing the class. \textsc{ChangeDistiller} library is employed for this task.

\subsubsection*{AuthorInfo}
This class contains the following relevant information about a code author:
\begin{itemize}[noitemsep,topsep=1pt]
	\item Name,
	\item E-mail,
	\item Total number of commits,
	\item Total number of code changes.
\end{itemize}

\subsubsection*{AuthorInfoExtractor}
This class is responsible for gathering information about authors extracted from \texttt{CommitInfo} objects. It updates the numbers of commits and code changes assigned to an author accordingly to the properties of commits.

\subsubsection*{CommitInfo}
This class contains the following relevant information about a commit:
\begin{itemize}[noitemsep,topsep=1pt]
	\item Author,
	\item ID,
	\item Timestamp,
	\item Being or not being a bug-fix,
	\item Number of code changes,
	\item Number of changed methods.
\end{itemize}

\subsubsection*{CommitInfoExtractor}
This is an extractor of relevant information from \textsc{JGit}'s objects representing commits, which is stored as \texttt{CommitInfo} objects. It updates the numbers of code changes and changed methods by analysing method histories provided by \textsc{ChangeDistiller}.

\subsubsection*{ClassHistoryWrapper}
This is a class suitable for wrapping a collection of class histories and iterating over histories of all their methods. It allows also for an iteration over histories of methods of inner classes.

\subsubsection*{MethodHistoryIterator}
\texttt{MethodHistoryIterator} is a static nested class of \texttt{ClassHistoryWrapper}. This iterator walks through all class histories by a \emph{depth-first search} on the composition graph and yields method histories. Next element to be returned by this iterator is always pre-loaded.

\subsubsection*{ExtractionUtils (interface)}
This is a utility class for repository information extraction (providing one static method which allows to obtain the HEAD of a repository).

\pack{Models}{models}{pl.edu.mimuw.changeanalyzer.models}

\noindent This package contains classes responsible for defining the data model and creation of its instances by processing of data extracted from software repositories.

\subsubsection*{DataSetBuilder (abstract)}
This is the abstract base class for dataset builders -- classes transforming method histories into model instances. Each builder class defines a set of attributes and provides methods to produce instances with this attributes. Created instances may represent commits, chunks, method snapshots, etc. depending on the model used by a particular builder.

\subsubsection*{ChunkDataSetBuilder (abstract)}
\extends{DataSetBuilder}

\noindent\texttt{ChunkDataSetBuilder} is a dataset builder that processes commits in chunks separated by bug-fixes. Defect-proneness scores are assigned by pluggable measures (strategies). Obtaining multiple scores for a single instance is possible by providing multiple bug-proneness measures to the builder.

\subsubsection*{DataSetProcessor (interface)}
This is an interface for objects processing data sets (that means modifying and/or removing existing features as well as computing new ones). Typically, a dataset processor aggregates a few attribute processors (see description of attributes package on page \pageref{sec:attributes}).

\subsubsection*{DataSetProvider}
\texttt{DataSetProvider} is a top-level class responsible for retrieving datasets either by extracting them from repositories or by reading files with previously extracted data. A provider wraps functionalities provided by a \texttt{DataSetBuilder} and a \texttt{DataSetProcessor}, and additionally separates training instances from test instances.

\subsubsection*{ReadOnlyDataSetProvider}
\extends{DataSetProvider}

\noindent\texttt{ReadOnlyDataSetProvider} is a simple dataset provider incapable of extracting or processing data. It expects processed data with the class (target) attribute on the last position. Any attempts to call methods responsible for the unsupported operations result in throwing an exception.

\subsubsection*{DefaultDataSetProcessor}
\implements{DataSetProcessor}

\noindent This is a static nested class of \texttt{ReadOnlyDataSetProvider}. The only operation performed by this processor is marking the last attribute of a dataset as the class attribute.

\subsubsection*{ChangeCounter}
This is a simple class for counting code changes of different types. It holds a separate, internal counter for each change type.

\pack{Attributes}{attributes}{pl.edu.mimuw.changeanalyzer.models.attributes}

\noindent This package contains classes representing attributes of data models and operations which could be performed on them.

\subsubsection*{Attributes}
This is a class representing a set of model attributes.

\subsubsection*{AttributeProcessor (interface)}
\texttt{AttributeProcessor} is an interface for classes manipulating attributes of dataset.

\subsubsection*{SumAttributes}
\implements{AttributeProcessor}

\noindent This is an attribute processor class which adds to the given dataset a new attribute being the sum of the defined source attributes.

\subsubsection*{DeleteAttributes}
\implements{AttributeProcessor}

\noindent This is an attribute processor class which removes a group of defined attributes from the given dataset.

\subsubsection*{ReorderClassAttribute}
\implements{AttributeProcessor}

\noindent This is an attribute processor class which moves the class attribute to the last position.

\pack{Measures}{measures}{pl.edu.mimuw.changeanalyzer.models.measures}

\noindent This package contains bug-proneness measure classes representing estimation strategies defined in Section \ref{sec:bug-proneness}.

\subsubsection*{BugPronenessMeasure (interface)}
\texttt{BugPronenessMeasure} is an interface for classes assigning bug-proneness scores to model instances produced by a \texttt{ChunkDataSetBuilder}. A measure could be state-aware in a sense that each assigned score does not depend solely on properties of a single method version, but of a whole chunk.

\subsubsection*{LinearMeasure}
\implements{BugPronenessMeasure}

\noindent This measure assigns bug-proneness score of 1.0 to the last commit of a chunk and linearly decreasing scores to the previous ones.

\subsubsection*{GeometricMeasure}
\implements{BugPronenessMeasure}

\noindent This measure assigns bug-proneness score of 1.0 to the last commit of a chunk and geometrically decreasing scores to the previous ones.

\subsubsection*{WeightedMeasure}
\implements{BugPronenessMeasure}

\noindent This measure computes bug-proneness scores proportionally to numbers of fine-grained changes since the last bug-fix. It assigns each group of commits a score equal to the ratio of number of changes in this group to the total number of changes in the whole chunk.

\pack{Standard implementations}{standard}{pl.edu.mimuw.changeanalyzer.models.standard}

\noindent This package contains standard implementations of abstract classes and interfaces defined in the models package.

\subsubsection*{StandardDataSetBuilder}
\extends{DataSetBuilder}

\noindent This data set builder class uses commit group as a model. Each produced instance represents a group of subsequent commits starting with a commit directly succeeding a bug-fix. Groups contain from one commit up to a whole chunk.

\subsubsection*{StandardDataSetProcessor}
\implements{DataSetProcessor}

\noindent This is the standard dataset processor used by \ca for processing raw data sets built by a \texttt{StandardDataSetBuilder}. It performs the following operations:
\begin{itemize}[noitemsep,topsep=1pt]
	\item Discard some unnecessary attributes,
	\item Sum values of parameter change-related attributes into a new attribute,
	\item Sums values of header change-related attributes a new attribute.
\end{itemize}

\subsubsection*{StandardDataSetProvider}
\extends{DataSetProvider}

\noindent A \texttt{StandardDataSetProvider} uses a \texttt{StandardDataSetBuilder} and a \texttt{StandardDataSetProcessor} to create and process datasets.

\pack{Exceptions}{exceptions}{pl.edu.mimuw.changeanalyzer.exceptions}

\noindent This package contains exceptions thrown by \ca.

\subsubsection*{ChangeAnalyzerException}
This is the base class for all \ca exceptions.

\subsubsection*{ExtractionException}
\extends{ChangeAnalyzerException}

\noindent \texttt{ExtractionException} is an exception thrown when an error occurs during the extraction of information from a repository.

\subsubsection*{DataSetBuilderException}
\extends{ChangeAnalyzerException}

\noindent This is an exception thrown by the \texttt{DataSetBuilder} class. It indicates an error during the creation of a dataset.

\subsubsection*{ProcessingException}
\extends{ChangeAnalyzerException}

\noindent \texttt{ProcessingException} is an exception thrown when an error occurs during the processing of a dataset.

\subsubsection*{PredictionException}
\extends{ChangeAnalyzerException}

\noindent \texttt{PredictionException} is an exception thrown due to an error in performing the bug-proneness prediction.

\pack{Utilities}{utilities}{pl.edu.mimuw.changeanalyzer.util}

\noindent This package contains some universal, reusable utility classes.

\subsubsection*{LazyList<T>}
\texttt{LazyList} is a generic container class designed for storing sequences provided by disposable iterable objects. It wraps up an ordinary list which is extended on demand by advancing the source iterator.

\subsubsection*{LazyIterator}
\texttt{LazyIterator} is an inner class of \texttt{LazyList} enabling the iteration through its elements.

\chapter{Experiments}
\label{cha:experiments}
In order to evaluate the accuracy of different machine learning algorithms (presented in Section \ref{sec:machine_learning}) applied to the proposed data model (Section \ref{sec:model}), as well as the efficiency of the implementation, several experiments were performed. The data for analysis was obtained from open source Java projects. Experiments were conducted using both the \ca tool and a special purpose R script.

The following sections describe the general \textbf{methodology} (Section \ref{sec:methodology}), \textbf{experimental data} (Section \ref{sec:data}), \textbf{software set-up} (Section \ref{sec:set-up}), and \textbf{results} (Section \ref{sec:results}) followed by a short \textbf{commentary} (Section \ref{sec:commentary}).

\section{Methodology}
\label{sec:methodology}
In the intended ordinary use of \ca, all code changes followed by a bug-fix compose the training set, while the unfixed ones (lacking bug-proneness scores) make up the test set. However, predictions of defect-proneness made in such situation are not verifiable until new fixes appear. Therefore, in order to evaluate the proposed predictive models, part of the training set has to be used as the test set.

A popular technique applied in such situations is called \emph{cross-validation}. As described in \cite[p. 249]{encyclopedia}, cross-validation starts with splitting the dataset into several subsets (also called \emph{folds}) $S_1 ... S_n$ of approximately equal size. Then the learning algorithm is applied $n$ times, in every $i$-th step using fold $S_i$ as the test set and the union of other folds as the training set. Evaluation scores from all steps are averaged. In order to reduce the variability, cross-validation is often applied multiple times with different dataset partitions, averaging the results. In my experiments I used 20-fold cross-validation repeated 10 times for each dataset.

Another important issue is the choice of an appropriate accuracy measure. I decided to utilize the widely applied \emph{root mean squared error} (RMSE), which is defined in \cite{rmse} with the following formula:
\[
RMSE = \sqrt{\frac{1}{n} \sum\limits_{i=1}^{n}\left(y_i - \hat{y}_i\right)^{2}}
\]
\noindent Where $n$ is the number of test observations, $y$ is the vector of \emph{gold standard} values and $\hat{y}$ is the vector of predicted values.

\section{Data}
\label{sec:data}
Ten big open source Java projects were used as experimental data. They were cloned from their \textsc{GitHub} repositories on 23 February 2015. The list of projects with several size-related statistics is presented in Table~\ref{tab:test_datasets}.

\begin{table}[H]
\centering
\caption{Test datasets}
\label{tab:test_datasets}
\input{tables/test_datasets.tex}
\end{table}

\section{Experimental set-up}
\label{sec:set-up}
Repositories containing the test data were cloned from \textsc{GitHub} into local directories. Further, the test interface of \ca was used (see Section \ref{sec:test_interface}), to extract data from the repositories into ARFF files.

The data was further processed by an R script, which performed the following operations:
\begin{enumerate}[itemsep=0.5pt]
	\item Read the data.
	\item Pre-process the data.
	\begin{enumerate}[label={\arabic{enumi}.\arabic*},topsep=0pt]
		\item Remove unnecessary features.
		\item Sum header changes\footnote{See note \ref{not:header_changes} on page \pageref{not:header_changes}.} into one attribute.
		\item Sum parameter changes\footnote{See note \ref{not:parameter changes} on page \pageref{not:parameter changes}} into one attribute.
	\end{enumerate}
	\item Run 10 times a 20-fold cross-validation for all datasets, for all models, for all bug-proneness scores.
	\item Average the cross-validation results by datasets.
\end{enumerate}
Full code listing of the script can be found in Appendix \ref{cha:experimental_script}.

All experiments were performed on a single machine with 4GB of RAM and \textsc{Intel}\textsuperscript{\textregistered} \textsc{Core}\textsuperscript{TM} 2 \textsc{Duo} T6570 CPU, running Windows XP 32-bit operating system. Time of the extraction was measured using the system clock.

\section{Results}
\label{sec:results}

\subsection{Accuracy}
\label{sec:accuracy}

Table \ref{tab:accuracy} and corresponding Figure \ref{fig:accuracy} present results of the cross-validation, for all classifiers and bug-proneness estimation strategies, averaged by datasets. The results were rounded up to three decimal places.

\begin{table}[h]
\caption{RMSE averaged by datasets}
\label{tab:accuracy}
\input{tables/accuracy.tex}
\end{table}


\begin{figure}[h]
\centering
\input{figures/accuracy.tex}
\caption{Classification accuracy}
\label{fig:accuracy}
\end{figure}

\subsection{Performance}
\label{sec:performance}

A comparison of project sizes (in numbers of classes) and corresponding extraction times (in seconds) is presented in Table \ref{tab:performance} below. The extraction times were rounded up to two decimal places. Figure \ref{fig:performance} additionally includes a plot of the above function fitted to the data using \emph{linear regression}.
\[
time(size) = 0.49 \cdot size + 24.95
\]

\begin{table}[H]
\caption{Data extraction performance}
\label{tab:performance}
\centering
\input{tables/performance.tex}
\end{table}

\begin{figure}[H]
\centering
\input{figures/performance.tex}
\caption{Data extraction performance}
\label{fig:performance}
\end{figure}

\section{Commentary}
\label{sec:commentary}
The experimental results presented in the previous section show clearly that random forest classifier achieves the best results from all proposed machine learning algorithms. Regardless of the bug-proneness estimation strategy, RMSE value for random forest was the lowest.

For every strategy, the performance order of the classifiers was the same, which confirms that using different ways of defining the defect-proneness parameter does not establish multiple problems, but rather different aspects of a single one.

Regarding the strategies, it is noticeable that using the geometric one (with $\alpha = 0.5$) results in significantly higher error values than the two others. This behaviour may be caused by the large difference (0.5) between bug-proneness scores assigned by this strategy to the last two commits in each chunk.

Time measurements show that the extraction of data from repositories is a rather computationally heavy process, approximately linearly dependent on the size of the analysed project. The extraction time is ranging from about 1 minute for the smallest projects, up to nearly 1 hour for the biggest one. Such values exclude the possibility of integration of the proposed method into an interactive tool, leaving only the option of batch processing.

\chapter{Conclusions}
\label{cha:conclusions}
In this thesis I have referred the topic of identification and prediction of defects in software. Further, I have presented my approach towards this problem and described an implementation of a software tool -- \textsc{ChangeAnalyzer} -- which realizes this approach. Finally, I have presented the results of experiments which were performed in order to provide a fair empirical basis for the assessment of the tool and its underlying methods. In this chapter I would like to conclude my work by issuing a judgement of the presented ideas, describing possible threats to the validity of my research and outlining the possibilities of its further extension.

\section{Assessment of approach}
\label{sec:assessment}
Due to the unique character of the target variable, the outcomes of classifiers' evaluation are incomparable with results presented in related work. I my opinion, they should be described as promising, but still far from perfect. An error rate reaching over 30\% of the target variable's range definitely calls for some more improvement.

\section{Threats to validity}
\label{sec:threats}

\subsection{Abstract target variable}
\label{sec:abstract_variable}
The \emph{bug-proneness} parameter, defined in Section \ref{sec:general_assumptions}, contrary to most variables presented in Section \ref{sec:dependent_variable}, is not directly observable. In other words, such choice of dependent variable may be criticized as lacking clear empirical interpretation, which makes it scarcely employable for practical solutions.

However, this charge can addressed by pointing to an observation, that \emph{bug-proneness} (by every computation strategy described in Section \ref{sec:bug-proneness}) can be interpreted as a measure of \emph{temporal proximity of the next commit}, which is definitely an information of practical value.

\subsection{Simple identification of bugs}
\label{sec:simple_identification}
\ca uses the simplest possible defect identification method, which drawbacks have been already described in Section \ref{sec:identification}. Due to its inaccuracy, the intended golden standard, which serves as a basis for prediction, may significantly diverge from the \emph{true golden standard}.

\subsection{Usage of default parameters}
\label{sec:default_parameters}
Machine learning algorithms, evaluated in the experiments, were used mostly with default parameter values defined by the authors of appropriate libraries. These values may not be well-suited for the specific character of the data processed by \ca. Therefore, the presented results may be an underestimation of the possible performance of the used classifiers.

\subsection{Choice of test data}
\label{sech:choice_of_data}
The test data used for the experimental evaluation of the proposed approach was obtained exclusively from publicly available, open source projects. Such projects may have some distinct characteristics arising from their unrestrained organizational structure. Obtaining more representative results would also require analysing of commercial software, developed using different methodologies.

Moreover, no cross-project prediction was performed. Numerous studies show that it is a significantly more challenging task than building a within-project model \cite{recalling, towards, cross-project}. As Zimmmermann \textit{et al.} point out in \cite{cross-project}, universal models are of high value for new software projects, from which no sufficient amount of data to train a classifier could be extracted.

\subsection{Cost-insensitive evaluation}
\label{sec:cost-insensitive}
The evaluation of models was performed using RMSE measure, which does not take into account the amount of effort which would be wasted in case of a misclassification.  Estimation of the cost of potential corrective actions would provide a basis for  a more robust assessment. However, the construction of an error measure which addresses this issue requires a deeper insight into the software development process. It also has to be mentioned, that the possible consequences of the two different kids of errors\footnote{Namely: \emph{false positives} and \emph{false negatives}. False positive (also called \emph{type I error}) is an indication of the presence of the predicted condition (in our case, a software defect) in the absence of this condition. On the other hand, false negative (called \emph{type II error}) is a situation where the condition has not been discovered despite its occurrence.} may vary a lot. Arisholm \textit{et al.} in \cite{systematic} propose the \emph{cost-effectiveness} measure, which tries to embrace all the mentioned restrictions.

\section{Possible further work}
\label{sec:further_work}

\subsection{New features}
\label{sec:new_features}
Expansion of the feature set used by \ca may lead to the improvement of its performance. Possible new attributes, which could be introduced into the tool include:
\begin{itemize}
	\item \textbf{Static code metrics}, such as number of method parameters or statements, cyclomatic complexity, or Halstead effort (for more examples, see Table \ref{tab:static_metrics}),
	\item \textbf{Delta metrics}, used by Arisholm \textit{et al.} in \cite{systematic}, which are representing the changes in static metrics over time,
	\item \textbf{More historical metrics}, including all listed in Table \ref{tab:historical_metrics}.
\end{itemize}

\subsection{Advanced bug identification methods}
\label{sec:advanced_identification}
In order to address the issue risen in Section \ref{sec:simple_identification}, some improvements can be made in the defect identification procedure. It is possible to extend \ca to utilize information from a bug-tracking system, which is a more reliable data source than commit messages. Another possibility is the integration with an external bug identification tool, such as \textsc{ReLink} \cite{ReLink}.

Independently of the above enhancements, which concern the bug-fix identification, there is a possibility of improvement in the field of tracking bug-introducing changes. By the incorporation of a tailored version of the SZZ algorithm (described in \cite{SZZ}), a new defect-proneness computation method could be implemented. Basing on matches provided by SZZ, increases of bug-proneness could be ascribed exclusively to commits affecting these parts of code which are later modified by a bug-fix.

\subsection{Parameter tuning}
\label{sec:paramater_tuning}
Accuracy of the machine learning algorithms used by \ca can be augmented by \emph{parameter tuning}. Cross-validation method, used for the evaluation (see Section \ref{sec:methodology}), can be also employed for this purpose. In order for the tuning to be  successful, it has to be performed for each analysed project individually. This however, would largely impair the performance of the tool, so the scope of the tuning has to be limited by a prior analysis of a possibly broad range of software projects.

\subsection{On-line learning}
\label{sec:online_learning}
In its current shape, \ca does not support any form of incremental development of prediction models. After obtaining new training data (which emerges after every bug-fix), the extraction of method histories and training of a model needs to be reiterated. Such design was chosen for the simplicity and brevity of the implementation, but it could be abolished in the further development. The possibility of limiting the scope of computation to the latest code changes would provide a great performance boost for \ca.

\printbibliography[heading=bibintoc]

\begin{appendices}

\chapter{Experimental script}
\label{cha:experimental_script}

\lstinputlisting[language=R,basicstyle=\ttfamily\small,commentstyle=\ttfamily\small,nolol]{experimental_script.r}
\end{appendices}

\end{document}