Supplementary Material for paper "Validation on Machine Reading Comprehension Software Without Annotated Labels: A Property-Based Method"
This is the supplementary material for ESEC/FSE'21 research paper "Validation on Machine Reading Comprehension Software Without Annotated Labels: A Property-Based Method".
It contains the test case generating tool, experimental replication package, and detailed experimental results for this paper.
We implement a python library to generate the validation input sets with the given one out of seven proposed MRs. Each sample in the input sets is a pair of inputs (one eligible source input and its corresponding follow-up input).
All the codes for this tool are stored in tool
directory.
- At first, you should prepare the necessary dependent libraries for this tool. You can type the command
pip install -r requirements.txt
to easily construct the experiment environment. - Use
from MT4MRC import MRs
to import the library, and instantiate a handler withhandler = MRs()
. - Load a raw dataset (no need for labels), from which the tool will pick eligible source inputs. (Current version only adapts the BoolQ datasets used in our experiments. You can extend it to other datasets by considering their data format and fields.)
- Iterate all the samples in the dataset and use the handler to produce corresponding follow-up cases with given MR.
- Export the generated test cases.
One example to obtain eligible source inputs and follow-up inputs from BoolQ test set with MR1-1
:
from MT4MRC import MRs
handler = MRs()
file_path = "boolq_test/test.jsonl"
with open(file_path, "r", encoding="utf-8") as f:
lines = f.readlines()
inputsets = []
for line in lines:
cases = handler.generate(data=line, mr="1_1")
if cases is not None: # judge the eligibility of the sample in 'line'
inputsets.append( (cases[0], cases[1]) )
# cases[0] and cases[1] are the source input and follow-up input, respectively
with open("source.jsonl", 'w', encoding='utf-8') as f_source, \
open("follow-up.jsonl", 'w', encoding='utf-8') as f_followup:
for case in inputsets: # dump generated inputs into jsonl files
print(case[0], file=f_source)
print(case[1], file=f_followup)
We provide the codes to replicate our experiments, including the scripts to build and train the four objective models and validate the trained models with one generated test case set.
All the codes for replication are stored in replicate
directory.
Since the four objective models are with different architectures and training paradigms, we provide four independent scripts to realize the building and training for each objective model. The usage of these scripts are as follows:
# RNN
python train_rnn.py --data_dir /path/to/dir_with_train.jsonl --output_dir /path/to/save_model
# BERT
python train_boolq_bert.py \
--model_type bert --model_name_or_path bert-base-cased \
--do_train --do_eval --do_lower_case \
--data_file /path/to/dir_with_train.jsonl \
--max_seq_length 256 --learning_rate 1e-5 --num_train_epochs 1000 --logging_steps 500 \
--per_gpu_eval_batch_size=8 --per_gpu_train_batch_size=8 \
--output_dir /path/to/save_model --tbname boolq_bert
# ROBERTa
python train_boolq_roberta.py \
--model_type roberta --model_name_or_path roberta-large \
--do_train --do_eval --do_lower_case \
--data_file /path/to/dir_with_train.jsonl \
--max_seq_length 256 --learning_rate 1e-5 --num_train_epochs 1000 --logging_steps 500 \
--per_gpu_eval_batch_size=8 --per_gpu_train_batch_size=8 \
--output_dir /path/to/save_model --tbname boolq_roberta
# T5
python train_t5.py --data_dir /path/to/dir_with_train.jsonl --output_dir /path/to/save_model
We also provide four independent scripts to validate corresponding objective model. The usage of these scripts are as follows:
# RNN
python eval_rnn.py --mr MRID --data_dir /path/to/dir_with_source&followup.jsonl --model_dir /path/to/saved_model
# BERT
python eval_boolq_bert.py \
--mr MRID \
--model_type bert --model_name_or_path bert-base-cased \
--do_eval --do_lower_case \
--data_file /path/to/dir_with_source&followup.jsonl \
--per_gpu_eval_batch_size=8 \
--output_dir /path/to/saved_model
# ROBERTa
python eval_boolq_roberta.py \
--mr MRID \
--model_type roberta --model_name_or_path roberta-large \
--do_eval --do_lower_case \
--data_file /path/to/dir_with_source&followup.jsonl \
--per_gpu_eval_batch_size=8 \
--output_dir /path/to/saved_model
# T5
python eval_t5.py --mr MRID --data_dir /path/to/data --model_dir /path/to/saved_model
One example to evaluate T5 on BoolQ dev set:
- Run with
python eval_t5.py --mr 1-1 --data_dir boolq_val/MR1-1/T5 --model_dir /model/T5
. - The script will output
0.5470459518599562
, where the violation rate54.70%
is thus obtained.
Due to the limited space, we do not provide all the detailed results for RQ2 and RQ4 in the paper. Here we release the results for all the four objective models, i.e., RNN, BERT, ROBERTa, and T5.
These detailed results are stored in figure
directory.
figure
RQ2_full.png: the results of RQ2 on all the four objective models.
RQ4_full.csv: the results of RQ4 on all the four objective models.
If you find our paper useful, please kindly cite it as:
@inproceedings{fse21-MT4MRC,
author = {Chen, Songqiang and Jin, Shuo and Xie, Xiaoyuan},
editor = {Spinellis, Diomidis and Gousios, Georgios and Chechik, Marsha and Penta, Massimiliano Di},
title = {Validation on Machine Reading Comprehension Software without Annotated
Labels: A Property-Based Method},
booktitle = {29th {ACM} Joint European Software Engineering Conference
and Symposium on the Foundations of Software Engineering, {ESEC/FSE} 2021, Athens,
Greece, August 23-28, 2021},
pages = {590--602},
publisher = {{ACM}},
year = {2021},
doi = {10.1145/3468264.3468569}
}