This repo contains two main parts for paper HyperINF:
- Mislabeled Data Detection
- Data Selection for LLM and VLM
In this task, we use the code of DataInf and compare our HyperINF with the privided baselines. Our implementation of HyperINF can be found in HyperINF/Mislabeled_Data_Detection/src/influence.py
.
You can run python Mislabeled_Data_Detection/test.py
to see the performance comparision among all the methods. We also provide the result of detection on COLA
dataset as an example in the image cola_r=16_detection_rate.pdf
. time_inverse.ipynb
is used for Mislabeled Data Detection illustration, and time cost for inverse matrix computation among Schulz and other popular algorithms.
We ultilize the code of Prismatic-VLM, please clone that repo first, and follow the instruction of it to build the environment and download the dataset (it could be large for VLM, so it will take a while to download).
For Data selection for LLM, we choose the datasets which are available in HuggingFace: QASC, PIQA, LogiQA and HellaSwag.
We modify the original codes to support only LLM and data selection for both LLM and VLM. We provide the modified codes in LLM_VLM_Finetune/scripts
and LLM_VLM_Finetune/prismatic
. You can add and replace the files in the original repo with the files in our repo.
-
The file
data_selection_vlm.py
is used for Data Selection for VLM, it will compute the gradients of val dataset and the influence score of each training data points, then sort the data points according to the influence score and save them. You can change thestage
in the config todata-pruning_llm
for using the language model's last layer to compute the influence score ordata-pruning_projector
for using the projector to compute the influence score. -
The file
pretrain_llm.py
is used for Finetuing LLM, anddata_selection_llm.py
is used for Data Selection for LLM. It is also modified from the original codes for supporting the data selection for LLM.
We provide all the baselines in our paper to select data, including DataInf
, LiSSA
, TracIN
and our approach HyperINF
. You can change the method
in the train_strategy.compute_training_samples_IF
to select different methods.