Skip to content

Commit

Permalink
guide to codebase; add inputs & outputs
Browse files Browse the repository at this point in the history
  • Loading branch information
jhaber-zz committed Mar 13, 2023
1 parent eec75a8 commit b4d23bb
Show file tree
Hide file tree
Showing 10 changed files with 177 additions and 95 deletions.
3 changes: 3 additions & 0 deletions .gitignore
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,6 @@ dmypy.json

# Pyre type checker
.pyre/

# extra code
code/filter_tweets.py
51 changes: 49 additions & 2 deletions README.md
100644 → 100755
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,59 @@

## Introduction

This is the replication repository for submission to [DATA 2023, the 12th International Conference on Data Science, Technology and Applications](https://data.scitevents.org/). (more info forthcoming)
This is repository contains code for the submission to [DATA 2023, the 12th International Conference on Data Science, Technology and Applications](https://data.scitevents.org/) titled "Identifying High Quality Training Data for Misinformation Detection".


## Guide to Codebase

(forthcoming)
### [`1_phrase_sampling.py`](code/1_phrase_sampling.py)
* **Description**: Sample tweets containing specific keywords/phrases
* **Inputs**: list of keywords to search (required argument)
* **Outputs**: list of tweets containing the keywords (.csv file)

### [`2_validate_mturk_results.ipynb`](code/2_process_mturk_results.ipynb)
* **Description**: Computes agreement scores of hand-labeled sample using Mechnical Turk; checks for bad workers.
* **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'.
* **Outputs**: N/A (agreement scores, worker validation)

### [`3_label_mturk_results.py`](code/3_label_mturk_results.py)
* **Description**: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
* **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
* **Outputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'

### [`4_train_classifiers.ipynb`](code/4_train_classifiers.ipynb)
* **Description**: Trains and evaluates classification models for misinformation detection. Trains k-Nearest Neighbors, Random Forest, Decision Tree, Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, and Multi-Layer Perceptron. Uses 10-fold cross-validation to select the best model.
* **Inputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'. If negative cases are needed for a focal myth, may also borrow from labeled positives for other myths (very unlikely a tweet contains both myths)
* **Outputs**: ML models and vectorizers (with TF-IDF weighting)

### [`5_sample_tweets.ipynb`](code/5_sample_tweets.ipynb)
* **Description**: Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to only those with a higher probability of being in the minority class than the majority class. To help the classifier learn to detect both classes amidst our imbalanced data, the new sample is predicted to be 90% minority class and 10% minority class. This sample will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.
* **Inputs**: unlabeled tweets; ML models to use for predictions
* **Outputs**: sample of unlabeled tweets--mostly predicted to be misinformation--for hand-labeling, with filename format '{myth_name}_sample_{sample_size}_{date}.csv'

### [`utils.py`](code/utils.py)
* **Description**: Custom function for loading tweets from Google Cloud Storage.
* **Inputs**: Storage bucket name and tweet filepaths
* **Outputs**: Pandas DataFrame containing tweets

### [`models/`](models/) folder
Contains final models and vectorizers (with TF-IDF weighting) for each myth:
* 5G
* antibiotics
* disinfectants
* home_remedies
* hydroxychloroquine
* mosquitoes
* uv_light
* weather

Using the following model types:
* K-Nearest Neighbors
* Decision Tree
* Random Forest
* Multinomial Naive Bayes
* Logistic Regression
* Multi-Layer Perceptron


## Abstract
Expand Down
54 changes: 20 additions & 34 deletions code/1_phrase_sampling.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,17 @@
#!/usr/bin/env python
# coding: utf-8

""" Sample tweets containing specific phrases
Usage
@inputs: list of keywords to search (required argument)
@outputs: list of tweets containing the keywords (.csv file)
@usage:
-----
```bash
python -m jobs.sampling.sample_phrases \
--input "data/election2020/#2020election_20190616.json" \
--input "data/PROJECT/FILEPATH.json" \
--output output/ \
--phrases \
"love" \
Expand All @@ -14,46 +20,22 @@
--preprocessing_choices 'stopwords' 'lowercase' 'stemmed'
```
```bash
spark-submit \
--py-files dist/shared.zip \
jobs/sampling/sample_phrases.py \
--input "data/election2020/#2020election_20190616.json" \
--output output/ \
--phrases \
"already" \
"cut the field" \
--limit 5 \
--lang "en" \
--phrase_conditional 'AND'
```
```bash
gcloud dataproc jobs submit pyspark \
--cluster "colton" \
--region "us-east4" \
--py-files "gs://build-artifacts/pyspark-shared.zip" \
gs://build-artifacts/pyspark-jobs/sampling/sample_phrases.py \
-- \
--input \
"gs://project_gun-violence/raw/hashtag/2017" \
--output output/ \
--phrases \
"efforts" \
"speak" \
"election" \
--limit 2000 \
--lang "en"
```
"""

###############################################
# Import packages
###############################################

import pyspark.sql.functions as f

from shared.base_job import BaseJob
from shared.job_helpers import construct_output_filename


###############################################
# Define functions
###############################################

class SamplePhrasesJob(BaseJob):

def __init__(self):
Expand Down Expand Up @@ -137,6 +119,10 @@ def process(self):
self.df = self.df.select("id_str", "date", self.args.target_attr)


###############################################
# Execute functions
###############################################

if __name__ == "__main__":

sp = SamplePhrasesJob()
Expand Down
12 changes: 9 additions & 3 deletions code/2_process_mturk_results.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,13 @@
"source": [
"# Process MTurk Results\n",
"\n",
"Performs a basic analysis of mTurk results for any results file."
"Computes agreement scores of hand-labeled sample using Mechnical Turk; checks for bad workers.\n",
"\n",
"## Inputs\n",
"List of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'\n",
"\n",
"## Outputs\n",
"N/A (agreement scores, worker validation)\n"
]
},
{
Expand Down Expand Up @@ -133,8 +139,8 @@
"# is_myth Answer Statistics\n",
"\n",
"def check_agree_on(col):\n",
" # Group the results by Task ID and their answer to the gun_violence question, then\n",
" # count the number of records in those groups. This determines how many answers\n",
" # Group the results by Task ID and their answer on whether tweet contains the focal myth, \n",
" # then count the number of records in those groups. This determines how many answers\n",
" # there were per choice, per task. Rename the count column.\n",
" df_gun_violence_counts = df.groupby(['HITId', col]).size().to_frame().reset_index()\n",
" df_gun_violence_counts.rename(columns={0: 'count'}, inplace=True)\n",
Expand Down
10 changes: 7 additions & 3 deletions code/3_label_mturk_results.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,15 +4,19 @@
'''
@title: Format converter
@description: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
@usage: python3 3_label_mturk_results.py --input_fp DATA_FP1 DATA_FP2
@usage:
```bash
python3 3_label_mturk_results.py --input_fp DATA_FP1 DATA_FP2
```
@inputs: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
@outputs: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'
'''


###############################################
# Initialize
# Import packages
###############################################

# import packages
import numpy as np
import pandas as pd
import csv
Expand Down
16 changes: 12 additions & 4 deletions code/4_train_classifiers.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,15 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"# Train classification models for misinformation detection"
"# Train classification models for misinformation detection\n",
"\n",
"Trains k-Nearest Neighbors, Random Forest, Decision Tree, Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, and Multi-Layer Perceptron. Uses 10-fold cross-validation to select the best model.\n",
"\n",
"## Inputs\n",
"list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'. If negative cases are needed for a focal myth, may also borrow from labeled positives for other myths (very unlikely a tweet contains both myths)\n",
"\n",
"## Outputs\n",
"ML models and vectorizers (with TF-IDF weighting)\n"
]
},
{
Expand Down Expand Up @@ -96,9 +104,9 @@
" (fp.endswith('labeled.csv'))\n",
" ]\n",
"# negative cases\n",
"neg_myth_fps = [fp for fp in glob.glob(join(DATA_DIR, 'negative_cases/*') if\n",
" (fp.endswith('labeled.csv'))\n",
" ]\n",
"neg_myth_fps = [fp for fp in glob.glob(join(DATA_DIR, 'negative_cases/*')) if\n",
" (fp.endswith('labeled.csv'))\n",
" ]\n",
"\n",
"# specify shorthand for focal myth\n",
"# use different, specific naming for multiple myths (e.g., myth1, myth2)\n",
Expand Down
8 changes: 6 additions & 2 deletions code/5_sample_tweets.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,13 @@
"source": [
"## Sample tweets for hand-labeling based on classification scores\n",
"\n",
"Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to only those with a higher probability of being in the minority class than the majority class. This will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.\n",
"Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to those with a higher probability of being in the minority class than the majority class. To help the classifier learn to detect both classes amidst our imbalanced data, the new sample is predicted to be 90% minority class and 10% minority class. This sample will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.\n",
"\n",
"Usage: `python3 6_filter_tweets.py -{myth prefix} {directory of classifier} {directory of vectorizer} -t {threshold lower bound} {threshold upper bound, < not <=} -nrand {number of files sampled from} -nrows {number of rows browsed in each file}`"
"### Inputs\n",
"Unlabeled tweets; ML models to use for predictions\n",
"\n",
"### Outputs\n",
"Sample of unlabeled tweets--mostly predicted to be misinformation--for hand-labeling, with filename format '{myth_name}_sample_{sample_size}_{date}.csv'\n"
]
},
{
Expand Down
31 changes: 31 additions & 0 deletions code/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
## Guide to Codebase

### [`1_phrase_sampling.py`](1_phrase_sampling.py)
* **Description**: Sample tweets containing specific keywords/phrases
* **Inputs**: list of keywords to search (required argument)
* **Outputs**: list of tweets containing the keywords (.csv file)

### [`2_validate_mturk_results.ipynb`](2_process_mturk_results.ipynb)
* **Description**: Computes agreement scores of hand-labeled sample using Mechnical Turk; checks for bad workers.
* **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'.
* **Outputs**: N/A (agreement scores, worker validation)

### [`3_label_mturk_results.py`](3_label_mturk_results.py)
* **Description**: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
* **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
* **Outputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'

### [`4_train_classifiers.ipynb`](4_train_classifiers.ipynb)
* **Description**: Trains and evaluates classification models for misinformation detection. Trains k-Nearest Neighbors, Random Forest, Decision Tree, Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, and Multi-Layer Perceptron. Uses 10-fold cross-validation to select the best model.
* **Inputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'. If negative cases are needed for a focal myth, may also borrow from labeled positives for other myths (very unlikely a tweet contains both myths)
* **Outputs**: ML models and vectorizers (with TF-IDF weighting)

### [`5_sample_tweets.ipynb`](5_sample_tweets.ipynb)
* **Description**: Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to only those with a higher probability of being in the minority class than the majority class. To help the classifier learn to detect both classes amidst our imbalanced data, the new sample is predicted to be 90% minority class and 10% minority class. This sample will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.
* **Inputs**: unlabeled tweets; ML models to use for predictions
* **Outputs**: sample of unlabeled tweets--mostly predicted to be misinformation--for hand-labeling, with filename format '{myth_name}_sample_{sample_size}_{date}.csv'

### [`utils.py`](utils.py)
* **Description**: Custom function for loading tweets from Google Cloud Storage.
* **Inputs**: Storage bucket name and tweet filepaths
* **Outputs**: Pandas DataFrame containing tweets
70 changes: 23 additions & 47 deletions code/utils.py
100644 → 100755
Original file line number Diff line number Diff line change
@@ -1,3 +1,22 @@
#!/usr/bin/env python
# coding: utf-8

'''
@title: Format converter
@description: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
@usage:
```bash
python3 3_label_mturk_results.py --input_fp DATA_FP1 DATA_FP2
```
@inputs: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
@outputs: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'
'''


###############################################
# Import packages
###############################################

import io
import glob
import os
Expand All @@ -6,56 +25,13 @@
import tqdm

from google.cloud import storage
client = storage.Client()

def read_file(file_path):
"""
Read a JSON spark file from given file path and return a dataframe
"""

file_path = file_path[:-1] if file_path[-1] == '/' else file_path

# Check if the given file_path is a directory or file
if os.path.isdir(file_path):

# list all JSON files in directory
files = glob.glob(file_path+"/*.json")

# read each JSON file into a dataframe, and append to the dataframe list
dfs = []
print("Reading JSON file")
for ix, f in enumerate(tqdm.tqdm(files)):
# print("Reading JSON file: {:04d}/{:04d}".format(ix+1, len(files)))
df = pd.read_json(f, lines=True)
dfs.append(df)

# combine dataframes
df = pd.concat(dfs)

elif os.path.isfile(file_path):
if is_spark_json(file_path):
df = pd.read_json(file_path, lines=True)
else:
df = pd.read_json(file_path, orient='records')
else:
raise ValueError(
"It is a special file (socket, FIFO, device file) or file not found")

return df

def is_spark_json(fp):
"""
Check if the given file path is spark json
"""
with open(fp, 'r') as f:
# line = f.readline().strip() # Very slow
for line in f:
if line[0] == '{' and line[-1] == '}':
return True
break
return False

###############################################
# Define helper function(s)
###############################################

client = storage.Client()
def gcs_read_json_gz(gcs_filepath, nrows=None):
# Validate input path
if not gcs_filepath.startswith("gs://") or not gcs_filepath.endswith(".json.gz"):
Expand Down
17 changes: 17 additions & 0 deletions models/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
Contains final models and vectorizers (with TF-IDF weighting) for each myth:
* 5G
* antibiotics
* disinfectants
* home_remedies
* hydroxychloroquine
* mosquitoes
* uv_light
* weather

Using the following model types:
* K-Nearest Neighbors
* Decision Tree
* Random Forest
* Multinomial Naive Bayes
* Logistic Regression
* Multi-Layer Perceptron

0 comments on commit b4d23bb

Please sign in to comment.