guide to codebase; add inputs & outputs

GU-DataLab · Mar 13, 2023 · b4d23bb · b4d23bb
1 parent eec75a8
commit b4d23bb
Show file tree

Hide file tree

Showing 10 changed files with 177 additions and 95 deletions.
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,6 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# extra code
+code/filter_tweets.py
diff --git a/README.md b/README.md
@@ -2,12 +2,59 @@
 
 ## Introduction
 
-This is the replication repository for submission to [DATA 2023, the 12th International Conference on Data Science, Technology and Applications](https://data.scitevents.org/). (more info forthcoming)
+This is repository contains code for the submission to [DATA 2023, the 12th International Conference on Data Science, Technology and Applications](https://data.scitevents.org/) titled "Identifying High Quality Training Data for Misinformation Detection". 
 
 
 ## Guide to Codebase
 
-(forthcoming)
+### [`1_phrase_sampling.py`](code/1_phrase_sampling.py)
+    * **Description**: Sample tweets containing specific keywords/phrases
+    * **Inputs**: list of keywords to search (required argument)
+    * **Outputs**: list of tweets containing the keywords (.csv file)
+
+### [`2_validate_mturk_results.ipynb`](code/2_process_mturk_results.ipynb)
+    * **Description**: Computes agreement scores of hand-labeled sample using Mechnical Turk; checks for bad workers.
+    * **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'.
+    * **Outputs**: N/A (agreement scores, worker validation)
+
+### [`3_label_mturk_results.py`](code/3_label_mturk_results.py)
+    * **Description**: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
+    * **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
+    * **Outputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'
+
+### [`4_train_classifiers.ipynb`](code/4_train_classifiers.ipynb)
+    * **Description**: Trains and evaluates classification models for misinformation detection. Trains k-Nearest Neighbors, Random Forest, Decision Tree, Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, and Multi-Layer Perceptron. Uses 10-fold cross-validation to select the best model.
+    * **Inputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'. If negative cases are needed for a focal myth, may also borrow from labeled positives for other myths (very unlikely a tweet contains both myths)
+    * **Outputs**: ML models and vectorizers (with TF-IDF weighting)
+
+### [`5_sample_tweets.ipynb`](code/5_sample_tweets.ipynb)
+    * **Description**: Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to only those with a higher probability of being in the minority class than the majority class. To help the classifier learn to detect both classes amidst our imbalanced data, the new sample is predicted to be 90% minority class and 10% minority class. This sample will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.
+    * **Inputs**: unlabeled tweets; ML models to use for predictions
+    * **Outputs**: sample of unlabeled tweets--mostly predicted to be misinformation--for hand-labeling, with filename format '{myth_name}_sample_{sample_size}_{date}.csv'
+
+### [`utils.py`](code/utils.py)
+    * **Description**: Custom function for loading tweets from Google Cloud Storage.
+    * **Inputs**: Storage bucket name and tweet filepaths
+    * **Outputs**: Pandas DataFrame containing tweets
+
+### [`models/`](models/) folder
+Contains final models and vectorizers (with TF-IDF weighting) for each myth:
+    * 5G  
+    * antibiotics  
+    * disinfectants  
+    * home_remedies  
+    * hydroxychloroquine  
+    * mosquitoes  
+    * uv_light  
+    * weather
+
+Using the following model types: 
+    * K-Nearest Neighbors
+    * Decision Tree
+    * Random Forest
+    * Multinomial Naive Bayes
+    * Logistic Regression
+    * Multi-Layer Perceptron
 
 
 ## Abstract

diff --git a/code/1_phrase_sampling.py b/code/1_phrase_sampling.py
@@ -1,11 +1,17 @@
+#!/usr/bin/env python
+# coding: utf-8
+
 """ Sample tweets containing specific phrases
 
-Usage
+@inputs: list of keywords to search (required argument)
+@outputs: list of tweets containing the keywords (.csv file)
+
+@usage:
 -----
 
 ```bash
 python -m jobs.sampling.sample_phrases \
-    --input "data/election2020/#2020election_20190616.json" \
+    --input "data/PROJECT/FILEPATH.json" \
     --output output/ \
     --phrases \
     "love" \
@@ -14,46 +20,22 @@
     --preprocessing_choices 'stopwords' 'lowercase' 'stemmed'
 ```
 
-```bash
-spark-submit \
-   --py-files dist/shared.zip \
-    jobs/sampling/sample_phrases.py \
-    --input "data/election2020/#2020election_20190616.json" \
-    --output output/ \
-    --phrases \
-    "already" \
-    "cut the field" \
-    --limit 5 \
-    --lang "en" \
-    --phrase_conditional 'AND'
-```
-
-```bash
-gcloud dataproc jobs submit pyspark \
-    --cluster "colton" \
-    --region "us-east4" \
-    --py-files "gs://build-artifacts/pyspark-shared.zip" \
-    gs://build-artifacts/pyspark-jobs/sampling/sample_phrases.py \
-    -- \
-    --input \
-    "gs://project_gun-violence/raw/hashtag/2017" \
-    --output output/ \
-    --phrases \
-    "efforts" \
-    "speak" \
-    "election" \
-    --limit 2000 \
-    --lang "en"
-```
-
 """
 
+###############################################
+# Import packages
+###############################################
+
 import pyspark.sql.functions as f
 
 from shared.base_job import BaseJob
 from shared.job_helpers import construct_output_filename
 
 
+###############################################
+# Define functions
+###############################################
+
 class SamplePhrasesJob(BaseJob):
 
     def __init__(self):
@@ -137,6 +119,10 @@ def process(self):
             self.df = self.df.select("id_str", "date", self.args.target_attr)
 
 
+###############################################
+# Execute functions
+###############################################
+
 if __name__ == "__main__":
 
     sp = SamplePhrasesJob()

diff --git a/code/2_process_mturk_results.ipynb b/code/2_process_mturk_results.ipynb
@@ -6,7 +6,13 @@
    "source": [
     "# Process MTurk Results\n",
     "\n",
-    "Performs a basic analysis of mTurk results for any results file."
+    "Computes agreement scores of hand-labeled sample using Mechnical Turk; checks for bad workers.\n",
+    "\n",
+    "## Inputs\n",
+    "List of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'\n",
+    "\n",
+    "## Outputs\n",
+    "N/A (agreement scores, worker validation)\n"
    ]
   },
   {
@@ -133,8 +139,8 @@
     "# is_myth Answer Statistics\n",
     "\n",
     "def check_agree_on(col):\n",
-    "    # Group the results by Task ID and their answer to the gun_violence question, then\n",
-    "    # count the number of records in those groups. This determines how many answers\n",
+    "    # Group the results by Task ID and their answer on whether tweet contains the focal myth, \n",
+    "    # then count the number of records in those groups. This determines how many answers\n",
     "    # there were per choice, per task. Rename the count column.\n",
     "    df_gun_violence_counts = df.groupby(['HITId', col]).size().to_frame().reset_index()\n",
     "    df_gun_violence_counts.rename(columns={0: 'count'}, inplace=True)\n",

diff --git a/code/3_label_mturk_results.py b/code/3_label_mturk_results.py
@@ -4,15 +4,19 @@
 '''
 @title: Format converter
 @description: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
-@usage: python3 3_label_mturk_results.py --input_fp DATA_FP1 DATA_FP2
+@usage: 
+    ```bash
+    python3 3_label_mturk_results.py --input_fp DATA_FP1 DATA_FP2
+    ```
+@inputs: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
+@outputs: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'
 '''
 
 
 ###############################################
-# Initialize
+# Import packages
 ###############################################
 
-# import packages
 import numpy as np
 import pandas as pd
 import csv

diff --git a/code/4_train_classifiers.ipynb b/code/4_train_classifiers.ipynb
@@ -4,7 +4,15 @@
    "cell_type": "markdown",
    "metadata": {},
    "source": [
-    "# Train classification models for misinformation detection"
+    "# Train classification models for misinformation detection\n",
+    "\n",
+    "Trains k-Nearest Neighbors, Random Forest, Decision Tree, Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, and Multi-Layer Perceptron. Uses 10-fold cross-validation to select the best model.\n",
+    "\n",
+    "## Inputs\n",
+    "list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'. If negative cases are needed for a focal myth, may also borrow from labeled positives for other myths (very unlikely a tweet contains both myths)\n",
+    "\n",
+    "## Outputs\n",
+    "ML models and vectorizers (with TF-IDF weighting)\n"
    ]
   },
   {
@@ -96,9 +104,9 @@
     "           (fp.endswith('labeled.csv'))\n",
     "          ]\n",
     "# negative cases\n",
-    "neg_myth_fps = [fp for fp in glob.glob(join(DATA_DIR, 'negative_cases/*') if\n",
-    "           (fp.endswith('labeled.csv'))\n",
-    "          ]\n",
+    "neg_myth_fps = [fp for fp in glob.glob(join(DATA_DIR, 'negative_cases/*')) if\n",
+    "               (fp.endswith('labeled.csv'))\n",
+    "               ]\n",
     "\n",
     "# specify shorthand for focal myth\n",
     "# use different, specific naming for multiple myths (e.g., myth1, myth2)\n",

diff --git a/code/5_sample_tweets.ipynb b/code/5_sample_tweets.ipynb
@@ -6,9 +6,13 @@
    "source": [
     "## Sample tweets for hand-labeling based on classification scores\n",
     "\n",
-    "Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to only those with a higher probability of being in the minority class than the majority class. This will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.\n",
+    "Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to those with a higher probability of being in the minority class than the majority class. To help the classifier learn to detect both classes amidst our imbalanced data, the new sample is predicted to be 90% minority class and 10% minority class. This sample will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.\n",
     "\n",
-    "Usage: `python3 6_filter_tweets.py -{myth prefix} {directory of classifier} {directory of vectorizer} -t {threshold lower bound} {threshold upper bound, < not <=} -nrand {number of files sampled from} -nrows {number of rows browsed in each file}`"
+    "### Inputs\n",
+    "Unlabeled tweets; ML models to use for predictions\n",
+    "\n",
+    "### Outputs\n",
+    "Sample of unlabeled tweets--mostly predicted to be misinformation--for hand-labeling, with filename format '{myth_name}_sample_{sample_size}_{date}.csv'\n"
    ]
   },
   {

diff --git a/code/README.md b/code/README.md
@@ -0,0 +1,31 @@
+## Guide to Codebase
+
+### [`1_phrase_sampling.py`](1_phrase_sampling.py)
+    * **Description**: Sample tweets containing specific keywords/phrases
+    * **Inputs**: list of keywords to search (required argument)
+    * **Outputs**: list of tweets containing the keywords (.csv file)
+
+### [`2_validate_mturk_results.ipynb`](2_process_mturk_results.ipynb)
+    * **Description**: Computes agreement scores of hand-labeled sample using Mechnical Turk; checks for bad workers.
+    * **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'.
+    * **Outputs**: N/A (agreement scores, worker validation)
+
+### [`3_label_mturk_results.py`](3_label_mturk_results.py)
+    * **Description**: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
+    * **Inputs**: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
+    * **Outputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'
+
+### [`4_train_classifiers.ipynb`](4_train_classifiers.ipynb)
+    * **Description**: Trains and evaluates classification models for misinformation detection. Trains k-Nearest Neighbors, Random Forest, Decision Tree, Multinomial Naive Bayes, Logistic Regression, Support Vector Machine, and Multi-Layer Perceptron. Uses 10-fold cross-validation to select the best model.
+    * **Inputs**: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'. If negative cases are needed for a focal myth, may also borrow from labeled positives for other myths (very unlikely a tweet contains both myths)
+    * **Outputs**: ML models and vectorizers (with TF-IDF weighting)
+
+### [`5_sample_tweets.ipynb`](5_sample_tweets.ipynb)
+    * **Description**: Uses classifiers trained on labeled tweets (about a myth vs. not) to filter tweets from March to August 2020 (especially April-May) to only those with a higher probability of being in the minority class than the majority class. To help the classifier learn to detect both classes amidst our imbalanced data, the new sample is predicted to be 90% minority class and 10% minority class. This sample will be used to select tweets for hand-coding that fall into minority classes, which are hard to capture from the first round of ML models. Data source is tweets with hashtags related to Covid-19.
+    * **Inputs**: unlabeled tweets; ML models to use for predictions
+    * **Outputs**: sample of unlabeled tweets--mostly predicted to be misinformation--for hand-labeling, with filename format '{myth_name}_sample_{sample_size}_{date}.csv'
+
+### [`utils.py`](utils.py)
+    * **Description**: Custom function for loading tweets from Google Cloud Storage.
+    * **Inputs**: Storage bucket name and tweet filepaths
+    * **Outputs**: Pandas DataFrame containing tweets
diff --git a/code/utils.py b/code/utils.py
@@ -1,3 +1,22 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+'''
+@title: Format converter
+@description: Converts MTurk results to the format [tweet_id, text, label]. Base input filepath(s) must be passed as a command-line argument; these will be appended with `-results.csv` to find input data and `-labeled.csv` to save output data.
+@usage: 
+    ```bash
+    python3 3_label_mturk_results.py --input_fp DATA_FP1 DATA_FP2
+    ```
+@inputs: list of labeled tweets repeated over multiple workers with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-results.csv'
+@outputs: list of validated labeled tweets with filename format 'myth_{myth_name}_sample_{sample_size}_{date}-labeled.csv'
+'''
+
+
+###############################################
+# Import packages
+###############################################
+
 import io
 import glob
 import os
@@ -6,56 +25,13 @@
 import tqdm
 
 from google.cloud import storage
+client = storage.Client()
 
-def read_file(file_path):
-    """
-    Read a JSON spark file from given file path and return a dataframe
-    """
-
-    file_path = file_path[:-1] if file_path[-1] == '/' else file_path
-
-    # Check if the given file_path is a directory or file
-    if os.path.isdir(file_path):
-
-        # list all JSON files in directory
-        files = glob.glob(file_path+"/*.json")
-
-        # read each JSON file into a dataframe, and append to the dataframe list
-        dfs = []
-        print("Reading JSON file")
-        for ix, f in enumerate(tqdm.tqdm(files)):
-            # print("Reading JSON file: {:04d}/{:04d}".format(ix+1, len(files)))
-            df = pd.read_json(f, lines=True)
-            dfs.append(df)
-
-        # combine dataframes
-        df = pd.concat(dfs)
-
-    elif os.path.isfile(file_path):
-        if is_spark_json(file_path):
-            df = pd.read_json(file_path, lines=True)
-        else:
-            df = pd.read_json(file_path, orient='records')
-    else:
-        raise ValueError(
-            "It is a special file (socket, FIFO, device file) or file not found")
-
-    return df
-
-def is_spark_json(fp):
-    """
-    Check if the given file path is spark json
-    """
-    with open(fp, 'r') as f:
-        # line = f.readline().strip() # Very slow
-        for line in f:
-            if line[0] == '{' and line[-1] == '}':
-                return True
-            break
-    return False
 
+###############################################
+# Define helper function(s)
+###############################################
 
-client = storage.Client()
 def gcs_read_json_gz(gcs_filepath, nrows=None):
     # Validate input path
     if not gcs_filepath.startswith("gs://") or not gcs_filepath.endswith(".json.gz"):

diff --git a/models/README.md b/models/README.md
@@ -0,0 +1,17 @@
+Contains final models and vectorizers (with TF-IDF weighting) for each myth:
+    * 5G  
+    * antibiotics  
+    * disinfectants  
+    * home_remedies  
+    * hydroxychloroquine  
+    * mosquitoes  
+    * uv_light  
+    * weather
+
+Using the following model types: 
+    * K-Nearest Neighbors
+    * Decision Tree
+    * Random Forest
+    * Multinomial Naive Bayes
+    * Logistic Regression
+    * Multi-Layer Perceptron