Merge pull request #50 from namiyousef/develop

Merging develop onto main
namiyousef · Apr 18, 2022 · a37f047 · a37f047
2 parents 53e8769 + 8afb395
commit a37f047
Show file tree

Hide file tree

Showing 75 changed files with 221,271 additions and 2 deletions.
diff --git a/.github/workflows/python-package.yml b/.github/workflows/python-package.yml
@@ -0,0 +1,41 @@
+# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
+# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions
+
+name: Python package
+
+on:
+  push:
+    branches: [ develop ]
+  pull_request:
+    branches: [ main ]
+
+jobs:
+  build:
+
+    runs-on: ubuntu-latest
+    strategy:
+      fail-fast: false
+      matrix:
+        python-version: ["3.8", "3.9", "3.10"]
+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python ${{ matrix.python-version }}
+      uses: actions/setup-python@v3
+      with:
+        python-version: ${{ matrix.python-version }}
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        python -m pip install flake8 pytest
+        python -m pip install .
+    - name: Lint with flake8
+      run: |
+        # stop the build if there are Python syntax errors or undefined names
+        flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics \
+        --exclude tests,experiments,pipeline
+        # exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
+        flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
+    - name: Test with pytest
+      run: |
+        python -m pytest tests
diff --git a/.github/workflows/python-publish.yml b/.github/workflows/python-publish.yml
@@ -0,0 +1,39 @@
+# This workflow will upload a Python Package using Twine when a release is created
+# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries
+
+# This workflow uses actions that are not certified by GitHub.
+# They are provided by a third-party and are governed by
+# separate terms of service, privacy policy, and support
+# documentation.
+
+name: Upload Python Package
+
+on:
+  release:
+    types: [published]
+
+permissions:
+  contents: read
+
+jobs:
+  deploy:
+
+    runs-on: ubuntu-latest
+
+    steps:
+    - uses: actions/checkout@v3
+    - name: Set up Python
+      uses: actions/setup-python@v3
+      with:
+        python-version: '3.x'
+    - name: Install dependencies
+      run: |
+        python -m pip install --upgrade pip
+        pip install build
+    - name: Build package
+      run: python -m build
+    - name: Publish package
+      uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
+      with:
+        user: __token__
+        password: ${{ secrets.PYPI_API_TOKEN }}
diff --git a/.gitignore b/.gitignore
@@ -127,3 +127,23 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+# Mac stuff
+.idea
+.DS_Store
+
+# project stuff
+data
+
+# google cloud
+.tmp.driveupload
+.tmp.drivedownload
+trest
+
+# VS Code
+.vscode
+
+# cluster stuff
+TMPDIR
+gpu_script.sh
+Library
diff --git a/AUTHORS.md b/AUTHORS.md
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -0,0 +1,63 @@
+# Contributing Instructions
+
+## Set-up
+
+First, clone the repository locally. After entering the repository, create a new virtual environment and run:
+
+`pip install -e .`
+
+If you now try `pip list` you should see `ArgMiner` as one of your installed packages. You have just installed ArgMiner in a development environment. So any dependencies packaged with the previous version of ArgMiner have now been installed.
+
+In order to now 'develop' in this environment, you can update your virtual env with any new packages using `pip install`. The key thing is that these won't be persisted into ArgMiner, but they will be used by you locally. This will help us keep the production version clean, while we change requirements ourselves.
+
+I've added a `requirements.txt` file in every folder. Make sure that when you install something new, that you keep the folder updated e.g. using `pip freeze > your_name/requirements.txt`.
+
+## Using Jupyter Notebooks
+
+First, install `notebook` into your virtual environment using `pip install notebook`.
+
+Now, unlike with Python, you won't be able to directly import anything from ArgMiner because Jupyter Notebooks limit the path to the directory that you are running the notebook from. This means that it won't be able to 'see' the ArgMiner package. In order to let it do this, you must add your virtual env as a kernel to Jupyter. First, run:
+
+`pip install ipykernel`
+
+Then:
+
+`ipython kernel install --name==your_venv_name`
+
+If you now go into Jupyter, when selecting the kernel with which to run your notebook make sure that it is the same as the one you just installed! You should now be able to import from ArgMiner as a package.
+
+## Using Colab
+
+Unfortunately because of the way Colab works, using it with our custom environment is a bit difficult. In other projects, the way we did it was by zipping the entire project each time it was updated, then unzipping it on Colab, and then installing dependencies. This was problematic because it caused version mismatches to occur.
+
+Here I will describe a solution I've come up with for how we can do this as a team. It is not clean, but it should work for our purposes. 
+
+### Committing changes directly from Colab
+
+When you open Colab, the default screen shows you your 'Recents'. Navigate to the 'GitHub' tab and then check the 'Include private repos' box. Then in the search bar, find 'namiyousef/argument-mining', and select the branch to be 'develop'. At some point, you will be asked to connect your account to Github. Make sure you do so when prompted.
+
+Once that is done, you'll be able to see all the notebooks in the repository in the last state that they were pushed.
+
+These notebooks don't exist locally, so if you make changes and try to save them there'll be a pop-up of "don't have permissions to save notebook, save in local drive". Don't save it locally. Instead, click on File->Save a copy in GitHub (NOT GitHub Gist). Update the commit message to something appropriate and submit :)
+
+### Syncing the repo with Google Drive
+
+There are multiple ways that this can be done, however to keep things simple I will go through a single method only. This assumes that you've already cloned your repository locally.
+
+First, download Google Drive for your computer. Once this has been done, sync the local copy of your repo with Google Drive. This will save the repository under 'Other Computers' within Google Drive. Now go into the broswer version of Google Drive, find the repository folder, right click and then click 'add shortcut to drive'. This will make sure that the folder is accessible through 'My Drive'.
+
+### Installing the virtual environment
+
+In order to make things easy, I've written some code that automatically installs ArgMiner as a package in your Colab. Please find the relevant code in `experiments/yousef/test.ipynb`
+
+### Pointers
+
+- To avoid issues, please make sure that your repository is always updated (e.g. using `git fetch`) and that local changes are fully synced to Google Drive before you run notebooks on it.
+- If you update your notebooks on Colab, then your local repository will be out of sync. Make sure you fetch before persisting any local changes
+
+For now, it is not possible to save things into Colab in a clear and consistent way. So if you want to save anything please do it locally. When we play around with this setup further I'll find a way of saving things nicely.
+
+## Making changes to ArgMiner
+I've laid out the project as such so that we don't clutter the actual package ArgMiner so much. The idea is that we can keep experimenting in our own folders under `experiments`. When we are ready to persist new changes to argminer, we can review it together. As of right now, this will be done in an ad-hoc fashion. There are no automated tests just yet.
+
+## Tests
diff --git a/LICENSE b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022 Yousef, Federico, Qingyu, Changmao
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
@@ -1,2 +1,109 @@
-# nlp-placeholder-name
-Repository for NLP project. Name to be changed when we decide on a project
+# ArgMiner: End-to-end Argument Mining 
+
+![GitHub](https://img.shields.io/github/license/namiyousef/argument-mining)
+![GitHub Workflow Status](https://img.shields.io/github/workflow/status/namiyousef/argument-mining/Python%20package)
+![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/namiyousef/argument-mining)
+---
+
+_argminer_ is a PyTorch based package for argument mining on state-of-the-art datasets. It provides a high level API for processing these datasets and applying different labelling strategies, augmenting them, training on them using a model from [huggingface](https://huggingface.co/), and performing inference on them.
+
+## Datasets
+
+- Argument Annotated Essays [[1]](#1): a collection of 402 essays written by college students containing with clauses classified as Premise, Claim and MajorClaim. This is the SOTA dataset in the filed of argument mining.
+
+- PERSUADE [[2]](#2): A newly released dataset from the Feedback Prize Kaggle [competition](https://www.kaggle.com/competitions/feedback-prize-2021/overview). This is a collection of over 15000 argumentative essays written by U.S. students in grades 6-12. It contains the following argument types: Lead, Position, Claim, Counterclaim, Rebuttal, Evidence, and Concluding Statement.
+
+Another important dataset in this field is the ARG2020 [[3]](#3) dataset. This was made public at a time coinciding with the early access release so support for it does not yet exist. There are plans to implement this in the future.
+
+## Installation
+You can install the package directly from PyPI using pip:
+```bash
+pip install argminer
+```
+If looking to install from the latest commit, please use the following:
+```bash
+pip install argminer@git+https://[email protected]/namiyousef/argument-mining.git@develop 
+```
+
+## Argminer features
+
+Datasets in the field of argument mining are stored in completely different formats and have different labels. This makes it difficult to compare and contrast model performance on them provided different configurations and processing steps. 
+
+The data processing API provides a standard method of processing these datasets end-to-end to enable fast and consistent experimentation and modelling.
+
+- ADD DIAGRAM
+
+### Labelling Strategies
+
+SOTA argument mining methods treat the process of chunking text into it's argument types and classifying them as an NER problem on long sequences. This means that segment of text with it's associated label is converted into a sequence of classes for the model to predict. To illustrate this we will be using the following passage:
+
+```python
+"Natural Language Processing is the best field in Machine Learning.According to a recent poll by DeepLearning.ai it's popularity has increased by twofold in the last 2 years."
+```
+
+
+Let's suppose for the sake of argument that the passage has the following labels:
+
+**Sentence 1: Claim**
+```python
+sentence_1 = "Natural Language Processing is the best field in Machine Learning."
+label_2 = "Claim"
+```
+
+**Sentence 2: Evidence**
+```python
+sentence_2 = "According to a recent poll by DeepLearning.ai it's popularity has increased by twofold in the last 2 years."
+label_2 = "Evidence"
+```
+
+From this we can create a vector with the associated label for each word. Thus the whole text would have an assoicated label as follows:
+
+```python
+labels = ["Claim"]*10 + ["Evidence"]*18  # numbers are length of split sentences
+```
+
+With the NER style of labelling, you can modify the above to indicate whether a label is the beginning of a chunk, the end of a chunk or inside a chunk. For example, `"According"` could be labelled as `"B-Evidence"` in sentence 2, and the subsequent parts of it would be labelled as `"I-Evidence"`.
+
+The above is easy if considering a split on words, however this becomes more complicated when we considered how transformer models work. These tokenise inputs based on substrings, and therefore the labels have to be extended. Further questions are raised, for instance: if the word according is split into `"Accord"` and `"ing"`, do we label both of these as `"B-Evidence"` or should we label `"ing"` as `"I-Evidence"`? Or do we just ignore subtokens? Or do we label subtokens as a completely separate class?
+
+We call a labelling strategy that keeps the subtoken labels the same as the start token labels `"wordLevel"` and a strategy that differentiates between them as `"standard"`. Further we provide the following labelling strategies:
+
+- **IO:** Argument tokens are classified as `"I-{arg_type}"` and non-argument tokens as `"O"`
+- **BIO:** Argument start tokens are classified as `"B-{arg_type}"`, all other argument tokens as `"I-{arg_type}"`". Non-argument tokens as `"O"`
+- **BIEO:** Argument start tokens are classified as `"B-{arg_type}"`, argument end tokens as `"E-{arg_type}"` and all other argument tokens as `"I-{arg_type}"`. Non-argument tokens as `"O"`
+- **BIXO:** First argument start tokens are classified as `"B-{arg_type}"`, other argument start tokens are classified as `"I-{arg_type}"`. Argument subtokens are classified as `"X"`. Non-argument tokens as `"O"`
+
+Considering all combinations, the processor API provies functionality for the following strategies:
+
+- standard_io
+- wordLevel_io
+- standard_bio
+- wordLevel_bio
+- standard_bieo*
+- wordLevel_bieo
+- standard_bixo**
+
+* `B-` labels are prioritised over `E-` tokens, e.g. for a single word sentence the word would be labelled as `B-`.
+
+** This method is not one that is backed by literature. It is something that we thought would be interesting to examine. The intuition is that the `X` label captures grammatical elements of argument segments.
+
+### Data Augmentation (Adversarial Examples)
+
+
+### Evaluation
+
+# Quick Start
+- TODO
+
+
+
+# References
+<a id="1">[1]</a>
+Christian S. and Iryna G. (2017). _Parsing Argumentation Structures in Persuasive Essays_. DOI: 10.1162/COLI_a_00295
+
+<a id="2">[2]</a> 
+Scott C. and the Learning Agency. url: https://github.com/scrosseye/PERSUADE_corpus
+
+<a id="3">[3]</a> 
+Alhindi, T. and Ghosh, D. (2021). 
+"Sharks are not the threat humans are": _Argument Component Segmentation in School Student Essays_. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. p.210-222. url: https://aclanthology.org/2021.bea-1.22
diff --git a/Useful link b/Useful link
@@ -0,0 +1,2 @@
+
+paper summary: https://marmalade-marten-b45.notion.site/NLP-group-coursework-003529854c4f45258036859c20f94493
diff --git a/argminer/__init__.py b/argminer/__init__.py
@@ -0,0 +1,4 @@
+import argminer
+
+
+__version__ = '0.0.17'
diff --git a/argminer/config.py b/argminer/config.py
@@ -0,0 +1,36 @@
+import os
+import pandas as pd
+
+# -- get email details
+EMAIL = os.environ.get('EMAIL', '[email protected]')
+EMAIL_PASSWORD = os.environ.get('EMAIL_PASSWORD', 'password')
+EMAIL_RECIPIENTS = os.environ.get('EMAIL_RECIPIENTS', EMAIL)
+
+# -- argument mining
+PREDICTION_STRING_START_ID = 0
+MAX_NORM = 10
+# -- label maps
+# TODO automate these...
+LABELS_DICT = dict(
+    TUDarmstadt=['MajorClaim', 'Claim', 'Premise'],
+    Persuade=['Lead', 'Position', 'Claim', 'Counterclaim', 'Rebuttal', 'Evidence', 'Concluding Statement']
+)
+STRATEGIES = ['io', 'bio', 'bieo', 'bixo']
+
+LABELS_MAP_DICT = {}
+for dataset, labels in LABELS_DICT.items():
+    LABELS_MAP_DICT[dataset] = {}
+    for strategy in STRATEGIES:
+        new_labels = ['O']
+        if strategy == 'bixo':
+            new_labels.append('X')
+        for label in labels:
+            if 'b' in strategy:
+                new_labels.append(f'B-{label}')
+            new_labels.append(f'I-{label}')
+            if 'e' in strategy:
+                new_labels.append(f'E-{label}')
+        LABELS_MAP_DICT[dataset][strategy] = pd.DataFrame({
+            'label_id': list(range(len(new_labels))),
+            'label': new_labels
+        })
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,2 @@

		paper summary: https://marmalade-marten-b45.notion.site/NLP-group-coursework-003529854c4f45258036859c20f94493