Skip to content

Commit

Permalink
Merge pull request #50 from namiyousef/develop
Browse files Browse the repository at this point in the history
Merging develop onto main
  • Loading branch information
namiyousef authored Apr 18, 2022
2 parents 53e8769 + 8afb395 commit a37f047
Show file tree
Hide file tree
Showing 75 changed files with 221,271 additions and 2 deletions.
41 changes: 41 additions & 0 deletions .github/workflows/python-package.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,41 @@
# This workflow will install Python dependencies, run tests and lint with a variety of Python versions
# For more information see: https://help.github.com/actions/language-and-framework-guides/using-python-with-github-actions

name: Python package

on:
push:
branches: [ develop ]
pull_request:
branches: [ main ]

jobs:
build:

runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
python-version: ["3.8", "3.9", "3.10"]

steps:
- uses: actions/checkout@v3
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v3
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
python -m pip install flake8 pytest
python -m pip install .
- name: Lint with flake8
run: |
# stop the build if there are Python syntax errors or undefined names
flake8 . --count --select=E9,F63,F7,F82 --show-source --statistics \
--exclude tests,experiments,pipeline
# exit-zero treats all errors as warnings. The GitHub editor is 127 chars wide
flake8 . --count --exit-zero --max-complexity=10 --max-line-length=127 --statistics
- name: Test with pytest
run: |
python -m pytest tests
39 changes: 39 additions & 0 deletions .github/workflows/python-publish.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
# This workflow will upload a Python Package using Twine when a release is created
# For more information see: https://help.github.com/en/actions/language-and-framework-guides/using-python-with-github-actions#publishing-to-package-registries

# This workflow uses actions that are not certified by GitHub.
# They are provided by a third-party and are governed by
# separate terms of service, privacy policy, and support
# documentation.

name: Upload Python Package

on:
release:
types: [published]

permissions:
contents: read

jobs:
deploy:

runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3
- name: Set up Python
uses: actions/setup-python@v3
with:
python-version: '3.x'
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install build
- name: Build package
run: python -m build
- name: Publish package
uses: pypa/gh-action-pypi-publish@27b31702a0e7fc50959f5ad993c78deac1bdfc29
with:
user: __token__
password: ${{ secrets.PYPI_API_TOKEN }}
20 changes: 20 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -127,3 +127,23 @@ dmypy.json

# Pyre type checker
.pyre/

# Mac stuff
.idea
.DS_Store

# project stuff
data

# google cloud
.tmp.driveupload
.tmp.drivedownload
trest

# VS Code
.vscode

# cluster stuff
TMPDIR
gpu_script.sh
Library
Empty file added AUTHORS.md
Empty file.
63 changes: 63 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,63 @@
# Contributing Instructions

## Set-up

First, clone the repository locally. After entering the repository, create a new virtual environment and run:

`pip install -e .`

If you now try `pip list` you should see `ArgMiner` as one of your installed packages. You have just installed ArgMiner in a development environment. So any dependencies packaged with the previous version of ArgMiner have now been installed.

In order to now 'develop' in this environment, you can update your virtual env with any new packages using `pip install`. The key thing is that these won't be persisted into ArgMiner, but they will be used by you locally. This will help us keep the production version clean, while we change requirements ourselves.

I've added a `requirements.txt` file in every folder. Make sure that when you install something new, that you keep the folder updated e.g. using `pip freeze > your_name/requirements.txt`.

## Using Jupyter Notebooks

First, install `notebook` into your virtual environment using `pip install notebook`.

Now, unlike with Python, you won't be able to directly import anything from ArgMiner because Jupyter Notebooks limit the path to the directory that you are running the notebook from. This means that it won't be able to 'see' the ArgMiner package. In order to let it do this, you must add your virtual env as a kernel to Jupyter. First, run:

`pip install ipykernel`

Then:

`ipython kernel install --name==your_venv_name`

If you now go into Jupyter, when selecting the kernel with which to run your notebook make sure that it is the same as the one you just installed! You should now be able to import from ArgMiner as a package.

## Using Colab

Unfortunately because of the way Colab works, using it with our custom environment is a bit difficult. In other projects, the way we did it was by zipping the entire project each time it was updated, then unzipping it on Colab, and then installing dependencies. This was problematic because it caused version mismatches to occur.

Here I will describe a solution I've come up with for how we can do this as a team. It is not clean, but it should work for our purposes.

### Committing changes directly from Colab

When you open Colab, the default screen shows you your 'Recents'. Navigate to the 'GitHub' tab and then check the 'Include private repos' box. Then in the search bar, find 'namiyousef/argument-mining', and select the branch to be 'develop'. At some point, you will be asked to connect your account to Github. Make sure you do so when prompted.

Once that is done, you'll be able to see all the notebooks in the repository in the last state that they were pushed.

These notebooks don't exist locally, so if you make changes and try to save them there'll be a pop-up of "don't have permissions to save notebook, save in local drive". Don't save it locally. Instead, click on File->Save a copy in GitHub (NOT GitHub Gist). Update the commit message to something appropriate and submit :)

### Syncing the repo with Google Drive

There are multiple ways that this can be done, however to keep things simple I will go through a single method only. This assumes that you've already cloned your repository locally.

First, download Google Drive for your computer. Once this has been done, sync the local copy of your repo with Google Drive. This will save the repository under 'Other Computers' within Google Drive. Now go into the broswer version of Google Drive, find the repository folder, right click and then click 'add shortcut to drive'. This will make sure that the folder is accessible through 'My Drive'.

### Installing the virtual environment

In order to make things easy, I've written some code that automatically installs ArgMiner as a package in your Colab. Please find the relevant code in `experiments/yousef/test.ipynb`

### Pointers

- To avoid issues, please make sure that your repository is always updated (e.g. using `git fetch`) and that local changes are fully synced to Google Drive before you run notebooks on it.
- If you update your notebooks on Colab, then your local repository will be out of sync. Make sure you fetch before persisting any local changes

For now, it is not possible to save things into Colab in a clear and consistent way. So if you want to save anything please do it locally. When we play around with this setup further I'll find a way of saving things nicely.

## Making changes to ArgMiner
I've laid out the project as such so that we don't clutter the actual package ArgMiner so much. The idea is that we can keep experimenting in our own folders under `experiments`. When we are ready to persist new changes to argminer, we can review it together. As of right now, this will be done in an ad-hoc fashion. There are no automated tests just yet.

## Tests
21 changes: 21 additions & 0 deletions LICENSE
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
MIT License

Copyright (c) 2022 Yousef, Federico, Qingyu, Changmao

Permission is hereby granted, free of charge, to any person obtaining a copy
of this software and associated documentation files (the "Software"), to deal
in the Software without restriction, including without limitation the rights
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
copies of the Software, and to permit persons to whom the Software is
furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
SOFTWARE.
111 changes: 109 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,2 +1,109 @@
# nlp-placeholder-name
Repository for NLP project. Name to be changed when we decide on a project
# ArgMiner: End-to-end Argument Mining

![GitHub](https://img.shields.io/github/license/namiyousef/argument-mining)
![GitHub Workflow Status](https://img.shields.io/github/workflow/status/namiyousef/argument-mining/Python%20package)
![GitHub tag (latest by date)](https://img.shields.io/github/v/tag/namiyousef/argument-mining)
---

_argminer_ is a PyTorch based package for argument mining on state-of-the-art datasets. It provides a high level API for processing these datasets and applying different labelling strategies, augmenting them, training on them using a model from [huggingface](https://huggingface.co/), and performing inference on them.

## Datasets

- Argument Annotated Essays [[1]](#1): a collection of 402 essays written by college students containing with clauses classified as Premise, Claim and MajorClaim. This is the SOTA dataset in the filed of argument mining.

- PERSUADE [[2]](#2): A newly released dataset from the Feedback Prize Kaggle [competition](https://www.kaggle.com/competitions/feedback-prize-2021/overview). This is a collection of over 15000 argumentative essays written by U.S. students in grades 6-12. It contains the following argument types: Lead, Position, Claim, Counterclaim, Rebuttal, Evidence, and Concluding Statement.

Another important dataset in this field is the ARG2020 [[3]](#3) dataset. This was made public at a time coinciding with the early access release so support for it does not yet exist. There are plans to implement this in the future.

## Installation
You can install the package directly from PyPI using pip:
```bash
pip install argminer
```
If looking to install from the latest commit, please use the following:
```bash
pip install argminer@git+https://[email protected]/namiyousef/argument-mining.git@develop
```

## Argminer features

Datasets in the field of argument mining are stored in completely different formats and have different labels. This makes it difficult to compare and contrast model performance on them provided different configurations and processing steps.

The data processing API provides a standard method of processing these datasets end-to-end to enable fast and consistent experimentation and modelling.

- ADD DIAGRAM

### Labelling Strategies

SOTA argument mining methods treat the process of chunking text into it's argument types and classifying them as an NER problem on long sequences. This means that segment of text with it's associated label is converted into a sequence of classes for the model to predict. To illustrate this we will be using the following passage:

```python
"Natural Language Processing is the best field in Machine Learning.According to a recent poll by DeepLearning.ai it's popularity has increased by twofold in the last 2 years."
```


Let's suppose for the sake of argument that the passage has the following labels:

**Sentence 1: Claim**
```python
sentence_1 = "Natural Language Processing is the best field in Machine Learning."
label_2 = "Claim"
```

**Sentence 2: Evidence**
```python
sentence_2 = "According to a recent poll by DeepLearning.ai it's popularity has increased by twofold in the last 2 years."
label_2 = "Evidence"
```

From this we can create a vector with the associated label for each word. Thus the whole text would have an assoicated label as follows:

```python
labels = ["Claim"]*10 + ["Evidence"]*18 # numbers are length of split sentences
```

With the NER style of labelling, you can modify the above to indicate whether a label is the beginning of a chunk, the end of a chunk or inside a chunk. For example, `"According"` could be labelled as `"B-Evidence"` in sentence 2, and the subsequent parts of it would be labelled as `"I-Evidence"`.

The above is easy if considering a split on words, however this becomes more complicated when we considered how transformer models work. These tokenise inputs based on substrings, and therefore the labels have to be extended. Further questions are raised, for instance: if the word according is split into `"Accord"` and `"ing"`, do we label both of these as `"B-Evidence"` or should we label `"ing"` as `"I-Evidence"`? Or do we just ignore subtokens? Or do we label subtokens as a completely separate class?

We call a labelling strategy that keeps the subtoken labels the same as the start token labels `"wordLevel"` and a strategy that differentiates between them as `"standard"`. Further we provide the following labelling strategies:

- **IO:** Argument tokens are classified as `"I-{arg_type}"` and non-argument tokens as `"O"`
- **BIO:** Argument start tokens are classified as `"B-{arg_type}"`, all other argument tokens as `"I-{arg_type}"`". Non-argument tokens as `"O"`
- **BIEO:** Argument start tokens are classified as `"B-{arg_type}"`, argument end tokens as `"E-{arg_type}"` and all other argument tokens as `"I-{arg_type}"`. Non-argument tokens as `"O"`
- **BIXO:** First argument start tokens are classified as `"B-{arg_type}"`, other argument start tokens are classified as `"I-{arg_type}"`. Argument subtokens are classified as `"X"`. Non-argument tokens as `"O"`

Considering all combinations, the processor API provies functionality for the following strategies:

- standard_io
- wordLevel_io
- standard_bio
- wordLevel_bio
- standard_bieo*
- wordLevel_bieo
- standard_bixo**

* `B-` labels are prioritised over `E-` tokens, e.g. for a single word sentence the word would be labelled as `B-`.

** This method is not one that is backed by literature. It is something that we thought would be interesting to examine. The intuition is that the `X` label captures grammatical elements of argument segments.

### Data Augmentation (Adversarial Examples)


### Evaluation

# Quick Start
- TODO



# References
<a id="1">[1]</a>
Christian S. and Iryna G. (2017). _Parsing Argumentation Structures in Persuasive Essays_. DOI: 10.1162/COLI_a_00295

<a id="2">[2]</a>
Scott C. and the Learning Agency. url: https://github.com/scrosseye/PERSUADE_corpus

<a id="3">[3]</a>
Alhindi, T. and Ghosh, D. (2021).
"Sharks are not the threat humans are": _Argument Component Segmentation in School Student Essays_. Proceedings of the 16th Workshop on Innovative Use of NLP for Building Educational Applications. p.210-222. url: https://aclanthology.org/2021.bea-1.22
2 changes: 2 additions & 0 deletions Useful link
Original file line number Diff line number Diff line change
@@ -0,0 +1,2 @@

paper summary: https://marmalade-marten-b45.notion.site/NLP-group-coursework-003529854c4f45258036859c20f94493
4 changes: 4 additions & 0 deletions argminer/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
import argminer


__version__ = '0.0.17'
36 changes: 36 additions & 0 deletions argminer/config.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,36 @@
import os
import pandas as pd

# -- get email details
EMAIL = os.environ.get('EMAIL', '[email protected]')
EMAIL_PASSWORD = os.environ.get('EMAIL_PASSWORD', 'password')
EMAIL_RECIPIENTS = os.environ.get('EMAIL_RECIPIENTS', EMAIL)

# -- argument mining
PREDICTION_STRING_START_ID = 0
MAX_NORM = 10
# -- label maps
# TODO automate these...
LABELS_DICT = dict(
TUDarmstadt=['MajorClaim', 'Claim', 'Premise'],
Persuade=['Lead', 'Position', 'Claim', 'Counterclaim', 'Rebuttal', 'Evidence', 'Concluding Statement']
)
STRATEGIES = ['io', 'bio', 'bieo', 'bixo']

LABELS_MAP_DICT = {}
for dataset, labels in LABELS_DICT.items():
LABELS_MAP_DICT[dataset] = {}
for strategy in STRATEGIES:
new_labels = ['O']
if strategy == 'bixo':
new_labels.append('X')
for label in labels:
if 'b' in strategy:
new_labels.append(f'B-{label}')
new_labels.append(f'I-{label}')
if 'e' in strategy:
new_labels.append(f'E-{label}')
LABELS_MAP_DICT[dataset][strategy] = pd.DataFrame({
'label_id': list(range(len(new_labels))),
'label': new_labels
})
Loading

0 comments on commit a37f047

Please sign in to comment.