Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Application #19

Open
wants to merge 91 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
91 commits
Select commit Hold shift + click to select a range
1e3dbff
added SVM classifier
MagMueller Oct 9, 2021
efd3421
added to classification.sh
MagMueller Oct 9, 2021
004eb57
Merge remote-tracking branch 'origin/main' into SupportVectorMachineC…
MagMueller Oct 9, 2021
637faec
add knn
MagMueller Oct 9, 2021
97c7e69
add svm
MagMueller Oct 9, 2021
e5a4e9b
Merge remote-tracking branch 'origin/main' into SupportVectorMachineC…
MagMueller Oct 9, 2021
7755e5d
Merge branch 'main' of https://github.com/avocardio/MLinPractice into…
MagMueller Oct 9, 2021
0525e22
Merge branch 'SupportVectorMachineClassifier' into main
MagMueller Oct 9, 2021
8f31962
spelling error
MagMueller Oct 9, 2021
58fbed5
use not all data
MagMueller Oct 9, 2021
c09128c
safer limit
MagMueller Oct 9, 2021
d18496a
Merge branch 'lbechberger:main' into main
MagMueller Oct 11, 2021
f078050
implementet hash feature
MagMueller Oct 11, 2021
78b3fc0
Merge branch 'lbechberger:main' into main
avocardio Oct 11, 2021
3fa2beb
new docu. file
avocardio Oct 11, 2021
cd4810c
Merge branch 'main' of https://github.com/avocardio/MLinPractice
avocardio Oct 11, 2021
521132a
Update Documentation
avocardio Oct 11, 2021
6c8c4e3
added hash vecor, but cohens kappa still 0.0
MagMueller Oct 11, 2021
b36b6e7
Merge branch 'main' of https://github.com/avocardio/MLinPractice
avocardio Oct 11, 2021
b5b598d
Wrong Docu
avocardio Oct 11, 2021
d0d66c8
Documentation update
avocardio Oct 11, 2021
4e1faf8
test for hash vector
MagMueller Oct 12, 2021
cf8f5a4
Merge remote-tracking branch 'origin/main' into hash_feature
MagMueller Oct 12, 2021
645d2bf
updated readme and add first try to documentation.md
MagMueller Oct 12, 2021
70f3928
spelling mistaks
MagMueller Oct 12, 2021
a0c9f2b
Merge pull request #1 from avocardio/hash_feature
MagMueller Oct 12, 2021
dd87c7b
filter out all languages except from english, maybe later: Translate
MagMueller Oct 12, 2021
f881a57
preprocess start
avocardio Oct 13, 2021
35bffee
preproccesing works now
MagMueller Oct 13, 2021
92c7321
now the outputfile looks correct
MagMueller Oct 13, 2021
9891b2e
Merge branch 'main' into preprocessing/english_tweets
MagMueller Oct 13, 2021
95efc6c
edit documentation
MagMueller Oct 13, 2021
b2bf9bb
edit other files for test run
MagMueller Oct 13, 2021
c879c9d
deleted file
avocardio Oct 13, 2021
7e806da
fix
avocardio Oct 14, 2021
e729203
small changes / fixes
avocardio Oct 14, 2021
9b5d213
commented out small dataset
avocardio Oct 14, 2021
3a13c43
renamed file, added emoji / link remover
avocardio Oct 15, 2021
3730089
small mistake
avocardio Oct 15, 2021
e1cd45d
hashtag_counts file (errors)
avocardio Oct 15, 2021
cc5d908
preprocessing done, edit string remover it works now!!!
MagMueller Oct 15, 2021
2e36ac8
prettier
MagMueller Oct 15, 2021
66f6ba5
created sklearn pipline
MagMueller Oct 16, 2021
c02c425
hash_vectorizer with SGDClassifier and
MagMueller Oct 17, 2021
99da159
more n_features
MagMueller Oct 17, 2021
fe0398f
best till now
MagMueller Oct 17, 2021
fe9966c
more data tfidf and sdg
MagMueller Oct 17, 2021
a55a9d8
Linear SVC
MagMueller Oct 17, 2021
7528167
LogisticRegression
MagMueller Oct 17, 2021
faab8fc
LogisticRegression TfidfVectorizer
MagMueller Oct 17, 2021
23ded36
added tfidf_vectorize
MagMueller Oct 17, 2021
cc32cbb
test for tfidf, found misstak in output
MagMueller Oct 17, 2021
b22eee2
misstake was in test class, edit hash _vec test
MagMueller Oct 17, 2021
5dadbab
documantation for test
MagMueller Oct 17, 2021
fa1025c
Merge branch 'main' into classifier
MagMueller Oct 17, 2021
02d3259
new file with more than tweet feature.
MagMueller Oct 17, 2021
6e7161a
give model likes and retweets in training:
MagMueller Oct 17, 2021
750f9bd
hashtag counter working
avocardio Oct 18, 2021
4c13843
added emoji count file
avocardio Oct 18, 2021
a77c46d
Updated emoji remover for better filtering
avocardio Oct 18, 2021
b88d3ea
Updated emoji remover for better filtering
avocardio Oct 18, 2021
512d15d
working emoji count
avocardio Oct 18, 2021
a7be4a9
count instead of contains
avocardio Oct 18, 2021
7f8489c
add more classifier
MagMueller Oct 18, 2021
5892de6
remove trace
MagMueller Oct 18, 2021
87cbd5d
mirror changes for merge
MagMueller Oct 18, 2021
5209cdc
Merge branch 'main' into classifier
MagMueller Oct 18, 2021
0a429ad
Adjusted format for merge
shagemann2021 Oct 19, 2021
3d3862a
Change 'codes' to 'code'
shagemann2021 Oct 19, 2021
c51ac17
added .lower() for all words
avocardio Oct 19, 2021
6f372f7
word2vec feature WIP
avocardio Oct 20, 2021
13d82cc
working
avocardio Oct 20, 2021
2fe11d8
added time file
avocardio Oct 20, 2021
c94954b
fixed
avocardio Oct 20, 2021
7fa5243
Merge pull request #3 from avocardio/time
avocardio Oct 20, 2021
a876b22
Merge branch 'main' into word2vec
avocardio Oct 20, 2021
cf3b184
Merge pull request #4 from avocardio/word2vec
avocardio Oct 20, 2021
ae6edd5
Merge branch 'main' into emoji_count
avocardio Oct 20, 2021
851c0a2
Merge pull request #5 from avocardio/emoji_count
avocardio Oct 20, 2021
8941523
Merge branch 'main' into hashtag_count
avocardio Oct 20, 2021
e536d9a
Merge pull request #6 from avocardio/hashtag_count
avocardio Oct 20, 2021
3d046cc
Repeated args
avocardio Oct 20, 2021
613aaf6
Small fix, deleted print()
avocardio Oct 20, 2021
f43d15b
new metric
MagMueller Oct 20, 2021
87bde87
name fix for features.csv
avocardio Oct 22, 2021
c36bde5
rounded word2vec to 4
avocardio Oct 22, 2021
e62214d
removed feature: replies count
avocardio Oct 24, 2021
02b0903
documentation images
avocardio Oct 24, 2021
346b9c9
Image Test
avocardio Oct 24, 2021
4fe362a
Update Documentation.md
avocardio Oct 24, 2021
ec6f7e0
Extended and improved the application
shagemann2021 Oct 26, 2021
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
246 changes: 217 additions & 29 deletions Documentation.md

Large diffs are not rendered by default.

Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Documentation/time_non_viral.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Documentation/time_viral.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Documentation/word_count_non_viral.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added Documentation/word_count_viral.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 5 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,8 @@ conda install -y -q -c conda-forge gensim=4.1.2
conda install -y -q -c conda-forge spyder=5.1.5
conda install -y -q -c conda-forge pandas=1.1.5
conda install -y -q -c conda-forge mlflow=1.20.2
conda install -y -q -c conda-forge spacy
conda install -c conda-forge langdetect
```

You can double-check that all of these packages have been installed by running `conda list` inside of your virtual environment. The Spyder IDE can be started by typing `~/miniconda/envs/MLinPractice/bin/spyder` in your terminal window (assuming you use miniconda, which is installed right in your home directory).
Expand Down Expand Up @@ -91,6 +93,8 @@ The features to be extracted can be configured with the following optional param
Moreover, the script support importing and exporting fitted feature extractors with the following optional arguments:
- `-i` or `--import_file`: Load a configured and fitted feature extraction from the given pickle file. Ignore all parameters that configure the features to extract.
- `-e` or `--export_file`: Export the configured and fitted feature extraction into the given pickle file.
- `--hash_vec`: use HashingVectorizer from sklearn.
and for number of features for hash vector edit HASH_VECTOR_N_FEATURES in util.py

## Dimensionality Reduction

Expand Down Expand Up @@ -128,7 +132,7 @@ By default, this data is used to train a classifier, which is specified by one o
The classifier is then evaluated, using the evaluation metrics as specified through the following optional arguments:
- `-a`or `--accuracy`: Classification accurracy (i.e., percentage of correctly classified examples).
- `-k`or `--kappa`: Cohen's kappa (i.e., adjusting accuracy for probability of random agreement).

- `--small 1000`: use just 1000 tweets.

Moreover, the script support importing and exporting trained classifiers with the following optional arguments:
- `-i` or `--import_file`: Load a trained classifier from the given pickle file. Ignore all parameters that configure the classifier to use and don't retrain the classifier.
Expand Down
183 changes: 183 additions & 0 deletions code/all_in_one.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,183 @@
import argparse
import pdb
import csv
import pickle


# feature_extraction
from sklearn.feature_extraction.text import CountVectorizer, HashingVectorizer, TfidfTransformer, TfidfVectorizer

# feature_selection
from sklearn.feature_selection import SelectKBest, mutual_info_classif, chi2

# dim_reduction
from sklearn.decomposition import PCA, TruncatedSVD, NMF

# classifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.linear_model import SGDClassifier
from sklearn.svm import LinearSVC, l1_min_c, SVC, LinearSVR, SVR

from sklearn.pipeline import Pipeline
from sklearn.pipeline import FeatureUnion
from sklearn.model_selection import cross_val_score

# metrics
from sklearn.metrics import classification_report, cohen_kappa_score, accuracy_score, balanced_accuracy_score

import pandas as pd
import numpy as np
import seaborn as sns

# balancing
from imblearn.under_sampling import RandomUnderSampler
from imblearn.over_sampling import RandomOverSampler
from sklearn.model_selection import train_test_split

from collections import Counter

parser = argparse.ArgumentParser(description="all in one")
parser.add_argument("input_file", help="path to the input file")
parser.add_argument("-e", "--export_file",
help="export the trained classifier to the given location", default=None)

# evaluate:
parser.add_argument("-a", "--accuracy", action="store_true",
help="evaluate using accuracy")
parser.add_argument("-k", "--kappa", action="store_true",
help="evaluate using Cohen's kappa")
parser.add_argument("--balanced_accuracy", action="store_true",
help="evaluate using balanced_accuracy")
parser.add_argument("--classification_report", action="store_true",
help="evaluate using classification_report")

# balance dataset
parser.add_argument("--balance", type=str,
help="choose btw under and oversampling", default=None)
parser.add_argument("--small", type=int,
help="choose subset of all data", default=None)
# feature_extraction
parser.add_argument("--feature_extraction", type=str,
help="choose a feature_extraction algo", default=None)
# dim_red
parser.add_argument("--dim_red", type=str,
help="choose a dim_red algo", default=None)
# classifier
parser.add_argument("--classifier", type=str,
help="choose a classifier", default=None)

args = parser.parse_args()
#args, unk = parser.parse_known_args()

# load data
# with open(args.input_file, 'rb') as f_in:
# data = pickle.load(f_in)

# load data
df = pd.read_csv(args.input_file, quoting=csv.QUOTE_NONNUMERIC,
lineterminator="\n")

if args.small is not None:
# if limit is given
max_length = len(df['label'])
limit = min(args.small, max_length)
df = df.head(limit)

# split data
input_col = 'preprocess_col'
X = df[input_col].array.reshape(-1, 1)
y = df["label"].ravel()

X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.1, random_state=42)

# balance data
if args.balance == 'over_sampler':
over_sampler = RandomOverSampler(random_state=42)
X_res, y_res = over_sampler.fit_resample(X_train, y_train)
elif args.balance == 'under_sampler':
under_sampler = RandomUnderSampler(random_state=42)
X_res, y_res = under_sampler.fit_resample(X_train, y_train)
else:
X_res, y_res = X_train, y_train

print(f"Training target statistics: {Counter(y_res)}")
print(f"Testing target statistics: {Counter(y_test)}")


my_pipeline = []

# feature_extraction
if args.feature_extraction == 'HashingVectorizer':
my_pipeline.append(('hashvec', HashingVectorizer(n_features=2**22,
strip_accents='ascii', stop_words='english', ngram_range=(1, 3))))
elif args.feature_extraction == 'TfidfVectorizer':
my_pipeline.append(('tfidf', TfidfVectorizer(
stop_words='english', ngram_range=(1, 3))))

# dimension reduction
if args.dim_red == 'SelectKBest(chi2)':
my_pipeline.append(('dim_red', SelectKBest(chi2)))
elif args.dim_red == 'NMF':
my_pipeline.append(('nmf', NMF()))


# classifier
if args.classifier == 'MultinomialNB':
my_pipeline.append(('MNB', MultinomialNB()))
elif args.classifier == 'SGDClassifier':
my_pipeline.append(('SGD', SGDClassifier(class_weight="balanced", n_jobs=-1,
random_state=42, alpha=1e-07, verbose=1)))
elif args.classifier == 'LogisticRegression':
my_pipeline.append(('LogisticRegression', LogisticRegression(class_weight="balanced", n_jobs=-1,
random_state=42, verbose=1)))
elif args.classifier == 'LinearSVC':
my_pipeline.append(('LinearSVC', LinearSVC(class_weight="balanced",
random_state=42, verbose=1)))
elif args.classifier == 'SVC':
# attention: time = samples ^ 2
my_pipeline.append(('SVC', SVC(class_weight="balanced",
random_state=42, verbose=1)))

classifier = Pipeline(my_pipeline)
import pdb
pdb.set_trace()
classifier.fit(X_res.ravel(), y_res)

# now classify the given data
prediction = classifier.predict(X_test.ravel())

prediction_train_set = classifier.predict(X_res.ravel())

pdb.set_trace()

# collect all evaluation metrics
evaluation_metrics = []
if args.accuracy:
evaluation_metrics.append(("accuracy", accuracy_score))
if args.kappa:
evaluation_metrics.append(("Cohen's kappa", cohen_kappa_score))
if args.balanced_accuracy:
evaluation_metrics.append(("balanced accuracy", balanced_accuracy_score))
# compute and print them
for metric_name, metric in evaluation_metrics:

print(" {0}: {1}".format(metric_name,
metric(y_test, prediction)))

if args.classification_report:
categories = ["Flop", "Viral"]
print("Matrix Train set:")
print(classification_report(y_res, prediction_train_set,
target_names=categories))
print("Matrix Test set:")
print(classification_report(y_test.ravel(), prediction,
target_names=categories))


# export the trained classifier if the user wants us to do so
if args.export_file is not None:
with open(args.export_file, 'wb') as f_out:
pickle.dump(classifier, f_out)
27 changes: 27 additions & 0 deletions code/all_in_one.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
#!/bin/bash

# create directory if not yet existing
mkdir -p data/all_in_one/

# run feature extraction on training set (may need to fit extractors)
#echo " training set"
#python3 -m code.all_in_one data/feature_extraction/training.pickle -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --small 10000

# raw input, mit preprocessing
#python3 -m code.all_in_one data/preprocessing/split/training.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --hash_vectorizer #--count_vectorizer

# raw input, ohne preprocessing
#python3 -m code.all_in_one data/preprocessing/labeled.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --count_vectorizer #--hash_vectorizer #

# sklearn example
#python3 -m code.example_sklearn_pipeline data/preprocessing/split/training.csv


# run feature extraction on validation set (with pre-fit extractors)
#echo " validation set"
#python3 -m code.all_in_one data/feature_extraction/validation.pickle -i data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --small 10000

# don't touch the test set, yet, because that would ruin the final generalization experiment!

# new approach
python3 -m code.all_in_one data/preprocessing/preprocessed.csv -e data/classification/classifier.pickle --accuracy --kappa --balanced_accuracy --classification_report --classifier 'LogisticRegression' --feature_extraction 'TfidfVectorizer' #--small 20000 #--balance 'over_sampler' # | HashingVectorizer TfidfVectorizer | SVC SGDClassifier LogisticRegression LinearSVC MultinomialNB data/preprocessing/split/training.csv data/preprocessing/labeled.csv data/preprocessing/preprocessed.csv
Loading