Added Tweet Count to README and Featurization

Updated the README to include the number of tweets used to build the vectorizer and FastText model and added code to build those models given a set of tweets.
infeco · Jan 10, 2019 · f4c0e98 · f4c0e98
1 parent 8671897
commit f4c0e98
Show file tree

Hide file tree

Showing 5 changed files with 2,287 additions and 0 deletions.
diff --git a/README.md b/README.md
@@ -47,6 +47,7 @@ While FastText provides several pre-trained word vector datasets trained on Wiki
 ### Featurization Models
 
 We provide the TF-IDF vectorizer built from a 1-percent sample of English tweets posted to Twitter and captured in Twitter's public sample stream between 2013 and 2016.
+This dataset contains 11,715,393 tweets.
 You can download this vectorizer here: [2013to2016_tfidf_vectorizer_20190109.pkl](http://obj.umiacs.umd.edu/trecis_2018/2013to2016_tfidf_vectorizer_20190109.pkl)
 
 We also provide our FastText-trained model on this same set of English tweets, which you can find here: [archived_text_sample_2013to2016_gensim_200.model.tgz](http://obj.umiacs.umd.edu/trecis_2018/archived_text_sample_2013to2016_gensim_200.model.tgz)