Savitr/data at master · OSM-DM-KGP/Savitr

History

Name		Name	Last commit message	Last commit date
parent directory ..
30Jan_FINAL_stem		30Jan_FINAL_stem
30Jan_Tagged		30Jan_Tagged
30Jan_Untagged		30Jan_Untagged
Tweets_Location		Tweets_Location
images		images
redundant		redundant
30Nov_duplicates_terminal.txt		30Nov_duplicates_terminal.txt
India_Locations.csv		India_Locations.csv
README.md		README.md
all_tweets.json		all_tweets.json
count_tweets.py		count_tweets.py
data_30Jan.csv		data_30Jan.csv
data_30Nov.csv		data_30Nov.csv
data_valid.json		data_valid.json
data_valid_Nov30.json		data_valid_Nov30.json
distinct_location_counts.txt		distinct_location_counts.txt
distinct_location_counts_nov.txt		distinct_location_counts_nov.txt
distinct_locations.py		distinct_locations.py
distinct_locations_redundant.txt		distinct_locations_redundant.txt
distinct_locations_redundant_nov.txt		distinct_locations_redundant_nov.txt
frequency-counts.txt		frequency-counts.txt
frequency-counts_nov.txt		frequency-counts_nov.txt
gen_wordcloud.py		gen_wordcloud.py
hashtag_counter.py		hashtag_counter.py
json_to_csv.py		json_to_csv.py
json_to_mongo.py		json_to_mongo.py
location_counts.json		location_counts.json
location_counts.txt		location_counts.txt
location_counts_nov.json		location_counts_nov.json
location_tabs.py		location_tabs.py
locations_redundant.txt		locations_redundant.txt
minimal_wc.py		minimal_wc.py
read_file.py		read_file.py
read_file_org.py		read_file_org.py
sample_tweet.txt		sample_tweet.txt
tagged_sample.json		tagged_sample.json
tagged_sample_200.csv		tagged_sample_200.csv
tweet_specific_1.py		tweet_specific_1.py
tweet_specific_2.py		tweet_specific_2.py
tweet_specific_3.py		tweet_specific_3.py
tweets_Jan22.txt		tweets_Jan22.txt
untagged_sample.json		untagged_sample.json
untagged_sample_200.csv		untagged_sample_200.csv
upload.sh		upload.sh
validate_csv.py		validate_csv.py

README.md

Sample twitter data

{
  "_id": "516799596153307136",
  "lang": "en",
  "plt": -5.799,
  "uid": "67763278",
  "tlt": -5.822,
  "cc": "BR",
  "f": "tw201492918305",
  "p": "a4ddc3856053f7e1",
  "flrs": 1014,
  "acr": {
    "$date": 1250900341000
  },
  "t": "@barrosmirella questão de ideias e conceitos. Você se definiu homofóbica nessa frase. Ngm precisa aceitar e/ou apoiar a homossexualidade*",
  "cr": {
    "$date": 1412049600000
  },
  "pln": -35.221,
  "tln": -35.229,
  "flng": 273
}

mongoimport --db test --collection tweets_collection --file tweets_collection.json
db.all_tweets.ensureIndex({ t: "text" })
Append "," to end of every line sed 's/$/,/' all_tweets.json > all_tweets1.json
Converting mongo to json using json_to_csv.py, manually need to edit out certain things.
Removing substring: sed -r 's/^{u'$date': u'//' data.csv, sed -r 's/^'}//' data.csv
Make sure you drop the collection first.
Use upload.sh to upload data.

cd data/Tagged/
sh ../upload.sh
cd ../Untagged/
sh ../upload.sh

Ensure index

use twitter
db.tweets_collection.ensureIndex({ t: "text" })
exit

290726 tweets unlocated, total

Dumping data from mongo: mongodump --db test --collection tweets_collection -o /home/kaustubh/

Data stored in .bson format in ~/test/

Convert bson to json: bsondump tweets_collection.bson > tweets_collection.json

Stats:

40,524 tweets. 14299 untagged.

Tweets from 2017-09-12 04:05:05.000Z to 2017-10-13 07:20:43.000Z

Min date query:

db.getCollection('tweets_collection').aggregate(
   [
     {
       $group:
       {
         _id: {},
         minDate: { $min: "$cr" }
       }
     }
   ]
  );

Analysis to do:

Day wise analysis: Flood #, Dengue #, min, max, average count_tweets.py for untagged and tagged separately. python ../count_tweets.py in folders
Containing both words: db.getCollection('all_tweets').find({$text: {$search: "\"flood\" \"dengue\""} }).count()
Containing flood, not dengue: db.getCollection('all_tweets').find({$text: {$search: "flood -dengue"}}).count()
Containing flood / dengue: db.getCollection('all_tweets').find({$text: {$search: "flood,dengue"} }).count() 265688
Locations mentioned, frequency, floods, dengue separately [Done together for now]
Locations: db.getCollection('all_tweets').distinct('loc') 962

# Redundant
db.all_tweets.aggregate([
    {
        $match: {
            loc: { $not: {$size: 0} }
        }
    },
    { $unwind: "$loc" },
    {
        $group: {
            _id: {$toLower: '$loc'},
            count: { $sum: 1 }
        }
    },
    {
        $match: {
            count: { $gte: 1 }
        }
    },
    { $sort : { count : -1} },
    { $limit : 100 }
]);

/might need to copy paste from mongo shell, not robo3t/ Saved in location_counts.json. Run location_tabs.py

db.all_tweets.aggregate(
   {$group : { _id : '$loc', count : {$sum : 1}}}
)

distinct_locations.py, then location_tabs.py

locations frequency 'wordcloud', urban rural, state, state, heatmap.
All tagged. Generated by us, and actual geotagged. Actual geotagged : 1099
db.getCollection('all_tweets').find({"f": {$eq: "1e222211"} }).count() => assigned location
db.getCollection('all_tweets').find({"f": {$nin: [ "1e222211", ""] } }).count() => Actual geotagged
% of tweets identified, test case 1000 manually checked. => how many mistaken,
why untagged. Should have been tagged.
Improve ? Stemming.
w2v for locations ?
Improve coverage ? P, R, F, Roc, Auc, ?
Random sample

db.getCollection('all_tweets_untagged').aggregate(
   [ { $sample: { size: 200 } } ]
)

** Retweets ?

** Remove en languages - spanish, etc.

Get wordclouds data, and frequency of how many locations appear 3 times. distinct_locations.py
Print only tweet text from mongo: db.Jan22_tweets.find({}, {t: 1, _id:0})
Export only tweets: mongoexport -d test -c Jan22_tweets -f t -o tweets_Jan22.txt
Sort hashtags used: hashtag_counter.py
db.getCollection('Jan22_tweets').find({"p": {$exists: true, "$ne": ""} ,$text: {$search: "dengue"}}) does location exist ?
No location: db.getCollection('Jan22_tweets').find({"p": {$eq :""} ,$text: {$search: "dengue"}}).count()

Randomly sampling some tweets into a file

mongo | tee out.txt
> DBQuery.shellBatchSize = 40000
> db.Jan30_tweets.aggregate(
   { $sample: { size: 40000 } }
)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data

data

README.md

Analysis to do:

Randomly sampling some tweets into a file

Files

data

Directory actions

More options

Directory actions

More options

Latest commit

History

data

Folders and files

parent directory

README.md

Analysis to do:

Randomly sampling some tweets into a file