Sample twitter data
{
"_id": "516799596153307136",
"lang": "en",
"plt": -5.799,
"uid": "67763278",
"tlt": -5.822,
"cc": "BR",
"f": "tw201492918305",
"p": "a4ddc3856053f7e1",
"flrs": 1014,
"acr": {
"$date": 1250900341000
},
"t": "@barrosmirella questão de ideias e conceitos. Você se definiu homofóbica nessa frase. Ngm precisa aceitar e/ou apoiar a homossexualidade*",
"cr": {
"$date": 1412049600000
},
"pln": -35.221,
"tln": -35.229,
"flng": 273
}
-
mongoimport --db test --collection tweets_collection --file tweets_collection.json
-
db.all_tweets.ensureIndex({ t: "text" })
-
Append "," to end of every line
sed 's/$/,/' all_tweets.json > all_tweets1.json
-
Converting mongo to json using json_to_csv.py, manually need to edit out certain things.
-
Removing substring:
sed -r 's/^{u'$date': u'//' data.csv
,sed -r 's/^'}//' data.csv
-
Make sure you drop the collection first.
-
Use upload.sh to upload data.
cd data/Tagged/
sh ../upload.sh
cd ../Untagged/
sh ../upload.sh
- Ensure index
use twitter
db.tweets_collection.ensureIndex({ t: "text" })
exit
290726 tweets unlocated, total
- Dumping data from mongo:
mongodump --db test --collection tweets_collection -o /home/kaustubh/
Data stored in .bson format in ~/test/
- Convert bson to json:
bsondump tweets_collection.bson > tweets_collection.json
Stats:
40,524 tweets. 14299 untagged.
Tweets from 2017-09-12 04:05:05.000Z to 2017-10-13 07:20:43.000Z
Min date query:
db.getCollection('tweets_collection').aggregate(
[
{
$group:
{
_id: {},
minDate: { $min: "$cr" }
}
}
]
);
-
Day wise analysis: Flood #, Dengue #, min, max, average count_tweets.py for untagged and tagged separately.
python ../count_tweets.py
in folders -
Containing both words:
db.getCollection('all_tweets').find({$text: {$search: "\"flood\" \"dengue\""} }).count()
-
Containing flood, not dengue:
db.getCollection('all_tweets').find({$text: {$search: "flood -dengue"}}).count()
-
Containing flood / dengue:
db.getCollection('all_tweets').find({$text: {$search: "flood,dengue"} }).count()
265688 -
Locations mentioned, frequency, floods, dengue separately [Done together for now]
-
Locations:
db.getCollection('all_tweets').distinct('loc')
962
# Redundant
db.all_tweets.aggregate([
{
$match: {
loc: { $not: {$size: 0} }
}
},
{ $unwind: "$loc" },
{
$group: {
_id: {$toLower: '$loc'},
count: { $sum: 1 }
}
},
{
$match: {
count: { $gte: 1 }
}
},
{ $sort : { count : -1} },
{ $limit : 100 }
]);
/might need to copy paste from mongo shell, not robo3t/ Saved in location_counts.json. Run location_tabs.py
db.all_tweets.aggregate(
{$group : { _id : '$loc', count : {$sum : 1}}}
)
distinct_locations.py, then location_tabs.py
-
locations frequency 'wordcloud', urban rural, state, state, heatmap.
-
All tagged. Generated by us, and actual geotagged. Actual geotagged : 1099
-
db.getCollection('all_tweets').find({"f": {$eq: "1e222211"} }).count()
=> assigned location -
db.getCollection('all_tweets').find({"f": {$nin: [ "1e222211", ""] } }).count()
=> Actual geotagged -
% of tweets identified, test case 1000 manually checked. => how many mistaken,
-
why untagged. Should have been tagged.
-
Improve ? Stemming.
-
w2v for locations ?
-
Improve coverage ? P, R, F, Roc, Auc, ?
-
Random sample
db.getCollection('all_tweets_untagged').aggregate(
[ { $sample: { size: 200 } } ]
)
** Retweets ?
** Remove en languages - spanish, etc.
-
Get wordclouds data, and frequency of how many locations appear 3 times. distinct_locations.py
-
Print only tweet text from mongo:
db.Jan22_tweets.find({}, {t: 1, _id:0})
-
Export only tweets:
mongoexport -d test -c Jan22_tweets -f t -o tweets_Jan22.txt
-
Sort hashtags used: hashtag_counter.py
-
db.getCollection('Jan22_tweets').find({"p": {$exists: true, "$ne": ""} ,$text: {$search: "dengue"}})
does location exist ? -
No location:
db.getCollection('Jan22_tweets').find({"p": {$eq :""} ,$text: {$search: "dengue"}}).count()
mongo | tee out.txt
> DBQuery.shellBatchSize = 40000
> db.Jan30_tweets.aggregate(
{ $sample: { size: 40000 } }
)