The steps below introduces the procedure to collect data from scraping Weibo and to process data into an analyzable data set. The scraping process can be very time-consuming, so we recommend directly running the Rscript mentioned above to see the graphic and regression results.
pip install -r requirements.txt
git clone https://github.com/justjavac/weibo-trending-hot-search/tree/master/archives
Run the process_keywords.py to process the data, it will process all the keywords between start_date and end_date into the folder 'KWS'
d is the days to scrape, it would take 2 threads to run it and will scrape the 10 most negative topics within that day. Here the start date is '2022-05-09', and you can change it. For example {0..2} means it will scrape the data from '2022-05-09' to '2022-05-11'.
for d in {0..0}
do
for i in {0..4}
do
mpirun -n 2 python3 run_spider.py 2022-05-09 $d $i &
pid=$!
wait $pid
done
done
Run the 'run.sbatch', you could change its configuration and run it multiple times to utilize the cluster.
Run the 'clean.sbatch' to clean the data, it will extract the important information and calculate the negative sentiment score, then save to the same folder. You could change its configuration and run it multiple times to utilize the cluster.
Here the 3 means the days to clean from the start data '2022-05-08', and it could run with 3 threads.
mpirun -n 3 python3 clean.py 2022-05-08