Project

The code needed for the econometric analysis is an Rscript named ``main_regression.R''.

The steps below introduces the procedure to collect data from scraping Weibo and to process data into an analyzable data set. The scraping process can be very time-consuming, so we recommend directly running the Rscript mentioned above to see the graphic and regression results.

Step 1: Configuration

Install the required packages

pip install -r requirements.txt

Clone all the keywords

git clone https://github.com/justjavac/weibo-trending-hot-search/tree/master/archives

data sample:

Step 2: Process the data

Run the process_keywords.py to process the data, it will process all the keywords between start_date and end_date into the folder 'KWS'

Step 3: Run the scraper

configure the configure.

d is the days to scrape, it would take 2 threads to run it and will scrape the 10 most negative topics within that day. Here the start date is '2022-05-09', and you can change it. For example {0..2} means it will scrape the data from '2022-05-09' to '2022-05-11'.

i is the number of topics to scrape, multiplied by 2, here it will scrape 10 topics.

for d in {0..0}
do
    for i in {0..4}
    do
        mpirun -n 2 python3 run_spider.py 2022-05-09 $d $i &
        pid=$!
        wait $pid
    done
done

Run the 'run.sbatch', you could change its configuration and run it multiple times to utilize the cluster.

The result will be stored in the folder 'output'

Step 4: clean and process the data

Run the 'clean.sbatch' to clean the data, it will extract the important information and calculate the negative sentiment score, then save to the same folder. You could change its configuration and run it multiple times to utilize the cluster.

Here the 3 means the days to clean from the start data '2022-05-08', and it could run with 3 threads.

mpirun -n 3 python3 clean.py 2022-05-08

Step 5: Summarize the data

Run the 'merge.py' to merge the cleaned data into one csv file and transform the location name from Chinese to English.

Run the 'summary.py' to summarize the data, it will summary the topics collected into 'summary.csv'.

Trouble shooting

Please do not run too many sbatches at the same time, as it may cause the limited cluster to crash.

If you reduce the 'DOWNLOAD_DELAY' in the 'settings.py', it may cause the spider to be blocked by the website.

Sometimes it may only scrape one topic instead of 2 initially due to unknown reason, you can just stop and run it again.

Sometimes it would always be blocked by the website, you can change the 'cookies.txt' and the 'user_agents.txt' to solve it.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.idea		.idea
images		images
output		output
sentencepiece-0.1.83		sentencepiece-0.1.83
spider		spider
weibosraper		weibosraper
.DS_Store		.DS_Store
.gitattributes		.gitattributes
.gitignore		.gitignore
Crawler.ipynb		Crawler.ipynb
ECMA_31320_Project.Rproj		ECMA_31320_Project.Rproj
LICENSE		LICENSE
NLP.ipynb		NLP.ipynb
README.md		README.md
did_apr_may.R		did_apr_may.R
main_regressions.R		main_regressions.R
merged_data0524.csv		merged_data0524.csv
merged_data_0508.csv		merged_data_0508.csv
requirements.txt		requirements.txt
summary.csv		summary.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project

The code needed for the econometric analysis is an Rscript named ``main_regression.R''.

The steps below introduces the procedure to collect data from scraping Weibo and to process data into an analyzable data set. The scraping process can be very time-consuming, so we recommend directly running the Rscript mentioned above to see the graphic and regression results.

Step 1: Configuration

Install the required packages

Clone all the keywords

data sample:

Step 2: Process the data

Run the process_keywords.py to process the data, it will process all the keywords between start_date and end_date into the folder 'KWS'

Step 3: Run the scraper

configure the configure.

d is the days to scrape, it would take 2 threads to run it and will scrape the 10 most negative topics within that day. Here the start date is '2022-05-09', and you can change it. For example {0..2} means it will scrape the data from '2022-05-09' to '2022-05-11'.

i is the number of topics to scrape, multiplied by 2, here it will scrape 10 topics.

Run the 'run.sbatch', you could change its configuration and run it multiple times to utilize the cluster.

The result will be stored in the folder 'output'

Step 4: clean and process the data

Run the 'clean.sbatch' to clean the data, it will extract the important information and calculate the negative sentiment score, then save to the same folder. You could change its configuration and run it multiple times to utilize the cluster.

Here the 3 means the days to clean from the start data '2022-05-08', and it could run with 3 threads.

Step 5: Summarize the data

Run the 'merge.py' to merge the cleaned data into one csv file and transform the location name from Chinese to English.

Run the 'summary.py' to summarize the data, it will summary the topics collected into 'summary.csv'.

Trouble shooting

Please do not run too many sbatches at the same time, as it may cause the limited cluster to crash.

If you reduce the 'DOWNLOAD_DELAY' in the 'settings.py', it may cause the spider to be blocked by the website.

Sometimes it may only scrape one topic instead of 2 initially due to unknown reason, you can just stop and run it again.

Sometimes it would always be blocked by the website, you can change the 'cookies.txt' and the 'user_agents.txt' to solve it.

About

Releases

Packages

Contributors 2

Languages

License

QichangZheng/ECMA_31320_Project

Folders and files

Latest commit

History

Repository files navigation

Project

The code needed for the econometric analysis is an Rscript named ``main_regression.R''.

The steps below introduces the procedure to collect data from scraping Weibo and to process data into an analyzable data set. The scraping process can be very time-consuming, so we recommend directly running the Rscript mentioned above to see the graphic and regression results.

Step 1: Configuration

Install the required packages

Clone all the keywords

data sample:

Step 2: Process the data

Run the process_keywords.py to process the data, it will process all the keywords between start_date and end_date into the folder 'KWS'

Step 3: Run the scraper

configure the configure.

d is the days to scrape, it would take 2 threads to run it and will scrape the 10 most negative topics within that day. Here the start date is '2022-05-09', and you can change it. For example {0..2} means it will scrape the data from '2022-05-09' to '2022-05-11'.

i is the number of topics to scrape, multiplied by 2, here it will scrape 10 topics.

Run the 'run.sbatch', you could change its configuration and run it multiple times to utilize the cluster.

The result will be stored in the folder 'output'

Step 4: clean and process the data

Run the 'clean.sbatch' to clean the data, it will extract the important information and calculate the negative sentiment score, then save to the same folder. You could change its configuration and run it multiple times to utilize the cluster.

Here the 3 means the days to clean from the start data '2022-05-08', and it could run with 3 threads.

Step 5: Summarize the data

Run the 'merge.py' to merge the cleaned data into one csv file and transform the location name from Chinese to English.

Run the 'summary.py' to summarize the data, it will summary the topics collected into 'summary.csv'.

Trouble shooting

Please do not run too many sbatches at the same time, as it may cause the limited cluster to crash.

If you reduce the 'DOWNLOAD_DELAY' in the 'settings.py', it may cause the spider to be blocked by the website.

Sometimes it may only scrape one topic instead of 2 initially due to unknown reason, you can just stop and run it again.

Sometimes it would always be blocked by the website, you can change the 'cookies.txt' and the 'user_agents.txt' to solve it.

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages