We have preprocessed COVID-19 dataset of US, Italy, Chana, and India. Raw dataset of each country can be found here:
- US: JHU CSSE COVID-19 Dataset, link
- Data for getting the date of first confirmed in states from here
- Italy: Dati COVID-19 Italia, link
- China: JHU CSSE COVID-19 Dataset, link
- India: COVID19-India API, link
Population data are collected on online. Population of US, Italy, China, India are on the links.
-
Download raw files from internet
- Downloaded files are saved under
dataset\country_name\raw_data
anddataset\country_name\origin_data
.
# Country that you want to download raw files # Country.US, Country,ITALY, Country.CHINA, Country.INDIA, Country.US_CONFIRMED are available country = Country.ITALY # download raw files raw_dict = download_raw_data(country) # preprocessing raw files into refined dataset origin_dict = get_origin_data(country)
- Downloaded files are saved under
-
Preprocess dataset
- You must download raw files and have refined dataset before preprocess the dataset.
- Preprocessed dataset are saved under
dataset\county_name\preprocessed_data
anddataset\country_name\sird_data
. - Preprocessing settings are saved
settings\pre_info.csv
andsettings\sird_info.csv
.
# Country that you want to preprocess raw files # Country.US, Country.ITALY, Country.CHINA, Country.INDIA are available country = Country.ITALY link_df = load_links(country) # set preprocess conditions sird_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'], increase=True, daily=True, remove_zero=True, smoothing=True, window=5, divide=False, pre_type=PreType.SIRD) # preprocess sird_dict = get_sird_dict(country, sird_info)
-
Get exact dataset for model
-
You can get dataset for NIPA model or model for R0 estimation or SIRD model
-
Dataset for each model is saved under
results\model_name\
-
-
You must have preprocessed dataset for getting exact dataset for the model
-
NIPA model
country = Country.ITALY link_df = load_links(country) sird_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'], increase=True, daily=True, remove_zero=True, smoothing=True, window=5, divide=True, pre_type=PreType.SIRD) dataset_dict = get_dataset_for_sird_model(country, sird_info)
-
country = Country.ITALY link_df = load_links(country) pre_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'], increase=True, daily=True, remove_zero=True, smoothing=True, window=5, divide=False, pre_type=PreType.PRE) test_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'], increase=True, daily=True, remove_zero=True, smoothing=True, window=5, divide=False, pre_type=PreType.TEST) dataset_dict = get_dataset_for_r0_model(country, pre_info, test_info)
-
SIRD model
country = Country.ITALY link_df = load_links(country) pre_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'], increase=True, daily=True, remove_zero=True, smoothing=True, window=5, divide=False, pre_type=PreType.PRE) test_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'], increase=False, daily=True, remove_zero=True, smoothing=True, window=5, divide=False, pre_type=PreType.TEST) dataset_dict = get_dataset_for_sird_model(country, pre_info, test_info)
-
PreprocessInfo
dataclass is used for passing preprocessing conditions.increase
,daily
,remove_zero
,smoothing
,window
,divide
are conditions used in the class.- increase: remove anomalies in increasing data.
- daily: change cumulated data into daily data
- remove_zero: remove data below zero and fill up the gap using interpolate method.
- smoothing, window: apply moving average
- divide: divide data by its population
- pre_type
- There are three types in PreType. pre_type is used for validate conditions for the type of the data.
PRE
,SIRD
,TEST
are available to use.
- Ebola
- WHO, 2014년 11월 14일부터 2016년 5월 11일까지 주 단위로 업데이트
- 확진 케이스와 사망 케이스에 대한 누적 데이터가 confirmed, probable, suspected, total로 제공
- Guinea, Liberia, Sierra Leone 세 지역
- Congo, 2018년 8월 4일부터 2020년 7월 11일까지 일 단위로 업데이트
- National과 health zone 두 개로 나눠서 데이터셋 제공됨
- csv 형식으로 제공되기 때문에 파싱 과정은 필요 없음
- confirmed cases와 probable cases, confirmed deaths가 제공됨
- KNOEMA
- professional로 계정 전환해야 자료 볼 수 있다는데, 여기서 활용 가능한 데이터는 WHO 데이터밖에 없기 때문에 좀 더 찾아봐야 할 듯?
- Regional WHO data on Ebola Cases in DR Congo/Guinea/Liberia/Sierra Leone
- Total number of cases, suspected, probable, confirmed, deaths 제공
- WHO, 2014년 11월 14일부터 2016년 5월 11일까지 주 단위로 업데이트