COVID_DataProcessor

Dataset

We have preprocessed COVID-19 dataset of US, Italy, Chana, and India. Raw dataset of each country can be found here:

US: JHU CSSE COVID-19 Dataset, link
- Data for getting the date of first confirmed in states from here
Italy: Dati COVID-19 Italia, link
China: JHU CSSE COVID-19 Dataset, link
India: COVID19-India API, link

Population data are collected on online. Population of US, Italy, China, India are on the links.

How to use DataProcessor

Download raw files from internet

Downloaded files are saved under dataset\country_name\raw_data and dataset\country_name\origin_data.

# Country that you want to download raw files
# Country.US, Country,ITALY, Country.CHINA, Country.INDIA, Country.US_CONFIRMED are available
country = Country.ITALY
# download raw files
raw_dict = download_raw_data(country)
# preprocessing raw files into refined dataset
origin_dict = get_origin_data(country)

Preprocess dataset

You must download raw files and have refined dataset before preprocess the dataset.
Preprocessed dataset are saved under dataset\county_name\preprocessed_data and dataset\country_name\sird_data.
Preprocessing settings are saved settings\pre_info.csv and settings\sird_info.csv.

# Country that you want to preprocess raw files
# Country.US, Country.ITALY, Country.CHINA, Country.INDIA are available
country = Country.ITALY
link_df = load_links(country)

# set preprocess conditions
sird_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'],
                           increase=True, daily=True, remove_zero=True,
                           smoothing=True, window=5, divide=False, pre_type=PreType.SIRD)

# preprocess
sird_dict = get_sird_dict(country, sird_info)

Get exact dataset for model
- You can get dataset for NIPA model or model for R0 estimation or SIRD model
- Dataset for each model is saved under results\model_name\

You must have preprocessed dataset for getting exact dataset for the model

NIPA model

country = Country.ITALY
link_df = load_links(country)

sird_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'],
                           increase=True, daily=True, remove_zero=True,
                           smoothing=True, window=5, divide=True, pre_type=PreType.SIRD)

dataset_dict = get_dataset_for_sird_model(country, sird_info)

R0_Estimation

country = Country.ITALY
link_df = load_links(country)

pre_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'],
                          increase=True, daily=True, remove_zero=True,
                          smoothing=True, window=5, divide=False, pre_type=PreType.PRE)

test_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'],
                           increase=True, daily=True, remove_zero=True,
                           smoothing=True, window=5, divide=False, pre_type=PreType.TEST)

dataset_dict = get_dataset_for_r0_model(country, pre_info, test_info)

SIRD model

country = Country.ITALY
link_df = load_links(country)

pre_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'],
                          increase=True, daily=True, remove_zero=True,
                          smoothing=True, window=5, divide=False, pre_type=PreType.PRE)
test_info = PreprocessInfo(country=country, start=link_df['start_date'], end=link_df['end_date'],
                           increase=False, daily=True, remove_zero=True,
                           smoothing=True, window=5, divide=False, pre_type=PreType.TEST)

dataset_dict = get_dataset_for_sird_model(country, pre_info, test_info)

Preprocessing Conditions

PreprocessInfo dataclass is used for passing preprocessing conditions. increase, daily, remove_zero, smoothing, window, divide are conditions used in the class.
increase: remove anomalies in increasing data.
daily: change cumulated data into daily data
remove_zero: remove data below zero and fill up the gap using interpolate method.
smoothing, window: apply moving average
divide: divide data by its population
pre_type
- There are three types in PreType. pre_type is used for validate conditions for the type of the data.
- PRE, SIRD, TEST are available to use.

We are going to use...

Ebola
- WHO, 2014년 11월 14일부터 2016년 5월 11일까지 주 단위로 업데이트
  - 확진 케이스와 사망 케이스에 대한 누적 데이터가 confirmed, probable, suspected, total로 제공
  - Guinea, Liberia, Sierra Leone 세 지역
- Congo, 2018년 8월 4일부터 2020년 7월 11일까지 일 단위로 업데이트
  - National과 health zone 두 개로 나눠서 데이터셋 제공됨
  - csv 형식으로 제공되기 때문에 파싱 과정은 필요 없음
  - confirmed cases와 probable cases, confirmed deaths가 제공됨
- KNOEMA
  - professional로 계정 전환해야 자료 볼 수 있다는데, 여기서 활용 가능한 데이터는 WHO 데이터밖에 없기 때문에 좀 더 찾아봐야 할 듯?
  - Regional WHO data on Ebola Cases in DR Congo/Guinea/Liberia/Sierra Leone
    - Total number of cases, suspected, probable, confirmed, deaths 제공

Name		Name	Last commit message	Last commit date
Latest commit History 73 Commits
COVID_DataProcessor		COVID_DataProcessor
dataset		dataset
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

COVID_DataProcessor

Dataset

How to use DataProcessor

Preprocessing Conditions

We are going to use...

About

Languages

License

DVL-Sejong/COVID_DataProcessor

Folders and files

Latest commit

History

Repository files navigation

COVID_DataProcessor

Dataset

How to use DataProcessor

Preprocessing Conditions

We are going to use...

About

Topics

Resources

License

Stars

Watchers

Forks

Languages