Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extreme volatility in USA: Cumulative Number Of Performed CV Tests #189

Closed
DataOps-epam opened this issue Apr 15, 2022 · 5 comments
Closed

Comments

@DataOps-epam
Copy link

Dear FIND,
Could you please check your time series "USA: Cumulative Number Of Performed CV Tests" here: https://raw.githubusercontent.com/dsbbfinddx/FINDCov19TrackerData/master/processed/coronavirus_tests.csv
We can see extreme volatility in 04/15/2022 period with latest release.
Could you please advise on this.
Thanks.

#USA.COVID19.B@FIND

@findanna
Copy link
Contributor

findanna commented Apr 19, 2022

Hello @DataOps-epam,

thank you for your feedback.
Indeed there are lots of revisions at the source for USA data.

We will also check on our side if we have an issue with the cumulative sum of the source data.
@angelicambg @benubah can you please check if code is still producing valid cumulative numbers for USA?
I have tried to reproduce in R and I did not have same results, probably because there are a lot of historical revisions at the source?

data_US <- readr::read_csv(
  "https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/testing_data/time_series_covid19_US.csv",
  col_names = TRUE,
  show_col_types = FALSE) |>
  mutate(date = as.Date(date, format = "%m/%d/%Y"))|>
  select(date, tests_combined_total) |>
  dplyr::group_by(date) |>
  dplyr::summarise(cum_tests = sum(tests_combined_total, na.rm = T))

 tail(data_US, 20)
# # A tibble: 20 × 2
# date       cum_tests
# <date>         <dbl>
#   1 2022-03-29 852124868
# 2 2022-03-30 853232274
# 3 2022-03-31 854242710
# 4 2022-04-01 845136012
# 5 2022-04-02 845397095
# 6 2022-04-03 845637721
# 7 2022-04-04 846394199
# 8 2022-04-05 847525133
# 9 2022-04-06 848362052
# 10 2022-04-07 849031771
# 11 2022-04-08 849609426
# 12 2022-04-09 850821150
# 13 2022-04-10 851036553
# 14 2022-04-11 851595486
# 15 2022-04-12 854936478
# 16 2022-04-13 852836553
# 17 2022-04-14 854146287
# 18 2022-04-15 836263614
# 19 2022-04-16 778594875
# 20 2022-04-17 778666699

@benubah
Copy link
Collaborator

benubah commented Apr 19, 2022

Yes we've been working on this in the past few days. There were some revisions at the source. But there are also some missing states. We can't reproduce this with R because, the Python scraping tool uses a pandas fillna() to replace NAs with the last known value.
At least the KS and MN states were missing from April 15th. I have reported this to the source.

library(dplyr)
#> 
usa <- readr::read_csv("https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/testing_data/time_series_covid19_US.csv", show_col_types = FALSE) |>
  select(date, state, tests_combined_total) 

data_sum <-
  usa |>
  mutate(date = as.Date(date, "%m/%d/%Y")) |>
  #na.locf does something similar to pandas fillna(method = "ffill")
  #mutate(tests_combined_total = zoo::na.locf(tests_combined_total)) |>
  group_by(date) |>
  summarize(tests_total = sum(tests_combined_total)) 
  

missing <- 
  usa |>
  filter(state %in% c("MN", "KS"),
         date %in% c("4/11/2022","4/12/2022","4/13/2022", "4/14/2022", "4/15/2022", "4/16/2022", "4/17/2022"))

print(missing)
#> # A tibble: 14 x 3
#>    date      state tests_combined_total
#>    <chr>     <chr>                <dbl>
#>  1 4/11/2022 KS                 2269976
#>  2 4/11/2022 MN                16379616
#>  3 4/12/2022 KS                 2269976
#>  4 4/12/2022 MN                16402859
#>  5 4/13/2022 KS                 2269976
#>  6 4/13/2022 MN                16402859
#>  7 4/14/2022 KS                 2269976
#>  8 4/14/2022 MN                16402859
#>  9 4/15/2022 KS                      NA
#> 10 4/15/2022 MN                      NA
#> 11 4/16/2022 KS                      NA
#> 12 4/16/2022 MN                      NA
#> 13 4/17/2022 KS                      NA
#> 14 4/17/2022 MN                      NA

We are working on a fix now at our end.

@findanna
Copy link
Contributor

Thank you @benubah, looking forward to the updates!

@seawaR
Copy link
Collaborator

seawaR commented Apr 19, 2022

Hi @findanna,
Last month we realized that some states were not reporting anymore and decided to take the last available value (as @benubah explained) to avoid negative values in the cumulative tests (for instance, you can see in your example with the data from 2022-03-31 and 2022-04-01).
It can happen that some of this states will publish again so we will work in an automatic weekly or monthly update that can help with this potential issue and the regular revisions you mention. In the mean time, I will make an update of the values from the beginning of the month.
However, what happened on 2022-04-15 seems to be a mistake from the source just on that day and because the scraping process for all countries only takes values for the current day we have manually corrected the values from 2022-04-14 to 2022-04-17 but please keep in mind that these values can change with the revisions from the source.

@findanna
Copy link
Contributor

Thank you @seawaR and @benubah.
If possible to correct historical values so that we match the sources, please do so.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants