Extreme volatility in USA: Cumulative Number Of Performed CV Tests #189

DataOps-epam · 2022-04-15T20:47:22Z

Dear FIND,
Could you please check your time series "USA: Cumulative Number Of Performed CV Tests" here: https://raw.githubusercontent.com/dsbbfinddx/FINDCov19TrackerData/master/processed/coronavirus_tests.csv
We can see extreme volatility in 04/15/2022 period with latest release.
Could you please advise on this.
Thanks.

#USA.COVID19.B@FIND

findanna · 2022-04-19T08:14:56Z

Hello @DataOps-epam,

thank you for your feedback.
Indeed there are lots of revisions at the source for USA data.

We will also check on our side if we have an issue with the cumulative sum of the source data.
@angelicambg @benubah can you please check if code is still producing valid cumulative numbers for USA?
I have tried to reproduce in R and I did not have same results, probably because there are a lot of historical revisions at the source?

data_US <- readr::read_csv(
  "https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/testing_data/time_series_covid19_US.csv",
  col_names = TRUE,
  show_col_types = FALSE) |>
  mutate(date = as.Date(date, format = "%m/%d/%Y"))|>
  select(date, tests_combined_total) |>
  dplyr::group_by(date) |>
  dplyr::summarise(cum_tests = sum(tests_combined_total, na.rm = T))

 tail(data_US, 20)
# # A tibble: 20 × 2
# date       cum_tests
# <date>         <dbl>
#   1 2022-03-29 852124868
# 2 2022-03-30 853232274
# 3 2022-03-31 854242710
# 4 2022-04-01 845136012
# 5 2022-04-02 845397095
# 6 2022-04-03 845637721
# 7 2022-04-04 846394199
# 8 2022-04-05 847525133
# 9 2022-04-06 848362052
# 10 2022-04-07 849031771
# 11 2022-04-08 849609426
# 12 2022-04-09 850821150
# 13 2022-04-10 851036553
# 14 2022-04-11 851595486
# 15 2022-04-12 854936478
# 16 2022-04-13 852836553
# 17 2022-04-14 854146287
# 18 2022-04-15 836263614
# 19 2022-04-16 778594875
# 20 2022-04-17 778666699

benubah · 2022-04-19T09:28:10Z

Yes we've been working on this in the past few days. There were some revisions at the source. But there are also some missing states. We can't reproduce this with R because, the Python scraping tool uses a pandas fillna() to replace NAs with the last known value.
At least the KS and MN states were missing from April 15th. I have reported this to the source.

library(dplyr)
#> 
usa <- readr::read_csv("https://raw.githubusercontent.com/govex/COVID-19/master/data_tables/testing_data/time_series_covid19_US.csv", show_col_types = FALSE) |>
  select(date, state, tests_combined_total) 

data_sum <-
  usa |>
  mutate(date = as.Date(date, "%m/%d/%Y")) |>
  #na.locf does something similar to pandas fillna(method = "ffill")
  #mutate(tests_combined_total = zoo::na.locf(tests_combined_total)) |>
  group_by(date) |>
  summarize(tests_total = sum(tests_combined_total)) 
  

missing <- 
  usa |>
  filter(state %in% c("MN", "KS"),
         date %in% c("4/11/2022","4/12/2022","4/13/2022", "4/14/2022", "4/15/2022", "4/16/2022", "4/17/2022"))

print(missing)
#> # A tibble: 14 x 3
#>    date      state tests_combined_total
#>    <chr>     <chr>                <dbl>
#>  1 4/11/2022 KS                 2269976
#>  2 4/11/2022 MN                16379616
#>  3 4/12/2022 KS                 2269976
#>  4 4/12/2022 MN                16402859
#>  5 4/13/2022 KS                 2269976
#>  6 4/13/2022 MN                16402859
#>  7 4/14/2022 KS                 2269976
#>  8 4/14/2022 MN                16402859
#>  9 4/15/2022 KS                      NA
#> 10 4/15/2022 MN                      NA
#> 11 4/16/2022 KS                      NA
#> 12 4/16/2022 MN                      NA
#> 13 4/17/2022 KS                      NA
#> 14 4/17/2022 MN                      NA

We are working on a fix now at our end.

findanna · 2022-04-19T09:38:36Z

Thank you @benubah, looking forward to the updates!

seawaR · 2022-04-19T15:46:00Z

Hi @findanna,
Last month we realized that some states were not reporting anymore and decided to take the last available value (as @benubah explained) to avoid negative values in the cumulative tests (for instance, you can see in your example with the data from 2022-03-31 and 2022-04-01).
It can happen that some of this states will publish again so we will work in an automatic weekly or monthly update that can help with this potential issue and the regular revisions you mention. In the mean time, I will make an update of the values from the beginning of the month.
However, what happened on 2022-04-15 seems to be a mistake from the source just on that day and because the scraping process for all countries only takes values for the current day we have manually corrected the values from 2022-04-14 to 2022-04-17 but please keep in mind that these values can change with the revisions from the source.

findanna · 2022-04-19T15:53:59Z

Thank you @seawaR and @benubah.
If possible to correct historical values so that we match the sources, please do so.

benubah closed this as completed in 7ef4c03 Apr 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extreme volatility in USA: Cumulative Number Of Performed CV Tests #189

Extreme volatility in USA: Cumulative Number Of Performed CV Tests #189

DataOps-epam commented Apr 15, 2022

findanna commented Apr 19, 2022 •

edited

Loading

benubah commented Apr 19, 2022

findanna commented Apr 19, 2022

seawaR commented Apr 19, 2022

findanna commented Apr 19, 2022

Extreme volatility in USA: Cumulative Number Of Performed CV Tests #189

Extreme volatility in USA: Cumulative Number Of Performed CV Tests #189

Comments

DataOps-epam commented Apr 15, 2022

findanna commented Apr 19, 2022 • edited Loading

benubah commented Apr 19, 2022

findanna commented Apr 19, 2022

seawaR commented Apr 19, 2022

findanna commented Apr 19, 2022

findanna commented Apr 19, 2022 •

edited

Loading