-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Outdated testcount data for Germany, incorrect Prophet forecast #10
Comments
Just had an idea: The RKI lab surveillance reports (see link in my issue report) indicate a test positivity rate for each Bundesland on page 5. It's a bit hard to read, but should be possible, maybe with some GIMP help. The test positivity rate combined with number of cases reported by the Bundesland in that week could be used to estimate the number of tests. Edit: Note the table on page 4 is not useful because it aggregates numbers for 10 weeks. Current positivity rate needs to be extracted from graphs on page 5. |
Hi @w-flo, From a modeling perspective it should be possible to build a more complete PyMC3 model, maybe using the |
Is the nation-wide data easier to get, such that it would be possible to integrate it in the nation-wide plot? |
@sdschulze I am not aware of any data source other than the weekly RKI reports. And those are just weekly numbers and based on essentially the same underlying data. As I wrote earlier, one could incorporate this in a more sophisticated model. |
It seems like the weekly test counts are published a lot more frequent than the daily test counts (which is kind of sad/hilarious?). So maybe one can:
|
@MaPePeR Yes. If you want to start, take a look at the Unfortunately I cannot share the CSV of daily German testcounts, but for comparison you can just use the data loader from Belgium (https://github.com/rtcovidlive/rtlive-global/blob/master/notebooks/Tutorial_dataloading.ipynb). By the way, there are many other countries from the OWID dataset that we could support. If anybody contributes a data loader for the UK, Ireland or Spain, that would be very welcome! |
I can give it a try. My experience with panda is minimal, though. I think I'm going to ignore them for now and only try to correct the country data. Other possibilities would be:
Biggest problem I see is making this correction somewhat transparent to the readers. What do you think? EDIT: At the moment I'm tending to implement the latest idea of predicting the weekly tests and the distribution of tests across weekdays and regions separately. But someone else should take a look at it and think about if that's really a good idea. |
I attempted that 3rd Idea, but the prediction of weekly average tests didn't work out as planned. The prediciton of the "region factor" also doesn't look promising and prophet raised warnings i don't understand. I kind of expected the prediction to look like the predicted tests for netherlands. They do not. (See |
@MaPePeR you'll probably want to set this kwarg to |
@MaPePeR in your cell |
For that part I don't want to predict the yellow zig-zag line, but the blue staircase, so I assume I cannot reproduce the forecast for netherlands: result_df, results = data_nl.forecast_NL(ourworldindata.create_loader_function("NL")(pandas.Timestamp('today')))
pyplot.plot(result_df.xs('all')['predicted_new_tests'], label='predicted')
pyplot.plot(result_df.xs('all')['new_tests'], label='tests')
pyplot.legend() So maybe there is something wrong with my prophet installation or they behave differently when called from a notepad? |
The plot in this comment looks like prophet did not converge. Your last plot here looks correct. On Rtlive.de/global we use Maybe I'm not seeing the problem you're referring to. |
I now tried:
I was not expecting a saw-like prediction. But I now realized, that this might be a wrong assumption from my part and one only does not see this saw on rtlive.de/global because of graph scaling. But I also realized, that my approach is wrong: I shouldn't use the prediction on data that is the same for all weekdays and then expect the prediction to also be the same on all weekdays. Instead I probably should use a predictor, that predicts the real weekly data instead of "fake" daily data. |
I've gone back to my first idea of scaling the prediction directly and I think it looks promising: https://github.com/MaPePeR/rtlive-global/blob/correct_prediction_with_weekly_sum/notebooks/Correct_daily_count_with_weekly_sum.ipynb Again, this is on belgian data, but simulating the german missing data. |
@MaPePeR indeed, this looks very promising! Do I understand it correctly that you used a shortened daily data timeseries from BE to train and afterwards scale the prediction with the weekly sums of the full series? Now to "get this into production" you'll have to make this a function with a very clearly defined signature (type hints + docstring). I suppose this function should take two dataframes - one with daily data and one with weekly sums? |
Yes, you understood correctly. I only shorten the daily data from BE to simulate the data situation we have in germany. At the moment I'm thinking if it would be more robust to don't base it on weeks and instead allow for arbitrary gaps in the data. Then it kind of depends if this method only should do the scaling or also the predicting. If it should not do the predicting itself, but act as a "postprocessing" step to the prediction it needs the predicted data and needs to know at which point the prediction starts/the known data ends. At the moment this can only be used for the |
I choose the 3-Series method signature and made the necessary changes and created yet another notebook in order to test it with belgian data again. Looks like it works, but this needs review and testing, as I wasn't able to test the real Changes to the Not sure, if and when I should make a PR. Should I include any of the notebooks? I feel like they are not clean enough to include them. |
I also now attempted to improve the region total testcounts in that same Notebook. The results also look very promising to me. Two attempts:
The region with the worst approximation: One of the better predictions: See notebook for all others. |
Thanks for your work on this! I'm unable to review it though since I don't know anything about the implementation of this project right now, so I hope project owners will take a look. :-) A similar approach could probably be used for each German state, but the weekly testcount report is not machine-readable (it's a PDF file) and they only publish the sum of tests for a sliding window of the past 10 weeks. By starting with their first report, testcount data for every single week could be inferred, I guess. Reports are at https://ars.rki.de/Content/COVID19/Main.aspx I guess the aggregate numbers in these PDF files won't necessarily match the "secret" daily testcount data the RKI provides rtlive.de with every couple of weeks, so there needs to be a state-specific scaling factor as well. Right now the testcount prediction for SH is way off (really small because number of tests was decreasing in the weeks before Christmas) so rtlive.de estimates severe undertesting and a high R_t value, while testcount estimation for SN might be too high, leading to a smaller R_t and probably too small estimated infection rate. I hope RKI sends out new daily testcount data to rtlive.de soon to fix this. |
Thanks for opening the PR @MaPePeR and thanks for the great discussion here everyone. |
@MaPePeR we have since received new data from the RKI. The ArcGIS database is currently unavailable (they seem to have technical difficulties), which is why https://Rtlive.de was not yet updated with this new data. We will check out your PR and compare the adjustments made by your function with the new data to see how much it could have improved the prediction. |
There seems to be a fluke right now, with the prediction about twice as high as the data. |
It's actually not a bug, but we just did not get around to update the plot with an informative legend. tl;dr: The dataset of daily testcounts we get from the RKI does not cover all tests, because not all laboratories report their testing volume. |
Thanks for the explanation! I'm not sure if it's related, but now the Rt values by RKI have disappeared from the plot. |
@sdschulze no, that's a new issue. See #17 |
Hi,
Thanks for the interesting analysis on rtlive.de!
I realize detailed testcount data is not publicly available for Germany, so it's probably difficult to get a data update.
However, I'm afraid the Prophet forecast is not a great method to get a forecast for current testcount based on 4 week old data. As you can see, rtlive.de indicates a growing number of tests for Germany and the Bundesländer in the past 4 weeks, even though RKI numbers published for those weeks clearly shows the opposite trend: number of tests has been going down in recent weeks. This probably results in underestimating the infection count and R_t.
I'm not sure if there's a good way to fix this unless RKI is willing to give you new data.
Maybe just copy the last week of available data for each Bundesland and scale it according to the change in country-wide test numbers published by RKI for each week? That would probably under-estimate tests in Bavaria and over-estimate tests in other Bundesländer, since Bavaria still has free tests for all, but since there is no public data on a Bundesland level other than https://ars.rki.de/Content/COVID19/Main.aspx (which doesn't appear to be very useful), I don't know a better workaround.
The text was updated successfully, but these errors were encountered: