Skip to content
This repository has been archived by the owner on Sep 28, 2023. It is now read-only.

normalise() should have consistent behaviour across DataSources: It should give mean=0 and std=1. #83

Open
JackKelly opened this issue Nov 24, 2021 · 3 comments

Comments

@JackKelly
Copy link
Member

Describe the bug
In satellite and nwp data sources, normalise() does the right thing: it ensures that, on average, the means will be zero; and the std will be 1.

In gsp and pv data sources, normalise() rescales the values to be in the range [0, 1], which isn't exactly the same thing!

Expected behavior
For any data source that's used as an input to the model, we probably want means to be zero and std to be 1.

For the target, we may sometimes want to re-scale to [0, 1] (if, for example, we're using a sigmoid output layer). But we should probably ignore that for now 🙂

@JackKelly JackKelly added the bug Something isn't working label Nov 24, 2021
@JackKelly JackKelly moved this to Todo in Nowcasting Nov 24, 2021
@peterdudfield
Copy link
Collaborator

I know each definitely of normalise each different, but it seems right that we normalize some data to be ~N(0,1) and some to be between [0,1]

@JackKelly
Copy link
Member Author

yeah, sorry, I think you're right that it's probably fine!

I think what's harmful is if some inputs are, like, orders of magnitude larger than some inputs. Then the model might struggle to learn which inputs are most informative (because the ones which are numerically larger will be "shouting the loudest" even if they're not actually very informative). Sure, you're right, having some inputs be ~N(0, 1) and some be in the range [0, 1] is probably fine!

You know far more maths than me, so more than happy to do whatever you think is best!

@JackKelly JackKelly removed the bug Something isn't working label Nov 24, 2021
@peterdudfield
Copy link
Collaborator

Might be good to put a check pydantic validation on the one N~(0,1) say like |x|<10. Would have to work out the probability of that, but i reckon it could be like '1/'#particles in the universe)' then we will will truly catch the correct errors.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
No open projects
Status: Todo
Development

No branches or pull requests

2 participants