-
Notifications
You must be signed in to change notification settings - Fork 183
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Within Session splitter #664
base: develop
Are you sure you want to change the base?
Conversation
Deleting unified_eval, so it can be addressed on another pr. Working on tests.
Adding: figures for documentation
Signed-off-by: Bruna Junqueira Lopes <[email protected]>
# Conflicts: # moabb/evaluations/metasplitters.py # moabb/tests/metasplits.py
Add shuffle and random_state parameters to WithinSession
docs/source/images/withinsess.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
better represent only one session @brunaafl
# Conflicts: # moabb/tests/splits.py
# Conflicts: # moabb/evaluations/splitters.py
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brunaafl,
Thanks for this PR, it looks good!
I left one comment regarding a test I think you should add
@pytest.mark.parametrize("shuffle", [True, False]) | ||
@pytest.mark.parametrize("random_state", [0, 42]) | ||
def test_within_session(shuffle, random_state): | ||
X, y, metadata = paradigm.get_data(dataset=dataset) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is important to check if the split is the same when we load the data of one/a few subject(s) only, paradigm.get_data(dataset=dataset, subjects=[m, n...])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A frist batch of comments. Overall, it looks good but I think it could be improved by adding the possibility to pass in a cv
object, that would allow to control the intrasession splits (for instance doing TimeSeriesSplit
, which makes sense in an online setting).
images/withinsess.png
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not using only the other version?
moabb/evaluations/splitters.py
Outdated
Parameters | ||
---------- | ||
n_folds : int | ||
Number of folds. Must be at least 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Number of folds. Must be at least 2. | |
Number of folds for the within-session k-fold split. Must be at least 2. |
moabb/evaluations/splitters.py
Outdated
random_state: int = 42, | ||
shuffle_subjects: bool = False, | ||
shuffle_session: bool = True, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Default random_state
should be None
. Also, convention would be to put it at the end of the argument list.
random_state: int = 42, | |
shuffle_subjects: bool = False, | |
shuffle_session: bool = True, | |
shuffle_subjects: bool = False, | |
shuffle_session: bool = True, | |
random_state: int = None, |
moabb/evaluations/splitters.py
Outdated
shuffle_session : bool, default=True | ||
Whether to shuffle each class's samples before splitting into batches. | ||
Note that the samples within each split will not be shuffled. | ||
shuffle_subjects : bool, default=False | ||
Apply shuffle in mixing subjects and sessions, this parameter allows | ||
sample iterations of the sppliter. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it is necessary to have both? I don't really see any use case where I would only use one no?
I would only have a shuffle
moabb/evaluations/splitters.py
Outdated
self.n_folds = n_folds | ||
self.shuffle_subjects = shuffle_subjects | ||
self.shuffle_session = shuffle_session | ||
self.random_state = check_random_state(random_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you use it like that, this is not a random_state
anymore but a rng
.
moabb/evaluations/splitters.py
Outdated
for subject in subjects: | ||
subject_mask = metadata.subject == subject | ||
subject_indices = all_index[subject_mask] | ||
subject_metadata = metadata[subject_mask] | ||
sessions = subject_metadata.session.unique() | ||
|
||
# Shuffle sessions if required | ||
if self.shuffle_session: | ||
self.random_state.shuffle(sessions) | ||
|
||
for session in sessions: | ||
session_mask = subject_metadata.session == session | ||
indices = subject_indices[session_mask] | ||
group_y = y[indices] | ||
|
||
# Use StratifiedKFold with the group-specific random state | ||
cv = StratifiedKFold( | ||
n_splits=self.n_folds, | ||
shuffle=self.shuffle_session, | ||
random_state=self.random_state, | ||
) | ||
for ix_train, ix_test in cv.split(indices, group_y): | ||
yield indices[ix_train], indices[ix_test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talk a bit with @sylvchev and I think the best would be to modify this, to take a cv
object in the constructor (default would be StratifiedKFold
), clone
it with a different random seed for each group subject, session
, and then yield the right indices.
That way, we can do a real shuffle, with shuffling the groups from which we retrieve the next split.
Would this make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Make sense, I'm working on that, thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
great! Thank you so much! And thanks for your patience :)
Co-authored-by: Thomas Moreau <[email protected]> Signed-off-by: Bruna Junqueira Lopes <[email protected]>
Co-authored-by: Thomas Moreau <[email protected]> Signed-off-by: Bruna Junqueira Lopes <[email protected]>
Co-authored-by: Thomas Moreau <[email protected]> Signed-off-by: Bruna Junqueira Lopes <[email protected]>
Adding PseudoOnlineSplit (time series splitter) Fixing tests
Hi, @tomMoral, @bruAristimunha , @PierreGtch ! I added the functionality to pass a metasplitter, such as TimeSeries/PseudoOnline. The way I designed this object, metasplitter returns indexes for calibration and test sets. To ensure the splitter returns indexes for a train set also, if needed, I was wondering if we could always have StratifiedKFold to split train/test, and allow to pass PseudoOnline as an inner_cv to further split the test set into calibration and test if wanted. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @brunaafl, thanks for all the work you put in!! About the delay, this is all voluntary work, so no need to apologise :)
I added the functionality to pass a metasplitter, such as TimeSeries/PseudoOnline. The way I designed this object, metasplitter returns indexes for calibration and test sets. To ensure the splitter returns indexes for a train set also, if needed, I was wondering if we could always have StratifiedKFold to split train/test, and allow to pass PseudoOnline as an inner_cv to further split the test set into calibration and test if wanted.
I’m not sure I understand your question. What is the difference between the train and the calibration sets?
I also left a few comments on the code
|
||
def __init__( | ||
self, | ||
cv=StratifiedKFold, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the scikit-learn framework, cv
is a cross-validator object, not a class. I think it would be best to stick to it. This would avoid to instantiate it during the split
call. You can have split=None
by default and instantiate cv=StratifiedKFold()
class in the __init__
method.
Also, you can check the cf argument with sklearn’s check_cv
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand the concern, but I'm a bit unsure on how to implement it in the case shuffle=True, since I'm defining a different seed for each (subject, session). The suggestion is to instantiate cv in the init method in case split is not needed, and keep how it is being done otherwise?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed it's not easy. But you could for example make a wrapper around StratifiedKFolds
which would instantiate a different cv with a different seed for each subject/session.
Also I just noticed that at the moment, the seeds for each subject/session are chosen at random. We will not be able to have reproducible results this way. Instead, you could add a parameter global_seed
to your wrapper and use, for each cv, random_state = global_seed + 10000*subject_number + session_number
(it's safe to say we will never have 10000 sessions) if global_seed
is an integer and none otherwise
# Conflicts: # moabb/evaluations/metasplitters.py # moabb/evaluations/splitters.py
This PR is a follow-up on PR #624 and is related to issue #612. It includes just the implementation for the WithinSessionSplitter data splitter.