Convert check new samples command to cron job #4394

hanars · 2024-09-26T18:57:52Z

No description provided.

bpblanken · 2024-09-26T20:36:53Z

seqr/management/commands/check_for_new_samples_from_pipeline.py

@@ -22,7 +23,10 @@

 logger = logging.getLogger(__name__)

-GS_PATH_TEMPLATE = 'gs://seqr-hail-search-data/v3.1/{path}/runs/{version}/'
+GS_PATH_TEMPLATE = 'gs://seqr-hail-search-data/v3.1/{genome_version}/{dataset_type}/runs/{run_version}/_SUCCESS'


Can we make gs://seqr-hail-search-data a reference to HAIL_SEARCH_DATA_DIR?

@jklugherz we should probably update the luigi piece to write a _SUCCESS file so the local pipeline has that functionality as well!

HAIL_SEARCH_DATA_DIR is the relative local path for the mounted volume not the gs path

We wanted this to work for local users though right? So saved variants would be updated? I can’t remember exactly.

We did and we do, but we also need this to work for us. HAIL_SEARCH_DATA_DIR has to be a relative local path for the hail backend to work for anyone, and for local users thats where run metadata will be for seqr as well (assuming we add a volume mount, as currently the seqr service is not accessing this data at all) but using that local path can not work for our seqr instance, which needs run metadata to be read from GCP directly.

We have 2 options here that I can think of:

For our seqr deployment have HAIL_SEARCH_DATA_DIR be an env variable for both the hail backend and seqr and have it be different in both places (local path in hail backend, gs path in seqr).

Add an additional env variable to seqr like HAIL_SEARCH_RUN_METADATA_DIR and then here use that if defined and if not fall back to HAIL_SEARCH_DATA_DIR

In both cases the local installs would work with only the one HAIL_SEARCH_DATA_DIR variable, so that should be straightforward for them at least

Ok... I really don't love doubling down on weird behavior in our system to avoid a couple of hours of work (especially if we're going to do it anyway in the future).

Of the options presented I think defining HAIL_SEARCH_DATA_DIR differently in hail-search and seqr makes the most sense.

I agree that not doubling down on weird behavior to save a few hours of work is a good engineering goal, but I also think avoiding scope creep is an important engineering goal. At the start of this PR, seqr was reading the metadata from GCP. This scope of this work said nothing about changing where the data was read from. While it may be "only a couple of hours of work" to change the open source helm chart, the tgg helm chart, airflow/pipeline to change how run directories are copied, and this spot of the code, those hours will have to be spread out over the course of a few days and lots of iterative discussion and PR review. In the meantime this PR stays open and I accrue merge conflicts.

When in the course of building something we discover that there is a better way to do something that requires more work I think it is better for team velocity to do a reasonable engineering solution that keeps the work within the original scope and then create tickets to track the new work and prioritize those tickets accordingly as part of our normal planning process. In this case, I also think we actually would want to have an engineering meeting or at least design plan for this if we are going to change what data the seqr pod does and does not have access to, as I am really not comfortable with the idea of mounting the full hail search data disk to the seqr pod

talked offline, on the same page now!

what did y'all decide to do here?

We're gonna do the HAIL_SEARCH_DATA_DIR being defined differently in seqr and hail-search environments. I made a ticket to re-think the delivery mechanism between the pipeline and seqr.

hanars · 2024-10-01T15:12:43Z

Note: should not clear cache if no projects updated!

…ck-samples-cron-job

…itute/seqr into multiple-pdo-loadable-samples

bpblanken · 2024-10-02T17:35:52Z

settings.py

@@ -156,6 +156,7 @@
    MEDIA_URL = '/media/'

 LOADING_DATASETS_DIR = os.environ.get('LOADING_DATASETS_DIR')
+HAIL_SEARCH_DATA_DIR = os.environ.get('HAIL_SEARCH_DATA_DIR')


👍

Do we want to have defaults for these here or just have them as defaults in seqr-helm?

lets just have the defaulted in seqr helm

…mples Correctly check airtable samples associated with multiple PDOs

hanars added 5 commits September 24, 2024 18:02

check new samples job checks all successful runs

16db614

do not reload old runs

5434b5d

test failures

0e94df4

fix test

e2b1f6b

clean up gcnv tests

0f16edd

hanars requested review from jklugherz and bpblanken September 26, 2024 18:58

fix unit test

4e130f4

bpblanken reviewed Sep 26, 2024

View reviewed changes

hanars added 3 commits September 30, 2024 14:50

correctly filter for samples in multiple pdos

e596902

update tests

8c185ee

clean up

c78cdbe

hanars added 5 commits October 1, 2024 15:10

Merge branch 'dev' of https://github.com/broadinstitute/seqr into che…

a57ad97

…ck-samples-cron-job

configurable run directory

219bf9a

test no cache reset when no new data

9bdffba

test local run dirs

e932afc

Merge branch 'dev' of https://github.com/broadinstitute/seqr into che…

a43a65e

…ck-samples-cron-job

hanars requested a review from bpblanken October 1, 2024 21:12

Merge branch 'check-samples-cron-job' of https://github.com/broadinst…

3a2fcf4

…itute/seqr into multiple-pdo-loadable-samples

bpblanken reviewed Oct 2, 2024

View reviewed changes

bpblanken approved these changes Oct 2, 2024

View reviewed changes

Merge pull request #4401 from broadinstitute/multiple-pdo-loadable-sa…

abf4a23

…mples Correctly check airtable samples associated with multiple PDOs

hanars merged commit 2a0d731 into dev Oct 2, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Convert check new samples command to cron job #4394

Convert check new samples command to cron job #4394

hanars commented Sep 26, 2024

bpblanken Sep 26, 2024

bpblanken Sep 26, 2024

hanars Sep 27, 2024

bpblanken Sep 27, 2024

hanars Sep 30, 2024 •

edited

Loading

bpblanken Sep 30, 2024

hanars Sep 30, 2024 •

edited

Loading

bpblanken Sep 30, 2024

jklugherz Oct 1, 2024

bpblanken Oct 1, 2024

hanars commented Oct 1, 2024

bpblanken Oct 2, 2024 •

edited

Loading

hanars Oct 2, 2024

bpblanken Oct 2, 2024

Convert check new samples command to cron job #4394

Convert check new samples command to cron job #4394

Conversation

hanars commented Sep 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanars Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanars Sep 30, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanars commented Oct 1, 2024

bpblanken Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

hanars Sep 30, 2024 •

edited

Loading

hanars Sep 30, 2024 •

edited

Loading

bpblanken Oct 2, 2024 •

edited

Loading