Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert check new samples command to cron job #4394

Merged
merged 16 commits into from
Oct 2, 2024
Merged

Conversation

hanars
Copy link
Collaborator

@hanars hanars commented Sep 26, 2024

No description provided.

@@ -22,7 +23,10 @@

logger = logging.getLogger(__name__)

GS_PATH_TEMPLATE = 'gs://seqr-hail-search-data/v3.1/{path}/runs/{version}/'
GS_PATH_TEMPLATE = 'gs://seqr-hail-search-data/v3.1/{genome_version}/{dataset_type}/runs/{run_version}/_SUCCESS'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make gs://seqr-hail-search-data a reference to HAIL_SEARCH_DATA_DIR?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jklugherz we should probably update the luigi piece to write a _SUCCESS file so the local pipeline has that functionality as well!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

HAIL_SEARCH_DATA_DIR is the relative local path for the mounted volume not the gs path

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We wanted this to work for local users though right? So saved variants would be updated? I can’t remember exactly.

Copy link
Collaborator Author

@hanars hanars Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We did and we do, but we also need this to work for us. HAIL_SEARCH_DATA_DIR has to be a relative local path for the hail backend to work for anyone, and for local users thats where run metadata will be for seqr as well (assuming we add a volume mount, as currently the seqr service is not accessing this data at all) but using that local path can not work for our seqr instance, which needs run metadata to be read from GCP directly.

We have 2 options here that I can think of:

  1. For our seqr deployment have HAIL_SEARCH_DATA_DIR be an env variable for both the hail backend and seqr and have it be different in both places (local path in hail backend, gs path in seqr).
  2. Add an additional env variable to seqr like HAIL_SEARCH_RUN_METADATA_DIR and then here use that if defined and if not fall back to HAIL_SEARCH_DATA_DIR

In both cases the local installs would work with only the one HAIL_SEARCH_DATA_DIR variable, so that should be straightforward for them at least

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok... I really don't love doubling down on weird behavior in our system to avoid a couple of hours of work (especially if we're going to do it anyway in the future).

Of the options presented I think defining HAIL_SEARCH_DATA_DIR differently in hail-search and seqr makes the most sense.

Copy link
Collaborator Author

@hanars hanars Sep 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that not doubling down on weird behavior to save a few hours of work is a good engineering goal, but I also think avoiding scope creep is an important engineering goal. At the start of this PR, seqr was reading the metadata from GCP. This scope of this work said nothing about changing where the data was read from. While it may be "only a couple of hours of work" to change the open source helm chart, the tgg helm chart, airflow/pipeline to change how run directories are copied, and this spot of the code, those hours will have to be spread out over the course of a few days and lots of iterative discussion and PR review. In the meantime this PR stays open and I accrue merge conflicts.

When in the course of building something we discover that there is a better way to do something that requires more work I think it is better for team velocity to do a reasonable engineering solution that keeps the work within the original scope and then create tickets to track the new work and prioritize those tickets accordingly as part of our normal planning process. In this case, I also think we actually would want to have an engineering meeting or at least design plan for this if we are going to change what data the seqr pod does and does not have access to, as I am really not comfortable with the idea of mounting the full hail search data disk to the seqr pod

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

talked offline, on the same page now!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what did y'all decide to do here?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're gonna do the HAIL_SEARCH_DATA_DIR being defined differently in seqr and hail-search environments. I made a ticket to re-think the delivery mechanism between the pipeline and seqr.

@hanars
Copy link
Collaborator Author

hanars commented Oct 1, 2024

Note: should not clear cache if no projects updated!

@hanars hanars requested a review from bpblanken October 1, 2024 21:12
@@ -156,6 +156,7 @@
MEDIA_URL = '/media/'

LOADING_DATASETS_DIR = os.environ.get('LOADING_DATASETS_DIR')
HAIL_SEARCH_DATA_DIR = os.environ.get('HAIL_SEARCH_DATA_DIR')
Copy link
Collaborator

@bpblanken bpblanken Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Do we want to have defaults for these here or just have them as defaults in seqr-helm?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets just have the defaulted in seqr helm

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

…mples

Correctly check airtable samples associated with multiple PDOs
@hanars hanars merged commit 2a0d731 into dev Oct 2, 2024
2 of 3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants