-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert check new samples command to cron job #4394
Conversation
@@ -22,7 +23,10 @@ | |||
|
|||
logger = logging.getLogger(__name__) | |||
|
|||
GS_PATH_TEMPLATE = 'gs://seqr-hail-search-data/v3.1/{path}/runs/{version}/' | |||
GS_PATH_TEMPLATE = 'gs://seqr-hail-search-data/v3.1/{genome_version}/{dataset_type}/runs/{run_version}/_SUCCESS' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we make gs://seqr-hail-search-data
a reference to HAIL_SEARCH_DATA_DIR
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jklugherz we should probably update the luigi piece to write a _SUCCESS
file so the local pipeline has that functionality as well!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HAIL_SEARCH_DATA_DIR
is the relative local path for the mounted volume not the gs path
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We wanted this to work for local users though right? So saved variants would be updated? I can’t remember exactly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We did and we do, but we also need this to work for us. HAIL_SEARCH_DATA_DIR
has to be a relative local path for the hail backend to work for anyone, and for local users thats where run metadata will be for seqr as well (assuming we add a volume mount, as currently the seqr service is not accessing this data at all) but using that local path can not work for our seqr instance, which needs run metadata to be read from GCP directly.
We have 2 options here that I can think of:
- For our seqr deployment have
HAIL_SEARCH_DATA_DIR
be an env variable for both the hail backend and seqr and have it be different in both places (local path in hail backend, gs path in seqr). - Add an additional env variable to seqr like
HAIL_SEARCH_RUN_METADATA_DIR
and then here use that if defined and if not fall back toHAIL_SEARCH_DATA_DIR
In both cases the local installs would work with only the one HAIL_SEARCH_DATA_DIR
variable, so that should be straightforward for them at least
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok... I really don't love doubling down on weird behavior in our system to avoid a couple of hours of work (especially if we're going to do it anyway in the future).
Of the options presented I think defining HAIL_SEARCH_DATA_DIR
differently in hail-search
and seqr
makes the most sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that not doubling down on weird behavior to save a few hours of work is a good engineering goal, but I also think avoiding scope creep is an important engineering goal. At the start of this PR, seqr was reading the metadata from GCP. This scope of this work said nothing about changing where the data was read from. While it may be "only a couple of hours of work" to change the open source helm chart, the tgg helm chart, airflow/pipeline to change how run directories are copied, and this spot of the code, those hours will have to be spread out over the course of a few days and lots of iterative discussion and PR review. In the meantime this PR stays open and I accrue merge conflicts.
When in the course of building something we discover that there is a better way to do something that requires more work I think it is better for team velocity to do a reasonable engineering solution that keeps the work within the original scope and then create tickets to track the new work and prioritize those tickets accordingly as part of our normal planning process. In this case, I also think we actually would want to have an engineering meeting or at least design plan for this if we are going to change what data the seqr pod does and does not have access to, as I am really not comfortable with the idea of mounting the full hail search data disk to the seqr pod
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
talked offline, on the same page now!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what did y'all decide to do here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We're gonna do the HAIL_SEARCH_DATA_DIR
being defined differently in seqr
and hail-search
environments. I made a ticket to re-think the delivery mechanism between the pipeline and seqr.
Note: should not clear cache if no projects updated! |
…ck-samples-cron-job
…ck-samples-cron-job
…itute/seqr into multiple-pdo-loadable-samples
@@ -156,6 +156,7 @@ | |||
MEDIA_URL = '/media/' | |||
|
|||
LOADING_DATASETS_DIR = os.environ.get('LOADING_DATASETS_DIR') | |||
HAIL_SEARCH_DATA_DIR = os.environ.get('HAIL_SEARCH_DATA_DIR') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
Do we want to have defaults for these here or just have them as defaults in seqr-helm
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets just have the defaulted in seqr helm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
…mples Correctly check airtable samples associated with multiple PDOs
No description provided.