Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Script to create hdf trigger merge files from PyCBC Live output trigger files #4697

Merged
merged 9 commits into from
Jul 12, 2024

Conversation

ArthurTolley
Copy link
Contributor

@ArthurTolley ArthurTolley commented Apr 15, 2024

This script is used to convert a large number of PyCBC Live trigger output files to a hdf trigger merge file.

Standard information about the request

This is a: new feature

This change affects: the live search

This change changes: scientific output

Motivation

Using template fits in the PyCBC Live search (#4527) requires template fits to be made separate to using them. One of the requirements to make these files is a hdf trigger merge. This code produces that trigger merge file.

Contents

The PyCBC Live search outputs a trigger file every stride (8s for Live, 1s for Early Warning) containing any triggers found within the previous stride. This code takes a number of options to be given these trigger files and will then collate them into a single hdf object: from a list of trigger files, from a directory containing subdirectories containing trigger files (typically one subdirectory for each day), from a start and end date, from a start date and a number of days, from a gps start and end time.

The template fit creation code expects the triggers to be in a certain format (the offline trigger file format) and therefore the Live triggers need to be converted to match that format, creating new datasets where needed and ensuring other datasets are correct: for example, PyCBC Live stores chisq as reduced chisq whereas the offline search doesn't, so we convert these PyCBC Live triggers to the offline format.

The triggers are also sorted by template_id and region references are created using the template_id boundaries to allow for rapid access in future codes.

Links to any issues or associated PRs

Testing performed

I have taken 2 days worth of PyCBC Live triggers with H1, L1 & V1 triggers and run the following scripts to test:

From a file containing a list of trigger files:

python ../pycbc_live_collate_triggers \
    --trigger-file-method file \
    --list-of-trigger-files test_file_list.txt \
    --ifos H1 L1 V1 \
    --output-trigger-file-list test_file_list_collated.txt \
    --output-dir ./ \
    --output-file-name test_collated_triggers.hdf

From a directory containing subdirectories containing trigger files:

python ../pycbc_live_collate_triggers \
     --trigger-file-method dir \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_dir.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_dir.hdf

From a start and end date:

python ../pycbc_live_collate_triggers \
     --trigger-file-method start-end-date \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --start-date 2024-04-11 \
     --end-date 2024-04-12 \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_start-end.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_start-end.hdf

From a start date and a number of days:

python ../pycbc_live_collate_triggers \
     --trigger-file-method start-num-days \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --start-date 2024-04-11 \
     --num-days 2 \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_start-num.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_start-num.hdf

From a start and end gps time:

python ../pycbc_live_collate_triggers \
     --trigger-file-method gps-start-end-time \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --gps-start-time 1396828816 \
     --gps-end-time 1396844864 \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_gps.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_gps.hdf
  • The author of this pull request confirms they will adhere to the code of conduct

Copy link
Contributor

@GarethCabournDavies GarethCabournDavies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There should be discussion of certain points I'm not sure of (i.e. do we need template_hash sorting in this), but this is broadly right.

I think the arguments regarding the file discovery should be reworked / simplified though. I think something like #4354 should work, so I'll put the work in for that to be pulled in here

bin/live/pycbc_live_collate_triggers Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
bin/live/pycbc_live_collate_triggers Outdated Show resolved Hide resolved
@GarethCabournDavies
Copy link
Contributor

GarethCabournDavies commented May 2, 2024

We are using this for inputs to fit_by_template and bin_trigger_rates_dq, which both use stat-threshold as input and discard anything below that - does it then make sense to implement the same here? That could save:

  • storage
  • computational effort in the sorting stage

@GarethCabournDavies
Copy link
Contributor

During testing, this code is quite slow, so I thought about what the later uses are; pycbc_fit_sngls_by_template and pycbc_bin_trigger_rates_dq at the moment.

They both apply cuts straight away. As a result, I think adding cuts would be good here and we can drastically reduce

  1. how quickly this runs
  2. how much space the files take up

This should mean that we can reuse a lot of the file-reading code from pycbc_live_single_significance_fits, which can be put into a module

@GarethCabournDavies
Copy link
Contributor

I've just found a bug, but not sure how best to fix it - in the case that there are no triggers from that template, the boundaries are not being set properly, so there are not enough region references being made

/H1/chisq_dof_template   Dataset {729910}
/H1/chisq_template       Dataset {729910}
/H1/coa_phase_template   Dataset {729910}
/H1/end_time_template    Dataset {729910}
/H1/sg_chisq_template    Dataset {729910}
/H1/sigmasq_template     Dataset {729910}
/H1/snr_template         Dataset {729910}
/H1/template_duration_template Dataset {729910}
/L1/chisq_dof_template   Dataset {729913}
/L1/chisq_template       Dataset {729913}
/L1/coa_phase_template   Dataset {729913}
/L1/end_time_template    Dataset {729913}
/L1/sg_chisq_template    Dataset {729913}
/L1/sigmasq_template     Dataset {729913}
/L1/snr_template         Dataset {729913}
/L1/template_duration_template Dataset {729913}

I think I can find the bug / propose a fix fairly soon

@ArthurTolley
Copy link
Contributor Author

@titodalcanton Gareth has added all the trigger finding stuff (and resolved a lot of the comments you had before).
Sorry for taking a while to get to those.
I think this is in a stage now where it could be merged.
We just need to profile and fully test this version.

@GarethCabournDavies
Copy link
Contributor

GarethCabournDavies commented Jul 8, 2024

Profile:
image

time -v output:

	User time (seconds): 1245.77
	System time (seconds): 12.20
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 21:19.87
	Maximum resident set size (kbytes): 1026320
	Minor (reclaiming a frame) page faults: 502261
	Voluntary context switches: 106347
	Involuntary context switches: 2862
	File system inputs: 6165104
	File system outputs: 7937024

So approx 20 minutes, 1Gb memory usage, and it looks like most time is spent in I/O

Link as the image seems to have disappeared for zooming in: https://ldas-jobs.ligo.caltech.edu/~gareth.cabourndavies/pycbclive/collate_trigs/testing/collate_profile.png

Comment on lines 70 to 77
if args.output_trigger_file_list:
logging.info(
'Writing list of trigger files to %s ',
args.output_trigger_file_list
)
with open(args.output_trigger_file_list, 'w') as f:
for item in trigger_files:
f.write("%s\n" % item)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurTolley is this still wanted or is it for testing only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was good for testing because the file system does spend a lot of time looking for the files themselves depending on how many directories were being searched but I don't think we need it anymore.

n_triggers_cut[ifo],
)

with h5py.File(args.bank_file,'r') as bank_file:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be moved to the start, then you can avoid closing/reopening the output file?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like it should

@GarethCabournDavies GarethCabournDavies self-requested a review July 9, 2024 09:32
@GarethCabournDavies GarethCabournDavies dismissed their stale review July 9, 2024 09:34

Changes implemented, but I have now contributed enough directly that I wouldnt feel comfortable as a reviewer

@GarethCabournDavies
Copy link
Contributor

poke @titodalcanton on approval/final review for this

Copy link
Contributor

@titodalcanton titodalcanton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good.

@titodalcanton titodalcanton merged commit e033ad8 into gwastro:master Jul 12, 2024
32 of 33 checks passed
yi-fan-wang pushed a commit to yi-fan-wang/pycbc that referenced this pull request Jul 15, 2024
…er files (gwastro#4697)

* combined trigger file list and collate triggers, untested

* working state

* removing print statement

* Cleaning up argparse and unused code

* Trigger file checking and ifo permutation

* Rework trigger file finding in pycbc_live_single_trigger_fits

* enumerate + continue indentation

* implement minor changes

---------

Co-authored-by: GarethCabournDavies <[email protected]>
prayush pushed a commit to prayush/pycbc that referenced this pull request Nov 21, 2024
…er files (gwastro#4697)

* combined trigger file list and collate triggers, untested

* working state

* removing print statement

* Cleaning up argparse and unused code

* Trigger file checking and ifo permutation

* Rework trigger file finding in pycbc_live_single_trigger_fits

* enumerate + continue indentation

* implement minor changes

---------

Co-authored-by: GarethCabournDavies <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants