Script to create hdf trigger merge files from PyCBC Live output trigger files #4697

ArthurTolley · 2024-04-15T16:55:55Z

This script is used to convert a large number of PyCBC Live trigger output files to a hdf trigger merge file.

Standard information about the request

This is a: new feature

This change affects: the live search

This change changes: scientific output

Motivation

Using template fits in the PyCBC Live search (#4527) requires template fits to be made separate to using them. One of the requirements to make these files is a hdf trigger merge. This code produces that trigger merge file.

The PyCBC Live search outputs a trigger file every stride (8s for Live, 1s for Early Warning) containing any triggers found within the previous stride. This code takes a number of options to be given these trigger files and will then collate them into a single hdf object: from a list of trigger files, from a directory containing subdirectories containing trigger files (typically one subdirectory for each day), from a start and end date, from a start date and a number of days, from a gps start and end time.

The template fit creation code expects the triggers to be in a certain format (the offline trigger file format) and therefore the Live triggers need to be converted to match that format, creating new datasets where needed and ensuring other datasets are correct: for example, PyCBC Live stores chisq as reduced chisq whereas the offline search doesn't, so we convert these PyCBC Live triggers to the offline format.

The triggers are also sorted by template_id and region references are created using the template_id boundaries to allow for rapid access in future codes.

Links to any issues or associated PRs

Template fits in Live: [pycbc live] Allowing the use of template fits in the pycbc live ranking statistic #4527

Testing performed

I have taken 2 days worth of PyCBC Live triggers with H1, L1 & V1 triggers and run the following scripts to test:

From a file containing a list of trigger files:

python ../pycbc_live_collate_triggers \
    --trigger-file-method file \
    --list-of-trigger-files test_file_list.txt \
    --ifos H1 L1 V1 \
    --output-trigger-file-list test_file_list_collated.txt \
    --output-dir ./ \
    --output-file-name test_collated_triggers.hdf

From a directory containing subdirectories containing trigger files:

python ../pycbc_live_collate_triggers \
     --trigger-file-method dir \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_dir.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_dir.hdf

From a start and end date:

python ../pycbc_live_collate_triggers \
     --trigger-file-method start-end-date \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --start-date 2024-04-11 \
     --end-date 2024-04-12 \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_start-end.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_start-end.hdf

From a start date and a number of days:

python ../pycbc_live_collate_triggers \
     --trigger-file-method start-num-days \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --start-date 2024-04-11 \
     --num-days 2 \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_start-num.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_start-num.hdf

From a start and end gps time:

python ../pycbc_live_collate_triggers \
     --trigger-file-method gps-start-end-time \
     --trigger-dir /home/arthur.tolley/PyCBC_changes/collate_triggers_script/pycbc/bin/live/testing \
     --gps-start-time 1396828816 \
     --gps-end-time 1396844864 \
     --ifos H1 L1 V1 \
     --output-trigger-file-list test_file_list_collated_gps.txt \
     --output-dir ./ \
     --output-file-name test_collated_triggers_gps.hdf

The author of this pull request confirms they will adhere to the code of conduct

GarethCabournDavies

There should be discussion of certain points I'm not sure of (i.e. do we need template_hash sorting in this), but this is broadly right.

I think the arguments regarding the file discovery should be reworked / simplified though. I think something like #4354 should work, so I'll put the work in for that to be pulled in here

bin/live/pycbc_live_collate_triggers

GarethCabournDavies · 2024-05-02T15:15:11Z

We are using this for inputs to fit_by_template and bin_trigger_rates_dq, which both use stat-threshold as input and discard anything below that - does it then make sense to implement the same here? That could save:

storage
computational effort in the sorting stage

GarethCabournDavies · 2024-05-14T16:01:48Z

During testing, this code is quite slow, so I thought about what the later uses are; pycbc_fit_sngls_by_template and pycbc_bin_trigger_rates_dq at the moment.

They both apply cuts straight away. As a result, I think adding cuts would be good here and we can drastically reduce

how quickly this runs
how much space the files take up

This should mean that we can reuse a lot of the file-reading code from pycbc_live_single_significance_fits, which can be put into a module

GarethCabournDavies · 2024-05-20T12:24:04Z

I've just found a bug, but not sure how best to fix it - in the case that there are no triggers from that template, the boundaries are not being set properly, so there are not enough region references being made

/H1/chisq_dof_template   Dataset {729910}
/H1/chisq_template       Dataset {729910}
/H1/coa_phase_template   Dataset {729910}
/H1/end_time_template    Dataset {729910}
/H1/sg_chisq_template    Dataset {729910}
/H1/sigmasq_template     Dataset {729910}
/H1/snr_template         Dataset {729910}
/H1/template_duration_template Dataset {729910}
/L1/chisq_dof_template   Dataset {729913}
/L1/chisq_template       Dataset {729913}
/L1/coa_phase_template   Dataset {729913}
/L1/end_time_template    Dataset {729913}
/L1/sg_chisq_template    Dataset {729913}
/L1/sigmasq_template     Dataset {729913}
/L1/snr_template         Dataset {729913}
/L1/template_duration_template Dataset {729913}

I think I can find the bug / propose a fix fairly soon

bin/live/pycbc_live_collate_triggers

DRAFT: changes to trigger collation script

ArthurTolley · 2024-07-02T13:22:06Z

@titodalcanton Gareth has added all the trigger finding stuff (and resolved a lot of the comments you had before).
Sorry for taking a while to get to those.
I think this is in a stage now where it could be merged.
We just need to profile and fully test this version.

GarethCabournDavies · 2024-07-08T12:48:13Z

Profile:

time -v output:

	User time (seconds): 1245.77
	System time (seconds): 12.20
	Percent of CPU this job got: 98%
	Elapsed (wall clock) time (h:mm:ss or m:ss): 21:19.87
	Maximum resident set size (kbytes): 1026320
	Minor (reclaiming a frame) page faults: 502261
	Voluntary context switches: 106347
	Involuntary context switches: 2862
	File system inputs: 6165104
	File system outputs: 7937024

So approx 20 minutes, 1Gb memory usage, and it looks like most time is spent in I/O

Link as the image seems to have disappeared for zooming in: https://ldas-jobs.ligo.caltech.edu/~gareth.cabourndavies/pycbclive/collate_trigs/testing/collate_profile.png

GarethCabournDavies · 2024-07-08T12:51:29Z

bin/live/pycbc_live_collate_triggers

+if args.output_trigger_file_list:
+    logging.info(
+        'Writing list of trigger files to %s ',
+        args.output_trigger_file_list
+    )
+    with open(args.output_trigger_file_list, 'w') as f:
+        for item in trigger_files:
+            f.write("%s\n" % item)


@ArthurTolley is this still wanted or is it for testing only?

It was good for testing because the file system does spend a lot of time looking for the files themselves depending on how many directories were being searched but I don't think we need it anymore.

GarethCabournDavies · 2024-07-08T13:02:20Z

bin/live/pycbc_live_collate_triggers

+                n_triggers_cut[ifo],
+            )
+
+with h5py.File(args.bank_file,'r') as bank_file:


Should this be moved to the start, then you can avoid closing/reopening the output file?

Looks like it should

Changes implemented, but I have now contributed enough directly that I wouldnt feel comfortable as a reviewer

GarethCabournDavies · 2024-07-12T09:47:09Z

poke @titodalcanton on approval/final review for this

titodalcanton

Looks good.

…er files (gwastro#4697) * combined trigger file list and collate triggers, untested * working state * removing print statement * Cleaning up argparse and unused code * Trigger file checking and ifo permutation * Rework trigger file finding in pycbc_live_single_trigger_fits * enumerate + continue indentation * implement minor changes --------- Co-authored-by: GarethCabournDavies <[email protected]>

ArthurTolley requested review from titodalcanton and GarethCabournDavies April 15, 2024 16:56

ArthurTolley self-assigned this Apr 15, 2024

ArthurTolley added the low latency label Apr 15, 2024

GarethCabournDavies previously requested changes Apr 16, 2024

View reviewed changes

GarethCabournDavies mentioned this pull request Apr 17, 2024

Rework trigger file finding in pycbc_live_single_trigger_fits #4354

Merged

spxiwh reviewed May 20, 2024

View reviewed changes