Export source parameters in a RAM aligned format #90

HannoSpreeuw · 2024-10-31T09:55:33Z

Issue also touched on here.

Currently, source parameters are returned from a call to the extract method of an image.ImageData instantiation as a sourcefinder.utility.containers.ExtractionResults instance of a list of extract.Detection instances. This is not a RAM aligned format.

But the main problem with that way of collecting source parameters is that the new, fast and vectorized way of computing these parameters is applied by calling the extract.source_measurements_pixels_and_celestial_vectorised method, which returns a number of Numpy ndarrays. Ideally one would want to concatenate the relevant columns of those ndarrays into a single new array, possibly of a new format with headers such as a Pandas dataframe or an xarray.

Implementing this has three consequences:

A number of PySE unit tests will need to be adapted to accept the new way of collecting source parameters. Currently, when the VECTORIZED approach is used, the relevant source parameters are reconverted to the old format of a sourcefinder.utility.containers.ExtractionResults instance of a list of extract.Detection instances, in order to pass the unit tests. This is unfortunate.
When the VECTORIZED approach is not used the old output format is still used. We should update that to the new format once the unit tests have been adapted, i.e. once we have completed 1.
TraP will need to accept this new format. So this is also a TraP issue.

Implementing this feature should also fix #56

The text was updated successfully, but these errors were encountered:

HannoSpreeuw · 2024-12-02T15:36:39Z

Echoing a question from @suvayu :

So let's say we put all the source parameters in one 2D array, with one row per source, does the order of the columns matter to you?
Just to be clear, each column would correspond to one value from extract.serialize().
That call yields 18 values, so you'd need 18 columns.

Or would you switch rows and columns?

Any preference wrt the type of output array? Numpy ndarray, xarray, Pandas dataframe....
Something with column headers would be less error prone, I reckon. So that would rule out a Numpy ndarray.

@tmillenaar you mentioned that you converted the extract.serialize() output to a Pandas dataframe for TraP internally. What does that Pandas dataframe look like?
Could you share a link to the TraP code for the initialisation of that Pandas dataframe?

tmillenaar · 2024-12-04T13:40:12Z

@HannoSpreeuw I tried to find you on Tuesday but unfortunately missed you, so let's discuss here.

I do the serialize followed by a pandas DataFrame here:
https://git.astron.nl/RD/trap/-/blob/21c0448affc086dc9c472c53fbc39309efd852ef/trap/source_extraction.py#L166

Just to be clear, consider this function more of a prove of concept. I am not even convinced I will stick with pandas in the long run. At the moment I am leaning towards polars, but I have yet to spend some time figuring out what approach is most suitable. In terms of output from pyse I would prefer just numpy. That leaves room for options and does not force the use a particular framework like xarray or pandas. If you do want to label the results you could even return a dictionary with 1d numpy arrays or a named tuple, but as long as the output of the function is properly documented I do not mind a 2d numpy array.

If you do decide to return as one numpy array, consider the following:
Often I will select certain columns to do an operation (say for example I select the coordinates of each source). Therefore I think it makes sense to align the data in memory by attribute, not by source. So in the form: [ra1, ra2,... dec1, dec2..., flux1, flux2] and not [ra1, dec1, flux1, ra2, dec2, flux2, ...]. Then it is more performant to extract for example all the peak flux values from the results.

HannoSpreeuw · 2024-12-04T14:14:20Z

I am hesitant towards exporting anything without headers, since it seems error prone.

A dict or named tuple of Numpy arrays, would that still be optimal in the sense of RAM aligned?

align the data in memory by attribute, not by source
Interesting, but does that mean that it makes a measurable difference for your processing speed if you select a column instead of a row?

I had never heard of Polars, but their webpage looks slick, haha.

Another thing that came to my mind is that we now have the seventeen floats and one Boolean set in stone in extract.Detection.serialize, perhaps at some point we should offer the option to select and deselect parameters.

tmillenaar · 2024-12-04T16:34:59Z

I agree that headers will make the data less error prone to work with.

"but does that mean that it makes a measurable difference for your processing speed if you select a column instead of a row" -> I decided to check. It hardly matters. Selecting the first consecutive 1/18th of an array or selecting every 18th element does not differ much in terms of performance, especially since we only have on the order of hundreds or thousands of sources. This is difference is insignificant.

I tend to work with the data as if it is a collection of 18 1D arrays, one array for each attribute. That way also you can have different data types per attribute. I realize now that your implementation is fundamentally on a per-source basis since each source has it's own detection class instance. So regardless we will have to loop over all detection objects in python.

To demonstrate on how I work with the data:

A nice example on the difference in approach can be found in the following function

pyse/sourcefinder/extract.py

Line 1128 in 44c415e

def distance_from(self, x, y):

def distance_from(self, x, y):
        """Distance from center"""
        return ((self.x - x) ** 2 + (self.y - y) ** 2) ** 0.5

Here you do the calculation for one point. If I want this for all sources, I would have to do [s.distance_from(x,y) for s in sources]. What I tend to do instead is take an array of source_x with lengh nr_sources and source_y with the same length. Then the equation would look like ((source_x - x) ** 2 + (source_y - y) ** 2) ** 0.5 which would return a numpy array with the distances. Then the looping is done in numpy and not in python.

HannoSpreeuw · 2024-12-09T14:18:11Z

I realize now that your implementation is fundamentally on a per-source basis since each source has it's own detection class instance.

Well, not if the user chooses the VECTORIZED approach - which precludes both force_beam and deblending.

In any case, we will export the same uniform, memory aligned format whether VECTORIZED is chosen or not.

Thanks for the example.
It is clear that with the new export format, all of your loops should be trivial to vectorize.
The extract.Detection.distance_from method will no longer work, unless you reconvert to the current format.
Anyway, that method for the new format should not be much of an issue, since it is such a simple function to write yourself.

What we should probably add to the image.ImageData class is a reconvert routine, from the new, memory aligned format, to the current way of recording source parameters, i.e. to a sourcefinder.utility.containers.ExtractionResults instance of a list of extract.Detection instances.
That reconversion method is more or less there, see lines 1154-1172 of image.py, but not yet as a separate function.
I will add that function.

For all the unit tests that extract sources, it would suffice to add a reconvert=True keyword argument to the ImageData.extract call, such that computed source parameters, if not yet in the current format, will be reconverted from the new format to the current format. You could also pursue a more rigorous approach, such that this reconversion is not necessary, but that would require substantial redesigning of the unit tests that extract sources.

The opposite, reconvert=False should be default and any source parameters not yet collected in the new, memory aligned format - i.e. when force_beam==True or deblend_nthresh>0 has been chosen - should be converted using a new function that will probably have to do a little more than
sources = pd.DataFrame(extraction_results, columns=PYSE_OUT_COLUMNS).
I will write that function, too.

HannoSpreeuw · 2024-12-30T19:51:15Z

Since we just want Numpy ndarrays with labels, xarray seems the simplest solution.

HannoSpreeuw · 2025-01-23T12:12:44Z

Allright, when I currently run PySE vectorized from master (which includes reconversion to the "old" format), I will break 3 unit tests:

FAILED test/test_L15_12h_const.py::L15_12hConstCor::testFluxes - AssertionError: False is not true
FAILED test/test_image.py::TestSimpleImageSourceFind::testSingleSourceExtraction - AssertionError: 24.55954808032697 != 25.516190123153514 within 0 places (0.9566420428265445 difference)
FAILED test/test_source_measurements.py::SourceParameters::testAllParameters - AssertionError: False is not true

When I don't reconvert to the old format, I will break another five unit tests, so eight in total:

FAILED test/test_L15_12h_const.py::L15_12hConstObs::testNumSources - AssertionError: 0 != 1
FAILED test/test_L15_12h_const.py::L15_12hConstObs::testSourceProperties - AttributeError: 'bool' object has no attribute 'peak'
FAILED test/test_L15_12h_const.py::L15_12hConstCor::testNumSources - AssertionError: 0 != 5
FAILED test/test_L15_12h_const.py::L15_12hConstMod::testFluxes - IndexError: list index out of range
FAILED test/test_L15_12h_const.py::L15_12hConstMod::testNumSources - AssertionError: 0 != 1
FAILED test/test_image.py::TestSimpleImageSourceFind::testSingleSourceExtraction - AssertionError: 0 != 2
FAILED test/test_image.py::TestSimpleImageSourceFind::testWcsConversionConsistency - AssertionError: 0 != 3
FAILED test/test_source_measurements.py::SourceParameters::testAllParameters - AssertionError: 0 != 3969

Since there are only eight unit tests to fix, we can pursue a more rigorous approach, as mentioned above, and these could be our next steps:

Add a list similar to PYSE_COLUMNS to the image module. Later move this and other user defined variables to a config file, see issue PySE standalone should be able to retrieve its parameters from a dataclass, e.g. to configure storage and export of source parameters #74.
Set VECTORIZED=True as default in the image module; remove the reconversion lines 1264-1292 and include an equivalent of extract.Detection.serialize to output the quantities from PYSE_COLUMNS to e.g. an xarray object.
Add a keyword argument such as retain_old_format = False to the image.ImageData.extract method
Update those eight unit tests such that they pass for vectorized source processing. They will be suited for e.g. xarray and adding a reconversion method, that I mentioned in the post above, will not be necessary.
Add an instance method to the image.ImageData class similar to the Pandas dataframe operation currently in TraP, to convert source measurements in the old format to e.g. an xarray object, after selecting PYSE_COLUMNS using extract.Detection.serialize.
Because of the default retain_old_format = False, unit tests that perform an ImageData().extract(...) with deblending or force_beam=True will now be broken. These calls should be updated to ImageData().extract(..., retain_old_format = True).

HannoSpreeuw assigned suvayu Oct 31, 2024

HannoSpreeuw mentioned this issue Dec 20, 2024

v0.3.3 -> "sig" parameter not in measurements #109

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Export source parameters in a RAM aligned format #90

Export source parameters in a RAM aligned format #90

HannoSpreeuw commented Oct 31, 2024 •

edited

Loading

HannoSpreeuw commented Dec 2, 2024

tmillenaar commented Dec 4, 2024

HannoSpreeuw commented Dec 4, 2024 •

edited

Loading

tmillenaar commented Dec 4, 2024

HannoSpreeuw commented Dec 9, 2024

HannoSpreeuw commented Dec 30, 2024

HannoSpreeuw commented Jan 23, 2025

Export source parameters in a RAM aligned format #90

Export source parameters in a RAM aligned format #90

Comments

HannoSpreeuw commented Oct 31, 2024 • edited Loading

HannoSpreeuw commented Dec 2, 2024

tmillenaar commented Dec 4, 2024

HannoSpreeuw commented Dec 4, 2024 • edited Loading

tmillenaar commented Dec 4, 2024

HannoSpreeuw commented Dec 9, 2024

HannoSpreeuw commented Dec 30, 2024

HannoSpreeuw commented Jan 23, 2025

HannoSpreeuw commented Oct 31, 2024 •

edited

Loading

HannoSpreeuw commented Dec 4, 2024 •

edited

Loading