Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Export source parameters in a RAM aligned format #90

Open
HannoSpreeuw opened this issue Oct 31, 2024 · 7 comments
Open

Export source parameters in a RAM aligned format #90

HannoSpreeuw opened this issue Oct 31, 2024 · 7 comments
Assignees

Comments

@HannoSpreeuw
Copy link
Collaborator

HannoSpreeuw commented Oct 31, 2024

Issue also touched on here.

Currently, source parameters are returned from a call to the extract method of an image.ImageData instantiation as a sourcefinder.utility.containers.ExtractionResults instance of a list of extract.Detection instances. This is not a RAM aligned format.

But the main problem with that way of collecting source parameters is that the new, fast and vectorized way of computing these parameters is applied by calling the extract.source_measurements_pixels_and_celestial_vectorised method, which returns a number of Numpy ndarrays. Ideally one would want to concatenate the relevant columns of those ndarrays into a single new array, possibly of a new format with headers such as a Pandas dataframe or an xarray.

Implementing this has three consequences:

  1. A number of PySE unit tests will need to be adapted to accept the new way of collecting source parameters. Currently, when the VECTORIZED approach is used, the relevant source parameters are reconverted to the old format of a sourcefinder.utility.containers.ExtractionResults instance of a list of extract.Detection instances, in order to pass the unit tests. This is unfortunate.

  2. When the VECTORIZED approach is not used the old output format is still used. We should update that to the new format once the unit tests have been adapted, i.e. once we have completed 1.

  3. TraP will need to accept this new format. So this is also a TraP issue.

Implementing this feature should also fix #56

@HannoSpreeuw
Copy link
Collaborator Author

Echoing a question from @suvayu :

So let's say we put all the source parameters in one 2D array, with one row per source, does the order of the columns matter to you?
Just to be clear, each column would correspond to one value from extract.serialize().
That call yields 18 values, so you'd need 18 columns.

Or would you switch rows and columns?

Any preference wrt the type of output array? Numpy ndarray, xarray, Pandas dataframe....
Something with column headers would be less error prone, I reckon. So that would rule out a Numpy ndarray.

@tmillenaar you mentioned that you converted the extract.serialize() output to a Pandas dataframe for TraP internally. What does that Pandas dataframe look like?
Could you share a link to the TraP code for the initialisation of that Pandas dataframe?

@tmillenaar
Copy link
Contributor

@HannoSpreeuw I tried to find you on Tuesday but unfortunately missed you, so let's discuss here.

I do the serialize followed by a pandas DataFrame here:
https://git.astron.nl/RD/trap/-/blob/21c0448affc086dc9c472c53fbc39309efd852ef/trap/source_extraction.py#L166

Just to be clear, consider this function more of a prove of concept. I am not even convinced I will stick with pandas in the long run. At the moment I am leaning towards polars, but I have yet to spend some time figuring out what approach is most suitable. In terms of output from pyse I would prefer just numpy. That leaves room for options and does not force the use a particular framework like xarray or pandas. If you do want to label the results you could even return a dictionary with 1d numpy arrays or a named tuple, but as long as the output of the function is properly documented I do not mind a 2d numpy array.

If you do decide to return as one numpy array, consider the following:
Often I will select certain columns to do an operation (say for example I select the coordinates of each source). Therefore I think it makes sense to align the data in memory by attribute, not by source. So in the form: [ra1, ra2,... dec1, dec2..., flux1, flux2] and not [ra1, dec1, flux1, ra2, dec2, flux2, ...]. Then it is more performant to extract for example all the peak flux values from the results.

@HannoSpreeuw
Copy link
Collaborator Author

HannoSpreeuw commented Dec 4, 2024

I am hesitant towards exporting anything without headers, since it seems error prone.

A dict or named tuple of Numpy arrays, would that still be optimal in the sense of RAM aligned?

align the data in memory by attribute, not by source
Interesting, but does that mean that it makes a measurable difference for your processing speed if you select a column instead of a row?

I had never heard of Polars, but their webpage looks slick, haha.

Another thing that came to my mind is that we now have the seventeen floats and one Boolean set in stone in extract.Detection.serialize, perhaps at some point we should offer the option to select and deselect parameters.

@tmillenaar
Copy link
Contributor

I agree that headers will make the data less error prone to work with.

"but does that mean that it makes a measurable difference for your processing speed if you select a column instead of a row" -> I decided to check. It hardly matters. Selecting the first consecutive 1/18th of an array or selecting every 18th element does not differ much in terms of performance, especially since we only have on the order of hundreds or thousands of sources. This is difference is insignificant.

I tend to work with the data as if it is a collection of 18 1D arrays, one array for each attribute. That way also you can have different data types per attribute. I realize now that your implementation is fundamentally on a per-source basis since each source has it's own detection class instance. So regardless we will have to loop over all detection objects in python.

To demonstrate on how I work with the data:

A nice example on the difference in approach can be found in the following function

def distance_from(self, x, y):

def distance_from(self, x, y):
        """Distance from center"""
        return ((self.x - x) ** 2 + (self.y - y) ** 2) ** 0.5

Here you do the calculation for one point. If I want this for all sources, I would have to do [s.distance_from(x,y) for s in sources]. What I tend to do instead is take an array of source_x with lengh nr_sources and source_y with the same length. Then the equation would look like ((source_x - x) ** 2 + (source_y - y) ** 2) ** 0.5 which would return a numpy array with the distances. Then the looping is done in numpy and not in python.

@HannoSpreeuw
Copy link
Collaborator Author

I realize now that your implementation is fundamentally on a per-source basis since each source has it's own detection class instance.

Well, not if the user chooses the VECTORIZED approach - which precludes both force_beam and deblending.

In any case, we will export the same uniform, memory aligned format whether VECTORIZED is chosen or not.

Thanks for the example.
It is clear that with the new export format, all of your loops should be trivial to vectorize.
The extract.Detection.distance_from method will no longer work, unless you reconvert to the current format.
Anyway, that method for the new format should not be much of an issue, since it is such a simple function to write yourself.

What we should probably add to the image.ImageData class is a reconvert routine, from the new, memory aligned format, to the current way of recording source parameters, i.e. to a sourcefinder.utility.containers.ExtractionResults instance of a list of extract.Detection instances.
That reconversion method is more or less there, see lines 1154-1172 of image.py, but not yet as a separate function.
I will add that function.

For all the unit tests that extract sources, it would suffice to add a reconvert=True keyword argument to the ImageData.extract call, such that computed source parameters, if not yet in the current format, will be reconverted from the new format to the current format. You could also pursue a more rigorous approach, such that this reconversion is not necessary, but that would require substantial redesigning of the unit tests that extract sources.

The opposite, reconvert=False should be default and any source parameters not yet collected in the new, memory aligned format - i.e. when force_beam==True or deblend_nthresh>0 has been chosen - should be converted using a new function that will probably have to do a little more than
sources = pd.DataFrame(extraction_results, columns=PYSE_OUT_COLUMNS).
I will write that function, too.

@HannoSpreeuw
Copy link
Collaborator Author

Since we just want Numpy ndarrays with labels, xarray seems the simplest solution.

@HannoSpreeuw
Copy link
Collaborator Author

Allright, when I currently run PySE vectorized from master (which includes reconversion to the "old" format), I will break 3 unit tests:

FAILED test/test_L15_12h_const.py::L15_12hConstCor::testFluxes - AssertionError: False is not true
FAILED test/test_image.py::TestSimpleImageSourceFind::testSingleSourceExtraction - AssertionError: 24.55954808032697 != 25.516190123153514 within 0 places (0.9566420428265445 difference)
FAILED test/test_source_measurements.py::SourceParameters::testAllParameters - AssertionError: False is not true

When I don't reconvert to the old format, I will break another five unit tests, so eight in total:

FAILED test/test_L15_12h_const.py::L15_12hConstObs::testNumSources - AssertionError: 0 != 1
FAILED test/test_L15_12h_const.py::L15_12hConstObs::testSourceProperties - AttributeError: 'bool' object has no attribute 'peak'
FAILED test/test_L15_12h_const.py::L15_12hConstCor::testNumSources - AssertionError: 0 != 5
FAILED test/test_L15_12h_const.py::L15_12hConstMod::testFluxes - IndexError: list index out of range
FAILED test/test_L15_12h_const.py::L15_12hConstMod::testNumSources - AssertionError: 0 != 1
FAILED test/test_image.py::TestSimpleImageSourceFind::testSingleSourceExtraction - AssertionError: 0 != 2
FAILED test/test_image.py::TestSimpleImageSourceFind::testWcsConversionConsistency - AssertionError: 0 != 3
FAILED test/test_source_measurements.py::SourceParameters::testAllParameters - AssertionError: 0 != 3969

Since there are only eight unit tests to fix, we can pursue a more rigorous approach, as mentioned above, and these could be our next steps:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants