-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Export source parameters in a RAM aligned format #90
Comments
Echoing a question from @suvayu : So let's say we put all the source parameters in one 2D array, with one row per source, does the order of the columns matter to you? Or would you switch rows and columns? Any preference wrt the type of output array? Numpy ndarray, xarray, Pandas dataframe.... @tmillenaar you mentioned that you converted the extract.serialize() output to a Pandas dataframe for TraP internally. What does that Pandas dataframe look like? |
@HannoSpreeuw I tried to find you on Tuesday but unfortunately missed you, so let's discuss here. I do the serialize followed by a pandas DataFrame here: Just to be clear, consider this function more of a prove of concept. I am not even convinced I will stick with pandas in the long run. At the moment I am leaning towards polars, but I have yet to spend some time figuring out what approach is most suitable. In terms of output from pyse I would prefer just numpy. That leaves room for options and does not force the use a particular framework like xarray or pandas. If you do want to label the results you could even return a dictionary with 1d numpy arrays or a named tuple, but as long as the output of the function is properly documented I do not mind a 2d numpy array. If you do decide to return as one numpy array, consider the following: |
I am hesitant towards exporting anything without headers, since it seems error prone. A dict or named tuple of Numpy arrays, would that still be optimal in the sense of RAM aligned? align the data in memory by attribute, not by source I had never heard of Polars, but their webpage looks slick, haha. Another thing that came to my mind is that we now have the seventeen floats and one Boolean set in stone in extract.Detection.serialize, perhaps at some point we should offer the option to select and deselect parameters. |
I agree that headers will make the data less error prone to work with. "but does that mean that it makes a measurable difference for your processing speed if you select a column instead of a row" -> I decided to check. It hardly matters. Selecting the first consecutive 1/18th of an array or selecting every 18th element does not differ much in terms of performance, especially since we only have on the order of hundreds or thousands of sources. This is difference is insignificant. I tend to work with the data as if it is a collection of 18 1D arrays, one array for each attribute. That way also you can have different data types per attribute. I realize now that your implementation is fundamentally on a per-source basis since each source has it's own detection class instance. So regardless we will have to loop over all detection objects in python. To demonstrate on how I work with the data: A nice example on the difference in approach can be found in the following function Line 1128 in 44c415e
Here you do the calculation for one point. If I want this for all sources, I would have to do |
Well, not if the user chooses the In any case, we will export the same uniform, memory aligned format whether Thanks for the example. What we should probably add to the For all the unit tests that extract sources, it would suffice to add a The opposite, |
Since we just want Numpy ndarrays with labels, xarray seems the simplest solution. |
Allright, when I currently run PySE vectorized from master (which includes reconversion to the "old" format), I will break 3 unit tests:
When I don't reconvert to the old format, I will break another five unit tests, so eight in total:
Since there are only eight unit tests to fix, we can pursue a more rigorous approach, as mentioned above, and these could be our next steps:
|
Issue also touched on here.
Currently, source parameters are returned from a call to the
extract
method of animage.ImageData
instantiation as asourcefinder.utility.containers.ExtractionResults
instance of a list ofextract.Detection
instances. This is not a RAM aligned format.But the main problem with that way of collecting source parameters is that the new, fast and vectorized way of computing these parameters is applied by calling the extract.source_measurements_pixels_and_celestial_vectorised method, which returns a number of Numpy ndarrays. Ideally one would want to concatenate the relevant columns of those ndarrays into a single new array, possibly of a new format with headers such as a Pandas dataframe or an xarray.
Implementing this has three consequences:
A number of PySE unit tests will need to be adapted to accept the new way of collecting source parameters. Currently, when the
VECTORIZED
approach is used, the relevant source parameters are reconverted to the old format of asourcefinder.utility.containers.ExtractionResults
instance of a list ofextract.Detection
instances, in order to pass the unit tests. This is unfortunate.When the
VECTORIZED
approach is not used the old output format is still used. We should update that to the new format once the unit tests have been adapted, i.e. once we have completed 1.TraP will need to accept this new format. So this is also a TraP issue.
Implementing this feature should also fix #56
The text was updated successfully, but these errors were encountered: