Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for container for compressible instruments #790

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

cmarooney-stfc
Copy link
Collaborator

Design for container for compressible instruments

@mducle mducle added the DO_NOT_CI Skip CI on this PR label Mar 18, 2022
@oerc0122
Copy link
Collaborator

I've made a PR into this branch to amend some of the markdown.

In terms of comments I think we need to look at some form of hashing function to append to serializable, or a new hashable mixin, preferably the latter.

Given containers.Map only accept limited types as keys, it thereby means that we want to be able to uniquely map our objects into hashes to have consistent reference in containers.Map following the hashing, the map should handle the rest of the tracking and such.

The reasons for potentially attaching hashing to serializable might be:

  • verification of a correct sending or data package, but this seems unlikely addition.
  • Using the existing indep_fields to collect data for hashing

It is very likely, however, that we don't necessarily want the same information hashable as we do serialized, as data can be reconstructed from potentially less or at the very least different information than a hash.

The choice of hashing algorithm in completely unimportant at this point, and while MATLAB has no native hashing, it has direct access to Java which does and a custom hash is trivial to write.

@mducle
Copy link
Member

mducle commented Mar 18, 2022

I'm not sure what the serialization routine currently in use is, but the undocumented getByteStreamFromArray is very useful. With this you can then use a bit of java to calculate an MD5 hash of the object:

Engine = java.security.MessageDigest.getInstance('MD5');
Engine.update(getByteStreamFromArray(obj));
hash = typecast(Engine.digest, 'uint8');

This could be your unique key to prevent duplication; it would not matter what the type of data is...

E.g.

classdef unique_con < handle
    properties(Access=private)
        store_ = {};
        hash_ = [];
    end
    methods
        function [idx, hash] = is_in_container(self, object)
            % Returns the index in the container if the object is in it else empty
            Engine = java.security.MessageDigest.getInstance('MD5');
            Engine.update(getByteStreamFromArray(object));
            hash = typecast(Engine.digest, 'uint8')';
            if isempty(self.hash_)
                idx = [];
            else
                [~, ~, idx] = intersect(hash, self.hash_, 'rows');
            end
        end
        function self = add_object(self, object)
            [idx, hash] = self.is_in_container(object);
            if isempty(idx)
                self.hash_ = cat(1, self.hash_, hash);
                self.store_ = cat(1, self.store_, object);
            end
        end
    end
end

I've used a cell array here because it takes the least memory and works with the uint8 hashes. If you want to use a struct or containers.Map you must convert the hash to a char array / string with reshape(dec2hex(hash), 1, 32) or similar, since only char arrays are valid fieldnames / key names in Matlab (a disadvantage compared to Python ;)

Copy link
Member

@abuts abuts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what is wrong with markdown here -- I am seeing this one well. Do not understand changes in the Markdown PR though, until they are merged so reviewing this one.

This document is highly incomplete and I would not agree with internal structure as the convenient way of containing objects. What would happen if you want to combine two Fermi chopper instruments e.g. MAPS and MARI runs? How would you deal with combining runs obtained on the same instrument, which have changed, e.g. old LET and new LET, where the detector coverage have changed? How would the internal structure of the container help you to achieve this purpose? What would happen if we want to combine inelastic X-ray scattering instrument from Diamond (and have written its description) and one of our instruments. The described container is completely useless.

  1. Usage scenarios.
    While common purpose of the container (keep instrument) is stated, its unclear why you would keep it?

1a) I would assume that one of the purposes would be to extract instruments contributed to a run.
So, when you do cut, you return objects contributed to the cut
I.e. you would have method, which return sub-container as the function of the run_id-s contributed to the cut.
How the structure would help to achieve this purpose?

1b) I know that Tobyfit uses some characteristics of the instrument. I am not sure what exactly it uses, but seen something, again, obtained as the function of a run-id. Which methods the container should have to support this and how it would provide for these methods? Would be these methods fast enough and occupy reasonable memory? (e.g. PDF as the function of run id)

1c) Are there any other usages for this container in Horace?

I think these questions should be answered or at least thought about before coding begins.

@abuts
Copy link
Member

abuts commented Mar 21, 2022

I would agree with Duc's design. This one of the possibilities to achieve the purpose of getting object as function of run-id. Do not like Java though. Its ugly and would it work if Matlab is launched with -nojvm? There are other ways of achieving the same purpose

The question if it is enough to get object as function of a run number or some other averages are requested should be discussed with Toby @tgperring and may be with Jackob @oerc0122

@oerc0122
Copy link
Collaborator

My markdown changes are so that comments can be applied to smaller segments of the code and don't break text-wrapping conventions. Besides a couple of minor wrapping of variable names with backticks, that's it.

@abuts
Copy link
Member

abuts commented Mar 22, 2022

After some thinking, would add more comments to Duc's (@mducle) design which is a good starting point.
it answers the questions (usage scenarios):

  1. store only unique objects in the container
  2. partially answers the question of getting objects as function of run id. To fully answer this question, a kind of map run_id->position in the container should be maintained.

concerns:

  1. Are java hashes the same for the objects with equivalent contents? If they work and fast, it would be ok.

  2. As a whole, the container is the storage for a comparable objects. Would it be better to separate comparison operators and the container? Then it would be freedom to identify appropriate comparison operator on the particular class, and to use the container as generic storage.


missing points (usage scenario):

How all this works with Tobyfit, is it convenient and is it efficient for obtaining what Tobyfit needs from instrument from this container? This question should be clarified with Toby (@tgperring) and may be Jacob (@oerc0122 )

@oerc0122
Copy link
Collaborator

In partially answering your concerns:

Are java hashes the same for the objects with equivalent contents?

It depends on how they're specified. It should, but we might find that for soem reason the byte stream includes (for example) the pointer to the data, which may be unique and then messes with the hash. It may be necessary to define a hashable mixin which provides the data to be hashed to uniquely identify an object.

If they work and fast, it would be ok.

In terms of speed, MD5 is widely used to determine whether files match and is sufficiently swift for that. Should it turn out it is not, we can look at alternative hashing schemes, but for now until we've determined it's performance critical (my guess is absolutely not given the cost of gen_sqw, I don't think we need to worry.

As a whole, the container is the storage for a comparable objects.

The objects don't have to be comparable, only the hashes if I'm understanding correctly.

Would it be better to separate comparison operators and the container?

I would say attaching the hashing algorithm to the container makes sense and means that the object represents a containerisation method. This means that should we decide to change the hashing method it is changed everywhere and the container contains all the information to store, recover and reproduce the hashes contained within it.

Then it would be freedom to identify appropriate comparison operator on the particular class, and to use the container as generic storage.

The comparison happens on the hash.

missing points (usage scenario):
How all this works with Tobyfit, is it convenient and is it efficient for obtaining what Tobyfit needs from instrument from this container? This question should be clarified with Toby (@tgperring) and may be Jacob (@oerc0122 )

Given this is a cell array (in this instance) or some other array or container, provided the data can be extracted, these can easily be coerced into the correct form for Tobyfit data for minimal cost compared to that of computing Tobyfit.

@abuts
Copy link
Member

abuts commented Mar 22, 2022

As a whole, the container is the storage for a comparable objects.

The objects don't have to be comparable, only the hashes if I'm understanding correctly.

Well, this is comparable objects, if the hashes for different objects are different and the hashes for the same objects are the same. I do not see any other definition of comparability. We want to ensure that equal objects are equal and ensure some order of non-equal objects to allow fast sorting.
So it is the container of comparable objects.

@abuts
Copy link
Member

abuts commented Mar 22, 2022

Would it be better to separate comparison operators and the container?

I would say attaching the hashing algorithm to the container makes sense and means that the object represents a containerisation

I would rather attach comparison operator to class or set of classes for one reason. We may want to ignore some properties of an class at comparison. E.g. very often we compare sqw objects with option ignore_string to avoid comparing internal filenames. In other situation we may want to include such option.

If a comparison should or should not include such a things would depend on the purpose, the classes are stored in the container.

@abuts
Copy link
Member

abuts commented Mar 22, 2022

Given this is a cell array (in this instance) or some other array or container, provided the data can be extracted, these can easily be coerced into the correct form for Tobyfit data for minimal cost compared to that of computing Tobyfit.

But cellarray of fully fledged objects occupy excessive memory, if you have them all expanded (e.g. full set of detectors for all runs). You should be able to extract your requested functor, without expanding the objects.

This is the operation I am currently not understand

@oerc0122
Copy link
Collaborator

Given this is a cell array (in this instance) or some other array or container, provided the data can be extracted, these can easily be coerced into the correct form for Tobyfit data for minimal cost compared to that of computing Tobyfit.

But cellarray of fully fledged objects occupy excessive memory, if you have them all expanded (e.g. full set of detectors for all runs). You should be able to extract your requested functor, without expanding the objects.

This is the operation I am currently not understand

The point of the operation is that only nUnique + 1 (currently reading) objects are in memory at any given time. MATLAB's copy-on-write mechanics mean that provided these objects are never changed, these copies remain compressed and Tobyfit will be modified to extract the required hashed value (i.e. detpar can be replaced with a hash on each SQW or even just a pointer [through copy-on-write] to its detpar)

@mducle
Copy link
Member

mducle commented Mar 22, 2022

Ok... I hacked together a quick class: https://gist.github.com/mducle/bc0e93d403ef9f30f605450335306dac

Edit: This uses the 128-bit MD5 hash as above. Another design choice is to truncate the hash to be a 64-bit integer and then the hash can be stored as a single number which might make searching faster... (or it could be used directly as the key in a containers.Map.)

li = let_instrument(5, 240, 80, 20, 1);
uc = unique_con; for ii = 1:100; uc(ii) = lm; end;
mi = merlin_instrument(180, 600, 'g');
uc = [uc unique_con(repmat({mm}, 1, 100))];
ss = struct(uc);

Edit2: Need to use struct(uc) to see the size below in whos because otherwise it just gives the pointer size.

gives:

>> uc

uc = 

Unique container with 200 elements and 2 unique elements

>> whos
  Name            Size             Bytes  Class                                          Attributes

  ii              1x1                  8  double                                                   
  li              1x1              15230  IX_inst_DGdisk                                           
  mi              1x1               6698  IX_inst_DGfermi                                          
  ss              1x1              24272  struct                                                   
  uc              1x1                  8  unique_con                                               

>> uc(55)

ans = 

  IX_inst_DGdisk with properties:

     mod_shape_mono: [1×1 IX_mod_shape_mono]
          moderator: [1×1 IX_moderator]
    shaping_chopper: [1×1 IX_doubledisk_chopper]
       mono_chopper: [1×1 IX_doubledisk_chopper]
          horiz_div: [1×1 IX_divergence_profile]
           vert_div: [1×1 IX_divergence_profile]
             energy: 5
               name: 'LET'
             source: [1×1 IX_source]

>> uc(155)

ans = 

  IX_inst_DGfermi with properties:

        moderator: [1×1 IX_moderator]
         aperture: [1×1 IX_aperture]
    fermi_chopper: [1×1 IX_fermi_chopper]
           energy: 180
             name: 'MERLIN'
           source: [1×1 IX_source]

@cmarooney-stfc
Copy link
Collaborator Author

cmarooney-stfc commented Mar 22, 2022

Responding to @abuts comments above 21/3/22 14.30. Alex, I paste your comment here (less the initial markdown stuff) so I can comment inline.

AB: This document is highly incomplete

CM: Yes indeed. As it has already generated a lot of comment, it would be silly to try and extend the document until replies are in. It appears to have been very succesful in generating thoughts, and as such has done its job. I think the word you want is "draft"

AB: and I would not agree with internal structure as the convenient way of containing objects. What would happen if you want to combine two Fermi chopper instruments e.g. MAPS and MARI runs?

CM: The present (draft) document assumes that the instruments are considered as monolithic objects, rather than breaking them down into their component parts. I will deal with the alternative in a moment. The SQW object design has always been that each run has a defined instrument. Previously it was part of the struct in header for the relevant run number; now it is one of the instruments in the cell array instruments of the Experiment-class experiment_info. If it is required that runs from MAPS and MARI are combined into one SQW, then (say) runs 1-10 will be on MAPS and will have the MAPS instrument, and (say) runs 11-20 each have the MARI instrument. At present these are uncompressed; if the instruments are compressed then the number of instruments will be reduced to 2, i.e. instruments{1}==MAPS, instruments{2}==MARI, and a separate instrument_index array will record instrument_index(1:10)==1, instrument_index(11-20)==2. All of this will be under the hood in the container object, and if the container is given the run number, it will return the relevant instrument.

If it is preferred to break down the instrument by component, then the containers can be used component-wise, so a container will store Fermi-choppers, indexed and compressed in the same manner, and the instrument classes will be converted to reference such containers.

AB: How would you deal with combining runs obtained on the same instrument, which have changed, e.g. old LET and new LET, where the detector coverage have changed?

CM: I am presuming that the old LET and new LET differ in their detector arrays from this description. In these descriptions I am using the shorthand "instrument" for "primary spectrometer" (as is already the case in the various IX_inst_* types in the code) and reserving "detector" for "secondary spectrometer" (as is already the case in the various IX_det* types in the code.) Currently and previously the SQW object contained only one detpar object and hence would presumably not cater for this scenario. The new Experiment-class experiment_info now contains an array of IX_detector_arrays, arbitrarily indexed and currently defaulting to element 1 for all runs. In this case I presume that the old and new LETs will each define a separate IX_detector_array, with detector_index values for each run pointing to detector_array(1) or detector_array(2) as appropriate. As with the instruments, the new detector array container could deal with this. However, as there is only one class type IX_detector_array and it is implemented as a Matlab array, that is currently not necessary, with a simple wrapper sufficing to do the indexing.

AB: How would the internal structure of the container help you to achieve this purpose?

CM: See previous replies.

AB: What would happen if we want to combine inelastic X-ray scattering instrument from Diamond (and have written its description) and one of our instruments.

CM: This is a very interesting extension. I am as yet unaware that it was proposed to use Horace for X-ray scattering as well as for neutron scattering, and would be grateful if you would point me to the design documents for this. As a particular issue I cannot visualise how the scattering ranges (r-theta-phi) would differ between X-ray and neutron, and it is more likely that there would be overlap between the ranges. Hence I would expect Horace to have to deal with this overlap. I would also want to know the extent to which the modelling codes (Euphonic, SpinW etc) would need to cater for different measurement techniques. Again, looking at a combined method design document would be useful.

AB: The described container is completely useless.

CM: This seems a very blanket comment. In the light of the above replies, you may wish to reconsider.

AB: Usage scenarios.
While common purpose of the container (keep instrument) is stated, its unclear why you would keep it?

1a) I would assume that one of the purposes would be to extract instruments contributed to a run.

CM: Yes, indeed as far as I know the only purpose, beyond encapsulating instrument compression.

AB: So, when you do cut, you return objects contributed to the cut
I.e. you would have method, which return sub-container as the function of the run_id-s contributed to the cut.
How the structure would help to achieve this purpose?

CM: Let's start with the sensible way of implementing this, which is to have instruments as handle classes. Then the duplicated container in the cut would only contain pointers to the instruments - no duplication, perfect compression. But copy-on-write will achieve the same effect; the duplicated container will not actually duplicate the instrument.

The additional scenario is that when everything is written to file, restoration from file will break the links (either handle or CoW) between cut and original SQW. I had agreed with Toby that the actual container of unique objects would sit outside the SQW objects and be restored from file separately, with only the indices retained in the SQW object; hence the container object's functions of storage and indexing would be split.

AB: 1b) I know that Tobyfit uses some characteristics of the instrument. I am not sure what exactly it uses, but seen something, again, obtained as the function of a run-id. Which methods the container should have to support this and how it would provide for these methods? Would be these methods fast enough and occupy reasonable memory? (e.g. PDF as the function of run id)

CM: The aim is to reference the unique instrument. Once it has been obtained, its use should be independent of the containment. The actual referencing - get run index, get instrument index, operates on integer indexes only and should not cause problems.

AB: I think these questions should be answered or at least thought about before coding begins.

CM: Agreed. Your responses to these points is appreciated. Thanks, Chris

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
DO_NOT_CI Skip CI on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants