-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for container for compressible instruments #790
base: master
Are you sure you want to change the base?
Conversation
I've made a PR into this branch to amend some of the markdown. In terms of comments I think we need to look at some form of hashing function to append to Given The reasons for potentially attaching hashing to
It is very likely, however, that we don't necessarily want the same information hashable as we do serialized, as data can be reconstructed from potentially less or at the very least different information than a hash. The choice of hashing algorithm in completely unimportant at this point, and while MATLAB has no native hashing, it has direct access to Java which does and a custom hash is trivial to write. |
I'm not sure what the serialization routine currently in use is, but the undocumented
This could be your unique key to prevent duplication; it would not matter what the type of data is... E.g. classdef unique_con < handle
properties(Access=private)
store_ = {};
hash_ = [];
end
methods
function [idx, hash] = is_in_container(self, object)
% Returns the index in the container if the object is in it else empty
Engine = java.security.MessageDigest.getInstance('MD5');
Engine.update(getByteStreamFromArray(object));
hash = typecast(Engine.digest, 'uint8')';
if isempty(self.hash_)
idx = [];
else
[~, ~, idx] = intersect(hash, self.hash_, 'rows');
end
end
function self = add_object(self, object)
[idx, hash] = self.is_in_container(object);
if isempty(idx)
self.hash_ = cat(1, self.hash_, hash);
self.store_ = cat(1, self.store_, object);
end
end
end
end I've used a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what is wrong with markdown here -- I am seeing this one well. Do not understand changes in the Markdown PR though, until they are merged so reviewing this one.
This document is highly incomplete and I would not agree with internal structure as the convenient way of containing objects. What would happen if you want to combine two Fermi chopper instruments e.g. MAPS and MARI runs? How would you deal with combining runs obtained on the same instrument, which have changed, e.g. old LET and new LET, where the detector coverage have changed? How would the internal structure of the container help you to achieve this purpose? What would happen if we want to combine inelastic X-ray scattering instrument from Diamond (and have written its description) and one of our instruments. The described container is completely useless.
- Usage scenarios.
While common purpose of the container (keep instrument) is stated, its unclear why you would keep it?
1a) I would assume that one of the purposes would be to extract instruments contributed to a run.
So, when you do cut, you return objects contributed to the cut
I.e. you would have method, which return sub-container as the function of the run_id-s contributed to the cut.
How the structure would help to achieve this purpose?
1b) I know that Tobyfit uses some characteristics of the instrument. I am not sure what exactly it uses, but seen something, again, obtained as the function of a run-id. Which methods the container should have to support this and how it would provide for these methods? Would be these methods fast enough and occupy reasonable memory? (e.g. PDF as the function of run id)
1c) Are there any other usages for this container in Horace?
I think these questions should be answered or at least thought about before coding begins.
I would agree with Duc's design. This one of the possibilities to achieve the purpose of getting object as function of run-id. Do not like Java though. Its ugly and would it work if Matlab is launched with The question if it is enough to get object as function of a run number or some other averages are requested should be discussed with Toby @tgperring and may be with Jackob @oerc0122 |
My markdown changes are so that comments can be applied to smaller segments of the code and don't break text-wrapping conventions. Besides a couple of minor wrapping of variable names with backticks, that's it. |
After some thinking, would add more comments to Duc's (@mducle) design which is a good starting point.
concerns:
missing points (usage scenario):How all this works with Tobyfit, is it convenient and is it efficient for obtaining what Tobyfit needs from instrument from this container? This question should be clarified with Toby (@tgperring) and may be Jacob (@oerc0122 ) |
In partially answering your concerns:
It depends on how they're specified. It should, but we might find that for soem reason the byte stream includes (for example) the pointer to the data, which may be unique and then messes with the hash. It may be necessary to define a
In terms of speed, MD5 is widely used to determine whether files match and is sufficiently swift for that. Should it turn out it is not, we can look at alternative hashing schemes, but for now until we've determined it's performance critical (my guess is absolutely not given the cost of
The objects don't have to be comparable, only the hashes if I'm understanding correctly.
I would say attaching the hashing algorithm to the container makes sense and means that the object represents a containerisation method. This means that should we decide to change the hashing method it is changed everywhere and the
The comparison happens on the hash.
Given this is a cell array (in this instance) or some other array or container, provided the data can be extracted, these can easily be coerced into the correct form for Tobyfit data for minimal cost compared to that of computing Tobyfit. |
Well, this is comparable objects, if the hashes for different objects are different and the hashes for the same objects are the same. I do not see any other definition of comparability. We want to ensure that equal objects are equal and ensure some order of non-equal objects to allow fast sorting. |
I would rather attach comparison operator to class or set of classes for one reason. We may want to ignore some properties of an class at comparison. E.g. very often we compare sqw objects with option If a comparison should or should not include such a things would depend on the purpose, the classes are stored in the container. |
But cellarray of fully fledged objects occupy excessive memory, if you have them all expanded (e.g. full set of detectors for all runs). You should be able to extract your requested functor, without expanding the objects. This is the operation I am currently not understand |
The point of the operation is that only |
Ok... I hacked together a quick class: https://gist.github.com/mducle/bc0e93d403ef9f30f605450335306dac Edit: This uses the 128-bit MD5 hash as above. Another design choice is to truncate the hash to be a 64-bit integer and then the hash can be stored as a single number which might make searching faster... (or it could be used directly as the key in a li = let_instrument(5, 240, 80, 20, 1);
uc = unique_con; for ii = 1:100; uc(ii) = lm; end;
mi = merlin_instrument(180, 600, 'g');
uc = [uc unique_con(repmat({mm}, 1, 100))];
ss = struct(uc); Edit2: Need to use gives:
|
Responding to @abuts comments above 21/3/22 14.30. Alex, I paste your comment here (less the initial markdown stuff) so I can comment inline. AB: This document is highly incomplete CM: Yes indeed. As it has already generated a lot of comment, it would be silly to try and extend the document until replies are in. It appears to have been very succesful in generating thoughts, and as such has done its job. I think the word you want is "draft" AB: and I would not agree with internal structure as the convenient way of containing objects. What would happen if you want to combine two Fermi chopper instruments e.g. MAPS and MARI runs? CM: The present (draft) document assumes that the instruments are considered as monolithic objects, rather than breaking them down into their component parts. I will deal with the alternative in a moment. The SQW object design has always been that each run has a defined instrument. Previously it was part of the struct in header for the relevant run number; now it is one of the instruments in the cell array instruments of the Experiment-class experiment_info. If it is required that runs from MAPS and MARI are combined into one SQW, then (say) runs 1-10 will be on MAPS and will have the MAPS instrument, and (say) runs 11-20 each have the MARI instrument. At present these are uncompressed; if the instruments are compressed then the number of instruments will be reduced to 2, i.e. instruments{1}==MAPS, instruments{2}==MARI, and a separate instrument_index array will record instrument_index(1:10)==1, instrument_index(11-20)==2. All of this will be under the hood in the container object, and if the container is given the run number, it will return the relevant instrument. If it is preferred to break down the instrument by component, then the containers can be used component-wise, so a container will store Fermi-choppers, indexed and compressed in the same manner, and the instrument classes will be converted to reference such containers. AB: How would you deal with combining runs obtained on the same instrument, which have changed, e.g. old LET and new LET, where the detector coverage have changed? CM: I am presuming that the old LET and new LET differ in their detector arrays from this description. In these descriptions I am using the shorthand "instrument" for "primary spectrometer" (as is already the case in the various IX_inst_* types in the code) and reserving "detector" for "secondary spectrometer" (as is already the case in the various IX_det* types in the code.) Currently and previously the SQW object contained only one detpar object and hence would presumably not cater for this scenario. The new Experiment-class experiment_info now contains an array of IX_detector_arrays, arbitrarily indexed and currently defaulting to element 1 for all runs. In this case I presume that the old and new LETs will each define a separate IX_detector_array, with detector_index values for each run pointing to detector_array(1) or detector_array(2) as appropriate. As with the instruments, the new detector array container could deal with this. However, as there is only one class type IX_detector_array and it is implemented as a Matlab array, that is currently not necessary, with a simple wrapper sufficing to do the indexing. AB: How would the internal structure of the container help you to achieve this purpose? CM: See previous replies. AB: What would happen if we want to combine inelastic X-ray scattering instrument from Diamond (and have written its description) and one of our instruments. CM: This is a very interesting extension. I am as yet unaware that it was proposed to use Horace for X-ray scattering as well as for neutron scattering, and would be grateful if you would point me to the design documents for this. As a particular issue I cannot visualise how the scattering ranges (r-theta-phi) would differ between X-ray and neutron, and it is more likely that there would be overlap between the ranges. Hence I would expect Horace to have to deal with this overlap. I would also want to know the extent to which the modelling codes (Euphonic, SpinW etc) would need to cater for different measurement techniques. Again, looking at a combined method design document would be useful. AB: The described container is completely useless. CM: This seems a very blanket comment. In the light of the above replies, you may wish to reconsider. AB: Usage scenarios.
CM: Yes, indeed as far as I know the only purpose, beyond encapsulating instrument compression. AB: So, when you do cut, you return objects contributed to the cut CM: Let's start with the sensible way of implementing this, which is to have instruments as handle classes. Then the duplicated container in the cut would only contain pointers to the instruments - no duplication, perfect compression. But copy-on-write will achieve the same effect; the duplicated container will not actually duplicate the instrument. The additional scenario is that when everything is written to file, restoration from file will break the links (either handle or CoW) between cut and original SQW. I had agreed with Toby that the actual container of unique objects would sit outside the SQW objects and be restored from file separately, with only the indices retained in the SQW object; hence the container object's functions of storage and indexing would be split. AB: 1b) I know that Tobyfit uses some characteristics of the instrument. I am not sure what exactly it uses, but seen something, again, obtained as the function of a run-id. Which methods the container should have to support this and how it would provide for these methods? Would be these methods fast enough and occupy reasonable memory? (e.g. PDF as the function of run id) CM: The aim is to reference the unique instrument. Once it has been obtained, its use should be independent of the containment. The actual referencing - get run index, get instrument index, operates on integer indexes only and should not cause problems. AB: I think these questions should be answered or at least thought about before coding begins. CM: Agreed. Your responses to these points is appreciated. Thanks, Chris |
Design for container for compressible instruments