Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DPL: avoid TMessage usage #12757

Merged
merged 1 commit into from
Feb 28, 2024
Merged

DPL: avoid TMessage usage #12757

merged 1 commit into from
Feb 28, 2024

Conversation

ktf
Copy link
Member

@ktf ktf commented Feb 23, 2024

DPL: avoid TMessage usage

TMessage does not allow for non owned buffers, so we end up having
an extra buffer in private memory for (de)serializing. Using TBufferFile
directly allows to avoid that, so this moves the whole ROOT serialization
support in DPL to use it.

Copy link
Contributor

REQUEST FOR PRODUCTION RELEASES:
To request your PR to be included in production software, please add the corresponding labels called "async-" to your PR. Add the labels directly (if you have the permissions) or add a comment of the form (note that labels are separated by a ",")

+async-label <label1>, <label2>, !<label3> ...

This will add <label1> and <label2> and removes <label3>.

The following labels are available
async-2023-pbpb-apass
async-2023-pp-apass1
async-2022-pp-apass6
async-2022-pp-apass4
async-mc
async-data

@alibuild
Copy link
Collaborator

alibuild commented Feb 23, 2024

Error while checking build/O2/fullCI for eb21563 at 2024-02-24 01:12:

## sw/BUILD/O2-full-system-test-latest/log
Detected critical problem in logfile digi.log
digi.log:[40125:internal-dpl-clock]: [00:12:38][FATAL] error while setting up workflow in o2-sim-digitizer-workflow: boost::interprocess::lock_exception
[ERROR] Workflow crashed - PID 40126 (SimReader) did not exit correctly however it's not clear why. Exit code forced to 128.
[40125:internal-dpl-clock]: [00:12:38][FATAL] error while setting up workflow in o2-sim-digitizer-workflow: boost::interprocess::lock_exception
[ERROR] Workflow crashed - PID 40125 (internal-dpl-clock) did not exit correctly however it's not clear why. Exit code forced to 128.


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/70591ea34cbe78768012351a09fd771f6746d458/slc8_x86-64/o2checkcode/1.0-local667/etc/modulefiles
++ cat
--

Full log here.

@ktf
Copy link
Member Author

ktf commented Feb 23, 2024

@shahor02 @sawenzel This seems to work (including not having the memory spikes in private memory) in my unit tests, but fails when deserializing the digits. Can you remind me if you have any special massaging there? Do you construct TMessages yourself?

@ktf
Copy link
Member Author

ktf commented Feb 23, 2024

ok, i guess i need to do the same massaging i did elsewhere in Detector.cxx. Will try tomorrow. is that code DPL only?

@shahor02
Copy link
Collaborator

Which digits? I thought we use only PODs for them. I did not remember these TMessages in the Detector.cxx, are they actually used if sh.mem. is allowed?

@alibuild
Copy link
Collaborator

alibuild commented Feb 24, 2024

Error while checking build/O2/fullCI for 045edb4 at 2024-02-24 16:42:

## sw/BUILD/O2-full-system-test-latest/log
Detected critical problem in logfile digi.log
digi.log:[29553:internal-dpl-clock]: [15:42:04][FATAL] error while setting up workflow in o2-sim-digitizer-workflow: boost::interprocess::lock_exception
[ERROR] Workflow crashed - PID 29554 (SimReader) did not exit correctly however it's not clear why. Exit code forced to 128.
[29553:internal-dpl-clock]: [15:42:04][FATAL] error while setting up workflow in o2-sim-digitizer-workflow: boost::interprocess::lock_exception
[ERROR] Workflow crashed - PID 29553 (internal-dpl-clock) did not exit correctly however it's not clear why. Exit code forced to 128.
[ERROR]  - Device internal-dpl-clock: pid 29553 (exit 128)
[INFO]    - First error: [15:42:04][FATAL] error while setting up workflow in o2-sim-digitizer-workflow: boost::interprocess::lock_exception
[ERROR]  - Device SimReader: pid 29554 (exit 128)
[ERROR] SEVERE: Device internal-dpl-clock (29553) had at least one message above severity 5: error while setting up workflow in o2-sim-digitizer-workflow: boost::interprocess::lock_exception


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/f28b07d63338c54b4744af65d77d910c9275e01a/slc8_x86-64/o2checkcode/1.0-local1860/etc/modulefiles
++ cat
--

Full log here.

@ktf
Copy link
Member Author

ktf commented Feb 24, 2024

I think the issue is that my new method is not compatible with what produced by attachShmMessage & co. I will try a few possibilities later.

@alibuild
Copy link
Collaborator

alibuild commented Feb 24, 2024

Error while checking build/O2/fullCI for 8e51310 at 2024-02-27 01:30:

## sw/BUILD/O2-latest/log
c++: error: unrecognized command-line option '--rtlib=compiler-rt'
c++: error: unrecognized command-line option '--rtlib=compiler-rt'


## sw/BUILD/O2-full-system-test-latest/log
command o2-sim --seed 12345 -j 32 -n 100 -m PIPE ITS MFT FT0 FV0 FDD -g extgen -e TGeant4 --configKeyValues "GeneratorExternal.fileName=$O2_ROOT/share/Generators/external/QEDLoader.C;QEDGenParam.yMin=-7;QEDGenParam.yMax=7;QEDGenParam.ptMin=0.001;QEDGenParam.ptMax=1.;${SIMOPTKEY}" --run 310000 had nonzero exit code 1
[ERROR] SHUTTING DOWN DUE TO SIGNALED EXIT IN COMPONENT 543
[00:30:41][ERROR] Some error/interrupt occurred on socket during receive


## sw/BUILD/o2checkcode-latest/log
--
========== List of errors found ==========
++ GRERR=0
++ grep -v clang-diagnostic-error error-log.txt
++ grep ' error:'
++ GRERR=1
++ [[ 1 == 0 ]]
++ mkdir -p /sw/INSTALLROOT/699da6695c9b74f157fc8fbfa138c562b08a66f1/slc8_x86-64/o2checkcode/1.0-local695/etc/modulefiles
++ cat
--

Full log here.

@ktf
Copy link
Member Author

ktf commented Feb 25, 2024

i patched them, but now it seems to break simulation. i need to grep more for tmessage, i guess...

@ktf
Copy link
Member Author

ktf commented Feb 27, 2024

@shahor02 @davidrohr @chiarazampolli @knopers8 @Barthelemy This seems to work fine on my box, and I do not see the spike (in private memory) in the QC tasks when sending the histograms. Could you have a look?

The qc-file-sink still has spikes when writing to disk, though.

@knopers8
Copy link
Collaborator

The qc-file-sink still has spikes when writing to disk, though.

We cannot really avoid it probably. If an object was already saved before, we have to read it back, merge, and write the merged object. We only read what we need and we keep it in memory only for the duration of the run callback.

This being said, I am not certain that TDirectory::WriteObject does not create a temporary chunk of memory to serialize the object, the documentation does not seem to say anything about it.

@ktf
Copy link
Member Author

ktf commented Feb 28, 2024

This being said, I am not certain that TDirectory::WriteObject does not create a temporary chunk of memory to serialize the object, the documentation does not seem to say anything about it.

Yes, of course the merging has the issue you described, I was more worried about the extra buffer which might be needed when writing back. However that's certainly not something simple to solve.

TMessage does not allow for non owned buffers, so we end up having
an extra buffer in private memory for (de)serializing. Using TBufferFile
directly allows to avoid that, so this moves the whole ROOT serialization
support in DPL to use it.
@ktf
Copy link
Member Author

ktf commented Feb 28, 2024

All the changes in the new commit are just formatting related. Old commit had all tests passing. Merging so that we can do more extensive tests.

@ktf ktf merged commit 4f5a6c0 into AliceO2Group:dev Feb 28, 2024
12 of 13 checks passed
@ktf ktf deleted the pr12757 branch February 28, 2024 13:38
@ktf
Copy link
Member Author

ktf commented Feb 29, 2024

Something broke, but I am not sure what. 43b29ee worked just fine and the diff between that and 03228df is:

--- /tmp/zshbLBbcd	2024-02-29 08:46:50
+++ /tmp/zsh11DLlX	2024-02-29 08:46:50
@@ -1,7 +1,7 @@
-changeset:   43b29ee7dd6d684bfd82f50368b610ad9f01dea8  
-parent:      a97d37656887de69b2c64d43ca9f053e8db342b9
+changeset:   03228df7d530292972830710a1bcc5bbf53845c8  
+parent:      b23bd7b6b27197e515ea77d8d3f67768fab81903
 user:        Giulio Eulisse <[email protected]>
-date:        Tue, 27 Feb 2024 23:27:49 +0100
+date:        Wed, 28 Feb 2024 14:08:06 +0100
 
     DPL: avoid TMessage usage
 
@@ -62,7 +62,7 @@
 -        typename RSS::FairTMessage ftm(const_cast<char*>(ref.payload), payloadSize);
 +        typename RSS::FairInputTBuffer ftm(const_cast<char*>(ref.payload), payloadSize);
 +        ftm.InitMap();
-+        auto *classInfo = ftm.ReadClass();
++        auto* classInfo = ftm.ReadClass();
 +        ftm.SetBufferOffset(0);
 +        ftm.ResetMap();
          result.reset(static_cast<wrapped*>(ftm.ReadObjectAny(cl)));
@@ -161,8 +161,8 @@
 +  // of overhead, where the source embedded the pointer for the reallocation.
 +  // Notice this will break if the sender and receiver are not using the same
 +  // size for a pointer.
-+  FairInputTBuffer(char * data, size_t size)
-+    : TBufferFile(TBuffer::kRead, size-sizeof(char*), data + sizeof(char*), false, nullptr)
++  FairInputTBuffer(char* data, size_t size)
++    : TBufferFile(TBuffer::kRead, size - sizeof(char*), data + sizeof(char*), false, nullptr)
 +  {
 +  }
 +};
@@ -194,7 +194,7 @@
 -  static std::unique_ptr<T> deserialize(gsl::span<std::byte> buffer);
 -  template <typename T = TObject>
 -  static inline std::unique_ptr<T> deserialize(std::byte* buffer, size_t size);
-+  static inline std::unique_ptr<T> deserialize(FairInputTBuffer & buffer);
++  static inline std::unique_ptr<T> deserialize(FairInputTBuffer& buffer);
  };
  
 -inline void TMessageSerializer::serialize(FairTMessage& tm, const TObject* input,
@@ -223,7 +223,7 @@
  
  template <typename T>
 -inline std::unique_ptr<T> TMessageSerializer::deserialize(gsl::span<std::byte> buffer)
-+inline std::unique_ptr<T> TMessageSerializer::deserialize(FairInputTBuffer & buffer)
++inline std::unique_ptr<T> TMessageSerializer::deserialize(FairInputTBuffer& buffer)
  {
    TClass* tgtClass = TClass::GetClass(typeid(T));
    if (tgtClass == nullptr) {
@@ -333,7 +333,7 @@
 diff --git a/Framework/Core/src/TMessageSerializer.cxx b/Framework/Core/src/TMessageSerializer.cxx
 --- a/Framework/Core/src/TMessageSerializer.cxx
 +++ b/Framework/Core/src/TMessageSerializer.cxx
-@@ -9,7 +9,42 @@
+@@ -9,7 +9,44 @@
  // granted to it by virtue of its status as an Intergovernmental Organization
  // or submit itself to any jurisdiction.
  #include <Framework/TMessageSerializer.h>
@@ -343,7 +343,8 @@
  
  using namespace o2::framework;
 +
-+void* FairOutputTBuffer::embedInItself(fair::mq::Message& msg) {
++void* FairOutputTBuffer::embedInItself(fair::mq::Message& msg)
++{
 +  // The first bytes of the message are used to store the pointer to the message itself
 +  // so that we can reallocate it if needed.
 +  if (sizeof(char*) > msg.GetSize()) {
@@ -356,7 +357,8 @@
 +}
 +
 +// Reallocation function. Get the message pointer from the data and call Rebuild.
-+char *FairOutputTBuffer::fairMQrealloc(char *oldData, size_t newSize, size_t oldSize) {
++char* FairOutputTBuffer::fairMQrealloc(char* oldData, size_t newSize, size_t oldSize)
++{
 +  // Old data is the pointer at the beginning of the message, so the pointer
 +  // to the message is **stored** in the 8 bytes before it.
 +  auto* msg = *(fair::mq::Message**)(oldData - sizeof(char*));
@@ -371,19 +373,20 @@
 +  // sure the old message is not deleted until the new one is ready.
 +  // We need 8 extra bytes for the pointer to the message itself (realloc does not know about it)
 +  // and we need to copy 8 bytes more than the old size (again, the extra pointer).
-+  msg->Rebuild(newSize+8, fair::mq::Alignment{64});
-+  memcpy(msg->GetData(), oldMsg->GetData(), oldSize+8);
++  msg->Rebuild(newSize + 8, fair::mq::Alignment{64});
++  memcpy(msg->GetData(), oldMsg->GetData(), oldSize + 8);
 +
 +  return reinterpret_cast<char*>(msg->GetData()) + sizeof(char*);
 +}
 diff --git a/Framework/Core/test/test_DataRefUtils.cxx b/Framework/Core/test/test_DataRefUtils.cxx
 --- a/Framework/Core/test/test_DataRefUtils.cxx
 +++ b/Framework/Core/test/test_DataRefUtils.cxx
-@@ -21,17 +21,37 @@
+@@ -21,17 +21,38 @@
  
  using namespace o2::framework;
  
-+TEST_CASE("PureRootTest") {
++TEST_CASE("PureRootTest")
++{
 +  TBufferFile buffer(TBuffer::kWrite);
 +  TObjString s("test");
 +  buffer.WriteObject(&s);
@@ -391,12 +394,12 @@
 +  TBufferFile buffer2(TBuffer::kRead, buffer.BufferSize(), buffer.Buffer(), false);
 +  buffer2.SetReadMode();
 +  buffer2.InitMap();
-+  TClass *storedClass = buffer2.ReadClass();
++  TClass* storedClass = buffer2.ReadClass();
 +  // ReadClass advances the buffer, so we need to reset it.
 +  buffer2.SetBufferOffset(0);
 +  buffer2.ResetMap();
 +  REQUIRE(storedClass != nullptr);
-+  auto *outS = (TObjString*)buffer2.ReadObjectAny(storedClass);
++  auto* outS = (TObjString*)buffer2.ReadObjectAny(storedClass);
 +  REQUIRE(outS != nullptr);
 +  REQUIRE(outS->GetString() == "test");
 +}
@@ -515,12 +518,12 @@
 +  auto msg = transport->CreateMessage(strlen(buffer) + 8);
 +  FairOutputTBuffer msg2(*msg);
 +  // The buffer starts after 8 bytes.
-+  REQUIRE(msg2.Buffer() == (char*)msg->GetData()+8);
++  REQUIRE(msg2.Buffer() == (char*)msg->GetData() + 8);
 +  // The first 8 bytes of the buffer store the pointer to the message itself.
 +  REQUIRE(*(fair::mq::Message**)msg->GetData() == msg.get());
 +  // Notice that TBuffer does the same trick with the reallocation function,
 +  // so in the end the useful buffer size is the message size minus 16.
-+  REQUIRE(msg2.BufferSize() == (msg->GetSize()-16));
++  REQUIRE(msg2.BufferSize() == (msg->GetSize() - 16));
 +  // This will not fit the original buffer size, so the buffer will be expanded.
 +  msg2.Expand(100);
  }

@TimoWilken any idea of what might have happened?

@ktf
Copy link
Member Author

ktf commented Feb 29, 2024

I think the issue is fixed by #12781, and indeed it should only affect the tests (because in general we do not deserialise twice the same buffer). Not sure how come 43b29ee tested correctly though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants