Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sporadic Testcase failure: pubsub_connection_test #1880

Open
KerstinKeller opened this issue Jan 8, 2025 · 9 comments
Open

Sporadic Testcase failure: pubsub_connection_test #1880

KerstinKeller opened this issue Jan 8, 2025 · 9 comments

Comments

@KerstinKeller
Copy link
Contributor

KerstinKeller commented Jan 8, 2025

Problem Description

15: [ RUN ] core_cpp_pubsub.TestPublisherIsSubscribedTiming
15: D:\a\ecal\ecal\ecal\tests\cpp\pubsub_test\src\pubsub_connection_test.cpp(220): error: Value of: publisher_seen_at_subscription_start
15: Actual: false
15: Expected: true
15:
15: [ FAILED ] core_cpp_pubsub.TestPublisherIsSubscribedTiming (1068 ms)

Also recently:

15: [ RUN ] core_cpp_pubsub.DynamicCreate
15: D:\a\ecal\ecal\ecal\tests\cpp\pubsub_test\src\pubsub_test.cpp(334): error: Value of: pub->Send(send_s)
15: Actual: false
15: Expected: true
15:
15: D:\a\ecal\ecal\ecal\tests\cpp\pubsub_test\src\pubsub_test.cpp(340): error: Expected equality of these values:
15: send_s.size()
15: Which is: 1024
15: g_callback_received_bytes
15: Which is: 0
15:
15: [ FAILED ] core_cpp_pubsub.DynamicCreate (6218 ms)

How to reproduce

current master

How did you get eCAL?

I don't know

Environment

eCAL System Information

No response

@Peguen
Copy link
Contributor

Peguen commented Jan 13, 2025

Also failed in V5 implementation on master build:

2025-01-10T17:52:19.7421456Z 16: [ RUN ] core_cpp_pubsub_v5.TestPublisherIsSubscribedTiming
2025-01-10T17:52:20.7687715Z 16: 1736531539743 ms | fv-az1695-630 | publisher_is_subscribed | 21757 | error | Could not load eCAL time sync module libecaltime-localtime.so
2025-01-10T17:52:20.7688586Z 16: /home/runner/work/ecal/ecal/ecal/tests/cpp/pubsub_v5_test/src/pubsub_connection_test.cpp:220: Failure
2025-01-10T17:52:20.7689116Z 16: Value of: publisher_seen_at_subscription_start
2025-01-10T17:52:20.7689392Z 16: Actual: false
2025-01-10T17:52:20.7689571Z 16: Expected: true
2025-01-10T17:52:20.7689734Z 16:
2025-01-10T17:52:20.7690546Z 16: [ FAILED ] core_cpp_pubsub_v5.TestPublisherIsSubscribedTiming (1027 ms)

@DownerCase
Copy link
Contributor

I may have seen this bug in the wild as well. Can reproduce a problem reliably with eCAL monitor (tui).

  1. Have at least two publishers
  2. Inspect one topic - works
  3. Look at a different one (deletes old sub, creates new one) - works
  4. Look at the original topic - broken
  5. At this point you will no longer be able to receive messages

Appears to be an issue with subscribing, destroying, and resubscribing to a topic.
Restarting just the monitor will allow it to work again (initially) so it seems to be localised to the receiving end.

@KerstinKeller
Copy link
Contributor Author

@DownerCase I cannot unfortunately reproduce the usecase with ecal mon tui.
Which platform did you see this on? What ecal.yaml configuration was loaded?

@DownerCase
Copy link
Contributor

DownerCase commented Jan 16, 2025

Platform:

  • Linux x86_64 (Arch linux),
  • gcc 14.2.1
  • master 5484c2b
  • No system or user installation of eCAL
    • (So no config installed outside of build tree that I can find)

Steps:

  • cmake --preset cli
  • cmake --build --preset cli
  • Run concurrently:
    • ./out/cli/build/bin/ecal_mon_tui
    • ./out/cli/build/bin/ecal_sample_person_snd
    • ./out/cli/build/bin/ecal_sample_minimal_snd
  • In the monitor:
    • Press enter to look at the messages from one topic - Works (messages visible after short delay)
    • Press down to switch to the other topic - Works (messages visible after short delay)
    • Press up to go back to the original topic - Broken (messages never populate)
      • The monitor indicates 1 local connection for the topic but no data goes through

Initial tests weren't loading a config (Specified yaml configuration path not valid:"ecal.yaml". Using default configuration.) but just tested with successful default config loading and problem persists.

Was also a warning about loading ecaltime plugin but even with it the problem persists.

I initially saw this with my Go eCAL monitor so its not specific to ecal_mon_tui

Edit: Just trying a Debug build to see if that unearths anything useful. 🤞
Edit: Ok, I enabled all logs with the config but don't see anything immediately. If you can't repro I can share them.

@KerstinKeller
Copy link
Contributor Author

Will try on a Unix platform tomorrow. Hopefully can reproduce, though there must be more to it, since we do have testcases for this scenario, too.

Can you run the whole testsuite, too? Do you see any kinds of issue there?

@DownerCase
Copy link
Contributor

DownerCase commented Jan 16, 2025

Can you run the whole testsuite, too? Do you see any kinds of issue there?

Just the one failure in somewhere different to the reports in this issue...

ecal/tests/cpp/registration_test_public/src/registration_gettopics.cpp:181: Failure
Expected equality of these values:
  number_publishers_seen
    Which is: 138
  0

ecal/tests/cpp/registration_test_public/src/registration_gettopics.cpp:182: Failure
Expected equality of these values:
  max_number_publishers_seen
    Which is: 1968
  max_publisher_count
    Which is: 2000

I ran the test_pubsub* tests on loop for a while and didn't find any test failures so not directly the same as your findings.

Edit: The registration_test_public tests sometimes all pass :/

@KerstinKeller
Copy link
Contributor Author

Recently on Github Actions:

13: [ RUN ] core_cpp_pubsub_callback_topid_id/TestFixture.OnePubSub/2
13: [eCAL] Log path does not exist: C:\Users\runneradmin\AppData\Local\eCAL\logs
13: [eCAL] Log path does not exist: C:\ProgramData\eCAL\logs
13: D:\a\ecal\ecal\ecal\tests\cpp\pubsub_test\src\pubsub_callback_topicid.cpp(140): error: Expected equality of these values:
13: callback_topic_id
13: Which is: STopicId(topic_id: SEntityId(entity_id: 0, process_id: 0, host_name: ), topic_name: )
13: pub_id
13: Which is: STopicId(topic_id: SEntityId(entity_id: 2658764291700, process_id: 6432, host_name: fv-az1783-444), topic_name: foo)
13:
13: D:\a\ecal\ecal\ecal\tests\cpp\pubsub_test\src\pubsub_callback_topicid.cpp(141): error: Expected equality of these values:
13: callback_datatype_info
13: Which is: 96-byte object <00-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00 00-00 00-00 00-00 00-00 00-32 37-2E 32-35 35-2E 00-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00>
13: datatype_info
13: Which is: 96-byte object <6D-79 74-79 70-65 00-00 0F-00 00-00 00-00 00-00 06-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00 74-65 73-74 00-00 00-00 00-33 39-2E 30-2E 30-2E 04-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00 64-65 73-63 00-00 00-00 03-00 00-00 FC-7F 00-00 04-00 00-00 00-00 00-00 0F-00 00-00 00-00 00-00>
13:
13: 1737053520011 ms | fv-az1783-444 | callback_topic_id | 6432 | error | CTCPReaderLayer - TCPPubSub (Error) -SubscriberSession ?->fv-az1783-444:51360: Error reading header length: The network connection was aborted by the local system.
13: 1737053520011 ms | fv-az1783-444 | callback_topic_id | 6432 | error | CTCPReaderLayer - TCPPubSub (Error) -Publisher ?: Error while waiting for subsriber: The I/O operation has been aborted because of either a thread exit or an application request.
13: [ FAILED ] core_cpp_pubsub_callback_topid_id/TestFixture.OnePubSub/2, where GetParam() = Config Layers: TCP (304 ms)

@KerstinKeller
Copy link
Contributor Author

@DownerCase the case which you are showing might be a performance issue.
We will be working on this issue post 6.0 by employing some new techniques for larger scale systems (>1000 pub / sub connections per system).

So about this testcase I would not be too worried, however all that is failing on Github Actions, I think there are really some underlying problems...

@DownerCase
Copy link
Contributor

Sounds good. Later on I was unable to get any tests to fail 🤷

I will try and figure the problem I'm seeing which seems different to the problem here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants