stack implementation and tests #4116

LeonidGoltsblat · 2024-10-27T23:27:01Z

This is a stack container implementation for the pjsib library (pj_stack), i.e. the single-linked list with First In Last Out logic (FILO).
The stack is implemented as close as possible to the implementation of pj_list.
This implementation is thread safe. Common implementation uses internal locking mechanism so is thread-safe.
But implementation for Windows platform uses locking free Windows embeded single linked list implementation, which makes this implementation exceptionally fast.

This is one of the basic mechanisms used in a series of subsequent pull requests with the overall goal of improving pjsip's performance.

The pull request contains tests that can be used as usage examples.

pjlib/src/pj/stack.c is the common stack implementation
pjlib/src/pj/stack_win32.c is the implementation for Windows platform

see more info in embedded documentation

- some changes in sln to build with v143 build tools (VS 2022) - 2 new pjsystest project configuration to build as Debug-Dynamic and Release-Dynamic - stack implementation and testing incorporated into pjlib and pjlib-test projects

sauwming · 2024-11-11T01:53:00Z

Thank you for the contribution. Perhaps you can tell us the reason of the introduction of this new data structure? (i.e. what future PR/feature do you plan to submit that will require the usage of stack?)

LeonidGoltsblat · 2024-11-12T21:43:02Z

Sorry for the delay in replying, I will definitely reply, but I will be busy for a couple more days.

LeonidGoltsblat · 2024-11-21T21:40:54Z

Sorry for the delay in replying,
Yes, I'm working on a few poll requests..
The main motivation for all of them is improved pjsip performance when servicing many calls at once

iocp (ioqueue_winnt). Some tests are now showing 10*n times perfomance enhanced compare to standard ioqueue_select
optimized udp_rtp transport (for example media stream changes implemented as it's closing and reopen should not force waiting states for other rtp stream)
parallelized conference bridge (yes, I see new commits with "async conference bridge", maybe I'm late...)
concept of "external data source" for "file player" - API for reading playback data from virtually any source (for example from a BLOB field in a database). As a special case, I implemented asynchronous pre-emptive file reading for use in IVR scenarios with many simultaneous file playbacks
some "small" changes, like ability to track lock wait duration and lock hold duration, and changes in pjsip algorithms to minimize the duration of such wait states.
I often use "stack" implementation as a very quick alternative to "list", espessialy agressively in the previous point.

sauwming · 2024-11-22T01:45:33Z

No worries at all about the delay, we are preparing for the final testing of 2.15 release so most likely, we can't merge this until the release anyway.

It's still not obvious to me from your examples when you need to use the stack? Typically in the SIP context, we use queue since we need to process the messages/events that come first (FIFO). In what specific case would you use LIFO?

sauwming · 2024-11-22T07:33:12Z

For iocp, I would like to invite you to check PR #4136 and let us know your feedback, such as do you also encounter similar issue; if yes, have you also fixed it in your code; will that PR potentially conflict with what you're doing, etc.
Thanks in advance.

LeonidGoltsblat · 2024-11-22T23:14:11Z

About iocp and PR #4136: I saw and fixed some issues with op_key reuse (WSAOVERLAPPED). I implemented a reference counting mechanism on key (OS HANDLE) that ensures that key is returned to free_list when iocp reports all pending operations complete (and we don't need closing_list now, only active_list and free_list). Of course, I should take a closer look at PR #4136 to comment more.
In iocp context, I use stack to save unused connect_op_keys (pj_ioqueue_connect() function prototype doesn't imply op_key, so I manage connect_op_keys internally). In this case, there is no difference between FIFO and LIFO, we can use any free op_key when we need a new one.
In other cases, the stack is used in a similar way: when we need to quickly get any instance of something. Another use case: reserving a free slot in a large array in a parallel multithreaded environment. All these cases are not actually SIP events, I'm using the stack as a fast container for additional data.
The only reason to use a stack rather than a list in all these cases is that Windows has a very efficient internal implementation that uses interlocked (atomic) operations rather than other "heavy" locks.

LeonidGoltsblat · 2024-11-26T00:09:26Z

I added stack_stress_test() as a one of the typical examples when we can use the stack (reserving an empty slot in a large array without having to lock the entire array).

decreased the repeat counter increased the number of threads

sauwming

Just need to clean up the project files and you should be mostly good to go.

For an example of a clean project patch, you can check PR 4132 (https://github.com/pjsip/pjproject/pull/4132/files#diff-0c444d946963ac7a6c002817133fdab6c0cef69c43bd85eff0dcd9c6419ed7ad)

LeonidGoltsblat · 2024-12-12T19:31:58Z

Just need to clean up the project files and you should be mostly good to go.

cleaned! i.e. line ending fixed
some "some irreleveant changes in the patch" moved to #4212

LeonidGoltsblat · 2024-12-12T20:42:24Z

@nanangizz

Just did a random search, I found liblfds, from a quick check on header file lfds711_list_addonly_singlylinked_ordered.h, it has 'insert after', 'insert at' (if those are what you mean by full functionality), haven't checked further if those ops are really lock-free.

Anything is possible, but when I said complexity, I meant this: if we insert "new" before or after "old", we have to make sure that "old" still exists in the container. I think this requires some external synchronization, and if so, I think we have to use this synchronization with all other operations.

Re: lock-free or thread-safety. It is perhaps nice, but IMO not so urgent for now. If you search list operation (e.g: pj_list_push_back()) in the library source for example and check mutex protection around it, it is actually very rare that the mutex protects only an operation on a list, usually it protects other states or multiple list operations (e.g: is_empty() & push_back()). Perhaps that's why so far we've never considered adding thread-safety feature to the list (IIRC, nor heard suggestion on it). Even it may introduce unnecessary overhead which eventually degrades performance.

Yes, there are many such examples but there are also opposite situations showing that external synchronization is not always optimal.

A little search in the single file pjsu_core.c (only because I have fixed "Merged request detected" in this file some days ago)

pjsua_transport_create()

PJSUA_LOCK();
/* Find empty transport slot */
/* Create the transport */
/* Save the transport */

on_return:
PJSUA_UNLOCK();

We have an unnecessary long lock, but with pj_stack this can be done it like this:
/* Quickly! RESERVE empty transport slot form the stack of empty slots /
/ Create the transport /
PJSUA_LOCK();
/ Save the transport */
PJSUA_UNLOCK();
Here we have very short locking time.

I don't think we often call this function concurrently :) but pjsip uses such an algorithm everywhere and in some cases this global locking may lead to serious perfomance degradation.

another example from the same file

mod_pjsua_on_rx_request()
PJSUA_LOCK();
if (rdata->msg_info.msg->line.req.method.id == PJSIP_INVITE_METHOD) {
processed = pjsua_call_on_incoming(rdata);
}
PJSUA_UNLOCK();
Do we really need to protect rdata->msg_info.msg->line.req.method.id under the global lock? May be better so:
if (rdata->msg_info.msg->line.req.method.id == PJSIP_INVITE_METHOD) {
PJSUA_LOCK();
processed = pjsua_call_on_incoming(rdata);
PJSUA_UNLOCK();
}
We have a lot of modules registered, each of which gets a global lock and then in most cases realizes that it wasn't the packet for it. I don't have measurements, but I think we may have a real performance penalty here.

These examples just show that external locking is not always a good choice, and internal locking is not always a bad choice.

sauwming

These should be all from me.

sauwming · 2024-12-13T04:50:36Z

pjlib/include/pj/stack.h

+
+#endif // PJ_STACK_IMPLEMENTATION
+
+#include <pj/stack.h>


Is this recursive self inclusion intentional?

mistake! Thanks!

sauwming · 2024-12-13T04:58:12Z

pjlib/include/pj/stack.h

+ * be aligned by 8 (for x86) or 16 (for x64) byte. 
+ * pjsip build system define PJ_POOL_ALIGNMENT macro to corresponding value.
+ * winnt.h define MEMORY_ALLOCATION_ALIGNMENT macro for this purpose.
+ * To use this macro in build system we recomend (this is optional) to add #include <windows.h> 


Perhaps better to put the "optional" note in the beginning of the paragraph, so it's clear to the reader.

sauwming · 2024-12-13T05:00:40Z

pjlib/include/pj/stack.h

+ * Stack in PJLIB is single-linked list with First In Last Out logic. 
+ * Stack is thread safe. Common PJLIB stack implementation uses internal locking mechanism so is thread-safe.
+ * Implementation for Windows platform uses locking free Windows embeded single linked list implementation.
+ * The performance of pj_stack implementation for Windows platform is 2-5x higher than cross-platform.


Putting numbers here without test data may not be wise.
So may be just put "considerably" higher/faster.

sauwming · 2024-12-13T05:07:46Z

pjlib/src/pj/stack.c

+ * because the item count can be changed at any time by another thread.
+ * For Windows platform returns the number of entries in the stack modulo 65535. For example,
+ * if the specified stack contains 65536 entries, pj_stack_size returns zero.
+ *


Remove the Windows comment and @param and @return. Unnecessary here.

sauwming · 2024-12-13T05:11:20Z

pjlib/src/pj/stack_win32.c

+ * because the item count can be changed at any time by another thread.
+ * For Windows platform returns the number of entries in the stack modulo 65535. For example,
+ * if the specified stack contains 65536 entries, pj_stack_size returns zero.
+ *


Remove @param and @return doc.

sauwming · 2024-12-13T05:17:59Z

pjlib/src/pjlib-test/stack.c

+
+    for (i = 0; !rc && i < PJ_ARRAY_SIZE(tests); ++i) {
+        tests[i].state.pool = pool;
+        rc = stack_stress_test(&tests[i]);


perhaps we should by default disable the stress test for non-Windows platform (create a macro such as #HAS_STRESS_TEST PJ_WIN32).

It will only burden the CI machines and there's little point in stress testing a relatively slower implementation of regular LIFO list + mutex.

I think we need multithreaded testing for thread safe api, but currently the default implementation is cross platform so may be just Windows testing is enough.
Create a macro HAS_MT_STACK_STRESS_TEST

all fixed!

These should be all from me

I hope this will be the case this time! :)

nanangizz · 2024-12-13T06:17:14Z

These examples just show that external locking is not always a good choice, and internal locking is not always a bad choice.

True, of course. I guess this may also answer your previous question?
Of course, we can add a "policy" to this object that will make it thread-safe or not, but is it really necessary?

LeonidGoltsblat · 2024-12-13T19:02:03Z

@nanangizz
FYI: Doubly Linked Lists
This is WDM, not regular WinApi, and only for Windows, but may be your desire to have a lock-free FIFO can be realized!

# Conflicts: # pjlib/build/pjlib.vcxproj.filters

LeonidGoltsblat · 2024-12-18T21:53:01Z

Hi!
New conflicts resolved. How about approving this PR?

LeonidGoltsblat · 2024-12-23T00:52:26Z

Hello, colleagues!
Please see the first pj_stack usage in #4230!

About pj_stack...
Using C11's built in atomic support, it's easy to implement a cross-platform lock-free LIFO (and maybe FIFO too). What do you think of C11?

bennylp · 2024-12-31T03:02:02Z

Hi @LeonidGoltsblat, first of all thanks for your contribution, and your time for submitting this. We're just back from (pjsip) office holiday, so coding is definitely slow this time of year. :)

I've read your submission, and here are my comments. As others have said, my first reaction is, why do we need this? What problem does it solve, or what enhancement does it offer? I don't see your patch is addressing any of these two questions quantitatively. For example, if it proposes significant speed improvements, I would like to be convinced with the numbers. First the theoretical performance improvement (e.g. how long to push()/pop() 1 billion elements), and maybe the projected actual improvement (e.g. it probably won't matter to have 5x speed improvement, if the improvement is only 50 nanoseconds and the whole operation (such as adding/removing ioqueue key) takes 10 usec).

On the other hand, this is quite a significant submission, as it modifies some of our core header files, adds third party copyright (i.e you), and imposes certain alignment on all pool memory allocations. I would probably only check the alignment in the stack_win32.c, to avoid imposing changes on the global space. The naming between pj_stack_type and pj_stack_t is also confusing, I would probably use pj_stack_node_t for the node. Each of this is probably solvable, but again, we have to go back to the fundamental question, i.e. why do we need this. Also this only works on Windows.

So I tend to say we keep this aside for now until we can be convinced there is a real, significant usage for it.

LeonidGoltsblat · 2025-01-01T00:12:46Z

Hi Benny, Merry Christmas and thanks for the feedback! I'm currently preparing a PR with a parallel version of the conference bridge. Hopefully it'll only take a few days, then I'll be back to discuss and happy to answer any questions you may have.
Best regards, Leonid

bennylp · 2025-01-03T09:50:31Z

The parallel conf bridge sounds exciting! But if the purpose is to show another sample implementation where the lock free stack can be used, I'm afraid this is sounding more and more like the stack is a solution looking for a problem :)

LeonidGoltsblat · 2025-01-05T23:00:08Z

@bennylp Please see #4241
By default, the stack is not used there (unfortunately) 🙂.

sauwming · 2025-01-07T04:45:00Z

Just an idea, perhaps it's better renaming the data structure from stack to something like atomic_slist (to represent an atomic/interlocked singly linked list).

The reason is simple. Using stack to solve the various problems in the other PR seems kind of strange, and as @bennylp pointed out, sounds forced as if the stack is made to solve something it's not supposed to.

But with the more appropriate name, suddenly it just seems natural. To solve the problem of resource allocation (such as unused slots) stored in an array that requires synchronisation, it makes sense IMO to use an atomic_slist data structure.

And the data structure also opens the door to be used elsewhere in the library that currently uses the doubly-linked list declared in list.h but which can actually be safely changed into a singly linked version.

bennylp · 2025-01-07T06:18:05Z

I'm closing this for now, with the following suggestion if similar work is to be resubmitted in the future:

please provide justification of the feature (bug fix? speed improvement results?)
use (currently hypothetical :) pj_pool_aligned_alloc() to localize memory alignment requirement to specific data rather than all pool allocations
use naming like pj_atomic_slist to make it more natural.

LeonidGoltsblat · 2025-01-10T21:38:02Z

@bennylp

Can I expect the hypothetical pj_pool_aligned_alloc() function to become real?
atomic_slist (and potentially FIFO) can be implemented as thread-safe and cross-platform using C11's atomic support. What are your thoughts on adopting C11 now, in 2025?

bennylp · 2025-01-13T03:12:10Z

@LeonidGoltsblat

yes
we've discussed this internally as recent as couple of weeks ago.

The problem is, unlike C++ standards, newer C standards are not well supported by compilers (especially msvc). Hence we stick to gnu89 standard for now. Having said that, you can use c11's atomic as a one of implementation choices for the atomic_slist if you want, as long as there is alternative implementation for people who don't have c11.

LeonidGoltsblat added 14 commits October 26, 2024 00:16

vs2022 compatible project files

c011722

stack implementation and tests

81996e6

code style fix

b2ab472

fix to Added more logs (pjsip#4098)

79062c1

remove Debug-Dynamic-Client and Release-Dynamic-Client configurations

267e1e0

use traditional select ioqueue instead of iocp

79f47fb

re-enabled lost stack test

95f49c1

Merge branch 'stack' into pj_stack

97ba6f3

untabify

24c959f

makefiles fixes

48f47a2

VS projects and solution files changes

dd3bdc3

- some changes in sln to build with v143 build tools (VS 2022) - 2 new pjsystest project configuration to build as Debug-Dynamic and Release-Dynamic - stack implementation and testing incorporated into pjlib and pjlib-test projects

define appropriate PJ_POOL_ALIGNMENT for Windows platform

52e0086

make files fix

6a331a1

stack implementation files added to aconfugure.*

5fec115

LeonidGoltsblat and others added 6 commits November 24, 2024 20:25

merge conflicts resolved

547b72d

merge conflicts resolved

61e69ef

Merge branch 'master' into pj_stack

722022d

added some comments about data alignment and macros to control alignment

44dd759

Merge remote-tracking branch 'myfork/pj_stack' into pj_stack

0588d57

stack_stress_test() added

f0eb2c4

LeonidGoltsblat and others added 3 commits November 28, 2024 23:55

fix pj_stack multithreading for non-windows platforms

246a20a

stack tests

5db76f6

decreased the repeat counter increased the number of threads

Merge branch 'pjsip:master' into pj_stack

1eea899

LeonidGoltsblat added 2 commits December 11, 2024 20:10

line ending

88e02f3

undone ..

445f8a1

sauwming reviewed Dec 12, 2024

View reviewed changes

LeonidGoltsblat added 3 commits December 12, 2024 19:44

line ending

9f38c20

rewind some unrelated changes

40f57cc

line ending

934ba4a

unrelated...

bb1307d

sauwming reviewed Dec 13, 2024

View reviewed changes

LeonidGoltsblat added 2 commits December 13, 2024 20:32

little formal changes

9841571

fix

a102249

sauwming approved these changes Dec 16, 2024

View reviewed changes

Merge remote-tracking branch 'github.com/master' into pj_stack

a424e39

# Conflicts: # pjlib/build/pjlib.vcxproj.filters

nanangizz requested a review from trengginas December 19, 2024 00:18

LeonidGoltsblat mentioned this pull request Dec 23, 2024

Lock free audio #4230

Closed

bennylp self-requested a review December 23, 2024 07:37

bennylp closed this Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

stack implementation and tests #4116

stack implementation and tests #4116

LeonidGoltsblat commented Oct 27, 2024 •

edited

Loading

sauwming commented Nov 11, 2024

LeonidGoltsblat commented Nov 12, 2024

LeonidGoltsblat commented Nov 21, 2024

sauwming commented Nov 22, 2024

sauwming commented Nov 22, 2024

LeonidGoltsblat commented Nov 22, 2024 •

edited

Loading

LeonidGoltsblat commented Nov 26, 2024

sauwming left a comment

LeonidGoltsblat commented Dec 12, 2024

LeonidGoltsblat commented Dec 12, 2024

sauwming left a comment

sauwming Dec 13, 2024

LeonidGoltsblat Dec 13, 2024

sauwming Dec 13, 2024

sauwming Dec 13, 2024

sauwming Dec 13, 2024

sauwming Dec 13, 2024

sauwming Dec 13, 2024 •

edited

Loading

LeonidGoltsblat Dec 13, 2024

nanangizz commented Dec 13, 2024

LeonidGoltsblat commented Dec 13, 2024

LeonidGoltsblat commented Dec 18, 2024

LeonidGoltsblat commented Dec 23, 2024

bennylp commented Dec 31, 2024 •

edited

Loading

LeonidGoltsblat commented Jan 1, 2025

bennylp commented Jan 3, 2025

LeonidGoltsblat commented Jan 5, 2025

sauwming commented Jan 7, 2025 •

edited

Loading

bennylp commented Jan 7, 2025

LeonidGoltsblat commented Jan 10, 2025

bennylp commented Jan 13, 2025

stack implementation and tests #4116

stack implementation and tests #4116

Conversation

LeonidGoltsblat commented Oct 27, 2024 • edited Loading

sauwming commented Nov 11, 2024

LeonidGoltsblat commented Nov 12, 2024

LeonidGoltsblat commented Nov 21, 2024

sauwming commented Nov 22, 2024

sauwming commented Nov 22, 2024

LeonidGoltsblat commented Nov 22, 2024 • edited Loading

LeonidGoltsblat commented Nov 26, 2024

sauwming left a comment

Choose a reason for hiding this comment

LeonidGoltsblat commented Dec 12, 2024

LeonidGoltsblat commented Dec 12, 2024

sauwming left a comment

Choose a reason for hiding this comment

sauwming Dec 13, 2024

Choose a reason for hiding this comment

LeonidGoltsblat Dec 13, 2024

Choose a reason for hiding this comment

sauwming Dec 13, 2024

Choose a reason for hiding this comment

sauwming Dec 13, 2024

Choose a reason for hiding this comment

sauwming Dec 13, 2024

Choose a reason for hiding this comment

sauwming Dec 13, 2024

Choose a reason for hiding this comment

sauwming Dec 13, 2024 • edited Loading

Choose a reason for hiding this comment

LeonidGoltsblat Dec 13, 2024

Choose a reason for hiding this comment

nanangizz commented Dec 13, 2024

LeonidGoltsblat commented Dec 13, 2024

LeonidGoltsblat commented Dec 18, 2024

LeonidGoltsblat commented Dec 23, 2024

bennylp commented Dec 31, 2024 • edited Loading

LeonidGoltsblat commented Jan 1, 2025

bennylp commented Jan 3, 2025

LeonidGoltsblat commented Jan 5, 2025

sauwming commented Jan 7, 2025 • edited Loading

bennylp commented Jan 7, 2025

LeonidGoltsblat commented Jan 10, 2025

bennylp commented Jan 13, 2025

LeonidGoltsblat commented Oct 27, 2024 •

edited

Loading

LeonidGoltsblat commented Nov 22, 2024 •

edited

Loading

sauwming Dec 13, 2024 •

edited

Loading

bennylp commented Dec 31, 2024 •

edited

Loading

sauwming commented Jan 7, 2025 •

edited

Loading