Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stack implementation and tests #4116

Closed
wants to merge 39 commits into from

Conversation

LeonidGoltsblat
Copy link
Contributor

@LeonidGoltsblat LeonidGoltsblat commented Oct 27, 2024

This is a stack container implementation for the pjsib library (pj_stack), i.e. the single-linked list with First In Last Out logic (FILO).
The stack is implemented as close as possible to the implementation of pj_list.
This implementation is thread safe. Common implementation uses internal locking mechanism so is thread-safe.
But implementation for Windows platform uses locking free Windows embeded single linked list implementation, which makes this implementation exceptionally fast.

This is one of the basic mechanisms used in a series of subsequent pull requests with the overall goal of improving pjsip's performance.

The pull request contains tests that can be used as usage examples.

pjlib/src/pj/stack.c is the common stack implementation
pjlib/src/pj/stack_win32.c is the implementation for Windows platform

see more info in embedded documentation

@sauwming
Copy link
Member

Thank you for the contribution. Perhaps you can tell us the reason of the introduction of this new data structure? (i.e. what future PR/feature do you plan to submit that will require the usage of stack?)

@LeonidGoltsblat
Copy link
Contributor Author

Sorry for the delay in replying, I will definitely reply, but I will be busy for a couple more days.

@LeonidGoltsblat
Copy link
Contributor Author

Sorry for the delay in replying,
Yes, I'm working on a few poll requests..
The main motivation for all of them is improved pjsip performance when servicing many calls at once

  • iocp (ioqueue_winnt). Some tests are now showing 10*n times perfomance enhanced compare to standard ioqueue_select
  • optimized udp_rtp transport (for example media stream changes implemented as it's closing and reopen should not force waiting states for other rtp stream)
  • parallelized conference bridge (yes, I see new commits with "async conference bridge", maybe I'm late...)
  • concept of "external data source" for "file player" - API for reading playback data from virtually any source (for example from a BLOB field in a database). As a special case, I implemented asynchronous pre-emptive file reading for use in IVR scenarios with many simultaneous file playbacks
  • some "small" changes, like ability to track lock wait duration and lock hold duration, and changes in pjsip algorithms to minimize the duration of such wait states.
    I often use "stack" implementation as a very quick alternative to "list", espessialy agressively in the previous point.

@sauwming
Copy link
Member

No worries at all about the delay, we are preparing for the final testing of 2.15 release so most likely, we can't merge this until the release anyway.

It's still not obvious to me from your examples when you need to use the stack? Typically in the SIP context, we use queue since we need to process the messages/events that come first (FIFO). In what specific case would you use LIFO?

@sauwming
Copy link
Member

For iocp, I would like to invite you to check PR #4136 and let us know your feedback, such as do you also encounter similar issue; if yes, have you also fixed it in your code; will that PR potentially conflict with what you're doing, etc.
Thanks in advance.

@LeonidGoltsblat
Copy link
Contributor Author

LeonidGoltsblat commented Nov 22, 2024

About iocp and PR #4136: I saw and fixed some issues with op_key reuse (WSAOVERLAPPED). I implemented a reference counting mechanism on key (OS HANDLE) that ensures that key is returned to free_list when iocp reports all pending operations complete (and we don't need closing_list now, only active_list and free_list). Of course, I should take a closer look at PR #4136 to comment more.
In iocp context, I use stack to save unused connect_op_keys (pj_ioqueue_connect() function prototype doesn't imply op_key, so I manage connect_op_keys internally). In this case, there is no difference between FIFO and LIFO, we can use any free op_key when we need a new one.
In other cases, the stack is used in a similar way: when we need to quickly get any instance of something. Another use case: reserving a free slot in a large array in a parallel multithreaded environment. All these cases are not actually SIP events, I'm using the stack as a fast container for additional data.
The only reason to use a stack rather than a list in all these cases is that Windows has a very efficient internal implementation that uses interlocked (atomic) operations rather than other "heavy" locks.

@LeonidGoltsblat
Copy link
Contributor Author

I added stack_stress_test() as a one of the typical examples when we can use the stack (reserving an empty slot in a large array without having to lock the entire array).

Copy link
Member

@sauwming sauwming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just need to clean up the project files and you should be mostly good to go.

For an example of a clean project patch, you can check PR 4132 (https://github.com/pjsip/pjproject/pull/4132/files#diff-0c444d946963ac7a6c002817133fdab6c0cef69c43bd85eff0dcd9c6419ed7ad)

@LeonidGoltsblat
Copy link
Contributor Author

Just need to clean up the project files and you should be mostly good to go.

cleaned! i.e. line ending fixed
some "some irreleveant changes in the patch" moved to #4212

@LeonidGoltsblat
Copy link
Contributor Author

@nanangizz

Just did a random search, I found liblfds, from a quick check on header file lfds711_list_addonly_singlylinked_ordered.h, it has 'insert after', 'insert at' (if those are what you mean by full functionality), haven't checked further if those ops are really lock-free.

Anything is possible, but when I said complexity, I meant this: if we insert "new" before or after "old", we have to make sure that "old" still exists in the container. I think this requires some external synchronization, and if so, I think we have to use this synchronization with all other operations.

Re: lock-free or thread-safety. It is perhaps nice, but IMO not so urgent for now. If you search list operation (e.g: pj_list_push_back()) in the library source for example and check mutex protection around it, it is actually very rare that the mutex protects only an operation on a list, usually it protects other states or multiple list operations (e.g: is_empty() & push_back()). Perhaps that's why so far we've never considered adding thread-safety feature to the list (IIRC, nor heard suggestion on it). Even it may introduce unnecessary overhead which eventually degrades performance.

Yes, there are many such examples but there are also opposite situations showing that external synchronization is not always optimal.

A little search in the single file pjsu_core.c (only because I have fixed "Merged request detected" in this file some days ago)

pjsua_transport_create()

PJSUA_LOCK();
/* Find empty transport slot */
/* Create the transport */
/* Save the transport */

on_return:
PJSUA_UNLOCK();

We have an unnecessary long lock, but with pj_stack this can be done it like this:
/* Quickly! RESERVE empty transport slot form the stack of empty slots /
/
Create the transport /
PJSUA_LOCK();
/
Save the transport */
PJSUA_UNLOCK();
Here we have very short locking time.

I don't think we often call this function concurrently :) but pjsip uses such an algorithm everywhere and in some cases this global locking may lead to serious perfomance degradation.

another example from the same file

mod_pjsua_on_rx_request()
PJSUA_LOCK();
if (rdata->msg_info.msg->line.req.method.id == PJSIP_INVITE_METHOD) {
processed = pjsua_call_on_incoming(rdata);
}
PJSUA_UNLOCK();
Do we really need to protect rdata->msg_info.msg->line.req.method.id under the global lock? May be better so:
if (rdata->msg_info.msg->line.req.method.id == PJSIP_INVITE_METHOD) {
PJSUA_LOCK();
processed = pjsua_call_on_incoming(rdata);
PJSUA_UNLOCK();
}
We have a lot of modules registered, each of which gets a global lock and then in most cases realizes that it wasn't the packet for it. I don't have measurements, but I think we may have a real performance penalty here.

These examples just show that external locking is not always a good choice, and internal locking is not always a bad choice.

Copy link
Member

@sauwming sauwming left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These should be all from me.


#endif // PJ_STACK_IMPLEMENTATION

#include <pj/stack.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this recursive self inclusion intentional?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

mistake! Thanks!

* be aligned by 8 (for x86) or 16 (for x64) byte.
* pjsip build system define PJ_POOL_ALIGNMENT macro to corresponding value.
* winnt.h define MEMORY_ALLOCATION_ALIGNMENT macro for this purpose.
* To use this macro in build system we recomend (this is optional) to add #include <windows.h>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps better to put the "optional" note in the beginning of the paragraph, so it's clear to the reader.

* Stack in PJLIB is single-linked list with First In Last Out logic.
* Stack is thread safe. Common PJLIB stack implementation uses internal locking mechanism so is thread-safe.
* Implementation for Windows platform uses locking free Windows embeded single linked list implementation.
* The performance of pj_stack implementation for Windows platform is 2-5x higher than cross-platform.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Putting numbers here without test data may not be wise.
So may be just put "considerably" higher/faster.

* because the item count can be changed at any time by another thread.
* For Windows platform returns the number of entries in the stack modulo 65535. For example,
* if the specified stack contains 65536 entries, pj_stack_size returns zero.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove the Windows comment and @param and @return. Unnecessary here.

* because the item count can be changed at any time by another thread.
* For Windows platform returns the number of entries in the stack modulo 65535. For example,
* if the specified stack contains 65536 entries, pj_stack_size returns zero.
*
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove @param and @return doc.


for (i = 0; !rc && i < PJ_ARRAY_SIZE(tests); ++i) {
tests[i].state.pool = pool;
rc = stack_stress_test(&tests[i]);
Copy link
Member

@sauwming sauwming Dec 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

perhaps we should by default disable the stress test for non-Windows platform (create a macro such as #HAS_STRESS_TEST PJ_WIN32).

It will only burden the CI machines and there's little point in stress testing a relatively slower implementation of regular LIFO list + mutex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need multithreaded testing for thread safe api, but currently the default implementation is cross platform so may be just Windows testing is enough.
Create a macro HAS_MT_STACK_STRESS_TEST

all fixed!

These should be all from me

I hope this will be the case this time! :)

@nanangizz
Copy link
Member

These examples just show that external locking is not always a good choice, and internal locking is not always a bad choice.

True, of course. I guess this may also answer your previous question?
Of course, we can add a "policy" to this object that will make it thread-safe or not, but is it really necessary?

@LeonidGoltsblat
Copy link
Contributor Author

@nanangizz
FYI: Doubly Linked Lists
This is WDM, not regular WinApi, and only for Windows, but may be your desire to have a lock-free FIFO can be realized!

# Conflicts:
#	pjlib/build/pjlib.vcxproj.filters
@LeonidGoltsblat
Copy link
Contributor Author

Hi!
New conflicts resolved. How about approving this PR?

@LeonidGoltsblat
Copy link
Contributor Author

Hello, colleagues!
Please see the first pj_stack usage in #4230!

About pj_stack...
Using C11's built in atomic support, it's easy to implement a cross-platform lock-free LIFO (and maybe FIFO too). What do you think of C11?

@bennylp bennylp self-requested a review December 23, 2024 07:37
@bennylp
Copy link
Member

bennylp commented Dec 31, 2024

Hi @LeonidGoltsblat, first of all thanks for your contribution, and your time for submitting this. We're just back from (pjsip) office holiday, so coding is definitely slow this time of year. :)

I've read your submission, and here are my comments. As others have said, my first reaction is, why do we need this? What problem does it solve, or what enhancement does it offer? I don't see your patch is addressing any of these two questions quantitatively. For example, if it proposes significant speed improvements, I would like to be convinced with the numbers. First the theoretical performance improvement (e.g. how long to push()/pop() 1 billion elements), and maybe the projected actual improvement (e.g. it probably won't matter to have 5x speed improvement, if the improvement is only 50 nanoseconds and the whole operation (such as adding/removing ioqueue key) takes 10 usec).

On the other hand, this is quite a significant submission, as it modifies some of our core header files, adds third party copyright (i.e you), and imposes certain alignment on all pool memory allocations. I would probably only check the alignment in the stack_win32.c, to avoid imposing changes on the global space. The naming between pj_stack_type and pj_stack_t is also confusing, I would probably use pj_stack_node_t for the node. Each of this is probably solvable, but again, we have to go back to the fundamental question, i.e. why do we need this. Also this only works on Windows.

So I tend to say we keep this aside for now until we can be convinced there is a real, significant usage for it.

@LeonidGoltsblat
Copy link
Contributor Author

Hi Benny, Merry Christmas and thanks for the feedback! I'm currently preparing a PR with a parallel version of the conference bridge. Hopefully it'll only take a few days, then I'll be back to discuss and happy to answer any questions you may have.
Best regards, Leonid

@bennylp
Copy link
Member

bennylp commented Jan 3, 2025

The parallel conf bridge sounds exciting! But if the purpose is to show another sample implementation where the lock free stack can be used, I'm afraid this is sounding more and more like the stack is a solution looking for a problem :)

@LeonidGoltsblat
Copy link
Contributor Author

@bennylp Please see #4241
By default, the stack is not used there (unfortunately) 🙂.

@sauwming
Copy link
Member

sauwming commented Jan 7, 2025

Just an idea, perhaps it's better renaming the data structure from stack to something like atomic_slist (to represent an atomic/interlocked singly linked list).

The reason is simple. Using stack to solve the various problems in the other PR seems kind of strange, and as @bennylp pointed out, sounds forced as if the stack is made to solve something it's not supposed to.

But with the more appropriate name, suddenly it just seems natural. To solve the problem of resource allocation (such as unused slots) stored in an array that requires synchronisation, it makes sense IMO to use an atomic_slist data structure.

And the data structure also opens the door to be used elsewhere in the library that currently uses the doubly-linked list declared in list.h but which can actually be safely changed into a singly linked version.

@bennylp
Copy link
Member

bennylp commented Jan 7, 2025

I'm closing this for now, with the following suggestion if similar work is to be resubmitted in the future:

  • please provide justification of the feature (bug fix? speed improvement results?)
  • use (currently hypothetical :) pj_pool_aligned_alloc() to localize memory alignment requirement to specific data rather than all pool allocations
  • use naming like pj_atomic_slist to make it more natural.

@bennylp bennylp closed this Jan 7, 2025
@LeonidGoltsblat
Copy link
Contributor Author

@bennylp

  • Can I expect the hypothetical pj_pool_aligned_alloc() function to become real?
  • atomic_slist (and potentially FIFO) can be implemented as thread-safe and cross-platform using C11's atomic support. What are your thoughts on adopting C11 now, in 2025?

@bennylp
Copy link
Member

bennylp commented Jan 13, 2025

@LeonidGoltsblat

  • yes

  • we've discussed this internally as recent as couple of weeks ago.

    The problem is, unlike C++ standards, newer C standards are not well supported by compilers (especially msvc). Hence we stick to gnu89 standard for now. Having said that, you can use c11's atomic as a one of implementation choices for the atomic_slist if you want, as long as there is alternative implementation for people who don't have c11.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants