LKL latency/throughput #357

speedingdaemon · 2017-07-11T13:35:34Z

I was reading this awesome paper written by Jerry : https://netdevconf.org/1.2/papers/jerry_chu.pdf
It mentions that the performance of lkl in user space did not beat that of host OS. Is that true still?
Does anyone know/has done some more performance benchmarking to find out more such latency /throughput numbers?
Curious to know what to expect with a TCP proxy service based off of lkl.

liuyuan10 · 2017-07-12T04:06:13Z

I don't think there is much change in performance since then. If the number of TCP connections are small, then the performance of LKL is OK (still a bit slower than host). But if the number is big (I'm talking about hundreds), due to the single threading nature of it, there is a big gap vs host.

…

On Tue, Jul 11, 2017 at 6:35 AM, speedingdaemon ***@***.***> wrote: I was reading this awesome paper written by Jerry : https://netdevconf.org/1.2/papers/jerry_chu.pdf It mentions that the performance of lkl in user space did not beat that of host OS. Is that true still? Does anyone know/has done some more performance benchmarking to find out more such latency /throughput numbers? Curious to know what to expect with a TCP proxy service based off of lkl. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#357>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AEVfUkVmvFhsManibfIlIDE-XgEy0VRvks5sM3ongaJpZM4OUPoQ> .

speedingdaemon · 2017-07-12T23:31:52Z

How would you guys compare performance of LKL vs MTCP?

linhua55 · 2017-07-13T05:00:30Z

@speedingdaemon
Thanks to @thehajime , I integrated LKL into rinetd proxy in order to use BBR in OpenVZ VPS. You can refer to
https://github.com/linhua55/lkl_study
https://github.com/linhua55/rinetd

It uses raw socket backend, but modified some code. As OpenVZ’s venet0 network interface is a Cooked interface. Its raw packet has no MAC layer(14 bytes). It can’t use AF_PACKET/SOCK_RAW, can only use AF_PACKET/SOCK_DGRAM.

But it has much CPU usage. Recently, I want to use TPACKET_V2 (packet mmap) to achieve Zero Copy to reduce some CPU.

speedingdaemon · 2017-07-17T22:59:14Z

@linhua55 @thehajime @tavip
I have an async server running on my blade.
Here is what I see:

Server running on host kernel as a user process (doing regular socket calls, but I did not set the incoming FD to NONBLOCKING) - 400 connections per second - did this experiment only because of the issue mentioned in Does LKL support non-blocking sockets? #360
Server running on host kernel as a user process (doing regular socket calls, but I set the incoming FD to NONBLOCKING) - 820 connections per second
Server running on host as a user process but is linked to LKL (doing LKL socket calls, but I did not set the incoming FD to NONBLOCKING, have another issue - Does LKL support non-blocking sockets? #360 ) - 369 connections per second (within 90% of host for experiment number 1 above)
Server running on host as a user process, but is using mTCP stack instead of LKL/Host - 40k connections per second

Do these numbers sound reasonable? I would have thought that LKL-based server should have been much faster than it really turned out to be... If the throughput found above in experiment number 3 is true, then why would anyone use LKL for HFT applications?

I am really really hoping that there is some init stuff that I am missing while using LKL. Wish someone could help.

tavip · 2017-07-18T08:14:26Z

I expect LKL to perform a bit worse that host, so the host / LKL ratio seems right.

I am not sure why do you think that LKL would be useful for HFT applications.

speedingdaemon · 2017-07-18T13:51:02Z

Oh. Sorry. My bad. I thought that I read somewhere.
@tavip
How about applications that need high performance/throughput?
What is the use-case for LKL?
I would have assumed that user-mode networking stack should have had much higher throughput...

speedingdaemon · 2017-07-18T21:16:17Z

Does this, the output of perf, look fine to folks?

 2.04%  myserver  [kernel.kallsyms]         [k] packet_recvmsg
 1.90%  myserver  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
 1.85%  myserver  [kernel.kallsyms]         [k] copy_user_enhanced_fast_string
 1.63%  myserver  [kernel.kallsyms]         [k] skb_release_head_state
 1.47%  myserver  liblkl.so                 [.] memset
 1.44%  myserver  liblkl.so                 [.] raid6_int8_xor_syndrome
 1.41%  myserver  liblkl.so                 [.] raid6_int8_gen_syndrome
 1.39%  myserver  liblkl.so                 [.] raid6_int4_xor_syndrome
 1.35%  myserver  liblkl.so                 [.] raid6_int2_xor_syndrome
 1.34%  myserver  liblkl.so                 [.] raid6_int4_gen_syndrome
 1.34%  myserver  liblkl.so                 [.] raid6_int2_gen_syndrome
 1.24%  myserver  liblkl.so                 [.] raid6_int1_xor_syndrome
 1.08%  myserver  liblkl.so                 [.] raid6_int1_gen_syndrome
 1.06%  myserver  liblkl.so                 [.] do_csum
 1.05%  myserver  liblkl.so                 [.] arch_local_irq_restore

Why do I see references to kernel.kallsyms at the top 4? I thought that with the use of LKL, I shouldn't have seen kernel.kallsyms stuff at the top.

Also, what is all this raid stuff? Is it for the 8 lkl_netdev_raw_create() calls I made in myserver?
Any way I can get rid of these things at the top to increase throughput?

thehajime · 2017-07-19T03:03:48Z

Why do I see references to kernel.kallsyms at the top 4? I thought that with the use of LKL, I shouldn't have seen kernel.kallsyms stuff at the top.

the raw socket (AF_PACKET socket) backend uses system calls (of host kernel) where you may frequently uses those functions when sending/receiving packets. it can be amortized by several techniques (bulk processing, segmentation offloading e.g.) but need modification to the backend itself.

Also, what is all this raid stuff? Is it for the 8 lkl_netdev_raw_create() calls I made in myserver?
Any way I can get rid of these things at the top to increase throughput?

See #301

This is due to the benchmark code of btrfs (I guess) in the boot phase so you won't see better performance/throughput even if you disabled the feature.

speedingdaemon · 2017-07-19T16:46:12Z

@thehajime

I don't mind modifying the backend. Can you please point me to the places that need to be modified?

Also, I saw code in lkl_netdev_tap_init() that takes an input, offload, to offload certain functionality.
Why don't we have something similar for lkl_netdev_raw_create()?

dakami · 2018-01-22T13:09:10Z

I spent some time poking at this. lt might be faster to use a pthread spinlock instead of a mutex, but there's a dependency right now on having a recursive mutex which I haven't been able to entangle.

At the extreme, it may be possible to run lkl on top of Xenomai, which does in fact support recursive mutexes.

It would be nice if this particular design decision was optional, as there are quite a few threading libraries to explore that don't implement this feature.

tavip · 2018-01-22T13:46:09Z

I have been working on moving the threading stuff out to host ops which removes a lot of dependencies (including mutexes), see:

https://github.com/tavip/linux/commits/the-expanse

The main problem I am still facing are the latency optimizations (direct irqs and syscalls), for which I could not find a model yet that allows us to move it to host ops.

dakami · 2018-01-24T10:05:00Z

I tried removing the recursive flag from master and kernel booting just froze. So something's certainly using it. I did see you were trying to let the host declare a jump instruction.

Where do the latency optimizations live?

In my mental model, syscalls just become function calls, and IRQs are just blocking tasks that interrupted the host in a way we caught (userfaultfd, signal) or nonblocking tasks in a list that can be serviced at the host's discretion (perhaps between functions via -finstrument-functions). But I haven't quite grokked what needs to be locked at this layer.

… missing When application fails to pass flags in netlink TLV for a new skbedit action, the kernel results in the following oops: [ 8.307732] BUG: unable to handle kernel paging request at 0000000000021130 [ 8.309167] PGD 80000000193d1067 P4D 80000000193d1067 PUD 180e0067 PMD 0 [ 8.310595] Oops: 0000 [#1] SMP PTI [ 8.311334] Modules linked in: kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw [ 8.314190] CPU: 1 PID: 397 Comm: tc Not tainted 4.17.0-rc3+ lkl#357 [ 8.315252] RIP: 0010:__tcf_idr_release+0x33/0x140 [ 8.316203] RSP: 0018:ffffa0718038f840 EFLAGS: 00010246 [ 8.317123] RAX: 0000000000000001 RBX: 0000000000021100 RCX: 0000000000000000 [ 8.319831] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000021100 [ 8.321181] RBP: 0000000000000000 R08: 000000000004adf8 R09: 0000000000000122 [ 8.322645] R10: 0000000000000000 R11: ffffffff9e5b01ed R12: 0000000000000000 [ 8.324157] R13: ffffffff9e0d3cc0 R14: 0000000000000000 R15: 0000000000000000 [ 8.325590] FS: 00007f591292e700(0000) GS:ffff8fcf5bc40000(0000) knlGS:0000000000000000 [ 8.327001] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 8.327987] CR2: 0000000000021130 CR3: 00000000180e6004 CR4: 00000000001606a0 [ 8.329289] Call Trace: [ 8.329735] tcf_skbedit_init+0xa7/0xb0 [ 8.330423] tcf_action_init_1+0x362/0x410 [ 8.331139] ? try_to_wake_up+0x44/0x430 [ 8.331817] tcf_action_init+0x103/0x190 [ 8.332511] tc_ctl_action+0x11a/0x220 [ 8.333174] rtnetlink_rcv_msg+0x23d/0x2e0 [ 8.333902] ? _cond_resched+0x16/0x40 [ 8.334569] ? __kmalloc_node_track_caller+0x5b/0x2c0 [ 8.335440] ? rtnl_calcit.isra.31+0xf0/0xf0 [ 8.336178] netlink_rcv_skb+0xdb/0x110 [ 8.336855] netlink_unicast+0x167/0x220 [ 8.337550] netlink_sendmsg+0x2a7/0x390 [ 8.338258] sock_sendmsg+0x30/0x40 [ 8.338865] ___sys_sendmsg+0x2c5/0x2e0 [ 8.339531] ? pagecache_get_page+0x27/0x210 [ 8.340271] ? filemap_fault+0xa2/0x630 [ 8.340943] ? page_add_file_rmap+0x108/0x200 [ 8.341732] ? alloc_set_pte+0x2aa/0x530 [ 8.342573] ? finish_fault+0x4e/0x70 [ 8.343332] ? __handle_mm_fault+0xbc1/0x10d0 [ 8.344337] ? __sys_sendmsg+0x53/0x80 [ 8.345040] __sys_sendmsg+0x53/0x80 [ 8.345678] do_syscall_64+0x4f/0x100 [ 8.346339] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [ 8.347206] RIP: 0033:0x7f591191da67 [ 8.347831] RSP: 002b:00007fff745abd48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e [ 8.349179] RAX: ffffffffffffffda RBX: 00007fff745abe70 RCX: 00007f591191da67 [ 8.350431] RDX: 0000000000000000 RSI: 00007fff745abdc0 RDI: 0000000000000003 [ 8.351659] RBP: 000000005af35251 R08: 0000000000000001 R09: 0000000000000000 [ 8.352922] R10: 00000000000005f1 R11: 0000000000000246 R12: 0000000000000000 [ 8.354183] R13: 00007fff745afed0 R14: 0000000000000001 R15: 00000000006767c0 [ 8.355400] Code: 41 89 d4 53 89 f5 48 89 fb e8 aa 20 fd ff 85 c0 0f 84 ed 00 00 00 48 85 db 0f 84 cf 00 00 00 40 84 ed 0f 85 cd 00 00 00 45 84 e4 <8b> 53 30 74 0d 85 d2 b8 ff ff ff ff 0f 8f b3 00 00 00 8b 43 2c [ 8.358699] RIP: __tcf_idr_release+0x33/0x140 RSP: ffffa0718038f840 [ 8.359770] CR2: 0000000000021130 [ 8.360438] ---[ end trace 60c66be45dfc14f0 ]--- The caller calls action's ->init() and passes pointer to "struct tc_action *a", which later may be initialized to point at the existing action, otherwise "struct tc_action *a" is still invalid, and therefore dereferencing it is an error as happens in tcf_idr_release, where refcnt is decremented. So in case of missing flags tcf_idr_release must be called only for existing actions. v2: - prepare patch for net tree Fixes: 5e1567a ("net sched: skbedit action fix late binding") Signed-off-by: Roman Mashak <[email protected]> Acked-by: Cong Wang <[email protected]> Signed-off-by: David S. Miller <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LKL latency/throughput #357

LKL latency/throughput #357

speedingdaemon commented Jul 11, 2017

liuyuan10 commented Jul 12, 2017 via email

speedingdaemon commented Jul 12, 2017

linhua55 commented Jul 13, 2017

speedingdaemon commented Jul 17, 2017 •

edited

Loading

tavip commented Jul 18, 2017

speedingdaemon commented Jul 18, 2017 •

edited

Loading

speedingdaemon commented Jul 18, 2017 •

edited

Loading

thehajime commented Jul 19, 2017

speedingdaemon commented Jul 19, 2017

dakami commented Jan 22, 2018

tavip commented Jan 22, 2018

dakami commented Jan 24, 2018

LKL latency/throughput #357

LKL latency/throughput #357

Comments

speedingdaemon commented Jul 11, 2017

liuyuan10 commented Jul 12, 2017 via email

speedingdaemon commented Jul 12, 2017

linhua55 commented Jul 13, 2017

speedingdaemon commented Jul 17, 2017 • edited Loading

tavip commented Jul 18, 2017

speedingdaemon commented Jul 18, 2017 • edited Loading

speedingdaemon commented Jul 18, 2017 • edited Loading

thehajime commented Jul 19, 2017

speedingdaemon commented Jul 19, 2017

dakami commented Jan 22, 2018

tavip commented Jan 22, 2018

dakami commented Jan 24, 2018

speedingdaemon commented Jul 17, 2017 •

edited

Loading

speedingdaemon commented Jul 18, 2017 •

edited

Loading

speedingdaemon commented Jul 18, 2017 •

edited

Loading