Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LKL latency/throughput #357

Open
speedingdaemon opened this issue Jul 11, 2017 · 12 comments
Open

LKL latency/throughput #357

speedingdaemon opened this issue Jul 11, 2017 · 12 comments

Comments

@speedingdaemon
Copy link

I was reading this awesome paper written by Jerry : https://netdevconf.org/1.2/papers/jerry_chu.pdf
It mentions that the performance of lkl in user space did not beat that of host OS. Is that true still?
Does anyone know/has done some more performance benchmarking to find out more such latency /throughput numbers?
Curious to know what to expect with a TCP proxy service based off of lkl.

@liuyuan10
Copy link
Member

liuyuan10 commented Jul 12, 2017 via email

@speedingdaemon
Copy link
Author

How would you guys compare performance of LKL vs MTCP?

@linhua55
Copy link

@speedingdaemon
Thanks to @thehajime , I integrated LKL into rinetd proxy in order to use BBR in OpenVZ VPS. You can refer to
https://github.com/linhua55/lkl_study
https://github.com/linhua55/rinetd

It uses raw socket backend, but modified some code. As OpenVZ’s venet0 network interface is a Cooked interface. Its raw packet has no MAC layer(14 bytes). It can’t use AF_PACKET/SOCK_RAW, can only use AF_PACKET/SOCK_DGRAM.

But it has much CPU usage. Recently, I want to use TPACKET_V2 (packet mmap) to achieve Zero Copy to reduce some CPU.

@speedingdaemon
Copy link
Author

speedingdaemon commented Jul 17, 2017

@linhua55 @thehajime @tavip
I have an async server running on my blade.
Here is what I see:

  1. Server running on host kernel as a user process (doing regular socket calls, but I did not set the incoming FD to NONBLOCKING) - 400 connections per second - did this experiment only because of the issue mentioned in Does LKL support non-blocking sockets? #360
  2. Server running on host kernel as a user process (doing regular socket calls, but I set the incoming FD to NONBLOCKING) - 820 connections per second
  3. Server running on host as a user process but is linked to LKL (doing LKL socket calls, but I did not set the incoming FD to NONBLOCKING, have another issue - Does LKL support non-blocking sockets? #360 ) - 369 connections per second (within 90% of host for experiment number 1 above)
  4. Server running on host as a user process, but is using mTCP stack instead of LKL/Host - 40k connections per second

Do these numbers sound reasonable? I would have thought that LKL-based server should have been much faster than it really turned out to be... If the throughput found above in experiment number 3 is true, then why would anyone use LKL for HFT applications?

I am really really hoping that there is some init stuff that I am missing while using LKL. Wish someone could help.

@tavip
Copy link
Member

tavip commented Jul 18, 2017

I expect LKL to perform a bit worse that host, so the host / LKL ratio seems right.

I am not sure why do you think that LKL would be useful for HFT applications.

@speedingdaemon
Copy link
Author

speedingdaemon commented Jul 18, 2017

Oh. Sorry. My bad. I thought that I read somewhere.
@tavip
How about applications that need high performance/throughput?
What is the use-case for LKL?
I would have assumed that user-mode networking stack should have had much higher throughput...

@speedingdaemon
Copy link
Author

speedingdaemon commented Jul 18, 2017

Does this, the output of perf, look fine to folks?

 2.04%  myserver  [kernel.kallsyms]         [k] packet_recvmsg
 1.90%  myserver  [kernel.kallsyms]         [k] _raw_spin_lock_irqsave
 1.85%  myserver  [kernel.kallsyms]         [k] copy_user_enhanced_fast_string
 1.63%  myserver  [kernel.kallsyms]         [k] skb_release_head_state
 1.47%  myserver  liblkl.so                 [.] memset
 1.44%  myserver  liblkl.so                 [.] raid6_int8_xor_syndrome
 1.41%  myserver  liblkl.so                 [.] raid6_int8_gen_syndrome
 1.39%  myserver  liblkl.so                 [.] raid6_int4_xor_syndrome
 1.35%  myserver  liblkl.so                 [.] raid6_int2_xor_syndrome
 1.34%  myserver  liblkl.so                 [.] raid6_int4_gen_syndrome
 1.34%  myserver  liblkl.so                 [.] raid6_int2_gen_syndrome
 1.24%  myserver  liblkl.so                 [.] raid6_int1_xor_syndrome
 1.08%  myserver  liblkl.so                 [.] raid6_int1_gen_syndrome
 1.06%  myserver  liblkl.so                 [.] do_csum
 1.05%  myserver  liblkl.so                 [.] arch_local_irq_restore

Why do I see references to kernel.kallsyms at the top 4? I thought that with the use of LKL, I shouldn't have seen kernel.kallsyms stuff at the top.

Also, what is all this raid stuff? Is it for the 8 lkl_netdev_raw_create() calls I made in myserver?
Any way I can get rid of these things at the top to increase throughput?

@thehajime
Copy link
Member

Why do I see references to kernel.kallsyms at the top 4? I thought that with the use of LKL, I shouldn't have seen kernel.kallsyms stuff at the top.

the raw socket (AF_PACKET socket) backend uses system calls (of host kernel) where you may frequently uses those functions when sending/receiving packets. it can be amortized by several techniques (bulk processing, segmentation offloading e.g.) but need modification to the backend itself.

Also, what is all this raid stuff? Is it for the 8 lkl_netdev_raw_create() calls I made in myserver?
Any way I can get rid of these things at the top to increase throughput?

See #301

This is due to the benchmark code of btrfs (I guess) in the boot phase so you won't see better performance/throughput even if you disabled the feature.

@speedingdaemon
Copy link
Author

@thehajime

I don't mind modifying the backend. Can you please point me to the places that need to be modified?

Also, I saw code in lkl_netdev_tap_init() that takes an input, offload, to offload certain functionality.
Why don't we have something similar for lkl_netdev_raw_create()?

@dakami
Copy link

dakami commented Jan 22, 2018

I spent some time poking at this. lt might be faster to use a pthread spinlock instead of a mutex, but there's a dependency right now on having a recursive mutex which I haven't been able to entangle.

At the extreme, it may be possible to run lkl on top of Xenomai, which does in fact support recursive mutexes.

It would be nice if this particular design decision was optional, as there are quite a few threading libraries to explore that don't implement this feature.

@tavip
Copy link
Member

tavip commented Jan 22, 2018

I have been working on moving the threading stuff out to host ops which removes a lot of dependencies (including mutexes), see:

https://github.com/tavip/linux/commits/the-expanse

The main problem I am still facing are the latency optimizations (direct irqs and syscalls), for which I could not find a model yet that allows us to move it to host ops.

@dakami
Copy link

dakami commented Jan 24, 2018

I tried removing the recursive flag from master and kernel booting just froze. So something's certainly using it. I did see you were trying to let the host declare a jump instruction.

Where do the latency optimizations live?

In my mental model, syscalls just become function calls, and IRQs are just blocking tasks that interrupted the host in a way we caught (userfaultfd, signal) or nonblocking tasks in a list that can be serviced at the host's discretion (perhaps between functions via -finstrument-functions). But I haven't quite grokked what needs to be locked at this layer.

retrage pushed a commit to retrage/linux that referenced this issue Dec 14, 2018
… missing

When application fails to pass flags in netlink TLV for a new skbedit action,
the kernel results in the following oops:

[    8.307732] BUG: unable to handle kernel paging request at 0000000000021130
[    8.309167] PGD 80000000193d1067 P4D 80000000193d1067 PUD 180e0067 PMD 0
[    8.310595] Oops: 0000 [#1] SMP PTI
[    8.311334] Modules linked in: kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd glue_helper serio_raw
[    8.314190] CPU: 1 PID: 397 Comm: tc Not tainted 4.17.0-rc3+ lkl#357
[    8.315252] RIP: 0010:__tcf_idr_release+0x33/0x140
[    8.316203] RSP: 0018:ffffa0718038f840 EFLAGS: 00010246
[    8.317123] RAX: 0000000000000001 RBX: 0000000000021100 RCX: 0000000000000000
[    8.319831] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000021100
[    8.321181] RBP: 0000000000000000 R08: 000000000004adf8 R09: 0000000000000122
[    8.322645] R10: 0000000000000000 R11: ffffffff9e5b01ed R12: 0000000000000000
[    8.324157] R13: ffffffff9e0d3cc0 R14: 0000000000000000 R15: 0000000000000000
[    8.325590] FS:  00007f591292e700(0000) GS:ffff8fcf5bc40000(0000) knlGS:0000000000000000
[    8.327001] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    8.327987] CR2: 0000000000021130 CR3: 00000000180e6004 CR4: 00000000001606a0
[    8.329289] Call Trace:
[    8.329735]  tcf_skbedit_init+0xa7/0xb0
[    8.330423]  tcf_action_init_1+0x362/0x410
[    8.331139]  ? try_to_wake_up+0x44/0x430
[    8.331817]  tcf_action_init+0x103/0x190
[    8.332511]  tc_ctl_action+0x11a/0x220
[    8.333174]  rtnetlink_rcv_msg+0x23d/0x2e0
[    8.333902]  ? _cond_resched+0x16/0x40
[    8.334569]  ? __kmalloc_node_track_caller+0x5b/0x2c0
[    8.335440]  ? rtnl_calcit.isra.31+0xf0/0xf0
[    8.336178]  netlink_rcv_skb+0xdb/0x110
[    8.336855]  netlink_unicast+0x167/0x220
[    8.337550]  netlink_sendmsg+0x2a7/0x390
[    8.338258]  sock_sendmsg+0x30/0x40
[    8.338865]  ___sys_sendmsg+0x2c5/0x2e0
[    8.339531]  ? pagecache_get_page+0x27/0x210
[    8.340271]  ? filemap_fault+0xa2/0x630
[    8.340943]  ? page_add_file_rmap+0x108/0x200
[    8.341732]  ? alloc_set_pte+0x2aa/0x530
[    8.342573]  ? finish_fault+0x4e/0x70
[    8.343332]  ? __handle_mm_fault+0xbc1/0x10d0
[    8.344337]  ? __sys_sendmsg+0x53/0x80
[    8.345040]  __sys_sendmsg+0x53/0x80
[    8.345678]  do_syscall_64+0x4f/0x100
[    8.346339]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[    8.347206] RIP: 0033:0x7f591191da67
[    8.347831] RSP: 002b:00007fff745abd48 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[    8.349179] RAX: ffffffffffffffda RBX: 00007fff745abe70 RCX: 00007f591191da67
[    8.350431] RDX: 0000000000000000 RSI: 00007fff745abdc0 RDI: 0000000000000003
[    8.351659] RBP: 000000005af35251 R08: 0000000000000001 R09: 0000000000000000
[    8.352922] R10: 00000000000005f1 R11: 0000000000000246 R12: 0000000000000000
[    8.354183] R13: 00007fff745afed0 R14: 0000000000000001 R15: 00000000006767c0
[    8.355400] Code: 41 89 d4 53 89 f5 48 89 fb e8 aa 20 fd ff 85 c0 0f 84 ed 00
00 00 48 85 db 0f 84 cf 00 00 00 40 84 ed 0f 85 cd 00 00 00 45 84 e4 <8b> 53 30
74 0d 85 d2 b8 ff ff ff ff 0f 8f b3 00 00 00 8b 43 2c
[    8.358699] RIP: __tcf_idr_release+0x33/0x140 RSP: ffffa0718038f840
[    8.359770] CR2: 0000000000021130
[    8.360438] ---[ end trace 60c66be45dfc14f0 ]---

The caller calls action's ->init() and passes pointer to "struct tc_action *a",
which later may be initialized to point at the existing action, otherwise
"struct tc_action *a" is still invalid, and therefore dereferencing it is an
error as happens in tcf_idr_release, where refcnt is decremented.

So in case of missing flags tcf_idr_release must be called only for
existing actions.

v2:
    - prepare patch for net tree

Fixes: 5e1567a ("net sched: skbedit action fix late binding")
Signed-off-by: Roman Mashak <[email protected]>
Acked-by: Cong Wang <[email protected]>
Signed-off-by: David S. Miller <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants