Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

net: remember the name of the lock chain (nftables) #2550

Open
wants to merge 2 commits into
base: criu-dev
Choose a base branch
from

Conversation

adrianreber
Copy link
Member

Using libnftables the chain to lock the network is composed of ("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the real PID can be anything and only the PID in the namespace is restored correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be called 'CRIU-1' most of the time which is also not really unique.

The uniqueness of the name was always problematic. With this change all tests are working again which rely on network locking if the nftables backend is used for network locking.

@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from c483710 to 0305093 Compare December 17, 2024 13:46
@avagin avagin requested a review from mihalicyn December 17, 2024 16:25
@mihalicyn
Copy link
Member

Hi Adrian!

Please, can you tell how and in which circumstances you've caught this issue?
It is not on our GitHub CI tests, right?

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

A first question I have after going through this is "How this has worked before?".

P.S. I'll take a closer look into this. I've spent not enough time yet to fully understand what's going on there.

@adrianreber
Copy link
Member Author

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

@adrianreber
Copy link
Member Author

Please, can you tell how and in which circumstances you've caught this issue?

I am trying to switch the default locking backend in Fedora and CentOS >= 10 to nftables from iptables because iptables is no longer installed by default.

As far as I understand the idea of your fix is to ensure that we keep nftables table name in the inventory image file instead of dynamically recalculate it on restore (using root_item->pid->real). Am I right?

Yes. The table name makes sense if the locking and unlocking happens in the same CRIU run, but between CRIU runs it does not work with the existing approach.

@mihalicyn
Copy link
Member

Ah, thanks for clarifications!

I wonder if we can do something like this:

$ git diff
diff --git a/criu/netfilter.c b/criu/netfilter.c
index 9e78dc4b0..c558f9bf1 100644
--- a/criu/netfilter.c
+++ b/criu/netfilter.c
@@ -299,7 +299,7 @@ int nftables_lock_connection(struct inet_sk_desc *sk)
 
 int nftables_get_table(char *table, int n)
 {
-       if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+       if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {
                pr_err("Cannot generate CRIU's nftables table name\n");
                return -1;
        }

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU. In this form only works for experimental purposes (and have to check for root_item->ids->has_pid_ns_id too). But I'm curious if it helps.

@mihalicyn
Copy link
Member

mihalicyn commented Dec 18, 2024

My idea is that instead of introducing a new field nft_lock_table in the inventory_entry just for a single purpose, we can use inventory_entry->root_ids->pid_ns_id or root_item->ids->pid_ns_id as a source of unique criu run id. And we already do something like that when generate criu_run_id value:

void util_init(void)
{
...
	criu_run_id = getpid();
	if (!stat("/proc/self/ns/pid", &statbuf))
		criu_run_id |= (uint64_t)statbuf.st_ino << 32;

@adrianreber
Copy link
Member Author

@mihalicyn I am happy to use whatever makes most sense.

What is pid_ns_id? Is that basically the inode of the PID NS? Or more? It still sounds like something we need to save somewhere in the image, right?

Yes, it's not a forward-compatible change and will break restore of images which were dumped with an older CRIU.

I don't think we have to worry about this. Currently it doesn't work at all.

Let me know which ID makes most sense and I can rework this PR. I think the important part is that is has to come from some value of the checkpoint image and not be generated during restore.

@adrianreber
Copy link
Member Author

@mihalicyn I think I understood your proposal now. The PR could be really simple as pid_ns_id is already in the image. Let me try it out.

@adrianreber
Copy link
Member Author

With this line it also passes all the zdtm test cases (besides a couple of tests which call iptables (which I did not install)) if I switch to the nftables locking backend:

if (snprintf(table, n, "inet CRIU-%d", root_item->ids->pid_ns_id) < 0) {

That brings it down to a one line change. Very good idea @mihalicyn. Thanks.

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

@adrianreber
Copy link
Member Author

@mihalicyn Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

So that is not really a good idea I think as it not really unique.

@mihalicyn
Copy link
Member

mihalicyn commented Dec 18, 2024

Hey Adrian,

Is that basically the inode of the PID NS?

Yes, precisely.

It still sounds like something we need to save somewhere in the image, right?

We don't as we already have it in the image anyways.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

How long can the pid_ns_id be? Currently the variable table is set to 32 characters.

Hmm, it's uint32 so in a string representation I guess it's like 10 chars.

Tests are happy, but root_item->ids->pid_ns_id always returns 1 when running in the host PID namespace.

That's my bad, actually, to get pid namespace inode number you need something like:

		ns = lookup_ns_by_id(root_item->ids->pid_ns_id, &pid_ns_desc);
		if (ns) {
			ns->kid // << this thing is inode number of pid namespace

But yes, I don't think that even with this change having pid_ns_id would be enough, I think we still need to add a new field to inventory_entry but my point is to make it universal for different cases like this one and name it criu_run_id or something like that. Also, to clearly document when it's unique and for what purposes must be used.

@mihalicyn
Copy link
Member

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

@adrianreber
Copy link
Member Author

Ah, okay. So let's use the criu_run_id and store it in the inventory.

I don't think we have to worry about this. Currently it doesn't work at all.

Are we 100% percent sure that it doesn't work and never worked in any circumstances?

I don't know. All tests with open TCP connections are just hanging during restore because the network locking cannot be disabled. According to zdtm it is so broken that it doesn't work currently.

Also, we have inventory_entry.dump_uptime field we can consume to get a certain degree of uniqueness.

As an additional field in the nft table name? "CRIU-%d-%" PRIx64 " ", criu_run_id, inventory_entry.dump_uptime)

Or instead of criu_run_id? Currently we collect the uptime rather late in the process of checkpointing. Definitely after the network locking. It seems to be only used during detect_pid_reuse() and that is never directly use, but it seems is only relevant during pre-dump and looking at the parent process. So it seems like we could move the uptime detection to an earlier point and then also use it in the network locking chain. During restore we would need to look at the criu_run_id from the checkpointing run and at the uptime of the checkpointing run. That could work.

@rst0git
Copy link
Member

rst0git commented Dec 18, 2024

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

Would it be possible to add a CI workflow or modify an existing one to run all tests with the nftables backend?

@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from 0305093 to 30e76fd Compare December 20, 2024 10:12
@adrianreber
Copy link
Member Author

@mihalicyn What do you think about the latest version. This works in my tests just as good as the previous version. Now using criu_run_id as suggested.

@mihalicyn
Copy link
Member

Hi Adrian!

What do you think about the latest version. This works in my tests just as good as the previous version. Now using criu_run_id as suggested.

Looks great! The only thing that worries me is that idea behind criu_run_id was to make it relatively unique within one machine/system, but here we try to use it as "unique enough" for purpose of migration between different machines, right? It's not a question to you, but in general. Maybe, we should consider adding more randomness to it? And what is even more important is that as we add criu_run_id to the images we loose ability to increase it size (from uint64_t to something bigger) in the future.

Also, I tried to play with nftables-based locking mechanism on my machine, and found, that for host-flavor test we have no problem with locking:

sudo ./test/zdtm.py run --ignore-taint -t zdtm/static/socket-tcp-local -f h
userns is supported
The kernel is tainted: '12288'
=== Run 1/1 ================ zdtm/static/socket-tcp-local
==================== Run zdtm/static/socket-tcp-local in h =====================
Start test
Running zdtm/static/socket-tcp-local.hook(--post-start)
./socket-tcp-local --pidfile=socket-tcp-local.pid --outfile=socket-tcp-local.out
Running zdtm/static/socket-tcp-local.hook(--pre-dump)
State      Recv-Q     Send-Q         Local Address:Port          Peer Address:Port     Process     
LISTEN     0          1                    0.0.0.0:8880               0.0.0.0:*                    
ESTAB      0          0                  127.0.0.1:8880             127.0.0.1:56452                
ESTAB      0          0                  127.0.0.1:56452            127.0.0.1:8880                 
Run criu dump
Running zdtm/static/socket-tcp-local.hook(--pre-restore)
Run criu restore
=[log]=> dump/zdtm/static/socket-tcp-local/55/1/restore.log
------------------------ grep Error ------------------------
b'(00.006320) pie: 55: 55: Restored'
b'(00.006413) Running post-restore scripts'
b'(00.006475) pidfile: Wrote pid 55 to /home/alex/storage/dev/criu/test/zdtm/static/socket-tcp-local.pid (2 bytes)'
b'(00.006480) net: Unlock network'
------------------------ ERROR OVER ------------------------
Running zdtm/static/socket-tcp-local.hook(--post-restore)
State      Recv-Q     Send-Q         Local Address:Port          Peer Address:Port     Process     
LISTEN     0          1                    0.0.0.0:8880               0.0.0.0:*                    
ESTAB      0          0                  127.0.0.1:8880             127.0.0.1:56452                
ESTAB      0          0                  127.0.0.1:56452            127.0.0.1:8880                 
Check TCP images
Send the 15 signal to  55
Wait for zdtm/static/socket-tcp-local(55) to die for 0.100000
Running zdtm/static/socket-tcp-local.hook(--clean)
Removing dump/zdtm/static/socket-tcp-local/55
==================== Test zdtm/static/socket-tcp-local PASS ====================

While for -f ns/-f uns we have the problem you described. BUT! When -f ns/-f uns is used, it means that locking happens inside net namespace and we can't expect any issues with uniqueness of nftable table names, right? My point is that we can simplify things for the case when net namespace is dumped, while for the case when we don't use namespaces we can still use this simple pid-based approach. Tell me if I'm saying something stupid.

@mihalicyn
Copy link
Member

mihalicyn commented Dec 23, 2024

Just for the sake of demonstrating my point. Something like this:

--- a/criu/netfilter.c
+++ b/criu/netfilter.c
@@ -14,6 +14,7 @@
 #include "util.h"
 #include "common/list.h"
 #include "files.h"
+#include "namespaces.h"
 #include "netfilter.h"
 #include "sockets.h"
 #include "sk-inet.h"
@@ -299,7 +300,19 @@ int nftables_lock_connection(struct inet_sk_desc *sk)
 
 int nftables_get_table(char *table, int n)
 {
-       if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
+       int table_id = 0;
+
+       if (root_ns_mask & CLONE_NEWNET) {
+               table_id = 0;
+       } else if (!(root_ns_mask & CLONE_NEWPID)) {
+               table_id = root_item->pid->real;
+       } else {
+               // here we need something unique
+               // as we don't have a net namespace and table name conflict is possible
+               // also, we *do* have a pid namespace and root_item->pid->real makes no sense.
+       }
+
+       if (snprintf(table, n, "inet CRIU-%d", table_id) < 0) {
                pr_err("Cannot generate CRIU's nftables table name\n");
                return -1;
        }

fixes tests for -f h and -f ns/-f uns. At the same time, it's still broken if you make a setup where process use pid namespace and don't use net namespace.

@adrianreber
Copy link
Member Author

Also, I tried to play with nftables-based locking mechanism on my machine, and found, that for host-flavor test we have no problem with locking:

Right, because pid_real is restored correctly.

While for -f ns/-f uns we have the problem you described. BUT! When -f ns/-f uns is used, it means that locking happens inside net namespace and we can't expect any issues with uniqueness of nftable table names, right? My point is that we can simplify things for the case when net namespace is dumped, while for the case when we don't use namespaces we can still use this simple pid-based approach. Tell me if I'm saying something stupid.

You are right. It works for -f h. I am not sure that having different code paths for host network namespace and separate namespaces is such a good idea. It doesn't seem to break anything if we always use the criu_run_id. Having one less if is not much, but if we do not gain anything from it, we could leave it away.

Maybe, we should consider adding more randomness to it?

No real opinion here, but it might be good. No idea.

And what is even more important is that as we add criu_run_id to the images we loose ability to increase it size (from uint64_t to something bigger) in the future.

We can just deprecate the protobuf field at some point in the future and use a new one if we feel that is necessary.

criu/net.c Outdated
@@ -229,6 +229,8 @@ static const char *unix_conf_entries[] = {
"max_dgram_qlen",
};

extern char nft_lock_table[32];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess, we don't need this anymore.

criu/image.c Outdated
* information is needed to identify the name of the network
* locking table.
*/
dump_criu_run_id = he->dump_criu_run_id;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (he->has_dump_criu_run_id) {
	dump_criu_run_id = he->dump_criu_run_id;
}

criu/netfilter.c Outdated
{
if (snprintf(table, n, "inet CRIU-%d", root_item->pid->real) < 0) {
if (snprintf(table, n, "inet CRIU-%" PRIx64, id) < 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

	/*
	 * Keep compatibility with images
	 * without he->dump_criu_run_id field.
	 */
	if (!id) {
		if (!(root_ns_mask & CLONE_NEWPID)) {
			id = root_item->pid->real;
		} else {
			pr_err("Cannot generate CRIU's nftables table name because of issue #2550\n");
			return -1;
		}
		
	}

	if (snprintf(table, n, "inet CRIU-%" PRIx64, id) < 0) {

What do you think about this?

@mihalicyn
Copy link
Member

mihalicyn commented Dec 23, 2024

We can just deprecate the protobuf field at some point in the future and use a new one if we feel that is necessary.

I agree.

I am not sure that having different code paths for host network namespace and separate namespaces is such a good idea.

Yeah, I agree. I just wanted to ensure that I understood problem right and this code example is the best way to show different scenarios we have and when it works and when it doesn't.

But we still need some extra checks for compatibility reasons, IMHO.

In general, this PR looks great to me. Thanks for working on this, Adrian!

criu/net.c Outdated

if (nftables_get_table(table, sizeof(table)))
if (dump_criu_run_id == 0)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to introduce a boolean parameter here to determine if we are on restore/dump codepath as dump_criu_run_id == 0 might be in two different cases: when we deal with an old image (without dump_criu_run_id field) or if we are on the restore codepath.

@avagin
Copy link
Member

avagin commented Dec 30, 2024

A first question I have after going through this is "How this has worked before?".

It probably never did. We are not running all the tests on a system without iptables with nftables locking backend. Only two or four tests are running with the nftables backend.

It was broken by 9838d34.

@avagin
Copy link
Member

avagin commented Jan 8, 2025

And what is even more important is that as we add criu_run_id to the images we loose ability to increase it size (from uint64_t to something bigger) in the future.

We can just deprecate the protobuf field at some point in the future and use a new one if we feel that is necessary.

We can do that, but we are trying to avoid that. If we see the problem, it is better to handle it right away. The obvious solution here is to add a true uuid, isn't it?

@adrianreber
Copy link
Member Author

And what is even more important is that as we add criu_run_id to the images we loose ability to increase it size (from uint64_t to something bigger) in the future.

We can just deprecate the protobuf field at some point in the future and use a new one if we feel that is necessary.

We can do that, but we are trying to avoid that. If we see the problem, it is better to handle it right away. The obvious solution here is to add a true uuid, isn't it?

What do you exactly suggest? Can you be more specific? Not sure what you mean by true. I assume you mean it should be really unique, right? Should we base that on the existing criu_run_id?

@avagin
Copy link
Member

avagin commented Jan 13, 2025

We can do that, but we are trying to avoid that. If we see the problem, it is better to handle it right away. The obvious solution here is to add a true uuid, isn't it?

What do you exactly suggest? Can you be more specific? Not sure what you mean by true. I assume you mean it should be really unique, right? Should we base that on the existing criu_run_id?

I mean we can consider to use the standard way to generate uuid (libuuid).

@adrianreber
Copy link
Member Author

We can do that, but we are trying to avoid that. If we see the problem, it is better to handle it right away. The obvious solution here is to add a true uuid, isn't it?

What do you exactly suggest? Can you be more specific? Not sure what you mean by true. I assume you mean it should be really unique, right? Should we base that on the existing criu_run_id?

I mean we can consider to use the standard way to generate uuid (libuuid).

Okay, makes sense. Let me try to rework this PR to use libuuid. That additional dependency doesn't sound like a big problem as on my system even systemd requires libuuid (indirectly). So it will be available on almost all systems.

@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from 30e76fd to ee03485 Compare January 23, 2025 17:45
@adrianreber
Copy link
Member Author

@avagin @mihalicyn I rewrote it using libuuid. Let me know if this is the right direction.

@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch 2 times, most recently from 327c7fa to 67434b8 Compare January 23, 2025 17:53
@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from 67434b8 to b2cd209 Compare January 23, 2025 17:59
criu_run_id will be used in upcoming changes to create and remove
network rules for network locking. Instead of trying to come up with
a way to create unique IDs, just use an existing library.

libuuid should be installed on most systems as it is indirectly required
by systemd (via libmount).

Signed-off-by: Adrian Reber <[email protected]>
Using libnftables the chain to lock the network is composed of
("CRIU-%d", real_pid). This leads to around 40 zdtm tests failing
with errors like this:

Error: No such file or directory; did you mean table 'CRIU-62' in family inet?
delete table inet CRIU-86

The reason is that as soon as a process is running in a namespace the
real PID can be anything and only the PID in the namespace is restored
correctly. Relying on the real PID does not work for the chain name.

Using the PID of the innermost namespace would lead to the chain be
called 'CRIU-1' most of the time which is also not really unique.

With this commit the change is now named using the already existing CRIU
run ID. To be able to correctly restore the process and delete the
locking table, the CRIU run id during checkpointing is now stored in the
inventory as dump_criu_run_id.

Signed-off-by: Adrian Reber <[email protected]>
@adrianreber adrianreber force-pushed the 2024-12-17-nftables-lock-name branch from b2cd209 to e80fa95 Compare January 24, 2025 08:17
* The uuid_unparse() function converts the supplied UUID uu from the binary
* representation into a 36-byte string (plus trailing '\0')
*/
#define RUN_ID_HASH_LENGTH 36+1
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

brackets around 36+1 are required (37 should work too). Otherwise expressions like 2 * RUN_ID_HASH_LENGTH don't work as expected.

@@ -141,4 +141,4 @@ int check_mount_v2(void)
return 0;
}

uint64_t compel_run_id;
char compel_run_id[37];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RUN_ID_HASH_LENGTH

@avagin
Copy link
Member

avagin commented Jan 24, 2025

LGTM

@avagin avagin requested a review from mihalicyn January 24, 2025 18:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants