-
Notifications
You must be signed in to change notification settings - Fork 337
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use nxt_nitems() instead of sizeof() for strings (arrays) #1467
Conversation
ac000
commented
Oct 23, 2024
•
edited
Loading
edited
Well, it's been nearly two years since we attempted this patch. But things have changed, both internally and in the C world and larger programming world in general. I was reminded of this patch from Alex due to hitting this issue again. Given the current histrionics towards C (OK, not entirely without merit), anything that can help us write safer code and avoid whole classes of simple but possibly nasty bugs should not be simply rejected... So I think the time has come to give this another go, it is after all a really simple patch with benefits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both ways work for length calculation. Personally, I prefer the current way since it's more concise. I'd like to leave this to team alignment.
BTW, review and tests are always effectively to catch potential issues like sizeof with string pointers, your catch is a good example.
Yes, both ways work (or not) to the same degree for some definition of work. That's not the point. The point is that the macro is very easy to misuse. With this patch you at least get shouted at by the compiler when you do misuse it (anything we can do to help the compiler help us has got to be good... especially in this day and age).
As for conciseness, I don't know... We would now have #define nxt_length(s) \
(nxt_nitems(s) - 1) vs #define nxt_length(s) \
(sizeof(s) - 1) OK so there is one extra level of abstraction. nxt_nitems() is simply #define nxt_nitems(x) \
(sizeof(x) / sizeof((x)[0])) Which is a well known construct and it's not like how to use it changes, it's a totally transparent change API wise. So whatever you may loose in conciseness I think is outweighed by the safety benefits. Talking of concise, why even use They both should only be used on nul-terminated strings. I see only one (hmm, OK, two) instances where nxt_length() is maybe better and that's if you have a string with embedded NUL's (or if you accidentally use it on a non nul-terminated string). strlen(3) though has one advantage... $ cat s.c #include <stdio.h>
#include <string.h>
#define nxt_length(s) (sizeof(s) -1)
int main(void)
{
char buf[] = {'1', '2', '3', '\0'};
char buf2[] = "123";
char buf3[] = {'1', '2', '3'};
char buf4[] = {'1', '2', '3', '\0', 'h', 'i', 'd', 'd', 'e', 'n', '\0'};
static const char *str = "123";
printf("strlen(buf) [%lu]\n", strlen(buf));
printf("strlen(buf2) [%lu]\n", strlen(buf2));
printf("strlen(buf3) [%lu]\n", strlen(buf3));
printf("strlen(buf4) [%lu]\n", strlen(buf4));
printf("strlen(str) [%lu]\n", strlen(str));
printf("nxt_length(buf) [%lu]\n", nxt_length(buf));
printf("nxt_length(buf2) [%lu]\n", nxt_length(buf2));
printf("nxt_length(buf3) [%lu]\n", nxt_length(buf3));
printf("nxt_length(buf4) [%lu]\n", nxt_length(buf4));
printf("nxt_length(str) [%lu]\n", nxt_length(str));
return 0;
}
So I'd say strlen(3) does just as good and is more concise, so by that argument we should just use that... For me the tide is turned to nxt_length()'s favour with this patch, as, even though it has the same issues as above, it at least does warn you when (not if) you do nxt_length(char *). Yes strlen(3) does the right thing with strlen(char *), but if you're looking at it from a safety point of view then a patched nxt_length() I think is better, yes it doesn't handle the char * case, but you get warned about that, (ironically you may just end up using strlen(3) in those cases), it does do the wrong thing with non nul-terminated strings, but strlen(3) is worse. In terms of embedded NUL's in strings, there is no right answer here... (which is why APIs that deal with such possibilities always take an explicit length parameter.)
Tests didn't catch the ternary bug, they didn't catch this latest bug. Yes, they were caught in review, will they next time? I'm willing to fight for this, as I don't see any downside and a real benefit... |
|
True, but strlen(3) has the advantage of working with char *, so it's swings and roundabouts... Besides strlen(3) is pretty well optimised these days... So given the choice between nxt_length() not invoking undefined behaviour (however who knows what chaos it could lead to if you start using this wrong result) when run on non nul-terminated strings or strlen(char *) doing the right thing, I'd probably take strlen(3) and it's not like we don't already use it anyway... (Not to mention nxt_length() is poorly named, length of what? (it's a rhetorical question...)) Given that we need to use strlen(3) in some cases. strlen(3) plus this patch seems like the best of both worlds. |
... and you say that 'you know not use nxt_length() on pointers', but you also know how easily (and unexpectedly) arrays decay to pointers... |
How easy would it be to overlook something like char *buf;
nxt_str_t str;
...
/* Having allocated and set buf */
nxt_set_str(&str, buf); Why not take defensive measures against such things? |
It seems like really, nxt_length() should be limited to static initialisations... |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just my 2 cents:
I think that having the extra warnings will make it easier to prevent issues introduced by those of us with less context when writing and using C. If we are ever to onboard new members to our team this will become an important point.
I am for the patch as is, or with the slight hypothetical impact from using strlen().
In fact if we can configure the warnings to be build breaking even better in my opinion.
That's currently the case as we compile with |
Regarding the (sizeof() -1) compile-time vs strlen(3) runtime evaluations. With optimisations enabled there are several cases where the strlen(3) will be turned into a runtime constant. E.g. #define STR "123"
...
/* These will always be runtime constants */
strlen(STR);
strlen("123");
/* These may be if the compiler can determine they won't change */
char buf[] = "123"
static const char *s= "123";
...
strlen(buf);
strlen(s); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
help us write safer code and avoid whole classes of simple but possibly nasty bugs
having the extra warnings will make it easier to prevent issues introduced by those of us with less context when writing and using C. If we are ever to onboard new members to our team this will become an important point.
I find these arguments from @ac000 and @avahahn compelling.
We're already hidden behind a layer of abstraction (the nxt_length
macro), so I'm not sure I understand fully the concerns against this specific patch.
Note I'm intentionally avoiding the |
In the Apache Software Foundation, votes are expressed as a number between -1 ("no") and +1 ("yes"). In this thread, I'm reading @ac000 and @avahahn both as "+1", summarized by @ac000's earlier comment "I'm willing to fight for this, as I don't see any downside and a real benefit" I'm reading @hongzhidao as something closer to "-0.1" from the comment "Personally, I prefer the current way since it's more concise. I'd like to leave this to team alignment." Basically: not your preference, but fine either way. Is that accurate, @hongzhidao? (Edit: I'd put myself at +0.9; I really like it, but I wouldn't die on that hill.) |
yep. |
sizeof() should never be used to get the size of an array. It is very unsafe, since arrays easily decay to pointers, and sizeof() applied to a pointer gives false results that compile and produce silent bugs. It's better to use nxt_items(), which implements sizeof() division, which recent compilers warn when used with pointers. This change would have caught a couple of bugs that were *almost* introduced First up is the _infamous_ ternary macro bug (yes, using the ternary operator in a macro is of itself a bad idea) nxt_str_set(&port, (r->tls ? "https://" : "http://")); which in the macro expansion runs: (&port)->length = nxt_length((r->tls ? : "https://" : "http://")); which evaluates to: port.length = sizeof(r->tls ? "https://" : "http://") - 1; which evaluates to: port.length = 8 - 1; Of course, we didn't want a compile-time-constant 8 there, but rather the length of the string. The above bug is not obvious to the untrained eye, so let's show some example programs that may give some more hints about the problem. $ cat sizeof.c #include <stdio.h> int main(void) { printf("%zu\n", sizeof("01")); printf("%zu\n", sizeof("012")); printf("%zu\n", sizeof(char *)); } $ cc -Wall -Wextra sizeof.c $ ./a.out 3 4 8 sizeof() returns the size in bytes of the array passed to it, which in case of char strings, it is equivalent to the length of the string + 1 (for the terminating '\0'). However, arrays decay very easily in C, and they decay to a pointer to the first element in the array. In case of strings, that is a 'char *'. When sizeof() is given a pointer, it returns the size of the pointer, which in most platforms is 8. The ternary operator (?) performs default promotions (and other nefarious stuff) that may surprise even the most experienced programmers. It contrasts the __builtin_choose_expr() GCC builtin [1], which performs almost equivalently, but without the unwanted effects of the ternary operator. [1]: <https://gcc.gnu.org/onlinedocs/gcc/Other-Builtins.html#index-_005f_005fbuiltin_005fchoose_005fexpr> $ cat ?.c #include <stdio.h> int main(void) { printf("%zu\n", sizeof("01")); printf("%zu\n", sizeof(__builtin_choose_expr(1, "01", "01"))); printf("%zu\n", sizeof(1 ? "01" : "01")); printf("%zu\n", sizeof(char *)); } $ cc -Wall -Wextra ?.c $ ./a.out 3 3 8 8 In the above program, we can see how the ternary operator (?) decays the array into a pointer, and makes it so that sizeof() will return a constant 8. As we can see, everything in the use of the macro would make it look like it should work, but the combination of some seemingly-safe side effects of various C features produces a completely unexpected bug. Second up is a more straight forward case of simply calling nxt_length() on a char * pointer. Like the above this will generally result in a length of 7. When you sit and think about it, you know very well sizeof(char *) is probably 8 these days (but may be some other value like 4). But when you're in the depths of code it's very easy to overlook this when all you're thinking about is to get the length of some string. Let's look at this patch in action $ cat sdiv.c #include <stdio.h> #define nxt_nitems(x) (sizeof(x) / sizeof((x)[0])) #define nxt_length(s) (nxt_nitems(s) - 1) #define nxt_unsafe_length(s) (sizeof(s) - 1) #define STR_LITERAL "1234567890" static const char *str_lit = "1234567890"; int main(void) { printf("[STR_LITERAL] nxt_unsafe_length(\"1234567890\") [%lu]\n", nxt_unsafe_length(STR_LITERAL)); printf("[STR_LITERAL] nxt_length(\"1234567890\") [%lu]\n", nxt_length(STR_LITERAL)); printf("[char * ] nxt_unsafe_length(\"1234567890\") [%lu]\n", nxt_unsafe_length(str_lit)); printf("[char * ] nxt_length(\"1234567890\") [%lu]\n", nxt_length(str_lit)); return 0; } First lets compile it without any flags $ make sdiv $ ./sdiv [STR_LITERAL] nxt_unsafe_length("1234567890") [10] [STR_LITERAL] nxt_length("1234567890") [10] [char * ] nxt_unsafe_length("1234567890") [7] [char * ] nxt_length("1234567890") [7] It compiled without error and runs, although with incorrect results for the two char *'s. Now lets build it with -Wsizeof-pointer-div (also enabled with -Wall) $ CFLAGS="-Wsizeof-pointer-div" make sdiv cc -Wsizeof-pointer-div nxt_nitems.c -o nxt_nitems sdiv.c: In function ‘main’: sdiv.c:3:44: warning: division ‘sizeof (const char *) / sizeof (char)’ does not compute the number of array elements [-Wsizeof-pointer-div] 3 | #define nxt_nitems(x) (sizeof(x) / sizeof((x)[0])) | ^ nxt_nitems.c:4:34: note: in expansion of macro ‘nxt_nitems’ 4 | #define nxt_length(s) (nxt_nitems(s) - 1) | ^~~~~~~~~~ nxt_nitems.c:22:16: note: in expansion of macro ‘nxt_length’ 22 | nxt_length(str_lit)); | ^~~~~~~~~~ nxt_nitems.c:10:20: note: first ‘sizeof’ operand was declared here 10 | static const char *str_lit = "1234567890"; | ^~~~~~~ So we now get a very loud compiler warning (coming from nxt_length(char *), nxt_unsafe_length() of course didn't trigger any warnings), telling us we're being daft. The good news is this didn't find any existing bugs! Let's keep it that way... Link: <https://stackoverflow.com/a/57537491> Cc: Andrew Clayton <[email protected]> Signed-off-by: Alejandro Colomar <[email protected]> Reviewed-by: Andrew Clayton <[email protected]> Tested-by: Andrew Clayton <[email protected]> [ Tweaked and expanded the commit message - Andrew ] Signed-off-by: Andrew Clayton <[email protected]>
Rebased with master...
|
Hi @ac000 ! It's great to see this change in. Nice! BTW, in the future, you'll be able to use a native C feature, thus removing the extra abstraction: https://thephd.dev/the-big-array-size-survey-for-c. And soon, we'll be able to write this: wchar_t *
wmemset(size_t n; wchar_t wcs[n], wchar_t wc, size_t n)
{
for (size_t i = 0; i < _Lengthof(wcs); i++)
wcs[i] = wc;
return wcs;
} Effectively using array parameters as arrays, with the implicit bounds safety compared to pointers. For nxt strings, that won't be so useful, since you don't pass arrays around, but structures. However, you'll be able to use a GCC feature that has been developed in parallel: Cheers, |
I've been keeping an eye on these things. Good stuff! |