-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce CString allocations in std as much as possible #93668
Conversation
Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @dtolnay (or someone else) soon. Please see the contribution instructions for more information. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm commenting here for the record instead of requesting a change, because espidf
just looks like and talks like a unix
target, so including specific changes may not belong in this PR.
cc @ivmarkov who can tell me if I'm wrong and may want to know about this change
library/std/src/sys/unix/fs.rs
Outdated
// https://docs.rs/compiler_builtins/latest/compiler_builtins/probestack/index.html | ||
// | ||
// Here we chose a typical PATH_MAX minus 2 typical cache lines. | ||
const MAX_STACK_ALLOCATION: usize = 4096 - 2 * 64; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On espidf
targets we usually have stack sizes in the order of 4-8k and in most cases this allocation would be an immediate stack overflow.
espidf
also uses SPIFFS for its underlying storage which by default has a 32 byte object name length limit (and no folder support at all).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even though PATH_MAX
may usually be 4k, I suppose that most paths will actually be much, much shorter. Since this is an optimization for the fast path, I think that it would be fine to use something like 256 bytes here and it would still apply most of the time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Kobzol Great point! I whipped up a quick program to analyze my computer's path length stats and got this:
min=17
max=368
mean=125
p50=118
p90=185
p99=244
p999=271
stddev=42
stdvar=1733
256
actually looks super reasonable given those stats, though I feel like we might want to go for 512 just to make sure almost no one ever allocates (since I marked run_with_cstr_allocating
as cold and never inline).
Source:
use histogram::Histogram;
use std::env;
use std::fs::{canonicalize, read_dir};
use std::path::Path;
fn main() {
let args = env::args().collect::<Vec<String>>();
let mut histogram = Histogram::new();
if args.len() == 2 {
log(&mut histogram, canonicalize(&args[1]).unwrap());
} else {
log(&mut histogram, canonicalize(".").unwrap());
}
println!(
"min={}\nmax={}\nmean={}\np50={}\np90={}\np99={}\np999={}\nstddev={}\nstdvar={}",
histogram.minimum().unwrap(),
histogram.maximum().unwrap(),
histogram.mean().unwrap(),
histogram.percentile(50.0).unwrap(),
histogram.percentile(90.0).unwrap(),
histogram.percentile(99.0).unwrap(),
histogram.percentile(99.9).unwrap(),
histogram.stddev().unwrap(),
histogram.stdvar().unwrap(),
)
}
fn log(histogram: &mut Histogram, path: impl AsRef<Path>) {
for dir in read_dir(path.as_ref()).unwrap() {
let entry = dir.unwrap();
histogram
.increment(entry.path().as_os_str().len() as u64)
.unwrap();
if entry.file_type().unwrap().is_dir() {
log(histogram, entry.path());
}
}
}
Wouldn't it be better to have a |
@hkratz Yeah, as mentioned in the PR description this code is nowhere near ready to be merged. I want to make sure this change is desirable before I put in actual effort. |
@dtolnay Can I get some feedback on whether or not this is desired? I just completed similar work in nix: nix-rust/nix#1656. |
Update: I've been analyzing memory allocations in one of my programs that does a lot of I/O and path allocations account for ~15% of all allocations in the program. That's huge for something that should be free! |
r? rust-lang/libs |
☔ The latest upstream changes (presumably #94079) made this pull request unmergeable. Please resolve the merge conflicts. |
The concept of this seems reasonable to me. I agree with the suggestion to reduce the stack allocation to 512 bytes, and I do think the |
Awesome! I created #96104 to get that started. |
Reminder: as discussed earlier, on MCU targets like |
@ivmarkov Some random googling says the cache line is 32 bytes on an ESP32, so 32 bytes seems reasonable. |
Ok, I looked into the SmallCStr idea and it actually already exists: https://github.com/rust-lang/rust/blob/master/compiler/rustc_data_structures/src/small_c_str.rs. Unfortunately, that's in the compiler so we'd have to bring over all of SmallVec into std to make things work which doesn't seem like a good idea. Given those issues, I'll continue with the functional approach and figure what the assembly looks like. Should hopefully have time to do this before the end of the week. |
@SUPERCILEX I have been prototyping a minimal SmallCString implementation without external dependencies. It works fine except for some unnecessary copying in the generated code. I will share the code in a draft PR shortly. |
As long as it's as efficient as single branch on "is bigger than X, then allocate, else use the stack" that works for me. One thing I just thought of: with the SmallCString approach, we're always going to use the same amount of stack space even if we don't need it. That probably doesn't matter, but I think the current approach allows decrementing %rsp only when we plan to use that space and avoid it if we're allocating. |
The reserved (stack) frame size is fixed for each function in Rust and the stack pointer is decremented accordingly in the beginning of the function (see e.g. Godbolt). So that would be the case for |
Nevermind + TIL then. 😁 Very cool, thanks for looking into that. |
@hkratz Actually, you inspired me to do the SmallCString thing since I realized we don't have to support mutability and can therefore have a much simpler implementation than SmallVec. Let me know if you have feedback on my implementation.
Check out these sweet-:peach: results!
Old:
New:
|
It seems like InvalidInput is expected: rust/src/test/ui/paths-containing-nul.rs Line 15 in 09ccb6c
|
@bors try @rust-timer queue |
@SUPERCILEX: 🔑 Insufficient privileges: not in try users |
The Miri subtree was changed cc @rust-lang/miri |
@joshtriplett take 2? :) |
@bors r+ rollup=never |
☀️ Test successful - checks-actions |
Oh my god, it finally happened!!! 🤩 Thanks everybody! |
Finished benchmarking commit (1b22541): comparison URL. Overall result: ❌ regressions - ACTION NEEDEDNext Steps: If you can justify the regressions found in this perf run, please indicate this with @rustbot label: +perf-regression Instruction countThis is a highly reliable metric that was used to determine the overall result at the top of this comment.
Max RSS (memory usage)ResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
CyclesResultsThis is a less reliable metric that may be of interest but was not used to determine the overall result at the top of this comment.
Footnotes |
Dang, I'm really not quite sure what to make of these benchmarks. They seem to be flip flopping. My own testing shows a ~4x improvement in instruction count and very minor improvement in real-time in an isolated micro-benchmark. Test code for _ in 0..10_000_000 {
let z = File::open("Cargo.toml").unwrap();
std::hint::black_box(&z);
} Hyperfine
Old
New
|
We don't have runtime benchmarks yet, these are compile time benchmarks. The results look mostly neutral. |
Whew, ok. |
The @rustbot labels: +perf-regression-triaged |
Reduce CString allocations in std as much as possible Currently, every operation involving paths in `fs` allocates memory to hold the path before sending it through the syscall. This PR instead uses a stack allocation (chosen size is somewhat arbitrary) when the path is short before falling back to heap allocations for long paths. Benchmarks show that the stack allocation is ~2x faster for short paths: ``` test sys::unix::fd::tests::bench_heap_path_alloc ... bench: 34 ns/iter (+/- 2) test sys::unix::fd::tests::bench_stack_path_alloc ... bench: 15 ns/iter (+/- 1) ``` For long paths, I couldn't find any measurable difference. --- I'd be surprised if I was the first to think of this, so I didn't fully flush out the PR. If this change is desirable, I'll make use of `run_with_cstr` across all platforms in every fs method (currently just unix open for testing). I also added an `impl From<FromBytesWithNulError>` which is presumably a no-no (or at least needs to be done in another PR). --- Also see nix-rust/nix#1655 with a bunch of discussion where I'm doing something similar.
Currently, every operation involving paths in
fs
allocates memory to hold the path before sending it through the syscall. This PR instead uses a stack allocation (chosen size is somewhat arbitrary) when the path is short before falling back to heap allocations for long paths.Benchmarks show that the stack allocation is ~2x faster for short paths:
For long paths, I couldn't find any measurable difference.
I'd be surprised if I was the first to think of this, so I didn't fully flush out the PR. If this change is desirable, I'll make use of
run_with_cstr
across all platforms in every fs method (currently just unix open for testing). I also added animpl From<FromBytesWithNulError>
which is presumably a no-no (or at least needs to be done in another PR).Also see nix-rust/nix#1655 with a bunch of discussion where I'm doing something similar.