Cutting down Rust compile times from 30 to 2 minutes with one thousand crates

2 months ago 2

Rust is fast at runtime — but not so much at compile time. That’s hardly news to anyone who's worked on a serious Rust codebase. There's a whole genre of blog posts dedicated to shaving seconds off cargo build.

At Feldera, we let users write SQL to define tables and views. Under the hood, we compile that SQL into Rust code — which is then compiled with rustc to a single binary that incrementally maintains all views as new data streams into tables.

We’ve already pulled a lot of tricks in the past to speed up compilation: type erasure, aggressive code deduplication, limiting codegen lines. And that got us quite far. However, recently we started on-boarding a new, large enterprise client with fairly complicated SQL. They wrote many, very large programs with Feldera. For example, one of them was 8562 lines of SQL code that eventually is translated to ~100k lines of Rust code by the Feldera SQL-to-Rust compiler.

To be clear, this isn’t some massive monolith we’re compiling. We’re talking about ~100k lines of generated Rust. That’s peanuts compared to something like the Linux kernel — 40 million lines (which manages to compile in a few minutes).

And yet… this one program was taking around 25 minutes to compile on my machine. Worse, on our customer's setup it took about 45 minutes. And this was after we already switched the code that is generated to using dynamic dispatch and pretty much eliminated all monomorphization.

Here's the log from the Feldera manager:

[manager] SQL compilation success: pipeline 0196268e-7f98-7de3-b728-0ee339e449fa (program version: 2) (took 101.94s) [manager] Rust compilation success: pipeline 0196268e-7f98-7de3-b728-0ee339e449fa (program version: 2) (took 1617.77s; source checksum: cbffcb959174; integrity checksum: 709a17251475)

Almost all the time is spent compiling Rust. The SQL-to-Rust translation takes about 1m40s. Even worse, the Rust build is doing the equivalent of a release build in cargo, so it happens from scratch every time (except for cargo crate dependencies which are already cached/re-used in the times we give here). Even the tiniest change in the input SQL kicks off a full rebuild of that giant program.

Of course, we tried debug builds too. Those cut the time down to ~5 minutes — but they’re not usable in practice. Our customers care about actual runtime performance: when the SQL code type-checks, they already know the Rust code will compile successfully and they're running real-time data pipelines and want to see end-to-end latency and throughput. Debug builds are just too slow and misleading for that.

What's happening?

Here’s the frustrating part.

We're using rustc v1.83, and despite having a 64-core machine with 128 threads, Rust barely puts any of them to work. This becomes evident quickly when looking at htop during the compliation:

An idle machine as seen in htop

That’s right. One core at 100%, and the rest are asleep.

We can instrument the compilation of this crate by passing -Ztime-passes to RUSTFLAGS
(this requires recompilation with nightly). It reveals that the majority of time is spent in LLVM passes and codegen – which unfortunately are single-threaded:

time: 0.346; rss: 38MB -> 342MB ( +304MB) parse_crate time: 0.000; rss: 344MB -> 345MB ( +1MB) crate_injection time: 5.286; rss: 345MB -> 1607MB (+1262MB) expand_crate time: 5.287; rss: 345MB -> 1607MB (+1262MB) macro_expand_crate time: 0.091; rss: 1607MB -> 1607MB ( +0MB) AST_validation time: 0.002; rss: 1607MB -> 1608MB ( +1MB) finalize_imports time: 0.029; rss: 1608MB -> 1608MB ( +0MB) finalize_macro_resolutions time: 2.382; rss: 1608MB -> 1937MB ( +329MB) late_resolve_crate time: 0.071; rss: 1937MB -> 1938MB ( +1MB) resolve_check_unused time: 0.138; rss: 1938MB -> 1938MB ( +0MB) resolve_postprocess time: 2.627; rss: 1607MB -> 1938MB ( +331MB) resolve_crate time: 0.069; rss: 1940MB -> 1940MB ( +0MB) write_dep_info time: 0.070; rss: 1940MB -> 1940MB ( +0MB) complete_gated_feature_checking time: 0.217; rss: 2790MB -> 2651MB ( -139MB) drop_ast time: 3.361; rss: 1940MB -> 2353MB ( +414MB) looking_for_entry_point time: 3.961; rss: 1940MB -> 2346MB ( +407MB) misc_checking_1 time: 6.301; rss: 2346MB -> 2007MB ( -339MB) coherence_checking time: 44.158; rss: 2346MB -> 3061MB ( +714MB) type_check_crate time: 18.773; rss: 3061MB -> 5024MB (+1963MB) MIR_borrow_checking time: 4.650; rss: 5024MB -> 5241MB ( +217MB) MIR_effect_checking time: 0.360; rss: 5243MB -> 5255MB ( +12MB) module_lints time: 0.360; rss: 5243MB -> 5255MB ( +12MB) lint_checking time: 0.947; rss: 5255MB -> 5254MB ( -1MB) privacy_checking_modules time: 1.587; rss: 5241MB -> 5254MB ( +13MB) misc_checking_3 time: 0.259; rss: 5254MB -> 5249MB ( -5MB) monomorphization_collector_root_collections time: 54.766; rss: 5249MB -> 7998MB (+2749MB) monomorphization_collector_graph_walk time: 6.086; rss: 8010MB -> 8565MB ( +554MB) partition_and_assert_distinct_symbols time: 0.000; rss: 8414MB -> 8415MB ( +1MB) write_allocator_module time: 35.220; rss: 8415MB -> 18037MB (+9622MB) codegen_to_LLVM_IR time: 96.733; rss: 5254MB -> 18037MB (+12783MB) codegen_crate time: 1333.423; rss: 10070MB -> 3176MB (-6893MB) LLVM_passes time: 1303.074; rss: 13594MB -> 756MB (-12837MB) finish_ongoing_codegen time: 1.091; rss: 756MB -> 756MB ( +0MB) run_linker time: 0.105; rss: 755MB -> 755MB ( +0MB) link_binary_remove_temps time: 1.217; rss: 756MB -> 755MB ( -1MB) link_binary time: 1.218; rss: 756MB -> 754MB ( -2MB) link_crate time: 1.218; rss: 756MB -> 754MB ( -2MB) link time: 1483.483; rss: 26MB -> 514MB ( +487MB) total

Sometimes during these 30 minutes, Rust will spin up a few threads — maybe 3 or 4 — but it never fully utilizes the machine. Not even close.

A mostly idle machine as seen in htop.

I get it: parallelizing compilation is hard. But this isn’t some edge-case, looking at it ourselves we clearly saw enough opportunities to parallelize compilation in this program.

Note aside: You might wonder what about increasing codegen-units in Cargo.toml? Wouldn't that speed up these passes? In our experience, it didn't matter: It was set to the default of 16 for reported times, but we also tried values like 256 with the default LTO configuration (thin local LTO). That was somewhat confusing (as a non rustc expert). I'd love to read an explanation for this.

What can we do about it?

Instead of emitting one giant crate containing everything, we tweaked our SQL-to-Rust compiler to split the output into many smaller crates. Each one encapsulating just a portion of the logic, neatly depending on each other, with a single top-level main crate pulling them all in.

The results were spectacular. Here's the same htop view after the change during compilation:

A very busy machine as seen in htop.

Beautiful. All the CPUs are now fully utilized all the time.
And it shows: The time to compile the rust program is down to 2m10s!

[manager] Rust compilation success: pipeline 01962739-79fd-7f03-bbf2-f8e29ce21e1d (program version: 2) (took 150.24s; source checksum: 0336f3eb9dc1; integrity checksum: 6051bcde6674)

How did we fix it?

In most Rust projects, splitting logic across dozens (or hundreds) of crates is impractical at best, a nightmare at worst. But in our case, it was surprisingly straightforward — thanks to how Feldera works under the hood.

When a user writes SQL in Feldera, we translate it into a dataflow graph: nodes are operators that transform data, and edges represent how data flows between them. Here's a small fragment of such a graph:

A feldera dataflow graph.

Since the Rust code is entirely auto-generated from this structure, we had total control over how to split it up.

Each operator becomes its own crate. Each crate exports a single function that builds one specific piece of the dataflow. They all follow the same predictable shape. The top-level main crate just wires them together.

pub fn create_operator_0097dd9de75ffef3(circuit: &RootCircuit,catalog: &mut Catalog, i0: &Stream<RootCircuit, IndexedWSet<Tup1<i32>, Tup5<i32, SqlString, F64, F64, Option<i32>>>>, i1: &Stream<RootCircuit, IndexedWSet<Tup1<i32>, Tup0>>, ) -> Stream<RootCircuit, WSet<Tup5<i32, SqlString, F64, F64, Option<i32>>>>{ let operator_0097dd9de75ffef3: Stream<RootCircuit, WSet<Tup5<i32, SqlString, F64, F64, Option<i32>>>> = i0.join(&i1, move |p0: &Tup1<i32>, p1: &Tup5<i32, SqlString, F64, F64, Option<i32>>, p2: &Tup0, | -> Tup5<i32, SqlString, F64, F64, Option<i32>> { Tup5::new( (*p1).0, (*p1).1.clone(), (*p1).2, (*p1).3, (*p1).4.as_ref().cloned()) }); return operator_0097dd9de75ffef3; }

We still need to figure out how to name these crates. A simple but powerful method is to hash the rust code they contain and use that as the name of the crate.

This ensures two things:

a. We have unique crate names.
b. More importantly: incremental changes to the SQL become incredibly effective

Imagine the user tweaks the SQL code just slightly. What happens is that most of the operators (and their crates) stay identical (the hash doesn't change), and rustc can re-use most of the previously compiled artifacts. Any new code that gets added due to the change will end up generating a new crate (with a different hash).

So how many crates are we talking about for that monster SQL program?

Let’s peek into the compiler directory inside the feldera container:

ubuntu@12e1de52de1b:~/.feldera/compiler/rust-compilation$ ls crates/ feldera_pipe_operator_000cb1599cb60b91 feldera_pipe_operator_4aab3e223e4ddcf9 feldera_pipe_operator_8d1f38d0358deacf feldera_pipe_operator_d8058d2f87a41ca0 feldera_pipe_operator_004093943841ab45 feldera_pipe_operator_4ae3aa1446d98a19 feldera_pipe_operator_8d30ed71269c765f feldera_pipe_operator_d841ffa208faa462 feldera_pipe_operator_004675554aea30aa feldera_pipe_operator_4aff1d1e8d2a6a9a feldera_pipe_operator_8e25b73d54f6491e feldera_pipe_operator_d88bab492aa0c8f5 feldera_pipe_operator_008ba4153ded3848 feldera_pipe_operator_4b3575ba2e10dad3 feldera_pipe_operator_8e667e68984170e5 feldera_pipe_operator_d8a43a536535a38d feldera_pipe_operator_00bee114a0d5eb4c feldera_pipe_operator_4b5370144b5268ae feldera_pipe_operator_8eb1e7460e7376f9 feldera_pipe_operator_d8c6422350e6e8fe feldera_pipe_operator_00d71fa11f791e35 feldera_pipe_operator_4b5d1c560b048f22 feldera_pipe_operator_8edfa111c7ed57b6 feldera_pipe_operator_d968b48784b4f7af ...

And then:

ubuntu@12e1de52de1b:~/.feldera/compiler/rust-compilation$ ls crates/ | wc -l 1106

That’s right — 1,106 crates!

Sounds excessive? Maybe. But in the end this is what makes rustc much more effective.

Are we done?

Unfortunately, not quite. There are still some mysteries here. Given that we now fully utilize 128 threads or 64 cores for pretty much the entire compile time, we can do a back of the envelope calculation for how long it should take: 25 min / 128 = 12 sec (or maybe 24 sec since hyper-threads aren't real cores). Yet it takes 170s to compile everything. Of course, we can't expect linear speed-up in practice, but still 7x slower than that seems excessive (these are all just parallel rustc invocation that run independently). Similar slowdowns also happen on laptop grade machines with much less memory and cores, so it doesn't just affect very large machines.

Here are some thoughts on what might be happening, but we'd be happy to hear some more opinions on this:

Contention on hardware resources (the system has more than enough memory but it might contend on caches)
The file-system is a bottleneck (doubt it since we also tried running this on a RAM-FS and it didn't make a difference, but it could be contending on locks in the file-system code in the kernel)
Compiling each of the 1k crates now performs some steps many times that get amortized when using a single crate (true, but the slowdown doesn't happen if we compile with -j1, the individual crate compile times are much faster as long as they happen in sequence)
Linking is the bottleneck now since we added 1k crates? We use mold and we see the total link time is only around 7 sec.

Conclusion

By simply changing how we generate Rust code under the hood, we’ve made Feldera’s compile times scale with your hardware instead of fighting it. What used to take 30–45 minutes now compiles in under 3 minutes, even for complex enterprise-scale SQL.

If you’re already pushing Feldera to its limits: thank you. Your workloads help us make the system better for everyone.

Read Entire Article