Cloudflare Raves About Performance Gains After Rust Rewrite (cloudflare.com) 52
"We've spent the last year rebuilding major components of our system," Cloudflare announced this week, "and we've just slashed the latency of traffic passing through our network for millions of our customers," (There's a 10ms cut in the median time to respond, plus a 25% performance boost as measured by CDN performance tests.) They replaced a 15-year-old system named FL (where they run security and performance features), and "At the same time, we've made our system more secure, and we've reduced the time it takes for us to build and release new products."
And yes, Rust was involved: We write a lot of Rust, and we've gotten pretty good at it... We built FL2 in Rust, on Oxy [Cloudflare's Rust-based next generation proxy framework], and built a strict module framework to structure all the logic in FL2... Built in Rust, [Oxy] eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare's scale, those guarantees aren't nice-to-haves, they're essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust's strict compile-time guarantees also pair perfectly with FL2's modular architecture, where we enforce clear contracts between product modules and their inputs and outputs...
It's a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much. So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.
Over 100 engineers worked on FL2 — and there was extensive testing, plus a fallback-to-FL1 procedure. But "We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year...." As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work [thanks to filters controlling whether modules need to run]... Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules. In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another. As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That's a huge bonus — we can spend the CPU on delivering more and more features for our customers!
Using our own tools and independent benchmarks like CDNPerf, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost. FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence...
We have long followed a policy that any unexplained crash of our systems needs to be investigated as a high priority. We won't be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations. We're spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We're already seeing the benefits in terms of customer performance and speed of development, and we're looking forward to giving these to all our customers.
After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize...!
Thanks to long-time Slashdot reader Beeftopia for sharing the article.
And yes, Rust was involved: We write a lot of Rust, and we've gotten pretty good at it... We built FL2 in Rust, on Oxy [Cloudflare's Rust-based next generation proxy framework], and built a strict module framework to structure all the logic in FL2... Built in Rust, [Oxy] eliminates entire classes of bugs that plagued our Nginx/LuaJIT-based FL1, like memory safety issues and data races, while delivering C-level performance. At Cloudflare's scale, those guarantees aren't nice-to-haves, they're essential. Every microsecond saved per request translates into tangible improvements in user experience, and every crash or edge case avoided keeps the Internet running smoothly. Rust's strict compile-time guarantees also pair perfectly with FL2's modular architecture, where we enforce clear contracts between product modules and their inputs and outputs...
It's a big enough distraction from shipping products to customers to rebuild product logic in Rust. Asking all our teams to maintain two versions of their product logic, and reimplement every change a second time until we finished our migration was too much. So, we implemented a layer in our old NGINX and OpenResty based FL which allowed the new modules to be run. Instead of maintaining a parallel implementation, teams could implement their logic in Rust, and replace their old Lua logic with that, without waiting for the full replacement of the old system.
Over 100 engineers worked on FL2 — and there was extensive testing, plus a fallback-to-FL1 procedure. But "We started running customer traffic through FL2 early in 2025, and have been progressively increasing the amount of traffic served throughout the year...." As we described at the start of this post, FL2 is substantially faster than FL1. The biggest reason for this is simply that FL2 performs less work [thanks to filters controlling whether modules need to run]... Another huge reason for better performance is that FL2 is a single codebase, implemented in a performance focussed language. In comparison, FL1 was based on NGINX (which is written in C), combined with LuaJIT (Lua, and C interface layers), and also contained plenty of Rust modules. In FL1, we spent a lot of time and memory converting data from the representation needed by one language, to the representation needed by another. As a result, our internal measures show that FL2 uses less than half the CPU of FL1, and much less than half the memory. That's a huge bonus — we can spend the CPU on delivering more and more features for our customers!
Using our own tools and independent benchmarks like CDNPerf, we measured the impact of FL2 as we rolled it out across the network. The results are clear: websites are responding 10 ms faster at the median, a 25% performance boost. FL2 is also more secure by design than FL1. No software system is perfect, but the Rust language brings us huge benefits over LuaJIT. Rust has strong compile-time memory checks and a type system that avoids large classes of errors. Combine that with our rigid module system, and we can make most changes with high confidence...
We have long followed a policy that any unexplained crash of our systems needs to be investigated as a high priority. We won't be relaxing that policy, though the main cause of novel crashes in FL2 so far has been due to hardware failure. The massively reduced rates of such crashes will give us time to do a good job of such investigations. We're spending the rest of 2025 completing the migration from FL1 to FL2, and will turn off FL1 in early 2026. We're already seeing the benefits in terms of customer performance and speed of development, and we're looking forward to giving these to all our customers.
After that, when everything is modular, in Rust and tested and scaled, we can really start to optimize...!
Thanks to long-time Slashdot reader Beeftopia for sharing the article.
Rust is great but... (Score:4, Insightful)
The headline is misleading. Rust has nothing to do with the performance gains. They rewrote an entire system using what they learned from the previous version. Sometimes this is the right thing to do. It's not always easy to predict, but when folks start raving about the rewrite then it was probably a good decision.
Re: Rust is great but... (Score:2)
I agree, most performance issues are due to a misdirected architecture, not language.
Not all architectural performance bottlenecks were a problem initially, but as the system grows they become more and more noticable.
Re: Rust is great but... (Score:4, Insightful)
People who say this usually haven't tried to write simultaneously multithreaded and concurrent applications in a systems language. Shit, rust makes it easier to do that than even "easy" higher level languages that were specifically designed for it from the beginning, like go.
Re: Rust is great but... (Score:1)
Re: Rust is great but... (Score:2)
Re: (Score:1)
And maybe you won't believe me, but my stuff works fine and without bugs.
There's a massive difference between the 500 line scripts you write, and projects like this. What you're doing, simply put, does not scale to this level. Period. There's a reason why, even in languages like python, tools like mypy exist.
This is why I like Python, because it can get straight to where you are going without running in circles.
In other words, every time you've tried to go outside of python, you find yourself running in circles. That's not a problem with other languages.
Re: (Score:2)
Re: Rust is great but... (Score:2)
Declaring variable types is running in circles. You know how to use C without types?
What languages are you still using that don't have implicit typing? Java 8? C# 3? Do tell.
And my scripts are more than 100 lines.
Which is even less than I was assuming, but ok.
I write full applications, I just use AI for the smaller scripts.
Oh...wait...You're the guy who wrote the macos calculator. It all makes sense now.
Re: (Score:2)
Re: Rust is great but... (Score:2)
Re: (Score:2)
Re:Rust is great but... (Score:4, Interesting)
The headline is misleading.
At a minimum, the headline is certainly (and perhaps intentionally?) ambiguous. The "summary" - which probably includes the entire blog post - does make it pretty obvious that rust was not the reason for the speedup, although their choice of rust certainly makes prima facie sense.
I'm a little surprised that replacing a bunch of old disparate software that's basically hacked together with a scripting language (obligatory xkcd [xkcd.com]) with a new custom compiled job only resulted in a 25% speed-up.
Re: (Score:2, Informative)
You know, assumptions are a tricky thing to base your reasoning on. But, as a hint, CF blog posts tend to be on the longer side, with interesting technical details, so it's kind of sad to see that TFS does not include a link to the source - although it aligns with the tradition of not RTFA around here, so maybe-ok job EditorDavid? Anyway, for your reading pleasure, this appears to be the missing link [cloudflare.com].
And btw, this:
Re: (Score:1)
they divided cpu and mem by at least 50%. The 25% gain is on the service as they measure it. both are pretty impressive numbers.
Re:Rust is great but... (Score:4, Interesting)
Re: (Score:2)
I was going to suggest FORTRAN.
Replacing cast-iron bicycle with a titanium one (Score:4, Informative)
The headline is misleading. Rust has nothing to do with the performance gains. They rewrote an entire system using what they learned from the previous version. Sometimes this is the right thing to do. It's not always easy to predict, but when folks start raving about the rewrite then it was probably a good decision.
Having seen the difference between something in Go and Rust, the hype is real. Go is a dogshit slow language....much slower than Java, but a little faster than Python. I ported some Go testing utilities some dipshit at my company wrote to rust...VERY tangible difference...probably 25%. Had full confirmation from the team there was no loss in functionality. The go code was even well written, IMO, certainly no obvious explanations for the bad performance...it's just a shitty slow language...same with Python, JavaScript, etc. We've seen similar results porting node.js or Go garbage from old teams into Java.
If you don't care about performance or efficiency or cloud spend...write in whatever you want...when money is on the line?...it's pretty common your toy prototypes need to be rewritten in grown-up languages like Rust or Java or even C/C++. Facebook famously started on PHP and had to rewrite everything because the language couldn't handle a site that big. We ported a bloated boondoggle Python app to Java...gave us MASSIVE cloud spend bills...reduced to less than half when porting to Java. We used to require 12-20 instances and now it's like 4-8...response time is 1/3 of what it used to be, etc. Admittedly, a modest fraction of that is what you described...re-examining old functionality, but most of the latency was really just removing the Python overhead.
Replacing Lua with Rust is like replacing a heavy steel beach cruiser with a modern carbon-fiber or titanium racing bike. You WILL see a massive performance boost. You could claim the cyclist just got better and yeah...the operator makes a bigger difference than the tool...but....shitty tools are shitty tools. Lua...FFS...who the fuck would use that for mission critical infrastructure? Isn't Lua a kid-friendly scripting language? No scripting language should be in charge of anything you want performance and efficiency from.
Re: (Score:2)
To be fair there's a common way to compile Lua to JVM bytecode so it's likely just a Java front-end, not using the basic interpreter.
Back in the day there was a craze to port Lua, Ruby, Perl, Groovy(!), to run as Java front-ends. Not many got put into production outside of Lua.
However the real point here is that it's now "tell me why I shouldn't use Rust" time.
Moving ABI might be a reasonable objection for a small team but Cloudflare has over a hundred engineers on this so it's not a problem.
They get speed
Re: (Score:2)
Okay FP, but I think there is actually a term "second-system effect" to describe it. I just confirmed it's in the old jargon file. (I even had a dead tree version a long time ago...)
Re: (Score:1)
Stop using IP ranges primarily used to attack the internet and you won't have issues.
Re: (Score:2)
I have the same issues, in one of Comcast's blocks. If I do get a different address after restarting my cable modem, its still going to be in the same /23 network, like it has been for the last 13 years.
Rust's faster than Lua, what a surprise (Score:4, Informative)
I mean, come on. Everybody knows that. They could've implemented the Lua parts in C as well, and then compare performance.
Re: (Score:2)
Yes, but that would have been a silly thing to do. Performance is not the only point here.
Re: (Score:2)
I mean, come on. Everybody knows that. They could've implemented the Lua parts in C as well, and then compare performance.
According to the article they used LuaJIT. I would not be surprised in their use case to get basically equivalent C performance.
They did state the main reason for the better performance: the new implementation has less logic and by using Rust cohesively instead than mixed with C/Lua components there is no need for "translation layers" between languages anymore.
They could have achieved the same by consolidating to C, but of course Rust brings additional important advantages for them.
Re: (Score:2)
Usually I see a headline like this and I'm like "yeah, the original code sucked and would have been made faster by rewriting it in the original language too".
And I'm 100% confident that this is the case here too. But Lua is slow as fuck and I'm sure that transition did help considerably.
I wonder if that's why it's been sucking less... (Score:2)
Re: (Score:3)
Re: I wonder if that's why it's been sucking less. (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Re: (Score:2)
Probably some hilariously misguided attempt at damping vibrations. Or cosmic radiation or something. It's not full of copper, that's for sure as the weight would do a proper number on the connections (there are matching RCA cables just hanging in the air.
Re:I wonder if that's why it's been sucking less.. (Score:4, Funny)
Those cables look like something that infected the spaceship in act one, and that were finally killed in act three.
Captchas just got a whole lot faster (Score:3)
and the ubiquitous surveillance of large swathes of the internet as well. Woohoo!
Except when sites are going down (Score:3)
Does "better performance" include inducing major outages or are those conveniently ignored?
Re: (Score:2)
Don't 10MS With the Zoho (Score:1)
Two obvious - and major - problems with this (Score:2)
1. It's more than likely that the original code simply wasn't well-written. First generation code often isn't.
2. Even if the original code was well-written, a rewrite is highly likely to produce improvements -- presuming that the authors of the second-generation code studied what already existed and thought carefully about its issues/problems.
In other words, I
Re: (Score:2)
First generation code is more often than written with theoretical use-cases in mind. Second generation code usually written with hindsight of how the 1st generation code was actually used in practice. You seem to call that "better", I rather use "more insightful".
Re: (Score:2)
Sometimes it is the language. I often rewrote Python code into C++ or D or (occasionally) C. There were always large gains, even when the logic stayed the same. (Though I've got to admit it often didn't.)
Switching from C to Rust might encourage the use of hash tables, which can often speed things up. (I don't like Rust, but for many purposes it's got a much better standard library that C does.)
Performance tuning and profiling (Score:2)
There is only one reliable way to improve the speed of code: performance tuning / profiling. At least 90% of performance problems exist because nobody bothered to profile the code.
C / C++ are inherently faster than languages like Rust, because they play fast and loose with memory and references, leaving it to the programmer to make sure they properly release unused memory, and not try to reference unallocated memory. But those differences in inherent speed are vastly overwhelmed by poorly structured code. S
LuaJIT is inherently single-threaded . . . (Score:1)
. . . including its runtime, garbage collection, and compiler. For a high transaction rate highly parallel I/O use cases - such as at Cloudflare - that means a LuaJIT application, even when deployed across multiple processes running on the same box, will likely be hamstrung in comparison to even moderately competently written Rust, Java, or C.
Or am I missing something obvious?