Rendered at 09:27:55 GMT+0000 (Coordinated Universal Time) with Cloudflare Workers.
adsharma 4 days ago [-]
> We plan to deliver improvements to [..] purging mechanisms
During my time at Facebook, I maintained a bunch of kernel patches to improve jemalloc purging mechanisms. It wasn't popular in the kernel or the security community, but it was more efficient on benchmarks for sure.
Many programs run multiple threads, allocate in one and free in the other. Jemalloc's primary mechanism used to be: madvise the page back to the kernel and then have it allocate it in another thread's pool.
One problem: this involves zero'ing memory, which has an impact on cache locality and over all app performance. It's completely unnecessary if the page is being recirculated within the same security domain.
The problem was getting everyone to agree on what that security domain is, even if the mechanism was opt-in.
I'm really surprised to see you still hocking this.
We did extensive benchmarking of HHVM with and without your patches, and they were proven to make no statistically significant difference in high level metrics. So we dropped them out of the kernel, and they never went back in.
I don't doubt for a second you can come up with specific counterexamples and microbenchnarks which show benefit. But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
adsharma 3 days ago [-]
You probably weren't there when servers were running for many days at a time.
By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over. If you're restarting the server every few hours, of course the memory fragmentation isn't much of an issue.
> But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
You mean 5 years after I stopped working on the kernel and the underlying system had changed?
I don't recall ever talking to you on the matter.
jcalvinowens 3 days ago [-]
> By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over
Nope, I started in 2014.
> I don't recall ever talking to you on the matter.
I recall. You refused to believe the benchmark results and made me repeat the test, then stopped replying after I did :)
adsharma 3 days ago [-]
The patches were written in 2011 and published in 2012. They did what they were supposed to at the time.
For the peanut gallery: this is a manifestation of an internal eng culture at fb that I wasn't particularly fond of. Celebrating that "I killed X" and partying about it.
You didn't reply to the main point: did you benchmark a server that was running several days at a time? Reasonable people can disagree about whether this a good deployment strategy or not. I tend to believe that there are many places which want to deploy servers and run for months if not days.
alexgartrell 3 days ago [-]
For the peanut gallery more: I worked with both of these guys at Meta on this.
The "servers are only on for a few hours" thing was like never true so I have no idea where that claim is coming from. The web performance test took more than a few hours to run alone and we had way more aggressive soaks for other workloads.
My recollection was that "write zeroes" just became a cheaper operation between '12 and '14.
A fun fact to distract from the awkwardness: a lot of the kernel work done in the early days was exceedingly scrappy. The port mapping stuff for memcached UDP before SO_REUSEPORT for example. FB binaries couldn't even run on vanilla linux a lot of the time. Over the next several years we put a TON of effort in getting as close to mainline as possible and now Meta is one of the biggest drivers of Linux development.
ot 3 days ago [-]
It's not just that zeroing got cheaper, but also we're doing a lot less of it, because jemalloc got much better.
If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.
Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".
However, the drawback is that system-level memory accounting becomes even more fuzzy.
(hi Alex!)
menaerus 3 days ago [-]
I am trying to understand the reason behind why "zeroing got cheaper" circa 2012-2014. Do you have some plausible explanations that you can share?
Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.
Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.
jcalvinowens 3 days ago [-]
My memory is that Ivy Bridge was when it started being different.
ahoka 3 days ago [-]
AVX maybe?
adsharma 3 days ago [-]
[ Edit: "servers" in this context meant the HHVM server processes, not the physical server which of course had a longer uptime ]
I think it's fair to say the hardware changed, the deployment strategy changed and the patches were no longer relevant, so we stopped applying them.
When I showed up, there were 100+ patches on top of a 2009 kernel tree. I reduced the size to about 10 or so critical patches, rebased them at a 6 months cadence over 2-3 years. Upstreamed a few.
Didn't go around saying those old patches were bad ideas and I got rid of them. How you say it matters.
alexgartrell 3 days ago [-]
The linked article says they decided to do CD in 2016 fwiw so that's not inconsistent with what I said.
You reduced the number of patches a lot and also pushed very hard to get us to 3.0 after we sat on 2.6.38 ~forever. Which was very appreciated, btw. We built the whole plan going forward based on this work.
I'm not arguing that anyone should be nice to anyone or not (it's a waste of breath when it comes to Linux). I'm just saying that the benchmarking was thorough and that contemporary 2014 hardware could zero pages fast.
yalok 3 days ago [-]
Tangentially, on this CD policy - it leads to really high p99s for a long tail of rare requests which don’t get reliable prewarming due to these frequent HHVM restarts…
1bpp 3 days ago [-]
This is why I always read the comments here.
genxy 3 days ago [-]
That is, wow, a story.
At what point did you realize how different fb engineering was from what you expected?
hedayet 3 days ago [-]
For me it happened around my first week after the bootcamp, so about 6 weeks from joining.
An important nuance - most Facebook engineers don't believe that Facebook/Meta would continue to grow next year; and that disbelief had been there since as early as in 2018 (when I'd joined).
very few facebook employees use their products outside of testing, which is a big contributor to that fear - they just can't believe that there are billions of people who would continue to use apps to post what they had for lunch!
And as a result of that lack of faith, most of them believe that Meta is a bubble and can burst at any point. Consequently, everyone works for the next performance review cycle, and most are just in rush to capture as much money as they could before that bubble bursts.
specialist 3 days ago [-]
> don't believe that Facebook/Meta would continue to grow next year
Huh.
The time I worked at a hyper growth company, us working in the coal mine had much the same skepticism. Our growth rate seemed ridiculous, surely we're over building, how much longer can this last?!
Happily, the marketing research team regularly presented stuff to our department. They explained who are customers were, projected market sizes (regionally, internationally), projected growth rates, competitive analysis (incumbents and upstarts), etc.
It helped so much. And although their forecasts seemed unbelievable, we over performed every year-over-year. Such that you sort of start to trust the (serious) marketing research types.
2 days ago [-]
eduction 3 days ago [-]
[flagged]
Salgat 3 days ago [-]
I'm personally appreciative of these comments. It's good that people make claims, be challenged, and both sides walk away with informative points being made. It's entirely possible both sides here are correct and wrong in their own way.
yalok 3 days ago [-]
Fwiw, this sounds like a healthy discourse - you don’t have to agree on everything, every approach has its merits, code that ends up shipping and supporting production wins the argument in some sense…
This is not special to Meta in any way, I observed it in any team which has more than 1 strong senior engineer.
menaerus 3 days ago [-]
No, calling out your ex colleague in public years after is not a "healthy discourse" ...
bigstrat2003 3 days ago [-]
There's nothing healthy about holding on to a work grudge from 10 years ago and then dragging it out in public. That's toxic AF.
debo_ 3 days ago [-]
This is literally how pretty much every conversation goes when you work with people close to the metal. It's a stylistic thing at this point.
For what it's worth, 20 years ago all programming newsgroups were like this. I grew my thick skin on alt.lang.perl lol
teiferer 3 days ago [-]
Except one is an employee and the other one is an ex employee. The bias this introduces is not just a minor nuance, it's what fuels the public conflict and causes everybody else to double check their popcorn reserves.
Of course technical discussions happen all the time at companies between competent people. But you don't do that in public, nor is this a technical debate: "I don't recall talking to you about it" - "I do, I did xyz then you ignored me" - "<changes subject>"
adsharma 3 days ago [-]
Important distinction yes. It also means I can't go back and check the thread on what was said and when. Nor do I want to.
Always good to talk face to face if you're have strong feelings about something. When I said "talk" I meant literally face to face.
Spending a decade or so on lkml, everyone develops a thick skin. But mix it with the corporate environment, Facebook 2011, being an ex-employee adds more to the drama.
Having read through the comments here, I'm still of the opinion that any HW changes had a secondary effect and the primary contributor was a change in how HHVM/jemalloc interacted with MADV.
One more suggestion: evaluate more than one app and company wide profiling data to make such decisions.
One of the challenges in doing so is the large contingent of people who don't have an understanding of CPU uarch/counters and yet have a negative opinion of their usefulness to make decisions like this.
So the only tool you have left with is to run large scale rack level tests in a close to prod env, which has its own set of problems and benefits.
menaerus 2 days ago [-]
Perf counters are only indicative of certain performance characteristics at the uarch level but when one improves one or more aspects of it the result does not necessarily positively correlate to the actual measurable performance gains in E2E workloads deployed on a system.
That said, one of the comments above suggests that the HW change was a switch to Ivy Bridge, when zeroing memory became cheaper, which is a bit unexpected (to me). So you might be more right when you say that the improvement was the result of memory allocation patterns and jemalloc.
nullpoint420 3 days ago [-]
This is why I love hacker news. I learn so much from these moments.
danudey 3 days ago [-]
Like "never work at Meta unless you can out-toxic your coworkers".
__turbobrew__ 3 days ago [-]
Yea I knew meta was toxic, but publicly beefing over something over a decade ago is a whole other matter. I can’t even remember what I was working on 10 years ago, and even if I did I wouldn’t be bringing people down that much later.
baby 3 days ago [-]
The problem is a lot of very strong engineers are also very difficult to work with. I worked at Meta too and can tell you the other side of the coin is that people who were too toxic could get canned as well!
__turbobrew__ 3 days ago [-]
Yes, I have worked with the strong but arrogant/snarky engineers. Luckily most of them got canned or forced out because the environment they create around themselves more than negates the positive impact they have. The strongest engineers I have worked with are all humble and kind.
It is their loss, I cannot imagine letting a minor work quarrel live rent free in my head for over a decade. I feel bad enough when something is stuck in my mind for a week.
throwaway2037 3 days ago [-]
Yeah, I am loving the public mudslinging over shit from 10 years ago, like high school girls fighting. This is like the FAANG version of the TV show Suits. We can call it FAANGs and use Midjorney to create the cover art and give the actors vampire fangs.
On a more serious note, it seems like any hyper competitive company eventually spirals into an awful, toxic working env.
hedayet 3 days ago [-]
Inside Meta, engineers are one of the kindest group of people.
This thread would've been way more fun with a couple of middle managers and product managers in the mix ;-)
lmm 3 days ago [-]
Funny, I was thinking what a relief it was to see people making their arguments frankly like on the HN of 10+ years ago.
CamperBob2 3 days ago [-]
Like "Hey, I wonder if Conway's Law works both ways. Huh. Wow. It looks like that is indeed the case."
integricho 3 days ago [-]
I came here for the article, stayed for the drama.
vardump 3 days ago [-]
I wouldn't be surprised if both 'adsharma' and 'jcalvinowens' were right, just at different points in time, perhaps in a bit different context. Things change.
google234123 2 days ago [-]
I like your clocks!
asveikau 3 days ago [-]
Maybe I'm misreading, but considering it OK to leak memory contents across a process boundary because it's within a cgroup sounds wild.
adsharma 3 days ago [-]
It wasn't any cgroup. If you put two untrusting processes in a memory cgroup, there is a lot that can go wrong.
If you don't like the idea of memory cgroups as a security domain, you could tighten it to be a process. But kernel developers have been opposed to tracking pages on a per address space basis for a long time. On the other hand memory cgroup tracking happens by construction.
asveikau 3 days ago [-]
> across a process boundary
> within a cgroup
Note the complementary language usage here. You seem to have interpreted that as me writing that it didn't matter what cgroup they are in, which is an odd thing to claim that I implied. I meant within the same cgroup obviously.
Yes, you can read memory out of another process through other means.. but you shouldn't map pages, be able to read them and see what happened in another process. That's the wild part. It strikes me as asking for problems.
I was unaware of MAP_UNINITIALIZED, support for which was disabled by default and for good reason. Seems like it was since removed.
adsharma 3 days ago [-]
I was clarifying that there are CPU cgroups, network cgroups etc and the proposal touched only memory cgroups.
The people deploying it are free to restrict the cgroup to one process before requesting MAP_UNINITIALIZED if there is a concern around security. At that point the memory cgroup becomes a way to get around the page tracking restriction.
But I get why aesthetically this idea sounds icky to a lot of people.
genxy 3 days ago [-]
What metrics were improved by your patches?
adsharma 3 days ago [-]
Some more historical context. It wasn't a random optimization idea that I thought about in the shower and implemented the next day. Previous work on company wide profiling, where my contribution was low level perf_events plumbing:
The profiling clearly showed kernel functions doing memzero at the top of the profiles which motivated the change. The performance impact (A/B testing and measuring the throughput) also showed a benefit at the point the change was committed.
The change stopped being impactful sometime after 2013, when a JIT replaced the transpiler. I'm guessing likely before 2016 when continuous deployment came into play. But that was continuously deploying PHP code, not HHVM itself.
By the time the patches were reevaluated I was working on a Graph Database, which sounded a lot more interesting than going back to my old job function and defending a patch that may or may not be relevant.
I'm still working on one. Guilty as charged of carrying ideas in my head for 10+ years and acting on them later. Link in my profile.
genxy 3 days ago [-]
This kind of thing always struck me as something that the MMU and the memory controller could team up on. When you give back memory, you could not refresh it for some cycles. Or you could DMA the same page of zeros over all of it, so the CPU isn't involved in menial labor.
adsharma 2 days ago [-]
This is an old debate that goes back 25+ years. One of the differences in how Linux and FreeBSD handle the issue.
Linux developers believe that involving the CPU warms the caches and is a good thing.
bmenrigh 4 days ago [-]
I recently started using Microsoft's mimalloc (via an LD_PRELOAD) to better use huge (1 GB) pages in a memory intensive program. The performance gains are significant (around 20%). It feels rather strange using an open source MS library for performance on my Linux system.
There needs to be more competition in the malloc space. Between various huge page sizes and transparent huge pages, there are a lot of gains to be had over what you get from a default GNU libc.
skavi 4 days ago [-]
We evaluated a few allocators for some of our Linux apps and found (modern) tcmalloc to consistently win in time and space. Our applications are primarily written in Rust and the allocators were linked in statically (except for glibc). Unfortunately I didn't capture much context on the allocation patterns. I think in general the apps allocate and deallocate at a higher rate than most Rust apps (or more than I'd like at least).
Our results from July 2025:
rows are <allocator>: <RSS>, <time spent for allocator operations>
app1:
glibc: 215,580 KB, 133 ms
mimalloc 2.1.7: 144,092 KB, 91 ms
mimalloc 2.2.4: 173,240 KB, 280 ms
tcmalloc: 138,496 KB, 96 ms
jemalloc: 147,408 KB, 92 ms
app2, bench1
glibc: 1,165,000 KB, 1.4 s
mimalloc 2.1.7: 1,072,000 KB, 5.1 s
mimalloc 2.2.4:
tcmalloc: 1,023,000 KB, 530 ms
app2, bench2
glibc: 1,190,224 KB, 1.5 s
mimalloc 2.1.7: 1,128,328 KB, 5.3 s
mimalloc 2.2.4: 1,657,600 KB, 3.7 s
tcmalloc: 1,045,968 KB, 640 ms
jemalloc: 1,210,000 KB, 1.1 s
app3
glibc: 284,616 KB, 440 ms
mimalloc 2.1.7: 246,216 KB, 250 ms
mimalloc 2.2.4: 325,184 KB, 290 ms
tcmalloc: 178,688 KB, 200 ms
jemalloc: 264,688 KB, 230 ms
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.
i don't recall which jemalloc was tested.
hedora 4 days ago [-]
I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).
tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).
Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.
Are you using async rust, or sync rust?
skavi 4 days ago [-]
modern tcmalloc uses per CPU caches via rseq [0]. We use async rust with multithreaded tokio executors (sometimes multiple in the same application). so relatively high thread counts.
How do you control which CPU your task resumes on? If you don't then it's still the same problem described above, no?
skavi 3 days ago [-]
on the OS scheduler side, i'd imagine there's some stickiness that keeps tasks from jumping wildly between cores. like i'd expect migration to be modelled as a non zero cost. complete speculation though.
tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.
for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.
packetlost 3 days ago [-]
Correct. The Linux scheduler has been NUMA aware + sticky for awhile (which is more or less what this reduces to in common scenarios).
jhalstead 3 days ago [-]
> I’m surprised (unless they replaced the core tcmalloc algorithm but kept the name).
1. tcmalloc is actually the only allocator I tested which was not using thread local caches. even glibc malloc has tcache.
2. async executors typically shouldn’t have tasks jumping willy nilly between threads. i see the issue u describe more often with the use of thread pools (like rayon or tokio’s spawn_blocking). i’d argue that the use of thread pools isn’t necessarily an inherent feature of async executors. certainly tokio relies on its threadpool for fs operations, but io-uring (for example) makes that mostly unnecessary.
ComputerGuru 4 days ago [-]
That’s a considerable regression for mimalloc between 2.1 and 2.2 – did you track it down or report it upstream?
Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.
skavi 4 days ago [-]
nope.
codexon 4 days ago [-]
This is similar to what I experienced when I tested mimalloc many years ago. If it was faster, it wasn't faster by much, and had pretty bad worst cases.
pjmlp 4 days ago [-]
If you go into Dr Dobbs, The C/C++ User's Journal and BYTE digital archives, there will be ads of companies whose product was basically special cased memory allocator.
Even toolchains like Turbo Pascal for MS-DOS, had an API to customise the memory allocator.
The one size fits all was never a solution.
adgjlsfhk1 4 days ago [-]
One of the best parts about GC languages is they tend to have much more efficient allocation/freeing because the cost is much more lumped together so it shows up better in a profile.
pjmlp 4 days ago [-]
Agreed, however there is also a reason why the best ones also pack multiple GC algorithms, like in Java and .NET, because one approach doesn't fit all workloads.
nevdka 4 days ago [-]
Then there’s perl, which doesn’t free at all.
hedora 4 days ago [-]
Perl frees memory. It uses refcounting, so you need to break heap cycles or it will leak.
(99% of the time, I find this less problematic than Java’s approach, fwiw).
wredcoll 3 days ago [-]
Unless this has changed recently, perl doesn't free memory to the kernel, only within its own process/vm.
cermicelli 4 days ago [-]
Freedom is overrated... :P
NooneAtAll3 4 days ago [-]
doesn't java also?
I heard that was a common complaint for minecraft
adgjlsfhk1 3 days ago [-]
Minecraft for somewhat silly reasons was largely stuck using Java8 for ~a decade longer than it should have which meant that it was using some fairly outdated GC algorithms.
NooneAtAll3 3 days ago [-]
"silly reasons" being Java breaking backwards compatibility
decade seems a usual timescale for that, considering f.e. python 2->3
kbolino 3 days ago [-]
So much software was stuck on Java 8 and for so long that some of the better GC algorithms got backported to it.
xxs 4 days ago [-]
What do you mean - if Java returns memory to the OS? Which one - Java heap of the malloc/free by the JVM?
cogman10 4 days ago [-]
Java is pretty greedy with the memory it claims. Especially historically it was pretty hard to get the JVM to release memory back to the OS.
To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak.
k_roy 4 days ago [-]
> Especially historically it was pretty hard to get the JVM to release memory back to the OS.
This feels like a huge understatement. I still have some PTSD around when I did Java professionally between like 2005 and 2014.
The early part of that was particularly horrible.
xxs 4 days ago [-]
Java has a quite strict max heap setting, it's very uncommon to let it allocate up to 25% of the system memory (the default). It won't grow past that point, though.
Baring bugs/native leaks - Java has a very predictable memory allocation.
NooneAtAll3 3 days ago [-]
we aren't talking about allocation, tho
we are talking about DEallocation
xxs 3 days ago [-]
it's a reply to:
"To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak."
I cut the part that it's possible to make JVM return memory heap after compaction but usually it's not done, i.e. if something grew once, it's likely to do it again.
adgjlsfhk1 4 days ago [-]
This only really ends up being a problem on windows. On systems with proper virtual memory setups, the cost of unused memory is very low (since the the OS can just page it out)
cogman10 3 days ago [-]
Unfortunately, the JVM and collectors like the JVM's plays really bad with virtual memory. (Actually, G1 might play better. Everything else does not).
The issue is that through the standard course of a JVM application running, every allocated page will ultimately be touched. The JVM fills up new gen, runs a minor collection, moves old objects to old gen, and continues until old gen gets filled. When old gen is filled, a major collection is triggered and all the live objects get moved around in memory.
This natural action of the JVM means you'll see a sawtooth of used memory in a properly running JVM where the peak of the sawtooth occasionally hits the memory maximum, which in turn causes the used memory to plummet.
pjmlp 3 days ago [-]
Depends on which JVM, PTC and Aicas do alright with their real time GCs for embedded deployment.
cogman10 3 days ago [-]
I've never really used anything other than the OpenJDK and Azuls.
How does PTC and Aicas does GC? Is it ref counted? I'm guessing they aren't doing moving collectors.
pjmlp 3 days ago [-]
They are real time GCs, nothing to do with refcounting.
One of the founding members of Aicas is the author of "Hard Realtime Garbage Collection in Modern Object Oriented Programming Languages" book, which was done as part of his PhD.
snackbroken 3 days ago [-]
For video games it is pretty bad, because reading back a page from disk containing "freed" (from the application perspective, but not returned to the OS) junk you don't care about is significantly slower than the OS just handing you a fresh one. A 10-20ms delay is a noticeable stutter and even on an SSD that's only a handful of round-trips.
cogman10 3 days ago [-]
Games today should be using ZGC.
There's a lot of bad tuning guides for minecraft that should be completely ignored and thrown in the trash. The only GC setting you need for it is `-XX:+UseZGC`
For example, a number of the minecraft golden guides I've seen will suggest things like setting pause targets but also survivor space sizes. The thing is, the pause target is disabled when you start playing with survivor space sizes.
xxs 3 days ago [-]
Overall if java hits the swap, it's a bad case. Windows is a like special beast when it comes to 'swapping', even if you don't truly needed it. On linux all (server) services run with swapoff.
pjmlp 3 days ago [-]
Not used Windows Server that much?
CyberDildonics 4 days ago [-]
Any extra throughput is far overshadowed by trying to control pauses and too much heap allocations happening because too much gets put on the heap. For anything interactive the options are usually fighting the gc or avoiding gc.
bluGill 4 days ago [-]
When it works. Many programs in GC language end up fighting the GC by allocating a large buffer and managing it by hand anyway because when performance counts you can't have allocation time in there at all. (you see this in C all the time as well)
cogman10 4 days ago [-]
That's generally a bad idea. Not always, but generally.
It was a better idea when Java had the old mark and sweep collector. However, with the generational collectors (which are all Java collectors now. except for epsilon) it's more problematic. Reusing buffers and objects in those buffers will pretty much guarantees that buffer ends up in oldgen. That means to clear it out, the VM has to do more expensive collections.
The actual allocation time for most of Java's collectors is almost 0, it's a capacity check and a pointer bump in most circumstances. Giving the JVM more memory will generally solve issues with memory pressure and GC times. That's (generally) a better solution to performance problems vs doing the large buffer.
Now, that said, there certainly have been times where allocation pressure is a major problem and removing the allocation is the solution. In particular, I've found boxing to often be a major cause of performance problems.
drob518 3 days ago [-]
If your workload is very regular, you can still do better with an arena allocator. Within the arena, it uses the same pointer-bump allocation as Java normally uses, but then you can free the whole area back to the start by resetting the pointer to its initial value. If you use the arena for servicing a single request, for instance, you then reset as soon as you're done with the request, setting you up with a totally free area for the next request. That's more efficient than a GC. But it also requires your algorithm to fall into that pattern where you KNOW that you can and should throw everything from the request away. If you can't guarantee that, then modern collectors are pretty magical and tunable.
CyberDildonics 4 days ago [-]
If people didn't need to do it, they wouldn't generally do it. Not always, but generally.
cogman10 4 days ago [-]
People do stuff they shouldn't all the time.
For example, some code I had to clean up pretty early on in my career was a dev, for unknown reasons, reinventing the `ArrayList` and then using that invention as a set (doing deduplication by iterating over the elements and checking for duplicates). It was done in the name of performance, but it was never a slow part of the code. I replaced the whole thing with a `HashSet` and saved ~300 loc as a result.
This individual did that sort of stuff all over the code base.
CyberDildonics 3 days ago [-]
Reinventing data structures poorly is very common.
Heap allocation in java is something trivial happens constantly. People typically do funky stuff with memory allocation because they have to, because the GC is causing pauses.
People avoid system allocators in C++ too, they just don't have to do it because of uncontrollable pauses.
cogman10 3 days ago [-]
> People typically do funky stuff with memory allocation because they have to
This same dev did things like putting what he deemed as being large objects (icons) into weak references to save memory. When the references were collected, invariably they had to be reloaded.
That was not the source of memory pressure issues in the app.
I've developed a mistrust for a lot of devs "doing it because we have to" when it comes to performance tweaks. It's not a never thing that a buffer is the right thing to do, but it's not been something I had to reach for to solve GC pressure issues. Often times, far more simple solutions like pulling an allocation out of the middle of a loop, or switching from boxed types to primatives, was all that was needed to relieve memory pressure.
The closest I've come to it is replacing code which would do an expensive and allocation heavy calculation with a field that caches the result of that calculation on the first call.
themanualstates 1 days ago [-]
"This same dev did things like putting what he deemed as being large objects (icons) into weak references to save memory. When the references were collected, invariably they had to be reloaded."
For .NET on iOS, the difference between managed and unmanaged objects is of particular concern. In the example you provide, the Icon Assets are objects from an Apple Framework, not managed by .NET. You might use them in the UIKit views for list items in a UIKit List View.
iOS creates and disposes these list view items independently of .NET managed code. Because the reference counts can't be updated across these contexts, you'll inevitably end up with dangling references. This memory can't be cleared, so inadvertently using strong references will cause a memory leak that grows until your app crashes.
The above still applies with different languages / frameworks of course, however the difference is less explicit from a syntax perspective IMHO
CyberDildonics 3 days ago [-]
I'm not sure why you're rationale for how to deal with garbage collected memory is based on a guy that didn't know standard data structures and your own gut feelings.
Any program that cares about performance is going to focus on minimizing memory allocation first. The difference between a GCed language like java is that the problems manifest as gc pauses that may or may not be predictable. In a language like C++ you can skip the pauses and worry about the overall throughput.
cogman10 3 days ago [-]
Well, let me just circle back to the start of this comment chain.
> Many programs in GC language end up fighting the GC by allocating a large buffer and managing it by hand
That's the primary thing I'm contending with. This is a strategy for fighting the GC, but it's also generally a bad strategy. One that I think gets pulled more because someone heard of the suggestion and less because it's a good way to make things faster.
That guy I'm talking about did a lot of "performance optimizations" based on gut feelings and not data. I've observed that a lot of engineers operate that way.
But I've further observed that when it comes to optimizing for the GC, a large amount of problems don't need such an extreme measure like building your own memory buffer and managing it directly. In fact, that sort of a measure is generally counter productive in a GC environment as it makes major collections more costly. It isn't a "never do this" thing, but it's also not something that "many programs" should be doing.
I agree that many programs with a GC will probably need to change their algorithms to minimize allocations. I disagree that "allocating a large buffer and managing it by hand" is a technique that almost any program or library needs to engage in to minimize GCs.
CyberDildonics 3 days ago [-]
This is a strategy for fighting the GC, but it's also generally a bad strategy.
Allocating a large buffer is literally what an array or vector is. A heap uses a heap structure and hops around in memory for every allocation and free. It gets worse the more allocations there are. The allocations are fragmented and in different parts of memory.
Allocating a large buffer takes care of all this if it is possible to anything else. It doesn't make sense to make lots of heap allocations when what you want is multiple items next to each other in memory and one heap allocation.
That guy I'm talking about did a lot of "performance optimizations" based on gut feelings and not data.
You need to let this go, that guy has nothing to do with what works when optimizing memory usage and allocation.
But I've further observed that when it comes to optimizing for the GC, a large amount of problems don't need such an extreme measure like building your own memory buffer and managing it directly.
Making an array of contiguous items is not an "extreme strategy", it's the most efficient and simplest way for a program to run. Other memory allocations can just be an extension of this.
I agree that many programs with a GC will probably need to change their algorithms to minimize allocations. I disagree that "allocating a large buffer and managing it by hand"
If you need the same amount of memory but need to minimize allocations how do you think that is done? You make larger allocations and split them up. You keep saying "managing it by hand" as if there is something that has to be tricky or difficult. Using indices of an array is not difficult and neither is handing out indices or ranges to in small sections.
cogman10 3 days ago [-]
> A heap uses a heap structure and hops around in memory for every allocation and free.
Not in the JVM. And maybe this is ultimately what we are butting up against. After all, the JVM isn't all GCed languages, it's just one of many.
In the JVM, heap allocations are done via bump allocation. When a region is filled, the JVM performs a garbage collection which moves objects in the heap (it compacts the memory). It's not an actual heap structure for the JVM.
> It doesn't make sense to make lots of heap allocations when what you want is multiple items next to each other in memory and one heap allocation.
That is (currently) not possible to do in the JVM, barring primitives. When I create a `new Foo[128]` in the JVM, that creates an array big enough to hold 128 references of Foo, not 128 Foo objects. Those have to be allocated onto the heap separately. This is part of the reason why managing such an object pool is pointless in the JVM. You have to make the allocations anyways and you are paying for the management cost of that pool.
The object pool is also particularly bad in the JVM because it stops the JVM from performing optimizations like scalarization. That's where the JVM can completely avoid a heap allocation all together and instead pulls out the internal fields of the allocated object to hand off to a calling function. In order for that optimization to occur, and object can't escape the current scope.
I get why this isn't the same story if you are talking about another language like C# or go. There are still the negative consequences of needing to manage the buffer, especially if the intent is to track allocations of items in the buffer and to reassign them. But there is a gain in the locality that's nice.
> Using indices of an array is not difficult and neither is handing out indices or ranges to in small sections.
Easy to do? Sure. Easy to do fast? Well, no. That's entirely the reason why C++ has multiple allocators. It's the crux of the problem an allocator is trying to solve in the first place "How can I efficiently give a chunk of memory back to the application".
Obviously, it'll matter what your usage pattern is, but if it's at all complex, you'll run into the same problems that the general allocator hits.
CyberDildonics 3 days ago [-]
In the JVM, heap allocations are done via bump allocation.
If that were true then they wouldn't be heap allocations.
Then you make data structures out of arrays of primitives.
Easy to do? Sure. Easy to do fast? Well, no. That's entirely the reason why C++ has multiple allocators.
I don't know what this means. Vectors are trivial and if you hand out ranges of memory in an arena allocator you allocate it once and free it once which solves the heavy allocation problem. The allocator parameter in templates don't factor in to this.
cogman10 3 days ago [-]
> If that were true then they wouldn't be heap allocations.
"Heap" is a misnomer. It's not called that due to the classic CS "heap" datastructure. It's called that for the same reason it's called a heap allocation in C++. Modern C++ allocators don't use a heap structure either.
How the JVM does allocations for all it's collectors is in fact a bump allocator in the heap space. There are some weedsy details (for example, threads in the JVM have their own heap space for doing allocation to avoid contention in allocation) but suffice it to say it ultimately translates into a region check then pointer bump. This is why the JVM is so fast at allocation, much faster than C++ can be. [1] [2]
> I don't know what this means.
JVM allocations are typically pointer bumps, adding a number to a register. There's really nothing faster than it. If you are implementing an arena then you've already lost in terms of performance.
Modern C++ allocators don't use a heap structure either.
"Yes, malloc uses a heap data structure to allocate memory dynamically for programs. The heap allows for persistent memory allocation that can be managed manually by the programmer."
"How Malloc Works with the Heap
Heap Data Structure: Malloc uses a heap data structure to manage memory. The heap is a region of a process's memory that is used for dynamic memory allocation.
Memory Management: When you call malloc, it searches the heap for a suitable block of memory that can accommodate the requested size. If found, it allocates that memory and returns a pointer to it."
How the JVM does allocations for all it's collectors is in fact a bump allocator in the heap space.
This doesn't make sense. It's one or the other. A heap isn't about getting more memory or mapping it into a process space, it is about managing the memory already in the process space and being able to free memory in a different order than you allocated it, then give that memory back out without system calls.
JVM allocations are typically pointer bumps, adding a number to a register.
I think you are mixing up mapping memory into a process (which is a system call not a register addition) and managing the memory once it is in process space.
The allocator frees memory and reuses it within a process. If freeing it was as simple as subtracting from a register then there would be no difference in speed between the stack and the heap and there would be no GC pauses and no GC complexity. None of these things are true obviously since java has been dealing with these problems for 30 years.
This is why the JVM is so fast at allocation, much faster than C++ can be
Java is slower than C++ and less predictable because you can't avoid the GC which is the whole point here.
The original point was that you have to either avoid the GC or fight the GC and a lot of what you have talked about is either not true or explains why someone has to avoid and fight the GC in the first place.
adgjlsfhk1 2 days ago [-]
You're wrong for like 6 different reasons.
Java does do bump pointer allocation. The key is that when GC runs, surviving objects get moved. The slow part of GC isn't the allocation (GCs generally have much faster allocators than malloc). The slow part is the barriers that the GC requires and the pauses.
CyberDildonics 2 days ago [-]
You're wrong for like 6 different reasons.
If that were true you could have listed one the made sense in context. This person was saying that allocation was as fast as a incrementing a register while continually ignoring the fact that deallocation needs to happen along with any organization of allocated memory.
Then they were ignoring that large allocations have big speed benefits for a reason.
Conflating java moving a pointer, mapping memory into a process, sbrk, and arena allocation is going in circles, but the fundamentals that people need to fight the GC or work around it remains.
Allocations have a price and the first step to optimizing any program is avoiding that, but in GC languages you get pauses on top of your slow downs.
2 days ago [-]
jibal 3 days ago [-]
Right? "I had this one contingent experience and I've built my entire world view and set of practices around it."
drob518 3 days ago [-]
Premature optimization is the root of all evil.
m463 4 days ago [-]
I remember in the early days of web services, using the apache portable runtime, specifically memory pools.
If you got a web request, you could allocate a memory pool for it, then you would do all your memory allocations from that pool. And when your web request ended - either cleanly or with a hundred different kinds of errors, you could just free the entire pool.
it was nice and made an impression on me.
I think the lowly malloc probably has lots of interesting ways of growing and changing.
Sesse__ 3 days ago [-]
This is called “an arena” more generally, and it is in wide use across many forms of servers, compilers, and others.
jra_samba 4 days ago [-]
Look into talloc, used inside Samba (and other FLOSS projects like sssd). Exactly this.
pocksuppet 4 days ago [-]
In many cases you can also do better than using malloc e.g. if you know you need a huge page, map a huge page directly with mmap
Yes, if you want to use huge pages with arbitrary alloc/free, then use a third-party malloc. If your alloc/free patterns are not arbitrary, you can do even better. We treat malloc as a magic black box but it's actually not very good.
IshKebab 4 days ago [-]
I feel like the real thing that needs to change is we need a more expressive allocation interface than just malloc/realloc. I'm sure that memory allocators could do a significantly better job if they had more information about what the program was intending to do.
liuliu 4 days ago [-]
There are, look no further than jemalloc API surface itself:
You can also play tricks with inlining and constant propagation in C (especially on the malloc path, where the ground-truth allocation size is usually statically known).
Dylan16807 4 days ago [-]
I think some operating system improvements could get people motivated to use huge pages a lot better. In particular make them less fragile on linux and make them not need admin rights on windows. The biggest factor causing problems there is that neither OS can swap a 2MB page. So someone needs to care enough to fix that.
anthk 4 days ago [-]
I used mimalloc to run zenlisp under OpenBSD as it would clash with the paranoid malloc of base.
jeffbee 4 days ago [-]
Just out of curiosity are you getting 1GB huge pages on Xeon or some other platform? I always thought this class of page is the hardest to exploit, considering that the machine only has, if I recall correctly, one TLB slot for those.
bmenrigh 4 days ago [-]
Modern x86_64 has supported multiple page sizes for a long time. I'm on commodity Zen 5 hardware (9900X) with 128 GiB of RAM. Linux will still use a base page size of 4kb but also supports both 2 MiB and 1 GiB huge pages. You can pass something like `default_hugepagesz=2M hugepagesz=1G hugepages=16` to your kernel on boot to use 2 MiB pages but reserve 16 1 GiB pages for later use.
The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.
EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).
jeffbee 4 days ago [-]
Right but on Intel the 1G page size has historically been the odd one. For example Skylake-X has 1536 L2 shared TLB entries for either 4K or 2M pages, but it only has 16 entries that can be used for 1G pages. It wasn't unified until Cascade Lake. But Skylake-like Xeon is still incredibly common in the cloud so it's hard to target the later ones.
Dylan16807 4 days ago [-]
So for any process that's using less than 16GB, it's a significant performance boost. And most processes using more RAM, but not splitting accesses across more than 16 zones in rapid succession, will also see a performance boost.
My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)
jeffbee 3 days ago [-]
That strikes me as a common hugepages win. People never believe you, though, when you say you can make their thing 20% faster for free.
menaerus 3 days ago [-]
Then it should be pretty easy to display that 20% "faster for free", no? But as always the devil is in the details. I experimented a lot with huge pages, and although in theory you should see the performance boost, the workloads I have been using to test this hypothesis did not end up with anything statistically significant/measurable. So, my conclusion was ... it depends.
Dylan16807 3 days ago [-]
Try a big factorio map just as a test case. It's a bit of an outlier on performance, in particular it's very heavy on memory bandwidth.
jeffbee 3 days ago [-]
Of course, it only helps workloads that exhibit high rates of page table walking per instruction. But those are really common.
menaerus 3 days ago [-]
Yes, I understand that. It is implied that there's a high TLB miss rate. However, I'm wondering if the penalty which we can quantify as O(4) memory accesses for 4-level page table, which amounts to ~20 cycles if pages are already in L1 cache, or ~60-200 cycles if they are in L2/L3, would be noticeable in workloads which are IO bound. In other words, would such workloads benefit from switching to the huge pages when most of the time CPU anyways sits waiting on the data to arrive from the storage.
jeffbee 3 days ago [-]
In a multi-tenant environment, yes. The faster they can get off the CPU and yield to some other tenant, the better it is.
tosti 3 days ago [-]
> commodity
> zen 5
> 128GiB
Are you from the future?
Dylan16807 2 days ago [-]
I'm not sure what point you're trying to make.
In the middle of last year, a 9900X was around $350 and 128GB of memory was also around $350. That's very easily "commodity" range.
tosti 2 days ago [-]
Damn. I feel old and must've missed that boat. Several other boats too, I guess.
Here I was thinking 16GiB is pretty good. I get to compile LibreOffice in an afternoon. QtWebEngine overnight.
Doesn't 128GiB make rowhammer much more feasible? You'd have 32GiB per DIMM.
Oh well
Dylan16807 2 days ago [-]
Two 64GiB DIMMs would be the more likely setup. The current CPUs strongly prefer having only one stick of DDR5 per channel.
The effectiveness of rowhammer depends on how well the manufacturer implemented target row refresh. But the internal ECC on DDR5 should help defend against it somewhat.
Personally I've been in the 24-32GiB range since 2013, and that's despite the fact that I'm still on DDR3.
sylware 4 days ago [-]
If there is so much performance difference among generic allocators, it means you need semantic optimized allocators (unless performance is actually not that much important in the end).
Cloudef 4 days ago [-]
You are not wrong and this is indeed what zig is trying to push by making all std functions that allocate take a allocator parameter.
codexon 4 days ago [-]
Agreed mostly. Going from standard library to something like jemalloc or tcmalloc will give you around 5-10% wins which can be significant, but the difference between those generic allocators seem small. I just made a slab allocator recently for a custom data type and got speedups of 100% over malloc.
sylware 3 days ago [-]
Here you go.
vesselapi 3 days ago [-]
[dead]
codexon 4 days ago [-]
I've been using jemalloc for over 10 years and don't really see a need for it to be updated. It always holds up in benchmarks against any new flavor of the month malloc that comes out.
Last time I checked mimalloc which was admittedly a while ago, probably 5 years, it was noticebly worse and I saw a lot of people on their github issues agreeing with me so I just never looked at it again.
adgjlsfhk1 4 days ago [-]
Mimalloc v3 has just come out (about a month ago) and is a significant improvement over both v2 and v1 (what you likely last tested)
hrmtst93837 4 days ago [-]
Benchmarks age fast. Treating a ten-year-old allocator as done just because it still wins old tests is tempting fate, since distros, glibc, kernel VM behavior, and high-core alloc patterns keep moving and the failures usually show up as weird regressions in production, not as a clean loss on someone's benchmark chart.
codexon 4 days ago [-]
It still beat mimalloc when I checked 4-5 years ago.
imp0cat 4 days ago [-]
You really need to benchmark your workloads, ideally with the "big 3" (jemalloc, tcmalloc, mimalloc). They all have their strengths and weaknesses.
Jemalloc can usually keep the smallest memory footprint, followed by tcmalloc.
Mimalloc can really speed things up sometimes.
As usually, YMMV.
codexon 4 days ago [-]
I've benchmarked them every few years, they never seem to differ by more than a few percent, and jemalloc seems to fragment and leak the least for processes running for months.
Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing, so I am not inclined to trust it now.
ComputerGuru 4 days ago [-]
> Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing
That’s… ahistorical, at least so far as I remember. It wasn’t marketed as either of those; it was marketed as small/simple/consistent with an opt-in high-severity mode, and then its performance bore out as a result of the first set of target features/design goals. It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.
> It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.
That is true of basically every single malloc replacement out there, that is not a uniquely defining feature.
HackerThemAll 4 days ago [-]
Look up the numbers in other comments above. When it comes to performance, the Google's tcmalloc is unconquered.
imp0cat 3 days ago [-]
I tried all three, multiple times, and it depends.
Using the last workload tested as an example, mimalloc just consumed memory like crazy. It was probably leaking, as it was the stock version that comes in Debian, so probably quite old.
Tcmalloc and jemalloc were neck to neck when comparing app metrics (request duration etc... was quite similar), but jemalloc consistently used only about half of RAM as opposed to tcmalloc).
Both custom allocators used way less RAM than the stock allocator though. Something like 10x (!) less. In the end the workload with jemalloc hovers somewhere around 4% of the memory limit. Not bad for one single package and an additional compile option to enable it.
One has to wonder if this due to the global memory shortage. ("Oh - changing our memory allocator to be more efficient will yield $XXM dollar savings over the next year").
bluGill 4 days ago [-]
Facebook had talks already years ago (10+) - nobody was allowed to share real numbers, but several facebook employed where allowed to share that the company has measured savings from optimizations. Reading between the lines, a 0.1% efficiency improvement to some parts of Facebook would save them $100,000 a month (again real numbers were never publicly shared so there is a range - it can't be less than $20,000), and so they had teams of people whose job it was to find those improvements.
Most of the savings seemed to come from HVAC costs, followed by buying less computers and in turn less data centers. I'm sure these days saving memory is also a big deal but it doesn't seem to have been then.
The above was already the case 10 years ago, so LLMs are at most another factor added on.
sethhochberg 4 days ago [-]
I don't have many regrets about having spent my career in (relatively) tiny companies by comparison, but it sure does sound fun to be on the other side for this kind of thing - the scale where micro-optimizations have macro impact.
In startups I've put more effort into squeezing blood from a stone for far less change; even if the change was proportionally more significant to the business. Sometimes it would be neat to say "something I did saved $X million dollars or saved Y kWh of energy" or whatever.
Anon1096 3 days ago [-]
I've worked on optimizing systems in that ballpark range, memory is worth saving but it isn't necessarily 1:1 with increasing revenue like CPU is. For CPU we have tables to calculate the infra cost savings (we're not really going to free up the server, more like the system is self balancing so it can run harder with the freed CPU), but for memory as long as we can load in whatever we want to (rec systems or ai models) we're in the clear so the marginal headroom isn't as important. It's more of a side thing that people optimizing CPU also get wins in by chance because the skillsets are similar.
alex1138 4 days ago [-]
I've heard of some people getting banned from FB to save memory space? Surely that can't be the case but I swear I've seen something like that
gzread 3 days ago [-]
There are some people who think they can beat the system by treating apps like Telegram and Discord as free cloud storage, and they certainly get banned to save storage space.
HackerThemAll 4 days ago [-]
> LLMs are at most another factor added on
At most...
Think 10x rather than 0.1x or 1x.
runevault 4 days ago [-]
On top of cost, they probably cannot get as much memory as they order in a timely fashion so offsetting that with greater efficiency matters right now.
loeg 4 days ago [-]
Yeah, identifying single-digit millions of savings out of profiles is relatively common practice at Meta. It's ~easy to come up with a big number when the impact is scaled across a very large numbers of servers. There is a culture of measuring and documenting these quantified wins.
foobarian 4 days ago [-]
Oooh maybe finally time for lovingly hand-optimized assembly to come back in fashion! (It probably has in AI workloads or so I daydream)
Nuzzerino 4 days ago [-]
With the reputation of that company, one can wonder a lot of backstories that are even more depressing than a memory shortage.
augusto-moura 4 days ago [-]
Not just shortage, any improvements to LLMs/electricity/servers memory footprint is becoming much more valuable as the time goes. If we can get 10% faster, you can easily get a lead in the LLM race. The incentives to transparently improving performance are tremendous
mathisfun123 4 days ago [-]
> changing our memory allocator
they've been using jemalloc (and employing "je") since 2009.
apatheticonion 3 days ago [-]
As an Australian who was just made redundant from a role that involved this type of low level programming - I love working on these these kinds of challenges.
I'm saddened that the job market in Australia is largely React CRUD applications and that it's unlikely I will find a role that lets me leverage my niche skill set (which is also my hobby)
amacneil 3 days ago [-]
I know this isn’t who’s hiring thread, but we are hiring in AU for low-level data processing and have interesting performance challenges.
Link in bio.
maxwindiff 3 days ago [-]
Love your product!
amacneil 3 days ago [-]
Thanks!
ajxs 3 days ago [-]
Speaking as an Australian that works on React CRUD applications because there's nothing else in the market, I've been reading through this thread thinking the exact same thing.
apatheticonion 3 days ago [-]
Google had some position open working on the kernel for ChromeOS, and Microsoft had some positions working on data center network drivers.
I applied for both and got ghosted, haha.
I also saw a government role as a security researcher. Involves reverse engineering, ghidra and that sort of thing. Super awesome - but the pay is extremely uncompetitive. Such a shame.
Other than that, the most interesting roles are in finance (like HFT) - where you need to juggle memory allocations, threads and use C++ (hoping I can pitch Rust but unlikely).
Sadly they have a reputation of having pretty rough cultures, uncompetitive salaries and it's all in-office
Not sure if it's the domain you're interested in, but there are quite a few HFT firms with offices in Australia.
The one I know of (IMC trading) does a lot of low level stuff like this and is currently hiring.
apatheticonion 3 days ago [-]
I'm actually looking at HFT companies. Hoping I find one that allows remote working - but looks like there are basically no remote roles going at the moment
apatheticonion 3 days ago [-]
I just tried to apply for IMC, the form on their careers page is broken. Looks like that's the first boss to defeat, haha
lukeh 3 days ago [-]
I hear you. Actually I read this thread because we’re using jemalloc in an embedded product. The only way I found to work on interesting problems here was to work for myself. (Having said that I think Apple might have some security research in Canberra? Years ago there was LinuxCare there and a lot of smart people. But that was in 2003…)
I remember I was a senior lead softeng of a worldbank funded startup project, and have deployed Ruby with jemalloc in prod. There's a huge noticeable speed and memory efficiency. It did saved us a lot of AWS costs, compare to just using normal Ruby. This was 8 years ago, why haven't projects adopt it yet as de facto.
kortex 3 days ago [-]
Usually lack of knowledge that such a thing exists, or just plain ol' momentum. Changing something long in production at established companies, even if there is a tangible benefit, can be a real challenge.
RegnisGnaw 4 days ago [-]
Is there a concise timelime/history of this? I thought jemalloc was 100% open source, why is Meta in control of it?
"Were I to reengage, the first step would be at least hundreds of hours of refactoring to pay off accrued technical debt."
Facebook's coding AIs to the rescue, maybe? I wonder how good all these "agentic" AIs are at dreaded refactoring jobs like these.
xxs 4 days ago [-]
Refactor doesn't mean just artificial puff-up jobs, it's very likely internal changes and reorganization (hence 100s of hours).
There are not many engineers capable of working on memory allocators, so adding more burden by agentic stuff is unlikely to produce anything of value.
rvz 4 days ago [-]
> Facebook's coding AIs to the rescue, maybe? I wonder how good all these "agentic" AIs are at dreaded refactoring jobs like these.
No.
This is something you shouldn't allow coding agents anywhere near, unless you have expert-level understanding required to maintain the project like the previous authors have done without an AI for years.
kenferry 3 days ago [-]
Hm, I wonder.
I've done some work in this sort of area before, though not literally on a malloc. Yes you very much want to be careful, but ultimately it's the tests that give you confidence. Pound the heck out of it in multithreaded contexts and test for consistency.
rvz 3 days ago [-]
> ...but ultimately it's the tests that give you confidence. Pound the heck out of it in multithreaded contexts and test for consistency.
I don't think so.
Even on LLM generated code, it is still not enough and you cannot trust it. They can pass the tests and still cause a regression and the code will look seemingly correct, for example in this case study [0].
AI is more than happy to declare the test wrong and “fix it” if you’re not careful. And the cherry on top is that sometimes the test could be wrong or need updating due to changed behavior. So…
echelon 4 days ago [-]
If you filter the commits to the past five years, four of the top six committers are Meta employees. The other two might be as well, it just doesn't say that on their Github / personal website.
Asm2D 3 days ago [-]
It would be great if Meta was able to sustain to support more open source projects, especially those they benefit from.
For example they use AsmJit in a lot of projects (both internal and open-source) and it's now unmaintained because of funding issues. Maybe they have now internal forks too.
robertlagrant 3 days ago [-]
> With the leverage jemalloc provides however, it can be tempting to realize some short-term benefit. It requires strong self-discipline as an organization to resist that temptation and adhere to the core engineering principles.
This doesn't quite read properly to me. What does it actually mean, does anyone know?
gjm11 3 days ago [-]
I'm pretty sure it means something like this: "Because jemalloc is used all over the place in our systems that run at tremendous scale, some hack that improves its performance a little bit while degrading the longer-term maintainability of the code can look very appealing -- look, doing this thing will save us $X,000,000 per year! -- and it takes discipline to avoid giving in to that temptation and to insist on doing things properly even if sometimes it means passing up a chance to make the code 0.1% faster and 10% messier."
Surprised not to see any mention of the global memory supply shock. Would love to learn more about how that economic is shifting software priorities toward memory allocation for the first time in my (relatively young) career
twodave 4 days ago [-]
While it may seem directly related, it's just not. These things are worked on regardless of how cheap or expensive RAM is, because optimizing memory footprint pretty much always leads to fewer machines leased, which is a worthwhile goal even for smaller shops.
jshorty 3 days ago [-]
That's useful to know, thank you.
4 days ago [-]
refulgentis 4 days ago [-]
There’s been shocks at hyperscaler scale, ex. this got yuge at Google for a couple years before ChatGPT
charcircuit 4 days ago [-]
Meta never abandoned jemalloc. https://github.com/facebook/jemalloc remained public the entire time. It's my understanding that Jason Evans, the creator of jemalloc, had ownership over the jemalloc/jemalloc repo which is why that one stopped being updated after he left.
kstrauser 4 days ago [-]
The repo's availability isn't related to whether it's still maintained.
charcircuit 4 days ago [-]
Meta still maintained it and actively pushed commits to it fixing bugs and adding improvements. From this blog post it sounds like they are increasing investment into it along with resurrecting the original repo. When the repo was archived Meta said that development on jemalloc would be focused towards Meta's own goals and needs as opposed to the larger ecosystem.
kstrauser 4 days ago [-]
I'm not directly involved enough to dig into the details here, but facebook/jemalloc currently says:
> This branch is 71 commits ahead of and 70 commits behind jemalloc/jemalloc:dev.
It looks like both have been independently updated.
Xylakant 4 days ago [-]
This looks a lot as if the facebook/jemalloc repo inserted a single commit 70 commits ago and then rebased the changes in the original repo on top. Because the commit SHAs for the changes pulled in change you see this result.
masklinn 4 days ago [-]
The team probably sync'd the two after unarchiving the original.
xxs 4 days ago [-]
Few months back, some of the services switched to jemalloc for the Java VM. It took months (of memory dumps and tracing sys-calls) to blame the JVM, itself, for getting killed by the oom_killer.
Initially the idea was diagnostics, instead the the problem disappeared on its own.
yxhuvud 4 days ago [-]
If you changed from glibc to jemalloc and that solved your issues, then you should blame glibc, not the JVM.
xxs 4 days ago [-]
Well, indeed - I thought that part was obvious reading it.
nubinetwork 4 days ago [-]
Someone should tell Bryan Cantrill, he'd probably be ecstatic...
starkparker 4 days ago [-]
(wrong thread)
gcr 4 days ago [-]
The URL of this story seems to have changed to a Meta press release. What are you quoting?
starkparker 4 days ago [-]
Sorry, misdirected the reply.
4 days ago [-]
rishabhjajoriya 4 days ago [-]
Large engineering orgs often underestimate how much CI pipelines amplify performance issues. Even small inefficiencies multiply when builds run hundreds of times a day.
thatoneengineer 4 days ago [-]
First impressions: LOL, the blunt commentary in the HN thread title compared to the PR-speak of the fb.com post.
Second thoughts: Actually the fb.com post is more transparent than I'd have predicted. Not bad at all. Of course it helps that they're delivering good news!
MBCook 4 days ago [-]
It’s still quite corporate-y, but other than the way of writing I agree it’s generally quite clear.
markstos 4 days ago [-]
How is the original author making out in the new arrangement?
I used jemalloc recently for ComfyUI/Wan and it’s literally magic. I’m surprised it doesn’t come that way by default.
jeffbee 3 days ago [-]
Allocators like that aren't the default for every process because they have higher startup costs. They are targeted to server workloads where startup cost doesn't matter, but it matters a lot if you're doing crud like starting millions of short-lived processes.
senderista 3 days ago [-]
I don't think glibc malloc makes an optimal set of tradeoffs for any scenario.
4 days ago [-]
lobf 4 days ago [-]
>We are committed to continuing to develop jemalloc development
From the Department of Redundancy Department.
agnishom 3 days ago [-]
Glad to see the commitment towards important non-LLM projects!
fhn 3 days ago [-]
this is just meta wanting people to test and fix. they don't care about it but they want you to
mywacaday 3 days ago [-]
The only option for cookies is to accept these terms and conditions, I thought implied consent was explicitly not allowed due to GDPR?
"To help personalize content, tailor and measure ads and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls:
I was recently debugging an app double-free segfault on my android 13 samsung galaxy A51 phone, and the internal stack trace pointed to jemalloc function calls (je_free).
larsberg 3 days ago [-]
scudo has been the default allocator for Android since Android 11, and we are hoping to make it mandatory for the few remaining places that don't use it. Using an allocator without memory protections in 2026 (especially after we have closed nearly all known performance gaps with jemalloc) is really not a great choice.
flippyhead 3 days ago [-]
Jemalloc and Jalad at Tanagra!
Nuzzerino 4 days ago [-]
> Building a software system is a lot like building a skyscraper: The product everyone sees is the top, but the part that keeps it from falling over is the foundation buried in the dirt and the scaffolding hidden from sight.
They should have just called it an ivory tower, as that's what they're building whenever they're not busy destroying democracy with OS Backdoor lobbyism or Cambridge Analytica shenanigans.
Edit: If every thread about any of Elon Musk's companies can contain at least 10 comments talking about Elon's purported crimes against humanity, threads about Zuckerberg's companies can contain at least 1 comment. Without reminders like this, stories like last week's might as well remain non-consequential.
fermentation 4 days ago [-]
Seems like they’d want to wait to commit until after the layoffs, right?
OsrsNeedsf2P 4 days ago [-]
I work in the space. This article would not have been published if the team responsible was on the chopping block
VoidWarranty 4 days ago [-]
[dead]
kubb 4 days ago [-]
It's just one team with like 4 people. They can layoff a lot of staff from Metaverse.
robutsume 3 days ago [-]
[dead]
openinstaclaw 3 days ago [-]
[dead]
rgupta1833 4 days ago [-]
[dead]
openclaw01 3 days ago [-]
[dead]
rwaksmunski 3 days ago [-]
jemalloc 5.2.1 vs mimalloc v3.2.8 in Rust software processing hundreds of Terabytes. Could not measure a meaningful difference, but mimalloc would release freed memory to the OS a lot sooner and therefore look nicer in top. That said, older mimalloc from default rust crate would cause memory corruption with large allocations >2Gb in about 5% of the cases. Stuck with battle hardened jemalloc for now.
torginus 3 days ago [-]
Mimalloc my beloved. The fact that jemalloc is this fiendishly complex allocator with a gazillion algorithms and approaches ( and a huge binary), yet mimalloc (a simple allocator with one bitmap-tracked pool per allocation size, and one pool collection per thread) is one of the bigger wins in software simplicity in recent memory.
4 days ago [-]
unit149 3 days ago [-]
[dead]
lesscraft 4 days ago [-]
[dead]
genie3io 3 days ago [-]
[dead]
oncallthrow 4 days ago [-]
And the Oscar for most mealy-mouthed post of the year goes to…
dang 4 days ago [-]
We generally try to avoid corporate press releases for that reason, but is there a good third-party post to replace it with?
If you need to optimize the allocator you are doing it wrong.
ot 3 days ago [-]
That's a false dichotomy: you optimize both the application and the allocator.
A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.
sumtechguy 3 days ago [-]
exactly. I can think of at least 5 different projects I have been on where a better allocator would made a world of difference. I can also think of another 5 where it probably would have been a waste of time to even fiddle with.
One project I spent a bunch of time optimizing the write path of I/O. It was just using standard fwrite. But by staging items correctly it was an easy 10x speed win. Those optimizations sometimes stack up and count big. But it also had a few edges on it, so use with care.
saagarjha 3 days ago [-]
Glad we have super slow allocators then
cbarrick 3 days ago [-]
Exactly. No need to engineer an allocator. You only live once!
During my time at Facebook, I maintained a bunch of kernel patches to improve jemalloc purging mechanisms. It wasn't popular in the kernel or the security community, but it was more efficient on benchmarks for sure.
Many programs run multiple threads, allocate in one and free in the other. Jemalloc's primary mechanism used to be: madvise the page back to the kernel and then have it allocate it in another thread's pool.
One problem: this involves zero'ing memory, which has an impact on cache locality and over all app performance. It's completely unnecessary if the page is being recirculated within the same security domain.
The problem was getting everyone to agree on what that security domain is, even if the mechanism was opt-in.
https://marc.info/?l=linux-kernel&m=132691299630179&w=2
We did extensive benchmarking of HHVM with and without your patches, and they were proven to make no statistically significant difference in high level metrics. So we dropped them out of the kernel, and they never went back in.
I don't doubt for a second you can come up with specific counterexamples and microbenchnarks which show benefit. But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
By the time you joined and benchmarked these systems, the continuous rolling deployment had taken over. If you're restarting the server every few hours, of course the memory fragmentation isn't much of an issue.
> But you were unable to show an advantage at the system level when challenged on it, and that's what matters.
You mean 5 years after I stopped working on the kernel and the underlying system had changed?
I don't recall ever talking to you on the matter.
Nope, I started in 2014.
> I don't recall ever talking to you on the matter.
I recall. You refused to believe the benchmark results and made me repeat the test, then stopped replying after I did :)
For the peanut gallery: this is a manifestation of an internal eng culture at fb that I wasn't particularly fond of. Celebrating that "I killed X" and partying about it.
You didn't reply to the main point: did you benchmark a server that was running several days at a time? Reasonable people can disagree about whether this a good deployment strategy or not. I tend to believe that there are many places which want to deploy servers and run for months if not days.
The "servers are only on for a few hours" thing was like never true so I have no idea where that claim is coming from. The web performance test took more than a few hours to run alone and we had way more aggressive soaks for other workloads.
My recollection was that "write zeroes" just became a cheaper operation between '12 and '14.
A fun fact to distract from the awkwardness: a lot of the kernel work done in the early days was exceedingly scrappy. The port mapping stuff for memcached UDP before SO_REUSEPORT for example. FB binaries couldn't even run on vanilla linux a lot of the time. Over the next several years we put a TON of effort in getting as close to mainline as possible and now Meta is one of the biggest drivers of Linux development.
If the allocator returns a page to the kernel and then immediately asks back for one, it's not doing its job well: the main purpose of the allocator is to cache allocations from the kernel. Those patches are pre-decay, pre-background purging thread; these changes significantly improve how jemalloc holds on to memory that might be needed soon. Instead, the zeroing out patches optimize for the pathological behavior.
Also, the kernel has since exposed better ways to optimize memory reclamation, like MADV_FREE, which is a "lazy reclaim": the page stays mapped to the process until the kernel actually need it, so if we use it again before that happens, the whole unmapping/mapping is avoided, which saves not only the zeroing cost, but also the TLB shootdown and other costs. And without changing any security boundary. jemalloc can take advantage of this by enabling "muzzy decay".
However, the drawback is that system-level memory accounting becomes even more fuzzy.
(hi Alex!)
Haswell (2013) doubled the store throughput to 32 bytes/cycle per core, and Sandy Bridge (2011) doubled the load throughput to the same, but the dataset being operated at FB is most likely much larger than what L1+L2+L3 can fit so I am wondering how much effect the vectorization engine might have had since bulk-zeroing operation for large datasets is anyways going to be bottlenecked by the single core memory bandwidth, which at the time was ~20GB/s.
Perhaps the operation became cheaper simply because of moving to another CPU uarch with higher clock and larger memory bandwidth rather than the vectorization.
People got promoted for continuous deployment
https://engineering.fb.com/2017/08/31/web/rapid-release-at-m...
I think it's fair to say the hardware changed, the deployment strategy changed and the patches were no longer relevant, so we stopped applying them.
When I showed up, there were 100+ patches on top of a 2009 kernel tree. I reduced the size to about 10 or so critical patches, rebased them at a 6 months cadence over 2-3 years. Upstreamed a few.
Didn't go around saying those old patches were bad ideas and I got rid of them. How you say it matters.
You reduced the number of patches a lot and also pushed very hard to get us to 3.0 after we sat on 2.6.38 ~forever. Which was very appreciated, btw. We built the whole plan going forward based on this work.
I'm not arguing that anyone should be nice to anyone or not (it's a waste of breath when it comes to Linux). I'm just saying that the benchmarking was thorough and that contemporary 2014 hardware could zero pages fast.
At what point did you realize how different fb engineering was from what you expected?
An important nuance - most Facebook engineers don't believe that Facebook/Meta would continue to grow next year; and that disbelief had been there since as early as in 2018 (when I'd joined).
very few facebook employees use their products outside of testing, which is a big contributor to that fear - they just can't believe that there are billions of people who would continue to use apps to post what they had for lunch!
And as a result of that lack of faith, most of them believe that Meta is a bubble and can burst at any point. Consequently, everyone works for the next performance review cycle, and most are just in rush to capture as much money as they could before that bubble bursts.
Huh.
The time I worked at a hyper growth company, us working in the coal mine had much the same skepticism. Our growth rate seemed ridiculous, surely we're over building, how much longer can this last?!
Happily, the marketing research team regularly presented stuff to our department. They explained who are customers were, projected market sizes (regionally, internationally), projected growth rates, competitive analysis (incumbents and upstarts), etc.
It helped so much. And although their forecasts seemed unbelievable, we over performed every year-over-year. Such that you sort of start to trust the (serious) marketing research types.
This is not special to Meta in any way, I observed it in any team which has more than 1 strong senior engineer.
For what it's worth, 20 years ago all programming newsgroups were like this. I grew my thick skin on alt.lang.perl lol
Of course technical discussions happen all the time at companies between competent people. But you don't do that in public, nor is this a technical debate: "I don't recall talking to you about it" - "I do, I did xyz then you ignored me" - "<changes subject>"
Always good to talk face to face if you're have strong feelings about something. When I said "talk" I meant literally face to face.
Spending a decade or so on lkml, everyone develops a thick skin. But mix it with the corporate environment, Facebook 2011, being an ex-employee adds more to the drama.
Having read through the comments here, I'm still of the opinion that any HW changes had a secondary effect and the primary contributor was a change in how HHVM/jemalloc interacted with MADV.
One more suggestion: evaluate more than one app and company wide profiling data to make such decisions.
One of the challenges in doing so is the large contingent of people who don't have an understanding of CPU uarch/counters and yet have a negative opinion of their usefulness to make decisions like this.
So the only tool you have left with is to run large scale rack level tests in a close to prod env, which has its own set of problems and benefits.
That said, one of the comments above suggests that the HW change was a switch to Ivy Bridge, when zeroing memory became cheaper, which is a bit unexpected (to me). So you might be more right when you say that the improvement was the result of memory allocation patterns and jemalloc.
It is their loss, I cannot imagine letting a minor work quarrel live rent free in my head for over a decade. I feel bad enough when something is stuck in my mind for a week.
On a more serious note, it seems like any hyper competitive company eventually spirals into an awful, toxic working env.
This thread would've been way more fun with a couple of middle managers and product managers in the mix ;-)
If you don't like the idea of memory cgroups as a security domain, you could tighten it to be a process. But kernel developers have been opposed to tracking pages on a per address space basis for a long time. On the other hand memory cgroup tracking happens by construction.
> within a cgroup
Note the complementary language usage here. You seem to have interpreted that as me writing that it didn't matter what cgroup they are in, which is an odd thing to claim that I implied. I meant within the same cgroup obviously.
Yes, you can read memory out of another process through other means.. but you shouldn't map pages, be able to read them and see what happened in another process. That's the wild part. It strikes me as asking for problems.
I was unaware of MAP_UNINITIALIZED, support for which was disabled by default and for good reason. Seems like it was since removed.
The people deploying it are free to restrict the cgroup to one process before requesting MAP_UNINITIALIZED if there is a concern around security. At that point the memory cgroup becomes a way to get around the page tracking restriction.
But I get why aesthetically this idea sounds icky to a lot of people.
https://research.google/pubs/google-wide-profiling-a-continu... https://engineering.fb.com/2025/01/21/production-engineering...
The profiling clearly showed kernel functions doing memzero at the top of the profiles which motivated the change. The performance impact (A/B testing and measuring the throughput) also showed a benefit at the point the change was committed.
This was when "facebook" was a ~1GB ELF binary. https://en.wikipedia.org/wiki/HipHop_for_PHP
The change stopped being impactful sometime after 2013, when a JIT replaced the transpiler. I'm guessing likely before 2016 when continuous deployment came into play. But that was continuously deploying PHP code, not HHVM itself.
By the time the patches were reevaluated I was working on a Graph Database, which sounded a lot more interesting than going back to my old job function and defending a patch that may or may not be relevant.
I'm still working on one. Guilty as charged of carrying ideas in my head for 10+ years and acting on them later. Link in my profile.
Linux developers believe that involving the CPU warms the caches and is a good thing.
There needs to be more competition in the malloc space. Between various huge page sizes and transparent huge pages, there are a lot of gains to be had over what you get from a default GNU libc.
Our results from July 2025:
rows are <allocator>: <RSS>, <time spent for allocator operations>
tcmalloc was from github.com/google/tcmalloc/tree/24b3f29.i don't recall which jemalloc was tested.
tcmalloc (thread caching malloc) assumes memory allocations have good thread locality. This is often a double win (less false sharing of cache lines, and most allocations hit thread-local data structures in the allocator).
Multithreaded async systems destroy that locality, so it constantly has to run through the exception case: A allocated a buffer, went async, the request wakes up on thread B, which frees the buffer, and has to synchronize with A to give it back.
Are you using async rust, or sync rust?
[0]: https://github.com/google/tcmalloc/blob/master/docs/design.m...
tokio scheduler side, the executor is thread per core and work stealing of in progress tasks shouldn't be happening too much.
for all thread pool threads or threads unaffiliated with the executor, see earlier speculation on OS scheduler behavior.
Indeed, it's not the old gperftools version.
Blog: https://abseil.io/blog/20200212-tcmalloc
History / Diffs: https://google.github.io/tcmalloc/gperftools.html
1. tcmalloc is actually the only allocator I tested which was not using thread local caches. even glibc malloc has tcache.
2. async executors typically shouldn’t have tasks jumping willy nilly between threads. i see the issue u describe more often with the use of thread pools (like rayon or tokio’s spawn_blocking). i’d argue that the use of thread pools isn’t necessarily an inherent feature of async executors. certainly tokio relies on its threadpool for fs operations, but io-uring (for example) makes that mostly unnecessary.
Edit: I see mimalloc v3 is out – I missed that! That probably moots this discussion altogether.
Even toolchains like Turbo Pascal for MS-DOS, had an API to customise the memory allocator.
The one size fits all was never a solution.
(99% of the time, I find this less problematic than Java’s approach, fwiw).
I heard that was a common complaint for minecraft
decade seems a usual timescale for that, considering f.e. python 2->3
To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak.
This feels like a huge understatement. I still have some PTSD around when I did Java professionally between like 2005 and 2014.
The early part of that was particularly horrible.
Baring bugs/native leaks - Java has a very predictable memory allocation.
we are talking about DEallocation
"To an outsider, that looks like the JVM heap just steadily growing, which is easy to mistake for a memory leak."
I cut the part that it's possible to make JVM return memory heap after compaction but usually it's not done, i.e. if something grew once, it's likely to do it again.
The issue is that through the standard course of a JVM application running, every allocated page will ultimately be touched. The JVM fills up new gen, runs a minor collection, moves old objects to old gen, and continues until old gen gets filled. When old gen is filled, a major collection is triggered and all the live objects get moved around in memory.
This natural action of the JVM means you'll see a sawtooth of used memory in a properly running JVM where the peak of the sawtooth occasionally hits the memory maximum, which in turn causes the used memory to plummet.
How does PTC and Aicas does GC? Is it ref counted? I'm guessing they aren't doing moving collectors.
One of the founding members of Aicas is the author of "Hard Realtime Garbage Collection in Modern Object Oriented Programming Languages" book, which was done as part of his PhD.
There's a lot of bad tuning guides for minecraft that should be completely ignored and thrown in the trash. The only GC setting you need for it is `-XX:+UseZGC`
For example, a number of the minecraft golden guides I've seen will suggest things like setting pause targets but also survivor space sizes. The thing is, the pause target is disabled when you start playing with survivor space sizes.
It was a better idea when Java had the old mark and sweep collector. However, with the generational collectors (which are all Java collectors now. except for epsilon) it's more problematic. Reusing buffers and objects in those buffers will pretty much guarantees that buffer ends up in oldgen. That means to clear it out, the VM has to do more expensive collections.
The actual allocation time for most of Java's collectors is almost 0, it's a capacity check and a pointer bump in most circumstances. Giving the JVM more memory will generally solve issues with memory pressure and GC times. That's (generally) a better solution to performance problems vs doing the large buffer.
Now, that said, there certainly have been times where allocation pressure is a major problem and removing the allocation is the solution. In particular, I've found boxing to often be a major cause of performance problems.
For example, some code I had to clean up pretty early on in my career was a dev, for unknown reasons, reinventing the `ArrayList` and then using that invention as a set (doing deduplication by iterating over the elements and checking for duplicates). It was done in the name of performance, but it was never a slow part of the code. I replaced the whole thing with a `HashSet` and saved ~300 loc as a result.
This individual did that sort of stuff all over the code base.
Heap allocation in java is something trivial happens constantly. People typically do funky stuff with memory allocation because they have to, because the GC is causing pauses.
People avoid system allocators in C++ too, they just don't have to do it because of uncontrollable pauses.
This same dev did things like putting what he deemed as being large objects (icons) into weak references to save memory. When the references were collected, invariably they had to be reloaded.
That was not the source of memory pressure issues in the app.
I've developed a mistrust for a lot of devs "doing it because we have to" when it comes to performance tweaks. It's not a never thing that a buffer is the right thing to do, but it's not been something I had to reach for to solve GC pressure issues. Often times, far more simple solutions like pulling an allocation out of the middle of a loop, or switching from boxed types to primatives, was all that was needed to relieve memory pressure.
The closest I've come to it is replacing code which would do an expensive and allocation heavy calculation with a field that caches the result of that calculation on the first call.
Well actually, this is what the Apple[1] docs instruct devs to do. https://developer.apple.com/library/archive/documentation/Co...
For .NET on iOS, the difference between managed and unmanaged objects is of particular concern. In the example you provide, the Icon Assets are objects from an Apple Framework, not managed by .NET. You might use them in the UIKit views for list items in a UIKit List View.
iOS creates and disposes these list view items independently of .NET managed code. Because the reference counts can't be updated across these contexts, you'll inevitably end up with dangling references. This memory can't be cleared, so inadvertently using strong references will cause a memory leak that grows until your app crashes.
The following is a great explainer in the context of Xamarin for iOS. https://thomasbandt.com/xamarinios-memory-pitfalls
The above still applies with different languages / frameworks of course, however the difference is less explicit from a syntax perspective IMHO
Any program that cares about performance is going to focus on minimizing memory allocation first. The difference between a GCed language like java is that the problems manifest as gc pauses that may or may not be predictable. In a language like C++ you can skip the pauses and worry about the overall throughput.
> Many programs in GC language end up fighting the GC by allocating a large buffer and managing it by hand
That's the primary thing I'm contending with. This is a strategy for fighting the GC, but it's also generally a bad strategy. One that I think gets pulled more because someone heard of the suggestion and less because it's a good way to make things faster.
That guy I'm talking about did a lot of "performance optimizations" based on gut feelings and not data. I've observed that a lot of engineers operate that way.
But I've further observed that when it comes to optimizing for the GC, a large amount of problems don't need such an extreme measure like building your own memory buffer and managing it directly. In fact, that sort of a measure is generally counter productive in a GC environment as it makes major collections more costly. It isn't a "never do this" thing, but it's also not something that "many programs" should be doing.
I agree that many programs with a GC will probably need to change their algorithms to minimize allocations. I disagree that "allocating a large buffer and managing it by hand" is a technique that almost any program or library needs to engage in to minimize GCs.
Allocating a large buffer is literally what an array or vector is. A heap uses a heap structure and hops around in memory for every allocation and free. It gets worse the more allocations there are. The allocations are fragmented and in different parts of memory.
Allocating a large buffer takes care of all this if it is possible to anything else. It doesn't make sense to make lots of heap allocations when what you want is multiple items next to each other in memory and one heap allocation.
That guy I'm talking about did a lot of "performance optimizations" based on gut feelings and not data.
You need to let this go, that guy has nothing to do with what works when optimizing memory usage and allocation.
But I've further observed that when it comes to optimizing for the GC, a large amount of problems don't need such an extreme measure like building your own memory buffer and managing it directly.
Making an array of contiguous items is not an "extreme strategy", it's the most efficient and simplest way for a program to run. Other memory allocations can just be an extension of this.
I agree that many programs with a GC will probably need to change their algorithms to minimize allocations. I disagree that "allocating a large buffer and managing it by hand"
If you need the same amount of memory but need to minimize allocations how do you think that is done? You make larger allocations and split them up. You keep saying "managing it by hand" as if there is something that has to be tricky or difficult. Using indices of an array is not difficult and neither is handing out indices or ranges to in small sections.
Not in the JVM. And maybe this is ultimately what we are butting up against. After all, the JVM isn't all GCed languages, it's just one of many.
In the JVM, heap allocations are done via bump allocation. When a region is filled, the JVM performs a garbage collection which moves objects in the heap (it compacts the memory). It's not an actual heap structure for the JVM.
> It doesn't make sense to make lots of heap allocations when what you want is multiple items next to each other in memory and one heap allocation.
That is (currently) not possible to do in the JVM, barring primitives. When I create a `new Foo[128]` in the JVM, that creates an array big enough to hold 128 references of Foo, not 128 Foo objects. Those have to be allocated onto the heap separately. This is part of the reason why managing such an object pool is pointless in the JVM. You have to make the allocations anyways and you are paying for the management cost of that pool.
The object pool is also particularly bad in the JVM because it stops the JVM from performing optimizations like scalarization. That's where the JVM can completely avoid a heap allocation all together and instead pulls out the internal fields of the allocated object to hand off to a calling function. In order for that optimization to occur, and object can't escape the current scope.
I get why this isn't the same story if you are talking about another language like C# or go. There are still the negative consequences of needing to manage the buffer, especially if the intent is to track allocations of items in the buffer and to reassign them. But there is a gain in the locality that's nice.
> Using indices of an array is not difficult and neither is handing out indices or ranges to in small sections.
Easy to do? Sure. Easy to do fast? Well, no. That's entirely the reason why C++ has multiple allocators. It's the crux of the problem an allocator is trying to solve in the first place "How can I efficiently give a chunk of memory back to the application".
Obviously, it'll matter what your usage pattern is, but if it's at all complex, you'll run into the same problems that the general allocator hits.
If that were true then they wouldn't be heap allocations.
https://www.digitalocean.com/community/tutorials/java-jvm-me...
https://docs.oracle.com/en/java/javase/21/core/heap-and-heap...
not possible to do in the JVM, barring primitives
Then you make data structures out of arrays of primitives.
Easy to do? Sure. Easy to do fast? Well, no. That's entirely the reason why C++ has multiple allocators.
I don't know what this means. Vectors are trivial and if you hand out ranges of memory in an arena allocator you allocate it once and free it once which solves the heavy allocation problem. The allocator parameter in templates don't factor in to this.
"Heap" is a misnomer. It's not called that due to the classic CS "heap" datastructure. It's called that for the same reason it's called a heap allocation in C++. Modern C++ allocators don't use a heap structure either.
How the JVM does allocations for all it's collectors is in fact a bump allocator in the heap space. There are some weedsy details (for example, threads in the JVM have their own heap space for doing allocation to avoid contention in allocation) but suffice it to say it ultimately translates into a region check then pointer bump. This is why the JVM is so fast at allocation, much faster than C++ can be. [1] [2]
> I don't know what this means.
JVM allocations are typically pointer bumps, adding a number to a register. There's really nothing faster than it. If you are implementing an arena then you've already lost in terms of performance.
[1] https://www.datadoghq.com/blog/understanding-java-gc/#memory...
[2] https://inside.java/2020/06/25/compact-forwarding/
"Yes, malloc uses a heap data structure to allocate memory dynamically for programs. The heap allows for persistent memory allocation that can be managed manually by the programmer."
"How Malloc Works with the Heap
How the JVM does allocations for all it's collectors is in fact a bump allocator in the heap space.This doesn't make sense. It's one or the other. A heap isn't about getting more memory or mapping it into a process space, it is about managing the memory already in the process space and being able to free memory in a different order than you allocated it, then give that memory back out without system calls.
https://www.geeksforgeeks.org/c/dynamic-memory-allocation-in...
https://en.wikipedia.org/wiki/C_dynamic_memory_allocation
JVM allocations are typically pointer bumps, adding a number to a register.
I think you are mixing up mapping memory into a process (which is a system call not a register addition) and managing the memory once it is in process space.
The allocator frees memory and reuses it within a process. If freeing it was as simple as subtracting from a register then there would be no difference in speed between the stack and the heap and there would be no GC pauses and no GC complexity. None of these things are true obviously since java has been dealing with these problems for 30 years.
This is why the JVM is so fast at allocation, much faster than C++ can be
Java is slower than C++ and less predictable because you can't avoid the GC which is the whole point here.
The original point was that you have to either avoid the GC or fight the GC and a lot of what you have talked about is either not true or explains why someone has to avoid and fight the GC in the first place.
Java does do bump pointer allocation. The key is that when GC runs, surviving objects get moved. The slow part of GC isn't the allocation (GCs generally have much faster allocators than malloc). The slow part is the barriers that the GC requires and the pauses.
If that were true you could have listed one the made sense in context. This person was saying that allocation was as fast as a incrementing a register while continually ignoring the fact that deallocation needs to happen along with any organization of allocated memory.
Then they were ignoring that large allocations have big speed benefits for a reason.
Conflating java moving a pointer, mapping memory into a process, sbrk, and arena allocation is going in circles, but the fundamentals that people need to fight the GC or work around it remains.
Allocations have a price and the first step to optimizing any program is avoiding that, but in GC languages you get pauses on top of your slow downs.
If you got a web request, you could allocate a memory pool for it, then you would do all your memory allocations from that pool. And when your web request ended - either cleanly or with a hundred different kinds of errors, you could just free the entire pool.
it was nice and made an impression on me.
I think the lowly malloc probably has lots of interesting ways of growing and changing.
Yes, if you want to use huge pages with arbitrary alloc/free, then use a third-party malloc. If your alloc/free patterns are not arbitrary, you can do even better. We treat malloc as a magic black box but it's actually not very good.
https://jemalloc.net/jemalloc.3.html
One thing to call out: sdallocx integrates well with C++'s sized delete semantics: https://isocpp.org/files/papers/n3778.html
The nice thing about mimalloc is that there are a ton of configurable knobs available via env vars. I'm able to hand those 16 1 GiB pages to the program at launch via `MIMALLOC_RESERVE_HUGE_OS_PAGES=16`.
EDIT: after re-reading your comment a few times, I apologize if you already knew this (which it sounds like you did).
My old Intel CPU only has 4 slots for 1GB pages, and that was enough to get me about a 20% performance boost on Factorio. (I think a couple percent might have been allocator change but the boost from forcing huge pages was very significant)
In the middle of last year, a 9900X was around $350 and 128GB of memory was also around $350. That's very easily "commodity" range.
Here I was thinking 16GiB is pretty good. I get to compile LibreOffice in an afternoon. QtWebEngine overnight.
Doesn't 128GiB make rowhammer much more feasible? You'd have 32GiB per DIMM.
Oh well
The effectiveness of rowhammer depends on how well the manufacturer implemented target row refresh. But the internal ECC on DDR5 should help defend against it somewhat.
Personally I've been in the 24-32GiB range since 2013, and that's despite the fact that I'm still on DDR3.
Last time I checked mimalloc which was admittedly a while ago, probably 5 years, it was noticebly worse and I saw a lot of people on their github issues agreeing with me so I just never looked at it again.
Jemalloc can usually keep the smallest memory footprint, followed by tcmalloc.
Mimalloc can really speed things up sometimes.
As usually, YMMV.
Mimalloc made the claim that they were the fastest/best when they released and that didn't hold up to real world testing, so I am not inclined to trust it now.
That’s… ahistorical, at least so far as I remember. It wasn’t marketed as either of those; it was marketed as small/simple/consistent with an opt-in high-severity mode, and then its performance bore out as a result of the first set of target features/design goals. It was mainly pushed as easy to adopt, easy to use, easy to statically link, etc.
That is true of basically every single malloc replacement out there, that is not a uniquely defining feature.
Using the last workload tested as an example, mimalloc just consumed memory like crazy. It was probably leaking, as it was the stock version that comes in Debian, so probably quite old.
Tcmalloc and jemalloc were neck to neck when comparing app metrics (request duration etc... was quite similar), but jemalloc consistently used only about half of RAM as opposed to tcmalloc).
Both custom allocators used way less RAM than the stock allocator though. Something like 10x (!) less. In the end the workload with jemalloc hovers somewhere around 4% of the memory limit. Not bad for one single package and an additional compile option to enable it.
Jemalloc Postmortem - https://news.ycombinator.com/item?id=44264958 - June 2025 (233 comments)
Jemalloc Repositories Are Archived - https://news.ycombinator.com/item?id=44161128 - June 2025 (7 comments)
Most of the savings seemed to come from HVAC costs, followed by buying less computers and in turn less data centers. I'm sure these days saving memory is also a big deal but it doesn't seem to have been then.
The above was already the case 10 years ago, so LLMs are at most another factor added on.
In startups I've put more effort into squeezing blood from a stone for far less change; even if the change was proportionally more significant to the business. Sometimes it would be neat to say "something I did saved $X million dollars or saved Y kWh of energy" or whatever.
At most... Think 10x rather than 0.1x or 1x.
they've been using jemalloc (and employing "je") since 2009.
I'm saddened that the job market in Australia is largely React CRUD applications and that it's unlikely I will find a role that lets me leverage my niche skill set (which is also my hobby)
Link in bio.
I applied for both and got ghosted, haha.
I also saw a government role as a security researcher. Involves reverse engineering, ghidra and that sort of thing. Super awesome - but the pay is extremely uncompetitive. Such a shame.
Other than that, the most interesting roles are in finance (like HFT) - where you need to juggle memory allocations, threads and use C++ (hoping I can pitch Rust but unlikely).
Sadly they have a reputation of having pretty rough cultures, uncompetitive salaries and it's all in-office
https://www.igalia.com/jobs/open/
The one I know of (IMC trading) does a lot of low level stuff like this and is currently hiring.
https://www.samba.org/samba/support/globalsupport.html
https://www.catalyst-au.net/
Facebook's coding AIs to the rescue, maybe? I wonder how good all these "agentic" AIs are at dreaded refactoring jobs like these.
There are not many engineers capable of working on memory allocators, so adding more burden by agentic stuff is unlikely to produce anything of value.
No.
This is something you shouldn't allow coding agents anywhere near, unless you have expert-level understanding required to maintain the project like the previous authors have done without an AI for years.
I've done some work in this sort of area before, though not literally on a malloc. Yes you very much want to be careful, but ultimately it's the tests that give you confidence. Pound the heck out of it in multithreaded contexts and test for consistency.
I don't think so.
Even on LLM generated code, it is still not enough and you cannot trust it. They can pass the tests and still cause a regression and the code will look seemingly correct, for example in this case study [0].
[0] https://sketch.dev/blog/our-first-outage-from-llm-written-co...
For example they use AsmJit in a lot of projects (both internal and open-source) and it's now unmaintained because of funding issues. Maybe they have now internal forks too.
This doesn't quite read properly to me. What does it actually mean, does anyone know?
https://technology.blog.gov.uk/2015/12/11/using-jemalloc-to-...
> This branch is 71 commits ahead of and 70 commits behind jemalloc/jemalloc:dev.
It looks like both have been independently updated.
Initially the idea was diagnostics, instead the the problem disappeared on its own.
Second thoughts: Actually the fb.com post is more transparent than I'd have predicted. Not bad at all. Of course it helps that they're delivering good news!
He's doing just fine. If you're looking for a story about a FAANG company not paying engineers well for their work, this isn't it.
when i preloaded jemalloc , memory remained at significantly lower levels, and - more importantly - it was stable.
there seems to be no single correct solution to memory allocation, depending on the workload
From the Department of Redundancy Department.
"To help personalize content, tailor and measure ads and provide a safer experience, we use cookies. By clicking or navigating the site, you agree to allow our collection of information on and off Facebook through cookies. Learn more, including about available controls:
https://engineering.fb.com/privacy"
I was recently debugging an app double-free segfault on my android 13 samsung galaxy A51 phone, and the internal stack trace pointed to jemalloc function calls (je_free).
They should have just called it an ivory tower, as that's what they're building whenever they're not busy destroying democracy with OS Backdoor lobbyism or Cambridge Analytica shenanigans.
Edit: If every thread about any of Elon Musk's companies can contain at least 10 comments talking about Elon's purported crimes against humanity, threads about Zuckerberg's companies can contain at least 1 comment. Without reminders like this, stories like last week's might as well remain non-consequential.
https://hn.algolia.com/?dateRange=all&page=0&prefix=true&sor...
A 0.5% improvement may not be a lot to you, but at hyperscaler scale it's well worth staffing a team to work on it, with the added benefit of having people on hand that can investigate subtle bugs and pathological perf behaviors.
but as usual there is an xkcd for that. https://xkcd.com/1205/
One project I spent a bunch of time optimizing the write path of I/O. It was just using standard fwrite. But by staging items correctly it was an easy 10x speed win. Those optimizations sometimes stack up and count big. But it also had a few edges on it, so use with care.