Discussion:
Stressing malloc(9)
(too old to reply)
Alan Somers
2024-04-19 22:23:51 UTC
Permalink
TLDR;
How can I create a workload that causes malloc(9)'s performance to plummet?

Background:
I recently witnessed a performance problem on a production server.
Overall throughput dropped by over 30x. dtrace showed that 60% of the
CPU time was dominated by lock_delay as called by three functions:
printf (via ctl_worker_thread), g_eli_alloc_data, and
g_eli_write_done. One thing those three have in common is that they
all use malloc(9). Fixing the problem was as simple as telling CTL to
stop printing so many warnings, by tuning
kern.cam.ctl.time_io_secs=100000.

But even with CTL quieted, dtrace still reports ~6% of the CPU cycles
in lock_delay via g_eli_alloc_data. So I believe that malloc is
limiting geli's performance. I would like to try replacing it with
uma(9).

But on a non-production server, none of my benchmark workloads causes
g_eli_alloc_data to break a sweat. I can't get its CPU consumption to
rise higher than 0.5%. And that's using the smallest sector size and
block size that I can.

So my question is: does anybody have a program that can really stress
malloc(9)? I'd like to run it in parallel with my geli benchmarks to
see how much it interferes.

-Alan


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mark Johnston
2024-04-20 15:07:03 UTC
Permalink
Post by Alan Somers
TLDR;
How can I create a workload that causes malloc(9)'s performance to plummet?
I recently witnessed a performance problem on a production server.
Overall throughput dropped by over 30x. dtrace showed that 60% of the
printf (via ctl_worker_thread), g_eli_alloc_data, and
g_eli_write_done. One thing those three have in common is that they
all use malloc(9). Fixing the problem was as simple as telling CTL to
stop printing so many warnings, by tuning
kern.cam.ctl.time_io_secs=100000.
But even with CTL quieted, dtrace still reports ~6% of the CPU cycles
in lock_delay via g_eli_alloc_data. So I believe that malloc is
limiting geli's performance. I would like to try replacing it with
uma(9).
What is the size of the allocations that g_eli_alloc_data() is doing?
malloc() is a pretty thin layer over UMA for allocations <= 64KB.
Larger allocations are handled by a different path (malloc_large())
which goes directly to the kmem_* allocator functions. Those functions
are very expensive: they're serialized by global locks and need to
update the pmap (and perform TLB shootdowns when memory is freed).
They're not meant to be used at a high rate.

My first guess would be that your production workload was hitting this
path, and your benchmarks are not. If you have stack traces or lock
names from DTrace, that would help validate this theory, in which case
using UMA to cache buffers would be a reasonable solution.
Post by Alan Somers
But on a non-production server, none of my benchmark workloads causes
g_eli_alloc_data to break a sweat. I can't get its CPU consumption to
rise higher than 0.5%. And that's using the smallest sector size and
block size that I can.
So my question is: does anybody have a program that can really stress
malloc(9)? I'd like to run it in parallel with my geli benchmarks to
see how much it interferes.
-Alan
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2024-04-20 17:23:41 UTC
Permalink
Post by Mark Johnston
Post by Alan Somers
TLDR;
How can I create a workload that causes malloc(9)'s performance to plummet?
I recently witnessed a performance problem on a production server.
Overall throughput dropped by over 30x. dtrace showed that 60% of the
printf (via ctl_worker_thread), g_eli_alloc_data, and
g_eli_write_done. One thing those three have in common is that they
all use malloc(9). Fixing the problem was as simple as telling CTL to
stop printing so many warnings, by tuning
kern.cam.ctl.time_io_secs=100000.
But even with CTL quieted, dtrace still reports ~6% of the CPU cycles
in lock_delay via g_eli_alloc_data. So I believe that malloc is
limiting geli's performance. I would like to try replacing it with
uma(9).
What is the size of the allocations that g_eli_alloc_data() is doing?
malloc() is a pretty thin layer over UMA for allocations <= 64KB.
Larger allocations are handled by a different path (malloc_large())
which goes directly to the kmem_* allocator functions. Those functions
are very expensive: they're serialized by global locks and need to
update the pmap (and perform TLB shootdowns when memory is freed).
They're not meant to be used at a high rate.
In my benchmarks so far, 512B. In the real application the size is
mostly between 4k and 16k, and it's always a multiple of 4k. But it's
sometimes great enough to use malloc_large, and it's those
malloc_large calls that account for the majority of the time spent in
g_eli_alloc_data. lockstat shows that malloc_large, as called by
g_elI_alloc_data, sometimes blocks for multiple ms.

But oddly, if I change the parameters so that g_eli_alloc_data
allocates 128kB, I still don't see malloc_large getting called. And
both dtrace and vmstat show that malloc is mostly operating on 512B
allocations. But dtrace does confirm that g_eli_alloc_data is being
called with 128kB arguments. Maybe something is getting inlined? I
don't understand how this is happening. I could probably figure out
if I recompile with some extra SDT probes, though.
Post by Mark Johnston
My first guess would be that your production workload was hitting this
path, and your benchmarks are not. If you have stack traces or lock
names from DTrace, that would help validate this theory, in which case
using UMA to cache buffers would be a reasonable solution.
Would that require creating an extra UMA zone for every possible geli
allocation size above 64kB?
Post by Mark Johnston
Post by Alan Somers
But on a non-production server, none of my benchmark workloads causes
g_eli_alloc_data to break a sweat. I can't get its CPU consumption to
rise higher than 0.5%. And that's using the smallest sector size and
block size that I can.
So my question is: does anybody have a program that can really stress
malloc(9)? I'd like to run it in parallel with my geli benchmarks to
see how much it interferes.
-Alan
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2024-04-22 16:46:01 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Mark Johnston
Post by Alan Somers
TLDR;
How can I create a workload that causes malloc(9)'s performance to plummet?
I recently witnessed a performance problem on a production server.
Overall throughput dropped by over 30x. dtrace showed that 60% of the
printf (via ctl_worker_thread), g_eli_alloc_data, and
g_eli_write_done. One thing those three have in common is that they
all use malloc(9). Fixing the problem was as simple as telling CTL to
stop printing so many warnings, by tuning
kern.cam.ctl.time_io_secs=100000.
But even with CTL quieted, dtrace still reports ~6% of the CPU cycles
in lock_delay via g_eli_alloc_data. So I believe that malloc is
limiting geli's performance. I would like to try replacing it with
uma(9).
What is the size of the allocations that g_eli_alloc_data() is doing?
malloc() is a pretty thin layer over UMA for allocations <= 64KB.
Larger allocations are handled by a different path (malloc_large())
which goes directly to the kmem_* allocator functions. Those functions
are very expensive: they're serialized by global locks and need to
update the pmap (and perform TLB shootdowns when memory is freed).
They're not meant to be used at a high rate.
In my benchmarks so far, 512B. In the real application the size is
mostly between 4k and 16k, and it's always a multiple of 4k. But it's
sometimes great enough to use malloc_large, and it's those
malloc_large calls that account for the majority of the time spent in
g_eli_alloc_data. lockstat shows that malloc_large, as called by
g_elI_alloc_data, sometimes blocks for multiple ms.
But oddly, if I change the parameters so that g_eli_alloc_data
allocates 128kB, I still don't see malloc_large getting called. And
both dtrace and vmstat show that malloc is mostly operating on 512B
allocations. But dtrace does confirm that g_eli_alloc_data is being
called with 128kB arguments. Maybe something is getting inlined?
malloc_large() is annotated __noinline, for what it's worth.
Post by Alan Somers
I
don't understand how this is happening. I could probably figure out
if I recompile with some extra SDT probes, though.
What is g_eli_alloc_sz on your system?
33kiB. That's larger than I expected. When I use a larger blocksize
in my benchmark, then I do indeed see malloc_large activity, and 11%
of the CPU is spend in g_eli_alloc_data.
I guess I'll add some UMA zones for this purpose. I'll try 256k and
512k zones, rounding up allocations as necessary. Thanks for the tip.
When I said "33kiB" I meant "33 pages", or 132 kB. And the solution
turns out to be very easy. Since I'm using ZFS on top of geli, with
the default recsize of 128kB, I'll just set
vfs.zfs.vdev.aggregation_limit to 128 kB. That way geli will never
need to allocate more than 128kB contiguously. ZFS doesn't even need
those big allocations to be contiguous; it's just aggregating smaller
operations to reduce disk IOPs. But aggregating up to 1MB (the
default) is overkill; any rotating HDD should easily be able to max
out its consecutive write IOPs with 128kB operation size. I'll add a
read-only sysctl for g_eli_alloc_sz too. Thanks Mark.

-Alan


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2024-04-21 23:47:41 UTC
Permalink
Post by Alan Somers
Post by Mark Johnston
Post by Alan Somers
TLDR;
How can I create a workload that causes malloc(9)'s performance to plummet?
I recently witnessed a performance problem on a production server.
Overall throughput dropped by over 30x. dtrace showed that 60% of the
printf (via ctl_worker_thread), g_eli_alloc_data, and
g_eli_write_done. One thing those three have in common is that they
all use malloc(9). Fixing the problem was as simple as telling CTL to
stop printing so many warnings, by tuning
kern.cam.ctl.time_io_secs=100000.
But even with CTL quieted, dtrace still reports ~6% of the CPU cycles
in lock_delay via g_eli_alloc_data. So I believe that malloc is
limiting geli's performance. I would like to try replacing it with
uma(9).
What is the size of the allocations that g_eli_alloc_data() is doing?
malloc() is a pretty thin layer over UMA for allocations <= 64KB.
Larger allocations are handled by a different path (malloc_large())
which goes directly to the kmem_* allocator functions. Those functions
are very expensive: they're serialized by global locks and need to
update the pmap (and perform TLB shootdowns when memory is freed).
They're not meant to be used at a high rate.
In my benchmarks so far, 512B. In the real application the size is
mostly between 4k and 16k, and it's always a multiple of 4k. But it's
sometimes great enough to use malloc_large, and it's those
malloc_large calls that account for the majority of the time spent in
g_eli_alloc_data. lockstat shows that malloc_large, as called by
g_elI_alloc_data, sometimes blocks for multiple ms.
But oddly, if I change the parameters so that g_eli_alloc_data
allocates 128kB, I still don't see malloc_large getting called. And
both dtrace and vmstat show that malloc is mostly operating on 512B
allocations. But dtrace does confirm that g_eli_alloc_data is being
called with 128kB arguments. Maybe something is getting inlined?
malloc_large() is annotated __noinline, for what it's worth.
Post by Alan Somers
I
don't understand how this is happening. I could probably figure out
if I recompile with some extra SDT probes, though.
What is g_eli_alloc_sz on your system?
33kiB. That's larger than I expected. When I use a larger blocksize
in my benchmark, then I do indeed see malloc_large activity, and 11%
of the CPU is spend in g_eli_alloc_data.

I guess I'll add some UMA zones for this purpose. I'll try 256k and
512k zones, rounding up allocations as necessary. Thanks for the tip.
Post by Alan Somers
Post by Mark Johnston
My first guess would be that your production workload was hitting this
path, and your benchmarks are not. If you have stack traces or lock
names from DTrace, that would help validate this theory, in which case
using UMA to cache buffers would be a reasonable solution.
Would that require creating an extra UMA zone for every possible geli
allocation size above 64kB?
Something like that. Or have a zone of maxphys-sized buffers (actually
I think it needs to be slightly larger than that?) and accept the
corresponding waste, given that these allocations are short-lived. This
is basically what g_eli_alloc_data() already does.
Post by Alan Somers
Post by Mark Johnston
Post by Alan Somers
But on a non-production server, none of my benchmark workloads causes
g_eli_alloc_data to break a sweat. I can't get its CPU consumption to
rise higher than 0.5%. And that's using the smallest sector size and
block size that I can.
So my question is: does anybody have a program that can really stress
malloc(9)? I'd like to run it in parallel with my geli benchmarks to
see how much it interferes.
-Alan
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Warner Losh
2024-04-23 21:58:53 UTC
Permalink
On Tue, Apr 23, 2024 at 2:37 AM Alexander Leidinger
You basically say, that it is not uncommon to have such large
allocations with kernels we ship (even in releases).
Wouldn't it make sense to optimize the kernel to handle larger uma
allocations?
Or do you expect it to be specific to ZFS and it may be more sane to
discuss with the OpenZFS developers to reduce this default setting?
Yes, both of those things are true. It might make sense to reduce the
setting's default value. OTOH, the current value is probably fine for
people who don't use geli (and possibly other transforms that require
allocating data). And it would also be good to optimize the kernel to
perform these allocations more efficiently. My best idea is to teach
g_eli_alloc_data how to allocate scatter/gather lists of 64k buffers
instead of contiguous memory. The memory doesn't need to be
contiguous, after all. But that's a bigger change, and I don't know
that I have the time for it right now.
-Alan
Do you have time do make a nice description of what would have to be
done in the wiki?
https://wiki.freebsd.org/IdeasPage
I've added the super-brief verrsion to https://wiki.freebsd.org/WarnerLosh
which has my crazy ideas list...

Warner

Alan Somers
2024-04-22 22:05:15 UTC
Permalink
Post by Alan Somers
When I said "33kiB" I meant "33 pages", or 132 kB. And the solution
turns out to be very easy. Since I'm using ZFS on top of geli, with
the default recsize of 128kB, I'll just set
vfs.zfs.vdev.aggregation_limit to 128 kB. That way geli will never
need to allocate more than 128kB contiguously. ZFS doesn't even need
those big allocations to be contiguous; it's just aggregating smaller
operations to reduce disk IOPs. But aggregating up to 1MB (the
default) is overkill; any rotating HDD should easily be able to max
out its consecutive write IOPs with 128kB operation size. I'll add a
read-only sysctl for g_eli_alloc_sz too. Thanks Mark.
-Alan
Setting this on one of my production machines that uses zfs behind geli drops the load average quite materially with zero impact on throughput that I can see (thus far.) I will run this for a while but it certainly doesn't appear to have any negatives associated with it and does appear to improve efficiency quite a bit.
Great news! Also, FTR I should add that this advice only applies to
people who use HDDs. For SSDs zfs uses a different aggregation limit,
and the default value is already low enough.
-Alan


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2024-04-23 12:47:03 UTC
Permalink
On Tue, Apr 23, 2024 at 2:37 AM Alexander Leidinger
Post by Alan Somers
Post by Alan Somers
When I said "33kiB" I meant "33 pages", or 132 kB. And the solution
turns out to be very easy. Since I'm using ZFS on top of geli, with
the default recsize of 128kB, I'll just set
vfs.zfs.vdev.aggregation_limit to 128 kB. That way geli will never
need to allocate more than 128kB contiguously. ZFS doesn't even need
those big allocations to be contiguous; it's just aggregating smaller
operations to reduce disk IOPs. But aggregating up to 1MB (the
default) is overkill; any rotating HDD should easily be able to max
out its consecutive write IOPs with 128kB operation size. I'll add a
read-only sysctl for g_eli_alloc_sz too. Thanks Mark.
-Alan
Setting this on one of my production machines that uses zfs behind
geli drops the load average quite materially with zero impact on
throughput that I can see (thus far.) I will run this for a while but
it certainly doesn't appear to have any negatives associated with it
and does appear to improve efficiency quite a bit.
Great news! Also, FTR I should add that this advice only applies to
people who use HDDs. For SSDs zfs uses a different aggregation limit,
and the default value is already low enough.
You basically say, that it is not uncommon to have such large
allocations with kernels we ship (even in releases).
Wouldn't it make sense to optimize the kernel to handle larger uma
allocations?
Or do you expect it to be specific to ZFS and it may be more sane to
discuss with the OpenZFS developers to reduce this default setting?
Yes, both of those things are true. It might make sense to reduce the
setting's default value. OTOH, the current value is probably fine for
people who don't use geli (and possibly other transforms that require
allocating data). And it would also be good to optimize the kernel to
perform these allocations more efficiently. My best idea is to teach
g_eli_alloc_data how to allocate scatter/gather lists of 64k buffers
instead of contiguous memory. The memory doesn't need to be
contiguous, after all. But that's a bigger change, and I don't know
that I have the time for it right now.
-Alan


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...