Discussion:
Ryzen 9 7950X3D bulk -a times: adding an example with SMT disabled (so 16 hardware threads, not 32)
(too old to reply)
Mark Millard
2023-11-10 01:26:57 UTC
Permalink
Reading some benchmark results for compilation activity that showed some
SMT vs. not examples and also using my C++ variant of the old HINT
benchmark, I ended up curious how a non-SMT from scratch bulk -a would
end up (ZFS context) compared my prior SMT based run.

I use a high load average style of bulk -a activity that has USE_TMPFS=all
involved. The system has 96 GiBytes of RAM (total across the 2 DIMMs).
The original under 1.5 day time definitely had significant swap space use
(RAM+SWAP = 96 GiBYtes + 364 GiBytes == 460 GiBytes == 471040 MiBytes).
The media was (and is) a PCIe based Optane 905P 1.5T. ZFS on a single
partition on the single drive, ZFS used just for bectl reasons, not other
typical use-ZFS reasons. I've not controlled the ARC size-range explicitly.

So less swap partition use is part of contribution to the results.

The original bulk -a spent a couple of hours at the end where it was
just fetching and building textproc/stardict-quick . I have not cleared
out /usr/ports/distfiles or updated anything.

So fetch time is also a difference here.

SMT (32 hardware threads, original bulk -a):

[33:10:00] [32] [04:37:23] Finished emulators/libretro-mame | libretro-mame-20220124_1: Success
[35:36:51] [23] [03:44:04] Finished textproc/stardict-quick | stardict-quick-2.4.2_9: Success
. . .
[main-amd64-bulk_a-default] [2023-11-01_07h14m50s] [committing:] Queued: 34683 Built: 33826 Failed: 179 Skipped: 358 Ignored: 320 Fetched: 0 Tobuild: 0 Time: 35:37:55

Swap-involved MaxObs (Max Observed) figures:
173310Mi MaxObsUsed
256332Mi MaxObs(Act+Lndry+SwapUsed)
265551Mi MaxObs(Act+Wir+Lndry+SwapUsed)
(So 265551Mi of 471040Mi RAM+SWAP.)

Just-RAM MaxObs figures:
81066Mi MaxObsActive
(Given the complications of getting usefully comparable wired figures for ZFS (ARC): omit.)
94493Mi MaxObs(Act+Wir+Lndry)

Note: MaxObs(A+B+C) <= MaxObs(A)+MaxObs(B)+MaxObs(C)

ALLOW_MAKE_JOBS=yes was used. No explicit restriction on PARALLEL_JOBS
or MAKE_JOBS_NUMBER (or analogous). So 32 builders allowed, each allowed
32 make jobs. This explains the high load averages of the bulk -a :

load averages . . . MaxObs: 360.70, 267.63, 210.84
(Those need not be all from the same time frame during the bulk -a .)

As for the ports vintage:

# ~/fbsd-based-on-what-commit.sh -C /usr/ports/
6ec8e3450b29 (HEAD -> main, freebsd/main, freebsd/HEAD) devel/sdts++: Mark DEPRECATED
Author: Muhammad Moinur Rahman <***@FreeBSD.org>
Commit: Muhammad Moinur Rahman <***@FreeBSD.org>
CommitDate: 2023-10-21 19:01:38 +0000
branch: main
merge-base: 6ec8e3450b29462a590d09fb0b07ed214d456bd5
merge-base: CommitDate: 2023-10-21 19:01:38 +0000
n637598 (--first-parent --count for merge-base)

I do have a environment that avoids various LLVM builds taking
as long to build :

llvm1[3-7] : no MLIR, no FLANG
llvm1[4-7] : use BE_NATIVE
other llvm* : use defaults (so, no avoidance)

I also prevent the builds from using strip on most of the install
materials built (not just toolchain materials).


non-SMT (16 hardware threads):

Note one builder (math/fricas), the last still present, was
stuck and I had to kill processes to have it stop unless I
was willing to wiat for my large timeout figures. The last
builder normal-finish was:

[39:48:10] [09] [00:16:23] Finished devel/gcc-msp430-ti-toolchain | gcc-msp430-ti-toolchain-9.3.1.2.20210722_1: Success

So, trying to place some bounds for comparing to SMT (32 hw threads)
and non-SMT (16 hw threads):

33:10:00 SMT -> 39:48:10 non-SMT would be over 6.5 hrs longer for non-SMT
35:36:51 SMT -> 39:48:10 non-SMT would be over 4 hrs longer for non-SMT

As for SMT vs. non-SMT Maximum Observed figures:

SMT load averages . . . MaxObs: 360.70, 267.63, 210.84
non-SMT load averages . . . MaxObs: 152.89, 100.94, 76.28

Swap-involved MaxObs figures for SMT (32 hw threads) vs not (16):
173310Mi vs. 33003Mi MaxObsUsed
256332Mi vs. 117221Mi MaxObs(Act+Lndry+SwapUsed)
265551Mi vs. 124776Mi MaxObs(Act+Wir+Lndry+SwapUsed)

Just-RAM MaxObs figures for SMT (32 hw threads) vs not (16):
81066Mi vs. 69763Mi MaxObsActive
(Given the complications of getting usefully comparable wired figures for ZFS (ARC): omit.)
94493Mi vs. 94303Mi MaxObs(Act+Wir+Lndry)


===
Mark Millard
marklmi at yahoo.com



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mark Millard
2023-11-13 02:00:46 UTC
Permalink
Post by Mark Millard
Reading some benchmark results for compilation activity that showed some
SMT vs. not examples and also using my C++ variant of the old HINT
benchmark, I ended up curious how a non-SMT from scratch bulk -a would
end up (ZFS context) compared my prior SMT based run.
I use a high load average style of bulk -a activity that has USE_TMPFS=all
involved. The system has 96 GiBytes of RAM (total across the 2 DIMMs).
The original under 1.5 day time definitely had significant swap space use
(RAM+SWAP = 96 GiBYtes + 364 GiBytes == 460 GiBytes == 471040 MiBytes).
The media was (and is) a PCIe based Optane 905P 1.5T. ZFS on a single
partition on the single drive, ZFS used just for bectl reasons, not other
typical use-ZFS reasons. I've not controlled the ARC size-range explicitly.
So less swap partition use is part of contribution to the results.
The original bulk -a spent a couple of hours at the end where it was
just fetching and building textproc/stardict-quick . I have not cleared
out /usr/ports/distfiles or updated anything.
So fetch time is also a difference here.
[33:10:00] [32] [04:37:23] Finished emulators/libretro-mame | libretro-mame-20220124_1: Success
[35:36:51] [23] [03:44:04] Finished textproc/stardict-quick | stardict-quick-2.4.2_9: Success
. . .
[main-amd64-bulk_a-default] [2023-11-01_07h14m50s] [committing:] Queued: 34683 Built: 33826 Failed: 179 Skipped: 358 Ignored: 320 Fetched: 0 Tobuild: 0 Time: 35:37:55
173310Mi MaxObsUsed
256332Mi MaxObs(Act+Lndry+SwapUsed)
265551Mi MaxObs(Act+Wir+Lndry+SwapUsed)
(So 265551Mi of 471040Mi RAM+SWAP.)
81066Mi MaxObsActive
(Given the complications of getting usefully comparable wired figures for ZFS (ARC): omit.)
94493Mi MaxObs(Act+Wir+Lndry)
Note: MaxObs(A+B+C) <= MaxObs(A)+MaxObs(B)+MaxObs(C)
ALLOW_MAKE_JOBS=yes was used. No explicit restriction on PARALLEL_JOBS
or MAKE_JOBS_NUMBER (or analogous). So 32 builders allowed, each allowed
load averages . . . MaxObs: 360.70, 267.63, 210.84
(Those need not be all from the same time frame during the bulk -a .)
# ~/fbsd-based-on-what-commit.sh -C /usr/ports/
6ec8e3450b29 (HEAD -> main, freebsd/main, freebsd/HEAD) devel/sdts++: Mark DEPRECATED
CommitDate: 2023-10-21 19:01:38 +0000
branch: main
merge-base: 6ec8e3450b29462a590d09fb0b07ed214d456bd5
merge-base: CommitDate: 2023-10-21 19:01:38 +0000
n637598 (--first-parent --count for merge-base)
I do have a environment that avoids various LLVM builds taking
llvm1[3-7] : no MLIR, no FLANG
llvm1[4-7] : use BE_NATIVE
other llvm* : use defaults (so, no avoidance)
I also prevent the builds from using strip on most of the install
materials built (not just toolchain materials).
Note one builder (math/fricas), the last still present, was
stuck and I had to kill processes to have it stop unless I
was willing to wiat for my large timeout figures. The last
[39:48:10] [09] [00:16:23] Finished devel/gcc-msp430-ti-toolchain | gcc-msp430-ti-toolchain-9.3.1.2.20210722_1: Success
So, trying to place some bounds for comparing to SMT (32 hw threads)
33:10:00 SMT -> 39:48:10 non-SMT would be over 6.5 hrs longer for non-SMT
35:36:51 SMT -> 39:48:10 non-SMT would be over 4 hrs longer for non-SMT
SMT load averages . . . MaxObs: 360.70, 267.63, 210.84
non-SMT load averages . . . MaxObs: 152.89, 100.94, 76.28
173310Mi vs. 33003Mi MaxObsUsed
256332Mi vs. 117221Mi MaxObs(Act+Lndry+SwapUsed)
265551Mi vs. 124776Mi MaxObs(Act+Wir+Lndry+SwapUsed)
81066Mi vs. 69763Mi MaxObsActive
(Given the complications of getting usefully comparable wired figures for ZFS (ARC): omit.)
94493Mi vs. 94303Mi MaxObs(Act+Wir+Lndry)
I've added a section for a plot for the 7950X3D to the end of:

https://github.com/markmi/acpphint/blob/master/Some_acpphint_curves_with_notes.md

It is from a C++ variant of the old HINT benchmark and includes
showing RAM caching consequences for the benchmark. The about
32 MiByte and about 96 MiByte cache sizes for the 2 CCDs are
observable.

I'll also note that for the devices present (active and not),
at fully active the 7950X3D seems to use 225 Watts .. 235 Watts
at the power cable for FreeBSD. Idle FreeBSD: more like 96
Watts.

(No video card. 2 forms of Optane 905P 1.5TB, one active. One
Samsung 960 Pro 2TB, inactive. One Samsung 970 EVO Plus 2TB,
inactive. 96 GiBytes of RAM total across 2 DIMMs. Fans and
AIO cooling. Keyboard and mouse USB powered. USB3 Ethernet
dongle. Monitor connection.)


ThreadRipper 1950X "bulk -a" test in progress:

I'm running a from-scratch USE_TMPFS=all "bulk -a" on the
ThreadRipper 1950X (128 GiBytes of RAM). From what I've seen
so far, it looks to likely take over 72 hr, so 2x+ as long
as the 7950X3D. (Samgsung 960 Pro 1TB system media and
Optane 900 480 GB swap space media in use, 447 GiByte I as I
remember). The ZFS partition on the 960 Pro has ashift=14 .)
It has a slightly modified copy of the ZFS from the 7950X3D
as far as starting content goes. It does have openzfs-2.2
compatibility fully enabled for its pool, including block
cloning, unlike any other ZFS I have around
(openzfs-2.1-freebsd).

===
Mark Millard
marklmi at yahoo.com



--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...