... was killed: a thread waited too long to allocate a page

Discussion:

(too old to reply)

Daniel Braniss

2023-12-28 09:38:02 UTC

hi,
I'm running 13.2 Stable on this particular host, which has about 200TB of zfs storage
the host also has some 132Gb of memory,
lately, mountd is getting killed:
kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited too long to allocate a page

rpcinfo shows itâs still there, but
service mountd restart
fails.

only solution is to reboot.
BTW, the only âheavyâ stuff that I can see are several rsync processes.

any ideas?

thanks,
danny

Karl Pielorz

2024-02-01 14:47:44 UTC

Permalink

Post by Daniel Braniss
hi,
I'm running 13.2 Stable on this particular host, which has about 200TB of
zfs storage the host also has some 132Gb of memory,
kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited
too long to allocate a page
rpcinfo shows it's still there, but
service mountd restart
fails.
only solution is to reboot.
BTW, the only 'heavy' stuff that I can see are several rsync
processes.

Hi,

I seem to have run into something similar. I recently upgraded a 12.4 box
to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync work
to it monthly - the first month we've done this with 13.2p9 we get a lot of
processes killed, all with a similar (but not identical) message, e.g.

pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory
pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to reclaim
memory
pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim memory
pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory
pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim memory
pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory
pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory

This 'looks' like 'out of RAM' type situation - but at the time, top showed:

last pid: 12622; load averages: 0.10, 0.24, 0.13

7 processes: 1 running, 6 sleeping
CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle
Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, 7430M
Free
ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M Other
15G Compressed, 23G Uncompressed, 1.59:1 Ratio
Swap: 8192M Total, 5296K Used, 8187M Free

Rebooting it recovers it, and it completed the rsync after the reboot -
which left us with:

last pid: 12570; load averages: 0.07, 0.14, 0.17
up 0+00:15:06 14:43:56
26 processes: 1 running, 25 sleeping
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free
ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other
14G Compressed, 15G Uncompressed, 1.03:1 Ratio
Swap: 8192M Total, 8192M Free

I've not seen any bug reports along this line, in fact very little coverage
at all of the specific error.

My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to leave
more free RAM floating around the system. During the rsync it was
'swapping' occasionally (few K in, few K out) - but it never ran out of
swap that I saw - and it certainly didn't look like an complete out of
memory scenario/box (which is what it felt like with everything getting
killed).

-Karl

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Miroslav Lachman

2024-02-01 16:32:00 UTC

Permalink

On 01/02/2024 15:47, Karl Pielorz wrote:

[..]

Post by Karl Pielorz
I seem to have run into something similar. I recently upgraded a 12.4
box to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of
rsync work to it monthly - the first month we've done this with 13.2p9
we get a lot of processes killed, all with a similar (but not identical)
message, e.g.
pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory
pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to reclaim
memory
pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim memory
pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory
pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim memory
pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory
pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory

I remember something similar on our machines after upgrade to 13.x about
a year ago. But don't remember what steps we take to walk around this
issue (if any). I also see this on my FreeBSD based desktop from time to
time... "ad more memory" and limit the ARC in loader.conf is my way.

[..]

Post by Karl Pielorz
I've not seen any bug reports along this line, in fact very little
coverage at all of the specific error.
My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to leave
more free RAM floating around the system. During the rsync it was
'swapping' occasionally (few K in, few K out) - but it never ran out of
swap that I saw - and it certainly didn't look like an complete out of
memory scenario/box (which is what it felt like with everything getting
killed).

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de