Post by Daniel Branisshi,
I'm running 13.2 Stable on this particular host, which has about 200TB of
zfs storage the host also has some 132Gb of memory,
kernel: pid 3212 (mountd), jid 0, uid 0, was killed: a thread waited
too long to allocate a page
rpcinfo shows it's still there, but
service mountd restart
fails.
only solution is to reboot.
BTW, the only 'heavy' stuff that I can see are several rsync
processes.
Hi,
I seem to have run into something similar. I recently upgraded a 12.4 box
to 13.2p9. The box has 32G of RAM, and runs ZFS. We do a lot of rsync work
to it monthly - the first month we've done this with 13.2p9 we get a lot of
processes killed, all with a similar (but not identical) message, e.g.
pid 11103 (ssh), jid 0, uid 0, was killed: failed to reclaim memory
pid 10972 (local-unbound), jid 0, uid 59, was killed: failed to reclaim
memory
pid 3223 (snmpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3243 (mountd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3251 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 10996 (sshd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3257 (sendmail), jid 0, uid 0, was killed: failed to reclaim memory
pid 8562 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3363 (smartd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8558 (csh), jid 0, uid 0, was killed: failed to reclaim memory
pid 3179 (ntpd), jid 0, uid 0, was killed: failed to reclaim memory
pid 8555 (tcsh), jid 0, uid 1001, was killed: failed to reclaim memory
pid 3260 (sendmail), jid 0, uid 25, was killed: failed to reclaim memory
pid 2806 (devd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3156 (rpcbind), jid 0, uid 0, was killed: failed to reclaim memory
pid 3252 (nfsd), jid 0, uid 0, was killed: failed to reclaim memory
pid 3377 (getty), jid 0, uid 0, was killed: failed to reclaim memory
This 'looks' like 'out of RAM' type situation - but at the time, top showed:
last pid: 12622; load averages: 0.10, 0.24, 0.13
7 processes: 1 running, 6 sleeping
CPU: 0.1% user, 0.0% nice, 0.2% system, 0.0% interrupt, 99.7% idle
Mem: 4324K Active, 8856K Inact, 244K Laundry, 24G Wired, 648M Buf, 7430M
Free
ARC: 20G Total, 8771M MFU, 10G MRU, 2432K Anon, 161M Header, 920M Other
15G Compressed, 23G Uncompressed, 1.59:1 Ratio
Swap: 8192M Total, 5296K Used, 8187M Free
Rebooting it recovers it, and it completed the rsync after the reboot -
which left us with:
last pid: 12570; load averages: 0.07, 0.14, 0.17
up 0+00:15:06 14:43:56
26 processes: 1 running, 25 sleeping
CPU: 0.0% user, 0.0% nice, 0.0% system, 0.0% interrupt, 100% idle
Mem: 39M Active, 5640K Inact, 17G Wired, 42M Buf, 14G Free
ARC: 15G Total, 33M MFU, 15G MRU, 130K Anon, 32M Header, 138M Other
14G Compressed, 15G Uncompressed, 1.03:1 Ratio
Swap: 8192M Total, 8192M Free
I've not seen any bug reports along this line, in fact very little coverage
at all of the specific error.
My only thought is to set a sysctl to limit ZFS ARC usage, i.e. to leave
more free RAM floating around the system. During the rsync it was
'swapping' occasionally (few K in, few K out) - but it never ran out of
swap that I saw - and it certainly didn't look like an complete out of
memory scenario/box (which is what it felt like with everything getting
killed).
-Karl
--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de