Upperlimit for bwait()

Discussion:

(too old to reply)

Poul-Henning Kamp

2024-05-30 06:42:07 UTC

We perhaps could gracefully handle such lengthy buffer IO operations by
adding a timeout in bwait() - like say 10 minutes. If the buffer IO is not
completed in a few mins, it probably would not complete forever and/or
would be slowing down the entire system. So it is better to stop such
faulty IO operations.

I agree that the symptoms are bad, but disagree about putting a workaround
in bread(), because you get system corruption if the I/O operation
completes anyway after the timeout.

The fundamental issue with timing out I/O, is stopping the operation
in progress.

If you do a "I'm not waiting for this any more", you have to sequester
the destination of the I/O operation, until you have 100% confirmation
that the operation has either been completed or sucessfuly neutered.
(As a policy choice, you may also want to write-protect the source.)

This is why hi-rel systems never allow direct(-mapped) I/O: By
insisting that data go through dedicated I/O buffers, failing buffers
can be sequestered as long as necessary, without complicating the
application logic.

Before Virtual Memory, the UNIX buffer-cache worked that way, and
MERT did that. (MERT = Early five-nines UNIX for telephone switches.)

Between "intelligent I/O controllers" with DMA access, virtual
memory and direct-mapped I/O, we /have/ to make sure the underlying
I/O operation is /guaranteed/ dead, before we wake up the thread.

The only place that can and should happen is in the device driver,
possibly assisted by infrastructure such as CAM.

You need to find out which device driver is ultimately responsible
for the hanging bread(), because that's where the timeout should
happen.

Poul-Henning
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de

Warner Losh

2024-05-30 13:53:59 UTC

Permalink

Hello,
There have been a few incidents reported on Juniper devices with FreeBSD,
where buffer IO operations sleep for more than 30 mins. Theoretically, this
can happen due to faulty hardware or in virtual platforms due to faulty
connection between guest and host, filesystem corruption, too many buffer
IO operations, and/or host not responding due to various reasons. When that
happens, as this buffer IO writes hold a lock before going to sleep, the
threads waiting for that lock would starve for so long. There is no upper
limit for this bwait() as of now. If that wait goes beyond 30 mins for a
sleeping thread OR 15 mins for a thread blocked on turnstile, deadlkres
crashes the kernel assuming a possible deadlock.

Why isn't the I/O timing out? That's the real problem.

I think that's a terrible idea. Why aren't the I/Os timing out?

For now, since we had seen these instances only with BIO operations, I
have a patch to set this value only from bufwait(). Please find the patch
attached. I am not very sure if 10 mins is a good upper limit for all the
scenarios for bwait(). If it is, then we could just change msleep() in
bwait() to set a 10 mins upper limit by default.

I never see this on any of the thousands of systems I've used.

Please let me know if this approach works for all the usecases - If not,
is there a better alternative ? And is 10 mins okay for a timeout ?

Making sure that the I/Os timeout.

And by that, I mean doing what we do in CAM. All the SIMs ensure that
transactions posted to the device will timeout. Most of the SIMs create a
timeout per transaction which expire and complete the CCBs with a timeout,
which the periph drivers then see this status and will fail the I/O with a
timed out status (or maybe retries it a couple of times, depending on the
hardware and its recovery methods (eg is the timeout due to the drive, the
link, the HBA, etc will result in different recovery in the face of
timeouts). NVME nvd does similar things: A timeout will cause the nvme card
to be reset and we try again, but eventually fail.

One might also wonder why 30s is the timeout for most of the commands. I
get that 'special' commands might need a longer timeout, but we likely
should look at lowering this somewhat. 15s is almost certainly safe. 10s is
probably safe. 5s will work, but you start to get P99.9999 outliers on
popular completely working spinning rust, and P99.9 on marginal drives, so
it can be a bit tricky to change (we'll have to phase it in). That could
make things a bit better in terms of worse case recovery time.

So why aren't the I/O's timing out is the real question here.

Warner

Thanks and Regards,
Kumara

Poul-Henning Kamp

2024-05-30 14:00:30 UTC

Permalink

Post by Warner Losh
One might also wonder why 30s is the timeout for most of the commands.

There are Hierarchial Storage Systems where 30s is not enough:
You issue what you think is a disk-read, but somewhere behind the
scenes, an overloaded tape-robot has no empty drives.

But otherwise: 100% with Warner here.
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
***@FreeBSD.ORG | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.

--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de