Discussion:
Should close() release locks atomically?
(too old to reply)
Alan Somers
2023-06-23 19:00:36 UTC
Permalink
The close() syscall automatically releases locks. Should it do so
atomically or is a delay permitted? I can't find anything in our man
pages or the open group specification that says.

The distinction matters when using O_NONBLOCK. For example:

fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //succeeds
// do some I/O
close(fd);
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //fails with EAGAIN!

I see this error frequently on a heavily loaded system. It isn't a
typical thread race though; ktrace shows that only one thread tries to
open the file in question. From the ktrace, I can see that the final
open() comes immediately after the close(), with no intervening
syscalls from that thread. It seems that close() doesn't release the
lock right away. I wouldn't notice if I weren't using O_NONBLOCK.

Should this be considered a bug? If so I could try to come up with a
minimal test case. But it's somewhat academic, since I plan to
refactor the code in a way that will eliminate the duplicate open().

-Alan


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Konstantin Belousov
2023-06-23 20:02:59 UTC
Permalink
Post by Alan Somers
The close() syscall automatically releases locks. Should it do so
atomically or is a delay permitted? I can't find anything in our man
pages or the open group specification that says.
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //succeeds
// do some I/O
close(fd);
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //fails with EAGAIN!
I see this error frequently on a heavily loaded system. It isn't a
typical thread race though; ktrace shows that only one thread tries to
open the file in question. From the ktrace, I can see that the final
open() comes immediately after the close(), with no intervening
syscalls from that thread. It seems that close() doesn't release the
lock right away. I wouldn't notice if I weren't using O_NONBLOCK.
Should this be considered a bug? If so I could try to come up with a
minimal test case. But it's somewhat academic, since I plan to
refactor the code in a way that will eliminate the duplicate open().
What type of the object is behind fd? O_NONBLOCK affects open itself.
We release flock after object close method, but before close(2) returns.


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2023-06-23 20:11:34 UTC
Permalink
Post by Konstantin Belousov
Post by Alan Somers
The close() syscall automatically releases locks. Should it do so
atomically or is a delay permitted? I can't find anything in our man
pages or the open group specification that says.
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //succeeds
// do some I/O
close(fd);
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //fails with EAGAIN!
I see this error frequently on a heavily loaded system. It isn't a
typical thread race though; ktrace shows that only one thread tries to
open the file in question. From the ktrace, I can see that the final
open() comes immediately after the close(), with no intervening
syscalls from that thread. It seems that close() doesn't release the
lock right away. I wouldn't notice if I weren't using O_NONBLOCK.
Should this be considered a bug? If so I could try to come up with a
minimal test case. But it's somewhat academic, since I plan to
refactor the code in a way that will eliminate the duplicate open().
What type of the object is behind fd? O_NONBLOCK affects open itself.
We release flock after object close method, but before close(2) returns.
This is a plain file on ZFS.


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2023-06-23 20:53:20 UTC
Permalink
Post by Alan Somers
Post by Konstantin Belousov
Post by Alan Somers
The close() syscall automatically releases locks. Should it do so
atomically or is a delay permitted? I can't find anything in our man
pages or the open group specification that says.
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //succeeds
// do some I/O
close(fd);
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //fails with EAGAIN!
I see this error frequently on a heavily loaded system. It isn't a
typical thread race though; ktrace shows that only one thread tries to
open the file in question. From the ktrace, I can see that the final
open() comes immediately after the close(), with no intervening
syscalls from that thread. It seems that close() doesn't release the
lock right away. I wouldn't notice if I weren't using O_NONBLOCK.
Should this be considered a bug? If so I could try to come up with a
minimal test case. But it's somewhat academic, since I plan to
refactor the code in a way that will eliminate the duplicate open().
What type of the object is behind fd? O_NONBLOCK affects open itself.
We release flock after object close method, but before close(2) returns.
This is a plain file on ZFS.
Can you write a self-contained example, and check the same issue e.g. on
tmpfs?
I just reproduced it on tmpfs. A minimal test case will take some more time...


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Alan Somers
2023-06-24 15:29:01 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Konstantin Belousov
Post by Alan Somers
The close() syscall automatically releases locks. Should it do so
atomically or is a delay permitted? I can't find anything in our man
pages or the open group specification that says.
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //succeeds
// do some I/O
close(fd);
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //fails with EAGAIN!
I see this error frequently on a heavily loaded system. It isn't a
typical thread race though; ktrace shows that only one thread tries to
open the file in question. From the ktrace, I can see that the final
open() comes immediately after the close(), with no intervening
syscalls from that thread. It seems that close() doesn't release the
lock right away. I wouldn't notice if I weren't using O_NONBLOCK.
Should this be considered a bug? If so I could try to come up with a
minimal test case. But it's somewhat academic, since I plan to
refactor the code in a way that will eliminate the duplicate open().
What type of the object is behind fd? O_NONBLOCK affects open itself.
We release flock after object close method, but before close(2) returns.
This is a plain file on ZFS.
Can you write a self-contained example, and check the same issue e.g. on
tmpfs?
I just reproduced it on tmpfs. A minimal test case will take some more time...
I'm afraid that I haven't been successful in creating a minimal test
case. My original test case, while it reliably reproduces the
problem, is huge. I'm sorry, but I think I'm going to declare ENOTIME
and get back to the aforementioned refactoring.


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Mark Murray
2023-06-28 17:05:30 UTC
Permalink
Hi - have you tried using e.g. CReduce to get a testcase? I've used
No I haven't. I could try. But,
* Does creduce work on Rust, too?
* My failure is frequent but not 100% reliable. Will that prevent me from using creduce successfully?
I believe it does work on Rust, and other languages too.

The "not 100% reliable" issue will be problematic, but if you are prepared
to run for some time in order to force the failure, and if you are then prepared
to let CReduce chew on this for some time, you could be in luck!

M
--
Mark R V Murray
alan somers
2023-07-04 18:25:10 UTC
Permalink
Post by Alan Somers
Post by Alan Somers
Post by Alan Somers
Post by Konstantin Belousov
Post by Alan Somers
The close() syscall automatically releases locks. Should it do so
atomically or is a delay permitted? I can't find anything in our man
pages or the open group specification that says.
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //succeeds
// do some I/O
close(fd);
fd = open(..., O_DIRECT | O_EXLOCK | O_NONBLOCK); //fails with EAGAIN!
I see this error frequently on a heavily loaded system. It isn't a
typical thread race though; ktrace shows that only one thread tries to
open the file in question. From the ktrace, I can see that the final
open() comes immediately after the close(), with no intervening
syscalls from that thread. It seems that close() doesn't release the
lock right away. I wouldn't notice if I weren't using O_NONBLOCK.
Should this be considered a bug? If so I could try to come up with a
minimal test case. But it's somewhat academic, since I plan to
refactor the code in a way that will eliminate the duplicate open().
What type of the object is behind fd? O_NONBLOCK affects open itself.
We release flock after object close method, but before close(2) returns.
This is a plain file on ZFS.
Can you write a self-contained example, and check the same issue e.g. on
tmpfs?
I just reproduced it on tmpfs. A minimal test case will take some more time...
I'm afraid that I haven't been successful in creating a minimal test
case. My original test case, while it reliably reproduces the
problem, is huge. I'm sorry, but I think I'm going to declare ENOTIME
and get back to the aforementioned refactoring.
I've finally succeeded in writing a minimal test case. The critical
piece I was missing before was that other threads were forking in the
background. Even though the file is opened O_CLOEXEC, the child
process briefly keeps it locked. However, the file ought to get
unlocked whenever _either_ that parent calls close() or the child
calls fdcloseexec. So I don't understand how it could fail to get
unlocked. I've posted the test case to Bugzilla. Let's move
discussion there.

https://bugs.freebsd.org/bugzilla/show_bug.cgi?id=272367

-Alan


--
Posted automagically by a mail2news gateway at muc.de e.V.
Please direct questions, flames, donations, etc. to news-***@muc.de
Loading...