Discussion:
Strange hang when calling signal()
(too old to reply)
Gleb Popov
2018-08-24 14:53:28 UTC
Permalink
I'm debugging a Qt test app that hangs when launching a QProcess.

The parent does the following:

QProcess p;
...
p.start();
p.waitForStarted(-1); // wait indefinitely

Under the hood starting the QProcess involves creating a pair of pipes and
forking:

qt_create_pipe(childStartedPipe);
...
pid_t childPid;
::forkfd(FFD_CLOEXEC, &childPid);

and waiting for it to be started is just ppoll()'ing on the pipe

pollfd pfd = qt_make_pollfd(childStartedPipe[0], POLLIN);
if (qt_poll_msecs(&pfd, 1, msecs) == 0) {
...

On the child side the code looks like

::signal(SIGPIPE, SIG_DFL);
...
qt_safe_close(childStartedPipe[0]);
...
qt_safe_execv(argv[0], argv);

So, the problem is that after forking the parent process hangs on polling
and child process hangs inside signal call; Here is the backtrace:

#0 _umtx_op_err () at
/usr/src/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:37
#1 0x0000000802bd9571 in __thr_rwlock_rdlock (rwlock=0x802bf3200,
flags=<optimized out>, tsp=<optimized out>) at
/usr/src/lib/libthr/thread/thr_umtx.c:307
#2 0x0000000802be24c0 in _thr_rwlock_rdlock (flags=0, tsp=0x0,
rwlock=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.h:232
#3 _thr_rtld_rlock_acquire (lock=0x802bf3200) at
/usr/src/lib/libthr/thread/thr_rtld.c:125
#4 0x000000080024e63b in rlock_acquire (lock=0x80025f0a0 <rtld_locks>,
lockstate=0x7fffffffc678) at /usr/src/libexec/rtld-elf/rtld_lock.c:208
#5 0x00000008002472dd in _rtld_bind (obj=0x80027b000, reloff=4416) at
/usr/src/libexec/rtld-elf/rtld.c:788
#6 0x000000080024404d in _rtld_bind_start () at
/usr/src/libexec/rtld-elf/amd64/rtld_start.S:121
#7 0x0000000803d31a76 in QProcessPrivate::execChild (this=0x81a9716c0,
workingDir=0x0, argv=0x81fde5760, envp=0x0) at io/qprocess_unix.cpp:537

Any idea what causes signal() to not return? I haven't extracted a minimal
repro yet, wanted to ask for any clues first.

The code in question is here:
https://github.com/qt/qtbase/blob/5.11/src/corelib/io/qprocess_unix.cpp
Relevant functions are QProcessPrivate::startProcess(),
QProcessPrivate::execChild(), QProcessPrivate::waitForStarted().

Thanks in advance.
Konstantin Belousov
2018-08-24 18:53:28 UTC
Permalink
Post by Gleb Popov
I'm debugging a Qt test app that hangs when launching a QProcess.
QProcess p;
...
p.start();
p.waitForStarted(-1); // wait indefinitely
Under the hood starting the QProcess involves creating a pair of pipes and
qt_create_pipe(childStartedPipe);
...
pid_t childPid;
::forkfd(FFD_CLOEXEC, &childPid);
and waiting for it to be started is just ppoll()'ing on the pipe
pollfd pfd = qt_make_pollfd(childStartedPipe[0], POLLIN);
if (qt_poll_msecs(&pfd, 1, msecs) == 0) {
...
On the child side the code looks like
::signal(SIGPIPE, SIG_DFL);
...
qt_safe_close(childStartedPipe[0]);
...
qt_safe_execv(argv[0], argv);
So, the problem is that after forking the parent process hangs on polling
#0 _umtx_op_err () at
/usr/src/lib/libthr/arch/amd64/amd64/_umtx_op_err.S:37
#1 0x0000000802bd9571 in __thr_rwlock_rdlock (rwlock=0x802bf3200,
flags=<optimized out>, tsp=<optimized out>) at
/usr/src/lib/libthr/thread/thr_umtx.c:307
#2 0x0000000802be24c0 in _thr_rwlock_rdlock (flags=0, tsp=0x0,
rwlock=<optimized out>) at /usr/src/lib/libthr/thread/thr_umtx.h:232
#3 _thr_rtld_rlock_acquire (lock=0x802bf3200) at
/usr/src/lib/libthr/thread/thr_rtld.c:125
#4 0x000000080024e63b in rlock_acquire (lock=0x80025f0a0 <rtld_locks>,
lockstate=0x7fffffffc678) at /usr/src/libexec/rtld-elf/rtld_lock.c:208
#5 0x00000008002472dd in _rtld_bind (obj=0x80027b000, reloff=4416) at
/usr/src/libexec/rtld-elf/rtld.c:788
#6 0x000000080024404d in _rtld_bind_start () at
/usr/src/libexec/rtld-elf/amd64/rtld_start.S:121
#7 0x0000000803d31a76 in QProcessPrivate::execChild (this=0x81a9716c0,
workingDir=0x0, argv=0x81fde5760, envp=0x0) at io/qprocess_unix.cpp:537
Any idea what causes signal() to not return? I haven't extracted a minimal
repro yet, wanted to ask for any clues first.
Immediate and not that useful answer is that the child process is trying
to acquire rtld bind lock, and the state of the lock is busy.

There was a constant stream of the bugs some time ago, where
multithreaded process forked and then tried to use services which
require some of the C runtime internal locks. It did not helped that
POSIX allow most of this breakage. Since state of the parent
process is usually not determinate at the time of fork, other thread
might have grabbed some of that locks (and made internal structures
inconsistent), which is inherited by the child. Then there is nobody in
the child to correct the damage (restore consistency and unlock).

Since then, we started locking most of that locks in parent around fork(2),
all the code in lib/libthr/thread/thr_fork.c. In particular, we lock rtld,
malloc, and disable cancellation around fork. So if your program used fork(2)
but ended with the broken rtld it is worrying.

On the other hand, we do not do that for vfork(2). So yes, the minimal
reproduction case, in bare libc/libthr API (i.e. without QT), would be
the first step to diagnose and and might be fix.
Post by Gleb Popov
https://github.com/qt/qtbase/blob/5.11/src/corelib/io/qprocess_unix.cpp
Relevant functions are QProcessPrivate::startProcess(),
QProcessPrivate::execChild(), QProcessPrivate::waitForStarted().
Eric van Gyzen
2018-08-24 19:10:36 UTC
Permalink
Post by Konstantin Belousov
Since then, we started locking most of that locks in parent around fork(2),
all the code in lib/libthr/thread/thr_fork.c. In particular, we lock rtld,
malloc, and disable cancellation around fork. So if your program used fork(2)
but ended with the broken rtld it is worrying.
On the other hand, we do not do that for vfork(2). So yes, the minimal
We also do not do that for rfork(2). I think we should, but I have not
done any research into it.

Eric
Konstantin Belousov
2018-08-24 19:23:15 UTC
Permalink
Post by Eric van Gyzen
Post by Konstantin Belousov
Since then, we started locking most of that locks in parent around fork(2),
all the code in lib/libthr/thread/thr_fork.c. In particular, we lock rtld,
malloc, and disable cancellation around fork. So if your program used fork(2)
but ended with the broken rtld it is worrying.
On the other hand, we do not do that for vfork(2). So yes, the minimal
We also do not do that for rfork(2). I think we should, but I have not
done any research into it.
I do not see how would it be correct to do that locking for vfork(2) because
the address space is shared between parent and child. For the similar reason,
sometimes rfork(2) also leaves the address space shared and then we have
the problem.

In essence, rfork(2) and vfork(2) do not mix with pthreads, and if you do,
you should know well what you do.
Eric van Gyzen
2018-08-24 19:38:46 UTC
Permalink
Post by Konstantin Belousov
Post by Eric van Gyzen
Post by Konstantin Belousov
Since then, we started locking most of that locks in parent around fork(2),
all the code in lib/libthr/thread/thr_fork.c. In particular, we lock rtld,
malloc, and disable cancellation around fork. So if your program used fork(2)
but ended with the broken rtld it is worrying.
On the other hand, we do not do that for vfork(2). So yes, the minimal
We also do not do that for rfork(2). I think we should, but I have not
done any research into it.
I do not see how would it be correct to do that locking for vfork(2) because
the address space is shared between parent and child. For the similar reason,
sometimes rfork(2) also leaves the address space shared and then we have
the problem.
I wasn't suggesting that we do it for vfork(2). I was only suggesting
that we do it for rfork(2), and obviously only if !RFMEM (and certain
other flag conditions).

Eric
Gleb Popov
2018-10-03 12:18:12 UTC
Permalink
I failed to extract a reproducing testcase from that test. The extracted
code works fine, so either I missed something, or some hidden conditions
must be fulfilled for the bug to appear.

Maybe I can try to dive into libc myself? Can you give me some guidance? Or
maybe you just build kdevelop yourself? It is a matter of handful of
commands.
Konstantin Belousov
2018-10-03 21:34:08 UTC
Permalink
Post by Gleb Popov
I failed to extract a reproducing testcase from that test. The extracted
code works fine, so either I missed something, or some hidden conditions
must be fulfilled for the bug to appear.
Maybe I can try to dive into libc myself? Can you give me some guidance? Or
maybe you just build kdevelop yourself? It is a matter of handful of
commands.
Can you provide me the minimal self-contained set of binaries and shared
libraries and the instruction to reproduce the issue ?
Konstantin Belousov
2018-10-04 08:21:37 UTC
Permalink
Post by Konstantin Belousov
Post by Gleb Popov
I failed to extract a reproducing testcase from that test. The extracted
code works fine, so either I missed something, or some hidden conditions
must be fulfilled for the bug to appear.
Maybe I can try to dive into libc myself? Can you give me some guidance?
Or
Post by Gleb Popov
maybe you just build kdevelop yourself? It is a matter of handful of
commands.
Can you provide me the minimal self-contained set of binaries and shared
libraries and the instruction to reproduce the issue ?
Pull in all needed depndencies
# pkg install kdevelop gmake cmake git
Get KDevelop sources
# git clone git://anongit.kde.org/kdevelop.git
# Build the test executable
# mkdir kdevelop/build
# cd kdevelop/build
# cmake ..
# gmake -j4 test_qthelpplugin
Run the test executable and see that it hangs
# ./plugins/qthelp/tests/test_qthelpplugin testDefaultValue
This is exactly the opposite of 'minimal'. Provide me the tarball which
have just the binary and non-base shared libs needed to reproduce.

I am not going to install kde on my crash boxes.
Konstantin Belousov
2018-10-04 09:26:31 UTC
Permalink
Post by Gleb Popov
Post by Konstantin Belousov
Post by Gleb Popov
I failed to extract a reproducing testcase from that test. The
extracted
Post by Konstantin Belousov
Post by Gleb Popov
code works fine, so either I missed something, or some hidden
conditions
Post by Konstantin Belousov
Post by Gleb Popov
must be fulfilled for the bug to appear.
Maybe I can try to dive into libc myself? Can you give me some
guidance?
Post by Konstantin Belousov
Or
Post by Gleb Popov
maybe you just build kdevelop yourself? It is a matter of handful of
commands.
Can you provide me the minimal self-contained set of binaries and
shared
Post by Konstantin Belousov
libraries and the instruction to reproduce the issue ?
Pull in all needed depndencies
# pkg install kdevelop gmake cmake git
Get KDevelop sources
# git clone git://anongit.kde.org/kdevelop.git
# Build the test executable
# mkdir kdevelop/build
# cd kdevelop/build
# cmake ..
# gmake -j4 test_qthelpplugin
Run the test executable and see that it hangs
# ./plugins/qthelp/tests/test_qthelpplugin testDefaultValue
This is exactly the opposite of 'minimal'. Provide me the tarball which
have just the binary and non-base shared libs needed to reproduce.
I am not going to install kde on my crash boxes.
Ok, can do that, but
# ldd ./plugins/qthelp/tests/test_qthelpplugin | wc -l
gives 286. Would that suit you?
As far as all the mess is extracted into a single directory and can be
removed with rm -rf, I am fine.
Gleb Popov
2018-10-04 18:32:41 UTC
Permalink
Post by Gleb Popov
Post by Gleb Popov
On Thu, Oct 4, 2018 at 12:34 AM Konstantin Belousov <
Post by Konstantin Belousov
Post by Gleb Popov
I failed to extract a reproducing testcase from that test. The
extracted
Post by Konstantin Belousov
Post by Gleb Popov
code works fine, so either I missed something, or some hidden
conditions
Post by Konstantin Belousov
Post by Gleb Popov
must be fulfilled for the bug to appear.
Maybe I can try to dive into libc myself? Can you give me some
guidance?
Post by Konstantin Belousov
Or
Post by Gleb Popov
maybe you just build kdevelop yourself? It is a matter of
handful of
Post by Gleb Popov
Post by Konstantin Belousov
Post by Gleb Popov
commands.
Can you provide me the minimal self-contained set of binaries and
shared
Post by Konstantin Belousov
libraries and the instruction to reproduce the issue ?
Pull in all needed depndencies
# pkg install kdevelop gmake cmake git
Get KDevelop sources
# git clone git://anongit.kde.org/kdevelop.git
# Build the test executable
# mkdir kdevelop/build
# cd kdevelop/build
# cmake ..
# gmake -j4 test_qthelpplugin
Run the test executable and see that it hangs
# ./plugins/qthelp/tests/test_qthelpplugin testDefaultValue
This is exactly the opposite of 'minimal'. Provide me the tarball
which
Post by Gleb Popov
have just the binary and non-base shared libs needed to reproduce.
I am not going to install kde on my crash boxes.
Ok, can do that, but
# ldd ./plugins/qthelp/tests/test_qthelpplugin | wc -l
gives 286. Would that suit you?
As far as all the mess is extracted into a single directory and can be
removed with rm -rf, I am fine.
I understand that you are reluctant to install KDE, but this "mess" can
also be removed with just

# pkg delete kdevelop
# pkg autoremove
# rm -rf ~/kdevelop

Anyway, I've copied all libraries from ldd output except base ones into a
separate dir, copied the test executable, set LD_LIBRARY_PATH to "."
aaaaaand the hang doesn't happen. Maybe I did something wrong?

Loading...