ksh coprocess tests are flakey #563

krader1961 · 2018-05-31T05:01:29Z

The ksh coprocess unit test has always been flakey as of mid 2017 when I started contributing to this project. But since the switch from AST Vmalloc to the system malloc the failure rate has increased dramatically. Specifically, it is now timing out roughly 25% of the time. Whether this is due to a use-after-free bug or a failure exposed only because the layout of malloc()'ed memory is different is TBD (to be determined). Simply increasing that unit test timeout value does not help (see PR #560).

The text was updated successfully, but these errors were encountered:

krader1961 · 2018-05-31T22:43:26Z

When the hang occurs there are four ksh processes running. One of them has an interesting stack (see below). It is stuck inside malloc() trying to acquire a lock. And frame #7 shows that this is being done inside a signal handler. And just before that signal we were inside a malloc() call. Oops!

#0  0xb77a2c31 in __kernel_vsyscall ()
#1  0xb7678b62 in __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/i386/lowlevellock.S:97
#2  0xb75f2f9f in _int_free (av=0xb7737780 <main_arena>, p=0x9f501b8, have_lock=0) at malloc.c:3962
#3  0x080fc9f0 in sfclose (f=0x9f501c0) at ../src/lib/libast/sfio/sfclose.c:150
#4  0x0809b403 in sh_close (fd=6) at ../src/cmd/ksh93/sh/io.c:448
#5  0x080a29e4 in job_reap (sig=17) at ../src/cmd/ksh93/sh/jobs.c:486
#6  0x080a2498 in job_waitsafe (sig=17, info=0xbfee080c, context=0xbfee088c) at ../src/cmd/ksh93/sh/jobs.c:325
#7  <signal handler called>
#8  __memset_ia32 () at ../sysdeps/i386/i686/multiarch/../memset.S:77
#9  0xb75f44b8 in alloc_perturb (n=<optimized out>, p=<optimized out>) at malloc.c:1880
#10 _int_malloc (av=av@entry=0xb7737780 <main_arena>, bytes=bytes@entry=32769) at malloc.c:3806
#11 0xb75f5fc5 in __GI___libc_malloc (bytes=32769) at malloc.c:2913
#12 0x0809ad4e in sh_iostream (shp=0x81e5280 <sh>, fd=14, fn=14) at ../src/cmd/ksh93/sh/io.c:313
#13 0x08062ace in sh_readline (shp=0x81e5280 <sh>, names=0x9f324b8, readfn=0x0, fd=14, flags=0, size=0, timeout=0) at ../src/cmd/ksh93/bltins/read.c:413
#14 0x08062911 in b_read (argc=2, argv=0x9f324b8, context=0x81e5608 <sh+904>) at ../src/cmd/ksh93/bltins/read.c:356
#15 0x080ec99c in sh_exec (shp=0x81e5280 <sh>, t=0x9f32468, flags=516) at ../src/cmd/ksh93/sh/xec.c:1233
#16 0x080ef69a in sh_exec (shp=0x81e5280 <sh>, t=0x9f47794, flags=516) at ../src/cmd/ksh93/sh/xec.c:1954
#17 0x080f04cb in sh_exec (shp=0x81e5280 <sh>, t=0x9f32388, flags=516) at ../src/cmd/ksh93/sh/xec.c:2150
#18 0x080ef69a in sh_exec (shp=0x81e5280 <sh>, t=0x9f4e1ac, flags=516) at ../src/cmd/ksh93/sh/xec.c:1954
#19 0x080efe67 in sh_exec (shp=0x81e5280 <sh>, t=0x9f315f8, flags=4) at ../src/cmd/ksh93/sh/xec.c:2084
#20 0x080b8bf1 in exfile (shp=0x81e5280 <sh>, iop=0x9f43638, fno=11) at ../src/cmd/ksh93/sh/main.c:536
#21 0x080b7dcb in sh_main (ac=3, av=0xbfee2b64, userinit=0x0) at ../src/cmd/ksh93/sh/main.c:335
#22 0x080dcfa4 in main (argc=3, argv=0xbfee2b64) at ../src/cmd/ksh93/sh/pmain.c:41

krader1961 · 2018-05-31T22:59:00Z

Prior to removing AST Vmalloc src/cmd/ksh93/include/jobs.h had the following block of code defining a vmbusy() function that is used in the function causing the deadlock to determine if it is safe to call job_reap(). If _std_malloc is true that function becomes #define vmbusy() 0; i.e., a no-op. So we have yet another example of a configurable option (use the system malloc) that should work but doesn't.

#if !_std_malloc
#   include <vmalloc.h>
#   ifdef vmlocked
#       define vmbusy() vmlocked(Vmregion)
#   else
#       if VMALLOC_VERSION >= 20130509L
#           define vmbusy()     (vmstat(Vmregion,0)!=0)
#       else
#           if VMALLOC_VERSION >= 20070911L
#               define vmbusy() (vmstat(0,0)!=0)
#           endif
#       endif
#   endif
#endif
#ifndef vmbusy
#   define vmbusy()     0
#endif

krader1961 · 2018-05-31T23:38:50Z

Note that we can't just do #define vmbusy() 1 (define it unconditionally true) because that keeps certain critical actions from occurring. So we have three options from easiest to hardest:

Modify job_waitsafe() to always assume reaping the job has to be deferred. It's not obvious if that would introduce other subtle failure modes due to two SIGCHLD signals being delivered before the first one is acted on.
Create wrappers for the libc malloc calls along these lines:

void *ast_malloc(size_t size) {
    vmbusy = true;
    void *p = malloc(size);
    vmbusy = false;
    return p;
}

Refactor the code to stop doing things in signal handlers that are not async safe.

krader1961 · 2018-05-31T23:51:29Z

Not surprisingly the simple solution, option #1 above, doesn't work. Deferring all job_reap() calls to after returning from the signal handler breaks a lot of unit tests. Which makes me wonder if the current mechanism of only deferring a small percentage is actually reliable.

DavidMorano · 2018-06-01T03:23:29Z

@krader1961

Refactor the code to stop doing things in signal handlers that are not async safe.

Refactoring the code to get unsafe stuff out of signal handlers was only ever the right thing to do. The C-language standard committee (ISO) says so. POSIX says so. CERT says so. And most other organizations which care about code correctness say so. The code should have been refactored years ago.

This whole discussion about the problem of putting non-async-safe code within a signal handler (talking about the KSH code base here) already happened -- over 10 years ago! Everyone can read the standards for themselves, but in the end, one either pretty much simply does the async-sig-safe things (defined by ISO or POSIX) or processes signals synchronously within the mainline code itself (many ways to do this). The code should -- finally -- be refactored.

krader1961 · 2018-06-01T03:36:13Z

The code should -- finally -- be refactored.

Patches are welcome. 😄

The only reason I'm doing any of this work is because:

a) This was the first UNIX shell I used three decades ago that didn't suck and it therefore is special to me.

b) I like trying to fix broken things.

DavidMorano · 2018-06-01T16:17:44Z

This was the first UNIX shell I used three decades ago that didn't suck and it therefore is special to me.
I like trying to fix broken things.
Thanks again for cleaning up the code as you have, and are continuing with. My early situation was similar: KSH was my first higher-feature shell.
Thanks.

krader1961 added the bug label May 31, 2018

kernigh mentioned this issue Jun 1, 2018

test failures in OpenBSD #483

Closed

krader1961 mentioned this issue Jun 1, 2018

Fix coprocess deadlock #565

Closed

krader1961 closed this as completed in 06e7fc4 Jun 1, 2018

krader1961 mentioned this issue Jun 20, 2018

What to do about all the places that ignore memory allocation failures #601

Closed

krader1961 self-assigned this Jul 28, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ksh coprocess tests are flakey #563

ksh coprocess tests are flakey #563

krader1961 commented May 31, 2018

krader1961 commented May 31, 2018 •

edited

Loading

krader1961 commented May 31, 2018

krader1961 commented May 31, 2018

krader1961 commented May 31, 2018

DavidMorano commented Jun 1, 2018

krader1961 commented Jun 1, 2018

DavidMorano commented Jun 1, 2018

ksh coprocess tests are flakey #563

ksh coprocess tests are flakey #563

Comments

krader1961 commented May 31, 2018

krader1961 commented May 31, 2018 • edited Loading

krader1961 commented May 31, 2018

krader1961 commented May 31, 2018

krader1961 commented May 31, 2018

DavidMorano commented Jun 1, 2018

krader1961 commented Jun 1, 2018

DavidMorano commented Jun 1, 2018

krader1961 commented May 31, 2018 •

edited

Loading