Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ksh coprocess tests are flakey #563

Closed
krader1961 opened this issue May 31, 2018 · 7 comments
Closed

ksh coprocess tests are flakey #563

krader1961 opened this issue May 31, 2018 · 7 comments
Assignees
Labels

Comments

@krader1961
Copy link
Contributor

The ksh coprocess unit test has always been flakey as of mid 2017 when I started contributing to this project. But since the switch from AST Vmalloc to the system malloc the failure rate has increased dramatically. Specifically, it is now timing out roughly 25% of the time. Whether this is due to a use-after-free bug or a failure exposed only because the layout of malloc()'ed memory is different is TBD (to be determined). Simply increasing that unit test timeout value does not help (see PR #560).

@krader1961 krader1961 added the bug label May 31, 2018
@krader1961
Copy link
Contributor Author

krader1961 commented May 31, 2018

When the hang occurs there are four ksh processes running. One of them has an interesting stack (see below). It is stuck inside malloc() trying to acquire a lock. And frame #7 shows that this is being done inside a signal handler. And just before that signal we were inside a malloc() call. Oops!

#0  0xb77a2c31 in __kernel_vsyscall ()
#1  0xb7678b62 in __lll_lock_wait_private () at ../sysdeps/unix/sysv/linux/i386/lowlevellock.S:97
#2  0xb75f2f9f in _int_free (av=0xb7737780 <main_arena>, p=0x9f501b8, have_lock=0) at malloc.c:3962
#3  0x080fc9f0 in sfclose (f=0x9f501c0) at ../src/lib/libast/sfio/sfclose.c:150
#4  0x0809b403 in sh_close (fd=6) at ../src/cmd/ksh93/sh/io.c:448
#5  0x080a29e4 in job_reap (sig=17) at ../src/cmd/ksh93/sh/jobs.c:486
#6  0x080a2498 in job_waitsafe (sig=17, info=0xbfee080c, context=0xbfee088c) at ../src/cmd/ksh93/sh/jobs.c:325
#7  <signal handler called>
#8  __memset_ia32 () at ../sysdeps/i386/i686/multiarch/../memset.S:77
#9  0xb75f44b8 in alloc_perturb (n=<optimized out>, p=<optimized out>) at malloc.c:1880
#10 _int_malloc (av=av@entry=0xb7737780 <main_arena>, bytes=bytes@entry=32769) at malloc.c:3806
#11 0xb75f5fc5 in __GI___libc_malloc (bytes=32769) at malloc.c:2913
#12 0x0809ad4e in sh_iostream (shp=0x81e5280 <sh>, fd=14, fn=14) at ../src/cmd/ksh93/sh/io.c:313
#13 0x08062ace in sh_readline (shp=0x81e5280 <sh>, names=0x9f324b8, readfn=0x0, fd=14, flags=0, size=0, timeout=0) at ../src/cmd/ksh93/bltins/read.c:413
#14 0x08062911 in b_read (argc=2, argv=0x9f324b8, context=0x81e5608 <sh+904>) at ../src/cmd/ksh93/bltins/read.c:356
#15 0x080ec99c in sh_exec (shp=0x81e5280 <sh>, t=0x9f32468, flags=516) at ../src/cmd/ksh93/sh/xec.c:1233
#16 0x080ef69a in sh_exec (shp=0x81e5280 <sh>, t=0x9f47794, flags=516) at ../src/cmd/ksh93/sh/xec.c:1954
#17 0x080f04cb in sh_exec (shp=0x81e5280 <sh>, t=0x9f32388, flags=516) at ../src/cmd/ksh93/sh/xec.c:2150
#18 0x080ef69a in sh_exec (shp=0x81e5280 <sh>, t=0x9f4e1ac, flags=516) at ../src/cmd/ksh93/sh/xec.c:1954
#19 0x080efe67 in sh_exec (shp=0x81e5280 <sh>, t=0x9f315f8, flags=4) at ../src/cmd/ksh93/sh/xec.c:2084
#20 0x080b8bf1 in exfile (shp=0x81e5280 <sh>, iop=0x9f43638, fno=11) at ../src/cmd/ksh93/sh/main.c:536
#21 0x080b7dcb in sh_main (ac=3, av=0xbfee2b64, userinit=0x0) at ../src/cmd/ksh93/sh/main.c:335
#22 0x080dcfa4 in main (argc=3, argv=0xbfee2b64) at ../src/cmd/ksh93/sh/pmain.c:41

@krader1961
Copy link
Contributor Author

Prior to removing AST Vmalloc src/cmd/ksh93/include/jobs.h had the following block of code defining a vmbusy() function that is used in the function causing the deadlock to determine if it is safe to call job_reap(). If _std_malloc is true that function becomes #define vmbusy() 0; i.e., a no-op. So we have yet another example of a configurable option (use the system malloc) that should work but doesn't.

#if !_std_malloc
#   include <vmalloc.h>
#   ifdef vmlocked
#       define vmbusy() vmlocked(Vmregion)
#   else
#       if VMALLOC_VERSION >= 20130509L
#           define vmbusy()     (vmstat(Vmregion,0)!=0)
#       else
#           if VMALLOC_VERSION >= 20070911L
#               define vmbusy() (vmstat(0,0)!=0)
#           endif
#       endif
#   endif
#endif
#ifndef vmbusy
#   define vmbusy()     0
#endif

@krader1961
Copy link
Contributor Author

Note that we can't just do #define vmbusy() 1 (define it unconditionally true) because that keeps certain critical actions from occurring. So we have three options from easiest to hardest:

  1. Modify job_waitsafe() to always assume reaping the job has to be deferred. It's not obvious if that would introduce other subtle failure modes due to two SIGCHLD signals being delivered before the first one is acted on.

  2. Create wrappers for the libc malloc calls along these lines:

void *ast_malloc(size_t size) {
    vmbusy = true;
    void *p = malloc(size);
    vmbusy = false;
    return p;
}
  1. Refactor the code to stop doing things in signal handlers that are not async safe.

@krader1961
Copy link
Contributor Author

Not surprisingly the simple solution, option #1 above, doesn't work. Deferring all job_reap() calls to after returning from the signal handler breaks a lot of unit tests. Which makes me wonder if the current mechanism of only deferring a small percentage is actually reliable.

@DavidMorano
Copy link

@krader1961

Refactor the code to stop doing things in signal handlers that are not async safe.

Refactoring the code to get unsafe stuff out of signal handlers was only ever the right thing to do. The C-language standard committee (ISO) says so. POSIX says so. CERT says so. And most other organizations which care about code correctness say so. The code should have been refactored years ago.

This whole discussion about the problem of putting non-async-safe code within a signal handler (talking about the KSH code base here) already happened -- over 10 years ago! Everyone can read the standards for themselves, but in the end, one either pretty much simply does the async-sig-safe things (defined by ISO or POSIX) or processes signals synchronously within the mainline code itself (many ways to do this). The code should -- finally -- be refactored.

@krader1961
Copy link
Contributor Author

The code should -- finally -- be refactored.

Patches are welcome. 😄

The only reason I'm doing any of this work is because:

a) This was the first UNIX shell I used three decades ago that didn't suck and it therefore is special to me.

b) I like trying to fix broken things.

@DavidMorano
Copy link

This was the first UNIX shell I used three decades ago that didn't suck and it therefore is special to me.
I like trying to fix broken things.
Thanks again for cleaning up the code as you have, and are continuing with. My early situation was similar: KSH was my first higher-feature shell.
Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants