Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #1920 (new defect)

Opened 14 years ago

Last modified 14 years ago

Process.toNullEndedArray blocks indefinitely?

Reported by: MichaelZ Assigned to: community
Priority: major Milestone: 1.0
Component: Tango Version: 0.99.9 Kai
Keywords: Cc:

Description

This is a duplicate of my post to http://www.dsource.org/projects/tango/forums/topic/872 - no response there in ~6 weeks though :/

Hi,

In my application, the following code never returned from the Process.execute:

  Process process = new Process(true, command);

  scope (exit)
  {
    process.close();
  }

  process.redirect(Redirect.All | Redirect.ErrorToOutput);
  process.execute;

  ....

Looking into the (otherwise still running) processes with gdb I can see that the initiating thread of the "parent" Process is waiting on a read():

#0  0xffffe430 in __kernel_vsyscall ()
#1  0xb734abbb in read () from /lib/libpthread.so.0
#2  0x080a8587 in _D5tango2io6device6Device6Device4readMFAvZk ()
#3  0x080c2ad9 in _D5tango3sys7Process7Process7executeMFZC5tango3sys7Process7Process ()
...

(Presumably this is the pexec.source.input.read on line 1235 of Process.d). So far so standard - I gather it's waiting for some kind of response from the child.

The child process looks unsatisfactory however; the exec has not yet happened, and the backtrace looks thusly:

#0  0xffffe430 in __kernel_vsyscall ()
#1  0xb734a7b9 in __lll_lock_wait () from /lib/libpthread.so.0
#2  0xb7345ce0 in _L_lock_286 () from /lib/libpthread.so.0
#3  0xb7345705 in pthread_mutex_lock () from /lib/libpthread.so.0
#4  0x080c44f9 in _d_monitor_lock ()
#5  0x080990e0 in _d_monitorenter ()
#6  0x0809e96c in _D2rt2gc5basic3gcx2GC6mallocMFkkZPv ()
#7  0x0809e3e7 in gc_malloc ()
#8  0x0809cc6a in _d_newarrayT ()
#9  0x080c32d4 in _D5tango3sys7Process7Process16toNullEndedArrayFAAaZAPa ()
#10 0x080c2c62 in _D5tango3sys7Process7Process7executeMFZC5tango3sys7Process7Process ()
...

(Presumably this is the char*[] dest = new char*[src.length + 1]; on Line 1869 of Process.d)

... and it's been like this for about 2 hours.

It looks to me like the child process is trying to acquire some memory management mutex which was locked by some (other) parent thread when the fork happened, and thus was copied (not shared) into the child in a locked state? Thus, when the thread of the parent unlocked the mutex again, the child never knew? I'm afraid I'm a little rusty on what exactly is copied and what is shared over a fork call, but can't see how a mutex could be sensibly shared.

Am I on the wrong track completely, or is there really a problem here?

Note: this happens "rarely"; in the forum post I stated it occured once. Since then we've seen it much more often, but it's still a long way from being reliably reproducible.

Change History

05/22/10 19:54:23 changed by kris

Sorry you didn't get a response to your other ticket; it's perhaps because nobody can currently see what's wrong?

05/27/10 17:26:58 changed by mwarning

Looks like a duplicate of #1906.

06/09/10 11:00:35 changed by MichaelZ

mwarning: Possibly, although (as I understand #1906) the exec'd process did actually run there (".... The subprocess starts correctly, produces output ...", whereas in my case it gets stuck before the exec. Since initially reporting the problem, I've seen the problem occur many more times, and it always appeared to be exactly the same as above - IE, before the exec.

06/09/10 11:32:10 changed by larsivi

Michael, did you try running your app via strace? It is noisy, but may tell you exactly what it is waiting for.

06/10/10 09:25:33 changed by MichaelZ

larsivi: I haven't tried that yet (it remains a very sporadic problem with no "obvious" way to reproduce it). Running the whole application via strace does not seem likely to be fruitful, since most of the (parent) application keeps going and the interesting output will presumably get lost in the "fog" of real output. However, I'll try attaching to the "stuck" child with strace if I see the problem again.

Last night I had a chance to follow up the suspicion I described above, and the opengroup specification for fork states: "the new process shall contain a replica of the calling thread and its entire address space, possibly including the states of mutexes and other resources. Consequently, to avoid errors, the child process may only execute async-signal-safe operations until such time as one of the exec functions is called" ( from http://www.opengroup.org/onlinepubs/009695399/functions/fork.html - see also the 'Rationale' on the same page where further comment on mutexes is made )

So... it seems that my suspicion was on the right track, and tango shouldn't do memory allocation in the child process (or alternatively, do memory allocation without using the relevant mutexes, but I doubt that's viable). However, it's anything but clear to me how this could be avoided. The toNullEndedArray calls could presumably be moved to before the fork call, but that's almost certainly not enough; quite a bit of code before the actual exec looks like it could do implicit memory allocation (toStringz? The various 'path ~=' lines in the execvpe implementation? For that matter: any of the functions called; although perr.source.close() (for example) is apparently async-signal-safe today, can we assume it will be in the future?).

It looks to me like changing this would be a fairly substantial bit of work :-/ We've currently got a more-or-less adequate workaround by externally monitoring and, when necessary, restarting the application. However, that's pretty hackish, and surely not viable in all cases.

06/10/10 11:49:57 changed by larsivi

I see. I really appreciate the investigation - unfortunately I think you're probably currently closest to actually fixing (or improving on) the issue. Indeed, removing the need for memory allocations if possible is always something we look into in any case - using the stack instead if possible, for example.