Download Reference Manual
The Developer's Library for D
About Wiki Forums Source Search Contact

Ticket #1495 (new defect)

Opened 15 years ago

Last modified 14 years ago

Race condition when migrating fibers to different threads

Reported by: ccutrer Assigned to: kris
Priority: major Milestone: 2.0
Component: Tango Version: 0.99.8 Sean
Keywords: Cc:

Description

Tested on the following:

  • dmd/windows/tango 0.99.7 - works
  • ldc/mac/tango trunk + patches in #1462 - asserts in Thread.d:1349
  • ldc/linux-amd64/tango trunk - asserts in Thread.d:1349
  • dmd/linux-amd64/tango trunk - segfaults

Attachments

fibercrash.d (1.7 kB) - added by ccutrer on 02/23/09 21:35:18.
fiberstaterace.patch (1.7 kB) - added by ccutrer on 04/15/09 20:33:47.

Change History

02/23/09 21:35:18 changed by ccutrer

  • attachment fibercrash.d added.

02/24/09 07:23:30 changed by kris

  • owner changed from kris to fawzi.

03/29/09 13:31:30 changed by larsivi

  • milestone changed from 0.99.8 to 0.99.9.

04/15/09 20:28:25 changed by ccutrer

  • version changed from 0.99.7 Dominik to 0.99.8 Sean.
  • summary changed from Crash on non-dmd-windows compilers when migrating fibers to different threads to Race condition when migrating fibers to different threads.

Ah... I've determined that it *doesn't* work on dmd/windows/tango 0.99.8. It still crashes, but it just silently brings down one thread, and the other thread just sits there waiting. Attaching patch to fix...

04/15/09 20:33:18 changed by ccutrer

Doh... copy paste error. The explanation is that there is a race condition when Fiber.state == State.HOLD, but the Thread still has a reference to the Fiber's context. When you call Fiber.yield(), m_state is set to HOLD prior to switchOut(), and set to EXEC after switchOut returns (i.e. the fiber as been called again). However, the Thread holds a reference to the Fiber's context until we call popContext in switchIn (where we return to after calling switchOut). So, the solution is to not set the state inside the fiber itself, but set it to EXEC in switchIn prior to calling pushContext(), and after the Fiber has switched itself back out, set it to HOLD after calling popContext. This will guarantee that when m_state == HOLD, no Thread has a reference to the fiber. When m_state == EXEC, we're either really executing, or we're in the middle of manipulating Thread state to associate/disassociate the fiber with the thread.

04/15/09 20:33:47 changed by ccutrer

  • attachment fiberstaterace.patch added.

04/15/09 20:59:38 changed by fawzi

Thanks I will look into it, thread switching is very "touchy", and I need to think it through calmly.

somewhat unrelated, with my scheduler I have recently found a bug i the gc, something that can lock at least on mac: suspendAll has a semaphore, and the call to it is not signal safe, and may (seldomly) lock when another thread is qcquiring a lock then suspended. I am not yet fully sure about how to fix it.

04/15/09 21:11:07 changed by fawzi

I would say that what you say looks reasonable, just a note: assuming that all assignements have been propagated when transferring a fiber, especially with Numa may not be fully justified...

04/16/09 20:25:24 changed by ccutrer

Hmm... interesting. I need to do some more research, but I'm running into an issue with a piece of code that allocates thousands of tiny fibers unexpectedly crashing. The crash appears to be in munmap, so my suspicion is that the GC is garbage collecting the fiber object before it has fully finished executing. Do you think this could be related to what you found above with the semaphore?

11/29/09 12:07:33 changed by fawzi

  • owner changed from fawzi to kris.

12/02/09 22:28:43 changed by kris

  • milestone changed from 0.99.9 to 2.0.