[Zope-dev] [Warning] Zope/ZEO clients: subprocesses can lead tonon-deterministic message loss

Tim Peters tim.peters at gmail.com
Tue Jun 29 00:31:31 EDT 2004


[Dieter Maurer]
>>> The problem occured in a ZEO client which called "asyncore.poll"
>>> in the forked subprocess. This "poll" deterministically
>>> stole ZEO server invalidation messages from the parent.

[Tim Peters]
>> I'm sorry, but this is still too vague to guess what happened.

[Dieter Maurer]
> Even when I sometimes make errors, my responses usually contain
> all relevant information.

I agree, but for whatever reason I'm having a very hard time following
this message thread.

>> - Which operating system was in use?

> The ZEO client application mentioned above is almost independent
> of the operating system -- beside the fact, that is uses
> "fork" (and therefore requires the OS to support it).

The OS is important because the semantics of fork() depend on the OS.

> Therefore, I did not mention that the application was running
> on Linux 2.

OK, so, e.g., the Solaris fork() semantics play no role in the actual
damage you saw.

>> - Which thread package?

> The application mentioned above does not use any thread.
> Therefore, it is independent of the thread package.
> Would it use threads it were "LinuxThreads" (but it does not).

You said the app was a ZEO client, and, if that's so, it uses multiple
threads whether or not your part of the app creates threads of its
own.  For example, a ZEO client creates a new thread just to connect
to a ZEO server.  If this is a ZEO client that never connects to a ZEO
server, then perhaps threads are wholly irrelevant.

> There is no mystery at all that the application lost ZEO server
> invalidation messages. It directly follows from the fork
> semantics with respect to file descriptors.

I can believe that's the truth, but I confess I still don't see how.

> The problem I saw for wider Zope/ZEO client usage came alone
> from reading the Linux "fork" manual page which indicates
> (or at least can be interpreted) that child and parent have the same threads.
> There was no concrete observation that messages are lost/duplicated
> in this szenario.

Good!  Thanks.

> Meanwhile, I checked that "fork" under Linux with LinuxThreads
> behaves with respect to threads as dictated by the POSIX
> standard: the forked process has a single thread and
> does not inherit other threads from its parent.
>
> I will soon check how our Solaris version of Python behaves.
> If this, too, has only one thread, I will apologize for
> the premature warning...

Solaris offers (or imposes <0.9 wink>) choices that don't exist on
other platforms.  One Solaris choice is whether you link Python with
native Solaris threads, or with the Sun POSIX pthreads library. 
Another choice is whether you call Solaris fork() or Solaris fork1()
(note that Python exposes fork1() on platforms that have it -- fork1()
clones only the calling threading).  The dangerous combination is
Solaris threads + Solaris fork().  The other 3 combinations are
harmless in this respect.  Note that even using Solaris threads, it
doesn't follow that places where Linux calls fork() under the covers
are also places Solaris calls fork() under the covers.  For example,
Solaris system() calls Solaris vfork() under the covers, which differs
from Solaris fork() in several key respects (and also differs from
Solaris fork1()).  The most relevant way vfork() differs from fork()
under Solaris is that vfork() only clones the calling thread.

>> - In the ZEO client that called fork(), did it call fork() directly, or
>> indirectly as the result of a system() or popen() call?  Or what?

> The ZEO client as the basic structure:
>
>    while 1:
>          work_to_do = get_work(...)
>          for work in work_to_do:
>              pid = fork()
>              if pid == 0:
>                 do_work(work)
>                 # will not return
>          sleep(...)
>
> "do_work" opens a new ZEO connection.
> "get_work" and "do_work" use "asyncore.poll" to synchronize with incoming
> messages from ZEO -- no "asyncore.mainloop" around.
>
> The "poll" in "do_work" has stolen ZEO invalidation messages
> destined for the parent such that "get_work" has read old state
> and returned work items already completed. That is the problem
> I saw.

Well, don't do that then <wink>.

> All this is easy to understand, (almost) platform independent
> and independant of the thread library.

I still wouldn't say it's easy to understand.  While the thread that
calls fork isn't running an asyncore loop, it must still be the case
that asyncore in the parent has a non-empty map -- yes?  If it had an
empty map, the child processes would start with a clean slate (map),
and so wouldn't pick up socket traffic meant for the parent.

If that's so, it looks like just clearing asyncore's map in the child
(before do_work()) would solve the (main) problem.

> *Iff* a thread library lets a forked child inherit all threads
> then the problem I announced in this "Warning" thread can
> occur, as it then behaves similarly to my application
> above (with an automatic rather than a explicit "poll").

I still don't want to rush to generalizations; as above, even on
Solaris with native Solaris threads and clone-everything Solaris
fork(), system() should be harmless regardless.  I don't know about
popen() on Solaris, though; etc etc.

> It may well be that there is no thread library that does this.
> In your words: all thread implementations may be "sane"
> with respect to thread inheritance...

At least Solaris fork() with Solaris native threads is not sane in
this respect.  Solaris fork1() with Solaris native threads is sane,
ditto any flavor of Solaris fork with Sun pthreads.  And sanity is
relative <wink>.


More information about the Zope-Dev mailing list