Multiprocessing.Process bug

Hello,

I observe a quite strange bug that involves python's multiprocessing
library. I try to use (read only) one graph instance with several
*Multithreading.Process*. The graph is unserialized in the parent process.
Each child receives a reference to the graph. Then each child does simple
repetitive calls to *graph_tool.topology.shortest_distance*. Everything
great each child process works as fast as it can. However when the main
process executes the *hash_graphs* function presented below, each child
process hangs infinitely. The *hash_graphs* is executed prior to the
children start.

def hash_graphs(*args):
    """
    Provides an edge based graph digest that can be used to invalidate old
cache.

    :type args: tuple of :class:`graph_tool.GraphView`
    :param args: the graphs to be hashed.

    :rtype: str
    :return: a hash digest of the input graph.
    """
    graph_hash = hashlib.md5()
    for graph in args:
        graph_hash.update(gt.edge_endpoint_property(graph, graph.vp['id'],
"source").a.tobytes())
        graph_hash.update(gt.edge_endpoint_property(graph, graph.vp['id'],
"target").a.tobytes())
    return graph_hash.hexdigest()

I package a MWE, it is available here :
https://drive.google.com/file/d/0B5GhhBKHOKOxVnpfYTBwNDZxODA/view?usp=sharing.
To run it simply do :

tar xzf mwe.tar.gz

# run the buggy version
python3 -m mwe DO_HASH

# run as expected
python3 -m mwe

The buggy output looks like :

<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-2, started)> does shortest_distance from 266188 to
207342
<MyProcess(MyProcess-3, started)> does shortest_distance from 514127 to
290838
<MyProcess(MyProcess-4, started)> does shortest_distance from 439705 to
87897
<MyProcess(MyProcess-5, started)> does shortest_distance from 223098 to
440593
<MyProcess(MyProcess-6, started)> does shortest_distance from 279880 to
368550
<MyProcess(MyProcess-7, started)> does shortest_distance from 108357 to
199593
<MyProcess(MyProcess-8, started)> does shortest_distance from 273888 to
275937

The expected output looks like :

<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-2, started)> does shortest_distance from 266188 to
207342
<MyProcess(MyProcess-3, started)> does shortest_distance from 514127 to
290838
<MyProcess(MyProcess-5, started)> does shortest_distance from 223098 to
440593
<MyProcess(MyProcess-6, started)> does shortest_distance from 279880 to
368550
<MyProcess(MyProcess-1, started)> done.
<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-2, started)> done.
<MyProcess(MyProcess-2, started)> does shortest_distance from 266188 to
207342
<MyProcess(MyProcess-4, started)> does shortest_distance from 439705 to
87897
<MyProcess(MyProcess-7, started)> does shortest_distance from 108357 to
199593
<MyProcess(MyProcess-3, started)> done.
<MyProcess(MyProcess-1, started)> done.
<MyProcess(MyProcess-3, started)> does shortest_distance from 514127 to
290838
<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-8, started)> does shortest_distance from 273888 to
275937
<MyProcess(MyProcess-2, started)> done.
<MyProcess(MyProcess-2, started)> does shortest_distance from 266188 to
207342
<MyProcess(MyProcess-3, started)> done.
<MyProcess(MyProcess-3, started)> does shortest_distance from 514127 to
290838
<MyProcess(MyProcess-1, started)> done.
<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-6, started)> done.
<MyProcess(MyProcess-6, started)> does shortest_distance from 279880 to
368550
<MyProcess(MyProcess-4, started)> done.
<MyProcess(MyProcess-4, started)> does shortest_distance from 439705 to
87897
<MyProcess(MyProcess-8, started)> done.
<MyProcess(MyProcess-8, started)> does shortest_distance from 273888 to
275937
<MyProcess(MyProcess-1, started)> done.
<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-2, started)> done.
<MyProcess(MyProcess-2, started)> does shortest_distance from 266188 to
207342
<MyProcess(MyProcess-3, started)> done.
<MyProcess(MyProcess-3, started)> does shortest_distance from 514127 to
290838
<MyProcess(MyProcess-5, started)> done.
<MyProcess(MyProcess-5, started)> does shortest_distance from 223098 to
440593
<MyProcess(MyProcess-1, started)> done.
<MyProcess(MyProcess-1, started)> does shortest_distance from 452946 to
391015
<MyProcess(MyProcess-8, started)> done.
<MyProcess(MyProcess-8, started)> does shortest_distance from 273888 to
275937
<MyProcess(MyProcess-7, started)> done.
<MyProcess(MyProcess-7, started)> does shortest_distance from 108357 to
199593
<MyProcess(MyProcess-3, started)> done.
<MyProcess(MyProcess-3, started)> does shortest_distance from 514127 to
290838
...

How could I explain this behavior ?

Bests,
François.

attachment.html (10.5 KB)

To be comprehensive, I add here the MWE source. Note that I fixed the
vertices so that the output is reproducible. However, one could select
randomly the vertices and would end with the same behavior.

Bests,
François.

import multiprocessing
import graph_tool as gt
import graph_tool.topology as gtt
import hashlib
import sys

class MyProcess(multiprocessing.Process):
    """
    A process that computes shortest paths and shortest distances in a
graph tool graph.
    """
    def __init__(self, graph, test):
        super(MyProcess, self).__init__()
        self.graph = graph
        self.test = test

    def run(self):
        while True:
            # Operation is repeated so that the bug is cristal clear.
            source, target = self.test
            source = self.graph.vertex(source)
            target = self.graph.vertex(target)

            # We start the work.
            print('{} does shortest_distance from {} to {}'.format(self,
source, target))

            gtt.shortest_distance(self.graph,
                                  source=source,
                                  weights=self.graph.ep['weight'],
                                  max_dist=1400,
                                  pred_map=True)

            # We end the work.
            print('{} done.'.format(self))

def hash_graphs(*args):
    """
    Provides an edge based graph digest that can be used to invalidate old
cache.

    :type args: tuple of :class:`graph_tool.GraphView`
    :param args: the graphs to be hashed.

    :rtype: str
    :return: a hash digest of the input graph.
    """
    graph_hash = hashlib.md5()
    for graph in args:
        graph_hash.update(gt.edge_endpoint_property(graph, graph.vp['id'],
"source").a.tobytes())
        graph_hash.update(gt.edge_endpoint_property(graph, graph.vp['id'],
"target").a.tobytes())
    return graph_hash.hexdigest()

if __name__ == '__main__':

    # Unserialize the graph.
    graph = gt.load_graph('./mwe/graph.gt.gz')

    # Bug switch.
    if sys.argv[-1] == 'DO_HASH':
        graph_hash = hash_graphs(graph)

    # Repetable inputs.
    tests = [(452946, 391015),
             (266188, 207342),
             (514127, 290838),
             (439705, 87897),
             (223098, 440593),
             (279880, 368550),
             (108357, 199593),
             (273888, 275937)]

    # Actual work.
    procs = [MyProcess(graph, tests[i]) for i in range(8)]

    for proc in procs:
        proc.start()

    for proc in procs:
        proc.join()

attachment.html (18.1 KB)

As a side note, the MWE I provide calls graph_tool.topology.shortest_path
without specifying the target., but the logs refer to the target, this is
counterintuitive, sorry about that. However, the behavior is as described
above when one does specify the target to graph_tool.topology.shortest_path.

François

attachment.html (19.2 KB)

I build the latest version *with debug enabled and openmp disable*, the bug
is not present anymore.

version: 2.19 (commit da041f33, Sat Nov 12 17:27:48 2016 +0100)
gcc version: 5.4.0
compilation flags: -I/usr/lib/python3/dist-packages/numpy/core/include
-I/usr/include -I/usr/include/python3.5m -fvisibility=default
-fvisibility-inlines-hidden -O3 -Wno-deprecated -Wno-unknown-pragmas
-ftemplate-depth-250 -ggdb3 -Wall -Wextra -ftemplate-backtrace-limit=0
-Wno-unknown-pragmas
install prefix: /usr/local
python dir: /usr/lib/python3/dist-packages
graph filtering: True
*openmp: False*
uname: Linux tungstene 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37
UTC 2016 x86_64

For the record, the version below has the bug :

version: *2.18* (commit 37996a0a, Thu Jul 21 17:56:14 2016 +0200)
gcc version: *5.3.1*
compilation flags: -Wdate-time -D_FORTIFY_SOURCE=2
-I/usr/include/python3.5m -I/usr/include
-I/usr/lib/python3/dist-packages/numpy/core/include -Wall -Wextra
-ftemplate-backtrace-limit=0 -flto=4 -ffunction-sections -fdata-sections
--std=gnu++14 -DNDEBUG -ftemplate-depth-250 -Wno-deprecated
-Wno-unknown-pragmas -O3 -fvisibility=default -fvisibility-inlines-hidden
-fopenmp -Wl,--gc-sections
install prefix: /usr
python dir: /usr/lib/python3/dist-packages
graph filtering: True
*openmp: True*
uname: Linux tungstene 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37
UTC 2016 x86_64

I'm building the head version with openmp enabled to know if it might be
involved in the bug.

François.

attachment.html (22.2 KB)

I confirm I reproduce the bug with the version below :

version: 2.19 (commit da041f33, Sat Nov 12 17:27:48 2016 +0100)
gcc version: 5.4.0
compilation flags: -I/usr/lib/python3/dist-packages/numpy/core/include
-I/usr/include -I/usr/include/python3.5m -fopenmp -fvisibility=default
-fvisibility-inlines-hidden -O3 -Wno-deprecated -Wno-unknown-pragmas
-ftemplate-depth-250 -DNDEBUG -Wall -Wextra -ftemplate-backtrace-limit=0
install prefix: /usr/local
python dir: /usr/lib/python3/dist-packages
graph filtering: True
*openmp: True*
uname: Linux tungstene 4.4.0-45-generic #66-Ubuntu SMP Wed Oct 19 14:12:37
UTC 2016 x86_64

François.

attachment.html (23.7 KB)

To disable openmp at the runtime with :

OMP_NUM_THREADS=1 python3 -m mwe DO_HASH

Also works.
F.

attachment.html (24.8 KB)

I don't think this has anything to do with graph-tool. Mixing
multiprocessing (i.e. fork()) with openmp (i.e. threads) is a bad idea, and
in general does not work, since it leaves mutexes in an inconsistent state.

From Python's multiprocessing documentation:

    The parent process uses os.fork() to fork the Python interpreter. The
    child process, when it begins, is effectively identical to the parent
    process. All resources of the parent are inherited by the child
    process. Note that safely forking a multithreaded process is
    problematic.

You might have some luck with the other spawning modes.

Best,
Tiago