Parallel Processing

I want to use graph-tool using the multiprocessing library in Python 2.6. I
keep running into issues though with trying to share a full graph object. I
can store the graph in a multiprocessing.Namespace, but it doesn't keep the
dicts of properties. Example:

def initNS( ns ):
    _g = Graph( directed = False)
    ns.graph = _g
    ns.edge_properties = {
        'genres': _g.new_edge_property("vector<string>"),
        'movieid': _g.new_edge_property("int"),
        }
    ns.vertex_properties = {
        'personid': _g.new_vertex_property("int32_t")
    }

    """
        Build property maps for edges and vertices to hold our data
        The graph vertices represent actors whereas movies represent edges
    """
    # Edges
    _g.edge_properties["genres"] = ns.edge_properties['genres']
    _g.edge_properties["movieid"] = ns.edge_properties['movieid']
    # Vertices
    _g.vertex_properties["personid"] = ns.vertex_properties['personid']

    ns.graph = _g

attachment.html (2.04 KB)

I want to use graph-tool using the multiprocessing library in Python
2.6. I keep running into issues though with trying to share a full
graph object. I can store the graph in a multiprocessing.Namespace,
but it doesn't keep the dicts of properties. Example:

def initNS( ns ):
    _g = Graph( directed = False)
    ns.graph = _g
    ns.edge_properties = {
        'genres': _g.new_edge_property("vector<string>"),
        'movieid': _g.new_edge_property("int"),
        }
    ns.vertex_properties = {
        'personid': _g.new_vertex_property("int32_t")
    }
   
    """
        Build property maps for edges and vertices to hold our data
        The graph vertices represent actors whereas movies represent edges
    """
    # Edges
    _g.edge_properties["genres"] = ns.edge_properties['genres']
    _g.edge_properties["movieid"] = ns.edge_properties['movieid']
    # Vertices
    _g.vertex_properties["personid"] = ns.vertex_properties['personid']
   
    ns.graph = _g

##########
This initializes 'ns', which is a multiprocessing.Namespace. The
problem is that for example, ns.edge_properties[ * ] tells me that the
type isn't pickle-able. I tried to just skip that and use the
_g.edge_properties to access it, but those dicts aren't carried over
to the different process in the pool. Presumably b/c they aren't
pickle-able.

Any thoughts about how to fix this?

The problem is that property maps cannot be pickled independently, since
they need internal references to the graph object. However, the entire
graph can be pickled, together with its internal properties
(i.e. properties stored in g.vertex_properties and
g.edge_properties). So, instead of keeping the properties in
ns.edge_properties, for instance, you should keep them in
ns.graph.edge_properties.

(For those interested, I'm attempting to use the IMDbPy library to do
some graph analysis on the relationships among actors and movies. Each
process has it's own db connection and trying to populate the graph
with actor and movie information in parallel since it's a pretty large
and dense graph. Somewhere in the neighborhood of 250,000+ vertices
for just a depth of three relationships)

Note that graph-tool is not thread safe... So any access or modification
to the graph must be protected by a lock.

Cheers,
Tiago

Tiago Peixoto wrote

Note that graph-tool is not thread safe... So any access or modification
to the graph must be protected by a lock.

Cheers,
Tiago

Sorry for bringing this thread from the dead but I didnt want to start a new
one if its a known thing.

Above you describe graph-tool as non thread-safe. What about the parallel
version of centralities for example? (the OpenMP vesions)

I tried to do some test (I have to say Im not sure if Im missing something!)
with the modified BGL files, with and without the openmp pragmas. The
results are a bit different. No openmp vs openmp and 2 runs of openmp
between them.

Is this something known? Am I missing something?

thanks

(I have included a csv export of the values.) export_openmp_test.dat
<http://main-discussion-list-for-the-graph-tool-project.982480.n3.nabble.com/file/n4025209/export_openmp_test.dat&gt;

possible patch.
"centrality" is shared. now my test correlate perfectly with the non
threaded.

diff --git a/src/boost-workaround/boost/graph/betweenness_centrality.hpp
b/src/boost-workaround/boost/graph/betweenness_centrality.hpp
index
32b56b77a0e5023a068e0b6bb0e9b4a156b17fb9..d4c3536d705b2465dbb020c3f15aef3c93e93ea6
100644
--- a/src/boost-workaround/boost/graph/betweenness_centrality.hpp
+++ b/src/boost-workaround/boost/graph/betweenness_centrality.hpp
@@ -364,7 +364,10 @@ namespace detail { namespace graph {
           }

           if (u != s) {
- update_centrality(centrality, u, get(dependency, u));
+ #pragma omp critical(globalupdate)
+ {
+ update_centrality(centrality, u, get(dependency, u));
+ }
           }
       }

Yes, it seems like you have found a race condition. Your patch does not
completely solve it, but there is a complete fix for this now in the git
version.. Please test it.

Cheers,
Tiago

Ill check the git. But for sure my fix did at least half job. a bit further
down its another update (divide by 2).
thanks

Without a machine to check I just have a couple of comments.

The fix will probably (in git) kill a lot of speed. Based on my test with
big graphs that a race will happen, i fixed it wih:

you just need to wrap only the "update_centrality()" functions (both vertex
and edge) and one function further down, "divide_dy_two" near the
"if_undirected".

With these things everything works great and no problem with speed even in
big graphs.

Best
T

I don't think so... Making several entries into a critical omp region
will make things very slow. Furthermore most of the time is spent
finding the shortest paths, which is outside the critical region.

The "divide_by_two" operation at the end is outside the parallel region,
so it does not need to be protected.

Cheers,
Tiago