Extracting vertex subgraphs (based on property map)?

Hi,

Could you suggest what is the most efficient way for:

1) extracting vertex clusters based on values in vertex property map?
      - e.g., for iterating through all connected components in a
graph, based on result of label_components

  2) saving extracted subgraphs to a file?

Code that I am using is in [1] but for a large graph (~1M edges, 12k
clusters) it is taking days. Is there a more optimal way?

Can the function topology.mark_subgraph be of help here?
  - the name seems relevant but I do not know how it is used.

[1] https://gist.github.com/4627494 - function save_all_graphs is
called to save all the subgraphs

Many thanks,
Uldis

can you release the data / a part of it, let me see if i can optimize it. I
personally don't know of anyway besides property map iteration .-.

attachment.html (1.57 KB)

Hi,

Could you suggest what is the most efficient way for:

1) extracting vertex clusters based on values in vertex property map?
      - e.g., for iterating through all connected components in a
graph, based on result of label_components

You can improve the performance by avoiding repeated python function
calls for each vertex. One way to do this is to use array
expressions. For instance, instead of doing:

    gv1 = gt.GraphView(g, vfilt = lambda v: v_prop[v] == cl_num)

you could do:

    gv1 = gt.GraphView(g, vfilt = v_prop.a == cl_num)

The second option should be significantly faster for larger graphs.

  2) saving extracted subgraphs to a file?

Code that I am using is in [1] but for a large graph (~1M edges, 12k
clusters) it is taking days. Is there a more optimal way?

The problem with this approach is that since you have a number C of
clusters, and the creation / iteration of each filtered graph is O(N +
E), where N and E correspond to the size of the unfiltered graph, the
whole thing becomes O(C(N + E)), which can be very slow if C is in the
order of tens of thousands, as is your case. In this case the best
approach would be not to create filtered graphs, but rather, to create
your cluster graphs individually by hand, i.e.

     clusters = defaultdict(list)
     for v in g.vertices():
         clusters[v_prop[v]].append(v)

     for i, (c, vs) in enumerate(clusters.items()):
         cg = Graph()
         vmap = defaultdict(lambda: cg.add_vertex())
         for v in vs:
             u = vmap[v]
             for w in v.out_neighbours():
                 cg.add_edge(u, vmap[w])
         cg.save("cluster-%d.xml.gz" % i)

This approach is O(N + E), which would be much faster than the previous
one if C is sufficiently large, even though it is pure python.

Can the function topology.mark_subgraph be of help here?
  - the name seems relevant but I do not know how it is used.

This returns a property map which can be used to mask a subgraph, if it
is represented as a graph object. I does in essence the opposite of what
you want to achieve.

Cheers,
Tiago

Awesome! Thanks a lot, this solved the problem.

Where could I find more information about the v_prop.a == cl_num part?
It might be useful in other cases of working with graph-tool, but
first need to understand it. Is it a part of NumPy?

Cheers,
Uldis

Yes, v_prop.a returns a view of the property map as a Numpy array. Numpy
supports so-called "array expressions", which translates to simple C
loops. So when you do v_prop.a == cl_num, this returns an array of
boolean values, which is computed in a tight C loop. This array is used
as-is by the GraphView constructor, which uses it as a filter. Thus, no
python loop is ever performed. In the first version, however, a function
is passed to the constructor, which is called on each vertex to build a
filter, resulting in a much slower performance.

Cheers,
Tiago