bfs_search performance on big graphs

Ilkka_Rajala · May 15, 2015, 7:11am

Hi,

I stumbled on a bit of a performance bottleneck when trying bfs_search
on a big graph. The problem is with initialize_vertex function, which
is called for every vertex of the graph. Even if the function itself
is empty, there's a bunch of work that's done for each vertex in
VisitorWrapper.__getattr__, before that empty function is even called.

Just doing:

g = Graph()
g.add_vertex(10000000)
%timeit bfs_search(g, 0, BFSVisitor())

1 loops, best of 3: 37.6 s per loop

Now I have a graph with 350M vertices and some 550M edges. It takes 21
minutes of initialize_vertex calls that I don't need, before doing a
few millisecond search :-/

Seems like my only option is to compile a custom version, with
initialize_vertex disabled from all search functions, or implement BFS
myself in python.

Hopefully I'm doing something terribly wrong?

What I'm trying to do, is find all vertices reachable from a given
root vertex. Graph is undirected and not fully connected. I thought
BFS would be a convenient way to do that, but maybe there's a better
way to get that information? I'm not very familiar with graphs

- Ilkka

tiago · May 15, 2015, 7:28am

I stumbled on a bit of a performance bottleneck when trying bfs_search
on a big graph. The problem is with initialize_vertex function, which
is called for every vertex of the graph. Even if the function itself
is empty, there's a bunch of work that's done for each vertex in
VisitorWrapper.__getattr__, before that empty function is even called.

Indeed. In bfs_search the visitor is a Python class, and hence can be
very slow.

What I'm trying to do, is find all vertices reachable from a given
root vertex. Graph is undirected and not fully connected. I thought
BFS would be a convenient way to do that, but maybe there's a better
way to get that information?

BFS is correct, but the bfs_search() function is intended to be used
when one wants to do something special with the visitor class to
implement some more elaborate algorithm. If you just want to obtain
reachability, you can use the shortest_distance() function. Any vertex
with a distance larger than N is not reachable...

For example:

dist = shortest_distance(g, source=g.vertex(10))
u = GraphView(g, vfilt=dist.a < g.num_vertices())

The graph view u contains only the vertices reachable by the source
vertex 10.

One can also use the label_out_component() function, which is
algorithmically equivalent (and maybe even slightly faster):

comp = label_out_component(g, g.vertex(10))
u = GraphView(g, vfilt=comp)

Both of these approaches should be much faster than bfs_search().
(Note that both these functions use BFS internally, but without
Python interference.)

Best,
Tiago

Ilkka_Rajala · May 15, 2015, 3:22pm

Thank you. Component labeling is exactly what I was trying to do, but
didn't know what to look for.

What would be nice, is parallel version of label_components(), but I
think I can live with this

- Ilkka