Speeding up graph copy?

Zouhair · January 22, 2020, 5:54am

How are you doing steps 3.a and 3.e exactly?

Are you calling g.add_vertex() and g.remove_vertex() repeatedly?

Yes I am. I assumed that's not the best thing but compared to copying the
graph, adding then removing the vertices was performing better.

I investigated a different approach whereas I allow additional vertices to
exist as long as I keep track of the original number of vertices (n_orig):
For a set of candidate vertices of size m
- if graph size is not large enough, add new vertices
- create a graph view that only keeps (n_orig + m)
- add edges as needed (this relies on vertex properties)
- do graph computation
- when done, remove edges (but not vertices)

When finished with all candidate sets, remove the extra vertices.

Now I am only spending ~25% of the time in resizing the graph (around 20%
in remove_edges, despite having set_fast_edge_removal(True)). That's still
more than I'd like, but it's an improvement from the 80% I had with adding
/ removing vertices. If you have other suggestions on how to tackle this
that'd be great (for example, I tried without remove_edges, but the edges
that get added to the graphview end up showing up in the original graph. Is
there a way to avoid that or to filter such that only the original edges
are kept in the graphview?)

as a minimal working example:

import graph_tool.all as gt
import numpy as np

g = gt.Graph(directed=False)
g.add_vertex(10)
original_edges = [[0, 1], [0, 4]]
g.add_edge_list(original_edges)

g.set_fast_edge_removal(True)
n_orig = g.num_vertices()

def foo(g, m, skipEdgeRemoval = False):
    n_act = g.num_vertices()
    n_extra = n_act - (n_orig+m)

    if n_extra < 0:
        print(f'adding {-n_extra} vertices ...')
        g.add_vertex(-n_extra)
        n_extra = 0
        n_act = g.num_vertices()

    filt = np.ones(n_act)
    if n_extra > 0:
        filt[(-n_extra):] = 0

    gv = gt.GraphView(g, filt)
    #print(f'filt = {filt}')

    new_vtx_ids = np.arange(n, n+m)
    print(f'indices of new vertices = {new_vtx_ids}')

    #Check that all edges are expected
    for e in list(gv.edges()):
        if [e.source(), e.target()] not in original_edges:
            raise Exception(f'{e.source()} -> {e.target()} was not in
original graph')

    #Add a bunch of random edges connecting to new candidates
    edge_list = []
    for s, t in zip(np.random.choice(range(n_orig), m),
                    np.random.choice(new_vtx_ids, m)):
        edge_list.append([s,t])
        #also updates some vertex properties, but that's easy/fast to undo,
not done in this MWE

    gv.add_edge_list(edge_list)

    print('Current edges:')
    for e in list(gv.edges()):
        print(f'{e.source()} -> {e.target()}')

    #computation on new graph goes here ...
    pass

    #print('Cleanup ...')
    #If cleanup is not done, this fails as the edges that were added to the
graphview
    #end up being added to the original graph as well :s
    if not skipEdgeRemoval:
        for e in edge_list:
            gv.remove_edge(gv.edge(e[0], e[1]))

foo(g, 1)
foo(g, 4)
foo(g, 3)
foo(g, 5)
#above is all good
foo(g, 3, True) #skip cleanup
foo(g, 4) #fails

attachment.html (7.74 KB)

Zouhair · January 22, 2020, 7:17am

Update:
I realized that GraphView has an efilt as well, so I've added an
edge_property to keep track of which edges were part of the original graph.

Oddly enough, this didn't improve things as it seems like the bottle neck
wasn't in edge removal (which is supposed to be O(1) with
set_fast_edge_removal(True)), but rather with constructing the edge object
using g.edge(src, tgt).
I tried using gv.add_edge() in a loop and saving the result (instead of
using add_edge_list). That made setting the edge properties at the end
fast, but adding the edges one by one ends up being as slow as
reconstructing the edge object.
What seems to have worked is to instead add another edge property map to
keep track of the new edges, and set that with add_edge_list, then copy it
over to the property map that gets used to the filtering.

With all of this in place this has pushed the graph "copying" down to 5% of
the routine, which seems reasonable for now. If someone has other
suggestions I'm all ears

Thanks again,
-z

Here's the MWE using efilt implementation:

import graph_tool.all as gt
import numpy as np
g = gt.Graph(directed=False)
g.add_vertex(10)
original_edges = [[0, 1], [0, 4]]
g.add_edge_list(original_edges)
g.edge_properties['orig'] = g.new_edge_property('bool', val=True)
g.edge_properties['oorig'] = g.new_edge_property('bool', val=True)

g.set_fast_edge_removal(True)
n_orig = g.num_vertices()

def foo(g, m):
    n_act = g.num_vertices()
    n_extra = n_act - (n_orig+m)

    if n_extra < 0:
        print(f'adding {-n_extra} vertices ...')
        g.add_vertex(-n_extra)
        n_extra = 0
        n_act = g.num_vertices()

    filt = np.ones(n_act)
    if n_extra > 0:
        filt[(-n_extra):] = 0

    gv = gt.GraphView(g, vfilt=filt, efilt=g.edge_properties['orig'])
    #print(f'filt = {filt}')

    new_vtx_ids = np.arange(n, n+m)
    print(f'indices of new vertices = {new_vtx_ids}')

    #Check that all edges are expected
    for e in list(gv.edges()):
        if [e.source(), e.target()] not in original_edges:
            raise Exception(f'{e.source()} -> {e.target()} was not in
original graph')

    #Add a bunch of random edges connecting to new candidates
    edge_list = []
    for s, t in zip(np.random.choice(range(n_orig), m),
                    np.random.choice(new_vtx_ids, m)):
        edge_list.append([s,t,False])

    #Can't set this directly to 'orig' since we are filtering on it, and if
it's false
    #the edge won't be added to this view!
    gv.add_edge_list(edge_list, eprops = [gv.edge_properties['oorig']])

    print('Current edges:')
    for e in list(gv.edges()):
        print(f'{e.source()} -> {e.target()}')

    #computation on new graph goes here ...
    pass

    g.edge_properties['orig'] = g.edge_properties['oorig']

foo(g, 1)
foo(g, 4)
foo(g, 3)
foo(g, 5)
foo(g, 3)
foo(g, 4)

attachment.html (11.6 KB)

tiago · January 22, 2020, 3:11pm

The point here is that this is sub-optimal and can be avoided. Both the
g.add_vertex() and g.remove_vertex() functions should not be called
repeatedly. Graph.add_vertex() can be called with an optional argument
specifying the number of vertices to be added, which happens much faster
than calling it multiple times. Likewise, Graph.remove_vertex() can be
called with a list of vertices that need to be removed, which is
performed in a much more efficient way, specially if the vertices
removed happen to be the ones with the largest indexes.

Zouhair · January 22, 2020, 4:32pm

Sorry I wasn't clear: I was calling add_vertex and remove_vertex with
arguments

attachment.html (1.49 KB)