Fast method for duplicated edge removal

Giuseppe_Profiti · September 20, 2011, 2:36pm

Hello,
I'm using a cvsreader to parse a very big file and create an undirected
graph accordingly.
The file can contain duplicated edges (i.e. A B in one row, B A in another
one), so I'm checking
if g.edge(v1,v2)==None:
e = g.add_edge(v1,v2)
in order to discard them (v1 and v2 are vertices created from what it's read
from the file).
However the graph contains a lot of edges (few millions) and vertices (many
thousands), with a potentially high degree for the vertices, and it takes a
lot to process the data.
As far as I read in the soruce code, the Graph.edge() method checks all the
outgoing edges of the source node, but even if I check which node has the
highest degree, it takes a lot of time to build the graph.

Is there any other way to remove the duplicated edges? Maybe an edge filter
based on some lambda wizardry?

Thanks in advance,
Giuseppe

attachment.html (1.06 KB)

tiago · September 20, 2011, 7:09pm

Hi Giuseppe,

Hello,
I'm using a cvsreader to parse a very big file and create an
undirected graph accordingly.
The file can contain duplicated edges (i.e. A B in one row, B A in
another one), so I'm checking
if g.edge(v1,v2)==None:
e = g.add_edge(v1,v2)

in order to discard them (v1 and v2 are vertices created from what
it's read from the file).
However the graph contains a lot of edges (few millions) and vertices
(many thousands), with a potentially high degree for the vertices, and
it takes a lot to process the data.
As far as I read in the soruce code, the Graph.edge() method checks
all the outgoing edges of the source node, but even if I check which
node has the highest degree, it takes a lot of time to build the
graph.

Is there any other way to remove the duplicated edges? Maybe an edge
filter based on some lambda wizardry?

The library includes a 'remove_parallel_edges()' function which does
what you want, and is fast.

If you want to mask the parallel edges temporarily, you can use
filtering, as such:

l = label_parallel_edges(g)
g = GraphView(g, efilt=lambda e: l[e] == 0)

Cheers,
Tiago

Giuseppe_Profiti · September 22, 2011, 9:25am

Hi Tiago,

2011/9/20 Tiago de Paula Peixoto <tiago(a)skewed.de>

Hi Giuseppe,

[cut]

The library includes a 'remove_parallel_edges()' function which does
what you want, and is fast.

thanks for the hint, postponing the check after loading the data speeded up
the process by 10 times,
however the remove_parallel_edges() does not work in parallel (at least on
my machine) and then it's a bit too slow for my purposes.
I'm trying a different approach in order to speed things up.

If you want to mask the parallel edges temporarily, you can use
filtering, as such:

l = label_parallel_edges(g)
g = GraphView(g, efilt=lambda e: l[e] == 0)

Has label_parallel_edges the same complexity of remove_parallel_edges?

Thanks again for your help,
Giuseppe

attachment.html (1.37 KB)