performance reducing drastically as more edges are processed

Ioana_Hulpus · March 16, 2020, 10:48am

Hi,
I am computing a weight for each edge, in a graph of some million nodes and
about 50 mil edges. Initially, it was doing 25000 edges per minute, so it
should have finished processing in about 30+ hours. Now I check the server
for results after more than 72 hours, and the process is still running, but
very slowly, doing less than 1000 edges per minute.
Is there anything I am doing wrong, or what could be the reason for such a
slowdown?

I list here the code:

def compute_exclusivities(g):
    count = 0
    excl = g.new_edge_property("double")
    g.edge_properties["Exclusivity"] = excl
    edges = g.get_edges([g.edge_properties["label_id"]])
    for edge in edges:
        edges_of_head = g.get_out_edges(edge[0],
[g.edge_properties["label_id"]])
        count_edges_ofhead_sametype = len(np.where(np.where(edges_of_head ==
edge[2])[1] == 2)[0])
        edges_of_tail = g.get_in_edges(edge[1],
[g.edge_properties["label_id"]])
        count_edges_oftail_sametype = len(np.where(np.where(edges_of_tail ==
edge[2])[1] == 2)[0])
        excl[g.edge(edge[0], edge[1])] = 1 / (count_edges_ofhead_sametype +
count_edges_oftail_sametype - 1)
        count = count + 1
        if count % 1000 == 0:
            print(count, " exclusivities computed")

Thanks,
Ioana

tiago · March 23, 2020, 2:08pm

It's hard to get to the bottom of things, since you are not even trying
to explaining what you wan to do. But looking very briefly at your code,
I can see the following problems:

You should not do a property map lookup inside the deeper loop, it's
better to look it up outside and keep it around:

elabel = g.edge_properties["label_id"]

Avoid looking up edge descriptors like this:

g.edge(edge[0], edge[1])

   This takes time O(k) where k is the degree of an endpoint. Since you
   are using array-based iteration, you should put the property map you
   want in the first call to g.get_edges().

These are just some things that pop out, but it might be that there is
an altogether better way to achieve whatever it is you want do.

Best,
Tiago

Ioana_Hulpus · April 22, 2020, 6:54am

I am trying to compute a weight for each relation, that I call "Exclusivity".
Given that the graph has typed edges (with the type of edges stored as an
int in the property map label_id), the Exclusivity of an edge is computed as
1 / (number of all edges of same type outgoing from the source node + number
of all edges of the same type incoming to the same target node - 1) .

I understand your suggestions, but regarding the second point, of not
looking up the edge descriptors with g.edge(edge[0], edge[1]), how else can
that be achieved? As far as I understand, g.get_edges() will return the
value of the property map for each edge, but how can I set the property map
values?

Thanks a lot,
Ioana

Ioana_Hulpus · April 28, 2020, 4:08pm

Hi,
I am still fighting with the same problem:

Ioana K-Hulpus wrote

I am trying to compute a weight for each relation, that I call
"Exclusivity".
Given that the graph has typed edges (with the type of edges stored as an
int in the property map label_id), the Exclusivity of an edge is computed
as
1 / (number of all edges of same type outgoing from the source node +
number
of all edges of the same type incoming to the same target node - 1) .

As advised by Tiago, I rewrote the code to:

1. not do property map lookup inside the loop

2. not lookup edge descriptors with g.edge(source, target)

The new code:

def computeexclusivities(g):
    count = 0
    excl = g.new_edge_property("double")
    g.edge_properties["Exclusivity"] = excl
    lbl_id = g.edge_properties["label_id"]
    edges = g.get_edges([lbl_id, excl])
    for edge in edges:
        edges_of_head = g.get_out_edges(edge[0], [lbl_id])
        count_edges_of_head_same_type = len(np.where(np.where(edges_of_head
== edge[2])[1] == 2)[0])
        edges_of_tail = g.get_in_edges(edge[1], [lbl_id])
        count_edges_of_tail_same_type = len(np.where(np.where(edges_of_tail
== edge[2])[1] == 2)[0])
        edge[3] = 1 / (count_edges_of_head_same_type +
count_edges_of_tail_same_type - 1)
        count = count + 1
        if count % 1000 == 0:
            print(count, " exclusivities computed")
    excl.get_array()[:] = edges[:, 3]

Now, I am not sure that excl.get_array() is the best solution, is there any
other way to set the property map without providing an edge descriptor? Or
an O(1) method for retrieving an edge descriptor?

In any case, the code is running now for a week and only did 17 mil edges
out of 50 mil, and at the moment it solves 2000 edges per minute.

Can anyone help me find a faster solution?

In my graph processing pipeline, this is just an intermediary stage, I will
have to use the computed exclusivities for computing transfer probabilities
for Personalised PageRank, and I cannot afford such long computing times.

Thanks a lot!

-Ioana