All Pairs Shortest Path with limitations

Hi together,

I have a question on how you would tackle the following:

I have a graph and want to calculate all possible shortest paths but exclude a few nodes as starting / stopping positions.

I can not filter out those vertices, since paths should be allowed to pass them - just not start or terminate there.

Calculating all possible shortest paths manually is computationally infeasible.

Thanks a lot!

Stephan

attachment.html (928 Bytes)

Hi Stephan,

If the number of start-end pairs is not high you could compute all
shortest paths and then remove the ones you are not interested in.

Best,
Aleks

Hi Aleks!

I thank you for your answer! If already tried that but have not been successful speed-wise.

A full APSP of a very small test graph already takes around 200 ms.

The fastest I could get as described below runs in about 3.6 s. :confused:

The main reason is that shortest_distance() has a "single source, all targets" mode but no "single target, all sources" mode (the reverse) - as soon as "source" is empty, it defaults to APSP, neglecting "target".

Hence I have to do this part manually using shortest_distance(G, u, v).*

My bottleneck is the APSP calculation.

Cheers,

Stephan

*Since the graph is directed, we have the following: Say V is the set of all vertices and B a subset of V I want to exclude as sources / targets from APSP. Then the set of points is [ (V x V) \ [ (V x B) + (B x V) ] ] + (B x B).

Or in graph-tool terms:
- (V x V): shortest_distance(G)

- (V x B): shortest_distance(G, v, b) for b in B for v in V (horrible run-time)

- (B x V): shortest_distance(G, b, _) for b in B

- (B x B): shortest_distance(G, u, v) for u,v in B x B

attachment.html (3.77 KB)

Of course it does. Just reverse the graph with:

      GraphView(g, reversed=True)

:partying_face: I didn't know that existed!

Thanks a ton, I'm really looking forward to test it out tomorrow!

Cheers,
Stephan

This works like a charm!

With it, I'm down to 228 ms ± 1.31 ms from 3.6 s ( 82.9 ms ± 2.45 ms for shortest_distance(G) ).

It's still very very expensive as it's 2.75 times APSP but approaching reasonability. The costly remaining part is the manual B x B.*

I tried to speed it up by the approximation to limit G to B since I know that B is connected but graph-tool segfaults right away (I wrote another ticket for that). :confused:

If anyone else has any ideas on how to speed this up, I'd be very grateful!

Cheers,
Stephan

- - - o - - -

*Since the graph is directed, we have the following: Say V is the set of all vertices and B a subset of V I want to exclude as sources / targets from APSP. Then the set of points is
(V\B) x (V\B) = [ (V x V) \ [ (V x B) + (B x V) ] ] U (B x B).

Or in graph-tool terms:
- (V x V): shortest_distance(G)
- (B x V): shortest_distance(G, b, _) for b in B
- (V x B): shortest_distance(GraphView(G, reversed=True), b, _) for b in B
- (B x B): shortest_distance(G, u, v) for u,v in B x B

Hi Tiago and the list,

my algorithm is still _infeasibly_ slow and > 90 % of the run-time results from shortest_path().

Do you have any idea on how this could be sped up?

I have the graph G, the set of all nodes V and a sub-set B ⊂ V.

I want to calculate or approximate the mean shortest distance between all pairs V\B x V\B but on the complete graph - or in other words, calculate the mean of all pairs shortest distance ( mean(shortest_distance()) ) but exclude n∈B from being source or target.

Sub-sampling would be okay as well.

I currently don't see a way to do this fast with graph-tool besides writing a custom function but that probably exceeds my capacities. Am I missing something? Do you have any ideas?

Ideal would be something like a multi-threaded shortest_distance(g=G, source=V\B, target=V\B, weights=G.edge_properties['weight'], samples=N) with N pairs random sub-sampling and source / target being lists.

Thanks a ton!

Stephan

attachment.html (4.94 KB)

Hi Stephan,

I've used graph tool to compute a lot of shortest paths and used a
dumb solution of just throwing it into multiprocessing module from
python (with some lru_cache tricks). If you have enough memory and you
don't call the main procedure often (seems to be the case?) then the
overhead of multiprocessing becomes insignificant. A benefit is that
this is pretty easy to code and scales perfectly to more cores (until
you run out of memory).

Best,
Aleks

Hi Aleks!

The multiprocessing module is what I'm currently using as a workaround until something other pops up - I just throw it into a mp.Pool.

For the initial debugging / get-to-know phase, so to say.

I don't know what exactly you've put into the multiprocessing module but I don't see how this could be of help in my case.

I'm running a metropolis algorithm that runs hours on a very small graph. Each step depends on the previous. On the graphs I'm actually - or at least - aiming for, it would probably run at least over a week.

I also have a large parameter space and would like to try something like lipo* but same here: Just 100 iterations already approach 2 weeks on the tiny graph and I can't multiprocess it.

Cheers,
Stephan

*https://pypi.org/project/lipo/

attachment.html (7.41 KB)

Hi Stephan,

Hi Aleks!

The multiprocessing module is what I'm currently using as a workaround until something other pops up - I just throw it into a mp.Pool.

For the initial debugging / get-to-know phase, so to say.

This is the first time I hear that someone uses multiprocessing for
debugging. Usually its the other way around :slight_smile:

I don't know what exactly you've put into the multiprocessing module but I don't see how this could be of help in my case.

You requested multi-threaded shortest_distance and wanted to compute
multiple paths and then get the mean dist?
So my suggestion was to throw Pool on shortest_distance(g=G,
source=V\B, target=V\B, weights=G.edge_properties['weight'],
samples=N) which is trivial to do e.g.

def wrap(node):
    return shortest_distance(g=G, source=node, target=node,
weights=G.edge_properties['weight'])

res = multiprocessing.Pool().map(wrap, random.choices(V, N))
do_some_processing_on_result(res)...

This is not optimal due to various overheads, however from python its
the best you can get.
But if you know a bit of c++ and want to optimize some more,
graph-tool already uses openmp so making a parallel loop over vertices
should be easy. Tiago keeps the code very readable and everything is
logical so finding a place to make your modification should take much
time: src/graph/topology/graph_distance.cc · master · Tiago Peixoto / graph-tool · GitLab

I'm running a metropolis algorithm that runs hours on a very small graph. Each step depends on the previous. On the graphs I'm actually - or at least - aiming for, it would probably run at least over a week.

I also have a large parameter space and would like to try something like lipo* but same here: Just 100 iterations already approach 2 weeks on the tiny graph and I can't multiprocess it.

You can for sure parallelize parameter opt with dlib. Check the
official docs (dlib C++ Library - global_function_search_abstract.h)
however its already done in other frameworks:
listed by lipo readme: Optuna: A hyperparameter optimization framework — Optuna 3.5.0 documentation
I've tested this one (~30 parameters):
GitHub - facebookresearch/nevergrad: A Python toolbox for performing gradient-free optimization

Good luck,
Aleks

Hi Aleks!

I thank you for your long answer!

def wrap(node):
return shortest_distance(g=G, source=node, target=node,
weights=G.edge_properties['weight'])

res = multiprocessing.Pool().map(wrap, random.choices(V, N))
do_some_processing_on_result(res)...

This is not optimal due to various overheads, however from python its the best you can get.

So you basically recommend full APSP manually?

In terms of run-time, this is the worst you can get divided by the number of cores - or the worst nightmare of an hpc admin.

Just out of curiosity: Even on my small test-graph it's ~ 1200 times slower than full shortest_distance().

I'll have a look at the other resources!

Cheers,
Stephan

Hi Stephan,

Hi Aleks!

I thank you for your long answer!

> def wrap(node):
> return shortest_distance(g=G, source=node, target=node,
> weights=G.edge_properties['weight'])
>
> res = multiprocessing.Pool().map(wrap, random.choices(V, N))
> do_some_processing_on_result(res)...
>
> This is not optimal due to various overheads, however from python its the best you can get.

So you basically recommend full APSP manually?

In terms of run-time, this is the worst you can get divided by the number of cores - or the worst nightmare of an hpc admin.

I made a typo so it should be `random.choices(V/B, N)`, but typo or
not I don't see how is this APSP? I am unsure which part gives HPC
admin nightmares, but I'll trust you that this is a bad HPC solution.

Just out of curiosity: Even on my small test-graph it's ~ 1200 times slower than full shortest_distance().

I've missed from your email that you have such a short execution time
so definitely multiprocessing makes no sense, sorry for the wasted
time.

"Even on" in your sentence tells me you don't understand how this
works so I'll explain. multiprocessing does binary serialization and
deserialization so small problems with short execution time are the
worst possible case (exactly your test case). Additionally I hope that
you didn't run this example on a machine with a very high core count
since that would only exacerbate the problem... In general doing micro
optimizations from Python has a very narrow range where its useful,
since Python is too slow in comparison to C++. So you'll likely have
to write some C++ if you need to go even faster.

Have fun,
Aleks