Questions on Parallelisation

P-M · June 19, 2018, 3:29pm

Hello Tiago,

I am trying to outsource some of my calculations to the Univeristy's cluster
as the compute times for some of my datasets are getting very lengthy. I had
a couple of questions arising out of this and was wondering whether you had
any thoughts/advice on them:

1. The University normally limits runtimes of a given job to 12h and
suggests checkpointing to get around this so that the calculation can simply
resume as a new job. Apparently the cluster has "DMTCP: Distributed
MultiThreaded CheckPointing" installed in order to allow checkpointing "for
some applications which do not have their own support for this". Is this
compatible with graph-tool or are you aware of any other ways in which I
could stop and restart a job? (What I am after is stopping and restarting a
graph-tool function which takes more than 12h to complete as I can obviously
pickle results of a calculation straightforwardly already.) If there is no
way of currently doing this is this something you might consider doing at
some point in the future?

2. As far as I understand using OpenMP I can only ever use a single node at
a time and am thus limited by how many CPUs and how much RAM this node
supplies. To be able to use more CPUs I would need to use e.g. MPI. Are
there any plans to implement this at some point? (I realise this may not be
straightforward but was interested in your thoughts.)

3. I would sometimes find it useful to be able to get a progress update for
some of the functions to carry out a rough order-of-magnitude estimate of
required compute time. Two use cases:
a) If calculating the betweenness centrality for a large network this
can take a long time. Having an idea of how many nodes have been covered
already would be useful to extrapolate time roughly remaining to see whether
the calculation is even feasible in the time available.
b) If collecting data using mcmc_equilibrate it would be potentially
less meaningful as the process is stochastic, however, information on how
many sweeps have been completed would still be useful to give a rough
estimate of whether completion will take hours, days, weeks, or months.

Best wishes,

Philipp

tiago · June 19, 2018, 8:49pm

Hello Tiago,

I am trying to outsource some of my calculations to the Univeristy's cluster
as the compute times for some of my datasets are getting very lengthy. I had
a couple of questions arising out of this and was wondering whether you had
any thoughts/advice on them:

1. The University normally limits runtimes of a given job to 12h and
suggests checkpointing to get around this so that the calculation can simply
resume as a new job. Apparently the cluster has "DMTCP: Distributed
MultiThreaded CheckPointing" installed in order to allow checkpointing "for
some applications which do not have their own support for this". Is this
compatible with graph-tool or are you aware of any other ways in which I
could stop and restart a job? (What I am after is stopping and restarting a
graph-tool function which takes more than 12h to complete as I can obviously
pickle results of a calculation straightforwardly already.) If there is no
way of currently doing this is this something you might consider doing at
some point in the future?

DMTCP is the right solution to this problem, and it should work. I use it
myself without any issues.

2. As far as I understand using OpenMP I can only ever use a single node at
a time and am thus limited by how many CPUs and how much RAM this node
supplies. To be able to use more CPUs I would need to use e.g. MPI. Are
there any plans to implement this at some point? (I realise this may not be
straightforward but was interested in your thoughts.)

There are no plans to implement MPI in graph-tool. It would require a major
redesign of the algorithms, and I have not interest in doing this.

3. I would sometimes find it useful to be able to get a progress update for
some of the functions to carry out a rough order-of-magnitude estimate of
required compute time. Two use cases:
a) If calculating the betweenness centrality for a large network this
can take a long time. Having an idea of how many nodes have been covered
already would be useful to extrapolate time roughly remaining to see whether
the calculation is even feasible in the time available.

It would be feasible to implement this, but I'd rather keep the code simple.
It would also interfere with OpenMP, since the progress would need to be
computed for each thread and but reported synchronously. A lot of work, for
just a minor convenience of having a progress bar... Unless there is a very
elegant way of doing this, I'd rather not.

b) If collecting data using mcmc_equilibrate it would be potentially
less meaningful as the process is stochastic, however, information on how
many sweeps have been completed would still be useful to give a rough
estimate of whether completion will take hours, days, weeks, or months.

This you can get by simply passing "verbose=True".

Best,
Tiago