Correlation Histogram

P-M · February 8, 2017, 5:09pm

I am trying to obtain the correlation histogram for a graph of mine following
the example given in the manual. I run:

g = gt.load_graph('graph.gt')
gt.remove_parallel_edges(g)
h=gt.corr_hist(g,'out','out')

My graph is relatively large at 12,238,931 vertices and 24,884,365 edges. My
problem is that as soon as I start the code it runs on 20 processes and
happily chomps through 252 GB of RAM before starting to spill over into the
swap making my machine incredibly slow.

I presume the RAM usage is linked to the parallel processing so presumably
could be tackled if I ran it using fewer processes. Is there any way of
reducing the RAM usage? Or would I need to implement the routine manually to
achieve this?

Best wishes,

Philipp

solstag · February 8, 2017, 5:38pm

Ni! Hi Phillip,

That's a feature of OpenMP controlled by an environment variable:

OMP_NUM_THREADS

So you can, for example

export OMP_NUM_THREADS=4

before running your code.

.~´

P-M · February 9, 2017, 3:11pm

Thanks! I presume this won't impact already running processes and is only
valid for as long as my instance of PuTTY is running and after that revert
to normal?

solstag · February 9, 2017, 4:37pm

Well, yes. Though you can configure your shell and make it permanent.
Software Carpentry has some good tutorials on using the shell, for example:

http://swcarpentry.github.io/shell-extras/08-environment-variables.html

You can also modify the environment from within Python, using the
"os.environ" dictionary. You'll just have to set the value for
'OMP_NUM_THREADS' before importing graph-tool, because openmp will consider
the value at the time of importing.

[]s

attachment.html (1.8 KB)

tiago · February 9, 2017, 7:20pm

graph-tool also provides some convenience functions for doing this from
inside python, independently of the environment variables:

graph_tool.openmp_get_num_threads()
graph_tool.openmp_set_num_threads()

graph_tool.openmp_get_schedule()
graph_tool.openmp_set_schedule()

P-M · February 10, 2017, 3:14pm

Those functions were very useful (as was the tip about the environment
variable). I couldn't find them anywhere in the documentation though. Would
it be possible to add them?

Thank you for the help, alas, in this case even limiting the threads to 1
didn't work. The routine still uses up 252 GB of RAM and then carries on
into the swap. I suppose the network is simply too large...

Best,

Philipp

tiago · February 11, 2017, 7:08pm

Those functions were very useful (as was the tip about the environment
variable). I couldn't find them anywhere in the documentation though. Would
it be possible to add them?

It's now in git.

Thank you for the help, alas, in this case even limiting the threads to 1
didn't work. The routine still uses up 252 GB of RAM and then carries on
into the swap. I suppose the network is simply too large...

I don't think this is necessarily true. The histogram constructs a DxD
matrix, where D is the largest degree in the network. This is probably why
you are running out of memory.

P-M · February 12, 2017, 12:53pm

Yup, that would explain it. D is in the order of 10^5 in this case. Would I
be able to write a more memory-efficient script manually with sparse
matrices or is the underlying code already fairly optimised in this regard?

Best,

Philipp

tiago · February 12, 2017, 3:54pm

Of course, you could do a lot better with a sparse representation...