Correlation Histogram

I am trying to obtain the correlation histogram for a graph of mine following
the example given in the manual. I run:

g = gt.load_graph('graph.gt')
gt.remove_parallel_edges(g)
h=gt.corr_hist(g,'out','out')

My graph is relatively large at 12,238,931 vertices and 24,884,365 edges. My
problem is that as soon as I start the code it runs on 20 processes and
happily chomps through 252 GB of RAM before starting to spill over into the
swap making my machine incredibly slow.

I presume the RAM usage is linked to the parallel processing so presumably
could be tackled if I ran it using fewer processes. Is there any way of
reducing the RAM usage? Or would I need to implement the routine manually to
achieve this?

Best wishes,

Philipp

Ni! Hi Phillip,

That's a feature of OpenMP controlled by an environment variable:

OMP_NUM_THREADS

So you can, for example

export OMP_NUM_THREADS=4

before running your code.

.~´

Thanks! I presume this won't impact already running processes and is only
valid for as long as my instance of PuTTY is running and after that revert
to normal?

Well, yes. Though you can configure your shell and make it permanent.
Software Carpentry has some good tutorials on using the shell, for example:

http://swcarpentry.github.io/shell-extras/08-environment-variables.html

You can also modify the environment from within Python, using the
"os.environ" dictionary. You'll just have to set the value for
'OMP_NUM_THREADS' before importing graph-tool, because openmp will consider
the value at the time of importing.

[]s

attachment.html (1.8 KB)

graph-tool also provides some convenience functions for doing this from
inside python, independently of the environment variables:

  graph_tool.openmp_get_num_threads()
        graph_tool.openmp_set_num_threads()

        graph_tool.openmp_get_schedule()
        graph_tool.openmp_set_schedule()

1 Like

Those functions were very useful (as was the tip about the environment
variable). I couldn't find them anywhere in the documentation though. Would
it be possible to add them?

Thank you for the help, alas, in this case even limiting the threads to 1
didn't work. The routine still uses up 252 GB of RAM and then carries on
into the swap. I suppose the network is simply too large...

Best,

Philipp

Those functions were very useful (as was the tip about the environment
variable). I couldn't find them anywhere in the documentation though. Would
it be possible to add them?

It's now in git.

Thank you for the help, alas, in this case even limiting the threads to 1
didn't work. The routine still uses up 252 GB of RAM and then carries on
into the swap. I suppose the network is simply too large...

I don't think this is necessarily true. The histogram constructs a DxD
matrix, where D is the largest degree in the network. This is probably why
you are running out of memory.

Yup, that would explain it. D is in the order of 10^5 in this case. Would I
be able to write a more memory-efficient script manually with sparse
matrices or is the underlying code already fairly optimised in this regard?

Best,

Philipp

Of course, you could do a lot better with a sparse representation...