I am about to start a project working on a graph with about 50M Nodes and 1B edges and I want your opinion regarding the feasibility of this endeavor with graph_tool.
Can you please share your experience with graph examples that were comparable in size?
I mostly want to calculate centrality measures and I will need to apply (several) filters to isolate nodes with particular attributes.
On top of it, I am running on a super-computer (so memory is NOT an issue) and if I am lucky they have installed/enabled the parallel version of the library.
Thank you very much,
You should be able to tackle graphs of this size, if you have enough
memory. For centrality calculations, graph-tool has pure-C++
parallel code, so you should see some good performance.
Graph filtering can also be done without involving python loops, so it
should scale well as well.
Just as an illustration, for the graph size you suggested:
In : g = random_graph(50000000, lambda: poisson(40), random=False, directed=False)
In : %time pagerank(g)
CPU times: user 3min 26s, sys: 44 s, total: 4min 10s
Wall time: 11.3 s
So, pagerank takes about 11 seconds on a machine with 32 cores (it would
have taken around 3-4 minutes in a single thread). And it takes about 50
GB of ram to store the graph.