> Let's pause and appreciate the learning opportunity.

> After considerable amount of unsubstantiated speculation about bugs in
the code, and even the underlying soundness of the algorithm, the
problem turned out to be a basic misuse. One which was impossible to
guess from the information given.

> It's also a learning on my part, in fact about something that I thought
I already knew: It's a pointless exercise to try to troubleshoot

something without a _minimal_ and _complete_ example at hand.

I think you may have misunderstood me. Sorry if I wasn't communicating clearly. I made a typo in my second to last email swapping the numbers for "directed" and "undirected". I sent another email immediately after correcting the typo. Hope you saw that. I wasn't saying I'd been making a silly mistake this whole time (as far as I can tell...). I was saying that I accidentally set my (properly) directed graph to undirected and the performance skyrocketed.

Unfortunately, that performance improvement was just the result of a mistake on my part. I didn't realize graph-tool's behavior for setting a graph to undirected was to turn reciprocal edges into 2 multiedges. So, when I removed the sampled edges I was only removing one of the two multiedges (in the fraction of the randomly selected which had a reciprocal). I.e., it was basically pure data leakage. Once I corrected that error by only sampling edges which don't have a reciprocal the performance between undirected/directed mostly disappeared. So, performance still bad. To ensure it wasn't just my network or something to do with directedness I tried several simple undirected graphs from graph-tool's collection (where I don't need to worry about multiedges). All performed poorly.

Since I haven't seen this network reconstruction perform well on any of the graphs I've tested I've given up on this approach. `get_edges_prob` is working very well for me, so I'll just stick to that, despite its slowness.

Although, this possibility occurred to me: I wonder how the reconstruction would do if I fed back the probabilities that I get from my binary classifier using the method described in the "Incorporating Extrinsic Uncertainty Estimates" section of your paper. I could regard it as a final step in a link prediction flow. Seems plausible to me that with this extra information the reconstruction might work better. I'm assuming no one has tried this yet?

> The directed SBM does not generate reciprocal edges very well, hence it is not a good
predictor in this case.

Where can I read more about this? I assumed the SBM handles the directed case just fine, even if there are many reciprocal edges. Does this have implications for community detection as well?

> That is in fact very similar to the conditional probability computed by
get_edges_prob(). The latter is more accurate, yields an actual
normalized probability, and includes information from the priors, but
for sufficiently sparse and large graphs, both computations should coincide.

Hmm, I didn't realize this. Thanks.

Thanks for your help, as always

On Thu, Apr 16, 2020 at 2:47 AM Tiago de Paula Peixoto <tiago@skewed.de> wrote:

Am 16.04.20 um 02:54 schrieb Deklan Webster:
> I, totally by accident, think I discovered the issue! In the process of
> setting up the testing between the reconstructive method, get_edges_prob
> method, and the simple stacked topological similarity classifier I
> accidentally set the graph to undirected without setting it back to
> directed. The results appeared to be reasonable finally. I've gone back
> and tested on the graph I've been testing on, and indeed:
>
> Undirected: ~13/1000 correctly identified
> Directed: 488/1000 correctly identified
>
> A giant improvement! Of course, I'm relieved since this solves my
> problem (I can just, as least, set to undirected to compute this).

Let's pause and appreciate the learning opportunity.

After considerable amount of unsubstantiated speculation about bugs in
the code, and even the underlying soundness of the algorithm, the
problem turned out to be a basic misuse. One which was impossible to
guess from the information given.

It's also a learning on my part, in fact about something that I thought
I already knew: It's a pointless exercise to try to troubleshoot
something without a _minimal_ and _complete_ example at hand.

> But,
> now I'm wondering what's going on. Is there some predictable reason why
> the directed case is performing so much worse? Something to do with
> directed graphs and equlibration? But if it were to do with equlibration
> why did this issue never affect `get_edges_prob`'s performance? Could
> there be a bug in how graph-tool is handling the directed case? In the
> paper it seems you've worked with the directed case before, so I assume
> you would have noticed if there were.

I'm done with the wild guesses, this is not productive.

> To be sure, I tested this on the "polblogs" graph as well. Undirected:
> reconstructive performance is great, directed: terrible.

The polblogs network I actually have investigated a bit in the past.
Despite being directed, most edges of that network are actually
reciprocal, so in fact it is mostly an undirected network. The directed
SBM does not generate reciprocal edges very well, hence it is not a good
predictor in this case. The undirected model is a better fit.

> I've read the documentation. The docstring says "Inverse temperature".
> The relevant section of the docs say,
>
>> An alternative is to use a greedy algorithm based on a merge-split
> MCMC with zero temperature [peixoto-merge-split-2020]
> <https://graph-tool.skewed.de/static/doc/demos/inference/inference.html#peixoto-merge-split-2020>,
> which requires a single global state, and thus can reduce memory usage.
>
> Just from this I don't understand. Not setting beta to infinity has
> already reduced memory usage for me. I don't understand the significance
> of this being greedy or why it doesn't apply to my situation.If the
> explanation of all this is in one of the papers please point to it and
> I'll read it.

The documentation assumes a basic understanding of how MCMC works,
without which a proper use of the the methods described is not possible,
unfortunately.

The MCMC algorithm (for reconstruction) attempts to sample from a
distribution \propto P(A,b)^beta where A is the network, b is a
partition and beta is the inverse temperate parameter. Setting beta=1
means sampling from P(A,b), whereas beta=infinity means finding a single
(A,b) that maximizes the distribution. The choice beta=infinity also
breaks ergodicity and detailed balance, making it a greedy heuristic,
rather than a proper MCMC.

To compute marginal probabilities you need beta=1, i.e. sample from the
joint distribution, not maximize it.

The point about reducing memory applied only when comparing to
minimize_blockmodel_dl().

>> This kind of comparison has been done already in
> https://arxiv.org/abs/1802.10582 and https://arxiv.org/abs/1909.07578.
> The SBM approach is the single best classifier among the over a hundred
> they consider, which is marginally beat only by a stacking of around 40
> other predictors.
>
> I've read both these papers and referred to the latter twice in this
> thread. We've already discussed that these don't use either
> `get_edges_prob` nor the full reconstructive marginal edge probability
>
> On page 8 of the former, they're using this as their similarity for SBM:
> "s_ij = θ_i θ_j * l_gi,gj"

That is in fact very similar to the conditional probability computed by
get_edges_prob(). The latter is more accurate, yields an actual
normalized probability, and includes information from the priors, but
for sufficiently sparse and large graphs, both computations should coincide.

Best,
Tiago

--
Tiago de Paula Peixoto <tiago@skewed.de>

_______________________________________________
graph-tool mailing list
graph-tool@skewed.de
https://lists.skewed.de/mailman/listinfo/graph-tool