> You seem to be talking about the "probability of predicting a
missing edge with a binary classifier", whereas I was referring to the
scores themselves not being proper probabilities.

Indeed, that's what I was pointing out, but when you say this:

> Most typically, edge prediction is

formulated as a binary classification task [7], in which each
missing (or spurious) edge is attributed a “score” (which
may or may not be a probability), so that those that reach a
prespecified discrimination threshold are classified as true

edges (or true nonedges)

I think this is misleading. This is not typical. As far as I can tell, very few people use unsupervised binary classification for link prediction. Most typically, edge prediction is formulated as a *supervised* binary classification task. From that setup, you can calibrate probabilities based on ability to predict.

> Indeed selecting the optimal threshold can be done by cross validation
in a *supervised* setting, but even then this will depend in general on
the fraction of removed edges, size of test set, etc.

I agree, this is a valid concern. If this is your objection to the *supervised* binary classification formulation then I think this should be the statement in the paper.

This could be settled to some degree experimentally. Adjust the different sizes of removed edges, test sizes, etc and see how drastically it changes the number of edges with probability > 0.5. My hypothesis: not very much assuming reasonable fractions.

> but in many scenarios
that is applicable. There is a big difference between measuring a
non-edge and not doing a measurement.

Yeah, I saw that part of the paper. Makes sense to me. But, that situation seems most common in the scientific domain. In digital networks (my case: Twitter), everything is usually easily measured. Again, if I just hand you a graph, how will this apply?

> Just an observation: these numbers do not mean much by themselves. MCMC
equilibration times will vary a lot between datasets. To be sure your
sampling is good you have to run diagnostics, e.g. running the chain
multiple times and checking they give the same answer, and checking that
your results are not changing by increasing the number of iterations.

> It's difficult to say. I would investigate the MCMC equilibration in
more detail before declaring defeat.

Well, I have run it multiple times with different numbers. To be sure, I just now ran it with the epsilon removed, 2000 wait, multiflip on, and then 100k(!) sampling iterations. Results were pretty much the same.

I noticed that in the docs you now recommend setting beta=np.inf when using multiflip. What exactly is this doing? (I plan to soon read that other paper you mention which probably includes this..) I tried using it and noticed the chain equilibrated much faster, with many of the iterations involving no moves. The resulting entropy was lower at 48k~ versus the 62k~ with beta at default. Not sure what's going on there. When I run the sampling step, am I not supposed to also set beta=np.inf? I tried this and got 0 results. I also tried just equilibrating with beta=np.inf and then sampling with the default. Results still pretty poor.

Are there any more tricks I should know about evaluating how well the chain is equlibrated?

I noticed that in your paper you didn't compare your reconstruction method performance against any baseline. How do you know how well it's performing if you don't have a baseline? I'm currently pessimistic given its performance on the graph I'm testing. Some of those graphs you were testing on in the paper are loaded into graph-tool, right? It would be fairly easily to train a RandomForest (no fancy boosted trees necessary) with the stacked similarity measures from graph-tool (and maybe a few other simple features I have in mind...) and test the performance against your reconstruction approach (at least for just link prediction). Interested in this? Conjectures? I would be willing to do it for some of the moderately-sized graphs.

Thanks for your help, as always

On Tue, Apr 14, 2020 at 3:39 AM Tiago de Paula Peixoto <tiago@skewed.de> wrote:

Am 14.04.20 um 02:28 schrieb Deklan Webster:
> I found many minor typos, if you're interested in
> fixing them, but I also found some more confusing errors (I *think* I'm
> reading the latest version..):

Please send whatever typos you found by private email, so these can be
collected.

> In FIG 1 it says "probability q of missing edge", shouldn't that be
> "spurious edge"?

Yes.

> At one point you say "If an entry is not observed, it has a value of
> n_ij= 0, which is different from it being observed with n_ij > 0 as a
> nonedge x_ij= 0." Shouldn't the terminology for n_ij=0 be "unmeasured"
> not "unobserved"? That really confused me in the following paragraphs.

That's that it's meant.

> I am doing something like 2 not something like 1. I agree that 1
> involves an arbitrary choice. 2 doesn't. With 2 I can calibrate
> probabilities which *do* have some meaning: they're based on my ability
> to predict the edges I removed. This is what the Nearly Optimal Link
> Prediction paper does.

Indeed selecting the optimal threshold can be done by cross validation
in a *supervised* setting, but even then this will depend in general on
the fraction of removed edges, size of test set, etc.

In any case, you are either being very generous in interpreting the
combination of arbitrary scores + threshold as "probabilities" with
"some meaning", or at the very least we are talking about different
things. You seem to be talking about the "probability of predicting a
missing edge with a binary classifier", whereas I was referring to the
scores themselves not being proper probabilities.

> Not sure what you mean by AUROC being 'relative'. As you say, you need a
> score threshold to make predictions in the end but computing AUROC
> itself doesn't require a threshold. There are other non-threshold-based
> metrics one can use.

The reason why the AUC does not need a threshold is because it is a
relative measure: it is the probability of a true positive having a
larger score than a true negative. It does not care about the absolute
values of the scores, only their relative values. If you multiply or add
every score with a constant, you get the same AUC value.

This means that even if you have a perfect AUC value of 1, you are still
not ready to actually predict the edges until you find out where is the
threshold that separates the true positives from the true negatives.

> For the equilibration I used
> wait=1000, epsilon=1e-5, multiflip=True. For sampling I used 5000
> iterations.

Just an observation: these numbers do not mean much by themselves. MCMC
equilibration times will vary a lot between datasets. To be sure your
sampling is good you have to run diagnostics, e.g. running the chain
multiple times and checking they give the same answer, and checking that
your results are not changing by increasing the number of iterations.

> So, slightly better, but still very poor. A binary classifier trained
> with just the simple similarity measures implemented in graph-tool (as
> described above) as features would give better results than this.
> Indeed, the conditional probability method is much much better (based on
> SHAP importance in my binary classifier). How can that be? Surely I'm
> missing something or making a mistake somewhere?

It's difficult to say. I would investigate the MCMC equilibration in
more detail before declaring defeat.

> After reading the paper I see what you meant now. If I understand
> correctly, this doesn't apply to my situation. Maybe I'm wrong? If
> someone says "Here is a real-world graph. Tell me what the missing edges
> are", this method doesn't appear applicable?

I'm not going to tell you what your situation is, but in many scenarios
that is applicable. There is a big difference between measuring a
non-edge and not doing a measurement. "The absence of evidence is not
evidence of absence." A good example is drug-drug interactions: A lack
of an effect observed when two drugs are taken together is very
different from not doing the experiment in the first place.

Sadly, this important distinction is often ignored when network datasets
are collected.

> Of the 1000 random edges, 981 were found in the marginal graphs.
> 271/1000 = 0.271 had probability > 0.5. Of the 2000 random non-edges 952
> were found in the marginal graphs, and only 12 had probability > 0.5.
>
> Assuming I did this right, this method seems to work much better! It
> didn't appear to be fooled by the unmeasured non-edges.

For one, the problem now became more balanced, as the size of true
positives and true negatives are the same.

> Let's say I run this twice, once with alpha=50, beta=50 and once with
> alpha=10, beta=90. The former run will have more possible non-edges,
> right?. Let's say that there are two non-edges, a and b which were found
> in both runs. Is it not the case that if P(a) < P(b) in the first run,
> that it will also be so in the second run? Is this an incorrect
> intuition? In other words, will adjusting alpha and beta change the
> relative ordering of the probabilities (for the common ones)?

No, it should not change the relative ordering by much, although this
can happen for extreme choices, e.g. that make the inferred graph very
dense.

Best,
Tiago

--
Tiago de Paula Peixoto <tiago@skewed.de>

_______________________________________________
graph-tool mailing list
graph-tool@skewed.de
https://lists.skewed.de/mailman/listinfo/graph-tool