> It is so utterly simple, it is *exactly* like doing sums.

Oh, I see now. Thanks!

> (I discuss this in the PRX paper, which I get the impression you did not
yet bother to read.)

Actually, I did read some of it before my last message (the terrorist network example is pretty cool). But, no, I didn't read it front to back. I apologize if it seems I'm not putting in effort trying to understand. I just went ahead and read it. I understood more than I thought I would. I found many minor typos, if you're interested in fixing them, but I also found some more confusing errors (I *think* I'm reading the latest version..):

In FIG 1 it says "probability q of missing edge", shouldn't that be "spurious edge"?

At one point you say "If an entry is not observed, it has a value of n_ij= 0, which is different from it being observed with n_ij > 0 as a nonedge x_ij= 0." Shouldn't the terminology for n_ij=0 be "unmeasured" not "unobserved"? That really confused me in the following paragraphs.

"Inverson bracket" should be "Iverson bracket"

> Not without making an _arbitrary_ choice of normalization, which results
in an _arbitrary_ number of edges being predicted.

> Without saying how you normalized the scores to begin with, these values
are meaningless.

Okay, for clarity here are two situations

1) I take a similarity measure, say the "dice" similarity measure from graph-tool, and compute it for every vertex pair. Then, I pick an arbitrary cutoff and consider everything greater than that a "missing edge".

2) I turn this into a supervised classification problem by randomly removing edges, computing whatever features for every vertex pair (say, every similarity measure from graph-tool: dice, salton, hub-promoted, etc), and then training a binary classifier using the removed edges, non-edges, and the corresponding features. (The tedious details on constructing train, test, validation, etc properly with edges/non-edges sets aside).

I am doing something like 2 not something like 1. I agree that 1 involves an arbitrary choice. 2 doesn't. With 2 I can calibrate probabilities which *do* have some meaning: they're based on my ability to predict the edges I removed. This is what the Nearly Optimal Link Prediction paper does.

To be specific, the probabilities I showed you are being spit out of an LGBMClassifier from LightGBM. The LightGBM docs recommend calibrating these probabilities further using the procedure detailed here https://scikit-learn.org/stable/modules/calibration.html using https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html, i.e. Platt scaling or isotonic regression. The references on the sklearn page are from 1999-2005, so I'm assuming there are better ways to calibrate probabilities by now.

From the paper,

> Although this distribution can be used to
perform the same binary classification task, by using the
posterior marginal probabilities π_ij as the aforementioned
“scores,” it contains substantially more information. For
example, the number of missing and spurious edges (and,
hence, the inferred probabilities p and q) are contained in
this distribution and thus do not need to be prespecified.
Indeed, our method lacks any kind of ad hoc input, such as
a discrimination threshold [note that the threshold 1=2 in
the MMPestimator of Eq. (44) is a derived optimum, not an
input]. This property means that absolute assessments such
as the similarity of Eq. (45) can be computed instead of
relative ones such as the AUC.

Not sure what you mean by AUROC being 'relative'. As you say, you need a score threshold to make predictions in the end but computing AUROC itself doesn't require a threshold. There are other non-threshold-based metrics one can use.

Okay, I was curious, so I tested it out: I just removed 1000 edges (of 21k; the structure was not destroyed) at random from the graph and ran this marginal probability procedure with *defaults*: 1 measurement per edge, 1 observation per edge, 1 measurement per non-edge, 0 observations per non-edge, alpha, beta, mu, nu all 1. For the equilibration I used wait=1000, epsilon=1e-5, multiflip=True. For sampling I used 5000 iterations.

In the resulting marginal graph 2470 non-edges from the original graph appear and 224 edges from the original graph didn't appear. Of the 1000 edges I removed, only 181 edges appear in the marginal graph. Okay, but for the 181 edges that did appear, were the probabilities at least high? Here's the histogram of probabilities https://i.imgur.com/kZcR0W1.png As you can see, almost all the probs are low. How many probabilities exceed 0.5? 16. So of the original 1000 edges I removed it correctly identified 16/1000 = 0.016 of them.

> Given the true value of r, the optimal choice should be beta=[(1-r)/r]*
alpha, and making alpha -> infinity (in practice just a large-enough value).

Okay, so even though this is not realistic in an exploratory context, I'll try adjusting alpha and beta to be ideal. 1000 out of 21k is about 5%. So, using your formula, I'll try alpha=100, beta=1900. Here are results:

10850 non-edges from the original graph in marginal graph, 3 edges not found from original graph. 450/1000 removed edges appear in the marginal graph. 25/1000 = 0.025 have probability > 0.5.

So, slightly better, but still very poor. A binary classifier trained with just the simple similarity measures implemented in graph-tool (as described above) as features would give better results than this. Indeed, the conditional probability method is much much better (based on SHAP importance in my binary classifier). How can that be? Surely I'm missing something or making a mistake somewhere?

> That is not what I meant. Setting n_default=0 means saying that every
non-edge is non-observed. You should set n=0 only to the edges
'removed', and to a comparable number of actual non_edges.

After reading the paper I see what you meant now. If I understand correctly, this doesn't apply to my situation. Maybe I'm wrong? If someone says "Here is a real-world graph. Tell me what the missing edges are", this method doesn't appear applicable?

Although, I can test it out just to see if it performs any better,

I selected 1000 random edges, set their n and x to 0. Then, I selected 2000 random non-edges, added them to the graph, and set their n and x to 0.

Of the 1000 random edges, 981 were found in the marginal graphs. 271/1000 = 0.271 had probability > 0.5. Of the 2000 random non-edges 952 were found in the marginal graphs, and only 12 had probability > 0.5.

Assuming I did this right, this method seems to work much better! It didn't appear to be fooled by the unmeasured non-edges.

> Choosing alpha=beta=infinity means being 100% sure that the value of r
(probability of missing edges) is 1/2. Does that value make sense to
you? Did you remove exactly 1/2 of the edges?

I was playing with different values as I showed you. I'll be more explicit about my thought process:

Let's say I run this twice, once with alpha=50, beta=50 and once with alpha=10, beta=90. The former run will have more possible non-edges, right?. Let's say that there are two non-edges, a and b which were found in both runs. Is it not the case that if P(a) < P(b) in the first run, that it will also be so in the second run? Is this an incorrect intuition? In other words, will adjusting alpha and beta change the relative ordering of the probabilities (for the common ones)?

Thanks for your help, as always

On Sat, Apr 11, 2020 at 7:02 AM Tiago de Paula Peixoto <tiago@skewed.de> wrote:

Am 11.04.20 um 07:24 schrieb Deklan Webster:
>> It's a sum in log-space, so it's incremental.
>
> Can you show me in code how to use logsumexp in such a way such that I
> don't need to store 6,000,000 * k floats in memory where k is the number
> of sampling iterations?

It is so utterly simple, it is *exactly* like doing sums. Here is the
log sum of the values from 1 to 1000, without keeping anything in memory:

res = -numpy.inf # zero
for x in range(1, 1000):
res = logsumexp([res, log(x)])

>> Most other link prediction methods only provide a ranking or scores of
> the edges. In order to actually predict edges, all these methods require
> you to determine how many edges should be actually placed.
>
> In your setup you still end up with a probability in the end. What
> probability cutoff you consider to be a "real missing edge" or not is
> still arbitrary, correct?

Not quite, at least not up to the choice of a loss function. For binary
classification using the 0-1 loss (there are not many alternatives) the
optimal threshold is 1/2.

(I discuss this in the PRX paper, which I get the impression you did not
yet bother to read.)

> Using a binary classifier and whatever link
> prediction features you also can get probabilities in the end.

Not without making an _arbitrary_ choice of normalization, which results
in an _arbitrary_ number of edges being predicted.

> For
> instance, and relating to the point of choosing that f ratio you
> mention, I counted some probabilities to get an estimate on the number
> of "actually missing" edges. I've confirmed that these probabilities are
> fairly well calibrated by removing edges and predicting, etc.
>
> 39 with probability above 0.80
> 143 with probability above 0.70
> 318 with probability above 0.60

Without saying how you normalized the scores to begin with, these values
are meaningless.

>> You should also try to consider using a different framework, where you
> leave the missing edges "unobserved", instead of transforming them into
> nonedges. The distinction is important, and changes the reconstruction
> problem. You can achieve this with MeasuredBlockState by setting the
> corresponding n value to zero.
>
> Do you mean setting n_default=0? Just tried that and it appears like
> almost all of my existing edges are missing from the marginal graph. Not
> sure what's going on there. I tried setting mu and nu to extreme values
> control this(?) but it doesn't appear to help much. With alpha=1,
> beta=1, mu=1, nu=9999: 20733 out of 20911 of my original edges are
> missing. What's happening here?

That is not what I meant. Setting n_default=0 means saying that every
non-edge is non-observed. You should set n=0 only to the edges
'removed', and to a comparable number of actual non_edges.

> Okay, so to summarize what's happening here: if alpha and beta are both
> 1 then p is uniform from [0,1], and it appears the model is favoring
> very low p for my graph. By setting alpha and beta I can control the
> distribution of p. By maintaining this alpha/(alpha+beta) ratio while
> increasing alpha and beta I can decrease the variance, which forces the
> p to be even higher (for the same mean), because it wants to go to the
> lower tail. That correct?

Right.

> Is there any reason not to set something very aggressive like alpha=500,
> beta=500? I just tested this is as a feature in my binary classifier
> with 5000 sampling iterations and it performed much worse than the
> conditional prob setup (with only 100 sampling iterations). I expected
> this marginal prob setup to work comparably to the conditional prob
> setup. Thoughts?

Choosing alpha=beta=infinity means being 100% sure that the value of r
(probability of missing edges) is 1/2. Does that value make sense to
you? Did you remove exactly 1/2 of the edges?

Given the true value of r, the optimal choice should be beta=[(1-r)/r]*
alpha, and making alpha -> infinity (in practice just a large-enough value).

Best,
Tiago

--
Tiago de Paula Peixoto <tiago@skewed.de>

_______________________________________________
graph-tool mailing list
graph-tool@skewed.de
https://lists.skewed.de/mailman/listinfo/graph-tool