It is so utterly simple, it is *exactly* like doing sums.

Oh, I see now. Thanks!

(I discuss this in the PRX paper, which I get the impression you did not

yet bother to read.)

Actually, I did read some of it before my last message (the terrorist

network example is pretty cool). But, no, I didn't read it front to back. I

apologize if it seems I'm not putting in effort trying to understand. I

just went ahead and read it. I understood more than I thought I would. I

found many minor typos, if you're interested in fixing them, but I also

found some more confusing errors (I *think* I'm reading the latest

version..):

In FIG 1 it says "probability q of missing edge", shouldn't that be

"spurious edge"?

At one point you say "If an entry is not observed, it has a value of n_ij=

0, which is different from it being observed with n_ij > 0 as a nonedge

x_ij= 0." Shouldn't the terminology for n_ij=0 be "unmeasured" not

"unobserved"? That really confused me in the following paragraphs.

"Inverson bracket" should be "Iverson bracket"

Not without making an _arbitrary_ choice of normalization, which results

in an _arbitrary_ number of edges being predicted.

Without saying how you normalized the scores to begin with, these values

are meaningless.

Okay, for clarity here are two situations

1) I take a similarity measure, say the "dice" similarity measure from

graph-tool, and compute it for every vertex pair. Then, I pick an arbitrary

cutoff and consider everything greater than that a "missing edge".

2) I turn this into a supervised classification problem by randomly

removing edges, computing whatever features for every vertex pair (say,

every similarity measure from graph-tool: dice, salton, hub-promoted, etc),

and then training a binary classifier using the removed edges, non-edges,

and the corresponding features. (The tedious details on constructing train,

test, validation, etc properly with edges/non-edges sets aside).

I am doing something like 2 not something like 1. I agree that 1 involves

an arbitrary choice. 2 doesn't. With 2 I can calibrate probabilities which

*do* have some meaning: they're based on my ability to predict the edges I

removed. This is what the Nearly Optimal Link Prediction paper does.

To be specific, the probabilities I showed you are being spit out of an

LGBMClassifier from LightGBM. The LightGBM docs recommend calibrating these

probabilities further using the procedure detailed here

https://scikit-learn.org/stable/modules/calibration.html using

https://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html,

i.e. Platt scaling or isotonic regression. The references on the sklearn

page are from 1999-2005, so I'm assuming there are better ways to calibrate

probabilities by now.

From the paper,

Although this distribution can be used to

perform the same binary classification task, by using the

posterior marginal probabilities π_ij as the aforementioned

“scores,” it contains substantially more information. For

example, the number of missing and spurious edges (and,

hence, the inferred probabilities p and q) are contained in

this distribution and thus do not need to be prespecified.

Indeed, our method lacks any kind of ad hoc input, such as

a discrimination threshold [note that the threshold 1=2 in

the MMPestimator of Eq. (44) is a derived optimum, not an

input]. This property means that absolute assessments such

as the similarity of Eq. (45) can be computed instead of

relative ones such as the AUC.

Not sure what you mean by AUROC being 'relative'. As you say, you need a

score threshold to make predictions in the end but computing AUROC itself

doesn't require a threshold. There are other non-threshold-based metrics

one can use.

Okay, I was curious, so I tested it out: I just removed 1000 edges (of 21k;

the structure was not destroyed) at random from the graph and ran this

marginal probability procedure with *defaults*: 1 measurement per edge, 1

observation per edge, 1 measurement per non-edge, 0 observations per

non-edge, alpha, beta, mu, nu all 1. For the equilibration I used

wait=1000, epsilon=1e-5, multiflip=True. For sampling I used 5000

iterations.

In the resulting marginal graph 2470 non-edges from the original graph

appear and 224 edges from the original graph didn't appear. Of the 1000

edges I removed, only 181 edges appear in the marginal graph. Okay, but for

the 181 edges that did appear, were the probabilities at least high? Here's

the histogram of probabilities https://i.imgur.com/kZcR0W1.png As you can

see, almost all the probs are low. How many probabilities exceed 0.5? 16.

So of the original 1000 edges I removed it correctly identified 16/1000 =

0.016 of them.

Given the true value of r, the optimal choice should be beta=[(1-r)/r]*

alpha, and making alpha -> infinity (in practice just a large-enough

value).

Okay, so even though this is not realistic in an exploratory context, I'll

try adjusting alpha and beta to be ideal. 1000 out of 21k is about 5%. So,

using your formula, I'll try alpha=100, beta=1900. Here are results:

10850 non-edges from the original graph in marginal graph, 3 edges not

found from original graph. 450/1000 removed edges appear in the marginal

graph. 25/1000 = 0.025 have probability > 0.5.

So, slightly better, but still very poor. A binary classifier trained with

just the simple similarity measures implemented in graph-tool (as described

above) as features would give better results than this. Indeed, the

conditional probability method is much much better (based on SHAP

importance in my binary classifier). How can that be? Surely I'm missing

something or making a mistake somewhere?

That is not what I meant. Setting n_default=0 means saying that every

non-edge is non-observed. You should set n=0 only to the edges

'removed', and to a comparable number of actual non_edges.

After reading the paper I see what you meant now. If I understand

correctly, this doesn't apply to my situation. Maybe I'm wrong? If someone

says "Here is a real-world graph. Tell me what the missing edges are", this

method doesn't appear applicable?

Although, I can test it out just to see if it performs any better,

I selected 1000 random edges, set their n and x to 0. Then, I selected 2000

random non-edges, added them to the graph, and set their n and x to 0.

Of the 1000 random edges, 981 were found in the marginal graphs. 271/1000 =

0.271 had probability > 0.5. Of the 2000 random non-edges 952 were found in

the marginal graphs, and only 12 had probability > 0.5.

Assuming I did this right, this method seems to work much better! It didn't

appear to be fooled by the unmeasured non-edges.

Choosing alpha=beta=infinity means being 100% sure that the value of r

(probability of missing edges) is 1/2. Does that value make sense to

you? Did you remove exactly 1/2 of the edges?

I was playing with different values as I showed you. I'll be more explicit

about my thought process:

Let's say I run this twice, once with alpha=50, beta=50 and once with

alpha=10, beta=90. The former run will have more possible non-edges,

right?. Let's say that there are two non-edges, a and b which were found in

both runs. Is it not the case that if P(a) < P(b) in the first run, that it

will also be so in the second run? Is this an incorrect intuition? In other

words, will adjusting alpha and beta change the relative ordering of the

probabilities (for the common ones)?

Thanks for your help, as always

attachment.html (14.9 KB)