Question about normalized RMI

minpark815 · June 4, 2025, 12:08am

Hello Professor Peixoto,

I was wondering whether it is ok for normalized RMI values to be much greater than 1. I am using this function with the norm=True flag. graph_tool.inference.reduced_mutual_information(x, y, norm=True)

I’m listing two environments that I’ve tested my code on below and the partitions I used with an example code snippet to read the partitions in.

Personal mac environment

Graph tool version here is 2.97 (commit cd15650e, Wed Apr 16 19:28:42 2025 +0200) with Python 3.13.3 (main, Apr 8 2025, 13:54:08) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
Operating system here is Darwin 24.3.0 Darwin Kernel Version 24.3.0: Thu Jan 2 20:31:46 PST 2025; root:xnu-11215.81.4~4/RELEASE_ARM64_T8132 arm64. It’s an M4 MacBook Air running Sequoia15.3.2.

Linux server environment

Graph tool version here is 2.77 (commit c6a3b9f3, Fri Aug 9 15:03:36 2024 -0300) with Python 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0] on linux
Operating system here is Linux 5.14.0-427.50.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 18 13:06:23 EST 2024 x86_64 x86_64 x86_64 GNU/Linux.

Minimal working example:
estimated_partition.csv (322.2 KB)
groundtruth_partition.csv (301.8 KB)

The two CSV uploaded are partitions in a csv format where each row is node_id,cluster_id. It’s also sorted by node ids. In my code snippet at the bottom of this post, I’m reading these two file formats into two arrays which is then passed to the RMI function. As I read the clustering files, I check to make sure that

Both lists have the same length, meaning same number of nodes
Both lists contain exactly one cluster ID per node

When I run this code, I get a result like this.

norm=false: 2.293921006280844
norm=true: 2.3954806656880083

import graph_tool.all as gt
import numpy as np

def read_clustering(filepath):
    num_nodes = 0
    with open(filepath, "r") as f:
        num_nodes = len(f.readlines())
    current_partition = np.full(num_nodes, -1)
    with open(filepath, "r") as f:
        for line in f:
            node_id,cluster_id = line.strip().split(",")
            if current_partition[int(node_id)] != -1:
                print("duplicate node id in clustering file. aborting.")
                exit(1)
            current_partition[int(node_id)] = int(cluster_id)

    for cluster_id in current_partition:
        if cluster_id == -1:
            print("missing cluster id information for a node. aborting.")
            exit(1)
    return current_partition

groundtruth_partition = read_clustering("./groundtruth_partition.csv")
estimated_partition = read_clustering("./estimated_partition.csv")


if len(groundtruth_partition) != len(estimated_partition):
    print("partitions don't match in length. aborting.")
    exit(1)

print(f"norm=false: {gt.reduced_mutual_information(groundtruth_partition, estimated_partition, norm=False)}")
print(f"norm=true: {gt.reduced_mutual_information(groundtruth_partition, estimated_partition, norm=True)}")

I saw another post about RMI values but that post seemed more concerned about values being negative, and about reconciling the different results from the different bit representation in newman’s code.