Hello Professor Peixoto,
I was wondering whether it is ok for normalized RMI values to be much greater than 1. I am using this function with the norm=True flag. graph_tool.inference.reduced_mutual_information(x, y, norm=True)
I’m listing two environments that I’ve tested my code on below and the partitions I used with an example code snippet to read the partitions in.
Personal mac environment
- Graph tool version here is
2.97 (commit cd15650e, Wed Apr 16 19:28:42 2025 +0200)
withPython 3.13.3 (main, Apr 8 2025, 13:54:08) [Clang 16.0.0 (clang-1600.0.26.6)] on darwin
- Operating system here is
Darwin 24.3.0 Darwin Kernel Version 24.3.0: Thu Jan 2 20:31:46 PST 2025; root:xnu-11215.81.4~4/RELEASE_ARM64_T8132 arm64
. It’s an M4 MacBook Air running Sequoia15.3.2.
Linux server environment
- Graph tool version here is
2.77 (commit c6a3b9f3, Fri Aug 9 15:03:36 2024 -0300)
withPython 3.10.15 (main, Sep 7 2024, 18:35:33) [GCC 9.4.0] on linux
- Operating system here is
Linux 5.14.0-427.50.1.el9_4.x86_64 #1 SMP PREEMPT_DYNAMIC Wed Dec 18 13:06:23 EST 2024 x86_64 x86_64 x86_64 GNU/Linux
.
Minimal working example:
estimated_partition.csv (322.2 KB)
groundtruth_partition.csv (301.8 KB)
The two CSV uploaded are partitions in a csv format where each row is node_id,cluster_id
. It’s also sorted by node ids. In my code snippet at the bottom of this post, I’m reading these two file formats into two arrays which is then passed to the RMI function. As I read the clustering files, I check to make sure that
- Both lists have the same length, meaning same number of nodes
- Both lists contain exactly one cluster ID per node
When I run this code, I get a result like this.
norm=false: 2.293921006280844
norm=true: 2.3954806656880083
import graph_tool.all as gt
import numpy as np
def read_clustering(filepath):
num_nodes = 0
with open(filepath, "r") as f:
num_nodes = len(f.readlines())
current_partition = np.full(num_nodes, -1)
with open(filepath, "r") as f:
for line in f:
node_id,cluster_id = line.strip().split(",")
if current_partition[int(node_id)] != -1:
print("duplicate node id in clustering file. aborting.")
exit(1)
current_partition[int(node_id)] = int(cluster_id)
for cluster_id in current_partition:
if cluster_id == -1:
print("missing cluster id information for a node. aborting.")
exit(1)
return current_partition
groundtruth_partition = read_clustering("./groundtruth_partition.csv")
estimated_partition = read_clustering("./estimated_partition.csv")
if len(groundtruth_partition) != len(estimated_partition):
print("partitions don't match in length. aborting.")
exit(1)
print(f"norm=false: {gt.reduced_mutual_information(groundtruth_partition, estimated_partition, norm=False)}")
print(f"norm=true: {gt.reduced_mutual_information(groundtruth_partition, estimated_partition, norm=True)}")
I saw another post about RMI values but that post seemed more concerned about values being negative, and about reconciling the different results from the different bit representation in newman’s code.