3. Use case: node embeddings#

The EDRep algorithm has a natural application to graph node embeddings. In fact, the probability matrix \(P\) can be interepreted as a row-normalized graph adjacency matrix or as the limiting probability distirbution of random walks on the graph. As such, EDRep can be efficiently adopted as an alternative to DeepWalk. Moreover, our framing allows us to easily create embeddings for weigthed and directed graphs which are often hard to deal with.

We hence provide a ready-to-use implementation of our algorithm for node embeddings. Specifically, we let \(P = \frac{1}{L}\sum_{a = 1}^L (D^{-1}A)^a\), where \(A\) is the graph adjacency matrix, \(D\) is the diagonal degree matrix and \(L\) is a parameter. In words, \(P_{ij}\) is the limiting probability for move from \(i\) to \(j\) in \(L\) steps or fewer.

The function NodeEmbedding allows us to create graph embeddings following this strategy. We compare the results on a community detection algorithm with the optimal Bethe-Hessian spectral clustering algorithm. This algorithm estimates the number of communities even if they are unknown at finds an optimal parametrization of the Bethe-Hessian matrix. The proposed EDRep embedding performs similarly to this optimal spectral clustering algorithm even if it is not capable of estimating the number of communities. From a computational standpoint, EDRep is faster than the SC algorithm, especially when there is a large number of communities (see what happens when changing the n_clusters parameter)

import numpy as np
from time import time

from src.dcsbm import *

from EDRep import NodeEmbedding

3.1. Generate a graph from the Degree-corrected stochastic block model#

We consider a graph with \(n = 60~000\) nodes and \(k\) communities, with average degree \(c = 10\) and \(c_{\rm out}\) average edges between different communities. We let the degree distribution follow a negative binomial.

n_clusters = 6
n = 60000

# degree distribution
theta = np.random.negative_binomial(3,0.1, n)+1
theta = theta/np.mean(theta)

# label vector
π = np.ones(n_clusters)/n_clusters
label = np.concatenate([[j for i in range(int(π[j]*n))] for j in range(n_clusters)])

c = 10
c_out = 6/n_clusters

C = matrix_C(c_out, c, 0., π)

# generate the adjacency matrix
A,  = adj(C,c, label, theta)

We now create the embedding. The function NodeEmbedding takes three main inputs (on top of the ones required by the EDRep algorithm)

  • A: the graph adjacency matrix. This has to be in scipy.sparse format and it must be non negative. It can be weighted and not symmetric

  • dim: this is the embedding dimension

  • lenght: this is the (optional) walk length parameter \(L\)

dim = 32
    
# get the EDRep embedding
t0 = time()
res = NodeEmbedding(A, dim, walk_length = 1)
print(f'EDRep execution time: {time() - t0}\n')

# get the SC clustering embedding
t0 = time()
X_SC = community_detection(A)
print(f'\nSC execution time: {time() - t0}')
Running the optimization for k = 1
EDRep execution time: 4.730544805526733

Number of clusters detected : 6

Estimating zeta : 02
SC execution time: 82.89865016937256
# Run kmeans on the obtained embeddings and print the performance
print(f'Performance for EDRep: {computeScore(res.X, label)}')
print(f'Performance for SC: {computeScore(X_SC, label)}')
Performance for EDRep: 0.8414014375831496
Performance for SC: 0.859684613269637