Quick start example

Users of scikit-hubness typically want to

  1. analyse, whether their data show hubness

  2. reduce hubness

  3. perform learning (classification, regression, …)

The following example shows all these steps for an example dataset from the text domain (dexter). Please make sure you have installed scikit-hubness (installation instructions).

First, we load the dataset and inspect its size.

from skhubness.data import load_dexter
X, y = load_dexter()
print(f'X.shape = {X.shape}, y.shape={y.shape}')

Dexter is embedded in a high-dimensional space, and could, thus, be prone to hubness. Therefore, we assess the actual degree of hubness.

from skhubness import Hubness
hub = Hubness(k=10, metric='cosine')
hub.fit(X)
k_skew = hub.score()
print(f'Skewness = {k_skew:.3f}')

As a rule-of-thumb, skewness > 1.2 indicates significant hubness. Additional hubness indices are available, for example:

print(f'Robin hood index: {hub.robinhood_index:.3f}')
print(f'Antihub occurrence: {hub.antihub_occurrence:.3f}')
print(f'Hub occurrence: {hub.hub_occurrence:.3f}')

There is considerable hubness in dexter. Let’s see, whether hubness reduction can improve kNN classification performance.

from sklearn.model_selection import cross_val_score
from skhubness.neighbors import KNeighborsClassifier

# vanilla kNN
knn_standard = KNeighborsClassifier(n_neighbors=5,
                                    metric='cosine')
acc_standard = cross_val_score(knn_standard, X, y, cv=5)

# kNN with hubness reduction (mutual proximity)
knn_mp = KNeighborsClassifier(n_neighbors=5,
                              metric='cosine',
                              hubness='mutual_proximity')
acc_mp = cross_val_score(knn_mp, X, y, cv=5)

print(f'Accuracy (vanilla kNN): {acc_standard.mean():.3f}')
print(f'Accuracy (kNN with hubness reduction): {acc_mp.mean():.3f}')

Accuracy was considerably improved by mutual proximity (MP). But did MP actually reduce hubness?

hub_mp = Hubness(k=10, metric='cosine',
                 hubness='mutual_proximity')
hub_mp.fit(X)
k_skew_mp = hub_mp.score()
print(f'Skewness after MP: {k_skew_mp:.3f} '
      f'(reduction of {k_skew - k_skew_mp:.3f})')
print(f'Robin hood: {hub_mp.robinhood_index:.3f} '
      f'(reduction of {hub.robinhood_index - hub_mp.robinhood_index:.3f})')

Yes!

The neighbor graph can also be created directly, with or without hubness reduction:

from skhubness.neighbors import kneighbors_graph
neighbor_graph = kneighbors_graph(X,
                                  n_neighbors=5,
                                  hubness='mutual_proximity')

You may want to precompute the graph like this, in order to avoid computing it repeatedly for subsequent hubness estimation and learning.