scikit-hubness: high-dimensional data mining

scikit-hubness is a Python package for analysis of hubness in high-dimensional data. It provides hubness reduction and approximate nearest neighbor search via a drop-in replacement for sklearn.neighbors.

Installation

Installation from PyPI

The current release of scikit-hubness can be installed from PyPI:

pip install scikit-hubness

Dependencies

All strict dependencies of scikit-hubness are automatically installed by pip. Some optional dependencies (certain ANN libraries) may not yet be available from PyPI. If you require one of these libraries, please refer to the library’s documentation for building instructions. For example, at the time of writing, puffinn was not available on PyPI. Building and installing is straight-forward:

git clone https://github.com/puffinn/puffinn.git
cd puffinn
python3 setup.py build
pip install .

Installation from source

You can always grab the latest version of scikit-hubness directly from GitHub:

cd install_dir
git clone git@github.com:VarIr/scikit-hubness.git
cd scikit-hubness
pip install -e .

This is the recommended approach, if you want to contribute to the development of scikit-hubness.

Supported platforms

scikit-hubness currently supports all major operating systems:

  • Linux

  • MacOS X

  • Windows

Note, that not all approximate nearest neighbor algorithms used in scikit-hubness are available on all platforms. This is because we rely on third-party libraries, which in some cases are not available for all platforms. The table below indicates, which libraries and algorithms are currently supported on your operating system. All exact nearest neighbor algorithms (as provided by scikit-learn) are available on all platforms.

library

algorithm

Linux

MacOS

Windows

nmslib

hnsw

x

x

x

annoy

rptree

x

x

x

ngtpy

nng

x

x

falconn

falconn_lsh

x

x

puffinn

lsh

x

x

sklearn

(all exact)

x

x

x

Quick start example

Users of scikit-hubness typically want to

  1. analyse, whether their data show hubness

  2. reduce hubness

  3. perform learning (classification, regression, …)

The following example shows all these steps for an example dataset from the text domain (dexter). Please make sure you have installed scikit-hubness (installation instructions).

First, we load the dataset and inspect its size.

from skhubness.data import load_dexter
X, y = load_dexter()
print(f'X.shape = {X.shape}, y.shape={y.shape}')

Dexter is embedded in a high-dimensional space, and could, thus, be prone to hubness. Therefore, we assess the actual degree of hubness.

from skhubness import Hubness
hub = Hubness(k=10, metric='cosine')
hub.fit(X)
k_skew = hub.score()
print(f'Skewness = {k_skew:.3f}')

As a rule-of-thumb, skewness > 1.2 indicates significant hubness. Additional hubness indices are available, for example:

print(f'Robin hood index: {hub.robinhood_index:.3f}')
print(f'Antihub occurrence: {hub.antihub_occurrence:.3f}')
print(f'Hub occurrence: {hub.hub_occurrence:.3f}')

There is considerable hubness in dexter. Let’s see, whether hubness reduction can improve kNN classification performance.

from sklearn.model_selection import cross_val_score
from skhubness.neighbors import KNeighborsClassifier

# vanilla kNN
knn_standard = KNeighborsClassifier(n_neighbors=5,
                                    metric='cosine')
acc_standard = cross_val_score(knn_standard, X, y, cv=5)

# kNN with hubness reduction (mutual proximity)
knn_mp = KNeighborsClassifier(n_neighbors=5,
                              metric='cosine',
                              hubness='mutual_proximity')
acc_mp = cross_val_score(knn_mp, X, y, cv=5)

print(f'Accuracy (vanilla kNN): {acc_standard.mean():.3f}')
print(f'Accuracy (kNN with hubness reduction): {acc_mp.mean():.3f}')

Accuracy was considerably improved by mutual proximity (MP). But did MP actually reduce hubness?

hub_mp = Hubness(k=10, metric='cosine',
                 hubness='mutual_proximity')
hub_mp.fit(X)
k_skew_mp = hub_mp.score()
print(f'Skewness after MP: {k_skew_mp:.3f} '
      f'(reduction of {k_skew - k_skew_mp:.3f})')
print(f'Robin hood: {hub_mp.robinhood_index:.3f} '
      f'(reduction of {hub.robinhood_index - hub_mp.robinhood_index:.3f})')

Yes!

The neighbor graph can also be created directly, with or without hubness reduction:

from skhubness.neighbors import kneighbors_graph
neighbor_graph = kneighbors_graph(X,
                                  n_neighbors=5,
                                  hubness='mutual_proximity')

You may want to precompute the graph like this, in order to avoid computing it repeatedly for subsequent hubness estimation and learning.

User guide

Welcome to scikit-hubness! Here we describe the core functionality of the package (hubness analysis, hubness reduction, neighbor search), and provide several usage examples.

Core Concepts

There are three main parts of scikit-hubness. Before we describe the corresponding subpackages, we briefly introduce the hubness phenomenon itself.

The hubness phenomenon

Hubness is a phenomenon of intrinsically high-dimensional data, detrimental to data mining and learning tasks. It refers to the tendency of hub and antihub emergence in k-nearest neighbor graphs (kNNGs): Hubs are objects that appear unwontedly often among the k-nearest neighbor lists of other objects, while antihubs hardly or never appear in these lists. Thus, hubs propagate their metainformation (such as class labels) widely within a kNNG. Conversely, information carried by antihubs is effectively lost. As a result, hubness leads to semantically distorted spaces, that negatively impact a large variety of tasks.

Music information retrieval is a show-case example for hubness: It has been shown, that recommendation lists based on music similarity scores tend to completely ignore certain songs (antihubs). On the other hand, different songs are recommended over and over again (hubs), sometimes even when they do not fit. Both effects are problematic: Users are provided with unsuitable (hub) recommendations, while artists that (unknowingly) producing antihub songs, may remain fameless unjustifiably.

The scikit-hubness package

scikit-hubness reflects our effort to make hubness analysis and hubness reduction readily available and easy-to-use for both machine learning researchers and practitioners.

The package builds upon scikit-learn. When feasible, their design decisions, code style, development practise etc. are adopted, so that new users can work their way into scikit-hubness rapidly. Workflows, therefore, comprise the well-known fit, predict, and score methods.

Two subpackages offer complementary functionality to scikit-learn:

The skhubness.neighbors subpackage, on the other hand, acts as a drop-in replacement for sklearn.neighbors. It provides all of its functionality, and adds two major components:

  • transparent hubness reduction

  • approximate nearest neighbor (ANN) search

and combinations of both. From the coding point-of-view, this is achieved by adding a handful new parameters to most classes (KNeighborsClassifier, RadiusNeighborRegressor, NearestNeighbors, etc).

  • hubness defines the hubness reduction algorithm used to compute the nearest neighbor graph (kNNG). Supported algorithms and corresponding parameter values are presented here, and are available as a Python list in <skhubness.reduction.hubness_algorithms>.

  • algorithm defines the kNNG construction algorithm similarly to the way sklearn does it. That is, all of sklearn’s algorithms are available, but in addition, several approximate nearest neighbor algorithms are provided as well. See below for a list of currently supported algorithms and their corresponding parameter values.

By providing the two arguments above, you select algorithms for hubness reduction and nearest neighbor search, respectively. Most of these algorithms can be further tuned by individual hyperparameters. These are not explicitly made accessible in high-level classes like KNeighborsClassifier, in order to avoid very long lists of arguments, because they differ from algorithm to algorithm. Instead, two dictionaries

  • hubness_params and

  • algorithm_params

are available in all high-level classes to set the nested arguments for ANN and hubness reduction methods.

The following example shows how to perform approximate hubness estimation (1) without, and (2) with hubness reduction by local scaling in an artificial data set.

In part 1, we estimate hubness in the original data.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1_000_000,
                           n_features=500,
                           n_informative=400,
                           random_state=123)

from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.1, random_state=456)

from skhubness.analysis import Hubness
hub = Hubness(k=10,
                   metric='euclidean',
                   algorithm='hnsw',
                   algorithm_params={'n_candidates': 100,
                                     'metric': 'euclidean',
                                     'post_processing': 2,
                                     },
                   return_value='robinhood',
                   n_jobs=8,
                   )
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(robin_hood)
0.7873205555555555  # before hubness reduction

There is high hubness in this dataset. In part 2, we estimate hubness after reduction by local scaling.

hub = Hubness(k=10,
              metric='euclidean',
              hubness='local_scaling',
              hubness_params={'k': 7},
              algorithm='hnsw',
              algorithm_params={'n_candidates': 100,
                                'metric': 'euclidean',
                                'post_processing': 2,
                               },
              return_value='robinhood',
              verbose=2
              )
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(robin_hood)
0.6614583333333331  # after hubness reduction

Approximate nearest neighbor search methods

Set the parameter algorithm to one of the following in order to enable ANN in most of the classes from skhubness.neighbors or Hubness:

  • ‘hnsw’ uses hierarchical navigable small-world graphs (provided by the nmslib library) in the wrapper class HNSW.

  • ‘lsh’ uses locality sensitive hashing (provided by the puffinn library) in the wrapper class PuffinnLSH.

  • ‘falconn_lsh’ uses locality sensitive hashing (provided by the falconn library) in the wrapper class FalconnLSH.

  • ‘nng’ uses ANNG or ONNG (provided by the NGT library) in the wrapper class NNG.

  • ‘rptree’ uses random projections trees (provided by the annoy library) in the wrapper class RandomProjectionTree.

Configure parameters of the chosen algorithm with algorithm_params. This dictionary is passed to the corresponding wrapper class. Take a look at their documentation in order to see, which parameters are available for each individual class.

Hubness reduction methods

Set the parameter hubness to one of the following identifiers in order to use the corresponding hubness reduction algorithm:

  • ‘mp’ or ‘mutual_proximity’ use mutual proximity (Gaussian or empiric distribution) as implemented in MutualProximity.

  • ‘ls’ or ‘local_scaling’ use local scaling or NICDM as implemented in LocalScaling.

  • ‘dsl’ or ‘dis_sim_local’ use DisSim Local as implemented in DisSimLocal.

Variants and additional parameters are set with the hubness_params dictionary. Have a look at the individual hubness reduction classes for available parameters.

Approximate hubness reduction

Exact hubness reduction scales at least quadratically with the number of samples. To reduce computational complexity, approximate hubness reduction can be applied, as described in the paper “Fast approximate hubness reduction for large high-dimensional data” (ICBK2018, on IEEE Xplore, also available as technical report).

The general idea behind approximate hubness reduction works as follows:

  1. retrieve n_candidates-nearest neighbors using an ANN method

  2. refine and reorder the candidate list by hubness reduction

  3. return n_neighbors nearest neighbors from the reordered candidate list

The procedure is implemented in scikit-hubness by simply passing both algorithm and hubness parameters to the relevant classes.

Also consider passing algorithm_params={'n_candidates': n_candidates}. Make sure to set the n_candidates high enough, for high sensitivity (towards “good” nearest neighbors). Too large values may, however, lead to long query times. As a rule of thumb for this trade-off, you can start by retrieving ten times as many candidates as you need nearest neighbors.

Hubness analysis

You can use the skhubness.analysis subpackage to assess whether your data is prone to hubness. Currently, the Hubness class acts as a one-stop-shop for hubness estimation. It provides several hubness measures, that are all computed from a nearest neighbor graph (kNNG). More specifically, hubness is measured from k-occurrence, that is, how often does an object occur in the k-nearest neighbor lists of other objects (reverse nearest neighbors). Traditionally, hubness has been measured by the skewness of the k-occurrence histogram, where higher skewness to the right indicates higher hubness (due to objects that appear very often as nearest neighbors). Recently, additional indices borrowed from inequality research have been proposed for measuring hubness, such as calculating the Robin Hood index or Gini index from k-occurrences, which may have more desirable features w.r.t to large datasets and interpretability.

The Hubness class provides a variety of these measures. It is based on scikit-learn’s BaseEstimator, and thus follows scikit-learn principles. When a new instance is created, sensible default parameters are used, unless specific choices are made. Typically, the user may want to choose a parameter k to define the size of nearest neighbor lists, or metric, in case the default Euclidean distances do not fit the data well. Parameter return_value defines which hubness measures to use. VALID_HUBNESS_MEASURES provides a list of available measures. If return_values=='all', all available measures are computed. The algorithm parameter defines how to compute the kNN graph. This is especially relevant for large datasets, as it provides more efficient index structures and approximate nearest neighbor algorithms. For example, algorithm='hnsw' uses a hierarchical navigable small-world graph to compute the hubness measures in log-linear time (instead of quadratic).

Hubness uses fit and score methods to estimate hubness. In this fictional example, we estimate hubness in terms of the Robin Hood index in some large dataset:

>>> X = (some large dataset)
>>> hub = Hubness(k=10,
>>>               return_value='robinhood',
>>>               algorithm='hnsw')
>>> hub.fit(X)  # Creates the HNSW index
>>> hub.score()
0.56

A Robin Hood index of 0.56 indicates, that 56% of all slots in nearest neighbor lists would need to be redistributed, in order to obtain equal k-occurrence for all objects. We’d consider this rather high hubness.

In order to evaluate, whether hubness reduction might be beneficial for downstream tasks (learning etc.), we can perform the same estimation with hubness reduction enabled. We use the same code as above, but add the hubness parameter:

>>> X = (some large dataset)
>>> hub = Hubness(k=10,
>>>               return_value='robinhood',
>>>               algorithm='hnsw',
>>>               hubness='local_scaling')
>>> hub.fit(X)
>>> hub.score()
0.35

Here, the hubness reduction method local scaling resulted in a markedly lower Robin Hood index.

Note, that we used the complete data set X in the examples above. We can also split the data into some X_train and X_test:

>>> hub.fit(X_train)
>>> hub.score(X_test)
0.36

This is useful, when you want to tune hyperparameters towards low hubness, and prevent data leakage.

Hubness measures

The degree of hubness in a dataset typically measured from its k-occurrence histogram \(O^k\). For an individual data object x, its k-occurrence \(O^k(x)\) is defined as the number of times x resides among the k-nearest neighbors of all other objects in the data set. In the notion of network analysis, \(O^k(x)\) is the indegree of x in a directed kNN graph. It is also known as reverse neighbor count.

The following measures are provided in Hubness by passing the corresponding argument values (e.g. hubness='robinhood'):

  • ‘k_skewness’: Skewness, the third central moment of the k-occurrence distribution, as introduced by Radovanović et al. 2010

  • ‘k_skewness_truncnorm’: skewness of truncated normal distribution estimated from k-occurrence distribution.

  • ‘atkinson’: the Atkinson index of inequality, which can be tuned in order to be more sensitive towards antihub or hubs.

  • ‘gini’: the Gini coefficient of inequality, defined as the half of the relative mean absolute difference

  • ‘robinhood’: the Robin Hood or Hoover index, which gives the amount that needs to be redistributed in order to obtain equality (e.g. proportion of total income, so that there is equal income for all; or the number of nearest neighbor slot, so that all objects are among the k-nearest neighbors of others exactly k times).

  • ‘antihubs’: returns the indices of antihubs in data set X (which are never among the nearest neighbors of other objects.

  • ‘antihub_occurrence’: proportion of antihubs in the data set (percentage of total objects, which are antihubs).

  • ‘hubs’: indices of hub objects x in data set X (with \(O^k(x) > \text{hub_size} * k\), where \(\text{hub_size} = 2\) by default).

  • ‘hub_occurrence’: proportion of nearest neighbor slots occupied by hubs

  • ‘groupie_ratio’: proportion of objects with the largest hub in their neighborhood

  • ‘k_neighbors’: indices to k-nearest neighbors for each object

  • ‘k_occurrence’: reverse neighbor count for each object

  • ‘all’: return a dictionary containing all of the above

Hubness reduction

The skhubness.reduction subpackage provides several hubness reduction methods. Currently, the supported methods are

  • Mutual proximity (independent Gaussian distance distribution), provided by MutualProximity with method='normal' (default),

  • Mutual proximity (empiric distance distribution), provided by MutualProximity with method='empiric',

  • Local scaling, provided by LocalScaling with method='standard' (default),

  • Non-iterative contextual dissimilarity measure, provided by LocalScaling with method='nicdm',

  • DisSim Local, provided by DisSimLocal,

which represent the most successful hubness reduction methods as identified in our paper “A comprehensive empirical comparison of hubness reduction in high-dimensional spaces”, KAIS (2019), DOI. This survey paper also comes with an overview of how the individual methods work.

There are two ways to use perform hubness reduction in scikit-hubness:

The former is the common approach, if you simply want to improve your learning task by hubness reduction. Most examples here also do so. The latter may, however, be more useful for researchers, who would like to investigate the hubness phenomenon itself.

All hubness reducers inherit from a common base class HubnessReduction. This abstract class defines two important methods: fit and transform, thus allowing to transform previously unseen data after the initial fit. Most hubness reduction methods do not operate on vector data, but manipulate pre-computed distances, in order to obtain secondary distances. Therefore, fit and transform take neighbor graphs as input, instead of vectors. Have a look at their signatures:

@abstractmethod
def fit(self, neigh_dist, neigh_ind, X, assume_sorted, *args, **kwargs):
    pass  # pragma: no cover

@abstractmethod
def transform(self, neigh_dist, neigh_ind, X, assume_sorted, return_distance=True):
    pass  # pragma: no cover

The arguments neigh_dist and neigh_ind are two arrays representing the nearest neighbor graph with shape (n_indexed, n_neighbors) during fit, and shape (n_query, n_neighbors) during transform. The i-th row in each array corresponds to the i-th object in the data set. The j-th column in neigh_ind contains the index of one of the k-nearest neighbors among the indexed objects, while the j-th column in neigh_dist contains the corresponding distance. Note, that this is the same format as obtained by scikit-learn’s kneighbors(return_distances=True) method.

This way, the user has full flexibility on how to calculate primary distances (Euclidean, cosine, KL divergence, etc). DisSimLocal (DSL) is the exception to this rule, because it is formulated specifically for Euclidean distances. DSL, therefore, also requires the training vectors in fit(..., X=X_train), and the test set vectors in transform(..., X=X_test). Argument X is ignored in the other hubness reduction methods.

When the neighbor graph is already sorted (lowest to highest distance), assume_sorted=True should be set, so that hubness reduction methods will not sort the arrays again, thus saving computational time.

Hubness reduction methods transform the primary distance graph, and return secondary distances. Note that for efficiency reasons, the returned arrays are not sorted. Please make sure to sort the arrays, if downstream tasks assume sorted arrays.

Nearest neighbors

The skhubness.neighbors subpackage provides several neighbors-based learning methods. It is designed as a drop-in replacement for scikit-learn’s neighbors. The package provides all functionality from sklearn.neighbors, and adds support for transparent hubness reduction, where applicable, including

In addition, scikit-hubness provides approximate nearest neighbor (ANN) search, in order to support large data sets with millions of data objects and more. A list of currently provided ANN methods is available here.

Hubness reduction and ANN search can be used independently or in conjunction, the latter yielding approximate hubness reduction. User of scikit-learn will find that only minor modification of their code is required to enable one or both of the above. We describe how to do so here.

For general information and details about nearest neighbors, we refer to the excellent scikit-learn User Guide on Nearest Neighbors.

Examples

In this section, we provide usage examples for skhubness.

Example: Hubness reduction

These examples show how to perform hubness reduction in kNN classification in (nested) cross-validation and pipelines.

Example: skhubness in Pipelines

Estimators from scikit-hubness can - of course - be used in a scikit-learn Pipeline. In this example, we select the best hubness reduction method and several other hyperparameters in grid search w.r.t. to classification performance.

from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

from skhubness.neighbors import KNeighborsClassifier

# Not so high-dimensional data
X, y = make_classification(n_samples=1_000,
                           n_features=50,
                           n_informative=20,
                           n_classes=2,
                           random_state=3453)

X, X_test, y, y_test = train_test_split(X, y,
                                        test_size=100,
                                        stratify=y,
                                        shuffle=True,
                                        random_state=124)

# Pipeline of standardization, dimensionality reduction, and kNN classification
pipe = Pipeline([('scale', StandardScaler(with_mean=True, with_std=True)),
                 ('pca', PCA(n_components=20, random_state=1213)),
                 ('knn', KNeighborsClassifier(n_neighbors=10, algorithm='lsh', hubness='mp'))])

# Exhaustive search for best algorithms and hyperparameters
param_grid = {'pca__n_components': [10, 20, 30],
              'knn__n_neighbors': [5, 10, 20],
              'knn__algorithm': ['auto', 'hnsw', 'lsh', 'falconn_lsh', 'nng', 'rptree'],
              'knn__hubness': [None, 'mp', 'ls', 'dsl']}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1354)
search = GridSearchCV(pipe, param_grid, n_jobs=5, cv=cv, verbose=1)
search.fit(X, y)

# Performance on hold-out data
acc = search.score(y_test, y_test)
print(acc)
# 0.79

print(search.best_params_)
# {'knn__algorithm': 'auto',
#  'knn__hubness': 'dsl',
#  'knn__n_neighbors': 20,
#  'pca__n_components': 30}

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Face recognition (Olivetti faces)

This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge. Image data is typically embedded in very high-dimensional spaces, which might be prone to hubness.

import numpy as np
from sklearn.datasets import olivetti_faces
from sklearn.model_selection import cross_val_score, StratifiedKFold, RandomizedSearchCV

from skhubness import Hubness
from skhubness.neighbors import KNeighborsClassifier

# Fetch data and have a look
d = olivetti_faces.fetch_olivetti_faces()
X, y = d['data'], d['target']
print(f'Data shape: {X.shape}')
print(f'Label shape: {y.shape}')
# (400, 4096)
# (400,)

# The data is embedded in a high-dimensional space.
# Is there hubness, and can we reduce it?
for hubness in [None, 'dsl', 'ls', 'mp']:
    hub = Hubness(k=10, hubness=hubness, return_value='k_skewness')
    hub.fit(X)
    score = hub.score()
    print(f'Hubness (10-skew): {score:.3f} with hubness reduction: {hubness}')
# Hubness (10-skew): 1.972 with hubness reduction: None
# Hubness (10-skew): 1.526 with hubness reduction: dsl
# Hubness (10-skew): 0.943 with hubness reduction: ls
# Hubness (10-skew): 0.184 with hubness reduction: mp

# There is some hubness, and all hubness reduction methods can reduce it (to varying degree)
# Let's assess the best kNN strategy and its estimated performance.
cv_perf = StratifiedKFold(n_splits=5, shuffle=True, random_state=7263)
cv_select = StratifiedKFold(n_splits=5, shuffle=True, random_state=32634)

knn = KNeighborsClassifier(algorithm_params={'n_candidates': 100})

# specify parameters and distributions to sample from
param_dist = {"n_neighbors": np.arange(1, 26),
              "weights": ['uniform', 'distance'],
              "hubness": [None, 'dsl', 'ls', 'mp']}

# Inner cross-validation to select best hyperparameters (incl hubness reduction method)
search = RandomizedSearchCV(estimator=knn,
                            param_distributions=param_dist,
                            n_iter=100,
                            cv=cv_select,
                            random_state=2345,
                            verbose=1)

# Outer cross-validation to estimate performance
score = cross_val_score(search, X, y, cv=cv_perf, verbose=1)
print(f'Scores: {score}')
print(f'Mean acc = {score.mean():.3f} +/- {score.std():.3f}')

# Select model that maximizes accuracy
search.fit(X, y)

# The best model's parameters
print(search.best_params_)

# Does it correspond to the results of hubness reduction above?
# Scores: [0.95   0.9625 1.     0.95   0.925 ]
# Mean acc = 0.957 +/- 0.024
# {'weights': 'distance', 'n_neighbors': 23, 'hubness': 'mp'}

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

Example: Approximate hubness reduction

These examples show how to combine approximate nearest neighbor search and hubness reduction.

Example: Reusing index structures

This example shows how to reuse index structures. If you want to first estimate hubness, and then perform kNN, you can avoid recomputing the ANN index structure, which can be costly.

from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split

from skhubness.analysis import Hubness
from skhubness.neighbors import KNeighborsClassifier

X, y = make_classification(n_samples=100_000,
                           n_features=500,
                           n_informative=400,
                           random_state=543)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=0.01,
                                                    stratify=y,
                                                    shuffle=True,
                                                    random_state=2346)

# Approximate hubness estimation: Creates LSH index and computes local scaling factors
hub = Hubness(k=10,
              return_value='robinhood',
              algorithm='falconn_lsh',
              hubness='ls',
              random_state=2345,
              shuffle_equal=False,
              verbose=1)
hub.fit(X_train)

robin_hood = hub.score(X_test)
print(f'Hubness (Robin Hood): {robin_hood}:.4f')
# 0.9060

# Approximate hubness reduction for classification: Reuse index & factors
knn = KNeighborsClassifier(n_neighbor=10,
                           algorithm='falconn_lsh',
                           hubness='ls',
                           n_jobs=1)

knn.fit(hub.nn_index_, y_train)  # REUSE INDEX HERE
acc = knn.score(X_test, y_test)
print(f'Test accuracy: {acc:.3f}')
# 0.959

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Example: Approximate hubness reduction

This example shows how to combine approximate nearest neighbor search and hubness reduction in order to perform approximate hubness reduction for large data sets.

from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

from skhubness.analysis import Hubness
from skhubness.neighbors import KNeighborsClassifier

# High-dimensional artificial data
X, y = make_classification(n_samples=1_000_000,
                           n_features=500,
                           n_informative=400,
                           random_state=543)

X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    test_size=10_000,
                                                    stratify=y,
                                                    shuffle=True,
                                                    random_state=2346)

# Approximate hubness estimation
hub = Hubness(k=10,
              return_value='robinhood',
              algorithm='hnsw',
              random_state=2345,
              shuffle_equal=False,
              n_jobs=-1,
              verbose=2)
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(f'Hubness (Robin Hood): {robin_hood:.3f}')
# 0.944

# Approximate hubness reduction for classification
knn = KNeighborsClassifier(n_neighbor=10,
                           algorithm='hnsw',
                           hubness='ls',
                           n_jobs=-1,
                           verbose=2)

knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Test accuracy: {acc:.3f}')
# Test accuracy: 0.987

Total running time of the script: ( 0 minutes 0.000 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

scikit-learn examples adapted for scikit-hubness

Examples concerning using skhubness.neighbors as drop-in replacement for sklearn.neighbors.

These examples are taken from scikit-learn and demonstrate the ease of transition from sklearn.neighbors to skhubness.neighbors. You will find that many examples require no more than modifying an import line, and/or adding one argument when instantiating an estimator.

Note, that these examples are not intended to demonstrate improved learning performance due to hubness reduction (the data are rather low-dimensional).

Nearest Neighbors regression

Demonstrate the resolution of a regression problem using a k-Nearest Neighbor and the interpolation of the target using both barycenter and constant weights.

Hubness reduction of this low-dimensional dataset shows only small effects.

Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_regression.html

_images/sphx_glr_plot_regression_001.png

Out:

/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_regression.py:60: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

print(__doc__)

# Author: Alexandre Gramfort <alexandre.gramfort@inria.fr>
#         Fabian Pedregosa <fabian.pedregosa@inria.fr>
#
# License: BSD 3 clause (C) INRIA


# #############################################################################
# Generate sample data
import numpy as np
import matplotlib.pyplot as plt
from skhubness.neighbors import KNeighborsRegressor

np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()

# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))

# #############################################################################
# Fit regression model
n_neighbors = 5

f = plt.figure()
for i, weights in enumerate(['uniform', 'distance']):
    for j, hubness in enumerate([None, 'local_scaling']):
        knn = KNeighborsRegressor(n_neighbors,
                                  algorithm_params={'n_candidates': 39},
                                  weights=weights,
                                  hubness=hubness)
        y_ = knn.fit(X, y).predict(T)

        plt.subplot(2, 2, i * 2 + j + 1)
        f.set_figheight(15)
        f.set_figwidth(15)
        plt.scatter(X, y, c='k', label='data')
        plt.plot(T, y_, c='g', label='prediction')
        plt.axis('tight')
        plt.legend()
        plt.title(f"KNeighborsRegressor (k = {n_neighbors}, weights = '{weights}', hubness = '{hubness}')")

plt.tight_layout()
plt.show()

Total running time of the script: ( 0 minutes 0.737 seconds)

Gallery generated by Sphinx-Gallery

Nearest Centroid Classification

Sample usage of Nearest Centroid classification. It will plot the decision boundaries for each class.

Note that no hubness reduction is currently implemented for centroids. However, hubness.neighbors retains all the features of sklearn.neighbors, in order to act as a full drop-in replacement.

Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_nearest_centroid.html

  • _images/sphx_glr_plot_nearest_centroid_001.png
  • _images/sphx_glr_plot_nearest_centroid_002.png

Out:

None 0.8133333333333334
0.2 0.82
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_nearest_centroid.py:64: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from skhubness.neighbors import NearestCentroid

n_neighbors = 15

# import some data to play with
iris = datasets.load_iris()
# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for shrinkage in [None, .2]:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = NearestCentroid(shrink_threshold=shrinkage)
    clf.fit(X, y)
    y_pred = clf.predict(X)
    print(shrinkage, np.mean(y == y_pred))
    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.title("3-Class classification (shrink_threshold=%r)"
              % shrinkage)
    plt.axis('tight')

plt.show()

Total running time of the script: ( 0 minutes 0.737 seconds)

Gallery generated by Sphinx-Gallery

Nearest Neighbors Classification

Sample usage of Nearest Neighbors classification. It will plot the decision boundaries for each class.

Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html

  • _images/sphx_glr_plot_classification_001.png
  • _images/sphx_glr_plot_classification_002.png

Out:

/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_classification.py:61: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from skhubness.neighbors import KNeighborsClassifier

n_neighbors = 15

# import some data to play with
iris = datasets.load_iris()

# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target

h = .02  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

for hubness in [None, 'mutual_proximity']:
    # we create an instance of Neighbours Classifier and fit the data.
    clf = KNeighborsClassifier(n_neighbors,
                               hubness=hubness,
                               weights='distance')
    clf.fit(X, y)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
    y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
    xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                         np.arange(y_min, y_max, h))
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light)

    # Plot also the training points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
                edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("3-Class classification (k = %i, hubness = '%s')"
              % (n_neighbors, hubness))

plt.show()

Total running time of the script: ( 0 minutes 25.940 seconds)

Gallery generated by Sphinx-Gallery

Dimensionality Reduction with Neighborhood Components Analysis

Sample usage of Neighborhood Components Analysis for dimensionality reduction.

This example compares different (linear) dimensionality reduction methods applied on the Digits data set. The data set contains images of digits from 0 to 9 with approximately 180 samples of each class. Each image is of dimension 8x8 = 64, and is reduced to a two-dimensional data point.

Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components.

Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.

Neighborhood Components Analysis (NCA) tries to find a feature space such that a stochastic nearest neighbor algorithm will give the best accuracy. Like LDA, it is a supervised method.

One can see that NCA enforces a clustering of the data that is visually meaningful despite the large reduction in dimension.

Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html

  • _images/sphx_glr_plot_nca_dim_reduction_001.png
  • _images/sphx_glr_plot_nca_dim_reduction_002.png
  • _images/sphx_glr_plot_nca_dim_reduction_003.png

Out:

/home/user/feldbauer/miniconda3/envs/hubness/lib/python3.7/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/home/user/feldbauer/miniconda3/envs/hubness/lib/python3.7/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
  warnings.warn("Variables are collinear.")
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_nca_dim_reduction.py:103: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

from skhubness.neighbors import (KNeighborsClassifier,
                                 NeighborhoodComponentsAnalysis)

print(__doc__)

n_neighbors = 3
random_state = 0

# Load Digits dataset
digits = datasets.load_digits()
X, y = digits.data, digits.target

# Split into train/test
X_train, X_test, y_train, y_test = \
    train_test_split(X, y, test_size=0.5, stratify=y,
                     random_state=random_state)

dim = len(X[0])
n_classes = len(np.unique(y))

# Reduce dimension to 2 with PCA
pca = make_pipeline(StandardScaler(),
                    PCA(n_components=2, random_state=random_state))

# Reduce dimension to 2 with LinearDiscriminantAnalysis
lda = make_pipeline(StandardScaler(),
                    LinearDiscriminantAnalysis(n_components=2))

# Reduce dimension to 2 with NeighborhoodComponentAnalysis
nca = make_pipeline(StandardScaler(),
                    NeighborhoodComponentsAnalysis(n_components=2,
                                                   random_state=random_state))

# Use a nearest neighbor classifier to evaluate the methods
knn = KNeighborsClassifier(n_neighbors=n_neighbors)

# Make a list of the methods to be compared
dim_reduction_methods = [('PCA', pca), ('LDA', lda), ('NCA', nca)]

# plt.figure()
for i, (name, model) in enumerate(dim_reduction_methods):
    plt.figure()
    # plt.subplot(1, 3, i + 1, aspect=1)

    # Fit the method's model
    model.fit(X_train, y_train)

    # Fit a nearest neighbor classifier on the embedded training set
    knn.fit(model.transform(X_train), y_train)

    # Compute the nearest neighbor accuracy on the embedded test set
    acc_knn = knn.score(model.transform(X_test), y_test)

    # Embed the data set in 2 dimensions using the fitted model
    X_embedded = model.transform(X)

    # Plot the projected points and show the evaluation score
    plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap='Set1')
    plt.title("{}, KNN (k={})\nTest accuracy = {:.2f}".format(name,
                                                              n_neighbors,
                                                              acc_knn))
plt.show()

Total running time of the script: ( 0 minutes 5.249 seconds)

Gallery generated by Sphinx-Gallery

Face completion with a multi-output estimators

This example shows the use of multi-output estimator to complete images. The goal is to predict the lower half of a face given its upper half.

The first column of images shows true faces. The next columns illustrate how extremely randomized trees, linear regression, ridge regression, and k nearest neighbors with or without hubness reduction complete the lower half of those faces.

Adapted from https://scikit-learn.org/stable/auto_examples/plot_multioutput_face_completion.html

_images/sphx_glr_plot_multioutput_face_completion_001.png

Out:

/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_multioutput_face_completion.py:106: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
  plt.show()

print(__doc__)

import numpy as np
import matplotlib.pyplot as plt

from sklearn.datasets import fetch_olivetti_faces
from sklearn.utils.validation import check_random_state

from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV

from skhubness.neighbors import KNeighborsRegressor

# Load the faces datasets
data = fetch_olivetti_faces()
targets = data.target

data = data.images.reshape((len(data.images), -1))
train = data[targets < 30]
test = data[targets >= 30]  # Test on independent people

# Test on a subset of people
n_faces = 5
rng = check_random_state(4)
face_ids = rng.randint(test.shape[0], size=(n_faces, ))
test = test[face_ids, :]

n_pixels = data.shape[1]
# Upper half of the faces
X_train = train[:, :(n_pixels + 1) // 2]
# Lower half of the faces
y_train = train[:, n_pixels // 2:]
X_test = test[:, :(n_pixels + 1) // 2]
y_test = test[:, n_pixels // 2:]

# Fit estimators
ESTIMATORS = {
    "Extra trees": ExtraTreesRegressor(n_estimators=10, max_features=32,
                                       random_state=0),
    "k-NN": KNeighborsRegressor(weights='distance'),
    "k-NN MP": KNeighborsRegressor(hubness='mp',
                                   hubness_params={'method': 'normal'},
                                   weights='distance'),
    "Linear regression": LinearRegression(),
    "Ridge": RidgeCV(),
}

y_test_predict = dict()
for name, estimator in ESTIMATORS.items():
    estimator.fit(X_train, y_train)
    y_test_predict[name] = estimator.predict(X_test)

# Plot the completed faces
image_shape = (64, 64)

n_cols = 1 + len(ESTIMATORS)
plt.figure(figsize=(2. * n_cols, 2.26 * n_faces))
plt.suptitle("Face completion with multi-output estimators", size=16)

for i in range(n_faces):
    true_face = np.hstack((X_test[i], y_test[i]))

    if i:
        sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)
    else:
        sub = plt.subplot(n_faces, n_cols, i * n_cols + 1,
                          title="true faces")

    sub.axis("off")
    sub.imshow(true_face.reshape(image_shape),
               cmap=plt.cm.gray,
               interpolation="nearest")

    for j, est in enumerate(sorted(ESTIMATORS)):
        completed_face = np.hstack((X_test[i], y_test_predict[est][i]))

        if i:
            sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)

        else:
            sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j,
                              title=est)

        sub.axis("off")
        sub.imshow(completed_face.reshape(image_shape),
                   cmap=plt.cm.gray,
                   interpolation="nearest")

plt.show()

Total running time of the script: ( 0 minutes 3.385 seconds)

Gallery generated by Sphinx-Gallery

Comparing Nearest Neighbors with and without Neighborhood Components Analysis

An example comparing nearest neighbors classification with and without Neighborhood Components Analysis.

It will plot the class decision boundaries given by a Nearest Neighbors classifier when using the Euclidean distance on the original features, versus using the Euclidean distance after the transformation learned by Neighborhood Components Analysis. The latter aims to find a linear transformation that maximises the (stochastic) nearest neighbor classification accuracy on the training set.

Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_classification.html

  • _images/sphx_glr_plot_nca_classification_001.png
  • _images/sphx_glr_plot_nca_classification_002.png
  • _images/sphx_glr_plot_nca_classification_003.png
  • _images/sphx_glr_plot_nca_classification_004.png
  • _images/sphx_glr_plot_nca_classification_005.png
  • _images/sphx_glr_plot_nca_classification_006.png

Out:



# License: BSD 3 clause

import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

from skhubness.neighbors import (KNeighborsClassifier,
                                 NeighborhoodComponentsAnalysis)
import warnings
warnings.filterwarnings('ignore')

print(__doc__)

n_neighbors = 1

dataset = datasets.load_iris()
X, y = dataset.data, dataset.target

# we only take two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = X[:, [0, 2]]

X_train, X_test, y_train, y_test = \
    train_test_split(X, y, stratify=y, test_size=0.7, random_state=42)

h = .01  # step size in the mesh

# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])

names = ['KNN',
         'NCA, KNN',
         'KNN, MP (normal)',
         'KNN, MP (empiric)',
         'KNN, LS (standard)',
         'KNN, LS (nicdm)',
        ]

classifiers = [Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
                         ]),
               Pipeline([('scaler', StandardScaler()),
                         ('nca', NeighborhoodComponentsAnalysis()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
                         ]),
               Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
                                                      hubness='mutual_proximity',
                                                      hubness_params={'method': 'normal'}))
                         ]),
               Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
                                                      hubness='mutual_proximity',
                                                      hubness_params={'method': 'empiric'}))
                         ]),
               Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
                                                      hubness='local_scaling',
                                                      hubness_params={'method': 'standard'}))
                         ]),
               Pipeline([('scaler', StandardScaler()),
                         ('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
                                                      hubness='local_scaling',
                                                      hubness_params={'method': 'nicdm'}))
                         ]),
               ]

x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
                     np.arange(y_min, y_max, h))

for name, clf in zip(names, classifiers):

    clf.fit(X_train, y_train)
    score = clf.score(X_test, y_test)

    # Plot the decision boundary. For that, we will assign a color to each
    # point in the mesh [x_min, x_max]x[y_min, y_max].
    Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])

    # Put the result into a color plot
    Z = Z.reshape(xx.shape)
    plt.figure()
    plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=.8)

    # Plot also the training and testing points
    plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
    plt.xlim(xx.min(), xx.max())
    plt.ylim(yy.min(), yy.max())
    plt.title("{} (k = {})".format(name, n_neighbors))
    plt.text(0.9, 0.1, '{:.2f}'.format(score), size=15,
             ha='center', va='center', transform=plt.gca().transAxes)

plt.show()

Total running time of the script: ( 6 minutes 19.660 seconds)

Gallery generated by Sphinx-Gallery

Manifold learning on handwritten digits: Locally Linear Embedding, Isomap…

An illustration of various embeddings on the digits dataset.

The RandomTreesEmbedding, from the sklearn.ensemble module, is not technically a manifold embedding method, as it learn a high-dimensional representation on which we apply a dimensionality reduction method. However, it is often useful to cast a dataset into a representation in which the classes are linearly-separable.

t-SNE will be initialized with the embedding that is generated by PCA in this example, which is not the default setting. It ensures global stability of the embedding, i.e., the embedding does not depend on random initialization.

Linear Discriminant Analysis, from the sklearn.discriminant_analysis module, and Neighborhood Components Analysis, from the sklearn.neighbors module, are supervised dimensionality reduction method, i.e. they make use of the provided labels, contrary to other methods.

Adapted from https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html

  • _images/sphx_glr_plot_lle_digits_001.png
  • _images/sphx_glr_plot_lle_digits_002.png
  • _images/sphx_glr_plot_lle_digits_003.png
  • _images/sphx_glr_plot_lle_digits_004.png
  • _images/sphx_glr_plot_lle_digits_005.png
  • _images/sphx_glr_plot_lle_digits_006.png
  • _images/sphx_glr_plot_lle_digits_007.png
  • _images/sphx_glr_plot_lle_digits_008.png
  • _images/sphx_glr_plot_lle_digits_009.png
  • _images/sphx_glr_plot_lle_digits_010.png
  • _images/sphx_glr_plot_lle_digits_011.png
  • _images/sphx_glr_plot_lle_digits_012.png
  • _images/sphx_glr_plot_lle_digits_013.png
  • _images/sphx_glr_plot_lle_digits_014.png
  • _images/sphx_glr_plot_lle_digits_015.png
  • _images/sphx_glr_plot_lle_digits_016.png

Out:

Computing random projection
Computing PCA projection
Computing Linear Discriminant Analysis projection
Computing Isomap projection
Done.
Computing LLE embedding
Done. Reconstruction error: 1.63545e-06
Computing modified LLE embedding
Done. Reconstruction error: 0.360653
Computing Hessian LLE embedding
Done. Reconstruction error: 0.2128
Computing LTSA embedding
Done. Reconstruction error: 0.2128
Computing MDS embedding
Done. Stress: 135359737.175700
Computing MDS embedding from local scaling neighbors graph
Done. Stress: 89610.656628
Computing MDS embedding from mutual proximity graph
Done. Stress: 25752.919623
Computing Totally Random Trees embedding
Computing Spectral embedding
Computing t-SNE embedding
Computing NCA projection

# Authors: Fabian Pedregosa <fabian.pedregosa@inria.fr>
#          Olivier Grisel <olivier.grisel@ensta.org>
#          Mathieu Blondel <mathieu@mblondel.org>
#          Gael Varoquaux
#          Roman Feldbauer
# License: BSD 3 clause (C) INRIA 2011

print(__doc__)
from time import time

import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
                     discriminant_analysis, random_projection)
from skhubness import neighbors

digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30


# ----------------------------------------------------------------------
# Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
    x_min, x_max = np.min(X, 0), np.max(X, 0)
    X = (X - x_min) / (x_max - x_min)

    plt.figure()
    ax = plt.subplot(111)
    for i in range(X.shape[0]):
        plt.text(X[i, 0], X[i, 1], str(y[i]),
                 color=plt.cm.Set1(y[i] / 10.),
                 fontdict={'weight': 'bold', 'size': 9})

    if hasattr(offsetbox, 'AnnotationBbox'):
        # only print thumbnails with matplotlib > 1.0
        shown_images = np.array([[1., 1.]])  # just something big
        for i in range(X.shape[0]):
            dist = np.sum((X[i] - shown_images) ** 2, 1)
            if np.min(dist) < 4e-3:
                # don't show points that are too close
                continue
            shown_images = np.r_[shown_images, [X[i]]]
            imagebox = offsetbox.AnnotationBbox(
                offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
                X[i])
            ax.add_artist(imagebox)
    plt.xticks([]), plt.yticks([])
    if title is not None:
        plt.title(title)


# ----------------------------------------------------------------------
# Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
    ix = 10 * i + 1
    for j in range(n_img_per_row):
        iy = 10 * j + 1
        img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))

plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')


# ----------------------------------------------------------------------
# Random 2D projection using a random unitary matrix
print("Computing random projection")
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits")


#----------------------------------------------------------------------
# Projection on to the first 2 principal components

print("Computing PCA projection")
t0 = time()
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
plot_embedding(X_pca,
               "Principal Components projection of the digits (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# Projection on to the first 2 linear discriminant components

print("Computing Linear Discriminant Analysis projection")
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01  # Make X invertible
t0 = time()
X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2).fit_transform(X2, y)
plot_embedding(X_lda,
               "Linear Discriminant projection of the digits (time %.2fs)" %
               (time() - t0))


# ----------------------------------------------------------------------
# Isomap projection of the digits dataset
print("Computing Isomap projection")
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
print("Done.")
plot_embedding(X_iso,
               "Isomap projection of the digits (time %.2fs)" %
               (time() - t0))


# ----------------------------------------------------------------------
# Locally linear embedding of the digits dataset
print("Computing LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='standard')
t0 = time()
X_lle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_lle,
               "Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))


# ----------------------------------------------------------------------
# Modified Locally linear embedding of the digits dataset
print("Computing modified LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='modified')
t0 = time()
X_mlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_mlle,
               "Modified Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))


# ----------------------------------------------------------------------
# HLLE embedding of the digits dataset
print("Computing Hessian LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='hessian')
t0 = time()
X_hlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_hlle,
               "Hessian Locally Linear Embedding of the digits (time %.2fs)" %
               (time() - t0))


# ----------------------------------------------------------------------
# LTSA embedding of the digits dataset
print("Computing LTSA embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
                                      method='ltsa')
t0 = time()
X_ltsa = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_ltsa,
               "Local Tangent Space Alignment of the digits (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# MDS  embedding of the digits dataset
print("Computing MDS embedding")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=2000,
                   dissimilarity='euclidean', metric=True,
                   )
t0 = time()

X_mds = clf.fit_transform(X)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
               "MDS embedding of the digits (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# Hubness reduction (LS) + MDS  embedding of the digits dataset
print("Computing MDS embedding from local scaling neighbors graph")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=2000,
                   dissimilarity='precomputed', metric=True,
                   )
t0 = time()
graph = neighbors.graph.kneighbors_graph(
    X, n_neighbors=X.shape[0]-1, mode='distance', hubness='local_scaling').toarray()
X_mds = clf.fit_transform(graph)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
               "Hubness reduction (LS) - MDS embedding (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# Hubness reduction (MP) + MDS  embedding of the digits dataset
print("Computing MDS embedding from mutual proximity graph")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=2000,
                   dissimilarity='precomputed', metric=True,
                   )
t0 = time()
graph = neighbors.graph.kneighbors_graph(
    X, n_neighbors=1082, mode='distance', hubness='mp').toarray()
X_mds = clf.fit_transform(graph)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
               "Hubness reduction (MP) - MDS embedding (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# Random Trees embedding of the digits dataset
print("Computing Totally Random Trees embedding")
hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0,
                                       max_depth=5)
t0 = time()
X_transformed = hasher.fit_transform(X)
pca = decomposition.TruncatedSVD(n_components=2)
X_reduced = pca.fit_transform(X_transformed)

plot_embedding(X_reduced,
               "Random forest embedding of the digits (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# Spectral embedding of the digits dataset
print("Computing Spectral embedding")
embedder = manifold.SpectralEmbedding(n_components=2, random_state=0,
                                      eigen_solver="arpack")
t0 = time()
X_se = embedder.fit_transform(X)

plot_embedding(X_se,
               "Spectral embedding of the digits (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# t-SNE embedding of the digits dataset
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)

plot_embedding(X_tsne,
               "t-SNE embedding of the digits (time %.2fs)" %
               (time() - t0))

# ----------------------------------------------------------------------
# NCA projection of the digits dataset
print("Computing NCA projection")
nca = neighbors.NeighborhoodComponentsAnalysis(n_components=2, random_state=0)
t0 = time()
X_nca = nca.fit_transform(X, y)

plot_embedding(X_nca,
               "NCA embedding of the digits (time %.2fs)" %
               (time() - t0))

plt.show()

Total running time of the script: ( 1 minutes 24.114 seconds)

Gallery generated by Sphinx-Gallery

Gallery generated by Sphinx-Gallery

API Documentation

This is the API documentation for scikit-hubness.

Analysis: skhubness.analysis

The skhubness.analysis package provides methods for measuring hubness.

analysis.Hubness

Examine hubness characteristics of data.

analysis.VALID_HUBNESS_MEASURES

Built-in mutable sequence.

Neighbors: skhubness.neighbors

The skhubness.neighbors package is a drop-in replacement for sklearn.neighbors, providing all of its features, while adding transparent support for hubness reduction and approximate nearest neighbor search.

neighbors.BallTree

BallTree for fast generalized N-point problems

neighbors.DistanceMetric

DistanceMetric class

neighbors.KDTree

KDTree for fast generalized N-point problems

neighbors.HNSW

Wrapper for using nmslib

neighbors.KNeighborsClassifier

Classifier implementing the k-nearest neighbors vote.

neighbors.KNeighborsRegressor

Regression based on k-nearest neighbors.

neighbors.FalconnLSH

Wrapper for using falconn LSH

neighbors.NearestCentroid

Nearest centroid classifier.

neighbors.NearestNeighbors

Unsupervised learner for implementing neighbor searches.

neighbors.NNG

Wrapper for ngtpy and NNG variants.

neighbors.PuffinnLSH

Wrap Puffinn LSH for scikit-learn compatibility.

neighbors.RadiusNeighborsClassifier

Classifier implementing a vote among neighbors within a given radius

neighbors.RadiusNeighborsRegressor

Regression based on neighbors within a fixed radius.

neighbors.RandomProjectionTree

Wrapper for using annoy.AnnoyIndex

neighbors.kneighbors_graph

Computes the (weighted) graph of k-Neighbors for points in X

neighbors.radius_neighbors_graph

Computes the (weighted) graph of Neighbors for points in X

neighbors.KernelDensity

Kernel Density Estimation.

neighbors.LocalOutlierFactor

Unsupervised Outlier Detection using Local Outlier Factor (LOF)

neighbors.NeighborhoodComponentsAnalysis

Neighborhood Components Analysis

Reduction: skhubness.reduction

The skhubness.reduction package provides methods for hubness reduction.

reduction.MutualProximity

Hubness reduction with Mutual Proximity [R5a390b9a9956-1].

reduction.LocalScaling

Hubness reduction with Local Scaling [Rf1f7cb70176a-1].

reduction.DisSimLocal

Hubness reduction with DisSimLocal [R3ede0f1b99b2-1].

reduction.hubness_algorithms

Supported hubness reduction algorithms

History of scikit-hubness

scikit-hubness builds upon previous software: the Hub-Toolbox. The original Hub-Toolbox was written for Matlab, and released in parallel with the release of the first hubness reduction methods in JMLR. In essence, it comprises methods to reduce hubness in distance matrices.

The Hub-Toolbox for Python3 is a port from Matlab to Python, which over the years got several extensions and additional functionality, such as more hubness reduction methods (Localized Centering, DisSimLocal, mp-dissim, etc.), approximate hubness reduction, and more. The software was developed by hubness researchers for hubness research.

The new scikit-hubness package is rewritten from scratch with a different goal in mind: Providing easy-to-use neighborhood-based data mining methods (classification, regression, etc.) with transparent hubness reduction. Building upon scikit-learn’s neighbors package, we provide a drop-in replacement called skhubness.neighbors, which offers all the functionality of sklearn.neighbors, but adds additional functionality (approximate nearest neighbor search, hubness reduction).

This way, we think that machine learning researchers and practitioners (many of which will be fluent in scikit-learn) can quickly and effectively employ scikit-hubness in their existing workflows, and improve learning in their high-dimensional data.

Contributing

scikit-hubness is free open source software. Contributions from the community are highly appreciated. Even small contributions improve the software’s quality.

Even if you are not familiar with programming languages and tools, you may contribute by filing bugs or any problems as a GitHub issue.

Git and branching model

We use git for version control (CVS), as do most projects nowadays. If you are not familiar with git, there are lots of tutorials on GitHub Guide. All the important basics are covered in the GitHub Git handbook.

Development of scikit-hubness (mostly) follows the git flow branching model. There are two main branches: master and develop. For any changes, a new branch should be created. If you want to add a new feature, fix a noncritical bug, etc. one should branch off develop. Only if you want to fix a critical bug, branch off master.

Workflow

In case of large changes to the software, please first get in contact with the authors for coordination, for example by filing an issue. If you want to fix small issues (typos in the docs, obvious errors, etc.) you can - of course - directly submit a pull request (PR).

  1. Create a fork of scikit-hubness in your GitHub account.

    Simply click “Fork” button on https://github.com/VarIr/scikit-hubness.

  2. Clone your fork on your computer.

    $ git clone git@github.com:YOUR-ACCOUNT-GOES-HERE/scikit-hubness.git && cd scikit-hubness

  3. Add remote upstream.

    $ git remote add upstream git@github.com:VarIr/scikit-hubness.git

  4. Create feature/bugfix branch.

    In case of feature or noncritical bugfix: $ git checkout develop && git checkout -b featureXYZ develop

    In case of critical bug: $ git checkout -b bugfix123 master

  5. Implement feature/fix bug/fix typo/…

    Happy coding!

  6. Create a commit with meaningful message

    If you only modified existing files, simply $ git commit -am "descriptive message what this commit does (in present tense) here"

  7. Push to GitHub

    e.g. $ git push origin featureXYZ

  8. Create pull request (PR)

    Git will likely provide a link to directly create the PR. If not, click “New pull request” on your fork on GitHub.

  9. Wait…

    Several devops checks will be performed automatically (e.g. continuous integration (CI) with Travis, AppVeyor).

    The authors will get in contact with you, and may ask for changes.

  10. Respond to code review.

    If there were issues with continous integration, or the authors asked for changes, please create a new commit locally, and simply push again to GitHub as you did before. The PR will be updated automatically.

  11. Maintainers merge PR, when all issues are resolved.

    Thanks a lot for your contribution!

Code style and further guidelines

  • Please make sure all code complies with PEP 8

  • All code should be documented sufficiently (functions, classes, etc. must have docstrings with general description, parameters, ideally return values, raised exceptions, notes, etc.)

  • Documentation style is NumPy format.

  • New code must be covered by unit tests using pytest.

  • If you fix a bug, please provide regression tests (fail on old code, succeed on new code).

  • It may be helpful to install scikit-hubness in editable mode for development. When you have already cloned the package, switch into the corresponding directory, and

    pip install -e .
    

    (don’t omit the trailing period). This way, any changes to the code are reflected immediately. That is, you don’t need to install the package each and every time, when you make changes while developing code.

Testing

In scikit-hubness, we aim for high code coverage. As of September 2019, between 98% and 99% of all code lines are visited at least once when running the complete test suite. This is primarily to ensure:

  • correctness of the code (to some extent) and

  • maintainability (new changes don’t break old code).

Creating a new PR, ideally all code would be covered by tests. Sometimes, this is not feasible or only with large effort. Pull requests will likely be accepted, if the overall code coverage at least does not decrease.

Unit tests are automatically performed for each PR using CI tools online. This may take some time, however. To run the tests locally, you need pytest installed. From the scikit-hubness directory, call

pytest skhubness/

to run all the tests. You can also restrict the tests to the subpackage you are working on, down to single tests. For example

pytest skhubness/reduction/tests/test_local_scaling.py --showlocals -v

only runs tests for hubness reduction with local scaling.

In order to check code coverage locally, you need the pytest-cov plugin.

pytest skhubness/reduction/ --cov=skhubness/reduction/

Changelog

Next release

Added or enhanced

  • Lower memory footprint for sparse targets in multilabel classification (previously converted to dense arrays) #61

Fixes

  • Hubness estimation could fail when ANN does not return enough neighbors #59

  • Heuristic to choose memory for Puffinn LSH.

0.21.2 - 2020-01-14

This is a maintenance release due to the publication in the Journal of Open Source Software.

0.21.1 - 2019-12-10

This is a bugfix release due to the recent update of scikit-learn to v0.22.

Fixes

  • Require scikit-learn v0.21.3.

    Until the necessary adaptions for v0.22 are completed, scikit-hubness will require scikit-learn v0.21.3.

0.21.0 - 2019-11-25

This is the first major release of scikit-hubness.

Added

  • Enable ONNG provided by NGT (optimized ANNG). Pass optimize=True to NNG.

  • User Guide: Description of all subpackages and common usage scenarios.

  • Examples: Various usage examples

  • Several tests

  • Classes inheriting from SupervisedIntegerMixin can be fit with an ApproximateNearestNeighbor or NearestNeighbors instance, thus reuse precomputed indices.

Changes

  • Use argument algorithm='nng' for ANNG/ONNG provided by NGT instead of 'onng'. Also set optimize=True in order to use ONNG.

Fixes

  • DisSimLocal would previously fail when invoked as hubness='dis_sim_local'.

  • Hubness reduction would previously ignore verbose arguments under certain circumstances.

  • HNSW would previously ignore n_jobs on index creation.

  • Fix installation instructions for puffinn.

0.21.0a9 - 2019-10-30

Added

  • General structure for docs

  • Enable NGT OpenMP support on MacOS (in addition to Linux)

  • Enable Puffinn LSH also on MacOS

Fixes

  • Correct mutual proximity (empiric) calculation

  • Better handling of optional packages (ANN libraries)

Maintenance

  • streamlined CI builds

  • several minor code improvements

New contributors

  • Silvan David Peter

0.21.0a8 - 2019-09-12

Added

  • Approximate nearest neighbor search

    • LSH by an additional provider, puffinn (Linux only, atm)

    • ANNG provided by ngtpy (Linux, MacOS)

    • Random projection forests provided by annoy (Linux, MacOS, Windows)

Fixes

  • Several minor issues

  • Several documentations issues

0.21.0a7 - 2019-07-17

The first alpha release of scikit-hubness to appear in this changelog. It already contains the following features:

  • Hubness estimation (exact or approximate)

  • Hubness reduction (exact or approximate)

    • Mutual proximity

    • Local scaling

    • DisSim Local

  • Approximate nearest neighbor search

Getting started

Get started with scikit-hubness in a breeze. Find how to install the package and see all core functionality applied in a single quick start example.

User Guide

The User Guide introduces the main concepts of scikit-hubness. It explains, how to analyze your data sets for hubness, and how to use the package to lift this curse of dimensionality. You will also find examples how to use skhubness.neighbors for approximate nearest neighbor search (with or without hubness reduction).

API Documentation

The API Documentation provides detailed information of the implemented methods. This information includes method descriptions, parameters, references, examples, etc. Find all the information about specific modules and functions of scikit-hubness in this section.

History

A brief history of the package, and how it relates to the Hub-Toolbox’es.

Development

There are several possibilities to contribute to this free open source software. We highly appreciate all input from the community, be it bug reports or code contributions.

Source code, issue tracking, discussion, and continuous integration appear on our GitHub page.

What’s new

To see what’s new in the latest version of scikit-hubness, have a look at the changelog.