scikit-hubness: high-dimensional data mining¶
scikit-hubness
is a Python package for analysis of hubness
in high-dimensional data. It provides hubness reduction and
approximate nearest neighbor search via a drop-in replacement for
sklearn.neighbors.
Installation¶
Installation from PyPI¶
The current release of scikit-hubness
can be installed from PyPI:
pip install scikit-hubness
Dependencies¶
All strict dependencies of scikit-hubness
are automatically installed
by pip
. Some optional dependencies (certain ANN libraries) may not
yet be available from PyPI. If you require one of these libraries,
please refer to the library’s documentation for building instructions.
For example, at the time of writing, puffinn
was not available on PyPI.
Building and installing is straight-forward:
git clone https://github.com/puffinn/puffinn.git
cd puffinn
python3 setup.py build
pip install .
Installation from source¶
You can always grab the latest version of scikit-hubness
directly from GitHub:
cd install_dir
git clone git@github.com:VarIr/scikit-hubness.git
cd scikit-hubness
pip install -e .
This is the recommended approach, if you want to contribute to the development of scikit-hubness
.
Supported platforms¶
scikit-hubness
currently supports all major operating systems:
Linux
MacOS X
Windows
Note, that not all approximate nearest neighbor algorithms used in scikit-hubness
are available on all platforms.
This is because we rely on third-party libraries, which in some cases are not
available for all platforms.
The table below indicates, which libraries and
algorithms are currently supported on your operating system.
All exact nearest neighbor algorithms (as provided by scikit-learn) are available on all platforms.
library |
algorithm |
Linux |
MacOS |
Windows |
nmslib |
hnsw |
x |
x |
x |
annoy |
rptree |
x |
x |
x |
ngtpy |
nng |
x |
x |
|
falconn |
falconn_lsh |
x |
x |
|
puffinn |
lsh |
x |
x |
|
sklearn |
(all exact) |
x |
x |
x |
Quick start example¶
Users of scikit-hubness
typically want to
analyse, whether their data show hubness
reduce hubness
perform learning (classification, regression, …)
The following example shows all these steps for an example dataset
from the text domain (dexter).
Please make sure you have installed scikit-hubness
(installation instructions).
First, we load the dataset and inspect its size.
from skhubness.data import load_dexter
X, y = load_dexter()
print(f'X.shape = {X.shape}, y.shape={y.shape}')
Dexter is embedded in a high-dimensional space, and could, thus, be prone to hubness. Therefore, we assess the actual degree of hubness.
from skhubness import Hubness
hub = Hubness(k=10, metric='cosine')
hub.fit(X)
k_skew = hub.score()
print(f'Skewness = {k_skew:.3f}')
As a rule-of-thumb, skewness > 1.2 indicates significant hubness. Additional hubness indices are available, for example:
print(f'Robin hood index: {hub.robinhood_index:.3f}')
print(f'Antihub occurrence: {hub.antihub_occurrence:.3f}')
print(f'Hub occurrence: {hub.hub_occurrence:.3f}')
There is considerable hubness in dexter. Let’s see, whether hubness reduction can improve kNN classification performance.
from sklearn.model_selection import cross_val_score
from skhubness.neighbors import KNeighborsClassifier
# vanilla kNN
knn_standard = KNeighborsClassifier(n_neighbors=5,
metric='cosine')
acc_standard = cross_val_score(knn_standard, X, y, cv=5)
# kNN with hubness reduction (mutual proximity)
knn_mp = KNeighborsClassifier(n_neighbors=5,
metric='cosine',
hubness='mutual_proximity')
acc_mp = cross_val_score(knn_mp, X, y, cv=5)
print(f'Accuracy (vanilla kNN): {acc_standard.mean():.3f}')
print(f'Accuracy (kNN with hubness reduction): {acc_mp.mean():.3f}')
Accuracy was considerably improved by mutual proximity (MP). But did MP actually reduce hubness?
hub_mp = Hubness(k=10, metric='cosine',
hubness='mutual_proximity')
hub_mp.fit(X)
k_skew_mp = hub_mp.score()
print(f'Skewness after MP: {k_skew_mp:.3f} '
f'(reduction of {k_skew - k_skew_mp:.3f})')
print(f'Robin hood: {hub_mp.robinhood_index:.3f} '
f'(reduction of {hub.robinhood_index - hub_mp.robinhood_index:.3f})')
Yes!
The neighbor graph can also be created directly, with or without hubness reduction:
from skhubness.neighbors import kneighbors_graph
neighbor_graph = kneighbors_graph(X,
n_neighbors=5,
hubness='mutual_proximity')
You may want to precompute the graph like this, in order to avoid computing it repeatedly for subsequent hubness estimation and learning.
User guide¶
Welcome to scikit-hubness
!
Here we describe the core functionality of the package
(hubness analysis, hubness reduction, neighbor search),
and provide several usage examples.
Core Concepts¶
There are three main parts of scikit-hubness
. Before we describe the corresponding subpackages,
we briefly introduce the hubness phenomenon itself.
The hubness phenomenon¶
Hubness is a phenomenon of intrinsically high-dimensional data, detrimental to data mining and learning tasks. It refers to the tendency of hub and antihub emergence in k-nearest neighbor graphs (kNNGs): Hubs are objects that appear unwontedly often among the k-nearest neighbor lists of other objects, while antihubs hardly or never appear in these lists. Thus, hubs propagate their metainformation (such as class labels) widely within a kNNG. Conversely, information carried by antihubs is effectively lost. As a result, hubness leads to semantically distorted spaces, that negatively impact a large variety of tasks.
Music information retrieval is a show-case example for hubness: It has been shown, that recommendation lists based on music similarity scores tend to completely ignore certain songs (antihubs). On the other hand, different songs are recommended over and over again (hubs), sometimes even when they do not fit. Both effects are problematic: Users are provided with unsuitable (hub) recommendations, while artists that (unknowingly) producing antihub songs, may remain fameless unjustifiably.
The scikit-hubness package¶
scikit-hubness
reflects our effort to make hubness analysis and
hubness reduction readily available and easy-to-use for both machine learning
researchers and practitioners.
The package builds upon scikit-learn
.
When feasible, their design decisions, code style, development practise etc. are
adopted, so that new users can work their way into scikit-hubness
rapidly.
Workflows, therefore, comprise the well-known fit
, predict
, and score
methods.
Two subpackages offer complementary functionality to scikit-learn
:
skhubness.analysis
allows to estimate hubness in dataskhubness.reduction
provides hubness reduction algorithms
The skhubness.neighbors
subpackage, on the other hand, acts as a drop-in
replacement for sklearn.neighbors
. It provides all of its functionality,
and adds two major components:
transparent hubness reduction
approximate nearest neighbor (ANN) search
and combinations of both. From the coding point-of-view,
this is achieved by adding a handful new parameters to most classes
(KNeighborsClassifier
,
RadiusNeighborRegressor
,
NearestNeighbors
,
etc).
hubness
defines the hubness reduction algorithm used to compute the nearest neighbor graph (kNNG). Supported algorithms and corresponding parameter values are presented here, and are available as a Python list in<skhubness.reduction.hubness_algorithms>
.algorithm
defines the kNNG construction algorithm similarly to the waysklearn
does it. That is, all ofsklearn
’s algorithms are available, but in addition, several approximate nearest neighbor algorithms are provided as well. See below for a list of currently supported algorithms and their corresponding parameter values.
By providing the two arguments above, you select algorithms
for hubness reduction and nearest neighbor search, respectively.
Most of these algorithms can be further tuned by individual hyperparameters.
These are not explicitly made accessible in high-level classes like KNeighborsClassifier
,
in order to avoid very long lists of arguments,
because they differ from algorithm to algorithm.
Instead, two dictionaries
hubness_params
andalgorithm_params
are available in all high-level classes to set the nested arguments for ANN and hubness reduction methods.
The following example shows how to perform approximate hubness estimation (1) without, and (2) with hubness reduction by local scaling in an artificial data set.
In part 1, we estimate hubness in the original data.
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1_000_000,
n_features=500,
n_informative=400,
random_state=123)
from sklearn.model_selection import train_test_split
X_train, X_test = train_test_split(X, test_size=0.1, random_state=456)
from skhubness.analysis import Hubness
hub = Hubness(k=10,
metric='euclidean',
algorithm='hnsw',
algorithm_params={'n_candidates': 100,
'metric': 'euclidean',
'post_processing': 2,
},
return_value='robinhood',
n_jobs=8,
)
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(robin_hood)
0.7873205555555555 # before hubness reduction
There is high hubness in this dataset. In part 2, we estimate hubness after reduction by local scaling.
hub = Hubness(k=10,
metric='euclidean',
hubness='local_scaling',
hubness_params={'k': 7},
algorithm='hnsw',
algorithm_params={'n_candidates': 100,
'metric': 'euclidean',
'post_processing': 2,
},
return_value='robinhood',
verbose=2
)
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(robin_hood)
0.6614583333333331 # after hubness reduction
Approximate nearest neighbor search methods¶
Set the parameter algorithm
to one of the following in order to enable ANN in
most of the classes from skhubness.neighbors
or Hubness
:
‘hnsw’ uses hierarchical navigable small-world graphs (provided by the
nmslib
library) in the wrapper classHNSW
.‘lsh’ uses locality sensitive hashing (provided by the
puffinn
library) in the wrapper classPuffinnLSH
.‘falconn_lsh’ uses locality sensitive hashing (provided by the
falconn
library) in the wrapper classFalconnLSH
.‘nng’ uses ANNG or ONNG (provided by the
NGT
library) in the wrapper classNNG
.‘rptree’ uses random projections trees (provided by the
annoy
library) in the wrapper classRandomProjectionTree
.
Configure parameters of the chosen algorithm with algorithm_params
.
This dictionary is passed to the corresponding wrapper class.
Take a look at their documentation in order to see, which parameters are available
for each individual class.
Hubness reduction methods¶
Set the parameter hubness
to one of the following identifiers
in order to use the corresponding hubness reduction algorithm:
‘mp’ or ‘mutual_proximity’ use mutual proximity (Gaussian or empiric distribution) as implemented in
MutualProximity
.‘ls’ or ‘local_scaling’ use local scaling or NICDM as implemented in
LocalScaling
.‘dsl’ or ‘dis_sim_local’ use DisSim Local as implemented in
DisSimLocal
.
Variants and additional parameters are set with the hubness_params
dictionary.
Have a look at the individual hubness reduction classes for available parameters.
Approximate hubness reduction¶
Exact hubness reduction scales at least quadratically with the number of samples. To reduce computational complexity, approximate hubness reduction can be applied, as described in the paper “Fast approximate hubness reduction for large high-dimensional data” (ICBK2018, on IEEE Xplore, also available as technical report).
The general idea behind approximate hubness reduction works as follows:
retrieve
n_candidates
-nearest neighbors using an ANN methodrefine and reorder the candidate list by hubness reduction
return
n_neighbors
nearest neighbors from the reordered candidate list
The procedure is implemented in scikit-hubness by simply passing both
algorithm
and hubness
parameters to the relevant classes.
Also consider passing algorithm_params={'n_candidates': n_candidates}
.
Make sure to set the n_candidates
high enough, for high sensitivity
(towards “good” nearest neighbors). Too large values may, however, lead
to long query times. As a rule of thumb for this trade-off, you can
start by retrieving ten times as many candidates as you need nearest neighbors.
Hubness analysis¶
You can use the skhubness.analysis
subpackage
to assess whether your data is prone to hubness.
Currently, the Hubness
class
acts as a one-stop-shop for hubness estimation.
It provides several hubness measures,
that are all computed from a nearest neighbor graph (kNNG).
More specifically, hubness is measured from k-occurrence,
that is, how often does an object occur in the k-nearest neighbor lists of other objects
(reverse nearest neighbors).
Traditionally, hubness has been measured by the skewness of the k-occurrence histogram,
where higher skewness to the right indicates higher hubness (due to objects that appear very
often as nearest neighbors).
Recently, additional indices borrowed from inequality research have been proposed for measuring hubness,
such as calculating the Robin Hood index or Gini index from k-occurrences,
which may have more desirable features w.r.t to large datasets and interpretability.
The Hubness
class provides a variety of these measures.
It is based on scikit-learn’s BaseEstimator
, and thus follows scikit-learn principles.
When a new instance is created, sensible default parameters are used,
unless specific choices are made.
Typically, the user may want to choose a parameter k
to define the size
of nearest neighbor lists, or metric
, in case the default Euclidean distances
do not fit the data well.
Parameter return_value
defines which hubness measures to use.
VALID_HUBNESS_MEASURES
provides a list of available measures.
If return_values=='all'
, all available measures are computed.
The algorithm
parameter defines how to compute the kNN graph.
This is especially relevant for large datasets, as it provides more efficient index
structures and approximate nearest neighbor algorithms.
For example, algorithm='hnsw'
uses a hierarchical navigable small-world graph
to compute the hubness measures in log-linear time (instead of quadratic).
Hubness
uses fit
and score
methods to estimate hubness.
In this fictional example, we estimate hubness in terms of the Robin Hood index in some large dataset:
>>> X = (some large dataset)
>>> hub = Hubness(k=10,
>>> return_value='robinhood',
>>> algorithm='hnsw')
>>> hub.fit(X) # Creates the HNSW index
>>> hub.score()
0.56
A Robin Hood index of 0.56 indicates, that 56% of all slots in nearest neighbor lists would need to be redistributed, in order to obtain equal k-occurrence for all objects. We’d consider this rather high hubness.
In order to evaluate, whether hubness reduction might be beneficial
for downstream tasks (learning etc.),
we can perform the same estimation with hubness reduction enabled.
We use the same code as above, but add the hubness
parameter:
>>> X = (some large dataset)
>>> hub = Hubness(k=10,
>>> return_value='robinhood',
>>> algorithm='hnsw',
>>> hubness='local_scaling')
>>> hub.fit(X)
>>> hub.score()
0.35
Here, the hubness reduction method local scaling resulted in a markedly lower Robin Hood index.
Note, that we used the complete data set X
in the examples above.
We can also split the data into some X_train
and X_test
:
>>> hub.fit(X_train)
>>> hub.score(X_test)
0.36
This is useful, when you want to tune hyperparameters towards low hubness, and prevent data leakage.
Hubness measures¶
The degree of hubness in a dataset typically measured from its k-occurrence histogram \(O^k\). For an individual data object x, its k-occurrence \(O^k(x)\) is defined as the number of times x resides among the k-nearest neighbors of all other objects in the data set. In the notion of network analysis, \(O^k(x)\) is the indegree of x in a directed kNN graph. It is also known as reverse neighbor count.
The following measures are provided in Hubness
by passing the corresponding argument values (e.g. hubness='robinhood'
):
‘k_skewness’: Skewness, the third central moment of the k-occurrence distribution, as introduced by Radovanović et al. 2010
‘k_skewness_truncnorm’: skewness of truncated normal distribution estimated from k-occurrence distribution.
‘atkinson’: the Atkinson index of inequality, which can be tuned in order to be more sensitive towards antihub or hubs.
‘gini’: the Gini coefficient of inequality, defined as the half of the relative mean absolute difference
‘robinhood’: the Robin Hood or Hoover index, which gives the amount that needs to be redistributed in order to obtain equality (e.g. proportion of total income, so that there is equal income for all; or the number of nearest neighbor slot, so that all objects are among the k-nearest neighbors of others exactly k times).
‘antihubs’: returns the indices of antihubs in data set X (which are never among the nearest neighbors of other objects.
‘antihub_occurrence’: proportion of antihubs in the data set (percentage of total objects, which are antihubs).
‘hubs’: indices of hub objects x in data set X (with \(O^k(x) > \text{hub_size} * k\), where \(\text{hub_size} = 2\) by default).
‘hub_occurrence’: proportion of nearest neighbor slots occupied by hubs
‘groupie_ratio’: proportion of objects with the largest hub in their neighborhood
‘k_neighbors’: indices to k-nearest neighbors for each object
‘k_occurrence’: reverse neighbor count for each object
‘all’: return a dictionary containing all of the above
Hubness reduction¶
The skhubness.reduction
subpackage provides several hubness reduction methods.
Currently, the supported methods are
Mutual proximity (independent Gaussian distance distribution), provided by
MutualProximity
withmethod='normal'
(default),Mutual proximity (empiric distance distribution), provided by
MutualProximity
withmethod='empiric'
,Local scaling, provided by
LocalScaling
withmethod='standard'
(default),Non-iterative contextual dissimilarity measure, provided by
LocalScaling
withmethod='nicdm'
,DisSim Local, provided by
DisSimLocal
,
which represent the most successful hubness reduction methods as identified in our paper “A comprehensive empirical comparison of hubness reduction in high-dimensional spaces”, KAIS (2019), DOI. This survey paper also comes with an overview of how the individual methods work.
There are two ways to use perform hubness reduction in scikit-hubness:
Implicitly, using the classes in
skhubness.neighbors
(see User Guide: Nearest neighbors),Explicitly, using the classes in
skhubness.reduction
.
The former is the common approach, if you simply want to improve your learning task by hubness reduction. Most examples here also do so. The latter may, however, be more useful for researchers, who would like to investigate the hubness phenomenon itself.
All hubness reducers inherit from a common base class
HubnessReduction
.
This abstract class defines two important methods:
fit
and
transform
,
thus allowing to transform previously unseen data after the initial fit.
Most hubness reduction methods do not operate on vector data,
but manipulate pre-computed distances, in order to obtain secondary distances.
Therefore, fit
and transform
take neighbor graphs as input, instead of vectors.
Have a look at their signatures:
@abstractmethod
def fit(self, neigh_dist, neigh_ind, X, assume_sorted, *args, **kwargs):
pass # pragma: no cover
@abstractmethod
def transform(self, neigh_dist, neigh_ind, X, assume_sorted, return_distance=True):
pass # pragma: no cover
The arguments neigh_dist
and neigh_ind
are two arrays representing the nearest neighbor graph
with shape (n_indexed, n_neighbors)
during fit, and
shape (n_query, n_neighbors)
during transform.
The i-th row in each array corresponds to the i-th object in the data set.
The j-th column in neigh_ind
contains the index of one of the k-nearest neighbors among the indexed objects,
while the j-th column in neigh_dist
contains the corresponding distance.
Note, that this is the same format as obtained by scikit-learn’s kneighbors(return_distances=True)
method.
This way, the user has full flexibility on how to calculate primary distances (Euclidean, cosine, KL divergence, etc).
DisSimLocal
(DSL) is the exception to this rule,
because it is formulated specifically for Euclidean distances.
DSL, therefore, also requires the training vectors in fit(..., X=X_train)
,
and the test set vectors in transform(..., X=X_test)
.
Argument X
is ignored in the other hubness reduction methods.
When the neighbor graph is already sorted (lowest to highest distance),
assume_sorted=True
should be set, so that hubness reduction methods
will not sort the arrays again, thus saving computational time.
Hubness reduction methods transform the primary distance graph, and return secondary distances. Note that for efficiency reasons, the returned arrays are not sorted. Please make sure to sort the arrays, if downstream tasks assume sorted arrays.
Nearest neighbors¶
The skhubness.neighbors
subpackage provides several neighbors-based learning methods.
It is designed as a drop-in replacement for scikit-learn’s neighbors
.
The package provides all functionality from sklearn.neighbors
,
and adds support for transparent hubness reduction, where applicable, including
classification (e.g.
KNeighborsClassifier
),regression (e.g.
RadiusNeighborsRegressor
),unsupervised learning (e.g.
NearestNeighbors
),outlier detection (
LocalOutlierFactor
), andkNN graphs (
kneighbors_graph
).
In addition, scikit-hubness provides approximate nearest neighbor (ANN) search, in order to support large data sets with millions of data objects and more. A list of currently provided ANN methods is available here.
Hubness reduction and ANN search can be used independently or in conjunction, the latter yielding approximate hubness reduction. User of scikit-learn will find that only minor modification of their code is required to enable one or both of the above. We describe how to do so here.
For general information and details about nearest neighbors, we refer to the excellent scikit-learn User Guide on Nearest Neighbors.
Examples¶
In this section, we provide usage examples for skhubness
.
Example: Hubness reduction¶
These examples show how to perform hubness reduction in kNN classification in (nested) cross-validation and pipelines.
Note
Click here to download the full example code
Example: skhubness in Pipelines¶
Estimators from scikit-hubness can - of course - be used in a scikit-learn Pipeline
.
In this example, we select the best hubness reduction method and several other
hyperparameters in grid search w.r.t. to classification performance.
from sklearn.datasets import make_classification
from sklearn.decomposition import PCA
from sklearn.model_selection import StratifiedKFold, train_test_split, GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from skhubness.neighbors import KNeighborsClassifier
# Not so high-dimensional data
X, y = make_classification(n_samples=1_000,
n_features=50,
n_informative=20,
n_classes=2,
random_state=3453)
X, X_test, y, y_test = train_test_split(X, y,
test_size=100,
stratify=y,
shuffle=True,
random_state=124)
# Pipeline of standardization, dimensionality reduction, and kNN classification
pipe = Pipeline([('scale', StandardScaler(with_mean=True, with_std=True)),
('pca', PCA(n_components=20, random_state=1213)),
('knn', KNeighborsClassifier(n_neighbors=10, algorithm='lsh', hubness='mp'))])
# Exhaustive search for best algorithms and hyperparameters
param_grid = {'pca__n_components': [10, 20, 30],
'knn__n_neighbors': [5, 10, 20],
'knn__algorithm': ['auto', 'hnsw', 'lsh', 'falconn_lsh', 'nng', 'rptree'],
'knn__hubness': [None, 'mp', 'ls', 'dsl']}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=1354)
search = GridSearchCV(pipe, param_grid, n_jobs=5, cv=cv, verbose=1)
search.fit(X, y)
# Performance on hold-out data
acc = search.score(y_test, y_test)
print(acc)
# 0.79
print(search.best_params_)
# {'knn__algorithm': 'auto',
# 'knn__hubness': 'dsl',
# 'knn__n_neighbors': 20,
# 'pca__n_components': 30}
Total running time of the script: ( 0 minutes 0.000 seconds)
Note
Click here to download the full example code
Face recognition (Olivetti faces)¶
This dataset contains a set of face images taken between April 1992 and April 1994 at AT&T Laboratories Cambridge. Image data is typically embedded in very high-dimensional spaces, which might be prone to hubness.
import numpy as np
from sklearn.datasets import olivetti_faces
from sklearn.model_selection import cross_val_score, StratifiedKFold, RandomizedSearchCV
from skhubness import Hubness
from skhubness.neighbors import KNeighborsClassifier
# Fetch data and have a look
d = olivetti_faces.fetch_olivetti_faces()
X, y = d['data'], d['target']
print(f'Data shape: {X.shape}')
print(f'Label shape: {y.shape}')
# (400, 4096)
# (400,)
# The data is embedded in a high-dimensional space.
# Is there hubness, and can we reduce it?
for hubness in [None, 'dsl', 'ls', 'mp']:
hub = Hubness(k=10, hubness=hubness, return_value='k_skewness')
hub.fit(X)
score = hub.score()
print(f'Hubness (10-skew): {score:.3f} with hubness reduction: {hubness}')
# Hubness (10-skew): 1.972 with hubness reduction: None
# Hubness (10-skew): 1.526 with hubness reduction: dsl
# Hubness (10-skew): 0.943 with hubness reduction: ls
# Hubness (10-skew): 0.184 with hubness reduction: mp
# There is some hubness, and all hubness reduction methods can reduce it (to varying degree)
# Let's assess the best kNN strategy and its estimated performance.
cv_perf = StratifiedKFold(n_splits=5, shuffle=True, random_state=7263)
cv_select = StratifiedKFold(n_splits=5, shuffle=True, random_state=32634)
knn = KNeighborsClassifier(algorithm_params={'n_candidates': 100})
# specify parameters and distributions to sample from
param_dist = {"n_neighbors": np.arange(1, 26),
"weights": ['uniform', 'distance'],
"hubness": [None, 'dsl', 'ls', 'mp']}
# Inner cross-validation to select best hyperparameters (incl hubness reduction method)
search = RandomizedSearchCV(estimator=knn,
param_distributions=param_dist,
n_iter=100,
cv=cv_select,
random_state=2345,
verbose=1)
# Outer cross-validation to estimate performance
score = cross_val_score(search, X, y, cv=cv_perf, verbose=1)
print(f'Scores: {score}')
print(f'Mean acc = {score.mean():.3f} +/- {score.std():.3f}')
# Select model that maximizes accuracy
search.fit(X, y)
# The best model's parameters
print(search.best_params_)
# Does it correspond to the results of hubness reduction above?
# Scores: [0.95 0.9625 1. 0.95 0.925 ]
# Mean acc = 0.957 +/- 0.024
# {'weights': 'distance', 'n_neighbors': 23, 'hubness': 'mp'}
Total running time of the script: ( 0 minutes 0.000 seconds)
Example: Approximate nearest neighbor search¶
This example shows how to perform approximate nearest neighbor search.
Note
Click here to download the full example code
Retrieving GLOVE word vectors¶
In this example we will retrieve similar words from GLOVE embeddings with an ANNG graph.
Precomputed ground-truth nearest neighbors are available from ANN benchmarks.
# For this example, the `h5py` package is required in addition to the requirements of scikit-hubness.
# You may install it from PyPI by the following command (if you're in an IPython/Jupyter environment):
# !pip install h5py
import numpy as np
import h5py
from skhubness.neighbors import NearestNeighbors
# Download the dataset with the following command.
# If the dataset is already available in the current working dir, you can skip this:
# !wget http://ann-benchmarks.com/glove-100-angular.hdf5
f = h5py.File('glove-100-angular.hdf5', 'r')
# Extract the split and ground-truth
X_train = f['train']
X_test = f['test']
neigh_true = f['neighbors']
dist = f['distances']
# How many object have we got?
for k in f.keys():
print(f'{k}: shape = {f[k].shape}')
# APPROXIMATE NEAREST NEIGHBOR SEARCH
# In order to retrieve most similar words from the GLOVE embeddings,
# we use the unsupervised `skhubness.neighbors.NearestNeighbors` class.
# The (approximate) nearest neighbor algorithm is set to NNG by passing `algorithm='nng'`.
# We can pass additional parameters to `NNG` via the `algorithm_params` dict.
# Here we set `n_jobs=8` to enable parallelism.
# Create the nearest neighbor index
nn_plain = NearestNeighbors(n_neighbors=100,
algorithm='nng',
algorithm_params={'n_candidates': 1_000,
'index_dir': 'auto',
'n_jobs': 8},
verbose=2,
)
nn_plain.fit(X_train)
# Note that NNG must save its index. By setting `index_dir='auto'`,
# NNG will try to save it to shared memory, if available, otherwise to $TMP.
# This index is NOT removed automatically, as one will typically want build an index once and use it often.
# Retrieve nearest neighbors for each test object
neigh_pred_plain = nn_plain.kneighbors(X_test,
n_neighbors=100,
return_distance=False)
# Calculate the recall per test object
recalled_plain = [np.intersect1d(neigh_true[i], neigh_pred_plain)
for i in range(len(X_test))]
recall_plain = np.array([recalled_plain[i].size / neigh_true.shape[1]
for i in range(len(X_test))])
# Statistics
print(f'Mean = {recall_plain.mean():.4f}, '
f'stdev = {recall_plain.std():.4f}')
# ANN with HUBNESS REDUCTION
# Here we set `n_candidates=1000`, so that for each query,
# 1000 neighbors will be retrieved first by `NNG`,
# that are subsequently refined by hubness reduction.
# Hubness reduction is performed by local scaling as specified with `hubness='ls'`.
# Creating the NN index with hubness reduction enabled
nn = NearestNeighbors(n_neighbors=100,
algorithm='nng',
algorithm_params={'n_candidates': 1_000,
'n_jobs': 8},
hubness='ls',
verbose=2,
)
nn.fit(X_train)
# Retrieve nearest neighbors for each test object
neigh_pred = nn.kneighbors(X_test,
n_neighbors=100,
return_distance=False)
# Measure recall per object and on average
recalled = [np.intersect1d(neigh_true[i], neigh_pred)
for i in range(len(X_test))]
recall = np.array([recalled[i].size / neigh_true.shape[1]
for i in range(len(X_test))])
print(f'Mean = {recall.mean():.4f}, '
f'stdev = {recall.std():.4f}')
# If the second results are significantly better than the first,
# this could indicate that the chosen ANN method is more prone
# to hubness than exact NN, which might be an interesting research question.
Total running time of the script: ( 0 minutes 0.000 seconds)
Example: Approximate hubness reduction¶
These examples show how to combine approximate nearest neighbor search and hubness reduction.
Note
Click here to download the full example code
Example: Reusing index structures¶
This example shows how to reuse index structures. If you want to first estimate hubness, and then perform kNN, you can avoid recomputing the ANN index structure, which can be costly.
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split
from skhubness.analysis import Hubness
from skhubness.neighbors import KNeighborsClassifier
X, y = make_classification(n_samples=100_000,
n_features=500,
n_informative=400,
random_state=543)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.01,
stratify=y,
shuffle=True,
random_state=2346)
# Approximate hubness estimation: Creates LSH index and computes local scaling factors
hub = Hubness(k=10,
return_value='robinhood',
algorithm='falconn_lsh',
hubness='ls',
random_state=2345,
shuffle_equal=False,
verbose=1)
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(f'Hubness (Robin Hood): {robin_hood}:.4f')
# 0.9060
# Approximate hubness reduction for classification: Reuse index & factors
knn = KNeighborsClassifier(n_neighbor=10,
algorithm='falconn_lsh',
hubness='ls',
n_jobs=1)
knn.fit(hub.nn_index_, y_train) # REUSE INDEX HERE
acc = knn.score(X_test, y_test)
print(f'Test accuracy: {acc:.3f}')
# 0.959
Total running time of the script: ( 0 minutes 0.000 seconds)
Note
Click here to download the full example code
Example: Approximate hubness reduction¶
This example shows how to combine approximate nearest neighbor search and hubness reduction in order to perform approximate hubness reduction for large data sets.
from sklearn.datasets import make_classification
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from skhubness.analysis import Hubness
from skhubness.neighbors import KNeighborsClassifier
# High-dimensional artificial data
X, y = make_classification(n_samples=1_000_000,
n_features=500,
n_informative=400,
random_state=543)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=10_000,
stratify=y,
shuffle=True,
random_state=2346)
# Approximate hubness estimation
hub = Hubness(k=10,
return_value='robinhood',
algorithm='hnsw',
random_state=2345,
shuffle_equal=False,
n_jobs=-1,
verbose=2)
hub.fit(X_train)
robin_hood = hub.score(X_test)
print(f'Hubness (Robin Hood): {robin_hood:.3f}')
# 0.944
# Approximate hubness reduction for classification
knn = KNeighborsClassifier(n_neighbor=10,
algorithm='hnsw',
hubness='ls',
n_jobs=-1,
verbose=2)
knn.fit(X_train, y_train)
y_pred = knn.predict(X_test)
acc = accuracy_score(y_test, y_pred)
print(f'Test accuracy: {acc:.3f}')
# Test accuracy: 0.987
Total running time of the script: ( 0 minutes 0.000 seconds)
scikit-learn examples adapted for scikit-hubness¶
Examples concerning using skhubness.neighbors
as drop-in replacement for sklearn.neighbors
.
These examples are taken from scikit-learn and demonstrate the ease of transition
from sklearn.neighbors
to skhubness.neighbors
.
You will find that many examples require no more than modifying an import line,
and/or adding one argument when instantiating an estimator.
Note, that these examples are not intended to demonstrate improved learning performance due to hubness reduction (the data are rather low-dimensional).
Note
Click here to download the full example code
Nearest Neighbors regression¶
Demonstrate the resolution of a regression problem using a k-Nearest Neighbor and the interpolation of the target using both barycenter and constant weights.
Hubness reduction of this low-dimensional dataset shows only small effects.
Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_regression.html

Out:
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_regression.py:60: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
print(__doc__)
# Author: Alexandre Gramfort <alexandre.gramfort@inria.fr>
# Fabian Pedregosa <fabian.pedregosa@inria.fr>
#
# License: BSD 3 clause (C) INRIA
# #############################################################################
# Generate sample data
import numpy as np
import matplotlib.pyplot as plt
from skhubness.neighbors import KNeighborsRegressor
np.random.seed(0)
X = np.sort(5 * np.random.rand(40, 1), axis=0)
T = np.linspace(0, 5, 500)[:, np.newaxis]
y = np.sin(X).ravel()
# Add noise to targets
y[::5] += 1 * (0.5 - np.random.rand(8))
# #############################################################################
# Fit regression model
n_neighbors = 5
f = plt.figure()
for i, weights in enumerate(['uniform', 'distance']):
for j, hubness in enumerate([None, 'local_scaling']):
knn = KNeighborsRegressor(n_neighbors,
algorithm_params={'n_candidates': 39},
weights=weights,
hubness=hubness)
y_ = knn.fit(X, y).predict(T)
plt.subplot(2, 2, i * 2 + j + 1)
f.set_figheight(15)
f.set_figwidth(15)
plt.scatter(X, y, c='k', label='data')
plt.plot(T, y_, c='g', label='prediction')
plt.axis('tight')
plt.legend()
plt.title(f"KNeighborsRegressor (k = {n_neighbors}, weights = '{weights}', hubness = '{hubness}')")
plt.tight_layout()
plt.show()
Total running time of the script: ( 0 minutes 0.737 seconds)
Note
Click here to download the full example code
Nearest Centroid Classification¶
Sample usage of Nearest Centroid classification. It will plot the decision boundaries for each class.
Note that no hubness reduction is currently implemented for centroids. However, hubness.neighbors retains all the features of sklearn.neighbors, in order to act as a full drop-in replacement.
Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_nearest_centroid.html
Out:
None 0.8133333333333334
0.2 0.82
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_nearest_centroid.py:64: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from skhubness.neighbors import NearestCentroid
n_neighbors = 15
# import some data to play with
iris = datasets.load_iris()
# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
for shrinkage in [None, .2]:
# we create an instance of Neighbours Classifier and fit the data.
clf = NearestCentroid(shrink_threshold=shrinkage)
clf.fit(X, y)
y_pred = clf.predict(X)
print(shrinkage, np.mean(y == y_pred))
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.title("3-Class classification (shrink_threshold=%r)"
% shrinkage)
plt.axis('tight')
plt.show()
Total running time of the script: ( 0 minutes 0.737 seconds)
Note
Click here to download the full example code
Nearest Neighbors Classification¶
Sample usage of Nearest Neighbors classification. It will plot the decision boundaries for each class.
Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_classification.html
Out:
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_classification.py:61: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from skhubness.neighbors import KNeighborsClassifier
n_neighbors = 15
# import some data to play with
iris = datasets.load_iris()
# we only take the first two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = iris.data[:, :2]
y = iris.target
h = .02 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
for hubness in [None, 'mutual_proximity']:
# we create an instance of Neighbours Classifier and fit the data.
clf = KNeighborsClassifier(n_neighbors,
hubness=hubness,
weights='distance')
clf.fit(X, y)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light)
# Plot also the training points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold,
edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("3-Class classification (k = %i, hubness = '%s')"
% (n_neighbors, hubness))
plt.show()
Total running time of the script: ( 0 minutes 25.940 seconds)
Note
Click here to download the full example code
Dimensionality Reduction with Neighborhood Components Analysis¶
Sample usage of Neighborhood Components Analysis for dimensionality reduction.
This example compares different (linear) dimensionality reduction methods applied on the Digits data set. The data set contains images of digits from 0 to 9 with approximately 180 samples of each class. Each image is of dimension 8x8 = 64, and is reduced to a two-dimensional data point.
Principal Component Analysis (PCA) applied to this data identifies the combination of attributes (principal components, or directions in the feature space) that account for the most variance in the data. Here we plot the different samples on the 2 first principal components.
Linear Discriminant Analysis (LDA) tries to identify attributes that account for the most variance between classes. In particular, LDA, in contrast to PCA, is a supervised method, using known class labels.
Neighborhood Components Analysis (NCA) tries to find a feature space such that a stochastic nearest neighbor algorithm will give the best accuracy. Like LDA, it is a supervised method.
One can see that NCA enforces a clustering of the data that is visually meaningful despite the large reduction in dimension.
Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_dim_reduction.html
Out:
/home/user/feldbauer/miniconda3/envs/hubness/lib/python3.7/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
warnings.warn("Variables are collinear.")
/home/user/feldbauer/miniconda3/envs/hubness/lib/python3.7/site-packages/sklearn/discriminant_analysis.py:388: UserWarning: Variables are collinear.
warnings.warn("Variables are collinear.")
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_nca_dim_reduction.py:103: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from skhubness.neighbors import (KNeighborsClassifier,
NeighborhoodComponentsAnalysis)
print(__doc__)
n_neighbors = 3
random_state = 0
# Load Digits dataset
digits = datasets.load_digits()
X, y = digits.data, digits.target
# Split into train/test
X_train, X_test, y_train, y_test = \
train_test_split(X, y, test_size=0.5, stratify=y,
random_state=random_state)
dim = len(X[0])
n_classes = len(np.unique(y))
# Reduce dimension to 2 with PCA
pca = make_pipeline(StandardScaler(),
PCA(n_components=2, random_state=random_state))
# Reduce dimension to 2 with LinearDiscriminantAnalysis
lda = make_pipeline(StandardScaler(),
LinearDiscriminantAnalysis(n_components=2))
# Reduce dimension to 2 with NeighborhoodComponentAnalysis
nca = make_pipeline(StandardScaler(),
NeighborhoodComponentsAnalysis(n_components=2,
random_state=random_state))
# Use a nearest neighbor classifier to evaluate the methods
knn = KNeighborsClassifier(n_neighbors=n_neighbors)
# Make a list of the methods to be compared
dim_reduction_methods = [('PCA', pca), ('LDA', lda), ('NCA', nca)]
# plt.figure()
for i, (name, model) in enumerate(dim_reduction_methods):
plt.figure()
# plt.subplot(1, 3, i + 1, aspect=1)
# Fit the method's model
model.fit(X_train, y_train)
# Fit a nearest neighbor classifier on the embedded training set
knn.fit(model.transform(X_train), y_train)
# Compute the nearest neighbor accuracy on the embedded test set
acc_knn = knn.score(model.transform(X_test), y_test)
# Embed the data set in 2 dimensions using the fitted model
X_embedded = model.transform(X)
# Plot the projected points and show the evaluation score
plt.scatter(X_embedded[:, 0], X_embedded[:, 1], c=y, s=30, cmap='Set1')
plt.title("{}, KNN (k={})\nTest accuracy = {:.2f}".format(name,
n_neighbors,
acc_knn))
plt.show()
Total running time of the script: ( 0 minutes 5.249 seconds)
Note
Click here to download the full example code
Face completion with a multi-output estimators¶
This example shows the use of multi-output estimator to complete images. The goal is to predict the lower half of a face given its upper half.
The first column of images shows true faces. The next columns illustrate how extremely randomized trees, linear regression, ridge regression, and k nearest neighbors with or without hubness reduction complete the lower half of those faces.
Adapted from https://scikit-learn.org/stable/auto_examples/plot_multioutput_face_completion.html

Out:
/home/user/feldbauer/PycharmProjects/hubness/examples/sklearn/plot_multioutput_face_completion.py:106: UserWarning: Matplotlib is currently using agg, which is a non-GUI backend, so cannot show the figure.
plt.show()
print(__doc__)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.datasets import fetch_olivetti_faces
from sklearn.utils.validation import check_random_state
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import RidgeCV
from skhubness.neighbors import KNeighborsRegressor
# Load the faces datasets
data = fetch_olivetti_faces()
targets = data.target
data = data.images.reshape((len(data.images), -1))
train = data[targets < 30]
test = data[targets >= 30] # Test on independent people
# Test on a subset of people
n_faces = 5
rng = check_random_state(4)
face_ids = rng.randint(test.shape[0], size=(n_faces, ))
test = test[face_ids, :]
n_pixels = data.shape[1]
# Upper half of the faces
X_train = train[:, :(n_pixels + 1) // 2]
# Lower half of the faces
y_train = train[:, n_pixels // 2:]
X_test = test[:, :(n_pixels + 1) // 2]
y_test = test[:, n_pixels // 2:]
# Fit estimators
ESTIMATORS = {
"Extra trees": ExtraTreesRegressor(n_estimators=10, max_features=32,
random_state=0),
"k-NN": KNeighborsRegressor(weights='distance'),
"k-NN MP": KNeighborsRegressor(hubness='mp',
hubness_params={'method': 'normal'},
weights='distance'),
"Linear regression": LinearRegression(),
"Ridge": RidgeCV(),
}
y_test_predict = dict()
for name, estimator in ESTIMATORS.items():
estimator.fit(X_train, y_train)
y_test_predict[name] = estimator.predict(X_test)
# Plot the completed faces
image_shape = (64, 64)
n_cols = 1 + len(ESTIMATORS)
plt.figure(figsize=(2. * n_cols, 2.26 * n_faces))
plt.suptitle("Face completion with multi-output estimators", size=16)
for i in range(n_faces):
true_face = np.hstack((X_test[i], y_test[i]))
if i:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 1)
else:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 1,
title="true faces")
sub.axis("off")
sub.imshow(true_face.reshape(image_shape),
cmap=plt.cm.gray,
interpolation="nearest")
for j, est in enumerate(sorted(ESTIMATORS)):
completed_face = np.hstack((X_test[i], y_test_predict[est][i]))
if i:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j)
else:
sub = plt.subplot(n_faces, n_cols, i * n_cols + 2 + j,
title=est)
sub.axis("off")
sub.imshow(completed_face.reshape(image_shape),
cmap=plt.cm.gray,
interpolation="nearest")
plt.show()
Total running time of the script: ( 0 minutes 3.385 seconds)
Note
Click here to download the full example code
Comparing Nearest Neighbors with and without Neighborhood Components Analysis¶
An example comparing nearest neighbors classification with and without Neighborhood Components Analysis.
It will plot the class decision boundaries given by a Nearest Neighbors classifier when using the Euclidean distance on the original features, versus using the Euclidean distance after the transformation learned by Neighborhood Components Analysis. The latter aims to find a linear transformation that maximises the (stochastic) nearest neighbor classification accuracy on the training set.
Adapted from https://scikit-learn.org/stable/auto_examples/neighbors/plot_nca_classification.html
Out:
# License: BSD 3 clause
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.colors import ListedColormap
from sklearn import datasets
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from skhubness.neighbors import (KNeighborsClassifier,
NeighborhoodComponentsAnalysis)
import warnings
warnings.filterwarnings('ignore')
print(__doc__)
n_neighbors = 1
dataset = datasets.load_iris()
X, y = dataset.data, dataset.target
# we only take two features. We could avoid this ugly
# slicing by using a two-dim dataset
X = X[:, [0, 2]]
X_train, X_test, y_train, y_test = \
train_test_split(X, y, stratify=y, test_size=0.7, random_state=42)
h = .01 # step size in the mesh
# Create color maps
cmap_light = ListedColormap(['#FFAAAA', '#AAFFAA', '#AAAAFF'])
cmap_bold = ListedColormap(['#FF0000', '#00FF00', '#0000FF'])
names = ['KNN',
'NCA, KNN',
'KNN, MP (normal)',
'KNN, MP (empiric)',
'KNN, LS (standard)',
'KNN, LS (nicdm)',
]
classifiers = [Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
]),
Pipeline([('scaler', StandardScaler()),
('nca', NeighborhoodComponentsAnalysis()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors))
]),
Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
hubness='mutual_proximity',
hubness_params={'method': 'normal'}))
]),
Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
hubness='mutual_proximity',
hubness_params={'method': 'empiric'}))
]),
Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
hubness='local_scaling',
hubness_params={'method': 'standard'}))
]),
Pipeline([('scaler', StandardScaler()),
('knn', KNeighborsClassifier(n_neighbors=n_neighbors,
hubness='local_scaling',
hubness_params={'method': 'nicdm'}))
]),
]
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
xx, yy = np.meshgrid(np.arange(x_min, x_max, h),
np.arange(y_min, y_max, h))
for name, clf in zip(names, classifiers):
clf.fit(X_train, y_train)
score = clf.score(X_test, y_test)
# Plot the decision boundary. For that, we will assign a color to each
# point in the mesh [x_min, x_max]x[y_min, y_max].
Z = clf.predict(np.c_[xx.ravel(), yy.ravel()])
# Put the result into a color plot
Z = Z.reshape(xx.shape)
plt.figure()
plt.pcolormesh(xx, yy, Z, cmap=cmap_light, alpha=.8)
# Plot also the training and testing points
plt.scatter(X[:, 0], X[:, 1], c=y, cmap=cmap_bold, edgecolor='k', s=20)
plt.xlim(xx.min(), xx.max())
plt.ylim(yy.min(), yy.max())
plt.title("{} (k = {})".format(name, n_neighbors))
plt.text(0.9, 0.1, '{:.2f}'.format(score), size=15,
ha='center', va='center', transform=plt.gca().transAxes)
plt.show()
Total running time of the script: ( 6 minutes 19.660 seconds)
Note
Click here to download the full example code
Manifold learning on handwritten digits: Locally Linear Embedding, Isomap…¶
An illustration of various embeddings on the digits dataset.
The RandomTreesEmbedding, from the sklearn.ensemble
module, is not
technically a manifold embedding method, as it learn a high-dimensional
representation on which we apply a dimensionality reduction method.
However, it is often useful to cast a dataset into a representation in
which the classes are linearly-separable.
t-SNE will be initialized with the embedding that is generated by PCA in this example, which is not the default setting. It ensures global stability of the embedding, i.e., the embedding does not depend on random initialization.
Linear Discriminant Analysis, from the sklearn.discriminant_analysis
module, and Neighborhood Components Analysis, from the sklearn.neighbors
module, are supervised dimensionality reduction method, i.e. they make use of
the provided labels, contrary to other methods.
Adapted from https://scikit-learn.org/stable/auto_examples/manifold/plot_lle_digits.html
Out:
Computing random projection
Computing PCA projection
Computing Linear Discriminant Analysis projection
Computing Isomap projection
Done.
Computing LLE embedding
Done. Reconstruction error: 1.63545e-06
Computing modified LLE embedding
Done. Reconstruction error: 0.360653
Computing Hessian LLE embedding
Done. Reconstruction error: 0.2128
Computing LTSA embedding
Done. Reconstruction error: 0.2128
Computing MDS embedding
Done. Stress: 135359737.175700
Computing MDS embedding from local scaling neighbors graph
Done. Stress: 89610.656628
Computing MDS embedding from mutual proximity graph
Done. Stress: 25752.919623
Computing Totally Random Trees embedding
Computing Spectral embedding
Computing t-SNE embedding
Computing NCA projection
# Authors: Fabian Pedregosa <fabian.pedregosa@inria.fr>
# Olivier Grisel <olivier.grisel@ensta.org>
# Mathieu Blondel <mathieu@mblondel.org>
# Gael Varoquaux
# Roman Feldbauer
# License: BSD 3 clause (C) INRIA 2011
print(__doc__)
from time import time
import numpy as np
import matplotlib.pyplot as plt
from matplotlib import offsetbox
from sklearn import (manifold, datasets, decomposition, ensemble,
discriminant_analysis, random_projection)
from skhubness import neighbors
digits = datasets.load_digits(n_class=6)
X = digits.data
y = digits.target
n_samples, n_features = X.shape
n_neighbors = 30
# ----------------------------------------------------------------------
# Scale and visualize the embedding vectors
def plot_embedding(X, title=None):
x_min, x_max = np.min(X, 0), np.max(X, 0)
X = (X - x_min) / (x_max - x_min)
plt.figure()
ax = plt.subplot(111)
for i in range(X.shape[0]):
plt.text(X[i, 0], X[i, 1], str(y[i]),
color=plt.cm.Set1(y[i] / 10.),
fontdict={'weight': 'bold', 'size': 9})
if hasattr(offsetbox, 'AnnotationBbox'):
# only print thumbnails with matplotlib > 1.0
shown_images = np.array([[1., 1.]]) # just something big
for i in range(X.shape[0]):
dist = np.sum((X[i] - shown_images) ** 2, 1)
if np.min(dist) < 4e-3:
# don't show points that are too close
continue
shown_images = np.r_[shown_images, [X[i]]]
imagebox = offsetbox.AnnotationBbox(
offsetbox.OffsetImage(digits.images[i], cmap=plt.cm.gray_r),
X[i])
ax.add_artist(imagebox)
plt.xticks([]), plt.yticks([])
if title is not None:
plt.title(title)
# ----------------------------------------------------------------------
# Plot images of the digits
n_img_per_row = 20
img = np.zeros((10 * n_img_per_row, 10 * n_img_per_row))
for i in range(n_img_per_row):
ix = 10 * i + 1
for j in range(n_img_per_row):
iy = 10 * j + 1
img[ix:ix + 8, iy:iy + 8] = X[i * n_img_per_row + j].reshape((8, 8))
plt.imshow(img, cmap=plt.cm.binary)
plt.xticks([])
plt.yticks([])
plt.title('A selection from the 64-dimensional digits dataset')
# ----------------------------------------------------------------------
# Random 2D projection using a random unitary matrix
print("Computing random projection")
rp = random_projection.SparseRandomProjection(n_components=2, random_state=42)
X_projected = rp.fit_transform(X)
plot_embedding(X_projected, "Random Projection of the digits")
#----------------------------------------------------------------------
# Projection on to the first 2 principal components
print("Computing PCA projection")
t0 = time()
X_pca = decomposition.TruncatedSVD(n_components=2).fit_transform(X)
plot_embedding(X_pca,
"Principal Components projection of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Projection on to the first 2 linear discriminant components
print("Computing Linear Discriminant Analysis projection")
X2 = X.copy()
X2.flat[::X.shape[1] + 1] += 0.01 # Make X invertible
t0 = time()
X_lda = discriminant_analysis.LinearDiscriminantAnalysis(n_components=2).fit_transform(X2, y)
plot_embedding(X_lda,
"Linear Discriminant projection of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Isomap projection of the digits dataset
print("Computing Isomap projection")
t0 = time()
X_iso = manifold.Isomap(n_neighbors, n_components=2).fit_transform(X)
print("Done.")
plot_embedding(X_iso,
"Isomap projection of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Locally linear embedding of the digits dataset
print("Computing LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='standard')
t0 = time()
X_lle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_lle,
"Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Modified Locally linear embedding of the digits dataset
print("Computing modified LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='modified')
t0 = time()
X_mlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_mlle,
"Modified Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# HLLE embedding of the digits dataset
print("Computing Hessian LLE embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='hessian')
t0 = time()
X_hlle = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_hlle,
"Hessian Locally Linear Embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# LTSA embedding of the digits dataset
print("Computing LTSA embedding")
clf = manifold.LocallyLinearEmbedding(n_neighbors, n_components=2,
method='ltsa')
t0 = time()
X_ltsa = clf.fit_transform(X)
print("Done. Reconstruction error: %g" % clf.reconstruction_error_)
plot_embedding(X_ltsa,
"Local Tangent Space Alignment of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# MDS embedding of the digits dataset
print("Computing MDS embedding")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=2000,
dissimilarity='euclidean', metric=True,
)
t0 = time()
X_mds = clf.fit_transform(X)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
"MDS embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Hubness reduction (LS) + MDS embedding of the digits dataset
print("Computing MDS embedding from local scaling neighbors graph")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=2000,
dissimilarity='precomputed', metric=True,
)
t0 = time()
graph = neighbors.graph.kneighbors_graph(
X, n_neighbors=X.shape[0]-1, mode='distance', hubness='local_scaling').toarray()
X_mds = clf.fit_transform(graph)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
"Hubness reduction (LS) - MDS embedding (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Hubness reduction (MP) + MDS embedding of the digits dataset
print("Computing MDS embedding from mutual proximity graph")
clf = manifold.MDS(n_components=2, n_init=1, max_iter=2000,
dissimilarity='precomputed', metric=True,
)
t0 = time()
graph = neighbors.graph.kneighbors_graph(
X, n_neighbors=1082, mode='distance', hubness='mp').toarray()
X_mds = clf.fit_transform(graph)
print("Done. Stress: %f" % clf.stress_)
plot_embedding(X_mds,
"Hubness reduction (MP) - MDS embedding (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Random Trees embedding of the digits dataset
print("Computing Totally Random Trees embedding")
hasher = ensemble.RandomTreesEmbedding(n_estimators=200, random_state=0,
max_depth=5)
t0 = time()
X_transformed = hasher.fit_transform(X)
pca = decomposition.TruncatedSVD(n_components=2)
X_reduced = pca.fit_transform(X_transformed)
plot_embedding(X_reduced,
"Random forest embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# Spectral embedding of the digits dataset
print("Computing Spectral embedding")
embedder = manifold.SpectralEmbedding(n_components=2, random_state=0,
eigen_solver="arpack")
t0 = time()
X_se = embedder.fit_transform(X)
plot_embedding(X_se,
"Spectral embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# t-SNE embedding of the digits dataset
print("Computing t-SNE embedding")
tsne = manifold.TSNE(n_components=2, init='pca', random_state=0)
t0 = time()
X_tsne = tsne.fit_transform(X)
plot_embedding(X_tsne,
"t-SNE embedding of the digits (time %.2fs)" %
(time() - t0))
# ----------------------------------------------------------------------
# NCA projection of the digits dataset
print("Computing NCA projection")
nca = neighbors.NeighborhoodComponentsAnalysis(n_components=2, random_state=0)
t0 = time()
X_nca = nca.fit_transform(X, y)
plot_embedding(X_nca,
"NCA embedding of the digits (time %.2fs)" %
(time() - t0))
plt.show()
Total running time of the script: ( 1 minutes 24.114 seconds)
API Documentation¶
This is the API documentation for scikit-hubness
.
Analysis: skhubness.analysis
¶
The skhubness.analysis
package provides methods for measuring hubness.
Examine hubness characteristics of data. |
|
Built-in mutable sequence. |
Neighbors: skhubness.neighbors
¶
The skhubness.neighbors
package is a drop-in replacement for sklearn.neighbors
,
providing all of its features, while adding transparent support for hubness reduction
and approximate nearest neighbor search.
BallTree for fast generalized N-point problems |
|
DistanceMetric class |
|
KDTree for fast generalized N-point problems |
|
Wrapper for using nmslib |
|
Classifier implementing the k-nearest neighbors vote. |
|
Regression based on k-nearest neighbors. |
|
Wrapper for using falconn LSH |
|
Nearest centroid classifier. |
|
Unsupervised learner for implementing neighbor searches. |
|
Wrapper for ngtpy and NNG variants. |
|
Wrap Puffinn LSH for scikit-learn compatibility. |
|
Classifier implementing a vote among neighbors within a given radius |
|
Regression based on neighbors within a fixed radius. |
|
Wrapper for using annoy.AnnoyIndex |
|
Computes the (weighted) graph of k-Neighbors for points in X |
|
Computes the (weighted) graph of Neighbors for points in X |
|
Kernel Density Estimation. |
|
Unsupervised Outlier Detection using Local Outlier Factor (LOF) |
|
Neighborhood Components Analysis |
Reduction: skhubness.reduction
¶
The skhubness.reduction
package provides methods for hubness reduction.
Hubness reduction with Mutual Proximity [R5a390b9a9956-1]. |
|
Hubness reduction with Local Scaling [Rf1f7cb70176a-1]. |
|
Hubness reduction with DisSimLocal [R3ede0f1b99b2-1]. |
|
Supported hubness reduction algorithms |
History of scikit-hubness
¶
scikit-hubness
builds upon previous software: the Hub-Toolbox.
The original Hub-Toolbox
was written for Matlab, and released in parallel
with the release of the first hubness reduction methods in
JMLR.
In essence, it comprises methods to reduce hubness in distance matrices.
The Hub-Toolbox for Python3 is a port from Matlab to Python, which over the years got several extensions and additional functionality, such as more hubness reduction methods (Localized Centering, DisSimLocal, mp-dissim, etc.), approximate hubness reduction, and more. The software was developed by hubness researchers for hubness research.
The new scikit-hubness
package is rewritten from scratch with a different goal in mind:
Providing easy-to-use neighborhood-based data mining methods (classification, regression, etc.)
with transparent hubness reduction.
Building upon scikit-learn’s neighbors
package, we provide a drop-in replacement
called skhubness.neighbors
, which offers all the functionality of sklearn.neighbors
,
but adds additional functionality (approximate nearest neighbor search, hubness reduction).
This way, we think that machine learning researchers and practitioners
(many of which will be fluent in scikit-learn)
can quickly and effectively employ scikit-hubness
in their existing workflows,
and improve learning in their high-dimensional data.
Contributing¶
scikit-hubness is free open source software. Contributions from the community are highly appreciated. Even small contributions improve the software’s quality.
Even if you are not familiar with programming languages and tools, you may contribute by filing bugs or any problems as a GitHub issue.
Git and branching model¶
We use git for version control (CVS), as do most projects nowadays. If you are not familiar with git, there are lots of tutorials on GitHub Guide. All the important basics are covered in the GitHub Git handbook.
Development of scikit-hubness (mostly) follows the git flow branching model. There are two main branches: master and develop. For any changes, a new branch should be created. If you want to add a new feature, fix a noncritical bug, etc. one should branch off develop. Only if you want to fix a critical bug, branch off master.
Workflow¶
In case of large changes to the software, please first get in contact with the authors for coordination, for example by filing an issue. If you want to fix small issues (typos in the docs, obvious errors, etc.) you can - of course - directly submit a pull request (PR).
- Create a fork of scikit-hubness in your GitHub account.
Simply click “Fork” button on https://github.com/VarIr/scikit-hubness.
- Clone your fork on your computer.
$
git clone git@github.com:YOUR-ACCOUNT-GOES-HERE/scikit-hubness.git && cd scikit-hubness
- Add remote upstream.
$
git remote add upstream git@github.com:VarIr/scikit-hubness.git
- Create feature/bugfix branch.
In case of feature or noncritical bugfix: $
git checkout develop && git checkout -b featureXYZ develop
In case of critical bug: $
git checkout -b bugfix123 master
- Implement feature/fix bug/fix typo/…
Happy coding!
- Create a commit with meaningful message
If you only modified existing files, simply
$ git commit -am "descriptive message what this commit does (in present tense) here"
- Push to GitHub
e.g. $
git push origin featureXYZ
- Create pull request (PR)
Git will likely provide a link to directly create the PR. If not, click “New pull request” on your fork on GitHub.
- Wait…
Several devops checks will be performed automatically (e.g. continuous integration (CI) with Travis, AppVeyor).
The authors will get in contact with you, and may ask for changes.
- Respond to code review.
If there were issues with continous integration, or the authors asked for changes, please create a new commit locally, and simply push again to GitHub as you did before. The PR will be updated automatically.
- Maintainers merge PR, when all issues are resolved.
Thanks a lot for your contribution!
Code style and further guidelines¶
Please make sure all code complies with PEP 8
All code should be documented sufficiently (functions, classes, etc. must have docstrings with general description, parameters, ideally return values, raised exceptions, notes, etc.)
Documentation style is NumPy format.
New code must be covered by unit tests using pytest.
If you fix a bug, please provide regression tests (fail on old code, succeed on new code).
It may be helpful to install scikit-hubness in editable mode for development. When you have already cloned the package, switch into the corresponding directory, and
pip install -e .
(don’t omit the trailing period). This way, any changes to the code are reflected immediately. That is, you don’t need to install the package each and every time, when you make changes while developing code.
Testing¶
In scikit-hubness, we aim for high code coverage. As of September 2019, between 98% and 99% of all code lines are visited at least once when running the complete test suite. This is primarily to ensure:
correctness of the code (to some extent) and
maintainability (new changes don’t break old code).
Creating a new PR, ideally all code would be covered by tests. Sometimes, this is not feasible or only with large effort. Pull requests will likely be accepted, if the overall code coverage at least does not decrease.
Unit tests are automatically performed for each PR using CI tools online. This may take some time, however. To run the tests locally, you need pytest installed. From the scikit-hubness directory, call
pytest skhubness/
to run all the tests. You can also restrict the tests to the subpackage you are working on, down to single tests. For example
pytest skhubness/reduction/tests/test_local_scaling.py --showlocals -v
only runs tests for hubness reduction with local scaling.
In order to check code coverage locally, you need the pytest-cov plugin.
pytest skhubness/reduction/ --cov=skhubness/reduction/
Changelog¶
Next release¶
…
Added or enhanced¶
Lower memory footprint for sparse targets in multilabel classification (previously converted to dense arrays) #61
Fixes¶
Hubness estimation could fail when ANN does not return enough neighbors #59
Heuristic to choose memory for Puffinn LSH.
0.21.2 - 2020-01-14¶
This is a maintenance release due to the publication in the Journal of Open Source Software.
0.21.1 - 2019-12-10¶
This is a bugfix release due to the recent update of scikit-learn to v0.22.
Fixes¶
Require scikit-learn v0.21.3.
Until the necessary adaptions for v0.22 are completed, scikit-hubness will require scikit-learn v0.21.3.
0.21.0 - 2019-11-25¶
This is the first major release of scikit-hubness.
Added¶
Enable ONNG provided by NGT (optimized ANNG). Pass
optimize=True
toNNG
.User Guide: Description of all subpackages and common usage scenarios.
Examples: Various usage examples
Several tests
Classes inheriting from
SupervisedIntegerMixin
can be fit with anApproximateNearestNeighbor
orNearestNeighbors
instance, thus reuse precomputed indices.
Changes¶
Use argument
algorithm='nng'
for ANNG/ONNG provided by NGT instead of'onng'
. Also setoptimize=True
in order to use ONNG.
Fixes¶
DisSimLocal would previously fail when invoked as
hubness='dis_sim_local'
.Hubness reduction would previously ignore
verbose
arguments under certain circumstances.HNSW
would previously ignoren_jobs
on index creation.Fix installation instructions for puffinn.
0.21.0a9 - 2019-10-30¶
Added¶
General structure for docs
Enable NGT OpenMP support on MacOS (in addition to Linux)
Enable Puffinn LSH also on MacOS
Fixes¶
Correct mutual proximity (empiric) calculation
Better handling of optional packages (ANN libraries)
Maintenance¶
streamlined CI builds
several minor code improvements
New contributors¶
Silvan David Peter
Getting started¶
Get started with scikit-hubness
in a breeze.
Find how to install the package and
see all core functionality applied in a single quick start example.
User Guide¶
The User Guide introduces the main concepts of scikit-hubness
.
It explains, how to analyze your data sets for hubness,
and how to use the package to lift this curse of dimensionality.
You will also find examples how to use skhubness.neighbors
for approximate nearest neighbor search (with or without hubness reduction).
API Documentation¶
The API Documentation provides detailed information
of the implemented methods.
This information includes method descriptions, parameters, references, examples, etc.
Find all the information about specific modules and functions of scikit-hubness
in this section.
History¶
A brief history of the package,
and how it relates to the Hub-Toolbox
’es.
Development¶
There are several possibilities to contribute to this free open source software. We highly appreciate all input from the community, be it bug reports or code contributions.
Source code, issue tracking, discussion, and continuous integration appear on our GitHub page.
What’s new¶
To see what’s new in the latest version of scikit-hubness
,
have a look at the changelog.