skhubness.analysis.Hubness

class skhubness.analysis.Hubness(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]

Examine hubness characteristics of data.

Parameters
k: int

Neighborhood size

return_value: str, default = “k_skewness”

Hubness measure to return by score() By default, this is the skewness of the k-occurrence histogram. Use “all” to return a dict of all available measures, or check skhubness.analysis.VALID_HUBNESS_MEASURE for available measures.

hub_size: float

Hubs are defined as objects with k-occurrence > hub_size * k.

metric: string, one of [‘euclidean’, ‘cosine’, ‘precomputed’]

Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.

store_k_neighbors: bool

Whether to save the k-neighbor lists. Requires O(n_test * k) memory.

store_k_occurrence: bool

Whether to save the k-occurrence. Requires O(n_test) memory.

algorithm: {‘auto’, ‘hnsw’, ‘lsh’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

  • ‘hnsw’ will use HNSW

  • ‘lsh’ will use FalconnLSH

  • ‘ball_tree’ will use BallTree

  • ‘kd_tree’ will use KDTree

  • ‘brute’ will use a brute-force search.

  • ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

algorithm_params: dict, optional

Override default parameters of the NN algorithm. For example, with algorithm=’lsh’ and algorithm_params={n_candidates: 100} one hundred approximate neighbors are retrieved with LSH. If parameter hubness is set, the candidate neighbors are further reordered with hubness reduction. Finally, n_neighbors objects are used from the (optionally reordered) candidates.

hubness: {‘mutual_proximity’, ‘local_scaling’, ‘dis_sim_local’, None}, optional

Hubness reduction algorithm

  • ‘mutual_proximity’ or ‘mp’ will use MutualProximity

  • ‘local_scaling’ or ‘ls’ will use LocalScaling

  • ‘dis_sim_local’ or ‘dsl’ will use DisSimLocal

If None, no hubness reduction will be performed (=vanilla kNN).

hubness_params: dict, optional

Override default parameters of the selected hubness reduction algorithm. For example, with hubness=’mp’ and hubness_params={‘method’: ‘normal’} a mutual proximity variant is used, which models distance distributions with independent Gaussians.

random_state: int, RandomState instance or None, optional

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle_equal: bool, optional

If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.

n_jobs: int, optional

Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs Note that not all steps are currently parallelized.

verbose: int, optional

Level of output messages

References

1

Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531

2

Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).

Attributes
k_skewness: float

Hubness, measured as skewness of k-occurrence histogram [1]

k_skewness_truncnorm: float

Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram

atkinson_index: float

Hubness, measured as the Atkinson index of k-occurrence distribution

gini_index: float

Hubness, measured as the Gini index of k-occurrence distribution

robinhood_index: float

Hubness, measured as Robin Hood index of k-occurrence distribution [2]

antihubs: int

Indices to antihubs

antihub_occurrence: float

Proportion of antihubs in data set

hubs: int

Indices to hubs

hub_occurrence: float

Proportion of k-nearest neighbor slots occupied by hubs

groupie_ratio: float

Proportion of objects with the largest hub in their neighborhood

k_occurrence: ndarray

Reverse neighbor count for each object

k_neighbors: ndarray

Indices to k-nearest neighbors for each object

__init__(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]

Initialize self. See help(type(self)) for accurate signature.

Methods

__init__([k, return_value, hub_size, …])

Initialize self.

fit(X[, y])

Fit indexed objects.

get_params([deep])

Get parameters for this estimator.

score([X, y, has_self_distances])

Estimate hubness in a data set.

set_params(**params)

Set the parameters of this estimator.

fit(X, y=None) → skhubness.analysis.estimation.Hubness[source]

Fit indexed objects.

Parameters
X: {array-like, sparse matrix}, shape (n_samples, n_features) or (n_query, n_indexed) if metric==’precomputed’

Training data vectors or distance matrix, if metric == ‘precomputed’.

y: ignored
Returns
self:

Fitted instance of :mod:Hubness

get_params(deep=True)

Get parameters for this estimator.

Parameters
deepbool, default=True

If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns
paramsmapping of string to any

Parameter names mapped to their values.

score(X: Optional[numpy.ndarray] = None, y=None, has_self_distances: bool = False) → Union[float, dict][source]

Estimate hubness in a data set.

Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.

Parameters
X: ndarray, shape (n_query, n_features) or (n_query, n_indexed)

Array of query vectors, or distance, if self.metric == ‘precomputed’

y: ignored
has_self_distances: bool, default = False

Define, whether a precomputed distance matrix contains self distances, which need to be excluded.

Returns
hubness_measure: float or dict

Return the hubness measure as indicated by return_value. Additional hubness indices are provided as attributes (e.g. robinhood_index_()). if return_value is ‘all’, a dict of all hubness measures is returned.

set_params(**params)

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters
**paramsdict

Estimator parameters.

Returns
selfobject

Estimator instance.