skhubness.analysis.Hubness¶

class skhubness.analysis.Hubness(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]¶

Examine hubness characteristics of data.

Parameters

k: int

Neighborhood size

return_value: str, default = “k_skewness”

Hubness measure to return by score() By default, this is the skewness of the k-occurrence histogram. Use “all” to return a dict of all available measures, or check skhubness.analysis.VALID_HUBNESS_MEASURE for available measures.

hub_size: float

Hubs are defined as objects with k-occurrence > hub_size * k.

metric: string, one of [‘euclidean’, ‘cosine’, ‘precomputed’]

Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.

store_k_neighbors: bool

Whether to save the k-neighbor lists. Requires O(n_test * k) memory.

store_k_occurrence: bool

Whether to save the k-occurrence. Requires O(n_test) memory.

algorithm: {‘auto’, ‘hnsw’, ‘lsh’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional

Algorithm used to compute the nearest neighbors:

‘hnsw’ will use HNSW
‘lsh’ will use FalconnLSH
‘ball_tree’ will use BallTree
‘kd_tree’ will use KDTree
‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to fit() method.

Note: fitting on sparse input will override the setting of this parameter, using brute force.

algorithm_params: dict, optional

Override default parameters of the NN algorithm. For example, with algorithm=’lsh’ and algorithm_params={n_candidates: 100} one hundred approximate neighbors are retrieved with LSH. If parameter hubness is set, the candidate neighbors are further reordered with hubness reduction. Finally, n_neighbors objects are used from the (optionally reordered) candidates.

hubness: {‘mutual_proximity’, ‘local_scaling’, ‘dis_sim_local’, None}, optional

Hubness reduction algorithm

‘mutual_proximity’ or ‘mp’ will use MutualProximity
‘local_scaling’ or ‘ls’ will use LocalScaling
‘dis_sim_local’ or ‘dsl’ will use DisSimLocal

If None, no hubness reduction will be performed (=vanilla kNN).

hubness_params: dict, optional

Override default parameters of the selected hubness reduction algorithm. For example, with hubness=’mp’ and hubness_params={‘method’: ‘normal’} a mutual proximity variant is used, which models distance distributions with independent Gaussians.

random_state: int, RandomState instance or None, optional

If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.

shuffle_equal: bool, optional

If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.

n_jobs: int, optional

Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs Note that not all steps are currently parallelized.

verbose: int, optional

Level of output messages

References

1: Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531
2: Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).

Attributes

k_skewness: float: Hubness, measured as skewness of k-occurrence histogram [1]
k_skewness_truncnorm: float: Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram
atkinson_index: float: Hubness, measured as the Atkinson index of k-occurrence distribution
gini_index: float: Hubness, measured as the Gini index of k-occurrence distribution
robinhood_index: float: Hubness, measured as Robin Hood index of k-occurrence distribution [2]
antihubs: int: Indices to antihubs
antihub_occurrence: float: Proportion of antihubs in data set
hubs: int: Indices to hubs
hub_occurrence: float: Proportion of k-nearest neighbor slots occupied by hubs
groupie_ratio: float: Proportion of objects with the largest hub in their neighborhood
k_occurrence: ndarray: Reverse neighbor count for each object
k_neighbors: ndarray: Indices to k-nearest neighbors for each object

__init__(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]¶: Initialize self. See help(type(self)) for accurate signature.

Methods

`__init__`([k, return_value, hub_size, …])	Initialize self.
`fit`(X[, y])	Fit indexed objects.
`get_params`([deep])	Get parameters for this estimator.
`score`([X, y, has_self_distances])	Estimate hubness in a data set.
`set_params`(**params)	Set the parameters of this estimator.

fit(X, y=None) → skhubness.analysis.estimation.Hubness[source]¶

Fit indexed objects.

Parameters

X: {array-like, sparse matrix}, shape (n_samples, n_features) or (n_query, n_indexed) if metric==’precomputed’: Training data vectors or distance matrix, if metric == ‘precomputed’.
y: ignored

Returns

self:: Fitted instance of :mod:Hubness

get_params(deep=True)¶

Get parameters for this estimator.

Parameters

deepbool, default=True: If True, will return the parameters for this estimator and contained subobjects that are estimators.

Returns

paramsmapping of string to any: Parameter names mapped to their values.

score(X: Optional[numpy.ndarray] = None, y=None, has_self_distances: bool = False) → Union[float, dict][source]¶

Estimate hubness in a data set.

Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.

Parameters

X: ndarray, shape (n_query, n_features) or (n_query, n_indexed): Array of query vectors, or distance, if self.metric == ‘precomputed’
y: ignored
has_self_distances: bool, default = False: Define, whether a precomputed distance matrix contains self distances, which need to be excluded.

Returns

hubness_measure: float or dict: Return the hubness measure as indicated by return_value. Additional hubness indices are provided as attributes (e.g. robinhood_index_()). if return_value is ‘all’, a dict of all hubness measures is returned.

set_params(**params)¶

Set the parameters of this estimator.

The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form <component>__<parameter> so that it’s possible to update each component of a nested object.

Parameters

**paramsdict: Estimator parameters.

Returns

selfobject: Estimator instance.