skhubness.analysis.Hubness¶

class
skhubness.analysis.
Hubness
(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]¶ Examine hubness characteristics of data.
 Parameters
 k: int
Neighborhood size
 return_value: str, default = “k_skewness”
Hubness measure to return by
score()
By default, this is the skewness of the koccurrence histogram. Use “all” to return a dict of all available measures, or check skhubness.analysis.VALID_HUBNESS_MEASURE for available measures. hub_size: float
Hubs are defined as objects with koccurrence > hub_size * k.
 metric: string, one of [‘euclidean’, ‘cosine’, ‘precomputed’]
Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.
 store_k_neighbors: bool
Whether to save the kneighbor lists. Requires O(n_test * k) memory.
 store_k_occurrence: bool
Whether to save the koccurrence. Requires O(n_test) memory.
 algorithm: {‘auto’, ‘hnsw’, ‘lsh’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
Algorithm used to compute the nearest neighbors:
‘hnsw’ will use
HNSW
‘lsh’ will use
FalconnLSH
‘ball_tree’ will use
BallTree
‘kd_tree’ will use
KDTree
‘brute’ will use a bruteforce search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()
method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
 algorithm_params: dict, optional
Override default parameters of the NN algorithm. For example, with algorithm=’lsh’ and algorithm_params={n_candidates: 100} one hundred approximate neighbors are retrieved with LSH. If parameter hubness is set, the candidate neighbors are further reordered with hubness reduction. Finally, n_neighbors objects are used from the (optionally reordered) candidates.
 hubness: {‘mutual_proximity’, ‘local_scaling’, ‘dis_sim_local’, None}, optional
Hubness reduction algorithm
‘mutual_proximity’ or ‘mp’ will use
MutualProximity
‘local_scaling’ or ‘ls’ will use
LocalScaling
‘dis_sim_local’ or ‘dsl’ will use
DisSimLocal
If None, no hubness reduction will be performed (=vanilla kNN).
 hubness_params: dict, optional
Override default parameters of the selected hubness reduction algorithm. For example, with hubness=’mp’ and hubness_params={‘method’: ‘normal’} a mutual proximity variant is used, which models distance distributions with independent Gaussians.
 random_state: int, RandomState instance or None, optional
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
 shuffle_equal: bool, optional
If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.
 n_jobs: int, optional
Number of processes for parallel computations.  1: Don’t use multiprocessing.  1: Use all CPUs Note that not all steps are currently parallelized.
 verbose: int, optional
Level of output messages
References
 1
Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in highdimensional data. Journal of Machine Learning Research, 2010, 11, 24872531
 2
Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large highdimensional data. IEEE International Conference of Big Knowledge (2018).
 Attributes
 k_skewness: float
Hubness, measured as skewness of koccurrence histogram [1]
 k_skewness_truncnorm: float
Hubness, measured as skewness of truncated normal distribution fitted with koccurrence histogram
 atkinson_index: float
Hubness, measured as the Atkinson index of koccurrence distribution
 gini_index: float
Hubness, measured as the Gini index of koccurrence distribution
 robinhood_index: float
Hubness, measured as Robin Hood index of koccurrence distribution [2]
 antihubs: int
Indices to antihubs
 antihub_occurrence: float
Proportion of antihubs in data set
 hubs: int
Indices to hubs
 hub_occurrence: float
Proportion of knearest neighbor slots occupied by hubs
 groupie_ratio: float
Proportion of objects with the largest hub in their neighborhood
 k_occurrence: ndarray
Reverse neighbor count for each object
 k_neighbors: ndarray
Indices to knearest neighbors for each object

__init__
(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([k, return_value, hub_size, …])Initialize self.
fit
(X[, y])Fit indexed objects.
get_params
([deep])Get parameters for this estimator.
score
([X, y, has_self_distances])Estimate hubness in a data set.
set_params
(**params)Set the parameters of this estimator.

fit
(X, y=None) → skhubness.analysis.estimation.Hubness[source]¶ Fit indexed objects.
 Parameters
 X: {arraylike, sparse matrix}, shape (n_samples, n_features) or (n_query, n_indexed) if metric==’precomputed’
Training data vectors or distance matrix, if metric == ‘precomputed’.
 y: ignored
 Returns
 self:
Fitted instance of :mod:Hubness

get_params
(deep=True)¶ Get parameters for this estimator.
 Parameters
 deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
 Returns
 paramsmapping of string to any
Parameter names mapped to their values.

score
(X: Optional[numpy.ndarray] = None, y=None, has_self_distances: bool = False) → Union[float, dict][source]¶ Estimate hubness in a data set.
Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, allagainstall distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.
 Parameters
 X: ndarray, shape (n_query, n_features) or (n_query, n_indexed)
Array of query vectors, or distance, if self.metric == ‘precomputed’
 y: ignored
 has_self_distances: bool, default = False
Define, whether a precomputed distance matrix contains self distances, which need to be excluded.
 Returns
 hubness_measure: float or dict
Return the hubness measure as indicated by return_value. Additional hubness indices are provided as attributes (e.g.
robinhood_index_()
). if return_value is ‘all’, a dict of all hubness measures is returned.

set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object. Parameters
 **paramsdict
Estimator parameters.
 Returns
 selfobject
Estimator instance.