skhubness.analysis.Hubness¶
-
class
skhubness.analysis.
Hubness
(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]¶ Examine hubness characteristics of data.
- Parameters
- k: int
Neighborhood size
- return_value: str, default = “k_skewness”
Hubness measure to return by
score()
By default, this is the skewness of the k-occurrence histogram. Use “all” to return a dict of all available measures, or check skhubness.analysis.VALID_HUBNESS_MEASURE for available measures.- hub_size: float
Hubs are defined as objects with k-occurrence > hub_size * k.
- metric: string, one of [‘euclidean’, ‘cosine’, ‘precomputed’]
Metric to use for distance computation. Currently, only Euclidean, cosine, and precomputed distances are supported.
- store_k_neighbors: bool
Whether to save the k-neighbor lists. Requires O(n_test * k) memory.
- store_k_occurrence: bool
Whether to save the k-occurrence. Requires O(n_test) memory.
- algorithm: {‘auto’, ‘hnsw’, ‘lsh’, ‘ball_tree’, ‘kd_tree’, ‘brute’}, optional
Algorithm used to compute the nearest neighbors:
‘hnsw’ will use
HNSW
‘lsh’ will use
FalconnLSH
‘ball_tree’ will use
BallTree
‘kd_tree’ will use
KDTree
‘brute’ will use a brute-force search.
‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to
fit()
method.
Note: fitting on sparse input will override the setting of this parameter, using brute force.
- algorithm_params: dict, optional
Override default parameters of the NN algorithm. For example, with algorithm=’lsh’ and algorithm_params={n_candidates: 100} one hundred approximate neighbors are retrieved with LSH. If parameter hubness is set, the candidate neighbors are further reordered with hubness reduction. Finally, n_neighbors objects are used from the (optionally reordered) candidates.
- hubness: {‘mutual_proximity’, ‘local_scaling’, ‘dis_sim_local’, None}, optional
Hubness reduction algorithm
‘mutual_proximity’ or ‘mp’ will use
MutualProximity
‘local_scaling’ or ‘ls’ will use
LocalScaling
‘dis_sim_local’ or ‘dsl’ will use
DisSimLocal
If None, no hubness reduction will be performed (=vanilla kNN).
- hubness_params: dict, optional
Override default parameters of the selected hubness reduction algorithm. For example, with hubness=’mp’ and hubness_params={‘method’: ‘normal’} a mutual proximity variant is used, which models distance distributions with independent Gaussians.
- random_state: int, RandomState instance or None, optional
If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.
- shuffle_equal: bool, optional
If true and metric=’precomputed’, shuffle neighbors with identical distances to avoid artifact hubness. NOTE: This is especially useful for secondary distance measures with a finite number of possible values, e.g. SNN or MP empiric.
- n_jobs: int, optional
Number of processes for parallel computations. - 1: Don’t use multiprocessing. - -1: Use all CPUs Note that not all steps are currently parallelized.
- verbose: int, optional
Level of output messages
References
- 1
Radovanović, M.; Nanopoulos, A. & Ivanovic, M. Hubs in space: Popular nearest neighbors in high-dimensional data. Journal of Machine Learning Research, 2010, 11, 2487-2531
- 2
Feldbauer, R.; Leodolter, M.; Plant, C. & Flexer, A. Fast approximate hubness reduction for large high-dimensional data. IEEE International Conference of Big Knowledge (2018).
- Attributes
- k_skewness: float
Hubness, measured as skewness of k-occurrence histogram [1]
- k_skewness_truncnorm: float
Hubness, measured as skewness of truncated normal distribution fitted with k-occurrence histogram
- atkinson_index: float
Hubness, measured as the Atkinson index of k-occurrence distribution
- gini_index: float
Hubness, measured as the Gini index of k-occurrence distribution
- robinhood_index: float
Hubness, measured as Robin Hood index of k-occurrence distribution [2]
- antihubs: int
Indices to antihubs
- antihub_occurrence: float
Proportion of antihubs in data set
- hubs: int
Indices to hubs
- hub_occurrence: float
Proportion of k-nearest neighbor slots occupied by hubs
- groupie_ratio: float
Proportion of objects with the largest hub in their neighborhood
- k_occurrence: ndarray
Reverse neighbor count for each object
- k_neighbors: ndarray
Indices to k-nearest neighbors for each object
-
__init__
(k: int = 10, return_value: str = 'k_skewness', hub_size: float = 2.0, metric='euclidean', store_k_neighbors: bool = False, store_k_occurrence: bool = False, algorithm: str = 'auto', algorithm_params: Optional[dict] = None, hubness: Optional[str] = None, hubness_params: Optional[dict] = None, verbose: int = 0, n_jobs: int = 1, random_state=None, shuffle_equal: bool = True)[source]¶ Initialize self. See help(type(self)) for accurate signature.
Methods
__init__
([k, return_value, hub_size, …])Initialize self.
fit
(X[, y])Fit indexed objects.
get_params
([deep])Get parameters for this estimator.
score
([X, y, has_self_distances])Estimate hubness in a data set.
set_params
(**params)Set the parameters of this estimator.
-
fit
(X, y=None) → skhubness.analysis.estimation.Hubness[source]¶ Fit indexed objects.
- Parameters
- X: {array-like, sparse matrix}, shape (n_samples, n_features) or (n_query, n_indexed) if metric==’precomputed’
Training data vectors or distance matrix, if metric == ‘precomputed’.
- y: ignored
- Returns
- self:
Fitted instance of :mod:Hubness
-
get_params
(deep=True)¶ Get parameters for this estimator.
- Parameters
- deepbool, default=True
If True, will return the parameters for this estimator and contained subobjects that are estimators.
- Returns
- paramsmapping of string to any
Parameter names mapped to their values.
-
score
(X: Optional[numpy.ndarray] = None, y=None, has_self_distances: bool = False) → Union[float, dict][source]¶ Estimate hubness in a data set.
Hubness is estimated from the distances between all objects in X to all objects in Y. If Y is None, all-against-all distances between the objects in X are used. If self.metric == ‘precomputed’, X must be a distance matrix.
- Parameters
- X: ndarray, shape (n_query, n_features) or (n_query, n_indexed)
Array of query vectors, or distance, if self.metric == ‘precomputed’
- y: ignored
- has_self_distances: bool, default = False
Define, whether a precomputed distance matrix contains self distances, which need to be excluded.
- Returns
- hubness_measure: float or dict
Return the hubness measure as indicated by return_value. Additional hubness indices are provided as attributes (e.g.
robinhood_index_()
). if return_value is ‘all’, a dict of all hubness measures is returned.
-
set_params
(**params)¶ Set the parameters of this estimator.
The method works on simple estimators as well as on nested objects (such as pipelines). The latter have parameters of the form
<component>__<parameter>
so that it’s possible to update each component of a nested object.- Parameters
- **paramsdict
Estimator parameters.
- Returns
- selfobject
Estimator instance.