The options of Outlier Detection method

Author

SEOYEON CHOI

Published

August 16, 2023

from sklearn.neighbors import LocalOutlierFactor

LOF[@breunig2000lof]

LocalOutlierFactor(
    n_neighbors=20,
    *,
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    p=2,
    metric_params=None,
    contamination='auto',
    novelty=False,
    n_jobs=None,
)

Parameter	Description	Default Value
n_neighbors	Number of neighbors to use by default for kneighbors queries. If n_neighbors is larger than the number of samples provided, all samples will be used.	20
algorithm	Algorithm used to compute the nearest neighbors: ‘ball_tree’ will use BallTree ‘kd_tree’ will use KDTree ‘brute’ will use a brute-force search. ‘auto’ will attempt to decide the most appropriate algorithm based on the values passed to the fit method. Note: fitting on sparse input will override the setting of this parameter, using brute force.	‘auto’
leaf_size	Leaf is size passed to BallTree or KDTree. This can affect the speed of the construction and query, as well as the memory required to store the tree. The optimal value depends on the nature of the problem.	30
metric	Metric to use for distance computation. Default is “minkowski”, which results in the standard Euclidean distance when p = 2.	‘minkowski’
p	Parameter for the Minkowski metric from sklearn.metrics.pairwise.pairwise_distances. When p = 1, this is equivalent to using manhattan_distance (l1), and euclidean_distance (l2) for p = 2. For arbitrary p, minkowski_distance (l_p) is used. 2
metric_params	Additional keyword arguments for the metric function.	None
contamination	The amount of contamination of the data set, i.e. the proportion of outliers in the data set. When fitting, this is used to define the threshold on the scores of the samples.	‘auto’
novelty	By default, LocalOutlierFactor is only meant to be used for outlier detection (novelty=False). Set novelty to True if you want to use LocalOutlierFactor for novelty detection.	False
n_jobs	The number of parallel jobs to run for neighbors search. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors.	None

from pyod.models.knn import KNN

kNN[@ramaswamy2000efficient]

KNN(
    contamination=0.1,
    n_neighbors=5,
    method='largest',
    radius=1.0,
    algorithm='auto',
    leaf_size=30,
    metric='minkowski',
    p=2,
    metric_params=None,
    n_jobs=1,
    **kwargs,
)

Parameter	Description	Default
contamination	Proportion of outliers in the data set, used to define the threshold on the decision function.	0.1
n_neighbors	Number of neighbors to use for k neighbors queries.	5
method	Method for kNN detection: ‘largest’, ‘mean’, or ‘median’.	‘largest’
radius	Range of parameter space for radius_neighbors queries.	1.0
algorithm	Algorithm to compute nearest neighbors: ‘auto’, ‘ball_tree’, ‘kd_tree’, or ‘brute’.	‘auto’
leaf_size	Leaf size passed to BallTree, affecting construction/query speed and memory.	30
metric	Metric for distance computation, from scikit-learn or scipy.spatial.distance.	‘minkowski’
p	Parameter for Minkowski metric, equivalent to manhattan_distance (l1) for p = 1 and euclidean_distance (l2) for p = 2.	2
metric_params	Additional keyword arguments for the metric function.	None
n_jobs	Number of parallel jobs for neighbors search. -1 uses CPU cores.	1

from pyod.models.cblof import CBLOF

CBLOF[@he2003discovering]

CBLOF(
    n_clusters=8,
    contamination=0.1,
    clustering_estimator=None,
    alpha=0.9,
    beta=5,
    use_weights=False,
    check_estimator=False,
    random_state=None,
    n_jobs=1,
)

Parameter	Description	Default
n_clusters	Number of clusters to form and centroids to generate.	8
contamination	Amount of contamination in the data set, proportion of outliers. Used to define threshold.	0.1
clustering_estimator	Base clustering algorithm for data clustering. Requires fit() and predict(). Default is KMeans.	None
alpha	Coefficient for deciding small and large clusters.	0.9
beta	Coefficient for deciding small and large clusters.	5
use_weights	Use cluster sizes as weights in outlier score calculation.	False
check_estimator	Check if base estimator is consistent with sklearn standard.	False
random_state	Seed for random number generator.	None

from sklearn import svm

OCSVM@manevitz2001one]

svm.OneClassSVM(
    *,
    kernel='rbf',
    degree=3,
    gamma='scale',
    coef0=0.0,
    tol=0.001,
    nu=0.5,
    shrinking=True,
    cache_size=200,
    verbose=False,
    max_iter=-1,
)

Parameter	Description	Default
kernel	Specifies the kernel type to be used in the algorithm. Options: ‘linear’, ‘poly’, ‘rbf’, ‘sigmoid’, ‘precomputed’. Default: ‘rbf’.	‘rbf’
degree	Degree of the polynomial kernel function (‘poly’). Non-negative. Ignored by other kernels.	3
gamma	Kernel coefficient for ‘rbf’, ‘poly’, and ‘sigmoid’. ‘scale’ (default), ‘auto’, or a non-negative float.	‘scale’
coef0	Independent term in kernel function. Significant in ‘poly’ and ‘sigmoid’.	0.0
tol	Tolerance for stopping criterion.	1e-3
nu	Upper bound on fraction of training errors and lower bound of fraction of support vectors. (0, 1] by default.	0.5
shrinking	Whether to use the shrinking heuristic.	True
cache_size	Size of the kernel cache (in MB).	200
verbose	Enable verbose output. May not work well in multithreaded contexts.	False
max_iter	Hard limit on solver iterations. -1 for no limit.	-1

from pyod.models.mcd import MCD

MCD[@hardin2004outlier]

MCD(
    contamination=0.1,
    store_precision=True,
    assume_centered=False,
    support_fraction=None,
    random_state=None,
)

Parameter	Description	Default
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.	0.1
store_precision	bool Specify if the estimated precision is stored.	True
assume_centered	bool If True, the support of the robust location and the covariance estimates is computed, and a covariance estimate is recomputed from it, without centering the data. Useful to work with data whose mean is significantly equal to zero but is not exactly zero. If False, the robust location and covariance are directly computed with the FastMCD algorithm without additional treatment.	False
support_fraction	float, 0 < support_fraction < 1 The proportion of points to be included in the support of the raw MCD estimate. Default is None, which implies that the minimum value of support_fraction will be used within the algorithm: [n_sample + n_features + 1] / 2	None
random_state	int, RandomState instance or None, optional (default=None) If int, random_state is the seed used by the random number generator; If RandomState instance, random_state is the random number generator; If None, the random number generator is the RandomState instance used by np.random.	None

from pyod.models.feature_bagging import FeatureBaggingfrom pyod.models.feature_bagging import FeatureBagging

FeatureBagging[@lazarevic2005feature]

FeatureBagging(
    base_estimator=None,
    n_estimators=10,
    contamination=0.1,
    max_features=1.0,
    bootstrap_features=False,
    check_detector=True,
    check_estimator=False,
    n_jobs=1,
    random_state=None,
    combination='average',
    verbose=0,
    estimator_params=None,
)

Parameter	Description	Default
base_estimator	The base estimator to fit on random subsets of the dataset. If None, base estimator is LOF detector.	None
n_estimators	The number of base estimators in the ensemble.	10
contamination	Amount of contamination in the data set, proportion of outliers. Used to define threshold.	0.1
max_features	Number of features to draw from X to train each base estimator.	1.0
bootstrap_features	Whether features are drawn with replacement.	False
check_detector	If True, check if base estimator is consistent with pyod standard.	True
check_estimator	If True, check if base estimator is consistent with sklearn standard. Deprecated in pyod 0.6.9. Replaced by check_detector.	False
n_jobs	Number of jobs to run in parallel for both fit and predict.	1
random_state	Seed used by random number generator.	None
combination	Method of combination: ‘average’ for average scores, ‘max’ for maximum scores.	‘average’
verbose	Controls the verbosity of the building process.	0
estimator_params	List of attributes to use as parameters when instantiating a new base estimator.	None

from pyod.models.abod import ABOD

ABOD[@kriegel2008angle]

ABOD(contamination=0.1, n_neighbors=5, method='fast')

Parameter	Description	Default
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.	0.1
n_neighbors	int, optional (default=10) Number of neighbors to use by default for k neighbors queries.	10
method	str, optional (default=‘fast’) Method for ABOD: ‘fast’ for fast ABOD using n_neighbors only, ‘default’ for original ABOD using all training points (could be slower).	‘fast’

from alibi_detect.od import IForest

IForest[@liu2008isolation]

IForest(
    threshold: float = None,
    n_estimators: int = 100,
    max_samples: Union[str, int, float] = 'auto',
    max_features: Union[int, float] = 1.0,
    bootstrap: bool = False,
    n_jobs: int = 1,
    data_type: str = 'tabular',
)

Parameter	Description	Default
threshold	Threshold used for outlier score to determine outliers.	None
n_estimators	Number of base estimators in the ensemble.	100
max_samples	Number of samples to draw from training data to train each base estimator. If int, draw ‘max_samples’ samples. If float, draw ‘max_samples * number of features’ samples.If ‘auto’, max_samples = min(256, number of samples).	auto
max_features	Number of features to draw from training data to train each base estimator. If int, draw ‘max_features’ features. If float, draw ‘max_features * number of features’ features.	1.0
bootstrap	Whether to fit individual trees on random subsets of the training data, sampled with replacement.	False
n_jobs	Number of jobs to run in parallel for ‘fit’ and ‘predict’.	1
data_type	Optionally specify the data type (tabular, image, or time-series). Added to metadata.	tabular

from pyod.models.hbos import HBOS

HBOS[@goldstein2012histogram]

HBOS(n_bins=10, alpha=0.1, tol=0.5, contamination=0.1)

Parameter	Description	Default
n_bins	int or string, optional (default=10) The number of bins. “auto” uses the birge-rozenblac method for automatic selection of the optimal number of bins for each feature.	10
alpha	float in (0, 1), optional (default=0.1) The regularizer for preventing overflow.	0.1
tol	float in (0, 1), optional (default=0.5) The parameter to decide the flexibility while dealing the samples falling outside the bins.	0.5
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.	0.1

from pyod.models.sos import SOS

SOS[@janssens2012stochastic]

SOS(contamination=0.1, perplexity=4.5, metric='euclidean', eps=1e-05)

Parameter	Description	Default
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.	0.1
perplexity	float, optional (default=4.5) A smooth measure of the effective number of neighbors. Perplexity is similar to parameter `k` in kNN algorithm (number of nearest neighbors). Perplexity range: 1 to n-1, where `n` is number of samples.	4.5
metric	str, default ‘euclidean’ Metric used for distance computation. Can use any metric from scipy.spatial.distance. Valid values: ‘euclidean’, [‘braycurtis’, ‘canberra’, ‘chebyshev’, …]. See scipy.spatial.distance documentation for details.	‘euclidean’
eps	float, optional (default=1e-5) Tolerance threshold for floating point errors.	1e-5

from pyod.models.so_gaal import SO_GAAL

SO_GAAL[@liu2019generative]

SO_GAAL(
    stop_epochs=20,
    lr_d=0.01,
    lr_g=0.0001,
    momentum=0.9,
    contamination=0.1,
)

Parameter	Description	Default
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.	0.1
stop_epochs	int, optional (default=20) The number of epochs of training. Total epochs equals three times stop_epochs.	20
lr_d	float, optional (default=0.01) The learn rate of the discriminator.	0.01
lr_g	float, optional (default=0.0001) The learn rate of the generator.	0.0001
momentum	float, optional (default=0.9) The momentum parameter for SGD.	0.9

from pyod.models.mo_gaal import MO_GAAL

MO_GAAL[@liu2019generative]

MO_GAAL(
    k=10,
    stop_epochs=20,
    lr_d=0.01,
    lr_g=0.0001,
    momentum=0.9,
    contamination=0.1,
)

Parameter	Description	Default
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function.	0.1
k	int, optional (default=10) The number of sub generators.	10
stop_epochs	int, optional (default=20) The number of epochs of training. Total epochs equals three times stop_epochs.	20
lr_d	float, optional (default=0.01) The learn rate of the discriminator.	0.01
lr_g	float, optional (default=0.0001) The learn rate of the generator.	0.0001
momentum	float, optional (default=0.9) The momentum parameter for SGD.	0.9

from pyod.models.lscp import LSCP

LSCP[@zhao2019lscp]

LSCP(
    detector_list,
    local_region_size=30,
    local_max_features=1.0,
    n_bins=10,
    random_state=None,
    contamination=0.1,
)

Parameter	Description	Default
detector_list	List, length must be greater than 1 Base unsupervised outlier detectors from PyOD. Requires fit and decision_function methods.	-
local_region_size	int, optional (default=30) Number of training points to consider in each iteration of local region generation process (30 by default).	30
local_max_features	float in (0.5, 1.), optional (default=1.0) Maximum proportion of number of features to consider when defining local region (1.0 by default).	1.0
n_bins	int, optional (default=10) Number of bins to use when selecting the local region.	10
random_state	RandomState, optional (default=None) A random number generator instance to define the state of the random permutations generator.	None
contamination	float in (0., 0.5), optional (default=0.1) The amount of contamination of the data set, i.e. the proportion of outliers in the data set. Used when fitting to define the threshold on the decision function (0.1 by default).	0.1