occ

On the Evaluation of Outlier Detection and One-Class Classification

Repository of the paper:

H. O. Marques, L. Swersky, J. Sander, A. Zimek and R. J. G. B. Campello. 
On the Evaluation of Outlier Detection and One-Class Classification: 
A Comparative Study of Algorithms, Model Selection, and Ensembles. 
DAMI (2023)

This repository only intends to provide the source code used in our experiments and instructions on how to use them. For details about the algorithms, properties, and parameters, consult our supplementary material.

Toolboxes

dd_tools [1]
PRTools5 [2]
GLPKmex

Importing toolboxes

After downloading, you can add PRTools5, dd_tools, and GLPKmex^* toolboxes to the MATLAB workspace using the command addpath: addpath('path/to/prtools'); addpath('path/to/dd_tools'); addpath('path/to/GLPKmex');

^*GLPKmex is only needed to use the LP classifier.

Datasets

Synthetic datasets [3]
- synth-batch1
UCI datasets [4]
- Pre-processed by Tax: Abalone, Arrhythmia, Balance-scale, Ball-bearing, Biomed, Breast, Cancer, Colon, Delft1x3, Delft2x2, Delft3x2, Delft5x1, Delft5x3, Diabetes, Ecoli, Glass, Heart, Hepatitis, Housing, Imports, Ionosphere, Iris, Liver, Satellite, Sonar, Spectf, Survival, Vehicle, Vowels, Waveform, and Wine.
- Pre-processed by ourselves: Artificial Characters, Cardiotocography, Car Evaluation, CNAE-9, Dermatology, Solar Flare, Hayes-Roth, LED Display, Lung Cancer, Multiple Features, Optical Recognition, Page Blocks, Seeds, Semeion, Soybean, Synthetic Control, Texture, User Knowledge Modeling, Vertebra Column, and Zoo.
Other datasets
- CellCycle-237 [5] and YeastGalactose [6]

Manipulating datasets

We provide above the original source of all datasets used in our experiments. For convenience, we also make them available ready-to-use in MATLAB here.

Reading datasets After downloading, you can load the datasets to the MATLAB workspace using the command load, just make sure you already imported PRTools5 to your workspace before: load('Datasets/Real/iris.mat');
Scatterplots After loading the dataset, you can plot it using the command scatterd to get some feeling about the distribution of datasets.
As this dataset is 4-dimensional, first, we will project it in a 2D space using PCA with the command pcam: iris2d = data*pcam(data,2); scatterd(iris2d);

Creating one-class datasets As this is a multi-class dataset, we have to transform it into a one-class dataset. It is done by using the dd_tools command oc_set.
You only need to set which class(es) will be the inlier (aka target) class.

Setting class 1 as inlier class: oc_data = oc_set(iris2d, [1]); scatterd(oc_data, 'legend');
Holdout In order to partition data into training and testing, we can use the command gendat. In the example below, we partition the dataset to use 80% for training and hold 20% to test: [train, test] = gendat(oc_data, 0.8);

Algorithms

The algorithms provided here follow the dd_tools pattern. Usually, the first parameter is the training dataset, the second is the percentage of the dataset that can be misclassified during the training, and the third is the algorithm’s parameter. Note that some algorithms have no parameter, and others have more than one.

One-class classification algorithms:
- Gaussian Mixture Model (GMM) [7] We use MATLAB’s own implementation for GMM, we just encapsulated it to follow the same pattern used by the dd_tools classifiers.
 - Training w = gmm_dd(target_class(train), 0, 1); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Parzen Window (PW) [8] We use dd_tools implementation for PW.
 - Training w = parzen_dd(target_class(train), 0, 0.25); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Support Vector Data Description (SVDD) [9] We use LIBSVM[21] implementation in C++ for SVDD due to the computational burden. We encapsulated it to follow the same pattern used by the dd_tools classifiers. As this is a C++ implementation, you must compile it before its first use. Make sure a supported compiler is installed on the machine.
 - Compiling mex -setup; make For general troubleshooting, read the LIBSVM README file.
 - Training w = libsvdd(target_class(train), 0, 1); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Linear Programming (LP) [10] We use dd_tools implementation for LP.
 - Training w = lpdd(target_class(train), 0, 0.25); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- k-Nearest Neighbor Data Description (kNN_local) [11] We use our own implementation for kNN_local, following the same pattern used by the dd_tools classifiers.
 - Training w = lknndd(target_class(train), 0, 1); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Auto-Encoder [12] We use dd_tools implementation for Auto-Encoder.
 - Training w = autoenc_dd(target_class(train), 0, 10); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Deep SVDD (DSVDD) [13] For DSVDD, we use the authors’ implementation in Python, we made some small adjustments to communicate to MATLAB and encapsulated it to follow the same pattern used by the dd_tools classifiers. Since the implementation is in Python, make sure you have a compatible version of Python and all the required packages installed. The list of packages required, you can find here. Also, make sure your Python environment is setup up on MATLAB. If not, check this out.
 - Add Python source to MATLAB env pathToSAD = fileparts('path/to/Deep-SAD-PyTorch/src/main.py'); insert(py.sys.path, int32(0), pathToSAD) 
 - Training w = dsvdd(target_class(train), 0, 8); 
 - Plot scatterd(oc_data, 'legend'); plotc(w)
Unsupervised outlier detection algorithms adapted to one-class classification
- k-Nearest Neighbors (kNN_global) [14] We use dd_tools implementation for kNN_global.
 - Training w = knndd(target_class(train), 0, 1); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Local Outlier Factor (LOF) [15] We use our own implementation for LOF in order to reuse the pre-computed quantities related to instances in the training data. The implementation follows the same pattern used by the dd_tools classifiers.
 - Training w = lof(target_class(train), 0, 10); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Local Correlation Integral (LOCI) [16] We use our own implementation for LOCI in order to reuse the pre-computed quantities related to instances in the training data. The implementation follows the same pattern used by the dd_tools classifiers.
 - Training w = loci(target_class(train), 0, 0.1); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Global-Local Outlier Scores from Hierarchies (GLOSH) [17] We use the authors’ implementation in Java for GLOSH. We also encapsulated it to follow the same pattern used by the dd_tools classifiers. Since the implementation is in Java, first, we need to import the Java source to the MATLAB environment:
 - Add Java source to MATLAB env javaaddpath Algorithms/GLOSH/GLOSHDD.jar import ca.ualberta.cs.hdbscanstar.*
 - Training w = gloshdd(target_class(train), 0, 5); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Isolation Forest (iForest) [18] For iForest, we use a third-part MATLAB implementation. We just encapsulated it to follow the same pattern used by the dd_tools classifiers.
 - Training w = iforest_dd(target_class(train), 0, 256, 60); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Angle-Based Outlier Detection (ABOD) [19] We use dd_tools implementation for ABOD.
 - Training w = abof_dd(target_class(train), 0); 
 - Plot scatterd(oc_data, 'legend'); plotc(w) 
- Subspace Outlier Degree (SOD) [20] For SOD, we use our own implementation based on ELKI[22] implementation. We also encapsulated it to follow the same pattern used by the dd_tools classifiers.
 - Training w = sod(target_class(train), 0, 10); 
 - Plot scatterd(oc_data, 'legend'); plotc(w)

Measures

Once the classifier is trained, we can compute its performance using different measures. We use the following performance measures in our experiments:

Area Under the ROC Curve (ROC AUC) [23] dd_auc(dd_roc(test*w));
Adjusted Precision-at-n (AdjustedPrec@n) [23] dd_precatn(test*w);
Matthews Correlation Coefficient (MCC) [24] dd_mcc(test*w);

Model Selection

Cross-validation [25] (supervised)

nrfolds = 10;
err = zeros(nrfolds, 1);
I = nrfolds;
for j=1:nrfolds
    %x - training set, z - test set
    [x,z,I] = dd_crossval(train, I);
    %training
    w = gmm_dd(x, 0, 1);
    %test
    err(j) = dd_auc(dd_roc(z*w));
end
mean(err)

Self-adaptive Data Shifting (SDS) [26] (unsupervised)

Generation of data: [sds_targets, sds_outliers] = sds(target_class(train));

Classifier error:

    % Error on target class
    err_t = dd_error(sds_targets*w);

    % Error on outlier class
    err_o = dd_error(sds_outliers*w);

    % classifier error
    err_sds = err_t(1) + err_o(2);

Perturbation [27] (unsupervised)

Generation of data: nrinst = 20; pert_targets = perturbation(target_class(train), nrinst, 0.5);

Classifier error:

  % Error on target class (cross-validation without outliers)
  nrfolds = 10;
  err_t = zeros(nrfolds, 1);
  I = nrfolds;
  for j = 1:nrfolds
  %x - training set, z - test set
  [x,z,I] = dd_crossval(target_class(train), I);
  %training
  w = gmm_dd(x, 0, 1);
  %test
  err_xval = dd_error(z, w);
  err_t(j) = err_xval(1);
  end

  % Error on outlier class (perturbed data)
  err_o = zeros(nrinst, 1);
  for j = 1:nrinst
    err_pert = dd_error(pert_targets{j}*w);
    err_o(j) = err_pert(2);
  end

  % classifier error
  err_pert = mean(err_t) + mean(err_o);

Uniform Objects [28] (unsupervised)

Generation of data: unif_targets = gendatout(target_class(train), 100000);

Classifier error:

  % Error on target class (cross-validation without outliers)
  nrfolds = 10;
  err_t = zeros(nrfolds, 1);
  I = nrfolds;
  for j = 1:nrfolds
  %x - training set, z - test set
  [x,z,I] = dd_crossval(target_class(train), I);
  %training
  w = gmm_dd(x, 0, 1);
  %test
  err_xval = dd_error(z, w);
  err_t(j) = err_xval(1);
  end

  % Error on outlier class (uniform data)
  err_o = dd_error(unif_targets*w);

  % classifier error
  err_unif = mean(err_t) + err_o(2);

Ensembles

Reciprocal Rank Fusion (RRF) [29]

ranks = zeros(size(test,1),3);
	
%training GMM
w = gmm_dd(target_class(train), 0, 1);
wx = test*w;
ranks(:,1) = +wx(:,1);

%training KNN
w = knndd(target_class(train), 0, 1);
wx = test*w;
ranks(:,2) = +wx(:,1);
	
%training LOF
w = lof(target_class(train), 0, 10);
wx = test*w;
ranks(:,3) = +wx(:,1);

% Combining rankings
ranks = tiedrank(ranks);
w = RRF_dd(train, 0, ranks);
dd_auc(dd_roc(test*w));

References

[1] D. M. J. Tax: DDtools, the Data Description Toolbox for Matlab. Version 2.1.3, Delft University of Technology, 2018
[2] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D. M. J. Tax, S. Verzakov: PRTools: A Matlab Toolbox for Pattern Recognition. Version 5.4.2, Delft University of Technology, 2018
[3] A. Zimek, M. Gaudet, R. J. G. B. Campello, J. Sander: Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles. SIGKDD, 2013.
[4] D. Dua, C. Graff: UCI Machine Learning Repository. University of California, 2019.
[5] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo: Model-Based Clustering and Data Transformations for Gene Expression Data. Bioinformatics, 2001.
[6] K. Y. Yeung, M. Medvedovic, R. E. Bumgarner: Clustering Gene-Expression Data with Repeated Measurements. Genome Biology, 2003.
[7] C. M. Bishop: Pattern Recognition and Machine Learning. Springer, 2006.
[8] E. Parzen: On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 1962.
[9] D. M. J. Tax, R. P. W. Duin: Support Vector Data Description. Machine Learning, 2004.
[10] E. Pekalska, D. M. J. Tax, R. P. W. Duin: One-Class LP Classifiers for Dissimilarity Representations. NIPS, 2002.
[11] D. de Ridder, D. M. J. Tax, R. P. W. Duin: An Experimental Comparison of One-Class Classification Methods. ASCI, 1998.
[12] N. Japkowicz, C. Myers, M. A. Gluck: A Novelty Detection Approach to Classification. IJCAI, 1995.
[13] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, M. Kloft: Deep One-Class Classification. ICML, 2018.
[14] S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD, 2000.
[15] M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD, 2000.
[16] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection using the Local Correlation Integral. ICDE, 2003.
[17] R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander: Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. TKDD, 2015.
[18] F. T. Liu, K. M. Ting, Z. Zhou: Isolation-Based Anomaly Detection. TKDD, 2012.
[19] H. Kriegel, M. Schubert, A. Zimek: Angle-Based Outlier Detection in High-Dimensional Data. SIGKDD, 2008.
[20] H. Kriegel, P. Kröger, E. Schubert, A. Zimek: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. PAKDD, 2009.
[21] C.-C. Chang, C.-J. Lin: LIBSVM: A Library for Support Vector Machines. TIST, 2011.
[22] E. Schubert, A. Zimek: ELKI: A large open-source library for data analysis. ELKI Release 0.7.5, CoRR arXiv 1902.03616, 2019.
[23] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, M. E. Houle: On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. DAMI, 2016.
[24] B. W. Matthews: Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. BBA, 1975.
[25] J. Han, M. Kamber, J. Pei: Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.
[26] S. Wang, Q. Liu, E. Zhu, F. Porikli, J. Yin: Hyperparameter Selection of One-Class Support Vector Machine by Self-Adaptive Data Shifting. Pattern Recognition, 2018.
[27] H. O. Marques: Evaluation and Model Selection for Unsupervised Outlier Detection and One-Class Classification. PhD thesis, University of São Paulo, 2011.
[28] D. M. J. Tax, R. P. W. Duin: Uniform Object Generation for Optimizing One-class Classifiers. JMLR, 2001.
[29] G. V. Cormack, C. L. A. Clarke, S Büttcher: Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR, 2009.