Repository of the paper:
H. O. Marques, L. Swersky, J. Sander, A. Zimek and R. J. G. B. Campello.
On the Evaluation of Outlier Detection and One-Class Classification:
A Comparative Study of Algorithms, Model Selection, and Ensembles.
DAMI (2023)
This repository only intends to provide the source code used in our experiments and instructions on how to use them. </br> For details about the algorithms, properties, and parameters, consult our supplementary material.
After downloading, you can add PRTools5, dd_tools, and GLPKmex* toolboxes to the MATLAB workspace using the command addpath: </br>
addpath('path/to/prtools'); </br>
addpath('path/to/dd_tools');</br>
addpath('path/to/GLPKmex');</br>
*GLPKmex is only needed to use the LP classifier.
We provide above the original source of all datasets used in our experiments.</br> For convenience, we also make them available ready-to-use in MATLAB here.</br>
Reading datasets</br>
After downloading, you can load the datasets to the MATLAB workspace using the command load, just make sure you already imported PRTools5 to your workspace before: </br>
load('Datasets/Real/iris.mat'); </br>
Scatterplots</br>
After loading the dataset, you can plot it using the command scatterd to get some feeling about the distribution of datasets.
As this dataset is 4-dimensional, first, we will project it in a 2D space using PCA with the command pcam: </br>
iris2d = data*pcam(data,2);</br>
scatterd(iris2d);</br>

Creating one-class datasets</br>
As this is a multi-class dataset, we have to transform it into a one-class dataset. It is done by using the dd_tools command oc_set.
You only need to set which class(es) will be the inlier (aka target) class.</br>
Setting class 1 as inlier class:</br>
oc_data = oc_set(iris2d, [1]);</br>
scatterd(oc_data, 'legend');</br>

Holdout</br>
In order to partition data into training and testing, we can use the command gendat. In the example below, we partition the dataset to use 80% for training and hold 20% to test:</br>
[train, test] = gendat(oc_data, 0.8);</br>
The algorithms provided here follow the dd_tools pattern. </br> Usually, the first parameter is the training dataset, the second is the percentage of the dataset that can be misclassified during the training, and the third is the algorithm’s parameter. </br> Note that some algorithms have no parameter, and others have more than one. </br>
w = gmm_dd(target_class(train), 0, 1); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = parzen_dd(target_class(train), 0, 0.25); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
mex -setup; </br>
make </br>
For general troubleshooting, read the LIBSVM README file.w = libsvdd(target_class(train), 0, 1); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = lpdd(target_class(train), 0, 0.25); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = lknndd(target_class(train), 0, 1); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = autoenc_dd(target_class(train), 0, 10); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
Deep SVDD (DSVDD) [13] </br> For DSVDD, we use the authors’ implementation in Python, we made some small adjustments to communicate to MATLAB and encapsulated it to follow the same pattern used by the dd_tools classifiers.</br> Since the implementation is in Python, make sure you have a compatible version of Python and all the required packages installed.</br> The list of packages required, you can find here.</br> Also, make sure your Python environment is setup up on MATLAB. If not, check this out.</br>
pathToSAD = fileparts('path/to/Deep-SAD-PyTorch/src/main.py'); </br>
insert(py.sys.path, int32(0), pathToSAD) </br>w = dsvdd(target_class(train), 0, 8); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = knndd(target_class(train), 0, 1); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = lof(target_class(train), 0, 10); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = loci(target_class(train), 0, 0.1); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
javaaddpath Algorithms/GLOSH/GLOSHDD.jar</br>
import ca.ualberta.cs.hdbscanstar.*</br>w = gloshdd(target_class(train), 0, 5); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = iforest_dd(target_class(train), 0, 256, 60); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = abof_dd(target_class(train), 0); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
w = sod(target_class(train), 0, 10); </br>scatterd(oc_data, 'legend'); </br>
plotc(w) </br>
Once the classifier is trained, we can compute its performance using different measures. </br> We use the following performance measures in our experiments: </br>
dd_auc(dd_roc(test*w));</br>dd_precatn(test*w);</br>dd_mcc(test*w);</br>nrfolds = 10;
err = zeros(nrfolds, 1);
I = nrfolds;
for j=1:nrfolds
%x - training set, z - test set
[x,z,I] = dd_crossval(train, I);
%training
w = gmm_dd(x, 0, 1);
%test
err(j) = dd_auc(dd_roc(z*w));
end
mean(err)
Generation of data: </br>
[sds_targets, sds_outliers] = sds(target_class(train));</br>
Classifier error: </br>
% Error on target class
err_t = dd_error(sds_targets*w);
% Error on outlier class
err_o = dd_error(sds_outliers*w);
% classifier error
err_sds = err_t(1) + err_o(2);
Generation of data: </br>
nrinst = 20;</br>
pert_targets = perturbation(target_class(train), nrinst, 0.5);</br>
Classifier error: </br>
% Error on target class (cross-validation without outliers)
nrfolds = 10;
err_t = zeros(nrfolds, 1);
I = nrfolds;
for j = 1:nrfolds
%x - training set, z - test set
[x,z,I] = dd_crossval(target_class(train), I);
%training
w = gmm_dd(x, 0, 1);
%test
err_xval = dd_error(z, w);
err_t(j) = err_xval(1);
end
% Error on outlier class (perturbed data)
err_o = zeros(nrinst, 1);
for j = 1:nrinst
err_pert = dd_error(pert_targets{j}*w);
err_o(j) = err_pert(2);
end
% classifier error
err_pert = mean(err_t) + mean(err_o);
Generation of data: </br>
unif_targets = gendatout(target_class(train), 100000);</br>
Classifier error: </br>
% Error on target class (cross-validation without outliers)
nrfolds = 10;
err_t = zeros(nrfolds, 1);
I = nrfolds;
for j = 1:nrfolds
%x - training set, z - test set
[x,z,I] = dd_crossval(target_class(train), I);
%training
w = gmm_dd(x, 0, 1);
%test
err_xval = dd_error(z, w);
err_t(j) = err_xval(1);
end
% Error on outlier class (uniform data)
err_o = dd_error(unif_targets*w);
% classifier error
err_unif = mean(err_t) + err_o(2);
ranks = zeros(size(test,1),3);
%training GMM
w = gmm_dd(target_class(train), 0, 1);
wx = test*w;
ranks(:,1) = +wx(:,1);
%training KNN
w = knndd(target_class(train), 0, 1);
wx = test*w;
ranks(:,2) = +wx(:,1);
%training LOF
w = lof(target_class(train), 0, 10);
wx = test*w;
ranks(:,3) = +wx(:,1);
% Combining rankings
ranks = tiedrank(ranks);
w = RRF_dd(train, 0, ranks);
dd_auc(dd_roc(test*w));
[1] D. M. J. Tax: DDtools, the Data Description Toolbox for Matlab. Version 2.1.3, Delft University of Technology, 2018
[2] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D. M. J. Tax, S. Verzakov: PRTools: A Matlab Toolbox for Pattern Recognition. Version 5.4.2, Delft University of Technology, 2018
[3] A. Zimek, M. Gaudet, R. J. G. B. Campello, J. Sander: Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles. SIGKDD, 2013.
[4] D. Dua, C. Graff: UCI Machine Learning Repository. University of California, 2019.
[5] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo: Model-Based Clustering and Data Transformations for Gene Expression Data. Bioinformatics, 2001.
[6] K. Y. Yeung, M. Medvedovic, R. E. Bumgarner: Clustering Gene-Expression Data with Repeated Measurements. Genome Biology, 2003.
[7] C. M. Bishop: Pattern Recognition and Machine Learning. Springer, 2006.
[8] E. Parzen: On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 1962.
[9] D. M. J. Tax, R. P. W. Duin: Support Vector Data Description. Machine Learning, 2004.
[10] E. Pekalska, D. M. J. Tax, R. P. W. Duin: One-Class LP Classifiers for Dissimilarity Representations. NIPS, 2002.
[11] D. de Ridder, D. M. J. Tax, R. P. W. Duin: An Experimental Comparison of One-Class Classification Methods. ASCI, 1998.
[12] N. Japkowicz, C. Myers, M. A. Gluck: A Novelty Detection Approach to Classification. IJCAI, 1995.
[13] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, M. Kloft: Deep One-Class Classification. ICML, 2018.
[14] S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD, 2000.
[15] M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD, 2000.
[16] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection using the Local Correlation Integral. ICDE, 2003.
[17] R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander: Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. TKDD, 2015.
[18] F. T. Liu, K. M. Ting, Z. Zhou: Isolation-Based Anomaly Detection. TKDD, 2012.
[19] H. Kriegel, M. Schubert, A. Zimek: Angle-Based Outlier Detection in High-Dimensional Data. SIGKDD, 2008.
[20] H. Kriegel, P. Kröger, E. Schubert, A. Zimek: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. PAKDD, 2009.
[21] C.-C. Chang, C.-J. Lin: LIBSVM: A Library for Support Vector Machines. TIST, 2011.
[22] E. Schubert, A. Zimek: ELKI: A large open-source library for data analysis. ELKI Release 0.7.5, CoRR arXiv 1902.03616, 2019.
[23] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, M. E. Houle: On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. DAMI, 2016.
[24] B. W. Matthews: Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. BBA, 1975.
[25] J. Han, M. Kamber, J. Pei: Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.
[26] S. Wang, Q. Liu, E. Zhu, F. Porikli, J. Yin: Hyperparameter Selection of One-Class Support Vector Machine by Self-Adaptive Data Shifting. Pattern Recognition, 2018.
[27] H. O. Marques: Evaluation and Model Selection for Unsupervised Outlier Detection and One-Class Classification. PhD thesis, University of São Paulo, 2011.
[28] D. M. J. Tax, R. P. W. Duin: Uniform Object Generation for Optimizing One-class Classifiers. JMLR, 2001.
[29] G. V. Cormack, C. L. A. Clarke, S Büttcher: Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR, 2009.