occ

On the Evaluation of Outlier Detection and One-Class Classification

Repository of the paper:

H. O. Marques, L. Swersky, J. Sander, A. Zimek and R. J. G. B. Campello. 
On the Evaluation of Outlier Detection and One-Class Classification: 
A Comparative Study of Algorithms, Model Selection, and Ensembles. 
DAMI (2023)

This repository only intends to provide the source code used in our experiments and instructions on how to use them. </br> For details about the algorithms, properties, and parameters, consult our supplementary material.

Toolboxes

Importing toolboxes

After downloading, you can add PRTools5, dd_tools, and GLPKmex* toolboxes to the MATLAB workspace using the command addpath: </br> addpath('path/to/prtools'); </br> addpath('path/to/dd_tools');</br> addpath('path/to/GLPKmex');</br>

*GLPKmex is only needed to use the LP classifier.


Datasets

Manipulating datasets

We provide above the original source of all datasets used in our experiments.</br> For convenience, we also make them available ready-to-use in MATLAB here.</br>

Algorithms

The algorithms provided here follow the dd_tools pattern. </br> Usually, the first parameter is the training dataset, the second is the percentage of the dataset that can be misclassified during the training, and the third is the algorithm’s parameter. </br> Note that some algorithms have no parameter, and others have more than one. </br>

Measures

Once the classifier is trained, we can compute its performance using different measures. </br> We use the following performance measures in our experiments: </br>

Model Selection

Ensembles

References

[1] D. M. J. Tax: DDtools, the Data Description Toolbox for Matlab. Version 2.1.3, Delft University of Technology, 2018
[2] R. P. W. Duin, P. Juszczak, P. Paclik, E. Pekalska, D. de Ridder, D. M. J. Tax, S. Verzakov: PRTools: A Matlab Toolbox for Pattern Recognition. Version 5.4.2, Delft University of Technology, 2018
[3] A. Zimek, M. Gaudet, R. J. G. B. Campello, J. Sander: Subsampling for Efficient and Effective Unsupervised Outlier Detection Ensembles. SIGKDD, 2013.
[4] D. Dua, C. Graff: UCI Machine Learning Repository. University of California, 2019.
[5] K. Y. Yeung, C. Fraley, A. Murua, A. E. Raftery, W. L. Ruzzo: Model-Based Clustering and Data Transformations for Gene Expression Data. Bioinformatics, 2001.
[6] K. Y. Yeung, M. Medvedovic, R. E. Bumgarner: Clustering Gene-Expression Data with Repeated Measurements. Genome Biology, 2003.
[7] C. M. Bishop: Pattern Recognition and Machine Learning. Springer, 2006.
[8] E. Parzen: On Estimation of a Probability Density Function and Mode. The Annals of Mathematical Statistics, 1962.
[9] D. M. J. Tax, R. P. W. Duin: Support Vector Data Description. Machine Learning, 2004.
[10] E. Pekalska, D. M. J. Tax, R. P. W. Duin: One-Class LP Classifiers for Dissimilarity Representations. NIPS, 2002.
[11] D. de Ridder, D. M. J. Tax, R. P. W. Duin: An Experimental Comparison of One-Class Classification Methods. ASCI, 1998.
[12] N. Japkowicz, C. Myers, M. A. Gluck: A Novelty Detection Approach to Classification. IJCAI, 1995.
[13] L. Ruff, N. Görnitz, L. Deecke, S. A. Siddiqui, A. Binder, E. Müller, M. Kloft: Deep One-Class Classification. ICML, 2018.
[14] S. Ramaswamy, R. Rastogi, K. Shim: Efficient Algorithms for Mining Outliers from Large Data Sets. SIGMOD, 2000.
[15] M. M. Breunig, H. Kriegel, R. T. Ng, J. Sander: LOF: Identifying Density-Based Local Outliers. SIGMOD, 2000.
[16] S. Papadimitriou, H. Kitagawa, P. B. Gibbons, C. Faloutsos: LOCI: Fast Outlier Detection using the Local Correlation Integral. ICDE, 2003.
[17] R. J. G. B. Campello, D. Moulavi, A. Zimek, J. Sander: Hierarchical Density Estimates for Data Clustering, Visualization, and Outlier Detection. TKDD, 2015.
[18] F. T. Liu, K. M. Ting, Z. Zhou: Isolation-Based Anomaly Detection. TKDD, 2012.
[19] H. Kriegel, M. Schubert, A. Zimek: Angle-Based Outlier Detection in High-Dimensional Data. SIGKDD, 2008.
[20] H. Kriegel, P. Kröger, E. Schubert, A. Zimek: Outlier Detection in Axis-Parallel Subspaces of High Dimensional Data. PAKDD, 2009.
[21] C.-C. Chang, C.-J. Lin: LIBSVM: A Library for Support Vector Machines. TIST, 2011.
[22] E. Schubert, A. Zimek: ELKI: A large open-source library for data analysis. ELKI Release 0.7.5, CoRR arXiv 1902.03616, 2019.
[23] G. O. Campos, A. Zimek, J. Sander, R. J. G. B. Campello, B. Micenková, E. Schubert, I. Assent, M. E. Houle: On the Evaluation of Unsupervised Outlier Detection: Measures, Datasets, and an Empirical Study. DAMI, 2016.
[24] B. W. Matthews: Comparison of the Predicted and Observed Secondary Structure of T4 Phage Lysozyme. BBA, 1975.
[25] J. Han, M. Kamber, J. Pei: Data Mining: Concepts and Techniques. Morgan Kaufmann, 2011.
[26] S. Wang, Q. Liu, E. Zhu, F. Porikli, J. Yin: Hyperparameter Selection of One-Class Support Vector Machine by Self-Adaptive Data Shifting. Pattern Recognition, 2018.
[27] H. O. Marques: Evaluation and Model Selection for Unsupervised Outlier Detection and One-Class Classification. PhD thesis, University of São Paulo, 2011.
[28] D. M. J. Tax, R. P. W. Duin: Uniform Object Generation for Optimizing One-class Classifiers. JMLR, 2001.
[29] G. V. Cormack, C. L. A. Clarke, S Büttcher: Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. SIGIR, 2009.