DIODE - DIstance-based Outlier DEtection

Department of Computer Science - Universidade Federal de Minas Gerais
Department of Computer Science and Engeneering - The Ohio State University



contact: meira AT dcc.ufmg.br
Av. Antonio Carlos, 6627 - Pampulha
31270-010 Belo Horizonte, MG, Brazil
+55-31-3409-5840




About the DIODE framework

The framework for DIstance-based Outlier DEtection (DIODE) was used to perform the experiments, and the strategies employed to improve the effectiveness of the pruning comparisions among objects. The techniques implemented in DIODE focus on partitions for sake of pruning and ranking objects. DIODE supports the evaluation of several optimizations, in isolation and also in combination mode.

The baseline method in DIODE employs a clustering preprocessing step (Bisecting K-means). Our framework also supports the ability to compute important summary statistics about the clusters or partitions (e.g., radius, diameter, density etc.) similarly to BIRCH algorithm. Further, the framework was implemented using a distributed and parallel plataform (Anthill environment) which provides the outlier detection in really large databases.

The DIODE source code can be downloaded here. More details about the Anthill environment is described in this page.

 


 

Publications:

Distance-based Outlier Detection: Consolidation and Renewed Bearing
Gustavo H. Orair, Carlos H. C. Teixeira, Wagner Meira Jr., Ye Wang, Srinivasan Parthasarathy
Very Large Data Base - Singapore (VLDB '10)

Um Algoritmo Eficiente para Detecção de Exceções em Bases Reais de Alta Dimensionalidade (in Portuguese)
Carlos H. C. Teixeira, Gustavo H. Orair, Wagner Meira Jr.
Revista de Iniciação Científica - Belém/Brazil (REIC/SBC '08)

Awards:

Um Algoritmo Eficiente para Detecção de Exceções em Bases Reais de Alta Dimensionalidade (In Portuguese)
Carlos H. C. Teixeira, Gustavo H. Orair, Wagner Meira.

Scientific Merit Award - given by the Brazilian Computer Society, July 2008 (Second place on Cientific Initiation Contest of 2008)

Datasets:

Government Auctions : The database contains records associated with purchases made by various government institutions from Brazil. Details on this dataset can be found here. [Download]

KddCup1999 : This data set contains a set of records that represent connections to a military computer network where there have been multiple intrusions and attacks by unauthorized users. The raw binary TCP data from the network has been processed into features such as connection duration, protocol type, number of failed logins, and so forth. This data set was obtained from here.

Forest CoverType : Database with the forest cover type for 30 x 30 meter cells obtained from US Forest Service (USFS) on Rocky Mountain Region. This data can be found here.

Uniform30D : This is a synthetic database with 30 dimensions where the attribute values were generated randomly between (0.5, -0.5), resulting in a uniform distribution. [Download]

ClusteredData : This synthetic database was generated based on various well-defined uniform and Gaussian distributions in a multi-dimensional space where all attributes are in the range (2, -2). [Download]

ClusteredData with noise : It consists of the ClusteredData database augmented with a few noisy objects (almost 0.1% of objects) that follow a uniform distribution between (2, -2). The ClusteredData and ClusteredData with noise databases contain well-defined clusters and will be used to evaluate the impact of noise on the algorithms' performance. [Download]

Sponsors:



Last Update 08-08-2010
eXTReMe Tracker