A Comparative Study for Outlier Detection Methods in High Dimensional Text Data

Outlier detection aims to find a data sample that is significantly different from other data samples. Various outlier detection methods have been proposed and have been shown to be able to detect anomalies in many practical problems. However, in high dimensional data, conventional outlier detection methods often behave unexpectedly due to a phenomenon called the curse of dimensionality. In this paper, we compare and analyze outlier detection performance in various experimental settings, focusing on text data with dimensions typically in the tens of thousands. Experimental setups were simulated to compare the performance of outlier detection methods in unsupervised versus semi-supervised mode and uni-modal versus multi-modal data distributions. The performance of outlier detection methods based on dimension reduction is compared, and a discussion on using k-NN distance in high dimensional data is also provided. Analysis through experimental comparison in various environments can provide insights into the application of outlier detection methods in high dimensional data.

Język:: Angielski

Częstotliwość wydawania:: 4 razy w roku
Dziedziny czasopisma:: Informatyka, Bazy danych i eksploracja danych, Sztuczna inteligencja

Kanał RSS czasopisma

A Comparative Study for Outlier Detection Methods in High Dimensional Text Data

Cheong Hee Park

Data publikacji: 28 lis 2022

Zakres stron: 5 - 17

Otrzymano: 22 cze 2022

Przyjęty: 19 paź 2022

DOI: https://doi.org/10.2478/jaiscr-2023-0001

Słowa kluczoweCurse of dimensionality, Dimension reduction, High dimensional text data, Outlier detection

© 2023 Cheong Hee Park, published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License.

Słowa kluczowe
Curse of dimensionality, Dimension reduction, High dimensional text data, Outlier detection