Development of a Distributed Outlier Detection Method Based on the Alternating Direction Method of Multipliers

Outlier detection is one of the most studied probems in data mining, and involves the identification of erroneous or “unusual” data points within a dataset. Outlier detection (OD) finds applications in discovery of novel information, bank frauds, system intrusions [12,14], as well as in data cleaning for machine learn-ing models that are sensitive to the presence of out-liers [28]. Formally, outlier detection can be defined as the problem of finding patterns in the data with unexpected behavior [7]. Outliers are often referred to as anomalies, exceptions, discordant observations, and various other similar terms [3,21].

Outlier detection is an example of a problem that can be solved using a class of classifiers known as one-class classifiers (OCC), a term proposed by Moya [18]. It solves the problem of finding the boundary that encompasses an entire class, given a dataset that could be contaminated with a small amount of anomalous data.

Tax and Duin [25] describe an OCC called Support Vector Data Description (SVDD). It aims to find the hypersphere with the minimum radius that encloses most data points in a dataset. The data points inside the hypersphere are considered to belong to the target class, while those outside are considered anomalous.

Currently, handling large volumes of data, known as big data, represents a challenge for outlier detec-tion algorithms. The resources of a single computer may not be sufficient to achieve efficient performance for a particular algorithm. Centralized algorithms are becoming less competitive in meeting the time demands of modern applications. Furthermore, due to the increasing sizes of datasets, they are frequently stored in distributed environments [3].

The Alternating Direction Method of Multipliers (ADMM) belongs to the category of distributed opti-mization and is suitable for solving many machine learning problems [1,4], especially those that can be formulated or transformed into convex optimization problems [5].

Based on these elements, the main objective of this research is to develop a distributed outlier detection algorithm based on solving the SVDD problem using ADMM. This research aims to contribute to outlier detection in large volumes of data by enabling distributed processing. Moreover, the proposed method does not require centralized data, making it applicable to distributed datasets.

2.

Concepts and Basic Notation

2.1.

ADMM

ADMM is a decomposition-coordination technique in which solutions to small local subproblems are coordinated to find the solution to a larger global problem [4]. Depending on the application, it is relatively simple to implement in distributed environments, making it applicable to big data problems [4, 10].

ADMM solves problems of the form: 1 $\begin{array}{l} minimize f (x) + g (z) \\ subject to: A x + B z = c \end{array}$

Where x ∈ ℝⁿ, z ∈ ℝ^m, A ∈ ℝ^p×n, B ∈ ℝ^p×m, and c ∈ ℝ^p. With f and g convex functions. The minimum value of problem (1) would be: $p^{*} = \min (f (x) + g (z) ∣ A x + B z = c)$

Where min refers to the minimum. The (augmented) Lagrangian for this problem would be: $L_{p} (x, z, y) = f (x) + g (z) + y^{T} (A x + B z - c) + \frac{ρ}{2} ‖ A x + B z - c ‖_{2}^{2}$

And the solution by ADMM consists of the iterations: 2 $x^{k + 1} : = \underset{x}{argmin} L_{p} (x, z^{k}, y^{k})$ 3 $z^{k + 1} : = \underset{z}{argmin} L_{p} (x^{k + 1}, z, y^{k})$ 4 $y^{k + 1} : = y^{k} + ρ (A x^{k + 1} + B z^{k + 1} - c)$ with ρ > 0. As can be seen, the variables x and z are updated alternately, hence the name of the algorithm. The variable y is called the dual variable.

ADMM can be written in a different, often more convenient form, known as the scaled form: 5 $x^{k + 1} : = \underset{x}{argmin} (f (x) + \frac{ρ}{2} {‖ A x + B z^{k} - c + u^{k} ‖}_{2}^{2})$ 6 $z^{k + 1} : = \underset{z}{argmin} (g (z) + \frac{ρ}{2} {‖ A x^{k + 1} + B z - c + u^{k} ‖}_{2}^{2})$ 7 $u^{k + 1} : = u^{k} + A x^{k + 1} + B z^{k + 1} - c$ where $u = \frac{1}{ρ} y$ . The dual variable is scaled.

Consensus problems are part of the field of distributed optimization and have historically been related to the ADMM. The consensus optimization problem consists of a single global variable with the objective and constraint terms divided into N parts: 8 $\begin{array}{l} minimize \sum_{i = 1}^{N} f_{i} (x_{i}) \\ subject to : x_{i} - z = 0 i = 1 \dots N \end{array}$

This is called the global consensus problem.

A common variation of the problem (8) is the introduction of a regularization term in the objective function g, which represents a regularization term: 9 $\begin{array}{l} minimize \sum_{i = 1}^{N} f_{i} (x_{i}) + g (z) \\ subject to: x_{i} - z = 0 i = 1 \dots N \end{array}$

The solution by ADMM to problem (9) would be (in the scaled form): 10 $x_{i}^{k + 1} : = \underset{x_{i}}{argmin} (f_{i} (x_{i}) + \frac{ρ}{2} {‖ x_{i} - z^{k} + u_{i}^{k} ‖}_{2}^{2})$ 11 $z^{k + 1} : = \underset{z}{argmin} (g (z) + \frac{N ρ}{2} {‖ z - {\bar{x}}^{k + 1} - {\bar{u}}^{k} ‖}_{2}^{2})$ 12 $u_{i}^{k + 1} : = u_{i}^{k} + x_{i}^{k + 1} - z^{k + 1}$ where x_i are the local primal variables at each node, u_i are the local dual variables, z is the consensus (global) variable, and $\bar{x}$ and $\bar{u}$ are the averages of the primal and dual variables, respectively.

For ADMM with consensus, the residuals are in the form of vectors: $\begin{array}{l} r^{k} & = (x_{1}^{k} - z^{k}, \dots, x_{N}^{k} - z^{k}), \\ s^{k} & = - ρ (z^{k} - z^{k - 1}, \dots, z^{k} - z^{k - 1}) \end{array}$ whose quadratic norms are: ${‖ r^{k} ‖}_{2}^{2} = \sum_{i = 1}^{N} {‖ x_{i}^{k} - z^{k} ‖}_{2}^{2}, {‖ s^{k} ‖}_{2}^{2} = N ρ^{2} {‖ z^{k} - z^{k - 1} ‖}_{2}^{2}$

2.2.

SVDD

SVDD, proposed in [25], is an unsupervised learn-ing method that is very useful for data description and anomaly detection. To describe the dataset, this model finds a hypersphere that encloses most of the data, minimizing the possibility of accepting anoma-lous data within it.

SVDD is formulated as an optimization problem in the following way: Given a set of points x_i ∈ ℝ^d, i = 1…n: 13 $\begin{array}{l} \underset{R, a, ξ_{i}}{minimize} R^{2} + C \sum_{i = 1}^{n} ξ_{i} \\ \begin{array}{l} subject to: & {‖ x_{i} - a ‖}^{2} \leq R^{2} + ξ_{i}, i = 1 \dots n \\ ξ_{i} \geq 0, i = 1 \dots n \end{array} \end{array}$

Where a is the center of the hypersphere, R is the radius, C is a control parameter, and ξ_i are slack variables. The distance from x_i to the center need not be less than or equal to R², and these large distances are penalized with margin errors ξ_i for each x_i. The parameter C controls the trade-off between the volume of the hypersphere and the errors, i.e., it controls the number of points included in the hypersphere.

Another approach to the data description problem can be found in [23,24]. Itis based on finding a hyperplane that separates the dataset from the origin with the maximum possible margin. The formulation would be: 14 $\begin{array}{l} \underset{w, ρ, ξ_{i}}{minimize} \frac{1}{2} ‖ w ‖^{2} - ρ + \frac{1}{v N} \sum_{i = 1}^{N} ξ_{i} \\ subject to: w \cdot x_{i} \geq ρ - ξ_{i} \forall i ξ_{i} \geq 0 \end{array}$

Where w is the weight vector of the hyperplane, ρ is the separation margin, x_i are the instances of the dataset, N is the number of instances, ξ_i are slack variables to account for possible errors, and v ∈ (0, 1) is a regularization parameter that indicates the fraction of data points that should be separated (equivalent to the C parameter). Because of this parameter, this method is known as v-SVC.

Although the hyperplane is not a closed boundary, it provides solutions comparable to the original SVDD problem when the data is preprocessed to unit norm [25].

3.

Related Work

3.1.

Convex Optimization in OCC and Outlier Detection

Convex analysis methods have found increasing application in outlier detection and related areas. OCC can benefit from the robust algebraic and geometric approximations provided by convex analysis and optimization, allowing efficient solution computation through mathematical optimization [26].

Kernel Mean Matching (KMM) [13] is a method that assists in outlier detection by checking whether the training and test sets follow the same distribution. Its solution is formulated as a quadratic programming problem. A similar task to KMM can be performed with Least Square Importance Fitting (LSIF) [15]. In this case, it is also a quadratic programming problem directly related to the least squares problem.

The convex hull is one of the most commonly used convex analyses approaches in OD. This approach aims to find the smallest convex set that encloses a given set of points. Due to its inherent problem formulation, the convex hull has been used in OCC as a method for outlier detection [2,6].

In [8], the convex modeling of the SVDD prob-lem and its solution using Lagrange multipliers are analyzed. In addition, [23] presents an OCC based on support vector machines, where the model corresponds to a quadratic programming problem related to SVDD. These approaches are of interest in the present research.

In the field of convex optimization, many works focus on decomposition methods and decentralized algorithms [4,11,17], which naturally lend themselves to be approached from the perspective of distributed optimization algorithms.

3.2.

Distributed Optimization for Outlier Detection

When working with OD in big data, valuable information can be discovered and it is applicable in various domains. A fundamental challenge of OD in distributed systems is to minimize communication between nodes while maintaining the effectiveness of an algorithm [3]. In [14], an algorithm for OD based on data neighborhood is developed with a distributed and input stream focused approach. The algorithm is based on LOCI (Local Correlation Integral), which allows the detection of contextual outliers, i.e., instances that are considered outliers for a particular subset of the dataset. This work highlights the power of parallel processing for OD with input data streams.

In [22], an implementation for real-time OD using the PySpark variant of Spark for Python is proposed. In this work, the training data is stored in the cloud and a cluster-based solution is presented. Instances outside the formed clusters are considered outliers. The approach also includes a scheduler that periodically retrains the algorithm, allowing for reevaluation of decisions previously made on instances. Considering that distributed OD is a relatively unexplored area, [28] proposes Sparx, a new algorithm based on the xStream algorithm.

xStream was originally designed for a single processor, but this proposes a Map-Reduce design using Apache Spark in Python. The environment used does not exchange information between nodes, and the data is decentralized.

In [29,30], the Python library PyOD is proposed, which provides numerous OD algorithms. The library analyzes various characteristics of these algorithms, including their suitability for multicore processing. PyOD aims to provide a comprehensive set of OD techniques, making it easier for researchers and practitioners to experiment with and apply different outlier detection algorithms in Python.

4.

Solving the SVDD Problem with Consensus-Based ADMM

Based on the analysis discussed in 2.1 and 2.2, the following solution to the SVDD problem using consensus-based ADMM is proposed: 15 $w_{i}^{k + 1} : = \underset{w_{i}}{argmin} \sum_{j = 1}^{N_{i}} {[ρ_{i}^{k} - w_{i} x_{i j}]}_{+} + \frac{α_{1}}{2} {‖ w_{i} - w^{k} + u_{i}^{k} ‖}^{2}$ 16 $ρ_{i}^{k + 1} : = \underset{ρ_{i}}{argmin} \sum_{j = 1}^{N_{i}} {[ρ_{i} - w_{i}^{k + 1} x_{i j}]}_{+} + \frac{α_{2}}{2} {‖ ρ_{i} - ρ^{k} + v_{i}^{k} ‖}^{2}$ 17 $w^{k + 1} : = \underset{w}{argmin} \frac{1}{2 λ} ‖ w ‖^{2} + \frac{J α_{1}}{2} {‖ w - {\bar{w}}^{k + 1} - {\bar{u}}^{k} ‖}^{2}$ 18 $ρ^{k + 1} : = \underset{ρ}{argmin} - \frac{ρ}{2 λ} + \frac{J α_{2}}{2} {‖ ρ - {\bar{ρ}}^{k + 1} - {\bar{v}}^{k} ‖}^{2}$ 19 $u_{i}^{k + 1} : = u_{i}^{k} + w_{i}^{k + 1} - w^{k + 1}$ 20 $v_{i}^{k + 1} : = v_{i}^{k} + ρ_{i}^{k + 1} - ρ^{k + 1}$

In this case, i = 1…J, where J is the number of nodes and N_i is the number of instances handled by node i. The values of α₁ and α₂ are assumed to be greater than 0. The λ parameter acts as a penalty parameter.

4.1.

Variable Updates

The minimizations are performed using gradients (or subgradients if necessary) and would be as follows: 21 $w_{i}^{k + 1} = w^{k} - u_{i}^{k} + \frac{1}{α_{1}} \sum_{j = 1}^{N} I_{w_{i} x_{i j} < ρ_{i}^{k}} (w_{i}) x_{i j}$ where $I_{A} (x) = {\begin{array}{l} 1, & x \in A \\ 0, & x \notin A \end{array}$ 22 $ρ_{i}^{k + 1} = {\begin{array}{l} ρ^{k} - v_{i}^{k}, & ρ^{k} - v_{i}^{k} \leq \min (\vec{a}) \\ ρ^{k} - v_{i}^{k} - \frac{N}{α_{2}}, & ρ^{k} - v_{i}^{k} - \frac{N}{α_{2}} > \max (\vec{a}) \\ ρ^{k} - v_{i}^{k} - \frac{N_{t}}{α_{2}}, & otherwise \end{array}$ 23 $w^{k + 1} = \frac{J α_{1}}{\frac{1}{λ} + J α_{1}} ({\bar{w}}^{k + 1} + {\bar{u}}^{k})$ 24 $ρ = \frac{1}{2 λ α_{2} J} + ({\bar{ρ}}^{k + 1} + {\bar{v}}^{k})$ 25 $u_{i}^{k + 1} : = u_{i}^{k} + w_{i}^{k + 1} - w^{k + 1}$ 26 $v_{i}^{k + 1} : = v_{i}^{k} + ρ_{i}^{k + 1} - ρ^{k + 1}$

Based on the proposed updates, the solution algorithm would be as shown in Algorithm 1:

Algorithm 1

Distributed SVDD using Consensus-based ADMM

Input: X, λ, J, α₁, α₂

Output: w, ρ

/* Initialize variables for w */

1 $w_{i} = \vec{0}$ , w = 0, $u = \vec{0}$

/* Initialize variables for ρ */

2 $ρ_{i} = \vec{1}$ , ρ = 1, $v = \vec{1}$

3 while stop do

/* Update w_i at each node */

4 $w_{i}^{k + 1} = argmin w_{i} \sum_{j = 1}^{N_{i}} {[ρ_{i}^{k} - w_{i} x_{i j}]}_{+} + \frac{α_{1}}{2} {‖ w_{i} - w^{k} + u_{i}^{k} ‖}_{2}^{2}$

5 /* Update ρ_i at each node */

6 $ρ_{i}^{k + 1} = argmin ρ_{i} {\sum_{j = 1}^{N_{i}} [ρ_{i} - w_{i}^{k + 1} x_{i j}]}_{+} + \frac{α_{2}}{2} {(ρ_{i} - ρ^{k} + v_{i}^{k})}^{2}$

7 Update consensus variables */

8 Update w; Update ρ;

9 /* Update dual variables */

10 $u_{i}^{k + 1} = u_{i}^{k} + w_{i}^{k + 1} - w^{k + 1}$

11 $v_{i}^{k + 1} = v_{i}^{k} + ρ_{i}^{k + 1} - ρ^{k + 1}$

12 /* Update history, objective function, dual and primal residuals, and check termination criteria */

13 ;

14 Update ϵ_primal, ϵ_dual, r_primal, r_dual, and objective function;

15 if stop then

16 break

The variable stop models the proposed stopping conditions. The most general condition is the maximum number of iterations, which ensures that the algorithm stops.

It also checks whether the residuals do not exceed a tolerance value that combines absolute and relative variants. Finally, an early stopping criterion is introduced, where the algorithm is stopped if there is no improvement of the best value obtained so far for a given number of iterations.

Given the characteristics of consensus ADMM for the distributed architecture, one node should be used where the results are centralized and aggregated. The flow of the algorithm in each iteration would be to update the parameters w_i and b_i at each node independently using its own dataset and the values of w, b and the dual variables from the previous iteration. Once the w_i and b_i values of each node have been obtained, they are centralized and aggregated synchronously to give the w and b of the current iteration. Finally, the dual variables are updated. This workflow was implemented using a mapreduce methodology.

5.

Results and Discussion

5.1.

Datasets

To evaluate the performance of the algorithm, we use five datasets: three synthetic, and two real. These datasets vary in the number of instances and dimensions. The synthetic datasets are generated based on probability distributions in the plane, which facilitates the construction of graphs to understand the behavior of the algorithm. The datasets are distributed equitably among the processing nodes.

Table 1 briefly describes the five datasets, the instances row defines the total number of elements in each dataset, while the outliers row defines how many of those instances are anomalous. The three synthetic sets corresponding to Experiments 1, 2 and 3 are constructed in the same way, the instances are (x,y) coordinates where most of the data is uniformly distributed around the points (-2,-2) and (2,2) with radius 1. The rest of the instances are normally distributed around the points (-2,-2) and (2,2) with radius 1. The remaining instances are normally distributed in the rectangle defined from (-4,-4) to (4,4) and are considered anomalous. Experiment 4 is performed on the Glass dataset, which has nine continuous features related to the concentrations of metals in the material. It is initially a classification problem with six classes, but class number 6 is in the minority and instances belonging to this class are considered anomalous.¹ Experiment 5 is carried out with a real dataset related to behavior reports for a physical system: Statlog (shuttle). For the experiments, a version is taken in which the main task is to distinguish the reports that are considered anomalous from those that are not.²

Table 1.

Experimental Results

Parameters	Experiment 1	Experiment 2	Experiment 3	Experiment 4	Experiment 5
Instances	420	2020	300500	214	46464
Dimension	2	2	2	9	9
Type	Synthetic	Synthetic	Synthetic	Real	Real
Outliers	20	20	500	6	878
Nodes	4	8	2	4	4
AUC-ROC OCSVM	0.95	0.92	0.95	0.76	0.83
AUC-ROC SGD-OC	0.94	0.90	0.96	0.76	0.83
AUC-ROC ADMM-OC	0.95	0.91	0.95	0.76	0.81

Preprocess The main step in data preprocessing is the use of the Nystroem approximation [19,27]. This step not only helps to reduce the dimensionality of the kernel matrix, but also allows for modeling the nonlinearity of the data. The kernel used is the Radial Basis Function (RBF) kernel [20].

In the experiments conducted, it was also found to be effective to scale the values before applying the kernel. For this purpose, the Robust Scaler from scikitlearn is used, which is more effective than standard scaling because the Robust Scaler is not sensitive to the presence of outliers [19].

Other common preprocessing steps can also apply if necessary, such as data type transformation and missing value handling [16].

5.2.

Experimental Setup

The same methodology was used for all experiments: A labeled dataset is taken, with one class representing the outliers. The labels are stored and removed from the dataset. The classifier is trained on the unlabeled data, then used to classify the data, and the performance is evaluated using the original stored labels.

This process is performed for three One Class Support Vector Machine classifiers, two of which are available in the scikitlearn library: OCSVM and SGD-OC. The third classifier is the one proposed in this research (called ADMM-oc).

Another important aspect is that for the ADMM-oc and SGD-oc methods the data shown in the AUC-ROC in Table 1 are the result of the average of 10 runs in the first four experiments and 5 runs in the fifth experiment. This process is done because both methods are stochastic.

5.3.

Metrics

The OD problem can be viewed as a two-class classification problem, where one class represents the outliers and the other class represents the “normal” instances [21]. Even methods that return a score or degree can be reduced to a two-class problem by set-ting a threshold for the data, above which they are considered anomalous [7]. Therefore, scoring metrics for binary classification problems are valid. An important consideration is the typical class imbalance in such problems, which may cause certain metrics to be unrepresentative of the quality of a given algorithm. Therefore, the AUC-ROC metric is evaluated because it is not sensitive to imbalanced data [9,12].

5.4.

Validation

Table 1 presents the results of the experiments, with a description of the datasets and some parameters used for each experiment. It can be observed that the AUC-ROC values remain similar to the other two methods compared, regardless of the variation in the datasets.

During Experiment 2, studies were conducted to verify the algorithm’s convergence. The results showed that the ADMM-OC algorithm achieves good convergence within a maximum of 20 iterations, as can be seen in Figures 1 and 2.

Considering the two-dimensionality of the dataset used in Experiment 3, a comparison of the contour plots of the classifiers is proposed in Figure 3. As can be observed, the contour plots are almost identical, which supports the similar AUC-ROC values obtained.

Visually the outermost contour line would repre-sent the boundary between anomalous and normal instances.

Based on the experimental study, it can be con-cluded that the algorithm maintains good effectiveness compared to other similar methods. The influence of data preprocessing, particularly the Nystroem approximation, on the performance of the algorithm was observed. For two-dimensional datasets, the results were evaluated graphically, visually confirming the effectiveness values obtained. Finally, a detailed analysis of the algorithm’s iterations allowed verification of aspects related to convergence and the consensus process.

6.

Conclusion

Outlier detection is a fundamental area of data mining. It not only contributes to data cleaning, but also to the discovery of new knowledge. However, its application to large volumes of data remains a challenge today. The SVDD problem is useful for outlier detection and modeling it as an optimization problem allows approaching it using iterative methods such as ADMM. The consensus variant of ADMM was used to model the solution of the SVDD problem. The proposed algorithm was implemented. The form of the consensus method allowed this implementation to have a distributed approach. To validate the model, a series of experiments were carried out using synthetic and real datasets with different numbers of instances and dimensions, which verified the competitiveness of the proposed algorithm with respect to other existing ones, obtaining AUC-ROC values in an acceptable range. The graphical analysis of the contour lines in the two-dimensional sets provided clarity to the results obtained.

As future work, it is considered to extend the comparison of the proposed algorithm with other distributed algorithms for OD. It is also proposed to validate the algorithm on non equal distribution datasets.

Language:: English

Publication timeframe:: 4 times per year
Journal Subjects:: Computer Sciences, Artificial Intelligence, Engineering, Electrical Engineering, Control Engineering, Metrology and Testing, Mechanical Engineering, Fundamentals of Mechanical Engineering

Journal RSS Feed

Development of a Distributed Outlier Detection Method Based on the Alternating Direction Method of Multipliers

Alejandro Cespón Ferriol

Héctor R González Diez

Carlos A. Morell Pérez

Published Online: Jun 26, 2025

Page range: 1 - 7

Received: Feb 29, 2024

Accepted: Mar 27, 2025

DOI: https://doi.org/10.14313/jamris-2025-009

Keywordsoutlier detection, big data, method of multipliers

© 2025 Alejandro Cespón Ferriol et al., published by Sciendo

This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License.

Keywords
outlier detection, big data, method of multipliers