Open Access

A Novel Variance Reduction Proximal Stochastic Newton Algorithm for Large-Scale Machine Learning Optimization

  
Dec 31, 2024

Cite
Download Cover

Introduction

In recent years, the field of machine learning has seen a surge in the importance of composite optimization problems. These problems, characterized by the sum of a smooth convex function and a non-smooth convex regularizer, arise in various applications ranging from regression models to classification tasks. The challenges in solving such problems are twofold. First, the large number of samples leads to high computational costs in calculating function values and gradients. Second, the optimization often occurs in high-dimensional spaces, further complicating the process [13].

Traditional approaches to solving these problems have evolved significantly over time, from full gradient descent methods to more sophisticated stochastic and variance-reduced algorithms [45]. Despite these advancements, there remains a need for algorithms that can more effectively balance computational efficiency with convergence speed, especially in the context of large-scale machine learning problems with nonsmooth regularizers [67].

In this paper, we introduce a novel algorithm : the Variance Reduction Proximal Stochastic Newton Algorithm (SNVR). SNVR combines the strengths of variance reduction techniques with the proximal Newton method, offering a powerful new approach to solving composite optimization problems. Our algorithm builds upon the ideas of stochastic average gradient methods but incorporates them into a proximal Newton framework.

The SNVR algorithm offers several key advantages: 1. It leverages the fast convergence properties of Newton-type methods. 2. It incorporates variance reduction techniques to mitigate the noise inherent in stochastic methods. 3. It maintains the ability to handle non-smooth regularizers through the use of proximal operators.

The remainder of this paper is organized as follows: Section 2 presents a literature review of recent advancements in optimization algorithms for machine learning. Section 3 describes the SNVR algorithm in detail. Section 4 presents the theoretical analysis and convergence properties of SNVR. Section 5 provides numerical results demonstrating the algorithm's performance on real-world datasets. Finally, Section 6 concludes the paper and discusses potential future research directions.

LITERATURE Research

Recent years have seen significant advancements in optimization algorithms for large-scale machine learning problems. This section provides an overview of key developments within the last five years, focusing on stochastic methods, variance reduction techniques, and quasiNewton approaches.

Stochastic Quasi-Newton Methods, Guo et al. (2023) [8] provided a comprehensive overview of stochastic quasi-Newton methods for large-scale machine learning. Their work highlighted the importance of balancing convergence speed, computational cost, and memory usage. The authors emphasized the need for further research into developing more efficient and scalable stochastic quasi-Newton methods, particularly for high-dimensional problems. Convergence Analysis for Non - strongly Convex Functions, Zhang et al. (2020) [9] made significant progress in understanding the convergence properties of Stochastic Gradient Descent (SGD) for non-strongly convex smooth optimization problems. Their novel analysis proved that SGD can achieve linear convergence under specific conditions, establishing a connection between the smoothness of the objective function and the convergence rate. This work provided valuable insights into the behavior of SGD in a broader class of optimization problems.

Variance Reduction Techniques Variance reduction has emerged as a crucial approach for improving the efficiency of stochastic optimization methods. Sinha et al. (2021) [10] conducted a comprehensive review of various variance reduction techniques used in deep learning. Their work discussed the strengths and weaknesses of each method, including their applicability, computational complexity, and impact on convergence. This review serves as a valuable guide for researchers and practitioners in selecting appropriate variance reduction techniques for specific deep learning tasks. Asynchronous Parallel Methods, As the scale of machine learning problems continues to grow, asynchronous parallel optimization methods have gained attention. Qianqian et al. (2021) [11] proposed an asynchronous parallel stochastic quasi-Newton method that combines the benefits of quasiNewton updates with asynchronous parallel processing. Their approach leverages a novel communication mechanism to ensure consistency and stability in parameter updates across multiple processors, resulting in significant speedups in training times compared to traditional methods.

Block Coordinate Descent with Variance Reduction, Gower et al. (2018) [12] introduced a new variance reduction technique for Stochastic Block Coordinate Descent (SBCD) methods. Their approach significantly reduces the variance in gradient estimates, achieving convergence rates comparable to full gradient methods. Y. Chen et al. [13] addresses the challenge of optimizing non-convex functions, which frequently arise in machine learning tasks such as deep learning. This work offers substantial improvements in efficiency and scalability for large-scale optimization problems, particularly those with block structure.

These recent advancements in optimization algorithms for machine learning have paved the way for more efficient and scalable methods. However, there is still room for improvement, particularly in developing algorithms that can effectively handle non-smooth regularizers while maintaining fast convergence rates. Our proposed Variance Reduction Proximal Stochastic Newton Algorithm (SNVR) aims to address these challenges by combining the strengths of variance reduction techniques with the proximal Newton method.

Mathematical model
Problem Formulation

These papers consider the following composite optimization problem: mindw=F(w)+Ω(w)=i=1Nfi(w)+Ω(w)\mathop {{{\min }_d}}\limits_{w \in } = F(w) + \Omega (w) = \sum\limits_{i = 1}^N {{f_i}} (w) + \Omega (w) f(x) is a smooth convex function composed of individual smooth convex functions fi (w)(i = 1,2,…, N)

Ω(w) is a convex, potentially non-smooth regularization function.

w ∈ ℝd is the parameter vector to be optimized. This formulation encompasses a wide range of machine learning problems, including regularized least squares, logistic regression, and support vector machines.

Assumptions

The component functions fi(.) are strongly convex, and their gradient functions satisfy the L-Lipschitz condition.

The Hessian matrix 2 fi(w) is bounded for any non-empty subset S. μvwfi(v)fi(w)fi(w)T(vw)vw\matrix{ {\mu v - w} \hfill \cr { \le {f_i}(v) - {f_i}(w) - \nabla {f_i}{{(w)}^T}(v - w)} \hfill \cr { \le v - w} \hfill \cr }

Here μ>0 and L > 0 are constants.

The Hessian matrix 2 fi(w) is bounded for any non-empty subset S. Specifically, there exist constants λ1 and λ2 such that, λ1Id2fi(w)λ2Id{\lambda _1}{I_d} \le {\nabla ^2}{f_i}(w) \le {\lambda _2}{I_d}

Key Lemmas
Lemma 3.1

Let w* be the optimal solution of the problem (1). Then. E F(wk)F(w) 24L[ ϕ(wk)ϕ(w*)+ϕ(wk+1)+ϕ(w*) ]{\left\| {\nabla F\left( {{w_k}} \right) - \nabla F(w)} \right\|^2} \le 4L\left[ {\phi \left( {{w_k}} \right) - \phi \left( {{w_*}} \right) + \phi \left( {{w_{k + 1}}} \right) + \phi \left( {{w_*}} \right)} \right]

Lemma 3.2

Let ϕ(w) = F(w)+Ω(w), and assume that F(w) is L-Lipschitz continuous. Let wk+1=proxαH(wkαH1gk){w_{k + 1}} = {\mathop{\rm prox}\nolimits} _\alpha ^H\left( {{w_k} - \alpha {H^{ - 1}}{g_k}} \right), where gk = ∇F(w), α is the step size, and 0 < α ≤1/L. Then, ϕ(wk) ϕ(wk+1)+gkTH(wkwk+1)+Δ(wk+1,wk)+12 gk H12\phi \left( {{w_k}} \right) \ge \phi \left( {{w_{k + 1}}} \right) + g_k^TH\left( {{w_k} - {w_{k + 1}}} \right) + \Delta \left( {{w_{k + 1}},{w_k}} \right) + {1 \over 2}\left\| {{g_k}} \right\|_{{H^{ - 1}}}^2

Where, Δ(wk+1,wk)= Ω(wk+1)Ω(wk)Ω(wk)T(wk+1wk)\Delta \left( {{w_{k + 1}},{w_k}} \right) = \Omega \left( {{w_{k + 1}}} \right) - \Omega \left( {{w_k}} \right) - \nabla \Omega {\left( {{w_k}} \right)^T}\left( {{w_{k + 1}} - {w_k}} \right)

Main Convergence Theorem:

Let w* = arg minwϕ(w), 0 < α ≤ 16λ1/λ22, and assume that the assumptions in Section 3 hold. Then, E[ ϕ(wk+1)ϕ(w*) ]ρ*[ ϕ(wk)ϕ(w*) ]\left[ {\phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right] \le {\rho ^*}\left[ {\phi \left( {{w_k}} \right) - \phi \left( {{w_*}} \right)} \right]

Where ρ*=(1+7Lαλ2λ1)<1{\rho ^*} = \left( {1 + {{7L\alpha {\lambda _2}} \over {{\lambda _1}}}} \right) < 1.

Let: { w=wk,w+=wk+1,v=vk,g=gk,u=w*H=Hk1,Δ=Δ(wk+1,wk)wk+1=proxαH( wkαHk1F(w) \left\{ {\matrix{ {w = {w_k},{w_ + } = {w_{k + 1}},v = {v_k},g = {g_k},u = {w_*}} \hfill \cr {H = H_k^{ - 1},\Delta = \Delta \left( {{w_{k + 1}},{w_k}} \right)} \hfill \cr {{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}}\over w} }_{k + 1}} = {\mathop{\rm prox}\nolimits} _\alpha ^H\left( {{w_k} - \alpha H_k^{ - 1}\nabla F(w)} \right.} \hfill \cr } } \right.

Then by Lemma 3.1 we can obtain: E wk+1w* Hk2E wkw* Hk2+2αE[ ϕ(wk+1)ϕ(w*) ]+Lα2λ22[ ϕ(wk)ϕ(w*)+ϕ(wk+1)ϕ(w*) ]\matrix{ {\left\| {{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}}\over w} }_{k + 1}} - {w_*}} \right\|_{{H_k}}^2} \hfill \cr { \le \left\| {{w_k} - {w_*}} \right\|_{{H_k}}^2 + 2\alpha \left[ {\phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr { + L{\alpha ^2}\lambda _2^2\left[ {\phi \left( {{w_k}} \right) - \phi \left( {{w_*}} \right) + \phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr }

EΔ(wk+1,wk)=0\Delta \left( {{w_{k + 1}},{w_k}} \right) = 0, we have: E wk+1w* Hk2E wkw* Hk2+2αE[ ϕ(wk+1)ϕ(w*) ]+8Lα2λ22[ ϕ(wk)ϕ(wz)+ϕ(wk+1)ϕ(w*) ]\matrix{ {\left\| {{w_{k + 1}} - {w_*}} \right\|_{{H_k}}^2} \hfill \cr { \le \left\| {{w_k} - {w_*}} \right\|_{{H_k}}^2 + 2\alpha \left[ {\phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr { + 8L{\alpha ^2}\lambda _2^2\left[ {\phi \left( {{w_k}} \right) - \phi \left( {{w_z}} \right) + \phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr }

By strong convexity, we know: E wk+1w* Hk2E wkw* Hk2+2αE[ ϕ(wk+1)ϕ(w*) ]+16Lα2λ22[ ϕ(wk)ϕ(w*) ]\matrix{ {\left\| {{{\mathord{\buildrel{\lower3pt\hbox{$\scriptscriptstyle\smile$}}\over w} }_{k + 1}} - {w_*}} \right\|_{{H_k}}^2} \hfill \cr { \le \left\| {{w_k} - {w_*}} \right\|_{{H_k}}^2 + 2\alpha \left[ {\phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr { + 16L{\alpha ^2}\lambda _2^2\left[ {\phi \left( {{w_k}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr }

Therefore, these works have: E[ ϕ(wk+1)ϕ(w*) ](1+7Lαλ2λ1)[ ϕ(wk)ϕ(w*) ]\matrix{ {\left[ {\phi \left( {{w_{k + 1}}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr { \le \left( {1 + {{7L\alpha {\lambda _2}} \over {{\lambda _1}}}} \right)\left[ {\phi \left( {{w_k}} \right) - \phi \left( {{w_*}} \right)} \right]} \hfill \cr }

Since 0<α16λ1λ220 < \alpha \le {{16{\lambda _1}} \over {\lambda _2^2}}, This work have ρ*=(1+7Lαλ2λ1)<1{\rho ^*} = \left( {1 + {{7L\alpha {\lambda _2}} \over {{\lambda _1}}}} \right) < 1 which implies that the SNVR algorithm converges linearly.

Algorithm implementation

The Variance Reduction Proximal Stochastic Newton Algorithm (SNVR) represents a sophisticated approach to solving large-scale machine learning optimization problems. At its core, SNVR combines the strengths of variance reduction techniques with the power of Newton's method, all while maintaining the ability to handle non-smooth regularizers through proximal operations.

The algorithm begins with an initialization phase, where an initial point is chosen, and key parameters such as batch size, convergence threshold, and step size are set. A crucial step in this phase is the computation and storage of the full gradient and the inverse of the Hessian matrix at the initial point. This forms the foundation for the subsequent iterative process.

The heart of SNVR lies in its main loop, where the algorithm iteratively refines the solution until convergence. In each iteration, a subset of the data is randomly selected, enabling the algorithm to work efficiently with large datasets. This stochastic approach is key to the algorithm's scalability.

A distinguishing feature of SNVR is its use of a variance-reduced gradient estimate. By maintaining a memory of previously computed gradients and updating only a subset in each iteration, SNVR achieves lower variance in its gradient estimates compared to standard stochastic gradient methods. This variance reduction technique is crucial for the algorithm's stability and fast convergence.

Stochastic Newton Variance

Reduced (SNVR) Algorithm

Require: Initial point W0, batch size b, tolerance , learning rate α, maximum iterations M

Ensure: Optimal solution W*

1: Initialize k = 0

2: Calculate and store the gradients ∇f1(w0), ∇f2(w0),…,∇fN(w0)

3: Compute the Hessian matrix H at w0 and its inverse H−1

4: Calculate w1=proxαH(w0αH1f(w0)){w_1} = {\mathop{\rm prox}\nolimits} _\alpha ^H\left( {{w_0} - \alpha {H^{ - 1}}\nabla f\left( {{w_0}} \right)} \right)

5: Set k = 1

6: while |ϕ(Wk+1) - ϕ(Wk)|>∈| do

7: Randomly select a subset Sk of size b from the set {1, 2, …, N }

8: Compute the gradients ∇fi (Wk,) for iSk

9: Calculate vk= ∇fs1(Wk) – ∇(fs1)(Wk–1) + ∇f(Wk–1)

10: Compute the Hessian matrix H at Wk and its inverse H−1

11: Calculate wk+1=proxαH(wkαk1H1vk1){w_{k + 1}} = {\mathop{\rm prox}\nolimits} _\alpha ^H\left( {{w_k} - \alpha _k^{ - 1}{H^{ - 1}}{v_{k - 1}}} \right)

12: Update the gradients ∇fi(Wk+1 for the updated subset Sk

13: Set k = k + 1

14: end while

15: return Wk+1

The algorithm then computes the Hessian matrix at the current point, incorporating second-order information into the optimization process. This Newton-type update allows SNVR to make more informed steps towards the optimal solution, particularly beneficial in regions where the objective function has high curvature.

The parameter update step employs a proximal operator, which is essential for handling the nonsmooth regularizer term in the objective function. This operator allows SNVR to effectively navigate the optimization landscape even in the presence of non-differentiable components.

After each update, the algorithm checks for convergence based on the change in the parameter values. This process continues until the algorithm converges or reaches a maximum number of iterations.

The SNVR algorithm's unique combination of variance reduction, Newton-type updates, and proximal operations positions it as a powerful tool for tackling complex optimization problems in machine learning.

Experimental process and results

To validate the theoretical properties and assess the practical performance of the SNVR algorithm, we conducted a comprehensive set of numerical experiments. Our study focused on a regularized least squares problem, a fundamental task in machine learning with wide-ranging applications.

The experiment utilized the "Heart" dataset, a real-world dataset consisting of 600 samples, each with 13 features. This dataset, while modest in size, presents a challenging optimization problem due to its high-dimensional feature space and the potential for complex relationships between features.

In our experimental setup, we carefully tuned the SNVR algorithm's parameters to balance performance and computational efficiency. The regularization parameter was set to a small value (10−5) to prevent overfitting while still allowing the model to capture the underlying patterns in the data. We limited the maximum number of iterations to 20, which proved more than sufficient for SNVR to converge to an optimal solution.

To provide a comprehensive evaluation, we compared SNVR against three state-of-the-art optimization algorithms: Proximal Gradient Descent (ProxGD), Proximal Stochastic Gradient Descent (proxSGD), and Proximal Stochastic Variance Reduced Gradient (ProxSVRG). This selection of algorithms represents a spectrum of approaches, from deterministic to stochastic, and from first-order to variance-reduced methods.

The results of our experiments were striking. SNVR demonstrated remarkable convergence behavior, with the objective function value decreasing rapidly in the initial iterations and then fine-tuning to reach the optimal value. The algorithm achieved convergence in just 5 iterations, significantly outpacing its competitors.

A detailed analysis of the convergence trajectory revealed that SNVR achieved a substantial 10.5% reduction in the objective function value within the first two iterations. This rapid initial progress highlights the algorithm's ability to quickly identify promising directions in the parameter space. The subsequent iterations saw a more gradual improvement, with the algorithm refining its solution to achieve an additional 0.2% reduction, ultimately reaching the optimal value of 0.1919.

When compared to other methods, SNVR's superior performance became even more evident. ProxSVRG, the next best performer, required 14 iterations to reach the same optimal value, taking 180% more iterations than SNVR. The first-order methods, ProxSGD and ProxGD, both exhausted the maximum allowed iterations (20) without achieving the optimal solution quality of SNVR.

The comparative analysis also revealed interesting insights into the trade-offs between convergence speed and solution quality. While ProxSVRG matched SNVR in terms of the final objective function value, it required significantly more iterations to do so. On the other hand, proxSGD and ProxGD demonstrated a clear tradeoff between speed and accuracy, with ProxGD, in particular, showing suboptimal performance in both aspects.

These results underline the effectiveness of SNVR in balancing rapid convergence with highquality solutions. The algorithm's ability to achieve optimal results in fewer iterations not only demonstrates its theoretical strengths but also highlights its practical value for large-scale machine learning tasks where computational efficiency is crucial.

The experimental outcomes provide strong empirical support for the theoretical properties of SNVR and suggest its potential as a powerful optimization tool for a wide range of machine learning applications. The algorithm's performance on the "Heart" dataset indicates that it could be particularly beneficial in scenarios requiring rapid, high-quality optimization, such as real-time machine learning systems or large-scale data analysis in resource-constrained environments.

Analysis of Results

The detailed experimental process and intermediate results provide several insights:

Convergence Efficiency: As illustrated in Figure 1. SNVR consistently outperforms other methods in terms of convergence speed, reaching near-optimal values in fewer iterations. This is particularly evident in the objective function value chart, where SNVR's curve shows the steepest decline.

Optimization-Generalization Trade-off: Figure 2. the training loss vs. test accuracy graph for SNVR demonstrates its ability to effectively balance between fitting the training data and generalizing to unseen data. This suggests that SNVR is less prone to overfitting compared to methods that might continue to decrease training loss without improving test accuracy.

Stability: The consistency of SNVR's performance across multiple runs (as indicated by the small variance in results) suggests that it is more robust to initial conditions and stochastic fluctuations compared to other methods.

Computational Considerations (Figure 3.): While SNVR has a higher per-iteration computational cost due to its use of second-order information, its rapid convergence often results in lower overall computational time to reach the optimal solution. This trade-off is particularly beneficial for problems where the cost of data access or function evaluations is high relative to the cost of algorithm computations.

These results highlight SNVR's potential as a powerful optimization tool for machine learning tasks, particularly in scenarios where rapid, high-quality convergence is crucial. The algorithm's ability to efficiently navigate the optimization landscape, as evidenced by the intermediate results, makes it well-suited for a wide range of applications, from real-time learning systems to large-scale data analysis in resource-constrained environments.

Figure 1.

Convergence analysis for various algorithms

Figure 2.

Training Loss Vs Test accuracy chart for SVNR algorithm.

Figure 3.

Computational time comparison for different algorithms

Conclusions

The Variance Reduction Proximal Stochastic Newton Algorithm (SNVR) is a novel optimization method designed for large-scale machine learning applications. By effectively combining variance reduction techniques with the proximal Newton method, SNVR minimizes composite functions consisting of a smooth convex component and a non-smooth convex regularizer. SNVR achieves linear convergence rates, surpassing existing optimization approaches. This is particularly beneficial for high-dimensional problems and large datasets. Numerical experiments on the "heart" dataset consistently demonstrate SNVR’s superiority to state-of-the-art methods like ProxGD, proxSGD, and ProxSVRG in terms of convergence speed and solution quality. SNVR offers 180-300% faster convergence over existing methods. SNVR’s ability to handle nonsmooth regularizers while maintaining computational efficiency makes it a versatile tool for various machine learning tasks, ranging from regression to complex classification e.g., real-time machine learning systems, large-scale data analysis in resource-constrained environments.

Future research directions include exploring SNVR’s applications in other domains, evaluating its performance on larger scale problems and diverse datasets, and investigating potential modifications to further enhance its efficiency or adaptability. In sum up, the Variance Reduction Proximal Stochastic Newton Algorithm is a valuable addition to the optimization toolkit for large-scale machine learning problems, offering significant theoretical guarantees and practical benefits.

Language:
English
Publication timeframe:
4 times per year
Journal Subjects:
Computer Sciences, Computer Sciences, other