1 Introduction
Efficient methods for solving an optimization problem
where $F:\mathbb {R}^n \rightarrow \mathbb {R}$ is a nonsmooth and nonconvex function, plays a vital role in various domains and are of paramount importance. Numerous studies have been conducted to develop methods for solving the aforementioned optimization problem under reasonable assumptions [Reference Burke, Lewis and Overton3]. The focus of these studies mostly lies in developing strategies to find the optimal solution denoted as $x^*$ , which satisfies the first-order optimality condition
where $\partial f(x)$ is the Clarke subdifferential of f at x [Reference Bolte, Daniilidis, Lewis and Shiota1]. Note that solving $0\in \partial f(x^*)$ is equivalent to finding a zero of the set-valued operator $\partial f(x^*)$ . In addition to the methods based on subdifferentials, penalty function methods offer an alternative approach for solving optimization problems. These techniques involve introducing a penalty term into the original objective function and then employing an iterative method where, at each step, a subproblem is solved within a predefined region [Reference Chen and Zhou6, Reference Mayeli14, Reference Selesnick16].
Motivated by [Reference Mayeli14], the main objective of this paper is to propose a novel approach for solving a class of nonsmooth and nonconvex optimization problems, in which we introduce the utilization of Bregman distance as a key component in constructing a regularized method based on the majorization–minimization (MM) approach. Specifically, we focus on solving the problem defined as (1.1) with
in which $y \in \mathbb R^n$ is given, $A \in \mathbb {R}^{m\times n}$ is an $m\times n$ matrix, $\lambda>0$ , and $f_{\alpha }$ is the Moreau envelope given by
The author in [Reference Mayeli14] proposes a regularization of the form
to ensure that the subproblem generated at each step of her proposed method has a unique solution. Then, the author uses the regularization to define a majorizer function
Then, the author presents an MM method that begins with the initialization of a point, denoted by $x^0 \in \mathbb {R}^n$ . Subsequent iterations are then defined by
where $B_{\epsilon /2^k}(x^{k})$ represents a ball centered at the point $x^{k}$ , with a radius equal to $\epsilon /2^k$ . Note that (1.4) can be obtained when $f^m_{\alpha }(x,w)$ replaces $f_{\alpha }$ in (1.2).
The main idea behind the MM approach is to transform a challenging optimization problem into a sequence of simpler and well-behaved subproblems. To solve the subproblems (1.5), there exist various methods proposed in the literature, including those described in [Reference Dang and Lan7, Reference Ruszczyński15]. These subproblems are indeed well-behaved and easier to solve because their objective functions are strictly convex and their feasible sets are bounded and convex. By solving the generated subproblems iteratively, a sequence of solutions is obtained that progressively approximates a solution to the original problem. Under certain assumptions, this sequence converges to a point that satisfies the optimality conditions of the original problem.
It is worth noting that the MM method has numerous applications across various fields. Among them, we can mention signal and image processing, support vector machines, nonnegative matrix factorization, and DNA sequence analysis. For a more complete list of applications, the readers can consult [Reference Hunter and Lange9, Reference Sun, Babu and Palomar17].
In this paper, we generalize the MM method presented in [Reference Mayeli14], where we substitute a general Bregman distance for the quadratic regularization term given by (1.3), to be defined in Section 2.
The rest of this paper is organized as follows. In Section 2, we present some basic facts that will be used in this paper as well as our generalized MM method. In Section 3, we establish our convergence analysis. We present some numerical results in Section 4.
2 Preliminaries and basic facts
In this section, we are going to describe Bregman distances and their properties, and the generalized MM method.
2.1 Bregman functions and distances
Definition Let S be an open and convex subset of $\mathbb {R}^n$ , and let $\bar {S} $ be its closure. Consider a convex real-valued function $\phi $ defined on $\bar {S}$ , and let $D:\bar {S}\times S \to \mathbb {R}$ be defined as
We say that $\phi $ is a Bregman function and D is its distance induced by $\phi $ (see [Reference Bregman2]) if the following conditions hold.
-
H1: $\phi $ is continuously differentiable on S.
-
H2: $\phi $ is strictly convex and continuous on $\bar {S}$ .
-
H3: For every $\theta \in \mathbb {R}$ , the partial level sets $\Gamma _1(w,\theta )=\{x\in \bar {S}: D(x,w)\leq \theta \}$ and $\Gamma _2(x,\theta )=\{w\in S : D(x,w)\leq \theta \}$ are bounded for all $w\in S$ and $x\in \bar {S}$ , respectively.
-
H4: If $\{w^k\}_{k=0}^{\infty }\subset S$ converges to $w^*$ , then $D(w^*,w^k)$ converges to 0.
-
H5: If $\{x^k\}_{k=0}^{\infty }\subset \bar {S}$ and $\{w^k\}_{k=0}^{\infty }\subset S$ are sequences such that $\{x^k\}_{k=0}^{\infty }$ is bounded, $\lim \limits _{k\to \infty } w^k=w^*$ and $D(x^k,w^k)=0$ , then $\lim \limits _{k\to \infty } x^k=w^*$ .
Note that when $\{x^k\}_{k=0}^{\infty }$ and $w^*$ are in S, $H4-H5$ automatically hold true due to $H1-H3$ . Moreover, because $\phi $ is a strictly convex function, we have that $D(x,w)\geq 0$ for all $x\in \bar {S}$ and $w\in S$ , and that $D(x,w)=0$ if and only if $x=w$ .
It is remarkable that, in addition to the Bregman distance, there exists another class of distance defined on the positive orthant of $\mathbb {R}^n$ ,
which is formally stated next [Reference Iusem, Svaiter and Teboulle11].
Definition Let $\phi :\mathbb {R}_{++}\rightarrow \mathbb {R}$ be a strictly convex function. A $\phi $ -divergence is a function, denoted by $d_{\phi }: \mathbb {R}_{++}^n\times \mathbb {R}_{++}^n\rightarrow \mathbb {R}$ , that is defined at $\left (x,y\right )$ as
Some important examples that are known in the literature are given below. We encourage interested reads to consult [Reference Butnariu and Iusem4, Reference Censor and Zenios5, Reference Iusem10, Reference Iusem, Svaiter and Teboulle11] for further elaboration on Bregman distances and $\phi $ -divergences as well as more examples.
-
• Let $S =\mathbb {R}^n$ and $\phi (x) = x^T M x$ , where M is an $n\times n$ symmetric and positive definite matrix. In this case, define $D(x,w) = \left (x - w\right )^T M\left (x - w\right ) = \|x - w\|^2_M$ . In particular, if M is the identity matrix, then $D(x,w) = \|x - w\|^2$ reduces to the Euclidean distance squared.
-
• Let $\phi (t)= t \log (t) -t + 1$ and $0\log (0)=0$ . In this case, the Kullback–Leiber divergence is defined as
$$ \begin{align*}d_{\phi}(x,y)=\sum\limits_{i=1}^n y_i \phi\left(\dfrac{x_i}{y_i}\right)=\sum\limits_{i=1}^n \left[x_i \log\left(\dfrac{x_i}{y_i}\right) + y_i - x_i\right].\end{align*} $$
It is worth emphasizing that the Bergman distance may not always be a distance in the usual sense of the term. In general, it lacks symmetry and fails to satisfy the triangle inequality as discussed in [Reference Butnariu and Iusem4].
Definition Let $w\in {\mathbb {R}}^n$ be given. The function $g(\cdot ,w):\mathbb {R}^n \to \mathbb {R}$ is called a local majorizer of a function $h:\mathbb {R}^n\to \mathbb {R}$ at w if
and
We also say that $g(\cdot ,w):\mathbb {R}^n \to \mathbb {R}$ minorizes f at w when $-g(\cdot ,w):\mathbb {R}^n \to \mathbb {R}$ majorizes $-f$ at w.
Geometrically, our objective is to ensure that the functions h and g are tangent to each other at the point w. In addition, both h and g should have directional derivatives at w. Moreover, we desire that for any small $\|d\|$ , the directional gradient of g at w in the direction of d is equal to the gradient of h at w in the direction of d. That is, $\nabla g(w;d, w)= \nabla h(w;d)$ for any small $\|d\|$ .
In the classical MM method, the majorizer $g^{k-1}$ is defined as $g^{k-1} = g(\cdot ,x^{k-1})$ , where g is a function that is tangent to the objective function at $x^{k-1}$ . This majorizer is then minimized over a convex set $\Omega $ to obtain the next iterate $x^k$ [Reference Hunter and Lange9]. That is, $x^k = \underset {x \in \Omega }{\mathrm {argmin}}\; g(x, x^{k-1})$ , provided that $x^k$ exists. Then $g^k$ is defined as $g^k = g(\cdot ,x^{k})$ . When the sequence of minimizers $\{x^k\}_{k=0}^{\infty }$ exists, we have the descending property
2.2 The generalized MM method
In our generalized majorizer–minimizer (GMM) method, which incorporates the Bregman distance, we, respectively, introduce the concepts of the Moreau envelope, regularization, and majorizer functions as
It is worth noting that due to the nonnegativity property of D, as given in Definition 2.1, the Moreau envelope $F^M$ acts as a majorizer for the function F. With this in mind, we can now formally present our method as outlined below.
GMM Method:
Initialize $x^0 \in \mathbb {R}^n$ and set $\gamma _m>\alpha $ .
For $k=0,1,2,\ldots ,$ compute
$x^{k+1}=\arg \hspace {-0.45cm}\min \limits _{x\in B_{\epsilon /2^k}{(x^k})}F^M(x,x^k)$ .
end
We emphasize that in our method, we impose the condition $\gamma _m> \alpha $ to ensure that
holds, meaning that $\frac {1}{2}A^TA+\lambda (\gamma _{m}-\alpha )\nabla ^2 \phi (x^k)$ is a positive definite matrix. We will prove in Lemma 3 that this condition guarantees that $F^M(\cdot ,x^k)$ has a unique minimizer within $B_{\epsilon /2^k}(x^k)$ .
3 Convergence analysis of GMM method
This section commences by establishing the foundation for our convergence analysis.
Lemma Let $x \in \mathbb {R}^n$ , $A \in \mathbb {R}^{m\times n}$ and $\lambda> 0$ . The function $F: \mathbb {R}^n \to \mathbb {R}$ defined in (1.2) with $f_{\alpha }$ given by (2.1) is convex if the condition
is satisfied, where $\nabla ^2\phi $ is the Hessian of $\phi $ that defines the Bregman distance. Moreover, F is strictly convex if
Proof We can write
where $g(x,z)$ is affine in x defined as
Therefore, F is convex if $\frac {1}{2}A^T A -\lambda {\alpha } \nabla ^2 \phi (x)$ is positive semidefinite. Moreover, it is obvious that F is strictly convex if $\frac {1}{2}A^T A -\lambda {\alpha } \nabla ^2 \phi (x)$ is positive definite.
Lemma The function $f^m_{\alpha }$ minorizes $f_{\alpha }$ for any arbitrary w. Moreover, if $\|d\|$ is small, then it holds that $\nabla f_{\alpha }(w,d) = \nabla f_{\alpha }^m(w,w,d).$
Proof The fact that $f^m_{\alpha }$ minorizes $f_{\alpha }$ can be inferred from (2.3). However, to establish the equality in terms of the directional derivative, a further proof is required. We can write
To finalize the proof, we only need to show that $\liminf _{\theta \rightarrow 0^+}\gamma _{m} D(w+\theta d, w)=0$ . This can be established based on Definition 2.1 and the fact that $\|d\|$ is small.
The following theorem provides a sufficient condition that ensures the convexity of $F^M(\cdot , w)$ for every w.
Lemma $F^M$ is a local majorizer of F at w. Moreover, $F^M(\cdot , w)$ is convex if
holds and that $F^M(\cdot , w)$ is strictly convex if
holds.
Proof Using (2.3), it is clear that the function $F^M(\cdot ,w)$ is a local majorizer for F. To prove that $F^M(\cdot ,w)$ is convex, we expand $F^M$ and obtain
where
Since g is affine in x, $F^m$ is convex if $\frac {1}{2}A^T A -\lambda {(\gamma _m - \alpha )} \nabla ^2 \phi (x)$ is positive semidefinite. Moreover, $F^m$ is strictly convex if $\frac {1}{2}A^T A -\lambda {(\gamma _m - \alpha )} \nabla ^2 \phi (x)$ is positive definite.
We remark that if $\gamma _m>\alpha $ and H2 hold, then $\frac {1}{2}A^TA + \lambda (\gamma _{m} - \alpha )\nabla ^2 \phi \succ 0$ holds, implying that $F^M(\cdot ,w)$ is strictly convex.
The next result states that the GMM method is well-defined and converges.
Theorem The sequence $\{x^k\}_{k=0}^{\infty }$ generated by the GMM method converges.
Proof Since
we must have
which implies that the sequence is bounded. Therefore, by the Bolzano–Weierstrass theorem, $\{x_k\}_{k=0}^{\infty }$ has an accumulation point $x^*$ . Consider a subsequence $\{x^{k_n}\}$ of $\{x^k\}_{k=0}^{\infty }$ such that $x^{k_n} \rightarrow x^*$ . Fix k and let $k_n> k$ . We can write
Therefore, $\{x^k\}_{k=0}^{\infty }$ converges to $x^*$ .
The following lemma presents a sufficient condition that ensures the strong convexity of the majorizer function $F^M$ .
Lemma $F^M$ is a-strongly convex if it holds that
Proof Likewise Lemma 3, expand $F^M(x,w)$ . Then, one can write $F^M(x,w)$ as
where $g(x,z,w)$ is the affine function in x given by (3.1). Therefore, F is a-strongly convex in x if (3.2) holds.
Next, we recall a technical lemma that plays a crucial role in the convergence result. This lemma is well-established and applicable to a broad range of functions, provided they possess a local minimizer.
Lemma If f is a-strongly convex in a set C and $\bar {x}$ is a local minimizer of f, then
Proof See Lemma B5 in [Reference Mairal13].
We now proceed to establish our main result, which guarantees the convergence of the GMM method.
Theorem Assume that (3.2) holds and the sequence $\{x^k\}_{k=0}^{\infty }$ converges to $\bar {x}$ . Then $\bar {x}$ is a stationary point for F and $\nabla F(\bar {x},d) \geq 0$ for every small $\|d\|$ . In particular, if $\{x^k\}_{k=0}^{\infty }$ is generated by the GMM method and at each step the majorizer function $F^m(\cdot ,x^k)$ is a-strongly convex, then $\{x^k\}_{k=0}^{\infty }$ converges to a stationary point of F.
Proof Fix x. Since f is continuous and $x^k\rightarrow \bar {x}$ , we can use (2.2) to conclude that
Therefore, it follows from (1.2) and (2.3) that
On the other hand, apply Lemma 3.6 and use the majorization property to obtain
Note that the first inequality is a direct consequence of the definition of the GMM method. Recall that $x^{k+1}$ is always a local minimizer of $F^M$ . By taking the limit on both sides of the aforementioned inequalities as $k \rightarrow \infty $ , we can express them as
or
For sufficiently small $\|d\|$ , we obtain
On the other hand, H3 implies that there exists $\beta>0$ such that
Therefore, inequalities (3.3) and (3.4) yield
As a result,
Dividing both sides of the above inequality by $\theta $ yields
Taking the limit as $\theta \rightarrow 0$ , we find that
In particular, if $\{x^k\}_{k=0}^{\infty }$ is generated by the GMM method, then $\bar {x}$ represents a stationary point of F. Using the descending property, the stationary point is a local minimum.
4 Application
In this section, we assess the effectiveness of our proposed method by applying the GMM method to a lasso regression model [Reference Hastie, Tibshirani and Wainwright8]. To evaluate the performance of our approach, we utilize the widely recognized Credit dataset, available in the ISLR package in R [Reference James, Witten, Hastie and Tibshirani12]. This dataset consists of 400 observations, where we consider $y=$ Balance (average credit card balance in $) as the dependent variable and Rating (credit rating), Income (income in $10,000’s), and Limit (credit limit) as the independent variables.
To estimate the lasso coefficients of the linear model $y = Ax$ , we construct a matrix A with dimensions $400 \times 4$ . The first column of A contains only $1$ ’s, whereas the last three columns correspond to the observed variables Rating, Income, and Limit, respectively. Our goal is to estimate the lasso coefficients for the linear model $y=Ax$ , where $x^T=\left [\begin {array}{cccc} x_1 & x_2 & x_3 & x_4 \end {array}\right ]$ . The lasso coefficients, denoted by $\left (\hat x_L\right )^T=\left [\begin {array}{cccc} \hat {x}_1 & \hat {x}_2 & \hat {x}_3 & \hat {x}_4 \end {array}\right ]$ , are obtained by solving the minimization problem
To determine the regularization parameter $\lambda $ , we employ the glmnet package in R and utilize cross-validation.
To present the numerical results, we examine the function $F^M$ defined in (1.4). We evaluate $F^M$ using the MM method in [Reference Mayeli14], which utilizes the regularization form given in (1.3), and the GMM method, which employs the Bregman distance
where M is a $4 \times 4$ diagonal matrix with diagonal entries of 10, 11, 12, and 13, respectively. In iteration k, the function $F^M(\cdot ,x^k)$ is minimized within the ball $B_{\epsilon /2^k}(x^k)$ . Both the MM method and the GMM method are initialized with parameters $\gamma _m=2$ , $\alpha =0$ , an initial ball with radius of $\epsilon =10^{100}$ , and an initial point $x^0$ which is obtained using the ordinary least squares method by applying the lm function in R. We employ a stopping criterion tolerance of $10^{-4}$ to determine convergence. The least square coefficients obtained from the lm function, the lasso coefficients obtained from the glmnet function, as well as the MM method and the GMM method are presented in Table 1. Remarkably, our proposed method yields estimators that closely align with the usual regression coefficients obtained from the R software. Moreover, our numerical findings indicate that our proposed method achieves convergence significantly faster than the MM method.
Acknowledgment
The work of Zeinab Mashreghi is funded by a grant from the Natural Sciences and Engineering Research Council of Canada (NSERC).