Front. Appl. Math. Stat.Frontiers in Applied Mathematics and StatisticsFront. Appl. Math. Stat.2297-4687Frontiers Media S.A.10.3389/fams.2017.00009Applied Mathematics and StatisticsOriginal ResearchSemi-Stochastic Gradient Descent MethodsKonečnýJakub^{*}RichtárikPeterSchool of Mathematics, University of EdinburghEdinburgh, United Kingdom
Edited by: Darinka Dentcheva, Stevens Institute of Technology, United States
Reviewed by: Yu Du, Rutgers University, United States; Vladimir Shikhman, Technische Universität Chemnitz, Germany; Uday V. Shanbhag, Pennsylvania State University, United States
*Correspondence: Jakub Konečný kubo.konecny@gmail.com
This article was submitted to Optimization, a section of the journal Frontiers in Applied Mathematics and Statistics
This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
In this paper we study the problem of minimizing the average of a large number of smooth convex loss functions. We propose a new method, S2GD (Semi-Stochastic Gradient Descent), which runs for one or several epochs in each of which a single full gradient and a random number of stochastic gradients is computed, following a geometric law. For strongly convex objectives, the method converges linearly. The total work needed for the method to output an epsilon-accurate solution in expectation, measured in the number of passes over data, is proportional to the condition number of the problem and inversely proportional to the number of functions forming the average. This is achieved by running the method with number of stochastic gradient evaluations per epoch proportional to conditioning of the problem. The SVRG method of Johnson and Zhang arises as a special case. To illustrate our theoretical results, S2GD only needs the workload equivalent to about 2.1 full gradient evaluations to find a 10e-6 accurate solution for a problem with 10e9 functions and a condition number of 10e3.
stochastic gradientvariance reductionempirical risk minimizationlinear convergenceconvex optimizationEP/G036136/1EP/I017127/1EP/K02325X/1Engineering and Physical Sciences Research Council10.13039/5011000002661. Introduction
Many problems in data science (e.g., machine learning, optimization, and statistics) can be cast as loss minimization problems of the form
minx∈ℝdf(x),
where
f(x)=def1n∑i=1nfi(x).
Here d typically denotes the number of features / coordinates, n the number of examples, and f_{i}(x) is the loss incurred on example i. That is, we are seeking to find a predictor x ∈ ℝ^{d} minimizing the average loss f(x). In big data applications, n is typically very large; in particular, n ≫ d.
Note that this formulation includes more typical formulation of L2-regularized objectives—f(x)=1n∑i=1nf~i(x)+λ2||x||2. We hide the regularizer into the function f_{i}(x) for the sake of simplicity of resulting analysis.
1.1. Motivation
Let us now briefly review two basic approaches to solving problem (1).
Gradient Descent. Given xk∈ℝd, the gradient descent (GD) method sets
xk+1=xk-hf′(xk),
where h is a stepsize parameter and f′(xk) is the gradient of f at x_{k}. We will refer to f′(x) by the name full gradient. In order to compute f′(xk), we need to compute the gradients of n functions. Since n is big, it is prohibitive to do this at every iteration.
Stochastic Gradient Descent (SGD). Unlike gradient descent, stochastic gradient descent [1, 2] instead picks a random i (uniformly) and updates
xk+1=xk-hfi′(xk).
Note that this strategy drastically reduces the amount of work that needs to be done in each iteration (by the factor of n). Since
E(fi′(xk))=f′(xk),
we have an unbiased estimator of the full gradient. Hence, the gradients of the component functions f_{1}, …, f_{n} will be referred to as stochastic gradients. A practical issue with SGD is that consecutive stochastic gradients may vary a lot or even point in opposite directions. This slows down the performance of SGD. On balance, however, SGD is preferable to GD in applications where low accuracy solutions are sufficient. In such cases usually only a small number of passes through the data (i.e., work equivalent to a small number of full gradient evaluations) are needed to find an acceptable x. For this reason, SGD is extremely popular in fields such as machine learning.
In order to improve upon GD, one needs to reduce the cost of computing a gradient. In order to improve upon SGD, one has to reduce the variance of the stochastic gradients. In this paper we propose and analyze a Semi-Stochastic Gradient Descent (S2GD) method. Our method combines GD and SGD steps and reaps the benefits of both algorithms: it inherits the stability and speed of GD and at the same time retains the work-efficiency of SGD.
1.2. Brief literature review
Several recent papers, e.g., Richtárik and Takáč [3], Roux et al. [4], Schmidt et al. [5], Shalev-Shwartz and Zhang [6], and Johnson and Zhang [7] proposed methods which achieve similar variance-reduction effect, directly or indirectly. These methods enjoy linear convergence rates when applied to minimizing smooth strongly convex loss functions.
The method in Richtárik and Takáč [3] is known as Random Coordinate Descent for Composite functions (RCDC), and can be either applied directly to Equation (1), or to a dual version of Equation (1). Unless specific conditions on the problem structure are met, application to the primal directly is not as computationally efficient as its dual version^{1}. Application of a coordinate descent method to the dual formulation of Equation (1) is generally referred to as Stochastic Dual Coordinate Ascent (SDCA) [9]. The algorithm in Shalev-Shwartz [6] exhibits this duality, and the method in Takáč et al. [10] extends the primal-dual framework to the parallel/mini-batch setting. Parallel and distributed stochastic coordinate descent methods were studied in Richtárik and Takáč [11], Fercoq and Richtárik [12], and Richtárik and Takáč [13].
Stochastic Average Gradient (SAG) by Roux et al. [4], is one of the first SGD-type methods, other than coordinate descent methods, which were shown to exhibit linear convergence. The method of Johnson and Zhang [7], called Stochastic Variance Reduced Gradient (SVRG), arises as a special case in our setting for a suboptimal choice of a single parameter of our method. The Epoch Mixed Gradient Descent (EMGD) method, Zhang et al. [14], is similar in spirit to SVRG, but achieves a quadratic dependence on the condition number instead of a linear dependence, as is the case with SDCA, SAG, SVRG and with our method.
Earlier works of Friedlander and Schmidt [15], Deng and Ferris [16], and Bastin et al. [17] attempt to interpolate between GD and SGD and decrease variance by varying the sample size. These methods however do not realize the kind of improvements as the recent methods above. For partially related classical work on semi-stochastic approximation methods we refer^{2} the reader to the papers of Marti and Fuchs [18, 19], which focus on general stochastic optimization.
1.3. Outline
We start in Section 2 by describing two algorithms: S2GD, which we analyze, and S2GD+, which we do not analyze, but which exhibits superior performance in practice. We then move to summarizing some of the main contributions of this paper in Section 3. Section 4 is devoted to establishing expectation and high probability complexity results for S2GD in the case of a strongly convex loss. The results are generic in that the parameters of the method are set arbitrarily. Hence, in Section 5 we study the problem of choosing the parameters optimally, with the goal of minimizing the total workload (# of processed examples) sufficient to produce a result of specified accuracy. In Section 6 we establish high probability complexity bounds for S2GD applied to a non-strongly convex loss function. Discussion of efficient implementation for sparse data is in Section 7. Finally, in Section 8 we perform very encouraging numerical experiments on real and artificial problem instances. A brief conclusion can be found in Section 9.
2. Semi-stochastic gradient descent
In this section we describe two novel algorithms: S2GD and S2GD+. We analyze the former only. The latter, however, has superior convergence properties in our experiments.
These following two assumption are regarded as basic setting for smooth convex optimization, under which analysis of methods is typically presented first^{3}. We assume throughout the paper that the functions f_{i} are convex and L-smooth.
Assumption 1. The functionsf_{1}, …, f_{n}have Lipschitz continuous gradients with constant L > 0 (in other words, they are L-smooth). That is, for all x, z ∈ ℝ^{d}and all i = 1, 2, …, n,
fi(z)≤fi(x)+〈fi′(x),z-x〉+L2║z-x║2.
(This implies that the gradient of f is Lipschitz with constant L, and hence f satisfies the same inequality.)
In one part of the paper (Section 4) we also make the following additional assumption:
Assumption 2. The average loss f is μ-strongly convex, μ > 0. That is, for all x, z ∈ ℝ^{d},
f(z)≥f(x)+〈f′(x),z-x〉+μ2║z-x║2.
(Note that, necessarily, μ ≤ L.)
2.1. S2GD
Algorithm 1 (S2GD) depends on three parameters: stepsize h, constant m limiting the number of stochastic gradients computed in a single epoch, and a ν ∈ [0, μ], where μ is the strong convexity constant of f. In practice, ν would be a known lower bound on μ. Note that the algorithm works also without any knowledge of the strong convexity parameter—the case of ν = 0.
Semi-Stochastic Gradient Descent (S2GD)
parameters:m = max # of stochastic steps per epoch, h = stepsize, ν = lower bound on μ
forj = 0, 1, 2, … do
gj←1n∑i=1nfi′(xj)
y_{j,0} ← x_{j}
Let t_{j} ← t with probability (1 − νh)^{m−t}/β for t = 1, 2, …, m
fort = 0 to t_{j} − 1 do
Pick i ∈ {1, 2, …, n}, uniformly at random
yj,t+1←yj,t-h(gj+fi′(yj,t)-fi′(xj))
end for
x_{j+1} ← y_{j,tj}
end for
The method has an outer loop, indexed by epoch counter j, and an inner loop, indexed by t. In each epoch j, the method first computes g_{j}—the full gradient of f at x_{j}. Subsequently, the method produces a random number t_{j} ∈ [1, m] of steps, following a geometric law, where
β=def∑t=1m(1-νh)m-t,
with only two stochastic gradients computed in each step^{4}. For each t = 0, …, t_{j} − 1, the stochastic gradient fi′(xj) is subtracted from g_{j}, and fi′(yj,t-1) is added to g_{j}, which ensures that, one has
E(gj+fi′(yj,t)-fi′(xj))=f′(yj,t),
where the expectation is with respect to the random variable i.
Hence, the algorithm is stochastic gradient descent—albeit executed in a nonstandard way (compared to the traditional implementation described in the introduction).
Note that for all j, the expected number of iterations of the inner loop, E(t_{j}), is equal to
ξ=ξ(m,h)=def∑t=1mt(1-νh)m-tβ.
Also note that ξ∈[m+12,m), with the lower bound attained for ν = 0, and the upper bound for νh → 1.
2.2. S2GD+
We also implement Algorithm 2, which we call S2GD+. In our experiments, the performance of this method is superior to all methods we tested, including S2GD. However, we do not analyze the complexity of this method and leave this as an open problem.
S2GD+
parameters: α ≥ 1 (e.g., α = 1)
1. Run SGD for a single pass over the data (i.e., n iterations); output x
2. Starting from x_{0} = x, run a version of S2GD in which t_{j} = αn for all j
In brief, S2GD+ starts by running SGD for 1 epoch (1 pass over the data) and then switches to a variant of S2GD in which the number of the inner iterations, t_{j}, is not random, but fixed to be n or a small multiple of n.
The motivation for this method is the following. It is common knowledge that SGD is able to progress much more in one pass over the data than GD (where this would correspond to a single gradient step). However, the very first step of S2GD is the computation of the full gradient of f. Hence, by starting with a single pass over data using SGD and then switching to S2GD, we obtain a superior method in practice^{5}.
3. Summary of results
In this section we summarize some of the main results and contributions of this work.
Complexity for strongly convexf. If f is strongly convex, S2GD needs
W=O((n+κ)log(1/ε))
work (measured as the total number of evaluations of the stochastic gradient, accounting for the full gradient evaluations as well) to output an ε-approximate solution (in expectation or in high probability), where κ = L/μ is the condition number. This is achieved by running S2GD with stepsize h = O(1/L), j = O(log(1/ε)) epochs (this is also equal to the number of full gradient evaluations) and m = O(κ) (this is also roughly equal to the number of stochastic gradient evaluations in a single epoch). The complexity results are stated in detail in Sections 4 and 5 (see Theorems 4, 5 and 6; see also Equations 26 and 27).
Comparison with existing results. This complexity result (Equation 6) matches the best-known results obtained for strongly convex losses in recent work such as Roux et al. [4], Johnson and Zhang [7], and Zhang and Mahdavi [14]. Our treatment is most closely related to Johnson and Zhang [7], and contains their method (SVRG) as a special case. In Table 1 we summarize our results in the strongly convex case with other existing results for different algorithms.
We should note that the rate of convergence of Nesterov's algorithm [21] is a deterministic result. EMGD and S2GD results hold with high probability (see Theorem 5 for precise statement). Complexity results for stochastic coordinate descent methods are also typically analyzed in the high probability regime [3]. The remaining results hold in expectation. Notion of κ is slightly different for SDCA, which requires explicit knowledge of the strong convexity parameter μ to run the algorithm. In contrast, other methods do not algorithmically depend on this, and thus their convergence rate can adapt to any additional strong convexity locally.
Complexity for convexf. If f is not strongly convex, then we propose that S2GD be applied to a perturbed version of the problem, with strong convexity constant μ = O(L/ε). An ε-accurate solution of the original problem is recovered with arbitrarily high probability (see Theorem 8 in Section 6). The total work in this case is
W=O((n+L/ε))log(1/ε)),
that is, Õ(1/ϵ), which is better than the standard rate of SGD.
Optimal parameters. We derive formulas for optimal parameters of the method which (approximately) minimize the total workload, measured in the number of stochastic gradients computed (counting a single full gradient evaluation as n evaluations of the stochastic gradient). In particular, we show that the method should be run for O(log(1/ε)) epochs, with stepsize h = O(1/L) and m = O(κ). No such results were derived for SVRG in Johnson and Zhang [7].
One epoch. Consider the case when S2GD is run for 1 epoch only, effectively limiting the number of full gradient evaluations to 1, while choosing a target accuracy ϵ. We show that S2GD with ν = μ needs
O(n+(κ/ε)log(1/ε))
work only (see Table 2). This compares favorably with the optimal complexity in the ν = 0 case (which reduces to SVRG), where the work needed is
O(n+κ/ε2).
For two epochs one could just say that we need ε decrease in each epoch, thus having complexity of O(n+(κ/ε)log(1/ε)). This is already better than general rate of SGD (O(1/ε)).
Special cases. GD and SVRG arise as special cases of S2GD, for m = 1 and ν = 0, respectively^{6}.
Low memory requirements. Note that SDCA and SAG, unlike SVRG and S2GD, need to store all gradients fi′ (or dual variables) throughout the iterative process. While this may not be a problem for a modest sized optimization task, this requirement makes such methods less suitable for problems with very large n.
S2GD+. We propose a “boosted” version of S2GD, called S2GD+, which we do not analyze. In our experiments, however, it performs vastly superior to all other methods we tested, including GD, SGD, SAG and S2GD. S2GD alone is better than both GD and SGD if a highly accurate solution is required. The performance of S2GD and SAG is roughly comparable, even though in our experiments S2GD turned to have an edge.
Comparison of performance of selected methods suitable for solving Equation (1).
Algorithm
Complexity/work
Nesterov's algorithm
O(κnlog(1/ε))
EMGD
O((n + κ^{2}) log(1/ε))
SAG
O(max{n, κ} log(1/ε))
SDCA
O((n + κ) log(1/ε))
SVRG
O((n + κ) log(1/ε))
S2GD
O((n + κ) log(1/ε))
The complexity/work is measured in the number of stochastic gradient evaluations needed to find an ε-solution.
Summary of complexity results and special cases.
Parameters
Method
Complexity
ν = μ, j=O(log(1ε))
& m = O(κ)
Optimal S2GD
O((n+κ)log(1ε))
m = 1
GD
—
ν = 0
SVRG [7]
O((n+κ)log(1ε))
ν = 0, j = 1, m=O(κε2)
Optimal SVRG with 1 epoch
O(n+κε2)
ν = μ, j = 1, m=O(κεlog(1ε))
Optimal S2GD with 1 epoch
O(n+κεlog(1ε))
Condition number: κ = L/μ if f is μ-strongly convex and κ = 2L/ε if f is not strongly convex and ϵ ≤ L.
4. Complexity analysis: strongly convex loss
For the purpose of the analysis, let
Fj,t=defσ(x1,x2,…,xj;yj,1,yj,2,…,yj,t)
be the σ-algebra generated by the relevant history of S2GD. We first isolate an auxiliary result.
Lemma 3. Consider the S2GD algorithm. For any fixed epoch number j, the following identity holds:
E(f(xj+1))=1β∑t=1m(1-νh)m-tE(f(yj,t-1)).
Proof. By the tower law of conditional expectations and the definition of x_{j+1} in the algorithm, we obtain
We now state and prove the main result of this section.
Theorem 4. Let Assumptions 1 and 2 be satisfied. Consider the S2GD algorithm applied to solving problem (1). Choose 0 ≤ ν ≤ μ, 0<h<12L, and let m be sufficiently large so that
c=def(1-νh)mβμh(1-2Lh)+2(L-μ)h1-2Lh<1.
Then we have the following convergence in expectation:
E(f(xj)-f(x*))≤cj(f(x0)-f(x*)).
Before we proceed to proving the theorem, note that in the special case with ν = 0, we recover the result of Johnson and Zhang [7] (with a minor improvement in the second term of c where L is replaced by L − μ), namely
c=1μh(1-2Lh)m+2(L-μ)h1-2Lh.
If we set ν = μ, then c can be written in the form (see Equation 4)
c=(1-μh)m(1-(1-μh)m)(1-2Lh)+2(L-μ)h1-2Lh.
Clearly, the latter c is a major improvement on the former one. We shall elaborate on this further later.
Proof. It is well-known [21, Theorem 2.1.5] that since the functions f_{i} are L-smooth, they necessarily satisfy the following inequality:
Let Gj,t=defgj+fi′(yj,t-1)-fi′(xj) be the direction of update at j^{th} iteration in the outer loop and t^{th} iteration in the inner loop. Taking expectation with respect to i, conditioned on the σ-algebra Fj,t-1 Equation (7), we obtain^{7}
Finally, we can analyze what happens after one iteration of the outer loop of S2GD, i.e., between two computations of the full gradient. By summing up inequalities Equation (17) for t = 1, …, m, with inequality t multiplied by (1 − νh)^{m−t}, we get the left-hand side
Since we have established linear convergence of expected values, a high probability result can be obtained in a straightforward way using Markov inequality.
Theorem 5. Consider the setting of Theorem 4. Then, for any 0 < ρ <1, 0 < ε <1 and
j≥log(1ερ)log(1c),
we have
P(f(xj)-f(x*)f(x0)-f(x*)≤ε)≥1-ρ.
Proof. This follows directly from Markov inequality and Theorem 4:
This result will be also useful when treating the non-strongly convex case.
5. Optimal choice of parameters
The goal of this section is to provide insight into the choice of parameters of S2GD; that is, the number of epochs (equivalently, full gradient evaluations) j, the maximal number of steps in each epoch m, and the stepsize h. The remaining parameters (L, μ, n) are inherent in the problem and we will hence treat them in this section as given.
In particular, ideally we wish to find parameters j, m and h solving the following optimization problem:
minj,m,hW~(j,m,h)=defj(n+2ξ(m,h)),
subject to
E(f(xj)-f(x*))≤ε(f(x0)-f(x*)).
Note that W~(j,m,h) is the expected work, measured by the number number of stochastic gradient evaluations, performed by S2GD when running for j epochs. Indeed, the evaluation of g_{j} is equivalent to n stochastic gradient evaluations, and each epoch further computes on average 2ξ(m, h) stochastic gradients (see Equation 5). Since m+12≤ξ(m,h)<m, we can simplify and solve the problem with ξ set to the conservative upper estimate ξ = m.
In view of Equation (10), accuracy constraint Equation (21) is satisfied if c (which depends on h and m) and j satisfy
cj≤ε.
We therefore instead consider the parameter fine-tuning problem:
minj,m,hW(j,m,h)=defj(n+2m) subject to c≤ε1/j.
In the following we (approximately) solve this problem in two steps. First, we fix j and find (nearly) optimal h = h(j) and m = m(j). The problem reduces to minimizing m subject to c ≤ ε^{1/j} by fine-tuning h. While in the ν = 0 case it is possible to obtain closed form solution, this is not possible for ν > μ.
However, it is still possible to obtain a good formula for h(j) leading to expression for good m(j) which depends on ε in the correct way. We then plug the formula for m(j) obtained this way back into Equation (23), and study the quantity W(j,m(j),h(j))=j(n+2m(j)) as a function of j, over which we optimize optimize at the end.
Theorem 6 (Choice of parameters). Fix the number of epochs j ≥ 1, error tolerance 0 < ε < 1, and let Δ = ε^{1/j}. If we run S2GD with the stepsize
Proof. We only need to show that c ≤ Δ, where c is given by Equation (12) for ν = μ and by Equation (11) for ν = 0. We denote the two summands in expressions for c as c_{1} and c_{2}. We choose the h and m so that both c_{1} and c_{2} are smaller than Δ/2, resulting in c_{1}+c_{2} = c ≤ Δ.
The stepsize h is chosen so that
c2=def2(L-μ)h1-2Lh=Δ2,
and hence it only remains to verify that c1=c-c2≤Δ2. In the ν = 0 case, m(j) is chosen so that c-c2=Δ2. In the ν = μ case, c-c2=Δ2 holds for m=log(2Δ+2κ-1κ-1)/log(11-H), where H=(4(κ-1)Δ+2κ)-1. We only need to observe that c decreases as m increases, and apply the inequality log(11-H)≥H.
□
We now comment on the above result:
Workload. Notice that for the choice of parameters j^{*}, h = h(j^{*}), m = m(j^{*}) and any ν ∈ [0, μ], the method needs log(1/ε) computations of the full gradient (note this is independent of κ), and O(κ log (1/ε)) computations of the stochastic gradient. This result, and special cases thereof, are summarized in Table 2.
Simpler formulas form. If κ ≥ 2, we can instead of Equation (25) use the following (slightly worse but) simpler expressions for m(j), obtained from Equation (25) by using the bounds 1 ≤ κ − 1, κ − 1 ≤ κ and Δ <1 in appropriate places (e.g., 8κΔ<8κΔ2, κκ-1≤2<2Δ2):
m≥m˜(j)=def{6κΔlog(5Δ), if ν=μ,20κΔ2, if ν=0.
Optimal stepsize in the ν = 0 case. Theorem 6 does not claim to have solved problem (23); the problem in general does not have a closed form solution. However, in the ν = 0 case a closed-form formula can easily be obtained:
h(j)=14Δ(L-μ)+4L,m≥m(j)=def8(κ-1)Δ2+8κΔ.
Indeed, for fixed j, Equation (23) is equivalent to finding h that minimizes m subject to the constraint c ≤ Δ. In view of Equation (11), this is equivalent to searching for h > 0 maximizing the quadratic h → h(Δ − 2(ΔL+L − μ)h), which leads to Equation (28).
Note that both the stepsize h(j) and the resulting m(j) are slightly larger in Theorem 6 than in Equation (28). This is because in the theorem the stepsize was for simplicity chosen to satisfy c2=Δ2, and hence is (slightly) suboptimal. Nevertheless, the dependence of m(j) on Δ is of the correct (optimal) order in both cases. That is, m(j)=O(κΔlog(1Δ)) for ν = μ and m(j)=O(κΔ2) for ν = 0.
Stepsize choice. In cases when one does not have a good estimate of the strong convexity constant μ to determine the stepsize via Equation (24), one may choose suboptimal stepsize that does not depend on μ and derive similar results to those above. For instance, one may choose h=Δ6L.
In Table 3 we provide comparison of work needed for small values of j, and different values of κ and ε. Note, for instance, that for any problem with n = 10^{9} and κ = 10^{3}, S2GD outputs a highly accurate solution (ε = 10^{ − 6}) in the amount of work equivalent to 2.12 evaluations of the full gradient of f!
Comparison of work sufficient to solve a problem with n = 10^{9}, and various values of κ and ε.
ε = 10^{−3}, κ = 10^{3}
ε = 10^{−6}, κ = 10^{3}
ε = 10^{−9}, κ = 10^{3}
j
Wμ(j)
W0(j)
j
Wμ(j)
W0(j)
j
Wμ(j)
W0(j)
1
1.06n
17.0n
1
116n
10^{7}n
2
7.58n
10^{4}n
2
2.00n
2.03n
2
2.12n
34.0n
3
3.18n
51.0n
3
3.00n
3.00n
3
3.01n
3.48n
4
4.03n
6.03n
4
4.00n
4.00n
4
4.00n
4.06n
5
5.01n
5.32n
5
5.00n
5.00n
5
5.00n
5.02n
6
6.00n
6.09n
ε = 10^{−3}, κ = 10^{6}
ε = 10^{−6}, κ = 10^{6}
ε = 10^{−9}, κ = 10^{6}
j
Wμ(j)
W0(j)
j
Wμ(j)
W0(j)
j
Wμ(j)
W0(j)
2
4.14n
35.0n
4
8.29n
70.0n
5
17.3n
328n
3
3.77n
8.29n
5
7.30n
26.3n
8
10.9n
32.5n
4
4.50n
6.39n
6
7.55n
16.5n
10
11.9n
21.4n
5
5.41n
6.60n
8
9.01n
12.7n
13
14.3n
19.1n
6
6.37n
7.28n
10
10.8n
13.2n
20
21.0n
23.5n
ε = 10^{−3}, κ = 10^{9}
ε = 10^{−6}, κ = 10^{9}
ε = 10^{−9}, κ = 10^{9}
j
Wμ(j)
W0(j)
j
Wμ(j)
W0(j)
j
Wμ(j)
W0(j)
6
378n
1, 293n
13
737n
2, 409n
15
1, 251n
4, 834n
8
358n
1, 063n
16
717n
2, 126n
24
1,076n
3, 189n
11
376n
1,002n
19
727n
2, 025n
30
1, 102n
3, 018n
15
426n
1, 058n
22
752n
2, 005n
32
1, 119n
3,008n
20
501n
1, 190n
30
852n
2, 116n
40
1, 210n
3, 078n
The work was computed using formula (23), with m(j) as in Equation (27). The notation Wν(j) indicates what parameter ν was used, and optimal values are highlighted in bold.
6. Complexity analysis: convex loss
If f is convex but not strongly convex, we define f^i(x)=deffi(x)+μ2‖x-x0‖2, for small enough μ > 0 (we shall see below how the choice of μ affects the results), and consider the perturbed problem
minx∈ℝdf^(x),
where
f^(x)=def1n∑i=1nf^i(x)=f(x)+μ2║x-x0║2.
Note that f^ is μ-strongly convex and (L + μ)-smooth. In particular, the theory developed in the previous section applies. We propose that S2GD be instead applied to the perturbed problem, and show that an approximate solution of Equation (29) is also an approximate solution of Equation (1) (we will assume that this problem has a minimizer).
Let x^* be the (necessarily unique) solution of the perturbed problem (29). The following result describes an important connection between the original problem and the perturbed problem.
Lemma 7. Ifx^∈ℝdsatisfiesf^(x^)≤f^(x^*)+δ, where δ > 0, then
f(x^)≤f(x*)+μ2║x0-x*║2+δ.
Proof. The statement is almost identical to Lemma 9 in Richtárik and Takáč [3]; its proof follows the same steps with only minor adjustments. □
We are now ready to establish a complexity result for non-strongly convex losses.
Theorem 8. Let Assumption 1 be satisfied. Choose μ > 0, 0 ≤ ν ≤ μ, stepsize0<h<12(L+μ), and letmbe sufficiently large so that
c^=def(1-νh)mβμh(1-2(L+μ)h)+2Lh1-2(L+μ)h<1.
Pickx0∈ℝdand letx^0=x0,x^1,…,x^jbe the sequence of iterates produced by S2GD as applied to problem (29). Then, for any 0 < ρ < 1, 0 < ε < 1 and
j≥log(1/(ερ))log(1/c^),
we have
P(f(x^j)-f(x*)≤ε(f(x0)-f(x*))+μ2║x0-x*║2)≥1-ρ.
In particular, if we choose μ = ϵ < L and parameters j^{*}, h(j^{*}), m(j^{*}) as in Theorem 6, the amount of work performed by S2GD to guarantee Equation (33) is
W(j*,h(j*),m(j*))=O((n+Lε)log(1ε)),
which consists ofO(1ε)full gradient evaluations andO(Lϵlog(1ε))stochastic gradient evaluations.
where the first inequality follows from f≤f^, and the second one from optimality of x_{*}. Hence, by first applying Lemma 7 with x^=x^j and δ = ε(f(x_{0})−f(x_{*})), and then Theorem 5, with c ← ĉ, f←f^, x0←x^0, x*←x^*, we obtain
The second statement follows directly from the second part of Theorem 6 and the fact that the condition number of the perturbed problem is κ=L+ϵϵ≤2Lϵ. □
7. Implementation for sparse data
In our sparse implementation of Algorithm 1, described in this section and formally stated as Algorithm 3, we make the following structural assumption:
Assumption 9. The loss functions arise as the composition of a univariate smooth loss function ϕ_{i}, and an inner product with a data point/exampleai∈ℝd:
fi(x)=ϕi(aiTx),i=1,2,…,n.
In this case, fi′(x)=ϕi′(aiTx)ai.
This is the structure in many cases of interest, including linear or logistic regression.
Semi-Stochastic Gradient Descent (S2GD) for sparse data; “lazy” updates
parameters: m = max # of stochastic steps per epoch, h =
stepsize, ν = lower bound on μ
forj = 0, 1, 2, … do
gj←1n∑i=1nfi′(xj)
y_{j,0}←x_{j}
χ^{(s)} ← 0 for s = 1, 2, …, d ⊳ Store when a coordinate was updated last time
Let t_{j} ← t with probability (1 − νh)^{m−t}/β for t = 1, 2, …, m
fort = 0 to t_{j} − 1 do
Pick i ∈ {1, 2, …, n}, uniformly at random
fors ∈ nnz(a_{i}) do
yj,t(s)←yj,t(s)-(t-χ(s))hgj(s) ⊳ Update what will be needed
χ^{(s)} = t
end for
yj,t+1←yj,t-h(ϕi′(aiTyj,t)-ϕi′(aiTxj))ai ⊳ A sparse update
end for
fors = 1 to ddo ⊳ Finish all the “lazy” updates
yj,tj(s)←yj,tj(s)-(tj-χ(s))hgj(s)
end for
x_{j+1} ← y_{j,tj}
end for
A natural question one might want to ask is whether S2GD can be implemented efficiently for sparse data.
Let us first take a brief detour and look at SGD, which performs iterations of the type:
xj+1←xj−hϕi′(aiTx)ai.
Let ω_{i} be the number of nonzero features in example a_{i}, i.e., ωi=def‖ai‖0≤d. Assuming that the computation of the derivative of the univariate function ϕ_{i} takes O(1) amount of work, the computation of ∇f_{i}(x) will take O(ω_{i}) work. Hence, the update step Equation (35) will cost O(ω_{i}), too, which means the method can naturally speed up its iterations on sparse data.
The situation is not as simple with S2GD, which for loss functions of the type described in Assumption 9 performs inner iterations as follows:
yj,t+1←yj,t−h(gj+ϕi′(aiTyj,t)ai−ϕi′(aiTxj)ai).
Indeed, note that gj=f′(xj) is in general be fully dense even for sparse data {a_{i}}. As a consequence, the update in Equation (36) might be as costly as d operations, irrespective of the sparsity level ω_{i} of the active example a_{i}. However, we can use the following “lazy/delayed” update trick. We split the update to the y vector into two parts: immediate, and delayed. Assume index i = i_{t} was chosen at inner iteration t. We immediately perform the update
ỹj,t+1←yj,t−h(ϕit′(aitTyj,t)−ϕit′(aitTxj))ait,
which costs O(a_{it}). Note that we have not computed the y_{j,t+1}. However, we “know” that
yj,t+1=ỹj,t+1−hgj,
without having to actually compute the difference. At the next iteration, we are supposed to perform update Equation (36) for i = i_{t+1}:
as we never computed y_{j,t+1}. However, here lies the trick: as a_{it+1} is sparse, we only need to know those coordinates s of y_{j,t+1} for which ait+1(s) is nonzero. So, just before we compute the (sparse part of) of the update Equation (37), we perform the update
yj,t+1(s)←ỹj,t+1(s)−hgj(s)
for coordinates s for which ait+1(s) is nonzero. This way we know that the inner product appearing in Equation (38) is computed correctly (despite the fact that y_{j,t+1} potentially is not!). In turn, this means that we can compute the sparse part of the update in Equation (37).
We now continue as before, again only computing ỹ_{j,t+3}. However, this time we have to be more careful as it is no longer true that
yj,t+2=ỹj,t+2−hgj.
We need to remember, for each coordinate s, the last iteration counter t for which ait(s)≠0. This way we will know how many times did we “forget” to apply the dense update -hgj(s). We do it in a just-in-time fashion, just before it is needed.
Algorithm 3 (sparse S2GD) performs these lazy updates as described above. It produces exactly the same result as Algorithm 1 (S2GD), but is much more efficient for sparse data as iteration picking example i only costs O(ω_{i}). This is done with a memory overhead of only O(d) (as represented by vector χ ∈ ℝ^{d}).
8. Numerical experiments
In this section we conduct computational experiments to illustrate some aspects of the performance of our algorithm. In Section 8.1 we consider the least squares problem with synthetic data to compare the practical performance and the theoretical bound on convergence in expectations. We demonstrate that for both SVRG and S2GD, the practical rate is substantially better than the theoretical one. In Section 8.2 we compare the S2GD algorithm on several real datasets with other algorithms suitable for this task. We also provide efficient implementation of the algorithm, as described in Section 7, for the case of logistic regression in the MLOSS repository^{8}.
8.1. Comparison with theory
Figure 1 presents a comparison of the theoretical rate and practical performance on a larger problem with artificial data, with a condition number we can control (and choose it to be poor). In particular, we consider the L2-regularized least squares with
fi(x)=12(aiTx−bi)2+λ2‖x‖2,
for some ai∈ℝd, b_{i} ∈ ℝ and λ > 0 is the regularization parameter.
Least squares with n = 10^{5}, κ = 10^{4}. Comparison of theoretical result and practical performance for cases ν = μ (full red line) and ν = 0 (dashed blue line).
We consider an instance with n = 100, 000, d = 1, 000 and κ = 10, 000. We run the algorithm with both parameters ν = λ (our best estimate of μ) and ν = 0. Recall that the latter choice leads to the SVRG method of [7]. We chose parameters m and h as a (numerical) solution of the work-minimization problem (20), obtaining m = 261, 063 and h = 1/11.4L for ν = λ and m = 426, 660 and h = 1/12.7L for ν = 0. The practical performance is obtained after a single run of the S2GD algorithm.
The figure demonstrates linear convergence of S2GD in practice, with the convergence rate being significantly better than the already strong theoretical result. Recall that the bound is on the expected function values. We can observe a rather strong convergence to machine precision in work equivalent to evaluating the full gradient only 40 times. Needless to say, neither SGD nor GD have such speed. Our method is also an improvement over [7], both in theory and practice.
8.2. Comparison with other methods
The S2GD algorithm can be applied to several classes of problems. We perform experiments on an important and in many applications used L2-regularized logistic regression for binary classification on several datasets. The functions f_{i} in this case are:
fi(x)=log(1+exp(liaiTx))+λ2||x||2,
where l_{i} is the label of i^{th} training exapmle a_{i}. In our experiments we set the regularization parameter λ = O(1/n) so that the condition number κ = O(n), which is about the most ill-conditioned problem used in practice. We added a (regularized) bias term to all datasets.
All the datasets we used, listed in Table 4, are freely available^{9} benchmark binary classification datasets.
Datasets used in the experiments.
Dataset
Training examples (n)
Variables (d)
L
μ
κ
ijcnn
49,990
23
1.23
1/n
61,696
rcv1
20,242
47,237
0.50
1/n
10,122
real-sim
72,309
20,959
0.50
1/n
36,155
url
2,396,130
3,231,962
128.70
100/n
3,084,052
In the experiment, we compared the following algorithms:
SGD: Stochastic Gradient Descent. After various experiments, we decided to use a variant with constant step-size that gave the best practical performance in hindsight.
L-BFGS: A publicly-available limited-memory quasi-Newton method that is suitable for broader classes of problems. We used a popular implementation by Mark Schmidt^{10}.
SAG: Stochastic Average Gradient, Schmidt et al. [5]. This is the most important method to compare to, as it also achieves linear convergence using only stochastic gradient evaluations. Although the methods has been analyzed for stepsize h = 1/16L, we experimented with various stepsizes and chose the one that gave the best performance for each problem individually.
SDCA: Stochastic Dual Coordinate Ascent, where we used approximate solution to the one-dimensional dual step, as in Section 6.2 of Shalev-Shwartz and Zhang [6].
S2GDcon: The S2GD algorithm with conservative stepsize choice, i.e., following the theory. We set m = O(κ) and h = 1/10L, which is approximately the value you would get from Equation (24).
S2GD: The S2GD algorithm, with stepsize that gave the best performance in hindsight. The best value of m was between n and 2n in all cases, but optimal h varied from 1/2L to 1/10L.
Note that SAG needs to store n gradients in memory in order to run. In case of relatively simple functions, one can store only n scalars, as the gradient of f_{i} is always a multiple of a_{i}. If we are comparing with SAG, we are implicitly assuming that our memory limitations allow us to do so. Although not included in Algorithm (1), we could also store these gradients we used to compute the full gradient, which would mean we would only have to compute a single stochastic gradient per inner iteration (instead of two).
We plot the results of these methods, as applied to various different, in the Figure 2 for first 15–30 passes through the data (i.e., amount of work work equivalent to 15–30 full gradient evaluations).
Practical performance for logistic regression and the following datasets: ijcnn, rcv (first row), realsim, url (second row).
There are several remarks we would like to make. First, our experiments confirm the insight from Schmidt et al. [5] that for this types of problems, reduced-variance methods consistently exhibit substantially better performance than the popular L-BFGS algorithm.
The performance gap between S2GDcon and S2GD differs from dataset to dataset. A possible explanation for this can be found in an extension of SVRG to proximal setting Xiao and Zhang [22], released after the first version of this paper was put onto arXiv (i.e., after December 2013). Instead Assumption 1, where all loss functions are assumed to be associated with the same constant L, the authors of Xiao and Zhang [22] instead assume that each loss function f_{i} has its own constant L_{i}. Subsequently, they sample proportionally to these quantities as opposed to the uniform sampling. In our case, L=maxiLi. This weighted sampling has an impact on the convergence: one gets dependence on the average of the quantities L_{i} and not in their maximum.
The number of passes through data seems a reasonable way to compare performance, but some algorithms could need more time to do the same amount of passes through data than others. In this sense, S2GD should be in fact faster than SAG due to the following property. While SAG updates the test point after each evaluation of a stochastic gradient, S2GD does not always make the update—during the evaluation of the full gradient. This claim is supported by computational evidence: SAG needed about 20–40% more time than S2GD to do the same amount of passes through data.
Finally, in Table 5 we provide the time it took the algorithm to produce these plots on a desktop computer with Intel Core i7 3610QM processor, with 2 × 4 GB DDR3 1,600 MHz memory. The numbers for the url dataset is are not representative, as the algorithm needed extra memory, which slightly exceeded the memory limit of our computer.
Time required to produce plots in Figure 2.
Time in seconds
Algorithm
ijcnn
rcv1
real-sim
url
S2GDcon
0.25
0.43
1.01
125.53
S2GD
0.29
0.49
1.02
54.04
SAG
0.41
0.73
1.87
71.74
L-BFGS
0.15
0.67
0.76
309.14
SGD
0.39
0.57
1.54
62.73
SDCA
0.33
0.38
1.10
126.32
8.3. Boosted variants of S2GD and SAG
In this section we study the practical performance of boosted methods, namely S2GD+ (Algorithm 2) and variant of SAG suggested by its authors [5, Section 4.2].
SAG+ is a simple modification of SAG, where one does not divide the sum of the stochastic gradients by n, but by the number of training examples seen during the run of the algorithm, which has the effect of producing larger steps at the beginning. The authors claim that this method performed better in practice than a hybrid SG/SAG algorithm.
We have observed that, in practice, starting SAG from a point close to the optimum, leads to an initial “away jump.” Eventually, the method exhibits linear convergence. In contrast, S2GD converges linearly from the start, regardless of the starting position.
Figure 3 shows that S2GD+ consistently improves over S2GD, while SAG+ does not improve always: sometimes it performs essentially the same as SAG. Although S2GD+ is overall a superior algorithm, one should note that this comes at the cost of having to choose stepsize parameter for SGD initialization. If one chooses these parameters poorly, then S2GD+ could perform worse than S2GD. The other three algorithms can work well without any parameter tuning.
Practical performance of boosted methods on datasets ijcnn, rcv (first row), realsim, url (second row).
9. Conclusion
We have developed a new semi-stochastic gradient descent method (S2GD) and analyzed its complexity for smooth convex and strongly convex loss functions. Our methods need O((κ/n) log(1/ε)) work only, measured in units equivalent to the evaluation of the full gradient of the loss function, where κ = L/μ if the loss is L-smooth and μ-strongly convex, and κ ≤ 2L/ε if the loss is merely L-smooth.
Our results in the strongly convex case match or improve on a few very recent results, while at the same time generalizing and simplifying the analysis. Additionally, we proposed S2GD+—a method which equips S2GD with an SGD pre-processing step—which in our experiments exhibits superior performance to all methods we tested. We leave the analysis of this method as an open problem.
Author contributions
All authors listed, have made substantial, direct and intellectual contribution to the work, and approved it for publication.
Funding
The work of both authors was supported by the Centre for Numerical Algorithms and Intelligent Software (funded by EPSRC grant EP/G036136/1 and the Scottish Funding Council). Both authors also thank the Simons Institute for the Theory of Computing, UC Berkeley, where this work was conceived and finalized. The work of PR was also supported by the EPSRC grant EP/I017127/1 (Mathematics for Vast Digital Resources) and EPSRC grant EP/K02325X/1 (Accelerated Coordinate Descent Methods for Big Data Problems).
Conflict of interest statement
The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.
ReferencesNemirovskiAJuditskyALanGShapiroA. Robust stochastic approximation approach to stochastic programming. ZhangT. Solving large scale linear prediction problems using stochastic gradient descent algorithms. In: RichtárikPTakáčM. Iteration complexity of randomized block-coordinate descent methods for minimizing a composite function. RouxNLSchmidtMBachFR. A stochastic gradient method with an exponential convergence _rate for finite training sets. In: SchmidtMLe RouxNBachF. Minimizing finite sums with the stochastic average gradient. Shalev-ShwartzSZhangT. Stochastic dual coordinate ascent methods for regularized loss minimization. JohnsonRZhangT. Accelerating stochastic gradient descent using predictive variance reduction. In: CsibaDRichtárikP. Coordinate descent face-off: primal or dual? arXiv preprint arXiv:160508982 (2016).HsiehCJChangKWLinCJKeerthiSSSundarajanS. A dual coordinate descent method for large-scale linear SVM. In: TakáčMBijralARichtárikPSrebroN. Mini-batch primal and dual methods for SVMs. In: RichtárikPTakáčM. Parallel coordinate descent methods for big data optimization. FercoqORichtárikP. Smooth minimization of nonsmooth functions with parallel coordinate descent methods. arXiv:13095885 (2013).RichtárikPTakáčM. Distributed coordinate descent method for learning with big data. ZhangLMahdaviMJinR. Linear convergence with condition number independent access of full gradients. In: FriedlanderMPSchmidtM. Hybrid deterministic-stochastic methods for data fitting. DengGFerrisMC. Variable-number sample-path optimization. BastinFCirilloCTointPL. Convergence theory for nonconvex stochastic programming with an application to mixed logit. MartiKFuchsE. On solutions of stochastic programming problems by descent procedures with stochastic and deterministic directions. MartiKFuchsE. Rates of convergence of semi-stochastic approximation procedures for solving stochastic optimization problems. KonečnýJLiuJRichtárikPTakáčM. Mini-batch semi-stochastic gradient descent in the proximal setting. NesterovY. XiaoLZhangT. A proximal stochastic gradient method with progressive variance reduction.
^{1}The question of whether or when primal or dual version is better has recently been studied in Csiba and Richtárik [8] to which we refer the reader for further details.
^{2}We thank Zaid Harchaoui who pointed us to these papers a few days before we posted our work to arXiv.
^{3}Since the first version of our work, our proposed algorithm has been extended to apply to a more broader class functions in Konečný et al. [20].
^{4}It is possible to get away with computing only a single stochastic gradient per inner iteration, namely fi′(yj,t), at the cost of having to store in memory fi′(xj) for i = 1, 2, …, n. This, however, can be impractical for big n.
^{5}Using a single pass of SGD as an initialization strategy was already considered in Roux et al. [4]. However, the authors claim that their implementation of vanilla SAG did not benefit from it. S2GD does benefit from such an initialization due to it starting, in theory, with a (heavy) full gradient computation.
^{6}While S2GD reduces to GD for m = 1, our analysis does not say anything meaningful in the m = 1 case—it is too coarse to cover this case. This is also the reason behind the empty space in the “Complexity” box column for GD in Table 2.
^{7}For simplicity, we supress the E(·|Fj,t-1) notation here.
^{8}http://mloss.org/software/view/556/
^{9}Available at http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets/.