^{1}

^{*}

^{1}

^{2}

^{3}

^{1}

^{2}

^{3}

Edited by: Yiming Ying, University at Albany, United States

Reviewed by: Shiyin Qin, Beihang University, China; Shao-Bo Lin, Wenzhou University, China

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

We consider the problem of estimating the arithmetic average of a finite collection of real vectors stored in a distributed fashion across several compute nodes subject to a communication budget constraint. Our analysis does not rely on any statistical assumptions about the source of the vectors. This problem arises as a subproblem in many applications, including reduce-all operations within algorithms for distributed and federated optimization and learning. We propose a flexible family of randomized algorithms exploring the trade-off between expected communication cost and estimation error. Our family contains the full-communication and zero-error method on one extreme, and an ϵ-bit communication and

We address the problem of estimating the arithmetic mean of

In particular, we consider a star network topology with a single server at the centre and

The purpose of the encoding operation is to compress the vector so as to save on communication cost, which is typically the bottleneck in practical applications.

To better illustrate the setup, consider the naive approach in which all nodes send the vectors without performing any encoding operation, followed by the application of a simple averaging decoder by the server. This results in zero estimation error at the expense of maximum communication cost of _{i}.

This operation appears as a computational primitive in numerous cases, and the communication cost can be reduced at the expense of acurracy. Our proposal for balancing accuracy and communication is in practice relevant for any application that uses the

The distributed mean estimation problem was recently studied in a statistical framework where it is assumed that the vectors _{i} are independent and identicaly distributed samples from some specific underlying distribution. In such a setup, the goal is to estimate the true mean of the underlying distribution [

In contrast, we do not make any statistical assumptions on the source of the vectors, and study the trade-off between expected communication costs and mean square error of the estimate. Arguably, this setup is a more robust and accurate model of the distributed mean estimation problems arising as subproblems in applications such as reduce-all operations within algorithms for distributed and federated optimization [_{i}} correspond to updates to a global model/variable. In such cases, the vectors evolve throughout the iterative process in a complicated pattern, typically approaching zero as the master algorithm converges to optimality. Hence, their statistical properties change, which renders fixed statistical assumptions not satisfied in practice.

For instance, when training a deep neural network model in a distributed environment, the vector _{i} corresponds to a stochastic gradient based on a minibatch of data stored on node

In this paper we propose a _{ij} for _{i} ∈ ℝ for ^{1}

To illustrate our results, consider the special case presented in Example 7, in which we choose to communicate a single bit per element of _{i} only. We then obtain an _{i} ∈ ℝ being the average of elements of _{i}, and 1 the all-ones vector in ℝ^{d}. Note that this bound improves upon the performance of the method of Suresh et al. [

Summary of achievable communication cost and estimation error, for various choices of probability

_{α,β} |
_{α,γ} |
||
---|---|---|---|

Example 5 (Full) | 1 | 0 | |

Example 6 (Log |
1/log |
||

Example 7 (1-bit) | 1/ |
||

Example 9 (below 1-bit) | 1/ |

While the above already improves upon the state of the art, the improved results are in fact obtained for a suboptimal choice of the parameters of our method (constant probabilities _{ij}, and node centers fixed to the mean μ_{i}). One can decrease the MSE further by optimizing over the probabilities and/or node centers (see section 6). However, apart from a very low communication cost regime in which we have a closed form expression for the optimal probabilities, the problem needs to be solved numerically, and hence we do not have expressions for how much improvement is possible. We illustrate the effect of fixed and optimal probabilities on the trade-off between communication cost and MSE experimentally on a few selected datasets in section 6 (see Figure

_{i} drawn in an i.i.d. fashion from Gaussian, Laplace, and χ^{2} distributions, from left to right. The black cross marks the performance of binary quantization (Example 4).

In section 2 we formalize the concepts of encoding and decoding protocols. In section 3 we describe a parametric family of randomized (and unbiased) encoding protocols and give a simple formula for the mean squared error. Subsequently, in section 4 we formalize the notion of communication cost, and describe several communication protocols, which are optimal under different circumstances. We give simple instantiations of our protocol in section 5, illustrating the trade-off between communication costs and accuracy. In section 6 we address the question of the optimal choice of parameters of our protocol. Finally, in section 7 we comment on possible extensions we leave out to future work.

In this work we consider (randomized) _{i} using the encoding protocol, which we denote _{i}) we denote the number of bits that need to be transferred under β. The server then estimates

The objective of this work is to study the trade-off between the (expected) number of bits that need to be communicated, and the accuracy of

In this work we focus on encoders which are unbiased, in the following sense.

Definition 2.1 (Unbiased and Independent Encoder): We say that encoder α is unbiased if _{α}[α(_{i})] = _{i} for all _{i}) is independent from α(_{j}) for all

_{i}) = _{i}. It is both unbiased and independent. This encoder does not lead to any savings in communication that would be otherwise infeasible though.

Another examples of unbiased and independent Encoders include the protocols introduced in section 3, or other existing techniques [

We now formalize the notion of accuracy of estimating

Definition 2.2 (Estimation Error / Mean Squared Error): The

To illustrate the above concept, we now give a few examples:

The next example generalizes the identity encoder and averaging decoder.

^{d} → ℝ^{d} be linear and invertible. Then we can set

and hence the MSE of (α, γ) is zero.

We shall now prove a simple result for unbiased and independent encoders used in subsequent sections.

Lemma 2.3 (Unbiased and Independent Encoder + Averaging Decoder): If the encoder α is unbiased and independent, and γ is the averaging decoder, then

_{α}[_{i}] = _{i} for all

where (*) follows from unbiasedness and independence. □

One may wish to define the encoder as a combination of two or more separate encoders: α(_{i}) = α_{2}(α_{1}(_{i})). See Suresh et al. [_{1} is a random rotation and α_{2} is binary quantization.

Let _{i} = (_{i}(1), …, _{i}(_{i}. In addition, with each _{i} ∈ ℝ. We refer to μ_{i} as the center of data at node

We shall define _{i} of random size, the second has _{i} of a fixed size.

With each pair (_{ij} ≤ 1, representing a probability. The collection of parameters {_{ij}, μ_{i}} defines an encoding protocol α as follows:

_{ij} to be zero, in which case we have _{i}(_{i} with probability 1. This raises issues such as potential lack of unbiasedness, which can be resolved, but only at the expense of a larger-than-reasonable notational overload.

In the rest of this section, let γ be the averaging decoder (Example 2). Since γ is fixed and deterministic, we shall for simplicity write _{α}[·] instead of _{α, γ}[·]. Similarly, we shall write _{α}(·) instead of _{α, γ}(·).

We now prove two lemmas describing properties of the encoding protocol α. Lemma 3.1 states that the protocol yields an unbiased estimate of the average

Lemma 3.1 (Unbiasedness): The encoder α defined in (1) is unbiased. That is, _{α}[α(_{i})] = _{i} for all _{α}[

_{α}[_{α}[_{i}(_{i}(

and the claim is proved. □

Lemma 3.2 (Mean Squared Error): Let α = α(_{ij}, μ_{i}) be the encoder defined in (1). Then

For any

It suffices to substitute the above into (3). □

Here we propose an alternative encoding protocol, one with deterministic support size. As we shall see later, this results in deterministic communication cost.

Let σ_{k}(

Note that due to the design, the size of the support of _{i} is always _{i}| = _{ij} =

As for the data-dependent protocol, we prove basic properties. The proofs are similar to those of Lemmas 3.1 and 3.2 and we defer them to Appendix

Lemma 3.3 (Unbiasedness): The encoder α defined in (4) is unbiased. That is, _{α}[α(_{i})] = _{i} for all _{α}[

Lemma 3.4 (Mean Squared Error): Let α = α(

Having defined the encoding protocols α, we need to specify the way the encoded vectors _{i} = α(_{i}), for _{i}) to denote the (expected) number of bits that are communicated by node _{i} = α(_{i}) is in general not deterministic, β(_{i}) can be a random variable.

Definition 4.1 (Communication Cost): The

Given _{i}, a good communication protocol is able to encode _{i} = α(_{i}) using a few bits only. Let _{i}.

In the rest of this section we describe several communication protocols β and calculate their communication cost.

Represent _{i} = α(_{i}) as _{i})) =

We will use a single variable for every element of the vector _{i}, which does not have constant size. The first bit decides whether the value represents μ_{i} or not. If yes, end of variable, if not, next _{i}(_{i}, which takes ^{2}

where 1_{e} is the indicator function of event

In the special case when _{ij} =

We can represent _{i} as a sparse vector; that is, a list of pairs (_{i}(_{i}(_{i}. The number of bits to represent each pair is ⌈log(_{i}. Additionally, we have to communicate the value of μ_{i} to the server, which takes

Summing up through

In the special case when _{ij} =

_{i}(_{1} < _{2} < ⋯ < _{k}. Further, let us denote _{0} = 0. We can then use a variant of variable-length quantity [

We now describe a sparse communication protocol compatible with fixed length encoder defined in (4). Note that the selection of set _{i}(_{i}(_{i}(_{ij}.

In particular, we represent _{i} as a vector containing the list of the values for which _{i}(_{i}, ordered by _{i} (using _{i}| =

In the case of the variable-size-support encoding protocol (1) with _{ij} =

If the elements of _{i} take only two different values,

In the above, we have presented several communication protocols of different complexity. However, it is not possible to claim any of them is the most efficient one. Which communication protocol is the best, depends on the specifics of the used encoding protocol. Consider the extreme case of encoding protocol (1) with _{ij} = 1 for all

However, in the interesting case when we consider small communication budget, the sparse communication protocols are the most efficient. Therefore, in the following sections, we focus primarily on optimizing the performance using these protocols.

In this section, we highlight on several instantiations of our protocols, recovering existing techniques and formulating novel ones. We comment on the resulting trade-offs between communication cost and estimation error.

We start by recovering an existing method, which turns every element of the vectors _{i} into a particular binary representation.

_{i} ≠ 0), we exactly recover the quantization algorithm proposed in Suresh et al. [

Using the formula (2) for the encoding protocol α, we get

This exactly recovers the MSE bound established in Suresh et al. [_{i}, plus a two real-valued scalars (11).

Now we move to comparing the communication costs and estimation error of various instantiations of the encoding protocols, utilizing the deterministic sparse communication protocol and uniform probabilities.

For the remainder of this section, let us only consider instantiations of our protocol where _{ij} = _{i}.

The properties of the following examples follow from Equations (2) to (10). When considering the communication costs of the protocols, keep in mind that the trivial benchmark is _{α,β} = _{α,β} = _{i}.

In this case, the encoding protocol is lossless, which ensures

This protocol order-wise matches the ^{r}, this protocol attains this error with _{i}. Finally, note that the factor

This protocol communicates on expectation single bit per element of _{i} (plus additional

This alternative protocol attains on expectation exactly single bit per element of _{i}, with (a slightly more complicated)

This protocol attains the MSE of protocol in Example 4 while at the same time communicating on average significantly less than a single bit per element of _{i}.

We summarize these examples in Table

Using the deterministic sparse protocol, there is an obvious lower bound on the communication cost — _{i}, such as 0, setting _{α,β} = ϵ, and the cost of exploding estimation error

Note that all of the above examples have random communication costs. What we present is the

Here we consider (α, β, γ), where α = α(_{ij}, μ_{i}) is the encoder defined in (1), β is the associated the sparse communication protocol, and γ is the averaging decoder. Recall from Lemma 2 and (8) that the mean square error and communication cost are given by:

Having these closed-form formulae as functions of the parameters {_{ij}, μ_{i}}, we can now ask questions such as:

Given a communication budget, which encoding protocol has the smallest mean squared error?

Given a bound on the mean squared error, which encoder suffers the minimal communication cost?

Let us now address the first question; the second question can be handled in a similar fashion. In particular, consider the optimization problem

where _{ij}.

Note that while the constraints in (14) are convex (they are linear), the objective is not jointly convex in {_{ij}, μ_{i}}. However, the objective is convex in {_{ij}} and convex in {μ_{i}}. This suggests a simple

Fix the probabilities and optimize over the node centers,

Fix the node centers and optimize over probabilities.

These two steps are repeated until a suitable convergence criterion is reached. Note that the first step has a closed form solution. Indeed, the problem decomposes across the node centers to

The second step does not have a closed form solution in general; we provide an analysis of this step in section 6.1.

_{ij}, μ_{i}}. We may therefore instead optimize this upper bound by a suitable convex optimization algorithm.

_{1}, …, _{n} and require

Let the node centers μ_{i} be fixed. Problem (14) (or, equivalently, step 2 of the alternating minimization method described above) then takes the form

Let _{i}(_{i}}. Notice that as long as _{ij} = 1 for all (_{ij} = 0 for all (^{3}_{α,γ} = 0. Hence, we can without loss of generality assume that

While we are not able to derive a closed-form solution to this problem, we can formulate upper and lower bounds on the optimal estimation error, given a bound on the communication cost formulated via

Theorem 6.1 (MSE-Optimal Protocols subject to a Communication Budget): Consider problem (17) and fix any

and the mean squared error satisfies the bounds

where _{ij} = |_{i}(_{i}| and

_{ij} =

where _{ij} ≤ 1, the optimal solution satisfies _{ij}/_{ij} = θ > 0 for all (

where _{ij} ≤ 1 for all (^{2} ≤

To illustrate the trade-offs between communication cost and estimation error (MSE) achievable by the protocols discussed in this section, we present simple numerical examples in Figure _{ij} and μ_{i}. In particular, we consider

uniform probabilities _{ij} =

optimal probabilities _{ij} with average node centers

optimal probabilities with optimal node centers, obtained via the alternating minimization approach described above (red solid line).

In order to put a scale on the horizontal axis, we assumed that _{i} with entries drawn in an i.i.d. fashion from Gaussian (^{2}(2)) distributions, respectively. As we can see, in the case of non-symmetric distributions, it is not necessarily optimal to set the node centers to averages.

As expected, for fixed node centers, optimizing over probabilities results in improved performance, across the entire trade-off curve. That is, the curve shifts downwards. In the first two plots based on data from symmetric distributions (Gaussian and Laplace), the average node centers are nearly optimal, which explains why the red solid and green dotted lines coalesce. This can be also established formally. In the third plot, based on the non-symmetric chi-squared data, optimizing over node centers leads to further improvement, which gets more pronounced with increased communication budget. It is possible to generate data where the difference between any pair of the three trade-off curves becomes arbitrarily large.

Finally, the black cross represents performance of the quantization protocol from Example 4. This approach appears as a single point in the trade-off space due to lack of any parameters to be fine-tuned.

In this section we outline further ideas worth consideration. However, we leave a detailed analysis to future work.

We can generalize the binary encoding protocol (1) to a

Let the collection of parameters

It is straightforward to generalize Lemmas 3.1 and 3.2 to this case. We omit the proofs for brevity.

Lemma 7.1 (Unbiasedness): The encoder α defined in (21) is unbiased. That is, _{α}[α(_{i})] = _{i} for all _{α}[

Lemma 7.2 (Mean Squared Error): Let

We expect the

Following the idea proposed in Suresh et al. [_{Q} which arises as the composition of a random mapping, _{i} for all _{i} = _{i} and

With this protocol we associate the decoder

This approach is motivated by the following observation: a random rotation can be identified by a single random seed, which is easy to communicate to the server without the need to communicate all floating point entries defining ^{−1} in particular, can incur a significant computational overhead. The randomized Hadamard transform used in Suresh et al.[_{i} = _{i} instead of {_{i}}. The outer expectation is over _{i}} into new data {_{i}} with better MSE, in expectation.

From now on, for simplicity assume the node centers are set to the average, i.e., ^{d}, define

where _{ij} =

It is interesting to investigate whether choosing

If

This is the case for the quantization protocol proposed in Suresh et al. [

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

JK acknowledges support from Google via a Google European Doctoral Fellowship. Work done while at University of Edinburgh, currently at Google. PR acknowledges support from Amazon, and the EPSRC Grant EP/K02325X/1, Accelerated Coordinate Descent Methods for Big Data Optimization and EPSRC Fellowship EP/N005538/1, Randomized Algorithms for Extreme Convex Optimization.

In this section we provide proofs of Lemmas 3.3 and 3.4, describing properties of the encoding protocol α defined in (4). For completeness, we also repeat the statements.

Lemma A.1 (Unbiasedness): The encoder α defined in (1) is unbiased. That is, _{α}[α(_{i})] = _{i} for all _{α}[

_{α}[_{i}(_{i}(

and the claim is proved. □

Lemma A.2 (Mean Squared Error): Let α = α(

Further,

It suffices to substitute the above into (A1). □

^{1}See Remark 4.

^{2}The distinction here is because μ_{i} can be chosen to be data independent, such as 0, so we don't have to communicate anything (i.e.,

^{3}We interpret 0/0 as 0 and do not worry about infeasibility. These issues can be properly formalized by allowing _{ij} to be zero in the encoding protocol and in (17). However, handling this singular situation requires a notational overload which we are not willing to pay.