^{*}

Edited by: Ke Shi, Old Dominion University, United States

Reviewed by: Jianjun Wang, Southwest University, China; Alex Cloninger, University of California, San Diego, United States

This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Dealing with massive data is a challenging task for machine learning. An important aspect of machine learning is function approximation. In the context of massive data, some of the commonly used tools for this purpose are sparsity, divide-and-conquer, and distributed learning. In this paper, we develop a very general theory of approximation by networks, which we have called eignets, to achieve local, stratified approximation. The very massive nature of the data allows us to use these eignets to solve inverse problems, such as finding a good approximation to the probability law that governs the data and finding the local smoothness of the target function near different points in the domain. In fact, we develop a wavelet-like representation using our eignets. Our theory is applicable to approximation on a general locally compact metric measure space. Special examples include approximation by periodic basis functions on the torus, zonal function networks on a Euclidean sphere (including smooth ReLU networks), Gaussian networks, and approximation on manifolds. We construct pre-fabricated networks so that no data-based training is required for the approximation.

Rapid advances in technology have led to the availability and need to analyze a massive data. The problem arises in almost every area of life from medical science to homeland security to finance. An immediate problem in dealing with a massive data set is that it is not possible to store it in a computer memory; we therefore have to deal with the data piecemeal to keep access to an external memory to a minimum. The other challenge is to devise efficient numerical algorithms to overcome difficulties, for example, in using the customary optimization problems in machine learning. On the other hand, the very availability of a massive data set should lead also to opportunities to solve some problems heretofore considered unmanageable. For example, deep learning often requires a large amount of training data, which, in turn, helps us to figure out the granularity in the data. Apart from deep learning, distributed learning is also a popular way of dealing with big data. A good survey with the taxonomy for dealing with massive data was recently conducted by Zhou et al. [

As pointed out in Cucker and Smale [^{*} supported on a smooth, compact, and connected Riemannian manifold; for simplicity, even that ^{*} is the Riemannian volume measure for the manifold, normalized to be a probability measure. Following (e.g., [

A bottleneck in this theory is the computation of the eigendecomposition of a matrix, which is necessarily huge in the case of big data. Kernel-based methods have been used also in connection with approximation on manifolds (e.g., [

It is also possible that the manifold hypothesis does not hold, and there is a recent work [

Our motivation comes from some recent works on distributed learning by Zhou et al. [

The highlights of this paper are the following.

In order to avoid an explicit, data-dependent eigendecomposition, we introduce the notion of an eignet, which generalizes several radial basis function and zonal function networks. We construct pre-fabricated eignets, whose linear combinations can be constructed just by using the noisy values of the target function as the coefficients, to yield the desired approximation.

Our theory generalizes the results in a number of examples used commonly in machine learning, some of which we will describe in section 2.

The use of optimization methods, such as empirical risk minimization has an intrinsic difficulty, namely, the minimizer of this risk may have no connection with the approximation error. There are also other problems, such as local minima, saddle points, speed of convergence, etc. that need to be taken into account, and the massive nature of the data makes this an even more challenging task. Our results do not depend upon any kind of optimization in order to determine the necessary approximation.

We developed a theory for local approximation using eignets so that only a relatively small amount of data is used in order to approximate the target function in any ball of the space, the data being sub-sampled using a distribution supported on a neighborhood of that ball. The accuracy of approximation adjusts itself automatically depending upon the local smoothness of the target function on the ball.

In normal machine learning algorithms, it is customary to assume a prior on the target function called smoothness class in approximation theory parlance. Our theory demonstrates clearly how a massive data can actually help to solve the inverse problem to determine the local smoothness of the target function using a wavelet-like representation based solely on the data.

Our results allow one to solve the inverse problem of estimating the probability density from which the data is chosen. In contrast to the statistical approaches that we are aware of, there is no limitation on how accurate the approximation can be asymptotically in terms of the number of samples; the accuracy is determined entirely by the smoothness of the density function.

All our estimates are given in terms of probability of the error being small rather than the expected value of some loss function being small.

This paper is abstract, theoretical, and technical. In section 2, we present a number of examples that are generalized by our set-up. The abstract set-up, together with the necessary definitions and assumptions, are discussed in section 3. The main results are stated in section 4 and proved in section 8. The proofs require a great deal of preparation, which is presented in sections 5–7. The results in these sections are not all new. Many of them are new only in some nuance. For example, we have proven in section 7 the quadrature formulas required in the construction of our pre-fabricated networks in a probabilistic setting, and we have also substituted an estimate on the gradients by certain Lipschitz condition, which makes sense without the differentiability structure on the manifold as we had done in our previous works. Our Theorem 7.1 generalizes most of our previous results in this direction with the exception of [

In this paper, we aim to develop a unifying theory applicable to a variety of kernels and domains. In this section, we describe some examples which have motivated the abstract theory to be presented in the rest of the paper. In the following examples,

^{q} = ℝ^{q}/(2πℤ^{q}) be the _{1}, ⋯, _{q}) and _{1}, ⋯, _{q}) is defined by ^{q}} is orthonormal with respect to the Lebesgue measure normalized to be a probability measure on 𝕋^{q}. We recall that the periodization of a function ^{q} → ℝ is defined formally by ^{q} is the same as the ^{○}. This Fourier coefficient will be denoted by

Periodization of the Gaussian.

Periodization of the Hardy multiquadric^{1}

^{q} can be thought of as a quotient space of 𝕋^{q} where all points of the form _{1}θ_{1}, ⋯, ε_{q}θ_{q})}, ^{q} can then by lifted to 𝕋^{q}, and this lifting preserves all the smoothness properties of the function. Our set-up below includes [−1, 1]^{q}, where the distance and the measure are defined via the mapping to the torus, and suitably weighted Jacobi polynomials are considered to be the orthonormalized family of functions. In particular, if ^{q} with an expansion _{k}'s are tensor product, orthonormalized, Chebyshev polynomials. Furthermore, _{k}'s have the same asymptotic behavior as

^{q+1}. The dimension of 𝕊^{q} as a manifold is ^{q} and the volume measure ^{*} are normalized to be a probability measure. We refer the reader to Müller [^{q} are called spherical polynomials of degree < ^{q} is denoted by ℍ_{ℓ} with dimension _{ℓ}. There is an orthonormal basis _{ℓ} that satisfies an addition formula

where ω_{q−1} is the volume of 𝕊^{q−1}, and _{ℓ} is the degree ℓ ultraspherical polynomial so that the family {_{ℓ}} is orthonormalized with respect to the weight (1 − ^{2})^{(q−2)/2} on (−1, 1). A zonal function on the sphere has the form

In particular, formally,

It is shown in Müller [

It is shown in Mhaskar et al. [

The smooth ReLU function

^{*} be the Riemannian volume measure normalized to be a probability measure, {λ_{k}} be the sequence of eigenvalues of the (negative) Laplace-Beltrami operator on 𝕏, and ϕ_{k} be the eigenfunction corresponding to the eigenvalue λ_{k}; in particular, ϕ_{0} ≡ 1. This example, of course, includes Examples 2.1–2.3. An eignet in this context has the form

□

^{q}, ρ be the ℓ^{∞} norm on 𝕏, ^{*} be the Lebesgue measure. For any multi-integer _{k} is defined via the generating function

The system {ϕ_{k}} is orthonormal with respect to ^{*}, and satisfies

where Δ is the Laplacian operator. As a consequence of the so called Mehler identity, one obtains [

A Gaussian network is a network of the form

Let 𝕏 be a connected, locally compact metric space with metric ρ. For

If

{

For a Borel measure ν on 𝕏 (signed or positive), we denote by |ν| its total variation measure defined for Borel subsets

where the supremum is over all countable measurable partitions ^{2}

The symbol ^{p}(ν, _{p, ν, K} < ∞, with the usual convention that two functions are considered equal if they are equal |ν|-almost everywhere on _{0}(

We fix a non-decreasing sequence _{0} = 0 and λ_{k} ↑ ∞ as ^{*} on 𝕏, and a system of orthonormal functions _{0}(

It is convenient to write Π_{n} = {0} if _{∞} = ⋃_{n>0}Π_{n}. It will be assumed in the sequel that Π_{∞} is dense in _{0} (and, thus, in every ^{p}, 1 ≤ _{∞} as

_{n}} _{n}} _{1}, _{2} > 0

^{*}({

_{1}, κ_{2} > 0

_{n} ⊂ 𝕏 _{n})

_{n} ⊆ 𝕂_{m} for all

_{n} = 𝕏 for all ^{*} is a probability measure, and ϕ_{0} ≡ 1. □

^{*} be the Riemannian volume measure normalized to be a probability measure, {λ_{k}} be the sequence of eigenvalues of the (negative) Laplace-Beltrami operator on 𝕏, and ϕ_{k} be the eigenfunction corresponding to the eigenvalue λ_{k}; in particular, ϕ_{0} ≡ 1. If the condition (3.2) is satisfied, then

_{n} = 𝔹(

In the rest of this paper, we assume 𝕏 to be a data space. Different theorems will require some additional assumptions, two of which we now enumerate. Not every theorem will need all of these; we will state explicitly which theorem uses which assumptions, apart from 𝕏 being a data space.

The first of these deals with the product of two diffusion polynomials. We do not know of any situation where it is not satisfied but are not able to prove it in general.

^{*} ≥ 1

_{n},

_{n}, then _{0} for some _{2n}. So, the product assumption holds trivially. The strong product assumption does not hold. However, if _{n}, then

_{k} are eigenfunctions of a more general elliptic operator. Since the results in these two papers are similar qualitatively, we will comment on Lu et al. [

In this remark only, let _{k}, λ_{j} < _{An}(2, ϕ_{k}ϕ_{j}) [see (3.6) below for definition] with

While this gives some insight into the product assumption, the results are inconclusive about the product assumption as stated. Also, it is hard to verify whether the conditions mentioned in the paper are satisfied for a given manifold.

In Lu et al. [_{k}, ϕ_{j} ∈ Π_{n}, _{An} for any _{k}ϕ_{j}} is ^{2}), the result is meaningful only if 0 < δ < 1 and ϵ ≥ ^{1−1/δ}.

In Geller and Pesenson [

In our results in section 4, we will need the following condition, which serves the purpose of gradient in many of our earlier theorems on manifolds.

_{n} > 0

_{n} =

We define next the smoothness classes of interest here.

We find it convenient to denote by ^{p} the space ^{p} = ^{p}(𝕏) if 1 ≤

^{p}(𝕏)

_{γ,p,w} _{Wγ,p,w} < ∞.

_{0} ∈ 𝕏_{γ,p,w}(_{0}) _{γ,p,w}.

_{γ,p} are available in terms of constructive properties of the functions, such as the number of derivatives, estimates on certain moduli of smoothness or ^{∞} coincides with the class of infinitely differentiable functions vanishing at infinity. □

We can now state another assumption that will be needed in studying local approximation.

^{∞}

_{k}, _{k} ∈ 𝕏.

_{k, r}(

We note some obvious observations about the partition of unity without the simple proof.

_{1}, ⋯ _{1}, κ_{2}_{1}_{2}

We end this section by defining a kernel that plays a central role in this theory.

Let

If

The following proposition recalls an important property of these kernels. Proposition 3.2 is proven in Maggioni and Mhaskar [

In the sequel, let

We will omit the mention of _{n}(_{n}(^{*}. In particular,

where for

.

In this section, we describe the terminology involving measures.

_{R, d}

For example, ^{*} itself is in _{0} with _{0}(𝕏) then the measure ^{*} is _{0} with

_{n}} _{n}|(𝕏)}

_{n}} _{n}|(𝕏)}

_{n}

In the case when 𝕏 is compact, a well-known theorem called Tchakaloff's theorem [

_{n}} is an admissible quadrature measure sequence, then

_{a} yields admissible quadrature measures of order ^{q} (in fact, of [−^{q} for an appropriate _{1}. □

The notion of an eignet defined below is a generalization of the various kernels described in the examples in section 2.

^{*} = ^{*}(^{*}

_{k} ∈ 𝕏.

^{q} × ℝ^{q} is a smooth kernel, with λ_{k} = |_{1}, ϕ_{k} as in Example 2.5, and

_{0} :[0, ∞) → ℝ satisfy |_{0}(_{1}(_{1} as stipulated in that definition. The function _{2} = _{1} is then a smooth mask and so is _{1}. Let _{0}(_{2}(_{1}(_{2} and once with _{1} to obtain a corresponding result for _{0} with different constants. For this reason, we will simplify our presentation by assuming the apparently restrictive conditions stipulated in Definition 3.10. In particular, this includes the example of the smooth ReLU network described in Example 2.3. □

_{0}(𝕏 × 𝕏)

_{n}(ν) with _{n} in a constructive manner with the number of neurons as stipulated in that theorem. □

In this section, we assume the Bernstein-Lipschitz condition (Definition 3.4) in all the theorems. We note that the measure ^{*} may not be a probability measure. Therefore, we take the help of an auxiliary function _{0} to define a probability measure as follows. Let _{0} ∈ _{0}(𝕏), _{0} ≥ 0 for all ^{*} is 0-regular, and ^{*} being the marginal distribution of

It is easy to verify using Fubini's theorem that if

Let _{n}} be an admissible product quadrature sequence in the sense of Definition 3.9. We define [cf. (3.20)]

where ^{*} is as in Definition 3.10.

_{n} are prefabricated independently of the data. The network

Our first theorem describes local function recovery using local sampling. We may interpret it in the spirit of distributed learning as in Chui et al. [_{n} using the function values themselves as the coefficients. The networks 𝔾_{n} have essentially the same localization property as the kernels Φ_{n} (cf. Theorem 8.2).

_{0} ∈ 𝕏 and ^{∞} _{0}, 3_{0}, _{0} = ψ/𝔪, _{0}_{γ, ∞},

_{1}, ⋯, _{M}} is a random sample from some probability measure supported on 𝕏, _{0}(_{j})/s with each _{j}, then the probability of selecting points outside of the support of _{0} is 0. This leads to a sub-sample

Next, we state two inverse theorems. Our first theorem obtains accuracy on the estimation of the density _{0} using eignets instead of positive kernels.

_{0} ∈ _{γ, ∞}, and

_{0}. □

The following theorem gives a complete characterization of the local smoothness classes using eignets. In particular, Part (b) of the following theorem gives a solution to the inverse problem of determining what smoothness class the target function belongs to near each point of 𝕏. In theory, this leads to a

_{0} ∈ _{0}(𝕏), _{0}(_{0} ∈ 𝕏, 0 < δ < 1. For each _{j} is a random sample from τ with

_{0}_{γ,∞}(_{0}) _{0}

_{0} _{0}_{γ, ∞,ϕ0}(_{0}).

We prove a lower bound on ^{*}(𝔹(

In order to prove the proposition, we recall a lemma, proved in Mhaskar [

_{d}, _{1}:[0, ∞) → [0, ∞)

P

Let

The Gaussian upper bound (3.3) shows that for

Using Lemma 5.1 with ^{*},

Therefore, denoting in this proof only that κ_{0} = ‖ϕ_{0}‖_{∞}, we obtain that

We now choose ^{2} so that _{4}. The estimate is clear for _{4} <

Next, we prove some results about the system {ϕ_{k}}.

_{n})

P

The estimate (5.6) follows from a Tauberian theorem [

In particular,

□

Next, we prove some properties of the operators σ_{n} and diffusion polynomials. The following proposition follows easily from Lemma 5.1 and Proposition 3.2. (cf. [

and

The following lemma is well-known; a proof is given in Mhaskar [

_{1}, ν), (Ω_{2}, τ) _{1} × Ω_{2} → ℝ

_{2} → ℝ,

_{1},

_{n/2}, then σ_{n}(

^{p}

P_{n}(_{n/2} is verified easily using the fact that ^{*} in place of |ν| and 0 in place of

The estimate (5.14) follows using Lemma 5.3. The estimate (5.15) is now routine to prove. □

_{n}, 1 ≤

P

Therefore, a use of Lemma 5.3 shows that

We use

_{n}, 0 <

P^{*} is assumed to be a probability measure, but this assumption was not used in this proof. The second estimate follows easily from Proposition 5.3. □

_{1}, _{2} ∈ Π_{n}, 1 ≤

P_{n}, 1 ≤

Now, the product assumption implies that for _{k}, λ_{j} <

where

shows that (5.21) is valid for all

□

In the sequel, we write

We note that

It is clear from Theorem 5.1 that for any

with convergence in the sense of ^{p}.

^{p}, _{0} ∈ 𝕏. We assume the partition of unity and the product assumption

_{0},

_{0}

_{γ, p,ϕ0}(_{0}).

_{γ, p}(_{0}), _{0}

_{0} ≡ 1. So, the statements (b) and (c) in Theorem 6.1 provide necessary and sufficient conditions for _{γ, p}(_{0}) in terms of the local rate of convergence of the globally defined operator σ_{n}(_{j}, respectively In the Hermite case (Example 3.2), it is shown in Mhaskar [_{γ, p,ϕ0} if and only if _{γ, p}. Therefore, the statements (b) and (c) in Theorem 6.1 provide similar necessary and sufficient conditions for _{γ, p}(_{0}) in this case as well. □

The proof of Theorem 6.1 is routine, but we sketch a proof for the sake of completeness.

P

Part (a) is easy to prove using the definitions.

In the rest of this proof, we fix ^{∞} be supported on 𝔹. Then there exists

Further, Lemma 5.4 yields a sequence

Hence,

Thus, _{γ, p,ϕ0} for every ϕ ∈ ^{∞} supported on 𝔹, and part (b) is proved.

To prove part (c), we observe that there exists _{γ, p}. Using partition of unity [cf. Proposition 3.1(a)], we find _{0}, 2_{0}, _{0}, 2

Recalling that ψ(

This proves part (c). □

Let {Ψ_{n}:𝕏×𝕏→𝕏} be a family of kernels (not necessarily symmetric). With a slight abuse of notation, we define when possible, for any measure ν with bounded total variation on 𝕏,

and

As usual, we will omit the mention of ν when ν = ^{*}.

_{n}:𝕏 × 𝕏 → 𝕏} be a sequence of kernels (not necessarily symmetric) with the property that both of the following functions of

_{0},

_{0}

_{γ, p,ϕ0}(_{0}).

_{γ, p}(_{0}), _{0}

P_{n}; _{n}(_{p} is decreasing rapidly. □

The purpose of this section is to prove the existence of admissible quadrature measures in the general set-up as in this paper. The ideas are mostly developed already in our earlier works [

If

If _{ϵ}(

In particular, by replacing

_{k} satisfying

_{n},

_{1}, ⋯, _{M} such that |_{k}| ≤ 2_{k},

_{n} = min(1/_{2n}).

In order to prove Theorem 7.1, we first recall the following theorem [^{*} is a probability measure, but this fact is not required in the proof. It is required only that ^{*}(𝔹(^{q} for 0 <

(_{y} ⊆ 𝔹(

(

(_{1} ⊆ 𝕂 be a compact subset. Then

P

We observe first that it is enough to prove this theorem for sufficiently large values of _{n},

In this proof, we will write _{y}} of 𝕂 as in Theorem 7.2. The volume property implies that each _{y} contains at least one element of _{y}} as {_{k}}, so that _{k} ∈ _{k}, _{2n} such that each _{k} ⊂ 𝔹(_{k}, 36δ) and _{n},

We now let _{k} = 0,

The next step is to prove that if δ ≤ _{2n}), then

In this part of the proof, the constants denoted by _{1}, _{2}, ⋯ will retain their value until (7.9) is proved. Let _{k} ⊂ 𝔹(_{k}, 36δ), there are at most

Next, since ^{2jr/δ)q}. Using Proposition 3.2 and the fact that

Since _{ϵ}_{7}(ϵ)/_{2n}) so that, in (7.10),

Next, we observe that for any _{n},

We therefore conclude, using (7.9), that

Together with (7.8), this leads to (7.4). From the definition of _{k} = 0 if

Having proved part (a), the proof of part (b) is by now a routine application of the Hahn-Banach theorem [cf. [

We now equip ℝ^{N} with the norm ^{*} on ^{*} of ^{*} to ℝ^{N}, which, in turn, can be identified with a vector _{k} = 0 if ^{*} is an extension of ^{*}. The preservation of norms shows that |_{k}| ≤ 2_{k} if _{k}| = 0 = _{k}. This completes the proof of part (b). □

Part (c) of Theorem 7.1 follows immediately from the first two parts and the following lemma.

^{*} be a probability measure on^{*})

_{1}, ⋯, _{M}} ^{*}

P_{1}, ⋯, _{M}})>ϵ, then there exists at least one _{1}, ⋯, _{M}} = ∅. For every _{j} to be equal to 1 if _{j} ∈ 𝔹(

Since

We set the right-hand side above to δ and solve for

We assume the set-up as in section 4. Our first goal is to prove the following theorem.

^{*}, _{1}, _{2}, such that if _{1}, ϵ_{1}), ⋯, (_{M}, ϵ_{M})} is a random sample from τ, then

In order to prove this theorem, we record an observation. The following lemma is an immediate corollary of the Bernstein-Lipschitz condition and Proposition 5.3.

_{n},

P

Let

Then in view of (4.2), ^{*} in place of ν,

Therefore, Bernstein concentration inequality (B.1) implies that for any

We now note that _{j}, _{n}. Taking a finite set

Then (8.3) leads to

We set the right-hand side above equal to _{1}, _{2}). □

Before starting to prove results regarding eignets, we first record the continuity and smoothness of a “smooth kernel”

^{∞}.

P^{−s}^{*}) ≤ ^{−s}^{*} ≥ 1 and ^{*}Λ^{−s−r−1}

In this proof, let ^{q},

Using Schwarz inequality, we conclude that

In particular, since _{0}(𝕏) (and in fact, _{0}(𝕏 × 𝕏)) and (8.5) holds with

So, there exists

Hence, for any

This shows that

In view of the convexity inequality,

(8.8) and (8.10) lead to

In turn, this implies that ^{p} for all

A fundamental fact that relates the kernels Φ_{n} and the pre-fabricated eignets 𝔾_{n}'s is the following theorem.

_{n}}

P_{n} = _{n, x} by

By definition, _{k}/_{k} >

Since ^{*}

Therefore, for

Using Proposition 8.1 (used with Λ = ^{*}

In view of Proposition 5.4 and Proposition 5.2, we see that for any

We now conclude from (8.14) that

Since {^{*}

The theorems in section 4 all follow from the following basic theorem.

^{∞}(𝕏),

P

P

We observe that with the choice of _{0} as in this theorem,

P

This follows directly from Theorem 8.3 by choosing

P

In view of Theorem 8.3, our assumptions imply that for each

Consequently, with probability ≥ 1 − δ, we have for each

Hence, the theorem follows from Theorem 6.1. □

The original contributions presented in the study are included in the article/supplementary material, further inquiries can be directed to the corresponding author/s.

The author confirms being the sole contributor of this work and has approved it for publication.

The author declares that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Let 𝕏 be a compact and connected smooth _{i, j}(^{i, j}(

where |

Then ^{2}. Therefore, Hörmander's theorem [

In turn, [

Then [

We need the following basic facts from probability theory. Proposition B.1(a) below is a reformulation of Boucheron et al. [

_{1}, ⋯, _{M} be independent real valued random variables such that for each_{j}| ≤

_{1}, ⋯, _{M} _{k} = 1) =

^{1}A Hardy multiquadric is a function of the form ^{q}. It is one of the oft-used function in theory and applications of radial basis function networks. For a survey, see the paper [

^{2}|ν|−ess sup_{x ∈ 𝕂}|