^{*}

Edited by: Slavica Jonic, IMPMC, Sorbonne Universités - CNRS UMR 7590, UPMC Univ Paris 6, MNHN, IRD UMR 206, France

Reviewed by: Carlos Oscar Sanchez Sorzano, National Center of Biotechnology (CSIC), Spain; Ion Grama, University of Southern Brittany, France

*Correspondence: Julio A. Kovacs

This article was submitted to Biophysics, a section of the journal Frontiers in Molecular Biosciences

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) or licensor are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

Our development of a Fast (Mutual) Information Matching (FIM) of molecular dynamics time series data led us to the general problem of how to accurately estimate the probability density function of a random variable, especially in cases of very uneven samples. Here, we propose a novel Balanced Adaptive Density Estimation (BADE) method that effectively optimizes the amount of smoothing at each point. To do this, BADE relies on an efficient nearest-neighbor search which results in good scaling for large data sizes. Our tests on simulated data show that BADE exhibits equal or better accuracy than existing methods, and visual tests on univariate and bivariate experimental data show that the results are also aesthetically pleasing. This is due in part to the use of a visual criterion for setting the smoothing level of the density estimate. Our results suggest that BADE offers an attractive new take on the fundamental density estimation problem in statistics. We have applied it on molecular dynamics simulations of membrane pore formation. We also expect BADE to be generally useful for low-dimensional applications in other statistical application domains such as bioinformatics, signal processing and econometrics.

^{*}-tree

One of the most popular non-parametric density estimation methods is ^{d} → ℝ being the _{d} a normalizing constant that depends on the dimension _{j} (“sample point estimator”) or on the test point

Originally, we adopted a fixed-bandwidth KDE approach in our recent application to Fast (Mutual) Information Matching (FIM) of molecular dynamics time series data (Kovacs and Wriggers,

The situation in regard to the variable-bandwidth KDE methods is less well developed. In fact, it has not been easy to make significant performance improvements by allowing the bandwidth to vary from point to point (Farmen and Marron,

One of the earliest alternative approaches to improve the performance of variable bandwidth estimators was proposed by Sain and Scott (

Attempts at alleviating the mentioned limitations include a class of methods that use convex combinations (i.e., linear combinations with non-negative coefficients adding up to 1) or mixtures of densities of certain types. Vapnik and Mukherjee (

Several other interesting ideas have also been put forward. For instance, Katkovnik and Shmulevich (

Motivated by the various limitations of previous methods, here we propose a novel approach, which we call “BADE” (for Balanced Adaptive Density Estimation) that offers several desirable features: good scaling for large data sizes (sublinear complexity in

Let ^{d} → ℝ. Let Σ_{P} be the covariance matrix of

Unlike most of the previous approaches, we do not use a kernel-based estimation approach. Instead, the basic idea is the following: for each probe point ^{d} where we want to estimate the density, we determine the set _{k}(

Of course, this expression is very reminiscent of the original proposal of Loftsgaarden and Quesenberry (_{k}(

The basic idea described above still suffers from a number of drawbacks. First, as with the method of Loftsgaarden and Quesenberry (_{k}(_{k}(

We note that due to the exponential decay of _{e}(_{rl} | 1 ≤ _{k}(

A second drawback of our basic idea is: what should _{k}(

Volume = const. This would yield a

Volume = const/

We found that neither of these extremes produces good density estimates: a constant volume is essentially like a histogram: it will not resolve sharp enough peaks, and will yield zero in regions where the sample points are widely spread; a constant

However, the geometric mean of both offers a good compromise:

For _{1}_{k}(_{k}(_{2} (which depends on _{k}(_{2}/

_{0} = 3.59)

The constant _{2} in Equation (8) depends on

In _{2}:

It turns out that _{0} does not depend on _{0}: we ourselves examined by eye the density estimates resulting from an array of values of _{0}, for various simulated densities. For each _{0} that yielded a density estimate that did not look undersmoothed. Even though this visual criterion might seem rather _{0}/

It is interesting to note that the expression for _{0} for

A more theoretical justification of the expression for _{0} would probably be related to how the human visual system processes information. One possible approach could be the addition of a regularization term that would emulate visual perception. An intriguing link to the standard MISE theory in kernel density estimation is that the optimal bandwidth, in the 1-dimensional case, is proportional to ^{−1/5}, which is _{0}/

We emphasize that this visual criterion was used only as a premise to determine the optimal dependence (on _{2}. This optimal dependence is determined once and for all—the user does not need make any choices. However, the user could, with discretion, vary the coefficients in the formula for _{0} (Equation 11), to obtain density estimates with greater or lesser amount of smoothing than that provided by the values in Equation (11). As a rule of thumb, our visual tests (not shown) suggest to keep the variation within a factor 2 from the stated values.

To further improve the visual appeal of the density estimate given by Equation (5), we added an optional smoothing step to our method. The smoothing procedure was inspired by that of Brewer (_{j} (_{j}) (omitting for clarity the subindex that indicated the number of nearest neighbors used), we define the smoothed precision matrices by:

Thus, the contribution of each covariance matrix is in accordance with the value of the Gaussian function defined by it, at each of the grid points. This equation shows that the smoothing can be considered local, in the sense that points _{j} where Σ(_{j}) is large (where the density is low) or which are far from _{i} contribute little, and only points that are close to _{i} and with a small Σ(_{j}) will contribute significantly to

Since both the smoothing step just described and the main step (Equation 5) are local, we see that our method does not suffer from the non-locality issues that affect, for instance, one version of Abramson's square-root method (basically, extreme tail sample points affect the density estimate elsewhere too much; see Terrell and Scott,

Finally, we also need the smoothed version of the “effective”

Then, the smoothed version of the density estimate is given by

We analyze separately the two steps of our method: the main estimator (Equation 5) and the (optional) covariance-smoothing step (Section 2.4).

The first, main step requires the incremental retrieval, for each test point ^{*}-tree data structure (Beckmann et al., ^{1/2}) (Loftsgaarden and Quesenberry,

Figure ^{8}. The data sets were artificially generated to simulate a bimodal distribution, shown in Figure ^{*}-tree to retrieve nearest neighbors; (c) BADE-naive using a naive way to retrieve nearest neighbors (i.e., by sorting the data points according to their distances to each probe point). We can see that FIM and BADE-RST have very similar asymptotics. In fact, FIM has a complexity of

As for the second step (covariance smoothing), Equations (12) and (14) tell us that the cost will be

In order to evaluate the accuracy of BADE, we performed statistics of the ISE (Integrated Squared Error) for simulated samples taken from known distributions (Figures

Also, we considered some real data sets to compare the density estimates of BADE with those of previous methods (Figures

The three univariate simulated densities, all Gaussian mixtures, on which we tested our method are shown in Figure

We compared the ISE statistics of our method, for each of the above three densities, with those of Hazelton (

Even though the differences in accuracy seem to be small in some cases, even a small consistent difference can be considered significant in this problem, as it has been difficult to make performance improvements in density estimation even when moving from fixed-bandwidth to variable-bandwidth methods (Terrell and Scott,

We tested our method on three univariate real data sets. Although not related to our intended application domain of molecular dynamics, the three data sets are widely used in the relevant statistics literature so that we can compare results easily with those from other methods:

Results for the Old Faithful data set are shown in Figure _{e} as a function of _{k}(_{k}(_{k}(_{k}, and vice versa.

The density estimates for the suicide data set are shown in Figure _{e} is slow as _{k} values in that region.

The Hidalgo stamp comparison between BADE and LS is shown in Figure 8. In contrast with the Old Faithful example, here BADE's estimate looks more smoothed than LS's, but otherwise the position and number of modes is the same for both methods. This is interesting in connection with the results of the analysis carried out by Brewer (

The three bivariate simulated densities, all Gaussian mixtures, on which we tested our method are shown in Figure

Results of ISE statistics comparing our method with that of Zougab et al. (

We tested our method on two bivariate real data sets, and compared the results with those from other methods. Again we chose data from outside our intended application in molecular dynamics to better compare with the available statistics literature:

Results for the bivariate Old Faithful data set are shown in Figure _{e}, and the area of the covariance ellipse, _{k} (before the covariance-smoothing step). As in the univariate case, we see that regions of large _{e} correspond to regions of small _{k}, and vice versa.

_{e}) on a grid of size 100 × 100. _{k} on the same grid. Both _{e} and _{k} are the ones before applying the covariance smoothing step.

The density estimates for the UNICEF data set, computed with our method and BABM (Zougab et al., _{e} and _{k} (before covariance smoothing).

_{e}) on a grid of size 100 × 100. _{k} on the same grid. Both _{e} and _{k} are the ones before applying the covariance smoothing step.

We have implemented a novel adaptive density-estimation approach suitable for our statistical evaluation of membrane simulations in Wriggers et al. (

Unlike most well known density estimation methods, ours is not based on kernels. Rather, it estimates the density at a given point directly, using the information about the sets of

We note that, specially in the context of fixed-bandwidth kernels, the covariance matrix could be considered a parameter which depends on the data. However, since in our approach it is not a fixed value, but rather a function of the point, we do not call it a parameter. Rather, the parameters are the coefficients in Equation (11), which are fixed (except for the optional smoothing variation) and do not depend on the data.

BADE is well suited for large data sizes. Methods that center a kernel function at each sample point become very expensive as the data size grows. Instead, BADE relies only on nearest-neighbor information, whose average required number

Our method is free of restrictions on the bandwidth matrices, such as diagonal or scalar. In fact, we are no longer dealing with “bandwidth” matrices, but covariance matrices of sets of nearest neighbors.

BADE has been defined for data of any dimension; however, we have worked out the constants and made tests only for dimensions 1 and 2. It is most efficient in low dimensions, due to the need to compute nearest neighbors. For this, it takes advantage of the R^{*}-tree data structure (Beckmann et al., ^{*}-tree data structure becomes less efficient due to the increasing relative volume of the “corners” of the hyperrectangles, and so better adapted data structures would be preferable in this case (see Hjaltason and Samet,

Our method was validated, both in the univariate and the bivariate settings, by ISE analyses on some simulated densities. These analyses consisted in generating a number of simulated samples (500 for the univariate case, 100 for the bivariate case) and measuring the integrated square error (ISE) between the density estimated from each sample and the actual density function. The ISE statistics were compared with similar results from previous approaches that were among the best available. In most cases we obtained lower errors, and in the remaining few cases the performance was virtually identical.

The apparent synergy between objective (low ISE) and subjective (visual appeal) criteria in our algorithm is a curious phenomenon that has also been observed by other researchers. Farmen and Marron (

The optional covariance-smoothing step in BADE yields very visually appealing density estimates, as our real-data examples show, but is not strictly necessary if all that's needed is a density estimate to perform further calculations. For instance, one of the applications for which we need bivariate density estimates is the computation of Mutual Information. In this case we don't need visually appealing functions, and thus we can save significant compute time.

At this time the algorithm is implemented as a C program. It will be freely disseminated as a part of release 1.5 of our software package

The mathematical theory of BADE was designed by JK. Experimental test data sets used in this work were prepared by CH. The larger project (including the accompanying paper) was supervised by WW. The paper was written by JK and WW.

This work was supported in part by the Frank Batten endowment and by National Institutes of Health grant R01GM62968 (to WW).

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

^{*}-tree: an efficient and robust access method for points and rectangles