^{1}

^{*}

^{1}

^{2}

^{1}

^{1}

^{1}

^{2}

Edited by: Tomas Halenka, Charles University, Czechia

Reviewed by: Nathaniel K. Newlands, Agriculture and Agri-Food Canada (AAFC), Canada; Xander Wang, University of Prince Edward Island, Canada

This article was submitted to Interdisciplinary Climate Studies, a section of the journal Frontiers in Earth Science

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

The Indus watershed is a highly populated region that contains parts of India, Pakistan, China, and Afghanistan. Changes in precipitation patterns and rates of glacial melt have significantly impacted the region in recent years, and climate change is projected to result in further serious human and environmental consequences. To understand the climate dynamics of the Indus watershed and surrounding regions, reanalysis and satellite data from products such as APHRODITE-2, TRMM, ERA5, and MERRA-2 are often used, yet these products are not always in agreement regarding critical variables such as precipitation. Here we objectively evaluate the level of agreement between precipitation from these four products. Because these data are on different spatial scales, we propose a low-rank spatio-temporal dynamic linear model for precipitation that integrates information from each of the above climate products. Specifically, we model each data source as the combination of a modified shared process, a discrepancy process, and Gaussian noise. We define the shared process at a high spatial resolution that can be upscaled according to the resolution of the observed data. Our proposed model's shared process provides a cohesive picture of monthly precipitation in the Indus watershed from 2000 to 2009, while the product-specific discrepancies provide insight into how and where the products differ from one another.

The Indus Watershed is a region that is particularly susceptible to the consequences of a changing climate (e.g., Immerzeel et al.,

One of the variables fundamental to many of the most pressing scientific questions for the climate of the Indus Watershed is precipitation. Due to the complexity of the physical processes governing precipitation, it is also a fairly difficult variable to effectively measure and model accurately in remote parts of the world, and especially in complex terrain (e.g., IPCC,

Taking some inspiration from these climate ensemble approaches this study used a spatio-temporal Bayesian statistical model that provides a novel approach to understanding and analyzing the differences and commonalities between four commonly used precipitation products in the Indus watershed region via the modeling of discrepancies and their associated uncertainty. Precipitation output from these same four products was also assimilated into a new monthly-resolved product for precipitation for the years 2000–2009 which can be used in future analyses. We realize this shared product at a 0.25 × 0.25° spatial resolution, while realizing each data source's discrepancy process at its native spatial resolution. The temporal domain for this model of the decade spanning 2000–2009 was chosen to provide us with a reasonable number of years with which to compare currently available products and test the viability of the method presented in this paper. A monthly temporal resolution was selected in order to capture precipitation climatology and seasonality for comparison to climate models. However, the methods presented in this paper could be reasonably applied to any spatial and temporal domain and resolution provided sufficient computing resources and appropriate input data.

It is also important to note that the research presented in this paper focuses on a statistical model that combines existing precipitation data into a new product that represents statistical consensus among input data (along with the identification of discrepancies) rather than a new climate model that incorporates the physics and dynamics of climate systems into its output. While an understanding of such physical processes is critical to the study of climate, this paper is focused instead on statistically analyzing the output of models and data products that were built with consideration of those processes in mind.

In section 2 of this paper we introduce the data products used in our analysis and discuss some of their important features. We introduce and specify our statistical model in section 3. Model results are included in section 4, and further discussion of those results is contained in section 5. We conclude our paper in section 6 with noteworthy observations from our analysis and suggestions for how our work might be used in the future.

We selected four datasets for this analysis: the Asian Precipitation—Highly-Resolved Observational Data Integration Toward Evaluation of Extreme Events (APHRODITE-2) product (Yatagai et al.,

Summary of data products used in this analysis.

APHRODITE-2 | 0.25 × 0.25 | Daily | Interpolated rain gauge |

TRMM | 0.25 × 0.25 | Daily | Satellite |

ERA5 | 0.25 × 0.25 | Hourly | Reanalysis |

MERRA-2 | 0.5 × 0.625 | Monthly | Reanalysis |

The APHRODITE-2 precipitation product uses daily rain gauge readings across Asia and statistically interpolates between them to produce spatially gridded precipitation estimates. TRMM is NASA's precipitation product produced via combining precipitation estimates from multiple satellites as well as some limited precipitation gauge data. Thus TRMM is dominantely based on remote observations, while Aphrodite is solely based on

Each of these precipitation products has idiosyncrasies in their estimates of precipitation that are derivatives of how they were constructed. However, each is based on observations and/or physical properties, and therefore represents a “plausible” representation of the system, albeit with varying degrees of uncertainty. While one cannot know what the “truth” is regarding precipitation for the Indus watershed by analyzing these four data products, we assess how they compare to one another: where they seem to be in greatest agreement, and where their estimates diverge more sharply from one another. Below is a summary assessment of average precipitation for the region bounded by 63.5-84 E and 27-40 N as estimated by each product.

As can be seen in

Average monthly anomaly (deviation from mean of all products) in precipitation (in mm) for Indus Watershed region for APHRODITE-2, ERA5, MERRA-2, and TRMM. Month 1 corresponds to January 2000.

While less dramatic, there are slight differences in the general trend in precipitation among these products. TRMM and ERA5 seem to have downward trends in precipitation relative to the group average, while MERRA-2 and APHRODITE-2 seem to be increasing in precipitation over time relative to the mean.

All of the above is illustrative of the fact that there are noteworthy differences between the data products commonly used to assess precipitation statistics and dynamics, and used as input to hydrological and glaciological models. These differences could have the potential to significantly impact climate assessments, uncertainty quantification, and cultural impact statements, and thus it is important to seriously consider the anomalies and noteworthy features of a model-derived data set prior to its further use.

In order to facilitate this, we provide a statistically sound, model based framework with which to model the discrepancies between these data products while taking into account the spatial and temporal dependence intrinsic to this type of data. Simultaneously, we provide a new data product that can act as a “consensus product,” probabilistically borrowing strength and spatial structure from each of APHRODITE-2, ERA5, MERRA-2, and TRMM.

As an additional note, there are two aspects of the data we considered prior to modeling: namely, that we were working with areal data, and that the data products we used have differing spatial support.

The output of the four products for precipitation (APHRODITE-2, TRMM, ERA5, and MERRA-2) is areal data. Areal data, unlike point data, are indexed for an entire spatial region, rather than at a specific observation point. Areal data are common in realms such as public health and government, where data might be recorded at a city, county, or state level (Waller and Gotway,

While each value of an estimated variable in a gridded climate product is indexed with specific latitude and longitude coordinates, the value is actually given for the entire rectangular region (rectangular with respect to the coordinate grid and ignoring the Earth's sphericity) centered about the provided coordinates. The region's geographical size is determined by the product's resolution. This means that each areal observation within the MERRA-2 product, which has a resolution of 0.5 × 0.625, covers a geographical region that is 5 times the size of the regions modeled by the other three products used in this analysis, which have 0.25 × 0.25° resolutions.

The second aspect of this data we considered in our model is that each product is realized on a different grid of locations and at varying resolution. This leads us to what is sometimes referred to as the “change of support problem” (Waller and Gotway,

Let _{j} and

To induce a model for

where _{Z} corresponding to the number of spatial areal units associated with the shared process, _{t} to the spatial scale of data product

As mentioned in the introduction and inherent in the notation above, each spatial data product is defined on different set of areal units at varying spatial resolutions. To realign the grid associated with the shared precipitation process to the native resolution of the

such that _{Z} and _{j}. We note that the operator

In this application we have _{Z} = 4,346, resulting in approximately 521,000 correlated values of the shared surface ^{3}, while the memory burden scales by ^{2} making it impractical for data sets of size

While there are various computationally tractable methods to model spatial data through the approximation of a Gaussian process model (e.g., Higdon, _{ij}} be an _{ij} = 1 if areal units ^{⊥}^{⊥} correspond to positive spatial dependence where ^{⊥}^{⊥} associated with positive eigenvalues such that

In this analysis, we chose to use a number of eigenvectors that accounts for at least 60% of the structural variability in ^{⊥}^{⊥}, which is calculated by cummulatively summing the non-negative eigenvalues of the Moran operator and identifying a cutoff. This results in using

Using the above basis function expansion, we set _{t} = _{Z}_{jt}_{j}_{jt} + _{j}_{jt} such that

where _{j} is the design matrix for the _{t} is the shared effect of the covariates on precipitation, _{j} is the _{j} × _{Z} realignment matrix (see above), _{Z} is the Moran eigenvector basis with associated coefficients _{t} for the shared precipitation surface _{t} defined on areal units _{jt} represents the discrepancy between the effect of the covariates (_{jt}) on data product _{t}, and _{j} is the Moran eigenvector basis on areal units _{jt} correspond to spatially structured discrepancy between data product

such that (4) shows that the spatial precipitation can be similarly viewed as a fixed effect, a basis function expansion of a spatial random effect and uncorrelated white noise.

Spatial eigenvectors from decomposed Moran operator.

To this point, we have considered only the spatial and not the temporal dimension of building a model for precipitation over time. In regards to temporal correlation, we employ a spatial dynamic linear model (DLM) (Petris et al.,

with

Notably, we chose to implement the DLM structure only for the shared process. Thus, we do not explicitly enforce temporal dependence in the discrepancy surfaces. However, because the discrepancy processes define deviations from the shared process, the two processes are correlated in the posterior dictated by the amount of temporal smoothness present in the data. Further, in initial stages of this research, we opted for a DLM in the discrepancies as well but found that such a model appeared to be computationally intractable. After assuming temporal independence in the discrepancy surfaces, the model showed considerably improved identifiability and posterior mixing, solidifying our choice to implement a DLM exclusively in the shared process.

Equations (3) and (5) provide a model that allowed us to estimate a spatially and temporally correlated shared precipitation process as well as spatially and temporally correlated estimates for product-specific discrepancy processes resolved monthly and at the native resolution of each product. More concretely, _{Z}_{t} + _{Z}_{t} (where _{Z} is the design matrix of the shared product), corresponds to the shared precipitation process at time _{j}_{jt} + _{j}_{jt} corresponds to the discrepancy process of data product _{t} and _{t} are identified using each of our data products, while the discrepancy parameters, _{jt} and _{jt}, are unique to each product and conditionally independent of each other

In order to estimate all model parameters, we choose to implement Bayesian model fitting via Markov chain Monte Carlo, a class of algorithm with considerable literature regarding its theory and implementation (e.g., Casella and George,

_{jt} departs from that recommended by Hughes and Haran. However, given that the basis functions in

One fact about using Moran bases to capture spatial variability is that each column of _{TRMM} =

The combination of normally distributed model parameters with inverse-gamma distributed variances means that—given

In the interest of brevity and focus, we omit the detailed notation for our full conditional distributions here, but the computational implementation of the Gibbs sampler for our admittedly complex Bayesian linear model is fairly standard within the Bayesian modeling literature (see for example Gelman et al., _{jt} and _{jt} can be updated in parallel from their complete conditional distribution to improve computational efficiency.

Due to the fact that our model is in effect identifying 600 unique but interconnected surfaces (one shared surface and four discrepancies for each of 120 time states), fitting this model is computationally expensive, although the computational burden is relatively modest in comparison to the computing-intensive climate models used to produce the data utilized in this analysis. We fit this model using the software R on a Dell PowerEdge R740 server with 2 x Intel(R) Xeon(R) Gold 6132 CPU @ 2.60GHz and 128GB of RAM. With the previously specified number of spatial eigenvectors used per surface, it required approximately 17 days to obtain 55,000 posterior draws, which were thinned by a factor of 50 due to memory limitations. In spite of high autocorrelation due to the close relationship between the shared surface and the discrepancies, our model parameters appear to have successfully converged based on an analysis of their trace plots and other commonly applied Bayesian convergence heuristics.

In this section, we look at several figures containing data and model output. The majority of these figures will be for the month of August 2000, which is chosen to be illustrative of the results for a single month. For

The precipitation products for August 2000 (in mm).

Shared precipitation surface (in mm) for August 2000 on left. Corresponding uncertainty surface on right. Uncertainty characterized using posterior mean standard deviation (in mm).

Discrepancy surfaces for August 2000 (in mm). Red regions indicate areas where the product modeled higher precipitation than shared surface, while blue regions indicate lower precipitation relative to shared surface.

Uncertainty associated with noiseless data approximation for August 2000 (in mm).

_{Z}_{t} + _{Z}_{t}). Also shown in

When we examine the uncertainty shown in

An advantage of the Bayesian modeling approach is the flexibility and ease of quantifying uncertainty for different elements of our model.

As previously mentioned, we have similar plots to those shown in

Ten-year averaged shared product (in mm) for winter months (December/January/February) on left. Ten-year averaged shared product (in mm) for summer months (June/July/August) on right. Note that the two plots contained in this figure use different scales.

Averaged discrepancies (in mm) for the months of December, January, and February across 10 years.

Averaged discrepancies (in mm) for the months of June, July, and August across 10 years.

In discussing our model's output and the underlying data products, we wish to highlight that any discussion of product “wet,” or “dry tendency” refers to departures from the consensus of data products as characterized by the model's shared product. Such a descriptor does not represent an objective measure of model output as compared to observed data. Thus, any conclusions drawn regarding a product's relative tendency in this paper should not be taken to mean that a product is necessarily inaccurate in its estimate of precipitation; rather, that it differs from the consensus of the products analyzed here.

In

In analyzing our model discrepancies, we also assessed the correlation between the magnitude of the discrepancies and elevation. Unsurprisingly, we found a small positive correlation between elevation and absolute discrepancy (

The comparisons made in the previous paragraphs are illustrative of the types of observations and conclusions made possible via our modeled shared product and discrepancies. Any number of questions related to the trends and differences found in these products over this time period could be explored and answered—including uncertainty quantification—using the output of our model. Additionally, the framework presented in this article can be extended of this model to different time periods or the integration of additional data sets.

The shared product introduced in this article captures spatial and temporal structure from each of the four data products due to the manner in which each source of data is incorporated into the overall model, making it a valuable reference point by which to judge the similarities and dissimilarities of the products used to inform it. Because of the incorporation of spatial and temporal dependencies within our model, along with the natural approach to assessing posterior uncertainty that the Bayesian methodological framework supplies, we are provided with model output that is considerably more nuanced and inferentially rich than a weighted average or simple mean at all locations. Based on our analysis of the discrepancies, our model shows a general dry tendency in MERRA-2 relative to the shared product, a finding which coincides with trends identified in the exploratory data analysis of section 2 and was illustrated in

An additional observation we make about our model, and the shared surface in particular, is that it is a fairly smooth process relative to the data used to inform it. This spatial smoothness is to be expected to an extent, given that each of the four products has unique local behavior that we would not expect to appear with the same magnitude in the shared process. However, we are also conscious of the manner in which low-rank Gaussian process approximations (such as the approach of Hughes and Haran used here) are often criticized for over-smoothing data (Datta et al.,

This may raise questions about the overall utility of our shared product for use in other analyses. We are of the opinion that this product is most useful for synoptic scale studies of precipitation variability and trends. For models that require precipitation data on a local and highly refined scale, it is likely that the product produced in our analysis will not be suitable due to its smoothness. Instead, one of the existing precipitation products should be chosen with consideration for its idiosyncrasies as discussed in this article.

In spite of our approach's potential disadvantages as discussed above, a valuable element of the shared product is that it provides an intuitive comparison point for our modeled discrepancies. Due to the shared product's central behavior among the products used in our analysis, comparison between products and basic interpretation of discrepancies is made simpler. The uncertainties we estimate as part of this model also provide us with a valuable way to discern if observed differences between products (particularly within their discrepancies) are statistically meaningful.

The methods used here can easily facilitate the incorporation of additional data sources. At the beginning of section 3 of this paper we specify that

Given the challenges related to modeling and measuring precipitation in the Indus watershed it is difficult to know the “truth” about precipitation in the region. Thus, when some products are referred to as “over-” or “under-estimating” precipitation, this is meant relative to the consensus of the products used in this analysis. It is entirely plausible that a product which appears to be an outlier when estimating precipitation for a particular region is in fact the most accurate of them all, a possibility which should not be discounted.

That said, in our analysis we found that MERRA-2 tended to have a dry tendency, while ERA5 tended to have a wet tendency. These tendencies are present in both winter and summer and are most notable in the regions with high precipitation. A notable idiosyncrasy of ERA5 was its consistent propensity to overestimate precipitation across the Tibetan plateau during Monsoon season.

Our analysis also produced a shared product for precipitation that assimilated spatial and temporal structure from APHRODITE-2, TRMM, ERA5, and MERRA-2. This product will be available for download and usage at NSIDC, along with all discrepancy surfaces and uncertainty estimates discussed in this article. Given the product's relative smoothness, it will likely be most useful for larger-scale studies of precipitation variability and trends. This product is also valuable as a reference point for understanding the discrepancies in our model.

The methodology presented in this article can be extended to incorporate additional data sets, and should scale reasonably well for other spatio-temporal resolutions and domains. While precipitation was the focus of our analysis, a similar model could be applied to other climate variables such as temperature.

Our model provides a cohesive statistical framework for understanding the shared structure of spatially and temporally varying data products, while simultaneously providing us with discrepancy surfaces and uncertainty estimates that allow us to understand how those products differ from one another and the consensus.

Publicly available datasets were analyzed in this study. This data can be found here: APHRODITE-2:

MC did the majority of data collection, model implementation, coding, and writing for this paper. MH advised MC during the development of the statistical model used in this paper. SR, CR, and WC are co-PIs on the grant that funded this research and provided motivation and initial direction on the project, as well as regular guidance on research direction. All authors contributed extensively to the review and revision of this paper.

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.