This article was submitted to Mathematics of Computation and Data Science, a section of the journal Frontiers in Applied Mathematics and Statistics
This is an openaccess article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.
Statistical learning theory provides the foundation to applied machine learning, and its various successful applications in computer vision, natural language processing and other scientific domains. The theory, however, does not take into account the unique challenges of performing statistical learning in geospatial settings. For instance, it is well known that model errors cannot be assumed to be independent and identically distributed in geospatial (a.k.a. regionalized) variables due to spatial correlation; and trends caused by geophysical processes lead to covariate shifts between the domain where the model was trained and the domain where it will be applied, which in turn harm the use of classical learning methodologies that rely on random samples of the data. In this work, we introduce the
Classical learning theory [
Among these methods derived under classical assumptions (more on this later), those for estimating the generalization (or prediction) error of learned models in unseen samples are crucial in practice [
The literature on generalization error estimation methods is vast [
Major assumptions are involved in the derivation of the estimation methods listed above. The first of them is the assumption that samples come from independent and identically distributed (i.i.d.) random variables. It is wellknown that spatial samples are not i.i.d., and that spatial correlation needs to be modeled explicitly with geostatistical theory. Even though the sample mean of the empirical error used in those methods is an unbiased estimator of the prediction error regardless of the i.i.d. assumption, the precision of the estimator can be degraded considerably with noni.i.d. samples.
Motivated by the necessity to leverage noni.i.d. samples in practical applications, and evidence that model’s performance is affected by spatial correlation [
Unlike the estimation methods proposed in the 70s, which use random splits of the data, these methods split the data based on spatial coordinates and what the authors called “dead zones”. This set of heuristics for creating data splits avoids configurations in which the model is evaluated on samples that are too near (
All methods for estimating generalization error in classical learning theory, including the methods listed above, rely on a second major assumption. The assumption that the distribution of unseen samples to which the model will be applied is equal to the distribution of samples over which the model was trained. This assumption is very unrealistic for various applications in geosciences, which involve quite heterogeneous (i.e., variable), heteroscedastic (i.e., with different variability) processes [
Very recently, an alternative to classical learning theory has been proposed, known as transfer learning theory, to deal with the more difficult problem of learning under shifts in distribution, and learning tasks [
Of particular interest in this work, the covariate shift problem is a type of transfer learning problem where the samples on which the model is applied have a distribution of covariates that differs from the distribution of covariates over which the model was trained [
The importance weights used in importanceweighted crossvalidation are ratios between the target (or test) probability density and the source (or train) probability density of covariates. Density ratios are useful in a much broader set of applications including twosample tests, outlier detection, and distribution comparison. For that reason, the problem of density ratio estimation has become a general statistical problem [
In this work, we introduce
The paper is organized as follows. In
In this section, we define the elements of statistical learning in geospatial settings. We discuss the covariate shift and spatial correlation properties of the problem, and illustrate how they affect the involved feature spaces.
Consider a sample space
For example,
Examples of source and target spatial domains.
In order to define the geostatistical learning problem, we need to understand the joint probability distribution of features for all locations in a spatial domain
Regardless of the stationarity assumptions involved in modeling these processes, we can assume that inside
Whereas the pointwise stationarity assumption may be reasonable inside a given spatial domain, the assumption of spatial independence of features is rarely defensible in practice. Additionally, pointwise stationarity often does not transfer from a source domain
We have introduced the notion of spatial domain
• Mining: The task of segmenting a mineral deposit from drillhole samples using a set of features is a spatial learning task. It assumes the segmentation result to be a
• Agriculture: The task of identifying crops from satellite images is a spatial learning task. Locations that have the same crop type
• Petroleum: The task of segmenting formations from seismic data is a spatial learning task because these formations are largescale
Many more examples of spatial learning tasks exist, and others are yet to be proposed. Given the concepts introduced above, we are now ready for the main definition of this section:
There are considerable differences between the classical definition of transfer learning [
Having understood the main differences between classical and geostatistical learning, we now focus our attention to a specific type of geostatistical transfer learning problem, and illustrate some of the unique challenges caused by spatial dependence.
Assume that the two spatial domains are different
Let
The property is based on the idea that the underlying true function
Covariate shift. The true underlying function
In the geosciences, it is very common to encounter problems with covariate shift due to the great variability of natural processes. Whenever a model is 1) learned using labels provided by experts on a spatial domain
Another important issue with geospatial data that is often ignored is spatial dependence, which we illustrate next. As mentioned earlier, the closer are two locations
Besides serving as a tool for diagnosing spatial correlation in geostatistical learning problems, variograms can also be used to simulate spatial processes with theoretical correlation structure. In
Impact of spatial correlation in feature space. Two Gaussian processes
Similar deformations are observed when the two processes
Impact of spatial correlation in feature space with correlated processes. Similar to
Spatial correlation may have different impact in source and target domains
Having defined geostatistical learning problems, and their covariate shift and spatial correlation properties, we now turn into a general definition of generalization error of learning models in geospatial settings. We review an importanceweighted approximation of a related generalization error based on pointwise stationarity assumptions, and the use of an efficient importanceweighted crossvalidation method for error estimation.
Consider a geostatistical learning problem
In the expected value of
Unlike the classical definition of generalization error, the definition above for geostatistical learning problems relies on a spatial loss function
More specifically, we consider pointwise learning with families that are made of a single learning model
Although pointwise learning with a single model is a very simple type of geostatistical learning, it is by far the most widely used approach in the geospatial literature. We acknowledge this fact, and consider an empirical approximation of the pointwise expected risk in
An empirical approximation of the pointwise expected risk of a model
Our goal is to find the pointwise model that minimizes the empirical risk approximation
Alternatively, our goal is to rank a collection of models
In order to achieve the stated goals, we need to 1) estimate the importance weights in the empirical risk approximation, and 2) remove the dependence of the approximation on a specific dataset. These two issues are addressed in the following sections.
The empirical approximation of the risk
Efficient methods for density ratio estimation that perform well with highdimensional features have been proposed in the literature. In this work we consider a fast method named Least Squares Importance Fitting (LSIF) [
This quadratic optimization problem with linear inequality constraints can be solved very efficiently with modern optimization software [
In order to remove the dependence of the empirical risk approximation on the dataset, we use importanceweighted crossvalidation (IWCV) [
The main difference in the IWCV procedure are the weights that multiply each sample. The regularization exponent
In the rest of the paper, we combine IWCV with LSIF into a method for estimating generalization error that we term
In this section, we perform experiments to assess estimators of generalization error under varying covariate shifts and spatial correlation lengths. We consider CrossValidation (CV), Block CrossValidation (BCV) and Density Ratio Validation (DRV), which all rely on the same crossvalidatory mechanism of splitting data into folds.
First, we use synthetic Gaussian process data and simple labeling functions to construct geostatistical learning problems for which learning models have a known (via geostatistical simulation) generalization error. In this case, we assess the estimators in terms of how well they estimate the actual error under various spatial distributions. Second, we demonstrate how the estimators are used for model selection in a real application with well logs from New Zealand, which can be considered to be a dataset of moderate size in this field.
Let
Three possible shift configurations. The target distribution is “inside” the source distribution
Geostatistical learning problems with
Given a shift parameterized by
The first configuration in
To efficiently simulate multiple spatial samples of the processes over a regular grid domain with
To fully specify the geostatistical learning problem, we need to specify a learning task. The task consists of predicting a binary variable
Labeling function
Having defined the problem, we proceed and specify learning models in order to investigate the different estimators of generalization error. We choose two models that are based on different prediction mechanisms [
These two models
The experiment proceeds as follows. For each shift
To facilitate the visualization of the results, we introduce shift functions
We plot the true generalization error of the models as a function of the different covariate shifts in
Generalization error of learning models versus covariate shift functions. Among all shift functions, the novelty factor is the only function that groups shift configurations along the horizontal axis. Models behave similarly in terms of generalization error for the given dataset size (
Among the three shift functions, the novelty factor is the only function that groups shift configurations along the horizontal axis. In this case, configurations deemed easy (i.e., where the target distribution is
The two models behave similarly in terms of generalization error for the given dataset size (i.e.,
We plot the CV, BCV and DRV estimates of generalization error versus covariate shift (i.e., novelty factor) in the top row of
Estimates of generalization error for various shifts (i.e., novelty factor) and various correlations lengths. The box plots for the
First, we emphasize that the CV and BCV estimates remain constant as a function of covariate shift. This is expected given that these estimators do not make use of the target distribution. The DRV estimates increase with covariate shift as expected, but do not follow the same rate of increase of the true (or actual) generalization error obtained with Monte Carlo simulation. Second, we emphasize in the box plots for the
In order to better visualize the trends in the estimates, we smooth the scatter plots with locally weighted regression per correlation length in the top row of
Trends of generalization error for different estimators
From the figure, there exists a gap between the DRV estimates and the actual generalization error of the models for all covariate shifts. This gap is expected given that the target distribution may be very different from the source distribution, particularly in
Unlike the previous experiment with synthetic Gaussian process data and known generalization error, this experiment consists of applying the CV, BCV and DRV estimators to a real dataset of well logs prepared inhouse [
The dataset consists of 407 wells in the Taranaki basin, including the main geophysical logs and reported geological formations. The basin comprises an area of about
Curated dataset with 407 wells in the Taranaki basin, New Zealand. The basin comprises an area of about
We split the wells into onshore and offshore locations in order to introduce a geostatistical learning problem with covariate shift. The problem consists of predicting the rock formation from well logs offshore after learning a model with well logs and reported (i.e., manually labeled) formations onshore. The well logs considered are gamma ray (GR), spontaneous potential (SP), density (DENS), compressional sonic (DTC) and neutron porosity (NEUT). We eliminate locations with missing values for these logs and investigate a balanced dataset with the two most frequent formations—Urenui and Manganui. We normalize the logs and illustrate the covariate shift property by comparing the scatter plots of onshore and offshore locations in
Distribution of main geophysical logs onshore (gray) and offshore (purple) centered by the mean and divided by the standard deviation. Visible covariate shift in the scatter and contour plots.
We set the hyperparameters of the error estimators based on variography and according to available computational resources. In particular, we set blocks for the BCV estimators with sides
In
Estimates of generalization error with different estimators for the onshoretooffshore problem. The CV estimator produces estimates that are the most distant to the actual target error due to covariate shift and spatial correlation. None of the estimators is capable of ranking the models correctly. They all select complex models with low generalization ability.

Among the three estimators of generalization error, the CV estimator produces estimates that are the most distant from the target error, with a tendency to underestimate the error. The BCV estimator produces estimates that are higher than the CV estimates, and consequently closer to the target error in this case. The DRV estimator produces the closest estimates for most models, however; like the CV estimator it fails to approximate the error for models like KNeighbors and DecisionTree that are overfitted to the source distribution. The three estimators fail to rank the models under covariate shift and spatial correlation. Overfitted models with low generalization ability are incorrectly ranked at the top of the list, and the best models, which are simple “linear” models, appear at the bottom. We compare these results with the results obtained for the problem without covariate shift in
Estimates of generalization error with different estimators for the problem without covariate shift. The BCV estimator produces estimates that are the most distant to the actual target error due to bias from its systematic selection of folds. All estimators are capable of ranking the models in the absence of covariate shift.

From
In this work, we introduce
We propose experiments with spatial data to compare estimators of generalization error, and illustrate how these estimators fail to rank models under covariate shift and spatial correlation. Based on the results of these experiments, we share a few remarks related to the choice of estimators in practice:
• The apparent quality of the BCV estimator is falsified in the QQ plots of
• The CV estimator is not adequate for geostatistical learning problems that show various forms of covariate shift. Situations without covariate shift are rare in geoscientific settings, and since the DRV estimator works reasonably well for both situations (i.e. with and without shift), it is recommended instead.
• Nevertheless, both the CV and DRV estimators suffer from a serious issue with overfitted models in which case they largely underestimate the generalization error. For riskaverse applications where one needs to be careful about the generalization error of the model, the BCV estimator can provide more conservative results.
• None of the three estimators were capable of ranking models correctly under covariate shift and spatial correlation. This is an indication that one needs to be skeptical about interpreting similar rankings available in the literature.
Finally, we believe that this work can motivate methodological advances in learning from geospatial data, including research on new estimators of geostatistical generalization error as opposed to pointwise generalization error, and more explicit treatments of spatial coordinates of samples in learning models.
All concepts and methods developed in this paper are made available in the GeoStats.jl project [
Experiments of this specific work can be reproduced with the following scripts:
The datasets presented in this study can be found in the same repository with the code:
JH: Conceptualization, methodology, software, formal analysis, investigation, visualization, writing—original draft; MZ: Methodology, validation; BC: Data curation, validation; BZ: Methodology, validation, supervision.
All authors were employed by the company IBM Research.