^{1}

^{*}

^{1}

^{1}

^{2}

^{3}

^{3}

^{4}

^{5}

^{1}

^{6}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

Edited by: Jacques Demongeot, Université Grenoble Alpes, France

Reviewed by: Yunlong Feng, University at Albany, United States; Alex Jung, Aalto University, Finland

This article was submitted to Mathematical Biology, a section of the journal Frontiers in Applied Mathematics and Statistics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

A sentinel network,

A sentinel network, ^{1}^{2}

The sources of uncertainty in such monitoring system are numerous [

sampling variance (although sampling at the WWTPs is integrated over 24 h),

variance from the Reverse Transcriptase (RT) and quantitative or digital Polymerase Chain Reaction (qPCR or dPCR) processes used to measure the concentration of SARS-CoV-2 RNA genes in the samples,

uncertainties on other analytical steps in the labs such as virus concentration, genome extraction, and the presence of inhibitors.

Using the raw viral load measurements right away to monitor the pandemic can, therefore, be misleading, as a large variation in the measured concentration can either be due to a real variation in virus concentration or to a quantification error. Therefore, these data need pre-processing in order to provide an accurate estimate of the real daily amount of virus arriving at each WWTP and to evaluate the uncertainty on this estimate as also underlined by [

A solution to estimate the underlying true concentrations (from which the noisy measurements are derived) is to exploit the time dependence of the successive measurements. This temporal dependence is indeed due to this (never observed) underlying process, while the measurement noises from one time step to another are assumed to be independent. A natural way of exploiting this time dependence to denoise the signal is Kalman filtering [

Concentration measurements provided by the

Data from the

In this study, we here focus on the one-dimensional setting and propose a Smoother adapted to Censored data with OUtliers (SCOU) that answers both the censoring and the outliers detection problems through discretization of the state space of the monitored quantities and more generally permits a very high degree of modeling flexibility.

The proposed model and its implementation are described in section 2. Section 3 provides an illustration and validation of our approach using numerical simulations. Section 4 gives an example of utilization of data from the

Our method is based on the following state-space model:

where:

_{t} ∈ ℝ is the real quantity at time _{Xt)t ∈ {1, …, n}} is the vector of real quantities (to be recovered).

_{t} ∈ ℝ is the measurement at time _{t} is generally only partially observed. We note _{t} is observed. ^{*} is an accessory latent variable corresponding to a non-censored version of

η ∈ ℝ, δ ∈ ℝ, σ ∈ ℝ^{+}, ^{+} are parameters (to be estimated).

ℓ is the threshold below which censorship applies^{3}

_{t} ∈ {0, 1} is, for any

Diagram of the proposed auto-regressive model. (_{Xt)t ∈ {1, …, n}} is the underlying auto-regressive process (to be recovered), _{t}, _{t} is the indicator variable for the event “_{X,t} are tiny innovations and ε_{Y,t} are the measurement errors.

We are interested in the distribution of

This distribution can be computed from the forward (

_{n}(

For

We have, hence, developed a discrete version of those computations, which makes it easy to adapt to many possible extensions of the model (censorship but also outliers and possibly heteroscedasticity for instance). The set of values that can be taken by

The transition matrix, π, is calculated as follows:

where ^{2}.

Let _{t}(_{t} = _{t}|_{t} = _{t}(

where _{{A = a}} is the indicator variable for the event {^{2}.

The F and B quantities, defined in Equations (2) and (3), can then be calculated recursively from π and

_{n}(

Numerical tricks like logarithmic re-scaling are moreover used to calculate

The parameters η, δ, σ, τ, and

For any ^{4}_{t} (once discretized) takes the value

We similarly show that:

Thus, to simulate from _{1} with

The probability for each observation _{t} to be an outlier (knowing

An outlier is detected if

The proposed estimation is first evaluated on artificial data. The simulations are designed to study the ability of the algorithm to produce a good estimation of the parameters, to identify correctly the outliers, and to adequately predict the conditional distribution of the underlying process.

Artificial data are simulated according to Model (1) with

We realize five experiments:

with no censoring, no outlier, δ = 0, η = 1,

· (1) with Δ = 0.02,

· (2) with Δ = 0.1,

· (3) with Δ = 0.7,

(4) with Δ = 0.1, ℓ set for each simulated data such that 16% of

(5) same setting as the previous experiment but with 31% of censored data (high censoring level).

Each experiment is replicated 100 times. For each simulation, we compare our approach, SCOU, to:

a 2-parameter Kalman Smoother implemented in the DLM R package [

the moving average smoother.

the Locally Estimated Scatterplot Smoothing (LOESS) [

As in the study by [

The first, second, and third experiments (no outliers, no censoring, δ = 0 and η = 1) show, as expected, that the 2-parameters Kalman smoother implemented in the DLM R package and our method give identical results for Δ = 0.02. The two methods show close performances for Δ = 0.1 and a substantial degradation when Δ = 0.7, as shown in

Root Mean Squared Error (RMSE) obtained for the prediction of the true underlying signal by the two-parameter exact smoother and by our method on data sets simulated with no outliers, no censoring, δ = 0 and η = 1 and with varying values of the discretization step, Δ. As expected, in this ideal setting, the 2-parameters Kalman smoother implemented in the DLM R package and our method give identical results for Δ = 0.02.

In the following, we focus on the two other experiments (medium and high censoring levels,

Simulated data with 16% of censored data and

Parameters estimates with our method (σ, τ, η, δ) and with a 2-parameter Kalman Smoother (σ-dlm and τ-dlm) for 100 replicates of the simulation experiment (150 time steps, outlier rate of

The _{t} is an outlier,

Finally, we evaluate the ability of our method to predict the distribution of

Root Mean Squared Errors for the prediction of

As for the variance prediction, the coverage rates of the 95% Prediction Intervals of our method, derived from the predicted distributions of

Coverage rates of the prediction intervals for the prediction of

The developed smoothing method aims to provide an estimate of the actual amount of viral genome arriving at each WWTP and to assess the uncertainty of this estimation.

The concentration measurements provided by

The direct application of SCOU on the flow-adjusted measurements shows heteroscedasticity of the residuals [estimated by

Hence, for this data set,

The application of SCOU to real data from the

The estimates of the smoothing parameters for the 190

The uncertainty of the monitoring system is evaluated by the parameter τ, whose estimates distribution for the 190

Importantly, the resulting smoothed signal is well correlated with the logarithm of the local COVID-19 incidence rates, and this correlation is most of the time greatly enhanced by the proposed smoothing step as depicted in

Correlation of raw

In order to produce comparable values from one WWTP to another,

We developed a method to smooth one dimensional time-series consisting of successive censored measurements with outliers when the associated measurement uncertainty is not known and the measured quantities have an auto-regressive nature. By discretizing the state space of the monitored quantities, the proposed method has the advantage of being easily adaptable to the specificities of the data (such as measurement censoring and the occurrence of outliers). An experiment on artificial data validates the proposed inference and prediction method. Our method has then been successfully applied to data generated during the

The proposed method could be further developed. First, the underlying

Another way to proceed would be to deduce from other available epidemiological data the shape of the signal to be found in wastewater (and thus an adequate smoothing) based on fine modeling of the whole pathway of SARS-CoV-2 from the human population to wastewater such as the one proposed by [

The datasets presented in this study can be found in online repositories. The names of the repository/repositories and accession number(s) can be found at:

The

YM, J-MM, VM, LM, and SW brought on the scientific problem. MC contributed to the design of the algorithm, performed experiments on artificial and real data, and wrote the manuscript. GN contributed to the design of the algorithm and coordinated the experiments and the writing of the manuscript. NC and SW prepared the data provided by

This study was carried out within the

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

All claims expressed in this article are solely those of the authors and do not necessarily represent those of their affiliated organizations, or those of the publisher, the editors and the reviewers. Any product that may be evaluated in this article, or claim that may be made by its manufacturer, is not guaranteed or endorsed by the publisher.

We would like to thank the WWTPs and the laboratories that contribute day to day to the

^{1}

^{2}In November 2021, representing more that one third of the French population.

^{3}In practice, ℓ can vary from one day to another, for instance if one works on quantities that correspond to the multiplication of concentrations (with a detection limit) by a fluctuating volume. This can be taken into account within our method with no additional cost.

^{4}Hint: decompose _{t} =