^{1}

^{2}

^{3}

^{4}

^{1}

^{5}

^{6}

^{*}

^{7}

^{*}

^{1}

^{*}

^{1}

^{2}

^{3}

^{4}

^{5}

^{6}

^{7}

Edited by: Aristides (Aris) Moustakas, Natural History Museum of Crete, University of Crete, Greece

Reviewed by: Ling Xue, Harbin Engineering University, China; Martin Kröger, ETH Zürich, Switzerland

This article was submitted to Social Physics, a section of the journal Frontiers in Physics

This is an open-access article distributed under the terms of the Creative Commons Attribution License (CC BY). The use, distribution or reproduction in other forums is permitted, provided the original author(s) and the copyright owner(s) are credited and that the original publication in this journal is cited, in accordance with accepted academic practice. No use, distribution or reproduction is permitted which does not comply with these terms.

In December 2019, novel coronavirus disease (COVID-19) hit Wuhan, Hubei Province, China and spread to the rest of China and overseas. The emergence of this virus coincided with the Spring Festival Travel Rush in China. It is possible to estimate the total number of COVID-19 cases in Wuhan, by 23 January 2020, given the cases reported in other cities/regions and population flow data between Wuhan and these cities/regions. We built a model to estimate the total number of COVID-19 cases in Wuhan by 23 January 2020, based on the number of cases detected outside Wuhan city in China, with the assumption that cases exported from Wuhan were less likely underreported in other cities/regions. We employed population flow data from different sources between Wuhan and other cities/regions by 23 January 2020. The number of total cases in Wuhan was determined by the maximum log likelihood estimation and Akaike Information Criterion (AIC) weight. We estimated 8 679 (95% CI: 7 701, 9 732) as total COVID-19 cases in Wuhan by 23 January 2020, based on combined source of data from Tencent and Baidu. Sources of population flow data impact the estimates of the total number of COVID-19 cases in Wuhan before city lockdown. We should make a comprehensive analysis based on different sources of data to overcome the bias from different sources.

Frontiers Media SA remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

In December 2019, a cluster of patients with pneumonia of unknown causes was reported in Wuhan, Hubei Province, China [

At the early stage of this outbreak, the cases might have been severely underreported due to the lack of diagnostic kits and insufficient screening for all suspected cases [

In this study, we aimed to estimate the number of COVID-19 cases in Wuhan, based on the cases exported from Wuhan to other cities/regions in mainland China and different sources of the population flow data between Wuhan and these cities/regions. We tested the impact of different sources of population flow data on estimating cases in Wuhan before city lockdown and combined different sources of data to overcome the bias from different sources. The estimates were made by 23 January 2020 (before the suspension of public transportations in Wuhan). We assumed that the cases exported from Wuhan were less likely underreported in other cities/regions in mainland China, as stringent temperature screening was implemented at airports and railway stations.

We obtained daily number of inbound and outbound domestic passengers traveling by air, train or road to/from Wuhan from two data sources:

(1) Tencent's LBS (location-based services) database (see:

(2) Baidu map database (see:

We equally divided the population flow data from different sources separately to get average daily population flow number.

The geographical distribution of exported COVID-19 cases in China. This figure reported number of reported COVID-19 cases in China, the dark gray area indicates the regions with zero COVID-19 cases as of 23 January 2020. Red paths show routes from Wuhan to other cities/regions.

As shown in

_{i} is the probability of detecting any exported cases from Wuhan in city/region

The probability _{i} can be derived from dividing daily outbound passengers of Wuhan to city/region

Then, we used cases exported from Wuhan to estimate the total number of COVID-19 cases infected in Wuhan (λ). Based on the data obtained from each city/region, we obtained the λ by maximum likelihood estimation.

In Equation (4), _{i} represents the number of cases exported from Wuhan and detected in city/region _{i} means the probability of finding any exported cases from Wuhan in in city/region

We assumed a population of 19 million (catchment population) traveling through the airport, railway stations and highways in Wuhan, and a 10-days delay on average, which accounted for the time interval reported between infection timing and case timing [

To overcome the bias from different sources of data, we first evaluated the correlation between two datasets to determine whether there is an apparent inconsistency or discrepancy between different sources of data. We found that the Spearman's rank correlation coefficient of Baidu and Tencent data for the same 24 cities/regions is 0.75, which means that two sources of data are correlated under 99.99% confidential level. We assumed a linear relationship between the Baidu data and Tencent data (see _{0} that α = 0. In Equation (6), _{Baidu} and _{Tencent} represent the number of population flow data from Baidu and Tencent.

We got the result that estimated coefficient α equals 0.10, β equals 1 272 and _{0} under 99% CI, which suggests that two sources of data are likely to have a linear relation. Since both sets of data is likely to be reasonable. We then applied Akaike Information Criterion (AIC) [_{est<uscore>i}, which follows a binomial distribution (Equation 2), based on Baidu and Tencent data, see Equation (7). To estimate the number of cases exported from Wuhan, the model used estimated the total number of COVID-19 cases infected in Wuhan (λ) from Equation (4). _{i} is the probability that we will find any exported cases from Wuhan in city/region

Since Baidu and Tencent data show significant linear relationship, which confirmed with each other that the general pattern of data is rational, we weighted (Equation 8) and combined (Equation 9) the estimated number of cases from Baidu and Tencent based on AIC value to obtain the final estimate.

In Equations (8) and (9), _{s} and _{s} represents the weight of estimated number of COVID-19 cases infected in Wuhan and AIC value for source _{s} is the estimate of the total number of cases from Equation (4), based on source

Comparison between the number of population flow data from Baidu and Tencent. Y-axis presents the number of population flow between Wuhan and other cities/regions from Tencent data. X-axis presents the number of population flow between Wuhan and other cities/regions from Baidu data.

Based on the data sourced from Tencent and Baidu, we estimated the total number of cases in Wuhan, λ (

Log maximum likelihood estimation for λ, based on data sourced from Baidu and Tencent.

AIC value and calculated Weight of final estimate for different sources of population data (under different assumptions of, θ, probability of an unspecified case reported in other cities/regions being an exported case from Wuhan).

AIC value | 137.8 | 137.5 | 136.6 | 147.1 | 146.7 | 144.6 |

Weight | 51.6% | 51.6% | 51.4% | 48.4% | 48.4% | 48.6% |

Summary table of estimated total number of cases infected in Wuhan (including cases exported from Wuhan to other cities/regions) and number of cases in Wuhan (excluding cases exported from Wuhan to other cities/regions) by 23 January 2020, from different sources of data.

Estimated total number of cases infected in Wuhan (95% CI) | 4 969 (4 426, 5 554) | 4 819 (4 284, 5 395) | 4 635 (4 111, 5 201) | 13 251 (11 811, 14 803) | 12 855 (11 437, 14 384) | 12 371 (10 981, 13 872) | 8 977 (8 000, 10 031) | 8 708 (7 746, 9 746) | 8 395 (7 450, 9 415) |

Estimated number of cases in Wuhan by 23 January 2020 (95% CI) | 4 672 (4 129, 5 257) | 4 531 (3 996, 5 107) | 4 358 (3 834, 4 924) | 12 950 (11 510, 14 502) | 12 563 (11 145, 14 092) | 12 090 (10 700, 13 592) | 8 679 (7 701, 9 732) | 8 418 (7 456, 9 456) | 8 116 (7 171, 9 137) |

A recent study by Imai et al. estimated that a total of 4 000 (95% CI: 1 000–9 700) cases on 18 January 2020 [

Estimates of the population outflow provided by Baidu and Tencent show substantial fluctuation, leading to results with significant differences. We found that Baidu and Tencent data show significant linear relation, which means that pattern of two sources of data is largely consistent. One possible reason for the phenomenon is that different institutions have a various definition of the number of people flow from one city to another. Methods include people who travel to other cities through Wuhan in the population flow may provide a much more significant figure than those that only calculate people who originally depart from Wuhan. At the same time, multiple round trips may also affect the count. Another possible reason is that Baidu and Tencent would fail to track the whole amount of population flow since not everyone uses mobile phone software from Baidu and Tencent.

Imai et al. suggested that by further improving the definition and testing of COVID-19 cases, and further expanding the scope of epidemic monitoring, the gap between the estimated number and official reported cases would be further narrowed. According to our results, statistics of population flow also play significant roles in estimation. At present, many researches use data from Baidu and Tencent platforms [

Different sources of population flow data impact the estimates of the total number of COVID-19 cases in Wuhan before city lockdown. We built a model that could be reproduced to employ incompatible sets of population flow data to estimate the number of COVID-19 cases more reasonably. We estimated 8 679 (95% CI: 7 701, 9 732) as total COVID-19 cases in Wuhan by 23 January 2020, based on the combined source of data from Tencent and Baidu. What data source can be used to make the most reliable estimation is not clear yet, though estimates based on a single source of data are likely to be biased. A comprehensive analysis based on different statistics is need before we reach any conclusions.

Publicly available datasets were analyzed in this study. This data can be found here:

All authors listed have made a substantial, direct and intellectual contribution to the work, and approved it for publication.

DH received a grant from Alibaba (China) Co. Ltd., Collaborative Research grant. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

The Supplementary Material for this article can be found online at: