If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
Statistical modelling is pivotal in assessing intensity of a stochastic processes. Novel Corona virus disease demanded proactive measures to understand the severity of disease spread and to plan its control accordingly. We propose estimation of reproduction number as a crucial factor to monitor the random dynamics of Covid-19 in India. In the present paper, semi-parametric regression based on penalized splines embedded under Bayesian formulation is utilised to estimate reproduction number while incorporating effects of underreporting and delay in reporting for the actual number of daily occurrences. Monte Carlo Markov Chain approximations are utilised to perform simulation study and thereby to assess the impact of the reporting probability and misspecification of delay pattern on potential for further substance of the pandemic. For a cycle of reporting on weekly basis, the proposed penalized spline Bayesian framework fits closest to the empirical data drawn for a two-day delay in reporting with approximately half of the actual cases being reported. The present paper is a contribution towards estimation of the true daily reproduction number of Covid-19 incidences in its next generation cycle.
It is difficult to capture the exact evolution and dynamics of any pandemic through deterministic modelling. Appearance of novel pathogen leading to sustained pandemic, therefore calls for accounting of temporal piecewise changes in short time-spans. Existing count of infectees and transmission time of infection are the two-key factors in assessment of growth rate of the infection in a specified population. Transmission and propagation of any novel communicable pathogen is quantified through statistical models which are validated through short-history data curated from the current time-chain of the pandemic. Average infection transmissions generated from a currently active single infected unit in a given specified time t is called reproduction number (). In simple words, represents number of secondary infections that one infected individual can spread further on time t. indicates that the epidemic is under control and approaching a disease free state while indicates that the control measures lack efficacy leading to the possibility of endemic transforming into epidemic. During progression of any epidemic the expected rise or decline in the reproduction number explains the time-based renewals. These estimates are therefore vital in future medical-assistance and planning in terms of manpower, medical machinery, medicines and health care infrastructure.
Spline is a flexible mathematical function that represents changing data through disaggregated polynomials. Splines have been an important key for addressing various mathematical problems in approximation theory and in numerical analysis.
in modelling of Covid-19 data and create essential need to appropriately trace the sample-path of its evolution. Rapid changes in trends on daily incidence of Covid- 19 is modelled using semiparametric regression in conjunction with versatile penalized splines to estimate .
This approach is motivated by the fact that is viewed as an autoregressive entity determined completely by its present value and the transmission period only. This idiom is also studied through likelihood approach.
Estimating the effective reproduction number for pandemic influenza from notification data made publicly available in real time: a multi-country analysis for influenza A/H1N1v 2009.
During any epidemic outbreak, regional governmental bodies are the sole source of reporting of the related incidence data on infection and mortality. Potential for misreporting exists in such data sets due to uncertainties in the real-time reporting of the test outcome as well as due to the limitations of the capacity of the testing facilities usually when pandemic is close to its peak. These limitations have been classified as delay in reporting and underreporting respectively. Length of diagnosis method, limited testing centres, hospital holidays including its restricted-capacity functioning during weekends are multiple sources which cause delays in reporting, while misreporting may arise due to false negative test and ignorance among the population units during early stages of the pandemic. Distribution of the time-elapsed since onset of infection till its occurrence in the infectee is assumed to be known and is termed as serial (or generation) interval distribution. Use of Bayesian inference for semi-parametric regression model brings flexibility in the estimation process through sequential updation of the new infectee counts. In the present work, we adopt Bayesian semiparametric regression model with splines to estimate along with delay and reporting probability. The obtained Bayes estimates are expected to determine the degree of success of the epidemic control strategies.
Our paper is organised as follows. Section 2 describes the statistical model for step-wise inclusion of the underreporting active cases and delay in the reporting process. In Section 3, we analyse corona virus disease (COVID-19) incidence data for India, estimate the daily reproduction number and the delay parameter for the duration of the outbreak from 15th March 2020 up to 13th April 2020(https://www.covid19india.org/). We conduct a simulation study to assess and validate the proposed method of Bayes estimation of after adjusting for the impact of misspecification of delay patterns and different reporting probabilities. Section 4 is attributed to discussion. Section 5 concludes the study and explores avenues for further research.
2. Methods
The manuscript comprises of majorly four broad methodological concepts. First, observed cases and actual cases are assumed to follow Poisson distribution. Second, the epidemic renewal equation is incorporated for reflecting the progression of disease spread. Third, semi-parametric spline regression which is used to model incorporating effect of underreporting and delay structures fitted under Bayesian paradigm and different scenarios created are compared through Deviance Information Criterion (DIC). Fourth, the simulated under various scenarios is compared with estimated with Mean average square error (MASE) and its components.
2.1 Notations and assumptions
Let denote the actual cases of new disease counts during T days of an epidemic which consists of a reported part and an unreported part such that for t = 1, 2, 3 … T. During any epidemic, the total count of reported cases are always less than the actual counts . This underreporting at time t and delay in reporting are responsible for underestimation of .
Reporting probability (τ) may be fixed or time dependent and is referred to as thinning parameter. Suppose maximum length of generation interval is then the distribution of time interval between the infection times of an infected case and its infector is represented by the ordered set termed as the generation interval probability. Hence, count of individuals infected on day is given as for i = 1, 2, …, r.
Assuming that all the infected individuals on day have the same capacity of further transmission and the transmission capability changes with each generation, the average infections at time t is denoted by for t = 2,3, ….,r while average infected in the first generation remains deterministic at . Additionally, represents delay probabilities which account for the percentage of cases on day i reported on day t. More specifically, captures the delay structure or delayed cases and is the proportion of total cases which are reported on day but actually belong to day or in other words if total number of reported or observed cases are 100 on day and is 0.4 then 40 cases actually belong to day and are reported on day Observed count of infectees, including delaying and underreporting is denoted by . Due to delay structures, would exceed for some t. For constant value of τ and over time T, shape of epidemic curve remains same accompanied by its positional shift only. For dynamic and over a time period, epidemic curve experiences change in its shape. Each of the two types of cases, described observed and actual are assumed to follow Poisson distribution with different means. Observed cases which comprise of underreporting and delay structures ultimately are of great analytical importance for this study and are considered for estimation and inferential purposes.
2.2 Building distributional structures
Since Poisson process is a renewal counting process, therefore, total count of infected individuals on day t is assumed to follow Poisson law with parameter . Probability of observing conditional on the past prevalence is given as,
(1)
More specifically represents the process prior to the current renewal point. Probability of reported cases is assumed to follow Binomial law, while accounting for data augmentation of actual prevalence due to underreporting is expressed as,
(2)
Thus, the observed effective mean prevalence reduces to without disturbing the existing correlation dynamics in the chain of the infectee and the infected as follows,
(3)
Thus, (3) includes the impact of underreporting in the data. Further, incorporating impact caused by , in (3), we have
(4)
where, , represents the average number of infectees or the mean number of observed cases . Also, it is the which includes the three components as reproduction number, underreporting parameter and delay parameter, which are to be estimated.
Symbolically, the effective mean due to underreporting is weighed by the accumulated cases in the time interval [i,t] being reported at time t. Physically, often the cases from temporal underreporting and from the delay in reporting are not distinguishable. Hence, to resolve such identifiability
usage of composite link function L is made to map transition from to . Next, we describe the different delay structures as follows.
(i)
One-day delay: A fraction of the new cases on day t is reported on day t + 1, denoted by
(ii)
Two-day delay: Fraction of new cases on day t is respectively reported partially on days t +1 and t +2 respectively.
(iii)
Weekend Delay: Due to weekly off on Saturday and Sunday at the reporting health centres, no reporting is recorded on these days. Wednesday, Thursday and Friday will have the same delay pattern as in case (i) above. Monday and Tuesday will have reporting of additional fraction which is carried over from the preceding Saturday and Sunday.
We use a penalized spline model to allow to change over time t. Furthermore, it is well known that penalized splines can be embedded in the linear mixed model framework where the selection of the smoothing parameter is provided by estimating the random-effects variance component.
We model stochastic in the epidemic renewal equation (5) by using a penalized spline as,
(5)
(6)
where is the vector of regression coefficients and is an ordered set of fixed knots such that and j = 1,2,3 …,E. Spline component in (6) represents the non-parametric part, which is amenable to mathematical computations. The statistical description of the considered model and its estimation through computational software is well documented
To ensure the desired flexibility, the number of knots E should be sufficiently large enough. The choice for number of knots is fixed to 15 and is based on past implementations
of similar model. The vector of random coefficients is assumed to be independent and normally distributed with and where is a matrix in which the entry is . The class of splines used in the present research is well formulated and decoded
in past studies. We are modelling log instead of because cannot be negative and can take any value between 0 to infinity. The right hand side of equation (6) can be negative after estimation of model parameters. Hence, log instead of .
We assign diffused informative priors as ; and postulating flat prior Beata(1,1) for τ and each. The choice of priors has been considered after understanding the application of the considered model under Bayesian setup in past studies
Considering, E = 3 with fixed generation distribution we estimate time varying , with fixed for t = 1,…,30 days. Smoothed estimate of thus obtained are plotted in Fig. 1. We use R version 4.1.0 and OpenBUGS to execute the proposed theoretic formulations for estimation of and and used Deviance Information Criterion
(DIC) to assess model adequacy and complexity (Table 1, Table 2, Table 3) under fixed . Posterior mean and posterior standard deviation are obtained based on Markov Chain Monte Carlo (MCMC) approximations. We run MCMC simulations for 105 iterations and attain convergence with respect to MCMC error (<0.05). Each simulation is repeated 1000 times to obtain the corresponding posterior estimate. First, we generate epidemic data on the basis of a “reference”, which we will refer to as the true underlying delay pattern, under τ = 0.5. Next, we estimate , on the basis of four other delay patterns. Five combinations or scenarios of underreporting and delay structures are considered.
(i)
Underreporting, one-day-delay ().
(ii)
Underreporting, two day-delay (.
(iii)
Underreporting weekend-delay ().
(iv)
Underreporting, no-delay ().
(v)
No-underreporting with no-delay ().
Fig. 1Smoothed mean estimates of the reproduction numbers for the one-day, two-day and weekend delay patterns with reporting probability of τ = 0.5.
We perform the next stage of simulation to assess the impact of assuming a wrong delay pattern on the estimation of the . We calculate the Mean Average Squared Error (MASE) with N = 1000 replications for T = 30 days to assess model efficacy as under,
MASE and its bias-variance components for for each true delay pattern vis-à-vis other considered delay-underreporting combinations (Table 4 and Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6). We conduct sensitivity analysis to examine the impact of varying reporting probability 0.15, 0.2, 0.4 and 0.6. on the estimates (Table 5). Estimates of τ are plotted under one-day, two-day and weekend delay misspecifications in Fig. 7, Fig. 8, Fig. 9.
Table 4MASE and its bias–variance decomposition for the estimates of the reproduction numbers based on the sensitivity analysis of different delay patterns.
Table 1 exhibited overall decrease in estimated one-day-delay probability for increasing reporting probability. DIC shows best, when reporting probability is half the actual occurrence (i.e., τ = 0.5)
Table 2 displayed mixed trend of estimated delay in probabilities due to increase in reporting probability under two day set-up. Again, DIC is observed to be lowest when number of reported cases are half of the actual cases.
Estimates of delay in probabilities are same for all the considered reporting probabilities. Overall, among all delay structures considered, the lowest DIC is seen for τ = 0.5 under two-day delay setup for the considered data set.
Table 4 shows the corresponding MASE and its bias–variance decomposition for estimates of the reproduction numbers based on different delay and reporting patterns. Accounting for delay and underreporting is seen to increase the accuracy for estimates of . Lowest MASE is obtained for the correct delay pattern, which validates the model. Largest MASEs are obtained for misspecification in . In general, Models with underreporting show less bias than those which incorrectly ignore underreporting. MASE's for the two considered scenarios of no-delay structures are closest to each other.
Table 5 yields the smaller MASE's for high reporting probabilities (0.4 and 0.6) for two-day and weekend delay patterns as compared to one-day delay patterns. Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6 depict the estimated daily reproduction numbers are shown for different reference and fitted delay pattern combinations. Fig. 7, Fig. 8, Fig. 9 show the estimated reproduction numbers under different delay patterns for different values of reporting probability. MASE is seen to be highest for and lowest for for all the reporting probability situations. MASE's are seen to decrease with increase in reporting probability for and
4. Discussion
Estimation of reproduction number, usually referred to as “Ro” in most of the studies, comprises of sufficient number of classical statistical methods utilised in recent past in India. Exponential growth rate model
Impact of COVID-19 epidemic curtailment strategies in selected Indian states: an analysis by reproduction number and doubling time with incidence modelling.
Impact of COVID-19 epidemic curtailment strategies in selected Indian states: an analysis by reproduction number and doubling time with incidence modelling.
Impact of COVID-19 epidemic curtailment strategies in selected Indian states: an analysis by reproduction number and doubling time with incidence modelling.
was carried out for COvid-19 in China for estimation of reproduction number using all possible statistical techniques based on maximum likelihood estimation, SEIRD, and MCMC. The estimated reproduction number for Covid-19 ranged from 2 to 7 in most of the studies.
is well known in understanding the time-based random phenomenon observed in real life applications. In context of epidemic modelling, renewal theory based
applications are limited in number. Embedding renewal equation into spline regression for capturing the dynamics of pandemic caused by the novel corona virus under different scenarios is the soul of this research. More specifically, the present research was to estimate reproduction number incorporating the effect of underreporting and delay in reporting through innovative approach of penalized spline regression through Bayesian toolkit which isn't being explored yet in context of Covid-19 in India.
We note that there is a large variability in the estimation of the reproduction number for the first few observations. Often Information is scarce at the beginning of any epidemic break-out, hence curve shows higher probability for the initial recorded cases. It can also imply that the initial infectees are responsible for larger number of secondary cases in absence of either awareness or control measures or both.
DIC being a measure of model adequacy and complexity gave us the answers for identifying the best case scenario for underreporting and delay parameters. Evidence from the analysis (Table 2) shows that among all the considered delay structures for different reporting probabilities two-day delay pattern with a reporting probability of 50% was the most suitable scenario which follows the considered dataset closely. Fig. 2, Fig. 3, Fig. 4, Fig. 5, Fig. 6 show that mis-specifying the delay pattern for , , , and has a moderate impact on the estimated trend for the reproduction number. Note that the structure of the pattern is an extension of the pattern. There is a substantial impact when misspecifying the pattern. MASE with the help of simulation gave us the answers on how model estimates vary on mis-specifying a particular scenario and reflected the stability and suitability of the adopted modelling procedure. Simulation study (Table 4, Table 5) justified the modelling strategy adopted for the stated problem under different cases of misspecification of underreporting and delay structures.
5. Conclusion
Reporting of instead of daily cases, alerts both the government and public to start and regulate preventive measures accordingly. In the present paper, explanation of the logarithmic transform of through penalized splines under Bayesian setup have generated stronger evidence in favour of the proposed model as is evident from following discussion. The present paper follows Bayesian paradigm under diffused informative priors by treating the reporting fractions, reposting delays and their interactive influences distinctly. Time dependent are seen to be influenced by reporting fractions and systematic delays in reporting. The present study conclude that observed data accompanied by 50% underreporting for a two day lag in recording the incidence of covid-19 cases are closest to the empirical data. A precise estimate of , therefore, is crucial in formulation of corrective or preventive policies towards handling the epidemic and towards efficacy of intervention indicators such as lockdown, quarantine and other related control measures. Future scope of study could include a stochastic serial distribution, correlated reproduction numbers and a different delay pattern than the weekly logistics studied in the present paper. Also, propagation of pandemic is considered for closed population in the present work, which could be deviated from, to understand the advantage and necessity of imposition of movement across spatial contours. Such a study could be useful in understanding cost-benefit bargains in economic activities. can be computed geographically i.e., for each state, district or any other administrative boundary to understand the similarities and differences in dynamics of pandemic like Covid-19 for geospatial analysis and local level covariates responsible for pandemic situation.
Sources of funding
The first author gratefully acknowledges IoE grant from University of Delhi.
Declaration of competing interest
All the authors declared to have no conflict of interest.
Acknowledgements
Authors express their gratitude to all the anonymous reviwers and the editor for their constructive comments and suggestions which resulted in immense improvement in the original manuscript.
Estimating the effective reproduction number for pandemic influenza from notification data made publicly available in real time: a multi-country analysis for influenza A/H1N1v 2009.
Impact of COVID-19 epidemic curtailment strategies in selected Indian states: an analysis by reproduction number and doubling time with incidence modelling.