Advertisement

Estimation of missing values in aggregate level spatial data

Open AccessPublished:October 12, 2020DOI:https://doi.org/10.1016/j.cegh.2020.10.003

      Abstract

      Background

      Data can be missing when a survey fails to collect information from certain regions due to feasibility issues, which can impose problems while performing spatial analysis.

      Objective

      The present study aims to estimate missing aggregate level public health spatial data by utilizing the information from neighbouring regions and accounting for spatial autocorrelation inherently present in the data.

      Methodology

      Data was simulated for fixed values of various parameters in spatial regression models under low and high autocorrelation scenarios in dependent and independent variables. In dependent variable, 5%–25% of values were assumed to be missing. Stochastic regression imputation using spatial regression models namely spatial lag model, spatial error model, spatial Durbin model, spatial Durbin error model and spatial lag of X model was performed. The performance of these models were also compared using data from Annual Health Survey 2012-13.

      Results

      The simulation analysis revealed that for any amount of missing values in the data, irrespective of whether the other variables in the regression model are spatially autocorrelated or not, if autocorrelation in the variable with missing values is high, stochastic regression imputation performed using spatial lag model, spatial Durbin model and spatial Durbin error model gives accurate estimates of missing values. If the autocorrelation is low, in addition to these three models, spatial lag X model was also found to be effective in estimating the missing values.

      Conclusion

      The proposed mechanism results in optimal imputation of missing values in spatial data, which can yield complete data useful for public health professionals for effective interventions.

      Keywords

      1. Introduction

      Missing data is a common problem in many public health surveys. The missing data problem occurs when the essential data is accessible for the use of performing analysis, however some part of it is missing from the possession of the analyst.
      • Bennett R.J.
      • Haining R.P.
      • Griffith D.A.
      The problem of missing data on spatial surfaces.
      Sometimes the data can be missing in the secondary data source when a survey fails to collect information from certain regions due to feasibility or logistic issues. This creates missing at random mechanism.
      • Allison P.D.
      Missing Data.
      Missing values in the data limit the accuracy of the models in prediction of occurrence of any event like disease, death, disability etc., which in turn hinders the success of public health surveillance to target effective interventions.
      • Rubin D.B.
      Multiple Imputation for Nonresponse in Surveys.
      In many situations a public health researcher may be interested to know various health indicators like infant mortality rate, prevalence of anaemia, child marriage rate etc. in a particular population using data from available sources. However, for few populations these estimates may not be available, as surveys are not conducted in these areas due to logistic reasons. Hence it is important to estimate the values of these indices for the missing populations based on available information from nearby populations as these measures are spatially correlated for nearby places. There are many methods to handle missing data and the commonly used are mean imputation, hot-deck imputation, multiple imputation and expectation-maximization imputation.
      • Little R.J.
      • Rubin D.B.
      Statistical Analysis with Missing Data.
      However, estimation of missing values in spatial data must be dealt differently due to the interdependence (autocorrelation) among the spatial data points. Several approaches have been developed in this direction,
      • Munoz B.
      • Lesser V.M.
      • Smith R.A.
      Applying multiple imputation with geostatistical models to account for item nonresponse in environmental data.
      • Li L.
      • Li Y.
      • Li Z.
      Efficient missing data imputing for traffic flow by considering temporal and spatial dependence.
      • Griffith D.A.
      • Liau Y.T.
      Imputed spatial data: cautions arising from response and covariate imputation measurement error.
      nonetheless, most of these techniques focus on obtaining best estimate of the parameters of interest rather than estimating missing values in the data.
      The present study aims to estimate missing values in spatial data at aggregate level by utilizing the information from neighbouring regions and accounting for spatial autocorrelation inherently present in the data. To account for the expected spatial association in the data, we intended to propose a method that accounts for the spatial structure of the data. The study also focuses on finding the accuracy in estimating missing values at aggregate level using a stochastic imputation technique using various spatial regression models under varied scenarios such as different levels of autocorrelation as well as various proportions of missing values in the data. After an accurate imputation, the complete data thus obtained can be further used for identifying the predictors for the event under study as well as to design appropriate intervention programs.

      2. Methods

      2.1 Spatial regression models

      Spatial regression models
      • LeSage J.P.
      • Pace R.K.
      Introduction to Spatial Econometrics.
      ,
      • Golgher A.B.
      • Voss P.R.
      How to interpret the coefficients of spatial models: spillovers, direct and indirect effects.
      namely spatial lag model (SLM), spatial error model (SEM), spatial Durbin model (SDM), spatial Durbin error model (SDEM) and spatial lag of X model (SLX) were fitted in order to impute the missing values using the stochastic regression imputation method. The description of spatial regression models under comparison is given below. Let us consider the general nesting spatial model,
      • Elhorst J.P.
      Spatial Econometrics: From Cross-Sectional Data to Spatial Panels.
      y=ρWy++WXθ+λWu++ε
      (1)


      xk=δkWxk+vk
      (2)


      where y is an N×1 vector of values of the dependent variable, W is an N×N dimensional neighbors weights matrix, where all elements wij>0 for all neighbouring units i and j (ij), and zero otherwise. X is an N×K matrix of K independent variables, β is a K×1 vector of regression coefficients, ρ represents the autoregressive scalar parameter in the dependent variable, θ is K×1 vector of spatial spill-over parameters, u represents an N×1 vector of spatially autocorrelated disturbances, λ represents the autoregressive scalar parameter in the disturbances. The parameter vector γ specifies the correlation between X and the disturbance term vector u, δk represents the autocorrelation in the kth independent variable. vk and ε are independent and randomly distributed disturbances following (0,σv2) and (0,σε2), and xk is the kth column-vector of X for k=1,,K independent variables.
      The reduced form of this model can be written as,
      • Rüttenauer T.
      Spatial regression models: a systematic comparison of different model specifications using Monte Carlo experiments.
      y=(INρW)1[(INδkW)1vkβk+W(INδkW)1vkθk+(INλW)1((INδkW)1vkγk+ε)]
      (3)


      When θ=0 and λ=0 in (1), the model takes the form of spatial lag model (SLM). The SLM assumes that the dependent variable in one unit is influenced by the spatially weighted dependent variable of neighbouring units. Spatial error model (SEM) is a another form of spatial model which is formed when ρ=0 and θ=0. In this specification it is assumed that the spatial association among the units is produced due to the unobserved features, which are either random or consist of a spatial pattern, and are independent of the explanatory variables included in the model. Spatial Durbin model (SDM) specification is formed when the autocorrelation is contributed by both dependent variable and independent variables, i.e. when λ=0. A specification which directly models the spatial spill-over effects by including the spatially lagged independent variables into the regression equation is known as the spatial lag of X (SLX) model, formed when ρ=0 and λ=0. The model which combines the specifications of SEM and SLX is known as the spatial Durbin error model (SDEM) which is formed when ρ=0.

      2.2 Stochastic regression imputation

      One of the most simple and commonly used imputation technique in the non-spatial context is stochastic regression imputation. This method uses the ordinary least squares (OLS) regression to predict missing values and adds a normally distributed residual term to each predicted value. It restores loss in variability and bias associated with regression imputation (Little and Rubin 2019). The stochastic regression imputation is based on complete cases and is given as,
      yˆik=j=1k1βˆjxij+zi


      where zi is a random draw from N(0,σ2) and σ2 is assume to be constant for given value of xj. The variable with missing values Yk is the variable of interest. This variable is considered as the dependent variable of the stochastic regression imputation model and the auxiliary variables (x1,,xk1) used to estimate the missing values are regarded as the independent variables.
      In the present study, each missing value in dependent variable was estimated by stochastic regression imputation based on above mentioned five different spatial models. The performance of these five spatial models in accurately estimating the missing values were compared using root mean square error. For comparison purpose the stochastic regression based on OLS was also performed for estimating the missing values. Both simulated data and a real data were used to assess the performance of these six models (SLM, SEM, SDM, SDEM, SLX and OLS) in accurately estimating the missing values in aggregate level spatial data. Simulation was done to study the performance of these modes at varying percentage of missing values as well as for varied values of autocorrelation of dependent and independent variables in the model. Finally, the performance of the models were assessed using a real data from Annual Health Survey 2012–13 published by the Ministry of Home Affairs, Government of India.
      • General R.
      Annual Health Survey-2012–13.

      2.3 Data simulation

      Data simulation was performed using spdep package in R software to generate variables and impute missing data at aggregate level using stochastic regression imputation based on ordinary least squares (OLS), SLM, SEM, SDM, SDEM and SLX. The geographical space chosen was the map of India consisting of nine states namely Bihar, Chhattisgarh, Jharkhand, Madhya Pradesh, Odisha, Rajasthan, Uttarakhand, Uttar Pradesh and Assam aggregated at the district level. From the shape file of India with 284 districts, neighbourhood matrix W was constructed based on queen contiguity weights matrix, which defines the neighbors as those areas that share boundary points as well as some portion of their boundary. Simulated data of size 284 districts included three independent variables for which the regression coefficient vector was fixed arbitrarily at β=(33.53)T. The disturbance parameters were generated as an independent and randomly distributed normal variable with mean zero and fixed at σv2,σε2=1. The spatial spill-over parameter vector was fixed at θ=(111)T. The autocorrelation in disturbance vector u and the omitted variable bias were fixed at λ={0.05} and γ=(000)T respectively. To generate dependent variable with low and high autocorrelation, the scalar parameter ρ was fixed at two values namely ρ=0.2 and ρ=0.8 respectively. Similarly, the autocorrelation in independent variables was assumed to be δ=0.1 and δ=0.6 to generate independent variables with low and high autocorrelation respectively. The variable of interest (dependent variable) was finally simulated following the general nesting spatial model given in equation (3). The simulation resulted in four datasets each of size 284 representing four different situations as given below.
      • a)
        High autocorrelation in both dependent and independent variables
      • b)
        Low autocorrelation in both dependent and independent variables
      • c)
        High autocorrelation in dependent variable and low autocorrelation in independent variables
      • d)
        Low autocorrelation in dependent variable and high autocorrelation in independent variables
      The autocorrelation and Pearson's correlation coefficient values of the simulated dependent and independent variables are given in Table 1.
      Table 1Spatial autocorrelation and Pearson's correlation coefficient values of the simulated variables.
      YX1X2X3
      High-HighCorrelation with Y−0.500.63−0.75
      Autocorrelation0.890.480.580.65
      Low-LowCorrelation with Y−0.470.62−0.58
      Autocorrelation0.250.100.110.01
      High-LowCorrelation with Y−0.440.49−0.42
      Autocorrelation0.770.070.110.05
      Low-HighCorrelation with Y−0.560.65−0.56
      Autocorrelation0.280.600.530.61
      The data was simulated 1000 times for each of the above four scenarios of levels of autocorrelation among the variables. In each simulated dataset a fixed proportion of values in dependent variable was assumed to be missing according to missing at random (MAR) mechanism. In MAR mechanism, probability of missing values in the dependent variable (Y) was based on the values of other variable(s).
      • Little R.J.
      • Rubin D.B.
      Statistical Analysis with Missing Data.
      To generate missing as per MAR mechanism in dependent variable, a fixed proportion of locations were randomly selected among those locations where the values of the independent variable X1 were very low (less than the first quartile value ofX1). In these selected locations the values of dependent variable were assumed to be missing. This was repeated for various proportions of missing, 5%, 10%, 15%, 20% and 25%in the dependent variable in each simulated dataset. For each proportion of missing and for each level of autocorrelation in variables, the missing values were estimated using stochastic regression imputation with ordinary least squares method (OLS), SLM, SEM, SDM, SDEM and SLX.
      The performance of the above models in accurately estimating the missing values under each scenario were compared based on the computed root mean squared error (RMSE) values. For i=1tow wherei are the regions consisting of missing values and w are the total number of regions with values to be imputed, the RMSE for the variable Y is given as,
      RMSEj[Yˆ]=i=1wj(YˆijYij)2wj


      where, Yˆij is the imputed value for the variable Y in region i, for a given simulated dataset (j = 1, 2, …,1000), for any given model and any fixed proportion of missing in the dataset. Yijis the observed value of the variable Y in region i and jth simulated dataset.
      The mean of the RMSE (mRMSE) and standard deviation of RMSE (sdRMSE) for a given model and fixed proportion of missing were calculated as,
      mRMSE[Yˆ]=j=11000RMSEj[Yˆ]1000


      sdRMSE[Yˆ]=j=11000(RMSEj[Yˆ]mRMSE[Yˆ])210001


      The mRMSE was computed for the imputations performed using each of the five spatial regression models and OLS model, on the data with 5%–25% proportions of missing values in the dependent variable. The mRMSE values were compared between the models and the model with lowest mRMSE in each scenario was considered the best in estimating the missing values in the dependent variable.

      2.4 Application on data from Annual Health Survey (AHS) 2012-13

      In reality, missing values can occur for the regions where data on certain events/diseases cannot be collected due to constraints such as safety, logistics etc. To reflect such situations, the proposed imputation techniques were applied on data obtained from AHS for 284 districts of India. The variables have been chosen in a manner to reflect the relationships in two scenarios, (i) high autocorrelation in dependent and independent variables (high-high) and (ii) low and high autocorrelation in dependent and independent variables (low-high) respectively. For the first scenario, the dependent variable considered was total fertility rate of a district (TFR) and the independent variables considered were percentage of mothers who received any antenatal check-up and unmet need for spacing in a district. In the case of second scenario namely low-low autocorrelation, the dependent variable considered was percentage of women aged 15–19 years who were already mothers/pregnant at the time of survey while independent variables total fertility rate and percentage of female literacy rate in a district. In real data analysis, overall rates of missingness of 5%, 10%, 15%, 20% and 25% of observations were imposed on the dependent variable according to MAR mechanism. Missing values were imposed on the dependent variable using the same procedure as it was done on the simulated data. The variables considered for the analysis and the values of their spatial autocorrelation and Pearson's correlation coefficient are listed in Table 2.
      Table 2Spatial autocorrelation and Pearson's correlation coefficient values of the variables retrieved from Annual Health Survey (AHS 2012-13) data.
      Correlation with YAutocorrelation
      High-High
      Y - Total fertility rate0.63
      X1 - Mothers who received any antenatal check-up (%)−0.600.58
      X2 - Unmet need for spacing (%)0.570.58
      Low-High
      Y - Women aged 15–19 years already mothers/pregnant at the time of survey (%)0.44
      X1 - Total fertility rate−0.130.63
      X2 - Female literacy rate0.290.62

      3. Results

      3.1 Comparison of various spatial regression models using simulated data

      The stochastic regression imputation with all the six models were applied on the simulated data under two scenarios (i) for four types of levels of autocorrelation among the dependent and independent variables as mentioned in Table 1, and (ii) for various proportion of missing values in the dependent variable. Fig. 1 gives the comparison of mean RMSE values for the six models for the above mentioned two scenarios. When the autocorrelation in both dependent and independent variables were high (Fig. 1(a)), it can be noted that the imputation performed using SEM had higher mRMSE compared to that obtained using the remaining five models. The mRMSE value using SEM was even higher for the imputation performed using OLS model, which model does not account for autocorrelation. The least and similar mRMSE in the imputation were produced by SLM, SDM and SDEM. When the autocorrelation in both dependent and independent variables were low (Fig. 1(b)), the performance of imputation using all the models was more or less similar in terms of mRMSE. However, the imputations performed using SDM, SDEM and SLX found better with least mRMSE. In case of high autocorrelation in dependent variable and low in independent variables (Fig. 1(c)), the pattern of mRMSE is similar to that of situation (a), but it can be noted that the mRMSE of imputation using SEM has reduced. Imputation performed using SLM, SDM and SDEM had least mRMSE when the autocorrelation in dependent variable was low and high in independent variables (Fig. 1(d)). In this case, OLS and SEM performed in a similar manner. Overall, the simulation analysis revealed that when autocorrelation is high in the variable with missing values, stochastic regression imputation performed using spatial lag model, spatial Durbin model and spatial Durbin error model gives accurate and almost similar estimates of missing values irrespective of whether the independent variables in the regression model were spatially autocorrelated or not, and this result was consistent for the varied amount of missing values. When the autocorrelation is low in the variable with missing values, in addition to the above three models, the stochastic regression imputation using spatial lag X model was also found to be equally accurate in estimating the missing values. It can also be noted that stochastic regression imputation performed using OLS model and SEM had the highest mRMSE in all the scenarios considered in the simulation study.
      Fig. 1
      Fig. 1Mean of RMSE (mRMSE) of imputations performed on simulated data using stochastic regression based on different models for various proportions of missing values in dependent variable.
      The mean and standard deviations of RMSE values obtained for 1000 simulated datasets and for real data analysis in each scenario is given in supplementary material (Table 3 and Table 4).

      3.2 Comparison of various spatial regression models using real data from AHS 2012-13

      Fig. 2 presents a line graph depicting the mRMSE obtained from imputations performed by stochastic regression using different models. Similar to the results obtained from the simulated data, imputation performed using SLM, SDM and SDEM were found to have least RMSE when the autocorrelation was high in both dependent and independent variables. This pattern remained consistent for varying proportions of missing values. The same three models performed well when the data possessed low autocorrelation in dependent variable and high autocorrelation in independent variables. It can also be noted that stochastic regression imputation performed using OLS model had highest mRMSE in all the scenarios considered.
      Fig. 2
      Fig. 2Mean of RMSE (mRMSE) of imputations performed on real data using stochastic regression based on (a). high autocorrelation in both dependent and independent variables (b). low autocorrelation in dependent variable and high autocorrelation in independent variables, for various proportions of missing values in dependent variable.

      4. Discussion and conclusions

      Missing data can be a problem especially when the researcher relies on the secondary data for the purpose of analysis. In many cases the data is incomplete, due to various reasons such as feasibility, safety etc. Various missing data imputation techniques are available but they are inefficient if there is interdependence in the data values. Such data requires to be handled differently while performing imputation and several methods have been developed and applied to deal with missing values. Baker et al.
      • Baker J.
      • White N.
      • Mengersen K.
      Missing in space: an evaluation of imputation methods for missing data in spatial analysis of risk factors for type II diabetes.
      compared mean imputation, imputation using a multivariate normal prior distribution and using a conditional autoregressive prior distribution to impute missing values in incomplete survey data by accounting for spatial autocorrelation. A Bayesian hierarchical modelling framework was developed by Song et al.
      • Song C.
      • Yang X.
      • Shi X.
      • Bo Y.
      • Wang J.
      Estimating missing values in China's official socioeconomic statistics using progressive spatiotemporal Bayesian hierarchical modeling.
      to impute missing values in the official socioeconomic statistics dataset of China by considering spatial autocorrelations and temporal trends into account. Likewise, the present study focuses on such scenarios and intends to propose a simple technique of stochastic regression imputation using the spatial regression models rather than the classical ordinary least squares regression. This technique will help to accurately estimate the missing values at aggregate level public health spatial data which can increase the power and precision of the estimates in further analysis.
      Though the present study is first of its kind that attempted to estimate the missing values at aggregate level spatial data, the study has few limitations. The performance of the models were evaluated for up to 25% missing proportion only. The mechanism of missing not at random is not considered since the classical stochastic regression imputation technique has been proven to be efficient only under situations when missing is at random. Another aspect which has not been given consideration in the present study is the missing values in categorical variables, which has a scope for future research.
      The study shows that stochastic regression imputation performed using spatial lag model, spatial Durbin model, and spatial Durbin error model performed consistently better for varied levels of missing proportions and also for various combinations of autocorrelation in dependent and independent variables. We recommend using stochastic regression imputation based on one of the above three models in order to estimate missing data, especially when the data has inherent spatial autocorrelation. The complete data obtained after using the proposed method can further be used by the public health professionals for effective interventions.

      Funding

      The authors acknowledge the funding support for this study by Department of Science and Technology, Government of India (Sanction order No. NRDMS/01/122/015).

      Availability of data and material

      The datasets used for data analysis are obtainable from the corresponding author on request.

      Declaration of competing interest

      The authors declare that they have no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.

      Appendix A. Supplementary data

      The following are the Supplementary data to this article:

      References

        • Bennett R.J.
        • Haining R.P.
        • Griffith D.A.
        The problem of missing data on spatial surfaces.
        Ann Assoc Am Geogr. 1984 Mar; 74: 138-156
        • Allison P.D.
        Missing Data.
        Sage Publications, 2001 Aug 13
        • Rubin D.B.
        Multiple Imputation for Nonresponse in Surveys.
        John Wiley & Sons, 2004 Jun 9
        • Little R.J.
        • Rubin D.B.
        Statistical Analysis with Missing Data.
        John Wiley & Sons, 2019 Apr 23
        • Munoz B.
        • Lesser V.M.
        • Smith R.A.
        Applying multiple imputation with geostatistical models to account for item nonresponse in environmental data.
        J Mod Appl Stat Methods. 2010; 9: 27
        • Li L.
        • Li Y.
        • Li Z.
        Efficient missing data imputing for traffic flow by considering temporal and spatial dependence.
        Transport Res C Emerg Technol. 2013 Sep 1; 34: 108-120
        • Griffith D.A.
        • Liau Y.T.
        Imputed spatial data: cautions arising from response and covariate imputation measurement error.
        Spatial Statist. 2020 Feb 3; : 100419
        • LeSage J.P.
        • Pace R.K.
        Introduction to Spatial Econometrics.
        Chapman & Hall/CRC, 2009
        • Golgher A.B.
        • Voss P.R.
        How to interpret the coefficients of spatial models: spillovers, direct and indirect effects.
        Spatial Demogr. 2016 Oct 1; 4: 175-205
        • Elhorst J.P.
        Spatial Econometrics: From Cross-Sectional Data to Spatial Panels.
        Springer, Heidelberg2014
        • Rüttenauer T.
        Spatial regression models: a systematic comparison of different model specifications using Monte Carlo experiments.
        Socio Methods Res. 2019 Jun 9; 0049124119882467
        • General R.
        Annual Health Survey-2012–13.
        Ministry of Home Affairs, Government of India, New Delhi2012
        • Baker J.
        • White N.
        • Mengersen K.
        Missing in space: an evaluation of imputation methods for missing data in spatial analysis of risk factors for type II diabetes.
        Int J Health Geogr. 2014 Dec 1; 13: 47
        • Song C.
        • Yang X.
        • Shi X.
        • Bo Y.
        • Wang J.
        Estimating missing values in China's official socioeconomic statistics using progressive spatiotemporal Bayesian hierarchical modeling.
        Sci Rep. 2018 Jul 3; 8: 1-3