Advertisement

Preferring Box-Cox transformation, instead of log transformation to convert skewed distribution of outcomes to normal in medical research

Open AccessPublished:April 14, 2022DOI:https://doi.org/10.1016/j.cegh.2022.101043

      Abstract

      Background

      While dealing with skewed outcome, researchers often use log-transformation to convert the data into normal and apply commonly used statistical tests like t-test, linear regression, etc. However, the log-transformed data will not be normal at all times. In such situations, Box-Cox transformation (BCT) can be used to transform skewed data into normal. However, the problem arises when researcher wanted to predict the outcome in original scale. Therefore the aim of this paper is to demonstrated the use of BCT for a skewed outcome and predict the outcome in original scale, using regression method.

      Materials and method

      The Cost of Pyelonephritis in Type-2 Diabetes (COPID) study data was used to demonstrate the BCT and back transformation method. This study conducted among patients admitted in the general medical wards in a tertiary care hospital from south India. The BCT was applied for total cost to convert it into normal. The multiple linear regression method was used and the predicted values were back transformed into original scale.

      Results

      The estimated lambda was −0.36. After BCT, total cost was approximately normal (p-value = 0.621). The residual plots suggested that the error follows normal and the variance is constant. The median (IQR) of the observed total cost was 57694(42405, 98621) whereas predicted total cost was 58317(44270, 95375).

      Conclusion

      When the data is skewed, the log-transformation is not appropriate in all scenarios. However, BCT will ensure normal distribution after transformation and also we can back transform the outcome in original scale given the covariates.

      Keywords

      1. Introduction

      In research data, the outcome variables can be binary (yes/no), categorical (mild/moderate/severe) or continuous. Researchers commonly uset-test, ANOVA and linear regression models while analysing continuous outcome data. These methods assume that the outcome variable is continuous and normally distributed. Some biological values such as serum creatinine and blood pressure values generally follow a normal distribution where the above mentioned methods can be used. However, some data such as serum triglyceride, cost of hospital expenses, and duration of hospital stay by patients do not generally follow normal distribution. Therefore, these variables need transformation that would convert the skewed data into normal (not in original units).
      The usual methods of dealing with skewed continuous outcome data are to transform them using some statistical methods such as log or square root transformation. In order to handle the positively skewed outcome, log transformation (log-normal distribution) is currently used in many situations such as the application of breath analysis,
      • Cimermanová K.
      Estimation of confidence intervals for the log-normal means and for the ratio and the difference of log-normal means with application of breath analysis.
      to study the mean carbon monoxide
      • Olsson U.
      Confidence intervals for the mean of a log-normal distribution.
      and medical cost data of diabetics and intensive care unit expenses.
      • Chen Y.-H.
      • Zhou X.-H.
      Interval estimates for the ratio and difference of two lognormal means.
      ,
      • Thomas K.
      • Peter J.V.
      • Christina J.
      • et al.
      Cost-utility in medical intensive care patients. Rationalizing ongoing care and timing of discharge from intensive care.
      While dealing with two-group comparison, the challenge is to back transform the mean difference between the two groups from the transformed scale. But, the difference between the two groups will not be on the same scale as the original unit of measurement. A common solution to this problem is to use the Generalised pivotal approach (GPA).
      • Krishnamoorthy K.
      • Mathew T.
      Inferences on the means of lognormal distributions using generalized p-values and generalized confidence intervals.
      This method assumes that the log-transformed data follows a normal distribution. Devika et al. recommended GPA to handle the skewed data while testing hypothesis.
      • Shanmuga Sundaram D.
      • Jeyaseelan L.
      • George S.
      • Zachariah G.
      Analysis strategy for comparison of skewed outcomes from biological data: A recent development.
      However, the log transformation will be help full if the range of a variable is more than one order of magnitude (101) and the variable is strictly positive and if the range of a variable is considerably less than one order of magnitude (101), then any transformation of that variable is unlikely to be helpful.
      • Weisberg S.
      Transformations.
      Often researchers do not validate the assumption of normality in the dependent variable. Log-transformed data may not be normally distributed or the previously right-skewed data may end up as left-skewed.
      • Thomas K.
      • Peter J.V.
      • Christina J.
      • et al.
      Cost-utility in medical intensive care patients. Rationalizing ongoing care and timing of discharge from intensive care.
      ,
      • Peter J.V.
      • Thomas K.
      • Jeyaseelan L.
      • et al.
      Cost OF intensive care IN India.
      In such a situation, the Box-Cox transformation (BCT) method can be used to transform skewed distributed outcome data.
      • Box G.E.P.
      • Cox D.R.
      An analysis of transformations.
      Also, the log transformation is actually a special case of BCT method and therefore, BCT is a more generic method as compared to the log-transformed or GPA. The BCT method almost makes skewed data into normal, as compared to the other methods mentioned above. Moreover, Osborne (2010) discussed five transformation methods that are square root, log, inverse, arcsine, and Box-Cox transformation and highlighted the merits and demerits of these methods.
      • Osborne J.
      Improving your data transformations: applying the Box-Cox transformation. Practical Assessment.
      Of these methods Osborne (2010) had shown that BCT works better than other transformations. Also that BCT takes the idea of having a range of power transformations to improve the efficacy of normalizing and variance equalizing for both positively and negatively-skewed variables. Agarwal et al. mentioned that Box-Cox transformation and weighted least square techniques are used when the variance of error is not constant (heteroscedasticity). And they developed an algorithm to find the weighting parameter in weighted least square model.
      • Agarwal G.G.
      • Pant R.
      Regression model with power transformation weighting: application to peak expiratory flow rate.
      However, the use of BCT is restricted since it usually requires the availability of covariate(s). In order to resolve this, Dag and Asar et al.
      • Dag O.
      • Asar O.
      • Ilk O.
      A methodology to implement box-cox transformation when No covariate is available.
      proposed a method to find lambda (BCT parameter which transforms non-normal data into normal without original unit) without any covariates. Moreover, they utilized the estimation of lambda using seven well-known goodness of fit test and implemented a publicly available R package “AID”.
      • Asar Ö.
      • Ilk O.
      • Dag O.
      Estimating Box-Cox power transformation parameter via goodness-of-fit tests.
      The main goal of the transformation is to do group comparison or regression analysis with covariates such as intervention, gender, age etc., which can be implemented using the transformed data. However, the problem arises when we want to predict the outcome given the covariates, after fitting the model. We can do so, but the outcome data is in the transformed scale not in original scale. In order to predict the outcome in original scale, Taylor et al.
      • Taylor J.M.G.
      The retransformed mean after a fitted power transformation.
      suggested an approximate estimator given covariates.
      In this paper, we focus on the role of BCT to convert skewed outcome into normal (by finding approximate lambda value) and also to predict the outcome in original scale given covariates, using regression method.

      2. Methods

      The methods section has three parts. The first describes the statistical method of BCT and the second section presents the back transformation of the fitted value of the outcome variable (total cost) in original scale given the covariates. The third section describes the data which is used to demonstrate the above two sections.

      2.1 Box-Cox Transformation

      The dependent (continuous) variable is denoted as Y and independent variables as X = (1, x1, x2, …., xk). Box and Cox introduced the following model to transform the skewed distribution into a normal distribution without original scale.
      • Box G.E.P.
      • Cox D.R.
      An analysis of transformations.
      YBC(Y,λ)=Xβ+σe
      (1)


      where,
      YBC(Y,λ)={Yλ1λifλ0log(Y)ifλ=0
      (2)


      X is covariates matrix including intercept.
      β=(β0,β1,β2,,βk) is vector of regression coefficients.
      σ is variance of random error and.
      e is random error follows standard normal distribution.
      In equation,
      • Cimermanová K.
      Estimation of confidence intervals for the log-normal means and for the ratio and the difference of log-normal means with application of breath analysis.
      the log(Y) is the special case in BCT, which is when the lambda (λ) is zero.
      The parameter lambda (λ) is frequently estimated using the method of maximum likelihood.
      • Box G.E.P.
      • Cox D.R.
      An analysis of transformations.
      The model is fitted for the transformed data with various lambda values and optimal lambda is obtained which gives the maximum likelihood value. The problem associated with this method is, if we change the covariates then the likelihood value also change. Therefore, the lambda will change. In order to handle this, we have used goodness of fits statistics method suggested by Asar et al. (2017) to estimate the lambda value by using “AID” package.
      • Asar Ö.
      • Ilk O.
      • Dag O.
      Estimating Box-Cox power transformation parameter via goodness-of-fit tests.
      This package has been incorporated in the R code which provided in the supplementary file. The outcome variable in our example is “total cost” which is the input.
      The above transformation (equation 2) is applicable only for positive values i.e. Y > 0. If the data has negative values, then we recommend to using Yeo-Johnson (2000) family of distributions that can be used without any restrictions on response variable (negative response).
      • Yeo I.
      • Johnson R.A.
      A new family of power transformations to improve normality or symmetry.
      This procedure is incorporated in “car” R package.
      • Fox John
      • Weisberg Sanford
      An R Companion to Applied Regression.

      2.2 Back transforming the predicted value using anapproximate estimator

      Taylor
      • Taylor J.M.G.
      The retransformed mean after a fitted power transformation.
      has given an approximate estimator to back translate the predicted value in original scale after a linear model is fitted using BCT.
      An approximate estimator for the conditional mean E(Y|X0) is given by,
      Yˆa=E(Y|X0)ˆ=(1+λˆX0βˆ)1λˆ{1+σˆ2(1λˆ)2(1+λˆX0βˆ)2}
      (3)


      The R software version 4.0.1 was used to perform the analysis. For easy to use, we created a function called “BCT_BT()” which incorporates both sections one and two and produce the results. The source code for “BCT_BT()” function is given as separate file (“source_file”). The researcher should load the source file using the command “source” before running the “BCT_BT()” function. And the R codes along with detailed instructions are given as supplementary material (Appendix A).

      2.3 Data description

      The data used in this study is a prospective observational economic study (COPID study -Cost of Pyelonephritis in Type-2 Diabetes). The outcome is total cost of hospital expenses (Direct and Indirect). This study was conducted among patients admitted in the general medical wards in a tertiary care hospital from south India. Participants were recruited between March 2016 and July 2017. The outcome variable, total cost was right skewed. The independent variables such as gender (0 = female, 1 = male), age, Ischemic heart disease (IHD) (0 = No, 1 = Yes), diabetes mellitus (DM) (0 = No, 1 = Yes), duration of hospital stay, and number of days of Intravenous antibiotics were used for the multiple linear regression analysis.

      3. Results

      Original data: Table 1 presents the descriptive statistics of demographic variables such as gender, age, etc. There were 92 patients whose mean ± SD age is 55.80 ± 16.54. Among them, 53 (57.61%) were females and 39 (42.39%) were males. The number of IHD and DM patients were 15 (16.30%) and 61 (66.30%) respectively. The median (IQR) of hospital stay was 10
      • Weisberg S.
      Transformations.
      ,
      • Yeo I.
      • Johnson R.A.
      A new family of power transformations to improve normality or symmetry.
      and number of days of Intravenous antibiotics was14.
      • Weisberg S.
      Transformations.
      ,
      • Taylor J.M.G.
      The retransformed mean after a fitted power transformation.
      The median (IQR) of total cost was 57694 (42367, 98756).
      Table 1Distribution of Demographic variables.
      Variablesn (%)
      Gender
      Female53 (57.61)
      Male39 (42.39)
      Ischemic Heart Disease
      No77 (83.70)
      Yes15 (16.30)
      Diabetes Mellitus
      No31 (33.70)
      Yes61 (66.30)
      Mean ± SD/Median(IQR)
      Age55.80 ± 16.54
      Duration of Hospital Stay10 (7, 15)
      No. of days of Intravenous antibiotics14 (7, 14)
      Total cost57694 (42405, 98621)
      Normality Assumption checking (Histogram and Q-Q plot): Fig. 1(a) and (c) presents the histogram and the Q-Q plot of the total cost respectively. This plot suggested that the total cost is not normal. The estimated lambda value is −0.36. After the transformation the total cost was approximately distributed as normal. The Shapiro-Wilk test was used to check the normality assumption. The insignificant p value suggested that the transformed data is normal (p value = 0.621). Fig. 1(b) and (d) present the histogram and Q-Q plot of total cost after BCT respectively. In Fig. 1(d), all the points lie within the confidence band and suggest that the data is normal. Therefore, we used the BCT method to transform the total cost and used that as a dependent variable in the regression model.
      Fig. 1
      Fig. 1Histogram and Q-Q plot of total cost.
      Note: Lambda= -0.36; P value=0.621 (Shapiro Wilk’s method)
      Regression: Table 2 presents the results of multiple linear regression analyses, in which hospital stay and number of days of Intravenous antibiotics were significantly correlated with the transformed total cost, with regression coefficient (95% CI) is 0.00117 (0.00088, 0.00147) and 0.00029 (0.00004, 0.00054) respectively. Fig. 2 represent the residual plots of the linear regression model. The residual vs. fitted plot suggested that the variance is almost constant. And most of the points in the Q-Q plot lies on the diagonal line suggested that the error follows normal distribution. We have used the Breusch-Pagan test to check the heteroskedasticity (p value = 0.657), shows that there is no sufficient evidence to prove that heteroscedasticity is present in the model.
      Table 2Multiple Linear Regression analysis results (After transformation).
      VariablesRegression coefficient (95% CI)P value
      Age0.000005 (−0.000099, 0.00011)0.920
      Gender0.00086 (−0.00246, 0.00419)0.607
      Hospital stay0.00117 (0.00088, 0.00147)<0.001
      IHD0.00142 (−0.00295, 0.00579)0.522
      No. of days of Intravenous antibiotics0.00029 (0.00004, 0.00054)0.022
      DM−0.00043 (−0.00396, 0.00309)0.807
      Fig. 2
      Fig. 2Residual plots for model adequacy checking.
      Back transformation: The predicted total cost was back transformed using equation (3), which provides the outcome in original scale and not necessarily normal. Table 3 depicts the summary statistics of both observed and back transformed predicted total cost in original unit. The median (IQR) of the observed total cost is 57694 (42405, 98621) whereas back transformed predicted total cost is 58317 (44270, 95375). The RMSE between the observed and back translated total cost is 221603.2.
      Table 3Summary of Total Cost and Predicted Total cost.
      Total CostBack transformed predicted cost
      1st quartile42,40544,270
      Median57,69458,317
      Mean88,3301,18,829
      3rd quartile98,62195,375
      SD87,561.082,58,076.8
      RMSE2,21,603.2
      Bias−30498.6

      4. Discussion

      While dealing with continuous outcome variable in the analysis, we often deal with skewed outcomes. Traditionally, researchers have used log transformation to convert the skewed distribution as normal.
      • Cimermanová K.
      Estimation of confidence intervals for the log-normal means and for the ratio and the difference of log-normal means with application of breath analysis.
      • Olsson U.
      Confidence intervals for the mean of a log-normal distribution.
      • Chen Y.-H.
      • Zhou X.-H.
      Interval estimates for the ratio and difference of two lognormal means.
      When researchers use log transformation for a skewed outcome, they presume that the transformation would convert the skewed data to normal; however, this may not be always true. Then normality assumption is sometimes not tested after log transformation.
      The difference between two groups such as new intervention vs. standard of care is still in the log-transformed form. The back-translated difference cannot be related to the original unit. In order to avoid this limitation, the GPA approach is recommended which will provide the difference in original scale.
      • Krishnamoorthy K.
      • Mathew T.
      Inferences on the means of lognormal distributions using generalized p-values and generalized confidence intervals.
      However, the important constraint of GPA approach is that the log-transformed data should follow a normal distribution. The BCT method assures that the transformed data are almost normally distributed and on back transformation we would be able to get the difference in original scale using the application of regression method. Taylor et al. proposed an approximate estimator, and reported that the estimate (for back transformation) is accurate except when the BCT (lambda) parameter is near zero.
      • Taylor J.M.G.
      The retransformed mean after a fitted power transformation.
      BC back transformed the fitted value in a regression gives us an approximately unbiased estimate of the original-scale median at that predictor value. This study data had ten observations that are expected to be outlying values (according to Box plot). However, without excluding them from the analyses still we are able to transform the data as normal and back transform the data to the original scale. However, our suggestion would be to remove the outlying values before estimating lambda and regression method.
      Osborne (2010) has demonstrated the use of Box-Cox transformation, and suggested that this method not only transforming the data but also has shown the impact of increasing the effect size in the analysis to the maximum correlation of over 80% through three real time data.
      • Osborne J.
      Improving your data transformations: applying the Box-Cox transformation. Practical Assessment.
      However, the real time data findings may not be generalizable unless various skewness in data to be simulated and evaluated with various methods of transformation. The author has also provided the SPSS syntax to do the computing. We also suggest the researchers to try and use SPSS code.
      • Osborne J.
      Improving your data transformations: applying the Box-Cox transformation. Practical Assessment.
      However, our experience in using SPSS syntax is very intensive as compare to using R code. The method does not require advanced computing skills and we have provided R code as supplementary file to facilitate the learning process. This tutorial article will be useful for researchers who would like to analyse outcome data that are not normally distributed. In order to find best lambda, first step is to estimate λ and check the four regression model assumptions that are normality, linearity, homoscedasticity, independence. In the light of getting the best fit, we may have to drop some covariates and validate the assumptions again. This process would improve lambda.

      5. Limitations

      If there are a small proportion of zero values or negative values in the outcome data, then BCT is not a recommended method. However, as limitation, if there are more zeros than other observations (zero inflated), BCT may not be appropriate. Besides, BCT may not be an appropriate method of transformation for any mixture distributions. When λ is near zero and σ is large, the problem of estimating the conditional mean is much harder. The estimator not accurate at this situation, therefore we suggest to use log normal distribution when λ is near to zero.
      Moreover, the back transformation method depends on various factors such as significantly correlated predictors, outliers in the outcome variable. The performance of the method was quite well if the data has more correlated variables.

      6. Conclusion

      In order to analyse continuous outcome data that are not normally distributed, Box-Cox transformation is recommended as an option to assess the outcome variable in two or more group comparisons. The back transformed outcome variable (predicted) is still not normal. However, the whole methodology is based on regression and therefore it is more versatile in a controlling confounders and predicting outcome given covariates.

      Funding

      This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.

      Declaration of competing interest

      None of the authors have conflicts of interest to report.

      Acknowledgements

      Nil.

      Appendix A. Supplementary data

      The following are the Supplementary data to this article:

      References

        • Cimermanová K.
        Estimation of confidence intervals for the log-normal means and for the ratio and the difference of log-normal means with application of breath analysis.
        Meas Sci Rev. 2007 Jan 1; : 7
        • Olsson U.
        Confidence intervals for the mean of a log-normal distribution.
        J Stat Educ. 2005 Jan 1; 13 (null)
        • Chen Y.-H.
        • Zhou X.-H.
        Interval estimates for the ratio and difference of two lognormal means.
        Stat Med. 2006; 25: 4099-4113
        • Thomas K.
        • Peter J.V.
        • Christina J.
        • et al.
        Cost-utility in medical intensive care patients. Rationalizing ongoing care and timing of discharge from intensive care.
        Ann Am Thorac Soc. 2015 Jul; 12: 1058-1065
        • Krishnamoorthy K.
        • Mathew T.
        Inferences on the means of lognormal distributions using generalized p-values and generalized confidence intervals.
        J Stat Plann Inference. 2003 Jul 1; 115: 103-121
        • Shanmuga Sundaram D.
        • Jeyaseelan L.
        • George S.
        • Zachariah G.
        Analysis strategy for comparison of skewed outcomes from biological data: A recent development.
        2014: 16-20 (2014 Dec 1)
        • Weisberg S.
        Transformations.
        in: Applied Linear Regression. third ed. Wiley-Interscience, Hoboken, N.J2005 (Wiley series in probability and statistics)
        • Peter J.V.
        • Thomas K.
        • Jeyaseelan L.
        • et al.
        Cost OF intensive care IN India.
        Int J Technol Assess Health Care. 2016 Jan; 32: 241-245
        • Box G.E.P.
        • Cox D.R.
        An analysis of transformations.
        J Roy Stat Soc B. 1964; 26: 211-252
        • Osborne J.
        Improving your data transformations: applying the Box-Cox transformation. Practical Assessment.
        Res Eval [Internet]. 2019 Nov 23; 15 (Available from)
        • Agarwal G.G.
        • Pant R.
        Regression model with power transformation weighting: application to peak expiratory flow rate.
        J Reliabil Stat Stud. 2009 Aug 10; : 52-59
        • Dag O.
        • Asar O.
        • Ilk O.
        A methodology to implement box-cox transformation when No covariate is available.
        Commun Stat Simulat Comput. 2014 Jan 1; 43: 1740-1759
        • Asar Ö.
        • Ilk O.
        • Dag O.
        Estimating Box-Cox power transformation parameter via goodness-of-fit tests.
        Commun Stat Simulat Comput. 2017 Jan 2; 46: 91-105
        • Taylor J.M.G.
        The retransformed mean after a fitted power transformation.
        J Am Stat Assoc. 1986 Mar 1; 81: 114-118
        • Yeo I.
        • Johnson R.A.
        A new family of power transformations to improve normality or symmetry.
        Biometrika. 2000 Dec 1; 87: 954-959
        • Fox John
        • Weisberg Sanford
        An R Companion to Applied Regression.
        ([Internet]. Third. Sage)2019 (Available from)