If you don't remember your password, you can reset it by entering your email address and clicking the Reset Password button. You will then receive an email that contains a secure link for resetting your password
If the address matches a valid account an email will be sent to __email__ with instructions for resetting your password
While dealing with skewed outcome, researchers often use log-transformation to convert the data into normal and apply commonly used statistical tests like t-test, linear regression, etc. However, the log-transformed data will not be normal at all times. In such situations, Box-Cox transformation (BCT) can be used to transform skewed data into normal. However, the problem arises when researcher wanted to predict the outcome in original scale. Therefore the aim of this paper is to demonstrated the use of BCT for a skewed outcome and predict the outcome in original scale, using regression method.
Materials and method
The Cost of Pyelonephritis in Type-2 Diabetes (COPID) study data was used to demonstrate the BCT and back transformation method. This study conducted among patients admitted in the general medical wards in a tertiary care hospital from south India. The BCT was applied for total cost to convert it into normal. The multiple linear regression method was used and the predicted values were back transformed into original scale.
The estimated lambda was −0.36. After BCT, total cost was approximately normal (p-value = 0.621). The residual plots suggested that the error follows normal and the variance is constant. The median (IQR) of the observed total cost was 57694(42405, 98621) whereas predicted total cost was 58317(44270, 95375).
When the data is skewed, the log-transformation is not appropriate in all scenarios. However, BCT will ensure normal distribution after transformation and also we can back transform the outcome in original scale given the covariates.
In research data, the outcome variables can be binary (yes/no), categorical (mild/moderate/severe) or continuous. Researchers commonly uset-test, ANOVA and linear regression models while analysing continuous outcome data. These methods assume that the outcome variable is continuous and normally distributed. Some biological values such as serum creatinine and blood pressure values generally follow a normal distribution where the above mentioned methods can be used. However, some data such as serum triglyceride, cost of hospital expenses, and duration of hospital stay by patients do not generally follow normal distribution. Therefore, these variables need transformation that would convert the skewed data into normal (not in original units).
The usual methods of dealing with skewed continuous outcome data are to transform them using some statistical methods such as log or square root transformation. In order to handle the positively skewed outcome, log transformation (log-normal distribution) is currently used in many situations such as the application of breath analysis,
While dealing with two-group comparison, the challenge is to back transform the mean difference between the two groups from the transformed scale. But, the difference between the two groups will not be on the same scale as the original unit of measurement. A common solution to this problem is to use the Generalised pivotal approach (GPA).
However, the log transformation will be help full if the range of a variable is more than one order of magnitude (101) and the variable is strictly positive and if the range of a variable is considerably less than one order of magnitude (101), then any transformation of that variable is unlikely to be helpful.
Often researchers do not validate the assumption of normality in the dependent variable. Log-transformed data may not be normally distributed or the previously right-skewed data may end up as left-skewed.
Also, the log transformation is actually a special case of BCT method and therefore, BCT is a more generic method as compared to the log-transformed or GPA. The BCT method almost makes skewed data into normal, as compared to the other methods mentioned above. Moreover, Osborne (2010) discussed five transformation methods that are square root, log, inverse, arcsine, and Box-Cox transformation and highlighted the merits and demerits of these methods.
Of these methods Osborne (2010) had shown that BCT works better than other transformations. Also that BCT takes the idea of having a range of power transformations to improve the efficacy of normalizing and variance equalizing for both positively and negatively-skewed variables. Agarwal et al. mentioned that Box-Cox transformation and weighted least square techniques are used when the variance of error is not constant (heteroscedasticity). And they developed an algorithm to find the weighting parameter in weighted least square model.
proposed a method to find lambda (BCT parameter which transforms non-normal data into normal without original unit) without any covariates. Moreover, they utilized the estimation of lambda using seven well-known goodness of fit test and implemented a publicly available R package “AID”.
The main goal of the transformation is to do group comparison or regression analysis with covariates such as intervention, gender, age etc., which can be implemented using the transformed data. However, the problem arises when we want to predict the outcome given the covariates, after fitting the model. We can do so, but the outcome data is in the transformed scale not in original scale. In order to predict the outcome in original scale, Taylor et al.
suggested an approximate estimator given covariates.
In this paper, we focus on the role of BCT to convert skewed outcome into normal (by finding approximate lambda value) and also to predict the outcome in original scale given covariates, using regression method.
The methods section has three parts. The first describes the statistical method of BCT and the second section presents the back transformation of the fitted value of the outcome variable (total cost) in original scale given the covariates. The third section describes the data which is used to demonstrate the above two sections.
2.1 Box-Cox Transformation
The dependent (continuous) variable is denoted as Y and independent variables as X = (1, x1, x2, …., xk). Box and Cox introduced the following model to transform the skewed distribution into a normal distribution without original scale.
The model is fitted for the transformed data with various lambda values and optimal lambda is obtained which gives the maximum likelihood value. The problem associated with this method is, if we change the covariates then the likelihood value also change. Therefore, the lambda will change. In order to handle this, we have used goodness of fits statistics method suggested by Asar et al. (2017) to estimate the lambda value by using “AID” package.
This package has been incorporated in the R code which provided in the supplementary file. The outcome variable in our example is “total cost” which is the input.
The above transformation (equation 2) is applicable only for positive values i.e. Y > 0. If the data has negative values, then we recommend to using Yeo-Johnson (2000) family of distributions that can be used without any restrictions on response variable (negative response).
has given an approximate estimator to back translate the predicted value in original scale after a linear model is fitted using BCT.
An approximate estimator for the conditional mean is given by,
The R software version 4.0.1 was used to perform the analysis. For easy to use, we created a function called “BCT_BT()” which incorporates both sections one and two and produce the results. The source code for “BCT_BT()” function is given as separate file (“source_file”). The researcher should load the source file using the command “source” before running the “BCT_BT()” function. And the R codes along with detailed instructions are given as supplementary material (Appendix A).
2.3 Data description
The data used in this study is a prospective observational economic study (COPID study -Cost of Pyelonephritis in Type-2 Diabetes). The outcome is total cost of hospital expenses (Direct and Indirect). This study was conducted among patients admitted in the general medical wards in a tertiary care hospital from south India. Participants were recruited between March 2016 and July 2017. The outcome variable, total cost was right skewed. The independent variables such as gender (0 = female, 1 = male), age, Ischemic heart disease (IHD) (0 = No, 1 = Yes), diabetes mellitus (DM) (0 = No, 1 = Yes), duration of hospital stay, and number of days of Intravenous antibiotics were used for the multiple linear regression analysis.
Original data:Table 1 presents the descriptive statistics of demographic variables such as gender, age, etc. There were 92 patients whose mean ± SD age is 55.80 ± 16.54. Among them, 53 (57.61%) were females and 39 (42.39%) were males. The number of IHD and DM patients were 15 (16.30%) and 61 (66.30%) respectively. The median (IQR) of hospital stay was 10
Normality Assumption checking (Histogram and Q-Q plot):Fig. 1(a) and (c) presents the histogram and the Q-Q plot of the total cost respectively. This plot suggested that the total cost is not normal. The estimated lambda value is −0.36. After the transformation the total cost was approximately distributed as normal. The Shapiro-Wilk test was used to check the normality assumption. The insignificant p value suggested that the transformed data is normal (p value = 0.621). Fig. 1(b) and (d) present the histogram and Q-Q plot of total cost after BCT respectively. In Fig. 1(d), all the points lie within the confidence band and suggest that the data is normal. Therefore, we used the BCT method to transform the total cost and used that as a dependent variable in the regression model.
Regression: Table 2 presents the results of multiple linear regression analyses, in which hospital stay and number of days of Intravenous antibiotics were significantly correlated with the transformed total cost, with regression coefficient (95% CI) is 0.00117 (0.00088, 0.00147) and 0.00029 (0.00004, 0.00054) respectively. Fig. 2 represent the residual plots of the linear regression model. The residual vs. fitted plot suggested that the variance is almost constant. And most of the points in the Q-Q plot lies on the diagonal line suggested that the error follows normal distribution. We have used the Breusch-Pagan test to check the heteroskedasticity (p value = 0.657), shows that there is no sufficient evidence to prove that heteroscedasticity is present in the model.
Table 2Multiple Linear Regression analysis results (After transformation).
Back transformation: The predicted total cost was back transformed using equation (3), which provides the outcome in original scale and not necessarily normal. Table 3 depicts the summary statistics of both observed and back transformed predicted total cost in original unit. The median (IQR) of the observed total cost is 57694 (42405, 98621) whereas back transformed predicted total cost is 58317 (44270, 95375). The RMSE between the observed and back translated total cost is 221603.2.
Table 3Summary of Total Cost and Predicted Total cost.
While dealing with continuous outcome variable in the analysis, we often deal with skewed outcomes. Traditionally, researchers have used log transformation to convert the skewed distribution as normal.
When researchers use log transformation for a skewed outcome, they presume that the transformation would convert the skewed data to normal; however, this may not be always true. Then normality assumption is sometimes not tested after log transformation.
The difference between two groups such as new intervention vs. standard of care is still in the log-transformed form. The back-translated difference cannot be related to the original unit. In order to avoid this limitation, the GPA approach is recommended which will provide the difference in original scale.
However, the important constraint of GPA approach is that the log-transformed data should follow a normal distribution. The BCT method assures that the transformed data are almost normally distributed and on back transformation we would be able to get the difference in original scale using the application of regression method. Taylor et al. proposed an approximate estimator, and reported that the estimate (for back transformation) is accurate except when the BCT (lambda) parameter is near zero.
BC back transformed the fitted value in a regression gives us an approximately unbiased estimate of the original-scale median at that predictor value. This study data had ten observations that are expected to be outlying values (according to Box plot). However, without excluding them from the analyses still we are able to transform the data as normal and back transform the data to the original scale. However, our suggestion would be to remove the outlying values before estimating lambda and regression method.
Osborne (2010) has demonstrated the use of Box-Cox transformation, and suggested that this method not only transforming the data but also has shown the impact of increasing the effect size in the analysis to the maximum correlation of over 80% through three real time data.
However, the real time data findings may not be generalizable unless various skewness in data to be simulated and evaluated with various methods of transformation. The author has also provided the SPSS syntax to do the computing. We also suggest the researchers to try and use SPSS code.
However, our experience in using SPSS syntax is very intensive as compare to using R code. The method does not require advanced computing skills and we have provided R code as supplementary file to facilitate the learning process. This tutorial article will be useful for researchers who would like to analyse outcome data that are not normally distributed. In order to find best lambda, first step is to estimate λ and check the four regression model assumptions that are normality, linearity, homoscedasticity, independence. In the light of getting the best fit, we may have to drop some covariates and validate the assumptions again. This process would improve lambda.
If there are a small proportion of zero values or negative values in the outcome data, then BCT is not a recommended method. However, as limitation, if there are more zeros than other observations (zero inflated), BCT may not be appropriate. Besides, BCT may not be an appropriate method of transformation for any mixture distributions. When λ is near zero and σ is large, the problem of estimating the conditional mean is much harder. The estimator not accurate at this situation, therefore we suggest to use log normal distribution when λ is near to zero.
Moreover, the back transformation method depends on various factors such as significantly correlated predictors, outliers in the outcome variable. The performance of the method was quite well if the data has more correlated variables.
In order to analyse continuous outcome data that are not normally distributed, Box-Cox transformation is recommended as an option to assess the outcome variable in two or more group comparisons. The back transformed outcome variable (predicted) is still not normal. However, the whole methodology is based on regression and therefore it is more versatile in a controlling confounders and predicting outcome given covariates.
This research did not receive any specific grant from funding agencies in the public, commercial, or not-for-profit sectors.
Declaration of competing interest
None of the authors have conflicts of interest to report.
Appendix A. Supplementary data
The following are the Supplementary data to this article: