In a previous post on missing data imputation, we reviewed the underlying assumptions and limitations of last observation carried forward (LOCF). This method was widely used in the past because of its straightforward application, ease of understanding, and often incorrect assumption that it is a conservative approach. Over the years, more sophisticated data imputation methods for longitudinal data have been developed that have advantages over the single imputation methods, such as LOCF, baseline observation carried forward (BOCF) or worst observation carried forward (WOCF). We will briefly review two of these methods, mixed models for repeated measures (MMRM) and multiple imputation (MI).
Both MMRM and MI methods assume any missing data are ‘missing at random’ (MAR). Unfortunately, the statistical terminology for missing data is not particularly helpful for researchers and is often misunderstood. MAR is defined as missingness, conditional on the observed data, which is independent of unobserved values. So what does this mean? Let’s consider an example. Suppose a double-blind placebo-controlled study is conducted for an antihypertensive drug where patients aged 20 to 75 years old have biweekly study visits during an 8-week treatment period. During the study, many of the patients discontinue early due to adverse events, and a majority of these patients are over 50. A concern is raised that because more of the older patients discontinued, and their blood pressures are generally expected to be higher and less controlled than the younger patients, the missing blood pressure measurements would have been higher on average than the observed measurements. Does this mean the blood pressure data is not missing at random? Not necessarily.
The fact that the distribution of missing and observed blood pressures are expected to be different only means that the data are not ‘missing completely at random’ (MCAR), another term in the missing data nomenclature. In contrast, MAR means there may be systematic differences between the missing and observed blood pressures, but these can be explained by other observed variables. In this example, age is the variable that explains the expected differences between the missing and observed blood pressures. So the question becomes, among patients of a similar age, would you expect the distributions of missing and observed blood pressures to be similar? If the answer is yes, then the missing blood pressure measurements are MAR, conditional on age.
One approach for analyzing MAR data is MMRM, which is used when the response is continuous and measured repeatedly over time. This method does not explicitly impute the missing values, but rather assumes that the subject’s missing data after withdrawal would have followed the trend of his or her own treatment group. For categorical responses and count data, generalized linear mixed models (GLMM) can be used. MMRM and GLMM are applicable under MCAR or MAR assumptions, and can be performed using SAS PROC MIXED and PROC GLIMMIX, respectively. Under MAR, if the model is correctly specified with predictor variables that explain the expected differences between missing and observed values, these methods provide an unbiased estimate of the treatment effect that would have been observed had all patients completed treatment.
MI, another widely used method for handling missing values that are assumed MAR, involves three steps:
1) the missing data are imputed m times to generate m complete data sets
2) each of the m complete datasets are analyzed using the pre-specified statistical method for the endpoint
3) the results from the m complete data sets are combined for inference
The type of imputation method chosen in Step 1 depends on the pattern of missingness (monotone or arbitrary) and the imputed variable type (continuous, binary or ordinal, or nominal) and can be applied using SAS PROC MI. Step 3 can be performed using SAS PROC MIANALYZE. Details of these steps are well documented elsewhere, and will not be repeated here. An important advantage of multiple imputation over single imputation is that by combining the different parameter estimates across the imputed datasets one obtains not only a single point estimate but a standard error that takes into account the uncertainty of the imputation process. From these statistics, valid confidence intervals for the parameters can be derived.
Both MMRM and MI are based on the assumption that data are MAR. However, there are many types of dropout and missing data scenarios, and one cannot exclude the possibility that some of the data are not MAR or MCAR, but rather ‘missing not at random’ (MNAR). For this reason, sensitivity analyses of the results to the method of handling missing data should almost always be included as part of a comprehensive analysis, particularly when there is a lot of missing data or an imbalance in missing data among the treatment groups. More to come on this topic at a later time.