Monday, September 03, 2007

Missing Data

I have been having fun on Deltoid recently (here and here). One annoying issue is whether or not the method by which SG replicated the L1 CMR estimates is obviously correct and/or the only reasonable approach. I don't think it is. (Given that it matches the formula in the spreadsheet distributed by Les Roberts, I am happy to assume that it matches the approach used in the paper.) Although I am a layman when it comes to demography, it seems obvious that any statistican would question whether just adding up all the deaths and dividing by person-months is the best way to estimate the crude mortality rate for Iraq. Such a proceedure ignores the fact that non-response varies across clusters.

Consider a simple example in which you have two clusters with 50 attempted interviews in each using a one year look-back period. In cluster A, you interview all 50 households. There are 10 people in each house and a total of 20 deaths. The CMR is cluster A is then 4% (20 deaths divided by 500 person-years). But, in cluster B, only 10 households agree to be interviewed. The other 40 refuse. There are also 10 people in each of the 10 households. There is one death, giving a CMR of 1% for cluster B.

Question: What is the best estimate of the CMR for the whole country given this data from two clusters? Answer: It depends! Certainly, the formula cited as obvious by SG, Robert Chung and others at Deltoid is not clearly the best answer (although I agree that it is a reasonable one). To see why, note that this formula just adds up the total deaths (20 + 1 = 21) and divides by the total person years in the population that agreed to be interviewed (500 + 100 = 600) to give a CMR of 3.5% (21/600). In other words, the overall estimate is very close to the estimate for cluster A because the actual sample size from cluster A is so much larger than that for cluster B even though the sampling plan called for equal sample sizes.

Ignoring non-response causes you to weigh clusters with higher-response rates more heavily even though, a priori, there is no particularly good reason to do so.

A different approach would be to treat the two cluster as independent and just average the resulting CMRs from each cluster. Such an approach would give 2.5%. Now, the difference between 2.5% and 3.5% isn't that big. Indeed, the differences in the two approaches for the Lancet data are even smaller. But the idea that one is clearly right and the other wrong is just stupid. It all depends on what you think the cause of the non-response is.

(A further complication is that household sizes may differ across clusters. In that case, it makes sense to either sum populations or to weight the clusters by the total population in each cluster. But the main point I am making here has to do with non-response.)