I sent a draft of my article on confidence intervals in L1 to the Lancet last year. My goal was to get them to force Roberts et al to answer some questions. The Lancet editor though it best to consider my article for publication. That was fine with me, although I agree with the eventual denial that my paper is too late and narrow to merit inclusion in the Lancet. For those who care, who are the reviewers thoughts, most of which I agree with (especially with regard to my silly mistake in the formula for covariance). I publish them here (with permission) since other Lancet aficionados may find them interesting, especially the APPENDIX at the end which seems to reveal some secrets to the original peer-review for L1.

Manuscript reference number: THELANCET-D-07-05168

Title: Comments on the Confidence Intervals of Roberts et al. (2004)

Dear Dr Kane,

Many thanks for submitting your manuscript to The Lancet. Clearly we felt we should seek advice on this and following external peer review, several editors here have discussed the manuscript. Essentially the reviewers all advise us that they are comfortable with the Roberts et al analysis but less comfortable with the arguments you advance. Our decision is, therefore, that we should not publish your submission.

The reviewers' comments and some editorial points that may be of interest to you are presented in the paragraphs below. I hope you find these comments helpful.

Reviewer #1: The main message of this paper is that there is a disagreement among

the three 95% confidence intervals for the variables CMRpost, CMRpre

and

RR = CMRpost

CMRpre

,

published by Roberts et al. in The Lancet 364: 1857-1864, (2004). For showing this the author assume that the published 95% confidence intervals for the variables CMRpost and CMRpre are correct, and then argues that the 95% confidence interval for the variable RR is not. An argument based on the calculation of the probability Pr[CMRpost - CMRpre < 0] is used.

My first comments is: I do not think the author needs to use this argument.

All we need is a distribution for RR and to derive from it the 95% confidence interval. But what statistical model should be considered for the variable RR? Roberts et al. �page 1859 second column� seem to assume a log-linear regression model but nothing is said about the regressors they used.

In this situation the simplest thing to do is to consider a normal distribution for the CMRs, as done in the paper. Under normality I have computed the confidence interval for RR, and find that it does not coincide with that given by Roberts et al.

Indeed, assuming, as in the paper, that CMRpre follows the normal distribution N(5, .662), and CMRpost is N(12.3, 5.562) distributed for which the 95% confidence intervals for CMRpre and CMRpost are the published (3.7, 6.3) and (1.4, 23.2), respectively, and assuming that the variables CMRpre and CMRpost are uncorrelated, the density of RR, whose values are denoted by w, turns out to be

f (w) = 1 Z 8 -8 |t| exp �-(t w - 12.3)2 (t - 5)2

2

(0.66) (5.56) 2(5.562) 2(0.662) �dt.

This density looks like that given in Figure 4 in page 10 of the paper.

>From this density the 95% confidence interval for RR is (0.28, 4.99)

which does not coincide with that of Robert et al.

Furthermore, if the confidence intervals for CMRpre and CMRpost were constructed with a t-distribution with a degree of freedom = 3, the confidence interval for RR is wider than the above one.

Therefore, I essentially agree with the inconsistency of the three confidence intervals as it is claimed in the paper. Having said that, here is my second question.

Second question: What is the message after the finding? If the message is that the paper by Roberts et al. is a quite obscure paper concerning to the statistical methodology used, I completely agree. But if themessage is that the data support the null hypothesis that mortality in Iraq is unchanged simply because the 95% confidence interval on f (w) contains the point w = 1, I disagree.

The reason for my disagreement is that to distinguish between estimation and testing was strongly recommended by Jeffreys (1961, pp. 245-249), mainly because the methods commonly used for estimation are typically not suitable for hypothesis testing. In particular, your hidden jump between the confidence interval and the non existence of evidence against the null, has no justification. We need to carry out a test (preferable a Bayesian test) based on the whole data set before to accept or reject a null.

A minor question. Is there something Bayesian in this paper? In my opinion it is frequentist paper with no Bayesian thinking at all. Well, after all what the paper proves is that a frequentist confidence interval is suspect.

A final comment: To cook the data with the intuition is an extremely bad and dangerous practice. It is sometimes used when your statistical model is clearly a wrong model unable to give you a sensible answer.

Reviewer #2:

1. I do not think this piece is suitable as a Lancet article, as it is essentially a statistical discussion of a previous paper. It might be suitable as a letter. However, I understood that you have a policy of only accepting correspondence on previous Lancet papers within a limited period after publication. The Roberts paper was published over two years ago, and has since been superseded by the later Burnham study in 2006. I am not convinced that it makes sense to publish this material now, especially as it is not based on any new facts that have emerged since the original publication.

2. In any case, I believe that the main statistical argument put forward by Kane is incorrect. He uses the confidence interval for the post-invasion mortality rate, as reported by Roberts, to deduce that the confidence interval for the relative risk is too narrow. The wide confidence interval for post-invasion mortality (especially when Falluja is included in the analysis) is based on cluster-sample methodology and correctly takes account of between-cluster variability. This is reflected in the very large design effect (29.3 including Falluja, 2.0 excluding Falluja). However the relative risk estimate is based on comparing pre-invasion and post-invasion mortality within clusters. This is a before and after comparison, and can be thought of as a kind of "matched analysis". If pre- and post-invasion rates are correlated, this will provide a more precise estimate of the relative risk than is implied by looking at the precision of the pre- and post-invasion rates individually. I

assume that this is the reason why Roberts' confidence interval for the RR is narrower than implied by Kane's calculations. In fact Kane explicitly states that he is assuming zero covariance, which does not seem appropriate given this study design. (In any case, his formula for CMR does not seem correct - the last term in the variance should be -2 x Cov(pre,post), I think.

3. In my original review of the Roberts paper, I pointed out that their confidence interval on the RR did not make allowance for the possible variation in relative risk between clusters. In response to this, they reanalysed their data (with assistance from Prof Zeger at Johns Hopkins who is a world expert on the analysis of correlated data!) and the confidence interval in the published paper does make allowance for this source of variation.

4. Kane also states that this overall relative risk estimate (including the Falluja data) is the main finding of the Roberts paper. I would question this. I think the main take-home message of the paper was their estimate of nearly 100,000 excess deaths post-invasion. This is actually based on the excess risk estimated after excluding the data from Falluja. And the authors give a wide confidence interval for this, with a lower confidence limit of 8,000 excess deaths which is much more conservative than the estimate based on the full dataset including Falluja.

5. As mentioned above, this study has in any case been largely superseded by the larger survey carried out in Iraq in 2006 by the same team and published in The Lancet in October 2006. This confirms the earlier findings, but gives much more accurate mortality estimates and provides stronger evidence that there has been a true increase in mortality. The relevance of a statistical critique of the earlier paper at this point is therefore a bit questionable.

6. Finally, one of the most compelling aspects of the evidence presented by Roberts and his team is the dramatic change in cause of death and the age and sex distribution of deaths. Kane makes no mention of these findings.

Reviewer #4:

Report for Lancet on paper/research letter by David Kane on; Confidence intervals of Roberts et al. (2004)

Main considerations:

1. Original fast track referees commented extensively, see APPENDIX, on the issue raised by Kane; but allowed that it be addressed either analytically or in Discussion - including by discussion of how cluster sampling re mortality could be improved in a future study.

2. In his Abstract and Introduction, Kane selectively cites 2.5 fold increase (95% CI 1.6-4.2) whereas, as acknowledged in footnote 2, the Findings section in Roberts et al (2004) also cited sensitivity analysis: 'If we exclude the Falluja data, the risk of death is 1.5-fold (1.1-2.3) after the invasion. We estimate that 98,000 more deaths than expected (8,000 - 194,000) happened after the invasion outside of Falluja and far more if the outlier Falluja cluster is included'. In so doing, Roberts et al. (2004) had addressed the methodological concerns of referees by at least one of the options offered to them. Importantly, their Findings exposed readers to the considerable method-variation.

3. Thus, Kane is covering ground that Lancet, its referees and the authors had traversed - at issue now are 3 points:

a) should Lancet readers have been confronted with Table 1 comparison of CMR versus CMR* to re-present to them why (1.1-2.3) needed to be in the frame besides (1.6-4.2)?

b) should Lancet readers have been confronted with confidence limits, such as in Table 2, which were nonsense albeit the presented point estimate was valid?

c) Do Kane's analyses via assumptions of i) normality and independence, or ii) unimodality and independence add quantitative understanding of why the lower bound of 1.6 is too high?

My answers are:

A. that Lancet and Roberts et al. sufficiently addressed a) by including method sensitivity in the Finding, not just in text of their paper.

B. On b), Roberts et al. might have under-scored even more so why they refrained from providing non-applicable confidence limits because a sceptic, such as Kane, may consider that their exclusion was to blind the reader to methodological problems. However, Lancet published a commentary which specifically addressed the issue of extrapolation from sampled clusters that may have differed in their potential for exposure to hostilities; and how other intelligence sources might be brought to bear to improve the estimation.

C. The final question relates to Kane's own contribution (with or without Michael Spagat as co-author who is currently acknowledged). Unfortunately, Dr Kane's estimate is merely an illustration of the problem that referees' alluded to: namely, if confidence interval is derived from 'rest+Falluja' it is so wide (because mice and elephant) that the uncertainty - as displayed in Dr Kane's Figure 3 - is so wide that no definite inference can be drawn. Yet, if Dr Kane had applied the same analyses to 'the rest' (as Roberts et al. did) , then his p8 distribution (last line) would read N (2.9, 1.352), and - as reported by Roberts et al. - would show statistically significant excess mortality in the non-Falluja clusters (cf 95% CI for RR of 1.1-2.3) . . . to which Falluja can only add.

Neither referees nor Roberts et al (2004) nor Lancet were in any doubt about the criticality of the second reported CI in Findings.

Since 2004, the same research team has conducted a follow-up analysis with more clusters. Despite the longer recall interval, there is good central agreement for the 14 month period post March 2003 with the earlier findings by Roberts et al.

SEE ALSO: article by Scott Zieger in June issue of Significance, 2007 who gave statistical advice to Roberts et al.

APPENDIX

A.1 'no statistical test comparing pre and post conflict CMRs; CIs overlap for CMRs (did compare RR, however, which was significantly different);

A.2 ' unclear why one should remove Falluja cluster if authors felt methodology done correctly. As mentioned above, should have sampled more clusters and fewer persons per cluster to improve methodology and avoid this issue', and related later comment;

A.3 'need to compare violent deaths in various clusters with external events that occurred in Iraq to help validate and interpret data'.

B. no mentions.

C.1 'The main weakness of the paper lies in the extrapolation from the CMR to a total number of deaths fro the whole country. Cluster surveys still need to be validated as a tool to estimate mortality, and design effects are known to be high due to the clustering of violent deaths. Extrapolating the results to the whole country cannot be done without great caution';

C.2 'Excluding the Falluja data narrows down the mortality rate confidence intervals, but what happened in Falluja is not an exception in Iraq and should be taken into account. On the other hand, keeping the data in gives such an important design effect that CI are too wide to make any conclusion. Similarly for the increased relative risk and the extrapolated number of deaths: excluding Falluja gives CIs on the limit of significance';

C.3 'The authors (correctly) discuss the particularity of Falluja extensively, but do not take this into account when extrapolating the figures for the whole country. The quarter of million deaths are presented as a fact;

C.4 'The results of this survey clearly illustrate the limitations of cluster sampling as a method fro estimating mortality, as is explained in $9 of the Limitations section.

D.1 'There is one important statistical issue related to the cluster sampling that I do not think they have addressed adequately, and this concerns the precision of their estimate of excess mortality. To obtain this estimate, they first obtain the mortality rate ratio for the post-invasion versus pre-invasion periods for the sampled households, together with a 95% CI for this rate ratio . . . valid point estimate of excess mortality . . . mortality rate ratio takes no account of the variation in the period effect BETWEEN clusters. This CI will validly estimate the precision of the rate ratio WITHIN THE SELECTED CLUSTERS . . . the remainder of this report is an excellent tutorial/explanatory note on this precise issue . . . and ends as follows:

D.2 'I would suggest that either the analysis is repeated so that the excess mortality estimates have confidence limits in which correct allowance is made fro the cluster sample design. Or this limitation (the use of a 'fixed effects' model) should be explicitly mentioned in the discussion. The emphasis would then be on the fact that these are VALID POINT ESTIMATES, and are very substantial, and on the striking change in causation of mortality. But there would be a clearer recognition that the excess mortality estimates at national level are very imprecise.

E.1 Information which conveys details about the sampling scheme and its resultant selection of clusters is suggested in Table 1. Table 1 also includes the basic paired data per selected cluster: person-months of observation before & after conflict; total births before & after conflict before & after conflict. Table 2 gives additional cause-specific information about deaths before & after conflict.

E.2 by comparison, authors' Figures 1, 2 and 3 do not sufficiently respect study design and obscure actual numbers, Figure 3 seems to convey that there were no deaths [non-violent, or violent] in Falluja in the 14.5 months before conflict.

E.3 From the methods as described, I am unable to check what allowance(s) have been made (and where) for design effects in computing standard errors fro paired versus unpaired estimates.

E.4 What RR fro total mortality (after versus before) was survey designed to be powerful to detect [if this genuinely was a prior consideration]?

E.5 Improvements to sample scheme could/should be related to how authors have tried to deal with 'Falluja' effect . . . For the future, stratified sampling might be proposed for stratum a= a priori heavily-conflicted Governorates & stratum b= other Governorates . . . Even in the present case, this stratified approach could be explored posthoc in deciding how to multiply up from the sample results for Governorates which were/were not a priori known to be heavily-conflicted (rather than ignore Falluja & apply non-Falluja rates to all Iraq) - see also specific comments, P7, last paragraph;

E.6 UNCLEAR how/if design effect was allowed fro in relative risk CIs.

The above comments were made on the penultimate draft. There was a further round of refereeing by one or more of A to E.

Thanks for sending us this critique.

best regardfs

Stuart

The Lancet