Thursday, March 29, 2007

Pedersen Quote

Did Jon Pedersen really say this?


I know many, many people who have served in Iraq, worked in Iraq providing aid, contracts, or working on Iraq issues inside and outside the government over the entire course of the war. Not one of them thinks the Lancet numbers can be correct. As Jon Pedersen, the lead researcher for the UNDP household cluster survey conducted in 2004 (the only other, and much larger, study to use the Lancet method) recently stated, the Lancet numbers are "high, and probably way too high. I would accept something in the vicinity of 100,000 but 600,000 is too much"


I have seen a quote along these lines before, but can't find an authoritative citation. The link given does not work for me.

Saturday, March 17, 2007

Random Numbers

A central concern of critics is the relationship between the US authors of the paper and the Iraqi survey teams. What did the US authors tell the Iraqis to do in terms of the exact survey procedures? What did the Iraqis actually do? What did the Iraqis tell the US authors that they did? How can the US authors (and the rest of us) know for sure what the Iraqis did?

A small but revealing example of this involves the sampling procedures. The main paper reports:


Sampling followed the same approach used in 2004, except that selection of survey sites was by random numbers applied to streets or blocks rather than with global positioning units (GPS), since surveyors felt that being seen with a GPS unit could put their lives at risk.

...

As a first stage of sampling, 50 clusters were selected systematically by Governorate with a population proportional to size approach, on the basis of the 2004 UNDP/Iraqi Ministry of Planning population estimates (table 1). At the second stage of sampling, the Governorate’s constituent administrative units were listed by population or estimated population, and location(s) were selected randomly proportionate to population size. The third stage consisted of random selection of a main street within the administrative unit from a list of all main streets. A residential street was then randomly selected from a list of residential streets crossing the main street. On the residential street, houses were numbered and a start household was randomly selected.


The "Human Cost" paper reports the same.


Selection of households to be interviewed must be completely random to be sure the results are free of bias. For this survey, all households had an equal chance of being selected. A series of completely random choices were made. First the location of each of the 50 clusters was chosen according the geographic distribution of the population in Iraq. This is known as the first stage of sampling in which the governates (provinces) where the survey would be conducted were selected. This sampling process went on randomly to select the town (or section of the town), the neighborhood, and then the actual house where the survey would start. This was all done using random numbers. Once the start house was selected, an interview was conducted there and then in the next 39 nearest houses.


A perfectly sensible procedure. Random numbers are, indeed, widely used in surveys. But is this actually what happened? Consider Gilbert Burnham's recent speech at MIT. Watch from 1:07 to 1:10 in the video. Some crank (i.e., me) is trying to understand precisely what the procedure was.

According Burnham, the team did not use random numbers! Instead, he mentioned two approaches. Once is to write down all the names of the candidate streets on pieces of papers and then "randomly" select among them by hand. Also, for selecting the specific house, Burnham reports:


Once they selected the streets, then they numbered the houses on that street from one to whatever the end of that street was. And then they randomly, using serial numbers on money, they randomly selected a start number and started with that house, and from that they went to the nearest front door, the nearest front door, nearest front door, nearest front door, until they had a total of 40 houses.


"[S]erial numbers on money" is not the same thing as "random numbers," as any statistician will tell you. First, there is no guarantee that the serial numbers on currency are random. Who knows if there are more 1's than 7's on Iraqi (or US) currency? Second, using serial numbers makes it much easier for an unscrupulous interviewer to cheat.

Now, I don't actually worry that this caused a major problem. Putting street names on a piece of paper and picking one out of a hat is a fairly random process. Once you have picked a street, I wouldn't think that it matters much which house you start with. Even a malicious interviewer would have trouble, I would think, knowing which house to pick, whatever answer he might "want" to get.

But note that this might be a concern. The interviewers went to a neighborhood and, before starting the survey --- before picking the start house? --- told the local kids about the survey. From those kids, one could learn which house suffered several deaths. One could check this by seeing if there is a tendency for houses near the start of the survey in a given cluster to have higher death rates than houses at the end of the survey.

Anyway, my complaint is that this is another example of the methodology described in the article not being accurate. Don't claim to use random numbers while actually using some other process. And, if the article is incorrect in its claim about the process used to select which houses to interview, what else is it incorrect about?

The Lancet ought to publish a correction.

Friday, March 16, 2007

Timing Comments

Kevin Wagner sent in these comments on the timing issue.


There are at least three claims about the Lancet study's procedures and the feasibility of its methods to which Les Roberts has given answers that would entail he either was misinformed about some of the fundamental aspects of his survey or was willing to play fast and loose with his rhetoric to dissuade inquiry.

The first two instances come from an article by the journal Nature, a copy of which is cited here:

http://psychoanalystsopposewar.org/blog/2007/03/01/nature-on-iraq-mortality-study/

1) "Roberts and Gilbert Burnham, also at Johns Hopkins, say local people were asked to identify pockets of homes away from the centre; the Iraqi interviewer says the team never worked with locals on this issue."

To my knowledge, this has never been addressed further by Roberts. It's an outright contradiction of the authors' claim with the interviewer's. It raises the questions of how much the study's authors accurately knew about their interviewers' behavior and also potential bias introduced from a failure to follow the study's methods on the parts of the interviewers.

2) "The US authors subsequently said that each team split into two pairs, a workload that is "doable", says Paul Spiegel, an epidemiologist at the United Nations High Commission for Refugees in Geneva, who carried out similar surveys in Kosovo and Ethiopia. After being asked by Nature whether even this system allowed enough time, author Les Roberts of Johns Hopkins said that the four individuals in a team often worked independently. But an Iraqi researcher involved in the data collection, who asked not to be named because he fears that press attention could make him the target of attacks, told Nature this never happened. Roberts later said that he had been referring to the procedure used in a 2004 mortality survey carried out in Iraq with the same team (L. Roberts et al. Lancet 364, 1857–1864; 2004)."

This clarification is problematic. Apparently, in response to a follow up question on the 2006 study, Roberts replies with something relevant only to 2004, i.e. that the interviewers worked independently. They did work independently in 2004 and did not [so he, Burnham and the interviewer claim] in 2006. Moreover, it is damaging to his case for the 2006 study's feasibility. In 2004, he had six effective interviewers. In 2006, he had four two-person teams, i.e., four effective interviewers. How would responding that he effectively had two more interviewers in 2004 than in 2006 prove that his 2006 timeframe was feasible?

3) Last is Roberts' response on to queries on the interview timeframe and his reply to Tim Lambert on same.

http://www.radioopensource.org/les-roberts-weighs-in-on-lancet-controversy/

He states:

"The two main criticisms which were in both the *Nature* article and *The Times* article are completely without merit. They said there wasn't enough time to have done the interviews. We had eight interviewers working ten hour days for 49 days, they had two hours in the field to ask each household five questions. They had time."

They had eight interviewers which he says worked 49 ten hour days. 8 interviewers x 10 hours x 49 days / 2000 houses in the sample = 1.98 hours.

However, as noted before, those 8 worked in two-person teams. 4x10x49/2000=.98.

In any case, how is the total time that was devoted to the entire project supposed to relate to the actual time they spent in the field? One to one correspondence? Every second they had for the project was spent in the field? Short of that, what does Roberts mean when he says they had 2 hours in the field per household?

http://scienceblogs.com/deltoid/2007/03/london_times_hatchet_job_on_la.php

Then Tim Lambert says he got a reply from Roberts in a post in the above blog entry:

"I asked Roberts and he says he agrees with Burnham about the interview length: 15-20 minutes."

So, again, what exactly is Roberts claiming when he says the interviewers had two hours in the field per household to ask five questions? How is this claim accurate or relevant? Interviews took 20 minutes, on the high side, according to Roberts. To what is "two hours in the field" supposed to relate? Did the interviewers really have an additional hour and 40 minutes per household in the field as his claim implies?


I am not sure how much I worry about this issue, especially since Burnham has told us that, on occasion, single interviewers would work alone (although we have no idea how often that occurred). It is further evidence that the US authors have trouble knowing exactly what the interviewers did. Related comments here.

Burnham Presentation: Notes and Comments

I attended Gilbert Burnham's recent presentation at MIT. Tim Lambert provides a link to the video. I have not reviewed the video, but here are some notes that I made of the talk. (These might be incorrect; doublecheck the video.)



  1. The work of the authors was split up about as you might expect. Burnham organized; Lafta organized the survey; Doocy did the statistics; Roberts wrote the article. None of the authors besides Lafta were ever in Iraq.


  2. He mention something about the confidence intervals of the first survey going to infinity if the Fallujah data were included. I am not sure if I understood this point correctly. Recall every assumes that the results of Lancet I would have been even stronger if the Fallujah data were included. Thus, excluding Fallujah was "conservative." But perhaps this only applies to the mean of the distribution. It could be (?) that including an outlier widens the confidence intervals so much that one can no longer reject the null hypothesis of no increase in excess mortality. I think that I or someone else made this point during the discussion 2 years ago, but I am too lazy to look for a citation. If this is correct, then it strikes me as something that the readers of Lancet I deserved to know. I should come back to this point later. My best guess now is that I just misheard Burnham and he was just talking about the mean estimate going very high, i.e., to infinity.


  3. Burnham mentioned that he wanted to oversample Fallujah if it were chosen again. And, mirable dictu, it was chosen. But isn't that a little suspicious? Again, my basic hypothesis is not so much that the surveyors were malicious as that they gave the Americans what they thought the Americans wanted. So, they knew that Burnham wanted to check out Fallujah again so they ensured (?) that Fallujah was "randomly" selected for Lancet II. (Does this mean that all the sampling was done by the Iraqis?) Also, if Fallujah were purposely oversampled, then one would need to adjust the estimates for this fact. Not easy to do! Did Doocy do this? How? Given that the reason that Burnham chose Fallujah to be oversampled was because he knew it was much more violent than other parts of Iraq, one would need to do some adjustment. I guess that Doocy could use the extra samples just to estimate Anbar more accurately and then combine the Anbar estimate with the rest of the country. But I don't recall any discussion of this in the paper.


  4. The authors wanted to get the results of the survey published well before the election in order to avoid the controversy which plagued Lancet I, but they weren't able to do so because of fund raising and dealing with ethical reviews and the like. This makes little sense to me. Raising money and dealing with IRBs is tough and time consuming, but the survey work was complete by July 10, 2006. So, by that date, all (?) the issues of money and ethics were done. Lancet I went from survey completion to publication is one month. At the same rate, Lancet II could have been published in August. Now, one month is a very quick time from survey to publication, so no one would expect such a result. Indeed, getting to press within 3 months is still unusual. But I still expect that Lancet editor Richard Thorton likes October even year publication dates for these articles.


  5. All the interviewers were from (the same, I think) community medicine center in Bagdhad


  6. Tim Lambert comments: "The IBC made vociferous attacks on the studies because they want to defend their methods, and Les Roberts suggests that IBC are trying to stop the donations from drying up." I thought that this was a low blow, and an unusual comment in an academic seminar. But, it could also be true. Burnham made clear that this was Roberts's opinion, not his.


  7. There was useful background on the conduct of the survey. Once the team had picked the street (house?) to start at, it would tell all the children in the neighborhood (who had gathered to see the strangers in their white coats) what the survey was about. The children would spread the word around the neighborhood, alleviating suspicion and making everyone comfortable.


  8. Tim Lambert comments: "They will soon release the data (with identifying material removed) to other researchers." I hope so! There still seemed (to me) to be some hedging on this, that only "qualified" researchers would be allowed to see the data, that they (or their institutions) would be expected to testify somehow to data security. With luck, this won't be a problem, but I have my doubts.


  9. Burnham claimed that the reason that the interviewers only asked for death certificates in 87% of the reported deaths was that they "forgot." This strikes me as an implausible (but testable) claim. Some blogospherians had speculated that the 13% where this wasn't done were purposely for choices made to avoid danger or trauma. Could you ask for a death certificate with a grief-stricken mother wailing on the front step? This claim is testable because we should see a pattern in the cases where no certificate was asked for. It should be focused on one or two teams and should occur early in the surveys rather than later. Checking this is an example of the analysis we will be able to do once the data is available. Related to this is the issue of the survey form itself. Has this been made publicly available? It should be! Also, any competently designed form would feature a check box for this (and every other question). If it is on the form, then how could one forget 13% of the time?


  10. One of the biggest surprises (to me) was Burnham's admission that the teams operated independently at times. I think that the drill was that a team of 4 would go to a cluster, pick a main street, then a side street, then a house. Two teams of two would start knocking on doors. So each sub-team would have to 20 houses to finish this cluster. But, Burnham said, these two person teams would sometimes (how often?) operate independently. In other words, as one person "finished up" an interview in one house, her partner would leave and start on the next house himself. This procedure helps to alleviate concerns that there wasn't "enough time" to do the surveys but it does mean that if only one of the eight interviewers were malicious, she would have opportunity to make her results whatever she wanted. This is another reason for looking closely at the raw data. Mortality rates should be similar for all four members of the one team. (The other team, operating in different clusters, might have different results.


  11. He mentioned the difficulty of getting good population estimates for Fallujah. He mentioned 500,000 as being a number for before the war but 200,000 after all the US military activity.


  12. The slides make clear that the main-street bias (MSB) issue is less serious than one might suppose (and that the methodology write up in the paper is misleading --- unintentionally, I think). But the whole issue is quite complex and I hope to return to it later.




Again, these are from my messy notes. Check the video for what Burnham actually said.

Thursday, March 15, 2007

SSS Removal

Thanks to a comment which mentioned this link to the original removal of my SSS post.


18 October 2006
REMOVED: A case for fraud?

Amy Perfors

David Kane's most recent guest post about the Lancet study has been removed. Since this is not a normal practice for us, explanations for why (and why we posted it in the first place) are below the fold.

Why remove it? The tone is unacceptable, the facts are shoddy, and the ideas are not endorsed by myself, the other authors on the sidebar, or the Harvard IQSS.

Why post it in the first place, given this? Here I admit to an error in judgment on my part. I see my job as head of the Author's Committee as doing the somewhat mundane and boring tasks of coordinating and inspiring our posters, not exercising editorial control. I was uncomfortable with the post even before putting it up, but I also hate censorship, and -- since I don't know this field or this study very well -- I couldn't say with complete confidence that my discomfort was totally justified. I decided to err on the side of expressing something I was uncomfortable with, rather than stifling it. Again, that was probably an error with regards to this post, and I apologize. It was not up to the standards we aspire to here, and does not reflect our views.

Posted by Amy Perfors at October 18, 2006 03:45 PM


I think that there were comments to this post (before it was, in turn, removed) but don't know if they are preserved anywhere. Note that my original title included a question mark. I do not know whether the Lancet results are correct, but if there is fraud, it is most likely to be found in the (lack of) quality of the raw data, not in any arcana of cluster calculations. The statistics here are fine.

Tuesday, March 13, 2007

Child Mortality

Comment threads at Crooked Timber are often interesting. Daniel Davies claims:


The thing is, that there is a “marker” in the Times article – as in, a statement that is not true and that is obviously not true to anyone who has read the article. It is in the following paragraph:

Dr Richard Garfield, an American academic who had collaborated with the authors on an earlier study, declined to join this one because he did not think that the risk to the interviewers was justifiable. Together with Professor Hans Rosling and Dr Johan Von Schreeb at the Karolinska Institute in Stockholm, Dr Garfield wrote to The Lancet to insist there must be a “substantial reporting error” because Burnham et al suggest that child deaths had dropped by two thirds since the invasion. The idea that war prevents children dying, Dr Garfield implies, points to something amiss.


This is not true. As table 2 of the study shows, infant mortality remained constant in the survey (when you adjust for the greater number of months in the post-war recall period) while child deaths increased substantially. They did not drop by two thirds, or indeed drop at all. Von Schreeb, Rosling and Garfield did not say they dropped either (presumably because they have read the survey). They said that the crude estimate of under-15 mortality was substantially lower than other estimates of under-5 mortality in Iraq, and that this implied that there may have been substantial under-reporting of child deaths. They then suggested that this reporting error might lead to additional uncertainty in the estimates of roughly the same size as the sampling error – +/- 30%. Note that, for bonus hack points, the “plus” sign in “+/- 30%” is not ornamental, and to treat Von Schreeb et al as providing evidence that the study was an overestimate is Kaplan’s Fallacy. This is my reason for believing that Anjana Ahuja didn’t read the research; it’s an error that could easily have been made in transcribing notes of a half-understood conversation but couldn’t have been made at all if you read the articles.


"ragout" writes:


On the question of whether the Times article is a “bad piece of science journalism,” I much prefer the Times’ version to Daniel’s. Specifically, Daniel summarizes Garfield and other critics as saying “that the crude estimate of under-15 mortality was substantially lower than other estimates of under-5 mortality in Iraq.”

But Daniel’s version is misleading. The critics were not quoting mortality rates as such, which would be deaths of the under-15 per kid under-15. If the critics had really compared under-15 mortality to under-5 mortality, as Daniel says, the critics would indeed be foolish.

But since the critics are prominent scientists, they certainly did not do anything so foolish. Instead, they compared under-15 deaths per birth in the Lancet study to under-5 deaths per birth in another study. The critics’ rely on the fact that, as a matter of logic, if there are X deaths of kids under 15, there must be less than X deaths of kids under 5.

The Lancet study has 36 under-15 deaths per 1000 births, and another pre-war study has 100 or so under-5 deaths per 1000 births. If follows that the Lancet study found an under-5 death rate less than 1/3 of the pre-war study. This is exactly what the Times article says, and Daniel obscures.

Second, Daniel claims that “infant mortality remained constant in the [Lancet] survey.” But as far as I can see there is no data in the paper to calculate pre and post war infant mortality. The paper just reports total births, not pre and post war births. Daniel, without telling the reader, is implicitly assuming that the birth rate remained constant (which hardly seems consistent with the drastic increase in violence).


I don't have the energy to dive into this one right now. Ragout seems to get the better of it but Davies is almost always correct (in my experience) in this sort of analysis. First pass, it seems like Davies is correct to criticize the wording of the original news article but that ragout (and Gilbert?) are correct on the substance. Is this point worth exploring further? Perhaps.

If the Lancet II survey team "made up" large portions of the data, then we would expect to find all sorts of anomalies like this. It is very hard (I would guess!) to make up data that "hangs together" and is consistent with other known information. Or could the difference be the fault of the previous survey? Or could it be due to chance?

Monday, March 12, 2007

ILCS Response Rates

A common response to concerns about the high response rate achieved in Lancet II is to note that the Iraq Living Conditions Survey 2004 (ILCS) reported a response rate of 98.5%. The technical sophistication and competence of ILCS has been universally praised.


In each governorate, 1,100 households were selected for interview, with the exception of Baghdad, where 3,300 households were selected. The sample thus consisted of 22,000 households. Of these, 21,668 were actually interviewed.


In fact, the response rate for ILCS is higher than that for Lancet II (98.3%). So, what is the problem?

The issue is that ILCS was conducted in a much more thorough fashion than Lancet II.


COSIT staff were extensively trained in implementing the survey tool by researchers from the Fafo AIS. The first round of training took place in Amman, Jordan during the first three weeks of February 2003. Core staff from COSIT’s offices in each governorate were present, in addition to administrative staff from the headquarters in Baghdad. Training of local staff was subsequently conducted at six locations within Iraq during the first two weeks of March 2003 by COSIT’s core staff under supervision from Fafo.

Fieldwork started on March 22, 2004, and was completed by May 25, 2004. Data collection in the Governorates of Erbil and Dahouk were implemented and completed in August 2004.

After each selected PSU had been mapped and listed, interviewers were sent to the 10 selected households. Interviewers were organized in teams of five, with individual supervisors who continuously provided guidance and checked the quality of all incoming interviews. When necessary, interviewers were sent back to the households to reconfirm information. Furthermore, supervisors from COSIT’s headquarters in Baghdad and Fafo staff also visited the interviewer teams.

Upon completion of the interviews, the information was sent to the governorate office for registration and inspection, then to the Baghdad main office for coding and data entry. During the data entry process, extensive quality control was implemented, and questionnaires were sent back to the field for re-interviewing or update both by COSIT’s Baghdad office and by Fafo headquarters in Oslo.

Completed data files were continuously sent to Fafo’s headquarters in Oslo, Norway, where further quality checks were implemented. In instances where problems arose, direct communication was made with COSIT. Several times during the fieldwork, COSIT arranged meetings with its offices’ heads in order to inform them of problems that had surfaced and resolve them.


In other words, ILCS interviewers went back to the sample households again and again and again. This is quite different from the procedure in Lancet II. In that case, a cluster was visited on just one day. In fact, it appears that houses were just checked one time. What good fortune that there was almost always someone (head of house or spouse) at home!

What we need, obviously, is more information about the initial ICLS samples. How many households were present for this first interview? How many immediately agreed to participate? Only this level of detail will tell us if the final ILCS response rate is relevant in evaluating the reliability of the Lancet II sample.

Kurdish Speakers?

One aspect of the debate that has always confused me is the issue of Kurdish speakers. My (correct?) understanding is that Iraq has a large Kurdish population which does not speak Arabic and/or is unlikely to participate in a survey conducted by a non-Kurdish speaker. References welcome! Lancet II reports that all the interviewers spoke English and Arabic. It does not mention Kurdish. Does this mean that none of the interviewers spoke Kurdish? That is what I would expect. Does that matter?

Consider the trouble that the ICLS went through to translate its survey into both Arabic and Kurdish.


The questionnaires reflect the nature of the survey. Two questionnaires were used: one general questionnaire for each household, answered by the household head or a member of the household with knowledge of all members was the respondent; and one targeted questionnaire used to interview women of the household aged between 15 and 54 years. The first questionnaire dealt with housing and infrastructure, household economy, basic demography, and the education, health, and labour force characteristics of the household members; the second focused on the women’s reproductive history and children’s health.

Three versions of the questionnaires—one Arabic and two Kurdish—were used in the field.. Although the questionnaires were developed in English, they were translated twice—once into Arabic or Kurdish, then back again into English—in order to verify the translations and ensure that all members of the survey team had a common understanding.

Compared to many surveys, the questionnaires were quite long, with a median interviewing time of 83 minutes. Fifty percent of the interviews lasted between 60 and 105.


Either ICLS wasted a lot of time and money unnecessarily translating the survey into Kurdish or there is no way that Lancet II got a 98% response rate without using Kurdish speakers. Or is there a third possibility?

Data Validity

This article on Lancet II and the news reactions thereto is interesting.


As a biostatistician, the Bloomberg School's Zeger has thought a lot about the study. "I am so impressed by Gil because he was able to conduct a scientific survey on a shoestring budget under very difficult circumstances," he says. He does not dismiss all concerns about the methodology. "It was the best science that could be done under the circumstances. We're always making decisions absent scientific-quality data — that's public health practice." But he draws an important distinction between practice and science. "We tend to have a different standard for scientific research. This study was on the research end. It was published in a scientific journal. There are a lot of aspects that are below the reporting standards you would have if you were doing a U.S. clinical trial, for example: the documentation for each case, the ability to reproduce the results, detailed information about how everything was done. I think it would be useful for the school and the public health community to think through these kinds of issues.

"[But] it's absolutely appropriate, on very limited resources, to go into a place like Iraq and make an estimate of excess mortality to use in planning and making decisions. My own sense is I would rather err on the side of generating potentially useful data, with all of the caveats. I think noisy data is better than no data." Zeger notes that the tests of the data's validity, built into the second survey at his recommendation, all checked out. He admits the numbers are hard to grasp, especially the study's estimate that from June 2005 to June 2006, Iraqis were dying at a rate of 1,000 per day. "That's a lot of bodies," he says. "I have a hard time getting my mind around that. But as a scientist, what do you do? That's the number."


I certainly agree that noisy data is better than no data, but only if we have access to all the details of where that noisy data comes from. What "vaidity" checks is Zeger talking about?

Links Related to Moore

Here are some links related to the unimpressive Steven E. Moore Wall Street Journal op-ed.

Stats.org
Tim Lambert

The cluster comments made by Moore seem wrong (as demonstrated by Lambert and others. But this point is, at least, not unreasonable.


With so few cluster points, it is highly unlikely the Johns Hopkins survey is representative of the population in Iraq. However, there is a definitive method of establishing if it is. Recording the gender, age, education and other demographic characteristics of the respondents allows a researcher to compare his survey results to a known demographic instrument, such as a census.

Dr. Roberts said that his team's surveyors did not ask demographic questions. I was so surprised to hear this that I emailed him later in the day to ask a second time if his team asked demographic questions and compared the results to the 1997 Iraqi census. Dr. Roberts replied that he had not even looked at the Iraqi census.

And so, while the gender and the age of the deceased were recorded in the 2006 Johns Hopkins study, nobody, according to Dr. Roberts, recorded demographic information for the living survey respondents. This would be the first survey I have looked at in my 15 years of looking that did not ask demographic questions of its respondents. But don't take my word for it --- try using Google to find a survey that does not ask demographic questions.


I don't know of a survey that doesn't ask demographic questions, but I haven't really looked. Why would demographics be useful? First, it is (perhaps) a check against randomization mistakes and/or fraudulent data. If the demographics of the survey don't match the demographics of the population, then something is wrong. (But if the match isn't too far off, then the concern isn't that great.) Second, you could use demographics to adjust the survey results. If, say, Sunnis made up 50% of the survey but are, we think, at most 25% of the population, then you would want to adjust the results accordingly.

I wonder why the survey form did not include demographics. Unlike Moore, I doubt that the reason is nefarious. But it is endlessly annoying that the Lancet authors refuse to even release the questions or actual survey forms.

UPDATE: Lambert points out that plenty of demographic data (age and gender of household residents) was collected. Moore's response:


Despite Les' eloquent response, he has yet to reveal any comparison of demographic information for the 2006 survey to the 1997 Iraqi census, the 2003 update to that census, the 2004 UNDP/ILCS survey or any other demographic instrument.


Seems like Roberts/Lambert is correct on this one. It still drives me nuts that we don't have access to the raw data.

IRI Response Rate

It appears that Brookings' excellent Iraq Index project may allow me to track every nationwide poll conducted in Iraq in 2006. If so, I could collect the response rates for each and then see whether or not Lancet II is an outlier.

One example is this poll by the International Republican Institute from March 2006. The accompanying Powerpoint slides report a response rate of 93% (2,804 out of 3,000). This is higher than the rates we have seen from WPO but lower than that of Lancet II. I will contact IRI to get more details.

More Plausible Respone Rates

As a follow up to this post, consider this January 2006 poll conducted for World Public Opinion.org. Methodological details (pdf):


The survey was designed and analyzed by the Program on International Policy Attitudes for WorldPublicOpinion.org. Field work was conducted through D3 Systems and its partner KA Research in Iraq. Face-to-face interviews were conducted among a national random sample of 1,000 Iraqi adults 18 years and older. An over sample of 150 Iraqi Sunni Arabs from predominantly Sunni Arab provinces (Anbar, Diyalah and Salah Al-Din) was carried out to provide additional precision with this group. The total sample thus was 1,150 Iraqi adults. The data were weighted to the following targets (Shia Arab, 55%, Sunni Arab 22%, Kurd 18%, other 5%) in order to properly represent the Iraqi ethnic/religious communities.

The sample design was a multi-stage area probability sample conducted in all 18 Iraqi provinces including Baghdad. Urban and rural areas were proportionally represented. A total of 5 sampling points (4 urban and 1 rural) of the 116 employed were replaced for security reasons with substitutes in the same province and urban/rural classification. Among all the cases drawn into the sample, a 94% contact rate and 74% completion rate were achieved.


The contact and completion rates are almost identical to those of the November 2006 survey. Again, how can these rates be so much lower than those for Lancet II? Now, it could still be that the surveyors employed by WPO (D3 and KA) are not actually surveying people, are just filling out the forms themselves. Anyone doing this is well advised to report plausible contact/completion rates rather than 100%, even though high participation is what the client "wants." Excessive contact/completion rates are the first thing that a wise client checks to ensure against fraud.

But if we assume that D3 and KA have done a proper job --- in nationwide surveys which bracket Lancet II --- it because hard to understand what magic the Lancet survey teams (composed of physicians who, I think, do no other survey work outside of the two Lancet articles) perform in order to produce such high rates. How are the Lancet survey teams able to reach 99% of the intended households while D3/KA can only contact 94%? How are the Lancet teams able to convince virtually everyone they meet (completion rate over 99%) to finish the survey, while D3/KA can't persuade more than 3/4 of the people they meet to finish? Are D3/KA survey teams rude or scary or socially awkward? Are the Lancet teams more friendly or engaging or persuasive?

Perhaps. Yet I think that one survey team is not telling the whole truth, has tried mightily to give its bosses what the bosses want to hear.

Plausible Response Rates

To the extent that I (and others) find the 98.3% response rate for Lancet II completely implausible, it behooves us to highlight similar surveys with more believable response rates. This report from World Public Opinion.org (WPO) provides such an example, especially appealing because it featured a nationwide sample and was conducted in September 2006, just a few months after the fieldwork for Lancet II. The headline, "Baghdad Shias Believe Killings May Increase Once U.S.-led Forces Depart but Large Majorities Still Support Withdrawal Within a Year," indicates that this is hardly the work of crazed Neocons.

What was the response rate for this poll? Details here (pdf):


The survey was designed and analyzed by the Program on International Policy Attitudes for WorldPublicOpinion.org. Field work was conducted through D3 Systems and its partner KA Research in Iraq. Face-to-face interviews were conducted among a national random sample of 1,000 Iraqi adults 18 years and older. An over sample of 150 Iraqi Sunni Arabs from predominantly Sunni Arab provinces (Anbar, Diyalah and Salah Al-Din) was carried out to provide additional precision with this group. The total sample thus was 1,150 Iraqi adults. The data were weighted to the following targets (Shia Arab, 55%, Sunni Arab 22%, Kurd 18%, other 5%) in order to properly represent the Iraqi ethnic/religious communities.

The sample design was a multi-stage area probability sample conducted in all 18 Iraqi provinces including Baghdad. Urban and rural areas were proportionally represented. Only one rural sampling point of the 115 employed were replaced for security reasons with substitutes in the same province and urban/rural classification. Among all the cases drawn into the sample, a 93% contact rate and 72% completion rate were achieved.


Since response rate is contact rate times completion (or participation) rate, the total response rate for this survey is 67%. (It could be that the completion rate here is actually from the whole sample, in other words, the 72% figure is the one we should use.) How can it be that Lancet II could achieve such a dramatically higher response rate? Note that both aspects of the overall response rate were lower for this survey. WPO had more trouble finding the individuals which it wanted to survey (99% versus 93%) and, having found those individuals, it had much more trouble convincing them to participate in the survey (99% versus either 72% or, if completion includes the contact problems, 77%).

If Lancet II had reported contact and participation rates more like those of the WPO, I would be much less suspicious of their results.

Sunday, March 11, 2007

More Response Rate Details

As a follow up to this post on response rates, I want to dive into the details in one of the polls that Kieran Healy cites because it helps to highlight just how implausible a 98.3% response rate for Lancet II is. Healy, among other examples, cites a Gallup poll (pdf) with a 97% response rate. He implies that, if Gallup (regularly?) gets a 97% response rate, than a 98% rate for Lancet II is perfectly plausible.

Here (from page 11) are all the detail we have on this 97% rate.


Face-to-face interviews were conducted among 1,178 adults who resided in urban areas within the governorate of Baghdad. Interviews were carried out between August 28 and September 4. The response rate was 97 percent; 3 percent of those selected refused to participate in the study.

A probability-based sample was drawn utilizing 1997 census data. Census districts were utilized as primary sampling units (PSUs). A total of 122 PSUs were chosen using probability-proportional-to-size methods. About 10 interviews, one per household, were conducted at each location. Interviewers were given all relevant address details for each PSU. Within each selected household, respondents were selected using the Kish method.

For the results based on this sample, one can say with 95% confidence that the margin of error is approximately ± 2.7%.


Now, it would be nice to have more details on this survey and, in time, I hope to find someone with lots of experience conducting polls in Iraq specifically and third world countries in general, but there are several issues to keep in mind.

Recall that non-response generally falls into two categories: failure to contact and, given that someone has been contacted, a refusal to participate.

Yet Gallup only provides us information about the response rate for people who were not absent. In other words, we have no information on the contact rate. We just know the participation rate. (It could be that Gallup is being sloppy and that this 3% who "refused to participate in the study" includes those who were not home.)

So the correct comparison of this 97% rate is not with the 98.3% response rate for Lancet II but the 99.2% participation rate.

Again, the difference between 97% and 99.2% may not seem large. And it isn't.

First, I just want to point out that the Lancet II response rates are higher than any other survey (with one possible exception to be addressed later). Second, one could just as easily say that the refusal rate for this Gallup poll is more than three times higher than the refusal rate for Lancet II. Why would that be? Why would households be so much less willing to participate in this Gallup than they are in the Lancet II? Third, the closer one gets to 100% participation, the more difficult it is to make progress. There is a much larger difference between 97% and 99% participation rates than between 60% and 62% rates because the marginal 2% increase is much harder to achieve the closer you get to 100%. Fourth, certain aspects of Lancet II should make it harder to have a higher participation rates. For example, this Gallup poll did not require the presence of the head of the household or spouse. Any resident adult would do.

Until we know more details about opinion polling in Iraq, it is tough to know what to do with very high response rates, rates much higher than anything we see in the US. But it is still a mystery why the rates for Lancet II should be so much higher than any other survey.

Iraqanalysis

Iraqanalysis.org seems to be a useful site. I especially liked their thoughts on survey response rates. See the link for details.

Terminology

In looking hard at the response rate for Lancet II, it is helpful to use some terminology. Recall (from page 4):


[A] final sample of 1849 households in 47 randomly selected clusters. In 16 (0·9%) dwellings, residents were absent; 15 (0·8%) households refused to participate.


I am unable to find web-based standard definitions for survey terms (suggestions welcome). So, I define the contact rate as the percentage of households (of those that the interviewers attempted to contact) in which residents were not absent. In this survey, the contact rate was 1833/1849 or 99.1%. The participation rate is the percentage of households (of those contacted) which agree to participate in the survey. In this case, it is 1818/1833 or 99.2%. The response rate is then the number of participating households divided by the number of households at which contact was attempted, or 1818/1849 = 98.3%

One confusing aspect is that the Lancet authors do not explain how they handled the head of household issue. On page 2 we have "The survey purpose was explained to the head of household or spouse, and oral consent was obtained." This implies that either the head of household or the spouse was the source of the information. This makes sense since such people are likely to have knowledge of the inhabitants over the last couple of years. But how often were the household head and spouse both absent? It would seem unlikely that this never happened. Did the interviewers code this as "residents" being "absent" and include these in those 16 cases? Or did they just get information from whatever adult was present?

I do not think that this is a critical issue, but there is value in getting all the details correct.

Saturday, March 10, 2007

Response Rate

Imagine that we conduct 100 independent polls, each of 10,000 people, in the US. The polls are all supposed to use the exact same procedure. Will all the response rates be exactly the same?

Of course not. One poll might have a 60% response rate and another a 64% response rate, not because of fraud or malfeasance but just because of random chance. In fact, we would expect there to be a distribution of response rates, perhaps centered around 60% with a 2% standard deviation. Although the poll with a 64% response is higher than the vast majority of the other polls, at least one poll out of the 100 needs to be the highest, just as one will be the lowest. An extreme result is no proof of fraud. This is all the more so since there is no human way to ensure that all 100 polls use the exact same procedure. At the very least, different individuals will be conducting the polls, or the same individuals will be conducting the polls on different days.

But what if the results of 99 of the polls produce a nice normal distribution centered on 60% with a 2% standard deviation but the 100th poll features a 99% response rate? What would be a reasonable conclusion?

First, this could just be random. Perhaps poll results are fat-tailed, and so extreme results are to be expected. Second, this could just be an honest mistake. Perhaps the interviewers in the 100th poll mismarked the forms. Perhaps the forms were marked directly but there was an error in the automatic reader. Third, this excessively high response rate might be evidence of fraud, might indicate that the reviewers for this poll did not bother to interview anyone and just filled out the forms themselves. Without more information, it is hard to know which of these three explanations is correct or if there is something else going on.

Readers can judge for themselves, but if anyone reports a 99% response rate for a US poll, I think that the second and third explanation (honest mistake or fraud) are the most likely. I can find no evidence that poll results are fat-tailed. For determining the usefulness of the poll results, it doesn't really matter whether the problem is a mistake or a fraud. In either case, the results of the poll are not reliable.

It should be obvious how this theoretical concerns relates to the Lancet II. I argue that the 99% response rate is ludicrously high, way higher than the rate for almost all polls on almost all subjects in almost all countries in the world. Kieran Healy takes me to task and writes:


Kane says, “I can not find a single example of a survey with a 99%+ response rates in a large sample for any survey topic in any country ever.” I googled around a bit looking for information on previous Iraqi polls and their response rates. It took about two minutes. Here is the methodological statement for a poll conducted by Oxford Research International for ABC News (and others, including Time and the BBC) in November of 2005. The report says, “The survey had a contact rate of 98 percent and a cooperation rate of 84 percent for a total response rate of 82 percent.” Here is one from the International Republican Institute, done in July. The PowerPoint slides for that one say that “A total sample of 2,849 valid interviews were obtained from a total sample of 3,120 rendering a response rate of 91 percent.” And here is a report put out in 2003 by the former Coalition Provisional Authority, summarizing surveys conducted by the Office of Research and Gallup. In the former, “The overall response rate was 89 percent, ranging from 93% in Baghdad to 100% in Suleymania and Erbil.” In the latter, “Face-to-face interviews were conducted among 1,178 adults who resided in urban areas within the governorate of Baghdad … The response rate was 97 percent.” So much for Iraqi surveys with extraordinary response rates being hard to find.


See the original post for links. Now, it is hard to know what to make of this. Healy finds the results for 4 polls. Their response rates are 82%, 91%, 89% and 97%. The average here is 89.75%. Let's round up to 90%.

So, I claim to not be able to find any poll with a response rate higher than 99% (the response rate in the Lancet II). Healy claims that I am wrong and, for evidence, cites 4 polls with response rates lower than 99%. Am I missing something? Isn't he just providing further evidence for my concerns. If, of the hundreds (?) of polls conducted in Iraq, Lancet II features the highest response rate, isn't that cause for concern? (Note that Healy, in an earlier portion of the same post cites a poll with 100% response rate. I hope to return to that specific example at a later date.)

Again, one poll will, by definition, have the highest response rate. A priori, there is no reason why Lancet II might not be that poll. But it is a bit worrying that the poll with the most controversial result of any poll conducted in Iraq in the last 2 years would also have the highest response rate of any poll. What are the odds of that? If response rate and controversy are independent, then this would be a surprising result. If they are correlated (perhaps people are more likely to want to participate in a poll about death rates than in a poll on less controversial topics), then this is to be expected.

In any event, the annoyance comes when someone like Henry Farrell on Crooked Timber writes:


I don’t have very much respect for David Kane (it isn’t me who was accused of fraud). What bugs me as much as the initial offensive accusation is that he never to my knowledge apologized afterwards or sought to retract his accusation (if I’d done something similar, god forbid, I hope that I’d have apologized abjectly to the offended parties; I’d likely have disappeared entirely from public debate immediately thereafter).


What accusation does Henry want me to retract? The main point of my initial post was a) that any problems with Lancet II are likely to lie with the interviewers, not with the specific clustering formulas and other arcana used by the authors in their statistical analysis and b) there is some evidence that the response rate for Lancet II is excessively high. Why such concerns make me untouchable is unclear.

A Case for Fraud

Note: I originally published this post on the Social Science Statistics blog at Harvard in November 2006. (I am an Institute Fellow at IQSS, the organization behind SSS.) It was attacked and denounced by, among others, Tim Lambert at Deltoid and Kieran Healy at Crooked Timber. I believe that the version below is identical to the one which was published (and removed), but it might not be exact. Since academics should be responsible for their prose, I republish it here.

SSS is an interesting blog, which I occasionally contribute to (example here). I think that it would be fun if they tackled more controversial topics, but I respect Gary King's judgment that this is not their primary mission.

In retrospect, I should have followed Gary's advice to tone down the language a bit.

--------------
The latest Lancet survey of Iraqi mortality, Burnham et al (2006), has come in for criticism. (See the Wikipedia entry for links. See here, here, here and here for criticism.) Daniel Davies is correct when he writes:


This is the question to always keep at the front of your mind when arguments are being slung around (and it is the general question one should always be thinking of when people talk statistics). How Would One Get This Sample, If The Facts Were Not This Way? There is really only one answer - that the study was fraudulent. It really could not have happened by chance. If a Mori poll puts the Labour party on 40% support, then we know that there is some inaccuracy in the poll, but we also know that there is basically zero chance that the true level of support is 2% or 96%, and for the Lancet survey to have delivered the results it did if the true body count is 60,000 would be about as improbable as this. Anyone who wants to dispute the important conclusion of the study has to be prepared to accuse the authors of fraud, and presumably to accept the legal consequences of doing so.


Assume, for a moment, that fraud occurred. How is it most likely to have happened? We can be fairly certain that the editors and authors did not do anything so crude as to lie about the numbers. If there is fraud, it derives from the Iraqi survey teams themselves. Consider the issues that I raised about Roberts et al (2004), the first Lancet study.


The central problem with the Lancet study was that it was conducted by people who, before the war started, were against the war, people who felt that the war was likely to increase civilian casualties and who, therefore, had an expectation/desire (unconscious or otherwise) to find the result that they found.

Consider the Iraqis who did the actual door-to-door surveying. Do you think that they appreciated having such a well paying job? Do you think that they hoped for more such work? If you were them, would you be tempted to shade the results just a little so that the person paying you was happy?


We know very little about these Iraqi teams. Besides monetary incentives to give the Lancet authors the answers they wanted, the Iraqis may have had political reasons as well. The paper reports (page 2) that:


The two survey teams each consisted of two female and two male interviewers, with the field manager (RL) serving as supervisor. All were medical doctors with previous survey and community medicine experience and were fluent in English and Arabic.


The field manager (RL) is Riyadh Lafta, an author of both papers. Now Lafta could be the most honest and disinterested scientist in all the world. Or he could be a partisan hack. There is almost no way for outsiders to judge. But were all the interviewers Sunni? (None of them seemed to speak Kurdish.) Were any former members of the Baath Party? Among highly educated doctors, party membership was common, even somewhat compulsory. It is unseemly to even raise these sorts of questions, and I agree that the names of the interviewers should not be released for safety reasons. But the entire paper hangs on their credibility. How can anyone know that they are telling the truth? The paper goes on:


A 2-day training session was held. Decisions on sampling sites were made by the field manager. The interview team were given the responsibility and authority to change to an alternate location if they perceived the level of insecurity or risk to be unacceptable. In every cluster, the numbers of households where no-one was at home or where participation was refused were recorded.


This is key. The interviewers could, at their discretion, change the location of the sample. How many times did they do this? We are not told and the authors refuse to release the underlying data or answer questions about their methodology. Again, as a matter of procedure, this may be a perfectly fine way to conduct the study. Safety concerns are paramount. But there is no way for any outsider to know how "random" the sampling actually was without access to more detailed information.

From page 4:


[A] final sample of 1849 households in 47 randomly selected clusters. In 16 (0·9%) dwellings, residents were absent; 15 (0·8%) households refused to participate.


Here, finally, is a hard number that we can use to evaluate the likelihood of fraud by the survey teams. If it is typical in such surveys to have such high (99%+) contact and response rates, then there is much less to worry about. But if such a level of cooperation is uncommon, if we can't find a single similar survey with anywhere near this level of compliance, then we should be suspicious. And, once we are suspicious of the underlying data, there is no reason to waste time on the arcana of calculating confidence intervals for cluster sampling. Unreliable data means useless results.

A commentator at Crooked Timber writes:


I have a stats background, and I’ve made a living conducting market and social research surveys for more than 25 years.

...

I’m also very worried about the fieldwork itself. I believe the reported refusal rate was 0.8% (I can’t find this in the report itself, so feel free to correct me). This is simply not believable. I have never conducted a survey with anything like a refusal rate that low, and before anyone talks about cultural differences, there are many non-cultural reasons for people to refuse to participate. If my survey was in a war-zone, I would expect refusal rates to be higher than normal.


One anonymous blog commentator is hardly an authority, but the point he raises is a factual one. What is the typical response rate for surveys of this kind? What is the highest response rate that has ever been recorded in such a survey, in any country on any topic?

In the context of US opinion polling, Mark Blumenthal reports.


The most comprehensive report on response and cooperation rates for news media polls I am aware of was compiled in 2003 by three academic survey methodologists: Jon Krosnick, Allyson Holbrook and Alison Pfent. In a paper presented at the 2003 AAPOR Conference, Krosnick and his colleagues analyzed the response rates from 20 national surveys contributed by major news media pollsters. They found response rates that varied from a low of 4% to a high of 51%, depending on the survey and method of calculation.


Just because the response rates in Burnham et al (2006) would be clear evidence of fraud if they were reported in the context of US polling is not dispositive since Iraq is different from the US and face-to-face is different from telephone polling. Wikipedia claims a 40% -- 50% response rate for household surveys without providing a source.

I can not find a single example of a survey with a 99%+ response rates in a large sample for any survey topic in any country ever. (If you come across such an example, please post it below.) Assume, for a moment, that there are no such examples, that no survey anywhere has ever had such a high response rate. If so, there are three possibilities.

1) The survey teams provided fraudulent data.

2) There is something different about this survey team or about Iraq at this time which makes this situation different from any other survey ever undertaken.

3) The high response rate is a once-in-a-life-time freak event. It would not be repeated even if the same survey team took another survey.

I do not think that 2 or 3 are very likely. Fraud in surveys, on the other hand, is all too common.

Tuesday, March 06, 2007

Hash of It

I couldn't leave this comment in this Deltoid thread, so I am putting it here.

Kevin writes:


As for the kudos to people who have "fought this lonely fight", is it seemly to be congratulating yourself thusly? (I take it that accusing people of fraud counts as fighting; it certainly smacks of looking for a fight.) And what do you propose to do with the data? If you are right and the whole thing was cooked, do you suppose Burnham et al made such a hash of it that the data will incriminate them?


1) I am congratulating the people besides me who have sought to make this data public. Do you have a problem with those efforts? I have spent much time e-mailing (often cc'ing Tim Lambert) the authors trying to get the data. If you think that I am lying about that, you can check with Tim.

2) I propose to do several things with the data (and the computer code). First, replicate the results. (I will be presenting a paper at JSM on these and related issues.) Second, I want to examine the data for fraud. For starters, if you believe the data, you should believe that 2,000 civilians were killed in pre-war bombing. Details here. I will also look for anomalous patterns. What if one interviewer recorded 5 times as many deaths as any other? This might happen by chance, or it might be a sign of fraud by this one interviewer. Without the data, no one can know for sure.

When I was one of the few/only calling for data access and voicing suspicion of the raw data (not of Burnham sitting in Jordan inputting things on a computer), then you might just ignore me. When more than one person (Spagat, Hicks, et al) have concerns, then wouldn't you say that data access is important?

3) I do not think that Burnham is guilty of fraud. He seems an honest, well-intentioned guy. But how does he know that the pile of data that was handed to him is accurate? How do you know that it is accurate? Isn't it possible that one of the Iraqi interviews (out of laziness or malice or a desire to give the rich Americans answers that make them happy) provided inaccurate data?

Monday, March 05, 2007

UK Times Article

Interesting article from the UK Times Online.


The statistics made headlines all over the world when they were published in The Lancet in October last year. More than 650,000 Iraqis – one in 40 of the population – had died as a result of the American-led invasion in 2003. The vast majority of these “excess” deaths (deaths over and above what would have been expected in the absence of the occupation) were violent. The victims, both civilians and combatants, had fallen prey to airstrikes, car bombs and gunfire.

Body counts in conflict zones are assumed to be ballpark – hospitals, record offices and mortuaries rarely operate smoothly in war – but this was ten times any other estimate. Iraq Body Count, an antiwar web-based charity that monitors news sources, put the civilian death toll for the same period at just under 50,000, broadly similar to that estimated by the United Nations Development Agency.

The implication of the Lancet study, which involved Iraqi doctors knocking on doors and asking residents about recent deaths in the household, was that Iraqis were being killed on an horrific scale. The controversy has deepened rather than evaporated. Several academics have tried to find out how the Lancet study was conducted; none regards their queries as having been addressed satisfactorily. Researchers contacted by The Times talk of unreturned e-mails or phone calls, or of being sent information that raises fresh doubts.

Iraq Body Count says there is “considerable cause for scepticism” and has complained that its figures had been misleadingly cited in the The Lancet as supporting evidence.

One critic is Professor Michael Spagat, a statistician from Royal Holloway College, University of London. He and colleagues at Oxford University point to the possibility of “main street bias” – that people living near major thoroughfares are more at risk from car bombs and other urban menaces. Thus, the figures arrived at were likely to exceed the true number. The Lancet study authors initially told The Times that “there was no main street bias” and later amended their reply to “no evidence of a main street bias”.

Professor Spagat says the Lancet paper contains misrepresentations of mortality figures suggested by other organisations, an inaccurate graph, the use of the word “casualties” to mean deaths rather than deaths plus injuries, and the perplexing finding that child deaths have fallen. Using the “three-to-one rule” – the idea that for every death, there are three injuries – there should be close to two million Iraqis seeking hospital treatment, which does not tally with hospital reports.

“The authors ignore contrary evidence, cherry-pick and manipulate supporting evidence and evade inconvenient questions,” contends Professor Spagat, who believes the paper was poorly reviewed. “They published a sampling methodology that can overestimate deaths by a wide margin but respond to criticism by claiming that they did not actually follow the procedures that they stated.” The paper had “no scientific standing”. Did he rule out the possibility of fraud? “No.”

If you factor in politics, the heat increases. One of The Lancet authors, Dr Les Roberts, campaigned for a Democrat seat in the US House of Representatives and has spoken out against the war. Dr Richard Horton, editor of the The Lancet is also antiwar. He says: “I believe this paper was very thoroughly reviewed. Every piece of work we publish is criticised – and quite rightly too. No research is perfect. The best we can do is make sure we have as open, transparent and honest a debate as we can. Then we'll get as close to the truth as possible. That is why I was so disappointed many politicians rejected the findings of this paper before really thinking through the issues.”

Knocking on doors in a war zone can be a deadly thing to do. But active surveillance – going out and measuring something – is regarded as a necessary corrective to passive surveillance, which relies on reports of deaths (and, therefore, usually produces an underestimate).

Iraq Body Count relies on passive surveillance, counting civilian deaths from at least two independent reports from recognised newsgathering agencies and leading English-language newspapers ( The Times is included). So Professor Gilbert Burnham, Dr Les Roberts and Dr Shannon Doocy at the Centre for International Emergency, Disaster and Refugee Studies, Johns Hopkins Bloomberg School of Public Health, Maryland, decided to work through Iraqi doctors, who speak the language and know the territory.

They drafted in Professor Riyadh Lafta, at Al Mustansiriya University in Baghdad, as a co-author of the Lancet paper. Professor Lafta supervised eight doctors in 47 different towns across the country. In each town, says the paper, a main street was randomly selected, and a residential street crossing that main street was picked at random.

The doctors knocked on doors and asked residents how many people in that household had died. A person needed to have been living at that address for three months before a death for it to be included. It was deemed too risky to ask if the dead person was a combatant or civilian, but they did ask to see death certificates. More than nine out of ten interviewees, the Lancet paper claims, were able to produce death certificates. Out of 1,849 households contacted, only 15 refused to participate. From this survey, the epidemiologists estimated the number of Iraqis who died after the invasion as somewhere between 393,000 and 943,000. The headline figure became 650,000, of which 601,000 were violent deaths. Even the lowest figure would have raised eyebrows.

Dr Richard Garfield, an American academic who had collaborated with the authors on an earlier study, declined to join this one because he did not think that the risk to the interviewers was justifiable. Together with Professor Hans Rosling and Dr Johan Von Schreeb at the Karolinska Institute in Stockholm, Dr Garfield wrote to The Lancet to insist there must be a “substantial reporting error” because Burnham et al suggest that child deaths had dropped by two thirds since the invasion. The idea that war prevents children dying, Dr Garfield implies, points to something amiss.

Professor Burnham told The Times in an e-mail that he had “full confidence in Professor Lafta and full faith in his interviewers”, although he did not directly address the drop in child mortality. Dr Garfield also queries the high availability of death certificates. Why, he asks, did the team not simply approach whoever was issuing them to estimate mortality, instead of sending interviewers into a war zone?

Professor Rosling told The Times that interviewees may have reported family members as dead to conceal the fact that relatives were in hiding, had fled the country, or had joined the police or militia. Young men can also be associated with several households (as a son, a husband or brother), so the same death might have been reported several times.

Professor Rosling says that, despite e-mails, “the authors haven’t provided us with the information needed to validate what they did”. He would like to see a live blog set up for the authors and their critics so that the matter can be clarified.

Another critic is Dr Madelyn Hsaio-Rei Hicks, of the Institute of Psychiatry in London, who specialises in surveying communities in conflict. In her letter to The Lancet, she pointed out that it was unfeasible for the Iraqi interviewing team to have covered 40 households in a day, as claimed. She wrote: “Assuming continuous interviewing for ten hours despite 55C heat, this allows 15 minutes per interview, including walking between households, obtaining informed consent and death certificates.”

Does she think the interviews were done at all? Dr Hicks responds: “I’m sure some interviews have been done but until they can prove it I don’t see how they could have done the study in the way they describe.”

Professor Burnham says the doctors worked in pairs and that interviews “took about 20 minutes”. The journal Nature, however, alleged last week that one of the Iraqi interviewers contradicts this. Dr Hicks says: : “I have started to suspect that they [the American researchers] don’t actually know what the interviewing team did. The fact that they can’t rattle off basic information suggests they either don’t know or they don’t care.”

And the corpses? Professor Burnham says that, according to reports, mortuaries and cemeteries have run out of space. He says that the Iraqi team has asked for data to remain confidential because of “possible risks” to both interviewers and interviewees.