Feeds:
Posts
Comments

I originally wrote this for a statistical literacy course. It is not original, in the sense that it draws from other such ‘do’ and ‘don’t’ lists  from various places as well as my own experience. I’m sure it could be improved, so comments suggestions and debate are very welcome.

 

The seven mortal statistical sins

Numbers are often an essential way to present clear and concise evidence for an argument. However they can be used well or badly. The following rules set out some common errors.

1 No measurement is perfect but some measures are more perfect than others (measurement error).

Numbers are only as good as the people who produce them. Any number depends upon the clear definition of what is measured and how it is measured. Very often it is difficult or impossible to precisely measure what is actually wanted (i.e. to obtain a valid measurement) in a way that will give consistent results whenever it is repeated (reliable measurements).

Numbers are more robust if they are based on definitions and means of measurement that are widely agreed, and whose strengths and weaknesses are well understood. Where there is controversy, the definition or measurement method used should be made clear. The source of numbers should always be provided.

Comparisons over time or across different groups can only be made if the measurement method stays the same. Extra care has to be taken when information from different sources is used to make such comparisons. Definitions used by different organisations rarely coincide exactly. Even within the same survey instrument question wording may change over time or new response categories may be added. Results can be influenced by the context within which a question is asked (including what questions have come before it). Comparisons across countries or language groups pose special problems.

Often it is sensible to report a range within which the true value of a measurement is thought to lie, but without both upper and lower limits, such ranges become meaningless. ‘Up to 99% of people’ includes the number zero; ‘as few as 1%’ does not rule out 99%.

Orders of magnitude matter. It is easy to misplace a decimal point or confuse a million with a billion, and thus get a number wildly wrong. Numbers should be presented with some readily recognisable comparison that make their magnitude comprehensible, and also makes the detection of such errors more likely.

2 Percentages or proportions have a base (denominator) which must be stated.

Percentages express numbers as a fraction of 100. If what that 100 comprises is not stated, then the meaning of the percentage will be unclear or misleading. Growth rates will depend upon the base year from which growth is measured. It is easy to confuse different groups of people on which percentages are based. E.g. does ‘working women’ refer to all women who do work, paid or unpaid; those currently in the labour force; those in employment; employees; employees working full time hours, and so on. Note too the ambiguity in english phrases like  ‘the percentage of working women (who …)’: Does the % refer to the fact that they are women? or that fact that they work? or the fact that they do both?

When the base is itself a percentage, as often happens when change is discussed, this presents two further problems.

The first is the confusion of absolute and relative change. If the growth rate rises from 2% to 3%, that is a 50% increase not 1%, but better expressed as a ‘one percentage point increase’ in the rate of growth. In this context the absolute change probably gives a better sense of what is happening than the relative change.

The second is the multiplication of the margin of error contained in calculating relative change on the basis of small numbers that themselves have a margin of error. E.g. a survey may show that over a period of time the number of people in a particular category has increased from 5% to 15%. This could, correctly, be described as a 300% increase. However it is from such a small base that the impression created is misleading. The obverse, that the number of people not in this category has declined from 95% to 85%, suggests a much more modest change, and one that will be less influenced by error in the original data because the absolute size of the base is larger: a few percentage points either way makes much less difference to 95% than 5%.

Incidence and prevalence are often confused. Incidence is a time based measure: those ‘at risk’ within a given time period experiencing an event: 10% of people caught a cold in 2009; 2% of motorists had an accident in 2009. Prevalence refers to a state of affairs at a point in time: on 1st December 2009 3% of people currently had a cold; on 1st December 2009, 24% of motorists had ever been involved in an accident.

3 The average may not be the same as ‘typical’, and will not be universal.

Averages summarise a lot of information in a single number. This makes them very useful, but their limitations should also be borne in mind. Averages may describe the most typical condition, but they may also describe a highly atypical mid-point between two or more very different conditions. Wherever there is variety, many cases may not be close to the ‘average’ and a few cases may be very far from it. This need not make such cases either ‘abnormal’ or unusual.

Distributions around an average may not be symmetrical. If there are a small number of cases with very high or very low values, this can drag the average up or down. When this is the case the ‘median’, the value of the case with the middle value when all cases are ranked, gives a better guide. Earnings are typically skewed in this way, so that substantially fewer than 50% of earners earn above ‘average’ earnings, but the level of  ‘median’ earnings will divide earners into two equally sized groups.

4 Highly unusual events may be fairly common

The probability that an event will occur depends not only upon what the chances are of it occurring in a given situation, but also upon the number of such situations (the base). The chances of winning the lottery are very low, but since millions buy tickets each week, there are regular winners. The occurrence of an unusual or unexpected event is not, in itself, evidence that some special factor must have caused it, especially if there are many situations in which it might occur (‘the Texan sharpshooter’ fallacy). Many events and states of affairs follow an approximately ‘normal’ distribution in which fewer cases are found, the further one travels from the value typical of the ‘average’ case. Unfortunately there have been several miscarriages of justice in which people have been convicted because it has been wrongly supposed that the chances of a particular event (e.g. a death) occurring by chance have been so small as to point towards the culpability of the defendant. The problem is that any unique ‘event’, whether common or not, has the same unimaginably small chance of occurring. Thus the probability of having each individual lottery number is the same. It is also the probability of having the winning number. What is different is the probability of  having a number that is not the winning  one!

Repeated measures of the same phenomenon regress towards the mean, showing spurious improvement or deterioration. Because no measurement is perfect, it contains some element of random error. To the extent that results towards the extreme ends of a scale (e.g. the ‘best’ and ‘worst’ performers) contain more of such error, repeating the measurement of performance is likely to lead to results less far from the mean, even if there has been no change in the underlying value of the characteristic that is being measured. This should always be taken into account when analysing the performance of e.g. ‘failing’ schools or hospitals, accident blackspots, and so on.

5 Correlation is not causation

Natural sciences and medicine frequently use randomised control trials to get evidence about cause and effect. If, on average, two groups in the experiment are the same to start with (randomised), and only one group is subjected to the experimental condition, any difference between this experimental group and the control group must, on average, be caused by the experimental condition. Evidence of cause and effect in human affairs is much harder to produce because only observation is usually possible, not experiments. We can observe correlations between conditions (sex and earnings; age and religious belief; unemployment and crime; social class and voting preference etc.) but this is not evidence, in itself, of causation. It is stronger (but by no means conclusive) evidence of causation if it can be shown that, aside from the characteristics under discussion, the different groups in what is thought to be the causal category (e.g. men and women; young and old; employed and unemployed, are otherwise similar in terms of any relevant characteristic). This is what social scientists or economists mean when they refer to ‘control’. In the absence of such control, correlations may simply be ‘spurious’: the product of another, prior, causal factor. For example there is a high cross-country correlation between the number of mobile phones in a country and the rate of infant mortality: more phones are associated with fewer infant deaths. It would be foolish, however to think that mobile phones saved infant lives: both are the results of a prior factor: the level of economic development.

Observational or experimental studies rarely, if ever, claim to discover ‘the’ cause of a condition or state of affairs. Usually such claims concern the possible size of one or more contributory causes among many.  

6 Surveys are a product of their samples.

Sampling makes it possible to get information about a population that is usually too large and expensive to measure directly.  But it can do so only if the sample has been systematically selected: usually by random selection. ‘Convenience’ samples, especially those in which members of the sample select themselves in some way, describe little more than the sample itself. Many ‘surveys’ used to promote products or publications take this form and have no more than propaganda value.

A ‘selection effect’ also operates when a group of people or things apparently defined by one characteristic is also defined in whole or in part by another one, either by dint of the method of their selection, or because of a strong correlation between the two characteristics. Selection effects can be extremely powerful. A recent, prominent example is given by Ben Goldacre,  who has drawn attention to the way in which studies of the effect of pharmaceutical drugs are much less likely to be published if the results of the study is that the drug has no effect. Journal editors prefer to report what they think of as substantive results rather than non-results. The effect of this is to bias public knowledge of any drug towards the conclusion that the drug is effective. Studies with positive results are selected for publication, and then the assumption tends to me made that these published studies comprise all studies that have been undertaken.

The likely accuracy of estimates of the characteristics of populations obtained from random samples depend upon the relevant number in the sample, rather than the population. Thus estimates about small sub-sections of the population (e.g. teenagers; single mothers; widowers; the self-employed; a minority ethnic group) may be liable to large errors. Surveys may also suffer from response bias if a substantial proportion of people choose not to respond to the survey, and there is reason to think that their characteristics may differ from those who choose to respond.

Significance is not substance

When working with random samples, economists and other social scientists often test any finding they obtain by calculating the probability that it is a result of chance sampling variation rather than a pattern that actually exists in the population. Conventionally a level of 5% probability is chosen, sometimes referred to as ‘statistical significance’.   In this context significance means neither ‘important’ nor ‘substantial’: it just describes how unlikely it is that such a finding could have occurred randomly. It also means that up to around one in twenty ‘results’ are due to chance sampling variation, but, of course, we cannot know which ones. This is why replication is an important part of both natural and social scientific research.

© John MacInnes University of Edinburgh 2013.

 

Public (non)Opinion

A lovely reminder of the perils of ‘opinion polling’ from a small experiment conducted by public policy polling in their Nov/dec poll in the US. As well as asking voters for the opinion of the (actually existing) Simpson-Bowles deficit reduction plan (40% had a view) they asked voters about the ‘Panetta Burns’ plan. 25% had a view, (including 40% of self classified ‘very liberal’ voters). Since the Panetta Burns plan was a figment of the pollsters’ imagination this tells us something about respondents’ desire to furnish desirable answers. I suppose it might also be construed as evidence of ESP and the priority of effects over causes were such a plan ever to come into existence (but this is unlikely given that neither surname used to configure the fake plan was taken from a serving Senator). The poll is at here. Comment on it from the Washington Post here and Paul Krugaman here

Public Spending

some nice data on UK public spending 2011-12 and visualisations of the same at

http://www.guardian.co.uk/news/datablog/2012/dec/04/government-spending-department-2011-12

“Chocolate Consumption, Cognitive Function, and Nobel Laureates”
Franz H. Messerli, M.D.
N Engl J Med 2012; 367:1562-1564

All Journals like publicity. it increases readership, impact, subscriptions, reputation. Presumably this is why the New England Journal of Medicine published a piece of statistical nonsense about the link between chocolate consumption and Nobel prizes.

The NEJM is one of the medical world’s top journals, with a high impact rating. It is difficult, and prestigious, to get published in it. That is presumably why it felt unable to give space (even a letter of 175 words) to any of the replies it received to this piece of statistical garbage.

Either the editors of the NEJM failed to notice statistical howlers any half competent undergraduate could be relied upon to spot, or they took a cynical decision to devote precious space to a piece which they knew well was scientific rubbish, believing (absolutely correctly as it turned out) that it would bring them a publicity windfall.

Is it churlish to criticise what could or should have been seen as a ‘bit of fun’ or lighthearted nonsense? I don´t think so. One only needs to google the story to see that the papers which carried it – even the serous or broadheet titles – usually took it at face value: it was written by a doctor after all and published in a scientific journal. (The inability of journalists to discern good data from bad is a another story for another day.) This has a corrosive and insidious effect. It helps cultivate the perception that statistics are something that can prove anything, because that is what bad statistics can always claim to do. This in turn helps to reinforce the perception that all statistics are rubbish, including the robust ones produced by careful research, of the kind NEJM depends upon on all its other pages, and upon which we all ultimately rely not only in evidence based medicine, but evidenced based everything.

The NEJM should be ashamed of debasing statistics in this way. However this cloud has one, small, silver lining. The article is wonderful example that can be used to teach students about the ecological fallacy, prior variables, causation and correlation and measurement error. The article has these in spades. A paper based upon a (rejected) reply to the NEJM by colleagues at Edinburgh (ChocolateSerialKillers_WintersRoberts) goes through some of them, and inter alia points out that similar evidence points to a link between chocolate consumption and serial killers. I’ve also produced a NEJM study comments that can be used in tutorials or seminars.

Available at its ‘Better Life Index‘ site, this index is part of the move, in the spirit of the Stiglitz commission, to broaden evaluations of economic performance beyond GDP, which Robert Kennedy famously commented

does not allow for the health of our children, the quality of their education or the joy of their play. It does not include the beauty of our poetry or the strength of our marriages, the intelligence of our public debate or the integrity of our public officials. It measures neither our wit nor our courage, neither our wisdom nor our learning, neither our compassion nor our devotion to our country, it measures everything in short, except that which makes life worthwhile

The OECD Better Life index measures few of these things either, but it does have a variety of indicators, and what is especially noteworthy from a QM teaching perspective, it allows individuals to weight each indicator according to their own preferences and compare countries on that basis.

The results are of limited use for analysis, but they could be a good way to teach about weights, or the issues involved in constructing summary indicators.

The website is interactive and has a range of information about the data used to construct each element of the index, country rankings on each of these, and discussion of the nature of measurement as well as the data itself downloadable as excel files.

I’ve just read David Hand’s Statistics a Very Short Introduction (2008, Oxford UP)
Statistics introductions, especially when they are not subject specific, tend to be uninspiring and formula driven. This is not: a good short read that helps give students context to what they might do with quants.

Richard Alldritt of the UKSA has an excellent blog entry ‘Numbers are Not Enough’ at the RSS’s (beta) StatsUserNet site on the need to have a good grasp of the ‘metadata’ when using official statistics, and even more so, the ‘open data’ that will increasingly be released by central and local authorities.

Dilnot and Blastland have an excellent chapter in Tiger That Isn´t about the way measurement error, gaming, definitions and the whole process of the social construction of data conspire to transform data from what we might think is a transparent representation of the obvious into anything but. They put it extremely well (p. 15; 158):

“we can establish a simple rule. If it has been counted it has been defined, and that will almost always have meant squeezing reality into boxes that don´t fit…The idealised perception of where numbers come from is that someone measures something, the figure is accurate and goes straight in the database. That is about as far from the truth as it is possible to get.”

Stat-JR

The Bristol Centre for Multilevel Modelling has released the beta version of Stat-JR. The software has two features that may make it attractive for teaching.

For those teaching advanced techniques the software can sit on top of other packages (such as SPSS or Stata) and use their features within a single command language (so that students do not need to learn whole new packages in order to execute new techniques).

However the (for me) really exciting bit of this software is the ‘ebook’ interface that it offers. It is possible to author ebooks with dynamic pages populated from datasets.The dynamic ebook page receives instructions from a reader which it then executes and posts results back to the page. This makes it a useful tool in building teaching materials, since the Stat-JR ebook can sit on top of spss or other applications. Interactive learning materials can be updated with later releases of data sets or releases of statistics software, or different examples suited to different discipline backgrounds, with much less effort needed to tailor materials to different audiences or take account of other changes.

From the CMLM message:

At present whilst the software is a beta release we are only distributing it in (renewable) 30-day limited licence form but it is our intention after a period to release fully when, as with our MLwiN software, Stat-JR will be free to UK academics with potentially a small one off fee to non-academics and non-UK users. Note that currently, as with MLwiN, Stat-JR is a Microsoft Windows only piece of software.
If you would like to test out the software and give us feedback then
for more details on the software, its documentation and how you can download it please visit http://www.bristol.ac.uk/cmm/research/estat/downloads/index.html and fill in a request form for a download.
Best wishes,

The Stat-JR team.

I find it useful to get students thinking abut quantitative evidence by examining how ignorant we are of the order of magnitude of numbers that nevertheless feature in highly visible public debates. Most probably have some idea that the incarceration rate is higher in the US than UK. But how many people does either jurisdiction lock up, and what ‘should’ the rate be?
PLenty of good data at

http://www.prisonstudies.org/info/worldbrief/

The US locks up more people per head than anywhere else, with an incarceration rate of 0.73%. It has 5% of the world’s population but about 25% of the world’s prisoners. With 2.27m it has almost as many as the combined total for Russia and China )

And which part of the UK has fewer prisoners per head than any other? Step forward… Northern Ireland with a rate of 99 per 100,000 pop, compared to 155 for Eng & Wales and 157 for Scotland.

Daniel Kahneman’s Thinking Fast and Slow has many relevant ideas for statistics teaching, regardless of how far one agrees or not with the details of ‘Prospect Theory’, behavioral economics or all of Kahneman’s arguments about the psychology of cognition.
The most important insight is his (to my mind convincing) experimental demonstrations of the manifold forms of the ‘What You See Is All There Is’ bias in cognition, together with a plausible account of its evolutionary origins. Quantitative, mostly statistical, evidence that goes beyond individual observation, together with effortful ‘System 2’ logical thought that follows axioms of probability or arithmetical calculation, are the only possible correctives to WYSIATI. This ought to be a stronger selling point for statistics. ‘Fear of Stats’ probably has an element of ‘discomfort of undermining cherished intuitions’ to it.
His account of regression to the mean and pilot instructors (picked up by Dilnot and Blastland in Tiger That Isnt) is probably a point to start with in introducing statistics. I also like the way he introduces the idea of correlation consequent to that of regression: reversing the order of most statistics texts.