I originally wrote this for a statistical literacy course. It is not original, in the sense that it draws from other such ‘do’ and ‘don’t’ lists from various places as well as my own experience. I’m sure it could be improved, so comments suggestions and debate are very welcome.
The seven mortal statistical sins
Numbers are often an essential way to present clear and concise evidence for an argument. However they can be used well or badly. The following rules set out some common errors.
1 No measurement is perfect but some measures are more perfect than others (measurement error).
Numbers are only as good as the people who produce them. Any number depends upon the clear definition of what is measured and how it is measured. Very often it is difficult or impossible to precisely measure what is actually wanted (i.e. to obtain a valid measurement) in a way that will give consistent results whenever it is repeated (reliable measurements).
Numbers are more robust if they are based on definitions and means of measurement that are widely agreed, and whose strengths and weaknesses are well understood. Where there is controversy, the definition or measurement method used should be made clear. The source of numbers should always be provided.
Comparisons over time or across different groups can only be made if the measurement method stays the same. Extra care has to be taken when information from different sources is used to make such comparisons. Definitions used by different organisations rarely coincide exactly. Even within the same survey instrument question wording may change over time or new response categories may be added. Results can be influenced by the context within which a question is asked (including what questions have come before it). Comparisons across countries or language groups pose special problems.
Often it is sensible to report a range within which the true value of a measurement is thought to lie, but without both upper and lower limits, such ranges become meaningless. ‘Up to 99% of people’ includes the number zero; ‘as few as 1%’ does not rule out 99%.
Orders of magnitude matter. It is easy to misplace a decimal point or confuse a million with a billion, and thus get a number wildly wrong. Numbers should be presented with some readily recognisable comparison that make their magnitude comprehensible, and also makes the detection of such errors more likely.
2 Percentages or proportions have a base (denominator) which must be stated.
Percentages express numbers as a fraction of 100. If what that 100 comprises is not stated, then the meaning of the percentage will be unclear or misleading. Growth rates will depend upon the base year from which growth is measured. It is easy to confuse different groups of people on which percentages are based. E.g. does ‘working women’ refer to all women who do work, paid or unpaid; those currently in the labour force; those in employment; employees; employees working full time hours, and so on. Note too the ambiguity in english phrases like ‘the percentage of working women (who …)’: Does the % refer to the fact that they are women? or that fact that they work? or the fact that they do both?
When the base is itself a percentage, as often happens when change is discussed, this presents two further problems.
The first is the confusion of absolute and relative change. If the growth rate rises from 2% to 3%, that is a 50% increase not 1%, but better expressed as a ‘one percentage point increase’ in the rate of growth. In this context the absolute change probably gives a better sense of what is happening than the relative change.
The second is the multiplication of the margin of error contained in calculating relative change on the basis of small numbers that themselves have a margin of error. E.g. a survey may show that over a period of time the number of people in a particular category has increased from 5% to 15%. This could, correctly, be described as a 300% increase. However it is from such a small base that the impression created is misleading. The obverse, that the number of people not in this category has declined from 95% to 85%, suggests a much more modest change, and one that will be less influenced by error in the original data because the absolute size of the base is larger: a few percentage points either way makes much less difference to 95% than 5%.
Incidence and prevalence are often confused. Incidence is a time based measure: those ‘at risk’ within a given time period experiencing an event: 10% of people caught a cold in 2009; 2% of motorists had an accident in 2009. Prevalence refers to a state of affairs at a point in time: on 1st December 2009 3% of people currently had a cold; on 1st December 2009, 24% of motorists had ever been involved in an accident.
3 The average may not be the same as ‘typical’, and will not be universal.
Averages summarise a lot of information in a single number. This makes them very useful, but their limitations should also be borne in mind. Averages may describe the most typical condition, but they may also describe a highly atypical mid-point between two or more very different conditions. Wherever there is variety, many cases may not be close to the ‘average’ and a few cases may be very far from it. This need not make such cases either ‘abnormal’ or unusual.
Distributions around an average may not be symmetrical. If there are a small number of cases with very high or very low values, this can drag the average up or down. When this is the case the ‘median’, the value of the case with the middle value when all cases are ranked, gives a better guide. Earnings are typically skewed in this way, so that substantially fewer than 50% of earners earn above ‘average’ earnings, but the level of ‘median’ earnings will divide earners into two equally sized groups.
4 Highly unusual events may be fairly common
The probability that an event will occur depends not only upon what the chances are of it occurring in a given situation, but also upon the number of such situations (the base). The chances of winning the lottery are very low, but since millions buy tickets each week, there are regular winners. The occurrence of an unusual or unexpected event is not, in itself, evidence that some special factor must have caused it, especially if there are many situations in which it might occur (‘the Texan sharpshooter’ fallacy). Many events and states of affairs follow an approximately ‘normal’ distribution in which fewer cases are found, the further one travels from the value typical of the ‘average’ case. Unfortunately there have been several miscarriages of justice in which people have been convicted because it has been wrongly supposed that the chances of a particular event (e.g. a death) occurring by chance have been so small as to point towards the culpability of the defendant. The problem is that any unique ‘event’, whether common or not, has the same unimaginably small chance of occurring. Thus the probability of having each individual lottery number is the same. It is also the probability of having the winning number. What is different is the probability of having a number that is not the winning one!
Repeated measures of the same phenomenon regress towards the mean, showing spurious improvement or deterioration. Because no measurement is perfect, it contains some element of random error. To the extent that results towards the extreme ends of a scale (e.g. the ‘best’ and ‘worst’ performers) contain more of such error, repeating the measurement of performance is likely to lead to results less far from the mean, even if there has been no change in the underlying value of the characteristic that is being measured. This should always be taken into account when analysing the performance of e.g. ‘failing’ schools or hospitals, accident blackspots, and so on.
5 Correlation is not causation
Natural sciences and medicine frequently use randomised control trials to get evidence about cause and effect. If, on average, two groups in the experiment are the same to start with (randomised), and only one group is subjected to the experimental condition, any difference between this experimental group and the control group must, on average, be caused by the experimental condition. Evidence of cause and effect in human affairs is much harder to produce because only observation is usually possible, not experiments. We can observe correlations between conditions (sex and earnings; age and religious belief; unemployment and crime; social class and voting preference etc.) but this is not evidence, in itself, of causation. It is stronger (but by no means conclusive) evidence of causation if it can be shown that, aside from the characteristics under discussion, the different groups in what is thought to be the causal category (e.g. men and women; young and old; employed and unemployed, are otherwise similar in terms of any relevant characteristic). This is what social scientists or economists mean when they refer to ‘control’. In the absence of such control, correlations may simply be ‘spurious’: the product of another, prior, causal factor. For example there is a high cross-country correlation between the number of mobile phones in a country and the rate of infant mortality: more phones are associated with fewer infant deaths. It would be foolish, however to think that mobile phones saved infant lives: both are the results of a prior factor: the level of economic development.
Observational or experimental studies rarely, if ever, claim to discover ‘the’ cause of a condition or state of affairs. Usually such claims concern the possible size of one or more contributory causes among many.
6 Surveys are a product of their samples.
Sampling makes it possible to get information about a population that is usually too large and expensive to measure directly. But it can do so only if the sample has been systematically selected: usually by random selection. ‘Convenience’ samples, especially those in which members of the sample select themselves in some way, describe little more than the sample itself. Many ‘surveys’ used to promote products or publications take this form and have no more than propaganda value.
A ‘selection effect’ also operates when a group of people or things apparently defined by one characteristic is also defined in whole or in part by another one, either by dint of the method of their selection, or because of a strong correlation between the two characteristics. Selection effects can be extremely powerful. A recent, prominent example is given by Ben Goldacre, who has drawn attention to the way in which studies of the effect of pharmaceutical drugs are much less likely to be published if the results of the study is that the drug has no effect. Journal editors prefer to report what they think of as substantive results rather than non-results. The effect of this is to bias public knowledge of any drug towards the conclusion that the drug is effective. Studies with positive results are selected for publication, and then the assumption tends to me made that these published studies comprise all studies that have been undertaken.
The likely accuracy of estimates of the characteristics of populations obtained from random samples depend upon the relevant number in the sample, rather than the population. Thus estimates about small sub-sections of the population (e.g. teenagers; single mothers; widowers; the self-employed; a minority ethnic group) may be liable to large errors. Surveys may also suffer from response bias if a substantial proportion of people choose not to respond to the survey, and there is reason to think that their characteristics may differ from those who choose to respond.
7 Significance is not substance
When working with random samples, economists and other social scientists often test any finding they obtain by calculating the probability that it is a result of chance sampling variation rather than a pattern that actually exists in the population. Conventionally a level of 5% probability is chosen, sometimes referred to as ‘statistical significance’. In this context significance means neither ‘important’ nor ‘substantial’: it just describes how unlikely it is that such a finding could have occurred randomly. It also means that up to around one in twenty ‘results’ are due to chance sampling variation, but, of course, we cannot know which ones. This is why replication is an important part of both natural and social scientific research.
© John MacInnes University of Edinburgh 2013.