MODELING FOR DATA SCIENCWE

One death is a tragedy; one million is a statistic. – Joseph Stalin

Much of the vocabulary used in the field of statistics can mislead people who interpret the words in their everyday sense. One of the worst offenders is “significant,” the mis-interpretation of which has led to countless mistakes in interpreting statistical results. Even more basic is the misleading nature of the word “statistics” itself.

The historical root of “statistics” is the word “state,” as in “the State of Germany.” About 200 years ago, as European principalities were being assembled into the familiar states of Italy, Germany, the US, and the UK, state leaders wanted to know what resources were available to them: population, industrial and agricultural output, exports and imports, etc. Such facts about the state were called “statistics.”

The 1840 by-laws of the American Statistical Association start, “The operations of this Association shall be principally directed to the Statistics of the United States … and not be confined to any particular part of the country. Foreign statistics may occasionally be considered.” The first few documents collected by the organization include “Statistical Forms for taking the 6th Census [1840] of the United States”; “pamphlets relative to Portland, Maine”; “Statistical tables of Massachusetts”; “State of the Banks in the U.S.”, and “Steam Engines in the U.S.”. 1

Facts are undeniably important to making sense of the world. For instance, here are some contemporary statistics in the early sense of the word. As reported by the news magazine The Economist, in 2016 there were 560,000 violent deaths around the world. 68% of these were murders; war accounts for 18%. Latin America has 38% of the recorded murders in 2016, even though its population is just 8% of the world total.2 In the everyday use of the word, a “statistic” refers to such a fact.

The professional meaning of “statistics” is different: not the facts themselves but the area of scientific practice relating to the collection, analysis, and interpretation of data. Many statistical professionals have decided to operate under the name data science, thereby avoiding the confusion of statistics-as-facts with statistics-as-science. Other terms you may encounter include data analysis, business analytics, or statistical modeling.

One key concept to understanding statistics-as-science is variation across the units of observation. For the present, imagine that the unit of observation is a country and that the observations themselves are the countries’ various population sizes and the counts of violent deaths in that country in 2016. And there’s one more kind of observation is relevant to The Economist’s violent-deaths example: whether the country is considered to be part of Latin America.

Needless to say, countries differ in these observed values: there is country-to-country variation. Indeed, the word variable is used in statistics to describe a an observable of a given sort. In the violent-deaths example, the variables are population size (a number), number of violent deaths in 2016 (a number), and whether or not the country is part of Latin America (a yes/no indicator). Saying that there is variation in population among the countries is a wordy kind of way of saying the obvious: that different countries have populations of different sizes. For instance, China and France are very different in population size. Similarly with number of violent deaths: this number varies from one country to another. For instance, the United States has a very different number of violent deaths than the Netherlands. And there’s variation from country to country in the Latin America variable: some countries are in Latin America (a “yes” value for the indicator) and others aren’t (a “no” value).

A major goal of statistics is to explain or account for variation in one variable using the variation in other variables. For instance, what accounts for the variation in the number of violent deaths from one country to another? It’s reasonable to think that population size plays a role: the US and Netherlands have different numbers of violent deaths in part because the populations of the two countries are so very different. The report in The Economist introduces another possible explanation for the country-by-country variation in violent deaths: whether the country is in Latin America. To judge from the statistics presented – 38% of violent deaths occuring in a set of countries with only 8% of the world population – Latin America is a dangerous place.

By framing questions in terms of explaining variation, statistics enables us to examine other possibilities and to tailor the explanation to the particular purpose for working with data in the first place. In particular, you can think about the violent-death situation using other units of observation and other explanatory variables that may more directly address the questions of interest.

Imagine that, instead of an individual country, the unit of observation were an individual person. This would give about 7,300,000,000 units – the world population – rather than the roughly 200 countries. What might be the variables? The variation to be explained is in a yes/no variable: whether the person suffered a violent death in 2016. This varies from person to person for the obvious reason that not everybody died a violent death. The value of the variable is “yes” for some people and “no” for the others (who constitute the vast majority).

And what variables might we use to explain the violent-death variable? There are many. Some that come to mind are age, sex, whether the person lives in a war zone, whether the person is engaged in illegal activity (particularly the drug trade). Less obvious ones are mentioned by The Economist article: whether or not the person lives in a city, the population growth rate of the city, and whether the person lives in a “developing” country.3 Still others: whether the person owns a gun or lives with someone who does, the level of corruption of police or degree of religious hatred in the person’s country or district.

In using these variables to explain the violent-death variable, we might well discover that a violent death is strongly associated with these factors: being a male in the 15-25 age range, living in a war zone, being involved with illegal drugs, living in an area of high unemployment, and so on.

What about the statistical “fact” that the violent-death rate is high in Latin America? This might have nothing at all to do with the characteristics typically associated with Latin America: e.g. Spanish and Portuguese being primary languages, a large indiginous and subsistence rural population. It might even be found that a Latin American culture is associated with a lower risk of violent death once one accounts for the other major factors.

A statistic (singular) is a fact. Any set of facts might be informative or misleading. The discipline of statistics, that is, statistics as a science, is the business of constructing, evaluating, and interpreting explanations. Major methods of statistics relate to quantifying variation, constructing mathematical models of the relationships among variables, and putting those models in the context of the uncertainty introduced by imperfect and incomplete observations. In your journey to master statistics, you’ll be learning conventions for organizing data and communicating reasoning and conclusions, mathematical tools for quantifying relationships, and, most fundamentally, the habits of mind of finding, exploring, and accounting for variation.


  1. Source: “Organization & Proceedings of the American Statistical Association 1838 -1872, p. 12

  2. Source: “Solving murder”, The Economist, April 7, 2018, p. 9

  3. It turns out that cities in developing countries tend to grow at a high rate as rural people move to the city in search of employment.