From a Reuters column:
“Data analysis is very similar to performing magic. With great skill you can pull things together and create the perception of surprising relationships.”
“lack of talent is not just an impediment; it’s a potential source of danger.”
“Often what’s most interesting isn’t the statistical relationship itself, but the data that was required to find it.”
Statistician Andrew Gelman makes an insightful remark, one to keep in mind not just when reading published scientific papers:
Levitt buttresses his argument with the statement, “Chris Goodall [the person who made the walking/driving comparison] is no right-wing nut; he is an environmentalist and author of the book How to Live a Low-Carbon Life.” How relevant is this? Even a “right-wing nut” could make a good point, right?
More to the point, I think we have to be careful about automatically trusting “crossover” arguments. Do we have to believe something, just because it comes from somebody who we wouldn’t expect to say it? I worry that this sort of crossover appeal is so appealing that otherwise-skeptical commentators (such as Levitt) forget their usual skepticism.
“Over the weekend of Apple’s April 3 release of the iPad, 73% of circulated tweets were favorable toward the iPad, but 26% expressed disappointment that the iPad could not replace the iPhone, according to a study.”
If you’re not too careful, you could conclude that sentiment towards the iPad was largely favorable. But you would probably have made a biased conclusion.
This is the point that Harvard Business Review’s Kate Crawford makes in a recent article, “The Hidden Biases in Big Data.” With a data sample, it is always critical to ask whether the sample is representative of the target population.
Thus, considering the iPad sentiment example, a key questions is: are the people who tweeted about Apple’s iPad over that weekend (the sample) representative of all the people who have, or even could have, interacted with the iPad during that time (the target population)?
Some excerpts from the article:
- Hidden biases in both the collection and analysis stages present considerable risks, and are as important to the big-data equation as the numbers themselves.
- Data and data sets are not objective; they are creations of human design.
- We get a much richer sense of the world when we ask people the why and the how not just the “how many”.
Read the article here.
Steve Lohr writes a reflection on the promises of Big Data, citing increasing buzz, yet also its initial big failure:
“Many of the Big Data techniques of math modeling, predictive algorithms and artificial intelligence software were first widely applied on Wall Street.” And what happened there we all know.
A chief scientist from an ad-targeting startup:
“You can fool yourself with data like you can’t with anything else. I fear a Big Data bubble.”
“A major part of managing Big Data projects, he says, is asking the right questions: How do you define the problem? What data do you need? Where does it come from? What are the assumptions behind the model that the data is fed into? How is the model different from reality?”
“Models do not just predict, but they can make things happen,” says Rachel Schutt, who taught a data science course this year at Columbia.
A concern is that “the algorithms that are shaping my digital world are too simple-minded, rather than too smart.”
Read at The New York Times.
To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of.
From an article on Facebook’s Data Science Team, at MIT’s Rechnology Review: experiments with million-plus sample sizes, the doubling of data every year, and new product offerings: selling data-management tools to other big businesses?
Below is a link to a great article on the science of statistics by George E. P. Box, where he mentions, and develops, his famous dictum that “all models are wrong” (qualifying it with the fact of their usefulness – “all models are wrong, but some are useful”, though this does not appear in the article varbatim)
- the importance of focusing on both theory and practice in statistical work, especially academic
- the ability to “devise simple but evocative models”
- not falling in love with your models (borrowing the metaphor of Pygmalion from Francis Bacon)
- resisting the temptations of “cookbookery” and “mathematistry”
Box first talks about the scientific method as a continual iteration between theory and practice (deduction and induction), and then illustrates this with poignant examples from the creative and industrious work of Fisher in all sorts of practical data analysis.
- One important idea is that science is a means whereby learning is achieved, not by mere theoretical speculation on the one hand, nor by the undirected accumulation of practical facts on the other, but rather by a motivated iteration between theory and practice
- Since all models are wrong the scientist cannot obtain a “correct” one by excessive elaboration.
- Since all models are wrong the scientist must be alert to what is importantly wrong.
- In the inferential stage, the analyst acts as a sponsor for the model […] Having completed the analysis, however, he must switch his role from sponsor to critic
In such areas as sociology, psychology, education, and even, I sadly say, engineering, investigators who are not themselves statisticians some- times take mathematistry seriously. Overawed by what they do not understand, they mistakenly distrust their own common sense and adopt inappropriate procedures devised by mathematicians with no scientific experience.
- One by one, the various crises which the world faces be- come more obvious and the need for hard facts on which to take sensible action becomes inescapable
- Mathematics artfully employed…
search for “science and statistics box” and look for a pdf
JStor has it at http://www.jstor.org/stable/2286841