“Data is always lying to you… but, we can fix it, sometimes, maybe.”
That’s how Patrick Ball, director of research for the Human Rights Data Analysis Group, opens his podcast, “Understanding Patterns of Mass Violence with Data and Statistics.” Published earlier this month, the podcast is part of Databites, a speaker series by Data & Society, a research institute in New York City that focuses on the social and cultural issues arising from data-centric technological development.
The idea that observable data are the same as patterns of behavior is a “naïve model,” Ball says, adding that there are methods for dealing with selection bias.
“In human rights data collection, we don’t usually know what we don’t know. That’s a problem. If you don’t know what you don’t know, how do you know if what you don’t know is systematically different from what you do know?” he asks.
During the 45-minute podcast, Ball explains how data about mass violence can seem to offer insights into patterns such as whether violence is getting better, or worse, over time. He explores whether data are representative of reality and how statistical patterns reflect how it was collected rather than changes in the real-world phenomena data purport to represent.
“We turn to data analysis as a check, as a reality check on our prejudices, on our presuppositions, on the things we think are true,” Ball says. “We look to data analysis to figure out if we got the story right. But, if we build data analysis that just reinforces the story we’ve already told ourselves, we’ve not only learned nothing, we’ve anti-learned. We’ve developed certainty about the wrong conclusion. Bad data analysis is much worse than no data analysis.”
Using analysis of killings in Iraq, homicides committed by police in the U.S., killings in the conflict in Syria and homicides in Colombia, Ball contrasts patterns in raw data with data in estimated total patterns of violence. He points to biases in raw data that can be corrected through estimation and explains why it matters.
“Raw data, no matter how big, is not a good basis for a story,” Ball explains. “It doesn’t matter if we’re talking about Truth Commission testimonies, UN investigations, press articles, crowdsourcing, SMS streams, NGO documentation, social media feeds, perpetrator records, government archives, state agency records, refugee camp records, raw data is good for cases. It helps you know about an individual story of violence. But, if you aggregate it into a database, and think you’re getting the patterns, you’re not going to get the patterns. What you’re going to get is a beautiful graph about how the data was collected, and that’s probably not what you want. … Technology and big data tend to amplify bias. The reason is that technology gives you more information about the stuff you were able to capture in the first place, but it doesn’t address the problem that some of those areas are just dead zones in terms of information. Some locations, some ethnicities, some perpetrator types, some kinds of violence, those are are just going to be dead zones for you. … Statistics is generally about comparisons.”
He argues that there are three ways to good statistics — a perfect census, proof of all the data or a random sample.
“We’ve got to get the story right,” Ball emphasizes. “We want to hold the powerful accountable. We need accountability in human rights cases so that people know when to speak of their loved ones in the past tense. We need accountability for human rights violations because it’s the only thing that will stop the cycle of violence. It’s the only thing that will stop the powerful from repression against the less powerful. And, if we are going to do data analysis on behalf of projects that work for justice against inequality, against environmental degradation, or for any of the other causes that we need to support, we have to be right. To do otherwise is malpractice.”
Data & Society’s other podcasts include:
- “Weapons of Math Destruction”
- “On Digital Passageways and Borders”
- “Ebola and the Law of Disaster Experimentation”
- “Balancing Privacy Obligations and Research Aims in a Learning Health Care System”
- “Genetic Coercion”
- “When Algorithms Become Culture”
The podcasts had aired as live webcasts and are available in a video archive, but popular demand turned the videos into audio-only recordings for more people to access, says Seth Young, director of communications for Data & Society. “We hope the additional distribution through iTunes, Google Play, etc. helps these talks reach a broader audience, too.”
Banner image credit: Justin Norman
Editor’s note: Watchworthy Wednesday posts highlight interesting DML resources and appear in DML Central every Wednesday. Any tips for future posts are welcome. Please comment below or send email to email@example.com.