Can there be too much data?

Philip Kegelmeyer uses the Bayes Rule, a theorem of probability theory, to evaluate the bozometer results. In cautioning about the potential pitfalls of relying too heavily on big data, Philip notes that if a bozometer worked with 99.99 percent accuracy, it would return more than 300,000 false readings when sampling the US population — 300,000 people incorrectly deemed to be bozos. (Photo By Dino Vournas)

Knowledge is power, but too much knowledge — in the form of data — can be a bad thing. “More information doesn’t always lead to better decisions,” says Philip Kegelmeyer (8900). “In fact, sometimes the two can be anti-correlated.”

An expert in machine learning, Philip has spent a lot of time pondering dangers and opportunities in “big data” — essentially, large and complex data sets that can only be processed on a supercomputer. He’s given numerous presentations to answer questions related to the use of personal data to enhance national security data analysis.

“Does all that data make a difference? Is it worth the privacy concerns?” he asks. “Big data is tricky. It can help or hurt your analysis, depending on how you use it.”

To understand these issues, Philip says you first have to appreciate how data can influence, or fail to influence, human decision-making. “The leading theory in evolutionary psychology is that intelligence evolved to win arguments, not to arrive at the truth. So in a roomful of people, the opinion of the most charismatic person often wins out,” he says. “That’s fairly depressing, and a good argument for thinking carefully about how data and judgment interact.”

The base rate fallacy

One way data can lead us astray is the base rate fallacy — an error in thinking in which we fail to take into account how likely things are to happen, or not to happen.

Philip gives the example of a bozometer that can accurately detect bozos 99.99 percent of the time. “I point it at you and it says you are a bozo. But are you really? The very counterintuitive answer depends on who else I test. This is not solely about you and the accuracy of the instrument,” he says.

On a pre-selected group of 2,000, of whom 1,000 are known bozos, the device will accurately find 999 bozos with one false alarm. But add a lot of untargeted data — the rest of the US population of approximately 300 million people — and you now have 300,000 false alarms.

“If you know there are only about 2,000 bozos in the entire data set, 99.99 percent accuracy isn’t so great,” says Philip. “The chances that you are really a bozo become quite small. This is the danger of adding untargeted data to any analytic.”

Even an analytic with 99.999 percent accuracy would still turn up 30,000 false alarms. “So you either need an incredibly accurate analytic, or a situation in which a high false alarm rate is acceptable,” he says. “This can work in the medical community, when medical tests are given to a broad population to screen for critical conditions. In this situation, a high false alarm rate may be tolerable.”

On the flip side, extra untargeted data can fill in connections and help you understand the importance of those connections. Philip invents the example of Abe and Abigail, who are both people of interest and have both been seen in Damascus. With additional flight information, you’d learn that they both frequently fly into Yemen and their time in Damascus almost always overlaps by a day.

“Without broad data, that is all you have and those facts seem very suggestive,” Philip explains. “But if you look at the entire set of normal flight records for that region, you might learn, for example, that 80 percent of all travel to Yemen goes through Damascus, most of that travel requires an overnight stay for refueling, and that 90 percent of that travel happens in three months of the year. With this additional, non-specific data, the odds that any two random travelers to Yemen would be in Damascus at the same time go way up.”

This is an example of how large amounts of properly used data, even if the vast bulk of that data is about people who are not of security interest, can enhance national security data analysis. Such data, he explains, is useful in providing context for what is normal and what is truly unique, as in the case of Abe and Abigail’s travel patterns. “The human mind prefers simple stories. The value of bulk data is that it can tell us when the stories are too simple, when we should look deeper and not trust our first impressions,” he says.

Mining blog posts to predict violence

Philip led the 2008-2010 Networks Grand Challenge LDRD that demonstrates the power of big data. The project dug into the question of why certain events sparked violent protests. In 2005, the publication of editorial cartoons depicting the Islamic prophet Muhammad in the Danish newspaper Jyllands-Posten set off worldwide protests, violent demonstrations, and riots, which were blamed for the deaths of hundreds of people.

“This wasn’t the first or last time that these cartoons were published, so why such an extreme reaction that one time?” asks Philip. “We looked at blog postings and comments and how the information travels across the web and developed an algorithm that can predict, based on multilingual text analysis, if an event will spark deadly violence.”

The project took in a lot of data by continuously scanning blogs in multiple languages and analyzing the aggregated voluntarily public text for keywords, text clustering, and sentiment. “The prediction capability comes from looking at what is a ‘normal’ response to incendiary events in the news,” says Philip. “Our algorithm can tell us if the response will lead to violence, but it can’t tell us when, where, or by whom that violence will occur.”

Can you trust your data?

Philip has a complicated relationship with data — he doesn’t always trust it. “People can fall in love with their data, to the point that they are blind to the idea that an adversary can manipulate data,” he says.

He cites a major metropolitan police department that implemented a computer-based system to assign police officers to the neighborhoods with the most illegal drug activity. A college student arrested for possession of marijuana might not trigger an increase in police presence, but violence among cocaine dealers would. The program worked great, until police officers began seeing disparities between the computer program’s assessment of the neighborhoods and what they saw on the streets.

It turned out that a drug gang had started bribing a data entry clerk in the police department, a scheme that went undetected for a year before the gang got too ambitious. At first the clerk only flagged the arrests of the gang doing the bribing as less violent, but eventually they had the clerk flag the arrests of a rival gang as more violent.

So it soon all unraveled on the witness stand. “And it’s not like the tampering was subtle,” explains Philip. “They were able to track the problems with the data back to the very day the bribery started.”

Unfortunately, adversaries also have far more sophisticated methods of sapping or suborning the critical use of data analytics on which many research institutions, government agencies, and companies rely, including Sandia.

“Through understanding our methods, adversaries seek to produce data that is evolving, incomplete, deceptive, and otherwise custom-defined to defeat analysis,” he says. “We can’t prevent this. In fact, we frequently depend on data over which adversaries have extensive influence.”

To address this problem, Phil is now leading another LDRD project, Counter Adversarial Data Analysis (CADA), that seeks to develop and assess novel data analysis methods to counter that adversarial influence.

“We are trying to understand if an adversary can know how we are using data and if they can actually change our data,” Philip explains. “How paranoid should we be that this could happen, and what can we do to remediate the situation? The bottom line is that big data can be powerful, but only if you understand the inherent weaknesses and tradeoffs. You can’t just take data at face value.”

Sandia LabNews

Subscribe to the LabNews email