The big data problem isn’t just about handling petabytes of information, or asking the right question, or avoiding false correlations (like understanding that just because more people drown at the same time as more ice cream is eaten, banning ice cream won’t reduce drownings).
It’s also about handling data responsibly. And so far, we’re not doing as well with that as we could be.
First Target worked out how to tell if you’re pregnant before your family does and decided to disguise its creepy marketing by mixing in irrelevant coupons with the baby offers. Then Facebook did research to find out if good news makes you depressed by showing some people more bad news and discovered that no, we’re generous enough to respond to positive posts with more positivity.
But if companies keep using the information about us in creepy ways instead of responsible ones, maybe we’ll stop being generous enough to share it. And that could mean we lose out on more efficient transport, cleaner cities and cheaper power, detecting dangerous drug interactions and the onset of depression — and hundreds of other advances we can get by applying machine learning to big data.
It’s time for a big data code of conduct.
Facebook’s dubious research is problematic for lots of reasons. For one thing, Facebook’s policy on what it would do with your data didn’t mention research until four months after it conducted the experiment. Facebook’s response was essentially to say that “everyone does it” and “we don’t have to call it research if it’s about making the service better” and other weasel-worded corporate comments. And the researcher’s apology was more about having caused anxiety by explaining the research badly than about having manipulated what appeared in timelines, because Facebook is manipulating what you see in your timeline all the time. Of course, that’s usually to make things better, not to see what your Pavlovian reaction to positive or negative updates is. The fact that Facebook can’t see that one is optimizing information and the other is treating users as lab rats — and that the difference is important — says that Facebook needs a far better ethics policy on how it mines user data for research.
Plus, Facebook has enough data that it shouldn’t have needed to manipulate the timelines in the first place; if its sentiment analysis was good enough to tell the difference between positive and negative posts (which is doubtful given how basic it was and how poor sentiment analysis tools are at detecting sarcasm), it should have been able to find users who were already seeing more positive or more negative updates than most users and simply track how positive or negative their posts were afterwards. When you have a hypothesis, you experiment on your data, not your users.
That’s how Eric Horvitz at Microsoft Research has run experiments to detect whether you’re likely to get depression, whether two drugs are interacting badly, whether a cholera epidemic is about to happen, and whether people are getting used to cartel violence in Mexico.
Using public Twitter feeds and looking at language, how often people tweet and at what time of day and how that changes, Horvitz’s team was able to predict with 70% accuracy who was going to suffer depression (which might help people get treatment and reduce the suicide rate from depression). Not only did they use information people were already sharing, they asked permission to look at them.
Click to read more…