Note: I have re-written this post in response to comments from biostatistician Thomas Lumley below.
It made headlines around the world: Facebook ‘likes’ can reveal users’ politics, sexual orientation, IQ. According to Michal Kosinski, the lead researcher, information on "gender, race, political views, religion, sexual orientation, personality, IQ and so on," can be extracted from the knowledge that a person likes Lady Gaga or Harley-Davidsons.
The study noted that many of the most predictive "likes" weren't obvious ones. For example, fewer than five per cent of users labelled as "gay" were connected with gay groups such as the "No H8 campaign." Instead, likes such as "Britney Spears" and "Desperate Housewives" were "moderately indicative of being gay."
Meanwhile, the "likes" most correlated with high intelligence were thunderstorms, The Colbert Report, science and curly fries...
How accurate were they? According to one report:
...researchers could tell Democrats and Republicans apart in 85% of the cases; black and white people apart in 95% of the cases; and homosexual and heterosexual men apart in 88% of the cases.
That sounds impresive, doesn't it? But just how accurate were the author's predictions?
The chart on the right is taken from the original article. Seventy-five to 88 percent accuracy for sexual orientation sounds pretty impressive. But what does that actually mean?
The authors coded people in their sample as "lesbian" if they were female, and chose "women" in response to Facebook's "interested in" question. Gay men were identified in a parallel way. Using this methodology, 4.3 percent of males in the sample were categorized as gay by the authors, while 2.4 percent of females were lesbian.
Because the number of people identified in the sample as gay or lesbian was so low, the simple prediction rule "not gay or lesbian" is highly accurate. "Not lesbian" correctly predicts the sexual orientation of 97.6 percent of women in the authors' sample. Not gay predictly corrects the sexual orientation of 94.7 percent of the men.
How much better were the authors able to do than this? The "accuracy" number reported in the chart above, and picked up by the media, is an Area Under the Curve (AUC) statistic. This measures "the probability that a classifier will rank a randomly chosen positive instance higher than a randomly chosen negative one." It's calculated in the same way as a Gini coefficient. A score close to 1 - which means the ability to tell positives from negatives - is a good thing.
Thomas Lumley, in the comments below, does some simulations based on the AUC numbers given in the paper to figure out how often the authors would be expected to predict a man's sexual orientation correctly:
With 5% gay, using a prediction threshold of 0.5 in the logistic regression model, I can get a total error rate of 4.8%, with about 1% of people predicted to be gay. That's made up of about 0.45% of the population falsely predicted to be gay, and just under 4.4% falsely predicted to be straight. It's not a terribly impressive improvement over chance, but it is an improvement (and I chose the threshold in a separate sample from the one I used to estimate the accuracy).
So, in a sample of 1000 people, 50 of whom are gay, the authors would identify 10 people whom they predict to be gay. Of these, about 4 actually would be straight, and the rest would be gay. An overwhelming majority - 88 percent - of gay people are not identified using this prediction method, and about 40 percent of those identified as gay are actually straight. It's better than one would do trying to guess who is gay by saying eeny-meeny-miny-moe. However the total error rate is almost as high as one would obtain using a simple "everyone is straight" decision rule.
By changing the prediction threshold, the authors could identify more potentially gay people, but a higher percentage of those would be straight, and the error rate would increase.
The presumption in medicine research is that false negatives are much more costly than false positives - it's much better to take out a healthy appendix or two than to have a patient's appendix rupture. Yet that is not necessarily true in advertising. I get irritated when the New York Times' "recommended for you list" includes every article on same-sex anything, or a list of hot blonde/lusty firefighter "singles in your area" appears on the right of my screen. In advertising the total error rate matters, and false positives may be as costly as false negatives.
As an aside, the accuracy of identifying sexuality from the "interested in" Facebook option is questionable. I took a look at the pages of my own gay and lesbian Facebook friends, and asked a friend's son to do the same. Not a single one of these GLBT friends - whether out or closeted - revealed on their profile that they were interested in people of the same gender.
The same issues - having an error rate not much better than one would obtain with a simple decision rule - arise with many of the other characteristics analyzed in the paper. 90 percent of the sample were Christian. By always guessing "Christian" one would be correct nine times out of ten. Only 21 percent of the sample takes drugs. By always guessing "no drugs", one would have a 79 percent success rate.
It is true that the authors did much better than a simple decision rule when predicting a person's gender. They had a prediction accuracy of 93 percent on gender, whereas 60 percent of their sample were female. They also did relatively well on guessing whether someone was Caucasian or African American. But I don't think "scientists are able to guess people's gender accurately based on their Facebook likes" would have gathered many headlines - it's not that challenging to do.
The moral of the story is that it's sensible to be at least somewhat Bayesian, and to take all available information into account when making predictions.