My rating: 79/100
See Book Notes for other books I have read. If you like my notes, go buy it!
Tagline: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Introduction: The Outlines of a Revolution
On Google, the top complaint about marriage is not having enough sex.
But the places with the highest racist search rates included upstate New York, western Pennsylvania, eastern Ohio, industrial Michigan and rural Illinois, along with West Virginia, southern Louisiana, and Mississippi.
Economists and other social scientists are always hunting for new sources of data, so let me be blunt: I am now convinced that Google searches are the most important dataset ever collected on the human psyche.
Part 1: Data, Big and Small
Part 2: The Powers of Big Data
Chapter 2: Was Freud Right?
I found that food’s being shaped like a phallus did not give it more likelihood of appearing in dreams than would be expected by its popularity. This theory of Feud’s is falsifiable – and, at least according to my look at the data, false.
Of the top hundred searches by men on PornHub, one of the most popular porn sites, sixteen are looking for incest-themed videos.
And women? Nine of the top hundred searches by women on PornHub are for incest-themed videos.
Big data allows us to finally see what people really want and really do, not what they say they want and say they do.
Chapter 3: Data Reimagined
Goldman and other financial firms paid tens of millions of dollars to get access to fiber-optic cables that reduced the time information travels from Chicago to New Jersey by just four milliseconds (from 17 to 13).
service: Google Correlate (was shut down 2019 due to low usage)
One day, I put the United states unemployment rate from 2004 through 2011 into Google Correlate. PP Of the trillions of Google searches during that time, what do you think turned out to be the most tightly connected to unemployment? The highest during the period I searched – and these terms do shift – was “Slutload.” That’s right, the most frequent search was for a pornographic site.
We can predict whether a man and woman will go on a second date based on how they speak on the first date.
One of the ways a man signals that he is attracted is obvious: he laughs at a woman’s jokes. Another is less obvious: when speaking, he limits the range of his pitch. There is research that suggests a monotone voice is often seen by women as masculine.
A woman signals her interest by varying her pitch, speaking more softly, and taking shorter turns talking. A woman is unlikely to be interested when she uses hedge words and phrases such as “probably” or “I guess.”
Fellas, if a woman is hedging her statements on any topic – if she “sorta” likes her drink of “kinda” feels chilly or “probably” will have another hors d’oeuvre – you can bet that she is “sorta” “kinda” “probably” not into you.
A woman is likely to be interested when she talks about herself.
Now, how can men and women communicate in order to get a date interested in them? The data tells us that there are plenty of ways a man can talk to raise the chances a woman likes him. Women like men who follow their lead. Perhaps surprisingly, a woman is more likely to report a connection if a man laughs at her jokes and keeps the conversation on topics she introduces rather than constantly changing the subject to those he wants to talk about. Women also like men who express support and sympathy. If a man says, “That’s awesome” or “That’s really cool,” a woman is significantly more likely to report a connection. Likewise if he uses phrases such as “That’s tough” or “You must be sad.”
Men are more likely to report clicking with a woman who talks about herself.
Women use the word “tomorrow” far more often than men do. Adding the letter “o” to the word “so” is one of the most feminine linguistic traits. “soooo”
What gets shared, positive or negative articles? “Content is more likely to become viral the more positive it is.”
Many people, particularly Marxists, have viewed American journalism as controlled by rich people or corporations with the goal of influencing the masses, perhaps to push people toward their political views. The owners of the American press, instead, are primarily giving the masses what they want so that the owners can become even richer.
The days of structured, clean, simple, survey-based data are over.
Chapter 4: Digital Truth Serum
People will admit more if they are alone than if others are in the room with them. On sensitive topics, every survey method will elicit substantial misreporting.
Americans search for “porn” more than they search for “weather”.
Adults with children are 3.6 times more likely to tell Google they regret their decisions than are adults without children.
Countrywide, I estimate – using data from Google searches and Google AdWords – that about 5 percent of male porn searches are for gay-male porn. And how does this vary in different parts of the country? Overall, there are more gay porn searches in tolerant states compared to intolerant states.
Fully 20 percent of videos watched by women on PornHub are lesbian.
It turns out that wives suspect their husbands of being gay rather frequently. They demonstrate that suspicion in the surprisingly common search: “Is my husband gay?” “Gay” is 10 percent more likely to complete searches that begin “Is my husband …” than the second-place word, “cheating.” It is eight times more common than “an alcoholic” and ten times more common than “depressed.”
Among the top PornHub searches by women is a genre of pornography that, I warn you, will disturb many readers: sex featuring violence against women. Fully 25 percent of female searches for straight porn emphasize the pain and/or humiliation of the woman – “painful anal crying,” “public disgrace,” and “extreme brutal gangbang,” for example. Five percent look for nonconsensual sex – “rape” or “forced” sex – even though these videos are banned on PornHub. And search rates for all these terms are at least twice as common among women as among men. If there is a genre of porn in which violence is perpetrated against a woman, my analysis of the data shows that it almost always appeals disproportionately to women.
There are sixteen times more complaints about a spouse not wanting sex than about a married partner not being willing to talk. There are five and a half times more complaints about an unmarried partner not wanting sex than an un married partner refusing to text back.
Men’s anxieties [are generally about] how well endowed they are.
Do women care about penis size? Rarely, according to Google searches. For every search women make about a partner’s phallus, men make roughly 170 searches about their own.
Men make as many searches looking for ways to perform oral sex on themselves as they do how to give a woman an orgasm. (This is among my favorite facts in Google search data.)
My note on pg 131 regarding Obama’s reaction to the mass shooting in San Bernardino, California, Dec 2 2015: Sometimes doing nothing is more effective than doing something/anything.
Things that we think are working can have the exact opposite effect from the one we expect.
The overwhelming majority of black Americans think they suffer from prejudice. … On the other hand, very few white Americans will admit to being racist.
White Americans may mean well, this theory goes, but they have a subconscious bias, which influences their treatment of black Americans. Academics invented an ingenious way to test for such a bias. It is called the implicit association test.
implicit prejudice against young girls: Parents are two and a half times more likely to ask “Is my son gifted?” than “Is my daughter gifted?”
In the United States, the chances that two people visiting the same news site have different political views is about 45 percent. In other words, the internet is far closer to perfect desegregation than perfect segregation.
Facebook data is overwhelmingly biased, making it the worst data for determining what people really like.
The majority of content on the internet is nonpornographic. For instance, of the top ten most visited websites, not one is pornographic.
Don’t trust what people tell you; trust what they do.
When we lecture angry people, the search data implies their fury can grow. But subtly provoking people’s curiosity, giving new information, and offering new images of the group that is stoking their rage may turn their thoughts in different, more positive directions.
Chapter 5 Zooming In
My prediction is that we will find that many of our adult behaviors and interests, even those that we consider fundamental to who we are, can be explained by the arbitrary facts of when we were born and what was going on in certain key years while we were young.
American women in the top 1 percent of income live, on average, ten years longer than American women in the bottom 1 percent of income. For men, the gap is fifteen years.
More rich people in a city means the poor there live longer. Poor people in New York City, for example, live a lot longer than poor people in Detroit.
Your chances of achieving notability were highly dependent on where you were born.
Being born in San Francisco County, Los Angeles County, or New York City all offered among the highest probabilities of making it to Wikipedia.
Suburban counties, unless they contained major college towns, performed far worse than their urban counterparts.
On weekends with a popular violent movie, the economists found, crime dropped.
Search rates for “suicide” peak at 12:36 am. The data shows that the hours between 2 and 4 am are prime time for big questions: What is the meaning of consciousness? Does free will exist? Is there life on other planets? The popularity of these questions late at night may be a result, in part, of cannabis use. Search rates for “how to roll a joint” peak between 1 and 2 am.
Chapter 6: All the World’s a Lab
Facebook now runs a thousand A/B tests per day, which means that a small number of engineers at Facebook start more randomized, controlled experiments in a given day than the entire pharmaceutical industry starts in a year.
On the effectivity of ads: But it’s not just that they work. The ads were incredibly effective. The average movie in our sample paid about $3 million for a Super Bowl ad slot. They got $8.3 million in increased ticket sales, a 2.8 to 1 return on their investment.
How a country responds to losing a leader.
Many economists previously leaned toward the view that leaders largely were impotent figureheads pushed around by external forces. Not so, according to Jones and Olken’s analysis of nature’s experiment.
Economists found that neighbors of lottery winners are significantly more likely to go bankrupt.
Regression discontinuity – a quasi-experimental pretest-posttest design that elicits the causal effects of interventions by assigning a cutoff or threshold above or below which an intervention is assigned. By comparing observations lying closely on either side of the threshold, it is possible to estimate the average treatment effect in environments in which randomisation is unfeasible.
Students on either side of the cutoff ended up with indistinguishable AP scores and indistinguishable SAT scores and attended indistinguishably prestigious universities.
My note: that is, the effect of going to a great high school versus a merely average one? Literally no difference.
Part 3: Big Data: Handle With Care
Chapter 7: Big Data, Big Schmata? What It Cannot Do
the curse of dimensionality
Before you bet your life savings on Coin 391, you would want to see how it does over the next couple of years. Social scientists call this an “out-of-sample” test.
The things we can measure are often not exactly what we care about.
In fact, in his book The Signal and the Noise, Nate Silverman estimates that the Oakland A’s, a data-driven organization profiled in Moneyball, were giving up eight to ten wins per year in the mid nineties because of their lousy defense. The solution is not always more Big Data. A special sauce is often necessary to help Big Data work best: the judgement of humans and small surveys, what we might call small data.
“School districts realize they shouldn’t be focusing solely on test scores,” says Thomas Kane, a professor of education at Harvard. A three-year study by the Bill & Melinda Gates Foundation bears out the value in education of both big and small data. The authors analyzed whether test-score-based models, student surveys, or teacher observations were best at measuring which teachers most improved student learning. When they put the three measures together into a composite score, they got the best results. “Each measure adds something of value,” the report concluded.
Chapter 8: Mo Data, Mo Problems? What We Shouldn’t Do
Recently, three economists looked for ways to predict the likelihood of whether a borrower would pay back a loan.
Terms used in loan applications by people most likely to pay back: debt-free, after-tax, graduate, lower interest rate, minimum payment
Terms used in loan applications by people most likely to default: God, will pay, hospital, promise, thank you
Someone who mentions God was 2.2 times more likely to default. This was among the single highest indicators that someone would not pay back.
Google searches related to criminal activity do correlate with criminal activity.
Conclusion: How Many People Finish Books?
Certainly, if you turned math into a game, students would have more fun, learn more, and do better on tests. Right? Wrong. Students who were taught fractions via a game tested worse than those who learned fractions in a more standard way.
The days of academics devoting months to recruiting a small number of undergrads to perform a single test will come to an end. Instead, academics will utilize digital data to test a few hundred or a few thousand ideas in just a few seconds. We’ll be able to learn a lot more in a lot less time.