Using Amazon’s MTurk to harvest 50 million Facebook profiles, and manipulate people

You may have heard the story that 50+ million Facebook user accounts were harvested by CA (CA). Or maybe you heard it was Global Science Research (GSR); or was it Dr. Alexander Kogan, or that Nix guy?

Still, you may have heard it was an inside job, facilitated by Facebook staff, or a data breach from lax security. Or was it was carried out with Peter Thiel’s data-mining company Plantir, or the daughter of Eric Schmidt, Google’s former CEO and Burning Man buddy.

What few people were discussing, was Amazon’s role in facilitating the data access. Specifically, their microtask site called Mechanical Turk.

Few people know about the online community that chronicles what happened, which anyone can access. It tells a story from the perspective of Amazon’s microtask workers, who granted access to their Facebook accounts.

Within this community, you’ll hear from the people who participated and granted access to GSR. Some enjoyed the process. Others expressed disgust. But what’s bizarre, is that over the course of a few years, 16% of participants consistently warned others that this project was exposing people’s identity, which violates Amazon’s privacy practices.

In this article, I’ll explain some of the platforms used, share my perspective on the issues, and talk about psychometric profiling, for evil or good.

Overview of the data mining process

I’ll first start by walking you through the data mining process before we get to the different platforms.

This is my understanding of how Dr. Kogan collected user profiles from Facebook:

  • Step 1. Target Mechanical Turk micro-task workers (Turkers) with Facebook accounts
  • Step 2. Ask the Turkers to complete a psychometric survey
  • Step 3. Ask the Turkers to grant access to their personal Facebook account
  • Step 4. Have Turkers reveal their identity
  • Step 5. Download data from the Turkers Facebook profile, and their friends’ data too
  • Step 6. Withhold paying Turkers till they grant access to their Facebook profile
  • Step 7. Pay compliant Turkers; or invite non-compliant Turkers to complete the task
  • Step 8. Repeat about 200,000  times

This is the process as I understand it. Next, I’ll discuss Mechanical Turk, and describe how researchers use them.

Using Amazon's MTurk to harvest 50 million Facebook profiles, and manipulate people Psychology

What’s Mechanical Turk

Amazon’s Mechanical Turk (MTurk) is a crowdsourcing platform where people pay virtual workers to perform micro-tasks.

The name may sound ridiculous, but it’s quite witty. The original Mechanical Turk was a chess-playing machine built in the 18th century, that dazzled people. However, it was a total hoax, with a person hidden inside the machine, controlling each chess move. In essence, the original Mechanical Turk was a bogus machine, powered by human intelligence.

Amazon’s Mechanical Turk works in a similar way, where their workers perform micro-tasks that are difficult for computers to perform. Typically they’re tasks that require human judgment, such as detecting sarcasm in a tweet or classifying photos.

The people who request work are “Requesters”, while those who do the work, are “Turkers”. Each microtask is called a Human Intelligence Task, otherwise known as a HIT.

MTurk offers an API, for on-demand HITs. With Amazon’s massive pool of MTurkers, Requestors are guaranteed a steady supply of Turkers, eager to perform HITs at the right price.

How academics use MTurk
Several years ago, scholars started using Mechnical Turk for research, including psychological questionnaires, experimental designs, and more. They found MTurk was a great place to recruit participants.

You might be thinking—hey, isn’t a population of micro-taskers um…sort of biased? That’s what I thought when my friend Don Steiny first told me about it. However, Mechanical Turk may be the world’s largest supplier of participants for scientific studies.

Given the widespread use by scholars, many academics evaluated whether MTurk was even a credible source for participant recruitment. Numerous studies show Turkers’ demographics, psychographics and comparison to US census data. The scientific verdict was that Turkers offer a reasonably representative sample of the US population, at a fair price. It gave research access to panels that were previously only available in massive polling companies, and in the early days, it was dirt cheap.

However, there are some major pitfalls and risks attached to studies in MTurk. First, you need to be street-smart, because MTurk is not a survey panel group; it’s a microtask portal. It’s full of Turkers who aim to make as much money, as fast as they can.

From my experience, about 20% of respondents only skim you survey questions, and some completely fake it. So to use it well, you need to get good at setting traps, to catch cheaters, and there are many scientific papers on how to catch inauthentic responses. If you don’t do this, your data will be junk.

Using MTurk for research is a bit like shopping in the most dangerous part of town. You’ll find the best deals, if you’re street-smart, and can get out.

Perhaps the most relevant point of this story is that MTurk also lets you recruit people with a Facebook account. My guess is that GSR would have used this feature to target Turkers with a Facebook account.

Using Amazon's MTurk to harvest 50 million Facebook profiles, and manipulate people Psychology


Amazon’s management run MTurk like absentee landlords, and the interface feels like a ghetto property that hasn’t been updated in 20 years.

They keep MTurk working and intervene to fix problems, but they don’t appear to value resolving Turker disputes. Since there’s no real Turk-Police, Turkers can be subject to all sorts of abuse, with few options for escalating or resolving them.

A few years back, this situation got so bad that Turkers setup Turkopticon, a plugin and community where Turkers could comment on Requestors and HIT’s. It’s a place where those who have experienced something good or bad, can share their views with others.

Turkopticon is independent of Amazon, so the staff at Amazon do not necessarily know what goes on in there. Though I’d be shocked if they didn’t monitor it.

The blame game and risk management
It’s easy to use the Turkopticon data to point the finger-of-shame at Dr. Kogan. So here’s my perspective on where the finger needs to be pointed.

Dr. Aleksandr Kogan setup a company, GSR, which carried out the data mining work, and sold that data to CA. In the fallout of the scandal, many of the actors turned on each other and started playing the blame game.

CA blamed Dr. Kogan and GSR for violating Facebook policies. Dr. Kogan argued that CA was intimately involved in the process, and was just scapegoating him.

What’s the legal basis for this bi-directional scapegoating? I assume that each company defined separate corporate entities, and signed contracts, which limited their legal liability.

I worked with some amazing people during my 7-years in the United Nations. However, one unethical manager once advised me to always bring in contractors for risky projects. If things work out, you can steal their credit; but if they fail, you can blame them. Either way, your job is safe.

I totally disagree with the ethics here but know that many corporate arrangements have this type of risk mitigation strategy in mind.

The Turkopticon data only refers to Dr. Kogan and GSR, however, I have a hard time believing that CA, and its parent company, had no clue what was going on.

It’s beyond my comprehension how staff at corporations that specialize in data services, had no clue about common data collection norms and terms in Amazon and Facebook? This is so far beyond my comprehension, that I can only speculate that perhaps Dr. Kogan was a patsy, who was willing to go first, into the risky minefield.

In other words, CA was happy to contract GSR for data extraction services, as a separate legal entity, which mitigated their corporate risk. But what made this deal sweeter, is they were in a position to scapegoat GSR for any breach and keep the data too.

I suspect, CA’s managers threw Dr. Kogan under the bus to protect their reputation, and vice versa, Dr. Kogan throws them under the bus to protect his reputation. One said, I didn’t know they gave us unethically obtained data; the other says I didn’t know they used our data for unethical purposes.

Apart from Amazon’s breach, there was nothing illegal about the data collection process. It was legal and within the terms of use. Facebook did not have a problem with GSR or CA working independently, but when GSR handed the Facebook data to CA, that’s when Facebook intervened.

My point is this–don’t use this evidence to single-out Dr. Kogan alone. The Turkopticon evidence places the smoking gun is in his hand, but I have a hard time believing that he pulled the trigger on his own.

Mechanical Turks’ requirement to protect worker identity

Finally, MTurk has several rules, but a key one is the requirement that you never ask a Turker to reveal their personal identity. When I first heard that MTurk was used to collect the Facebook data, it didn’t make sense, because the minute a Turker logs into Facebook, you linked their anonymous Turker ID to them.

Even if the study had been carried out by Cambridge University (which it wasn’t), this would have still been a problem, as you’re not supposed to reveal Turker identities under any circumstances.

As far as I can tell, the only actual violation was against Amazon’s rules.

Analysis of Turkopticon

I found two accounts in Turkopticon, linked to the Facebook collection process. One is from Dr. Kogan’s personal account with 104 Turker responses; the other is for GSR with 49 Turker responses. Together, there are 153 comments, running from 5 May 2013 to 15 December 2015.

To access these accounts, register at: Then use these exact search terms to find these two accounts: “Aleksandr Kogan” and “Global Science Research”.

For this analysis, I merged the data from both accounts, Aleksandr Kogan and Global Science Research. The Turkers ranked items on a 5-point scale, which I averaged and turned into percentages.

When interpreting the responses, remember that this is a synthesis of the two accounts, so it’s possible that some of the responses were for work that Dr. Kogan was performing independent work.

Also, people tend to comment on these types of sites when they have bad experiences, so it may be more negative than the average, and disputes may have been settled by email outside these discussions, so you won’t have the full picture, only a partial view.

Finally, we have no idea if Amazon or GSR staff knew about these discussions in Turkopticon.

Overall impressions

Overall, Turkers appeared to hold an okay, but not great impression of Dr. Kogan or GSR’s HITs.

Both profiles are full of complaints about not being paid, a lengthy process, and being asked to redo surveys. Some people reported having good experiences. It sounds like GSR must have had someone manually process work at key times, so that some people may have been paid immediately, while others had to wait.

Over time, Turker experience improved a bit, but their complaints of violating MTurk’s privacy requirements remained constant, over a few years.

Overall, GSR was probably overloaded beyond capacity, as they received a 48% approval rating for responding, and 60% rating for approving work, which was possibly the biggest complaint. Fairness in approving or rejecting work was 85%, which I guess is ok.

I’ve seen estimates that Turkers make about $6 per hour. Many in the academic community felt the low fees were exploitative, so there was a move by researchers to pay minimum wage as fair compensation.

If you visit the Turkopticon accounts, you’ll see that GSR was paying small fees. The workers rated his pay fairness at 71%, which is okay-ish for the platform.

Here’s what they had to say:

Survey on wellbeing for $2. Can’t recall how long it took but I think it was under 30 minutes. Approved next day
Jun 12 2014 | hs
Account: Global Science Research

I remembered this guy and almost didn’t click on this latest HIT, because he was so disorganized in a prior communication, IMO. Now he wants to download my personal FB info? Get lost.
Feb 25 2014 | Giraffe
Account: Aleksandr Kogan

HIT took 4 mins and paid .50, approved/paid 8 days later.
Dec 11 2013 | Thomas
Account: Aleksandr Kogan

Paid within 48 hours, pay could definitely be better but over all it was a great HIT
Jun 25 2013 | sf1…@t…
Account: Aleksandr Kogan

This guy says that he has no records that I completed his survey. I did everything as it said in the hit even pasted the confirmation code. I am so sure because I paste the codes on a word document before submiting anything. Then he says he couldn’t find it, lol. I see he does the same to everyone that works on his HITs. He is a professor at a University? God only knows what he is teaching his students. For me he just another SCAMMER that rejects our work so he doesn’t have to pay us.
Jun 22 2013 | msor…@b…
Account: Aleksandr Kogan

Quality of scientific instruments

One of the Turkers felt that Dr. Kogan’s tools weren’t up to scratch, and took pot-shots at his questionnaire.

When interpreting this, remember that MTurk has a large population of people who do lots of surveys. After a few years, some of these Turkers develop a keen eye for high-quality questionnaires.

Below is a fair point about an error in the ordinal ranking of scale categories. This would cause some damage, but it would probably be manageable bias if it was not too widespread.

Here’s what one Turker had to say.

…Also, the survey was carelessly done, which makes me believe it’s all a setup to collect information about a person for really cheap. For example, on a scale from “netural” to “strongly agree”, the order should be neutral=>slightly agree=>agree=>strongly agree. The survey listed it as neutral=> agree=>slightly agree=>strongly agree. The survey sometimes did this for the “disagree” scale, too.
May 12 2014 | AIW
Account: Aleksandr Kogan

Transparent representation

Both accounts used real names, so it appears that GSR and Dr. Kogan were acting transparently. In other words, they were representing themselves, and the Turkers were able to track down the study.

There were a few references to Cambridge University early on, then nothing, and later to the University of Toronto. Everything looked fairly transparent.

Here are some examples, that show they were transparent and could be identified:

I barely got through the survey in time. I seem to be fortunate in that I got paid. I’ve experienced something like this before. So many hits were rejected that it was likely the rejections were some kind of mistake. When enough of us complained the rejections were withdrawn and we got paid, so complaining may get the requester to look at the hits again. Aleksandr Kogan is a professor in the Psychology Dept. at the University of Cambridge.
Jun 21 2013 | Humpty Dumpty
Account: Aleksandr Kogan

University of Cambridge 30-35 minute psychology study investigating personality and the evaluation of media clips
HIT pending after 15 days
EDIT 8/25 Started another survey and was informed in the survey that MS Silverfish needs to be downloaded & had to return
$1.20 / 45m
Jun 21 2013 | capecoralhobo
Account: Aleksandr Kogan

This guy says that he has no records that I completed his survey. I did everything as it said in the hit even pasted the confirmation code. I am so sure because I paste the codes on a word document before submiting anything. Then he says he couldn’t find it, lol. I see he does the same to everyone that works on his HITs. He is a professor at a University? God only knows what he is teaching his students. For me he just another SCAMMER that rejects our work so he doesn’t have to pay us.
Jun 22 2013 | msor…@b…
Account: Aleksandr Kogan

“University of Toronto study”
27 minutes for $1.88, approved the same day I did it and paid by the next day.
This review was edited by the author Tue Nov 03 11:11 PST.
Nov 03 2015 | kimadagem
Account: Global Science Research

The breach of trust

Dr. Kogan made this promise to safeguard Turker’s identity with this statement: “Please note that we take several precautions to ensure all of your data stays anonymous and safe, and it will only be used for research purposes.

This is perhaps the most important promise that was made to the public, the promise that gave users confidence, that allows them to trust and share their personal data. This is a standard type of commitment used in academic research.

This was the commitment, on which the breach of trust hinges. However, Turkopticon only has one reference to this promise, on 3 Mar 2014. After this date, I didn’t see any other references to it. I assume they removed it, and put it on the app, which I have not yet seen.

First, it makes no sense to use this ethics statement in MTurk, because MTurk’s policy forbids revealing Turkers identity, so there’s nothing to protect.

Here’s the reference to the promise.

Violates MTurk Terms of service, this requester explicitly states “provide our app access to your Facebook so we can download some of your data—some demographic data, your likes, your friends list, whether your friends know one another, and some of your private messages. Please note that we take several precautions to ensure all of your data stays anonymous and safe, and it will only be used for research purposes. You will receive $.50 for participating. ”
Mar 03 2014 | joysinger
Account: Aleksandr Kogan

Chaotic process

There are numerous complaints, of messy management, with lots of lost payments, rejected work, and asking the Turkers to redo surveys that they claim to have completed.

Overall, the speed of communication was rated 48%. But in some cases, the Turkers received fair treatment and commented on it. It sounds like they were a bit overwhelmed.

“Thank you for being interested in our research! However, when we compared the MTurk ID entered at the end of the survey to the records on MTurk itself, we did not find your record of participation. There might be some technical problems. Or you may enter your MTurk ID with typo(s). If you still wish to get your Amazon Mechanical Turk credit, please re-do the survey (as soon as possible)…”

Asking me to do the survey twice, so I assume he’s trying to pad his results. I copied and pasted my MTurk ID, so pretty sure it was fine. Others have gotten this message as well.
Feb 11 2014 | soulo…@y…
Account: Aleksandr Kogan

Using a stress-inducing dark pattern

There’s some evidence that they employed dark patterns at the beginning, to pressure the Turker community, into giving up access to Facebook.

If you haven’t heard of “dark patterns”, they’re interactive design patterns that use digital psychology in a manipulative way. My new definition of a dark pattern is an interactive design pattern that triggers moral outrage in a sizeable population of users.

It sounds like early iterations of the survey employed a dark pattern, where people started the survey, and then mid-way through, they were told they’d have to give up access to their Facebook profile.

According to the Turkers, Dr. Kogan’s team combined bait and switch, with loss-aversion. Here’s how this dark pattern works. The Turker decides to complete a survey for money, completes it, and then midway through they’re asked to access to their Facebook account. At this point, the person has to either lose their time and money, or violate their privacy.

For anyone opposed to this, psychologically they’ve been placed in a loss-averse dilemma, where they could choose to get punished by losing money, or they could choose to get punished by losing their privacy. Have your pick. Enjoy.

For those who didn’t care about their Facebook privacy, there’s no dilemma. But for those who cared, there’s a good chance this type of design pattern would trigger a stress-inducing dilemma. This may explain all the moral outrage, in the comments.

Evidence suggests the GSR team employed this dark design pattern in the beginning, and then later replaced it, with an incentive system, offering to pay bonuses, for anyone who opted to provide Facebook access.

Here’s an example of the earlier bait and switch tactic:

Survey on personality, wellbeing, demographics + facebook
Without warning in the description the hit ask you to violate mturk tos by giving them access to your facebook account. Since I had to return the hit I was not paid for my time so giving them a 1 for payment. Giving them a 1 for being fair since obviously they were not. A 1 for promptly paying since they did not pay me for my time lost due to their violation of mturk tos.
Jun 09 2014 | DisplayName
Account: Aleksandr Kogan

Using Amazon's MTurk to harvest 50 million Facebook profiles, and manipulate people Psychology

Violating Amazon’s MTurk anonymity principle

Turkopticon has a button for Turkers to press, which adds a badge to their comment, “VIOLATES MTURK TERMS OF SERVICE”.

Overall, 16% of all Turkers clicked on this button, highlighting violations of Amazon’s privacy practices. In total, there were 25 complaints, over 2.5 years, running from 21 June 2013 to 15 Dec 2015.

In MTurk, it’s a BIG-no-no to do anything that forces Turkers to reveal their identity. Numerous Turkers pointed this out, flagged it, complained, argued, and sent warnings to others.

This is clear evidence that Amazon’s privacy requirements were violated for a few years. Whether Amazon or GSR knew about that, they can always play the plausible deniability card by saying “I didn’t know”.

I’d ballpark guess that 200,000 Turkers were involved in the Facebook data collection job. About 16% complained in Turkopticon. And given that Turkers normal send emails to discuss problems, I have a hard time believing that nobody in Amazon or GSR was aware of the violation.

Here is a discussion thread, that broke out:

Thread, with Turker wanting Dr. Kogan banned
VIOLATES MTURK POLICIES. Asked to access my facebook and download their software. I HOPE AMT bans all turkers that have reviewed this hit by admitting they too have violated the policies by participating.
May 10 2014 | ASU ALL DAY!

Banning all people who decided to take the survey. That’s a bit extreme don’t you think?
May 10 2014 | alpine

IF you WILLINGLY & KNOWINGLY participate in a crime in this country, will they arrest you? Will they fine you and give you jail time? How is this any different. Unless your one of those that turns a blind eye to crime, so long as you can get in on the action. A POLICY is a POLICY, and if turkers participate in violations, then they RUIN the legitimate experiences for those turkers that DO FOLLOW THE RULES. SO NO ITS NOT EXTREME. ITS RATHER APPROPRIATE.
May 13 2014 | ASU ALL DAY!
Account: Aleksandr Kogan

Nothing on Facebook is private anyway. If you want privacy, don’t create a Facebook account. Good HIT. Approved next day.
Jun 09 2014

Privacy and moral disgust

In 2015 the Guardian published an investigative journalism article about GSR using MTurk to drive Facebook data mining, for political advertising.

Amazon leadership say they had no idea what was happening before, and stopped GSR once they learned about it. But what I do know is that Amazon setup a system where Turkers can be abused, with little power to stop that abuse. Because of this, the Facebook situation happened.

Since Dr. Kogan’s data mining was perfectly legitimate in Facebook until he handed the data to a 3rd party, Amazon was the only company in blatant violation of policies since day one.

I suspect there would have been knowledge of what was happening, but it wasn’t a huge deal, because it’s actually common practice.

Don’t ask, don’t tell
Despite all the finger-pointing about unsavory privacy practices, my experience in data-driven technology, is that it operates like the policy “Don’t ask, don’t tell”. People and companies do it all the time, but don’t like to talk about it.

Profiling is no longer sneaky spycraft. It’s widespread and institutionalized. Full Contact gives you 500 FREE profiles every month via their API. Or if you have over $100K per year, try established companies like Oracle and Adobe who will sell you people’s devices cookie IDs so you can send them programmatic ads.

Media are spying on you…to survive
On the publishing side, what I find ironic, are some of the most moralistic articles on protecting privacy, that are published on news sites loaded with identity trackers. Many are specifically tied to programmatic advertising platforms, that offer high-degrees of targeting.

Just by being on these news sites, there’s a good chance that you’ll be added to a programmatic advertising list of cookie IDs, and then if you authenticate, you may be added to a list with personal details, that’s more valuable for them to sell. Get the Ghostery plugin if you want to take a look.

I don’t think this is a bad thing because it helps pay their bills. After all, the media produces content, and advertisers pay them to advertise. I don’t mind being tracked to help media companies pay the bills, because I’d rather help maintain what’s left of our independent media.

Facebook harvesting your contacts for growth
Also, I don’t believe companies like Facebook would have existed without data mining massive volumes of University contact lists in their upstart years. A recent memo leaked from Facebook, focused on ruthless growth, even with questionable growth techniques.

I also don’t believe they can grow without similar tactics. I consider Facebook’s “messenger” app to be the darkest of the dark patterns. I suspect they selected the name “messenger” to confuse people over the native app called “messaging”. They finally added a small logo, after keeping it ambiguously unbranded for some time. I see messenger as a chameleon trojan horse, that slowly moves towards its goal, your contact list.

Google scraping, to take over the planet
The world’s biggest data miner is Google. Their PageRank algorithm works because it extracts judgment from other people’s data. Google’s crawler is not a “crawler”. It doesn’t walk around your site. It extracts data from your site, and stores that data in a database.

Normally, Google can scrape you, but you cannot scrape Google. However, I met the head of an advertising intelligence company, and the first thing I asked her was, “How does your company even exist because you’re using data scraped from Google?”

She told me that Google likes what they’re doing, and though they’re in violation of googles terms, Google tolerates them because they add value to their products.

Free-flowing data
Tim Berners Lee, who invented the World Wide Web, considered HTML a second-rate language. His original vision was to make the markup language more like XML, so that all web data could be exchanged freely, in a semantic web.

This never fully worked out according to plan, but perhaps RSS, APIs, mashups services, and great scraping tools have created numerous innovations, that can only exist in an environment of open data. Many of the big tech companies owe their existence to this ecosystem, but once in power, they try to close those data doors to others.

Data acquisition operates a bit like the French civil law system. If no rule explicitly forbids it, then it’s okay and you’re probably safe, legally. But if people feel moral disgust, then you have failed to pass an ethical line. Ideally, you want to pass both lines.

I suspect the data access issue would have been ignored if GSR didn’t leave the Facebook ecosystem, and use the data for politics.

For those who feels visceral disgust reaction to Donald Trump, it’s difficult to accept that their data was used help get him elected. For those that support Trump, they may feel the same way about the privacy breach, but they won’t be left feeling that their data was used for self-harm.

Perhaps this is why there’s less talk about broken laws and more about broken trust.

  1. Rshila 4 April 2018 at 00:03 - Reply

    Amazing! So much we do t know, thanks for the behind the scenes

  2. Gillian 4 April 2018 at 06:34 - Reply

    That was a great analysis, Brian! Thanks for sharing it.

  3. Arash Samimi 4 April 2018 at 09:46 - Reply

    Very insightful. This is one of the few unbiased and evidence-based articles that I have read about the recent data breach. Thanks Brian.

