Filter by Categories
Clinical Practice Research
Ethical Legal and Regulatory Considerations
Planning, Managing, and Publishing Research
Research Design and Method

Crowdsourcing Speech Ratings with Amazon’s Mechanical Turk

Tara McAllister Byun

DOI: 10.1044/cred-meas-tth-001/wHr_ub8Tvss

An overview and advice for adding crowdsourced data collection methods to the CSD research toolbox.

Crowdsourcing actually has been used in a lot of research fields, but it hasn’t made its way into speech pathology just yet. So I teamed up with a few other researchers in speech pathology who had experience with crowdsourcing to try to get the word out a little bit.

I became familiar with it in linguistics, but it’s also used extensively in psychology and the idea is you’re doing a behavioral experiment, so want to get some sort of data like linguistic grammaticality judgments, or maybe you want to look at a priming kind of effect in psychology. And you have to recruit subjects. You have to bring them into your lab, you have to pay them, you have to go through the whole process, and it can be very challenging. It can be slow and laborious. It’s labor intensive to bring a large number of participants into your laboratory.

So someone in psychology said, “Wouldn’t it be nice if I could put my experiment online and if I could recruit a lot of people online to do my behavioral experiment, my priming task, or to give me a linguistic judgment, or a perceptual rating. And, in fact, there are established platforms for crowdsourcing.

An Overview of Amazon’s Mechanical Turk

The one that I use, there are a few different ones, the one that’s most commonly used in research is Amazon’s crowdsouring platform. And it has an interesting name, it’s called “Amazon’s Mechanical Turk.”

So the original Mechanical Turk was a supposed chess playing automaton in the late 1700s. It was claimed to be a robot that could play chess that could defeat chess masters, but of course, at that time, the technology didn’t exist for that. It was later revealed that there was a human chess master hidden in the works of the machine that was operating it. And it looked like a robot, but it was actually a human controlling it secretly.

And when Amazon developed its crowdsourcing platform they called it Mechanical Turk because it acts like a computer interface. You as the experimenter feel like you’re interacting with a computer and sending commands and getting your data processed by a computer, but there’s actually human intelligence running the show behind the scenes. They call it, “artificial, artificial intelligence.” It looks like a computer, but it’s really humans doing it.

You can post a task on Mechanical Turk, and there’s a huge population of individuals who work on Mechanical Turk, who will accept a job, and will perform a task for you and you pay them, usually a small increment of money, for these tasks. The tasks tend to be short, repetitive, simple things. And they are usually things that humans do better than computers, like maybe identifying objects in photographs, or in our case rating speech sounds, things that you love to get humans to do, but you’d really love to have access to a large pool of individuals.

Using Crowdsourcing for Speech Ratings

I’ve been using it for speech data rating. So in my treatment research, I collect speech samples before, during, and after treatment. And I need to get them all rated by blinded listeners in order to tell whether my participants are making progress over the course of their treatment. I’ve had challenges in terms of how long it takes, how hard it is to find expert raters, and how long it takes to get data rated by expert raters, and also how expensive it can be. So, I was wondering, would it be possible to get ratings from listeners on this crowdsourcing platform?

Of course, I don’t expect the average person on Amazon’s Mechanical Turk to have the level of expertise of a speech pathologist, in terms of giving a perceptual judgment to a child’s /r/ sound. But the really interesting thing about crowdsourcing is you can take a non-expert and compare them to an expert, and at the individual level, the expert is definitely going to outperform the non-expert. But if you take a crowd of non-experts, and aggregate across their responses, then that can actually converge with the type of response that you get from an expert because the noise cancels out. It’s what people describe as “the wisdom of crowds.” If you have enough non-expert judgments, it can perform like an expert’s rating.

We were wondering, can we get a large number of naive listeners on Mechanical Turk to rate our stimuli and approximate the performance of our expert listeners. And, in fact, I did a study using children’s /r/ sounds, that compared a sample of expert listeners and a significantly larger sample of naive listeners recruited through Amazon’s Mechanical Turk. We found that if we had a sample of at least nine naive listeners from Mechanical Turk, and we aggregated across their responses, then they converged with the normal standard that you would expect for publication in our literature.

Bringing Crowdsourcing into the CSD Toolbox

So, it does seem to be possible to get ratings from these crowdsourcing platforms that are comparable to the ones that we would actually get from trained speech pathologists. Which is pretty remarkable when you think about the fact that you’re taking individuals who don’t have particular expertise, knowledge of this task, but just by combining responses from a large number of individuals, then you get performance that exceeds what any one of those individuals could have contributed.

I feel strongly, and also Suzanne Adlof and Michelle Moore who are co-presenting with me, we have looked into the crowdsourcing literature in either linguistics or psychology, and we’ve really been struck by how well-established it is there.

It’s just a tool in the toolbox if you’re a behavioral psychologist or a linguist. If you need something rated, if you need someone to generate a pool of words that you might want to use as stimuli, one of the alternatives you’ll think of is, “why don’t we put it on Mechanical Turk?” And there’s no reason that researchers in communication disorders can’t benefit from it in exactly the same way. Because many of the questions we’re asking are precisely the same questions. So, we just want to get the word out that this is another tool you can consider that could really make our research faster and easier.

Case Studies and Getting Started

The following materials from “Bringing the Crowdsourcing Revolution to Research in Communication Disorders”, a seminar session presented at the 2014 ASHA National Convention, are available under the “Supplemental” tab on this page.

An overview of the benefits and points to consider when using crowdsourcing in research, as well as step-by-step screenshots for navigating Amazon’s Mechanical Turk.

Michelle W. Moore, West Virginia University

Study purpose: To investigate lexical stress influence on word class-specific deficits in expressive-based aphasia

Experimental tasks: Single word reading, sentence completion

Tara McAllister Byun, New York University

Study purpose: What is the level of agreement between crowdsourced ratings of speech and ratings obtained from more experienced listeners?

Experimental tasks: Binary rating task of speech accuracy

Suzanne Adlof, University of South Carolina

Study purpose: Which contexts are most “nutritious” for vocabulary instruction?

Experimental tasks: Rating context informativeness

Websites of Interest

Amazon Mechanical Turk (main website)

Experimental Turk: A blog on social science experiments on Amazon Mechanical Turk (A researcher-managed blog tracking evidence and tutorials for using Amazon Mechanical Turk as an online subject pool for experiments.)

Further Reading

Crump, M.J.C., McDonnell, J.V., Gureckis, T.M. (2013). Evaluating Amazon’s Mechanical Turk as a tool for experimental behavioral research. PLoS ONE, 8. 10.1371/journal.pone.0057410.

Mason, W. & Suri, S. (2012) Conducting behavioral research on Amazon’s Mechanical Turk. Behavior Research Methods, 44, 1–23. [Article] [PubMed]

McAllister Byun, T., Halpin, P.F., Szeredi, D. (2015). Online crowdsourcing for efficient rating of speech: A validation study. Journal of Communication Disorders. 10.1016/j.jcomdis.2014.11.003.

Paolacci, G., Chandler, J., & Ipeirotis, P.G. (2010). Running experiments on Amazon Mechanical Turk. Judgment and Decision Making, 5, 411-419.

Gibson, E., Piantadosi, S. & Fedorenko, K. (2011), Using Mechanical Turk to obtain and analyze English acceptability judgments. Language and Linguistics Compass, 5, 509–524. 10.1111/j.1749-818X.2011.00295.x [Article]

Sprouse, J. (2011). A validation of Amazon Mechanical Turk for the collection of acceptability judgments in linguistic theory. Behavior Research Methods, 43, 155-167. [Article] [PubMed]


Tara McAllister Byun
New York University

The content of this page is based on selected clips from a video interview conducted at the ASHA Convention.

Additional digested resources and references for further reading were selected and implemented by CREd Library staff.

Copyright © 2015 American Speech-Language-Hearing Association