Luis von Ahn, 29
Using “captchas” to digitize books
Carnegie Mellon University
Luis von Ahn is a pioneer of "captchas"--those strings of distorted characters that websites force you to recognize and type in order to establish that you are a person and not a malevolent computer. But he finds the technology's success a mixed blessing. "At first I was feeling quite proud of myself," says von Ahn, a 2006 MacArthur "genius grant" recipient who created captchas (an acronym for "completely automated public Turing test to tell computers and humans apart") for Yahoo in 2000 to thwart automated e-mail account registration, a tool of spammers. "But then I was feeling bad, because every time you solve a captcha, you waste 10 seconds." People around the world solve an estimated 60 million captchas every day, adding up to more than 150,000 wasted hours.
Von Ahn, an assistant professor of computer science, is a leader in using human skills to make computers work better. For example, he created an online game in which players identify elements in photographs; their answers help improve image-search algorithms. He's now trying to put captchas to work in one of the epic efforts of the information age: digitizing millions of old books and making them searchable online.
An estimated 8 percent of words in these old books can't be read by the optical character recognition (OCR) software used to scan them. Von Ahn has teamed with the nonprofit Internet Archive to use captchas to help interpret those words. After all, he says, "while you are solving a captcha, you are solving a task that computers can't perform." So he created a tool, called "recaptcha," that pairs an unknown word with a known one. He distorts them both and puts a line through them--standard techniques for creating captchas. A user must decipher both captchas to access a site. The accurate typing of the known word serves the security purpose of captchas and adds a measure of confidence that the unknown word was identified correctly and can be used in place of the OCR's gibberish. Volunteers have begun deploying recaptchas, and the technique has been used to decipher two million words for the Internet Archive's book digitization effort. Recaptchas tap the joint power of people, networks, and computers in a way that should have a big impact, says Brewster Kahle, an Internet entrepreneur and cofounder of the archive: "It is like an army of ants building the Taj Mahal."
This image illustrates the difficulty that optical-character-recognition software can have in interpreting the content of older books. Luis von Ahn's recaptcha project is designed to help replace the OCR gibberish with the actual words.