Headlines have claimed AIs outperform humans at ‘reading comprehension,’ but in reality they’ve got a long way to go
Computers are built to process data, but there’s a particular form of information so rich and dense in meaning that it’s beyond the full comprehension of even the most advanced AI. It’s also one that you and I process intuitively and deal in every day: language.
Understanding the written and spoken word is a big an important challenge for computer scientists. This month, a small milestone was passed when a pair of teams from Microsoft and Alibaba independently created AI programs that can outperform humans in a reading comprehension test. As you might expect, this news resulted in a flurry of coverage. Headlines like “Robots can now read better than humans, putting millions of jobs at risk,” and “Computers are getting better than humans at reading.”
But of course, it’s not as simple as that.
Technically, these headlines aren’t wrong. But, like a lot of coverage of artificial intelligence, they exploit ambiguities to exaggerate things to the point that they become incredibly misleading. (It’s ironic, considering the subject at hand is reading comprehension.) Computers can now outperform humans at reading, it’s true, but only at one very specific and constrained task — which even the creators say was never designed to capture the full complexity of what we understand as “reading.”
As is often the case in AI, the test is actually a dataset, compiled by a group of Stanford university computer scientists that includes Percy Liang and Pranav Rajpurkar. It’s called the Stanford Question Answering Dataset (or SQuAD for short), and consists of more than 100,000 pairs of questions and answers based on 536 paragraph-length Wikipedia excerpts. You then read the excerpt and answer questions on it.
On the surface, SQuAD looks formidable. The queries are wide-ranging, taking in everything from historical trivia (“When did Martin Luther die?”) to pop culture (“What enemy of Doctor Who is also a Time Lord?”) and basic chemistry (“What is needed to make combustion happen?”). The source paragraphs are equally dense, focusing on arcane topics like the legislative protocol of the European Union and the concept of civil disobedience.
Faced with SQuAD’s questions, humans get around 82.3 percent of questions right. Alibaba and Microsoft’s AIs edged out this score, just — getting 82.4 percent and 82.6 percent respectively. That’s close, but a win’s a win.
But while these questions and topics look intimidating, the test itself is easy. Think about it like this: for each question, the computers and humans know that the answer has to be in the source paragraph somewhere — and not just the answer, but the exact wording. Asking “Whose authority did Luther’s theology oppose?” seems tough, but when the source text includes the sentence “[Luther’s] theology challenged the authority and office of the Pope,” it doesn’t look quite so bad. You don’t need to understand what “authority” is, you just need to look for basic grammatical components, like the subject and object of a sentence.
All this is expected, explain Pranav Rajpurkar and Percy Liang. “A lot of these models use pattern matching to arrive at an answer,” Rajpurkar tells The Verge. This includes Alibaba and Microsoft’s latest efforts, both of which use deep learning to analyze sample completed tests, and from this sift out common methods of answering the questions. Liang explains: “For example, if you ask when was someone born and you have a passage describing their life, the algorithm will just spot the ‘when’ in the question and look for any dates in the passage.”
These sorts of methods are self-evidently successful, but, like many forms of artificial intelligence, they’re also easily tricked. Since helping create SQuAD, Liang and his colleague Robin Jia made a version of the test that includes so-called “adversarial examples” designed to trip up the computer. Here, that means adding extra information to each paragraph.
So, if you have a question asking, “What’s the name of the quarterback who was 38 in Super Bowl XXXIII?,” you just make sure the source text mentions two quarterbacks (who are identified as having different numbered jerseys), and the computer is stumped. Liang summarizes current AI’s performance in these tests by saying: “It’s kind of like when you have a student who can do well at tests without recognizing any of the subject material.”
Yoav Goldberg, a lecturer at Bar Ilan University who specializes in natural language processing, says the mistake is thinking of SQuAD as something akin to a school test, rather than a tool intended to help computer scientists. “SQuAD was not designed to be a realistic assessment of ‘reading comprehension’ in the popular sense,” Goldberg tells The Verge over email. “It was designed as a benchmark for machine learning methods, and the human evaluation was performed to assess the quality of the dataset, not the humans’ abilities.” It’s the fault of the media and PR for interpreting it as something more than this.
Goldberg also notes that the baseline the computers are being measured against doesn’t really capture humanity at its finest. The 82.3 percent accuracy score comes from workers recruited via Amazon’s Mechanical Turk (standard practice in computer science), who are paid a few cents per question and have to answer under a time limit. “So maybe they weren’t really doing their best,” suggests Goldberg.
Liang adds, “Just to paint the spectrum a little bit: when you take the SATs or whatever, those are much, much harder than SQuAD questions. Even elementary school reading comprehensions are harder, because they often include questions like ‘Why did X do this?’ and ‘If this person had not gone to school what would have happen?’ So they’re a lot more interpretive. We’re not even tackling those more open-ended types of questions.”
Even with these caveats, the performance of Alibaba and Microsoft’s programs deserve recognition. “Before SQuAD, if you’d asked whether computers could ever do reading comprehension on Wikipedia factoid questions as well as humans, you wouldn’t have been able to say yes definitively,” says Rajpurkar. Goldberg adds that it’s still impressive how, in just a few years, AI powered by deep learning has quickly outclassed earlier methods.
And being able to extract this sort of data, even with only a surface-level understanding, could be useful in a number of domains — from better search engines to software that digs through long documents for lawyers and doctors. Alibaba, best known for its huge online shopping portfolio, says it’s already using this tech to help field customer service inquiries.
So what’s next for computer reading? Will AI ever be able to understand language in the same way that humans do? Researchers in the field are not making any predictions. On the face of it, understanding text fully requires so much quintessentially human knowledge that machines may take decades and decades to match us. However, the history of AI shows that problem-solving methods initially derided as “cheating” or “hacky” can soon combine to create something unexpectedly powerful.
“Practically, I think these systems are going to be really useful,” says Liang. “But in terms of the grand intellectual challenge, can we get computers to understand, that’s a completely different question.” And the way forward? That’s clear at least, says Liang: harder tests.