Yann LeCun says the next frontier in machine vision is software that learns just by observing the world.
Five years ago, researchers made a sudden leap in the accuracy of software that can interpret images. The technology behind it, artificial neural networks, underpins the recent boom in artificial intelligence. It is why Google and Facebook now let you search inside your photos, and it has unlocked new applications for facial recognition.
Yann LeCun, director of Facebook’s AI research group and a professor at New York University, helped pioneer the use of neural networks for machine vision. He says there’s still progress to be made—and that it could lead to software with common sense.
Just how good is machine vision now?
If you have an image with a dominant object in it, and the name of the game is to give the category of the object—that just works. As long as you have enough data, on the order of 1,000 objects per category, we can recognize very specific objects like cars of a particular brand or plants of a particular species or dogs of a particular breed. We can also recognize more abstract categories, like whether images are landscapes, sunsets, weddings, or birthday parties. Just five years ago it wasn’t clear this problem was completely solvable. But that doesn’t mean vision is solved.
What’s an important problem that isn’t “solved” yet?
People have been playing for a number of years with the idea of generating captions or descriptions for images and video. There have been, on the face of it, impressive demonstrations, [but] those are not as impressive as they look. Their domain of expertise is very limited to whatever universe we train them on. Most of the systems, you show them images with other types of objects or unusual situations they’ve never seen and they will say complete garbage about it. They don’t have common sense.
What’s the connection between vision and common sense?
It depends who you talk to—even within Facebook there are people with different opinions on this. You could interact with an intelligent system purely with language. The problem is that language is a very low-bandwidth channel. Much information that goes through language is because humans have a lot of background knowledge to interpret this information.
Other people think that the only way to provide enough information to an AI system is to ground it in visual perception, [which] is much, much higher in information content than language. If you tell a machine “This is a smartphone,” “This is a steamroller,” “There are certain things you can move by pushing and others you cannot,” perhaps the machine will learn basic knowledge about how the world works. Kind of like how babies learn.
Babies learn a lot about the world without explicit instruction, though.
One of the things we really want to do is get machines to acquire the very large number of facts that represent the constraints of the real world just by observing it through video or other channels. That’s what would allow them to acquire common sense, in the end. These are things that animals and babies learn in the first few months of life—you learn a ridiculously large amount about the world just by observation. There are a lot of ways that machines are currently fooled easily because they have very narrow knowledge of the world.
What progress is being made on getting software to learn by observation?
We are very interested in the idea that a learning system should be able to predict the future. You show it a few frames of video and it tries to predict what’s going to happen next. If we can train a system to do this we think we’ll have developed techniques at the root of an unsupervised learning system. That is where, in my opinion, a lot of interesting things are likely to happen. The applications for this are not necessarily in vision—it’s a big part of our effort in making progress in AI.