In the Loop: Alexei Efros

Alexei (Alyosha) Efros is an associate professor of robotics and computer science at CMU and last year received a Finmeccanica Career Development Chair. A native of St. Petersburg, Russia, Efros came to the United States in 1989, earning his bachelor's degree from the University of Utah and his master's and Ph.D. from the University of California at Berkeley.

A member of the CMU faculty since 2004, he is a recipient of the CVPR Best Paper Award, a National Science Foundation CAREER award, Sloan and Guggenheim fellowships, and a SIGGRAPH Significant New Researcher Award.

Efros recently spoke to Link Managing Editor Jason Togyer about his early computing experience, his career path and his current research in computer vision.

Your father is a physicist. Growing up, did you feel pressure to go into a scientific career?

My father did put pressure on me in one way. He said, "Don't become a physicist!" (Laughs.) Science was certainly the preferred path. When I was growing up in the Soviet Union it was already the time of perestroika, but there was still a feeling that it was better to go into a technical field, because there was freedom for people to be independent of the authorities, which was very difficult to do in the arts or humanities. I think that's one of the reasons the Soviet Union produced so many great scientists--other paths were so constrained.

What was your first computer programming experience?

Through my father, when I was 13, I got one of the first Soviet personal computers, called the Elektronika BK-0010. It was kind of like a Commodore, but with 32K of RAM.

My biggest stroke of luck was that the computer needed a tape recorder to load or store programs, and I didn't have a tape recorder. I say "luck" because that meant I didn't have any games that I could play, and I didn't have any software, so I had to write my own. Even more "lucky" was that the computer would overheat after three hours! I couldn't just write a game and keep playing it, because it would die after three hours. I learned to code very quickly, and I think that pushed me in the direction of computer science. If I had started by playing games, I don't know that I would have had the perseverance to become a programmer.

I eventually developed an interest was artificial intelligence--creating computers that would not just do a bunch of things really fast, but that could reason and understand.

Why did you specialize in computer vision instead?

As I got older, I realized artificial intelligence was a very difficult problem, and the idea that computers would some day reason or write poetry seemed like such a big leap that I might not achieve realistic results in my lifetime.

In computer vision, we're creating computers that understand and recognize objects and scenes. While that's still an extremely hard problem, it's also the kind of problem where you can see very immediate results. By running your algorithm on a new image, you can immediately see if it's working or not. This can be very frustrating, but also very satisfying when things actually work.

Is computer vision a problem more of detection or processing?

Computer vision is really two fields--measurement and understanding. Computer vision as measurement means using cameras to sense something objective about the world, such as light intensity or the distance to an object. That's a very precise, well-defined problem, where increasing the resolution of a camera, for instance, will immediately give you impact in terms of better results.

Computer vision as understanding is a much less well-defined problem--most of the digital cameras now on the market already have better resolution than the human eye, but that's not really helping us in terms of understanding.

Understanding in terms of what?

In terms of telling a computer to look at a picture and "find a car on this street," or "find a cup on this table," or "find a chair in this room." Those questions require you first to define, "What is a cup?" or a car or a chair. You know one when you see one, but to come up with a visual definition of a chair--which may come in many different shapes and sizes--is very hard to do.

In a way, it's a question of psychology and philosophy. You cannot separate the questions from the fact that humans are asking these questions. It's humans, not computers, who are interested in the physical world--cups and cars and chairs.

Is it a problem of putting things into categories?

In some sense. But the problem is that old ideas of categorization, taxonomies, and so forth, going back to Aristotle and Socrates, don't seem to model the real world terribly well. Wittgenstein, for instance, said that while we all understand the idea of "games" as a category, it appears to be impossible to come up with a list of properties that would apply to all games. In this case, the categories aren't formal definitions, but rather groups of examples within a particular context.

Then how can we come up with a model that computers can use to understand the visual world?

We may not be able to come up with a good intrinsic model, but with all of the data we can collect, we may be able to come up with a phenomenological model so that a computer might be able to predict, for instance, "what's going to happen next" within a given context. Humans, of course, do this all the time--we are amazingly good future tellers in terms of, "Can I cross the street now, or will I be hit by a car?"

I believe the answer lies in using huge amounts of visual data to build connections between these examples and their visual context. One of my areas of focus is trying to use a large amount of visual data to allow computers to discover their own understanding of the visual world, without any human help.

So we're trying to move away from linguistic definitions and more into direct ways of describing things, in terms of their relationships with their environment and with a particular task. After all, vision, unlike language, is common to almost all animals.   A mouse doesn't need to know that something is called a "cat," but it better be able to predict what that something is going to do next!

Are you inventing a new visual language?

It's not that grandiose, but yes, it's a kind of non-verbal vocabulary--trying to understand the world in terms of vision and action instead of verbally, connecting things visually in much the same way that we now connect things with words.

And who knows what that's going to be useful for? If it works, it would get us closer to a visual understanding of the physical world that could aid in the navigation of autonomous vehicles, or in finding photos of something on the Web by doing a visual query.

Have you circled back to your original interest--building computers that reason?

I don't expressly say that, but I've always had in the back of my mind this grand goal. You might say I'm a cognitive scientist who happens to be working in a School of Computer Science because I want to build a computational model of how the brain works. We'll see how well that goes!
For More Information: 
Jason Togyer | 412-268-8721 | jt3y@cs.cmu.edu