Why it pays to study Psychology: Lessons from Computer Vision
Date Posted:
December 16, 2020
Date Recorded:
December 12, 2020
Speaker(s):
Alyosha Efros, University of California, Berkeley
All Captioned Videos SVRHM Workshop 2020
PRESENTER: Alyosha has been an influential figure really in computer vision in general. I think most of us, definitely myself, and my generation when we grew up and went to CVPR wanted to be like Alyosha. So really it's an honor to have him here. He was advised by Jitendra Malik at Berkeley, now also faculty at Berkeley after being at CMU. And he'll be speaking about why it pays to read perception literature confessions of a computer scientist. So Alyosha, the stage is yours. Take it away.
ALEXEI ALYOSHA EFROS: All right. Thank you very much. Thank you for inviting me to speak here. Well, I'm the last dog. And so I'm going to keep it light. And also I am a computer scientist. I'm a computer vision scientist. I don't actually do any perceptual experiments. But I'm a huge fan. And indeed, I was thinking that I would do something a little bit different and just kind of look at all of the ways that reading perception [INAUDIBLE] psychology literature has shaped my career and kind have guided me in doing my work. OK?
So basically, yeah. This is going to be kind of a little bit of a rambling love letter to you guys. OK? So I started with texture. This was the first thing that Jitender Malik told me I should look at the texture. And it was immediately I got excited because it is this magical problem that's not just a computer problem. It's a perceptual problem.
So you have a bunch of samples that don't look like each other at all on the pixel level. Computer at L2 distance, they're all very different. And yet a human looks at it and they all look like they are drawn from the same distribution. OK? And so even the definition of texture really is a perceptual definition. And things like texture analysis is basically really it's all about you have some true infinite texture out there in the sky. And all you have is you have a couple of samples or some samples. And then your question you want to ask is, are they drawn from the same texture or different textures?
And so really my first influence in psychology has been, of course Bela Julesz, who is really the father of texture. And I mean, he also did the random dot stereogram thing but I don't have stereo. So I don't care about that. But texturally really, he was really very influential for me. And so for those of you maybe who don't remember, he was really the one who focused on this idea of preattentive discrimination.
So he noticed that there are some patterns that just pop out at you and are easily discriminatable from the background and other patterns that are very hard to discriminate even though in terms of -ing differences, they are just as different. And so he basically argued that the things that you cannot distinguish preattentively are what constitutes the same texture. OK?
And so my first work was on texture synthesis which is kind of a related question of if you have one sample from this texture in the sky, how can you generate another sample that a human would think is drawn from the same texture? And we did a paper on this non-parametric texture drawing. And we got some good results.
And one thing that kind of I found interesting is that by mistake I ran it on an image that turned out not to be a texture image. It was a real image. And this was an extrapolation result. And I was kind of surprised that that worked. It still kind of worked. OK? And then reading further, I happened upon, or maybe I heard it actually at one of the workshops, Simon Thorpe came and gave this wonderful talk on recognition on visual classification. And I think the task was, is there an animal in this image? And he was doing it something like 150 milliseconds. So really preattentive exactly like Bela Julesz's regime of preattentive, feet forward, not much time to do anything really. Right?
And what Simon's lab has shown is that people were really good at this, surprisingly good at doing what you would think is a very fundamentally deep and semantic task. And they were just doing it like nothing. OK? And of course, Nancy Kanwisher also did cool follow up work in that. And it kind of also reaffirmed these results.
And so there was already this feeling that there was something about texture processing that was kind of more that meets the eye. And in my lab back when I was a grad student, Laura Walker Renninger and Jitender Malik were looking at scene classification. And they were basically, their idea was to do some really stupid texture discriminator. And so what they did is they just did basically a dictionary of patches and then they looked at the histogram of these patches, textons. And then they basically just did nearest neighbor in the histogram of patches discriminations. A really very basic, very low level texture kind of question.
And they thought, how far can they go? How well can they do? And what they found on this kind of scene discrimination task is that they were-- oh I don't have them. Well in the x-axis, there are different scene types like bedroom, forest, beach, et cetera. And what they found is that compared to human subjects, they were basically doing pretty well just on the texture. They were pretty much on the level of humans at 50 milliseconds. OK?
And so this really got me thinking. And then later on, there was another work also on just doing texture. Notice this. This is 2001. OK? And I'll just play this work from the Microsoft lab.
[VIDEO PLAYBACK]
ALEXEI ALYOSHA EFROS: This is John Winn and colleagues. And here they basically just do-- they compute texture within this boundary. And then they just do kind of nearest neighbor in terms of the texton representation. OK? So the stupidest simplest thing you can think of, right? It doesn't know anything about the boundary of the cow. It doesn't even know about the whole idea of a cow. It's basically doing cowness or grassness. It's totally doing texture recognition. But you show it to an undergrad and they'll be like, wow. The computer is recognizing objects, right?
And this really made me realize how amazing-- I mean, look at this. Yeah. It just works. And you don't even need the full-- it has no idea that this is like many bikes. It's not [INAUDIBLE] But it doesn't matter because it looks good. And so this really opened my eyes. And I realized a lot of what we think of as object recognition is really just that. It's really just texture recognition.
And so when later on there was all of these examples from GADIS of neural networks, you scramble your image into some texture mumbo-jumbo and it just works perfectly fine. I wasn't surprised. To me, this was obvious because of all of this texture of perception work that I have read about 20 years earlier.
[END PLAYBACK]
And so this is why I was really focused on thinking about going away from just texture recognition and really thinking about scene understanding, not just recognition but really parsing the scene into some meaningful parts. And it is really all about learning complexity because if you look at the kind of natural scene, you cannot do it without dealing with occlusions. You cannot do it without reasoning about 3D. You cannot do it without really trying to understand what is it that makes a scene?
And luckily I found that psychologists, again, were thinking about this. So for example Hawk and colleagues we're talking about different types of ill formed scenes, so type one is a well-formed scene. And then there is a bunch of different ways it can go berserk, right? And then Irv Beiderman had this wonderful page.
So Biederman of course, most people think of Biederman, they think of geons. And I think psychologists are really into geons. From a computer vision perspective, eh. I mean they're nice in theory. But it's just impossible to actually do them in practice. But the Biederman's work on understanding well formed scenes, that's a beautiful paper. If you haven't read it, just beautiful, beautiful paper.
And so Biederman basically had this whole theory about what is it that makes a well formed scene? What is it that you need to have for a scene to hang together, to be a coherent thing? And so he hypothesized five parts. Support, size, interposition, and likelihood of appearance.
And luckily I was reading this paper and I was also in Brussels. And I went to a Magritte museum. And I realized that Magritte, apparently read Biederman because he had all of these things in his paintings. So here is a violation of support. OK. Here is a violation of size. Here is a violation of interposition. And here, just everything goes bad. Position, probability, size, everything. OK?
And so then I realized that really we need to deal with all of these things. And if we can get there, then we can maybe try to understand a scene. And this is what I convinced my very first PhD student, Derek Hoiem, to take on. And he was super brave because I thought that it like 20 year worth of effort. And he didn't know better. So he took it on.
And he basically looked at trying to understand a 3D spatial layout from a single image and kind of try to figure out what is it that-- basically think about the three dimensional structure of the scene like the support surface, the vertical surfaces, their orientations, and also connect it with occlusions, figure ground relationships, and also object sizes and interposition. Try to do all of this together. OK? And of course the main challenge is that from a single image, this is ill-defined. So given this image, it could be this, it could be this, it could be this, it could be an infinite number of interpretations. And the trick is, of course, to get the right one, to find the right interpretation, or the most likely interpretation.
So how is it that we can figure out that this is not likely, and this is more likely? And so I'm not going to just-- I'm just going to brief summarize his work. And he basically talked about basically labeling these geometric classes of different types of surfaces. And then we had a way to pop it up like that basically a paper pop up, and given these geometric labels, kind of find where to fold, and where to cut, and make it into a little very coarse plain or 3D model, and then be able to kind of basically walk into an image from just a simple image. OK?
And it's a very coarse model but it captures the precept of 3D in a way that a lot of the more sophisticated models do not. So here are some more examples of that. This is really old stuff. This is, yeah, 2005 but still cool. And it's an interesting kind of historical note that basically at the same time, Andrew Ng and students were also interested in kind of understanding 3D from a single image. But they went the completely quantitative route. They were basically predicting depth that every pixel.
And I would say that kind of the perception informed view I think was a more right way to go. And in fact, if you look at [INAUDIBLE] who was the first author of this other competing work, in a couple of years he actually came around to our view. And he basically added that the qualitative 3D understanding to his system. So it was kind of we got ahead because we read more preception stuff.
Now, this is not always going to work. So here is an example of things not working. And it's not working because the previous model did not really deal with occlusion boundaries. OK? And so next, Derek thought, OK. We needed to figure out depth ordering at occlusions, again, in a qualitative way. And of course, then you have to go to all of the literature and figure it out. And of course, I made Derek read all of that literature and basically boundary ownership and all that stuff.
And it's interesting again to kind of connect this to the old classic work in computer vision, something like the Waltz algorithm where the idea was that you're going to label your boundaries, you're going to find your boundaries, you're going to label them, and then there is this beautiful, gorgeous, junction propagation algorithm. For those of you who are in computer vision, I'm sure maybe you have-- or an older generation, you probably had to do it in your class. So there is a different type of junctions, and then you propagate labels for the junctions, and you get this beautiful result afterward. OK?
And it just it's a gorgeous beautiful story. OK? The only problem is it just never worked. It worked on these hand drawn things. It never worked in the real world. And everybody was kind of puzzled. Why doesn't it work in the real world? So people tried to find all these T-junctions and then blah, blah, blah. And I just happened to stumble upon a little wonderful paper by Josh McDermott which basically kind of made me realize we are just on completely the wrong track. So basically he was looking at T-junctions in real images and looking at how people perceive T-junctions.
And he showed that locally from a little patch, humans could not tell T-junctions or any other kind of junctions really. It was really only when they got the whole context they were able to realize that those were the correct type of junction. So what we were doing in computer vision was completely backwards. We were starting with trying to find junctions and then trying to understand an image from there whereas it looked like the humans were doing it the other way around. They were first getting some notion about regions and boundaries and then kind of-- the junctions where the output, not the input.
And so that made us just go this direction. And so basically what we started with is we started with slowly going from edges to boundaries and then trying to figure out which were the occlusion boundaries in conjunction with also reasoning about quality of depth and these kind of surface representations. OK? And we were able to slowly, slowly, kind of converge upon a representation that would give us both boundaries with a labeling of foreground, background, and also with a range of depth. OK?
And the final work of Derek's thesis was basically just put everything together with surfaces, occlusion boundaries, and objects and their scales, and basically put it all together. And it was actually very much inspired by kind of intrinsic image work from Barrow and Tenenbaum.
So that was all fine. But the problem was that surfaces were just not good enough to really represent the scene because they were just all these paper cutouts. There was no meat on this thing. And so we realized that we really need to go from surfaces to volumes. And again, this is where kind of we were very much inspired by Biederman and colleagues. And we tried to basically figure out how to inject both geometric volumetric constraints and also physical constraints like stability all in the same framework. And this was a work of Abhinav Gupta who back then was just a starting postdoc in my lab.
And so we had kind of a very cute algorithm that kind of starts with the little blocks and then basically makes these LEGO blocks, makes them to represent a given image, again, from a single real image. And so we were able to do these kind of 3D parse graphs with relationships like in front of, above of, heavy, light. And again, this is 10 years ago. I'm kind of surprised how well we were able to do it. And we were able even to get a hold of some sort of 3D renderings from real images. I think I have actually not seen any other paper that's able to do something like this.
So this is still a hard problem. And I think we're able to do as much progress as we have because we were aware of all of this work from Irv Biederman and colleagues.
And around that time, I had my kind of the next inspiration. And this one came from Aude Oliva. And this is her wonderful work on the capacity of visual memory. I know that Aude gave a talk here. But unfortunately she doesn't talk about this stuff anymore. And the young kids, they don't know that. And so I'm going to make sure that-- public service here, to make sure that everybody knows because it's a beautiful, beautiful work.
So of course, in 1973, Standing already showed that humans have this amazing capacity for remembering images, something like 83% recognition of 10,000 images. OK? But Standing, he did forest versus beach. So it wasn't clear how much information was being stored in the memory. Is it just one label? Or is it every pixel? Right?
And so what Aude did is that she basically redid this experiment but really tried to push the humans harder by showing a lot of very similar objects and also looking at the same object in different states and also having similar classes of objects, but not the same instance. And the results were spectacular. So she replicated Standing results for kind of the same object class. But she also showed that even if it's different if even-- people weren't able to distinguish one member of a class. They were able to tell apart one remote control from a different remote control. Or they were able to tell apart different states of the same image.
So it was just incredible how much we remember visually. And so this really was inspirational to my data driven kick where James Hays and I thought, OK. Let's try to do something with lots and lots of data. We thought, OK. How about hole filling? So here is some thing we don't want. We are going to get rid of it in Photoshop. And now we are going to download two million images from Flickr. And this is 2007. So this is a long time ago. And we're just going to try to see if we can use some other image to fill in the hole. And tada, it actually works.
And the cool thing is really it works because of all this data. So because first James tried it on 20,000 images, which back in 2007 was huge amounts of data. And it just totally didn't work. Here are the nearest neighbors, and they're not that near. But then he just kept downloading. And when he got to $2 million, boom it just worked. And so here are some other examples. Yeah. And we also did it for geolocalization with the same idea. OK?
But do humans really remember every single pixel? I asked Aude when we were in Paris. We happened to be together in Paris on our sabbaticals in 2012. And we had wonderful time drinking wine and talking about philosophy. And Aude was like, ha, funny you should mention. I actually have some experiment of this but I didn't think anybody cared. So I never published it.
So this is such a cool experiment, at least to me. I'm just going to play it. Aude gave me the slides. So we're actually going to do it. OK? I think we still have a little bit of time. And it's hard to do on Zoom, but so the idea is Aude is going to show you lots of images, or a bunch of images. When you see the same image twice, you're going to clap. It really works when you can hear other people clapping. But OK. Here we go.
[VIDEO PLAYBACK]
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: OK. So of course you notice that some of these images are basically just kind of random textures that don't mean anything. And when you do it in an audience, it's very clear. Basically, people do really well on normal images and they are basically a chance on these random meaningless textures. It's a very, very clean result. Very powerful result.
And so this made me think, wait a minute. It's not just that you're remembering every single pixel. You're really remembering something that's meaningful to you. So you're remembering on some natural image manifold. OK? And that really got me into this whole self-supervised representation learning. So the whole self-supervised learning, I got this idea from Aude in 2012 in Paris. And then, all of these papers that are at my lab has since put out all the result of this realization that we really need to learn a good representation that kind of that sits on the data manifold. Because otherwise, we're not going to make it work.
OK. So do I have more time? Or am I done?
PRESENTER: You can keep going to keep going, Alyosha. Yeah.
ALEXEI ALYOSHA EFROS: OK. One last thing is there is all this self-supervised stuff. This is probably the one that may be most interesting to you guys. This is a work we did with Andrew Owens. And this was a self-supervised learning work, again, inspired by a perceptual experiment, the McGurk effect. I'm sure most of you know--
[VIDEO PLAYBACK]
- Bah, bah, bah, bah, bah, bah.
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: So he's saying bah.
[VIDEO PLAYBACK]
- Bah, bah, bah, bah, fah, fah, fah, fah, fah. fah.
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: Right? Now, he's saying fah with an F. But actually, the audio is exactly the same. It's just that the video is different. OK? So this is just a beautiful illustration of a very tight coupling between the audio and visual processing. And of course, in computer vision, we never does this. It's always separate.
And so we decided we are going to do a learned, a joined audio visual representation. And the way we did it is our usual self-supervised trick. We say, OK. Let's try to do bring the real ones close and the fake ones apart. So we say, OK. Let's take real, let's just like a video and its corresponding audio, and fake-- oh, sorry.
[MUSIC PLAYING]
This is real. And fake, we just take some random other audio from [INAUDIBLE] OK? Now, the problem is this is too easy. Not going to work very well because it's too easy to tell. So what we did instead is we took the same audio but we just shifted it a little bit. OK? And then--
[MUSIC PLAYING]
Now the system needs to really work hard to try to figure out that there is a shift. It needs to find correspondences and figure out that there is a shift. And so we trained the system. And after several weeks of training, we're able to get a representation that it knew where the sound was coming from.
[VIDEO PLAYBACK]
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: OK? And we even did a little cute demo of separating on screen from off screen sound. So this is--
[VIDEO PLAYBACK]
- Were able to show to the rest of the world the unshakable Japan, US alliance.
- Donald, thank you so much.
- Thank--
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: And then we can just do only the audio that has evidence in the pixels.
[VIDEO PLAYBACK]
- [NON-ENGLISH SPEECH] Donald, thank you so much.
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: Or we could do the other one.
[VIDEO PLAYBACK]
- Were able to show to the rest of the world the unshakable Japan US alliance. Thank--
[END PLAYBACK]
ALEXEI ALYOSHA EFROS: OK? And I have many, many more examples. But I'm out of time. So in conclusion, in computer vision, I have this reputation. People say, oh he's very creative. He has lots of out-of-the-box ideas. And frankly, this is all bullshit. The dirty truth is I just read and get inspired by human perception research. And you can be too. And so this is basically that's how it is. So thank you. And also thanks to all the wonderful psychologists that I got inspiration from. Thank you.