Do deep networks see the way we do? Qualitative and quantitative differences
Date Posted:
December 16, 2020
Date Recorded:
December 12, 2020
Speaker(s):
S.P. Arun, Indian Institute of Science
All Captioned Videos SVRHM Workshop 2020
Description:
SP Arun's homepage - https://sites.google.com/site/visionlabiisc/
PRESENTER: I am personally very excited to introduce my graduate school mentor, the amazing SP Arun. So Arun received his PhD from Johns Hopkins University with Professor Ken Johnson and Andreas Andreou and then received his post-doctoral training at the Carnegie Mellon University with Carl Olson. He's currently an associate professor at the Center for Neuroscience at the Indian Institution of Science in Bangalore. His work spans human psychophysics, single-cell physiology from macaque, computational modeling, and, more recently, human brain imaging. And today, he will talk to us about "Do Neural Deep Networks See the Way We Do? Qualitative and Quantitative Differences."
SP ARUN: So thanks a lot for this invitation, and it's really nice to be participating in this meeting all the way back from India. It's almost 11:00 PM here. But all the talks have been exciting so far, so I'm still awake. So this is just a picture of our Institute to inspire you or interest you in, perhaps, someday coming here when it's possible to travel in person. It's a beautiful country and a beautiful place to work at.
So let me start by highlighting what everybody else has been talking about with the question of "why compare deep networks and brains?" And, of course, there are very, very simple answers to this. We want to understand how brains represent visual information. We'd like to also understand how deep networks represent visual information, because they tend to be very opaque to understanding what they actually are-- what exactly is going on. And then, of course, any insight that we get from understanding the brain and using deep networks to understand brains should eventually lead to performance or insights.
And so, hopefully, these are really two sides of the same coin. That's what we've been working at with various approaches in our lab. So let me start by highlighting the fact that when you think of the task of object detection, it's actually pretty similar the way it happens in machines and humans.
And you can start with this image, and you can extract features from the image using various kinds of computer vision models. And you have a classifier, and then, finally, at the end of the day, you can conclude something like [INAUDIBLE]. Yeah?
But if you think about how this happens in the brain, there's actually a very similar sort of sequence. Here's Arturo here. And then the brain looks at it, and there's a stage of visual feature extraction in the visual cortex. And several speakers before me have highlighted what's happening in the ventral visual pathway. Then there's some kind of binary decision, and then, if my assumption is correct, then Arturo is from Peru.
And so we performed a very similar sort of stages of feature extraction, binary decision. And the question is, then, we try to understand how deep networks are doing it or any computer vision model or any visual information processing. The system is doing it.
You could start by comparing performance. And all of you know that deep networks have been pretty impressive in their performance on various visual tasks. But still the question remains, can they do better? Do they approach human performance? What exactly is the difference?
And so what we started, or what I'm going to tell you about is different ways of comparing deep networks and human vision. And we started off by saying that if you take a very, very simple visual task, then chances are that a machine or a human being could do this the same way, and the performance would be really at ceiling, like 95%, 100%. And there's no sort of scope to investigate differences between the two.
So what we thought is to maybe we should actually try and compare machines and humans on a hard task. And if you make machines and humans perform a hard task, then perhaps we can start looking at the patterns of errors and then maybe get deeper insights into what's going on. So we started with this sort of task-- this is [? Haresh, ?] who is a post-doc in the lab.
And we started on this task that-- it's based on a version of the task that I just showed you with [INAUDIBLE] face, which is a task based on looking at people from different parts of India. And it turns out that people from different parts of India have slightly differing types of facial features. And people from India can actually categorize them pretty well.
And so here's just a sampling of all the faces we took from our database. So these are all North Indian faces, and these are all Southern Indian faces. And these are the regions that they are all coming from. And the point is really not that we want to do something very specific to Indian faces, but the idea is that this is a hard task for even humans to do.
So for example, even people are pretty-- some faces are difficult to categorize. Some faces are very easy to categorize. And you can see an entire range of variation over here.
So then the question is, can we mimic this performance in computer vision algorithm? So if you take this case database that we built up from scratch by taking photographs and crawling the internet-- and here, you can extract deep features. As you all you know, you can extract features using standard [INAUDIBLE]. And then we also calculated some manual features in order to compare the performance of various classifiers based on different parts of the face. And so the way these manual features are expected is to register each face to automatic landmarks and then extract various patterns of intensity over [? here ?] or special features like the size of the eyes or the size of the mouth and size of the nose and so on.
And so the idea here is that we're exacting features from different subsets or different parts of the face. And so the simple question we started was to say, given that we have this database, can we learn the distinction between these two regions of India? And here is sort of a snapshot of the performance. Humans are about 65% correct on this task on the database that we tested.
And I want to highlight, of course, that a bunch of different algorithms were tested in the study. And one of the relevant ones is a [INAUDIBLE] that was trained on face recognition. You can see that the performance of the deep network is actually similar to that of humans.
OK, but what is interesting is that all this feature in all the models that we tested on this particular task actually had very, very different patterns of errors. So, in other words, what is shown here is the correlation between human accuracy and-- the bar on the right-hand side shows that humans are very, very consistent with each other. If you take two groups of humans, you get a very high correlation between which faces are accurately classified and which faces are not accurately classified. But if you look at all the other computational models, including deep networks, all of them showed very low correlation with human patterns of accuracy, which means that if humans consider certain faces very easy to categorize as being from whichever part of India, then computer models do not experience the same kind of difficulty.
To make it even worse, we actually trained-- with the hope of making it actually better, we trained computer vision models to predict human accuracy. And, even after learning to predict human accuracy, we were not able to actually capture human accuracy on this task, so suggesting that, potentially, humans are using very different face representations compared to a variety of different computer vision models. Now, we went a step further to-- here's the advantage of using manual features, because, again, deep networks that are black box, they're operating on the [INAUDIBLE], so you don't really know what the judgment is based on. And so, to illustrate this aspect of what features we might be using for classification, we looked at the performance of a bunch of subsets of features that we extracted exhaustively, a bunch of features from the eyes, nose, mouth, or the contour of the face. And then we wanted to know whether, based on the features from these eyes, nose, mouth, or contour, whether we can classify the face as being from a particular part of India.
And so it turns out we found that the mouth region was the most predictive of the region of India to which the face belonged. And, just to confirm this in a human experiment, we took a bunch of faces, and we obscured different portions of the face. This is the same black bar superimposed on the eyes or the nose or the mouth. And we repeated the experiment where we asked humans to categorize these cases as belonging to the northern or the southern part of India. And it turns out that obscuring the mouth and the face actually reduced the performance the most, suggesting that just as we found that the mouth is the most informative regarding the category of the face, humans also seem to be using the mouth to categorize the face the most.
OK, so, to summarize what I've shown you so far, basically, if you compare performance on hard tasks, you can find that deep networks and humans show very systematic differences in error patterns. So the question is, then, can we go beyond comparing performance? And this is [INAUDIBLE], who is a former PhD student from here.
And so you come back to this diagram. Maybe you want to compare performance. And now the question is, can we go a step before that and compare features? And the question then becomes, how do we compare features between deep networks and brains? And I think some of the speakers before me have alluded to this, so I'll go a little bit faster.
Imagine that you wanted to compare these two-- the deep network with brain representations. Now, you can imagine that neurons or units in the deep network might be representing a bunch of features. And it could simply be that they're two-- that the features in the deep network and the features used by the brain are simply rotations of each other.
And so if you simply-- there's no proper correspondence between the units of the deep network and the units that you record from coming from the brain. And so the question is-- the point is that, instead of comparing features, it's actually much more interesting to compare distances. So rather than comparing features, we want to compare distances between objects. And that becomes a much more valid basis of comparison.
So, to compare distances, you've got to be having ways of computing distances. And we used a particular kind of visual task to compare-- to measure distances and perception. Just to give you an example of this, I'm showing you can image which contains a leopard here inside. And this is an actual image that we've taken a photograph from. And you can see that the leopard here is difficult to find because it's similar to the surroundings.
And this principle can be taken back into the lab. And you can convert this into a visual search experiment. And you can see here that there are identical images over here, but one of them is different. And you can find the odd one out pretty easily here, which means that these two images are far apart in perception.
And visual search is a really nice task because it's very natural. You don't have to ask subjects to generate a similarity rating or compute some notion of similarity. You just have to ask them to find the odd one out. And you can objectively characterize performance rather than looking at subjective ratings.
And so now these two objects are far apart. And if you can do this experiment many, many times, you can see here that the odd one out is a little bit harder to find. This is the odd one out over here. And that means that these two objects are close together.
So, in this manner, we compile a large data set of [INAUDIBLE] distances between objects and humans by looking at the reciprocal of reaction time in a visual search task. As you can imagine, if it takes you a long time to find one object in a field of the other, these two objects are actually similar. So the reciprocal of reaction time becomes a measure of dissimilarity. And so we measured perceived distances between objects, between lots of pairs of objects from lots of human participants.
And so the question is can we now compare the representations that we get from perception in humans with distances and computation? And here, it's pretty straightforward. You could just access the features that have been used by various different computer vision models, and we can compute distances between the features by simply taking the Euclidean distance within the feature vectors activated by these images.
And so, just to summarize what I've shown you-- what we have, I'm noting here the perceived distance plotted against the predicted distance. And this is actually a combination model that uses the best of all a variety of different computer vision models, including deep networks. And so what we find is a very nice correlation, and this is the correlation of points unfold.
And what you are intrigued by is, perhaps, that there would be systematic differences in these points, which are actually well above or below the [INAUDIBLE]. And so it turns out that, actually, there were very systematic differences. There were a bunch of pairs of objects where the distance was overestimated. That is, the observed distance was actually less than the predicted distance or computer vision models overestimated the perceptual distance. And there's a bunch of image pairs where computer models actually underestimated the perceptual distance.
And it turns out-- and there's a lot of systematic differences. And one of the differences that we were intrigued about was this idea of symmetry. What we found is that symmetric objects are actually a lot more distinctive in perception compared to what you see in computer vision models. And so the question is if deep networks are actually-- or computer vision models that expecting symmetry from the image are learning symmetry properties, then maybe they're not learning symmetry from the natural images or from the data sets that they're being trained on.
So, as a proof of principle, we can then ask, can we include symmetry features into a deep network and actually improve performance? And so, this is our pipeline by which we established this, we took images. We extracted symmetry features by computing symmetry across many different two-dimensional axes. And then we trained the classifier.
And we have regular deep network over here, and we get the output of the deep network classifier. And we merge the two classifiers and ask whether the synthesis of these two classifiers actually improves performance. And these are the results. And we find that this is-- in the cross-validated sense, that we find that there's a significant improvement in performance on [INAUDIBLE] using several different deep networks and on several cross-[INAUDIBLE].
So the point here is that if you try to compare object representations between deep networks and in human perception, we find systematic, but fixable, biases. So the idea now is that whatever biases that we've observed between humans and deep networks can potentially be fixable and that should as a proof of principle also improve performance. And these results have been recently published in this journal [INAUDIBLE].
So the next question that we wanted to ask-- and we have a few minutes to wrap up. The question is can we now go beyond quantitative differences? So far, what we've been talking about is that deep networks show a nice correlation with object representations. But these are all quantitative differences, and we wonder if there's something that we can [INAUDIBLE] think about instances where there are qualitative differences between deep networks and human perception.
And so when we started looking around or thinking about this problem, well, it turns out that psychologists have been-- or vision scientists have been looking at this kind of qualitative differences for a while now. Here's one good example.
This is the Thatcher effect. All of you probably noticed that these two faces look really different when they're viewed upright. But if you invert them, they look very similar to each other. And so the Thatcher effect can now be restated as not an effect in terms of the perception of the face, but we can actually restate in terms of the distances between the upright faces and the [INAUDIBLE]. So the idea is that these two upright faces look very, very different from each other, whereas the two inverted faces, the same two faces actually look very, very similar.
So we can start recasting very popular perceptual phenomena into statements about the underlying representation of distance. And so the Thatcher effect is simply a statement that the distance between two upright faces is actually more than the distance between the two inverted faces. And now we have this tool to now compare and ask whether deep networks also show the Thatcher effect.
So you can go about saying that, well, if the perceptual representation is like this, we can compute a Thatcher index, which is simply the difference divided by the sum of the two distances. And we can now compare this Thatcher index for deep networks as well. So you can take the deep network, look at the feature activation in every layer, and then compute the distance between two upright faces and two inverted faces. And now you can characterize whether the deep networks actually experience the Thatcher effect as a function of the different layers of the deep network.
And so here are the results. When you have a Thatcher index greater than 0, that means it's similar to human perception. If you have a Thatcher index less than 0, it means that it's qualitatively different, because the two upright faces are now no longer more distant than between [INAUDIBLE] faces.
And what we can see is three different deep networks shown here for comparison. VGG16 is actually the standard VGG architecture trained on [? image ?] object classification. VGG-face is actually the VGG architecture trained on face recognition. And [? VGG-rand ?] is actually just the randomly initialized VGG network. And you can see here that only VGG-face actually shows up [INAUDIBLE] and VGG16, in fact, shows the opposite of what our-- virtually no Thatcher effect.
And so this is all fairly sensible. And we started now comparing a whole bunch of different perceptual phenomena and trying to ask whether these could be captured in the deep network and at what layer they are captured in the deep network. And so here's an example of a particular phenomenon that actually is not captured by deep network. It's called a global advantage effect.
All of you know this colloquially as the effect where we see the forest before we see the trees. And, again, the idea is that people can detect-- the classic idea, which was reported many decades ago, is that humans can report the global shape of this object faster than the local shape. And so we realize that, again, this is a statement regarding the distances between various objects in this object representation.
And so what we did was to think that maybe the distance between two shapes, which have identical local shape but a difference in the global shape-- this distance might be larger than the distance between two shapes that have identical global shape but differences in local shape. And so this is how we captured this idea of a global advantage effect. And we can now again compute a global advantage index. And this global advantage index is simply the difference divided by the sum of the global distance divided by the-- and the local distance. And now we can compute this global advantage effect as a function of layers in the VGG network.
And this is what we found. This is quite dramatic. And we were quite surprised by it, because what you see here is that the randomly initialized VGG architecture actually has a global advantage, whereas the trained VGG architecture, which is trained on image net for object classification, actually shows a local advantage. And this is consistent with a number of studies that are coming out now showing that VGG [INAUDIBLE] deep networks are actually sensitive to local texture and local features rather than the global shape of an object.
And, to summarize what we found in the study, we've actually looked at a number of different perceptual phenomena, and we asked simply whether these phenomena are present in deep networks or not. And we got a bunch of yes answers and a bunch of no answers. And what we think is common to a lot of the [? nuance ?] is more nuanced processing of objects like 3D processing or part decomposition, shadow processing, global advantage, and so on.
And so what we think is that this list of properties will tell us what kinds of training is sufficient to produce these properties in deep network and also on the flip side help us elucidate what kinds of computations are required or in [? what are ?] different layers in the brain as well. So, to summarize what I've shown you here, comparing perceptual phenomena can reveal qualitative similarities and differences. And this is a study also published [INAUDIBLE].
So I want to end by thanking our lab. This is our lab at the moment. You can follow us on our website. And these are the funding agencies that are paying us to do what we'd gladly pay [INAUDIBLE]. Thank you.