Model-agnostic Measure of Generalization Difficulty
Date Posted:
May 5, 2023
Date Recorded:
February 14, 2023
CBMM Speaker(s):
Akhilan Boopathy All Captioned Videos CBMM Research
Description:
CBMM Research Meeting: Akhilan Boopathy, MIT graduate student in the Fiete Lab
Abstract: The measure of a machine learning algorithm is the difficulty of the tasks it can perform, and sufficiently difficult tasks are critical drivers of strong machine learning models. However, quantifying the generalization difficulty of machine learning benchmarks has remained challenging. We propose what is to our knowledge the first model agnostic measure of the inherent generalization difficulty of tasks. Our inductive bias complexity measure quantifies the total information required to generalize well on a task minus the information provided by the data. It does so by measuring the fractional volume occupied by hypotheses that generalize on a task given that they fit the training data. It scales exponentially with the intrinsic dimensionality of the space over which the model must generalize but only polynomially in resolution per dimension, showing that tasks which require generalizing over many dimensions are drastically more difficult than tasks involving more detail in fewer dimensions. Our measure can be applied to compute and compare supervised learning, reinforcement learning and meta-learning generalization difficulties against each other. We show that applied empirically, it formally quantifies intuitively expected trends, e.g. that in terms of required inductive bias, MNIST (less than) CIFAR10 (less than) Imagenet and fully observable Markov decision processes (MDPs) (less than) partially observable MDPs. Further, we show that classification of complex images (less than) few-shot meta-learning with simple images. Our measure provides a quantitative metric to guide the construction of more complex tasks requiring greater inductive bias, and thereby encourages the development of more sophisticated architectures and learning algorithms with more powerful generalization capabilities.
PRESENTER: So today we have an Akhilan Boopathy from Ila Fiete's lab. Some of you know him. So Akhilan had-- he went to MIT to get his degree in electrical engineering and computer science. And right now, he is a PhD student with interest in the intersection of machine learning and AI in neuroscience.
And I think the hope of his work or what he tries to-- hopes to achieve is to make more robust architectures that can be challenged with more real world applications. So today, he's going to tell us about this model agnostic measures of generalization theory.
AKHILAN BOOPATHY: Thanks for the intro. Yeah, so this is titled model-agnostic measure of generalization difficulty. Work done in collaboration with several others, Kevin, Jaedong, Shu, Asaad, and my advisor, Ila.
So before going into the content of the talk, I'll just give-- I'll talk about the-- first, I'll give the motivation for this project. And I'll talk about the distinction between sample complexity and a concept I'll call inductive bias complexity. And I'll get into what that means later on.
Then, I'll go into how you actually quantify inductive bias complexity. And then, finally, I'll go into our experimental results and conclude. And I'm happy to take questions at any point.
But first, just to motivate this project. Over the past 10 years or so, we've seen a lot of progress in machine learning. It has been driven by the availability of benchmarks. So for example, for image nut, image that was pretty important in the development of large scale confidence, at least from an engineering perspective.
And then, if we look at progress in deep reinforcement learning, a lot of progress there has been driven by availability of video games like from Atari, or even to play our games like Chess or Go. And then, large scale transformers have been driven-- progress there has been driven by the availability of large language corpora, Wikipedia or other sources on the internet.
So given that we historically know that good tasks and benchmarks are super important for progress in the field, this raises the question, how do we know what a good benchmark is? Specifically, how do we know what are the good benchmarks that encourage the development of more generalizable architectures and learning algorithms? And so that's the question we're going to try to answer in this project.
So to answer this question, we're going to first look at this distinction between two concepts, sample complexity and this concept I'll call inductive bias complexity. And I'll go into what that means in just a few slides. But first, well, let's imagine that we're in a typical machine learning setting. So we have some set of training data.
Typically, you'll have the training data is fed through some learning process to produce a trained model at the end. And so the learning process typically includes logic inductive biases. It can be like the model architecture we're using, the learning rule we're using, the initialization of the model even, and various other hyperparameters. And people generally think that both the training data and the inductive biases are important to generalize both.
Now, one question you can ask is, how could we measure how much training data or inductive bias is needed to generalize? One way to answer this question is to look at something we'll call the hypothesis space. Now, what exactly do I mean by hypothesis space here?
It's easiest to imagine what a hypothesis space is in a regression setting. So imagine that we're trying to perform regression on maybe these red points here. And the orange curves represent different hypotheses for how these red points are modeled.
Now, you may imagine that there's basically two ways we could split up the hypothesis space. One is we could categorize hypotheses by whether or not they fit the training data. So for example, these two hypotheses on the left here, both of them fit the training data. So the orange curves go through the red points. And then, this hypothesis on the left does not.
Another way we could split up the hypotheses is by whether or not they fit our inductive biases. So both of these two curves on the right-- yes, on your left. They look nice and smooth. They maybe fit what our notions-- our intuitive notions of what a well-generalizing function should look like.
Meanwhile, this other hypothesis on the right, it goes through the training points but it has these really high frequency fluctuations that are higher frequency than what-- the kinds of patterns we observe in the data. So we'd probably expect that those kinds of functions don't generalize very well. So those don't fit our inductive bias.
And at the intersection of these two, that's where we'd expect our well-generalizing hypothesis to be. So they need to both fit the training data and fit our inductive biases.
So sample complexity just tells us how much training data we need to generalize. So here on the top row shows one setting-- one learning setting where we have a small amount of training data. And with this small amount of training data, we're not able to find the true hypothesis. There's just not enough information for us to find what the true function is.
When we increase the amount of training data, we can find the true hypothesis. And so sample complexity here is just a minimum amount of training data we need to find the true hypothesis or get close to the true hypothesis.
Inductive bias complexity is defined basically the same way, but for inductive biases instead of the training data. So here, now we're imagining that our training data is fixed. We have a fixed number of training points. And now, we're varying aspects of the learning process. So in this top row here, we're imagining a setting where we don't have many constraints on the model architecture, for example.
In this study, we produce a hypothesis that fits the training points, but doesn't do so in a way that we'd expect to generalize well. It has these weird non-smoothnesses or it's just very spiky in general. And we don't expect that to perform well.
But if we add some more constraints on our learning algorithm or architecture, for example, then we do get a well-generalizing hypothesis. And so this inductive bias complexity is just the minimum amount of inductive bias we need to add in order to get a well-generalizing set of hypotheses.
Now, in general, we might expect that there's a trade-off between these two quantities. If we have a low amount of trading data, we'll probably need a lot of inductive bias to generalize on the task. And conversely, if we have little inductive bias, we'll need a lot of training data to generalize.
So just to really nail down the distinction between these two quantities. Sample complexity is specific to a set of inductive biases. So different architectures will probably have different sample complexity because maybe an architecture with more inductive bias will require fewer training samples to generalize on. In the same way, inductive bias complexity is specific to a training set. So if I have more training data, then I'll probably need less inductive bias.
Sample complexity tells us how much information is needed from the training data's perspective. So information training data provides in order to generalize. Inductive bias complexity tells us how much information comes from us as model designers. So how much information do we put into the architecture, the learning, [INAUDIBLE] initialization, and so on? Sample complexity can be used to evaluate different models. So we can say model A is better than Model B because model A requires fewer samples to generalize with. In the same way, inductive bias can be used to compare different tasks. So we can say task A is better than task B because task A requires more inductive bias from us as model designers.
Now, sample complexity people have shown in the generalization literature correlates with measures of model class complexity. So people have shown things like you can upper bound the number of samples you need to generalize with some measures of model class complexity or [INAUDIBLE] complexity. With more constrained model classes, meaning that you need fewer training samples to generalize.
Analogously, we'll see that inductive bias complexity correlates with qualitative notions of task difficulty. And we'll get into this in the experimental results. And then, finally, sample complexity is relatively easy to quantify. So you can just take a model, train it on a task, and then, see how many training samples we need to generalize well.
Inductive bias complexity, on the other hand, is harder to quantify. And actually, to our knowledge, no quantitative measures of this exist yet. There is plenty of qualitative ways to assess inductive bias, but nothing quantitative so far.
Yes, the question we're going to-- specific question we're going to try and answer here is, how can we actually quantify inductive bias? And we're going to ask an information theoretic question. How much information does the inductive bias provide about the correct hypothesis?
And it turns out this is-- there's a nice information theoretic answer to this question. The information content in the inductive bias is just related to the amount by which the inductive bias shrinks the hypothesis space. So the expression for this information content is here. It's negative log of some fraction. The denominator of the fraction is the volume of the full set of hypotheses that fits the training data.
And the numerator here is the fraction of hypotheses that generalizes well. So essentially, this is saying that if we have this very large space of interpolating hypotheses, the ones that fit the training data, but only a very small fraction of the generalizability as well, then we need to put in a lot of information as a model designer to specify that well-generalizing set.
And so then, this fraction is going to be small and negative log of the fraction will be very large. And so that is the amount of information we need to provide. And conversely, if most of the hypotheses that fit the training data generalize well, then we don't really need to provide much information.
And so this metric is something we can really think about as the difficulty of a task, at least, the difficulty to generalize on a task. Because it's telling us how much information we as model designers need to put in to generalize. Now, a couple of things to note about this definition.
First of all, it's quite general. We can use it to compare inductive biases across many different domains ranging from supervised classification to reinforcement learning to meta learning and we'll go into all of these domains in our experimental results.
And another thing to note is that this requires quantifying sizes or volumes in the hypothesis space. So first of all, it requires setting a hypothesis space for a task. And then, it requires computing volumes in that space. So how we'll do that is what I'll go into next. Yeah, but also happy to take your questions at any point. [INAUDIBLE]
All right, I'll move on. So our first step to actually being able to quantify what we saw from earlier is just to write out more formally and mathematically that expression from earlier. So here I'm defining task difficulty or this inductive bias complexity measure as negative log of now probability.
So this probability is just the conditional probability that given that a hypothesis interpolates the training side, what's the probability that it generalizes well? Up to some error rate epsilon. So just walking through the notation a little bit here.
Theta is a hypothesis, which is a random variable. And p and q represent the test and training sets respectively. And e of theta comma p, or of theta come q represent the error of hypothesis theta when applied to the distribution p and q respectively.
So this is just saying that-- this expression here on the right is just saying that we interpolate the training set with zero error on the training set. And we're looking at the probability that we generalize up to error rate epsilon.
Now, so to actually be able to compute this quantity, we're going to make several approximations and bounds on the quantity. The first step here is going to be to assume-- or is going to be to assume some kind of relationship between the training and test distributions.
So to understand what we're going to do here, imagine that we have a true hypothesis and we have some interpolating hypothesis. We know that because interpolating hypothesis interprets the training data, they're exactly equal at training points. So they're exactly equal here. And they're probably approximately close to each other in some region around the training point.
And so what this allows us to say is that if a test point is pretty close to a training point, then there's probably relatively little generalization error. And more formally, we can bound the generalization error at a test point based on the distance between the training and test distributions.
And so that allows us to lower bound the probability that we generalize as follows. So the probability that we generalize, given that we interpolate the training data, is lower bounded by this expression here, which basically says that our hypothesis lands within some radius of the optimal hypothesis, which is the true hypothesis, the one that generalizes perfectly.
And here, this fraction is that radius. So if we take a look at this fraction, in the numerator, we have epsilon, which is the desired error rate. And that makes sense. So if we are allowing for a larger error rate on our test set, then it will be easier to generalize. It's easier to become within the radius of theta star and we can more easily guarantee generalization.
Similarly, in the denominator, there is this expression Wpq, which is the Wasserstein distance between the training and test distributions and just some measure of distributional distance between these two distributions. Now, if our training and test distributions are really close together, then this Wpq term will be very small. And so that makes this radius quite big.
And that, again, says that it's easy to generalize because you can easily land within this radius ball of theta star. And that guarantees good generalization.
All right, so if we just take the bound from the previous slide, apply a negative log to both sides, we get now an upper bound on our task difficulty expression. So the task difficulty is now upper bounded by negative log of this probability here, the probability that we land within this radius of theta star. And just for notational convenience, I've put the-- I've expressed this condition as this big theta q. This is just the set of all interpolating hypotheses.
Now, if we're actually-- if we actually want to quantify what this is, we still need to make some more assumptions. Specifically, we need to assume something about the hypothesis space we're working with. So this still requires assuming the hypothesis space is because we need to compute the volume of this sphere or of this ball within the theta star.
So what is the hypothesis space? A hypothesis space is something we can think about as a prior over possible functions we might want to consider when trying to generalize on a task. And the first thing to note is that if we're considering just the site of interpolating hypotheses on a task, considering all possible functions is probably not something that's reasonable to do.
So for instance, let's say we have this set of training points on the left. Maybe this orange curve is a reasonable hypothesis. Probably something we might want to consider when trying to solve the task, this regression problem. But this hypothesis over here is probably not reasonable. So this is the same hypothesis, but it has this-- it takes a different value at one point. It has this weird discontinuity. And it's probably not something anyone would consider when trying to solve this regression problem.
It turns out that if we consider all hypotheses, including the ones with these weird discontinuities on the left, then an infinite amount of information would be required to generalize on basically any problem. And this is because if we include the hypotheses like the one on the left here, the hypothesis space becomes infinite dimensional. And specifying a well-generalizing set within that infinite dimensional space requires an infinite amount of information. And so that won't be a useful measure to us.
Instead, we're going to restrict our hypothesis space to something finite dimensional, but still quite general and without imposing much restrictions on the space. So how do we do this?
We are first going to start out with imagining we have just a 1D input. So here, imagine that the task input just lies on some bounded interval of the real line going from minus R to R. Now, a reasonable and, let's say, we're performing a regression for-- of some function going from here to the real line.
Now, a reasonable hypothesis for this test might be this nice smooth curve here. An unreasonable hypothesis might be one that includes some really high frequency fluctuations. So let's say that we want to include functions like these. And then, exclude functions like these.
One natural way to parameterize the hypothesis space to allow the separation is to just parameterize the hypothesis as a Fourier series. So in this Fourier series, we have some linear combination of some Fourier basis functions. So here e to some constant times omega x. Omega is the frequency of the basis function. X is the input to the task.
And these theta sub omegas are the hypothesis based dimensions. They are the coefficients that correspond to different dimensions in the hypothesis space. And one thing that's important to note here is that we have a finite number of terms in this Fourier series. And that's because we've cut off the maximum frequency that function could take here.
So we've allowed-- we've provided some restriction on the hypothesis space. And that actually allows us to give-- to make the hypothesis based finite dimensional. And so the dimensionality just scales here with this big M.
So this is maybe one reasonable way to define a hypothesis space and just a simple one dimensional case. But what about for more general problems? Like most of the time, we're not working with inputs that are one dimensional, but we're looking at high dimensional inputs that lie on some complicated manifold. How do we deal with that?
So we can borrow this idea from a paper in 2020, which basically does exactly this. So what they do is they consider regression on low dimensional manifolds. And they provide some sample complexity bounds for performing regression on these manifolds.
And they set a hypothesis space based on eigenfunctions of a certain operator on that manifold. It's just called the Laplace-Beltrami operator. And it turns out that the eigenfunctions of this operator generalize Fourier basis functions. So if you apply this operator to rn, for example, the Euclidean space, they will produce your usual exponential basis functions.
Now, what they show here is that the dimension-- if you construct the hypothesis space with these eigenfunctions, the dimensionality scales exponentially with the intrinsic dimensionality of the task-- of the input data to a task. So the scaling is big M where big M is the maximum frequency, the same maximum frequency that we saw in the previous slide raised to the small m, which is the intrinsic dimensionality of the data. And that would be for image classification. That would be the low dimensional manifold on which images lie.
Now, to understand where this big M to the small m scaling comes from, imagine that we're setting up our hypothesis space as follows. So we're parameterizeing it, again, as just some linear combination of some orthogonal basis functions. And now, they're more generally just the eigenfunctions of that operator I mentioned in the previous slide.
Now, the reason we see this big M to the small m scaling is-- well, I mean, one intuitive way to understand it is to consider the two dimensional case. So imagine that we're working with an input in R2. Now, if we take the Fourier transform of that input space, we'll get a two dimensional frequency space.
And if we imagine that we restrict our functions on the input space to have some maximum frequency, then the only frequencies that are really relevant to our task are going to lie within some radius of the origin in this frequency space.
And so, in our Fourier series, we'd have a discrete number or a finite number of frequencies because we'd only want to consider the discrete frequencies within that radius of the origin.
The number of such frequencies is just going to be-- it's going to scale as M squared. It's going to basically scale as the area of the circle, which scales as M squared. And similarly, if we're working in a three dimensional input space, then we'll have M cubed frequencies. And in general, if we're working in a small m dimensional space, we're going to have big M to the small m number of frequencies.
And so, then, we'll have that many orthogonal basis functions here. And then, that many coefficients over here-- coefficients theta. And that will determine the dimensionality of the hypothesis space we're working with.
AUDIENCE: But this is bad, right?
AKHILAN BOOPATHY: This is bad in terms of--
AUDIENCE: This is a curser dimensionality.
AKHILAN BOOPATHY: Yeah, this is exactly--
AUDIENCE: M is anything like people use usually. Like, I don't know, 1,000 dimensions for c4 images. You have big M, say, 10 to the 1,000.
AKHILAN BOOPATHY: Absolutely. Yeah, so this is exactly cursive dimensionality. Although, I think--
AUDIENCE: You know how many protons are in the universe?
AKHILAN BOOPATHY: I'm not sure, actually.
AUDIENCE: It's only 10 to the 18.
AKHILAN BOOPATHY: Oh, wow. OK. Yeah, this will be--
AUDIENCE: Much bigger.
AKHILAN BOOPATHY: Much bigger. Yeah. So one thing that's important to note here is the small m is the intrinsic dimensionality of the task. For c4, for example, it wouldn't be the total number of pixels. It would be the dimensionality of the low dimensional manifold on which the c4 images lie. But you're still right. This will be a very large number, as we'll see experimentally.
AUDIENCE: And also like, I mean, I guess in a sense to note that, actually, you want a bigger promise of a hypothesis space as possible, right? Because you don't want to specify the model to any great extent. You're just saying of the space of all hypotheses how much information do you need to include in your specific models in order to solve the task well starting from all possible models. All possible even remotely reasonable models.
So we don't want to artificially reduce the hypothesis space at the moment, I guess. That's one way to respond, I guess.
AKHILAN BOOPATHY: Yeah, I agree. That makes sense. Here we're including every possible hypothesis that's reasonable. And that might be very, very large at this point. We haven't included any real inductive biases. Anything significant at this point.
OK, so moving on from here. The next step in our calculations will be to determine the dimensionality of the interpolating hypothesis space. So this is just the set of hypotheses that interpolate the training data, not the full set of all hypotheses.
And it will turn out that, basically, every training point we're given reduces the dimensionality of the hypothesis base by a fixed amount. And this is easy to see if we're just dealing with a linear hypothesis space. Because for each training point, we can just adjust one parameter of our model to fit that training point if we have a one dimensional output.
In general, the same logic applies for general hypothesis basis. Every training point we're given, we can just draw some fixed number of parameters in the hypothesis space to fit that new point. And so the reduction in dimensionality just becomes n times d. And as the training set size, d is the output dimensionality of our task. And for a regression problem, that would just be the output dimensionality of the function we're trying to regress.
All right, so going back to our bound. We can plug-in all our approximations and we get a final expression for task difficulty. So the first step here is going to be just noting that this is a probability that we land within some ball or on theta star.
So the volume of the ball, this probability just scales with the volume of the ball. And the volume of the ball is just radius raised to the dimensionality times some constants, which we'll ignore for now. So if we just plug that in and take a negative log of both sides, the exponent falls up in front. So we get the dimensionality of the interpolating hypothesis space. And then, the log of the radius here.
And then, the final approximation we're going to make is on Wpq, which recall is the distance between the training and test distributions. Now, it'll turn out that this distance scales as n to the minus 1 over m. I mean, intuitively, this is because if we have more and more training points, then we're going to more densely cover the support of the distribution that we're sampling from.
And so the distance between the training and test distributions is going to be closer together. And that makes sense because here if we increase n because of this negative sign in the exponent the Wpq will go down.
All right, so after all of these approximations, we can get a final bound on the task difficulty. And this isn't the exact expression. I've removed some terms and constants just to simplify things. But this should get the main point across. Just to go through what all these terms are here.
Big M, again, is the maximum frequency of hypotheses we want to consider. D is the relevant output dimensionality of the task. n it's our training set size. C is just some constant. Epsilon is the desired error rate on the test size. And small m is the intrinsic data dimensionality.
So just to note some trends of how this bound varies with these different constants or variables. Taks difficulty decreases with training set size, which we expect. We expect there's some trade-off between training set size and inductive bias. And the exact trend, whether it's linear or logarithmic, depends on the values of the other variables.
Task difficulty decreases with error rate. That also makes sense. If we're allowing a larger error rate on the test set, then we don't need to provide as much inductive bias to solve the task.
Task difficulty increases with the maximum frequency of functions that we want to consider. And this maximum frequency we can really think about as the relevant resolution of functions on the data. So if we need to, for example, perform really fine grained classification on an input space, then we might want to consider really high frequency functions on that space. And so the hypothesis space will become bigger, and so the task difficulty will become bigger.
And then, finally, task difficulty increases exponentially with the intrinsic data dimensionality. So this is, again, the cursive dimensionality where if we need to generalize over more dimensions of variation, then that's going to drastically increase the amount of inductive bias we need to provide in order to generalize well.
All right, if there's no other questions, I'll move on to our experimental results. In our experimental results, we basically-- oh, yeah.
AUDIENCE: So the task difficulty, when you were saying that you can play with the difficulty by adjusting the maximum frequency to allow, the way I thought about that is that it's an implicit assumption that the data is somewhat smooth. That's why you need lower frequencies. But that's not the case on all the tasks, right?
AKHILAN BOOPATHY: That's a good question. I guess my claim would be that all tasks have some limit that you might want to consider. Even very fine grained tasks will have some relevant resolution. It just might be very small. But that's just something we could plug into M.
AUDIENCE: I guess the-- all of these-- what you've derived so far, you're implicitly assuming that your data has generated some distributions and that's the setting that you're working in, right? So I think that might-- so your data is not being assembled in a deterministic way or in a particular way. You're getting it randomly. And all of the samples are being drawn from some distribution. Is that right?
AKHILAN BOOPATHY: Right. So this expression we do assume IAD. But just going back here, the place where we assume IAD is just in this approximation of Wpq. So we can also assume, for example, that the samples are drawn on IAD and we just need to make a modification to our approximation here. So we could, for example, assume training and test distributions are very far apart. And that would just make this term very large.
OK, onto our experimental results. And then, we can circle back later for further questions. So the first thing we did is quantify the task difficulty of various image classification benchmarks. So on the y-axis here is our task difficulty. And this is the task difficulty to achieve state of the art accuracy on these data sets. But one thing that's important to note is the numbers are essentially the same, even if we fix the accuracy level we're trying to achieve. So like if we fix 90% as the accuracy level, the numbers are essentially the same.
First thing to note, data sets would intuitively think are more complicated. Like ImageNet is very difficult. CIFAR-10 and SVHN are slightly less difficult. And MNIST is very easy in comparison. But the other thing to note is that, even for MNIST, we require 10 to the 16 or so bits of information in the inductive bias to solve the task.
So this is typically much larger than even the number of parameters in the models one would actually use to classify MNIST. This inductive bias corresponds to not the parameters of the model, but the specification of the model architecture itself and the learning rule and all these other various factors that go into specifying this well-generalizing set of hypotheses that we're starting from.
And I'll go into why these numbers are so big in a bit more detail on a few slides down. We also see that in ImageNet, we require a 10 to the 40 something bits of information to solve the task from just from the inductive bias. Yeah.
AUDIENCE: Sorry. Quick question. Maybe this is a diversion. But when I was following the presentation, you define task difficulty and you were discussing lower bounds on that upper bounds on that. So far.
And then, you just got through presenting upper bounds and now you're jumping to an empirical result and I'm not getting the bridge to how you even make this plot. And why it isn't just a lower bound and an upper bound given the earlier presentation?
AKHILAN BOOPATHY: So that's a great question. This is actually just the upper bound. Only compute this upper bound.
[BACKGROUND CHATTER]
AKHILAN BOOPATHY: Somebody.
[BACKGROUND CHATTER]
PRESENTER: Sorry, we're not working.
AUDIENCE: I didn't mean-- it accidentally unmuted. I'm sorry. I'm at the airport. I apologize for the background sound.
AUDIENCE: Oh, so she wasn't-- OK. So you pick some parameters to compute this. Things like M. I didn't see how you were choosing all of that to get to this plot. Didn't seem like there's a bunch of choices there. Jump to that, right?
AKHILAN BOOPATHY: That's a good question.
AUDIENCE: It's empirically, that's experimentor defined, right? Or modified defined, isn't it? If I'm understanding.
AKHILAN BOOPATHY: That's true. So we pick M based on the task. We can set a relevant resolution that we might want to consider. We just set it based on the interclass distance between different classes. So when classes are really far apart, for MNIST, they might be far apart, then this relevant max frequency is probably going to be smaller.
AUDIENCE: It's a funny coupling of the hypothesis base. You're willing to consider it's based on the data that someone decided to provide you. That's what you're saying now, which seems a little circular somehow.
AKHILAN BOOPATHY: It's circular in the sense that-- well, a hypothesis based-- so I mean, ultimately, we want task difficulty-- the task difficulty bound is a property of the data set we're trying to-- we're considering, right? So it's not circular in that sense. It's circular only in the sense that are the base hypothesis base that we're considering is a property of the task.
But that's totally reasonable in my view because we can't consider the space of all hypotheses. That's just not tractable. And it is reasonable to think that hypotheses which require more fine grained classification or fine grained distinctions between data points would be harder to solve.
AUDIENCE: OK. So in other words, you're trying to show upper bounds.
AKHILAN BOOPATHY: Yeah, these are all upper bounds.
AUDIENCE: How do you get lowercase m, the intrinsic dimensionality?
AKHILAN BOOPATHY: I mean, there's several techniques to estimate. We use the nearest neighbors based estimation tool to estimate for these data sets.
AUDIENCE: Question.
AKHILAN BOOPATHY: Yeah.
AUDIENCE: Yeah, actually, it just comes with the last question. I mean, determining the dimensionality of the manifold in a data set, I mean, that's a difficult mathematical problem. And as you say, there are many different ways to do it.
So it seems to me that there are many different ways to do m and many different ways to determine capital M. And your assumptions on what to use to determine those might drastically alter your results. I mean, the equation seems very intuitive. To some extent, it makes sense.
But to actually apply it, it seems like the things that you're plugging in are almost as hard to determine as the thing you're trying to determine from the outset.
AKHILAN BOOPATHY: That's a great comment. I think one thing that's important to note is not all of these quantities are very relevant to the final number we get. So it turns out that most of these are not really going to change the final scale of the inductive bias measure that we get. The most important one turns out to be the small m, which is the intrinsic data dimensionality, which as you pointed out, is very difficult to estimate often for these image classification data sets.
But I mean, our goal is not really to find a very precise measure of this inductive bias complexity. It's rather to find the general trends of how different data sets can be compared in this measure. And then, show how task difficulty changes as we share very different properties of the task.
So at this stage, it's not really designed to be quantitatively precise. We couldn't, for example, do more quantitatively precise estimations of this intrinsic dimensionality and then find more quantitatively precise estimates here if we wanted to.
AUDIENCE: How does the measure you're computing-- if you can go back to the-- this looks like the logarithm of a generalization you could get from either using rudimentary complex CVC dimensions, something like that. How does it-- in these qualitative takeaways I think you could get from those bonds as well. So how does this differ from that?
AKHILAN BOOPATHY: Yeah, that's also a great question. The thing we're trying to approximate here is not sample complexity, right? It's this amount of inductive bias information we need from a model designer's perspective in order to solve the task. And it's really no surprise that there's an exponential scaling with intrinsic data dimensionality for both sample complexity and inductive bias complexity.
It would be weird if there weren't the same kind of scaling, right? Because one wouldn't expect that intrinsic adding more dimensions of variation should affect only one but not the other. So it's not-- in my view, it's not really a surprise that this expression is very similar to what we see in generalization bounds. But again, we are measuring a different quantity.
All right, I'll move on to our next result. So here, we show the amount of inductive bias contained in different architectures when applied to different data sets. So each of these plots on the x-axis shows the inductive bias information for these three different data sets.
And for MNIST, linear models tend to produce-- provide least inductive bias. Some more complicated convolutional networks provide more inductive bias. And similarly, for ImageNet and CIFAR-10, some more complicated transformer-- like vision transformer models provide more inductive bias than some more simple convenience.
Now, one thing that's important to take away from this set of plots is that the same architecture, or almost the same architecture, can provide very different amounts of inductive bias when applied to different tasks. So here we have a vision transformer applied to CIFAR-10 and it provides a relatively small amount of inductive bias. Actually, orders of magnitude less inductive bias than the same vision transformer architecture applied to ImageNet.
And so what this is essentially saying is that a data set can extract more information from an architecture if the data set itself is more complex. So different data sets can extract different levels of information from a fixed architecture.
All right, next plot. Here, we're plotting the trade-off between the training data and inductive bias. So here, training data is on the x-axis. And the amount of training data is on the x-axis. And y-axis is our inductive bias measure. And this is on ImageNet. So here, we're just varying the amount of training samples that we're sampling from ImageNet.
And as you can see, there's a trade-off. So if we have more training samples, we need less inductive bias. But one thing-- if you look at the axes of this plot, one thing you'll notice is that even if we increase the amount of training data by many orders of magnitude on ImageNet, we still only reduce the task difficulty by a very tiny amount relative to the scale of this plot.
So we start from maybe 2.5 times 10 to the 41 bits of information and it goes down to maybe 2.4. What this is essentially saying is that, even if we drastically increase the amount of training data, we'll still need about the same amount of inductive bias to solve the task. So data is providing very little information relative to the inductive biases that go into solving something like ImageNet or generalizing on something like ImageNet.
AUDIENCE: But that's just because ImageNet images are to 224 by 224 versus something like 30 by 30 for CIFAR for MNIST, right?
AKHILAN BOOPATHY: Oh, I guess two points on that. First of all, this is all based on the intrinsic dimensionality of ImageNet or CIFAR-10 or MNIST. So yeah, I mean, it's not just based on the extrinsic dimensionality. And the second comment is, I mean, we see the same pattern for MNIST. Even if you drastically increase the number of training samples, inductive bias goes down less than you would expect.
AUDIENCE: Yeah, but that's, from your expression, the task difficulty seems to depend on logarithmic beyond the number of samples versus on the--
[BACKGROUND CHATTER]
AUDIENCE: Exponentially on the dimensionality, right? So from the-- so most of the contribution to your measure is the dimensionality of the--
AKHILAN BOOPATHY: Exactly. Yeah. Yeah, that's exactly right. All right, so so far--
AUDIENCE: If I could interject here just to ask a question. I mean, I'm having a tough time with the y-axis along the lines of what Tommy said. Yeah, it doesn't go down very much, but that number is too big for me to comprehend. I mean, many, many orders of magnitude more than the number of atoms in the universe.
I don't know how to interpret that y-axis that way. I mean, what that really means.
AKHILAN BOOPATHY: That's a wonderful question. I'll get to that comment in three slides, I think. Yeah.
AUDIENCE: Well, mine was the same version of the same question, just said another way. You have many orders of magnitude of data increase on the y-axis and having essentially zero effect on whatever you're measuring on the y-axis at some huge number. I don't know what that means.
I would like to think more data samples is going to make tasks easier. But you're saying many orders of magnitude data just as hard as the reader this plot somehow. But it doesn't make a lot of intuitive sense.
AKHILAN BOOPATHY: Yeah, I guess, hold on to that thought. I'll get to it in its three slides. OK, so we can also do the same thing for other domains here. We did it for reinforcement learning. And these are continuous control tasks from the [INAUDIBLE] framework. And again, we see that more intuitively simple tasks, like Reacher, which is a very simple arm task where you have moved in a 2D box and reach a particular location, are simpler than these more complex locomotion tasks.
And importantly, we can-- we get similar numbers for task difficulty and same orders of magnitude as the image classification task. So this is saying that even though these tasks are in very different domains, we can compare them along this unified metric.
And another interesting observation, tasks with noisier observations require a lot more inductive bias to solve. So to assess this, we took the Cart-Pole test, which is an inverted pendulum control task. And in this task, the agent-- the reinforcement learning agent gets in four observations about the arm. I think it's position, velocity, and then, the angular position and velocity. And it's asked to make some control action at the bottom of the arm to control it.
And so the original Cart-Pole test is very simple. So relatively-- it requires very little inductive bias to solve. But as we added noise to the state observation, first of all, the number of observations that are required to accurately determine the state of this inverted pendulum increases.
And so that corresponds to a huge increase in the inductive bias complexity of the task. And intuitively, the way to think about this is, if we add noise, that increases the intrinsic dimensionality of the task and that makes the task harder.
All right, and this is the last results slide before I circled back to the comment about why the scales are very large. Finally, we looked at meta learning tasks. And so here, I've plotted task difficulty for ImageNet versus Omniglot. And Omniglot has this very huge task difficulty of 10 to the 140 something bits of information.
And this is maybe counterintuitive because ImageNet has these really rich images. And Omniglot is just a meta learning data set where we're given a set of alphabets and you need to generalize-- and few shall learn on an unseen test alphabet where you're only given a small number of samples.
And all of these images are very simple, like MNIST looking. But still, the data set task difficulty is very large. And the reason is just that ImageNet lie on a pretty low dimensional manifold. I think less than 50 or something probably.
Whereas, in Omniglot, you need to generalize over a much higher dimensional space of alphabets, which turns out to have a much higher intrinsic dimension. And so that explains why this is much larger.
All right, but finally, getting to this question of how we can understand these very large scales of task difficulty. One way to think about this is that, first of all, we're starting with a hypothesis space that's really massive. It includes basically all-- the space of all band limited functions. So we're taking a space of all functions, cutting off the very, very high frequency ones. But this is still a very, very large space.
It's likely we think that functions which are even computable on standard hardware-- standard computing hardware are themselves a very, very tiny portion of this full hypothesis space. And even within those, the ones that are accessible by relatively small sized neural networks are themselves a small portion of that set of computable functions.
Now, the set of functions that fit our training data, probably if we just sample one randomly, most of them are not going to be properly computable on standard hardware. Maybe they're not smooth enough, they don't have enough compositionally, they don't satisfy our-- they're just not easily accessible by us.
And we might expect that the set of models that generalize well are the ones that are accessible by these relatively small neural networks. And so remember that how we measured inductive bias was as based on this fraction. So the fraction of this purple box to this red box. The fraction of well-generalizing models to this full set of training interpolating models.
And it's natural that this fraction is going to be very, very tiny for-- in most cases because or just our hypothesis space is very massive. And that will cause one to a very, very large number for inductive bias. Essentially, what this is saying is that a lot of the inductive bias doesn't come from us directly as model designers, but maybe just the choice of the computing infrastructure we're using or even the fact that we're only able to write a reasonably sized lines of amounts of code to actually implement one of these models.
So a lot of the inductive bias we're putting in just comes from these constraints that don't even-- that start even way before we define the architecture of a model. And that's why we think these numbers are so large.
AUDIENCE: Could you give an example? I don't even-- I mean, functions computable on standard hardware. Obviously, mathematicians come up with all kinds of wacky functions. But I mean, what's a function we might be interested in that's not computable in standard hardware?
AKHILAN BOOPATHY: One easy way to construct such a function or, basically, imagine that we're-- imagine you take the full set of band limited functions on like 100 dimensional space and just randomly pick one. The amount of information you need to just specify one function is probably larger than the size of the universe or something.
So it's not even possible to express those functions just because the space is so massive. Obviously, you can specify particular functions in the that space. But specifying any hypothesis in that space is not feasible, we think.
AUDIENCE: I don't understand how this connects to the high values, though. Can you make that intuition?
AKHILAN BOOPATHY: Yeah, I mean, so remember that the high values are just negative log of the fraction here. So the fraction of the red box to the purple box. And so most of this red box lies outside of the function's computable on standard hardware. So that's going to be a very, very large space.
And the purple box is probably going to lie within this orange box that functions computable on standard hardware or expressively by relatively small neural networks. So already, without making any real strong assumptions about our models, we already have this ratio is very, very tiny.
So even just starting out, we have that the amount of inductive bias that goes into solving these tasks is going to be very massive.
AUDIENCE: So then you-- so I get that point. And that's the large number. But then, you're saying, any of the choices we make don't really matter. The number of samples we collect, the number of whatever biases we think we want to build into an architecture kind of don't matter. Is that a summary-- in that relative scale sense.
AKHILAN BOOPATHY: In the relative scale sense, these choices do matter. For example, here we showed different architectures to provide different levels of inductive bias. And they are like-- there is some significant differences between them.
If you just ignore the scale, you can measure these differences and say, this provides-- convolutional networks provide this much more inductive bias than linear networks on MNIST. So you can say something.
AUDIENCE: Is this your last slide? No, sorry. The one before?
AKHILAN BOOPATHY: This one?
AUDIENCE: Because I have something to say about this. But maybe you finish.
PRESENTER: Oh, I have a conclusion slide.
AUDIENCE: Sorry, I thought you were done. Yeah, go ahead.
AKHILAN BOOPATHY: Yeah, I mean, this is just a summary of what I talked about. So we found that there's a trade-off between inductive biases and training data. And then, we found some way to quantify the amount of inductive bias based on this negative log of [INAUDIBLE] quantity.
We found that inductive biases for typical tasks are very, very massive. The amount of inductive bias required to generalize on them. And then, finally, we found that tasks requiring generalization on many dimensions of variation require more inductive bias, so cursive dimensionality. And that's just a good way to construct tasks is to add more dimensions of variations to the task or consider tasks with these many dimensions of variation. Yeah, but that's my conclusion.
[APPLAUSE]
AUDIENCE: If you go back to the other slide. So I think there is a memo that I wrote recently. And basically, it says that, in practice, high dimensional functions do not exist. So there are several way to say this in your article.
I think the numbers you presented are, say, in the right order of magnitude for shallow networks. Now, turns out if the results in the [INAUDIBLE] are correct that computable functions in the Turing sense-- and there are some subtleties there. But because we are speaking about continuous functions.
So for continuous function, I have to assume that bounded first derivatives exist. But then, there should always be a compositionally sparse representation of a computable function. If it's computable in polynomial time, these parse representations as a finite number-- most a polynomial number of nodes. You can think of it as a graph. A function graph where the various nodes are functions.
So the overall graph is a function of functions of functions. Each of the constituent functions depend on a relatively small number of variables. So if you do this, the dimensionality of the overall function is essentially bounded by the dimensionality of the biggest constituent function.
AKHILAN BOOPATHY: I see.
AUDIENCE: So the number you have are probably going to be reduced by many order of magnitudes for the composition of it. But I expect the result to be correct, the ones you mentioned.
AKHILAN BOOPATHY: I see. Yeah.
AUDIENCE: But the numbers to be much more acceptable.
AKHILAN BOOPATHY: Yeah. That makes sense.
AUDIENCE: But there is one consequence here. I think there is a bias, which is to find out this parse representation. Every function has many compositional representation. It's not unique. So there is some are sparse and are good.
And how do you find them, right? For instance, for convolutional network, you assume a certain sparse representation, which is given by the architecture of the network, which is parse. Every neural receives only 3 by 3 inputs or whatever.
If you know this parse representation or you assume it, then you're OK. But if not, then maybe the case of transformers. The real task is to find a good-- a composition in a parse representation. And that would be the main prior information, the main bias. So essentially, the architecture of the network, if you want.
Anyway, something to be discussed. I'm not saying this is necessary or correct, but I hope so.
AUDIENCE: For the table of numbers you had for the architectures, how did you derive them?
AKHILAN BOOPATHY: Yeah, so that's a great question. So what we did is we just computed the performance-- generalization performance of each of these models. And that just told us we can plug that into our formula from earlier on error rate here.
AUDIENCE: Oh, I see.
AKHILAN BOOPATHY: And then, that will tell us how much information the models about the task.
[SCUFFING NOISES]
AUDIENCE: I have a question in how you treated the noise. Because if you assume that you have noisy observation, I guess, the way you define the hypothesis is interpolating all the data points. But it looks like you might define hypothesis differently if you assume that the data is noisy. You're not trying to pass through the data.
AKHILAN BOOPATHY: That's a good point. That's a great point. I mean, so it turns out it's possible to define the same kind of inductive bias measure even if you don't interpolate the training points. And I think the measure is probably very similar to this.
Yeah, it's a good point that if we know the training points are noisy, then we probably don't want to interpolate them. And so, yeah, I mean, but I mean qualitatively it doesn't affect our results significantly.
AUDIENCE: So do those architectures actually interpret the training set?
AKHILAN BOOPATHY: No. In general, I don't think-- I mean, for example, linear networks probably can't interpolate on this, I don't think.
AUDIENCE: Does that require you to adjust the additional terms that you have on?
AKHILAN BOOPATHY: Yeah, that's a good question. I think, in principle, it does. But in practice, the numbers don't change very much. Again, the main factor behind the scale of these numbers is the intrinsic dimensionality of the data. So it's just a small adjustment.
AUDIENCE: This is a variant of Tommy's question that I-- maybe it's the same or not. But this plot seems to imply that the more you go towards what you consider to be real world or questions there on the right, which we know are soluble to some degree, at least by some natural systems, again, you want to define that.
I can't tell how I should read this plot. I'm looking at it as like, oh, they're all the same? But on the scales, you're showing they're actually just as different. Are you in a regime where now it's like, now it almost does-- you're so far out of the-- the upper bound is so big that it becomes meaningless in a way. That's one read of this plot. I'm not sure if that's what you want us to take from this.
The space is so big now that everything becomes-- everything you've unpacked becomes a little less applicable. Or is this, no, I still see there's a difference across the architectures there that could be measured. It's just the scale you're plotting on. I don't have error bars. I don't know what you want me to take from it.
AKHILAN BOOPATHY: Yeah, and that's a good question. I think the main useful thing that this measure provides is a way to compare different tasks. So I mean, we already know how to compare models. We can just plot the performance of the different models. And so this doesn't tell us too much constructively about the models themselves.
What this does tell us is that ImageNet is probably a much better task than CIFAR-10 just because you need to put in a lot more constraints on the hypothesis space to solve it. And so it's really on the task side that it's providing some useful guide.
So when you say the task, you mean, everything about how many data points? If I take MNIST and just keep increasing the number of data points, will that drive it up to be harder than ImageNet?
AKHILAN BOOPATHY: Well, more data points makes the test easier.
AUDIENCE: OK, I guess I'm not getting that intuition because I seem to be unlocking a bigger and bigger function space by your setup of M in a way. But maybe it's going to control the main ways that I'm not seeing, right? Seem like you were tying M to that sampling.
AKHILAN BOOPATHY: Yeah, so M--
AUDIENCE: Unreasonable in effect that as I drive M, I can basically unlock a bigger space. Drive the number of data points that I can unlock M, which is then going up maybe bounded by that small m, which maybe you'd say is bounded at some point to stop. There's a true intrinsic dimensionality in effect. Is that the way to think about that.
AKHILAN BOOPATHY: So it's a big M is controlled by the relevant resolution of the data point. So for MNIST, the relevant resolution is just the spacing between different classes. So if you make the classes really close together, and you need to make some very fine grained pixel level decisions to determine which class something is in, that would increase the difficulty of MNIST.
So that would be the way to drive up task difficulty.
AUDIENCE: I'm still in function mode, which you present it. There's no classes to me. There's just functions to fit. If I get more samples, I can then fit higher frequency functions. Because they're more reasonable now, potentially. And then, I have this infinite space so I can keep unlocking with finer and finer sampling.
AKHILAN BOOPATHY: I wouldn't say that M really strongly depends on the number of samples. Empirically, the way we compute M is we just plot out all the training points of MNIST. And then, the interclass distance is the-- corresponds, basically, to how we compute maximum frequency.
So even if we increase the training set size by 10 times or so, the interclass distance is probably not going to change very much. It might decrease a little bit.
AUDIENCE: OK, I'm just not connecting that to the math that you showed with the 2D case and the M and the samples on a grid. That connection is not there for me, but maybe we should talk offline. I thought your fitting functions over 2D spaces. Fitting through the points and what's reasonable or not. And that was the setup, right?
So if I keep sampling in that 2D space, I can keep making higher and higher dimensional functions. I didn't connect classes to that idea. I didn't see the connection there. But I'm just maybe missing it. So sorry.
AKHILAN BOOPATHY: Maybe we can chat offline.
AUDIENCE: Yeah [INAUDIBLE]. Thank you.
PRESENTER: Thank you for--
[APPLAUSE]
Associated Research Module: