Building and Training Deep Learning Models in PyTorch
Date Posted:
November 16, 2023
Date Recorded:
November 8, 2023
Speaker(s):
Valmiki Kothare, MIT.
All Captioned Videos Computational Tutorials
Description:
BCS Computational Tutorial Series with Valmiki Kothare, MIT.
In this tutorial, we will use deep learning on EEG and EMG mice data to predict sleep stages (Wakefulness, REM, Non-REM). We will walk through an example Jupyter Notebook in which we load a dataset, preprocess it, build a "residual-attention" network, train our model, and validate our performance on withheld data. In the process of going through the notebook, we will discuss briefly how to run this on OpenMind and how to parallelize training across multiple GPUs, as well as the reasoning behind the network architecture choice and the basic theory of the attention/transformer layer.
Google Colab Notebook - https://colab.research.google.com/drive/1zbib_Cv4v1t9q7Wqb_w-R4wI9xGPIh2...
VALMIKI KOTHARE: First, to run any Jupyter Notebook, you need a server to run it on. So this can be just your local laptop, or if you want to use a GPU or a cluster, you'll have to start the server on the cluster and then connect your whatever runtime you have.
So, in my case, I am using OpenMind. So I SSH into OpenMind. And I can do this for you here. And, in my folder, that I'm working in, I've created a batch file to schedule a Slurm job. So Slurm is the scheduler, the job scheduler used on OpenMind to acquire resources. And I will show you the batch file here.
So this file is used to acquire given resources. And you can change how many GPUs you want, for example, and how many threads, essentially, cores of the CPU you're using, constrain the type of GPU you want, and a couple different OS constraints, for example, and memory constraints.
So all of this is on the Wiki. And I've linked that in the Colab notebook for you guys to check out how to do this more specifically for whatever needs you have. This essentially activates a Conda environment and boots up a Jupyter server. So then you'll call this batch script using sbatch jupyter.sh or whatever you want to call your file.
And it will-- and I've already done this. So it'll show up here when you do type the command squeue dash u for user and your username. And then an output file called-- that you can specify, I've called it Jupyter.ou-- will be created. And, of course, it's nonsensical now because I've been using it for a while. But-- all right, that is unnecessary. Sorry, it's just very long.
Let me start this again. I'll just take you through this process anyway because the server crashed. And I'll cancel that other job that we were using.
And then, in your output file, you're going to get-- see, it's now started a Jupyter server. You have to give it about 30 seconds. And it will provide you with a link that you can use to connect to your Jupyter server.
So here it is. This second link is the link you want to use. The first tells you what node you're running on. So we're running on node 61. Now we're going to do something called SSH tunneling. This is going to connect our local computer to the node through the host node.
And I've created a batch-- sorry, just a batch script to do this, which essentially allows me to just input my port that I've specified in my batch script and the node. But, for reference, this is going to be in this tutorial how to use Jupyter Notebook on OpenMind. And it just tells you the commands you need to run for SSH forwarding and for starting up your server.
So now, I've done this, set it up as port forwarding. Now I can connect to the server through Google Cloud. You can do this a number of ways. If you prefer using Jupyter Notebook or Jupyter Lab, just paste the URL straight into your browser, and it will take you to the Jupyter Lab IDE here. Wi-Fi is not great, but-- [INAUDIBLE] loads. There we go.
STUDENT: [INAUDIBLE] question?
VALMIKI KOTHARE: Yes.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yes, because, I mean, you can run stuff on Colab using the hosted runtime. But because I can't give you the data, and there's no way of hosting the data that the student gave me for you guys to use, you're not going to be able to train the model on his data.
So you're welcome to run the script, but it's not going to do much for you. If you would like, you can download the AccuSleep data set, which is-- I've linked above. It's huge. It's a lot of data. And it's what we're using to pre-train this model. But, yeah, for now, I'd say, just follow along through the code, but we're not going to be able to run it.
STUDENT: [INAUDIBLE] download both [INAUDIBLE]?
VALMIKI KOTHARE: Yeah, so, essentially-- and we'll get into this. But what I did was I used the 24-hour for training and the four-hour for testing for validation. So, yeah, this is if you want to use it in Jupyter lab. But I prefer using either VS Code or Google Colab. And so, to do so, you'll open your notebook, click Connect to local runtime, and then paste that URL in.
And it'll connect here. And you can see-- to check, for example what kind of device you have, GPU, you will type something like bang, which will execute something in Bash, and NVIDIA SMI. As you can see, we have four GPUs allocated to us.
I'm going to have to restart this. So I've linked a couple of resources here for you guys to use. The Wiki, of course, is going to have a bunch of resources if you want to use OpenMind. And, specifically, the Jupyter Notebook tutorial will teach you how to do what I just did in more detail.
There's also the website, which has some slides on basic OpenMind tutorials, and then the data set here that we're going to be using. So now that we're here, a bit of context on the problem that we're going to be trying to solve. This problem is known as sleep staging. And the goal is to use EEG and EMG data, specifically of mice, in our case, to predict what stage a mouse-- what stage of sleep a mouse is in. It's like sleep scoring.
So the labeled data is labeled with either wake REM or non-REM sleep for given two-and-a-half-second periods. At every two-and-a-half seconds, there's a label. And this was human-labeled. So this is a lot of data. And the student who came to me who is [INAUDIBLE], and he graciously let me use this for the tutorial.
But he wanted to use a model pre-trained on these AccuSleep data for his own data set. And so, this is the overarching goal of what we're trying to do here today. But a lot of the principles apply to other time series data, for example.
So, basically, we're going to install any requirements. A good practice is to have something called a requirements.txt file in your repository, which specifies all the necessary Python packages you may need.
And an interesting thing about Google Colab is that the notebook that you have that you're using is stored and hosted on Google Drive. But everything else, if you connect to a local runtime, for example, is hosted elsewhere. In my case, it's hosted on OpenMind, where I-- and the current working directory is the folder that I started this batch script in.
Google Colab offers a bunch of different services. You can also-- if you buy time using their hosted runtimes, you can store your files or your data on Google Drive and then just import them in.
And then, for a sanity check, we check that we are on the node that we-- what's going on here? All right. It should be working fine, but we're having trouble connecting here.
Perfect. So I've already installed all of these packages. But that's why they say, a requirement already satisfied. And we are on the correct node that we requested. So here we import a bunch of packages into Python, namely NumPy and pandas. But, mostly, we're going to be working with PyTorch tensors.
And then, this package called Accelerate. So Accelerate is a newer package created by Hugging Face, which allows us to very easily port our PyTorch code into a distributed training. So now we can use multiple GPUs to train faster than we would with a single GPU or even multiple nodes with multiple GPUs on them. And it depends on whatever server you may have.
And this is quite useful if you're working with a lot of data or you're working with data that can't fit, for example, a whole batch onto a single GPU, you can use multiple and split the batches up. So the rest of these-- go ahead.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: I like raw PyTorch. It's just a personal preference. I see no reason not to use PyTorch Lightning if you've been using it before. But, really, the structure-- the code you write is fairly similar. And the differences are quite minor. So it's a preference thing.
I think Accelerate is super easy if you already have PyTorch code. But, say, if you already implemented in PyTorch Lightning, which for people that don't know is higher abstraction wrapper for PyTorch, then, I mean, I would say go ahead and use it. But I don't have a particular recommendation either way. It's really your workflow.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yeah, it's very easy. And I'll show you how to do that here. So here we import all these packages that we may need. And we'll go through a couple of them as we go along. And here's where we start the data loading and preprocessing stage.
So I've written a couple helper functions here. And, namely, this class sleep data set-- so PyTorch implements the data set class. And to implement a custom data set, you simply create an inherited class. And, here, I'm calling it sleep data set.
So this will read our data from wherever we've stored it on our server. And it will structure it in such a way that we can index into it as if it were an array, for example. And this allows us to very easily pull samples from our data set for training. So there are a couple ways you could do this. You could leave all your entire data set in memory and only load it when you explicitly need it.
And this is a good practice if your data set is huge and doesn't fit on your RAM and, instead, you need to keep it on disk until you need a batch, for example. And then you read it from your disk. But turns out that this data set is small enough to fit on 20 gigs, which is what I requested.
So I load all of it with a couple functions here. Basically, I won't go into too much detail. If you really would like to, you can go into the code here and see. But what I've done is I run this preprocess script if I haven't done so before.
And what this does for me is it does downsampling, it does bandpass filtering, and it does scaling. And all of these are necessary for this EEG domain where that it can tend to be noisy and you'll want to clean it up a bit before you make predictions on it.
But you don't really want to do this all the time. You don't want to do this, for example, every time you load the data. Because this will, obviously, be a pain in the ass to sit there and wait for it to load maybe 10 minutes of 20 gigs of data into RAM and preprocess all of that on CPU.
So, instead, we preprocess it beforehand. And then we load those preprocessed files afterwards. So now that I've done this in the past, I can always just call this function, get preprocessed file. And it'll give us the EEG, EMG, and the labels for each of the directories we specify.
The way I structure this is all the EEGs that I've collected for different collection periods I store in lists as well as the labels. And I create this mapping from, essentially, the range of total samples that we have, where samples refers to-- a single sample is a two-and-a-half-second window anywhere in the data set. You pick a window, and this is your sample.
And this is what you want to predict. Based on this data alone, what stages of sleep is the mouse in? We do a couple different things by extending that window into the past. So we say, look at 10 seconds, what stage were they in in the last two-and-a-half seconds? Giving this context helps with dependencies on, for example, a mouse can't move from wakefulness to non-REM in one jump.
And so, having a larger context will get rid of misclassifications. And these considerations are all domain-specific. You have to know your data well and be able to make these assumptions. So if you want to do real-time, you can't be making predictions based on future time windows because, obviously, you won't have seen them until later. So these considerations have to be made for time series data specifically.
But that's a little too specific. We'll skip ahead for now. This is the data set class. It loads all your data into RAM if you want it and then provides-- you have to implement two functions specifically, the length function, which gives you the total number of samples you have, and the get item function. So it gets a single item based on an index that is between zero and the length of your data set. So, again, that index mapping makes this easier. But you can implement this any number of ways.
Now we start getting into just loading everything, and initializing everything, and starting the model design. So these are the parameters that I've specified for our model and for our data and our training paradigm. We'll go through them specifically later. But, basically, it's a good practice to have a config file.
You can store this in a .json or any kind of dictionary that specifies all your configurations so that you can save them later on and reference them in the future if you want to compare different models that you've trained and you want to see, oh, why did this model perform better than another?
Now we can run these. So this is the AccuSleep data that we're working with. One is an EEG probe. The other is an EMG probe. And then, this is what we're calling the homemade data. This is Mathias's data that he acquired. And we've picked an EEG probe and an EMG probe of his to match.
We've downsampled all to the same frequency. We've done all the same preprocessing steps. So, theoretically, you want your test data, the data you're hoping to predict on, to be as close in distribution as possible to the training data. It'll lead to, in general, better results.
And then, I do one more thing here. I create a data set and a data loader to get the input shape of the input size to the model. And this lets you, for example, if you need a specific output shape, and you need to use a linear layer to contract the size of a tensor, you're going to need to know the input shape beforehand. And so, this is a good-- an easy way to do that by just getting the size of the input shape from the data loader.
So this data loader class essentially wraps the data set class. And you can see, we take sample data, which is our data set in as a parameter for data loader. And the data loader collates each sample into a batch. So you can specify a batch size to train your model on.
Generally, this will be somewhere between 16 and 32, maybe even 64, depends. Can play around with that number. But this data loader essentially does this for us and will shuffle it for us if we want to. So it can pick random samples from anywhere in the data set and collate them together.
A couple different parameters can be specified. So, generally, you're going to want to specify or define a function called get data loaders in this Accelerate paradigm because you need to initialize your data loaders, your model, your optimizer, all in the Accelerated train function.
And so, having a helper function like this that you can save to another file if you want to is a good practice. And it essentially just creates your data set, creates the data loaders for each of these data sets, and returns them.
And the same goes for fine-tuning. So another thing that we're going to be looking at is this concept of pre-training and fine-tuning. Pre-training is using some large data set that may be agnostic to the task you're working on to train a model to get a good representation of the task of data that you're working with or the modality of data you're working with. And then, you can fine-tune on, say, your own data.
And this generally will make for a more accurate predictions on your data because it's-- if it's so out of distribution from the training data, you're going to get poor results. And that's what I've seen here, using Mathias's data on a model trained only on the AccuSleep data.
So now, we're getting into the-- yes. It will, to some degree. The benefit of doing this is largely for speed. You will be able to converge faster using the fine-tuned model or, more, what you're saying is true if you don't freeze some parameters of the model. So you could freeze, say, the first half of the model and then train-- of the pre-trained model-- and then only optimize the latter weights for the fine-tuning.
And doing so is faster because you're optimizing over less parameters, and it should have some, for example, spatial context developed through the pre-training process. So, yeah, it is largely for speed benefits. But you could train even just the last layer, and that is often enough to get the performance you need from a fine-tuned model.
So now we'll get into the actual model design choices. So this data is time series data. And, in recent times, attention has often been used-- so, in the past, these sleep scoring models, they use basic statistical and logistic regression in the beginning and then slowly move to convolutional nets.
And now, for an experiment, I wanted to try to use attention for it because attention has been known to be beneficial in training over, say, some time window or context. Because the idea of attention is that you pay-- you attend to different-- in the context of LLMs, you attend to different words with different importance based on their similarity to each other.
In the same way, you can attend to different time periods in a time series, for example, a large spike in EMG, which would indicate the animal moved a lot. This kind of attention to these very time-specific instances is what makes, theoretically, attention well-suited for time series data.
And so, I was inspired by audio diffusion. So, for people that aren't familiar, diffusion models are what have been used in DALL-E 2, for example, for generative models that generate very detailed and intricate images after being pre-trained on huge corpus of image data for image generation. And, in the audio field, they've started using attention layers to try and achieve a similar effect and improve performance. So this is where I got the idea.
Whether or not it is essential that you have attention in this model is not clear. You can play around with it. I haven't really played around with it. But, in the parameters, I specify how many of my blocks have attention implemented, however how many of them are just residual convolutional blocks. And that's the general feeling in machine learning is that you just need to try stuff and figure out what works based on your own task, because there's no free lunch, as we know.
So we'll go through this model. Essentially, the basic structure is alternating residual and attention blocks that attend across the temporal dimension followed by an out block, which is essentially just two linear layers that constrict the output to three, essentially a tensor of length three, for our three classes.
Anyone can stop me if they're confused about some of the neural network modules that we're using, residual, and just ask me questions. Because I don't know the experience in the room. So I can clarify anything that people are confused about. Yes.
I think this-- so the attention mechanism is not precisely how you may think of what biologically or behaviorally what attention is. So it is implemented in-- I'll just do-- [LAUGHS]
STUDENT: Is this the format [INAUDIBLE] that is a good blog post that explains--
VALMIKI KOTHARE: Yeah. And there are a couple explanations of it. But, basically-- this is not a great explanation. There. This is the mechanism for attention. So the idea of attention that makes it learned are these three linear blocks here. So you're not just taking the input and attending to itself, which is what self-attention is, and using a similarity metric to attend to itself. You are first applying some linear transformation to it and then attending.
And this learned linear transformation is what makes the attention learnable. So it's not necessarily attending to a specific portion of your signal simply because it is similar to another portion of your signal, it is doing so because it has learned that, in this context, the wait for a spike is important, if you're comprehending my meaning.
It's, in a way, I mean, that's all ML really is, is abstract-- is feature abstraction but not in a manual way. So I'm struggling a bit to explain this intuitively without getting into a ton of math.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yes.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yes. Yeah. The real benefit of this is not really this attention, oh, it can pay attention to different portions. It does this, but it's really like the receptive field, because you are able to attend across the entire sequence at once. Where, in convolutional blocks, you need several layers to get to a point where a token at the end can influence predictions from a token at the beginning. So this speeds up convergence because this receptive field is a little smaller, or is much larger as the entire sequence length.
So let's go back to the model here. So the residual block is the first portion of this. And it's simply-- I will also bring up a image here for context. It's a sequence of two convolutional layers with what we call a residual connection.
The idea is that, by having this residual connection throughout the network, and you do so in the attention module as well, you allow gradients to flow easily from the output to the beginning of the network without something called the decaying gradient, which happens when a network is super deep and the gradient dies out towards the beginning of the network when doing backprop from the loss.
There are countless theories as to why this works well. It just works. And that's why we're using it. Doing the layers in this diagram here are one-dimensional convolutions. So we're doing convolution across the temporal dimension in our image-- in our-- across two channels. And we're increasing the number of channels each time.
This is generally supported in the literature. Just by increasing by increasing the number of channels, you increase the number of learnable parameters. And you can play around with these. Here we have the forward function.
So to implement any module in PyTorch, you need to implement two functions only, or two methods only, the init and the forward. The initialization function will initialize any layer's learnable parameters that you may need. And the forward function defines how the input gets translated into the output.
So here we're downsampling our input for performing normalization and activation function of our choice. I believe I've chosen ReLU or some variant of it, and then our convolution, and then the same again twice. And then we add what our residual was, which is this connection from the input to the output and return it.
This is a very basic residual block. And it's the foundation of a lot of, well, image classification models. And you can use it also for audio or time series classification. Then we have the attention block. Yes.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: And here, let me bring it up to here. So you can see, here I specify this linear layer, q, k, v. And we take as input a, x, and perform, after normalization, this linear transformation on it. What this does is takes our input and creates three linearly transformed copies of it.
This is called self-attention because q, k, v are derived from the same signal. So they're all derived from x, our input. But they are not identical because the linear transformation performed upon them is different because there are different weights for each of the three q, k, and v's.
And so, once done, this becomes just perform standard scaled product attention. And you do this across multiple heads, meaning, in our case, you're performing attention across, say, eight different heads that you've split your input across. So you divide q into eight even pieces. And then, same with k and v, and then perform eight separate attentions across all of these.
And then, you then follow this same pattern of projecting, normalizing, and then performing a feedforward. So this is the basic transformer block is what people have come to associate with the transformer is the scaled thought product attention followed by-- I'll pull it up here-- followed by the feedforward. It's a basic encoder that you see in most LLMs.
And so, I instantiate all of these in our model, and PyTorch allows you to use nested model instantiation. So if you have several modules-- smaller modules, you can create a bigger module simply by instantiating them inside the large model and then calling them through your forward pass.
There are a couple nuances that you have to worry about. If you create, say, a-- if you instantiate a module inside a list, just a standard Python list, it won't be registered as a learnable parameter of this model. So you have to use something called a module list which tells Python that, OK, this module list is an attribute of our model. We must register all the parameters inside.
And this happens in a recursive way. So if you have a module list in your attention block, then when you create your model block that has an attention block in its module list, it will also register all the parameters in the module list of the attention block. And this is all done automatically. So as long as you have some basic best practices, you keep them in mind, you won't have to worry about this most of the time.
Yeah, and then, basically, the last-- so, again, it's four-- I think I do two residual blocks without attention and then two blocks with residual and attention. The idea is that you want to develop some context and expand the length of your embedding size to, say, 32 channels or something and then perform attention-- well, this, I mean, the idea is that the larger the embedding, you can split it into more heads and perform multi-headed attention, where you can't do this with, say, only two channels.
And then, the output block, so if you're doing any kind of classification, you need to end up with an output of your model of the size of the number of labels you're trying to classify. So, in our case, it's just REM, non-REM, and wakefulness. If you were doing human-- in a human study, you would need REM, non-REM, or, sorry, REM wakefulness, and then on REM one, two, three, so that would be what-- five.
After this, the rest is handled by our optimizer and our criterion. So the loss function we're using for multi-class classification generally is cross-entropy loss. And the idea is, you want to-- the output of this model is the probability that your input corresponds to any given class. So the highest probability, obviously, is what you classify the model as. And this loss function aims to maximize the probability of the correct class.
So now we're going to get into training. So we've defined this function pre-train and fine-tuned. They're essentially nearly identical, they just use different data loaders and data sets. Go through quickly, we always need to instantiate the model, the optimizer, and the data loaders. And then, we iterate over the number of epics, and then, inside this, iterate over the entire data set.
So here's where we're going to get into the conversion process of your code to Accelerate and how you can do distributed training. So this is very simple. Everything here-- and I'll mark it with a little comment-- should have done this before, but that's OK.
You're going here, these four lines are where you have to add to allow your code to do Accelerated training. What this means is, this is something that you only have to do if you're running in a Jupyter notebook, it's some bug that I found. You don't have to worry about this generally.
But, otherwise, this project config just specifies what directory you want to save your model to and whether you want to have automated checkpoint naming. These are a couple other parameters you can specify, but this is the important part, the Accelerator.
So, conceptually, what the Accelerator does is splits the data set into however many number of GPUs you have. So if you have four GPUs, you're going to be splitting your data into four equally sized chunks and then doing training on each of those separately.
This allows you to theoretically train your model four times faster. Of course, it doesn't happen exactly. You don't get 4x speed boost because there is some overhead with communication between the different threads. But you should see significant improvements.
And Accelerate does this automatically using this prepare function. So you pass in your data loaders, your model, and your optimizer. And it splits your data loaders across the GPUs and threads and copies your model across all of them. So this is something known as data parallelization. And this is distinct from model parallelization.
So, for most cases, people who want a speed boost or want to be able to fit large data onto multiple GPUs will use this method. Model parallelization is where your model is too big to fit on a single GPU. This makes it hard-- impossible to train on just one GPU.
That's a more advanced technique because then you have to manually split your model across several GPUs and communicate between those GPUs within one iteration, within one forward pass and your backward pass. It's possible. And if you have specific needs for that, you can definitely contact me and I'll try and help implement that for you guys. But, in the meantime, I feel like this will cover the majority of the use cases. And it's extremely easy to do.
So, afterwards, your code remains nearly identical, and you just change your what was loss.backward in your original PyTorch code is now accelerator.backwards and you're calling it loss. What this does is synchronizes the backward pass between all the different GPUs you're using and synchronizes this optimization step so that you can have-- it essentially enables you to train across multiple GPUs. So you update your model parameters in a cohesive way. You're not working on shared memory simultaneously across multiple threads.
I can get into some more details of these steps here. They're generally pretty well documented for any PyTorch tutorials you're doing. Conceptually, you are just inputting your sample, your batch into the model, getting some predictions from the model, passing it into your criterion, which we specified as cross-entropy loss, and optimizing based on this backward pass through the model.
So the loss, if poor, will direct your model downward in the direction of lowering this loss function. I mean, it's known as back propagation. Again, I don't know the experience in the room. So maybe-- let me think. So I can go a little more conceptual here then.
So you're trying to optimize for a loss function. This loss is how well your model performs. It's a single scalar value that determines whether your model is making the correct predictions or not. It's different from accuracy, for example, because you can pass a gradient through it. Meaning all the operations you did in your model and in your loss function are differentiable.
This allows you to pass a gradient through the model to determine the gradients. So you establish a gradient for each parameter relative to your loss function. And this gradient tells you in which direction the parameter needs to move in order to-- or change-- in order to improve the performance of the model or improve the loss, in this case, decrease the loss.
And so, this is the point of calculating the loss here and then performing a backward pass. This backward pass calculates the gradients for every parameter in your model. And then, the optimizer performs a step in the correct direction of the gradient that has been calculated. Is that-- yes.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yeah, so, in this case, because we can store every batch-- a single batch on each GPU, all we're doing is calculating the-- we're doing the whole backprop process on each GPU individually and updating the parameters serially.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: I mean, you can think of it that way. You can also think of it as you perform backprop on-- say you have four batches for each of the four GPUs. Think of it as one GPU calculating the first batch, then the next batch, then the next batch, then the next batch. It's just backprop. It's the same math as doing it serially.
But, instead, you're doing all the back propagation and forward pass at the same time. So it saves speed in that manner. But when you're actually changing the model parameters, obviously, you're going to have to do this serially because you can't-- I mean, you can't change the parameters four times in parallel. There's only one model to do this on.
If your batch was too large to fit on a single GPU, what you could do is something called gradient accumulation, where you calculate-- you do the backward pass once. Don't do the optimization step. Wait, and then, the next GPU does their backward pass.
And suppose each batch for each GPU is eight, and you have a theoretical batch size of 16, now this accumulated gradient is your total batch gradient. And you can now do the optimization step as if it were-- you'd done a single backward pass for the whole batch. Does that make sense?
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: True, but it's only useful if you have a very large batch size that can't fit on your GPUs. You see what I'm saying? And, yeah, I agree that it's optimal, but it doesn't actually give you any speedup if your batch size fits within a single GPU. Because you're still doing everything in parallel except for the updates to the weights of the model, which is the least computationally expensive.
That's why I create-- add this line here, disable for not on the main process. So because you're spawning multiple processes to run each of these instances of training on a GPU, you need to use special print function. So, for any output, you want to use accelerator.print.
For any logging, for example, using weights and biases, you're going to use accelerator.log. And this will do this will automatically just call the log function or the print function on only the main process, only one process, so you don't have conflicting messages or multiple prints for each one. Make sense?
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yeah, but they're all 10% done because they're all synchronized. You see what I'm saying? So keeping the bar for only this one, it will still be for everything 10% done. If you have a print or a log that depends on every process or on some output from every process, you use accelerator.gather.
And as long as the input is a torch tensor, then it will concatenate every instance across the multiple processes into a single tensor. And then you can perform some accumulation step on it, like a mean or a sum.
Let's see here. So I've explained conceptually what is happening here in the main training step. We calculate some metrics for accuracy by seeing how many of the classifications are correct. And we log all of these to weights and biases. So weights and biases is super useful, essentially, logging service that you can use.
And I can run it so you can see what it looks like. It allows you to log, say, the performance of your model over-- what? [INAUDIBLE]
This is our model training. I've created just a progress bar so you can see how fast it's running and the progress of a single epoch. And then, weights and biases keeps track of whatever you want to log. So if you log the loss of your model, it will track that, your accuracy. This run just started. But for this run, which I was training earlier, it will log whatever you'd like.
I really like this and prefer it to other tools like TensorBoard just because it-- I mean, you can watch it from anywhere. At home, if you start a training run from the office, you can check the performance and terminate it if you see that you're getting poor performance. And, in this case, I was.
But, for example, on a longer run, which I'll show you, this is what you should expect to see, a drop in the loss function. This is quite fast, increase in accuracy. And then, I log some other things like total accuracy across an epoch and then across a loss.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Oh.
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Oh, yeah. So that's the thing, it's just a link. So you can open it on any browser. So you don't need to open it-- as long as you call-- using Accelerate, you use Accelerate and-- let me find the code snippet-- here. You initialize the tracker in-- if you're just using weights and biases and you're not using Accelerate, it's as easy as weights and biases.init, and then your project name, and whatever config you have.
And then, it'll work just like TensorBoard. You log whatever dictionary you want for, let's say, epoch loss or accuracy, and it'll show up here. So it's, yeah, it's way easier. I would definitely use that over TensorBoard. Weights and biases, W and B.
So I've trained this model already. So I'm going to stop this. Now we're going to fine-tune our model. And, as I said, this is just training your model on new data maybe specific to your use case. In the context of LLMs, this is training on a huge corpus of general text data and then self-supervised training.
And then, you can fine-tune on task-specific-- some downstream task, like semantic understanding or something, or some hand-labeled data that you have. In our context, it's fairly the same. The process is the same. And I have some parameters to specify if I want to train the whole model or just the last block of the model, for example. The running, we will run identically.
And if you're using Accelerator, once you've incorporated it into your code, if you're doing this in Jupyter Notebook, you use this function accelerate.notebooklauncher. If you're doing it in your CLI, if you have a Python script with all of this code in it, then you'll call accelerate launch in your CLI, and the name of your file, as well as the parameters that you want to specify.
And I forgot to mention that if you are-- to use Accelerate, you first have to specify your configuration of your machine. And it's as simple as calling accelerate launch. Oh, sorry. I'm wrong. Accelerate config. This won't work. You call accelerate config.
And, say, I want to use four GPUs on a single machine. It gives you the options to do that. You can also manually input a file, but this is easier. So I'll do multi-GPU training. I'll train on one node. And I don't want any of these optimizations, four GPUs, all of them, and then it saves the configuration for you.
And then, yeah, fine-tuning is identical. So I'm not going to go into details here. I just use the-- Mathias's homemade data instead of the AccuSleep data. And I launch it from here. And then we can do our testing. So might have to rerun some of these.
So it turns out didn't save the checkpoint. [INAUDIBLE] One second, sorry. I guess we're going to run-- we are going to run the fine-tuning. And it'll go pretty quickly because our fine-tuning data is much smaller.
The accuracy is very poor, which I'm confused about, because I just trained this. Unless I'm calculating it wrong. I'm going to assume I calculated it wrong and stop this.
So we've, here, yeah, we've loaded our model. And when we call accelerate.savestate, at the end of each iteration of our epoch, we save our model into a checkpoint folder. This checkpoint folder can be used to load the entire state of the training loaders and the model if we, say, get preempted in our Slurm job, or the program crashes, our server crashes, then we can start from where we left off.
It also saves the model so that we can load it later on for inference. And that's what we're going to do now. So we load the model, send it to one of the GPUs we've acquired, and load the state dictionary from the checkpoint we have.
Then we evaluate the model by doing only the forward pass and accumulating all the predictions into a list here that we then can do a classification report on. So it looks pretty good. We're getting an accuracy of 83% with a 92% weighted based on REM and non-REM being significantly less than wakefulness and a confusion matrix here.
So these are two-- these are a couple ways that you can evaluate the performance of your model against, say, other iterations that you may try. One is accuracy, obviously. Weighted accuracy is useful for, let's say, you have a data set that skews heavily towards a single label. Having a weighted or balanced accuracy metric will tell you whether or not you're making too many misclassifications of your very low occurrence label.
These other metrics can come in handy for binary classification generally and is more useful using these precisions and F1 scores. And then, our confusion matrix tells us how many true positives and false positives we have. And, generally, you want this diagonal to be-- well, significantly higher than the rest.
So that brings us to the end of this. And I know this was kind of rushed and a bit just walking through code, I guess. But I'd really-- I'm happy to entertain questions anywhere you got confused. And, yeah, maybe even improvements for following tutorials, because this is my first one.
So OK. So, yeah, you can use it any time. I'd say it's a good practice-- it might be good practice at this point to just incorporate it into your code from the beginning because you can run it with a single GPU. You can even use it to run on a CPU and your code doesn't change at all.
But when you want to-- you may want specifically to leverage the capability to train across multiple GPUs and multiple nodes if your data set is huge and your training time is getting prohibitively long. So if it's taking a day to train an epoch, that's probably telling you that you might want to speed things up. And if you've done everything else, then using multiple GPUs will help.
It's also super useful if your batch size is too large to fit on a GPU. So a GPU has a certain amount of video memory, VRAM on it that you can store your tensors on. And, at some point, well, you have to store your model. You have to store your optimizer state. And you have to store all the entire batch that you're using onto the GPU.
And if your batch is very large or if your input data is very large, you may only be able to store one or two samples instead of an entire batch of 16 samples onto your GPU. And then it becomes useful to train-- to use multiple GPUs, say, each one has two samples on it.
And then, you do gradient accumulation, which I was referring to before, calculating the backward pass for each, not performing the optimizer step until you've done so for 16 total samples across, say, eight GPUs, and then using this accumulated gradient to perform the optimization step.
So this is where it's super useful where your data is very large in terms of on a per sample basis or on just a data set size where you need to increase your-- or decrease your training time a lot. Does that makes sense?
STUDENT: Yeah.
VALMIKI KOTHARE: Yeah. That that's kind of-- if your model now gets to a prohibitively large size, then you run into the problem of fitting your entire model on the GPU, then you have to move to model parallelism, which, again, I was referring to before. You have to split your model itself across multiple GPUs and perform computations between them. So you don't get any speed boost from that, but it lets you train an LLM, for example. Yes.
STUDENT: Do you recommend a restructuring for understanding some of these trade-offs? We're talking a lot about [INAUDIBLE] strengths or maybe more-- get more out of the analysis given certain circumstances. Is there a place like a repo or somewhere where you can feel like someone's done a good comparison of all these approaches?
VALMIKI KOTHARE: Are you talking specifically like model design choices?
STUDENT: Yeah, model design choices, yeah, what kind of layer you use. I'm sure that the literature is massive. Is there a nice compendium somewhere?
VALMIKI KOTHARE: Not that I've found. And, in fact, I'm going to look into that, because I feel like that would be a great resource to have. I think the problem that is that it is very domain-specific.
So unless you are working in a very well-researched, well-documented domain like image classification which, of course, there are pre-trained models and standards that there have been developed, you're going to have a hard time finding a resource that tells you definitively, oh, try this technique, try this. Because it may even depend on your data and how it differs from someone else's data.
I can't say I know of anything off the top of my head like that. But, again, it is domain-specific. So if you're doing computer vision, yeah, there are tons of resources like that, tons of blog posts that do comparisons for image classification.
And, generally, the approach I've found works best is reading the literature, as big as it is, there-- this domain, for example, has a couple influential papers that say that convolutional or even residual networks perform better than standard statistical, like using logistic regression or hand-crafted features. So, yeah, I can't recommend a specific one. But I will look, because I think people could benefit from that a lot.
STUDENT: [INAUDIBLE] for learning [INAUDIBLE] there [INAUDIBLE].
VALMIKI KOTHARE: Yeah, 100%. Yeah. I would recommend-- let me think about this. I feel like the way I learned was just the PyTorch website has great tutorials for it, for just getting a handle of this structure of data loading, preprocessing, and then model instantiation. It'll get you through all the basics there.
Beyond that, I don't know that there's a great tutorial site as much as just finding-- yeah, whatever you're working on, if you can find-- if you can see someone else who's worked on it before, that's the best way, I think, of learning this kind of thing. Because it is really a lot of experimentation and trying a couple different things before you get the performance you're looking for.
So in the beginning, what I did was I acquired resources using a batch script. So here are my-- I've stored my batch script in the folder I'm working in, this sleep staging folder, here.
So this specifies all the resources I want. I wanted four threads for each of the four GPUs I'm using, 20 gigabytes of memory. And then I call-- I start my Jupyter server here. And then I will call sbatch for this. And once it runs, it will output-- let me-- hopefully this works. [INAUDIBLE] add Jupyter [INAUDIBLE].
Yeah, right here. So all of this won't be there when you first start it. This is what will print. It'll print the, essentially, the URL that you're going to use to connect your-- either your browser using Jupyter Notebook or Google Colab to it. You're going to copy this. And I've also linked the tutorial for this in the Wiki.
But you set up SSH port forwarding by doing-- this will take a minute. But SSH dash L. And then you're going to SSH in. And you will use the port that you specified in your batch script to link your local computer to that node through the host node, which is necessary because you can't SSH directly into that node. Yeah, and this tutorial will go through everything you need. It has--
STUDENT: [INAUDIBLE]
VALMIKI KOTHARE: Yes, yes, yes. This sets up port forwarding and tunneling. And then, once you have that, all you do is-- and I don't know which you prefer, but you can just paste that link in here, and it'll open up Jupyter Notebook. Or, if you want to use Jupyter-- or Google Colab, you'll paste that link in here.
STUDENT: You just [INAUDIBLE]?
VALMIKI KOTHARE: That's it. Yeah. Yeah, and if you have any questions doing this, running this, come to me. My email is just valmiki@mit.edu. And feel free to ask me questions.
Yeah, I'm here. I guess I didn't introduce myself in the beginning, but I'm here as a resource for anyone doing computation or machine learning in the Department. Yeah. Thanks for coming, guys.