I’ve been slowly working through the excellent Practical Deep Learning for Coders course offered by fast.ai. If you’re interested in machine learning, deep learning in particular, this course is a great place to start.
In the beginning of the course the authors urge against trying to create your own server, and instead recommend getting started with one of several ready-made online Jupyter environments. This is good advice if you don’t have the experience, time, or desire to create your own environment. However, if you come from a technical background and are already comfortable with AWS and Linux, building your own cloud-based ML box can be a worthwhile endeavor.
The road to becoming an effective deep learning practitioner includes learning how to work your way through the “weird stuff on the edges” of deep learning – like breaking and fixing (again and again) your local Python environment, getting a Jupyter server running, and spinning up a GPU server to train models. In my experience with real world projects, it’s often the “weird stuff on the edges” that takes the most time and energy because it wasn’t part of the plan, and we didn’t see it coming. So with that philosophy in mind, let’s build a powerful GPU server in AWS and get up and running with the fastai!
Who is this tutorial for?
If you understand AWS VPCs and subnets, Linux, GPU drivers, Python and pip, and want to roll your own cloud-based ML rig, this tutorial is for you. If that last sentence sounded like alphabet soup, it’s probably not worth spending time on this. Your main goal is to study deep learning, so if you don’t already know AWS and Linux, don’t take this side quest – it will only lead to frustration. Just jump into a ready-made Google Colab notebook and get started with fastai. You can come back later if you want to learn the nitty gritty sysadmin bits.
We’ll use AWS’s Elastic Compute Service (EC2) to create our server. The high-level steps are:
- Create a g4dn.xlarge EC2 instance with an NVIDIA GPU and a 60GB root volume.
- Install NVIDIA drivers for CUDA support
- Install Jupyter Lab
- Set up fastai and the fastbook course
I’ll explain each step in more detail below.
Create an EC2 instance with an NVIDIA GPU
Fastai uses the PyTorch deep learning library, which in turn uses NVIDIA CUDA for GPU acceleration. AWS has several options for EC2 instances with NVIDIA GPUs. In this tutorial, we’ll create a g4dn.xlarge, which has 4 vCPUs, 16GB of RAM and a single NVIDIA T4 GPU with 16GB of video RAM. While this is the smallest instance in the g4dn family, it’s enough to speed through the training sections of the fastai notebooks because it has the same NVIDIA T4 GPU has its more powerful siblings, and training happens on the GPU not the CPU.
At the time of this writing, a g4dn.xlarge costs $0.526 / hour in the AWS us-east-1 region, but be sure to check the current prices in your region. It’s also a very good idea to shut down your instance when you’re not using it to avoid unnecessary costs. Here’s a simple script to list, start, and stop EC2 instances using Python: https://github.com/jason-weddington/ec2-manager.
First, log in to your AWS Console and navigate to EC2:
Now, launch a new instance:
We’ll use Ubuntu Server 20.04 LTS, the current LTS release at present. Be sure to create an x86 instance, not Arm:
In the Choose an Instance Type screen, filter for the g4dn family, choose the smallest one and click Next: Configure Instance Details:
In Configure Instance Details, select a VPC with an attached internet gateway, so that your new instance will be able to access the internet. If you don’t know what that means, first think hard about whether this exercise is a good use of your time, because it might not be. If you want to continue, go read about VPCs and internet gateways, and come back when you have (at least) one of each.
Now click Next: Add Storage.
In the Add Storage section, you’ll need to increase the size of the root volume from the default 8gb in order to have enough space for NVIDIA drivers and CUDA. For my instance, I used a 60GB volume and it’s currently 64% full after a complete build, including NVIDIA drivers, CUDA, and fastai. I recommend at least 60GB so you have enough free space to play around with other projects without having to build a new server, or add more storage.
Add some tags if it will help you keep track of the instance, and then move on to Next: Configure Security Group.
In the Configure Security Group step, you’ll need to specify a security group that allows incoming SSH traffic. We’ll use SSH to configure the instance, and later we’ll tunnel over SSH to access Jupyter Lab. If it makes sense for your situation, consider limiting source traffic to your public IP.
Keep in mind that you’ll need to update this from time to time if you have dynamic public IP. Choose from your existing security groups, or create a new one. Then click Review and Launch:
Once your instance has started, you’ll also need to associate an Elastic IP address so that you can SSH to the server. Allocate a new Elastic IP and associate it with your new instance if needed. You should now be able to SSH to the server and install the latest updates:
ssh -i <path to private key> ubuntu@<elastic ip address> sudo apt update; sudo apt -y upgrade; sudo apt -y autoremove
Install NVIDIA Drivers for CUDA Support
Once your instance is up and running, the next step is to install NVIDIA drivers with NVIDIA CUDA support. There are a few different ways to do this, I’ll cover two different methods:
Option 1 – Online install using package manager:
Follow the Ubuntu LTS instructions in the Package Managers section in the link below. Use the “–no-install-recommends” option in step 5 for a lean install since we’re running a headless server with no GUI.
At time of writing, the above link only mentions Ubuntu LTS 16.04 and 18.04, but I’ve tested these instructions on 20.04 and they work fine because of the $distribution environment variable set in step 2.
Option 2 – Local driver repository:
Navigate to https://www.nvidia.com/Download/Find.aspx and fill in the options for the NVIDIA T4 GPU in your g4dn.xlarge instance:
Download the .deb package to your local machine, and use SCP to copy it into the EC2 instance:
scp -i <path to private key> <nvidia driver file name>.deb ubuntu@<elastic ip address>:/home/ubuntu
Then SSH to the EC2 instance and (attempt to) install the .deb package:
sudo dpkg -i nvidia-driver-local-repo-ubuntu2004-460.32.03_1.0-1_amd64.deb
You’ll get a message complaining that the CUDA GPG key is not installed. Follow the instructions to install it and then repeat the command to install the driver package:
At this point, we’ve only added the repo. Next, we install the driver:
sudo apt update sudo apt -y install cuda-drivers --no-install-recommends
Confirm that PyTorch can use CUDA:
This should be all we need, but before moving on, it’s worth confirming that PyTorch can use CUDA. Install pip3 and then install PyTorch with pip:
sudo apt -y install python3-pip pip3 install --user torch
Then run python3, import torch, and call:
If CUDA is not available, review the steps above and try to figure out where things went wrong. One of us missed a step. It might have been me (sorry) but if you’ve made it this far you’re probably smart enough to figure it out.
Install Jupyter Lab
Installing Jupyter Lab and generating a default config is easy. We’ll also set a password so we can connect remotely:
pip3 install --user jupyterlab jupyter lab --generate-config jupyter lab password
We’re going to tunnel over SSH, so we don’t need to configure anything special for Jupyter Lab. Don’t run Jupyter just yet, we still need to clone the fastbook repo.
Set Up the Fastai Course
Ok, this is where the fun starts. The entire fastai course is available on GitHub as Jupyter notebooks. (I know, right?) So, clone the repo wherever you like to keep your git things and cd into fastbook.
IMPORTANT: as of this writing, I’ve been unable to successfully train ResNet for image classification in the latest versions of fastai, torch and torchvision. The symptom is that the model appears to train, accuracy never increases. When used for inference, the model’s predictions are exactly backward. For example, on the cats and dogs task, the model thinks every dog is a cat and every cat is a dog. To work around this issue let’s install specific versions of each dependency. The versions below worked for me, but if you run into the problem I’ve just described, play around with the versions of fastai, torch and torchvision.
git clone https://github.com/fastai/fastbook.git cd fastbook sudo apt -y install graphviz pip3 install --user torch==1.7.1 torchvision==0.8.2 pip3 install --user fastai==2.1.10 fastbook graphviz
Now, start Jupyter Lab. Consider using screen or tmux so you can detach the terminal that is running Jupyter Lab. I’ll use screen for this:
screen -S lab jupyter lab
Next, forward port 8888 on your local machine to port 8888 on the EC2 instance over SSH:
ssh -i <path to private key> -N -f -L localhost:8888:localhost:8888 ubuntu@<elastic ip address>
Connect to Jupyter Lab:
On your local machine, point your browser to: http://localhost:8888/lab
If all went well, you should see an authentication page where you can enter the password you created in step 4 above. Open the first notebook, 01_intro.ipynb, and run the first cell. You may get an error that pip was not found:
This is because we’re not in a conda environment like the notebook expects us to be. This is fine because we’ve already installed fastbook above. You can also change the command to pip3 to get it to work:
At this point, we should be good to go. To check that everything is working, let’s train a model. Scroll down to the “Running Your First Notebook” section and run the code cell that starts with “# id first_training”:
This cell downloads a pre-trained ResNet 34 model and then uses transfer learning to fine tune the network to the example problem of identifying cats.
That’s it! If you got this far and saw the progress indicators as the model trained, you’re all set.