Skip to article frontmatterSkip to article content

Use a cloud virtual machine as a Jupyter notebook server

Introduction

Jupyter notebooks are really useful interactive coding environments that bring along documentation capacity in readable, non-code format. This means we can use Jupyter notebooks to develop and test code and legibly explain what’s happening all in the same place. So far so good; and there are lots of great videos online showing this environment in action.

If you would like to build a Jupyter notebook environment (a notebook server) on a cloud machine: You’re in the right place. The directions here are “from the ground up” and serve a second purpose as well, which is (we at CloudBank hope) to de-mystify building on and using the cloud.

Before we begin, however: Remember that cloud providers like to do things for us “as a service”. You can get into Jupyter notebooks in this manner in Azure or on the Google Cloud Platform or on Amazon Web Services or on the IBM Cloud etcetera. If you choose to go this way there are two things to check on: The cost for the value-added service and whether the results of your work will be transferable to other environments / platforms.

That said, we’ll proceed with the Do It Yourself On The Cloud approach. Here is an example of a notebook server browser interface.

Binder as sandbox Jupyter notebook server

What is the plan?

Learning to compute on the cloud includes understanding machine images: Snapshots of an entire operating system including installed software, customization, code and data. This CloudBank solution serves two purposes: It introduce machine images and it demonstrates using a cloud virtual machine (herein VM) as a traditional desktop--possibly quite a powerful one--for running a Jupyter notebook server. The interface to this working environment will be via a web browser on our Local machine. We move information back and forth securely from Local to Cloud using an ssh tunnel.

Why instructions are necessary

At a high level we are configuring a research computer. Because it is on the cloud there is some extra vocabulary involved and some extra steps, starting with securing a cloud account. There is enough complexity to merit a walk-through. Here’s the narrative:

A VM may be in one of two states: Started or Stopped. We pay for a VM by the hour: Only when it is Started. Stopped is like having the power turned off: You can resume using it later without loss of data. Terminating a VM means the VM no longer exists: Everything is gone.

Starting a VM from an image recreates the machine in its state when the image was created. When you do this you have a choice of which type of VM to use for the image. You can use a cheap low-power VM if you do not need a lot of computing power; say you just want to write and test some code. Or you can choose a powerful (more expensive) VM if you have some heavy computation to do.

Notice that you may start many VMs from a single image. Your collaborators may use an image to start their own VMs as well. This means that customized work environments are easy to replicate and share.

Key concepts

Walk-through

Again these use AWS as the “example cloud” but these notes will apply broadly to any cloud. The short version describes what to do with scant attention to specifics. The extended version is more comprehensive.

Short version
Extended version
From my own computer log on to the AWS EC2 instance
On the EC2 instance (VM) mount any added storage drives
Install the Jupyter Lab notebook serve
Configure the machine for research
Create an image (AMI) of this Virtual Machine
Share the AMI with other AWS accounts
Terminate the VM
Notes

left off here

what follows is a copy paste of the original student hands-on for restoring and using a VM. As such it can be reduced down to a much smaller section here.

CloudBank Solution Repo: Research Computing and Cloud Images

This tutorial introduces you, the researcher or student, to using a virtual machine image on the cloud as a basis for research computing. This page describes using an image that has already been built for you. If you want to see how to do the preparatory build: Look into the creating_an_image sub-folder.

You may be familiar with a zip or tar file containing an entire directory. A machine image is analogous; think of it as a zip file of the entire computer’s contents from operating system to home directory to code to data files. The idea is that once a cloud Virtual Machine (VM) is configured for use: It can be stored as a machine image. This image is then un-packed back onto a Virtual Machine. The image is used to creating a working research computing environment for a scientist to use.

Our specific example here takes advantage of the Jupyterlab notebook server. This page is a tutorial for going from a pre-built Linux environment with Jupyterlab installed as an image and reconstituting it on the AWS cloud as a Virtual Machine. We will use a secure tunnel called an ssh tunnel to enable you the researcher to connect to the Jupyterlab server through your web browser.

In the sub-folder called bootstrap we go through the process of building this machine image.

In the sub-folder called waterhackweek we go from the rebuilt Jupyterlab VM to cloning a repository of notebooks on the VM for a particular research topic.

Important remark on cost management

In what follows on this page: The VM has a single disk drive (filesystem) with a fairly small capacity of 32 Gigabytes. However it is common practice to create images that bundle large datasets. The sub-folder bootstrap tutorial spins up a Virtual Machine, for example, that has 200 GB of block storage (disk drive) volume mounted as two data filesystems. This allows the image builder to include a moderately large dataset with the image and provide some working space as well.

Our point of emphasis here -- as with all cloud resource allocation actions -- is that one should understand and track the costs associated with cloud resources. ‘Block’ or ‘disk’ storage -- for example -- runs about $20 per month for 200 GB.

Again our main admonition: The cloud is very powerful but it is important to understand and manage the cost of using it.

Outline

For Windows Users

A brief interruption anticipating a possible issue: Further down we are going to build something called an “ssh tunnel” to use our Virtual Machine as a Jupyter notebook server. If you are doing this on a PC running Windows: No problem, that’s perfectly feasible but Windows does not natively make this Linux-y step trivial. So be ready: We will introduce the idea of installing a small Linux bash shell on the Windows PC. It is a bit of a “yet another step, really??” situation but on balance it can help save time and avoid frustration. Instructions are here.

Connecting to a Jupyterlab server built on a cloud Virtual Machine

Suppose we would like to visually explore some (ocean) data. This data took years to collect and months to bring together in one location. Hopefully it takes less than an hour to deploy and connect to a Jupyterlab notebook server on the cloud.

Prerequisites: Cloudbank credentials to connect to the cloud and an available bash shell.

We are using in this case the AWS (Amazon Web Services) cloud. You will log on to the AWS console, start a Virtual Machine (called an EC2 instance) and on that machine start a Jupyterlab server. If you were starting from nothing you would be installing Python packages and importing datasets. Our objective here is to avoid all of that by using a pre-built environment stored on the cloud as an image. Once you have identified this image you can start it on virtually any size machine; from a small cheap one say costing 0.04perhourtoaverypowerfulcomputerthatmightcost0.04 per hour to a very powerful computer that might cost 2.40 / hour or more. Cloud users choose a computer based on computational needs.

Once the computer is running (with everything pre-installed) you will create an ssh tunnel. This is a secure connection that associates a local address on your local comoputer with the Jupyterlab server running on the cloud VM. By connecting through this tunnel the cloud VM becomes the backing engine for exploring the data.

The procedure is presented in 13-or-so steps with interspersed comments. Upon completion you will have your own data science research environment.

Procedure

  1. Log on to the AWS console using your credentials; and be sure to set your Region (upper right drop-down) to Oregon o Sidebar: You can choose Services (upper left) to see a listing of services, i.e. what you can do on the AWS cloud

  2. Navigate to Amazon Machine Image choices (AMIs): Services > Compute > EC2. Then choose AMIs from the left sidebar.

  3. In the upper left drop down menu of the AMI pane select ‘Private Images’ (it may say ‘Owned by Me’ or ‘Public Images’ by default). When ‘Private Images’ is selected, you should see an AMI listed called jupyter1-cb. Select this AMI by clicking on it so a blue dot appears at the left edge of the table.

  4. Choose Actions > Launch. Choose a VM type c5.large (you may have to scroll down a bit to find this). Choose Review and Launch at the bottom right.

WARNING: If this tutorial is a class-sized activity there is a potential collision scenario. Let’s take a moment to outline this and how to resolve it.

The AWS EC2 Launch Wizard goes through seven steps. Step 6 involves choosing a Security Group. This SG is given a name by default; and a classroom full of people will get the same default as they proceed. So the way to avoid this (which will obstruct the next couple of steps) is to click on the Security Group step (step 6) and give a name for the Security Group that is unique. As below with keypair and instance names the best choice for a Security Group name is simply yourname. In our instructions we use the name hedylamarr as an example of your name. Now you can proceed to step 7 of the wizard; which is step 10 in this procedural.

  1. Choose Launch at the bottom right. In the ensuing keypair dialog choose Create new keypair and name it hedylamarr. This will be the key to identifying your instance and logging in to it in what follows. Download the keypair file you generated; then click Launch Instance in the dialog box.

  2. Continue in the AWS console: Click View Instances at the lower right. This takes you to a table of EC2 instances (Virtual Machines), where one instance is listed per row. Therefore one of the rows in this table should be your instance.

Class strategy remark: You now have a Virtual Machine (“EC2 instance”) on the public cloud. If you are in a class with many people doing the same thing at the same time it can be difficult to identify which instance is your VM. Once you identify your instance: Name it using yourname.

  1. Locate your instance by the key name: In the table of instances scroll right to find the Key name column. Scan down this column for the keypair you used to identify the row for your EC2 instance. Return to the left-most Name column of this row -- the row for your instance -- and hover your cursor to bring up a pencil icon in the “Name” column. Click on that and name your instance appropriately.

  2. One the instance status dot changes to green (running): Note its ip address in the instance table. Here we take this to be 12.23.34.45 as an example.

Now we are ready to log in to this VM or EC2 instance from a bash shell. Our main goal on the instance is to start a ‘quiet’ Jupyterlab session. There will not initially be a browser interface; that comes a little further down. Once the Jupyterlab server starts it will print a token, a long string of letters and numbers used to authenticate; so be prepared to copy that to a text editor for a few minutes.

  1. On your computer: Open a bash shell and ensure the keypair file hedylamarr.pem is present in your working directory. Issue a chmod command to give the keypair file limited rwx permissions: chmod 400 hedylamarr.pem.

If running Windows on your local computer: You may need to install or enable the native bash shell. As an alternative you can install an Ubuntu bash shell. In either case it is useful to realize that the home directory of this shell is not the same location as the Windows User home directory. If hedylamarr.pem was downloaded to C:\\Users\hedylamarr\Downloads: Move it to your bash home directory, for example using this sequence of commands:

bash> mv /mnt/c/Users/hedylamarr/Downloads/hedylamarr.pem ~
bash> cd ~
bash> chmod 400 hedylamarr.pem

The ssh command insists the authentication keypair file hedylamarr.pem have User read-only permission, Files in the Windows User area are not amenable to chmod; so we relocate this file and proceed in bash from ~. These details can very frustrating so we go to the trouble to elaborate this here.

  1. In bash run ssh -i hedylamarr.pem ubuntu@12.23.34.45 to connect to your EC2 instance. (Use the correct ip address.) Respond yes at the prompt to complete the login. Note that your username is ubuntu and you have the ability to run root commands using sudo.

If you have configured the image using (for example) the tutorial in the bootstrap sub-folder: You may wish to spend some time looking around to ensure the instance is configured as expected. For example if there are additional data disks you can check to see these have been mounted properly.

  1. On the command line of your EC2 instance issue (jupyter lab --no-browser --port=8889) & After about one minute this command should produce multiple lines of output including a token string.

...token=ae948dc6923848982349fbc48a2938d4958f23409eea427
...token=ae948dc6923848982349fbc48a2938d4958f23409eea427

Copy this string to a text editor for later use. If you lose this token just log on to the instance where it is running and issue jupyter notebook list to see it. Issue exit to log out of your cloud instance.

Note: The Linux structure (command) & causes command to run as a background job. This allows you to log out of your instance, leaving it to function as a Jupyterlab server.

Note: The above token is used the first time connecting to the Jupyterlab server. It may not be used in subsequent sessions but it is worth keeping it around. If it is lost for some reason: Re-start the Jupyterlab server on the VM and re-copy the token.

  1. In the bash shell issue ssh -N -f -i hedylamarr.pem -L localhost:7005:localhost:8889 ubuntu@12.23.34.45. (Make appropriate substitutions.)

This ssh command creates an ssh tunnel to the EC2 instance running the Jupyterlab server. You associate a location on your own computer (localhost:7005) with the connection point on the EC2 instance (12.23.34.45:8889). The number trailing the colon, called a port, acts as a qualifier for the connection.

  1. In your browser address bar type localhost:7005. This should change to localhost:7005/lab?. When prompted for the token paste in the token string you copied earlier. You should now see a Jupyterlab environment. When using the instructional image for this tutorial there should be a notebook listed on the left called chlorophyll.ipynb. You may click on this notebook to open it and explore the contents.

Idiosyncratic Cloud Notes