NoSQL [🔗]
0. Introduction [🔗]
In this tutorial we set up a database (an ’empty container’). We then proceed to write records to this database, specifically information about the periodic table of elements. In a subsequent tutorial, we’ll build an API providing public access to this database. This access happens without needing Azure account credentials. This is in effect a toy model of creating an open data resource that you might share with other researchers.
0.1 Concept of operations and authentication [🔗]
Conceptually we have two bases of operation in the Azure cloud: The Azure portal and the Azure VM that we started
up in the first Lab. Both of these appear to us via our laptop: The portal is seen in a web browser; and the VM
is seen via VS Code Server which functions as our development environment. VS Code Server is communicating with the
Azure VM via ssh (secure shell) traffic on port 22 of the VM. So both the VM and the NoSQL database that we will
build below exist in the Azure cloud.
Underlying this construction process (building cyberinfrastructure) is the notion of authentication. In
the case of the Azure VM the authentication is by means of a keypair .pem file. This is pointed to by the config
file Host entry seen in VS Code running on our laptop. The Host entry also has the ip address of the VM and
the Username azureuser; so this is all the information needed to connect to and authenticate on the Azure VM.
In VS Code: Clicking the lower left >< symbol initiates the startup of VS Code Server connected
to this Azure VM. VS Code Server looks like VS Code; but it runs on the cloud machine.
If we build a cloud database: It will also require authentication. Authentication to access the NoSQL database from the Azure portal is by means of logging in to the portal with our NetID. That is sufficient. Authentication to access the database from the VM is also necessary. Just because the VM is on the Azure cloud does not mean it can go messing around with the database. In the procedure below there is a step where we get an access key from the database (using the portal) and place that key in a file on the VM (using VS Code Server). This key is then used by a Python program running on the VM to authenticate access to the database. Specifically the Python program will write data into the database.
0.1.1 More than you wanted to know about security [🔗]
On VS Code Server we will run a Python script (on the Azure VM) that communicates with the database service
via a secure communication protocol called HTTPS. This is analogous to the secure shell (ssh) protocol
that we use on port 22. By convention HTTPS uses port 443. So the Azure VM is sending information packets
out to the internet which are routed (as if by magic) back to the database service inside the Azure cloud,
specifically to port 443. We could just as easily run the code that talks to the database service from our
own laptop.
If we were inclined to be more secure: There is a mechanism for creating a private endpoint for the database service internal to Azure. This would still work with our Azure VM but we could turn off the public access point.
0.2 Objective of this Lab [🔗]
We commence to build a NoSQL database (pronounced “no-sequel”) service on top of Azure’s “Cosmos DB” service. A NoSQL database is great for storing freeform data that does not fit easily into tables with rigidly defined rows and columns. To explore further, check out JavaScript Object Notation (JSON) document structure. Also worth noting: There are sub-categories of NoSQL. We will be using the Document type of NoSQL. Each element in the table will become a self-contained (JSON) document. There are also Dictionary, Family and Graph flavors of NoSQL as well.
- Document NoSQL is self-contained JSON documents; has structural awareness / queryable fields
- Dictionary flavor is simpler: key-value pairs with no hierarchical structure to the value
- Family flavor suports complex data entities through flexible expansion of ‘column’ notion
- Graph contains nodes and edges: Strong support for relationships within the data
This guide assumes that you’ve completed the VM Workstation tutorial. Working from the Azure VM is not strictly speaking required. You could just as well do this procedure from your laptop, for example in Windows WSL. However sticking with the standardized Azure VM approach will make it easier for course staff to help you debug if something goes amiss.
Once the database is populated with elements from the periodic table we will use the portal interface to examine some query patterns: Which elements on the table are in gase phase at STP and so on.
1. Get your environment ready [🔗]
Portal and workstation [🔗]
We’ll start by making sure our workstation VM is turned on. Open a web browser and log in to the Azure web portal.
Go to the virtual machines dashboard by using the search bar at the top:

Select your workstation VM from the list, and if it’s stopped, click the Start button:

Keep this window open in the background, we’ll be using it again in a minute.
VSCode [🔗]
Open VSCode and make sure your window is remotely connected to your cloud virtual machine. If it is, you’ll see the VM’s public IP address in the bottom left blue box:

If not, click the blue >< button and open an SSH connection to the workstation as shown in the workstation tutorial.
β Having trouble with your VM? β
If you can’t connect VSCode to your VM, check out these troubleshooting guides from the Workstation tutorial:
2. Create an empty database [🔗]
Now we’re ready to create our database!
Go back to the web portal and search for Cosmos. Open up the dashboard for Azure Cosmos DB:

From here, create a new database by clicking the + Create button:

Azure supports a number of different database technologies, all of which are provided
with the brand name “Cosmos DB”. Today, we’ll be making a NoSQL document store, which
they call “Cosmos DB for NoSQL”. Click the Create button under the Cosmos DB for NoSQL
heading:

We’ll be presented with a configuration page to select various database options. Choose the following:
- Subscription: Your course staff should announce to you what subscription to use. If you’re not sure, ask them.
- Resource group: Choose the pre-existing resource group that contains your UW NetID in the name. It will look something like
rg-amlclass-[YOUR UW NETID]. - Account Name: Name your account
______-periodic-db, where the blank_______is replaced with your UW NetID. In general, the account can be named whatever we want, but this format makes it easier for course staff to keep track of. - Location: The general rule here is to place your database in the same location as the VMs and other resources that access it – this can make database accesses faster, and in some cases make things cheaper. For us, though, it doesn’t matter. Leave it with its default value.
- Apply Free Tier Discount: Do Not Apply

When you’re done, click Next: Global Distribution > at the bottom.
On this page, choose the following options:
- Geo-Redundancy: Disable. This option requests that Azure backs our data up in different physical locations around the world. Though good practice for important data, our exercise today isn’t that important ;) .
- Multi-region Writes: Disable
When you’re done, click Next: Networking >.
There are no options we need to change here, but in general, this is how we would limit database access to the outside world. This is really important if our database stored personal data for human beings, or trade secrets for our employer. In cases like this, only VMs within our cloud account are allowed to access the database, and we can carefully control how the outside world accesses those. Ask your course staff if you’re interested in learning more.
Now, click Review + Create, and then finally the blue Create button.
After a minute, the database should get created and we’ll be presented with a blue Go to resource button. Click it:

The portal will bring us to a quickstart page, but we’re not going to follow those instructions. Instead, select the Data Explorer option on the left:

From this page, we’ll be able to see and edit the contents of the database. We’re going to create a new container to contain our periodic table data. Click the New Container button:

A configuration menu will slide in from the right. Enter the following options:
Database id: Select
Create new, and name itperiodic-dbContainer id: Call it
elementsPartition key: Type out
/Period. Note the beginning/and capitalization; these are important.The partition key refers to a piece of data in each of our database entries that will arrange them into groups. The idea is to choose a key such that, in an average day reading from our database, no group will be accessed much more than any other group. Here we’re choosing an chemical element’s period (row number in the periodic table) as the partition key. It’s not necessarily the best partition key, but for our project it’s good enough.
Container Max RU/s: Enter
1000. This number is an estimate of how much data will be read and written from our database in a given second. The units here are called “RU"s (“request units”). Reading one entry from the database equates roughly to 1 RU, while writing one entry equates to 5 RUs. This value directly influences how much our database will cost (higher max RU/s, higher cost).

When you’re done, click OK at bottom of the sidebar.
If all went well, we should see an empty database:

Let’s put some stuff in it!
3. Populate the database [🔗]
Here’s our strategy: We’re going to download a free/open resource file: A collection of chemical
element data in CSV format (CSV = comma-separated
values). We write a Python script to get a local copy of this CSV file; and then insert that data
into our cloud database. When using Python we will take a moment to create a contextual environment
using the built-in Python venv module.
To get started, head back to your remote VSCode window. If there is not already a terminal available:
Create a new terminal using the Terminal -> New Terminal menu option:

Make a new directory called db-populate with this command:
mkdir db-populateThen enter it with the command:
cd db-populateGetting the data [🔗]
Now let’s download the periodic table data
using the wget command. This is a terminal command that downloads files in the manner of your web browser.
The format of the wget command is:
wget [URL OF FILE TO DOWNLOAD] -O [NAME TO SAVE THE FILE AS] (note -O is a capital letter oh)
For our periodic table data, use these arguments for DOWNLOAD URL and FILE NAME:
- URL:
https://gist.githubusercontent.com/speters33w/e5b1246d6859f29c4f02a299714d4c20/raw/a78d747534b915c19c5fb6d1fac0df6a77d62452/Periodic%2520Table%2520of%2520Elements.csv- Filename:
periodic-table.csvWhen we open the periodic-table.csv file from the files bar on the left: We should see a bunch of element data…

Let’s load it into our database!
Installing Python modules [🔗]
As noted we now create a Python virtual environment (“venv”) as a localized context for our database work. We will then jump into this environment to install the various Python libraries we need. This environment business is an organizational practice: If we have multiple projects to work on, setting up each project with its own respective environment helps avoid a “kitchen sink” approach where we have a global environment with lots and lots of libraries installed. Not too surprising: The kitchen sink approach can lead to library mutual incompatibilities or work; which means time lost to debug. Avoiding this by setting up specialized environments can be seen as an example of compartmentalization of resources. This is considered to be a best practice in research software engineering.
Start by running this command to create a new environment:
python3 -m venv populate-envAnd then run this command to enter it:
source populate-env/bin/activateYou should see the environment name appear before the regular stuff in the terminal prompt:

In general, you’ll only need to create the environment once, but you might need to “activate” it every time you open a new terminal. If you’re not sure, look for that (populate-env) text before the rest of the command prompt. If you don’t see it, run that activate command above.
Next, create a new file named requirements.txt. Do this by right-clicking the db-populate folder in the explorer bar on the left, and selecting New File.... Name the file requirements.txt:

(if the explorer bar isn’t visible, try running the terminal command code ~)
In the new file’s text area, paste in this text and save:
Each line reports a different Python library that needs to be installed for our code to run:
- pandas is used to open and process CSV files
- python-dotenv is used to load “secret” data like usernames and passwords that we don’t want to put directly into our code
- azure-core and azure-cosmos are used to interact with the Azure cloud and our Cosmos DB, respectively.
Finally, use pip to install the libraries listed by requirements.txt using the following command line:
pip3 install -r requirements.txtIf all goes well, you should see a message like this:

Writing our code [🔗]
Create another new file, and name this one process.py.
Plunk this code into it:
| |
You should just be able to paste the above code into the file and save it (though your course staff encourage you to read through it!). Finally, in the terminal, run your code with the following command:
python3 process.py periodic-table.csvIf all goes well, you should get this error:

We’ve gotta tell the python script how to log in to our database! But how? UW NetID and password? Way too insecure. A hacker could register you for the wrong classes.
URIs and Keys [🔗]
Make a new file called .env, and inside it put the text:
ACCOUNT_HOST=
ACCOUNT_KEY=
The ACCOUNT_HOST variable is going to point to where on the web our database is located. The ACCOUNT_KEY is, effectively, a very long password to that databse. Let’s open the Azure portal to retrieve that information.
On the Azure portal, go back to the Cosmos DB dashboard and open your database from the list. Find and open the Keys page from the menu on the left (1) . Now copy the URI with the copy button (2) and paste it after ACCOUNT_HOST= in the .env file. Click the little eye icon next to PRIMARY KEY to view the database’s access key (3), and then copy that (4). Paste it after ACCOUNT_KEY= in the .env file:

When you’re done, save the .env file and try running the script again. If all goes well, you should see messages that all the various elements were added to the database:

We can confirm this by going to the Azure portal, opening our database’s dashboard, and going back to Data Explorer:

From here, if we open the periodic-db entry under the NOSQL API list, and then choose Items, we should see a listing of all the elements in our database:

Yay! π©βπ¬βοΈπ¨βπ¬π§ͺπ§βπ¬βοΈ
4. Querying the database [🔗]
In the text box next to SELECT * FROM c, try entering a filter expression into the box like:
WHERE c.Phase = "Gas"Click the Apply Filter button, to see only the elements that are gaseous at STP:

You can write complex SQL queries to make arbitrarily selective filters. Here’s a refined query that only shows gasses with atomic masses higher than 35:
WHERE c.Phase = "Gas" AND c.AtomicMass > 35This query language, based on SQL, is quite flexible. Here is an example of a query that will show you all element names that don’t end with the letter n or suffix ium:
WHERE NOT (c.Element LIKE "%n" OR c.Element LIKE "%ium")You can sort results using an ORDER BY clause:
ORDER BY c.AtomicNumberAnd choose the number of results with an OFFSET/LIMIT clause (the below will select 2 elements, starting after atomic number 6)
ORDER BY c.AtomicNumber OFFSET 6 LIMIT 2For more detail about how to refine these filters, see the official documentation.
π Challenge:
- Write a query that identifies the most massive element that was discovered before the year 1900.