Creating a NoSQL database on Azure

NoSQL [🔗]

0. Introduction [🔗]

In this tutorial we set up a database (an ’empty container’). We then proceed to write records to this database, specifically information about the periodic table of elements. In a subsequent tutorial, we’ll build an API providing public access to this database. This access happens without needing Azure account credentials. This is in effect a toy model of creating an open data resource that you might share with other researchers.

0.1 Concept of operations and authentication [🔗]

Conceptually we have two bases of operation in the Azure cloud: The Azure portal and the Azure VM that we started up in the first Lab. Both of these appear to us via our laptop: The portal is seen in a web browser; and the VM is seen via VS Code Server which functions as our development environment. VS Code Server is communicating with the Azure VM via ssh (secure shell) traffic on port 22 of the VM. So both the VM and the NoSQL database that we will build below exist in the Azure cloud.

Underlying this construction process (building cyberinfrastructure) is the notion of authentication. In the case of the Azure VM the authentication is by means of a keypair .pem file. This is pointed to by the config file Host entry seen in VS Code running on our laptop. The Host entry also has the ip address of the VM and the Username azureuser; so this is all the information needed to connect to and authenticate on the Azure VM. In VS Code: Clicking the lower left >< symbol initiates the startup of VS Code Server connected to this Azure VM. VS Code Server looks like VS Code; but it runs on the cloud machine.

If we build a cloud database: It will also require authentication. Authentication to access the NoSQL database from the Azure portal is by means of logging in to the portal with our NetID. That is sufficient. Authentication to access the database from the VM is also necessary. Just because the VM is on the Azure cloud does not mean it can go messing around with the database. In the procedure below there is a step where we get an access key from the database (using the portal) and place that key in a file on the VM (using VS Code Server). This key is then used by a Python program running on the VM to authenticate access to the database. Specifically the Python program will write data into the database.

0.1.1 More than you wanted to know about security [🔗]

On VS Code Server we will run a Python script (on the Azure VM) that communicates with the database service via a secure communication protocol called HTTPS. This is analogous to the secure shell (ssh) protocol that we use on port 22. By convention HTTPS uses port 443. So the Azure VM is sending information packets out to the internet which are routed (as if by magic) back to the database service inside the Azure cloud, specifically to port 443. We could just as easily run the code that talks to the database service from our own laptop.

If we were inclined to be more secure: There is a mechanism for creating a private endpoint for the database service internal to Azure. This would still work with our Azure VM but we could turn off the public access point.

0.2 Objective of this Lab [🔗]

We commence to build a NoSQL database (pronounced “no-sequel”) service on top of Azure’s “Cosmos DB” service. A NoSQL database is great for storing freeform data that does not fit easily into tables with rigidly defined rows and columns. To explore further, check out JavaScript Object Notation (JSON) document structure. Also worth noting: There are sub-categories of NoSQL. We will be using the Document type of NoSQL. Each element in the table will become a self-contained (JSON) document. There are also Dictionary, Family and Graph flavors of NoSQL as well.

This guide assumes that you’ve completed the VM Workstation tutorial. Working from the Azure VM is not strictly speaking required. You could just as well do this procedure from your laptop, for example in Windows WSL. However sticking with the standardized Azure VM approach will make it easier for course staff to help you debug if something goes amiss.

Once the database is populated with elements from the periodic table we will use the portal interface to examine some query patterns: Which elements on the table are in gase phase at STP and so on.

1. Get your environment ready [🔗]

Portal and workstation [🔗]

We’ll start by making sure our workstation VM is turned on. Open a web browser and log in to the Azure web portal.

Go to the virtual machines dashboard by using the search bar at the top:

Select your workstation VM from the list, and if it’s stopped, click the Start button:

Keep this window open in the background, we’ll be using it again in a minute.

VSCode [🔗]

Open VSCode and make sure your window is remotely connected to your cloud virtual machine. If it is, you’ll see the VM’s public IP address in the bottom left blue box:

If not, click the blue >< button and open an SSH connection to the workstation as shown in the workstation tutorial.

β›” Having trouble with your VM? β›”

If you can’t connect VSCode to your VM, check out these troubleshooting guides from the Workstation tutorial:

2. Create an empty database [🔗]

Now we’re ready to create our database!

Go back to the web portal and search for Cosmos. Open up the dashboard for Azure Cosmos DB:

From here, create a new database by clicking the + Create button:

Azure supports a number of different database technologies, all of which are provided with the brand name “Cosmos DB”. Today, we’ll be making a NoSQL document store, which they call “Cosmos DB for NoSQL”. Click the Create button under the Cosmos DB for NoSQL heading:

We’ll be presented with a configuration page to select various database options. Choose the following:

When you’re done, click Next: Global Distribution > at the bottom.

On this page, choose the following options:

When you’re done, click Next: Networking >.

There are no options we need to change here, but in general, this is how we would limit database access to the outside world. This is really important if our database stored personal data for human beings, or trade secrets for our employer. In cases like this, only VMs within our cloud account are allowed to access the database, and we can carefully control how the outside world accesses those. Ask your course staff if you’re interested in learning more.

Now, click Review + Create, and then finally the blue Create button.

After a minute, the database should get created and we’ll be presented with a blue Go to resource button. Click it:

The portal will bring us to a quickstart page, but we’re not going to follow those instructions. Instead, select the Data Explorer option on the left:

From this page, we’ll be able to see and edit the contents of the database. We’re going to create a new container to contain our periodic table data. Click the New Container button:

A configuration menu will slide in from the right. Enter the following options:

When you’re done, click OK at bottom of the sidebar.

If all went well, we should see an empty database:

Let’s put some stuff in it!

3. Populate the database [🔗]

Here’s our strategy: We’re going to download a free/open resource file: A collection of chemical element data in CSV format (CSV = comma-separated values). We write a Python script to get a local copy of this CSV file; and then insert that data into our cloud database. When using Python we will take a moment to create a contextual environment using the built-in Python venv module.

To get started, head back to your remote VSCode window. If there is not already a terminal available: Create a new terminal using the Terminal -> New Terminal menu option:

Make a new directory called db-populate with this command:

mkdir db-populate

Then enter it with the command:

cd db-populate

Getting the data [🔗]

Now let’s download the periodic table data using the wget command. This is a terminal command that downloads files in the manner of your web browser. The format of the wget command is:

wget [URL OF FILE TO DOWNLOAD] -O [NAME TO SAVE THE FILE AS] 

(note -O is a capital letter oh)

For our periodic table data, use these arguments for DOWNLOAD URL and FILE NAME:

https://gist.githubusercontent.com/speters33w/e5b1246d6859f29c4f02a299714d4c20/raw/a78d747534b915c19c5fb6d1fac0df6a77d62452/Periodic%2520Table%2520of%2520Elements.csv
periodic-table.csv

When we open the periodic-table.csv file from the files bar on the left: We should see a bunch of element data…

Let’s load it into our database!

Installing Python modules [🔗]

As noted we now create a Python virtual environment (“venv”) as a localized context for our database work. We will then jump into this environment to install the various Python libraries we need. This environment business is an organizational practice: If we have multiple projects to work on, setting up each project with its own respective environment helps avoid a “kitchen sink” approach where we have a global environment with lots and lots of libraries installed. Not too surprising: The kitchen sink approach can lead to library mutual incompatibilities or work; which means time lost to debug. Avoiding this by setting up specialized environments can be seen as an example of compartmentalization of resources. This is considered to be a best practice in research software engineering.

Start by running this command to create a new environment:

python3 -m venv populate-env

And then run this command to enter it:

source populate-env/bin/activate

You should see the environment name appear before the regular stuff in the terminal prompt:

In general, you’ll only need to create the environment once, but you might need to “activate” it every time you open a new terminal. If you’re not sure, look for that (populate-env) text before the rest of the command prompt. If you don’t see it, run that activate command above.

Next, create a new file named requirements.txt. Do this by right-clicking the db-populate folder in the explorer bar on the left, and selecting New File.... Name the file requirements.txt:

(if the explorer bar isn’t visible, try running the terminal command code ~)

In the new file’s text area, paste in this text and save:

1
2
3
4
pandas
python-dotenv
azure-core
azure-cosmos

Each line reports a different Python library that needs to be installed for our code to run:

Finally, use pip to install the libraries listed by requirements.txt using the following command line:

pip3 install -r requirements.txt

If all goes well, you should see a message like this:

Writing our code [🔗]

Create another new file, and name this one process.py.

Plunk this code into it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
#!/usr/bin/env python3
import sys
import os
import pprint

import dotenv
import pandas as pd
import azure.cosmos.cosmos_client as cosmos_client

# Load secret data like our database key from a ".env" file
dotenv.load_dotenv()

def fail_error(msg):
    """Print a nice red error to the screen and then exit the script"""
    sys.stderr.write("\033[1;31m{:}\n".format(msg))
    sys.exit(2)

# Crash the script intentionally if the .env file doesn't contain a
# database URL and key:
try:
    HOST = os.environ['ACCOUNT_HOST']
    MASTER_KEY = os.environ['ACCOUNT_KEY']
except KeyError:
    fail_error("Get your database's account URL and key and set them in the ACCOUNT_HOST / ACCOUNT_KEY environment variables")

# Specify the database name and container name we want to work with:
DATABASE_ID = "periodic-db"
CONTAINER_ID = "elements"

def dataframe_to_dicts(df):
    """Function to loop through rows in a spreadsheet and spit them
    out as Python dictionaries/NoSQL-style 'documents'"""
    for record in df.to_dict(orient='records'):
        yield {k:v for k,v in record.items() if not pd.isna(v)}

def load_data(filename):
    """Function to load periodic table data from a file and 'normalize'
    it so that datapoints that are either true or false are actually 
    represented by python 'True' and 'False' values (rather than the
    words yes/no) """
    with open(filename, "r") as f:
        data = pd.read_csv(f, true_values=["yes"])
    for col in ["Radioactive", "Natural", "Metal", "Nonmetal", "Metalloid"]:
        data[col] = data[col].replace(pd.NA, False)
    return dataframe_to_dicts(data)

if __name__=="__main__":
    # Make sure the script has a valid data filename specified:
    if len(sys.argv) != 2:
        fail_error("Specify input data file to import")
    if not os.path.isfile(sys.argv[1]):
        fail_error("Couldn't find specified input file. Check path/spelling for typos.")

    # Open a connection to our cloud database account. Select the
    # database and container specified at the top of this script.
    client = cosmos_client.CosmosClient(HOST, {'masterKey': MASTER_KEY})
    db = client.get_database_client(DATABASE_ID)
    container = db.get_container_client(CONTAINER_ID)

    # Try to loop through each entry in the CSV file, and add it to
    # our cloud database with the 'create_item' action
    try:
        for record in load_data(sys.argv[1]):
            # Select the element's name as its unique ID in the database
            record["id"] = record["Element"]
            container.create_item(body=record)
            print("Added {:} ({:}) to db".format(record["Element"], record["AtomicNumber"]))
    except pd.errors.ParserError:
        fail_error("Input file is not in CSV format")

You should just be able to paste the above code into the file and save it (though your course staff encourage you to read through it!). Finally, in the terminal, run your code with the following command:

python3 process.py periodic-table.csv

If all goes well, you should get this error:

We’ve gotta tell the python script how to log in to our database! But how? UW NetID and password? Way too insecure. A hacker could register you for the wrong classes.

URIs and Keys [🔗]

Make a new file called .env, and inside it put the text:

ACCOUNT_HOST=
ACCOUNT_KEY=

The ACCOUNT_HOST variable is going to point to where on the web our database is located. The ACCOUNT_KEY is, effectively, a very long password to that databse. Let’s open the Azure portal to retrieve that information.

On the Azure portal, go back to the Cosmos DB dashboard and open your database from the list. Find and open the Keys page from the menu on the left (1) . Now copy the URI with the copy button (2) and paste it after ACCOUNT_HOST= in the .env file. Click the little eye icon next to PRIMARY KEY to view the database’s access key (3), and then copy that (4). Paste it after ACCOUNT_KEY= in the .env file:

When you’re done, save the .env file and try running the script again. If all goes well, you should see messages that all the various elements were added to the database:

We can confirm this by going to the Azure portal, opening our database’s dashboard, and going back to Data Explorer:

From here, if we open the periodic-db entry under the NOSQL API list, and then choose Items, we should see a listing of all the elements in our database:

Yay! πŸ‘©β€πŸ”¬βš—οΈπŸ‘¨β€πŸ”¬πŸ§ͺπŸ§‘β€πŸ”¬βš›οΈ

4. Querying the database [🔗]

In the text box next to SELECT * FROM c, try entering a filter expression into the box like:

WHERE c.Phase = "Gas"

Click the Apply Filter button, to see only the elements that are gaseous at STP:

You can write complex SQL queries to make arbitrarily selective filters. Here’s a refined query that only shows gasses with atomic masses higher than 35:

WHERE c.Phase = "Gas" AND c.AtomicMass > 35

This query language, based on SQL, is quite flexible. Here is an example of a query that will show you all element names that don’t end with the letter n or suffix ium:

WHERE NOT (c.Element LIKE "%n" OR c.Element LIKE "%ium")

You can sort results using an ORDER BY clause:

ORDER BY c.AtomicNumber

And choose the number of results with an OFFSET/LIMIT clause (the below will select 2 elements, starting after atomic number 6)

ORDER BY c.AtomicNumber OFFSET 6 LIMIT 2

For more detail about how to refine these filters, see the official documentation.

πŸ† Challenge:

  • Write a query that identifies the most massive element that was discovered before the year 1900.