We include this page for common meta-practice / meta-design. We try to describe what is useful and how one learns to use it. The theme is collaborative data science irrespective of cloud computing. Research computing methods advocacy is a broader topic includes cloud but many other tools and technologies and methods besides.
These topics may seem daunting. They tend to make sense with some dedicated time and effort. As with research there is no substitute for clearing away distractions and focusing for an extended period (from minutes to months depending) developing new skills. The good news is that we are in a “boom” era for documentation, tutorials, knowledge bases and debugging tools. For example it is now common practice – given a bug some code – to paste the error message into a browser and shortly arrive at a good solution to the problem on Stack Overflow. YouTube abounds with walk-throughs and tutorials; and resource books such as the Python Data Science Handbook can be found online at no cost.
Suppose you are interested in analyzing dolphin vocalizations. You may find a training series on YouTube; or you may create one. As a more concrete example we found this training resource on learning how to work with regular expressions. Super powerful if you work with text interpretation and parsing, say in NLP. So we put that link below. If you find a good resource: Tell us so we can put it here!
The point: Linux may not be your OS of choice but in terms of open source affinity and getting stuff done: There is a lot to recommend its use.
The point: cf Linux. By ‘the shell’ we mean ‘the Linux command line interpreter program’.
The point: Want a safe stable backup of your code base? Learn some sort of software versioning. The best one is still terrible. It is called
git. The only good thing about it is that you can be reasonably proficient by having a cheat sheet of 12
The point: cf
git. That safe stable backup can live on a safe stable website called GitHub. Your projects become repositories there; and you can make them public or private. GitHub is free or cheap.
The point: You can develop programs as text files… or you can use an abstraction layer called a Jupyter notebook as a development environment. In the latter case you can break your code up into smaller blocks and run them piecemeal. This fragmentation even supports interspersed documentation; nicely formatted human-readable text.
The point: If you configure your data science computing environment by installing specialized libraries: You can also automate this process. It makes reproducing your environment much simpler, at the cost of some time invested in the process.
The point: A simple bit of reproducibility is provided at not cost by binder, a service that will replicate and execute your Jupyter notebooks. It is a very low-power copy of your computing environment so you will not be able to use it to calculate big things like the mass of the proton. But it is automatic and it does execute Jupyter notebooks. Again at the cost of some time invested.
The point: If you travel down the Jupyter notebooks path you may wish to staple some of them together as a sort of virtual book. For this purpose the Jupyter people have invented Jupyter Books.
The point: cf
git. If you are working within your own code base you create the code base in some location (say on your own computer) from a GitHub backup or canonical version using
git clone. If on the other hand you want to begin working from someone else’s repository (and you are not on
clone terms) you use
I wonder if I got that right.
- Pickle: An easy way to share data structures across notebooks. Here is an introductory blog.
- Notebooks tend to grow until they become unwieldy. A natural step is to break them down into smaller conceptual blocks.
- Pickle facilitates ‘picking up where the previous notebook left off’.
- Warning: Not necessarily cross-platform
- Rejoice: Checkpointing! Cloud cost savings!
- Up your game to JSON!
- What’s the bump? What’s the drawback?
The point: There is both line magic (%) and cell magic (%%) in Jupyter notebooks that will accomplish interesting and useful meta-tasks. Simple example: Time a cell to see how long it takes to execute.
The point: For a group of researchers it can be useful to centralize Jupyter notebook server environments as a management service. The mechanism to do this is called a Jupyter Hub. However there is a simplified variation of this that is worth learning about as well and this is called the Littlest Jupyter Hub. It is not twee.
The point: Much of the Jupyter notebook support we describe above can be done on one’s own machine… up until we get to large infrastructure deployments like Jupyter Hubs where a set of servers are needed. We can rent these on the cloud. The trick is to have them automatically scale up and down according to human demand so as to mimimize cost to operate. And we want to understand how it is that they can be made reliable so that they do not lose our work. This is the persistence problem.
Now we begin to poke our nose in the door of data science sub-disciplines, often associated with forms of machine learning.
- The Jupyter environment as noted is suitable for many data science sub-disciplines. Natural Language Processing (NLP) is no exception.
- [This tutorial video provides an excellent introduction to the topic]
- Left off here… get some links and stuff
A regular expression is a formalism for parsing text. Learn how to do it here.
- AWS-to-Azure via AzCopy and via Azure Data Factory
- GCP-to-Azure via AzCopy and via Azure Data Factory