Science Gateways and Dataset Dissemination

The cloud can be a big help in making a datasets available to others in your field. The primary challenge is in dealing with ongoing storage fees and the extra egress charges that cloud platforms levy for downloads of your data. There are a few strategies towards dealing with this:

Encouraging cloud-native, storage-adjacent computation
Taking advantage of cheaper object storage, which can be binned and discounted based on frequency of access.
Taking advantage of vendor-specific discount programs for publically hosted scientific data

The rest of this article will go into these strategies in detail.

Data storage and costing

TODO: object glacier storage

Discount programs

TODO

Data APIs

TODO: zero to API solution

Storage-adjacent Computation

The approach that many cloud-hosted gateways take towards disseminating data is providing an experimentation platform, usually a JupyterHub, to their users. This way, rather than every user of the dataset downloading what they need to their own storage, they simply run their code or use tools hosted on cloud machines that have free access to the central dataset.

For help on getting things set up, check out our CloudBank Solutions for setting up a JupyterHub or running a hosted web application

Case studies

Analysis Ready Data in the Cloud: Charles Stern of the Pangeo Proejct shows us the power of making coding environments available on the cloud right next to where their data sets are stored.
Data storage for neuroscience at a massive scale: Dr. Satra Ghosh introduces us to the DANDI Archive, a thorough archive of neuroscience data made available through the use of distributed cloud services