Step 5: Handline Your Data Pipeline
Who’s involved
Instrument scientist (code version control, tool support from NCAS Data Manager)
Present Action
Reason
Rarely raw data are suitable for further consumption and so need to be worked up. In this step a documented, version controlled processing chain is established to get through to a product ready for further sharing. Using github means that your processing code is secure and that the provenance of your resulting files can be easily traced back to help resolve issues and build confidence in your work as it’s transparent and fully traceable.
This may be outside of JASMIN (e.g. being handled in a proprietary tool on a Windows machine), but with a view to eventually feed into the provided delivery routes.
These notes cover how to use your GitHub repository, get referenceable instances of your code to include in your outputs and how to use the NCAS Group Work Space when you need it in your data pipeline.
Workflow
Back up your workflow with github repositories
- Finding and using your repo (done for you in step 2)
- The GitHub repository can be found online at https://github.com/ncasuk/<instrument name>-software. You may need to ask NCAS IT for the relevant permissions to add files to the repository - make sure to provide your GitHub username
- It should be linked to the relevant folder on JASMIN (i.e. /gws/pw/j07/ncas_obs_vol{1,2}/software/<instrument-name>/) - this link can be confirmed by running
git status
in the folder on JASMIN- If this returns an error, such as “fatal: not a git repository”, contact NCAS IT
- Any code written to process and format data should be included in this repository. Files added to the folder on JASMIN should be backed up to GitHub with the following commands from within the instrument software folder:
git add <name of files(s)>
git commit -m “Insert useful message”
git push
- Any files added to the GitHub repository online can be pulled down to JASMIN using the
git pull
command - It is a good idea to update the README file to include a brief step-by-step guide to how you processed the data from raw format to final version - within this you can reference and link to any external or proprietary software also used, or any requirements/dependencies needed to use your code. This helps aid the traceability of data, but also helps remind you how you processed your data if you have to reprocess or process new data in the future.
- For a more detailed guide to using Git and GitHub, visit https://docs.github.com/en/get-started
Tagged releases and published code (in Zenodo)
When you have a working version of your code that you are using to process your data, it is a good idea to create a tagged release with a version number. This produces a snapshot of the repository that you can refer to, rather than just the main branch of the repository which may continue to evolve in time. These releases can also be linked to Zenodo, providing you with a DOI for your software.
- Visit https://zenodo.org and log in (or sign up if you don’t already have an account).
- Link your GitHub account to Zenodo. This should be done automatically if you log in to Zenodo through GitHub, otherwise on Zenodo:
- Click the dropdown icon next to your username in the top right corner
- Click on “Linked accounts”
- Click on “Connect” next to GitHub
- Choose the repository you want to link to Zenodo:
- Click on GitHub in the Settings, or in the dropdown menu from the top right corner
- Find the repository in the list and flip the switch to on (if you have access to a lot of repositories, it may take a short while for them all to load)
- On GitHub, create a release of the repository (Zenodo will not archive and DOI releases made before the switch has been flipped in the previous section):
- Login to GitHub and visit https://github.com/ncasuk/<instrument name>-software
- Click on “Releases” on the right hand side of the page, and then “Create a new release”
- Add a meaningful release title and description.
- Create a tag for this release. Click on “Choose a tag” just above the “Release title” box, type the name of the tag you wish to create, then click on “Create new tag: on publish”. Note the “Tagging suggestions” on that webpage and “Semantic versioning”. This tag name should be used where appropriate within the netCDF global attribute fields to refer to the version of the code used to produce that file.
- Click on “Publish release” button at the bottom of the page
- You should see this release on both GitHub and Zenodo (it may take a few minutes to reach Zenodo, especially if this is the first release of the repository it is capturing). Zenodo will provide two DOIs - one that points specifically to this release, and one that will always point to the latest release for that repository