OHBM Hackathon 2021 - TrainTrack
🧠💻

Reproducible Workflows

Stephan Heunis ("Atlantis" slot)
@fMRwhy jsheunis Psychoinformatics lab
Institute of Neuroscience and Medicine, Brain & Behavior (INM-7)


Şeyma Bayrak ("Rising Sun" slot)
sheyma
Otto Hahn Group Cognitive Neurogenetics


Slides:
https://tinyurl.com/ohbm-2021-handson-repro

Agenda

  • Overview (8 min)
  • Practical aspects (7 min)
  • Step 1: share data and code (30 min)
  • Step 2: software environment and requirements (30 min)
  • -- BREAK (30 min) --
  • Step 3: cloudy containers (30 min)
  • Step 4: interplanetary sci-comm (20 min)
  • Step 5: reproducible data management (20 min)
  • More tools (5 min)


Total running time: 3 hours

Overview

You've just published a paper that calculated the cortical thickness for a group of research participants. You've run some statistical tests and visualized the results.

Overview

Soon after, a researcher in your field sends you an email:

Overview

Our goal for this hands-on session is to guide you through the use of several tools for creating a reproducible workflow so that you can go...

from reacting like this:

to reacting like this:

Overview

There are many routes to follow, and many tools to help you on your way. We are going to take a step-wise, "Galaxy brain" approach:

Overview

By the end of this session, you should be able to do the following STEPS:

  1. Set up a `requirements.txt` file that specifies package requirements
  2. Specify and set up a virtual environment to install requirements
  3. Share code, installation, and running instructions via GitHub
  4. Transform your code into a Jupyter notebook
  5. Set up your code repository to run in the cloud with Binder
  6. Understand how containers can play a role in this context
  7. Understand the benefits of data management with DataLad


Questions? 💬

Practical aspects

  • You can follow along without having any specific software or tools installed on your machine.
  • You just need a good internet connection and your browser.
  • However, if you have the required software/packages installed locally, you are welcome (and encouraged) to run the tasks on your system (we'll provide instructions).
  • We use myBinder as a cloud-based tool to demonstrate and run some of the tasks in the workshop, but we'll teach you all about it.

Practical aspects

  • Hands-on!
    • You will be doing a lot of tasks yourself (together with all of us)
    • The laptop icon (💻) indicates a task for you
    • The eyes icon (👀) means we'll demonstrate something
    • The speech bubble icon (💬 ) indicates time for discussion/opinions
    • Feel free to ask questions at any time (via chat, voice, or video)
    • Live polls, please participate!
  • Useful links:

Practical aspects

Who are you?

http://etc.ch/j6my

Practical aspects

Step 1: share data and code

Why don't we just send them the data and code "via" a download link?

Step 1: share data and code

Why don't we just send them the data and code "via" a download link?

💻 Try it yourself:

Step 1: share data and code

💬 Why don't we just send them the data and code "via" a download link?

  • You have to create a link and send an email every time someone requests it
  • There are no instructions included for running the analysis
  • The person might not have the correct software or package versions installed
  • They could perhaps figure out from the code which packages are required, but installing them might interfere with their existing environment, which might break things down the line.

Step 1: share data and code

So they tried running the script on their machine, and then...


💬 What went wrong? What should we have done?

Step 1: share data and code

To prevent this issue (or similar issues), while still allowing others to run the code on their machines, we need to share:

1. The required packages requirements.txt
2. The Python version virtual environment
3. Instructions for how to use these to successfully run the script README
4. Data and code and all of the above in an accessible location GitHub

Step 1: share data and code

We'll start with adding our data and code to Github:

In your browser (👀 💻)

  1. Create a GitHub account if you don't already have one
  2. Create a new repository ("ohbm-handson-test") with a README and MIT License
  3. Upload the code and data directories to your repository
    • You can only upload max 100 files at once, so you will have to upload the data files in steps
    • Take care to keep the correct directory structure
  4. Commit (i.e. save) these uploads

Step 1: share data and code

We'll start with adding our data and code to Github:

From the command line (💻)

  1. Create a GitHub account if you don't already have one
  2. Create a new repository ("ohbm-handson-test") with a README and MIT License
  3. Copy the repository URL
  4. Clone the repository to your machine, and navigate to it
  5. Copy the code and data directories to your repository
  6. Commit the changes to git
  7. Push the changes to GitHub

Step 1: share data and code

We'll start with adding our data and code to Github:

From the command line (💻)

            
                #!/bin/bash

                ROOTDIR=[where-you-want-to-save-the-repo]
                REPOURL=[insert-your-repo-URL]
                REPONAME=[insert-your-repo-name]
                CONTENTDIR=[insert-path-to-paper_data_and_code-directory]

                cd $ROOTDIR
                git clone $REPOURL
                cd $REPONAME
                
                cp -R $CONTENTDIR/* .
                git add --all
                git commit -m "add data and code to repo"
                git push origin main
            
        

Step 1: share data and code

Next, we'll add instructions to the README:

In your browser (👀 💻)

  1. Click on the edit button (🖊️) next to "README" on the repo's main page
  2. Write in your own words: the content of the repo and how to run the analysis
    • Editing is done in Markdown format (cheat sheet)
    • You can "Preview" your changes while editing
    • If you don't want to use your own words, use the content from our example
  3. Commit (i.e. save) these changes

Step 1: share data and code

Next, we'll add instructions to the README:

From the command line (💻)

  1. Open the "README.md" file in your favourite text editor
  2. Write in your own words: the content of the repo and how to run the analysis
    • Editing is done in Markdown format (cheat sheet)
    • If you don't want to use your own words, use the content from our example
  3. Commit the changes to git
  4. Push the changes to GitHub

Step 1: share data and code

Next, we'll add instructions to the README:

From the command line (💻)

            
                #!/bin/bash
                # After editing and saving the README file
                # Make sure you are located in the repo's root directory

                git add README.md
                git commit -m "add description to readme"
                git push origin main
            
        

Step 1: share data and code

We have not yet specified the software or package requirements and we have not explained how to set up a virtual environment. We'll address these as our next main step in the hands-on session.


1. The required packages requirements.txt ?
2. The Python version virtual environment ?
3. Instructions for how to use these to successfully run the script README
4. Data and code and all of the above in an accessible location GitHub

Well done on achieving your first milestone towards creating a reproducible workflow! 🥳🥳🥳

Step 1: share data and code

"Galaxy brain" update:

Step 2: software environment and requirements

After sharing the news about the public GitHub repo with our colleague, we get the following reply:



Now we'll focus on requirements and virtual environments

Step 2: software environment and requirements

Introducing requirements.txt

  • Single file to capture required Python packages
  • Makes installation straight-forward with pip:
  •                 
                        pip install -r requirements.txt
                    
                
  • In the file, you can specify the required packages (check script) and versions:
  •                 
                        matplotlib==3.2.2
                        numpy>=1.16.5
                        pandas
                        nibabel
                        nilearn>=0.7.1
                        sklearn
                        brainspace
                    
                

💻 Now, create your own requirements.txt file and add it to your GitHub repo (either in your browser or via git).

Step 2: software environment and requirements

Introducing requirements.txt

NOTE: not all packages/tools required for an analysis are necessarily Python packages, or if they are they might not be available on the Python Package Index (PyPI). This means that installing everything with pip and a requirements.txt file might not be sufficient:

  • Check whether some packages/tools might have to be installed differently:
    • APT: for managing packages (retrieval, configuration and installation) on Unix-like computer systems
    • conda: Package, dependency and environment management for any language running on Windows, macOS and Linux
  • Add extra installation instructions to README
  •                 
                        git clone https://github.com/MICA-MNI/BrainStat.git
                        cd BrainStat
                        python3 setup.py build
                        python3 setup.py install
                    
                

Step 2: software environment and requirements

Introducing Virtual Enviroments

  • requirements.txt helps a lot, but what if the colleague already has some of these packages installed? 💬
  • Installing new packages or existing packages with different versions can interfere with their local Python setup, and cause errors.
XKCD

Step 2: software environment and requirements

Introducing Virtual Enviroments

It would be great if colleagues could install our package requirements in a clean and isolated environment ==> virtual environments!
  • virtualenv
  •                 
                        #!/bin/bash
                        pip install virtualenv #install the package
                        virtualenv --python=python3 mypythonenv #create a new virtual environment
                        source mypythonenv/bin/activate #activate the virtual environment
                        # now install your packages with pip and do the analysis
                        deactivate #deactivate the virtual environment
                    
                
  • miniconda
  •                 
                        #!/bin/bash
                        # install miniconda using install files via link
                        conda create -n mypythonenv python=3.6
                        conda activate mypythonenv
                        # now install your packages with conda and/or pip and do the analysis
                        conda deactivate #deactivate the virtual environment
                    
                

Step 2: software environment and requirements

Introducing Virtual Enviroments

💻 So let's tell our colleague what to do next. Update your repo's README to include instructions for setting up a virtual environment and installing all required packages in this environment. Include:

  • An intro sentence to mention that a virtual environment can be used to run the code
  • Instructions on how to install the virtual environment manager
  • Instructions on how to create and activate the virtual environment
  • Instructions on how to install all packages using requirements.txt
  • Lastly, instructions on how to run the analysis (if not already included)

Step 2: software environment and requirements

Congrats!
You've achieved Level 2 of being a Reproducible Workflows Magician!

1. The required packages requirements.txt
2. The Python version virtual environment
3. Instructions for how to use these to successfully run the script README
4. Data and code and all of the above in an accessible location GitHub

Step 2: software environment and requirements

So we send the update to our colleague:

Step 2: software environment and requirements

"Galaxy brain" update:

Step 2: software environment and requirements

Well done!!

Gifer.com

Step 3: cloudy containers

So you decide to chill out now that everything should be working, right? Unfortunately, our feeling of accomplishment is short-lived, because Professor Important Dude is back with another question..

Step 3: cloudy containers

And then you realise... many things. At the same time:

  • Damn, you forgot about the OpenGL/LibGL requirement
  • You really don't want to have another Zoom call in this pandemic
  • You absolutely don't want to have multiple Zoom calls forever


So you cry out in desperation: Why can't it just work on their machines?!

Step 3: cloudy containers

Reddit

Step 3: cloudy containers

Introducing containers

Step 3: cloudy containers

Introducing containers

Rachael Ainsworth
But we don't want to have to explain how to install yet another tool. Ideally, we can run everything in the cloud...

Step 3: cloudy containers

Introducing Binder

Scriberia

Step 3: cloudy containers

💻 In this hans-on session, we will follow the steps to make our data, code, and computational environment accessible in Binder:

  1. Code and data in a public GitHub repository ✅
  2. Specify environment configuration:
    • environment.yml for conda ❓
    • requirements.txt for Python/PIP ✅
    • apt.txt for Unix-based software ❓
  3. Add any extra tasks to postBuild
  4. Generate a link to a Binder environment ❓
  5. Add the Binder link to your repo's README ❓

Step 3: cloudy containers

  • We already have a requirements.txt file with most packages
  • 💻 Remember the LibGL issue? We can let Binder install that with apt.txt:
  •                 
                        libgl1-mesa-dev
                        xvfb
                    
                
  • 💻 And remember that we have separate instructions for installing BrainStat? We can add those lines of code to postBuild:
  •                 
                        #!/bin/bash
                        git clone https://github.com/MICA-MNI/BrainStat.git
                        cd BrainStat
                        python3 setup.py build
                        python3 setup.py install
                        cd ..
                        export DISPLAY=:99.0
                        which Xvfb
                        Xvfb :99 -screen 0 1024x768x24 > /dev/null 2>&1 &
                        sleep 3
                        exec "$@"
                    
                

Step 3: cloudy containers

Status update:

  1. Code and data in a public GitHub repository ✅
  2. Specify environment configuration ✅
    • NOTE: using environment.yml instead of requirements.txt to specify the configuration is also possible. It all depends on which packages are available via which distribution service. For an example of setting up our repository using environment.yml, see the conda-env branch of the repo. For a variety of options for configuration files see the Binder documentation.
  3. Add any extra tasks to postBuild
  4. Generate a link to a Binder environment ❓
  5. Add the Binder link to your repo's README ❓


💻 Now we still have to "Binderize" everything!

Step 3: cloudy containers

💻👀 mybinder.org

Step 3: cloudy containers

Status update:

  1. Code and data in a public GitHub repository ✅
  2. Specify environment configuration ✅
  3. Add any extra tasks to postBuild
  4. Generate a link to a Binder environment ✅
  5. 💻 Add the Binder link to your repo's README
    • While the Binder is building, add the Binder image and link to your README
    • Once the Binder build is successful, it should open a Jupyter environment
    • Now we have a complete environment for reproducing your results IN THE CLOUD!!!

Step 3: cloudy containers

We can't hold our excitement, so we send a quick email to notify the colleague:

Step 3: cloudy containers

"Galaxy brain" update:

STEP 4: Interplanetary sci-comm

So, you're busy (rightfully) thinking that you have done a great job of making your work more reproducible, but then ... it looks like we spoke too soon. Our VIP friend has another request...

STEP 4: Interplanetary sci-comm

So firstly, it works for them!!! (Always try and celebrate the wins in academia, however big or small.)

Secondly, let's take a deeper look into this notebook thing.
And what's with Jupiter?

NASA SpacePlace

STEP 4: Interplanetary sci-comm

"The Jupyter Notebook is an open-source web application that allows you to create and share documents that contain live code, equations, visualizations and narrative text. Uses include: data cleaning and transformation, numerical simulation, statistical modeling, data visualization, machine learning, and much more."
Jupyter.org

STEP 4: Interplanetary sci-comm

Jupyter notebooks:

  • Can be installed (amongst other methods) with conda and pip
  •                 
                        conda install -c conda-forge notebook
                        pip install notebook
                    
                
  • Can work with kernels for Python, R, Octave, Julia and many more!
  • Form part of the bigger Jupyter ecosystem (including JupyterLab and JupyterHub)
  • If on a public repository, can be viewed statically with nbviewer
  • If on Binder, can be interacted with and edited in the cloud

STEP 4: Interplanetary sci-comm

💻 To make our work more accessible and intuitive, we will create a Jupyter Notebook translated from the current Python script, by dividing it into code and Markdown cells, and running these to generate the visualisation outputs:

  1. Open the Binder link
  2. Click on New > Python 3
  3. Use the editing funcitons to add/edit/move cells
    • Use Markdown cells for narrative content (as in the README)
    • Use code cells for Python code
  4. 👀 Have a look at this static example for guidance

STEP 4: Interplanetary sci-comm

Show and tell! 💻👀💬

STEP 4: Interplanetary sci-comm

This is what you have achieved!
Juliette Taka, Logilab and the OpenDreamKit project

STEP 4: Interplanetary sci-comm

"Galaxy brain" update:

STEP 5: Reproducible data management

In many cases we will work with large datasets and multiple analysis pipelines will form part of the full research pipeline, each with their own list of software and environment requirements. Managing the full reproducibility of such workflows can be challenging (or impossible) without the right tools that allow:
  1. Data version control
  2. Provenance capture


Introducing

STEP 5: Reproducible data management

in brief

  • A command-line tool with Python API
  • Build on top of Git and Git-annex
  • Allows...
  • ... version-controlling arbitrarily large content,
    ... easily sharing and obtaining data (note: no data hosting!),
    ... (computationally) reproducible data analysis,
    ... and much more
  • Completely domain-agnostic
  • Available for all major operating systems (Linux, macOS/OSX, Windows): installation instructions
  • Detailed documentation: DataLad Handbook

STEP 5: Reproducible data management

DataLad version control helps you get...
From this: To this:
www.phdcomics.com; www.linode.com

STEP 5: Reproducible data management

DataLad SUMMARY - Nesting and Consumption

  1. A DataLad dataset is a folder/directory with files
  2. Subdirectories and their content can be part of the superdataset, or they can be DataLad datasets themselves (nested subdatasets)


STEP 5: Reproducible data management

DataLad SUMMARY - Nesting and Consumption

  • A modular structure makes individual components (with their respective provenance) reusable.
  • Nesting can flexibly link all components and allows recursive operations across dataset boundaries
  • Read all about this in the chapter on YODA principles

STEP 5: Reproducible data management

DataLad SUMMARY - Computational reproducibility

  • Code may produce different results or fail with different software
  • Datasets can store & share software environments and execute code inside of the software container
  • DataLad extension: datalad-container

datalad-containers run

STEP 5: Reproducible data management

DataLad SUMMARY - Getting started

Read the DataLad handbook
An interactive, hands-on crash-course (free and open source)
Check out or use public DataLad datasets, e.g., from OpenNeuro

                    $ datalad clone ///openneuro/ds000001
                    [INFO   ] Cloning http://datasets.datalad.org/openneuro/ds000001 [1 other candidates] into '/tmp/ds000001'
                    [INFO   ] access to 1 dataset sibling s3-PRIVATE not auto-enabled, enable with:
                    | 		datalad siblings -d "/tmp/ds000001" enable -s s3-PRIVATE
                    install(ok): /tmp/ds000001 (dataset)
                    
                    $ cd ds000001
                    $ ls sub-01/*
                    sub-01/anat:
                    sub-01_inplaneT2.nii.gz  sub-01_T1w.nii.gz
                    
                    sub-01/func:
                    sub-01_task-balloonanalogrisktask_run-01_bold.nii.gz
                    sub-01_task-balloonanalogrisktask_run-01_events.tsv
                    sub-01_task-balloonanalogrisktask_run-02_bold.nii.gz
                    sub-01_task-balloonanalogrisktask_run-02_events.tsv
                    sub-01_task-balloonanalogrisktask_run-03_bold.nii.gz
                    sub-01_task-balloonanalogrisktask_run-03_events.tsv
                 
💻 Walk through the DataLad tutorial on Binder

STEP 5: Reproducible data management

"Galaxy brain" update:

Summary of achievements

This is what you can do now!

  1. Setup `requirements.txt` file that specifies package requirements ✅
  2. Specify and set up a virtual environment to install requirements ✅
  3. Share code, installation, and running instructions via GitHub ✅
  4. Transform your code into a Jupyter notebook ✅
  5. Set up your code repository to run in the cloud with Binder ✅
  6. Understand how containers can play a role in this context ✅
  7. Understand the benefits of data management with DataLad ✅

Other resources for reproducible workflows

Acknowledgements / Sources