diff --git a/.github/workflows/deploy.yml b/.github/workflows/deploy.yml new file mode 100644 index 0000000..c1194c1 --- /dev/null +++ b/.github/workflows/deploy.yml @@ -0,0 +1,43 @@ +name: Build and deploy slides + +on: + pull_request: + branches: [ "main" ] + push: + branches: [ "main" ] + + # Allows manual run + workflow_dispatch: + +jobs: + # Builds slides with quarto and deploys them to a branch + build: + runs-on: ubuntu-latest + + steps: + - name: Checkout code + uses: actions/checkout@v4 + + - name: Set up Quarto + uses: quarto-dev/quarto-actions/setup@v2 + + - name: Render Quarto Project + run: | + cd src + quarto render slides.qmd + cd ../ + + - name: Test pages build + if: github.ref != 'refs/heads/main' + uses: JamesIves/github-pages-deploy-action@v4 + with: + branch: test-pages + folder: src + dry-run: true + + - name: Deploy pages for main + if: github.ref == 'refs/heads/main' + uses: JamesIves/github-pages-deploy-action@v4 + with: + branch: gh-pages + folder: src diff --git a/.gitignore b/.gitignore new file mode 100644 index 0000000..725afe5 --- /dev/null +++ b/.gitignore @@ -0,0 +1,2 @@ +*.html +src/slides_files diff --git a/src/dependencies.qmd b/src/dependencies.qmd new file mode 100644 index 0000000..a241419 --- /dev/null +++ b/src/dependencies.qmd @@ -0,0 +1,43 @@ +# Dependencies + +## Dependencies + +- All software has dependencies +- Some are more obvious than others: + - Data/input + - Packages/libraries e.g. numpy, Eigen + - System libraries + - Compiler/Interpreter +- If your code can't run without it, it's a dependency! + +## How to discover dependencies + +- Some dependencies may be "implicit" +- For example, you may have a library installed on your system +- Since the code "just works", you may not be aware of the dependency +- To find these, try running on a different system (or multiple) and see what breaks + +## How to declare dependencies + +- List them in a tracked file in the repository + - e.g. add a "Dependencies" section to your README.md +- Specify: + - Versions of each dependency e.g. numpy 2.3.9 + - Where/how to aquire the dependency + +## Dependency metadata + +- There are automated ways of resolving dependencies +- Usually language/tool specific +- Some tools automatically update dependency metadata + - e.g. Rust's cargo, Julia's Pkg, uv for Python + - Project file: Depencies and compatible versions + - Lock file: Write exact version (plus other metadata e.g. source) of *every* + dependency you are using + - Important to track both - lock files record the exact environment you use + +## System dependencies + +- Conda +- Docker +- Nix/Guix diff --git a/src/documentation.qmd b/src/documentation.qmd new file mode 100644 index 0000000..c3bcfcf --- /dev/null +++ b/src/documentation.qmd @@ -0,0 +1,30 @@ +# Documentation + +## Documentation + +- Not all information can be conveyed in code +- We need to tell other people how to use our projects +- And sometimes ourselves! +- Documentation covers anything outside of the code/metadata + +## README + +- Markdown file at the project root +- Should contain: + - Description of project + - Dependencies + - Instructions on building/running + +## Comments + +- Comments in code are also another form of documentation +- Comments should: + - Explain *why* the code is doing something + - Give context that is external to the scope + +## Generating Docs + +- Use tools that generate docs from source code +- Single source of truth +- Comments/Docstrings embedded in code +- Reduce separation between code and docs diff --git a/src/fair_principles.qmd b/src/fair_principles.qmd new file mode 100644 index 0000000..70754eb --- /dev/null +++ b/src/fair_principles.qmd @@ -0,0 +1,24 @@ +# FAIR Principles + +--- + +- Findable: Software, and it's metadata, are easy for humans and machines to + find. + +--- + +- Accessible: Software, and it's metadata, are retrievable via standardised + protocols. + +--- + +- Interoperable: Software interoperates with other software by exchanging + data and/or metadata, and/or through interaction via a application + programming interfaces (APIs), described through standards. + +--- + +- Reusable: Software is both usable (can be executed) and reusable (can be + understood, modified, built upon, or incorporated into other software). + +See: https://www.nature.com/articles/s41597-022-01710-x diff --git a/src/introduction.qmd b/src/introduction.qmd new file mode 100644 index 0000000..0f2bd65 --- /dev/null +++ b/src/introduction.qmd @@ -0,0 +1,29 @@ +## What is reproducibility? + +For this course we will take the following definition: + +- *Reproducible*: + Performing the same analysis on the same data produces the same results + +## Why is reproducibility important? + +In the context of scientific computing/analysis, we want to be able to: + +- Verify our own results +- Verify the results of others + +By making our work reproducible, we ensure that both these things are not just +possible, but straightforward + +## Additional benefits + +- Safely implement changes +- Can perform workflow on different inputs more easily +- Simpler for new team members to get started +- Better collaboration + +## Where do we go from here... + +Throughout the rest of this session, we will walk through the steps that we can +take to go from an ad hoc collection of scripts into a reproducible scientific +workflow! diff --git a/src/introduction_walkthrough.qmd b/src/introduction_walkthrough.qmd new file mode 100644 index 0000000..7208b3a --- /dev/null +++ b/src/introduction_walkthrough.qmd @@ -0,0 +1,7 @@ +## A likely scenario + +- You have just joined a new research group as a Student/Researcher/PI. +- The group use a custom pipeline/setup to perform their data analysis/simulations. +- You try to get the setup working on your local system/a new hpc system and... + *It doesn't work!* + diff --git a/src/slides.qmd b/src/slides.qmd new file mode 100644 index 0000000..ecfa7f9 --- /dev/null +++ b/src/slides.qmd @@ -0,0 +1,57 @@ +--- +title: Reproducibility in Scientific Computing + +format: + revealjs: + theme: night + logo: https://iccs.cam.ac.uk/sites/default/files/iccs_ucam_combined_reverse_colour.png + +authors: + - name: Jack Franklin + - name: Marion Weinzierl +--- + +{{< include introduction.qmd >}} + +{{< include version_control.qmd >}} + +{{< include dependencies.qmd >}} + +{{< include testing.qmd >}} + +{{< include documentation.qmd >}} + +{{< include fair_principles.qmd >}} + +# Conclusion/Outlook + +## Reproducibility is important + +Primary benefits: +- Confidence in scientific results +- Peer review/cross analysis + +Additional benefits: +- Allows for code resuse +- Better collaboration + +## Ingredients for reproducibility: + +- Version Control +- Dependency Metadata +- Public Accessibility + +## Even better if + +- Testing for: + - Verification + - Regression checks + +## Make it easy! + +- When starting from scratch, much easier to implement these as you go +- For a large project: + - Add to VC + - Document dependencies + - Follow best practice for new code + - Implement small improvements whenever modifying diff --git a/src/testing.qmd b/src/testing.qmd new file mode 100644 index 0000000..e5f30e6 --- /dev/null +++ b/src/testing.qmd @@ -0,0 +1,45 @@ +# Testing + +## Testing + +- Important to test code +- Check that code does what it should +- Test on inputs outside of the "normal" range +- Verify that results of code do not change +- Can also be used to check dependency changes + +## Unit tests + +- Test the smallest logical unit of the code +- Ensure each component works as intended +- Test functions for known results +- Compare to previously produced results + +## Integration tests + +- Test that components work together +- Try to have a range of complexity of tests +- Can use previous results to validate model +- Ensure no regression of results + +## Adding tests to a project + +- Often we inherit large projects with no unit tests +- How do we improve test coverage in this case? + +## Adding tests to a project + + 1. Create integration tests - use previous results or create "golden outputs" + 2. Identify and extract parts of the code which can be split apart + 3. Create unit tests for the new functions + 4. Run the integration tests to ensure results have not changed + 5. Repeat 2-4 until all code has unit tests + +- Whenever you change a part of the code, try to use this method +- Code coverage will slowly improve, with less extra work + +## Automating tests (CI etc) + +- Automate testing to ensure tests pass for every commit +- Also useful for tests that can take a long time/need lots of resources +- If hosting code on e.g. GitHub, GitLab etc, can use Continuous Integration (CI) diff --git a/src/version_control.qmd b/src/version_control.qmd new file mode 100644 index 0000000..f8ac915 --- /dev/null +++ b/src/version_control.qmd @@ -0,0 +1,51 @@ +# Version Control + +## Version Control + +- The first thing we should do is move our project into version control (VC) +- This way we never lose the original state of the project +- We can then try things without worrying about breaking anything! +- This will also benefit any later development, so the sooner the better + +## What to add to VC + +- DON'T do this: +``` bash +git add . +``` + +- Our repository should only contain: + - Code/scripts + - Documentation + - Metadata + - i.e. just text files + +There will be some exceptions to this rule, but for the vast majority of cases +it will be true. + +## What to add to VC + +- Large datafiles should be hosted separately (e.g. on Zenodo) +- External dependencies should be declared + - e.g. link to Zenodo dataset in docs and code +- Use .gitignore to automatically ignore any unwanted files + - e.g. build outputs + +## Aside - testing with worktrees + +- git worktrees are like "local clones" of a repository +- Create a worktree: +``` bash +git worktree add -b +``` +- Will make a new directory, with only files that are tracked +- Can use as a cleanroom to ensure all dependencies are there +- For more info: `git worktree add --help` + +## What to do next? + +- The repository can then also be hosted a remote service (e.g. GitHub, GitLab, Codeberg, Bitbucket) +- This will make collaboration with other people a lot easier! +- It will also mean that any work done can be accessed by collaborators + +