Reproducible Research

January 17, 2024

Motivation and Rationale

If you are going to take the time to build computational tools for humanistic research, you want others to be able to reproduce what you did. You want others to be able to verify your results, but also to use the tools you built to further their own projects.

Reproducible research may be an unfamiliar concept to many humanists, but its motivation in the Digital Humanities is to ensure that other people will be able to use and modify the computational tools you have taken the time to build. Reproducible research is motivated by several questions:

Will others be able to run my code?
Will others know that my code works and does what I say it does?
Will others be able to adapt, modify and contribute to my code?

There are several factors that contribute to producing reproducible research. They include:

Well-organized projects that others can understand
Using a version control system to facilitate collaboration on the project
Recording your environment so others can run your code on their machine

Reproducible research

Reproducible research is a moving target in the humanities. While the social sciences have recently reckoned with a so-called “replication crisis,” the humanities are only beginning to think about how their research can be reproducible. As the humanities increasingly works with large data sets and computational tools that exceed what can be manually verified by a third-party observer, we need to agree upon best practices that will ensure our peers can trust the validity of our results.

As digital humanists, we can learn several lessons from the social sciences and hard sciences to avoid a “replication crisis” in the humanities. A big step towards producing more reproducible research is writing better code that others can reuse to produce the same results.

Reproducibility comes in multiple forms:

Someone else wants to download my data and code to verify my results independently
Someone else wants to use my code on new data to produce their own research
Someone wants to modify my data and code to test edge cases in my results

Some key aspects of reproducible research include:

publication of the raw underlying data used to achieve the results;
clear documentation of the steps taken to achieve the results;
open source release of the code used for data gathering, analysis, and other steps;
separating code based on function (i.e. modular code development) so others can interpret and reuse your code;
documents key decisions and changes in the project (i.e. version control);
means to ensure the tools do what they are supposed to (i.e. tests, code review).

In addition, reproducible research follows a set of community-defined best practices to ensure that your project can be understood and used by others. These practices may evolve and change over time, but these sets of lessons contribute a set of basic principles that can guide the development of reproducible research in the Digital Humanities.

Resources

Rik Peels, “Replicability and replication in the humanities.” Research Integrity and Peer Review 4 (2019). https://doi.org/10.1126/science.aac4716.
Joseph Flanagan, “Reproducible research: Strategies, tools, and workflows.” Studies in Variation, Contacts and Change in English, eds. Turo Hiltunen, Joe McVeigh, Tanja Säily (Helsinki: Research Unit for Variation, Contacts and Change in English, 2017). https://varieng.helsinki.fi/series/volumes/19/flanagan/

Organizing your project

Your project will involve a number of components. This may include raw data, processed data, documentation, source code, and code for dependencies. You may be working as part of a team, or you may be writing code that others will use in the future. In either case, your project should be organized so that someone unfamiliar with the project can quickly find the information they need, whether these are datasets or functions in your code. In order for your research to be reproducible and make sense to others, you should organize your project in a consistent, predictable way.

File structures and directories

This lesson from Code Refinery has some recommendations about you might choose to organize your project. The important points for humanist researchers are the following:

Each project should have its own folder
The structure of your project should be consistent. Another researcher should be able to understand your file structure without you explaining it to them.
Your project should have a README file that explains how to run the project on a different machine.
Different parts of the project should be in different files and/or directories.

Your project organization might look something like this:

impressive-project/
├── data/
|  ├── README.md
|  ├── metadata.csv
|  └── texts/
|     ├── book_a.txt
|     ├── book_b.txt
|     └── ...
├── code/
├── tests/
├── doc/
│   ├── index.rst
│   └── ...
├── .gitignore
├── CITATION.cff
├── CONTRIBUTING.md
├── dependencies.md
├── LICENSE
├── README.md

By using a clear organizational structure like this one, someone unfamiliar with your project will quickly be able to understand what your project is doing and how they might go about reproducing your work or building upon it for their own purposes.

Workflow

To get the results from your project, you will run multiple steps. This may take a haphazard or intutitive form which may be difficult for others to understand:

xkcd comic on helping older adults troubleshoot their computer problems

This CodeRefinery lesson provides several strategies for recording your workflow. Here are some important considerations for humanists.

Manual workflow

A manual workflow might mean writing several discrete functions to perform each task. You may then run each function as needed from the command line or as part of a script. This may be useful for tasks you do not need to run very often. For instance, if you were trying to find the most frequently occuring words in a single book, this workflow might make sense. It may also make sense in more experimental projects where you want to spend time analyzing the output at each stage.

However, for large projects, this may become unwieldy. Imagine trying to analyze 500 books. Or imagine trying to OCR, clean, analyze, and graph an entire run of a 18th century newspaper. In these cases, a manual workflow would take too much time and be too prone to error.

Scripted workflow

When dealing with large amounts of data, tasks you need to run frequently, or tasks you want to automate for others to run, it makes sense to turn towards a scripted solution. The CodeRefinery lesson on workflow management provides examples of using Bash or Snakemake to automate your workflow. Here is the same example reproduced in python:

#to-do: write out this example

Recording your workflow

Regardless of how you organize your workflow, you should provide clear documentation for others to reproduce your steps.

Your project should include a README that provides clear instructions for running each step in your workflow. The README should also contain information about installation and the type of data required for the project.

You may additionally want to provide a graphical representation of your project’s workflow. This can help others see how the pieces fit together, and may also explain what each function in your workflow does.

TO-DO: Find suitable visualization of a workflow.

Additional resources

This list is still under review.

Project organization:
- Python package structure
- Basics of Packaging Python Programs
Workflow management - Carpentries lesson - Singularity Intro Docs
Common dependency managers intro lessons and/or docs:

+++