Thinking About a Reproducible Research Topic

Over the past 4 years, my lab has made a concerted effort to make all of the code that is associated with a paper publicly available in our lab’s GitHub project directory. This started because I was paranoid that people wouldn’t believe our analysis. I rationalized that you may not like how we did an analysis, but you should at least be able to see where the results came from. Our current process isn’t perfect, but it’s far better than what we were doing things 5 years ago and it will probably be better in another 5 years. We’ve done a lot of learning over the years and it hasn’t always been easy. I have noticed more people are reproducible research curious. I also suspect that many people don’t know where to look for other examples of reproducible papers and are overwhelmed at where to start. One of the newer features in GitHub is the ability to create topics to link repositories. I would like to propose and see if there is any interest in creating a common topic to indicate those repositories that other scientists are using to make their papers more reproducible. My proposal is to use reproducible-paper.

From my own experience, I know that our process of generating reproducible papers has evolved with each paper. The initial effort was an IPython notebook that demonstrated what we did. Because the dataset was so huge, it wasn’t practical (or I didn’t know how) to execute the code directly on our computer cluster. Then we started putting all of our R code into our R markdown documents and writing our papers from scratch as R markdown (e.g. Baxter et al). This got painful fast as it would take a long time to reformat some part of our text because R would re-run all of the analytical code. Now, our process has evolved to a point where we write our manuscripts as R markdown documents, but all of the computationally heavy lifting is done in separate scripts that are automated and controlled using Make (e.g. Sze et al). By using R markdown as the last step of generating a manuscript, any number you see in one of our papers has code behind it to calculate that number and where the underlying data came from to calculate that number. The same is true for all of our tables and figures.

Because everything is done using version control (i.e. git), it is straightforward to make all of that code available in a GitHub repository. There are multiple reasons for using version control, but perhaps the most obvious benefit is that we can post any of our repositories to GitHub for free and others can come and see how we’ve done our analysis. If it helps, you can think of a repository as being like an electronic notebook that you can copy to your own computer to re-run another person’s analysis. The feedback we’ve gotten to making our repositories public has been very positive. People have asked why we did X and thanked us for posting the code to do Y because they want to do Y’. Reviewers have commented that they looked at our repository to investigate a point that wasn’t clear in the text. Most important for me as a PI is that I am in a better position to oversee the analyses that my trainees are doing. I can directly inspect their code, run it myself, make comments, and encourage them to try different things. As academics GitHub currently allows us to have as many repositories as we want in our group account and they can be private (i.e. Only we can see them) or public (i.e. Anyone can see them). GitHub is also nice because we can make use of their issue trackers, webpage hosting, and other features.

As I said earlier, one of the newer features of GitHub is that you can add the topic to your repository so others can find repositories with the same topic. A challenge that I know we faced and suspect others face as well is that it is hard to find people that are trying to automate and make their published analyses reproducible. I would greatly benefit from seeing how others are doing their analyses. For example, we haven’t waded into using containers and I am not entirely clear how we would. Also, because we are limited to including files that are no more than 100 MB in a repository, we have primarily only included code and not data in these repositories. I wold love to see how others are incorporating new tools into their work and how they are getting around similar problems. Also, I have found that I code and annotate my code and repository differently now that I think that someone may someday look at what I’ve done. If a topic tag helps more eyeballs gaze on my work, then maybe we’ll do a better job of coding and explaining the decisions we made in designing our analyses. More importantly, I think that using this tool to aggregate repositories will be instructive to beginners that are reproducible research curious and will help lift the overall quality and reproducibility of data analysis.

What do you think? Any interest in a concerted effort to adding a common topic to your repositories where you are trying to to make your research reproducible? Does reproducible-paper work for you? I’ve gone ahead and added the topic tag to my lab’s repositories so you can see what this would look like for yourself.