Best Practices for Reproducible Research
Essential workflows for maintaining high-quality bioinformatics analyses
Why Best Practices Matter
Your most important collaborator is your future self 6 months from now.
Document everything as if you’ll need to re-run the analysis later (you will), or explain it to someone else (you will).
Maintaining good practices from day one:
- Saves time in the long run
- Prevents errors and lost work
- Enables reproducibility
- Facilitates collaboration
- Makes your research more impactful
Analysis Catalogue
What is an Analysis Catalogue?
An analysis catalogue is a structured record of every analysis you perform. It’s your lab notebook for bioinformatics - tracking what you did, when, why, and what you found.
Why Keep One?
- Remember what you did: Months later, you’ll need to recall exact parameters
- Avoid duplication: Quickly check if you’ve already run similar analyses
- Share with collaborators: Clear documentation helps others understand your work
- Publication requirements: Journals require detailed methods - your catalogue has it all
- Troubleshoot issues: When something goes wrong, trace back your steps
Analysis Catalogue Template
Below is a template you can adapt. Keep it in a spreadsheet (Excel, Google Sheets) or as a CSV file in your project directory.
Download Template
Copy this table structure to create your own analysis catalogue:
| ID | Name | Description | Link | Data | Status | Notes | Report/Output |
|---|---|---|---|---|---|---|---|
anAD001 |
GWAS APOE variants | Test association between APOE variants and AD age of onset in cohort | /scripts/gwas/run_gwas_apoe.sh |
/data/cohort1/genotypes_QC.vcf (n=5,432: 2,100 cases, 3,332 controls) |
Complete | Removed 3 outlier samples, see QC log | anAD001_report |
anAD002 |
RNA-seq differential expression | Identify differentially expressed genes in AD vs control brain tissue | /scripts/rnaseq/deseq2_analysis.R |
/data/rnaseq/batch2/fastq/ (n=40: 20 AD, 20 controls, frontal cortex) |
In Progress | Waiting for additional samples | - |
anAD003 |
Variant calling WGS | Call germline variants from 30x WGS data, focus on dementia genes | /workflows/wgs_pipeline.snakefile |
/data/wgs/batch1/ (n=100, 30x coverage, Illumina NovaSeq) |
On Hold | Need more compute resources | - |
anFTD001 |
C9orf72 repeat expansion | Screen for C9orf72 hexanucleotide repeat expansions in FTD cohort | /scripts/str/c9orf72_screen.sh |
/data/pcr_data/c9_screening.csv (n=250 FTD patients) |
Complete | 15 patients positive for expansion | anFTD001_summary.xlsx |
Field Descriptions:
- ID: Short unique identifier with project prefix (e.g., anAD001, anFTD002, anLBD001)
- Name: Short descriptive name of the analysis
- Description: Brief explanation of what you’re analyzing and why
- Link: Path to main script/workflow, or URL to documentation
- Data: Path to input data + brief description (sample size, data type, source)
- Status: Current state (Complete, In Progress, On Hold, Failed, Planned)
- Notes: Important details, issues encountered, decisions made
- Report/Output: Link to results folder, report file, or final output location
Example Filled Catalogue Entry
Here’s what a completed entry looks like:
ID: anAD004
Name: WGS Variant Calling - Batch 1
Description: Called germline variants from WGS data (30x coverage) in 100
dementia patients to identify rare pathogenic variants in
known dementia genes (PSEN1, APP, MAPT, GRN, etc.)
Link: /scripts/variant_calling/wgs_pipeline_v2.3.snakefile
Data: /data/wgs/raw_fastq/batch1/ (n=100: 75 AD, 15 FTD, 10 LBD;
Illumina NovaSeq 6000, 30x coverage, paired-end 150bp)
Status: Complete
Notes: Sample WGS_045 failed QC (coverage 18x), excluded from analysis.
Re-ran BQSR after initial Ti/Tv ratio was low (1.8). Used hg38
reference, GATK v4.2.6, filtered to 45 dementia genes.
Final n=99 samples passed QC.
Report/Output: /results/wgs/batch1_variants/anAD004_final_report.html
Key finding: 12 rare pathogenic/likely pathogenic variants
identified, 8 previously reported in ClinVarWhat to include in the Data column:
- File path or location: Where the input data is stored
- Sample size: Number of individuals/samples (n=X)
- Breakdown: Cases vs controls, disease subtypes
- Data type: WGS, WES, RNA-seq, genotyping array, etc.
- Sequencing details: Platform, coverage, read length (if relevant)
- Tissue/source: Brain region, blood, etc.
Examples of good Data entries:
/data/genotypes/cohort_A.bed(n=1,200: 600 AD cases, 600 controls; Illumina GSAMD v3 array)/data/rnaseq/hippocampus/batch3/(n=50: 25 AD, 25 controls; 50M reads/sample, PE100)/data/methylation/frontal_cortex.csv(n=80, Illumina EPIC array, 850K CpGs)- UKDRI WGS Cohort, Batch 2:
/project/wgs_b2/(n=200, 30x WGS, NovaSeq) - Public data: GEO:GSE12345 (n=36, Affymetrix microarray, temporal cortex)
Tips for Maintaining Your Catalogue
- Update in real-time: Fill in fields as you go, not after the fact
- Be specific: Future you won’t remember vague descriptions
- Version everything: Note exact software versions
- Link to everything: Include full paths to data, scripts, results
- Note failures too: Document what didn’t work and why
- Regular reviews: Weekly, check status and update priorities
Project Folder Structure
Detailed guidance on folder organisation now lives on Reproducible Project Structure.
The short version is:
- use the same layout across projects
- keep raw and processed data separate
- keep scripts separate from results
- document where data lives and how outputs are generated
- make the repository understandable to someone joining later
For a new project, the best sequence is:
- Start with Start Here
- Apply the structure on Reproducible Project Structure
- Keep the repository documented and version controlled from the first commit
Documentation Best Practices
README Files
Every project needs a good README. Include:
Treat the README as the front door to the project. It should help a new person understand what the repository is, where to start, and where to find the important pieces.
Essential sections: - Project overview and goals - Data sources and description - How to run the analysis - Software requirements - Key results/conclusions - Authors and dates
Example README template:
# Dementia GWAS Analysis - APOE Region
## Overview
Genome-wide association study focused on APOE region in 5,000 AD cases
and 10,000 controls to identify rare protective variants.
## Data
- **Source:** UKDRI Dementia Cohort
- **Platform:** Illumina GSAMD v3 Array
- **QC:** Standard QC applied (call rate >95%, HWE p>1e-6, MAF >0.01)
## Analysis Pipeline
```bash
# 1. Quality control
plink --bfile raw_data --geno 0.05 --mind 0.05 --hwe 1e-6 --maf 0.01 --make-bed --out qc_pass
# 2. Association testing
plink --bfile qc_pass --logistic --covar age_sex.txt --out gwas_results
# 3. Annotation
./scripts/annotate_variants.R gwas_results.assoc.logistic
```
## Key Results
- Genome-wide significant signal at rs429358 (APOE ε4, p=2.3e-125)
- Novel suggestive signal at rs75627662 (p=3.2e-7)
- See `results/figures/manhattan_plot.png`
## Requirements
See `environment.yml` for full list
- PLINK v1.9
- R v4.2 with ggplot2, data.table
## Author
Your Name (your.email@ukdri.ac.uk)
Date: 2025-10-16Inline Code Comments
Good commenting practices:
# BAD: Obvious comment
x = x + 1 # increment x
# GOOD: Explains WHY
x = x + 1 # adjust for 0-based indexing
# BAD: No context
threshold = 5e-8
# GOOD: Clear reasoning
threshold = 5e-8 # genome-wide significance threshold for GWAS
# EXCELLENT: Document complex logic
# Filter variants: keep only those with:
# - MAF > 0.01 (common variants for this analysis)
# - INFO score > 0.8 (well-imputed)
# - HWE p > 1e-6 (not deviated in controls)
filtered_vars = variants[
(variants['MAF'] > 0.01) &
(variants['INFO'] > 0.8) &
(variants['HWE_P'] > 1e-6)
]Version Control with Git
Even if working alone, version control is essential for tracking changes, reverting mistakes, and understanding your analysis evolution.
Basic Git workflow:
# Initialize repository
cd my_project
git init
# Add files
git add scripts/analysis.py
git add README.md
# Commit with meaningful message
git commit -m "Add initial variant filtering script"
# Continue working...
# Make changes, then:
git add scripts/analysis.py
git commit -m "Fix bug in MAF filtering, now correctly filters < 0.01"
# View history
git log
# Create branch for experimental analysis
git branch experimental_method
git checkout experimental_method
# ... work on experimental method ...
# Merge back if successful
git checkout main
git merge experimental_methodCommit message best practices:
# BAD
git commit -m "fixed stuff"
git commit -m "update"
# GOOD
git commit -m "Fix MAF calculation bug causing incorrect filtering"
git commit -m "Add PCA plot generation to QC script"
git commit -m "Update README with new analysis pipeline steps"Analysis Reproducibility Checklist
Before considering an analysis complete, verify:
The New Starter Test
Before you consider a repository ready to hand over, ask:
Could someone who does not know this project find the right place to start quickly, with no help from me?
If the answer is no, the repository usually needs one or more of:
- a README that says what this is and how to start
- a clearer folder structure
- better onboarding or project notes
Referencing & Citations
Why Proper Citation Matters
- Give credit to tool developers
- Allow others to reproduce your work
- Required by publishers
- Helps track method usage/impact
What to Cite
Always cite: - Analysis software and tools - Reference genomes and databases - Published methods/algorithms - R packages and Python libraries - Workflow managers - Pre-processing pipelines
How to Find Citations
For bioinformatics tools:
- Check tool’s documentation or website
- Look in GitHub README
- Search on PubMed for the tool name
- Check tool’s –cite or –citation flag
- Use Bioconda:
conda search <tool> --info
Example:
# Many tools have citation info
samtools --version # includes citation
gatk --version # includes citationCitation Management
Use reference managers: - Zotero (free, open-source) - Mendeley (free) - EndNote (institutional license often available) - Papers (Mac)
Bioinformatics-specific tip: Create a collection/folder specifically for software/tools cited in your analyses.
Example Citations Section
In your methods:
## Software and Tools
Quality control was performed using FastQC v0.11.9 (Andrews, 2010)
and MultiQC v1.12 (Ewels et al., 2016). Reads were aligned to the
GRCh38 reference genome using BWA-MEM v0.7.17 (Li & Durbin, 2009).
Variant calling was performed using GATK v4.2.6 (McKenna et al., 2010),
and variants were annotated with VEP v106 (McLaren et al., 2016).
Statistical analyses were conducted in R v4.2.0 (R Core Team, 2022)
using the following packages: ggplot2 v3.3.6 (Wickham, 2016),
data.table v1.14.2 (Dowle & Srinivasan, 2021).
References:
Andrews, S. (2010). FastQC: a quality control tool for high throughput
sequence data.
Ewels, P., Magnusson, M., Lundin, S., & Käller, M. (2016). MultiQC:
summarize analysis results for multiple tools and samples in a single
report. Bioinformatics, 32(19), 3047-3048.
Li, H., & Durbin, R. (2009). Fast and accurate short read alignment
with Burrows-Wheeler transform. Bioinformatics, 25(14), 1754-1760.
[etc.]Quick Reference: Common Tool Citations
Alignment: - BWA: Li & Durbin (2009, 2010) - Bowtie2: Langmead & Salzberg (2012) - STAR: Dobin et al. (2013)
Variant Calling: - GATK: McKenna et al. (2010), Van der Auwera et al. (2013) - FreeBayes: Garrison & Marth (2012) - BCFtools: Danecek et al. (2021)
File Processing: - SAMtools: Li et al. (2009), Danecek et al. (2021) - BEDtools: Quinlan & Hall (2010) - Picard: Broad Institute
Annotation: - VEP: McLaren et al. (2016) - ANNOVAR: Wang et al. (2010) - SnpEff: Cingolani et al. (2012)
Workflows: - Snakemake: Mölder et al. (2021) - Nextflow: Di Tommaso et al. (2017)
Additional Resources
Templates and Tools
- Cookiecutter Data Science: Project template generator
- Snakemake profiles: Workflow config templates
- Awesome README: README examples
Reproducibility Guides
- British Ecological Society Guide to Reproducible Code
- rOpenSci Reproducibility Guide
- The Turing Way: Handbook for reproducible data science
Summary
✅ Keep an analysis catalogue tracking every analysis
✅ Use consistent folder structures for all projects
✅ Document everything with READMEs and comments
✅ Version control with Git from day one
✅ Cite all tools and resources properly
✅ Test reproducibility before sharing or publishing
Your future self will thank you!
Next Steps
- Download the analysis catalogue template and start using it today
- Set up your next project with Reproducible Project Structure
- Review Tutorials & Workshops for more on reproducibility
- Explore Tools & Setup for workflow managers