Essential Tools & Setup
Genomics and bioinformatics software you need to know
Overview
This page covers essential tools for genomics analysis, particularly focused on dementia research applications. Each tool includes a description, use cases, and installation instructions.
Many bioinformatics tools can be installed via package managers like conda or mamba, which handle dependencies automatically. We recommend setting up a conda environment for your projects.
Setting Up Your Environment
Conda/Mamba Setup
# Install Miniconda (if not already installed)
wget https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
# Install mamba (faster alternative to conda)
conda install -c conda-forge mamba
# Create a bioinformatics environment
mamba create -n bioinfo python=3.10
conda activate bioinfoCore Genomics Tools
Sequence Alignment & Mapping
BWA (Burrows-Wheeler Aligner)
What it does: Fast alignment of short DNA sequences to a reference genome. Industry standard for DNA-seq alignment.
Use cases: Whole genome sequencing, exome sequencing, variant calling
Installation:
# Via conda
mamba install -c bioconda bwa
# Or from source
git clone https://github.com/lh3/bwa.git
cd bwa
makeBasic usage:
# Index reference genome
bwa index reference.fasta
# Align paired-end reads
bwa mem reference.fasta read1.fq read2.fq > aligned.samBowtie2
What it does: Ultra-fast alignment tool for sequencing reads, especially good for reads >50bp.
Use cases: RNA-seq, ChIP-seq, ATAC-seq alignment
Installation:
mamba install -c bioconda bowtie2STAR (Spliced Transcripts Alignment to a Reference)
What it does: Fast RNA-seq aligner that handles splicing. Designed specifically for aligning RNA-seq reads to genomes.
Use cases: RNA-seq analysis, identifying splice junctions, gene expression quantification
Installation:
mamba install -c bioconda starBasic usage:
# Generate genome index
STAR --runMode genomeGenerate --genomeDir genome_index \
--genomeFastaFiles reference.fasta --sjdbGTFfile annotations.gtf
# Align reads
STAR --genomeDir genome_index --readFilesIn read1.fq read2.fq \
--outFileNamePrefix sample_ --outSAMtype BAM SortedByCoordinateFile Processing & Manipulation
SAMtools
What it does: Swiss army knife for manipulating SAM/BAM/CRAM files. Essential for sorting, indexing, filtering, and viewing alignment files.
Use cases: Post-alignment processing, file format conversion, basic statistics, variant calling prep
Installation:
mamba install -c bioconda samtoolsCommon commands:
# Convert SAM to BAM
samtools view -b input.sam > output.bam
# Sort BAM file
samtools sort input.bam -o sorted.bam
# Index BAM file
samtools index sorted.bam
# View alignment statistics
samtools flagstat sorted.bam
# Extract specific region
samtools view sorted.bam chr1:1000000-2000000BCFtools
What it does: Utilities for variant calling and manipulating VCF/BCF files. Part of the SAMtools suite.
Use cases: Variant calling, filtering variants, merging VCF files, format conversion
Installation:
mamba install -c bioconda bcftoolsBasic usage:
# Call variants
bcftools mpileup -f reference.fasta sorted.bam | bcftools call -mv -O v -o variants.vcf
# Filter variants by quality
bcftools filter -i 'QUAL>20' variants.vcf -o filtered.vcf
# Extract sample statistics
bcftools stats variants.vcfBEDtools
What it does: Powerful toolkit for genome arithmetic - comparing, manipulating, and annotating genomic features in BED, VCF, BAM, and GFF formats.
Use cases: Finding overlapping regions, calculating coverage, extracting sequences, intersecting genomic intervals
Installation:
mamba install -c bioconda bedtoolsExample operations:
# Intersect two BED files
bedtools intersect -a file1.bed -b file2.bed > overlap.bed
# Calculate coverage
bedtools coverage -a regions.bed -b alignments.bam > coverage.txt
# Extract FASTA sequences from BED intervals
bedtools getfasta -fi genome.fa -bed regions.bed -fo output.faVariant Calling
GATK (Genome Analysis Toolkit)
What it does: Comprehensive toolkit for variant discovery, particularly SNPs and indels. Gold standard for germline variant calling.
Use cases: Germline variant calling (SNPs/indels), somatic variant calling, copy number variation, genotyping
Installation:
# Via conda
mamba install -c bioconda gatk4
# Or download directly
wget https://github.com/broadinstitute/gatk/releases/download/4.x.x.x/gatk-4.x.x.x.zip
unzip gatk-4.x.x.x.zipKey GATK workflow steps:
# Mark duplicates
gatk MarkDuplicates -I sorted.bam -O marked.bam -M metrics.txt
# Base quality score recalibration
gatk BaseRecalibrator -I marked.bam -R reference.fasta \
--known-sites dbsnp.vcf -O recal_data.table
gatk ApplyBQSR -I marked.bam -R reference.fasta \
--bqsr-recal-file recal_data.table -O recalibrated.bam
# Call variants
gatk HaplotypeCaller -R reference.fasta -I recalibrated.bam \
-O variants.vcfFreeBayes
What it does: Bayesian genetic variant detector. Good alternative to GATK, simpler to use for basic variant calling.
Use cases: SNP/indel discovery, population genetics, simpler variant calling workflows
Installation:
mamba install -c bioconda freebayesVarScan
What it does: Variant detection in massively parallel sequencing data. Particularly good for somatic mutation detection.
Use cases: Somatic mutations, copy number alterations, comparing tumor/normal samples
Installation:
mamba install -c bioconda varscanVariant Annotation
VEP (Variant Effect Predictor)
What it does: Determines the effect of variants (SNPs, insertions, deletions) on genes, transcripts, and protein sequence. From Ensembl.
Use cases: Functional annotation, predicting consequences of variants, prioritizing variants
Installation:
mamba install -c bioconda ensembl-vep
# Download cache (required for offline use)
vep_install -a cf -s homo_sapiens -y GRCh38Basic usage:
vep -i variants.vcf -o annotated.vcf --cache --assembly GRCh38 \
--everything --vcf --force_overwriteSnpEff
What it does: Genetic variant annotation and effect prediction. Simpler alternative to VEP.
Use cases: Variant annotation, functional effect prediction
Installation:
mamba install -c bioconda snpeff
# Download database
snpeff download GRCh38.99ANNOVAR
What it does: Efficient annotation of genetic variants with diverse databases.
Use cases: Variant annotation, integrating multiple annotation sources
Installation: Download from ANNOVAR website
Quality Control
FastQC
What it does: Quality control tool for high throughput sequence data. Generates reports on read quality, GC content, adapter contamination, etc.
Use cases: Pre- and post-processing QC, identifying problems with sequencing data
Installation:
mamba install -c bioconda fastqcUsage:
# Run on single file
fastqc sample.fastq.gz
# Run on multiple files
fastqc *.fastq.gz -o qc_results/MultiQC
What it does: Aggregates results from multiple bioinformatics analyses into a single report. Works with FastQC, STAR, GATK, and many other tools.
Use cases: Summarizing QC across multiple samples, project-wide quality reports
Installation:
mamba install -c bioconda multiqcUsage:
# Run in directory containing analysis results
multiqc .Trimmomatic
What it does: Flexible read trimming tool for Illumina NGS data. Removes adapters and low-quality bases.
Use cases: Pre-processing raw reads, adapter removal, quality trimming
Installation:
mamba install -c bioconda trimmomaticLong-Read Sequencing
Minimap2
What it does: Versatile sequence alignment program for long reads (PacBio, Oxford Nanopore). Very fast.
Use cases: Long-read alignment, genome assembly, RNA isoform detection
Installation:
mamba install -c bioconda minimap2Canu
What it does: Long-read assembler for high-noise single-molecule sequencing.
Use cases: De novo genome assembly from PacBio/Nanopore reads
Installation:
mamba install -c bioconda canuStructural Variation
Manta
What it does: Structural variant caller for Illumina sequencing data.
Use cases: Detecting deletions, insertions, inversions, duplications, translocations
Installation:
mamba install -c bioconda mantaDelly
What it does: Integrated structural variant prediction method.
Use cases: SV detection in germline and somatic contexts
Installation:
mamba install -c bioconda dellyRNA-seq Specific Tools
Quantification
Salmon
What it does: Ultra-fast transcript quantification from RNA-seq data. Uses pseudoalignment.
Use cases: Gene/transcript expression quantification
Installation:
mamba install -c bioconda salmonfeatureCounts (from Subread package)
What it does: Counts reads mapping to genomic features (genes, exons, etc.).
Use cases: RNA-seq read counting for differential expression analysis
Installation:
mamba install -c bioconda subreadRSEM
What it does: Accurate transcript quantification from RNA-seq data.
Use cases: Isoform-level expression estimation
Installation:
mamba install -c bioconda rsemDifferential Expression
DESeq2
What it does: R package for differential gene expression analysis using count data.
Use cases: Finding differentially expressed genes between conditions
Installation (in R):
if (!require("BiocManager", quietly = TRUE))
install.packages("BiocManager")
BiocManager::install("DESeq2")edgeR
What it does: R package for differential expression analysis of RNA-seq data.
Use cases: Differential expression, particularly for datasets with biological replication
Installation (in R):
BiocManager::install("edgeR")Workflow Management
Nextflow
What it does: Workflow management system for data-driven computational pipelines.
Use cases: Building reproducible, scalable bioinformatics pipelines
Installation:
# Via conda
mamba install -c bioconda nextflow
# Or direct download
curl -s https://get.nextflow.io | bashSnakemake
What it does: Python-based workflow management system.
Use cases: Creating reproducible and scalable data analyses
Installation:
mamba install -c conda-forge -c bioconda snakemakeContainerization
Docker
What it does: Platform for developing, shipping, and running applications in containers.
Use cases: Reproducible environments, sharing analyses
Installation: See Docker documentation
Singularity/Apptainer
What it does: Container platform designed for HPC environments. More HPC-friendly than Docker.
Use cases: Running containers on HPC clusters
Installation:
mamba install -c conda-forge singularityDementia-Relevant Databases & Tools
PGS Catalog
What it does: Repository of polygenic scores (PGS) for various traits and diseases, including neurodegenerative conditions.
Use cases: Calculating polygenic risk scores for dementia and related traits
Access: PGS Catalog
GWAS Catalog
What it does: Curated collection of genome-wide association studies.
Use cases: Finding known genetic associations with dementia, AD, other neurodegenerative diseases
Access: GWAS Catalog
Ensembl
What it does: Genome browser and database for vertebrate genomes.
Use cases: Genome annotation, variant annotation, accessing reference genomes
Access: Ensembl
AlzGene
What it does: Field synopsis of genetic association studies in Alzheimer’s disease.
Use cases: AD genetics research, known genetic risk factors
Access: AlzGene
Tips for Tool Management
Always document which version of each tool you used. Results can vary between versions.
Best Practices
Use environment management (conda, mamba) to keep projects isolated
Document versions in your README and analysis catalogue
Create environment files for reproducibility:
conda env export > environment.ymlTest on small datasets first before running large analyses
Read the documentation - most tools have excellent docs with examples
Quick Reference Commands
# Check if tool is installed
which bwa
samtools --version
# List conda environments
conda env list
# Activate environment
conda activate bioinfo
# Export environment for reproducibility
conda env export > environment.yml
# Create environment from file
conda env create -f environment.ymlNext Steps
Now that you know what tools are available:
- Set up your computing environment with the essential tools
- Explore Genomics Fundamentals to understand the data types
- Try hands-on Tutorials & Workshops
- Establish Best Practices for your analyses