Genomics Fundamentals

Core concepts for dementia research

Introduction to Genomics in Dementia Research

Understanding the genetic basis of dementia is crucial for developing new diagnostics and therapeutics. This section covers fundamental genomics concepts with applications to dementia research.

NoteWhy Genomics Matters for Dementia

Genetic factors play significant roles in both familial and sporadic forms of dementia. Genomics helps us understand disease mechanisms, identify at-risk individuals, and discover therapeutic targets.


Genomic Data Types

Sequencing Technologies

Whole Genome Sequencing (WGS)

What it measures: Complete DNA sequence of an organism

Applications in dementia research:

  • Discovering rare variants
  • Structural variation detection
  • Non-coding region analysis
  • Complete genetic risk profiling

Typical coverage: 30x for germline studies


Whole Exome Sequencing (WES)

What it measures: Protein-coding regions (~1% of genome)

Applications:

  • Finding disease-causing variants in genes
  • Cost-effective for coding variant discovery
  • Useful for familial dementia studies

Typical coverage: 100x or higher


Targeted/Panel Sequencing

What it measures: Specific genes or regions of interest

Applications:

  • Clinical diagnostic panels (e.g., AD/FTD gene panels)
  • Validating variants found in WGS/WES
  • Large cohort studies of known risk loci

RNA Sequencing (RNA-seq)

What it measures: Gene expression levels, transcript isoforms

Applications in dementia:

  • Differential gene expression in disease vs. control
  • Identifying dysregulated pathways
  • Alternative splicing analysis
  • Single-cell RNA-seq for cell-type-specific changes

Long-Read Sequencing

Technologies: PacBio, Oxford Nanopore

Advantages:

  • Better for structural variants
  • Phasing variants
  • Resolving repetitive regions (like C9orf72 repeats)
  • Detecting epigenetic modifications

Genomic Data Formats

FASTQ

What it is: Raw sequencing reads with quality scores

Structure:

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Each read has:

  • Sequence identifier (@)
  • DNA sequence
  • Quality scores (+ separator)
  • ASCII-encoded quality values

SAM/BAM/CRAM

What it is: Aligned sequence data

  • SAM: Text format (human-readable)
  • BAM: Binary compressed SAM (smaller, faster)
  • CRAM: Further compressed (reference-based)

Key fields:

  • Read name, flag, chromosome, position
  • CIGAR string (alignment representation)
  • Mapping quality, mate pair info

Example SAM line:

READ1  99  chr1  12345  60  100M  =  12445  200  ACGT...  IIII...

VCF (Variant Call Format)

What it is: Standard format for variant data

Contains:

  • Chromosome, position, reference allele
  • Alternate allele(s)
  • Quality scores
  • Genotype information
  • Annotations (if added)

Example:

#CHROM  POS     ID      REF  ALT  QUAL  FILTER  INFO
chr19   45411941  rs429358  T   C   100   PASS   ...

BED

What it is: Simple format for genomic regions/features

Structure:

chr1  1000  2000  feature1  100  +
chr1  3000  4000  feature2  200  -

Columns: chromosome, start, end, (optional: name, score, strand)

Uses:

  • Defining regions of interest
  • Gene annotations
  • Coverage calculations

GFF/GTF

What it is: Gene annotation formats

Contains:

  • Gene, transcript, exon coordinates
  • Gene IDs, names, biotypes
  • Used by RNA-seq quantification tools

Common Analysis Workflows

Variant Calling Pipeline

Raw FASTQ reads
    ↓
Quality Control (FastQC)
    ↓
Read Trimming (Trimmomatic/cutadapt)
    ↓
Alignment to Reference (BWA/Bowtie2)
    ↓
BAM Processing (SAMtools: sort, index)
    ↓
Mark Duplicates (GATK/Picard)
    ↓
Base Quality Recalibration (GATK)
    ↓
Variant Calling (GATK/FreeBayes)
    ↓
Variant Filtering
    ↓
Variant Annotation (VEP/SnpEff)
    ↓
Downstream Analysis

RNA-seq Analysis Pipeline

Raw FASTQ reads
    ↓
Quality Control (FastQC)
    ↓
Trimming (if needed)
    ↓
Alignment (STAR/HISAT2) OR Pseudo-alignment (Salmon/Kallisto)
    ↓
Quantification (featureCounts/Salmon)
    ↓
Quality Control (MultiQC)
    ↓
Differential Expression (DESeq2/edgeR)
    ↓
Pathway Analysis
    ↓
Visualization

GWAS Concepts

What is GWAS?

Genome-Wide Association Study identifies genetic variants associated with traits or diseases by scanning genomes of many individuals.

Key Components

SNP Arrays/Genotyping

  • Common variants (MAF > 1%)
  • Millions of SNPs tested simultaneously
  • Cost-effective for large cohorts

Statistical Associations

  • Testing each SNP for association with disease
  • P-value threshold: typically 5×10⁻⁸ (genome-wide significance)
  • Multiple testing correction

Manhattan Plots

  • Visualize GWAS results
  • X-axis: chromosomal position
  • Y-axis: -log10(p-value)
  • Peaks indicate associated loci

GWAS in Dementia Research

Major GWAS findings for Alzheimer’s:

  • APOE region (chromosome 19) - strongest signal
  • Immune-related genes (TREM2, CR1, CD33)
  • Lipid metabolism genes (CLU, ABCA7)
  • Endocytosis genes (BIN1, PICALM)

Polygenic Risk Scores (PRS)

  • Combines effects of many variants
  • Predicts individual genetic risk
  • Useful for risk stratification in clinical studies

Linkage Disequilibrium (LD)

What it is: Non-random association of alleles at different loci

Why it matters:

  • GWAS signals often tag multiple correlated variants
  • Fine-mapping needed to identify causal variants
  • Different LD patterns across populations

Tools for LD analysis:

  • PLINK
  • LDLink (online)
  • Ensembl browser

Functional Genomics

Understanding Variant Effects

Not all variants are equal:

Variant Types by Impact:

  • High impact: Stop gained/lost, frameshift
  • Moderate impact: Missense, in-frame indels
  • Low impact: Synonymous variants
  • Modifier: Intronic, intergenic

Regulatory Variants

  • Affect gene expression
  • Located in enhancers, promoters
  • Can be more common than coding variants
  • Harder to interpret but equally important

Expression QTLs (eQTLs)

What they are: Genetic variants associated with gene expression levels

Applications in dementia:

  • Connect GWAS variants to affected genes
  • Understand mechanisms
  • Brain-specific eQTL databases (GTEx, BrainSeq)

Multi-Omics Integration

Combining data types for comprehensive understanding:

  • Genomics + Transcriptomics: Which variants affect gene expression?
  • Genomics + Proteomics: Protein-level effects of variants
  • Genomics + Epigenomics: DNA methylation, histone modifications
  • Single-cell multi-omics: Cell-type-specific effects

Population Genetics Considerations

Ancestry and Dementia Risk

Why it matters:

  • Allele frequencies vary by ancestry
  • APOE ε4 prevalence differs across populations
  • PRS trained on European ancestry may not transfer well

Best Practices:

  • Include diverse populations in studies
  • Report ancestry-specific results
  • Consider population stratification
  • Use ancestry-informed analyses

Family-Based Studies

Approaches:

  • Linkage analysis for rare familial forms
  • Segregation analysis
  • Parent-offspring trios for de novo mutations

Advantages:

  • More power for rare variants
  • Phasing information
  • Controls for population structure

Reference Genomes

Human Reference Assembly

Current versions:

  • GRCh38/hg38: Current standard (2013, regularly updated)
  • GRCh37/hg19: Previous version (still widely used)
ImportantCritical: Check Your Reference!

Always verify which reference genome was used. Coordinates differ between versions. Mixing them causes errors.


Reference Files Needed

For variant calling you’ll need:

  • Reference FASTA file (.fasta or .fa)
  • Index files (.fai, .dict)
  • Known variants (dbSNP, 1000 Genomes)
  • Gene annotations (GTF/GFF)

Where to download:

  • NCBI, Ensembl, UCSC Genome Browser
  • GATK Resource Bundle
  • Illumina iGenomes

Quality Metrics

Sequencing Quality

Key metrics to check:

  • Q30: Percentage of bases with quality score ≥ 30 (99.9% accuracy)
  • Coverage: Average depth across genome/exome
  • Uniformity: Even coverage distribution
  • Duplication rate: PCR duplicates (should be <20%)

Alignment Quality

  • Mapping rate: % reads that align (should be >90%)
  • Properly paired: For paired-end data
  • Insert size distribution: Expected range for library
  • Chimeric reads: Abnormally high suggests contamination

Variant Quality

  • Transition/transversion ratio (Ti/Tv): ~2.0-2.1 for WGS (whole genome)
  • Het/hom ratio: Expected ranges vary by analysis type
  • Mendelian errors: For family data
  • Known variant concordance: Compare to dbSNP

Ethical Considerations

APOE Genotyping

ImportantEthical Note

APOE status has implications for AD risk. Special consent and genetic counseling considerations apply.


Incidental Findings

When sequencing, you may discover:

  • Pathogenic variants unrelated to dementia
  • Carrier status for recessive conditions
  • Pharmacogenetic variants

Best practices:

  • Follow institutional policies
  • Consider secondary findings policies (ACMG guidelines)
  • Prepare return of results protocols

Data Management

Storage Requirements

Typical file sizes:

  • Raw FASTQ (30x WGS): ~100-200 GB per sample
  • BAM file: ~30-50 GB per sample
  • VCF file: 1-10 GB (depending on samples)

Tips:

  • Compress files (gzip, bgzip)
  • Use CRAM instead of BAM when possible
  • Archive raw data, keep processed files accessible
  • Plan for backups!

Data Sharing

Public repositories:

  • dbGaP: Controlled access for human data
  • European Genome-phenome Archive (EGA)
  • Sequence Read Archive (SRA): Raw sequencing data

Considerations:

  • Patient consent for data sharing
  • De-identification requirements
  • Institutional data sharing policies

Dementia Genomics Resources

Databases

Consortia

  • IGAP (International Genomics of Alzheimer’s Project)
  • ADGC (Alzheimer Disease Genetics Consortium)
  • CHARGE (Cohorts for Heart and Aging Research in Genomic Epidemiology)

Next Steps

Now that you understand the fundamentals:

  1. Practice with Tutorials & Workshops
  2. Set up Essential Tools for your analyses
  3. Implement Best Practices for reproducibility
  4. Explore additional Resources for deeper learning

Further Reading

TipRecommended Papers
  • Lambert et al. (2013) “Meta-analysis of 74,046 individuals identifies 11 new susceptibility loci for Alzheimer’s disease”
  • Kunkle et al. (2019) “Genetic meta-analysis of diagnosed Alzheimer’s disease identifies new risk loci”
  • Bellenguez et al. (2022) “New insights into the genetic etiology of Alzheimer’s disease and related dementias”