What is Metagenomics? Study of metagenomes (genetic material directly recovered from environmental samples). Also known as Environmental Genomics or Community Genomics. Genomic analysis of microbial DNA directly from uncultured communities. Identifies novel microbial genes for metabolic pathways (e.g., energy, carbon, nitrogen metabolism). Processes data using bioinformatics tools. History of Metagenomics 1970s: Carl Woese proposed ribosomal RNA genes as molecular markers for life classification. Sanger automated sequencing revolutionized microbial study. Advances in molecular techniques accessed the "new uncultured world" of microbial communities. Impactful techniques: Polymerase Chain Reaction (PCR) rRNA genes cloning and sequencing Fluorescent In Situ Hybridization (FISH) Denaturing Gradient Gel Electrophoresis (DGGE) Restriction-Fragment Length Polymorphism (RFLP, T-RFLP) Metagenomics Importance Organisms can be studied directly in their natural environments, bypassing the need for isolation and culture. Significant advantages for viral metagenomics due to difficulties in cultivating hosts. Genomic information advances research in forensic science and biomedical fields. Historical Background: Limitations of Culture Methods 1. The Great Plate Count Anomaly (Staley et al., 1985) 1g of soil can contain up to 4,000 different species, but less than 1% are readily culturable with known methods. Culturability is extremely low for natural bacterial populations. Many organisms cannot adapt to artificial and restrictive laboratory conditions. Comparison of total vs. cultivatable microbial diversity shows a 2-3 order of magnitude difference. 2. Loss of Realistic Prokaryotic Biodiversity Traditional cultivation methods can lead to the loss of major portions of microbial communities. Closely related bacterial species often require very different culture conditions. Phylogenetic studies (e.g., Pace et al., 1986) using environmental DNA revealed numerous new microbial groups previously undetected by cultivation. 3. Limitation of Finding Novel Genes and Proteins Culture methods yield limited novel genes/proteins compared to metagenomic studies. Metagenomics Basic Strategies Sample of Environment Extraction of Metagenome: Ectopic extraction Plasmid extraction Metagenomic Library Construction: Carriers: Fosmid, Cosmid, BAC, $\lambda$ phage Hosts: E. coli, Streptomycete, Pseudomonas , etc. Metagenomic Library Screening: Sequence-based screening Nonsequence-based screening New gene activity expression Metagenomic DNA Libraries Creation (Principle) Environmental Samples: Collect diverse samples. DNA Extraction: Obtain DNA from cultivable and non-cultivable bacteria. Cloning: Insert DNA fragments into vectors. Representation of Initial Biodiversity: Create a metagenomic DNA library. Screening for Enzymes: Identify novel genes and activities. Sequence-based Metagenomics: ORF prediction, sequence comparisons to databases, homology-based enzyme mining. Leads to biodiversity insights, thousands of new genes, high-throughput annotation. Functional Metagenomics: Colony/phage activity screening on chemical substrates. Leads to active genes identified, novel activities, and protein families. Next Generation Sequencing Technologies (NGS) Sanger sequencing had a great impact but is limited to $\approx 96$ sequences per run with $\approx 650$ bp length. NGS platforms allow parallel sequencing of millions of DNA molecules with varied yields and sequence lengths. Different Sequencing Platforms Short-read: Illumina, Ion Torrent Long-read: PacBio SMRT, Nanopore Each employs unique chemistries (e.g., sequencing-by-synthesis, sequencing-by-ligation). Used for whole genome sequencing, targeted sequencing, and transcriptomics. Short-read Sequencing Illumina: Most widely used, "sequencing by synthesis". Pyrosequencing: Older platform, detects pyrophosphate released during synthesis. Ion Torrent: Semiconductor-based, detects hydrogen ions released during DNA synthesis. SOLiD: "Sequencing by ligation", uses fluorescently labeled probes ligated to DNA strand. Illumina Sequencing Steps Nucleic Acid Extraction Isolate genetic material. Quality control check: purity (UV spectrophotometry), quantification (fluorometric methods). Library Preparation DNA fragmentation (mechanical shearing, enzymatic digestion, transposon-based). End repair and A-tailing (adding adenine nucleotide to 3' ends). Ligation of adapter molecules (synthetic DNA sequences) to DNA fragments. Adapters provide binding sites for sequencing primers. Cluster Generation by Bridge Amplification DNA library loaded onto a flow cell with small lanes. Flow cell has oligonucleotide primers covalently attached to its surface. DNA fragments bind to complementary primers and undergo bridge amplification. Each amplified bridge creates a cluster. Process finishes when each DNA spot has enough copies for a strong signal. Sequencing by Synthesis (SBS) Fluorescently labeled nucleotides (dATP/dCTP/dGTP/dTTP) are added. Each nucleotide emits fluorescence upon attachment, identifying the base. A dNTP mix (dATP, dCTP, dGTP, dTTP) serves as building blocks. Data Analysis Sequences processed and analyzed using bioinformatics tools. Images converted into base sequences by analyzing fluorescent signals. Identifies sequence variants, maps gene locations, and enables downstream analyses. Interprets data for pathways, biomarkers, and gene functions. Translates raw data into biological insights. Some instruments have built-in analysis software. Pyrosequencing Sample Preparation DNA extracted (mechanical disruption, chemical lysis). DNA fragmented (restriction enzymes, mechanical methods). PCR Amplification DNA template prepared by amplifying region of interest using PCR. Fragmented DNA amplified using a biotinylated primer. Biotin labelling covalently attaches biotin to a molecule, making it detectable. Resulting biotinylated molecule detected via high-affinity binding to streptavidin/avidin. PCR product contains one biotinylated strand used as template. Biotin-tagged single-stranded DNA isolated using streptavidin-coated beads. Template DNA hybridized with a sequencing primer, then added to pyrosequencing reaction. Sequencing Reaction Reagents (template DNA, enzymes, substrates) loaded. Addition of nucleotides initiates reaction. DNA polymerase adds complementary nucleotide, releasing pyrophosphate (PPi). PPi converted to ATP by ATP sulfurylase (in presence of APS). The generated ATP converts luciferin to oxyluciferin by luciferase, producing light signals. Apyrase degrades ATP and unused/unincorporated nucleotides. Light intensity detected by CCD camera and recorded as peaks. Reactions: DNA template + dNTP (complementary) $\rightarrow$ DNA product + PPi + H$^+$ PPi + APS $\rightarrow$ ATP + SO$_4^{2-}$ ATP + luciferin + O$_2$ $\rightarrow$ AMP + PPi + oxyluciferin + CO$_2$ + light Unincorporated nucleotide + H$_2$O $\rightarrow$ Nucleoside + Pi Sequence Analysis Generates pyrograms (graphical representation of light signals). Displays light peaks corresponding to added nucleotides. Analyzes signals to determine nucleotide sequence. Multiple fragments assembled into complete DNA sequence using bioinformatics. Nanopore Sequencing (Long Reads) DNA Extraction and Library Preparation Extract genetic material from samples. For ultra-long reads, use special methods for high molecular weight DNA (e.g., spin column, magnetic bead, phenol-chloroform). Fragment extracted DNA (physical shearing, enzymatic digestion). Optional size selection for specific lengths. Repair DNA fragment ends for accurate sequencing. Add adaptors to DNA fragment ends to attach to motor proteins and nanopores. Sequencing Process Library introduced into a flow cell with ionic solution-filled chambers and nanopores. High concentrations of potassium chloride (KCl) or lithium chloride (LiCl) act as primary ionic solutions. Constant voltage applied to flow cell creates ionic current through nanopores. DNA mixed with motor proteins that unwind double helix and transport one strand through nanopore. Single-stranded DNA passing through nanopore interferes with ionic current. Each nucleotide causes a specific current change, creating specific signals detected by a patch-clamp amplifier. Data Analysis Signals translated into DNA sequences using base-calling algorithms. Base-calling algorithms convert raw signal data into nucleotide sequences (A, C, G, T). Algorithms interpret signals (electrical current changes, fluorescence colors) to infer base order. Error correction refines sequence data for accuracy. Sequenced data aligned to a reference genome, and genome assembly performed. Structural variants and repetitive regions detected post-assembly and alignment. Oxford Nanopore Technologies (ONT) Enables analysis of any living thing, by any person, in any environment. Works in extreme conditions (e.g., -5°C in Antarctica, high humidity in Democratic Republic of Congo). Used for field studies (e.g., geothermal microbes in Iceland using solar power). ONT Basic Workflow Extract DNA/RNA from sample. Prepare sample(s) for sequencing. Run sequencing. Convert 'squiggles' to bases ('basecalling'). Clean up fragments. 'Assemble' fragments to make a whole genome. Downstream processing (e.g., phylogenetic tree, gene identification). ONT Devices and Throughput Device No. flow cells per run Throughput Theoretical maximum output Cost Flongle 1 126 channels 2.8 Gb From $90 (per flow cell) MinION 1 512 channels 50 Gb From $1,000 (starter pack) Mk1C 1 512 channels 50 Gb From $4,000 (starter pack) GridION 5 5 x 512 channels 250 Gb From $49,000 (starter pack) PromethION 24 or 48 10,700+ channels 2,596+ Gb From $100,455 (starter pack) Different Kits for ONT Rapid barcoding kit Ligation kit PCR based kits Automated kit prep Comparison of Sequencing Technologies Feature Illumina (SBS) Pyrosequencing Nanopore Sequencing Read Length Short (50-300 bp) Short (typically 200-500 bp) Ultra-long (up to 4 Mb) Throughput Very High (Tb per run) Low to Medium (Mb per run) High (Gb to Tb per run) Accuracy High (Q30: 99.9%) Moderate (Q20-Q30) Moderate (85-99%, improving) Error Type Substitution errors Indels in homopolymer regions Indels in homopolymer regions Principle Sequencing by Synthesis (fluorescence) Sequencing by Synthesis (luminescence from PPi release) Detects electrical current changes as DNA passes through pore Cost per Gb Lowest Moderate to High Moderate (decreasing) Time per Run Hours to Days Hours Real-time (minutes to hours) Initial Setup Cost High Moderate Low (e.g., MinION) Applications WGS, RNA-Seq, ChIP-Seq, Metagenomics, Variant Calling Targeted sequencing, SNP detection, Methylation analysis De novo assembly, Structural variant detection, Direct RNA sequencing, Field sequencing Advantages High accuracy, high throughput, low cost per base Relatively fast, good for short targeted sequences Long reads, real-time data, portable devices, direct RNA/DNA sequencing Disadvantages Short reads, not ideal for highly repetitive regions Lower throughput, homopolymer issues, enzyme-dependent Higher error rate (improving), requires high molecular weight DNA