Bioinformatics & Cell Biology Basics

Shared 11/29/2025•203 views

/ 1

Cheatsheet Content

1. Bioinformatics Fundamentals Definition: Use of computers and computational tools to understand biology. Core Tasks: Storage, retrieval, analysis, and interpretation of biological data using software and databases. Origin: Term coined by Paulien Hogeweg and Ben Hesper (1970). Key Application Areas: Drug discovery and development Personalized medicine Disease surveillance and epidemiology Designing better genetically modified organisms (GMOs) Forensic analysis Bioweapons (dual-use concern) 2. Cells and Basic Chemistry Cells: Basic structural and functional units of life. Cell Composition: Proteins, lipids (fats), carbohydrates (sugars), nucleic acids (DNA, RNA). Elemental Makeup: Primarily C, H, O, N, P, S; other essential minerals. 3. Types of Cells Prokaryotic Cells: No nucleus; DNA in cytoplasm as a single circular chromosome. No membrane-bound organelles. Examples: Bacteria, Archaea. Eukaryotic Cells: Have a nucleus containing DNA. Have membrane-bound organelles (mitochondria, ER, Golgi, etc.). Examples: Plants, animals, fungi, protists. 4. Cell Geometry and Shape Shape Factors: Cell type (prokaryotic vs eukaryotic), function. Prokaryotes: Cell wall determines shape. Cocci – spherical Bacilli – rod-shaped Spirilla – spiral-shaped Eukaryotes: Cytoskeleton + plasma membrane maintain shape. Function-based shapes: RBC (biconcave), Neurons (long extensions), Muscle cells (elongated). Surface appendages: long flagella, short many cilia. 5. Human Cell – Nucleus and Chromosomes Nucleus: Stores the "blueprint" of the organism. Contains chromosomes, nucleolus, nucleoplasm, nuclear membrane. Chromosomes: Thread-like "noodles" of chromatin in non-dividing cells; condense into visible structures during mitosis. Chromatin = DNA wrapped around histone proteins (beads-on-a-string). Humans: 23 pairs (46 total) in somatic cells; gametes have 23 single chromosomes. 22 pairs autosomes + 1 pair sex chromosomes (XX or XY). Each cell (except gametes) has all 23 pairs, but only a subset of genes is active depending on cell type. Chromosome Functions: Carry genes for brain development, immune response, skeletal development, metabolism, pigmentation, disease-related genes. Sex Chromosomes (23rd pair): Control sexual development and sex-linked traits; XX = female, XY = male. 6. Diploid vs Haploid Diploid (2n): Cells with pairs of chromosomes (somatic cells: 46 in humans). Haploid (n): Cells with one copy of each chromosome (gametes: 23 in humans). 7. Human Formation and Meiosis Gamete Formation: Male (sperm) and female (ovum) gametes formed from diploid precursor cells via meiosis, each carrying 23 chromosomes (haploid). Fertilization: Sperm (23) + Ovum (23) $\to$ Zygote (46 chromosomes, 23 pairs). Development: Zygote divides by mitosis to form blastocyst, then embryo; every new cell copies the same set of chromosomes. 8. Cell Differentiation and Gene Activation Blastocyst develops into embryo; cells specialize into tissues and organs. All cells have identical DNA, but different genes are turned on/off based on: Signals from neighboring cells Chemical gradients Position and timing in the embryo Gene Activation Control: Epigenetic mechanisms (DNA methylation, histone modification). 9. DNA – Structure and Role Location: Nucleus, packaged as chromatin and chromosomes. Function Analogy: DNA (instruction book in central library) $\to$ RNA (copy) $\to$ Cytoplasm (laboratory) where proteins are built. DNA Composition: Polymer of nucleotides. Each nucleotide = pentose sugar (deoxyribose), phosphate group, nitrogenous base. DNA sugar is deoxyribose (one less oxygen than RNA sugar). Nitrogenous Bases: Purines (A, G) and Pyrimidines (C, T). Base Pairing Rules: A pairs with T, G pairs with C (complementary via hydrogen bonds). DNA Structure: Two antiparallel strands of nucleotides, twisted into a double helix. 10. DNA Sequencing Determining the exact order of nucleotide bases (A, T, C, G) in a DNA molecule. 11. Gene and Genome Gene: Continuous segment of DNA that encodes instructions for a specific protein or functional RNA. Coding strand and template strand; template read 3' $\to$ 5' to synthesize RNA 5' $\to$ 3'. Coding vs Non-coding DNA: Coding DNA: Contains genes that produce proteins or functional RNAs. Non-coding DNA: Does not code proteins but regulates gene activity (on/off switches, regulatory elements). Within Genes: Exons: Coding segments that remain in mature mRNA and encode protein. Introns: Non-coding segments removed during splicing; help regulate which exons are included. Genome: Complete set of genetic material of an organism (all genes + non-coding DNA). Includes protein-coding genes and regulatory/other non-coding regions. 12. RNA and its Types Location: Both nucleus and cytoplasm. Features: Usually single-stranded, uses ribose sugar, uses Uracil (U) instead of Thymine (T). Role: Acts as intermediary using DNA instructions to build proteins. Three Main Functional RNAs: mRNA (messenger RNA): Carries gene's coding sequence from nucleus to cytoplasm. rRNA (ribosomal RNA): Structural and catalytic component of ribosomes; helps identify start codon (AUG) and reads codons in triplets. tRNA (transfer RNA): Brings specific amino acids to ribosome by matching its anticodon to mRNA codons. 13. Central Dogma, Transcription, Translation, Gene Expression Central Dogma: Information flow DNA $\to$ RNA $\to$ Protein. Gene Expression: Using a gene to produce a functional product (protein/RNA); involves transcription, translation, and regulation. Transcription (in nucleus) RNA polymerase binds near promoter regions (e.g., TATA box) 25–35 bp upstream of gene start; promoter must be active. Steps: DNA double helix unwinds locally. Free RNA nucleotides pair with template DNA bases to synthesize pre-mRNA (primary transcript). Pre-mRNA undergoes splicing: introns removed, exons joined $\to$ mature mRNA. Alternative Splicing: Same gene can produce multiple protein variants by including/excluding different exons. Translation (in cytoplasm) Mature mRNA exits nucleus and binds to ribosomes. Ribosome reads mRNA in codons (triplets of bases). Start Codon: AUG (codes methionine). Stop Codons: UAA, UAG, UGA (signal termination). Genetic Code: Mapping from codons to amino acids and stops. tRNA brings matching amino acids; ribosome links them into a polypeptide chain, which then folds into a functional protein. Degenerate: Most amino acids encoded by multiple codons. Many mutations in the third base of a codon do not change the amino acid (silent mutations). 14. Proteins, Amino Acids, Peptides, Proteome Why the body needs proteins Building and repairing tissues, muscle growth. Enzymes and many hormones. Immune defense and antibody production. Transport and storage (e.g., hemoglobin, albumin). Backup energy source, structural support. Protein Sources: Synthesized by cells (endogenous), obtained from diet (exogenous). Amino Acids Monomers/building blocks of proteins. Human body uses 20 different amino acids. 9 essential (from food). 11 non-essential (synthesized by cells). Each amino acid has a side chain (R-group); 20 different R-groups define 20 amino acids. Peptides and Polypeptides Amino acids connect via peptide bonds to form chains. Short chains = peptides; long chains = polypeptides (proteins). Changing sequence or length yields different proteins. Protein Structure and Denaturation Protein function depends on its final 3D folded structure. Heat or chemicals can alter folding (denaturation) without changing amino acid order, resulting in loss of function. Proteome Entire set of proteins produced by an organism, tissue, or cell at a particular time under specific conditions. Properties: Dynamic: changes with environment, signals, disease states. Context-dependent: different cell types express different subsets of proteins. Represents functional output of the genome. 15. Molecular Mutation and its Consequences Molecular Mutation: Change in DNA sequence/structure at molecular level. Types of Mutations: Substitution: one base replaced. Insertion: extra base(s) added. Deletion: base(s) removed. Duplication: DNA segment repeated. Inversion: DNA segment reversed. Effects: May alter protein sequence or prevent correct protein formation. Can cause genetic disorders and cancers, or create beneficial variation for evolution. Many mutations are neutral or silent (no effect on protein). 16. Open Reading Frame (ORF) & Algorithm Reading Frame Basics Definition: Way of dividing DNA/RNA into consecutive, non-overlapping codons (triplets). A change in reading frame changes the codons and thus the amino acid sequence. Both strands (coding and complementary) are read 5' to 3'; each has 3 possible frames depending on the starting position (shift = 0, 1, 2). 6 total reading frames: 3 on each of 2 strands. Central Dogma Context for ORFs DNA $\to$ pre-mRNA (has introns/exons) $\to$ splicing $\to$ mature mRNA $\to$ translation. Only mature mRNA is read by the ribosome. Start/Stop Codons Start Codon (DNA): ATG; (RNA): AUG. Stop Codons (DNA): TAA, TAG, TGA; (RNA): UAA, UAG, UGA. Reading always in 5' $\to$ 3' direction. Open Reading Frame (ORF) Definition Definition: Continuous stretch from a start codon to a stop codon, without internal stop codons. The longer the ORF, the higher the chance it is protein-coding. Objective: Find all ORFs, usually the longest one is functionally important. Calculating the 6 Reading Frames For a given DNA strand (Forward): +0 frame: starts at base 0 +1 frame: starts at base 1 +2 frame: starts at base 2 For the reverse complement strand: –0, –1, –2 (as above, starting on bases 0, 1, 2 after reverse-complementing). Frame examples (for S = ATGAAGTGACCTTAG): +0: ATG AAG TGA ... +1: TGA AGT GAC ... +2: GAA GTG ACC ... Compute reverse complement, then same +0, +1, +2 frames. ORF Algorithm (Stepwise for Exams) Read Input: DNA sequence $S$, length $n$; min ORF length $L_{min}$. Strand Setup: Forward: $S$ as given. Reverse: reverse $S$, take complement (A $\leftrightarrow$ T, C $\leftrightarrow$ G). Frame Setup: For each strand, make 3 frames (shifts 0, 1, 2). ORF Search in Each Frame: Go codon-by-codon. When start codon (ATG) found, mark position. Continue in same frame until stop codon (TAA/TAG/TGA). Extract from start to stop (inclusive). If ORF length $\ge L_{min}$, record: strand, frame, start/end, length, sequence. Repeat for all starts/stops in all frames. Result: List ORFs by position, frame, strand, sequence, length. Notes on ORFs Complementary strand ORFs may or may not be functionally relevant. If there are start and stop codons, but an intervening stop codon appears in-frame, it's not a valid ORF. If a stop codon is present without a start codon, it is not an ORF. Example (summarized) $S$ = 5’-ATGAAGTGACCTTAG-3’ Forward frames: +0: ATG AAG TGA CCT TAG +1: TGA AGT GAC… +2: GAA GTG ACC… Reverse complement: (find, then same +0, +1, +2) 17. Chromosomes & oriC: Prokaryotes vs Eukaryotes Q1: What is oriC? A1: oriC (replication origin) is a specific genomic region where DNA replication begins. Q2: Compare prokaryotic and eukaryotic chromosomes with respect to oriC and genome size. Prokaryotes: Chromosome is usually a single, circular DNA molecule. Typically has a single oriC. Smaller genome size, making oriC simpler to locate. Eukaryotes: Chromosomes are linear. Have multiple oriC per chromosome. Large genome size (long base-pair length), making locating all oriC complex. Q3: Why is oriC identification important for gene therapy? A3: For long-term gene therapy (introducing functional genes into host chromosomes to correct defective genes), the inserted gene must be replicated along with the host genome. Understanding the replication origin and mechanism is crucial to ensure the therapeutic gene is faithfully copied and passed to daughter cells. 18. Bacterial (Prokaryotic) DNA Elements Q1: Define plasmid. A1: A plasmid is a circular, double-stranded DNA molecule separate from the main bacterial chromosome. It contains non-essential but often beneficial genes. Q2: What is the difference between a bacterial chromosome and a plasmid? Bacterial Chromosome: Circular, double-stranded DNA containing essential genes for survival and reproduction. Duplicated and passed to daughter cells during division. Plasmid: Circular, double-stranded DNA, separate from chromosome. Contains non-essential but beneficial genes (e.g., antibiotic resistance). Replicates independently and can be transferred between cells/species. Q3: Discuss plasmids and their applications in biotechnology. A3: Plasmids are small, circular DNA molecules found in bacteria, separate from the main chromosome. They carry genes that confer advantageous traits, like antibiotic resistance. Plasmids can replicate independently and are transferable between bacteria, even across species (e.g., via conjugation). In biotechnology, plasmids are widely used as vectors to introduce foreign genes into cells. For example, they are engineered to carry the human insulin gene, which is then expressed by bacteria to produce insulin for medical use. 19. Origin of Replication (OriC) & DnaA Motif Q1: Define a motif. A1: A motif is a short, recurring pattern in DNA, RNA, or protein sequences that has a specific function and is often recognized and bound by specific proteins. Q2: What is a DnaA box? A2: A DnaA box (or DnaA motif) is a specific DNA motif present at the oriC region in bacteria. The DnaA protein binds to this motif to initiate DNA replication. Q3: What is the typical length of a DnaA box? A3: Typically, a DnaA box is a 9 base-pair (bp) motif. Q4: Explain the role of DnaA protein and DnaA box in bacterial DNA replication initiation. A4: The DnaA box is a specific 9-bp DNA motif found clustered within the oriC region of bacterial chromosomes. The DnaA protein binds to these DnaA boxes. This binding triggers the local unwinding of the DNA double helix at the oriC. This initial unwinding allows for the recruitment of other replication machinery, such as helicase and DNA polymerase, which further unwind the DNA and begin the replication process. Q5: Describe the sequence features of oriC in bacteria. A5: Bacterial oriC regions are characterized by the presence of multiple, clustered DnaA box motifs, which are typically 9 base pairs long. These motifs serve as binding sites for the DnaA initiator protein. The clustering of these motifs facilitates the cooperative binding of DnaA, which is essential for initiating the unwinding of the DNA helix and forming the replication bubble. 20. k-mer: Definition, Counting, and Frequent k-mers Q1: Define k-mer. A1: A k-mer is a substring of length $k$ in a DNA (or RNA) sequence. For example, if $k=3$, "ACT" is a 3-mer. Q2: Explain how frequent k-mer analysis can be used to detect potential DnaA boxes. A2: DnaA boxes are specific 9-mer motifs that occur frequently and are clustered near the oriC region. By applying a frequent k-mer analysis (e.g., for $k=9$) to a candidate genomic region, we can identify k-mers that appear with high frequency. These highly frequent 9-mers are then considered potential candidates for DnaA boxes, providing a clue to the location of oriC. Pseudocode: Generate all k-mers in a DNA string def generate_kmers(dna, k): kmers = [] for i in range(len(dna) - k + 1): kmers.append(dna[i:i+k]) return kmers Pseudocode: Count occurrences of a pattern in a DNA string def count_pattern(dna, pattern): count = 0 for i in range(len(dna) - len(pattern) + 1): if dna[i:i+len(pattern)] == pattern: count += 1 return count Pseudocode: Find single most frequent k-mer def frequent_kmer(dna, k): kmer_counts = {} for i in range(len(dna) - k + 1): kmer = dna[i:i+k] kmer_counts[kmer] = kmer_counts.get(kmer, 0) + 1 if not kmer_counts: return [] max_count = 0 for count in kmer_counts.values(): if count > max_count: max_count = count freq_kmers = [] for kmer, count in kmer_counts.items(): if count == max_count: freq_kmers.append(kmer) return freq_kmers Pseudocode: Find set of all highly frequent k-mers (all k-mers with max frequency) def frequent_kmers_all_max(dna, k): kmer_counts = {} for i in range(len(dna) - k + 1): kmer = dna[i:i+k] kmer_counts[kmer] = kmer_counts.get(kmer, 0) + 1 if not kmer_counts: return set() max_count = max(kmer_counts.values()) freq_kmers = set() for kmer, count in kmer_counts.items(): if count == max_count: freq_kmers.add(kmer) return freq_kmers 21. Pattern Matching Problem Q1: What is the pattern matching problem? A1: Given a genome sequence and a specific pattern (k-mer), the pattern matching problem aims to find all starting positions where the pattern occurs within the genome. Q2: What is meant by a clump in DNA sequences? A2: A clump refers to a region in a DNA sequence where a specific k-mer appears many times within a relatively small genomic window. This clustering of occurrences may indicate a functional significance. Q3: Explain how pattern matching and clump detection help identify candidate oriC regions. A3: Pattern matching allows us to locate all occurrences of a specific k-mer (like a potential DnaA box) in a genome. Once these positions are known, clump detection can be applied to identify regions where these k-mers are unusually close together. A clump of specific k-mers in a region suggests a possible replication initiation site, analogous to the clustered DnaA binding sites in oriC. Q4: Why is a single motif-based search for oriC unreliable across different bacterial species? A4: DnaA box sequences, while functionally similar, can vary across different bacterial species (i.e., they have different consensus sequences). A motif (e.g., a specific 9-mer like "ATGATCAAG") that functions as a DnaA box in one species might not be the DnaA box in another. Relying solely on a single fixed motif would lead to inaccurate oriC identification in diverse bacterial genomes. 22. Clump Finding Problem Q1: Define the clump finding problem. A1: The clump finding problem is to identify all k-mers that occur at least $t$ times within any sliding window of length $L$ in a given genome sequence. Q2: Describe the clump finding problem and its application in oriC detection. A2: The clump finding problem seeks to locate k-mers that show localized enrichment (appear frequently within a defined window). In the context of oriC detection, the oriC region is known to contain clusters of DnaA boxes or other regulatory motifs. By identifying k-mers that form clumps, we can pinpoint genomic regions that are potentially the oriC, as these clumps might correspond to the clustered DnaA binding sites. Q3: Discuss limitations of the clump-finding approach for detecting DnaA boxes. A3: While clump finding can identify regions with enriched k-mers, it has limitations for specifically detecting DnaA boxes. It might identify hundreds of k-mers forming clumps, making it difficult to discern which one corresponds to the actual DnaA box. Furthermore, not all clumps are functionally relevant, and a clump alone doesn't provide enough structural signal to definitively identify oriC. This approach often needs to be combined with other methods for accurate DnaA box identification. 23. Bacterial Chromosomes, Leading/Lagging Strands & GC Content Q1: What are oriC and terC? A1: oriC is the origin of replication, where DNA replication begins. terC is the terminus of replication, where DNA replication ends. Q2: Define leading and lagging strands. A2: In DNA replication, the leading strand is synthesized continuously in the 5' $\to$ 3' direction, towards the replication fork. The lagging strand is synthesized discontinuously in short segments (Okazaki fragments) in the 5' $\to$ 3' direction, away from the replication fork. Q3: Explain why one half of a bacterial chromosome is G-rich and the other C-rich. A3: During bidirectional DNA replication from oriC, the leading strand is synthesized continuously, while the lagging strand is synthesized discontinuously. Due to inherent biases in the replication machinery, the leading strand tends to accumulate more G nucleotides, and the lagging strand tends to accumulate more C nucleotides. Since the two replication forks move in opposite directions from oriC to terC, one half of the circular chromosome will predominantly be replicated as a leading strand, becoming G-rich, while the other half will be replicated as a lagging strand, becoming C-rich. Q4: Describe how strand bias arises during replication. A4: Strand bias, leading to differences in G/C content, arises from the asymmetric nature of DNA replication. The leading strand is synthesized smoothly, while the lagging strand requires repeated priming and synthesis of Okazaki fragments. The enzymes involved in these processes, along with potential differences in repair mechanisms or mutational pressures, can lead to a slight preference for incorporating certain bases (like G on the leading strand and C on the lagging strand) on each of the newly synthesized strands. This cumulative bias over the length of the replichore (the DNA replicated by one fork) results in one half of the chromosome being G-rich and the other C-rich. 24. GC Skew and GC-skew Plot Q1: Define GC skew. A1: GC skew is a measure of the imbalance between the number of guanine (G) and cytosine (C) nucleotides in a DNA sequence. It is typically defined as $(G - C) / (G + C)$. Q2: Explain how GC skew is used to identify oriC in bacterial genomes. A2: In bacterial genomes, the GC skew changes significantly at the oriC and terC regions due to strand compositional bias during replication (leading strand becomes G-rich, lagging strand becomes C-rich). When a cumulative GC skew plot is generated along the genome, the oriC typically corresponds to a sharp decrease, reaching a distinct trough (minimum) in the cumulative skew curve. This characteristic pattern allows for the identification of potential oriC locations. Q3: Discuss strand bias and how GC skew analysis reflects replication origin and terminus. A3: Strand bias refers to the unequal distribution of nucleotides between the leading and lagging strands due to asymmetric replication. In bacteria, the leading strand tends to accumulate more Gs, and the lagging strand more Cs. GC skew quantifies this bias. A cumulative GC skew plot effectively integrates these local biases. The point where replication initiates (oriC) marks the transition from one leading/lagging strand pair to another, causing a significant shift in the cumulative GC skew. This is typically observed as a sharp downward slope and a minimum value (trough) at oriC. Conversely, the replication terminus (terC) often correlates with a peak (maximum) in the cumulative GC skew, where the strand bias reverses as the two replication forks meet. 25. oriC Finding Algorithm (Combined Structural + Motif Signals) Q1: Describe the algorithm to find oriC using GC skew and frequent k-mers. A1: Compute GC-skew array: Calculate the cumulative GC skew for each position in the genome. Find candidate oriC region: Identify the position in the genome where the cumulative GC skew is at its minimum (this indicates the structural signal of oriC). Restrict local window: Define a window of a specific length (e.g., $L$) centered around the minimum cumulative GC skew position. Find all k-mers in this window: Within this local window, generate all k-mers (typically $k=9$ for DnaA boxes) and count their occurrences. Identify most frequent k-mers: The k-mers that appear with the highest frequency within this window are considered candidate DnaA boxes (motif signal). Final oriC detection: The oriC is identified as the region around the minimum cumulative GC skew that is enriched with these most frequent k-mers. Q2: Explain the roles of structural and motif signals in oriC detection. A2: The structural signal , provided by GC skew analysis, reflects the large-scale compositional asymmetry of the bacterial chromosome due to replication. The minimum in the cumulative GC skew plot points to the general vicinity of oriC. The motif signal , derived from frequent k-mer analysis, identifies the specific, short DNA patterns (like DnaA boxes) that are known to bind initiator proteins and are clustered within oriC. Combining both signals allows for more accurate identification: the GC skew narrows down the search, and the frequent k-mers confirm the presence of the specific functional elements expected at an oriC. Q3: Discuss three approaches to oriC detection: Frequent k-mers / pattern matching, Clump finding, and GC skew + frequent k-mers. Compare their advantages and limitations. A3: Frequent k-mers / Pattern Matching: Advantages: Simple to implement, can identify specific known motifs. Limitations: Highly sensitive to the exact motif sequence, which can vary between species; a single motif might not be sufficient; doesn't account for clustering or large-scale genomic features. Clump Finding: Advantages: Identifies regions where any k-mer is unusually frequent, reflecting the clustered nature of DnaA boxes. Less dependent on a single, fixed motif sequence. Limitations: Can return many false positives (non-functional clumps); doesn't distinguish between functional motifs and other repetitive sequences; computationally more intensive than simple pattern matching; still lacks a global structural signal. GC Skew + Frequent k-mers: Advantages: Combines a global structural signal (GC skew minimum) with local motif enrichment. The GC skew effectively narrows down the search space for oriC, and the frequent k-mers then identify the specific DnaA box candidates within that refined region. This integrated approach significantly improves accuracy and robustness. Limitations: Requires accurate GC skew calculation; the choice of window size for k-mer analysis around the skew minimum can influence results; still relies on the assumption that DnaA boxes are the most frequent k-mers in the oriC region. 26. Sequence Analysis – Basics Q1: Define sequence analysis. A1: Sequence analysis is the examination of DNA, RNA, or protein sequences to understand their structure, function, and evolutionary relationships. Q2: Mention two applications of sequence analysis. A2: Predicting the function of unknown genes/proteins by comparing them with known ones. Detecting conserved (important) regions that evolution has preserved. Q3: Explain how sequence analysis helps infer gene function. A3: If a newly discovered gene has a sequence similar to a gene with a known function (e.g., an enzyme or a structural protein) in another organism, it is highly probable that the new gene performs a similar function. This inference is based on the principle that sequence similarity often implies functional and evolutionary relatedness. 27. Benefits / Applications of Sequence Analysis Q1: List any four benefits of sequence analysis. A1: Understanding genetic information and its link to biological function. Identifying genetic variants associated with diseases. Discovering new genes and proteins. Studying evolutionary relationships among organisms. Q2: Discuss the role of sequence analysis in disease study/drug discovery. A2: In disease study, sequence analysis helps identify genetic mutations or variations linked to specific diseases (e.g., SNPs, indels). This can lead to early diagnosis, risk assessment, and personalized treatment strategies. In drug discovery, understanding the sequence and structure of disease-causing proteins (targets) allows for the rational design of drugs that can bind to and modulate their activity. It also aids in identifying potential off-target effects and optimizing drug efficacy. 28. Key Aspects of Sequence Analysis Q1: Name two key aspects of sequence analysis. A1: Motif finding Sequence alignment Q2: Short note on motif finding / gene prediction / phylogenetic analysis (any one). A2 (Motif Finding): Motif finding involves discovering short, recurring patterns within DNA, RNA, or protein sequences that are often functionally significant. These motifs can be binding sites for transcription factors, active sites of enzymes, or structural elements. Identifying motifs helps in understanding gene regulation, protein function, and evolutionary conservation. 29. Sequence Alignment – Basic Ideas Q1: Define identity. A1: Identity refers to the exact same residues (nucleotide or amino acid) found at corresponding positions when two sequences are aligned. Q2: Define similarity. A2: Similarity includes not only identical residues but also amino acids that are different but share similar biochemical properties (e.g., both are hydrophobic, or both are positively charged), suggesting they might perform similar roles. Q3: Differentiate identity and similarity. A3: Identity is a strict measure counting only exact matches at aligned positions. Similarity is a broader measure that includes identical matches and also accounts for conservative substitutions where different residues have similar chemical properties, implying a conserved function despite a sequence change. Q4: Differentiate pairwise and multiple sequence alignment. A4: Pairwise alignment compares exactly two sequences to find regions of similarity. Multiple sequence alignment (MSA) aligns more than two sequences simultaneously to identify conserved regions across a larger group, often used for phylogenetic analysis or motif discovery. Q5: Explain the need for sequence alignment in bioinformatics. A5: Sequence alignment is fundamental in bioinformatics because it allows us to infer functional, structural, and evolutionary relationships between biological sequences. By lining up sequences and identifying regions of similarity, we can deduce shared ancestry (homology), predict the function of unknown genes/proteins, locate conserved domains, and understand the impact of mutations. 30. Identity and Similarity – Formula and Idea Q1: Write the formula for identity. A1: Identity(A,B) = (Number of identical elements in both A and B) / (Length of alignment). Q2: Why are mismatch costs more relevant in proteins than in DNA/RNA? A2: Mismatch costs are more nuanced and relevant in protein alignment because amino acids have diverse biochemical properties (e.g., charge, hydrophobicity, size). A mismatch between two chemically similar amino acids might be less detrimental to protein function than a mismatch between two very different ones. DNA/RNA nucleotides, having less diverse properties, typically treat all mismatches equally or with less distinction. Q3: Explain identity and similarity with an example alignment. A3: Consider aligning two protein segments: Sequence 1: A L V S G Sequence 2: A I V T G Here, 'A', 'V', 'G' are identical (3 matches). If the alignment length is 5, Identity = 3/5 = 60%. For similarity, 'L' (Leucine) and 'I' (Isoleucine) are both hydrophobic amino acids, so their mismatch might be considered "similar" or a conservative substitution. 'S' (Serine) and 'T' (Threonine) are both polar, uncharged amino acids, also considered similar. So, for similarity, we might count all 5 positions as similar, giving 100% similarity, even if only 60% identity. 31. Homology, Orthologs, Paralogs Q1: Define homology. A1: Homology refers to the relationship between two biological sequences (DNA, RNA, or protein) that share a common evolutionary ancestor. It is a qualitative (yes/no) statement, not a measure of percentage. Q2: Distinguish orthologs and paralogs. A2: Orthologs are homologous genes in different species that diverged from a common ancestral gene due to a speciation event. They often retain similar functions. Paralogs are homologous genes within the same species that arose from a gene duplication event. They can evolve new or different functions over time. Q3: Explain homology, orthologs and paralogs with examples. A3: Homology: Human hemoglobin and chimpanzee hemoglobin are homologous because they both descended from a common ancestral hemoglobin gene. Orthologs: The human $\alpha$-hemoglobin gene and the mouse $\alpha$-hemoglobin gene are orthologs. They diverged when the human and mouse lineages split from a common ancestor, and both function in oxygen transport. Paralogs: The human $\alpha$-hemoglobin gene and the human $\beta$-hemoglobin gene are paralogs. They arose from a gene duplication event within the human lineage (or its ancestors) and have evolved to perform slightly different but related functions within the hemoglobin complex. 32. Basic Sequence Alignment Model Q1: Explain the term “residue”. A1: In molecular biology, "residue" refers to the individual monomeric units that make up a polymer. For DNA/RNA, the residues are nucleotides (A, T, C, G, U). For proteins, the residues are amino acids. Q2: What is an indel? How does it appear in alignment? A2: An indel (insertion/deletion) is a type of mutation where one or more nucleotides are either inserted into or deleted from a DNA sequence. In a sequence alignment, an indel appears as a gap (represented by a dash '-') in one sequence, aligned opposite a residue in the other sequence. Q3: Describe different types of mutations and how they reflect in sequence alignments. A3: Substitution: A single nucleotide is replaced by another. In alignment, this appears as a mismatch (e.g., A aligned with G). Insertion/Deletion (Indel): One or more nucleotides are added or removed. In alignment, this appears as a gap in one sequence opposite residues in the other. No Change: The sequences are identical at a position. In alignment, this appears as a match . 33. Scoring Function for Alignments Q1: Give an example of a simple scoring scheme. A1: A simple scoring scheme for an alignment might be: $+1$ for each match, $-1$ for each mismatch, and $-2$ for each gap. The total score $F = (\text{number of matches} \times 1) + (\text{number of mismatches} \times -1) + (\text{number of gaps} \times -2)$. Q2: Why do we penalize gaps in sequence alignment? A2: Gaps are penalized because insertions or deletions are generally rarer evolutionary events than single-base substitutions. Penalizing them prevents alignments from having too many gaps, which could artificially increase the number of matches and lead to biologically unrealistic alignments. Q3: Explain the role of scoring functions in distinguishing true homology from random similarity. A3: Scoring functions assign numerical values to matches, mismatches, and gaps. A higher total alignment score indicates a better alignment, biologically speaking. By optimizing this score, algorithms find the alignment that maximizes similarity while minimizing evolutionary events like mutations and indels. This helps distinguish genuinely homologous sequences (which tend to have high scores) from sequences that might appear similar by random chance but would require many unlikely evolutionary events (resulting in low scores). 34. Substitution Matrices (High-level; BLOSUM introduced) Q1: Expand BLOSUM. A1: BLOSUM stands for BLO cks SU bstitution M atrix. Q2: Why do we use substitution matrices in protein alignment? A2: We use substitution matrices in protein alignment because not all amino acid mismatches are equally likely or equally detrimental to protein function. Substitution matrices (like BLOSUM) assign scores based on the observed frequencies of amino acid substitutions in evolutionarily related proteins, reflecting the biochemical similarity of amino acids. Aligning a Leucine with an Isoleucine (similar properties) should receive a higher score than aligning a Leucine with a Proline (very different properties). Q3: Explain how BLOSUM matrices are constructed and interpreted. A3: BLOSUM matrices are constructed by analyzing highly conserved blocks of aligned protein sequences that lack gaps. The frequencies of amino acid substitutions within these blocks are used to derive scores for aligning any two amino acids. A BLOSUM $x$ matrix (e.g., BLOSUM62) is built from sequences that share at most $(100-x)\%$ identity, meaning they are more divergent. Interpretation: Positive scores: Indicate that the substitution is common and likely tolerated (or is an identity match). Higher positive scores mean more common. Negative scores: Indicate that the substitution is rare or unlikely, suggesting it's biochemically disfavored or detrimental. Zero scores: Neutral, occurring as often as expected by chance. For example, BLOSUM62 is suitable for comparing moderately distant protein sequences, as it's derived from proteins with at least 62% identity. 35. Global vs Local Alignment – Concept Q1: Differentiate global and local alignment. A1: Global alignment attempts to align the entire length of both sequences, from end to end, including any regions that don't match well. Local alignment , in contrast, identifies and aligns only the best matching subsequences (substrings) within the two sequences, ignoring poorly matching regions. Q2: When would you prefer local alignment over global alignment? A2: Local alignment is preferred when comparing divergent sequences, sequences of significantly different lengths, or when searching for conserved domains, motifs, or small functional regions within larger sequences. It's particularly useful when only a part of one sequence is expected to be homologous to a part of another, such as finding a specific protein domain in a much longer protein. Q3: Explain global and local alignment strategies with suitable examples. A3: Global Alignment (Needleman-Wunsch): Aims to align two sequences entirely. Sequence A: G G A T C G A Sequence B: G G - T C C A This is suitable for closely related genes of similar length, like comparing the same gene from two closely related species. Local Alignment (Smith-Waterman): Finds the most similar regions within two sequences. Sequence X: A T G C A C G T A T G C A G C T A G G A C A T Sequence Y: C G T A T G C A G Here, only a short, highly conserved region (CGTATGCA G) is aligned, even though the surrounding sequences are very different. This is ideal for finding a specific protein domain (e.g., a kinase domain) within a much larger, unrelated protein. 36. Protein Structure Basics Q1: Define peptide bond; explain N-terminus and C-terminus. A1: A peptide bond is a covalent amide bond formed by a dehydration reaction between the carboxyl group of one amino acid and the amino group of another. Each polypeptide has an N-terminus (or amino terminus), which is the end with a free amino group ($\text{NH}_2$), and a C-terminus (or carboxyl terminus), which is the end with a free carboxyl group ($\text{COOH}$). Polypeptides are synthesized and written from N-terminus to C-terminus. Q2: Differentiate peptide, polypeptide, and protein. A2: A peptide is a short chain of amino acids linked by peptide bonds. A polypeptide is a longer, unbranched chain of amino acids. A protein is one or more polypeptides folded into a specific three-dimensional functional structure. All proteins are polypeptides, but not all polypeptides are functional proteins (they must fold correctly). Q3: Describe primary, secondary, tertiary, and quaternary structure of proteins. A3: Primary Structure: The linear sequence of amino acids in a polypeptide chain, held together by covalent peptide bonds. Secondary Structure: Local, regular folding patterns of the polypeptide backbone, primarily stabilized by hydrogen bonds between backbone atoms. Common types include $\alpha$-helices and $\beta$-pleated sheets. Tertiary Structure: The overall three-dimensional shape of a single polypeptide chain, resulting from interactions between amino acid side chains (R-groups), including hydrophobic interactions, ionic bonds, hydrogen bonds, and disulfide bridges. Quaternary Structure: The arrangement and association of multiple polypeptide chains (subunits) to form a functional protein complex. Stabilized by similar non-covalent interactions as tertiary structure. Q4: Explain $\alpha$-helix and $\beta$-pleated sheet with stabilizing interactions. A4: $\alpha$-helix: A coiled, spiral-like structure where the polypeptide backbone forms a right-handed helix. It is stabilized by hydrogen bonds formed between the carboxyl oxygen of one amino acid and the amino hydrogen of an amino acid four positions ahead in the sequence ($i$ to $i+4$). $\beta$-pleated sheet: A sheet-like structure formed by two or more polypeptide strands lying side-by-side. It is stabilized by hydrogen bonds between the carboxyl oxygens and amino hydrogens of backbone atoms on adjacent strands. These strands can be parallel or anti-parallel. Q5: What are $\Phi$ and $\Psi$ angles? Describe Ramachandran plot and allowed regions. A5: $\Phi$ (phi) and $\Psi$ (psi) are the two main dihedral angles (rotation angles) in the polypeptide backbone. $\Phi$ represents the rotation around the N-$\text{C}\alpha$ bond, and $\Psi$ represents the rotation around the $\text{C}\alpha$-C bond. The Ramachandran plot is a graph that visualizes the sterically allowed combinations of $\Phi$ and $\Psi$ angles for amino acid residues in a polypeptide chain, showing regions where stable secondary structures (like $\alpha$-helices and $\beta$-sheets) are found. Allowed regions: These are areas on the plot where there are no steric clashes between atoms. $\alpha$-helix: Typically found around $\Phi \approx -60^\circ$, $\Psi \approx -45^\circ$. $\beta$-sheet: Typically found around $\Phi \approx -120^\circ$, $\Psi \approx +120^\circ$. Q6: Define $\beta$-turn and give its role in protein structure. A6: A $\beta$-turn (or hairpin turn) is a short, compact segment of a polypeptide chain, typically involving four amino acid residues, that causes the polypeptide backbone to make an abrupt reversal in direction. Its primary role is to connect adjacent secondary structure elements, especially two anti-parallel $\beta$-strands in a $\beta$-sheet, allowing the protein to fold into a compact globular shape. Q7: List methods to determine protein structure and mention databases PDB, SCOP. A7: Common experimental methods to determine protein structure include X-ray crystallography, Nuclear Magnetic Resonance (NMR) spectroscopy, and Cryo-electron microscopy (Cryo-EM). Important protein structure databases are: PDB (Protein Data Bank): A worldwide repository for the 3D structural data of large biological molecules, including proteins and nucleic acids. SCOP (Structural Classification of Proteins): A database that classifies protein structures based on their evolutionary and structural relationships. Q8: Explain statistical method for secondary structure prediction and the concept of propensity. A8: Statistical methods for secondary structure prediction aim to predict whether a particular amino acid residue is likely to be in an $\alpha$-helix, $\beta$-sheet, or turn based on its intrinsic preference. The concept of propensity quantifies this preference. For each amino acid, a propensity value is calculated for each secondary structure type (e.g., Propensity$_\alpha$(i) for residue $i$ in an $\alpha$-helix). This value is typically the ratio of the frequency of residue $i$ in that secondary structure to the overall frequency of all residues in that secondary structure, derived from known protein structures. Regions are then predicted as secondary structures based on these aggregated propensity values. Q9: Outline the Chou–Fasman algorithm and its parameters P(a), P(b), P(turn), f(i)…f(i+3). A9: The Chou–Fasman algorithm is a historical method for predicting protein secondary structure based on statistical propensities. It assigns conformational parameters to each amino acid: P(a): Propensity for forming an $\alpha$-helix. P(b): Propensity for forming a $\beta$-sheet. P(turn): Propensity for forming a $\beta$-turn. Additionally, it uses four turn parameters, $f(i)$, $f(i+1)$, $f(i+2)$, $f(i+3)$, which represent the observed frequencies of specific amino acids at each of the four positions within a $\beta$-turn. The algorithm scans the protein sequence with a sliding window, calculates average propensities within the window, and uses decision rules and these parameters to identify potential $\alpha$-helical, $\beta$-sheet, and turn regions. 37. RNA Structure Basics Q1: Describe basic RNA structural elements (helices, loops, bulges, junctions). A1: RNA secondary structure is composed of several basic elements: Helices (or Stems): Double-stranded regions formed by complementary base pairing (e.g., A-U, G-C). Loops: Single-stranded regions at the end of a helix (hairpin loop) or connecting two helices (internal loop or bulge). Bulges: Loops where one strand of a helix has extra unpaired bases. Junctions (or Multi-branch Loops): Regions where three or more helices meet. Q2: Explain complementary and non-complementary base pairs in RNA and their energies. A2: Complementary Base Pairs: These are standard Watson-Crick pairs (A-U, G-C) and the G-U wobble pair, which form hydrogen bonds and stabilize RNA secondary structure. A-U: Forms 2 hydrogen bonds ($\approx$ 2 kcal/mol). G-C: Forms 3 hydrogen bonds ($\approx$ 3 kcal/mol), making it the most stable pair. G-U wobble: Forms 1 hydrogen bond ($\approx$ 1 kcal/mol). Non-complementary Base Pairs: All other base combinations that do not form stable hydrogen bonds. These typically contribute to single-stranded regions like loops or bulges. Q3: Why is RNA/protein structure important in drug design? A3: Knowing the 3D structure of RNA or proteins is crucial for rational drug design. Drugs often work by binding specifically to target molecules (proteins or RNA) in a "lock-and-key" fashion. Understanding the target's structure allows medicinal chemists to design molecules that fit precisely into binding pockets, maximizing efficacy and minimizing off-target effects. This accelerates the drug discovery process and leads to more potent and selective therapies. Q4: Define S(i, j) and describe the base-pair maximization DP algorithm conceptually. A4: In the context of RNA secondary structure prediction using dynamic programming, $S(i, j)$ represents the maximum number of base pairs that can be formed within the RNA subsequence from index $i$ to index $j$. The base-pair maximization dynamic programming algorithm aims to find the RNA secondary structure with the maximum number of paired bases. It works by building a score matrix (DP table) where each cell $(i, j)$ stores $S(i, j)$. The table is filled iteratively, typically along diagonals, by considering two main cases for the bases at positions $i$ and $j$: If bases $i$ and $j$ are non-complementary, $S(i, j)$ is the maximum value obtained from considering various ways to split the subsequence (bifurcation), or by pairing $i$ with another base $k$ ($i If bases $i$ and $j$ are complementary, $S(i, j)$ can be $1 + S(i+1, j-1)$ (representing the pair $(i, j)$ plus the maximum pairs in the interior subsequence), or the maximum of $S(i, j-1)$ or $S(i+1, j)$ (if $i$ or $j$ is unpaired), or the maximum over various bifurcations. After filling the entire matrix, the cell $S(1, n)$ (for a sequence of length $n$) gives the maximum number of base pairs for the full sequence. A traceback procedure then reconstructs the actual base-paired structure. Q5: Discuss limitations of base-pair maximization for predicting the most stable RNA structure. A5: While base-pair maximization is a useful conceptual algorithm, it has significant limitations for predicting the most energetically stable RNA structure: Ignores Loop Energies: It only considers maximizing base pairs and does not explicitly account for the energetic costs/benefits of forming different types of loops (hairpin, internal, bulge, multibranch). Large, unstable loops might be formed if they lead to an extra base pair. Simplified Energy Model: It uses a very simplistic energy model (each base pair contributes equally, usually +1). Real RNA folding is driven by complex thermodynamic factors, where loop entropies, stacking interactions, and non-canonical base pairs all play a crucial role. Over-prediction of Base Pairs: By solely maximizing base pairs, the algorithm might predict structures that are sterically unfavorable or less stable due to the formation of many small, strained loops. More advanced algorithms, like those based on minimum free energy, address these limitations by incorporating detailed energy models for various structural elements. 38. Multiple Sequence Alignment (MSA) Q1: What is the goal of Multiple Sequence Alignment (MSA)? A1: The goal of MSA is to compare multiple protein or DNA sequences together to detect structural/functional similarity, identify conserved regions, and infer evolutionary relationships. Q2: Why is MSA more powerful than pairwise alignment for detecting relationships? A2: Pairwise alignment can miss relationships when sequence similarity is weak or when only short, highly conserved regions exist. Aligning many sequences simultaneously can reveal subtle patterns and conserved motifs that are not evident in pairwise comparisons, providing stronger evidence for shared ancestry or function. Q3: What is a consensus sequence in the context of MSA? A3: A consensus sequence is a representative sequence derived from an MSA, where each position in the consensus sequence typically corresponds to the most frequent residue (or base) found at that column in the alignment. It highlights the most common pattern across the aligned sequences. 39. Scoring MSA and Entropy Q1: How is entropy used to measure variability in an MSA? A1: Entropy measures the variability or randomness at each column of an MSA. A high entropy value indicates that the column is highly variable, with many different residues present. A low entropy value (approaching zero) indicates a highly conserved column, where most or all sequences have the same residue. This helps identify functionally important regions that are under selective pressure. (Formula: $H = -\sum p_i \log_2 p_i$, where $p_i$ is the frequency of residue $i$ in the column.) Q2: What is the typical aim in entropy-based scoring for MSA, and what are its limitations? A2: The typical aim is to find an alignment that minimizes entropy, emphasizing highly conserved columns. However, a limitation is that entropy only considers residue frequencies and does not incorporate evolutionary models or phylogenetic relationships between sequences. Therefore, minimizing entropy alone does not directly guarantee biological correctness or optimality from an evolutionary perspective. 40. Progressive MSA and CLUSTAL Q1: Why are heuristic methods used for MSA instead of exact optimal methods? A1: Exact optimal MSA is computationally very expensive, with complexity growing exponentially with the number of sequences. Heuristic methods, like progressive alignment, are used to balance alignment quality with computational feasibility, providing reasonably good alignments in a practical timeframe. Q2: Describe the general principle of progressive alignment. A2: Progressive alignment builds an MSA by first performing all-pairwise alignments to compute a distance matrix. This matrix is then used to construct a guide tree (often a phylogenetic tree). Sequences or groups of sequences are then progressively aligned according to the branching order of this guide tree, starting with the most closely related pairs and gradually adding more distant sequences or groups. Q3: What is a major drawback of progressive alignment? A3: A major drawback is that early alignment errors (often called "seed errors") made during the alignment of closely related sequences can propagate through the entire alignment process. These errors are difficult to correct later in the process, potentially leading to suboptimal overall alignments. Q4: Briefly explain how CLUSTALW (or CLUSTAL$\Omega$) works. A4: CLUSTALW is a classic progressive MSA program. It first computes all pairwise alignments to determine sequence similarities and constructs a distance matrix. This matrix is then used to build a guide tree (e.g., using the neighbor-joining method), which dictates the order of alignment. Finally, sequences and existing alignment blocks are progressively aligned according to the guide tree, with an option for iterative refinement to improve the alignment quality. 41. Phylogenetic Trees: Basics Q1: Define phylogeny. A1: Phylogeny is the evolutionary history and relationships among species, populations, or genes. Q2: Why is molecular phylogeny generally more reliable than phenotype-based phylogeny? A2: Phenotype-based phylogeny can be misled by convergent evolution (e.g., wings in birds and bats evolving independently), where similar traits arise from different evolutionary paths. Molecular phylogeny, which compares DNA/RNA or protein sequences, is generally more reliable because sequence-level changes are less directly subject to visible selection pressures than phenotypic traits, providing a more direct record of evolutionary divergence. Q3: Differentiate between a cladogram, phylogram, and chronogram. A3: Cladogram: Shows only the branching order and evolutionary relationships; branch lengths are not meaningful. Phylogram: Branch lengths are proportional to the amount of evolutionary change (e.g., number of substitutions) that has occurred along that branch. Chronogram: Branch lengths are proportional to the actual time since divergence. 42. Rooted vs Unrooted Trees Q1: What is the difference between a rooted and an unrooted phylogenetic tree? A1: A rooted tree has a single node, called the root, which represents the most recent common ancestor (MRCA) of all the taxa in the tree. It shows the direction of evolution and the order of branching. An unrooted tree shows the evolutionary relationships and connectivity among taxa but does not specify a common ancestor or the direction of evolution. It indicates how taxa are related but not which lineage is ancestral. Q2: Why does the number of possible rooted/unrooted trees for $n$ taxa grow very fast? A2: As the number of taxa ($n$) increases, the number of possible ways to arrange these taxa into different tree topologies (branching patterns) increases combinatorially. For example, for 3 taxa, there's only 1 unrooted tree but 3 rooted trees. For 4 taxa, there are 3 unrooted and 15 rooted trees. This rapid growth makes exhaustive searching for the optimal tree computationally intractable for even moderately large $n$. 43. Data Types for Tree Construction Q1: What are the two main molecular data views used for tree construction? A1: The two main molecular data views are character-based data and distance-based data . Q2: Explain character-based data in molecular phylogenetics. A2: In character-based methods, each aligned position in a sequence alignment is treated as a separate character, and the nucleotide or amino acid at that position is its "state." These methods directly analyze these character states to find the tree that explains the observed patterns with the fewest evolutionary changes. Q3: Explain distance-based data in molecular phylogenetics. A3: Distance-based methods use a single numerical value (distance) to represent the overall pairwise difference or dissimilarity between sequences. The input for these methods is typically a distance matrix, which summarizes all pairwise sequence distances (e.g., Hamming distance, or evolutionary distances accounting for substitution models). 44. UPGMA Algorithm Q1: What is UPGMA and what assumption does it make? A1: UPGMA (Unweighted Pair Group Method with Arithmetic Mean) is a classic distance-based clustering algorithm used for building rooted phylogenetic trees. It assumes a molecular clock, meaning that the rate of evolution is roughly constant across all lineages. Q2: Outline the key steps of the UPGMA algorithm. A2: Start with a distance matrix for all taxa. Find the two taxa (or clusters) with the smallest distance between them. Cluster these two into a new group, and place a node at a distance equal to half their separation. Recompute the distances between this new group and all remaining taxa/groups, using the arithmetic mean of the distances to the original members. For example, $d_{(AB),C} = (d_{AC} + d_{BC}) / 2$. Repeat steps 2-4 until all taxa are grouped into a single tree. 45. Maximum Parsimony (MP) Q1: What is the biological principle behind Maximum Parsimony? A1: The biological principle behind Maximum Parsimony is that evolution tends to proceed with the fewest possible changes. Therefore, among all possible phylogenetic trees, the most parsimonious tree is the one that requires the minimum number of mutational events (substitutions) to explain the observed sequence data. This is often related to "Occam's razor." Q2: How does Maximum Parsimony work to build a phylogenetic tree? A2: Maximum Parsimony typically works by: Considering all possible unrooted tree topologies for the given taxa. For each tree topology, reconstructing the ancestral character states at the internal nodes. Calculating the total number of evolutionary changes (substitutions) required for that tree to explain the observed sequences (its parsimony score). Selecting the tree (or trees) with the smallest total number of changes as the most parsimonious, and thus the optimal, phylogenetic tree. 46. Informative vs Uninformative Sites in MP Q1: Define an informative site in Maximum Parsimony. A1: An informative site is a position in a sequence alignment that provides useful phylogenetic signal for distinguishing among different tree topologies. Specifically, it must have at least two different character states, and each of these states must appear in at least two different taxa. Q2: Why are uninformative sites discarded in MP scoring? A2: Uninformative sites (e.g., constant sites where all taxa have the same character, or sites where a change can be explained by a single mutation regardless of tree topology) do not help differentiate between competing tree topologies. Since their contribution to the total parsimony score would be the same for all trees, they are ignored to reduce computational complexity without affecting the selection of the most parsimonious tree. 47. Maximum Parsimony Workflow Q1: Outline the general workflow for building a tree using Maximum Parsimony. A1: Enumerate Topologies: Generate all possible unrooted tree topologies for the given set of taxa. Identify Informative Sites: Classify alignment positions into informative and uninformative sites; only informative sites are used for scoring. Compute Parsimony Score: For each candidate tree topology, determine the minimum number of substitutions required at each informative site to explain the observed character states, and sum these changes to get a total parsimony score for the tree. Select Optimal Tree: Choose the tree (or trees) with the smallest total parsimony score as the most optimal Maximum Parsimony tree.

Related Cheatsheets

Cell Structure & Function Essentials

175 views

Create Your Own AI Cheatsheet

Generate comprehensive study cheatsheets from your notes, textbooks, or lecture materials using AI.