Sample exam questions from previous years 1. Next-gen sequencing
Transcription
Sample exam questions from previous years 1. Next-gen sequencing
Sample exam questions from previous years 1. Next-gen sequencing (3pts) To confidently call a base when sequencing all of the exons from an individual, one must get at least 8x coverage. At 13x coverage what fraction of bases are sequenced at least 8 times. Assume the size of the exome is 45 million bases. (3pts) At heterozygous positions, one allele is maternal and the other is paternal. If a position is sequenced 8 times, what is the probability that all 8 reads come from the mother? What would this mean? (2pts) What are the advantages, if any, of using paired end reads to perform genome sequencing? Exome sequencing? (2pt) Your friend, Seeann Vee, wants to map structural variation in cancer. She decides to use microarrays instead of sequencing because they are not as costly. What classes of structural variants would she miss with this approach? (3pts) A company claims to have a developed a sequencing technology with perfect accuracy and 10,000 base pair reads generated from sheared genomic DNA. Because of this, they claim that they can sequence a human genome at 1x average coverage and get good results (e.g. high consensus accuracy and good coverage of the genome), instead of at the normal 30-40x. Assuming the accuracy is indeed perfect, provide two reasons why they will need more than 1x coverage. (3pts) You need to sequence a library of S.cerevisiae strains that are tagged by a molecular barcode. Assume that 1) there are 6000 strains in the library , 2) each strain is uniformly represented (this is the control population), and 3) Each read gives you the sequence of 1 barcode. How many reads do you need to ensure a 95% probability that each barcode is sequenced at least once? (2pts) Oxford nanopore just announced that they have a sequencing technology that has a 4% error rate and a 100,000 bp read length and works on single DNA molecules. How might these long reads impact the sequencing of human genomes (name at least one way)? (2pts) The Oxford technology is a single molecule technology. How would you expect the accuracy of the method to change with the position in the read? How does it change in Illumina technology? 2. Synthetic Biology A) (2 points) Which of the following statements about synthetic genetic circuits is true? A. B. C. D. Synthetic circuits rival natural circuits in their complexity. Genetics elements exhibit excellent modularity in new contexts. DNA synthesis capabilities are sufficient for synthetic circuits. Tuning of promoter strengths and mRNA stability is rarely needed for circuit elements. B) (3 points) Which of the following network features were not optimized to generate an artemisin biosynthetic pathway in microbes? A. B. C. D. Ribosome binding site sequences. Enzyme active site configurations. Protein stoichiometry in enzymatic complexes. Codon compatibility with host organism. C) (3 points) You are planning to initiate a synthetic biology project in which you will create variants of an antibiotic synthetic pathway in order to produce a diverse library of modified antibiotic molecules. Suggest a host organism to use and give a justification for your selection. D) (2 points) Name one technique for generating a diverse library of genes when you have access to a family of related genes, and a technique for when you only have one gene. 2. You are working on a bioremediation project and are attempting to engineer an enzyme to degrade the toxic chemical leemealone. You discover in the literature a bacterial protein (leftalonase) capable of converting leftalone, a molecule related to leemealone, to harmless products. No other enzymes have yet been reported to carry out similar chemical reactions. You decide that you will use the directed evolution of bacteria to generate the enzyme you desire, and you clone the leftalonase gene into a bacterial expression vector. A. (1 pts) Describe a simple petri-dish based selection for an active leemalonase. B. (1 pts) Describe one way to generate variation in the leftalone gene as input for your selection. C. (1 pts) How might you go about finding starting material for directed evolution by DNA shuffling without leaving the comforting glow of your computer screen? D. (2 pts) Assume you have no access to computers. How might you obtain a diverse supply of promising genetic material with only the knowledge of the DNA sequence of leftalonase, a coupon for two free oligonucleotides, and a sample of soil? Where would you obtain the soil most likely to be of use to you? 3. Homology A) (4 pts) Your colleague professor Arthur Loggi has found a mysterious peptide sequence from a Na’vi sample. The Na’vi genome is only 50% complete. He asks you to look for an ortholog of this peptide in the Human genome because you have a program that can do so by finding reciprocal best Blast hits. Please tell professor Loggi what are the key differences between orthologs and paralogs, and why your program may or may not work in this case. Make sure professor Loggi understands the caveats of using best Blast hits to find orthologs, and suggest some alternatives to using best Blast hits. B) You run your program for Professor Loggi using the peptide sequence: RVVNLVPSFWVLDATYKNYAINYNCDVTYKLY The top three alignments you get by Blasting the sequence to human protein database are: Yours: R V V N L V P S - - F W V L D A T Y K N Y A I N Y N C D V T Y K L Y SEQ 1: Q F F P L M P P A P Y W I L A T D Y E N L P L V Y S C T T F F W L F Yours: R V V N L V P S - - F W V L D A T Y K N Y A I N Y N C D V T Y K L Y SEQ 2: Q F F P L M P P A P Y W I L D A T Y K N Y A L V Y S C T T F F W L F Yours: R V V N L V P S - - F W V L D A T Y K N Y A I N Y N C D V T Y K L Y SEQ 3: R V V P L M P S A P Y W I L D A T Y K N Y A L V Y S C D V T Y K L F (4 pts) Calculate the score of each of these three alignments using the default parameters for Blast, which use the Blossum62 substitution matrix (shown below), a gap opening penalty of –11, and a gap extension penalty of -1. For each score you calculate also compute the P-value of getting a score greater or equal to these scores, assuming EVD (extreme value distribution): ! (!!!) 𝑃 𝑆 ≥ 𝑥 = 1 − 𝑒[!! ]; use λ 0.693, and µ 50. Finally, give professor Loggi a recommendation. A R N D C Q E G H I L K A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 R -1 5 0 -2 -3 1 0 -2 0 -3 -2 2 N -2 0 6 1 -3 0 0 0 1 -3 -3 0 D -2 -2 1 6 -3 0 2 -1 -1 -3 -4 -1 C 0 -3 -3 -3 9 -3 -4 -3 -3 -1 -1 -3 Q -1 1 0 0 -3 5 2 -2 0 -3 -2 1 E -1 0 0 2 -4 2 5 -2 0 -3 -3 1 G 0 -2 0 -1 -3 -2 -2 6 -2 -4 -4 -2 H -2 0 1 -1 -3 0 0 -2 8 -3 -3 -1 I -1 -3 -3 -3 -1 -3 -3 -4 -3 4 2 -3 L -1 -2 -3 -4 -1 -2 -3 -4 -3 2 4 -2 K -1 2 0 -1 -3 1 1 -2 -1 -3 -2 5 M -1 -1 -2 -3 -1 0 -2 -3 -2 1 2 -1 F -2 -3 -3 -3 -2 -3 -3 -3 -1 0 0 -3 P -1 -2 -2 -1 -3 -1 -1 -2 -2 -3 -3 -1 S 1 -1 1 0 -1 0 0 0 -1 -2 -2 0 T 0 -1 0 -1 -1 -1 -1 -2 -2 -1 -1 -1 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 Y -2 -2 -2 -3 -2 -1 -2 -3 2 -1 -1 -2 V 0 -3 -3 -3 -1 -2 -2 -3 -3 3 1 -2 M F P S T W Y V -1 -2 -1 1 0 -3 -2 0 -1 -3 -2 -1 -1 -3 -2 -3 -2 -3 -2 1 0 -4 -2 -3 -3 -3 -1 0 -1 -4 -3 -3 -1 -2 -3 -1 -1 -2 -2 -1 0 -3 -1 0 -1 -2 -1 -2 -2 -3 -1 0 -1 -3 -2 -2 -3 -3 -2 0 -2 -2 -3 -3 -2 -1 -2 -1 -2 -2 2 -3 1 0 -3 -2 -1 -3 -1 3 2 0 -3 -2 -1 -2 -1 1 -1 -3 -1 0 -1 -3 -2 -2 5 0 -2 -1 -1 -1 -1 1 0 6 -4 -2 -2 1 3 -1 -2 -4 7 -1 -1 -4 -3 -2 -1 -2 -1 4 1 -3 -2 -2 -1 -2 -1 1 5 -2 -2 0 -1 1 -4 -3 -2 11 2 -3 -1 3 -3 -2 -2 2 7 -1 1 -1 -2 -2 0 -3 -1 4 C) (2 pts) In primates, the amino acid Histidine is 8% of the total amino acid content of their proteomes. In Na’vi-related species Histidine is 2% of the total proteome. Assume that the probability of observing two Histidines aligned to each other in alignments of orthologs from within both groups of species is the same. If you were to construct two BLOSUM Matricies, one from primates and one from Na’vi related species, in which group would the score for aligning Histidine to Histidine be higher? Why? 4. Epigenetics A) (2pts) Your colleague professor Eugene Mathew Lateed generated a genome-wide DNA methylation map for normal colon cells using MRE-seq and MeDIP-seq. In an intergenic region, he found an interesting locus. This locus is about 20kb. On one end of the locus, there is a 2kb CpG rich stretch that has both intermediate MRE-seq and MeDIP-seq signals. The rest 18kb has high level of MeDIP-seq signals. Based on what you learnt in class, why might you suspect that this region encodes for a novel gene? (2 pts) You decide to look at histone modification patterns across this region for more evidence. There are several genome-wide datasets available for this cell type: H3K4me1, H3K4me3, H3K27me3, H3K9me3, H3K36me3, and H3K9Ac. Which histone mark would you investigate for this locus and why? Suggest at least one other source of data that may help you, and why you think it may help? B) You decide to use bisulfite sequencing to validate the methylation status of the 2kb region that has both intermediate MRE-seq and MeDIP-seq signal. Bisulfite treatment converts unmethylated C to T. You do this experiment in both normal colon cells and a colon cancer cell line. On the next page is the data after aligning reads from bisulfite treated DNA to the region from both normal colon and colon cancer cell lines. For simplicity, we only consider one strand. Template: GATCGTGCACGATCTCGGCAATTCGGGATGCCGGCTCGTCACCGGTCGCT Reads(normal) GATTGTGTATGATTTTGGTAATTTGGG GATTGTGTATGATTTTGGTAATTTGGG GATTGTGTATGATTTTGGTAATTTGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATTGTGTATGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATTGTGTATGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GTATGATTTTGGTAATTTGGGATGTTGGTTTG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTATGATTTTGGTAATTTGGGATGTTGGTTTG GTATGATTTTGGTAATTTGGGATGTTGGTTTG GTATGATTTTGGTAATTTGGGATGTTGGTTTG GTATGATTTTGGTAATTTGGGATGTTGGTTTG GTATGATTTTGGTAATTTGGGATGTTGGTTTG CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT TGGGATGTTGGTTTGTTATTGGTTGTT TGGGATGTTGGTTTGTTATTGGTTGTT CGGGATGTCGGTTCGTTATCGGTCGTT TGGGATGTTGGTTTGTTATTGGTTGTT TGGGATGTTGGTTTGTTATTGGTTGTT TGGGATGTTGGTTTGTTATTGGTTGTT Template: GATCGTGCACGATCTCGGCAATTCGGGATGCCGGCTCGTCACCGGTCGCT Reads(cancer) GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GATCGTGTACGATTTCGGTAATTCGGG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG GTACGATTTCGGTAATTCGGGATGTCGGTTCG CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT CGGGATGTCGGTTCGTTATCGGTCGTT (2 pts) calculate the level of methylation, defined as percentage methylated, of the 8 CpG sites in both normal cells and in cancer cells. (2 pts) Professor Lateed suspects this locus may be imprinted. Does the bisulfite data support his idea? In other words is there evidence for an extended methylated or unmethylated haplotype across this region? Describe how you might obtain additional support for the imprinted status of this promoter. (2 pts) If this novel gene plays a role in tumorigenesis, do you think it is a tumor suppressor gene or an oncogene? Why? Propose at least one mechanism that could lead to the observed change in the cancer cell line. 5. Expression profiling A) (2 pts) What are the two kinds of generally accepted RNA editing? B) (3 pts) What are three kinds of normalization that we discussed? [hint, there were two for microarray, and one in the context of RNAseq] What kind of biases do they each correct for? C) (1 pt) Under what circumstances does correlation fail to detect co-regulated genes? D) (1 pt) What is an advantage of using Gene Ontologies for analyzing gene lists? E) (1pt) What is a disadvantage? F) (1 pt) What is the difference between biological variation and technical variation? G) (1 pt) You have 200 cancer samples and decide to cluster then with a k-means clustering algorithm. You use k=3. The algorithm returns that 100 of the samples are in a cluster A, 72 in B, and 28 in C. What can you conclude from this analysis about the different subtypes of cancer within your sample? 1. (3 pts) Compare and contrast the advantages of microarrays vs. RNAseq (0.5pts per difference between platforms) 2. (3 pts) When should a Fisher’s exact test be used? Describe two examples of analytical situations where a Fisher’s exact test could be used. 3. (3pts) What does FPKM stand for? What normalization does it provide? Calculate the FPKM for Your Favorite Gene1 (YFG1), given that: Yfg1 is has three exons of 500 bp each, and 2 introns of 1000 bp, and codes for a protein of 125a.a. (Thus has substantial UTR regions). Yfg1 has no alternative splicing. Yfg1 has a 57% GC content Your library has 3million reads in it, and is a paired end library. 3000 reads map to YFG1. Pairs are perfectly matched. 4. (1 pt) What is your favorite kind of clustering, and why? Mention at least two kinds, and why you like one more than the other. 6. DNA Protein interaction: motif finding Consider a peculiar family of modular transcription factors that recognize a single base pair per module. For two members of this family, TF-1 and TF-2, the base pair frequencies across all known binding sites have been determined: Frequency of: A C G T TF-1 1/2 1/4 0 1/4 TF-2 0 1/4 1/2 1/4 Orthologs of TF-1 and TF-2 proteins are found in different genomes with wildly different base frequencies in the sense strands of promoter regions. The base frequencies for two such genomes, G-1 and G-2, are given here: Background Frequency of: A C G T G-1 1/4 1/4 1/4 1/4 Using the equation I seq = G-2 1/4 1/2 1/8 1/8 ∑∑ f (b, j ) log j b 2 f (b, j ) , where b is the base at a particular position j p (b) in the binding motif, answer the following questions. (5 pts) Does the binding module TF-1 have higher information content in one genome or the other? (5 pts) Which transcription factor module contains more information in genome 2?