Current Protocols in H - Medical Life Sciences
Transcription
Current Protocols in H - Medical Life Sciences
Copyright protected PolyPhred Analysis Software for Mutation Detection from Fluorescence-Based Sequence Data UNIT 7.16 Kate T. Montgomery,1 Oleg Iartchouck,1 Li Li,2 Stephanie Loomis,1 Vanessa Obourn,1 and Raju Kucherlapati1 1 Harvard Medical School - Partners Healthcare Center for Genetics and Genomics, Boston, Massachusetts 2 Albany Medical College, Albany, New York ABSTRACT The ability to search for genetic variants that may be related to human disease is one of the most exciting consequences of the availability of the sequence of the human genome. Large cohorts of individuals exhibiting certain phenotypes can be studied and candidate genes resequenced. However, the challenge of analyzing sequence data from many individuals with accuracy, speed, and economy is great. This unit describes one set of software tools: Phred, Phrap, PolyPhred, and Consed. Coverage includes the advantages and disadvantages of these analysis tools, details for obtaining and using the software, and the results one may expect. The software is being continually updated to permit further automation of mutation analysis. Currently, however, at least some manual review is required if one wishes to identify 100% of the variants in a sample set. Curr. C 2008 by John Wiley & Sons, Inc. Protoc. Hum. Genet. 59:7.16.1-7.16.21. Keywords: DNA sequencing r mutation identification r SNPs r indels r sequence traces r Consed r Phred r Phrap r PolyPhred INTRODUCTION The purpose of this unit is to describe, in detail, the use of one software application, PolyPhred, (Nickerson et al., 1997; Stephens et al., 2006) to identify mutations or variations among individuals as exhibited in fluorescence-based DNA sequence data. UNIT 7.9 describes in detail the protocols for obtaining DNA sequences representing particular exons or regions of the genome via PCR amplification and sequencing of the desired regions. The success of a mutation detection project of any size is critically dependent upon the ability of the investigators to analyze the data obtained in an accurate and timely fashion. Even a modest project—e.g., resequencing one gene with 20 exons in 10 individuals—will require a minimum of 400 sequence traces that have to be examined and compared to the normal sequence. Chromatograms generated for a large project may number in the thousands. It is impractical to review this quantity of data manually, examining and comparing individual chromatograms. Such an approach would be both time-consuming and subject to human error. Similarly, aligning and comparing the simple text files of the sequences is inappropriate, because one must know the quality of the sequence data before relying upon the called bases. In response to the enormous increase in the use of sequencing to identify significant pathogenic mutations in both the research and clinical environments, several academic groups and companies have developed software applications to facilitate data review. The challenge is to create a program into which one can simply import sequence data, and the software will identify, list, and export all variants from a standard reference sequence. The program must be able to identify the dual peaks of heterozygotes and Current Protocols in Human Genetics 7.16.1-7.16.21, October 2008 Published online October 2008 in Wiley Interscience (www.interscience.wiley.com). DOI: 10.1002/0471142905.hg0716s59 C 2008 John Wiley & Sons, Inc. Copyright Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.1 Supplement 59 Copyright protected differentiate them from background noise, and also recognize homozygous variants as they differ from a reference sequence. It must also be able to display and enable a reviewer to characterize more complicated variants, including insertions and deletions (indels). Happily, significant progress has been made in the development of automated sequence analysis software in recent years. This unit describes, in considerable detail, one of the available options, a suite of programs including Phred (Ewing et al., 1998; Ewing and Green, 1998), Phrap, PolyPhred (Nickerson et al., 1997; Stephens et al., 2006), and Consed (Gordon et al., 1998; Gordon, 2004). The unit describes how to obtain the programs and how to use them for resequencing projects on a UNIX server. STRATEGIC PLANNING Selection of Software The selection of software to assist in sequence analysis is one of the critical decisions to be made during the strategic planning of a project whose goal is mutation detection via fluorescence-based sequencing. The factors to consider are (1) performance of the program, (2) cost and availability of the program, (3) difficulty of program setup and use, and (4) suitability to the size and complexity of the expected work. Brief descriptions of a few of the available programs are found in the Commentary section, and the use of one program (Mutation Surveyor) is described in UNIT 10.8. Further evaluations and comments are available online. Several programs fulfill parts of the challenge of automated analysis as outlined above quite effectively, but none are yet able to fully analyze chromatograms with complete accuracy and report findings without some level of manual intervention. However, the need for manual review is greatly reduced by the applications described, and by others available at the present time. The purpose of this unit is to describe the use of Phred, Phrap, PolyPhred, and Consed at Harvard Medical School – Partners Center for Genetics and Genomics (HPCGG; http://www.hpcgg.org/) for the analysis of all fluorescence-based sequencing projects in our research laboratory. This suite of programs was originally developed during the Human Genome Project by Phil Green, Brent Ewing, David Gordon, LaDeanna Hillier, Deborah Nickerson, and others. Total automation of very large survey projects in the research environment using the newest version of PolyPhred (v.5.04) has been demonstrated to be highly efficient (Stephens et al., 2006), if some low level of error is acceptable. For clinical applications, where no level of error is acceptable, at least one round of manual review is still necessary. The performance of Phred, Phrap, PolyPhred, and Consed is exceptionally reliable. The programs run quickly, require no user interactions while running, offer many options for customization, and can handle virtually unlimited quantities of data. The graphical user interface, Consed, provides many ways to review the data and export information. In addition, the background information extracted by the programs from the data files or chromatograms is all available in text files, and information can be obtained by queries addressed to these files by external programs in the UNIX environment. PolyPhred: Mutation Detection from FluorescenceBased Sequence Data The documentation and learning tools are very good, though they may seem complex and time-consuming in the beginning. The programs are available without cost to academic laboratories, from the groups that developed them. The primary disadvantages to selecting this software option are that (1) the programs require a UNIX or Linux server, Mac OS X.X, or one of the other platforms described in the documentation, and (2) the installation and original setup require more sophisticated computer knowledge than most other candidate programs. Notwithstanding the technical challenges of setting up and becoming familiar with Phred, Phrap, PolyPhred, and Consed, they offer many advantages, especially for large projects. 7.16.2 Supplement 59 Current Protocols in Human Genetics Copyright protected It takes only a few minutes to set up a project, dump data into the correct folder, run the program, and open the assembly in Consed. One can then use a navigation tool to jump from one variant to the next, or quickly scan the sequence data representing one amplicon of up to 100 individuals (or more), and determine whether there are any variants. If many genes with many amplicons are to be reviewed, all data still go into one folder, where the user can move through each of the amplicons one by one. Overview of PolyPhred PolyPhred is designed to compare fluorescence-based sequence chromatograms (UNIT 7.9) from different individuals and to identify differences among sequences (Nickerson et al., 1997; Stephens et al., 2006). The algorithm upon which it is based has recently been updated (v.5.04) to facilitate the automation of single-nucleotide polymorphism (SNP) identification, and under certain ideal conditions it is highly accurate. The ideal conditions include clean sequence with reads available in both directions from four or more individuals. The program determines a consensus (or takes a reference sequence as the consensus), applies a tag at positions where polymorphisms appear, and applies a quality score to each tag. These tags can be exported, easily reviewed, and modified. The genotype of each individual, at each polymorphic site is recorded in a text file, the polyphred.out file. The discussion below provides details for using PolyPhred version 5.04 in a UNIX environment to review data quickly and accurately. In our experience, v.5.04 identifies almost all of the SNPs found by manual review, and exhibits a very low level of false positives. When v.5.04 was tested by the Nickerson group, 93% of all SNPs were found in 47 to 90 patients, and 100% of the high-frequency SNPs were found in the same group. There were no false positives, and the overall accuracy was 99.9% (Stephens et al., 2006). Once familiar with the application, there are certain parameters that an advanced user might wish to modify to fit the particular scope and demands of a project, and to take advantage of the automation to save time. The Nickerson group continues to develop this program, with goals of making it even more reliable in the automated format and expanding it to include the automated identification of indels, a function now available in a Beta version, PolyPhred v.6.11 (Bhangale et al., 2006). PolyPhred is not a stand-alone program, but is integrated with the sequence analysis programs mentioned above, originally developed by Phil Green’s group for the Human Genome Project. These programs are: Phred (Ewing and Green, 1998; Ewing et al., 1998), Phrap, Cross-match, Swat, and Consed (Gordon et al., 1998; Gordon, 2004). A full introduction and description of Phred, Phrap, and Consed is available at http://www.phrap. org/phredphrap/general.html, while a description of PolyPhred v.5.04 is available at http://droog.mbt.washington.edu/PolyPhred.html. Special instructions for other platforms are in the documentation and a complete Tour of Consed, the graphical user interface, is provided. Each of the elements of the integrated set of programs has a particular function. Phred provides base calls, peak information, and quality scores for sequence data from a fluorescence-based sequencing machine. The quality scores are a representation of the likelihood of error for any given base. A Phred score of 20 means that the base will be incorrect 1 of 100 times (99% accuracy), while a Phred score of 40 means that the base will be incorrect 1 of 10,000 times (99.99% accuracy). Phred can also be used independently for base calling and to generate quality scores for sequence data. In this capacity only, it can be run in the Microsoft Windows environment. Phrap, Cross-match, and Swat are used to screen out vector sequences and to provide sequence alignments of similar sequences. PolyPhred uses the information from Phred Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.3 Current Protocols in Human Genetics Supplement 59 Copyright protected and Phrap to identify and tag putative heterozygous and homozygous variants from the consensus. Consed, a graphical user interface, is then used to view and edit the resulting assembly. Consed is described at http://www.genome.washington.edu/consed/consed.html (Gordon et al., 1998; Gordon, 2004). It is strongly recommended that new users take the “Quick Tour” provided with Consed, which will introduce all of its features. At a minimum, the documentation should be carefully reviewed to identify the tools that will be useful for the intended project. Much of Consed’s complexity is related to shotgun sequencing, whole-genome assemblies, and autofinish. The tools for resequencing and mutation identification projects are less difficult to master. Once the project is set up and Consed is running, there is a help guide that can be accessed from the program. One of the major benefits of using PolyPhred for the detection of sequence variation, especially when many samples are to be examined, is the ease with which data files can be managed in the UNIX environment, either manually or using perl scripts. Once chromatograms are deposited in the appropriate directory, the full suite of programs can be triggered to run on the server completely in the background. The user merely awaits completion of the assembly so the results can be reviewed. It is possible (though not necessary) to view the progress of the program while it is running; when it is complete, it will tell the user that the data can now be viewed in Consed. Another benefit is that the text output of the programs is readily available, and can be queried by external scripts to extract any information not available directly through Consed. Thus, it is possible to customize the information derived from these programs when projects demand it. For example, when new data are entered into a project at HPCGG, we extract the Phred quality scores for each base in the region of interest, for each read. These scores are used to determine if a read “passes” or “fails” our criteria for quality, and this pass/fail information is incorporated into several e-mails sent to the members of the sequencing group. The e-mails provide a summary report of the success rate of the entire run, as well as whether each read passes. They provide real-time assessment of the quality of the output from any computer with e-mail access, so that problems can be identified in a timely fashion. Finally, data from 100 or more individuals can be reviewed by scrolling through a single computer screen. This allows the user to recognize the presence of a variant very quickly. Alternatively, Consed will list the variants in the single contig or in the entire assembly if multiple exons or amplicons have been sequenced. This list can be used to navigate through the assembly to view all variant positions. With the highly accurate SNP detection algorithm now in place, nearly full automation is possible. PolyPhred: Mutation Detection from FluorescenceBased Sequence Data At HPCGG, an automatic data processing pipeline is initiated when data collection is completed on the ABI 3730xl. A series of perl scripts is triggered that direct necessary operations, such as transfer of chromatograms to the UNIX system, name trimming (.ab1 files have long unwieldy names), and subsequent sorting and transfer to the correct project folder where analysis by Phred, Phrap and PolyPhred is initiated. E-mail reports of the Phred quality scores are sent to the sequencing group, and their arrival indicates that analysis is complete and results can be reviewed using Consed. This kind of pipeline is ideal for high-volume sequencing projects and when bioinformatics support is available. However, since every laboratory has a slightly different computer environment and not everyone has adequate support for automating these processes, the explanation below assumes manual project setup, file transfer, and program initiation. 7.16.4 Supplement 59 Current Protocols in Human Genetics Copyright protected Obtaining and Installing the Programs PolyPhred is available for a number of different operating platforms, such as UNIX, Linux, MacOSX, and others listed in the documentation. It is not available for Microsoft Windows, and therefore, it is important to have IT support for one of these other platforms. However, the system is very stable once it is downloaded and installed in a UNIX or Linux environment, so the need for support is not extensive. Phred and Phrap and their associated components can be obtained via e-mail by following instructions at http://www.phrap.org/phredphrapconsed.html. An academic user agreement from David Gordon (gordon@genome.washington.edu) must be signed in order to receive permission to download Consed from http://bozeman.genome. washington.edu/consed/consed.html. PolyPhred is obtained by following instructions at http://droog.mbt.washington.edu/PolyPhred.html. The programs are free to academic laboratories, while commercial users must obtain a license (license@u.washington.edu) prior to obtaining them. Usually a System Administrator will download and perform the actual installation, following the online instructions. The various elements will be placed in particular directories in the UNIX environment as the instructions specify. PC users must also install x-windows (http://www.starnet.com/products/xwin32) or similar software to access the UNIX/Linux environment (see Basic Protocol, Materials). Alternatively, the user can access the programs directly through a UNIX terminal. A perl script called phredPhrap, which runs phred-swat-crossmatch-phrap in the proper sequence, is provided with the program suite. This script must be run from the edit dir, and it must be able to find the other directories as described above. Certain modifications to this script are required to direct it to include PolyPhred, and others changes may be implemented to customize certain details for Consed. These modifications are described in the program support files. Most significantly, one line of the phredPhrap script must be changed from $bUsingPolyPhred = 0 to $bUsingPolyPhred = 1. The modified perl script can be saved as phredPhrap.poly. In order to analyze data, the user must go to the edit dir of the project, and type this command in the command window. All chromatogram data that have been placed in the chromat dir will be analyzed. Phd files (text files that contain information extracted from the chromatograms), will be created for each read and will subsequently appear in the phd dir. If chromatograms have been added since the last phredPhrap.poly command, they will not have corresponding phd files or poly files; the program will recognize this, analyze the new data, create phd files and poly files for the new data, and update the assembly. If no new data has been added, there is no need to run phredPhrap.poly, and the command consed will display the existing assembly. When the programs are installed, a sample set of data is available for a “Quick Tour” of Consed. This tour provides a very detailed exploration of the capabilities of the program, and it is recommended that users view it. However, it includes a lot of information that will not be needed for simple resequencing projects, so a basic guide for mutation identification is provided here. USING PolyPhred FOR SEQUENCE ANALYSIS AND MUTATION DETECTION BASIC PROTOCOL When performing the sequence detection in your own laboratory, transfer data files from the data collection computer to a Windows desktop computer when the run is completed for temporary stroage and review. The files may be ABI chromatogram data (.abi) files or the equivalent from other sequence analyzers. Working with data directly on the collection machines is not recommended, and data should be removed from them frequently for storage elsewhere. The array pictures and raw data files should be reviewed Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.5 Current Protocols in Human Genetics Supplement 59 Copyright protected briefly to check for any obvious problems. The length of read and quality of the array and sequencing controls should be checked to be sure they meet standards. When sequencing is performed by an outside core facility, the facility will provide chromatograms so you can import them into whatever sequence analysis program you have selected. They may also provide text files of the sequence, but be aware that those files are not reliable for mutation detection! The user must review the actual data in order to determine its quality and whether there are artifacts or miscalled bases. PolyPhred runs under UNIX. A basic understanding of UNIX commands that allows you to navigate through directories, make directories and move, remove, and copy files. A short list of commands is below. A novice should spend a little time learning navigation in the UNIX environment. ls = list contents of current directory pwd = requests information on present working directory cd = change directory (must give pathway or subdirectory of pwd, or it will take you back to your home directory) mkdir <dir name> = make a new directory in this directory cp = copy (cp <file> to a new place <full pathway to desired location>) mv = (mv <file> to a new place <full pathway to desired location>) Materials PolyPhred: Mutation Detection from FluorescenceBased Sequence Data High-speed access to the Internet for Web-based tools and information, especially access to genome databases A UNIX server for running the mutation identification software, or other type as described below. The programs can be run on Mac OSX.X or LINUX, but they do not run under Microsoft Windows. A directory on the UNIX computer where data analysis will be performed, referred to here as ANALYSIS DIRECTORY. The user will have full privileges to read, write, and execute here, and the sequence analysis programs will be available. Within this directory, all projects will have their own folders. A Project may consist of one gene with multiple exons, or many genes. The program will align like sequences into “contigs.” A Windows PC terminal with an X-terminal emulator installed and running, to interface with the UNIX System. One option is X-Win32, which may be downloaded from http://www.starnet.com/products/xwin32/. There is a free trial version, and various types of licenses may be purchased at reasonable cost. Other options are Exceed, Reflection X, and OpenNT.X. A three-button mouse or a mouse capable of emulating a three-button mouse. A two-button mouse with a scroll button is easily adapted to Consed. Current versions of Phred, Phrap (and its associated programs), Consed, and PolyPhred, installed in the user’s pathway of the UNIX machine. A sample set of data is provided by the software developers, and if this is downloaded, there is a tutorial that is very useful in developing a complete understanding of how to use the programs. Experimental sequence data equivalent to .ab1 files or scf files generated by ABI 3730xl or other fluorescence-based automated sequencer (see also Beckman Instruments, LI-COR Life Sciences, or Amersham Biosciences analyzers). Such data are generally known as “chromatograms.” NOTE: In the following directions, all commands to be typed in the UNIX command window are in italics, while directories are bold. All UNIX commands are case sensitive, 7.16.6 Supplement 59 Current Protocols in Human Genetics Copyright protected and spaces are not allowed within file or directory names. Underscore is frequently used in place of a space. Set up the project A basic file structure is required for data analysis, and this is created by the user before depositing the data or running the programs. First, a project directory is created in a user directory that has access to the programs. Four directories are then required to be created within the project directory. The project directory name can be anything, but should be unique and specific to the particular data to be analyzed. The four directories within PROJECT A must be named edit dir, phd dir, chromat dir, and poly dir. 1. Login to the UNIX server and go to (cd) the ANALYSIS DIRECTORY. 2. Set up a project folder or directory called PROJECT A, as follows: a. mkdir PROJECT A (Type command mkdir, followed by a space and the name of the directory. A space always follows a command in UNIX). b. cd PROJECT A and make four directories required for all Phred-PhrapPolyPhred-Consed analysis, using the mkdir command: mkdir chromat dir phd dir edit dir poly dir (space between directory names). The same thing would be accomplished by typing mkdir followed by the desired new directory name four times. c. Transfer your dataset (.ab1 or equivalent chromatogram files) into the chromat dir. Use File Transfer Protocol (ftp) from the collection computer or other Windows-based data storage location. Or, cp or mv files to desired location from a UNIX environment. Warning! If you use the mv command incorrectly, forgetting to give the full path of the destination, the files will be moved to nowhere, and may be lost forever. cp is safer. 3. Set up the project-specific control chromatogram files. Name these files as gene exonX.ref or gene exonX.pseudo so you will recognize them. a. You may use high-quality chromatograms representing the normal sequence of the PCR product of each amplicon to be sequenced, forward and reverse. b. Alternatively, you can follow the directions in the Consed documentation for creating a pseudo-chromatogram representing the template sequence used to design your primers. The command is “sudophred,” and it is run from the edit directory. The input is a text file of the template sequences, in fasta format with names following the < symbol on the line above the sequence. The files created will go automatically to the appropriate directories (phd dir and chromat dir). Example: sudophred genename.fasta. c. If desired, use external programs to create user-defined tags in the control files when setting up a project. At HPCGG, we add coding sequence tags, PCR primer tags, and SNP tags from dbSNP (http://www.ncbi.nlm.nih.gov/SNP/). The consed documentation shows how to do this in ADDING TAGS FROM OTHER PROGRAMS. Tags are added to the phd files based upon their numerical position in the template sequence. This will require some perl script facility. Tags can also be added by the user, while working in consed, to individual reads, reference or pseudo sequences, or the consensus. The application of these tags is described below (Working with Consed, 7). This requires no programming expertise but is somewhat timeconsuming. d. Your project is set up. The data files you have added to the chromat dir are ready to be analyzed. See UNIT 7.9 for protocols to generate data files, including recommended naming conventions for chromatograms: Current Protocols in Human Genetics Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.7 Supplement 59 Copyright protected PROJECT gene-exon#.sample.direction Remember that a new project must have four directories, but all may be empty except for the chromat dir. Working with PolyPhred and Consed: data analysis and review The full documentation for Consed and the “Quick Tour” of Consed are available at http://bozeman.mbt.washington.edu/consed/distributions/README.15.0.txt. Once you have installed the programs on your system, the Consed documentation is available from the Help menu. 4. Go to (cd) the edit dir of the project you wish to analyze. Type the command, phredPhrap.poly into the UNIX command window. This perl script is included with the program package, slightly modified as described above for PolyPhred. It directs the sequential running of the component programs. It must always be typed from within the edit dir of the project to be analyzed. 5. Wait while the full suite of programs, including phred-phrap-crossmatch, swat, and polyphred, run. They will generate a number of text files in the various directories created (edit dir, phd dir, and poly dir). These files contain information extracted from your chromatograms. The steps are chronicled on the screen. When complete, a message in the command window will tell you that you may now run Consed on a particular .ace file. a. The phredPhrap.poly command results in full analysis of all the data files you have put in the chromat dir. Each of the components of the suite of programs performs its function as described above—base calling and base quality score assignment, sequence comparison and alignment of similar sequences, and assembly, yielding multiple contigs, contig consensus determination, and polymorphism tagging and scoring. b. The programs all run independently in the background. There is no interaction with the user. At HPCGG, our automated pipeline is triggered as soon as the analyzing machines complete their runs. The data are copied to appropriate UNIX directories (according to project name) and processed. Quality control scripts are run and the results are distributed by email to the sequencing group. c. Since the programs are run on a UNIX server, data for many projects can be processed at the same time, each from a different edit dir within a different Project folder. d. The output is simple text files which will appear in the edit dir, phd dir, and poly dir. Consed will use these files to provide a graphical interface to the user to review the data. Some of the files are very long, but they can be used by external scripts to obtain information about the results. For example, the polyphred.out file in the edit dir, contains the specific positions where polymorphisms are seen in each contig, and the genotypes of all individuals at that position. The same information is displayed graphically in Consed. 6. View the results. From the edit dir, type the command consed. A warning may appear in the command line window: <no ./.consedrc file so no project-specific resources– that’s ok>. Ignore this. Two windows will appear on your X-terminal screen. One will be called Ace Files and the other Consed Main Window. PolyPhred: Mutation Detection from FluorescenceBased Sequence Data 7. Select an Ace file by double-clicking on it. The newest file is on top. An Ace file in the edit dir is absolutely required for Consed. The empty boxes in the Consed Main Window will then be populated with the selected Ace file name, a list of Contigs 7.16.8 Supplement 59 Current Protocols in Human Genetics Copyright protected Figure 7.16.1 Consed Main Window. Drop-down menus applying to the whole project are on top, then a box listing the .ace file being reviewed. Below the .ace file are several action buttons and then the Contig List and Read List. Two boxes are available for performing searches based on read name, and two action buttons that will either Show a contig or read highlighted in one of the lists or Close All Windows. The Quit Consed button (on the upper right) will exit the program after prompting for saving your changes. The Help button (on the top right) will give you full documentation. (overlapping reads), and a Read List. A large warning box may also appear referring to templates—ignore it. The Main Consed Window is shown in Figure 7.16.1. a. The contig list is arranged by number of reads in the contig, lowest to highest. b. The read list is arranged alpha-numerically. c. Clicking on either a contig or a read will bring up an Aligned Reads (AR) window display (Figure 7.16.2, a sample AR window). 8. Basic consed tools and navigation: a. The Consed Main Window. Figure 7.16.1 shows the critical elements and actions available in the Consed Main Window. This is the primary window from which you select the parts of the assembly you want to view. i. Drop-down menus applying to the whole project are on top: File, Navigate, Info, and Options. Left-click on the topic to see the choices. Choices are fully described in the Consed Documentation, available through the Help option, to the right. See below for use of the Options menu. ii. The name of the assembly .ace file being displayed, is in the box below. Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.9 Current Protocols in Human Genetics Supplement 59 Copyright protected Figure 7.16.2 Aligned Reads Window, PolyPhred v.5.04. Read names are on the left, and arrows indicate the direction of each read. Here, forward is on the top and reverse on the bottom. The quality of bases is indicated by the shading, from dark gray to white. The colored red bar on the pseudo sequence is a coding sequence tag added by the user, while the column of blue tags is applied by PolyPhred to indicate a position where a polymorphism is identified. The red tag on the consensus means this polymorphism is a polyPhredRank1 tag, with a quality score of 99, and the pink tags indicate heterozygous bases. For color version of this figure see http://www.currentprotocols.com. iii. Several action buttons appear next, notably for this discussion, the String Search button for finding particular sequences within the assembly and the Quit Consed button. The others are explained fully in the Help documentation. iv. A list of the contigs in the assembly appears in the large Contig List box, in ascending order by number of reads in each contig. v. All the reads in the assembly are shown in the Read List. Failed sequences do not appear. vi. The Find Reads box allows you to search for all reads with a particular string in their name—i.e., an exon or a gene. The result will be a new box with a list of all reads that contain the string in their name. vii. To Quit Consed, use the button on the upper right. You will be prompted to save changes if you wish. Unless you made an editing mistake, save the assembly. Each save will generate a new .ace file, with a new number. When Consed is run again, you will usually select the newest .ace file and the others may be removed as they are very big. b. View a contig in an Aligned Reads (AR) window (Figure 7.16.2): PolyPhred: Mutation Detection from FluorescenceBased Sequence Data i. Read names on left side indicate Project gene exon.samplename.primer (.F or .R). ii. Arrows indicate direction of read—reads are complemented and aligned for viewing data in Consed. The default view has forward and reverse grouped, above or below the line. (See below, we usually change this option from the Options menu). 7.16.10 Supplement 59 Current Protocols in Human Genetics Copyright protected iii. Shades of gray to white indicate quality (Phred score) of bases—whitest is best. The numerical phred score of a base is shown in bottom box if you click with the left mouse button on the base of interest. The Phred scores are as follows: 0 to 39, shades of gray; 40 to 97 are bright white. 98=edited, unsure; 99=edited, sure. iv. Upper-case bases are high quality, lower-case lower quality (cruder scale than the numerical Phred scores or the color shades). v. Consensus sequence is top line, determined by program from data. vi. Red bases differ from Consensus. vii. If user-defined tags were entered, they are seen on the pseudo sequence. Example: red bar is a coding sequence tag. viii. When PolyPhred recognizes a polymorphism, a column of tags is applied (see position 558), blue for homozygotes, pink for heterozygotes. Mouse over a pink tag to see what the alleles are, in the box at the bottom of the window. ix. PolyPhred also tags the consensus with a color-coded rank tag: red = rank1, score of 99, highest quality. Range is Rank1 to Rank6. c. Revise the view format options. Go to the Consed Main Window, left-click on Options at the top. Select General Preferences from the menu. A General Preferences Box appears, and you can modify default settings. Change “Display reads alphabetically or by strand . . .”option from “strand” to “alpha,” then left-click the box Apply and Dismiss (bottom left). This is the only change we routinely make in the default view. The effect is shown in Figure 7.16.3 where forward and reverse of each individual are next to each other. This makes it easier to identify artifacts not present in both directions. Changes made from the options menu are temporary, and will not be applied next time Consed is run. d. Show protein translation. Click on the Misc box at the top of AR window. Multiple options appear—select Show Top Strand Protein Translation. All three reading frames for the entire sequence are shown, with potential start and stop codons highlighted. Use this display format. It facilitates the characterization of variants. The user may tag actual start and stop codons and coding positions. e. View different parts of the contig. Scroll through the contig in the AR window by moving the scroll buttons on the bottom or side of the window. Other methods are described in the Quick Tour or Consed documentation. f. Viewing Traces. Using the mouse middle button, click on a base of one of your sequences, a Trace Window will appear. You can scroll through the sequence, using the scroll button on the bottom. You can change the magnitude or breadth of the peaks using the two-slide buttons on the left. The top line of sequence is the consensus, the second line is the edit line, where you can edit the calls for this chromatogram, or tag a base or region using the middle button. The third line of sequence is the Phred base call. You can view four chromatograms at one time, to compare data. In the options window you can modify this number. g. Edit a base. If you are sure Phred has called an incorrect base, you can change it. Middle-click on the read in the AR window. The trace will appear. Left-click on the base you wish to change. Type the correct base. If you are sure, you can make it high-quality by making it upper case. The Phred score will reflect your selection. When you save the assembly, (see below) this change will be recorded in a new version of the phd file for the read (phd.2). However, the previous phd file (phd.1) will also be retained, and the chromatogram will never be altered. h. Find a sequence. String Search allows you to find a sequence: left-click on a base in the AR window and slide the cursor over 10 to 15 bases while holding the mouse button down. The bases will turn yellow. Release. Left-click the Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.11 Current Protocols in Human Genetics Supplement 59 Copyright protected A B Figure 7.16.3 (A) Aligned reads window after changing view from “strand” to “alpha” in the Options list. This view facilitates review by placing forward and reverse reads next to one another. The red tag at the top indicates the polymorphism is Rank1, with a quality score of 99. Two individuals are shown with heterozygous tags (pink) and two with homozygous tags (blue). (B) The chromatograms showing heterozygous and homozygous individuals in both directions. See fuschia arrows in panel A for these data (021.F at the top, then 021.R, 019.F, and 019.R). For color version of this figure see http://www.currentprotocols.com. Search for String box, upper left. A Search for String window appears. Use the middle mouse button to paste the highlighted string into the Query string field. A box will appear that shows every place in the assembly where the string appears. i. Complement a contig. It is convenient to review the sequence with the forward reads going from left to right, or in a sense direction. If the assembly does not do this, you can click on the Compl Cont button above the sequence in the AR window. j. Join contigs. If two contigs overlap but are not joined, highlight and string search part of the overlapping sequence. A new box appears, listing two contigs. Double click on each and the AR window for each appears, with the cursor over the first base in the string. In each window, click on the Compare Cont button above the reads. The sequences will appear in a Compare Contigs window. Click on Align, and scroll to review the alignment. If it looks valid, click on Join Contigs. Your assembly will be modified to place these two contigs into one. PolyPhred: Mutation Detection from FluorescenceBased Sequence Data Phrap assemblies are sometimes incorrect, and may fail to put all overlapping reads in one contig. You will need to review each amplicon quickly to be sure no errors have occurred. Example: Two individuals have a 3 bp deletion relative to the others and to the normal sequence (pseudo)—they may be in a different contig, but you can combine them. Or, two exon templates may overlap but the overlap may be very small. You can join them to facilitate review of the intronic sequence. 7.16.12 Supplement 59 Current Protocols in Human Genetics Copyright protected k. Tear Contigs. Right-click on a base near the position where you want to tear a contig into two separate contigs. Choose Tear contig at this consensus position. A box will appear where you select the contig you want each read to fall into, after the separation. Select Do Tear at the bottom of box. Occasionally, contigs are misassembled, sometimes for no apparent reason. This tool allows you to fix them. l. Remove reads. Reads may be removed from a contig. The most common reason to do this is because poor quality interferes with the analysis. Right-click on a base in the offending read, and choose Put read xxxxxx.xx.F into its own contig. This read will now appear as a single-read contig on the Consed Main Window. See the Quick Tour documentation for more complicated read removal and addition. m. Save Assembly. You must periodically save the changes you have made, if you want to have them available next time you open the assembly. If the program crashes without saving, you may lose your changes. The assembly can be saved from the file menu in either the Consed Main Window or the Aligned Reads window, upper-left in both cases. Left-click on the menu and choose save assembly. This creates a new .ace file, which you will be able to access next time you open the Project. Or, you can access an older .ace file to see the data before edits. Ace files are big, and you should not accumulate too many in a project. Most old .ace files are not useful, so use the UNIX rm <filename> to delete them from the edit dir. You will be probably be prompted to confirm removal. 9. Add tags manually to control or pseudo files. Type the designation you have given to identify your control or pseudo sequences, in the Find reads box of the Consed Main Window (Example: pseudo). Press enter. A new window will appear with a list of all the pseudo sequences in the assembly, in alpha-numeric order. Double-click on the first. An Aligned Reads window will appear. Apply coding sequence tags to your pseudo or control sequences manually, if you have not used an external script to do so. a. Coding sequence tags may be added to indicate most important regions for review. i. Locate the first base of the coding sequence using string search (refer to the UCSC template files used in primer design, UNIT 7.9). Middle-click on this base in the pseudo sequence. A Trace Window will appear. ii. Middle-click on the first base of the coding sequence in the “edt” line, and holding the button, slide it along for a few bases. They will turn yellow. Release, and a window appears, “What to Do with Selection.” iii. Choose add tag. A new window appears, Select Tag Type. Note the variety of options available. Choose Coding Sequence, and select “ok” at the bottom of the window. A new red tag covering the highlighted bases will be seen on the pseudo sequence. iv. To locate the last base of the coding sequence, string search for the last 15 bases of the exon as seen in the UCSC reference. Note the position of the last base in the box that appears. Then right-click on the tag you created for the beginning of the coding sequence (see iii), and choose “Tag: coding sequence show more info?” in the drop-down menu. The coding sequence box will appear, where you can change the End Unpadded Consensus Position to the number noted for the last coding base. You may also type a note in the comment box (e.g., Exon 3). Then choose Save Changes. Your tag will now cover the entire coding sequence for this exon. If you added a note, it will appear in the box at the bottom of the AR window when your mouse is over the tag. Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.13 Current Protocols in Human Genetics Supplement 59 Copyright protected v. Tag primers in the pseudo sequence. String search for a forward primer sequence. Middle-click on the relevant pseudo sequence to bring up the Trace window and highlight the forward primer sequence holding the middle button down. Choose add a tag, and select Forward Primer from the list that appears. Click OK on the bottom of the tag list box. Many types of tags are described in the list, among them forward and reverse primer. Different tags are displayed in different colors. b. Manual addition of tags is time-consuming, but it is very useful to mark the coding sequence, and the forward and reverse primers. Tags are saved in the phd files, when you save the assembly, and will be permanently available to you. You can also add other relevant tags, for start codons, known mutations, stop codons, etc. These phd files with tags can be copied with the same templates to other projects where you are sequencing the same genes. 10. Recognizing mutations: common patterns and problems. a. Clean data: Figure 7.16.2 shows good clean data as represented in an AR window in Consed. Most of the bases are bright white, and one can be confident that there are no hidden variants in these traces, except those marked. b. SNP: If a SNP is present in your data, it will appear as the one at position 558 and colored gray in Figure 7.16.2. Note that this appears in the forward and reverse sequence of the same sample. In one case, phred has called a C, in the other a T. Both are lower case (low quality) due to the secondary peak. The pink tag means heterozygote, and if you put your mouse over this tag (no click), it will tell you it is heterozygote TC tag data 99 (highest quality). c. Indel: Heterozygous small insertions and deletions have a very clear pattern in the AR window in PolyPhred and in the corresponding trace windows. The sequence in each direction is clean until the deletion or insertion, then it is double, as both alleles are read (Fig. 7.16.4). The traces for a normal individual (forward only, top) and for an individual with a heterozygous deletion (forward and reverse) are shown in Figure 7.16.5. To determine the exact nature of the indel: i. Write the wild-type sequence, from 10 bases to the left of the position where the sequence begins to fail to ∼20 bases after this point. CCTTGCCACGCTAGCTTTCTGACATC..... ii. From the forward read, write the wild-type and then the variant peak after the mixed sequence begins (the base that appears in addition to the wild-type base) immediately below the wild-type sequence. CCTTGCCACGCTAGCTTTCTGACATC..... CCTTGCCACGCAGCTTTCTGACATCC..... From this you can see that the secondary sequence is like the wild-type with one T deleted. iii. To confirm, read the sequence in the reverse direction, right to left, and line it up with the others PolyPhred: Mutation Detection from FluorescenceBased Sequence Data CCTTGCCACGCTAGCTTTCTGACATC..... CCTTGCCACGCAGCTTTCTGACATCC..... ...CCTTGCCACGCAGCTTTCTGACATC 7.16.14 Supplement 59 Current Protocols in Human Genetics Copyright protected Figure 7.16.4 Aligned Reads window showing four individuals with heterozygous single-base deletions. The characteristic pattern for indels is clear sequence to the point of the deletion or insertion, then an immediate loss of good sequence. The arrows indicate the direction of the reads. PolyPhred v.5.04 does not tag the indels, so manual review is required to identify them. For color version of this figure see http://www.currentprotocols.com. d. Longer deletions that fall within the amplicon: These are harder to see, especially if they are very long so both ends may not be in one window. They may be characterized exactly as shown above. To avoid missing them, always check the beginning and end of sequence that appears to be double when the poor sequence extends for a long distance in both directions. The hallmark is that each direction will start as a clean sequence, then become double. If, for example, the deletion is 20 bp long, the double sequence in both reads will overlap for ∼20 bases. But if the indel is 150 bases, the double sequence may appear to be a failed sequence, and you may miss a critical variant. e. Contaminated sequence: In contrast to the situation described above for indels, double sequence can also appear when two or more PCR products or two primers are present in a sequencing reaction. In either case, the traces will be double or messy in the beginning, often in both directions, but may clear up somewhere before the end. This may be explained by i. Contaminated universal sequencing primers that anneal to PCR product and each extend different products that incorporate dye terminators. ii. Two or more PCR products that are both extended by a single primer and incorporate dye terminators. iii. Clean sequence appears later on because one of the templates is shorter than the other and the longer product is clean in the end. 11. Review your data. Open consed and review all contigs, record variants and failures. a. Each contig should contain the control or pseudo sequence from one exon and all the reads you have deposited in the chromat dir for that exon. Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.15 Current Protocols in Human Genetics Supplement 59 Copyright protected Figure 7.16.5 Traces showing a single-base deletion. The top panel is the forward trace of a normal individual, the middle panel shows the forward sequence of an individual with a deletion, and the bottom panel shows the reverse sequence of the same individual. For color version of this figure see http://www.currentprotocols.com. b. Manual Review. i. Review the contigs for faulty assembly; tear apart those that are incorrect, pin those that belong together. If two exons are close to each other and their templates overlap, they may be in the same contig. For manual review, we review each gene, exon by exon, beginning with the first, characterize each mutation, and record the results in an Excel sheet. ii. Failed reads may not appear, so check to see that all individuals are present for each exon. iii. Check to be sure every amplicon is present. If all reads fail for one exon, the pseudo sequence may not appear, and it is easy to miss the fact that the exon is not present. c. Automated review: PolyPhred: Mutation Detection from FluorescenceBased Sequence Data i. Using the Navigation menu from the Consed Main Window, Navigate tags in all contigs, for the polyPhredRank1 to 6 tags. ii. Click on one of the polyPhredRank tags. A box will appear with a list you can click through, jumping right to the tag, in whatever contig it may be. You must do one Rank at a time. 7.16.16 Supplement 59 Current Protocols in Human Genetics Copyright protected iii. Click through all Ranks to determine if they are real or not. Record. iv. This gives a quick view of all the mutations found by PolyPhred, and depending upon your dataset and the size of your project, may be very informative. However, you may also need to scan each contig for variants missed by PolyPhred, and also for indels, which currently are not tagged in v.5.04. 12. Editing bases in Consed: If you are sure Phred has called an incorrect base, you can change it (see step 8g above). Middle-click on the base to be changed, which will bring up a trace window. Edits are made in the trace window. Left-click on the base in the “edt” line of the sequence and type in the changed base. Changes are saved in a new version of the phd file that contains the information extracted from each chromatogram. The changes are written to a new version of the Ace file when the project is saved, and will be seen if the new Ace file is opened at a later date. The chromatograms themselves are never modified. If your purpose is primarily to identify variants, it is not necessary to edit most bases— this is time-consuming, and unless changes impact a variant, don’t bother. 13. Save Assembly before closing to save any changes. COMMENTARY Background Information Progress in sequencing technology and in knowledge of the genome has made it technically and economically feasible to screen large numbers of samples for mutations in many different genes. UNIT 7.9 presents detailed protocols for generating sequence data by automated fluorescence-based sequencing. This unit describes sequence data review and mutation identification using one of several available programs, PolyPhred v.5.04. Our group has had great success with using this particular program. Development of the program continues and a new version, as yet untested in our hands, is available for Indel detection (PolyPhred v.6.11, http://droog.gs. washington.edu/get poly6.html), described in Bhangale et al. (2006). Several additional computer programs that aid in the analysis of electronic chromatogram files (also known as electropherograms, .abi or scf files) from an ABI (or other) automated sequencer are briefly noted below. Some are commercially available, while others are free to academic laboratories. Trace analysis software from other sources For most mutation identification projects, it is clearly advantageous to use some form of sequence analysis software for the task of reviewing the quality of the data and identifying sequence variations, and there are many options available. Below is a discussion of four of these options from the perspective of the end user; details of programming algorithms are not discussed. The first three (Sequencher, DNASTAR, and Mutation Surveyor) are com- mercially available and run on either a Macintosh or PC/Windows platform. These programs seem to be well-supported and widely used for sequence analysis and mutation identification. They each have graphical reporting capacities that are attractive. The Staden package for sequence analysis, described in Staden (1994), includes Gap4 and (like PolyPhred) has recently become available without cost (http://staden.sourceforge.net/). The new version can be run on UNIX or PC platforms, but (like PolyPhred) it seems complex if you have no prior experience with the package. The choice of software for mutation detection will depend on personal preference, local experience, and the availability of computer resources, but it is a choice that should be made early in the planning stages because it is a critical element for success. Sequencher 4.7 Sequencher is a sequence assembly software package available from Gene Codes (http://www.genecodes.com/; info@ genecodes.com). It is easy to install and runs on either Macintosh or PC/Windows computers. We have no current experience with Sequencher, but it is one of the programs most commonly mentioned as being used by the laboratories for whom we perform sequencing. A review of its features online and through the Demo program, downloaded from Gene Codes for testing purposes, shows that it performs many of the functions required for mutation analysis. Sequence data are imported in the form of chromatograms from automated sequencers, Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.17 Current Protocols in Human Genetics Supplement 59 Copyright protected PolyPhred: Mutation Detection from FluorescenceBased Sequence Data such as ABI, LI-COR, or Pharmacia/ALF systems. A user-defined Reference Sequence can also be imported and used to align experimental data. The data are assembled with the “assemble contigs” command and an overview of the assembly is available. Reads can be trimmed based on program-defined quality scores and also to match the Reference Sequence. The user can observe the sequence and the aligned chromatograms and inspect them for mutations. The simultaneous evaluation of bidirectional traces of a control sample and one with a suspected mutation will allow the identification of most mutations. It is possible to navigate quickly through the ambiguous bases or places where the data do not agree with the Reference Sequence, so base-calling errors can be edited, and heterozygous or homozygous mutations identified and annotated. There is a translation function as well, so each reading frame can be viewed. Sequencher also has an automated feature for identifying heterozygote variations in which “secondary peaks” are called. The user can specify the minimum lower peak height as a percentage of the upper, then quickly scan and edit base pairs with secondary peaks that are suggestive of a heterozygote variation. However, the program notes say this function calls some positions that are not true heterozygotes, and also fails to capture some, largely due to background noise or sequence data artifacts. Once the contig is edited, a Variance Table can be generated, indicating where differences exist between the consensus sequence (representing your edited experimental data) and the Reference Sequence. A graphical summary report can be produced of features you have identified, showing base or amino acid changes and their locations. The major advantage of Sequencher is that it is easy to install and use on Macintosh or PC/Windows computers, after consulting the user’s manual with its short tutorial. It is clearly useful for sequence assembly, especially for relatively small projects, and is superior to visual inspection of printed chromatograms for detecting heterozygote mutations. However, compared to PolyPhred, it is relatively labor-intensive to import a set of data, assemble it, and edit it. Another consideration is the current cost of Sequencher, approximately $3000 for a single academic user, with support and upgrades for 1 year. In contrast, PolyPhred is free to academic laboratories that have the ability to implement it. A free demo version of Sequencher can be obtained upon request from Gene Codes for those interested in evaluating it. More information on Sequencher can be found on the Gene Codes Web site (http://www.genecodes.com/). Lasergene from DNASTAR Lasergene (http://www.dnastar.com/ products/lasergene.php) is a Windows/Mac program package including seven modules for the analysis and management of DNA and protein sequences. Individual modules for particular functions may be purchased, most notably SeqMan Pro, designed for sequence assembly and SNP discovery. Site licenses are also available for multiple users. The Web site describes the various elements and the details of SeqMan Pro. We have not used this program, but the overview suggests that it includes many of the same features of Sequencher and Mutation Surveyor. Data can be in multiple formats, including ABI files and SCF files, and the program can handle large numbers of reads, so it is not limited (as Mutation Surveyor is currently) to 400. In addition, it can manage phrap assemblies, which might be useful if you were using phredPhrap as a primary sequence analysis tool but wanted to use some of the attractive graphical reporting facility of SeqMan Pro. Mutation Surveyor and Mutation Explorer These programs for sequence analysis, available from SoftGenetics, are described in detail at http://www.softgenetics. com/download/Mutation Surveyor Sheet.pdf. Mutation Surveyor is recommended for discovery, while Mutation Explorer is for clinical applications, though the difference between them is not obvious. Both run in the PC/Windows environment. Mutation Surveyor is available in different size capacities; the smallest can handle up to 48 lanes of sequence, while the largest handles 400, and there is a network version, as well as stand-alone versions. A detailed description of the implementation of Mutation Surveyor is in UNIT 10.8. A comprehensive external review of Mutation Surveyor by the National Genetics Reference Laboratory, UK, is available at http://www.softgenetics.com/ download/NGRL TechnologyAssessment.pdf. This review describes the program’s characteristics and gives a very informative picture of its capacity. Version 3.01 of Mutation Surveyor was released in December 2006, adding two new functions. It now has the ability to mask vector sequences if cloned fragments are being sequenced, and it has an improved capacity to identify known 7.16.18 Supplement 59 Current Protocols in Human Genetics Copyright protected SNPs from the Genbank database or from the laboratory. The user can activate a Negative SNP function so these SNPs are shown in the screen with experimental data, even if they are not seen in the sample. Mutation Surveyor is used in the Laboratory for Molecular Medicine at HPCGG. This group finds its approach to mutation identification to be extremely effective, and its reporting capacity makes it very helpful. However, as noted in the external review, it cannot yet be used in a clinical setting without additional manual review. This program offers quite a few more features than Sequencher, but is correspondingly more costly. Like Sequencher, it is rather labor-intensive to upload the appropriate files for analysis, and does not run well in a truly automated fashion. Critical Parameters and Troubleshooting Because heterozygous point mutations and small insertions or deletions are the most difficult to identify, this unit illustrates how such mutations can be located using PolyPhred v.5.04. Homozygous mutations are also observed when a known reference sequence is incorporated in your assembly. There are several caveats that must be recognized when performing sequence analysis of the type described here, and there are certain hallmarks that the reviewer may observe in sequence data that indicate potential problems in data production that should be addressed. The most important caveats and hallmarks are discussed below. Sequencing alone is not useful for identifying heterozygous individuals carrying large deletions or rearrangements in one chromosome, because in such cases only the normal or existing allele will be amplified. If one observes a homozygous variant, it is not possible to exclude a deletion unless the individual also exhibits a heterozygous variant in the same amplicon. It is useful to analyze the entire amplicon rather than just the coding sequence, because such confirmatory heterozygous SNPs may occur outside the exon. This is standard practice at HPCGG. In a broader sense, sequencing never proves the presence of two alleles unless heterozygous bases are seen. However, loss of heterozygosity may be especially worth investigating if the sequence data show multiple exons without any evidence of heterozygosity. In the case of tumors that may be contaminated with normal tissue, variants in tumor cells may be masked by the amplification of small quantities of DNA from normal cells. UNIT 10.9 describes the sequencing of the EGF receptor in tumor cell populations, where the presence of normal cells may be a problem. It is possible that variant peaks may be much smaller than the normal heterozygous peaks (ratio should be 1:1) when tumor cells are mixed with normal cells, and so background must be very low for proper interpretation. As described in UNIT 7.9, the primers used to amplify regions of the genome must not span a SNP or indel, or amplification may fail or be biased. While dbSNP continues to add newly reported SNPs that can inform the placement of primers, many novel SNPs occur in every individual, so all data must be viewed with that caveat. We have seen instances where a primer fails to amplify one allele and thus a mutation has been missed. We are only aware of this when there are overlapping amplicons and the SNP under the primer is seen in the product from the flanking primer pair. It is important to view the data with a critical eye, and to note any unusual characteristics. For instance, if a heterozygous SNP occurs in every sample, there is a possibility that DNAs may have been cross-contaminated, very likely with one or more that are actually homozygous for the variant. This is most likely when many DNAs are stored in a plate, and they are accessed many times. Such results should be confirmed with clean samples. Another problem that is often first noted during sequence analysis is that primers or whole amplicons are not unique in the genome. The data may look very “dirty,” with many background peaks and potential heterozygous positions throughout in all individuals. The color in Consed is not bright white (see Basic Protocol, step 10a). We have seen this to be true for several exons of a gene, while all the rest of the amplicons/exons are high quality, and have determined that fragments of some genes are duplicated in other places. The amplicons in question, as well as the primers, should be checked to confirm that they are unique, using the BLAT tool at UCSC (http://genome.cse.ucsc.edu/cgi-bin/ hgBlat?command=start&org=Human&db= hg18&hgsid=105162860). BLAT is described in UNIT 7.9. Anticipated Results and Time Considerations The most critical factor for efficient and accurate mutation identification using PolyPhred Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.19 Current Protocols in Human Genetics Supplement 59 Copyright protected PolyPhred: Mutation Detection from FluorescenceBased Sequence Data or any other software application is to obtain high-quality data. The protocols for doing this are in UNIT 7.9. It is also important that PolyPhred be set up on a platform with sufficient power and space to run fairly intensive computational programs. Once this is accomplished, and a pipeline is established to move the data files to the appropriate folders, one can anticipate that data processing will move smoothly and will be very rapid. Users will require a few sessions with a UNIX expert to learn the basics of the UNIX command window, and then a few sessions with the “Quick Tour” of Consed or with an experienced user to become proficient. Fortunately, even a very inexperienced user cannot create any unredeemable errors within Consed and in a properly managed UNIX infrastructure. The user will rapidly learn the few tools and commands that are repeated often, and will understand the structure and appreciate the scope of the software. Data review and reporting may be more or less time-consuming, depending upon several factors including (1) data quality, (2) the number of amplicons in a project, (3) the number of individuals to be sequenced for each amplicon, (4) the number of variants present, and (5) the method of reporting the results. If the data are all clean, the PolyPhred assembly is likely to be completely correct, and each amplicon can be scanned quickly, individual samples accounted for, and variants seen and characterized with very little difficulty. For poor-quality data, the time required can increase dramatically. Thus, it is very important to optimize primer design, PCR reactions, sequencing reactions, and analyzer run parameters ahead of time. The HPCGG has had projects where >100 amplicons are sequenced in a small number of samples (two to ten), as well as projects where a much smaller number of amplicons are sequenced in hundreds of individuals. Projects with many amplicons are more timeconsuming than those with few, even though the actual number of reads may be equal, because each amplicon will form one contig, which must be individually opened and reviewed. If all the reads fall into one or a few amplicons/contigs, the work of reviewing is accomplished more quickly. The actual number of variants that may appear in a project may also vary, depending on the experimental design. In a candidate gene search, most (or all) of the amplicons reviewed may have no significant or novel variants. Alternatively, in a project where one has a selected group of patients with one disease frequently caused by mutations in one gene, every sample may have one or more significant variants. The most time-consuming part of data analysis is the characterization of novel mutations. One of the most complicated issues is to determine how to report results efficiently and with complete clarity. PolyPhred does not have a graphical or spreadsheet-based reporting tool that can be readily used for clinical or even research reporting. HPCGG uses an Excel format, and reporting across the top every variant observed, in the order in which they occur within the gene. It is essential to provide the exon or intron within which the variant occurs, and a 9-base string for each SNP, or 4 bases flanking each indel or duplication. This should be unique, and allow the unequivocal identification of the variant. In addition, one may wish to provide the genotype for each individual at this position; for coding sequence variants, it is useful to determine the nucleotide and codon numbers and changes. We report every variant within the amplicon, even those that are far from the coding sequence. Finally, for publication or presentation of results, the Human Genome Variation Society’s recommendations (http://www.hgvs.org/mutnomen/) should be consulted to determine the correct nomenclature for the variant. In summary, the time required for a project is extremely variable depending upon all the factors discussed, most notably the quality of data, the number of amplicons, the number of individuals, and the detail of the reports to be generated. Literature Cited Bhangale, T.R., Stephens, M., and Nickerson, D.A. 2006. Automating resequencing-based detection of insertion-deletion polymorphisms. Nat. Genet. 38:1457-1462. Ewing, B. and Green, P. 1998. Base-calling of automated sequencer traces using phred. II. Error probabilities. Genome Res. 8:186-194. Ewing, B., Hillier, L., Wendl, M., Green, P. 1998. Basecalling of automated sequencer traces using phred. I. Accuracy assessment. Genome Res. 8:175-185. Gordon, D. 2004. Viewing and editing assembled sequences using consed. Curr. Protoc. Bioinformatics 11.2.1-11.2.43. Gordon, D., Abajian, C., and Green, P. 1998. Consed: A graphical tool for sequence finishing. Genome Res. 8:195-202. Nickerson, D.A., Tobe, V.O., and Taylor, S.L. 1997. PolyPhred: Automating the detection and genotyping of single nucleotide substitutions using 7.16.20 Supplement 59 Current Protocols in Human Genetics Copyright protected fluorescence-based resequencing. Nucleic Acids Res. 25:2745-2751. Staden, R. 1994. Staden: Comparing sequences. Methods Mol. Biol. 25:155-170. Stephens, M., Sloan, J.S., Robertson, P.D., Scheet, P., and Nickerson, D.A. 2006. Automating sequence-based detection and genotyping of SNPs from diploid samples. Nat. Genet. 38:375381. Searching Candidate Genes for Sequence Variation: Mutations and Polymorphisms 7.16.21 Current Protocols in Human Genetics Supplement 59