Barton, G. J. and Sternberg, (1987)
Transcription
Barton, G. J. and Sternberg, (1987)
Protein Engineeringvol.l no.2 pp.89-94, 1987 Evaluation and improvementsin the automatic alignment of protein sequences from chicken alpha and beta haemoglobin,varying the usersuppliedgap-penaltyparameterswill producea numberof difLaboratory of Molecular Biology, Department of Crystallography, Birkbeck ferent optimalalignments(Fitch and Smith, 1983). College, Malet Street, London WCIE 7HX, UK workershave notedthat alignmentsobtainedby the Several 'To whom reprint requestsshould be sent superpositionof homologoushigh-resolutionprotein structures The accuracy of protein sequencealignment obtained by can be quite different to tlose producedby automaticsequence applyrng a commonly used global sequencecomparison basedmethods(e.g.Dickersonet al., 1976;Chothiaand[.esk, algorithm is assessed. Alignmentsbasedon the superposition 1982;Delbaereet al.,1979).In this paper,the qualityof an of the three-dimensionalstructures are used as a standard automaticallyobtained sequencealignment is systematically for testing the automatic, sequence-based methods. Alignassessed by referenceto alignmentsbasedon structuresuperments obtained from the global comparisonof five pairs of position. Furthermore,in order to model more effectivelythe homologousprotein sequencesstudied gave54Voagreement observedevolutionarypreferencefor insertions/deletions to ocoverall for residuesin secondarystructures. The inclusionof cur in the loop regionswhich join secondarystructures(e.g. information about the secondarystructure of one of the pro. Perutzet al., 1965),we haveintroduceda secondarystructure teins in order to limit the number of gapsinserted in regions function(Q) whichhastheeffectof reducingthepenaldependent of secondary structure, improved this figure to 68Vo. A structuralregions.A similar apty for a gap in non-secondary similarity scoreof greater than six standard deviation units proachhasbeenindependently appliedby A.M.Lesk, M.lrvitt suggeststhat an alignment which is greater than TSVocorand C.Chothia(I*sk et al., 1986)to align proteinswithin the rect within secondarystrucfural regionscan be obtainedauto globin and serineproteinasefamilies. matically for the pair of sequences. In thispaperfive pain of structurallyhomologous proteinswere Key words: alignment/protein/sequence/secondary/structure consideredand the sequencealignmentstestedover residues whichareclearlyin homologoussecondary structures.Theinclusion of Q derived from a knowledgeof the threedimensional Introduction structureof one protein in eachpair resultedin an overall improvementin alignmentaccuracy. Automatic sequencecomparisons(e.g. Needlemanand Wunsch, GeoffreyJ.Bartonl and Michael J.E.Sternberg 1970; Sellers, 1974) are used to searchfor homology between a newly determinedsequenceand one in the data bank. Recent developmentshave aimed at identi$ing local homologies (e.g. Sellers, 1979: Goad and Kanehisa, 1982; Boswell and Mclachlan, 1984)or at increasingspeedand reducing memory requirements(e.g. Gotoh, 1982; Taylor, 1986; Fickett, 1984; Wilbur and Lipman, 1983). This paper concentrateson the accuracy of the alignment. From a sequencealignment of two (or more) proteins, conserved regions are identified which may provide pointers to functionally and/or structurally important regions of the molecules. This sequenceinformation alone can be the basis for studies such as site-directedmutagenesis(Winter and Fersht, 1984) or the selectionof peptidesagainstwhich antibodiesare raised (Sutcliffe et al., 1983). Furthermore, a newly determined sequencemay show sequencehomology with a protein whose structure has been obtained crystallographically and this can lead to a threedimensional atomic model by model-building techniques(e.9. Browne et aL, 1969;Blundell et al., 1983;Traverset al., l9M). Most sequencecomparison methodsare basedon dynamic programming algorithms such as those originally applied in molecular biology by Needlemanand Wunsch (1970) and formalized by Sellers (1974) and Waterman et al. (1976). These methods carry out a global comparison of the sequencesand the result is an optimal score for the comparisonand an alignment with that score. Although the score may be valuable for identi$ing homology againsta background of randomized sequences, the actual sequencealignment may not be correct. Even for two very short sequences( ( 20 amino acids when translated) taken @ IRL Press Limited, Oxford, England Materials and methods Proteinsaligned Five pairsof proteinsanddomains,for whichalignments based on three-dimensional structuresareavailable,weretakenastest data.For eachpair ofproteinsa setofunambiguously assigned equivalentresidueswereselectedfrom centralportionsof homologoussecondarystructures(TableI). The percentage accuracy of eachautomaticsequencealignmentwas calculatedfrom the numberof aminoacidpairswithin the definedzoneswhich were alignedas in the structuralalignment. Needlemanand Wunschalgoithm The computerprogram written for this study implementeda variant of the Needlemanand Wunsch(1970)algorithm. (i) A matrix of amino acid pair scoresD is chosen:in its simplestform this may indicatea scoreof 1 for identity and 0 for all otherstates,while moresophisticated systemsincorporate informationaboutconservative substitutions by usinga weighting scheme.In this studythe MDM259matrix was used(Dayhoff, 1972, 1978)with a constantof 8 addedto removeall negative terms. Preliminarystudiesshowedthis pair scorematrix to be superior to either identity or genetic code types. Alternative matrices(e.g.Mclachlan,1972;FengetaL, 1985)althoughnot includedin this studymaybe expectedto performat leastaswell as the MDM256matrix (Fenger al., 1985). (ii) The proteinsequences are definedasA1,^,B1,nwherem z{, B, respectively. andn arethe numberof residuesin sequence (iii) A matrix R.,n is generatedby referenceto D whereeach 89 GJ.Barton and M.J.E.Sternberg Table I. Proteinpain usedto test alignmentmethods Protein/domainA (abbreviation) (numberof residues) Protein/domainB (abbreviation) (numberof residues) Immunoglobulinlight chain variablercgion(FABVL) (103) Immunoglobulinheavychain variableregion(FABVH) (l17) Immunoglobulinheavychain variableregion(FABVH) (ll7) Immunoglobulinlight chain constantregion(FABCL) (105) Plastocyanin(PLASTO) (99) Source of structural alignment Number of residues taken for test Number of zones Cohenet a/. (1981) 4l Azurin (AZURIN) (128) Cohener aJ. (1981) Chothiaand Lesk (1982) 38 48 7 Human alphahaemoglobin (HAHU) (l4l) Root noduleleghaemoglobin (r.EGHE)(156) Lesk and Chothia(1980) Trypsin (P2PTN) (223) (PIEST) (24O) Tosyl-elastase M.Zvelebil (personalcommunication) 100 63 ll 7 For eachproEin pah, a numberof zoneswerc identified from the central regionsof homologousbeta-shandsand alpha-helices(FABVL:FABVH, Tyr8l-Tyr86:Tyr93-Asn98; Ser64-Ile70:Asn76l-erfi2; His34-Gln39:Tyr33-Arg38; Ser58-Ser62:Thr68-Asn72; lruzt-Pro8:l,eu4{1y8,Thrl9-Gly24:Serl9-Val24; Thr68-Asn72:Gly5l-Thr55; Tyr33-Val37:Thr38-Lys43; Val9-Thr97:VallOGSerl11.FABVH:FABCL, l,eu,l-Gly8:Val8-Prol2; Serl9-Va124:Thr2,1-Ile29; Alal3-Ile2l:Glnl,t-Val22; Tyr93-Asr08:Tyr8zl-Thr89; Vall0GSerlll:Val95-Val9. PI-ASTO:AZURIN,Asp2-Gly6:Sert-Gln8; AsnTGl:u82:AIa67-Leu73; Me92-Asn9:Metl2l-Lys128. Val$Asp42:Val49-Ser51;Gly67-I*u74:Lys92-Yal99: Gly78-Cys84:Glul06Cysl12; Glu25-Asn32:Lys27-Ser34; Pro77-His89:Asp89-Serl0l; Thr38-Thr4l:Ata39,Asf,2;Ala53-Val70:Pro6GAla77; HAHU:LEGH, Pro4-Lys16:Glu54lul7;Ala2l-Ser35:1e22-fle36; Trp34-Ser37:Trp39:Thr42; His23-Ile30:Hi98-Ile35; Glnl5-Asn19:Gln15-Glnl9; ho95-Hisl12:AsplM-Yall24: holl9Thrl37:Glul3l-Lysl49. P2PTN:P1EST, Met86-Leu90:Ala95-ku9;Cysl lGGlyl20:Cysl27-Gly13l; Lysl36Alal,l0:Glnlzl6l,eul50; Metl6GAlal63:Met172-Alal75; Gln63-Val72:Gln7GVal79; Automaticalignmentswere tesEdby considering how many Gly204-Lys208:Thr22l-Arg225). Prol8GCysl83:Prol9l-CysllX; Lysl8GTrpl93:Ala201-Phe208; residueswithin thesezoneswere equivalencedas expectedfrom the structur€basedalignment.The secondarystructuredependentfunction @ was derivedby referenceto the frst sequencein eachpair suchthat Q = Q, within eachzoneand Q = h outsidethe zone. elementR;u representsthe scorefor z4;versusB;. (iv) R.,n is actedon to generateS.,n whereeachelementSrj holdsthe maximumscorefor a comparisonof Ai,mwith 87,2. (v) Suitablepointersare recordedin (iv) to enablean alignment with the maximum score for 241,.v€rSUS J|l,n to be generated. In order to limit the total numbrof gapsintroduced(residues in onesequence alignedwith blanl$), a gap-penaltyis subtracted during the processof generatingS..n whenevera gap is introduced.We followed the recommendations of Fitch and Smith (1983) by using a gap-penaltyfunction having both lengthdependentand length-independent terms of the form: P:GtxL+Gz whereZ is the length of gap and Gl ar\dG2 are userdefined constants.The programalso allows gapsat the endsof the sequences(terminalgaps)to be optionallyweighted,a featurenot inherent to the standard Needlemanand Wunsch (1970) algorithm. hteruion of gap-pernltyfunaion to includesecondarystructural infornntion The secondarystructuredependentfunctionQ modifiesthe gap penaltyfunction to the form: P"r:Qx(GlxL+G2) where 0 = I s I and the suffix ss denotesthe inclusion of secondarystructuralinformation. In its most generalform, Q may be derivedfrom a propertyof the sequence which exhibits a maximumfor regionslikely to be involvedin secondarystructures or other conservedregions,and a minimum for regions likely to be subjectto greatervariability. One miglrt therefore denveQ from a secondarystructurcpredictionprcfile (e.g. Garrier et aI., 1978;Chouand Fasman,1977),a smoothedprofile (e.g.[.evitt, 1976),or a profileof likebasedon hydrophobicity ly buriedresidues(e.g. Janin, 1979).For the purposesof this study,however,a stepfunctionwasderivedthrougha knowledge of the tertiary structureof one of the pair of proteinssuchthat structure,andQ : Q - 1.0(0r) in regionsof clearsecondary 90 612 0 l 2 3 . ! ! t 3 ! 0 0 0 t 1 2 2 3 ! 3 9 9 0 o 0 ! 3 ! ! t 1 2 a e 2 3 ! 3 a 3 2 ! 2 t 2 t 2 t 2 2 t 2 2 z 9 ! 3 2 2 2 2 2 9 9 9 qt !r !o !e !2 ze ze 5 3t !0 l! t2 tr lr ? ! t t 6 l ! t ! t t ! r l t 6 1 6 r t t ! l ! l ! 9 t 6 t 6 r t l ! l r t a lo 16 t6 t! l! l! l! 0 0 I 2 3 . Gs I 6 ? E 9 t0 ! 0 2 1 !t ll l r l t t2 !r t 2 1 3 !! !o ts !3 ! 6 1 6 ! 6 ! 6 3 6 !6 2 t 1 6 16 t a 5 a | | I t0 t2 29 z9 29 19 21 It 2t 2t 2? et 2t It 2t Zt It 29 lt 29 tt tt t! tt r! a, 29 2' 29 2t 21 tt tt tt lt t! 29 2t 2e tt It tt rt lt 29 2t 2t tt la tl t! tr ? t t t0 16 36 ta !5 36 36 t5 16 !6 !6 36 !a t6 !6 !6 It It Ir lt G7 - 6 tol t2 tt 36 36 !6 3! 36 !6 t6 36 !6 t6 36 36 16 IC !6 !c 3C lc 36 36 !6 16 36 rc !6 !6 rc lbl !r aa 36 16 llt 36 16 !6 t6 !c lrt 16 lrl 16 t6 t6 36 !6 36 t6 36 t6 !c ,a !6 16 !6 t5 36 !6 16 16 !6 !6 t6 !c 3c 16 t6 !6 l6 !6 !6 t6 t6 29 29 !0 16 36 !t ?2 3! 36 36 Fig, l. Re.sultof varying constantsGt and Gz for a comparisonof FABVL with FABVH. (a) Standardalgorithm (no penaltyfor end gaps).O) As (s) but including secondarystructuralinformation (Qt : 0.25, q : 1.0). Each scorercpresentsthe total numberof correctly equivalencedresidues within the sevenspecifiedzones.All numben are thereforeout of a maximumof 41 (seeTable I and Figure 2). 0.25 elsewhere(Q) (seelegendto TableI). The valueof 0.25 wasfoundin a preliminary studyto be optimal for all five pairs of proteins,by varyingQ2ftom0 to 0.75 in stepsof 0.25.Note that a valueof Q, : Qt: 1.0 setsPss: P. Q thereforehas the effect of making the formation of a gap more likely in the regionslinking secondarystructures. Automatic dignment of protein scqucmcl l -----A----- Z ? 6 V V O L L T E O O i -----A----- | P B P O S P B L V O O V A R P P O S O O R T V L I t -------B----T I S C T E L T C T -----I -------B --------- I ---------C- I O V | 9 g 8 O S S N T I F O B i ; ; ;t------c-----; ; ; ; ;I ; ; i 3 x iI [ I ; ; v i F i l S 3 l 3 + b ! --------E-------- t----D-----t L T I P F L H R N S N R A V R T F g V 1 . LI V r-----D----! --------F----------- A A D V Y Y Y Y C C | -------F------ Z z S v V o A N O - N D D E T O A S R A 5 C K S C L I R D I I N t P o S P V o S L C A P E s v n i O o R r A O P L I T L L ' C E } I t S K | I --------E-------A A T L A I T R L s N o F 6 L ------E---------| t O L O P P P O O T R A E A A D E D T i -------c------i V F O O O T K L T V L R V A S V I . OI O O S L V T t------c------l R C | F v O L O A V T D A O R R T A Y I H V K I . IY Y Y T I I V i -------c-------l . I S S A T I - AI N O F S L R L | i --------E-------- L i ------c------- R R K T I t -----A----L T O P L E o B t-----A-----l V L t ------B-----T I S C B L T c l------B------l Y I V T T F F l O v S s Y S o H H S N I 6 T F N O T N B O B A D T tbl t -----D----B V A X S t l L v n i r t t -----D----- T p A A t I -------F-----D Y Y C O S V Y Y C A R N i i -------F------ L S N lol | Y D l ' I I O 6 O s L v O T AA AE D D | t -- ----o------R S L R V F O G O T N L t ' l OO O S L V T D V I A O C ------o------i ! Tv VS SL R Itg.2. The effect of includingsecondarysFucturalinformationinto the alignmentof FABVL (top sequence)with FABVH (bottom sequence).The seven = 3' Q, = zoies A-G correspondto bela-strandsin dre immunoglobulindomains(Cohenet al., l98l). (a) Standardalgorithm (no penaltyfor endlgaPs,$ = = (9" F 0.25) the C and 1.0, information structural O)' secondary Qt On including are incorrect. F strands and D and of the C 5). The alignments strandalignmentsare correctedbut D is still displacedby two residues. Alignrnmtscarried out runs usingthe following For eachpair of proteinsequences, gap and weighting, end for values Qy werc performed: 0, Run End gaPs weighted? I 2 3 4 Yes Yes o o N N Q, Qt I 0.25 I 0.25 G1and G2werevariedfrom 0 to l0 For eachmn, parameters in integerstepsproviding l2l comparisonsfor analysis.A preliminarystudyindicatedthat little moreinformationwasgained by explorng G/G2 spaceto Gr : Gz: 20. In addition, werecarhomologytestsfor eachpair of sequences conventional ried out by randomizingthe sequences100times, obtainingan andexpressalignmentscorefor eachrandompair of sequences in standard ing the scoreobtainedfrom the original sequences comparisons. randomized of the the mean from units deviation Resuttsard discussion of the standnrdNeedlemanand Wunschalign,4n assesstnent mmt ncthod andWunschalgorithm Thesensitivityof the standardNeedleman to changesin G1 and G2 is illustratedby Figure la, whilst Figure2a illustratesone possiblealignment.Figure la shows the resultof 121alignmentscarriedout on two immunoglobulin variabledomains(FABVL, FABVH) with Q : QL: 1.0 and no penaltyfor end gaps.Even the best alignmenthasonly 32 whilst the out of the41 selectedresiduescorrectlyequivalenced worst scores16/41.Although the valuesof G1 and G2 corto thesealignmentsaredistant,thescoresof 16border responding of 31, 30, 32 md 18/41.A changeof oneunit in on-regions G1 and/or G2 canthereforelead to a great reductionin alignmentquality. Furthermore,this patternof scoresis not common to all five protein pairs examinedindicating that a universal recommendationfor gap-penaltyvalues is not possible. The choiceof G1and G2 availablemay be, however,considerably studiedhere, reducedsincefor every pair of protein sequences at leastoneexampleof thebestalignmentobainablefor the pair may resultwith a valueof G1 : 0 (e'g. seeFigure la, val99s of lZt+t). This findingdiffersfrom thatof FirchandSmith(1983) who showedthat the expectedalignmentof two shortsequences from chickenalphaandbetahaemoglobincould only be obtained by including a length dependentgap-term(G1). Figure3 includesthe resultof runs for Q1= 1.0, with and andWunsch withoutend-gapweighting.The standardNeedleman algorithm doesnot explicitly include gap weightsfor terminal gaps.However,it is naturalto weightterminalgapswhenaligfng proteinsequences sincegenerallythereareequivalent homologous the end gapsleadsto an imWeighting the ends. near residues 9l G.J.Barton and M.J.E.Sternberg l- Sso g 40 uJ o = 3 0 ur E H20 0 ! : 1 0 0 ' 2 5I 0 0 ? 5 ENO6APS: W FABVl FAB\IX UW 'l 0 06 10 025 w u FAB\IX FABGl w ,t.00.25 ,100.25 1.0 0.25 1.0 025 w u w PLASTO AZURIN w u HAHU LEGH 1.0 0.25 .'0 025 w w uw P2PTN P TEST Fig. 3. Summaryof alignmentaccuracyfor five pairsof proteins/domains. Blocksrepres€nt the meanof l2l alignmentsfor Gs = 0- 10, G2 = 0- l0 in integersteps.Small circles and small squaresindicatethe best and worst alignmentobtainedin each l2l comparisons,respectively.Hatchedareasare the resultsof the standardalgorithm(Q, = Qt : 1.0), plain areasthe resultof includingsecondary structuralinformation(0" : 1.0, Qt = 0.25). W = endgapspenalized,UW = end-gaps not penalized. provementin meanalignmentquality for all five pairs of se(54-62%).In addition,thebestalignmentobtainedfor quences eachpairimprovedfor P2PTNversusPIESTandPLASTOversus AZURIN, whilst it stayedthe samefor FABVL versus FABVH, HAHU versusLEGH and FABVH versusFABCL. In all five pairs,theworstalignmentobainedimproved(32-51% overall). Furthermorefor FABVH versusFABCL alignments in whichno residueswerecorrectlyequivalenced wereremoved. TheproteinpairsFABVL versusFABVH, HAHU andLEGH and P2PTN versus PIEST score much better overall (90 as against39% with weightedend gaps)than FABVH versus FABCL and PLASTOversusAZURIN. In order to align the lattertwo pairscorrectly,a long gapmustbe introducedand/or the algorithmmustignoreregionsof poor similarity. A long gap will only beallowedby a Needleman andWunschtypealgorithm if the sequencesimilarity betweeninsertedresidues,and the residuesborderingthe potentialgap is very weak. Normally, severalshortergapswill be introducedinsteadin order to match regionsof theinsertionwith segments of theothersequence. This smearingof the alignmentis inherentto the global alignment method,althoughadoptionof an algorithmwhich weightsstrings of insertionsor deletionslower thanmultiple isolatedsingleinsertionsand deletionsmay overcomethis deficiency(Krushkal and Sankoff,1983). Given any hvo sequencesto align for which the threedimensionalstructuresareunknown.onewould like to discover 92 Table II. Resultof similarity tests Comparison Gl G2 Percent correctly aligned Significance score (SD unirs) P2PTNversusPIEST HAHU versusLEGHE FABVL versusFABVH PLSTO versusAZURIN FABVH versusFABCL 0 0 0 0 0 5 9 6 98 92 78 50 26 17.5 6.5 6.1 5.1 1.7 2 'Percentcorrectly aligned' refers to the percentage of residuesaligned within the selectedzonesas expectedfrom the structurebasedalignment. Values for Gt ^nd Gz were taken to give the best alignmentpossible(see Figure 3). The significancescorewas calculatedas follows: a value (V) was calculatedfor the comparisonof the two sequences;valuesfor the comparisonof 100 randomizedsequences of the samecompositionand length were then obtainedand the mean(M) and standarddeviation(SD) calculated;the significancescorc quotedis equal to (V - M)ISD. how successful the alignment is likely to be. The scores obtain- (asdescribedin Materials ed by a conventional testfor relatedness andmettrods)shouldfollow thesamefrendsasthestmchrre-based testusedhereincludingany deficienciesin thetreatmentofgaps asdiscussedabove.Tabletr illustratesthe resultsof suchhomology testsand showsa correlationbetweenthe percentageof residueswithin secondarystructurescorrectly aligned,and the similarity score(which is basedon the whole sequence).This Automaticalignmentof proteinsequences suggeststhat an alignment is likely to be good if a score greater than 6 SD can be obtained. However, preliminary trials (data not included), have shown that the significancescore obtained for a comparisoncan vary by * I SD dependingon the particular valuesof gap-penaltiesused.We suggestthat a signihcancescore greater than 7 SD for a comparison means that most of the residueswithin secondarystructureswill be aligned correctly. A score of below 5 SD indicatesthat the alignment is poor, Inclusion of secondarystructural information into the alignment when one X-rq) structure is known The effect ofintroducing secondarystructural information from one sequenceof known X-ray structure is illustrated in Figure lb. Not only has the maximum alignment score increasedto 36/41, but the methodhasconvergedon this valuewith increasing G1 and G2. This pattern is observed for FABVL versus FABVH, HAHU versusAZURIN and P2PTN versusPIEST, with and without end-gapweightingand indicatesthat the observed dependenceupon G1 and G2 is reducedwhen reliable secondary structural information is included from one sequence.Figure 2b illustratesone alignmentfor FABVL versus FABVH with 36/41 residuescorrectly equivalenced,this correspondsto the correct alignmentof A, B, C, E, F and G strands.However, the standardmethod using the same values of G1 and G2 misalignedthe C-, D- and F-strandsgiving a scoreof only 29141 (Figure 2a). The result of including secondarystructural information into the alignmentof all five protein pairs is presentedin Figure 3. There is an improvement in mean alignment accuracy on including Qt = 0.25 from 54 to 68% overall (unweightedend gaps)and 62 to 70% (weightedend gaps).The best alignment obtainedfor eachpair also increases,except for FABVH versus FABCL wherethe valuesstay the same.No alignmentshaving zero residuescorrectly equivalencedwere obtainedwhen secondary structural information was included. The improvementsin alignment within secondarystructural regionsare of value in providing a more reliable automaticstarting point for protein model-building studies. They are also valuablewhen aligning whole families of proteins since less manual intervention is required. Such multiple alignmentshelp to increaseunderstandingof the structuraland functional importanceof particular residuepositions. However, Needlemanand Wunsch type algorithms cannot easily be extended to the due to prosimultaneousalignmentof more than three sequences hibitive memory and computer-timerequirements(Murata e/ a/., 1985).Taylor (1986)has developedan algorithmwhich aligns 'templates' correspondingto conserved on the basisof predefined regions (e.g. beta-strands,alpha-helices)in the proteins. This methodalthoughnot as inherentlyflexible as the Needlemanand Wunsch approach, allows multiple alignments which conserve structural motifs to be carried out in a reasonabletime. Conclusions In this study we have uged sequencealignments based on the superpositionof known three-dimensionalprotein structuresas a standardagainstwhich to test: (i) The Needlemanand Wunsch (1970) global sequencecomparisonmethod, and (ii) an extended Needlemanand Wunsch algorithm which includes information basedon known secondarystructures. The following are the main findings. (i) If the user defined length-dependentand length-independent gap-penaltiesare varied, a number of different alignmentscan be obtained, many of which are only partly correct, or sometimes completely different to the alignment expectedfrom the superposition of three-dimensionalstructures. (ii) The best alignment using the standard Needleman and Wunsch alignment for each pair of proteins studiedmay be ob: tained without including a length dependentgap-penalty. (iii) Protein sequencecomparisonswhere one sequencehas a long insertion, or non-structurally homologousregion are not well aligned by this method (38% overall). Treating the sequences in sectionsbordered by known key residuesmay help the overall alignment for these protein pairs (Smith et al., l98l). (iv) The standarddeviation used in Needlemanand Wunsch homology testscorrelatesvery well with the quality of alignment as indicatedby referenceto alignmentsbasedon structuresuperposition. For those proteins tested, a score greater than 6 SD suggeststtrat the alignment(s) obtained for the two sequenceswill be good (>75% correct within secondarystructures). (v) The modification of the gappenalty function to include information about the positions of secondarystructuresfrom one of the proteins being aligned, and thus limit the number of gaps introducedin helix/sheetregions,improvesthe overall alignment quality and reducesthe sensitivity to changesin length-dependent and length-independentgap-penaltyconstants. The problems of aligning sequenceswhere there are regions of poor structuralhomology,or large insertions(e.g. immunogobulin variableversusconstantdomains),is unlikely to be overcome by global methodsusing a simple gap-penaltyfunction as describedhere. The inclusion of secondarystructural information from a knowledgeof the three-dimensionalstructureof one ofthe proteins to be aligned by using a variable gap-penaltycan provide, however, improvementsin alignmentaccuracy.In particular, for protein sequenceswhich though distantly related in evolution do not exhibit large insertions relative to each other (e.g. human alpha-haemoglobinversus root nodule leghaemoglobin) a useful improvement in alignment accuracycan be obtained. The inclusion of secondarystructural information into the alignment is therefore to be recommendedwheneverpossible. Acknowledgements We would hke to thank Dr J.Thornton for her comments on the manuscript, Mark6ta Zvelebil for her alignment of trypsin with elastaseand Professor T.L.Blundell for his continuedsupport. This work was supportedby the Science and Engineering ResearchCouncil. References L. and Pearl,L.H. (1983) Naure, 3M, 273 -27 5. Blundell,T.L., Sibanda,B. Boswell,D.R. and Mcl-achlan,A.D. (1984)Nucleic Acids Res., 12, 457-46/.. Browne,W.J., North,A.C.T., Phillips,D.C., Brew,K., Vanaman,T.C. and Hill,R.C. (1969) J. Mol. Biol., 120, 97 -120. Chothia,C. and [.esk,A.M. (1982\ J. Mol. Biol.,tffi,3@-323. C h o u , P . Y . a n d F a s m a n , G . D (. 1 9 7 7 )J . M o l . B i o l . , l l 5 , 1 3 5 - 1 7 5 . Cohen,F.E.,Novotny,J.,Sternberg,M.J.E.,Campbell,D.G.and Williams,A.F. (1981\ Biochem. J., f95, 3l -40. Dayhoff,M.O. (1972) ln Dayhoff,M.O. (d.), Atlas of Protein kqunce and Strucare. National Biomedical Research Foundation Washington, DC, Vol. 5, pp.89-110. and StrucDayhoff,M.O. (1978)In Dayhoff,M.O. (d.), Atlas of Protein Sequence rure. National Biomedica.lResearchFoundation Washington, DC, Vol. 5, suppl. 3., pp. 34s-3s8. N.G. (1979)Nature,279, 165- 168. D. andJames,M. T.J., Brayer.G. Delbaere,L. Timkovich,R.and Almassy,R.l.(1976)J. Mol. Biol., 100, Dickerson,R.E., 4'13-491. andDoolittle,R.F.(1985)J. Mol. Ewl., 21, I 12- 125. Feng,D.F., Johnson,M.S. (1984)NucleicAcidsRes.,12,175-l'19. Fickett,J.W. Fitch,W.M.andSmith.T.F.(1983)Proc.Natl.Acad.Sci.UM, 80, 1382- 1386. (1978)./.Mol. Biol.,l20,9'l-12O. andRobson,B. Gamier,J., Osguthoqpe,D.J. (1982)NucleicAcidsRes.,10,247-263. Goad,W.B.andKanehisa,M.I. Gotoh,O.(1982)J. Mol. Biol., 162, 705-708. 93 G.J.Barton and M.J.E.Stcrnbery Janin,J.(1979)Nuurc, 2Tl, 491-4E2' Krushkal,J.B.ard SankotrD. (1983)In SankotrD. andKrushkal'J.B'(eds)'Tine WarpsStringHits and Macronnlecales:the Theoryand Praaice of SeEtence Addison-Wesley,p. 296-299. C.omparison. (1980)J. Mol. Biol.,136'225-270Lesk.A.M.andChothia,C. Irsk,A.M., lrvitt,M. and Chothia,C.(1986)Prot. hg.' 1,71-78Levitt,M.(1976)J. MoL Biol.,104, 59-107. . (1972\J. Mol. Biol., &, 417-437 . Mct achlan,A.D NdL Acad. Sri. USA, ad $rssnran,J.L.(1985)Pr?oc. Murata,M.,Richardson,J.S. n,3037-3s77. andWunsch,C.D.(1970)J' Mol. 8io1.,8,43-453. Needleman,S.B. 59- lO7' Perutr,M.F.,IGndrew,J.C.ard WaEon,H.C.(1965)J. Mol. Biol., 1114, -193Muh.,zlt,787 Appl. J. Sellers,P.Onq SIAM Sellen,P.(199) Proc. Nul. Acad. Sci' USA,76'3MI' andFirch,W.M.(1981)J.MoL Ewl.' lt,38-'16' Smith,T.F.,Wat€nnan,M.S. 219, M., Grcen,N.andlrrner,R.A. (1983)Science, Sutcliffe,J.G.,Shinnick,T. ffi-ffi. Taylor,W.R.(19E6).L MoL Biol., lBE, 233-258. andBodmer,W.F'(1984)Naure, Travers,P.,Blundell,T.L.,Stemberg,M.J.E. 310.235-238. Smith,T.F.andBeyer,W.A.On6, Mv. Mdh., m, 367-387 . Waterman,M.S., Wilbur,W.J.andLipman,D.J.(1983)Prw. Natl-Acad.Sci-USA,n,726-1n. Winter.G.andFersht,A.R.(l9V) TrendsBiotech.,2,ll5-119. Receivedon August27, 1986 94