Regular Expressions and Pattern Matching Regular Expression (regex):
Transcription
Regular Expressions and Pattern Matching Regular Expression (regex):
Regular Expressions and Pattern Matching james.wasmuth@ed.ac.uk Regular Expression (regex): a separate language, allowing the construction of patterns. used in most programming languages. very powerful in Perl. Pattern Match: using regex to search data and look for a match. Overview: how to create regular expressions how to use them to match and extract data biological context So Why Regex? Parse files of data and information: fasta embl / genbank format html (web-pages) user input to programs Check format Find illegal characters (validation) Search for sequences motifs Simple Patterns place regex between pair of forward slashes (/ /). try: #!/usr/bin/perl while (<STDIN>) { if (/abc/) { print “1 >> $_”; } } Run the script. Type in something that contains abc: abcfoobar Type in something that doesn't: fgh cba foobar ab c foobar print statement is returned if abc is matched within the typed input. Simple Patterns (2) Can also match strings from files. genomes_desc.txt contains a few text lines containing information about three genomes. try: #!/usr/bin/perl open IN, “<genomes_desc.txt”; while (<IN>) { if (/elegans/) { #match lines with this regex print; #print lines with match } } Parses each line in turn. Looks for elegans anywhere in line $_ Flexible matching There are many characters with special meanings – metacharacters. star (*) matches any number of instances /ab*c/ => 'a' followed by zero or more 'b' followed by 'c' => abc or abbbbbbbc or ac plus (+) matches at least one instance /ab+c/ => 'a' followed by one or more 'b' followed by 'c' => abc or abbc or abbbbbbbbbbbbbbc NOT ac question mark (?) matches zero or one instance /ab?c/ => 'a' followed by 0 or 1 'b' followed by 'c' => abc or ac More General Quantifiers Match a character a specific number or range of instances {x} will match x number of instances. /ab{3}c/ => abbbc {x,y} will match between x and y instances. /a{2,4}bc/ => aabc or aaabc or aaaabc {x,} will match x+ instances. /abc{3,}/ => abccc or abccccccccc or abcccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccc ccccccccccccccccccccccc More metacharacters dot (.) refers to any character even tab (\t) and space but not newline (\n). /a.*c/ => 'a' followed by any number of any characters followed by 'c' Escaping But I want to use these symbols in my regex!?! to use a * , + , ? or . in the pattern when not a metacharacter, need to 'escape' them with a backslash. /C\. elegans/ => C. elegans only /C. elegans/ => Ca , Cb , C3 , C> , C. , etc... The 'delimitor' of the regex, forward slash “/”, and the 'escape' character, backslash “\”, are also metacharacters. These need to be escaped if required in regex. Important when trying to match URLs and email addresses. /joe\.bloggs\@darwin\.co\.uk/ /www\.envgen\.nox\.ac\.uk\/biolinux\.html/ Using metacharacters. The file nemaglobins.embl contains 21 embl database files that contain a globin protein within their sequence. try: #!/usr/bin/perl $count; open IN, “<nemaglobins.embl” or die; while (<IN>) { if (/AC .*/) { #that's three spaces print; $count++; } } print “total=$count\n”; Grouping Patterns Can group patterns in parentheses “()”. Useful when coupled with quantifiers /elegans+/ => eleganssssssssssssss /(elegans)+/ => eleganselegans...elegans n 2 1 /eleg(ans){4}/ => elegansansansans 1 2 3 4 Alternatives Want either this pattern or that pattern. Two ways: 1.) the vertical bar '|' either the left side matches or the right side matches /(human|mouse|rat)/ => any string with human or mouse or rat. Combine with previous examples: /Fugu( |\t)+rubripes/ matches if Fugu and rubripes are seperated by any mixture of spaces and tabs 2.) character class is a list of characters within '[]'. It will match any single character within the class. /[wxyz1234\t]/ => any of the nine. a range can be specified with '-' /[w-z1-4\t]/ => as above to match a hyphen it must be first in the class /[-a-zA-Z]/ => any letter character and a hyphen negating a character with '^' /[^z]/ => any character except z /[^abc]/ => any character except a or b or c Other Shortcuts \d => any digit [0-9] \w => any “word” character [A-Za-z0-9_] \s => any white space [\t\n\r\f ] \D => any character except a digit [^\d] \W => any character except a “word” character [^\w] \S => any character except a white space [^\s] Can use any of these in conjunction with quantifiers, /\s*/ => any amount of white space Using alternatives to find a hydrophobic region... try: open IN, "< nippo_sigpept.fsa" or die; while (<IN>) { if (/>/) { #a header line $count++; #keep running total of sequence number } else { #not a header if (/[VILMFWCA]{8,}/) { $match++; } } } print "Hydrophobic region found in $match sequences from $count\n"; Could also have used /(V|I|L|M|F|W|C|A){8,}/ Binding Operator Revisited? So far matching against $_ The binding operator “=~”matches the pattern on right against the string on left. Usually add the m operator (optional). $sumthing = 'Ascaris suum is a nematode'; if ($sumthing=~m/suum.*nematode/) { print “this organism infects pigs!\n”; } Anchors /pattern/ will match anywhere in the string. Use anchors to hold pattern to a point in the string. caret “^” (shift 6) marks the beginning of string while dollar “$” marks end of a string. /^elegans/ => elegans only at start of string. Not C. elegans. /Canis$/ => Canis only at end of string. Not Canis lupus. /^\s*$/ => a blank line. “$” ignores new line character “\n”. N.B. compare use of “^” as an anchor with that in the character class. Anchors (2) Word Boundary \b matches the start or end of a word. /\bmus\b/ would match mus but not musculus /la\b/ => Drosophila but not Plasmodium /\btes/ => Comamonas testosteroni but not Pan troglodytes \b ignores newline character. Be careful with full stops they're characters too! Memory Variables Able to extract sections of the pattern match and store in a variable. Anything stored in parentheses “()” is written into a special variable. The first instance is $1, the second $2, the fourth $4 and so on. Extract from file: Organism: Homo sapiens ... Extract from Perl script: while ($line=<IN>) { if ($line=~m/Organism:\s(\w)+\s(\w)+/) { $genus=$1; #stores Homo $species=$2; #stores sapiens } } Substitutions Able to replace a pattern within a string with another string. Use the “s” operator s/abc/xyz/ => find abc and replace with xyz By default only the first instance of a match. Using 'g' modifier (global) will find and replace all instances. $line = 'abccdcbabc'; $line =~ s/abc/xyz/g; print $line; #produces xyzcdcbxyz; 1 Run dna2rna.pl Now look at dna2rna.pl 2 dna2rna.pl #!/usr/bin/perl print "Enter DNA sequence\n"; while ($line = <STDIN>) { chomp $line; #remove trailing \n if ($line=~m/[^AGCT]/i) { #case insensitive infered by 'i' #modifier print "your sequence contained an invalid nucleotide: $&\nPlease try again\n"; #'$&' is a special variable which stores what the #regular expression matched. Don't worry about it for now. } else { $line=~s/t/u/g; #replace all lower case 't' $line=~s/T/U/g; #replace all upper case 'T' print "The RNA sequence is:\n$line\n"; print “Try again or ctrl C to quit\n”; } } EMBL file revisited using shortcuts and anchors to help make more robust: if (/AC .*/) { #that's three spaces can be rewritten as; if (/^AC\s{3}(.*)\n$/){ #more certain to return what you want $accession=$1; #now have info stored to use later. } Now Its Your Turn :o) nemaglobins.embl contains entries for complete cds of nematode sequences. Foreach entry print the ACcession, OrganiSm name and AGCT content of the SeQuence. Output should read: Accession: AC00000 <tab> Species: Toxocara canis <newline> A: 34 G: 65 C: 24 T: 75 <newline><newline> Hints: The lines of interest are AC, OS, and SQ. Three regular expressions - one for each query. Use a series of if and elsif loops to search for regular expressions. Print when matched. Bonus point - remove the semi-colon from the accession id. Shout if need help.