LWP::Simple - Purdue Genomics Wiki
Transcription
LWP::Simple - Purdue Genomics Wiki
22 – 26 February Week 7 • Topics ○ Subroutines ○ Complex data structures ○ Internet agents / programming • Reading ○ CPAN libwww-perl (LWP) ○ LWP cookbook − http://search.cpan.org/~gaas/libwww-perl-6.04/lwpcook.pod ○ Perl Cookbook (available on Safari) ○ Review HTML Forms − http://www.w3schools.com/html/html_forms.asp Biol 59500-033 - Practical Biocomputing 1 Subroutines Main program • my $answer = times( $a, $b ) Subroutine • sub times { my ( $a, $b ) = @_; my $answer = $a * $b; return $answer; } Biol 59500-033 - Practical Biocomputing 2 Complex Data Structures # hash of hashes # 1. gene information my %gene = ( At5g04870 => { At1g18890 => { At4g21940 => { gene => "cpk1", begin => 1416783, end => 1420338, xsome => 5 }, gene => "cpk10", begin => 6522764, end => 6525962, xsome => 1 }, gene => "cpk15", begin => 11640802, end => 11643762, xsome => 4 }, ); # 1.1 print gene info sorted by systematic name (e.g., At5g04870 ) # 1.2 print gene info sorted by chromosome # 1.3 print gene info sorted by gene length Biol 59500-033 - Practical Biocomputing 3 Complex Data Structures # 2. array of hashes. this corresponds to the info in a fasta file my @sequence = ( { { { name doc seq name doc seq name doc seq => => => => => => => => => "seqa", "sequence of gene a", "CGCATCGTATCCGATCGTAGCCTGCATCGTATGCTA" }, "wtfin", "know one knows what this gene does", "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN" }, "lookase1", "related to ADHD", "CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCC" }, ); #2.1 print the name of each sequence #2.2 print out info in alphabetical order of sequence name #2.3 print out information in sequence (seq) length order Biol 59500-033 - Practical Biocomputing 4 Complex Data Structures # hash of arrays # 3. location of cities ( latitude, longitude my %location = ( Montgomery Little_Rock Phoenix Sacramento Denver Hartford Dover Tallahassee Atlanta Des_Moines Boise Springfield Indianapolis Topeka Frankfort Baton_Rouge ); ) => => => => => => => => => => => => => => => => [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ [ 32.3615, 34.736, 33.5284, 38.5556, 39.7263, 41.7626, 39.1619, 30.4518, 33.7595, 41.5909, 43.6137, 39.7833, 39.7909, 39.0392, 38.1973, 30.4581, -86.2791 -92.3311 -112.076 -121.469 -104.965 -72.6886 -75.5268 -84.2728 -84.4032 -93.6209 -116.238 -89.6504 -86.1477 -95.6895 -84.8631 -91.1402 ], ], ], ], ], ], ], ], ], ], ], ], ], ], ], ], # 3.1 print the locations of Phoenix, Topeka, and Atlanta # 3.2 print all the cities west of springfield # 3.3 print the cities in east to west order Biol 59500-033 - Practical Biocomputing 5 Complex Data Structures # array of arrays # 4. some Euclidian x, y, z coordinates my @coord = ( [ -66.838, -0.754, [ -67.651, -1.371, [ -67.424, -0.595, [ -68.320, 0.089, [ -67.234, -2.829, [ -66.691, -3.521, [ -67.718, -3.597, [ -67.130, -4.281, [ -68.089, -4.324, [ -66.213, -0.711, [ -65.842, -0.029, [ -65.325, 1.368, [ -64.130, 1.565, [ -64.763, -0.831, -25.764 -24.677 -23.384 -22.888 -24.467 -25.710 -26.826 -28.051 -29.188 -22.847 -21.614 -21.944 -22.165 -20.883 ], ], ], ], ], ], ], ], ], ], ], ], ], ] ); # 4.1 sort by z coordinate # 4.2 find the center of these coordinates ( ave_x, ave_y, ave_z ) # 4.3 find all coordinates within 2.0 of the center Biol 59500-033 - Practical Biocomputing 6 Complex Data Structures # 5. array of hashes of arrays of arrays # this one is more complicated and only intended for those who feel the above is # is trivial. note that loc is an array of the beginning and ending positions of # the gene on the chromosome, and exon is an array of arrays of the beginning # and ending position of each exon; the exon coordinates are an offset from the # beginning of the gene given in loc. my %gene = ( At5g04870 => { gene => "cpk1", loc => [ 1416783,1420338 ], exon => [ [ 1001, 1809 ], [ 2171, 2314 ], [ 2400, 2552 ] ], xsome => 5 }, At1g18890 => { gene loc exon => "cpk10", => [ 6522764, 6525962 ], => [ [ 1001, 1298 ], [ 1540, 2693 ] ], xsome => 1 }, At4g21940 => { gene loc exon => "cpk15", => [ 11640802, 11643762 ], => [ [ 1001, 2379 ], [ 2497, 2640 ], [ 2736, 2888 ], [ 3050, 3165 ], [ 3321, 3488 ] ], xsome => 4 }, ); # 5.1 list each gene and its exons in alphabetical order (by the "gene" key) # 5.2 list the genes and their locations in order of the number of exons # 5.3 list the genes and their locations in order of the longest exon in each gene Biol 59500-033 - Practical Biocomputing 7 Internet Programming • CPAN libwww-perl (LWP) • LWP cookbook – http://search.cpan.org/~gaas/libwww-perl-6.04/lwpcook.pod • Perl Cookbook (available on Safari) Biol 59500-033 - Practical Biocomputing 8 Internet Programming Internet packages • LWP::Simple ○ Simple fetching of web pages and "GET" method forms • LWP::UserAgent ○ More complicated fetching of "POST" method forms, uses HTTP::Request and HTTP::Response • HTTP::Request ○ Create HTTP formatted requests • HTTP::Response ○ Parse HTTP formatted respnses • URI::URL ○ mthods for handling URLs • HTML ○ methods for handling HTML formatted files Biol 59500-033 - Practical Biocomputing 9 Internet Programming wget • wget is available to fetch webpages on most unix systems use strict; my $url = "http://plantsp.genomics.purdue.edu"; my $content = `wget $url `; Biol 59500-033 - Practical Biocomputing 10 Internet Programming LWP Package • Short for libwww-Perl • LWP::Simple • get($url) ○ The get() function will fetch the document identified by the given URL and return it. It returns undef if it fails. The $url argument can be either a simple string or a reference to a URI object. • head($url) ○ Get document headers. Returns the following 5 values if successful: ($content_type, $document_length, $modified_time, $expires, $server) ○ Returns an empty list if it fails. In scalar context returns TRUE if successful. • getprint($url) ○ Get and print a document identified by a URL. The document is printed to the selected default filehandle for output (normally STDOUT) as data is received from the network. If the request fails, then the status code and message are printed on STDERR. The return value is the HTTP response code. • getstore($url, $file) ○ Gets a document identified by a URL and stores it in the file. The return value is the HTTP response code. • mirror($url, $file) Biol 59500-033 - Practical Biocomputing 11 Internet Programming LWP::Simple • Getting a web page • Most basic, little more than wget use strict; use LWP::Simple; my $url = "http://plantsp.genomics.purdue.edu"; my $content = get ( $url ); • What if something goes wrong? Biol 59500-033 - Practical Biocomputing 12 Internet Programming LWP::Simple • Getting a web page • Checking for errors, better than wget use strict; use LWP::Simple; my $url = "http://plantsp.genomics.purdue.edu"; unless ( my $content = get ( $url ) ) { die "unable to access $url\n\n"; } # test for success • Inconvenient ○ Have to alter code each time ○ I get bored typing http:// Biol 59500-033 - Practical Biocomputing 13 Internet Programming LWP::Simple • Getting a web page • More useful with getopt ○ Doesn't hard code ○ supply http:// prefix use strict; use Getopt::Std; use LWP::Simple; my $option = {}; getopts( 'u', $option ); my $url = "http://plantsp.genomics.purdue.edu"; if ( $$option{u} ) { $url = $$option{u}; } # default URL unless ( $url =~ /http:\/\//i ) { $url = "http://".$url; } # add http:// prefix if missing unless ( my $content = get ( $url ) ) { die "unable to access $url\n\n"; } # test for success Biol 59500-033 - Practical Biocomputing 14 Internet Programming LWP::Simple • LWP::Simple works well with REST–based web services • NCBI E-utilities (eutils, http://www.ncbi.nlm.nih.gov/books/NBK25500/) Provided by NCBI to ○ search databases (esearch) ○ download summaries (esummary) ○ download complete entries (efetch) ○ upload UIDs to NCBI server for later processing (epost) ○ query Entrez (egquery) ○ trace links in entries (elink) ○ examine database statistics and fields (einfo) ○ retrieve spelling suggestions (espell) • Base URL: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/xxx.fcgi Biol 59500-033 - Practical Biocomputing 15 Internet Programming LWP::Simple • esearch ○ url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi ○ parameters: − db – databases to search (pubmed,protein, nucleotide, genome, etc) term – search term usehistory – y|n, store the results of search on server − http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi? db=pubmed&term=science[journal]+AND+breast+cancer note: no spaces in term − − Biol 59500-033 - Practical Biocomputing 16 Internet Programming LWP::Simple • efetch ○ url: http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi ○ parameters: − − − db – databases to search (pubmed,protein, nucleotide, genome, etc) id – uid list (e.g., &id=15718680,157427902,119703751) rettype – retrieval type, varies with database Abstract or MEDLINE from PubMed, or GenPept or FASTA from protein − − − − − retmode – e.g., text, HMTL or XML retstart - Sequential index of the first record to be retrieved retmax - Total number of records from the input set to be retrieved WebEnv – specifies the Web Environment that contains the UID list to be provided as input to EFetch query_key - specifies which of the UID lists attached to the given Web Environment will be used as input to Efetch efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=15718680,157427902, 119703751 Biol 59500-033 - Practical Biocomputing 17 Internet Programming LWP::Simple • simple esearch script #!/usr/bin/perl ################################################################################ # # Use NCBI eutil service to retrieve sequences from pubmed # # Gribskov Admin Feb 26, 2013 ################################################################################ use strict; use Getopt::Std; use LWP::Simple; # base URL for NCBI eutil services my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; my $database = "protein"; my $query = "arsenite reductase AND arabidopsis"; my $search = $BASE."esearch.fcgi?db=$database&term=$query"; print "searching $search...\n\n"; my $result = get $search; print "$result\n"; exit 0; Biol 59500-033 - Practical Biocomputing 18 Internet Programming LWP::Simple searching http://eutils.ncbi.nlm.nih.gov/entrez/eutils/esearch.fcgi?db=protein&term=arsenite reductase AND arabidopsis... <?xml version="1.0" ?> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> <eSearchResult> <Count>5</Count> <RetMax>5</RetMax> <RetStart>0</RetStart> <IdList> <Id>28868893</Id> <Id>62286622</Id> <Id>410092315</Id> <Id>409760340</Id> <Id>28852132</Id> </IdList> <TranslationSet> <Translation> <From>arsenite reductase</From> <To>arsenite reductase[Protein Name] OR (arsenite[All Fields] AND reductase[All Fields])</To> </Translation> <Translation> <From>arabidopsis</From> <To>"Arabidopsis"[Organism] OR arabidopsis[All Fields]</To> </Translation> </TranslationSet> Biol 59500-033 - Practical Biocomputing 19 Internet Programming LWP::Simple <TranslationStack> <TermSet> <Term>arsenite reductase[Protein Name]</Term> <Field>Protein Name</Field> <Count>32</Count> <Explode>N</Explode> </TermSet> <TermSet> <Term>arsenite[All Fields]</Term> <Field>All Fields</Field> <Count>94001</Count> <Explode>N</Explode> </TermSet> <TermSet> <Term>reductase[All Fields]</Term> <Field>All Fields</Field> <Count>1646624</Count> <Explode>N</Explode> </TermSet> <OP>AND</OP> <OP>GROUP</OP> <OP>OR</OP> <OP>GROUP</OP> <TermSet> <Term>"Arabidopsis"[Organism]</Term> <Field>Organism</Field> <Count>0</Count> <Explode>N</Explode> </TermSet> <TermSet> <Term>arabidopsis[All Fields]</Term> <Field>All Fields</Field> <Count>1005396</Count> <Explode>N</Explode> </TermSet> <OP>OR</OP> <OP>GROUP</OP> <OP>AND</OP> </TranslationStack> <QueryTranslation>(arsenite reductase[Protein Name] OR (arsenite[All Fields] AND reductase[All Fields])) AND ("Arabidopsis"[Organism] OR arabidopsis[All Fields]) </QueryTranslation> </eSearchResult> Biol 59500-033 - Practical Biocomputing 20 Internet Programming LWP::Simple #!/usr/bin/perl ################################################################################ # # Use NCBI eutil service to retrieve sequences from pubmed # # Gribskov Admin Feb 26, 2013 ################################################################################ use strict; use Getopt::Std; use LWP::Simple; # base URL for NCBI eutil services my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; my $database = "protein"; my $query = "arsenite reductase AND arabidopsis"; my $search = $BASE."esearch.fcgi?db=$database&term=$query"; print "searching $search...\n\n"; my $result = get $search; #print "$result\n"; # get IDs my ( $ids ) = $result =~ /<IdList>(.*)<\/IdList>/s; $ids =~ s/<\/?Id>//g; my @idlist = split " ", $ids; print "idlist:@idlist\n"; exit 0; Biol 59500-033 - Practical Biocomputing 21 Internet Programming LWP::Simple #!/usr/bin/perl ################################################################################ # # Use NCBI eutil service to retrieve sequences from pubmed # # Gribskov Admin Feb 26, 2013 ################################################################################ use strict; use Getopt::Std; use LWP::Simple; # base URL for NCBI eutil services my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; my $database = "protein"; my $query = "arsenite reductase AND arabidopsis"; #my $search = $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y"; my $search = $BASE."esearch.fcgi?db=$database&term=$query"; print "searching $search...\n\n"; my $result = get $search; #print "$result\n"; # get IDs my ( $ids ) = $result =~ /<IdList>(.*)<\/IdList>/s; $ids =~ s/<\/?Id>//g; my @idlist = split " ", $ids; print "idlist:@idlist\n"; # retrieve with efetch my $idstring = join ",", @idlist; my $fetch = $BASE."efetch.fcgi?db=protein&retmode=text&rettype=fasta&id=$idstring"; print "fetch:$fetch\n"; my $sequence = get $fetch; print $sequence; Biol 59500-033 - Practical Biocomputing 22 Internet Programming LWP::Simple >gi|28868893|ref|NP_791512.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000] MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP >gi|62286622|sp|Q8GY31.1|CDC25_ARATH RecName: Full=Dual specificity phosphatase Cdc25; AltName: Full=Arath;CDC25; AltName: Full=Arsenate reductase 2; AltName: Full=Sulfurtransferase 5; Short=AtStr5 MGRSIFSFFTKKKKMAMARSISYITSTQLLPLHRRPNIAIIDVRDEERNYDGHIAGSLHYASGSFDDKIS HLVQNVKDKDTLVFHCALSQVRGPTCARRLVNYLDEKKEDTGIKNIMILERGFNGWEASGKPVCRCAEVP CKGDCA >gi|410092315|ref|ZP_11288844.1| arsenate reductase [Pseudomonas viridiflava UASWS0038] MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP >gi|409760340|gb|EKN45494.1| arsenate reductase [Pseudomonas viridiflava UASWS0038] MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP >gi|28852132|gb|AAO55207.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000] MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP Biol 59500-033 - Practical Biocomputing 23 Internet Programming LWP::Simple • Large retrievals with Eutils ○ NCBI allows the results of a large query to be stored on their database and used in other queries using the usehistory=y parameter with esearch ○ multiple sets of sequences can then be retrieved in chunks using − − ○ retstart – index of first sequence to retrieve retmax – number of sequences to retrieve NCBI recommends setting retmax = 500 to avoid having an adverse impact on their services Biol 59500-033 - Practical Biocomputing 24 Internet Programming LWP::Simple • esearch ○ additional information with &usehistory=y <?xml version="1.0" ?> <!DOCTYPE eSearchResult PUBLIC "-//NLM//DTD eSearchResult, 11 May 2002//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/eSearch_020511.dtd"> <eSearchResult> <Count>5</Count> <RetMax>5</RetMax> <RetStart>0</RetStart> <QueryKey>1</QueryKey> <WebEnv>NCID_1_3419506_165.112.9.24_5555_1362406574_334475218</WebEnv> <IdList> <Id>28868893</Id> <Id>62286622</Id> <Id>410092315</Id> <Id>409760340</Id> <Id>28852132</Id> Biol 59500-033 - Practical Biocomputing 25 Internet Programming LWP::Simple #!/usr/bin/perl ################################################################################ # # Use NCBI eutil service to retrieve sequences from pubmed # # Gribskov Admin Feb 26, 2013 ################################################################################ use strict; use Getopt::Std; use LWP::Simple; # base URL for NCBI eutil services my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; my $database = "protein"; my $query = "arsenite reductase AND arabidopsis"; my $search = $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y"; print "searching for $query ...\n\n"; my $result = get $search; print $result; # get number of matches, WebEnv and query_key my ( $webenv ) = $result =~ /<WebEnv>(\S+)<\/WebEnv>/s; my ( $query_key ) = $result =~ /<QueryKey>(\d+)<\/QueryKey>/s; my ( $matches ) = $result =~ /<Count>(\d+)<\/Count>/s; print "WebEnv:$webenv query_key:$query_key matches: $matches\n"; # retrieve with efetch my $retmax = 2; my $retstart = 0; while ( $retstart < 6 ) { my $fetch = $BASE."efetch.fcgi?db=protein"; $fetch .= "&retmode=text&rettype=fasta"; $fetch .= "&retmax=$retmax&retstart=$retstart"; $fetch .= "&WebEnv=$webenv&query_key=$query_key"; print "start=$retstart query:$fetch\n"; my $sequence = get $fetch; print $sequence; $retstart += $retmax; } exit 0; Biol 59500-033 - Practical Biocomputing 26 Internet Programming LWP::Simple WebEnv:NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803 query_key:1 matches: 5 start=0 query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=0&We bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1 >gi|28868893|ref|NP_791512.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000] MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP >gi|62286622|sp|Q8GY31.1|CDC25_ARATH RecName: Full=Dual specificity phosphatase Cdc25; AltName: Full=Arath;CDC25; AltName: Full=Arsenate reductase 2; AltName: Full=Sulfurtransferase 5; Short=AtStr5 MGRSIFSFFTKKKKMAMARSISYITSTQLLPLHRRPNIAIIDVRDEERNYDGHIAGSLHYASGSFDDKIS HLVQNVKDKDTLVFHCALSQVRGPTCARRLVNYLDEKKEDTGIKNIMILERGFNGWEASGKPVCRCAEVP CKGDCA start=2 query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=2&We bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1 >gi|410092315|ref|ZP_11288844.1| arsenate reductase [Pseudomonas viridiflava UASWS0038] MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP >gi|409760340|gb|EKN45494.1| arsenate reductase [Pseudomonas viridiflava UASWS0038] MTDLTLYHNPRCTKSRGALELLQARGLSPDVVLYLETPPDAAQLRELLGKLGISARQLLRTGEDDYKQLN LADASLSDEQLIAAMAAHPKLIERPILVVGDKAVIGRPPENVLELLP start=4 query:http://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi?db=protein&retmode=text&rettype=fasta&retmax=2&retstart=4&We bEnv=NCID_1_2501752_130.14.22.76_5555_1362406846_1531808803&query_key=1 >gi|28852132|gb|AAO55207.1| arsenate reductase [Pseudomonas syringae pv. tomato str. DC3000] MTDLTLYHNPRCTKSRGALELLQARGLTPDIILYLETPPDAGTLHDLLGKLGISARQLLRTGEDDYKQLN LADPSLSDEQLVAAMAAHPKLIERPILVAGNKAVIGRPPENILELLP Biol 59500-033 - Practical Biocomputing 27 Internet Programming LWP::Simple • Drawbacks to eutils script ○ must change the program for every different search ○ no progress report while running ○ no error checking • Enhancing flexibility with getopt Biol 59500-033 - Practical Biocomputing 28 Internet Programming LWP::Simple #!/usr/bin/perl ################################################################################ # # Use NCBI eutil service to retrieve sequences from pubmed # # Gribskov Admin Feb 26, 2013 ################################################################################ use strict; use Getopt::Std; use LWP::Simple; # base URL for NCBI eutil services my $BASE = 'http://eutils.ncbi.nlm.nih.gov/entrez/eutils/'; my $DEFAULT_INTERVAL = 10000; my $DEFAULT_DATABASE = 'protein'; my $USAGE = qq{eutil.pl <query_string> -h this usage message -d <string>, NCBI database to search (default=$DEFAULT_DATABASE) -i <int>, interval for reporting progress (default=$DEFAULT_INTERVAL) }; # command line options my $option = {}; getopts( 'd:hi:', $option ); # help if ( $$option{h} ) { print "$USAGE\n"; exit 1; } my $database = $DEFAULT_DATABASE; if ( $$option{d} ) { $database = $$option{d}; } my $interval = $DEFAULT_INTERVAL; if ( $$option{i} ) { $interval = $$option{i}; } my $query = $ARGV[0]; Biol 59500-033 - Practical Biocomputing 29 Internet Programming LWP::Simple #-----------------------------------------------------------------------------# main program #-----------------------------------------------------------------------------print STDERR "eutil.pl\n"; print STDERR " report interval: $interval\n"; print STDERR " database: $database\n"; print STDERR " query: $query\n\n"; my $search = $BASE."esearch.fcgi?db=$database&term=$query&usehistory=y"; print "searching for $query ...\n\n"; my $result = get $search; # get number of matches, WebEnv and query_key my ( $webenv ) = $result =~ /<WebEnv>(\S+)<\/WebEnv>/s; my ( $query_key ) = $result =~ /<QueryKey>(\d+)<\/QueryKey>/s; my ( $matches ) = $result =~ /<Count>(\d+)<\/Count>/s; print STDERR "matches: $matches WebEnv:$webenv query_key:$query_key\n\n"; # retrieve with efetch my $retmax = 500; my $retstart = 0; while ( $retstart < $matches ) { my $fetch = $BASE."efetch.fcgi?db=protein"; $fetch .= "&retmode=text&rettype=fasta"; $fetch .= "&retmax=$retmax&retstart=$retstart"; $fetch .= "&WebEnv=$webenv&query_key=$query_key"; my $sequence = get $fetch; print $sequence; $retstart += $retmax; unless ( $retstart % $interval ) { print STDERR " $retstart sequences retrieved\n"; } } exit 0; Biol 59500-033 - Practical Biocomputing 30 Internet Programming Running servers without using the web form • LWP::UserAgent • Blast @ PlantsP ○ Find homologous sequences using BLAST http://xplantsp.genomics.purdue.edu/cgibin/blast_tmpl_soap.cgi?db=PlantsP Biol 59500-033 - Practical Biocomputing 31 Internet Programming LWP::UserAgent $ua->agent( $product_id ) • Get/set the product token that is used to identify the user agent on the network. The agent value is sent as the ``User-Agent'' header in the requests. The default is the string returned by the _agent() method (see below). • If the $product_id ends with space then the _agent() string is appended to it. • The user agent string should be one or more simple product identifiers with an optional version number separated by the ``/'' character. Examples are: • $ua->agent('Checkbot/0.4 ' . $ua->_agent); $ua->agent('Checkbot/0.4 '); # same as above $ua>agent('Mozilla/5.0'); $ua->agent(""); # don't identify $ua->_agent • Returns the default agent identifier. This is a string of the form ``libwww-perl/#.##'', where ``#.##'' is substituted with the version number of this library. $ua->from( $email_address ) • Get/set the e-mail address for the human user who controls the requesting user agent. The address should be machine-usable, as defined in RFC 822. The from value is send as the ``From'' header in the requests. Example: • $ua->from('gaas@cpan.org'); • The default is to not send a ``From'' header. See the default_headers() method for the more general interface that allow any header to be defaulted. $ua->max_size( $bytes ) • Get/set the size limit for response content. The default is undef, which means that there is no limit. If the returned response content is only partial, because the size limit was exceeded, then a ``Client-Aborted'' header will be added to the response. The content might end up longer than max_size as we abort once appending a chunk of data makes the length exceed the limit. The ``Content-Length'' header, if present, will indicate the length of the full content and will normally not be the same as length($res->content). $ua->timeout( $secs ) • Get/set the timeout value in seconds. The default timeout() value is 180 seconds, i.e. 3 minutes. • The requests is aborted if no activity on the connection to the server is observed for timeout seconds. This means that the time it takes for the complete transaction and the request() method to actually return might be longer. Biol 59500-033 - Practical Biocomputing 32 Internet Programming LWP::UserAgent - REQUEST METHODS $ua->get( $url , $field_name => $value, ... ) • This method will dispatch a GET request on the given $url. Further arguments can be given to initialize the headers of the request. These are given as separate name/value pairs. The return value is a response object.. $ua->head( $url , $field_name => $value, ... ) • This method will dispatch a HEAD request on the given $url. Otherwise it works like the get() method described above. $ua->post( $url, \%form ) $ua->post( $url, \@form ) $ua->post( $url, \%form, $field_name => $value, ... ) • This method will dispatch a POST request on the given $url, with %form or @form providing the key/value pairs for the fill-in form content. Additional headers and content options are the same as for the get() method. $ua->request( $request, $content_file ) • This method will dispatch the given $request object. Normally this will be an instance of the HTTP::Request class, but any object with a similar interface will do. The return value is a response object. See the HTTP::Request manpage and the HTTP::Response manpage for a description of the interface provided by these classes. • The request() method will process redirects and authentication responses transparently. This means that it may actually send several simple requests via the simple_request() method described below. $ua->simple_request( $request ) This method dispatches a single request and returns the response received. Arguments are the same as for request() described above. • The difference from request() is that simple_request() will not try to handle redirects or authentication responses. The request() method will in fact invoke this method for each simple request it sends. $ua->redirect_ok( $prospective_request, $response ) • This method is called by request() before it tries to follow a redirection to the request in $response. This should return a TRUE value if this redirection is permissible. The $prospective_request will be the request to be sent if this method returns TRUE. • The base implementation will return FALSE unless the method is in the object's requests_redirectable list, FALSE if the proposed redirection is to a ``file://...'' URL, and TRUE otherwise. Biol 59500-033 - Practical Biocomputing 33 Internet Programming Running servers without using the web form • LWP::UserAgent • Blast @ PlantsP ○ Find homologous sequences using BLAST http://xplantsp.genomics.purdue.edu/cgibin/blast_tmpl_soap.cgi?db=PlantsP Biol 59500-033 - Practical Biocomputing 34 Internet programming Form information • chrome – web developer extension • firefox ○ firebug ○ web developer • safari ○ web inspector ○ firebug Biol 59500-033 - Practical Biocomputing 35 Internet Programming LWP::UserAgent • Blast search form • Gets from user ○ DATALIB ○ SEQUENCE • Defaults ○ PROGRAM ○ UNGAPPED_ALIGNMENT ○ FSET ○ EXPECT ○ DESCRIPTIONS ○ ALIGNMENTS • Hidden ○ db Biol 59500-033 - Practical Biocomputing 36 Internet Programming Finding Form variables in page source • <FORM> ○ may be more than one • <INPUT> ○ box ○ radio button ○ checkbox • <SELECT> ○ pulldown menu • <TEXTAREA> ○ a large box for text Biol 59500-033 - Practical Biocomputing 37 Internet Programming BLAST@PlantsP • DB ○ hidden -> "plantsp" • PROGRAM ○ Select -> ○ <option value=blastp SELECTED> blastp (prot. vs prot.) <option value=blastn > blastn (DNA vs DNA) <option value=blastx> blastx (transl. DNA vs prot.) <option value=tblastn> tblastn (prot. vs transl. DNA) <option value=tblastx> tblastx (transl. DNA vs transl. DNA) • DATALIB ○ Select ○ <OPTION VALUE=ap>----- Protein databases ----<OPTION VALUE=tigr_osa5prot> Rice Proteins (TIGR release 5 - 01/24/2007) <OPTION VALUE=physco_pro> Physcomitrella proteins (JGI v1.1 - March 2007) <OPTION VALUE=selmo1_pro> Selaginella proteins (JGI v1.0 - March 2007 (released 10/31)) <OPTION VALUE=all_pro> All Plants(P+T+Ubq) Proteins (Purdue - 28 Jan 2008) <OPTION VALUE=tair_ath8prot> Arabidopsis proteins (TAIR release 8 - 2008-05-16) <OPTION VALUE=vp_090116> Viridiplantae proteins (All viridiplantae proteins - 01/16/2009) <OPTION VALUE=PlantProteinDB> Plant Protein (Combined plant protein database - 2009-0313) <OPTION VALUE=an>----- DNA databases ----<OPTION VALUE=osa_indica> Indica (chinese) Rice Genomic Sequence (Yu et al. - 9/5/2002) … Biol 59500-033 - Practical Biocomputing 38 Internet Programming HTML Forms <INPUT TYPE=checkbox NAME=UNGAPPED_ALIGNMENT VALUE=is_set> Perform ungapped alignment <BR> The query sequence is <INPUT TYPE=checkbox NAME=FSET VALUE=isset CHECKED> filtered for low complexity regions by default. <BR> <TEXTAREA NAME=SEQUENCE ROWS=6 COLS=80 VALUE=></TEXTAREA> Expect Cutoff <select name=EXPECT> <option> 1e-100 <option> 1e-50 <option> 1e-25 <option> 1e-20 <option> 1e-15 <option> 1e-10 <option> 1e-5 <option> 1.0 <option selected> 10.0 <option> 100.0 <option> 500.0 <option> 1000.0 </select> Biol 59500-033 - Practical Biocomputing 39 Internet Programming Finding Form Fields by Element Info Biol 59500-033 - Practical Biocomputing 40 Internet Programming LWP::UserAgent • BLAST @ PlantsP use strict; use HTTP::Request::Common qw( POST ); use LWP::UserAgent; my $site = "http://plantsp.genomics.purdue.edu/"; my $target = "plantsp/cgi-bin/blast_basic.cgi"; my $seq = " MAKNVMQLAILSTQRVVLLLWLLHAPAAADAALTTVAGCPSKCGDVDIPLPFGIGDHCAW ESFDVVCNESFSPPRPHTGNIEIKEISVEAGEMRVYTPVADQCYNSSSTSAPGFGASLEL TAPFLLAQSNEFTAIGCNTVAFLDGRNNGSYSTGCITTCGSVEAAAQNGEPCTGLGCCQV PSIPPNLTTLHISWNDQGFLNFTPIGTPCSYAFVAQKDWYNFSRQDFGPVGSKDFITNST"; $seq =~ s/\s*//g; my $library = "pp_active_prots"; my $agent = LWP::UserAgent->new(); my $request = POST $site.$target, [ DATALIB => $library, SEQUENCE => $seq, PROGRAM => "blastp", UNGAPPED_ALIGNMENT => 1, FSET => 1, EXPECT => 10.0, DESCRIPTIONS => 10, ALIGNMENTS => 10, db => "plantsp" ]; my $response = $agent->request( $request ); print $response->as_string; Biol 59500-033 - Practical Biocomputing 41 Internet Programming ClustalW2 @ EBI • Multiple sequence alignment • http://www.ebi.ac.uk/Tools/msa/clustalw2/ Biol 59500-033 - Practical Biocomputing 42 Internet Programming ClustalW2 @ EBI Biol 59500-033 - Practical Biocomputing 43 Internet Programming TmHmm 2.0 • Predict transmembrane helices • http://www.cbs.dtu.dk/services/TMHMM/ Biol 59500-033 - Practical Biocomputing 44 Internet Programming TmHmm - Page info Biol 59500-033 - Practical Biocomputing 45 Internet Programming Psort • Prediction of sorting signals • http://wolfpsort.seq.cbrc.jp/ Biol 59500-033 - Practical Biocomputing 46 Internet Programming PSort Output • Formatted HTML Biol 59500-033 - Practical Biocomputing 47 Internet Programming PSort Output • HTML Source • Requires "Screen Scraping" Biol 59500-033 - Practical Biocomputing 48 Internet Programming Removing HTML Tags • A stupid approach ( $text ) = $html =~ s/<[^>]*>//g; • Fails for ○ <IMG SRC="foo.gif" ALT="A foo in its natural habitat"> ○ <IMG SRC="foo.gif" ALT="A > B" ); ○ <!-- <a comment> --> ○ <script>if (a<b && a>c)</script> ○ etc... Biol 59500-033 - Practical Biocomputing 49 Internet Programming Parsing HTML • Use HTML package to find or remove tags • better, but complicated use HTML::Parser; $tree = HTML::Parser->new( start_h => [ sub{ print shift, "\n"}, "tag"], text_h => [ sub{ print shift; print " ",shift,"\n"}, "line, dtext" ] ); $tree->parse_file( "origins_life.htm" ); Biol 59500-033 - Practical Biocomputing 50 Internet Programming Parsing HTML • Parser package uses "event handlers" • event => [ subr, information] ○ Event types: − − − − − − − − − text start end declaration comment process start_document end_document default • subr is a subroutine to process the information ○ for example a subroutine: sub{ print shift } Biol 59500-033 - Practical Biocomputing 51 Internet Programming Parsing HTML • information types ○ attr - returns a reference to a hash of attribute name/value pairs ○ @attr - Basically the same as attr, but keys and values are returned as individual arguments and the original sequence is preserved ○ attrseq - returns a reference to an array of attribute names ○ column - returns the column number of the start of the event ○ Dtext - returns the decoded text ○ Event - returns the event name ○ Length - returns the number of bytes of the source text ○ Line - returns the line number of the start of the event ○ skipped_text - returns the concatenated text of all the events that have been skipped since the last time an event ○ tagname - returns the element name ○ Tokens - returns a reference to an array of token strings to be passed. The strings are exactly as they were found in the original text, no decoding or case changes are applied. ○ Text - returns the source text (including markup element delimiters) Biol 59500-033 - Practical Biocomputing 52 Internet Programming Checking Document Links • HTML::LinkExtor ○ Get all the links in a document and possibly process each ○ $parser = HTML::LinkExtor->new( $function, $url ); ○ links clears and returns a list of links. Each element is an array reference with the type of link and the attribute-value pairs from the tag ○ $function is normally undef, but can be a reference to a function that you want to act on every link <A HREF=http://www.perl.com/>Home</A> <IMG SRC="images/big/jpg" LOWSRC="images/big-lowres.jpg"> • $parser->links returns [ [ a, [ img, href src lowsrc => http://www.perl.com/ ], =>"images/big/jpg" ], =>"images/big-lowres.jpg"] ] Biol 59500-033 - Practical Biocomputing 53 Internet Programming Checking Document Links • LinkExtor use strict; use HTML::LinkExtor; use LWP::Simple qw( get head ); my $base_url = shift || die "usage $0 <start_url>\n"; my $content = get( $base_url ); my $parser = HTML::LinkExtor->new(); $parser->parse( $content ); my @links = $parser->links; print "base URL: $base_url\n\n"; foreach my $linkref ( @links ) { my @linklist = @$linkref; my $type = shift @linklist; my ( $attr, $value ) = @linklist; # print "type: $type @linklist\n"; # print " attr: $attr value:$value\n"; if ( $value =~ /ftp|http|https?/ ) { if ( head( $value ) ) { print "$value is OK\n"; } else { print "$value is BAD\n"; } } } Biol 59500-033 - Practical Biocomputing # $linkref is a reference to a list 54