The Comprehensive Engineering Guide to WebQL
Transcription
The Comprehensive Engineering Guide to WebQL
CONFIDENTIAL 1 CONFIDENTIAL Background PURPOSE: (1) To provide engineers with the knowledge base necessary to succeed as WebQL application developers (2) To provide more extensive examples than what the documentation provides (3) To illustrate how WebQL goes beyond what Oracle, SQL, and other database programming languages can do PREREQUISITE: (1) Familiarity with basic HTML tags like <table>, <tr>, <td>, <form>, etc. (2) Familiarity with Regular Expressions like [^0-9]{2,} and [A-Z0-9]+? (3) Familiarity with javascript—no need to be an expert at any of (1) through (3) (4) An understanding of calling functions to perform computations on strings and numbers PREFACE: This book will get software developers excited about web programming. Loading any ODBC driven database as well as any internet page, WebQL has no limit for applications ranging from competitive pricing to business intelligence. This guide is organized into 4 Chapters with 5 Appendices and a Foreword; Chapter 4 will not be available for public 2 CONFIDENTIAL consumption. The appendices do not use WebQL and should be read by engineers inexperienced in HTML, javascript, and/or Regular Expressions. Chapter 1 is a short introduction on what the WebQL product is and how the Interactive Developer Environment for Windows looks. Chapter 2 goes through coding basics, and Chapter 3 shows us how to crawl around the web and grab data out of every nook and cranny. Chapter 4 shows how to build industry-strength WebQL applications for a hosted web-mining service. As a teacher, I chose an active first-person voice for writing the guide, which some technical writing experts would debate. The American Institute of Physics' Style Manual, p. 14, third edition, 1978 is a science/engineering authority on technical writing that encourages the use of active first-person. Second-person alienates the reader and third-person is not as good at keeping the reader’s attention. 3 CONFIDENTIAL CHAPTER 1: Introductions 1.0 What is WebQL? WebQL is a table-select programming language modeled after Microsoft Structured Query Language (SQL). A table select programming language is a programming language that joins tables together in different ways and then filters the results as desired and writes the results to a database. WebQL does everything Microsoft SQL does and more. WebQL provides the most answers to the limitless web programming questions of today. 1.1 What can WebQL do? On schedule, WebQL can target internet HTML source files, crawl them in real-time, use Regular Expressions to extract data out of them, and then write the data to a formatted database. WebQL interacts with any database that is ODBC driven, and also reads information from both local and non-local HTML, XML, CSV, txt, pdf, and other file formats. A local file refers to a file stored on our machine or intranet whereas a non-local file requires a fetch over the internet to view it, which can be done by WebQL as well as a web browser like Microsoft Internet Explorer. A WebQL program is something like artificial intelligence for a browser. Using table-select programming statements, WebQL surfs through websites and submits HTML forms with whatever values we want. WebQL can read in any database or databases, manipulate the data and make it look any way we want, and then write the data to any database or create any XML file with XSLT transforms or HTML file with CSS stylesheets. This guide demonstrates how to create custom DHTML environments from WebQL data extracted form the internet. WebQL utilizes the internet as another database ready to be tapped making a new horizon for realtime business applications. WebQL’s ability to mechanically crawl a website and submit 4 CONFIDENTIAL forms allows for thousands of hours of manual clicking and writing to be done by a computer program in hundreds of minutes. 1.2 What does WebQL look like? WebQL Studio is an Interactive Developer’s Environment (IDE) for Microsoft Windows. This ideal IDE allows us to write code, run code, illustrate data flow, debug all network transmissions, sort data, view data browser vs. text, etc. Before we write any code, let’s get a feel for the environment in which we will be writing code. 5 CONFIDENTIAL Picture 1.2.0: A screenshot of the WebQL IDE As code warriors, we want to write code and then push the play button. When we click the play icon (or press F5), the active WebQL code window compiles and an Execution Window is launched. If the play button isn’t clickable, then our active WebQL window isn’t a Code Window (plain text)—it’s probably an Execution Window. Picture 1.2.0 has one text window and two execution windows cascaded behind. Typically, a text window corresponds to a single specific Execution Window. 6 CONFIDENTIAL The Execution Window consists of a default viewer and 7 tabs. Remember that a table is merely rows and columns of data that can be displayed in the default viewer. A table consists of rows (or “records”) and columns (or “fields”). A query is text file of code that, when run, has a corresponding Execution Window with a default viewer. The following diagram is a quick recap of what we know so far about WebQL: Picture 1.2.0.5: Quick sketch of WebQL Studio Below is what an Execution Window looks like inside the WebQL Studio IDE: 7 CONFIDENTIAL Picture 1.2.1: A screenshot of the Execution Window in the IDE The output table in the default viewer is typically used by the programmer for debugging purposes. Output data sets are usually written to CSV files or ODBC driven databases on delivery to customers. A stereotypical customer of a WebQL hosted application is a pricing analyst wanting to mine a large volume of prices from a competitor’s website. Picture 1.2.1 happens to be bank rates instead of prices although the concept is the same. If we need to see WebQL’s reading and writing abilities immediately, Chapter 2.6 discusses the reading and writing abilities of WebQL. The 7 tabs in the lower half of the Execution Window are outlined as follows: (1) Messages. These are great in pinpointing mistakes in Regular Expressions. Regular Expressions in WebQL are a tool used to match patterns to the HTML source of a webpage to extract targeted information with extraction buffers denoted by parenthesis. Messages will itemize every page request, every page load, most Regular Expression patterns, and other information. Message verbosity is controlled through ViewÆOptionsÆEnvironment and is illustrated in Picture 1.2.1.5. The most severe level (Diagnostics) is recommended. 8 CONFIDENTIAL Picture 1.2.1.5: Setting Message verbosity for debugging (Diagnostics) Picture 1.2.1 gives an example of the Messages tab with “Diagnostics” set for message verbosity. We might need to lower message verbosity if our WebQL program makes tens of thousands of requests in order to lower virtual memory (VM) consumption. Once a certain threshold for VM is exceeded, the WebQL run time environment crashes. From Picture 1.2.1, the first message states which WebQL code window is being run. The second message is the queuing of the first page request; the third message is the first page request being made; the fourth message is the completed page load. The fifth message is a Regular Expression pattern with one extraction buffer that extracts a redirection link (in a browser this is done automatically because a browser is javascript enabled). The sixth message is the delaying of the second page request that is a result of the extraction of the redirection link. Eventually, the second page request will be queued, then requested, then loaded just like the first page request. 9 CONFIDENTIAL (2) Statistics. Statistics are for monitoring how much network traffic is being generated. These stats enable statements like, “My 1 page of code generated 300,000 clicks that ripped 10 billion bytes of data.” These data can also be used to infer other stats like successful page loads per minute. Picture 1.2.2: The Statistics tab of the Execution Window (3) Data Flow. This tool is very useful in pinpointing a problem in data filtering. Suppose our code loads 10 data tables from various databases, then joins and filters them throughout 3 pages of code, and then writes the data to a CSV output file. We push the play button, the code runs, and there is no output generated. Code that runs and produces no output is most likely filtering records when we didn’t mean to filter them all. If we look at the Data Flow, we should be able to figure out where the data got cut-off. We can even click in the Data Flow where we think the problem is and the corresponding point in the code will pop-up. 10 CONFIDENTIAL Picture 1.2.3: The Data Flow tab of the Execution Window In Picture 1.2.3, the flow of records appears just fine because the box titled ASSEMBLE has the same number of records derived from the parent boxes RATES1 and RATES2. RATES1 and RATES2 are most likely data tables being assembled side-by-side like in Example 3.5.4. (4 and 5) Text vs. Browser. If our code selects fields into the default viewer of the Execution Window, then we can view those fields as plain text (4) or as a Microsoft Internet Explorer browser views them (5). Image files are not saved locally for the sake of browser viewing, so some images do not appear in the browser view like Picture 1.2.5. 11 CONFIDENTIAL Picture 1.2.4: The Text View tab of a SOURCE_CONTENT field Picture 1.2.5: The Browser View tab of that same field 12 CONFIDENTIAL (6) Output Files. This tab lists independent of directory any and all of the output files produced during query execution. We can view an output file as it is being created by the WebQL query. If the code is going to run for hours, we should check the output files early to make sure they look the way we want. Picture 1.2.6: The Output Files tab The output files for harvesting the bank rates, branches, and ATM locations are in CSV format while the news and stocks information are in both custom HTML output and plain text CSV format. (7) Network. The Network tab is one of the newer enhancements to the IDE. Every single outgoing request with headers is itemized along with each individual response with complete HTML source. Every single character of incoming and outgoing network transmission is revealed here. Picture 1.2.7: The Network tab 13 CONFIDENTIAL If our crawler makes thousands and thousands of page requests, we are better to turn off the Network tab feature because storing every outgoing and incoming byte could cause WebQL’s virtual memory to overflow, which is at 2GB. We can turn on and off several features in the same location in the WebQL IDE. Picture 1.2.7.5: Toggle-able features in the WebQL IDE From left to right, the features are Messages tab auto-scrolling, default viewer autoscrolling, Data Flow tab on/off, Network tab on/off. The Network tab is also referred to as a proxy. A proxy is best understood by a diagram: Picture 1.2.8: Seeing the Network tab as a proxy A proxy monitors every piece of outgoing communication and the corresponding response to the PC. The outgoing and incoming arrows in Picture 1.2.8 are analogous to those in the left half of Picture 1.2.7. 14 CONFIDENTIAL 1.3 How do websites and the internet work? The internet can be thought of as a large number of computers connected together both with and without wires. A website is a collection of files on a computer (or “server”) connected to the internet that creates communication sequences with another computer operated by a (usually) human user. In addition to a server that houses a collection of files, a website could have a large database of information attached to it. A website is symbolic of artificial intelligence because a human user is interacting with computers to achieve some sort of automated goal. That goal could be buying an airline ticket, downloading photographs, or learning, among other automated goals. The internet is used by a human typing and clicking on his machine thereby making it communicate with another machine. As internet users, we need to know how to work our machine but in the course of achieving our automated goal, we really don’t care about how our machine talks to a website nor how the software (web browser) on our machine works. Picture 1.3.1: Giving meaning to user interactions and server interactions Sending information to a website from our computer allows a website to take that information and create any file in return. How the file returned is conjured up is hidden from us. The hidden behind-the-scenes tasks can only happen when there are server 15 CONFIDENTIAL interactions. A quick example of “hidden behind-the-scenes tasks” would be a welcoming message for a website. We click a link to a website (user interaction), which causes a request for a page to be sent from our computer browser to a website (server interaction). There is code written on the website that checks to see if the current day is a holiday. If the current day is a holiday, then the page returned to the browser on our computer has a welcome message, “Have a nice holiday” otherwise the page returned to the browser on our computer has a welcome message, “Have a nice day.” No matter how many user interactions we make and even if we view the source code for the browser page, we cannot figure out the exact code that creates the welcome message—we only know what the welcome message is. Again, this type of “hidden behind-the-scenes tasks” can only happen when there are server interactions. Some user interactions cause server interactions while others don’t. Suppose we download a Boggle game that does cool things like highlights the letters as we click them and draws lines with arrows on them. We can try a weboggle game at http://weboggle.shackworks.com/. The letter highlighting and arrow drawing do not cause server interactions. When the game is downloaded through the browser, the browser page source code contains everything it needs to play the highlighting and arrow-drawing games. What doesn’t get downloaded is the gigantic dictionary of words that is needed to verify any guess. Verifying a guess causes a server interaction. The database of words in the dictionary is a part of a website and a part of the “hidden behindthe-scenes tasks” that makes the internet more effective in achieving our automated goal, which is entertainment in the case of internet Boggle. The process of programming subtle server interactions to do things like verify the spelling of an English word without loading an entirely new page from the server is an example of AJAX and is a popular style of writing web code in 2006. Teaching the details of AJAX is not a goal of this guide. The collection of files on a server that constitute a website contains files typically written in 3 or 4 different programming languages. The source code of files that makes it to our computer typically contains 2 different programming languages. 16 CONFIDENTIAL Picture 1.3.2: Typical programming languages in the internet PHP is server scripting. Javascript is client scripting. Why, then, do server pages contain javascript? Javascript executes in a javascript-enabled browser. The collection of files on a server has javascript in them to be sent to the user to interact with in a browser. PHP can write (or “echo”) javascript or HTML. Suppose that on a holiday, a javascript holiday game gets written by PHP on a server and passes it to the browser for a user to play with. To better understand how we use the internet, we must walk through the 5-step process of what happens when we command our browser to load files from the internet: 1) We click a link or else type a Universal Resource Locator (URL) into the browser. 2) The server replies with the appropriate file. Before the appropriate file is sent back our browser, the file that we want on the server contains HTML, PHP, mySQL, and javascript. The PHP code embedded throughout the HTML and javascript gets executed and the results are sent back to the computer browser. The source code of the file in the browser is the result after the PHP code has been executed. In addition to being embedded throughout an HTML/javascript source file on the server, the PHP code can actually write javascript functions or more HTML for us to interact with in the browser. The PHP code can also use mySQL to access a large database of photographs and/or product listings to be displayed for us in the browser. 17 CONFIDENTIAL 3) After getting a reply in the browser, we can interact with the HTML and javascript to set up an airline ticket purchase or, more generally, to enable the process of achieving our automated goal. Sometimes javascript games only cause user interactions where as buying an airline ticket causes javascript to submit forms that cause server interactions. Because javascript runs on our computer (the “client”) and causes both user and server interactions, javascript is considered client-sided scripting. 4) Eventually from a form submission or another click, our user interactions are again going to cause server interactions. Let’s say we send over dates and times for airplane tickets to the server. The information that we send fetches us another page (or “file”) from the collection of files comprising the website. Before we see the results page that we want, the PHP and mySQL are executed. The PHP runs and uses mySQL to grab the appropriate airline tickets out of a database and the results are sent back to the browser on our machine. Because we have no idea on how the PHP code runs on their database and server, PHP and mySQL are considered server-sided scripting. 5) Sometimes our automated goal requires several steps to go through and thus causes more information to be sent to the server. Go back to step 3). Picture 1.3.3: Where client-sided and server-sided scripting are executed Clicking around the internet and achieving our automated goals are the harmonious interaction of at least 3 and probably 4 or more programming languages. More and more web programming languages and concepts like Flash are being developed all of the time, so there isn’t really a limit to the intensity and fun of scripting on either the client-side or 18 CONFIDENTIAL the server-side. Intense scripting is not a goal of this engineering guide; however, having a feel for how internet pages talk to a PC and back will help make us better WebQL programmers. WebQL programming works off of what a browser sees in terms of source code, so we typically only see 2 programming languages when we write WebQL spider code (javascript and HTML). 19 CONFIDENTIAL CHAPTER 2: Coding Basics 2.0 The select statement The WebQL programming language is a table-select programming language. Our code will look like nothing more than standardized statements that select tables and join them together. All sample code outside of IDE snapshots will use Verdana font to differentiate it from prose. An entire select statement is called a node. select 'Hello World!' --Example 2.0.0 Above is our first WebQL query. Notice how the statement selects a table that has one row and one column containing a text string. Because no output destination is specified, the assumed destination is the default viewer in the Execution Window. All text on a line appearing after -- is a comment and does not influence the code. Another way to comment large segments of text is to begin the comment with /* and end it with */. We can name (or “alias”) the table (or “database”) and fields (or “columns”) anyway we want, and we can add as many fields as we want. Consider: 20 CONFIDENTIAL The default names for nodes are SELECTn where n - 1 is the integer number of select statements that precede the node in the file. The default names for fields are EXPRn where n is the corresponding column in the table for the expression. Using specific aliases that we choose like in Example 2.0.1 makes the code easy to read. Now that we have gotten a feel for how to run something and get output, let’s play with a few functions so we feel at home as software developers. 21 CONFIDENTIAL A comprehensive index of functions is available through the IDE help index. Go to HelpÆContentsÆIndex. In addition to functions, the help index lists all keywords in the WebQL programming language. The help index is illustrated in Picture 2.0.0. 22 CONFIDENTIAL Picture 2.0.0: Hunting through the help index for functions and keywords In this guide, Appendix V is a WebQL Giant Function Table that is a quick-reference for WebQL’s most popular functions. From Example 2.0.2, the syntax for if conditions is clear (use <> for unequal); html_to_text is a function that removes all HTML tags; replace is a function that is a bit more complicated. Replace has arguments that alternate string, Regular Expression, string, Regular Expression, string, … . The first Regular Expression is matched to the first string and then is replaced by the second string. The second expression is matched to the first string and then is replaced by the third string… The Regular Expression S{10} means match the letter S exactly 10 times. NE1 is also a Regular Expression but does not use any special Regular Expression characters. Replace is a powerful function like extract_pattern, which will be seen later on. 23 CONFIDENTIAL Instead of generating strings explicitly in the WebQL code like Examples 2.0.1 and 2.0.2, more commonly we will need to read input from a CSV file. CSV stands for “Comma Separated Values” that are organized into rows (or “records”). A CSV file is a table, which is a database. Our next example introduces the concepts of both datasources and Data Translators. In this case, the datasource is the CSV file and the Data Translator is CSV—the mechanism that will allow us to select columns out of the datasource. Suppose we want to read as input a 2-column file containing a Yes/No column for alcoholic beverage availability for some particular markets of a given airline, with the corresponding airport codes for the flight. Commercial airport codes are always 3 characters in length, so an origin/destination combo should be of length 6. 24 CONFIDENTIAL The proper way to read this code is, “Apply the CSV Data Translator to the input.csv file. Next, make a table that performs checks and adjustments to column one (C1) and column two (C2) as written.” For an unknown reason, row 2 column 2 of the input file did not contain 6 characters after trimming outside whitespace, so our own error was triggered. The CSV Data Translator is one of the most basic of all Data Translators in WebQL. C1, C2, …, Cn are Pseudocolumns of the CSV Data Translator. The next section is titled Data Translators and Pseudocolumns. The CSV Data Translator has only one other selectable Pseudocolumn. ROW_ID is the other Pseudocolumn of the CSV Data Translator that represents the sequential row of input from the CSV file. In this case, the ROW_ID is naturally displayed along the left side of the Execution Window. If we want to manipulate ROW_ID later on, then we will have to select the ROW_ID Pseudocolumn explicitly as a field and alias it if we want to. We will see that in Example 2.2.1. There exist several parts to a select statement. So far, we’ve seen select, from, within. into, where, group by, sort by, and having are the other major components. Remember that as WebQL application developers, our code will always be lots of select statements (also known as “tables,” “databases,” or “nodes”) that end up getting joined together and filtered in creative ways to make a final node or set of final nodes that write to local or non-local destinations. The general format of a select statement is heuristically the following: select <these fields> into <this destination> from <this Data Translator applied to…> within <…this file/db/source> where <this and that are true or false or matching or equal, etc.> 25 CONFIDENTIAL Our code will always look like this at every node and can potentially be missing any and all parts except for the selected fields. 2.1 Data Translators and Pseudocolumns This section on Data Translators and Pseudocolumns is intended to introduce various techniques for extracting data out of a datasource. WebQL is effective when using HTML pages as datasources. Later on, we will learn how to crawl these sources and grab any information that we want along the way. The first Pseudocolumns that we will discuss are independent of any Data Translator. Basically, in any select statement at any point in the code, these fields are fair game to be selected. The truth is that there are exceptions, one being that RECORD_ID is sometimes not a selectable field in an outer node (see Chapter 2.4 for inner-outer node relationships). Some of these Pseudocolumns only make sense when we specify a datasource. Some of these Pseudocolumns are particularly useful when specifying an HTTP URL as a datasource. A datasource follows the word within of the select statement, or it follows the word from when no Data Translator is specified. Table 2.1.0: Pseudocolumns always available to select ----------Pseudocolumn---------- ---------------Description--------------- CHUNK_ID The sequential ID of the current chunk of data. An HTML page can be sliced into segments (“chunks”) if the CONVERT USING Document Retrieval Option is used. The value is NULL if we are not using a chunking converter. CONTAINER_URL The URL of the document from which WebQL extracted the current record. The value is NULL if the source document does not have a parent (not a WebQL parent, but an HTML parent). CRAWL_DEPTH The depth of the current source document within the current crawl. The value is NULL if the current document was not produced by a crawl. 26 CONFIDENTIAL ERROR The error of the current fetch. ERROR is NULL if the fetch succeeds. Use this field with the “with errors” Document Retrieval Option. FETCH_ID The sequential ID of the current fetch. This ID is global across a query and is always increasing. REQUEST_URL The URL initially visited to load the document, prior to any redirection. REQUEST_POST_DATA The POST data initially submitted to load the current document, prior to any redirection. PARENT_URL The URL of the document from which the link to the source document was extracted within the current crawl. The value is NULL if the current document was not produced by a crawl. RECORD_ID The sequential ID of the current record within the current node. SOURCE_CONTENT The content of the document from which the current record was extracted. SOURCE_POST_DATA The POST data submitted to request the current document. SOURCE_RECORD_ID The sequential ID of the current record within the current document. SOURCE_TITLE The title of the document from which the current record was extracted. SOURCE_TYPE The MIME type of the document from which the current record was extracted. SOURCE_URL The URL of the document from which the current record was extracted. A similar table to Table 2.1.0 is available through the IDE help index (look for “Pseudocolumns”). A lot of these do not make sense yet because we haven’t been selecting anything from the internet. Let’s select all of these fields on a crawl of a webpage to a certain depth and see how they vary. There are simple Document Retrieval Options that we use to crawl a URL. The details of major Document Retrieval Options are in Chapter 2.5. When we say that we are going to “crawl” a URL to depth k, then we are going to load the URL and all links on the page (to depth 2), and all links on the pages that get loaded (to depth 3), etc. Example 2.1.1 crawls to http://www.yahoo.com to 27 CONFIDENTIAL depth 2 and views all Pseudocolumns from Table 2.1.0. All null output columns are smashed to fit the image. 28 CONFIDENTIAL 29 CONFIDENTIAL In addition to all of these Pseudocolumns, we also have the option of selecting the Pseudocolumns of a Data Translator if we choose to use one. A Data Translator is a tool that creates Pseudocolumns out of a datasource. A webpage is an example of a datasource. The best way to conceptualize a Data Translator is a lens that converts a datasource into rows and columns. Picture 2.1.0: Data Translators as lenses The Pseudocolumns that we want to select will determine which Data Translator (or “lens”) we want to use. Picture 2.1.0 shows any page from the internet as the targeted datasource. We can think of Data Translators as a set of differing lenses through which to see a datasource. Using multiple Data Translators in a single node can cause unpredictable behaviors that suggest separating the Data Translators across multiple nodes is a safer and cleaner way to code—we will see that in the next section on joining nodes. 30 CONFIDENTIAL Let’s review each Data Translator and the associated Pseudocolumns. This will take a while. We’ll begin by listing the Data Translators with their associated Pseudocolumns in a table. Table 2.1.1: Pseudocolumns of various Data Translators DATA TRANSLATOR images links table rows table columns table cells pattern upattern empty CSV TSV forms table values tables sentences pages paths full links lines mail headers RSS snapshot values text_table_rows PSEUDOCOLUMNS URL, CAPTION URL, CONTENT, OUTER_CONTENT Cx, TABLE_PATH, TABLE_ID, TABLE_CONTENT, ROW_ID ROW_CONTENT, COLUMN_COUNT Rx, TABLE_PATH, TABLE_ID, TABLE_CONTENT, COLUMN_ID CELL, TABLE_PATH, TABLE_ID, TABLE_CONTENT, ROW_ID, ROW_CONTENT, COLUMN_ID ITEMx, COLUMN_COUNT ITEMx, COLUMN_COUNT (none) Cx, ROW_ID Cx, ROW_ID FORM_ID, FORM_URL, FORM_METHOD, FORM_CONTENT, CONTROL_NAME, CONTROL_TYPE, CONTROL_DEFAULT, CONTROL_VALUES <Pseudocolumns vary by page>, Cx TABLE_PATH, TABLE_ID, TABLE_CONTENT, RxCk SENTENCE PAGE, CONTENT ITEMx URL, TYPE, AS_APPEARS LINE, CONTENT MESSAGE_ID, FOLDER_PATH, SENDER_NAME, SENDER_ADDR, SUBJECT, SEND_TIME, RECIPIENT_NAME, RECIPIENT_ADDR, BODY, HTML_BODY, HEADERS, TO, CC, BCC, ATTACHMENTS NAME, VALUE TITLE, LINK, DESCRIPTION IMAGE <Pseudocolumns vary by page> TABLE_ID, TYPE, ROW_ID, COLUMN_COUNT, DELIMITER, Cx Combining powerful functions, creative filtering tricks, and Data Translators, small amounts of code can perform surprising tasks that we would never even think of unless somebody told us. For example, if we want every image off of yahoo’s homepage that doesn’t come from the yahoo.com server, it’s just a few characters: 31 CONFIDENTIAL If we want every non-yahoo.com domain image off of yahoo’s homepage plus every nonyahoo.com domain image off of every page that we can link to off of yahoo’s homepage, then it’s only a few more characters to type. In this case, there are thousands of such images, but less than a thousand of them are unique. I will use the unique feature to eliminate repeated records. The records consist of 2 fields: URL and CAPTION. Can repeated images still exist even when unique is used? If the same image is coming from 2 different servers or has 2 different file names, the URL field will not match and thus unique won’t solve the duplicate. 32 CONFIDENTIAL If we only want a count of these images, or a count of how many times an image is repeated by URL, we can use the count Aggregate Function. Aggregate Functions are explained in Chapter 2.3, and Examples 2.3.1-2 follow the previous 2 examples. The links Data Translator is very similar to the images Data Translator. Instead of a URL and a CAPTION (the CAPTION is the alt text of the <IMG> HTML tag), the links Data Translator has URL, CONTENT, and OUTER_CONTENT as Pseudocolumns. The links Data Translator itemizes every HTML anchor on the page. Some HTML anchors call javascript functions instead of providing URLs, so the URL is a javascript function call that WebQL does not perform. In such a case, we must find the 33 CONFIDENTIAL definition of the javascript function being called and do in WebQL what the javascript is doing. Suppose an anchor tag calls a javascript function with one argument: a directory/file location string that must be form-submitted to trigger the page/file download we want. Here is an example of one such anchor tag: <a href=”javascript:GetMyBaseURL2(‘path1’)”><font face=”times” size =-2>click here for file</font></a> Here the value of the URL Pseudocolumn is: javascript:GetMyBaseURL2(‘path1’) The value of CONTENT is: <font face=”times” size =-2>click here for a file</font> and the value of OUTER_CONTENT is the entire anchor tag binding including the anchor tag itself. In QL2 in Regular Expression language, OUTER_CONTENT is represented by the extraction buffer (<a\s*[^>]*>.*?</a>) Hopefully Regular Expressions are familiar. If not, we will learn more about them later in Table 2.1.2. In addition to Table 2.1.2, Appendix III is a quick-reference for building Regular Expressions. Suppose that preceding this anchor tag in the HTML code is the javascript function GetMyBaseURL2. Function GetMyBaseURL2(myStr) { f.myDirection.value = myStr; f.submit() } 34 CONFIDENTIAL The first thing we must do is extract the “path1” file/path name by using a Regular Expression. We then submit the form named f with the value of the myDirection field set at the name of the file/path that got extracted. Here, we are trying to conceptualize how sometimes we can get lucky with the links Data Translator and immediately find a page URL that we want, and the rest of the time we have to jump through the javascript hoops just to fetch a page. Here’s an example of an HTML page where we can’t use the links Data Translator to crawl a link to follow because of a javascript form submission: Picture 2.1.0.4: Clicking links that perform javascript The code to crawl one of these links uses a pattern Data Translator to extract the desired path/direction in the HTML page source in Picture 2.1.0.5 and then submits the form named f. We need to save the file in Picture 2.1.0.4 locally to ‘c:\Ex2.1.3.5.html’ before running the code in Example 2.1.3.5. 35 CONFIDENTIAL This example shows how to submit a form when the form submission is the result of a javascript function call. In this example (Example 2.1.3.5), the submitting Document Retrieval Option uses a robust Regular Expression to match the HTML form named f. The concepts of joining nodes together and form submission are covered in more detail later in Chapter 2. This code is provided now to complete the example. The Document Retrieval Option cache forms is used so that the forms in Ex2.1.3.5.html are loaded once in the node getdirection and zero times in the node submitform. Here is the 36 CONFIDENTIAL source code of the Ex2.1.3.5.html file available for download at http://www.geocities.com/ql2software/Ex2.1.3.5.html. Picture 2.1.0.5: HTML source code of Ex2.1.3.5.html Moving on to a more basic example using the links Data Translator, suppose that we want every link off of a page, but we only want to see the non-javascript links. Expedia is a website that is very javascript-intense, perhaps unnecessarily so. 37 CONFIDENTIAL The only two fields that we select are URL and OUTER_CONTENT. There is no need to select the CONTENT Pseudocolumn because we have the CONTENT if we have the OUTER_CONTENT. Sometimes loading Expedia’s homepage requires us to add a cookie and sometimes it doesn’t. If we are ever having trouble getting the right page out of Expedia, add “adding cookie 'jscript=1;'” after the URL. Sometimes this is needed. This feature is one of many powerful Document Retrieval Options which are covered in Chapter 2.5. Moving on to CSV and TSV, these Data Translators allow us to select the columns out of CSV and TSV files as if the files are data tables (they are). Again, CSV stands for 38 CONFIDENTIAL “Comma Separated Values” and TSV stands for “Tab Separated Values”. Besides selecting out column n with Cn, the only other Pseudocolumn is ROW_ID, which is the row of a record from the CSV/TSV file. We will see lots of CSV file interaction in this guide; we could have just as easily used TSV instead. The next Data Translators to be discussed are pattern and upattern. upattern is optimized for handling double byte characters on international sites. On an American website that uses English, pattern should suffice. The concept of a pattern Data Translator is to craft extraction buffers (symbolized by parenthesis) within Regular Expressions to extract wild card text strings, often prices of products or services that change daily or hourly. The first wild card that we’ll learn is <html>(.+?)</html> which means extract any character 1 or more times between the HTML tags. Effectively, the extraction buffer strips the begin and end HTML tags off of an HTML file. Below is a table of Regular Expression wild cards and what they mean. Table 2.1.2: Commonly Used RegEx in WebQL RegEx EXPLANATION . .* .+ .*? .+? [^>]* [^<]* (?:A|B|C|D|F) (?:A|B|C|D|F)? <tr[^>]*>.*?</tr> colou?r 20[0-9]{2} Reali(?:s|z)e match any 1 character match as many characters as possible 0 or more times match as many characters as possible, but at least 1 or more match as few characters as possible 0 or more times match as few characters as possible, but at least 1 or more match as many characters as possible up until the end of an HTML tag match as many characters as possible up until the beginning of an HTML tag match one character A, B, C, D, or F optionally match one character A, B, C, D, or F match any HTML table row match the word color in American or British English match any year 2000-2099 match the word realize in American or British English The parenthesis in the grouping expression (?:) do not represent an extraction buffer— they represent a group. The vertical bars symbolize disjunction. For example (?:bear|cub) matches the word bear or cub. A group can end with a ? just like a .* can. There is more information on Regular Expressions in Appendix III. 39 CONFIDENTIAL Let’s create an example. We’ll implement the SOURCE_TITLE Pseudocolumn using the pattern Data Translator. The first and only extraction buffer in the Regular Expression corresponds to item1 in the data selection. If a second extraction buffer existed in the Regular Expression then it would correspond to item2 in the data selection. The pattern implemented says, “Give me all characters between the HTML <title> tags and then immediately match the remainder of the page.” The pattern Data Translator is faster with the .* at the end because the <title> HTML tag is not looked for besides the first occurrence. 40 CONFIDENTIAL Regular Expressions of the pattern Data Translator in WebQL can have as many extraction buffers as we want and can become as complicated as we want. Try not to write anything that is too big of a headache to look at. Try to write expressions in small pieces then piece them together into something bigger for the sake of debugging. The next example is at the upper bound of what is acceptable for Regular Expression cluddyness. The first thing to notice is where the extraction buffers are and how many there are. The buffer that starts first is item1, the buffer that starts second is item2, etc. Why did this query produce only one output record into the default viewer considering that Orbitz 41 CONFIDENTIAL probably has dozens of tables on every page? We ended the pattern with .* which we know means match as many characters as possible 0 or more times, thus that .* matches the vast majority of the page, so we actually end up getting the first table, the first row in the first table, and the first cell in the first row of the first table. It’s not surprising that the first cell is some text and links welcoming us to sign in or register. If we take away the .* at the end, then what are the extraction buffers extracting? It looks like we are getting more tables than before. The truth is that we are getting the content of every table on the top layer that has at least one row containing at least one 42 CONFIDENTIAL column, and we are actually getting the content of that row and that column associated with every such table. There happen to be 5 in this case. We also used the ignore children Document Retrieval Option which cuts off any of Orbitz’s child source pages. Major travel sales corporations load advertisements and other information not of interest through child pages; we chose to ignore them. Another technique using pattern is to use lots of pattern Data Translators all ending with the .* trick described in Example 2.1.5. This way, every translator generates only 1 record, so there is no confusion over inconsistent behavior in record spreading. 43 CONFIDENTIAL We notice that the first price is $0.00, which is the amount in our shopping cart. pattern2 tells us if the phrase “Office Supplies” appears anywhere case-insensitive on the page. The FirstAncor (with anchor spelled wrong) is the inside of the first <a …> </a> HTML tag, which is an image of some sort in this case. Consider using two pattern Data Translators where one of them extracts 3 records and the other extracts 2 records. What should happen? We are going to have holes in our data table in the default viewer and wherever else we are writing them. This also ties into 44 CONFIDENTIAL to a major theme for this guide about how using multiple Data Translators in the same node is almost never a good idea. The only general case exception to that rule is using pattern and empty together. Suppose a website lists sale prices a certain way on their homepage. If nothing on the homepage is on sale, then the pattern doesn’t match. If the pattern Data Translator doesn’t match anything on the page, then a null table (0 records) is produced. If we want to trigger an error message instead of a null table then we use the empty Data Translator with the pattern Data Translator like this: Even though the pattern does not match the page anywhere, the empty Data Translator forces a single record to appear and nvl converts the NULL item1 into the string ‘No 45 CONFIDENTIAL Sale Price’. nvl is a function that returns its first argument unless its first argument is null, in which case nvl returns its second argument. Moving on to the forms Data Translator, forms is great at helping us program with the submitting Document Retrieval Option. To use submitting, we want to find out what values we can submit without hunting and pecking through an HTML source. The forms Data Translator is perfect for that task. This very short segment of code tells us everything we are going to need about submitting one of the 2 forms on www.hotels.com. CONTROL_NAME represents the variables on the form that we may submit. Nodes that use the submitting Document Retrieval Option are reviewed extensively in Chapter 2.5. For now, all we need to know 46 CONFIDENTIAL is that this program that uses the forms Data Translator gives us the form break-down that we need to successfully submit forms in Chapter 2.5. We are moving on to some of the table Data Translators, namely table rows, table columns, and table cells. To get a feel for what the Pseudocolumns do for us, we can write 3 queries similar to Example 2.1.9. Using the select * trick will trigger every Pseudocolumn out of the specified Data Translator that is not italicized in Table 2.1.1. We use select * only in this case or in the case when we don’t care what we’re selecting (believe it or not, that does happen— see Picture 4.1.2). 47 CONFIDENTIAL Below are the Execution Windows for these 3 examples. Picture 2.1.1: Default viewer pics of Examples 2.1.10-12 We notice that the TABLE_PATH, TABLE_ID, ROW_ID, and COLUMN_ID cells all overlap each other. We should be getting a feel for how we could use any one of the 3 Data Translators to perform the same data extraction task. 48 This row/column/table CONFIDENTIAL information is useful for filtering, but how are these Data Translators applied most effectively for data extraction? Because numeric pricing and interest rate applications are hot right now in the post-2000 internet era, we’ll pull every data cell out of an HMTL table using table cells to exemplify the power of the table cells Data Translator. To keep this guide complete without the need of a computer, here is a browser screen image of the table located at http://www.geocities.com/ql2software/tableFun.html. 49 CONFIDENTIAL Picture 2.1.1.5: tableFun.html Several examples throughout this guide will refer to this table. Taking the example one step further, suppose we had a page where we wanted prices out of all 275 price cells, then ‘\$’ on the filter is a good idea because ‘\d+’ will match too many cells. 50 CONFIDENTIAL table rows and table columns are slightly different. Directly referencing columns out of the table rows Data Translator and directly referencing rows out of the table columns Data Translator are where these two Data Translators are most effective. The two-column stock table at http://www.geocities.com/ql2software/tableFun.html is quickly formatted into selectable columns. Huge data tables with 10 or more columns embedded in HTML sources that are 3 or more pages long are equally easily parsed out by the Cx fields of the table rows Data Translator. 51 CONFIDENTIAL Similarly, we can target different information by selecting rows out of the table columns Data Translator. Here, we don’t bother to alias the fields for writing column headings—we will let the rows transpose themselves into columns. We use the clean function to remove HTML and outlying whitespace. Whether using Cx column selectors with table rows or Rx row selectors with table columns, a great way to debug the translator completely is to look at both the italicized Pseudocolumns and non-italicized Pseudocolumns of Table 2.1.1 simultaneously. 52 CONFIDENTIAL For a given ROW_CONTENT, to what do the Cx column selectors correspond? What is the TABLE_CONTENT for a given ROW_CONTENT? These types of questions can be answered by using this idea of selecting both italicized and non-italicized Pseudocolumns out of a Data Translator. The notion of selecting both italicized and non-italicized Pseudocolumns can be applied to any Data Translator. We’ve learned a lot about Data Translators in WebQL so far, but the truth is that we’ve not yet completed discussing half of them! We can jump to Chapter 2.2 on joining nodes if we feel like learning the rest of the Data Translators later. 53 CONFIDENTIAL Let’s continue by discussing two other table-related Data Translators, text_table_rows and tables. Let’s look at Pseudocolumns of tables. We’ve targeted the tables on Expedia’s homepage that have forms on them, and we are looking at the Browser View tab of TABLE_ID=9 in the Execution Window in Example 2.1.17. Notice how the TABLE_PATH and TABLE_ID correspond to the 3 previously mentioned table-related Data Translators (table rows, table columns, and table cells—see Examples 2.1.10-12). 54 CONFIDENTIAL tables is also an effective Data Translator when we explicitly reference a cell by the RxCk Pseudocolumn. We easily cut a stock index and its tick out of an HTML table in the above example with the tables Data Translator. Using the text_table_rows Data Translator, we get quite a different picture of the Expedia page than any other table-related Data Translator. 55 CONFIDENTIAL The way the data come out through this Data Translator suggest that HTML pages are not where text_table_rows is most effective. Again, we are using the trick of select * plus manually referencing columns simultaneously to fully debug the Data Translator on the page. text_table_rows works best on files that represent data tables that use a | or / or a similar character to delimit the columns in the table. Moving on, the sentences Data Translator has only one Pseudocolumn, SENTENCE. 56 CONFIDENTIAL WebQL is quick and powerful at calculating information about characters per sentence in this example. This particular example uses mathematical Aggregate Functions to column-wise calculate numbers. We’ll learn in detail about Aggregate Functions in Chapter 2.3. Cleaning the SOURCE_CONTENT with the convert using Document Retrieval Option changes the source page into only the words that we see in a browser and also removes any redundant whitespace. In addition to the links Data Translator discussed in Example 2.1.4, there is also a full links Data Translator. To get every link off of a page including flash files and stylesheets, we must use full links that has slightly different Pseudocolumns than the links Data Translator. 57 CONFIDENTIAL Looking at the default viewer in the Execution Window, we notice the difference between URL and AS_APPEARS. Record 110 shows that the way the link appears in the HTML already has the appropriate URL base attached to it automatically in the URL Pseudocolumn. There are a total of 12 different link types including Anchor, Base, Detected, Flash, Form, Frame, Image, Object, RSS, Script, Stylesheet, and Unknown. Next is the snapshot Data Translator. It captures *.bmp browser-based screenshots of whatever datasource we specify. 58 CONFIDENTIAL We now see how easy it is to automatically get a glimpse of our top 10 or even 500 competitors’ websites quickly and easily. The page and lines Data Translators are great for plain text documents and pdf files. Here, we have a pdf file by page of an HKN CS GRE review guide: 59 CONFIDENTIAL We could easily apply a filter for pages matching topics (substrings) and/or Regular Expressions. Applying the lines Data Translator to the same pdf file, we can filter to get the line number of every line on the topic of recursion. 60 CONFIDENTIAL Immediately in our 41 page pdf study guide, we see where the topic of recursion is mentioned. To better understand the paths Data Translator, we can study XPath, which is a querying language for extracting data out of XML. Appendix IV contains a link to information on XML and XPath. This next example implements the links Data Translator using the paths Data Translator. 61 CONFIDENTIAL There is no need to worry about being unfamiliar with XPath because the most popular websites for business applications are done in HTML. RSS is a standard XML format for news circulation; it’s not surprising that there is an RSS Data Translator in WebQL. To grab data like news links straight out of an RSS URL, use the RSS Data Translator. The Pseudocolumns are TITLE, LINK, and DESCRIPTION. 62 CONFIDENTIAL The RSS Data Translator gives us the power to target and accurately filter news links by the hundreds within seconds. The data are directly ripped out of the XML in the Slashdot RSS page source. How many cookies do websites try to tattoo to our browser the instant we hit them? We can uncover this sort of information quickly using the headers Data Translator. Any cookie setting occurs in the header of an internet request. 63 CONFIDENTIAL This particular megastore for piping, fasteners, etc. sets dozens of cookies when we hit their homepage. We could build an application that hits hundreds of popular websites and does cookie analysis. We could look at the number of cookies, the length of the cookies, etc. Moving to the next Data Translator, mail is another file-specific Data Translator such as CSV, TSV, and RSS. It gives quite a variety of Pseudocolumns to select from within pst files. Suppose we had a large collection of emails in pst format, we could easily filter them by subject or by phrases (Regular Expressions) that match the body text of the messages. If the emails we want to look at work on the pst system then the Pseudocolumns are self-explanatory and there is no need for an example. The 15 Pseudocolumns of the mail Data Translator are MESSAGE_ID, FOLDER_PATH, SENDER_NAME, SENDER_ADDR, SUBJECT, SEND_TIME, RECIPIENT_NAME, RECIPIENT_ADDR, BODY, HTML_BODY, HEADERS, TO, CC, BCC, and ATTACHMENTS. 64 CONFIDENTIAL The final two Data Translators are values and table values. The Pseudocolumns for these two Data Translators are similar, but table values also has column referencing ability C1, C2, … . Recall Picture 2.1.1.5 for these two examples. Creatively, this example shows how to cut a subsection of a table without using a where filter. Using the values Data Translator, we can pick out certain stock information based on the row’s name column, which is the first column of the table. 65 CONFIDENTIAL Now we have all stock ticks we asked for aliased as their indexes, which effectively give us structure and control over numbers floating through HTML source pages. Overall, we have seen a wide variety of ways to convert datasources (including webpages) in to rows and columns of data through Data Translators. Some of the Data Translators are customizable (such as pattern and upattern) and the others aren’t (such as sentences and links). Given that we know how to effectively translate page sources from the internet and select Pseudocolumns out of those Data Translators to create tables, we will now learn how to connect tables together in Chapter 2.2 and even create navigational threads in Chapter 3. 2.2 Joining Nodes Let’s expand and modify Example 2.0.3 to have a child node that filters errors. 66 CONFIDENTIAL Now that more than one node appears (remember that a node is a table which is a database), which one gets outputted to the default viewer? To avoid confusion, we specify which one, which is the node InputNoError in this case. The relationship between GetInput and InputNoError is that GetInput is the parent and InputNoError is the child. A parent merely passes on what it selects to its child. In the process of passing the fields of the parent’s table to the child, the records (or “rows”) can be filtered similar to the way that records get filtered in the parent node. When queries get more advanced and involve complicated branching schematics, using where filters both in the node itself and between nodes is extremely convenient. We have now seen both methods of using where filters. As a child, InputNoError can reference any of 67 CONFIDENTIAL the parents’ fields’ aliases in its node. InputNoError also has the ability to re-alias any data it selects—it doesn’t in this example. A child can have any number of parents, and parents can have any number of children. Suppose a child has 2 parents. One parent’s table has 6 records and the other has 5 records. Given that no filtering takes place, how many records will the child have? The answer is the product of all parents, which is 30 in this case. In the example below, we illustrate a cross product (spatial cross product) and use a nifty join trick called spread. So we don’t have to create CSV or other input files, we make arrays and smear them into columns of data using join spread—it is in the IDE help index (HelpÆContentsÆIndex). The example also uses union, which has not been discussed yet. union is used when we want tasks to operate completely independent and ignorant of each other. The process of making an array of 6 elements has nothing to do with the process of making an array of 5 elements, so when we start the process of making the second array, we use union. After smearing the arrays into single columns of data, the columns are joined thus the spatial cross product of the rows is generated in the child node myDemo. If our code writes to the same database in different nodes and we need the writing to occur in a certain order, then wait union can be used instead of union. 68 CONFIDENTIAL 69 CONFIDENTIAL Another method of joining queries is parent wait join. If a child node must follow up a page load made by a parent before the parent processes its next request, then parent wait join can be used; however, inner-outer queries (or “views”) are a better way of depth-first navigation. Some advanced WebQL applications submit forms and go 8 clicks or more deep into a site to retrieve data. Suppose such a task must be performed 10,000 times. If a breadth-first approach to form submission and crawling is applied, then the server-sided session timer could timeout (see Picture 2.4.1). Imagine going 7 clicks deep into a travel site for a fare basis code for a flight, and doing that for 50,000 flights in a row. If we submit the search form 50,000 times, then navigate one more click for the 50,000 pages that come back, we won’t be able to get to any of the data because by the time we advance to the second click on the first search, hours could have passed and the session has already timed out. These issues are discussed in detail in Chapter 3. For a quick-fix that enables depth-first fetching, we use parent wait join, but if we are going to try to parent wait join 4 or more nodes together to trigger a desired navigational effect, then we aren’t going about solving the problem the right way. Navigational threads are discussed in Chapter 3 and see the text below Picture 4.1.1 for the right way. The words parent wait can also be used to precede union join, source join, and output source join, which I will outline next. union join stacks the records of 2 or more nodes of equal column width on top of each other. 70 CONFIDENTIAL table1 has 6 records and table2 has 4 records. union joining the tables gives 10 records. Notice how the referencing works in a union join. The only aliases that matter are those of the parent that is listed first. Node table1’s aliases are all that matter in the union join. source join is straight-forward method of joining nodes. The SOURCE_CONTENT of a child node is the same as the parent node. If we must chunk an HTML page into pieces we might what to do that in a node different than the one we loaded the page in. When we say “chunk an HTML page” we mean creating multiple SOURCE_CONTENT records out of a single SOURCE_CONTENT record (see Example 2.5.2). This is when source join is appropriate. We can achieve the same effect using output source join. output source join allows us to select a column or columns from the parent to 71 CONFIDENTIAL be processed as SOURCE_CONTENT records by the child. If we select the SOURCE_CONTENT of an HTML page as a column and then output source join to that column, we achieve the same effect as source join. We can also implement filters on these join techniques. Example 2.2.4 illustrates two equivalent branching techniques to fetch both the link and image URLs from a single page load of www.cnn.com. Whenever using the output source join or source join technique we must write the from/within source syntax to symbolize that our datasource for the node is the same data source as the parent node. Remember that the word source follows the word from when no Data Translator is specified and follows the word within when a Data Translator is used. 72 CONFIDENTIAL There also exist the techniques of outer joins which are either a right outer join, a left outer join, or a full outer join. For outer joins, we specify both a left and right parent. 73 CONFIDENTIAL To see the rest of the picture, here are the right.csv and full.csv files: 74 CONFIDENTIAL Similar to the left.csv file, the full.csv file is the same, thus all three tables are the same because there are no null records. For an example, see Picture 4.3.1. The last way to connect two nodes is with minus. Here’s a tricky piece of code that selects only the dynamic links on the page http://www.news.com. 75 CONFIDENTIAL Of the 234 links on the page, only 9 appear to be dynamically generated. Using the manual table building code of Example 2.2.5, we can show how minus works with nodes without the web. 76 CONFIDENTIAL The node Out takes all records from node table1 and deletes any that contain the number 2 in the first column named myNums1. The result is written into the default viewer. Now that we’ve seen how to connect nodes in various ways, we can string together long navigational threads that simulate clicking into a website. Navigation is covered in Chapter 3. 2.3 Aggregating Data by Column and Grouping (Aggregate Functions) Let’s go back to the identification of non-yahoo.com domain images on http://www.yahoo.com in Example 2.1.2. Suppose a count of such images is all that is 77 CONFIDENTIAL needed rather than the URLs of the images. We will now introduce the most basic Aggregate Function: count. What makes a function an Aggregate Function is that it operates column-wise on a table rather than row-wise. count column-wise counts something. Depending on what Data Translator we choose to use, count will count different things. We were using the images Data Translator in Example 2.1.2, so image counting is what we’ll do. So we see that 14 images are all that come from outside of yahoo.com. What can be said of the uniqueness of these images? We’ll investigate that in this next example where a Pseudocolumn of the images Data Translator will appear both inside and outside of the 78 CONFIDENTIAL Aggregate Function count in the selected fields. When this happens, every field not bound by an Aggregate Function must appear in a group by expression. Let’s look at the code and output to try to understand what’s going on: As we can see, only one image on http://www.yahoo.com from outside of yahoo.com is a repeat, and it’s repeated only once. Now from seeing Aggregate Functions, we didn’t quite get the whole story when the select statement was outlined in Chapter 2.0. Here is a heuristic sketch of a select statement that uses Aggregate Functions: 79 CONFIDENTIAL select <these fields>, <these aggregate fields> into <this destination> from <this Data Translator applied to…> within <…this file/db/source> where <this and that are true or false or matching or equal, etc. for these fields> group by <these fields> having <this and that true or false or matching or equal, etc. for these aggregate fields> Taking the previous example of itemizing repeated images, we can now add an Aggregate Function filter with a having clause that gives us the non-yahoo.com domain images that are repeated at least once. 80 CONFIDENTIAL We do indeed get the result that we anticipated—only one image fits such criteria. having is the third and final record-filtering technique. We know how to filter twice with where and once with having effectively all in one select statement. Now that we’ve seen count used, what are the rest of the Aggregate Functions? Most Aggregate Functions are math-related. In addition to count, there are min, max, avg, sum, and stddev—we don’t need to go over those. The last and possibly the best Aggregate Function is gather. gather turns a column into an array. There is a sub-tables for Aggregate Functions and array functions in the WebQL Giant Function Table which is Appendix V. 81 CONFIDENTIAL We will give 2 last examples to complete this lesson on Aggregate Functions. First, we will use all of the math Aggregate Functions besides count effectively; second, we will show how to quickly turn a column of data into an element of a row using gather. We will also add a slight curveball to the next example by applying the links Data Translator to not just one HTML source but 3. We do that by listing the sources in an array. The functions, Aggregate Functions, and grouping expression used execute once for each page load. If we are loading information out of a database or the web (which I should have you convinced by now is nothing more than a database) then the mathematical Aggregate Functions compute for each database. There also exist array-versions of the functions: array_sum, array_avg, array_min, etc. 82 CONFIDENTIAL The results to this little piece of code are quite impressive. Coming up with this code could take between 5 and 15 minutes, and it took 1 second to run. How many man hours would it take if HTML source pages were loaded out of browsers and the URL lengths were counted and tallied by hand? Is the computer more accurate in this case than a human? Moving on to the last example, we will use the Aggregate Function gather to pull out the 12th table cell that appears on www.yahoo.com. The clean function strips all HTML tags and eliminates redundant or outlying whitespace. 83 CONFIDENTIAL Because all the Pseudocolumns (namely, item1) of the pattern Data Translator stay within all Aggregate Functions used (namely, gather), there is no need for a group by expression. Without question, Aggregate Functions are a powerful tool in selecting and reprocessing a data table. Aggregate Functions introduce grouping expressions that enable additional data filtering techniques such as a having filter. Aggregate Functions are crucial in the development of advanced applications such as the one depicted by Picture 3.4.6. 2.4 Inner-Outer Query Relationship One of the most useful coding styles in WebQL separates code segments (or “subroutines”) into separate files. These source code files ending with .wqv are called a view or an Inner Query. Inner Queries are useful because every field selected by the outer parent is global inside a view. This is the Global Field Theorem. A node can select any and every field selected by its outer parent. The selection of the outer parent is also called input. Suppose we want to keep a count of processes by batch. 84 CONFIDENTIAL Notice that the node writeReport, out of nowhere, selects mybatches successfully. mybatches is nowhere in the parent of writeReport (namely, process1), but it is in the outer parent named batches. The unnamed node in Example 2.4.1 is selecting all fields from the default viewer of the inner query and putting them into the default viewer of the outer query, which is in the Execution Window. Inner-Outer relationships are separate queries and we can create aliasing that is similar between them (For example, fetch1 could be the name of a node in both the inner and outer query). Rewriting this example as two queries illustrates the trivial difference of separate files vs. using parenthesis in creating inner-outer query relationships. 85 CONFIDENTIAL Example 2.4.2b is batches.wqv. There is no conflict of variable names in the node writeReport because the code specifies input.mybatches (or “my outer parent’s mybatches”) and batch1.mybatches, which is in the immediate parent of writeReport. The batch1 node on the left in Example 2.4.2a is clearly not being referenced by in the second field of writeReport because writeReport is connected to the code on the left only through the input, which is the field mybatches of the node batches. Notice that Example 2.4.2a is stray-code in an unsaved file while batches.wqv 86 CONFIDENTIAL in Example 2.4.2b is saved. When calling a view, the last saved version is used. Make sure to save a view before running the outer file. Later on, we will use Inner Queries to navigate down paths (effectively, “click”) through websites and extract data. Inner Queries are great for depth-first crawling instead of breadth-first crawling. Suppose it takes 3 clicks to get to the price on a website for each of 3 different customer searches supplied via a CSV file. Sometimes, we must go 3 clicks deep for the first search, then 3 clicks deep for the second search, and so on (Picture 2.4.1), instead of going 1 click deep for all 3 searches, then 2 clicks deep for all 3 searches, and finally 3 clicks deep for all 3 (Picture 2.4.2). The reason is that our session established in Picture 2.4.2 could timeout before the inbound flights are selected, especially if there are hundreds or thousands of input line items in the customer’s CSV file. Sometimes, it doesn’t matter if we use breadth-first or depth-first navigation, but when we need depth-first navigation look to the Inner Query or view. Picture 2.4.1: Depth-first navigation achieved by inner-outer query relationships 87 CONFIDENTIAL Picture 2.4.1 is ideal for when tens, hundreds, or thousands of flights are being fetched. To achieve this navigational order of fetches, inner-outer queries must be used for each of the 3 fetches. A detailed example of creating depth-first navigation is Example 3.5.6. Picture 2.4.2: Breadth-first navigation achieved by writing all 3 fetches in the same query For all reasons stated, breadth-first navigation is not recommended. 2.5 Document Retrieval Options The most powerful networking accessory commands are called Document Retrieval Options. These options apply to all document sources whether they are local or nonlocal; however, they are most useful when trying to access an HTML datasource from the internet. So far, we have used crawl of a URL to depth in Example 2.1.3. Imagine how many pages are loaded to depth k? If the average page load has 50 links on it with a 88 CONFIDENTIAL small standard deviation, then a crawl to depth k would be approximately 50k-1 clicks. We should be able to see how some sloppy code could cause all kinds of unnecessary internet traffic. crawl of can also be used in conjunction with following if for a simple way to continue clicking a next-20-results-type button. We can also use circular referencing with union joins to repeatedly click Next/Next/Next/Next to scrape all of the desired results, which will discussed in detail in Chapter 3. One of our favorite examples below shows how following if works. We will learn to greatly appreciate the idea of being able to repeatedly click the next button in a one node effort using following if than to implement a 3-node URL-trapping circular reference (see Picture 3.5.3.5 and Example 3.6.2). 89 CONFIDENTIAL Here we crawl for the first 30 Google results for our desired search terms. How easily could we read in a user-defined (or customer-defined) list of search terms from an input file instead? We could also use an input variable to crawl to a dynamic crawl depth. 90 CONFIDENTIAL crawl of / following if is like a Data Translator that has Pseudocolumns (CONTAINER_URL, CRAWL_DEPTH, SOURCE_CONTENT, SOURCE_URL, URL, URL_CONTENT), but it’s not a Data Translator. These Pseudofields can only be referenced in the section of the code with the crawl of phrase after the statement following if. Checking various conditions on these Pseudofields can allow for targeted custom crawling, such as limiting the search results to the first 30 results. Document Retrieval Options are powerful for dividing the SOURCE_CONTENT field of an HTML source into pieces known as chunks. Suppose we want to scrape Foreign Currency Yields on investments in Singapore. A page copied from a bank website is on Geocities so the page structure won’t change since the publication of this document. Here is a view of the webpage from a later date so we have an appreciation for the data extraction task at hand: 91 CONFIDENTIAL Picture 2.5.0a: The first half of the page for Example 2.5.2 92 CONFIDENTIAL Picture 2.5.0b: The second half of the page for Example 2.5.2 Here’s the start to the code for scraping the page. Go to the URL in the node doredirect2 in a browser if you want to see the text source: 93 CONFIDENTIAL The first thing to notice is the records that are going to be extracted within the page chunks also want to know what the minimum deposits are for various currencies. This information is only presented once on a page that we want 9 chunks out of. We must extract the minimum deposit information before chunking the page with the convert using Document Retrieval Option. In conjunction with the function text_to_array, convert using changes the SOURCE_CONTENT Pseudocolumn into whatever we specify. In this case, SOURCE_CONTENT gets changed into an array of sub-sources which increase the number of SOURCE_CONTENT records processed by the node. The rest of the code is included below. It uses an Inner Query to process the chunkpage node’s chunked source. 94 CONFIDENTIAL Notice that the node mydata is writing to multiple destinations. That’s not a problem; separate the destinations with a comma. mydata also makes use of the decode 95 CONFIDENTIAL function and datetime_to_text operating on the global variable now, which is our current moment in Unix time. Notice that the pattern Data Translator p1 in the translate node uses multiple extraction buffers—we can have as many as we want. We are also making use of multiple pattern Data Translators with the .* end trick discussed earlier in Chapter 2.1. We know that the node translate will only produce one record per page chunk. translate selects the currency name and the source for rate extraction. Running the code produces all 7 rate categories for each of the 9 currencies presented, so 63 records are manufactured by the mydata node after every page chunk is processed. We should also make note of the technique of combining two columns of data from different tables side by side. The node assemble puts rate1 and rate2 side by side from different tables by joining the nodes rates1 and rates2 together and filtering the records based on equal RECORD_IDs (see also Example 3.5.4). decode and if are also made use of effectively in Example 2.5.2. 96 CONFIDENTIAL Picture 2.5.1: Execution Window of Example 2.5.2 97 CONFIDENTIAL Notice in the messaging how the pattern Data Translators hit the page chunks separately. The month fractions symbolize weeks, and some FCY data was not available when the page loaded such as Yen rates. This is a great application for harvesting data to feed a corporate database with its own number-crunching and report-making software. Sometimes WebQL is best used as a web crawler that harvests and cleans data for another company’s database applications. We can produce those applications as well. In Example 2.5.2, the page of interest is actually triggered out of a javascript redirect. Remember that WebQL doesn’t do javascript in a tradeoff for programming authority, so triggering this page is a multi-node process that requires manually cutting a javascript redirect and concatenating it onto a base URL. 98 CONFIDENTIAL We see below, the Foreign Currency Fixed Deopsits page is properly loaded after doing some manual work on the location.href redirect: 99 CONFIDENTIAL Picture 2.5.1.5: The Browser View of the SOURCE_CONTENT of Example 2.5.3 Next, let’s take a look at the submitting Document Retrieval Option. submitting let’s us submit values to forms and load the pages that get returned. Now we can submit searches on anybody’s website and analyze the results. 100 CONFIDENTIAL All we do is submit the city Tokyo into the appropriate form value. We can figure out that form value by either looking the http://www.hotels.com page source (either through WebQL or ViewÆSource in a browser) or by using the forms Data Translator with the select * trick discussed in Example 2.1.9. Notice how the string “Tokyo” could just as easily be a running list of input from a customer-generated CSV file. Instead of submitting directly to a form by number, we can say 101 CONFIDENTIAL Submitting values v1 for ‘var1’ if form_content matching myMatcher Thus, if there exists something in the HTML form that is unique, we can identify our desired form submission by a substring match or Regular Expression match in the HTML of the form. This technique can be used to submit to one, some, or all forms on a page. myMatcher is a Regular Expression that we can make as robust or parochial as we want. Similarly, a form has an action URL (or “target”) that can be used to identify a form for submission. Submitting values v1 for ‘var1’ values v2 for ‘var2’ to form target myTarget If we know the action URL of the form we are submitting to, which can be easily derived from the form analysis technique of Example 2.1.9, then we can just specify a substring of the action URL as a target, myTarget. myTarget is not an expression, however, myMatcher is. myTarget is a substring text literal. To give us an idea of how hectic some form submissions can be, take a look at this file from an advanced application: 102 CONFIDENTIAL Picture 2.5.2: Code from an advanced application exemplifying submitting This code that hits Expedia does a great job of making use of all kinds of Document Retrieval Options. We see some that we know and some that we have not seen before. cache forms is a very powerful Document Retrieval Option. If we want to submit to a form dozens or hundreds of times, we don’t want to reload that form every time we want to submit it. The idea is that we load a form once, then submit it 100 times rather than loading it 100 times and submitting it once for each load. cache forms is better 103 CONFIDENTIAL exemplified in Example 2.1.3.5. Caching the form in Picture 2.5.2 is to enable a node appearing later in the code to make use of all forms loaded in the node in Picture 2.5.2. post is a major Document Retrieval Option. post lets us explicitly hit a URL that typically requires a form submission or a click by the user. In Picture 2.5.2, the explicit post statement is to immediately trigger the advanced car search form on Expedia. unguarded is another Document Retrieval Option. WebQL by default will never reload the same data unless the Document Retrieval Option unguarded is used. If we must use unguarded then there probably isn’t optimal efficiency in the code. If we load a page and pass the SOURCE_CONTENT down through several nodes, then why would we need to reload anything? unguarded is needed in this example because sometimes a customer’s input can have overlap with itself because of an input algorithm for this particular query. The convert using Document Retrieval Option is quite complicated in this example. Some searches for specific vehicle types also show other vehicles at the location—we don’t want “other vehicles at this location” in our output report. rewrite using is the last Document Retrieval Option that we will discuss. If we are using submitting, and for an undiagnosable reason the URL is getting things added on to it or the wrong thing added on to it, then we can rewrite it just before the page is loaded. Suppose when we submit our form “&Back.x=1&Back.y=1” gets added to the URL when we need “&Next.x=1&Next.y=1” to be part of our URL destination. This trick will work: Rewrite URL using replace(URL,’Back\.’,’Next.’) 104 CONFIDENTIAL A similar maneuver is documented in Picture 2.5.2. Remember that the second argument to replace is a Regular Expression and the third argument is a string. If artifacts are showing up in the Post Data, then we can also rewrite it just before the page is loaded. Rewrite post_data using post_data || ‘&myVar1=myVal1&myVar2=myVal2’ Here, a couple of variables weren’t showing up in the form submission, so we tacked them on to the end of the post_data string using the string concatenation operator ||. These kinds of rewrites are needed infrequently, but they make the process of debugging code a lot less annoying by enabling workarounds. 2.6 Read/Write Features So far, we’ve loaded HTML internet pages and selected stuff out of them into the default viewer and/or CSV text files. WebQL has the ability to read almost any datasource and write to almost any file type, especially in the domain of spreadsheets and business applications. The syntax of reading from or writing to a database is table@connection There is an Access Database file at http://www.geocities.com/ql2software/db1.mdb that contains an employee list named ‘people’ with a couple of errors in it. First, there are repeated employees; second, the names of the fields of the table (Name, Age, Job, Title) are duplicated as the first row in the table. We will save this file to our local machine (in the ‘c:\’ directory) and create a node that cleans the database up and outputs it into the default viewer. 105 CONFIDENTIAL The db1.mdb file has 21 records in it, and we should be satisfied that we have fewer records in the default viewer after removing duplicates and the erroneous first row. WebQL can do a couple of tricks writing data to a destination. The first trick is the HTML trick. We can output this employee list into an HTML table to give it color, shape, or style. 106 CONFIDENTIAL Notice that we are using the same code as Example 2.6.1, but instead we are writing an HTML file header, then appending the employee list as HTML rows, and then appending the end of the HTML file. The HTML file looks like this: 107 CONFIDENTIAL Picture 2.6.1: HTML1.html created in Example 2.6.2 Another powerful writing technique allows XSLT transforms to be applied to XML. 108 CONFIDENTIAL Using the same example as before, we can have our own stylesheet that looks better than Picture 2.6.1. Notice how the coding is easier in Example 2.6.3 than in Example 2.6.2. Studying XML and XSLT stylesheets enables us to make great looking reports in WebQL without doing much more than calling those files at write time. The final mention on WebQL destination writing will involve the WebQL built in pivot table effect called pivot. Specify the word pivot at the end of the destination to make use of this Document Write Option. The third column of our data ends up getting spread as column headings, so the number of columns in a WebQL pivot table is 2 plus the number of unique items in column 3 of the select statement. The best way to get comfortable with the pivot option is to use it on a familiar dataset and look at the results. Try switching the columns around and see what kind of effect it has on the look of the data. Be aware that a pivot is not always a good idea. The HTML application developed in Picture 3.4.5 was created using the pivot write option. 2.7 Virtual Tables In addition to the Pseudocolumns of the various Data Translators so far, WebQL has Virtual Tables that have their own Pseudocolumns. The purpose of Virtual Tables is to 109 CONFIDENTIAL allow developers to reference lists of product features and capabilities within the language in an up-to-the-minute fashion. The above query reveals what WebQL Virtual Tables are available to select from. These universal Virtual Tables that can always be selected from begin with “WEBQL$”. Now we can further dissect these tables. 110 CONFIDENTIAL We see the 1000+ different types of file encoding understood by WebQL. We can use the select * trick on every Virtual Table to get the details summarized below. Table 2.7.1: Virtual Tables always available as datasources VIRTUAL TABLE DESCRIPTION WEBQL$DATASOURCE WEBQL$ENCODING lists known data resources lists all known encodings WEBQL$LOCALE lists all local identifiers WEBQL$OPTION lists all server options WEBQL$TIMEZONE lists all known timezones WEBQL$URL_SCHEME lists all known protocols WEBQL$VIRTUAL_TABLE lists all known Virtual Tables PSEUDOCOLUMNS NAME, DESCRIPTION NAME, CANONICAL_NAME ID, LANGUAGE, COUNTRY, VARIANT NAME, DEFAULT, VALUE ID, BASE_OFFSET_HOUR, BASE_OFFSET_MINUTE, NAME, ABBREV, DST_NAME, DST_ABBREV NAME, USES_NETWORK, DEFAULT_PORT SCHEMA, NAME An example of making use of a Virtual Table would be to look up all POP URL schemes known by WebQL. 111 CONFIDENTIAL The results should satisfy any seasoned web programmer. 2.8 Sorting Data There is one last piece to the select statement that we are yet to discuss. In addition to clicking and sorting data in the default viewer of the Execution Window, we can sort the data in a node by a sort by clause. Sorting happens after data filtering, so sort by should be at the end of a node. 112 CONFIDENTIAL Sorting expressions can be creative and involve true/false. In Example 2.8.1, the list of numbers 1 through 10 is sorted by odds first then evens. Now, we’ll sort multiples of 4 in descending order, and then multiples of 3 in ascending order. 113 CONFIDENTIAL We see how the multiples of 4 appear at the beginning and the multiples of 3 appear at the end. Sorting is not the most important feature because databases and programs like Excel usually have their own sorting mechanisms. When we need complicated sorting algorithms sequentially applied to a table of data, WebQL is a viable tool with even more flexibility than Excel or Access. Sorting causes a node to wait until all records in it have been generated before displaying or writing the results because the results must be sorted. Therefore, sort by cannot exist as a part of a node in a circular loop. unique also cannot be used in a circular loop. We are (finally) through our instruction on coding basics in Chapter 2. We are moving on to navigation in Chapter 3. 114 CONFIDENTIAL CHAPTER 3: Navigating and Crawling the Web 3.0 The Navigational State Navigating through the internet in WebQL can be tricky. To make learning easier, the idea of a Pseudocolumn 3-tuple called a Navigational State is defined. At any point in time on the internet, the page that we are looking at is primarily the result of 3 different things. Picture 3.0.1: The Trevor J Buckingham Navigational State It’s true that the page that we are looking at through a browser could also be impacted by headers and cookies, but these three fields characterize “where we are and where we can go on the web” more than any other information. Is there anything interesting about these fields? They are all Pseudocolumns! We don’t have to do anything fancy to get the 115 CONFIDENTIAL information besides select the SOURCE_URL, SOURCE_POST_DATA, and SOURCE_CONTENT. The major theorem to understand here is that as we crawl the web in a parent node, a child node can always advance the navigating one way or another if it knows its parent’s Navigational State. A given webpage always is triggered by a URL, and it might need Post Data as well. Some URLs load pages successfully without Post Data. Together, these two components of the Navigational Sate answer, “Where are we?” The HTML source enables us to answer “Where can we go?” And, if we know where we are at and where we have the potential to go, then we are navigating; hence our realization of the Navigational State is the Pseudocolumn SOURCE_POST_DATA, SOURCE_CONTENT. 116 3-tuple SOURCE_URL, CONFIDENTIAL Notice how because GetLink1 knows the Navigational State of its parent GetNavState1, GetLink1 is thus capable of navigating anywhere we could click on the site with a mouse. GetLink1 chose the 5th search result, and then fetches the 117 CONFIDENTIAL associated Navigational State in its child node GetNavState2. As it turns out, we only need part of the Navigational State in GetNavState1 (the SOURCE_CONTENT) because the page selected by GetNavState1 is not required to provide a form in GetNavState2. The link selected in GetLink1 is a stand-alone URL without post data. See Picture 3.5.3.6 for a node similar to GetNavState2 where a form is needed from a paternal node. Which database is displayed in the default viewer of the Execution Window? Because the SOURCE_URL of GetNavState1 is http://www.yahoo.com, it must be GetNavState2. Because we didn’t pick a destination for any of the data that we selected, we should not anticipate any particular data in the default viewer of the Execution Window. Further, we know it’s not GetNavState1 in the default viewer because GetNavState1 has 2 records: one for the http://www.yahoo.com page fetch and one for the form submission. 3.1 Circular Navigation Circular Navigation is an effective application of Circular Referencing. The concept of Circular Referencing is first illustrated below. 118 CONFIDENTIAL 119 CONFIDENTIAL The counter starts at zero and then runs in a circle until the field cnt reaches 10. To achieve the same result in 2 nodes, we can union join a node to itself recursively. Although WebQL allows for recursive nodes, it does not allow for recursive views. 120 CONFIDENTIAL Perhaps the node called SECOND in the dataflow should have an arrow to itself. The circular constructs in the previous 2 examples work great for navigating. Sometimes the “next” button is javascript-driven to see the next page of results from a form submission. Remember that WebQL does not understand javascript, which is a tradeoff of sacrificing convenience for programming power. Using pattern and/or upattern Data Translators to manually extract links, a 4-node loop can be crafted to elegantly crawl a site. Let’s take Example 2.5.1 and instead of using crawl of, we will use a 4-node loop to achieve the same effect. 121 CONFIDENTIAL 122 CONFIDENTIAL Notice that the data don’t come out in the same order in the Execution Window below as when we used crawl of back in Example 2.5.1. 123 CONFIDENTIAL Picture 3.1.1: The Execution Window with Data Flow tab of Example 3.1.3 124 CONFIDENTIAL Although the loop appears to have only 3 names involved, ManageCrawlGoogleA, ManageCrawlGoogleB, FirstCrawlGoogle, and ButFirstCrawlGoogle are all involved in the notion of Circular Navigating. FirstCrawlGoogle is the trigger node, and the other 3 run in a circle. 3.2 The Fetch Now that we have experience crawling around the web and running in circles, we can ease ourselves into this realization. Picture 3.2.1: The Diagram of a Fetch A fetch is a URL (optionally with Post Data) and the resulting loaded page (these together symbolized by the Navigational State) with all cookies and headers and other Pseudocolumns. It’s true that the headers contain the URL and the Post Data, but to easily select those values and move around the web, we need the URL and Post Data in the Navigational State. Many of the Pseudocolumns of Table 2.1.0 can be thought of as Pseudocolumns of a fetch. 125 CONFIDENTIAL A “request” is the outgoing headers and cookies (which include the URL and Post Data); a “download” is a server’s reply to our request, and together these concepts are a “fetch”. Fetches and all network transmissions are debugged down to the character through a proxy that is viewable through the Network tab of the Execution Window. Picture 3.2.2: A Header with URL and Post Data viewed in the Network tab 126 CONFIDENTIAL Every smiley face doesn’t necessary comprise an entire fetch. A fetch can have child page loads, frame loads, image loads, etc. Again, be sure to make use of any and all of the Pseudocolumns outlined in Table 2.1.0 whenever we fetch a page. Some people confuse the notion of what a Navigational State is and what a fetch is. A Navigational State is a 3-tuple of Pseudocolumns. A child node can always advance navigating (clicking) through a website if it knows its parent’s Navigational State. This is the Navigator Theorem of Web Crawling. We don’t need to know everything about a fetch to advance clicking; we only need to know some or all of the Navigational State. Notice that a fetch isn’t only 3 Pseudocolumns. A fetch is a request header (or series of request headers) with the corresponding page loads that allow a programmer a diagnostic view of the web through WebQL Pseudocolumns. 3.3 Debugging Form Submissions Form submission is more tedious than difficult in WebQL primarily because WebQL is not javascript enabled. If a form submission is javascript-intense, the best way to submit the form is to mimic Internet Explorer. In WebQL, go to ViewÆNetwork Monitor. Set the Network Monitor on a specific port, such as 211. 127 CONFIDENTIAL Picture 3.3.1: WebQL Studio IDE ViewÆNetwork Monitor Next in Internet Explorer, go to ViewÆInternet OptionsÆConnectionsÆLAN SettingsÆUse a Proxy (check)ÆAdvanced. Picture 3.3.2: LAN Settings in MSIE 128 CONFIDENTIAL Set the Socks proxy as “localhost” on port 211. Picture 3.3.3: Advanced Proxy Settings in MSIE We now have separate proxies set up for both the browser and for the WebQL program. Recall that WebQL always has a proxy—the Network tab. The goal is to use Document Retrieval Options to make the WebQL proxy look enough like the browser’s proxy so that the page that gets loaded in a browser is the same page that gets loaded in WebQL. Again, this is more of a tedious task than anything else. Let’s say we want to scrape prices out of hotels.com for a given locale. The first thing we want to do is Example 2.1.9, which is using the select * trick with the forms Data Translator on http://www.hotels.com. From that information, we can figure out what form to submit and which variables to submit along with it. Recalling Example 2.5.4, the form to submit is form 2 and the only required variable to submit is usertypedcity. If 129 CONFIDENTIAL dates other than the default dates or if occupants other than the default occupants are required then those variables can be submitted in the same fashion as usertypedcity. Picture 3.3.4: select * from http://www.hotels.com Since Example 2.5.4 was written, http://www.hotels.com changed their site. From this page load, what should we do to load the page we want? If we go to http://www.hotels.com in a browser, how is the page different? Looking at the Network tab, here is how we formed the request in WebQL: 130 CONFIDENTIAL Picture 3.3.5: The Outgoing Request for hotels.com in WebQL Setting the browser proxy and requesting www.hotels.com gives us: 131 CONFIDENTIAL Picture 3.3.6: The Outgoing Request for hotels.com in MSIE Notice how there appear to be dozens of outgoing requests more than what we saw when we tried to load the same page in WebQL. Those outgoing requests are triggered both by javascript and HTML tags when the homepage is loaded. The page is successfully loaded in MSIE because MSIE is javascript-enabled. 132 CONFIDENTIAL Based on Picture 3.3.4, we need to make http://www.hotels.com think that we are javascript-enabled even though we aren’t. Looking at Pictures 3.3.4 and 3.3.6, what should we do? My guess is add on to the URL ?js=1. The variable &zz= might not be vital to the page load, so we might not need to worry about it. Let’s see if that works… Picture 3.3.7: select * from http://www.hotels.com?js=1 Sure enough, we get the page and form we want by adding the query ?js=1 onto the URL. Query in this sense means “the text after and including the ? in a URL” rather than a WebQL query. Applying the forms Data Translator to this page gives us: 133 CONFIDENTIAL 134 CONFIDENTIAL Notice how Example 3.3.1 differs from Example 2.1.9. Being able to quickly handle site updates makes us better WebQL programmers. Now, let’s re-implement Example 2.5.4 to accommodate for the http://www.hotels.com site change. 135 CONFIDENTIAL Looking at the Browser View of the Execution Window, we have successfully reimplemented Example 2.5.4 to reflect site changes to http://www.hotels.com. 136 To CONFIDENTIAL improve the code even more, we can make the form submission robust by referencing a form action URL substring as a target rather than by the sequential number of the form on the page. 137 CONFIDENTIAL Notice that regardless of what form number we want, we submit the correct form based on a substring match of the form’s action URL (or “target”). As we’ve seen in this chapter, navigating is a tedious task of debugging analysis more than anything complicated or difficult. WebQL is more flexible and fit to crawl the web than any other web-harvesting product. 3.4 Looking Ahead to Advanced Programs If this version of the Guide excludes Chapter 4, included are a few screenshots of some advanced crawlers. This content isn’t exactly the same in Chapter 4. 138 CONFIDENTIAL Picture 3.4.1: The Data Flow of an Advanced WebQL Program 139 CONFIDENTIAL 140 CONFIDENTIAL Picture 3.4.2: (previous page) The *.csv Output of an Advanced WebQL Program Picture 3.4.3: Part of the source code of an Advanced WebQL Inner Query 141 CONFIDENTIAL Picture 3.4.4: Custom DHTML Output Done by an Advanced WebQL Program 142 CONFIDENTIAL Picture 3.4.5: Custom DHTML Output Done by an Advanced WebQL Program 143 CONFIDENTIAL Picture 3.4.6: Custom DHTML Output Done by an Advanced WebQL Program 144 CONFIDENTIAL Picture 3.4.7: Custom DHTML Output Done by an Advanced WebQL Program 3.5 Other Coding Techniques and Strategies On top of the coding styles used so far, there are a few other tricks worth mentioning. We learned early on how to filter using where both between and within nodes. We can 145 CONFIDENTIAL also use a pattern extractor twice in a single node by using the pattern Data Translator followed up by the extract_pattern(str1, expr1) function. Let’s look at an example. Instead of writing a single overcomplicated pattern to extract the first link that appears in the <h2> tags, we can just grab what is between the <h2> tags with the pattern Data Translator and then match an anchor/source pattern via the extract_pattern function to pull out the URL. \b in Regular Expressions is a word border, which could be a white space, a comma, a period, etc. Sometimes writing two simple patterns is more effective than writing one complicated pattern. Sometimes we need two extremely complicated patterns to get the data that we are looking for. 146 CONFIDENTIAL It turns out that we can just as easily use the extract_URL(str1) function to achieve the same result. Nevertheless, Example 3.5.1 suffices to show how multiple patterns can be used consecutively in a single node. We can also now imagine cases when we would compose a function call extract_URL(extract_Pattern(str1, expr1)), which we could see as using 3 patterns consecutively. 147 CONFIDENTIAL Another common coding technique used is to capture a Navigational State in a node and then branch off of it to trap errors. In addition to any network error or timeout, we must watch for site-triggered errors such as “No products matched your search.” The best way to do this is to capture the Navigational State, then use join and output source join with expression checks to see if data were found or an error occurred. Remember that no matter how many child nodes collect data or trap errors, we can always add another child node to extract a link and advance clicking through the webpage. Picture 3.5.1: A Segment of Code from an Inner Query 148 CONFIDENTIAL Clearly, we select a Navigational State in the node fetch1. We also match a pattern to see how many results come from our search that uses input fields searcher and method for form values. We include the empty Data Translator for the case when the pattern doesn’t match. We use the outer-parent (input) field myrec to debug the form page load in one branch, and then trap errors in three other branches. Later on in the code, we can union join all of the error branches into a single node that writes all errors to whatever destination we want. Picture 3.5.2: Combining the error branches Perhaps the most powerful coding technique is to branch off in all directions to load pricing/product information, union join those sources together, and then pattern-crunch all of the source pages simultaneously in one node. 149 CONFIDENTIAL Picture 3.5.3: Data scraping node getdata The node getdata is output source joined to the SOURCE_CONTENT of the fetchmanager node. The node fetchmanager is a collection of HTML source pages captured by the various branches. A node that writes output can be joined to getdata. Suppose fetchmanager is a part of a circular reference that submits a form in the first source selection and then sequentially clicks through the “next” button of results in the second source selection until all results are exhausted. This is the most effective coding structure for scraping an unknown number of results N > n where n is the number of results displayed per page. Picture 3.5.3.5: Diagram of circular navigation with a pattern-crunch Sometimes the node for selecting second and later sources is also a form submission, in which case the forms must be cached in both source selecting nodes in order to prevent repeated form loading. In this case, the node for selecting second and later sources uses both the post Document Retrieval Option and the submitting Document Retrieval 150 CONFIDENTIAL Option. Below is a segment of code that shows an ordinary node for selecting second and later sources. This example is particularly good because it requires the addition of a Referrer URL in order to load the proper page. Also, random delays between 1 and 5 seconds are added before each fetch. Document Retrieval Options are discussed in Chapter 2.5 and can be used creatively with functions to enhance the navigational capabilities of WebQL. Picture 3.5.3.6: Advancing navigation when the “next” button is a form submission Recreating Picture 3.5.3.5 with appropriate node aliases gives us: 151 CONFIDENTIAL Picture 3.5.3.7: Sound node aliases for circular navigation with a pattern-crunch This technique of pseudocoding circular navigational threads with node aliases makes us more effective and quicker WebQL programmers. Taking it one step further, node diagrams can be used to contrive view diagrams for massive programs that resemble the Data Flow in Picture 3.4.1 and the application set in Picture 4.1.0e. Finally, choosing node aliases carefully makes the process of updating or modifying another person’s code much less tedious. As computer programmers, we know that looping structures can be avoided sometimes by using arrays and array functions. each is a powerful function, and works great for generating HTML. Suppose price_array is an array of prices and cheapest is the value of the lowest price in the array, we can generate the proper number of HTML table cells for an HTML row quickly and highlight the cheapest value: ‘<tr>’||array_to_text(Each(price_array,’<td class=price1>’||if(each_item=cheapest,’<font color=red>’||each_item||’</font>’, each_item)||’</td>’),’’)||’</tr>’ 152 CONFIDENTIAL The next trick in Chapter 3 is also the most important for programs that crawl websites. Using the WebQL Network Monitor (ViewÆNetwork Monitor) as a proxy for a browser and the Network tab of the Execution Window as a proxy for WebQL (recall Picture 3.3.6 and 3.3.5 respectively), we can do a side-by-side proxy-to-proxy comparison of all outgoing and incoming traffic. Picture 3.5.4: Proxy-to-Proxy Comparison Because browsers are javascript enabled and WebQL isn’t, the proxies often vary significantly. WebQL can do what the browser does by imitating the javascript or it can work around the javascript. If we want the browser’s effect in WebQL, we sometimes have to work to make it happen (see Chapter 3.6 for examples). Moving on, the function diff_arrays(array1, array2) is an advanced technique that can be used to detect changes in a page over time or to report differences in 2 arrays. The function each is also used in combination with diff_arrays. 153 CONFIDENTIAL Picture 3.5.5: file1.txt and file2.txt Given file1.txt and file2.txt, we can use blank spaces to separate the words into arrays and then call the diff_arrays function. diff_arrays returns an array of 2-element arrays where the first element is either +, -, or NULL and the second element is a word from the above files. We can then call the each function and decode the first element as +, -, or NULL and color code the second element accordingly. 154 CONFIDENTIAL Notice that the length of the diff_arrays result is the size of words1 (10) plus the number of differences (2) totaling 12. Because the function each returns an array, array_to_text is used with a single white space as the array element delimiter. diff_arrays returns an array of 2-element arrays, so each_item of the each function is a 2-element array. Depending on what the first element is, the color of the second 155 CONFIDENTIAL element is determined. The pages and lines Data Translators can used to monitor/compare files by line or by page, rather than by word as in Example 3.5.3. To better see how changes in an HTML page can be tracked, we will do a similar example for the diff_html function. In addition to diff_html, highlight_html will be used so we as programmers feel abstracted from the underlying details of how these functions work. diff_html is a function that takes two HTML text files as arguments and returns an array of 2-element arrays, just like diff_arrays does. highlight_html adds the necessary HTML to give background color (argument 3) and text color (argument 2) to a given string of HTML text. highlight_html also puts a border around images that change. Let’s look at the HTML sources with their corresponding browser view. Picture 3.5.6: file1.html The HTML files are quick and simple. They contain a table with a single non-terminated row tag with three columns containing text, an image, and text, respectively. Suppose 156 CONFIDENTIAL that the text and images are dynamic and we wish to track the differences. Suppose that yesterday’s website is file1.html and today’s website is file2.html. Picture 3.5.7: file2.html These files are available for download at http://www.geocities.com/ql2software/file1.html http://www.geocities.com/ql2software/file2.html Here is the code that will perform the diff_html: 157 CONFIDENTIAL Notice how the similar text has no change to it, while differences are colored based on what file they appear in with a beige background. Also notice how the change in images 158 CONFIDENTIAL is shown by a highlighted border around the images. Below is the final piece to the puzzle: the text view of the “diffed” file. 159 CONFIDENTIAL 160 CONFIDENTIAL Picture 3.5.8: The computed difference of file1.html and file2.html (previous page) We see how the highlight_html function adds background color and text color via the style HTML tag attribute. Clearly, the diff_html function in combination with highlight_html does a lot of grunt work with minimal effort from the programmer. Changing pace, sometimes we have tables of data in two different nodes with the same number of records that we want to “paste” together side by side like this (recall Example 2.5.2): Picture 3.5.9: The side-by-side combination of tables If we join 2 tables of length 4, what do we know about the number of records in the child node, assuming there is no record filtering? The answer is 4*4=16, but we want a table that still has only 4 records in it. Suppose we have a table with names and ages in Node1 and job titles in Node2. All we have to do is add the RECORD_ID Pseudocolumn, and then filter where the RECORD_IDs are equal. 161 CONFIDENTIAL The final table result is written into the default viewer and contains only the 4 records that we want, which is the equivalent of pasting the tables in Node1 and Node2 sideby-side. 162 CONFIDENTIAL Sometimes in WebQL we need to submit forms; some are easier than others. Let’s look at an example of a more difficult form submission. Suppose we want to scrape Comcast’s prices for their services in several different geographical areas. To begin that process, we should look at the page in a browser. Picture 3.5.10: Comcast’s homepage-localization-form 163 CONFIDENTIAL From our experience in a browser, the form seems easy just like a login or search box. Naturally, the next step is to use the select * trick through the forms Data Translator. Picture 3.5.11: Curveballs in form analysis To our surprise, the forms Data Translator did not find any forms on the page. There could be any one of a number of reasons for that. We need to look at the page source (SOURCE_CONTENT) so we have more information. 164 CONFIDENTIAL Picture 3.5.12: More curveballs in form analysis The first thing that we notice is that the SOURCE_URL is different than the URL that we specify in the code and different from the page loaded in the Messages tab of the Execution Window. This is could be the result of a redirect or a refresh, but we aren’t sure yet. Looking at the Browser View tab of the SOURCE_CONTENT field, we see that the page that our browser loads in Picture 3.5.10 is different from what WebQL is loading. 165 CONFIDENTIAL Picture 3.5.13: Getting the browser view of the unknown page in the Execution Window Apparently, the page that is getting loaded doesn’t have any forms on it—just a link to go back to the homepage. This could be the page that gets loaded when javascript is not enabled or when cookies aren’t enabled, but we need further investigation (remember that WebQL is not javascript-enabled). Let’s look at the Network tab and see if the proxy tells us anything. 166 CONFIDENTIAL Picture 3.5.14: Hunting through the Network tab of the Execution Window After checking that our first outgoing request is consistent with our code, we look at the reply from Comcast. In the reply, we notice a refresh META tag that appears to be responsible for forwarding us onto the http://www.comcast.com/NoScript.html URL. We found the refresh META tag by doing a control+F find on “NoScript.html”. Immediately, we use the ignore refresh Document Retrieval Option to see if we get the right page: 167 CONFIDENTIAL Picture 3.5.15: Browser View tab of the Execution Window Looking back at Picture 3.5.10 and Picture 3.5.15, the forms appear to match along with the URL. We can go on with a form analysis of the page using the select * trick with the forms Data Translator: 168 CONFIDENTIAL Picture 3.5.16: Even more curveballs in form analysis Now that we have the right form identified as form 2, we wonder why the form viewed through the browser has fields that don’t match the form values. Also, the CONTROL_TYPE of each CONTROL_NAME is hidden instead of text. Our hypothesis is that javascript is somehow manipulating the input fields through a browser to work the forms with hidden variable CONTROL_TYPEs. What is great about WebQL is that we don’t really need to fuss through all of the javascript and form- 169 CONFIDENTIAL triggering mechanisms of the Localize.ashx page source to get the page we want; rather, we’ll just analyze the behavior of Microsoft Internet Explorer (MSIE) through a proxy and try to copy the behavior. The next step is to set a proxy on the browser window of Picture 3.5.10. In MSIE, go to ToolsÆInternet OptionsÆConnectionsÆLAN SettingsÆ(check) Use a ProxyÆAdvanced. Set the SOCKs proxy to “localhost” on port 211. In WebQL, go to ViewÆNetwork Monitor and set the port at 211. In MSIE, enter the address “2300 N Commonwealth Ave” in the Street address field, “3B” in the Apt. number field, and “60614” in the ZIP field and click continue. In addition to a new internet page being loaded in our browser, we see the proxy light up with traffic. 170 CONFIDENTIAL Picture 3.5.17: Getting to the next form We see the URL change in our browser window, and we get another form that requires us to submit additional information. In particular, we must select our location as either 171 CONFIDENTIAL “Chicago NW (Area 3)” or “Chicago (Area 1), IL”. Let’s look at a proxy to see exactly how this page was arrived at because the CONTROL_NAMEs in form 2 in Picture 3.5.16 do not sufficiently complete all of the variables in the action URL in the browser window above, which is debugged through the Network Monitor proxy below: 172 CONFIDENTIAL Picture 3.5.18: Proxy analysis of moving from Picture 3.5.10 to Picture 3.5.17 173 CONFIDENTIAL The proxy in Picture 3.5.17 is in two pieces to show the entire query string on the action URL of the form submission (the “query string” of a URL is what comes after and including the “?”). Again, the form variables for form 2 in Picture 3.5.16 do not contain all of the variables showing up in the query string, so we have a couple of options as software developers: 1) Hack-out the URL manually and skip every step including loading the form. 2) Load and submit the form and rewrite the query string with the rewrite using Document Retrieval Option. Let’s go with option 1 to illustrate the skill set that we have developed as web programmers. Let’s create the URL in Picture 3.5.17 explicitly and view the SOURCE_CONTENT. 174 CONFIDENTIAL Picture 3.5.19: Verifying that our URL-hacked form matches Picture 3.5.17 Sure enough, by looking at the Browser View tab of the SOURCE_CONTENT in the Execution Window, we are successfully getting to our desired point of navigation by manually creating the URL that’s pinpointed in the proxy in Picture 3.5.18. We can now do a form analysis of this navigational checkpoint, but remember that we didn’t make much use of our last form analysis to get to this point of navigation because we proxy- 175 CONFIDENTIAL hacked the URL. Before we do another form analysis, we will do a proxy analysis of selecting the “Chicago NW (Area 3)” radio button. Clear the history of our proxy by right-clicking the left half of the Network Monitor and select “clear history”. Now, select the “Chicago NW (Area 3)” radio button in Picture 3.5.17 and watch the proxy light up. Here is the page in a browser: 176 CONFIDENTIAL Picture 3.5.20: Comcast.com localized homepage Finally, we have arrived in a browser at the localized homepage. We see “60614 Chicago NW” in the upper right corner, so we now we have arrived at the right page. By 177 CONFIDENTIAL looking at the page source using ViewÆSource in the browser, we are able to match the page with one of the responses in the proxy analysis. In particular, the 19453 byte response from Comcast.com is the localized homepage and our outgoing request to get the page is analyzed below: 178 CONFIDENTIAL Picture 3.5.21: Using a proxy to help load Picture 3.5.20 (previous page) When we went to the page in Picture 3.5.17 in the process of localizing our browser, we were required to submit additional information in order to successfully load the localized homepage in Picture 3.5.20. Because that information (specifically, a radio button for “Chicago NW (Area 3)”) in no way appears in the URL of the proxy (Picture 3.5.20) that triggered the localized page, hacking-out that URL will probably not get us the page that we need. We are better off doing a form analysis of the form in Picture 3.5.17. 179 CONFIDENTIAL Picture 3.5.22: Form analysis of the radio button in Picture 3.5.17 Looking at the default viewer of the Execution Window, ZipCodeFranchiseMapID is the value that we need to submit to advance to the localized homepage. Our strategy will be to extract the ZipCodeFranchiseMapIDs with a pattern and then submit them individually to form 3. Because the addresses that we are ultimately going to look up are going to come from a giant input file, let’s create an input handler for comma separated input. Putting the code together arrives at: 180 CONFIDENTIAL Picture 3.5.23: The first part of the code to Example 3.5.6 181 CONFIDENTIAL The strategy for code in Example 3.5.6 comes from the need to create a depth-first navigational scheme (recall Picture 2.4.1). The first task is to fetch the input data from the input.csv file. Because each input must be processed independently, we create an inner-outer query relationship. We then use the pattern Data Translator to extract the ZipCodeFranchiseMapIDs, and because the ZipCodeFranchiseMapIDs must be processed independently, we create another inner-outer query relationship. Sometimes, the first address that we submit sufficiently loads the localized homepage (thus the ZipCodeFranchiseMapID is null), which explains the node outer2a. We will be 182 CONFIDENTIAL able to better appreciate inner-outer relationships after this example and Examples 3.6.1 and 3.6.2. Running the code in Example 3.5.6 with the following input: Picture 3.5.24: The input file to Example 3.5.6 produces the following Execution Window: 183 CONFIDENTIAL Picture 3.5.25: The Execution Window to Example 3.5.6 We see above, the Browser View tab of the Execution Window to Example 3.5.6 appears to be properly loading various localized homepages so that the special offers and prices for different geographical locations can be scraped for an analyst. Clearly, submitting a form to get to a localized homepage (such as this example of Comcast.com in Picture 3.5.10) can require meticulous debugging. 184 CONFIDENTIAL 3.6 Learning by Example Now that we have seen the theory behind navigation and circular navigation, we can apply this knowledge to develop a web spider. The example that we will discuss allow us to search for products at http://www.techdepot.com and report the prices. Of course, websites that sell products change on a regular basis so this walkthrough could be out of date by the time you read it; however, it will still be a useful tool for developing a strategy for scraping information out of any website. The first thing we must do is analyze the form from the techdepot homepage. Picture 3.6.1: The form analysis of http://www.techdepot.com 185 CONFIDENTIAL We see that there are 2 forms. Form 1 appears to be a search box that we can submit keywords to and the second form appears to be an email address submission for an email list that we can subscribe to. Form 1 is the form we want to submit values to for the variable Keyword. The next step is to create an input list of keywords to be submitted in the form submission. We will use a basic inner-outer relationship so that the input records (keywords) are processed independently and can be freely selected throughout the Inner Query that does the form submission. Picture 3.6.2: Creating the inner-outer input relationship 186 CONFIDENTIAL We should interpret the code in Picture 3.6.2 as being a query with a single node fetch1 such that the outer parent of the query contains the search inputs, which are “Dell 720” and “Sony Vaio sz.” Remember that inside of the parenthesis, every field selected in the outer parent is global. Thus, any node inside of the parenthesis knows about the field myinput, and only one value of myinput will be processed by the inner query at a time. Instead of submitting to the FORM_ID, we chose to submit to the form target, which is a substring match of the FORM_URL, which is sometimes called an action URL. The only variable that appears to be essential for a proper form submission is Keyword. Looking at the code in Picture 3.6.2, we see that the default viewer of the outer query happens to be the default viewer of the inner query. The inner query writes the Navigational State with errors into the inner query’s default viewer, which outer1 selects all fields out of into its default viewer, which is what we see in the Execution Window below: 187 CONFIDENTIAL Picture 3.6.3: Getting past the first form submission In the browser view of the SC (SOURCE_CONTENT) field, we see a message stating that results have been found. Next, we need to create a navigation scheme that will click the “next button” until all pages of results are exhausted. As it turns out, the search for “Dell 720” has only 1 page of results whereas the search for “Sony Vaio sz” has 4 pages of results. For this particular website, clicking for the next page of results involves javascript. We will have to simulate the javascript in WebQL so that we can reach the next page of results. The javascript function NextPage is a form submission that advances us to whatever page we specify as argument. We know that from looking at the function in the source code which can be seen through the Text View tab of the Execution Window. Looking through the source code of the first results page of the “Sony Vaio sz” search shows us the javascript function called NextPage: 188 CONFIDENTIAL function NextPage(iPage){ document.frmNav.Page.value=iPage document.frmNav.searchtype.value='nav' document.frmNav.submit() } Given this function, our strategy will be to identify the maximum results page number based upon the page numbers towards the bottom right corner of the website, and then submit the frmNav form over and over for each page number. 189 CONFIDENTIAL Picture 3.6.4: Results pages with a black double-arrow The search “HP” came up with 200 pages of results. Determining what this last page is can be tricky because the soft grey vs. black double-arrow is different for searches that have 5 or less pages of results: 190 CONFIDENTIAL Picture 3.6.5: Results pages with a grey double-arrow After looking at the source code, we need to write a pattern robust enough to handle either case. Here are the two relevant excerpts of code: (Picture 3.6.4) <a href="javascript:NextPage(200)">200</a></td><td class="td"><a href="javascript:NextPage(6)">>></a></td></tr></table> (Picture 3.6.5) <a href="javascript:NextPage(4)">4</a></td><td class="td"><font color="#aaaaaa">>></font></td></tr></table> Our goal is to create a pattern Data Translator that will extract the number 200 in the first case (Picture 3.6.4) and that will extract the number 4 in the second case (Picture 3.6.5). 191 CONFIDENTIAL Picture 3.6.6: Getting the pattern Data Translator correct We see that our choice of pattern works in the output of the default viewer: 192 CONFIDENTIAL Picture 3.6.7: Pages 2-200 for the search “HP” We see that we have successfully created the arrays that we need to scrape-out the rest of the results to our searches. We will remove the “HP” search for now because if we can crawl 3 additional pages of results in the “Sony Vaio sz” search, we will assume that the mechanism will work for n additional pages of results. The next step in the process of developing this web spider is to do a form analysis of the first page of results. The best way to do this is to search for “Sony Vaio sz” through a browser then do a ViewÆSource to see the source code in a text editor. Save the text file as c:\test.html. Now use the select * trick through the forms Data Translator to learn about the forms on c:\test.html. 193 CONFIDENTIAL Picture 3.6.8: Form analysis of first page of search results Looking through all of the form variables for forms 2 and 4, we are not sure which form 194 CONFIDENTIAL the NextPage function works off of. We know that the name of the form is frmNav but we don’t know its number until we do further investigation. Picture 3.6.9: Form analysis of first page of search results 195 CONFIDENTIAL The underlying uncertainty of which form to submit stems from the fact that both form 2 and form 4 contain control variables consistent with those in the NextPage javascript function, namely Page and searchtype. The next move we should make is to do a text search for “<form ” through the source code of the first page of results to check each form name and see which one is frmNav. Picture 3.6.10: The second match on the page for “<form ” search Using the find feature on our text editor, we see that the form that we want to submit is the second form, not the fourth. Looking back at Picture 3.6.8, we notice that the searchtype form variable already has the value “nav” so we should only need to explicitly submit the Page variable. If we have all of the page variables that we need in an array, there is no need for a circular loop, so our fetchmanager will union join fetch1 and fetch2 without looping. Let’s see if it works: 196 CONFIDENTIAL 197 CONFIDENTIAL Picture 3.6.11: Writing the second fetch and moving the fetchmanager (previous page) Picture 3.6.12: Verifying that we got all results We indeed get every page that we want because of the second fetch node fetch2. There is one record for the “Dell 720” search and four records for the “Sony Vaio sz” search. We are able to get the pages we want (and only the pages we want) without having to write a circular reference. The next 2-steps are to write an error report and to scrape out the prices, product descriptions, product numbers, and any other information we want about each product, which can be called a pattern-crunch. To write the error report, we join to the node 198 CONFIDENTIAL outer1 where the URLERROR field is not null. We will write the input search string, the error, and a timestamp into a CSV file called errors.csv. To handle the prices and other information we want to extract, we will output source join to the SC (SOURCE_CONTENT) field of outer1 and write a series of Regular Expressions to match and extract the targeted information. We will also use the convert using Document Retrieval Option to chunk the page moments before we pattern-crunch it. Here’s all of the code to complete the example: 199 CONFIDENTIAL 200 CONFIDENTIAL Picture 3.6.13: The first part of Example 3.6.1 (previous page) Every product listing is an HTML table row that begins class=”product”>”, which will serve as our chunking converter. with “<td Although the product listings fit tightly in a browser, the HTML source code between two product listings is extensive. Below is the HTML source code of a product listing from its beginning up until the next product: <td class="product"><a href="http://www.TechDepot.com/product.asp?productid=4523329&ii d=1250&Hits=2000&HKeyword=HP">HP Consumer - nx6310 - Core Solo T1300 1.66 GHz - 15</a></td> <td nowrap rowspan="2" align="right" valign="top" class="price"> $892.95<IMG src=/Assets/images/clear.gif height=1 width=1></td> </tr> <tr> <td colspan="2"><img src="/Assets/images/clear.gif" alt="" width="9" height="5" border="0"></td> </tr> <tr> <td colspan="2" rowspan="2" valign="top"> <table width="365" border="0" cellspacing="0" cellpadding="0"> <tr> <td colspan="2" class="bullet">• Intel Core Solo T1300 <td rowspan="3" align="right" valign="bottom" class="bullet">Platform: PC</td></tr><tr><td colspan="2" class="bullet"> • 512 MB / 60 GB / 1.66 GHz </tr> <tr><td colspan="2" class="bullet"> • Windows XP Professional </tr> <tr> <td colspan="3"><img src="/Assets/images/clear.gif" alt="" width="9" height="3" border="0"></td> </tr> <tr><td colspan="3"><table border="0" cellspacing="0" cellpadding="0"><tr> 201 CONFIDENTIAL <td><table width="289" border="0" cellspacing="0" cellpadding="0"> <tr> <td class="td" align="left" valign="top">sku#S5523329 </td> <td class="td" align="left" valign="top">mfr#PZ903UA#ABA</td> <td class="td" align="right" valign="top"><img src="/Assets/images/Results_stock_Check.gif" alt="" width="12" height="13" border="0"> In-stock</td></td> <td valign="top"><img src="/Assets/images/clear.gif" alt="" width="5" height="1" border="0"></td></tr> <tr> <td><img src="/Assets/images/clear.gif" alt="" width="85" height="1" border="0"></td> <td><img src="/Assets/images/clear.gif" alt="" width="122" height="1" border="0"></td> <td colspan="2"><img src="/Assets/images/clear.gif" alt="" width="90" height="1" border="0"></td> </tr> </table></td> <td align="right" valign="top"><a href="Javascript:GotoProd(4523329,1250)"><img onClick="Javascript:GotoProd(4523329,1250)" src="/Assets/images/Results_details_box.gif" alt="" width="71" height="15" border="0"></a></td> </tr></table></td></tr> </table></td> </tr> </table></td> </tr> <tr> <td colspan="3" align="right"><img src="/Assets/images/ccccccpixel.gif" alt="" width="384" height="1" vspace="10" border="0"></td> </tr> </table></td> </tr> <TR><TD><table width="464" border="0" cellspacing="0" cellpadding="0"> <tr> <td><input type="checkbox" name="compare" value="4119381"></td> <td> <a href="http://www.TechDepot.com/product.asp?productid=4119381&ii d=1250&Hits=2000&HKeyword=HP"><img 202 CONFIDENTIAL src="http://images.techdepot.com/comassets/productsmall/CNET/I40 9904.jpg" alt="" width="70" height="70" border="0"></a></td> <td> <table width="369" border="0" cellspacing="0" cellpadding="0"> <tr> <td class="product"> Here is the browser view of the same product listing: Picture 3.6.13.5: A techdepot product listing in a browser It appears that the first anchor tag after our chunking converter provides us with the product title. Extracting it is derived in Data Translator p1 seen in the rest of the code to Example 3.6.1 below. Included also are Data Translators p2, p3, and p4, which extract the price, sku part number, and manufacturer’s part number with shipping information, respectively. Error reporting is implemented, as well. 203 CONFIDENTIAL Here’s the output that we get in the Execution Window: 204 CONFIDENTIAL Picture 3.6.14: The Execution Window of Example 3.6.1 We are happy to see that the data comes out clean and ready to be dumped into a database or report-making computer program. Although it takes time and money to develop this kind of code, it takes a lot more time and a lot more money to pay a person to click through the site and manually write down the numbers and/or type them into Excel. 205 CONFIDENTIAL Let’s move on to another http://www.superwarehouse.com. example by writing a similar crawler for Again, we start by debugging the forms on the homepage: Picture 3.6.15: Form debugging on the homepage of Superwarehouse We see that WebQL has more than one form such that FORM_ID = 1. Even if this is something that we’ve never seen before, we can work around the impediment by using a form target for form submission as we do in Example 3.6.1. keyword is the only form variable that we need, so we are ready to create an inner-outer query relationship with input from the user. We will make the input a CSV file rather than an array to create variety. 206 CONFIDENTIAL Picture 3.6.16: Creating an input processing node and an input file Running this code creates the following Execution Window: 207 CONFIDENTIAL Picture 3.6.17: Default viewer of Navigational States of search results Because of the way that 9 source files appear for 3 fetches (5 in the first fetch, 2 in the second fetch, and 2 in the third fetch), we should decide if using the ignore children Document Retrieval Option is appropriate. Investigating the source files suggests that it is because the child pages to each fetch are advertisements that we don’t care about. Inserting the ignore children Document Retrieval Option and rerunning the code creates the following Execution Window: 208 CONFIDENTIAL Picture 3.6.18: Default viewer of Navigational States of search results We now have fetch1 working the way we want because there are no child pages being loaded, and we have 3 SOURCE_CONTENT records: 1 for the form load, 1 for the “HP Laserjet” search, and 1 for the “12A5950” search. The next step is to pass the second and third records of fetch1 into a fetchmanager that ultimately holds all of the results pages. To get all of the results pages we must investigate the “Next >>” link of the “HP Laserjet” search and figure out how to pull out the rest of the results. We can do this by looking at the Text View tab of the SC (SOURCE_CONTENT) field. Using the ControlF text finder, we find out that clicking the “Next >>” link is a javascript form submission. <a href="javascript:nextResults('21', '40', '', '' );" 209 CONFIDENTIAL We will use this text (with Regular Expressions for the digits) as a filter for the getNextLink node from Picture 3.5.3.7. This is a good filter because the function call does not happen in the search results page source for a single page of results like “12A5950”. The function exists in the page source, but it is never called. Let’s take a look at the nextResults function: <script language="JavaScript"> /* submit to next result set */ function nextResults (NextStart, NextEnd, odrBy, srtTp) { var theForm = document.search; var jsOdrBy; jsOdrBy=odrBy; var jsSrtTp; jsSrtTp=srtTp; if ((jsOdrBy != '')&&(jsSrtTp != '')) theForm.action = theForm.action+'&resultSet=nextset&Start='+NextStart+'&End='+Nex tEnd+'&ybRdroVan='+jsOdrBy+'&ybTrsVan='+jsSrtTp; else theForm.action = theForm.action+'&resultSet=nextset&Start='+NextStart+'&End='+Nex tEnd; theForm.submit(); } function swapIMG(oImg) { //se script.js T. if (oImg.width!=68) { //max size oImg.width=oImg.width; } else { oImg.width=68; } } </script> Because clicking the “Next >>” link causes a form submission, our next best move is to do a form analysis of the first results page. The best way to do that is to copy the text out of the SC (row 2) cell of Picture 3.6.18 and paste it into a text editor. Save the file as c:\test.html and use the select * trick with the forms Data Translator. 210 CONFIDENTIAL Picture 3.6.19: Form analysis of the first page of search results We now have the form debugged, but given the function nextResults, we still aren’t exactly sure which form is being submitted. Recall Picture 3.5.4. We need to set up a proxy for a browser to debug the form submission when we click the “Next >>” link. The first thing we should do is go to the http://www.superwarehouse.com website and submit a search for “HP Laserjet”. In WebQL, we will use ViewÆNetwork Monitor and set whatever port we want. We must set the same port in Microsoft Internet Explorer by going to ToolsÆInternet OptionsÆConnectionsÆLAN SettingsÆ(check) Use a ProxyÆAdvanced. Insert the name “localhost” for the SOCKS proxy on the same port and click OK. Finally, click the “Next >>” button in MSIE and watch for a POST statement in the proxy. 211 CONFIDENTIAL 212 CONFIDENTIAL Picture 3.6.20: Proxy analysis of a javascript form submission (previous page) Looking at the circled POST variables (left circle), the form that gets submitted when we click the “Next >>” link is form 1; however, because the action URL manufactured by nextResults works for form 1 or form 3, submitting form 3 might work, too. The right circle shows what URL is manufactured by the javascript function nextResults. We know how to submit POST variables using the submitting Document Retrieval Option and we know how to extend a URL using the rewrite using Document Retrieval Option. Pulling it all together gives us code like this: 213 CONFIDENTIAL Picture 3.6.21: WebQL spider with a circular referencing fetchmanager 214 CONFIDENTIAL As it turns out, submitting form 3 instead of form 1 still works because ultimately in their systems, manufacturing the proper action URL means more than the post variables that get submitted. We now have all 21 results pages for the “HP Laserjet” search (20 results per page; 419 total results) and we have 1 results page for the “12A5950” search. Here is the Execution Window: 215 CONFIDENTIAL Picture 3.6.22: Navigational States of the node fetchmanager We can look at the Text View tab of any one of the SC (SOURCE_CONTENT) fields and begin the process of extracting the data with a pattern-crunching node. The patterncrunching node is output source joined to the SC field of the fetchmanager. The pattern-crunching node will convert each SC field into as many chunks as there are results on the page plus 1 (16 results creates 17 chunks). We will also combine all of the 216 CONFIDENTIAL error traps into an error report. Hunting through the SC field suggests that a particular table cell tag should be used as a chunking delimiter: <td align="center" valign="top" style="padding-top:3px; paddingleft:3px;padding-bottom:3px;padding-right:3px;" class="border_rtBttm"> The part ‘class=”border_rtBttm”’ appears only in this table cell tag that starts the listing of a new product. After searching for how many times ‘class=”border_rtBttm”’ appears in the SC field, it will suffice as our chunking converter. The remainder of Regular Expressions can easily be derived now that the page is chunked. Looking at the unabridged HTML source code of a product listing relative to our chunking converter, we have: <td align="center" valign="top" style="padding-top:3px; paddingleft:3px;padding-bottom:3px;padding-right:3px;" class="border_rtBttm"> <a href="http://www.superwarehouse.com/HP_1,500_Sheet_Feeder_for_ LaserJet_4200_and_4300_Series/Q2444B/p/435756"> <img onload="swapIMG(this)" src="http://www.superwarehouse.com/images/products/hp4250feeder _thn.jpg" border="0" align="top" alt="HP 1,500 Sheet Feeder for LaserJet 4200 and 4300 Series" class="thn"> 217 CONFIDENTIAL </a> </td> <td class="border_bottom" valign="top" style="padding-top:3px; padding-left:3px;padding-bottom:3px;padding-right:3px;"><div class="swhtxt2"><a href="http://www.superwarehouse.com/HP_1,500_Sheet_Feeder_for_ LaserJet_4200_and_4300_Series/p/435756" class="boxLink2">HP 1,500 Sheet Feeder for LaserJet 4200 and 4300 Series</a></div></td> <td valign="top" class="border_bottom" style="padding-top:3px; padding-left:3px;padding-bottom:3px;padding-right:3px;"><div class="swhtxt2">Q2444B</div></td> <td valign="top" class="border_bottom" style="padding-top:3px; padding-left:3px;padding-bottom:3px;padding-right:3px;"><div class="OrangeTxtSMbld">$478.99</div></td> <td valign="top" class="border_bottom" style="padding-top:3px; padding-left:3px;padding-bottom:3px;padding-right:3px;"><div class="swhtxt2">In Stock</div></td> <td valign="top" class="border_bottom" style="padding-top:3px; padding-left:3px;padding-bottom:3px;padding-right:3px;"><a href="http://cart.superwarehouse.com/index.cfm?fuseaction=cart.add &productID=435756"><img src="images/bbuynow.gif" width="50" height="18" border="0"></a></td> <td align="left" width="71"> </td> </tr> <tr> 218 CONFIDENTIAL <td align="center" valign="top" style="padding-top:3px; paddingleft:3px;padding-bottom:3px;padding-right:3px;" class="border_rtBttm"> Here is the browser perspective of the code: Picture 3.6.22.5: Browser view of a superwarehouse product listing Seeing that every item of interest has an </div> tag after it, we can construct one pattern that wild cards everything between the </div> tags and cleans the data. Implementing the Regular Expression and adding the error report completes the code for Example 3.6.2: 219 CONFIDENTIAL Picture 3.6.23: The first of half the code to Example 3.6.2 220 CONFIDENTIAL 221 CONFIDENTIAL Example 3.6.2 extracts all prices for a given search string stipulated in the in.csv file. For our inputs, we get 400+ results for “HP Laserjet” and 1 result for “12A5950”. Looking at the accurate results in the Execution Window, we see the value in WebQL because employing a person to click and copy all of these prices, product names, item codes, etc. out of a browser must cost more than having one of our web spiders scrape-out the information: Picture 3.6.24: The Execution Window for the given input of Example 3.6.2 An analysis of crawling deeper into the site for more specific information shows for little improvement in the data set. At the current level of depth to our crawl, we are able to extract the price, product title, item code, and availability. Crawling one level deeper 222 CONFIDENTIAL creates (for our given input) over 400 additional fetches when the current version of query makes only 23 fetches (1 for the homepage, 1 for the “12A5950” search, and 21+ for the “HP Laserjet” search). Is it worth increasing the number of fetches 20-fold for the additional information that is available to be extracted? Looking in a browser at the additional information besides the product name, item code, price, and availability, we see an item summary and some specs. 223 CONFIDENTIAL Picture 3.6.25: Deciding whether or not to go deeper 224 CONFIDENTIAL Upon further investigation, the summary and features aren’t essential to a pricing application because the analyst knows what the product is based on the product title; and, most of the specs are in the product title. Going to an additional level of depth, the information appears to be too redundant to increase the number of fetches made by the query by a factor of 20. This concludes Example 3.6.2. Clearly, there are several steps to writing a web spider. The process is roughly the same for any site regardless of whether or not we use circular navigation. The heuristic explanation is to submit forms and click until we get to results pages. We use those results pages to get additional results pages, which may or may not require circular navigation. Finally, we pattern-crunch and error report on all results pages creating output and error files that are easily customizable for market demands. 225 CONFIDENTIAL CHAPTER 4: Developing Applications for the QL2 Client Center 4.0 QL2 Client Center Introduction The QL2 Client Center is a storage and processing powerhouse co-located with InterNap in the KIRO/Fischer building of downtown Seattle, Washington. In 2004, the system rotated 13 processing servers with a redundant 2 terabytes of disk. The QL2 Client Center is built and managed by Dave Haube. QL2 Software is a business that both produces and sells software. The products that QL2 offers are the license of WebQL Studio and/or the license of custom queries developed with WebQL Studio ready to run on the Red Hat Linux-driven QL2 Client Center. Picture 4.0.0: QL2 business sketch In 2005, the vast majority of revenue came from building custom applications and hosting them on the QL2 Client Center. 226 CONFIDENTIAL As an example, let’s look at an application set on the QL2 Client Center that a customer deals with on a regularly scheduled basis. Go to the http://client.QL2.com. The QL2 Client Center is a java application that generates a website and manages large-scale data extraction tasks for customers. Login as ‘demo’ with ‘demo’ as the password. We can click around the Client Center as we please and use the question marks for help and explanations. In the top right area of the screen, click on the “Cars” tab. In the row of tabs that appear below it, click on the “Results” tab. Picture 4.0.1: A screenshot of the carrental results tab on the QL2 Client Center We are going to look at a rental car pricing output data set that was pulled out of multiple travel sites by multiple WebQL queries cued by the Client Center. Click on the rcout.csv output file or the rctables.htm file associated with any of the customer data collection runs. Notice how much data was ripped out of the sites in real-time, formatted, htmlreprocessed, and delivered to the customer in a matter of a couple of hours. 227 CONFIDENTIAL 4.1 Navigate and Branch The coding style used by most applications on the QL2 Client Center is “Navigate and Branch.” Think of the code (also called a “web spider”) crawling from one Navigational State to the next, all the while branching right and left to extract data and/or trap errors along the way. Picture 4.1.0a: Node-by-Node Diagram of “Navigate and Branch” Notice how the backbone of this navigation schematic is the join from node Bn to node A(n+1). Depending on the particular application being written, the relationship from Bn to A(n+1) might need to be an inner-outer query relationship. If an inner-outer query relationship is needed to enable depth-first navigation like in Picture 2.4.1, Bn would be the node with all input variables to the inner query that begins with node A(n+1). 228 CONFIDENTIAL Making regular parent relationships into inner-outer parent relationships can be inferred from Example 2.4.1. The function of the B nodes in Picture 4.1.0a is to drive the Navigational States down whatever set of clicks we want to simulate. For the errors, we union join all of the error trapping nodes together so we write all of the errors in a single node called D0. Picture 4.1.0b: Node Diagram for error collection union joining all of the error traps together and then writing them in a single node is a good idea because sometimes we want modify the way we write error reports and we would rather make a change once in an error collection node rather than make changes to potentially hundreds of error trapping nodes. Sometimes we want to crawl around a site and extract information as we crawl, and then ultimately write all of the extracted data to a specific destination in a single node. In this case, the B and C nodes in Picture 4.1.0a are combined: 229 CONFIDENTIAL Picture 4.1.0c: Modified version of “Navigate and Branch” There’s no reason why we cannot extract a link with the pattern Data Translator and extract targeted information with another pattern Data Translator in the same node. If A1 and A2 require form submissions, remember to use the cache forms Document Retrieval Option in A1. We can infer the syntax of submitting a second form from Picture 3.5.3.6. Referring back to Picture 3.5.3.5, the circular navigation with a pattern-crunch is merely a modified version of Navigate and Branch where node A2 navigates (clicks the “next” button) repeatedly until all pages of targeted information are exhausted. If our where filters are creative enough, we can complete the diagram with a single arrow and node BC1 is then the pattern-crunching and next-link-selecting node. 230 CONFIDENTIAL Picture 4.1.0d: Modified version of “Navigate and Branch” This tight loop does not have a fetchmanager node like Picture 3.5.3.5. Different programmers prefer different coding styles. Using a fetchmanager spreads the process of looping across 4 nodes and makes the code more versatile when websites get updated and the code needs to change. On the other hand, tight loops require less code and get the job done. We can produce more effective diagrams by using special blocks to symbolize navigational looping structures. When we create a navigational looping structure, all we care about in terms of a game plan is the fetchmanager because all errors and data collection will branch off of the fetchmanager. We can union join fetchmanagers into sourcegatherers and then branch off of the sourcegatherers. Let’s see an example by making a game plan for scraping an office supply website. We will allow a user to provide input in 2 ways. The first is colon-separated product category links to follow and the second is search terms to be submitted into a search box. Clicking through the site to 231 CONFIDENTIAL a product category arrives at a set of results that may have more than one page. The same is true if you enter a certain part number as a search term. The process of taking the first results page and triggering the rest of the results pages can be combined. We will use two different views, each of which gets us to the first results page, then we’ll dump the first results pages from both views into the same circular referencing routine to pull out the rest of the pages. Picture 4.1.0e: Modified version of “Navigate and Branch” with loop blocks and Views A View is a query, which has a default viewer. A node can select out of that default viewer. We can union 2 such nodes and then union join them into a circular looping fetchmanager. The fetchmanager should have the navigational states of results pages in it, at which point we branch for data scraping and error trapping. If we need minute details, we can extract the links associated with each individual product in the 232 CONFIDENTIAL pattern-crunch, and then fetch the link and go after whatever information we want in the next page load. Suppose that there are 100 results to a search displayed 15 at a time. The circular looping fetchmanager should have 7 pages of results in it from 7 fetches. To go after the fine print of each product, we need to do 100 additional fetches, which is about 14 times the amount of network traffic. A cost/benefit analysis should be done for each level of depth past the results pages that the spider crawls to. Going deeper and deeper into a site for every individual product creates a lot more code and a lot more traffic for sometimes minimal improvement in the output data set (see Picture 3.6.25). Picture 4.1.0e is the outline of a CyberPrice query, which is the roughest and toughest query set running on the QL2 Client Center. For an engineering executive summary, selling applications in the fashion of Picture 4.1.0x is the multimillion dollar idea of QL2 Software, Inc. Putting lots of nodes together in this style results in Data Flow diagrams that look like this: 233 CONFIDENTIAL Picture 4.1.1: Data Flow view of an application from the QL2 Client Center In this particular query, we use a launch file cued by the Client Center to preprocess the input provided by a customer through a browser. Every launch file starts with “launch_”. Once input has been preprocessed, the launch file calls an Inner Query or view. Each input record to the view uses Navigate and Branch to collect data, trap errors, report siteoutages, etc. After the first input record to the view completes processing, then the 234 CONFIDENTIAL second record goes. This helps aid in the process of depth-first navigation (recall Picture 2.4.1). Again, the general concept of a Client Center application is to first write a launch file that preps input from the user. 235 CONFIDENTIAL Picture 4.1.2: A typical launch file from the QL2 Client Center 236 CONFIDENTIAL The field FILE_BASE is actually a field created at the command line when the Client Center cues a launch file. In 2005, the WebQL Studio IDE was still only a Windows product, so the Red Hat Linux-driven Client Center actually calls everything from a command prompt. At that time, variables can be passed into launch files. We can create these variables by updating the appropriate *.ddl file in base/src/hosting/schema/. The appropriate *.ddl file corresponds to the type of Client Center application we are building (Airfare, Carrental, Hotel, etc.). The final node of this launch file is interesting because this is the bizarre case when we really don’t care what we are selecting, so select * is used. Basically, all of the file/db writing occurs in the Inner Query digikey_getone.wqv, so the selecting/writing of that final unnamed node in the launch file is irrelevant. The style of the first node, getinput, has evolved slightly since this query was last updated. Picture 4.1.3: (left) Modifications to the first node; (right) the view read_in.wqv The Inner Query read_in.wqv on the right shows how the user-input has been standardized across several queries. Calling the view read_in.wqv in the launch file 237 CONFIDENTIAL makes changes to the way 20 queries read in data a matter of making 1 code update rather than 20 code updates. In late October of 2004, we released CyberPrice, which is one of the more impressive WebQL application sets currently running on the QL2 Client Center. After the launch file preps the data, each record is input into view (or “Inner Query”) that could call its own view or views. Here is a segment of code out of the view that corresponds to Picture 4.1.2. 238 CONFIDENTIAL Picture 4.1.4: A segment of a typical Inner Query file from the QL2 Client Center Clearly, Navigate and Branch is the coding technique being used. The necessary components of the Navigational States are selected in nodes fetch1 and fetch2a. The 239 CONFIDENTIAL branches visible of fetch1 are urle1, cnterr, and getmylinks, where as fetch2a has visible branches urle2a and method1a. 4.2 Client Center Diagnostic Tools The QL2 Client Center has several ways of analyzing a query’s performance on the fly. Suppose Orbitz changes the location of a price, and it forces us to update our code. The time between the site changing and us delivering the update should be minimized and always less than 48 hours, ideally less than 24 hours. When a query is failing to produce output or if excessive errors are being triggered, we want to be able to see that without opening the 40 megabyte text output file and hunting for certain records. The Client Center has a diagnostics page that lists the various data extraction tasks being run on our hosted hardware, and the pages are sorted by application type (like Airfare, Carrental, etc.). The numbers displayed show the input, output, and error counts, which can raise a red flag when something is wrong. Basically, we don’t want a $100,000+ customer running something for 8 hours with no data at the end because the price of every product changed from using a <TH> tag to a <TD> tag. The diagnostic pages give us the quick lead on where to look when we are supervising query performance. 240 CONFIDENTIAL Picture 4.2.1: A screenshot of the Diagnostics Page of the QL2 Client Center We can fix and sense problems with the Client Center other ways as well, but this is the quickest and easiest way to get affiliated with finding and fixing problems. Sometimes we don’t notice problems until customers point them out. We will need to login as an administrator to get to this page and click SystemÆDiagnostics. Each record in Picture 4.1.4 represents a “Run” on the Client Center that corresponds to a batch of customerdefined inputs. The Inputs column, of which only the “s” is visible on the left in Picture 4.2.1, represents the number of input line items. Err represents the number of output line items that are an error and Raw represents all line-item output. These stats will fill in the browser window 241 CONFIDENTIAL upon completion. QSRC are Work Unit stats that evolve as the job runs in real-time. Q is the number of queued Work Units. S is the number of stopped Work Units. R stands for running currently and C stands for completed. If we click in deeper to the Diagnostics Page to the right of the query start time and duration, we can view the Run in detail. A “Run” is a batch of customer input that is currently running, already complete, or queued to run on our hosted-processing hardware. Customer input is sometimes thousands of input line items. Instead of sequentially running all input line items together, the Client Center will break the Run into minibatches of 20 or less called Work Units and run them in parallel. Some Client Center applications only have 4 inputs per Work Unit, such as cruises. A given Work Unit can run only 1 query and operates on a single thread. In WebQL Windows IDE, we can adjust our thread settings by going to ViewÆOptionsÆNetwork. Picture 4.2.1.5: WebQL Network Options 242 CONFIDENTIAL “Maximum number of running requests” represents the number of threads; the Client Center operates with single-threaded Work Units. The reason that Client Center Work Units operate on a single thread is that some applications require assurance that requests will follow a particular order and that order can be guaranteed only when WebQL navigates on a single thread. When we are extracting links that we know are independent of the post data and cookies, we can “up the firepower” and run 15 or more threads simultaneously. Depending on how large a customer is, customers have the right to run different numbers of Work Units. If a customer’s run uses 100 line items on 1 query and 25 on another, the Client Center will break the run down with 5 Work Units on launch_query1 and 2 Work Units on launch_query2. There will be 20 inputs per Work Unit on launch_query1 and 20 inputs on the first launch_query2 Work Unit and 5 inputs on the second launch_query2 Work Unit. Suppose that the customer’s account is authorized to use 5 Work Units simultaneously. In this case, 5 of the 7 Work Units will immediately begin running even if they all go to the same site. The number of Work Units is a representation of the amount of power a customer has on our data center. In 2005, 10 Work Units costs approximately $100,000 per year to license. A Run is always assigned a Run ID, and a Work Unit is always assigned a Work Unit ID, so if we spot a problem and want to discuss it with another worker, referencing it by Run ID / Work Unit ID combo is the quickest and easiest way. 243 CONFIDENTIAL Picture 4.2.2: A browser window of a Run by Work Unit (SystemÆDiagnosticsÆView) 244 CONFIDENTIAL Every Work Unit is cued by a go.sh file and has an input file called in.csv. run.txt stores the name of the processing server assigned to the Work Unit, and the text of the Messages tab of the WebQL Execution Window is stored in run.log. Any files created by the query appear to the right of the run.log file. 4.3 Post Processing For customers, probably the most important issue to them is Data Integrity. Data Integrity means that every line item of input provided by the customer through the QL2 Client Center has accurate output data or an error reported in the delivered output file set. The sites on the internet can have any one of many problems with them at any given moment, especially if a site is bombarded with traffic. Sometimes customer input on a hosted query does not produce any output nor any errors, so we must force an error to appear. After a set of Work Units run, a Post Processing Work Unit starts and runs a Post Processing Query. Every Post Processing Query’s name begins with “pp_”. The major idea behind Post Processing is to manipulate all output—good and bad (w/errors)—from the same file across all Work Units called raw.csv. In every query when we write output, we always write to raw.csv, whether it’s an error or not. Post Processing then delivers the custom-tailored output files to the customer. One of the default jobs of Post Processing is to analyze the total Run’s input relative to its output and create error messages when an input does not have corresponding output or errors. This is done most easily with left outer join. 245 CONFIDENTIAL Picture 4.3.1 shows the standard procedures for Post Processing on the left and the node that generates errors for input and output not reconciling on the right. The code on the left is the entire pp_default.wql file and the code on the right is a section of rawfile.wqv (a complete code citation comes later). Both nodes RAW_INPUT and RAW_OUTPUT have a field called INPUT_ID. All records get filtered by the where clause except for those that appear only in RAW_INPUT. This is the most creative use of left outer join so far. From above, we see that the essence of Post Processing is 3 different steps. 246 CONFIDENTIAL Picture 4.3.2: The 3 step outline of pp_default.wql Whether we are writing queries for Airfare, Carrental, CyberPrice, or any other QL2 Client Center application set, our default Post Processing will follow this model. These 3 steps are represented by the views rawfile.wqv, stdout.wqv, stderr.wqv. Often times, the standard output format is not what the customer wants, then we would write the view customer1out.wqv, and then also write pp_customer1.wql as the Post Processing Query. Here’s the rest of the file rawfile.wqv, from the top: 247 CONFIDENTIAL Picture 4.3.3: The rest of rawfile.wqv To provide complete code citation included are stdout.wqv and stderr.wqv. 248 CONFIDENTIAL Picture 4.3.4: stderr.wqv 249 CONFIDENTIAL Picture 4.3.5: stdout.wqv In addition to these data-manipulation routines, DHTML and interactive environments with javascript are a way of massaging data into reports that pricing analysts can handle. The idea with DHTML is that we write to an HTML text file all throughout the code starting with <HTML><HEAD> writing tables with rows, cells, etc. and we get code segments ending up looking like this: 250 CONFIDENTIAL Picture 4.3.6: A segment of WebQL code that generates HTML An HTML expert who is also a WebQL expert can do amazing things considering all that the customer provides is an input file on the Client Center. Imagine setting up a car rental shopping list to deliver a link to a clickable report in an email inbox every morning between 9 and 10am that looks like this: 251 CONFIDENTIAL Picture 4.3.7: One of many clickable HTML reports made in Post Processing As a business, we provide output for a customer’s input. The overall diagram on how data flows on the Client Center was the recipe for the success of QL2 Software. 252 CONFIDENTIAL Picture 4.3.8: Schematic diagram of input to output for a given application set on the QL2 Client Center The customer’s input is provided through a web browser and used to cue work units that each run a given query. The raw.csv output files for the Work Units are concatenated together and passed to the Post Processing Query. WebQL converts the compiled output into customized files for the user. Each application set on the Client Center has its own maximum line items per Work Unit and its own Post Processing Query with customized versions for customers as needed. In terms of source code, the application sets are stored in base/src/hosting/scripts. 4.4 Managing Customer Accounts Not every customer runs every query in a given application set. The check boxes for enabling queries for a customer can be found on the Client Center through an administrative account at SystemÆOrganizations by clicking on an organization (customer or prospect) and on an application set. In addition to enabling queries, input variables to queries are defined by customer login at the same page location. Each customer account is an organization comprised of sub-accounts. This set up is to make settings for an organization, then the individual logins (sub-accounts) can override those settings on a login-by-login basis. 253 CONFIDENTIAL Picture 4.4.1: SystemÆOrganizationsÆ<customer>ÆCars Activating queries and setting input variables are handled for an organization in Picture 4.4.1. This interface is somewhat self-explanatory and we can usually figure anything out when clicking around the Client Center by using the “?” quick-help links. 254 CONFIDENTIAL 4.5 CVS/Cygwin The QL2 system of developing code, testing it on a staging server, and then releasing to the live server is done through a program called CVS which is a part of the complete download package of Cygwin. Go to Cygwin.com and download the download manager, and do a “full” download of all components. This could take awhile. After completing the install you will need to set a series of environment variables. There is an employee “how to” guide on Cygwin, which should be used in addition to Chapter 4.5. 4.6 The Source Repository The Source Repository is the collection of queries for QL2 Client Center customers. The queries are launched from the Red Hat Linux WebQL command line. As developers, we are most concerned with the base/src/hosting/scripts directory and subdirectories. With our username and password after we have fully installed Cygwin and set environment variables on our machine (and have done the CVS cobase command), we type interactions with the Source Repository at the DOS command line: C:\>cd base/src/hosting/scripts C:\base\src\hosting\scripts>cvs up –d This will update all scripts onto our machine from the source repository server (Radish). In the scripts directory, there are a series of directories symbolizing the various application tabs in the Client Center, such as carrental, cyberprice, vacations, etc. If we have to add files (launch_xx.wql, xx.wqv) for an application to run under the carrental tab, continue from above: C:\base\src\hosting\scripts>cd carrental/pricing 255 CONFIDENTIAL C:\base\src\hosting\scripts\carrental\pricing>cvs add launch_xx.wql xx.wqv C:\base\src\hosting\scripts\carrental\pricing>cvs commit launch_xx.wql xx.wqv C:\base\src\hosting\scripts\carrental\pricing>./dist-new nutmeg carrental The above set of commands adds, commits (version 1.0), and distributes the files to the Nutmeg staging server. Once the query has passed all testing on Nutmeg, continue at the DOS command line: C:\base\src\hosting\scripts\carrental\pricing>cvs tag –F release launch_xx.wql xx.wqv C:\base\src\hosting\scripts\carrental\pricing>./dist-new master carrental Now the files have been distributed to the live server that customers interact with. We can run a test on the live site through an administrative account. We must make sure that the query that we are trying to run is activated in the account we want to run the query in (Recall Chapter 4.4 and Picture 4.4.1). These commands can be remembered as a 5 step process. Picture 4.6.1: The 5 steps to releasing a file In addition to these 5 commands, the database administrator must make an update that we cue through committing the updates.txt file and the *.ddl file of the specific application tab (carrental.ddl in this case). These files are located in the base/src/hosting/schema 256 CONFIDENTIAL directory. Again, recall Picture 4.4.1. Making new variables and queries appear so that they can be activated for a given customer account is done through the *.ddl files and the updates.txt file. The way to add variables and queries is illustrated in the following pictures: Picture 4.6.2: A sample of queries and variables in the carrental.ddl file Picture 4.6.3: A sample of updates in the updates.txt file We must make sure to update the schema directory before modifying and committing the files: C:\base\src\hosting\schema>cvs up 257 CONFIDENTIAL By not specifying a file name we are updating the entire schema directory in one command. After modifying the files we then commit the files: C:\base\src\hosting\schema>cvs commit updates.txt carrental.ddl We can now alert a database administrator that we are ready for a database update for both the staging server (Nutmeg) and the live site (Master). 4.7 Making queries industry strength We have already written queries to scrape prices out of http://www.techdepot.com and http://www.superwarehouse.com. In order to sell the application for big money on the QL2 Client Center, some enhancements must be made to the code. For example, the error trapping that is done in Examples 3.6.1-2 is insufficient. Another improvement is an input handler. Finally, we will normalize all errors and output into a single file format that is ready for post processing, which is outlined in Chapter 4.3. Let’s start by creating an input handler for both Example 3.6.1 and Example 3.6.2 to work from. The reason for creating an input handler is to ease the process of updating all queries in our application set simultaneously. Suppose we add 18 additional queries to the 2 that we already have and then change the way that input processed, such as column 3 having the search terms instead of column 2. If we didn’t have an input handler, then we would have to update all 20 pieces of code to handle the input properly. With the input handler, we just update it once and all 20 pieces of code use it. This example is similar to Picture 4.1.3. 258 CONFIDENTIAL Picture 4.7.1: Implementing an input handler with a FILE_BASE On an industry strength data center, input files are directory-specific and require a FILE_BASE to locate the appropriate input file. The FILE_BASE will also be needed for directory-specific output files throughout the code. There will be an error-trap added 259 CONFIDENTIAL in the reader.wqv file once error enhancements are implemented. The next step is to determine what the standardized output will be between the two software agents, techdepot and superwarehouse. Looking at the current versions of output: Picture 4.7.2: Deriving a common output file format 260 CONFIDENTIAL We see that the output formats are somewhat similar. We will have to include the FILE_BASE in our data writing, but first we need to derive the common output format. The itemcode field of Example 3.6.2 can be merged with the sku field of Example 3.6.1. Typically, there are two different types of product numbers for products sold on the internet. One is the website’s part number (or “sku”) and the other is the manufacturer’s part number. Some sites use only one type of part number. In the process of merging applications for different websites, we need to take into consideration both types of product numbers in the common output format. The techdepot code has both sku and manufacturer’s part numbers. Another difference between the two applications is that the superwarehouse code includes the inputID along with the output. Because we will also include the site name in the common output table, the table will require 9 common fields (SITE, INPUTID, SEARCHSTRING, SKU, MANUF, PRODUCT, PRICE, AVAILABLE, TIMESTAMP). Our ultimate goal is to combine errors and output into one table, so let’s add 3 more common fields for errors (MESSAGE, ERRORID, RERUN) for a total of 12 fields. The RERUN field is a 1 or 0/Null based on whether or not the error can potentially be corrected by rerunning the input for the given SEARCHSTRING. Let’s call the 12-field table raw.csv. We can tell if a record in the table is an output item or an error item based on whether or not the MESSAGE field is null. This is important in post processing, which is covered in Chapter 4.3. Implementing the 12-field format makes the code look like this: 261 CONFIDENTIAL Picture 4.7.3: Making the tables similar with 12 fields 262 CONFIDENTIAL The INPUTID field will need to be added to Example 3.6.1 when the input is read in through the reader.wqv file for the code to compile. Before we should worry about compiling anything, we need to implement error traps and controls. In addition to every HTTP error associated with a fetch, we must also write error traps for responses like “Zero Products Found for your Search.” Further, we need to take into consideration that some searches will have thousands of results. What should happen in these cases? Should there be a limit of 100 results per search? To create a fetch ceiling, let’s say that we will only scrape out a maximum of 3 pages of results per search. Implementing the error updates recreates Example 3.6.1 as Example 4.7.1: 263 CONFIDENTIAL Picture 4.7.4: The first part of the code to Example 4.7.1 We now are able to see the adjustments made to reader.wqv and how it has impacted the code. Adding lots of error traps often causes us to be more specific in the references we make as in the node columninput. The need to include “select1.” in the FILE_BASE reference is indeed because of an error trap that gets implemented in the inner query of the outer1 node. Here’s some more of the code: 264 CONFIDENTIAL Picture 4.7.5: The second part of the code to Example 4.7.1 265 CONFIDENTIAL Picture 4.7.5 shows us the new error trapping. Here is more of the code: Picture 4.7.6: The third part of the code to Example 4.7.1 266 CONFIDENTIAL Picture 4.7.7: The final part of the code to Example 4.7.1 Running the input from Picture 4.7.4 produces the following output viewed through Microsoft Excel: 267 CONFIDENTIAL Picture 4.7.8: The final part of the code to Example 4.7.1 The output seems ready for post processing. Notice that we’ve stripped the dollar signs and commas off of the prices. 268 CONFIDENTIAL We can make similar enhancements to Example 3.6.2 and recreate it as Example 4.7.2. The next step in that process is to implement the error traps needed for an industrystrength web-crawler and write our output to the common 12-field raw.csv format that gets the output ready for post processing. Picture 4.7.9 is the start to the superwarehouse query: 269 CONFIDENTIAL Picture 4.7.9: The first parts of code to Example 4.7.2 We see that the start is very similar to that of Example 4.7.1. 270 CONFIDENTIAL Picture 4.7.10: The second part of code to Example 4.7.2 We see some variations to Example 4.7.1 and some similarities. Let’s look at more of the code: 271 CONFIDENTIAL Picture 4.7.11: The third part of code to Example 4.7.2 Finally, here’s the last of the code: 272 CONFIDENTIAL Picture 4.7.12: The final part of code to Example 4.7.2 Running the code with the input seen in Picture 4.7.9 produces the following output file viewed in Microsoft Excel: 273 CONFIDENTIAL Picture 4.7.13: The output of Example 4.7.2 The Excel view of the raw.csv file suggests that the crawler is working as desired by giving up to 3 pages of results (60 results) for each search term. We have now gone through an example of creating an application set in Examples 4.7.1-2. We could easily add 10 or more applications that work in conjunction with reader.wqv. Through post processing we can deliver different customized output and error files to different customers. The idea is to develop an application set for a given market (airfare, car rental, vacation packages, etc.) and get lots of customers to run the same code to surf and 274 CONFIDENTIAL scrape the sites and then each customer has its own post processing code that delivers customized output and error files (recall Picture 4.3.8). Clearly, there are quite a few details to writing an industry-strength web-crawler, especially one with elaborate error controls. Elaborate error controls include both fetchrelated network errors and errors triggered out of the site such as “Zero results found.” Even when we feel like one of our crawlers is complete, often times new customer input triggers behaviors in the website that we have not yet seen and we have to update the code. 275 CONFIDENTIAL Appendix I: HTML knowledge for WebQL To effectively code a WebQL web spider, basic knowledge of HTML tags is beneficial. HTML stands for HyperText Markup Language. HTML pages are plain text files that get interpreted by a browser in to what we see when we click around the internet. Nearly every page on the web is built in the structure of tables with rows containing cells which in turn may contain a table. Being able to position images, links, and text for a browser using <table>, <tr>, and <td> tags is a skill of every HTML developer. Overtime, websites have evolved into great detail that are works of art and graphic design. A website that would take 3 guys and $100,000 to develop in the mid-late 1990s is now a matter of a 4 digit contract for a single person in 2005. Luckily, building the intricate details of modern HTML is not a requirement to be an expert WebQL application developer. We just need to know HTML basics. Here is a rough sketch of an HTML page that contains a table. <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD> <TITLE>Prices and Taxes in 3 Currencies </TITLE> </HEAD> <BODY background="liberty.gif"> <TABLE align=center cellpadding=3 cellspacing=3 bgcolor=blue> <tr> <th colspan = 3 bgcolor=red align=center><b> Prices and Taxes in 3 Currencies</b> </th> </tr> <tr> <td> <font size=+2><b>$$ 276 CONFIDENTIAL </b></font> </td> <td> <font size=+2 color=white><b>55.75 </b></font> </td> <td> <font size=+2>5.52 </font> </td> </tr> <tr> <td> <font size=+2><b>Yen </b></font> </td> <td> <font size=+2 color=white><b>6500 </b></font> </td> <td> <font size=+2>622 </font> </td> </tr> <tr> <td> <font size=+2><b>British Pound </b></font> </td> <td> <font size=+2 color=white><b>30 </b></font> </td> <td> <font size=+2>2.88 </font> </td> </tr> </TABLE> </BODY> </HTML> <!-- Example I.1> 277 CONFIDENTIAL Looking at this code through a browser on the following page, we can figure out how table rows <tr>, table cells <td>, and table headers <th> fit inside a table. We also notice the table row/cell relationships. Picture I.1: The HTML table seen through MSIE The background image file is tiled along the background of the browser window, and the table is horizontally positioned in the center and vertically at the top. The HTML file is available for download at http://www.geocities.com/ql2software/pages/myPrices.html. As mentioned before, crafting HTML is often a game of making tables inside of tables. Consider this code: <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD> <TITLE>FCY rates for USD deposit </TITLE> </HEAD> <body background=”liberty.gif”> <TABLE cellSpacing=1 cellPadding=0 width=430 border=3> <TBODY> <TR vAlign=top align=left> <TD> <TABLE cellSpacing=1 cellPadding=2 278 CONFIDENTIAL width="100%" border=0> <TBODY> <TR vAlign=top align=middle> <TD> <B>Time</B></TD></TR> <TR vAlign=top align=middle> <TD>1 day</TD></TR> <TR vAlign=top align=middle> <TD>1 week</TD></TR> <TR vAlign=top align=middle> <TD>1 mth</TD></TR> <TR vAlign=top align=middle> <TD>2 mths</TD></TR> <TR vAlign=top align=middle> <TD>3 mths</TD></TR> <TR vAlign=top align=middle> <TD>6 mths</TD></TR> <TR vAlign=top align=middle> <TD>12 mths</TD> </TR></TBODY></TABLE></TD> <TD> <TABLE cellSpacing=1 cellPadding=2 width="100%" border=0> <TBODY> <TR vAlign=top align=middle> <TD><B>< 10'000 USD</B> </TD></TR> <TR vAlign=top align=middle> <TD>0.0000</TD></TR> <TR vAlign=top align=middle> <TD>0.0000</TD></TR> <TR vAlign=top align=middle> <TD>2.7150</TD></TR> <TR vAlign=top align=middle> <TD>2.8450</TD></TR> <TR vAlign=top align=middle> <TD>3.0350</TD></TR> <TR vAlign=top align=middle> <TD>3.2750</TD></TR> <TR vAlign=top align=middle> <TD>3.5250</TD> </TR></TBODY></TABLE></TD> </TD> </TR></TBODY> 279 CONFIDENTIAL </TABLE> </BODY> </HTML> <!-- Example I.2> We notice that the table is not centered this time and that some table cells actually contain tables. <TBODY> and </TBODY> are tags that symbolize the beginning and end to a table body. Notice how the table has a border but does not have a background. Imagine what the code looks like through a browser and then take a look below. Picture I.2: MSIE view of the nested tables in Example I.2 The table in this HTML code has 1 row and 2 cells. Each cell contains a table of 8 rows each with 1 cell width. If familiar with Data Translators in WebQL, we can apply the table rows, table cells, and table columns Data Translators to the page sources of 280 CONFIDENTIAL Examples I.1-2. The source code of the table of Example I.2 is available at http://www.geocities.com/QL2software/pages/myRates.html. Looking at TABLE_ID, ROW_ID, and COLUMN_ID Pseudocolumns, we can try to reprocess both tables by rotating the vertical columns into rows and rows into columns. A few other tags besides table-related tags are worth mentioning. <BR> is a line break that forces text to appear on different lines. Line1<BR>Line2 is HTML code that will put Line2 physically below Line1 on a page. Similarly, <NOBR> This text stays on one line.</NOBR> is a tag-set that prevents text from breaking into different lines no matter how smashed the browser window gets. <td><IMG src=”myDirectory/myImage.jpg”></td> is merely placing an image in a table cell. If we wanted to extract the image in a WebQL program, we can get the URL by either using the images Data Translator or by using a Regular Expression pattern Data Translator to extract the image’s URL extension and concatenate it with the domain of the SOURCE_URL Pseudocolumn. If we aren’t yet familiar with Data Translators and Pseudocolumns, they are explained in detail in Chapter 2.1. When hunting through HMTL page sources for a particular location in the page, we should get used to using text find (cntrl+F). Looking at a web page from a browser’s view, we should be able to use a text string in the finder to get us where we want to be in the HTML source. 281 CONFIDENTIAL The HTML form tag is another tag that is helpful to be familiar with. We will leave the code astray as it was found in the page source. <form action="index.cfm?handler=data.processsearchb" class="formstyle" method="post" name="headersearch"> <tr> <td rowspan="2" width="150" background="images\spacer.gif"> </td> <td colspan="2" width="250" background="images\spacer.gif"> <font color="#FFFFFF"><b>PART # / KEYWORD</b></font></td> <td rowspan="2" width="150" background="images\spacer.gif"><img align="right" border="0" src="images/Same-Day_Shipping.gif" alt="Same Day Shipping / No Minimum Order"></td> </tr> <tr> <td align="right" background="images\spacer.gif"> <input type="text" name="keyword" size="45" ID="Text1"></td> <td><input type="image" src="images/layout/header/btnSearchBGC1A68B2.gif" ID="Image1" NAME="Image1" align="bottom"> <input type="hidden" name="isANewSearchLimitSearch" value="false"> </td> </tr> </form> The class of the form named headersearch is a reference to a stylesheet that can do things like set a background color and font size/color automatically in one stylesheet reference. 282 CONFIDENTIAL An HTML stylesheet is a *.CSS file that is referenced for colors and styles throughout an HTML source page. A form has various inputs that have a consequence in which page is loaded next. What are the inputs to this form? What tags most likely appear shortly before the beginning and after the end? Probably <TABLE…> and </TABLE>. Now that we have some fundamental knowledge about HTML, we are better equipped to use WebQL for large volume data extraction tasks. The HTML source code appearance of tables with tags is limited in structure; WebQL comes in an extracts the information out of the source code and gives it both structure and format for a database or archive. Only this basic knowledge of how HTML is a tag-language for a browser to present tables of information is required to be a successful WebQL Application Developer. If an HTML tag looks interesting, we can type it into Google to learn more about it. 283 CONFIDENTIAL Appendix II: javascript knowledge for WebQL Basic knowledge of javascript, in addition to HTML, is beneficial for WebQL webapplication development. Javascript is a client-sided scripting language that can create dynamic changes in HTML, cause a redirect, create a game for a user to play, and/or submit a form. Javascript code is started and ended by the <SCRIPT…> and </SCRIPT> HTML tags. <HTML> <HEAD> <TITLE>Redirecting...</TITLE> <SCRIPT language=javascript> function crawlFurther() { window.location = “http://www.geocities.com/ql2software/redirect/index.html”; } </SCRIPT> </HEAD> <BODY onLoad=”javascript:crawlFurther()”> </BODY> </HTML> This HTML file is at http://www.geocities.com/ql2software/pages/redirect.html. As the page loads, the window location is immediately moved to a different subdirectory via javascript. We should land at a page that looks like this: 284 CONFIDENTIAL Picture II.1: The page with a redirected body URL Notice that the browser URL and the URL in the function crawlFurther aren’t the same, but the page is the same if we load the http://www.geocities.com/ql2software/redirect/index.html file directly. If we view the source of Picture II.1, we will not see any reference to the message, only the URL in the crawlFurther javascript function. 285 CONFIDENTIAL Picture II.2: The body URL accessed directly If the relocation is a dynamic link that includes sessionID and/or other information, then we won’t be able to load the body URL directly. We will have to cut the URL out of the source page that calls the redirect, which is the URL in the crawlFurther function, and then load it. The view of the source code of the page represented by the body URL does contain the message, “This is the redirected homepage.” Javascript can also be used to submit forms on a page. Submitting a form involves a user entering information in a browser window and clicking the “go” or “search” button. Whether we are buying something with our credit card or looking for flights, we are submitting a form most likely activated by javascript. Let’s go over an example of such a form. 286 CONFIDENTIAL Picture II.3: The 2005 Expedia car search form This form comprises about 350 lines of source code available at http://www.geocities.com/ql2software/pages/carsform.html. Not visible in the browser window above is the green “Search” button off to the right. Here’s the HTML of the button: 287 CONFIDENTIAL <INPUT class=GoButton id=A1034_3 onclick=FS() type=submit value=Search> We see that the button calls the function FS, which calls QFS. The necessary javascript to see what’s happening is pasted in below. Notice that the function SF() submits the form instantiated in function FS(). function QFS(wt, postshop, dprt, dest, pinf) { var rate = 0; if (1 == wt) {rate = (true == postshop) ? 58 : 250;} else if (4 == wt) {rate = (true == postshop) ? 159 : 410;} else if (3 == wt) {rate = (true == postshop) ? 1491 : 1450;} else if (9 == wt) {rate = (true == postshop) ? 331 : 2500;} else if (6 == wt) {rate = (true == postshop) ? 2500 : 2000;} else if (28 == wt) {rate = (true == postshop) ? 500 : 500;} QualifiedForSurvey(wt, postshop, 45, rate, dprt, dest, pinf); } function TEK(a,evt){ var keycode; if (window.event){ keycode = window.event.keyCode; evt = window.event;} else if(evt) {keycode = evt.which;} else {return true;} if(13==keycode){evt.cancelBubble = true; evt.returnValue = false; eval(a);} } function FS() { QFS(3, true); var f = getObj("MainForm"); if (getObj('rateoption1').checked) { f.vend.value = f.vend1.value; } if (getObj('rateoption2').checked) { f.vend.value = f.vend2.value; } if (getObj('rateoption3').checked) { f.vend.value = f.vend3.value; } 288 CONFIDENTIAL } function SF() { FS(); window.external.AutoCompleteSaveForm(f); f.submit(); } Here’s the MainForm tag only. <FORM onkeypress="TEK('SF()')" id=MainForm name=MainForm action=/pub/agent.dll?qscr=cars&itid=&itdx=&itty= method=post ?> The great part about WebQL is that we don’t need to understand every detail of the javascript, we just need to know what matters and what doesn’t for the sake of submitting a form. Most of the time, the input for a form comes from a CSV file, which can be thousands of records (or “rows”) long. We should try to figure out which form variables need to be submitted before clicking the “go” button. Hunting through the HTML form and the associated javascript calls are a great way to do it. Another technique involves using the forms Data Translator in Example 2.l.9, which is an even easier technique of identifying the form variables. We can also set up a proxy on our browser to see what variables and values get submitted when a form is submitted by a user click. We must make sure that we are submitting the variables in a similar fashion in the WebQL Studio IDE, which has a built-in proxy under the Network tab of the Execution Window (see Picture 1.2.7). The most important thing to understand is that information is returned to a user by a web server when the user submits a form. The form is built with HTML and sometimes uses javascript to pass the information from the user to the web server. WebQL automates this process and takes hundreds of man hours of clicking and turns it into hundreds of minutes of web harvesting done by a computer. 289 CONFIDENTIAL Appendix III: Regular Expression Knowledge for WebQL Regular Expressions are a way of matching text, perhaps pages long, to small expressions. Basic knowledge of HTML from the prior appendices will help. For the sake of examples, let’s invent a function called ‘matching’ that has the syntax myString matching myExpression If the expression fits anywhere into myString, then the function evaluates to TRUE, otherwise the function evaluates to FALSE. If true, then we say myExpression “matches” myString. Suppose the source code text of our favorite homepage on the internet is stored in a text string called myFavHomepage, then myFavHomepage matching ‘<HTML>.*?</HTML>’ returns TRUE if myFavHomepage has a start HTML tag followed somewhere on the page by an end HTML tag. The phrase .*? means “match anything until, but possibly match nothing.” We notice that some HTML tags have more to them then just their name. Consider the image tag <IMG src=”/myDirectory/myPic.GIF” alt=”Beauty at it’s best!”> How could we ambiguously match any image tag with a Regular Expression? From what we’ve learned so far <IMG.*?> 290 CONFIDENTIAL should be all we need. Believe it or not, that works! There is a “tighter” way of writing the expression that says, “match as many characters as possible that aren’t greater-than characters until we hit a greater-than character”: <IMG[^>]*> Brackets symbolize a character class, and the carrot sign symbolize negation. Thus, “match a character that is not a digit” is written [^0-9] and “match a character that is an English letter” is [A-Z] To “match one or more letters, as many as possible” then [A-Z]+ while “match one or more letters, as few as possible” is [A-Z]+? With what we’ve learned, how would we match any HTML table? <TABLE[^>]*>.*?</TABLE> Character classes can also be used to extract prices. A price probably has digits, commas, and a period, so 291 CONFIDENTIAL [0-9,.]+ is a good pattern, but not the best. The pattern that we have selected would also match 99.99.953.40,33,4, which is not a price. If the price for sure starts with a dollar sign and ends with cents, then \$\d*\.\d{2} is a better pattern. The period and the $ are actually reserved characters, so they must be set off by a backslash ‘\’. The period alone stands for “match any character” which we could infer from the first example, so to match a period character explicitly, we must use a backslash in the expression. ‘\d’ symbolizes the character class [0-9] and the squiggly brackets say “match exactly this many”. Notice that there need not be any dollar digits matched at all, only cents, and that there is no space between the dollar sign and the price. The dollar sign alone without the backslash symbolizes the end of a line. Thus, “match the start of the first table cell tag until the end of the same line” in the source file is represented by <td.*?$ “Match the first table cell tag until the final end of line marker” would then be <td.*$ WebQL is powerful because it allows us to use extraction buffers with Regular Expressions to pinpoint and extract data from an HTML source. Let’s invent another function called extract_pattern(text1, pattern1). The idea is that pattern1 is a Regular Expression containing an extraction buffer denoted by parenthesis. For example, if myFavHomepage is our favorite homepage, then 292 CONFIDENTIAL extract_pattern(myFavHomepage,’<TITLE>(.*?)</TITLE>’) will extract the title of myFavHomepage. Suppose we load a page that has a product with a price on it. Call the page myProductPage. extract_pattern(myProductPage, ’<td class=price>\s*\$\s*([\d.,]+)\s*</td>’) is an example of a thorough expression that won’t mistakably grab something other than what we are looking for. We notice that the table cell tag must be style-sheeted as a price cell, and the pattern also takes into account a dollar sign and the potential for empty whitespace on either side of the dollar sign and on either side of the price itself. We are now much more effective WebQL programmers now that we have a fundamental understanding of Regular Expressions. 293 CONFIDENTIAL Appendix IV: Other Readings There aren’t many readings recommended to enhance our WebQL programming ability for several reasons. First, a comprehensive book on Regular Expressions is great knowledge to have, but no book focuses on applying Regular Expressions to HTML sources, which is one of the most effective ways to apply the WebQL programming language. Recommended is the book that I looked at, which is in Perl. Second, if HTML surfing and slashing is one of the best ways to use WebQL, then knowing HTML is beneficial. Once upon a time, HTML books were popular, but if there is anything about an HTML tag that we don’t know, we can just type it into Google and figure it out. The same is true for javascript. There are a couple of online resources listed that include the HTML 4.0 Specification text and an award-winning reference guide. There are also one javascript and one XML reference listed. Finally, WebQL is a programming language with only a dozen or so experts coding with it in the year 2005. Given that, this is the first book that explains how to code with WebQL from an engineering perspective. This limits the number of other readings relevant to the developer. 1) Friedl, J. E. F. Mastering Regular Expressions. ISBN: 1-56592-257-3 2) Full text of HTML 4.0 Specification http://www.w3.org/TR/1998/REC-html4019980424/ 3) The HTML 4.0 Reference http://www.htmlhelp.com/reference/html40/ 4) Javacript Tutorial http://www.w3schools.com/js/default.asp 5) XML/XSLT Tutorial http://www.w3schools.com/xml/default.asp 294 CONFIDENTIAL Appendix V: WebQL Function Tables This quick reference is to help us find functions sorted by function type. We can get additional help with functions by name by checking out HelpÆContentsÆIndex. WEBQL GIANT FUNCTION TABLE String Functions after(str1, expr1) assemble(str1a, str1b, str2a, str2b, …) before(str1, expr1) between(str1, expr1, expr2) str1 case matching expr1 replace(str0, expr1, str1, ...) chr(num1) chr_id(str1) clean(str1) coalesce(str1, str2, …) creation_date(str1) datetime_to_text(date1, str1, str2, str3) diff_html(str1, str2) expression_match(sexpr1, str1) extract_case_pattern_array(str1, expr1) extract_pattern(str1, expr1) extract_pattern_array(str1, expr1) extract_url(str1) extract_value(str1, str2) returns whatever occurs after the point where expr1 matches str1; returns null if expr1 does not match str1 returns a string of '&'||strna||'='||strnb returns whatever occurs before the point where expr1 matches str1; returns null if expr1 does not match str1 returns whatever text occurs before expr2 matches str1 and after expr1 matches str1; returns null if either expression does not match str1 returns true if expr1 matches str1 case sensitive, false otherwise returns str0 with all matching characters of expr1 to str0 replaced by str1; we can have as many exprn and strn combos as we want returns the unicode character associated with num1 returns the integer number associated with unicode character str1 returns str1 with all HTML and repeated whitespace removed returns the first non-null str returns the creation date/time for the file specified in the path str1 returns the text version of date1 in format str1 in timezone str2 with locale str3; only date1 is a required argument and can be NOW. returns an array symbolizing differences between the HTML in str1 and str2 (see help) returns true if str1 meets the requirements of the search expression sexpr1 (see help) case-sensitive version of extract_pattern_array returns the contents of the extraction buffer of expr1 on str1; if expr1 does not match then null is returned returns an array of extraction buffer matches of expr1 on str1; returns null if expr1 does not match str1 returns the first URL in the HTML of str1 returns the value of the field str2 in the HTML of 295 CONFIDENTIAL file_size(str1) highlight_html(str1, num1) html_encode(str1) html_to_text(str1) html_to_xml(str1) in(str1, array1) instr(str1, chr1) str1 is null last_modified_date(str1) length(str1) load(str1) load_lines(str1) lookup(array1, str1) lower(str1) str1 matching expr1 md5(str1) normalize_html(str1) nullif(str1, str2) number_to_text(num1, str1, str2) nvl(str1, str2) replace_links(str1, str2, array1) studio_action(str1, str2) substr(str1, num1, num2) str1 || str2 text_to_datetime(str1, str2, str3, str4) text_to_number(str1, str2, str3) transform_xml(str1, str2) trim(str1) str1 returns the size in bytes of the file represented by the path str1 returns the HTML str1 with all colors specified by num1 highlighted with tags added returns a string that is the HTML encoded version of str1 returns a string that has the HTML removed from str1 returns a string that gives well-formed XML to the HTML in str1 returns true if str1 is an element of array1; returns false otherwise returns the integer position of chr1 in str1 returns true if str1 is null, false otherwise; str1 is not null works too returns the last modified date of the file in the path specified by str1 returns the number of characters in str1 returns the string of the file specified in the path str1 returns an array of the lines of the file specified in the path str1 returns the position of str1 in array1; returns null of str1 is not in array1 returns str1 converted to lowercase returns true if expr1 matches str1 case insensitive, false otherwise returns MD5-hashed version of str1 returns well-formed HTML around the segment of HTML in str1 returns null if str1 = str2, otherwise str1 returns num1 as a string with str1 an optional format string and str2 an optional locale (see help) returns str1 unless str1 is null then str2 is returned returns links in HTML of str1 optionally specifies an origin URL str2 and optionally specifies replacements array1 (see help) creates special output field with clickable link that specifies a type in str1 and a segment of code to execute in the body str2 (see help) returns a new string starting at position num1 and goes on until the end of the string unless the optional num2 is mentioned as the length of the substring returns a string of str1 concatenated with str2 returns a date object of text interpreted by the format str2 with optional timezone str3 and optional locale str4 (see help) returns a number from str1 that is optionally formatted by str2 with optional locale str3 (see help) returns the transform of the XML of str1 by the stylesheet represented by str2 returns str1 without any external whitespace 296 CONFIDENTIAL upper(str1) ureplace(str0, expr1, str1, ...) url_change_type(str1, str2, str3) url_decode(str1) url_domain(str1) url_encode(str1) url_file(str1) url_filename(str1) url_host(str1) url_make_absolute(str1, str2) url_make_relative(str1, str2) url_parameter(str1, str2) url_port(str1) url_query(str1) url_reparent(str1, str2, str3) url_scheme(str1) xml_encode(str1) returns str1 converted to uppercase same as replace but uses smaller character set (faster expressions) returns a URL string str1 with str2 as the file type conversion and the optional str3 as the default page name (see help) returns the decoded URL string str1 returns the url_domain out of the URL string str1 returns the URL encoding of the string str1 returns the file+path of the URL str1 returns the file name only of the URL str1 returns the host of the URL str1 returns the URL of the path str1 to the URL base str2 (see help) returns the URL of str1 relative to the base URL str2 (see help) returns the parameter associated with the value str2 in the variable string at the end of the URL str1 returns the port number of the URL str1 returns the query string portion of URL str1 (after the ?) returns the URL str1 with the old prefix str2 replaced by the new prefix str3 (see help) returns the protocol of the URL str1 returns the XML encoded version of str1 Numeric Functions abs(num1) arcos(num1) arcsin(num1) arctan(num1) ceil(num1) cos(num1) deg(num1) exp(num1) floor(num1) hex(num1) log(num1) mod(num1, num2) oct(num1) pow(num1, num2) rad(num1) random(array1) round(num1, num2) sin(num1) sqrt(num1) tan(num1) returns the absolute value of the number num returns cos^-1(num1 radians) returns sin^-1(num1) returns tan^-1(num1) returns the least integer greater than or equal to num1 returns the cosine of num1 radians returns num1 radians in terms of degrees returns e to the power num1 returns the greates integer less than or equal to num1 returns the decimal value of the hex value num1 returns the value of log base 10 of num1 returns the remainder of dividing num1 by num2 returns the decimal value of the octal value num1 returns num1 to the num2 power returns the number of radians represented by num1 degrees returns a random element of array1; "random" alone generates a number between 0 and 1 returns num1 rounded to num2 decimal places returns the sine of num1 radians returns the square root of num1 returns the tangent of num1 radians 297 CONFIDENTIAL to_number(str1) trunc(num1, num2) returns the number of the given number-string returns num1 truncated to num2 decimal places Array Functions array_avg(array1) array_max(array1) array_min(array1) array_sum(array1) array_to_text(array1, str1) array_unique(array1) chisquare(array1) diff_arrays(array1, array2) each(array1, A(each_item)) flatten(array1) merge(array1, array2) reverse(array1) sample(array1, num1) sequence(num1, num2, num3) shuffle(array1) size(array1) slice(array1, num1, num2) text_to_array(str1, str2) returns the average of the elements of array1 returns the value of the greatest element of array1 returns the value of the least element of array1 returns the value of the sum of the elements of array1 returns the attachment of all array1 elements connected as a string with str1 between each array1 element returns a reduced-size or same size array consisting of only unique elements of array1 returns a string describing the CHI^2 of array1 returns the array-difference of array1 and array2 (see help) returns an array involving A on each_item of array1 returns a modified version of array1 with all structural array nesting removed returns a single array of array1 concatenated with array2 returns the elements of array1 in reverse order returns an array size num1 that is a random sample of array1's elements returns an array starting at num1 incrementing by num3 up to num2 returns an array that contains the randomly shuffled elements of array1 returns the number of elements of array1 returns a smaller array consisting of array1 indexed at num1 up until array1 indexed at num2 returns an array of elements of str1 delimited by str2 Aggregate Functions avg(col1) chisquare(col1) count(col1) count(unique col1) gather(col1) gather_map(col1, col2) geometric_mean(col1) harmonic_mean(col1) max(col1) mean(col1) median(col1) min(col1) mode(col1) sample_range(col1) stddev(col1) sum(col1) returns the average value of all records in col1 returns a string describing the CHI^2 of col1 returns the count of all records in col1 returns the count of all unique records in col1 returns the records of col1 in an array returns an array with col1n paired with col2n returns the geometric mean of the records in col1 returns the harmonic mean of the records in col1 returns the maximum value of the records in col1 returns the mean value of the records in col1 returns the median value of the records in col1 returns the minimum values of the records in col1 returns the mode of the values in col1 returns the range of values max - min of col1 returns the standard deviation of the records in col1 returns the sum of all records in col1 298 CONFIDENTIAL variance(col1) returns the variance of the records in col1 Document Retrieval Options acceptng text adding cookie text adding header text allow partial anonymize text cache forms forget forms ignore option1 convert using A converting from text converting to text forget cookies crawl of URL1 to depth N delay num1 crawl of URL1 following if bool1 head independent inline text post text proxy text retry num1 rewrite ps1 using A submitting values str1a for str1b to form k submitting values str1a for str1b if bool1 timeout num1 unguarded user agent text via browser with errors images only text='image/*', any text='*/*' manually adds cookie in string text manually adds HTTP headers to a request specified in text entire request does not fail when child page load fails anonymizes a request with text as the anonymization key caches forms associated with a fetch clears memory of forms associated with a fetch disregards option1 which is either children, compound, encoding, errors, keepalive, redirect, refresh, truncation makes the source_content of a fetch A specifies the MIME type of a retrieved document as text such as 'text/html' specifies the MIME type of a retrieved document to be made text before Data Translators are applied manually drops all cookies added by a fetch automatically crawls every link on URL1 to clicking depth N automatically delays a fetch by num1 seconds crawl URL1 so long as bool1 is true (see chapter 2.5) causes an HTTP request to use the HEAD method allows for repeat fetches allows for text to be your datasource posts the post data string in text specifies ip address or location in text retries a failed request repeatedly num1 times or fewer if the request succeeds rewrites Pseudocolumn ps1 just before a fetch using the expression A submits form k in addition to loading a request; submits values strna for strnb for as many n as we want submits every form where bool1 is true; submits values strna for strnb for as many n as we want forces a timeout error if the request takes longer than num1 seconds enables a fetch if the fetch is a repeat makes text the user agent for fetching a page loads page as would MSIE include datacommunication errors when "error" Pseudocolumn is selected Document Write Options append WebQL adds to the existing file rather than overwriting the file 299 CONFIDENTIAL encoding text type file fix at num1 pviot raw stylesheet URL1 transform str1 truncate with headings specifies text as the file character set to be used when writing such as 'utf-16' specifies file write type like text, excel, or xml, etc. specifies the file width as num1 makes pivot table view of the written records writes data to a file without file conversions specifies CSS/XSLT stylesheet located at URL1 with which to write a file uses str1 XSLT stylesheet when writing file opposite of append; always writes over file even if another node has written to it include aliases as column headers at write-time Conditional Operators case when bool1 then A else B end decode(Key, A1, A2, B1, B2, …, …, Z) if(bool1, A, B) return A when bool1 is true, return B otherwise; can have as many when/thens as we want if A1=Key then A2 else if B1=Key then B2 else Z if bool1 is true then return A otherwise return B 300 CONFIDENTIAL The Comprehensive Engineering Guide to WebQL Problem Set Exercises with Solutions CHAPTER 1 P1.0.1 WebQL is modeled after ______ . WebQL is a _________ programming language. WebQL is modeled after Microsoft “SQL”. WebQL is a “table select” programming language (“database-style” is also acceptable). P1.1.1 Which of the following does WebQL support? a) b) c) d) read/write on an ODBC-driven database read/write XML with XSLT transforms read/write HTML with CSS stylesheet all of the above d) P1.2.1 What happens in the WebQL Studio IDE when you push F5 when a *.wql file is active? The code compiles and an Execution Window launches. --OR-- The query is run. P1.2.2 The Network tab serves as a _______. Draw a diagram. 301 CONFIDENTIAL Proxy. P1.2.3 Pick 2 other Execution Window tabs and describe what they do. (1) Statistics—keeps track of requests, fetches, bytes in/out, other stats network stats, too (2) Messages—logs request queuing, loading, etc. also pattern extractions and joins (3) Browser View—views cell as a browser would (4) Text View—views cell as plain text (5) Output Files—collection of any and all output files generated during query execution (6) Data Flow—shows node to node flow of data by illustration P1.3.1 Based on what you saw in Chapter 1.3, how could you write a website with online financial transactions (like Orbitz) in such a way that WebQL Regular Expressions cannot extract the prices? When the page request for prices comes to the server, every ticket price is assigned the correct price image with a random file name stored in a database. When the random file name is submitted back to the server for purchase, the file name is looked up in the database so the customer is charged the proper amount. Because WebQL sees only what 302 CONFIDENTIAL a browser can see, the random price image filenames are meaningless unless we can either (1) see the database of price / image name pairings or (2) use an image-to-text interpreter to tell us what the image is showing. CHAPTER 2 P2.0.1 What is the result of Select If(1,1,0), If(nvl(null,0),0,1), If(after(‘Hello’,’k’),0,1) ? (1,1,1) P2.1.1 Write a query that extracts the name of every file loaded by the www.foxnews.com homepage and stores the filenames in an array. 303 CONFIDENTIAL It’s not surprising that the first file is a stylesheet. P2.1.2 Write a query that will extract all bold text from the www.cnn.com homepage. 304 CONFIDENTIAL P2.2.1 Suppose node1 has 4 records, node2 has 2 records, and node3 has 7 records. If all 3 nodes are joined together unfiltered, then how many records are in the child node? 4*2*7=56 P2.2.2 Write a small segment of code that generates the 125 3-tuples (1-5, 1-5, 1-5) in the default viewer of the Execution Window. 305 CONFIDENTIAL P2.3.1 Modify your query for P2.1.2 by counting the number of times every bold phrase appears on the www.cnn.com homepage. Filter the results by keeping only bold phrases that have a count greater than 1. 306 CONFIDENTIAL P2.4.1 This code produces the following output: 307 CONFIDENTIAL What are the values of RECORD_ID? (1,1,1) P2.5.1 Of the CONTROL_NAME variables for the Slashdot.com login form, which ones do we need to submit when using the submitting Document Retrieval Option? 4, 1 P2.5.2 Explain what the Document Retrieval Options do in the above segment of code. 308 CONFIDENTIAL with errors—include network errors retry 2 with delay 5—repeatedly retry loading a page up to 2 times when there is an error until the page loads successfully; wait 5 seconds before retrying each failure timeout 100—force a timeout at 100 seconds cache forms—caches forms so page does not need reload to submit form over and over ignore children—filter out all child pages associated with the URL rewrite… —fetch google instead of yahoo. convert…—make the SOURCE_CONTENT only what is after the <BODY…> HTML tag, or do no source conversion if there is no <BODY…> HMTL tag. P2.6.1 What is the syntax of reading/writing to a db? table@connection P2.7.1 Use WebQL Virtual Tables to figure out if MS Access file format is accepted. *.mdb is accepted. 309 CONFIDENTIAL CHAPTER 3 P3.1.1 Write a circular referencing scheme to generate a table of numbers demonstrating a Fibonacci sequence in the default viewer of the Execution Window. Maintain 3 fields—1 for the nth number in the sequence, 1 for the (n+1)th number in the sequence, and 1 field for n itself. Make a table for n <= 20. 1 is the first number, 1 is the second number, 2 is the third number, 3 is the fourth number 5 is the fifth number, 8 is the sixth number, … nth + (n+1)st is the (n+2)nd number, etc. 310 CONFIDENTIAL P3.1.2 Write a Circular Referencing scheme to construct a table of values x, y=exp(x) in the default viewer of the Execution Window for x in [0, 10] at .05 intervals. There should be 201 records in the default viewer. 311 CONFIDENTIAL P3.1.3 Do P3.1.2 without using Circular Referencing. 312 CONFIDENTIAL P3.5.1 Write a web spider that submits item codes into the search box at http://www.abcink.com and crawls all pages of results to extract the price, sale price, coupon code, item code, and item description. Be sure to add a column for a time stamp for every price. 313 CONFIDENTIAL Above is the first half of the code to solve P3.5.1. The rest of the code is below: 314 CONFIDENTIAL Here is the output from the default viewer for the given input in the node arrayinput: 315 CONFIDENTIAL The above output completes P3.5.1. 316 CONFIDENTIAL The Future of WebQL Since 2000, WebQL has evolved substantially in terms of Document Retrieval Options and Data Translators. More and more functions and Aggregate Functions have been added as well. WebQL has an edge over SQL in the creative application of Regular Expressions to HTML page sources. WebQL is also the most flexible product for web crawling and bulk form submission, making the product best equipped to handle the data demands of the future. The future of WebQL as an educational tool could take the shape of Matlab. Adding Aggregate Functions Plot(col1,col2) and Surface3D(col1,col2,col3) that return *.bmp images of graphs would be the first step. The future of WebQL as a web crawler involves allowing the developer to call javascript functions selectively as a Document Retrieval Option. Currently, all javascript in an HTML page load must be done manually in WebQL if the javascript-effect is desired. Ethically, many people question the idea of writing a web crawler. Some crawlers go out of control and generate too much traffic. Similarly, hitting a company’s site with an overload of traffic can slow down the company’s systems unfairly. Some web sites specifically say in a user agreement that the site’s information cannot be reused for commercial purposes. Despite these problems, the computer science world should accept web-crawl programming both educationally and professionally. Making full use of communications resources and driving internet bandwidth technology higher and cheaper should not be seen as a problem at a university or corporation. 317 CONFIDENTIAL About the Author Trevor J. Buckingham has been a University-level tech instructor since age 19. Lab instruction for CS61A with Brian Harvy and GSI instruction for Math 55 with Fraydoun Rezakhanlou were performed at Berkeley 1999-2000 and exam writing for EE215 with Ceon Ramon was done at the University of Washington in 2003. After studying Electrical Engineering and Computer Sciences at the University of California 1998-2001, Trevor and 4 others started building a data center that has become QL2 Software in Pioneer Square, Seattle. After working fulltime, Trevor studied Aeronautics at the graduate level with Ly and Rysdyk and Business in the PTC graduate program with Kim and EVPartners at the University of Washington 2003-2004. He is currently a long distance runner in Chicago, Illinois where he authored The Comprehensive Engineering Guide to WebQL. Trevor is one of the founders of the Engineering Entrepreneur Partnership, which hosts an annual TPC golf event at Snoqualmie Ridge. 318 CONFIDENTIAL 319