The Comprehensive Engineering Guide to WebQL

Transcription

CONFIDENTIAL
1
CONFIDENTIAL
Background
PURPOSE:
(1)
To provide engineers with the knowledge base necessary to succeed as
WebQL application developers
(2)
To provide more extensive examples than what the documentation provides
(3)
To illustrate how WebQL goes beyond what Oracle, SQL, and other database
programming languages can do
PREREQUISITE:
(1) Familiarity with basic HTML tags like <table>, <tr>, <td>, <form>, etc.
(2) Familiarity with Regular Expressions like [^0-9]{2,} and [A-Z0-9]+?
(3) Familiarity with javascript—no need to be an expert at any of (1) through (3)
(4) An understanding of calling functions to perform computations on strings and
numbers
PREFACE:
This book will get software developers excited about web programming. Loading any
ODBC driven database as well as any internet page, WebQL has no limit for applications
ranging from competitive pricing to business intelligence. This guide is organized into 4
Chapters with 5 Appendices and a Foreword; Chapter 4 will not be available for public
2
CONFIDENTIAL
consumption. The appendices do not use WebQL and should be read by engineers
inexperienced in HTML, javascript, and/or Regular Expressions.
Chapter 1 is a short introduction on what the WebQL product is and how the Interactive
Developer Environment for Windows looks. Chapter 2 goes through coding basics, and
Chapter 3 shows us how to crawl around the web and grab data out of every nook and
cranny. Chapter 4 shows how to build industry-strength WebQL applications for a
hosted web-mining service. As a teacher, I chose an active first-person voice for writing
the guide, which some technical writing experts would debate. The American Institute of
Physics' Style Manual, p. 14, third edition, 1978 is a science/engineering authority on
technical writing that encourages the use of active first-person. Second-person alienates
the reader and third-person is not as good at keeping the reader’s attention.
3
CONFIDENTIAL
CHAPTER 1: Introductions
1.0 What is WebQL?
WebQL is a table-select programming language modeled after Microsoft Structured
Query Language (SQL).
A table select programming language is a programming
language that joins tables together in different ways and then filters the results as desired
and writes the results to a database. WebQL does everything Microsoft SQL does and
more. WebQL provides the most answers to the limitless web programming questions of
today.
1.1 What can WebQL do?
On schedule, WebQL can target internet HTML source files, crawl them in real-time, use
Regular Expressions to extract data out of them, and then write the data to a formatted
database. WebQL interacts with any database that is ODBC driven, and also reads
information from both local and non-local HTML, XML, CSV, txt, pdf, and other file
formats. A local file refers to a file stored on our machine or intranet whereas a non-local
file requires a fetch over the internet to view it, which can be done by WebQL as well as
a web browser like Microsoft Internet Explorer. A WebQL program is something like
artificial intelligence for a browser. Using table-select programming statements, WebQL
surfs through websites and submits HTML forms with whatever values we want.
WebQL can read in any database or databases, manipulate the data and make it look any
way we want, and then write the data to any database or create any XML file with XSLT
transforms or HTML file with CSS stylesheets. This guide demonstrates how to create
custom DHTML environments from WebQL data extracted form the internet. WebQL
utilizes the internet as another database ready to be tapped making a new horizon for realtime business applications. WebQL’s ability to mechanically crawl a website and submit
4
CONFIDENTIAL
forms allows for thousands of hours of manual clicking and writing to be done by a
computer program in hundreds of minutes.
1.2 What does WebQL look like?
WebQL Studio is an Interactive Developer’s Environment (IDE) for Microsoft Windows.
This ideal IDE allows us to write code, run code, illustrate data flow, debug all network
transmissions, sort data, view data browser vs. text, etc. Before we write any code, let’s
get a feel for the environment in which we will be writing code.
5
CONFIDENTIAL
Picture 1.2.0: A screenshot of the WebQL IDE
As code warriors, we want to write code and then push the play button. When we click
the play icon (or press F5), the active WebQL code window compiles and an Execution
Window is launched. If the play button isn’t clickable, then our active WebQL window
isn’t a Code Window (plain text)—it’s probably an Execution Window. Picture 1.2.0 has
one text window and two execution windows cascaded behind. Typically, a text window
corresponds to a single specific Execution Window.
6
CONFIDENTIAL
The Execution Window consists of a default viewer and 7 tabs. Remember that a table is
merely rows and columns of data that can be displayed in the default viewer. A table
consists of rows (or “records”) and columns (or “fields”). A query is text file of code
that, when run, has a corresponding Execution Window with a default viewer. The
following diagram is a quick recap of what we know so far about WebQL:
Picture 1.2.0.5: Quick sketch of WebQL Studio
Below is what an Execution Window looks like inside the WebQL Studio IDE:
7
CONFIDENTIAL
Picture 1.2.1: A screenshot of the Execution Window in the IDE
The output table in the default viewer is typically used by the programmer for debugging
purposes. Output data sets are usually written to CSV files or ODBC driven databases on
delivery to customers. A stereotypical customer of a WebQL hosted application is a
pricing analyst wanting to mine a large volume of prices from a competitor’s website.
Picture 1.2.1 happens to be bank rates instead of prices although the concept is the same.
If we need to see WebQL’s reading and writing abilities immediately, Chapter 2.6
discusses the reading and writing abilities of WebQL. The 7 tabs in the lower half of the
Execution Window are outlined as follows:
(1) Messages. These are great in pinpointing mistakes in Regular Expressions. Regular
Expressions in WebQL are a tool used to match patterns to the HTML source of a
webpage to extract targeted information with extraction buffers denoted by parenthesis.
Messages will itemize every page request, every page load, most Regular Expression
patterns,
and
other
information.
Message
verbosity
is
controlled
through
ViewÆOptionsÆEnvironment and is illustrated in Picture 1.2.1.5. The most severe level
(Diagnostics) is recommended.
8
CONFIDENTIAL
Picture 1.2.1.5: Setting Message verbosity for debugging (Diagnostics)
Picture 1.2.1 gives an example of the Messages tab with “Diagnostics” set for message
verbosity. We might need to lower message verbosity if our WebQL program makes tens
of thousands of requests in order to lower virtual memory (VM) consumption. Once a
certain threshold for VM is exceeded, the WebQL run time environment crashes. From
Picture 1.2.1, the first message states which WebQL code window is being run. The
second message is the queuing of the first page request; the third message is the first page
request being made; the fourth message is the completed page load. The fifth message is
a Regular Expression pattern with one extraction buffer that extracts a redirection link (in
a browser this is done automatically because a browser is javascript enabled). The sixth
message is the delaying of the second page request that is a result of the extraction of the
redirection link. Eventually, the second page request will be queued, then requested, then
loaded just like the first page request.
9
CONFIDENTIAL
(2) Statistics. Statistics are for monitoring how much network traffic is being generated.
These stats enable statements like, “My 1 page of code generated 300,000 clicks that
ripped 10 billion bytes of data.” These data can also be used to infer other stats like
successful page loads per minute.
Picture 1.2.2: The Statistics tab of the Execution Window
(3) Data Flow. This tool is very useful in pinpointing a problem in data filtering.
Suppose our code loads 10 data tables from various databases, then joins and filters them
throughout 3 pages of code, and then writes the data to a CSV output file. We push the
play button, the code runs, and there is no output generated. Code that runs and produces
no output is most likely filtering records when we didn’t mean to filter them all. If we
look at the Data Flow, we should be able to figure out where the data got cut-off. We can
even click in the Data Flow where we think the problem is and the corresponding point in
the code will pop-up.
10
CONFIDENTIAL
Picture 1.2.3: The Data Flow tab of the Execution Window
In Picture 1.2.3, the flow of records appears just fine because the box titled ASSEMBLE
has the same number of records derived from the parent boxes RATES1 and RATES2.
RATES1 and RATES2 are most likely data tables being assembled side-by-side like in
Example 3.5.4.
(4 and 5) Text vs. Browser. If our code selects fields into the default viewer of the
Execution Window, then we can view those fields as plain text (4) or as a Microsoft
Internet Explorer browser views them (5). Image files are not saved locally for the sake
of browser viewing, so some images do not appear in the browser view like Picture 1.2.5.
11
CONFIDENTIAL
Picture 1.2.4: The Text View tab of a SOURCE_CONTENT field
Picture 1.2.5: The Browser View tab of that same field
12
CONFIDENTIAL
(6) Output Files. This tab lists independent of directory any and all of the output files
produced during query execution. We can view an output file as it is being created by the
WebQL query. If the code is going to run for hours, we should check the output files
early to make sure they look the way we want.
Picture 1.2.6: The Output Files tab
The output files for harvesting the bank rates, branches, and ATM locations are in CSV
format while the news and stocks information are in both custom HTML output and plain
text CSV format.
(7) Network. The Network tab is one of the newer enhancements to the IDE. Every
single outgoing request with headers is itemized along with each individual response with
complete HTML source. Every single character of incoming and outgoing network
transmission is revealed here.
Picture 1.2.7: The Network tab
13
CONFIDENTIAL
If our crawler makes thousands and thousands of page requests, we are better to turn off
the Network tab feature because storing every outgoing and incoming byte could cause
WebQL’s virtual memory to overflow, which is at 2GB. We can turn on and off several
features in the same location in the WebQL IDE.
Picture 1.2.7.5: Toggle-able features in the WebQL IDE
From left to right, the features are Messages tab auto-scrolling, default viewer autoscrolling, Data Flow tab on/off, Network tab on/off. The Network tab is also referred to
as a proxy. A proxy is best understood by a diagram:
Picture 1.2.8: Seeing the Network tab as a proxy
A proxy monitors every piece of outgoing communication and the corresponding
response to the PC. The outgoing and incoming arrows in Picture 1.2.8 are analogous to
those in the left half of Picture 1.2.7.
14
CONFIDENTIAL
1.3 How do websites and the internet work?
The internet can be thought of as a large number of computers connected together both
with and without wires. A website is a collection of files on a computer (or “server”)
connected to the internet that creates communication sequences with another computer
operated by a (usually) human user. In addition to a server that houses a collection of
files, a website could have a large database of information attached to it. A website is
symbolic of artificial intelligence because a human user is interacting with computers to
achieve some sort of automated goal. That goal could be buying an airline ticket,
downloading photographs, or learning, among other automated goals. The internet is
used by a human typing and clicking on his machine thereby making it communicate with
another machine. As internet users, we need to know how to work our machine but in the
course of achieving our automated goal, we really don’t care about how our machine
talks to a website nor how the software (web browser) on our machine works.
Picture 1.3.1: Giving meaning to user interactions and server interactions
Sending information to a website from our computer allows a website to take that
information and create any file in return. How the file returned is conjured up is hidden
from us. The hidden behind-the-scenes tasks can only happen when there are server
15
CONFIDENTIAL
interactions.
A quick example of “hidden behind-the-scenes tasks” would be a
welcoming message for a website. We click a link to a website (user interaction), which
causes a request for a page to be sent from our computer browser to a website (server
interaction). There is code written on the website that checks to see if the current day is a
holiday. If the current day is a holiday, then the page returned to the browser on our
computer has a welcome message, “Have a nice holiday” otherwise the page returned to
the browser on our computer has a welcome message, “Have a nice day.” No matter how
many user interactions we make and even if we view the source code for the browser
page, we cannot figure out the exact code that creates the welcome message—we only
know what the welcome message is. Again, this type of “hidden behind-the-scenes
tasks” can only happen when there are server interactions.
Some user interactions cause server interactions while others don’t.
Suppose we
download a Boggle game that does cool things like highlights the letters as we click them
and draws lines with arrows on them.
We can try a weboggle game at
http://weboggle.shackworks.com/. The letter highlighting and arrow drawing do not
cause server interactions.
When the game is downloaded through the browser, the
browser page source code contains everything it needs to play the highlighting and
arrow-drawing games. What doesn’t get downloaded is the gigantic dictionary of words
that is needed to verify any guess. Verifying a guess causes a server interaction. The
database of words in the dictionary is a part of a website and a part of the “hidden behindthe-scenes tasks” that makes the internet more effective in achieving our automated goal,
which is entertainment in the case of internet Boggle. The process of programming
subtle server interactions to do things like verify the spelling of an English word without
loading an entirely new page from the server is an example of AJAX and is a popular
style of writing web code in 2006. Teaching the details of AJAX is not a goal of this
guide.
The collection of files on a server that constitute a website contains files typically written
in 3 or 4 different programming languages. The source code of files that makes it to our
computer typically contains 2 different programming languages.
16
CONFIDENTIAL
Picture 1.3.2: Typical programming languages in the internet
PHP is server scripting. Javascript is client scripting. Why, then, do server pages contain
javascript? Javascript executes in a javascript-enabled browser. The collection of files
on a server has javascript in them to be sent to the user to interact with in a browser. PHP
can write (or “echo”) javascript or HTML. Suppose that on a holiday, a javascript
holiday game gets written by PHP on a server and passes it to the browser for a user to
play with.
To better understand how we use the internet, we must walk through the 5-step process of
what happens when we command our browser to load files from the internet:
1) We click a link or else type a Universal Resource Locator (URL) into the
browser.
2) The server replies with the appropriate file. Before the appropriate file is
sent back our browser, the file that we want on the server contains HTML,
PHP, mySQL, and javascript. The PHP code embedded throughout the
HTML and javascript gets executed and the results are sent back to the
computer browser. The source code of the file in the browser is the result
after the PHP code has been executed. In addition to being embedded
throughout an HTML/javascript source file on the server, the PHP code
can actually write javascript functions or more HTML for us to interact
with in the browser. The PHP code can also use mySQL to access a large
database of photographs and/or product listings to be displayed for us in
the browser.
17
CONFIDENTIAL
3) After getting a reply in the browser, we can interact with the HTML and
javascript to set up an airline ticket purchase or, more generally, to enable
the process of achieving our automated goal. Sometimes javascript games
only cause user interactions where as buying an airline ticket causes
javascript to submit forms that cause server interactions. Because
javascript runs on our computer (the “client”) and causes both user and
server interactions, javascript is considered client-sided scripting.
4) Eventually from a form submission or another click, our user interactions
are again going to cause server interactions. Let’s say we send over dates
and times for airplane tickets to the server. The information that we send
fetches us another page (or “file”) from the collection of files comprising
the website. Before we see the results page that we want, the PHP and
mySQL are executed. The PHP runs and uses mySQL to grab the
appropriate airline tickets out of a database and the results are sent back to
the browser on our machine. Because we have no idea on how the PHP
code runs on their database and server, PHP and mySQL are considered
server-sided scripting.
5) Sometimes our automated goal requires several steps to go through and
thus causes more information to be sent to the server. Go back to step 3).
Picture 1.3.3: Where client-sided and server-sided scripting are executed
Clicking around the internet and achieving our automated goals are the harmonious
interaction of at least 3 and probably 4 or more programming languages. More and more
web programming languages and concepts like Flash are being developed all of the time,
so there isn’t really a limit to the intensity and fun of scripting on either the client-side or
18
CONFIDENTIAL
the server-side. Intense scripting is not a goal of this engineering guide; however, having
a feel for how internet pages talk to a PC and back will help make us better WebQL
programmers. WebQL programming works off of what a browser sees in terms of source
code, so we typically only see 2 programming languages when we write WebQL spider
code (javascript and HTML).
19
CONFIDENTIAL
CHAPTER 2: Coding Basics
2.0 The select statement
The WebQL programming language is a table-select programming language. Our code
will look like nothing more than standardized statements that select tables and join
them together. All sample code outside of IDE snapshots will use Verdana font to
differentiate it from prose. An entire select statement is called a node.
select
'Hello World!' --Example 2.0.0
Above is our first WebQL query. Notice how the statement selects a table that has one
row and one column containing a text string. Because no output destination is specified,
the assumed destination is the default viewer in the Execution Window. All text on a line
appearing after -- is a comment and does not influence the code. Another way to
comment large segments of text is to begin the comment with /* and end it with */.
We can name (or “alias”) the table (or “database”) and fields (or “columns”) anyway we
want, and we can add as many fields as we want. Consider:
20
CONFIDENTIAL
The default names for nodes are SELECTn where n - 1 is the integer number of select
statements that precede the node in the file. The default names for fields are EXPRn
where n is the corresponding column in the table for the expression. Using specific
aliases that we choose like in Example 2.0.1 makes the code easy to read.
Now that we have gotten a feel for how to run something and get output, let’s play with a
few functions so we feel at home as software developers.
21
CONFIDENTIAL
A comprehensive index of functions is available through the IDE help index. Go to
HelpÆContentsÆIndex. In addition to functions, the help index lists all keywords in the
WebQL programming language. The help index is illustrated in Picture 2.0.0.
22
CONFIDENTIAL
Picture 2.0.0: Hunting through the help index for functions and keywords
In this guide, Appendix V is a WebQL Giant Function Table that is a quick-reference for
WebQL’s most popular functions. From Example 2.0.2, the syntax for if conditions is
clear (use <> for unequal); html_to_text is a function that removes all HTML tags;
replace is a function that is a bit more complicated. Replace has arguments that alternate
string, Regular Expression, string, Regular Expression, string, … . The first Regular
Expression is matched to the first string and then is replaced by the second string. The
second expression is matched to the first string and then is replaced by the third string…
The Regular Expression S{10} means match the letter S exactly 10 times. NE1 is also
a Regular Expression but does not use any special Regular Expression characters.
Replace is a powerful function like extract_pattern, which will be seen later on.
23
CONFIDENTIAL
Instead of generating strings explicitly in the WebQL code like Examples 2.0.1 and 2.0.2,
more commonly we will need to read input from a CSV file. CSV stands for “Comma
Separated Values” that are organized into rows (or “records”). A CSV file is a table,
which is a database. Our next example introduces the concepts of both datasources and
Data Translators. In this case, the datasource is the CSV file and the Data Translator is
CSV—the mechanism that will allow us to select columns out of the datasource.
Suppose we want to read as input a 2-column file containing a Yes/No column for
alcoholic beverage availability for some particular markets of a given airline, with the
corresponding airport codes for the flight.
Commercial airport codes are always 3
characters in length, so an origin/destination combo should be of length 6.
24
CONFIDENTIAL
The proper way to read this code is, “Apply the CSV Data Translator to the input.csv
file. Next, make a table that performs checks and adjustments to column one (C1) and
column two (C2) as written.” For an unknown reason, row 2 column 2 of the input file
did not contain 6 characters after trimming outside whitespace, so our own error was
triggered. The CSV Data Translator is one of the most basic of all Data Translators in
WebQL. C1, C2, …, Cn are Pseudocolumns of the CSV Data Translator. The next
section is titled Data Translators and Pseudocolumns. The CSV Data Translator has
only one other selectable Pseudocolumn. ROW_ID is the other Pseudocolumn of the
CSV Data Translator that represents the sequential row of input from the CSV file. In
this case, the ROW_ID is naturally displayed along the left side of the Execution
Window. If we want to manipulate ROW_ID later on, then we will have to select the
ROW_ID Pseudocolumn explicitly as a field and alias it if we want to. We will see that
in Example 2.2.1.
There exist several parts to a select statement. So far, we’ve seen select, from,
within.
into, where, group by, sort by, and having are the other major
components. Remember that as WebQL application developers, our code will always be
lots of select statements (also known as “tables,” “databases,” or “nodes”) that end up
getting joined together and filtered in creative ways to make a final node or set of final
nodes that write to local or non-local destinations. The general format of a select
statement is heuristically the following:
select
<these fields>
into
<this destination>
from
<this Data Translator applied to…>
within
<…this file/db/source>
where
<this and that are true or false or matching or equal, etc.>
25
CONFIDENTIAL
Our code will always look like this at every node and can potentially be missing any and
all parts except for the selected fields.
2.1 Data Translators and Pseudocolumns
This section on Data Translators and Pseudocolumns is intended to introduce various
techniques for extracting data out of a datasource. WebQL is effective when using
HTML pages as datasources. Later on, we will learn how to crawl these sources and grab
any information that we want along the way.
The first Pseudocolumns that we will discuss are independent of any Data Translator.
Basically, in any select statement at any point in the code, these fields are fair game to
be selected. The truth is that there are exceptions, one being that RECORD_ID is
sometimes not a selectable field in an outer node (see Chapter 2.4 for inner-outer node
relationships).
Some of these Pseudocolumns only make sense when we specify a
datasource. Some of these Pseudocolumns are particularly useful when specifying an
HTTP URL as a datasource. A datasource follows the word within of the select
statement, or it follows the word from when no Data Translator is specified.
Table 2.1.0: Pseudocolumns always available to select
----------Pseudocolumn----------
---------------Description---------------
CHUNK_ID
The sequential ID of the current chunk of data. An
HTML page can be sliced into segments (“chunks”) if
the CONVERT USING Document Retrieval Option is
used. The value is NULL if we are not using a
chunking converter.
CONTAINER_URL
The URL of the document from which WebQL
extracted the current record. The value is NULL if the
source document does not have a parent (not a WebQL
parent, but an HTML parent).
CRAWL_DEPTH
The depth of the current source document within the
current crawl. The value is NULL if the current
document was not produced by a crawl.
26
CONFIDENTIAL
ERROR
The error of the current fetch. ERROR is NULL if the
fetch succeeds. Use this field with the “with errors”
Document Retrieval Option.
FETCH_ID
The sequential ID of the current fetch. This ID is
global across a query and is always increasing.
REQUEST_URL
The URL initially visited to load the document, prior to
any redirection.
REQUEST_POST_DATA
The POST data initially submitted to load the current
document, prior to any redirection.
PARENT_URL
The URL of the document from which the link to the
source document was extracted within the current
crawl. The value is NULL if the current document was
not produced by a crawl.
RECORD_ID
The sequential ID of the current record within the
current node.
SOURCE_CONTENT
The content of the document from which the current
record was extracted.
SOURCE_POST_DATA
The POST data submitted to request the current
document.
SOURCE_RECORD_ID
The sequential ID of the current record within the
current document.
SOURCE_TITLE
The title of the document from which the current
SOURCE_TYPE
The MIME type of the document from which the
current record was extracted.
SOURCE_URL
The URL of the document from which the current
A similar table to Table 2.1.0 is available through the IDE help index (look for
“Pseudocolumns”). A lot of these do not make sense yet because we haven’t been
selecting anything from the internet. Let’s select all of these fields on a crawl of a
webpage to a certain depth and see how they vary. There are simple Document Retrieval
Options that we use to crawl a URL. The details of major Document Retrieval Options
are in Chapter 2.5. When we say that we are going to “crawl” a URL to depth k, then we
are going to load the URL and all links on the page (to depth 2), and all links on the
pages that get loaded (to depth 3), etc. Example 2.1.1 crawls to http://www.yahoo.com to
27
CONFIDENTIAL
depth 2 and views all Pseudocolumns from Table 2.1.0. All null output columns are
smashed to fit the image.
28
CONFIDENTIAL
29
CONFIDENTIAL
In addition to all of these Pseudocolumns, we also have the option of selecting the
Pseudocolumns of a Data Translator if we choose to use one. A Data Translator is a tool
that creates Pseudocolumns out of a datasource.
A webpage is an example of a
datasource. The best way to conceptualize a Data Translator is a lens that converts a
datasource into rows and columns.
Picture 2.1.0: Data Translators as lenses
The Pseudocolumns that we want to select will determine which Data Translator (or
“lens”) we want to use. Picture 2.1.0 shows any page from the internet as the targeted
datasource. We can think of Data Translators as a set of differing lenses through which
to see a datasource.
Using multiple Data Translators in a single node can cause
unpredictable behaviors that suggest separating the Data Translators across multiple
nodes is a safer and cleaner way to code—we will see that in the next section on joining
nodes.
30
CONFIDENTIAL
Let’s review each Data Translator and the associated Pseudocolumns. This will take a
while. We’ll begin by listing the Data Translators with their associated Pseudocolumns
in a table.
Table 2.1.1: Pseudocolumns of various Data Translators
DATA TRANSLATOR
images
links
table rows
table columns
table cells
pattern
upattern
empty
CSV
TSV
forms
table values
tables
sentences
pages
paths
full links
lines
mail
headers
RSS
snapshot
values
text_table_rows
PSEUDOCOLUMNS
URL, CAPTION
URL, CONTENT, OUTER_CONTENT
Cx, TABLE_PATH, TABLE_ID, TABLE_CONTENT, ROW_ID
ROW_CONTENT, COLUMN_COUNT
Rx, TABLE_PATH, TABLE_ID, TABLE_CONTENT, COLUMN_ID
CELL, TABLE_PATH, TABLE_ID, TABLE_CONTENT, ROW_ID,
ROW_CONTENT, COLUMN_ID
ITEMx, COLUMN_COUNT
ITEMx, COLUMN_COUNT
(none)
Cx, ROW_ID
Cx, ROW_ID
FORM_ID, FORM_URL, FORM_METHOD, FORM_CONTENT,
CONTROL_NAME, CONTROL_TYPE, CONTROL_DEFAULT,
CONTROL_VALUES
<Pseudocolumns vary by page>, Cx
TABLE_PATH, TABLE_ID, TABLE_CONTENT, RxCk
SENTENCE
PAGE, CONTENT
ITEMx
URL, TYPE, AS_APPEARS
LINE, CONTENT
MESSAGE_ID, FOLDER_PATH, SENDER_NAME,
SENDER_ADDR, SUBJECT, SEND_TIME, RECIPIENT_NAME,
RECIPIENT_ADDR, BODY, HTML_BODY, HEADERS, TO, CC,
BCC, ATTACHMENTS
NAME, VALUE
TITLE, LINK, DESCRIPTION
IMAGE
<Pseudocolumns vary by page>
TABLE_ID, TYPE, ROW_ID, COLUMN_COUNT, DELIMITER, Cx
Combining powerful functions, creative filtering tricks, and Data Translators, small
amounts of code can perform surprising tasks that we would never even think of unless
somebody told us. For example, if we want every image off of yahoo’s homepage that
doesn’t come from the yahoo.com server, it’s just a few characters:
31
CONFIDENTIAL
If we want every non-yahoo.com domain image off of yahoo’s homepage plus every nonyahoo.com domain image off of every page that we can link to off of yahoo’s homepage,
then it’s only a few more characters to type. In this case, there are thousands of such
images, but less than a thousand of them are unique. I will use the unique feature to
eliminate repeated records. The records consist of 2 fields: URL and CAPTION. Can
repeated images still exist even when unique is used? If the same image is coming from
2 different servers or has 2 different file names, the URL field will not match and thus
unique won’t solve the duplicate.
32
CONFIDENTIAL
If we only want a count of these images, or a count of how many times an image is
repeated by URL, we can use the count Aggregate Function. Aggregate Functions are
explained in Chapter 2.3, and Examples 2.3.1-2 follow the previous 2 examples.
The links Data Translator is very similar to the images Data Translator. Instead of a
URL and a CAPTION (the CAPTION is the alt text of the <IMG> HTML tag), the
links
Data
Translator
has
URL,
CONTENT,
and
OUTER_CONTENT
as
Pseudocolumns. The links Data Translator itemizes every HTML anchor on the page.
Some HTML anchors call javascript functions instead of providing URLs, so the URL is
a javascript function call that WebQL does not perform. In such a case, we must find the
33
CONFIDENTIAL
definition of the javascript function being called and do in WebQL what the javascript is
doing.
Suppose an anchor tag calls a javascript function with one argument: a
directory/file location string that must be form-submitted to trigger the page/file
download we want. Here is an example of one such anchor tag:
<a href=”javascript:GetMyBaseURL2(‘path1’)”><font face=”times”
size =-2>click here for file</font></a>
Here the value of the URL Pseudocolumn is:
javascript:GetMyBaseURL2(‘path1’)
The value of CONTENT is:
<font face=”times” size =-2>click here for a file</font>
and the value of OUTER_CONTENT is the entire anchor tag binding including the
anchor tag itself.
In QL2 in Regular Expression language, OUTER_CONTENT is
represented by the extraction buffer
(<a\s*[^>]*>.*?</a>)
Hopefully Regular Expressions are familiar. If not, we will learn more about them later
in Table 2.1.2. In addition to Table 2.1.2, Appendix III is a quick-reference for building
Regular Expressions. Suppose that preceding this anchor tag in the HTML code is the
javascript function GetMyBaseURL2.
Function GetMyBaseURL2(myStr) {
f.myDirection.value = myStr;
f.submit()
}
34
CONFIDENTIAL
The first thing we must do is extract the “path1” file/path name by using a Regular
Expression. We then submit the form named f with the value of the myDirection field
set at the name of the file/path that got extracted. Here, we are trying to conceptualize
how sometimes we can get lucky with the links Data Translator and immediately find a
page URL that we want, and the rest of the time we have to jump through the javascript
hoops just to fetch a page. Here’s an example of an HTML page where we can’t use the
links Data Translator to crawl a link to follow because of a javascript form submission:
Picture 2.1.0.4: Clicking links that perform javascript
The code to crawl one of these links uses a pattern Data Translator to extract the
desired path/direction in the HTML page source in Picture 2.1.0.5 and then submits the
form named f.
We need to save the file in Picture 2.1.0.4 locally to ‘c:\Ex2.1.3.5.html’ before running
the code in Example 2.1.3.5.
35
CONFIDENTIAL
This example shows how to submit a form when the form submission is the result of a
javascript function call. In this example (Example 2.1.3.5), the submitting Document
Retrieval Option uses a robust Regular Expression to match the HTML form named f.
The concepts of joining nodes together and form submission are covered in more detail
later in Chapter 2. This code is provided now to complete the example. The Document
Retrieval Option cache forms is used so that the forms in Ex2.1.3.5.html are loaded
once in the node getdirection and zero times in the node submitform. Here is the
36
CONFIDENTIAL
source
code
of
the
Ex2.1.3.5.html
file
available
for
download
at
http://www.geocities.com/ql2software/Ex2.1.3.5.html.
Picture 2.1.0.5: HTML source code of Ex2.1.3.5.html
Moving on to a more basic example using the links Data Translator, suppose that we
want every link off of a page, but we only want to see the non-javascript links. Expedia
is a website that is very javascript-intense, perhaps unnecessarily so.
37
CONFIDENTIAL
The only two fields that we select are URL and OUTER_CONTENT. There is no need
to select the CONTENT Pseudocolumn because we have the CONTENT if we have the
OUTER_CONTENT. Sometimes loading Expedia’s homepage requires us to add a
cookie and sometimes it doesn’t. If we are ever having trouble getting the right page out
of Expedia, add “adding cookie 'jscript=1;'” after the URL. Sometimes this is needed.
This feature is one of many powerful Document Retrieval Options which are covered in
Chapter 2.5.
Moving on to CSV and TSV, these Data Translators allow us to select the columns out
of CSV and TSV files as if the files are data tables (they are). Again, CSV stands for
38
CONFIDENTIAL
“Comma Separated Values” and TSV stands for “Tab Separated Values”.
Besides
selecting out column n with Cn, the only other Pseudocolumn is ROW_ID, which is the
row of a record from the CSV/TSV file. We will see lots of CSV file interaction in this
guide; we could have just as easily used TSV instead.
The next Data Translators to be discussed are pattern and upattern. upattern is
optimized for handling double byte characters on international sites. On an American
website that uses English, pattern should suffice. The concept of a pattern Data
Translator is to craft extraction buffers (symbolized by parenthesis) within Regular
Expressions to extract wild card text strings, often prices of products or services that
change daily or hourly. The first wild card that we’ll learn is <html>(.+?)</html>
which means extract any character 1 or more times between the HTML tags. Effectively,
the extraction buffer strips the begin and end HTML tags off of an HTML file. Below is
a table of Regular Expression wild cards and what they mean.
Table 2.1.2: Commonly Used RegEx in WebQL
RegEx
EXPLANATION
.
.*
.+
.*?
.+?
[^>]*
[^<]*
(?:A|B|C|D|F)
(?:A|B|C|D|F)?
<tr[^>]*>.*?</tr>
colou?r
20[0-9]{2}
Reali(?:s|z)e
match any 1 character
match as many characters as possible 0 or more times
match as many characters as possible, but at least 1 or more
match as few characters as possible 0 or more times
match as few characters as possible, but at least 1 or more
match as many characters as possible up until the end of an HTML tag
match as many characters as possible up until the beginning of an HTML tag
match one character A, B, C, D, or F
optionally match one character A, B, C, D, or F
match any HTML table row
match the word color in American or British English
match any year 2000-2099
match the word realize in American or British English
The parenthesis in the grouping expression (?:) do not represent an extraction buffer—
they represent a group.
The vertical bars symbolize disjunction.
For example
(?:bear|cub) matches the word bear or cub. A group can end with a ? just like a .*
can. There is more information on Regular Expressions in Appendix III.
39
CONFIDENTIAL
Let’s create an example. We’ll implement the SOURCE_TITLE Pseudocolumn using
the pattern Data Translator.
The first and only extraction buffer in the Regular Expression corresponds to item1 in
the data selection. If a second extraction buffer existed in the Regular Expression then it
would correspond to item2 in the data selection. The pattern implemented says, “Give
me all characters between the HTML <title> tags and then immediately match the
remainder of the page.” The pattern Data Translator is faster with the .* at the end
because the <title> HTML tag is not looked for besides the first occurrence.
40
CONFIDENTIAL
Regular Expressions of the pattern Data Translator in WebQL can have as many
extraction buffers as we want and can become as complicated as we want. Try not to
write anything that is too big of a headache to look at. Try to write expressions in small
pieces then piece them together into something bigger for the sake of debugging. The
next example is at the upper bound of what is acceptable for Regular Expression
cluddyness.
The first thing to notice is where the extraction buffers are and how many there are. The
buffer that starts first is item1, the buffer that starts second is item2, etc. Why did this
query produce only one output record into the default viewer considering that Orbitz
41
CONFIDENTIAL
probably has dozens of tables on every page? We ended the pattern with .* which we
know means match as many characters as possible 0 or more times, thus that .* matches
the vast majority of the page, so we actually end up getting the first table, the first row in
the first table, and the first cell in the first row of the first table. It’s not surprising that
the first cell is some text and links welcoming us to sign in or register. If we take away
the .* at the end, then what are the extraction buffers extracting?
It looks like we are getting more tables than before. The truth is that we are getting the
content of every table on the top layer that has at least one row containing at least one
42
CONFIDENTIAL
column, and we are actually getting the content of that row and that column associated
with every such table. There happen to be 5 in this case. We also used the ignore
children Document Retrieval Option which cuts off any of Orbitz’s child source pages.
Major travel sales corporations load advertisements and other information not of interest
through child pages; we chose to ignore them.
Another technique using pattern is to use lots of pattern Data Translators all ending
with the .* trick described in Example 2.1.5. This way, every translator generates only 1
record, so there is no confusion over inconsistent behavior in record spreading.
43
CONFIDENTIAL
We notice that the first price is $0.00, which is the amount in our shopping cart.
pattern2 tells us if the phrase “Office Supplies” appears anywhere case-insensitive on
the page. The FirstAncor (with anchor spelled wrong) is the inside of the first <a …>
</a> HTML tag, which is an image of some sort in this case.
Consider using two pattern Data Translators where one of them extracts 3 records and
the other extracts 2 records. What should happen? We are going to have holes in our
data table in the default viewer and wherever else we are writing them. This also ties into
44
CONFIDENTIAL
to a major theme for this guide about how using multiple Data Translators in the same
node is almost never a good idea. The only general case exception to that rule is using
pattern and empty together.
Suppose a website lists sale prices a certain way on their homepage. If nothing on the
homepage is on sale, then the pattern doesn’t match. If the pattern Data Translator
doesn’t match anything on the page, then a null table (0 records) is produced. If we want
to trigger an error message instead of a null table then we use the empty Data Translator
with the pattern Data Translator like this:
Even though the pattern does not match the page anywhere, the empty Data Translator
forces a single record to appear and nvl converts the NULL item1 into the string ‘No
45
CONFIDENTIAL
Sale Price’. nvl is a function that returns its first argument unless its first argument is
null, in which case nvl returns its second argument.
Moving on to the forms Data Translator, forms is great at helping us program with the
submitting Document Retrieval Option. To use submitting, we want to find out
what values we can submit without hunting and pecking through an HTML source. The
forms Data Translator is perfect for that task.
This very short segment of code tells us everything we are going to need about
submitting one of the 2 forms on www.hotels.com. CONTROL_NAME represents the
variables on the form that we may submit. Nodes that use the submitting Document
Retrieval Option are reviewed extensively in Chapter 2.5. For now, all we need to know
46
CONFIDENTIAL
is that this program that uses the forms Data Translator gives us the form break-down
that we need to successfully submit forms in Chapter 2.5.
We are moving on to some of the table Data Translators, namely table rows, table
columns, and table cells. To get a feel for what the Pseudocolumns do for us, we can
write 3 queries similar to Example 2.1.9.
Using the select * trick will trigger every Pseudocolumn out of the specified Data
Translator that is not italicized in Table 2.1.1. We use select * only in this case or in
the case when we don’t care what we’re selecting (believe it or not, that does happen—
see Picture 4.1.2).
47
CONFIDENTIAL
Below are the Execution Windows for these 3 examples.
Picture 2.1.1: Default viewer pics of Examples 2.1.10-12
We notice that the TABLE_PATH, TABLE_ID, ROW_ID, and COLUMN_ID cells all
overlap each other. We should be getting a feel for how we could use any one of the 3
Data Translators to perform the same data extraction task.
48
This row/column/table
CONFIDENTIAL
information is useful for filtering, but how are these Data Translators applied most
effectively for data extraction?
Because numeric pricing and interest rate applications are hot right now in the post-2000
internet era, we’ll pull every data cell out of an HMTL table using table cells to
exemplify the power of the table cells Data Translator.
To keep this guide complete without the need of a computer, here is a browser screen
image of the table located at http://www.geocities.com/ql2software/tableFun.html.
49
CONFIDENTIAL
Picture 2.1.1.5: tableFun.html
Several examples throughout this guide will refer to this table.
Taking the example one step further, suppose we had a page where we wanted prices out
of all 275 price cells, then ‘\$’ on the filter is a good idea because ‘\d+’ will match too
many cells.
50
CONFIDENTIAL
table rows and table columns are slightly different. Directly referencing columns
out of the table rows Data Translator and directly referencing rows out of the table
columns Data Translator are where these two Data Translators are most effective.
The two-column stock table at http://www.geocities.com/ql2software/tableFun.html is
quickly formatted into selectable columns. Huge data tables with 10 or more columns
embedded in HTML sources that are 3 or more pages long are equally easily parsed out
by the Cx fields of the table rows Data Translator.
51
CONFIDENTIAL
Similarly, we can target different information by selecting rows out of the table
columns Data Translator.
Here, we don’t bother to alias the fields for writing column headings—we will let the
rows transpose themselves into columns. We use the clean function to remove HTML
and outlying whitespace. Whether using Cx column selectors with table rows or Rx
row selectors with table columns, a great way to debug the translator completely is to
look at both the italicized Pseudocolumns and non-italicized Pseudocolumns of Table
2.1.1 simultaneously.
52
CONFIDENTIAL
For a given ROW_CONTENT, to what do the Cx column selectors correspond? What is
the TABLE_CONTENT for a given ROW_CONTENT? These types of questions can be
answered by using this idea of selecting both italicized and non-italicized Pseudocolumns
out of a Data Translator. The notion of selecting both italicized and non-italicized
Pseudocolumns can be applied to any Data Translator.
We’ve learned a lot about Data Translators in WebQL so far, but the truth is that we’ve
not yet completed discussing half of them! We can jump to Chapter 2.2 on joining nodes
if we feel like learning the rest of the Data Translators later.
53
CONFIDENTIAL
Let’s
continue
by
discussing
two
other
table-related
Data
Translators,
text_table_rows and tables. Let’s look at Pseudocolumns of tables.
We’ve targeted the tables on Expedia’s homepage that have forms on them, and we are
looking at the Browser View tab of TABLE_ID=9 in the Execution Window in Example
2.1.17. Notice how the TABLE_PATH and TABLE_ID correspond to the 3 previously
mentioned table-related Data Translators (table rows, table columns, and table
cells—see Examples 2.1.10-12).
54
CONFIDENTIAL
tables is also an effective Data Translator when we explicitly reference a cell by the
RxCk Pseudocolumn.
We easily cut a stock index and its tick out of an HTML table in the above example with
the tables Data Translator.
Using the text_table_rows Data Translator, we get quite a different picture of the
Expedia page than any other table-related Data Translator.
55
CONFIDENTIAL
The way the data come out through this Data Translator suggest that HTML pages are not
where text_table_rows is most effective. Again, we are using the trick of select *
plus manually referencing columns simultaneously to fully debug the Data Translator on
the page. text_table_rows works best on files that represent data tables that use a | or
/ or a similar character to delimit the columns in the table.
Moving on, the sentences Data Translator has only one Pseudocolumn, SENTENCE.
56
CONFIDENTIAL
WebQL is quick and powerful at calculating information about characters per sentence in
this example.
This particular example uses mathematical Aggregate Functions to
column-wise calculate numbers. We’ll learn in detail about Aggregate Functions in
Chapter 2.3. Cleaning the SOURCE_CONTENT with the convert using Document
Retrieval Option changes the source page into only the words that we see in a browser
and also removes any redundant whitespace.
In addition to the links Data Translator discussed in Example 2.1.4, there is also a full
links Data Translator.
To get every link off of a page including flash files and
stylesheets, we must use full links that has slightly different Pseudocolumns than the
links Data Translator.
57
CONFIDENTIAL
Looking at the default viewer in the Execution Window, we notice the difference
between URL and AS_APPEARS. Record 110 shows that the way the link appears in
the HTML already has the appropriate URL base attached to it automatically in the URL
Pseudocolumn. There are a total of 12 different link types including Anchor, Base,
Detected, Flash, Form, Frame, Image, Object, RSS, Script, Stylesheet, and Unknown.
Next is the snapshot Data Translator. It captures *.bmp browser-based screenshots of
whatever datasource we specify.
58
CONFIDENTIAL
We now see how easy it is to automatically get a glimpse of our top 10 or even 500
competitors’ websites quickly and easily.
The page and lines Data Translators are great for plain text documents and pdf files.
Here, we have a pdf file by page of an HKN CS GRE review guide:
59
CONFIDENTIAL
We could easily apply a filter for pages matching topics (substrings) and/or Regular
Expressions. Applying the lines Data Translator to the same pdf file, we can filter to get
the line number of every line on the topic of recursion.
60
CONFIDENTIAL
Immediately in our 41 page pdf study guide, we see where the topic of recursion is
mentioned.
To better understand the paths Data Translator, we can study XPath, which is a
querying language for extracting data out of XML. Appendix IV contains a link to
information on XML and XPath.
This next example implements the links Data
Translator using the paths Data Translator.
61
CONFIDENTIAL
There is no need to worry about being unfamiliar with XPath because the most popular
websites for business applications are done in HTML.
RSS is a standard XML format for news circulation; it’s not surprising that there is an
RSS Data Translator in WebQL. To grab data like news links straight out of an RSS
URL, use the RSS Data Translator.
The Pseudocolumns are TITLE, LINK, and
DESCRIPTION.
62
CONFIDENTIAL
The RSS Data Translator gives us the power to target and accurately filter news links by
the hundreds within seconds. The data are directly ripped out of the XML in the Slashdot
RSS page source.
How many cookies do websites try to tattoo to our browser the instant we hit them? We
can uncover this sort of information quickly using the headers Data Translator. Any
cookie setting occurs in the header of an internet request.
63
CONFIDENTIAL
This particular megastore for piping, fasteners, etc. sets dozens of cookies when we hit
their homepage. We could build an application that hits hundreds of popular websites
and does cookie analysis. We could look at the number of cookies, the length of the
cookies, etc.
Moving to the next Data Translator, mail is another file-specific Data Translator such as
CSV, TSV, and RSS. It gives quite a variety of Pseudocolumns to select from within
pst files. Suppose we had a large collection of emails in pst format, we could easily filter
them by subject or by phrases (Regular Expressions) that match the body text of the
messages.
If the emails we want to look at work on the pst system then the
Pseudocolumns are self-explanatory and there is no need for an example.
The 15
Pseudocolumns of the mail Data Translator are MESSAGE_ID, FOLDER_PATH,
SENDER_NAME, SENDER_ADDR, SUBJECT, SEND_TIME, RECIPIENT_NAME,
RECIPIENT_ADDR, BODY, HTML_BODY, HEADERS, TO, CC, BCC, and
ATTACHMENTS.
64
CONFIDENTIAL
The final two Data Translators are values and table values. The Pseudocolumns for
these two Data Translators are similar, but table values also has column referencing
ability C1, C2, … . Recall Picture 2.1.1.5 for these two examples.
Creatively, this example shows how to cut a subsection of a table without using a where
filter. Using the values Data Translator, we can pick out certain stock information
based on the row’s name column, which is the first column of the table.
65
CONFIDENTIAL
Now we have all stock ticks we asked for aliased as their indexes, which effectively give
us structure and control over numbers floating through HTML source pages.
Overall, we have seen a wide variety of ways to convert datasources (including
webpages) in to rows and columns of data through Data Translators. Some of the Data
Translators are customizable (such as pattern and upattern) and the others aren’t
(such as sentences and links). Given that we know how to effectively translate page
sources from the internet and select Pseudocolumns out of those Data Translators to
create tables, we will now learn how to connect tables together in Chapter 2.2 and even
create navigational threads in Chapter 3.
2.2 Joining Nodes
Let’s expand and modify Example 2.0.3 to have a child node that filters errors.
66
CONFIDENTIAL
Now that more than one node appears (remember that a node is a table which is a
database), which one gets outputted to the default viewer? To avoid confusion, we
specify which one, which is the node InputNoError in this case. The relationship
between GetInput and InputNoError is that GetInput is the parent and
InputNoError is the child. A parent merely passes on what it selects to its child. In the
process of passing the fields of the parent’s table to the child, the records (or “rows”) can
be filtered similar to the way that records get filtered in the parent node. When queries
get more advanced and involve complicated branching schematics, using where filters
both in the node itself and between nodes is extremely convenient. We have now seen
both methods of using where filters. As a child, InputNoError can reference any of
67
CONFIDENTIAL
the parents’ fields’ aliases in its node. InputNoError also has the ability to re-alias any
data it selects—it doesn’t in this example.
A child can have any number of parents, and parents can have any number of children.
Suppose a child has 2 parents. One parent’s table has 6 records and the other has 5
records. Given that no filtering takes place, how many records will the child have? The
answer is the product of all parents, which is 30 in this case. In the example below, we
illustrate a cross product (spatial cross product) and use a nifty join trick called spread.
So we don’t have to create CSV or other input files, we make arrays and smear them into
columns
of
data
using
join
spread—it
is
in
the
IDE
help
index
(HelpÆContentsÆIndex). The example also uses union, which has not been discussed
yet. union is used when we want tasks to operate completely independent and ignorant
of each other. The process of making an array of 6 elements has nothing to do with the
process of making an array of 5 elements, so when we start the process of making the
second array, we use union. After smearing the arrays into single columns of data, the
columns are joined thus the spatial cross product of the rows is generated in the child
node myDemo. If our code writes to the same database in different nodes and we need
the writing to occur in a certain order, then wait union can be used instead of union.
68
CONFIDENTIAL
69
CONFIDENTIAL
Another method of joining queries is parent wait join. If a child node must follow up
a page load made by a parent before the parent processes its next request, then parent
wait join can be used; however, inner-outer queries (or “views”) are a better way of
depth-first navigation. Some advanced WebQL applications submit forms and go 8
clicks or more deep into a site to retrieve data. Suppose such a task must be performed
10,000 times. If a breadth-first approach to form submission and crawling is applied,
then the server-sided session timer could timeout (see Picture 2.4.1). Imagine going 7
clicks deep into a travel site for a fare basis code for a flight, and doing that for 50,000
flights in a row. If we submit the search form 50,000 times, then navigate one more click
for the 50,000 pages that come back, we won’t be able to get to any of the data because
by the time we advance to the second click on the first search, hours could have passed
and the session has already timed out. These issues are discussed in detail in Chapter 3.
For a quick-fix that enables depth-first fetching, we use parent wait join, but if we are
going to try to parent wait join 4 or more nodes together to trigger a desired
navigational effect, then we aren’t going about solving the problem the right way.
Navigational threads are discussed in Chapter 3 and see the text below Picture 4.1.1 for
the right way.
The words parent wait can also be used to precede union join, source join, and
output source join, which I will outline next. union join stacks the records of 2 or
more nodes of equal column width on top of each other.
70
CONFIDENTIAL
table1 has 6 records and table2 has 4 records. union joining the tables gives 10
records. Notice how the referencing works in a union join. The only aliases that
matter are those of the parent that is listed first. Node table1’s aliases are all that matter
in the union join.
source join is straight-forward method of joining nodes. The SOURCE_CONTENT of
a child node is the same as the parent node. If we must chunk an HTML page into pieces
we might what to do that in a node different than the one we loaded the page in. When
we say “chunk an HTML page” we mean creating multiple SOURCE_CONTENT
records out of a single SOURCE_CONTENT record (see Example 2.5.2). This is when
source join is appropriate. We can achieve the same effect using output source
join. output source join allows us to select a column or columns from the parent to
71
CONFIDENTIAL
be processed as SOURCE_CONTENT records by the child.
If we select the
SOURCE_CONTENT of an HTML page as a column and then output source join to
that column, we achieve the same effect as source join. We can also implement filters
on these join techniques.
Example 2.2.4 illustrates two equivalent branching techniques to fetch both the link and
image URLs from a single page load of www.cnn.com. Whenever using the output
source join or source join technique we must write the from/within source
syntax to symbolize that our datasource for the node is the same data source as the parent
node. Remember that the word source follows the word from when no Data Translator
is specified and follows the word within when a Data Translator is used.
72
CONFIDENTIAL
There also exist the techniques of outer joins which are either a right outer join, a
left outer join, or a full outer join. For outer joins, we specify both a left and
right parent.
73
CONFIDENTIAL
To see the rest of the picture, here are the right.csv and full.csv files:
74
CONFIDENTIAL
Similar to the left.csv file, the full.csv file is the same, thus all three tables are the same
because there are no null records. For an example, see Picture 4.3.1.
The last way to connect two nodes is with minus. Here’s a tricky piece of code that
selects only the dynamic links on the page http://www.news.com.
75
CONFIDENTIAL
Of the 234 links on the page, only 9 appear to be dynamically generated. Using the
manual table building code of Example 2.2.5, we can show how minus works with
nodes without the web.
76
CONFIDENTIAL
The node Out takes all records from node table1 and deletes any that contain the
number 2 in the first column named myNums1. The result is written into the default
viewer.
Now that we’ve seen how to connect nodes in various ways, we can string together long
navigational threads that simulate clicking into a website. Navigation is covered in
Chapter 3.
2.3 Aggregating Data by Column and Grouping (Aggregate Functions)
Let’s go back to the identification of non-yahoo.com domain images on
http://www.yahoo.com in Example 2.1.2. Suppose a count of such images is all that is
77
CONFIDENTIAL
needed rather than the URLs of the images. We will now introduce the most basic
Aggregate Function: count. What makes a function an Aggregate Function is that it
operates column-wise on a table rather than row-wise. count column-wise counts
something. Depending on what Data Translator we choose to use, count will count
different things. We were using the images Data Translator in Example 2.1.2, so image
counting is what we’ll do.
So we see that 14 images are all that come from outside of yahoo.com. What can be said
of the uniqueness of these images? We’ll investigate that in this next example where a
Pseudocolumn of the images Data Translator will appear both inside and outside of the
78
CONFIDENTIAL
Aggregate Function count in the selected fields. When this happens, every field not
bound by an Aggregate Function must appear in a group by expression. Let’s look at
the code and output to try to understand what’s going on:
As we can see, only one image on http://www.yahoo.com from outside of yahoo.com is a
repeat, and it’s repeated only once.
Now from seeing Aggregate Functions, we didn’t quite get the whole story when the
select statement was outlined in Chapter 2.0. Here is a heuristic sketch of a select
statement that uses Aggregate Functions:
79
CONFIDENTIAL
select
<these fields>,
<these aggregate fields>
into
<this destination>
from
<this Data Translator applied to…>
within
<…this file/db/source>
where
<this and that are true or false or matching or equal, etc. for
these fields>
group by
<these fields>
having
<this and that true or false or matching or equal, etc. for these
aggregate fields>
Taking the previous example of itemizing repeated images, we can now add an
Aggregate Function filter with a having clause that gives us the non-yahoo.com domain
images that are repeated at least once.
80
CONFIDENTIAL
We do indeed get the result that we anticipated—only one image fits such criteria.
having is the third and final record-filtering technique. We know how to filter twice
with where and once with having effectively all in one select statement. Now that
we’ve seen count used, what are the rest of the Aggregate Functions? Most Aggregate
Functions are math-related. In addition to count, there are min, max, avg, sum, and
stddev—we don’t need to go over those. The last and possibly the best Aggregate
Function is gather. gather turns a column into an array. There is a sub-tables for
Aggregate Functions and array functions in the WebQL Giant Function Table which is
Appendix V.
81
CONFIDENTIAL
We will give 2 last examples to complete this lesson on Aggregate Functions. First, we
will use all of the math Aggregate Functions besides count effectively; second, we will
show how to quickly turn a column of data into an element of a row using gather. We
will also add a slight curveball to the next example by applying the links Data Translator
to not just one HTML source but 3. We do that by listing the sources in an array. The
functions, Aggregate Functions, and grouping expression used execute once for each
page load. If we are loading information out of a database or the web (which I should
have you convinced by now is nothing more than a database) then the mathematical
Aggregate Functions compute for each database. There also exist array-versions of the
functions: array_sum, array_avg, array_min, etc.
82
CONFIDENTIAL
The results to this little piece of code are quite impressive. Coming up with this code
could take between 5 and 15 minutes, and it took 1 second to run. How many man hours
would it take if HTML source pages were loaded out of browsers and the URL lengths
were counted and tallied by hand? Is the computer more accurate in this case than a
human?
Moving on to the last example, we will use the Aggregate Function gather to pull out
the 12th table cell that appears on www.yahoo.com. The clean function strips all HTML
tags and eliminates redundant or outlying whitespace.
83
CONFIDENTIAL
Because all the Pseudocolumns (namely, item1) of the pattern Data Translator stay
within all Aggregate Functions used (namely, gather), there is no need for a group by
expression.
Without question, Aggregate Functions are a powerful tool in selecting and reprocessing
a data table. Aggregate Functions introduce grouping expressions that enable additional
data filtering techniques such as a having filter. Aggregate Functions are crucial in the
development of advanced applications such as the one depicted by Picture 3.4.6.
2.4 Inner-Outer Query Relationship
One of the most useful coding styles in WebQL separates code segments (or
“subroutines”) into separate files. These source code files ending with .wqv are called a
view or an Inner Query. Inner Queries are useful because every field selected by the
outer parent is global inside a view. This is the Global Field Theorem. A node can
select any and every field selected by its outer parent. The selection of the outer parent is
also called input.
Suppose we want to keep a count of processes by batch.
84
CONFIDENTIAL
Notice that the node writeReport, out of nowhere, selects mybatches successfully.
mybatches is nowhere in the parent of writeReport (namely, process1), but it is in
the outer parent named batches. The unnamed node in Example 2.4.1 is selecting all
fields from the default viewer of the inner query and putting them into the default viewer
of the outer query, which is in the Execution Window. Inner-Outer relationships are
separate queries and we can create aliasing that is similar between them (For example,
fetch1 could be the name of a node in both the inner and outer query). Rewriting this
example as two queries illustrates the trivial difference of separate files vs. using
parenthesis in creating inner-outer query relationships.
85
CONFIDENTIAL
Example 2.4.2b is batches.wqv. There is no conflict of variable names in the node
writeReport because the code specifies input.mybatches (or “my outer parent’s
mybatches”) and batch1.mybatches, which is in the immediate parent of
writeReport. The batch1 node on the left in Example 2.4.2a is clearly not being
referenced by in the second field of writeReport because writeReport is connected
to the code on the left only through the input, which is the field mybatches of the node
batches. Notice that Example 2.4.2a is stray-code in an unsaved file while batches.wqv
86
CONFIDENTIAL
in Example 2.4.2b is saved. When calling a view, the last saved version is used. Make
sure to save a view before running the outer file.
Later on, we will use Inner Queries to navigate down paths (effectively, “click”) through
websites and extract data. Inner Queries are great for depth-first crawling instead of
breadth-first crawling. Suppose it takes 3 clicks to get to the price on a website for each
of 3 different customer searches supplied via a CSV file. Sometimes, we must go 3
clicks deep for the first search, then 3 clicks deep for the second search, and so on
(Picture 2.4.1), instead of going 1 click deep for all 3 searches, then 2 clicks deep for all 3
searches, and finally 3 clicks deep for all 3 (Picture 2.4.2). The reason is that our session
established in Picture 2.4.2 could timeout before the inbound flights are selected,
especially if there are hundreds or thousands of input line items in the customer’s CSV
file. Sometimes, it doesn’t matter if we use breadth-first or depth-first navigation, but
when we need depth-first navigation look to the Inner Query or view.
Picture 2.4.1: Depth-first navigation achieved by inner-outer query relationships
87
CONFIDENTIAL
Picture 2.4.1 is ideal for when tens, hundreds, or thousands of flights are being fetched.
To achieve this navigational order of fetches, inner-outer queries must be used for each of
the 3 fetches. A detailed example of creating depth-first navigation is Example 3.5.6.
Picture 2.4.2: Breadth-first navigation achieved by
writing all 3 fetches in the same query
For all reasons stated, breadth-first navigation is not recommended.
2.5 Document Retrieval Options
The most powerful networking accessory commands are called Document Retrieval
Options. These options apply to all document sources whether they are local or nonlocal; however, they are most useful when trying to access an HTML datasource from the
internet. So far, we have used crawl of a URL to depth in Example 2.1.3. Imagine
how many pages are loaded to depth k? If the average page load has 50 links on it with a
88
CONFIDENTIAL
small standard deviation, then a crawl to depth k would be approximately 50k-1 clicks.
We should be able to see how some sloppy code could cause all kinds of unnecessary
internet traffic. crawl of can also be used in conjunction with following if for a simple
way to continue clicking a next-20-results-type button.
We can also use circular
referencing with union joins to repeatedly click Next/Next/Next/Next to scrape all of
the desired results, which will discussed in detail in Chapter 3. One of our favorite
examples below shows how following if works. We will learn to greatly appreciate the
idea of being able to repeatedly click the next button in a one node effort using
following if than to implement a 3-node URL-trapping circular reference (see Picture
3.5.3.5 and Example 3.6.2).
89
CONFIDENTIAL
Here we crawl for the first 30 Google results for our desired search terms. How easily
could we read in a user-defined (or customer-defined) list of search terms from an input
file instead? We could also use an input variable to crawl to a dynamic crawl depth.
90
CONFIDENTIAL
crawl of / following if is like a Data Translator that has Pseudocolumns
(CONTAINER_URL, CRAWL_DEPTH, SOURCE_CONTENT, SOURCE_URL, URL,
URL_CONTENT), but it’s not a Data Translator.
These Pseudofields can only be
referenced in the section of the code with the crawl of phrase after the statement
following if. Checking various conditions on these Pseudofields can allow for targeted
custom crawling, such as limiting the search results to the first 30 results.
Document Retrieval Options are powerful for dividing the SOURCE_CONTENT field of
an HTML source into pieces known as chunks. Suppose we want to scrape Foreign
Currency Yields on investments in Singapore. A page copied from a bank website is on
Geocities so the page structure won’t change since the publication of this document.
Here is a view of the webpage from a later date so we have an appreciation for the data
extraction task at hand:
91
CONFIDENTIAL
Picture 2.5.0a: The first half of the page for Example 2.5.2
92
CONFIDENTIAL
Picture 2.5.0b: The second half of the page for Example 2.5.2
Here’s the start to the code for scraping the page.
Go to the URL in the node
doredirect2 in a browser if you want to see the text source:
93
CONFIDENTIAL
The first thing to notice is the records that are going to be extracted within the page
chunks also want to know what the minimum deposits are for various currencies. This
information is only presented once on a page that we want 9 chunks out of. We must
extract the minimum deposit information before chunking the page with the convert
using Document Retrieval Option. In conjunction with the function text_to_array,
convert using changes the SOURCE_CONTENT Pseudocolumn into whatever we
specify. In this case, SOURCE_CONTENT gets changed into an array of sub-sources
which increase the number of SOURCE_CONTENT records processed by the node. The
rest of the code is included below. It uses an Inner Query to process the chunkpage
node’s chunked source.
94
CONFIDENTIAL
Notice that the node mydata is writing to multiple destinations. That’s not a problem;
separate the destinations with a comma. mydata also makes use of the decode
95
CONFIDENTIAL
function and datetime_to_text operating on the global variable now, which is our
current moment in Unix time. Notice that the pattern Data Translator p1 in the
translate node uses multiple extraction buffers—we can have as many as we want. We
are also making use of multiple pattern Data Translators with the .* end trick discussed
earlier in Chapter 2.1. We know that the node translate will only produce one record
per page chunk. translate selects the currency name and the source for rate extraction.
Running the code produces all 7 rate categories for each of the 9 currencies presented, so
63 records are manufactured by the mydata node after every page chunk is processed.
We should also make note of the technique of combining two columns of data from
different tables side by side. The node assemble puts rate1 and rate2 side by side
from different tables by joining the nodes rates1 and rates2 together and filtering the
records based on equal RECORD_IDs (see also Example 3.5.4). decode and if are also
made use of effectively in Example 2.5.2.
96
CONFIDENTIAL
Picture 2.5.1: Execution Window of Example 2.5.2
97
CONFIDENTIAL
Notice in the messaging how the pattern Data Translators hit the page chunks
separately. The month fractions symbolize weeks, and some FCY data was not available
when the page loaded such as Yen rates. This is a great application for harvesting data to
feed a corporate database with its own number-crunching and report-making software.
Sometimes WebQL is best used as a web crawler that harvests and cleans data for
another company’s database applications. We can produce those applications as well.
In Example 2.5.2, the page of interest is actually triggered out of a javascript redirect.
Remember that WebQL doesn’t do javascript in a tradeoff for programming authority, so
triggering this page is a multi-node process that requires manually cutting a javascript
redirect and concatenating it onto a base URL.
98
CONFIDENTIAL
We see below, the Foreign Currency Fixed Deopsits page is properly loaded after doing
some manual work on the location.href redirect:
99
CONFIDENTIAL
Picture 2.5.1.5: The Browser View of the SOURCE_CONTENT of Example 2.5.3
Next, let’s take a look at the submitting Document Retrieval Option. submitting
let’s us submit values to forms and load the pages that get returned. Now we can submit
searches on anybody’s website and analyze the results.
100
CONFIDENTIAL
All we do is submit the city Tokyo into the appropriate form value. We can figure out
that form value by either looking the http://www.hotels.com page source (either through
WebQL or ViewÆSource in a browser) or by using the forms Data Translator with the
select * trick discussed in Example 2.1.9. Notice how the string “Tokyo” could just as
easily be a running list of input from a customer-generated CSV file.
Instead of submitting directly to a form by number, we can say
101
CONFIDENTIAL
Submitting
values v1 for ‘var1’
if form_content matching myMatcher
Thus, if there exists something in the HTML form that is unique, we can identify our
desired form submission by a substring match or Regular Expression match in the HTML
of the form. This technique can be used to submit to one, some, or all forms on a page.
myMatcher is a Regular Expression that we can make as robust or parochial as we
want.
Similarly, a form has an action URL (or “target”) that can be used to identify a form for
submission.
Submitting
to form target myTarget
If we know the action URL of the form we are submitting to, which can be easily derived
from the form analysis technique of Example 2.1.9, then we can just specify a substring
of the action URL as a target, myTarget. myTarget is not an expression, however,
myMatcher is. myTarget is a substring text literal.
To give us an idea of how hectic some form submissions can be, take a look at this file
from an advanced application:
102
CONFIDENTIAL
Picture 2.5.2: Code from an advanced application exemplifying submitting
This code that hits Expedia does a great job of making use of all kinds of Document
Retrieval Options. We see some that we know and some that we have not seen before.
cache forms is a very powerful Document Retrieval Option. If we want to submit to a
form dozens or hundreds of times, we don’t want to reload that form every time we want
to submit it. The idea is that we load a form once, then submit it 100 times rather than
loading it 100 times and submitting it once for each load. cache forms is better
103
CONFIDENTIAL
exemplified in Example 2.1.3.5. Caching the form in Picture 2.5.2 is to enable a node
appearing later in the code to make use of all forms loaded in the node in Picture 2.5.2.
post is a major Document Retrieval Option. post lets us explicitly hit a URL that
typically requires a form submission or a click by the user. In Picture 2.5.2, the explicit
post statement is to immediately trigger the advanced car search form on Expedia.
unguarded is another Document Retrieval Option. WebQL by default will never reload the same data unless the Document Retrieval Option unguarded is used. If we
must use unguarded then there probably isn’t optimal efficiency in the code. If we
load a page and pass the SOURCE_CONTENT down through several nodes, then why
would we need to reload anything? unguarded is needed in this example because
sometimes a customer’s input can have overlap with itself because of an input algorithm
for this particular query.
The convert using Document Retrieval Option is quite complicated in this example.
Some searches for specific vehicle types also show other vehicles at the location—we
don’t want “other vehicles at this location” in our output report.
rewrite using is the last Document Retrieval Option that we will discuss. If we are
using submitting, and for an undiagnosable reason the URL is getting things added on
to it or the wrong thing added on to it, then we can rewrite it just before the page is
loaded. Suppose when we submit our form “&Back.x=1&Back.y=1” gets added to the
URL when we need “&Next.x=1&Next.y=1” to be part of our URL destination. This
trick will work:
Rewrite URL using
replace(URL,’Back\.’,’Next.’)
104
CONFIDENTIAL
A similar maneuver is documented in Picture 2.5.2. Remember that the second argument
to replace is a Regular Expression and the third argument is a string. If artifacts are
showing up in the Post Data, then we can also rewrite it just before the page is loaded.
Rewrite post_data using
post_data || ‘&myVar1=myVal1&myVar2=myVal2’
Here, a couple of variables weren’t showing up in the form submission, so we tacked
them on to the end of the post_data string using the string concatenation operator ||.
These kinds of rewrites are needed infrequently, but they make the process of debugging
code a lot less annoying by enabling workarounds.
2.6 Read/Write Features
So far, we’ve loaded HTML internet pages and selected stuff out of them into the default
viewer and/or CSV text files. WebQL has the ability to read almost any datasource and
write to almost any file type, especially in the domain of spreadsheets and business
applications. The syntax of reading from or writing to a database is
table@connection
There is an Access Database file at http://www.geocities.com/ql2software/db1.mdb that
contains an employee list named ‘people’ with a couple of errors in it. First, there are
repeated employees; second, the names of the fields of the table (Name, Age, Job, Title)
are duplicated as the first row in the table. We will save this file to our local machine (in
the ‘c:\’ directory) and create a node that cleans the database up and outputs it into the
default viewer.
105
CONFIDENTIAL
The db1.mdb file has 21 records in it, and we should be satisfied that we have fewer
records in the default viewer after removing duplicates and the erroneous first row.
WebQL can do a couple of tricks writing data to a destination. The first trick is the
HTML trick. We can output this employee list into an HTML table to give it color,
shape, or style.
106
CONFIDENTIAL
Notice that we are using the same code as Example 2.6.1, but instead we are writing an
HTML file header, then appending the employee list as HTML rows, and then appending
the end of the HTML file. The HTML file looks like this:
107
CONFIDENTIAL
Picture 2.6.1: HTML1.html created in Example 2.6.2
Another powerful writing technique allows XSLT transforms to be applied to XML.
108
CONFIDENTIAL
Using the same example as before, we can have our own stylesheet that looks better than
Picture 2.6.1. Notice how the coding is easier in Example 2.6.3 than in Example 2.6.2.
Studying XML and XSLT stylesheets enables us to make great looking reports in
WebQL without doing much more than calling those files at write time.
The final mention on WebQL destination writing will involve the WebQL built in pivot
table effect called pivot. Specify the word pivot at the end of the destination to make
use of this Document Write Option. The third column of our data ends up getting spread
as column headings, so the number of columns in a WebQL pivot table is 2 plus the
number of unique items in column 3 of the select statement. The best way to get
comfortable with the pivot option is to use it on a familiar dataset and look at the results.
Try switching the columns around and see what kind of effect it has on the look of the
data.
Be aware that a pivot is not always a good idea.
The HTML application
developed in Picture 3.4.5 was created using the pivot write option.
2.7 Virtual Tables
In addition to the Pseudocolumns of the various Data Translators so far, WebQL has
Virtual Tables that have their own Pseudocolumns. The purpose of Virtual Tables is to
109
CONFIDENTIAL
allow developers to reference lists of product features and capabilities within the
language in an up-to-the-minute fashion.
The above query reveals what WebQL Virtual Tables are available to select from. These
universal Virtual Tables that can always be selected from begin with “WEBQL$”. Now
we can further dissect these tables.
110
CONFIDENTIAL
We see the 1000+ different types of file encoding understood by WebQL. We can use
the select * trick on every Virtual Table to get the details summarized below.
Table 2.7.1: Virtual Tables always available as datasources
VIRTUAL TABLE
DESCRIPTION
WEBQL$DATASOURCE
WEBQL$ENCODING
lists known data resources
lists all known encodings
WEBQL$LOCALE
lists all local identifiers
WEBQL$OPTION
lists all server options
WEBQL$TIMEZONE
lists all known timezones
WEBQL$URL_SCHEME
lists all known protocols
WEBQL$VIRTUAL_TABLE
lists all known Virtual Tables
PSEUDOCOLUMNS
NAME, DESCRIPTION
NAME, CANONICAL_NAME
ID, LANGUAGE, COUNTRY,
VARIANT
NAME, DEFAULT, VALUE
ID, BASE_OFFSET_HOUR,
BASE_OFFSET_MINUTE, NAME,
ABBREV, DST_NAME,
DST_ABBREV
NAME, USES_NETWORK,
DEFAULT_PORT
SCHEMA, NAME
An example of making use of a Virtual Table would be to look up all POP URL schemes
known by WebQL.
111
CONFIDENTIAL
The results should satisfy any seasoned web programmer.
2.8 Sorting Data
There is one last piece to the select statement that we are yet to discuss. In addition to
clicking and sorting data in the default viewer of the Execution Window, we can sort the
data in a node by a sort by clause. Sorting happens after data filtering, so sort by
should be at the end of a node.
112
CONFIDENTIAL
Sorting expressions can be creative and involve true/false. In Example 2.8.1, the list of
numbers 1 through 10 is sorted by odds first then evens. Now, we’ll sort multiples of 4 in
descending order, and then multiples of 3 in ascending order.
113
CONFIDENTIAL
We see how the multiples of 4 appear at the beginning and the multiples of 3 appear at
the end. Sorting is not the most important feature because databases and programs like
Excel usually have their own sorting mechanisms. When we need complicated sorting
algorithms sequentially applied to a table of data, WebQL is a viable tool with even more
flexibility than Excel or Access. Sorting causes a node to wait until all records in it have
been generated before displaying or writing the results because the results must be sorted.
Therefore, sort by cannot exist as a part of a node in a circular loop. unique also
cannot be used in a circular loop.
We are (finally) through our instruction on coding basics in Chapter 2. We are moving
on to navigation in Chapter 3.
114
CONFIDENTIAL
CHAPTER 3: Navigating and Crawling the Web
3.0 The Navigational State
Navigating through the internet in WebQL can be tricky. To make learning easier, the
idea of a Pseudocolumn 3-tuple called a Navigational State is defined. At any point in
time on the internet, the page that we are looking at is primarily the result of 3 different
things.
Picture 3.0.1: The Trevor J Buckingham Navigational State
It’s true that the page that we are looking at through a browser could also be impacted by
headers and cookies, but these three fields characterize “where we are and where we can
go on the web” more than any other information. Is there anything interesting about
these fields? They are all Pseudocolumns! We don’t have to do anything fancy to get the
115
CONFIDENTIAL
information
besides
select
the
SOURCE_URL,
SOURCE_POST_DATA,
and
SOURCE_CONTENT.
The major theorem to understand here is that as we crawl the web in a parent node, a
child node can always advance the navigating one way or another if it knows its parent’s
Navigational State. A given webpage always is triggered by a URL, and it might need
Post Data as well. Some URLs load pages successfully without Post Data. Together,
these two components of the Navigational Sate answer, “Where are we?” The HTML
source enables us to answer “Where can we go?” And, if we know where we are at and
where we have the potential to go, then we are navigating; hence our realization of the
Navigational
State
is
the
Pseudocolumn
SOURCE_POST_DATA, SOURCE_CONTENT.
116
3-tuple
SOURCE_URL,
CONFIDENTIAL
Notice how because GetLink1 knows the Navigational State of its parent
GetNavState1, GetLink1 is thus capable of navigating anywhere we could click on
the site with a mouse. GetLink1 chose the 5th search result, and then fetches the
117
CONFIDENTIAL
associated Navigational State in its child node GetNavState2. As it turns out, we only
need part of the Navigational State in GetNavState1 (the SOURCE_CONTENT)
because the page selected by GetNavState1 is not required to provide a form in
GetNavState2. The link selected in GetLink1 is a stand-alone URL without post
data. See Picture 3.5.3.6 for a node similar to GetNavState2 where a form is needed
from a paternal node.
Which database is displayed in the default viewer of the Execution Window? Because
the SOURCE_URL of GetNavState1 is http://www.yahoo.com, it must be
GetNavState2. Because we didn’t pick a destination for any of the data that we
selected, we should not anticipate any particular data in the default viewer of the
Execution Window. Further, we know it’s not GetNavState1 in the default viewer
because GetNavState1 has 2 records: one for the http://www.yahoo.com page fetch
and one for the form submission.
3.1 Circular Navigation
Circular Navigation is an effective application of Circular Referencing. The concept of
Circular Referencing is first illustrated below.
118
CONFIDENTIAL
119
CONFIDENTIAL
The counter starts at zero and then runs in a circle until the field cnt reaches 10. To
achieve the same result in 2 nodes, we can union join a node to itself recursively.
Although WebQL allows for recursive nodes, it does not allow for recursive views.
120
CONFIDENTIAL
Perhaps the node called SECOND in the dataflow should have an arrow to itself.
The circular constructs in the previous 2 examples work great for navigating.
Sometimes the “next” button is javascript-driven to see the next page of results from a
form submission. Remember that WebQL does not understand javascript, which is a
tradeoff of sacrificing convenience for programming power. Using pattern and/or
upattern Data Translators to manually extract links, a 4-node loop can be crafted to
elegantly crawl a site. Let’s take Example 2.5.1 and instead of using crawl of, we will
use a 4-node loop to achieve the same effect.
121
CONFIDENTIAL
122
CONFIDENTIAL
Notice that the data don’t come out in the same order in the Execution Window below as
when we used crawl of back in Example 2.5.1.
123
CONFIDENTIAL
Picture 3.1.1: The Execution Window with Data Flow tab of Example 3.1.3
124
CONFIDENTIAL
Although the loop appears to have only 3 names involved, ManageCrawlGoogleA,
ManageCrawlGoogleB, FirstCrawlGoogle, and ButFirstCrawlGoogle are all
involved in the notion of Circular Navigating. FirstCrawlGoogle is the trigger node,
and the other 3 run in a circle.
3.2 The Fetch
Now that we have experience crawling around the web and running in circles, we can
ease ourselves into this realization.
Picture 3.2.1: The Diagram of a Fetch
A fetch is a URL (optionally with Post Data) and the resulting loaded page (these
together symbolized by the Navigational State) with all cookies and headers and other
Pseudocolumns. It’s true that the headers contain the URL and the Post Data, but to
easily select those values and move around the web, we need the URL and Post Data in
the Navigational State. Many of the Pseudocolumns of Table 2.1.0 can be thought of as
Pseudocolumns of a fetch.
125
CONFIDENTIAL
A “request” is the outgoing headers and cookies (which include the URL and Post Data);
a “download” is a server’s reply to our request, and together these concepts are a “fetch”.
Fetches and all network transmissions are debugged down to the character through a
proxy that is viewable through the Network tab of the Execution Window.
Picture 3.2.2: A Header with URL and Post Data viewed in the Network tab
126
CONFIDENTIAL
Every smiley face doesn’t necessary comprise an entire fetch. A fetch can have child
page loads, frame loads, image loads, etc. Again, be sure to make use of any and all of
the Pseudocolumns outlined in Table 2.1.0 whenever we fetch a page.
Some people confuse the notion of what a Navigational State is and what a fetch is. A
Navigational State is a 3-tuple of Pseudocolumns. A child node can always advance
navigating (clicking) through a website if it knows its parent’s Navigational State. This
is the Navigator Theorem of Web Crawling. We don’t need to know everything about
a fetch to advance clicking; we only need to know some or all of the Navigational State.
Notice that a fetch isn’t only 3 Pseudocolumns. A fetch is a request header (or series of
request headers) with the corresponding page loads that allow a programmer a diagnostic
view of the web through WebQL Pseudocolumns.
3.3 Debugging Form Submissions
Form submission is more tedious than difficult in WebQL primarily because WebQL is
not javascript enabled. If a form submission is javascript-intense, the best way to submit
the form is to mimic Internet Explorer. In WebQL, go to ViewÆNetwork Monitor. Set
the Network Monitor on a specific port, such as 211.
127
CONFIDENTIAL
Picture 3.3.1: WebQL Studio IDE ViewÆNetwork Monitor
Next in Internet Explorer, go to ViewÆInternet OptionsÆConnectionsÆLAN
SettingsÆUse a Proxy (check)ÆAdvanced.
Picture 3.3.2: LAN Settings in MSIE
128
CONFIDENTIAL
Set the Socks proxy as “localhost” on port 211.
Picture 3.3.3: Advanced Proxy Settings in MSIE
We now have separate proxies set up for both the browser and for the WebQL program.
Recall that WebQL always has a proxy—the Network tab. The goal is to use Document
Retrieval Options to make the WebQL proxy look enough like the browser’s proxy so
that the page that gets loaded in a browser is the same page that gets loaded in WebQL.
Again, this is more of a tedious task than anything else.
Let’s say we want to scrape prices out of hotels.com for a given locale. The first thing
we want to do is Example 2.1.9, which is using the select * trick with the forms Data
Translator on http://www.hotels.com. From that information, we can figure out what
form to submit and which variables to submit along with it. Recalling Example 2.5.4, the
form to submit is form 2 and the only required variable to submit is usertypedcity. If
129
CONFIDENTIAL
dates other than the default dates or if occupants other than the default occupants are
required then those variables can be submitted in the same fashion as usertypedcity.
Picture 3.3.4: select * from http://www.hotels.com
Since Example 2.5.4 was written, http://www.hotels.com changed their site. From this
page load, what should we do to load the page we want?
If we go to
http://www.hotels.com in a browser, how is the page different?
Looking at the Network tab, here is how we formed the request in WebQL:
130
CONFIDENTIAL
Picture 3.3.5: The Outgoing Request for hotels.com in WebQL
Setting the browser proxy and requesting www.hotels.com gives us:
131
CONFIDENTIAL
Picture 3.3.6: The Outgoing Request for hotels.com in MSIE
Notice how there appear to be dozens of outgoing requests more than what we saw when
we tried to load the same page in WebQL. Those outgoing requests are triggered both by
javascript and HTML tags when the homepage is loaded. The page is successfully
loaded in MSIE because MSIE is javascript-enabled.
132
CONFIDENTIAL
Based on Picture 3.3.4, we need to make http://www.hotels.com think that we are
javascript-enabled even though we aren’t. Looking at Pictures 3.3.4 and 3.3.6, what
should we do? My guess is add on to the URL ?js=1. The variable &zz= might not be
vital to the page load, so we might not need to worry about it. Let’s see if that works…
Picture 3.3.7: select * from http://www.hotels.com?js=1
Sure enough, we get the page and form we want by adding the query ?js=1 onto the
URL. Query in this sense means “the text after and including the ? in a URL” rather than
a WebQL query. Applying the forms Data Translator to this page gives us:
133
CONFIDENTIAL
134
CONFIDENTIAL
Notice how Example 3.3.1 differs from Example 2.1.9. Being able to quickly handle site
updates makes us better WebQL programmers. Now, let’s re-implement Example 2.5.4
to accommodate for the http://www.hotels.com site change.
135
CONFIDENTIAL
Looking at the Browser View of the Execution Window, we have successfully reimplemented Example 2.5.4 to reflect site changes to http://www.hotels.com.
136
To
CONFIDENTIAL
improve the code even more, we can make the form submission robust by referencing a
form action URL substring as a target rather than by the sequential number of the form on
the page.
137
CONFIDENTIAL
Notice that regardless of what form number we want, we submit the correct form based
on a substring match of the form’s action URL (or “target”).
As we’ve seen in this chapter, navigating is a tedious task of debugging analysis more
than anything complicated or difficult. WebQL is more flexible and fit to crawl the web
than any other web-harvesting product.
3.4 Looking Ahead to Advanced Programs
If this version of the Guide excludes Chapter 4, included are a few screenshots of some
advanced crawlers. This content isn’t exactly the same in Chapter 4.
138
CONFIDENTIAL
Picture 3.4.1: The Data Flow of an Advanced WebQL Program
139
CONFIDENTIAL
140
CONFIDENTIAL
Picture 3.4.2: (previous page) The *.csv Output of an Advanced WebQL Program
Picture 3.4.3: Part of the source code of an Advanced WebQL Inner Query
141
CONFIDENTIAL
Picture 3.4.4: Custom DHTML Output Done by an Advanced WebQL Program
142
CONFIDENTIAL
143
CONFIDENTIAL
144
CONFIDENTIAL
3.5 Other Coding Techniques and Strategies
On top of the coding styles used so far, there are a few other tricks worth mentioning.
We learned early on how to filter using where both between and within nodes. We can
145
CONFIDENTIAL
also use a pattern extractor twice in a single node by using the pattern Data Translator
followed up by the extract_pattern(str1, expr1) function.
Let’s look at an
example.
Instead of writing a single overcomplicated pattern to extract the first link that appears in
the <h2> tags, we can just grab what is between the <h2> tags with the pattern Data
Translator and then match an anchor/source pattern via the extract_pattern function
to pull out the URL. \b in Regular Expressions is a word border, which could be a white
space, a comma, a period, etc. Sometimes writing two simple patterns is more effective
than writing one complicated pattern. Sometimes we need two extremely complicated
patterns to get the data that we are looking for.
146
CONFIDENTIAL
It turns out that we can just as easily use the extract_URL(str1) function to achieve
the same result.
Nevertheless, Example 3.5.1 suffices to show how multiple patterns can be used
consecutively in a single node. We can also now imagine cases when we would compose
a function call extract_URL(extract_Pattern(str1, expr1)), which we could see
as using 3 patterns consecutively.
147
CONFIDENTIAL
Another common coding technique used is to capture a Navigational State in a node and
then branch off of it to trap errors. In addition to any network error or timeout, we must
watch for site-triggered errors such as “No products matched your search.” The best way
to do this is to capture the Navigational State, then use join and output source join
with expression checks to see if data were found or an error occurred. Remember that no
matter how many child nodes collect data or trap errors, we can always add another child
node to extract a link and advance clicking through the webpage.
Picture 3.5.1: A Segment of Code from an Inner Query
148
CONFIDENTIAL
Clearly, we select a Navigational State in the node fetch1. We also match a pattern to
see how many results come from our search that uses input fields searcher and
method for form values. We include the empty Data Translator for the case when the
pattern doesn’t match. We use the outer-parent (input) field myrec to debug the form
page load in one branch, and then trap errors in three other branches. Later on in the
code, we can union join all of the error branches into a single node that writes all errors
to whatever destination we want.
Picture 3.5.2: Combining the error branches
Perhaps the most powerful coding technique is to branch off in all directions to load
pricing/product information, union join those sources together, and then pattern-crunch
all of the source pages simultaneously in one node.
149
CONFIDENTIAL
Picture 3.5.3: Data scraping node getdata
The node getdata is output source joined to the SOURCE_CONTENT of the
fetchmanager node. The node fetchmanager is a collection of HTML source
pages captured by the various branches. A node that writes output can be joined to
getdata. Suppose fetchmanager is a part of a circular reference that submits a form
in the first source selection and then sequentially clicks through the “next” button of
results in the second source selection until all results are exhausted. This is the most
effective coding structure for scraping an unknown number of results N > n where n is
the number of results displayed per page.
Picture 3.5.3.5: Diagram of circular navigation with a pattern-crunch
Sometimes the node for selecting second and later sources is also a form submission, in
which case the forms must be cached in both source selecting nodes in order to prevent
repeated form loading. In this case, the node for selecting second and later sources uses
both the post Document Retrieval Option and the submitting Document Retrieval
150
CONFIDENTIAL
Option. Below is a segment of code that shows an ordinary node for selecting second
and later sources. This example is particularly good because it requires the addition of a
Referrer URL in order to load the proper page. Also, random delays between 1 and 5
seconds are added before each fetch. Document Retrieval Options are discussed in
Chapter 2.5 and can be used creatively with functions to enhance the navigational
capabilities of WebQL.
Picture 3.5.3.6: Advancing navigation when the “next” button is a form submission
Recreating Picture 3.5.3.5 with appropriate node aliases gives us:
151
CONFIDENTIAL
Picture 3.5.3.7: Sound node aliases for circular navigation with a pattern-crunch
This technique of pseudocoding circular navigational threads with node aliases makes us
more effective and quicker WebQL programmers. Taking it one step further, node
diagrams can be used to contrive view diagrams for massive programs that resemble the
Data Flow in Picture 3.4.1 and the application set in Picture 4.1.0e. Finally, choosing
node aliases carefully makes the process of updating or modifying another person’s code
much less tedious.
As computer programmers, we know that looping structures can be avoided sometimes
by using arrays and array functions. each is a powerful function, and works great for
generating HTML. Suppose price_array is an array of prices and cheapest is the
value of the lowest price in the array, we can generate the proper number of HTML table
cells for an HTML row quickly and highlight the cheapest value:
‘<tr>’||array_to_text(Each(price_array,’<td
class=price1>’||if(each_item=cheapest,’<font
color=red>’||each_item||’</font>’, each_item)||’</td>’),’’)||’</tr>’
152
CONFIDENTIAL
The next trick in Chapter 3 is also the most important for programs that crawl websites.
Using the WebQL Network Monitor (ViewÆNetwork Monitor) as a proxy for a browser
and the Network tab of the Execution Window as a proxy for WebQL (recall Picture
3.3.6 and 3.3.5 respectively), we can do a side-by-side proxy-to-proxy comparison of all
outgoing and incoming traffic.
Picture 3.5.4: Proxy-to-Proxy Comparison
Because browsers are javascript enabled and WebQL isn’t, the proxies often vary
significantly. WebQL can do what the browser does by imitating the javascript or it can
work around the javascript. If we want the browser’s effect in WebQL, we sometimes
have to work to make it happen (see Chapter 3.6 for examples).
Moving on, the function diff_arrays(array1, array2) is an advanced technique that
can be used to detect changes in a page over time or to report differences in 2 arrays. The
function each is also used in combination with diff_arrays.
153
CONFIDENTIAL
Picture 3.5.5: file1.txt and file2.txt
Given file1.txt and file2.txt, we can use blank spaces to separate the words into arrays
and then call the diff_arrays function. diff_arrays returns an array of 2-element
arrays where the first element is either +, -, or NULL and the second element is a word
from the above files. We can then call the each function and decode the first element
as +, -, or NULL and color code the second element accordingly.
154
CONFIDENTIAL
Notice that the length of the diff_arrays result is the size of words1 (10) plus the
number of differences (2) totaling 12. Because the function each returns an array,
array_to_text is used with a single white space as the array element delimiter.
diff_arrays returns an array of 2-element arrays, so each_item of the each function
is a 2-element array. Depending on what the first element is, the color of the second
155
CONFIDENTIAL
element is determined.
The pages and lines Data Translators can used to
monitor/compare files by line or by page, rather than by word as in Example 3.5.3.
To better see how changes in an HTML page can be tracked, we will do a similar
example for the diff_html function. In addition to diff_html, highlight_html will
be used so we as programmers feel abstracted from the underlying details of how these
functions work. diff_html is a function that takes two HTML text files as arguments
and returns an array of 2-element arrays, just like diff_arrays does. highlight_html
adds the necessary HTML to give background color (argument 3) and text color
(argument 2) to a given string of HTML text. highlight_html also puts a border
around images that change. Let’s look at the HTML sources with their corresponding
browser view.
Picture 3.5.6: file1.html
The HTML files are quick and simple. They contain a table with a single non-terminated
row tag with three columns containing text, an image, and text, respectively. Suppose
156
CONFIDENTIAL
that the text and images are dynamic and we wish to track the differences. Suppose that
yesterday’s website is file1.html and today’s website is file2.html.
Picture 3.5.7: file2.html
These files are available for download at
http://www.geocities.com/ql2software/file1.html
http://www.geocities.com/ql2software/file2.html
Here is the code that will perform the diff_html:
157
CONFIDENTIAL
Notice how the similar text has no change to it, while differences are colored based on
what file they appear in with a beige background. Also notice how the change in images
158
CONFIDENTIAL
is shown by a highlighted border around the images. Below is the final piece to the
puzzle: the text view of the “diffed” file.
159
CONFIDENTIAL
160
CONFIDENTIAL
Picture 3.5.8: The computed difference of file1.html and file2.html (previous page)
We see how the highlight_html function adds background color and text color via the
style HTML tag attribute.
Clearly, the diff_html function in combination with
highlight_html does a lot of grunt work with minimal effort from the programmer.
Changing pace, sometimes we have tables of data in two different nodes with the same
number of records that we want to “paste” together side by side like this (recall Example
2.5.2):
Picture 3.5.9: The side-by-side combination of tables
If we join 2 tables of length 4, what do we know about the number of records in the child
node, assuming there is no record filtering? The answer is 4*4=16, but we want a table
that still has only 4 records in it. Suppose we have a table with names and ages in
Node1 and job titles in Node2.
All we have to do is add the RECORD_ID
Pseudocolumn, and then filter where the RECORD_IDs are equal.
161
CONFIDENTIAL
The final table result is written into the default viewer and contains only the 4 records
that we want, which is the equivalent of pasting the tables in Node1 and Node2 sideby-side.
162
CONFIDENTIAL
Sometimes in WebQL we need to submit forms; some are easier than others. Let’s look
at an example of a more difficult form submission.
Suppose we want to scrape
Comcast’s prices for their services in several different geographical areas. To begin that
process, we should look at the page in a browser.
Picture 3.5.10: Comcast’s homepage-localization-form
163
CONFIDENTIAL
From our experience in a browser, the form seems easy just like a login or search box.
Naturally, the next step is to use the select * trick through the forms Data Translator.
Picture 3.5.11: Curveballs in form analysis
To our surprise, the forms Data Translator did not find any forms on the page. There
could be any one of a number of reasons for that. We need to look at the page source
(SOURCE_CONTENT) so we have more information.
164
CONFIDENTIAL
Picture 3.5.12: More curveballs in form analysis
The first thing that we notice is that the SOURCE_URL is different than the URL that we
specify in the code and different from the page loaded in the Messages tab of the
Execution Window. This is could be the result of a redirect or a refresh, but we aren’t
sure yet. Looking at the Browser View tab of the SOURCE_CONTENT field, we see
that the page that our browser loads in Picture 3.5.10 is different from what WebQL is
loading.
165
CONFIDENTIAL
Picture 3.5.13: Getting the browser view of the unknown page in the Execution Window
Apparently, the page that is getting loaded doesn’t have any forms on it—just a link to go
back to the homepage. This could be the page that gets loaded when javascript is not
enabled or when cookies aren’t enabled, but we need further investigation (remember that
WebQL is not javascript-enabled). Let’s look at the Network tab and see if the proxy
tells us anything.
166
CONFIDENTIAL
Picture 3.5.14: Hunting through the Network tab of the Execution Window
After checking that our first outgoing request is consistent with our code, we look at the
reply from Comcast. In the reply, we notice a refresh META tag that appears to be
responsible for forwarding us onto the http://www.comcast.com/NoScript.html URL. We
found the refresh META tag by doing a control+F find on “NoScript.html”.
Immediately, we use the ignore refresh Document Retrieval Option to see if we get
the right page:
167
CONFIDENTIAL
Picture 3.5.15: Browser View tab of the Execution Window
Looking back at Picture 3.5.10 and Picture 3.5.15, the forms appear to match along with
the URL. We can go on with a form analysis of the page using the select * trick with
the forms Data Translator:
168
CONFIDENTIAL
Picture 3.5.16: Even more curveballs in form analysis
Now that we have the right form identified as form 2, we wonder why the form viewed
through the browser has fields that don’t match the form values.
Also, the
CONTROL_TYPE of each CONTROL_NAME is hidden instead of text.
Our
hypothesis is that javascript is somehow manipulating the input fields through a browser
to work the forms with hidden variable CONTROL_TYPEs. What is great about
WebQL is that we don’t really need to fuss through all of the javascript and form-
169
CONFIDENTIAL
triggering mechanisms of the Localize.ashx page source to get the page we want; rather,
we’ll just analyze the behavior of Microsoft Internet Explorer (MSIE) through a proxy
and try to copy the behavior.
The next step is to set a proxy on the browser window of Picture 3.5.10. In MSIE, go to
ToolsÆInternet
OptionsÆConnectionsÆLAN
SettingsÆ(check)
Use
a
ProxyÆAdvanced. Set the SOCKs proxy to “localhost” on port 211. In WebQL, go to
ViewÆNetwork Monitor and set the port at 211. In MSIE, enter the address “2300 N
Commonwealth Ave” in the Street address field, “3B” in the Apt. number field, and
“60614” in the ZIP field and click continue. In addition to a new internet page being
loaded in our browser, we see the proxy light up with traffic.
170
CONFIDENTIAL
Picture 3.5.17: Getting to the next form
We see the URL change in our browser window, and we get another form that requires us
to submit additional information. In particular, we must select our location as either
171
CONFIDENTIAL
“Chicago NW (Area 3)” or “Chicago (Area 1), IL”. Let’s look at a proxy to see exactly
how this page was arrived at because the CONTROL_NAMEs in form 2 in Picture 3.5.16
do not sufficiently complete all of the variables in the action URL in the browser window
above, which is debugged through the Network Monitor proxy below:
172
CONFIDENTIAL
Picture 3.5.18: Proxy analysis of moving from Picture 3.5.10 to Picture 3.5.17
173
CONFIDENTIAL
The proxy in Picture 3.5.17 is in two pieces to show the entire query string on the action
URL of the form submission (the “query string” of a URL is what comes after and
including the “?”). Again, the form variables for form 2 in Picture 3.5.16 do not contain
all of the variables showing up in the query string, so we have a couple of options as
software developers:
1) Hack-out the URL manually and skip every step including loading the form.
2) Load and submit the form and rewrite the query string with the rewrite using
Document Retrieval Option.
Let’s go with option 1 to illustrate the skill set that we have developed as web
programmers.
Let’s create the URL in Picture 3.5.17 explicitly and view the
SOURCE_CONTENT.
174
CONFIDENTIAL
Picture 3.5.19: Verifying that our URL-hacked form matches Picture 3.5.17
Sure enough, by looking at the Browser View tab of the SOURCE_CONTENT in the
Execution Window, we are successfully getting to our desired point of navigation by
manually creating the URL that’s pinpointed in the proxy in Picture 3.5.18. We can now
do a form analysis of this navigational checkpoint, but remember that we didn’t make
much use of our last form analysis to get to this point of navigation because we proxy-
175
CONFIDENTIAL
hacked the URL. Before we do another form analysis, we will do a proxy analysis of
selecting the “Chicago NW (Area 3)” radio button. Clear the history of our proxy by
right-clicking the left half of the Network Monitor and select “clear history”. Now, select
the “Chicago NW (Area 3)” radio button in Picture 3.5.17 and watch the proxy light up.
Here is the page in a browser:
176
CONFIDENTIAL
Picture 3.5.20: Comcast.com localized homepage
Finally, we have arrived in a browser at the localized homepage.
We see “60614
Chicago NW” in the upper right corner, so we now we have arrived at the right page. By
177
CONFIDENTIAL
looking at the page source using ViewÆSource in the browser, we are able to match the
page with one of the responses in the proxy analysis. In particular, the 19453 byte
response from Comcast.com is the localized homepage and our outgoing request to get
the page is analyzed below:
178
CONFIDENTIAL
Picture 3.5.21: Using a proxy to help load Picture 3.5.20 (previous page)
When we went to the page in Picture 3.5.17 in the process of localizing our browser, we
were required to submit additional information in order to successfully load the localized
homepage in Picture 3.5.20. Because that information (specifically, a radio button for
“Chicago NW (Area 3)”) in no way appears in the URL of the proxy (Picture 3.5.20) that
triggered the localized page, hacking-out that URL will probably not get us the page that
we need. We are better off doing a form analysis of the form in Picture 3.5.17.
179
CONFIDENTIAL
Picture 3.5.22: Form analysis of the radio button in Picture 3.5.17
Looking at the default viewer of the Execution Window, ZipCodeFranchiseMapID is
the value that we need to submit to advance to the localized homepage. Our strategy will
be to extract the ZipCodeFranchiseMapIDs with a pattern and then submit them
individually to form 3. Because the addresses that we are ultimately going to look up are
going to come from a giant input file, let’s create an input handler for comma separated
input. Putting the code together arrives at:
180
CONFIDENTIAL
Picture 3.5.23: The first part of the code to Example 3.5.6
181
CONFIDENTIAL
The strategy for code in Example 3.5.6 comes from the need to create a depth-first
navigational scheme (recall Picture 2.4.1). The first task is to fetch the input data from
the input.csv file. Because each input must be processed independently, we create an
inner-outer query relationship. We then use the pattern Data Translator to extract the
ZipCodeFranchiseMapIDs, and because the ZipCodeFranchiseMapIDs must be
processed independently, we create another inner-outer query relationship. Sometimes,
the first address that we submit sufficiently loads the localized homepage (thus the
ZipCodeFranchiseMapID is null), which explains the node outer2a. We will be
182
CONFIDENTIAL
able to better appreciate inner-outer relationships after this example and Examples 3.6.1
and 3.6.2. Running the code in Example 3.5.6 with the following input:
Picture 3.5.24: The input file to Example 3.5.6
produces the following Execution Window:
183
CONFIDENTIAL
Picture 3.5.25: The Execution Window to Example 3.5.6
We see above, the Browser View tab of the Execution Window to Example 3.5.6 appears
to be properly loading various localized homepages so that the special offers and prices
for different geographical locations can be scraped for an analyst. Clearly, submitting a
form to get to a localized homepage (such as this example of Comcast.com in Picture
3.5.10) can require meticulous debugging.
184
CONFIDENTIAL
3.6 Learning by Example
Now that we have seen the theory behind navigation and circular navigation, we can
apply this knowledge to develop a web spider. The example that we will discuss allow us
to search for products at http://www.techdepot.com and report the prices. Of course,
websites that sell products change on a regular basis so this walkthrough could be out of
date by the time you read it; however, it will still be a useful tool for developing a
strategy for scraping information out of any website.
The first thing we must do is analyze the form from the techdepot homepage.
Picture 3.6.1: The form analysis of http://www.techdepot.com
185
CONFIDENTIAL
We see that there are 2 forms. Form 1 appears to be a search box that we can submit
keywords to and the second form appears to be an email address submission for an email
list that we can subscribe to.
Form 1 is the form we want to submit values to for the variable Keyword. The next
step is to create an input list of keywords to be submitted in the form submission. We
will use a basic inner-outer relationship so that the input records (keywords) are
processed independently and can be freely selected throughout the Inner Query that does
the form submission.
Picture 3.6.2: Creating the inner-outer input relationship
186
CONFIDENTIAL
We should interpret the code in Picture 3.6.2 as being a query with a single node fetch1
such that the outer parent of the query contains the search inputs, which are “Dell 720”
and “Sony Vaio sz.” Remember that inside of the parenthesis, every field selected in the
outer parent is global. Thus, any node inside of the parenthesis knows about the field
myinput, and only one value of myinput will be processed by the inner query at a
time.
Instead of submitting to the FORM_ID, we chose to submit to the form target, which is a
substring match of the FORM_URL, which is sometimes called an action URL. The only
variable that appears to be essential for a proper form submission is Keyword.
Looking at the code in Picture 3.6.2, we see that the default viewer of the outer query
happens to be the default viewer of the inner query.
The inner query writes the
Navigational State with errors into the inner query’s default viewer, which outer1
selects all fields out of into its default viewer, which is what we see in the Execution
Window below:
187
CONFIDENTIAL
Picture 3.6.3: Getting past the first form submission
In the browser view of the SC (SOURCE_CONTENT) field, we see a message stating
that results have been found. Next, we need to create a navigation scheme that will click
the “next button” until all pages of results are exhausted. As it turns out, the search for
“Dell 720” has only 1 page of results whereas the search for “Sony Vaio sz” has 4 pages
of results. For this particular website, clicking for the next page of results involves
javascript. We will have to simulate the javascript in WebQL so that we can reach the
next page of results. The javascript function NextPage is a form submission that
advances us to whatever page we specify as argument. We know that from looking at the
function in the source code which can be seen through the Text View tab of the
Execution Window. Looking through the source code of the first results page of the
“Sony Vaio sz” search shows us the javascript function called NextPage:
188
CONFIDENTIAL
function NextPage(iPage){
document.frmNav.Page.value=iPage
document.frmNav.searchtype.value='nav'
document.frmNav.submit()
}
Given this function, our strategy will be to identify the maximum results page number
based upon the page numbers towards the bottom right corner of the website, and then
submit the frmNav form over and over for each page number.
189
CONFIDENTIAL
Picture 3.6.4: Results pages with a black double-arrow
The search “HP” came up with 200 pages of results. Determining what this last page is
can be tricky because the soft grey vs. black double-arrow is different for searches that
have 5 or less pages of results:
190
CONFIDENTIAL
Picture 3.6.5: Results pages with a grey double-arrow
After looking at the source code, we need to write a pattern robust enough to handle
either case. Here are the two relevant excerpts of code:
(Picture 3.6.4)
<a href="javascript:NextPage(200)">200</a></td><td
class="td"><a
href="javascript:NextPage(6)">>></a></td></tr></table>
(Picture 3.6.5)
<a href="javascript:NextPage(4)">4</a></td><td class="td"><font
color="#aaaaaa">>></font></td></tr></table>
Our goal is to create a pattern Data Translator that will extract the number 200 in the
first case (Picture 3.6.4) and that will extract the number 4 in the second case (Picture
3.6.5).
191
CONFIDENTIAL
Picture 3.6.6: Getting the pattern Data Translator correct
We see that our choice of pattern works in the output of the default viewer:
192
CONFIDENTIAL
Picture 3.6.7: Pages 2-200 for the search “HP”
We see that we have successfully created the arrays that we need to scrape-out the rest of
the results to our searches. We will remove the “HP” search for now because if we can
crawl 3 additional pages of results in the “Sony Vaio sz” search, we will assume that the
mechanism will work for n additional pages of results.
The next step in the process of developing this web spider is to do a form analysis of the
first page of results. The best way to do this is to search for “Sony Vaio sz” through a
browser then do a ViewÆSource to see the source code in a text editor. Save the text file
as c:\test.html. Now use the select * trick through the forms Data Translator to learn
about the forms on c:\test.html.
193
CONFIDENTIAL
Picture 3.6.8: Form analysis of first page of search results
Looking through all of the form variables for forms 2 and 4, we are not sure which form
194
CONFIDENTIAL
the NextPage function works off of. We know that the name of the form is frmNav
but we don’t know its number until we do further investigation.
Picture 3.6.9: Form analysis of first page of search results
195
CONFIDENTIAL
The underlying uncertainty of which form to submit stems from the fact that both form 2
and form 4 contain control variables consistent with those in the NextPage javascript
function, namely Page and searchtype. The next move we should make is to do a
text search for “<form ” through the source code of the first page of results to check
each form name and see which one is frmNav.
Picture 3.6.10: The second match on the page for “<form ” search
Using the find feature on our text editor, we see that the form that we want to submit is
the second form, not the fourth. Looking back at Picture 3.6.8, we notice that the
searchtype form variable already has the value “nav” so we should only need to
explicitly submit the Page variable. If we have all of the page variables that we need in
an array, there is no need for a circular loop, so our fetchmanager will union join
fetch1 and fetch2 without looping. Let’s see if it works:
196
CONFIDENTIAL
197
CONFIDENTIAL
Picture 3.6.11: Writing the second fetch and moving
the fetchmanager (previous page)
Picture 3.6.12: Verifying that we got all results
We indeed get every page that we want because of the second fetch node fetch2. There
is one record for the “Dell 720” search and four records for the “Sony Vaio sz” search.
We are able to get the pages we want (and only the pages we want) without having to
write a circular reference.
The next 2-steps are to write an error report and to scrape out the prices, product
descriptions, product numbers, and any other information we want about each product,
which can be called a pattern-crunch. To write the error report, we join to the node
198
CONFIDENTIAL
outer1 where the URLERROR field is not null. We will write the input search string,
the error, and a timestamp into a CSV file called errors.csv.
To handle the prices and other information we want to extract, we will output source
join to the SC (SOURCE_CONTENT) field of outer1 and write a series of Regular
Expressions to match and extract the targeted information.
We will also use the
convert using Document Retrieval Option to chunk the page moments before we
pattern-crunch it. Here’s all of the code to complete the example:
199
CONFIDENTIAL
200
CONFIDENTIAL
Picture 3.6.13: The first part of Example 3.6.1 (previous page)
Every
product
listing
is
an
HTML
table
row
that
begins
class=”product”>”, which will serve as our chunking converter.
with
“<td
Although the
product listings fit tightly in a browser, the HTML source code between two product
listings is extensive. Below is the HTML source code of a product listing from its
beginning up until the next product:
<td class="product"><a
href="http://www.TechDepot.com/product.asp?productid=4523329&ii
d=1250&Hits=2000&HKeyword=HP">HP Consumer - nx6310 - Core
Solo T1300 1.66 GHz - 15</a></td>
<td nowrap rowspan="2" align="right" valign="top"
class="price"> $892.95<IMG src=/Assets/images/clear.gif
height=1 width=1></td>
</tr>
<tr>
<td colspan="2"><img src="/Assets/images/clear.gif" alt=""
width="9" height="5" border="0"></td>
</tr>
<tr>
<td colspan="2" rowspan="2" valign="top">
<table width="365" border="0" cellspacing="0" cellpadding="0">
<tr>
<td colspan="2" class="bullet"> Intel Core Solo T1300
<td rowspan="3" align="right" valign="bottom"
class="bullet">Platform: PC</td></tr><tr><td colspan="2"
class="bullet">
 512 MB / 60 GB / 1.66 GHz
</tr>
<tr><td colspan="2" class="bullet">
 Windows XP Professional
</tr>
<tr>
</tr>
<tr><td colspan="3"><table border="0" cellspacing="0"
cellpadding="0"><tr>
201
CONFIDENTIAL
<td><table width="289" border="0" cellspacing="0"
cellpadding="0">
<tr>
<td class="td" align="left" valign="top">sku#S5523329 </td>
<td class="td" align="left" valign="top">mfr#PZ903UA#ABA</td>
<td class="td" align="right" valign="top"><img
src="/Assets/images/Results_stock_Check.gif" alt="" width="12"
height="13" border="0"> In-stock</td></td>
<td valign="top"><img src="/Assets/images/clear.gif" alt=""
width="5" height="1" border="0"></td></tr>
<tr>
<td><img src="/Assets/images/clear.gif" alt="" width="85"
height="1" border="0"></td>
<td><img src="/Assets/images/clear.gif" alt="" width="122"
height="1" border="0"></td>
</tr>
</table></td>
<td align="right" valign="top"><a
href="Javascript:GotoProd(4523329,1250)"><img
onClick="Javascript:GotoProd(4523329,1250)"
src="/Assets/images/Results_details_box.gif" alt="" width="71"
height="15" border="0"></a></td>
</tr></table></td></tr>
</table></td>
</tr>
</table></td>
</tr>
<tr>
<td colspan="3" align="right"><img
src="/Assets/images/ccccccpixel.gif" alt="" width="384" height="1"
vspace="10" border="0"></td>
</tr>
</table></td>
</tr>
<TR><TD><table width="464" border="0" cellspacing="0"
cellpadding="0">
<tr>
<td><input type="checkbox" name="compare"
value="4119381"></td>
<td>
<a
href="http://www.TechDepot.com/product.asp?productid=4119381&ii
d=1250&Hits=2000&HKeyword=HP"><img
202
CONFIDENTIAL
src="http://images.techdepot.com/comassets/productsmall/CNET/I40
9904.jpg" alt="" width="70" height="70" border="0"></a></td>
<td>
<table width="369" border="0" cellspacing="0" cellpadding="0">
<tr>
<td class="product">
Here is the browser view of the same product listing:
Picture 3.6.13.5: A techdepot product listing in a browser
It appears that the first anchor tag after our chunking converter provides us with the
product title. Extracting it is derived in Data Translator p1 seen in the rest of the code to
Example 3.6.1 below. Included also are Data Translators p2, p3, and p4, which extract
the price, sku part number, and manufacturer’s part number with shipping information,
respectively. Error reporting is implemented, as well.
203
CONFIDENTIAL
Here’s the output that we get in the Execution Window:
204
CONFIDENTIAL
Picture 3.6.14: The Execution Window of Example 3.6.1
We are happy to see that the data comes out clean and ready to be dumped into a database
or report-making computer program. Although it takes time and money to develop this
kind of code, it takes a lot more time and a lot more money to pay a person to click
through the site and manually write down the numbers and/or type them into Excel.
205
CONFIDENTIAL
Let’s
move
on
to
another
http://www.superwarehouse.com.
example
by
writing
a
similar
crawler
for
Again, we start by debugging the forms on the
homepage:
Picture 3.6.15: Form debugging on the homepage of Superwarehouse
We see that WebQL has more than one form such that FORM_ID = 1. Even if this is
something that we’ve never seen before, we can work around the impediment by using a
form target for form submission as we do in Example 3.6.1. keyword is the only form
variable that we need, so we are ready to create an inner-outer query relationship with
input from the user. We will make the input a CSV file rather than an array to create
variety.
206
CONFIDENTIAL
Picture 3.6.16: Creating an input processing node and an input file
Running this code creates the following Execution Window:
207
CONFIDENTIAL
Picture 3.6.17: Default viewer of Navigational States of search results
Because of the way that 9 source files appear for 3 fetches (5 in the first fetch, 2 in the
second fetch, and 2 in the third fetch), we should decide if using the ignore children
Document Retrieval Option is appropriate. Investigating the source files suggests that it
is because the child pages to each fetch are advertisements that we don’t care about.
Inserting the ignore children Document Retrieval Option and rerunning the code
creates the following Execution Window:
208
CONFIDENTIAL
Picture 3.6.18: Default viewer of Navigational States of search results
We now have fetch1 working the way we want because there are no child pages being
loaded, and we have 3 SOURCE_CONTENT records: 1 for the form load, 1 for the “HP
Laserjet” search, and 1 for the “12A5950” search. The next step is to pass the second and
third records of fetch1 into a fetchmanager that ultimately holds all of the results
pages. To get all of the results pages we must investigate the “Next >>” link of the “HP
Laserjet” search and figure out how to pull out the rest of the results. We can do this by
looking at the Text View tab of the SC (SOURCE_CONTENT) field. Using the ControlF text finder, we find out that clicking the “Next >>” link is a javascript form submission.
<a href="javascript:nextResults('21', '40', '', '' );"
209
CONFIDENTIAL
We will use this text (with Regular Expressions for the digits) as a filter for the
getNextLink node from Picture 3.5.3.7. This is a good filter because the function call
does not happen in the search results page source for a single page of results like
“12A5950”. The function exists in the page source, but it is never called. Let’s take a
look at the nextResults function:
<script language="JavaScript">
/* submit to next result set */
function nextResults (NextStart, NextEnd,
odrBy, srtTp) {
var theForm = document.search;
var jsOdrBy; jsOdrBy=odrBy;
var jsSrtTp; jsSrtTp=srtTp;
if ((jsOdrBy != '')&&(jsSrtTp != ''))
theForm.action =
theForm.action+'&resultSet=nextset&Start='+NextStart+'&End='+Nex
tEnd+'&ybRdroVan='+jsOdrBy+'&ybTrsVan='+jsSrtTp;
else
theForm.action =
theForm.action+'&resultSet=nextset&Start='+NextStart+'&End='+Nex
tEnd;
theForm.submit();
}
function swapIMG(oImg) { //se script.js T.
if (oImg.width!=68) { //max size
oImg.width=oImg.width;
}
else {
oImg.width=68;
}
}
</script>
Because clicking the “Next >>” link causes a form submission, our next best move is to
do a form analysis of the first results page. The best way to do that is to copy the text out
of the SC (row 2) cell of Picture 3.6.18 and paste it into a text editor. Save the file as
c:\test.html and use the select * trick with the forms Data Translator.
210
CONFIDENTIAL
Picture 3.6.19: Form analysis of the first page of search results
We now have the form debugged, but given the function nextResults, we still aren’t
exactly sure which form is being submitted. Recall Picture 3.5.4. We need to set up a
proxy for a browser to debug the form submission when we click the “Next >>” link.
The first thing we should do is go to the http://www.superwarehouse.com website and
submit a search for “HP Laserjet”. In WebQL, we will use ViewÆNetwork Monitor and
set whatever port we want. We must set the same port in Microsoft Internet Explorer by
going to ToolsÆInternet OptionsÆConnectionsÆLAN SettingsÆ(check) Use a
ProxyÆAdvanced. Insert the name “localhost” for the SOCKS proxy on the same port
and click OK. Finally, click the “Next >>” button in MSIE and watch for a POST
statement in the proxy.
211
CONFIDENTIAL
212
CONFIDENTIAL
Picture 3.6.20: Proxy analysis of a javascript form submission (previous page)
Looking at the circled POST variables (left circle), the form that gets submitted when we
click the “Next >>” link is form 1; however, because the action URL manufactured by
nextResults works for form 1 or form 3, submitting form 3 might work, too. The right
circle shows what URL is manufactured by the javascript function nextResults. We
know how to submit POST variables using the submitting Document Retrieval Option
and we know how to extend a URL using the rewrite using Document Retrieval
Option. Pulling it all together gives us code like this:
213
CONFIDENTIAL
Picture 3.6.21: WebQL spider with a circular referencing fetchmanager
214
CONFIDENTIAL
As it turns out, submitting form 3 instead of form 1 still works because ultimately in their
systems, manufacturing the proper action URL means more than the post variables that
get submitted.
We now have all 21 results pages for the “HP Laserjet” search (20 results per page; 419
total results) and we have 1 results page for the “12A5950” search. Here is the Execution
Window:
215
CONFIDENTIAL
Picture 3.6.22: Navigational States of the node fetchmanager
We can look at the Text View tab of any one of the SC (SOURCE_CONTENT) fields
and begin the process of extracting the data with a pattern-crunching node. The patterncrunching node is output source joined to the SC field of the fetchmanager. The
pattern-crunching node will convert each SC field into as many chunks as there are
results on the page plus 1 (16 results creates 17 chunks). We will also combine all of the
216
CONFIDENTIAL
error traps into an error report. Hunting through the SC field suggests that a particular
table cell tag should be used as a chunking delimiter:
<td align="center" valign="top" style="padding-top:3px; paddingleft:3px;padding-bottom:3px;padding-right:3px;"
class="border_rtBttm">
The part ‘class=”border_rtBttm”’ appears only in this table cell tag that starts the listing
of a new product. After searching for how many times ‘class=”border_rtBttm”’ appears
in the SC field, it will suffice as our chunking converter. The remainder of Regular
Expressions can easily be derived now that the page is chunked.
Looking at the
unabridged HTML source code of a product listing relative to our chunking converter, we
have:
<td
align="center" valign="top" style="padding-top:3px; paddingleft:3px;padding-bottom:3px;padding-right:3px;"
<a
href="http://www.superwarehouse.com/HP_1,500_Sheet_Feeder_for_
LaserJet_4200_and_4300_Series/Q2444B/p/435756">
<img onload="swapIMG(this)"
src="http://www.superwarehouse.com/images/products/hp4250feeder
_thn.jpg" border="0" align="top" alt="HP 1,500 Sheet Feeder for
LaserJet 4200 and 4300 Series" class="thn">
217
CONFIDENTIAL
</a>
</td>
<td
class="border_bottom" valign="top" style="padding-top:3px;
padding-left:3px;padding-bottom:3px;padding-right:3px;"><div
class="swhtxt2"><a
href="http://www.superwarehouse.com/HP_1,500_Sheet_Feeder_for_
LaserJet_4200_and_4300_Series/p/435756" class="boxLink2">HP
1,500 Sheet Feeder for LaserJet 4200 and 4300
Series</a></div></td>
<td
valign="top" class="border_bottom" style="padding-top:3px;
class="swhtxt2">Q2444B</div></td>
<td
class="OrangeTxtSMbld">$478.99</div></td>
<td
class="swhtxt2">In Stock</div></td>
<td
padding-left:3px;padding-bottom:3px;padding-right:3px;"><a
href="http://cart.superwarehouse.com/index.cfm?fuseaction=cart.add
&productID=435756"><img src="images/bbuynow.gif" width="50"
height="18" border="0"></a></td>
<td
align="left" width="71">
</td>
</tr>
<tr>
218
CONFIDENTIAL
<td
align="center" valign="top" style="padding-top:3px; paddingleft:3px;padding-bottom:3px;padding-right:3px;"
Here is the browser perspective of the code:
Picture 3.6.22.5: Browser view of a superwarehouse product listing
Seeing that every item of interest has an </div> tag after it, we can construct one
pattern that wild cards everything between the </div> tags and cleans the data.
Implementing the Regular Expression and adding the error report completes the code for
Example 3.6.2:
219
CONFIDENTIAL
Picture 3.6.23: The first of half the code to Example 3.6.2
220
CONFIDENTIAL
221
CONFIDENTIAL
Example 3.6.2 extracts all prices for a given search string stipulated in the in.csv file. For
our inputs, we get 400+ results for “HP Laserjet” and 1 result for “12A5950”. Looking at
the accurate results in the Execution Window, we see the value in WebQL because
employing a person to click and copy all of these prices, product names, item codes, etc.
out of a browser must cost more than having one of our web spiders scrape-out the
information:
Picture 3.6.24: The Execution Window for the given input of Example 3.6.2
An analysis of crawling deeper into the site for more specific information shows for little
improvement in the data set. At the current level of depth to our crawl, we are able to
extract the price, product title, item code, and availability. Crawling one level deeper
222
CONFIDENTIAL
creates (for our given input) over 400 additional fetches when the current version of
query makes only 23 fetches (1 for the homepage, 1 for the “12A5950” search, and 21+
for the “HP Laserjet” search). Is it worth increasing the number of fetches 20-fold for the
additional information that is available to be extracted? Looking in a browser at the
additional information besides the product name, item code, price, and availability, we
see an item summary and some specs.
223
CONFIDENTIAL
Picture 3.6.25: Deciding whether or not to go deeper
224
CONFIDENTIAL
Upon further investigation, the summary and features aren’t essential to a pricing
application because the analyst knows what the product is based on the product title; and,
most of the specs are in the product title. Going to an additional level of depth, the
information appears to be too redundant to increase the number of fetches made by the
query by a factor of 20. This concludes Example 3.6.2.
Clearly, there are several steps to writing a web spider. The process is roughly the same
for any site regardless of whether or not we use circular navigation. The heuristic
explanation is to submit forms and click until we get to results pages. We use those
results pages to get additional results pages, which may or may not require circular
navigation. Finally, we pattern-crunch and error report on all results pages creating
output and error files that are easily customizable for market demands.
225
CONFIDENTIAL
CHAPTER 4: Developing Applications for the QL2 Client Center
4.0 QL2 Client Center Introduction
The QL2 Client Center is a storage and processing powerhouse co-located with InterNap
in the KIRO/Fischer building of downtown Seattle, Washington. In 2004, the system
rotated 13 processing servers with a redundant 2 terabytes of disk. The QL2 Client
Center is built and managed by Dave Haube.
QL2 Software is a business that both produces and sells software. The products that QL2
offers are the license of WebQL Studio and/or the license of custom queries developed
with WebQL Studio ready to run on the Red Hat Linux-driven QL2 Client Center.
Picture 4.0.0: QL2 business sketch
In 2005, the vast majority of revenue came from building custom applications and
hosting them on the QL2 Client Center.
226
CONFIDENTIAL
As an example, let’s look at an application set on the QL2 Client Center that a customer
deals with on a regularly scheduled basis. Go to the http://client.QL2.com. The QL2
Client Center is a java application that generates a website and manages large-scale data
extraction tasks for customers. Login as ‘demo’ with ‘demo’ as the password. We can
click around the Client Center as we please and use the question marks for help and
explanations. In the top right area of the screen, click on the “Cars” tab. In the row of
tabs that appear below it, click on the “Results” tab.
Picture 4.0.1: A screenshot of the carrental results tab on the QL2 Client Center
We are going to look at a rental car pricing output data set that was pulled out of multiple
travel sites by multiple WebQL queries cued by the Client Center. Click on the rcout.csv
output file or the rctables.htm file associated with any of the customer data collection
runs. Notice how much data was ripped out of the sites in real-time, formatted, htmlreprocessed, and delivered to the customer in a matter of a couple of hours.
227
CONFIDENTIAL
4.1 Navigate and Branch
The coding style used by most applications on the QL2 Client Center is “Navigate and
Branch.” Think of the code (also called a “web spider”) crawling from one Navigational
State to the next, all the while branching right and left to extract data and/or trap errors
along the way.
Picture 4.1.0a: Node-by-Node Diagram of “Navigate and Branch”
Notice how the backbone of this navigation schematic is the join from node Bn to node
A(n+1). Depending on the particular application being written, the relationship from Bn
to A(n+1) might need to be an inner-outer query relationship. If an inner-outer query
relationship is needed to enable depth-first navigation like in Picture 2.4.1, Bn would be
the node with all input variables to the inner query that begins with node A(n+1).
228
CONFIDENTIAL
Making regular parent relationships into inner-outer parent relationships can be inferred
from Example 2.4.1. The function of the B nodes in Picture 4.1.0a is to drive the
Navigational States down whatever set of clicks we want to simulate. For the errors, we
union join all of the error trapping nodes together so we write all of the errors in a
single node called D0.
Picture 4.1.0b: Node Diagram for error collection
union joining all of the error traps together and then writing them in a single node is a
good idea because sometimes we want modify the way we write error reports and we
would rather make a change once in an error collection node rather than make changes to
potentially hundreds of error trapping nodes.
Sometimes we want to crawl around a site and extract information as we crawl, and then
ultimately write all of the extracted data to a specific destination in a single node. In this
case, the B and C nodes in Picture 4.1.0a are combined:
229
CONFIDENTIAL
Picture 4.1.0c: Modified version of “Navigate and Branch”
There’s no reason why we cannot extract a link with the pattern Data Translator and
extract targeted information with another pattern Data Translator in the same node. If
A1 and A2 require form submissions, remember to use the cache forms Document
Retrieval Option in A1. We can infer the syntax of submitting a second form from
Picture 3.5.3.6.
Referring back to Picture 3.5.3.5, the circular navigation with a pattern-crunch is merely
a modified version of Navigate and Branch where node A2 navigates (clicks the “next”
button) repeatedly until all pages of targeted information are exhausted. If our where
filters are creative enough, we can complete the diagram with a single arrow and node
BC1 is then the pattern-crunching and next-link-selecting node.
230
CONFIDENTIAL
Picture 4.1.0d: Modified version of “Navigate and Branch”
This tight loop does not have a fetchmanager node like Picture 3.5.3.5.
Different
programmers prefer different coding styles. Using a fetchmanager spreads the process of
looping across 4 nodes and makes the code more versatile when websites get updated and
the code needs to change. On the other hand, tight loops require less code and get the job
done.
We can produce more effective diagrams by using special blocks to symbolize
navigational looping structures. When we create a navigational looping structure, all we
care about in terms of a game plan is the fetchmanager because all errors and data
collection will branch off of the fetchmanager. We can union join fetchmanagers into
sourcegatherers and then branch off of the sourcegatherers. Let’s see an example by
making a game plan for scraping an office supply website. We will allow a user to
provide input in 2 ways. The first is colon-separated product category links to follow and
the second is search terms to be submitted into a search box. Clicking through the site to
231
CONFIDENTIAL
a product category arrives at a set of results that may have more than one page. The same
is true if you enter a certain part number as a search term. The process of taking the first
results page and triggering the rest of the results pages can be combined. We will use
two different views, each of which gets us to the first results page, then we’ll dump the
first results pages from both views into the same circular referencing routine to pull out
the rest of the pages.
Picture 4.1.0e: Modified version of “Navigate and Branch” with loop blocks and Views
A View is a query, which has a default viewer. A node can select out of that default
viewer. We can union 2 such nodes and then union join them into a circular looping
fetchmanager. The fetchmanager should have the navigational states of results
pages in it, at which point we branch for data scraping and error trapping. If we need
minute details, we can extract the links associated with each individual product in the
232
CONFIDENTIAL
pattern-crunch, and then fetch the link and go after whatever information we want in the
next page load. Suppose that there are 100 results to a search displayed 15 at a time. The
circular looping fetchmanager should have 7 pages of results in it from 7 fetches. To go
after the fine print of each product, we need to do 100 additional fetches, which is about
14 times the amount of network traffic. A cost/benefit analysis should be done for each
level of depth past the results pages that the spider crawls to. Going deeper and deeper
into a site for every individual product creates a lot more code and a lot more traffic for
sometimes minimal improvement in the output data set (see Picture 3.6.25). Picture
4.1.0e is the outline of a CyberPrice query, which is the roughest and toughest query set
running on the QL2 Client Center. For an engineering executive summary, selling
applications in the fashion of Picture 4.1.0x is the multimillion dollar idea of QL2
Software, Inc.
Putting lots of nodes together in this style results in Data Flow diagrams that look like
this:
233
CONFIDENTIAL
Picture 4.1.1: Data Flow view of an application from the QL2 Client Center
In this particular query, we use a launch file cued by the Client Center to preprocess the
input provided by a customer through a browser. Every launch file starts with “launch_”.
Once input has been preprocessed, the launch file calls an Inner Query or view. Each
input record to the view uses Navigate and Branch to collect data, trap errors, report siteoutages, etc. After the first input record to the view completes processing, then the
234
CONFIDENTIAL
second record goes. This helps aid in the process of depth-first navigation (recall Picture
2.4.1).
Again, the general concept of a Client Center application is to first write a launch file that
preps input from the user.
235
CONFIDENTIAL
Picture 4.1.2: A typical launch file from the QL2 Client Center
236
CONFIDENTIAL
The field FILE_BASE is actually a field created at the command line when the Client
Center cues a launch file. In 2005, the WebQL Studio IDE was still only a Windows
product, so the Red Hat Linux-driven Client Center actually calls everything from a
command prompt. At that time, variables can be passed into launch files. We can create
these variables by updating the appropriate *.ddl file in base/src/hosting/schema/. The
appropriate *.ddl file corresponds to the type of Client Center application we are building
(Airfare, Carrental, Hotel, etc.). The final node of this launch file is interesting because
this is the bizarre case when we really don’t care what we are selecting, so select * is
used. Basically, all of the file/db writing occurs in the Inner Query digikey_getone.wqv,
so the selecting/writing of that final unnamed node in the launch file is irrelevant.
The style of the first node, getinput, has evolved slightly since this query was last
updated.
Picture 4.1.3: (left) Modifications to the first node; (right) the view read_in.wqv
The Inner Query read_in.wqv on the right shows how the user-input has been
standardized across several queries. Calling the view read_in.wqv in the launch file
237
CONFIDENTIAL
makes changes to the way 20 queries read in data a matter of making 1 code update rather
than 20 code updates. In late October of 2004, we released CyberPrice, which is one of
the more impressive WebQL application sets currently running on the QL2 Client Center.
After the launch file preps the data, each record is input into view (or “Inner Query”)
that could call its own view or views. Here is a segment of code out of the view that
corresponds to Picture 4.1.2.
238
CONFIDENTIAL
Picture 4.1.4: A segment of a typical Inner Query file from the QL2 Client Center
Clearly, Navigate and Branch is the coding technique being used.
The necessary
components of the Navigational States are selected in nodes fetch1 and fetch2a. The
239
CONFIDENTIAL
branches visible of fetch1 are urle1, cnterr, and getmylinks, where as fetch2a has
visible branches urle2a and method1a.
4.2 Client Center Diagnostic Tools
The QL2 Client Center has several ways of analyzing a query’s performance on the fly.
Suppose Orbitz changes the location of a price, and it forces us to update our code. The
time between the site changing and us delivering the update should be minimized and
always less than 48 hours, ideally less than 24 hours. When a query is failing to produce
output or if excessive errors are being triggered, we want to be able to see that without
opening the 40 megabyte text output file and hunting for certain records. The Client
Center has a diagnostics page that lists the various data extraction tasks being run on our
hosted hardware, and the pages are sorted by application type (like Airfare, Carrental,
etc.). The numbers displayed show the input, output, and error counts, which can raise a
red flag when something is wrong. Basically, we don’t want a $100,000+ customer
running something for 8 hours with no data at the end because the price of every product
changed from using a <TH> tag to a <TD> tag. The diagnostic pages give us the
quick lead on where to look when we are supervising query performance.
240
CONFIDENTIAL
Picture 4.2.1: A screenshot of the Diagnostics Page of the QL2 Client Center
We can fix and sense problems with the Client Center other ways as well, but this is the
quickest and easiest way to get affiliated with finding and fixing problems. Sometimes
we don’t notice problems until customers point them out. We will need to login as an
administrator to get to this page and click SystemÆDiagnostics. Each record in Picture
4.1.4 represents a “Run” on the Client Center that corresponds to a batch of customerdefined inputs.
The Inputs column, of which only the “s” is visible on the left in Picture 4.2.1, represents
the number of input line items. Err represents the number of output line items that are an
error and Raw represents all line-item output. These stats will fill in the browser window
241
CONFIDENTIAL
upon completion. QSRC are Work Unit stats that evolve as the job runs in real-time. Q
is the number of queued Work Units. S is the number of stopped Work Units. R stands
for running currently and C stands for completed.
If we click in deeper to the Diagnostics Page to the right of the query start time and
duration, we can view the Run in detail. A “Run” is a batch of customer input that is
currently running, already complete, or queued to run on our hosted-processing hardware.
Customer input is sometimes thousands of input line items. Instead of sequentially
running all input line items together, the Client Center will break the Run into minibatches of 20 or less called Work Units and run them in parallel. Some Client Center
applications only have 4 inputs per Work Unit, such as cruises. A given Work Unit can
run only 1 query and operates on a single thread. In WebQL Windows IDE, we can
adjust our thread settings by going to ViewÆOptionsÆNetwork.
Picture 4.2.1.5: WebQL Network Options
242
CONFIDENTIAL
“Maximum number of running requests” represents the number of threads; the Client
Center operates with single-threaded Work Units. The reason that Client Center Work
Units operate on a single thread is that some applications require assurance that requests
will follow a particular order and that order can be guaranteed only when WebQL
navigates on a single thread. When we are extracting links that we know are independent
of the post data and cookies, we can “up the firepower” and run 15 or more threads
simultaneously. Depending on how large a customer is, customers have the right to run
different numbers of Work Units. If a customer’s run uses 100 line items on 1 query and
25 on another, the Client Center will break the run down with 5 Work Units on
launch_query1 and 2 Work Units on launch_query2. There will be 20 inputs per Work
Unit on launch_query1 and 20 inputs on the first launch_query2 Work Unit and 5 inputs
on the second launch_query2 Work Unit.
Suppose that the customer’s account is
authorized to use 5 Work Units simultaneously. In this case, 5 of the 7 Work Units will
immediately begin running even if they all go to the same site. The number of Work
Units is a representation of the amount of power a customer has on our data center. In
2005, 10 Work Units costs approximately $100,000 per year to license.
A Run is always assigned a Run ID, and a Work Unit is always assigned a Work Unit ID,
so if we spot a problem and want to discuss it with another worker, referencing it by Run
ID / Work Unit ID combo is the quickest and easiest way.
243
CONFIDENTIAL
Picture 4.2.2: A browser window of a Run by Work Unit
(SystemÆDiagnosticsÆView)
244
CONFIDENTIAL
Every Work Unit is cued by a go.sh file and has an input file called in.csv. run.txt stores
the name of the processing server assigned to the Work Unit, and the text of the
Messages tab of the WebQL Execution Window is stored in run.log. Any files created by
the query appear to the right of the run.log file.
4.3 Post Processing
For customers, probably the most important issue to them is Data Integrity.
Data
Integrity means that every line item of input provided by the customer through the QL2
Client Center has accurate output data or an error reported in the delivered output file set.
The sites on the internet can have any one of many problems with them at any given
moment, especially if a site is bombarded with traffic. Sometimes customer input on a
hosted query does not produce any output nor any errors, so we must force an error to
appear. After a set of Work Units run, a Post Processing Work Unit starts and runs a Post
Processing Query. Every Post Processing Query’s name begins with “pp_”. The major
idea behind Post Processing is to manipulate all output—good and bad (w/errors)—from
the same file across all Work Units called raw.csv. In every query when we write output,
we always write to raw.csv, whether it’s an error or not. Post Processing then delivers
the custom-tailored output files to the customer.
One of the default jobs of Post
Processing is to analyze the total Run’s input relative to its output and create error
messages when an input does not have corresponding output or errors. This is done most
easily with left outer join.
245
CONFIDENTIAL
Picture 4.3.1 shows the standard procedures for Post Processing on the left and the node
that generates errors for input and output not reconciling on the right. The code on the
left is the entire pp_default.wql file and the code on the right is a section of rawfile.wqv
(a complete code citation comes later). Both nodes RAW_INPUT and RAW_OUTPUT
have a field called INPUT_ID. All records get filtered by the where clause except for
those that appear only in RAW_INPUT. This is the most creative use of left outer
join so far. From above, we see that the essence of Post Processing is 3 different steps.
246
CONFIDENTIAL
Picture 4.3.2: The 3 step outline of pp_default.wql
Whether we are writing queries for Airfare, Carrental, CyberPrice, or any other QL2
Client Center application set, our default Post Processing will follow this model. These 3
steps are represented by the views rawfile.wqv, stdout.wqv, stderr.wqv. Often times, the
standard output format is not what the customer wants, then we would write the view
customer1out.wqv, and then also write pp_customer1.wql as the Post Processing Query.
Here’s the rest of the file rawfile.wqv, from the top:
247
CONFIDENTIAL
Picture 4.3.3: The rest of rawfile.wqv
To provide complete code citation included are stdout.wqv and stderr.wqv.
248
CONFIDENTIAL
Picture 4.3.4: stderr.wqv
249
CONFIDENTIAL
Picture 4.3.5: stdout.wqv
In addition to these data-manipulation routines, DHTML and interactive environments
with javascript are a way of massaging data into reports that pricing analysts can handle.
The idea with DHTML is that we write to an HTML text file all throughout the code
starting with <HTML><HEAD> writing tables with rows, cells, etc. and we get code
segments ending up looking like this:
250
CONFIDENTIAL
Picture 4.3.6: A segment of WebQL code that generates HTML
An HTML expert who is also a WebQL expert can do amazing things considering all that
the customer provides is an input file on the Client Center. Imagine setting up a car
rental shopping list to deliver a link to a clickable report in an email inbox every morning
between 9 and 10am that looks like this:
251
CONFIDENTIAL
Picture 4.3.7: One of many clickable HTML reports made in Post Processing
As a business, we provide output for a customer’s input. The overall diagram on how
data flows on the Client Center was the recipe for the success of QL2 Software.
252
CONFIDENTIAL
Picture 4.3.8: Schematic diagram of input to output for a given application set on the
QL2 Client Center
The customer’s input is provided through a web browser and used to cue work units that
each run a given query. The raw.csv output files for the Work Units are concatenated
together and passed to the Post Processing Query. WebQL converts the compiled output
into customized files for the user. Each application set on the Client Center has its own
maximum line items per Work Unit and its own Post Processing Query with customized
versions for customers as needed. In terms of source code, the application sets are stored
in base/src/hosting/scripts.
4.4 Managing Customer Accounts
Not every customer runs every query in a given application set. The check boxes for
enabling queries for a customer can be found on the Client Center through an
administrative account at SystemÆOrganizations by clicking on an organization
(customer or prospect) and on an application set. In addition to enabling queries, input
variables to queries are defined by customer login at the same page location. Each
customer account is an organization comprised of sub-accounts. This set up is to make
settings for an organization, then the individual logins (sub-accounts) can override those
settings on a login-by-login basis.
253
CONFIDENTIAL
Picture 4.4.1: SystemÆOrganizationsÆ<customer>ÆCars
Activating queries and setting input variables are handled for an organization in Picture
4.4.1. This interface is somewhat self-explanatory and we can usually figure anything
out when clicking around the Client Center by using the “?” quick-help links.
254
CONFIDENTIAL
4.5 CVS/Cygwin
The QL2 system of developing code, testing it on a staging server, and then releasing to
the live server is done through a program called CVS which is a part of the complete
download package of Cygwin. Go to Cygwin.com and download the download manager,
and do a “full” download of all components. This could take awhile. After completing
the install you will need to set a series of environment variables. There is an employee
“how to” guide on Cygwin, which should be used in addition to Chapter 4.5.
4.6 The Source Repository
The Source Repository is the collection of queries for QL2 Client Center customers. The
queries are launched from the Red Hat Linux WebQL command line. As developers, we
are most concerned with the base/src/hosting/scripts directory and subdirectories. With
our username and password after we have fully installed Cygwin and set environment
variables on our machine (and have done the CVS cobase command), we type
interactions with the Source Repository at the DOS command line:
C:\>cd base/src/hosting/scripts
C:\base\src\hosting\scripts>cvs up –d
This will update all scripts onto our machine from the source repository server (Radish).
In the scripts directory, there are a series of directories symbolizing the various
application tabs in the Client Center, such as carrental, cyberprice, vacations, etc. If we
have to add files (launch_xx.wql, xx.wqv) for an application to run under the carrental
tab, continue from above:
C:\base\src\hosting\scripts>cd carrental/pricing
255
CONFIDENTIAL
C:\base\src\hosting\scripts\carrental\pricing>cvs add launch_xx.wql
xx.wqv
C:\base\src\hosting\scripts\carrental\pricing>cvs commit
launch_xx.wql xx.wqv
C:\base\src\hosting\scripts\carrental\pricing>./dist-new nutmeg
carrental
The above set of commands adds, commits (version 1.0), and distributes the files to the
Nutmeg staging server. Once the query has passed all testing on Nutmeg, continue at the
DOS command line:
C:\base\src\hosting\scripts\carrental\pricing>cvs tag –F release
launch_xx.wql xx.wqv
C:\base\src\hosting\scripts\carrental\pricing>./dist-new master
carrental
Now the files have been distributed to the live server that customers interact with. We
can run a test on the live site through an administrative account. We must make sure that
the query that we are trying to run is activated in the account we want to run the query in
(Recall Chapter 4.4 and Picture 4.4.1). These commands can be remembered as a 5 step
process.
Picture 4.6.1: The 5 steps to releasing a file
In addition to these 5 commands, the database administrator must make an update that we
cue through committing the updates.txt file and the *.ddl file of the specific application
tab (carrental.ddl in this case). These files are located in the base/src/hosting/schema
256
CONFIDENTIAL
directory. Again, recall Picture 4.4.1. Making new variables and queries appear so that
they can be activated for a given customer account is done through the *.ddl files and the
updates.txt file. The way to add variables and queries is illustrated in the following
pictures:
Picture 4.6.2: A sample of queries and variables in the carrental.ddl file
Picture 4.6.3: A sample of updates in the updates.txt file
We must make sure to update the schema directory before modifying and committing the
files:
C:\base\src\hosting\schema>cvs up
257
CONFIDENTIAL
By not specifying a file name we are updating the entire schema directory in one
command. After modifying the files we then commit the files:
C:\base\src\hosting\schema>cvs commit updates.txt carrental.ddl
We can now alert a database administrator that we are ready for a database update for
both the staging server (Nutmeg) and the live site (Master).
4.7 Making queries industry strength
We have already written queries to scrape prices out of http://www.techdepot.com and
http://www.superwarehouse.com. In order to sell the application for big money on the
QL2 Client Center, some enhancements must be made to the code. For example, the
error trapping that is done in Examples 3.6.1-2 is insufficient. Another improvement is
an input handler. Finally, we will normalize all errors and output into a single file format
that is ready for post processing, which is outlined in Chapter 4.3. Let’s start by creating
an input handler for both Example 3.6.1 and Example 3.6.2 to work from. The reason for
creating an input handler is to ease the process of updating all queries in our application
set simultaneously. Suppose we add 18 additional queries to the 2 that we already have
and then change the way that input processed, such as column 3 having the search terms
instead of column 2. If we didn’t have an input handler, then we would have to update all
20 pieces of code to handle the input properly. With the input handler, we just update it
once and all 20 pieces of code use it. This example is similar to Picture 4.1.3.
258
CONFIDENTIAL
Picture 4.7.1: Implementing an input handler with a FILE_BASE
On an industry strength data center, input files are directory-specific and require a
FILE_BASE to locate the appropriate input file. The FILE_BASE will also be needed
for directory-specific output files throughout the code. There will be an error-trap added
259
CONFIDENTIAL
in the reader.wqv file once error enhancements are implemented. The next step is to
determine what the standardized output will be between the two software agents,
techdepot and superwarehouse. Looking at the current versions of output:
Picture 4.7.2: Deriving a common output file format
260
CONFIDENTIAL
We see that the output formats are somewhat similar. We will have to include the
FILE_BASE in our data writing, but first we need to derive the common output format.
The itemcode field of Example 3.6.2 can be merged with the sku field of Example
3.6.1. Typically, there are two different types of product numbers for products sold on
the internet.
One is the website’s part number (or “sku”) and the other is the
manufacturer’s part number. Some sites use only one type of part number. In the process
of merging applications for different websites, we need to take into consideration both
types of product numbers in the common output format. The techdepot code has both sku
and manufacturer’s part numbers. Another difference between the two applications is
that the superwarehouse code includes the inputID along with the output. Because we
will also include the site name in the common output table, the table will require 9
common fields (SITE, INPUTID, SEARCHSTRING, SKU, MANUF, PRODUCT,
PRICE, AVAILABLE, TIMESTAMP). Our ultimate goal is to combine errors and
output into one table, so let’s add 3 more common fields for errors (MESSAGE,
ERRORID, RERUN) for a total of 12 fields. The RERUN field is a 1 or 0/Null based on
whether or not the error can potentially be corrected by rerunning the input for the given
SEARCHSTRING. Let’s call the 12-field table raw.csv. We can tell if a record in the
table is an output item or an error item based on whether or not the MESSAGE field is
null.
This is important in post processing, which is covered in Chapter 4.3.
Implementing the 12-field format makes the code look like this:
261
CONFIDENTIAL
Picture 4.7.3: Making the tables similar with 12 fields
262
CONFIDENTIAL
The INPUTID field will need to be added to Example 3.6.1 when the input is read in
through the reader.wqv file for the code to compile. Before we should worry about
compiling anything, we need to implement error traps and controls. In addition to every
HTTP error associated with a fetch, we must also write error traps for responses like
“Zero Products Found for your Search.” Further, we need to take into consideration that
some searches will have thousands of results. What should happen in these cases?
Should there be a limit of 100 results per search? To create a fetch ceiling, let’s say that
we will only scrape out a maximum of 3 pages of results per search. Implementing the
error updates recreates Example 3.6.1 as Example 4.7.1:
263
CONFIDENTIAL
Picture 4.7.4: The first part of the code to Example 4.7.1
We now are able to see the adjustments made to reader.wqv and how it has impacted the
code. Adding lots of error traps often causes us to be more specific in the references we
make as in the node columninput. The need to include “select1.” in the FILE_BASE
reference is indeed because of an error trap that gets implemented in the inner query of
the outer1 node. Here’s some more of the code:
264
CONFIDENTIAL
Picture 4.7.5: The second part of the code to Example 4.7.1
265
CONFIDENTIAL
Picture 4.7.5 shows us the new error trapping. Here is more of the code:
Picture 4.7.6: The third part of the code to Example 4.7.1
266
CONFIDENTIAL
Picture 4.7.7: The final part of the code to Example 4.7.1
Running the input from Picture 4.7.4 produces the following output viewed through
Microsoft Excel:
267
CONFIDENTIAL
Picture 4.7.8: The final part of the code to Example 4.7.1
The output seems ready for post processing. Notice that we’ve stripped the dollar signs
and commas off of the prices.
268
CONFIDENTIAL
We can make similar enhancements to Example 3.6.2 and recreate it as Example 4.7.2.
The next step in that process is to implement the error traps needed for an industrystrength web-crawler and write our output to the common 12-field raw.csv format that
gets the output ready for post processing. Picture 4.7.9 is the start to the superwarehouse
query:
269
CONFIDENTIAL
Picture 4.7.9: The first parts of code to Example 4.7.2
We see that the start is very similar to that of Example 4.7.1.
270
CONFIDENTIAL
Picture 4.7.10: The second part of code to Example 4.7.2
We see some variations to Example 4.7.1 and some similarities. Let’s look at more of the
code:
271
CONFIDENTIAL
Picture 4.7.11: The third part of code to Example 4.7.2
Finally, here’s the last of the code:
272
CONFIDENTIAL
Picture 4.7.12: The final part of code to Example 4.7.2
Running the code with the input seen in Picture 4.7.9 produces the following output file
viewed in Microsoft Excel:
273
CONFIDENTIAL
Picture 4.7.13: The output of Example 4.7.2
The Excel view of the raw.csv file suggests that the crawler is working as desired by
giving up to 3 pages of results (60 results) for each search term. We have now gone
through an example of creating an application set in Examples 4.7.1-2. We could easily
add 10 or more applications that work in conjunction with reader.wqv. Through post
processing we can deliver different customized output and error files to different
customers. The idea is to develop an application set for a given market (airfare, car
rental, vacation packages, etc.) and get lots of customers to run the same code to surf and
274
CONFIDENTIAL
scrape the sites and then each customer has its own post processing code that delivers
customized output and error files (recall Picture 4.3.8).
Clearly, there are quite a few details to writing an industry-strength web-crawler,
especially one with elaborate error controls. Elaborate error controls include both fetchrelated network errors and errors triggered out of the site such as “Zero results found.”
Even when we feel like one of our crawlers is complete, often times new customer input
triggers behaviors in the website that we have not yet seen and we have to update the
code.
275
CONFIDENTIAL
Appendix I: HTML knowledge for WebQL
To effectively code a WebQL web spider, basic knowledge of HTML tags is beneficial.
HTML stands for HyperText Markup Language. HTML pages are plain text files that get
interpreted by a browser in to what we see when we click around the internet. Nearly
every page on the web is built in the structure of tables with rows containing cells which
in turn may contain a table. Being able to position images, links, and text for a browser
using <table>, <tr>, and <td> tags is a skill of every HTML developer. Overtime,
websites have evolved into great detail that are works of art and graphic design. A
website that would take 3 guys and $100,000 to develop in the mid-late 1990s is now a
matter of a 4 digit contract for a single person in 2005.
Luckily, building the intricate details of modern HTML is not a requirement to be an
expert WebQL application developer. We just need to know HTML basics. Here is a
rough sketch of an HTML page that contains a table.
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE>Prices and Taxes in 3 Currencies
</TITLE>
</HEAD>
<BODY background="liberty.gif">
<TABLE align=center cellpadding=3 cellspacing=3
bgcolor=blue>
<tr>
<th colspan = 3 bgcolor=red
align=center><b>
Prices and Taxes in 3 Currencies</b>
</th>
</tr>
<tr>
<td>
<font size=+2><b>$$
276
CONFIDENTIAL
</b></font>
</td>
<td>
<font size=+2 color=white><b>55.75
</b></font>
</td>
<td>
<font size=+2>5.52
</font>
</td>
</tr>
<tr>
<td>
<font size=+2><b>Yen
</b></font>
</td>
<td>
<font size=+2 color=white><b>6500
</b></font>
</td>
<td>
<font size=+2>622
</font>
</td>
</tr>
<tr>
<td>
<font size=+2><b>British Pound
</b></font>
</td>
<td>
<font size=+2 color=white><b>30
</b></font>
</td>
<td>
<font size=+2>2.88
</font>
</td>
</tr>
</TABLE>
</BODY>
</HTML>
<!-- Example I.1>
277
CONFIDENTIAL
Looking at this code through a browser on the following page, we can figure out how
table rows <tr>, table cells <td>, and table headers <th> fit inside a table. We also
notice the table row/cell relationships.
Picture I.1: The HTML table seen through MSIE
The background image file is tiled along the background of the browser window, and the
table is horizontally positioned in the center and vertically at the top. The HTML file is
available for download at http://www.geocities.com/ql2software/pages/myPrices.html.
As mentioned before, crafting HTML is often a game of making tables inside of tables.
Consider this code:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<TITLE>FCY rates for USD deposit
</TITLE>
</HEAD>
<body background=”liberty.gif”>
<TABLE cellSpacing=1 cellPadding=0 width=430 border=3>
<TBODY>
<TR vAlign=top align=left>
<TD>
<TABLE cellSpacing=1 cellPadding=2
278
CONFIDENTIAL
width="100%" border=0>
<TBODY>
<TR vAlign=top align=middle>
<TD>
<B>Time</B></TD></TR>
<TD>1 day</TD></TR>
<TD>1 week</TD></TR>
<TD>1 mth</TD></TR>
<TD>2 mths</TD></TR>
<TD>12 mths</TD>
</TR></TBODY></TABLE></TD>
<TD>
<TABLE cellSpacing=1 cellPadding=2
width="100%" border=0>
<TBODY>
<TD><B>< 10'000 USD</B>
</TD></TR>
<TD>0.0000</TD></TR>
<TD>0.0000</TD></TR>
<TD>2.7150</TD></TR>
<TD>2.8450</TD></TR>
<TD>3.0350</TD></TR>
<TD>3.2750</TD></TR>
<TD>3.5250</TD>
</TR></TBODY></TABLE></TD>
</TD>
</TR></TBODY>
279
CONFIDENTIAL
</TABLE>
</BODY>
</HTML>
<!-- Example I.2>
We notice that the table is not centered this time and that some table cells actually contain
tables. <TBODY> and </TBODY> are tags that symbolize the beginning and end to a
table body. Notice how the table has a border but does not have a background. Imagine
what the code looks like through a browser and then take a look below.
Picture I.2: MSIE view of the nested tables in Example I.2
The table in this HTML code has 1 row and 2 cells. Each cell contains a table of 8 rows
each with 1 cell width. If familiar with Data Translators in WebQL, we can apply the
table rows, table cells, and table columns Data Translators to the page sources of
280
CONFIDENTIAL
Examples I.1-2.
The source code of the table of Example I.2 is available at
http://www.geocities.com/QL2software/pages/myRates.html.
Looking at TABLE_ID,
ROW_ID, and COLUMN_ID Pseudocolumns, we can try to reprocess both tables by
rotating the vertical columns into rows and rows into columns.
A few other tags besides table-related tags are worth mentioning. <BR> is a line break
that forces text to appear on different lines.
Line1<BR>Line2
is HTML code that will put Line2 physically below Line1 on a page. Similarly,
<NOBR> This text stays on one line.</NOBR>
is a tag-set that prevents text from breaking into different lines no matter how smashed
the browser window gets.
<td><IMG src=”myDirectory/myImage.jpg”></td>
is merely placing an image in a table cell. If we wanted to extract the image in a WebQL
program, we can get the URL by either using the images Data Translator or by using a
Regular Expression pattern Data Translator to extract the image’s URL extension and
concatenate it with the domain of the SOURCE_URL Pseudocolumn. If we aren’t yet
familiar with Data Translators and Pseudocolumns, they are explained in detail in
Chapter 2.1.
When hunting through HMTL page sources for a particular location in the page, we
should get used to using text find (cntrl+F). Looking at a web page from a browser’s
view, we should be able to use a text string in the finder to get us where we want to be in
the HTML source.
281
CONFIDENTIAL
The HTML form tag is another tag that is helpful to be familiar with. We will leave the
code astray as it was found in the page source.
<form action="index.cfm?handler=data.processsearchb"
class="formstyle" method="post" name="headersearch">
<tr>
<td rowspan="2" width="150"
background="images\spacer.gif"> </td>
<td colspan="2" width="250"
background="images\spacer.gif"> <font color="#FFFFFF"><b>PART
# / KEYWORD</b></font></td>
<td rowspan="2" width="150"
background="images\spacer.gif"><img align="right" border="0"
src="images/Same-Day_Shipping.gif" alt="Same Day Shipping / No
Minimum Order"></td>
</tr>
<tr>
<td align="right" background="images\spacer.gif">
<input type="text" name="keyword"
size="45" ID="Text1"></td>
<td><input type="image"
src="images/layout/header/btnSearchBGC1A68B2.gif" ID="Image1"
NAME="Image1" align="bottom">
<input type="hidden" name="isANewSearchLimitSearch"
value="false">
</td>
</tr>
</form>
The class of the form named headersearch is a reference to a stylesheet that can do things
like set a background color and font size/color automatically in one stylesheet reference.
282
CONFIDENTIAL
An HTML stylesheet is a *.CSS file that is referenced for colors and styles throughout an
HTML source page. A form has various inputs that have a consequence in which page is
loaded next. What are the inputs to this form? What tags most likely appear shortly
before the beginning and after the end? Probably <TABLE…> and </TABLE>.
Now that we have some fundamental knowledge about HTML, we are better equipped to
use WebQL for large volume data extraction tasks. The HTML source code appearance
of tables with tags is limited in structure; WebQL comes in an extracts the information
out of the source code and gives it both structure and format for a database or archive.
Only this basic knowledge of how HTML is a tag-language for a browser to present
tables of information is required to be a successful WebQL Application Developer. If an
HTML tag looks interesting, we can type it into Google to learn more about it.
283
CONFIDENTIAL
Appendix II: javascript knowledge for WebQL
Basic knowledge of javascript, in addition to HTML, is beneficial for WebQL webapplication development. Javascript is a client-sided scripting language that can create
dynamic changes in HTML, cause a redirect, create a game for a user to play, and/or
submit a form.
Javascript code is started and ended by the <SCRIPT…> and
</SCRIPT> HTML tags.
<HTML>
<HEAD>
<TITLE>Redirecting...</TITLE>
<SCRIPT language=javascript>
function crawlFurther() {
window.location =
“http://www.geocities.com/ql2software/redirect/index.html”;
}
</SCRIPT>
</HEAD>
<BODY onLoad=”javascript:crawlFurther()”>
</BODY>
</HTML>
This HTML file is at http://www.geocities.com/ql2software/pages/redirect.html. As the
page loads, the window location is immediately moved to a different subdirectory via
javascript. We should land at a page that looks like this:
284
CONFIDENTIAL
Picture II.1: The page with a redirected body URL
Notice that the browser URL and the URL in the function crawlFurther aren’t the
same,
but
the
page
is
the
same
if
we
load
the
http://www.geocities.com/ql2software/redirect/index.html file directly. If we view the
source of Picture II.1, we will not see any reference to the message, only the URL in the
crawlFurther javascript function.
285
CONFIDENTIAL
Picture II.2: The body URL accessed directly
If the relocation is a dynamic link that includes sessionID and/or other information, then
we won’t be able to load the body URL directly. We will have to cut the URL out of the
source page that calls the redirect, which is the URL in the crawlFurther function, and
then load it. The view of the source code of the page represented by the body URL does
contain the message, “This is the redirected homepage.”
Javascript can also be used to submit forms on a page. Submitting a form involves a user
entering information in a browser window and clicking the “go” or “search” button.
Whether we are buying something with our credit card or looking for flights, we are
submitting a form most likely activated by javascript.
Let’s go over an example of such a form.
286
CONFIDENTIAL
Picture II.3: The 2005 Expedia car search form
This
form
comprises
about
350
lines
of
source
code
available
at
http://www.geocities.com/ql2software/pages/carsform.html. Not visible in the browser
window above is the green “Search” button off to the right. Here’s the HTML of the
button:
287
CONFIDENTIAL
<INPUT class=GoButton id=A1034_3 onclick=FS() type=submit
value=Search>
We see that the button calls the function FS, which calls QFS. The necessary javascript
to see what’s happening is pasted in below. Notice that the function SF() submits the
form instantiated in function FS().
function QFS(wt, postshop, dprt, dest, pinf)
{
var rate = 0;
if (1 == wt) {rate = (true == postshop) ? 58 : 250;}
else if (4 == wt) {rate = (true == postshop) ? 159 : 410;}
QualifiedForSurvey(wt, postshop, 45, rate, dprt, dest, pinf);
}
function TEK(a,evt){
var keycode;
if (window.event){ keycode = window.event.keyCode; evt =
window.event;}
else if(evt) {keycode = evt.which;}
else {return true;}
if(13==keycode){evt.cancelBubble = true; evt.returnValue =
false; eval(a);}
}
function FS() {
QFS(3, true);
var f = getObj("MainForm");
if (getObj('rateoption1').checked) {
f.vend.value = f.vend1.value;
}
}
}
288
CONFIDENTIAL
}
function SF() {
FS();
window.external.AutoCompleteSaveForm(f);
f.submit();
}
Here’s the MainForm tag only.
<FORM onkeypress="TEK('SF()')" id=MainForm name=MainForm
action=/pub/agent.dll?qscr=cars&itid=&itdx=&itty=
method=post ?>
The great part about WebQL is that we don’t need to understand every detail of the
javascript, we just need to know what matters and what doesn’t for the sake of submitting
a form. Most of the time, the input for a form comes from a CSV file, which can be
thousands of records (or “rows”) long. We should try to figure out which form variables
need to be submitted before clicking the “go” button. Hunting through the HTML form
and the associated javascript calls are a great way to do it. Another technique involves
using the forms Data Translator in Example 2.l.9, which is an even easier technique of
identifying the form variables. We can also set up a proxy on our browser to see what
variables and values get submitted when a form is submitted by a user click. We must
make sure that we are submitting the variables in a similar fashion in the WebQL Studio
IDE, which has a built-in proxy under the Network tab of the Execution Window (see
Picture 1.2.7).
The most important thing to understand is that information is returned to a user by a web
server when the user submits a form. The form is built with HTML and sometimes uses
javascript to pass the information from the user to the web server. WebQL automates this
process and takes hundreds of man hours of clicking and turns it into hundreds of minutes
of web harvesting done by a computer.
289
CONFIDENTIAL
Appendix III: Regular Expression Knowledge for WebQL
Regular Expressions are a way of matching text, perhaps pages long, to small
expressions. Basic knowledge of HTML from the prior appendices will help. For the
sake of examples, let’s invent a function called ‘matching’ that has the syntax
myString matching myExpression
If the expression fits anywhere into myString, then the function evaluates to TRUE,
otherwise the function evaluates to FALSE.
If true, then we say myExpression
“matches” myString. Suppose the source code text of our favorite homepage on the
internet is stored in a text string called myFavHomepage, then
myFavHomepage matching ‘<HTML>.*?</HTML>’
returns TRUE if myFavHomepage has a start HTML tag followed somewhere on the
page by an end HTML tag. The phrase .*? means “match anything until, but possibly
match nothing.”
We notice that some HTML tags have more to them then just their name. Consider the
image tag
<IMG src=”/myDirectory/myPic.GIF” alt=”Beauty at it’s best!”>
How could we ambiguously match any image tag with a Regular Expression? From what
we’ve learned so far
<IMG.*?>
290
CONFIDENTIAL
should be all we need. Believe it or not, that works! There is a “tighter” way of writing
the expression that says, “match as many characters as possible that aren’t greater-than
characters until we hit a greater-than character”:
<IMG[^>]*>
Brackets symbolize a character class, and the carrot sign symbolize negation. Thus,
“match a character that is not a digit” is written
[^0-9]
and “match a character that is an English letter” is
[A-Z]
To “match one or more letters, as many as possible” then
[A-Z]+
while “match one or more letters, as few as possible” is
[A-Z]+?
With what we’ve learned, how would we match any HTML table?
<TABLE[^>]*>.*?</TABLE>
Character classes can also be used to extract prices. A price probably has digits, commas,
and a period, so
291
CONFIDENTIAL
[0-9,.]+
is a good pattern, but not the best. The pattern that we have selected would also match
99.99.953.40,33,4, which is not a price. If the price for sure starts with a dollar sign and
ends with cents, then
\$\d*\.\d{2}
is a better pattern. The period and the $ are actually reserved characters, so they must be
set off by a backslash ‘\’. The period alone stands for “match any character” which we
could infer from the first example, so to match a period character explicitly, we must use
a backslash in the expression. ‘\d’ symbolizes the character class [0-9] and the squiggly
brackets say “match exactly this many”. Notice that there need not be any dollar digits
matched at all, only cents, and that there is no space between the dollar sign and the price.
The dollar sign alone without the backslash symbolizes the end of a line. Thus, “match
the start of the first table cell tag until the end of the same line” in the source file is
represented by
<td.*?$
“Match the first table cell tag until the final end of line marker” would then be
<td.*$
WebQL is powerful because it allows us to use extraction buffers with Regular
Expressions to pinpoint and extract data from an HTML source. Let’s invent another
function called extract_pattern(text1, pattern1). The idea is that pattern1 is a
Regular Expression containing an extraction buffer denoted by parenthesis. For example,
if myFavHomepage is our favorite homepage, then
292
CONFIDENTIAL
extract_pattern(myFavHomepage,’<TITLE>(.*?)</TITLE>’)
will extract the title of myFavHomepage.
Suppose we load a page that has a product with a price on it.
Call the page
myProductPage.
extract_pattern(myProductPage,
’<td class=price>\s*\$\s*([\d.,]+)\s*</td>’)
is an example of a thorough expression that won’t mistakably grab something other than
what we are looking for. We notice that the table cell tag must be style-sheeted as a price
cell, and the pattern also takes into account a dollar sign and the potential for empty
whitespace on either side of the dollar sign and on either side of the price itself.
We are now much more effective WebQL programmers now that we have a fundamental
understanding of Regular Expressions.
293
CONFIDENTIAL
Appendix IV: Other Readings
There aren’t many readings recommended to enhance our WebQL programming ability
for several reasons.
First, a comprehensive book on Regular Expressions is great
knowledge to have, but no book focuses on applying Regular Expressions to HTML
sources, which is one of the most effective ways to apply the WebQL programming
language. Recommended is the book that I looked at, which is in Perl. Second, if HTML
surfing and slashing is one of the best ways to use WebQL, then knowing HTML is
beneficial. Once upon a time, HTML books were popular, but if there is anything about
an HTML tag that we don’t know, we can just type it into Google and figure it out. The
same is true for javascript. There are a couple of online resources listed that include the
HTML 4.0 Specification text and an award-winning reference guide. There are also one
javascript and one XML reference listed. Finally, WebQL is a programming language
with only a dozen or so experts coding with it in the year 2005. Given that, this is the
first book that explains how to code with WebQL from an engineering perspective. This
limits the number of other readings relevant to the developer.
1) Friedl, J. E. F. Mastering Regular Expressions. ISBN: 1-56592-257-3
2) Full text of HTML 4.0 Specification http://www.w3.org/TR/1998/REC-html4019980424/
3) The HTML 4.0 Reference http://www.htmlhelp.com/reference/html40/
4) Javacript Tutorial http://www.w3schools.com/js/default.asp
5) XML/XSLT Tutorial http://www.w3schools.com/xml/default.asp
294
CONFIDENTIAL
Appendix V: WebQL Function Tables
This quick reference is to help us find functions sorted by function type. We can get
additional help with functions by name by checking out HelpÆContentsÆIndex.
WEBQL GIANT FUNCTION TABLE
String Functions
after(str1, expr1)
assemble(str1a, str1b, str2a, str2b, …)
before(str1, expr1)
between(str1, expr1, expr2)
str1 case matching expr1
replace(str0, expr1, str1, ...)
chr(num1)
chr_id(str1)
clean(str1)
coalesce(str1, str2, …)
creation_date(str1)
datetime_to_text(date1, str1, str2, str3)
diff_html(str1, str2)
expression_match(sexpr1, str1)
extract_case_pattern_array(str1, expr1)
extract_pattern(str1, expr1)
extract_pattern_array(str1, expr1)
extract_url(str1)
extract_value(str1, str2)
returns whatever occurs after the point where expr1
matches str1; returns null if expr1 does not match
str1
returns a string of '&'||strna||'='||strnb
returns whatever occurs before the point where
expr1 matches str1; returns null if expr1 does not
match str1
returns whatever text occurs before expr2 matches
str1 and after expr1 matches str1; returns null if
either expression does not match str1
returns true if expr1 matches str1 case sensitive,
false otherwise
returns str0 with all matching characters of expr1 to
str0 replaced by str1; we can have as many exprn
and strn combos as we want
returns the unicode character associated with num1
returns the integer number associated with unicode
character str1
returns str1 with all HTML and repeated whitespace
removed
returns the first non-null str
returns the creation date/time for the file specified in
the path str1
returns the text version of date1 in format str1 in
timezone str2 with locale str3; only date1 is a
required argument and can be NOW.
returns an array symbolizing differences between
the HTML in str1 and str2 (see help)
returns true if str1 meets the requirements of the
search expression sexpr1 (see help)
case-sensitive version of extract_pattern_array
returns the contents of the extraction buffer of expr1
on str1; if expr1 does not match then null is
returned
returns an array of extraction buffer matches of
expr1 on str1; returns null if expr1 does not match
str1
returns the first URL in the HTML of str1
returns the value of the field str2 in the HTML of
295
CONFIDENTIAL
file_size(str1)
highlight_html(str1, num1)
html_encode(str1)
html_to_text(str1)
html_to_xml(str1)
in(str1, array1)
instr(str1, chr1)
str1 is null
last_modified_date(str1)
length(str1)
load(str1)
load_lines(str1)
lookup(array1, str1)
lower(str1)
str1 matching expr1
md5(str1)
normalize_html(str1)
nullif(str1, str2)
number_to_text(num1, str1, str2)
nvl(str1, str2)
replace_links(str1, str2, array1)
studio_action(str1, str2)
substr(str1, num1, num2)
str1 || str2
text_to_datetime(str1, str2, str3, str4)
text_to_number(str1, str2, str3)
transform_xml(str1, str2)
trim(str1)
str1
returns the size in bytes of the file represented by
the path str1
returns the HTML str1 with all colors specified by
num1 highlighted with tags added
returns a string that is the HTML encoded version of
str1
returns a string that has the HTML removed from
str1
returns a string that gives well-formed XML to the
HTML in str1
returns true if str1 is an element of array1; returns
false otherwise
returns the integer position of chr1 in str1
returns true if str1 is null, false otherwise; str1 is not
null works too
returns the last modified date of the file in the path
specified by str1
returns the number of characters in str1
returns the string of the file specified in the path str1
returns an array of the lines of the file specified in
the path str1
returns the position of str1 in array1; returns null of
str1 is not in array1
returns str1 converted to lowercase
returns true if expr1 matches str1 case insensitive,
false otherwise
returns MD5-hashed version of str1
returns well-formed HTML around the segment of
HTML in str1
returns null if str1 = str2, otherwise str1
returns num1 as a string with str1 an optional
format string and str2 an optional locale (see help)
returns str1 unless str1 is null then str2 is returned
returns links in HTML of str1 optionally specifies an
origin URL str2 and optionally specifies
replacements array1 (see help)
creates special output field with clickable link that
specifies a type in str1 and a segment of code to
execute in the body str2 (see help)
returns a new string starting at position num1 and
goes on until the end of the string unless the
optional num2 is mentioned as the length of the
substring
returns a string of str1 concatenated with str2
returns a date object of text interpreted by the
format str2 with optional timezone str3 and optional
locale str4 (see help)
returns a number from str1 that is optionally
formatted by str2 with optional locale str3 (see help)
returns the transform of the XML of str1 by the
stylesheet represented by str2
returns str1 without any external whitespace
296
CONFIDENTIAL
upper(str1)
ureplace(str0, expr1, str1, ...)
url_change_type(str1, str2, str3)
url_decode(str1)
url_domain(str1)
url_encode(str1)
url_file(str1)
url_filename(str1)
url_host(str1)
url_make_absolute(str1, str2)
url_make_relative(str1, str2)
url_parameter(str1, str2)
url_port(str1)
url_query(str1)
url_reparent(str1, str2, str3)
url_scheme(str1)
xml_encode(str1)
returns str1 converted to uppercase
same as replace but uses smaller character set
(faster expressions)
returns a URL string str1 with str2 as the file type
conversion and the optional str3 as the default page
name (see help)
returns the decoded URL string str1
returns the url_domain out of the URL string str1
returns the URL encoding of the string str1
returns the file+path of the URL str1
returns the file name only of the URL str1
returns the host of the URL str1
returns the URL of the path str1 to the URL base
str2 (see help)
returns the URL of str1 relative to the base URL
str2 (see help)
returns the parameter associated with the value str2
in the variable string at the end of the URL str1
returns the port number of the URL str1
returns the query string portion of URL str1 (after
the ?)
returns the URL str1 with the old prefix str2
replaced by the new prefix str3 (see help)
returns the protocol of the URL str1
returns the XML encoded version of str1
Numeric Functions
abs(num1)
arcos(num1)
arcsin(num1)
arctan(num1)
ceil(num1)
cos(num1)
deg(num1)
exp(num1)
floor(num1)
hex(num1)
log(num1)
mod(num1, num2)
oct(num1)
pow(num1, num2)
rad(num1)
random(array1)
round(num1, num2)
sin(num1)
sqrt(num1)
tan(num1)
returns the absolute value of the number num
returns cos^-1(num1 radians)
returns sin^-1(num1)
returns tan^-1(num1)
returns the least integer greater than or equal to
num1
returns the cosine of num1 radians
returns num1 radians in terms of degrees
returns e to the power num1
returns the greates integer less than or equal to
num1
returns the decimal value of the hex value num1
returns the value of log base 10 of num1
returns the remainder of dividing num1 by num2
returns the decimal value of the octal value num1
returns num1 to the num2 power
returns the number of radians represented by num1
degrees
returns a random element of array1; "random"
alone generates a number between 0 and 1
returns num1 rounded to num2 decimal places
returns the sine of num1 radians
returns the square root of num1
returns the tangent of num1 radians
297
CONFIDENTIAL
to_number(str1)
trunc(num1, num2)
returns the number of the given number-string
returns num1 truncated to num2 decimal places
Array Functions
array_avg(array1)
array_max(array1)
array_min(array1)
array_sum(array1)
array_to_text(array1, str1)
array_unique(array1)
chisquare(array1)
diff_arrays(array1, array2)
each(array1, A(each_item))
flatten(array1)
merge(array1, array2)
reverse(array1)
sample(array1, num1)
sequence(num1, num2, num3)
shuffle(array1)
size(array1)
slice(array1, num1, num2)
text_to_array(str1, str2)
returns the average of the elements of array1
returns the value of the greatest element of array1
returns the value of the least element of array1
returns the value of the sum of the elements of
array1
returns the attachment of all array1 elements
connected as a string with str1 between each
array1 element
returns a reduced-size or same size array
consisting of only unique elements of array1
returns a string describing the CHI^2 of array1
returns the array-difference of array1 and array2
(see help)
returns an array involving A on each_item of array1
returns a modified version of array1 with all
structural array nesting removed
returns a single array of array1 concatenated with
array2
returns the elements of array1 in reverse order
returns an array size num1 that is a random sample
of array1's elements
returns an array starting at num1 incrementing by
num3 up to num2
returns an array that contains the randomly shuffled
elements of array1
returns the number of elements of array1
returns a smaller array consisting of array1 indexed
at num1 up until array1 indexed at num2
returns an array of elements of str1 delimited by
str2
Aggregate Functions
avg(col1)
chisquare(col1)
count(col1)
count(unique col1)
gather(col1)
gather_map(col1, col2)
geometric_mean(col1)
harmonic_mean(col1)
max(col1)
mean(col1)
median(col1)
min(col1)
mode(col1)
sample_range(col1)
stddev(col1)
sum(col1)
returns the average value of all records in col1
returns a string describing the CHI^2 of col1
returns the count of all records in col1
returns the count of all unique records in col1
returns the records of col1 in an array
returns an array with col1n paired with col2n
returns the geometric mean of the records in col1
returns the harmonic mean of the records in col1
returns the maximum value of the records in col1
returns the mean value of the records in col1
returns the median value of the records in col1
returns the minimum values of the records in col1
returns the mode of the values in col1
returns the range of values max - min of col1
returns the standard deviation of the records in col1
returns the sum of all records in col1
298
CONFIDENTIAL
variance(col1)
returns the variance of the records in col1
Document Retrieval
Options
acceptng text
adding cookie text
adding header text
allow partial
anonymize text
cache forms
forget forms
ignore option1
convert using A
converting from text
converting to text
forget cookies
crawl of URL1 to depth N
delay num1
crawl of URL1 following if bool1
head
independent
inline text
post text
proxy text
retry num1
rewrite ps1 using A
submitting values str1a for str1b to form k
submitting values str1a for str1b if bool1
timeout num1
unguarded
user agent text
via browser
with errors
images only text='image/*', any text='*/*'
manually adds cookie in string text
manually adds HTTP headers to a request specified
in text
entire request does not fail when child page load
fails
anonymizes a request with text as the
anonymization key
caches forms associated with a fetch
clears memory of forms associated with a fetch
disregards option1 which is either children,
compound, encoding, errors, keepalive, redirect,
refresh, truncation
makes the source_content of a fetch A
specifies the MIME type of a retrieved document as
text such as 'text/html'
specifies the MIME type of a retrieved document to
be made text before Data Translators are applied
manually drops all cookies added by a fetch
automatically crawls every link on URL1 to clicking
depth N
automatically delays a fetch by num1 seconds
crawl URL1 so long as bool1 is true (see chapter
2.5)
causes an HTTP request to use the HEAD method
allows for repeat fetches
allows for text to be your datasource
posts the post data string in text
specifies ip address or location in text
retries a failed request repeatedly num1 times or
fewer if the request succeeds
rewrites Pseudocolumn ps1 just before a fetch
using the expression A
submits form k in addition to loading a request;
submits values strna for strnb for as many n as we
want
submits every form where bool1 is true; submits
values strna for strnb for as many n as we want
forces a timeout error if the request takes longer
than num1 seconds
enables a fetch if the fetch is a repeat
makes text the user agent for fetching a page
loads page as would MSIE
include datacommunication errors when "error"
Pseudocolumn is selected
Document Write Options
append
WebQL adds to the existing file rather than
overwriting the file
299
CONFIDENTIAL
encoding text
type file
fix at num1
pviot
raw
stylesheet URL1
transform str1
truncate
with headings
specifies text as the file character set to be used
when writing such as 'utf-16'
specifies file write type like text, excel, or xml, etc.
specifies the file width as num1
makes pivot table view of the written records
writes data to a file without file conversions
specifies CSS/XSLT stylesheet located at URL1
with which to write a file
uses str1 XSLT stylesheet when writing file
opposite of append; always writes over file even if
another node has written to it
include aliases as column headers at write-time
Conditional Operators
case when bool1 then A else B end
decode(Key, A1, A2, B1, B2, …, …, Z)
if(bool1, A, B)
return A when bool1 is true, return B otherwise; can
have as many when/thens as we want
if A1=Key then A2 else if B1=Key then B2 else Z
if bool1 is true then return A otherwise return B
300
CONFIDENTIAL
The Comprehensive Engineering Guide to WebQL
Problem Set Exercises
with Solutions
CHAPTER 1
P1.0.1
WebQL is modeled after ______ . WebQL is a _________ programming language.
WebQL is modeled after Microsoft “SQL”. WebQL is a “table select” programming
language (“database-style” is also acceptable).
P1.1.1
Which of the following does WebQL support?
a)
b)
c)
d)
read/write on an ODBC-driven database
read/write XML with XSLT transforms
read/write HTML with CSS stylesheet
all of the above
d)
P1.2.1
What happens in the WebQL Studio IDE when you push F5 when a *.wql file is active?
The code compiles and an Execution Window launches. --OR-- The query is run.
P1.2.2
The Network tab serves as a _______. Draw a diagram.
301
CONFIDENTIAL
Proxy.
P1.2.3
Pick 2 other Execution Window tabs and describe what they do.
(1) Statistics—keeps track of requests, fetches, bytes in/out, other stats network stats,
too
(2) Messages—logs request queuing, loading, etc. also pattern extractions and joins
(3) Browser View—views cell as a browser would
(4) Text View—views cell as plain text
(5) Output Files—collection of any and all output files generated during query
execution
(6) Data Flow—shows node to node flow of data by illustration
P1.3.1
Based on what you saw in Chapter 1.3, how could you write a website with online
financial transactions (like Orbitz) in such a way that WebQL Regular Expressions
cannot extract the prices?
When the page request for prices comes to the server, every ticket price is assigned the
correct price image with a random file name stored in a database. When the random file
name is submitted back to the server for purchase, the file name is looked up in the
database so the customer is charged the proper amount. Because WebQL sees only what
302
CONFIDENTIAL
a browser can see, the random price image filenames are meaningless unless we can
either (1) see the database of price / image name pairings or (2) use an image-to-text
interpreter to tell us what the image is showing.
CHAPTER 2
P2.0.1
What is the result of
Select
If(1,1,0),
If(nvl(null,0),0,1),
If(after(‘Hello’,’k’),0,1)
?
(1,1,1)
P2.1.1
Write a query that extracts the name of every file loaded by the www.foxnews.com
homepage and stores the filenames in an array.
303
CONFIDENTIAL
It’s not surprising that the first file is a stylesheet.
P2.1.2
Write a query that will extract all bold text from the www.cnn.com homepage.
304
CONFIDENTIAL
P2.2.1
Suppose node1 has 4 records, node2 has 2 records, and node3 has 7 records. If all 3
nodes are joined together unfiltered, then how many records are in the child node?
4*2*7=56
P2.2.2
Write a small segment of code that generates the 125 3-tuples (1-5, 1-5, 1-5) in the
default viewer of the Execution Window.
305
CONFIDENTIAL
P2.3.1
Modify your query for P2.1.2 by counting the number of times every bold phrase appears
on the www.cnn.com homepage. Filter the results by keeping only bold phrases that
have a count greater than 1.
306
CONFIDENTIAL
P2.4.1
This code produces the following output:
307
CONFIDENTIAL
What are the values of RECORD_ID?
(1,1,1)
P2.5.1
Of the CONTROL_NAME variables for the Slashdot.com login form, which ones do we
need to submit when using the submitting Document Retrieval Option?
4, 1
P2.5.2
Explain what the Document Retrieval Options do in the above segment of code.
308
CONFIDENTIAL
with errors—include network errors
retry 2 with delay 5—repeatedly retry loading a page up to 2 times when there is an
error until the page loads successfully; wait 5 seconds before retrying each failure
timeout 100—force a timeout at 100 seconds
cache forms—caches forms so page does not need reload to submit form over and
over
ignore children—filter out all child pages associated with the URL
rewrite… —fetch google instead of yahoo.
convert…—make the SOURCE_CONTENT only what is after the <BODY…> HTML
tag, or do no source conversion if there is no <BODY…> HMTL tag.
P2.6.1
What is the syntax of reading/writing to a db?
table@connection
P2.7.1
Use WebQL Virtual Tables to figure out if MS Access file format is accepted.
*.mdb is accepted.
309
CONFIDENTIAL
CHAPTER 3
P3.1.1
Write a circular referencing scheme to generate a table of numbers demonstrating a
Fibonacci sequence in the default viewer of the Execution Window. Maintain 3 fields—1
for the nth number in the sequence, 1 for the (n+1)th number in the sequence, and 1 field
for n itself. Make a table for n <= 20. 1 is the first number, 1 is the second number, 2 is
the third number, 3 is the fourth number 5 is the fifth number, 8 is the sixth number, …
nth + (n+1)st is the (n+2)nd number, etc.
310
CONFIDENTIAL
P3.1.2
Write a Circular Referencing scheme to construct a table of values x, y=exp(x) in the
default viewer of the Execution Window for x in [0, 10] at .05 intervals. There should be
201 records in the default viewer.
311
CONFIDENTIAL
P3.1.3
Do P3.1.2 without using Circular Referencing.
312
CONFIDENTIAL
P3.5.1
Write a web spider that submits item codes into the search box at http://www.abcink.com
and crawls all pages of results to extract the price, sale price, coupon code, item code, and
item description. Be sure to add a column for a time stamp for every price.
313
CONFIDENTIAL
Above is the first half of the code to solve P3.5.1. The rest of the code is below:
314
CONFIDENTIAL
Here is the output from the default viewer for the given input in the node arrayinput:
315
CONFIDENTIAL
The above output completes P3.5.1.
316
CONFIDENTIAL
The Future of WebQL
Since 2000, WebQL has evolved substantially in terms of Document Retrieval Options
and Data Translators. More and more functions and Aggregate Functions have been
added as well. WebQL has an edge over SQL in the creative application of Regular
Expressions to HTML page sources. WebQL is also the most flexible product for web
crawling and bulk form submission, making the product best equipped to handle the data
demands of the future.
The future of WebQL as an educational tool could take the shape of Matlab. Adding
Aggregate Functions Plot(col1,col2) and Surface3D(col1,col2,col3) that return
*.bmp images of graphs would be the first step. The future of WebQL as a web crawler
involves allowing the developer to call javascript functions selectively as a Document
Retrieval Option. Currently, all javascript in an HTML page load must be done manually
in WebQL if the javascript-effect is desired.
Ethically, many people question the idea of writing a web crawler. Some crawlers go out
of control and generate too much traffic. Similarly, hitting a company’s site with an
overload of traffic can slow down the company’s systems unfairly. Some web sites
specifically say in a user agreement that the site’s information cannot be reused for
commercial purposes. Despite these problems, the computer science world should accept
web-crawl programming both educationally and professionally.
Making full use of
communications resources and driving internet bandwidth technology higher and cheaper
should not be seen as a problem at a university or corporation.
317
CONFIDENTIAL
About the Author
Trevor J. Buckingham has been a University-level tech instructor since age 19. Lab
instruction for CS61A with Brian Harvy and GSI instruction for Math 55 with Fraydoun
Rezakhanlou were performed at Berkeley 1999-2000 and exam writing for EE215 with
Ceon Ramon was done at the University of Washington in 2003.
After studying
Electrical Engineering and Computer Sciences at the University of California 1998-2001,
Trevor and 4 others started building a data center that has become QL2 Software in
Pioneer Square, Seattle.
After working fulltime, Trevor studied Aeronautics at the
graduate level with Ly and Rysdyk and Business in the PTC graduate program with Kim
and EVPartners at the University of Washington 2003-2004. He is currently a long
distance runner in Chicago, Illinois where he authored The Comprehensive Engineering
Guide to WebQL.
Trevor is one of the founders of the Engineering Entrepreneur
Partnership, which hosts an annual TPC golf event at Snoqualmie Ridge.
318
CONFIDENTIAL
319

The Comprehensive Engineering Guide to WebQL

Transcription

Similar documents

Manufacturing Energy Management Solution (MEMS)

Supply Chain design at Amminex

case study

Convergence Challenges for the SME

ETRM Solutions - Primetech Systems Inc.

Indonesian Diaspora Network Southern California

Tools for Field Service and Asset Management

EndoChoice Investor Presentation

2 aeronautical industries to materials a4m 8 2 13 salvato