3 - Content Grabber

Transcription

3 - Content Grabber

User Guide
© 2014 - 2015 Sequentum
Content Grabber
© 2014 - 2015 Sequentum
All rights reserved. No parts of this work may be reproduced in any form or by any means - graphic, electronic, or
mechanical, including photocopying, recording, taping, or information storage and retrieval systems - without the written
permission of the publisher.
Products that are referred to in this document may be either trademarks and/or registered trademarks of the respective
owners. The publisher and the author make no claim to these trademarks.
While every precaution has been taken in the preparation of this document, the publisher and the author assume no
responsibility for errors or omissions, or for damages resulting from the use of information contained in this document or
from the use of programs and source code that may accompany it. In no event shall the publisher and the author be liable
for any loss of profit or any other commercial damage caused or alleged to have been caused directly or indirectly by this
document.
Printed: May 2015
Contents
3
Table of Contents
Foreword
0
Part I Introduction
7
1 Editions
...................................................................................................................................
& Licensing
7
2 What...................................................................................................................................
is Web Scraping?
10
3 Web Scraping
...................................................................................................................................
with Content Grabber
11
4 Web Scraping
...................................................................................................................................
Limitations
12
5 Web Scraping
...................................................................................................................................
Techniques
15
HTML Content
.......................................................................................................................................................... 16
Dynamic..........................................................................................................................................................
Websites
17
XPath and
..........................................................................................................................................................
Selection Techniques
18
Regular Expressions
.......................................................................................................................................................... 18
6 Converting
...................................................................................................................................
Visual Web Ripper Projects
19
Part II Quick Start with Content Grabber
22
1 Installing
...................................................................................................................................
Content Grabber
22
2 Content
...................................................................................................................................
Grabber Basics
24
3 Exploring
...................................................................................................................................
the Main Window
27
4 Simple
...................................................................................................................................
vs Expert Layout
29
5 Building
...................................................................................................................................
Your First Agent
30
Choosing..........................................................................................................................................................
a Start URL
32
Select the
..........................................................................................................................................................
Content to Capture
33
Refine Your
..........................................................................................................................................................
Data (optional)
40
Output Data
..........................................................................................................................................................
Format
42
Test Your..........................................................................................................................................................
Agent
43
Scheduling
.......................................................................................................................................................... 45
Run Your..........................................................................................................................................................
Agent
46
Part III Content Grabber Editor
48
1 Customizing
...................................................................................................................................
the Editor Layout
48
2 Expert
...................................................................................................................................
Layout Features
51
3 Working
...................................................................................................................................
With the Web Browser
59
4 Command
...................................................................................................................................
Configuration
62
5 Selection
...................................................................................................................................
Tools and Short-cut Keys
69
6 Navigating
...................................................................................................................................
an Agent
73
7 Testing/Debugging
...................................................................................................................................
an Agent
76
Using the..........................................................................................................................................................
Debugger
76
Using Logging
.......................................................................................................................................................... 81
8 Scheduling
................................................................................................................................... 82
9 Copying
...................................................................................................................................
& Moving Agents
85
10 Exporting
...................................................................................................................................
& Importing Agents
86
© 2014 - 2015 Sequentum
3
4
Content Grabber
11 Automated
...................................................................................................................................
Agent Backups
87
12 Template
...................................................................................................................................
Libraries
88
Part IV Selection Techniques
89
1 Selection
...................................................................................................................................
XPaths
90
2 Optimizing
...................................................................................................................................
Selections
97
3 List Selections
................................................................................................................................... 98
4 Selection
...................................................................................................................................
Anchors
101
5 Editing
...................................................................................................................................
XPaths Manually
102
Part V Agent Commands
104
1 Container
...................................................................................................................................
Commands
108
2 Capture
...................................................................................................................................
Commands
109
Web Element
..........................................................................................................................................................
Content
111
Download
..........................................................................................................................................................
Image
114
Download
..........................................................................................................................................................
Video
119
Download
..........................................................................................................................................................
Document
123
Download
..........................................................................................................................................................
Screenshot
129
Calculated
..........................................................................................................................................................
Value
134
Page Attribute
.......................................................................................................................................................... 135
Data Value
.......................................................................................................................................................... 136
3 Action
...................................................................................................................................
Commands
137
Agent .......................................................................................................................................................... 145
Navigate
..........................................................................................................................................................
Link
151
Navigate
..........................................................................................................................................................
URL
152
Navigate
..........................................................................................................................................................
Pagination
155
Set Form
..........................................................................................................................................................
Field
160
Action Configuration
.......................................................................................................................................................... 163
Action
.........................................................................................................................................................
Types
164
Browser
.........................................................................................................................................................
Target
165
Action
.........................................................................................................................................................
Events
167
Browser
.........................................................................................................................................................
Activities
177
4 List ...................................................................................................................................
Commands
185
Data List
.......................................................................................................................................................... 186
Web Element
..........................................................................................................................................................
List
188
5 Group
...................................................................................................................................
Commands
189
Group .......................................................................................................................................................... 190
Group in
..........................................................................................................................................................
Page Area
191
6 Other
...................................................................................................................................
Commands
192
Execute..........................................................................................................................................................
Script
192
Transform
..........................................................................................................................................................
Page
197
Refresh..........................................................................................................................................................
Document
199
Close Browser
.......................................................................................................................................................... 200
7 Web...................................................................................................................................
Forms
200
8 Command
...................................................................................................................................
Library
204
Part VI Error Handling
207
© 2014 - 2015 Sequentum
Contents
5
1 Agent
...................................................................................................................................
Error Handling
208
2 Error
...................................................................................................................................
Logs and Notifications
210
Part VII Extracting Data From Non-HTML Documents
214
Part VIII Data
219
1 Database
...................................................................................................................................
Connections
219
2 The ...................................................................................................................................
Internal Database
220
3 Using
...................................................................................................................................
Data Input
223
Data Providers
.......................................................................................................................................................... 223
Input Parameters
.......................................................................................................................................................... 230
Agent Data
.......................................................................................................................................................... 234
4 Exporting
...................................................................................................................................
Data
235
Selecting
..........................................................................................................................................................
an Export Target
236
Exporting
..........................................................................................................................................................
Downloaded Images and Files
238
Character
..........................................................................................................................................................
Encoding
238
Changing
..........................................................................................................................................................
the Default Data Structures
239
Exporting
..........................................................................................................................................................
Data with Scripts
249
5 Distributing
...................................................................................................................................
Data
249
Email and
..........................................................................................................................................................
FTP distribution
250
Scripting
..........................................................................................................................................................
Data Distribution
251
Part IX Anonymous Web Scraping
252
Part X CAPTCHA & IP Blocking
253
1 CAPTCHA
...................................................................................................................................
Blocking
253
2 IP Blocking
...................................................................................................................................
& Proxy Servers
259
Part XI Improving Agent Performance and Reliability
264
1 Using
...................................................................................................................................
MS SQL as Internal Database
264
2 Using
...................................................................................................................................
the Static Parser Browser
265
3 Optimizing
...................................................................................................................................
Agent Commands
267
Part XII Building Self-Contained Agents
270
1 Customizing
...................................................................................................................................
the User Interface
273
2 Using
...................................................................................................................................
Input Data
276
3 Using
...................................................................................................................................
a Self-Contained Agent
277
Part XIII Running Agents from the Command-Line
279
1 Using
...................................................................................................................................
the Content Grabber runtime
281
Part XIV Scripting
283
1 Script
...................................................................................................................................
Languages
284
2 Script
...................................................................................................................................
Library
288
3 Assembly
...................................................................................................................................
References
289
© 2014 - 2015 Sequentum
5
6
Content Grabber
4 Script
...................................................................................................................................
Utilities
290
5 Script
...................................................................................................................................
Template Library
300
6 Agent
...................................................................................................................................
Initialization Scripts
302
7 Data...................................................................................................................................
Export Scripts
306
8 Data...................................................................................................................................
Distribution Scripts
314
9 Content
...................................................................................................................................
Transformation Scripts
316
10 Data...................................................................................................................................
Input Scripts
321
11 Image
...................................................................................................................................
OCR Scripts
326
12 Convert
...................................................................................................................................
Document to HTML Scripts
330
13 Custom
...................................................................................................................................
Scripts
335
Part XV Programming Interface
341
1 Building
...................................................................................................................................
a Desktop Application
342
Visual Studio
..........................................................................................................................................................
Configuration
350
Distributing
..........................................................................................................................................................
Your Application
353
2 Building
...................................................................................................................................
a Web Application
353
Using the
..........................................................................................................................................................
Content Grabber Agent Service
354
Using Simple
..........................................................................................................................................................
Web Requests
365
3 Sessions
................................................................................................................................... 373
4 API Reference
................................................................................................................................... 375
Index
397
© 2014 - 2015 Sequentum
Introduction
1
7
Introduction
Congratulations on your purchase of Content Grabber. We value you as a customer
and we want to help you increase your web-scraping productivity. We have put years
of effort into designing and developing Content Grabber. Our software can be a great
help in simplifying the web-scraping process, increasing your efficiency, and optimizing
the performance. We have the best web-scraping solution on the market and strive to
exceed your expectations.
You can learn more in these sections:
Editions and Licensing
What is Web Scraping?
Web Scraping with Content Grabber
Web Scraping Limitations
Web Scraping Techniques
We have designed this comprehensive user guide to help you get up and running
quickly with Content Grabber, and then learn the advanced features so that you can
get your job done more efficiently. After Installing Content Grabber, we recommend
that you get familiar with Understanding the Concept, Exploring the Main Window and
Building Your First Agent.
We welcome your input and feedback, so please send us an email at:
info@ContentGrabber.com.au. We are keen to know how you are using Content
Grabber, and welcome any inquiry from you or your colleagues.
1.1
Editions & Licensing
The Content Grabber software comes in three editions, each of which requires a
license. Please see the license agreement in the Content Grabber installation folder
for the full license agreement. In this article we highlight the most important
differences among the Content Grabber editions.
© 2014 - 2015 Sequentum
8
Content Grabber
These are the current license editions of Content Grabber:
Professional Edition
Premium Edition
Server Edition
This user guide describes all features for all editions of the Content Grabber software.
Some sections of this guide present features that are only available in specific
versions, and these features may not be available in the software edition that you are
using. We explain these differences below.
Professional Edition
If a current-license copy of the Professional Edition of the software is running on a
computer, then a single user may create, edit, and run web-scraping agents without
restrictions.
The Professional Edition can create self-contained agents that you can distribute
and users can run royalty-free. These self-contained agents use a standard user
interface that display a promotional message for Content Grabber, and it isn't
possible to edit the display templates to remove this message.
It is not possible to open any agents in the Professional Edition that were built in the
Premium Edition.
The Content Grabber Runtime cannot run or edit web-scraping agents built with the
Professional Edition. This also means that the entire API and the Script Library is
unavailable to agents built with the Professional Edition.
The runagent.exe command-line program will only run agents built with the Premium
Edition (so long as that edition of the software is on the computer).
© 2014 - 2015 Sequentum
Introduction
9
Premium Edition
The Professional Edition is the most feature-rich version of the Content Grabber
software suite.
As with the Professional Edition, if a current-license copy of the Premium Edition of
the software is running on a computer, then a single user may create, edit, and run
web-scraping agents without restrictions.
With the Premium Edition, you can use the API and the Script Library.
The Premium Edition can create self-contained agents that you can distribute and
users can run royalty-free. These self-contained agents use a standard user interface
that displays a promotional message for Content Grabber, but the Premium Edition
gives you the ability to change the display templates to remove this message.
You can open agents in the Premium Edition that were built in the Professional
Edition, but the only features that will function are those that are available in the
Professional Edition.
Server Edition
The Server Edition of Content Grabber is intended for use in production environments
for running your agents and carrying out simple maintenance. It is not a developer
license and is priced accordingly.
The Server Edition cannot create new agents, but it can run and edit agents built with
either the Premium Edition or the Professional Edition.
Restrictions for the Server Edition:
With this edition, you may not change the agent command structure, or alter an
agent in a way that would change the output data structure.
It is not possible to add, delete, move, or copy commands in this edition, nor can
© 2014 - 2015 Sequentum
10
Content Grabber
it modify the Disabled and Export command properties.
This edition cannot export agents or create self-contained agents.
This edition cannot edit script libraries, though it can run agents that use script
libraries.
Content Grabber Runtime
The Content Grabber Runtime cannot change the agent command structure, or
modify the agent in a way that would change the output data structure. Also the
Runtime cannot add, delete, move or copy commands, nor can it change the
Disabled and Export command properties.
With the Premium Edition, you can create and distribute agents royalty-free with the
Content Grabber Runtime. Please be aware that the Content Grabber Runtime
has its own license and restrictions.
The Content Grabber Runtime cannot run or edit web-scraping agents built with the
Professional Edition. This also means that the entire API and the Script Library is
unavailable to agents built with the Professional Edition.
An agent built with the Professional Edition that you later edit with the Premium
Edition cannot be run with the Content Grabber Runtime.
1.2
What is Web Scraping?
We provide this brief introduction for those who are new to web-scraping. If you are
familiar with web-scraping, then you many want to begin by reading the article
Installing Content Grabber.
Web-scraping is the process of extracting data from websites and storing that data in
a structured, easy-to-use format. The value of a web-scraping tool like Content
Grabber is that you can easily specify and collect large amounts of source data that
may be very dynamic (data that changes very frequently).
© 2014 - 2015 Sequentum
Introduction
11
Usually, data available on the Internet has little or no structure and is only viewable
with a web browser. Elements such as text, images, video, and sound are built into a
web page so that they are presentable in a web browser. It can be very tedious to
manually capture and separate this data, and can require many hours of effort to
complete. With Content Grabber, you can automate this process and capture website
data in a fraction of the time that it would take using other methods.
Web-scraping software interacts with websites in the same way as you do when using
your web browser. However, in addition to displaying the data in a browser on your
screen, web-scraping software saves the data from the web page to a local file or
database.
You can configure web-scraping agents to run on multiple websites, and you can
schedule each agent to run automatically. It's easy to configure your agent to run as
frequently as you like (hourly, daily, weekly, monthly) to ensure that you are capturing
the very latest data.
1.3
Web Scraping with Content Grabber
With Content Grabber, you can automatically harvest data from a website and deliver
the content as structured data in multiple database formats (Oracle, SQLServer, My
SQL, OLE DBE), or in other formats such as Excel spreadsheets, CSV or XML files.
Content Grabber can also extract data from highly dynamic websites where most
other extraction tools are incapable. It can process AJAX-enabled websites, submit
© 2014 - 2015 Sequentum
12
Content Grabber
forms repeatedly to cover all possible input values, and manage website logins.
Web-scraping technology is transforming the Internet into a structured data source,
and Content Grabber is opening up numerous business opportunities for both
corporations and individuals. The following is just a small sample of how web-scraping
technology is optimizing and enabling new businesses:
Price Comparison Portals / Mobile
Locate the highest-ranking keywords
Apps
of your competitors on all major
Collaborative lists (home
search engines
foreclosures, job boards, & tourist
Background Checking
attractions)
Confirm the integrity of business
News & Content Aggregation
partners
Competitive price monitoring
Monitor online sources for copyright
Monitor dealers for price compliance
infringement
Track inventory on retailer websites
Sales Lead Generation
Social media and brand monitoring
Content migration (CMS & CRM).
Content Grabber is a powerful, visual web-scraping tool that can do all of this and
much more. We provide a comprehensive user guide to help you get up and running
quickly. After Installing Content Grabber, we recommend that you look at Content
Grabber Basics, and then get familiar with Exploring the Main Window and then
Building Your First Agent.
1.4
Web Scraping Limitations
Web-scraping can be challenging if you want to mine data from complex, dynamic
websites. If you're new to web-scraping, then we recommend that you begin with an
easy website: one that is mostly static and has little, if any, AJAX or JavaScript.
After you get familiar with the navigation paths for your target website, you need to
identify a good start URL. Sometimes this is simply the start URL of the website, but
often the best URL is the one for a sub-page—such as a product listing. Once you
© 2014 - 2015 Sequentum
Introduction
13
have this URL, you’ll need to copy it and then paste it into the address bar of Content
Grabber.
NOTE: Some websites allow navigation without any corresponding
change in the visible URL. In such cases, you may not have a start
URL that points directly to your start webpage, and so you’ll need to
add preliminary steps to your agent to navigate to that webpage.
Web-scraping can be also challenging if you don't have the proper tools. Largely,
you're completely at the mercy of the target website, and that website can change at
anytime - without notice. Or, it may contain faulty JavaScript that causes it to crash
and exhibit surprising behavior. The server that hosts the website may crash, or the
website may undergo maintenance. Many potential problems can occur during a
lengthy web-scraping session, and you have very little influence on any of them.
Content Grabber offers an array of advanced error-handling and stability features that
can help you manage many of the problems that a web-scraping agent is likely to
encounter.
In addition to the unreliable websites, another challenge is that some web-scraping
tasks are especially difficult to complete - including the following:
Extracting data from complex websites
Extracting data from websites that use deterrents
Extracting huge amounts of data
Extracting data from non-HTML content.
Extracting Data From Complex Websites
If you are developing web-scraping agents for a large number of different websites,
you will probably find that around 50% of the websites are very easy, 30% are
modest in difficulty, and 20% are very challenging. For a small percentage, it will be
effectively impossible to extract meaningful data. It may take two weeks or more for a
© 2014 - 2015 Sequentum
14
Content Grabber
web-scraping expert to develop an agent for such a website, so the cost of developing
the agent is likely to outweigh the value of the data you might be able to extract.
Extracting Data From Websites Using Deterrents
Web-scraping will always be challenging for any website with active deterrents in
place. If it is necessary to login to access the content that you want to extract, then
the website can always cancel your account and make it impractical to create new
accounts.
Some websites uses browser fingerprinting to identify and block your access to the
website. Fingerprinting uses JavaScript to make a positive identification by examining
your browser and computer specifications, and thereby making it impossible to
circumvent.
Another method for websites that are wary of crawlers or scrapers is the use of
CAPTCHA. Content Grabber includes tools you can use to overcome CAPTCHA
protection, but you'll incur additional costs to get a 3rd-party to do automatic
CAPTCHA processing. See CAPTCHA Blocking for more information.
The most common protection technique is using your IP address to identify and block
your access to a website. You can usually circumvent this technique by using a proxy
rotation service, which hides your actual IP address and uses a new IP address every
time you request a web page from a website. See IP Blocking & Proxy Servers for
more information.
NOTE: Ethically and legally, we recommend that you avoid websites that are actively
taking measures to block your access, even if you are able to circumvent the
protection.
Extracting Huge Amounts of Data
A web-scraping tool must actually visit a web page to extract data from it.
Downloading a web page takes time, and it could take weeks and months to load and
extract data from millions of web pages. For example, it's virtually impossible to
© 2014 - 2015 Sequentum
Introduction
15
extract all product data from Amazon.com, since there are too many web pages.
Extracting Data From Non-HTML Content
Some websites are built entirely in Flash, which is a small-footprint software
application that runs in the web browser. Content Grabber can only work with HTML
content, so it can only extract the Flash file. However, it can't interact with the Flash
application or extract data from within the Flash application.
Many websites provide data in the form of PDF files and other file formats. Though it
cannot directly extract data from such files, Content Grabber can easily download
those files and convert the files into an HTML document using 3rd-party converters to
extract data from the conversion output. The document conversion happens very
quickly in real-time, so it will seem as though you are performing a direct extraction.
It's important to realize that PDF documents and most file formats don't contain
content that is easily convertible into structured HTML. To do that, you can use the
Regular Expressions feature of Content Grabber to resolve the conversion output.
1.5
Web Scraping Techniques
Content Grabber makes it easy to extract data from most websites without requiring
much prior knowledge about web-scraping techniques. However, you'll be able to build
better web-scraping agents if you know some basic techniques. Some very difficult
websites will require in-depth knowledge, but this user guide can help you gain more
understanding and direct you to additional resources.
The following topics are important if you want to become proficient at web-scraping,
but they are not necessary a prerequisite for successful use of Content Grabber on all
websites. Click the links to learn more about each topic:
HTML Content - Web pages are driven by HTML, which is the basic language
for building websites.
Dynamic Websites - It can be challenging to perform data extraction on dynamic
websites. So, it's good to have a general understanding of how JavaScript
© 2014 - 2015 Sequentum
16
Content Grabber
works, since it is found on most dynamic websites.
XPath and Selection Techniques - Most web scraping tools extract data from a
website by selecting web elements on the web page. XPath is a language that
manages the web selection.
Regular Expressions - XPath can select a web element such as a paragraph of
text, but you may have interest only in a small part of the web element content.
Regular Expressions is a language for extracting small bits of text from a
larger text element.
1.5.1
HTML Content
HTML stands for HyperText Markup Language - the standard markup language for
creating web pages. It consists of content that is defined in an HTML document by
tags that appear in brackets, such as <html>. Typically, these tags are seen in pairs,
with one on each end of the content that they represent (such as <h1> and </h1>). The
first tag in a pair is the start tag, and the second tag is the end tag (also known as
opening tags and closing tags). Some tags that represent empty elements don't come
in pairs, such as <img>.
The purpose of a web browser is to read HTML documents and compose them into
visible web pages. The browser does not display the HTML tags, but rather interprets
the tags and displays content on the page that corresponds to that tag. HTML
describes the structure of a web page semantically, with cues for presentation. This
distinguishes it as a markup language rather than a programming language.
HTML elements are the building blocks of any website, including embedded images
and objects, and also interactive forms. It provides the structure for a page by
denoting structural semantics for text such as headings, paragraphs, lists, links,
quotes, and other items. It can also contain scripts written in languages such as
JavaScript - which controls the behavior of HTML web pages.
Content Grabber uses XPath to select specific HTML tags and then extracts content
© 2014 - 2015 Sequentum
Introduction
17
from those tags. An HTML tag can contain both text and attributes. For example, a
HTML tag that displays an image will contain a scr attribute that specifies the URL of
the image to display. Content Grabber can extract both tag text and tag attributes,
and may perform certain actions on the content it extracts. For example, it may
extract the scr attribute from an <image> HTML tag and then use the URL to
download the image.
There are many websites that have HTML tutorials. Here is one example:
http://www.w3schools.com/html/html_intro.asp
1.5.2
Dynamic Websites
Using HTML script, a client-side dynamic web page will continue to load more content
after the initial content loads and the page elements are available to the user. The
most common language for client-side scripting is JavaScript, and it may use AJAX
(Asynchronous JavaScript and XML) to load additional content onto a web page
asynchronously. It may also modify existing content on a web page, such as enabling
or disabling content when you click on particular web elements.
To extract data correctly, Content Grabber needs to detect any dynamic changes on
a web page. For example, if you want to extract any additional data that AJAX loads
onto a web page, then you'll want to configure Content Grabber to wait on AJAX to
finish processing the new content before it can start extracting it.
Content Grabber is excellent at the automatic detection of dynamic changes.
However, sometimes JavaScript behaves unusually, and you may need to make
adjustments to properly extract dynamic content. For example, Content Grabber can
detect when JavaScript completes an AJAX load of dynamic content. But it cannot
detect exactly when the JavaScript is done and so it will simply wait for a few
milliseconds. If the JavaScript takes an unusually long time to display the dynamic
content, you may need to use the timeout feature of Content Grabber to insert a short
interval for the JavaScript to display the dynamic content (typically a few additional
milliseconds).
© 2014 - 2015 Sequentum
18
Content Grabber
Familiarity with JavaScript can make it much easier to configure a web-scraping agent
to extract data from dynamic websites when Content Grabber is unable configure the
agent automatically. You can learn more from various JavaScript tutorials available on
the web, such as:
http://www.w3schools.com/js/
1.5.3
XPath and Selection Techniques
Proper selection technique is a critical aspect of web-scraping. The most basic
selection technique is to point-and-click on elements in the web browser panel, which
is the easiest way to add commands to an agent. XPath is a common syntax for
selecting elements in HTML and XML documents. Each time you click on an element in
the web browser, Content Grabber works in the background to calculate the selection
XPath.
Content Grabber has a variety of tools that help you create precise XPaths without
needing to know XPath syntax at all. If you want finer control over element selection,
it's worthwhile to learn some specifics about XPath syntax. Although this user guide
doesn't include an XPath tutorial, there are many tutorials available on the web. We
recommend this as a good place to learn more about XPath:
www.w3schools.com/XPath/xpath_syntax.asp
1.5.4
Regular Expressions
With Regular Expressions, you can write expressions that look for specific character
sequences within strings and then extract small text strings out of larger ones.
Content Grabber uses XPath to select web elements on a web page, and then
extracts content from those web elements. You may only want some parts of the
content extraction, or you may want to transform it. For example, a single web
element may contain the entire address of a company, but you may want to extract
© 2014 - 2015 Sequentum
Introduction
19
the content into separate elements such as street address, city, zip code and state.
You can use Regular Expressions to split the address text into separate text strings.
There are many tutorial websites that teach Regular Expressions. Here is one
example:
http://www.regular-expressions.info/reference.html
1.6
Converting Visual Web Ripper Projects
Visual Web Ripper is another web scraping tool published by
Sequentum. Content Grabber can open Visual Web Ripper projects
and convert them into Content Grabber agents. This is NOT a fully
automated conversion, and many agents will need manual
adjustments after conversion.
To convert a Visual Web Ripper project, simply open the project file
in Content Grabber as if it was a normal Content Grabber agent.
Content Grabber will then ask if you want to convert the project file
into an agent. The Visual Web Ripper project will be left intact, so
you don't need to make a copy of the Visual Web Ripper project.
First open the Visual Web Ripper project
as if it was a normal agent.
© 2014 - 2015 Sequentum
20
Content Grabber
You will be asked to convert the Visual Web Ripper project to an agent.
Content Grabber will attempt to add all required commands to the
converted agent, but it will not be able to set all command properties
correctly, so we recommend you test all commands in a converted
agent to make sure they work correctly.
The following list includes features that will NOT be converted from a
Visual Web Ripper project. This is not a complete list, and there may
be other aspects of a project that will not be converted correctly.
Visual Web Ripper scripts will not be converted, except for very
simple Content Transformation scripts.
OleDB and Oracle database connections will not be converted. All
other database connections will be converted into shared
connections.
The Private Proxy Switch option is not available in Content
Grabber.
PAC Proxy configuration will not be converted.
© 2014 - 2015 Sequentum
Introduction
Most Page Transformation elements will require post
configuration.
The Tag Name attribute is not available in Content Grabber.
The property SaveDataMethod will require post configuration
when set to anything other than Default.
Many Back templates will require post configuration.
Multiple Page Navigation templates working together will require
post configuration.
Page Navigation templates using Dynamic Links or List of Links will
sometimes require post configuration
All actions will be converted into actions with the property Detect
Action set to true. All other action configuration, including wait
elements, will be ignored.
Form Field elements using the Start Index and Count properties
will often require post configuration.
Form Field Lookup Data Sources will not be converted.
Project user agent settings will not be converted.
CAPTCHA configuration will not be converted.
Document conversion will not be converted.
Document download on form submit will not be converted.
Scheduling configuration will not converted.
Notification configuration will not be converted.
Duplicate checks will sometimes require post configuration
Many rarely used options in projects, templates and contents don't
have corresponding options in Content Grabber and will therefore
not be converted.
© 2014 - 2015 Sequentum
21
22
2
Content Grabber
Quick Start with Content Grabber
We want to help you get up and running, so this chapter outlines the basics to help
you quickly install the software, build a simple agent, and then test that agent. Further
along in the guide, we explain more details and explore the advanced features. For
now, let's see how easy it is to build an agent.
In this chapter, you'll learn:
Installing Content Grabber
Content Grabber Basics
Exploring the Main Window
Simple vs Expert Layout
Building Your First Agent
2.1
Installing Content Grabber
Content Grabber is a Windows-based software application. Windows 7, Windows 8,
Windows 2008 R2 and Windows 2012 are the supported versions of Windows. It is
important to note that while it is not supported, the Content Grabber runtime works on
Windows XP+ and the Content Grabber editor works on Windows Vista+. While a
Macintosh version is not available, a number of our users utilize Parallel's windows
emulation software to successfully run Content Grabber.
Installing Content Grabber is easy. Simply follow these steps:
1. Download the setup file from http://www.sequentum.com/asdr34ert/setup.zip
2. Find the setup.zip file on your local disk (likely to be your Downloads folder).
3. Open the zip file and then double-click the setup.exe file to launch it.
4. After a moment, you'll see the first panel of the Content Grabber setup wizard.
Click the Next button to continue with the installation.
© 2014 - 2015 Sequentum
23
5. Click the option to Accept the Agreement, and then click the Next button.
6. In the next panel, you can change the installation location. If you accept the
default location, simply click Next. Otherwise, click the Browse button to
choose a different installation folder, and then choose the folder and click the
Open button to close the pop-up, and then click the Next button.
7. In the next panel you can change the name of the folder that will appear in the
Windows Start Menu. Click Next and, in the next panel, click the Install button
to complete the installation.
8. Locate the Content Grabber item in the Windows Start Menu and then launch
it.
© 2014 - 2015 Sequentum
24
2.2
Content Grabber
Content Grabber Basics
Web-scraping tools generally use macros or configuration methods, and follow a
sequential list of commands. The macro approach is more user friendly and
automatically records the actions of a user in a browser. However, there are typically
restrictions on accessing the code behind the agent. The configuration approach
allows the user to directly configure each part of the agent. They can introduce more
code structure, controls, data refinements, or add their own naming conventions.
Content Grabber gives you the option to either follow the simple macro automation
methods, or to take direct control over the treatment of each element and command
within your agent.
Content Grabber Agent Development
With Content Grabber, you can visually browse the website and click on the data
elements in the order that you want to collect them. Based on the content elements
selected, Content Grabber will automatically determine the corresponding action type
and provide default names for each command as it builds the agent for you.
© 2014 - 2015 Sequentum
25
Content Grabber main screen - building CarPoint Agent
A Content Grabber agent is a collection of commands which are executed in serial
until completed. The commands can either be actions (such as a jump to a URL) or
data capture commands (e.g. capture text). These commands are recorded in order
of execution in the Agent Explorer panel of the Content Grabber screen.
Agent Explorer panel with New Agent commands
If you want to make other adjustments or gain more control of your commands, you
can make changes in the Configure Agent Command panel.
Configure Agent Command panel
You can also add new commands to your agent, or configure existing ones. To do
this, you simply click twice on any web element (content item) and the Content
Grabber Message window will appear. From here you can select the command type
you want and add it to the Agent Explorer.
© 2014 - 2015 Sequentum
26
Content Grabber
Content Grabber Message window pop-up
Content Grabber Data Outputs
After you have finished building you Agent and run it for the first time, Content Grabber
saves the data locally in a structured database format. Content Grabber can export
extracted web data as a report or to numerous different database types. Data output
options include CSV, Excel, XML, SQL Server, MySQL, Oracle and OleDB.
Content Grabber's Data Configuration window
You can also use a Content Grabber export script to completely customize the data
export to your own database structures.
Scheduling
© 2014 - 2015 Sequentum
27
Content Grabber provides an Agent scheduling facility that enables you to
automatically run your agent at predetermined time slots whenever you need it to run.
This can be done every hour, every day, month, year and so on.
Content Grabber's Scheduling Window
2.3
Exploring the Main Window
In this section, we give a brief overview of the Content Grabber window.
© 2014 - 2015 Sequentum
28
Content Grabber
Address Bar
The Address Bar is where you enter the URL of the page that is the start of your
web-scraping agent. This is what we call the Start URL.
Web browser Panel with Data-Capture Box
The Content Grabber web browser appears in the main panel. The special feature of
this integral web browser is this: as you move the mouse and pass over a web
element, you’ll see a yellow rectangle envelop that element. When you stop over one
element and click on it twice, the Command pop-up window will appear—prompting
you to choose the command that you want to insert into the agent for this data
element.
Agent Explorer Panel
As you select data elements in the browser panel, the corresponding Command item
will appear in the Agent Explorer panel. The Agent Explorer is used to view agent
© 2014 - 2015 Sequentum
29
commands for the type of web page loaded in the selected browser tab. When you
select a new browser tab with another web page, a new set of commands will be
displayed in the Agent Explorer.
You can use the Agent Explorer to view, edit or execute commands.
Agent Configuration Panel
In this panel, you can edit an agent command, execute a command, or add a subcommand. Further along in this guide, we’ll explain how to explore and edit each
command.
2.4
Simple vs Expert Layout
Content Grabber includes a vast number of advanced features which can be
overwhelming for a new user. Content Grabber allows you to hide some advanced
features or to show all the features available. The default layout setting is the
Simplified Layout which excludes some advanced features from the Content Grabber
menus.
You can switch between the Simplified Layout and the Expert Layout from the
Application Settings menu at the top of the Content Grabber application.
Content Grabber's Application Settings menu with Simplified Layout enabled
You can also show or hide the advanced features using the radio buttons displayed on
Content Grabber's start-up screen.
© 2014 - 2015 Sequentum
30
Content Grabber
Show or hide advanced feature selection on start-up screen
After switching to Expert Layout you will see a number of additional menu items
displaying. To switch back to Simplified Layout you choose Simplified Layout under
the Expert Layout option within the Application Settings menu.
Content Grabber's Application Settings menu with Expert Layout enabled
For more detail on the different Expert Layout options and their functions, refer to
Expert Layout Features within the Content Grabber Editor section of this manual.
2.5
Building Your First Agent
The diagram shows the key steps for building a web scraping agent. Below we
provide links the other topics which explain each of these steps in more detail.
© 2014 - 2015 Sequentum
31
Basic web scraping Agent creation process
Here in this section, we cover the identification of data elements on your target
website and create a web-scraping agent. We'll work through each of the above steps
with examples that match common web-scraping usage, so you can get comfortable
building your own agents.
We cover these topics in this section:
Choosing a Start URL
Select the Data to Capture
Refine your data (optional)
Output Data Format
© 2014 - 2015 Sequentum
32
Content Grabber
Test Your Agent
Scheduling
Run your Agent
After you get comfortable creating an agent, you can learn more about the Content
Grabber Editor.
2.5.1
Choosing a Start URL
The Start URL is the place where you begin data collection and corresponds to the
starting point for your web-scraping agent.
In the following sections, we'll use the Cruise Direct website for our example.
http://www.cruisedirect.com
Note: In this example we start from the Cruise Direct home page, however, if the data
you require is not located on a website home page, you can start the agent from a
website sub-page. This approach will make the agent more efficient, so it’s worth
taking the time to be more specific.
1. We start by pasting the start web page url from the target website (http://
wwwcruisedirect.com) into Content Grabber's Address Bar.
2. Next click the Blue Play button to load the Cruise Direct home page.
Note: you can also press the Enter key to load the Cruise Direct home page.
© 2014 - 2015 Sequentum
33
Content Grabber with Cruise Direct home page loaded
In the next section, Select the Data to Capture, we will continue use the Cruise Direct
website data for our example.
2.5.2
Select the Content to Capture
In the previous section, we selected our Start URL and loaded the web page into
Content Grabber. Next, you can select the data you want to capture and start building
your web scraping agent. In our Cruise Direct example, we plan to search for
available cruise vacations and then extract the cruise name, the cruise line,
destination, departing place, ports of call, and the different prices.
1. Firstly we need to perform a search to retrieve the data for the available cruises.
To do this, we select the orange Search button element with the mouse, then click
one more time to display the Content Grabber Message window.
© 2014 - 2015 Sequentum
34
Content Grabber
Content Grabber Message Window
2. From the message window, we choose the Click on the Web Element option to
add a new command to the agent that will execute the search and display the
search results on a new web page. Notice that Content Grabber has added our first
command to Agent Explorer - in this case to execute the search and display the
search results.
Agent Explorer with new Search command linked to new Search page
3. We are now ready to add commands to our agent to extract the cruise data. As we
have a number of data elements in tables, we will use a list to simplify the extraction
for us. To capture a data element, move your mouse precisely over the data
element you want, until you see the data-capture box around it. We start by
selecting the first cruise name.
© 2014 - 2015 Sequentum
35
First Cruise Line data element selected within Content Grabber
4. Then click List in the Configure Agent Command panel to activate the list
selection mode.
Activating List selection mode from the Configure Agent Command panel
5. In list selection mode, we can add web data elements into the list by clicking
similar data elements. Now we'll click on the second cruise name and you will see
Content Grabber has selected the remaining data elements on the page. Note: if
any cruise data elements remain unselected, simply click on these to add to the
list.
© 2014 - 2015 Sequentum
36
Content Grabber
Second Cruise Line data element selected while in List selection mode
6. We now click Save to save the list and exit list selection mode. The Web Element
list command defines the list area, so any elements within this area are now
included within the list.
7. To capture the cruise name text, we click on any selected element to display the
Content Grabber Message window. From the Content Grabber Message
window, choose the Capture Text option to add the web element command to
capture the cruise names. We have now added new web element list and web
element commands to the Agent Explorer and Content Grabber has set default
names for these commands.
8. To edit the names for the commands, we click the respective Edit icons and set the
names of the commands to ‘Search List’ and ‘Cruise Name’. Then click the Green
Tick to save.
© 2014 - 2015 Sequentum
37
Agent Explorer with new Search List and Cruise Name commands
9. Now we plan to extract the individual cruise data elements from each table. So
firstly click on the cruise line data element. Content Grabber now automatically
selects all the cruise line information because it is already defined as a list.
10. Next, click on the cruise line data element one more time to display the Content
Grabber Message window. Now choose the Capture Text option from the
Content Grabber Message window to add the command to the Agent so we can
capture the individual cruise line data elements.
11. After that, we click the Edit icon to change the name of the command to ‘Cruise
Line’ then save it.
12. Now we do the same for the Destination and Departing From data elements, then
set the respective names of the commands to ‘Destination’ and ‘Departing From’
and save them.
Agent Explorer with new Capture Text commands
13. To extract the ‘Ports of Call’ data we will utilize Content Grabber's content
transformation method. Refer to Refine Your Data to see how this is done.
© 2014 - 2015 Sequentum
38
Content Grabber
14. We also want to capture all the price information in the pricing table, so as before,
we select the first element (View Itinerary) in the pricing table from the list. Content
Grabber will then select all of the first elements throughout the list.
15. Click one more time to display the Content Grabber Message Window. Then
choose the Capture Text option to add the command to the Agent. Before we
change the default command name, we plan to add all the individual price details to
the Agent, then we will change all the names at once.
16. Do steps 14 and 15 again for the ‘inside’, ‘outside’, ‘balcony’ and ‘suite’ data
elements to add 4 new commands to the Agent.
17.Now we will change the names of the commands to ‘Departing Date’, ‘Inside’,
‘Outside’, ‘Balcony’ and ‘Suite’ by clicking the respective Edit icons and then save.
Agent Explorer showing all Capture Text commands
18. Thus far we have created the Agent to extract all the cruise information on the first
page. We need to set it up to iterate through all the search result pages. To do this,
we need to use the Follow Pagination command to follow each of the
pages.Scroll down the page and select the ‘Next 10’ link. Then click one more time
on the selected element to display the Content Grabber Message window.
© 2014 - 2015 Sequentum
39
Content Grabber Message windows with Follow Pagination option selected
19. Now we choose the Follow Pagination option to add the pagination command to
the Agent.
Content Grabber has added the pagination command to the Agent and loads the
next page on the second browser tab.
20. When we click on the pagination command we can see all the search list
commands inside of the pagination command. This means our agent will now iterate
through all the search result pages to extract this information.
Agent Explorer showing the contents of the Pagination command
21.We have now finished building the Agent so we should save it. To save the Agent,
© 2014 - 2015 Sequentum
40
Content Grabber
choose File > Save in the Content Grabber menu, and then enter the Agent Name
“cruisedirect”. Then click the Save button to commit your changes.
Saving the "cruisedirect" Agent
In the next section, Refine Your Data we use Content Grabber's Content
Transformation method to retrieve the Ports of Call data.
2.5.3
Refine Your Data (optional)
In the previous section, we added commands to our agent to capture all the Cruise
Direct content elements we require, except for the Ports of Call data. Due to the way
the information is displayed for Ports of Call, we cannot select only the Ports of Call
data element by clicking on it. Instead, we will select all the cruise information, then
use Content Grabber’s transformation method to extract only the Ports of Call detail.
1. Firstly select the cruise information then use the mouse to expand the
Configure Agent Command panel so we can see all the transformation fields.
2. Next, we scroll down the capture sample window until we can see the Ports of
Call information. Now we select only the ports of call information. You should
© 2014 - 2015 Sequentum
41
notice that the Transformation Script button has now changed to a Generate
Transformation button.
Selecting text to Transform in the Configure Agent Command panel
3. Now click on the Generate Transformation button, and we can now see only
the ports of call information in the Transformed window.
Transformed text in the Configure Agent Command panel
4. We will now click the Add Command button to add the transformation
command to the Agent.
5. Click the Edit icon, then change the default command name to ‘Ports of Call’
then save it.
© 2014 - 2015 Sequentum
42
Content Grabber
Agent Explorer showing all the Agent commands
In the next section, Output Data Format we look at the data output formats available
for the extracted web data and show how to change and configure a new export
target.
2.5.4
Output Data Format
Content Grabber can export extracted web data as a report or to numerous different
database types. Data output options include CSV, Excel, XML, SQL Server, MySQL,
Oracle and OleDB.
You can also use a Content Grabber export script to completely customize the data
export to your own database structures. This is useful when you want the data
updates to be dynamically reflected such as an online website / portal.
Content Grabber can export data to Excel 2003+ and take advantage of features in
Excel 2007+ such as outlining and embedded images.
Data is exported automatically to your chosen export destination when a data
extraction project completes, so you don't have to export data manually. However, you
can always export extracted data manually at any time to any export destination.
The following steps show how easy it is to choose the data export type you want to
use.
© 2014 - 2015 Sequentum
43
1. We start by clicking on Content Grabber's Data menu and then clicking the
Change Export Target button.
Content Grabber's Data menu
2. After clicking the Change Export Target button the Data Configuration Window
displays. This window allows you to change and configure a new export target. The
default option is Excel 2003.
Click the Export target drop-down list box to display the different report and
database export options available. Then choose the format you want to use. You
can also change the default destination folder location for where the data file(s) are
output.
Content Grabber's Data Configuration window
In the next section, Test Your Agent we use Content Grabber's Debugging features to
test the "cruisedata" Agent is functioning as expected.
2.5.5
Test Your Agent
Once you have finished developing your agent, it is important to test it to ensure the
© 2014 - 2015 Sequentum
44
Content Grabber
correct data is being extracted and in the format you require. Content Grabber has a
sophisticated debugging engine which enables you to carefully analyze all aspects of
your agent and the data being extracted. It can also help you to pin-point any trouble
spots in your agent code so you can quickly resolve them. For more detail on the
Debugger features available in Content Grabber refer to Testing/Debugging an Agent.
Now let's run the Cruise Direct agent to check the commands are working properly.
To do the test run, click the ‘Debug’ menu option at the top left of the screen and then
the click ‘Start’ arrow button to start the debugging.
Content Grabber Debugger running the Cruise Direct Agent
During the debugging, we can observe and check that Content Grabber goes through
each of the commands in sequence and processes each of the web pages to extract
the required data.
Part way through running the agent, we will click the ‘Stop’ button to stop the
debugging. Then click ‘View Export Data’ to check the output results are correct.
© 2014 - 2015 Sequentum
45
Content Grabber's default view of the export data
To see the export data in Excel, we can simply click on the ‘Open Exported
Spreadsheet’ button to open the Excel spreadsheet.
Viewing the Cruise Direct export results in an Excel Spreadsheet
In the next section, Scheduling you will learn how to set up an agent so it can be
automatically run at intervals of your choosing.
2.5.6
Scheduling
© 2014 - 2015 Sequentum
46
Content Grabber
Once you have completed your agent, you will be able to access this facility. From
the Agent Settings menu at the top of the Content Grabber application. Simply Click
on the Schedule menu option to display the Scheduling window.
For details on how to configure Content Grabber's Scheduling feature, or if you'd like
to learn how to use Windows Task Scheduler with your Agents, refer to Scheduling
in the Content Grabber Editor section.
In the next section, Run Your Agent we will show you how you run your finished webscraping Agent.
2.5.7
Run Your Agent
Now that we have finished developing and testing the Cruise Direct Agent it is ready to
use.
To run your agent, you simply click on the Run menu at the top left of the Content
Grabber application then click the Run Agent arrow selection.
© 2014 - 2015 Sequentum
47
Content Grabber's Run menu selections
Note: if you have already scheduled your Agent to run at a later time or date, you can
just leave it on your Internet enabled PC or Server and it will run automatically. Refer
to Scheduling for more detail.
© 2014 - 2015 Sequentum
48
3
Content Grabber
Content Grabber Editor
The Content Grabber Editor enables you to easily see and manipulate your web
scraping agent as your build it. This section of the manual looks at various aspects of
the Content Grabber interface to help ensure you understand and get the most out of
the Content Grabber application.
In this chapter, you'll learn:
Customizing the Editor Layout
Expert Layout Features
Working With the Web Browser
Command Configuration
Selection Tools and Short-cut Keys
Navigating an Agent
Testing/Debugging an Agent
Scheduling
Copying & Moving Agents
Exporting & Importing Agents
Automated Agent Backups
Template Libraries
3.1
Customizing the Editor Layout
Content Grabber's editor layout includes a number of workspace windows which can
be moved and re-sized to suit your development layout preferences. To help highlight
this feature and demonstrate how you can configure the Editor Layout, we will
reposition the Agent Explorer panel on the left of the Editor Layout screen.
© 2014 - 2015 Sequentum
49
Content Grabber's default start screen with Agent Explorer on the upper right side of the screen
Now by simply positioning the cursor at the top label of the Agent Explorer panel and
then clicking and holding the mouse button down, you can drag the panel to a new
position on the screen. When you are dragging the Agent Explorer panel, you will
notice a number of small docking objects appear with small arrows on them. These
are the available docking areas to where you can drag and lock the panel into its new
position.
Moving the Agent Explorer panel with the mouse - docking positions are displayed
© 2014 - 2015 Sequentum
50
Content Grabber
When you position the cursor over a docking position, you will notice the area changes
color to blue. This means the panel is in position and you can let go of the mouse
button to dock the Agent Explorer panel into this new position.
Docking position selected at the left of the screen - indicated by blue highlight
Once the mouse button is released, the Agent Explorer panel is relocated to the left of
the Editor Layout screen.
Agent Explorer panel repositioned to the left of the Editor Layout screen
© 2014 - 2015 Sequentum
3.2
51
Expert Layout Features
When using Expert Layout there will be additional options in most menu selections. In
this section we will have a brief look at what the Expert Layout options are and what
they are used for.
Note: You can switch between Simplified Layout and Expert Layout from the
Application Settings menu at the top of the Content Grabber screen.
Application Settings menu
The Application Settings Menu includes the following Expert Layout options.
Content Grabber's Application Settings menu with Expert Layout enabled
Application Proxy Source
Specifies a default source of proxies when
an Agent's proxy source is set to
Application.
Proxy List
Opens the Proxy List Window where you
can add or remove proxy servers when the
application proxy source is set to Proxy List.
Database Connections
Default database connection configurations
for all agents on the current computer. By
using computer wide database connections,
you avoid having to configure database
connections for each agent.
© 2014 - 2015 Sequentum
52
Content Grabber
New Connection
Adds a new database connection for use in
all agents on the computer being used.
Edit
Edit or remove the selected database
connection.
Command Library
Opens the Command Library that allows
you to view and insert commands from the
library into your agent.
Agent Library
Opens the Agent Library that allows you to
view agents in the library, and create new
agents based on agents within the library.
Export Libraries
Exports all libraries to a single file that can
be imported into Content Grabber on a
different computer. You can also use this
feature to backup your libraries.
Import Libraries
Imports libraries from a file that has
previously been exported from Content
Grabber. Note: Importing libraries overrides
all existing libraries.
Selection menu
© 2014 - 2015 Sequentum
53
The entire Selection (Selection XPath) menu is only visible in Expert Layout. The menu
options are as follows.
Content Grabber's Selection XPath menu
Refresh Browser
Refreshes the selection in the web browser
If the selection XPath is edited manually, then
a refresh is required to update the selection in
the web browser.
Selected XPath
A selection can consist of multiple XPaths.
The web browser will mark the web elements
chosen by the selected XPath. Choose View
All to mark the web elements selected by all
the XPaths in the selection.
Add Selection XPath
Adds an XPath to the selection.
A selection can consist of multiple XPaths,
which allows you to select web elements in
different areas of a web page.
Remove XPath Selection
Removes the currently selected XPath from
the selection.
Note: A selection must consist of at least one
XPath, so an XPath can only be removed if
© 2014 - 2015 Sequentum
54
Content Grabber
the web selection has more than one XPath.
Selection XPath
The current selection XPath.
Note: After editing the XPath, you must press
Refresh Browser to update the selection in
the web browser.
Run menu
The Run menu has two additional options for Internal Data when in Expert Layout.
These menu options are as follows.
Content Grabber's Run menu with Expert Layout enabled
View Log
Opens the Log Window that shows all the entries from the
last agent run.
Note: you must turn on logging and then run the agent before
you can view log data.
View Internal Data
View the internal data extracted by the agent. You must run
the agent before you can view internal data.
The internal data is a "raw" representation of the extracted
data, and contains many data fields that are only used
internally by Content Grabber.
© 2014 - 2015 Sequentum
Data menu
The Data menu has three additional options when in Expert Layout. One
additional Export Data option and two new Internal Data options. These menu
Content Grabber's Data menu with Expert Layout enabled
Regenerate Export Data
Generates and exports existing data.
If you change the export target, you can press this
button to export the data to the new export target
without having to extract the data again.
View Internal Data
View the internal data extracted by the agent. You
must run the agent before you can view internal
data.
The internal data is a "raw" representation of the
extracted data, and contains many data fields that
are only used internally by Content Grabber.
Change Internal Database
Content Grabber saves data to an internal
database while it's extracting data.
The default internal database is a SLQLite
database, but you can change this database to
SQL Server or MySQL.
© 2014 - 2015 Sequentum
55
56
Content Grabber
Debug menu
The Debug menu has a number of changes and additional menu options. The menu
Content Grabber's Debug menu with Expert Layout enabled
Execute Next Command
Executes the next command and then pauses
debugging.
Execute Next Action
Executes the next action command and then
pauses debugging.
An action command is a command that executes
an action such as loading a new web page.
Debug Speed
Set the debug speed to slow down or increase the
debugging speed. You can slow down the
debugging speed to help you see what is
happening while the agent is running.
Debug Set
If the selected command selects a list of web
elements or data items, you can specify a range of
list items that should be processed by the
debugger.
© 2014 - 2015 Sequentum
View Internal Data
View the internal data extracted by the agent. You
must run the agent before you can view internal
data.
The internal data is a "raw" representation of the
extracted data, and contains many data fields that
are only used internally by Content Grabber.
Enable Debug Logging
Turns on logging while debugging the agent.
Log HTML
Logs the full HTML of all pages in addition to other
log details.
View Log
Opens the Log Window that shows all the entries
from the last agent run.
Note: you must turn on logging and then run the
agent before you can view log data.
Agent Settings menu
The Agent Settings menu has several additional menu options available in Expert
Layout. These menu options are as follows.
Content Grabber's Agent Settings menu with Expert Layout enabled
© 2014 - 2015 Sequentum
57
58
Content Grabber
Proxy Source
Specifies where Content Grabber should
look for proxies when running an agent.
Proxy Rotation Interval
Specifies the proxy rotation interval when
using a list of proxy servers.
Proxy List
Opens the Proxy List Window where you
can add or remove proxy servers when the
application proxy source is set to Proxy
List.
Activate Proxy
After changing the proxy source, use this
button to start using the chosen proxy
configuration in the editor.
Input Parameters
Input parameters are named values
assigned to an agent when the agent
starts.
Initialization Script
An initialization script is executed before
the agent starts. This script allows you to
set certain agent parameters, such as
database connection strings or start URLs.
Assembly References
A list of .NET assemblies used for
scripting. These can be standard .NET
© 2014 - 2015 Sequentum
59
assemblies or custom assemblies.
Refresh IntelliSense
Refreshes IntelliSense in all script editors.
Use this button to refresh IntelliSense after
adding or removing assembly references.
3.3
Working With the Web Browser
Content Grabber uses an embedded version of Internet Explorer as its web browser.
The web browser has been greatly modified to suit the purpose of web scraping, but it
essentially works the same way as the standard version of Internet Explorer you have
installed on your computer. If a website does not work properly in Internet Explorer, it
will also not work properly in Content Grabber.
Web Browser Selection Mode
When you click on links and buttons in a normal web browser, you will normally
perform some sort of action, such as loading a new web page. Content Grabber
intercepts all actions in its web browser, and when you click on a web element, it
marks the element as selected instead of performing the default web browser
action.
When you click on a web element that has already been selected, Content Grabber
will give you an option to add an agent command that performs an action on the
selected web element.
© 2014 - 2015 Sequentum
60
Content Grabber
The available options, and the order of the options available, depends on what type
of web element you have selected. For example, if you have clicked on a selected
link element, the first option will be to navigate the link and open a new web page.
You can change what happens when you click on a selected web element by setting
the Add Command Mode in the Application Settings ribbon menu.
You have the following three options.
Add
Description
Command
Mode
© 2014 - 2015 Sequentum
Always Add
Content Grabber automatically adds an agent
Command
command that performs the most appropriate
61
action for the selected web element, without first
displaying the Add Command dialog.
Confirm
The Add Command dialog is displayed when you
Before
click on a selected web element, which lets you
Adding
decide what type action you want to perform on
Command
the web element. This is the default behavior.
Never Add
Nothing happens when you click on a selected
Command
web element. You always have to manually
configure your agent commands.
Web Browser Navigation Mode
Sometimes it's useful to navigate in the Content Grabber web browser the same way
as a normal web browser. For example, you may load a URL, but then want to
navigate to a particular web page and start the data extraction from that page.
You can switch the web browser from selection mode to navigation mode by clicking
the Navigate in Web Browser button in the application menu.
Once you have reached the web page where you want to start extracting data, you
can set the current URL as the start URL by clicking the check icon next to the URL
address bar.
Important: Content Grabber must be able to load the start web page directly from
the start URL. If the start web page cannot be loaded directly by using the start URL,
© 2014 - 2015 Sequentum
62
Content Grabber
then you must choose another start web page, and then use agent commands to
navigate to the web page where you want data extraction to start.
Disabling Web Browser Events
Some websites display or hide web elements when certain events occur in the web
browser, such as when the move moves over a web element, or when an input field
loses focus. Sometimes it can be difficult to select such dynamic web elements,
because they may show and hide as you move the mouse around.
You can click the Disable Web Browser Events button in the application menu to
block web browser events, so that dynamic web elements do not show and hide as
you move your mouse for example. If you want to "freeze" the web page when your
mouse is over a certain web element, then you can use the CTRL+D shortcut key to
disable events, so you don't have to move the mouse away from the web element to
click the Disable Web Browser Events button.
3.4
Content Grabber agents consist of commands that are executed to navigate a website
and extract content.
You can configure a command manually by selecting a web element in the Content
Grabber web browser and then use the configuration window to specify and configure
the action that should be performed on the selected web element.
© 2014 - 2015 Sequentum
63
You can also let Content Grabber configure a command automatically by clicking on a
selected web element, and then choose from a list of appropriate actions for the
selected web element type.
© 2014 - 2015 Sequentum
64
Content Grabber
Please see Agent Commands for more information about configuring commands.
Configuring Multiple Commands
Automatically
Content Grabber can automatically configure multiple commands to extract content
from whole areas of a web page, such as complete HTML tables or search results.
To configure multiple commands automatically, select the entire page area you want
to extract data from, such as a HTML table, and then click on the selected area again
to open the Add Command dialog window.
© 2014 - 2015 Sequentum
65
If you have selected an area with more than one web element, you will get an option
to Capture All Web elements. This option will add a default capture command for
each web element in the selected area. The image below shows an example of
commands added using the Capture All Web Elements option.
© 2014 - 2015 Sequentum
66
Content Grabber
Content Grabber will often add more capture commands than you need, and you will
have to manually delete the commands that are not needed.
Content Grabber will automatically try to assign names to the added commands, but in
most cases you will have to change the names to make them more meaningful.
Capturing HTML Tables or Lists
If you have selected more than one web element in the web browser, you will get an
option to Capture List or Table on the Add Command dialog window. This option
will add a command for your selected list, and within the list it will add a default
capture command for each web element.
If you have selected a single web element that contains a list or a HTML table, you
will also get an option to Capture List or Table on the Add Command dialog
window. This option will locate the first list or table within the selected web element,
and add a list command to your agent that iterates through each row in the table or
list, and within the list it will add a default capture command for each web element.
After selecting the Capture List or Table option from the Add Command dialog
© 2014 - 2015 Sequentum
67
window, you will get a new dialog window where you can select the type of table or
list you have selected.
You can choose one of the following table or list types.
Table or
Description
List Type
Table With
A table where the first row contains header text.
Header Row
The header text will not be extracted, but will be
used to name the capture commands. The header
text must be unique for each column.
Table
A table without header text. The capture
Without
commands will use auto generated names.
Header Text
Table of
© 2014 - 2015 Sequentum
A table with two columns where the first column
68
Content Grabber
Name/Value
contains the value name and the second column
Pairs
contains the value. The value names will not be
extracted, but will be used to name the capture
commands. The value names must be unique.
List
Any simple list of content. A capture command will
be added for each web element in the list. The
capture commands will use auto generated
names.
After selecting the table or list type, Content Grabber will add a list command with
capture sub-commands. The image below shows an example of commands added
using the Capture List or Table option.
Content Grabber will often add more capture commands than you need, and you will
have to manually delete the commands that are not needed.
Content Grabber will automatically try to assign names to the added commands, but
if you have chosen the List or Table Without Header Text option, you may have to
change the names to make them more meaningful.
© 2014 - 2015 Sequentum
3.5
69
Selection Tools and Short-cut Keys
Content Grabber includes a handful of selection tools and short-cut keys that can
significantly save you time when developing an agent. It is useful to understand how
these work to improve your productivity. This section of the manual looks at these
tools and explains how they are used. Also included are details on short-cut keys for
further optimizing development time.
Most of the menu buttons / icons for the selection tools can be found at the bottom of
the Configure Agent Command panel.
Bottom of Configure Agent Command panel showing selection tools
List Selection
The list selection is used to select common elements on a page and save then as a
list. It can be text, titles, photos, captions, tabular data etc.
To use this function, simply select a web element in the browser then click the List
button at the bottom of the Configure Agent Command panel. Clicking the List button
enables List Selection Mode.
To collect multiple elements, select the next web element and Content Grabber will
automatically add all the similar elements on the page to the list. While in List
Selection Mode, you can continue to add items to your list by clicking on the elements
if they weren't selected automatically. You can also click a selected item again to
remove it from the list.
© 2014 - 2015 Sequentum
70
Content Grabber
List Selection Mode enabled
Once you have selected all your content elements into the list, click the Save button to
save the list and exit List Selection Mode.
Shift Short-cut key
Instead of clicking the List button you can also enable List Selection Mode using the
Shift key.
To do this, hold down the Shift key on your keyboard, click the first element (such as
the first title in the list) and—continuing to hold down the Shift key—click the mouse
on the very next element (e.g. the next title in the list).
When you have finished selecting elements for you list, simply release the Shift key to
save the list.
Anchor Selection
The Anchor Selection function is useful when you want to extract content that changes
© 2014 - 2015 Sequentum
71
position on different pages. For example, you may extract data from a table that
displays product properties, such as width, height and depth, but some properties
may not apply to all products, so the location of each property may vary depending on
how many properties are available for a certain product. You can use an anchor
selection to tie the property value to the property name, so the property value you are
extracting is always correct, no matter where it's located in the table.
To use this function, select one or more elements in the web browser then click the
Anchor button at the bottom of the Configure Agent Command panel. This will
enable Anchor Selection Mode. While in Anchor Selection Mode, click the web
element you want to anchor your selection to.
Anchor Selection Mode enabled
Once you have made your selection, click the Save button to save the Anchor and
© 2014 - 2015 Sequentum
72
Content Grabber
exit Anchor Selection Mode.
Ctrl Short-cut key
Instead of clicking the Anchor button, you can also enable Anchor Selection Mode
using use the Control (Ctrl) key.
To do this, hold down the Ctrl key on your keyboard, click the first element and
continuing to hold down the Crtl key. Next, click on the web element you want to
Anchor this content element to. Once complete, release the Ctrl key to save the
Anchor.
Command Naming Short-cut (Alt Key)
When creating a new Command in Agent Explorer, Content Grabber will automatically
name the command for you. This is typically based on the naming conventions used in
the HTML behind the web page.
If you hold down the Alt key when clicking on any text element in the browser, Content
Grabber will name the new command with the text you click on.
Expand Selection
The Expand Selection tool is useful when only part of a web element has been
selected (e.g. a multiple row and column table) and you want to expand the selection.
To use this function, simply click the Expand Selection icon at the bottom of the
Configure Agent Command panel to expand the current selection in the web
browser. Click again if you need to expand the selection further.
Clear Selection
© 2014 - 2015 Sequentum
73
The Clear Selection tool clears the current content element selection in the web
browser.
To use this function, simply click the Clear Selection icon at the bottom of the
Configure Agent Command panel to clear the current selection in the web browser.
Exact Selection
The Exact Selection tool is generally only used by advanced users. Use this option to
generate selection XPaths with as much detail as possible. Exact selections are likely
to fail if a website changes slightly, so only use this option if you understand the
consequences.
To use this function, select the content element on the browser page you want
positional information for and then click the Expand Selection icon at the bottom of
the Configure Agent Command panel. The detailed positional data is then displayed
in the Selection XPath window at the top of the Content Grabber screen.
This option is best used to create selection XPaths that can be used as a starting
point when manually creating XPaths. Refer to Editing XPaths Manually for more
detail.
3.6
Navigating an Agent
An agent loads and processes a number of different types of web pages, and each
type of web page has a separate web browser tab in the agent editor.
© 2014 - 2015 Sequentum
74
Content Grabber
The above image shows the browser tabs for an agent that processes three different
types of web pages.
An agent may process thousands of web pages, but you only need to define
commands for each different type of web page. All web pages with the same layout
are considered the same type of web page. For example, if you are extracting data
from a product catalog, all product detail pages are considered a single type of web
page, and you therefore only need to define commands for a single product detail
page.
Agent Explorer
The Agent Explorer is used to view agent commands for the type of web page loaded
in the selected browser tab. When you select a new browser tab with another web
page, a new set of commands will be displayed in the Agent Explorer.
You can use the Agent Explorer to view, edit or execute commands. If you execute a
command that opens a new web page in a new web browser tab, Content Grabber
will automatically switch web browser tab and load the new web page.
© 2014 - 2015 Sequentum
75
When you open a new agent, Content Grabber will automatically execute the Agent
command which loads the first web page into the first web browser tab, but it will not
execute all the other commands, so it will not load web pages into any other web
browser tabs. You can still switch browser tabs to view and edit the associated
commands, but you will not be able to view the associated web page until you execute
the command that loads a web page into the browser tab.
You generally need to execute all agent commands if you want to make sure web
pages are loaded into all web browser tabs. It can be a very tedious task to manually
execute all commands in an agent, so you can click the Execute All Commands
button in the application menu to automatically execute all commands in the agent.
You can also execute all commands from start up until a specific command. Right-click
on a command in the Agent Explorer and select Execute From Start to Here from
the context menu.
If you just want to execute a set of commands on the web page in the selected web
browser tab, then you can right-click on a command and select Execute to Here from
the context menu. This is often used when you want to submit a web form, but don't
want to manually execute all the Form Field commands required to set the form field
values. Since you just want to set form field values on the current web page, there's
© 2014 - 2015 Sequentum
76
Content Grabber
no need to execute commands in other web browser tabs.
3.7
Testing/Debugging an Agent
The agent debugger is an essential tool when trying to locate and correct issues in an
agent. Running an agent through the debugger is normally the first thing you do after
you have built your agent.
The debugger will warn you of any potential issues while it's running an agent, and
allow you to pause and correct any issues.
An agent runs significantly slower through the debugger, and the debugger does not
automatically manage JavaScript memory leaks and hanging/crashing web browsers,
so you should only use the debugger for testing, and not for full agent runs.
Refer to Using the Debugger for more detail.
Logging
Logging is used to collect detailed information about the web scraping process and
can be used both when debugging and running an agent. Logging includes links to the
web pages that are being processed, so it's easy to pinpoint pages that may be
causing problems.
Refer to Using Logging for more detail.
3.7.1
Using the Debugger
The agent debugger is used to test your agent, and to find and correct issues such as
missing content elements.
When building an agent, you often base the design of the agent on just a few web
pages, and then rely on Content Grabber to execute your commands on all similar
web pages it encounters.
© 2014 - 2015 Sequentum
77
For example, if you are extracting data from a product catalog, you may select the
product name and price on a single product detail page and add agent commands to
extract those two web elements. Content Grabber will then use these two agent
commands to extract product name and price for all product detail pages in the
product catalog. This works fine in most cases, but some products maybe on special,
and the product detail pages for those products may display the price differently.
Content Grabber may be unable to pick up the discount prices and the price for some
products may therefore be missing. Even if you know some products have a discount
and need special attention, it may be difficult to find such a product page in a large
catalog.
The agent debugger can help you correct issues where an agent is unable to locate
content, because it stops and warns you when content is missing and allows you to
correct the agent command that caused the error. In the product catalog example, you
would be able to correct the agent command that extracts the price, so it includes a
selection for the discounted price.
Controlling the Debugger
Use the Start button in the Debug ribbon tab to start the debugger.
The agent will run directly in the agent editor, so you will be able to see the web
pages being loaded into the web browser. The debugger will mark the content that is
being processed in the web browser, and it will also highlight the command that is
being executed, in the Agent Explorer panel.
Once the agent is running in the debugger, you can pause or stop the debugger at any
time. If you pause the debugger, you can use the Next Command button to execute
© 2014 - 2015 Sequentum
78
Content Grabber
one command at a time, and the Next Action button to execute commands until the
debugger reaches an action command.
Setting the Debug Speed
The debugger will mark the content that is being processed in the web browser, which
is helpful when trying to work out how the agent is processing the website, but the
debugger may run so fast that it's impossible to see what's going on. You can slow
down the debugger to make it much easier to follow the process.
You can change the debug speed before you start the debugger or while the
debugger is running.
Debugging a Data Subset
Agent list commands may process large data sets, such as a long list of Start URLs
or a long list of form field input data. If you want to test an agent with data from the
end of a list, it may take hours for the agent to reach the data you want to test.
Instead of waiting for the agent to reach the data you want to test, you can specify
the subset of data you want to include in the debugging session.
The Debug Set option is only available when you select a list command in the Agent
Explorer. This setting is automatically saved to the agent command and will apply to
the command every time you debug the agent. This option has no effect when you run
an agent, only when you debug the agent.
© 2014 - 2015 Sequentum
79
Debug From a Specific Command
Sometimes you may want to only debug a small part of your agent. If you have a
complex agent that processes a large website, it may take a long time before the
debugger reaches certain part of your agent, so instead of starting from the beginning,
you can select the agent command where the debugger should start.
When you start debugging from a specific command, the web browser tab that is
associated with the command must have a web page loaded. You cannot start
debugging from a blank web page.
Disable a Command While Debugging
When debugging a complex agent that processes a large website, it's often useful to
exclude parts of the agent, so you only execute the part of the agent you want to test.
you can do this by disabling commands while debugging.
© 2014 - 2015 Sequentum
80
Content Grabber
When you disable a container command, all sub-commands are also automatically
disabled.
This setting is automatically saved to the agent command, so the command will be
disabled every time you debug the agent until you enable the command again. This
setting has no effect when you run an agent, only when you debug the agent.
Viewing Debug Data
Data collected while debugging an agent is saved to a separate data store, so it
doesn't overwrite data that is collected when you run the agent. The debugger will
export data to your chosen export target, but it will never distribute data. If you are
exporting data to a database, the debugger will create separate data tables for the
debug data. If you are exporting data to a file format, the files will be written to the
agent's debug data folder.
Important: If you are using a script to export data, you are responsible for managing
debug data if you want to separate debug data from normal data.
© 2014 - 2015 Sequentum
81
Please see the Data topic for more information about agent data.
3.7.2
Using Logging
Logging can be used when debugging or running an agent to collect information about
the web scraping process. Logging can be set to three different detail levels, low,
medium and high. Low detail level only logs errors, medium detail level logs errors and
warnings, and high detail level logs errors, warnings and general progress information.
Debug logging is always set to high detail level and cannot be changed.
Debug log data is stored in a separate data store, so it doesn't overwrite or get mixed
with normal log data.
Debug logging can be turned on in the Debug ribbon menu.
Normal logging can be turned on when running an agent on the Run Agent screen.
The Run Agent screen is not displayed when running an agent using the Run button in
the application menu, so you will not be able to configure logging when running an
agent this way.
Logging Raw HTML
Content Grabber automatically logs direct links to processed web pages when logging
© 2014 - 2015 Sequentum
82
Content Grabber
is turned on, so it's normally easy to view specific web pages that have been
processed. For example, if an error or warning appears in the log, you can simply
click on the associated URL to open the web page and see if there is anything special
about the page that may cause an error.
Sometimes it's not possible to open a web page using a direct URL. For example,
some websites implement CAPTCHA protection, which is web pages that appear
randomly to ask the user to enter a verification code. If a CAPTCHA page is retrieved
instead of a normal web page, your agent is likely to encounter errors, but if you click
on the associated URL later, you may not get the CAPTCHA page because it appears
randomly. In this case it may be difficult to determine what is causing the error, but
you can use the Log HTML feature to log the raw HTML of all processed web pages,
and this will allow you to view the CAPTCHA page. Please see the CAPTCHA topic
for more information about CAPTCHA.
Error Handling
Please see the Error Handling topic for more information about error handling,
notifications and logging.
3.8
Scheduling
Content Grabber Scheduling
If you have an Agent loaded in Content Grabber, you will be able to access the
Scheduling feature. From the Agent Settings menu at the top of the Content Grabber
application. Simply Click on the Schedule menu option to display the Scheduling
window.
© 2014 - 2015 Sequentum
83
The Enable schedule checkbox allows you to easily turn scheduling on or off for the
Agent. Click the Enable schedule checkbox so you can begin configuring the
schedule details.
The Run interval fields allow you to select how often your agent will run. This can be
set in seconds, minutes, hours or days. Start date and Start time enable to
preciously control from when the agent will start executing.
The log feature allows you to record when the agent runs. The log level setting you
choose (i.e. none, low, medium, high) will determine the level of detail that is recorded
about each run. For instance, you may want to track all error messages or issues
encountered by the running agent so you might choose "high". However, be mindful
that a high level of logging may slow the Agent performance.
Content Grabber also includes some scheduling security features. If you don't want
your Agent to run when to aren't logged on to your computer, simply click the Run
only if user is logged on check box.
Using Windows Task Scheduler
Should you want to manage multiple agents from one place or are looking for
additional controls, or calendar functionality, you can utilize the Windows Task
Scheduler instead. To do this, simply click on the Open Windows Scheduler button
© 2014 - 2015 Sequentum
84
Content Grabber
on the Scheduling window.
Windows Task Scheduler is used to create and manage common tasks that your
computer will carry out automatically at the times you specify. In this instance the task
will be to run a web-scraping Agent(s) at the scheduled time(s) you require.
To setup a basic Agent schedule task, simply go to the Action menu and select the
Basic Task Wizard. This wizard will lead you through the required steps to setup your
schedule task.
Windows Task Scheduler - Using the Basic Task Wizard
For more advanced scheduling options or settings such as multiple task actions or
triggers, use the Create Task command in the Actions menu.
© 2014 - 2015 Sequentum
85
Windows Task Scheduler - Create Task Command (for advanced options)
Tasks are stored in folders in the Task Scheduler Library. To view or perform an
operation on an individual task, select it in the Task Scheduler Library and click on a
command in the Action menu.
3.9
Copying & Moving Agents
Moving an agent
You can move an agent by simply moving the agent folder to a new location. The
agent will retain its unique identifier, so the agent will work as before although it is
now in a new location.
Copying an agent
To copy an agent, simply choose File > Save As in the Content Grabber menu.
The new agent will get a unique identifier - an identifier that is different from the
original agent.
WARNING: We strongly recommend that you do not simply copy the agent from
one folder to another on your disk. The result will be two agents having the
same identifier, since the copy will retain the original identifier.
© 2014 - 2015 Sequentum
86
Content Grabber
Content Grabber stores information about each agent in its database, and can
only store one entry per identifier. If you do have two agents on your computer
with the same identifier, the active agent will be the agent that you open most
recently in the Content Grabber editor. In these circumstances, you risk a
duplication of effort or a loss of work.
3.10
Exporting & Importing Agents
Exporting and Importing agents allows you to easily move your web scraping agents
between machines. You can also export your agent for back-up to another device.
Exporting Agents
To export your agent, first make sure you save your agent and have it open. Then
from the File menu, select Export Agent. The following Export Agent pop-up window
will appear on your screen.
Export Agent window
Content Grabber's Export Agent facility enables you to package up all the associated
files and folders into one file in a manner that is similar to a zip file. The default
location for where agents are exported is the Documents > Content Grabber >
Agents folder on your PC.
You can also elect to Include internal data and/or Include export data into the
export file by selecting the corresponding check box. Once you have made your
selection, simply click the Export button.
© 2014 - 2015 Sequentum
87
For detail on the Create self-contained agent check box, refer to Building a Selfcontained Agent.
Importing Agents
You can only import agents that have previously been exported from Content Grabber.
To do this, select Import Agent from the File menu. The following Export Agent popup window will appear on your screen.
Import Agent Open window - to select agent file for import
Simply select the Content Grabber agent file you wish to Import and then click the
Open button.
All the folder structures and associated files for the agent will then be unpacked into
the Documents folder on your machine.
3.11
Automated Agent Backups
Whenever you save an agent, Content Grabber automatically makes a backup for you
which you can restore your agent from should you have any issues.
To see the different backups available to restore from, click on the File menu and you
© 2014 - 2015 Sequentum
88
Content Grabber
will see the agent backup listed under this menu.
Restore from Content Grabber's Automated Backups
Content Grabber also saves a backup restore file for you should the application close
unexpectedly (e.g should the application process terminate). This will be available for
you to restore from so you do not lose your latest development work.
3.12
Template Libraries
A template library contains building blocks that you can use to build your agents. A
building block can be an entire agent, a group of commands, or a script. Content
Grabber includes a large number of predefined building blocks, but you can also add
your own - either to the Agent Template Library or the Command Template Library.
For example, if you need to create many agents that have the same structure but will
point to different websites, then you could add your own agent template and derive it
from the first agent you create.
We encourage you to learn more about the Agent Template Library and the Command
Template Library.
© 2014 - 2015 Sequentum
4
89
Mastering the essential selection techniques is a critical aspect of web-scraping.
When you point-and-click on content in the web browser to create agent commands,
you are using the most basic selection technique. In addition to the simple point-andclick feature, Content Grabber provides a range of tools to help you make precise
selections, including:
XPath Editor - gives you the ability to manipulate the
selection XPath
Tree View Window - gives a more precise view of a
web page
List Tool - helps you select a list of web page
elements
Anchor Tool - helps you apply conditions to the
selection XPath.
It is important to realize that you have to be explicit with Content Grabber. You have
to be deliberate and specific when selecting each of the HTML elements that you want
to capture. Sometimes it can be confusing, as when you want to capture the elements
of a search-result listing and each entry has a heading. In some cases, the heading
will be a link and in other cases it will be plain text, so the HTML for the headings may
look something like this:
<h1><a href="http://website.com">Heading as a link</a></h1>
<h1><span>Heading as plain text</span></h1>
If the first two headings are links, then only the link headings will be chosen - not the
headings that are plain text. To extract the text of all the headings in this example, you
might use the first two headings to create a list selection. As with many software
applications, you have to be explicit with Content Grabber. It does not somehow know
that you want to extract all the headings, and its default function will be to assume that
© 2014 - 2015 Sequentum
90
Content Grabber
you are trying to select only the links. In such a case, you would need to change the
selection so that it selects the <h1> tag instead of the link tag - before you create the
list.
4.1
Selection XPaths
Each time you click on content in the web browser panel, Content Grabber does
some processing in the background to calculate the selection XPath. XPath is a
common syntax for selecting in XML and HTML documents. Content Grabber uses a
standard implementation that supports XPath v1.0 syntax, and also supports a range
of new methods specifically designed to make web scraping easier. Content Grabber
has a range of tools that helps you create a precise XPath, without the need to know
the syntax. Eventually, you may find the need to fine-tune the XPath manually, and
then you will need to learn the XPath syntax.
Each time you make a selection in the web browser, you can view the selection
XPath either in the Selection ribbon menu or on the status bar (see the figures
below). Note that the Selection ribbon menu is only available if you are using the
Expert user interface layout.
The selection XPath in the Selection ribbon menu
The selection XPath is shown on the status bar
The XPath in the figure above is:
//div[@class='feature featureLast']/h3[1]/a[1]
It contains these selection steps:
© 2014 - 2015 Sequentum
91
1. Selects all <div> tags within the webpage having the class attribute value feature
featureLast.
2. Also selects all child <h3> tags from the selection in step 1 (/h3[1]).
3. Then selects all child <a> tags from the selection in step 2 (/a[1]).
Read more in the XPath and Selection Techniques article.
To learn more about XPath, we recommend that you
consult a good reference guide such as this one:
www.w3schools.com/XPath/xpath_syntax.asp
Content Grabber XPath Functions
Content Grabber has a set of non-standard functions you can use in
XPath selections:
Function
Description
bool equals(string
Returns True if the two strings are
source, string target)
equal. The comparison is case
insensitive.
bool fuzzy-match(string
Uses the Jaro/Winkler distance
source, string target,
algorithm to determine if two strings
[double tolerance])
match. This function first splits each
string into words and then compares
words from each string. If one string
contains more words than the other
string, the additional words are not
considered when calculating the
distance.
Tolerance must be between 0 and
1. A tolerance of 1 means an exact
match. A tolerance of 0.85 is used if
© 2014 - 2015 Sequentum
92
Content Grabber
tolerance is not specified.
bool fuzzy-match(string
Uses the Jaro/Winkler distance
source, string target,
algorithm to determine if two strings
double
match. This function first splits each
singleWordTolerance,
string into words and then compares
int
words from each string.
wordMismatchesAllowe
d, [bool
If isMatchAllSourceWords is true, all
isMatchAllSourceWords
words in the source string will be
], [bool
matches.
isMatchAllTargetWords
])
If isMatchAllTargetWords is true, all
words in the target string will be
matches.
if both isMatchAllSourceWords and
isMatchAllTargetWords are true, all
words in the string containing the
most words will be matched.
if both isMatchAllSourceWords and
isMatchAllTargetWords are false, all
words in the string containing the
least words will be matched.
SingleWordTolerance must be
between 0 and 1. A tolerance of 1
means an exact match. If a word
match is less than the
SingleWordTolerance value, a word
mismatch is recorded.
© 2014 - 2015 Sequentum
WordMismatchesAllowed is the
number of words that are allowed to
mismatch before the function returns
false.
bool not-
Returns False if the string contains
whitespace(string
only white space characters.
source)
add-bookmark()
Adds the current node to an internal
list of bookmarks.
bool has-parent-
Returns True if a parent of the
bookmark()
current node is in the bookmark list.
NodeSet parent-
Returns the first parent node that is
bookmark()
in the bookmark list.
string html()
Returns the HTML of the current
node.
string inner-html()
Returns the inner HTML of the
current node.
string uniqueid()
Returns a unique ID.
string url()
Returns any URL property of the
current node.
string image-url()
Returns any image URL of the
current node.
string email()
Returns any email address of the
current node.
string flash-url()
Returns any flash URL of the current
node.
string tag-text()
Returns the text of the current node
excluding text from any child nodes.
© 2014 - 2015 Sequentum
93
94
Content Grabber
string find-data(string
Returns the extracted data for a
commandName)
command that has already been
processed. If the command does
not exist in the current container
command, the function will keep
searching in parent containers.
string get-data(string
Returns the extracted data for a
commandName)
command that has already been
processed. The command must
exist in the current container
command.
string get-input-
Returns input data as a string value
data(string
from the specified data column in
commandName, string
the data provided by the specified
columnName)
command. The specified command
must be a data provider.
string get-input-
Returns input data as a string value
data(string
from the specified data column in
columnName)
the data provided by the last data
provider parent command.
string get-input-data()
Returns input data from the first
data column that contains a string
value in the data provided by the last
data provider parent command.
string get-global-
Returns the data entry with the
data(string name)
specified name from the global data
dictionary which includes input
parameters. The data entry must be
a string value or a simple value type
that can be converted to a string
value.
© 2014 - 2015 Sequentum
int node-
Returns the position of a specific
position([nodeSet])
node among all nodes with the same
parent node. If no node is given,
then this is the position of the root
node.
When choosing elements inside a
web element list, the current list
element is the root node. So a call
to node-position() would return the
position of the current list element.
The index is not zero based, so the
first index is 1.
NodeSet root([int
Returns the root node with a specific
nodeIndex])
index. The index must be greater
than 1. If no index is specified the
current root node is returned.
When selecting elements inside a
element is the root node. So, a call
to root() would return the current list
element, and a call to root(2) would
return the second list element.
int root-index()
Returns the index of the current root
node.
to root-index() would return the
© 2014 - 2015 Sequentum
95
96
Content Grabber
index of the current list element.
first index is 1.
int last-root-index()
Returns the index of the last root
node.
to last-root-index() would return
the index of the last list element.
first index is 1.
int root-position()
Returns the position of the current
root node. This position is relative to
all nodes with the same parent
node.
to root-position() would return the
position of the current list element.
first index is 1.
NodeSet root-siblings()
Returns following siblings of the
current root node, but stops when it
encounters another root node.
© 2014 - 2015 Sequentum
97
This function is equivalent to the
following selection:
root()/following-sibling::*[root-index()
=last-root-index()
or
position()<root-position(root-index()+1)root-position()]
to root-siblings() would return
siblings of the current list element.
4.2
Optimizing Selections
An excessively long XPath is unlikely to work well if the target web page changes even slightly. Every step in an XPath must match an HTML tag on the web page, and
it will fail if any of these tags move to another location or go away entirely.
When you click on an HTML element in the web browser, Content Grabber will
automatically attempt to create an optimal selection XPath by making it as short as
possible. For example, the full selection XPath path to a <div> HTML element could
be as follows:
DIV[1]/DIV[5]/TABLE[2]/TBODY[1]/TR[2]/TD[1]/DIV
If the DIV tag has an ID attribute with a unique value
listView, the optimal XPath is:
//DIV[@id='listView']
© 2014 - 2015 Sequentum
98
Content Grabber
Content Grabber will examine the entire web page for a DIV tag having the ID
attribute value listView. This XPath example is very robust and not sensitive to future
page changes, and it will work as long as this element exists on the web page - even
if the rest of the page changes.
The design of Content Grabber is to prefer ID attributes for optimizing XPaths,
because the expectation is that any given ID will be unique on a web page. However,
sometimes websites use IDs that are unique to a specific content element, such as a
product ID, and such IDs are not appropriate for use in an XPath. If you extract data
from a product catalog by using an XPath to extract the product title from all product
detail pages, then you do not want the XPath to depend on a specific product ID,
such as we have in the case below:
//H1[@id='sku_245865']
Such an XPath would work for only one specific product and not for the others.
Content Grabber will avoid using IDs that look like identifiers, such as a product
identifier, but if it turns out wrong, then your only option is to optimize the XPath
manually.
If you need to optimize a selection XPath manually, you can use the Exact Selection
tool. This tool generates selection XPaths with as much detail as possible, and then
you can manually remove the detail that is causing problems - such as an ID attribute
containing a product identifier.
The Exact Selection tool
4.3
List Selections
List selections are important, since you will often want to extract a
list of web content or follow a list of links.
© 2014 - 2015 Sequentum
You can create list selections by editing the XPath manually, or by
using the Content Grabber List Tool. You can start the List Tool by
selecting a web element and then holding down the SHIFT key while
selecting more web elements that should be included in the list.
Once you release the SHIFT key the list selection will be complete.
You can also start the List Tool by selecting a web element and then
click on List toolbar button.
The List Selection toolbar button
You can now select more web elements that should be included in
your list. You can exclude web elements from the list by clicking on
web elements that are already selected. You will need to click the
Save button in the List mode window to complete the list.
The List Selection window
Content Grabber will not just add web elements you click on to the
list, but also other web elements of the same type that are naturally
part of the same list on the page. For example, all rows in a HTML
© 2014 - 2015 Sequentum
99
100
Content Grabber
table are naturally part of the same list, so if you click on the first
row in the table and then SHIFT-click on the second row, Content
Grabber will create a list of all rows in the table, and not just the two
rows you have clicked on.
You can only add web elements of the same type to a list, and all
web elements in a list must naturally belong to the same list. For
example, if you have a list of products in one area of a page, and
another list of products in another area of the same page, you may
not be able to add them to the same list.
Content Grabber will automatically attempt to create a natural list of
the same type of web elements. If you want to create a list of table
rows for example, you can select the first table row, and then
SHIFT-click anywhere inside the second table row to create the list.
Even if you select a table column, or any other child web element, in
the second row, Content Grabber will know that it needs to create a
list of table rows. It's important to notice that it DOES matter exactly
which child web element you click on in the second table row, since
Content grabber will use this information as a selection filter, and
exclude all web elements from the list that don't contain this child
web element. Content Grabber may also use the text or other
properties of the child element to filter the list, but it will only do so if
the filter does not filter away web elements you have specifically
clicked on to add to the list.
Web Element Siblings
A list of web elements will automatically also include all siblings of
the web elements. For example, you may have a list of table rows
where the first row contains a product name and the next couple of
rows contain details about that product, and then another title row
followed by more rows containing details about the second product,
© 2014 - 2015 Sequentum
and so on. In this case you should create a list of the title rows only,
and the siblings of the title rows containing product details will then
be available to agent commands that references the list selection.
For example, if a Web Element List command uses the list selection,
then sub-commands can capture content within the sibling web
elements.
4.4
Selection Anchors
Selection anchors are used to anchor a selected web element to
another element, or to filter a list by a child element.
The most common use of selection anchors is when you have a list
of name/value pairs on a page and want to extract the value for a
specific name, but that specific name/value pair is not always
located in the same position. For example, when extracting product
specifications from a table, some products may have size and
weight in the first two rows, but other products may have model
and memory in the first two rows, so if you extract size based on
position, you will extract size for some products, but model for other
products. To overcome this problem, you can anchor the selection of
the size value to the text of the web element containing the size
name.
You can anchor a selection by CTRL-clicking on the anchor element,
or by clicking on the Anchor toolbar button and then selecting the
anchor element. When using CTRL to select an anchor, the anchor
will always be a text anchor, but when using the Anchor toolbar
button, you can choose many different kinds of anchors.
© 2014 - 2015 Sequentum
101
102
Content Grabber
The Anchor toolbar button
An anchor works as a filter when used on a list selection, and the
anchor element must be selected inside the list. When using CTRL to
select the anchor element, the anchor will filter the list so that only
web elements that contain a child web element similar to the anchor
element will be included in the list. When using the Anchor toolbar
button, you can chose many different kinds of anchors to filter a list.
The Anchor window
4.5
Editing XPaths Manually
Content Grabber will usually create an optimal selection XPath, but sometimes you
may find it necessary to fine-tune the XPath manually. It can be difficult, however, to
create an XPath from scratch. If you want to manually edit an XPath, we recommend
that you first let Content Grabber generate the XPath and then edit it.
As we show in the figures below, you can use either the status bar XPath editor or
the Selection ribbon menu to manually edit an XPath. Note that the Selection ribbon
menu is only available if you are using the Expert user interface layout.
© 2014 - 2015 Sequentum
The status bar XPath editor
The Selection Ribbon menu
© 2014 - 2015 Sequentum
103
104
5
Content Grabber
Agent Commands
Content Grabber web-scraping agents usually contain a list of commands that execute
sequentially. Sometimes, commands will execute in parallel (simultaneously). Each
agent command performs an action, such as loading a new page or capturing content.
Some commands can contain sub-commands. These commands are called container
commands and they execute all their sub-commands one or more times. If a container
command executes its sub-commands more than one time, then the command is also
called a list command. A list command iterates through a list of data elements or web
elements and executes all its sub-commands once for each element in the list.
The Agent Explorer shows all commands in an agent
The most important command in a agent is the Agent command itself. This is the first
command that executes. Since it contains all other commands for the agent, it is called
a container command. The Agent Command loads the Start URL, the point at which
data extraction starts. The Agent Command also controls many other important
aspects of the agent, such as data export.
Some agent commands have a corresponding web selection that chooses an element
from the current web page. One or more of these selections are put to use when the
© 2014 - 2015 Sequentum
Agent Commands
105
command executes. A Navigate Link command, for example, selects a link on the
current web page, which will open a new web page.
See The Content Grabber Editor topic to learn how to use the Agent Explorer to add
and configure agent commands.
Command Types
We classify agent commands into four types, according to
function:
Capture Commands
List Commands
Action Commands
Container Commands
Capture commands do nothing but capture web content. Container, List, and Action
commands may function as one or more types simultaneously. There are some
special commands that don't fall into any of these 4 categories, such as the Execute
Script command which simply runs a .NET script.
Agent Command Properties
All commands have a set of properties that govern the function of the command. The
following is a set of basic properties common to all agent commands.
Property
Description
Name
The name of the command.
ID
The internal ID of the command.
Disabled
The default value for this property is False. The
system will ignore the command if this property
set to True.
© 2014 - 2015 Sequentum
106
Content Grabber
Export
A command with this property set to False
(default) will not save any data output. This
includes all sub-commands that are part of a
container command.
Recursion
A command with this property set to False
Disabled
(default) will only execute one time in case there
is a recursive loop.
Notify On
The default value for this property is True, which
Critical
means that a notification email is sent at the end
Error
of an agent run if this command encounters a
critical error. Critical errors include page load
errors and mandatory web selections that are
missing.
Debug
The system will ignore this command during
Disabled
debugging if this property set to False (default).
Debug
The default value for this property is False.
Break
During debugging, the system will break at this
Point
command if this value is True.
Web Selection Properties
All commands that perform web selection will have these properties:
Property
Description
Paths
Contains one or more selection paths, each of
which points to a specific web element on a
web page. These paths may point to elements
in various locations on the web page.
A path corresponds to a web page selection.
© 2014 - 2015 Sequentum
Agent Commands
107
The value of the Selection Mission Option
property (below) determines how Content
Grabber will handle any missing selection.
Selection
The value of this property specifies the action
Missing
of Content Grabber if this selection does not
Option
exist in the current page. You can set this
property to one of the following:
Default Option - The software will decide
the most appropriate action depending on
the type of command.
Optional Selection - The selection is
optional, so the agent will continue executing
any sub-commands.
Ignore Command when Selection Missing
- The agent will skip this command and also
skip all sub-commands.
Log Error and Ignore Command when
Selection Missing - The agent will skip this
command and all sub-commands, and will
also log an error.
Selection Path Properties
A web selection can consist of multiple selection paths, each of which selects web
elements in different locations on a web page. Each selection path will have these
properties:
Property
Description
Path
The selection XPath. See the topic XPath and
Selection Techniques for more information
about XPaths.
© 2014 - 2015 Sequentum
108
Content Grabber
Wait
Specify an XPath here that you want the agent
XPath
to finish loading before proceeding with the
other commands. This option is useful when
selecting content that loads asynchronously often after the main elements of the page have
finished loading.
Wait
Specifies how long to wait for the Wait XPath
Timeout
web selection element to load.
Seconds
Wait For
If you set this property to True, the agent only
Refresh
waits for the wait XPath to refresh. Does not
Only
verify that the selection Path exists. This
setting is useful if the Path is optional.
5.1
Container Commands
Container Commands contain one or more sub-commands. After the container
command executes, all of the sub-commands execute at least once. If a container
command executes its sub-commands more than once, then the command is also a
List Command. A list command iterates through a list of data elements or web
element, and then executes all its sub-commands - once for each element in the list.
In this chapter, we explain all of the container commands:
Agent
Navigate Link
Navigate URL
Set Form Field
Data List
Web Element List
Group Commands
Group Commands in Page Area
© 2014 - 2015 Sequentum
Agent Commands
109
Container Properties
All commands have a set of properties that govern the
function of the command. The following is a set of basic
properties common to all container commands.
Property
Description
Command
Links to another container command where
Link
processing will continue. The targeted container
command will be executed, so it's normally best
to link to a group command that does nothing,
so it's clear what happens after the link.
Dependen
The action of the dependent command will be
t
executed before the current command is
Command
executed. Dependent commands of the
dependent will also be executed, so a
dependent command can cause many
commands to be executed.
Dependent commands are often used when
processing web forms, because you may need
to ensure some form fields are set properly
before setting other form fields.
Repeat
Repeats the command action while the
While
command selection exists. This can be used
Selection
when a command is deleting all items from a list,
is Valid
so the command will keep deleting items as long
as there are items in the list.
5.2
Capture Commands
Capture commands capture data from a web page or from an input data provider,
and saves the captured data to output. No other type of command can save data to
© 2014 - 2015 Sequentum
110
Content Grabber
output, so you need a capture command for each data field you want in your output
data.
It is important to understand that some commands don't capture anything. For
example, the Set Form Field command does not capture the input value that
corresponds to a form field, but simply identifies the field. To capture data, you need
to use one of the capture commands to acquire the input value and save it to data
output.
These are all of the Capture commands available in Content Grabber:
Web Element Content
Download Image
Download Document
Download Screenshot
Calculated Value
Page Attribute
Data Value
Capture Command Properties
All commands have a set of properties that govern the
function of the command. The following is a set of basic
properties common to all capture commands.
Property
Description
ContentTra
Please see the Content Transformation Script
nsformation
topic for more information.
Script
DataType
The type of the captured data. This should be
set to ShortText or LongText. Other data
© 2014 - 2015 Sequentum
Agent Commands
111
types are not supported, but may be used
internally.
DesignValu
The value to use for this capture command in
e
the agent editor. This value can be important
when testing scripts in the editor if the scripts
depend on captured data.
IsUserDefin
The design value for this command is user
edDesignVa
defined instead of set automatically when the
lue
command is saved.
Use Data
Captures a data value instead of an attribute of
Value
the selected web element. The web element is
ignored when this option is true.
5.2.1
Data
Specifies the input data to use when the Use
Consumer
Data Value property is set to true.
Web Element Content
Typically, the Web Element Content command is the most common, since it is the
command that captures text content from the target web page. The command allows
you to choose which type of text to extract from a particular web element.
The properties shown in the following table are available to all Web Element Content
commands:
Text Option
Description
Text
This is the text that displays in the web browser, and is the
most common choice (it's also the default).
HTML
This option will extract the entire HTML of the chosen web
element, including the HTML of any child elements.
Inner HTML
The entire HTML of all child elements of the selected web
element, but not the tag HTML of the chosen element itself.
© 2014 - 2015 Sequentum
112
Content Grabber
Tag Text
The text of the selected web element, excluding the text of
any child elements.
Unique ID
A unique identifier. The web selection is ignored if this option
is selected.
Default File
Returns any file name specified by a response from a web
Name
server. If no file name is specified, extracts a file name from
a href attribute, or a unique identifier if the href does not not
exist.
This attribute is mostly relevant to file capture commands.
In addition to the default options given above, some web elements may have other
attributes available, such as Class, Name, ID, Value, URL, etc. If a web element has
an attribute that is not shown in the default drop down box, then you can simply enter
the name of the attribute you want to extract.
The figure below shows the Command Properties panel after choosing Web
Element Content from the New Command drop-down:
© 2014 - 2015 Sequentum
Agent Commands
113
Properties
The following properties are specific to Web Element Content commands and are
listed in the HTML Capture section on the command property tab:
Property
HTML Attribute
Description
The HTML attribute to extract from the selected web
element.
Concatenate
Concatenates content if multiple web elements are
Multiple Web
selected. Only the first web element will be used if
Elements
this property is set to false.
Concatenate
The separator to use between content from multiple
Content Separator
web elements. This property is only applicable when
Concatenate Multiple Web Elements is set to true.
The following properties are available in the Web Selection section on the Properties
tab:
Property
Selection
Description
The selection XPath for the chosen web element.
See the Web Selection Properties article for more
information.
Extracting URLs deserves a special mention. That's because it's more common to
extract a link URL instead of navigating to the link, or to extract an image URL
instead of downloading the image itself. If you want to extract a URL from a web
element, simply select the element and extract the URL or Image URL attribute. You
can also enter the actual HTML tag attributes: src for images and href for links.
However, the URL and Image URL attributes automatically convert any relative URLs
to absolute URLs-which is best in most cases.
© 2014 - 2015 Sequentum
114
Content Grabber
Content Transformation Script
The Web Element Content command allows you to use regular expressions or a
.NET script, to transform the extracted content. In most cases we recommend that
you write expressions or use a script to clean the data that you extract. You can also
separate data-such as the elements of a postal address-into separate fields.
Example: Consider a case in which you want to extract product data that includes a
price of $400. You could use a transformation script to strip off the "$" character and
leave only the numeric value.
Please see the Content Transformation Script topic for more information.
These common properties are also available for this command:
Capture Command Propertie
5.2.2
Download Image
The Download Image command extracts an image from a web page. The command
will download an image, and then save it to the file system, send it to a database, or
embed it into Excel - depending on your chosen export target. The web selection path
for this command normally points to the image itself, but it could also point to a web
element that contains a URL that links to the image.
The figure below shows the Command Properties panel after choosing Download
Image from the New Command drop-down:
© 2014 - 2015 Sequentum
Agent Commands
115
Data Fields
If the agent is saving the image to a database, then by default this command will
generate two data fields: one for the image data and another for the name of the
image. If the agent is saving the image to the file system, the command will generate
only one data field containing the full file path to the image. The command property
Export URL can be used to also generate a data field that contains the image URL.
The Common tab in the Configure Agent Command panel has three tabs:
File URL - contains the URL for the image.
File Name - contains the name of the downloaded image.
OCR tab - check the box if you want to convert an image into text.
We explain the details of each below.
© 2014 - 2015 Sequentum
116
Content Grabber
File URL
The entry in this tab determines the specific URL for the image, and the agent uses
this URL to download the image at run time. You can choose the HTML attribute that
the command should extract to get the image URL. The default value is Image URL,
which extracts the src HTML attribute.
Click the Transformation Script button click to enter regular .NET expressions or
write a .NET script that will transform the image URL to meet your requirements. This
is often useful when you want to extract a large image, but it's easier to select a
corresponding thumbnail image. You may then be able to transform the thumbnail
image URL into the large image URL, and the agent will then use the transformed
image URL to download the large image.
See the Content Transformation Script topic for more information about content
transformation scripts.
You can choose the HTML attribute that the command should extract to get the image
URL. The default value is Image URL, which extracts the src HTML attribute.
File Name
The entry in this tab contains the file name. From the drop-down menu you can choose
the HTML attribute that you want to use as the name.
Click the Transformation Script button to enter regular .NET expressions or write a
.NET script that will transform the image name to meet your requirements. See the
Content Transformation Script topic for more information about content transformation
scripts.
Use the Data Value option to specify that an agent data value will be used as file
name. The agent data can come from a data provider, an input parameter or
captured data.
© 2014 - 2015 Sequentum
Agent Commands
117
Use the Detect File Extension option to specify if agent should try and detect
the file type of the downloaded document, or if a transformation script or a data
value will provide a file name that includes a file extension.
OCR
Using the OCR tab, you have the option to convert the image into text. For example,
you might need to convert CAPTCHA images into text so that the agent can bypass
CAPTCHA blocking websites. See the topic CAPTCHA Blocking for more information.
This tab gives you the option to export both the image file and the converted text, or
just the converted text. To convert the image, you'll need to check the box and then
enter a script to call an external OCR service. Content Grabber does not include an
OCR feature, but allows you to integrate with 3rd party services by using this script
feature. For more information, see the topic Image OCR Scripts.
Command Properties
The following command properties are specific to the Download Image command,
and are found in the Convert Image section on the Command property tab:
Attribute
Description
Convert to
Uses a script to convert the image to text
Text
and export the text. An common use of this
property is for CAPTCHA processing.
Image
A script used to transform an image into
Conversion
text.
Script
Export
Converted
Image
© 2014 - 2015 Sequentum
Exports the image along with the text.
118
Content Grabber
The following properties are available in the File Capture section on the command
property tab:
Attribute
File Name
Description
A script that transforms the file name.
Transformation
Script
Export URL
Specifies how to export the file URL.
Use Original
When possible, this property contains the
File Name
original file name.
Fixed File
Adds a fixed file extension to the
Extension
downloaded file.
File Name
The web element attribute to use as file
Attribute
name for the image file.
Auto Detect File Automatically detects the file extension of
Extension
the image file. Uncheck this box if you want
the file name transformation script to set the
file extension.
Use File Name
Uses a data value as file name instead
Data Value
of an attribute of the selected web
element.
File Name Data
Specifies the input data to use when
Consumer
Use Data Value as File Name is set to
true.
The following properties are available in the HTML Capture section on the Properties
tab:
Property
Description
© 2014 - 2015 Sequentum
Agent Commands
HTML Attribute
119
element.
Concatenate
Multiple Web
Elements
Concatenate
Content Separator
tab:
Property
Selection
Description
information.
5.2.3
Download Video
The Download Video command will download a video from a website and save it to
your local disk or a database. The most common video formats are mp4, ogg and
Flash. Content Grabber cannot extract content from within a video (such as Flash
objects), but can only extract the video file as a whole.
The command has a corresponding web selection that will normally select the link to
the video. Optionally, the command can extract data from a web element and then
© 2014 - 2015 Sequentum
120
Content Grabber
calculate the direct video URL.
Video from the New Command drop-down:
Data Fields
If the agent is saving the video to a database, then by default this command will
generate two data fields: one data field will contain the video binary data and another
will contain the name of the video. If you're saving the video to the file system, the
command will generate only one data field containing the full file path to the video file.
You can also use the property Export URL to generate a data field that contains the
video URL.
The Common tab in the Configure Agent Command panel has two tabs:
File URL - contains the URL for the video.
File Name - contains the name of the downloaded video.
© 2014 - 2015 Sequentum
Agent Commands
121
File URL
The entry in this tab determines the specific URL for the video, and the agent uses this
URL to download the video at run time. You can choose the HTML attribute that the
command should extract to get the video URL. The default value is Video URL, which
will cause the command to look at the entire tag HTML and extract the first
appropriate video URL.
.NET script that will transform the document URL to meet your requirements.
See the Content Transformation Script topic for more information about content
File Name
The entry in this tab contains the file name. From the drop-down menu, you can
choose the HTML attribute that you want to use as the name.
.NET script that will transform the document name to meet your requirements. See
the Content Transformation Script topic for more information about content
captured data.
Command Properties
© 2014 - 2015 Sequentum
122
Content Grabber
The following properties are available in the File Capture section on the command
property tab:
Attribute
Description
File Name
Transformation
Script
Export URL
Use Original
File Name
original file name.
Fixed File
Adds a fixed file extension to the
Extension
downloaded file.
File Name
The web element attribute to use as file
Attribute
name for the file.
Auto Detect File
Automatically detects the file extension of
Extension
the file name transformation script to set
the file extension.
Use File Name
Data Value
element.
File Name Data
Consumer
true.
tab:
Property
Description
© 2014 - 2015 Sequentum
Agent Commands
HTML Attribute
123
element.
Concatenate
Multiple Web
Elements
Concatenate
Content Separator
tab:
Property
Selection
Description
information.
5.2.4
Download Document
The Download Document command extracts a document from a web page. The
command will download a document, and then save it to the file system or send it to a
database - depending on your chosen export target.
The web selection path for this command normally points to the document link itself,
but it could also point to a web element that contains information from which the
© 2014 - 2015 Sequentum
124
Content Grabber
document URL can be derived by using Content Transformation.
Document from the New Command drop-down:
Data Fields
If the agent is saving the document to a database, then by default this command
will generate two data fields: one for the document binary data and another for
the name of the document. If the agent is saving the document to the file system,
then the command will generate only one data field containing the full file path to
the document. The command property Export URL can be used to also generate
a data field that contains the document URL.
The Common tab in the Configure Agent Command panel has three tabs:
© 2014 - 2015 Sequentum
Agent Commands
125
File URL - contains the URL for the image.
File Name - contains the name of the downloaded image.
Convert to HTML - specifies if the downloaded document should be converted
to HTML.
File URL
The entry in this tab determines the specific URL for the image, and the agent
uses this URL to download the document at run time.
You can choose the HTML attribute that the command should extract to get the
document URL. The default value is URL, which extracts the href HTML attribute
(if the chosen web element is a link).
Click the Transformation Script button to enter regular .NET expressions or
write a .NET script that will transform the document URL to meet your
requirements. See the Content Transformation Script topic for more information
about content transformation scripts.
URL. The agent data can come from a data provider, an input parameter or
captured data.
File Name
Click the Transformation Script button to enter regular .NET expressions or
write a .NET script that will transform the document name to meet your
requirements. See the Content Transformation Script topic for more information
about content transformation scripts.
© 2014 - 2015 Sequentum
126
Content Grabber
captured data.
Convert To HTML
A downloaded document can be converted into a HTML page as it's being
downloaded, and a URL command can later be used to open the HTML page.
Capture commands can then be used to extract data from the HTML page.
Please see Extracting Data From Non-HTML Documents for more information.
Click To Download
Check this box when no direct URL is available for the document, but it is
necessary to download the document by clicking on a web element - such as a
button. When you enable this option:
The File URL tab becomes unavailable
© 2014 - 2015 Sequentum
Agent Commands
Content Grabber will assign a unique identifier as the file name
You will have access to the Action configuration tab where you can fine tune
the behavior of the action.
See the topic Action Commands for more information.
Command Properties
The following command properties are specific to Download Document
commands and are listed in the Action section on the command property tab.
Attribute
Description
File Download
Action configuration when the Download Method
Action
property is set to Click to Download. See Action
Command Properties for more information.
Download Method
Specifies how the document is downloaded. The
file can be downloaded using the extracted
URL, using an action on the selected web
element, or as a result of an action from a
parent from field.
The following properties are available in the Convert Document section on the
command property tab.
Attribute
Description
Convert to HTML
Uses a script to convert the downloaded
document to a HTML page which can be
processed later using a URL command.
Conversion Script
A script used to transform the downloaded
document into a HTML page.
Export Converted
© 2014 - 2015 Sequentum
Exports the downloaded document in addition to
127
128
Content Grabber
Document
generating and exporting the converted HTML
page.
The following properties are listed in the File Capture section on the command
property tab.
Attribute
Description
File Name
A script used to transform the file name
Transformation
attribute used to name the download file.
Script
Export URL
Use Original File
Uses the file name of the document when
Name
possible.
Fixed File
Adds a fixed file extension to the downloaded
Extension
file.
File Name Attribute
The web element attribute to use as file name
for the downloaded file.
Auto Detect File
Automatically detects the file extension of the
Extension
downloaded file. Clear this option if you want a
file name transformation script to set the file
extension.
Use File Name
Uses a data value as file name instead of an
Data Value
attribute of the selected web element.
File Name Data
Specifies the input data to use when Use Data
Consumer
Value as File Name is set to true.
The following properties are available in the HTML Capture section on the
command property tab.
© 2014 - 2015 Sequentum
Agent Commands
Property
HTML Attribute
129
Description
element.
Concatenate
Multiple Web
Elements
Concatenate
Content Separator
tab:
Property
Selection
Description
information.
The following common command properties are also available to the Download
Document command.
Command Properties
5.2.5
Download Screenshot
The Download Screenshot command extracts a screenshot of the entire webpage or
the chosen web element. The command will save the screenshot, save it to the file
system or send it to a database - depending on your chosen export target. The web
selection path for this command normally points to the document itself, but it could
also point to a web element that contains a URL that links to the document.
© 2014 - 2015 Sequentum
130
Content Grabber
Screenshot from the New Command drop-down:
Data Fields
If the agent is saving the screenshot to a database, then by default this command will
generate two data fields: one for the screenshot data and another for the name of the
screenshot. If the agent is saving the screenshot to the file system, the command will
generate only one data field containing the full file path to the screenshot.
The Common tab in the Configure Agent Command panel has two tabs:
File Name tab - contains the name of the downloaded image.
OCR tab - check the box if you want to convert an image into text.
© 2014 - 2015 Sequentum
Agent Commands
131
File Name
.NET script that will transform the screenshot name to meet your requirements. See
the Content Transformation Script topic for more information about content
captured data.
OCR
Using the OCR tab, you have the option to convert the screenshot into text. This tab
gives you the option to export both the screenshot file and the converted text, or just
the converted text. To convert the screenshot, you'll need to check the box and then
enter a script to call an external OCR service. Content Grabber does not include an
OCR feature, but allows you to integrate with 3rd party services by using this script
feature. For more information, see Image OCR Scripts.
Command Properties
The following properties are available in the Convert Image section in the Properties
tab:
Attribute
Description
Convert to Text
Uses a script to convert the downloaded
image to text and exports the text. This
can be used for CAPTCHA processing.
Image
© 2014 - 2015 Sequentum
A script used to transform the screenshot
132
Content Grabber
Conversion
image into text.
Script
Export
Exports the converted screenshot image in
Converted
addition to the text.
Image
The following properties are available in the File Capture section in the Properties
tab:
Attribute
Description
File Name
Transformation
Script
Export URL
Use Original
File Name
original file name.
Fixed File
Adds a fixed file extension to the file.
Extension
File Name
The web element attribute to use as the
Attribute
file name for the screenshot image file.
Auto Detect File
Automatically detects the file extension of
Extension
the file name transformation script to set
the file extension.
Use File Name
Data Value
element.
File Name Data
Consumer
© 2014 - 2015 Sequentum
Agent Commands
133
true.
tab:
Property
HTML Attribute
Description
element.
Concatenate
Multiple Web
Elements
Concatenate
Content Separator
tab:
Property
Selection
Description
information.
© 2014 - 2015 Sequentum
134
5.2.6
Content Grabber
Calculated Value
This command will save a static value to a data output. You can specify the value at
design time, or as an Input Parameter value that will be set at run time.
.NET script that will transform the value to meet your requirements. See the Content
Transformation Script topic for more information about content transformation scripts.
Command Properties
The following properties are available in the Calculated Value section in the
Properties tab:
Attribute
Description
Value
The command will export this
value.
Use Input
The command exports the
Parameter
chosen input value to data
output.
Input
The command will export this
Parameter
as the name of the input
Name
parameter.
© 2014 - 2015 Sequentum
Agent Commands
135
5.2.7
Page Attribute
Use this command to extract general attributes from the current web page, such a
page URL or page meta data.
A command that will extract the URL of the current web page
.NET script that will transform the attribute value to meet your requirements. See the
scripts.
Command Properties
The following properties are available in the Page Attribute section on the Properties
tab:
Attribute
Description
Attribute Name
The name of the page attribute to capture.
The default value is Page URL, which
© 2014 - 2015 Sequentum
136
Content Grabber
captures the URL of the current web page.
The attribute name can also be the name of
page meta data, such as page title,
keywords or description.
5.2.8
Data Value
Use this command to extract data from a provider and save it to a data output. The
command can save values from data provider commands, such as URL commands or
Form Field commands. For example, if you are using a Set Form Field command to
provide input values to a web form, then these input values will not automatically save
to a data output unless you use a Data Value command.
See Using Data Input for more information about data providers.
This command extracts the URL given by the parent agent command
.NET script that will transform the attribute value to meet your requirements. See the
© 2014 - 2015 Sequentum
Agent Commands
137
scripts.
Command Properties
The following properties are available in the Data Value section of the Properties
tab:
Attribute
Description
Data Consumer
This property specifies which data provider
and data column to extract data from. See
the topic Using Data Input for more
information.
5.3
Action Commands
An Action command executes an action in the web browser, such as opening a new
web page. Most action commands are also container commands, and normally
contain the commands that should execute after the web browser action. If, for
example, an action command opens a page in a new web browser, the subcommands will execute in the context of that new web page.
© 2014 - 2015 Sequentum
138
Content Grabber
These are the Action commands available in Content Grabber:
Agent
Navigate Link
Navigate URL
Navigate Pagination
Set Form Field
Below we summarize each of the configuration tabs for Action commands, and
provide links to each of the subtopics.
Configuration of an Action Command
Explore the options and properties that you can configure for a command by taking
these simple steps:
1. Clicking once on a web element in the browser panel.
2. Locate the New Command drop-down in the Configure Agent Command.
3. Choose a command from the drop-down.
4. Explore the tabs: Common, Action, Data and Properties.
5. In the Common tab, uncheck the Use default input box to reveal more
options.
6. In the Action tab, uncheck the Discover Action box to reveal more options.
In many web-scraping scenarios, the default functionality will be quite sufficient.
Should you find the need for more flexibility and control, you can learn how to
configure all aspects and properties in the Action Configuration section.
Action Command Properties
The following is the set of basic properties common to all action commands:
Property
Description
© 2014 - 2015 Sequentum
Agent Commands
Editor Action
Specifies a URL to use when performing an
action in the editor. See below for Editor
Action Properties.
Scroll Until End
Set this value to True if you want the
of Page
command to scroll to the end of the web
page after an action. Continues scrolling
until no further scrolling is possible, and
waits for the calls to complete if scrolling
triggers AJAX calls.
Limit Number
Set this value to True if you want to limit the
of Scrolls
number of scrolls to a number that you
specify in Maximum Number of Scrolls,
which can be useful if a page can scroll
indefinitely.
Maximum
If Limit Number of Scrolls = True, this
Number of
value is the maximum number of scrolls.
Scrolls
Discover
Configures action properties automatically
Action
when the command first executes.
Timeouts
Specifies timeout values for the action. See
below for Timeout properties.
Action Type
Specifies the action to perform when clicking
on a web element.
Activities
Specifies how this action should wait for
browser activities to complete.
Events
A list of events to fire on the selected web
element.
Default Events
Unless you set this to False, the default list
of events will fire on the chosen web
© 2014 - 2015 Sequentum
139
140
Content Grabber
element.
JQuery
Specifies what action events should be fired
using JQuery.
Add Force
Adds an If-Modified-Since header to the
Refresh Header
web request to make sure the web page is
not retrieved from cache.
Block Known
The web page will not load content from
Ad Servers
known ad servers, such as
ad.doubleclick.net. This speeds up
processing slightly.
Target Browser
Specifies the web browser where a new
web page will load - either in a New
browser, the Current browser, or the
Parent.
Redirect First
Redirects the first request to a new browser
Request
window when TargetBrowser is set to New,
even if the first request is coming from a
frame within the current browser window. If
this property is set to false, requests from
frames within the current browser window
will not be redirected.
Set this property to true if a link opens a
page in a frame, but you want to open the
page in a new browser tab.
Clear Cookies
Clears cookies for the current URL before
executing the action.
Clear Session
Clears the web browser session before
executing the action.
© 2014 - 2015 Sequentum
Agent Commands
Error Handling
Specifies how the agent should react when
an error occurs during execution. The
default reaction is to exit the command, but
the other options are available:
None - No error handling.
Exit Command - Exits the command.
Retry Command - attempts another
execution of the command.
Stop Agent - Stops the agent
completely.
Restart and Continue Agent - Restarts
the agent and then continues at previous
point of execution.
Choose None if you want the agent to
continue executing sub-commands after an
error occurs. You can handle the error in
sub-commands by using the script
parameter IsParentCommandActionError.
Error Retry
Applicable when you set Error Handling =
Count
Retry Command. This is the number of
times that the agent should retry the
command when an error occurs during
execution.
Reuse Browser
If a page is loaded in a new browser tab,
Tab
the browser is reused the next time the
same page is loaded. This option is only
relevant for full page load actions.
Browser Mode
Specifies the type of a new web browser,
which can be one of the following types:
© 2014 - 2015 Sequentum
141
142
Content Grabber
Default
Full Dynamic Browser
Optimized Dynamic Browser
Static Browser
HTML Parser.
The Static Browser and the HTML Parser
are faster because they don't execute
JavaScript, but may not work on many
websites that utilize JavaScript.
Editor Action Properties
These are the editor properties:
Property
Description
Use Specific
Specifies whether to load a direct URL
URL
when an action executes.
URL
Specifies the URL to load when an action
action executes.
Timeout Properties
The table below lists all the timeout properties.
NOTE: Any changes to the timeout values will override the default values.
Property
Description
Page Interactive
The number of milliseconds that the
Timeout
command will wait for a page load to
become interactive.
Page Completed
© 2014 - 2015 Sequentum
Agent Commands
Timeout
command will wait for a page load to
complete after the page has become
interactive.
Ajax Completed
Timeout
command will wait for an AJAX call to
complete.
Frame
Completed
command will wait for frame content to
Timeout
complete loading.
Page Load
Timeout
command will wait for a page to start
loading.
Ajax Start
Timeout
command will for an AJAX call to start.
Asynchronous
Timeout
command will wait for an asynchronous
action to complete.
Discover First
Activity Timeout
command will wait for the first activity when
discovering new activities.
Discover
Activity Timeout
command will wait for the next activity
when discovering new activities.
Ajax Content
Render Delay
command will wait for AJAX loaded content
to render on a web page.
Ajax Content
Render Delay
command will wait for AJAX loaded content
After Scroll
to render on a web page after triggering a
scroll.
© 2014 - 2015 Sequentum
143
144
Content Grabber
Wait for
Waits for external frames to load.
External Frames
Otherwise ignores external frames.
Use
Waits twice as long as specified by the
Conservative
default timeout values. This can be a quick
Default Timeout
way to test if issues with an action are
caused by timeout values being too short.
Default timeouts are used when discovering
activities, and when scrolling a page.
Activity Properties
Choose an activity in the Activities property, which launches this pop-up window.
Here you can edit activity properties.
This table lists all the activity timeout properties.
Property
Description
© 2014 - 2015 Sequentum
Agent Commands
Discover Action
145
Automatically configures action properties
when executing an action. This is slightly
slower than specifying the activities.
Activity Type
The type of the activity, which you can
choose to be one of the following:
Activity Status
Page
XPath
Refresh
Frames
Ajax
Externals
Timeout
Download
The status of the activity, which you can
choose to be one of the following:
Timeout
Default
Interactive
Loading
Completed
The number of milliseconds to wait for the
activity.
URL Match
If the activity type is a Page load, the page
URL must match this regular expression.
Wait XPath
If the activity type is XPath, this property
specifies the XPath to wait for.
Required
If True, an action fails if a
required activity never
occurs.
5.3.1
Agent
The Agent command is the first command that executes in an agent; all other
commands are sub-commands. So, only one Agent command can exist in an agent.
The Agent command loads the start URL, which is the first point of data extraction,
and also contains all common agent properties (including data export configuration).
© 2014 - 2015 Sequentum
146
Content Grabber
The Agent command uses a data provider that provides one or more start URLs, and
the command will execute once for each of these URLs.
This Agent command uses a simple data provider to
load a single static start URL
NOTE: The Agent command derives from the Navigate URL command,
which loads one or more URLs.
The configuration screen for the Agent command has four tabs: Common, Action,
Data, and Properties. We explain the Properties in the sections below. In the
Common tab, you can edit the command name and (optionally) customize the data
provider properties.
CSV Data: If you leave a check in the Use Default Input box, the command will
provide simple CSV data and use that data as input. Simple CSV data consists of
values that you enter directly into the command, so no external CSV is necessary.
You can uncheck the Use Default Input box and choose the Data Provider that will
provide the start URLs. The default data provider is a simple data provider that
provides a list of static URLs. You can populate the data provider directly by entering
the start URLs in the URLs input box.
© 2014 - 2015 Sequentum
Agent Commands
147
Use the Action tab to control how the web browser loads the start URLs. See Action
Configuration for more information.
Use the Data tab to set the data provider that provides the start URLs. Read more in
Using Data Input.
Command Properties
The following properties are specific to Agent commands:
Attribute
Description
Directory
The directory that contains all the agent files.
This is a read-only property and will be set
automatically when you save the agent.
User Agent
The user agent that the web browser sends to
target the web server when requesting web
pages.
Process Restart Condition
These properties govern the conditions under which an agent process would restart.
Click Process Restart Conditions in the Properties tab to reveal these properties:
Attribute
Description
Restart On
Restarts an agent process at a specific time interval that you
Time Interval
can edit.
Restart Time
The number of minutes between each restart - if an agent
Interval
process is restarting on time intervals.
Restart On
Restarts an agent process at a specific maximum memory
Memory Usage
usage limit. Use this option to clear JavaScript memory leaks
and ensure that the process does not run out of memory.
© 2014 - 2015 Sequentum
148
Content Grabber
Max Memory
If Restart on Memory Usage = True, the maximum
Usage
permissible memory usage (in MB) before an agent process
will restart.
Pause Before
The number of minutes that you want the agent to pause
Restarting
between stopping and restarting an agent process.
Data
Attribute
Description
Data
The data provider that provides start URLs for
Provider
the agent. See Using Data Input for more
information.
Database
Common database connections that are
Connection
available to the agent. See the Data section for
s
more information.
Export
The destination where data extractions will go.
Target
See Exporting Data and Distributing Data for
more information.
Input
A list of input values that you can specify when
Parameters
running an agent from the command line or from
the API. See Input Parameters for more
information.
Internal
The internal database that will store the data
Database
during extraction. For more information, see the
article, The Internal Database.
Proxy
Attribute
Description
Proxy
See IP Blocking & Proxy Servers
© 2014 - 2015 Sequentum
Agent Commands
Configuratio
for more information.
n
Scheduling
Attribute
Description
Email
See Error Logs and Notifications
Notification
for more information.
Scheduling
See Scheduling for more
information.
Scripting
Attribute
Description
Assembly
See Assembly References for more
References
information.
Initialization
See Agent Initialization Scripts for more
Script
information.
Self-Contained Agents
Attribute
Description
Author
See Building Self-Contained
Details
Agents for more information.
Sessions
Attribute
Description
Cleanup
See Sessions for more information.
External
Session
© 2014 - 2015 Sequentum
149
150
Content Grabber
Data
Support
Sessions
Remove
Internal
Session
Data
Immediately
User Interface
Attribute
Description
Max Tree
The maximum number of nodes that will
Expansions
automatically expand in the Tree View window.
Max Tree
The maximum number of chosen nodes in the
Selections
Tree View window. Setting this value too high
may affect the performance of the editor.
XPath Factory
Attribute
Description
Max Inner
The maximum number of inner nodes to examine
Prospects
when optimizing a selection XPath. Setting this
value too high may affect the performance of the
editor.
The following common command properties are also available to the
Agent command:
Navigate URL Command Properties
Data List Command Properties
© 2014 - 2015 Sequentum
Agent Commands
List Command Properties
Container Command Properties
Command Properties
5.3.2
Navigate Link
This command executes an action, such as a mouse click, on a web element.
You can apply the Navigate Link command to any kind of web element,
including links, buttons, and menus.
The figure below shows the Configure Agent Command panel after
choosing Navigate Link from the New Command drop-down:
The configuration screen for the Navigate Link command has three tabs.
Common, Action, and Properties. We explain the Properties in the sections
below. Use the Common tab to set the command name. Use the Action tab to
control how the command chooses the web element and loads new content
into the web browser. See Action Configuration for more information.
Properties
The properties in the tables below are available to Navigate Link commands.
© 2014 - 2015 Sequentum
151
152
Content Grabber
Link
Attribute
Description
Worker
Number of worker threads to use when this
Threads
command opens links in a new web browser.
Each worker thread will use a new separate
web browser.
Repeat Link Action
Attribute
Description
Repeat Link
Repeats the link action until the chosen web
Action
element no longer exists, or the action no longer
results in a page change.
Max.
The maximum number of times the link action
Repeats
will repeat. A value of zero means that there is
no maximum to the number of repetitions.
Navigate Link command:
Command Properties
5.3.3
Navigate URL
This command opens a new URL and loads a web page into the web browser.
You can specify a single URL, a list of URLs, or intake URLs from an input
data source such as a database or CSV file.
© 2014 - 2015 Sequentum
Agent Commands
choosing Navigate URL from the New Command drop-down:
The configuration screen for the Agent command has four tabs: Common,
Action, Data, and Properties. We explain the Properties in the sections
below. In the Common tab, you can edit the command name and, optionally,
customize the data provider properties.
You can uncheck the Use default input box and choose the Data provider
that will provide the start URLs. The default data provider is a simple data
provider that provides a list of static URLs. You can populate the data provider
directly by entering the start URLs in the URLs input box.
Use the Action tab to control how the web browser loads the start URLs.
See Action Configuration for more information. Use the Data tab to set the
data provider that provides the start URLs. Read more in Using Data Input.
The Navigate URL command can use the data provider configured on the
Data tab, or it can use a data provider configured by a parent command. The
Data tab is not available if the command uses a data provider from a parent
command. See Using Data Input for more information about data providers.
© 2014 - 2015 Sequentum
153
154
Content Grabber
Posting Data and Custom Header
The Navigate URL command supports URLs with custom parameters that
allow you to post data and add custom headers, which is especially useful if
you want to load a web page that is the result of a web-form submission.
Sometimes, you can request such a web page by putting the form field values
on the URL as normal query parameters; but other times the website requires
that you post the form field values.
URL Parameters
You can use the following two URL parameters to specify post data
and custom headers.
Note: The Post parameter must come before the Header
parameter. Each parameter is optional.
Parameter
Description
??Post=
Specifies data that should be posted to the
website.
??Headers=
Specifies additional headers that should be
sent to the website.
NOTE: This parameter must come after
the ??post parameter.
Example
The URL in this example would post par2=val2&par3=val3 to the
website, while par1=val1 is just a normal query parameter.
http://www.sequentum.com?par1=val1??
post=par2=val2&par3=val3??headers
© 2014 - 2015 Sequentum
Agent Commands
=Content-Type: application/x-www-form-urlencoded
NOTE: When posting value pairs to a website, you must also
add the header Content-Type: application/x-www-form-urlencoded.
Otherwise the website will not understand the format of the
posted data.
Navigate URL Command Properties
The following properties are available to Navigate URL commands:
Attribute
Description
Data
Specifies the data provider, input parameter, or
Consumer
agent data that provides the URLs for the
command.
Worker
Threads
command opens URLs in a new web browser.
web browser.
Navigate URL command:
Command Properties
5.3.4
Navigate Pagination
Typically, pagination is the feature that gives the user the ability to navigate
among multiple pages. The most common example of pagination is a website
© 2014 - 2015 Sequentum
155
156
Content Grabber
that displays a search results listing that spans multiple pages. This command
follows all pagination links on a web page, and will process a specific
container command after opening each pagination link.
choosing Navigate Pagination from the New Command drop-down:
Pagination takes several different forms, but in nearly all cases it will
fall under one of the following categories:
Next Page link/button
List/sequence of numeric links
List/sequence of text links.
We cover each of these in the sections below.
Single Next Page Link
We recommend that you use the Single Next Link pagination type on a
website that uses a single link to move to the next page. In addition to a
single Next Page link, a website may provide a sequence of pagination links.
In such cases, you should still choose this pagination type, which will ignore
any of the other pagination links. It's only necessary that Content Grabber can
move to the next page in the sequence. So, it's more efficient to identify the
© 2014 - 2015 Sequentum
Agent Commands
Next Page link.
Content Grabber will continue to follow the Next Page link until it no longer
appears on the next results page - or it becomes inactive. In very rare cases,
a website may continue to display an active Next Page link even when there
are no more pages available, and may link back to the first page in the
sequence or continue to load the last page.
You can check the box for Use Page Count and then configure the agent to
stop the pagination command when it reaches the maximum number of pages.
The Page Count property requires you to specify the Capture Commands that
delivers the maximum page number. Most websites that display a search
result will also display the total number of available pages in the search result,
so you can choose the Web Element Content command to capture the total
number of pages. Then, use this command to place a limit on the number of
pages that the pagination command will process.
If you want to stop pagination on a fixed page, you can add a Calculated
Value command that contains the fixed page number, and then use that
command as the command that delivers the maximum page number.
List of Numeric Links
This pagination type is the best choice for a website that only displays a
sequence of pagination links but lacks a Next Page link. Page set navigation
links are often found together with the numeric links. For example, if a you find
10 numeric links, the page set navigator links allows a user to navigate to the
next 10 numeric links or the previous 10 numeric links. This pagination type
requires one web selection for the numeric links and, if applicable, another
selection for the page set navigator link that moves to the next set of numeric
links.
When Content Grabber processes a list of numeric links, it will extract a
© 2014 - 2015 Sequentum
157
158
Content Grabber
specific attribute of the chosen web elements and then convert the content into
numeric values. If the agent cannot convert the content into a numeric value, it
ignores the corresponding link. Content Grabber expects a sequence of links,
so it will continue to the next page number in the sequence after it processes
the current page number. You can use a content transformation script to
convert the content into a number. For example, if the agent cannot convert
the text attribute of a pagination link directly into a number, it may look for the
page number in the URL attribute. In this case, a transformation script would
be necessary to transform the URL into a number.
List of Text Links
Use this pagination type on a website that features a list of pagination links
but none of the attributes of the link elements will convert into sequential
numbers. A common application of this pagination type is on a website that
displays items according to the first letter of the item name. In cases like this,
the pagination would consist of all the letters in the alphabet (A-Z), and when
a user clicks on a letter, all items starting with that letter will appear in the
listing.
Navigate Pagination Command Properties
The following properties are available to Navigate Pagination
commands:
Attribute
Description
Page
Number
command opens links in a new web browser.
Attribute
web browser.
Page
A script used to transform the page number
Number
content. The page number content is extracted
Transformati
from the web element attribute specified by
© 2014 - 2015 Sequentum
Agent Commands
on Script
PageNumberAttributeName.
Navigate
Navigates directly to a specific pagination page
Directly to
when resuming an agent. It's not always
Page
possible to navigate directly to a pagination
page. For example, this is not possible for
websites using AJAX pagination.
Pagination
Specifies the type of pagination.
Type
Use Max.
Specifies a command used to capture the
Page
maximum number of pages in pagination. The
Number
pagination command will stop following links
Command
when the number of pages reaches this value.
Max. Page
Uses a command to capture the maximum
Number
number of pages in pagination. The pagination
Command
command will stop following links when this
number of pages is reached.
Use Next
The selection property includes a selection for
Pagination
a link to the next pagination set.
Set
Pagination
The command that is executed for each
Command
pagination link.
Link
These properties are also available to the Navigate Pagination
command:
Command Properties
© 2014 - 2015 Sequentum
159
160
5.3.5
Content Grabber
Set Form Field
Use this command to set input fields in a web form. The command supports the
following types of web form fields:
Input fields
Select boxes
Text area fields
Check boxes
Radio buttons.
Note: Some web sites use advanced JavaScript to make web
elements emulate standard form fields, and the Set Form
Field command does not support those web elements.
Use a normal Navigate Link command to click on a submit button to submit the web
form. The submit button can be a standard web form submit button or any other web
element that uses JavaScript to submit the web form. A Navigate Link command is not
required if one of the form fields automatically submits the web form when an input
value changes.
default data provider for standard
text box
default data provider for
radio button
default data provider for a
select box
The configuration screen for the Set Form Field command has four tabs - Common,
© 2014 - 2015 Sequentum
Agent Commands
161
Action, Data and Properties. Use the Common tab to set the command name and
select the data provider that provides the input values for the form field. The default
data provider for a standard input form field or a radio button is a simple data provider
that provides the input values. The input values for a radio button or check box must
be either On or Off. The default data provider for a select box form field is a selection
data provider. The default selection data provider will attempt to select all appropriate
values in the drop-box (select-box), and the Set Form Field command uses those
values as input values.
Use the Action tab to control how the web browser loads the URLs. See Action
Configuration for more information.
Use the Data tab to configure a data provider that can provide the form field input
values. The Set Form Field command can use the data provider configured on the
Data tab, or it can use a data provider configured by a parent command. The Data
tab is not available if the command uses a data provider from a parent command.
See Using Data Input for more information about data providers.
Command Properties
The following properties are available to Set Form Field commands:
Attribute
Description
Data
Specifies the data provider, input parameter, or
Consumer
agent data that provides the input values for the
command.
Set Select
If the chosen form field is a select box, then
Box Text
each of the input values will be set as a text
attribute for each corresponding select options.
Set Value
Choose True if you always want the command
to set the form field value. If you leave it as
Default, the command will set the form field
© 2014 - 2015 Sequentum
162
Content Grabber
value if it is set to fire events - unless the form
field is a checkbox or a radio button. We
recommend that you leave this property set to
Default unless an event function - such as
sendtext() - is setting the form field value.
Only
Only executes the command action if the form
Execute
field value changes. That is if the current input
Action on
value is different from the current form field
Value
value.
Change
Handle Web
When interacting with a file
Browser
upload web browser dialog box,
Dialogs
the command will cause a
clicking of a dialog button having
the name that is in the value of
this property (&OPEN).
Download
Attempts to download a document when the
Document
form field value changes, and uses the
Command
Download Document command to capture and
export the document.
These properties are also available to the Set Form Field command:
Command Properties
© 2014 - 2015 Sequentum
Agent Commands
5.3.6
Action Configuration
By default, all action commands are set to Discover action settings (the
Discover action check box on the Action tab is checked). The action settings
will automatically be configured when you execute the command the first
time. The default action settings are usually quite sufficient, so you will rarely
need to worry about the Action configuration tab at all. However, you can
fine-tune the action settings to achieve better performance, and you may
need to adjust the action settings to get the command working correctly.
After choosing a New Command type from the drop-down in the Configure
Agent Command panel, click the Action tab.
By default, all action commands are set to automatically discover action
settings when you execute the command for the first time. The settings are
usually suitable for most scenarios, so you will rarely need to worry about the
Action configuration tab at all. After some experience, you may want to finetune these settings to achieve better performance, and sometimes it may be
necessary to adjust the action settings to get the command to function
according to your precise requirements.
There are several configuration tabs that are available for Action commands in
the Configure Agent Command panel, including:
Action Types - Specifies the type of action, which can be a Fire
© 2014 - 2015 Sequentum
163
164
Content Grabber
Event, URL, or No Action. Not all action types are available for all
action commands.
Browser Target - Specifies the web browser in which the new
content should load.
Action Events - For action set to Fire Events only, this is a list of
events that will fire on the chosen web element.
Browser Activities - Specifies the browser activities for which the
command will wait before the action is complete.
5.3.6.1
Action Types
Action commands can execute one of two types, or no action at all:
Fire Events - The command fires events, such as a mouse click, on the selected
web element.
Load URL - The command loads a new page into the web browser using a direct
URL.
No Action - The command does not execute an action. This is only relevant for
form fields which may execute an action when an input values is assigned, but
often execute no action at all.
This action command will fire events
© 2014 - 2015 Sequentum
Agent Commands
165
Use Conservative Timeouts
An action command uses default timeout values to determine how
long it should wait for activities, such as how long it should wait for a
new page to start loading. Default timeout values are only used
when an action command is discovering activities or when it is
scrolling a web page to discover new content.
The default timeout values will work in most situations, but some
websites may be very slow and require longer timeouts. You can
increase the timeouts manually, but it can sometimes be hard to
determine exactly which timeouts to increase. The option Use
Conservative Timeouts causes an action command to wait twice
as long as specified by the default timeout values. This can be a
quick way to test if issues with an action are caused by timeout
values being too short.
Scroll to End of Page
Some web pages load additional content when you scroll the page either downward or to the right. To extract all content from such
pages, you need to include an action that scrolls down to the end of
the page, so all content is available to the agent.
When you set the option Scroll to end of page, you will be able to
limit the number of times the command scrolls to the end of a page
to load new content. This can be important, since some pages will
continue loading new content for a single page until Content Grabber
finally runs out of memory.
5.3.6.2
Browser Target
An action command often loads new content. With the Browser action type, you can
configure how content loads into a new web browser, the current web browser, or the
© 2014 - 2015 Sequentum
166
Content Grabber
parent web browser. You can also specify a different browser mode, as we explain
below.
This Navigate Link command will open a web page in a new web browser
Uncheck the Discover Action box and then click the Browser tab to view these for
the Target Browser:
New. The action command loads content into a new web browser, and all subcommands will operate in the new browser. To use this option, the action
command must load a completely new page. Asynchronous actions, such as AJAX
calls, cannot load content into a new web browser.
Current. The action command loads content into the current web browser, and all
sub-commands will operate in the new browser. Asynchronous actions, such as
AJAX calls, can only load content into the current web browser.
Parent. Some older websites may require child browsers to load content into the
parent browser in order to function correctly, so this option should only be chosen
in such cases. If an action command loads content into a new web browser, that
new web browser becomes a child browser of the current web browser, and
actions in the child browser can direct content into the parent browser. To use this
option, the action command must load a completely new page.
© 2014 - 2015 Sequentum
Agent Commands
167
Browser Mode
If you leave the default for the Target Browser (New), you can choose the
browser mode:
Default - The new web browser will be exactly the same type as the parent
browser.
Full Dynamic Browser. The browser functions as a standard web browser, and it
will download images, and execute JavaScript and all objects such as Flash.
Optimized Dynamic Browser. The browser executes JavaScript, but will not
download images or execute objects such as Flash. Although the browser still
permits extraction of images and Flash, these will not display in the browser.
Static Browser - The web browser does not execute JavaScript at all.
Consequently, the browser is faster and more reliable than the other browsers, but
does not work on websites that rely on JavaScript to function correctly.
HTML Parser - the command does not start a new browser. Instead, the web
page simply downloads and runs through a HTML parser. Although the HTML
parser does not execute JavaScript and does not load frames, it is faster and
more reliable than the web browsers. However, the parser does not work on
websites that rely on JavaScript, and the parser may also be unable to submit
some web forms (even when they don't rely on JavaScript).
Note: You must reopen browser tab for any change in this
setting to take effect.
5.3.6.3
Action Events
If a command action type is set to fire events, then you can specify the events that
should be fired on the selected web element.
© 2014 - 2015 Sequentum
168
Content Grabber
The Events tab
In most cases, you can check the Use default events check box, which will fire all
appropriate events that the web element supports. In special cases, you may want to
remove some of the default events. For example, an action may try to open a dropdown box for an input form field.
Firing the focus or click event on the input form field may cause the drop-down box to
open, but the blur event may cause the drop-down box to close. In that case, firing all
the default events would open the drop-down box but quickly close it again. To prevent
this, uncheck the Use default events box and remove the blur event from the list.
Content Grabber uses standard Internet Explorer functionality to fire events, but it can
also use the JQuery library - if JQuery is available on the web page. Set the option
Use JQuery if available to fire events using JQuery. This option is designed for
websites that are built for Internet Explorer version 8 or earlier, and has no effect on
most newer websites.
Supported Events and Functions
Content Grabber supports all the default events for the chosen web element and
some custom events and functions. The following list includes only the most common
events, and is not a complete list of all available events. Please see a JavaScript
© 2014 - 2015 Sequentum
Agent Commands
169
reference guide for all available events.
Event Name
Description
mousedown
Emulates the press of a mouse button onto the
chosen web element without releasing the
button.
mouseup
Following a mousedown event, emulates
releasing a mouse button onto the chosen web
element.
click
In immediate succession, emulates a press
and a release of a mouse button on the chosen
web element .
keydown
Emulates pressing a key in relation to the
chosen web element without releasing it.
keyup
Emulates releasing a key in relation to the
chosen web element .
keypress
In immediate succession, emulates a press
and a release of a key in relation to the chosen
web element.
focus
Emulates bringing input focus to the chosen
web element.
blur
Emulates removing input focus to the chosen
web element.
change
Emulates changing the input value of the
chosen web element.
The following list includes custom functions that can be used along with the standard
events:
Function
© 2014 - 2015 Sequentum
Description
170
Content Grabber
exec(JavaS
Executes a JavaScript on the selected web
cript)
element.
Example 1:
exec($(element).unbind('blur'))
This example removes all blur events
from the chosen web element. This
example requires JQuery to be
available on the web page, but the
exec function works on non-JQuery
JavaScript as well.
Example 2:
exec(element.click())
This example fires the click event on
the selected web element.
Example 3:
exec(window.history.back())
This example moved the current page
back to the previous page.
The variable element is always defined as
the selected web element.
unbind(Eve
Removes all events of type Event from the
nt)
selected web element. This function requires
JQuery to be available on the web page.
Example:
unbind(blur)
© 2014 - 2015 Sequentum
Agent Commands
This example is equivalent to calling
exec($(element).unbind('blur')) or
exec($(element).off('blur'))
click()
Calls the click function on the chosen web
element. In most cases, this function is
equivalent to using the event click in most
cases. The function is specific to Internet
Explorer and maybe required on some very
old websites, but in most cases you should
use the click event instead.
delay(millise
Pauses execution of the command for a
conds)
specific number of milliseconds.
Example:
delay(2000)
This example inserts a delay of 2
seconds.
Important note: The activity timeouts
include any event delay. So, if you have a
single activity that waits 500 milliseconds,
and all events take longer than 500
milliseconds to fire, then the action will time
out before all the events have had time to
fire.
setinputtext
The function inserts text into the chosen form
()
field, and is only compatible with Form Field
commands. If the form field is a select box, the
setinputtext
function selects the option with the text
(text)
attribute equal to the specified text.
If this function call contains no text, the function
© 2014 - 2015 Sequentum
171
172
Content Grabber
inserts the input data for the Form Field
command.
Example: sendinputtext("hotels")
Typically, a Form Field command sets the
value of the chosen form field and then fires
the specifies events. This function gives you
the ability to fire events before setting the
value of a form field.
NOTE: Often, this function is used when the
corresponding Form Field command property
Set Value is set to false.
sendtext()
This function activates the web browser, sets
the focus to the chosen web element, and
sendtext(te
sends text to that element. This will trigger the
xt)
same events as standard user input and may
be necessary for some very complex
websites-sites on which it can be very difficult
to determine the exact order of events.
If the function call contains no text, the function
inserts the input data for the Form Field
command.
Example: sendtext("hotels")
This will send the text "hotels" to the
chosen web element.
Some complex websites may open a drop-
© 2014 - 2015 Sequentum
Agent Commands
down when a user enters text into an input
box, and the drop-down box may close
automatically when the focus moves away
from the input box. When you are working in
the Content Grabber designer, the input focus
automatically moves off the web browser after
an action executes, so you may need to
remove the blur events from the chosen web
element.
Example:
unbind(blur)
sendtext("hotels")
NOTE: the unbind function requires JQuery to
be available on the web page.
sendkey(ke
This function activates the web browser, sets
ycode)
the focus to the chosen web element, and
sends a key code to that element. This will
trigger the same events as standard user input
and maybe required for some very complex
websites where it can be very difficult to
determine the exact order of events.
Example:
sendkey(down)
sendkey(return)
This will send two instructions to the chosen
web element: one instruction that is equivalent
to a press of the down arrow key and another
instruction that is equivalent to a press of the
Return key.
© 2014 - 2015 Sequentum
173
174
Content Grabber
Some complex websites may open a dropdown when a user enters text into an input
box, and the drop-down box may close
automatically when the focus moves away
from the input box. When you are working in
the Content Grabber designer, the input focus
automatically moves off the web browser after
an action executes, so you may need to
remove the blur events from the chosen web
element.
Example:
unbind(blur)
sendkey(down)
NOTE: the unbind function requires JQuery to
be available on the web page.
scroll()
This function scrolls downward on the page
containing the chosen web element, to ensure
scroll(perce
adequate coverage for pages that load content
ntage)
dynamically.
Often, the function call is used along with the
option Repeat Link Action, which will repeat a
downward scroll to the bottom-until the scroll
no longer loads new content.
Example 1:
scroll()
This example will scroll all the way to the
bottom of the content. Optionally, you can
specify an the amount to scroll in terms of
percentage.
© 2014 - 2015 Sequentum
Agent Commands
Example 2:
scroll(50)
This example will scroll half way down through
the content.
windowscro
This function scrolls down to the bottom of the
ll()
page containing the chosen web element, to
ensure adequate coverage for pages that load
windowscro
content dynamically.
ll
(percentage
If the web browser contains multiple frames,
)
you should choose a web element within the
frame to which you want to scroll. The chosen
web element has no other influence on this
function.
Often, the function call is used along with the
option Repeat Link Action, which will repeat a
downward scroll to the bottom - until the scroll
no longer loads new content.
The Scroll to End of Page action option
combines the windowscroll event with the
action option Repeat Link Action, so this
option can be used as an alternative to the
windowscroll function.
Example 1:
windowscroll()
This example will scroll all the way to the
bottom of the web page. Optionally, you can
specify an the amount to scroll in terms of
© 2014 - 2015 Sequentum
175
176
Content Grabber
percentage.
Example 2:
windowscroll(50)
This example will scroll half way down the web
page.
Back()
Move back to the previous page by calling the
JavaScript function window.history.back()
If the action command is a Form Field, then the input value is available for all event
functions by using the variable [input]. For example, you can call the function sendtext
to set the input value of a form field:
sendtext([input])
Or, the input value could be set using JQuery:
exec($(element).val('[input]'))
If you set a form field value using any of these event functions, you may want to
uncheck the form field box Set Value, so that the form field value cannot be set
automatically, but rather only by an event function.
Events and Default Actions
When you interact with web elements in a web browser, such as
clicking on a link or setting focus on a form field, the associated
event handlers will be executed, but the browser will also execute a
default action on the selected web element. For example, if you click
on a link, the browser will navigate to a new page, and if you click on
a form field, the input focus will be set on the form field.
When Content Grabber fires an event on a web element, the
© 2014 - 2015 Sequentum
Agent Commands
177
associated event handlers will be executed, but the default action will
only be executed for click events. This is sometimes important,
because some websites may expect the default action to be
executed. For example, if you fire the focus event on a web element,
the associated focus event handlers will be executed, but the focus
will not be moved to the selected web element. Normally you don't
want the focus to change when processing a website, since it may
interrupt other work you are doing on the computer, but some
websites may not work correctly if the focus does not change. In this
case you can execute a function on the selected web element
instead of firing an event. You can execute any supported function
on a web element by simply specifying a function call like focus()
instead of just the event name focus.
5.3.6.4
Browser Activities
Typically, you have no concern about the sequence of complex activities during the
loading of a web page, since you simply wait for the content that you want to see. The
most critical content on a web page will likely load far in advance of the time that you
actually get around to viewing a specific part of the web page. Usually, all features
function correctly as you fill in web forms or click links.
However, it's very different with web-scraping agents, since these agents are very
fast. An agent will attempt to process a web page as quickly as possible and continue
onto the next page. A web-scraping agent is so fast that it could easily start
processing a page before all of the essential content loads. So, it's important that you
configure an action command to wait for all important browser activities to complete
and all the content loads before web page processing begins.
When an action command executes, it waits for certain activities to complete in the
web browser. For example, if a command executes a click on a link, it may wait for a
page to load or an AJAX call to complete. Some actions may result in a very complex
© 2014 - 2015 Sequentum
178
Content Grabber
set of activities. An action may load a new page that then uses AJAX to load
additional dynamic content onto the page.
Discovering Activities
By default, action commands are set to automatically discover activities. After a
command fires the action events, it will monitor all activity in the web browser and wait
for the critical activity to complete. Then the command will wait for less than a second
and, if no new activity starts, it will consider the action to be complete.
In Content Grabber, you can leave the default for automatic discovery, or uncheck the
Discover Activities box and then use the buttons to manually specify which activities
the command will discover:
Manually configuring for discovery of activities
Automatic activity discovery is a bit slower than manual configuration. That's because
automatic detection pauses and waits for any additional activity. So, automatic
discovery may take a long time when running on a poorly built website, since there
may be long delays between activities. For example, after the initial page load, it may
take several seconds for dynamic content to load into a new web page. With
© 2014 - 2015 Sequentum
Agent Commands
179
automatic discovery, the action command will only wait for a short moment, and may
miss some activities if there is a significant delay.
Alternatively, you can choose to manually configure the activities and set the wait
timeout values. Or, you can change the default wait timeout values for the action
command.
Activities
Each activity has a corresponding timeout: this is the number of milliseconds the action
command will wait for the activity to start. If an activity is required, the action will fail if
the activity does not start within this time period. If an activity is not required, the
command will wait for the next activity.
To manually configure activities, do the following:
1. Locate the Activities tab within the Action tab of the Configure Agent
Command panel.
2. Uncheck the Discover Activities box to enable manually configuration of activity
discovery.
3. Click on the buttons at the bottom of the panel to add one or more activities.
4. Edit the segments for each activity by (a) deleting segments that you don't want
and (b) changing the numeric value for the timeout. Optionally, you can make
some segments required.
Page Load
The Page activity waits for a full page load to complete. Click the Page button to load
the default wait times for each of the activity segments:
© 2014 - 2015 Sequentum
180
Content Grabber
Activity
Delay Description
Segment
Required
Default Value
by Default (milli
seconds)
Loading
Wait for the web page to start loading.
Y
1,000
Interactive
Additional wait time for cases in which
Y
30,000
N
5,000
the web page may not be done
loading, but Content Grabber can
operate on the page. Sometimes a
web page may get stuck in this mode,
because an image won't load
correctly. In this case Content Grabber
can still continue and extract data from
the web page.
Completed
Wait for the page to complete loading.
We recommend that you do not
configure this activity as a required
activity, since in most cases it's best to
allow Content Grabber to proceed in
case a web page becomes stuck in
Interactive mode.
AJAX
This activity waits for one or more AJAX calls to complete. Note that several AJAX
calls that load simultaneously are part of the same activity. Click the AJAX button to
load the default wait times for each of the activity segments:
© 2014 - 2015 Sequentum
Agent Commands
Activity
Delay Description
Segment
Required
181
Default Value
by Default (milli
seconds)
Loading
Wait for one or more AJAX calls to
N
1,000
N
5,000
load.
Interactive
Wait for all AJAX calls to complete.
Any new AJAX call that starts after
this segment is a new activity.
Timeout
This activity simply pauses any action for a specific number of milliseconds, and is
usually put in place to wait for JavaScript to dynamically render content on a web
page. For example, a JavaScript may use AJAX to retrieve content from a web
server, and then consume additional time processing the content before displaying it
on the page. The action command can wait for the AJAX call to complete, but it
cannot wait for any processing afterward. Normally, JavaScript processes so quickly
that no waiting is necessary, but the command may sometimes need to wait a short
period of time.
Page Refresh
This activity waits for a web page refresh. Click the Refresh button to load the default
wait times for each of the activity segments:
© 2014 - 2015 Sequentum
182
Content Grabber
Activity
Delay Description
Segment
Required
Default Value
by Default (milli
seconds)
Loading
Wait for the web page to start loading.
Y
1,000
Interactive
Additional wait time for cases in which
Y
30,000
N
5,000
the web page may not be done
loading, but Content Grabber can
operate on the page. Sometimes a
web page may get stuck in this mode,
because an image won't load
correctly. In this case Content Grabber
can still continue and extract data from
the web page
Completed
Wait for the page to finish loading. We
recommend that you do not configure
this activity as a required activity, since
in most cases it's best to allow Content
Grabber to proceed in case a web
page becomes stuck in Interactive
mode.
Externals
This activity waits for one or more external pages to load in frames or iframes.
Several pages loading simultaneously are considered the same activity.
NOTE: This wait activity is for external frames - frames that
originate from a different domain than the main page. Use the
Frames activity to wait for frame pages from the same domain
as that of the main page.
© 2014 - 2015 Sequentum
Agent Commands
183
Click the Externals button to load the default wait times for each of the activity
segments:
Activity
Delay Description
Segment
Required
Default Value
by Default (milli
seconds)
Loading
Wait for one or more pages to start
Y
1,000
N
5,000
loading.
Interactive
Wait for all pages to finish loading.
New frame pages may start loading
after this, but that is part of a new
activity.
File Download
This activity waits for a file to download. You can specify a mime type to identify the
file to download in case an action triggers multiple page loads. In most cases, the
"application" mime type is the correct choice. Click the Download button to load the
default wait time into the activity sequence.
Web Element Selection
This activity waits for a specific web element to load onto a page. You specify the
web element with an XPath. Click the Selection button to load the default wait time
into the activity sequence.
NOTE: Content Grabber will parse the web page to perform
this action, and may decrease performance significantly. If
possible, avoid using this activity.
Browser Activity Screen
© 2014 - 2015 Sequentum
184
Content Grabber
This feature shows all browser activities that occur after the current action executes.
You can use this information to determine potential issues with the configuration of the
action. Use the Activity button on the Content Grabber status bar to open the
Browser Activity screen, as shown in the figure below:
The Activity button on the status bar
Critical activities have dark coloring and other activities have light coloring. A blue row
appears in the sequence at the point where the command recognizes completion of
the action. Activities that occur after the action completes may not necessarily indicate
a problem. If the agent does not work as you expect, then you may need to
reconfigure your action in such a way that it waits for some or all of those activities.
Activity is seen after the action completes, which may not be a problem
Taking a closer look, you can examine the Timing values on the Browser Activity
pop-up window, and then work out appropriate timeout values for the post-completion
activities. For example, if dynamic changes are causing problems, then you can look
at the timing of the last dynamic change. We would recommend subtracting the time
© 2014 - 2015 Sequentum
Agent Commands
185
for Action Completed from the time of the last Dynamic Page Change action. The
result is the minimum number of additional milliseconds that the command needs to
wait. This would ensure that all dynamic page changes complete before the entire
action completes. Simply insert a Timeout activity to insert a delay on the Activities
tab configuration screen, as we explain in the section above.
Post-completion events
As we wrap up this article, let's also notice that the sequence shown in the figure
above indicates two potential problems. We see both a Page Loading and many
Dynamic Page Changes after the action completes. In this case, we suggest two
remedies. First, increase the amount of time the command waits for the page load to
start. Second, insert a Timeout activity to allow JavaScript to accommodate the
dynamic changes.
5.4
List Commands
A List command is a special type of container command that iterates through a set of
elements and executes each of its sub-commands once for each element. The
element list can be taken from an input source, such as a CSV file, or it can come
from a web selection that selects multiple web elements on the current web page.
The following commands are list commands:
Agent
Data List
Set Form Field
Web Element List
The Web Element List command uses a web selection to generate its element list.
The Agent, Set Form Field and Data List commands use an input data provider to
generate their element lists. A command is not considered a list command if the input
data provider is set to None. For example, a Set Form Field command that uses
input data from a different command would have its own data provider set to None,
and therefore is not a List command in that case. You can learn more by clicking the
© 2014 - 2015 Sequentum
186
Content Grabber
links to any of the subsections above.
These properties are common to all list commands:
Property
Description
List Start
The index of the first item to use in a list when
Index
debugging.
List Count
The number of list items to use when debugging.
Setting a value of zero will cause the command
to use all list items while debugging.
5.4.1
Data List
This command provides input data to sub-commands. The command (and all
sub-command) will execute once for each input data value. The figure below
shows the Command Properties panel after choosing Data List from the New
Command drop-down:
© 2014 - 2015 Sequentum
Agent Commands
These are properties specific to data list commands:
Property
Description
Data
The data source or CSV file that will provide
Provider
the input data to the command and all subcommands. See Using Data Input for more
information about data providers.
Data
Specifies the action to take when data is
Missing
missing. Possible actions are:
Action
Default - The default behavior is to ignore the
command if the data provider provides no
data, and it will not process any subcommands.
© 2014 - 2015 Sequentum
187
188
Content Grabber
Optional Data - The agent will continue to
process sub-commands if the data provider
provides no data.
Ignore Command when Data is Missing The agent will ignore the command if the data
provider provides no data, and it will not
process any sub-commands.
5.4.2
Web Element List
This command selects a list of elements on a web page. The parent command and all
of its sub-commands will execute once for each web element in the list, and the web
selection of each sub-command will be relative to the current element in the list.
NOTE: All sub-commands will have the same parent selection.
The figure below shows the Configure Agent Command panel after choosing Web
Element List from the New command drop-down:
As we write above, the web selection of each sub-command will be relative to the
current element in the list, but some commands can break the web-element list. If a
command breaks a parent web-element list, then the sub-commands will no longer be
relative to the web-element list. It's important to realize that any command that loads a
new web page will always break a web-element list - since the elements in the webelement list don't exist on the new web page.
© 2014 - 2015 Sequentum
Agent Commands
189
A command that loads a partial web page, or makes changes to a web page, may
break a web-element list if the Break HTML Area property has been set to True.
This property is specific to data list commands:
Property
Description
Editor Index
Specifies the index of the list element to use in
the editor. This property is only relevant at
design time, and has no use at run time or
when debugging.
These are properties specific to web-element list commands:
Command Properties
5.5
Group Commands
Group commands are simple container commands that you can use to place
commands together - which can be a much easier arrangement for navigating and
maintaining an agent. You may, for example, want to group together several
commands that have various bits of contact data into a Group command and give it
the name "Content Information".
There are other uses for Group commands. Consider this restriction: you can link only
one other container command to a container command, such as Navigate Pagination.
So, you would need to use a Group command if you want to link to three container
commands together.
© 2014 - 2015 Sequentum
190
Content Grabber
These are the Group commands:
Group
Group in Page Area
5.5.1
Group
Use this command to group commands together. Although the Group command
performs no action, it can be useful in the following cases:
To group commands that logically belong together. This makes it easier to
navigate and maintain an agent. For example, four capture commands used to
extract street address, city, zip code and state could be grouped into a Group
command having the name "Address".
When adding commands to the template library for future reuse.
When you want to link a container command to more than one other command.
The Navigate Pagination command is an example of a container command that
often needs to link to more than one other command.
Group Command Properties
These properties apply to all Group commands:
Command Properties
© 2014 - 2015 Sequentum
Agent Commands
5.5.2
191
Group in Page Area
Use a Group Commands in Page Area command to group commands together that
have the same web selection. The command is similar to the Group command, but it
has a corresponding web selection, and any web selection of the sub-commands will
be relative to that selection.
Typically, you would choose this command when you need to split a single web
element into separate fields. For example, a single web element may contain the
entire address of a company, and several capture commands with the same web
selection will be necessary to extract and split the address into separate fields such
as street address, city, zip code, and state. If you group all the capture commands
into a single Group Commands in Page Area command, then you only need to select
the web element once. Also, if the page location of the address changes, you only
need to change the web selection of the Group Commands in Page Area command.
You can also add this command to your Template Library, so that you can reuse the
command in another agent and only need to change the web selection.
Command Properties
These properties apply to all Group Commands in Page Area
commands:
Command Properties
© 2014 - 2015 Sequentum
192
5.6
Content Grabber
Other Commands
Click on the links below to learn more about the other commands that are available in
Content Grabber:
Execute Script
Transform Page
Refresh Document
Close Browser
5.6.1
Execute Script
This command executes a custom .NET script, which is useful in these
scenarios:
Control the command execution flow
Check for duplicates
Manual interaction with the agent web browser.
Each script command has a corresponding web element selection, and this element
has a corresponding entry in the custom script. At run time, if the selection does not
exist on the web page, the web element will be NULL in the script. You can check for
NULL elements in the script and execute custom functionality if a NULL is found.
The figure below shows the Configure Agent Command panel after choosing
Execute Script from the New Command drop-down:
© 2014 - 2015 Sequentum
Agent Commands
193
Predefined Scripts
Click the Script type drop-down to choose a script containing typical functionality.
You can use a script just as it is given in the template, or click the Edit Custom Script
button to launch a pop-up window to edit the script according to your requirements.
Duplicate Data
Choose Duplicate Data in the Script type drop-down to select a script that will
examine the data extraction up to a specific point, and check for a matching data row.
If a duplicate is found, the agent exits a specified container command.
If you want to check for duplicates in the data extraction from a previous agent run,
you must ensure that the agent does not replace data every time it runs. This is done
on the Change Internal Database window - which you can open from the Data menu.
© 2014 - 2015 Sequentum
194
Content Grabber
Exit Command
This script exits a container command when a specific condition is
met.
Retry Command
This script retries a container command when a specific condition is met. You can
control the number of times that the agent will retry a container command by editing
the default script and using the RetryCount value in the script parameters.
© 2014 - 2015 Sequentum
Agent Commands
195
Retry Agent and Continue
This script restarts the entire agent and continues where it left off when a
specified condition is met.
Typically, you would use this script to completely reset a web browser after an error
occurs. The agent continues exactly from the point before the reset, so that if an error
keeps occurring at this point, the agent will keep restarting in an infinite loop. To avoid
this problem, you can set the status Failed on the current data row, so the agent will
move to the next data item when it restarts. This code snippet example sets the data
row status to Failed.
args.DataRow.SetStatus(RowStatus.Failed)
Pause Agent
© 2014 - 2015 Sequentum
196
Content Grabber
This script pauses the agent when a specified condition is met. The agent will display
the web browser and allow a user to manually interact with the web browser. The
user can press the Continue button to continue processing.
Script Return Value
An Execute Script command can control the command execution flow by its return
value. The command returns an instance of a class CustomScriptReturn. This class
has the following 4 public static methods:
Method
Description
static
The agent will retry the specific
CustomScriptReturn
container command. The container
RetryContainer(IContain
command must be a parent of the
er container)
current command.
static
The agent will exit the specific
CustomScriptReturn
container command. The container
ExitContainer(IContai
command must be a parent of the
ner container)
current command.
static
The agent pauses and displays an
CustomScriptReturn
agent web browser, which allows
Pause()
a user to interact with the web
browser before continuing
© 2014 - 2015 Sequentum
Agent Commands
197
processing.
static
The agent continues its normal
CustomScriptReturn
executing flow.
Empty()
5.6.2
Transform Page
This command transforms part of a web page to facilitate the capture of data. You
can choose from the following types of transformations:
Transform HTML
Insert HTML
Replace HTML
Denormalize HTML Table
Use the Transform Page command to transform the structure of a web page into a
form that will facilitate easier data extraction. The command uses a transformation
script to transform existing HTML, or it can insert or replace HTML on a web page.
The corresponding web selection for this command is the HTML of the chosen web
element that will be transformed.
© 2014 - 2015 Sequentum
198
Content Grabber
The default transformation for this command is Transform HTML, which denormalizes
an HTML table. When using the Denormalize HTML Table transformation, the web
selection must be an HTML table. The transformation will repeat some table cells and
rows to make sure that all rowspan and colspan attributes in the table are set to
one.
The figure below depicts an HTML table having some cells that span multiple rows.
You can use the Denormalize HTML Table transformation to ensure that no cells in
the table spans more than one row, which makes it much easier to extract data from
the table.
© 2014 - 2015 Sequentum
Agent Commands
199
Choose the Denormalize HTML Table transformation on this table to ensure that no cells span
more than one row
5.6.3
Refresh Document
When Content Grabber loads a new web page into the web browser, it parses the
dynamic web page and generates a equivalent static version of the page. Agent
commands will work on the static version of the web page, since it's much faster and
more reliable. All action commands will cause a refresh of the static version of the
page, since action commands are likely to either load a new web page or modify the
existing page.
The Refresh Document command is used to force a refresh of the static version of
the web page, so that it corresponds to the dynamic version loaded in the web
browser. Usually, Content Grabber will be able to refresh the document when
necessary, so you'll only need to use this command in special cases. For example,
consider the case where a child web page refreshes the parent web page. After
processing the child web page, you may need to perform a Refresh Document
command before continuing to process the parent web page.
© 2014 - 2015 Sequentum
200
5.6.4
Content Grabber
Close Browser
Content Grabber often opens new web pages in a new browser window, and will
always open a web page in a new window if the web page design specifies the
opening of a child window. When Content Grabber finishes processing a page in a
child window, it will leave the browser window open and reuse the window for the next
similar child page. Some websites will close a child window after use, and will only
work correctly if the child window does indeed close.
Use the Close Browser command to force a browser window to close, so that the
website will create a new browser window.
5.7
Web Forms
Content Grabber can submit web forms repeatedly - for any combination of input
© 2014 - 2015 Sequentum
Agent Commands
201
values. You can specify input values as a static list of values, or you can feed the
values from an input data source - such as a database or a CSV file.
Remember, you can use Set Form Field commands to set form fields, and Navigate
Link commands to submit web forms. Some web forms don't have a separate Submit
button, but instead submit the form when a form field value changes. We recommend
that you use a Navigate Link command in such cases. Set Form Field commands
perform actions, such as loading a new web page.
The figure below depicts an example configuration for an agent that submits a web
form:
Typically, Set Form Field commands that submit a web form are nested. So, if the
first command is iterating through a list of input values, then all succeeding Set Form
Field commands will execute once for each input value. The last command in a web
form configuration is typically a Navigate Link command, and since it's nested along
with all the preceding Set Form Field commands, it executes once for each
combination of input values.
File Download
You can use a web form to download a file, such as an Excel or PDF file. The file
download may start when you click the form Submit button, or it may start when you
change a form field value.
If a file download starts when you click a Submit button, you should use a Download
© 2014 - 2015 Sequentum
202
Content Grabber
Document command instead of a Navigate Link command. The Download Document
command should select the Submit button and have the File Download Action
property set to Click to Download.
If a file download starts when you change a form field value, you should use a
Download Document command instead of a Navigate Link command. Also, the Form
Field command that triggers the document download most have the option Download
Document command set to the Download Document. The Download Document
command must be a direct sub-command of the Form Field command that triggers the
file download.
Uploading File
Some web forms contain file upload boxes that allow users to upload files to a
website. Scripting cannot modify file upload boxes, so it's necessary to process these
differently than other web form input boxes.
Content Grabber will automatically detect file upload form fields and will try to handle
the file upload dialog box that appears when clicking on the file upload form field. For
the Set Form Field command that handles the file upload form field, the input value
should be the full local file path to the file you want to upload.
When the file upload dialog box appears, Content Grabber will automatically enter the
file path and click on the button that selects the file. This happens so fast that the
dialog box won't appear. Content Grabber needs to click on the button that selects the
file, and it does that by looking for a button having an "Open" label. This is the correct
button for English language versions of Windows, but will not be the correct button for
other language versions of Windows. In such cases, you may need to change the Set
Form Field command property Handle Web Browser Dialog. The default value is
&OPEN, which means that the agent will be looking for the text "Open" on the dialog
button. The text &O is shown as O and means the button can be activated by pressed
the keyboard combination ALT + O.
© 2014 - 2015 Sequentum
Agent Commands
203
Website Login Forms
Content Grabber can process website login forms the same way it handles any other
web forms. An agent that processes a login web form would normally have a Set
Form Field command for the Username, and another Set Form Field command for
the Password, and a Navigate Link command that clicks on the login Submit button.
Basic Windows Authentication
Some websites use basic Windows authentication, and they will display a Windows
login box. Content Grabber gives you the ability to set the Username and Password
for basic Windows authentication by editing the Agent Command and then setting the
Username and Password in the Basic Windows Authentication of the properties
tab.
After setting the basic Windows authentication properties, you must reload your agent
for the properties to take effect. Basic Windows authentication does not work for
websites that have proxies, and also does not work with browser mode set to HTML
Parser.
NOTE: Internet Explorer no longer allows you to embed
authentication directly in the URL like this: http://
username:password@www.domain.com/index.html
© 2014 - 2015 Sequentum
204
5.8
Content Grabber
Command Library
Content Grabber contains an agent template library and a command template library
for easy reuse of agents and commands. Some templates come along with the
installation of Content Grabber software, but you can also add your own templates.
Exporting and Importing User Libraries
You can export and import any user templates from one computer to another.
Agent Templates
When creating a new agent, you can derive the new agent from one of your own
agent templates or one of the standard templates.
© 2014 - 2015 Sequentum
Agent Commands
205
You can add any of your agents to the template library by right-clicking on the agent in
the Agent Explorer window and then selecting Add to Agent Template Library from
the context menu.
After adding a new agent template, the template will become available in My
Templates when creating a new agent.
Command Templates
You can add all container commands to the template library. Any sub-commands for
a container command will go into that library, so that when you later insert the
command from the library into a new agent, the insert task will include the entire
command structure.
Add container commands to the library by right-clicking on a command in the Agent
Explorer window and selecting Add to Template Library from the context menu.
© 2014 - 2015 Sequentum
206
Content Grabber
Command templates are categorized to make it easier to find specific templates.
When you add a new template, you can place it in an existing category or enter a new
category name.
You can insert a command template into an agent by doing the following:
1. In the Agent Explorer window, right-click the container command that will be the
parent command.
2. Choose Insert Template from Library from the context menu.
© 2014 - 2015 Sequentum
Error Handling
6
207
Error Handling
Web scraping is frequently and notoriously unreliable, because you must contend with
many external factors over which you have no control. A target website may fail
because of defects in the web application, or there may be problems with the Internet
connectivity anywhere between you and the target website. These problems may
seem negligible when you are browsing a few websites with a normal web browser,
but a web-scraping agent is capable of navigating more web pages in a few hours
than a human can view in an entire year. So, small glitches can become significant
problems that inhibit reliable data extraction. You can minimize these factors with error
handling, especially for your critical web-scraping tasks.
Error handling can be difficult to implement properly. In cases where your agent only
collects non-critical data, you may decide to skip error handling and simply re-run the
agent. Error handling is only important when an agent needs to deliver reliable data
every time it runs.
For web scraping, there are two aspects to error handling: Agent error handling and
Error logs and notifications. Agent error handling happens automatically, when specific
errors occur during the execution of an agent. For example, an agent could retry a
command if it fails to load a new web page. Error logs and notifications can warn an
administrator when an agent encounters trouble and requires attention. Consider the
case in which a target website has a new layout and the agent may no longer be able
to extract data correctly. You can configure error handling for this agent so that an
email notification will be sent to the administrator - who can decide how to update the
agent to accommodate the new layout.
We encourage you to learn more about error handling in these sections:
Agent error handling
Error logs and notifications
© 2014 - 2015 Sequentum
208
6.1
Content Grabber
Agent Error Handling
You can configure an agent to react to errors in two different ways. You can configure
the error handling properties for an action command, or you can use a custom script
to watch for an error condition and take appropriate action.
You can configure an agent to react to errors in two different ways. You can
configure the error handling properties for an action command, or you can use
a custom script to watch for an error condition and take appropriate action.
Action Command Error Handling
These are the Error Handling properties:
Error
Description
Handling
Exit
The agent will exit the action command and
Command
continue executing the next command. The
agent will skip all sub-commands of the action
command.
Retry
The agent will retry the action command a
Command
specified number of times, and if the action
command does not succeed, it will skip all subcommands of the action command and continue
executing the next command. Set the property
Retry Count to specify the number of retries. If
Retry Count is set to zero the agent will keep
retrying the command indefinitely.
Restart Agent
The agent will restart and continue where it left
and Continue
off. This option is useful if an error puts the
website into a state where the agent cannot
continue.
Stop Agent
The agent will stop.
© 2014 - 2015 Sequentum
Error Handling
No Error
The agent will not handle the error, but will
Handling
continue to execute the sub-commands of the
209
action command. Use this option if you want to
handle the error in a custom script later in the
process.
Error Handling In An Execute Script Command
You can use Execute Script Commands to implement advanced error handling, such
as in these cases.
A required element is missing from a web page
An error message is displayed on a web page
An error occurs during the execution of an action command.
You can write a custom script in C# or VB.NET, or you can use the default script
options to generate a default script without writing any .NET code.
This default script retries a parent command
if a selection is missing
An Execute Script Command has a corresponding web selection, and the script can
use this web selection to check for any errors. The script also is given the parameter
IsParentCommandActionError, which is set to True if the parent action command
© 2014 - 2015 Sequentum
210
Content Grabber
encounters an error while executing the command action.
6.2
Error Logs and Notifications
You can use error notifications to warn when an agent encounters an error, so that the
administrator can take appropriate action. The administrator can use the logs to get
more information about the errors.
Error Notifications
Notifications are email messages that are sent to a specific email address when
certain conditions occur. The notification configuration screen is available by clicking
the Agent Settings > Notifications menu.
This agent sends email notifications if it encounters any
critical errors
An email notification can be sent when an agent finishes, but never during agent
execution or when debugging an agent. You can choose to send email notifications
when one or more of these conditions are met:
Condition
Description
© 2014 - 2015 Sequentum
Error Handling
Always send
A notification email will always be sent
notifications when
when the agent finishes, whether an error
an agent completes occurs or not.
a run
Send notification
An email notification will be sent when
on critical errors
critical errors occur. A critical error will
appear as red color on the log screen.
These critical errors include page load
errors and also situations where required
web selections are missing from a web
page.
Send notification
A notification email will be sent when an
on low number of
agent loads less than the minimum
page loads
number of pages that you specify in Low
page number threshold. Page loads
include all command actions, including
content loaded asynchronously using
AJAX and content changes triggered by
synchronous JavaScript.
A script can trigger notifications by calling one of these methods on
the script parameter object:
Method
Description
void Notify(bool
Triggers notification at the end of agent
alwaysNotify =
execution. If alwaysNotify is set to
true)
False, this method only triggers a
notification if you configure the agent to
Send notification on critical errors
(see above).
void Notify(string
© 2014 - 2015 Sequentum
Triggers notification at the end of an
211
212
Content Grabber
message, bool
agent execution, and adds a message to
alwaysNotify =
the notification email. If alwaysNotify is
false)
set to False, this method only triggers a
notification if you configure the agent to
Send notification on critical errors
(see above).
Error Logs
Logging is off by default. If you want to save log information, you must first enable
logging before you run an agent or debug it. Simply locate the Debug section in the
Properties tab, and set the Debug Disabled property to True.
A Content Grabber agent send log data to a database table by default. The database
table supports the log viewer in the Content Grabber editor, which you can access
by clicking the View Log button in either the Run ribbon menu or the Debug ribbon
menu.
The critical errors in this log are red in color
URL Column
The URL column in the Agent Debug Log window shows links to all web pages that
© 2014 - 2015 Sequentum
Error Handling
213
the agent has processed, and you can double-click on any of these links to open the
web page in your default browser. For any failures on data extraction, this is an easy
way to open the web page and check if the web page layout is different from the
expected layout.
NOTE: Double-clicking on a link to a sub-page only works if the website allows
direct links into the sub-pages. Many websites don't allow direct links into subpages without going through other pages first. Read below to learn how to
directly access the HTML for sub-pages.
HTML Column
If a website does not allow direct links into sub-pages, you can configure an agent to
save the entire HTML of all processed pages. When you configure the agent to log all
HTML, the HTML column will contain buttons that you can click to view the raw HTML
in the Content Grabber HTML viewer. The raw HTML does not include style sheets
and other support files that are necessary to render a complete web page properly,
but it will often show you enough information to determine what is causing an error
(such a CAPTCHA page).
© 2014 - 2015 Sequentum
214
7
Content Grabber
Extracting Data From Non-HTML Documents
Websites generally provide most of their content in HTML format, but some websites
may also provide content in other formats - such as PDF or Microsoft Word
documents. Since Content Grabber can only process HTML documents, it will simply
download any non-HTML document. Content Grabber can help you extract text and
images from within a PDF or Word document by converting such documents into
HTML.
To have Content Grabber convert your non-HTML document, you will need to provide
an external document converter. The Content Grabber public website provides a list of
open source programs that you can use for this purpose. Please remember that we
don't support these tools, and you must comply with the license for any conversion
tool.
Limitations
The design of most file formats, including PDF and Word files, doesn't include easeof-conversion to HTML. So, the conversion output is considerably more difficult to
manage than standard HTML. In many cases, you'll have to select the entire HTML
page and then use Regular Expressions to extract the target content.
Installing a Document Converter
Content Grabber uses a custom script that makes a call to an external document
converter, and you can configure this script to call any type of program. The default
script can handle the two document converters currently available for download on
the Content Grabber website.
Installing the PDF To HTML Converter
Follow these steps to install the PDF-to-HTML document converter:
1. Download the pdftohtml.zip file from the Content Grabber website:
© 2014 - 2015 Sequentum
215
http://www.ContentGrabber.com/Resources/Tools.aspx
2. Extract the content of the zip file into the default Content Grabber Converters
folder, My Documents\Content Grabber\Converters. You can also copy the converter
into the corresponding Public Documents folder if you need the converter to be
available for all users on the computer.
3. The direct path to the document converter should now be:
My Documents\Content Grabber\Converters\pdftohtml\pdftohtml.exe
Installing the Docx To HTML Converter
Follow these steps to install the docx to HTML document converter:
1. Download the docxtohtml.zip file from the Content Grabber
website:
http://www.ContentGrabber.com/Resources/Tools.aspx
2. Extract the content of the zip file into the default Content Grabber Converters
folder, My Documents\Content Grabber\Converters. You can also copy the converter
into the corresponding Public Documents folder if you need the converter to be
available for all users on the computer.
3. The direct path to the document converter should now be:
My Documents\Content Grabber\Converters\docxtohtml
\docxtohtml.exe
Using a Document Converter
You'll need to add the custom script (that calls the document converters) to a
Download Document command, and that same command must also perform the
download of the document.
© 2014 - 2015 Sequentum
216
Content Grabber
The default conversion script looks like this:
using System;
using System.IO;
using Sequentum.ContentGrabber.Api;
public class Script
{
//See help for a definition of ConvertDocumentToHtmlArguments.
public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)
{
if(args.DocumentType=="pdf")
ScriptUtils.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noframes");
else if(args.DocumentType=="docx")
ScriptUtils.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "");
if(!File.Exists(args.HtmlFilePath))
return false;
return true;
}
}
Extracting Content From a Converted Document
After converting a document to HTML, that document goes into the agent data folder
and you can use a Navigate URL command to open the HTML document. The
© 2014 - 2015 Sequentum
217
Download Document command that did the conversion will also store the path to the
document, and then the Navigate URL command can use that path to get the file URL
to the HTML document.
An agent with a URL command that links to a
converted HTML document
The Navigate URL command uses data from the Download Document command, so
the Download Document command must execute first. Also, both the commands must
have the same parent command, or the Navigate URL command must be a child
command of the command that contains the Download Document command.
You can execute the Navigate URL command in the editor to open the converted
HTML document, but you must first execute the Download Document command to
make sure the HTML document is available.
A Navigate URL command using data captured by a
© 2014 - 2015 Sequentum
218
Content Grabber
Download Document command
© 2014 - 2015 Sequentum
Data
8
219
Data
Content Grabber manages three types of data:
Input data - is used to provide input values to an agent, such as input values for a
web form or start URLs. See the topic Using Data Input for more information.
Internal data - An agent stores data extraction elements in an internal database
while the agent is running. When the agent completes, this internal data will
become external data. For more information, see the topic The Internal Database.
External data - this is data that goes out to a specific export target. See the
topic Exporting Data for more information about exporting data.
Distributing Data
If an agent is exporting data to a particular file type, such as a spreadsheet, then the
agent can also send the file as an attachment to an email message to one or more
recipients. The agent can also FTP the files to a remote computer. See the topic
Distributing Data for more information.
We also recommend these topics:
The Internal Database
Using Data Input
Exporting Data
Distributing Data
8.1
Many professionals who use Content Grabber will normally involve the use of one or
more of their database servers, and a web-scraping agent can both import and export
data to multiple servers. To increase performance and reliability, you can also
configure an agent to have its primary database on an external server instead of using
the default SQLite file database.
© 2014 - 2015 Sequentum
220
Content Grabber
Development Environments
As you gain proficiency with agent development and deployment, you may find it best
to use separate computing environments for the various stages: a developer
environment, a testing environment, and a production environment. Each of these
environments will typically have its own database, and so the database connections in
your agent will need to change when you move an agent from development into
testing, and then from testing into production.
Unique Database Names on Your Network
To make it easier to move agents between different computer environments, you can
configure network database connections - each of which has a unique name on the
network. Using such names will allow for an automatic switch to a new connection if
the agent moves to another computer on the network. Also, we recommend that you
take this approach, even if you are only using a single computer to design and run
agents, since you won't have to configure a new database connection each time you
create an agent.
Another benefit of using unique-name database connections is the agent will contain
only the name of the connection. This design gives additional security when you send
agents to others, since the agent will contain no database connection information.
If you don't configure a network database connection, then the connection will be
known only to the agent in the location that you configure the agent. You can use the
database connection anywhere in the agent, but it will be unknown to other agents.
8.2
The Internal Database
Content Grabber uses two layers of internal data, internal data and export data. While
the agent is running, it continuously saves data in an internal format. When the agent
finishes, it converts the internal data into export data, and then it sends this data to the
export target so it becomes external data.
© 2014 - 2015 Sequentum
Data
221
The default internal database is a SQLite database, but this can be changed SQL
Server or MySQL. Oracle and OleDB are not currently supported as internal
databases.
Internal Data
Content Grabber stores internal data in the database - which contains all the data
extraction elements for any agent, but also contains the data which corresponds to the
settings for the agent properties. For example, if an agent stops, crashes, or
otherwise fails, the internal data is used to start the agent again exactly at the point of
failure. Also, the internal data will always contain a data table for each container
command in the agent and at least one data field for each capture command in the
agent. You can always view the internal data by clicking the Data > View Internal
Data menu.
If you configure an agent to add new data to the existing data, then the internal data
store will continue to grow larger. Otherwise, the internal data store will be recreated
every time the agent is run.
Export Data
The agent stores the export data in the internal database, and this data doesn't
contain any of the property data for the agent itself. Also, the export data is more
readable than the internal data. The export data will usually not contain a data table
for each container command in the agent, and some capture commands may not have
corresponding data fields.
Export data is always overwritten, so you cannot add data to existing export data.
You can only add data to existing internal data. You can always view the internal data
by clicking the Data > View Export Data menu.
External Data
© 2014 - 2015 Sequentum
222
Content Grabber
External data is the same as the export data, and the agent delivers it to an external
data store chosen by the user. Typically, external data is overwritten, except when
you choose Export last data segment only and you are exporting to a database. In
that case, the operation simply adds the data to the external database tables. You
can also use a Data Export Script to customize the export process which will allow
you to update or add to existing data. The export data is set in a fixed format, but you
have some options for changing this format. You need to use an Data Export Script if
you want to automatically create or configure your own data structures within this
data.
Internal Database
The default internal database is a SQLite file database, but you can change it to
either a SQL Server or MySQL database. You can make this change on the Internal
Database screen.
Important Notice About SQL Server and MySQL
WARNING: It is important to ensure that no two agents that
use the same internal database have the same agent name.
Each time you run an agent, Content Grabber verifies the tables in the internal
database. If the tables don't exist or are invalid, it removes and recreates all tables for
that agent. If any changes have been made to the agent, there may be some
inconsistencies. Instead of removing all tables, it removes all tables starting with the
specific agent name.
All internal data tables have the following naming format:
VWR INTERNAL [PROJECT_NAME] [TEMPLATE_NAME]
When Content Grabber removes a table for a specific agent, it removes all
tables starting with the name:
© 2014 - 2015 Sequentum
Data
223
VWR INTERNAL [PROJECT_NAME]
This approach creates an obvious problem when two agents use the same internal
database and also start with the same name, because Content Grabber would
remove the tables for both projects when it should only remove tables for one project.
It is important to ensure that no two agents that use the same internal database have
the same agent name.
8.3
Using Data Input
Content Grabber can take input data from a wide variety of data sources, such as
databases and CSV files. Though you can use input data for many different purposes,
its most common use is for loading URLs or inputting values for web forms.
We explain each of the types of input data in these sections:
Data providers
Input parameters
Agent data
8.3.1
Data Providers
You can assign a data provider to any Data List command and can load data from the
following data sources:
Simple
SQL Server
CSV
MySQL
Selection
Oracle
Script
OleDB
© 2014 - 2015 Sequentum
224
Content Grabber
Data List commands include the following:
Agent
Set Form Field
Navigate URL
Data List
A list command will iterate through all the data rows given by the data provider and
execute each of the sub-commands once for each row.
Consuming Data
Consumer commands use the data that comes from the Data Provider. The
consumer must be a sub-command of the command providing the data. Data
consumers include the following types of commands:
Agent
Set Form Field
Navigate URL
Data Value
Most data providers are also data consumers and, consequently, many commands will
retrieve data automatically. For example, an agent command will normally provide at
least one start URL to itself - as shown in the figure.
© 2014 - 2015 Sequentum
Data
225
You can choose the Data Provider on the Common tab of the Configure Agent
Command panel. If multiple data columns are available, then you can choose a
specific Data Column. The Default data column is simply the first data column that
the data provider presents.
A data provider will often provide multiple data columns for data sets that correspond
to a web form. An example would be an agent that searches for air fares, which would
submit a list of departure and destination cities to a web form. The data provider
would then provide two columns: one for departure city and one for destination city.
Consuming Data in a Script
A script can consume data from a data provider as long as the script corresponds
with a command that is a sub-command of the data provider. Most scripts provide
easy access to input data from the script arguments. The following example retrieves
the current data from the data column DepartureCity in the CityData data provider.
args.GetInputData("CityData").GetStringValue("DepartureCity");
See the Scripting article for more information.
Simple Data Provider
© 2014 - 2015 Sequentum
226
Content Grabber
Choosing Simple for the Data Provider will cause the agent to extract data from a
CSV data set that the agent designer enters at design time. The agent contains the
CSV data internally, so that there is no dependency on external files.
The CSV data set may contain multiple rows, each having multiple comma-separated
data values. The data follows the same formatting rules as standard CSV data. So,
you should enclose a data value in double-quotes if it contains a comma or single
quote, and use two double quotes to escape any double quotes in a data value.
The following example contains a header row, although a header row is not
mandatory. If the data contains a header row, then you need to check the Input data
includes header row box on the Data tab.
Departure City,Destination City
Sydney,Brisbane
Sydney,Melbourne
CSV Data Provider
Choosing CSV for the Data Provider is similar to the Simple data provider, but uses
an external CSV file.
We recommend that you choose the CSV Data Provider for large CSV files, since the
© 2014 - 2015 Sequentum
Data
227
CSV data provider will perform much better than the Simple data provider for large
quantities of data.
For a CSV provider, you can choose the value Separator and the text
Encoding of the CSV file.
You can place CSV files anywhere on your computer, but we recommend you place
them in the default input data folder for your agent. Later, if you want to export your
agent, then you can include these files along with the export.
Selection Data Provider
Choosing Selection for the Data Provider will extract data from elements on the web
page. The data provider uses the selection XPath of the data provider command and
then adds a relative XPath to find the web elements that you include in the selection.
You can choose which HTML attributes of the web elements to include as data.
© 2014 - 2015 Sequentum
228
Content Grabber
The choice of Selection for the Data Provider is almost exclusively for use with dropdown menus in web forms. A form field command selects the drop down on the web
page, and a relative XPath selects the option elements within that drop down. The
drop-down options then become available to the form field command, which can
iterate through the options and ensure that the web form submission is done for each
item in the drop down. Although this is somewhat tedious, the benefit is more precise
control over which options you want to submit with the web form.
Script Data Provider
Choose Script for the Data Provider for full customization of the agent input data.
This option provides a .NET data table containing the input data, which may contain
multiple data columns and rows.
© 2014 - 2015 Sequentum
Data
229
Content Grabber provides some standard .NET libraries which makes it easier to
generate .NET data tables for a variety of input data. See the topic Data Input Scripts
for more information and sample code.
Database Data Providers
Choose a Database for the Data Provider to work any of these
database connections:
SQL Server
Oracle
MySQL
OleDB
In Content Grabber, you can share database connections among all agents on a
computer. We recommend that you read more about create new database
connections.
© 2014 - 2015 Sequentum
230
Content Grabber
A database data provider requires execution of a SQL Select statement on the
database connection to retrieve the input data. The following example selects the two
data columns, DepatureCity and DestinationCity, from a table City.
SELECT DepatureCity, DestinationCity FROM City
8.3.2
Input Parameters
Input parameters are named values that get their value assignments when an agent
starts. You can specify these as command-line parameters when running an agent
from the command line, or specify them in a GUI window when running the agent from
the Windows desktop. Any data consumer command within the agent can use these
parameters.
© 2014 - 2015 Sequentum
Data
231
You can configure input parameters to get value assignments at runtime by
using one of these methods:
A GUI window
Command-line parameters
The Content Grabber API
Input Parameters with a GUI Window
When running an agent from the Content Grabber editor and the agent has any input
parameters, then the Runtime Input Parameters window will automatically appear.
Runtime Input Parameters
The user running the agent can enter runtime values for the input parameters and can
even add new parameters that are only defined during the current execution of the
© 2014 - 2015 Sequentum
232
Content Grabber
agent. An agent will always use the default input parameters when debugging. The
Runtime Input Parameters window will not appear when starting a debugging
session, so you will need to change the default values if you want to debug the agent
with different input parameters.
Specifying Input Parameters on the Command Line
When running an agent from the command line, you can specify input parameters as
command-line arguments. It's unnecessary to specify all of the input parameters. If
you don't specify a parameter, the call to run the agent will use a default value for that
parameter. See the topic Running Agent from the Command-Line for more
information.
The following example runs an agent named Sequentum on the command line, and
then sets the input parameters DepartureCity and DestinationCity.
RunAgent.exe Sequentum -DepartureCity "Sydney" -DestinationCity "Melbourne"
Specifying Input Parameters Using the Content
Grabber API
When running an agent using the API, you can specify input parameters using the
AgentSettings class in the Content Grabber API. See the topic Programming
Interface for more information.
The following example uses the API to add a set of input parameters and
then run the agent:
AgentApi api = new AgentApi(@"C:\Users\Public\Documents\Content Grabber\Agents\
qantasApiTest\qantasApiTest.scg");
AgentSettings settings = new AgentSettings();
settings.InputParameters.Add("from", "SYD");
settings.InputParameters.Add("to", "MEL");
settings.InputParameters.Add("departure_date", DateTime.Now.ToString("yyyy-MM-dd"));
© 2014 - 2015 Sequentum
Data
233
settings.InputParameters.Add("return_date", DateTime.Now.ToString("yyyy-MM-dd"));
settings.InputParameters.Add("travel_class", "ECO");
settings.InputParameters.Add("adults", "1");
settings.InputParameters.Add("children", "0");
settings.LogLevel = AgentLogLevel.High;
settings.IsLogToFile = true;
api.RunAgent(settings);
Using Input Parameters
Any data consumer command, such as the following, can use input
parameters.
Agent
Set Form Field
Navigate URL
Data Value
Using Input Parameters
Using Input Parameters in a Script
Most scripts provide easy access to input parameters from the script arguments. See
the topic Scripting for more information. The following example retrieves the input
parameter DepartureCity:
args.GlobalData.GetString("DepartureCity");
© 2014 - 2015 Sequentum
234
8.3.3
Content Grabber
Agent Data
Agent data is data that is currently being extracted by the agent. For example, data
extracted by a capture command could be used as input to a form field command.
Agent data can be used by data consumer commands, but the data consumer must
be a sub-command of the capture command which is extracting the data. Else, it must
have the same parent command as the capture command. See Agent Data for more
information.
Agent data is the data that the agent is currently extracting, and it can be put to use
by other commands at runtime. If another command uses such data, then it must
execute after the command that extracts the data. Otherwise, the data wouldn't be
available. Also, the command using the data must be a sub-command or have the
same parent command as the command that extracts the data.
Using Agent Data
Agent data can be used by any data consumer command. Data consumers
include the following types of commands:
Agent
Set Form Field
© 2014 - 2015 Sequentum
Data
Navigate URL
235
Data Value
CAPTCHA Protection
Agent data is often used when resolving CAPTCHA images. You can automatically
process a CAPTCHA page by using an OCR service to convert the CAPTCHA image
into text, and then use that text in a form field command to set the CAPTCHA input
box.
A CAPTCHA page or section appears on some websites to protect access to a
secure area of the website. A CAPTCHA image displays a sequence of numbers and
characters that a website user must type into an input box in order to continue to the
secure area. This process ensures the website user is a real human and not an
automated agent. See the topic CAPTCHA Blocking for more information.
8.4
Exporting Data
While performing a data extraction, an agent saves data to an internal data store and
then exports that data to a specific export target when the agent completes or if the
user manually stops the agent.
Content Grabber supports many different types of data stores such as databases,
CSV files, XML files and spreadsheets. You can use a data export script to gain
complete control over the entire export process - including the export of data to data
stores that Content Grabber doesn't support.
We recommend that you also read these topics:
Selecting an Export Target
Exporting Downloaded Images and Files
Character Encoding
Changing the Default Data Structures
© 2014 - 2015 Sequentum
236
Content Grabber
Scripting Data Export
8.4.1
Selecting an Export Target
Content Grabber can export data to many different data formats. By default, data will
export to an Excel spreadsheet, but you can change the export target by clicking the
menu Data > Data Export.
Content Grabber supports the following export targets:
None - data extractions will only remain in the internal database, and the agent
won't export the data. This is a useful option if you want to work directly on the
internal database and have no need for exporting. NOTE: This option is for expert
users only.
XML - the data will export to a single XML file.
SQL Server - the data will export to one or more SQL Server database tables.
MySQL - the data will export to one or more MySQL database tables.
Oracle - the data will export to one or more Oracle database tables.
OleDB - the data will export to one or more database tables. You can use any
database that has an OleDB provider.
CSV - the data will export to one or more CSV files. The default character
encoding is ASCII, but you can specify another type of encoding.
Microsoft Excel 2003 - the data will export to a single Excel spreadsheet, but it
may save the data into more than one worksheet. The agent will save any images
to the disk.
Microsoft Excel 2007 and later - the data will export to a single Excel
spreadsheet. You can save multiple data tables to a single worksheet using
grouping and outlining. The agent will save any images to the disk or embed them
into the worksheet.
Content Grabber always stores data in the internal database before exporting the
data to the chosen output format. This means you can always export data to a new
format without having to extract the data again.
© 2014 - 2015 Sequentum
Data
237
When exporting the data to a database, Content Grabber automatically creates new
database tables if they don't exist, and automatically deletes and re-creates existing
database tables if they do not have the correct number of columns.
Exporting Data to Oracle
Content Grabber supports Oracle databases but only through the Oracle OleDB
provider. You must first install the Oracle client software before the OleDB provider
will work. The OleDB provider given by Microsoft supports Oracle only up to version
8i, so you should use the OleDB provider from Oracle.
Use an Oracle OleDB connection string such as this one:
Provider=OraOLEDB.Oracle;Data Source=//localhost:1521/OracleTest;
User Id=test;Password=test;
The Data Source parameter in the connection string can be set to TNS, but
only if you configure the TNS configuration file properly.
Exporting Data to MS Access
Content Grabber is a 32-bit application and will only be able to communicate with a
32-bit OleDB provider. If you have installed the 64-bit version of Microsoft Office,
then you won't be able to use the 32-bit OleDB provider, and Content Grabber won't
be able to export data to Microsoft Access. If you are using the 32-bit version of
Office, you can use a connection string such as this:
Provider=Microsoft.ACE.OLEDB.12.0;Data Source=
C:\Data\MyDatabase.accdb;Persist Security Info=False;
You must also change the Parameters option from "@" to [@].
© 2014 - 2015 Sequentum
238
8.4.2
Content Grabber
Exporting Downloaded Images and Files
An agent will often download images and files during a data extraction. By default, the
agent saves these files in the internal database. However, you can configure the agent
to save all files to disk by unchecking the Embed files in database box on the
Internal Database configuration screen. When writing files to disk, the agent will store
the full path to each file in the internal database.
If you are exporting data and there are corresponding files in the internal database,
then these files will also export to the external database. If you uncheck the Embed
files in database box, then only the file paths will export to the external database. If
you are exporting data to a file format that doesn't support embedded files, such as
Excel 2003, then only the file paths will export to the external database and the files
will be written to disk.
Typically, Content Grabber will use the name of a file when saving it to disk. However,
since a website could potentially use the same name for two different files, Content
Grabber will set the file name with a unique identifier (GUID) if the file already exists
on disk.
You can also use a file name transformation script to name the files, and configure the
script to use some of the other extracted data elements for the file name. For
example, you could name a product image with the product name or product ID.
8.4.3
Character Encoding
Although the user interface text is English-only, Content Grabber can extract data from
websites in any language or encoding. If you have problems with the encoding of your
data extraction, then the problem is most likely with your export target. For example, if
you are exporting data to a MySQL database that uses Latin1 encoding, then you'll
find a question mark in the place of each Unicode character.
In accordance with CSV specifications, CSV output files have ASCII encoding and do
not support Unicode. However, if you're using a CSV import tool that can read CSV
© 2014 - 2015 Sequentum
Data
239
files with other character encodings, then you can specify the encoding when choosing
the CSV export target.
MySQL Character Encoding
Many users install their MySQL databases with Latin1 character encoding and then
encounter some characters appearing as question marks. If you want to export
Unicode characters to a MySQL database, you'll need to use Unicode character
encoding such as UTF-8. If you configure your MySQL database to use Unicode, you
must also specify the encoding when configuring the Export Target for the agent.
8.4.4
Changing the Default Data Structures
All container commands have a set of properties that you can use to specify how to
export data from sub-containers:
Property
Description
Sub-
Specifies how to export data from multiple sub-
Container
containers. You can add data together and
Export
then merge with data from parent containers,
Method
or simply merge the data and then merge that
data set with data from parent containers.
Alternatively, you can separate all the data
from the various sub-containers, which will also
separate the data from the data in parent
containers. Non-list containers have the option
Always Merge Data set to True by default.
The setting of this property has no effect on
sub-containers with the option Always Merge
Data set to True.
Sub-
Specifies how a data export from a sub-
Container
container will merge with data from the
Merge
corresponding parent container. This option has
© 2014 - 2015 Sequentum
240
Content Grabber
Method
no effect if the data from sub-containers is
separate. Non-list containers have an option
Always Merge Data set to True by default.
The setting of this property has no effect on
sub-containers with the option Always Merge
Data set to True.
Export Rows
Exports data rows from a list container as
in Parent
columns in the parent data table.
Columns
Separator
The separator character which marks each row
when exporting multiple rows in a single
column.
Name
A capture command that will provide the data
Command
for column names. This property is optional,
but must be set together with a command
name for the Value Command property.
Value
A capture command that will provide the data
Command
for column values. This property is optional, but
must be set together with a command name for
the Name Command property.
Sub-container Export Rules
Data exports from sub-containers will conform to these rules:
1. Data from non-list containers will merge with data from parent containers. A nonlist container that contains a list container will function as a list container. A nonlist container with the Always Merge Data property set to False will also function
as a list container. A list container with Export Rows in Parent Columns set to
Multiple Columns or Single Columns is equivalent to a non-list container.
2. Data from all list containers will combine or separate according to the Sub-
© 2014 - 2015 Sequentum
Data
241
Container Export Method setting.
3. If the data is separate in the previous step, then no further action is taken.
Otherwise, the resulting data will merge or combine with the parent data
according to the Sub-Container Merge Method setting.
Sub-Container Export Method
This method controls how data will export from sub-containers and has the following
four options:
Default
Add Horizontally
Separate
Add Vertically
Data from non-list containers will always merge with data from parent containers when
the option Always Merge Data is set to True. The Sub-Container Export Method
setting has no effect on these non-list containers.
Default Export Method
When using this method, data from sub-containers will be separate in some
circumstances - depending on the structure of your commands. By default, Content
Grabber separates data if the parent container contains more than one list container,
or if a single list-container generates significantly less data fields than the parent
container. For example, consider an agent that extracts product data from a website such as Amazon.com. One container command may extract product properties such
as product name, description, and price; and a sub-container may extract images
belonging to the product. The parent container extracts more data fields than the subcontainer, so by default the data from the sub-container will be separate from the data
of the parent container.
If the Default Export method does not separate data from sub-containers, it means
that there is only one list-container. Consequently, the export result will depend on the
Sub-Container Merge Method property, which will merge data with the parent data
(by default).
© 2014 - 2015 Sequentum
242
Content Grabber
Separate Export Method
When using this method, data from each sub-container will separate, which also
means the data will be separate from data in the parent containers. If, for example,
the data exports to a database, then the data from each sub-container will go into
separate database tables.
The figures below show a simple agent having one list container, along with the data
export it generates. By default, the Movie List container will merge its data with the
data from the imdb container, and the result will be a single data table containing the
entire data extraction.
A simple agent with one list command
© 2014 - 2015 Sequentum
Data
243
The above agent extracts this data by default. Data from the Movie List container merges with
data from the "imdb" container.
In this example, you can separate the data from the Movie List command by setting
the property Sub-Container Export Method to Separate.
The data is exported to separate data tables when the property Sub-Container Export method is set to Separate.
Add Horizontally Export Method
When using this export method, data from sub-commands will combine horizontally as
shown in the example below. The resulting data will combine or append to the parent
data according to the setting of the Sub-Container Merge Method.
Consider an example in which two different sub-containers generate the following data
sets. Here is the data for the first sub-container:
Column1
Column2
Data1
Data2
Data3
Data4
Data5
Data6
Data7
Data8
Data9
Data10
© 2014 - 2015 Sequentum
244
Content Grabber
Here is the data from the second sub-container:
Column3
Column4
Column5
Data11
Data12
Data13
Data14
Data15
Data16
Data17
Data18
Data19
This would be the result of a merge of these two data sets:
Column1
Column2
Column3
Column4
Column5
Data1
Data2
Data11
Data12
Data13
Data3
Data4
Data14
Data15
Data16
Data5
Data6
Data17
Data18
Data19
Data7
Data8
Data9
Data10
Add Vertically Export Method
When using this method, data from sub-commands will combine vertically as shown in
the example below. The resulting data set will merge or combine with the parent data
according to the Sub-Container Merge Method setting. If we apply this method to
the example data sets for the first and second sub-container above, this will be the
result:
Column1
Column2
Data1
Data2
Data3
Data4
Data5
Data6
Data7
Data8
Data9
Data10
Column3
© 2014 - 2015 Sequentum
Data
Data11
Data12
Data13
Data14
Data15
Data16
Data17
Data18
Data19
245
Sub-Container Merge Method
This method controls how data from sub-containers will merge with parent data.
Choose from these settings:
Default
Merge
Add.
The setting for this method has no effect when data from sub-containers is separate.
It also has no effect on non-list containers having the Always Merge Data property
set to True.
By default, Content Grabber will always merge data from sub-containers with parent
data, and this is nearly always the best action. When parent data merges with child
data, each data row in the parent data table will repeat for each corresponding child
data row - as shown in the example below. First, consider this data set for the parent
container:
ID Column
Column 1
Column 2
ID1
Data1
Data2
ID2
Data3
Data4
ID3
Data5
Data6
Here is the data from the sub-container:
Parent ID Column
© 2014 - 2015 Sequentum
Column 3
246
Content Grabber
ID1
Data7
ID1
Data8
ID2
Data9
ID2
Data10
ID3
Data11
If we apply this method to a merge of the data sets for the parent and sub-container
above, this will be the result:
New ID
Column1
Column2
Column3
ID1
Data1
Data2
Data7
ID2
Data1
Data2
Data8
ID3
Data3
Data4
Data9
ID4
Data3
Data4
Data10
ID5
Data5
Data6
Data11
Column
One situation where the Sub-Container Merge Method property should be set to
Add is the case in which a Pagination command links to a non-list container. If you
don't set the property to Add, then the non-list container will generate one row in the
parent data table by default, and the Navigate Pagination command (which is a list
container) will merge its data with the parent data. The result will be that the non-list
container will generate one row that will be overwritten, as we can see in yet another
example below.
This data set could come from a non-list container:
IDC Column
Column 1
Column 2
ID1
Data1
Data2
This data set could come from a Pagination container:
© 2014 - 2015 Sequentum
Data
Parent ID
Column 1
Column 2
ID1
Data3
Data4
ID1
Data5
Data6
ID1
Data7
Data8
ID1
Data9
Data10
ID1
Data11
Data12
247
Column
If we apply this method to a merge of the data sets for the non-list container and the
Pagination container above, this will be the result:
New ID Column
Column 1
Column 2
ID1
Data3
Data4
ID1
Data5
Data6
ID1
Data7
Data8
ID1
Data9
Data10
ID1
Data11
Data12
Export Rows to Parent Columns
Use this property to save all of the list-container data in the table of the parent. You
can control exactly what data will save and how it will save to the parent table by
setting these properties: Export Rows to Parent Columns, Name Command, Value
Command and Separator.
This figure represents a container command having a configuration to export data
rows to multiple columns in the parent data table:
© 2014 - 2015 Sequentum
248
Content Grabber
The Export Rows to Parent Columns property specifies how the data should save
into the parent table. You can choose one of these three methods:
None - The data will not save to the parent table, which is the default behavior.
Multiple Columns - Each data row that the list container generates will
generate additional columns in the parent table.
Single Column - The content from all data rows will concatenate into a single
column in the parent table. The data concatenates, but also separates by a
specific character that is given in the Separator property.
When saving data in multiple columns in the parent table, the column names in the
parent table will be the same as the column names in the child table, but with an
additional column. For example, if the column name in the child table is image, then
column names in the parent table will be as follows:
image_1, image_2, image_3, ....
If you save different kinds of data in the parent table, such as name/value properties,
then you may want the columns in the parent table to have names that correspond to
the content in the child table. For example, consider the following data from a child
table:
Name
Value
Width
100
Height
50
Depth
25
You may want this child table to generate the following parent table columns:
Width
Height
Depth
100
50
25
© 2014 - 2015 Sequentum
Data
249
You can achieve this by setting the Name Command property to the name of the
command that will extract the values you want to use as column names. Also, set the
Value Command property to the name of the command that will extract the values
you wish to use as column values.
Limitations
Adding columns to the parent table can lead to a different number of columns in the
parent table every time you run an agent. This is usually only desirable when exporting
data to Excel spreadsheets, and may sometimes be useful when exporting data to
CSV files.
8.4.5
Exporting Data with Scripts
With data export scripts, you can customize the export process for extraction data
sets. An export script is ideal for exporting to special data structures in a database, or
exporting data to a third-party or custom-data store.
An export script gives much more control than the standard export process because
the script can control when to insert, update, or delete data in the export target. See
the topic Data Export Scripts for more information.
8.5
Distributing Data
If the export target for an agent is set to a file format such as a spreadsheet, then you
can configure an agent to send a data export as a message attachment to one or
more email destinations. Alternatively, you can FTP the data to a remote server, or
have the agent use a custom script to distribute the export.
Read these topics for more information:
Email and FTP distribution
Scripting Data Distribution
© 2014 - 2015 Sequentum
250
8.5.1
Content Grabber
Email and FTP distribution
If the export target for an agent is set to a file format, you can configure an agent to
send a data export as a message attachment to one or more email destinations.
Alternatively, you can FTP the data to a remote server.
Email Distribution
When an agent exports data through email, it attaches the data files to the email in a
single compressed file or as individual data files. In the Email Delivery panel, you'll
need to specify the Server, Login, and Email Addresses.
Email distribution of exported data
FTP Distribution
When an agent uses FTP to copy data to a remote server, the agent can copy the
data files in a single compressed file or as individual data files. In the FTP Delivery
window, you must specify the remote Server address, Username, and Password.
The remote server must host an FTP service.
© 2014 - 2015 Sequentum
Data
251
FTP distribution of exported data
8.5.2
Scripting Data Distribution
An agent can use a custom script to distribute data export files. A data distribution
script doesn't need to actually distribute the data, but can be used to perform custom
processing tasks after data export is complete.
Please see the topic Data Distribution Scripts for more information.
© 2014 - 2015 Sequentum
252
9
Content Grabber
Anonymous Web Scraping
Many users prefer to browse the web in private. Some website owners do not want
web scraping tools to extract data from their websites, and their administrators may
attempt to block access to their website for any non-human users. Content Grabber
will behave exactly like a normal Internet Explorer user when your agent uses a Full
Dynamic Browser.
When you visit a website with a web browser or a web scraping tool such as Content
Grabber, the website owner can record your IP address and attempt to identify you
with the intent to block your access to the website.
If you do not want a website owner to be able to identify you when you visit a website,
you can use a proxy server to hide your IP address. The proxy server will visit the
website for you. There are many different types of proxy servers, but Content
Grabber supports only HTTP proxy servers. It does not support other types of proxy
servers, such as SOCKS proxies.
See the topic CAPTCHA & IP Blocking for more information about configuring proxy
servers in Content Grabber.
© 2014 - 2015 Sequentum
CAPTCHA & IP Blocking
10
253
Some website owners do not want web-scraping agents to extract data from their
websites. They may try to block access to their website if they believe the user is not
human.
CAPTCHA and IP blocking are the two most common ways of blocking web scraping
agents. You can configure a Content Grabber agent to manually or automatically
bypass CAPTCHA, and also configure an agent to use proxy servers - which prevents
IP blocking.
Read these topics for more information:
CAPTCHA Blocking
IP Blocking & Proxy Server
10.1
CAPTCHA Blocking
A website can implement CAPTCHA blocking by using a web form that the user must
submit to gain access to any restricted areas of the site. The web form is usually quite
simple, consisting of an image element and a text box element. The image displays
some characters which the user must enter into the text box in the exact sequence as
given in the image. A human user can read the text in the CAPTCHA image, but a
web-scraping agent requires special character recognition software to successfully
discern the characters in the image.
© 2014 - 2015 Sequentum
254
Content Grabber
A typical registration form with CAPTCHA blocking
Content Grabber performs both manual and automatic data extraction from websites
that implement CAPTCHA blocking. Automatic data extraction requires an account
with a third-party CAPTCHA recognition service and, typically, there is a small fee for
processing each CAPTCHA image. Manual data extraction is free, but requires you to
manually decode CAPTCHA images while running a data extraction agent.
Manual CAPTCHA Configuration
You have two options when configuring manual CAPTCHA. The
easiest one to configure is the option we describe below. The other
option uses the same approach as automatic CAPTCHA
configuration, but instead of using a script to resolve the CAPTCHA,
a window is displayed allowing the user to manually resolve the
CAPTCHA.
Manual CAPTCHA processing is easy to configure, but requires you
to manually decode CAPTCHA images while the agent is running.
The agent will pause and display the browser window where a user
can view the CAPTCHA image and enter the CAPTCHA text in the
text box. You can configure the agent to automatically submit the
web form after the user has entered the CAPTCHA text, or you can
reply on the user to submit the form while the agent is paused.
© 2014 - 2015 Sequentum
If CAPTCHA blocking is part of a larger registration form, you can
process the CAPTCHA part manually and let the agent process the
rest of the form automatically. In this case you should let the agent
submit the form automatically rather than relying on the user to
submit the form while the agent is paused.
Follow these steps to pause an agent when a CAPTCHA image is
displayed:
1. Add an Execute Script command to your agent.
2. Select the CAPTCHA image element in the web browser. This
sets the command's web selection.
3. Select the default script type Pause Agent.
4. Select the default script condition If Selection Exists.
5. Save the command.
This command will pause the agent when the CAPTCHA image
element exists on the web page, and allow a user to enter the
CAPTCHA text.
You must add this command to all locations in the agent where
CAPTCHA blocking could be encountered.
Important: Manual CAPTCHA processing relies on a human user to
decode the CAPTCHA image, so an agent using manual CAPTCHA
configuration cannot be run from the scheduler or API, or any other
fully automated way.
Automatic CAPTCHA Configuration
Automatic CAPTCHA processing requires an account with a third
party CAPTCHA recognition service. The third party recognition
service must provide a .NET API and you must add an OCR script
© 2014 - 2015 Sequentum
255
256
Content Grabber
that uses this API to call the service. See the section below for two
examples of CAPTCHA recognition services.
Follow these steps to configure an agent for automatic CAPTCHA
processing:
1. Add a new Group Commands command. This group of
commands will handle CAPTCHA.
2. Add an Execute Script command to the group. This command will
skip CAPTCHA processing if the CAPTCHA image doesn't exist in
the web page.
i. Select the CAPTCHA image in the web browser. This sets the
command's web selection.
ii. Select the default script type Exit Command.
iii. Select the default command Parent Command.
iv. Select the default condition If Selection Missing.
3. Add a Download Image command to the group. This command
will download the CAPTCHA image and use an OCR script to
decode the image into plain text.
ii. Open the OCR tab and check the option Convert image to
text.
iii. Add an OCR script. See below for script examples. If you
want to manually resolve the CAPTCHA, you can check the
option Convert image manually in which case you should not
specify a script.
4. Add a Set Form Field command to the group. This command will
use the converted image text to set the CAPTCHA text box.
i. Select the CAPTCHA form field on the web page. This sets
the command's web selection.
© 2014 - 2015 Sequentum
ii. Clear the command option Use default input.
iii. Set the data provider to Captured Data and select the
CAPTCHA image command from step 3.
5. Add a Navigate Link command to the group. This command will
submit the CAPTCHA form.
i. Select the CAPTCHA form submit button. This sets the
6. Add an Execute Script command to the group. This command will
retry CAPTCHA processing if the CAPTCHA image still exists on
the web page. If the CAPTCHA image still exists we assume it's
because the CAPTCHA recognition service decoded the
CAPTCHA image incorrectly and we'll try again.
ii. Select the default script type Retry Command.
iii. Select the default command Parent Command.
iv. Select the default condition If Selection Exists.
Automatic CAPTCHA configuration
The command library includes the group of commands listed above.
Select the Automatic CAPTCHA command from the library and add
it to your agent. After you have added the group command from the
© 2014 - 2015 Sequentum
257
258
Content Grabber
library, you need to set the web selection for all the commands that
require a web selection.
You must add this group of command to all locations in the agent
where CAPTCHA blocking could be encountered.
CAPTCHA OCR Scripts
Content Grabber includes the API and standard OCR scripts to call
the following CAPTCHA recognition services.
http://www.deathbycaptcha.com
http://bypasscaptcha.com
At the time of writing, the Death by CAPTCHA service charges
US$6.95 for 5000 CAPTCHAs and Bypass CAPTCHA charges
US$34 for 5000 CAPTCHAs. We are not affiliated with these
companies in any way and don't charge any additional fees for these
services.
The following OCR script uses the Death by CAPTCHA service to
decode CAPTCHA images.
public static string
ConvertImageToText(ConvertImageToTextArguments args)
{
string captcha =
DeathByCaptchaService.DecodeCaptcha(args.Image, "login",
"password");
return captcha;
}
The login and password in the script above is provided by Death by
© 2014 - 2015 Sequentum
259
CAPTCHA.
The following OCR script uses the Bypass CAPTCHA service to
decode CAPTCHA images.
public static string
ConvertImageToText(ConvertImageToTextArguments args)
{
string captcha =
BypassCaptchaService.DecodeCaptcha(args.Image, "key");
return captcha;
}
The key in the script above is provided by Bypass CAPTCHA.
Troubleshooting
If CAPTCHA fails when entering a correct CAPTCHA it may be
caused by the following issue.
Content Grabber cannot directly get the CAPTCHA image from the
web browser, so it downloads the image a second time, and that
may be disallowed by the web server, especially if you are using a
proxy rotation service where a new IP address maybe assigned for
the image download. To overcome this problem, you can use a
Download Screenshot command instead of a Download Image
command, in which case a second image download is not required.
10.2
IP Blocking & Proxy Servers
When you visit a website with a web browser or a web scraping tool, such as Content
Grabber, the website owner can record your IP address and may be able to use this
information to identify you or block your access to the website.
If you do not want a website owner to be able to identify you while you visit a website,
© 2014 - 2015 Sequentum
260
Content Grabber
you can use a proxy server to hide your IP address. When you use a proxy server,
you do not visit the target website directly, but instead request that the proxy server
visit the website for you.
There are many different types of proxy server, but Content Grabber supports only
HTTP proxy servers. It does not support other types of proxy servers, such as
SOCKS proxies.
How to Configure Proxy Servers
After you have purchased proxy server access or found freely available proxy servers
on the web, you will receive one or more proxy server IP addresses and possibly a
username and password to access the proxies. You need to enter this information into
the Content Grabber agent by selecting Agent Settings from the ribbon menu, or you
can setup default proxies in the Application Settings ribbon menu.
Agent proxy settings
Application proxy settings
The proxy menus allow you to set the Proxy source and Rotation interval options.
You can select one of the following proxy sources:
Proxy source
Default
Description
The agent will use the proxy configured in
© 2014 - 2015 Sequentum
261
Internet Explorer, or no proxies if no proxies
have been configured in Internet Explorer.
Application
The agent will use the default proxy settings
configured in the Application Settings menu.
Proxy List
The agent will use a specific list of proxies.
Click the Proxy List button to add proxies to the
list.
Gateway
Use this proxy setting in Application Settings if
you must connect to a specific proxy to access
the Internet.
The rotation interval is the number of pages the agent will load before switching to the
next proxy.
The Proxy List screen is used to specify a list of proxies and set the proxy verification
option:
Add proxies on the Proxy list screen
© 2014 - 2015 Sequentum
262
Content Grabber
You must specify the Proxy Address in the following format, including the port
number:
206.118.215.245:60099
In the example above, 206.118.215.245 is the IP address of the proxy server and
60099 is the port number.
Proxy Verification
Content Grabber can automatically verify if a proxy is available before switching to the
proxy. This allows agents to switch to the next proxy if a proxy is not available, and
thereby avoid stopping the agent prematurely just because a proxy is unavailable.
Proxy
Description
verification
option
Verify proxy
The agent will verify a proxy before switching to
before use
the proxy.
Verification
The number of seconds the agent will wait for a
timeout
successful proxy verification.
Importing Proxy Servers
If you are using a large number of proxy servers, it can be tedious to add them all to
each project. You can use the Import Proxies button to import a list of proxies from a
CSV file. The CSV file must have the following format:
Proxy Address, Username, Password
The Username and Password are optional columns.
CSV Example 1:
© 2014 - 2015 Sequentum
263
proxy
173.244.220.185:8800
50.21.10.78:8800
134.22.166.242:8800
CSV Example 2:
proxy, username, password
173.244.220.185:8800, user1, pass1
50.21.10.78:8800, user1, pass1
134.22.166.242:8800, user1, pass1
Proxy Gateway
Some company network configurations require all users to access the Internet through
a specific proxy. Use the Gateway proxy option in Application Settings to open the
Proxy Gateway screen.
Some companies require force Internet access
only through a proxy gateway
The Proxy address can specify a proxy directly or it can be a URL to a proxy
configuration script (PAC).
© 2014 - 2015 Sequentum
264
11
Content Grabber
Improving Agent Performance and Reliability
It is important to understand that the complexity of web scraping increases
significantly when extracting large amounts of data. The time it takes to perform
certain operations may take so long that the whole web scraping process becomes
impractical. Web scraping is unreliable in nature, and the longer a web scraping
project runs, the higher the chance something fails and interrupts the web scraping
process.
The following topics list a few steps you can take to increase performance and
reliability. Performance tuning is not for the novice user, and if you have no IT
experience, you may not be able to implement the recommendations in this topic.
Using MS SQL as Internal Database
Using the Static Parser Browser
Optimizing Agent Commands
11.1
Using MS SQL as Internal Database
A Content Grabber agent saves all extracted data to an internal database while
running. The default internal database is a SQLite file database, and file databases
don't perform as well as real databases, such as SQL Server or MySQL. You may
get better performance on some operations by switching to a real database server
depending on your system and the data you're extracting.
We recommend that you change the internal database to SQL Server Express if you
don't already have an existing database server. SQL Server Express is a free
database server from Microsoft and can contain up to 10GB of data in a single
database. You can download SQL Server Express from this URL:
© 2014 - 2015 Sequentum
265
http://www.microsoft.com/sqlserver/en/us/editions/express.aspx
Read the SQL Server Express documentation to learn how to install and operate the
database server.
Once you have installed SQL Server Express and added a database, you can change
the internal database from the menu Data > Change Internal Database:
For example, you can improve performance by choosing SQL Server from the
Internal Database drop-down:
11.2
Using the Static Parser Browser
When Content Grabber extracts data from a web page, it first loads the web page
into a web browser. The web browser parses and renders the web page, and
executes any JavaScript the page contains. This is a very safe approach, since
Content Grabber uses an embedded version of Internet Explorer as it's web browser.
Therefore if the target website is working in Internet Explorer, Content Grabber can
usually extract data from the website. However, the approach is also slow and may
cause instability.
© 2014 - 2015 Sequentum
266
Content Grabber
If you have been using Internet Explorer to browse the web, you may have sometimes
experienced problems, such as hanging websites or program crashes. This may occur
very rarely (say once a year), so it may not be a problem during normal usage of
Internet Explorer. When Content Grabber uses Internet Explorer to browse a website,
it may access more web pages in a few hours than you access in a year, so stability
issues are magnified significantly.
The main source of website instability is JavaScript. A website developer can use
JavaScript to implement dynamic features on the website, but JavaScript bugs may
lead to memory leaks, hanging websites or even program crashes.
All action commands in a Content Grabber agent, that open a new web browser, can
be configured to open a specific type of web browser. The default browser is an
embedded version of Internet Explorer, but you can change this to a Static Parser.
The Static Parser does not use Internet Explorer at all, and it completely ignores
JavaScript, so it's generally much more reliable.
JavaScript is always single threaded, so many operations cannot be performed
simultaneously when using Internet Explorer web browsers. Since the Static Parser
does not execute JavaScript, it can often process web pages much faster than an
Internet Explorer web browser.
Many websites don't work properly if JavaScript is disabled, so the Static Parser will
not work for all websites, but many websites can be partly processed with a Static
Parser, so you should always switch to a Static Parser if a particular web page can
be processed without JavaScript.
© 2014 - 2015 Sequentum
267
Configure an Action command to use a specific web
browser type
If you want an agent to use the Static Parser by default, then you can set the web
browser type on the Agent Settings > Dynamic Browser > Static Parser:
11.3
Optimizing Agent Commands
Action commands such as Navigate Link and Navigate URL commands are much
slower to execute than commands that simply capture content, so it's important to try
and limit the number of action commands in your agents. If you can access a target
web page directly with a direct URL, rather than clicking multiple links, then you should
always opt for the direct URL.
Many dynamic websites show and hide content when you click on buttons or links.
Content Grabber can extract both visible and hidden content, so it's often not
necessary to add an action command that simply makes content visible.
Optimizing Browser Activities
© 2014 - 2015 Sequentum
268
Content Grabber
An agent spends most of its time waiting for web pages to load, and Content Grabber
cannot always determine exactly when a page load or page action has completed, so
it sometimes ends up waiting longer than it has to. If the agent starts extracting data
too early, it may not be able to extract the data correctly, since some data may not
have loaded onto the web page yet.
If an agent command executes an action that results in multiple browser activities,
such as multiple page loads and AJAX callbacks, then Content Grabber will set the
action option Discover activities, instead of setting the exact browser activities the
action command should wait for. Content Grabber does this because the number of
activities or the order of activities is likely to change, so it's safest to keep discovering
the activities every time the action is executed. However, discovering activities is
slower than waiting for a set number of activities, because Content Grabber has to
wait to make sure no more activities occur before it can consider the action as
completed. It's always best to manually set the browser activities Content Grabber
should wait for, so it doesn't wait longer than absolutely necessary, but this can be
difficult to get right and may not be worth the effort unless performance is very
important.
When an action command uses the option Discover activities, it uses a set of timeout
values to determine how long it should wait for new activities. These timeout values
are set fairly conservatively to make sure they are long enough for slow computers
and slow Internet connections. You may be able to lower these timeout values, so an
action command completes faster without failing to extract content correctly.
© 2014 - 2015 Sequentum
269
Default activity timeout values
You can use the Browser Activity panel to view activities and timings for an action.
Please see the topic Browser Activities for more information.
© 2014 - 2015 Sequentum
270
12
Content Grabber
Building Self-Contained Agents
You can build self-contained agents that will run without the need for the Content
Grabber application to be present on the target computer. Self-contained agents can
be distributed royalty-free, and includes a user interface that allows end users to
configure and run the agent.
Note: the self-contained user interface has an introduction
screen that contains a promotional message for Content
Grabber and Sequentum. This message can only be removed if
you are using a premium version of Content Grabber.
The following figure shows the default self-contained user interface, in which you can
add your own text and logo:
© 2014 - 2015 Sequentum
271
An end user can configure the following agent features:
Export target (only file formats are supported)
Scheduling
Input parameters
Notifications
Proxies.
A self-contained agent can only export to file formats (Excel, CSV or XML). It cannot
export to databases or execute a data export script. Self-contained agents require
Internet Explorer 8+ and .NET v4+. The self-contained agents will not attempt to
download and install these required components if they don't exist on the target
computer, but most Windows computers will already have those components installed.
Building a Self-Contained Agent
Follow the steps below to build a self-contained agent:
1. Click the Export Agent application menu to open the Export Agent screen.
2. Make sure you check the option Create self-contained agent.
© 2014 - 2015 Sequentum
272
Content Grabber
3. You can now click Export to create the self-contained agent. Before creating the
self-contained agent you can click Customize Design to customize the design
and text displayed on the agent's user interface.
For more information, see the topic Customizing the User Interface.
Self-Contained Agent Files
All files required to configure and run a self-contained agent are included in a single
executable file which can be distributed royalty free. Any custom assemblies used by
an agent are included in the executable file, but only if they are located in the
Assemblies folder for the agent. Assemblies shared by agents and located in the
shared Assemblies folder are not included in the executable file, and should not be
used in self-contained agents.
Upgrading a Self-Contained Agent
You can upgrade an agent if the target website of a self-contained agent has changed
and you need to make changes to the agent in order for it to work correctly. Simply
make the changes to your agent and export the agent again. The end-user can
override his existing self-contained agent with your new agent. Any configuration
changes the end-user has made to the agent will not be lost.
For more information, see the topic Using a Self-Contained Agent.
© 2014 - 2015 Sequentum
12.1
273
Customizing the User Interface
A self-contained agent includes a user interface that allows users to configure and run
the agent. You can control some of the text and images displayed on this user
interface, and you can also control which agent options the user can configure.
Click the Customize Design button to open the Customize Self-Contained Agent
screen:
The Customize Self-Contained Agent screen has three tabs, in which you can
specify the text and images that should be used on the user interface of the selfcontained agent:
© 2014 - 2015 Sequentum
274
Content Grabber
The Agent Details tab allows you to specify text and images that are displayed on the
agent start-up screen. The Company Details tab allows you to specify text and
images that are displayed on the agent About screen. The Templates tab allows you
to specify which configuration options should be available on the user interface:
© 2014 - 2015 Sequentum
275
The Templates tab also allows you to specify custom template files. The template
files are HTML files that controls the user interface layout. The template files contain
special tags that will be replaced with the information you provide on the Agent Detail
and Company Details tabs.
If you want to provide your own template files, we recommend you copy the default
template files to a new location and modify them to suit your needs. The default files
are located in:
[Content Grabber Installation Folder]\Resources\SelfContained\Docs
This folder contains a number of files and sub-folders. Your custom folder must
contain the same sub-folders and HTML files with the same names. You can remove
or add style sheets and images to your custom folders. All files and sub-folders in
your custom folder will be included in the executable agent file.
© 2014 - 2015 Sequentum
276
12.2
Content Grabber
Using Input Data
A self-contained agent can use any input data that is supplied by a database. If an
agent uses data from a CSV file, the file should be placed in the default agent input
folder, and you should set the option to include input data when exporting the agent.
If an agent uses data from a Simple Data Provider, the standard self-contained user
interface can be used by the end-user to edit the input data. A Simple Data Provider is
only available in the self-contained user interface if the data provider property Public
Provider is set to True. A public name for the data provider can also be specified. This
public name will only be used in the self-contained user interface, and not in the
Content Grabber editor.
Allow Simple input data to be edited using
the self-contained user interface
An agent can have multiple public data providers, and each data provider can be
selected from a drop down box as shown on the following figure.
© 2014 - 2015 Sequentum
277
Simple input data can be edited using the self-contained user interface
12.3
Using a Self-Contained Agent
A self-contained agent is a single executable file that includes all the required files to
configure and run the agent. A self-contained agent does not need to be installed.
When you want to configure or run the agent, simply click on the executable agent file,
and the agent user interface will be displayed.
The first time the executable agent file is run, a sub-folder with the same name as the
agent will be created in the folder where the executable file is located. This folder will
store configuration information for the agent, so the folder should not be removed. If
the folder is removed, all configuration changes you have made to the agent will be
lost, and the agent will start with its default configuration.
© 2014 - 2015 Sequentum
278
Content Grabber
Upgrading a Self-Contained Agent
You can upgrade a self-contained agent by simply overriding the existing executable
agent file with a new agent file. Any configuration changes you have made to the
agent will not be lost, unless you delete the configuration folder.
© 2014 - 2015 Sequentum
Running Agents from the Command-Line
13
279
You can run agents from the command line using the RunAgent.exe program, which
you can find in the Content Grabber installation folder.
RunAgent.exe agentName
You can specify the full path to an agent file or just the name of the agent. If you only
specify the agent name, Content Grabber will look for the agent in the default location
for the user running the program. If the agent is not located in that folder, Content
Grabber will look in the default public agent location. The default public agent location
is:
C:\Users\Public\Documents\Content Grabber\Agents
If your agent is not located in any of the default agent locations, you need to specify
the full path to the agent file:
RunProject.exe "C:\My Custom Folder\agentName.scg"
When running multiple agents in a batch file, you can use the Start command to run
the agents asynchronously:
Start "WindowTitle" RunAgent.exe agentName1
Start "WindowTitle" RunAgent.exe agentName2
Exit Codes
The RunAgent.exe command line program returns one of the following exit codes:
Exit Code
Status
Description
0
Success
Data extraction was completed
successfully.
1
Failed
A critical application error occurred during
data extraction.
© 2014 - 2015 Sequentum
280
Content Grabber
Exit Code
Status
Description
2
Incomplete
Data extraction was interrupted.
3
Completed with errors
Data extraction was completed, but with
one or more page load errors or missing
required elements.
4
Incomplete with errors
Data extraction was interrupted, and
encountered one or more page load
errors or missing required elements.
5
Export Failed
Data extraction failed because it was
unable to export data. You can manually
open Content Grabber and attempt to
export.
Command Line Arguments
Input parameters can be added as command line arguments. The input parameter
name must be preceded with a dash. For example:
RunAgent.exe agentName -username "test" -password "test"
Please see the topic Input Parameters for more information.
The following command line switches can be used, and should not have a preceding
dash:
Switch
Description
log_level
Log detail level. Logging is turned off by default.
Example: RunAgent.exe agentName log_level High
log_html
Logs the HTML of all loaded web pages to files.
Example: RunAgent.exe agentName log_html
log_to_file
Logs information to a file instead of a database.
© 2014 - 2015 Sequentum
Switch
281
Description
Example: RunAgent.exe agentName log_to_file
view_browser
Displays the web browsers used to navigate the target
website.
Example: RunAgent.exe agentName view_browser
no_ui
No user interface will be displayed. This will cause an
agent pause command to fail.
Example: RunAgent.exe agentName no_ui
session_id
The agent will run in a specified Session.
Example: RunAgent.exe agentName session_id
sdgfdf4353
session_timeout
If an agent runs in a Session, this is the session
timeout in minutes. The default session timeout is 30
minutes.
Example: RunAgent.exe agentName session_id
sdgfdf4353 session_timeout 10
continue
Continues an agent that was stopped prematurely.
Example: RunAgent.exe agentName continue
13.1
Using the Content Grabber runtime
NOTICE: This functionality is only available to agents that are
built with the Content Grabber Premium Edition.
© 2014 - 2015 Sequentum
282
Content Grabber
You can use the Content Grabber command-line program on a computer without
installing the entire Content Grabber software environment. You must ensure that the
following files are in the same folder on the target computer:
AgentApi.dll
AgentProxy.dll
AjaxHook.dll
msvcp120.dll
msvcr120.dll
SQLite.Interop.dll
System.Data.SQLite.dll
RunAgent.exe
RunAgentProcess.exe
BaseEngine.dll
MySql.Data.dll
These are the minimum computer specifications for running the Content Grabber
runtime:
Windows 8/Vista/7/2008R2/2012 (though we don't support it, Windows XP will
work for most target websites)
Internet Explorer 9+ (Internet Explorer 8 will work for most target websites, but
is not supported)
.NET framework version 4+
© 2014 - 2015 Sequentum
Scripting
14
283
Scripting
You can use scripting in a variety of ways to customize Content Grabber behavior, or
to extend standard functionality. Content Grabber scripts are .NET functions written in
C# or VB.NET, or regular expressions.
Content Grabber scripts can be divided into three categories:
Content transformation scripts.
Command scripts.
Extension scripts.
Content Transformation Scripts
Content transformation scripts are used to transform a piece of text. For example, if
you extract an address from a web page, but want only the postal code, you can
transform the address into just the postal code. You can also view this process as
extracting sub-text, but sometimes you may do more than merely extract sub-text.
That is why we call this transformation.
Content transformation scripts are used in the following situations:
Transformation of extracted content.
Transformation of file names when downloading files.
Transformation of URLs when using Navigate Link commands that extract and
navigate to direct URLs.
Transformation of input data.
See the topic Content Transformation Scripts for more information.
Command Scripts
Command scripts are general purpose scripts that are executed as part of the normal
agent command execution flow. The scripts are often used to alter the execution flow
of an agent.
© 2014 - 2015 Sequentum
284
Content Grabber
See the topic Command Scripts for more information.
Extension Scripts
Extension scripts are used to extend Content Grabber functionality. A Data Export
script is an example of an extension script that can be used to apply some custom
business logic to extracted data.
The following scripts are extension scripts:
Agent Initialization Scripts
Data Export Scripts
Data Distribution Scripts
Data Input Scripts
Image OCR Scripts
Convert Document to HTML Scripts
14.1
Script Languages
The Content Grabber scripting engine supports C#, VB.NET and Regular Expressions.
C# and VB.NET can be used for all types of scripts, but Regular Expressions can only
be used by content transformation scripts.
Regular Expressions
This manual does not explain regular expression syntax. Please visit the following
website for more information about regular expressions:
Regular Expressions Reference Guide
Regular expression scripts in Content Grabber can include any number of regex match
and replace operations. Each regex operation must be specified on two lines. The first
line must contain the regex pattern and the second line must contain the operation
which can be return, return all or replace with. The return operation returns the first
© 2014 - 2015 Sequentum
Scripting
285
match in the original content or a selected group within the match The returned match
can be combined with static text. The return all operation returns all matches or the
specified group within all matches. The replace operation replaces all matches - or a
selected group within the first match, in the original content and then returns the
content.
A group within a match must be specified by the group number, and cannot be
specified by a group name. The group number 0 specifies the entire match, and the
number 1 specifies the first group in the match. A group number must be preceded by
the character $, so for example, $1 specifies the first group in a match. Use two $
characters ($$) to escape the $ character in an operation.
If a regex script contains more than one regex operation, the next operation will work
on the output from the preceding operation. All regex operations are case insensitive
and line breaks are ignored. The following eight special operations can be specified on
a single line:
strip_html removes all HTML tags from the content.
url_decode decodes an encoded URL.
html_decode decodes encoded HTML.
trim removes line breaks and white spaces from the beginning and end of the
content.
line_breaks converts some HTML tags such as <P>, <BR> and <LI> into
standard Windows line breaks.
to_lower converts text to lower case.
to_upper converts text to upper case.
capitalize_words capitalizes all words in the text.
aggregate [separator] if return all has been used to return a list, this operation
adds all strings in the list together separated by the specified separator.
The syntax {$content_name} can be used to reference extracted data in the current
container. For example, if you had a capture command named product_id, you could
construct the following regular expression to extract all text between the product ID
© 2014 - 2015 Sequentum
286
Content Grabber
and the first white space:
{$product_id}(.*?)\s
Examples
Here are a number of examples:
Regular
Description
Expression
.*
Returns the entire match, so everything in this
return
case.
A(.*?)B
Returns the group 1 match, so everything
return $1
between A and B.
(A).*?(B)
Returns group 1 and 2 matches, so AB in this
return $1$2
case.
A(.*?)B
Replaces every instance of everything between
replace with
A and B (including A and B) with nothing.
A(.*?)B
replace with
A and B (including A and B) with "some new
some new
text".
text
A(.*?)B
Replaces the first instance of everything
replace $1
between A and B (excluding A and B) with
with some
"some new text".
new text
A(.*?)B
replace with
A and B (including A and B) with the text
$1
between A and B, so in effect it removes A and
B.
© 2014 - 2015 Sequentum
Scripting
A(.*?)B
First extracts everything between A and B, and
return $1
then replaces every instance of everything
C(.*?)D
between C and D (including C and D) with
replace with
nothing.
<br>
Replaces all <BR> HTML tags with standard
replace with
Windows line breaks.
287
\r\n
A(.*?)B
Returns the group 1 match, so everything
return $1
between A and B, and then URL_decodes the
url_decode
result.
A(.*?)B
Returns the group 1 match, but adds C to the
return C$1D
beginning and D to the end before returning the
match.
C# and VB.NET
This manual does not explain C# and VB.NET syntax. Please visit an online reference
guide for more information about these languages.
C# and VB.NET scripts must have one class named Script with a static function that
is executed by Content Grabber. The signature of the static method depends on the
type of script. The following example is for a Data Input script.
using System;
using System.Data;
public class Script
{
public static DataTable ProvideData(DataProviderArguments args)
{
DataTable data = args.ScriptUtilities.LoadCsvFileFromDefaultInputFolder("inputData.scv");
return data;
}
}
© 2014 - 2015 Sequentum
288
Content Grabber
A script can have more than one function in the Script class, and can also use
functionality from external .NET libraries. All external .NET libraries must be added to
the agent's assembly references. See the topic Assembly References for more
information.
14.2
Script Library
Content Grabber provides a default procedure for developing scripts in Visual Studio.
The build-in script editor in Content Grabber does not provide advanced development
tools such as a debugger, so we recommend you build large scripts in Visual Studio
and smaller scripts in the built-in script editor.
When you add a script in Content Grabber, you can select to use a script library. The
script library is a normal .NET class library that contains classes and methods for all
types of scripts in Content Grabber. When using a Script Library, Content Grabber will
automatically call the specific methods in the .NET class library.
Agent or Shared Script Library
Content Grabber supports two types of script libraries, a script library that is shared
© 2014 - 2015 Sequentum
Scripting
289
between all agents in your Content Grabber document folder, and a script library that
is used only for a specific agent.
The shared script library assembly must be placed in the folder Content Grabber
\Assemblies and the script library is shared by all agents located in Content Grabber
\Agents. The Content Grabber folder can be located anywhere, but by default it's
located in My Documents\Content Grabber.
The agent script library assembly must be placed in the Assemblies sub-folder of an
agent folder.
Visual Studio Solution Template
Content Grabber provides a Visual Studio 2013 solution with a project that contains all
the required classes and methods. When you click the button Open Project in C# or
Open Project in Vb.NET, Content Grabber will automatically copy the solution and
projects to your agent's Visual Studio folder or to the shared Visual Studio folder if
you are using a shared script library. The Visual Studio solution also contains a test
project you can use to test the functionality in your script library.
When you build the Visual Studio solution in release mode, the script library assembly
is automatically copied to the right location so it's picked up by Content Grabber when
you run an agent. When Content Grabber loads the script library it will become locked
and you will need to close Content Grabber before you can build the library in release
mode. The debug version of the library is not locked by Content Grabber, because it's
not copied to any of the Content Grabber Assemblies folders, and will not be used
by your agents by default.
14.3
Assembly References
You can use external assemblies in your scripts by adding assembly references to
your agent. Click the Assembly References button when editing a script to open this
screen:
© 2014 - 2015 Sequentum
290
Content Grabber
You should only enter the file name of your assembly file when adding a reference, not
the full file path. You must copy the assembly file to the agent's Assemblies folder or to
the shared Assemblies folder. Content Grabber will look for unresolved assemblies in
those two folder.
The shared Assemblies folder is located in Content Grabber\Assemblies and assemblies
in this folder are shared by all agents in Content Grabber\Agents. The Content Grabber
folder can be located anywhere, but by default it's located in My Documents\Content
Grabber.
14.4
Script Utilities
All scripts have access to a ScriptUtils class through the script arguments. This class
contains the following utility functions:
Function
Description
DataTable
Loads a CSV file from a specified location
LoadCsvFile(stri
and returns the data in a DataTable.
ng path)
© 2014 - 2015 Sequentum
Scripting
DataTable
Loads a CSV file from a specified location
LoadCsvFile(stri
using a specified value separator and
ng path, char
returns the data in a DataTable.
separator)
DataTable
Loads a CSV file with a specified name
LoadCsvFileFrom
from the default input data folder and
DefaultInputFold
returns the data in a DataTable.
er(string
filename)
DataTable
Loads a CSV file with a specified name
LoadCsvFileFrom
from the default input data folder using a
DefaultInputFold
specified value separator and returns the
er(string
data in a DataTable.
filename, char
separator)
void
Executes a program on the command line.
ExecuteComman
The parameters inputFilePath,
dLine(string
outputFilePath and options are added as
programFilePath,
command line parameters.
string
inputFilePath,
string
outputFilePath,
string options)
void
Executes a program on the command line
ExecuteComman
with the specified command line
dLine(string
parameters.
programFilePath,
string
arguments)
© 2014 - 2015 Sequentum
291
292
Content Grabber
int
Clears all Internet Explorer cookies on the
ClearAllBrowserC
computer.
ookies()
int
Clears all Internet Explorer cookies on the
ClearBrowserCo
computer associated with the specified
okies(string url)
URL.
void
Clears the Internet Explorer session
ResetBrowserSe
cookies. This is equivalent to restarting an
ssion()
Internet Explorer browser.
Example
The following example loads data from a CSV file located in the
agent's default input folder:
using System;
using System.Data;
public class Script
{
{
DataTable data = args.ScriptUtilities.LoadCsvFileFromDefaultInputFolder("inputData.scv");
return data;
}
}
Database Utilities
All scripts have access to the following function through the script
arguments:
public IConnection GetDatabaseConnection(string connectionName)
The function returns a predefined database connection. See the topic Database
© 2014 - 2015 Sequentum
Scripting
293
Connections for more information about predefined database connections.
The IConnection, ICommand and IReader interfaces are Content Grabber interfaces
that are meant to make it easier to write and read data to a database, but you can
always use the standard .NET libraries if you prefer.
The following example writes some extracted data to a database:
using System;
public class Script
{
public static CustomScriptReturn CustomScript(CustomScriptArguments args)
{
IConnection connection = args.GetDatabaseConnection("exportDatabase");
connection.OpenDatabase();
try
{
ICommand command = connection.GetNewCommand();
command.SetSql("insert into export_table values
(@title, @description)");
command.AddParameterWithValue("title", args.DataRow["title"]);
command.AddParameterWithValue("description",
args.DataRow["description"]);
command.ExecuteNonQuery();
}
finally
{
connection.CloseDatabase();
}
return CustomScriptReturn.Empty();
}
}
The IConnection interface has the following functions and properties:
Function or
Property
© 2014 - 2015 Sequentum
Description
294
Content Grabber
IReader
Returns an IReader interface to a new
GetNewReader(st
Reader object that uses the current
ring sql)
connection.
ICommand
Returns an ICommand interface to a new
GetNewCommand
Command object that uses the current
()
connection.
void
Opens the database connection.
OpenDatabase()
void
Closes the database connection.
CloseDatabase()
object
Return the underlying database connection
GetConnection()
that is specific to the database type used.
For example, if the database is a MySQL
database, the object returned is a
MySqlConnection.
void
Executes a SQL statement that doesn't
ExecuteNonQuery
return a result set.
(string sql)
object
Executes a SQL statement that returns a
ExecuteScalar(str
single scalar value.
ing sql)
long
ExecuteCount(stri
single long value. This function should be
ng sql)
used when using the SQL function "Count".
bool
Return true if the specified table exists in
HasTable(string
the database.
tableName)
bool
Return true if the specified columns exists
HasColumn(string
in the specified table.
© 2014 - 2015 Sequentum
Scripting
tableName, string
columnName)
DataRow[]
Return all columns in the specified
GetTableColumns
database table.
(string
tableName)
void
Drops the specified database table.
DropTable(string
tableName)
void
Truncates the specified database table.
TruncateTable(str
ing tableName)
void Lock()
Locks the database connection, so it
cannot be used by any other running
threads.
void Release()
Releases a lock on the database
connection.
IReader
Executes a SQL with a list of SQL
ExecuteReader(st
parameters and returns a IReader.
ring sql, params
object[] pars)
DataTable
ExecuteDataTale(
parameters and returns a DataTable.
string sql, params
object[] pars)
void
ExecuteNonQuery
parameters that doesn't return a result.
(string sql,
params object[]
© 2014 - 2015 Sequentum
295
296
Content Grabber
pars)
object
ExecuteScalar(str
parameters that returns a single scalar
ing sql, params
value.
object[] pars)
long
ExecuteCount(stri
parameters that returns a single long
ng sql, params
value. This function should be used when
object[] pars)
using the SQL function "Count".
The ICommand interface has the following functions and properties:
Function or
Description
Property
void
Adds a SQL parameter to the Command
AddParameterWit
object. A matching parameter must exist in
hValue(string
the associated SQL statement. The value
pName, object
of a parameter is updated if the parameter
pValue)
already exists.
void
AddParameter(str
ing pName,
the associated SQL statement. The value
CaptureDataType
of a parameter is updated if the parameter
type)
already exists.
void
Sets the value of an existing SQL
SetParameterValu
parameter. A matching parameter must
e(string pName,
exist in the associated SQL statement.
object pValue)
void
Executes a SQL statement that doesn't
ExecuteNonQuer
return a result set.
© 2014 - 2015 Sequentum
Scripting
y(string sql)
void
Executes an existing SQL statement that
ExecuteNonQuer
doesn't return a result set.
y()
object
ExecuteScalar(str
single scalar value.
ing sql)
object
ExecuteScalar()
returns a single scalar value.
long
ExecuteCount(str
single long value. This function should be
ing sql)
used when using the SQL function "Count".
long
ExecuteCount()
returns a single long value. This function
should be used when using the SQL
function "Count".
void SetSql(string
Set the SQL statement associated with the
sql)
SQL command.
The IReader interface has the following functions and properties:
Function or
Description
Property
void
Sets the SQL statement associated with this
SetSql(string
data reader.
sql)
void
AddParameterWi
thValue(string
the associated SQL statement. The value of
a parameter is updated if the parameter
© 2014 - 2015 Sequentum
297
298
Content Grabber
pName, object
already exists.
pValue)
bool Read()
Reads a row of data. Executes an existing
SQL statement if it has not already been
executed.
void Close()
Closes the data reader.
DateTime
Gets a DateTime value from the current
GetDateTimeVal
row.
ue(int
columnIndex)
string
Gets a String value from the current row.
GetStringValue(i
nt columnIndex)
int
Gets a Integer value from the current row.
GetIntValue(int
columnIndex)
Guid
Gets a Guid value from the current row.
GetGuidValue(in
t columnIndex)
Type
Gets the data type of a specified column.
GetFieldType(int
columnIndex)
IDataReader
Returns the underlying .NET data reader.
GetDataReader()
object
Gets a data value from the current data
GetFieldValue(in
row.
t columnIndex)
Extension Methods
© 2014 - 2015 Sequentum
Scripting
299
Content Grabber provides a few extension methods that can be used in all scripts:
Function
Description
static DataTable
Converts a single string value into a DataTable
ToDataTable(this string
with one data column and one data row.
stringValue, string
columnName)
static DataTable
Converts an array of strings into a DataTable
ToDataTable(this string[]
with one data column and one data row for each
dataRows, string
string value.
columnName)
static DataTable
Converts an array of strings into a DataTable
ToDataTable(this string[]
with one data column and one data row for each
dataRows, string
string value. A standard .NET format string can
columnName, string
be used to format the string value before it is
stringFormat)
inserted into the DataTable.
static DataTable
Converts a list of strings into a DataTable with
ToDataTable(this
one data column and one data row for each
List<string> dataRows,
string value.
string columnName)
static DataTable
Converts a list of strings into a DataTable with
ToDataTable(this
one data column and one data row for each
List<string> dataRows,
string value. A standard .NET format string can
string columnName, string
be used to format the string value before it is
stringFormat)
inserted into the DataTable.
Example
The following script generates a list of URLs and uses an extension method to convert
the list of URLs into a DataTable.
using System;
using System.Collections.Generic;
© 2014 - 2015 Sequentum
300
Content Grabber
using System.Data;
public class Script
{
{
List<string> urls = new List<string>();
for(int i=1;i<1000;i++)
{
urls.Add("http://www.domain.com/page.php?ID=" +i.ToString());
}
return urls.ToDataTable("url");
}
}
14.5
Script Template Library
Content Grabber has a Script Template Library that allows you to easily share scripts
between agents. This library also contains some standard scripts for common
operations, such as regular expressions used to extract emails or phone numbers
from a piece of text.
You can save a script to the library at anytime when your writing a script. Simply enter
a script name and description, and click Save to Library. If you save a script with the
same name as an existing script, the existing script will be overwritten. You can save
a script to the template library at any time by clicking the Save button:
© 2014 - 2015 Sequentum
Scripting
301
You can load a script from the library by opening the Script name drop down box,
and then selecting the script you want to load:
Exporting and Importing Template Libraries
© 2014 - 2015 Sequentum
302
Content Grabber
All user templates can be exported and imported, so you can move the templates
from one computer to another. Content Grabber exports or imports all templates,
including both command templates and script templates. You cannot export or import
just command templates or just script templates.
14.6
Agent Initialization Scripts
An agent initialization script is run before the agent starts. The script can be used to
initialize data before an agent is run. The script is also often used to set agent
properties, such as database connection parameters, but we don't recommend you do
this, since the Agent class is undocumented and unsupported.
You can add an agent initialization script to an agent by clicking the menu Agent >
Initialization Script:
Agent initialization scripts can be written in C# or VB.NET.
The following example initializes a global counter.
using System;
public class Script
{
public static bool InitializeAgent(AgentInitializationArguments args)
{
args.GlobalData.AddOrUpdateData("counter", 1);
© 2014 - 2015 Sequentum
Scripting
303
return true;
}
}
The following example makes changes to an agent before it's run. The changes to the
agent are not permanent, but only in effect at runtime. When you are debugging an
agent, the debugger works directly on the agent in the editor, so any changes to an
agent are permanent. We therefore recommend you use the args.IsDebug variable to
turn off any changes to an agent during debugging.
using System;
using Sequentum.ContentGrabber.Commands;
public class Script
{
{
//Don't make changes to the agent while debugging
If(args.IsDebug)
return true;
//Set the file path and download folder if we're exporting to CSV
if(args.Agent.ExportDestination.ExportTargetType == ExportTargetType.Csv)
{
if(args.GlobalData.HasData("CsvFilePath"))
{
args.Agent.ExportDestination.CsvExport.UseDefaultPath = false;
args.Agent.ExportDestination.CsvExport.FilePath = args.GlobalData.GetStrin
}
if(args.GlobalData.HasData("DownloadFolder"))
{
args.Agent.ExportDestination.UseDefaultDownloadPath = false;
args.Agent.ExportDestination.DownloadDirectoryName = args.GlobalData.G
}
}
//Set the database connection if we're exporting to a SQL Server database
else if(args.Agent.ExportDestination.ExportTargetType == ExportTargetType.SqlServer)
{
© 2014 - 2015 Sequentum
304
Content Grabber
if(args.GlobalData.HasData("DatabaseName"))
{
args.Agent.ExportDestination.DatabaseExport.DatabaseConnectionName =
}
}
//Set the database connection and SQL for the input data provider in the Agent command
if(args.GlobalData.HasData("DatabaseName") && args.GlobalData.HasData("SQL"))
{
args.Agent.DataProvider.DatabaseProvider.DatabaseConnectionName = args.Globa
args.Agent.DataProvider.DatabaseProvider.Sql = args.GlobalData.GetString("SQL")
}
return true;
}
}
If you continue an agent, the initialization script is not run, but any changes that was
made to global data or the agent are are retained.
NOTE: The script must have a static method with the following signature:
All logging in an agent initialization script is always written to file and never to
database. This is because the script runs before the agent starts, and the internal
database is not available at that time.
An instance of the AgentInitializationArguments class is provided by Content Grabber
and has the following functions and properties:
Property or Function
Description
Agent
The current agent that is executing the script.
ScriptUtils ScriptUtilities
A script utility class with helper methods. See
Script Utilities for more information.
bool IsDebug
True if the agent is running in debug mode.
© 2014 - 2015 Sequentum
Scripting
305
Description
IInputData InputDataCache
All input data available to the agent command.
void WriteDebug(string
Writes log information to the agent log. This
debugMessage,
method has no effect if agent logging is disabled,
DebugMessageType
or if called during design time.
messageType =
DebugMessageType.Inform
ation)
void WriteDebug(string
debugMessage, bool
method has no effect if agent logging is disabled,
showMessageInDesignMod
or if called during design time.
e, DebugMessageType
messageType =
DebugMessageType.Inform
ation)
void Notify(bool
Triggers notification at the end of an agent run. If
alwaysNotify)
alwaysNotify is set to false, this method only
triggers a notification if the agent has been
configured to send notifications on critical errors.
void Notify(string message, Triggers notification at the end of an agent run, and
bool alwaysNotify)
adds the message to the notification email. If
alwaysNotify is set to false, this method only
triggers a notification if the agent has been
configured to send notifications on critical errors.
GlobalDataDictionary
Global data dictionary that can be used to store
GlobalData
data that needs to be available in all scripts and
after agent restarts.
Input Parameters are also stored in this dictionary.
IConnection
© 2014 - 2015 Sequentum
Returns the specified database connection. The
306
Content Grabber
Description
GetDatabaseConnection(st
database connection must have been previously
ring connectionName)
defined for the agent or be a shared connection for
all agents on the computer. Your script is
responsible for opening and closing the connection
by calling the OpenDatabase and CloseDatabase
methods.
14.7
Data Export Scripts
Data Export scripts can be used to customize the export process. An export script
could be used to export extracted data to custom data structures in a database, or it
could export data to a custom data source.
A Data Export script can be added from the Data Export Configuration screen:
The following example writes data from two export tables to two CSV files:
using System;
using System.IO;
using Sequentum.ContentGrabber.Commands;
public class Script
© 2014 - 2015 Sequentum
Scripting
307
{
public static bool ExportData(DataExportArguments args)
{
try
{
StreamWriter writer1 = File.CreateText(@"c:\temp\data1.csv");
try
{
StreamWriter writer2 = File.CreateText(@"c:\temp\data2.csv");
try
{
IExportReader mainTableReader = args.Data.GetTable();
try
{
while(mainTableReader.Read())
{
string csvData1 = childTableReader.
GetStringValue("Field Name 1");
csvData1 += ",";
csvData1 += childTableReader.GetStringValue
("Field Name 2");
writer1.WriteLine(csvData1);
IExportReader childTableReader =
mainTableReader.GetChildTable("Child table name");
try
{
while(childTableReader.Read())
{
string csvData2 =
childTableReader.GetStringValue("Child Field Name 1");
csvData2 += ",";
csvData2 += childTableReader.
GetStringValue("Child Field Name 2");
writer2.WriteLine(csvData2);
}
}
finally
{
childTableReader.Close();
}
© 2014 - 2015 Sequentum
308
Content Grabber
}
}
finally
{
mainTableReader.Close();
}
}
finally
{
writer1.Close();
}
}
finally
{
writer2.Close();
}
return true;
}
catch(Exception ex)
{
args.WriteDebug(ex.Message, DebugMessageType.Error);
return false;
}
}
}
The script must have a static method with the following signature:
public static bool ExportData(DataExportArguments args)
The function should return True if it succeeds and False if it fails. In this example, an
instance of the DataExportArguments class will be given by Content Grabber and
has the following functions and properties:
Property or
Description
Function
Agent Agent
The current agent.
© 2014 - 2015 Sequentum
Scripting
ScriptUtils
A script utility class with helper methods.
ScriptUtilities
See Script Utilities for more information.
IExportData Data
The data to export.
bool IsDebug
True if the agent is running in debug
mode.
ExtendedSortedLi
All input parameters defined by the agent.
st<string, string>
InputParemeters
void
Writes log information to the agent log.
WriteDebug(string This method has no effect if agent logging
debugMessage,
is disabled, or if called during design time.
DebugMessageTy
pe messageType
=
DebugMessageTy
pe.Information)
void
WriteDebug(string This method has no effect if agent logging
debugMessage,
bool
showMessageInD
esignMode,
DebugMessageTy
pe messageType
=
DebugMessageTy
pe.Information)
void Notify(bool
Triggers notification at the end of an agent
alwaysNotify)
run. If alwaysNotify is set to false, this
method only triggers a notification if the
© 2014 - 2015 Sequentum
309
310
Content Grabber
agent has been configured to send
notifications on critical errors.
void Notify(string
message, bool
run, and adds the message to the
alwaysNotify)
notification email. If alwaysNotify is set to
false, this method only triggers a
notification if the agent has been
configured to send notifications on critical
errors.
GlobalDataDiction
Global data dictionary that can be used to
ary GlobalData
store data that needs to be available in all
scripts and after agent restarts.
Input Parameters are also stored in this
dictionary.
IConnection
Returns the specified database
GetDatabaseConn
connection. The database connection must
ection(string
have been previously defined for the agent
connectionName)
or be a shared connection for all agents on
the computer. Your script is responsible
for opening and closing the connection by
calling the OpenDatabase and
CloseDatabase methods.
string[]
Exports data to the specified export
ExportData(Expor
target. Returns the names of any exported
tTarget
data files if exporting to a file format.
exportTarget,
string sessionId =
null)
public void
Exports data to the specified database.
ExportToDatabas
© 2014 - 2015 Sequentum
Scripting
e(string
databaseConnecti
onName, string
sessionId = null)
string[]
Exports data to a CSV file using the
ExportToCsv(strin default export folders and default options.
g sessionId = null)
Use the ExportData method to specify
CSV export options. Returns the names of
all exported CSV files.
string
Exports data to an Excel file using the
ExportToExcel(str
default export folders and default options.
ing sessionId =
null)
Excel export options. Returns the name of
the exported Excel file.
string
Exports data to an XML file using the
ExportToXml(strin default export folders and default options.
g sessionId = null)
XML export options. Returns the name of
the exported XML file.
The IExportData interface has the following functions and properties:
Property or
Description
Function
IExportReader
Returns a data reader that reads data from
GetTable()
the main table containing exported data. All
other data tables will be child tables of this
table.
IExportReader
this[string name]
a specified data table.
{ get; }
© 2014 - 2015 Sequentum
311
312
Content Grabber
IExportReader
GetTable(string
a specified data table.
name)
bool
Validates the export data and returns True
ValidateData()
if the data is valid and False if the data is
invalid. If the agent that extracted this data
has changed since the data was extracted,
then the data may have become invalid.
The IExportReader interface has the following functions and properties:
Property or
Description
Function
bool Read()
Reads the next data row. Returns False if
there are no more data rows available.
void Close()
Closes the data reader.
string
GetStringValue(st
ring
columnName)
int
GetIntValue(strin
g columnName)
Guid
Gets a GUID value from the current row.
GetGuidValue(stri
ng columnName)
string
GetStringValue(in
t index)
© 2014 - 2015 Sequentum
Scripting
int GetIntValue(int
index)
Guid
Gets a GUID value from the current row.
GetGuidValue(int
index)
Type
Gets the data type of a specified column.
GetFieldType(int
index)
object
Return a data value from the current data
GetFieldValue(stri
row.
ng columnName)
object
GetFieldValue(int
row.
index)
object this[string
columnName]
row.
object this[int
index]
row.
IExportReader
Return a data reader that reads data from
GetChildTable(str
a specified child table of the table
ing tableName)
associated with this data reader. The
returned data reader will only read data
rows that are child data to the current
data row.
string[]
An array containing the names of all child
ChildTableNames
tables of the table associated with this
data reader.
© 2014 - 2015 Sequentum
313
314
14.8
Content Grabber
Data Distribution Scripts
Content Grabber has built-in support for data distribution by email or to an FTP server
when data is exported to file formats such as CSV or Excel. A Data Distribution
script can be used to customize data distribution or distribute data to an unsupported
media. A Data Distribution script could also be used to run some custom functionality
after data has been exported. For example, a script could run a stored procedure in a
database after data has been exported to the database.
You can add a Data Distribution script to an agent by clicking the Script button in
the Data ribbon menu.
The following example executes a stored procedure in a SQL Server database:
using System;
public class Script
{
public static bool DeliverData(DataDeliveryArguments args)
{
args.GetDatabaseConnection("test").ExecuteNonQuery("exec updateData");
return true;
}
}
The script must have a static method which has the following signature:
public static bool DeliverData(DataDeliveryArguments args)
The function should return True if it succeeds and False if it fails.
An instance of the DataDeliveryArguments class is provided by Content Grabber and
© 2014 - 2015 Sequentum
Scripting
has the following functions and properties:
Property or
Description
Function
Agent Agent
The current agent.
ScriptUtils
ScriptUtilities
string[] Files
The exported files that need to be
distributed.
bool IsDebug
void
WriteDebug(strin
This method has no effect if agent logging
g
is disabled, or if it is called during design
debugMessage,
time.
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void
WriteDebug(strin
g
debugMessage,
bool
showMessageIn
DesignMode,
DebugMessageT
ype
messageType =
DebugMessageT
© 2014 - 2015 Sequentum
315
316
Content Grabber
ype.Information)
void Notify(bool
alwaysNotify)
void Notify(string
message, bool
alwaysNotify)
notification if the agent has been configured
to send notifications on critical errors.
GlobalDataDictio
nary GlobalData
dictionary.
IConnection
Returns the specified database connection.
GetDatabaseCon
The database connection must have been
nection(string
previously defined for the agent or be a
connectionName
shared connection for all agents on the
)
computer. Your script is responsible for
opening and closing the connection by
14.9
Content Transformation Scripts
Content transformation scripts are used to transform content after it has been
extracted from a web page. Content transformation is often used on HTML elements
© 2014 - 2015 Sequentum
Scripting
317
to extract information that is not placed in individual elements and therefore cannot be
selected in the web browser. For example, content transformation can be used to
extract parts of an address, such as a postal code, from a single HTML element
containing the full address.
Content transformation scripts can be used in most Capture Commands to transform
the content extracted by the commands, but can also be used in other types of
commands, such as in a Navigate Link command to transform an extracted URL.
A content transformation can be defined as a regular expression or as a C# or
VB.NET script. Regular expressions are often used when you wish to extract sub-text
from a larger piece of extracted text.
The following regular expression example extracts all the text until the first '<'
character:
(.*?)<
return $1
See the topic Script Languages for information about how to use regular expressions
in Content Grabber.
The following script also extracts all text until the first '<' character, but uses C#
© 2014 - 2015 Sequentum
318
Content Grabber
instead of regular expressions:
using System;
public class Script
{
public static string TransformContent(ContentTransformationArguments args)
{
return args.Content.Remove(args.Content.IndexOf('<'));
}
}
public static string TransformContent(ContentTransformationArguments args)
The function will return the transformed content.
An instance of the ContentTransformationArguments class is provided by Content
Grabber and has the following functions and properties:
Property or
Description
Function
Agent Agent
The current agent.
ScriptUtils
ScriptUtilities
Command
The current agent command being
Command
executed.
IConnection
The current internal database connection
DatabaseConnect
used by the agent. This connection is
ion
already open and should not be closed by
your script.
string Content
The extracted content that should be
© 2014 - 2015 Sequentum
Scripting
transformed.
IHtmlNode
The extracted HTML node.
HtmlNode
IInternalDataRow
The current internal data row containing the
DataRow
data that has been extracted so far in the
current container command.
bool IsDebug
IInputData
All input data available to the current
InputDataCache
command.
void
WriteDebug(strin
method has no effect if agent logging is
g debugMessage,
disabled, or if called during design time.
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void
WriteDebug(strin
g debugMessage,
bool
showMessageInD
esignMode,
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void Notify(bool
alwaysNotify)
© 2014 - 2015 Sequentum
319
320
Content Grabber
void Notify(string
message, bool
alwaysNotify)
false, this method only triggers a notification
if the agent has been configured to send
GlobalDataDictio
nary GlobalData
dictionary.
IConnection
GetDatabaseCon
nection(string
connectionName)
IInputDataRow
If the current command is a data provider,
GetInputData()
the data for that command is returned.
Otherwise this function searches the
command's parents and returns the first
found input data.
IInputDataRow
If the specified command is a data provider,
GetInputData(Co
mmand
© 2014 - 2015 Sequentum
Scripting
command)
321
found input data.
IInputDataRow
GetInputData(stri
ng
Otherwise searches the command's parents
commandName)
and returns the first found input data.
IInputDataRow
GetInputData(Gui
d commandId)
Otherwise the function throws an error.
14.10 Data Input Scripts
Data Input scripts can be used to generate data for data provider commands. Data
provider commands include the following types of commands:
Agent
Navigate URL
Set Form Field
Data List
A Data Input script can be added to a command by selecting the data provider Script
from the Data configuration tab:
© 2014 - 2015 Sequentum
322
Content Grabber
The following example generates a list of URLs that could be used in an Agent
command to provide start URLs:
using System;
using System.Collections.Generic;
using System.Data;
public class Script
{
{
List<string> urls = new List<string>();
for(int i=1;i<1000;i++)
{
urls.Add("http://www.domain.com/page.php?ID=" +i.ToString());
}
return urls.ToDataTable("url");
}
}
This script makes use of the extension method ToDataTable which is part of the
Script Utilities.
© 2014 - 2015 Sequentum
Scripting
323
The function must return a DataTable that contains the input data. The DataTable can
have one or more data columns, and all columns will be available to the data provider
command that is using the script.
An instance of the DataProviderArguments class is provided by Content Grabber
Property or
Description
Function
Agent Agent
The current agent.
ScriptUtils
ScriptUtilities
Command
The current agent command being executed.
Command
IContainer
The parent container command of the
ParentContainer
current command.
IConnection
DatabaseConnect
ion
your script.
IHtmlNode
HtmlNode
IInternalDataRow
DataRow
bool IsDebug
bool
If true, only the data schema is required, so
IsSchemaOnly
you can optimize processing by only
returning the data schema with no data.
© 2014 - 2015 Sequentum
324
Content Grabber
IInputData
InputDataCache
command.
void
WriteDebug(strin
g debugMessage,
DebugMessageTy
pe messageType
=
DebugMessageTy
pe.Information)
void
WriteDebug(strin
g debugMessage,
bool
showMessageInD
esignMode,
DebugMessageTy
pe messageType
=
DebugMessageTy
pe.Information)
void Notify(bool
alwaysNotify)
void Notify(string
message, bool
alwaysNotify)
© 2014 - 2015 Sequentum
Scripting
GlobalDataDiction Global data dictionary that can be used to
ary GlobalData
dictionary.
IConnection
GetDatabaseCon
nection(string
connectionName)
IInputDataRow
GetInputData()
found input data.
IInputDataRow
GetInputData(Co
mmand
command)
found input data.
IInputDataRow
GetInputData(stri
ng
commandName)
found input data.
IInputDataRow
© 2014 - 2015 Sequentum
325
326
Content Grabber
GetInputData(Gui
d commandId)
14.11 Image OCR Scripts
An Image OCR script is used to convert an image into plain text. This is often used
when a website uses a CAPTCHA page to block automated access to the website. A
CAPTCHA page contains an image with numbers and characters that a user must
recognize and enter in plain text. Some websites also display emails and phone
numbers as images, and an Image OCR script can be used to convert those images
into plain text, making the data more useful.
Content Grabber does not include any OCR functionality, so you must use a 3rd party
OCR service. The Image OCR script is used to integrate with a 3rd party service.
An Image OCR script can be added to a Download Image command by selecting the
OCR configuration tab and setting the option Convert image to text.
The following two examples show how to integrate with two popular CAPTCHA
recognition services:
© 2014 - 2015 Sequentum
Scripting
327
public static string ConvertImageToText(ConvertImageToTextArguments args)
{
string captcha = DeathByCaptchaService.DecodeCaptcha(args.Image, "login",
"password");
return captcha;
}
The script above integrates with the 3rd party service http://
www.deathbycaptcha.com:
{
string captcha = BypassCaptchaService.DecodeCaptcha(args.Image, "key");
return captcha;
}
The script above integrates with the 3rd party service http://bypasscaptcha.com/
The Image OCR script must have a static function with the following signature:
The function must return the converted image as a string.
An instance of the ConvertImageToTextArguments class is provided by Content
Grabber and has the following functions and properties:
Property or
Description
Function
byte[] Image
The image that needs to be converted to
text.
Agent Agent
The current agent.
ScriptUtils
ScriptUtilities
Command
Command
executed.
© 2014 - 2015 Sequentum
328
Content Grabber
IContainer
ParentContainer
current command.
IConnection
DatabaseConnectio used by the agent. This connection is
n
your script.
IHtmlNode
HtmlNode
IInternalDataRow
The current internal data row containing
DataRow
the data that has been extracted so far in
the current container command.
bool IsDebug
bool IsSchemaOnly If true, only the data schema is required,
so you can optimize processing by only
IInputData
InputDataCache
command.
void
WriteDebug(string
debugMessage,
DebugMessageTyp
e messageType =
DebugMessageTyp
e.Information)
void
WriteDebug(string
debugMessage,
bool
showMessageInDe
© 2014 - 2015 Sequentum
Scripting
signMode,
DebugMessageTyp
e messageType =
DebugMessageTyp
e.Information)
void Notify(bool
alwaysNotify)
void Notify(string
message, bool
alwaysNotify)
notification if the agent has been
configured to send notifications on critical
errors.
GlobalDataDictiona Global data dictionary that can be used to
ry GlobalData
dictionary.
IConnection
Returns the specified database
GetDatabaseConne connection. The database connection must
ction(string
have been previously defined for the agent
connectionName)
or be a shared connection for all agents on
the computer. Your script is responsible
for opening and closing the connection by
© 2014 - 2015 Sequentum
329
330
Content Grabber
IInputDataRow
GetInputData()
found input data.
IInputDataRow
If the specified command is a data
GetInputData(Com
provider, the data for that command is
mand command)
returned. Otherwise this function searches
the command's parents and returns the
first found input data.
IInputDataRow
GetInputData(strin
g commandName)
returned. Otherwise this function searches
the command's parents and returns the
first found input data.
IInputDataRow
GetInputData(Guid
commandId)
returned. Otherwise the function throws an
error.
14.12 Convert Document to HTML Scripts
A Convert Document to HTML script is used to convert a downloaded document into
a HTML page, so Content Grabber can extract data from the document the same way
as for any other HTML page.
Please see the topic Extracting Data From Non-HTML Documents for more
information.
A Convert Document to HTML script can be added to a Download Document
command by selecting the Convert to HTML configuration tab and setting the option
Convert to HTML:
© 2014 - 2015 Sequentum
Scripting
331
The following example is the default Convert Document to HTML script. This script
checks the type of the downloaded document, and uses an appropriate document
converter to convert the document to a HTML page:
using System;
using System.IO;
public class Script
{
public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args)
{
if(args.DocumentType=="pdf")
ScriptUtils.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "-noframes");
else if(args.DocumentType=="docx")
ScriptUtils.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe",
args.DocumentFilePath, args.HtmlFilePath, "");
if(!File.Exists(args.HtmlFilePath))
return false;
return true;
}
}
The function should return True if the conversion succeeds or False if the conversion
© 2014 - 2015 Sequentum
332
Content Grabber
failed.
An instance of the ConvertDocumentToHtmlArguments class is provided by
Content Grabber and has the following functions and properties:
Property or
Description
Function
string
The path of the document that needs to be
DocumentFilePat
converted to HTML.
h
string
The type of document that needs to be
DocumentType
converted to HTML. For example, if the
document is a PDF document, the
document type will be pdf.
string
The file path the script should use for the
HtmlFilePath
converted HTML file.
Agent Agent
The current agent.
ScriptUtils
ScriptUtilities
Command
Command
executed.
IContainer
ParentContainer
current command.
IConnection
DatabaseConnec
tion
your script.
IHtmlNode
HtmlNode
IInternalDataRow
© 2014 - 2015 Sequentum
Scripting
DataRow
bool IsDebug
bool
IsSchemaOnly
IInputData
InputDataCache
command.
void
WriteDebug(strin
g
debugMessage,
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void
WriteDebug(strin
g
debugMessage,
bool
showMessageIn
DesignMode,
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void Notify(bool
© 2014 - 2015 Sequentum
333
334
Content Grabber
alwaysNotify)
void
Notify(string
message, bool
alwaysNotify)
GlobalDataDictio
nary GlobalData
dictionary.
IConnection
GetDatabaseCon
nection(string
connectionName
)
IInputDataRow
GetInputData()
found input data.
IInputDataRow
GetInputData(Co
© 2014 - 2015 Sequentum
Scripting
mmand
command)
335
found input data.
IInputDataRow
GetInputData(str
ing
commandName)
found input data.
IInputDataRow
GetInputData(Gu
id commandId)
14.13 Custom Scripts
A custom script is the script used by an Execute Script command to execute custom
functionality. The Execute Script command provides a list of predefined scripts that
can be used directly or edited to suit specific needs. Please see the Execute Script
topic for more information about how to use the predefined scripts.
A custom script can be added to an Execute Script command by setting the Script
type to Custom Script:
© 2014 - 2015 Sequentum
336
Content Grabber
The following example writes extracted data to a database. Normally a Data Export
script would be used to write extracted data after an agent has completed, but an
Execute Script command could be used to write data as it's being extracted:
using System;
public class Script
{
public static CustomScriptReturn CustomScript(CustomScriptArguments args)
{
IConnection connection = args.GetDatabaseConnection("exportDatabase");
connection.OpenDatabase();
try
{
ICommand command = connection.GetNewCommand();
command.SetSql("insert into export_table values (@title,
@description)");
command.AddParameterWithValue("title", args.DataRow["title"]);
command.AddParameterWithValue("description",
args.DataRow["description"]);
command.ExecuteNonQuery();
}
finally
{
connection.CloseDatabase();
}
return CustomScriptReturn.Empty();
}
}
The above script uses the IConnection and ICommand interfaces which are part of
Content Grabber's Script Utilities.
The script must have a static method with the following signature.
The function must return a DataTable that contains the input data. The DataTable can
have one or more data columns, and all columns will be available to the data provider
command that is using the script.
© 2014 - 2015 Sequentum
Scripting
337
An instance of the DataProviderArguments class is provided by Content Grabber
Property or
Description
Function
Agent Agent
The current agent.
ScriptUtils
ScriptUtilities
Command
Command
executed.
IContainer
ParentContainer
current command.
IConnection
DatabaseConnect
ion
your script.
IHtmlNode
HtmlNode
IInternalDataRow
DataRow
bool IsDebug
bool
IsSchemaOnly
IInputData
InputDataCache
command.
void
© 2014 - 2015 Sequentum
338
Content Grabber
WriteDebug(strin
g debugMessage,
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void
WriteDebug(strin
g debugMessage,
bool
showMessageInD
esignMode,
DebugMessageT
ype
messageType =
DebugMessageT
ype.Information)
void Notify(bool
alwaysNotify)
void Notify(string
message, bool
alwaysNotify)
GlobalDataDictio
nary GlobalData
© 2014 - 2015 Sequentum
Scripting
dictionary.
IConnection
GetDatabaseCon
nection(string
connectionName)
IInputDataRow
GetInputData()
found input data.
IInputDataRow
GetInputData(Co
mmand
command)
found input data.
IInputDataRow
GetInputData(stri
ng
commandName)
found input data.
IInputDataRow
GetInputData(Gui
d commandId)
© 2014 - 2015 Sequentum
339
340
Content Grabber
© 2014 - 2015 Sequentum
Programming Interface
15
341
The Content Grabber programming interface (API) provides access to the Content
Grabber runtime from your own applications. The Content Grabber runtime can be
distributed with your applications royalty free and does not require the Content
Grabber application to be installed on the target computer. The Content Grabber
runtime requires .NET version 4 client profile or higher, and Internet Explorer 9 or
higher.
If you want to run agents from a web application, you must use the Content Grabber
proxy API. The proxy API interfaces with the Content Grabber runtime using a
Windows service. The Windows service is installed as part of the Content Grabber
application, so Content Grabber must be installed on the web server.
In this chapter, you can learn more about Building a Desktop Application and Building
a Web Application.
The Content Grabber programming interface (API) provides access to the Content
Grabber runtime from your own applications. The Content Grabber runtime can be
distributed with your applications royalty free and does not require the Content
Grabber application to be installed on the target computer. The Content Grabber
runtime requires .NET version 4 client profile or higher, and Internet Explorer 9 or
higher.
If you want to run agents from a web application, you must use the Content Grabber
proxy API. The proxy API interfaces with the Content Grabber runtime using a
Windows service. The Windows service is installed as part of the Content Grabber
application, so Content Grabber must be installed on the web server.
You can also use simple web requests to call the Windows service without the use of
the Content Grabber proxy API. This means that you can easily execute agents and
retrieve extracted data from non-Windows environments, such as from a PHP page on
a Linux server.
© 2014 - 2015 Sequentum
342
Content Grabber
In this chapter, you can learn more about Building a Desktop Application and Building
a Web Application.
API Restrictions
The API cannot change the agent command structure, or alter the agent in a way that
would change the output data structure. You cannot add, delete, move or copy
commands, and you cannot change the Disabled and Export command properties.
15.1
Building a Desktop Application
The Content Grabber application can build stand-alone agents. This is a convenient
way to distribute agents that can be configured and run without requiring the full
Content Grabber application on the target computer. However, stand-alone agents
have a standardized user interface and can only export data to file formats. If you
want full control of the user interface and export features, you can build your own
desktop application using the Content Grabber API.
See these topics for more information:
Visual Studio Configuration
Distributing Your Application
Example
The following example uses the API to run an agent with a set of input parameters. It
sets the log level to high and specifies that log information should be written to file.
AgentApi api = new AgentApi(@"C:\Users\Public\Documents\Content Grabber\Agents\
qantasApiTest\qantasApiTest.scg");
settings.InputParameters.Add("from", "SYD");
settings.InputParameters.Add("to", "MEL");
settings.InputParameters.Add("departure_date", DateTime.Now.ToString("yyyy-MM-dd"));
settings.InputParameters.Add("return_date", DateTime.Now.ToString("yyyy-MM-dd"));
settings.InputParameters.Add("travel_class", "ECO");
© 2014 - 2015 Sequentum
settings.InputParameters.Add("adults", "1");
settings.InputParameters.Add("children", "0");
api.RunAgent(settings);
Agent API Functions
Function
Description
AgentApi(string
Instantiates a new API class with the
agentNameOrPat
specified agent and session ID. You can
h, string
specify the full path to an agent file or just
sessionId)
the name of the agent. If you only specify
the agent name, Content Grabber will look
for the agent in the default location for the
user running the agent service. The default
agent location for the local System account
is:
C:\Users\Public\Documents\Content
Grabber\Agents
AgentApi(string
Instantiates a new API class without a
agentNameOrPat
session. You can specify the full path to an
h)
agent file or just the name of the agent. If
you only specify the agent name, Content
Grabber will look for the agent in the default
location for the user running the agent
service. The default agent location for the
local System account is:
Grabber\Agents
© 2014 - 2015 Sequentum
343
344
Content Grabber
void
Connects to a Content Grabber agent
Connect(string
service. You can specify the server name or
endPointAddress
IP address and port number. The default
)
connection string for a local service is:
http://localhost:8000/ContentGrabber
void
Closes the connection to the Content
CloseConnection
Grabber agent service.
()
void StartAgent()
Runs the agent specified when instantiating
the API. The agent will run asynchronously.
void
Runs the agent with additional settings. The
StartAgent(Agent
agent will run asynchronously. See below
Settings
for more information about the
settings)
AgentSettings class.
void StopAgent()
Stops the agent if it is currently running.
void
Closes an agent session. When you close
CloseAgentSessi
an agent session, all data associated with
on()
that session is removed and you will not be
able to retrieve status information about the
agent that ran in this session. You can only
close a session if an agent is not currently
running in the session.
You don't need to close a session. Session
data will be removed automatically after the
agent has completed running and the
session timeout has elapsed. The default
© 2014 - 2015 Sequentum
session timeout is 30 minutes, so by default
session data will be removed automatically
30 minutes after the agent has completed.
AgentStatus
Returns status information about an agent
GetAgentStatus(
that has been run asynchronously. See
)
below for more information about the
AgentStatus class.
DataTable
Returns progress information in a
GetAgentProgre
DataTable about an agent running in
ssAsDataTable()
asynchronously. See below for more
information about the information returned.
DataTable
Returns progress information as a JSON
GetAgentProgre
string about an agent running in
ssAsJson()
DataTable
Returns log data in a DataTable for an
GetAgentLogAsD
agent that has been run asynchronously.
ataTable(string
This function does not return any data if
agentNameOrPat
logging is disabled or if logging is written to
h, string
file. See below for more information about
sessionId)
the information returned.
string
Returns log data as a JSON string for an
GetAgentLogAsJ
son(string
agentNameOrPat
h, string
sessionId)
DataSet
Returns extracted data in a DataSet for an
GetAgentExport
DataAsDataSet()
© 2014 - 2015 Sequentum
345
346
Content Grabber
string
Returns extracted data as a JSON string for
GetAgentExport
an agent that has been run asynchronously.
DataAsJson()
string
Returns extracted data as an XML string for
GetAgentExport
DataAsXml()
RunAgentReturn
Runs an agent synchronously and returns
Json
extracted data as a JSON string. The agent
is always run in a session when the agent
supports sessions, and the session is
closed automatically after the agent has
completed its run.
RunAgentReturn
Xml
extracted data as an XML string. The agent
completed its run.
RunAgentReturn
DataSet
extracted data in a DataSet. The agent is
always run in a session when the agent
completed its run.
RunAgentReturn
Runs an agent synchronously with additional
Json(AgentSettin
settings and returns extracted data as a
gs settings)
JSON string. The agent is always run in a
session when the agent supports sessions,
and the session is closed automatically after
the agent has completed its run.
© 2014 - 2015 Sequentum
See below for more information about the
RunAgentReturn
Xml(AgentSettin
settings and returns extracted data as an
gs settings)
XML string. The agent is always run in a
RunAgentReturn
DataSet
settings and returns extracted data in a
DataSet. The agent is always run in a
public Agent
Returns the agent specified when
GetAgent()
instantiating the API class.
public void
Saves the specified agent.
SaveAgent(Agent
agent)
AgentSettings
The following agent settings can be specified when running an agent:
Property
© 2014 - 2015 Sequentum
Description
347
348
Content Grabber
bool LogLevel
Log detail level. Set the log level to
None to turn off logging.
bool IsLogHtml
Logs the raw HTML of all web pages
processed by the agent.
bool IsLogToFile
Logs data to a file instead of a
database table.
int Timeout
This value specifies the session timeout
in minutes when an agent is run
asynchronously. All session data is
removed automatically when the agent
has completed and this timeout has
elapsed. The default session timeout is
30 minutes.
This value specifies the maximum
number of seconds an agent will run
when it's run synchronously. When the
timeout is reached, the agent will stop
and close its session if it's run in a
session. The default timeout is 30
seconds.
Dictionary<string,
A list of input parameters.
string>
InputParameters
AgentStatus
An agent can provide the following status information:
Property
Description
© 2014 - 2015 Sequentum
PageLoads
The RunStatus can be one of the following
values:
Completed. The agent has completed
successfully.
Incomplete. The agent has completed,
but stopped prematurely. The agent
may have been stopped manually.
Failed. The agent has completed, but a
critical error occurred.
Idle. The agent has never been run.
Starting. The agent is starting.
ExportingData. The agent is exporting
data to the specified export target.
Stopping. The agent is in the process
of stopping.
Restarting. The agent is restarting.
This usually occurs when the agent
needs to clear JavaScript memory
leaks.
ExportFailed. The agent completed,
but failed to export data.
MissingElement
s
StartTime
The number of page loads. This includes
AJAX calls triggered by agent actions.
The amount of time the agent has run.
int
The number of times an agent command
MissingElements
could not find it's specified content where
the content was not specified as optional.
int PageErrors
The number of page load errors. This
includes errors loading content from AJAX
calls that were triggered by agent actions.
© 2014 - 2015 Sequentum
349
350
Content Grabber
DateTime
The time the agent started.
StartTime
Agent Progress Data
An agent can provide progress data in a DataTable. The DataTable contains a
DataRow for each web browser the agent is using to extract data. Each DataRow
contains a status column and a description column. The progress data is the same
information displayed when running an agent in the Content Grabber agent editor.
Agent Log Data
An agent can provide log data in a DataTable. The DataTable contains a log level
column and a description column. A log level of 1 means an error, 2 means a warning
and 3 means information. The log data is the same data you can view in the Content
Grabber agent editor.
Agent Export Data
The API can provider extracted data in a DataSet, as an XML string or as a JSON
string. The API provides the entire extracted data set in one operation, so for large
data sets it's recommend to export the data to an external database and read it
directly from there.
15.1.1 Visual Studio Configuration
The Content Grabber API is a 32-bit library and since the library will be embedded
into your application, your application also needs to be a 32-bit application. To make
sure your application is a 32-bit application, go to project properties in Visual Studio
and set the platform target to x86.
© 2014 - 2015 Sequentum
The Content Grabber API uses the .NET v4.0 client profile framework, so your
application should be configured to use the same or a higher version of the .NET
framework. To set the .NET framework version in Visual Studio, go to project
properties in Visual Studio and set the target framework to .NET v4 or higher.
© 2014 - 2015 Sequentum
351
352
Content Grabber
Assembly References
You must add the following references to your project to use the The Content Grabber
API:
AgentApi.dll
AgentProxy.dll
Once you add these assemblies to your project references, Visual Studio will
automatically copy these files and some dependent assemblies to your project's BIN
folder. However, not all required assemblies will be copied to your BIN folder, so you
must manually ensure the following files are in your BIN folder. The files are found in
the Content Grabber installation folder:
AgentApi.dll
AgentProxy.dll
© 2014 - 2015 Sequentum
353
AjaxHook.dll
msvcp120.dll
msvcr120.dll
SQLite.Interop.dll
RunAgent.exe
RunAgentProcess.exe
BaseEngine.dll
MySql.Data.dll
15.1.2 Distributing Your Application
You can distribute your desktop applications without having to install Content Grabber
on the target computers. You just have to make sure the following files are in the
same folder as your executable program:
AgentApi.dll
AgentProxy.dll
AjaxHook.dll
msvcp120.dll
msvcr120.dll
SQLite.Interop.dll
RunAgent.exe
RunAgentProcess.exe
BaseEngine.dll
MySql.Data.dll
15.2
Building a Web Application
The Content Grabber API can be used to run agents from your own applications.
When using the API directly, your application needs security privileges that are
standard for desktop applications, but not for web applications. It can be a complex
© 2014 - 2015 Sequentum
354
Content Grabber
task to ensure a web application has the required security privileges, and you may end
up giving the application too many security privileges, making it vulnerable to security
breaches.
Important: We don't support using the full API from a web
application. We only support the use of the API proxy in web
applications. You may be able to use the full API, depending on
the security policies on the web server. We are unable to help
you configure a web server for this purpose.
We recommend you use the Content Grabber agent service when using the API in
web applications. The Content Grabber agent service provides most API functionality
through a simple proxy assembly that requires no special security settings and
depends on no other files than the single proxy assembly file. The proxy assembly can
run in both 32-bit and 64-bit applications, which makes it even easier to use this
assembly rather than the full API. Read more in the topic Using the Content Grabber
Agent Service.
15.2.1 Using the Content Grabber Agent Service
Content Grabber includes a Windows service that can be used to run agents. Your
web application communicates with the Windows service using a small proxy
assembly that requires no special security privileges and depends on no other files.
You simply add the proxy assembly to your web application's assembly references
and use the proxy to call the API functions. The proxy assembly contains the same
methods as the standard API, except for functions used to load and save agents. The
proxy can only execute agents, not load agents, because the proxy is designed to rely
on no other assemblies and loading an agent would require the agent definition
classes found in the full Content Grabber API Reference.
Before you can use the proxy to communicate with the Windows Service, you need
to connect to the service. You use the proxy method Connect to connect to the
© 2014 - 2015 Sequentum
355
service. The default connection string is:
If your Content Grabber service is located on a remote server, you can change
localhost to the name or IP address of the remote server. The service is listed on port
8000 by default, but you can change the port in the Content Grabber editor. The
Content Grabber service is stopped by default and configured to start manually. If you
are going to use this service, you should configure the service to start automatically.
The Content Grabber service is configured to logon as the local System account by
default. You can change this account if you wish, but if you decide to continue using
the local System account, you must make sure this account has proper access to the
Internet through Internet Explorer. For example, Internet cookies may have been
disabled for the System account, or there may be other Internet restrictions for this
account. To test and change Internet setting for the System account, open the
Content Grabber editor and click the Test IE as System button in the Application
menu. Pressing this button will open Internet Explorer using the System account and
you can then test Internet access and change settings in Internet Explorer if
required.
All Windows services, including the Content Grabber agent service, run in a special
Windows session that cannot interact with users and cannot display user interfaces.
JavaScript on some websites does not work correctly without at least a hidden web
browser window. Such websites are rare, but if you need to run such a website
through the Content Grabber agent service, the service needs to run the agent in a
normal user session. In order for the Windows service to start an agent in a normal
user session the following three conditions must be met.
If your Content Grabber service is located on a remote server, you can change
localhost to the name or IP address of the remote server. The service is
listening on port 8001 by default, but you can change the port in the Content
Grabber editor. The Content Grabber service is stopped by default and
configured to start manually. If you are going to use this service, you should
© 2014 - 2015 Sequentum
356
Content Grabber
configure the service to start automatically.
The Content Grabber service is configured to logon as the local System account
by default. You can change this account if you wish, but if you decide to continue
using the local System account, you must make sure this account has proper
access to the Internet through Internet Explorer. For example, Internet cookies
may have been disabled for the System account, or there may be other Internet
restrictions for this account. To test and change Internet setting for the System
account, open the Content Grabber editor and click the Test IE as System
button in the Application menu. Pressing this button will open Internet Explorer
using the System account and you can then test Internet access and change
settings in Internet Explorer if required.
All Windows services, including the Content Grabber agent service, run in a
special Windows session that cannot interact with users and cannot display user
interfaces. JavaScript on some websites does not work correctly without at
least a hidden web browser window. Such websites are rare, but if you need to
run such a website through the Content Grabber agent service, the service
needs to run the agent in a normal user session. In order for the Windows
service to start an agent in a normal user session the following three conditions
must be met.
1. You must set the agent option Run Interactively.
2. The Content Grabber agent service must run under the System account.
3. A user must be logged onto the computer while the agent runs. The agent will run
in the Windows session of the logged in user, but in the security context of the
System account.
Example
The following example connects to a local Content Grabber agent service and runs an
agent in a new session. It also sets the log level to high and specifies that log
information should be written to file.
© 2014 - 2015 Sequentum
357
string sessionId = Guid.NewGuid().ToString();
AgentProxy proxy = new AgentProxy(@"C:\Users\Public\Documents\Content Grabber\
Agents\qantas\qantas.scg", sessionId);
proxy.Connect("http://localhost:8001/ContentGrabber");
proxy.StartAgent(settings);
You can connect to the Content Grabber agent service again at any time to get status
information about a specified agent running in a specified session.
AgentProxy qantasProxy = new AgentProxy(@"C:\Users\Public\Documents\Content Grabber\
Agents\qantas\qantas.scg", sessionId);
qantasProxy.Connect("http://localhost:8001/ContentGrabber");
AgentStatus status = qantasProxy.GetAgentStatus();
If you are running an agent from a web application, you could use AJAX callbacks to
get status information about a running agent. You only have to keep track of the
session ID between callbacks.
Proxy Functions
The following functions are available in the proxy assembly:
Function
Description
AgentProxy(string
Instantiates a new proxy with the
agentNameOrPath,
string sessionId)
specify the full path to an agent file or
just the name of the agent. If you only
specify the agent name, Content
Grabber will look for the agent in the
default location for the user running the
agent service. The default agent location
© 2014 - 2015 Sequentum
358
Content Grabber
for the local System account is:
C:\Users\Public\Documents\Content Grabber
\Agents
AgentProxy(string
Instantiates a new proxy without a
agentNameOrPath)
session. You can specify the full path to
an agent file or just the name of the
agent. If you only specify the agent
name, Content Grabber will look for the
agent in the default location for the user
running the agent service. The default
agent location for the local System
account is:
\Agents
void Connect(string
Connects to the Content Grabber agent
endPointAddress)
service. You can specify the server
name or IP address and port number.
The default connection string for a local
service is:
void
CloseConnection()
void StartAgent()
Starts the agent specified when
instantiating the proxy. The agent will run
asynchronously.
void
Starts the agent with additional settings.
© 2014 - 2015 Sequentum
StartAgent(AgentSe
The agent will run asynchronously. See
ttings settings)
void StopAgent()
Stops the agent if it is currently running
asynchronously.
void
Closes an agent session after the agent
CloseAgentSession(
has been run asynchronously. When you
)
close an agent session, all data
associated with that session is removed
and you will not be able to retrieve status
information about the agent that ran in
this session. You can only close a
session if an agent is not currently
You don't need to close a session.
Session data will be removed
automatically after the agent has
completed running and the session
timeout has elapsed. The default session
timeout is 30 minutes, so by default
session data will be removed
automatically 30 minutes after the agent
has completed.
AgentStatus
Returns status information about an
GetAgentStatus()
See below for more information about
the AgentStatus class.
DataTable
GetAgentProgressA
sDataTable()
© 2014 - 2015 Sequentum
359
360
Content Grabber
information about the information
returned.
DataTable
Returns progress information as JSON
GetAgentProgressA
about an agent running in
sJson()
returned.
DataTable
GetAgentLogAsData
Table(string
agentNameOrPath,
logging is disabled or if logging is written
string sessionId)
to file. See below for more information
about the information returned.
string
Returns log data as JSON for an agent
GetAgentLogAsJso
that has been run asynchronously. This
n(string
function does not return any data if
agentNameOrPath,
string sessionId)
DataSet
Returns extracted data in a DataSet for
GetAgentDataAsDat
an agent that has been run
aSet()
asynchronously.
string
Returns extracted data as a JSON string
GetAgentDataAsJso
for an agent that has been run
n()
asynchronously.
string
Returns extracted data as an XML string
GetAgentDataAsXml
()
asynchronously.
RunAgentReturnJso
Runs an agent synchronously and
n
returns extracted data as a JSON string.
© 2014 - 2015 Sequentum
The agent is always run in a session
when the agent supports sessions, and
the session is closed automatically after
RunAgentReturnXml
returns extracted data as an XML string.
RunAgentReturnDat
aSet
returns extracted data in a DataSet. The
agent is always run in a session when
the agent supports sessions, and the
session is closed automatically after the
agent has completed its run.
RunAgentReturnJso
Runs an agent synchronously with
n(AgentSettings
additional settings and returns extracted
settings)
data as a JSON string. The agent is
completed its run.
the AgentSettings class.
RunAgentReturnXml
(AgentSettings
settings)
data as an XML string. The agent is
© 2014 - 2015 Sequentum
361
362
Content Grabber
completed its run.
RunAgentReturnDat
aSet(AgentSettings
settings)
data in a DataSet. The agent is always
run in a session when the agent
completed its run.
AgentSettings
The following agent settings can specified when running an agent:
Property
Description
bool LogLevel
bool IsLogHtml
Logs the raw HTML of all web pages
processed by the agent.
bool IsLogToFile
Logs data to a file instead of a
database table.
int Timeout
This value specifies the session timeout
in minutes when an agent is run
asynchronously. All session data is
© 2014 - 2015 Sequentum
elapsed. The default session timeout is
30 minutes.
This value specifies the maximum
number of seconds an agent will run
when it's run synchronously. When the
timeout is reached, the agent will stop
and close its session if it's run in a
session. The default timeout is 30
seconds.
Dictionary<string,
string>
InputParameters
GlobalData
Any serializable data object can be
stored in this dictionary and will be
available to all scripts in an agent.
Notice that input parameters will
eventually be stored in this dictionary
as well, so it doesn't matter if you use
input parameters or global data to
store your input data.
ProxyList
A list of web proxies that will be used
by the agent. If a proxy list is specified,
it overrides any default proxy settings in
the agent.
AgentStatus
Property
Description
AgentRunningStatus
The RunStatus can be one of the
© 2014 - 2015 Sequentum
363
364
Content Grabber
RunStatus
following values.
Completed. The agent has
completed successfully.
Incomplete. The agent has
completed, but stopped prematurely.
The agent may have been stopped
manually.
Failed. The agent has completed, but
a critical error occurred.
ExportingData. The agent is
exporting data to the specified export
target.
Stopping. The agent is in the process
of stopping.
Restarting. The agent is restarting.
This usually occurs when the agent
needs to clear JavaScript memory
leaks.
ExportFailed. The agent completed,
but failed to export data.
int PageLoads
The number of page loads. This
includes AJAX calls triggered by agent
actions.
TimeSpan Runtime
int MissingElements
The number of times an agent
command could not find it's specified
content where the content was not
specified as optional.
© 2014 - 2015 Sequentum
int PageErrors
365
includes errors loading content from
AJAX calls that were triggered by agent
actions.
DateTime StartTime
Agent Progress Data
Agent Log Data
Agent Export Data
The API can provider extracted data in a DataSet, as an XML string or as a JSON
string. The API provides the entire extracted data set in one operation, so for large
data sets it's recommend to export the data to an external database and read it
directly from there.
15.2.2 Using Simple Web Requests
The Content Grabber Windows service supports simple web requests, so you
can execute agents and retrieve extracted data from non-Windows
environments, such as from a PHP page on a Linux server.
The Windows service is listening for web requests on port 8002 by default. You
© 2014 - 2015 Sequentum
366
Content Grabber
can change this port number in the Content Grabber editor. The service is
stopped by default and configured to start manually. If you are going to use this
service, you should configure the service to start automatically. Please see
Using the Content Grabber Agent Service for more information about the
Windows service.
This is an example of a web request that executes an agent named Sequentum
and provides some input values for the agent.
http://localhost:8002/ContentGrabber/RunAgentReturnJson?
agent=sequentum&pars={"StartDate":"2015-10-15","EndDate":"2015-12-15"}
The above web request executes the agent synchronously and returns the
extracted data as a JSON string.
Web Request Functions
The following functions are available when using web requests:
Function
Description
StartAgent?
Starts an agent that will run
agent={agentName
asynchronously.
OrPath}
&sessionId={sessio
This function supports both GET and
nId}
POST requests. If you need to specify a
&sessionTimeout={
long list of input parameters you must
sessionTimeout}
use POST requests, since GET requests
&logLevel={logLevel
are limited in length.
}
&logHtml={isLogHt
This function accepts the following
ml}
parameters:
&logToFile={isLogT
oFile}
agent (required): You can specify the
full path to an agent file or just the name
© 2014 - 2015 Sequentum
&pars={inputParam
of the agent. If you only specify the
eters}
agent name, Content Grabber will look
for the agent in the default location for
the user running the agent service. The
default agent location for the local
System account is:
\Agents
sessionId (optional): The session will
run in a session with the specified
session ID.
sessionTimeout (optional): If the agent
is running in a session, this value
specifies the number of minutes the
agent data will be available before it's
deleted. The default session timeout is
30 minutes.
logLevel (optional): Log detail level.
Set the log level to None to turn off
logging. The default log level is None.
Accepted values are None, Low, Medium
and High.
logHtml (optional): Logs the raw HTML
of all web pages processed by the
agent. The default value is False.
logToFile (optional): Logs data to a file
instead of a database table. The default
© 2014 - 2015 Sequentum
367
368
Content Grabber
value is False.
Pars (optional): A JSON formatted list
of input values that can be used by the
agent. The JSON string should be URL
encoded.
RunAgentReturnJso
n?
the extracted data as a JSON string.
agent={agentName
OrPath}
&timeout={timeout}
&logLevel={logLevel
}
&logHtml={isLogHt
ml}
&logToFile={isLogT
oFile}
&pars={inputParam
eters}
parameters:
System account is:
© 2014 - 2015 Sequentum
\Agents
Timeout (optional): This maximum
number of seconds an agent will run.
When the timeout is reached, the agent
will stop and close its session if it's run in
a session. The default timeout is 30
seconds.
and High.
value is False.
encoded.
RunAgentReturnXml
?
the extracted data as an XML string.
agent={agentName
OrPath}
&timeout={timeout}
© 2014 - 2015 Sequentum
369
370
Content Grabber
&logLevel={logLevel
}
&logHtml={isLogHt
ml}
&logToFile={isLogT
oFile}
&pars={inputParam
eters}
parameters:
System account is:
\Agents
seconds.
© 2014 - 2015 Sequentum
and High.
value is False.
encoded.
StopAgent?
agent={agentName
asynchronously.
OrPath}
&sessionId={sessio
This function supports only GET
nId}"
requests.
CloseAgentSession
?
agent={agentName
OrPath}
&sessionId={sessio
nId}
© 2014 - 2015 Sequentum
371
372
Content Grabber
has completed.
requests.
GetAgentStatus?
agent={agentName
OrPath}
&sessionId={sessio
nId}
requests.
GetAgentProgressA
sJson?
about an agent running in a
agent={agentName
OrPath}
&sessionId={sessio
returned.
nId}
requests.
GetAgentLogAsJso
n?
agent={agentName
OrPath}
&sessionId={sessio
© 2014 - 2015 Sequentum
373
nId}
requests.
DataSet
GetAgentDataAsDat
aSet()
asynchronously.
requests.
GetAgentDataAsJso
n?
agent={agentName
asynchronously.
OrPath}
&sessionId={sessio
nId}
requests.
GetAgentDataAsXml
?
agent={agentName
asynchronously.
OrPath}
15.3
&sessionId={sessio
nId}
requests.
Sessions
Sessions allow you to run multiple instances of the same agent at the same time. This
is particularly useful when running an agent directly from a website where you can
have many web users visiting the website at the same time. Each web user can start
the agent in their own session and the user will only see agent progress status and
extracted data belonging to his session.
Running an Agent in a Session
To run an agent in a session, you must specify a session ID when you run the agent
© 2014 - 2015 Sequentum
374
Content Grabber
and the agent must be configured to support sessions. To configure an agent to
support sessions, use the agent option Support Sessions in the Sessions section on
the Advanced options tab.
A session ID can be specified when running an agent via the Agent API or the Agent
Proxy. You can use any string as session ID, but each session must have a unique ID,
so you would normally use a unique identifier. The following C# code snippet runs an
agent with a unique session ID:
string sessionId = Guid.New().ToString();
AgentApi agent = new AgentApi("sequentum", sessionId);
agent.Run(new AgentSettings());
A session ID can also be specified when running an agent from the commandline by
using the command-line option session_id. The following command line input runs an
agent named sequentum with a session ID ry37456r.
RunAgent.exe "sequentum" session_id "ry37456r"
Session Data Cleanup
When running an agent normally without sessions, the agent can cleanup previously
extracted data and status information before it starts because only data and status
from the last run is important. Data cleanup becomes much more complex when
running agent sessions. An agent session should only cleanup data from its own
session, so it cannot do a general cleanup of all data belonging to the agent. Since all
sessions are normally new sessions there is no previous session data to cleanup, and
the agent cannot cleanup data when it completes a session since you'll need access
to the data for some time, so session data risks hanging around forever.
You can manually trigger a session cleanup by calling the API method CloseSession,
but in a web application you may not always be able to tell when a user session ends.
When running an agent in a session you can specify a session timeout, and the
session cleanup will start automatically after the agent session completes and the
© 2014 - 2015 Sequentum
375
specified timeout has passed. This gives your application a minimum amount of time to
use the session data before it's removed.
An agent only cleans up externally exported session data if the agent is configured to
export data to a database. If you are using an Export Script or if you are exporting to
a file format, then you are responsible for any cleanup of exported session data.
You may not always want to remove externally exported session data automatically
when an agent session ends. To prevent session data from being removed, set the
agent option Cleanup External Session Data to false. This option can be found in
the Sessions section on the advanced options tab.
If you are not using the API to retrieve extracted data, but only working with externally
exported session data, then you can use the option Remove Internal Session Data
Immediately to remove internal session data immediately after it has been externally
exported. This will reduce the size and increase performance of the internal database.
Session status information is not removed and will still be available until the session
timeout has passed.
15.4
API Reference
The Content Grabber API consists of two components, the Agent API Library and
the Agent Proxy Library. The two libraries provide similar functionality, but the Agent
Proxy library must connect to the Content Grabber agent service while the Agent API
is a stand-alone library that can either connect to a Content Grabber agent service or
work on its own.
The Agent API contains agent definition classes that allows you to load, modify and
save agents. You cannot add or remove commands from an agent, but you can
modify properties of existing commands. The Proxy API can only run agents and
cannot load and save agents.
© 2014 - 2015 Sequentum
376
Content Grabber
The Proxy API depends on no other files than the Proxy API assembly file. The
Agent API depends on the following files which are found in the Content Grabber
installation folder. These files must be copied to the folder of your executable
program:
AgentApi.dll
AgentProxy.dll
AjaxHook.dll
msvcp120.dll
msvcr120.dll
SQLite.Interop.dll
RunAgent.exe
RunAgentProcess.exe
BaseEngine.dll
MySql.Data.dll
Agents can also be run using simple web requests. Web requests
can be sent from most applications and require no special Content
Grabber API libraries.
Agent API Reference
Function
Description
AgentApi(string
Instantiates a new API class with the
agentNameOrPat
h, string
specify the full path to an agent file or just
sessionId)
the name of the agent. If you only specify
the agent name, Content Grabber will look
for the agent in the default location for the
user running the agent service. The default
agent location for the local System account
is:
© 2014 - 2015 Sequentum
Grabber\Agents
AgentApi(string
Instantiates a new API class without a
agentNameOrPat
session. You can specify the full path to an
h)
agent file or just the name of the agent. If
you only specify the agent name, Content
Grabber will look for the agent in the default
location for the user running the agent
service. The default agent location for the
local System account is:
Grabber\Agents
void
Connects to a Content Grabber agent
Connect(string
service. You can specify the server name or
endPointAddress
IP address and port number. The default
)
connection string for a local service is:
void
CloseConnection
()
void StartAgent()
Runs the agent specified when instantiating
the API. The agent will run asynchronously.
void
Runs the agent with additional settings. The
StartAgent(Agent
agent will run asynchronously. See below
Settings
for more information about the
© 2014 - 2015 Sequentum
377
378
Content Grabber
settings)
void StopAgent()
Stops the agent if it is currently running.
void
Closes an agent session. When you close
CloseAgentSessi
an agent session, all data associated with
on()
that session is removed and you will not be
able to retrieve status information about the
agent that ran in this session. You can only
close a session if an agent is not currently
You don't need to close a session. Session
data will be removed automatically after the
agent has completed running and the
session timeout has elapsed. The default
session timeout is 30 minutes, so by default
session data will be removed automatically
30 minutes after the agent has completed.
AgentStatus
Returns status information about an agent
GetAgentStatus(
that has been run asynchronously. See
)
AgentStatus class.
DataTable
GetAgentProgre
ssAsDataTable()
DataTable
Returns progress information as a JSON
GetAgentProgre
string about an agent running in
ssAsJson()
DataTable
GetAgentLogAsD
© 2014 - 2015 Sequentum
ataTable(string
agentNameOrPat
h, string
sessionId)
string
Returns log data as a JSON string for an
GetAgentLogAsJ
son(string
agentNameOrPat
h, string
sessionId)
DataSet
Returns extracted data in a DataSet for an
GetAgentExport
DataAsDataSet()
string
Returns extracted data as a JSON string for
GetAgentExport
DataAsJson()
string
Returns extracted data as an XML string for
GetAgentExport
DataAsXml()
RunAgentReturn
Json
extracted data as a JSON string. The agent
completed its run.
RunAgentReturn
Xml
extracted data as an XML string. The agent
© 2014 - 2015 Sequentum
379
380
Content Grabber
completed its run.
RunAgentReturn
DataSet
extracted data in a DataSet. The agent is
completed its run.
RunAgentReturn
Json(AgentSettin
settings and returns extracted data as a
gs settings)
JSON string. The agent is always run in a
RunAgentReturn
Xml(AgentSettin
settings and returns extracted data as an
gs settings)
XML string. The agent is always run in a
RunAgentReturn
DataSet
settings and returns extracted data in a
DataSet. The agent is always run in a
© 2014 - 2015 Sequentum
public Agent
Returns the agent specified when
GetAgent()
instantiating the API class.
public void
Saves the specified agent.
SaveAgent(Agent
agent)
Agent Proxy Reference
The following functions are available in the proxy assembly:
Function
Description
AgentProxy(string
Instantiates a new proxy with the
agentNameOrPath,
string sessionId)
specify the full path to an agent file or
just the name of the agent. If you only
specify the agent name, Content
Grabber will look for the agent in the
default location for the user running the
agent service. The default agent location
for the local System account is:
\Agents
AgentProxy(string
Instantiates a new proxy without a
agentNameOrPath)
session. You can specify the full path to
an agent file or just the name of the
agent. If you only specify the agent
name, Content Grabber will look for the
© 2014 - 2015 Sequentum
381
382
Content Grabber
agent in the default location for the user
running the agent service. The default
agent location for the local System
account is:
\Agents
void Connect(string
Connects to the Content Grabber agent
endPointAddress)
service. You can specify the server
name or IP address and port number.
The default connection string for a local
service is:
void
CloseConnection()
void StartAgent()
Starts the agent specified when
instantiating the proxy. The agent will run
asynchronously.
void
Starts the agent with additional settings.
StartAgent(AgentSe
The agent will run asynchronously. See
ttings settings)
void StopAgent()
asynchronously.
void
CloseAgentSession(
)
© 2014 - 2015 Sequentum
has completed.
AgentStatus
GetAgentStatus()
DataTable
GetAgentProgressA
sDataTable()
returned.
DataTable
GetAgentProgressA
about an agent running in
sJson()
returned.
DataTable
GetAgentLogAsData
© 2014 - 2015 Sequentum
383
384
Content Grabber
Table(string
agentNameOrPath,
string sessionId)
string
GetAgentLogAsJso
n(string
agentNameOrPath,
string sessionId)
DataSet
GetAgentDataAsDat
aSet()
asynchronously.
string
GetAgentDataAsJso
n()
asynchronously.
string
GetAgentDataAsXml
()
asynchronously.
RunAgentReturnJso
n
returns extracted data as a JSON string.
RunAgentReturnXml
returns extracted data as an XML string.
© 2014 - 2015 Sequentum
RunAgentReturnDat
aSet
returns extracted data in a DataSet. The
agent is always run in a session when
the agent supports sessions, and the
session is closed automatically after the
agent has completed its run.
RunAgentReturnJso
n(AgentSettings
settings)
data as a JSON string. The agent is
completed its run.
RunAgentReturnXml
(AgentSettings
settings)
data as an XML string. The agent is
completed its run.
RunAgentReturnDat
aSet(AgentSettings
settings)
data in a DataSet. The agent is always
run in a session when the agent
© 2014 - 2015 Sequentum
385
386
Content Grabber
completed its run.
AgentSettings Reference
The following agent settings can specified when running an agent:
Property
Description
AgentLogLevel
LogLevel
bool IsLogHtml
Logs the HTML of all loaded web
pages to files.
bool IsLogToFile
Logs information to a file instead of a
database.
int Timeout
Session timeout. All session data is
elapsed.
Dictionary<string,
string>
InputParameters
GlobalData
Any serializeable data object can be
stored in this dictionary and will be
available to all scripts in an agent.
Notice that input parameters will
eventually be stored in this dictionary
© 2014 - 2015 Sequentum
as well, so it doesn't matter if you use
input parameters or global data to
store your input data.
AgentStatus Reference
Property
Description
AgentRunningStat The RunStatus can be one of the following
us RunStatus
values.
Completed. The agent has completed
successfully.
Incomplete. The agent has completed,
but stopped prematurely. The agent may
have been stopped manually.
Failed. The agent has completed, but a
critical error occurred.
ExportingData. The agent is exporting
data to the specified export target.
Stopping. The agent is in the process of
stopping.
Restarting. The agent is restarting. This
usually occurs when the agent needs to
clear JavaScript memory leaks.
ExportFailed. The agent completed, but
failed to export data.
int PageLoads
© 2014 - 2015 Sequentum
The number of page loads. This includes
387
388
Content Grabber
AJAX calls triggered by agent actions.
TimeSpan
Runtime
int
The number of times an agent command
MissingElements
could not find it's specified content where
the content was not specified as optional.
int PageErrors
includes errors loading content from AJAX
calls that were triggered by agent actions.
DateTime
StartTime
Agent Progress Data
Agent Log Data
Web Request Referenc
The following functions are available when using web requests:
Function
Description
StartAgent?
Starts an agent that will run
© 2014 - 2015 Sequentum
agent={agentName
asynchronously.
OrPath}
&sessionId={sessio
nId}
&sessionTimeout={
sessionTimeout}
&logLevel={logLevel
}
&logHtml={isLogHt
ml}
parameters:
&logToFile={isLogT
oFile}
&pars={inputParam
eters}
System account is:
\Agents
sessionId (optional): The session will
run in a session with the specified
session ID.
sessionTimeout (optional): If the agent
is running in a session, this value
specifies the number of minutes the
agent data will be available before it's
deleted. The default session timeout is
30 minutes.
© 2014 - 2015 Sequentum
389
390
Content Grabber
and High.
value is False.
encoded.
RunAgentReturnJso
n?
the extracted data as a JSON string.
agent={agentName
OrPath}
&timeout={timeout}
&logLevel={logLevel
}
&logHtml={isLogHt
ml}
&logToFile={isLogT
oFile}
&pars={inputParam
eters}
© 2014 - 2015 Sequentum
parameters:
System account is:
\Agents
seconds.
and High.
© 2014 - 2015 Sequentum
391
392
Content Grabber
value is False.
encoded.
RunAgentReturnXml
?
the extracted data as an XML string.
agent={agentName
OrPath}
&timeout={timeout}
&logLevel={logLevel
}
&logHtml={isLogHt
ml}
&logToFile={isLogT
oFile}
&pars={inputParam
eters}
parameters:
© 2014 - 2015 Sequentum
System account is:
\Agents
seconds.
and High.
value is False.
encoded.
StopAgent?
agent={agentName
asynchronously.
OrPath}
© 2014 - 2015 Sequentum
393
394
Content Grabber
&sessionId={sessio
nId}"
requests.
CloseAgentSession
?
agent={agentName
OrPath}
&sessionId={sessio
nId}
has completed.
requests.
GetAgentStatus?
agent={agentName
OrPath}
&sessionId={sessio
nId}
requests.
© 2014 - 2015 Sequentum
GetAgentProgressA
sJson?
about an agent running in a
agent={agentName
OrPath}
&sessionId={sessio
returned.
nId}
requests.
GetAgentLogAsJso
n?
agent={agentName
OrPath}
&sessionId={sessio
nId}
requests.
DataSet
GetAgentDataAsDat
aSet()
asynchronously.
requests.
GetAgentDataAsJso
n?
agent={agentName
asynchronously.
OrPath}
&sessionId={sessio
nId}
requests.
GetAgentDataAsXml
?
© 2014 - 2015 Sequentum
395
396
Content Grabber
agent={agentName
asynchronously.
OrPath}
&sessionId={sessio
nId}
requests.
© 2014 - 2015 Sequentum
Index
Index
-&&OPEN
200
-((percentage)
167
-..NET class library
288
.NET functions 283
.NET v4.0 client profile framework
.NET version 4 341
350
-???Headers= 152
??Post= 152
-332-bit library
350
-AAction Command Error Handling 208
Action Commands 137
Action Completed 177
Action Configuration 163
Action Events 167
Action Type 137
Action Types 164
Activities 137
Activity
177
Activity Status 137
Activity Type 137
Add Force Refresh Header 137
Add Horizontally
239
Add Horizontally Export Method 239
Add to Agent Template Library
204
Add Vertically 239
© 2014 - 2015 Sequentum
397
Add Vertically Export Method 239
Address Bar 27
Agent 185, 234
Agent API 375
Agent API Library
375
Agent API Reference 375
Agent command 145
Agent Configuration 27
Agent Data 234
Agent Debug Log window 210
Agent Details 273
Agent Development 24
Agent Error Handling 208
Agent Explorer 24, 27
Agent Export Data 354
Agent Initialization Scripts 302
Agent Log Data 354, 375
Agent Progress Data 375
Agent Proxy
375
Agent Proxy Library
375
Agent Proxy Reference 375
Agent Settings 45, 259
Agent Templates 204
AgentSettings 354
AgentSettings Reference 375
AgentStatus Reference 375
AJAX 177
Ajax Completed Timeout 137
Ajax Content Render Delay 137
Ajax Content Render Delay After Scroll 137
Ajax Start Timeout 137
Always Merge Data 239
Always send notifications when an agent completes a
run 210
alwaysNotify
210
Anonymous Web Scraping 252
API 253, 341, 350
API Reference 354, 375
API Restrictions 341
Application Settings 29
Assembly References 145, 289, 350
Asynchronous Timeout 137
Attribute Name 135
Author Details 145
Auto Detect File Extension 114
Automatic activity discovery
177
Automatic CAPTCHA Configuration 253
398
Content Grabber
-BBasic Windows Authentication 200
Block Known Ad Servers 137
blur 167
bool HasColumn(string tableName, string
columnName) 290
bool HasTable(string tableName) 290
bool Read() 290
Browser 165
Browser action type 165
Browser Activities 177, 267
Browser Activity
267
Browser Activity Screen 177
Browser Mode 137, 165
Browser Target 165
Building a Self-Contained Agent 270
Building a Web Application 353
Building Self-Contained Agents 270
Building Your First Agent 30
Bypass CAPTCHA 253
bypasscaptcha.com 326
-CC# 283, 284, 302
Calculated Value 134, 155
capitalize_words 284
CAPTCHA 234, 253, 326
CAPTCHA Blocking 253
CAPTCHA images 234
CAPTCHA OCR Scripts 253
CAPTCHA Protection 234
Capture Commands 109, 316
change 167
Change Export Target 42
Changing the Default Data Structures 239
Character Encoding 238
CityData 223
Cleanup External Session Data 145, 373
Clear Cookies 137
Clear Session 137
click 167
click() 167
Close Browser 192
CloseSession 373
Command Library 204
Command Line Arguments 279
Command object 290
Command Properties 191
Command scripts 283
Command Templates 204
Command-line parameters 230
Company Details 273
Completed 177
Configuration of an Action Command 137
Configure Agent Command 24
Consuming Data 223
Consuming Data in a Script 223
Container Command Properties 191
Container Commands 108
Content Grabber agent service 353
Content Grabber API 230
Content Grabber Basics 24
Content Grabber Message window 24, 33
Content Grabber Premium Edition 281
Content Grabber Runtime 7
Content Transformation Scripts 283, 316
ContentTransformationArguments 316
continue 279
Conversion Script 123
Convert Document to HTML 330
Convert Document to HTML Scripts 330
Convert image to text 253, 326
Convert to HTML 123
Convert to Text 114, 129
ConvertDocumentToHtmlArguments 330
ConvertImageToTextArguments 326
Cruise Direct example 33
CSV 42, 223, 236
CSV Data Provider 223
Current 165
Custom Scripts 335
Customize Design 273
Customize Self-Contained Agent 273
Customizing the User Interface 273
-DData 219
Data Consumer 136, 152, 160
Data Distribution 314
Data Distribution Scripts 314
Data Export 306
Data Export script 283
© 2014 - 2015 Sequentum
Index
Data Export Scripts 306
Data Input 321
Data Input Scripts 321
Data List 185, 186
Data List command 223
Data Missing Action 186
Data output 42
Data Outputs 24
Data Provider 145, 186
Data Providers 223
Data Source 236
Data Value 136, 234
Database 223
Database Connections 145, 219
Database Data Providers 223
database types 42
Database Utilities 290
DataDeliveryArguments 314
DataExportArguments 306
DataProviderArguments 321, 335
DataRow[] GetTableColumns(string tableName) 290
DataTable LoadCsvFile(string path) 290
DataTable LoadCsvFile(string path, char separator)
290
DataTable LoadCsvFileFromDefaultInputFolder(string
filename) 290
DataTable LoadCsvFileFromDefaultInputFolder(string
filename, char separator) 290
DateTime GetDateTimeValue(int columnIndex) 290
Death by CAPTCHA 253
deathbycaptcha.com 326
Debug 43
Debug Disabled 210
Debugger 43
debugging 43
Default 165, 186, 239
Default Events 137
Default Export Method 239
delay(milliseconds) 167
DepartureCity
223
Development Environments 219
Directory
145
Discover action 137, 163
Discover activities 267
Discover Activities box 177
Discover Activity Timeout 137
Discover First Activity Timeout 137
distributed royalty-free 270
Distributing Data 219, 249
© 2014 - 2015 Sequentum
399
Distributing Your Application 353
Docx To HTML 214
Docx To HTML Converter 214
down arrow 167
down arrow key 167
Download Document 123, 200
Download Document Command 160
Download Image 114
Download Screenshot 129
Download Video 119
Dynamic Page Change 177
Dynamic Page Changes 177
Dynamic Websites 17
-EEditor Action 137
Editor Index 188
Email and FTP distribution 250
Email Delivery
250
Email Distribution 250
Email Notification 145
Error Handling 137, 207
Error Handling In An Execute Script Command 208
Error Logs 210
Error logs and notifications 207
Error Notifications 210
Error Retry Count 137
Events 137
Excel 42
exec(JavaScript) 167
executable file 277
Execute Script 192, 253, 335
Execute Script Commands 208
Exit Codes 279
Exit Command 137, 208, 253
Expert Layout 29
Export Agent 270
Export Converted Document 123
Export Converted Image 114, 129
export data 220
Export last data segment only
220
Export Rows in Parent Columns 239
Export Rows to Parent Columns 239
Export Target 145, 238
Export URL 114
Exporting and Importing Template Libraries 300
Exporting and Importing User Libraries 204
400
Content Grabber
Exporting Data 235
Exporting Data to MS Access 236
Exporting Data to Oracle 236
Exporting Data with Scripts 249
Exporting Downloaded Images and Files 238
Extension Methods 290
Extension scripts 283
external assemblies 289
External Data 219, 220
Externals 177
Extracting Content From a Converted Document
214
Extracting Data From Complex Websites 12
Extracting Data From Non-HTML Content 12
Extracting Data From Websites Using Deterrents
12
Extracting Huge Amounts of Data 12
Extracting URLs 111
-FFile Capture 114
File Download 177, 200
File Download Action 200
File Name 114, 119, 123, 129
File Name Attribute 114
File Name Transformation Script 114
File URL 114, 119, 123
Fire Event 163
Fire Events 164
Fixed File Extension 114
Flash 119
focus 167
Follow Pagination command 33
Frame Completed Timeout 137
Frames 177
Free Private Proxy Switch Account 259
FTP Delivery
250
FTP Distribution 250
Full Dynamic Browser 165, 252
-GGateway
259
Generate Transformation 40
Group 189, 190
Group Command Properties 190
Group Commands 189
Group Commands in Page Area 191
Group in Page Area 189, 191
GUI window 230
GUID 238
Guid GetGuidValue(int columnIndex) 290
-HHandle Web Browser Dialog 200
Handle Web Browser Dialogs 160
How to Configure Proxy Servers 259
HTML 111
HTML Attribute 111, 114, 129
HTML Capture 129
HTML Column 210
HTML Content 16
HTML Parser 165, 200
html_decode 284
HTTP proxy servers 252, 259
-IICommand 290
ICommand GetNewCommand() 290
IConnection 290
IDataReader GetDataReader() 290
IExportData 306
IExportReader 306
If Selection Exists 253
If-Modified-Since header 137
iframes 177
Ignore Command when Data is Missing 186
Image Conversion Script 114, 129
Image OCR Scripts 326
imdb 239
Import Proxies 259
Importing Proxy Servers 259
Improving Agent Performance and Reliability 264
Initialization Script 145
Inner HTML 111
Input data 219
Input Parameters 145, 230
Input Parameters with a GUI Window 230
Insert Template from Library 204
Installing Content Grabber 22
int ClearAllBrowserCookies() 290
int ClearBrowserCookies(string url) 290
© 2014 - 2015 Sequentum
Index
int GetIntValue(int columnIndex) 290
Interactive 177
Interactive mode 177
internal data 219, 220
Internal Database 145, 264
Internet Explorer 9 341
Introduction 7
IP blocking 253
IP Blocking & Proxy Servers 259
IReader 290
IReader ExecuteReader(string sql, params object[]
pars) 290
IReader GetNewReader(string sql) 290
IsParentCommandActionError 137, 208
-JJQuery
137
-Kkey steps for building a web scraping agent
keydown 167
keypress 167
keyup 167
-LLimit Number of Scrolls 137
line_breaks 284
List 185
list command 108
List Commands 185
List Count 185
List of Numeric Links 155
List of Text Links 155
List selection mode 33
List Selections 98
List Start Index 185
List/sequence of numeric links 155
List/sequence of text links 155
Lists 98
Load URL 164
Loading 177
log_html 279
log_level 279
log_to_file 279
© 2014 - 2015 Sequentum
30
401
long ExecuteCount() 290
long ExecuteCount(string sql) 290
long ExecuteCount(string sql, params object[] pars)
290
Low page number threshold 210
-MMain Window 27
manual and automatic data extraction 253
Manual CAPTCHA 253
Manual CAPTCHA Configuration 253
Max Inner Prospects 145
Max Memory Usage 145
Max Tree Expansions 145
Max. Page Number Command 155
Max. Repeats 151
Maximum Number of Scrolls 137
Microsoft Excel 236
Microsoft Excel 2003 236
Microsoft Excel 2007 236
Microsoft Word 214
mousedown 167
mouseup 167
Movie List 239
mp4 119
Multiple Columns 239
My Templates 204
MySQL 42, 223, 236, 238
MySQL Character Encoding 238
-NName Command 239
Navigate Directly to Page 155
Navigate Link 200, 316
Navigate Link command 151
Navigate Pagination Command 155
Navigate URL 152, 234
Navigate URL Command 152
New 165
New Command 111
Next Page 155
Next Page link/button 155
No Action 163, 164
No Error Handling 208
no_ui 279
None 236, 239
402
Content Grabber
-Oobject ExecuteScalar() 290
object ExecuteScalar(string sql) 290
object ExecuteScalar(string sql, params object[] pars)
290
object GetConnection() 290
object GetFieldValue(int columnIndex) 290
OCR 114, 129
OCR script 253
OCR scripts 253
OCR service 234
ogg 119
OleDB 42, 223, 236
Only Execute Action on Value Change 160
Optimized Dynamic Browser 165
Optimizing Agent Commands 267
Optimizing Browser Activities 267
Optional Data 186
Oracle 42, 223, 236
Oracle OleDB 236
Original File Name 114
Other Commands 192
-PPage Attribute 135
Page Completed Timeout 137
Page Count 155
Page Interactive Timeout 137
Page Load 177
Page Load Timeout 137
Page Loading 177
Page Number Attribute 155
Page Number Transformation Script
Page Refresh 177
Page URL 135
PageNumberAttributeName 155
Pages 259
Pagination 33
Pagination Command Link 155
Pagination Type 155
Parent 165
Password 200
Pause Agent 253
Pause Before Restarting 145
PDF 214
155
PDF To HTML 214
PDF-to-HTML document converter
Polipo 259
Post-completion events 177
Posting Data and Custom Header
Premium Edition 7, 281
Private Proxy Switch 259
Process Restart Condition 145
Professional Edition 7
Programming Interface 341
promotional message 270
Proxy 145
Proxy API 375
proxy assembly 354
Proxy Configuration 145
Proxy Functions 354
Proxy Gateway 259
Proxy List 259
Proxy source 259
Proxy switch 259
Proxy Verification 259
Public Provider 276
214
152
-RRefresh Document 192
Regular Expressions 18, 284
Regular Expressions Reference Guide 284
Remove Internal Session Data Immediately 145,
373
Remove skipped proxies 259
Repeat Link Action 151, 167
Required 137
Restart Agent and Continue 208
Restart and Continue Agent 137
Restart On Memory Usage 145
Restart On Time Interval 145
Restart Time Interval 145
Retry Command 137, 208, 253
Retry Count 208
Return 167
Return key
167
Reuse Browser Tab 137
Rotation interval 259
Run 46
Run Agent 46
Run Interactively
354
Run Your Agent 46
© 2014 - 2015 Sequentum
Index
RunAgent.exe 279
Running an Agent in a Session 373
Runtime 7
Runtime Input Parameters 230
279
-SSchedule 45
Scheduling 45, 145
Script 223
Script Data Provider 223
Script Languages 284
Script Library
288
Script Template Library
300
Script type 335
Script Utilities 290
Scripting 145, 283
Scripting Data Distribution 251
Scroll to End of Page 167
Scroll Until End of Page 137
scroll() 167
scroll(percentage) 167
Select the Content to Capture 33
Selecting an Export Target 236
Selection 111, 114, 223
Selection Data Provider 223
Selection Techniques 18
self-contained agent 277
Self-Contained Agent Files 270
Self-Contained Agents 145
Send notification on critical errors 210
Send notification on low number of page loads
sendkey(keycode) 167
sendtext() 167
sendtext(text) 167
Separate 239
Separate Export Method 239
Separator 239
Server Edition 7
Session Data Cleanup 373
session_id 279
session_timeout 279
Sessions 145, 373
Set Form Field 109, 160, 185, 200, 234
Set Form Field Command 160
Set Select Box Text 160
Set Value 160
© 2014 - 2015 Sequentum
210
403
setinputtext() 167
setinputtext(text) 167
shared script library
288
Simple 223
Simple Data Provider 223, 276
Simplified Layout 29
Single Column 239
Single Next Link 155
Single Next Page Link 155
Skip unavailable proxies 259
Specifying Input Parameters on the Command Line
230
SQL Server 42, 223, 236
SQL Server Express 264
SQLite 219
SQLite database 220
src
111
Start 279
Start command 279
Start URL 27, 32
Static Browser 165
static DataTable ToDataTable(this List<string>
dataRows, string columnName) 290
static DataTable ToDataTable(this List<string>
dataRows, string columnName, string stringFormat)
290
static DataTable ToDataTable(this string stringValue,
string columnName) 290
static DataTable ToDataTable(this string[] dataRows,
string columnName) 290
static DataTable ToDataTable(this string[] dataRows,
string columnName, string stringFormat) 290
Static Parser 265
Stop Agent 137, 208
string GetStringValue(int columnIndex) 290
strip_html 284
Sub-Container Export Method 239
Sub-container Export Rules 239
Sub-Container Merge Method 239
Submit 200
Submit button 200
Support Sessions 145, 373
-TTag Text 111
Target Browser 137, 165
Templates 273
Test Your Agent 43
404
Content Grabber
Text 111
The Internal Database 220
third-party CAPTCHA recognition service 253
Time 259
Timeout 137, 177
Timeouts 137
Timing 177
TNS 236
to_lower 284
to_upper 284
ToDataTable 321
TOR 259
Transform Page 192
transformation 40
Transformation Script 40
Tree Selections 145
trim 284
Type GetFieldType(int columnIndex) 290
-Uunbind(Event) 167
Unique Database Names on Your Network 219
Unique ID 111
Upgrading a Self-Contained Agent 270, 277
Uploading File 200
URL 163
URL Column 210
URL Match 137
URL Parameters 152
url_decode 284
Use default events 167
Use Default Input 145
Use Max. Page Number Command 155
Use Next Pagination Set 155
Use Original File Name 114
Use Page Count 155
User Agent 145
User Interface 145
Username 200
Using a Self-Contained Agent 277
Using Data Input 223
Using Input Data 276
Using Input Parameters 230
Using Input Parameters in a Script 230
Using MS SQL Server as Internal Database 264
Using the Content Grabber Agent Service 354
Using the Content Grabber runtime 281
Using the Free TOR Proxy Network 259
Using the Static Parser Browser 265
UTF-8 238
-VValue Command 239
VB.NET 283, 284, 302
Verification timeout 259
Verify proxy before use 259
Vidalia 259
Video URL 119
View Export Data 43
View Log 210
view_browser 279
Visual Studio 288, 350
Visual Studio Configuration 350
Visual Studio Solution Template 288
Visual Web Ripper 19
void AddParameter(string pName, CaptureDataType
type) 290
void AddParameterWithValue(string pName, object
pValue) 290
void Close() 290
void CloseDatabase() 290
void DropTable(string tableName) 290
void ExecuteCommandLine(string programFilePath,
string arguments) 290
void ExecuteCommandLine(string programFilePath,
string inputFilePath, string outputFilePath, string
options) 290
void ExecuteNonQuery() 290
void ExecuteNonQuery(string sql) 290
void ExecuteNonQuery(string sql, params object[]
pars) 290
void Lock() 290
void Notify(bool alwaysNotify = true) 210
void Notify(string message, bool alwaysNotify = false)
210
void OpenDatabase() 290
void Release() 290
void ResetBrowserSession() 290
void SetParameterValue(string pName, object pValue)
290
void SetSql(string sql) 290
void TruncateTable(string tableName) 290
© 2014 - 2015 Sequentum
Index
-WWait XPath 137
web applications 353
Web Element Content 111
Web Element List 185, 188
Web Element Selection 177
Web Forms 200
Web requests 354, 365
Web Scraping Limitations 12
Web Selection Properties 191
Web-Scraping Techniques 15
Website Login Forms 200
What is Web Scraping? 10
Windows service 354, 365
Windows Task Scheduler 45
windowscroll 167
windowscroll() 167
Worker Threads 151, 152
-XXML 42, 236
XPath 18, 223
XPath Factory
145
© 2014 - 2015 Sequentum
405
406
Content Grabber
Endnotes 2... (after index)
© 2014 - 2015 Sequentum
Back Cover

3 - Content Grabber

Transcription

Similar documents

$70 $50 - General Tire

$70 $50 - Tire Rack

Receive your choice of a pair of SMS Audio “Street... with the purchase of 4 General Tires from 7/1/14 to...

Mechanical Grabber sci on

brush grabber root grapple

FEEL THE adrenaline

QuickStart Setup Guide

USING EZTV2Pulsar:

Epiphan VGA2USB Pro Product Brochure