3 - Content Grabber
Transcription
3 - Content Grabber
User Guide © 2014 - 2015 Sequentum Content Grabber © 2014 - 2015 Sequentum All rights reserved. No parts of this work may be reproduced in any form or by any means - graphic, electronic, or mechanical, including photocopying, recording, taping, or information storage and retrieval systems - without the written permission of the publisher. Products that are referred to in this document may be either trademarks and/or registered trademarks of the respective owners. The publisher and the author make no claim to these trademarks. While every precaution has been taken in the preparation of this document, the publisher and the author assume no responsibility for errors or omissions, or for damages resulting from the use of information contained in this document or from the use of programs and source code that may accompany it. In no event shall the publisher and the author be liable for any loss of profit or any other commercial damage caused or alleged to have been caused directly or indirectly by this document. Printed: May 2015 Contents 3 Table of Contents Foreword 0 Part I Introduction 7 1 Editions ................................................................................................................................... & Licensing 7 2 What................................................................................................................................... is Web Scraping? 10 3 Web Scraping ................................................................................................................................... with Content Grabber 11 4 Web Scraping ................................................................................................................................... Limitations 12 5 Web Scraping ................................................................................................................................... Techniques 15 HTML Content .......................................................................................................................................................... 16 Dynamic.......................................................................................................................................................... Websites 17 XPath and .......................................................................................................................................................... Selection Techniques 18 Regular Expressions .......................................................................................................................................................... 18 6 Converting ................................................................................................................................... Visual Web Ripper Projects 19 Part II Quick Start with Content Grabber 22 1 Installing ................................................................................................................................... Content Grabber 22 2 Content ................................................................................................................................... Grabber Basics 24 3 Exploring ................................................................................................................................... the Main Window 27 4 Simple ................................................................................................................................... vs Expert Layout 29 5 Building ................................................................................................................................... Your First Agent 30 Choosing.......................................................................................................................................................... a Start URL 32 Select the .......................................................................................................................................................... Content to Capture 33 Refine Your .......................................................................................................................................................... Data (optional) 40 Output Data .......................................................................................................................................................... Format 42 Test Your.......................................................................................................................................................... Agent 43 Scheduling .......................................................................................................................................................... 45 Run Your.......................................................................................................................................................... Agent 46 Part III Content Grabber Editor 48 1 Customizing ................................................................................................................................... the Editor Layout 48 2 Expert ................................................................................................................................... Layout Features 51 3 Working ................................................................................................................................... With the Web Browser 59 4 Command ................................................................................................................................... Configuration 62 5 Selection ................................................................................................................................... Tools and Short-cut Keys 69 6 Navigating ................................................................................................................................... an Agent 73 7 Testing/Debugging ................................................................................................................................... an Agent 76 Using the.......................................................................................................................................................... Debugger 76 Using Logging .......................................................................................................................................................... 81 8 Scheduling ................................................................................................................................... 82 9 Copying ................................................................................................................................... & Moving Agents 85 10 Exporting ................................................................................................................................... & Importing Agents 86 © 2014 - 2015 Sequentum 3 4 Content Grabber 11 Automated ................................................................................................................................... Agent Backups 87 12 Template ................................................................................................................................... Libraries 88 Part IV Selection Techniques 89 1 Selection ................................................................................................................................... XPaths 90 2 Optimizing ................................................................................................................................... Selections 97 3 List Selections ................................................................................................................................... 98 4 Selection ................................................................................................................................... Anchors 101 5 Editing ................................................................................................................................... XPaths Manually 102 Part V Agent Commands 104 1 Container ................................................................................................................................... Commands 108 2 Capture ................................................................................................................................... Commands 109 Web Element .......................................................................................................................................................... Content 111 Download .......................................................................................................................................................... Image 114 Download .......................................................................................................................................................... Video 119 Download .......................................................................................................................................................... Document 123 Download .......................................................................................................................................................... Screenshot 129 Calculated .......................................................................................................................................................... Value 134 Page Attribute .......................................................................................................................................................... 135 Data Value .......................................................................................................................................................... 136 3 Action ................................................................................................................................... Commands 137 Agent .......................................................................................................................................................... 145 Navigate .......................................................................................................................................................... Link 151 Navigate .......................................................................................................................................................... URL 152 Navigate .......................................................................................................................................................... Pagination 155 Set Form .......................................................................................................................................................... Field 160 Action Configuration .......................................................................................................................................................... 163 Action ......................................................................................................................................................... Types 164 Browser ......................................................................................................................................................... Target 165 Action ......................................................................................................................................................... Events 167 Browser ......................................................................................................................................................... Activities 177 4 List ................................................................................................................................... Commands 185 Data List .......................................................................................................................................................... 186 Web Element .......................................................................................................................................................... List 188 5 Group ................................................................................................................................... Commands 189 Group .......................................................................................................................................................... 190 Group in .......................................................................................................................................................... Page Area 191 6 Other ................................................................................................................................... Commands 192 Execute.......................................................................................................................................................... Script 192 Transform .......................................................................................................................................................... Page 197 Refresh.......................................................................................................................................................... Document 199 Close Browser .......................................................................................................................................................... 200 7 Web................................................................................................................................... Forms 200 8 Command ................................................................................................................................... Library 204 Part VI Error Handling 207 © 2014 - 2015 Sequentum Contents 5 1 Agent ................................................................................................................................... Error Handling 208 2 Error ................................................................................................................................... Logs and Notifications 210 Part VII Extracting Data From Non-HTML Documents 214 Part VIII Data 219 1 Database ................................................................................................................................... Connections 219 2 The ................................................................................................................................... Internal Database 220 3 Using ................................................................................................................................... Data Input 223 Data Providers .......................................................................................................................................................... 223 Input Parameters .......................................................................................................................................................... 230 Agent Data .......................................................................................................................................................... 234 4 Exporting ................................................................................................................................... Data 235 Selecting .......................................................................................................................................................... an Export Target 236 Exporting .......................................................................................................................................................... Downloaded Images and Files 238 Character .......................................................................................................................................................... Encoding 238 Changing .......................................................................................................................................................... the Default Data Structures 239 Exporting .......................................................................................................................................................... Data with Scripts 249 5 Distributing ................................................................................................................................... Data 249 Email and .......................................................................................................................................................... FTP distribution 250 Scripting .......................................................................................................................................................... Data Distribution 251 Part IX Anonymous Web Scraping 252 Part X CAPTCHA & IP Blocking 253 1 CAPTCHA ................................................................................................................................... Blocking 253 2 IP Blocking ................................................................................................................................... & Proxy Servers 259 Part XI Improving Agent Performance and Reliability 264 1 Using ................................................................................................................................... MS SQL as Internal Database 264 2 Using ................................................................................................................................... the Static Parser Browser 265 3 Optimizing ................................................................................................................................... Agent Commands 267 Part XII Building Self-Contained Agents 270 1 Customizing ................................................................................................................................... the User Interface 273 2 Using ................................................................................................................................... Input Data 276 3 Using ................................................................................................................................... a Self-Contained Agent 277 Part XIII Running Agents from the Command-Line 279 1 Using ................................................................................................................................... the Content Grabber runtime 281 Part XIV Scripting 283 1 Script ................................................................................................................................... Languages 284 2 Script ................................................................................................................................... Library 288 3 Assembly ................................................................................................................................... References 289 © 2014 - 2015 Sequentum 5 6 Content Grabber 4 Script ................................................................................................................................... Utilities 290 5 Script ................................................................................................................................... Template Library 300 6 Agent ................................................................................................................................... Initialization Scripts 302 7 Data................................................................................................................................... Export Scripts 306 8 Data................................................................................................................................... Distribution Scripts 314 9 Content ................................................................................................................................... Transformation Scripts 316 10 Data................................................................................................................................... Input Scripts 321 11 Image ................................................................................................................................... OCR Scripts 326 12 Convert ................................................................................................................................... Document to HTML Scripts 330 13 Custom ................................................................................................................................... Scripts 335 Part XV Programming Interface 341 1 Building ................................................................................................................................... a Desktop Application 342 Visual Studio .......................................................................................................................................................... Configuration 350 Distributing .......................................................................................................................................................... Your Application 353 2 Building ................................................................................................................................... a Web Application 353 Using the .......................................................................................................................................................... Content Grabber Agent Service 354 Using Simple .......................................................................................................................................................... Web Requests 365 3 Sessions ................................................................................................................................... 373 4 API Reference ................................................................................................................................... 375 Index 397 © 2014 - 2015 Sequentum Introduction 1 7 Introduction Congratulations on your purchase of Content Grabber. We value you as a customer and we want to help you increase your web-scraping productivity. We have put years of effort into designing and developing Content Grabber. Our software can be a great help in simplifying the web-scraping process, increasing your efficiency, and optimizing the performance. We have the best web-scraping solution on the market and strive to exceed your expectations. You can learn more in these sections: Editions and Licensing What is Web Scraping? Web Scraping with Content Grabber Web Scraping Limitations Web Scraping Techniques We have designed this comprehensive user guide to help you get up and running quickly with Content Grabber, and then learn the advanced features so that you can get your job done more efficiently. After Installing Content Grabber, we recommend that you get familiar with Understanding the Concept, Exploring the Main Window and Building Your First Agent. We welcome your input and feedback, so please send us an email at: info@ContentGrabber.com.au. We are keen to know how you are using Content Grabber, and welcome any inquiry from you or your colleagues. 1.1 Editions & Licensing The Content Grabber software comes in three editions, each of which requires a license. Please see the license agreement in the Content Grabber installation folder for the full license agreement. In this article we highlight the most important differences among the Content Grabber editions. © 2014 - 2015 Sequentum 8 Content Grabber These are the current license editions of Content Grabber: Professional Edition Premium Edition Server Edition This user guide describes all features for all editions of the Content Grabber software. Some sections of this guide present features that are only available in specific versions, and these features may not be available in the software edition that you are using. We explain these differences below. Professional Edition If a current-license copy of the Professional Edition of the software is running on a computer, then a single user may create, edit, and run web-scraping agents without restrictions. The Professional Edition can create self-contained agents that you can distribute and users can run royalty-free. These self-contained agents use a standard user interface that display a promotional message for Content Grabber, and it isn't possible to edit the display templates to remove this message. It is not possible to open any agents in the Professional Edition that were built in the Premium Edition. The Content Grabber Runtime cannot run or edit web-scraping agents built with the Professional Edition. This also means that the entire API and the Script Library is unavailable to agents built with the Professional Edition. The runagent.exe command-line program will only run agents built with the Premium Edition (so long as that edition of the software is on the computer). © 2014 - 2015 Sequentum Introduction 9 Premium Edition The Professional Edition is the most feature-rich version of the Content Grabber software suite. As with the Professional Edition, if a current-license copy of the Premium Edition of the software is running on a computer, then a single user may create, edit, and run web-scraping agents without restrictions. With the Premium Edition, you can use the API and the Script Library. The Premium Edition can create self-contained agents that you can distribute and users can run royalty-free. These self-contained agents use a standard user interface that displays a promotional message for Content Grabber, but the Premium Edition gives you the ability to change the display templates to remove this message. You can open agents in the Premium Edition that were built in the Professional Edition, but the only features that will function are those that are available in the Professional Edition. Server Edition The Server Edition of Content Grabber is intended for use in production environments for running your agents and carrying out simple maintenance. It is not a developer license and is priced accordingly. The Server Edition cannot create new agents, but it can run and edit agents built with either the Premium Edition or the Professional Edition. Restrictions for the Server Edition: With this edition, you may not change the agent command structure, or alter an agent in a way that would change the output data structure. It is not possible to add, delete, move, or copy commands in this edition, nor can © 2014 - 2015 Sequentum 10 Content Grabber it modify the Disabled and Export command properties. This edition cannot export agents or create self-contained agents. This edition cannot edit script libraries, though it can run agents that use script libraries. Content Grabber Runtime The Content Grabber Runtime cannot change the agent command structure, or modify the agent in a way that would change the output data structure. Also the Runtime cannot add, delete, move or copy commands, nor can it change the Disabled and Export command properties. With the Premium Edition, you can create and distribute agents royalty-free with the Content Grabber Runtime. Please be aware that the Content Grabber Runtime has its own license and restrictions. The Content Grabber Runtime cannot run or edit web-scraping agents built with the Professional Edition. This also means that the entire API and the Script Library is unavailable to agents built with the Professional Edition. An agent built with the Professional Edition that you later edit with the Premium Edition cannot be run with the Content Grabber Runtime. 1.2 What is Web Scraping? We provide this brief introduction for those who are new to web-scraping. If you are familiar with web-scraping, then you many want to begin by reading the article Installing Content Grabber. Web-scraping is the process of extracting data from websites and storing that data in a structured, easy-to-use format. The value of a web-scraping tool like Content Grabber is that you can easily specify and collect large amounts of source data that may be very dynamic (data that changes very frequently). © 2014 - 2015 Sequentum Introduction 11 Usually, data available on the Internet has little or no structure and is only viewable with a web browser. Elements such as text, images, video, and sound are built into a web page so that they are presentable in a web browser. It can be very tedious to manually capture and separate this data, and can require many hours of effort to complete. With Content Grabber, you can automate this process and capture website data in a fraction of the time that it would take using other methods. Web-scraping software interacts with websites in the same way as you do when using your web browser. However, in addition to displaying the data in a browser on your screen, web-scraping software saves the data from the web page to a local file or database. You can configure web-scraping agents to run on multiple websites, and you can schedule each agent to run automatically. It's easy to configure your agent to run as frequently as you like (hourly, daily, weekly, monthly) to ensure that you are capturing the very latest data. 1.3 Web Scraping with Content Grabber With Content Grabber, you can automatically harvest data from a website and deliver the content as structured data in multiple database formats (Oracle, SQLServer, My SQL, OLE DBE), or in other formats such as Excel spreadsheets, CSV or XML files. Content Grabber can also extract data from highly dynamic websites where most other extraction tools are incapable. It can process AJAX-enabled websites, submit © 2014 - 2015 Sequentum 12 Content Grabber forms repeatedly to cover all possible input values, and manage website logins. Web-scraping technology is transforming the Internet into a structured data source, and Content Grabber is opening up numerous business opportunities for both corporations and individuals. The following is just a small sample of how web-scraping technology is optimizing and enabling new businesses: Price Comparison Portals / Mobile Locate the highest-ranking keywords Apps of your competitors on all major Collaborative lists (home search engines foreclosures, job boards, & tourist Background Checking attractions) Confirm the integrity of business News & Content Aggregation partners Competitive price monitoring Monitor online sources for copyright Monitor dealers for price compliance infringement Track inventory on retailer websites Sales Lead Generation Social media and brand monitoring Content migration (CMS & CRM). Content Grabber is a powerful, visual web-scraping tool that can do all of this and much more. We provide a comprehensive user guide to help you get up and running quickly. After Installing Content Grabber, we recommend that you look at Content Grabber Basics, and then get familiar with Exploring the Main Window and then Building Your First Agent. 1.4 Web Scraping Limitations Web-scraping can be challenging if you want to mine data from complex, dynamic websites. If you're new to web-scraping, then we recommend that you begin with an easy website: one that is mostly static and has little, if any, AJAX or JavaScript. After you get familiar with the navigation paths for your target website, you need to identify a good start URL. Sometimes this is simply the start URL of the website, but often the best URL is the one for a sub-page—such as a product listing. Once you © 2014 - 2015 Sequentum Introduction 13 have this URL, you’ll need to copy it and then paste it into the address bar of Content Grabber. NOTE: Some websites allow navigation without any corresponding change in the visible URL. In such cases, you may not have a start URL that points directly to your start webpage, and so you’ll need to add preliminary steps to your agent to navigate to that webpage. Web-scraping can be also challenging if you don't have the proper tools. Largely, you're completely at the mercy of the target website, and that website can change at anytime - without notice. Or, it may contain faulty JavaScript that causes it to crash and exhibit surprising behavior. The server that hosts the website may crash, or the website may undergo maintenance. Many potential problems can occur during a lengthy web-scraping session, and you have very little influence on any of them. Content Grabber offers an array of advanced error-handling and stability features that can help you manage many of the problems that a web-scraping agent is likely to encounter. In addition to the unreliable websites, another challenge is that some web-scraping tasks are especially difficult to complete - including the following: Extracting data from complex websites Extracting data from websites that use deterrents Extracting huge amounts of data Extracting data from non-HTML content. Extracting Data From Complex Websites If you are developing web-scraping agents for a large number of different websites, you will probably find that around 50% of the websites are very easy, 30% are modest in difficulty, and 20% are very challenging. For a small percentage, it will be effectively impossible to extract meaningful data. It may take two weeks or more for a © 2014 - 2015 Sequentum 14 Content Grabber web-scraping expert to develop an agent for such a website, so the cost of developing the agent is likely to outweigh the value of the data you might be able to extract. Extracting Data From Websites Using Deterrents Web-scraping will always be challenging for any website with active deterrents in place. If it is necessary to login to access the content that you want to extract, then the website can always cancel your account and make it impractical to create new accounts. Some websites uses browser fingerprinting to identify and block your access to the website. Fingerprinting uses JavaScript to make a positive identification by examining your browser and computer specifications, and thereby making it impossible to circumvent. Another method for websites that are wary of crawlers or scrapers is the use of CAPTCHA. Content Grabber includes tools you can use to overcome CAPTCHA protection, but you'll incur additional costs to get a 3rd-party to do automatic CAPTCHA processing. See CAPTCHA Blocking for more information. The most common protection technique is using your IP address to identify and block your access to a website. You can usually circumvent this technique by using a proxy rotation service, which hides your actual IP address and uses a new IP address every time you request a web page from a website. See IP Blocking & Proxy Servers for more information. NOTE: Ethically and legally, we recommend that you avoid websites that are actively taking measures to block your access, even if you are able to circumvent the protection. Extracting Huge Amounts of Data A web-scraping tool must actually visit a web page to extract data from it. Downloading a web page takes time, and it could take weeks and months to load and extract data from millions of web pages. For example, it's virtually impossible to © 2014 - 2015 Sequentum Introduction 15 extract all product data from Amazon.com, since there are too many web pages. Extracting Data From Non-HTML Content Some websites are built entirely in Flash, which is a small-footprint software application that runs in the web browser. Content Grabber can only work with HTML content, so it can only extract the Flash file. However, it can't interact with the Flash application or extract data from within the Flash application. Many websites provide data in the form of PDF files and other file formats. Though it cannot directly extract data from such files, Content Grabber can easily download those files and convert the files into an HTML document using 3rd-party converters to extract data from the conversion output. The document conversion happens very quickly in real-time, so it will seem as though you are performing a direct extraction. It's important to realize that PDF documents and most file formats don't contain content that is easily convertible into structured HTML. To do that, you can use the Regular Expressions feature of Content Grabber to resolve the conversion output. 1.5 Web Scraping Techniques Content Grabber makes it easy to extract data from most websites without requiring much prior knowledge about web-scraping techniques. However, you'll be able to build better web-scraping agents if you know some basic techniques. Some very difficult websites will require in-depth knowledge, but this user guide can help you gain more understanding and direct you to additional resources. The following topics are important if you want to become proficient at web-scraping, but they are not necessary a prerequisite for successful use of Content Grabber on all websites. Click the links to learn more about each topic: HTML Content - Web pages are driven by HTML, which is the basic language for building websites. Dynamic Websites - It can be challenging to perform data extraction on dynamic websites. So, it's good to have a general understanding of how JavaScript © 2014 - 2015 Sequentum 16 Content Grabber works, since it is found on most dynamic websites. XPath and Selection Techniques - Most web scraping tools extract data from a website by selecting web elements on the web page. XPath is a language that manages the web selection. Regular Expressions - XPath can select a web element such as a paragraph of text, but you may have interest only in a small part of the web element content. Regular Expressions is a language for extracting small bits of text from a larger text element. 1.5.1 HTML Content HTML stands for HyperText Markup Language - the standard markup language for creating web pages. It consists of content that is defined in an HTML document by tags that appear in brackets, such as <html>. Typically, these tags are seen in pairs, with one on each end of the content that they represent (such as <h1> and </h1>). The first tag in a pair is the start tag, and the second tag is the end tag (also known as opening tags and closing tags). Some tags that represent empty elements don't come in pairs, such as <img>. The purpose of a web browser is to read HTML documents and compose them into visible web pages. The browser does not display the HTML tags, but rather interprets the tags and displays content on the page that corresponds to that tag. HTML describes the structure of a web page semantically, with cues for presentation. This distinguishes it as a markup language rather than a programming language. HTML elements are the building blocks of any website, including embedded images and objects, and also interactive forms. It provides the structure for a page by denoting structural semantics for text such as headings, paragraphs, lists, links, quotes, and other items. It can also contain scripts written in languages such as JavaScript - which controls the behavior of HTML web pages. Content Grabber uses XPath to select specific HTML tags and then extracts content © 2014 - 2015 Sequentum Introduction 17 from those tags. An HTML tag can contain both text and attributes. For example, a HTML tag that displays an image will contain a scr attribute that specifies the URL of the image to display. Content Grabber can extract both tag text and tag attributes, and may perform certain actions on the content it extracts. For example, it may extract the scr attribute from an <image> HTML tag and then use the URL to download the image. There are many websites that have HTML tutorials. Here is one example: http://www.w3schools.com/html/html_intro.asp 1.5.2 Dynamic Websites Using HTML script, a client-side dynamic web page will continue to load more content after the initial content loads and the page elements are available to the user. The most common language for client-side scripting is JavaScript, and it may use AJAX (Asynchronous JavaScript and XML) to load additional content onto a web page asynchronously. It may also modify existing content on a web page, such as enabling or disabling content when you click on particular web elements. To extract data correctly, Content Grabber needs to detect any dynamic changes on a web page. For example, if you want to extract any additional data that AJAX loads onto a web page, then you'll want to configure Content Grabber to wait on AJAX to finish processing the new content before it can start extracting it. Content Grabber is excellent at the automatic detection of dynamic changes. However, sometimes JavaScript behaves unusually, and you may need to make adjustments to properly extract dynamic content. For example, Content Grabber can detect when JavaScript completes an AJAX load of dynamic content. But it cannot detect exactly when the JavaScript is done and so it will simply wait for a few milliseconds. If the JavaScript takes an unusually long time to display the dynamic content, you may need to use the timeout feature of Content Grabber to insert a short interval for the JavaScript to display the dynamic content (typically a few additional milliseconds). © 2014 - 2015 Sequentum 18 Content Grabber Familiarity with JavaScript can make it much easier to configure a web-scraping agent to extract data from dynamic websites when Content Grabber is unable configure the agent automatically. You can learn more from various JavaScript tutorials available on the web, such as: http://www.w3schools.com/js/ 1.5.3 XPath and Selection Techniques Proper selection technique is a critical aspect of web-scraping. The most basic selection technique is to point-and-click on elements in the web browser panel, which is the easiest way to add commands to an agent. XPath is a common syntax for selecting elements in HTML and XML documents. Each time you click on an element in the web browser, Content Grabber works in the background to calculate the selection XPath. Content Grabber has a variety of tools that help you create precise XPaths without needing to know XPath syntax at all. If you want finer control over element selection, it's worthwhile to learn some specifics about XPath syntax. Although this user guide doesn't include an XPath tutorial, there are many tutorials available on the web. We recommend this as a good place to learn more about XPath: www.w3schools.com/XPath/xpath_syntax.asp 1.5.4 Regular Expressions With Regular Expressions, you can write expressions that look for specific character sequences within strings and then extract small text strings out of larger ones. Content Grabber uses XPath to select web elements on a web page, and then extracts content from those web elements. You may only want some parts of the content extraction, or you may want to transform it. For example, a single web element may contain the entire address of a company, but you may want to extract © 2014 - 2015 Sequentum Introduction 19 the content into separate elements such as street address, city, zip code and state. You can use Regular Expressions to split the address text into separate text strings. There are many tutorial websites that teach Regular Expressions. Here is one example: http://www.regular-expressions.info/reference.html 1.6 Converting Visual Web Ripper Projects Visual Web Ripper is another web scraping tool published by Sequentum. Content Grabber can open Visual Web Ripper projects and convert them into Content Grabber agents. This is NOT a fully automated conversion, and many agents will need manual adjustments after conversion. To convert a Visual Web Ripper project, simply open the project file in Content Grabber as if it was a normal Content Grabber agent. Content Grabber will then ask if you want to convert the project file into an agent. The Visual Web Ripper project will be left intact, so you don't need to make a copy of the Visual Web Ripper project. First open the Visual Web Ripper project as if it was a normal agent. © 2014 - 2015 Sequentum 20 Content Grabber You will be asked to convert the Visual Web Ripper project to an agent. Content Grabber will attempt to add all required commands to the converted agent, but it will not be able to set all command properties correctly, so we recommend you test all commands in a converted agent to make sure they work correctly. The following list includes features that will NOT be converted from a Visual Web Ripper project. This is not a complete list, and there may be other aspects of a project that will not be converted correctly. Visual Web Ripper scripts will not be converted, except for very simple Content Transformation scripts. OleDB and Oracle database connections will not be converted. All other database connections will be converted into shared connections. The Private Proxy Switch option is not available in Content Grabber. PAC Proxy configuration will not be converted. © 2014 - 2015 Sequentum Introduction Most Page Transformation elements will require post configuration. The Tag Name attribute is not available in Content Grabber. The property SaveDataMethod will require post configuration when set to anything other than Default. Many Back templates will require post configuration. Multiple Page Navigation templates working together will require post configuration. Page Navigation templates using Dynamic Links or List of Links will sometimes require post configuration All actions will be converted into actions with the property Detect Action set to true. All other action configuration, including wait elements, will be ignored. Form Field elements using the Start Index and Count properties will often require post configuration. Form Field Lookup Data Sources will not be converted. Project user agent settings will not be converted. CAPTCHA configuration will not be converted. Document conversion will not be converted. Document download on form submit will not be converted. Scheduling configuration will not converted. Notification configuration will not be converted. Duplicate checks will sometimes require post configuration Many rarely used options in projects, templates and contents don't have corresponding options in Content Grabber and will therefore not be converted. © 2014 - 2015 Sequentum 21 22 2 Content Grabber Quick Start with Content Grabber We want to help you get up and running, so this chapter outlines the basics to help you quickly install the software, build a simple agent, and then test that agent. Further along in the guide, we explain more details and explore the advanced features. For now, let's see how easy it is to build an agent. In this chapter, you'll learn: Installing Content Grabber Content Grabber Basics Exploring the Main Window Simple vs Expert Layout Building Your First Agent 2.1 Installing Content Grabber Content Grabber is a Windows-based software application. Windows 7, Windows 8, Windows 2008 R2 and Windows 2012 are the supported versions of Windows. It is important to note that while it is not supported, the Content Grabber runtime works on Windows XP+ and the Content Grabber editor works on Windows Vista+. While a Macintosh version is not available, a number of our users utilize Parallel's windows emulation software to successfully run Content Grabber. Installing Content Grabber is easy. Simply follow these steps: 1. Download the setup file from http://www.sequentum.com/asdr34ert/setup.zip 2. Find the setup.zip file on your local disk (likely to be your Downloads folder). 3. Open the zip file and then double-click the setup.exe file to launch it. 4. After a moment, you'll see the first panel of the Content Grabber setup wizard. Click the Next button to continue with the installation. © 2014 - 2015 Sequentum Quick Start with Content Grabber 23 5. Click the option to Accept the Agreement, and then click the Next button. 6. In the next panel, you can change the installation location. If you accept the default location, simply click Next. Otherwise, click the Browse button to choose a different installation folder, and then choose the folder and click the Open button to close the pop-up, and then click the Next button. 7. In the next panel you can change the name of the folder that will appear in the Windows Start Menu. Click Next and, in the next panel, click the Install button to complete the installation. 8. Locate the Content Grabber item in the Windows Start Menu and then launch it. © 2014 - 2015 Sequentum 24 2.2 Content Grabber Content Grabber Basics Web-scraping tools generally use macros or configuration methods, and follow a sequential list of commands. The macro approach is more user friendly and automatically records the actions of a user in a browser. However, there are typically restrictions on accessing the code behind the agent. The configuration approach allows the user to directly configure each part of the agent. They can introduce more code structure, controls, data refinements, or add their own naming conventions. Content Grabber gives you the option to either follow the simple macro automation methods, or to take direct control over the treatment of each element and command within your agent. Content Grabber Agent Development With Content Grabber, you can visually browse the website and click on the data elements in the order that you want to collect them. Based on the content elements selected, Content Grabber will automatically determine the corresponding action type and provide default names for each command as it builds the agent for you. © 2014 - 2015 Sequentum Quick Start with Content Grabber 25 Content Grabber main screen - building CarPoint Agent A Content Grabber agent is a collection of commands which are executed in serial until completed. The commands can either be actions (such as a jump to a URL) or data capture commands (e.g. capture text). These commands are recorded in order of execution in the Agent Explorer panel of the Content Grabber screen. Agent Explorer panel with New Agent commands If you want to make other adjustments or gain more control of your commands, you can make changes in the Configure Agent Command panel. Configure Agent Command panel You can also add new commands to your agent, or configure existing ones. To do this, you simply click twice on any web element (content item) and the Content Grabber Message window will appear. From here you can select the command type you want and add it to the Agent Explorer. © 2014 - 2015 Sequentum 26 Content Grabber Content Grabber Message window pop-up Content Grabber Data Outputs After you have finished building you Agent and run it for the first time, Content Grabber saves the data locally in a structured database format. Content Grabber can export extracted web data as a report or to numerous different database types. Data output options include CSV, Excel, XML, SQL Server, MySQL, Oracle and OleDB. Content Grabber's Data Configuration window You can also use a Content Grabber export script to completely customize the data export to your own database structures. Scheduling © 2014 - 2015 Sequentum Quick Start with Content Grabber 27 Content Grabber provides an Agent scheduling facility that enables you to automatically run your agent at predetermined time slots whenever you need it to run. This can be done every hour, every day, month, year and so on. Content Grabber's Scheduling Window 2.3 Exploring the Main Window In this section, we give a brief overview of the Content Grabber window. © 2014 - 2015 Sequentum 28 Content Grabber Address Bar The Address Bar is where you enter the URL of the page that is the start of your web-scraping agent. This is what we call the Start URL. Web browser Panel with Data-Capture Box The Content Grabber web browser appears in the main panel. The special feature of this integral web browser is this: as you move the mouse and pass over a web element, you’ll see a yellow rectangle envelop that element. When you stop over one element and click on it twice, the Command pop-up window will appear—prompting you to choose the command that you want to insert into the agent for this data element. Agent Explorer Panel As you select data elements in the browser panel, the corresponding Command item will appear in the Agent Explorer panel. The Agent Explorer is used to view agent © 2014 - 2015 Sequentum Quick Start with Content Grabber 29 commands for the type of web page loaded in the selected browser tab. When you select a new browser tab with another web page, a new set of commands will be displayed in the Agent Explorer. You can use the Agent Explorer to view, edit or execute commands. Agent Configuration Panel In this panel, you can edit an agent command, execute a command, or add a subcommand. Further along in this guide, we’ll explain how to explore and edit each command. 2.4 Simple vs Expert Layout Content Grabber includes a vast number of advanced features which can be overwhelming for a new user. Content Grabber allows you to hide some advanced features or to show all the features available. The default layout setting is the Simplified Layout which excludes some advanced features from the Content Grabber menus. You can switch between the Simplified Layout and the Expert Layout from the Application Settings menu at the top of the Content Grabber application. Content Grabber's Application Settings menu with Simplified Layout enabled You can also show or hide the advanced features using the radio buttons displayed on Content Grabber's start-up screen. © 2014 - 2015 Sequentum 30 Content Grabber Show or hide advanced feature selection on start-up screen After switching to Expert Layout you will see a number of additional menu items displaying. To switch back to Simplified Layout you choose Simplified Layout under the Expert Layout option within the Application Settings menu. Content Grabber's Application Settings menu with Expert Layout enabled For more detail on the different Expert Layout options and their functions, refer to Expert Layout Features within the Content Grabber Editor section of this manual. 2.5 Building Your First Agent The diagram shows the key steps for building a web scraping agent. Below we provide links the other topics which explain each of these steps in more detail. © 2014 - 2015 Sequentum Quick Start with Content Grabber 31 Basic web scraping Agent creation process Here in this section, we cover the identification of data elements on your target website and create a web-scraping agent. We'll work through each of the above steps with examples that match common web-scraping usage, so you can get comfortable building your own agents. We cover these topics in this section: Choosing a Start URL Select the Data to Capture Refine your data (optional) Output Data Format © 2014 - 2015 Sequentum 32 Content Grabber Test Your Agent Scheduling Run your Agent After you get comfortable creating an agent, you can learn more about the Content Grabber Editor. 2.5.1 Choosing a Start URL The Start URL is the place where you begin data collection and corresponds to the starting point for your web-scraping agent. In the following sections, we'll use the Cruise Direct website for our example. http://www.cruisedirect.com Note: In this example we start from the Cruise Direct home page, however, if the data you require is not located on a website home page, you can start the agent from a website sub-page. This approach will make the agent more efficient, so it’s worth taking the time to be more specific. 1. We start by pasting the start web page url from the target website (http:// wwwcruisedirect.com) into Content Grabber's Address Bar. 2. Next click the Blue Play button to load the Cruise Direct home page. Note: you can also press the Enter key to load the Cruise Direct home page. © 2014 - 2015 Sequentum Quick Start with Content Grabber 33 Content Grabber with Cruise Direct home page loaded In the next section, Select the Data to Capture, we will continue use the Cruise Direct website data for our example. 2.5.2 Select the Content to Capture In the previous section, we selected our Start URL and loaded the web page into Content Grabber. Next, you can select the data you want to capture and start building your web scraping agent. In our Cruise Direct example, we plan to search for available cruise vacations and then extract the cruise name, the cruise line, destination, departing place, ports of call, and the different prices. 1. Firstly we need to perform a search to retrieve the data for the available cruises. To do this, we select the orange Search button element with the mouse, then click one more time to display the Content Grabber Message window. © 2014 - 2015 Sequentum 34 Content Grabber Content Grabber Message Window 2. From the message window, we choose the Click on the Web Element option to add a new command to the agent that will execute the search and display the search results on a new web page. Notice that Content Grabber has added our first command to Agent Explorer - in this case to execute the search and display the search results. Agent Explorer with new Search command linked to new Search page 3. We are now ready to add commands to our agent to extract the cruise data. As we have a number of data elements in tables, we will use a list to simplify the extraction for us. To capture a data element, move your mouse precisely over the data element you want, until you see the data-capture box around it. We start by selecting the first cruise name. © 2014 - 2015 Sequentum Quick Start with Content Grabber 35 First Cruise Line data element selected within Content Grabber 4. Then click List in the Configure Agent Command panel to activate the list selection mode. Activating List selection mode from the Configure Agent Command panel 5. In list selection mode, we can add web data elements into the list by clicking similar data elements. Now we'll click on the second cruise name and you will see Content Grabber has selected the remaining data elements on the page. Note: if any cruise data elements remain unselected, simply click on these to add to the list. © 2014 - 2015 Sequentum 36 Content Grabber Second Cruise Line data element selected while in List selection mode 6. We now click Save to save the list and exit list selection mode. The Web Element list command defines the list area, so any elements within this area are now included within the list. 7. To capture the cruise name text, we click on any selected element to display the Content Grabber Message window. From the Content Grabber Message window, choose the Capture Text option to add the web element command to capture the cruise names. We have now added new web element list and web element commands to the Agent Explorer and Content Grabber has set default names for these commands. 8. To edit the names for the commands, we click the respective Edit icons and set the names of the commands to ‘Search List’ and ‘Cruise Name’. Then click the Green Tick to save. © 2014 - 2015 Sequentum Quick Start with Content Grabber 37 Agent Explorer with new Search List and Cruise Name commands 9. Now we plan to extract the individual cruise data elements from each table. So firstly click on the cruise line data element. Content Grabber now automatically selects all the cruise line information because it is already defined as a list. 10. Next, click on the cruise line data element one more time to display the Content Grabber Message window. Now choose the Capture Text option from the Content Grabber Message window to add the command to the Agent so we can capture the individual cruise line data elements. 11. After that, we click the Edit icon to change the name of the command to ‘Cruise Line’ then save it. 12. Now we do the same for the Destination and Departing From data elements, then set the respective names of the commands to ‘Destination’ and ‘Departing From’ and save them. Agent Explorer with new Capture Text commands 13. To extract the ‘Ports of Call’ data we will utilize Content Grabber's content transformation method. Refer to Refine Your Data to see how this is done. © 2014 - 2015 Sequentum 38 Content Grabber 14. We also want to capture all the price information in the pricing table, so as before, we select the first element (View Itinerary) in the pricing table from the list. Content Grabber will then select all of the first elements throughout the list. 15. Click one more time to display the Content Grabber Message Window. Then choose the Capture Text option to add the command to the Agent. Before we change the default command name, we plan to add all the individual price details to the Agent, then we will change all the names at once. 16. Do steps 14 and 15 again for the ‘inside’, ‘outside’, ‘balcony’ and ‘suite’ data elements to add 4 new commands to the Agent. 17.Now we will change the names of the commands to ‘Departing Date’, ‘Inside’, ‘Outside’, ‘Balcony’ and ‘Suite’ by clicking the respective Edit icons and then save. Agent Explorer showing all Capture Text commands 18. Thus far we have created the Agent to extract all the cruise information on the first page. We need to set it up to iterate through all the search result pages. To do this, we need to use the Follow Pagination command to follow each of the pages.Scroll down the page and select the ‘Next 10’ link. Then click one more time on the selected element to display the Content Grabber Message window. © 2014 - 2015 Sequentum Quick Start with Content Grabber 39 Content Grabber Message windows with Follow Pagination option selected 19. Now we choose the Follow Pagination option to add the pagination command to the Agent. Content Grabber has added the pagination command to the Agent and loads the next page on the second browser tab. 20. When we click on the pagination command we can see all the search list commands inside of the pagination command. This means our agent will now iterate through all the search result pages to extract this information. Agent Explorer showing the contents of the Pagination command 21.We have now finished building the Agent so we should save it. To save the Agent, © 2014 - 2015 Sequentum 40 Content Grabber choose File > Save in the Content Grabber menu, and then enter the Agent Name “cruisedirect”. Then click the Save button to commit your changes. Saving the "cruisedirect" Agent In the next section, Refine Your Data we use Content Grabber's Content Transformation method to retrieve the Ports of Call data. 2.5.3 Refine Your Data (optional) In the previous section, we added commands to our agent to capture all the Cruise Direct content elements we require, except for the Ports of Call data. Due to the way the information is displayed for Ports of Call, we cannot select only the Ports of Call data element by clicking on it. Instead, we will select all the cruise information, then use Content Grabber’s transformation method to extract only the Ports of Call detail. 1. Firstly select the cruise information then use the mouse to expand the Configure Agent Command panel so we can see all the transformation fields. 2. Next, we scroll down the capture sample window until we can see the Ports of Call information. Now we select only the ports of call information. You should © 2014 - 2015 Sequentum Quick Start with Content Grabber 41 notice that the Transformation Script button has now changed to a Generate Transformation button. Selecting text to Transform in the Configure Agent Command panel 3. Now click on the Generate Transformation button, and we can now see only the ports of call information in the Transformed window. Transformed text in the Configure Agent Command panel 4. We will now click the Add Command button to add the transformation command to the Agent. 5. Click the Edit icon, then change the default command name to ‘Ports of Call’ then save it. © 2014 - 2015 Sequentum 42 Content Grabber Agent Explorer showing all the Agent commands In the next section, Output Data Format we look at the data output formats available for the extracted web data and show how to change and configure a new export target. 2.5.4 Output Data Format Content Grabber can export extracted web data as a report or to numerous different database types. Data output options include CSV, Excel, XML, SQL Server, MySQL, Oracle and OleDB. You can also use a Content Grabber export script to completely customize the data export to your own database structures. This is useful when you want the data updates to be dynamically reflected such as an online website / portal. Content Grabber can export data to Excel 2003+ and take advantage of features in Excel 2007+ such as outlining and embedded images. Data is exported automatically to your chosen export destination when a data extraction project completes, so you don't have to export data manually. However, you can always export extracted data manually at any time to any export destination. The following steps show how easy it is to choose the data export type you want to use. © 2014 - 2015 Sequentum Quick Start with Content Grabber 43 1. We start by clicking on Content Grabber's Data menu and then clicking the Change Export Target button. Content Grabber's Data menu 2. After clicking the Change Export Target button the Data Configuration Window displays. This window allows you to change and configure a new export target. The default option is Excel 2003. Click the Export target drop-down list box to display the different report and database export options available. Then choose the format you want to use. You can also change the default destination folder location for where the data file(s) are output. Content Grabber's Data Configuration window In the next section, Test Your Agent we use Content Grabber's Debugging features to test the "cruisedata" Agent is functioning as expected. 2.5.5 Test Your Agent Once you have finished developing your agent, it is important to test it to ensure the © 2014 - 2015 Sequentum 44 Content Grabber correct data is being extracted and in the format you require. Content Grabber has a sophisticated debugging engine which enables you to carefully analyze all aspects of your agent and the data being extracted. It can also help you to pin-point any trouble spots in your agent code so you can quickly resolve them. For more detail on the Debugger features available in Content Grabber refer to Testing/Debugging an Agent. Now let's run the Cruise Direct agent to check the commands are working properly. To do the test run, click the ‘Debug’ menu option at the top left of the screen and then the click ‘Start’ arrow button to start the debugging. Content Grabber Debugger running the Cruise Direct Agent During the debugging, we can observe and check that Content Grabber goes through each of the commands in sequence and processes each of the web pages to extract the required data. Part way through running the agent, we will click the ‘Stop’ button to stop the debugging. Then click ‘View Export Data’ to check the output results are correct. © 2014 - 2015 Sequentum Quick Start with Content Grabber 45 Content Grabber's default view of the export data To see the export data in Excel, we can simply click on the ‘Open Exported Spreadsheet’ button to open the Excel spreadsheet. Viewing the Cruise Direct export results in an Excel Spreadsheet In the next section, Scheduling you will learn how to set up an agent so it can be automatically run at intervals of your choosing. 2.5.6 Scheduling Content Grabber provides an Agent scheduling facility that enables you to automatically run your agent at predetermined time slots whenever you need it to run. © 2014 - 2015 Sequentum 46 Content Grabber This can be done every hour, every day, month, year and so on. Once you have completed your agent, you will be able to access this facility. From the Agent Settings menu at the top of the Content Grabber application. Simply Click on the Schedule menu option to display the Scheduling window. Content Grabber's Scheduling Window For details on how to configure Content Grabber's Scheduling feature, or if you'd like to learn how to use Windows Task Scheduler with your Agents, refer to Scheduling in the Content Grabber Editor section. In the next section, Run Your Agent we will show you how you run your finished webscraping Agent. 2.5.7 Run Your Agent Now that we have finished developing and testing the Cruise Direct Agent it is ready to use. To run your agent, you simply click on the Run menu at the top left of the Content Grabber application then click the Run Agent arrow selection. © 2014 - 2015 Sequentum Quick Start with Content Grabber 47 Content Grabber's Run menu selections Note: if you have already scheduled your Agent to run at a later time or date, you can just leave it on your Internet enabled PC or Server and it will run automatically. Refer to Scheduling for more detail. © 2014 - 2015 Sequentum 48 3 Content Grabber Content Grabber Editor The Content Grabber Editor enables you to easily see and manipulate your web scraping agent as your build it. This section of the manual looks at various aspects of the Content Grabber interface to help ensure you understand and get the most out of the Content Grabber application. In this chapter, you'll learn: Customizing the Editor Layout Expert Layout Features Working With the Web Browser Command Configuration Selection Tools and Short-cut Keys Navigating an Agent Testing/Debugging an Agent Scheduling Copying & Moving Agents Exporting & Importing Agents Automated Agent Backups Template Libraries 3.1 Customizing the Editor Layout Content Grabber's editor layout includes a number of workspace windows which can be moved and re-sized to suit your development layout preferences. To help highlight this feature and demonstrate how you can configure the Editor Layout, we will reposition the Agent Explorer panel on the left of the Editor Layout screen. © 2014 - 2015 Sequentum Content Grabber Editor 49 Content Grabber's default start screen with Agent Explorer on the upper right side of the screen Now by simply positioning the cursor at the top label of the Agent Explorer panel and then clicking and holding the mouse button down, you can drag the panel to a new position on the screen. When you are dragging the Agent Explorer panel, you will notice a number of small docking objects appear with small arrows on them. These are the available docking areas to where you can drag and lock the panel into its new position. Moving the Agent Explorer panel with the mouse - docking positions are displayed © 2014 - 2015 Sequentum 50 Content Grabber When you position the cursor over a docking position, you will notice the area changes color to blue. This means the panel is in position and you can let go of the mouse button to dock the Agent Explorer panel into this new position. Docking position selected at the left of the screen - indicated by blue highlight Once the mouse button is released, the Agent Explorer panel is relocated to the left of the Editor Layout screen. Agent Explorer panel repositioned to the left of the Editor Layout screen © 2014 - 2015 Sequentum Content Grabber Editor 3.2 51 Expert Layout Features When using Expert Layout there will be additional options in most menu selections. In this section we will have a brief look at what the Expert Layout options are and what they are used for. Note: You can switch between Simplified Layout and Expert Layout from the Application Settings menu at the top of the Content Grabber screen. Application Settings menu The Application Settings Menu includes the following Expert Layout options. Content Grabber's Application Settings menu with Expert Layout enabled Application Proxy Source Specifies a default source of proxies when an Agent's proxy source is set to Application. Proxy List Opens the Proxy List Window where you can add or remove proxy servers when the application proxy source is set to Proxy List. Database Connections Default database connection configurations for all agents on the current computer. By using computer wide database connections, you avoid having to configure database connections for each agent. © 2014 - 2015 Sequentum 52 Content Grabber New Connection Adds a new database connection for use in all agents on the computer being used. Edit Edit or remove the selected database connection. Command Library Opens the Command Library that allows you to view and insert commands from the library into your agent. Agent Library Opens the Agent Library that allows you to view agents in the library, and create new agents based on agents within the library. Export Libraries Exports all libraries to a single file that can be imported into Content Grabber on a different computer. You can also use this feature to backup your libraries. Import Libraries Imports libraries from a file that has previously been exported from Content Grabber. Note: Importing libraries overrides all existing libraries. Selection menu © 2014 - 2015 Sequentum Content Grabber Editor 53 The entire Selection (Selection XPath) menu is only visible in Expert Layout. The menu options are as follows. Content Grabber's Selection XPath menu Refresh Browser Refreshes the selection in the web browser If the selection XPath is edited manually, then a refresh is required to update the selection in the web browser. Selected XPath A selection can consist of multiple XPaths. The web browser will mark the web elements chosen by the selected XPath. Choose View All to mark the web elements selected by all the XPaths in the selection. Add Selection XPath Adds an XPath to the selection. A selection can consist of multiple XPaths, which allows you to select web elements in different areas of a web page. Remove XPath Selection Removes the currently selected XPath from the selection. Note: A selection must consist of at least one XPath, so an XPath can only be removed if © 2014 - 2015 Sequentum 54 Content Grabber the web selection has more than one XPath. Selection XPath The current selection XPath. Note: After editing the XPath, you must press Refresh Browser to update the selection in the web browser. Run menu The Run menu has two additional options for Internal Data when in Expert Layout. These menu options are as follows. Content Grabber's Run menu with Expert Layout enabled View Log Opens the Log Window that shows all the entries from the last agent run. Note: you must turn on logging and then run the agent before you can view log data. View Internal Data View the internal data extracted by the agent. You must run the agent before you can view internal data. The internal data is a "raw" representation of the extracted data, and contains many data fields that are only used internally by Content Grabber. © 2014 - 2015 Sequentum Content Grabber Editor Data menu The Data menu has three additional options when in Expert Layout. One additional Export Data option and two new Internal Data options. These menu options are as follows. Content Grabber's Data menu with Expert Layout enabled Regenerate Export Data Generates and exports existing data. If you change the export target, you can press this button to export the data to the new export target without having to extract the data again. View Internal Data View the internal data extracted by the agent. You must run the agent before you can view internal data. The internal data is a "raw" representation of the extracted data, and contains many data fields that are only used internally by Content Grabber. Change Internal Database Content Grabber saves data to an internal database while it's extracting data. The default internal database is a SLQLite database, but you can change this database to SQL Server or MySQL. © 2014 - 2015 Sequentum 55 56 Content Grabber Debug menu The Debug menu has a number of changes and additional menu options. The menu options are as follows. Content Grabber's Debug menu with Expert Layout enabled Execute Next Command Executes the next command and then pauses debugging. Execute Next Action Executes the next action command and then pauses debugging. An action command is a command that executes an action such as loading a new web page. Debug Speed Set the debug speed to slow down or increase the debugging speed. You can slow down the debugging speed to help you see what is happening while the agent is running. Debug Set If the selected command selects a list of web elements or data items, you can specify a range of list items that should be processed by the debugger. © 2014 - 2015 Sequentum Content Grabber Editor View Internal Data View the internal data extracted by the agent. You must run the agent before you can view internal data. The internal data is a "raw" representation of the extracted data, and contains many data fields that are only used internally by Content Grabber. Enable Debug Logging Turns on logging while debugging the agent. Log HTML Logs the full HTML of all pages in addition to other log details. View Log Opens the Log Window that shows all the entries from the last agent run. Note: you must turn on logging and then run the agent before you can view log data. Agent Settings menu The Agent Settings menu has several additional menu options available in Expert Layout. These menu options are as follows. Content Grabber's Agent Settings menu with Expert Layout enabled © 2014 - 2015 Sequentum 57 58 Content Grabber Proxy Source Specifies where Content Grabber should look for proxies when running an agent. Proxy Rotation Interval Specifies the proxy rotation interval when using a list of proxy servers. Proxy List Opens the Proxy List Window where you can add or remove proxy servers when the application proxy source is set to Proxy List. Activate Proxy After changing the proxy source, use this button to start using the chosen proxy configuration in the editor. Input Parameters Input parameters are named values assigned to an agent when the agent starts. Initialization Script An initialization script is executed before the agent starts. This script allows you to set certain agent parameters, such as database connection strings or start URLs. Assembly References A list of .NET assemblies used for scripting. These can be standard .NET © 2014 - 2015 Sequentum Content Grabber Editor 59 assemblies or custom assemblies. Refresh IntelliSense Refreshes IntelliSense in all script editors. Use this button to refresh IntelliSense after adding or removing assembly references. 3.3 Working With the Web Browser Content Grabber uses an embedded version of Internet Explorer as its web browser. The web browser has been greatly modified to suit the purpose of web scraping, but it essentially works the same way as the standard version of Internet Explorer you have installed on your computer. If a website does not work properly in Internet Explorer, it will also not work properly in Content Grabber. Web Browser Selection Mode When you click on links and buttons in a normal web browser, you will normally perform some sort of action, such as loading a new web page. Content Grabber intercepts all actions in its web browser, and when you click on a web element, it marks the element as selected instead of performing the default web browser action. When you click on a web element that has already been selected, Content Grabber will give you an option to add an agent command that performs an action on the selected web element. © 2014 - 2015 Sequentum 60 Content Grabber The available options, and the order of the options available, depends on what type of web element you have selected. For example, if you have clicked on a selected link element, the first option will be to navigate the link and open a new web page. You can change what happens when you click on a selected web element by setting the Add Command Mode in the Application Settings ribbon menu. You have the following three options. Add Description Command Mode © 2014 - 2015 Sequentum Content Grabber Editor Always Add Content Grabber automatically adds an agent Command command that performs the most appropriate 61 action for the selected web element, without first displaying the Add Command dialog. Confirm The Add Command dialog is displayed when you Before click on a selected web element, which lets you Adding decide what type action you want to perform on Command the web element. This is the default behavior. Never Add Nothing happens when you click on a selected Command web element. You always have to manually configure your agent commands. Web Browser Navigation Mode Sometimes it's useful to navigate in the Content Grabber web browser the same way as a normal web browser. For example, you may load a URL, but then want to navigate to a particular web page and start the data extraction from that page. You can switch the web browser from selection mode to navigation mode by clicking the Navigate in Web Browser button in the application menu. Once you have reached the web page where you want to start extracting data, you can set the current URL as the start URL by clicking the check icon next to the URL address bar. Important: Content Grabber must be able to load the start web page directly from the start URL. If the start web page cannot be loaded directly by using the start URL, © 2014 - 2015 Sequentum 62 Content Grabber then you must choose another start web page, and then use agent commands to navigate to the web page where you want data extraction to start. Disabling Web Browser Events Some websites display or hide web elements when certain events occur in the web browser, such as when the move moves over a web element, or when an input field loses focus. Sometimes it can be difficult to select such dynamic web elements, because they may show and hide as you move the mouse around. You can click the Disable Web Browser Events button in the application menu to block web browser events, so that dynamic web elements do not show and hide as you move your mouse for example. If you want to "freeze" the web page when your mouse is over a certain web element, then you can use the CTRL+D shortcut key to disable events, so you don't have to move the mouse away from the web element to click the Disable Web Browser Events button. 3.4 Command Configuration Content Grabber agents consist of commands that are executed to navigate a website and extract content. You can configure a command manually by selecting a web element in the Content Grabber web browser and then use the configuration window to specify and configure the action that should be performed on the selected web element. © 2014 - 2015 Sequentum Content Grabber Editor 63 You can also let Content Grabber configure a command automatically by clicking on a selected web element, and then choose from a list of appropriate actions for the selected web element type. © 2014 - 2015 Sequentum 64 Content Grabber Please see Agent Commands for more information about configuring commands. Configuring Multiple Commands Automatically Content Grabber can automatically configure multiple commands to extract content from whole areas of a web page, such as complete HTML tables or search results. To configure multiple commands automatically, select the entire page area you want to extract data from, such as a HTML table, and then click on the selected area again to open the Add Command dialog window. © 2014 - 2015 Sequentum Content Grabber Editor 65 If you have selected an area with more than one web element, you will get an option to Capture All Web elements. This option will add a default capture command for each web element in the selected area. The image below shows an example of commands added using the Capture All Web Elements option. © 2014 - 2015 Sequentum 66 Content Grabber Content Grabber will often add more capture commands than you need, and you will have to manually delete the commands that are not needed. Content Grabber will automatically try to assign names to the added commands, but in most cases you will have to change the names to make them more meaningful. Capturing HTML Tables or Lists If you have selected more than one web element in the web browser, you will get an option to Capture List or Table on the Add Command dialog window. This option will add a command for your selected list, and within the list it will add a default capture command for each web element. If you have selected a single web element that contains a list or a HTML table, you will also get an option to Capture List or Table on the Add Command dialog window. This option will locate the first list or table within the selected web element, and add a list command to your agent that iterates through each row in the table or list, and within the list it will add a default capture command for each web element. After selecting the Capture List or Table option from the Add Command dialog © 2014 - 2015 Sequentum Content Grabber Editor 67 window, you will get a new dialog window where you can select the type of table or list you have selected. You can choose one of the following table or list types. Table or Description List Type Table With A table where the first row contains header text. Header Row The header text will not be extracted, but will be used to name the capture commands. The header text must be unique for each column. Table A table without header text. The capture Without commands will use auto generated names. Header Text Table of © 2014 - 2015 Sequentum A table with two columns where the first column 68 Content Grabber Name/Value contains the value name and the second column Pairs contains the value. The value names will not be extracted, but will be used to name the capture commands. The value names must be unique. List Any simple list of content. A capture command will be added for each web element in the list. The capture commands will use auto generated names. After selecting the table or list type, Content Grabber will add a list command with capture sub-commands. The image below shows an example of commands added using the Capture List or Table option. Content Grabber will often add more capture commands than you need, and you will have to manually delete the commands that are not needed. Content Grabber will automatically try to assign names to the added commands, but if you have chosen the List or Table Without Header Text option, you may have to change the names to make them more meaningful. © 2014 - 2015 Sequentum Content Grabber Editor 3.5 69 Selection Tools and Short-cut Keys Content Grabber includes a handful of selection tools and short-cut keys that can significantly save you time when developing an agent. It is useful to understand how these work to improve your productivity. This section of the manual looks at these tools and explains how they are used. Also included are details on short-cut keys for further optimizing development time. Most of the menu buttons / icons for the selection tools can be found at the bottom of the Configure Agent Command panel. Bottom of Configure Agent Command panel showing selection tools List Selection The list selection is used to select common elements on a page and save then as a list. It can be text, titles, photos, captions, tabular data etc. To use this function, simply select a web element in the browser then click the List button at the bottom of the Configure Agent Command panel. Clicking the List button enables List Selection Mode. To collect multiple elements, select the next web element and Content Grabber will automatically add all the similar elements on the page to the list. While in List Selection Mode, you can continue to add items to your list by clicking on the elements if they weren't selected automatically. You can also click a selected item again to remove it from the list. © 2014 - 2015 Sequentum 70 Content Grabber List Selection Mode enabled Once you have selected all your content elements into the list, click the Save button to save the list and exit List Selection Mode. Shift Short-cut key Instead of clicking the List button you can also enable List Selection Mode using the Shift key. To do this, hold down the Shift key on your keyboard, click the first element (such as the first title in the list) and—continuing to hold down the Shift key—click the mouse on the very next element (e.g. the next title in the list). When you have finished selecting elements for you list, simply release the Shift key to save the list. Anchor Selection The Anchor Selection function is useful when you want to extract content that changes © 2014 - 2015 Sequentum Content Grabber Editor 71 position on different pages. For example, you may extract data from a table that displays product properties, such as width, height and depth, but some properties may not apply to all products, so the location of each property may vary depending on how many properties are available for a certain product. You can use an anchor selection to tie the property value to the property name, so the property value you are extracting is always correct, no matter where it's located in the table. To use this function, select one or more elements in the web browser then click the Anchor button at the bottom of the Configure Agent Command panel. This will enable Anchor Selection Mode. While in Anchor Selection Mode, click the web element you want to anchor your selection to. Anchor Selection Mode enabled Once you have made your selection, click the Save button to save the Anchor and © 2014 - 2015 Sequentum 72 Content Grabber exit Anchor Selection Mode. Ctrl Short-cut key Instead of clicking the Anchor button, you can also enable Anchor Selection Mode using use the Control (Ctrl) key. To do this, hold down the Ctrl key on your keyboard, click the first element and continuing to hold down the Crtl key. Next, click on the web element you want to Anchor this content element to. Once complete, release the Ctrl key to save the Anchor. Command Naming Short-cut (Alt Key) When creating a new Command in Agent Explorer, Content Grabber will automatically name the command for you. This is typically based on the naming conventions used in the HTML behind the web page. If you hold down the Alt key when clicking on any text element in the browser, Content Grabber will name the new command with the text you click on. Expand Selection The Expand Selection tool is useful when only part of a web element has been selected (e.g. a multiple row and column table) and you want to expand the selection. To use this function, simply click the Expand Selection icon at the bottom of the Configure Agent Command panel to expand the current selection in the web browser. Click again if you need to expand the selection further. Clear Selection © 2014 - 2015 Sequentum Content Grabber Editor 73 The Clear Selection tool clears the current content element selection in the web browser. To use this function, simply click the Clear Selection icon at the bottom of the Configure Agent Command panel to clear the current selection in the web browser. Exact Selection The Exact Selection tool is generally only used by advanced users. Use this option to generate selection XPaths with as much detail as possible. Exact selections are likely to fail if a website changes slightly, so only use this option if you understand the consequences. To use this function, select the content element on the browser page you want positional information for and then click the Expand Selection icon at the bottom of the Configure Agent Command panel. The detailed positional data is then displayed in the Selection XPath window at the top of the Content Grabber screen. This option is best used to create selection XPaths that can be used as a starting point when manually creating XPaths. Refer to Editing XPaths Manually for more detail. 3.6 Navigating an Agent An agent loads and processes a number of different types of web pages, and each type of web page has a separate web browser tab in the agent editor. © 2014 - 2015 Sequentum 74 Content Grabber The above image shows the browser tabs for an agent that processes three different types of web pages. An agent may process thousands of web pages, but you only need to define commands for each different type of web page. All web pages with the same layout are considered the same type of web page. For example, if you are extracting data from a product catalog, all product detail pages are considered a single type of web page, and you therefore only need to define commands for a single product detail page. Agent Explorer The Agent Explorer is used to view agent commands for the type of web page loaded in the selected browser tab. When you select a new browser tab with another web page, a new set of commands will be displayed in the Agent Explorer. You can use the Agent Explorer to view, edit or execute commands. If you execute a command that opens a new web page in a new web browser tab, Content Grabber will automatically switch web browser tab and load the new web page. © 2014 - 2015 Sequentum Content Grabber Editor 75 When you open a new agent, Content Grabber will automatically execute the Agent command which loads the first web page into the first web browser tab, but it will not execute all the other commands, so it will not load web pages into any other web browser tabs. You can still switch browser tabs to view and edit the associated commands, but you will not be able to view the associated web page until you execute the command that loads a web page into the browser tab. You generally need to execute all agent commands if you want to make sure web pages are loaded into all web browser tabs. It can be a very tedious task to manually execute all commands in an agent, so you can click the Execute All Commands button in the application menu to automatically execute all commands in the agent. You can also execute all commands from start up until a specific command. Right-click on a command in the Agent Explorer and select Execute From Start to Here from the context menu. If you just want to execute a set of commands on the web page in the selected web browser tab, then you can right-click on a command and select Execute to Here from the context menu. This is often used when you want to submit a web form, but don't want to manually execute all the Form Field commands required to set the form field values. Since you just want to set form field values on the current web page, there's © 2014 - 2015 Sequentum 76 Content Grabber no need to execute commands in other web browser tabs. 3.7 Testing/Debugging an Agent The agent debugger is an essential tool when trying to locate and correct issues in an agent. Running an agent through the debugger is normally the first thing you do after you have built your agent. The debugger will warn you of any potential issues while it's running an agent, and allow you to pause and correct any issues. An agent runs significantly slower through the debugger, and the debugger does not automatically manage JavaScript memory leaks and hanging/crashing web browsers, so you should only use the debugger for testing, and not for full agent runs. Refer to Using the Debugger for more detail. Logging Logging is used to collect detailed information about the web scraping process and can be used both when debugging and running an agent. Logging includes links to the web pages that are being processed, so it's easy to pinpoint pages that may be causing problems. Refer to Using Logging for more detail. 3.7.1 Using the Debugger The agent debugger is used to test your agent, and to find and correct issues such as missing content elements. When building an agent, you often base the design of the agent on just a few web pages, and then rely on Content Grabber to execute your commands on all similar web pages it encounters. © 2014 - 2015 Sequentum Content Grabber Editor 77 For example, if you are extracting data from a product catalog, you may select the product name and price on a single product detail page and add agent commands to extract those two web elements. Content Grabber will then use these two agent commands to extract product name and price for all product detail pages in the product catalog. This works fine in most cases, but some products maybe on special, and the product detail pages for those products may display the price differently. Content Grabber may be unable to pick up the discount prices and the price for some products may therefore be missing. Even if you know some products have a discount and need special attention, it may be difficult to find such a product page in a large catalog. The agent debugger can help you correct issues where an agent is unable to locate content, because it stops and warns you when content is missing and allows you to correct the agent command that caused the error. In the product catalog example, you would be able to correct the agent command that extracts the price, so it includes a selection for the discounted price. Controlling the Debugger Use the Start button in the Debug ribbon tab to start the debugger. The agent will run directly in the agent editor, so you will be able to see the web pages being loaded into the web browser. The debugger will mark the content that is being processed in the web browser, and it will also highlight the command that is being executed, in the Agent Explorer panel. Once the agent is running in the debugger, you can pause or stop the debugger at any time. If you pause the debugger, you can use the Next Command button to execute © 2014 - 2015 Sequentum 78 Content Grabber one command at a time, and the Next Action button to execute commands until the debugger reaches an action command. Setting the Debug Speed The debugger will mark the content that is being processed in the web browser, which is helpful when trying to work out how the agent is processing the website, but the debugger may run so fast that it's impossible to see what's going on. You can slow down the debugger to make it much easier to follow the process. You can change the debug speed before you start the debugger or while the debugger is running. Debugging a Data Subset Agent list commands may process large data sets, such as a long list of Start URLs or a long list of form field input data. If you want to test an agent with data from the end of a list, it may take hours for the agent to reach the data you want to test. Instead of waiting for the agent to reach the data you want to test, you can specify the subset of data you want to include in the debugging session. The Debug Set option is only available when you select a list command in the Agent Explorer. This setting is automatically saved to the agent command and will apply to the command every time you debug the agent. This option has no effect when you run an agent, only when you debug the agent. © 2014 - 2015 Sequentum Content Grabber Editor 79 Debug From a Specific Command Sometimes you may want to only debug a small part of your agent. If you have a complex agent that processes a large website, it may take a long time before the debugger reaches certain part of your agent, so instead of starting from the beginning, you can select the agent command where the debugger should start. When you start debugging from a specific command, the web browser tab that is associated with the command must have a web page loaded. You cannot start debugging from a blank web page. Disable a Command While Debugging When debugging a complex agent that processes a large website, it's often useful to exclude parts of the agent, so you only execute the part of the agent you want to test. you can do this by disabling commands while debugging. © 2014 - 2015 Sequentum 80 Content Grabber When you disable a container command, all sub-commands are also automatically disabled. This setting is automatically saved to the agent command, so the command will be disabled every time you debug the agent until you enable the command again. This setting has no effect when you run an agent, only when you debug the agent. Viewing Debug Data Data collected while debugging an agent is saved to a separate data store, so it doesn't overwrite data that is collected when you run the agent. The debugger will export data to your chosen export target, but it will never distribute data. If you are exporting data to a database, the debugger will create separate data tables for the debug data. If you are exporting data to a file format, the files will be written to the agent's debug data folder. Important: If you are using a script to export data, you are responsible for managing debug data if you want to separate debug data from normal data. © 2014 - 2015 Sequentum Content Grabber Editor 81 Please see the Data topic for more information about agent data. 3.7.2 Using Logging Logging can be used when debugging or running an agent to collect information about the web scraping process. Logging can be set to three different detail levels, low, medium and high. Low detail level only logs errors, medium detail level logs errors and warnings, and high detail level logs errors, warnings and general progress information. Debug logging is always set to high detail level and cannot be changed. Debug log data is stored in a separate data store, so it doesn't overwrite or get mixed with normal log data. Debug logging can be turned on in the Debug ribbon menu. Normal logging can be turned on when running an agent on the Run Agent screen. The Run Agent screen is not displayed when running an agent using the Run button in the application menu, so you will not be able to configure logging when running an agent this way. Logging Raw HTML Content Grabber automatically logs direct links to processed web pages when logging © 2014 - 2015 Sequentum 82 Content Grabber is turned on, so it's normally easy to view specific web pages that have been processed. For example, if an error or warning appears in the log, you can simply click on the associated URL to open the web page and see if there is anything special about the page that may cause an error. Sometimes it's not possible to open a web page using a direct URL. For example, some websites implement CAPTCHA protection, which is web pages that appear randomly to ask the user to enter a verification code. If a CAPTCHA page is retrieved instead of a normal web page, your agent is likely to encounter errors, but if you click on the associated URL later, you may not get the CAPTCHA page because it appears randomly. In this case it may be difficult to determine what is causing the error, but you can use the Log HTML feature to log the raw HTML of all processed web pages, and this will allow you to view the CAPTCHA page. Please see the CAPTCHA topic for more information about CAPTCHA. Error Handling Please see the Error Handling topic for more information about error handling, notifications and logging. 3.8 Scheduling Content Grabber Scheduling Content Grabber provides an Agent scheduling facility that enables you to automatically run your agent at predetermined time slots whenever you need it to run. This can be done every hour, every day, month, year and so on. If you have an Agent loaded in Content Grabber, you will be able to access the Scheduling feature. From the Agent Settings menu at the top of the Content Grabber application. Simply Click on the Schedule menu option to display the Scheduling window. © 2014 - 2015 Sequentum Content Grabber Editor 83 Content Grabber's Scheduling Window The Enable schedule checkbox allows you to easily turn scheduling on or off for the Agent. Click the Enable schedule checkbox so you can begin configuring the schedule details. The Run interval fields allow you to select how often your agent will run. This can be set in seconds, minutes, hours or days. Start date and Start time enable to preciously control from when the agent will start executing. The log feature allows you to record when the agent runs. The log level setting you choose (i.e. none, low, medium, high) will determine the level of detail that is recorded about each run. For instance, you may want to track all error messages or issues encountered by the running agent so you might choose "high". However, be mindful that a high level of logging may slow the Agent performance. Content Grabber also includes some scheduling security features. If you don't want your Agent to run when to aren't logged on to your computer, simply click the Run only if user is logged on check box. Using Windows Task Scheduler Should you want to manage multiple agents from one place or are looking for additional controls, or calendar functionality, you can utilize the Windows Task Scheduler instead. To do this, simply click on the Open Windows Scheduler button © 2014 - 2015 Sequentum 84 Content Grabber on the Scheduling window. Windows Task Scheduler is used to create and manage common tasks that your computer will carry out automatically at the times you specify. In this instance the task will be to run a web-scraping Agent(s) at the scheduled time(s) you require. To setup a basic Agent schedule task, simply go to the Action menu and select the Basic Task Wizard. This wizard will lead you through the required steps to setup your schedule task. Windows Task Scheduler - Using the Basic Task Wizard For more advanced scheduling options or settings such as multiple task actions or triggers, use the Create Task command in the Actions menu. © 2014 - 2015 Sequentum Content Grabber Editor 85 Windows Task Scheduler - Create Task Command (for advanced options) Tasks are stored in folders in the Task Scheduler Library. To view or perform an operation on an individual task, select it in the Task Scheduler Library and click on a command in the Action menu. 3.9 Copying & Moving Agents Moving an agent You can move an agent by simply moving the agent folder to a new location. The agent will retain its unique identifier, so the agent will work as before although it is now in a new location. Copying an agent To copy an agent, simply choose File > Save As in the Content Grabber menu. The new agent will get a unique identifier - an identifier that is different from the original agent. WARNING: We strongly recommend that you do not simply copy the agent from one folder to another on your disk. The result will be two agents having the same identifier, since the copy will retain the original identifier. © 2014 - 2015 Sequentum 86 Content Grabber Content Grabber stores information about each agent in its database, and can only store one entry per identifier. If you do have two agents on your computer with the same identifier, the active agent will be the agent that you open most recently in the Content Grabber editor. In these circumstances, you risk a duplication of effort or a loss of work. 3.10 Exporting & Importing Agents Exporting and Importing agents allows you to easily move your web scraping agents between machines. You can also export your agent for back-up to another device. Exporting Agents To export your agent, first make sure you save your agent and have it open. Then from the File menu, select Export Agent. The following Export Agent pop-up window will appear on your screen. Export Agent window Content Grabber's Export Agent facility enables you to package up all the associated files and folders into one file in a manner that is similar to a zip file. The default location for where agents are exported is the Documents > Content Grabber > Agents folder on your PC. You can also elect to Include internal data and/or Include export data into the export file by selecting the corresponding check box. Once you have made your selection, simply click the Export button. © 2014 - 2015 Sequentum Content Grabber Editor 87 For detail on the Create self-contained agent check box, refer to Building a Selfcontained Agent. Importing Agents You can only import agents that have previously been exported from Content Grabber. To do this, select Import Agent from the File menu. The following Export Agent popup window will appear on your screen. Import Agent Open window - to select agent file for import Simply select the Content Grabber agent file you wish to Import and then click the Open button. All the folder structures and associated files for the agent will then be unpacked into the Documents folder on your machine. 3.11 Automated Agent Backups Whenever you save an agent, Content Grabber automatically makes a backup for you which you can restore your agent from should you have any issues. To see the different backups available to restore from, click on the File menu and you © 2014 - 2015 Sequentum 88 Content Grabber will see the agent backup listed under this menu. Restore from Content Grabber's Automated Backups Content Grabber also saves a backup restore file for you should the application close unexpectedly (e.g should the application process terminate). This will be available for you to restore from so you do not lose your latest development work. 3.12 Template Libraries A template library contains building blocks that you can use to build your agents. A building block can be an entire agent, a group of commands, or a script. Content Grabber includes a large number of predefined building blocks, but you can also add your own - either to the Agent Template Library or the Command Template Library. For example, if you need to create many agents that have the same structure but will point to different websites, then you could add your own agent template and derive it from the first agent you create. We encourage you to learn more about the Agent Template Library and the Command Template Library. © 2014 - 2015 Sequentum Selection Techniques 4 89 Selection Techniques Mastering the essential selection techniques is a critical aspect of web-scraping. When you point-and-click on content in the web browser to create agent commands, you are using the most basic selection technique. In addition to the simple point-andclick feature, Content Grabber provides a range of tools to help you make precise selections, including: XPath Editor - gives you the ability to manipulate the selection XPath Tree View Window - gives a more precise view of a web page List Tool - helps you select a list of web page elements Anchor Tool - helps you apply conditions to the selection XPath. It is important to realize that you have to be explicit with Content Grabber. You have to be deliberate and specific when selecting each of the HTML elements that you want to capture. Sometimes it can be confusing, as when you want to capture the elements of a search-result listing and each entry has a heading. In some cases, the heading will be a link and in other cases it will be plain text, so the HTML for the headings may look something like this: <h1><a href="http://website.com">Heading as a link</a></h1> <h1><a href="http://website.com">Heading as a link</a></h1> <h1><span>Heading as plain text</span></h1> <h1><a href="http://website.com">Heading as a link</a></h1> If the first two headings are links, then only the link headings will be chosen - not the headings that are plain text. To extract the text of all the headings in this example, you might use the first two headings to create a list selection. As with many software applications, you have to be explicit with Content Grabber. It does not somehow know that you want to extract all the headings, and its default function will be to assume that © 2014 - 2015 Sequentum 90 Content Grabber you are trying to select only the links. In such a case, you would need to change the selection so that it selects the <h1> tag instead of the link tag - before you create the list. 4.1 Selection XPaths Each time you click on content in the web browser panel, Content Grabber does some processing in the background to calculate the selection XPath. XPath is a common syntax for selecting in XML and HTML documents. Content Grabber uses a standard implementation that supports XPath v1.0 syntax, and also supports a range of new methods specifically designed to make web scraping easier. Content Grabber has a range of tools that helps you create a precise XPath, without the need to know the syntax. Eventually, you may find the need to fine-tune the XPath manually, and then you will need to learn the XPath syntax. Each time you make a selection in the web browser, you can view the selection XPath either in the Selection ribbon menu or on the status bar (see the figures below). Note that the Selection ribbon menu is only available if you are using the Expert user interface layout. The selection XPath in the Selection ribbon menu The selection XPath is shown on the status bar The XPath in the figure above is: //div[@class='feature featureLast']/h3[1]/a[1] It contains these selection steps: © 2014 - 2015 Sequentum Selection Techniques 91 1. Selects all <div> tags within the webpage having the class attribute value feature featureLast. 2. Also selects all child <h3> tags from the selection in step 1 (/h3[1]). 3. Then selects all child <a> tags from the selection in step 2 (/a[1]). Read more in the XPath and Selection Techniques article. To learn more about XPath, we recommend that you consult a good reference guide such as this one: www.w3schools.com/XPath/xpath_syntax.asp Content Grabber XPath Functions Content Grabber has a set of non-standard functions you can use in XPath selections: Function Description bool equals(string Returns True if the two strings are source, string target) equal. The comparison is case insensitive. bool fuzzy-match(string Uses the Jaro/Winkler distance source, string target, algorithm to determine if two strings [double tolerance]) match. This function first splits each string into words and then compares words from each string. If one string contains more words than the other string, the additional words are not considered when calculating the distance. Tolerance must be between 0 and 1. A tolerance of 1 means an exact match. A tolerance of 0.85 is used if © 2014 - 2015 Sequentum 92 Content Grabber tolerance is not specified. bool fuzzy-match(string Uses the Jaro/Winkler distance source, string target, algorithm to determine if two strings double match. This function first splits each singleWordTolerance, string into words and then compares int words from each string. wordMismatchesAllowe d, [bool If isMatchAllSourceWords is true, all isMatchAllSourceWords words in the source string will be ], [bool matches. isMatchAllTargetWords ]) If isMatchAllTargetWords is true, all words in the target string will be matches. if both isMatchAllSourceWords and isMatchAllTargetWords are true, all words in the string containing the most words will be matched. if both isMatchAllSourceWords and isMatchAllTargetWords are false, all words in the string containing the least words will be matched. SingleWordTolerance must be between 0 and 1. A tolerance of 1 means an exact match. If a word match is less than the SingleWordTolerance value, a word mismatch is recorded. © 2014 - 2015 Sequentum Selection Techniques WordMismatchesAllowed is the number of words that are allowed to mismatch before the function returns false. bool not- Returns False if the string contains whitespace(string only white space characters. source) add-bookmark() Adds the current node to an internal list of bookmarks. bool has-parent- Returns True if a parent of the bookmark() current node is in the bookmark list. NodeSet parent- Returns the first parent node that is bookmark() in the bookmark list. string html() Returns the HTML of the current node. string inner-html() Returns the inner HTML of the current node. string uniqueid() Returns a unique ID. string url() Returns any URL property of the current node. string image-url() Returns any image URL of the current node. string email() Returns any email address of the current node. string flash-url() Returns any flash URL of the current node. string tag-text() Returns the text of the current node excluding text from any child nodes. © 2014 - 2015 Sequentum 93 94 Content Grabber string find-data(string Returns the extracted data for a commandName) command that has already been processed. If the command does not exist in the current container command, the function will keep searching in parent containers. string get-data(string Returns the extracted data for a commandName) command that has already been processed. The command must exist in the current container command. string get-input- Returns input data as a string value data(string from the specified data column in commandName, string the data provided by the specified columnName) command. The specified command must be a data provider. string get-input- Returns input data as a string value data(string from the specified data column in columnName) the data provided by the last data provider parent command. string get-input-data() Returns input data from the first data column that contains a string value in the data provided by the last data provider parent command. string get-global- Returns the data entry with the data(string name) specified name from the global data dictionary which includes input parameters. The data entry must be a string value or a simple value type that can be converted to a string value. © 2014 - 2015 Sequentum Selection Techniques int node- Returns the position of a specific position([nodeSet]) node among all nodes with the same parent node. If no node is given, then this is the position of the root node. When choosing elements inside a web element list, the current list element is the root node. So a call to node-position() would return the position of the current list element. The index is not zero based, so the first index is 1. NodeSet root([int Returns the root node with a specific nodeIndex]) index. The index must be greater than 1. If no index is specified the current root node is returned. When selecting elements inside a web element list, the current list element is the root node. So, a call to root() would return the current list element, and a call to root(2) would return the second list element. int root-index() Returns the index of the current root node. When selecting elements inside a web element list, the current list element is the root node. So, a call to root-index() would return the © 2014 - 2015 Sequentum 95 96 Content Grabber index of the current list element. The index is not zero based, so the first index is 1. int last-root-index() Returns the index of the last root node. When selecting elements inside a web element list, the current list element is the root node. So, a call to last-root-index() would return the index of the last list element. The index is not zero based, so the first index is 1. int root-position() Returns the position of the current root node. This position is relative to all nodes with the same parent node. When selecting elements inside a web element list, the current list element is the root node. So, a call to root-position() would return the position of the current list element. The index is not zero based, so the first index is 1. NodeSet root-siblings() Returns following siblings of the current root node, but stops when it encounters another root node. © 2014 - 2015 Sequentum Selection Techniques 97 This function is equivalent to the following selection: root()/following-sibling::*[root-index() =last-root-index() or position()<root-position(root-index()+1)root-position()] When selecting elements inside a web element list, the current list element is the root node. So, a call to root-siblings() would return siblings of the current list element. 4.2 Optimizing Selections An excessively long XPath is unlikely to work well if the target web page changes even slightly. Every step in an XPath must match an HTML tag on the web page, and it will fail if any of these tags move to another location or go away entirely. When you click on an HTML element in the web browser, Content Grabber will automatically attempt to create an optimal selection XPath by making it as short as possible. For example, the full selection XPath path to a <div> HTML element could be as follows: DIV[1]/DIV[5]/TABLE[2]/TBODY[1]/TR[2]/TD[1]/DIV If the DIV tag has an ID attribute with a unique value listView, the optimal XPath is: //DIV[@id='listView'] © 2014 - 2015 Sequentum 98 Content Grabber Content Grabber will examine the entire web page for a DIV tag having the ID attribute value listView. This XPath example is very robust and not sensitive to future page changes, and it will work as long as this element exists on the web page - even if the rest of the page changes. The design of Content Grabber is to prefer ID attributes for optimizing XPaths, because the expectation is that any given ID will be unique on a web page. However, sometimes websites use IDs that are unique to a specific content element, such as a product ID, and such IDs are not appropriate for use in an XPath. If you extract data from a product catalog by using an XPath to extract the product title from all product detail pages, then you do not want the XPath to depend on a specific product ID, such as we have in the case below: //H1[@id='sku_245865'] Such an XPath would work for only one specific product and not for the others. Content Grabber will avoid using IDs that look like identifiers, such as a product identifier, but if it turns out wrong, then your only option is to optimize the XPath manually. If you need to optimize a selection XPath manually, you can use the Exact Selection tool. This tool generates selection XPaths with as much detail as possible, and then you can manually remove the detail that is causing problems - such as an ID attribute containing a product identifier. The Exact Selection tool 4.3 List Selections List selections are important, since you will often want to extract a list of web content or follow a list of links. © 2014 - 2015 Sequentum Selection Techniques You can create list selections by editing the XPath manually, or by using the Content Grabber List Tool. You can start the List Tool by selecting a web element and then holding down the SHIFT key while selecting more web elements that should be included in the list. Once you release the SHIFT key the list selection will be complete. You can also start the List Tool by selecting a web element and then click on List toolbar button. The List Selection toolbar button You can now select more web elements that should be included in your list. You can exclude web elements from the list by clicking on web elements that are already selected. You will need to click the Save button in the List mode window to complete the list. The List Selection window Content Grabber will not just add web elements you click on to the list, but also other web elements of the same type that are naturally part of the same list on the page. For example, all rows in a HTML © 2014 - 2015 Sequentum 99 100 Content Grabber table are naturally part of the same list, so if you click on the first row in the table and then SHIFT-click on the second row, Content Grabber will create a list of all rows in the table, and not just the two rows you have clicked on. You can only add web elements of the same type to a list, and all web elements in a list must naturally belong to the same list. For example, if you have a list of products in one area of a page, and another list of products in another area of the same page, you may not be able to add them to the same list. Content Grabber will automatically attempt to create a natural list of the same type of web elements. If you want to create a list of table rows for example, you can select the first table row, and then SHIFT-click anywhere inside the second table row to create the list. Even if you select a table column, or any other child web element, in the second row, Content Grabber will know that it needs to create a list of table rows. It's important to notice that it DOES matter exactly which child web element you click on in the second table row, since Content grabber will use this information as a selection filter, and exclude all web elements from the list that don't contain this child web element. Content Grabber may also use the text or other properties of the child element to filter the list, but it will only do so if the filter does not filter away web elements you have specifically clicked on to add to the list. Web Element Siblings A list of web elements will automatically also include all siblings of the web elements. For example, you may have a list of table rows where the first row contains a product name and the next couple of rows contain details about that product, and then another title row followed by more rows containing details about the second product, © 2014 - 2015 Sequentum Selection Techniques and so on. In this case you should create a list of the title rows only, and the siblings of the title rows containing product details will then be available to agent commands that references the list selection. For example, if a Web Element List command uses the list selection, then sub-commands can capture content within the sibling web elements. 4.4 Selection Anchors Selection anchors are used to anchor a selected web element to another element, or to filter a list by a child element. The most common use of selection anchors is when you have a list of name/value pairs on a page and want to extract the value for a specific name, but that specific name/value pair is not always located in the same position. For example, when extracting product specifications from a table, some products may have size and weight in the first two rows, but other products may have model and memory in the first two rows, so if you extract size based on position, you will extract size for some products, but model for other products. To overcome this problem, you can anchor the selection of the size value to the text of the web element containing the size name. You can anchor a selection by CTRL-clicking on the anchor element, or by clicking on the Anchor toolbar button and then selecting the anchor element. When using CTRL to select an anchor, the anchor will always be a text anchor, but when using the Anchor toolbar button, you can choose many different kinds of anchors. © 2014 - 2015 Sequentum 101 102 Content Grabber The Anchor toolbar button An anchor works as a filter when used on a list selection, and the anchor element must be selected inside the list. When using CTRL to select the anchor element, the anchor will filter the list so that only web elements that contain a child web element similar to the anchor element will be included in the list. When using the Anchor toolbar button, you can chose many different kinds of anchors to filter a list. The Anchor window 4.5 Editing XPaths Manually Content Grabber will usually create an optimal selection XPath, but sometimes you may find it necessary to fine-tune the XPath manually. It can be difficult, however, to create an XPath from scratch. If you want to manually edit an XPath, we recommend that you first let Content Grabber generate the XPath and then edit it. As we show in the figures below, you can use either the status bar XPath editor or the Selection ribbon menu to manually edit an XPath. Note that the Selection ribbon menu is only available if you are using the Expert user interface layout. © 2014 - 2015 Sequentum Selection Techniques The status bar XPath editor The Selection Ribbon menu © 2014 - 2015 Sequentum 103 104 5 Content Grabber Agent Commands Content Grabber web-scraping agents usually contain a list of commands that execute sequentially. Sometimes, commands will execute in parallel (simultaneously). Each agent command performs an action, such as loading a new page or capturing content. Some commands can contain sub-commands. These commands are called container commands and they execute all their sub-commands one or more times. If a container command executes its sub-commands more than one time, then the command is also called a list command. A list command iterates through a list of data elements or web elements and executes all its sub-commands once for each element in the list. The Agent Explorer shows all commands in an agent The most important command in a agent is the Agent command itself. This is the first command that executes. Since it contains all other commands for the agent, it is called a container command. The Agent Command loads the Start URL, the point at which data extraction starts. The Agent Command also controls many other important aspects of the agent, such as data export. Some agent commands have a corresponding web selection that chooses an element from the current web page. One or more of these selections are put to use when the © 2014 - 2015 Sequentum Agent Commands 105 command executes. A Navigate Link command, for example, selects a link on the current web page, which will open a new web page. See The Content Grabber Editor topic to learn how to use the Agent Explorer to add and configure agent commands. Command Types We classify agent commands into four types, according to function: Capture Commands List Commands Action Commands Container Commands Capture commands do nothing but capture web content. Container, List, and Action commands may function as one or more types simultaneously. There are some special commands that don't fall into any of these 4 categories, such as the Execute Script command which simply runs a .NET script. Agent Command Properties All commands have a set of properties that govern the function of the command. The following is a set of basic properties common to all agent commands. Property Description Name The name of the command. ID The internal ID of the command. Disabled The default value for this property is False. The system will ignore the command if this property set to True. © 2014 - 2015 Sequentum 106 Content Grabber Export A command with this property set to False (default) will not save any data output. This includes all sub-commands that are part of a container command. Recursion A command with this property set to False Disabled (default) will only execute one time in case there is a recursive loop. Notify On The default value for this property is True, which Critical means that a notification email is sent at the end Error of an agent run if this command encounters a critical error. Critical errors include page load errors and mandatory web selections that are missing. Debug The system will ignore this command during Disabled debugging if this property set to False (default). Debug The default value for this property is False. Break During debugging, the system will break at this Point command if this value is True. Web Selection Properties All commands that perform web selection will have these properties: Property Description Paths Contains one or more selection paths, each of which points to a specific web element on a web page. These paths may point to elements in various locations on the web page. A path corresponds to a web page selection. © 2014 - 2015 Sequentum Agent Commands 107 The value of the Selection Mission Option property (below) determines how Content Grabber will handle any missing selection. Selection The value of this property specifies the action Missing of Content Grabber if this selection does not Option exist in the current page. You can set this property to one of the following: Default Option - The software will decide the most appropriate action depending on the type of command. Optional Selection - The selection is optional, so the agent will continue executing any sub-commands. Ignore Command when Selection Missing - The agent will skip this command and also skip all sub-commands. Log Error and Ignore Command when Selection Missing - The agent will skip this command and all sub-commands, and will also log an error. Selection Path Properties A web selection can consist of multiple selection paths, each of which selects web elements in different locations on a web page. Each selection path will have these properties: Property Description Path The selection XPath. See the topic XPath and Selection Techniques for more information about XPaths. © 2014 - 2015 Sequentum 108 Content Grabber Wait Specify an XPath here that you want the agent XPath to finish loading before proceeding with the other commands. This option is useful when selecting content that loads asynchronously often after the main elements of the page have finished loading. Wait Specifies how long to wait for the Wait XPath Timeout web selection element to load. Seconds Wait For If you set this property to True, the agent only Refresh waits for the wait XPath to refresh. Does not Only verify that the selection Path exists. This setting is useful if the Path is optional. 5.1 Container Commands Container Commands contain one or more sub-commands. After the container command executes, all of the sub-commands execute at least once. If a container command executes its sub-commands more than once, then the command is also a List Command. A list command iterates through a list of data elements or web element, and then executes all its sub-commands - once for each element in the list. In this chapter, we explain all of the container commands: Agent Navigate Link Navigate URL Set Form Field Data List Web Element List Group Commands Group Commands in Page Area © 2014 - 2015 Sequentum Agent Commands 109 Container Properties All commands have a set of properties that govern the function of the command. The following is a set of basic properties common to all container commands. Property Description Command Links to another container command where Link processing will continue. The targeted container command will be executed, so it's normally best to link to a group command that does nothing, so it's clear what happens after the link. Dependen The action of the dependent command will be t executed before the current command is Command executed. Dependent commands of the dependent will also be executed, so a dependent command can cause many commands to be executed. Dependent commands are often used when processing web forms, because you may need to ensure some form fields are set properly before setting other form fields. Repeat Repeats the command action while the While command selection exists. This can be used Selection when a command is deleting all items from a list, is Valid so the command will keep deleting items as long as there are items in the list. 5.2 Capture Commands Capture commands capture data from a web page or from an input data provider, and saves the captured data to output. No other type of command can save data to © 2014 - 2015 Sequentum 110 Content Grabber output, so you need a capture command for each data field you want in your output data. It is important to understand that some commands don't capture anything. For example, the Set Form Field command does not capture the input value that corresponds to a form field, but simply identifies the field. To capture data, you need to use one of the capture commands to acquire the input value and save it to data output. These are all of the Capture commands available in Content Grabber: Web Element Content Download Image Download Document Download Screenshot Calculated Value Page Attribute Data Value Capture Command Properties All commands have a set of properties that govern the function of the command. The following is a set of basic properties common to all capture commands. Property Description ContentTra Please see the Content Transformation Script nsformation topic for more information. Script DataType The type of the captured data. This should be set to ShortText or LongText. Other data © 2014 - 2015 Sequentum Agent Commands 111 types are not supported, but may be used internally. DesignValu The value to use for this capture command in e the agent editor. This value can be important when testing scripts in the editor if the scripts depend on captured data. IsUserDefin The design value for this command is user edDesignVa defined instead of set automatically when the lue command is saved. Use Data Captures a data value instead of an attribute of Value the selected web element. The web element is ignored when this option is true. 5.2.1 Data Specifies the input data to use when the Use Consumer Data Value property is set to true. Web Element Content Typically, the Web Element Content command is the most common, since it is the command that captures text content from the target web page. The command allows you to choose which type of text to extract from a particular web element. The properties shown in the following table are available to all Web Element Content commands: Text Option Description Text This is the text that displays in the web browser, and is the most common choice (it's also the default). HTML This option will extract the entire HTML of the chosen web element, including the HTML of any child elements. Inner HTML The entire HTML of all child elements of the selected web element, but not the tag HTML of the chosen element itself. © 2014 - 2015 Sequentum 112 Content Grabber Tag Text The text of the selected web element, excluding the text of any child elements. Unique ID A unique identifier. The web selection is ignored if this option is selected. Default File Returns any file name specified by a response from a web Name server. If no file name is specified, extracts a file name from a href attribute, or a unique identifier if the href does not not exist. This attribute is mostly relevant to file capture commands. In addition to the default options given above, some web elements may have other attributes available, such as Class, Name, ID, Value, URL, etc. If a web element has an attribute that is not shown in the default drop down box, then you can simply enter the name of the attribute you want to extract. The figure below shows the Command Properties panel after choosing Web Element Content from the New Command drop-down: © 2014 - 2015 Sequentum Agent Commands 113 Properties The following properties are specific to Web Element Content commands and are listed in the HTML Capture section on the command property tab: Property HTML Attribute Description The HTML attribute to extract from the selected web element. Concatenate Concatenates content if multiple web elements are Multiple Web selected. Only the first web element will be used if Elements this property is set to false. Concatenate The separator to use between content from multiple Content Separator web elements. This property is only applicable when Concatenate Multiple Web Elements is set to true. The following properties are available in the Web Selection section on the Properties tab: Property Selection Description The selection XPath for the chosen web element. See the Web Selection Properties article for more information. Extracting URLs deserves a special mention. That's because it's more common to extract a link URL instead of navigating to the link, or to extract an image URL instead of downloading the image itself. If you want to extract a URL from a web element, simply select the element and extract the URL or Image URL attribute. You can also enter the actual HTML tag attributes: src for images and href for links. However, the URL and Image URL attributes automatically convert any relative URLs to absolute URLs-which is best in most cases. © 2014 - 2015 Sequentum 114 Content Grabber Content Transformation Script The Web Element Content command allows you to use regular expressions or a .NET script, to transform the extracted content. In most cases we recommend that you write expressions or use a script to clean the data that you extract. You can also separate data-such as the elements of a postal address-into separate fields. Example: Consider a case in which you want to extract product data that includes a price of $400. You could use a transformation script to strip off the "$" character and leave only the numeric value. Please see the Content Transformation Script topic for more information. These common properties are also available for this command: Capture Command Propertie Agent Command Properties 5.2.2 Download Image The Download Image command extracts an image from a web page. The command will download an image, and then save it to the file system, send it to a database, or embed it into Excel - depending on your chosen export target. The web selection path for this command normally points to the image itself, but it could also point to a web element that contains a URL that links to the image. The figure below shows the Command Properties panel after choosing Download Image from the New Command drop-down: © 2014 - 2015 Sequentum Agent Commands 115 Data Fields If the agent is saving the image to a database, then by default this command will generate two data fields: one for the image data and another for the name of the image. If the agent is saving the image to the file system, the command will generate only one data field containing the full file path to the image. The command property Export URL can be used to also generate a data field that contains the image URL. Command Configuration The Common tab in the Configure Agent Command panel has three tabs: File URL - contains the URL for the image. File Name - contains the name of the downloaded image. OCR tab - check the box if you want to convert an image into text. We explain the details of each below. © 2014 - 2015 Sequentum 116 Content Grabber File URL The entry in this tab determines the specific URL for the image, and the agent uses this URL to download the image at run time. You can choose the HTML attribute that the command should extract to get the image URL. The default value is Image URL, which extracts the src HTML attribute. Click the Transformation Script button click to enter regular .NET expressions or write a .NET script that will transform the image URL to meet your requirements. This is often useful when you want to extract a large image, but it's easier to select a corresponding thumbnail image. You may then be able to transform the thumbnail image URL into the large image URL, and the agent will then use the transformed image URL to download the large image. See the Content Transformation Script topic for more information about content transformation scripts. You can choose the HTML attribute that the command should extract to get the image URL. The default value is Image URL, which extracts the src HTML attribute. File Name The entry in this tab contains the file name. From the drop-down menu you can choose the HTML attribute that you want to use as the name. Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the image name to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. Use the Data Value option to specify that an agent data value will be used as file name. The agent data can come from a data provider, an input parameter or captured data. © 2014 - 2015 Sequentum Agent Commands 117 Use the Detect File Extension option to specify if agent should try and detect the file type of the downloaded document, or if a transformation script or a data value will provide a file name that includes a file extension. OCR Using the OCR tab, you have the option to convert the image into text. For example, you might need to convert CAPTCHA images into text so that the agent can bypass CAPTCHA blocking websites. See the topic CAPTCHA Blocking for more information. This tab gives you the option to export both the image file and the converted text, or just the converted text. To convert the image, you'll need to check the box and then enter a script to call an external OCR service. Content Grabber does not include an OCR feature, but allows you to integrate with 3rd party services by using this script feature. For more information, see the topic Image OCR Scripts. Command Properties The following command properties are specific to the Download Image command, and are found in the Convert Image section on the Command property tab: Attribute Description Convert to Uses a script to convert the image to text Text and export the text. An common use of this property is for CAPTCHA processing. Image A script used to transform an image into Conversion text. Script Export Converted Image © 2014 - 2015 Sequentum Exports the image along with the text. 118 Content Grabber The following properties are available in the File Capture section on the command property tab: Attribute File Name Description A script that transforms the file name. Transformation Script Export URL Specifies how to export the file URL. Use Original When possible, this property contains the File Name original file name. Fixed File Adds a fixed file extension to the Extension downloaded file. File Name The web element attribute to use as file Attribute name for the image file. Auto Detect File Automatically detects the file extension of Extension the image file. Uncheck this box if you want the file name transformation script to set the file extension. Use File Name Uses a data value as file name instead Data Value of an attribute of the selected web element. File Name Data Specifies the input data to use when Consumer Use Data Value as File Name is set to true. The following properties are available in the HTML Capture section on the Properties tab: Property Description © 2014 - 2015 Sequentum Agent Commands HTML Attribute 119 The HTML attribute to extract from the selected web element. Concatenate Concatenates content if multiple web elements are Multiple Web selected. Only the first web element will be used if Elements this property is set to false. Concatenate The separator to use between content from multiple Content Separator web elements. This property is only applicable when Concatenate Multiple Web Elements is set to true. The following properties are available in the Web Selection section on the Properties tab: Property Selection Description The selection XPath for the chosen web element. See the Web Selection Properties article for more information. These common properties are also available for this command: Web Selection Properties Capture Command Properties Agent Command Properties 5.2.3 Download Video The Download Video command will download a video from a website and save it to your local disk or a database. The most common video formats are mp4, ogg and Flash. Content Grabber cannot extract content from within a video (such as Flash objects), but can only extract the video file as a whole. The command has a corresponding web selection that will normally select the link to the video. Optionally, the command can extract data from a web element and then © 2014 - 2015 Sequentum 120 Content Grabber calculate the direct video URL. The figure below shows the Command Properties panel after choosing Download Video from the New Command drop-down: Data Fields If the agent is saving the video to a database, then by default this command will generate two data fields: one data field will contain the video binary data and another will contain the name of the video. If you're saving the video to the file system, the command will generate only one data field containing the full file path to the video file. You can also use the property Export URL to generate a data field that contains the video URL. Command Configuration The Common tab in the Configure Agent Command panel has two tabs: File URL - contains the URL for the video. File Name - contains the name of the downloaded video. We explain the details of each below. © 2014 - 2015 Sequentum Agent Commands 121 File URL The entry in this tab determines the specific URL for the video, and the agent uses this URL to download the video at run time. You can choose the HTML attribute that the command should extract to get the video URL. The default value is Video URL, which will cause the command to look at the entire tag HTML and extract the first appropriate video URL. Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the document URL to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. File Name The entry in this tab contains the file name. From the drop-down menu, you can choose the HTML attribute that you want to use as the name. Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the document name to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. Use the Data Value option to specify that an agent data value will be used as file name. The agent data can come from a data provider, an input parameter or captured data. Use the Detect File Extension option to specify if agent should try and detect the file type of the downloaded document, or if a transformation script or a data value will provide a file name that includes a file extension. Command Properties © 2014 - 2015 Sequentum 122 Content Grabber The following properties are available in the File Capture section on the command property tab: Attribute Description File Name A script that transforms the file name. Transformation Script Export URL Specifies how to export the file URL. Use Original When possible, this property contains the File Name original file name. Fixed File Adds a fixed file extension to the Extension downloaded file. File Name The web element attribute to use as file Attribute name for the file. Auto Detect File Automatically detects the file extension of Extension the image file. Uncheck this box if you want the file name transformation script to set the file extension. Use File Name Uses a data value as file name instead Data Value of an attribute of the selected web element. File Name Data Specifies the input data to use when Consumer Use Data Value as File Name is set to true. The following properties are available in the HTML Capture section on the Properties tab: Property Description © 2014 - 2015 Sequentum Agent Commands HTML Attribute 123 The HTML attribute to extract from the selected web element. Concatenate Concatenates content if multiple web elements are Multiple Web selected. Only the first web element will be used if Elements this property is set to false. Concatenate The separator to use between content from multiple Content Separator web elements. This property is only applicable when Concatenate Multiple Web Elements is set to true. The following properties are available in the Web Selection section on the Properties tab: Property Selection Description The selection XPath for the chosen web element. See the Web Selection Properties article for more information. These common properties are also available for this command: Web Selection Properties Capture Command Properties Agent Command Properties 5.2.4 Download Document The Download Document command extracts a document from a web page. The command will download a document, and then save it to the file system or send it to a database - depending on your chosen export target. The web selection path for this command normally points to the document link itself, but it could also point to a web element that contains information from which the © 2014 - 2015 Sequentum 124 Content Grabber document URL can be derived by using Content Transformation. The figure below shows the Command Properties panel after choosing Download Document from the New Command drop-down: Data Fields If the agent is saving the document to a database, then by default this command will generate two data fields: one for the document binary data and another for the name of the document. If the agent is saving the document to the file system, then the command will generate only one data field containing the full file path to the document. The command property Export URL can be used to also generate a data field that contains the document URL. Command Configuration The Common tab in the Configure Agent Command panel has three tabs: © 2014 - 2015 Sequentum Agent Commands 125 File URL - contains the URL for the image. File Name - contains the name of the downloaded image. Convert to HTML - specifies if the downloaded document should be converted to HTML. We explain the details of each below. File URL The entry in this tab determines the specific URL for the image, and the agent uses this URL to download the document at run time. You can choose the HTML attribute that the command should extract to get the document URL. The default value is URL, which extracts the href HTML attribute (if the chosen web element is a link). Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the document URL to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. Use the Data Value option to specify that an agent data value will be used as file URL. The agent data can come from a data provider, an input parameter or captured data. File Name The entry in this tab contains the file name. From the drop-down menu, you can choose the HTML attribute that you want to use as the name. Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the document name to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. © 2014 - 2015 Sequentum 126 Content Grabber Use the Data Value option to specify that an agent data value will be used as file name. The agent data can come from a data provider, an input parameter or captured data. Use the Detect File Extension option to specify if agent should try and detect the file type of the downloaded document, or if a transformation script or a data value will provide a file name that includes a file extension. Convert To HTML A downloaded document can be converted into a HTML page as it's being downloaded, and a URL command can later be used to open the HTML page. Capture commands can then be used to extract data from the HTML page. Please see Extracting Data From Non-HTML Documents for more information. Click To Download Check this box when no direct URL is available for the document, but it is necessary to download the document by clicking on a web element - such as a button. When you enable this option: The File URL tab becomes unavailable © 2014 - 2015 Sequentum Agent Commands Content Grabber will assign a unique identifier as the file name You will have access to the Action configuration tab where you can fine tune the behavior of the action. See the topic Action Commands for more information. Command Properties The following command properties are specific to Download Document commands and are listed in the Action section on the command property tab. Attribute Description File Download Action configuration when the Download Method Action property is set to Click to Download. See Action Command Properties for more information. Download Method Specifies how the document is downloaded. The file can be downloaded using the extracted URL, using an action on the selected web element, or as a result of an action from a parent from field. The following properties are available in the Convert Document section on the command property tab. Attribute Description Convert to HTML Uses a script to convert the downloaded document to a HTML page which can be processed later using a URL command. Conversion Script A script used to transform the downloaded document into a HTML page. Export Converted © 2014 - 2015 Sequentum Exports the downloaded document in addition to 127 128 Content Grabber Document generating and exporting the converted HTML page. The following properties are listed in the File Capture section on the command property tab. Attribute Description File Name A script used to transform the file name Transformation attribute used to name the download file. Script Export URL Specifies how to export the file URL. Use Original File Uses the file name of the document when Name possible. Fixed File Adds a fixed file extension to the downloaded Extension file. File Name Attribute The web element attribute to use as file name for the downloaded file. Auto Detect File Automatically detects the file extension of the Extension downloaded file. Clear this option if you want a file name transformation script to set the file extension. Use File Name Uses a data value as file name instead of an Data Value attribute of the selected web element. File Name Data Specifies the input data to use when Use Data Consumer Value as File Name is set to true. The following properties are available in the HTML Capture section on the command property tab. © 2014 - 2015 Sequentum Agent Commands Property HTML Attribute 129 Description The HTML attribute to extract from the selected web element. Concatenate Concatenates content if multiple web elements are Multiple Web selected. Only the first web element will be used if Elements this property is set to false. Concatenate The separator to use between content from multiple Content Separator web elements. This property is only applicable when Concatenate Multiple Web Elements is set to true. The following properties are available in the Web Selection section on the Properties tab: Property Selection Description The selection XPath for the chosen web element. See the Web Selection Properties article for more information. The following common command properties are also available to the Download Document command. Web Selection Properties Capture Command Properties Command Properties 5.2.5 Download Screenshot The Download Screenshot command extracts a screenshot of the entire webpage or the chosen web element. The command will save the screenshot, save it to the file system or send it to a database - depending on your chosen export target. The web selection path for this command normally points to the document itself, but it could also point to a web element that contains a URL that links to the document. © 2014 - 2015 Sequentum 130 Content Grabber The figure below shows the Command Properties panel after choosing Download Screenshot from the New Command drop-down: Data Fields If the agent is saving the screenshot to a database, then by default this command will generate two data fields: one for the screenshot data and another for the name of the screenshot. If the agent is saving the screenshot to the file system, the command will generate only one data field containing the full file path to the screenshot. Command Configuration The Common tab in the Configure Agent Command panel has two tabs: File Name tab - contains the name of the downloaded image. OCR tab - check the box if you want to convert an image into text. We explain the details of each below. © 2014 - 2015 Sequentum Agent Commands 131 File Name The entry in this tab contains the file name. From the drop-down menu, you can choose the HTML attribute that you want to use as the name. Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the screenshot name to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. Use the Data Value option to specify that an agent data value will be used as file name. The agent data can come from a data provider, an input parameter or captured data. OCR Using the OCR tab, you have the option to convert the screenshot into text. This tab gives you the option to export both the screenshot file and the converted text, or just the converted text. To convert the screenshot, you'll need to check the box and then enter a script to call an external OCR service. Content Grabber does not include an OCR feature, but allows you to integrate with 3rd party services by using this script feature. For more information, see Image OCR Scripts. Command Properties The following properties are available in the Convert Image section in the Properties tab: Attribute Description Convert to Text Uses a script to convert the downloaded image to text and exports the text. This can be used for CAPTCHA processing. Image © 2014 - 2015 Sequentum A script used to transform the screenshot 132 Content Grabber Conversion image into text. Script Export Exports the converted screenshot image in Converted addition to the text. Image The following properties are available in the File Capture section in the Properties tab: Attribute Description File Name A script that transforms the file name. Transformation Script Export URL Specifies how to export the file URL. Use Original When possible, this property contains the File Name original file name. Fixed File Adds a fixed file extension to the file. Extension File Name The web element attribute to use as the Attribute file name for the screenshot image file. Auto Detect File Automatically detects the file extension of Extension the image file. Uncheck this box if you want the file name transformation script to set the file extension. Use File Name Uses a data value as file name instead Data Value of an attribute of the selected web element. File Name Data Specifies the input data to use when Consumer Use Data Value as File Name is set to © 2014 - 2015 Sequentum Agent Commands 133 true. The following properties are available in the HTML Capture section on the Properties tab: Property HTML Attribute Description The HTML attribute to extract from the selected web element. Concatenate Concatenates content if multiple web elements are Multiple Web selected. Only the first web element will be used if Elements this property is set to false. Concatenate The separator to use between content from multiple Content Separator web elements. This property is only applicable when Concatenate Multiple Web Elements is set to true. The following properties are available in the Web Selection section on the Properties tab: Property Selection Description The selection XPath for the chosen web element. See the Web Selection Properties article for more information. These common properties are also available for this command: Web Selection Properties Capture Command Properties Agent Command Properties © 2014 - 2015 Sequentum 134 5.2.6 Content Grabber Calculated Value This command will save a static value to a data output. You can specify the value at design time, or as an Input Parameter value that will be set at run time. Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the value to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. Command Properties The following properties are available in the Calculated Value section in the Properties tab: Attribute Description Value The command will export this value. Use Input The command exports the Parameter chosen input value to data output. Input The command will export this Parameter as the name of the input Name parameter. © 2014 - 2015 Sequentum Agent Commands 135 These common properties are also available for this command: Capture Command Properties Agent Command Properties 5.2.7 Page Attribute Use this command to extract general attributes from the current web page, such a page URL or page meta data. A command that will extract the URL of the current web page Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the attribute value to meet your requirements. See the Content Transformation Script topic for more information about content transformation scripts. Command Properties The following properties are available in the Page Attribute section on the Properties tab: Attribute Description Attribute Name The name of the page attribute to capture. The default value is Page URL, which © 2014 - 2015 Sequentum 136 Content Grabber captures the URL of the current web page. The attribute name can also be the name of page meta data, such as page title, keywords or description. These common properties are also available for this command: Capture Command Properties Agent Command Properties 5.2.8 Data Value Use this command to extract data from a provider and save it to a data output. The command can save values from data provider commands, such as URL commands or Form Field commands. For example, if you are using a Set Form Field command to provide input values to a web form, then these input values will not automatically save to a data output unless you use a Data Value command. See Using Data Input for more information about data providers. This command extracts the URL given by the parent agent command Click the Transformation Script button to enter regular .NET expressions or write a .NET script that will transform the attribute value to meet your requirements. See the Content Transformation Script topic for more information about content transformation © 2014 - 2015 Sequentum Agent Commands 137 scripts. Command Properties The following properties are available in the Data Value section of the Properties tab: Attribute Description Data Consumer This property specifies which data provider and data column to extract data from. See the topic Using Data Input for more information. These common properties are also available for this command: Capture Command Properties Agent Command Properties 5.3 Action Commands An Action command executes an action in the web browser, such as opening a new web page. Most action commands are also container commands, and normally contain the commands that should execute after the web browser action. If, for example, an action command opens a page in a new web browser, the subcommands will execute in the context of that new web page. © 2014 - 2015 Sequentum 138 Content Grabber These are the Action commands available in Content Grabber: Agent Navigate Link Navigate URL Navigate Pagination Set Form Field Below we summarize each of the configuration tabs for Action commands, and provide links to each of the subtopics. Configuration of an Action Command Explore the options and properties that you can configure for a command by taking these simple steps: 1. Clicking once on a web element in the browser panel. 2. Locate the New Command drop-down in the Configure Agent Command. 3. Choose a command from the drop-down. 4. Explore the tabs: Common, Action, Data and Properties. 5. In the Common tab, uncheck the Use default input box to reveal more options. 6. In the Action tab, uncheck the Discover Action box to reveal more options. In many web-scraping scenarios, the default functionality will be quite sufficient. Should you find the need for more flexibility and control, you can learn how to configure all aspects and properties in the Action Configuration section. Action Command Properties The following is the set of basic properties common to all action commands: Property Description © 2014 - 2015 Sequentum Agent Commands Editor Action Specifies a URL to use when performing an action in the editor. See below for Editor Action Properties. Scroll Until End Set this value to True if you want the of Page command to scroll to the end of the web page after an action. Continues scrolling until no further scrolling is possible, and waits for the calls to complete if scrolling triggers AJAX calls. Limit Number Set this value to True if you want to limit the of Scrolls number of scrolls to a number that you specify in Maximum Number of Scrolls, which can be useful if a page can scroll indefinitely. Maximum If Limit Number of Scrolls = True, this Number of value is the maximum number of scrolls. Scrolls Discover Configures action properties automatically Action when the command first executes. Timeouts Specifies timeout values for the action. See below for Timeout properties. Action Type Specifies the action to perform when clicking on a web element. Activities Specifies how this action should wait for browser activities to complete. Events A list of events to fire on the selected web element. Default Events Unless you set this to False, the default list of events will fire on the chosen web © 2014 - 2015 Sequentum 139 140 Content Grabber element. JQuery Specifies what action events should be fired using JQuery. Add Force Adds an If-Modified-Since header to the Refresh Header web request to make sure the web page is not retrieved from cache. Block Known The web page will not load content from Ad Servers known ad servers, such as ad.doubleclick.net. This speeds up processing slightly. Target Browser Specifies the web browser where a new web page will load - either in a New browser, the Current browser, or the Parent. Redirect First Redirects the first request to a new browser Request window when TargetBrowser is set to New, even if the first request is coming from a frame within the current browser window. If this property is set to false, requests from frames within the current browser window will not be redirected. Set this property to true if a link opens a page in a frame, but you want to open the page in a new browser tab. Clear Cookies Clears cookies for the current URL before executing the action. Clear Session Clears the web browser session before executing the action. © 2014 - 2015 Sequentum Agent Commands Error Handling Specifies how the agent should react when an error occurs during execution. The default reaction is to exit the command, but the other options are available: None - No error handling. Exit Command - Exits the command. Retry Command - attempts another execution of the command. Stop Agent - Stops the agent completely. Restart and Continue Agent - Restarts the agent and then continues at previous point of execution. Choose None if you want the agent to continue executing sub-commands after an error occurs. You can handle the error in sub-commands by using the script parameter IsParentCommandActionError. Error Retry Applicable when you set Error Handling = Count Retry Command. This is the number of times that the agent should retry the command when an error occurs during execution. Reuse Browser If a page is loaded in a new browser tab, Tab the browser is reused the next time the same page is loaded. This option is only relevant for full page load actions. Browser Mode Specifies the type of a new web browser, which can be one of the following types: © 2014 - 2015 Sequentum 141 142 Content Grabber Default Full Dynamic Browser Optimized Dynamic Browser Static Browser HTML Parser. The Static Browser and the HTML Parser are faster because they don't execute JavaScript, but may not work on many websites that utilize JavaScript. Editor Action Properties These are the editor properties: Property Description Use Specific Specifies whether to load a direct URL URL when an action executes. URL Specifies the URL to load when an action action executes. Timeout Properties The table below lists all the timeout properties. NOTE: Any changes to the timeout values will override the default values. Property Description Page Interactive The number of milliseconds that the Timeout command will wait for a page load to become interactive. Page Completed The number of milliseconds that the © 2014 - 2015 Sequentum Agent Commands Timeout command will wait for a page load to complete after the page has become interactive. Ajax Completed The number of milliseconds that the Timeout command will wait for an AJAX call to complete. Frame The number of milliseconds that the Completed command will wait for frame content to Timeout complete loading. Page Load The number of milliseconds that the Timeout command will wait for a page to start loading. Ajax Start The number of milliseconds that the Timeout command will for an AJAX call to start. Asynchronous The number of milliseconds that the Timeout command will wait for an asynchronous action to complete. Discover First The number of milliseconds that the Activity Timeout command will wait for the first activity when discovering new activities. Discover The number of milliseconds that the Activity Timeout command will wait for the next activity when discovering new activities. Ajax Content The number of milliseconds that the Render Delay command will wait for AJAX loaded content to render on a web page. Ajax Content The number of milliseconds that the Render Delay command will wait for AJAX loaded content After Scroll to render on a web page after triggering a scroll. © 2014 - 2015 Sequentum 143 144 Content Grabber Wait for Waits for external frames to load. External Frames Otherwise ignores external frames. Use Waits twice as long as specified by the Conservative default timeout values. This can be a quick Default Timeout way to test if issues with an action are caused by timeout values being too short. Default timeouts are used when discovering activities, and when scrolling a page. Activity Properties Choose an activity in the Activities property, which launches this pop-up window. Here you can edit activity properties. This table lists all the activity timeout properties. Property Description © 2014 - 2015 Sequentum Agent Commands Discover Action 145 Automatically configures action properties when executing an action. This is slightly slower than specifying the activities. Activity Type The type of the activity, which you can choose to be one of the following: Activity Status Page XPath Refresh Frames Ajax Externals Timeout Download The status of the activity, which you can choose to be one of the following: Timeout Default Interactive Loading Completed The number of milliseconds to wait for the activity. URL Match If the activity type is a Page load, the page URL must match this regular expression. Wait XPath If the activity type is XPath, this property specifies the XPath to wait for. Required If True, an action fails if a required activity never occurs. 5.3.1 Agent The Agent command is the first command that executes in an agent; all other commands are sub-commands. So, only one Agent command can exist in an agent. The Agent command loads the start URL, which is the first point of data extraction, and also contains all common agent properties (including data export configuration). © 2014 - 2015 Sequentum 146 Content Grabber The Agent command uses a data provider that provides one or more start URLs, and the command will execute once for each of these URLs. This Agent command uses a simple data provider to load a single static start URL NOTE: The Agent command derives from the Navigate URL command, which loads one or more URLs. Command Configuration The configuration screen for the Agent command has four tabs: Common, Action, Data, and Properties. We explain the Properties in the sections below. In the Common tab, you can edit the command name and (optionally) customize the data provider properties. CSV Data: If you leave a check in the Use Default Input box, the command will provide simple CSV data and use that data as input. Simple CSV data consists of values that you enter directly into the command, so no external CSV is necessary. You can uncheck the Use Default Input box and choose the Data Provider that will provide the start URLs. The default data provider is a simple data provider that provides a list of static URLs. You can populate the data provider directly by entering the start URLs in the URLs input box. © 2014 - 2015 Sequentum Agent Commands 147 Use the Action tab to control how the web browser loads the start URLs. See Action Configuration for more information. Use the Data tab to set the data provider that provides the start URLs. Read more in Using Data Input. Command Properties The following properties are specific to Agent commands: Attribute Description Directory The directory that contains all the agent files. This is a read-only property and will be set automatically when you save the agent. User Agent The user agent that the web browser sends to target the web server when requesting web pages. Process Restart Condition These properties govern the conditions under which an agent process would restart. Click Process Restart Conditions in the Properties tab to reveal these properties: Attribute Description Restart On Restarts an agent process at a specific time interval that you Time Interval can edit. Restart Time The number of minutes between each restart - if an agent Interval process is restarting on time intervals. Restart On Restarts an agent process at a specific maximum memory Memory Usage usage limit. Use this option to clear JavaScript memory leaks and ensure that the process does not run out of memory. © 2014 - 2015 Sequentum 148 Content Grabber Max Memory If Restart on Memory Usage = True, the maximum Usage permissible memory usage (in MB) before an agent process will restart. Pause Before The number of minutes that you want the agent to pause Restarting between stopping and restarting an agent process. Data Attribute Description Data The data provider that provides start URLs for Provider the agent. See Using Data Input for more information. Database Common database connections that are Connection available to the agent. See the Data section for s more information. Export The destination where data extractions will go. Target See Exporting Data and Distributing Data for more information. Input A list of input values that you can specify when Parameters running an agent from the command line or from the API. See Input Parameters for more information. Internal The internal database that will store the data Database during extraction. For more information, see the article, The Internal Database. Proxy Attribute Description Proxy See IP Blocking & Proxy Servers © 2014 - 2015 Sequentum Agent Commands Configuratio for more information. n Scheduling Attribute Description Email See Error Logs and Notifications Notification for more information. Scheduling See Scheduling for more information. Scripting Attribute Description Assembly See Assembly References for more References information. Initialization See Agent Initialization Scripts for more Script information. Self-Contained Agents Attribute Description Author See Building Self-Contained Details Agents for more information. Sessions Attribute Description Cleanup See Sessions for more information. External Session © 2014 - 2015 Sequentum 149 150 Content Grabber Data Support See Sessions for more information. Sessions Remove See Sessions for more information. Internal Session Data Immediately User Interface Attribute Description Max Tree The maximum number of nodes that will Expansions automatically expand in the Tree View window. Max Tree The maximum number of chosen nodes in the Selections Tree View window. Setting this value too high may affect the performance of the editor. XPath Factory Attribute Description Max Inner The maximum number of inner nodes to examine Prospects when optimizing a selection XPath. Setting this value too high may affect the performance of the editor. The following common command properties are also available to the Agent command: Navigate URL Command Properties Data List Command Properties © 2014 - 2015 Sequentum Agent Commands Action Command Properties List Command Properties Container Command Properties Command Properties 5.3.2 Navigate Link This command executes an action, such as a mouse click, on a web element. You can apply the Navigate Link command to any kind of web element, including links, buttons, and menus. The figure below shows the Configure Agent Command panel after choosing Navigate Link from the New Command drop-down: Command Configuration The configuration screen for the Navigate Link command has three tabs. Common, Action, and Properties. We explain the Properties in the sections below. Use the Common tab to set the command name. Use the Action tab to control how the command chooses the web element and loads new content into the web browser. See Action Configuration for more information. Properties The properties in the tables below are available to Navigate Link commands. © 2014 - 2015 Sequentum 151 152 Content Grabber Link Attribute Description Worker Number of worker threads to use when this Threads command opens links in a new web browser. Each worker thread will use a new separate web browser. Repeat Link Action Attribute Description Repeat Link Repeats the link action until the chosen web Action element no longer exists, or the action no longer results in a page change. Max. The maximum number of times the link action Repeats will repeat. A value of zero means that there is no maximum to the number of repetitions. The following common command properties are also available to the Navigate Link command: Action Command Properties Web Selection Properties Container Command Properties Command Properties 5.3.3 Navigate URL This command opens a new URL and loads a web page into the web browser. You can specify a single URL, a list of URLs, or intake URLs from an input data source such as a database or CSV file. © 2014 - 2015 Sequentum Agent Commands The figure below shows the Configure Agent Command panel after choosing Navigate URL from the New Command drop-down: Command Configuration The configuration screen for the Agent command has four tabs: Common, Action, Data, and Properties. We explain the Properties in the sections below. In the Common tab, you can edit the command name and, optionally, customize the data provider properties. You can uncheck the Use default input box and choose the Data provider that will provide the start URLs. The default data provider is a simple data provider that provides a list of static URLs. You can populate the data provider directly by entering the start URLs in the URLs input box. Use the Action tab to control how the web browser loads the start URLs. See Action Configuration for more information. Use the Data tab to set the data provider that provides the start URLs. Read more in Using Data Input. The Navigate URL command can use the data provider configured on the Data tab, or it can use a data provider configured by a parent command. The Data tab is not available if the command uses a data provider from a parent command. See Using Data Input for more information about data providers. © 2014 - 2015 Sequentum 153 154 Content Grabber Posting Data and Custom Header The Navigate URL command supports URLs with custom parameters that allow you to post data and add custom headers, which is especially useful if you want to load a web page that is the result of a web-form submission. Sometimes, you can request such a web page by putting the form field values on the URL as normal query parameters; but other times the website requires that you post the form field values. URL Parameters You can use the following two URL parameters to specify post data and custom headers. Note: The Post parameter must come before the Header parameter. Each parameter is optional. Parameter Description ??Post= Specifies data that should be posted to the website. ??Headers= Specifies additional headers that should be sent to the website. NOTE: This parameter must come after the ??post parameter. Example The URL in this example would post par2=val2&par3=val3 to the website, while par1=val1 is just a normal query parameter. http://www.sequentum.com?par1=val1?? post=par2=val2&par3=val3??headers © 2014 - 2015 Sequentum Agent Commands =Content-Type: application/x-www-form-urlencoded NOTE: When posting value pairs to a website, you must also add the header Content-Type: application/x-www-form-urlencoded. Otherwise the website will not understand the format of the posted data. Navigate URL Command Properties The following properties are available to Navigate URL commands: Attribute Description Data Specifies the data provider, input parameter, or Consumer agent data that provides the URLs for the command. Worker Number of worker threads to use when this Threads command opens URLs in a new web browser. Each worker thread will use a new separate web browser. The following common command properties are also available to the Navigate URL command: Data List Command Properties Action Command Properties List Command Properties Container Command Properties Command Properties 5.3.4 Navigate Pagination Typically, pagination is the feature that gives the user the ability to navigate among multiple pages. The most common example of pagination is a website © 2014 - 2015 Sequentum 155 156 Content Grabber that displays a search results listing that spans multiple pages. This command follows all pagination links on a web page, and will process a specific container command after opening each pagination link. The figure below shows the Configure Agent Command panel after choosing Navigate Pagination from the New Command drop-down: Pagination takes several different forms, but in nearly all cases it will fall under one of the following categories: Next Page link/button List/sequence of numeric links List/sequence of text links. We cover each of these in the sections below. Single Next Page Link We recommend that you use the Single Next Link pagination type on a website that uses a single link to move to the next page. In addition to a single Next Page link, a website may provide a sequence of pagination links. In such cases, you should still choose this pagination type, which will ignore any of the other pagination links. It's only necessary that Content Grabber can move to the next page in the sequence. So, it's more efficient to identify the © 2014 - 2015 Sequentum Agent Commands Next Page link. Content Grabber will continue to follow the Next Page link until it no longer appears on the next results page - or it becomes inactive. In very rare cases, a website may continue to display an active Next Page link even when there are no more pages available, and may link back to the first page in the sequence or continue to load the last page. You can check the box for Use Page Count and then configure the agent to stop the pagination command when it reaches the maximum number of pages. The Page Count property requires you to specify the Capture Commands that delivers the maximum page number. Most websites that display a search result will also display the total number of available pages in the search result, so you can choose the Web Element Content command to capture the total number of pages. Then, use this command to place a limit on the number of pages that the pagination command will process. If you want to stop pagination on a fixed page, you can add a Calculated Value command that contains the fixed page number, and then use that command as the command that delivers the maximum page number. List of Numeric Links This pagination type is the best choice for a website that only displays a sequence of pagination links but lacks a Next Page link. Page set navigation links are often found together with the numeric links. For example, if a you find 10 numeric links, the page set navigator links allows a user to navigate to the next 10 numeric links or the previous 10 numeric links. This pagination type requires one web selection for the numeric links and, if applicable, another selection for the page set navigator link that moves to the next set of numeric links. When Content Grabber processes a list of numeric links, it will extract a © 2014 - 2015 Sequentum 157 158 Content Grabber specific attribute of the chosen web elements and then convert the content into numeric values. If the agent cannot convert the content into a numeric value, it ignores the corresponding link. Content Grabber expects a sequence of links, so it will continue to the next page number in the sequence after it processes the current page number. You can use a content transformation script to convert the content into a number. For example, if the agent cannot convert the text attribute of a pagination link directly into a number, it may look for the page number in the URL attribute. In this case, a transformation script would be necessary to transform the URL into a number. List of Text Links Use this pagination type on a website that features a list of pagination links but none of the attributes of the link elements will convert into sequential numbers. A common application of this pagination type is on a website that displays items according to the first letter of the item name. In cases like this, the pagination would consist of all the letters in the alphabet (A-Z), and when a user clicks on a letter, all items starting with that letter will appear in the listing. Navigate Pagination Command Properties The following properties are available to Navigate Pagination commands: Attribute Description Page Number of worker threads to use when this Number command opens links in a new web browser. Attribute Each worker thread will use a new separate web browser. Page A script used to transform the page number Number content. The page number content is extracted Transformati from the web element attribute specified by © 2014 - 2015 Sequentum Agent Commands on Script PageNumberAttributeName. Navigate Navigates directly to a specific pagination page Directly to when resuming an agent. It's not always Page possible to navigate directly to a pagination page. For example, this is not possible for websites using AJAX pagination. Pagination Specifies the type of pagination. Type Use Max. Specifies a command used to capture the Page maximum number of pages in pagination. The Number pagination command will stop following links Command when the number of pages reaches this value. Max. Page Uses a command to capture the maximum Number number of pages in pagination. The pagination Command command will stop following links when this number of pages is reached. Use Next The selection property includes a selection for Pagination a link to the next pagination set. Set Pagination The command that is executed for each Command pagination link. Link These properties are also available to the Navigate Pagination command: Action Command Properties Web Selection Properties Container Command Properties Command Properties © 2014 - 2015 Sequentum 159 160 5.3.5 Content Grabber Set Form Field Use this command to set input fields in a web form. The command supports the following types of web form fields: Input fields Select boxes Text area fields Check boxes Radio buttons. Note: Some web sites use advanced JavaScript to make web elements emulate standard form fields, and the Set Form Field command does not support those web elements. Use a normal Navigate Link command to click on a submit button to submit the web form. The submit button can be a standard web form submit button or any other web element that uses JavaScript to submit the web form. A Navigate Link command is not required if one of the form fields automatically submits the web form when an input value changes. default data provider for standard text box default data provider for radio button default data provider for a select box Command Configuration The configuration screen for the Set Form Field command has four tabs - Common, © 2014 - 2015 Sequentum Agent Commands 161 Action, Data and Properties. Use the Common tab to set the command name and select the data provider that provides the input values for the form field. The default data provider for a standard input form field or a radio button is a simple data provider that provides the input values. The input values for a radio button or check box must be either On or Off. The default data provider for a select box form field is a selection data provider. The default selection data provider will attempt to select all appropriate values in the drop-box (select-box), and the Set Form Field command uses those values as input values. Use the Action tab to control how the web browser loads the URLs. See Action Configuration for more information. Use the Data tab to configure a data provider that can provide the form field input values. The Set Form Field command can use the data provider configured on the Data tab, or it can use a data provider configured by a parent command. The Data tab is not available if the command uses a data provider from a parent command. See Using Data Input for more information about data providers. Command Properties The following properties are available to Set Form Field commands: Attribute Description Data Specifies the data provider, input parameter, or Consumer agent data that provides the input values for the command. Set Select If the chosen form field is a select box, then Box Text each of the input values will be set as a text attribute for each corresponding select options. Set Value Choose True if you always want the command to set the form field value. If you leave it as Default, the command will set the form field © 2014 - 2015 Sequentum 162 Content Grabber value if it is set to fire events - unless the form field is a checkbox or a radio button. We recommend that you leave this property set to Default unless an event function - such as sendtext() - is setting the form field value. Only Only executes the command action if the form Execute field value changes. That is if the current input Action on value is different from the current form field Value value. Change Handle Web When interacting with a file Browser upload web browser dialog box, Dialogs the command will cause a clicking of a dialog button having the name that is in the value of this property (&OPEN). Download Attempts to download a document when the Document form field value changes, and uses the Command Download Document command to capture and export the document. These properties are also available to the Set Form Field command: Data List Command Properties Action Command Properties Web Selection Properties List Command Properties Container Command Properties Command Properties © 2014 - 2015 Sequentum Agent Commands 5.3.6 Action Configuration By default, all action commands are set to Discover action settings (the Discover action check box on the Action tab is checked). The action settings will automatically be configured when you execute the command the first time. The default action settings are usually quite sufficient, so you will rarely need to worry about the Action configuration tab at all. However, you can fine-tune the action settings to achieve better performance, and you may need to adjust the action settings to get the command working correctly. After choosing a New Command type from the drop-down in the Configure Agent Command panel, click the Action tab. By default, all action commands are set to automatically discover action settings when you execute the command for the first time. The settings are usually suitable for most scenarios, so you will rarely need to worry about the Action configuration tab at all. After some experience, you may want to finetune these settings to achieve better performance, and sometimes it may be necessary to adjust the action settings to get the command to function according to your precise requirements. There are several configuration tabs that are available for Action commands in the Configure Agent Command panel, including: Action Types - Specifies the type of action, which can be a Fire © 2014 - 2015 Sequentum 163 164 Content Grabber Event, URL, or No Action. Not all action types are available for all action commands. Browser Target - Specifies the web browser in which the new content should load. Action Events - For action set to Fire Events only, this is a list of events that will fire on the chosen web element. Browser Activities - Specifies the browser activities for which the command will wait before the action is complete. 5.3.6.1 Action Types Action commands can execute one of two types, or no action at all: Fire Events - The command fires events, such as a mouse click, on the selected web element. Load URL - The command loads a new page into the web browser using a direct URL. No Action - The command does not execute an action. This is only relevant for form fields which may execute an action when an input values is assigned, but often execute no action at all. This action command will fire events © 2014 - 2015 Sequentum Agent Commands 165 Use Conservative Timeouts An action command uses default timeout values to determine how long it should wait for activities, such as how long it should wait for a new page to start loading. Default timeout values are only used when an action command is discovering activities or when it is scrolling a web page to discover new content. The default timeout values will work in most situations, but some websites may be very slow and require longer timeouts. You can increase the timeouts manually, but it can sometimes be hard to determine exactly which timeouts to increase. The option Use Conservative Timeouts causes an action command to wait twice as long as specified by the default timeout values. This can be a quick way to test if issues with an action are caused by timeout values being too short. Scroll to End of Page Some web pages load additional content when you scroll the page either downward or to the right. To extract all content from such pages, you need to include an action that scrolls down to the end of the page, so all content is available to the agent. When you set the option Scroll to end of page, you will be able to limit the number of times the command scrolls to the end of a page to load new content. This can be important, since some pages will continue loading new content for a single page until Content Grabber finally runs out of memory. 5.3.6.2 Browser Target An action command often loads new content. With the Browser action type, you can configure how content loads into a new web browser, the current web browser, or the © 2014 - 2015 Sequentum 166 Content Grabber parent web browser. You can also specify a different browser mode, as we explain below. This Navigate Link command will open a web page in a new web browser Uncheck the Discover Action box and then click the Browser tab to view these for the Target Browser: New. The action command loads content into a new web browser, and all subcommands will operate in the new browser. To use this option, the action command must load a completely new page. Asynchronous actions, such as AJAX calls, cannot load content into a new web browser. Current. The action command loads content into the current web browser, and all sub-commands will operate in the new browser. Asynchronous actions, such as AJAX calls, can only load content into the current web browser. Parent. Some older websites may require child browsers to load content into the parent browser in order to function correctly, so this option should only be chosen in such cases. If an action command loads content into a new web browser, that new web browser becomes a child browser of the current web browser, and actions in the child browser can direct content into the parent browser. To use this option, the action command must load a completely new page. © 2014 - 2015 Sequentum Agent Commands 167 Browser Mode If you leave the default for the Target Browser (New), you can choose the browser mode: Default - The new web browser will be exactly the same type as the parent browser. Full Dynamic Browser. The browser functions as a standard web browser, and it will download images, and execute JavaScript and all objects such as Flash. Optimized Dynamic Browser. The browser executes JavaScript, but will not download images or execute objects such as Flash. Although the browser still permits extraction of images and Flash, these will not display in the browser. Static Browser - The web browser does not execute JavaScript at all. Consequently, the browser is faster and more reliable than the other browsers, but does not work on websites that rely on JavaScript to function correctly. HTML Parser - the command does not start a new browser. Instead, the web page simply downloads and runs through a HTML parser. Although the HTML parser does not execute JavaScript and does not load frames, it is faster and more reliable than the web browsers. However, the parser does not work on websites that rely on JavaScript, and the parser may also be unable to submit some web forms (even when they don't rely on JavaScript). Note: You must reopen browser tab for any change in this setting to take effect. 5.3.6.3 Action Events If a command action type is set to fire events, then you can specify the events that should be fired on the selected web element. © 2014 - 2015 Sequentum 168 Content Grabber The Events tab In most cases, you can check the Use default events check box, which will fire all appropriate events that the web element supports. In special cases, you may want to remove some of the default events. For example, an action may try to open a dropdown box for an input form field. Firing the focus or click event on the input form field may cause the drop-down box to open, but the blur event may cause the drop-down box to close. In that case, firing all the default events would open the drop-down box but quickly close it again. To prevent this, uncheck the Use default events box and remove the blur event from the list. Content Grabber uses standard Internet Explorer functionality to fire events, but it can also use the JQuery library - if JQuery is available on the web page. Set the option Use JQuery if available to fire events using JQuery. This option is designed for websites that are built for Internet Explorer version 8 or earlier, and has no effect on most newer websites. Supported Events and Functions Content Grabber supports all the default events for the chosen web element and some custom events and functions. The following list includes only the most common events, and is not a complete list of all available events. Please see a JavaScript © 2014 - 2015 Sequentum Agent Commands 169 reference guide for all available events. Event Name Description mousedown Emulates the press of a mouse button onto the chosen web element without releasing the button. mouseup Following a mousedown event, emulates releasing a mouse button onto the chosen web element. click In immediate succession, emulates a press and a release of a mouse button on the chosen web element . keydown Emulates pressing a key in relation to the chosen web element without releasing it. keyup Emulates releasing a key in relation to the chosen web element . keypress In immediate succession, emulates a press and a release of a key in relation to the chosen web element. focus Emulates bringing input focus to the chosen web element. blur Emulates removing input focus to the chosen web element. change Emulates changing the input value of the chosen web element. The following list includes custom functions that can be used along with the standard events: Function © 2014 - 2015 Sequentum Description 170 Content Grabber exec(JavaS Executes a JavaScript on the selected web cript) element. Example 1: exec($(element).unbind('blur')) This example removes all blur events from the chosen web element. This example requires JQuery to be available on the web page, but the exec function works on non-JQuery JavaScript as well. Example 2: exec(element.click()) This example fires the click event on the selected web element. Example 3: exec(window.history.back()) This example moved the current page back to the previous page. The variable element is always defined as the selected web element. unbind(Eve Removes all events of type Event from the nt) selected web element. This function requires JQuery to be available on the web page. Example: unbind(blur) © 2014 - 2015 Sequentum Agent Commands This example is equivalent to calling exec($(element).unbind('blur')) or exec($(element).off('blur')) click() Calls the click function on the chosen web element. In most cases, this function is equivalent to using the event click in most cases. The function is specific to Internet Explorer and maybe required on some very old websites, but in most cases you should use the click event instead. delay(millise Pauses execution of the command for a conds) specific number of milliseconds. Example: delay(2000) This example inserts a delay of 2 seconds. Important note: The activity timeouts include any event delay. So, if you have a single activity that waits 500 milliseconds, and all events take longer than 500 milliseconds to fire, then the action will time out before all the events have had time to fire. setinputtext The function inserts text into the chosen form () field, and is only compatible with Form Field commands. If the form field is a select box, the setinputtext function selects the option with the text (text) attribute equal to the specified text. If this function call contains no text, the function © 2014 - 2015 Sequentum 171 172 Content Grabber inserts the input data for the Form Field command. Example: sendinputtext("hotels") Typically, a Form Field command sets the value of the chosen form field and then fires the specifies events. This function gives you the ability to fire events before setting the value of a form field. NOTE: Often, this function is used when the corresponding Form Field command property Set Value is set to false. sendtext() This function activates the web browser, sets the focus to the chosen web element, and sendtext(te sends text to that element. This will trigger the xt) same events as standard user input and may be necessary for some very complex websites-sites on which it can be very difficult to determine the exact order of events. If the function call contains no text, the function inserts the input data for the Form Field command. Example: sendtext("hotels") This will send the text "hotels" to the chosen web element. Some complex websites may open a drop- © 2014 - 2015 Sequentum Agent Commands down when a user enters text into an input box, and the drop-down box may close automatically when the focus moves away from the input box. When you are working in the Content Grabber designer, the input focus automatically moves off the web browser after an action executes, so you may need to remove the blur events from the chosen web element. Example: unbind(blur) sendtext("hotels") NOTE: the unbind function requires JQuery to be available on the web page. sendkey(ke This function activates the web browser, sets ycode) the focus to the chosen web element, and sends a key code to that element. This will trigger the same events as standard user input and maybe required for some very complex websites where it can be very difficult to determine the exact order of events. Example: sendkey(down) sendkey(return) This will send two instructions to the chosen web element: one instruction that is equivalent to a press of the down arrow key and another instruction that is equivalent to a press of the Return key. © 2014 - 2015 Sequentum 173 174 Content Grabber Some complex websites may open a dropdown when a user enters text into an input box, and the drop-down box may close automatically when the focus moves away from the input box. When you are working in the Content Grabber designer, the input focus automatically moves off the web browser after an action executes, so you may need to remove the blur events from the chosen web element. Example: unbind(blur) sendkey(down) NOTE: the unbind function requires JQuery to be available on the web page. scroll() This function scrolls downward on the page containing the chosen web element, to ensure scroll(perce adequate coverage for pages that load content ntage) dynamically. Often, the function call is used along with the option Repeat Link Action, which will repeat a downward scroll to the bottom-until the scroll no longer loads new content. Example 1: scroll() This example will scroll all the way to the bottom of the content. Optionally, you can specify an the amount to scroll in terms of percentage. © 2014 - 2015 Sequentum Agent Commands Example 2: scroll(50) This example will scroll half way down through the content. windowscro This function scrolls down to the bottom of the ll() page containing the chosen web element, to ensure adequate coverage for pages that load windowscro content dynamically. ll (percentage If the web browser contains multiple frames, ) you should choose a web element within the frame to which you want to scroll. The chosen web element has no other influence on this function. Often, the function call is used along with the option Repeat Link Action, which will repeat a downward scroll to the bottom - until the scroll no longer loads new content. The Scroll to End of Page action option combines the windowscroll event with the action option Repeat Link Action, so this option can be used as an alternative to the windowscroll function. Example 1: windowscroll() This example will scroll all the way to the bottom of the web page. Optionally, you can specify an the amount to scroll in terms of © 2014 - 2015 Sequentum 175 176 Content Grabber percentage. Example 2: windowscroll(50) This example will scroll half way down the web page. Back() Move back to the previous page by calling the JavaScript function window.history.back() If the action command is a Form Field, then the input value is available for all event functions by using the variable [input]. For example, you can call the function sendtext to set the input value of a form field: sendtext([input]) Or, the input value could be set using JQuery: exec($(element).val('[input]')) If you set a form field value using any of these event functions, you may want to uncheck the form field box Set Value, so that the form field value cannot be set automatically, but rather only by an event function. Events and Default Actions When you interact with web elements in a web browser, such as clicking on a link or setting focus on a form field, the associated event handlers will be executed, but the browser will also execute a default action on the selected web element. For example, if you click on a link, the browser will navigate to a new page, and if you click on a form field, the input focus will be set on the form field. When Content Grabber fires an event on a web element, the © 2014 - 2015 Sequentum Agent Commands 177 associated event handlers will be executed, but the default action will only be executed for click events. This is sometimes important, because some websites may expect the default action to be executed. For example, if you fire the focus event on a web element, the associated focus event handlers will be executed, but the focus will not be moved to the selected web element. Normally you don't want the focus to change when processing a website, since it may interrupt other work you are doing on the computer, but some websites may not work correctly if the focus does not change. In this case you can execute a function on the selected web element instead of firing an event. You can execute any supported function on a web element by simply specifying a function call like focus() instead of just the event name focus. 5.3.6.4 Browser Activities Typically, you have no concern about the sequence of complex activities during the loading of a web page, since you simply wait for the content that you want to see. The most critical content on a web page will likely load far in advance of the time that you actually get around to viewing a specific part of the web page. Usually, all features function correctly as you fill in web forms or click links. However, it's very different with web-scraping agents, since these agents are very fast. An agent will attempt to process a web page as quickly as possible and continue onto the next page. A web-scraping agent is so fast that it could easily start processing a page before all of the essential content loads. So, it's important that you configure an action command to wait for all important browser activities to complete and all the content loads before web page processing begins. When an action command executes, it waits for certain activities to complete in the web browser. For example, if a command executes a click on a link, it may wait for a page to load or an AJAX call to complete. Some actions may result in a very complex © 2014 - 2015 Sequentum 178 Content Grabber set of activities. An action may load a new page that then uses AJAX to load additional dynamic content onto the page. Discovering Activities By default, action commands are set to automatically discover activities. After a command fires the action events, it will monitor all activity in the web browser and wait for the critical activity to complete. Then the command will wait for less than a second and, if no new activity starts, it will consider the action to be complete. In Content Grabber, you can leave the default for automatic discovery, or uncheck the Discover Activities box and then use the buttons to manually specify which activities the command will discover: Manually configuring for discovery of activities Automatic activity discovery is a bit slower than manual configuration. That's because automatic detection pauses and waits for any additional activity. So, automatic discovery may take a long time when running on a poorly built website, since there may be long delays between activities. For example, after the initial page load, it may take several seconds for dynamic content to load into a new web page. With © 2014 - 2015 Sequentum Agent Commands 179 automatic discovery, the action command will only wait for a short moment, and may miss some activities if there is a significant delay. Alternatively, you can choose to manually configure the activities and set the wait timeout values. Or, you can change the default wait timeout values for the action command. Activities Each activity has a corresponding timeout: this is the number of milliseconds the action command will wait for the activity to start. If an activity is required, the action will fail if the activity does not start within this time period. If an activity is not required, the command will wait for the next activity. To manually configure activities, do the following: 1. Locate the Activities tab within the Action tab of the Configure Agent Command panel. 2. Uncheck the Discover Activities box to enable manually configuration of activity discovery. 3. Click on the buttons at the bottom of the panel to add one or more activities. 4. Edit the segments for each activity by (a) deleting segments that you don't want and (b) changing the numeric value for the timeout. Optionally, you can make some segments required. Page Load The Page activity waits for a full page load to complete. Click the Page button to load the default wait times for each of the activity segments: © 2014 - 2015 Sequentum 180 Content Grabber Activity Delay Description Segment Required Default Value by Default (milli seconds) Loading Wait for the web page to start loading. Y 1,000 Interactive Additional wait time for cases in which Y 30,000 N 5,000 the web page may not be done loading, but Content Grabber can operate on the page. Sometimes a web page may get stuck in this mode, because an image won't load correctly. In this case Content Grabber can still continue and extract data from the web page. Completed Wait for the page to complete loading. We recommend that you do not configure this activity as a required activity, since in most cases it's best to allow Content Grabber to proceed in case a web page becomes stuck in Interactive mode. AJAX This activity waits for one or more AJAX calls to complete. Note that several AJAX calls that load simultaneously are part of the same activity. Click the AJAX button to load the default wait times for each of the activity segments: © 2014 - 2015 Sequentum Agent Commands Activity Delay Description Segment Required 181 Default Value by Default (milli seconds) Loading Wait for one or more AJAX calls to N 1,000 N 5,000 load. Interactive Wait for all AJAX calls to complete. Any new AJAX call that starts after this segment is a new activity. Timeout This activity simply pauses any action for a specific number of milliseconds, and is usually put in place to wait for JavaScript to dynamically render content on a web page. For example, a JavaScript may use AJAX to retrieve content from a web server, and then consume additional time processing the content before displaying it on the page. The action command can wait for the AJAX call to complete, but it cannot wait for any processing afterward. Normally, JavaScript processes so quickly that no waiting is necessary, but the command may sometimes need to wait a short period of time. Page Refresh This activity waits for a web page refresh. Click the Refresh button to load the default wait times for each of the activity segments: © 2014 - 2015 Sequentum 182 Content Grabber Activity Delay Description Segment Required Default Value by Default (milli seconds) Loading Wait for the web page to start loading. Y 1,000 Interactive Additional wait time for cases in which Y 30,000 N 5,000 the web page may not be done loading, but Content Grabber can operate on the page. Sometimes a web page may get stuck in this mode, because an image won't load correctly. In this case Content Grabber can still continue and extract data from the web page Completed Wait for the page to finish loading. We recommend that you do not configure this activity as a required activity, since in most cases it's best to allow Content Grabber to proceed in case a web page becomes stuck in Interactive mode. Externals This activity waits for one or more external pages to load in frames or iframes. Several pages loading simultaneously are considered the same activity. NOTE: This wait activity is for external frames - frames that originate from a different domain than the main page. Use the Frames activity to wait for frame pages from the same domain as that of the main page. © 2014 - 2015 Sequentum Agent Commands 183 Click the Externals button to load the default wait times for each of the activity segments: Activity Delay Description Segment Required Default Value by Default (milli seconds) Loading Wait for one or more pages to start Y 1,000 N 5,000 loading. Interactive Wait for all pages to finish loading. New frame pages may start loading after this, but that is part of a new activity. File Download This activity waits for a file to download. You can specify a mime type to identify the file to download in case an action triggers multiple page loads. In most cases, the "application" mime type is the correct choice. Click the Download button to load the default wait time into the activity sequence. Web Element Selection This activity waits for a specific web element to load onto a page. You specify the web element with an XPath. Click the Selection button to load the default wait time into the activity sequence. NOTE: Content Grabber will parse the web page to perform this action, and may decrease performance significantly. If possible, avoid using this activity. Browser Activity Screen © 2014 - 2015 Sequentum 184 Content Grabber This feature shows all browser activities that occur after the current action executes. You can use this information to determine potential issues with the configuration of the action. Use the Activity button on the Content Grabber status bar to open the Browser Activity screen, as shown in the figure below: The Activity button on the status bar Critical activities have dark coloring and other activities have light coloring. A blue row appears in the sequence at the point where the command recognizes completion of the action. Activities that occur after the action completes may not necessarily indicate a problem. If the agent does not work as you expect, then you may need to reconfigure your action in such a way that it waits for some or all of those activities. Activity is seen after the action completes, which may not be a problem Taking a closer look, you can examine the Timing values on the Browser Activity pop-up window, and then work out appropriate timeout values for the post-completion activities. For example, if dynamic changes are causing problems, then you can look at the timing of the last dynamic change. We would recommend subtracting the time © 2014 - 2015 Sequentum Agent Commands 185 for Action Completed from the time of the last Dynamic Page Change action. The result is the minimum number of additional milliseconds that the command needs to wait. This would ensure that all dynamic page changes complete before the entire action completes. Simply insert a Timeout activity to insert a delay on the Activities tab configuration screen, as we explain in the section above. Post-completion events As we wrap up this article, let's also notice that the sequence shown in the figure above indicates two potential problems. We see both a Page Loading and many Dynamic Page Changes after the action completes. In this case, we suggest two remedies. First, increase the amount of time the command waits for the page load to start. Second, insert a Timeout activity to allow JavaScript to accommodate the dynamic changes. 5.4 List Commands A List command is a special type of container command that iterates through a set of elements and executes each of its sub-commands once for each element. The element list can be taken from an input source, such as a CSV file, or it can come from a web selection that selects multiple web elements on the current web page. The following commands are list commands: Agent Data List Set Form Field Web Element List The Web Element List command uses a web selection to generate its element list. The Agent, Set Form Field and Data List commands use an input data provider to generate their element lists. A command is not considered a list command if the input data provider is set to None. For example, a Set Form Field command that uses input data from a different command would have its own data provider set to None, and therefore is not a List command in that case. You can learn more by clicking the © 2014 - 2015 Sequentum 186 Content Grabber links to any of the subsections above. List Command Properties These properties are common to all list commands: Property Description List Start The index of the first item to use in a list when Index debugging. List Count The number of list items to use when debugging. Setting a value of zero will cause the command to use all list items while debugging. 5.4.1 Data List This command provides input data to sub-commands. The command (and all sub-command) will execute once for each input data value. The figure below shows the Command Properties panel after choosing Data List from the New Command drop-down: © 2014 - 2015 Sequentum Agent Commands Data List Command Properties These are properties specific to data list commands: Property Description Data The data source or CSV file that will provide Provider the input data to the command and all subcommands. See Using Data Input for more information about data providers. Data Specifies the action to take when data is Missing missing. Possible actions are: Action Default - The default behavior is to ignore the command if the data provider provides no data, and it will not process any subcommands. © 2014 - 2015 Sequentum 187 188 Content Grabber Optional Data - The agent will continue to process sub-commands if the data provider provides no data. Ignore Command when Data is Missing The agent will ignore the command if the data provider provides no data, and it will not process any sub-commands. 5.4.2 Web Element List This command selects a list of elements on a web page. The parent command and all of its sub-commands will execute once for each web element in the list, and the web selection of each sub-command will be relative to the current element in the list. NOTE: All sub-commands will have the same parent selection. The figure below shows the Configure Agent Command panel after choosing Web Element List from the New command drop-down: As we write above, the web selection of each sub-command will be relative to the current element in the list, but some commands can break the web-element list. If a command breaks a parent web-element list, then the sub-commands will no longer be relative to the web-element list. It's important to realize that any command that loads a new web page will always break a web-element list - since the elements in the webelement list don't exist on the new web page. © 2014 - 2015 Sequentum Agent Commands 189 A command that loads a partial web page, or makes changes to a web page, may break a web-element list if the Break HTML Area property has been set to True. This property is specific to data list commands: Property Description Editor Index Specifies the index of the list element to use in the editor. This property is only relevant at design time, and has no use at run time or when debugging. Data List Command Properties These are properties specific to web-element list commands: Web Selection Properties List Command Properties Container Command Properties Command Properties 5.5 Group Commands Group commands are simple container commands that you can use to place commands together - which can be a much easier arrangement for navigating and maintaining an agent. You may, for example, want to group together several commands that have various bits of contact data into a Group command and give it the name "Content Information". There are other uses for Group commands. Consider this restriction: you can link only one other container command to a container command, such as Navigate Pagination. So, you would need to use a Group command if you want to link to three container commands together. © 2014 - 2015 Sequentum 190 Content Grabber These are the Group commands: Group Group in Page Area 5.5.1 Group Use this command to group commands together. Although the Group command performs no action, it can be useful in the following cases: To group commands that logically belong together. This makes it easier to navigate and maintain an agent. For example, four capture commands used to extract street address, city, zip code and state could be grouped into a Group command having the name "Address". When adding commands to the template library for future reuse. When you want to link a container command to more than one other command. The Navigate Pagination command is an example of a container command that often needs to link to more than one other command. Group Command Properties These properties apply to all Group commands: Container Command Properties Command Properties © 2014 - 2015 Sequentum Agent Commands 5.5.2 191 Group in Page Area Use a Group Commands in Page Area command to group commands together that have the same web selection. The command is similar to the Group command, but it has a corresponding web selection, and any web selection of the sub-commands will be relative to that selection. Typically, you would choose this command when you need to split a single web element into separate fields. For example, a single web element may contain the entire address of a company, and several capture commands with the same web selection will be necessary to extract and split the address into separate fields such as street address, city, zip code, and state. If you group all the capture commands into a single Group Commands in Page Area command, then you only need to select the web element once. Also, if the page location of the address changes, you only need to change the web selection of the Group Commands in Page Area command. You can also add this command to your Template Library, so that you can reuse the command in another agent and only need to change the web selection. Command Properties These properties apply to all Group Commands in Page Area commands: Web Selection Properties Container Command Properties Command Properties © 2014 - 2015 Sequentum 192 5.6 Content Grabber Other Commands Click on the links below to learn more about the other commands that are available in Content Grabber: Execute Script Transform Page Refresh Document Close Browser 5.6.1 Execute Script This command executes a custom .NET script, which is useful in these scenarios: Control the command execution flow Check for duplicates Manual interaction with the agent web browser. Each script command has a corresponding web element selection, and this element has a corresponding entry in the custom script. At run time, if the selection does not exist on the web page, the web element will be NULL in the script. You can check for NULL elements in the script and execute custom functionality if a NULL is found. The figure below shows the Configure Agent Command panel after choosing Execute Script from the New Command drop-down: © 2014 - 2015 Sequentum Agent Commands 193 Predefined Scripts Click the Script type drop-down to choose a script containing typical functionality. You can use a script just as it is given in the template, or click the Edit Custom Script button to launch a pop-up window to edit the script according to your requirements. Duplicate Data Choose Duplicate Data in the Script type drop-down to select a script that will examine the data extraction up to a specific point, and check for a matching data row. If a duplicate is found, the agent exits a specified container command. If you want to check for duplicates in the data extraction from a previous agent run, you must ensure that the agent does not replace data every time it runs. This is done on the Change Internal Database window - which you can open from the Data menu. © 2014 - 2015 Sequentum 194 Content Grabber Exit Command This script exits a container command when a specific condition is met. Retry Command This script retries a container command when a specific condition is met. You can control the number of times that the agent will retry a container command by editing the default script and using the RetryCount value in the script parameters. © 2014 - 2015 Sequentum Agent Commands 195 Retry Agent and Continue This script restarts the entire agent and continues where it left off when a specified condition is met. Typically, you would use this script to completely reset a web browser after an error occurs. The agent continues exactly from the point before the reset, so that if an error keeps occurring at this point, the agent will keep restarting in an infinite loop. To avoid this problem, you can set the status Failed on the current data row, so the agent will move to the next data item when it restarts. This code snippet example sets the data row status to Failed. args.DataRow.SetStatus(RowStatus.Failed) Pause Agent © 2014 - 2015 Sequentum 196 Content Grabber This script pauses the agent when a specified condition is met. The agent will display the web browser and allow a user to manually interact with the web browser. The user can press the Continue button to continue processing. Script Return Value An Execute Script command can control the command execution flow by its return value. The command returns an instance of a class CustomScriptReturn. This class has the following 4 public static methods: Method Description static The agent will retry the specific CustomScriptReturn container command. The container RetryContainer(IContain command must be a parent of the er container) current command. static The agent will exit the specific CustomScriptReturn container command. The container ExitContainer(IContai command must be a parent of the ner container) current command. static The agent pauses and displays an CustomScriptReturn agent web browser, which allows Pause() a user to interact with the web browser before continuing © 2014 - 2015 Sequentum Agent Commands 197 processing. static The agent continues its normal CustomScriptReturn executing flow. Empty() 5.6.2 Transform Page This command transforms part of a web page to facilitate the capture of data. You can choose from the following types of transformations: Transform HTML Insert HTML Replace HTML Denormalize HTML Table Use the Transform Page command to transform the structure of a web page into a form that will facilitate easier data extraction. The command uses a transformation script to transform existing HTML, or it can insert or replace HTML on a web page. The corresponding web selection for this command is the HTML of the chosen web element that will be transformed. © 2014 - 2015 Sequentum 198 Content Grabber The default transformation for this command is Transform HTML, which denormalizes an HTML table. When using the Denormalize HTML Table transformation, the web selection must be an HTML table. The transformation will repeat some table cells and rows to make sure that all rowspan and colspan attributes in the table are set to one. The figure below depicts an HTML table having some cells that span multiple rows. You can use the Denormalize HTML Table transformation to ensure that no cells in the table spans more than one row, which makes it much easier to extract data from the table. © 2014 - 2015 Sequentum Agent Commands 199 Choose the Denormalize HTML Table transformation on this table to ensure that no cells span more than one row 5.6.3 Refresh Document When Content Grabber loads a new web page into the web browser, it parses the dynamic web page and generates a equivalent static version of the page. Agent commands will work on the static version of the web page, since it's much faster and more reliable. All action commands will cause a refresh of the static version of the page, since action commands are likely to either load a new web page or modify the existing page. The Refresh Document command is used to force a refresh of the static version of the web page, so that it corresponds to the dynamic version loaded in the web browser. Usually, Content Grabber will be able to refresh the document when necessary, so you'll only need to use this command in special cases. For example, consider the case where a child web page refreshes the parent web page. After processing the child web page, you may need to perform a Refresh Document command before continuing to process the parent web page. © 2014 - 2015 Sequentum 200 5.6.4 Content Grabber Close Browser Content Grabber often opens new web pages in a new browser window, and will always open a web page in a new window if the web page design specifies the opening of a child window. When Content Grabber finishes processing a page in a child window, it will leave the browser window open and reuse the window for the next similar child page. Some websites will close a child window after use, and will only work correctly if the child window does indeed close. Use the Close Browser command to force a browser window to close, so that the website will create a new browser window. 5.7 Web Forms Content Grabber can submit web forms repeatedly - for any combination of input © 2014 - 2015 Sequentum Agent Commands 201 values. You can specify input values as a static list of values, or you can feed the values from an input data source - such as a database or a CSV file. Remember, you can use Set Form Field commands to set form fields, and Navigate Link commands to submit web forms. Some web forms don't have a separate Submit button, but instead submit the form when a form field value changes. We recommend that you use a Navigate Link command in such cases. Set Form Field commands perform actions, such as loading a new web page. The figure below depicts an example configuration for an agent that submits a web form: Typically, Set Form Field commands that submit a web form are nested. So, if the first command is iterating through a list of input values, then all succeeding Set Form Field commands will execute once for each input value. The last command in a web form configuration is typically a Navigate Link command, and since it's nested along with all the preceding Set Form Field commands, it executes once for each combination of input values. File Download You can use a web form to download a file, such as an Excel or PDF file. The file download may start when you click the form Submit button, or it may start when you change a form field value. If a file download starts when you click a Submit button, you should use a Download © 2014 - 2015 Sequentum 202 Content Grabber Document command instead of a Navigate Link command. The Download Document command should select the Submit button and have the File Download Action property set to Click to Download. If a file download starts when you change a form field value, you should use a Download Document command instead of a Navigate Link command. Also, the Form Field command that triggers the document download most have the option Download Document command set to the Download Document. The Download Document command must be a direct sub-command of the Form Field command that triggers the file download. Uploading File Some web forms contain file upload boxes that allow users to upload files to a website. Scripting cannot modify file upload boxes, so it's necessary to process these differently than other web form input boxes. Content Grabber will automatically detect file upload form fields and will try to handle the file upload dialog box that appears when clicking on the file upload form field. For the Set Form Field command that handles the file upload form field, the input value should be the full local file path to the file you want to upload. When the file upload dialog box appears, Content Grabber will automatically enter the file path and click on the button that selects the file. This happens so fast that the dialog box won't appear. Content Grabber needs to click on the button that selects the file, and it does that by looking for a button having an "Open" label. This is the correct button for English language versions of Windows, but will not be the correct button for other language versions of Windows. In such cases, you may need to change the Set Form Field command property Handle Web Browser Dialog. The default value is &OPEN, which means that the agent will be looking for the text "Open" on the dialog button. The text &O is shown as O and means the button can be activated by pressed the keyboard combination ALT + O. © 2014 - 2015 Sequentum Agent Commands 203 Website Login Forms Content Grabber can process website login forms the same way it handles any other web forms. An agent that processes a login web form would normally have a Set Form Field command for the Username, and another Set Form Field command for the Password, and a Navigate Link command that clicks on the login Submit button. Basic Windows Authentication Some websites use basic Windows authentication, and they will display a Windows login box. Content Grabber gives you the ability to set the Username and Password for basic Windows authentication by editing the Agent Command and then setting the Username and Password in the Basic Windows Authentication of the properties tab. After setting the basic Windows authentication properties, you must reload your agent for the properties to take effect. Basic Windows authentication does not work for websites that have proxies, and also does not work with browser mode set to HTML Parser. NOTE: Internet Explorer no longer allows you to embed authentication directly in the URL like this: http:// username:password@www.domain.com/index.html © 2014 - 2015 Sequentum 204 5.8 Content Grabber Command Library Content Grabber contains an agent template library and a command template library for easy reuse of agents and commands. Some templates come along with the installation of Content Grabber software, but you can also add your own templates. Exporting and Importing User Libraries You can export and import any user templates from one computer to another. Agent Templates When creating a new agent, you can derive the new agent from one of your own agent templates or one of the standard templates. © 2014 - 2015 Sequentum Agent Commands 205 You can add any of your agents to the template library by right-clicking on the agent in the Agent Explorer window and then selecting Add to Agent Template Library from the context menu. After adding a new agent template, the template will become available in My Templates when creating a new agent. Command Templates You can add all container commands to the template library. Any sub-commands for a container command will go into that library, so that when you later insert the command from the library into a new agent, the insert task will include the entire command structure. Add container commands to the library by right-clicking on a command in the Agent Explorer window and selecting Add to Template Library from the context menu. © 2014 - 2015 Sequentum 206 Content Grabber Command templates are categorized to make it easier to find specific templates. When you add a new template, you can place it in an existing category or enter a new category name. You can insert a command template into an agent by doing the following: 1. In the Agent Explorer window, right-click the container command that will be the parent command. 2. Choose Insert Template from Library from the context menu. © 2014 - 2015 Sequentum Error Handling 6 207 Error Handling Web scraping is frequently and notoriously unreliable, because you must contend with many external factors over which you have no control. A target website may fail because of defects in the web application, or there may be problems with the Internet connectivity anywhere between you and the target website. These problems may seem negligible when you are browsing a few websites with a normal web browser, but a web-scraping agent is capable of navigating more web pages in a few hours than a human can view in an entire year. So, small glitches can become significant problems that inhibit reliable data extraction. You can minimize these factors with error handling, especially for your critical web-scraping tasks. Error handling can be difficult to implement properly. In cases where your agent only collects non-critical data, you may decide to skip error handling and simply re-run the agent. Error handling is only important when an agent needs to deliver reliable data every time it runs. For web scraping, there are two aspects to error handling: Agent error handling and Error logs and notifications. Agent error handling happens automatically, when specific errors occur during the execution of an agent. For example, an agent could retry a command if it fails to load a new web page. Error logs and notifications can warn an administrator when an agent encounters trouble and requires attention. Consider the case in which a target website has a new layout and the agent may no longer be able to extract data correctly. You can configure error handling for this agent so that an email notification will be sent to the administrator - who can decide how to update the agent to accommodate the new layout. We encourage you to learn more about error handling in these sections: Agent error handling Error logs and notifications © 2014 - 2015 Sequentum 208 6.1 Content Grabber Agent Error Handling You can configure an agent to react to errors in two different ways. You can configure the error handling properties for an action command, or you can use a custom script to watch for an error condition and take appropriate action. You can configure an agent to react to errors in two different ways. You can configure the error handling properties for an action command, or you can use a custom script to watch for an error condition and take appropriate action. Action Command Error Handling These are the Error Handling properties: Error Description Handling Exit The agent will exit the action command and Command continue executing the next command. The agent will skip all sub-commands of the action command. Retry The agent will retry the action command a Command specified number of times, and if the action command does not succeed, it will skip all subcommands of the action command and continue executing the next command. Set the property Retry Count to specify the number of retries. If Retry Count is set to zero the agent will keep retrying the command indefinitely. Restart Agent The agent will restart and continue where it left and Continue off. This option is useful if an error puts the website into a state where the agent cannot continue. Stop Agent The agent will stop. © 2014 - 2015 Sequentum Error Handling No Error The agent will not handle the error, but will Handling continue to execute the sub-commands of the 209 action command. Use this option if you want to handle the error in a custom script later in the process. Error Handling In An Execute Script Command You can use Execute Script Commands to implement advanced error handling, such as in these cases. A required element is missing from a web page An error message is displayed on a web page An error occurs during the execution of an action command. You can write a custom script in C# or VB.NET, or you can use the default script options to generate a default script without writing any .NET code. This default script retries a parent command if a selection is missing An Execute Script Command has a corresponding web selection, and the script can use this web selection to check for any errors. The script also is given the parameter IsParentCommandActionError, which is set to True if the parent action command © 2014 - 2015 Sequentum 210 Content Grabber encounters an error while executing the command action. 6.2 Error Logs and Notifications You can use error notifications to warn when an agent encounters an error, so that the administrator can take appropriate action. The administrator can use the logs to get more information about the errors. Error Notifications Notifications are email messages that are sent to a specific email address when certain conditions occur. The notification configuration screen is available by clicking the Agent Settings > Notifications menu. This agent sends email notifications if it encounters any critical errors An email notification can be sent when an agent finishes, but never during agent execution or when debugging an agent. You can choose to send email notifications when one or more of these conditions are met: Condition Description © 2014 - 2015 Sequentum Error Handling Always send A notification email will always be sent notifications when when the agent finishes, whether an error an agent completes occurs or not. a run Send notification An email notification will be sent when on critical errors critical errors occur. A critical error will appear as red color on the log screen. These critical errors include page load errors and also situations where required web selections are missing from a web page. Send notification A notification email will be sent when an on low number of agent loads less than the minimum page loads number of pages that you specify in Low page number threshold. Page loads include all command actions, including content loaded asynchronously using AJAX and content changes triggered by synchronous JavaScript. A script can trigger notifications by calling one of these methods on the script parameter object: Method Description void Notify(bool Triggers notification at the end of agent alwaysNotify = execution. If alwaysNotify is set to true) False, this method only triggers a notification if you configure the agent to Send notification on critical errors (see above). void Notify(string © 2014 - 2015 Sequentum Triggers notification at the end of an 211 212 Content Grabber message, bool agent execution, and adds a message to alwaysNotify = the notification email. If alwaysNotify is false) set to False, this method only triggers a notification if you configure the agent to Send notification on critical errors (see above). Error Logs Logging is off by default. If you want to save log information, you must first enable logging before you run an agent or debug it. Simply locate the Debug section in the Properties tab, and set the Debug Disabled property to True. A Content Grabber agent send log data to a database table by default. The database table supports the log viewer in the Content Grabber editor, which you can access by clicking the View Log button in either the Run ribbon menu or the Debug ribbon menu. The critical errors in this log are red in color URL Column The URL column in the Agent Debug Log window shows links to all web pages that © 2014 - 2015 Sequentum Error Handling 213 the agent has processed, and you can double-click on any of these links to open the web page in your default browser. For any failures on data extraction, this is an easy way to open the web page and check if the web page layout is different from the expected layout. NOTE: Double-clicking on a link to a sub-page only works if the website allows direct links into the sub-pages. Many websites don't allow direct links into subpages without going through other pages first. Read below to learn how to directly access the HTML for sub-pages. HTML Column If a website does not allow direct links into sub-pages, you can configure an agent to save the entire HTML of all processed pages. When you configure the agent to log all HTML, the HTML column will contain buttons that you can click to view the raw HTML in the Content Grabber HTML viewer. The raw HTML does not include style sheets and other support files that are necessary to render a complete web page properly, but it will often show you enough information to determine what is causing an error (such a CAPTCHA page). © 2014 - 2015 Sequentum 214 7 Content Grabber Extracting Data From Non-HTML Documents Websites generally provide most of their content in HTML format, but some websites may also provide content in other formats - such as PDF or Microsoft Word documents. Since Content Grabber can only process HTML documents, it will simply download any non-HTML document. Content Grabber can help you extract text and images from within a PDF or Word document by converting such documents into HTML. To have Content Grabber convert your non-HTML document, you will need to provide an external document converter. The Content Grabber public website provides a list of open source programs that you can use for this purpose. Please remember that we don't support these tools, and you must comply with the license for any conversion tool. Limitations The design of most file formats, including PDF and Word files, doesn't include easeof-conversion to HTML. So, the conversion output is considerably more difficult to manage than standard HTML. In many cases, you'll have to select the entire HTML page and then use Regular Expressions to extract the target content. Installing a Document Converter Content Grabber uses a custom script that makes a call to an external document converter, and you can configure this script to call any type of program. The default script can handle the two document converters currently available for download on the Content Grabber website. Installing the PDF To HTML Converter Follow these steps to install the PDF-to-HTML document converter: 1. Download the pdftohtml.zip file from the Content Grabber website: © 2014 - 2015 Sequentum Extracting Data From Non-HTML Documents 215 http://www.ContentGrabber.com/Resources/Tools.aspx 2. Extract the content of the zip file into the default Content Grabber Converters folder, My Documents\Content Grabber\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer. 3. The direct path to the document converter should now be: My Documents\Content Grabber\Converters\pdftohtml\pdftohtml.exe Installing the Docx To HTML Converter Follow these steps to install the docx to HTML document converter: 1. Download the docxtohtml.zip file from the Content Grabber website: http://www.ContentGrabber.com/Resources/Tools.aspx 2. Extract the content of the zip file into the default Content Grabber Converters folder, My Documents\Content Grabber\Converters. You can also copy the converter into the corresponding Public Documents folder if you need the converter to be available for all users on the computer. 3. The direct path to the document converter should now be: My Documents\Content Grabber\Converters\docxtohtml \docxtohtml.exe Using a Document Converter You'll need to add the custom script (that calls the document converters) to a Download Document command, and that same command must also perform the download of the document. © 2014 - 2015 Sequentum 216 Content Grabber The default conversion script looks like this: using System; using System.IO; using Sequentum.ContentGrabber.Api; public class Script { //See help for a definition of ConvertDocumentToHtmlArguments. public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args) { if(args.DocumentType=="pdf") ScriptUtils.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe", args.DocumentFilePath, args.HtmlFilePath, "-noframes"); else if(args.DocumentType=="docx") ScriptUtils.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe", args.DocumentFilePath, args.HtmlFilePath, ""); if(!File.Exists(args.HtmlFilePath)) return false; return true; } } Extracting Content From a Converted Document After converting a document to HTML, that document goes into the agent data folder and you can use a Navigate URL command to open the HTML document. The © 2014 - 2015 Sequentum Extracting Data From Non-HTML Documents 217 Download Document command that did the conversion will also store the path to the document, and then the Navigate URL command can use that path to get the file URL to the HTML document. An agent with a URL command that links to a converted HTML document The Navigate URL command uses data from the Download Document command, so the Download Document command must execute first. Also, both the commands must have the same parent command, or the Navigate URL command must be a child command of the command that contains the Download Document command. You can execute the Navigate URL command in the editor to open the converted HTML document, but you must first execute the Download Document command to make sure the HTML document is available. A Navigate URL command using data captured by a © 2014 - 2015 Sequentum 218 Content Grabber Download Document command © 2014 - 2015 Sequentum Data 8 219 Data Content Grabber manages three types of data: Input data - is used to provide input values to an agent, such as input values for a web form or start URLs. See the topic Using Data Input for more information. Internal data - An agent stores data extraction elements in an internal database while the agent is running. When the agent completes, this internal data will become external data. For more information, see the topic The Internal Database. External data - this is data that goes out to a specific export target. See the topic Exporting Data for more information about exporting data. Distributing Data If an agent is exporting data to a particular file type, such as a spreadsheet, then the agent can also send the file as an attachment to an email message to one or more recipients. The agent can also FTP the files to a remote computer. See the topic Distributing Data for more information. We also recommend these topics: Database Connections The Internal Database Using Data Input Exporting Data Distributing Data 8.1 Database Connections Many professionals who use Content Grabber will normally involve the use of one or more of their database servers, and a web-scraping agent can both import and export data to multiple servers. To increase performance and reliability, you can also configure an agent to have its primary database on an external server instead of using the default SQLite file database. © 2014 - 2015 Sequentum 220 Content Grabber Development Environments As you gain proficiency with agent development and deployment, you may find it best to use separate computing environments for the various stages: a developer environment, a testing environment, and a production environment. Each of these environments will typically have its own database, and so the database connections in your agent will need to change when you move an agent from development into testing, and then from testing into production. Unique Database Names on Your Network To make it easier to move agents between different computer environments, you can configure network database connections - each of which has a unique name on the network. Using such names will allow for an automatic switch to a new connection if the agent moves to another computer on the network. Also, we recommend that you take this approach, even if you are only using a single computer to design and run agents, since you won't have to configure a new database connection each time you create an agent. Another benefit of using unique-name database connections is the agent will contain only the name of the connection. This design gives additional security when you send agents to others, since the agent will contain no database connection information. If you don't configure a network database connection, then the connection will be known only to the agent in the location that you configure the agent. You can use the database connection anywhere in the agent, but it will be unknown to other agents. 8.2 The Internal Database Content Grabber uses two layers of internal data, internal data and export data. While the agent is running, it continuously saves data in an internal format. When the agent finishes, it converts the internal data into export data, and then it sends this data to the export target so it becomes external data. © 2014 - 2015 Sequentum Data 221 The default internal database is a SQLite database, but this can be changed SQL Server or MySQL. Oracle and OleDB are not currently supported as internal databases. Internal Data Content Grabber stores internal data in the database - which contains all the data extraction elements for any agent, but also contains the data which corresponds to the settings for the agent properties. For example, if an agent stops, crashes, or otherwise fails, the internal data is used to start the agent again exactly at the point of failure. Also, the internal data will always contain a data table for each container command in the agent and at least one data field for each capture command in the agent. You can always view the internal data by clicking the Data > View Internal Data menu. If you configure an agent to add new data to the existing data, then the internal data store will continue to grow larger. Otherwise, the internal data store will be recreated every time the agent is run. Export Data The agent stores the export data in the internal database, and this data doesn't contain any of the property data for the agent itself. Also, the export data is more readable than the internal data. The export data will usually not contain a data table for each container command in the agent, and some capture commands may not have corresponding data fields. Export data is always overwritten, so you cannot add data to existing export data. You can only add data to existing internal data. You can always view the internal data by clicking the Data > View Export Data menu. External Data © 2014 - 2015 Sequentum 222 Content Grabber External data is the same as the export data, and the agent delivers it to an external data store chosen by the user. Typically, external data is overwritten, except when you choose Export last data segment only and you are exporting to a database. In that case, the operation simply adds the data to the external database tables. You can also use a Data Export Script to customize the export process which will allow you to update or add to existing data. The export data is set in a fixed format, but you have some options for changing this format. You need to use an Data Export Script if you want to automatically create or configure your own data structures within this data. Internal Database The default internal database is a SQLite file database, but you can change it to either a SQL Server or MySQL database. You can make this change on the Internal Database screen. Important Notice About SQL Server and MySQL WARNING: It is important to ensure that no two agents that use the same internal database have the same agent name. Each time you run an agent, Content Grabber verifies the tables in the internal database. If the tables don't exist or are invalid, it removes and recreates all tables for that agent. If any changes have been made to the agent, there may be some inconsistencies. Instead of removing all tables, it removes all tables starting with the specific agent name. All internal data tables have the following naming format: VWR INTERNAL [PROJECT_NAME] [TEMPLATE_NAME] When Content Grabber removes a table for a specific agent, it removes all tables starting with the name: © 2014 - 2015 Sequentum Data 223 VWR INTERNAL [PROJECT_NAME] This approach creates an obvious problem when two agents use the same internal database and also start with the same name, because Content Grabber would remove the tables for both projects when it should only remove tables for one project. It is important to ensure that no two agents that use the same internal database have the same agent name. 8.3 Using Data Input Content Grabber can take input data from a wide variety of data sources, such as databases and CSV files. Though you can use input data for many different purposes, its most common use is for loading URLs or inputting values for web forms. We explain each of the types of input data in these sections: Data providers Input parameters Agent data 8.3.1 Data Providers You can assign a data provider to any Data List command and can load data from the following data sources: Simple SQL Server CSV MySQL Selection Oracle Script OleDB © 2014 - 2015 Sequentum 224 Content Grabber Data List commands include the following: Agent Set Form Field Navigate URL Data List A list command will iterate through all the data rows given by the data provider and execute each of the sub-commands once for each row. Consuming Data Consumer commands use the data that comes from the Data Provider. The consumer must be a sub-command of the command providing the data. Data consumers include the following types of commands: Agent Set Form Field Navigate URL Data Value Most data providers are also data consumers and, consequently, many commands will retrieve data automatically. For example, an agent command will normally provide at least one start URL to itself - as shown in the figure. © 2014 - 2015 Sequentum Data 225 You can choose the Data Provider on the Common tab of the Configure Agent Command panel. If multiple data columns are available, then you can choose a specific Data Column. The Default data column is simply the first data column that the data provider presents. A data provider will often provide multiple data columns for data sets that correspond to a web form. An example would be an agent that searches for air fares, which would submit a list of departure and destination cities to a web form. The data provider would then provide two columns: one for departure city and one for destination city. Consuming Data in a Script A script can consume data from a data provider as long as the script corresponds with a command that is a sub-command of the data provider. Most scripts provide easy access to input data from the script arguments. The following example retrieves the current data from the data column DepartureCity in the CityData data provider. args.GetInputData("CityData").GetStringValue("DepartureCity"); See the Scripting article for more information. Simple Data Provider © 2014 - 2015 Sequentum 226 Content Grabber Choosing Simple for the Data Provider will cause the agent to extract data from a CSV data set that the agent designer enters at design time. The agent contains the CSV data internally, so that there is no dependency on external files. The CSV data set may contain multiple rows, each having multiple comma-separated data values. The data follows the same formatting rules as standard CSV data. So, you should enclose a data value in double-quotes if it contains a comma or single quote, and use two double quotes to escape any double quotes in a data value. The following example contains a header row, although a header row is not mandatory. If the data contains a header row, then you need to check the Input data includes header row box on the Data tab. Departure City,Destination City Sydney,Brisbane Sydney,Melbourne CSV Data Provider Choosing CSV for the Data Provider is similar to the Simple data provider, but uses an external CSV file. We recommend that you choose the CSV Data Provider for large CSV files, since the © 2014 - 2015 Sequentum Data 227 CSV data provider will perform much better than the Simple data provider for large quantities of data. For a CSV provider, you can choose the value Separator and the text Encoding of the CSV file. You can place CSV files anywhere on your computer, but we recommend you place them in the default input data folder for your agent. Later, if you want to export your agent, then you can include these files along with the export. Selection Data Provider Choosing Selection for the Data Provider will extract data from elements on the web page. The data provider uses the selection XPath of the data provider command and then adds a relative XPath to find the web elements that you include in the selection. You can choose which HTML attributes of the web elements to include as data. © 2014 - 2015 Sequentum 228 Content Grabber The choice of Selection for the Data Provider is almost exclusively for use with dropdown menus in web forms. A form field command selects the drop down on the web page, and a relative XPath selects the option elements within that drop down. The drop-down options then become available to the form field command, which can iterate through the options and ensure that the web form submission is done for each item in the drop down. Although this is somewhat tedious, the benefit is more precise control over which options you want to submit with the web form. Script Data Provider Choose Script for the Data Provider for full customization of the agent input data. This option provides a .NET data table containing the input data, which may contain multiple data columns and rows. © 2014 - 2015 Sequentum Data 229 Content Grabber provides some standard .NET libraries which makes it easier to generate .NET data tables for a variety of input data. See the topic Data Input Scripts for more information and sample code. Database Data Providers Choose a Database for the Data Provider to work any of these database connections: SQL Server Oracle MySQL OleDB In Content Grabber, you can share database connections among all agents on a computer. We recommend that you read more about create new database connections. © 2014 - 2015 Sequentum 230 Content Grabber A database data provider requires execution of a SQL Select statement on the database connection to retrieve the input data. The following example selects the two data columns, DepatureCity and DestinationCity, from a table City. SELECT DepatureCity, DestinationCity FROM City 8.3.2 Input Parameters Input parameters are named values that get their value assignments when an agent starts. You can specify these as command-line parameters when running an agent from the command line, or specify them in a GUI window when running the agent from the Windows desktop. Any data consumer command within the agent can use these parameters. © 2014 - 2015 Sequentum Data 231 You can configure input parameters to get value assignments at runtime by using one of these methods: A GUI window Command-line parameters The Content Grabber API Input Parameters with a GUI Window When running an agent from the Content Grabber editor and the agent has any input parameters, then the Runtime Input Parameters window will automatically appear. Runtime Input Parameters The user running the agent can enter runtime values for the input parameters and can even add new parameters that are only defined during the current execution of the © 2014 - 2015 Sequentum 232 Content Grabber agent. An agent will always use the default input parameters when debugging. The Runtime Input Parameters window will not appear when starting a debugging session, so you will need to change the default values if you want to debug the agent with different input parameters. Specifying Input Parameters on the Command Line When running an agent from the command line, you can specify input parameters as command-line arguments. It's unnecessary to specify all of the input parameters. If you don't specify a parameter, the call to run the agent will use a default value for that parameter. See the topic Running Agent from the Command-Line for more information. The following example runs an agent named Sequentum on the command line, and then sets the input parameters DepartureCity and DestinationCity. RunAgent.exe Sequentum -DepartureCity "Sydney" -DestinationCity "Melbourne" Specifying Input Parameters Using the Content Grabber API When running an agent using the API, you can specify input parameters using the AgentSettings class in the Content Grabber API. See the topic Programming Interface for more information. The following example uses the API to add a set of input parameters and then run the agent: AgentApi api = new AgentApi(@"C:\Users\Public\Documents\Content Grabber\Agents\ qantasApiTest\qantasApiTest.scg"); AgentSettings settings = new AgentSettings(); settings.InputParameters.Add("from", "SYD"); settings.InputParameters.Add("to", "MEL"); settings.InputParameters.Add("departure_date", DateTime.Now.ToString("yyyy-MM-dd")); © 2014 - 2015 Sequentum Data 233 settings.InputParameters.Add("return_date", DateTime.Now.ToString("yyyy-MM-dd")); settings.InputParameters.Add("travel_class", "ECO"); settings.InputParameters.Add("adults", "1"); settings.InputParameters.Add("children", "0"); settings.LogLevel = AgentLogLevel.High; settings.IsLogToFile = true; api.RunAgent(settings); Using Input Parameters Any data consumer command, such as the following, can use input parameters. Agent Set Form Field Navigate URL Data Value Using Input Parameters Using Input Parameters in a Script Most scripts provide easy access to input parameters from the script arguments. See the topic Scripting for more information. The following example retrieves the input parameter DepartureCity: args.GlobalData.GetString("DepartureCity"); © 2014 - 2015 Sequentum 234 8.3.3 Content Grabber Agent Data Agent data is data that is currently being extracted by the agent. For example, data extracted by a capture command could be used as input to a form field command. Agent data can be used by data consumer commands, but the data consumer must be a sub-command of the capture command which is extracting the data. Else, it must have the same parent command as the capture command. See Agent Data for more information. Agent data is the data that the agent is currently extracting, and it can be put to use by other commands at runtime. If another command uses such data, then it must execute after the command that extracts the data. Otherwise, the data wouldn't be available. Also, the command using the data must be a sub-command or have the same parent command as the command that extracts the data. Using Agent Data Agent data can be used by any data consumer command. Data consumers include the following types of commands: Agent Set Form Field © 2014 - 2015 Sequentum Data Navigate URL 235 Data Value CAPTCHA Protection Agent data is often used when resolving CAPTCHA images. You can automatically process a CAPTCHA page by using an OCR service to convert the CAPTCHA image into text, and then use that text in a form field command to set the CAPTCHA input box. A CAPTCHA page or section appears on some websites to protect access to a secure area of the website. A CAPTCHA image displays a sequence of numbers and characters that a website user must type into an input box in order to continue to the secure area. This process ensures the website user is a real human and not an automated agent. See the topic CAPTCHA Blocking for more information. 8.4 Exporting Data While performing a data extraction, an agent saves data to an internal data store and then exports that data to a specific export target when the agent completes or if the user manually stops the agent. Content Grabber supports many different types of data stores such as databases, CSV files, XML files and spreadsheets. You can use a data export script to gain complete control over the entire export process - including the export of data to data stores that Content Grabber doesn't support. We recommend that you also read these topics: Selecting an Export Target Exporting Downloaded Images and Files Character Encoding Changing the Default Data Structures © 2014 - 2015 Sequentum 236 Content Grabber Scripting Data Export 8.4.1 Selecting an Export Target Content Grabber can export data to many different data formats. By default, data will export to an Excel spreadsheet, but you can change the export target by clicking the menu Data > Data Export. Content Grabber supports the following export targets: None - data extractions will only remain in the internal database, and the agent won't export the data. This is a useful option if you want to work directly on the internal database and have no need for exporting. NOTE: This option is for expert users only. XML - the data will export to a single XML file. SQL Server - the data will export to one or more SQL Server database tables. MySQL - the data will export to one or more MySQL database tables. Oracle - the data will export to one or more Oracle database tables. OleDB - the data will export to one or more database tables. You can use any database that has an OleDB provider. CSV - the data will export to one or more CSV files. The default character encoding is ASCII, but you can specify another type of encoding. Microsoft Excel 2003 - the data will export to a single Excel spreadsheet, but it may save the data into more than one worksheet. The agent will save any images to the disk. Microsoft Excel 2007 and later - the data will export to a single Excel spreadsheet. You can save multiple data tables to a single worksheet using grouping and outlining. The agent will save any images to the disk or embed them into the worksheet. Content Grabber always stores data in the internal database before exporting the data to the chosen output format. This means you can always export data to a new format without having to extract the data again. © 2014 - 2015 Sequentum Data 237 When exporting the data to a database, Content Grabber automatically creates new database tables if they don't exist, and automatically deletes and re-creates existing database tables if they do not have the correct number of columns. Exporting Data to Oracle Content Grabber supports Oracle databases but only through the Oracle OleDB provider. You must first install the Oracle client software before the OleDB provider will work. The OleDB provider given by Microsoft supports Oracle only up to version 8i, so you should use the OleDB provider from Oracle. Use an Oracle OleDB connection string such as this one: Provider=OraOLEDB.Oracle;Data Source=//localhost:1521/OracleTest; User Id=test;Password=test; The Data Source parameter in the connection string can be set to TNS, but only if you configure the TNS configuration file properly. Exporting Data to MS Access Content Grabber is a 32-bit application and will only be able to communicate with a 32-bit OleDB provider. If you have installed the 64-bit version of Microsoft Office, then you won't be able to use the 32-bit OleDB provider, and Content Grabber won't be able to export data to Microsoft Access. If you are using the 32-bit version of Office, you can use a connection string such as this: Provider=Microsoft.ACE.OLEDB.12.0;Data Source= C:\Data\MyDatabase.accdb;Persist Security Info=False; You must also change the Parameters option from "@" to [@]. © 2014 - 2015 Sequentum 238 8.4.2 Content Grabber Exporting Downloaded Images and Files An agent will often download images and files during a data extraction. By default, the agent saves these files in the internal database. However, you can configure the agent to save all files to disk by unchecking the Embed files in database box on the Internal Database configuration screen. When writing files to disk, the agent will store the full path to each file in the internal database. If you are exporting data and there are corresponding files in the internal database, then these files will also export to the external database. If you uncheck the Embed files in database box, then only the file paths will export to the external database. If you are exporting data to a file format that doesn't support embedded files, such as Excel 2003, then only the file paths will export to the external database and the files will be written to disk. Typically, Content Grabber will use the name of a file when saving it to disk. However, since a website could potentially use the same name for two different files, Content Grabber will set the file name with a unique identifier (GUID) if the file already exists on disk. You can also use a file name transformation script to name the files, and configure the script to use some of the other extracted data elements for the file name. For example, you could name a product image with the product name or product ID. 8.4.3 Character Encoding Although the user interface text is English-only, Content Grabber can extract data from websites in any language or encoding. If you have problems with the encoding of your data extraction, then the problem is most likely with your export target. For example, if you are exporting data to a MySQL database that uses Latin1 encoding, then you'll find a question mark in the place of each Unicode character. In accordance with CSV specifications, CSV output files have ASCII encoding and do not support Unicode. However, if you're using a CSV import tool that can read CSV © 2014 - 2015 Sequentum Data 239 files with other character encodings, then you can specify the encoding when choosing the CSV export target. MySQL Character Encoding Many users install their MySQL databases with Latin1 character encoding and then encounter some characters appearing as question marks. If you want to export Unicode characters to a MySQL database, you'll need to use Unicode character encoding such as UTF-8. If you configure your MySQL database to use Unicode, you must also specify the encoding when configuring the Export Target for the agent. 8.4.4 Changing the Default Data Structures All container commands have a set of properties that you can use to specify how to export data from sub-containers: Property Description Sub- Specifies how to export data from multiple sub- Container containers. You can add data together and Export then merge with data from parent containers, Method or simply merge the data and then merge that data set with data from parent containers. Alternatively, you can separate all the data from the various sub-containers, which will also separate the data from the data in parent containers. Non-list containers have the option Always Merge Data set to True by default. The setting of this property has no effect on sub-containers with the option Always Merge Data set to True. Sub- Specifies how a data export from a sub- Container container will merge with data from the Merge corresponding parent container. This option has © 2014 - 2015 Sequentum 240 Content Grabber Method no effect if the data from sub-containers is separate. Non-list containers have an option Always Merge Data set to True by default. The setting of this property has no effect on sub-containers with the option Always Merge Data set to True. Export Rows Exports data rows from a list container as in Parent columns in the parent data table. Columns Separator The separator character which marks each row when exporting multiple rows in a single column. Name A capture command that will provide the data Command for column names. This property is optional, but must be set together with a command name for the Value Command property. Value A capture command that will provide the data Command for column values. This property is optional, but must be set together with a command name for the Name Command property. Sub-container Export Rules Data exports from sub-containers will conform to these rules: 1. Data from non-list containers will merge with data from parent containers. A nonlist container that contains a list container will function as a list container. A nonlist container with the Always Merge Data property set to False will also function as a list container. A list container with Export Rows in Parent Columns set to Multiple Columns or Single Columns is equivalent to a non-list container. 2. Data from all list containers will combine or separate according to the Sub- © 2014 - 2015 Sequentum Data 241 Container Export Method setting. 3. If the data is separate in the previous step, then no further action is taken. Otherwise, the resulting data will merge or combine with the parent data according to the Sub-Container Merge Method setting. Sub-Container Export Method This method controls how data will export from sub-containers and has the following four options: Default Add Horizontally Separate Add Vertically Data from non-list containers will always merge with data from parent containers when the option Always Merge Data is set to True. The Sub-Container Export Method setting has no effect on these non-list containers. Default Export Method When using this method, data from sub-containers will be separate in some circumstances - depending on the structure of your commands. By default, Content Grabber separates data if the parent container contains more than one list container, or if a single list-container generates significantly less data fields than the parent container. For example, consider an agent that extracts product data from a website such as Amazon.com. One container command may extract product properties such as product name, description, and price; and a sub-container may extract images belonging to the product. The parent container extracts more data fields than the subcontainer, so by default the data from the sub-container will be separate from the data of the parent container. If the Default Export method does not separate data from sub-containers, it means that there is only one list-container. Consequently, the export result will depend on the Sub-Container Merge Method property, which will merge data with the parent data (by default). © 2014 - 2015 Sequentum 242 Content Grabber Separate Export Method When using this method, data from each sub-container will separate, which also means the data will be separate from data in the parent containers. If, for example, the data exports to a database, then the data from each sub-container will go into separate database tables. The figures below show a simple agent having one list container, along with the data export it generates. By default, the Movie List container will merge its data with the data from the imdb container, and the result will be a single data table containing the entire data extraction. A simple agent with one list command © 2014 - 2015 Sequentum Data 243 The above agent extracts this data by default. Data from the Movie List container merges with data from the "imdb" container. In this example, you can separate the data from the Movie List command by setting the property Sub-Container Export Method to Separate. The data is exported to separate data tables when the property Sub-Container Export method is set to Separate. Add Horizontally Export Method When using this export method, data from sub-commands will combine horizontally as shown in the example below. The resulting data will combine or append to the parent data according to the setting of the Sub-Container Merge Method. Consider an example in which two different sub-containers generate the following data sets. Here is the data for the first sub-container: Column1 Column2 Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 Data10 © 2014 - 2015 Sequentum 244 Content Grabber Here is the data from the second sub-container: Column3 Column4 Column5 Data11 Data12 Data13 Data14 Data15 Data16 Data17 Data18 Data19 This would be the result of a merge of these two data sets: Column1 Column2 Column3 Column4 Column5 Data1 Data2 Data11 Data12 Data13 Data3 Data4 Data14 Data15 Data16 Data5 Data6 Data17 Data18 Data19 Data7 Data8 Data9 Data10 Add Vertically Export Method When using this method, data from sub-commands will combine vertically as shown in the example below. The resulting data set will merge or combine with the parent data according to the Sub-Container Merge Method setting. If we apply this method to the example data sets for the first and second sub-container above, this will be the result: Column1 Column2 Data1 Data2 Data3 Data4 Data5 Data6 Data7 Data8 Data9 Data10 Column3 © 2014 - 2015 Sequentum Data Data11 Data12 Data13 Data14 Data15 Data16 Data17 Data18 Data19 245 Sub-Container Merge Method This method controls how data from sub-containers will merge with parent data. Choose from these settings: Default Merge Add. The setting for this method has no effect when data from sub-containers is separate. It also has no effect on non-list containers having the Always Merge Data property set to True. By default, Content Grabber will always merge data from sub-containers with parent data, and this is nearly always the best action. When parent data merges with child data, each data row in the parent data table will repeat for each corresponding child data row - as shown in the example below. First, consider this data set for the parent container: ID Column Column 1 Column 2 ID1 Data1 Data2 ID2 Data3 Data4 ID3 Data5 Data6 Here is the data from the sub-container: Parent ID Column © 2014 - 2015 Sequentum Column 3 246 Content Grabber ID1 Data7 ID1 Data8 ID2 Data9 ID2 Data10 ID3 Data11 If we apply this method to a merge of the data sets for the parent and sub-container above, this will be the result: New ID Column1 Column2 Column3 ID1 Data1 Data2 Data7 ID2 Data1 Data2 Data8 ID3 Data3 Data4 Data9 ID4 Data3 Data4 Data10 ID5 Data5 Data6 Data11 Column One situation where the Sub-Container Merge Method property should be set to Add is the case in which a Pagination command links to a non-list container. If you don't set the property to Add, then the non-list container will generate one row in the parent data table by default, and the Navigate Pagination command (which is a list container) will merge its data with the parent data. The result will be that the non-list container will generate one row that will be overwritten, as we can see in yet another example below. This data set could come from a non-list container: IDC Column Column 1 Column 2 ID1 Data1 Data2 This data set could come from a Pagination container: © 2014 - 2015 Sequentum Data Parent ID Column 1 Column 2 ID1 Data3 Data4 ID1 Data5 Data6 ID1 Data7 Data8 ID1 Data9 Data10 ID1 Data11 Data12 247 Column If we apply this method to a merge of the data sets for the non-list container and the Pagination container above, this will be the result: New ID Column Column 1 Column 2 ID1 Data3 Data4 ID1 Data5 Data6 ID1 Data7 Data8 ID1 Data9 Data10 ID1 Data11 Data12 Export Rows to Parent Columns Use this property to save all of the list-container data in the table of the parent. You can control exactly what data will save and how it will save to the parent table by setting these properties: Export Rows to Parent Columns, Name Command, Value Command and Separator. This figure represents a container command having a configuration to export data rows to multiple columns in the parent data table: © 2014 - 2015 Sequentum 248 Content Grabber The Export Rows to Parent Columns property specifies how the data should save into the parent table. You can choose one of these three methods: None - The data will not save to the parent table, which is the default behavior. Multiple Columns - Each data row that the list container generates will generate additional columns in the parent table. Single Column - The content from all data rows will concatenate into a single column in the parent table. The data concatenates, but also separates by a specific character that is given in the Separator property. When saving data in multiple columns in the parent table, the column names in the parent table will be the same as the column names in the child table, but with an additional column. For example, if the column name in the child table is image, then column names in the parent table will be as follows: image_1, image_2, image_3, .... If you save different kinds of data in the parent table, such as name/value properties, then you may want the columns in the parent table to have names that correspond to the content in the child table. For example, consider the following data from a child table: Name Value Width 100 Height 50 Depth 25 You may want this child table to generate the following parent table columns: Width Height Depth 100 50 25 © 2014 - 2015 Sequentum Data 249 You can achieve this by setting the Name Command property to the name of the command that will extract the values you want to use as column names. Also, set the Value Command property to the name of the command that will extract the values you wish to use as column values. Limitations Adding columns to the parent table can lead to a different number of columns in the parent table every time you run an agent. This is usually only desirable when exporting data to Excel spreadsheets, and may sometimes be useful when exporting data to CSV files. 8.4.5 Exporting Data with Scripts With data export scripts, you can customize the export process for extraction data sets. An export script is ideal for exporting to special data structures in a database, or exporting data to a third-party or custom-data store. An export script gives much more control than the standard export process because the script can control when to insert, update, or delete data in the export target. See the topic Data Export Scripts for more information. 8.5 Distributing Data If the export target for an agent is set to a file format such as a spreadsheet, then you can configure an agent to send a data export as a message attachment to one or more email destinations. Alternatively, you can FTP the data to a remote server, or have the agent use a custom script to distribute the export. Read these topics for more information: Email and FTP distribution Scripting Data Distribution © 2014 - 2015 Sequentum 250 8.5.1 Content Grabber Email and FTP distribution If the export target for an agent is set to a file format, you can configure an agent to send a data export as a message attachment to one or more email destinations. Alternatively, you can FTP the data to a remote server. Email Distribution When an agent exports data through email, it attaches the data files to the email in a single compressed file or as individual data files. In the Email Delivery panel, you'll need to specify the Server, Login, and Email Addresses. Email distribution of exported data FTP Distribution When an agent uses FTP to copy data to a remote server, the agent can copy the data files in a single compressed file or as individual data files. In the FTP Delivery window, you must specify the remote Server address, Username, and Password. The remote server must host an FTP service. © 2014 - 2015 Sequentum Data 251 FTP distribution of exported data 8.5.2 Scripting Data Distribution An agent can use a custom script to distribute data export files. A data distribution script doesn't need to actually distribute the data, but can be used to perform custom processing tasks after data export is complete. Please see the topic Data Distribution Scripts for more information. © 2014 - 2015 Sequentum 252 9 Content Grabber Anonymous Web Scraping Many users prefer to browse the web in private. Some website owners do not want web scraping tools to extract data from their websites, and their administrators may attempt to block access to their website for any non-human users. Content Grabber will behave exactly like a normal Internet Explorer user when your agent uses a Full Dynamic Browser. When you visit a website with a web browser or a web scraping tool such as Content Grabber, the website owner can record your IP address and attempt to identify you with the intent to block your access to the website. If you do not want a website owner to be able to identify you when you visit a website, you can use a proxy server to hide your IP address. The proxy server will visit the website for you. There are many different types of proxy servers, but Content Grabber supports only HTTP proxy servers. It does not support other types of proxy servers, such as SOCKS proxies. See the topic CAPTCHA & IP Blocking for more information about configuring proxy servers in Content Grabber. © 2014 - 2015 Sequentum CAPTCHA & IP Blocking 10 253 CAPTCHA & IP Blocking Some website owners do not want web-scraping agents to extract data from their websites. They may try to block access to their website if they believe the user is not human. CAPTCHA and IP blocking are the two most common ways of blocking web scraping agents. You can configure a Content Grabber agent to manually or automatically bypass CAPTCHA, and also configure an agent to use proxy servers - which prevents IP blocking. Read these topics for more information: CAPTCHA Blocking IP Blocking & Proxy Server 10.1 CAPTCHA Blocking A website can implement CAPTCHA blocking by using a web form that the user must submit to gain access to any restricted areas of the site. The web form is usually quite simple, consisting of an image element and a text box element. The image displays some characters which the user must enter into the text box in the exact sequence as given in the image. A human user can read the text in the CAPTCHA image, but a web-scraping agent requires special character recognition software to successfully discern the characters in the image. © 2014 - 2015 Sequentum 254 Content Grabber A typical registration form with CAPTCHA blocking Content Grabber performs both manual and automatic data extraction from websites that implement CAPTCHA blocking. Automatic data extraction requires an account with a third-party CAPTCHA recognition service and, typically, there is a small fee for processing each CAPTCHA image. Manual data extraction is free, but requires you to manually decode CAPTCHA images while running a data extraction agent. Manual CAPTCHA Configuration You have two options when configuring manual CAPTCHA. The easiest one to configure is the option we describe below. The other option uses the same approach as automatic CAPTCHA configuration, but instead of using a script to resolve the CAPTCHA, a window is displayed allowing the user to manually resolve the CAPTCHA. Manual CAPTCHA processing is easy to configure, but requires you to manually decode CAPTCHA images while the agent is running. The agent will pause and display the browser window where a user can view the CAPTCHA image and enter the CAPTCHA text in the text box. You can configure the agent to automatically submit the web form after the user has entered the CAPTCHA text, or you can reply on the user to submit the form while the agent is paused. © 2014 - 2015 Sequentum CAPTCHA & IP Blocking If CAPTCHA blocking is part of a larger registration form, you can process the CAPTCHA part manually and let the agent process the rest of the form automatically. In this case you should let the agent submit the form automatically rather than relying on the user to submit the form while the agent is paused. Follow these steps to pause an agent when a CAPTCHA image is displayed: 1. Add an Execute Script command to your agent. 2. Select the CAPTCHA image element in the web browser. This sets the command's web selection. 3. Select the default script type Pause Agent. 4. Select the default script condition If Selection Exists. 5. Save the command. This command will pause the agent when the CAPTCHA image element exists on the web page, and allow a user to enter the CAPTCHA text. You must add this command to all locations in the agent where CAPTCHA blocking could be encountered. Important: Manual CAPTCHA processing relies on a human user to decode the CAPTCHA image, so an agent using manual CAPTCHA configuration cannot be run from the scheduler or API, or any other fully automated way. Automatic CAPTCHA Configuration Automatic CAPTCHA processing requires an account with a third party CAPTCHA recognition service. The third party recognition service must provide a .NET API and you must add an OCR script © 2014 - 2015 Sequentum 255 256 Content Grabber that uses this API to call the service. See the section below for two examples of CAPTCHA recognition services. Follow these steps to configure an agent for automatic CAPTCHA processing: 1. Add a new Group Commands command. This group of commands will handle CAPTCHA. 2. Add an Execute Script command to the group. This command will skip CAPTCHA processing if the CAPTCHA image doesn't exist in the web page. i. Select the CAPTCHA image in the web browser. This sets the command's web selection. ii. Select the default script type Exit Command. iii. Select the default command Parent Command. iv. Select the default condition If Selection Missing. 3. Add a Download Image command to the group. This command will download the CAPTCHA image and use an OCR script to decode the image into plain text. i. Select the CAPTCHA image in the web browser. This sets the command's web selection. ii. Open the OCR tab and check the option Convert image to text. iii. Add an OCR script. See below for script examples. If you want to manually resolve the CAPTCHA, you can check the option Convert image manually in which case you should not specify a script. 4. Add a Set Form Field command to the group. This command will use the converted image text to set the CAPTCHA text box. i. Select the CAPTCHA form field on the web page. This sets the command's web selection. © 2014 - 2015 Sequentum CAPTCHA & IP Blocking ii. Clear the command option Use default input. iii. Set the data provider to Captured Data and select the CAPTCHA image command from step 3. 5. Add a Navigate Link command to the group. This command will submit the CAPTCHA form. i. Select the CAPTCHA form submit button. This sets the command's web selection. 6. Add an Execute Script command to the group. This command will retry CAPTCHA processing if the CAPTCHA image still exists on the web page. If the CAPTCHA image still exists we assume it's because the CAPTCHA recognition service decoded the CAPTCHA image incorrectly and we'll try again. i. Select the CAPTCHA image in the web browser. This sets the command's web selection. ii. Select the default script type Retry Command. iii. Select the default command Parent Command. iv. Select the default condition If Selection Exists. Automatic CAPTCHA configuration The command library includes the group of commands listed above. Select the Automatic CAPTCHA command from the library and add it to your agent. After you have added the group command from the © 2014 - 2015 Sequentum 257 258 Content Grabber library, you need to set the web selection for all the commands that require a web selection. You must add this group of command to all locations in the agent where CAPTCHA blocking could be encountered. CAPTCHA OCR Scripts Content Grabber includes the API and standard OCR scripts to call the following CAPTCHA recognition services. http://www.deathbycaptcha.com http://bypasscaptcha.com At the time of writing, the Death by CAPTCHA service charges US$6.95 for 5000 CAPTCHAs and Bypass CAPTCHA charges US$34 for 5000 CAPTCHAs. We are not affiliated with these companies in any way and don't charge any additional fees for these services. The following OCR script uses the Death by CAPTCHA service to decode CAPTCHA images. public static string ConvertImageToText(ConvertImageToTextArguments args) { string captcha = DeathByCaptchaService.DecodeCaptcha(args.Image, "login", "password"); return captcha; } The login and password in the script above is provided by Death by © 2014 - 2015 Sequentum CAPTCHA & IP Blocking 259 CAPTCHA. The following OCR script uses the Bypass CAPTCHA service to decode CAPTCHA images. public static string ConvertImageToText(ConvertImageToTextArguments args) { string captcha = BypassCaptchaService.DecodeCaptcha(args.Image, "key"); return captcha; } The key in the script above is provided by Bypass CAPTCHA. Troubleshooting If CAPTCHA fails when entering a correct CAPTCHA it may be caused by the following issue. Content Grabber cannot directly get the CAPTCHA image from the web browser, so it downloads the image a second time, and that may be disallowed by the web server, especially if you are using a proxy rotation service where a new IP address maybe assigned for the image download. To overcome this problem, you can use a Download Screenshot command instead of a Download Image command, in which case a second image download is not required. 10.2 IP Blocking & Proxy Servers When you visit a website with a web browser or a web scraping tool, such as Content Grabber, the website owner can record your IP address and may be able to use this information to identify you or block your access to the website. If you do not want a website owner to be able to identify you while you visit a website, © 2014 - 2015 Sequentum 260 Content Grabber you can use a proxy server to hide your IP address. When you use a proxy server, you do not visit the target website directly, but instead request that the proxy server visit the website for you. There are many different types of proxy server, but Content Grabber supports only HTTP proxy servers. It does not support other types of proxy servers, such as SOCKS proxies. How to Configure Proxy Servers After you have purchased proxy server access or found freely available proxy servers on the web, you will receive one or more proxy server IP addresses and possibly a username and password to access the proxies. You need to enter this information into the Content Grabber agent by selecting Agent Settings from the ribbon menu, or you can setup default proxies in the Application Settings ribbon menu. Agent proxy settings Application proxy settings The proxy menus allow you to set the Proxy source and Rotation interval options. You can select one of the following proxy sources: Proxy source Default Description The agent will use the proxy configured in © 2014 - 2015 Sequentum CAPTCHA & IP Blocking 261 Internet Explorer, or no proxies if no proxies have been configured in Internet Explorer. Application The agent will use the default proxy settings configured in the Application Settings menu. Proxy List The agent will use a specific list of proxies. Click the Proxy List button to add proxies to the list. Gateway Use this proxy setting in Application Settings if you must connect to a specific proxy to access the Internet. The rotation interval is the number of pages the agent will load before switching to the next proxy. The Proxy List screen is used to specify a list of proxies and set the proxy verification option: Add proxies on the Proxy list screen © 2014 - 2015 Sequentum 262 Content Grabber You must specify the Proxy Address in the following format, including the port number: 206.118.215.245:60099 In the example above, 206.118.215.245 is the IP address of the proxy server and 60099 is the port number. Proxy Verification Content Grabber can automatically verify if a proxy is available before switching to the proxy. This allows agents to switch to the next proxy if a proxy is not available, and thereby avoid stopping the agent prematurely just because a proxy is unavailable. Proxy Description verification option Verify proxy The agent will verify a proxy before switching to before use the proxy. Verification The number of seconds the agent will wait for a timeout successful proxy verification. Importing Proxy Servers If you are using a large number of proxy servers, it can be tedious to add them all to each project. You can use the Import Proxies button to import a list of proxies from a CSV file. The CSV file must have the following format: Proxy Address, Username, Password The Username and Password are optional columns. CSV Example 1: © 2014 - 2015 Sequentum CAPTCHA & IP Blocking 263 proxy 173.244.220.185:8800 50.21.10.78:8800 134.22.166.242:8800 CSV Example 2: proxy, username, password 173.244.220.185:8800, user1, pass1 50.21.10.78:8800, user1, pass1 134.22.166.242:8800, user1, pass1 Proxy Gateway Some company network configurations require all users to access the Internet through a specific proxy. Use the Gateway proxy option in Application Settings to open the Proxy Gateway screen. Some companies require force Internet access only through a proxy gateway The Proxy address can specify a proxy directly or it can be a URL to a proxy configuration script (PAC). © 2014 - 2015 Sequentum 264 11 Content Grabber Improving Agent Performance and Reliability It is important to understand that the complexity of web scraping increases significantly when extracting large amounts of data. The time it takes to perform certain operations may take so long that the whole web scraping process becomes impractical. Web scraping is unreliable in nature, and the longer a web scraping project runs, the higher the chance something fails and interrupts the web scraping process. The following topics list a few steps you can take to increase performance and reliability. Performance tuning is not for the novice user, and if you have no IT experience, you may not be able to implement the recommendations in this topic. Using MS SQL as Internal Database Using the Static Parser Browser Optimizing Agent Commands 11.1 Using MS SQL as Internal Database A Content Grabber agent saves all extracted data to an internal database while running. The default internal database is a SQLite file database, and file databases don't perform as well as real databases, such as SQL Server or MySQL. You may get better performance on some operations by switching to a real database server depending on your system and the data you're extracting. We recommend that you change the internal database to SQL Server Express if you don't already have an existing database server. SQL Server Express is a free database server from Microsoft and can contain up to 10GB of data in a single database. You can download SQL Server Express from this URL: © 2014 - 2015 Sequentum Improving Agent Performance and Reliability 265 http://www.microsoft.com/sqlserver/en/us/editions/express.aspx Read the SQL Server Express documentation to learn how to install and operate the database server. Once you have installed SQL Server Express and added a database, you can change the internal database from the menu Data > Change Internal Database: For example, you can improve performance by choosing SQL Server from the Internal Database drop-down: 11.2 Using the Static Parser Browser When Content Grabber extracts data from a web page, it first loads the web page into a web browser. The web browser parses and renders the web page, and executes any JavaScript the page contains. This is a very safe approach, since Content Grabber uses an embedded version of Internet Explorer as it's web browser. Therefore if the target website is working in Internet Explorer, Content Grabber can usually extract data from the website. However, the approach is also slow and may cause instability. © 2014 - 2015 Sequentum 266 Content Grabber If you have been using Internet Explorer to browse the web, you may have sometimes experienced problems, such as hanging websites or program crashes. This may occur very rarely (say once a year), so it may not be a problem during normal usage of Internet Explorer. When Content Grabber uses Internet Explorer to browse a website, it may access more web pages in a few hours than you access in a year, so stability issues are magnified significantly. The main source of website instability is JavaScript. A website developer can use JavaScript to implement dynamic features on the website, but JavaScript bugs may lead to memory leaks, hanging websites or even program crashes. All action commands in a Content Grabber agent, that open a new web browser, can be configured to open a specific type of web browser. The default browser is an embedded version of Internet Explorer, but you can change this to a Static Parser. The Static Parser does not use Internet Explorer at all, and it completely ignores JavaScript, so it's generally much more reliable. JavaScript is always single threaded, so many operations cannot be performed simultaneously when using Internet Explorer web browsers. Since the Static Parser does not execute JavaScript, it can often process web pages much faster than an Internet Explorer web browser. Many websites don't work properly if JavaScript is disabled, so the Static Parser will not work for all websites, but many websites can be partly processed with a Static Parser, so you should always switch to a Static Parser if a particular web page can be processed without JavaScript. © 2014 - 2015 Sequentum Improving Agent Performance and Reliability 267 Configure an Action command to use a specific web browser type If you want an agent to use the Static Parser by default, then you can set the web browser type on the Agent Settings > Dynamic Browser > Static Parser: 11.3 Optimizing Agent Commands Action commands such as Navigate Link and Navigate URL commands are much slower to execute than commands that simply capture content, so it's important to try and limit the number of action commands in your agents. If you can access a target web page directly with a direct URL, rather than clicking multiple links, then you should always opt for the direct URL. Many dynamic websites show and hide content when you click on buttons or links. Content Grabber can extract both visible and hidden content, so it's often not necessary to add an action command that simply makes content visible. Optimizing Browser Activities © 2014 - 2015 Sequentum 268 Content Grabber An agent spends most of its time waiting for web pages to load, and Content Grabber cannot always determine exactly when a page load or page action has completed, so it sometimes ends up waiting longer than it has to. If the agent starts extracting data too early, it may not be able to extract the data correctly, since some data may not have loaded onto the web page yet. If an agent command executes an action that results in multiple browser activities, such as multiple page loads and AJAX callbacks, then Content Grabber will set the action option Discover activities, instead of setting the exact browser activities the action command should wait for. Content Grabber does this because the number of activities or the order of activities is likely to change, so it's safest to keep discovering the activities every time the action is executed. However, discovering activities is slower than waiting for a set number of activities, because Content Grabber has to wait to make sure no more activities occur before it can consider the action as completed. It's always best to manually set the browser activities Content Grabber should wait for, so it doesn't wait longer than absolutely necessary, but this can be difficult to get right and may not be worth the effort unless performance is very important. When an action command uses the option Discover activities, it uses a set of timeout values to determine how long it should wait for new activities. These timeout values are set fairly conservatively to make sure they are long enough for slow computers and slow Internet connections. You may be able to lower these timeout values, so an action command completes faster without failing to extract content correctly. © 2014 - 2015 Sequentum Improving Agent Performance and Reliability 269 Default activity timeout values You can use the Browser Activity panel to view activities and timings for an action. Please see the topic Browser Activities for more information. © 2014 - 2015 Sequentum 270 12 Content Grabber Building Self-Contained Agents You can build self-contained agents that will run without the need for the Content Grabber application to be present on the target computer. Self-contained agents can be distributed royalty-free, and includes a user interface that allows end users to configure and run the agent. Note: the self-contained user interface has an introduction screen that contains a promotional message for Content Grabber and Sequentum. This message can only be removed if you are using a premium version of Content Grabber. The following figure shows the default self-contained user interface, in which you can add your own text and logo: © 2014 - 2015 Sequentum Building Self-Contained Agents 271 An end user can configure the following agent features: Export target (only file formats are supported) Scheduling Input parameters Notifications Proxies. A self-contained agent can only export to file formats (Excel, CSV or XML). It cannot export to databases or execute a data export script. Self-contained agents require Internet Explorer 8+ and .NET v4+. The self-contained agents will not attempt to download and install these required components if they don't exist on the target computer, but most Windows computers will already have those components installed. Building a Self-Contained Agent Follow the steps below to build a self-contained agent: 1. Click the Export Agent application menu to open the Export Agent screen. 2. Make sure you check the option Create self-contained agent. © 2014 - 2015 Sequentum 272 Content Grabber 3. You can now click Export to create the self-contained agent. Before creating the self-contained agent you can click Customize Design to customize the design and text displayed on the agent's user interface. For more information, see the topic Customizing the User Interface. Self-Contained Agent Files All files required to configure and run a self-contained agent are included in a single executable file which can be distributed royalty free. Any custom assemblies used by an agent are included in the executable file, but only if they are located in the Assemblies folder for the agent. Assemblies shared by agents and located in the shared Assemblies folder are not included in the executable file, and should not be used in self-contained agents. Upgrading a Self-Contained Agent You can upgrade an agent if the target website of a self-contained agent has changed and you need to make changes to the agent in order for it to work correctly. Simply make the changes to your agent and export the agent again. The end-user can override his existing self-contained agent with your new agent. Any configuration changes the end-user has made to the agent will not be lost. For more information, see the topic Using a Self-Contained Agent. © 2014 - 2015 Sequentum Building Self-Contained Agents 12.1 273 Customizing the User Interface A self-contained agent includes a user interface that allows users to configure and run the agent. You can control some of the text and images displayed on this user interface, and you can also control which agent options the user can configure. Click the Customize Design button to open the Customize Self-Contained Agent screen: The Customize Self-Contained Agent screen has three tabs, in which you can specify the text and images that should be used on the user interface of the selfcontained agent: © 2014 - 2015 Sequentum 274 Content Grabber The Agent Details tab allows you to specify text and images that are displayed on the agent start-up screen. The Company Details tab allows you to specify text and images that are displayed on the agent About screen. The Templates tab allows you to specify which configuration options should be available on the user interface: © 2014 - 2015 Sequentum Building Self-Contained Agents 275 The Templates tab also allows you to specify custom template files. The template files are HTML files that controls the user interface layout. The template files contain special tags that will be replaced with the information you provide on the Agent Detail and Company Details tabs. If you want to provide your own template files, we recommend you copy the default template files to a new location and modify them to suit your needs. The default files are located in: [Content Grabber Installation Folder]\Resources\SelfContained\Docs This folder contains a number of files and sub-folders. Your custom folder must contain the same sub-folders and HTML files with the same names. You can remove or add style sheets and images to your custom folders. All files and sub-folders in your custom folder will be included in the executable agent file. © 2014 - 2015 Sequentum 276 12.2 Content Grabber Using Input Data A self-contained agent can use any input data that is supplied by a database. If an agent uses data from a CSV file, the file should be placed in the default agent input folder, and you should set the option to include input data when exporting the agent. If an agent uses data from a Simple Data Provider, the standard self-contained user interface can be used by the end-user to edit the input data. A Simple Data Provider is only available in the self-contained user interface if the data provider property Public Provider is set to True. A public name for the data provider can also be specified. This public name will only be used in the self-contained user interface, and not in the Content Grabber editor. Allow Simple input data to be edited using the self-contained user interface An agent can have multiple public data providers, and each data provider can be selected from a drop down box as shown on the following figure. © 2014 - 2015 Sequentum Building Self-Contained Agents 277 Simple input data can be edited using the self-contained user interface 12.3 Using a Self-Contained Agent A self-contained agent is a single executable file that includes all the required files to configure and run the agent. A self-contained agent does not need to be installed. When you want to configure or run the agent, simply click on the executable agent file, and the agent user interface will be displayed. The first time the executable agent file is run, a sub-folder with the same name as the agent will be created in the folder where the executable file is located. This folder will store configuration information for the agent, so the folder should not be removed. If the folder is removed, all configuration changes you have made to the agent will be lost, and the agent will start with its default configuration. © 2014 - 2015 Sequentum 278 Content Grabber Upgrading a Self-Contained Agent You can upgrade a self-contained agent by simply overriding the existing executable agent file with a new agent file. Any configuration changes you have made to the agent will not be lost, unless you delete the configuration folder. © 2014 - 2015 Sequentum Running Agents from the Command-Line 13 279 Running Agents from the Command-Line You can run agents from the command line using the RunAgent.exe program, which you can find in the Content Grabber installation folder. RunAgent.exe agentName You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the program. If the agent is not located in that folder, Content Grabber will look in the default public agent location. The default public agent location is: C:\Users\Public\Documents\Content Grabber\Agents If your agent is not located in any of the default agent locations, you need to specify the full path to the agent file: RunProject.exe "C:\My Custom Folder\agentName.scg" When running multiple agents in a batch file, you can use the Start command to run the agents asynchronously: Start "WindowTitle" RunAgent.exe agentName1 Start "WindowTitle" RunAgent.exe agentName2 Exit Codes The RunAgent.exe command line program returns one of the following exit codes: Exit Code Status Description 0 Success Data extraction was completed successfully. 1 Failed A critical application error occurred during data extraction. © 2014 - 2015 Sequentum 280 Content Grabber Exit Code Status Description 2 Incomplete Data extraction was interrupted. 3 Completed with errors Data extraction was completed, but with one or more page load errors or missing required elements. 4 Incomplete with errors Data extraction was interrupted, and encountered one or more page load errors or missing required elements. 5 Export Failed Data extraction failed because it was unable to export data. You can manually open Content Grabber and attempt to export. Command Line Arguments Input parameters can be added as command line arguments. The input parameter name must be preceded with a dash. For example: RunAgent.exe agentName -username "test" -password "test" Please see the topic Input Parameters for more information. The following command line switches can be used, and should not have a preceding dash: Switch Description log_level Log detail level. Logging is turned off by default. Example: RunAgent.exe agentName log_level High log_html Logs the HTML of all loaded web pages to files. Example: RunAgent.exe agentName log_html log_to_file Logs information to a file instead of a database. © 2014 - 2015 Sequentum Running Agents from the Command-Line Switch 281 Description Example: RunAgent.exe agentName log_to_file view_browser Displays the web browsers used to navigate the target website. Example: RunAgent.exe agentName view_browser no_ui No user interface will be displayed. This will cause an agent pause command to fail. Example: RunAgent.exe agentName no_ui session_id The agent will run in a specified Session. Example: RunAgent.exe agentName session_id sdgfdf4353 session_timeout If an agent runs in a Session, this is the session timeout in minutes. The default session timeout is 30 minutes. Example: RunAgent.exe agentName session_id sdgfdf4353 session_timeout 10 continue Continues an agent that was stopped prematurely. Example: RunAgent.exe agentName continue 13.1 Using the Content Grabber runtime NOTICE: This functionality is only available to agents that are built with the Content Grabber Premium Edition. © 2014 - 2015 Sequentum 282 Content Grabber You can use the Content Grabber command-line program on a computer without installing the entire Content Grabber software environment. You must ensure that the following files are in the same folder on the target computer: AgentApi.dll AgentProxy.dll AjaxHook.dll msvcp120.dll msvcr120.dll SQLite.Interop.dll System.Data.SQLite.dll RunAgent.exe RunAgentProcess.exe BaseEngine.dll MySql.Data.dll These are the minimum computer specifications for running the Content Grabber runtime: Windows 8/Vista/7/2008R2/2012 (though we don't support it, Windows XP will work for most target websites) Internet Explorer 9+ (Internet Explorer 8 will work for most target websites, but is not supported) .NET framework version 4+ © 2014 - 2015 Sequentum Scripting 14 283 Scripting You can use scripting in a variety of ways to customize Content Grabber behavior, or to extend standard functionality. Content Grabber scripts are .NET functions written in C# or VB.NET, or regular expressions. Content Grabber scripts can be divided into three categories: Content transformation scripts. Command scripts. Extension scripts. Content Transformation Scripts Content transformation scripts are used to transform a piece of text. For example, if you extract an address from a web page, but want only the postal code, you can transform the address into just the postal code. You can also view this process as extracting sub-text, but sometimes you may do more than merely extract sub-text. That is why we call this transformation. Content transformation scripts are used in the following situations: Transformation of extracted content. Transformation of file names when downloading files. Transformation of URLs when using Navigate Link commands that extract and navigate to direct URLs. Transformation of input data. See the topic Content Transformation Scripts for more information. Command Scripts Command scripts are general purpose scripts that are executed as part of the normal agent command execution flow. The scripts are often used to alter the execution flow of an agent. © 2014 - 2015 Sequentum 284 Content Grabber See the topic Command Scripts for more information. Extension Scripts Extension scripts are used to extend Content Grabber functionality. A Data Export script is an example of an extension script that can be used to apply some custom business logic to extracted data. The following scripts are extension scripts: Agent Initialization Scripts Data Export Scripts Data Distribution Scripts Data Input Scripts Image OCR Scripts Convert Document to HTML Scripts 14.1 Script Languages The Content Grabber scripting engine supports C#, VB.NET and Regular Expressions. C# and VB.NET can be used for all types of scripts, but Regular Expressions can only be used by content transformation scripts. Regular Expressions This manual does not explain regular expression syntax. Please visit the following website for more information about regular expressions: Regular Expressions Reference Guide Regular expression scripts in Content Grabber can include any number of regex match and replace operations. Each regex operation must be specified on two lines. The first line must contain the regex pattern and the second line must contain the operation which can be return, return all or replace with. The return operation returns the first © 2014 - 2015 Sequentum Scripting 285 match in the original content or a selected group within the match The returned match can be combined with static text. The return all operation returns all matches or the specified group within all matches. The replace operation replaces all matches - or a selected group within the first match, in the original content and then returns the content. A group within a match must be specified by the group number, and cannot be specified by a group name. The group number 0 specifies the entire match, and the number 1 specifies the first group in the match. A group number must be preceded by the character $, so for example, $1 specifies the first group in a match. Use two $ characters ($$) to escape the $ character in an operation. If a regex script contains more than one regex operation, the next operation will work on the output from the preceding operation. All regex operations are case insensitive and line breaks are ignored. The following eight special operations can be specified on a single line: strip_html removes all HTML tags from the content. url_decode decodes an encoded URL. html_decode decodes encoded HTML. trim removes line breaks and white spaces from the beginning and end of the content. line_breaks converts some HTML tags such as <P>, <BR> and <LI> into standard Windows line breaks. to_lower converts text to lower case. to_upper converts text to upper case. capitalize_words capitalizes all words in the text. aggregate [separator] if return all has been used to return a list, this operation adds all strings in the list together separated by the specified separator. The syntax {$content_name} can be used to reference extracted data in the current container. For example, if you had a capture command named product_id, you could construct the following regular expression to extract all text between the product ID © 2014 - 2015 Sequentum 286 Content Grabber and the first white space: {$product_id}(.*?)\s Examples Here are a number of examples: Regular Description Expression .* Returns the entire match, so everything in this return case. A(.*?)B Returns the group 1 match, so everything return $1 between A and B. (A).*?(B) Returns group 1 and 2 matches, so AB in this return $1$2 case. A(.*?)B Replaces every instance of everything between replace with A and B (including A and B) with nothing. A(.*?)B Replaces every instance of everything between replace with A and B (including A and B) with "some new some new text". text A(.*?)B Replaces the first instance of everything replace $1 between A and B (excluding A and B) with with some "some new text". new text A(.*?)B Replaces every instance of everything between replace with A and B (including A and B) with the text $1 between A and B, so in effect it removes A and B. © 2014 - 2015 Sequentum Scripting A(.*?)B First extracts everything between A and B, and return $1 then replaces every instance of everything C(.*?)D between C and D (including C and D) with replace with nothing. <br> Replaces all <BR> HTML tags with standard replace with Windows line breaks. 287 \r\n A(.*?)B Returns the group 1 match, so everything return $1 between A and B, and then URL_decodes the url_decode result. A(.*?)B Returns the group 1 match, but adds C to the return C$1D beginning and D to the end before returning the match. C# and VB.NET This manual does not explain C# and VB.NET syntax. Please visit an online reference guide for more information about these languages. C# and VB.NET scripts must have one class named Script with a static function that is executed by Content Grabber. The signature of the static method depends on the type of script. The following example is for a Data Input script. using System; using System.Data; using Sequentum.ContentGrabber.Api; public class Script { public static DataTable ProvideData(DataProviderArguments args) { DataTable data = args.ScriptUtilities.LoadCsvFileFromDefaultInputFolder("inputData.scv"); return data; } } © 2014 - 2015 Sequentum 288 Content Grabber A script can have more than one function in the Script class, and can also use functionality from external .NET libraries. All external .NET libraries must be added to the agent's assembly references. See the topic Assembly References for more information. 14.2 Script Library Content Grabber provides a default procedure for developing scripts in Visual Studio. The build-in script editor in Content Grabber does not provide advanced development tools such as a debugger, so we recommend you build large scripts in Visual Studio and smaller scripts in the built-in script editor. When you add a script in Content Grabber, you can select to use a script library. The script library is a normal .NET class library that contains classes and methods for all types of scripts in Content Grabber. When using a Script Library, Content Grabber will automatically call the specific methods in the .NET class library. Agent or Shared Script Library Content Grabber supports two types of script libraries, a script library that is shared © 2014 - 2015 Sequentum Scripting 289 between all agents in your Content Grabber document folder, and a script library that is used only for a specific agent. The shared script library assembly must be placed in the folder Content Grabber \Assemblies and the script library is shared by all agents located in Content Grabber \Agents. The Content Grabber folder can be located anywhere, but by default it's located in My Documents\Content Grabber. The agent script library assembly must be placed in the Assemblies sub-folder of an agent folder. Visual Studio Solution Template Content Grabber provides a Visual Studio 2013 solution with a project that contains all the required classes and methods. When you click the button Open Project in C# or Open Project in Vb.NET, Content Grabber will automatically copy the solution and projects to your agent's Visual Studio folder or to the shared Visual Studio folder if you are using a shared script library. The Visual Studio solution also contains a test project you can use to test the functionality in your script library. When you build the Visual Studio solution in release mode, the script library assembly is automatically copied to the right location so it's picked up by Content Grabber when you run an agent. When Content Grabber loads the script library it will become locked and you will need to close Content Grabber before you can build the library in release mode. The debug version of the library is not locked by Content Grabber, because it's not copied to any of the Content Grabber Assemblies folders, and will not be used by your agents by default. 14.3 Assembly References You can use external assemblies in your scripts by adding assembly references to your agent. Click the Assembly References button when editing a script to open this screen: © 2014 - 2015 Sequentum 290 Content Grabber You should only enter the file name of your assembly file when adding a reference, not the full file path. You must copy the assembly file to the agent's Assemblies folder or to the shared Assemblies folder. Content Grabber will look for unresolved assemblies in those two folder. The shared Assemblies folder is located in Content Grabber\Assemblies and assemblies in this folder are shared by all agents in Content Grabber\Agents. The Content Grabber folder can be located anywhere, but by default it's located in My Documents\Content Grabber. 14.4 Script Utilities All scripts have access to a ScriptUtils class through the script arguments. This class contains the following utility functions: Function Description DataTable Loads a CSV file from a specified location LoadCsvFile(stri and returns the data in a DataTable. ng path) © 2014 - 2015 Sequentum Scripting DataTable Loads a CSV file from a specified location LoadCsvFile(stri using a specified value separator and ng path, char returns the data in a DataTable. separator) DataTable Loads a CSV file with a specified name LoadCsvFileFrom from the default input data folder and DefaultInputFold returns the data in a DataTable. er(string filename) DataTable Loads a CSV file with a specified name LoadCsvFileFrom from the default input data folder using a DefaultInputFold specified value separator and returns the er(string data in a DataTable. filename, char separator) void Executes a program on the command line. ExecuteComman The parameters inputFilePath, dLine(string outputFilePath and options are added as programFilePath, command line parameters. string inputFilePath, string outputFilePath, string options) void Executes a program on the command line ExecuteComman with the specified command line dLine(string parameters. programFilePath, string arguments) © 2014 - 2015 Sequentum 291 292 Content Grabber int Clears all Internet Explorer cookies on the ClearAllBrowserC computer. ookies() int Clears all Internet Explorer cookies on the ClearBrowserCo computer associated with the specified okies(string url) URL. void Clears the Internet Explorer session ResetBrowserSe cookies. This is equivalent to restarting an ssion() Internet Explorer browser. Example The following example loads data from a CSV file located in the agent's default input folder: using System; using System.Data; using Sequentum.ContentGrabber.Api; public class Script { public static DataTable ProvideData(DataProviderArguments args) { DataTable data = args.ScriptUtilities.LoadCsvFileFromDefaultInputFolder("inputData.scv"); return data; } } Database Utilities All scripts have access to the following function through the script arguments: public IConnection GetDatabaseConnection(string connectionName) The function returns a predefined database connection. See the topic Database © 2014 - 2015 Sequentum Scripting 293 Connections for more information about predefined database connections. The IConnection, ICommand and IReader interfaces are Content Grabber interfaces that are meant to make it easier to write and read data to a database, but you can always use the standard .NET libraries if you prefer. The following example writes some extracted data to a database: using System; using Sequentum.ContentGrabber.Api; public class Script { public static CustomScriptReturn CustomScript(CustomScriptArguments args) { IConnection connection = args.GetDatabaseConnection("exportDatabase"); connection.OpenDatabase(); try { ICommand command = connection.GetNewCommand(); command.SetSql("insert into export_table values (@title, @description)"); command.AddParameterWithValue("title", args.DataRow["title"]); command.AddParameterWithValue("description", args.DataRow["description"]); command.ExecuteNonQuery(); } finally { connection.CloseDatabase(); } return CustomScriptReturn.Empty(); } } The IConnection interface has the following functions and properties: Function or Property © 2014 - 2015 Sequentum Description 294 Content Grabber IReader Returns an IReader interface to a new GetNewReader(st Reader object that uses the current ring sql) connection. ICommand Returns an ICommand interface to a new GetNewCommand Command object that uses the current () connection. void Opens the database connection. OpenDatabase() void Closes the database connection. CloseDatabase() object Return the underlying database connection GetConnection() that is specific to the database type used. For example, if the database is a MySQL database, the object returned is a MySqlConnection. void Executes a SQL statement that doesn't ExecuteNonQuery return a result set. (string sql) object Executes a SQL statement that returns a ExecuteScalar(str single scalar value. ing sql) long Executes a SQL statement that returns a ExecuteCount(stri single long value. This function should be ng sql) used when using the SQL function "Count". bool Return true if the specified table exists in HasTable(string the database. tableName) bool Return true if the specified columns exists HasColumn(string in the specified table. © 2014 - 2015 Sequentum Scripting tableName, string columnName) DataRow[] Return all columns in the specified GetTableColumns database table. (string tableName) void Drops the specified database table. DropTable(string tableName) void Truncates the specified database table. TruncateTable(str ing tableName) void Lock() Locks the database connection, so it cannot be used by any other running threads. void Release() Releases a lock on the database connection. IReader Executes a SQL with a list of SQL ExecuteReader(st parameters and returns a IReader. ring sql, params object[] pars) DataTable Executes a SQL with a list of SQL ExecuteDataTale( parameters and returns a DataTable. string sql, params object[] pars) void Executes a SQL with a list of SQL ExecuteNonQuery parameters that doesn't return a result. (string sql, params object[] © 2014 - 2015 Sequentum 295 296 Content Grabber pars) object Executes a SQL with a list of SQL ExecuteScalar(str parameters that returns a single scalar ing sql, params value. object[] pars) long Executes a SQL with a list of SQL ExecuteCount(stri parameters that returns a single long ng sql, params value. This function should be used when object[] pars) using the SQL function "Count". The ICommand interface has the following functions and properties: Function or Description Property void Adds a SQL parameter to the Command AddParameterWit object. A matching parameter must exist in hValue(string the associated SQL statement. The value pName, object of a parameter is updated if the parameter pValue) already exists. void Adds a SQL parameter to the Command AddParameter(str object. A matching parameter must exist in ing pName, the associated SQL statement. The value CaptureDataType of a parameter is updated if the parameter type) already exists. void Sets the value of an existing SQL SetParameterValu parameter. A matching parameter must e(string pName, exist in the associated SQL statement. object pValue) void Executes a SQL statement that doesn't ExecuteNonQuer return a result set. © 2014 - 2015 Sequentum Scripting y(string sql) void Executes an existing SQL statement that ExecuteNonQuer doesn't return a result set. y() object Executes a SQL statement that returns a ExecuteScalar(str single scalar value. ing sql) object Executes an existing SQL statement that ExecuteScalar() returns a single scalar value. long Executes a SQL statement that returns a ExecuteCount(str single long value. This function should be ing sql) used when using the SQL function "Count". long Executes an existing SQL statement that ExecuteCount() returns a single long value. This function should be used when using the SQL function "Count". void SetSql(string Set the SQL statement associated with the sql) SQL command. The IReader interface has the following functions and properties: Function or Description Property void Sets the SQL statement associated with this SetSql(string data reader. sql) void Adds a SQL parameter to the Command AddParameterWi object. A matching parameter must exist in thValue(string the associated SQL statement. The value of a parameter is updated if the parameter © 2014 - 2015 Sequentum 297 298 Content Grabber pName, object already exists. pValue) bool Read() Reads a row of data. Executes an existing SQL statement if it has not already been executed. void Close() Closes the data reader. DateTime Gets a DateTime value from the current GetDateTimeVal row. ue(int columnIndex) string Gets a String value from the current row. GetStringValue(i nt columnIndex) int Gets a Integer value from the current row. GetIntValue(int columnIndex) Guid Gets a Guid value from the current row. GetGuidValue(in t columnIndex) Type Gets the data type of a specified column. GetFieldType(int columnIndex) IDataReader Returns the underlying .NET data reader. GetDataReader() object Gets a data value from the current data GetFieldValue(in row. t columnIndex) Extension Methods © 2014 - 2015 Sequentum Scripting 299 Content Grabber provides a few extension methods that can be used in all scripts: Function Description static DataTable Converts a single string value into a DataTable ToDataTable(this string with one data column and one data row. stringValue, string columnName) static DataTable Converts an array of strings into a DataTable ToDataTable(this string[] with one data column and one data row for each dataRows, string string value. columnName) static DataTable Converts an array of strings into a DataTable ToDataTable(this string[] with one data column and one data row for each dataRows, string string value. A standard .NET format string can columnName, string be used to format the string value before it is stringFormat) inserted into the DataTable. static DataTable Converts a list of strings into a DataTable with ToDataTable(this one data column and one data row for each List<string> dataRows, string value. string columnName) static DataTable Converts a list of strings into a DataTable with ToDataTable(this one data column and one data row for each List<string> dataRows, string value. A standard .NET format string can string columnName, string be used to format the string value before it is stringFormat) inserted into the DataTable. Example The following script generates a list of URLs and uses an extension method to convert the list of URLs into a DataTable. using System; using System.Collections.Generic; © 2014 - 2015 Sequentum 300 Content Grabber using System.Data; using Sequentum.ContentGrabber.Api; public class Script { public static DataTable ProvideData(DataProviderArguments args) { List<string> urls = new List<string>(); for(int i=1;i<1000;i++) { urls.Add("http://www.domain.com/page.php?ID=" +i.ToString()); } return urls.ToDataTable("url"); } } 14.5 Script Template Library Content Grabber has a Script Template Library that allows you to easily share scripts between agents. This library also contains some standard scripts for common operations, such as regular expressions used to extract emails or phone numbers from a piece of text. You can save a script to the library at anytime when your writing a script. Simply enter a script name and description, and click Save to Library. If you save a script with the same name as an existing script, the existing script will be overwritten. You can save a script to the template library at any time by clicking the Save button: © 2014 - 2015 Sequentum Scripting 301 You can load a script from the library by opening the Script name drop down box, and then selecting the script you want to load: Exporting and Importing Template Libraries © 2014 - 2015 Sequentum 302 Content Grabber All user templates can be exported and imported, so you can move the templates from one computer to another. Content Grabber exports or imports all templates, including both command templates and script templates. You cannot export or import just command templates or just script templates. 14.6 Agent Initialization Scripts An agent initialization script is run before the agent starts. The script can be used to initialize data before an agent is run. The script is also often used to set agent properties, such as database connection parameters, but we don't recommend you do this, since the Agent class is undocumented and unsupported. You can add an agent initialization script to an agent by clicking the menu Agent > Initialization Script: Agent initialization scripts can be written in C# or VB.NET. The following example initializes a global counter. using System; using Sequentum.ContentGrabber.Api; public class Script { public static bool InitializeAgent(AgentInitializationArguments args) { args.GlobalData.AddOrUpdateData("counter", 1); © 2014 - 2015 Sequentum Scripting 303 return true; } } The following example makes changes to an agent before it's run. The changes to the agent are not permanent, but only in effect at runtime. When you are debugging an agent, the debugger works directly on the agent in the editor, so any changes to an agent are permanent. We therefore recommend you use the args.IsDebug variable to turn off any changes to an agent during debugging. using System; using Sequentum.ContentGrabber.Api; using Sequentum.ContentGrabber.Commands; public class Script { public static bool InitializeAgent(AgentInitializationArguments args) { //Don't make changes to the agent while debugging If(args.IsDebug) return true; //Set the file path and download folder if we're exporting to CSV if(args.Agent.ExportDestination.ExportTargetType == ExportTargetType.Csv) { if(args.GlobalData.HasData("CsvFilePath")) { args.Agent.ExportDestination.CsvExport.UseDefaultPath = false; args.Agent.ExportDestination.CsvExport.FilePath = args.GlobalData.GetStrin } if(args.GlobalData.HasData("DownloadFolder")) { args.Agent.ExportDestination.UseDefaultDownloadPath = false; args.Agent.ExportDestination.DownloadDirectoryName = args.GlobalData.G } } //Set the database connection if we're exporting to a SQL Server database else if(args.Agent.ExportDestination.ExportTargetType == ExportTargetType.SqlServer) { © 2014 - 2015 Sequentum 304 Content Grabber if(args.GlobalData.HasData("DatabaseName")) { args.Agent.ExportDestination.DatabaseExport.DatabaseConnectionName = } } //Set the database connection and SQL for the input data provider in the Agent command if(args.GlobalData.HasData("DatabaseName") && args.GlobalData.HasData("SQL")) { args.Agent.DataProvider.DatabaseProvider.DatabaseConnectionName = args.Globa args.Agent.DataProvider.DatabaseProvider.Sql = args.GlobalData.GetString("SQL") } return true; } } If you continue an agent, the initialization script is not run, but any changes that was made to global data or the agent are are retained. NOTE: The script must have a static method with the following signature: public static bool InitializeAgent(AgentInitializationArguments args) All logging in an agent initialization script is always written to file and never to database. This is because the script runs before the agent starts, and the internal database is not available at that time. An instance of the AgentInitializationArguments class is provided by Content Grabber and has the following functions and properties: Property or Function Description Agent The current agent that is executing the script. ScriptUtils ScriptUtilities A script utility class with helper methods. See Script Utilities for more information. bool IsDebug True if the agent is running in debug mode. © 2014 - 2015 Sequentum Scripting Property or Function 305 Description IInputData InputDataCache All input data available to the agent command. void WriteDebug(string Writes log information to the agent log. This debugMessage, method has no effect if agent logging is disabled, DebugMessageType or if called during design time. messageType = DebugMessageType.Inform ation) void WriteDebug(string Writes log information to the agent log. This debugMessage, bool method has no effect if agent logging is disabled, showMessageInDesignMod or if called during design time. e, DebugMessageType messageType = DebugMessageType.Inform ation) void Notify(bool Triggers notification at the end of an agent run. If alwaysNotify) alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. void Notify(string message, Triggers notification at the end of an agent run, and bool alwaysNotify) adds the message to the notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDictionary Global data dictionary that can be used to store GlobalData data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection © 2014 - 2015 Sequentum Returns the specified database connection. The 306 Content Grabber Property or Function Description GetDatabaseConnection(st database connection must have been previously ring connectionName) defined for the agent or be a shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. 14.7 Data Export Scripts Data Export scripts can be used to customize the export process. An export script could be used to export extracted data to custom data structures in a database, or it could export data to a custom data source. A Data Export script can be added from the Data Export Configuration screen: The following example writes data from two export tables to two CSV files: using System; using System.IO; using Sequentum.ContentGrabber.Api; using Sequentum.ContentGrabber.Commands; public class Script © 2014 - 2015 Sequentum Scripting 307 { public static bool ExportData(DataExportArguments args) { try { StreamWriter writer1 = File.CreateText(@"c:\temp\data1.csv"); try { StreamWriter writer2 = File.CreateText(@"c:\temp\data2.csv"); try { IExportReader mainTableReader = args.Data.GetTable(); try { while(mainTableReader.Read()) { string csvData1 = childTableReader. GetStringValue("Field Name 1"); csvData1 += ","; csvData1 += childTableReader.GetStringValue ("Field Name 2"); writer1.WriteLine(csvData1); IExportReader childTableReader = mainTableReader.GetChildTable("Child table name"); try { while(childTableReader.Read()) { string csvData2 = childTableReader.GetStringValue("Child Field Name 1"); csvData2 += ","; csvData2 += childTableReader. GetStringValue("Child Field Name 2"); writer2.WriteLine(csvData2); } } finally { childTableReader.Close(); } © 2014 - 2015 Sequentum 308 Content Grabber } } finally { mainTableReader.Close(); } } finally { writer1.Close(); } } finally { writer2.Close(); } return true; } catch(Exception ex) { args.WriteDebug(ex.Message, DebugMessageType.Error); return false; } } } The script must have a static method with the following signature: public static bool ExportData(DataExportArguments args) The function should return True if it succeeds and False if it fails. In this example, an instance of the DataExportArguments class will be given by Content Grabber and has the following functions and properties: Property or Description Function Agent Agent The current agent. © 2014 - 2015 Sequentum Scripting ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. IExportData Data The data to export. bool IsDebug True if the agent is running in debug mode. ExtendedSortedLi All input parameters defined by the agent. st<string, string> InputParemeters void Writes log information to the agent log. WriteDebug(string This method has no effect if agent logging debugMessage, is disabled, or if called during design time. DebugMessageTy pe messageType = DebugMessageTy pe.Information) void Writes log information to the agent log. WriteDebug(string This method has no effect if agent logging debugMessage, is disabled, or if called during design time. bool showMessageInD esignMode, DebugMessageTy pe messageType = DebugMessageTy pe.Information) void Notify(bool Triggers notification at the end of an agent alwaysNotify) run. If alwaysNotify is set to false, this method only triggers a notification if the © 2014 - 2015 Sequentum 309 310 Content Grabber agent has been configured to send notifications on critical errors. void Notify(string Triggers notification at the end of an agent message, bool run, and adds the message to the alwaysNotify) notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDiction Global data dictionary that can be used to ary GlobalData store data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database GetDatabaseConn connection. The database connection must ection(string have been previously defined for the agent connectionName) or be a shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. string[] Exports data to the specified export ExportData(Expor target. Returns the names of any exported tTarget data files if exporting to a file format. exportTarget, string sessionId = null) public void Exports data to the specified database. ExportToDatabas © 2014 - 2015 Sequentum Scripting e(string databaseConnecti onName, string sessionId = null) string[] Exports data to a CSV file using the ExportToCsv(strin default export folders and default options. g sessionId = null) Use the ExportData method to specify CSV export options. Returns the names of all exported CSV files. string Exports data to an Excel file using the ExportToExcel(str default export folders and default options. ing sessionId = Use the ExportData method to specify null) Excel export options. Returns the name of the exported Excel file. string Exports data to an XML file using the ExportToXml(strin default export folders and default options. g sessionId = null) Use the ExportData method to specify XML export options. Returns the name of the exported XML file. The IExportData interface has the following functions and properties: Property or Description Function IExportReader Returns a data reader that reads data from GetTable() the main table containing exported data. All other data tables will be child tables of this table. IExportReader Returns a data reader that reads data from this[string name] a specified data table. { get; } © 2014 - 2015 Sequentum 311 312 Content Grabber IExportReader Returns a data reader that reads data from GetTable(string a specified data table. name) bool Validates the export data and returns True ValidateData() if the data is valid and False if the data is invalid. If the agent that extracted this data has changed since the data was extracted, then the data may have become invalid. The IExportReader interface has the following functions and properties: Property or Description Function bool Read() Reads the next data row. Returns False if there are no more data rows available. void Close() Closes the data reader. string Gets a String value from the current row. GetStringValue(st ring columnName) int Gets a Integer value from the current row. GetIntValue(strin g columnName) Guid Gets a GUID value from the current row. GetGuidValue(stri ng columnName) string Gets a String value from the current row. GetStringValue(in t index) © 2014 - 2015 Sequentum Scripting int GetIntValue(int Gets a Integer value from the current row. index) Guid Gets a GUID value from the current row. GetGuidValue(int index) Type Gets the data type of a specified column. GetFieldType(int index) object Return a data value from the current data GetFieldValue(stri row. ng columnName) object GetFieldValue(int Return a data value from the current data row. index) object this[string Return a data value from the current data columnName] row. object this[int Return a data value from the current data index] row. IExportReader Return a data reader that reads data from GetChildTable(str a specified child table of the table ing tableName) associated with this data reader. The returned data reader will only read data rows that are child data to the current data row. string[] An array containing the names of all child ChildTableNames tables of the table associated with this data reader. © 2014 - 2015 Sequentum 313 314 14.8 Content Grabber Data Distribution Scripts Content Grabber has built-in support for data distribution by email or to an FTP server when data is exported to file formats such as CSV or Excel. A Data Distribution script can be used to customize data distribution or distribute data to an unsupported media. A Data Distribution script could also be used to run some custom functionality after data has been exported. For example, a script could run a stored procedure in a database after data has been exported to the database. You can add a Data Distribution script to an agent by clicking the Script button in the Data ribbon menu. The following example executes a stored procedure in a SQL Server database: using System; using Sequentum.ContentGrabber.Api; public class Script { public static bool DeliverData(DataDeliveryArguments args) { args.GetDatabaseConnection("test").ExecuteNonQuery("exec updateData"); return true; } } The script must have a static method which has the following signature: public static bool DeliverData(DataDeliveryArguments args) The function should return True if it succeeds and False if it fails. An instance of the DataDeliveryArguments class is provided by Content Grabber and © 2014 - 2015 Sequentum Scripting has the following functions and properties: Property or Description Function Agent Agent The current agent. ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. string[] Files The exported files that need to be distributed. bool IsDebug True if the agent is running in debug mode. void Writes log information to the agent log. WriteDebug(strin This method has no effect if agent logging g is disabled, or if it is called during design debugMessage, time. DebugMessageT ype messageType = DebugMessageT ype.Information) void Writes log information to the agent log. WriteDebug(strin This method has no effect if agent logging g is disabled, or if called during design time. debugMessage, bool showMessageIn DesignMode, DebugMessageT ype messageType = DebugMessageT © 2014 - 2015 Sequentum 315 316 Content Grabber ype.Information) void Notify(bool Triggers notification at the end of an agent alwaysNotify) run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. void Notify(string Triggers notification at the end of an agent message, bool run, and adds the message to the alwaysNotify) notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDictio Global data dictionary that can be used to nary GlobalData store data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database connection. GetDatabaseCon The database connection must have been nection(string previously defined for the agent or be a connectionName shared connection for all agents on the ) computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. 14.9 Content Transformation Scripts Content transformation scripts are used to transform content after it has been extracted from a web page. Content transformation is often used on HTML elements © 2014 - 2015 Sequentum Scripting 317 to extract information that is not placed in individual elements and therefore cannot be selected in the web browser. For example, content transformation can be used to extract parts of an address, such as a postal code, from a single HTML element containing the full address. Content transformation scripts can be used in most Capture Commands to transform the content extracted by the commands, but can also be used in other types of commands, such as in a Navigate Link command to transform an extracted URL. A content transformation can be defined as a regular expression or as a C# or VB.NET script. Regular expressions are often used when you wish to extract sub-text from a larger piece of extracted text. The following regular expression example extracts all the text until the first '<' character: (.*?)< return $1 See the topic Script Languages for information about how to use regular expressions in Content Grabber. The following script also extracts all text until the first '<' character, but uses C# © 2014 - 2015 Sequentum 318 Content Grabber instead of regular expressions: using System; using Sequentum.ContentGrabber.Api; public class Script { public static string TransformContent(ContentTransformationArguments args) { return args.Content.Remove(args.Content.IndexOf('<')); } } The script must have a static method with the following signature: public static string TransformContent(ContentTransformationArguments args) The function will return the transformed content. An instance of the ContentTransformationArguments class is provided by Content Grabber and has the following functions and properties: Property or Description Function Agent Agent The current agent. ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. Command The current agent command being Command executed. IConnection The current internal database connection DatabaseConnect used by the agent. This connection is ion already open and should not be closed by your script. string Content The extracted content that should be © 2014 - 2015 Sequentum Scripting transformed. IHtmlNode The extracted HTML node. HtmlNode IInternalDataRow The current internal data row containing the DataRow data that has been extracted so far in the current container command. bool IsDebug True if the agent is running in debug mode. IInputData All input data available to the current InputDataCache command. void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g debugMessage, disabled, or if called during design time. DebugMessageT ype messageType = DebugMessageT ype.Information) void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g debugMessage, disabled, or if called during design time. bool showMessageInD esignMode, DebugMessageT ype messageType = DebugMessageT ype.Information) void Notify(bool Triggers notification at the end of an agent alwaysNotify) run. If alwaysNotify is set to false, this © 2014 - 2015 Sequentum 319 320 Content Grabber method only triggers a notification if the agent has been configured to send notifications on critical errors. void Notify(string Triggers notification at the end of an agent message, bool run, and adds the message to the alwaysNotify) notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDictio Global data dictionary that can be used to nary GlobalData store data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database connection. GetDatabaseCon The database connection must have been nection(string previously defined for the agent or be a connectionName) shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. IInputDataRow If the current command is a data provider, GetInputData() the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Co the data for that command is returned. mmand Otherwise this function searches the © 2014 - 2015 Sequentum Scripting command) 321 command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(stri the data for that command is returned. ng Otherwise searches the command's parents commandName) and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Gui the data for that command is returned. d commandId) Otherwise the function throws an error. 14.10 Data Input Scripts Data Input scripts can be used to generate data for data provider commands. Data provider commands include the following types of commands: Agent Navigate URL Set Form Field Data List A Data Input script can be added to a command by selecting the data provider Script from the Data configuration tab: © 2014 - 2015 Sequentum 322 Content Grabber The following example generates a list of URLs that could be used in an Agent command to provide start URLs: using System; using System.Collections.Generic; using System.Data; using Sequentum.ContentGrabber.Api; public class Script { public static DataTable ProvideData(DataProviderArguments args) { List<string> urls = new List<string>(); for(int i=1;i<1000;i++) { urls.Add("http://www.domain.com/page.php?ID=" +i.ToString()); } return urls.ToDataTable("url"); } } This script makes use of the extension method ToDataTable which is part of the Script Utilities. The script must have a static method with the following signature: public static DataTable ProvideData(DataProviderArguments args) © 2014 - 2015 Sequentum Scripting 323 The function must return a DataTable that contains the input data. The DataTable can have one or more data columns, and all columns will be available to the data provider command that is using the script. An instance of the DataProviderArguments class is provided by Content Grabber and has the following functions and properties: Property or Description Function Agent Agent The current agent. ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. Command The current agent command being executed. Command IContainer The parent container command of the ParentContainer current command. IConnection The current internal database connection DatabaseConnect used by the agent. This connection is ion already open and should not be closed by your script. IHtmlNode The extracted HTML node. HtmlNode IInternalDataRow The current internal data row containing the DataRow data that has been extracted so far in the current container command. bool IsDebug True if the agent is running in debug mode. bool If true, only the data schema is required, so IsSchemaOnly you can optimize processing by only returning the data schema with no data. © 2014 - 2015 Sequentum 324 Content Grabber IInputData All input data available to the current InputDataCache command. void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g debugMessage, disabled, or if called during design time. DebugMessageTy pe messageType = DebugMessageTy pe.Information) void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g debugMessage, disabled, or if called during design time. bool showMessageInD esignMode, DebugMessageTy pe messageType = DebugMessageTy pe.Information) void Notify(bool Triggers notification at the end of an agent alwaysNotify) run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. void Notify(string Triggers notification at the end of an agent message, bool run, and adds the message to the alwaysNotify) notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send © 2014 - 2015 Sequentum Scripting notifications on critical errors. GlobalDataDiction Global data dictionary that can be used to ary GlobalData store data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database connection. GetDatabaseCon The database connection must have been nection(string previously defined for the agent or be a connectionName) shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. IInputDataRow If the current command is a data provider, GetInputData() the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Co the data for that command is returned. mmand Otherwise this function searches the command) command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(stri the data for that command is returned. ng Otherwise this function searches the commandName) command's parents and returns the first found input data. IInputDataRow © 2014 - 2015 Sequentum If the specified command is a data provider, 325 326 Content Grabber GetInputData(Gui the data for that command is returned. d commandId) Otherwise the function throws an error. 14.11 Image OCR Scripts An Image OCR script is used to convert an image into plain text. This is often used when a website uses a CAPTCHA page to block automated access to the website. A CAPTCHA page contains an image with numbers and characters that a user must recognize and enter in plain text. Some websites also display emails and phone numbers as images, and an Image OCR script can be used to convert those images into plain text, making the data more useful. Content Grabber does not include any OCR functionality, so you must use a 3rd party OCR service. The Image OCR script is used to integrate with a 3rd party service. An Image OCR script can be added to a Download Image command by selecting the OCR configuration tab and setting the option Convert image to text. The following two examples show how to integrate with two popular CAPTCHA recognition services: © 2014 - 2015 Sequentum Scripting 327 public static string ConvertImageToText(ConvertImageToTextArguments args) { string captcha = DeathByCaptchaService.DecodeCaptcha(args.Image, "login", "password"); return captcha; } The script above integrates with the 3rd party service http:// www.deathbycaptcha.com: public static string ConvertImageToText(ConvertImageToTextArguments args) { string captcha = BypassCaptchaService.DecodeCaptcha(args.Image, "key"); return captcha; } The script above integrates with the 3rd party service http://bypasscaptcha.com/ The Image OCR script must have a static function with the following signature: public static string ConvertImageToText(ConvertImageToTextArguments args) The function must return the converted image as a string. An instance of the ConvertImageToTextArguments class is provided by Content Grabber and has the following functions and properties: Property or Description Function byte[] Image The image that needs to be converted to text. Agent Agent The current agent. ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. Command The current agent command being Command executed. © 2014 - 2015 Sequentum 328 Content Grabber IContainer The parent container command of the ParentContainer current command. IConnection The current internal database connection DatabaseConnectio used by the agent. This connection is n already open and should not be closed by your script. IHtmlNode The extracted HTML node. HtmlNode IInternalDataRow The current internal data row containing DataRow the data that has been extracted so far in the current container command. bool IsDebug True if the agent is running in debug mode. bool IsSchemaOnly If true, only the data schema is required, so you can optimize processing by only returning the data schema with no data. IInputData All input data available to the current InputDataCache command. void Writes log information to the agent log. WriteDebug(string This method has no effect if agent logging debugMessage, is disabled, or if called during design time. DebugMessageTyp e messageType = DebugMessageTyp e.Information) void Writes log information to the agent log. WriteDebug(string This method has no effect if agent logging debugMessage, is disabled, or if called during design time. bool showMessageInDe © 2014 - 2015 Sequentum Scripting signMode, DebugMessageTyp e messageType = DebugMessageTyp e.Information) void Notify(bool Triggers notification at the end of an agent alwaysNotify) run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. void Notify(string Triggers notification at the end of an agent message, bool run, and adds the message to the alwaysNotify) notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDictiona Global data dictionary that can be used to ry GlobalData store data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database GetDatabaseConne connection. The database connection must ction(string have been previously defined for the agent connectionName) or be a shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. © 2014 - 2015 Sequentum 329 330 Content Grabber IInputDataRow If the current command is a data provider, GetInputData() the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data GetInputData(Com provider, the data for that command is mand command) returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data GetInputData(strin provider, the data for that command is g commandName) returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data GetInputData(Guid provider, the data for that command is commandId) returned. Otherwise the function throws an error. 14.12 Convert Document to HTML Scripts A Convert Document to HTML script is used to convert a downloaded document into a HTML page, so Content Grabber can extract data from the document the same way as for any other HTML page. Please see the topic Extracting Data From Non-HTML Documents for more information. A Convert Document to HTML script can be added to a Download Document command by selecting the Convert to HTML configuration tab and setting the option Convert to HTML: © 2014 - 2015 Sequentum Scripting 331 The following example is the default Convert Document to HTML script. This script checks the type of the downloaded document, and uses an appropriate document converter to convert the document to a HTML page: using System; using System.IO; using Sequentum.ContentGrabber.Api; public class Script { public static bool ConvertDocumentToHtml(ConvertDocumentToHtmlArguments args) { if(args.DocumentType=="pdf") ScriptUtils.ExecuteCommandLine(@"Converters\pdftohtml\pdftohtml.exe", args.DocumentFilePath, args.HtmlFilePath, "-noframes"); else if(args.DocumentType=="docx") ScriptUtils.ExecuteCommandLine(@"Converters\docxtohtml\docxtohtml.exe", args.DocumentFilePath, args.HtmlFilePath, ""); if(!File.Exists(args.HtmlFilePath)) return false; return true; } } The function should return True if the conversion succeeds or False if the conversion © 2014 - 2015 Sequentum 332 Content Grabber failed. An instance of the ConvertDocumentToHtmlArguments class is provided by Content Grabber and has the following functions and properties: Property or Description Function string The path of the document that needs to be DocumentFilePat converted to HTML. h string The type of document that needs to be DocumentType converted to HTML. For example, if the document is a PDF document, the document type will be pdf. string The file path the script should use for the HtmlFilePath converted HTML file. Agent Agent The current agent. ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. Command The current agent command being Command executed. IContainer The parent container command of the ParentContainer current command. IConnection The current internal database connection DatabaseConnec used by the agent. This connection is tion already open and should not be closed by your script. IHtmlNode The extracted HTML node. HtmlNode IInternalDataRow The current internal data row containing the © 2014 - 2015 Sequentum Scripting DataRow data that has been extracted so far in the current container command. bool IsDebug True if the agent is running in debug mode. bool If true, only the data schema is required, so IsSchemaOnly you can optimize processing by only returning the data schema with no data. IInputData All input data available to the current InputDataCache command. void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g disabled, or if called during design time. debugMessage, DebugMessageT ype messageType = DebugMessageT ype.Information) void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g disabled, or if called during design time. debugMessage, bool showMessageIn DesignMode, DebugMessageT ype messageType = DebugMessageT ype.Information) void Notify(bool © 2014 - 2015 Sequentum Triggers notification at the end of an agent 333 334 Content Grabber alwaysNotify) run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. void Triggers notification at the end of an agent Notify(string run, and adds the message to the message, bool notification email. If alwaysNotify is set to alwaysNotify) false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDictio Global data dictionary that can be used to nary GlobalData store data that needs to be available in all scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database connection. GetDatabaseCon The database connection must have been nection(string previously defined for the agent or be a connectionName shared connection for all agents on the ) computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. IInputDataRow If the current command is a data provider, GetInputData() the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Co the data for that command is returned. © 2014 - 2015 Sequentum Scripting mmand Otherwise this function searches the command) command's parents and returns the first 335 found input data. IInputDataRow If the specified command is a data provider, GetInputData(str the data for that command is returned. ing Otherwise this function searches the commandName) command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Gu the data for that command is returned. id commandId) Otherwise the function throws an error. 14.13 Custom Scripts A custom script is the script used by an Execute Script command to execute custom functionality. The Execute Script command provides a list of predefined scripts that can be used directly or edited to suit specific needs. Please see the Execute Script topic for more information about how to use the predefined scripts. A custom script can be added to an Execute Script command by setting the Script type to Custom Script: © 2014 - 2015 Sequentum 336 Content Grabber The following example writes extracted data to a database. Normally a Data Export script would be used to write extracted data after an agent has completed, but an Execute Script command could be used to write data as it's being extracted: using System; using Sequentum.ContentGrabber.Api; public class Script { public static CustomScriptReturn CustomScript(CustomScriptArguments args) { IConnection connection = args.GetDatabaseConnection("exportDatabase"); connection.OpenDatabase(); try { ICommand command = connection.GetNewCommand(); command.SetSql("insert into export_table values (@title, @description)"); command.AddParameterWithValue("title", args.DataRow["title"]); command.AddParameterWithValue("description", args.DataRow["description"]); command.ExecuteNonQuery(); } finally { connection.CloseDatabase(); } return CustomScriptReturn.Empty(); } } The above script uses the IConnection and ICommand interfaces which are part of Content Grabber's Script Utilities. The script must have a static method with the following signature. public static DataTable ProvideData(DataProviderArguments args) The function must return a DataTable that contains the input data. The DataTable can have one or more data columns, and all columns will be available to the data provider command that is using the script. © 2014 - 2015 Sequentum Scripting 337 An instance of the DataProviderArguments class is provided by Content Grabber and has the following functions and properties: Property or Description Function Agent Agent The current agent. ScriptUtils A script utility class with helper methods. ScriptUtilities See Script Utilities for more information. Command The current agent command being Command executed. IContainer The parent container command of the ParentContainer current command. IConnection The current internal database connection DatabaseConnect used by the agent. This connection is ion already open and should not be closed by your script. IHtmlNode The extracted HTML node. HtmlNode IInternalDataRow The current internal data row containing the DataRow data that has been extracted so far in the current container command. bool IsDebug True if the agent is running in debug mode. bool If true, only the data schema is required, so IsSchemaOnly you can optimize processing by only returning the data schema with no data. IInputData All input data available to the current InputDataCache command. void Writes log information to the agent log. This © 2014 - 2015 Sequentum 338 Content Grabber WriteDebug(strin method has no effect if agent logging is g debugMessage, disabled, or if called during design time. DebugMessageT ype messageType = DebugMessageT ype.Information) void Writes log information to the agent log. This WriteDebug(strin method has no effect if agent logging is g debugMessage, disabled, or if called during design time. bool showMessageInD esignMode, DebugMessageT ype messageType = DebugMessageT ype.Information) void Notify(bool Triggers notification at the end of an agent alwaysNotify) run. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. void Notify(string Triggers notification at the end of an agent message, bool run, and adds the message to the alwaysNotify) notification email. If alwaysNotify is set to false, this method only triggers a notification if the agent has been configured to send notifications on critical errors. GlobalDataDictio Global data dictionary that can be used to nary GlobalData store data that needs to be available in all © 2014 - 2015 Sequentum Scripting scripts and after agent restarts. Input Parameters are also stored in this dictionary. IConnection Returns the specified database connection. GetDatabaseCon The database connection must have been nection(string previously defined for the agent or be a connectionName) shared connection for all agents on the computer. Your script is responsible for opening and closing the connection by calling the OpenDatabase and CloseDatabase methods. IInputDataRow If the current command is a data provider, GetInputData() the data for that command is returned. Otherwise this function searches the command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Co the data for that command is returned. mmand Otherwise this function searches the command) command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(stri the data for that command is returned. ng Otherwise this function searches the commandName) command's parents and returns the first found input data. IInputDataRow If the specified command is a data provider, GetInputData(Gui the data for that command is returned. d commandId) Otherwise the function throws an error. © 2014 - 2015 Sequentum 339 340 Content Grabber © 2014 - 2015 Sequentum Programming Interface 15 341 Programming Interface The Content Grabber programming interface (API) provides access to the Content Grabber runtime from your own applications. The Content Grabber runtime can be distributed with your applications royalty free and does not require the Content Grabber application to be installed on the target computer. The Content Grabber runtime requires .NET version 4 client profile or higher, and Internet Explorer 9 or higher. If you want to run agents from a web application, you must use the Content Grabber proxy API. The proxy API interfaces with the Content Grabber runtime using a Windows service. The Windows service is installed as part of the Content Grabber application, so Content Grabber must be installed on the web server. In this chapter, you can learn more about Building a Desktop Application and Building a Web Application. The Content Grabber programming interface (API) provides access to the Content Grabber runtime from your own applications. The Content Grabber runtime can be distributed with your applications royalty free and does not require the Content Grabber application to be installed on the target computer. The Content Grabber runtime requires .NET version 4 client profile or higher, and Internet Explorer 9 or higher. If you want to run agents from a web application, you must use the Content Grabber proxy API. The proxy API interfaces with the Content Grabber runtime using a Windows service. The Windows service is installed as part of the Content Grabber application, so Content Grabber must be installed on the web server. You can also use simple web requests to call the Windows service without the use of the Content Grabber proxy API. This means that you can easily execute agents and retrieve extracted data from non-Windows environments, such as from a PHP page on a Linux server. © 2014 - 2015 Sequentum 342 Content Grabber In this chapter, you can learn more about Building a Desktop Application and Building a Web Application. API Restrictions The API cannot change the agent command structure, or alter the agent in a way that would change the output data structure. You cannot add, delete, move or copy commands, and you cannot change the Disabled and Export command properties. 15.1 Building a Desktop Application The Content Grabber application can build stand-alone agents. This is a convenient way to distribute agents that can be configured and run without requiring the full Content Grabber application on the target computer. However, stand-alone agents have a standardized user interface and can only export data to file formats. If you want full control of the user interface and export features, you can build your own desktop application using the Content Grabber API. See these topics for more information: Visual Studio Configuration Distributing Your Application Example The following example uses the API to run an agent with a set of input parameters. It sets the log level to high and specifies that log information should be written to file. AgentApi api = new AgentApi(@"C:\Users\Public\Documents\Content Grabber\Agents\ qantasApiTest\qantasApiTest.scg"); AgentSettings settings = new AgentSettings(); settings.InputParameters.Add("from", "SYD"); settings.InputParameters.Add("to", "MEL"); settings.InputParameters.Add("departure_date", DateTime.Now.ToString("yyyy-MM-dd")); settings.InputParameters.Add("return_date", DateTime.Now.ToString("yyyy-MM-dd")); settings.InputParameters.Add("travel_class", "ECO"); © 2014 - 2015 Sequentum Programming Interface settings.InputParameters.Add("adults", "1"); settings.InputParameters.Add("children", "0"); settings.LogLevel = AgentLogLevel.High; settings.IsLogToFile = true; api.RunAgent(settings); Agent API Functions Function Description AgentApi(string Instantiates a new API class with the agentNameOrPat specified agent and session ID. You can h, string specify the full path to an agent file or just sessionId) the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber\Agents AgentApi(string Instantiates a new API class without a agentNameOrPat session. You can specify the full path to an h) agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber\Agents © 2014 - 2015 Sequentum 343 344 Content Grabber void Connects to a Content Grabber agent Connect(string service. You can specify the server name or endPointAddress IP address and port number. The default ) connection string for a local service is: http://localhost:8000/ContentGrabber void Closes the connection to the Content CloseConnection Grabber agent service. () void StartAgent() Runs the agent specified when instantiating the API. The agent will run asynchronously. void Runs the agent with additional settings. The StartAgent(Agent agent will run asynchronously. See below Settings for more information about the settings) AgentSettings class. void StopAgent() Stops the agent if it is currently running. void Closes an agent session. When you close CloseAgentSessi an agent session, all data associated with on() that session is removed and you will not be able to retrieve status information about the agent that ran in this session. You can only close a session if an agent is not currently running in the session. You don't need to close a session. Session data will be removed automatically after the agent has completed running and the session timeout has elapsed. The default © 2014 - 2015 Sequentum Programming Interface session timeout is 30 minutes, so by default session data will be removed automatically 30 minutes after the agent has completed. AgentStatus Returns status information about an agent GetAgentStatus( that has been run asynchronously. See ) below for more information about the AgentStatus class. DataTable Returns progress information in a GetAgentProgre DataTable about an agent running in ssAsDataTable() asynchronously. See below for more information about the information returned. DataTable Returns progress information as a JSON GetAgentProgre string about an agent running in ssAsJson() asynchronously. See below for more information about the information returned. DataTable Returns log data in a DataTable for an GetAgentLogAsD agent that has been run asynchronously. ataTable(string This function does not return any data if agentNameOrPat logging is disabled or if logging is written to h, string file. See below for more information about sessionId) the information returned. string Returns log data as a JSON string for an GetAgentLogAsJ agent that has been run asynchronously. son(string This function does not return any data if agentNameOrPat logging is disabled or if logging is written to h, string file. See below for more information about sessionId) the information returned. DataSet Returns extracted data in a DataSet for an GetAgentExport agent that has been run asynchronously. DataAsDataSet() © 2014 - 2015 Sequentum 345 346 Content Grabber string Returns extracted data as a JSON string for GetAgentExport an agent that has been run asynchronously. DataAsJson() string Returns extracted data as an XML string for GetAgentExport an agent that has been run asynchronously. DataAsXml() RunAgentReturn Runs an agent synchronously and returns Json extracted data as a JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturn Runs an agent synchronously and returns Xml extracted data as an XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturn Runs an agent synchronously and returns DataSet extracted data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturn Runs an agent synchronously with additional Json(AgentSettin settings and returns extracted data as a gs settings) JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. © 2014 - 2015 Sequentum Programming Interface See below for more information about the AgentSettings class. RunAgentReturn Runs an agent synchronously with additional Xml(AgentSettin settings and returns extracted data as an gs settings) XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturn Runs an agent synchronously with additional DataSet settings and returns extracted data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. public Agent Returns the agent specified when GetAgent() instantiating the API class. public void Saves the specified agent. SaveAgent(Agent agent) AgentSettings The following agent settings can be specified when running an agent: Property © 2014 - 2015 Sequentum Description 347 348 Content Grabber bool LogLevel Log detail level. Set the log level to None to turn off logging. bool IsLogHtml Logs the raw HTML of all web pages processed by the agent. bool IsLogToFile Logs data to a file instead of a database table. int Timeout This value specifies the session timeout in minutes when an agent is run asynchronously. All session data is removed automatically when the agent has completed and this timeout has elapsed. The default session timeout is 30 minutes. This value specifies the maximum number of seconds an agent will run when it's run synchronously. When the timeout is reached, the agent will stop and close its session if it's run in a session. The default timeout is 30 seconds. Dictionary<string, A list of input parameters. string> InputParameters AgentStatus An agent can provide the following status information: Property Description © 2014 - 2015 Sequentum Programming Interface PageLoads The RunStatus can be one of the following values: Completed. The agent has completed successfully. Incomplete. The agent has completed, but stopped prematurely. The agent may have been stopped manually. Failed. The agent has completed, but a critical error occurred. Idle. The agent has never been run. Starting. The agent is starting. ExportingData. The agent is exporting data to the specified export target. Stopping. The agent is in the process of stopping. Restarting. The agent is restarting. This usually occurs when the agent needs to clear JavaScript memory leaks. ExportFailed. The agent completed, but failed to export data. MissingElement s StartTime The number of page loads. This includes AJAX calls triggered by agent actions. The amount of time the agent has run. int The number of times an agent command MissingElements could not find it's specified content where the content was not specified as optional. int PageErrors The number of page load errors. This includes errors loading content from AJAX calls that were triggered by agent actions. © 2014 - 2015 Sequentum 349 350 Content Grabber DateTime The time the agent started. StartTime Agent Progress Data An agent can provide progress data in a DataTable. The DataTable contains a DataRow for each web browser the agent is using to extract data. Each DataRow contains a status column and a description column. The progress data is the same information displayed when running an agent in the Content Grabber agent editor. Agent Log Data An agent can provide log data in a DataTable. The DataTable contains a log level column and a description column. A log level of 1 means an error, 2 means a warning and 3 means information. The log data is the same data you can view in the Content Grabber agent editor. Agent Export Data The API can provider extracted data in a DataSet, as an XML string or as a JSON string. The API provides the entire extracted data set in one operation, so for large data sets it's recommend to export the data to an external database and read it directly from there. 15.1.1 Visual Studio Configuration The Content Grabber API is a 32-bit library and since the library will be embedded into your application, your application also needs to be a 32-bit application. To make sure your application is a 32-bit application, go to project properties in Visual Studio and set the platform target to x86. © 2014 - 2015 Sequentum Programming Interface The Content Grabber API uses the .NET v4.0 client profile framework, so your application should be configured to use the same or a higher version of the .NET framework. To set the .NET framework version in Visual Studio, go to project properties in Visual Studio and set the target framework to .NET v4 or higher. © 2014 - 2015 Sequentum 351 352 Content Grabber Assembly References You must add the following references to your project to use the The Content Grabber API: AgentApi.dll AgentProxy.dll Once you add these assemblies to your project references, Visual Studio will automatically copy these files and some dependent assemblies to your project's BIN folder. However, not all required assemblies will be copied to your BIN folder, so you must manually ensure the following files are in your BIN folder. The files are found in the Content Grabber installation folder: AgentApi.dll AgentProxy.dll © 2014 - 2015 Sequentum Programming Interface 353 AjaxHook.dll msvcp120.dll msvcr120.dll SQLite.Interop.dll System.Data.SQLite.dll RunAgent.exe RunAgentProcess.exe BaseEngine.dll MySql.Data.dll 15.1.2 Distributing Your Application You can distribute your desktop applications without having to install Content Grabber on the target computers. You just have to make sure the following files are in the same folder as your executable program: AgentApi.dll AgentProxy.dll AjaxHook.dll msvcp120.dll msvcr120.dll SQLite.Interop.dll System.Data.SQLite.dll RunAgent.exe RunAgentProcess.exe BaseEngine.dll MySql.Data.dll 15.2 Building a Web Application The Content Grabber API can be used to run agents from your own applications. When using the API directly, your application needs security privileges that are standard for desktop applications, but not for web applications. It can be a complex © 2014 - 2015 Sequentum 354 Content Grabber task to ensure a web application has the required security privileges, and you may end up giving the application too many security privileges, making it vulnerable to security breaches. Important: We don't support using the full API from a web application. We only support the use of the API proxy in web applications. You may be able to use the full API, depending on the security policies on the web server. We are unable to help you configure a web server for this purpose. We recommend you use the Content Grabber agent service when using the API in web applications. The Content Grabber agent service provides most API functionality through a simple proxy assembly that requires no special security settings and depends on no other files than the single proxy assembly file. The proxy assembly can run in both 32-bit and 64-bit applications, which makes it even easier to use this assembly rather than the full API. Read more in the topic Using the Content Grabber Agent Service. 15.2.1 Using the Content Grabber Agent Service Content Grabber includes a Windows service that can be used to run agents. Your web application communicates with the Windows service using a small proxy assembly that requires no special security privileges and depends on no other files. You simply add the proxy assembly to your web application's assembly references and use the proxy to call the API functions. The proxy assembly contains the same methods as the standard API, except for functions used to load and save agents. The proxy can only execute agents, not load agents, because the proxy is designed to rely on no other assemblies and loading an agent would require the agent definition classes found in the full Content Grabber API Reference. Before you can use the proxy to communicate with the Windows Service, you need to connect to the service. You use the proxy method Connect to connect to the © 2014 - 2015 Sequentum Programming Interface 355 service. The default connection string is: http://localhost:8001/ContentGrabber If your Content Grabber service is located on a remote server, you can change localhost to the name or IP address of the remote server. The service is listed on port 8000 by default, but you can change the port in the Content Grabber editor. The Content Grabber service is stopped by default and configured to start manually. If you are going to use this service, you should configure the service to start automatically. The Content Grabber service is configured to logon as the local System account by default. You can change this account if you wish, but if you decide to continue using the local System account, you must make sure this account has proper access to the Internet through Internet Explorer. For example, Internet cookies may have been disabled for the System account, or there may be other Internet restrictions for this account. To test and change Internet setting for the System account, open the Content Grabber editor and click the Test IE as System button in the Application menu. Pressing this button will open Internet Explorer using the System account and you can then test Internet access and change settings in Internet Explorer if required. All Windows services, including the Content Grabber agent service, run in a special Windows session that cannot interact with users and cannot display user interfaces. JavaScript on some websites does not work correctly without at least a hidden web browser window. Such websites are rare, but if you need to run such a website through the Content Grabber agent service, the service needs to run the agent in a normal user session. In order for the Windows service to start an agent in a normal user session the following three conditions must be met. If your Content Grabber service is located on a remote server, you can change localhost to the name or IP address of the remote server. The service is listening on port 8001 by default, but you can change the port in the Content Grabber editor. The Content Grabber service is stopped by default and configured to start manually. If you are going to use this service, you should © 2014 - 2015 Sequentum 356 Content Grabber configure the service to start automatically. The Content Grabber service is configured to logon as the local System account by default. You can change this account if you wish, but if you decide to continue using the local System account, you must make sure this account has proper access to the Internet through Internet Explorer. For example, Internet cookies may have been disabled for the System account, or there may be other Internet restrictions for this account. To test and change Internet setting for the System account, open the Content Grabber editor and click the Test IE as System button in the Application menu. Pressing this button will open Internet Explorer using the System account and you can then test Internet access and change settings in Internet Explorer if required. All Windows services, including the Content Grabber agent service, run in a special Windows session that cannot interact with users and cannot display user interfaces. JavaScript on some websites does not work correctly without at least a hidden web browser window. Such websites are rare, but if you need to run such a website through the Content Grabber agent service, the service needs to run the agent in a normal user session. In order for the Windows service to start an agent in a normal user session the following three conditions must be met. 1. You must set the agent option Run Interactively. 2. The Content Grabber agent service must run under the System account. 3. A user must be logged onto the computer while the agent runs. The agent will run in the Windows session of the logged in user, but in the security context of the System account. Example The following example connects to a local Content Grabber agent service and runs an agent in a new session. It also sets the log level to high and specifies that log information should be written to file. © 2014 - 2015 Sequentum Programming Interface 357 string sessionId = Guid.NewGuid().ToString(); AgentProxy proxy = new AgentProxy(@"C:\Users\Public\Documents\Content Grabber\ Agents\qantas\qantas.scg", sessionId); proxy.Connect("http://localhost:8001/ContentGrabber"); AgentSettings settings = new AgentSettings(); settings.LogLevel = AgentLogLevel.High; settings.IsLogToFile = true; proxy.StartAgent(settings); You can connect to the Content Grabber agent service again at any time to get status information about a specified agent running in a specified session. AgentProxy qantasProxy = new AgentProxy(@"C:\Users\Public\Documents\Content Grabber\ Agents\qantas\qantas.scg", sessionId); qantasProxy.Connect("http://localhost:8001/ContentGrabber"); AgentStatus status = qantasProxy.GetAgentStatus(); If you are running an agent from a web application, you could use AJAX callbacks to get status information about a running agent. You only have to keep track of the session ID between callbacks. Proxy Functions The following functions are available in the proxy assembly: Function Description AgentProxy(string Instantiates a new proxy with the agentNameOrPath, specified agent and session ID. You can string sessionId) specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location © 2014 - 2015 Sequentum 358 Content Grabber for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents AgentProxy(string Instantiates a new proxy without a agentNameOrPath) session. You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents void Connect(string Connects to the Content Grabber agent endPointAddress) service. You can specify the server name or IP address and port number. The default connection string for a local service is: http://localhost:8000/ContentGrabber void Closes the connection to the Content CloseConnection() Grabber agent service. void StartAgent() Starts the agent specified when instantiating the proxy. The agent will run asynchronously. void Starts the agent with additional settings. © 2014 - 2015 Sequentum Programming Interface StartAgent(AgentSe The agent will run asynchronously. See ttings settings) below for more information about the AgentSettings class. void StopAgent() Stops the agent if it is currently running asynchronously. void Closes an agent session after the agent CloseAgentSession( has been run asynchronously. When you ) close an agent session, all data associated with that session is removed and you will not be able to retrieve status information about the agent that ran in this session. You can only close a session if an agent is not currently running in the session. You don't need to close a session. Session data will be removed automatically after the agent has completed running and the session timeout has elapsed. The default session timeout is 30 minutes, so by default session data will be removed automatically 30 minutes after the agent has completed. AgentStatus Returns status information about an GetAgentStatus() agent that has been run asynchronously. See below for more information about the AgentStatus class. DataTable Returns progress information in a GetAgentProgressA DataTable about an agent running in sDataTable() asynchronously. See below for more © 2014 - 2015 Sequentum 359 360 Content Grabber information about the information returned. DataTable Returns progress information as JSON GetAgentProgressA about an agent running in sJson() asynchronously. See below for more information about the information returned. DataTable Returns log data in a DataTable for an GetAgentLogAsData agent that has been run asynchronously. Table(string This function does not return any data if agentNameOrPath, logging is disabled or if logging is written string sessionId) to file. See below for more information about the information returned. string Returns log data as JSON for an agent GetAgentLogAsJso that has been run asynchronously. This n(string function does not return any data if agentNameOrPath, logging is disabled or if logging is written string sessionId) to file. See below for more information about the information returned. DataSet Returns extracted data in a DataSet for GetAgentDataAsDat an agent that has been run aSet() asynchronously. string Returns extracted data as a JSON string GetAgentDataAsJso for an agent that has been run n() asynchronously. string Returns extracted data as an XML string GetAgentDataAsXml for an agent that has been run () asynchronously. RunAgentReturnJso Runs an agent synchronously and n returns extracted data as a JSON string. © 2014 - 2015 Sequentum Programming Interface The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturnXml Runs an agent synchronously and returns extracted data as an XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturnDat Runs an agent synchronously and aSet returns extracted data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturnJso Runs an agent synchronously with n(AgentSettings additional settings and returns extracted settings) data as a JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturnXml Runs an agent synchronously with (AgentSettings additional settings and returns extracted settings) data as an XML string. The agent is always run in a session when the agent supports sessions, and the session is © 2014 - 2015 Sequentum 361 362 Content Grabber closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturnDat Runs an agent synchronously with aSet(AgentSettings additional settings and returns extracted settings) data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. AgentSettings The following agent settings can specified when running an agent: Property Description bool LogLevel Log detail level. Set the log level to None to turn off logging. bool IsLogHtml Logs the raw HTML of all web pages processed by the agent. bool IsLogToFile Logs data to a file instead of a database table. int Timeout This value specifies the session timeout in minutes when an agent is run asynchronously. All session data is removed automatically when the agent has completed and this timeout has © 2014 - 2015 Sequentum Programming Interface elapsed. The default session timeout is 30 minutes. This value specifies the maximum number of seconds an agent will run when it's run synchronously. When the timeout is reached, the agent will stop and close its session if it's run in a session. The default timeout is 30 seconds. Dictionary<string, A list of input parameters. string> InputParameters GlobalData Any serializable data object can be stored in this dictionary and will be available to all scripts in an agent. Notice that input parameters will eventually be stored in this dictionary as well, so it doesn't matter if you use input parameters or global data to store your input data. ProxyList A list of web proxies that will be used by the agent. If a proxy list is specified, it overrides any default proxy settings in the agent. AgentStatus An agent can provide the following status information: Property Description AgentRunningStatus The RunStatus can be one of the © 2014 - 2015 Sequentum 363 364 Content Grabber RunStatus following values. Completed. The agent has completed successfully. Incomplete. The agent has completed, but stopped prematurely. The agent may have been stopped manually. Failed. The agent has completed, but a critical error occurred. Idle. The agent has never been run. Starting. The agent is starting. ExportingData. The agent is exporting data to the specified export target. Stopping. The agent is in the process of stopping. Restarting. The agent is restarting. This usually occurs when the agent needs to clear JavaScript memory leaks. ExportFailed. The agent completed, but failed to export data. int PageLoads The number of page loads. This includes AJAX calls triggered by agent actions. TimeSpan Runtime The amount of time the agent has run. int MissingElements The number of times an agent command could not find it's specified content where the content was not specified as optional. © 2014 - 2015 Sequentum Programming Interface int PageErrors 365 The number of page load errors. This includes errors loading content from AJAX calls that were triggered by agent actions. DateTime StartTime The time the agent started. Agent Progress Data An agent can provide progress data in a DataTable. The DataTable contains a DataRow for each web browser the agent is using to extract data. Each DataRow contains a status column and a description column. The progress data is the same information displayed when running an agent in the Content Grabber agent editor. Agent Log Data An agent can provide log data in a DataTable. The DataTable contains a log level column and a description column. A log level of 1 means an error, 2 means a warning and 3 means information. The log data is the same data you can view in the Content Grabber agent editor. Agent Export Data The API can provider extracted data in a DataSet, as an XML string or as a JSON string. The API provides the entire extracted data set in one operation, so for large data sets it's recommend to export the data to an external database and read it directly from there. 15.2.2 Using Simple Web Requests The Content Grabber Windows service supports simple web requests, so you can execute agents and retrieve extracted data from non-Windows environments, such as from a PHP page on a Linux server. The Windows service is listening for web requests on port 8002 by default. You © 2014 - 2015 Sequentum 366 Content Grabber can change this port number in the Content Grabber editor. The service is stopped by default and configured to start manually. If you are going to use this service, you should configure the service to start automatically. Please see Using the Content Grabber Agent Service for more information about the Windows service. This is an example of a web request that executes an agent named Sequentum and provides some input values for the agent. http://localhost:8002/ContentGrabber/RunAgentReturnJson? agent=sequentum&pars={"StartDate":"2015-10-15","EndDate":"2015-12-15"} The above web request executes the agent synchronously and returns the extracted data as a JSON string. Web Request Functions The following functions are available when using web requests: Function Description StartAgent? Starts an agent that will run agent={agentName asynchronously. OrPath} &sessionId={sessio This function supports both GET and nId} POST requests. If you need to specify a &sessionTimeout={ long list of input parameters you must sessionTimeout} use POST requests, since GET requests &logLevel={logLevel are limited in length. } &logHtml={isLogHt This function accepts the following ml} parameters: &logToFile={isLogT oFile} agent (required): You can specify the full path to an agent file or just the name © 2014 - 2015 Sequentum Programming Interface &pars={inputParam of the agent. If you only specify the eters} agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents sessionId (optional): The session will run in a session with the specified session ID. sessionTimeout (optional): If the agent is running in a session, this value specifies the number of minutes the agent data will be available before it's deleted. The default session timeout is 30 minutes. logLevel (optional): Log detail level. Set the log level to None to turn off logging. The default log level is None. Accepted values are None, Low, Medium and High. logHtml (optional): Logs the raw HTML of all web pages processed by the agent. The default value is False. logToFile (optional): Logs data to a file instead of a database table. The default © 2014 - 2015 Sequentum 367 368 Content Grabber value is False. Pars (optional): A JSON formatted list of input values that can be used by the agent. The JSON string should be URL encoded. RunAgentReturnJso Runs an agent synchronously and returns n? the extracted data as a JSON string. agent={agentName OrPath} The agent is always run in a session &timeout={timeout} when the agent supports sessions, and &logLevel={logLevel the session is closed automatically after } the agent has completed its run. &logHtml={isLogHt ml} This function supports both GET and &logToFile={isLogT POST requests. If you need to specify a oFile} long list of input parameters you must &pars={inputParam eters} use POST requests, since GET requests are limited in length. This function accepts the following parameters: agent (required): You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber © 2014 - 2015 Sequentum Programming Interface \Agents Timeout (optional): This maximum number of seconds an agent will run. When the timeout is reached, the agent will stop and close its session if it's run in a session. The default timeout is 30 seconds. logLevel (optional): Log detail level. Set the log level to None to turn off logging. The default log level is None. Accepted values are None, Low, Medium and High. logHtml (optional): Logs the raw HTML of all web pages processed by the agent. The default value is False. logToFile (optional): Logs data to a file instead of a database table. The default value is False. Pars (optional): A JSON formatted list of input values that can be used by the agent. The JSON string should be URL encoded. RunAgentReturnXml Runs an agent synchronously and returns ? the extracted data as an XML string. agent={agentName OrPath} The agent is always run in a session &timeout={timeout} when the agent supports sessions, and © 2014 - 2015 Sequentum 369 370 Content Grabber &logLevel={logLevel the session is closed automatically after } the agent has completed its run. &logHtml={isLogHt ml} This function supports both GET and &logToFile={isLogT POST requests. If you need to specify a oFile} long list of input parameters you must &pars={inputParam use POST requests, since GET requests eters} are limited in length. This function accepts the following parameters: agent (required): You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents Timeout (optional): This maximum number of seconds an agent will run. When the timeout is reached, the agent will stop and close its session if it's run in a session. The default timeout is 30 seconds. logLevel (optional): Log detail level. Set the log level to None to turn off © 2014 - 2015 Sequentum Programming Interface logging. The default log level is None. Accepted values are None, Low, Medium and High. logHtml (optional): Logs the raw HTML of all web pages processed by the agent. The default value is False. logToFile (optional): Logs data to a file instead of a database table. The default value is False. Pars (optional): A JSON formatted list of input values that can be used by the agent. The JSON string should be URL encoded. StopAgent? Stops the agent if it is currently running agent={agentName asynchronously. OrPath} &sessionId={sessio This function supports only GET nId}" requests. CloseAgentSession Closes an agent session after the agent ? has been run asynchronously. When you agent={agentName close an agent session, all data OrPath} associated with that session is removed &sessionId={sessio and you will not be able to retrieve status nId} information about the agent that ran in this session. You can only close a session if an agent is not currently running in the session. You don't need to close a session. © 2014 - 2015 Sequentum 371 372 Content Grabber Session data will be removed automatically after the agent has completed running and the session timeout has elapsed. The default session timeout is 30 minutes, so by default session data will be removed automatically 30 minutes after the agent has completed. This function supports only GET requests. GetAgentStatus? Returns status information about an agent={agentName agent that has been run asynchronously. OrPath} See below for more information about &sessionId={sessio the AgentStatus class. nId} This function supports only GET requests. GetAgentProgressA Returns progress information as JSON sJson? about an agent running in a agent={agentName asynchronously. See below for more OrPath} information about the information &sessionId={sessio returned. nId} This function supports only GET requests. GetAgentLogAsJso Returns log data as JSON for an agent n? that has been run asynchronously. This agent={agentName function does not return any data if OrPath} logging is disabled or if logging is written &sessionId={sessio to file. See below for more information about the information returned. © 2014 - 2015 Sequentum Programming Interface 373 nId} This function supports only GET requests. DataSet Returns extracted data in a DataSet for GetAgentDataAsDat an agent that has been run aSet() asynchronously. This function supports only GET requests. GetAgentDataAsJso Returns extracted data as a JSON string n? for an agent that has been run agent={agentName asynchronously. OrPath} &sessionId={sessio This function supports only GET nId} requests. GetAgentDataAsXml Returns extracted data as an XML string ? for an agent that has been run agent={agentName asynchronously. OrPath} 15.3 &sessionId={sessio This function supports only GET nId} requests. Sessions Sessions allow you to run multiple instances of the same agent at the same time. This is particularly useful when running an agent directly from a website where you can have many web users visiting the website at the same time. Each web user can start the agent in their own session and the user will only see agent progress status and extracted data belonging to his session. Running an Agent in a Session To run an agent in a session, you must specify a session ID when you run the agent © 2014 - 2015 Sequentum 374 Content Grabber and the agent must be configured to support sessions. To configure an agent to support sessions, use the agent option Support Sessions in the Sessions section on the Advanced options tab. A session ID can be specified when running an agent via the Agent API or the Agent Proxy. You can use any string as session ID, but each session must have a unique ID, so you would normally use a unique identifier. The following C# code snippet runs an agent with a unique session ID: string sessionId = Guid.New().ToString(); AgentApi agent = new AgentApi("sequentum", sessionId); agent.Run(new AgentSettings()); A session ID can also be specified when running an agent from the commandline by using the command-line option session_id. The following command line input runs an agent named sequentum with a session ID ry37456r. RunAgent.exe "sequentum" session_id "ry37456r" Session Data Cleanup When running an agent normally without sessions, the agent can cleanup previously extracted data and status information before it starts because only data and status from the last run is important. Data cleanup becomes much more complex when running agent sessions. An agent session should only cleanup data from its own session, so it cannot do a general cleanup of all data belonging to the agent. Since all sessions are normally new sessions there is no previous session data to cleanup, and the agent cannot cleanup data when it completes a session since you'll need access to the data for some time, so session data risks hanging around forever. You can manually trigger a session cleanup by calling the API method CloseSession, but in a web application you may not always be able to tell when a user session ends. When running an agent in a session you can specify a session timeout, and the session cleanup will start automatically after the agent session completes and the © 2014 - 2015 Sequentum Programming Interface 375 specified timeout has passed. This gives your application a minimum amount of time to use the session data before it's removed. An agent only cleans up externally exported session data if the agent is configured to export data to a database. If you are using an Export Script or if you are exporting to a file format, then you are responsible for any cleanup of exported session data. You may not always want to remove externally exported session data automatically when an agent session ends. To prevent session data from being removed, set the agent option Cleanup External Session Data to false. This option can be found in the Sessions section on the advanced options tab. If you are not using the API to retrieve extracted data, but only working with externally exported session data, then you can use the option Remove Internal Session Data Immediately to remove internal session data immediately after it has been externally exported. This will reduce the size and increase performance of the internal database. Session status information is not removed and will still be available until the session timeout has passed. 15.4 API Reference The Content Grabber API consists of two components, the Agent API Library and the Agent Proxy Library. The two libraries provide similar functionality, but the Agent Proxy library must connect to the Content Grabber agent service while the Agent API is a stand-alone library that can either connect to a Content Grabber agent service or work on its own. The Agent API contains agent definition classes that allows you to load, modify and save agents. You cannot add or remove commands from an agent, but you can modify properties of existing commands. The Proxy API can only run agents and cannot load and save agents. © 2014 - 2015 Sequentum 376 Content Grabber The Proxy API depends on no other files than the Proxy API assembly file. The Agent API depends on the following files which are found in the Content Grabber installation folder. These files must be copied to the folder of your executable program: AgentApi.dll AgentProxy.dll AjaxHook.dll msvcp120.dll msvcr120.dll SQLite.Interop.dll System.Data.SQLite.dll RunAgent.exe RunAgentProcess.exe BaseEngine.dll MySql.Data.dll Agents can also be run using simple web requests. Web requests can be sent from most applications and require no special Content Grabber API libraries. Agent API Reference Function Description AgentApi(string Instantiates a new API class with the agentNameOrPat specified agent and session ID. You can h, string specify the full path to an agent file or just sessionId) the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: © 2014 - 2015 Sequentum Programming Interface C:\Users\Public\Documents\Content Grabber\Agents AgentApi(string Instantiates a new API class without a agentNameOrPat session. You can specify the full path to an h) agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber\Agents void Connects to a Content Grabber agent Connect(string service. You can specify the server name or endPointAddress IP address and port number. The default ) connection string for a local service is: http://localhost:8000/ContentGrabber void Closes the connection to the Content CloseConnection Grabber agent service. () void StartAgent() Runs the agent specified when instantiating the API. The agent will run asynchronously. void Runs the agent with additional settings. The StartAgent(Agent agent will run asynchronously. See below Settings for more information about the © 2014 - 2015 Sequentum 377 378 Content Grabber settings) AgentSettings class. void StopAgent() Stops the agent if it is currently running. void Closes an agent session. When you close CloseAgentSessi an agent session, all data associated with on() that session is removed and you will not be able to retrieve status information about the agent that ran in this session. You can only close a session if an agent is not currently running in the session. You don't need to close a session. Session data will be removed automatically after the agent has completed running and the session timeout has elapsed. The default session timeout is 30 minutes, so by default session data will be removed automatically 30 minutes after the agent has completed. AgentStatus Returns status information about an agent GetAgentStatus( that has been run asynchronously. See ) below for more information about the AgentStatus class. DataTable Returns progress information in a GetAgentProgre DataTable about an agent running in ssAsDataTable() asynchronously. See below for more information about the information returned. DataTable Returns progress information as a JSON GetAgentProgre string about an agent running in ssAsJson() asynchronously. See below for more information about the information returned. DataTable Returns log data in a DataTable for an GetAgentLogAsD agent that has been run asynchronously. © 2014 - 2015 Sequentum Programming Interface ataTable(string This function does not return any data if agentNameOrPat logging is disabled or if logging is written to h, string file. See below for more information about sessionId) the information returned. string Returns log data as a JSON string for an GetAgentLogAsJ agent that has been run asynchronously. son(string This function does not return any data if agentNameOrPat logging is disabled or if logging is written to h, string file. See below for more information about sessionId) the information returned. DataSet Returns extracted data in a DataSet for an GetAgentExport agent that has been run asynchronously. DataAsDataSet() string Returns extracted data as a JSON string for GetAgentExport an agent that has been run asynchronously. DataAsJson() string Returns extracted data as an XML string for GetAgentExport an agent that has been run asynchronously. DataAsXml() RunAgentReturn Runs an agent synchronously and returns Json extracted data as a JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturn Runs an agent synchronously and returns Xml extracted data as an XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has © 2014 - 2015 Sequentum 379 380 Content Grabber completed its run. RunAgentReturn Runs an agent synchronously and returns DataSet extracted data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturn Runs an agent synchronously with additional Json(AgentSettin settings and returns extracted data as a gs settings) JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturn Runs an agent synchronously with additional Xml(AgentSettin settings and returns extracted data as an gs settings) XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturn Runs an agent synchronously with additional DataSet settings and returns extracted data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. © 2014 - 2015 Sequentum Programming Interface See below for more information about the AgentSettings class. public Agent Returns the agent specified when GetAgent() instantiating the API class. public void Saves the specified agent. SaveAgent(Agent agent) Agent Proxy Reference The following functions are available in the proxy assembly: Function Description AgentProxy(string Instantiates a new proxy with the agentNameOrPath, specified agent and session ID. You can string sessionId) specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents AgentProxy(string Instantiates a new proxy without a agentNameOrPath) session. You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the © 2014 - 2015 Sequentum 381 382 Content Grabber agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents void Connect(string Connects to the Content Grabber agent endPointAddress) service. You can specify the server name or IP address and port number. The default connection string for a local service is: http://localhost:8000/ContentGrabber void Closes the connection to the Content CloseConnection() Grabber agent service. void StartAgent() Starts the agent specified when instantiating the proxy. The agent will run asynchronously. void Starts the agent with additional settings. StartAgent(AgentSe The agent will run asynchronously. See ttings settings) below for more information about the AgentSettings class. void StopAgent() Stops the agent if it is currently running asynchronously. void Closes an agent session after the agent CloseAgentSession( has been run asynchronously. When you ) close an agent session, all data associated with that session is removed © 2014 - 2015 Sequentum Programming Interface and you will not be able to retrieve status information about the agent that ran in this session. You can only close a session if an agent is not currently running in the session. You don't need to close a session. Session data will be removed automatically after the agent has completed running and the session timeout has elapsed. The default session timeout is 30 minutes, so by default session data will be removed automatically 30 minutes after the agent has completed. AgentStatus Returns status information about an GetAgentStatus() agent that has been run asynchronously. See below for more information about the AgentStatus class. DataTable Returns progress information in a GetAgentProgressA DataTable about an agent running in sDataTable() asynchronously. See below for more information about the information returned. DataTable Returns progress information as JSON GetAgentProgressA about an agent running in sJson() asynchronously. See below for more information about the information returned. DataTable Returns log data in a DataTable for an GetAgentLogAsData agent that has been run asynchronously. © 2014 - 2015 Sequentum 383 384 Content Grabber Table(string This function does not return any data if agentNameOrPath, logging is disabled or if logging is written string sessionId) to file. See below for more information about the information returned. string Returns log data as JSON for an agent GetAgentLogAsJso that has been run asynchronously. This n(string function does not return any data if agentNameOrPath, logging is disabled or if logging is written string sessionId) to file. See below for more information about the information returned. DataSet Returns extracted data in a DataSet for GetAgentDataAsDat an agent that has been run aSet() asynchronously. string Returns extracted data as a JSON string GetAgentDataAsJso for an agent that has been run n() asynchronously. string Returns extracted data as an XML string GetAgentDataAsXml for an agent that has been run () asynchronously. RunAgentReturnJso Runs an agent synchronously and n returns extracted data as a JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturnXml Runs an agent synchronously and returns extracted data as an XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after © 2014 - 2015 Sequentum Programming Interface the agent has completed its run. RunAgentReturnDat Runs an agent synchronously and aSet returns extracted data in a DataSet. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. RunAgentReturnJso Runs an agent synchronously with n(AgentSettings additional settings and returns extracted settings) data as a JSON string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturnXml Runs an agent synchronously with (AgentSettings additional settings and returns extracted settings) data as an XML string. The agent is always run in a session when the agent supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. RunAgentReturnDat Runs an agent synchronously with aSet(AgentSettings additional settings and returns extracted settings) data in a DataSet. The agent is always run in a session when the agent © 2014 - 2015 Sequentum 385 386 Content Grabber supports sessions, and the session is closed automatically after the agent has completed its run. See below for more information about the AgentSettings class. AgentSettings Reference The following agent settings can specified when running an agent: Property Description AgentLogLevel Log detail level. Set the log level to LogLevel None to turn off logging. bool IsLogHtml Logs the HTML of all loaded web pages to files. bool IsLogToFile Logs information to a file instead of a database. int Timeout Session timeout. All session data is removed automatically when the agent has completed and this timeout has elapsed. Dictionary<string, A list of input parameters. string> InputParameters GlobalData Any serializeable data object can be stored in this dictionary and will be available to all scripts in an agent. Notice that input parameters will eventually be stored in this dictionary © 2014 - 2015 Sequentum Programming Interface as well, so it doesn't matter if you use input parameters or global data to store your input data. AgentStatus Reference An agent can provide the following status information: Property Description AgentRunningStat The RunStatus can be one of the following us RunStatus values. Completed. The agent has completed successfully. Incomplete. The agent has completed, but stopped prematurely. The agent may have been stopped manually. Failed. The agent has completed, but a critical error occurred. Idle. The agent has never been run. Starting. The agent is starting. ExportingData. The agent is exporting data to the specified export target. Stopping. The agent is in the process of stopping. Restarting. The agent is restarting. This usually occurs when the agent needs to clear JavaScript memory leaks. ExportFailed. The agent completed, but failed to export data. int PageLoads © 2014 - 2015 Sequentum The number of page loads. This includes 387 388 Content Grabber AJAX calls triggered by agent actions. TimeSpan The amount of time the agent has run. Runtime int The number of times an agent command MissingElements could not find it's specified content where the content was not specified as optional. int PageErrors The number of page load errors. This includes errors loading content from AJAX calls that were triggered by agent actions. DateTime The time the agent started. StartTime Agent Progress Data An agent can provide progress data in a DataTable. The DataTable contains a DataRow for each web browser the agent is using to extract data. Each DataRow contains a status column and a description column. The progress data is the same information displayed when running an agent in the Content Grabber agent editor. Agent Log Data An agent can provide log data in a DataTable. The DataTable contains a log level column and a description column. A log level of 1 means an error, 2 means a warning and 3 means information. The log data is the same data you can view in the Content Grabber agent editor. Web Request Referenc The following functions are available when using web requests: Function Description StartAgent? Starts an agent that will run © 2014 - 2015 Sequentum Programming Interface agent={agentName asynchronously. OrPath} &sessionId={sessio This function supports both GET and nId} POST requests. If you need to specify a &sessionTimeout={ long list of input parameters you must sessionTimeout} use POST requests, since GET requests &logLevel={logLevel are limited in length. } &logHtml={isLogHt This function accepts the following ml} parameters: &logToFile={isLogT oFile} &pars={inputParam eters} agent (required): You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents sessionId (optional): The session will run in a session with the specified session ID. sessionTimeout (optional): If the agent is running in a session, this value specifies the number of minutes the agent data will be available before it's deleted. The default session timeout is 30 minutes. © 2014 - 2015 Sequentum 389 390 Content Grabber logLevel (optional): Log detail level. Set the log level to None to turn off logging. The default log level is None. Accepted values are None, Low, Medium and High. logHtml (optional): Logs the raw HTML of all web pages processed by the agent. The default value is False. logToFile (optional): Logs data to a file instead of a database table. The default value is False. Pars (optional): A JSON formatted list of input values that can be used by the agent. The JSON string should be URL encoded. RunAgentReturnJso Runs an agent synchronously and returns n? the extracted data as a JSON string. agent={agentName OrPath} The agent is always run in a session &timeout={timeout} when the agent supports sessions, and &logLevel={logLevel the session is closed automatically after } the agent has completed its run. &logHtml={isLogHt ml} This function supports both GET and &logToFile={isLogT POST requests. If you need to specify a oFile} long list of input parameters you must &pars={inputParam eters} use POST requests, since GET requests are limited in length. © 2014 - 2015 Sequentum Programming Interface This function accepts the following parameters: agent (required): You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local System account is: C:\Users\Public\Documents\Content Grabber \Agents Timeout (optional): This maximum number of seconds an agent will run. When the timeout is reached, the agent will stop and close its session if it's run in a session. The default timeout is 30 seconds. logLevel (optional): Log detail level. Set the log level to None to turn off logging. The default log level is None. Accepted values are None, Low, Medium and High. logHtml (optional): Logs the raw HTML of all web pages processed by the agent. The default value is False. © 2014 - 2015 Sequentum 391 392 Content Grabber logToFile (optional): Logs data to a file instead of a database table. The default value is False. Pars (optional): A JSON formatted list of input values that can be used by the agent. The JSON string should be URL encoded. RunAgentReturnXml Runs an agent synchronously and returns ? the extracted data as an XML string. agent={agentName OrPath} The agent is always run in a session &timeout={timeout} when the agent supports sessions, and &logLevel={logLevel the session is closed automatically after } the agent has completed its run. &logHtml={isLogHt ml} This function supports both GET and &logToFile={isLogT POST requests. If you need to specify a oFile} long list of input parameters you must &pars={inputParam eters} use POST requests, since GET requests are limited in length. This function accepts the following parameters: agent (required): You can specify the full path to an agent file or just the name of the agent. If you only specify the agent name, Content Grabber will look for the agent in the default location for the user running the agent service. The default agent location for the local © 2014 - 2015 Sequentum Programming Interface System account is: C:\Users\Public\Documents\Content Grabber \Agents Timeout (optional): This maximum number of seconds an agent will run. When the timeout is reached, the agent will stop and close its session if it's run in a session. The default timeout is 30 seconds. logLevel (optional): Log detail level. Set the log level to None to turn off logging. The default log level is None. Accepted values are None, Low, Medium and High. logHtml (optional): Logs the raw HTML of all web pages processed by the agent. The default value is False. logToFile (optional): Logs data to a file instead of a database table. The default value is False. Pars (optional): A JSON formatted list of input values that can be used by the agent. The JSON string should be URL encoded. StopAgent? Stops the agent if it is currently running agent={agentName asynchronously. OrPath} © 2014 - 2015 Sequentum 393 394 Content Grabber &sessionId={sessio This function supports only GET nId}" requests. CloseAgentSession Closes an agent session after the agent ? has been run asynchronously. When you agent={agentName close an agent session, all data OrPath} associated with that session is removed &sessionId={sessio and you will not be able to retrieve status nId} information about the agent that ran in this session. You can only close a session if an agent is not currently running in the session. You don't need to close a session. Session data will be removed automatically after the agent has completed running and the session timeout has elapsed. The default session timeout is 30 minutes, so by default session data will be removed automatically 30 minutes after the agent has completed. This function supports only GET requests. GetAgentStatus? Returns status information about an agent={agentName agent that has been run asynchronously. OrPath} See below for more information about &sessionId={sessio the AgentStatus class. nId} This function supports only GET requests. © 2014 - 2015 Sequentum Programming Interface GetAgentProgressA Returns progress information as JSON sJson? about an agent running in a agent={agentName asynchronously. See below for more OrPath} information about the information &sessionId={sessio returned. nId} This function supports only GET requests. GetAgentLogAsJso Returns log data as JSON for an agent n? that has been run asynchronously. This agent={agentName function does not return any data if OrPath} logging is disabled or if logging is written &sessionId={sessio to file. See below for more information nId} about the information returned. This function supports only GET requests. DataSet Returns extracted data in a DataSet for GetAgentDataAsDat an agent that has been run aSet() asynchronously. This function supports only GET requests. GetAgentDataAsJso Returns extracted data as a JSON string n? for an agent that has been run agent={agentName asynchronously. OrPath} &sessionId={sessio This function supports only GET nId} requests. GetAgentDataAsXml Returns extracted data as an XML string ? for an agent that has been run © 2014 - 2015 Sequentum 395 396 Content Grabber agent={agentName asynchronously. OrPath} &sessionId={sessio This function supports only GET nId} requests. © 2014 - 2015 Sequentum Index Index -&&OPEN 200 -((percentage) 167 -..NET class library 288 .NET functions 283 .NET v4.0 client profile framework .NET version 4 341 350 -???Headers= 152 ??Post= 152 -332-bit library 350 -AAction Command Error Handling 208 Action Commands 137 Action Completed 177 Action Configuration 163 Action Events 167 Action Type 137 Action Types 164 Activities 137 Activity 177 Activity Status 137 Activity Type 137 Add Force Refresh Header 137 Add Horizontally 239 Add Horizontally Export Method 239 Add to Agent Template Library 204 Add Vertically 239 © 2014 - 2015 Sequentum 397 Add Vertically Export Method 239 Address Bar 27 Agent 185, 234 Agent API 375 Agent API Library 375 Agent API Reference 375 Agent command 145 Agent Configuration 27 Agent Data 234 Agent Debug Log window 210 Agent Details 273 Agent Development 24 Agent Error Handling 208 Agent Explorer 24, 27 Agent Export Data 354 Agent Initialization Scripts 302 Agent Log Data 354, 375 Agent Progress Data 375 Agent Proxy 375 Agent Proxy Library 375 Agent Proxy Reference 375 Agent Settings 45, 259 Agent Templates 204 AgentSettings 354 AgentSettings Reference 375 AgentStatus Reference 375 AJAX 177 Ajax Completed Timeout 137 Ajax Content Render Delay 137 Ajax Content Render Delay After Scroll 137 Ajax Start Timeout 137 Always Merge Data 239 Always send notifications when an agent completes a run 210 alwaysNotify 210 Anonymous Web Scraping 252 API 253, 341, 350 API Reference 354, 375 API Restrictions 341 Application Settings 29 Assembly References 145, 289, 350 Asynchronous Timeout 137 Attribute Name 135 Author Details 145 Auto Detect File Extension 114 Automatic activity discovery 177 Automatic CAPTCHA Configuration 253 398 Content Grabber -BBasic Windows Authentication 200 Block Known Ad Servers 137 blur 167 bool HasColumn(string tableName, string columnName) 290 bool HasTable(string tableName) 290 bool Read() 290 Browser 165 Browser action type 165 Browser Activities 177, 267 Browser Activity 267 Browser Activity Screen 177 Browser Mode 137, 165 Browser Target 165 Building a Self-Contained Agent 270 Building a Web Application 353 Building Self-Contained Agents 270 Building Your First Agent 30 Bypass CAPTCHA 253 bypasscaptcha.com 326 -CC# 283, 284, 302 Calculated Value 134, 155 capitalize_words 284 CAPTCHA 234, 253, 326 CAPTCHA Blocking 253 CAPTCHA images 234 CAPTCHA OCR Scripts 253 CAPTCHA Protection 234 Capture Commands 109, 316 change 167 Change Export Target 42 Changing the Default Data Structures 239 Character Encoding 238 CityData 223 Cleanup External Session Data 145, 373 Clear Cookies 137 Clear Session 137 click 167 click() 167 Close Browser 192 CloseSession 373 Command Library 204 Command Line Arguments 279 Command object 290 Command Properties 191 Command scripts 283 Command Templates 204 Command-line parameters 230 Company Details 273 Completed 177 Configuration of an Action Command 137 Configure Agent Command 24 Consuming Data 223 Consuming Data in a Script 223 Container Command Properties 191 Container Commands 108 Content Grabber agent service 353 Content Grabber API 230 Content Grabber Basics 24 Content Grabber Message window 24, 33 Content Grabber Premium Edition 281 Content Grabber Runtime 7 Content Transformation Scripts 283, 316 ContentTransformationArguments 316 continue 279 Conversion Script 123 Convert Document to HTML 330 Convert Document to HTML Scripts 330 Convert image to text 253, 326 Convert to HTML 123 Convert to Text 114, 129 ConvertDocumentToHtmlArguments 330 ConvertImageToTextArguments 326 Cruise Direct example 33 CSV 42, 223, 236 CSV Data Provider 223 Current 165 Custom Scripts 335 Customize Design 273 Customize Self-Contained Agent 273 Customizing the User Interface 273 -DData 219 Data Consumer 136, 152, 160 Data Distribution 314 Data Distribution Scripts 314 Data Export 306 Data Export script 283 © 2014 - 2015 Sequentum Index Data Export Scripts 306 Data Input 321 Data Input Scripts 321 Data List 185, 186 Data List command 223 Data Missing Action 186 Data output 42 Data Outputs 24 Data Provider 145, 186 Data Providers 223 Data Source 236 Data Value 136, 234 Database 223 Database Connections 145, 219 Database Data Providers 223 database types 42 Database Utilities 290 DataDeliveryArguments 314 DataExportArguments 306 DataProviderArguments 321, 335 DataRow[] GetTableColumns(string tableName) 290 DataTable LoadCsvFile(string path) 290 DataTable LoadCsvFile(string path, char separator) 290 DataTable LoadCsvFileFromDefaultInputFolder(string filename) 290 DataTable LoadCsvFileFromDefaultInputFolder(string filename, char separator) 290 DateTime GetDateTimeValue(int columnIndex) 290 Death by CAPTCHA 253 deathbycaptcha.com 326 Debug 43 Debug Disabled 210 Debugger 43 debugging 43 Default 165, 186, 239 Default Events 137 Default Export Method 239 delay(milliseconds) 167 DepartureCity 223 Development Environments 219 Directory 145 Discover action 137, 163 Discover activities 267 Discover Activities box 177 Discover Activity Timeout 137 Discover First Activity Timeout 137 distributed royalty-free 270 Distributing Data 219, 249 © 2014 - 2015 Sequentum 399 Distributing Your Application 353 Docx To HTML 214 Docx To HTML Converter 214 down arrow 167 down arrow key 167 Download Document 123, 200 Download Document Command 160 Download Image 114 Download Screenshot 129 Download Video 119 Dynamic Page Change 177 Dynamic Page Changes 177 Dynamic Websites 17 -EEditor Action 137 Editor Index 188 Email and FTP distribution 250 Email Delivery 250 Email Distribution 250 Email Notification 145 Error Handling 137, 207 Error Handling In An Execute Script Command 208 Error Logs 210 Error logs and notifications 207 Error Notifications 210 Error Retry Count 137 Events 137 Excel 42 exec(JavaScript) 167 executable file 277 Execute Script 192, 253, 335 Execute Script Commands 208 Exit Codes 279 Exit Command 137, 208, 253 Expert Layout 29 Export Agent 270 Export Converted Document 123 Export Converted Image 114, 129 export data 220 Export last data segment only 220 Export Rows in Parent Columns 239 Export Rows to Parent Columns 239 Export Target 145, 238 Export URL 114 Exporting and Importing Template Libraries 300 Exporting and Importing User Libraries 204 400 Content Grabber Exporting Data 235 Exporting Data to MS Access 236 Exporting Data to Oracle 236 Exporting Data with Scripts 249 Exporting Downloaded Images and Files 238 Extension Methods 290 Extension scripts 283 external assemblies 289 External Data 219, 220 Externals 177 Extracting Content From a Converted Document 214 Extracting Data From Complex Websites 12 Extracting Data From Non-HTML Content 12 Extracting Data From Websites Using Deterrents 12 Extracting Huge Amounts of Data 12 Extracting URLs 111 -FFile Capture 114 File Download 177, 200 File Download Action 200 File Name 114, 119, 123, 129 File Name Attribute 114 File Name Transformation Script 114 File URL 114, 119, 123 Fire Event 163 Fire Events 164 Fixed File Extension 114 Flash 119 focus 167 Follow Pagination command 33 Frame Completed Timeout 137 Frames 177 Free Private Proxy Switch Account 259 FTP Delivery 250 FTP Distribution 250 Full Dynamic Browser 165, 252 -GGateway 259 Generate Transformation 40 Group 189, 190 Group Command Properties 190 Group Commands 189 Group Commands in Page Area 191 Group in Page Area 189, 191 GUI window 230 GUID 238 Guid GetGuidValue(int columnIndex) 290 -HHandle Web Browser Dialog 200 Handle Web Browser Dialogs 160 How to Configure Proxy Servers 259 HTML 111 HTML Attribute 111, 114, 129 HTML Capture 129 HTML Column 210 HTML Content 16 HTML Parser 165, 200 html_decode 284 HTTP proxy servers 252, 259 -IICommand 290 ICommand GetNewCommand() 290 IConnection 290 IDataReader GetDataReader() 290 IExportData 306 IExportReader 306 If Selection Exists 253 If-Modified-Since header 137 iframes 177 Ignore Command when Data is Missing 186 Image Conversion Script 114, 129 Image OCR Scripts 326 imdb 239 Import Proxies 259 Importing Proxy Servers 259 Improving Agent Performance and Reliability 264 Initialization Script 145 Inner HTML 111 Input data 219 Input Parameters 145, 230 Input Parameters with a GUI Window 230 Insert Template from Library 204 Installing Content Grabber 22 int ClearAllBrowserCookies() 290 int ClearBrowserCookies(string url) 290 © 2014 - 2015 Sequentum Index int GetIntValue(int columnIndex) 290 Interactive 177 Interactive mode 177 internal data 219, 220 Internal Database 145, 264 Internet Explorer 9 341 Introduction 7 IP blocking 253 IP Blocking & Proxy Servers 259 IReader 290 IReader ExecuteReader(string sql, params object[] pars) 290 IReader GetNewReader(string sql) 290 IsParentCommandActionError 137, 208 -JJQuery 137 -Kkey steps for building a web scraping agent keydown 167 keypress 167 keyup 167 -LLimit Number of Scrolls 137 line_breaks 284 List 185 list command 108 List Commands 185 List Count 185 List of Numeric Links 155 List of Text Links 155 List selection mode 33 List Selections 98 List Start Index 185 List/sequence of numeric links 155 List/sequence of text links 155 Lists 98 Load URL 164 Loading 177 log_html 279 log_level 279 log_to_file 279 © 2014 - 2015 Sequentum 30 401 long ExecuteCount() 290 long ExecuteCount(string sql) 290 long ExecuteCount(string sql, params object[] pars) 290 Low page number threshold 210 -MMain Window 27 manual and automatic data extraction 253 Manual CAPTCHA 253 Manual CAPTCHA Configuration 253 Max Inner Prospects 145 Max Memory Usage 145 Max Tree Expansions 145 Max. Page Number Command 155 Max. Repeats 151 Maximum Number of Scrolls 137 Microsoft Excel 236 Microsoft Excel 2003 236 Microsoft Excel 2007 236 Microsoft Word 214 mousedown 167 mouseup 167 Movie List 239 mp4 119 Multiple Columns 239 My Templates 204 MySQL 42, 223, 236, 238 MySQL Character Encoding 238 -NName Command 239 Navigate Directly to Page 155 Navigate Link 200, 316 Navigate Link command 151 Navigate Pagination Command 155 Navigate URL 152, 234 Navigate URL Command 152 New 165 New Command 111 Next Page 155 Next Page link/button 155 No Action 163, 164 No Error Handling 208 no_ui 279 None 236, 239 402 Content Grabber -Oobject ExecuteScalar() 290 object ExecuteScalar(string sql) 290 object ExecuteScalar(string sql, params object[] pars) 290 object GetConnection() 290 object GetFieldValue(int columnIndex) 290 OCR 114, 129 OCR script 253 OCR scripts 253 OCR service 234 ogg 119 OleDB 42, 223, 236 Only Execute Action on Value Change 160 Optimized Dynamic Browser 165 Optimizing Agent Commands 267 Optimizing Browser Activities 267 Optional Data 186 Oracle 42, 223, 236 Oracle OleDB 236 Original File Name 114 Other Commands 192 -PPage Attribute 135 Page Completed Timeout 137 Page Count 155 Page Interactive Timeout 137 Page Load 177 Page Load Timeout 137 Page Loading 177 Page Number Attribute 155 Page Number Transformation Script Page Refresh 177 Page URL 135 PageNumberAttributeName 155 Pages 259 Pagination 33 Pagination Command Link 155 Pagination Type 155 Parent 165 Password 200 Pause Agent 253 Pause Before Restarting 145 PDF 214 155 PDF To HTML 214 PDF-to-HTML document converter Polipo 259 Post-completion events 177 Posting Data and Custom Header Premium Edition 7, 281 Private Proxy Switch 259 Process Restart Condition 145 Professional Edition 7 Programming Interface 341 promotional message 270 Proxy 145 Proxy API 375 proxy assembly 354 Proxy Configuration 145 Proxy Functions 354 Proxy Gateway 259 Proxy List 259 Proxy source 259 Proxy switch 259 Proxy Verification 259 Public Provider 276 214 152 -RRefresh Document 192 Regular Expressions 18, 284 Regular Expressions Reference Guide 284 Remove Internal Session Data Immediately 145, 373 Remove skipped proxies 259 Repeat Link Action 151, 167 Required 137 Restart Agent and Continue 208 Restart and Continue Agent 137 Restart On Memory Usage 145 Restart On Time Interval 145 Restart Time Interval 145 Retry Command 137, 208, 253 Retry Count 208 Return 167 Return key 167 Reuse Browser Tab 137 Rotation interval 259 Run 46 Run Agent 46 Run Interactively 354 Run Your Agent 46 © 2014 - 2015 Sequentum Index RunAgent.exe 279 Running Agents from the Command-Line Running an Agent in a Session 373 Runtime 7 Runtime Input Parameters 230 279 -SSchedule 45 Scheduling 45, 145 Script 223 Script Data Provider 223 Script Languages 284 Script Library 288 Script Template Library 300 Script type 335 Script Utilities 290 Scripting 145, 283 Scripting Data Distribution 251 Scroll to End of Page 167 Scroll Until End of Page 137 scroll() 167 scroll(percentage) 167 Select the Content to Capture 33 Selecting an Export Target 236 Selection 111, 114, 223 Selection Data Provider 223 Selection Techniques 18 self-contained agent 277 Self-Contained Agent Files 270 Self-Contained Agents 145 Send notification on critical errors 210 Send notification on low number of page loads sendkey(keycode) 167 sendtext() 167 sendtext(text) 167 Separate 239 Separate Export Method 239 Separator 239 Server Edition 7 Session Data Cleanup 373 session_id 279 session_timeout 279 Sessions 145, 373 Set Form Field 109, 160, 185, 200, 234 Set Form Field Command 160 Set Select Box Text 160 Set Value 160 © 2014 - 2015 Sequentum 210 403 setinputtext() 167 setinputtext(text) 167 shared script library 288 Simple 223 Simple Data Provider 223, 276 Simplified Layout 29 Single Column 239 Single Next Link 155 Single Next Page Link 155 Skip unavailable proxies 259 Specifying Input Parameters on the Command Line 230 SQL Server 42, 223, 236 SQL Server Express 264 SQLite 219 SQLite database 220 src 111 Start 279 Start command 279 Start URL 27, 32 Static Browser 165 static DataTable ToDataTable(this List<string> dataRows, string columnName) 290 static DataTable ToDataTable(this List<string> dataRows, string columnName, string stringFormat) 290 static DataTable ToDataTable(this string stringValue, string columnName) 290 static DataTable ToDataTable(this string[] dataRows, string columnName) 290 static DataTable ToDataTable(this string[] dataRows, string columnName, string stringFormat) 290 Static Parser 265 Stop Agent 137, 208 string GetStringValue(int columnIndex) 290 strip_html 284 Sub-Container Export Method 239 Sub-container Export Rules 239 Sub-Container Merge Method 239 Submit 200 Submit button 200 Support Sessions 145, 373 -TTag Text 111 Target Browser 137, 165 Templates 273 Test Your Agent 43 404 Content Grabber Text 111 The Internal Database 220 third-party CAPTCHA recognition service 253 Time 259 Timeout 137, 177 Timeouts 137 Timing 177 TNS 236 to_lower 284 to_upper 284 ToDataTable 321 TOR 259 Transform Page 192 transformation 40 Transformation Script 40 Tree Selections 145 trim 284 Type GetFieldType(int columnIndex) 290 -Uunbind(Event) 167 Unique Database Names on Your Network 219 Unique ID 111 Upgrading a Self-Contained Agent 270, 277 Uploading File 200 URL 163 URL Column 210 URL Match 137 URL Parameters 152 url_decode 284 Use default events 167 Use Default Input 145 Use Max. Page Number Command 155 Use Next Pagination Set 155 Use Original File Name 114 Use Page Count 155 User Agent 145 User Interface 145 Username 200 Using a Self-Contained Agent 277 Using Data Input 223 Using Input Data 276 Using Input Parameters 230 Using Input Parameters in a Script 230 Using MS SQL Server as Internal Database 264 Using the Content Grabber Agent Service 354 Using the Content Grabber runtime 281 Using the Free TOR Proxy Network 259 Using the Static Parser Browser 265 UTF-8 238 -VValue Command 239 VB.NET 283, 284, 302 Verification timeout 259 Verify proxy before use 259 Vidalia 259 Video URL 119 View Export Data 43 View Log 210 view_browser 279 Visual Studio 288, 350 Visual Studio Configuration 350 Visual Studio Solution Template 288 Visual Web Ripper 19 void AddParameter(string pName, CaptureDataType type) 290 void AddParameterWithValue(string pName, object pValue) 290 void Close() 290 void CloseDatabase() 290 void DropTable(string tableName) 290 void ExecuteCommandLine(string programFilePath, string arguments) 290 void ExecuteCommandLine(string programFilePath, string inputFilePath, string outputFilePath, string options) 290 void ExecuteNonQuery() 290 void ExecuteNonQuery(string sql) 290 void ExecuteNonQuery(string sql, params object[] pars) 290 void Lock() 290 void Notify(bool alwaysNotify = true) 210 void Notify(string message, bool alwaysNotify = false) 210 void OpenDatabase() 290 void Release() 290 void ResetBrowserSession() 290 void SetParameterValue(string pName, object pValue) 290 void SetSql(string sql) 290 void TruncateTable(string tableName) 290 © 2014 - 2015 Sequentum Index -WWait XPath 137 web applications 353 Web Element Content 111 Web Element List 185, 188 Web Element Selection 177 Web Forms 200 Web requests 354, 365 Web Scraping Limitations 12 Web Selection Properties 191 Web-Scraping Techniques 15 Website Login Forms 200 What is Web Scraping? 10 Windows service 354, 365 Windows Task Scheduler 45 windowscroll 167 windowscroll() 167 Worker Threads 151, 152 -XXML 42, 236 XPath 18, 223 XPath Factory 145 © 2014 - 2015 Sequentum 405 406 Content Grabber Endnotes 2... (after index) © 2014 - 2015 Sequentum Back Cover