The Web Robots Pages

Transcription

The Web Robots Pages
The Web Robots Pages
The Web Robots Pages
The Web Robots Pages
Web Robots are programs that traverse the Web automatically. Some people call them Web
Wanderers, Crawlers, or Spiders. These pages have further information about these Web Robots.
The Web Robots FAQ
Frequently Asked Questions about Web Robots, from Web
users, Web authors, and Robot implementors.
Robots Exclusion
Find out what you can do to direct robots that visit your Web
site.
A List of Robots
A database of currently known robots, with descriptions and
contact details.
The Robots Mailing List
An archived mailing list for discussion of technical aspects
of designing, building, and operating Web Robots.
Articles and Papers
Background reading for people interested in Web Robots
Related Sites
Some references to other sites that concern Web Robots.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/robots.html [18.02.2001 13:14:10]
The Web Robots FAQ
The Web Robots Pages
The Web Robots FAQ...
These frequently asked questions about Web robots.
Send suggestions and comments to Martijn Koster. This information is in the public domain.
Table of Contents
1. About WWW robots
❍ What is a WWW robot?
❍
What is an agent?
❍
What is a search engine?
❍
What kinds of robots are there?
❍
So what are Robots, Spiders, Web Crawlers, Worms, Ants?
❍
Aren't robots bad for the web?
❍
Are there any robot books?
❍
Where do I find out more about robots?
2. Indexing robots
❍ How does a robot decide where to visit?
❍
How does an indexing robot decide what to index?
❍
How do I register my page with a robot?
3. For Server Administrators
❍ How do I know if I've been visited by a robot?
❍
I've been visited by a robot! Now what?
❍
A robot is traversing my whole site too fast!
❍
How do I keep a robot off my server?
4. Robots exclusion standard
❍ Why do I find entries for /robots.txt in my log files?
❍
How do I prevent robots scanning my site?
❍
Where do I find out how /robots.txt files work?
❍
Will the /robots.txt standard be extended?
❍
What if I cannot make a /robots.txt?
5. Availability
❍ Where can I use a robot?
❍
Where can I get a robot?
http://info.webcrawler.com/mak/projects/robots/faq.html (1 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
❍
Where can I get the source code for a robot?
❍
I'm writing a robot, what do I need to be careful of?
❍
I've written a robot, how do I list it?
About Web Robots
What is a WWW robot?
A robot is a program that automatically traverses the Web's hypertext structure by retrieving a
document, and recursively retrieving all documents that are referenced.
Note that "recursive" here doesn't limit the definition to any specific traversal algorithm; even if a
robot applies some heuristic to the selection and order of documents to visit and spaces out requests
over a long space of time, it is still a robot.
Normal Web browsers are not robots, because the are operated by a human, and don't automatically
retrieve referenced documents (other than inline images).
Web robots are sometimes referred to as Web Wanderers, Web Crawlers, or Spiders. These names are
a bit misleading as they give the impression the software itself moves between sites like a virus; this
not the case, a robot simply visits sites by requesting documents from them.
What is an agent?
The word "agent" is used for lots of meanings in computing these days. Specifically:
Autonomous agents
are programs that do travel between sites, deciding themselves when to move and what to do.
These can only travel between special servers and are currently not widespread in the Internet.
Intelligent agents
are programs that help users with things, such as choosing a product, or guiding a user through
form filling, or even helping users find things. These have generally little to do with
networking.
User-agent
is a technical name for programs that perform networking tasks for a user, such as Web
User-agents like Netscape Navigator and Microsoft Internet Explorer, and Email User-agent
like Qualcomm Eudora etc.
What is a search engine?
A search engine is a program that searches through some dataset. In the context of the Web, the word
"search engine" is most often used for search forms that search through databases of HTML
documents gathered by a robot.
http://info.webcrawler.com/mak/projects/robots/faq.html (2 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
What other kinds of robots are there?
Robots can be used for a number of purposes:
● Indexing
HTML validation
● Link validation
● "What's New" monitoring
● Mirroring
See the list of active robots to see what robot does what. Don't ask me -- all I know is what's on the
list...
●
So what are Robots, Spiders, Web Crawlers, Worms, Ants
They're all names for the same sort of thing, with slightly different connotations:
Robots
the generic name, see above.
Spiders
same as robots, but sounds cooler in the press.
Worms
same as robots, although technically a worm is a replicating program, unlike a robot.
Web crawlers
same as robots, but note WebCrawler is a specific robot
WebAnts
distributed cooperating robots.
Aren't robots bad for the web?
There are a few reasons people believe robots are bad for the Web:
● Certain robot implementations can (and have in the past) overloaded networks and servers. This
happens especially with people who are just starting to write a robot; these days there is
sufficient information on robots to prevent some of these mistakes.
● Robots are operated by humans, who make mistakes in configuration, or simply don't consider
the implications of their actions. This means people need to be careful, and robot authors need
to make it difficult for people to make mistakes with bad effects
● Web-wide indexing robots build a central database of documents, which doesn't scale too well
to millions of documents on millions of sites.
But at the same time the majority of robots are well designed, professionally operated, cause no
problems, and provide a valuable service in the absence of widely deployed better solutions.
So no, robots aren't inherently bad, nor inherently brilliant, and need careful attention.
http://info.webcrawler.com/mak/projects/robots/faq.html (3 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
Are there any robot books?
Yes:
Internet Agents: Spiders, Wanderers, Brokers, and Bots by Fah-Chun Cheong.
This books covers Web robots, commerce transaction agents, Mud agents, and a few others. It
includes source code for a simple Web robot based on top of libwww-perl4.
Its coverage of HTTP, HTML, and Web libraries is a bit too thin to be a "how to write a web
robot" book, but it provides useful background reading and a good overview of the
state-of-the-art, especially if you haven't got the time to find all the info yourself on the Web.
Published by New Riders, ISBN 1-56205-463-5.
Bots and Other Internet Beasties by Joseph Williams
I haven't seen this myself, but someone said: The William's book 'Bots and other Internet
Beasties' was quite disappointing. It claims to be a 'how to' book on writing robots, but my
impression is that it is nothing more than a collection of chapters, written by various people
involved in this area and subsequently bound together.
Published by Sam's, ISBN: 1-57521-016-9
Web Client Programming with Perl by Clinton Wong
This O'Reilly book is planned for Fall 1996, check the O'Reilly Web Site for the current status.
It promises to be a practical book, but I haven't seen it yet.
A few others can be found on the The Software Agents Mailing List FAQ
Where do I find out more about robots?
There is a Web robots home page on: http://info.webcrawler.com/mak/projects/robots/robots.html
While this is hosted at one of the major robots' site, it is an unbiased and reasoneably comprehensive
collection of information which is maintained by Martijn Koster <m.koster@webcrawler.com>.
Of course the latest version of this FAQ is there.
You'll also find details and an archive of the robots mailing list, which is intended for technical
discussions about robots.
Indexing robots
How does a robot decide where to visit?
This depends on the robot, each one uses different strategies. In general they start from a historical list
of URLs, especially of documents with many links elsewhere, such as server lists, "What's New"
pages, and the most popular sites on the Web.
Most indexing services also allow you to submit URLs manually, which will then be queued and
http://info.webcrawler.com/mak/projects/robots/faq.html (4 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
visited by the robot.
Sometimes other sources for URLs are used, such as scanners through USENET postings, published
mailing list achives etc.
Given those starting points a robot can select URLs to visit and index, and to parse and use as a source
for new URLs.
How does an indexing robot decide what to index?
If an indexing robot knows about a document, it may decide to parse it, and insert it into its database.
How this is done depends on the robot: Some robots index the HTML Titles, or the first few
paragraphs, or parse the entire HTML and index all words, with weightings depending on HTML
constructs, etc. Some parse the META tag, or other special hidden tags.
We hope that as the Web evolves more facilities becomes available to efficiently associate meta data
such as indexing information with a document. This is being worked on...
How do I register my page with a robot?
You guessed it, it depends on the service :-) Most services have a link to a URL submission form on
their search page.
Fortunately you don't have to submit your URL to every service by hand: Submit-it <URL:
http://www.submit-it.com/> will do it for you.
For Server Administrators
How do I know if I've been visited by a robot?
You can check your server logs for sites that retrieve many documents, especially in a short time.
If your server supports User-agent logging you can check for retrievals with unusual User-agent heder
values.
Finally, if you notice a site repeatedly checking for the file '/robots.txt' chances are that is a robot too.
I've been visited by a robot! Now what?
Well, nothing :-) The whole idea is they are automatic; you don't need to do anything.
If you think you have discovered a new robot (ie one that is not listed on the list of active robots, and
it does more than sporadic visits, drop me a line so I can make a note of it for future reference. But
please don't tell me about every robot that happens to drop by!
http://info.webcrawler.com/mak/projects/robots/faq.html (5 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
A robot is traversing my whole site too fast!
This is called "rapid-fire", and people usually notice it if they're monitoring or analysing an access log
file.
First of all check if it is a problem by checking the load of your server, and monitoring your servers'
error log, and concurrent connections if you can. If you have a medium or high performance server, it
is quite likely to be able to cope a high load of even several requests per second, especially if the visits
are quick.
However you may have problems if you have a low performance site, such as your own desktop PC or
Mac you're working on, or you run low performance server software, or if you have many long
retrievals (such as CGI scripts or large documents). These problems manifest themselves in refused
connections, a high load, performance slowdowns, or in extreme cases a system crash.
If this happens, there are a few things you should do. Most importantly, start logging information:
when did you notice, what happened, what do your logs say, what are you doing in response etc; this
helps investigating the problem later. Secondly, try and find out where the robot came from, what IP
addresses or DNS domains, and see if they are mentioned in the list of active robots. If you can
identify a site this way, you can email the person responsible, and ask them what's up. If this doesn't
help, try their own site for telephone numbers, or mail postmaster at their domain.
If the robot is not on the list, mail me with all the information you have collected, including actions on
your part. If I can't help, at least I can make a note of it for others.
How do I keep a robot off my server?
Read the next section...
Robots exclusion standard
Why do I find entries for /robots.txt in my log files?
They are probably from robots trying to see if you have specified any rules for them using the
Standard for Robot Exclusion, see also below.
If you don't care about robots and want to prevent the messages in your error logs, simply create an
empty file called robots.txt in the root level of your server.
Don't put any HTML or English language "Who the hell are you?" text in it -- it will probably never
get read by anyone :-)
How do I prevent robots scanning my site?
The quick way to prevent robots visiting your site is put these two lines into the /robots.txt file on
your server:
http://info.webcrawler.com/mak/projects/robots/faq.html (6 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
User-agent: *
Disallow: /
but its easy to be more selective than that.
Where do I find out how /robots.txt files work?
You can read the whole standard specification but the basic concept is simple: by writing a structured
text file you can indicate to robots that certain parts of your server are off-limits to some or all robots.
It is best explained with an example:
# /robots.txt file for http://webcrawler.com/
# mail webmaster@webcrawler.com for constructive criticism
User-agent: webcrawler
Disallow:
User-agent: lycra
Disallow: /
User-agent: *
Disallow: /tmp
Disallow: /logs
The first two lines, starting with '#', specify a comment
The first paragraph specifies that the robot called 'webcrawler' has nothing disallowed: it may go
anywhere.
The second paragraph indicates that the robot called 'lycra' has all relative URLs starting with '/'
disallowed. Because all relative URL's on a server start with '/', this means the entire site is closed off.
The third paragraph indicates that all other robots should not visit URLs starting with /tmp or /log.
Note the '*' is a special token, meaning "any other User-agent"; you cannot use wildcard patterns or
regular expressions in either User-agent or Disallow lines.
Two common errors:
● Wildcards are _not_ supported: instead of 'Disallow: /tmp/*' just say 'Disallow: /tmp'.
● You shouldn't put more than one path on a Disallow line (this may change in a future version of
the spec)
Will the /robots.txt standard be extended?
Probably... there are some ideas floating around. They haven't made it into a coherent proposal
because of time constraints, and because there is little pressure. Mail suggestions to the robots mailing
list, and check the robots home page for work in progress.
http://info.webcrawler.com/mak/projects/robots/faq.html (7 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
What if I can't make a /robots.txt file?
Sometimes you cannot make a /robots.txt file, because you don't administer the entire server. All is
not lost: there is a new standard for using HTML META tags to keep robots out of your documents.
The basic idea is that if you include a tag like:
<META NAME="ROBOTS" CONTENT="NOINDEX">
in your HTML document, that document won't be indexed.
If you do:
<META NAME="ROBOTS" CONTENT="NOFOLLOW">
the links in that document will not be parsed by the robot.
Availability
Where can I use a robot?
If you mean a search service, check out the various directory pages on the Web, such as Netscape's
Exploring the Net or try one of the Meta search services such as MetaSearch
Where can I get a robot?
Well, you can have a look at the list of robots; I'm starting to indicate their public availability slowly.
In the meantime, two indexing robots that you should be able to get hold of are Harvest (free), and
Verity's.
Where can I get the source code for a robot?
See above -- some may be willing to give out source code.
Alternatively check out the libwww-perl5 package, that has a simple example.
I'm writing a robot, what do I need to be careful of?
Lots. First read through all the stuff on the robot page then read the proceedings of past WWW
Conferences, and the complete HTTP and HTML spec. Yes; it's a lot of work :-)
http://info.webcrawler.com/mak/projects/robots/faq.html (8 of 9) [18.02.2001 13:14:14]
The Web Robots FAQ
I've written a robot, how do I list it?
Simply fill in a form you can find on The Web Robots Database and email it to me.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/faq.html (9 of 9) [18.02.2001 13:14:14]
Robots Exclusion
The Web Robots Pages
Robots Exclusion
Sometimes people find they have been indexed by an indexing robot, or that a resource discovery
robot has visited part of a site that for some reason shouldn't be visited by robots.
In recognition of this problem, many Web Robots offer facilities for Web site administrators and
content providers to limit what the robot does. This is achieved through two mechanisms:
The Robots Exclusion Protocol A Web site administrator can indicate which parts of the site
should not be vistsed by a robot, by providing a specially
formatted file on their site, in http://.../robots.txt.
The Robots META tag A Web author can indicate if a page may or may not be indexed,
or analysed for links, through the use of a special HTML META
tag.
The remainder of this pages provides full details on these facilities.
Note that these methods rely on cooperation from the Robot, and are by no means guaranteed to work
for every Robot. If you need stronger protection from robots and other agents, you should use
alternative methods such as password protection.
The Robots Exclusion Protocol
The Robots Exclusion Protocol is a method that allows Web site administrators to indicate to visiting
robots which parts of their site should not be visited by the robot.
In a nutshell, when a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for
http://www.foobar.com/robots.txt. If it can find this document, it will analyse its contents for records
like:
User-agent: *
Disallow: /
to see if it is allowed to retrieve the document. The precise details on how these rules can be specified,
and what they mean, can be found in:
● Web Server Administrator's Guide to the Robots Exclusion Protocol
●
HTML Author's Guide to the Robots Exclusion Protocol
●
The original 1994 protocol description, as currently deployed.
●
The revised Internet-Draft specification, which is not yet completed or implemented.
http://info.webcrawler.com/mak/projects/robots/exclusion.html (1 of 2) [18.02.2001 13:14:16]
Robots Exclusion
The Robots META tag
The Robots META tag allows HTML authors to indicate to visiting robots if a document may be
indexed, or used to harvest more links. No server administrator action is required.
Note that currently only a few robots implement this.
In this simple example:
<META NAME="ROBOTS" CONTENT="NOINDEX, NOFOLLOW">
a robot should neither index this document, nor analyse it for links.
Full details on how this tags works is provided:
● Web Server Administrator's Guide to the Robots META tag
●
HTML Author's Guide to the Robots META tag
●
The original notes from the May 1996 IndexingWorkshop
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/exclusion.html (2 of 2) [18.02.2001 13:14:16]
The Web Robots Database
The Web Robots Pages
The Web Robots Database
The List of Active Robots has been changed to a new format, called The Web Robots Database. This format
will allow more information to be stored, updates to happen faster, and the information to be more clearly
presented.
Note that now robot technology is being used in increasing numbers of end-user products, this list is
becoming less useful and complete.
For general information on robots see Web Robots Pages.
The robot information is now stored into individual files, with several HTML tables providing different
views of the data:
● View Names
●
View Type Details using tables
●
View Contact Details using tables
Browsers without support for tables can consult the overview of text files.
The combined raw data in machine readable format is available in a text file.
To add a new robot, fill in this empty template, using this schema description, and email it to
m.koster@webcrawler.com
Others
There are robots out there that the database contains no details on. If/when I get those details they will be
added, otherwise they'll remain on the list below, as unresponsive or unknown sites.
Services with no information
These services must use robots, but haven't replied to requests for an entry...
Magellan
User-agent field: Wobot/1.00
From: mckinley.mckinley.com (206.214.202.2) and galileo.mckinley.com.
(206.214.202.45)
Honors "robots.txt": yes
Contact: cedeno@mckinley.mckinley.com (or possibly:
spider@mckinley.mckinley.com)
Purpose: Resource discovery for Magellan (http://www.mckinley.com/)
User Agents
These look like new robots, but have no contact info...
BizBot04 kirk.overleaf.com
http://info.webcrawler.com/mak/projects/robots/active.html (1 of 2) [18.02.2001 13:14:18]
The Web Robots Database
HappyBot (gserver.kw.net)
CaliforniaBrownSpider
EI*Net/0.1 libwww/0.1
Ibot/1.0 libwww-perl/0.40
Merritt/1.0
StatFetcher/1.0
TeacherSoft/1.0 libwww/2.17
WWW Collector
processor/0.0ALPHA libwww-perl/0.20
wobot/1.0 from 206.214.202.45
Libertech-Rover
www.libertech.com?
WhoWhere Robot
ITI Spider
w3index
MyCNNSpider
SummyCrawler
OGspider
linklooker
CyberSpyder (amant@www.cyberspyder.com)
SlowBot
heraSpider
Surfbot
Bizbot003
WebWalker
SandBot
EnigmaBot
spyder3.microsys.com
www.freeloader.com.
Hosts
These have no known user-agent, but have requested /robots.txt repeatedly or exhibited crawling
patterns.
205.252.60.71
194.20.32.131
198.5.209.201
acke.dc.luth.se
dallas.mt.cs.cmu.edu
darkwing.cadvision.com
waldec.com
www2000.ogsm.vanderbilt.edu
unet.ca
murph.cais.net (rapid fire... sigh)
spyder3.microsys.com
www.freeloader.com.
Some other robots are mentioned in a list of Japanese Search Engines.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/active.html (2 of 2) [18.02.2001 13:14:18]
WWW Robots Mailing List
The Web Robots Pages
WWW Robots Mailing List
Note: this mailing list was formerly located at robots@nexor.co.uk.
This list has moved to robots@mccmedia.com
Charter
The robots@webcrawler.com mailing-list is intended as a technical forum for authors,
maintainers and administrators of WWW robots. Its aim is to maximise the benefits WWW robots can
offer while minimising drawbacks and duplication of effort. It is intended to address both
development and operational aspects of WWW robots.
This list is not intended for general discussion of WWW development efforts, or as a first line of
support for users of robot facilities.
Postings to this list are informal, and decisions and recommendations formulated here do not
constitute any official standards. Postings to this list will be made available publicly through a mailing
list archive. The administrator of this list nor his company accept any responsibility for the content of
the postings.
Administrativa
These few rules of etiquette make the administrator's life easier, and this list (and others) more
productive and enjoyable:
When subscribing to this list, make sure you check any auto-responder ("vacation"> software, and
make sure it doesn't reply to messages from this list. X-400 and LAN email systems are notorious for
positive delivery reports...
If your email address changes, please unsubscribe and resubscribe rather than just let the subscription
go stale: this saves the administrator work (and fustration)
When first joining the list, glance through the archive (details below) or listen-in a while before
posting, so you get a feel for the kind of traffic on the list.
Never send "unsubscribe" messages to the list itself.
Don't post unrelated or repeated advertising to the list.
Subscription Details
To subscribe to this list, send a mail message to robots-request@webcrawler.com, with the
word subscribe on the first line of the body.
http://info.webcrawler.com/mailing-lists/robots/info.html (1 of 2) [18.02.2001 13:14:19]
WWW Robots Mailing List
To unsubscribe to this list, send a mail message to robots-request@webcrawler.com, with
the word unsubscribe on the first line of the body.
Should this fail or should you otherwise need human assistance, send a message to
owner-robots@webcrawler.com.
To send message to all subscribers on the list itself, mail robots@webcrawler.com.
The Archive
Messages to this list are archived. The preferred way of accessing the archived messages is using the
Robots Mailing List Archive provided by Hypermail.
Behind the scenes this list is currently managed by Majordomo, an automated mailing list manager
written in Perl. Majordomo also allows acces to archived messages; send mail to
robots-request@webcrawler.com with the word help in the body to find out how.
Martijn Koster
http://info.webcrawler.com/mailing-lists/robots/info.html (2 of 2) [18.02.2001 13:14:19]
Articles and Papers about WWW Robots
The Web Robots Pages
Articles and Papers about WWW
Robots
This is a list of papers related to robots. Formatted suggestions gracefully accepted. See also the FAQ
on books and information.
Protocol Gives Sites Way To Keep Out The 'Bots Jeremy Carl, Web Week, Volume 1, Issue 7,
November 1995 (no longer online)
Robots in the Web: threat or treat? , Martijn Koster, ConneXions, Volume 9, No. 4, April 1995
Guidelines for Robot Writers , Martijn Koster, 1993
Evaluation of the Standard for Robots Exclusion , Martijn Koster, 1996
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/articles.html [18.02.2001 13:14:21]
WWW Robots Related Sites
The Web Robots Pages
WWW Robots Related Sites
Bot Spot "The Spot for All Bots on the Net".
The Web Robots Pages Martijn Koster's pages on robots, specifically robot exclusion.
Japanese Search Engines This is a comprehensive index for searching, submitting, and
navigating using Japanese search engines.
Search Engine Watch A site with information about many search engines, including
comparisons. Some information is available to subscribers
only.
RoboGen RoboGen is a visual editor for Robot Exclusion Files; it allows
one to create agent rules by logging onto your FTP server and
selecting files and directories.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/sites.html [18.02.2001 13:14:22]
Martijn Koster's Home Page
Martijn Koster
My name is Martijn Koster
.
I currently work as a consultant software engineer at Excite,
primarily working on Excite Inbox.
You may be interested in my projects and publications, or a short
biography.
DO NOT ASK ME TO REMOVE OR ADD YOUR URLS TO ANY
SEARCH ENGINES!.
I can't even do it if I wanted to. To contact me about anything else,
email m.koster@webcrawler.com.
Disclaimer: These pages represent my personal views, I do not speak for my employer. Copyright
1995-2000 Martijn Koster. All rights reserved.
http://info.webcrawler.com/mak/mak.html [18.02.2001 13:14:25]
A Standard for Robot Exclusion
The Web Robots Pages
A Standard for Robot Exclusion
Table of contents:
● Status of this document
●
Introduction
●
Method
●
Format
●
Examples
●
Example Code
●
Author's Address
Status of this document
This document represents a consensus on 30 June 1994 on the robots mailing list
(robots-request@nexor.co.uk) [Note the Robots mailing list has relocated to WebCrawler. See the
Robots pages at WebCrawler for details], between the majority of robot authors and other people with
an interest in robots. It has also been open for discussion on the Technical World Wide Web mailing
list (www-talk@info.cern.ch). This document is based on a previous working draft under the same
title.
It is not an official standard backed by a standards body, or owned by any commercial organisation. It
is not enforced by anybody, and there no guarantee that all current and future robots will use it.
Consider it a common facility the majority of robot authors offer the WWW community to protect
WWW server against unwanted accesses by their robots.
The latest version of this document can be found on
http://info.webcrawler.com/mak/projects/robots/robots.html.
Introduction
WWW Robots (also called wanderers or spiders) are programs that traverse many pages in the World
Wide Web by recursively retrieving linked pages. For more information see the robots page.
In 1993 and 1994 there have been occasions where robots have visited WWW servers where they
weren't welcome for various reasons. Sometimes these reasons were robot specific, e.g. certain robots
swamped servers with rapid-fire requests, or retrieved the same files repeatedly. In other situations
robots traversed parts of WWW servers that weren't suitable, e.g. very deep virtual trees, duplicated
information, temporary information, or cgi-scripts with side-effects (such as voting).
These incidents indicated the need for established mechanisms for WWW servers to indicate to robots
http://info.webcrawler.com/mak/projects/robots/norobots.html (1 of 4) [18.02.2001 13:14:28]
A Standard for Robot Exclusion
which parts of their server should not be accessed. This standard addresses this need with an
operational solution.
The Method
The method used to exclude robots from a server is to create a file on the server which specifies an
access policy for robots. This file must be accessible via HTTP on the local URL "/robots.txt".
The contents of this file are specified below.
This approach was chosen because it can be easily implemented on any existing WWW server, and a
robot can find the access policy with only a single document retrieval.
A possible drawback of this single-file approach is that only a server administrator can maintain such
a list, not the individual document maintainers on the server. This can be resolved by a local process
to construct the single file from a number of others, but if, or how, this is done is outside of the scope
of this document.
The choice of the URL was motivated by several criteria:
● The filename should fit in file naming restrictions of all common operating systems.
● The filename extension should not require extra server configuration.
● The filename should indicate the purpose of the file and be easy to remember.
● The likelihood of a clash with existing files should be minimal.
The Format
The format and semantics of the "/robots.txt" file are as follows:
The file consists of one or more records separated by one or more blank lines (terminated by
CR,CR/NL, or NL). Each record contains lines of the form
"<field>:<optionalspace><value><optionalspace>". The field name is case
insensitive.
Comments can be included in file using UNIX bourne shell conventions: the '#' character is used to
indicate that preceding space (if any) and the remainder of the line up to the line termination is
discarded. Lines containing only a comment are discarded completely, and therefore do not indicate a
record boundary.
The record starts with one or more User-agent lines, followed by one or more Disallow lines,
as detailed below. Unrecognised headers are ignored.
User-agent
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for
more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the
name without version information is recommended.
http://info.webcrawler.com/mak/projects/robots/norobots.html (2 of 4) [18.02.2001 13:14:28]
A Standard for Robot Exclusion
If the value is '*', the record describes the default access policy for any robot that has not
matched any of the other records. It is not allowed to have multiple such records in the
"/robots.txt" file.
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or
a partial path; any URL that starts with this value will not be retrieved. For example,
Disallow: /help disallows both /help.html and /help/index.html, whereas
Disallow: /help/ would disallow /help/index.html but allow /help.html.
Any empty value, indicates that all URLs can be retrieved. At least one Disallow field needs to
be present in a record.
The presence of an empty "/robots.txt" file has no explicit associated semantics, it will be
treated as if it was not present, i.e. all robots will consider themselves welcome.
Examples
The following example "/robots.txt" file specifies that no robots should visit any URL starting
with "/cyberworld/map/" or "/tmp/", or /foo.html:
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
Disallow: /tmp/ # these will soon disappear
Disallow: /foo.html
This example "/robots.txt" file specifies that no robots should visit any URL starting with
"/cyberworld/map/", except the robot called "cybermapper":
# robots.txt for http://www.example.com/
User-agent: *
Disallow: /cyberworld/map/ # This is an infinite virtual URL space
# Cybermapper knows where to go.
User-agent: cybermapper
Disallow:
This example indicates that no robots should visit this site further:
# go away
User-agent: *
Disallow: /
http://info.webcrawler.com/mak/projects/robots/norobots.html (3 of 4) [18.02.2001 13:14:28]
A Standard for Robot Exclusion
Example Code
Although it is not part of this specification, some example code in Perl is available in norobots.pl. It is
a bit more flexible in its parsing than this document specificies, and is provided as-is, without
warranty.
Note: This code is no longer available. Instead I recommend using the robots exclusion
code in the Perl libwww-perl5 library, available from CPAN in the LWP directory.
Author's Address
Martijn Koster <m.koster@webcrawler.com>
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/norobots.html (4 of 4) [18.02.2001 13:14:28]
Web Server Administrator's Guide to the Robots Exclusion Protocol
The Web Robots Pages
Web Server Administrator's Guide
to the Robots Exclusion Protocol
This guide is aimed at Web Server Administrators who want to use the Robots Exclusion Protocol.
Note that this is not a specification -- for details and formal syntax and definition see the specification.
Introduction
The Robots Exclusion Protocol is very straightforward. In a nutshell it works like this:
When a compliant Web Robot vists a site, it first checks for a "/robots.txt" URL on the site. If this
URL exists, the Robot parses its contents for directives that instruct the robot not to visit certain parts
of the site.
As a Web Server Administrator you can create directives that make sense for your site. This page tells
you how.
Where to create the robots.txt file
The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP
server running on a particular host and port number. For example:
Site URL
Corresponding Robots.txt URL
http://www.w3.org/
http://www.w3.org/robots.txt
http://www.w3.org:80/
http://www.w3.org:80/robots.txt
http://www.w3.org:1234/
http://www.w3.org:1234/robots.txt
http://w3.org/
http://w3.org/robots.txt
Note that there can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt"
files in user directories, because a robot will never look at them. If you want your users to be able to
create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't
want to do this your users might want to use the Robots META Tag instead.
Also, remeber that URL's are case sensitive, and "/robots.txt" must be all lower-case.
Pointless robots.txt URLs
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html (1 of 3) [18.02.2001 13:14:30]
Web Server Administrator's Guide to the Robots Exclusion Protocol
http://www.w3.org/admin/robots.txt
http://www.w3.org/~timbl/robots.txt
ftp://ftp.w3.com/robots.txt
So, you need to provide the "/robots.txt" in the top-level of your URL space. How to do this depends
on your particular server software and configuration.
For most servers it means creating a file in your top-level server directory. On a UNIX machine this
might be /usr/local/etc/httpd/htdocs/robots.txt
What to put into the robots.txt file
The "/robots.txt" file usually contains a record looking like this:
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /~joe/
In this example, three directories are excluded.
Note that you need a separate "Disallow" line for every URL prefix you want to exclude -- you cannot
say "Disallow: /cgi-bin/ /tmp/". Also, you may not have blank lines in a record, as they are used to
delimit multiple records.
Note also that regular expression are not supported in either the User-agent or Disallow lines. The '*'
in the User-agent field is a special value meaning "any robot". Specifically, you cannot have lines like
"Disallow: /tmp/*" or "Disallow: *.gif".
What you want to exclude depends on your server. Everything not explicitly disallowed is considered
fair game to retrieve. Here follow some examples:
To exclude all robots from the entire server
User-agent: *
Disallow: /
To allow all robots complete access
User-agent: *
Disallow:
Or create an empty "/robots.txt" file.
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html (2 of 3) [18.02.2001 13:14:30]
Web Server Administrator's Guide to the Robots Exclusion Protocol
To exclude all robots from part of the server
User-agent: *
Disallow: /cgi-bin/
Disallow: /tmp/
Disallow: /private/
To exclude a single robot
User-agent: BadBot
Disallow: /
To allow a single robot
User-agent: WebCrawler
Disallow:
User-agent: *
Disallow: /
To exclude all files except one
This is currently a bit awkward, as there is no "Allow" field. The easy way is to put all files to be
disallowed into a separate directory, say "docs", and leave the one file in the level above this
directory:
User-agent: *
Disallow: /~joe/docs/
Alternatively you can explicitly disallow all disallowed pages:
User-agent: *
Disallow: /~joe/private.html
Disallow: /~joe/foo.html
Disallow: /~joe/bar.html
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/exclusion-admin.html (3 of 3) [18.02.2001 13:14:30]
HTML Author's Guide to the Robots Exclusion Protocol
The Web Robots Pages
HTML Author's Guide to the Robots
Exclusion Protocol
The Robots Exclusion Protocol requires that instructions are placed in a URL "/robots.txt", i.e. in the
top-level of your server's document space.
If you rent space for your HTML files on the server of your Internet Service Provider, or another third
party, you are usually not allowed to install or modify files in the top-level of the server's document
space.
This means that to use the Robots Exclusion Protocol, you have to liase with the server administrator,
and get him/her add the rules to the "/robots.txt", using the Web Server Administrator's Guide to the
Robots Exclusion Protocol.
There is no way around this -- specifically, there is no point in providing your own "/robots.txt" files
elsewhere on the server, like in your home directory or subdirectories; Robots won't look for them,
and even if they did find them, they wouldn't pay attention to the rules there.
If your administrator is unwilling to install or modify "/robots.txt" rules on your behalf, and all you
want is prevent being indexed by indexing robots like WebCrawler and Lycos, you can add a Robots
Meta Tag to all pages you don't want indexed. Note this functionality is not implemented by all
indexing robots.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/exclusion-user.html [18.02.2001 13:14:31]
A Standard for Robot Exclusion
The Web Robots Pages
Network Working Group
INTERNET DRAFT
Category: Informational
Dec 4, 1996
<draft-koster-robots-00.txt>
M. Koster
WebCrawler
November 1996
Expires June 4, 1997
A Method for Web Robots Control
Status of this Memo
This document is an Internet-Draft. Internet-Drafts are
working documents of the Internet Engineering Task Force
(IETF), its areas, and its working groups. Note that other
groups may also distribute working documents as InternetDrafts.
Internet-Drafts are draft documents valid for a maximum of six
months and may be updated, replaced, or obsoleted by other
documents at any time. It is inappropriate to use InternetDrafts as reference material or to cite them other than as
``work in progress.''
To learn the current status of any Internet-Draft, please
check the ``1id-abstracts.txt'' listing contained in the
Internet- Drafts Shadow Directories on ftp.is.co.za (Africa),
nic.nordu.net (Europe), munnari.oz.au (Pacific Rim),
ds.internic.net (US East Coast), or ftp.isi.edu (US West
Coast).
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (1 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
[Page 1]
December 4, 1996
Table of Contents
1.
2.
3.
3.1
3.2
3.2.1
3.2.2
3.3
3.4
4.
5.
5.1
5.2
6.
7.
8.
9.
1.
Abstract . . . . . . . . . . . . . . . . . . . . .
Introduction . . . . . . . . . . . . . . . . . . .
Specification . . . . . . . . . . . . . . . . . . .
Access method . . . . . . . . . . . . . . . . . . .
File Format Description . . . . . . . . . . . . . .
The User-agent line . . . . . . . . . . . . . . . .
The Allow and Disallow lines . . . . . . . . . . .
Formal Syntax . . . . . . . . . . . . . . . . . . .
Expiration . . . . . . . . . . . . . . . . . . . .
Examples . . . . . . . . . . . . . . . . . . . . .
Implementor's Notes . . . . . . . . . . . . . . . .
Backwards Compatibility . . . . . . . . . . . . . .
Interoperability . . .. . . . . . . . . . . . . . .
Security Considerations . . . . . . . . . . . . . .
References . . . . . . . . . . . . . . . . . . . .
Acknowledgements . . . . . . . . . . . . . . . . .
Author's Address . . . . . . . . . . . . . . . . .
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
2
2
3
3
4
5
5
6
8
8
9
9
10
10
10
11
11
Abstract
This memo defines a method for administrators of sites on the WorldWide Web to give instructions to visiting Web robots, most
importantly what areas of the site are to be avoided.
This document provides a more rigid specification of the Standard
for Robots Exclusion [1], which is currently in wide-spread use by
the Web community since 1994.
2.
Introduction
Web Robots (also called "Wanderers" or "Spiders") are Web client
programs that automatically traverse the Web's hypertext structure
by retrieving a document, and recursively retrieving all documents
that are referenced.
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (2 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
Note that "recursively" here doesn't limit the definition to any
specific traversal algorithm; even if a robot applies some heuristic
to the selection and order of documents to visit and spaces out
requests over a long space of time, it qualifies to be called a
robot.
Robots are often used for maintenance and indexing purposes, by
people other than the administrators of the site being visited. In
some cases such visits may have undesirable effects which the
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
[Page 2]
December 4, 1996
administrators would like to prevent, such as indexing of an
unannounced site, traversal of parts of the site which require vast
resources of the server, recursive traversal of an infinite URL
space, etc.
The technique specified in this memo allows Web site administrators
to indicate to visiting robots which parts of the site should be
avoided. It is solely up to the visiting robot to consult this
information and act accordingly. Blocking parts of the Web site
regardless of a robot's compliance with this method are outside
the scope of this memo.
3. The Specification
This memo specifies a format for encoding instructions to visiting
robots, and specifies an access method to retrieve these
instructions. Robots must retrieve these instructions before visiting
other URLs on the site, and use the instructions to determine if
other URLs on the site can be accessed.
3.1 Access method
The instructions must be accessible via HTTP [2] from the site that
the instructions are to be applied to, as a resource of Internet
Media Type [3] "text/plain" under a standard relative path on the
server: "/robots.txt".
For convenience we will refer to this resource as the "/robots.txt
file", though the resource need in fact not originate from a filesystem.
Some examples of URLs [4] for sites and URLs for corresponding
"/robots.txt" sites:
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (3 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
http://www.foo.com/welcome.html http://www.foo.com/robots.txt
http://www.bar.com:8001/
http://www.bar.com:8001/robots.txt
If the server response indicates Success (HTTP 2xx Status Code,)
the robot must read the content, parse it, and follow any
instructions applicable to that robot.
If the server response indicates the resource does not exist (HTTP
Status Code 404), the robot can assume no instructions are
available, and that access to the site is not restricted by
/robots.txt.
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
[Page 3]
December 4, 1996
Specific behaviors for other server responses are not required by
this specification, though the following behaviours are recommended:
- On server response indicating access restrictions (HTTP Status
Code 401 or 403) a robot should regard access to the site
completely restricted.
- On the request attempt resulted in temporary failure a robot
should defer visits to the site until such time as the resource
can be retrieved.
- On server response indicating Redirection (HTTP Status Code 3XX)
a robot should follow the redirects until a resource can be
found.
3.2 File Format Description
The instructions are encoded as a formatted plain text object,
described here. A complete BNF-like description of the syntax of this
format is given in section 3.3.
The format logically consists of a non-empty set or records,
separated by blank lines. The records consist of a set of lines of
the form:
<Field> ":" <value>
In this memo we refer to lines with a Field "foo" as "foo lines".
The record starts with one or more User-agent lines, specifying
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (4 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
which robots the record applies to, followed by "Disallow" and
"Allow" instructions to that robot. For example:
User-agent: webcrawler
User-agent: infoseek
Allow:
/tmp/ok.html
Disallow: /tmp
Disallow: /user/foo
These lines are discussed separately below.
Lines with Fields not explicitly specified by this specification
may occur in the /robots.txt, allowing for future extension of the
format. Consult the BNF for restrictions on the syntax of such
extensions. Note specifically that for backwards compatibility
with robots implementing earlier versions of this specification,
breaking of lines is not allowed.
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
[Page 4]
December 4, 1996
Comments are allowed anywhere in the file, and consist of optional
whitespace, followed by a comment character '#' followed by the
comment, terminated by the end-of-line.
3.2.1 The User-agent line
Name tokens are used to allow robots to identify themselves via a
simple product token. Name tokens should be short and to the
point. The name token a robot chooses for itself should be sent
as part of the HTTP User-agent header, and must be well documented.
These name tokens are used in User-agent lines in /robots.txt to
identify to which specific robots the record applies. The robot
must obey the first record in /robots.txt that contains a UserAgent line whose value contains the name token of the robot as a
substring. The name comparisons are case-insensitive. If no such
record exists, it should obey the first record with a User-agent
line with a "*" value, if present. If no record satisfied either
condition, or no records are present at all, access is unlimited.
The name comparisons are case-insensitive.
For example, a fictional company FigTree Search Services who names
their robot "Fig Tree", send HTTP requests like:
GET / HTTP/1.0
User-agent: FigTree/0.1 Robot libwww-perl/5.04
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (5 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
might scan the "/robots.txt" file for records with:
User-agent: figtree
3.2.2 The Allow and Disallow lines
These lines indicate whether accessing a URL that matches the
corresponding path is allowed or disallowed. Note that these
instructions apply to any HTTP method on a URL.
To evaluate if access to a URL is allowed, a robot must attempt to
match the paths in Allow and Disallow lines against the URL, in the
order they occur in the record. The first match found is used. If no
match is found, the default assumption is that the URL is allowed.
The /robots.txt URL is always allowed, and must not appear in the
Allow/Disallow rules.
The matching process compares every octet in the path portion of
the URL and the path from the record. If a %xx encoded octet is
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
[Page 5]
December 4, 1996
encountered it is unencoded prior to comparison, unless it is the
"/" character, which has special meaning in a path. The match
evaluates positively if and only if the end of the path from the
record is reached before a difference in octets is encountered.
This table illustrates some examples:
Record Path
/tmp
/tmp
/tmp
/tmp/
/tmp/
/tmp/
URL path
/tmp
/tmp.html
/tmp/a.html
/tmp
/tmp/
/tmp/a.html
Matches
yes
yes
yes
no
yes
yes
/a%3cd.html
/a%3Cd.html
/a%3cd.html
/a%3Cd.html
/a%3cd.html
/a%3cd.html
/a%3Cd.html
/a%3Cd.html
yes
yes
yes
yes
/a%2fb.html
/a%2fb.html
/a/b.html
/a%2fb.html
/a/b.html
/a%2fb.html
yes
no
no
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (6 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
/a/b.html
/a/b.html
yes
/%7ejoe/index.html /~joe/index.html
yes
/~joe/index.html
/%7Ejoe/index.html yes
3.3 Formal Syntax
This is a BNF-like description, using the conventions of RFC 822 [5],
except that "|" is used to designate alternatives. Briefly, literals
are quoted with "", parentheses "(" and ")" are used to group
elements, optional elements are enclosed in [brackets], and elements
may be preceded with <n>* to designate n or more repetitions of the
following element; n defaults to 0.
robotstxt
= *blankcomment
| *blankcomment record *( 1*commentblank 1*record )
*blankcomment
blankcomment = 1*(blank | commentline)
commentblank = *commentline blank *(blankcomment)
blank
= *space CRLF
CRLF
= CR LF
record
= *commentline agentline *(commentline | agentline)
1*ruleline *(commentline | ruleline)
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
agentline
ruleline
disallowline
allowline
extension
value
=
=
=
=
=
=
commentline
comment
space
rpath
agent
anychar
CHAR
CTL
=
=
=
=
=
=
=
=
CR
LF
SP
[Page 6]
December 4, 1996
"User-agent:" *space agent [comment] CRLF
(disallowline | allowline | extension)
"Disallow" ":" *space path [comment] CRLF
"Allow" ":" *space rpath [comment] CRLF
token : *space value [comment] CRLF
<any CHAR except CR or LF or "#">
comment CRLF
*blank "#" anychar
1*(SP | HT)
"/" path
token
<any CHAR except CR or LF>
<any US-ASCII character (octets 0 - 127)>
<any US-ASCII control character
(octets 0 - 31) and DEL (127)>
= <US-ASCII CR, carriage return (13)>
= <US-ASCII LF, linefeed (10)>
= <US-ASCII SP, space (32)>
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (7 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
HT
= <US-ASCII HT, horizontal-tab (9)>
The syntax for "token" is taken from RFC 1945 [2], reproduced here for
convenience:
token
= 1*<any CHAR except CTLs or tspecials>
tspecials
=
|
|
|
"("
","
"/"
"{"
|
|
|
|
")"
";"
"["
"}"
|
|
|
|
"<" | ">" | "@"
":" | "\" | <">
"]" | "?" | "="
SP | HT
The syntax for "path" is defined in RFC 1808 [6], reproduced here for
convenience:
path
fsegment
segment
= fsegment *( "/" segment )
= 1*pchar
= *pchar
pchar
uchar
unreserved
= uchar | ":" | "@" | "&" | "="
= unreserved | escape
= alpha | digit | safe | extra
escape
hex
= "%" hex hex
= digit | "A" | "B" | "C" | "D" | "E" | "F" |
"a" | "b" | "c" | "d" | "e" | "f"
alpha
= lowalpha | hialpha
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
lowalpha
hialpha
= "a"
"j"
"s"
= "A"
"J"
"S"
|
|
|
|
|
|
"b"
"k"
"t"
"B"
"K"
"T"
|
|
|
|
|
|
"c"
"l"
"u"
"C"
"L"
"U"
|
|
|
|
|
|
"d"
"m"
"v"
"D"
"M"
"V"
|
|
|
|
|
|
"e"
"n"
"w"
"E"
"N"
"W"
|
|
|
|
|
|
[Page 7]
"f"
"o"
"x"
"F"
"O"
"X"
December 4, 1996
|
|
|
|
|
|
"g"
"p"
"y"
"G"
"P"
"Y"
|
|
|
|
|
|
"h"
"q"
"z"
"H"
"Q"
"Z"
| "i" |
| "r" |
| "I" |
| "R" |
digit
= "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7" |
"8" | "9"
safe
extra
= "$" | "-" | "_" | "." | "+"
= "!" | "*" | "'" | "(" | ")" | ","
3.4 Expiration
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (8 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
Robots should cache /robots.txt files, but if they do they must
periodically verify the cached copy is fresh before using its
contents.
Standard HTTP cache-control mechanisms can be used by both origin
server and robots to influence the caching of the /robots.txt file.
Specifically robots should take note of Expires header set by the
origin server.
If no cache-control directives are present robots should default to
an expiry of 7 days.
4. Examples
This section contains an example of how a /robots.txt may be used.
A fictional site may have the following URLs:
http://www.fict.org/
http://www.fict.org/index.html
http://www.fict.org/robots.txt
http://www.fict.org/server.html
http://www.fict.org/services/fast.html
http://www.fict.org/services/slow.html
http://www.fict.org/orgo.gif
http://www.fict.org/org/about.html
http://www.fict.org/org/plans.html
http://www.fict.org/%7Ejim/jim.html
http://www.fict.org/%7Emak/mak.html
The site may in the /robots.txt have specific rules for robots that
send a HTTP User-agent "UnhipBot/0.1", "WebCrawler/3.0", and
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
"Excite/1.0", and a set of default rules:
# /robots.txt for http://www.fict.org/
# comments to webmaster@fict.org
User-agent: unhipbot
Disallow: /
User-agent: webcrawler
User-agent: excite
Disallow:
User-agent: *
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (9 of 12) [18.02.2001 13:14:36]
[Page 8]
December 4, 1996
A Standard for Robot Exclusion
Disallow: /org/plans.html
Allow: /org/
Allow: /serv
Allow: /~mak
Disallow: /
The following matrix shows which robots are allowed to access URLs:
http://www.fict.org/
http://www.fict.org/index.html
http://www.fict.org/robots.txt
http://www.fict.org/server.html
http://www.fict.org/services/fast.html
http://www.fict.org/services/slow.html
http://www.fict.org/orgo.gif
http://www.fict.org/org/about.html
http://www.fict.org/org/plans.html
http://www.fict.org/%7Ejim/jim.html
http://www.fict.org/%7Emak/mak.html
unhipbot webcrawler other
& excite
No
Yes
No
No
Yes
No
Yes
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
Yes
Yes
No
Yes
No
No
Yes
Yes
No
Yes
No
No
Yes
No
No
Yes
Yes
5. Notes for Implementors
5.1
Backwards Compatibility
Previous of this specification didn't provide the Allow line. The
introduction of the Allow line causes robots to behave slightly
differently under either specification:
If a /robots.txt contains an Allow which overrides a later occurring
Disallow, a robot ignoring Allow lines will not retrieve those
parts. This is considered acceptable because there is no requirement
for a robot to access URLs it is allowed to retrieve, and it is safe,
in that no URLs a Web site administrator wants to Disallow are be
allowed. It is expected this may in fact encourage robots to upgrade
compliance to the specification in this memo.
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
5.2
[Page 9]
December 4, 1996
Interoperability
Implementors should pay particular attention to the robustness in
parsing of the /robots.txt file. Web site administrators who are not
aware of the /robots.txt mechanisms often notice repeated failing
request for it in their log files, and react by putting up pages
asking "What are you looking for?".
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (10 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
As the majority of /robots.txt files are created with platformspecific text editors, robots should be liberal in accepting files
with different end-of-line conventions, specifically CR and LF in
addition to CRLF.
6. Security Considerations
There are a few risks in the method described here, which may affect
either origin server or robot.
Web site administrators must realise this method is voluntary, and
is not sufficient to guarantee some robots will not visit restricted
parts of the URL space. Failure to use proper authentication or other
restriction may result in exposure of restricted information. It even
possible that the occurence of paths in the /robots.txt file may
expose the existence of resources not otherwise linked to on the
site, which may aid people guessing for URLs.
Robots need to be aware that the amount of resources spent on dealing
with the /robots.txt is a function of the file contents, which is not
under the control of the robot. For example, the contents may be
larger in size than the robot can deal with. To prevent denial-ofservice attacks, robots are therefore encouraged to place limits on
the resources spent on processing of /robots.txt.
The /robots.txt directives are retrieved and applied in separate,
possible unauthenticated HTTP transactions, and it is possible that
one server can impersonate another or otherwise intercept a
/robots.txt, and provide a robot with false information. This
specification does not preclude authentication and encryption
from being employed to increase security.
7. Acknowledgements
The author would like the subscribers to the robots mailing list for
their contributions to this specification.
Koster
draft-koster-robots-00.txt
INTERNET DRAFT
A Method for Robots Control
[Page 10]
December 4, 1996
8. References
[1] Koster, M., "A Standard for Robot Exclusion",
http://info.webcrawler.com/mak/projects/robots/norobots.html,
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (11 of 12) [18.02.2001 13:14:36]
A Standard for Robot Exclusion
June 1994.
[2] Berners-Lee, T., Fielding, R., and Frystyk, H., "Hypertext
Transfer Protocol -- HTTP/1.0." RFC 1945, MIT/LCS, May 1996.
[3] Postel, J., "Media Type Registration Procedure." RFC 1590,
USC/ISI, March 1994.
[4]
Berners-Lee, T., Masinter, L., and M. McCahill, "Uniform
Resource Locators (URL)", RFC 1738, CERN, Xerox PARC,
University of Minnesota, December 1994.
[5] Crocker, D., "Standard for the Format of ARPA Internet Text
Messages", STD 11, RFC 822, UDEL, August 1982.
[6] Fielding, R., "Relative Uniform Resource Locators", RFC 1808,
UC Irvine, June 1995.
9. Author's Address
Martijn Koster
WebCrawler
America Online
690 Fifth Street
San Francisco
CA 94107
Phone: 415-3565431
EMail: m.koster@webcrawler.com
Expires June 4, 1997
Koster
draft-koster-robots-00.txt
[Page 11]
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/norobots-rfc.html (12 of 12) [18.02.2001 13:14:36]
Web Server Administrator's Guide to the Robots META tag
The Web Robots Pages
Web Server Administrator's Guide
to the Robots META tag.
Good news! As a Web Server Administrator you don't need to do anything to support the Robots
META tag.
Simply refer your users to the HTML Author's Guide to the Robots META tag
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/meta-admin.html [18.02.2001 13:14:37]
HTML Author's Guide to the Robots META tag
The Web Robots Pages
HTML Author's Guide
to the Robots META tag.
The Robots META tag is a simple mechanism to indicate to visiting Web Robots if a page should be
indexed, or links on the page should be followed.
It differs from the Protocol for Robots Exclusion in that you need no effort or permission from your Web
Server Administrator.
Note: Currently only few robots support this tag!
Where to put the Robots META tag
Like any META tag it should be placed in the HEAD section of an HTML page:
<html>
<head>
<meta name="robots" content="noindex,nofollow">
<meta name="description" content="This page ....">
<title>...</title>
</head>
<body>
...
What to put into the Robots META tag
The content of the Robots META tag contains directives separated by commas. The currently defined
directives are [NO]INDEX and [NO]FOLLOW. The INDEX directive specifies if an indexing robot
should index the page. The FOLLOW directive specifies if a robot is to follow links on the page. The
defaults are INDEX and FOLLOW. The values ALL and NONE set all directives on or off:
ALL=INDEX,FOLLOW and NONE=NOINDEX,NOFOLLOW.
Some examples:
<meta
<meta
<meta
<meta
name="robots"
name="robots"
name="robots"
name="robots"
content="index,follow">
content="noindex,follow">
content="index,nofollow">
content="noindex,nofollow">
Note the "robots" name of the tag and the content are case insensitive.
You obviously should not specify conflicting or repeating directives such as:
<meta name="robots" content="INDEX,NOINDEX,NOFOLLOW,FOLLOW,FOLLOW">
http://info.webcrawler.com/mak/projects/robots/meta-user.html (1 of 2) [18.02.2001 13:14:39]
HTML Author's Guide to the Robots META tag
A formal syntax for the Robots META tag content is:
content
all
none
directives
directive
index
follow
=
=
=
=
=
=
=
all | none | directives
"ALL"
"NONE"
directive ["," directives]
index | follow
"INDEX" | "NOINDEX"
"FOLLOW" | "NOFOLLOW"
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/meta-user.html (2 of 2) [18.02.2001 13:14:39]
Spidering BOF Report
The Web Robots Pages
Spidering BOF Report
[Note: This is an HTML version of the original notes from the Distributed Indexing/Searching
Workshop ]
Report by Michael Mauldin (Lycos)
(later edited by Michael Schwartz)
While the overall workshop goal was to determine areas where standards could be pursued, the
Spidering BOF attempted to reach actual standards agreements about some immediate term issues
facing robot-based search services, at least among spider-based search service representatives who
were in attendance at the workshop (Excite, InfoSeek, and Lycos). The agreements fell into four
areas, but we report only three of them here because the fourth area concerned a KEYWORDS tag
that many workshop participants felt was not appropriate for specification by this BOF without the
participation of other groups that have been working on that issue.
The remaining three areas were:
ROBOTS meta-tag
<META NAME="ROBOTS"
CONTENT="ALL | NONE | NOINDEX | NOFOLLOW">
default = empty = "ALL"
"NONE" = "NOINDEX, NOFOLLOW"
The filler is a comma separated list of terms: ALL, NONE, INDEX, NOINDEX, FOLLOW,
NOFOLLOW.
Discussion: This tag is meant to provide users who cannot control the robots.txt file at their sites. It
provides a last chance to keep their content out of search services. It was decided not to add syntax to
allow robot specific permissions within the meta-tag.
INDEX means that robots are welcome to include this page in search services.
FOLLOW means that robots are welcome to follow links from this page to find other pages.
So a value of "NOINDEX" allows the subsidiary links to be explored, even though the page is not
indexed. A value of "NOFOLLOW" allows the page to be indexed, but no links from the page are
explored (this may be useful if the page is a free entry point into pay-per-view content, for example. A
value of "NONE" tells the robot to ignore the page.
http://info.webcrawler.com/mak/projects/robots/meta-notes.html (1 of 2) [18.02.2001 13:14:40]
Spidering BOF Report
DESCRIPTION meta-tag
<META NAME="DESCRIPTION" CONTENT="...text...">
The intent is that the text can be used by a search service when printing a summary of the document.
The text should not contain any formatting information.
Other issues with ROBOTS.TXT
These are issues recommended for future standards discussion that could not be resolved within the
scope of this workshop.
● Ambiguities in the current specification http://www.kollar.com/robots.html
● A means of canonicalizing sites, using: HTTP-EQUIV HOST ROBOTS.TXT ALIAS
● ways of supporting multiple robots.txt files per site ("robotsN.txt")
● ways of advertising content that should be indexed (rather than just restricting content that
should not be indexed)
● Flow control information: retrieval interval or maximum connections open to server
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/meta-notes.html (2 of 2) [18.02.2001 13:14:40]
Database of Web Robots, Overview
The Web Robots Pages
Database of Web Robots, Overview
In addition to this overview, you can View Contact or View Type.
1. Acme.Spider
2. Ahoy! The Homepage Finder
3. Alkaline
4. Walhello appie
5. Arachnophilia
6. ArchitextSpider
7. Aretha
8. ARIADNE
9. arks
10. ASpider (Associative Spider)
11. ATN Worldwide
12. Atomz.com Search Robot
13. AURESYS
14. BackRub
15. unnamed
16. Big Brother
17. Bjaaland
18. BlackWidow
19. Die Blinde Kuh
20. Bloodhound
21. bright.net caching robot
22. BSpider
23. CACTVS Chemistry Spider
24. Calif
25. Cassandra
26. Digimarc Marcspider/CGI
27. Checkbot
28. churl
29. CMC/0.01
30. Collective
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (1 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
31. Combine System
32. Conceptbot
33. CoolBot
34. Web Core / Roots
35. XYLEME Robot
36. Internet Cruiser Robot
37. Cusco
38. CyberSpyder Link Test
39. DeWeb(c) Katalog/Index
40. DienstSpider
41. Digger
42. Digital Integrity Robot
43. Direct Hit Grabber
44. DNAbot
45. DownLoad Express
46. DragonBot
47. DWCP (Dridus' Web Cataloging Project)
48. e-collector
49. EbiNess
50. EIT Link Verifier Robot
51. Emacs-w3 Search Engine
52. ananzi
53. Esther
54. Evliya Celebi
55. nzexplorer
56. Fluid Dynamics Search Engine robot
57. Felix IDE
58. Wild Ferret Web Hopper #1, #2, #3
59. FetchRover
60. fido
61. Hämähäkki
62. KIT-Fireball
63. Fish search
64. Fouineur
65. Robot Francoroute
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (2 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
66. Freecrawl
67. FunnelWeb
68. gazz
69. GCreep
70. GetBot
71. GetURL
72. Golem
73. Googlebot
74. Grapnel/0.01 Experiment
75. Griffon
76. Gromit
77. Northern Light Gulliver
78. HamBot
79. Harvest
80. havIndex
81. HI (HTML Index) Search
82. Hometown Spider Pro
83. Wired Digital
84. ht://Dig
85. HTMLgobble
86. Hyper-Decontextualizer
87. IBM_Planetwide
88. Popular Iconoclast
89. Ingrid
90. Imagelock
91. IncyWincy
92. Informant
93. InfoSeek Robot 1.0
94. Infoseek Sidewinder
95. InfoSpiders
96. Inspector Web
97. IntelliAgent
98. I, Robot
99. Iron33
100. Israeli-search
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (3 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
101. JavaBee
102. JBot Java Web Robot
103. JCrawler
104. Jeeves
105. Jobot
106. JoeBot
107. The Jubii Indexing Robot
108. JumpStation
109. Katipo
110. KDD-Explorer
111. Kilroy
112. KO_Yappo_Robot
113. LabelGrabber
114. larbin
115. legs
116. Link Validator
117. LinkScan
118. LinkWalker
119. Lockon
120. logo.gif Crawler
121. Lycos
122. Mac WWWWorm
123. Magpie
124. Mattie
125. MediaFox
126. MerzScope
127. NEC-MeshExplorer
128. MindCrawler
129. moget
130. MOMspider
131. Monster
132. Motor
133. Muscat Ferret
134. Mwd.Search
135. Internet Shinchakubin
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (4 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
136. NetCarta WebMap Engine
137. NetMechanic
138. NetScoop
139. newscan-online
140. NHSE Web Forager
141. Nomad
142. The NorthStar Robot
143. Occam
144. HKU WWW Octopus
145. Orb Search
146. Pack Rat
147. PageBoy
148. ParaSite
149. Patric
150. pegasus
151. The Peregrinator
152. PerlCrawler 1.0
153. Phantom
154. PiltdownMan
155. Pioneer
156. html_analyzer
157. Portal Juice Spider
158. PGP Key Agent
159. PlumtreeWebAccessor
160. Poppi
161. PortalB Spider
162. GetterroboPlus Puu
163. The Python Robot
164. Raven Search
165. RBSE Spider
166. Resume Robot
167. RoadHouse Crawling System
168. Road Runner: The ImageScape Robot
169. Robbie the Robot
170. ComputingSite Robi/1.0
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (5 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
171. Robozilla
172. Roverbot
173. SafetyNet Robot
174. Scooter
175. Search.Aus-AU.COM
176. SearchProcess
177. Senrigan
178. SG-Scout
179. ShagSeeker
180. Shai'Hulud
181. Sift
182. Simmany Robot Ver1.0
183. Site Valet
184. Open Text Index Robot
185. SiteTech-Rover
186. SLCrawler
187. Inktomi Slurp
188. Smart Spider
189. Snooper
190. Solbot
191. Spanner
192. Speedy Spider
193. spider_monkey
194. SpiderBot
195. SpiderMan
196. SpiderView(tm)
197. Spry Wizard Robot
198. Site Searcher
199. Suke
200. suntek search engine
201. Sven
202. TACH Black Widow
203. Tarantula
204. tarspider
205. Tcl W3 Robot
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (6 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
206. TechBOT
207. Templeton
208. TeomaTechnologies
209. TitIn
210. TITAN
211. The TkWWW Robot
212. TLSpider
213. UCSD Crawl
214. UdmSearch
215. URL Check
216. URL Spider Pro
217. Valkyrie
218. Victoria
219. vision-search
220. Voyager
221. VWbot
222. The NWI Robot
223. W3M2
224. the World Wide Web Wanderer
225. WebBandit Web Spider
226. WebCatcher
227. WebCopy
228. webfetcher
229. The Webfoot Robot
230. weblayers
231. WebLinker
232. WebMirror
233. The Web Moose
234. WebQuest
235. Digimarc MarcSpider
236. WebReaper
237. webs
238. Websnarf
239. WebSpider
240. WebVac
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (7 of 8) [18.02.2001 13:14:44]
Database of Web Robots, Overview
241. webwalk
242. WebWalker
243. WebWatch
244. Wget
245. whatUseek Winona
246. WhoWhere Robot
247. w3mir
248. WebStolperer
249. The Web Wombat
250. The World Wide Web Worm
251. WWWC Ver 0.2.5
252. WebZinger
253. XGET
254. Nederland.zoek
The Web Robots Database
http://info.webcrawler.com/mak/projects/robots/active/html/index.html (8 of 8) [18.02.2001 13:14:44]
Database of Web Robots, View Type
The Web Robots Pages
Database of Web Robots, View Type
Alternatively you can View Contact, or see the Overview.
Name
Details
Acme.Spider
Purpose: indexing maintenance statistics
Availability: source
Platform: java
Ahoy! The Homepage Finder
Purpose: maintenance
Availability: none
Platform: UNIX
Alkaline
Purpose: indexing
Availability: binary
Platform: unix windows95 windowsNT
Walhello appie
Purpose: indexing
Availability: none
Platform: windows98
Arachnophilia
Purpose:
Availability:
Platform:
ArchitextSpider
Purpose: indexing, statistics
Availability:
Platform:
Aretha
Purpose:
Availability:
Platform: Macintosh
ARIADNE
Purpose: statistics, development of focused crawling
strategies
Availability: none
Platform: java
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (1 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
arks
Purpose: indexing
Availability: data
Platform: PLATFORM INDEPENDENT
ASpider (Associative Spider)
Purpose: indexing
Availability:
Platform: unix
ATN Worldwide
Purpose: indexing
Availability:
Platform:
Atomz.com Search Robot
Purpose: indexing
Availability: service
Platform: unix
AURESYS
Purpose: indexing,statistics
Availability: Protected by Password
Platform: Aix, Unix
BackRub
Purpose: indexing, statistics
Availability:
Platform:
unnamed
Purpose: Copyright Infringement Tracking
Availability: 24/7
Platform: NT
Big Brother
Purpose: maintenance
Availability: binary
Platform: mac
Bjaaland
Purpose: indexing
Availability: none
Platform: unix
BlackWidow
Purpose: indexing, statistics
Availability:
Platform:
Die Blinde Kuh
Purpose: indexing
Availability: none
Platform: unix
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (2 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
Bloodhound
Purpose: Web Site Download
Availability: Executible
Platform: Windows95, WindowsNT, Windows98,
Windows2000
bright.net caching robot
Purpose: caching
Availability: none
Platform:
BSpider
Purpose: indexing
Availability: none
Platform: Unix
CACTVS Chemistry Spider
Purpose: indexing.
Availability:
Platform:
Calif
Purpose: indexing
Availability: none
Platform: unix
Cassandra
Purpose: indexing
Availability: none
Platform: crossplatform
Digimarc Marcspider/CGI
Purpose: maintenance
Availability: none
Platform: windowsNT
Checkbot
Purpose: maintenance
Availability: source
Platform: unix,WindowsNT
churl
Purpose: maintenance
Availability:
Platform:
CMC/0.01
Purpose: maintenance
Availability: none
Platform: unix
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (3 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
Collective
Purpose: Collective is a highly configurable program
designed to interrogate online search engines and
online databases, it will ignore web pages that lie
about there content, and dead url's, it can be super
strict, it searches each web page it finds for your
search terms to ensure those terms are present, any
positive urls are added to a html file for your to
view at any time even before the program has
finished. Collective can wonder the web for days if
required.
Availability: Executible
Platform: Windows95, WindowsNT, Windows98,
Windows2000
Combine System
Purpose: indexing
Availability: source
Platform: unix
Conceptbot
Purpose: indexing
Availability: data
Platform: unix
CoolBot
Purpose: indexing
Availability: none
Platform: unix
Web Core / Roots
Purpose: indexing, maintenance
Availability:
Platform:
XYLEME Robot
Purpose: indexing
Availability: data
Platform: unix
Internet Cruiser Robot
Purpose: indexing
Availability: none
Platform: unix
Cusco
Purpose: indexing
Availability: none
Platform: any
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (4 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
CyberSpyder Link Test
Purpose: link validation, some html validation
Availability: binary
Platform: windows 3.1x, windows95, windowsNT
DeWeb(c) Katalog/Index
Purpose: indexing, mirroring, statistics
Availability:
Platform:
DienstSpider
Purpose: indexing
Availability: none
Platform: unix
Digger
Purpose: indexing
Availability: none
Platform: unix, windows
Digital Integrity Robot
Purpose: WWW Indexing
Availability: none
Platform: unix
Direct Hit Grabber
Purpose: Indexing and statistics
Availability:
Platform: unix
DNAbot
Purpose: indexing
Availability: data
Platform: unix, windows, windows95, windowsNT, mac
DownLoad Express
Purpose: graphic download
Availability: binary
Platform: win95/98/NT
DragonBot
Purpose: indexing
Availability: none
Platform: windowsNT
DWCP (Dridus' Web
Cataloging Project)
Purpose: indexing, statistics
Availability: source, binary, data
Platform: java
e-collector
Purpose: email collector
Availability: Binary
Platform: Windows 9*/NT/2000
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (5 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
EbiNess
Purpose: statistics
Availability: Open Source
Platform: unix(Linux)
EIT Link Verifier Robot
Purpose: maintenance
Availability:
Platform:
Emacs-w3 Search Engine
Purpose: indexing
Availability:
Platform:
ananzi
Purpose: indexing
Availability:
Platform:
Esther
Purpose: indexing
Availability: data
Platform: unix (FreeBSD 2.2.8)
Evliya Celebi
Purpose: indexing turkish content
Availability: source
Platform: unix
nzexplorer
Purpose: indexing, statistics
Availability: source (commercial)
Platform: UNIX
Fluid Dynamics Search
Engine robot
Purpose: indexing
Availability: source;data
Platform: unix;windows
Felix IDE
Purpose: indexing, statistics
Availability: binary
Platform: windows95, windowsNT
Wild Ferret Web Hopper #1,
#2, #3
Purpose: indexing maintenance statistics
Availability:
Platform:
FetchRover
Purpose: maintenance, statistics
Availability: binary, source
Platform: Windows/NT, Windows/95, Solaris SPARC
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (6 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
fido
Purpose: indexing
Availability: none
Platform: Unix
Hämähäkki
Purpose: indexing
Availability: no
Platform: UNIX
KIT-Fireball
Purpose: indexing
Availability: none
Platform: unix
Fish search
Purpose: indexing
Availability: binary
Platform:
Fouineur
Purpose: indexing, statistics
Availability: none
Platform: unix, windows
Robot Francoroute
Purpose: indexing, mirroring, statistics
Availability:
Platform:
Freecrawl
Purpose: indexing
Availability: none
Platform: unix
FunnelWeb
Purpose: indexing, statisitics
Availability:
Platform:
gazz
Purpose: statistics
Availability: none
Platform: unix
GCreep
Purpose: indexing
Availability: none
Platform: linux+mysql
GetBot
Purpose: maintenance
Availability:
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (7 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
GetURL
Purpose: maintenance, mirroring
Availability:
Platform:
Golem
Purpose: maintenance
Availability: none
Platform: mac
Googlebot
Purpose: indexing statistics
Availability:
Platform:
Grapnel/0.01 Experiment
Purpose: Indexing
Availability: None, yet
Platform: WinNT
Griffon
Purpose: indexing
Availability: none
Platform: unix
Gromit
Purpose: indexing
Availability: none
Platform: unix
Northern Light Gulliver
Purpose: indexing
Availability: none
Platform: unix
HamBot
Purpose: indexing
Availability: none
Platform: unix, Windows95
Harvest
Purpose: indexing
Availability:
Platform:
havIndex
Purpose: indexing
Availability: binary
Platform: Java VM 1.1
HI (HTML Index) Search
Purpose: indexing
Availability:
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (8 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
Hometown Spider Pro
Purpose: indexing
Availability: none
Platform: windowsNT
Wired Digital
Purpose: indexing
Availability: none
Platform: unix
ht://Dig
Purpose: indexing
Availability: source
Platform: unix
HTMLgobble
Purpose: mirror
Availability:
Platform:
Hyper-Decontextualizer
Purpose: indexing
Availability:
Platform:
IBM_Planetwide
Purpose: indexing, maintenance, mirroring
Availability:
Platform:
Popular Iconoclast
Purpose: statistics
Availability: source
Platform: unix (OpenBSD)
Ingrid
Purpose: Indexing
Availability: Commercial as part of search engine package
Platform: UNIX
Imagelock
Purpose: maintenance
Availability: none
Platform: windows95
IncyWincy
Purpose:
Availability:
Platform:
Informant
Purpose: indexing
Availability: none
Platform: unix
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (9 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
InfoSeek Robot 1.0
Purpose: indexing
Availability:
Platform:
Infoseek Sidewinder
Purpose: indexing
Availability:
Platform:
InfoSpiders
Purpose: search
Availability: none
Platform: unix, mac
Inspector Web
Purpose: maintentance: link validation, html validation,
image size validation, etc
Availability: free service and more extensive commercial service
Platform: unix
IntelliAgent
Purpose: indexing
Availability:
Platform:
I, Robot
Purpose: indexing
Availability: none
Platform: unix
Iron33
Purpose: indexing, statistics
Availability: source
Platform: unix
Israeli-search
Purpose: indexing.
Availability:
Platform:
JavaBee
Purpose: Stealing Java Code
Availability: binary
Platform: Java
JBot Java Web Robot
Purpose: indexing
Availability: source
Platform: Java
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (10 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
JCrawler
Purpose: indexing
Availability: none
Platform: unix
Jeeves
Purpose: indexing maintenance statistics
Availability: none
Platform: UNIX
Jobot
Purpose: standalone
Availability:
Platform:
JoeBot
Purpose:
Availability:
Platform:
The Jubii Indexing Robot
Purpose: indexing, maintainance
Availability:
Platform:
JumpStation
Purpose: indexing
Availability:
Platform:
Katipo
Purpose: maintenance
Availability: binary
Platform: Macintosh
KDD-Explorer
Purpose: indexing
Availability: none
Platform: unix
Kilroy
Purpose: indexing,statistics
Availability: none
Platform: unix,windowsNT
KO_Yappo_Robot
Purpose: indexing
Availability: none
Platform: unix
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (11 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
LabelGrabber
Purpose: Grabs PICS labels from web pages, submits them to
a label bueau
Availability: source
Platform: windows, windows95, windowsNT, unix
larbin
Purpose: Your imagination is the only limit
Availability: source (GPL), mail me for customization
Platform: Linux
legs
Purpose: indexing
Availability: none
Platform: linux
Link Validator
Purpose: maintenance
Availability: none
Platform: unix, windows
LinkScan
Purpose: Link checker, SiteMapper, and HTML Validator
Availability: Program is shareware
Platform: Unix, Linux, Windows 98/NT
LinkWalker
Purpose: maintenance, statistics
Availability: none
Platform: windowsNT
Lockon
Purpose: indexing
Availability: none
Platform: UNIX
logo.gif Crawler
Purpose: indexing
Availability: none
Platform: unix
Lycos
Purpose: indexing
Availability:
Platform:
Mac WWWWorm
Purpose: indexing
Availability: none
Platform: Macintosh
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (12 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
Magpie
Purpose: indexing, statistics
Availability:
Platform: unix
Mattie
Purpose: MP3 Spider
Availability: None
Platform: Windows 2000
MediaFox
Purpose: indexing and maintenance
Availability: none
Platform: (Java)
MerzScope
Purpose: WebMapping
Availability: binary
Platform: (Java Based) unix,windows95,windowsNT,os2,mac
etc ..
NEC-MeshExplorer
Purpose: indexing
Availability: none
Platform: unix
MindCrawler
Purpose: indexing
Availability: none
Platform: linux
moget
Purpose: indexing,statistics
Availability: none
Platform: unix
MOMspider
Purpose: maintenance, statistics
Availability: source
Platform: UNIX
Monster
Purpose: maintenance, mirroring
Availability: binary
Platform: UNIX (Linux)
Motor
Purpose: indexing
Availability: data
Platform: mac
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (13 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
Muscat Ferret
Purpose: indexing
Availability: none
Platform: unix
Mwd.Search
Purpose: indexing
Availability: none
Platform: unix (Linux)
Internet Shinchakubin
Purpose: find new links and changed pages
Availability: binary as bundled software
Platform: Windows98
NetCarta WebMap Engine
Purpose: indexing, maintenance, mirroring, statistics
Availability:
Platform:
NetMechanic
Purpose: Link and HTML validation
Availability: via web page
Platform: UNIX
NetScoop
Purpose: indexing
Availability: none
Platform: UNIX
newscan-online
Purpose: indexing
Availability: binary
Platform: Linux
NHSE Web Forager
Purpose: indexing
Availability:
Platform:
Nomad
Purpose: indexing
Availability:
Platform:
The NorthStar Robot
Purpose: indexing
Availability:
Platform:
Occam
Purpose: indexing
Availability: none
Platform: unix
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (14 of 25) [18.02.2001 13:15:10]
Database of Web Robots, View Type
HKU WWW Octopus
Purpose: indexing
Availability:
Platform:
Orb Search
Purpose: indexing
Availability: data
Platform: unix
Pack Rat
Purpose: both maintenance and mirroring
Availability: at the moment, none...source when developed.
Platform: unix
PageBoy
Purpose: indexing
Availability: none
Platform: unix
ParaSite
Purpose: indexing
Availability: none
Platform: windowsNT
Patric
Purpose: statistics
Availability: data
Platform: unix
pegasus
Purpose: indexing
Availability: source, binary
Platform: unix
The Peregrinator
Purpose:
Availability:
Platform:
PerlCrawler 1.0
Purpose: indexing
Availability: source
Platform: unix
Phantom
Purpose: indexing
Availability:
Platform: Macintosh
PiltdownMan
Purpose: statistics
Availability: none
Platform: windows95, windows98, windowsNT
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (15 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
Pioneer
Purpose: indexing, statistics
Availability:
Platform:
html_analyzer
Purpose: maintainance
Availability:
Platform:
Portal Juice Spider
Purpose: indexing, statistics
Availability: none
Platform: unix
PGP Key Agent
Purpose: indexing
Availability: none
Platform: UNIX, Windows NT
PlumtreeWebAccessor
Purpose: indexing for the Plumtree Server
Availability: none
Platform: windowsNT
Poppi
Purpose: indexing
Availability: none
Platform: unix/linux
PortalB Spider
Purpose: indexing
Availability: none
Platform: unix
GetterroboPlus Puu
Purpose: Purpose of the robot. One or more of: - gathering:
gather data of original standerd TAG for Puu
contains the information of the sites registered my
Search Engin. - maintenance: link validation
Availability: none
Platform: unix
The Python Robot
Purpose:
Availability: none
Platform:
Raven Search
Purpose: Indexing: gather content for commercial query
engine.
Availability: None
Platform: Unix, Windows98, WindowsNT, Windows2000
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (16 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
RBSE Spider
Purpose: indexing, statistics
Availability:
Platform:
Resume Robot
Purpose: indexing.
Availability:
Platform:
RoadHouse Crawling System
Purpose:
Availability: none
Platform:
Road Runner: The
ImageScape Robot
Purpose: indexing
Availability:
Platform: UNIX
Robbie the Robot
Purpose: indexing
Availability: none
Platform: unix, windows95, windowsNT
ComputingSite Robi/1.0
Purpose: indexing,maintenance
Availability:
Platform: UNIX
Robozilla
Purpose: maintenance
Availability: none
Platform:
Roverbot
Purpose: indexing
Availability:
Platform:
SafetyNet Robot
Purpose: indexing.
Availability:
Platform:
Scooter
Purpose: indexing
Availability: none
Platform: unix
Search.Aus-AU.COM
Purpose: - indexing: gather content for an indexing service
Availability: - none
Platform: - mac - unix - windows95 - windowsNT
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (17 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
SearchProcess
Purpose: Statistic
Availability: none
Platform: linux
Senrigan
Purpose: indexing
Availability: none
Platform: Java
SG-Scout
Purpose: indexing
Availability:
Platform:
ShagSeeker
Purpose: indexing
Availability: data
Platform: unix
Shai'Hulud
Purpose: mirroring
Availability: source
Platform: unix
Sift
Purpose: indexing
Availability: data
Platform: unix
Simmany Robot Ver1.0
Purpose: indexing, maintenance, statistics
Availability: none
Platform: unix
Site Valet
Purpose: maintenance
Availability: data
Platform: unix
Open Text Index Robot
Purpose: indexing
Availability: inquire to markk@opentext.com (Mark Kraatz)
Platform: UNIX
SiteTech-Rover
Purpose: indexing
Availability:
Platform:
SLCrawler
Purpose: To build the site map.
Availability: none
Platform: windows, windows95, windowsNT
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (18 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
Inktomi Slurp
Purpose: indexing, statistics
Availability: none
Platform: unix
Smart Spider
Purpose: indexing
Availability: data, binary, source
Platform: windows95, windowsNT
Snooper
Purpose:
Availability: none
Platform:
Solbot
Purpose: indexing
Availability: none
Platform: unix
Spanner
Purpose: indexing,maintenance
Availability: source
Platform: unix
Speedy Spider
Purpose: indexing
Availability: none
Platform: Windows
spider_monkey
Purpose: gather content for a free indexing service
Availability: bulk data gathered by robot available
Platform: unix
SpiderBot
Purpose: indexing, mirroring
Availability: source, binary, data
Platform: unix, windows, windows95, windowsNT
SpiderMan
Purpose: user searching using IR technique
Availability: binary&source
Platform: Java 1.2
SpiderView(tm)
Purpose: maintenance
Availability: source
Platform: unix, nt
Spry Wizard Robot
Purpose: indexing
Availability:
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (19 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
Site Searcher
Purpose: indexing
Availability: binary
Platform: winows95, windows98, windowsNT
Suke
Purpose: indexing
Availability: source
Platform: FreeBSD3.*
suntek search engine
Purpose: to create a search portal on Asian web sites
Availability: available now
Platform: NT, Linux, UNIX
Sven
Purpose: indexing
Availability: none
Platform: Windows
TACH Black Widow
Purpose: maintenance: link validation
Availability: none
Platform: UNIX, Linux
Tarantula
Purpose: indexing
Availability: none
Platform: unix
tarspider
Purpose: mirroring
Availability:
Platform:
Tcl W3 Robot
Purpose: maintenance, statistics
Availability:
Platform:
TechBOT
Purpose: statistics, maintenance
Availability: none
Platform: Unix
Templeton
Purpose: mirroring, mapping, automating web applications
Availability: binary
Platform: OS/2, Linux, SunOS, Solaris
TeomaTechnologies
Purpose:
Availability: none
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (20 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
TitIn
Purpose: indexing, statistics
Availability: data, source on request
Platform: unix
TITAN
Purpose: indexing
Availability: no
Platform: SunOS 4.1.4
The TkWWW Robot
Purpose: indexing
Availability:
Platform:
TLSpider
Purpose: to get web sites and add them to the topiclink future
directory
Availability: none
Platform: linux
UCSD Crawl
Purpose: indexing, statistics
Availability:
Platform:
UdmSearch
Purpose: indexing, validation
Availability: source, binary
Platform: unix
URL Check
Purpose: maintenance
Availability: binary
Platform: unix
URL Spider Pro
Purpose: indexing
Availability: binary
Platform: Windows9x/NT
Valkyrie
Purpose: indexing
Availability: none
Platform: unix
Victoria
Purpose: maintenance
Availability: none
Platform: unix
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (21 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
vision-search
Purpose: indexing.
Availability:
Platform:
Voyager
Purpose: indexing, maintenance
Availability: none
Platform: unix
VWbot
Purpose: indexing
Availability: source
Platform: unix
The NWI Robot
Purpose: discovery,statistics
Availability: none (at the moment)
Platform: UNIX
W3M2
Purpose: indexing, maintenance, statistics
Availability:
Platform:
the World Wide Web
Wanderer
Purpose: statistics
Availability: data
Platform: unix
WebBandit Web Spider
Purpose: Resource Gathering / Server Benchmarking
Availability: source, binary
Platform: Intel - windows95
WebCatcher
Purpose: indexing
Availability: none
Platform: unix, windows, mac
WebCopy
Purpose: mirroring
Availability:
Platform:
webfetcher
Purpose: mirroring
Availability:
Platform:
The Webfoot Robot
Purpose:
Availability:
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (22 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
weblayers
Purpose: maintainance
Availability:
Platform:
WebLinker
Purpose: maintenance
Availability:
Platform:
WebMirror
Purpose: mirroring
Availability:
Platform: Windows95
The Web Moose
Purpose: statistics, maintenance
Availability: data
Platform: Windows NT
WebQuest
Purpose: indexing
Availability: none
Platform: unix
Digimarc MarcSpider
Purpose: maintenance
Availability: none
Platform: windowsNT
WebReaper
Purpose: indexing/offline browsing
Availability: binary
Platform: windows95, windowsNT
webs
Purpose: statistics
Availability: none
Platform: unix
Websnarf
Purpose:
Availability:
Platform:
WebSpider
Purpose: maintenance, link diagnostics
Availability:
Platform:
WebVac
Purpose: mirroring
Availability:
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (23 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
webwalk
Purpose: indexing, maintentance, mirroring, statistics
Availability:
Platform:
WebWalker
Purpose: maintenance
Availability: source
Platform: unix
WebWatch
Purpose: maintainance, statistics
Availability:
Platform:
Wget
Purpose: mirroring, maintenance
Availability: source
Platform: unix
whatUseek Winona
Purpose: Robot used for site-level search and meta-search
engines.
Availability: none
Platform: unix
WhoWhere Robot
Purpose: indexing
Availability: none
Platform: Sun Unix
w3mir
Purpose: mirroring.
Availability:
Platform: UNIX, WindowsNT
WebStolperer
Purpose: indexing
Availability: none
Platform: unix, NT
The Web Wombat
Purpose: indexing, statistics.
Availability:
Platform:
The World Wide Web Worm
Purpose: indexing
Availability:
Platform:
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (24 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Type
WWWC Ver 0.2.5
Purpose: maintenance
Availability: binary
Platform: windows, windows95, windowsNT
WebZinger
Purpose: indexing
Availability: binary
Platform: windows95, windowsNT 4, mac, solaris, unix
XGET
Purpose: mirroring
Availability: binary
Platform: X68000, X68030
Nederland.zoek
Purpose: indexing
Availability: none
Platform: unix (Linux)
The Web Robots Database
http://info.webcrawler.com/mak/projects/robots/active/html/type.html (25 of 25) [18.02.2001 13:15:11]
Database of Web Robots, View Contact
The Web Robots Pages
Database of Web Robots, View Contact
Alternatively you can View Type, or see the Overview.
Name
Details
Acme.Spider
Agent: Due to a deficiency in Java it's not currently possible to set the
User-Agent.
Host: *
Email: jef@acme.com
Ahoy! The Homepage
Finder
Agent: 'Ahoy! The Homepage Finder'
Host: cs.washington.edu
Email: marclang@cs.washington.edu
Alkaline
Agent: AlkalineBOT
Host: *
Email: dblock@vestris.com
Walhello appie
Agent: appie/1.1
Host: 213.10.10.116, 213.10.10.117, 213.10.10.118
Email: aimo@walhello.com
Arachnophilia
Agent: Arachnophilia
Host: halsoft.com
Email: taluskie@utpapa.ph.utexas.edu
ArchitextSpider
Agent: ArchitextSpider
Host: *.atext.com
Email: spider@atext.com
Aretha
Agent:
Host:
Email: davew@well.com
ARIADNE
Agent: Due to a deficiency in Java it's not currently possible to set the
User-Agent.
Host: dbs.informatik.uni-muenchen.de
Email: Gross@dbs.informatik.uni-muenchen.de
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (1 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
arks
Agent: arks/1.0
Host: dpsindia.com
Email: aniruddha.c@usa.net
ASpider (Associative
Spider)
Agent: ASpider/0.09
Host: nova.pvv.unit.no
Email: fredj@pvv.ntnu.no
ATN Worldwide
Agent: ATN_Worldwide
Host: www.allthatnet.com
Email: info@allthatnet.com
Atomz.com Search
Robot
Agent: Atomz/1.0
Host: www.atomz.com
Email: mike@atomz.com
AURESYS
Agent: AURESYS/1.0
Host: crrm.univ-mrs.fr, 192.134.99.192
Email: mannina@crrm.univ-mrs.fr
BackRub
Agent: BackRub/*.*
Host: *.stanford.edu
Email: page@leland.stanford.edu
unnamed
Agent: BaySpider
Host:
Email:
Big Brother
Agent: Big Brother
Host: *
Email: Francois.Pottier@inria.fr
Bjaaland
Agent: Bjaaland/0.5
Host: barry.bitmovers.net
Email: tbray@textuality.com
BlackWidow
Agent: BlackWidow
Host: 140.190.65.*
Email: khooghee@marys.smumn.edu
Die Blinde Kuh
Agent: Die Blinde Kuh
Host: minerva.sozialwiss.uni-hamburg.de
Email: maschinist@blinde-kuh.de
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (2 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Bloodhound
Agent: None
Host: *
Email: genius@ukonline.co.uk
bright.net caching robot
Agent: Mozilla/3.01 (compatible;)
Host: 209.143.1.46
Email:
BSpider
Agent: BSpider/1.0 libwww-perl/0.40
Host: 210.159.73.34, 210.159.73.35
Email: okumura@rsl.crl.fujixerox.co.jp
CACTVS Chemistry
Spider
Agent: CACTVS Chemistry Spider
Host: utamaro.organik.uni-erlangen.de
Email: wdi@eros.ccc.uni-erlangen.de
Calif
Agent: Calif/0.6 (kosarev@tnps.net; http://www.tnps.dp.ua)
Host: cobra.tnps.dp.ua
Email: kosarev@tnps.net
Cassandra
Agent:
Host: www.aha.ru
Email: billy168@aha.ru
Digimarc
Marcspider/CGI
Agent: Digimarc CGIReader/1.0
Host: 206.102.3.*
Email: wmreader@digimarc.com
Checkbot
Agent: Checkbot/x.xx LWP/5.x
Host: *
Email: graaff@xs4all.nl
churl
Agent:
Host:
Email: yunke@umich.edu
CMC/0.01
Agent: CMC/0.01
Host: haruna.next.ne.jp, 203.183.218.4
Email: shinobu@po.next.ne.jp
Collective
Agent: LWP
Host: *
Email: genius@ukonline.co.uk
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (3 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Combine System
Agent: combine/0.0
Host: *.ub2.lu.se
Email: tsao@munin.ub2.lu.se
Conceptbot
Agent: conceptbot/0.3
Host: router.sifry.com
Email: david@sifry.com
CoolBot
Agent: CoolBot
Host: www.suchmaschine21.de
Email: info@suchmaschine21.de
Web Core / Roots
Agent: root/0.1
Host: shiva.di.uminho.pt, from www.di.uminho.pt
Email: wc@di.uminho.pt
XYLEME Robot
Agent: cosmos/0.3
Host:
Email: preda@xyleme.com
Internet Cruiser Robot
Agent: Internet Cruiser Robot/2.1
Host: *.krstarica.com
Email: robot@krstarica.com
Cusco
Agent: Cusco/3.2
Host: *.cusco.pt, *.viatecla.pt
Email: clerigo@viatecla.pt
CyberSpyder Link Test
Agent: CyberSpyder/2.1
Host: *
Email: amant@cyberspyder.com
DeWeb(c) Katalog/Index
Agent: Deweb/1.01
Host: deweb.orbit.de
Email: dewebmaster@orbit.de
DienstSpider
Agent: dienstspider/1.0
Host: sappho.csi.forth.gr
Email: asidirop@csi.forth.gr
Digger
Agent: Digger/1.0 JDK/1.3.0
Host:
Email: admin@diggit.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (4 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Digital Integrity Robot
Agent: DIIbot
Host: digital-integrity.com
Email: robot@digital-integrity.com
Direct Hit Grabber
Agent: grabber
Host: *.directhit.com
Email: DirectHitGrabber@directhit.com
DNAbot
Agent: DNAbot/1.0
Host: xx.dnainc.co.jp
Email: tomatell@xx.dnainc.co.jp
DownLoad Express
Agent:
Host: *
Email: dlxpress@mediaone.net
DragonBot
Agent: DragonBot/1.0 libwww/5.0
Host: *.paczone.com
Email: admin@paczone.com
DWCP (Dridus' Web
Cataloging Project)
Agent: DWCP/2.0
Host: *.dridus.com
Email: rmm@dridus.com
e-collector
Agent: LWP::
Host: *
Email: smarty@thatrobotsite.com
EbiNess
Agent: EbiNess/0.01a
Host:
Email: mdavis@kieser.net
EIT Link Verifier Robot
Agent: EIT-Link-Verifier-Robot/0.2
Host: *
Email: mcguire@eit.COM
Emacs-w3 Search
Engine
Agent: Emacs-w3/v[0-9\.]+
Host: *
Email: wmperry@spry.com
ananzi
Agent: EMC Spider
Host: bilbo.internal.empirical.com
Email: hpayne@u-media.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (5 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Esther
Agent: esther
Host: *.falconsoft.com
Email: tim@falconsoft.com
Evliya Celebi
Agent: Evliya Celebi v0.151 - http://ilker.ulak.net.tr
Host: 193.140.83.*
Email: ilker@ulak.net.tr
nzexplorer
Agent: explorersearch
Host: bitz.co.nz
Email: paul@bourke.gen.nz
Fluid Dynamics Search
Engine robot
Agent: Mozilla/4.0 (compatible: FDSE robot)
Host: yes
Email: zoltanm@nickname.net
Felix IDE
Agent: FelixIDE/1.0
Host: *
Email: felix@pentone.com
Wild Ferret Web Hopper
#1, #2, #3
Agent: Hazel's Ferret Web hopper,
Host:
Email: ghbos@postoffice.worldnet.att.net
FetchRover
Agent: ESIRover v1.0
Host: *
Email: ken@engsoftware.com
fido
Agent: fido/0.9 Harvest/1.4.pl2
Host: fido.planetsearch.com, *.planetsearch.com, 206.64.113.*
Email: fido@planetsearch.com
Hämähäkki
Agent: Hämähäkki/0.2
Host: *.www.fi
Email: Timo.Metsala@www.fi
KIT-Fireball
Agent: KIT-Fireball/2.0 libwww/5.0a
Host: *.fireball.de
Email: info@fireball.de
Fish search
Agent: Fish-Search-Robot
Host: www.win.tue.nl
Email: debra@win.tue.nl
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (6 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Fouineur
Agent: Mozilla/2.0 (compatible fouineur v2.0; fouineur.9bit.qc.ca)
Host: *
Email: jvandal@9bit.qc.ca
Robot Francoroute
Agent: Robot du CRIM 1.0a
Host: zorro.crim.ca
Email: maparent@crim.ca
Freecrawl
Agent: Freecrawl
Host: *.freeside.net
Email: ekhall@freeside.net
FunnelWeb
Agent: FunnelWeb-1.0
Host: earth.planets.com.au
Email: eaglesd@pc.com.au
gazz
Agent: gazz/1.0
Host: *.nttrd.com, *.infobee.ne.jp
Email: gazz@nttrd.com
GCreep
Agent: gcreep/1.0
Host: mbx.instrumentpolen.se
Email: anders@instrumentpolen.se
GetBot
Agent: ???
Host:
Email: zav@macromedia.com
GetURL
Agent: GetURL.rexx v1.05
Host: *
Email: James@Snark.apana.org.au
Golem
Agent: Golem/1.1
Host: *.quibble.com
Email: geoff@quibble.com
Googlebot
Agent: Googlebot/2.0 beta (googlebot(at)googlebot.com)
Host: *.googlebot.com
Email: googlebot@googlebot.com
Grapnel/0.01 Experiment
Agent:
Host: varies
Email: v93_kat@ce.kth.se
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (7 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Griffon
Agent: griffon/1.0
Host: *.navi.ocn.ne.jp
Email: griffon@super.navi.ocn.ne.jp
Gromit
Agent: Gromit/1.0
Host: *.austlii.edu.au
Email: dan@austlii.edu.au
Northern Light Gulliver
Agent: Gulliver/1.1
Host: scooby.northernlight.com, taz.northernlight.com,
gulliver.northernlight.com
Email: crawler@northernlight.com
HamBot
Agent:
Host: *.hamrad.com
Email: john@futureone.com
Harvest
Agent: yes
Host: bruno.cs.colorado.edu
Email:
havIndex
Agent: havIndex/X.xx[bxx]
Host: *
Email: havIndex@hav.com
HI (HTML Index)
Search
Agent: AITCSRobot/1.1
Host:
Email: a94385@cs.ait.ac.th
Hometown Spider Pro
Agent: Hometown Spider Pro
Host: 63.195.193.17
Email: admin@hometownsingles.com
Wired Digital
Agent: wired-digital-newsbot/1.5
Host: gossip.hotwired.com
Email: bowen@hotwired.com
ht://Dig
Agent: htdig/3.1.0b2
Host: *
Email: andrew@contigo.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (8 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
HTMLgobble
Agent: HTMLgobble v2.2
Host: tp70.rz.uni-karlsruhe.de
Email: ley@rz.uni-karlsruhe.de
Hyper-Decontextualizer
Agent: no
Host:
Email: cliff@tricon.net
IBM_Planetwide
Agent: IBM_Planetwide,
Host: www.ibm.com www2.ibm.com
Email: epc@www.ibm.com"
Popular Iconoclast
Agent: gestaltIconoclast/1.0 libwww-FM/2.17
Host: gestalt.sewanee.edu
Email: chris@gestalt.sewanee.edu
Ingrid
Agent: INGRID/0.1
Host: bart.ilse.nl
Email: ilse@ilse.nl
Imagelock
Agent: Mozilla 3.01 PBWF (Win95)
Host: 209.111.133.*
Email: belanger@imagelock.com
IncyWincy
Agent: IncyWincy/1.0b1
Host: osiris.sunderland.ac.uk
Email: simon.stobart@sunderland.ac.uk
Informant
Agent: Informant
Host: informant.dartmouth.edu
Email: info_adm@cosmo.dartmouth.edu
InfoSeek Robot 1.0
Agent: InfoSeek Robot 1.0
Host: corp-gw.infoseek.com
Email: stk@infoseek.com
Infoseek Sidewinder
Agent: Infoseek Sidewinder
Host:
Email: mna@infoseek.com
InfoSpiders
Agent: InfoSpiders/0.1
Host: *.ucsd.edu
Email: fil@cs.ucsd.edu
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (9 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
Inspector Web
Agent: inspectorwww/1.0
http://www.greenpac.com/inspectorwww.html
Host: www.corpsite.com, www.greenpac.com, 38.234.171.*
Email: doug@greenpac.com
IntelliAgent
Agent: 'IAGENT/1.0'
Host: sand.it.bond.edu.au
Email: s1523@sand.it.bond.edu.au
I, Robot
Agent: I Robot 0.4 (irobot@chaos.dk)
Host: *.mame.dk, 206.161.121.*
Email: irobot@chaos.dk
Iron33
Agent: Iron33/0.0
Host: *.folon.ueda.info.waseda.ac.jp, 133.9.215.*
Email: watanabe@ueda.info.waseda.ac.jp
Israeli-search
Agent: IsraeliSearch/1.0
Host: dylan.ius.cs.cmu.edu
Email: etamar@xpert.co
JavaBee
Agent: JavaBee
Host: *
Email: info@objectbox.com
JBot Java Web Robot
Agent: JBot (but can be changed by the user)
Host: *
Email: daniel@matuschek.net
JCrawler
Agent: JCrawler/0.2
Host: db.netimages.com
Email: snowhare@netimages.com
Jeeves
Agent: Jeeves v0.05alpha (PERL, LWP, lglb@doc.ic.ac.uk)
Host: *.doc.ic.ac.uk
Email: lglb@doc.ic.ac.uk
Jobot
Agent: Jobot/0.1alpha libwww-perl/4.0
Host: supernova.micrognosis.com
Email: ajack@corp.micrognosis.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (10 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
JoeBot
Agent: JoeBot/x.x,
Host:
Email: rwaldin@primenet.com
The Jubii Indexing Robot
Agent: JubiiRobot/version#
Host: any host in the cybernet.dk domain
Email: jakob@jubii.dk
JumpStation
Agent: jumpstation
Host: *.stir.ac.uk
Email: j.fletcher@stirling.ac.uk
Katipo
Agent: Katipo/1.0
Host: *
Email: Michael.Newbery@vuw.ac.nz
KDD-Explorer
Agent: KDD-Explorer/0.1
Host: mlc.kddvw.kcom.or.jp
Email: matsu@lab.kdd.co.jp
Kilroy
Agent: yes
Host: *.oclc.org
Email: kilroy@oclc.org
KO_Yappo_Robot
Agent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html)
Host: yappo.com,209.25.40.1
Email: office_KO@yappo.com
LabelGrabber
Agent: LabelGrab/1.1
Host: head.w3.org
Email: jamieson@mit.edu
larbin
Agent: larbin (+mail)
Host: *
Email: sebastien.ailleret@inria.fr
legs
Agent: legs
Host:
Email: admin@magportal.com
Link Validator
Agent: Linkidator/0.93
Host: *.mitre.org
Email: tgimon@mitre.org
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (11 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
LinkScan
Agent: LinkScan Server/5.5 | LinkScan Workstation/5.5
Host: *
Email: sales@elsop.com
LinkWalker
Agent: LinkWalker
Host: *.seventwentyfour.com
Email: rbryant@seventwentyfour.com
Lockon
Agent: Lockon/xxxxx
Host: *.hitech.tuis.ac.jp
Email: search@rsch.tuis.ac.jp
logo.gif Crawler
Agent: logo.gif crawler
Host: *.inm.de
Email: sevo@inm.de
Lycos
Agent: Lycos/x.x
Host: fuzine.mt.cs.cmu.edu, lycos.com
Email: fuzzy@cmu.edu
Mac WWWWorm
Agent:
Host:
Email: lemieuse@ERE.UMontreal.CA
Magpie
Agent: Magpie/1.0
Host: *.blueberry.co.uk, 194.70.52.*, 193.131.167.144
Email: Keith.Jones@blueberry.co.uk
Mattie
Agent: AO/A-T.IDRG v2.3
Host: mattie.mcw.aarkayn.org
Email: matt@mcw.aarkayn.org
MediaFox
Agent: MediaFox/x.y
Host: 141.99.*.*
Email: sfx@uni-media.de
MerzScope
Agent: MerzScope
Host: (Client Based)
Email:
NEC-MeshExplorer
Agent: NEC-MeshExplorer
Host: meshsv300.tk.mesh.ad.jp
Email: web-dir@mxa.meshnet.or.jp
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (12 of 24) [18.02.2001 13:15:50]
Database of Web Robots, View Contact
MindCrawler
Agent: MindCrawler
Host: *
Email: support@mindpass.com
moget
Agent: moget/1.0
Host: *.goo.ne.jp
Email: moget@goo.ne.jp
MOMspider
Agent: MOMspider/1.00 libwww-perl/0.40
Host: *
Email: fielding@ics.uci.edu
Monster
Agent: Monster/vX.X.X -$TYPE ($OSTYPE)
Host: wild.stu.neva.ru
Email: diwil@wild.stu.neva.ru
Motor
Agent: Motor/0.2
Host: Michael.cybercon.technopark.gmd.de
Email: Motor@cybercon.technopark.gmd.de
Muscat Ferret
Agent: MuscatFerret/
Host: 193.114.89.*, 194.168.54.11
Email: olly@muscat.co.uk
Mwd.Search
Agent: MwdSearch/0.1
Host: *.fifi.net
Email: Antti.Westerberg@mwd.sci.fi
Internet Shinchakubin
Agent: User-Agent: Mozilla/4.0 (compatible; sharp-info-agent v1.0; )
Host: *
Email: shinchakubin-request@isl.nara.sharp.co.jp
NetCarta WebMap
Engine
Agent: NetCarta CyberPilot Pro
Host:
Email: info@netcarta.com
NetMechanic
Agent: NetMechanic
Host: 206.26.168.18
Email: tdahm@iquest.com
NetScoop
Agent: NetScoop/1.0 libwww/5.0a
Host: alpha.is.tokushima-u.ac.jp, beta.is.tokushima-u.ac.jp
Email: kita@is.tokushima-u.ac.jp
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (13 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
newscan-online
Agent: newscan-online/1.1
Host: *newscan-online.de
Email: mueller@newscan-online.de
NHSE Web Forager
Agent: NHSEWalker/3.0
Host: *.mcs.anl.gov
Email: olson@mcs.anl.gov
Nomad
Agent: Nomad-V2.x
Host: *.cs.colostate.edu
Email: sonnen@cs.colostat.edu
The NorthStar Robot
Agent: NorthStar
Host: frognot.utdallas.edu, utdallas.edu, cnidir.org
Email: barrie@unr.edu
Occam
Agent: Occam/1.0
Host: gentian.cs.washington.edu, sekiu.cs.washington.edu,
saxifrage.cs.washington.edu
Email: friedman@cs.washington.edu
HKU WWW Octopus
Agent: HKU WWW Robot,
Host: phoenix.cs.hku.hk
Email: jax@cs.hku.hk
Orb Search
Agent: Orbsearch/1.0
Host: cow.dyn.ml.org, *.dyn.ml.org
Email: webernet@geocities.com
Pack Rat
Agent: PackRat/1.0
Host: cps.msu.edu
Email: dexterte@cps.msu.edu
PageBoy
Agent: PageBoy/1.0
Host: *.webdocs.org
Email: pageboy@webdocs.org
ParaSite
Agent: ParaSite/0.21 (http://www.ianett.com/parasite/)
Host: *.ianett.com
Email: parasite@ianett.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (14 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
Patric
Agent: Patric/0.01a
Host: *.nwnet.net
Email: webmaster@nwnet.net
pegasus
Agent: web robot PEGASUS
Host: *
Email: shannon@opensource.or.id
The Peregrinator
Agent: Peregrinator-Mathematics/0.7
Host:
Email: jimr@maths.su.oz.au
PerlCrawler 1.0
Agent: PerlCrawler/1.0 Xavatoria/2.0
Host: server5.hypermart.net
Email: webmaster@perlsearch.hypermart.net
Phantom
Agent: Duppies
Host:
Email: lburke@aktiv.com
PiltdownMan
Agent: PiltdownMan/1.0 profitnet@myezmail.com
Host: 62.36.128.*, 194.133.59.*, 212.106.215.*
Email: profitnet@myezmail.com
Pioneer
Agent: Pioneer
Host: *.uncfsu.edu or flyer.ncsc.org
Email: micah@sequent.uncfsu.edu
html_analyzer
Agent:
Host:
Email: pitkow@aries.colorado.edu
Portal Juice Spider
Agent: PortalJuice.com/4.0
Host: *.portaljuice.com, *.nextopia.com
Email: pjspider@portaljuice.com
PGP Key Agent
Agent: PGP-KA/1.2
Host: salerno.starnet.it
Email: puma@comm2000.it
PlumtreeWebAccessor
Agent: PlumtreeWebAccessor/0.9
Host:
Email: josephs@plumtree.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (15 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
Poppi
Agent: Poppi/1.0
Host: =20
Email:
PortalB Spider
Agent: PortalBSpider/1.0 (spider@portalb.com)
Host: spider1.portalb.com, spider2.portalb.com, etc.
Email: spider@portalb.com
GetterroboPlus Puu
Agent: straight FLASH!! GetterroboPlus 1.5
Host: straight FLASH!! Getterrobo-Plus, *.homing.net
Email: marunaka@homing.net
The Python Robot
Agent:
Host:
Email: guido@python.org
Raven Search
Agent: Raven-v2
Host: 192.168.1.*
Email: ravensearch@hotmail.com
RBSE Spider
Agent:
Host: rbse.jsc.nasa.gov (192.88.42.10)
Email: eichmann@rbse.jsc.nasa.gov
Resume Robot
Agent: Resume Robot
Host:
Email: proquest@onramp.net
RoadHouse Crawling
System
Agent: RHCS/1.0a
Host: stage.perceval.be
Email: helpdesk@perceval.be
Road Runner: The
ImageScape Robot
Agent: Road Runner: ImageScape Robot (lim@cs.leidenuniv.nl)
Host:
Email: lim@cs.leidenuniv.nl
Robbie the Robot
Agent: Robbie/0.1
Host: *.lmco.com
Email: robert.h.pollack@lmco.com
ComputingSite Robi/1.0
Agent: ComputingSite Robi/1.0 (robi@computingsite.com)
Host: robi.computingsite.com
Email: robi@computingsite.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (16 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
Robozilla
Agent: Robozilla/1.0
Host: directory.mozilla.org
Email: robozilla@dmozed.org
Roverbot
Agent: Roverbot
Host: roverbot.com
Email: gmd@spyder.net
SafetyNet Robot
Agent: SafetyNet Robot 0.1,
Host: *.urlabs.com
Email: m.l.nelson@urlabs.com
Scooter
Agent: Scooter/2.0 G.R.A.B. V1.1.0
Host: *.av.pa-x.dec.com
Email: scooter@pa.dec.com
Search.Aus-AU.COM
Agent: not available
Host: Search.Aus-AU.COM, 203.55.124.29, 203.2.239.29
Email: dez@geko.com
SearchProcess
Agent: searchprocess/0.9
Host: searchprocess.com
Email: bruno@intelligence-process.com
Senrigan
Agent: Senrigan/xxxxxx
Host: aniki.olu.info.waseda.ac.jp
Email: kent@muraoka.info.waseda.ac.jp
SG-Scout
Agent: SG-Scout
Host: beta.xerox.com
Email: ptbb@ai.mit.edu, beebee@parc.xerox.com
ShagSeeker
Agent: Shagseeker at http://www.shagseek.com /1.0
Host: shagseek.com
Email: joe.reynolds@shagseek.com
Shai'Hulud
Agent: Shai'Hulud
Host: *.rdtex.ru
Email: shawdow@usa.net
Sift
Agent: libwww-perl-5.41
Host: www.worthy.com
Email: bworthy@worthy.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (17 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
Simmany Robot Ver1.0
Agent: SimBot/1.0
Host: sansam.hnc.net
Email: ailove@hnc.co.kr
Site Valet
Agent: Site Valet
Host: valet.webthing.com,valet.*
Email: nick@webthing.com
Open Text Index Robot
Agent: Open Text Site Crawler V1.0
Host: *.opentext.com
Email: faichney@opentext.com
SiteTech-Rover
Agent: SiteTech-Rover
Host:
Email: adasilva@sitetech.com
SLCrawler
Agent: SLCrawler
Host: n/a
Email: kng@inxight.com
Inktomi Slurp
Agent: Slurp/2.0
Host: *.inktomi.com
Email: slurp@inktomi.com
Smart Spider
Agent: ESISmartSpider/2.0
Host: 207.16.241.*
Email: ken@engsoftware.com
Snooper
Agent: Snooper/b97_01
Host:
Email: melnicki@sit.ca
Solbot
Agent: Solbot/1.0 LWP/5.07
Host: robot*.sol.no
Email: ftj@sys.sol.no
Spanner
Agent: Spanner/1.0 (Linux 2.0.27 i586)
Host: *.kluge.net
Email: felicity@kluge.net
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (18 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
Speedy Spider
Agent: Speedy Spider ( http://www.entireweb.com/speedy.html )
Host: router-00.sverige.net, 193.15.210.29, *.entireweb.com,
*.worldlight.com
Email: speedy@worldlight.com
spider_monkey
Agent: mouse.house/7.1
Host: snowball.ionsys.com
Email: mprm@ionsys.com
SpiderBot
Agent: SpiderBot/1.0
Host: *
Email: spidrboticruzado@solaria.emp.ubu.es
SpiderMan
Agent: SpiderMan 1.0
Host: NA
Email: leunghok@comp.nus.edu.sg
SpiderView(tm)
Agent: Mozilla/4.0 (compatible; SpiderView 1.0;unix)
Host: bobmin.quad2.iuinc.com, *
Email: webmaster@northernwebs.com
Spry Wizard Robot
Agent: no
Host: wizard.spry.com or tiger.spry.com
Email: info@spry.com
Site Searcher
Agent: ssearcher100
Host: *
Email: zackware@hotmail.com
Suke
Agent: suke/*.*
Host: *
Email: robot@kensaku.org
suntek search engine
Agent: suntek/1.0
Host: search.suntek.com.hk
Email: karen@suntek.com.hk
Sven
Agent:
Host: 24.113.12.29
Email: rhondle@home.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (19 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
TACH Black Widow
Agent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31 1997
12:25:00
Host: *.theautochannel.com
Email: mjenn@theautochannel.com
Tarantula
Agent: Tarantula/1.0
Host: yes
Email: Markus.Hoevener@evision.de
tarspider
Agent: tarspider
Host:
Email: chakl@fu-berlin.de
Tcl W3 Robot
Agent: dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/)
Host: hplyot.obspm.fr
Email: dl@hplyot.obspm.fr
TechBOT
Agent: TechBOT
Host: techaid.net
Email: techbot@techaid.net
Templeton
Agent: Templeton/{version} for {platform}
Host: *
Email: nealk@net66.com
TeomaTechnologies
Agent: teoma_agent1 [teoma_admin@hawkholdings.com]
Host: 63.236.92.145
Email: teoma_admin@hawkholdings.com
TitIn
Agent: TitIn/0.2
Host: barok.foi.hr
Email: dpavlin@foi.hr
TITAN
Agent: TITAN/0.1
Host: nlptitan.isl.ntt.jp
Email: hayashi@nttnly.isl.ntt.jp
The TkWWW Robot
Agent:
Host:
Email: scott@cs.sunyit.edu
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (20 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
TLSpider
Agent: TLSpider/1.1
Host: tlspider.topiclink.com (not avalible yet)
Email: tlspider@outtel.com
UCSD Crawl
Agent: UCSD-Crawler
Host: nuthaus.mib.org scilib.ucsd.edu
Email: atilghma@mib.org
UdmSearch
Agent: UdmSearch/2.1.1
Host: *
Email: bar@izhcom.ru
URL Check
Agent: urlck/1.2.3
Host: *
Email: dave@cutternet.com
URL Spider Pro
Agent: URL Spider Pro
Host: *
Email: greg@innerprise.net
Valkyrie
Agent: Valkyrie/1.0 libwww-perl/0.40
Host: *.c.u-tokyo.ac.jp
Email: harada@graco.c.u-tokyo.ac.jp
Victoria
Agent: Victoria/1.0
Host:
Email: adrianh@oneworld.co.uk
vision-search
Agent: vision-search/3.0'
Host: dylan.ius.cs.cmu.edu
Email: har@cs.cmu.edu
Voyager
Agent: Voyager/0.0
Host: *.lisa.co.jp
Email: voyager@lisa.co.jp
VWbot
Agent: VWbot_K/4.2
Host: vancouver-webpages.com
Email: andrew@vancouver-webpages.com
The NWI Robot
Agent: w3index
Host: nwi.ub2.lu.se, mars.dtv.dk and a few others
Email: siglun@munin.ub2.lu.se
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (21 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
W3M2
Agent: W3M2/x.xxx
Host: *
Email: tronche@lri.fr
the World Wide Web
Wanderer
Agent: WWWWanderer v3.0
Host: *.mit.edu
Email: mkgray@mit.edu
WebBandit Web Spider
Agent: WebBandit/1.0
Host: ix.netcom.com
Email: wooger@ix.netcom.com
WebCatcher
Agent: WebCatcher/1.0
Host: oscar.lang.nagoya-u.ac.jp
Email: reiji@infonia.ne.jp
WebCopy
Agent: WebCopy/(version)
Host: *
Email: vparada@inf.utfsm.cl
webfetcher
Agent: WebFetcher/0.8,
Host: *
Email: webfetch@ontv.com
The Webfoot Robot
Agent:
Host: phoenix.doc.ic.ac.uk
Email: L.McLoughlin@doc.ic.ac.uk
weblayers
Agent: weblayers/0.0
Host:
Email: loic@afp.com
WebLinker
Agent: WebLinker/0.0 libwww-perl/0.1
Host:
Email: jcasey@maths.tcd.ie
WebMirror
Agent: no
Host:
Email: sfchan@mailhost.net
The Web Moose
Agent: WebMoose/0.0.0000
Host: msn.com
Email: mikeblas@nwlink.com
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (22 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
WebQuest
Agent: WebQuest/1.0
Host: 210.121.146.2, 210.113.104.1, 210.113.104.2
Email: cty@cosmonet.co.kr
Digimarc MarcSpider
Agent: Digimarc WebReader/1.2
Host: 206.102.3.*
Email: wmreader@digimarc.com
WebReaper
Agent: WebReaper [webreaper@otway.com]
Host: *
Email: webreaper@otway.com
webs
Agent: webs@recruit.co.jp
Host: lemon.recruit.co.jp
Email: dew@wwwadmin.rnet.or.jp
Websnarf
Agent:
Host:
Email: charles@fma.com
WebSpider
Agent:
Host: several
Email: u610468@csi.uottawa.ca
WebVac
Agent: webvac/1.0
Host:
Email: tim@federated.com
webwalk
Agent: webwalk
Host:
Email:
WebWalker
Agent: WebWalker/1.10
Host: *
Email: fccheong@cs.berkeley.edu
WebWatch
Agent: WebWatch
Host:
Email: janos@specter.com
Wget
Agent: Wget/1.4.0
Host: *
Email: hniksic@srce.hr
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (23 of 24) [18.02.2001 13:15:51]
Database of Web Robots, View Contact
whatUseek Winona
Agent: whatUseek_winona/3.0
Host: *.whatuseek.com, *.aol2.com
Email: neil@whatUseek.com
WhoWhere Robot
Agent:
Host: spica.whowhere.com
Email: rupesh@whowhere.com
w3mir
Agent: w3mir
Host:
Email: w3mir-core@usit.uio.no
WebStolperer
Agent: WOLP/1.0 mda/1.0
Host: www.suchfibel.de
Email: mda@suchfibel.de
The Web Wombat
Agent: no
Host: qwerty.intercom.com.au
Email: phill@intercom.com.au
The World Wide Web
Worm
Agent:
Host: piper.cs.colorado.edu
Email: mcbryan@piper.cs.colorado.edu
WWWC Ver 0.2.5
Agent: WWWC/0.25 (Win95)
Host:
Email: naka@kinet.or.jp
WebZinger
Agent: none
Host: http://www.imaginon.com/wzindex.html *
Email: info@imaginon.com
XGET
Agent: XGET/0.7
Host: *
Email: shige@mh1.117.ne.jp
Nederland.zoek
Agent: Nederland.zoek
Host: 193.67.110.*
Email: zoek@nederland.net
The Web Robots Database
http://info.webcrawler.com/mak/projects/robots/active/html/contact.html (24 of 24) [18.02.2001 13:15:51]
Database of Web Robots, Overview of Raw files
The Web Robots Pages
Database of Web Robots, Overview of
Raw files
1. Acme.Spider
2. Ahoy! The Homepage Finder
3. Alkaline
4. Walhello appie
5. Arachnophilia
6. ArchitextSpider
7. Aretha
8. ARIADNE
9. arks
10. ASpider (Associative Spider)
11. ATN Worldwide
12. Atomz.com Search Robot
13. AURESYS
14. BackRub
15. unnamed
16. Big Brother
17. Bjaaland
18. BlackWidow
19. Die Blinde Kuh
20. Bloodhound
21. bright.net caching robot
22. BSpider
23. CACTVS Chemistry Spider
24. Calif
25. Cassandra
26. Digimarc Marcspider/CGI
27. Checkbot
28. churl
29. CMC/0.01
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (1 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
30. Collective
31. Combine System
32. Conceptbot
33. CoolBot
34. Web Core / Roots
35. XYLEME Robot
36. Internet Cruiser Robot
37. Cusco
38. CyberSpyder Link Test
39. DeWeb(c) Katalog/Index
40. DienstSpider
41. Digger
42. Digital Integrity Robot
43. Direct Hit Grabber
44. DNAbot
45. DownLoad Express
46. DragonBot
47. DWCP (Dridus' Web Cataloging Project)
48. e-collector
49. EbiNess
50. EIT Link Verifier Robot
51. Emacs-w3 Search Engine
52. ananzi
53. Esther
54. Evliya Celebi
55. nzexplorer
56. Fluid Dynamics Search Engine robot
57. Felix IDE
58. Wild Ferret Web Hopper #1, #2, #3
59. FetchRover
60. fido
61. Hämähäkki
62. KIT-Fireball
63. Fish search
64. Fouineur
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (2 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
65. Robot Francoroute
66. Freecrawl
67. FunnelWeb
68. gazz
69. GCreep
70. GetBot
71. GetURL
72. Golem
73. Googlebot
74. Grapnel/0.01 Experiment
75. Griffon
76. Gromit
77. Northern Light Gulliver
78. HamBot
79. Harvest
80. havIndex
81. HI (HTML Index) Search
82. Hometown Spider Pro
83. Wired Digital
84. ht://Dig
85. HTMLgobble
86. Hyper-Decontextualizer
87. IBM_Planetwide
88. Popular Iconoclast
89. Ingrid
90. Imagelock
91. IncyWincy
92. Informant
93. InfoSeek Robot 1.0
94. Infoseek Sidewinder
95. InfoSpiders
96. Inspector Web
97. IntelliAgent
98. I, Robot
99. Iron33
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (3 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
100. Israeli-search
101. JavaBee
102. JBot Java Web Robot
103. JCrawler
104. Jeeves
105. Jobot
106. JoeBot
107. The Jubii Indexing Robot
108. JumpStation
109. Katipo
110. KDD-Explorer
111. Kilroy
112. KO_Yappo_Robot
113. LabelGrabber
114. larbin
115. legs
116. Link Validator
117. LinkScan
118. LinkWalker
119. Lockon
120. logo.gif Crawler
121. Lycos
122. Mac WWWWorm
123. Magpie
124. Mattie
125. MediaFox
126. MerzScope
127. NEC-MeshExplorer
128. MindCrawler
129. moget
130. MOMspider
131. Monster
132. Motor
133. Muscat Ferret
134. Mwd.Search
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (4 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
135. Internet Shinchakubin
136. NetCarta WebMap Engine
137. NetMechanic
138. NetScoop
139. newscan-online
140. NHSE Web Forager
141. Nomad
142. The NorthStar Robot
143. Occam
144. HKU WWW Octopus
145. Orb Search
146. Pack Rat
147. PageBoy
148. ParaSite
149. Patric
150. pegasus
151. The Peregrinator
152. PerlCrawler 1.0
153. Phantom
154. PiltdownMan
155. Pioneer
156. html_analyzer
157. Portal Juice Spider
158. PGP Key Agent
159. PlumtreeWebAccessor
160. Poppi
161. PortalB Spider
162. GetterroboPlus Puu
163. The Python Robot
164. Raven Search
165. RBSE Spider
166. Resume Robot
167. RoadHouse Crawling System
168. Road Runner: The ImageScape Robot
169. Robbie the Robot
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (5 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
170. ComputingSite Robi/1.0
171. Robozilla
172. Roverbot
173. SafetyNet Robot
174. Scooter
175. Search.Aus-AU.COM
176. SearchProcess
177. Senrigan
178. SG-Scout
179. ShagSeeker
180. Shai'Hulud
181. Sift
182. Simmany Robot Ver1.0
183. Site Valet
184. Open Text Index Robot
185. SiteTech-Rover
186. SLCrawler
187. Inktomi Slurp
188. Smart Spider
189. Snooper
190. Solbot
191. Spanner
192. Speedy Spider
193. spider_monkey
194. SpiderBot
195. SpiderMan
196. SpiderView(tm)
197. Spry Wizard Robot
198. Site Searcher
199. Suke
200. suntek search engine
201. Sven
202. TACH Black Widow
203. Tarantula
204. tarspider
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (6 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
205. Tcl W3 Robot
206. TechBOT
207. Templeton
208. TeomaTechnologies
209. TitIn
210. TITAN
211. The TkWWW Robot
212. TLSpider
213. UCSD Crawl
214. UdmSearch
215. URL Check
216. URL Spider Pro
217. Valkyrie
218. Victoria
219. vision-search
220. Voyager
221. VWbot
222. The NWI Robot
223. W3M2
224. the World Wide Web Wanderer
225. WebBandit Web Spider
226. WebCatcher
227. WebCopy
228. webfetcher
229. The Webfoot Robot
230. weblayers
231. WebLinker
232. WebMirror
233. The Web Moose
234. WebQuest
235. Digimarc MarcSpider
236. WebReaper
237. webs
238. Websnarf
239. WebSpider
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (7 of 8) [18.02.2001 13:16:00]
Database of Web Robots, Overview of Raw files
240. WebVac
241. webwalk
242. WebWalker
243. WebWatch
244. Wget
245. whatUseek Winona
246. WhoWhere Robot
247. w3mir
248. WebStolperer
249. The Web Wombat
250. The World Wide Web Worm
251. WWWC Ver 0.2.5
252. WebZinger
253. XGET
254. Nederland.zoek
The Web Robots Database
http://info.webcrawler.com/mak/projects/robots/active/html/raw.html (8 of 8) [18.02.2001 13:16:00]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
to set the User-Agent.
robot-noindex:
robot-host:
robot-from:
robot-useragent:
to set the User-Agent.
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
Acme.Spider
Acme.Spider
http://www.acme.com/java/software/Acme.Spider.html
http://www.acme.com/java/software/Acme.Spider.html
Jef Poskanzer - ACME Laboratories
http://www.acme.com/
jef@acme.com
active
indexing maintenance statistics
standalone
java
source
yes
Due to a deficiency in Java it's not currently possible
no
*
no
Due to a deficiency in Java it's not currently possible
java
A Java utility class for writing your own robots.
Wed, 04 Dec 1996 21:30:11 GMT
Jef Poskanzer
robot-id:
ahoythehomepagefinder
robot-name:
Ahoy! The Homepage Finder
robot-cover-url:
http://www.cs.washington.edu/research/ahoy/
robot-details-url: http://www.cs.washington.edu/research/ahoy/doc/home.html
robot-owner-name:
Marc Langheinrich
robot-owner-url:
http://www.cs.washington.edu/homes/marclang
robot-owner-email: marclang@cs.washington.edu
robot-status:
active
robot-purpose:
maintenance
robot-type:
standalone
robot-platform:
UNIX
robot-availability: none
robot-exclusion:
yes
robot-exclusion-useragent: ahoy
robot-noindex:
no
robot-host:
cs.washington.edu
robot-from:
no
robot-useragent:
'Ahoy! The Homepage Finder'
robot-language:
Perl 5
robot-description: Ahoy! is an ongoing research project at the
University of Washington for finding personal Homepages.
robot-history:
Research project at the University of Washington in
1995/1996
robot-environment: research
modified-date:
Fri June 28 14:00:00 1996
modified-by:
Marc Langheinrich
robot-id: Alkaline
robot-name: Alkaline
robot-cover-url: http://www.vestris.com/alkaline
robot-details-url: http://www.vestris.com/alkaline
robot-owner-name: Daniel Doubrovkine
robot-owner-url: http://cuiwww.unige.ch/~doubrov5
robot-owner-email: dblock@vestris.com
robot-status: development active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix windows95 windowsNT
http://info.webcrawler.com/mak/projects/robots/active/all.txt (1 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: AlkalineBOT
robot-noindex: yes
robot-host: *
robot-from: no
robot-useragent: AlkalineBOT
robot-language: c++
robot-description: Unix/NT internet/intranet search engine
robot-history: Vestris Inc. search engine designed at the University of
Geneva
robot-environment: commercial research
modified-date: Thu Dec 10 14:01:13 MET 1998
modified-by: Daniel Doubrovkine <dblock@vestris.com>
robot-id: appie
robot-name: Walhello appie
robot-cover-url: www.walhello.com
robot-details-url: www.walhello.com/aboutgl.html
robot-owner-name: Aimo Pieterse
robot-owner-url: www.walhello.com
robot-owner-email: aimo@walhello.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: windows98
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: appie
robot-noindex: yes
robot-host: 213.10.10.116, 213.10.10.117, 213.10.10.118
robot-from: yes
robot-useragent: appie/1.1
robot-language: Visual C++
robot-description: The appie-spider is used to collect and index web pages for
the Walhello search engine
robot-history: The spider was built in march/april 2000
robot-environment: commercial
modified-date: Thu, 20 Jul 2000 22:38:00 GMT
modified-by: Aimo Pieterse
robot-id:
arachnophilia
robot-name:
Arachnophilia
robot-cover-url:
robot-details-url:
robot-owner-name:
Vince Taluskie
robot-owner-url:
http://www.ph.utexas.edu/people/vince.html
robot-owner-email: taluskie@utpapa.ph.utexas.edu
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
halsoft.com
robot-from:
robot-useragent:
Arachnophilia
robot-language:
robot-description: The purpose (undertaken by HaL Software) of this run was to
collect approximately 10k html documents for testing
automatic abstract generation
robot-history:
robot-environment:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (2 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-date:
modified-by:
robot-id:
architext
robot-name:
ArchitextSpider
robot-cover-url:
http://www.excite.com/
robot-details-url:
robot-owner-name:
Architext Software
robot-owner-url:
http://www.atext.com/spider.html
robot-owner-email: spider@atext.com
robot-status:
robot-purpose:
indexing, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*.atext.com
robot-from:
yes
robot-useragent:
ArchitextSpider
robot-language:
perl 5 and c
robot-description: Its purpose is to generate a Resource Discovery database,
and to generate statistics. The ArchitextSpider collects
information for the Excite and WebCrawler search engines.
robot-history:
robot-environment:
modified-date:
Tue Oct 3 01:10:26 1995
modified-by:
robot-id:
aretha
robot-name:
Aretha
robot-cover-url:
robot-details-url:
robot-owner-name:
Dave Weiner
robot-owner-url:
http://www.hotwired.com/Staff/userland/
robot-owner-email: davew@well.com
robot-status:
robot-purpose:
robot-type:
robot-platform:
Macintosh
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description: A crude robot built on top of Netscape and Userland
Frontier, a scripting system for Macs
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: ariadne
robot-name: ARIADNE
robot-cover-url: (forthcoming)
robot-details-url: (forthcoming)
robot-owner-name: Mr. Matthias H. Gross
robot-owner-url: http://www.lrz-muenchen.de/~gross/
robot-owner-email: Gross@dbs.informatik.uni-muenchen.de
robot-status: development
robot-purpose: statistics, development of focused crawling strategies
http://info.webcrawler.com/mak/projects/robots/active/all.txt (3 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-type: standalone
robot-platform: java
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: ariadne
robot-noindex: no
robot-host: dbs.informatik.uni-muenchen.de
robot-from: no
robot-useragent: Due to a deficiency in Java it's not currently possible
to set the User-Agent.
robot-language: java
robot-description: The ARIADNE robot is a prototype of a environment for
testing focused crawling strategies.
robot-history: This robot is part of a research project at the
University of Munich (LMU), started in 2000.
robot-environment: research
modified-date: Mo, 13 Mar 2000 14:00:00 GMT
modified-by: Mr. Matthias H. Gross
robot-id:arks
robot-name:arks
robot-cover-url:http://www.dpsindia.com
robot-details-url:http://www.dpsindia.com
robot-owner-name:Aniruddha Choudhury
robot-owner-url:
robot-owner-email:aniruddha.c@usa.net
robot-status:development
robot-purpose:indexing
robot-type:standalone
robot-platform:PLATFORM INDEPENDENT
robot-availability:data
robot-exclusion:yes
robot-exclusion-useragent:arks
robot-noindex:no
robot-host:dpsindia.com
robot-from:no
robot-useragent:arks/1.0
robot-language:Java 1.2
robot-description:The Arks robot is used to build the database
for the dpsindia/lawvistas.com search service .
The robot runs weekly, and visits sites in a random order
robot-history:finds its root from s/w development project for a portal
robot-environment:commercial
modified-date:6 th November 2000
modified-by:Aniruddha Choudhury
robot-id:
aspider
robot-name:
ASpider (Associative Spider)
robot-cover-url:
robot-details-url:
robot-owner-name:
Fred Johansen
robot-owner-url:
http://www.pvv.ntnu.no/~fredj/
robot-owner-email: fredj@pvv.ntnu.no
robot-status:
retired
robot-purpose:
indexing
robot-type:
robot-platform:
unix
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
nova.pvv.unit.no
robot-from:
yes
robot-useragent:
ASpider/0.09
robot-language:
perl4
http://info.webcrawler.com/mak/projects/robots/active/all.txt (4 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description: ASpider is a CGI script that searches the web for keywords given by
the user through a form.
robot-history:
robot-environment: hobby
modified-date:
modified-by:
robot-id: atn.txt
robot-name: ATN Worldwide
robot-details-url:
robot-cover-url:
robot-owner-name: All That Net
robot-owner-url: http://www.allthatnet.com
robot-owner-email: info@allthatnet.com
robot-status: active
robot-purpose: indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion: yes
robot-exclusion-useragent: ATN_Worldwide
robot-noindex:
robot-nofollow:
robot-host: www.allthatnet.com
robot-from:
robot-useragent: ATN_Worldwide
robot-language:
robot-description: The ATN robot is used to build the database for the
AllThatNet search service operated by All That Net. The robot runs weekly,
and visits sites in a random order.
robot-history:
robot-environment:
modified-date: July 09, 2000 17:43 GMT
robot-id: atomz
robot-name: Atomz.com Search Robot
robot-cover-url: http://www.atomz.com/help/
robot-details-url: http://www.atomz.com/
robot-owner-name: Mike Thompson
robot-owner-url: http://www.atomz.com/
robot-owner-email: mike@atomz.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: service
robot-exclusion: yes
robot-exclusion-useragent: Atomz
robot-noindex: yes
robot-host: www.atomz.com
robot-from: no
robot-useragent: Atomz/1.0
robot-language: c
robot-description: Robot used for web site search service.
robot-history: Developed for Atomz.com, launched in 1999.
robot-environment: service
modified-date: Tue Jul 13 03:50:06 GMT 1999
modified-by: Mike Thompson
robot-id: auresys
robot-name: AURESYS
robot-cover-url: http://crrm.univ-mrs.fr
robot-details-url: http://crrm.univ-mrs.fr
robot-owner-name: Mannina Bruno
robot-owner-url: ftp://crrm.univ-mrs.fr/pub/CVetud/Etudiants/Mannina/CVbruno.htm
http://info.webcrawler.com/mak/projects/robots/active/all.txt (5 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-email: mannina@crrm.univ-mrs.fr
robot-status: robot actively in use
robot-purpose: indexing,statistics
robot-type: Standalone
robot-platform: Aix, Unix
robot-availability: Protected by Password
robot-exclusion: Yes
robot-exclusion-useragent:
robot-noindex: no
robot-host: crrm.univ-mrs.fr, 192.134.99.192
robot-from: Yes
robot-useragent: AURESYS/1.0
robot-language: Perl 5.001m
robot-description: The AURESYS is used to build a personnal database for
somebody who search information. The database is structured to be
analysed. AURESYS can found new server by IP incremental. It generate
statistics...
robot-history: This robot finds its roots in a research project at the
University of Marseille in 1995-1996
robot-environment: used for Research
modified-date: Mon, 1 Jul 1996 14:30:00 GMT
modified-by: Mannina Bruno
robot-id:
backrub
robot-name:
BackRub
robot-cover-url:
robot-details-url:
robot-owner-name:
Larry Page
robot-owner-url:
http://backrub.stanford.edu/
robot-owner-email: page@leland.stanford.edu
robot-status:
robot-purpose:
indexing, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
*.stanford.edu
robot-from:
yes
robot-useragent:
BackRub/*.*
robot-language:
Java.
robot-description:
robot-history:
robot-environment:
modified-date:
Wed Feb 21 02:57:42 1996.
modified-by:
robot-id: robot-name: bayspider
robot-cover-url: http://www.baytsp.com
robot-details-url: http://www.baytsp.com
robot-owner-name: BayTSP.com,Inc
robot-owner-url: robot-owner-email: marki@baytsp.com
robot-status: Active
robot-purpose: Copyright Infringement Tracking
robot-type: Stand Alone
robot-platform: NT
robot-availability: 24/7
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
robot-useragent: BaySpider
robot-language: English
http://info.webcrawler.com/mak/projects/robots/active/all.txt (6 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description:
robot-history:
robot-environment:
modified-date: 1/15/2001
modified-by: Marki@baytsp.com
robot-id: bigbrother
robot-name: Big Brother
robot-cover-url: http://pauillac.inria.fr/~fpottier/mac-soft.html.en
robot-details-url:
robot-owner-name: Francois Pottier
robot-owner-url: http://pauillac.inria.fr/~fpottier/
robot-owner-email: Francois.Pottier@inria.fr
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: mac
robot-availability: binary
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: *
robot-from: not as of 1.0
robot-useragent: Big Brother
robot-language: c++
robot-description: Macintosh-hosted link validation tool.
robot-history:
robot-environment: shareware
modified-date: Thu Sep 19 18:01:46 MET DST 1996
modified-by: Francois Pottier
robot-id: bjaaland
robot-name: Bjaaland
robot-cover-url: http://www.textuality.com
robot-details-url: http://www.textuality.com
robot-owner-name: Tim Bray
robot-owner-url: http://www.textuality.com
robot-owner-email: tbray@textuality.com
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Bjaaland
robot-noindex: no
robot-host: barry.bitmovers.net
robot-from: no
robot-useragent: Bjaaland/0.5
robot-language: perl5
robot-description: Crawls sites listed in the ODP (see http://dmoz.org)
robot-history: None, yet
robot-environment: service
modified-date: Monday, 19 July 1999, 13:46:00 PDT
modified-by: tbray@textuality.com
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
blackwidow
BlackWidow
http://140.190.65.12/~khooghee/index.html
Kevin Hoogheem
khooghee@marys.smumn.edu
indexing, statistics
http://info.webcrawler.com/mak/projects/robots/active/all.txt (7 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
140.190.65.*
robot-from:
yes
robot-useragent:
BlackWidow
robot-language:
C, C++.
robot-description: Started as a research project and now is used to find links
for a random link generator. Also is used to research the
growth of specific sites.
robot-history:
robot-environment:
modified-date:
Fri Feb 9 00:11:22 1996.
modified-by:
robot-id: blindekuh
robot-name: Die Blinde Kuh
robot-cover-url: http://www.blinde-kuh.de/
robot-details-url: http://www.blinde-kuh.de/robot.html (german language)
robot-owner-name: Stefan R. Mueller
robot-owner-url: http://www.rrz.uni-hamburg.de/philsem/stefan_mueller/
robot-owner-email:maschinist@blinde-kuh.de
robot-status: development
robot-purpose: indexing
robot-type: browser
robot-platform: unix
robot-availability: none
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: minerva.sozialwiss.uni-hamburg.de
robot-from: yes
robot-useragent: Die Blinde Kuh
robot-language: perl5
robot-description: The robot is use for indixing and proofing the
registered urls in the german language search-engine for kids.
Its a none-comercial one-woman-project of Birgit Bachmann
living in Hamburg, Germany.
robot-history: The robot was developed by Stefan R. Mueller
to help by the manual proof of registered Links.
robot-environment: hobby
modified-date: Mon Jul 22 1998
modified-by: Stefan R. Mueller
robot-id:Bloodhound
robot-name:Bloodhound
robot-cover-url:http://web.ukonline.co.uk/genius/bloodhound.htm
robot-details-url:http://web.ukonline.co.uk/genius/bloodhound.htm
robot-owner-name:Dean Smart
robot-owner-url:http://web.ukonline.co.uk/genius/bloodhound.htm
robot-owner-email:genius@ukonline.co.uk
robot-status:active
robot-purpose:Web Site Download
robot-type:standalone
robot-platform:Windows95, WindowsNT, Windows98, Windows2000
robot-availability:Executible
robot-exclusion:No
robot-exclusion-useragent:Ukonline
robot-noindex:No
robot-host:*
robot-from:No
robot-useragent:None
http://info.webcrawler.com/mak/projects/robots/active/all.txt (8 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-language:Perl5
robot-description:Bloodhound will download an whole web site depending on the
number of links to follow specified by the user.
robot-history:First version was released on the 1 july 2000
robot-environment:Commercial
modified-date:1 july 2000
modified-by:Dean Smart
robot-id: brightnet
robot-name: bright.net caching robot
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status: active
robot-purpose: caching
robot-type:
robot-platform:
robot-availability: none
robot-exclusion: no
robot-noindex:
robot-host: 209.143.1.46
robot-from: no
robot-useragent: Mozilla/3.01 (compatible;)
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date: Fri Nov 13 14:08:01 EST 1998
modified-by: brian d foy <comdog@computerdog.com>
robot-id: bspider
robot-name: BSpider
robot-cover-url: not yet
robot-details-url: not yet
robot-owner-name: Yo Okumura
robot-owner-url: not yet
robot-owner-email: okumura@rsl.crl.fujixerox.co.jp
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: Unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: bspider
robot-noindex: yes
robot-host: 210.159.73.34, 210.159.73.35
robot-from: yes
robot-useragent: BSpider/1.0 libwww-perl/0.40
robot-language: perl
robot-description: BSpider is crawling inside of Japanese domain for indexing.
robot-history: Starts Apr 1997 in a research project at Fuji Xerox Corp.
Research Lab.
robot-environment: research
modified-date: Mon, 21 Apr 1997 18:00:00 JST
modified-by: Yo Okumura
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
cactvschemistryspider
CACTVS Chemistry Spider
http://schiele.organik.uni-erlangen.de/cactvs/spider.html
W. D. Ihlenfeldt
http://schiele.organik.uni-erlangen.de/cactvs/
wdi@eros.ccc.uni-erlangen.de
http://info.webcrawler.com/mak/projects/robots/active/all.txt (9 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-status:
robot-purpose:
indexing.
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
utamaro.organik.uni-erlangen.de
robot-from:
no
robot-useragent:
CACTVS Chemistry Spider
robot-language:
TCL, C
robot-description: Locates chemical structures in Chemical MIME formats on WWW
and FTP servers and downloads them into database searchable
with structure queries (substructure, fullstructure,
formula, properties etc.)
robot-history:
robot-environment:
modified-date:
Sat Mar 30 00:55:40 1996.
modified-by:
robot-id: calif
robot-name: Calif
robot-details-url: http://www.tnps.dp.ua/calif/details.html
robot-cover-url: http://www.tnps.dp.ua/calif/
robot-owner-name: Alexander Kosarev
robot-owner-url: http://www.tnps.dp.ua/~dark/
robot-owner-email: kosarev@tnps.net
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: calif
robot-noindex: yes
robot-host: cobra.tnps.dp.ua
robot-from: yes
robot-useragent: Calif/0.6 (kosarev@tnps.net; http://www.tnps.dp.ua)
robot-language: c++
robot-description: Used to build searchable index
robot-history: In development stage
robot-environment: research
modified-date: Sun, 6 Jun 1999 13:25:33 GMT
robot-id: cassandra
robot-name: Cassandra
robot-cover-url: http://post.mipt.rssi.ru/~billy/search/
robot-details-url: http://post.mipt.rssi.ru/~billy/search/
robot-owner-name: Mr. Oleg Bilibin
robot-owner-url:
http://post.mipt.rssi.ru/~billy/
robot-owner-email: billy168@aha.ru
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: crossplatform
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent:
robot-noindex: no
robot-host: www.aha.ru
robot-from: no
robot-useragent:
robot-language: java
robot-description: Cassandra search robot is used to create and maintain indexed
http://info.webcrawler.com/mak/projects/robots/active/all.txt (10 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
database for widespread Information Retrieval System
robot-history: Master of Science degree project at Moscow Institute of Physics and
Technology
robot-environment: research
modified-date: Wed, 3 Jun 1998 12:00:00 GMT
robot-id: cgireader
robot-name: Digimarc Marcspider/CGI
robot-cover-url: http://www.digimarc.com/prod_fam.html
robot-details-url: http://www.digimarc.com/prod_fam.html
robot-owner-name: Digimarc Corporation
robot-owner-url: http://www.digimarc.com
robot-owner-email: wmreader@digimarc.com
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent:
robot-noindex:
robot-host: 206.102.3.*
robot-from:
robot-useragent: Digimarc CGIReader/1.0
robot-language: c++
robot-description: Similar to Digimarc Marcspider, Marcspider/CGI examines
image files for watermarks but more focused on CGI Urls.
In order to not waste internet bandwidth with yet another crawler,
we have contracted with one of the major crawlers/seach engines
to provide us with a list of specific CGI URLs of interest to us.
If an URL is to a page of interest (via CGI), then we access the
page to get the image URLs from it, but we do not crawl to
any other pages.
robot-history: First operation in December 1997
robot-environment: service
modified-date: Fri, 5 Dec 1997 12:00:00 GMT
modified-by: Dan Ramos
robot-id:
checkbot
robot-name:
Checkbot
robot-cover-url:
http://www.xs4all.nl/~graaff/checkbot/
robot-details-url:
robot-owner-name:
Hans de Graaff
robot-owner-url:
http://www.xs4all.nl/~graaff/checkbot/
robot-owner-email: graaff@xs4all.nl
robot-status:
active
robot-purpose:
maintenance
robot-type:
standalone
robot-platform:
unix,WindowsNT
robot-availability: source
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
no
robot-useragent:
Checkbot/x.xx LWP/5.x
robot-language:
perl 5
robot-description: Checkbot checks links in a
given set of pages on one or more servers. It reports links
which returned an error code
robot-history:
robot-environment: hobby
modified-date:
Tue Jun 25 07:44:00 1996
modified-by:
Hans de Graaff
http://info.webcrawler.com/mak/projects/robots/active/all.txt (11 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id:
churl
robot-name:
churl
robot-cover-url:
http://www-personal.engin.umich.edu/~yunke/scripts/churl/
robot-details-url:
robot-owner-name:
Justin Yunke
robot-owner-url:
http://www-personal.engin.umich.edu/~yunke/
robot-owner-email: yunke@umich.edu
robot-status:
robot-purpose:
maintenance
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description: A URL checking robot, which stays within one step of the
local server
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: cmc
robot-name: CMC/0.01
robot-details-url: http://www2.next.ne.jp/cgi-bin/music/help.cgi?phase=robot
robot-cover-url: http://www2.next.ne.jp/music/
robot-owner-name: Shinobu Kubota.
robot-owner-url: http://www2.next.ne.jp/cgi-bin/music/help.cgi?phase=profile
robot-owner-email: shinobu@po.next.ne.jp
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: CMC/0.01
robot-noindex: no
robot-host: haruna.next.ne.jp, 203.183.218.4
robot-from: yes
robot-useragent: CMC/0.01
robot-language: perl5
robot-description: This CMC/0.01 robot collects the information
of the page that was registered to the music
specialty searching service.
robot-history: This CMC/0.01 robot was made for the computer
music center on November 4, 1997.
robot-environment: hobby
modified-date: Sat, 23 May 1998 17:22:00 GMT
robot-id:Collective
robot-name:Collective
robot-cover-url:http://web.ukonline.co.uk/genius/collective.htm
robot-details-url:http://web.ukonline.co.uk/genius/collective.htm
robot-owner-name:Dean Smart
robot-owner-url:http://web.ukonline.co.uk/genius/collective.htm
robot-owner-email:genius@ukonline.co.uk
robot-status:development
robot-purpose:Collective is a highly configurable program designed to interrogate
online search engines and online databases, it will ignore web pages
that lie about there content, and dead url's, it can be super strict, it searches each
web page
http://info.webcrawler.com/mak/projects/robots/active/all.txt (12 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
it finds for your search terms to ensure those terms are present, any positive urls
are added to
a html file for your to view at any time even before the program has finished.
Collective can wonder the web for days if required.
robot-type:standalone
robot-platform:Windows95, WindowsNT, Windows98, Windows2000
robot-availability:Executible
robot-exclusion:No
robot-exclusion-useragent:
robot-noindex:No
robot-host:*
robot-from:No
robot-useragent:LWP
robot-language:Perl5 (With Visual Basic front-end)
robot-description:Collective is the most cleverest Internet search engine,
With all found url?s guaranteed to have your search terms.
robot-history:Develpment started on August, 03, 2000
robot-environment:Commercial
modified-date:August, 03, 2000
modified-by:Dean Smart
robot-id: combine
robot-name: Combine System
robot-cover-url: http://www.ub2.lu.se/~tsao/combine.ps
robot-details-url: http://www.ub2.lu.se/~tsao/combine.ps
robot-owner-name: Yong Cao
robot-owner-url: http://www.ub2.lu.se/
robot-owner-email: tsao@munin.ub2.lu.se
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: combine
robot-noindex: no
robot-host: *.ub2.lu.se
robot-from: yes
robot-useragent: combine/0.0
robot-language: c, perl5
robot-description: An open, distributed, and efficient harvester.
robot-history: A complete re-design of the NWI robot (w3index) for DESIRE project.
robot-environment: research
modified-date: Tue, 04 Mar 1997 16:11:40 GMT
modified-by: Yong Cao
robot-id: conceptbot
robot-name: Conceptbot
robot-cover-url: http://www.aptltd.com/~sifry/conceptbot/tech.html
robot-details-url: http://www.aptltd.com/~sifry/conceptbot
robot-owner-name: David L. Sifry
robot-owner-url: http://www.aptltd.com/~sifry
robot-owner-email: david@sifry.com
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: conceptbot
robot-noindex: yes
robot-host: router.sifry.com
robot-from: yes
robot-useragent: conceptbot/0.3
robot-language: perl5
http://info.webcrawler.com/mak/projects/robots/active/all.txt (13 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description:The Conceptbot spider is used to research
concept-based search indexing techniques. It uses a breadth first seach
to spread out the number of hits on a single site over time. The spider
runs at irregular intervals and is still under construction.
robot-history: This spider began as a research project at Sifry
Consulting in April 1996.
robot-environment: research
modified-date: Mon, 9 Sep 1996 15:31:07 GMT
modified-by: David L. Sifry <david@sifry.com>
robot-id: coolbot
robot-name: CoolBot
robot-cover-url: www.suchmaschine21.de
robot-details-url: www.suchmaschine21.de
robot-owner-name: Stefan Fischerlaender
robot-owner-url: www.suchmaschine21.de
robot-owner-email: info@suchmaschine21.de
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: CoolBot
robot-noindex: yes
robot-host: www.suchmaschine21.de
robot-from: no
robot-useragent: CoolBot
robot-language: perl5
robot-description: The CoolBot robot is used to build and maintain the
directory of the german search engine Suchmaschine21.
robot-history: none so far
robot-environment: service
modified-date: Wed, 21 Jan 2001 12:16:00 GMT
modified-by: Stefan Fischerlaender
robot-id:
core
robot-name:
Web Core / Roots
robot-cover-url:
http://www.di.uminho.pt/wc
robot-details-url:
robot-owner-name:
Jorge Portugal Andrade
robot-owner-url:
http://www.di.uminho.pt/~cbm
robot-owner-email: wc@di.uminho.pt
robot-status:
robot-purpose:
indexing, maintenance
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
shiva.di.uminho.pt, from www.di.uminho.pt
robot-from:
no
robot-useragent:
root/0.1
robot-language:
perl
robot-description: Parallel robot developed in Minho Univeristy in Portugal to
catalog relations among URLs and to support a special
navigation aid.
robot-history:
First versions since October 1995.
robot-environment:
modified-date:
Wed Jan 10 23:19:08 1996.
modified-by:
robot-id: cosmos
robot-name: XYLEME Robot
http://info.webcrawler.com/mak/projects/robots/active/all.txt (14 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-cover-url: http://xyleme.com/
robot-details-url:
robot-owner-name: Mihai Preda
robot-owner-url: http://www.mihaipreda.com/
robot-owner-email: preda@xyleme.com
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: cosmos
robot-noindex: no
robot-nofollow: no
robot-host:
robot-from: yes
robot-useragent: cosmos/0.3
robot-language: c++
robot-description: index XML, follow HTML
robot-history:
robot-environment: service
modified-date: Fri, 24 Nov 2000 00:00:00 GMT
modified-by: Mihai Preda
robot-id: cruiser
robot-name: Internet Cruiser Robot
robot-cover-url: http://www.krstarica.com/
robot-details-url: http://www.krstarica.com/eng/url/
robot-owner-name: Internet Cruiser
robot-owner-url: http://www.krstarica.com/
robot-owner-email: robot@krstarica.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Internet Cruiser Robot
robot-noindex: yes
robot-host: *.krstarica.com
robot-from: no
robot-useragent: Internet Cruiser Robot/2.1
robot-language: c++
robot-description: Internet Cruiser Robot is Internet Cruiser's prime index
agent.
robot-history:
robot-environment: service
modified-date: Fri, 17 Jan 2001 12:00:00 GMT
modified-by: tech@krstarica.com
robot-id: cusco
robot-name: Cusco
robot-cover-url: http://www.cusco.pt/
robot-details-url: http://www.cusco.pt/
robot-owner-name: Filipe Costa Clerigo
robot-owner-url: http://www.viatecla.pt/
robot-owner-email: clerigo@viatecla.pt
robot-status: active
robot-purpose: indexing
robot-type: standlone
robot-platform: any
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: cusco
robot-noindex: yes
http://info.webcrawler.com/mak/projects/robots/active/all.txt (15 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-host: *.cusco.pt, *.viatecla.pt
robot-from: yes
robot-useragent: Cusco/3.2
robot-language: Java
robot-description: The Cusco robot is part of the CUCE indexing sistem. It
gathers information from several sources: HTTP, Databases or filesystem. At
this moment, it's universe is the .pt domain and the information it gathers
is available at the Portuguese search engine Cusco http://www.cusco.pt/.
robot-history: The Cusco search engine started in the company ViaTecla as a
project to demonstrate our development capabilities and to fill the need of
a portuguese-specific search engine. Now, we are developping new
functionalities that cannot be found in any other on-line search engines.
robot-environment:service, research
modified-date: Mon, 21 Jun 1999 14:00:00 GMT
modified-by: Filipe Costa Clerigo
robot-id: cyberspyder
robot-name: CyberSpyder Link Test
robot-cover-url: http://www.cyberspyder.com/cslnkts1.html
robot-details-url: http://www.cyberspyder.com/cslnkts1.html
robot-owner-name: Tom Aman
robot-owner-url: http://www.cyberspyder.com/
robot-owner-email: amant@cyberspyder.com
robot-status: active
robot-purpose: link validation, some html validation
robot-type: standalone
robot-platform: windows 3.1x, windows95, windowsNT
robot-availability: binary
robot-exclusion: user configurable
robot-exclusion-useragent: cyberspyder
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent: CyberSpyder/2.1
robot-language: Microsoft Visual Basic 4.0
robot-description: CyberSpyder Link Test is intended to be used as a site
management tool to validate that HTTP links on a page are functional and to
produce various analysis reports to assist in managing a site.
robot-history: The original robot was created to fill a widely seen need
for a easy to use link checking program.
robot-environment: commercial
modified-date: Tue, 31 Mar 1998 01:02:00 GMT
modified-by: Tom Aman
robot-id:
deweb
robot-name:
DeWeb(c) Katalog/Index
robot-cover-url:
http://deweb.orbit.de/
robot-details-url:
robot-owner-name:
Marc Mielke
robot-owner-url:
http://www.orbit.de/
robot-owner-email: dewebmaster@orbit.de
robot-status:
robot-purpose:
indexing, mirroring, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
deweb.orbit.de
robot-from:
yes
robot-useragent:
Deweb/1.01
robot-language:
perl 4
robot-description: Its purpose is to generate a Resource Discovery database,
perform mirroring, and generate statistics. Uses combination
http://info.webcrawler.com/mak/projects/robots/active/all.txt (16 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
of Informix(tm) Database and WN 1.11 serversoftware for
indexing/ressource discovery, fulltext search, text
excerpts.
robot-history:
robot-environment:
modified-date:
Wed Jan 10 08:23:00 1996
modified-by:
robot-id: dienstspider
robot-name: DienstSpider
robot-cover-url: http://sappho.csi.forth.gr:22000/
robot-details-url:
robot-owner-name: Antonis Sidiropoulos
robot-owner-url: http://www.csi.forth.gr/~asidirop
robot-owner-email: asidirop@csi.forth.gr
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host: sappho.csi.forth.gr
robot-from:
robot-useragent: dienstspider/1.0
robot-language: C
robot-description: Indexing and searching the NCSTRL(Networked Computer Science
Technical Report Library) and ERCIM Collection
robot-history: The version 1.0 was the developer's master thesis project
robot-environment: research
modified-date: Fri, 4 Dec 1998 0:0:0 GMT
modified-by: asidirop@csi.forth.gr
robot-id: digger
robot-name: Digger
robot-cover-url: http://www.diggit.com/
robot-details-url:
robot-owner-name: Benjamin Lipchak
robot-owner-url:
robot-owner-email: admin@diggit.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix, windows
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: digger
robot-noindex: yes
robot-host:
robot-from: yes
robot-useragent: Digger/1.0 JDK/1.3.0
robot-language: java
robot-description: indexing web sites for the Diggit! search engine
robot-history:
robot-environment: service
modified-date:
modified-by:
robot-id: diibot
robot-name: Digital Integrity Robot
robot-cover-url: http://www.digital-integrity.com/robotinfo.html
robot-details-url: http://www.digital-integrity.com/robotinfo.html
robot-owner-name: Digital Integrity, Inc.
robot-owner-url:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (17 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-email: robot@digital-integrity.com
robot-status: Production
robot-purpose: WWW Indexing
robot-type:
robot-platform: unix
robot-availability: none
robot-exclusion: Conforms to robots.txt convention
robot-exclusion-useragent: DIIbot
robot-noindex: Yes
robot-host: digital-integrity.com
robot-from:
robot-useragent: DIIbot
robot-language: Java/C
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: directhit
robot-name: Direct Hit Grabber
robot-cover-url: www.directhit.com
robot-details-url: http://www.directhit.com/about/company/spider.html
robot-status: active
robot-description: Direct Hit Grabber indexes documents and
collects Web statistics for the Direct Hit Search Engine (available at
www.directhit.com and our partners' sites)
robot-purpose: Indexing and statistics
robot-type: standalone
robot-platform: unix
robot-language: C++
robot-owner-name: Direct Hit Technologies, Inc.
robot-owner-url: www.directhit.com
robot-owner-email: DirectHitGrabber@directhit.com
robot-exclusion: yes
robot-exclusion-useragent: grabber
robot-noindex: yes
robot-host: *.directhit.com
robot-from: yes
robot-useragent: grabber
robot-environment: service
modified-by: grabber@directhit.com
robot-id: dnabot
robot-name: DNAbot
robot-cover-url: http://xx.dnainc.co.jp/dnabot/
robot-details-url: http://xx.dnainc.co.jp/dnabot/
robot-owner-name: Tom Tanaka
robot-owner-url: http://xx.dnainc.co.jp
robot-owner-email: tomatell@xx.dnainc.co.jp
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix, windows, windows95, windowsNT, mac
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent:
robot-noindex: no
robot-host: xx.dnainc.co.jp
robot-from: yes
robot-useragent: DNAbot/1.0
robot-language: java
robot-description: A search robot in 100 java, with its own built-in
database engine and web server . Currently in Japanese.
robot-history: Developed by DNA, Inc.(Niigata City, Japan) in 1998.
http://info.webcrawler.com/mak/projects/robots/active/all.txt (18 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-environment: commercial
modified-date: Mon, 4 Jan 1999 14:30:00 GMT
modified-by: Tom Tanaka
robot-id: download_express
robot-name: DownLoad Express
robot-cover-url: http://www.jacksonville.net/~dlxpress
robot-details-url: http://www.jacksonville.net/~dlxpress
robot-owner-name: DownLoad Express Inc
robot-owner-url: http://www.jacksonville.net/~dlxpress
robot-owner-email: dlxpress@mediaone.net
robot-status: active
robot-purpose: graphic download
robot-type: standalone
robot-platform: win95/98/NT
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: downloadexpress
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent:
robot-language: visual basic
robot-description: automatically downloads graphics from the web
robot-history:
robot-environment: commerical
modified-date: Wed, 05 May 1998
modified-by: DownLoad Express Inc
robot-id: dragonbot
robot-name: DragonBot
robot-cover-url: http://www.paczone.com/
robot-details-url:
robot-owner-name: Paul Law
robot-owner-url:
robot-owner-email: admin@paczone.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: DragonBot
robot-noindex: no
robot-host: *.paczone.com
robot-from: no
robot-useragent: DragonBot/1.0 libwww/5.0
robot-language: C++
robot-description: Collects web pages related to East Asia
robot-history:
robot-environment: service
modified-date: Mon, 11 Aug 1997 00:00:00 GMT
modified-by:
robot-id: dwcp
robot-name: DWCP (Dridus' Web Cataloging Project)
robot-cover-url: http://www.dridus.com/~rmm/dwcp.php3
robot-details-url: http://www.dridus.com/~rmm/dwcp.php3
robot-owner-name: Ross Mellgren (Dridus Norwind)
robot-owner-url: http://www.dridus.com/~rmm
robot-owner-email: rmm@dridus.com
robot-status: development
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: java
http://info.webcrawler.com/mak/projects/robots/active/all.txt (19 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-availability: source, binary, data
robot-exclusion: yes
robot-exclusion-useragent: dwcp
robot-noindex: no
robot-host: *.dridus.com
robot-from: dridus@dridus.com
robot-useragent: DWCP/2.0
robot-language: java
robot-description: The DWCP robot is used to gather information for
Dridus' Web Cataloging Project, which is intended to catalog domains and
urls (no content).
robot-history: Developed from scratch by Dridus Norwind.
robot-environment: hobby
modified-date: Sat, 10 Jul 1999 00:05:40 GMT
modified-by: Ross Mellgren
robot-id: e-collector
robot-name: e-collector
robot-cover-url: http://www.thatrobotsite.com/agents/ecollector.htm
robot-details-url: http://www.thatrobotsite.com/agents/ecollector.htm
robot-owner-name: Dean Smart
robot-owner-url: http://www.thatrobotsite.com
robot-owner-email: smarty@thatrobotsite.com
robot-status: Active
robot-purpose: email collector
robot-type: Collector of email addresses
robot-platform: Windows 9*/NT/2000
robot-availability: Binary
robot-exclusion: No
robot-exclusion-useragent: ecollector
robot-noindex: No
robot-host: *
robot-from: No
robot-useragent: LWP::
robot-language: Perl5
robot-description: e-collector in the simplist terms is a e-mail address
collector, thus the name e-collector.
So what?
Have you ever wanted to have the email addresses of as many companys that
sell or supply for example "dried fruit", i personnaly don't but this is
just an example.
Those of you who may use this type of robot will know exactly what you can
do with information, first don't spam with it, for those still not sure
what this type of robot will do for you then take this for example:
Your a international distributer of "dried fruit" and you boss has told you
if you rise sales by 10% then he will bye you a new car (Wish i had a boss
like that), well anyway there are thousands of shops distributers ect, that
you could be doing business with but you don't know who they are?, because
there in other countries or the nearest town but have never heard about them
before. Has the penny droped yet, no well now you have the opertunity to
find out who they are with an internet address and a person to contact in
that company just by downloading and running e-collector.
Plus it's free, you don't have to do any leg work just run the program and
sit back and watch your potential customers arriving.
robot-history: robot-environment: Service
modified-date: Weekly
modified-by: Dean Smart
robot-id:ebiness
robot-name:EbiNess
robot-cover-url:http://sourceforge.net/projects/ebiness
robot-details-url:http://ebiness.sourceforge.net/
robot-owner-name:Mike Davis
robot-owner-url:http://www.carisbrook.co.uk/mike
http://info.webcrawler.com/mak/projects/robots/active/all.txt (20 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-email:mdavis@kieser.net
robot-status:Pre-Alpha
robot-purpose:statistics
robot-type:standalone
robot-platform:unix(Linux)
robot-availability:Open Source
robot-exclusion:yes
robot-exclusion-useragent:ebiness
robot-noindex:no
robot-host:
robot-from:no
robot-useragent:EbiNess/0.01a
robot-language:c++
robot-description:Used to build a url relationship database, to be viewed in 3D
robot-history:Dreamed it up over some beers
robot-environment:hobby
modified-date:Mon, 27 Nov 2000 12:26:00 GMT
modified-by:Mike Davis
robot-id:
eit
robot-name:
EIT Link Verifier Robot
robot-cover-url:
http://wsk.eit.com/wsk/dist/doc/admin/webtest/verify_links.html
robot-details-url:
robot-owner-name:
Jim McGuire
robot-owner-url:
http://www.eit.com/people/mcguire.html
robot-owner-email: mcguire@eit.COM
robot-status:
robot-purpose:
maintenance
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
robot-useragent:
EIT-Link-Verifier-Robot/0.2
robot-language:
robot-description: Combination of an HTML form and a CGI script that verifies
links from a given starting point (with some controls to
prevent it going off-site or limitless)
robot-history:
Announced on 12 July 1994
robot-environment:
modified-date:
modified-by:
robot-id:
emacs
robot-name:
Emacs-w3 Search Engine
robot-cover-url:
http://www.cs.indiana.edu/elisp/w3/docs.html
robot-details-url:
robot-owner-name:
William M. Perry
robot-owner-url:
http://www.cs.indiana.edu/hyplan/wmperry.html
robot-owner-email: wmperry@spry.com
robot-status:
retired
robot-purpose:
indexing
robot-type:
browser
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
yes
robot-useragent:
Emacs-w3/v[0-9\.]+
robot-language:
lisp
http://info.webcrawler.com/mak/projects/robots/active/all.txt (21 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description: Its purpose is to generate a Resource Discovery database
This code has not been looked at in a while, but will be
spruced up for the Emacs-w3 2.2.0 release sometime this
month. It will honor the /robots.txt file at that
time.
robot-history:
robot-environment:
modified-date:
Fri May 5 16:09:18 1995
modified-by:
robot-id:
emcspider
robot-name:
ananzi
robot-cover-url:
http://www.empirical.com/
robot-details-url:
robot-owner-name:
Hunter Payne
robot-owner-url:
http://www.psc.edu/~hpayne/
robot-owner-email: hpayne@u-media.com
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
bilbo.internal.empirical.com
robot-from:
yes
robot-useragent:
EMC Spider
robot-language:
java This spider is still in the development stages but, it
will be hitting sites while I finish debugging it.
robot-description:
robot-history:
robot-environment:
modified-date:
Wed May 29 14:47:01 1996.
modified-by:
robot-id: esther
robot-name: Esther
robot-details-url: http://search.falconsoft.com/
robot-cover-url: http://search.falconsoft.com/
robot-owner-name: Tim Gustafson
robot-owner-url: http://www.falconsoft.com/
robot-owner-email:
tim@falconsoft.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix (FreeBSD 2.2.8)
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: esther
robot-noindex: no
robot-host: *.falconsoft.com
robot-from: yes
robot-useragent: esther
robot-language: perl5
robot-description: This crawler is used to build the search database at
http://search.falconsoft.com/
robot-history: Developed by FalconSoft.
robot-environment: service
modified-date: Tue, 22 Dec 1998 00:22:00 PST
robot-id: evliyacelebi
robot-name: Evliya Celebi
robot-cover-url: http://ilker.ulak.net.tr/EvliyaCelebi
robot-details-url: http://ilker.ulak.net.tr/EvliyaCelebi
http://info.webcrawler.com/mak/projects/robots/active/all.txt (22 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-name: Ilker TEMIR
robot-owner-url: http://ilker.ulak.net.tr
robot-owner-email: ilker@ulak.net.tr
robot-status: development
robot-purpose: indexing turkish content
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: N/A
robot-noindex: no
robot-nofollow: no
robot-host: 193.140.83.*
robot-from: ilker@ulak.net.tr
robot-useragent: Evliya Celebi v0.151 - http://ilker.ulak.net.tr
robot-language: perl5
robot-history:
robot-description: crawles pages under ".tr" domain or having turkish character
encoding (iso-8859-9 or windows-1254)
robot-environment: hobby
modified-date: Fri Mar 31 15:03:12 GMT 2000
robot-id:
nzexplorer
robot-name:
nzexplorer
robot-cover-url:
http://nzexplorer.co.nz/
robot-details-url:
robot-owner-name:
Paul Bourke
robot-owner-url:
http://bourke.gen.nz/paul.html
robot-owner-email: paul@bourke.gen.nz
robot-status:
active
robot-purpose:
indexing, statistics
robot-type:
standalone
robot-platform:
UNIX
robot-availability: source (commercial)
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
bitz.co.nz
robot-from:
no
robot-useragent:
explorersearch
robot-language:
c++
robot-history:
Started in 1995 to provide a comprehensive index
to WWW pages within New Zealand. Now also used in
Malaysia and other countries.
robot-environment: service
modified-date:
Tues, 25 Jun 1996
modified-by:
Paul Bourke
robot-id:fdse
robot-name:Fluid Dynamics Search Engine robot
robot-cover-url:http://www.xav.com/scripts/search/
robot-details-url:http://www.xav.com/scripts/search/
robot-owner-name:Zoltan Milosevic
robot-owner-url:http://www.xav.com/
robot-owner-email:zoltanm@nickname.net
robot-status:active
robot-purpose:indexing
robot-type:standalone
robot-platform:unix;windows
robot-availability:source;data
robot-exclusion:yes
robot-exclusion-useragent:FDSE
robot-noindex:yes
robot-host:yes
robot-from:*
http://info.webcrawler.com/mak/projects/robots/active/all.txt (23 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent:Mozilla/4.0 (compatible: FDSE robot)
robot-language:perl5
robot-description:Crawls remote sites as part of a shareware search engine
program
robot-history:Developed in late 1998 over three pots of coffee
robot-environment:commercial
modified-date:Fri, 21 Jan 2000 10:15:49 GMT
modified-by:Zoltan Milosevic
robot-id:
felix
robot-name:
Felix IDE
robot-cover-url:
http://www.pentone.com
robot-details-url:
http://www.pentone.com
robot-owner-name:
The Pentone Group, Inc.
robot-owner-url:
http://www.pentone.com
robot-owner-email:
felix@pentone.com
robot-status:
active
robot-purpose: indexing, statistics
robot-type:
standalone
robot-platform: windows95, windowsNT
robot-availability:
binary
robot-exclusion:
yes
robot-exclusion-useragent:
FELIX IDE
robot-noindex: yes
robot-host:
*
robot-from:
yes
robot-useragent:
FelixIDE/1.0
robot-language: visual basic
robot-description:
Felix IDE is a retail personal search spider sold by
The Pentone Group, Inc.
It supports the proprietary exclusion "Frequency: ??????????" in the
robots.txt file. Question marks represent an integer
indicating number of milliseconds to delay between document requests. This
is called VDRF(tm) or Variable Document Retrieval Frequency. Note that
users can re-define the useragent name.
robot-history: This robot began as an in-house tool for the lucrative Felix
IDS (Information Discovery Service) and has gone retail.
robot-environment:
service, commercial, research
modified-date: Fri, 11 Apr 1997 19:08:02 GMT
modified-by:
Kerry B. Rogers
robot-id:
ferret
robot-name:
Wild Ferret Web Hopper #1, #2, #3
robot-cover-url:
http://www.greenearth.com/
robot-details-url:
robot-owner-name:
Greg Boswell
robot-owner-url:
http://www.greenearth.com/
robot-owner-email: ghbos@postoffice.worldnet.att.net
robot-status:
robot-purpose:
indexing maintenance statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
Hazel's Ferret Web hopper,
robot-language:
C++, Visual Basic, Java
robot-description: The wild ferret web hopper's are designed as specific agents
to retrieve data from all available sources on the internet.
They work in an onion format hopping from spot to spot one
level at a time over the internet. The information is
gathered into different relational databases, known as
http://info.webcrawler.com/mak/projects/robots/active/all.txt (24 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
"Hazel's Horde". The information is publicly available and
will be free for the browsing at www.greenearth.com.
Effective date of the data posting is to be
announced.
robot-history:
robot-environment:
modified-date:
Mon Feb 19 00:28:37 1996.
modified-by:
robot-id: fetchrover
robot-name: FetchRover
robot-cover-url: http://www.engsoftware.com/fetch.htm
robot-details-url: http://www.engsoftware.com/spiders/
robot-owner-name: Dr. Kenneth R. Wadland
robot-owner-url: http://www.engsoftware.com/
robot-owner-email: ken@engsoftware.com
robot-status: active
robot-purpose: maintenance, statistics
robot-type: standalone
robot-platform: Windows/NT, Windows/95, Solaris SPARC
robot-availability: binary, source
robot-exclusion: yes
robot-exclusion-useragent: ESI
robot-noindex: N/A
robot-host: *
robot-from: yes
robot-useragent: ESIRover v1.0
robot-language: C++
robot-description: FetchRover fetches Web Pages.
It is an automated page-fetching engine. FetchRover can be
used stand-alone or as the front-end to a full-featured Spider.
Its database can use any ODBC compliant database server, including
Microsoft Access, Oracle, Sybase SQL Server, FoxPro, etc.
robot-history: Used as the front-end to SmartSpider (another Spider
product sold by Engineeering Software, Inc.)
robot-environment: commercial, service
modified-date: Thu, 03 Apr 1997 21:49:50 EST
modified-by: Ken Wadland
robot-id: fido
robot-name: fido
robot-cover-url: http://www.planetsearch.com/
robot-details-url: http://www.planetsearch.com/info/fido.html
robot-owner-name: Steve DeJarnett
robot-owner-url: http://www.planetsearch.com/staff/steved.html
robot-owner-email: fido@planetsearch.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: Unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: fido
robot-noindex: no
robot-host: fido.planetsearch.com, *.planetsearch.com, 206.64.113.*
robot-from: yes
robot-useragent: fido/0.9 Harvest/1.4.pl2
robot-language: c, perl5
robot-description: fido is used to gather documents for the search engine
provided in the PlanetSearch service, which is operated by
the Philips Multimedia Center. The robots runs on an
ongoing basis.
robot-history: fido was originally based on the Harvest Gatherer, but has since
evolved into a new creature. It still uses some support code
from Harvest.
http://info.webcrawler.com/mak/projects/robots/active/all.txt (25 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-environment: service
modified-date: Sat, 2 Nov 1996 00:08:18 GMT
modified-by: Steve DeJarnett
robot-id:
finnish
robot-name:
Hämähäkki
robot-cover-url:
http://www.fi/search.html
robot-details-url: http://www.fi/www/spider.html
robot-owner-name:
Timo Metsälä
robot-owner-url:
http://www.fi/~timo/
robot-owner-email: Timo.Metsala@www.fi
robot-status:
active
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
UNIX
robot-availability: no
robot-exclusion:
yes
robot-exclusion-useragent: Hämähäkki
robot-noindex:
no
robot-host:
*.www.fi
robot-from:
yes
robot-useragent:
Hämähäkki/0.2
robot-language:
C
robot-description: Its purpose is to generate a Resource Discovery
database from the Finnish (top-level domain .fi) www servers.
The resulting database is used by the search engine
at http://www.fi/search.html.
robot-history:
(The name Hämähäkki is just Finnish for spider.)
robot-environment:
modified-date:
1996-06-25
modified-by:
Jaakko.Hyvatti@www.fi
robot-id: fireball
robot-name: KIT-Fireball
robot-cover-url: http://www.fireball.de
robot-details-url: http://www.fireball.de/technik.html (in German)
robot-owner-name: Gruner + Jahr Electronic Media Service GmbH
robot-owner-url: http://www.ems.guj.de
robot-owner-email:info@fireball.de
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: KIT-Fireball
robot-noindex: yes
robot-host: *.fireball.de
robot-from: yes
robot-useragent: KIT-Fireball/2.0 libwww/5.0a
robot-language: c
robot-description: The Fireball robots gather web documents in German
language for the database of the Fireball search service.
robot-history: The robot was developed by Benhui Chen in a research
project at the Technical University of Berlin in 1996 and was
re-implemented by its developer in 1997 for the present owner.
robot-environment: service
modified-date: Mon Feb 23 11:26:08 1998
modified-by: Detlev Kalb
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
fish
Fish search
http://www.win.tue.nl/bin/fish-search
Paul De Bra
http://info.webcrawler.com/mak/projects/robots/active/all.txt (26 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-url:
http://www.win.tue.nl/win/cs/is/debra/
robot-owner-email: debra@win.tue.nl
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability: binary
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
www.win.tue.nl
robot-from:
no
robot-useragent:
Fish-Search-Robot
robot-language:
c
robot-description: Its purpose is to discover resources on the fly a version
exists that is integrated into the T&uuml;bingen Mosaic
2.4.2 browser (also written in C)
robot-history:
Originated as an addition to Mosaic for X
robot-environment:
modified-date:
Mon May 8 09:31:19 1995
modified-by:
robot-id: fouineur
robot-name: Fouineur
robot-cover-url: http://fouineur.9bit.qc.ca/
robot-details-url: http://fouineur.9bit.qc.ca/informations.html
robot-owner-name: Joel Vandal
robot-owner-url: http://www.9bit.qc.ca/~jvandal/
robot-owner-email: jvandal@9bit.qc.ca
robot-status: development
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: unix, windows
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: fouineur
robot-noindex: no
robot-host: *
robot-from: yes
robot-useragent: Mozilla/2.0 (compatible fouineur v2.0; fouineur.9bit.qc.ca)
robot-language: perl5
robot-description: This robot build automaticaly a database that is used
by our own search engine. This robot auto-detect the
language (french, english & spanish) used in the HTML
page. Each database record generated by this robot
include: date, url, title, total words, title, size
and de-htmlized text. Also support server-side and
client-side IMAGEMAP.
robot-history: No robots does all thing that we need for our usage.
robot-environment: service
modified-date: Thu, 9 Jan 1997 22:57:28 EST
modified-by: jvandal@9bit.qc.ca
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
francoroute
Robot Francoroute
Marc-Antoine Parent
http://www.crim.ca/~maparent
maparent@crim.ca
indexing, mirroring, statistics
browser
http://info.webcrawler.com/mak/projects/robots/active/all.txt (27 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
zorro.crim.ca
robot-from:
yes
robot-useragent:
Robot du CRIM 1.0a
robot-language:
perl5, sqlplus
robot-description: Part of the RISQ's Francoroute project for researching
francophone. Uses the Accept-Language tag and reduces demand
accordingly
robot-history:
robot-environment:
modified-date:
Wed Jan 10 23:56:22 1996.
modified-by:
robot-id: freecrawl
robot-name: Freecrawl
robot-cover-url: http://euroseek.net/
robot-owner-name: Jesper Ekhall
robot-owner-email: ekhall@freeside.net
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Freecrawl
robot-noindex: no
robot-host: *.freeside.net
robot-from: yes
robot-useragent: Freecrawl
robot-language: c
robot-description: The Freecrawl robot is used to build a database for the
EuroSeek service.
robot-environment: service
robot-id:
funnelweb
robot-name:
FunnelWeb
robot-cover-url:
http://funnelweb.net.au
robot-details-url:
robot-owner-name:
David Eagles
robot-owner-url:
http://www.pc.com.au
robot-owner-email: eaglesd@pc.com.au
robot-status:
robot-purpose:
indexing, statisitics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
earth.planets.com.au
robot-from:
yes
robot-useragent:
FunnelWeb-1.0
robot-language:
c and c++
robot-description: Its purpose is to generate a Resource Discovery database,
and generate statistics. Localised South Pacific Discovery
and Search Engine, plus distributed operation under
development.
robot-history:
robot-environment:
modified-date:
Mon Nov 27 21:30:11 1995
modified-by:
robot-id: gazz
http://info.webcrawler.com/mak/projects/robots/active/all.txt (28 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-name: gazz
robot-cover-url: http://gazz.nttrd.com/
robot-details-url: http://gazz.nttrd.com/
robot-owner-name: NTT Cyberspace Laboratories
robot-owner-url: http://gazz.nttrd.com/
robot-owner-email: gazz@nttrd.com
robot-status: development
robot-purpose: statistics
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: gazz
robot-noindex: yes
robot-host: *.nttrd.com, *.infobee.ne.jp
robot-from: yes
robot-useragent: gazz/1.0
robot-language: c
robot-description: This robot is used for research purposes.
robot-history: Its root is TITAN project in NTT.
robot-environment: research
modified-date: Wed, 09 Jun 1999 10:43:18 GMT
modified-by: noto@isl.ntt.co.jp
robot-id: gcreep
robot-name: GCreep
robot-cover-url: http://www.instrumentpolen.se/gcreep/index.html
robot-details-url: http://www.instrumentpolen.se/gcreep/index.html
robot-owner-name: Instrumentpolen AB
robot-owner-url: http://www.instrumentpolen.se/ip-kontor/eng/index.html
robot-owner-email: anders@instrumentpolen.se
robot-status: development
robot-purpose: indexing
robot-type: browser+standalone
robot-platform: linux+mysql
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: gcreep
robot-noindex: yes
robot-host: mbx.instrumentpolen.se
robot-from: yes
robot-useragent: gcreep/1.0
robot-language: c
robot-description: Indexing robot to learn SQL
robot-history: Spare time project begun late '96, maybe early '97
robot-environment: hobby
modified-date: Fri, 23 Jan 1998 16:09:00 MET
modified-by: Anders Hedstrom
robot-id:
getbot
robot-name:
GetBot
robot-cover-url:
http://www.blacktop.com.zav/bots
robot-details-url:
robot-owner-name:
Alex Zavatone
robot-owner-url:
http://www.blacktop.com/zav
robot-owner-email: zav@macromedia.com
robot-status:
robot-purpose:
maintenance
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no.
robot-exclusion-useragent:
robot-noindex:
robot-host:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (29 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-from:
no
robot-useragent:
???
robot-language:
Shockwave/Director.
robot-description: GetBot's purpose is to index all the sites it can find that
contain Shockwave movies. It is the first bot or spider
written in Shockwave. The bot was originally written at
Macromedia on a hungover Sunday as a proof of concept. Alex Zavatone 3/29/96
robot-history:
robot-environment:
modified-date:
Fri Mar 29 20:06:12 1996.
modified-by:
robot-id:
geturl
robot-name:
GetURL
robot-cover-url:
http://Snark.apana.org.au/James/GetURL/
robot-details-url:
robot-owner-name:
James Burton
robot-owner-url:
http://Snark.apana.org.au/James/
robot-owner-email: James@Snark.apana.org.au
robot-status:
robot-purpose:
maintenance, mirroring
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
no
robot-useragent:
GetURL.rexx v1.05
robot-language:
ARexx (Amiga REXX)
robot-description: Its purpose is to validate links, perform mirroring, and
copy document trees. Designed as a tool for retrieving web
pages in batch mode without the encumbrance of a browser.
Can be used to describe a set of pages to fetch, and to
maintain an archive or mirror. Is not run by a central site
and accessed by clients - is run by the end user or archive
maintainer
robot-history:
robot-environment:
modified-date:
Tue May 9 15:13:12 1995
modified-by:
robot-id: golem
robot-name: Golem
robot-cover-url: http://www.quibble.com/golem/
robot-details-url: http://www.quibble.com/golem/
robot-owner-name: Geoff Duncan
robot-owner-url: http://www.quibble.com/geoff/
robot-owner-email: geoff@quibble.com
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: mac
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: golem
robot-noindex: no
robot-host: *.quibble.com
robot-from: yes
robot-useragent: Golem/1.1
robot-language: HyperTalk/AppleScript/C++
robot-description: Golem generates status reports on collections of URLs
supplied by clients. Designed to assist with editorial updates of
http://info.webcrawler.com/mak/projects/robots/active/all.txt (30 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
Web-related sites or products.
robot-history: Personal project turned into a contract service for private
clients.
robot-environment: service,research
modified-date: Wed, 16 Apr 1997 20:50:00 GMT
modified-by: Geoff Duncan
robot-id: googlebot
robot-name: Googlebot
robot-cover-url: http://googlebot.com/
robot-details-url: http://googlebot.com/
robot-owner-name: Google Inc.
robot-owner-url: http://google.com/
robot-owner-email: googlebot@googlebot.com
robot-status: active
robot-purpose: indexing statistics
robot-type: standalone
robot-platform:
robot-availability:
robot-exclusion: yes
robot-exclusion-useragent: Googlebot
robot-noindex: yes
robot-host: *.googlebot.com
robot-from: yes
robot-useragent: Googlebot/2.0 beta (googlebot(at)googlebot.com)
robot-language: Python
robot-description:
robot-history: Used to be called backrub and run from stanford.edu
robot-environment: service
modified-date: Wed, 29 Sep 1999 18:36:25 -0700
modified-by: Amit Patel <amitp@google.com>
robot-id: grapnel
robot-name: Grapnel/0.01 Experiment
robot-cover-url: varies
robot-details-url: mailto:v93_kat@ce.kth.se
robot-owner-name: Philip Kallerman
robot-owner-url: v93_kat@ce.kth.se
robot-owner-email: v93_kat@ce.kth.se
robot-status: Experimental
robot-purpose: Indexing
robot-type:
robot-platform: WinNT
robot-availability: None, yet
robot-exclusion: Yes
robot-exclusion-useragent: No
robot-noindex: No
robot-host: varies
robot-from: Varies
robot-useragent:
robot-language: Perl
robot-description: Resource Discovery Experimentation
robot-history: None, hoping to make some
robot-environment:
modified-date:
modified-by: 7 Feb 1997
robot-id:griffon
robot-name:Griffon
robot-cover-url:http://navi.ocn.ne.jp/
robot-details-url:http://navi.ocn.ne.jp/griffon/
robot-owner-name:NTT Communications Corporate Users Business Division
robot-owner-url:http://navi.ocn.ne.jp/
robot-owner-email:griffon@super.navi.ocn.ne.jp
robot-status:active
http://info.webcrawler.com/mak/projects/robots/active/all.txt (31 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-purpose:indexing
robot-type:standalone
robot-platform:unix
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:griffon
robot-noindex:yes
robot-nofollow:yes
robot-host:*.navi.ocn.ne.jp
robot-from:yes
robot-useragent:griffon/1.0
robot-language:c
robot-description:The Griffon robot is used to build database for the OCN navi
search service operated by NTT Communications Corporation.
It mainly gathers pages written in Japanese.
robot-history:Its root is TITAN project in NTT.
robot-environment:service
modified-date:Mon,25 Jan 2000 15:25:30 GMT
modified-by:toka@navi.ocn.ne.jp
robot-id: gromit
robot-name: Gromit
robot-cover-url: http://www.austlii.edu.au/
robot-details-url: http://www2.austlii.edu.au/~dan/gromit/
robot-owner-name: Daniel Austin
robot-owner-url: http://www2.austlii.edu.au/~dan/
robot-owner-email: dan@austlii.edu.au
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Gromit
robot-noindex: no
robot-host: *.austlii.edu.au
robot-from: yes
robot-useragent: Gromit/1.0
robot-language: perl5
robot-description: Gromit is a Targetted Web Spider that indexes legal
sites contained in the AustLII legal links database.
robot-history: This robot is based on the Perl5 LWP::RobotUA module.
robot-environment: research
modified-date: Wed, 11 Jun 1997 03:58:40 GMT
modified-by: Daniel Austin
robot-id: gulliver
robot-name: Northern Light Gulliver
robot-cover-url:
robot-details-url:
robot-owner-name: Mike Mulligan
robot-owner-url:
robot-owner-email: crawler@northernlight.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: gulliver
robot-noindex: yes
robot-host: scooby.northernlight.com, taz.northernlight.com,
gulliver.northernlight.com
robot-from: yes
robot-useragent: Gulliver/1.1
http://info.webcrawler.com/mak/projects/robots/active/all.txt (32 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-language: c
robot-description: Gulliver is a robot to be used to collect
web pages for indexing and subsequent searching of the index.
robot-history: Oct 1996: development; Dec 1996-Jan 1997: crawl & debug;
Mar 1997: crawl again;
robot-environment: service
modified-date: Wed, 21 Apr 1999 16:00:00 GMT
modified-by: Mike Mulligan
robot-id: hambot
robot-name: HamBot
robot-cover-url: http://www.hamrad.com/search.html
robot-details-url: http://www.hamrad.com/
robot-owner-name: John Dykstra
robot-owner-url:
robot-owner-email: john@futureone.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix, Windows95
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: hambot
robot-noindex: yes
robot-host: *.hamrad.com
robot-from:
robot-useragent:
robot-language: perl5, C++
robot-description: Two HamBot robots are used (stand alone & browser based)
to aid in building the database for HamRad Search - The Search Engine for
Search Engines. The robota are run intermittently and perform nearly
identical functions.
robot-history: A non commercial (hobby?) project to aid in building and
maintaining the database for the the HamRad search engine.
robot-environment: service
modified-date: Fri, 17 Apr 1998 21:44:00 GMT
modified-by: JD
robot-id:
harvest
robot-name:
Harvest
robot-cover-url:
http://harvest.cs.colorado.edu
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
bruno.cs.colorado.edu
robot-from:
yes
robot-useragent:
yes
robot-language:
robot-description: Harvest's motivation is to index community- or topicspecific collections, rather than to locate and index all
HTML objects that can be found. Also, Harvest allows users
to control the enumeration several ways, including stop
lists and depth and count limits. Therefore, Harvest
provides a much more controlled way of indexing the Web than
is typical of robots. Pauses 1 second between requests (by
default).
http://info.webcrawler.com/mak/projects/robots/active/all.txt (33 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: havindex
robot-name: havIndex
robot-cover-url: http://www.hav.com/
robot-details-url: http://www.hav.com/
robot-owner-name: hav.Software and Horace A. (Kicker) Vallas
robot-owner-url: http://www.hav.com/
robot-owner-email: havIndex@hav.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: Java VM 1.1
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: havIndex
robot-noindex: yes
robot-host: *
robot-from: no
robot-useragent: havIndex/X.xx[bxx]
robot-language: Java
robot-description: havIndex allows individuals to build searchable word
index of (user specified) lists of URLs. havIndex does not crawl rather it requires one or more user supplied lists of URLs to be
indexed. havIndex does (optionally) save urls parsed from indexed
pages.
robot-history: Developed to answer client requests for URL specific
index capabilities.
robot-environment: commercial, service
modified-date: 6-27-98
modified-by: Horace A. (Kicker) Vallas
robot-id:
hi
robot-name:
HI (HTML Index) Search
robot-cover-url:
http://cs6.cs.ait.ac.th:21870/pa.html
robot-details-url:
robot-owner-name:
Razzakul Haider Chowdhury
robot-owner-url:
http://cs6.cs.ait.ac.th:21870/index.html
robot-owner-email: a94385@cs.ait.ac.th
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
yes
robot-useragent:
AITCSRobot/1.1
robot-language:
perl 5
robot-description: Its purpose is to generate a Resource Discovery database.
This Robot traverses the net and creates a searchable
database of Web pages. It stores the title string of the
HTML document and the absolute url. A search engine provides
the boolean AND & OR query models with or without filtering
the stop list of words. Feature is kept for the Web page
owners to add the url to the searchable database.
robot-history:
robot-environment:
modified-date:
Wed Oct 4 06:54:31 1995
modified-by:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (34 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id: hometown
robot-name: Hometown Spider Pro
robot-cover-url: http://www.hometownsingles.com
robot-details-url: http://www.hometownsingles.com
robot-owner-name: Bob Brown
robot-owner-url: http://www.hometownsingles.com
robot-owner-email: admin@hometownsingles.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: *
robot-noindex: yes
robot-host: 63.195.193.17
robot-from: no
robot-useragent: Hometown Spider Pro
robot-language: delphi
robot-description: The Hometown Spider Pro is used to maintain the indexes
for Hometown Singles.
robot-history: Innerprise URL Spider Pro
robot-environment: commercial
modified-date: Tue, 28 Mar 2000 16:00:00 GMT
modified-by: Hometown Singles
robot-id: wired-digital
robot-name: Wired Digital
robot-cover-url:
robot-details-url:
robot-owner-name: Bowen Dwelle
robot-owner-url:
robot-owner-email: bowen@hotwired.com
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: hotwired
robot-noindex: no
robot-host: gossip.hotwired.com
robot-from: yes
robot-useragent: wired-digital-newsbot/1.5
robot-language: perl-5.004
robot-description: this is a test
robot-history:
robot-environment: research
modified-date: Thu, 30 Oct 1997
modified-by: bowen@hotwired.com
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-owner-name2:
robot-owner-url2:
robot-owner-email2:
robot-status:
robot-purpose:
robot-type:
htdig
ht://Dig
http://www.htdig.org/
http://www.htdig.org/howitworks.html
Andrew Scherpbier
http://www.htdig.org/author.html
andrew@contigo.com
Geoff Hutchison
http://wso.williams.edu/~ghutchis/
ghutchis@wso.williams.edu
indexing
standalone
http://info.webcrawler.com/mak/projects/robots/active/all.txt (35 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-platform:
unix
robot-availability: source
robot-exclusion:
yes
robot-exclusion-useragent: htdig
robot-noindex:
yes
robot-host:
*
robot-from:
no
robot-useragent:
htdig/3.1.0b2
robot-language:
C,C++.
robot-history:This robot was originally developed for use at San Diego
State University.
robot-environment:
modified-date:Tue, 3 Nov 1998 10:09:02 EST
modified-by: Geoff Hutchison <Geoffrey.R.Hutchison@williams.edu>
robot-id:
htmlgobble
robot-name:
HTMLgobble
robot-cover-url:
robot-details-url:
robot-owner-name:
Andreas Ley
robot-owner-url:
robot-owner-email: ley@rz.uni-karlsruhe.de
robot-status:
robot-purpose:
mirror
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
tp70.rz.uni-karlsruhe.de
robot-from:
yes
robot-useragent:
HTMLgobble v2.2
robot-language:
robot-description: A mirroring robot. Configured to stay within a directory,
sleeps between requests, and the next version will use HEAD
to check if the entire document needs to be
retrieved
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id:
hyperdecontextualizer
robot-name:
Hyper-Decontextualizer
robot-cover-url:
http://www.tricon.net/Comm/synapse/spider/
robot-details-url:
robot-owner-name:
Cliff Hall
robot-owner-url:
http://kpt1.tricon.net/cgi-bin/cliff.cgi
robot-owner-email: cliff@tricon.net
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
no
robot-useragent:
no
robot-language:
Perl 5 Takes an input sentence and marks up each word with
an appropriate hyper-text link.
robot-description:
robot-history:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (36 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-environment:
modified-date:
modified-by:
Mon May
6 17:41:29 1996.
robot-id:
ibm
robot-name:
IBM_Planetwide
robot-cover-url:
http://www.ibm.com/%7ewebmaster/
robot-details-url:
robot-owner-name:
Ed Costello
robot-owner-url:
http://www.ibm.com/%7ewebmaster/
robot-owner-email: epc@www.ibm.com"
robot-status:
robot-purpose:
indexing, maintenance, mirroring
robot-type:
standalone and
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
www.ibm.com www2.ibm.com
robot-from:
yes
robot-useragent:
IBM_Planetwide,
robot-language:
Perl5
robot-description: Restricted to IBM owned or related domains.
robot-history:
robot-environment:
modified-date:
Mon Jan 22 22:09:19 1996.
modified-by:
robot-id: iconoclast
robot-name: Popular Iconoclast
robot-cover-url: http://gestalt.sewanee.edu/ic/
robot-details-url: http://gestalt.sewanee.edu/ic/info.html
robot-owner-name: Chris Cappuccio
robot-owner-url: http://sefl.satelnet.org/~ccappuc/
robot-owner-email: chris@gestalt.sewanee.edu
robot-status: development
robot-purpose: statistics
robot-type: standalone
robot-platform: unix (OpenBSD)
robot-availability: source
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: gestalt.sewanee.edu
robot-from: yes
robot-useragent: gestaltIconoclast/1.0 libwww-FM/2.17
robot-language: c,perl5
robot-description: This guy likes statistics
robot-history: This robot has a history in mathematics and english
robot-environment: research
modified-date: Wed, 5 Mar 1997 17:35:16 CST
modified-by: chris@gestalt.sewanee.edu
robot-id: Ilse
robot-name: Ingrid
robot-cover-url:
robot-details-url:
robot-owner-name: Ilse c.v.
robot-owner-url: http://www.ilse.nl/
robot-owner-email: ilse@ilse.nl
robot-status: Running
robot-purpose: Indexing
robot-type: Web Indexer
robot-platform: UNIX
http://info.webcrawler.com/mak/projects/robots/active/all.txt (37 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-availability: Commercial as part of search engine package
robot-exclusion: Yes
robot-exclusion-useragent: INGRID/0.1
robot-noindex: Yes
robot-host: bart.ilse.nl
robot-from: Yes
robot-useragent: INGRID/0.1
robot-language: C
robot-description:
robot-history:
robot-environment:
modified-date: 06/13/1997
modified-by: Ilse
robot-id: imagelock
robot-name: Imagelock
robot-cover-url:
robot-details-url:
robot-owner-name: Ken Belanger
robot-owner-url:
robot-owner-email: belanger@imagelock.com
robot-status: development
robot-purpose: maintenance
robot-type:
robot-platform: windows95
robot-availability: none
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: 209.111.133.*
robot-from: no
robot-useragent: Mozilla 3.01 PBWF (Win95)
robot-language:
robot-description: searches for image links
robot-history:
robot-environment: service
modified-date: Tue, 11 Aug 1998 17:28:52 GMT
modified-by: brian@smithrenaud.com
robot-id:
incywincy
robot-name:
IncyWincy
robot-cover-url:
http://osiris.sunderland.ac.uk/sst-scripts/simon.html
robot-details-url:
robot-owner-name:
Simon Stobart
robot-owner-url:
http://osiris.sunderland.ac.uk/sst-scripts/simon.html
robot-owner-email: simon.stobart@sunderland.ac.uk
robot-status:
robot-purpose:
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
osiris.sunderland.ac.uk
robot-from:
yes
robot-useragent:
IncyWincy/1.0b1
robot-language:
C++
robot-description: Various Research projects at the University of
Sunderland
robot-history:
robot-environment:
modified-date:
Fri Jan 19 21:50:32 1996.
modified-by:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (38 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id: informant
robot-name: Informant
robot-cover-url: http://informant.dartmouth.edu/
robot-details-url: http://informant.dartmouth.edu/about.html
robot-owner-name: Bob Gray
robot-owner-name2: Aditya Bhasin
robot-owner-name3: Katsuhiro Moizumi
robot-owner-name4: Dr. George V. Cybenko
robot-owner-url: http://informant.dartmouth.edu/
robot-owner-email: info_adm@cosmo.dartmouth.edu
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: no
robot-exclusion-useragent: Informant
robot-noindex: no
robot-host: informant.dartmouth.edu
robot-from: yes
robot-useragent: Informant
robot-language: c, c++
robot-description: The Informant robot continually checks the Web pages
that are relevant to user queries. Users are notified of any new or
updated pages. The robot runs daily, but the number of hits per site
per day should be quite small, and these hits should be randomly
distributed over several hours. Since the robot does not actually
follow links (aside from those returned from the major search engines
such as Lycos), it does not fall victim to the common looping problems.
The robot will support the Robot Exclusion Standard by early December, 1996.
robot-history: The robot is part of a research project at Dartmouth College.
The robot may become part of a commercial service (at which time it may be
subsumed by some other, existing robot).
robot-environment: research, service
modified-date: Sun, 3 Nov 1996 11:55:00 GMT
modified-by: Bob Gray
robot-id:
infoseek
robot-name:
InfoSeek Robot 1.0
robot-cover-url:
http://www.infoseek.com
robot-details-url:
robot-owner-name:
Steve Kirsch
robot-owner-url:
http://www.infoseek.com
robot-owner-email: stk@infoseek.com
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
corp-gw.infoseek.com
robot-from:
yes
robot-useragent:
InfoSeek Robot 1.0
robot-language:
python
robot-description: Its purpose is to generate a Resource Discovery database.
Collects WWW pages for both InfoSeek's free WWW search and
commercial search. Uses a unique proprietary algorithm to
identify the most popular and interesting WWW pages. Very
fast, but never has more than one request per site
outstanding at any given time. Has been refined for more
than a year.
robot-history:
robot-environment:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (39 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-date:
modified-by:
Sun May 28 01:35:48 1995
robot-id:
infoseeksidewinder
robot-name:
Infoseek Sidewinder
robot-cover-url:
http://www.infoseek.com/
robot-details-url:
robot-owner-name:
Mike Agostino
robot-owner-url:
http://www.infoseek.com/
robot-owner-email: mna@infoseek.com
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
Infoseek Sidewinder
robot-language:
C Collects WWW pages for both InfoSeek's free WWW search
services. Uses a unique, incremental, very fast proprietary
algorithm to find WWW pages.
robot-description:
robot-history:
robot-environment:
modified-date:
Sat Apr 27 01:20:15 1996.
modified-by:
robot-id: infospider
robot-name: InfoSpiders
robot-cover-url: http://www-cse.ucsd.edu/users/fil/agents/agents.html
robot-owner-name: Filippo Menczer
robot-owner-url: http://www-cse.ucsd.edu/users/fil/
robot-owner-email: fil@cs.ucsd.edu
robot-status: development
robot-purpose: search
robot-type: standalone
robot-platform: unix, mac
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: InfoSpiders
robot-noindex: no
robot-host: *.ucsd.edu
robot-from: yes
robot-useragent: InfoSpiders/0.1
robot-language: c, perl5
robot-description: application of artificial life algorithm to adaptive
distributed information retrieval
robot-history: UC San Diego, Computer Science Dept. PhD research project
(1995-97) under supervision of Prof. Rik Belew
robot-environment: research
modified-date: Mon, 16 Sep 1996 14:08:00 PDT
robot-id: inspectorwww
robot-name: Inspector Web
robot-cover-url: http://www.greenpac.com/inspector/
robot-details-url: http://www.greenpac.com/inspector/ourrobot.html
robot-owner-name: Doug Green
robot-owner-url: http://www.greenpac.com
robot-owner-email: doug@greenpac.com
robot-status: active: robot significantly developed, but still undergoing fixes
robot-purpose: maintentance: link validation, html validation, image size
validation, etc
http://info.webcrawler.com/mak/projects/robots/active/all.txt (40 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-type: standalone
robot-platform: unix
robot-availability: free service and more extensive commercial service
robot-exclusion: yes
robot-exclusion-useragent: inspectorwww
robot-noindex: no
robot-host: www.corpsite.com, www.greenpac.com, 38.234.171.*
robot-from: yes
robot-useragent: inspectorwww/1.0 http://www.greenpac.com/inspectorwww.html
robot-language: c
robot-description: Provide inspection reports which give advise to WWW
site owners on missing links, images resize problems, syntax errors, etc.
robot-history: development started in Mar 1997
robot-environment: commercial
modified-date: Tue Jun 17 09:24:58 EST 1997
modified-by: Doug Green
robot-id:
intelliagent
robot-name:
IntelliAgent
robot-cover-url:
http://www.geocities.com/SiliconValley/3086/iagent.html
robot-details-url:
robot-owner-name:
David Reilly
robot-owner-url:
http://www.geocities.com/SiliconValley/3086/index.html
robot-owner-email: s1523@sand.it.bond.edu.au
robot-status:
development
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
sand.it.bond.edu.au
robot-from:
no
robot-useragent:
'IAGENT/1.0'
robot-language:
C
robot-description: IntelliAgent is still in development. Indeed, it is very far
from completion. I'm planning to limit the depth at which it
will probe, so hopefully IAgent won't cause anyone much of a
problem. At the end of its completion, I hope to publish
both the raw data and original source code.
robot-history:
robot-environment:
modified-date:
Fri May 31 02:10:39 1996.
modified-by:
robot-id: irobot
robot-name: I, Robot
robot-cover-url: http://irobot.mame.dk/
robot-details-url: http://irobot.mame.dk/about.phtml
robot-owner-name: [mame.dk]
robot-owner-url: http://www.mame.dk/
robot-owner-email: irobot@chaos.dk
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: irobot
robot-noindex: yes
robot-host: *.mame.dk, 206.161.121.*
robot-from: no
robot-useragent: I Robot 0.4 (irobot@chaos.dk)
robot-language: c
http://info.webcrawler.com/mak/projects/robots/active/all.txt (41 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description: I Robot is used to build a fresh database for the
emulation community. Primary focus is information on emulation and
especially old arcade machines. Primarily english sites will be indexed and
only if they have their own domain. Sites are added manually on based on
submitions after they has been evaluated.
robot-history: The robot was started in june 2000
robot-environment1: service
robot-environment2: hobby
modified-date: Fri, 27 Oct 2000 09:08:06 GMT
modified-by: BombJack mameadm@chaos.dk
robot-id:iron33
robot-name:Iron33
robot-cover-url:http://verno.ueda.info.waseda.ac.jp/iron33/
robot-details-url:http://verno.ueda.info.waseda.ac.jp/iron33/history.html
robot-owner-name:Takashi Watanabe
robot-owner-url:http://www.ueda.info.waseda.ac.jp/~watanabe/
robot-owner-email:watanabe@ueda.info.waseda.ac.jp
robot-status:active
robot-purpose:indexing, statistics
robot-type:standalone
robot-platform:unix
robot-availability:source
robot-exclusion:yes
robot-exclusion-useragent:Iron33
robot-noindex:no
robot-host:*.folon.ueda.info.waseda.ac.jp, 133.9.215.*
robot-from:yes
robot-useragent:Iron33/0.0
robot-language:c
robot-description:The robot "Iron33" is used to build the
database for the WWW search engine "Verno".
robot-history:
robot-environment:research
modified-date:Fri, 20 Mar 1998 18:34 JST
modified-by:Watanabe Takashi
robot-id:
israelisearch
robot-name:
Israeli-search
robot-cover-url:
http://www.idc.ac.il/Sandbag/
robot-details-url:
robot-owner-name:
Etamar Laron
robot-owner-url:
http://www.xpert.com/~etamar/
robot-owner-email: etamar@xpert.co
robot-status:
robot-purpose:
indexing.
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
dylan.ius.cs.cmu.edu
robot-from:
no
robot-useragent:
IsraeliSearch/1.0
robot-language:
C A complete software designed to collect information in a
distributed workload and supports context queries. Intended
to be a complete updated resource for Israeli sites and
information related to Israel or Israeli
Society.
robot-description:
robot-history:
robot-environment:
modified-date:
Tue Apr 23 19:23:55 1996.
modified-by:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (42 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id: javabee
robot-name: JavaBee
robot-cover-url: http://www.javabee.com
robot-details-url:
robot-owner-name:ObjectBox
robot-owner-url:http://www.objectbox.com/
robot-owner-email:info@objectbox.com
robot-status:Active
robot-purpose:Stealing Java Code
robot-type:standalone
robot-platform:Java
robot-availability:binary
robot-exclusion:no
robot-exclusion-useragent:
robot-noindex:no
robot-host:*
robot-from:no
robot-useragent:JavaBee
robot-language:Java
robot-description:This robot is used to grab java applets and run them
locally overriding the security implemented
robot-history:
robot-environment:commercial
modified-date:
modified-by:
robot-id: JBot
robot-name: JBot Java Web Robot
robot-cover-url: http://www.matuschek.net/software/jbot
robot-details-url: http://www.matuschek.net/software/jbot
robot-owner-name: Daniel Matuschek
robot-owner-url: http://www.matuschek.net
robot-owner-email: daniel@matuschek.net
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: Java
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: JBot
robot-noindex: no
robot-host: *
robot-from: robot-useragent: JBot (but can be changed by the user)
robot-language: Java
robot-description: Java web crawler to download web sites
robot-history: robot-environment: hobby
modified-date: Thu, 03 Jan 2000 16:00:00 GMT
modified-by: Daniel Matuschek <daniel@matuschek.net>
robot-id: jcrawler
robot-name: JCrawler
robot-cover-url: http://www.nihongo.org/jcrawler/
robot-details-url:
robot-owner-name: Benjamin Franz
robot-owner-url: http://www.nihongo.org/snowhare/
robot-owner-email: snowhare@netimages.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
http://info.webcrawler.com/mak/projects/robots/active/all.txt (43 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-exclusion-useragent: jcrawler
robot-noindex: yes
robot-host: db.netimages.com
robot-from: yes
robot-useragent: JCrawler/0.2
robot-language: perl5
robot-description: JCrawler is currently used to build the Vietnam topic
specific WWW index for VietGATE
<URL:http://www.vietgate.net/>. It schedules visits
randomly, but will not visit a site more than once
every two minutes. It uses a subject matter relevance
pruning algorithm to determine what pages to crawl
and index and will not generally index pages with
no Vietnam related content. Uses Unicode internally,
and detects and converts several different Vietnamese
character encodings.
robot-history:
robot-environment: service
modified-date: Wed, 08 Oct 1997 00:09:52 GMT
modified-by: Benjamin Franz
robot-id: jeeves
robot-name: Jeeves
robot-cover-url: http://www-students.doc.ic.ac.uk/~lglb/Jeeves/
robot-details-url:
robot-owner-name: Leon Brocard
robot-owner-url: http://www-students.doc.ic.ac.uk/~lglb/
robot-owner-email: lglb@doc.ic.ac.uk
robot-status: development
robot-purpose: indexing maintenance statistics
robot-type: standalone
robot-platform: UNIX
robot-availability: none
robot-exclusion: no
robot-exclusion-useragent: jeeves
robot-noindex: no
robot-host: *.doc.ic.ac.uk
robot-from: yes
robot-useragent: Jeeves v0.05alpha (PERL, LWP, lglb@doc.ic.ac.uk)
robot-language: perl5
robot-description: Jeeves is basically a web-mirroring robot built as a
final-year degree project. It will have many nice features and is
already web-friendly. Still in development.
robot-history: Still short (0.05alpha)
robot-environment: research
modified-date: Wed, 23 Apr 1997 17:26:50 GMT
modified-by: Leon Brocard
robot-id:
jobot
robot-name:
Jobot
robot-cover-url:
http://www.micrognosis.com/~ajack/jobot/jobot.html
robot-details-url:
robot-owner-name:
Adam Jack
robot-owner-url:
http://www.micrognosis.com/~ajack/index.html
robot-owner-email: ajack@corp.micrognosis.com
robot-status:
inactive
robot-purpose:
standalone
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
supernova.micrognosis.com
robot-from:
yes
http://info.webcrawler.com/mak/projects/robots/active/all.txt (44 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent:
robot-language:
robot-description:
Intended to
Hence - Job
robot-history:
robot-environment:
modified-date:
modified-by:
Jobot/0.1alpha libwww-perl/4.0
perl 4
Its purpose is to generate a Resource Discovery database.
seek out sites of potential "career interest".
Robot.
Tue Jan
9 18:55:55 1996
robot-id:
joebot
robot-name:
JoeBot
robot-cover-url:
robot-details-url:
robot-owner-name:
Ray Waldin
robot-owner-url:
http://www.primenet.com/~rwaldin
robot-owner-email: rwaldin@primenet.com
robot-status:
robot-purpose:
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
JoeBot/x.x,
robot-language:
java JoeBot is a generic web crawler implemented as a
collection of Java classes which can be used in a variety of
applications, including resource discovery, link validation,
mirroring, etc. It currently limits itself to one visit per
host per minute.
robot-description:
robot-history:
robot-environment:
modified-date:
Sun May 19 08:13:06 1996.
modified-by:
robot-id:
jubii
robot-name:
The Jubii Indexing Robot
robot-cover-url:
http://www.jubii.dk/robot/default.htm
robot-details-url:
robot-owner-name:
Jakob Faarvang
robot-owner-url:
http://www.cybernet.dk/staff/jakob/
robot-owner-email: jakob@jubii.dk
robot-status:
robot-purpose:
indexing, maintainance
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
any host in the cybernet.dk domain
robot-from:
yes
robot-useragent:
JubiiRobot/version#
robot-language:
visual basic 4.0
robot-description: Its purpose is to generate a Resource Discovery database,
and validate links. Used for indexing the .dk top-level
domain as well as other Danish sites for aDanish web
database, as well as link validation.
robot-history:
Will be in constant operation from Spring
1996
robot-environment:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (45 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-date:
modified-by:
Sat Jan
6 20:58:44 1996
robot-id:
jumpstation
robot-name:
JumpStation
robot-cover-url:
http://js.stir.ac.uk/jsbin/jsii
robot-details-url:
robot-owner-name:
Jonathon Fletcher
robot-owner-url:
http://www.stir.ac.uk/~jf1
robot-owner-email: j.fletcher@stirling.ac.uk
robot-status:
retired
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
*.stir.ac.uk
robot-from:
yes
robot-useragent:
jumpstation
robot-language:
perl, C, c++
robot-description:
robot-history:
Originated as a weekend project in 1993.
robot-environment:
modified-date:
Tue May 16 00:57:42 1995.
modified-by:
robot-id:
katipo
robot-name:
Katipo
robot-cover-url:
http://www.vuw.ac.nz/~newbery/Katipo.html
robot-details-url: http://www.vuw.ac.nz/~newbery/Katipo/Katipo-doc.html
robot-owner-name:
Michael Newbery
robot-owner-url:
http://www.vuw.ac.nz/~newbery
robot-owner-email: Michael.Newbery@vuw.ac.nz
robot-status:
active
robot-purpose:
maintenance
robot-type:
standalone
robot-platform:
Macintosh
robot-availability: binary
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
yes
robot-useragent:
Katipo/1.0
robot-language:
c
robot-description: Watches all the pages you have previously visited
and tells you when they have changed.
robot-history:
robot-environment: commercial (free)
modified-date:
Tue, 25 Jun 96 11:40:07 +1200
modified-by:
Michael Newbery
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
kdd
KDD-Explorer
http://mlc.kddvw.kcom.or.jp/CLINKS/html/clinks.html
not available
Kazunori Matsumoto
not available
matsu@lab.kdd.co.jp
development (to be avtive in June 1997)
indexing
standalone
unix
http://info.webcrawler.com/mak/projects/robots/active/all.txt (46 of 107) [18.02.2001 13:17:47]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-availability:
none
robot-exclusion:
yes
robot-exclusion-useragent:KDD-Explorer
robot-noindex:
no
robot-host:
mlc.kddvw.kcom.or.jp
robot-from:
yes
robot-useragent:
KDD-Explorer/0.1
robot-language:
c
robot-description:
KDD-Explorer is used for indexing valuable documents
which will be retrieved via an experimental cross-language
search engine, CLINKS.
robot-history:
This robot was designed in Knowledge-bases Information
processing Laboratory, KDD R&D Laboratories, 1996-1997
robot-environment:
research
modified-date:
Mon, 2 June 1997 18:00:00 JST
modified-by:
Kazunori Matsumoto
robot-id:kilroy
robot-name:Kilroy
robot-cover-url:http://purl.org/kilroy
robot-details-url:http://purl.org/kilroy
robot-owner-name:OCLC
robot-owner-url:http://www.oclc.org
robot-owner-email:kilroy@oclc.org
robot-status:active
robot-purpose:indexing,statistics
robot-type:standalone
robot-platform:unix,windowsNT
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:*
robot-noindex:no
robot-host:*.oclc.org
robot-from:no
robot-useragent:yes
robot-language:java
robot-description:Used to collect data for several projects.
Runs constantly and visits site no faster than once every 90 seconds.
robot-history:none
robot-environment:research,service
modified-date:Thursday, 24 Apr 1997 20:00:00 GMT
modified-by:tkac
robot-id: ko_yappo_robot
robot-name: KO_Yappo_Robot
robot-cover-url: http://yappo.com/info/robot.html
robot-details-url: http://yappo.com/
robot-owner-name: Kazuhiro Osawa
robot-owner-url: http://yappo.com/
robot-owner-email: office_KO@yappo.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: ko_yappo_robot
robot-noindex: yes
robot-host: yappo.com,209.25.40.1
robot-from: yes
robot-useragent: KO_Yappo_Robot/1.0.4(http://yappo.com/info/robot.html)
robot-language: perl
robot-description: The KO_Yappo_Robot robot is used to build the database
for the Yappo search service by k,osawa
(part of AOL).
http://info.webcrawler.com/mak/projects/robots/active/all.txt (47 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
The robot runs random day, and visits sites in a random order.
robot-history: The robot is hobby of k,osawa
at the Tokyo in 1997
robot-environment: hobby
modified-date: Fri, 18 Jul 1996 12:34:21 GMT
modified-by: KO
robot-id: labelgrabber.txt
robot-name: LabelGrabber
robot-cover-url: http://www.w3.org/PICS/refcode/LabelGrabber/index.htm
robot-details-url: http://www.w3.org/PICS/refcode/LabelGrabber/index.htm
robot-owner-name: Kyle Jamieson
robot-owner-url: http://www.w3.org/PICS/refcode/LabelGrabber/index.htm
robot-owner-email: jamieson@mit.edu
robot-status: active
robot-purpose: Grabs PICS labels from web pages, submits them to a label bueau
robot-type: standalone
robot-platform: windows, windows95, windowsNT, unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: label-grabber
robot-noindex: no
robot-host: head.w3.org
robot-from: no
robot-useragent: LabelGrab/1.1
robot-language: java
robot-description: The label grabber searches for PICS labels and submits
them to a label bureau
robot-history: N/A
robot-environment: research
modified-date: Wed, 28 Jan 1998 17:32:52 GMT
modified-by: jamieson@mit.edu
robot-id: larbin
robot-name: larbin
robot-cover-url: http://para.inria.fr/~ailleret/larbin/index-eng.html
robot-owner-name: Sebastien Ailleret
robot-owner-url: http://para.inria.fr/~ailleret/
robot-owner-email: sebastien.ailleret@inria.fr
robot-status: active
robot-purpose: Your imagination is the only limit
robot-type: standalone
robot-platform: Linux
robot-availability: source (GPL), mail me for customization
robot-exclusion: yes
robot-exclusion-useragent: larbin
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent: larbin (+mail)
robot-language: c++
robot-description: Parcourir le web, telle est ma passion
robot-history: french research group (INRIA Verso)
robot-environment: hobby
modified-date: 2000-3-28
modified-by: Sebastien Ailleret
robot-id: legs
robot-name: legs
robot-cover-url: http://www.MagPortal.com/
robot-details-url:
robot-owner-name: Bill Dimm
robot-owner-url: http://www.HotNeuron.com/
robot-owner-email: admin@magportal.com
robot-status: active
http://info.webcrawler.com/mak/projects/robots/active/all.txt (48 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-purpose: indexing
robot-type: standalone
robot-platform: linux
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: legs
robot-noindex: no
robot-host:
robot-from: yes
robot-useragent: legs
robot-language: perl5
robot-description: The legs robot is used to build the magazine article
database for MagPortal.com.
robot-history:
robot-environment: service
modified-date: Wed, 22 Mar 2000 14:10:49 GMT
modified-by: Bill Dimm
robot-id: linkidator
robot-name: Link Validator
robot-cover-url:
robot-details-url:
robot-owner-name: Thomas Gimon
robot-owner-url:
robot-owner-email: tgimon@mitre.org
robot-status: development
robot-purpose: maintenance
robot-type: standalone
robot-platform: unix, windows
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Linkidator
robot-noindex: yes
robot-nofollow: yes
robot-host: *.mitre.org
robot-from: yes
robot-useragent: Linkidator/0.93
robot-language: perl5
robot-description: Recursively checks all links on a site, looking for
broken or redirected links. Checks all off-site links using HEAD
requests and does not progress further. Designed to behave well and to
be very configurable.
robot-history: Built using WWW-Robot-0.022 perl module. Currently in
beta test. Seeking approval for public release.
robot-environment: internal
modified-date: Fri, 20 Jan 2001 02:22:00 EST
modified-by: Thomas Gimon
robot-id:linkscan
robot-name:LinkScan
robot-cover-url:http://www.elsop.com/
robot-details-url:http://www.elsop.com/linkscan/overview.html
robot-owner-name:Electronic Software Publishing Corp. (Elsop)
robot-owner-url:http://www.elsop.com/
robot-owner-email:sales@elsop.com
robot-status:Robot actively in use
robot-purpose:Link checker, SiteMapper, and HTML Validator
robot-type:Standalone
robot-platform:Unix, Linux, Windows 98/NT
robot-availability:Program is shareware
robot-exclusion:No
robot-exclusion-useragent:
robot-noindex:Yes
robot-host:*
robot-from:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (49 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent:LinkScan Server/5.5 | LinkScan Workstation/5.5
robot-language:perl5
robot-description:LinkScan checks links, validates HTML and creates site maps
robot-history: First developed by Elsop in January,1997
robot-environment:Commercial
modified-date:Fri, 3 September 1999 17:00:00 PDT
modified-by: Kenneth R. Churilla
robot-id: linkwalker
robot-name: LinkWalker
robot-cover-url: http://www.seventwentyfour.com
robot-details-url: http://www.seventwentyfour.com/tech.html
robot-owner-name: Roy Bryant
robot-owner-url:
robot-owner-email: rbryant@seventwentyfour.com
robot-status: active
robot-purpose: maintenance, statistics
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: linkwalker
robot-noindex: yes
robot-host: *.seventwentyfour.com
robot-from: yes
robot-useragent: LinkWalker
robot-language: c++
robot-description: LinkWalker generates a database of links.
We send reports of bad ones to webmasters.
robot-history: Constructed late 1997 through April 1998.
In full service April 1998.
robot-environment: service
modified-date: Wed, 22 Apr 1998
modified-by: Roy Bryant
robot-id:lockon
robot-name:Lockon
robot-cover-url:
robot-details-url:
robot-owner-name:Seiji Sasazuka & Takahiro Ohmori
robot-owner-url:
robot-owner-email:search@rsch.tuis.ac.jp
robot-status:active
robot-purpose:indexing
robot-type:standalone
robot-platform:UNIX
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:Lockon
robot-noindex:yes
robot-host:*.hitech.tuis.ac.jp
robot-from:yes
robot-useragent:Lockon/xxxxx
robot-language:perl5
robot-description:This robot gathers only HTML document.
robot-history:This robot was developed in the Tokyo university of information sciences
in 1998.
robot-environment:research
modified-date:Tue. 10 Nov 1998 20:00:00 GMT
modified-by:Seiji Sasazuka & Takahiro Ohmori
robot-id:logo_gif
robot-name: logo.gif Crawler
robot-cover-url: http://www.inm.de/projects/logogif.html
robot-details-url:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (50 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-name: Sevo Stille
robot-owner-url: http://www.inm.de/people/sevo
robot-owner-email: sevo@inm.de
robot-status: under development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: logo_gif_crawler
robot-noindex: no
robot-host: *.inm.de
robot-from: yes
robot-useragent: logo.gif crawler
robot-language: perl
robot-description: meta-indexing engine for corporate logo graphics
The robot runs at irregular intervals and will only pull a start page and
its associated /.*logo\.gif/i (if any). It will be terminated once a
statistically
significant number of samples has been collected.
robot-history: logo.gif is part of the design diploma of Markus Weisbeck,
and tries to analyze the abundance of the logo metaphor in WWW
corporate design.
The crawler and image database were written by Sevo Stille and Peter
Frank of the Institut für Neue Medien, respectively.
robot-environment: research, statistics
modified-date: 25.5.97
modified-by: Sevo Stille
robot-id:
lycos
robot-name:
Lycos
robot-cover-url:
http://lycos.cs.cmu.edu/
robot-details-url:
robot-owner-name:
Dr. Michael L. Mauldin
robot-owner-url:
http://fuzine.mt.cs.cmu.edu/mlm/home.html
robot-owner-email: fuzzy@cmu.edu
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
fuzine.mt.cs.cmu.edu, lycos.com
robot-from:
robot-useragent:
Lycos/x.x
robot-language:
robot-description: This is a research program in providing information
retrieval and discovery in the WWW, using a finite memory
model of the web to guide intelligent, directed searches for
specific information needs
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
macworm
Mac WWWWorm
Sebastien Lemieux
lemieuse@ERE.UMontreal.CA
http://info.webcrawler.com/mak/projects/robots/active/all.txt (51 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-purpose:
indexing
robot-type:
robot-platform:
Macintosh
robot-availability: none
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
robot-language:
hypercard
robot-description: a French Keyword-searching robot for the Mac The author has
decided not to release this robot to the
public
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: magpie
robot-name: Magpie
robot-cover-url:
robot-details-url:
robot-owner-name: Keith Jones
robot-owner-url:
robot-owner-email: Keith.Jones@blueberry.co.uk
robot-status: development
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: unix
robot-availability:
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: *.blueberry.co.uk, 194.70.52.*, 193.131.167.144
robot-from: no
robot-useragent: Magpie/1.0
robot-language: perl5
robot-description: Used to obtain information from a specified list of web pages for
local indexing. Runs every two hours, and visits only a small number of sites.
robot-history: Part of a research project. Alpha testing from 10 July 1996, Beta
testing from 10 September.
robot-environment: research
modified-date: Wed, 10 Oct 1996 13:15:00 GMT
modified-by: Keith Jones
robot-id: mattie
robot-name: Mattie
robot-cover-url: http://www.mcw.aarkayn.org
robot-details-url: http://www.mcw.aarkayn.org/web/mattie.asp
robot-owner-name: Matt
robot-owner-url: http://www.mcw.aarkayn.org
robot-owner-email: matt@mcw.aarkayn.org
robot-status: Active
robot-purpose: MP3 Spider
robot-type: Standalone
robot-platform: Windows 2000
robot-availability: None
robot-exclusion: Yes
robot-exclusion-useragent: mattie
robot-noindex: N/A
robot-nofollow: Yes
robot-host: mattie.mcw.aarkayn.org
robot-from: Yes
robot-useragent: AO/A-T.IDRG v2.3
http://info.webcrawler.com/mak/projects/robots/active/all.txt (52 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-language: AO/A-T.IDRGL
robot-description: Mattie?s sole purpose is to seek out MP3z for Matt.
robot-history: Mattie was written 2000 Mar. 03 Fri. 18:48:00 -0500
GMT (e). He was last modified 2000 Nov. 08 Wed. 14:52:00 -0600 GMT (f).
robot-environment: Hobby
modified-date: Wed, 08 Nov 2000 20:52:00 GMT
modified-by: Matt
robot-id: mediafox
robot-name: MediaFox
robot-cover-url: none
robot-details-url: none
robot-owner-name: Lars Eilebrecht
robot-owner-url: http://www.home.unix-ag.org/sfx/
robot-owner-email: sfx@uni-media.de
robot-status: development
robot-purpose: indexing and maintenance
robot-type: standalone
robot-platform: (Java)
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: mediafox
robot-noindex: yes
robot-host: 141.99.*.*
robot-from: yes
robot-useragent: MediaFox/x.y
robot-language: Java
robot-description: The robot is used to index meta information of a
specified set of documents and update a database
accordingly.
robot-history: Project at the University of Siegen
robot-environment: research
modified-date: Fri Aug 14 03:37:56 CEST 1998
modified-by: Lars Eilebrecht
robot-id:merzscope
robot-name:MerzScope
robot-cover-url:http://www.merzcom.com
robot-details-url:http://www.merzcom.com
robot-owner-name:(Client based robot)
robot-owner-url:(Client based robot)
robot-owner-email:
robot-status:actively in use
robot-purpose:WebMapping
robot-type:standalone
robot-platform: (Java Based) unix,windows95,windowsNT,os2,mac etc ..
robot-availability:binary
robot-exclusion: yes
robot-exclusion-useragent: MerzScope
robot-noindex: no
robot-host:(Client Based)
robot-from:
robot-useragent: MerzScope
robot-language: java
robot-description: Robot is part of a Web-Mapping package called MerzScope,
to be used mainly by consultants, and web masters to create and
publish maps, on and of the World wide web.
robot-history:
robot-environment:
modified-date: Fri, 13 March 1997 16:31:00
modified-by: Philip Lenir, MerzScope lead developper
robot-id:
robot-name:
robot-cover-url:
meshexplorer
NEC-MeshExplorer
http://netplaza.biglobe.or.jp/
http://info.webcrawler.com/mak/projects/robots/active/all.txt (53 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-details-url:
http://netplaza.biglobe.or.jp/keyword.html
robot-owner-name:
web search service maintenance group
robot-owner-url:
http://netplaza.biglobe.or.jp/keyword.html
robot-owner-email:
web-dir@mxa.meshnet.or.jp
robot-status:
active
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
unix
robot-availability:
none
robot-exclusion:
yes
robot-exclusion-useragent:
NEC-MeshExplorer
robot-noindex:
no
robot-host:
meshsv300.tk.mesh.ad.jp
robot-from:
yes
robot-useragent:
NEC-MeshExplorer
robot-language:
c
robot-description:
The NEC-MeshExplorer robot is used to build database for the
NETPLAZA
search service operated by NEC Corporation. The robot searches URLs
around sites in japan(JP domain).
The robot runs every day, and visits sites in a random order.
robot-history: Prototype version of this robot was developed in C&C Research
Laboratories, NEC Corporation. Current robot (Version 1.0) is based
on the prototype and has more functions.
robot-environment:
research
modified-date:
Jan 1, 1997
modified-by:
Nobuya Kubo, Hajime Takano
robot-id: MindCrawler
robot-name: MindCrawler
robot-cover-url: http://www.mindpass.com/_technology_faq.htm
robot-details-url:
robot-owner-name: Mindpass
robot-owner-url: http://www.mindpass.com/
robot-owner-email: support@mindpass.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: linux
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: MindCrawler
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent: MindCrawler
robot-language: c++
robot-description:
robot-history:
robot-environment:
modified-date: Tue Mar 28 11:30:09 CEST 2000
modified-by:
robot-id:moget
robot-name:moget
robot-cover-url:
robot-details-url:
robot-owner-name:NTT-ME Infomation Xing,Inc
robot-owner-url:http://www.nttx.co.jp
robot-owner-email:moget@goo.ne.jp
robot-status:active
robot-purpose:indexing,statistics
robot-type:standalone
robot-platform:unix
robot-availability:none
http://info.webcrawler.com/mak/projects/robots/active/all.txt (54 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-exclusion:yes
robot-exclusion-useragent:moget
robot-noindex:yes
robot-host:*.goo.ne.jp
robot-from:yes
robot-useragent:moget/1.0
robot-language:c
robot-description: This robot is used to build the database for the search service
operated by goo
robot-history:
robot-environment:service
modified-date:Thu, 30 Mar 2000 18:40:37 GMT
modified-by:moget@goo.ne.jp
robot-id:
momspider
robot-name:
MOMspider
robot-cover-url:
http://www.ics.uci.edu/WebSoft/MOMspider/
robot-details-url:
robot-owner-name:
Roy T. Fielding
robot-owner-url:
http://www.ics.uci.edu/dir/grad/Software/fielding
robot-owner-email: fielding@ics.uci.edu
robot-status:
active
robot-purpose:
maintenance, statistics
robot-type:
standalone
robot-platform:
UNIX
robot-availability: source
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
yes
robot-useragent:
MOMspider/1.00 libwww-perl/0.40
robot-language:
perl 4
robot-description: to validate links, and generate statistics. It's usually run
from anywhere
robot-history:
Originated as a research project at the University of
California, Irvine, in 1993. Presented at the First
International WWW Conference in Geneva, 1994.
robot-environment:
modified-date:
Sat May 6 08:11:58 1995
modified-by:
fielding@ics.uci.edu
robot-id:
monster
robot-name:
Monster
robot-cover-url:
http://www.neva.ru/monster.list/russian.www.html
robot-details-url:
robot-owner-name:
Dmitry Dicky
robot-owner-url:
http://wild.stu.neva.ru/
robot-owner-email: diwil@wild.stu.neva.ru
robot-status:
active
robot-purpose:
maintenance, mirroring
robot-type:
standalone
robot-platform:
UNIX (Linux)
robot-availability: binary
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
wild.stu.neva.ru
robot-from:
robot-useragent:
Monster/vX.X.X -$TYPE ($OSTYPE)
robot-language:
C
robot-description: The Monster has two parts - Web searcher and Web analyzer.
Searcher is intended to perform the list of WWW sites of
desired domain (for example it can perform list of all
WWW sites of mit.edu, com, org, etc... domain)
http://info.webcrawler.com/mak/projects/robots/active/all.txt (55 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
In the User-agent field $TYPE is set to 'Mapper' for Web searcher
and 'StAlone' for Web analyzer.
robot-history:
Now the full (I suppose) list of ex-USSR sites is produced.
robot-environment:
modified-date:
Tue Jun 25 10:03:36 1996
modified-by:
robot-id: motor
robot-name: Motor
robot-cover-url: http://www.cybercon.de/Motor/index.html
robot-details-url:
robot-owner-name: Mr. Oliver Runge, Mr. Michael Goeckel
robot-owner-url: http://www.cybercon.de/index.html
robot-owner-email: Motor@cybercon.technopark.gmd.de
robot-status: developement
robot-purpose: indexing
robot-type: standalone
robot-platform: mac
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: Motor
robot-noindex: no
robot-host: Michael.cybercon.technopark.gmd.de
robot-from: yes
robot-useragent: Motor/0.2
robot-language: 4th dimension
robot-description: The Motor robot is used to build the database for the
www.webindex.de search service operated by CyberCon. The robot ios under
development - it runs in random intervals and visits site in a priority
driven order (.de/.ch/.at first, root and robots.txt first)
robot-history:
robot-environment: service
modified-date: Wed, 3 Jul 1996 15:30:00 +0100
modified-by: Michael Goeckel (Michael@cybercon.technopark.gmd.de)
robot-id: muscatferret
robot-name: Muscat Ferret
robot-cover-url: http://www.muscat.co.uk/euroferret/
robot-details-url:
robot-owner-name: Olly Betts
robot-owner-url: http://www.muscat.co.uk/~olly/
robot-owner-email: olly@muscat.co.uk
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: MuscatFerret
robot-noindex: yes
robot-host: 193.114.89.*, 194.168.54.11
robot-from: yes
robot-useragent: MuscatFerret/<version>
robot-language: c, perl5
robot-description: Used to build the database for the EuroFerret
<URL:http://www.muscat.co.uk/euroferret/>
robot-history:
robot-environment: service
modified-date: Tue, 21 May 1997 17:11:00 GMT
modified-by: olly@muscat.co.uk
robot-id: mwdsearch
robot-name: Mwd.Search
robot-cover-url: (none)
robot-details-url: (none)
http://info.webcrawler.com/mak/projects/robots/active/all.txt (56 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-name: Antti Westerberg
robot-owner-url: (none)
robot-owner-email: Antti.Westerberg@mwd.sci.fi
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix (Linux)
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: MwdSearch
robot-noindex: yes
robot-host: *.fifi.net
robot-from: no
robot-useragent: MwdSearch/0.1
robot-language: perl5, c
robot-description: Robot for indexing finnish (toplevel domain .fi)
webpages for search engine called Fifi.
Visits sites in random order.
robot-history: (none)
robot-environment: service (+ commercial)mwd.sci.fi>
modified-date: Mon, 26 May 1997 15:55:02 EEST
modified-by: Antti.Westerberg@mwd.sci.fi
robot-id: myweb
robot-name: Internet Shinchakubin
robot-cover-url: http://naragw.sharp.co.jp/myweb/home/
robot-details-url:
robot-owner-name: SHARP Corp.
robot-owner-url: http://naragw.sharp.co.jp/myweb/home/
robot-owner-email: shinchakubin-request@isl.nara.sharp.co.jp
robot-status: active
robot-purpose: find new links and changed pages
robot-type: standalone
robot-platform: Windows98
robot-availability: binary as bundled software
robot-exclusion: yes
robot-exclusion-useragent: sharp-info-agent
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent: User-Agent: Mozilla/4.0 (compatible; sharp-info-agent v1.0; )
robot-language: Java
robot-description: makes a list of new links and changed pages based
on user's frequently clicked pages in the past 31 days.
client may run this software one or few times every day, manually or
specified time.
robot-history: shipped for SHARP's PC users since Feb 2000
robot-environment: commercial
modified-date: Fri, 30 Jun 2000 19:02:52 JST
modified-by: Katsuo Doi <doi@isl.nara.sharp.co.jp>
robot-id:
netcarta
robot-name:
NetCarta WebMap Engine
robot-cover-url:
http://www.netcarta.com/
robot-details-url:
robot-owner-name:
NetCarta WebMap Engine
robot-owner-url:
http://www.netcarta.com/
robot-owner-email: info@netcarta.com
robot-status:
robot-purpose:
indexing, maintenance, mirroring, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (57 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
NetCarta CyberPilot Pro
robot-language:
C++.
robot-description: The NetCarta WebMap Engine is a general purpose, commercial
spider. Packaged with a full GUI in the CyberPilo Pro
product, it acts as a personal spider to work with a browser
to facilitiate context-based navigation. The WebMapper
product uses the robot to manage a site (site copy, site
diff, and extensive link management facilities). All
versions can create publishable NetCarta WebMaps, which
capture the crawled information. If the robot sees a
published map, it will return the published map rather than
continuing its crawl. Since this is a personal spider, it
will be launched from multiple domains. This robot tends to
focus on a particular site. No instance of the robot should
have more than one outstanding request out to any given site
at a time. The User-agent field contains a coded ID
identifying the instance of the spider; specific users can
be blocked via robots.txt using this ID.
robot-history:
robot-environment:
modified-date:
Sun Feb 18 02:02:49 1996.
modified-by:
robot-id: netmechanic
robot-name: NetMechanic
robot-cover-url: http://www.netmechanic.com
robot-details-url: http://www.netmechanic.com/faq.html
robot-owner-name: Tom Dahm
robot-owner-url: http://iquest.com/~tdahm
robot-owner-email: tdahm@iquest.com
robot-status: development
robot-purpose: Link and HTML validation
robot-type: standalone with web gateway
robot-platform: UNIX
robot-availability: via web page
robot-exclusion: Yes
robot-exclusion-useragent: WebMechanic
robot-noindex: no
robot-host: 206.26.168.18
robot-from: no
robot-useragent: NetMechanic
robot-language: C
robot-description: NetMechanic is a link validation and
HTML validation robot run using a web page interface.
robot-history:
robot-environment:
modified-date: Sat, 17 Aug 1996 12:00:00 GMT
modified-by:
robot-id: netscoop
robot-name: NetScoop
robot-cover-url: http://www-a2k.is.tokushima-u.ac.jp/search/index.html
robot-owner-name: Kenji Kita
robot-owner-url: http://www-a2k.is.tokushima-u.ac.jp/member/kita/index.html
robot-owner-email: kita@is.tokushima-u.ac.jp
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: UNIX
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: NetScoop
http://info.webcrawler.com/mak/projects/robots/active/all.txt (58 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-host: alpha.is.tokushima-u.ac.jp, beta.is.tokushima-u.ac.jp
robot-useragent: NetScoop/1.0 libwww/5.0a
robot-language: C
robot-description: The NetScoop robot is used to build the database
for the NetScoop search engine.
robot-history: The robot has been used in the research project
at the Faculty of Engineering, Tokushima University, Japan.,
since Dec. 1996.
robot-environment: research
modified-date: Fri, 10 Jan 1997.
modified-by: Kenji Kita
robot-id: newscan-online
robot-name: newscan-online
robot-cover-url: http://www.newscan-online.de/
robot-details-url: http://www.newscan-online.de/info.html
robot-owner-name: Axel Mueller
robot-owner-url:
robot-owner-email: mueller@newscan-online.de
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: Linux
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: newscan-online
robot-noindex: no
robot-host: *newscan-online.de
robot-from: yes
robot-useragent: newscan-online/1.1
robot-language: perl
robot-description: The newscan-online robot is used to build a database for
the newscan-online news search service operated by smart information
services. The robot runs daily and visits predefined sites in a random order.
robot-history: This robot finds its roots in a prereleased software for
news filtering for Lotus Notes in 1995.
robot-environment: service
modified-date: Fri, 9 Apr 1999 11:45:00 GMT
modified-by: Axel Mueller
robot-id:
nhse
robot-name:
NHSE Web Forager
robot-cover-url:
http://nhse.mcs.anl.gov/
robot-details-url:
robot-owner-name:
Robert Olson
robot-owner-url:
http://www.mcs.anl.gov/people/olson/
robot-owner-email: olson@mcs.anl.gov
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*.mcs.anl.gov
robot-from:
yes
robot-useragent:
NHSEWalker/3.0
robot-language:
perl 5
robot-description: to generate a Resource Discovery database
robot-history:
robot-environment:
modified-date:
Fri May 5 15:47:55 1995
modified-by:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (59 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id:
nomad
robot-name:
Nomad
robot-cover-url:
http://www.cs.colostate.edu/~sonnen/projects/nomad.html
robot-details-url:
robot-owner-name:
Richard Sonnen
robot-owner-url:
http://www.cs.colostate.edu/~sonnen/
robot-owner-email: sonnen@cs.colostat.edu
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
*.cs.colostate.edu
robot-from:
no
robot-useragent:
Nomad-V2.x
robot-language:
Perl 4
robot-description:
robot-history:
Developed in 1995 at Colorado State University.
robot-environment:
modified-date:
Sat Jan 27 21:02:20 1996.
modified-by:
robot-id:
northstar
robot-name:
The NorthStar Robot
robot-cover-url:
http://comics.scs.unr.edu:7000/top.html
robot-details-url:
robot-owner-name:
Fred Barrie
robot-owner-url:
robot-owner-email: barrie@unr.edu
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
frognot.utdallas.edu, utdallas.edu, cnidir.org
robot-from:
yes
robot-useragent:
NorthStar
robot-language:
robot-description: Recent runs (26 April 94) will concentrate on textual
analysis of the Web versus GopherSpace (from the Veronica
data) as well as indexing.
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: occam
robot-name: Occam
robot-cover-url: http://www.cs.washington.edu/research/projects/ai/www/occam/
robot-details-url:
robot-owner-name: Marc Friedman
robot-owner-url: http://www.cs.washington.edu/homes/friedman/
robot-owner-email: friedman@cs.washington.edu
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
http://info.webcrawler.com/mak/projects/robots/active/all.txt (60 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-exclusion-useragent: Occam
robot-noindex: no
robot-host: gentian.cs.washington.edu, sekiu.cs.washington.edu,
saxifrage.cs.washington.edu
robot-from: yes
robot-useragent: Occam/1.0
robot-language: CommonLisp, perl4
robot-description: The robot takes high-level queries, breaks them down into
multiple web requests, and answers them by combining disparate
data gathered in one minute from numerous web sites, or from
the robots cache. Currently the only user is me.
robot-history: The robot is a descendant of Rodney,
an earlier project at the University of Washington.
robot-environment: research
modified-date: Thu, 21 Nov 1996 20:30 GMT
modified-by: friedman@cs.washington.edu (Marc Friedman)
robot-id:
octopus
robot-name:
HKU WWW Octopus
robot-cover-url:
http://phoenix.cs.hku.hk:1234/~jax/w3rui.shtml
robot-details-url:
robot-owner-name:
Law Kwok Tung , Lee Tak Yeung , Lo Chun Wing
robot-owner-url:
http://phoenix.cs.hku.hk:1234/~jax
robot-owner-email: jax@cs.hku.hk
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no.
robot-exclusion-useragent:
robot-noindex:
robot-host:
phoenix.cs.hku.hk
robot-from:
yes
robot-useragent:
HKU WWW Robot,
robot-language:
Perl 5, C, Java.
robot-description: HKU Octopus is an ongoing project for resource discovery in
the Hong Kong and China WWW domain . It is a research
project conducted by three undergraduate at the University
of Hong Kong
robot-history:
robot-environment:
modified-date:
Thu Mar 7 14:21:55 1996.
modified-by:
robot-id: orb_search
robot-name: Orb Search
robot-cover-url: http://orbsearch.home.ml.org
robot-details-url: http://orbsearch.home.ml.org
robot-owner-name: Matt Weber
robot-owner-url: http://www.weberworld.com
robot-owner-email: webernet@geocities.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: Orbsearch/1.0
robot-noindex: yes
robot-host: cow.dyn.ml.org, *.dyn.ml.org
robot-from: yes
robot-useragent: Orbsearch/1.0
robot-language: Perl5
robot-description: Orbsearch builds the database for Orb Search Engine.
http://info.webcrawler.com/mak/projects/robots/active/all.txt (61 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
It runs when requested.
robot-history: This robot was started as a hobby.
robot-environment: hobby
modified-date: Sun, 31 Aug 1997 02:28:52 GMT
modified-by: Matt Weber
robot-id: packrat
robot-name: Pack Rat
robot-cover-url: http://web.cps.msu.edu/~dexterte/isl/packrat.html
robot-details-url:
robot-owner-name: Terry Dexter
robot-owner-url: http://web.cps.msu.edu/~dexterte
robot-owner-email: dexterte@cps.msu.edu
robot-status: development
robot-purpose: both maintenance and mirroring
robot-type: standalone
robot-platform: unix
robot-availability: at the moment, none...source when developed.
robot-exclusion: yes
robot-exclusion-useragent: packrat or *
robot-noindex: no, not yet
robot-host: cps.msu.edu
robot-from:
robot-useragent: PackRat/1.0
robot-language: perl with libwww-5.0
robot-description: Used for local maintenance and for gathering
web pages so
that local statisistical info can be used in artificial intelligence programs.
Funded by NEMOnline.
robot-history: In the making...
robot-environment: research
modified-date: Tue, 20 Aug 1996 15:45:11
modified-by: Terry Dexter
robot-id:pageboy
robot-name:PageBoy
robot-cover-url:http://www.webdocs.org/
robot-details-url:http://www.webdocs.org/
robot-owner-name:Chihiro Kuroda
robot-owner-url:http://www.webdocs.org/
robot-owner-email:pageboy@webdocs.org
robot-status:development
robot-purpose:indexing
robot-type:standalone
robot-platform:unix
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:pageboy
robot-noindex:yes
robot-nofollow:yes
robot-host:*.webdocs.org
robot-from:yes
robot-useragent:PageBoy/1.0
robot-language:c
robot-description:The robot visits at regular intervals.
robot-history:none
robot-environment:service
modified-date:Fri, 21 Oct 1999 17:28:52 GMT
modified-by:webdocs
robot-id: parasite
robot-name: ParaSite
robot-cover-url: http://www.ianett.com/parasite/
robot-details-url: http://www.ianett.com/parasite/
robot-owner-name: iaNett.com
http://info.webcrawler.com/mak/projects/robots/active/all.txt (62 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-url: http://www.ianett.com/
robot-owner-email: parasite@ianett.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: ParaSite
robot-noindex: yes
robot-nofollow: yes
robot-host: *.ianett.com
robot-from: yes
robot-useragent: ParaSite/0.21 (http://www.ianett.com/parasite/)
robot-language: c++
robot-description: Builds index for ianett.com search database. Runs
continiously.
robot-history: Second generation of ianett.com spidering technology,
originally called Sven.
robot-environment: service
modified-date: July 28, 2000
modified-by: Marty Anstey
robot-id:
patric
robot-name:
Patric
robot-cover-url:
http://www.nwnet.net/technical/ITR/index.html
robot-details-url:
http://www.nwnet.net/technical/ITR/index.html
robot-owner-name:
toney@nwnet.net
robot-owner-url:
http://www.nwnet.net/company/staff/toney
robot-owner-email:
webmaster@nwnet.net
robot-status:
development
robot-purpose:
statistics
robot-type:
standalone
robot-platform:
unix
robot-availability:
data
robot-exclusion:
yes
robot-exclusion-useragent: patric
robot-noindex:
yes
robot-host:
*.nwnet.net
robot-from:
no
robot-useragent:
Patric/0.01a
robot-language:
perl
robot-description:
(contained at http://www.nwnet.net/technical/ITR/index.html )
robot-history:
(contained at http://www.nwnet.net/technical/ITR/index.html )
robot-environment:
service
modified-date:
Thurs, 15 Aug 1996
modified-by:
toney@nwnet.net
robot-id: pegasus
robot-name: pegasus
robot-cover-url: http://opensource.or.id/projects.html
robot-details-url: http://pegasus.opensource.or.id
robot-owner-name: A.Y.Kiky Shannon
robot-owner-url: http://go.to/ayks
robot-owner-email: shannon@opensource.or.id
robot-status: inactive - open source
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: source, binary
robot-exclusion: yes
robot-exclusion-useragent: pegasus
robot-noindex: yes
robot-host: *
robot-from: yes
http://info.webcrawler.com/mak/projects/robots/active/all.txt (63 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent: web robot PEGASUS
robot-language: perl5
robot-description: pegasus gathers information from HTML pages (7 important
tags). The indexing process can be started based on starting URL(s) or a range
of IP address.
robot-history: This robot was created as an implementation of a final project on
Informatics Engineering Department, Institute of Technology Bandung, Indonesia.
robot-environment: research
modified-date: Fri, 20 Oct 2000 14:58:40 GMT
modified-by: A.Y.Kiky Shannon
robot-id:
perignator
robot-name:
The Peregrinator
robot-cover-url:
http://www.maths.usyd.edu.au:8000/jimr/pe/Peregrinator.html
robot-details-url:
robot-owner-name:
Jim Richardson
robot-owner-url:
http://www.maths.usyd.edu.au:8000/jimr.html
robot-owner-email: jimr@maths.su.oz.au
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
yes
robot-useragent:
Peregrinator-Mathematics/0.7
robot-language:
perl 4
robot-description: This robot is being used to generate an index of documents
on Web sites connected with mathematics and statistics. It
ignores off-site links, so does not stray from a list of
servers specified initially.
robot-history:
commenced operation in August 1994
robot-environment:
modified-date:
modified-by:
robot-id: perlcrawler
robot-name: PerlCrawler 1.0
robot-cover-url: http://perlsearch.hypermart.net/
robot-details-url: http://www.xav.com/scripts/xavatoria/index.html
robot-owner-name: Matt McKenzie
robot-owner-url: http://perlsearch.hypermart.net/
robot-owner-email: webmaster@perlsearch.hypermart.net
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: perlcrawler
robot-noindex: yes
robot-host: server5.hypermart.net
robot-from: yes
robot-useragent: PerlCrawler/1.0 Xavatoria/2.0
robot-language: perl5
robot-description: The PerlCrawler robot is designed to index and build
a database of pages relating to the Perl programming language.
robot-history: Originated in modified form on 25 June 1998
robot-environment: hobby
modified-date: Fri, 18 Dec 1998 23:37:40 GMT
modified-by: Matt McKenzie
http://info.webcrawler.com/mak/projects/robots/active/all.txt (64 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id:
phantom
robot-name:
Phantom
robot-cover-url:
http://www.maxum.com/phantom/
robot-details-url:
robot-owner-name:
Larry Burke
robot-owner-url:
http://www.aktiv.com/
robot-owner-email: lburke@aktiv.com
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
Macintosh
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
Duppies
robot-language:
robot-description: Designed to allow webmasters to provide a searchable index
of their own site as well as to other sites, perhaps with
similar content.
robot-history:
robot-environment:
modified-date:
Fri Jan 19 05:08:15 1996.
modified-by:
robot-id: piltdownman
robot-name: PiltdownMan
robot-cover-url: http://profitnet.bizland.com/
robot-details-url: http://profitnet.bizland.com/piltdownman.html
robot-owner-name: Daniel Vilà
robot-owner-url: http://profitnet.bizland.com/aboutus.html
robot-owner-email: profitnet@myezmail.com
robot-status: active
robot-purpose: statistics
robot-type: standalone
robot-platform: windows95, windows98, windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: piltdownman
robot-noindex: no
robot-nofollow: no
robot-host: 62.36.128.*, 194.133.59.*, 212.106.215.*
robot-from: no
robot-useragent: PiltdownMan/1.0 profitnet@myezmail.com
robot-language: c++
robot-description: The PiltdownMan robot is used to get a
list of links from the search engines
in our database. These links are
followed, and the page that they refer
is downloaded to get some statistics
from them.
The robot runs once a month, more or
less, and visits the first 10 pages
listed in every search engine, for a
group of keywords.
robot-history: To maintain a database of search engines,
we needed an automated tool. That's why
we began the creation of this robot.
robot-environment: service
modified-date: Mon, 13 Dec 1999 21:50:32 GMT
modified-by: Daniel Vilà
robot-id:
pioneer
http://info.webcrawler.com/mak/projects/robots/active/all.txt (65 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-name:
Pioneer
robot-cover-url:
http://sequent.uncfsu.edu/~micah/pioneer.html
robot-details-url:
robot-owner-name:
Micah A. Williams
robot-owner-url:
http://sequent.uncfsu.edu/~micah/
robot-owner-email: micah@sequent.uncfsu.edu
robot-status:
robot-purpose:
indexing, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
*.uncfsu.edu or flyer.ncsc.org
robot-from:
yes
robot-useragent:
Pioneer
robot-language:
C.
robot-description: Pioneer is part of an undergraduate research
project.
robot-history:
robot-environment:
modified-date:
Mon Feb 5 02:49:32 1996.
modified-by:
robot-id:
pitkow
robot-name:
html_analyzer
robot-cover-url:
robot-details-url:
robot-owner-name:
James E. Pitkow
robot-owner-url:
robot-owner-email: pitkow@aries.colorado.edu
robot-status:
robot-purpose:
maintainance
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description: to check validity of Web servers. I'm not sure if it has
ever been run remotely.
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: pjspider
robot-name: Portal Juice Spider
robot-cover-url: http://www.portaljuice.com
robot-details-url: http://www.portaljuice.com/pjspider.html
robot-owner-name: Nextopia Software Corporation
robot-owner-url: http://www.portaljuice.com
robot-owner-email: pjspider@portaljuice.com
robot-status: active
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: pjspider
http://info.webcrawler.com/mak/projects/robots/active/all.txt (66 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-noindex: yes
robot-host: *.portaljuice.com, *.nextopia.com
robot-from: yes
robot-useragent: PortalJuice.com/4.0
robot-language: C/C++
robot-description: Indexing web documents for Portal Juice vertical portal
search engine
robot-history: Indexing the web since 1998 for the purposes of offering our
commerical Portal Juice search engine services.
robot-environment: service
modified-date: Wed Jun 23 17:00:00 EST 1999
modified-by: pjspider@portaljuice.com
robot-id:
pka
robot-name:
PGP Key Agent
robot-cover-url:
http://www.starnet.it/pgp
robot-details-url:
robot-owner-name:
Massimiliano Pucciarelli
robot-owner-url:
http://www.starnet.it/puma
robot-owner-email: puma@comm2000.it
robot-status:
Active
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
UNIX, Windows NT
robot-availability: none
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
salerno.starnet.it
robot-from:
yes
robot-useragent:
PGP-KA/1.2
robot-language:
Perl 5
robot-description: This program search the pgp public key for the
specified user.
robot-history:
Originated as a research project at Salerno
University in 1995.
robot-environment: Research
modified-date:
June 27 1996.
modified-by:
Massimiliano Pucciarelli
robot-id: plumtreewebaccessor
robot-name: PlumtreeWebAccessor
robot-cover-url:
robot-details-url: http://www.plumtree.com/
robot-owner-name: Joseph A. Stanko
robot-owner-url:
robot-owner-email: josephs@plumtree.com
robot-status: development
robot-purpose: indexing for the Plumtree Server
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: PlumtreeWebAccessor
robot-noindex: yes
robot-host:
robot-from: yes
robot-useragent: PlumtreeWebAccessor/0.9
robot-language: c++
robot-description: The Plumtree Web Accessor is a component that
customers can add to the
Plumtree Server to index documents on the World Wide Web.
robot-history:
robot-environment: commercial
modified-date: Thu, 17 Dec 1998
http://info.webcrawler.com/mak/projects/robots/active/all.txt (67 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-by: Joseph A. Stanko <josephs@plumtree.com>
robot-id: poppi
robot-name: Poppi
robot-cover-url: http://members.tripod.com/poppisearch
robot-details-url: http://members.tripod.com/poppisearch
robot-owner-name: Antonio Provenzano
robot-owner-url: Antonio Provenzano
robot-owner-email:
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix/linux
robot-availability: none
robot-exclusion:
robot-exclusion-useragent:
robot-noindex: yes
robot-host:=20
robot-from:
robot-useragent: Poppi/1.0
robot-language: C
robot-description: Poppi is a crawler to index the web that runs weekly
gathering and indexing hypertextual, multimedia and executable file
formats
robot-history: Created by Antonio Provenzano in the april of 2000, has
been acquired from Tomi Officine Multimediali srl and it is next to
release as service and commercial
robot-environment: service
modified-date: Mon, 22 May 2000 15:47:30 GMT
modified-by: Antonio Provenzano
robot-id: portalb
robot-name: PortalB Spider
robot-cover-url: http://www.portalb.com/
robot-details-url:
robot-owner-name: PortalB Spider Bug List
robot-owner-url:
robot-owner-email: spider@portalb.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: PortalBSpider
robot-noindex: yes
robot-nofollow: yes
robot-host: spider1.portalb.com, spider2.portalb.com, etc.
robot-from: no
robot-useragent: PortalBSpider/1.0 (spider@portalb.com)
robot-language: C++
robot-description: The PortalB Spider indexes selected sites for
high-quality business information.
robot-history:
robot-environment: service
robot-id: Puu
robot-name: GetterroboPlus Puu
robot-details-url: http://marunaka.homing.net/straight/getter/
robot-cover-url: http://marunaka.homing.net/straight/
robot-owner-name: marunaka
robot-owner-url: http://marunaka.homing.net
robot-owner-email: marunaka@homing.net
robot-status: active: robot actively in use
robot-purpose: Purpose of the robot. One or more of:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (68 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
- gathering: gather data of original standerd TAG for Puu contains the
information of the sites registered my Search Engin.
- maintenance: link validation
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes (Puu patrols only registered url in my Search Engine)
robot-exclusion-useragent: Getterrobo-Plus
robot-noindex: no
robot-host: straight FLASH!! Getterrobo-Plus, *.homing.net
robot-from: yes
robot-useragent: straight FLASH!! GetterroboPlus 1.5
robot-language: perl5
robot-description:
Puu robot is used to gater data from registered site in Search Engin
"straight FLASH!!" for building anouncement page of state of renewal of
registered site in "straight FLASH!!".
Robot runs everyday.
robot-history:
This robot patorols based registered sites in Search Engin "straight FLASH!!"
robot-environment: hobby
modified-date: Fri, 26 Jun 1998
robot-id:
python
robot-name:
The Python Robot
robot-cover-url:
http://www.python.org/
robot-details-url:
robot-owner-name:
Guido van Rossum
robot-owner-url:
http://www.python.org/~guido/
robot-owner-email: guido@python.org
robot-status:
retired
robot-purpose:
robot-type:
robot-platform:
robot-availability: none
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: raven
robot-name: Raven Search
robot-cover-url: http://ravensearch.tripod.com
robot-details-url: http://ravensearch.tripod.com
robot-owner-name: Raven Group
robot-owner-url: http://ravensearch.tripod.com
robot-owner-email: ravensearch@hotmail.com
robot-status: Development: robot under development
robot-purpose: Indexing: gather content for commercial query engine.
robot-type: Standalone: a separate program
robot-platform: Unix, Windows98, WindowsNT, Windows2000
robot-availability: None
robot-exclusion: Yes
robot-exclusion-useragent: Raven
robot-noindex: Yes
robot-nofollow: Yes
robot-host: 192.168.1.*
http://info.webcrawler.com/mak/projects/robots/active/all.txt (69 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-from: Yes
robot-useragent: Raven-v2
robot-language: Perl-5
robot-description: Raven was written for the express purpose of indexing the web.
It can parallel process hundreds of URLS's at a time. It runs on a sporadic basis
as testing continues. It is really several programs running concurrently.
It takes four computers to run Raven Search. Scalable in sets of four.
robot-history: This robot is new. First active on March 25, 2000.
robot-environment: Commercial: is a commercial product. Possibly GNU later ;-)
modified-date: Fri, 25 Mar 2000 17:28:52 GMT
modified-by: Raven Group
robot-id:
rbse
robot-name:
RBSE Spider
robot-cover-url:
http://rbse.jsc.nasa.gov/eichmann/urlsearch.html
robot-details-url:
robot-owner-name:
David Eichmann
robot-owner-url:
http://rbse.jsc.nasa.gov/eichmann/home.html
robot-owner-email: eichmann@rbse.jsc.nasa.gov
robot-status:
active
robot-purpose:
indexing, statistics
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
rbse.jsc.nasa.gov (192.88.42.10)
robot-from:
robot-useragent:
robot-language:
C, oracle, wais
robot-description: Developed and operated as part of the NASA-funded Repository
Based Software Engineering Program at the Research Institute
for Computing and Information Systems, University of Houston
- Clear Lake.
robot-history:
robot-environment:
modified-date:
Thu May 18 04:47:02 1995
modified-by:
robot-id:
resumerobot
robot-name:
Resume Robot
robot-cover-url:
http://www.onramp.net/proquest/resume/robot/robot.html
robot-details-url:
robot-owner-name:
James Stakelum
robot-owner-url:
http://www.onramp.net/proquest/resume/java/resume.html
robot-owner-email: proquest@onramp.net
robot-status:
robot-purpose:
indexing.
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
Resume Robot
robot-language:
C++.
robot-description:
robot-history:
robot-environment:
modified-date:
Tue Mar 12 15:52:25 1996.
modified-by:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (70 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-id: rhcs
robot-name: RoadHouse Crawling System
robot-cover-url: http://stage.perceval.be (under developpement)
robot-details-url:
robot-owner-name: Gregoire Welraeds, Emmanuel Bergmans
robot-owner-url: http://www.perceval.be
robot-owner-email: helpdesk@perceval.be
robot-status: development
robot-purpose1: indexing
robot-purpose2: maintenance
robot-purpose3: statistics
robot-type: standalone
robot-platform1: unix (FreeBSD & Linux)
robot-availability: none
robot-exclusion: no (under development)
robot-exclusion-useragent: RHCS
robot-noindex: no (under development)
robot-host: stage.perceval.be
robot-from: no
robot-useragent: RHCS/1.0a
robot-language: c
robot-description: robot used tp build the database for the RoadHouse search service
project operated by Perceval
robot-history: The need of this robot find its roots in the actual RoadHouse directory
not maintenained since 1997
robot-environment: service
modified-date: Fri, 26 Feb 1999 12:00:00 GMT
modified-by: Gregoire Welraeds
robot-id: roadrunner
robot-name: Road Runner: The ImageScape Robot
robot-owner-name: LIM Group
robot-owner-email: lim@cs.leidenuniv.nl
robot-status: development/active
robot-purpose: indexing
robot-type: standalone
robot-platform: UNIX
robot-exclusion: yes
robot-exclusion-useragent: roadrunner
robot-useragent: Road Runner: ImageScape Robot (lim@cs.leidenuniv.nl)
robot-language: C, perl5
robot-description: Create Image/Text index for WWW
robot-history: ImageScape Project
robot-environment: commercial service
modified-date: Dec. 1st, 1996
robot-id: robbie
robot-name: Robbie the Robot
robot-cover-url:
robot-details-url:
robot-owner-name: Robert H. Pollack
robot-owner-url:
robot-owner-email: robert.h.pollack@lmco.com
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix, windows95, windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Robbie
robot-noindex: no
robot-host: *.lmco.com
robot-from: yes
robot-useragent: Robbie/0.1
robot-language: java
http://info.webcrawler.com/mak/projects/robots/active/all.txt (71 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description: Used to define document collections for the DISCO system.
Robbie is still under development and runs several
times a day, but usually only for ten minutes or so.
Sites are visited in the order in which references
are found, but no host is visited more than once in
any two-minute period.
robot-history: The DISCO system is a resource-discovery component in
the OLLA system, which is a prototype system, developed
under DARPA funding, to support computer-based education
and training.
robot-environment: research
modified-date: Wed, 5 Feb 1997 19:00:00 GMT
modified-by:
robot-id: robi
robot-name: ComputingSite Robi/1.0
robot-cover-url: http://www.computingsite.com/robi/
robot-details-url: http://www.computingsite.com/robi/
robot-owner-name: Tecor Communications S.L.
robot-owner-url: http://www.tecor.com/
robot-owner-email: robi@computingsite.com
robot-status: Active
robot-purpose: indexing,maintenance
robot-type: standalone
robot-platform: UNIX
robot-availability:
robot-exclusion: yes
robot-exclusion-useragent: robi
robot-noindex: no
robot-host: robi.computingsite.com
robot-from:
robot-useragent: ComputingSite Robi/1.0 (robi@computingsite.com)
robot-language: python
robot-description: Intelligent agent used to build the ComputingSite Search
Directory.
robot-history: It was born on August 1997.
robot-environment: service
modified-date: Wed, 13 May 1998 17:28:52 GMT
modified-by: Jorge Alegre
robot-id: robozilla
robot-name: Robozilla
robot-cover-url: http://dmoz.org/
robot-details-url: http://www.dmoz.org/newsletter/2000Aug/robo.html
robot-owner-name: "Rob O'Zilla"
robot-owner-url: http://dmoz.org/profiles/robozilla.html
robot-owner-email: robozilla@dmozed.org
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-availability: none
robot-exclusion: no
robot-noindex: no
robot-host: directory.mozilla.org
robot-useragent: Robozilla/1.0
robot-description: Robozilla visits all the links within the Open Directory
periodically, marking the ones that return errors for review.
robot-environment: service
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
roverbot
Roverbot
http://www.roverbot.com/
GlobalMedia Design (Andrew Cowan & Brian
http://info.webcrawler.com/mak/projects/robots/active/all.txt (72 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
Clark)
robot-owner-url:
http://www.radzone.org/gmd/
robot-owner-email: gmd@spyder.net
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
roverbot.com
robot-from:
yes
robot-useragent:
Roverbot
robot-language:
perl5
robot-description: Targeted email gatherer utilizing user-defined seed points
and interacting with both the webserver and MX servers of
remote sites.
robot-history:
robot-environment:
modified-date:
Tue Jun 18 19:16:31 1996.
modified-by:
robot-id:
safetynetrobot
robot-name:
SafetyNet Robot
robot-cover-url:
http://www.urlabs.com/
robot-details-url:
robot-owner-name:
Michael L. Nelson
robot-owner-url:
http://www.urlabs.com/
robot-owner-email: m.l.nelson@urlabs.com
robot-status:
robot-purpose:
indexing.
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no.
robot-exclusion-useragent:
robot-noindex:
robot-host:
*.urlabs.com
robot-from:
yes
robot-useragent:
SafetyNet Robot 0.1,
robot-language:
Perl 5
robot-description: Finds URLs for K-12 content management.
robot-history:
robot-environment:
modified-date:
Sat Mar 23 20:12:39 1996.
modified-by:
robot-id: scooter
robot-name: Scooter
robot-cover-url: http://www.altavista.com/
robot-details-url: http://www.altavista.com/av/content/addurl.htm
robot-owner-name: AltaVista
robot-owner-url: http://www.altavista.com/
robot-owner-email: scooter@pa.dec.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Scooter
robot-noindex: yes
robot-host: *.av.pa-x.dec.com
robot-from: yes
http://info.webcrawler.com/mak/projects/robots/active/all.txt (73 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent: Scooter/2.0 G.R.A.B. V1.1.0
robot-language: c
robot-description: Scooter is AltaVista's prime index agent.
robot-history: Version 2 of Scooter/1.0 developed by Louis Monier of WRL.
robot-environment: service
modified-date: Wed, 13 Jan 1999 17:18:59 GMT
modified-by: steves@avs.dec.com
robot-id: search_au
robot-name: Search.Aus-AU.COM
robot-details-url: http://Search.Aus-AU.COM/
robot-cover-url: http://Search.Aus-AU.COM/
robot-owner-name: Dez Blanchfield
robot-owner-url: not currently available
robot-owner-email: dez@geko.com
robot-status: - development: robot under development
robot-purpose: - indexing: gather content for an indexing service
robot-type: - standalone: a separate program
robot-platform: - mac - unix - windows95 - windowsNT
robot-availability: - none
robot-exclusion: yes
robot-exclusion-useragent: Search-AU
robot-noindex: yes
robot-host: Search.Aus-AU.COM, 203.55.124.29, 203.2.239.29
robot-from: no
robot-useragent: not available
robot-language: c, perl, sql
robot-description: Search-AU is a development tool I have built
to investigate the power of a search engine and web crawler
to give me access to a database of web content ( html / url's )
and address's etc from which I hope to build more accurate stats
about the .au zone's web content.
the robot started crawling from http://www.geko.net.au/ on
march 1st, 1998 and after nine days had 70mb of compressed ascii
in a database to work with. i hope to run a refresh of the crawl
every month initially, and soon every week bandwidth and cpu allowing.
if the project warrants further development, i will turn it into
an australian ( .au ) zone search engine and make it commercially
available for advertising to cover the costs which are starting
to mount up. --dez (980313 - black friday!)
robot-environment: - hobby: written as a hobby
modified-date: Fri Mar 13 10:03:32 EST 1998
robot-id: searchprocess
robot-name: SearchProcess
robot-cover-url: http://www.searchprocess.com
robot-details-url: http://www.intelligence-process.com
robot-owner-name: Mannina Bruno
robot-owner-url: http://www.intelligence-process.com
robot-owner-email: bruno@intelligence-process.com
robot-status: active
robot-purpose: Statistic
robot-type: browser
robot-platform: linux
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: searchprocess
robot-noindex: yes
robot-host: searchprocess.com
robot-from: yes
robot-useragent: searchprocess/0.9
robot-language: perl
robot-description: An intelligent Agent Online. SearchProcess is used to
provide structured information to user.
robot-history: This is the son of Auresys
http://info.webcrawler.com/mak/projects/robots/active/all.txt (74 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-environment: Service freeware
modified-date: Thus, 22 Dec 1999
modified-by: Mannina Bruno
robot-id:
senrigan
robot-name:
Senrigan
robot-cover-url:
http://www.info.waseda.ac.jp/search-e.html
robot-details-url:
robot-owner-name:
TAMURA Kent
robot-owner-url:
http://www.info.waseda.ac.jp/muraoka/members/kent/
robot-owner-email: kent@muraoka.info.waseda.ac.jp
robot-status:
active
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
Java
robot-availability: none
robot-exclusion:
yes
robot-exclusion-useragent:Senrigan
robot-noindex:
yes
robot-host:
aniki.olu.info.waseda.ac.jp
robot-from:
yes
robot-useragent:
Senrigan/xxxxxx
robot-language:
Java
robot-description: This robot now gets HTMLs from only jp domain.
robot-history:
It has been running since Dec 1994
robot-environment: research
modified-date:
Mon Jul 1 07:30:00 GMT 1996
modified-by:
TAMURA Kent
robot-id:
sgscout
robot-name:
SG-Scout
robot-cover-url:
http://www-swiss.ai.mit.edu/~ptbb/SG-Scout/SG-Scout.html
robot-details-url:
robot-owner-name:
Peter Beebee
robot-owner-url:
http://www-swiss.ai.mit.edu/~ptbb/personal/index.html
robot-owner-email: ptbb@ai.mit.edu, beebee@parc.xerox.com
robot-status:
active
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
beta.xerox.com
robot-from:
yes
robot-useragent:
SG-Scout
robot-language:
robot-description: Does a "server-oriented" breadth-first search in a
round-robin fashion, with multiple processes.
robot-history:
Run since 27 June 1994, for an internal XEROX research
project
robot-environment:
modified-date:
modified-by:
robot-id:shaggy
robot-name:ShagSeeker
robot-cover-url:http://www.shagseek.com
robot-details-url:
robot-owner-name:Joseph Reynolds
robot-owner-url:http://www.shagseek.com
robot-owner-email:joe.reynolds@shagseek.com
robot-status:active
robot-purpose:indexing
http://info.webcrawler.com/mak/projects/robots/active/all.txt (75 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-type:standalone
robot-platform:unix
robot-availability:data
robot-exclusion:yes
robot-exclusion-useragent:Shagseeker
robot-noindex:yes
robot-host:shagseek.com
robot-from:
robot-useragent:Shagseeker at http://www.shagseek.com /1.0
robot-language:perl5
robot-description:Shagseeker is the gatherer for the Shagseek.com search
engine and goes out weekly.
robot-history:none yet
robot-environment:service
modified-date:Mon 17 Jan 2000 10:00:00 EST
modified-by:Joseph Reynolds
robot-id: shaihulud
robot-name: Shai'Hulud
robot-cover-url:
robot-details-url:
robot-owner-name: Dimitri Khaoustov
robot-owner-url:
robot-owner-email: shawdow@usa.net
robot-status: active
robot-purpose: mirroring
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: *.rdtex.ru
robot-from:
robot-useragent: Shai'Hulud
robot-language: C
robot-description: Used to build mirrors for internal use
robot-history: This robot finds its roots in a research project at RDTeX
Perspective Projects Group in 1996
robot-environment: research
modified-date: Mon, 5 Aug 1996 14:35:08 GMT
modified-by: Dimitri Khaoustov
robot-id: sift
robot-name: Sift
robot-cover-url: http://www.worthy.com/
robot-details-url: http://www.worthy.com/
robot-owner-name: Bob Worthy
robot-owner-url: http://www.worthy.com/~bworthy
robot-owner-email: bworthy@worthy.com
robot-status: development, active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: sift
robot-noindex: yes
robot-host: www.worthy.com
robot-from:
robot-useragent: libwww-perl-5.41
robot-language: perl
robot-description: Subject directed (via key phrase list) indexing.
robot-history: Libwww of course, implementation using MySQL August, 1999.
Indexing Search and Rescue sites.
http://info.webcrawler.com/mak/projects/robots/active/all.txt (76 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-environment: research, service
modified-date: Sat, 16 Oct 1999 19:40:00 GMT
modified-by: Bob Worthy
robot-id: simbot
robot-name: Simmany Robot Ver1.0
robot-cover-url: http://simmany.hnc.net/
robot-details-url: http://simmany.hnc.net/irman1.html
robot-owner-name: Youngsik, Lee(@L?5=D)
robot-owner-url:
robot-owner-email: ailove@hnc.co.kr
robot-status: development & active
robot-purpose: indexing, maintenance, statistics
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: SimBot
robot-noindex: no
robot-host: sansam.hnc.net
robot-from: no
robot-useragent: SimBot/1.0
robot-language: C
robot-description: The Simmany Robot is used to build the Map(DB) for
the simmany service operated by HNC(Hangul & Computer Co., Ltd.). The
robot runs weekly, and visits sites that have a useful korean
information in a defined order.
robot-history: This robot is a part of simmany service and simmini
products. The simmini is the Web products that make use of the indexing
and retrieving modules of simmany.
robot-environment: service, commercial
modified-date: Thu, 19 Sep 1996 07:02:26 GMT
modified-by: Youngsik, Lee
robot-id: site-valet
robot-name: Site Valet
robot-cover-url: http://valet.webthing.com/
robot-details-url: http://valet.webthing.com/
robot-owner-name: Nick Kew
robot-owner-url:
robot-owner-email: nick@webthing.com
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: unix
robot-availability: data
robot-exclusion: yes
robot-exclusion-useragent: Site Valet
robot-noindex: no
robot-host: valet.webthing.com,valet.*
robot-from: yes
robot-useragent: Site Valet
robot-language: perl
robot-description: a deluxe site monitoring and analysis service
robot-history: builds on cg-eye, the WDG Validator, and the Link Valet
robot-environment: service
modified-date: Tue, 27 June 2000
modified-by: nick@webthing.com
robot-id: sitegrabber
robot-name: Open Text Index Robot
robot-cover-url: http://index.opentext.net/main/faq.html
robot-details-url: http://index.opentext.net/OTI_Robot.html
robot-owner-name: John Faichney
robot-owner-url:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (77 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-email: faichney@opentext.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: UNIX
robot-availability: inquire to markk@opentext.com (Mark Kraatz)
robot-exclusion: yes
robot-exclusion-useragent: Open Text Site Crawler
robot-noindex: no
robot-host: *.opentext.com
robot-from: yes
robot-useragent: Open Text Site Crawler V1.0
robot-language: perl/C
robot-description: This robot is run by Open Text Corporation to produce the
data for the Open Text Index
robot-history: Started in May/95 to replace existing Open Text robot which
was based on libwww
robot-environment: commercial
modified-date: Fri Jul 25 11:46:56 EDT 1997
modified-by: John Faichney
robot-id:
sitetech
robot-name:
SiteTech-Rover
robot-cover-url:
http://www.sitetech.com/
robot-details-url:
robot-owner-name:
Anil Peres-da-Silva
robot-owner-url:
http://www.sitetech.com
robot-owner-email: adasilva@sitetech.com
robot-status:
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
SiteTech-Rover
robot-language:
C++.
robot-description: Originated as part of a suite of Internet Products to
organize, search & navigate Intranet sites and to validate
links in HTML documents.
robot-history: This robot originally went by the name of LiberTech-Rover
robot-environment:
modified-date:
Fri Aug 9 17:06:56 1996.
modified-by: Anil Peres-da-Silva
robot-id:slcrawler
robot-name:SLCrawler
robot-cover-url:
robot-details-url:
robot-owner-name:Inxight Software
robot-owner-url:http://www.inxight.com
robot-owner-email:kng@inxight.com
robot-status:active
robot-purpose:To build the site map.
robot-type:standalone
robot-platform:windows, windows95, windowsNT
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:SLCrawler/2.0
robot-noindex:no
robot-host:n/a
robot-from:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (78 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent:SLCrawler
robot-language:Java
robot-description:To build the site map.
robot-history:It is SLCrawler to crawl html page on Internet.
robot-environment: commercial: is a commercial product
modified-date:Nov. 15, 2000
modified-by:Karen Ng
robot-id: slurp
robot-name: Inktomi Slurp
robot-cover-url: http://www.inktomi.com/
robot-details-url: http://www.inktomi.com/slurp.html
robot-owner-name: Inktomi Corporation
robot-owner-url: http://www.inktomi.com/
robot-owner-email: slurp@inktomi.com
robot-status: active
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: slurp
robot-noindex: yes
robot-host: *.inktomi.com
robot-from: yes
robot-useragent: Slurp/2.0
robot-language: C/C++
robot-description: Indexing documents for the HotBot search engine
(www.hotbot.com), collecting Web statistics
robot-history: Switch from Slurp/1.0 to Slurp/2.0 November 1996
robot-environment: service
modified-date: Fri Feb 28 13:57:43 PST 1997
modified-by: slurp@inktomi.com
robot-id: smartspider
robot-name: Smart Spider
robot-cover-url: http://www.travel-finder.com
robot-details-url: http://www.engsoftware.com/robots.htm
robot-owner-name: Ken Wadland
robot-owner-url: http://www.engsoftware.com
robot-owner-email: ken@engsoftware.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: windows95, windowsNT
robot-availability: data, binary, source
robot-exclusion: Yes
robot-exclusion-useragent: ESI
robot-noindex: Yes
robot-host: 207.16.241.*
robot-from: Yes
robot-useragent: ESISmartSpider/2.0
robot-language: C++
robot-description: Classifies sites using a Knowledge Base.
Robot collects
web pages which are then parsed and feed to the Knowledge Base. The
Knowledge Base classifies the sites into any of hundreds of
categories
based on the vocabulary used. Currently used by:
//www.travel-finder.com
(Travel and Tourist Info) and
//www.golightway.com (Christian Sites).
Several options exist to
control whether sites are discovered and/or
classified fully
automatically, full manually
or somewhere in between.
robot-history: Feb '96 -- Product design begun. May '96 -- First data
results
published by Travel-Finder. Oct '96 -- Generalized and announced
and a
product for other sites. Jan '97 -- First data results published by
GoLightWay.
robot-environment: service, commercial
http://info.webcrawler.com/mak/projects/robots/active/all.txt (79 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-date: Mon, 13 Jan 1997 10:41:00 EST
modified-by: Ken Wadland
robot-id: snooper
robot-name: Snooper
robot-cover-url: http://darsun.sit.qc.ca
robot-details-url:
robot-owner-name: Isabelle A. Melnick
robot-owner-url:
robot-owner-email: melnicki@sit.ca
robot-status: part under development and part active
robot-purpose:
robot-type:
robot-platform:
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: snooper
robot-noindex:
robot-host:
robot-from:
robot-useragent: Snooper/b97_01
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: solbot
robot-name: Solbot
robot-cover-url: http://kvasir.sol.no/
robot-details-url:
robot-owner-name: Frank Tore Johansen
robot-owner-url:
robot-owner-email: ftj@sys.sol.no
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: solbot
robot-noindex: yes
robot-host: robot*.sol.no
robot-from:
robot-useragent: Solbot/1.0 LWP/5.07
robot-language: perl, c
robot-description: Builds data for the Kvasir search service. Only searches
sites which ends with one of the following domains: "no", "se", "dk", "is",
"fi"robot-history: This robot is the result of a 3 years old late night hack when
the Verity robot (of that time) was unable to index sites with iso8859
characters (in URL and other places), and we just _had_ to have something up and going
the next day...
robot-environment: service
modified-date: Tue Apr 7 16:25:05 MET DST 1998
modified-by: Frank Tore Johansen <ftj@sys.sol.no>
robot-id: spanner
robot-name: Spanner
robot-cover-url: http://www.kluge.net/NES/spanner/
robot-details-url: http://www.kluge.net/NES/spanner/
robot-owner-name: Theo Van Dinter
robot-owner-url: http://www.kluge.net/~felicity/
robot-owner-email: felicity@kluge.net
robot-status: development
http://info.webcrawler.com/mak/projects/robots/active/all.txt (80 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-purpose: indexing,maintenance
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: Spanner
robot-noindex: yes
robot-host: *.kluge.net
robot-from: yes
robot-useragent: Spanner/1.0 (Linux 2.0.27 i586)
robot-language: perl
robot-description: Used to index/check links on an intranet.
robot-history: Pet project of the author since beginning of 1996.
robot-environment: hobby
modified-date: Mon, 06 Jan 1997 00:00:00 GMT
modified-by: felicity@kluge.net
robot-id:speedy
robot-name:Speedy Spider
robot-cover-url:http://www.entireweb.com/
robot-details-url:http://www.entireweb.com/speedy.html
robot-owner-name:WorldLight.com AB
robot-owner-url:http://www.worldlight.com
robot-owner-email:speedy@worldlight.com
robot-status:active
robot-purpose:indexing
robot-type:standalone
robot-platform:Windows
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:speedy
robot-noindex:yes
robot-host:router-00.sverige.net, 193.15.210.29, *.entireweb.com,
*.worldlight.com
robot-from:yes
robot-useragent:Speedy Spider ( http://www.entireweb.com/speedy.html )
robot-language:C, C++
robot-description:Speedy Spider is used to build the database
for the Entireweb.com search service operated by WorldLight.com
(part of WorldLight Network).
The robot runs constantly, and visits sites in a random order.
robot-history:This robot is a part of the highly advanced search engine
Entireweb.com, that was developed in Halmstad, Sweden during 1998-2000.
robot-environment:service, commercial
modified-date:Mon, 17 July 2000 11:05:03 GMT
modified-by:Marcus Andersson
robot-id: spider_monkey
robot-name: spider_monkey
robot-cover-url: http://www.mobrien.com/add_site.html
robot-details-url: http://www.mobrien.com/add_site.html
robot-owner-name: MPRM Group Limited
robot-owner-url: http://www.mobrien.com
robot-owner-email: mprm@ionsys.com
robot-status: robot actively in use
robot-purpose: gather content for a free indexing service
robot-type: FDSE robot
robot-platform: unix
robot-availability: bulk data gathered by robot available
robot-exclusion: yes
robot-exclusion-useragent: spider_monkey
robot-noindex: yes
robot-host: snowball.ionsys.com
robot-from: yes
robot-useragent: mouse.house/7.1
http://info.webcrawler.com/mak/projects/robots/active/all.txt (81 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-language: perl5
robot-description: Robot runs every 30 days for a full index and weekly =
on a list of accumulated visitor requests
robot-history: This robot is under development and currently active
robot-environment: written as an employee / guest service
modified-date: Mon, 22 May 2000 12:28:52 GMT
modified-by: MPRM Group Limited
robot-id: spiderbot
robot-name: SpiderBot
robot-cover-url: http://pisuerga.inf.ubu.es/lsi/Docencia/TFC/ITIG/icruzadn/cover.htm
robot-details-url:
http://pisuerga.inf.ubu.es/lsi/Docencia/TFC/ITIG/icruzadn/details.htm
robot-owner-name: Ignacio Cruzado Nu.o
robot-owner-url: http://pisuerga.inf.ubu.es/lsi/Docencia/TFC/ITIG/icruzadn/icruzadn.htm
robot-owner-email: spidrboticruzado@solaria.emp.ubu.es
robot-status: active
robot-purpose: indexing, mirroring
robot-type: standalone, browser
robot-platform: unix, windows, windows95, windowsNT
robot-availability: source, binary, data
robot-exclusion: yes
robot-exclusion-useragent: SpiderBot/1.0
robot-noindex: yes
robot-host: *
robot-from: yes
robot-useragent: SpiderBot/1.0
robot-language: C++, Tcl
robot-description: Recovers Web Pages and saves them on your hard disk. Then it
reindexes them.
robot-history: This Robot belongs to Ignacio Cruzado Nu.o End of Studies Thesis
"Recuperador p.ginas Web", to get the titulation of "Management Tecnical Informatics
Engineer" in the for the Burgos University in Spain.
robot-environment: research
modified-date: Sun, 27 Jun 1999 09:00:00 GMT
modified-by: Ignacio Cruzado Nu.o
robot-id:spiderman
robot-name:SpiderMan
robot-cover-url:http://www.comp.nus.edu.sg/~leunghok
robot-details-url:http://www.comp.nus.edu.sg/~leunghok/honproj.html
robot-owner-name:Leung Hok Peng , The School Of Computing Nus , Singapore
robot-owner-url:http://www.comp.nus.edu.sg/~leunghok
robot-owner-email:leunghok@comp.nus.edu.sg
robot-status:development & active
robot-purpose:user searching using IR technique
robot-type:stand alone
robot-platform:Java 1.2
robot-availability:binary&source
robot-exclusion:no
robot-exclusion-useragent:nil
robot-noindex:no
robot-host:NA
robot-from:NA
robot-useragent:SpiderMan 1.0
robot-language:java
robot-description:It is used for any user to search the web given a query string
robot-history:Originated from The Center for Natural Product Research and The
School of computing National University Of Singapore
robot-environment:research
modified-date:08/08/1999
modified-by:Leung Hok Peng and Dr Hsu Wynne
robot-id: spiderview
robot-name: SpiderView(tm)
http://info.webcrawler.com/mak/projects/robots/active/all.txt (82 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-cover-url: http://www.northernwebs.com/set/spider_view.html
robot-details-url: http://www.northernwebs.com/set/spider_sales.html
robot-owner-name: Northern Webs
robot-owner-url: http://www.northernwebs.com
robot-owner-email: webmaster@northernwebs.com
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: unix, nt
robot-availability: source
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex:
robot-host: bobmin.quad2.iuinc.com, *
robot-from: No
robot-useragent: Mozilla/4.0 (compatible; SpiderView 1.0;unix)
robot-language: perl
robot-description: SpiderView is a server based program which can spider
a webpage, testing the links found on the page, evaluating your server
and its performance.
robot-history: This is an offshoot http retrieval program based on our
Medibot software.
robot-environment: commercial
modified-date:
modified-by:
robot-id:
spry
robot-name:
Spry Wizard Robot
robot-cover-url:
http://www.spry.com/wizard/index.html
robot-details-url:
robot-owner-name:
spry
robot-owner-url:
ttp://www.spry.com/index.html
robot-owner-email: info@spry.com
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
wizard.spry.com or tiger.spry.com
robot-from:
no
robot-useragent:
no
robot-language:
robot-description: Its purpose is to generate a Resource Discovery database
Spry is refusing to give any comments about this
robot
robot-history:
robot-environment:
modified-date:
Tue Jul 11 09:29:45 GMT 1995
modified-by:
robot-id: ssearcher
robot-name: Site Searcher
robot-cover-url: www.satacoy.com
robot-details-url: www.satacoy.com
robot-owner-name: Zackware
robot-owner-url: www.satacoy.com
robot-owner-email: zackware@hotmail.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: winows95, windows98, windowsNT
robot-availability: binary
http://info.webcrawler.com/mak/projects/robots/active/all.txt (83 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent: ssearcher100
robot-language: C++
robot-description: Site Searcher scans web sites for specific file types.
(JPG, MP3, MPG, etc)
robot-history: Released 4/4/1999
robot-environment: hobby
modified-date: 04/26/1999
robot-id: suke
robot-name: Suke
robot-cover-url: http://www.kensaku.org/
robot-details-url: http://www.kensaku.org/
robot-owner-name: Yosuke Kuroda
robot-owner-url: http://www.kensaku.org/yk/
robot-owner-email: robot@kensaku.org
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: FreeBSD3.*
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: suke
robot-noindex: no
robot-host: *
robot-from: yes
robot-useragent: suke/*.*
robot-language: c
robot-description: This robot visits mainly sites in japan.
robot-history: since 1999
robot-environment: service
robot-id: suntek
robot-name: suntek search engine
robot-cover-url: http://www.portal.com.hk/
robot-details-url: http://www.suntek.com.hk/
robot-owner-name: Suntek Computer Systems
robot-owner-url: http://www.suntek.com.hk/
robot-owner-email: karen@suntek.com.hk
robot-status: operational
robot-purpose: to create a search portal on Asian web sites
robot-type:
robot-platform: NT, Linux, UNIX
robot-availability: available now
robot-exclusion:
robot-exclusion-useragent:
robot-noindex: yes
robot-host: search.suntek.com.hk
robot-from: yes
robot-useragent: suntek/1.0
robot-language: Java
robot-description: A multilingual search engine with emphasis on Asia contents
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: sven
robot-name: Sven
robot-cover-url:
robot-details-url: http://marty.weathercity.com/sven/
http://info.webcrawler.com/mak/projects/robots/active/all.txt (84 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-name: Marty Anstey
robot-owner-url: http://marty.weathercity.com/
robot-owner-email: rhondle@home.com
robot-status: Active
robot-purpose: indexing
robot-type: standalone
robot-platform: Windows
robot-availability: none
robot-exclusion: no
robot-exclusion-useragent:
robot-noindex: no
robot-host: 24.113.12.29
robot-from: no
robot-useragent:
robot-language: VB5
robot-description: Used to gather sites for netbreach.com. Runs constantly.
robot-history: Developed as an experiment in web indexing.
robot-environment: hobby, service
modified-date: Tue, 3 Mar 1999 08:15:00 PST
modified-by: Marty Anstey
robot-id: tach_bw
robot-name: TACH Black Widow
robot-cover-url: http://theautochannel.com/~mjenn/bw.html
robot-details-url: http://theautochannel.com/~mjenn/bw-syntax.html
robot-owner-name: Michael Jennings
robot-owner-url: http://www.spd.louisville.edu/~mejenn01/
robot-owner-email: mjenn@theautochannel.com
robot-status: development
robot-purpose: maintenance: link validation
robot-type: standalone
robot-platform: UNIX, Linux
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: tach_bw
robot-noindex: no
robot-host: *.theautochannel.com
robot-from: yes
robot-useragent: Mozilla/3.0 (Black Widow v1.1.0; Linux 2.0.27; Dec 31 1997 12:25:00
robot-language: C/C++
robot-description: Exhaustively recurses a single site to check for broken links
robot-history: Corporate application begun in 1996 for The Auto Channel
robot-environment: commercial
modified-date: Thu, Jan 23 1997 23:09:00 GMT
modified-by: Michael Jennings
robot-id:tarantula
robot-name: Tarantula
robot-cover-url: http://www.nathan.de/nathan/software.html#TARANTULA
robot-details-url: http://www.nathan.de/
robot-owner-name: Markus Hoevener
robot-owner-url:
robot-owner-email: Markus.Hoevener@evision.de
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: yes
robot-noindex: yes
robot-host: yes
robot-from: no
robot-useragent: Tarantula/1.0
robot-language: C
http://info.webcrawler.com/mak/projects/robots/active/all.txt (85 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-description: Tarantual gathers information for german search engine
Nathanrobot-history: Started February 1997
robot-environment: service
modified-date: Mon, 29 Dec 1997 15:30:00 GMT
modified-by: Markus Hoevener
robot-id:
tarspider
robot-name:
tarspider
robot-cover-url:
robot-details-url:
robot-owner-name:
Olaf Schreck
robot-owner-url:
http://www.chemie.fu-berlin.de/user/chakl/ChaklHome.html
robot-owner-email: chakl@fu-berlin.de
robot-status:
robot-purpose:
mirroring
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
chakl@fu-berlin.de
robot-useragent:
tarspider
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id:
tcl
robot-name:
Tcl W3 Robot
robot-cover-url:
http://hplyot.obspm.fr/~dl/robo.html
robot-details-url:
robot-owner-name:
Laurent Demailly
robot-owner-url:
http://hplyot.obspm.fr/~dl/
robot-owner-email: dl@hplyot.obspm.fr
robot-status:
robot-purpose:
maintenance, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
hplyot.obspm.fr
robot-from:
yes
robot-useragent:
dlw3robot/x.y (in TclX by http://hplyot.obspm.fr/~dl/)
robot-language:
tcl
robot-description: Its purpose is to validate links, and generate
statistics.
robot-history:
robot-environment:
modified-date:
Tue May 23 17:51:39 1995
modified-by:
robot-id: techbot
robot-name: TechBOT
robot-cover-url: http://www.techaid.net/
robot-details-url: http://www.echaid.net/TechBOT/
robot-owner-name: TechAID Internet Services
robot-owner-url: http://www.techaid.net/
robot-owner-email: techbot@techaid.net
robot-status: active
http://info.webcrawler.com/mak/projects/robots/active/all.txt (86 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-purpose:statistics, maintenance
robot-type: standalone
robot-platform: Unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: TechBOT
robot-noindex: yes
robot-host: techaid.net
robot-from: yes
robot-useragent: TechBOT
robot-language: perl5
robot-description: TechBOT is constantly upgraded. Currently he is used for
Link Validation, Load Time, HTML Validation and much much more.
robot-history: TechBOT started his life as a Page Change Detection robot,
but has taken on many new and exciting roles.
robot-environment: service
modified-date: Sat, 18 Dec 1998 14:26:00 EST
modified-by: techbot@techaid.net
robot-id: templeton
robot-name: Templeton
robot-cover-url: http://www.bmtmicro.com/catalog/tton/
robot-details-url: http://www.bmtmicro.com/catalog/tton/
robot-owner-name: Neal Krawetz
robot-owner-url: http://www.cs.tamu.edu/people/nealk/
robot-owner-email: nealk@net66.com
robot-status: active
robot-purpose: mirroring, mapping, automating web applications
robot-type: standalone
robot-platform: OS/2, Linux, SunOS, Solaris
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: templeton
robot-noindex: no
robot-host: *
robot-from: yes
robot-useragent: Templeton/{version} for {platform}
robot-language: C
robot-description: Templeton is a very configurable robots for mirroring, mapping, and
automating applications on retrieved documents.
robot-history: This robot was originally created as a test-of-concept.
robot-environment: service, commercial, research, hobby
modified-date: Sun, 6 Apr 1997 10:00:00 GMT
modified-by: Neal Krawetz
robot-id:teoma_agent1
robot-name:TeomaTechnologies
robot-cover-url:http://www.teoma.com/
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:teoma_admin@hawkholdings.com
robot-status:active
robot-purpose:
robot-type:
robot-platform:
robot-availability:none
robot-exclusion:no
robot-exclusion-useragent:
robot-noindex:unknown
robot-host:63.236.92.145
robot-from:
robot-useragent:teoma_agent1 [teoma_admin@hawkholdings.com]
robot-language:unknown
robot-description:Unknown robot visiting pages and tacking "%09182837231" or
http://info.webcrawler.com/mak/projects/robots/active/all.txt (87 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
somesuch onto the ends of URL's.
robot-history:unknown
robot-environment:unknown
modified-date:Thu, 04 Jan 2001 09:05:00 PDT
modified-by: kph
robot-id: titin
robot-name: TitIn
robot-cover-url: http://www.foi.hr/~dpavlin/titin/
robot-details-url: http://www.foi.hr/~dpavlin/titin/tehnical.htm
robot-owner-name: Dobrica Pavlinusic
robot-owner-url: http://www.foi.hr/~dpavlin/
robot-owner-email: dpavlin@foi.hr
robot-status: development
robot-purpose: indexing, statistics
robot-type: standalone
robot-platform: unix
robot-availability: data, source on request
robot-exclusion: yes
robot-exclusion-useragent: titin
robot-noindex: no
robot-host: barok.foi.hr
robot-from: no
robot-useragent: TitIn/0.2
robot-language: perl5, c
robot-description:
The TitIn is used to index all titles of Web server in
.hr domain.
robot-history:
It was done as result of desperate need for central index of
Croatian web servers in December 1996.
robot-environment: research
modified-date: Thu, 12 Dec 1996 16:06:42 MET
modified-by: Dobrica Pavlinusic
robot-id:
titan
robot-name:
TITAN
robot-cover-url:
http://isserv.tas.ntt.jp/chisho/titan-e.html
robot-details-url: http://isserv.tas.ntt.jp/chisho/titan-help/eng/titan-help-e.html
robot-owner-name:
Yoshihiko HAYASHI
robot-owner-url:
robot-owner-email: hayashi@nttnly.isl.ntt.jp
robot-status:
active
robot-purpose:
indexing
robot-type:
standalone
robot-platform:
SunOS 4.1.4
robot-availability: no
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
nlptitan.isl.ntt.jp
robot-from:
yes
robot-useragent:
TITAN/0.1
robot-language:
perl 4
robot-description: Its purpose is to generate a Resource Discovery
database, and copy document trees. Our primary goal is to develop
an advanced method for indexing the WWW documents. Uses libwww-perl
robot-history:
robot-environment:
modified-date:
Mon Jun 24 17:20:44 PDT 1996
modified-by:
Yoshihiko HAYASHI
robot-id:
robot-name:
robot-cover-url:
tkwww
The TkWWW Robot
http://fang.cs.sunyit.edu/Robots/tkwww.html
http://info.webcrawler.com/mak/projects/robots/active/all.txt (88 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-details-url:
robot-owner-name:
Scott Spetka
robot-owner-url:
http://fang.cs.sunyit.edu/scott/scott.html
robot-owner-email: scott@cs.sunyit.edu
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description: It is designed to search Web neighborhoods to find pages
that may be logically related. The Robot returns a list of
links that looks like a hot list. The search can be by key
word or all links at a distance of one or two hops may be
returned. The TkWWW Robot is described in a paper presented
at the WWW94 Conference in Chicago.
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: tlspider
robot-name:TLSpider
robot-cover-url: n/a
robot-details-url: n/a
robot-owner-name: topiclink.com
robot-owner-url: topiclink.com
robot-owner-email: tlspider@outtel.com
robot-status: not activated
robot-purpose: to get web sites and add them to the topiclink future directory
robot-type:development: robot under development
robot-platform:linux
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:topiclink
robot-noindex:no
robot-host: tlspider.topiclink.com (not avalible yet)
robot-from:no
robot-useragent:TLSpider/1.1
robot-language:perl5
robot-description:This robot runs 2 days a week getting information for
TopicLink.com
robot-history:This robot was created to server for the internet search engine
TopicLink.com
robot-environment:service
modified-date:September,10,1999 17:28 GMT
modified-by: TopicLink Spider Team
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
ucsd
UCSD Crawl
http://www.mib.org/~ucsdcrawl
Adam Tilghman
http://www.mib.org/~atilghma
atilghma@mib.org
indexing, statistics
standalone
http://info.webcrawler.com/mak/projects/robots/active/all.txt (89 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
nuthaus.mib.org scilib.ucsd.edu
robot-from:
yes
robot-useragent:
UCSD-Crawler
robot-language:
Perl 4
robot-description: Should hit ONLY within UC San Diego - trying to count
servers here.
robot-history:
robot-environment:
modified-date:
Sat Jan 27 09:21:40 1996.
modified-by:
robot-id: udmsearch
robot-name: UdmSearch
robot-details-url: http://mysearch.udm.net/
robot-cover-url: http://mysearch.udm.net/
robot-owner-name: Alexander Barkov
robot-owner-url: http://mysearch.udm.net/
robot-owner-email: bar@izhcom.ru
robot-status: active
robot-purpose: indexing, validation
robot-type: standalone
robot-platform: unix
robot-availability: source, binary
robot-exclusion: yes
robot-exclusion-useragent: UdmSearch
robot-noindex: yes
robot-host: *
robot-from: no
robot-useragent: UdmSearch/2.1.1
robot-language: c
robot-description: UdmSearch is a free web search engine software for
intranet/small domain internet servers
robot-history: Developed since 1998, origin purpose is a search engine
over republic of Udmurtia http://search.udm.net
robot-environment: hobby
modified-date: Mon, 6 Sep 1999 10:28:52 GMT
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description:
urlck
URL Check
http://www.cutternet.com/products/webcheck.html
http://www.cutternet.com/products/urlck.html
Dave Finnegan
http://www.cutternet.com
dave@cutternet.com
active
maintenance
standalone
unix
binary
yes
urlck
no
*
yes
urlck/1.2.3
c
The robot is used to manage, maintain, and modify
web sites. It builds a database detailing the
site, builds HTML reports describing the site, and
can be used to up-load pages to the site or to
modify existing pages and URLs within the site. It
http://info.webcrawler.com/mak/projects/robots/active/all.txt (90 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-history:
robot-environment:
modified-date:
modified-by:
can also be used to mirror whole or partial sites.
It supports HTTP, File, FTP, and Mailto schemes.
Originally designed to validate URLs.
commercial
July 9, 1997
Dave Finnegan
robot-id: us
robot-name: URL Spider Pro
robot-cover-url: http://www.innerprise.net
robot-details-url: http://www.innerprise.net/us.htm
robot-owner-name: Innerprise
robot-owner-url: http://www.innerprise.net
robot-owner-email: greg@innerprise.net
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: Windows9x/NT
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: *
robot-noindex: yes
robot-host: *
robot-from: no
robot-useragent: URL Spider Pro
robot-language: delphi
robot-description: Used for building a database of web pages.
robot-history: Project started July 1998.
robot-environment: commercial
modified-date: Mon, 12 Jul 1999 17:50:30 GMT
modified-by: Innerprise
robot-id: valkyrie
robot-name: Valkyrie
robot-cover-url: http://kichijiro.c.u-tokyo.ac.jp/odin/
robot-details-url: http://kichijiro.c.u-tokyo.ac.jp/odin/robot.html
robot-owner-name: Masanori Harada
robot-owner-url: http://www.graco.c.u-tokyo.ac.jp/~harada/
robot-owner-email: harada@graco.c.u-tokyo.ac.jp
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Valkyrie libwww-perl
robot-noindex: no
robot-host: *.c.u-tokyo.ac.jp
robot-from: yes
robot-useragent: Valkyrie/1.0 libwww-perl/0.40
robot-language: perl4
robot-description: used to collect resources from Japanese Web sites for ODIN search
engine.
robot-history: This robot has been used since Oct. 1995 for author's research.
robot-environment: service research
modified-date: Thu Mar 20 19:09:56 JST 1997
modified-by: harada@graco.c.u-tokyo.ac.jp
robot-id: victoria
robot-name: Victoria
robot-cover-url:
robot-details-url:
robot-owner-name: Adrian Howard
robot-owner-url:
robot-owner-email: adrianh@oneworld.co.uk
http://info.webcrawler.com/mak/projects/robots/active/all.txt (91 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-status: development
robot-purpose: maintenance
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Victoria
robot-noindex: yes
robot-host:
robot-from:
robot-useragent: Victoria/1.0
robot-language: perl,c
robot-description: Victoria is part of a groupware produced
by Victoria Real Ltd. (voice: +44 [0]1273 774469,
fax: +44 [0]1273 779960 email: victoria@pavilion.co.uk).
Victoria is used to monitor changes in W3 documents,
both intranet and internet based.
Contact Victoria Real for more information.
robot-history:
robot-environment: commercial
modified-date: Fri, 22 Nov 1996 16:45 GMT
modified-by: victoria@pavilion.co.uk
robot-id:
visionsearch
robot-name:
vision-search
robot-cover-url:
http://www.ius.cs.cmu.edu/cgi-bin/vision-search
robot-details-url:
robot-owner-name:
Henry A. Rowley
robot-owner-url:
http://www.cs.cmu.edu/~har
robot-owner-email: har@cs.cmu.edu
robot-status:
robot-purpose:
indexing.
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
robot-host:
dylan.ius.cs.cmu.edu
robot-from:
no
robot-useragent:
vision-search/3.0'
robot-language:
Perl 5
robot-description: Intended to be an index of computer vision pages, containing
all pages within <em>n</em> links (for some small
<em>n</em>) of the Vision Home Page
robot-history:
robot-environment:
modified-date:
Fri Mar 8 16:03:04 1996
modified-by:
robot-id: voyager
robot-name: Voyager
robot-cover-url: http://www.lisa.co.jp/voyager/
robot-details-url:
robot-owner-name: Voyager Staff
robot-owner-url: http://www.lisa.co.jp/voyager/
robot-owner-email: voyager@lisa.co.jp
robot-status: development
robot-purpose: indexing, maintenance
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Voyager
robot-noindex: no
http://info.webcrawler.com/mak/projects/robots/active/all.txt (92 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-host: *.lisa.co.jp
robot-from: yes
robot-useragent: Voyager/0.0
robot-language: perl5
robot-description: This robot is used to
Lisa Search service.
and visits sites in a
robot-history:
robot-environment: service
modified-date: Mon, 30 Nov 1998 08:00:00
modified-by: Hideyuki Ezaki
build the database for the
The robot manually launch
random order.
GMT
robot-id: vwbot
robot-name: VWbot
robot-cover-url: http://vancouver-webpages.com/VWbot/
robot-details-url: http://vancouver-webpages.com/VWbot/aboutK.shtml
robot-owner-name: Andrew Daviel
robot-owner-url: http://vancouver-webpages.com/~admin/
robot-owner-email: andrew@vancouver-webpages.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: VWbot_K
robot-noindex: yes
robot-host: vancouver-webpages.com
robot-from: yes
robot-useragent: VWbot_K/4.2
robot-language: perl4
robot-description: Used to index BC sites for the searchBC database. Runs daily.
robot-history: Originally written fall 1995. Actively maintained.
robot-environment: service commercial research
modified-date: Tue, 4 Mar 1997 20:00:00 GMT
modified-by: Andrew Daviel
robot-id:
w3index
robot-name:
The NWI Robot
robot-cover-url:
http://www.ub2.lu.se/NNC/projects/NWI/the_nwi_robot.html
robot-owner-name:
Sigfrid Lundberg, Lund university, Sweden
robot-owner-url:
http://nwi.ub2.lu.se/~siglun
robot-owner-email:
siglun@munin.ub2.lu.se
robot-status:
active
robot-purpose:
discovery,statistics
robot-type:
standalone
robot-platform:
UNIX
robot-availability:
none (at the moment)
robot-exclusion:
yes
robot-noindex:
No
robot-host:
nwi.ub2.lu.se, mars.dtv.dk and a few others
robot-from:
yes
robot-useragent:
w3index
robot-language:
perl5
robot-description:
A resource discovery robot, used primarily for
the indexing of the Scandinavian Web
robot-history:
It is about a year or so old.
Written by Anders Ard–, Mattias Borrell,
HÂkan Ard– and myself.
robot-environment: service,research
modified-date:
Wed Jun 26 13:58:04 MET DST 1996
modified-by:
Sigfrid Lundberg
robot-id:
robot-name:
w3m2
W3M2
http://info.webcrawler.com/mak/projects/robots/active/all.txt (93 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-cover-url:
http://tronche.com/W3M2
robot-details-url:
robot-owner-name:
Christophe Tronche
robot-owner-url:
http://tronche.com/
robot-owner-email: tronche@lri.fr
robot-status:
robot-purpose:
indexing, maintenance, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
yes
robot-useragent:
W3M2/x.xxx
robot-language:
Perl 4, Perl 5, and C++
robot-description: to generate a Resource Discovery database, validate links,
validate HTML, and generate statistics
robot-history:
robot-environment:
modified-date:
Fri May 5 17:48:48 1995
modified-by:
robot-id:
wanderer
robot-name:
the World Wide Web Wanderer
robot-cover-url:
http://www.mit.edu/people/mkgray/net/
robot-details-url:
robot-owner-name:
Matthew Gray
robot-owner-url:
http://www.mit.edu:8001/people/mkgray/mkgray.html
robot-owner-email: mkgray@mit.edu
robot-status:
active
robot-purpose:
statistics
robot-type:
standalone
robot-platform:
unix
robot-availability: data
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*.mit.edu
robot-from:
robot-useragent:
WWWWanderer v3.0
robot-language:
perl4
robot-description: Run initially in June 1993, its aim is to measure
the growth in the web.
robot-history:
robot-environment: research
modified-date:
modified-by:
robot-id:webbandit
robot-name:WebBandit Web Spider
robot-cover-url:http://pw2.netcom.com/~wooger/
robot-details-url:http://pw2.netcom.com/~wooger/
robot-owner-name:Jerry Walsh
robot-owner-url:http://pw2.netcom.com/~wooger/
robot-owner-email:wooger@ix.netcom.com
robot-status:active
robot-purpose:Resource Gathering / Server Benchmarking
robot-type:standalone application
robot-platform:Intel - windows95
robot-availability:source, binary
robot-exclusion:no
robot-exclusion-useragent:WebBandit/1.0
robot-noindex:no
http://info.webcrawler.com/mak/projects/robots/active/all.txt (94 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-host:ix.netcom.com
robot-from:no
robot-useragent:WebBandit/1.0
robot-language:C++
robot-description:multithreaded, hyperlink-following,
resource finding webspider
robot-history:Inspired by reading of
Internet Programming book by Jamsa/Cope
robot-environment:commercial
modified-date:11/21/96
modified-by:Jerry Walsh
robot-id: webcatcher
robot-name: WebCatcher
robot-cover-url: http://oscar.lang.nagoya-u.ac.jp
robot-details-url:
robot-owner-name: Reiji SUZUKI
robot-owner-url: http://oscar.lang.nagoya-u.ac.jp/~reiji/index.html
robot-owner-email: reiji@infonia.ne.jp
robot-owner-name2: Masatoshi SUGIURA
robot-owner-url2: http://oscar.lang.nagoya-u.ac.jp/~sugiura/index.html
robot-owner-email2: sugiura@lang.nagoya-u.ac.jp
robot-status: development
robot-purpose: indexing
robot-type: standalone
robot-platform: unix, windows, mac
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: webcatcher
robot-noindex: no
robot-host: oscar.lang.nagoya-u.ac.jp
robot-from: no
robot-useragent: WebCatcher/1.0
robot-language: perl5
robot-description: WebCatcher gathers web pages
that Japanese collage students want to visit.
robot-history: This robot finds its roots in a research project
at Nagoya University in 1998.
robot-environment: research
modified-date: Fri, 16 Oct 1998 17:28:52 JST
modified-by: "Reiji SUZUKI" <reiji@infonia.ne.jp>
robot-id:
webcopy
robot-name:
WebCopy
robot-cover-url:
http://www.inf.utfsm.cl/~vparada/webcopy.html
robot-details-url:
robot-owner-name:
Victor Parada
robot-owner-url:
http://www.inf.utfsm.cl/~vparada/
robot-owner-email: vparada@inf.utfsm.cl
robot-status:
robot-purpose:
mirroring
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
*
robot-from:
no
robot-useragent:
WebCopy/(version)
robot-language:
perl 4 or perl 5
robot-description: Its purpose is to perform mirroring. WebCopy can retrieve
files recursively using HTTP protocol.It can be used as a
delayed browser or as a mirroring tool. It cannot jump from
one site to another.
http://info.webcrawler.com/mak/projects/robots/active/all.txt (95 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-history:
robot-environment:
modified-date:
modified-by:
Sun Jul 2 15:27:04 1995
robot-id:
webfetcher
robot-name:
webfetcher
robot-cover-url:
http://www.ontv.com/
robot-details-url:
robot-owner-name:
robot-owner-url:
http://www.ontv.com/
robot-owner-email: webfetch@ontv.com
robot-status:
robot-purpose:
mirroring
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
*
robot-from:
yes
robot-useragent:
WebFetcher/0.8,
robot-language:
C++
robot-description: don't wait! OnTV's WebFetcher mirrors whole sites down to
your hard disk on a TV-like schedule. Catch w3
documentation. Catch discovery.com without waiting! A fully
operational web robot for NT/95 today, most UNIX soon, MAC
tomorrow.
robot-history:
robot-environment:
modified-date:
Sat Jan 27 10:31:43 1996.
modified-by:
robot-id:
webfoot
robot-name:
The Webfoot Robot
robot-cover-url:
robot-details-url:
robot-owner-name:
Lee McLoughlin
robot-owner-url:
http://web.doc.ic.ac.uk/f?/lmjm
robot-owner-email: L.McLoughlin@doc.ic.ac.uk
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
phoenix.doc.ic.ac.uk
robot-from:
robot-useragent:
robot-language:
robot-description:
robot-history:
First spotted in Mid February 1994
robot-environment:
modified-date:
modified-by:
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
weblayers
weblayers
http://www.univ-paris8.fr/~loic/weblayers/
Loic Dachary
http://www.univ-paris8.fr/~loic/
http://info.webcrawler.com/mak/projects/robots/active/all.txt (96 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-owner-email: loic@afp.com
robot-status:
robot-purpose:
maintainance
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
weblayers/0.0
robot-language:
perl 5
robot-description: Its purpose is to validate, cache and maintain links. It is
designed to maintain the cache generated by the emacs emacs
w3 mode (N*tscape replacement) and to support annotated
documents (keep them in sync with the original document via
diff/patch).
robot-history:
robot-environment:
modified-date:
Fri Jun 23 16:30:42 FRE 1995
modified-by:
robot-id:
weblinker
robot-name:
WebLinker
robot-cover-url:
http://www.cern.ch/WebLinker/
robot-details-url:
robot-owner-name:
James Casey
robot-owner-url:
http://www.maths.tcd.ie/hyplan/jcasey/jcasey.html
robot-owner-email: jcasey@maths.tcd.ie
robot-status:
robot-purpose:
maintenance
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
robot-useragent:
WebLinker/0.0 libwww-perl/0.1
robot-language:
robot-description: it traverses a section of web, doing URN->URL conversion.
It will be used as a post-processing tool on documents created
by automatic converters such as LaTeX2HTML or WebMaker. At
the moment it works at full speed, but is restricted to
localsites. External GETs will be added, but these will be
running slowly. WebLinker is meant to be run locally, so if
you see it elsewhere let the author know!
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
webmirror
WebMirror
http://www.winsite.com/pc/win95/netutil/wbmiror1.zip
Sui Fung Chan
http://www.geocities.com/NapaVally/1208
sfchan@mailhost.net
mirroring
standalone
Windows95
http://info.webcrawler.com/mak/projects/robots/active/all.txt (97 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
no
robot-useragent:
no
robot-language:
C++
robot-description: It download web pages to hard drive for off-line
browsing.
robot-history:
robot-environment:
modified-date:
Mon Apr 29 08:52:25 1996.
modified-by:
robot-id: webmoose
robot-name: The Web Moose
robot-cover-url:
robot-details-url: http://www.nwlink.com/~mikeblas/webmoose/
robot-owner-name: Mike Blaszczak
robot-owner-url: http://www.nwlink.com/~mikeblas/
robot-owner-email: mikeblas@nwlink.com
robot-status: development
robot-purpose: statistics, maintenance
robot-type: standalone
robot-platform: Windows NT
robot-availability: data
robot-exclusion: no
robot-exclusion-useragent: WebMoose
robot-noindex: no
robot-host: msn.com
robot-from: no
robot-useragent: WebMoose/0.0.0000
robot-language: C++
robot-description: This robot collects statistics and verifies links.
It
builds an graph of its visit path.
robot-history: This robot is under development.
It will support ROBOTS.TXT soon.
robot-environment: hobby
modified-date: Fri, 30 Aug 1996 00:00:00 GMT
modified-by: Mike Blaszczak
robot-id:webquest
robot-name:WebQuest
robot-cover-url:
robot-details-url:
robot-owner-name:TaeYoung Choi
robot-owner-url:http://www.cosmocyber.co.kr:8080/~cty/index.html
robot-owner-email:cty@cosmonet.co.kr
robot-status:development
robot-purpose:indexing
robot-type:standalone
robot-platform:unix
robot-availability:none
robot-exclusion:yes
robot-exclusion-useragent:webquest
robot-noindex:no
robot-host:210.121.146.2, 210.113.104.1, 210.113.104.2
robot-from:yes
robot-useragent:WebQuest/1.0
robot-language:perl5
robot-description:WebQuest will be used to build the databases for various web
search service sites which will be in service by early 1998. Until the end of
Jan. 1998, WebQuest will run from time to time. Since then, it will run
http://info.webcrawler.com/mak/projects/robots/active/all.txt (98 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
daily(for few hours and very slowly).
robot-history:The developent of WebQuest was motivated by the need for a
customized robot in various projects of COSMO Information & Communication Co.,
Ltd. in Korea.
robot-environment:service
modified-date:Tue, 30 Dec 1997 09:27:20 GMT
modified-by:TaeYoung Choi
robot-id: webreader
robot-name: Digimarc MarcSpider
robot-cover-url: http://www.digimarc.com/prod_fam.html
robot-details-url: http://www.digimarc.com/prod_fam.html
robot-owner-name: Digimarc Corporation
robot-owner-url: http://www.digimarc.com
robot-owner-email: wmreader@digimarc.com
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: windowsNT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent:
robot-noindex:
robot-host: 206.102.3.*
robot-from: yes
robot-useragent: Digimarc WebReader/1.2
robot-language: c++
robot-description: Examines image files for watermarks.
In order to not waste internet bandwidth with yet
another crawler, we have contracted with one of the major crawlers/seach
engines to provide us with a list of specific URLs of interest to us. If an
URL is to an image, we may read the image, but we do not crawl to any other
URLs. If an URL is to a page of interest (ususally due to CGI), then we
access the page to get the image URLs from it, but we do not crawl to any
other pages.
robot-history: First operation in August 1997.
robot-environment: service
modified-date: Mon, 20 Oct 1997 16:44:29 GMT
modified-by: Brian MacIntosh
robot-id: webreaper
robot-name: WebReaper
robot-cover-url: http://www.otway.com/webreaper
robot-details-url:
robot-owner-name: Mark Otway
robot-owner-url: http://www.otway.com
robot-owner-email: webreaper@otway.com
robot-status: active
robot-purpose: indexing/offline browsing
robot-type: standalone
robot-platform: windows95, windowsNT
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: webreaper
robot-noindex: no
robot-host: *
robot-from: no
robot-useragent: WebReaper [webreaper@otway.com]
robot-language: c++
robot-description: Freeware app which downloads and saves sites locally for
offline browsing.
robot-history: Written for personal use, and then distributed to the public
as freeware.
robot-environment: hobby
modified-date: Thu, 25 Mar 1999 15:00:00 GMT
http://info.webcrawler.com/mak/projects/robots/active/all.txt (99 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-by: Mark Otway
robot-id:
webs
robot-name:
webs
robot-cover-url:
http://webdew.rnet.or.jp/
robot-details-url:
http://webdew.rnet.or.jp/service/shank/NAVI/SEARCH/info2.html#robot
robot-owner-name:
Recruit Co.Ltd,
robot-owner-url:
robot-owner-email:
dew@wwwadmin.rnet.or.jp
robot-status:
active
robot-purpose:
statistics
robot-type:
standalone
robot-platform:
unix
robot-availability:
none
robot-exclusion:
yes
robot-exclusion-useragent:
webs
robot-noindex:
no
robot-host:
lemon.recruit.co.jp
robot-from:
yes
robot-useragent:
webs@recruit.co.jp
robot-language:
perl5
robot-description:
The webs robot is used to gather WWW servers'
top pages last modified date data. Collected
statistics reflects the priority of WWW server
data collection for webdew indexing service.
Indexing in webdew is done by manually.
robot-history:
robot-environment:
service
modified-date:
Fri, 6 Sep 1996 10:00:00 GMT
modified-by:
robot-id:
websnarf
robot-name:
Websnarf
robot-cover-url:
robot-details-url:
robot-owner-name:
Charlie Stross
robot-owner-url:
robot-owner-email: charles@fma.com
robot-status:
retired
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: webspider
robot-name: WebSpider
robot-details-url: http://www.csi.uottawa.ca/~u610468
robot-cover-url:
robot-owner-name: Nicolas Fraiji
robot-owner-email: u610468@csi.uottawa.ca
robot-status: active, under further enhancement.
robot-purpose: maintenance, link diagnostics
http://info.webcrawler.com/mak/projects/robots/active/all.txt (100 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-type: standalone
robot-exclusion: yes
robot-noindex: no
robot-exclusion-useragent: webspider
robot-host: several
robot-from: Yes
robot-language: Perl4
robot-history: developped as a course project at the University of
Ottawa, Canada in 1996.
robot-environment: Educational use and Research
robot-id:
webvac
robot-name:
WebVac
robot-cover-url:
http://www.federated.com/~tim/webvac.html
robot-details-url:
robot-owner-name:
Tim Jensen
robot-owner-url:
http://www.federated.com/~tim
robot-owner-email: tim@federated.com
robot-status:
robot-purpose:
mirroring
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
no
robot-useragent:
webvac/1.0
robot-language:
C++
robot-description:
robot-history:
robot-environment:
modified-date:
Mon May 13 03:19:17 1996.
modified-by:
robot-id:
webwalk
robot-name:
webwalk
robot-cover-url:
robot-details-url:
robot-owner-name:
Rich Testardi
robot-owner-url:
robot-owner-email:
robot-status:
retired
robot-purpose:
indexing, maintentance, mirroring, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
yes
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
yes
robot-useragent:
webwalk
robot-language:
c
robot-description: Its purpose is to generate a Resource Discovery database,
validate links, validate HTML, perform mirroring, copy
document trees, and generate statistics. Webwalk is easily
extensible to perform virtually any maintenance function
which involves web traversal, in a way much like the '-exec'
option of the find(1) command. Webwalk is usually used
behind the HP firewall
robot-history:
robot-environment:
modified-date:
Wed Nov 15 09:51:59 PST 1995
http://info.webcrawler.com/mak/projects/robots/active/all.txt (101 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
modified-by:
robot-id: webwalker
robot-name: WebWalker
robot-cover-url:
robot-details-url:
robot-owner-name: Fah-Chun Cheong
robot-owner-url: http://www.cs.berkeley.edu/~fccheong/
robot-owner-email: fccheong@cs.berkeley.edu
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: WebWalker
robot-noindex: no
robot-host: *
robot-from: yes
robot-useragent: WebWalker/1.10
robot-language: perl4
robot-description: WebWalker performs WWW traversal for individual
sites and tests for the integrity of all hyperlinks
to external sites.
robot-history: A Web maintenance robot for expository purposes,
first published in the book "Internet Agents: Spiders,
Wanderers, Brokers, and Bots" by the robot's author.
robot-environment: hobby
modified-date: Thu, 25 Jul 1996 16:00:52 PDT
modified-by: Fah-Chun Cheong
robot-id:
webwatch
robot-name:
WebWatch
robot-cover-url:
http://www.specter.com/users/janos/specter
robot-details-url:
robot-owner-name:
Joseph Janos
robot-owner-url:
http://www.specter.com/users/janos/specter
robot-owner-email: janos@specter.com
robot-status:
robot-purpose:
maintainance, statistics
robot-type:
standalone
robot-platform:
robot-availability:
robot-exclusion:
no
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
robot-from:
no
robot-useragent:
WebWatch
robot-language:
c++
robot-description: Its purpose is to validate HTML, and generate statistics.
Check URLs modified since a given date.
robot-history:
robot-environment:
modified-date:
Wed Jul 26 13:36:32 1995
modified-by:
robot-id: wget
robot-name: Wget
robot-cover-url: ftp://gnjilux.cc.fer.hr/pub/unix/util/wget/
robot-details-url:
robot-owner-name: Hrvoje Niksic
robot-owner-url:
robot-owner-email: hniksic@srce.hr
robot-status: development
http://info.webcrawler.com/mak/projects/robots/active/all.txt (102 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-purpose: mirroring, maintenance
robot-type: standalone
robot-platform: unix
robot-availability: source
robot-exclusion: yes
robot-exclusion-useragent: wget
robot-noindex: no
robot-host: *
robot-from: yes
robot-useragent: Wget/1.4.0
robot-language: C
robot-description:
Wget is a utility for retrieving files using HTTP and FTP protocols.
It works non-interactively, and can retrieve HTML pages and FTP
trees recursively. It can be used for mirroring Web pages and FTP
sites, or for traversing the Web gathering data. It is run by the
end user or archive maintainer.
robot-history:
robot-environment: hobby, research
modified-date: Mon, 11 Nov 1996 06:00:44 MET
modified-by: Hrvoje Niksic
robot-id: whatuseek
robot-name: whatUseek Winona
robot-cover-url: http://www.whatUseek.com/
robot-details-url: http://www.whatUseek.com/
robot-owner-name: Neil Mansilla
robot-owner-url: http://www.whatUseek.com/
robot-owner-email: neil@whatUseek.com
robot-status: active
robot-purpose: Robot used for site-level search and meta-search engines.
robot-type: standalone
robot-platform: unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: winona
robot-noindex: yes
robot-host: *.whatuseek.com, *.aol2.com
robot-from: no
robot-useragent: whatUseek_winona/3.0
robot-language: c++
robot-description: The whatUseek robot, Winona, is used for site-level
search engines. It is also implemented in several meta-search engines.
robot-history: Winona was developed in November of 1996.
robot-environment: service
modified-date: Wed, 17 Jan 2001 11:52:00 EST
modified-by: Neil Mansilla
robot-id: whowhere
robot-name: WhoWhere Robot
robot-cover-url: http://www.whowhere.com
robot-details-url:
robot-owner-name: Rupesh Kapoor
robot-owner-url:
robot-owner-email: rupesh@whowhere.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: Sun Unix
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: whowhere
robot-noindex: no
robot-host: spica.whowhere.com
robot-from: no
http://info.webcrawler.com/mak/projects/robots/active/all.txt (103 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-useragent:
robot-language: C/Perl
robot-description: Gathers data for email directory from web pages
robot-history:
robot-environment: commercial
modified-date:
modified-by:
robot-id:
wmir
robot-name:
w3mir
robot-cover-url:
http://www.ifi.uio.no/~janl/w3mir.html
robot-details-url:
robot-owner-name:
Nicolai Langfeldt
robot-owner-url:
http://www.ifi.uio.no/~janl/w3mir.html
robot-owner-email: w3mir-core@usit.uio.no
robot-status:
robot-purpose:
mirroring.
robot-type:
standalone
robot-platform:
UNIX, WindowsNT
robot-availability:
robot-exclusion:
no.
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
yes
robot-useragent:
w3mir
robot-language:
Perl
robot-description: W3mir uses the If-Modified-Since HTTP header and recurses
only the directory and subdirectories of it's start
document. Known to work on U*ixes and Windows
NT.
robot-history:
robot-environment:
modified-date:
Wed Apr 24 13:23:42 1996.
modified-by:
robot-id: wolp
robot-name: WebStolperer
robot-cover-url: http://www.suchfibel.de/maschinisten
robot-details-url: http://www.suchfibel.de/maschinisten/text/werkzeuge.htm (in German)
robot-owner-name: Marius Dahler
robot-owner-url: http://www.suchfibel.de/maschinisten
robot-owner-email: mda@suchfibel.de
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix, NT
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: WOLP
robot-noindex: yes
robot-host: www.suchfibel.de
robot-from: yes
robot-useragent: WOLP/1.0 mda/1.0
robot-language: perl5
robot-description: The robot gathers information about specified
web-projects and generates knowledge bases in Javascript or an own
format
robot-environment: hobby
modified-date: 22 Jul 1998
modified-by: Marius Dahler
robot-id:
robot-name:
robot-cover-url:
wombat
The Web Wombat
http://www.intercom.com.au/wombat/
http://info.webcrawler.com/mak/projects/robots/active/all.txt (104 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-details-url:
robot-owner-name:
Internet Communications
robot-owner-url:
http://www.intercom.com.au/
robot-owner-email: phill@intercom.com.au
robot-status:
robot-purpose:
indexing, statistics.
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
no.
robot-exclusion-useragent:
robot-noindex:
robot-host:
qwerty.intercom.com.au
robot-from:
no
robot-useragent:
no
robot-language:
IBM Rexx/VisualAge C++ under OS/2.
robot-description: The robot is the basis of the Web Wombat search engine
(Australian/New Zealand content ONLY).
robot-history:
robot-environment:
modified-date:
Thu Feb 29 00:39:49 1996.
modified-by:
robot-id:
worm
robot-name:
The World Wide Web Worm
robot-cover-url:
http://www.cs.colorado.edu/home/mcbryan/WWWW.html
robot-details-url:
robot-owner-name:
Oliver McBryan
robot-owner-url:
http://www.cs.colorado.edu/home/mcbryan/Home.html
robot-owner-email: mcbryan@piper.cs.colorado.edu
robot-status:
robot-purpose:
indexing
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
no
robot-host:
piper.cs.colorado.edu
robot-from:
robot-useragent:
robot-language:
robot-description: indexing robot, actually has quite flexible search
options
robot-history:
robot-environment:
modified-date:
modified-by:
robot-id: wwwc
robot-name: WWWC Ver 0.2.5
robot-cover-url: http://www.kinet.or.jp/naka/tomo/wwwc.html
robot-details-url:
robot-owner-name: Tomoaki Nakashima.
robot-owner-url: http://www.kinet.or.jp/naka/tomo/
robot-owner-email: naka@kinet.or.jp
robot-status: active
robot-purpose: maintenance
robot-type: standalone
robot-platform: windows, windows95, windowsNT
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: WWWC
robot-noindex: no
robot-host:
http://info.webcrawler.com/mak/projects/robots/active/all.txt (105 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-from: yes
robot-useragent: WWWC/0.25 (Win95)
robot-language: c
robot-description:
robot-history: 1997
robot-environment: hobby
modified-date: Tuesday, 18 Feb 1997 06:02:47 GMT
modified-by: Tomoaki Nakashima (naka@kinet.or.jp)
robot-id: wz101
robot-name: WebZinger
robot-details-url: http://www.imaginon.com/wzindex.html
robot-cover-url: http://www.imaginon.com
robot-owner-name: ImaginOn, Inc
robot-owner-url: http://www.imaginon.com
robot-owner-email: info@imaginon.com
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: windows95, windowsNT 4, mac, solaris, unix
robot-availability: binary
robot-exclusion: no
robot-exclusion-useragent: none
robot-noindex: no
robot-host: http://www.imaginon.com/wzindex.html *
robot-from: no
robot-useragent: none
robot-language: java
robot-description: commercial Web Bot that accepts plain text queries, uses
webcrawler, lycos or excite to get URLs, then visits sites. If the user's
filter parameters are met, downloads one picture and a paragraph of test.
Playsback slide show format of one text paragraph plus image from each site.
robot-history: developed by ImaginOn in 1996 and 1997
robot-environment: commercial
modified-date: Wed, 11 Sep 1997 02:00:00 GMT
modified-by: schwartz@imaginon.com
robot-id: xget
robot-name: XGET
robot-cover-url: http://www2.117.ne.jp/~moremore/x68000/soft/soft.html
robot-details-url: http://www2.117.ne.jp/~moremore/x68000/soft/soft.html
robot-owner-name: Hiroyuki Shigenaga
robot-owner-url: http://www2.117.ne.jp/~moremore/
robot-owner-email: shige@mh1.117.ne.jp
robot-status: active
robot-purpose: mirroring
robot-type: standalone
robot-platform: X68000, X68030
robot-availability: binary
robot-exclusion: yes
robot-exclusion-useragent: XGET
robot-noindex: no
robot-host: *
robot-from: yes
robot-useragent: XGET/0.7
robot-language: c
robot-description: Its purpose is to retrieve updated files.It is run by the end
userrobot-history: 1997
robot-environment: hobby
modified-date: Fri, 07 May 1998 17:00:00 GMT
modified-by: Hiroyuki Shigenaga
robot-id: Nederland.zoek
robot-name: Nederland.zoek
robot-cover-url: http://www.nederland.net/
http://info.webcrawler.com/mak/projects/robots/active/all.txt (106 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/all.txt
robot-details-url:
robot-owner-name: System Operator Nederland.net
robot-owner-url:
robot-owner-email: zoek@nederland.net
robot-status: active
robot-purpose: indexing
robot-type: standalone
robot-platform: unix (Linux)
robot-availability: none
robot-exclusion: yes
robot-exclusion-useragent: Nederland.zoek
robot-noindex: no
robot-host: 193.67.110.*
robot-from: yes
robot-useragent: Nederland.zoek
robot-language: c
robot-description: This robot indexes all .nl sites for the search-engine of
Nederland.net
robot-history: Developed at Computel Standby in Apeldoorn, The Netherlands
robot-environment: service
modified-date: Sat, 8 Feb 1997 01:10:00 CET
modified-by: Sander Steffann <sander@nederland.net>
http://info.webcrawler.com/mak/projects/robots/active/all.txt (107 of 107) [18.02.2001 13:17:48]
http://info.webcrawler.com/mak/projects/robots/active/empty.txt
robot-id:
robot-name:
robot-cover-url:
robot-details-url:
robot-owner-name:
robot-owner-url:
robot-owner-email:
robot-status:
robot-purpose:
robot-type:
robot-platform:
robot-availability:
robot-exclusion:
robot-exclusion-useragent:
robot-noindex:
robot-host:
robot-from:
robot-useragent:
robot-language:
robot-description:
robot-history:
robot-environment:
modified-date:
modified-by:
http://info.webcrawler.com/mak/projects/robots/active/empty.txt [18.02.2001 13:17:50]
http://info.webcrawler.com/mak/projects/robots/active/schema.txt
Database Format
--------------Records
------Records are formatted like RFC 822 messages.
Unless specified, values may not contain HTML, or empty lines,
but may contain 8-bit values.
Where a value contains "one or more" tokens, they are
to be separated by a comma followed by a space.
Fields can be repeated and grouped by appending number 2 and up,
for example:
robot-owner-name1: Mr A. RobotAuthor
robot-owner-url1: http://webrobot.com/~a/a.html
robot-owner-name2: Mr B. RobotCoAuthor
robot-owner-name2: http://webrobot.com/~b/b.html
Fields Schema
-----robot-id:
Short name for the robot,
used internally as a unique reference.
Should use [a-z-_]+
Example: webcrawler
robot-name:
Full name of the robot,
for presentation purposes.
Example: WebCrawler
robot-details-url:
URL of the robot home page,
containing further technical details on the robot,
background information etc.
Example: http://webcrawler.com/WebCrawler/Facts/HowItWorks.html
robot-cover-url:
URL of the robot product,
containing marketing details about either the robot,
or the service to which the robot is related.
Example: http://webcrawler.com/
robot-owner-name:
Name of the owner. For service robots this is the person
running the robot, who can be contacted in case of specific
problems.
In the case of robot products this is the person
maintaining the product, who can be contacted if the
robot has bugs.
Example: Brian Pinkerton
robot-owner-url:
Home page of the robot-owner-name
Example: http://info.webcrawler.com/bp/bp.html
robot-owner-email:
Email address of owner
Example: np@webcrawler.com
robot-status:
http://info.webcrawler.com/mak/projects/robots/active/schema.txt (1 of 3) [18.02.2001 13:17:53]
http://info.webcrawler.com/mak/projects/robots/active/schema.txt
Deployment status of the robot. One of:
- development: robot under development
- active: robot actively in use
- retired: robot no longer used
robot-purpose:
Purpose of the robot. One or more of:
- indexing: gather content for an indexing service
- maintenance: link validation, html validation etc.
- statistics: used to gather statistics
Further details can be given in the description
robot-type:
Type of robot software. One or more of:
- standalone: a separate program
- browser: built into a browser
- plugin: a plugin for a browser
robot-platform:
Platform robot runs on. One or more of:
- unix
- windows, windows95, windowsNT
- os2
- mac
etc.
robot-availability:
Availability of robot to general public. One or more of:
- source: source code available
- binary: binary form available
- data: bulk data gathered by robot available
- none
Details on robot-url or robot-cover-url.
robot-exclusion:
Standard for Robots Exclusion supported.
yes or no
robot-exclusion-useragent:
Substring to use in /robots.txt
Example: webcrawler
robot-noindex:
<meta name="robots" content="noindex"> directive supported:
yes or no
robot-nofollow:
<meta name="robots" content="nofollow"> directive supported:
yes or no
robot-host:
Host the robot is run from. Can be a pattern of DNS and/or IP.
If the robot is available to the general public, add '*'
Example: spidey.webcrawler.com, *.webcrawler.com, 192.216.46.*
robot-from:
The HTTP From field as defined in RFC 1945 can be set.
yes or no
robot-useragent:
The HTTP User-Agent field as defined in RFC 1945
Example: WebCrawler/1.0 libwww/4.0
robot-language:
Languages the robot is written in. One or more of:
http://info.webcrawler.com/mak/projects/robots/active/schema.txt (2 of 3) [18.02.2001 13:17:53]
http://info.webcrawler.com/mak/projects/robots/active/schema.txt
c,c++,perl,perl4,perl5,java,tcl,python, etc.
robot-description:
Text description of the robot's functions.
More details should go on robot-url.
Example: The WebCrawler robot is used to build the database
for the WebCrawler search service operated by GNN
(part of AOL).
The robot runs weekly, and visits sites in a random order.
robot-history:
Text description of the origins of the robot.
Example: This robot finds its roots in a research project
at the University of Washington in 1994.
robot-environment:
The environment the robot operates in. One or more of:
- service: builds a commercial service
- commercial: is a commercial product
- research: used for research
- hobby: written as a hobby
modified-date:
The date this record was last modified. Format as in HTTP
Example: Fri, 21 Jun 1996 17:28:52 GMT
http://info.webcrawler.com/mak/projects/robots/active/schema.txt (3 of 3) [18.02.2001 13:17:53]
Robots Mailing List Archive by thread
Robots Mailing List Archive by thread
●
About this archive
●
Most recent messages
●
Messages sorted by: [ date ][ subject ][ author ]
●
Other mail archives
Starting: Wed 00 Jan 1970 - 16:31:48 PDT
Ending: Thu 18 Dec 1997 - 14:33:60 PDT
Messages: 2106
● Announcement Michael G=?iso-8859-1?Q?=F6ckel
●
RE: The Internet Archive robot Sigfrid Lundberg
●
The robots mailing list at WebCrawler Martijn Koster
●
Something that would be handy Tim Bray
●
Site Announcement James
❍
Re: Site Announcement Martijn Koster
●
How do I let spiders in? Roger Dearnaley
●
Unfriendly robot at 205.177.10.2 Nick Arnett
❍
Re: Unfriendly robot at 205.177.10.2 Tim Bray
❍
Re: Unfriendly robot at 205.177.10.2 Nick Arnett
❍
Re: Unfriendly robot at 205.177.10.2 Reinier Post
❍
Re: Unfriendly robot at 205.177.10.2 Reinier Post
❍
Re: Unfriendly robot at 205.177.10.2 Leigh DeForest Dupee
●
CORRECTION -- Re: Unfriendly robot Nick Arnett
●
Looking for a spider Alain Desilets
❍
Re: Looking for a spider Alvaro Monge
❍
Re: Looking for a spider Xiaodong Zhang
❍
Re: Looking for a spider Alvaro Monge
❍
Re: Looking for a spider Reinier Post
❍
Re: Looking for a spider Alain Desilets
❍
Re: Looking for a spider Alain Desilets
❍
Re: Looking for a spider Alain Desilets
❍
Re: Looking for a spider Marilyn R Wulfekuhler
❍
Re: Looking for a spider Alain Desilets
❍
Re: Looking for a spider Marilyn R Wulfekuhler
❍
Re: Looking for a spider Alain Desilets
❍
Re: Looking for a spider Gene Essman
http://info.webcrawler.com/mailing-lists/robots/index.html (1 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
❍
Re: Looking for a spider Nick Arnett
❍
Re: Looking for a spider Ted Sullivan
❍
Re: Looking for a spider drose@AZStarNet.com
❍
Re: Looking for a spider Ted Sullivan
❍
Re: Looking for a spider i.bromwich
●
Is it a robot or a link-updater? David Bakin
●
Unfriendly robot owner identified! Nick Arnett
❍
Re: Unfriendly robot owner identified! Andrew Leonard
●
Really fast searching Nick Arnett
●
Sorry! Alain Desilets
●
Re: Unfriendly robot at 205.252.60.50 Nick Arnett
❍
●
re: Lycos unfriendly robot Murray Bent
❍
●
Re: Unfriendly robot at 205.252.60.50 Andrew Leonard
Re: Lycos unfriendly robot Reinier Post
Re: Unfriendly robot at 205.252.60.50 Nick Arnett
❍
Re: Unfriendly robot at 205.252.60.50 Kim Davies
●
Re: Unfriendly robot at 205.252.60.50 Nick Arnett
●
Re: Proposed URLs that robots should search Nick Arnett
❍
●
Re: Proposed URLs that robots should search Kim Davies
❍
●
Re: Proposed URLs that robots should search Martijn Koster
Re: Proposed URLs that robots should search Andrew Daviel
lycos patents Murray Bent
❍
Re: lycos patents Scott Stephenson
❍
Re: lycos patents Martijn Koster
❍
Re: lycos patents Matthew Gray
❍
Re: lycos patents Reinier Post
❍
Re: lycos patents Roger Dearnaley
●
re: Lycos patents Murray Bent
●
Patents? Martijn Koster
●
meta tag implementation Davide Musella
❍
Re: meta tag implementation Jeffrey C. Chen
❍
Simple load robot Jaakko Hyvatti
❍
Re: meta tag implementation Steve Nisbet
❍
Re: meta tag implementation Davide Musella
❍
Re: meta tag implementation Reinier Post
http://info.webcrawler.com/mailing-lists/robots/index.html (2 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
❍
●
Re: meta tag implementation Steve Nisbet
Preliminary robot.faq (Please Send Questions or Comments) Keith Fischer
❍
Re: Preliminary robot.faq (Please Send Questions or Comments) Tim Bray
❍
Re: Preliminary robot.faq (Please Send Questions or Comments) Keith Fischer
❍
Re: Preliminary robot.faq (Please Send Questions or Comments) Reinier Post
❍
Re: Preliminary robot.faq (Please Send Questions or Comments) YUWONO BUDI
❍
Bad robot: WebHopper bounch! Owner: peter@cartes.hut.fi Benjamin Franz
●
BOUNCE robots: Admin request
●
wwwbot.pl problem Andrew Daviel
❍
●
●
●
●
●
●
Re: wwwbot.pl problem Fred Douglis
yet another robot Paul Francis
❍
yet another robot, volume 2 David Eagles
❍
Re: yet another robot, volume 2 James
❍
Re: yet another robot, volume 2 James
Q: Cooperation of robots Byung-Gyu Chang
❍
Re: Q: Cooperation of robots David Eagles
❍
Re: Q: Cooperation of robots Nick Arnett
❍
Re: Q: Cooperation of robots Paul Francis
❍
Re: Q: Cooperation of robots Jaakko Hyvatti
Smart Agent help Michael Goldberg
❍
Re: Smart Agent help Paul Francis
❍
Re: Smart Agent help Jason_Murray_at_FCRD@cclink.tfn.com
harvest John D. Pritchard
❍
Re: harvest Michael Goldberg
❍
mortgages with: Re: harvest John D. Pritchard
How frequently should I check /robots.txt? Skip Montanaro
❍
Re: How frequently should I check /robots.txt? gil cosson
❍
Re: How frequently should I check /robots.txt? Martijn Koster
❍
Re: How frequently should I check /robots.txt? Martijn Koster
McKinley Spider hit us hard Christopher Penrose
❍
Re: McKinley Spider hit us hard Michael Van Biesbrouck
●
Mail failure Adminstrator
●
Mail failure Adminstrator
●
Mail failure Adminstrator
●
Mail failure Adminstrator
http://info.webcrawler.com/mailing-lists/robots/index.html (3 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
●
Small robot needed Karoly Negyesi
●
New robot turned loose on an unsuspecting public... and a DNS question Skip Montanaro
●
Re: New robot turned loose on an unsuspecting public... and a DNS question
Thomas Maslen
inquiry about robots Cristian Ionitoiu
●
MacPower Lance Ogletree
❍
❍
Re: MacPower Jaakko Hyvatti
❍
Re: MacPower (an apology, I am very sorry) Jaakko Hyvatti
●
Re: Returned mail: Service unavailableHELP HELP! Julian Gorodsky
●
Re: Returned mail: Service unavailableHELP AGAIN HELP AGAIN! Julian Gorodsky
●
Indexing two-byte text Harry Munir Behrens
❍
Re: Indexing two-byte text John D. Pritchard
●
Indexing two-byte text Harry Munir Behrens
●
Either a spider or a hacker? ww2.allcon.com Randall Hill
●
Indexing two-byte text Mark Schrimsher
❍
Re: Indexing two-byte text Paul Francis
❍
Re: Indexing two-byte text Mark Schrimsher
❍
Re: Indexing two-byte text Paul Francis
❍
Re: Indexing two-byte text Frank Smadja
❍
Re: Indexing two-byte text Paul Francis
❍
Re: Indexing two-byte text Mark Schrimsher
❍
Re: Indexing two-byte text Mark Schrimsher
❍
Re: Indexing two-byte text Paul Francis
❍
Re: Indexing two-byte text Mark Schrimsher
●
RE: Indexing two-byte text smadja@netvision.net.il
●
RE: Indexing two-byte text Mark Schrimsher
❍
RE: Indexing two-byte text Noboru Iwayama
●
Freely available robot code in C available? ecarp@tssun5.dsccc.com
●
Freely available robot code in C available? Kevin Hoogheem
●
❍
Re: Freely available robot code in C available? Mark Schrimsher
❍
Re: Freely available robot code in C available? Nick Arnett
❍
Re: Freely available robot code in C available? Edwin Carp
Harvest question Jim Meritt
❍
Re: Harvest question Mark Schrimsher
●
Announcement and Help Requested Simon.Stobart
●
Announcement and Help Requested Simon.Stobart
http://info.webcrawler.com/mailing-lists/robots/index.html (4 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
❍
Re: Announcement and Help Requested Martijn Koster
❍
Re: Announcement and Help Requested Jeremy.Ellman
❍
Re: Announcement and Help Requested Simon.Stobart
❍
Re: Announcement and Help Requested Jeremy.Ellman
●
Re[2]: Harvest question Jim Meritt
●
Robot on the Rampage mschrimsher@twics.com
❍
Re: Robot on the Rampage Susumu Shimizu
❍
Re: Robot on the Rampage Reinier Post
❍
Checking Log files Cees Hek
❍
Re: Checking Log files Kevin Hoogheem
❍
Re: Checking Log files Mark Schrimsher
❍
Re: Checking Log files Cees Hek
●
[1]RE>Checking Log files Roger Dearnaley
●
[2]RE>Checking Log files Roger Dearnaley
●
[3]RE>Checking Log files Roger Dearnaley
●
[4]RE>Checking Log files Roger Dearnaley
●
[5]RE>Checking Log files Roger Dearnaley
●
[5]RE>Checking Log files Mark Schrimsher
❍
Re: [5]RE>Checking Log files M.Levy@cs.ucl.ac.uk
●
[1]RE>[5]RE>Checking Log fi Roger Dearnaley
●
[2]RE>[5]RE>Checking Log fi Roger Dearnaley
●
❍
Re: [2]RE>[5]RE>Checking Log fi Micah A. Williams
❍
Re: [2]RE>[5]RE>Checking Log fi Skip Montanaro
❍
Re: [2]RE>[5]RE>Checking Log fi Bjorn-Olav Strand
❍
Re: [2]RE>[5]RE>Checking Log fi Gordon Bainbridge
❍
Re: [2]RE>[5]RE>Checking Log fi Micah A. Williams
❍
Re: [2]RE>[5]RE>Checking Log fi ecarp@tssun5.dsccc.com
Wobot? Byung-Gyu Chang
❍
Re: Wobot? Nick Arnett
●
Announcing NaecSpyr, a new. . . robot? Mordechai T. Abzug
●
[3]RE>[5]RE>Checking Log fi Roger Dearnaley
❍
Re: [3]RE>[5]RE>Checking Log fi Micah A. Williams
●
[1]Wobot? Roger Dearnaley
●
[1]Announcing NaecSpyr, a n Roger Dearnaley
●
[2]Announcing NaecSpyr, a n Roger Dearnaley
http://info.webcrawler.com/mailing-lists/robots/index.html (5 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
●
[2]Wobot? Roger Dearnaley
❍
Contact for Intouchgroup.com Vince Taluskie
●
[4]RE>[5]RE>Checking Log fi Roger Dearnaley
●
[3]Wobot? Roger Dearnaley
●
[3]Announcing NaecSpyr, a n Roger Dearnaley
●
[1]RE>[3]RE>[5]RE>Checking Roger Dearnaley
●
[5]RE>[5]RE>Checking Log fi Roger Dearnaley
●
[1]RE>[2]RE>[5]RE>Checking Roger Dearnaley
●
[4]Wobot? Roger Dearnaley
●
[4]Announcing NaecSpyr, a n Roger Dearnaley
●
[2]RE>[3]RE>[5]RE>Checking Roger Dearnaley
●
Dearnaley Auto Reply Cannon? Micah A. Williams
●
Dearnaley Auto Reply Cannon? Kevin Hoogheem
●
[2]RE>[2]RE>[5]RE>Checking Roger Dearnaley
●
[1]Contact for Intouchgroup Roger Dearnaley
●
Re: [2]RE>[5]RE>Checking Lo Saul Jacobs
●
[5]Wobot? Roger Dearnaley
●
Re: [2]RE>[5]RE>Checking Lo Bonnie Scott
●
Vacation wars Martijn Koster
❍
Re: Vacation wars Nick Arnett
●
Re: [2]RE>[5]RE>Checking Lo David Henderson
●
New Robot??? David Henderson
●
test; please ignore Martijn Koster
❍
Re: test; please ignore Mark Schrimsher
❍
Re: test; please ignore David Henderson
●
Unfriendly Lycos , again ... Murray Bent
●
Inter-robot Comms Port David Eagles
●
❍
Re: Inter-robot Comms Port John D. Pritchard
❍
Re: Inter-robot Comms Port Super-User
❍
Re: Inter-robot Comms Port Carlos Baquero
Re: Unfriendly Lycos , again ... Steven L. Baur
❍
●
Re: Unfriendly Lycos , again ... John_R_R_Leavitt@NL.CS.CMU.EDU
Inter-robot Communications - Part II David Eagles
❍
Re: Inter-robot Communications - Part II Martijn Koster
❍
Re: Inter-robot Communications - Part II Mordechai T. Abzug
http://info.webcrawler.com/mailing-lists/robots/index.html (6 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
●
unknown robot (no name)
❍
Re: unknown robot Luiz Fernando
❍
Re: unknown robot
❍
Re: unknown robot John Lindroth
●
RE: Inter-robot Communications - Part II David Eagles
●
RE: Inter-robot Communications - Part II Martijn Koster
❍
●
please add my site gil cosson
❍
●
Re: please add my site Martijn Koster
Please Help ME!! Dong-Hyun Kim
❍
●
Re: Inter-robot Communications - Part II John D. Pritchard
Re: Please Help ME!! Byung-Gyu Chang
Infinite e-mail loop Roger Dearnaley
❍
Infinite e-mail loop Skip Montanaro
●
Up to date list of Robots James
●
Web Robot Matthew Gray
●
Re: Web Robots James
●
Re: Web Robots Jakob Faarvang
●
Does this count as a robot? Thomas Stets
❍
Re: Does this count as a robot? Jeremy.Ellman
❍
Re: Does this count as a robot? Benjamin Franz
❍
Re: Does this count as a robot? Benjamin Franz
❍
Re: Does this count as a robot? YUWONO BUDI
❍
Re: Does this count as a robot? M.Levy@cs.ucl.ac.uk
❍
Re: Does this count as a robot? YUWONO BUDI
❍
Recursing heuristics (Re: Does this..) Jaakko Hyvatti
❍
avoiding infinite regress for robots Reinier Post
●
RE: avoiding infinite regress for robots David Eagles
●
Recursion David Eagles
❍
Re: Recursion mabzug1@gl.umbc.edu
●
Duplicate docs (was avoiding infinite regress...) Nick Arnett
●
RE: Recursion David Eagles
●
MD5 in HTTP headers - where? Skip Montanaro
❍
●
Re: MD5 in HTTP headers - where? Mordechai T. Abzug
robots.txt extensions Adam Jack
❍
Re: robots.txt extensions Martijn Koster
http://info.webcrawler.com/mailing-lists/robots/index.html (7 of 61) [18.02.2001 13:19:24]
Robots Mailing List Archive by thread
●
❍
Re: robots.txt extensions Adam Jack
❍
Re: robots.txt extensions Jaakko Hyvatti
❍
Re: robots.txt extensions Adam Jack
❍
Re: robots.txt extensions Martin Kiff
Does anyone else consider this irresponsible? Robert Raisch, The Internet Company
❍
Re: Does anyone else consider this irresponsible? Stan Norton
❍
Re: Does anyone else consider this irresponsible?
❍
Re: Does anyone else consider this irresponsible? Mark Norman
❍
Re: Does anyone else consider this irresponsible? Eric Hollander
❍
Re: Does anyone else consider this irresponsible? Super-User
❍
Re: Does anyone else consider this irresponsible? Robert Raisch, The Internet
Company
Responsible behavior, Robots vs. humans, URL botany... Skip Montanaro
❍
❍
Re: Does anyone else consider this irresponsible? Robert Raisch, The Internet
Company
Re: Does anyone else consider this irresponsible? Ed Carp @ TSSUN5
❍
Re: Does anyone else consider this irresponsible?
❍
●
FAQ again. Martijn Koster
●
Robots / source availability? Martijn Koster
❍
●
●
Re: Robots / source availability? Erik Selberg
Re: Does anyone else consider... Mark Norman
❍
Re: Does anyone else consider...
❍
Re: Does anyone else consider... Skip Montanaro
Re: Does anyone else consider... Mark Schrimsher
❍
Re: Does anyone else consider...
●
(no subject) Alison Gwin
●
Robots not Frames savy David Henderson
●
Re: Spam Software Sought Mark Schrimsher
●
Re: Does anyone else consider... smadja@netvision.net.il
●
Horror story
❍
Re: Horror story Skip Montanaro
❍
Re: Horror story Jaakko Hyvatti
❍
Re: Horror story Ted Sullivan
❍
Re: Horror story Brian Pinkerton
❍
Re: Horror story Murray Bent
❍
Re: Horror story Mordechai T. Abzug
http://info.webcrawler.com/mailing-lists/robots/index.html (8 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: Horror story Steve Nisbet
❍
Re: Horror story Steve Nisbet
●
Gopher Protocol Question Hal Belisle
●
New Robot Announcement Larry Burke
●
❍
Re: New Robot Announcement Mordechai T. Abzug
❍
Re: New Robot Announcement Larry Burke
❍
Re: New Robot Announcement Jeremy.Ellman
❍
Re: New Robot Announcement Ed Carp @ TSSUN5
❍
Re: New Robot Announcement David Levine
❍
Re: New Robot Announcement Larry Burke
❍
Re: New Robot Announcement Jakob Faarvang
❍
Re: New Robot Announcement John Lindroth
❍
Re: New Robot Announcement Ed Carp @ TSSUN5
❍
Re: New Robot Announcement Kevin Hoogheem
Re: robots.txt extensions Steven L Baur
❍
●
●
Re: robots.txt extensions Skip Montanaro
Alta Vista searches WHAT?!? Ed Carp @ TSSUN5
❍
Re: Alta Vista searches WHAT?!? Martijn Koster
❍
Re: Alta Vista searches WHAT?!?
❍
Re: Alta Vista searches WHAT?!? Adam Jack
❍
Re: Alta Vista searches WHAT?!? Tronche Ch. le pitre
❍
Re: Alta Vista searches WHAT?!? Mark Schrimsher
❍
Re: Alta Vista searches WHAT?!? Wayne Lamb
❍
Re: Alta Vista searches WHAT?!? Reinier Post
❍
Re: Alta Vista searches WHAT?!? Wayne Lamb
❍
Re: Alta Vista searches WHAT?!? Erik Selberg
❍
Re: Alta Vista searches WHAT?!? Edward Stangler
❍
Re: Alta Vista searches WHAT?!? Erik Selberg
BOUNCE robots: Admin request Martijn Koster
❍
Re: BOUNCE robots: Admin request Nick Arnett
❍
Re: BOUNCE robots: Admin request Jim Meritt
●
RE: Alta Vista searches WHAT?!? Ted Sullivan
●
robots.txt , authors of robots , webmasters .... savron@world-net.sct.fr
❍
Re: robots.txt , authors of robots , webmasters .... Reinier Post
❍
Re: robots.txt , authors of robots , webmasters ....OMOMOM[D Wayne Lamb
http://info.webcrawler.com/mailing-lists/robots/index.html (9 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
●
❍
Re: robots.txt , authors of robots , webmasters ....OM Wayne Lamb
❍
Re: robots.txt , authors of robots , webmasters .... Benjamin Franz
❍
Re: robots.txt , authors of robots , webmasters .... Wayne Lamb
❍
❍
Re: robots.txt , authors of robots , webmasters .... Robert Raisch, The Internet
Company
Re: robots.txt , authors of robots , webmasters .... Benjamin Franz
❍
Re: robots.txt , authors of robots , webmasters .... Carlos Baquero
❍
Re: robots.txt , authors of robots , webmasters .... Reinier Post
❍
Re: robots.txt , authors of robots , webmasters .... Kevin Hoogheem
❍
Re: robots.txt , authors of robots , webmasters .... Ed Carp @ TSSUN5
❍
Re: robots.txt , authors of robots , webmasters .... Adam Jack
❍
Re: robots.txt , authors of robots , webmasters .... Nick Arnett
Web robots and gopher space -- two separate worlds savron@world-net.sct.fr
❍
Re: Web robots and gopher space -- two separate worlds Wayne Lamb
●
[ANNOUNCE] CFP: AAAI-96 WS on Internet-based Information Systems Alexander
Franz
Robot Research Bhupinder S. Sran
●
Re: Re: robots.txt , authors of robots , webmasters .... Larry Burke
●
re: privacy, courtesy, protection John Lammers
●
Server name in /robots.txt Martin Kiff
●
●
❍
Re: Server name in /robots.txt Tim Bray
❍
Re: Server name in /robots.txt Christopher Penrose
❍
Re: Server name in /robots.txt Martijn Koster
❍
Re: Server name in /robots.txt Reinier Post
❍
Re: Server name in /robots.txt Martin Kiff
❍
❍
Canonical Names for documents (was Re: Server name in /robots.txt) Michael De La
Rue
Re: Server name in /robots.txt Martijn Koster
❍
Re: Server name in /robots.txt mannina@crrm.univ-mrs.fr
Polite Request #2 to be Removed form List AJAJR@aol.com
❍
Re: Polite Request #2 to be Removed form List blea@hic.net
●
un-subcribe AJAJR@aol.com
●
RE: Server name in /robots.txt David Eagles
●
RE: Server name in /robots.txt Tim Bray
❍
●
Re: Server name in /robots.txt Mordechai T. Abzug
Who sets standards (was Server name in /robots.txt) Nick Arnett
http://info.webcrawler.com/mailing-lists/robots/index.html (10 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
●
Re: Who sets standards (was Server name in /robots.txt) Tim Bray
HEAD request [was Re: Server name in /robots.txt] Davide Musella
❍
Re: HEAD request [was Re: Server name in /robots.txt] Martijn Koster
❍
Re: HEAD request [was Re: Server name in /robots.txt] Davide Musella
❍
Re: HEAD request [was Re: Server name in /robots.txt] Renato Mario Rossello
●
Activity from 205.252.60.5[0-8] Martin Kiff
●
test. please ignore. Mark Norman
●
Any info on "E-mail America"? Bonnie Scott
●
www.pl? The YakkoWakko. Webmaster
●
New URL's from Equity Int'll Webcenter Mark Krell
●
Requesting info on database engines Renato Mario Rossello
●
News Clipper for newsgroups - Windows Richard Glenner
❍
●
●
●
Re: News Clipper for newsgroups - Windows Nick Arnett
Wanted: Web Robot code - C/Perl Charlie Brown
❍
Re: Wanted: Web Robot code - C/Perl Patrick 'Zapzap' Lin
❍
Re: Wanted: Web Robot code - C/Perl Keith Fischer
❍
Re: Wanted: Web Robot code - C/Perl Kevin Hoogheem
Perl Spiders Christopher Penrose
❍
Re: Perl Spiders dino
❍
Re: Perl Spiders Charlie Brown
❍
Re: Perl Spiders Christopher Penrose
❍
Re:Re: Perl Spiders mannina@crrm.univ-mrs.fr
Here is WebWalker Christopher Penrose
❍
Re: Here is WebWalker dino
●
The "Robot and Search Engine FAQ" Keith D. Fischer
●
algorithms Kenneth DeMarse
❍
●
The Robot And Search Engine FAQ Keith D. Fischer
❍
●
●
Re: algorithms too Jose Raul Vaquero Pulido
Re: The Robot And Search Engine FAQ Erik Selberg
Money Spider WWW Robot for Windows John McGrath - Money Spider Ltd.
❍
Re: Money Spider WWW Robot for Windows Nick Arnett
❍
Re: Money Spider WWW Robot for Windows Simon.Stobart
robots.txt changes how often? Tangy Verdell
❍
Re: robots.txt changes how often? Darrin Chandler
❍
Re: robots.txt changes how often? Jaakko Hyvatti
http://info.webcrawler.com/mailing-lists/robots/index.html (11 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
●
❍
Re: robots.txt changes how often? Martin Kiff
❍
Re: robots.txt changes how often? Jeremy.Ellman
robots.txt Tangy Verdell
❍
Re: robots.txt Jaakko Hyvatti
●
Re: Commercial Robot Vendor Recoomendations Request Michael De La Rue
●
fdsf Davide Musella
●
Robots and search engines technical information. Hayssam Hasbini
❍
Re: Robots and search engines technical information. Jeremy.Ellman
❍
Re: Robots and search engines technical information. Tangy Verdell
❍
Re: Robots and search engines technical information. Erik Selberg
❍
Re: Robots and search engines technical information. joseph williams
●
Tutorial Proposal for WWW95 http
●
URL measurement studies? Darren R. Hardy
❍
Re: URL measurement studies? John D. Pritchard
●
about robots.txt content errors savron@world-net.sct.fr
●
Dot dot problem... Sean Parker
●
●
❍
Re: Dot dot problem... Reinier Post
❍
Re: Dot dot problem... Nick Arnett
Robot Databases Steve Livingston
❍
Re: Robot Databases Tronche Ch. le pitre
❍
Re: Robot Databases Ted Sullivan
❍
Re: Robot Databases
❍
Re: Robot Databases Skip Montanaro
❍
Re: Robot Databases Ted Sullivan
Anyone doing a Java-based robot yet? Nick Arnett
❍
Re: Anyone doing a Java-based robot yet? David A Weeks
❍
Anyone doing a Java-based robot yet? Pertti Kasanen
❍
Re: Anyone doing a Java-based robot yet? Adam Jack
❍
Re: Anyone doing a Java-based robot yet? John D. Pritchard
❍
Re: Anyone doing a Java-based robot yet? Nick Arnett
❍
Re: Anyone doing a Java-based robot yet? Adam Jack
❍
Re: Anyone doing a Java-based robot yet? John D. Pritchard
❍
Re: Anyone doing a Java-based robot yet? Adam Jack
❍
Re: Anyone doing a Java-based robot yet? Frank Smadja
❍
Re: Anyone doing a Java-based robot yet?6 Mr David A Weeks
http://info.webcrawler.com/mailing-lists/robots/index.html (12 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
●
RE: Robot Databases Ted Sullivan
●
Ingrid ready for prelim alpha testing.... Paul Francis
●
Robot for Sun David Schnardthorst
●
url locating Martijn De Boef
❍
Re: url locating Terry Smith
●
Re[2]: Anyone doing a Java-based robot yet? Tangy Verdell
●
Altavista indexing password files John Messerly
❍
Re: Altavista indexing password files
❍
Re: Altavista indexing password files savron@world-net.sct.fr
●
RE: Altavista indexing password files John Messerly
●
BSE-Slurp/0.6 Gordon V. Cormack
❍
Re: BSE-Slurp/0.6 Mordechai T. Abzug
❍
Re: BSE-Slurp/0.6 Mark Schrimsher
●
Can I retrieve image map files? Mark Norman
●
Robots available for Intranet applications Douglas Summersgill
❍
Re: Robots available for Intranet applications Sylvain Duclos
❍
Re: Robots available for Intranet applications Mark Slabinski
❍
Re: Robots available for Intranet applications Jared Williams
❍
Re: Robots available for Intranet applications Josef Pellizzari
❍
Re: Robots available for Intranet applications Jared Williams
❍
Re: Robots available for Intranet applications Nick Arnett
●
"What's new" in web pages is not necessarily reliable Mordechai T. Abzug
●
verify URL Jim Meritt
❍
Re: verify URL Vince Taluskie
❍
Re: verify URL Carlos Baquero
❍
Re: verify URL Tronche Ch. le pitre
❍
Re: verify URL Reinier Post
●
libww and robot source for Sequent Dynix/Ptx 4.1.3 (no name)
●
Re[2]: verify URL Jim Meritt
●
robot authentication parameters Sibylle Gonzales
❍
Re: robot authentication parameters Dan Gildor
❍
Re: robot authentication parameters Michael De La Rue
●
RE: verify URL Debbie Swanson
●
Re: Robots available for Intranet applications jon madison
❍
Re: Robots available for Intranet applications David Schnardthorst
http://info.webcrawler.com/mailing-lists/robots/index.html (13 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
●
How to...??? Francesco
❍
●
Re: How to...??? Jeremy.Ellman
image map traversal Mark Norman
❍
Re: image map traversal Benjamin Franz
❍
Re: image map traversal Cees Hek
●
ReHowto...??? Jared Williams
●
Info on authoring a Web Robot Jared Williams
●
●
●
❍
Re: Info on authoring a Web Robot Detlev Kalb
❍
Re: Info on authoring a Web Robot Keith D. Fischer
❍
RCPT: Re: Info on authoring a Web Robot Jeannine Washington
image map traversal frizzlefry@nucleus.atom.com
❍
Re: image map traversal Nick Arnett
❍
Re: image map traversal jon madison
Links Jared Williams
❍
Re: Links Martijn Koster
❍
Re: Links Jared Williams
❍
Re: Links Mordechai T. Abzug
❍
Re: Links (don't bother checking; I've done it for you) Michael De La Rue
❍
Re: Links (don't bother checking; I've done it for you) Chris Brown
❍
Re: Links (don't bother checking; I've done it for you) Jaakko Hyvatti
❍
Re: Links Jared Williams
Limiting robots to top-level page only (via robots.txt)? Chuck Doucette
❍
●
Re: Limiting robots to top-level page only (via robots.txt)? Jaakko Hyvatti
Image Maps Thomas Merlin
❍
Re: Image Maps
●
Request for Source code in C for Robots ACHAKS
●
robots that index comments Dan Gildor
❍
Re: robots that index comments murray bent
●
Re: UNSUBSCRIBE ROBOTS Pinano@aol.com
●
keywords in META-element Detlev Kalb
❍
Re: keywords in META-element Davide Musella
●
Announce: ActiveX Search (IFilter) spec/sample Lee Fisher
●
Re: Links This Site is about Robots Not Censorship Keith
❍
●
Re: Links This Site is about Robots Not Censorship Michael De La Rue
Re: Links (don't bother checking; I've done it for you) Darrin Chandler
http://info.webcrawler.com/mailing-lists/robots/index.html (14 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: Links (don't bother checking; I've done it for you) Benjamin Franz
●
Re: Links (don't bother checking; I've done it for you) David Henderson
●
Re: Links This Site is about Robots Not Censorship Rob Turk
●
Re: Links (don't bother checking; I've done it for you) Martin Kiff
●
Re: Links (don't bother checking; I've done it for you) Darrin Chandler
●
The Letter To End All Letters Jared Williams
●
Heuristics.... Martin Kiff
❍
●
Re: Heuristics.... Nick Arnett
unscribe christophe grandjacquet
❍
Re: unscribe jon madison
●
unsubscibe christophe grandjacquet
●
Admin: how to get off this list Martijn Koster
●
Search accuracy Nick Arnett
❍
Re: Search accuracy Benjamin Franz
❍
Re: Search accuracy Nick Arnett
❍
Re: Search accuracy Benjamin Franz
❍
Re: Search accuracy mred@neosoft.com
❍
Re: Search accuracy YUWONO BUDI
❍
Re: Search accuracy Judy Feder
❍
Re: Search accuracy Nick Arnett
❍
Re: Search accuracy John D. Pritchard
❍
Re: Search accuracy Daniel C Grigsby
❍
Re: Search accuracy David A Weeks
❍
Re: Search accuracy Nick Arnett
❍
Re: Search accuracy Robert Raisch, The Internet Company
❍
Re: Search accuracy Nick Arnett
❍
Re: Search accuracy Benjamin Franz
❍
Re: Search accuracy Ellen M Voorhees
❍
Re: Search accuracy Ted Sullivan
●
Clean up Bots... Andy Warner
●
Re: (Fwd) Re: Search accuracy Colin Goodier
●
VB and robot development Mitchell Elster
❍
Re: VB and robot development mred@neosoft.com
❍
Re: VB and robot development erc@dal1820.computek.net
❍
Re: VB and robot development Jakob Faarvang
http://info.webcrawler.com/mailing-lists/robots/index.html (15 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
●
❍
Re: VB and robot development Ian McKellar
❍
Re: VB and robot development Jakob Faarvang
❍
Re: VB and robot development Darrin Chandler
❍
Re: VB and robot development
Problem with your Index Mike Rodriguez
❍
●
●
●
Re: Problem with your Index Martijn Koster
Handling keyword repetitions Mr David A Weeks
❍
Re: Handling keyword repetitions Alan
❍
Re: Handling keyword repetitions Alan
word spam chris cobb
❍
Re: word spam arutgers
❍
Re: word spam Alan
❍
Re: word spam Trevor Jenkins
❍
Re: word spam Benjamin Franz
❍
Re: word spam Ken Wadland
❍
Re: word spam YUWONO BUDI
❍
Re: word spam Benjamin Franz
❍
Re: word spam Kevin Hoogheem
❍
Re: word spam Ken Wadland
❍
Re: word spam Reinier Post
❍
Re: word spam Alan
❍
Re: word spam terces1@postoffice.ptd.net
❍
Re: word spam
❍
Re: word spam Andrey A. Krasov
❍
Re: word spam Nick Arnett
http directory index request Mark Norman
❍
Re: http directory index request Mordechai T. Abzug
●
RE: http directory index request David Levine
●
Returned mail: Can't create output: Error 0 Mail Delivery Subsystem
●
Re: word spam
●
Web Robot Jared Williams
●
Robots in the client? Ricardo Eito Brun
❍
Re: Robots in the client? Paul De Bra
❍
Re: Robots in the client? Bonnie Scott
❍
Re: Robots in the client? Ricardo Eito Brun
http://info.webcrawler.com/mailing-lists/robots/index.html (16 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: Robots in the client? Michael Carnevali, Student, FHD
●
General Information Ricardo Eito Brun
●
Magic, Intelligence, and search engines Tim Bray
❍
●
●
Re: Magic, Intelligence, and search engines YUWONO BUDI
default documents Jakob Faarvang
❍
Re: default documents Darrin Chandler
❍
Re: default documents Harry Munir Behrens
❍
Re: default documents Micah A. Williams
❍
Re: default documents mred@neosoft.com
Mailing list Jared Williams
❍
Re: Mailing list Mordechai T. Abzug
❍
Re: Mailing list Jared Williams
❍
Re: Mailing list Mordechai T. Abzug
❍
Re: Mailing list Kevin Hoogheem
❍
Re: Mailing list Gordon Bainbridge
❍
Re: Mailing list Rob Turk
●
RE: Mailing List Mitchell Elster
●
About Mother of All Bulletin Boards Ricardo Eito Brun
❍
●
Re: About Mother of All Bulletin Boards Oliver A. McBryan
search engine Jose Raul Vaquero Pulido
❍
Re: search engine Scott W. Wood
❍
Re: search engine Rob Turk
❍
Re: search engine Jakob Faarvang
●
Quiz playing robots ? Andrey A. Krasov
●
Try robot... Jared Williams
❍
●
Re: Try robot... Andy Warner
(no subject) Jared Williams
❍
(no subject) Vince Taluskie
❍
(no subject) Michael De La Rue
●
"Good Times" hoax Nick Arnett
●
(no subject) Kevin Hoogheem
●
RE: "Good Times" hoax David Eagles
●
Re: Re: Bill Day
●
About integrated search engines Ricardo Eito Brun
❍
Re: About integrated search engines Brian Ulicny
http://info.webcrawler.com/mailing-lists/robots/index.html (17 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: About integrated search engines Keith D. Fischer
❍
Re: About integrated search engines Frank Smadja
●
[ MERCHANTS ] My Sincerest Apologies Jared Williams
●
Re: Apologies || communal bots Rob Turk
❍
Re: communal bots Bonnie Scott
❍
Re: communal bots John D. Pritchard
●
Re: communal bots Rob Turk
●
To: ???? Robot Mitchell Elster
❍
Re: To: ???? Robot Ken Wadland
❍
Re: To: ???? Robot John D. Pritchard
❍
Re: To: ???? Robot Mitchell Elster
❍
Re: To: ???? Robot John D. Pritchard
●
RE: To: ???? Robot chris cobb
●
Private Investigator Lists Scott W. Wood
●
[Fwd: Re: To: ???? Robot] Rob Turk
●
Re: To: ???? Robot] Terry Smith
●
Re: To: ???? Robot Mitchell Elster
●
Looking for a search engine Bhupinder S. Sran
●
❍
Re: Looking for a search engine Mark Schrimsher
❍
Re: Looking for a search engine Nick Arnett
Admin: List archive is back Martijn Koster
❍
●
Re: Admin: List archive is back Nick Arnett
topical search tool -- help?! Brian Fitzgerald
❍
Re: topical search tool -- help?! Brian Ulicny
❍
Re: topical search tool -- help?! Brian Ulicny
❍
Re: topical search tool -- help?! Paul Francis
❍
Re: topical search tool -- help?! Paul Francis
❍
Re: topical search tool -- help?! Paul Francis
❍
Re: topical search tool -- help?! Nick Arnett
❍
Re: topical search tool -- help?! Brian Ulicny
❍
Re: topical search tool -- help?! Brian Ulicny
❍
Re: topical search tool -- help?! Brian Ulicny
❍
Re: topical search tool -- help?! Nick Arnett
❍
Re: topical search tool -- help?! Robert Raisch, The Internet Company
❍
Re: topical search tool -- help?! Brian Ulicny
http://info.webcrawler.com/mailing-lists/robots/index.html (18 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: topical search tool -- help?! Paul Francis
❍
Re: topical search tool -- help?! Robert Raisch, The Internet Company
●
cc:Mail SMTPLINK Undeliverable Message
●
Indexing a set of URL's Fred Melssen
●
Political economy of distributed search (was topical search...) Nick Arnett
❍
Re: Political economy of distributed search (was topical search...) Erik Selberg
❍
Re: Political economy of distributed search (was topical search...) Jeremy.Ellman
❍
Re: Political economy of distributed search (was topical search...) Benjamin Franz
❍
Re: Political economy of distributed search (was topical search...) John D. Pritchard
●
Re: Political economy of distributed search (was topical search Robert Raisch, The Internet
Company
(OTP) RE: Political economy of distributed search (was topical search...) David Levine
●
Re: Political economy of distributed search (was topical Terry Smith
●
Re: (OTP) RE: Political economy of distributed search (was topical Kevin Hoogheem
●
Re: Political economy of distributed search (was topical Steve Jones
●
Meta-seach engines Harry Munir Behrens
●
●
●
●
❍
Re: Meta-seach engines Greg Fenton
❍
Re: Meta-seach engines Erik Selberg
any robots/search-engines which index links? Alex Chapman
❍
Re: any robots/search-engines which index links? Jakob Faarvang
❍
Re: any robots/search-engines which index links? Carlos Baquero
❍
Re: any robots/search-engines which index links? Benjamin Franz
❍
Re: any robots/search-engines which index links? Alex Chapman
❍
Re: any robots/search-engines which index links? Carlos Baquero
❍
Re: any robots/search-engines which index links? Carlos Baquero
alta vista and virtualvin.com chris cobb
❍
Re: alta vista and virtualvin.com Benjamin Franz
❍
Re: alta vista and virtualvin.com Larry Gilbert
❍
Re: alta vista and virtualvin.com Benjamin Franz
VB. page grabber... Marc's internet diving suit
❍
Re: VB. page grabber... Terry Smith
❍
Re: VB. page grabber... Ed Carp
❍
Re: VB. page grabber... Jakob Faarvang
●
Re: Lead Time Myles Olson
●
RE: VB. page grabber... Victor F Ribeiro
●
Re: Political economy of distributed search (was topical Brian Ulicny
http://info.webcrawler.com/mailing-lists/robots/index.html (19 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
●
RE: VB. page grabber... chris cobb
●
RE: any robots/search-engines which index links? Louis Monier
●
Re[2]: verify URL Jim Meritt
●
ANNOUNCE: Don Norman (Apple) LIVE! 15-May 5PM UK = noon EDT Marc Eisenstadt
●
Web spaces of strange topology. Where? Michael De La Rue
●
❍
Re: Web spaces of strange topology. Where? Michael De La Rue
❍
Re: Web spaces of strange topology. Where? J.E. Fritz
❍
Re: Web spaces of strange topology. Where? Benjamin Franz
❍
Re: Web spaces of strange topology. Where? John Lindroth
❍
Re: Web spaces of strange topology. Where? Brian Clark
❍
Re: Web spaces of strange topology. Where? mred@neosoft.com
Somebody is turning 23! Eric
❍
Re: Somebody is turning 23! drose@AZStarNet.com
❍
Re: Somebody is turning 23! John D. Pritchard
❍
Re: Somebody is turning 23! drose@AZStarNet.com
●
robots and cookies Ken Nakagama
●
Accept: Thomas Abrahamsson
●
❍
Re: Accept: Martijn Koster
❍
Re: Accept: mred@neosoft.com
Defenses against bad robots kathy@accessone.com
❍
Re: Defenses against bad robots Larry Gilbert
❍
Re: Defenses against bad robots Mordechai T. Abzug
❍
Re: Defenses against bad robots Benjamin Franz
❍
Re: Defenses against bad robots Martijn Koster
❍
Re: Defenses against bad robots mred@neosoft.com
❍
Re: Defenses against bad robots mred@neosoft.com
❍
Re: Defenses against bad robots Benjamin Franz
❍
Book about robots (was Re: Defenses against bad robots) Tronche Ch. le pitre
❍
Re: Book about robots (was Re: Defenses against bad robots) Eric Knight
❍
Re: Defenses against bad robots
❍
Re: Defenses against bad robots Jaakko Hyvatti
❍
Re: Defenses against bad robots Steve Jones
❍
Re: Defenses against bad robots John D. Pritchard
❍
Re: Defenses against bad robots mred@neosoft.com
❍
Re: Defenses against bad robots Robert Raisch, The Internet Company
http://info.webcrawler.com/mailing-lists/robots/index.html (20 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: Defenses against bad robots John D. Pritchard
❍
Re: Defenses against bad robots John D. Pritchard
❍
Re: Defenses against bad robots Robert Raisch, The Internet Company
●
Re: Image Maps jon madison
●
That wacky Wobot
●
??: reload problem Andrey A. Krasov
●
Robot-HTML Web Page? J.Y.K.
❍
Re: Robot-HTML Web Page? Paul Francis
❍
Re: Robot-HTML Web Page? Darrin Chandler
❍
Re: Robot-HTML Web Page? Paul Francis
❍
Re: Robot-HTML Web Page? Mordechai T. Abzug
❍
Re: Robot-HTML Web Page? Ricardo Eito Brun
❍
Re: Robot-HTML Web Page? Alex Chapman
❍
Re: Robot-HTML Web Page? Nick Arnett
●
Robot Exclusion Standard Revisited Charles P. Kollar
●
McKinley robot Rafhael Cedeno
●
McKinley robot Dave Rothwell
●
RE: alta vista and virtualvin.com Louis Monier
●
RE: alta vista and virtualvin.com Louis Monier
❍
RE: alta vista and virtualvin.com Ann Cantelow
❍
RE: alta vista and virtualvin.com Michael De La Rue
❍
RE: alta vista and virtualvin.com Ann Cantelow
❍
Re: alta vista and virtualvin.com John D. Pritchard
●
RE: alta vista and virtualvin.com Louis Monier
●
Specific searches The Wild
●
RE: alta vista and virtualvin.com Louis Monier
●
RE: alta vista and virtualvin.com Bakin, David
●
RE: alta vista and virtualvin.com Paul Francis
❍
Re: alta vista and virtualvin.com Michael Van Biesbrouck
●
A new robot...TOPjobs(tm) USA JOBbot 1.0a D. Williams
●
Test server for robot development? D. Williams
●
Re: Robot Exclusion Standard Revisited (LONG) Martijn Koster
❍
BackRub robot warning Roy T. Fielding
●
Content based robot collectors Scott 'Webster' Wood
●
Robot to collect web pages per site Henrik Fagrell
http://info.webcrawler.com/mailing-lists/robots/index.html (21 of 61) [18.02.2001 13:19:25]
Robots Mailing List Archive by thread
❍
Re: Robot to collect web pages per site Jeremy.Ellman
●
Recherche de documentation sur les agents intelligents ou Robots. Mannina Bruno
●
www.kollar.com/robots.html John D. Pritchard
●
Tagging a document with language Tronche Ch. le pitre
❍
Re: Tagging a document with language Donald E. Eastlake 3rd
❍
Re: Tagging a document with language J.E. Fritz
●
Re: Tagging a document with language Gen-ichiro Kikui
●
RE: Tagging a document with language Henk Alles
●
RE: Tagging a document with language Robert Raisch, The Internet Company
●
implementation fo HEAD response with meta info Davide Musella
●
❍
Re: implementation fo HEAD response with meta info G. Edward Johnson
❍
Re: implementation fo HEAD response with meta info Davide Musella
robots, what else! Fred K. Lenherr
❍
Re: robots, what else! Scott 'Webster' Wood
❍
Re: robots, what else! anna@nerve.itim.mi.cnr.it
●
Finding the canonical name for a server Jaakko Hyvatti
●
Re: (book recommendation re: net agents) Rob Turk
●
Content based search engine Scott 'Webster' Wood
❍
Re: Content based search engine Martijn Koster
❍
Re: Content based search engine Skip Montanaro
●
(no subject) Digital Universe Inc.
●
(no subject) Larry Stephen Burke
●
(no subject) Fred K. Lenherr
●
BackRub robot L a r r y P a g e
❍
Re: BackRub robot Ross Finlayson
❍
Re: BackRub robot Ron Kanagy
❍
Re: BackRub robot Issac Roth
❍
Re: BackRub robot Ann Cantelow
❍
Re: BackRub robot Martijn Koster
●
robots.txt: allow directive Leslie Cuff
●
looking for specific bot... Brenden Portolese
❍
●
Looking for News robot Michael Goldberg
❍
●
Re: looking for specific bot... Ed Carp
Re: Looking for News robot Martijn Koster
robot.polite Brian Hancock
http://info.webcrawler.com/mailing-lists/robots/index.html (22 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
●
❍
Re: robot.polite Joe Nieten
❍
Re: robot.polite Martijn Koster
❍
Re: robot.polite Martijn Koster
❍
Re: robot.polite Terry O'Neill
❍
Re: robot.polite Leslie Cuff
❍
Re: robot.polite Leslie Cuff
❍
Re: robot.polite Ross Finlayson
❍
Re: robot.polite Martijn Koster
Looking for good one JoongSub Lee
❍
Introducing myself Richard J. Rossi
❍
Re: Introducing myself Martijn Koster
❍
Re: Introducing myself Fred K. Lenherr
❍
Re: Introducing myself Siu-ki Wong
❍
Re: Introducing myself Leslie Cuff
❍
Re: Introducing myself Martijn Koster
❍
Re: Introducing myself Martijn Koster
❍
Re: Introducing myself Dr. Detlev Kalb
❍
Re: Introducing myself Dr. Detlev Kalb
❍
RCPT: Re: Introducing myself Hauke Loens
❍
RCPT: Re: Introducing myself Hauke Loens
❍
Re: Looking for good one Greg Fenton
More dangers of spiders... dws
❍
HTML query to .ps? Scott 'Webster' Wood
❍
Re: HTML query to .ps? .... John W. Kulp
❍
Re: HTML query to .ps? .... John W. Kulp
●
RE: Tagging a document with language Henk Alles
●
Keyword indexing David Reilly
❍
Re: Keyword indexing Robert Raisch, The Internet Company
❍
Re: Keyword indexing Martijn Koster
❍
Re: Keyword indexing Scott 'Webster' Wood
❍
Re: Keyword indexing Sigfrid Lundberg
❍
Re: Keyword indexing Brian Ulicny
❍
Re: Keyword indexing David Reilly
❍
Re: Keyword indexing Dave White
❍
Re: Keyword indexing Brian Ulicny
http://info.webcrawler.com/mailing-lists/robots/index.html (23 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
❍
Re: Keyword indexing Paul Francis
❍
Re: Keyword indexing Ricardo Eito Brun
❍
Re: Keyword indexing Fred K. Lenherr
Robot books boogieoogie goobnie
❍
Re: Robot books joseph williams
❍
Re: Robot books Richard J. Rossi
❍
Re: Introducing myself Richard J. Rossi
❍
Re: Robot books Hauke Loens
❍
RCPT: Re: Robot books Netmode
❍
RCPT: Re: Robot books Hauke Loens
●
Allow/deny robots from major search services boogieoogie goobnie
●
Robot logic? marco@specialnet.cmt.it
❍
●
●
●
Re: Robot logic? Martijn Koster
robots.txt usage Wiebe Weikamp
❍
Re: robots.txt usage Reinier Post
❍
Re: robots.txt usage Daniel Lo
❍
Re: robots.txt usage Brian Clark
❍
Re: robots.txt usage Ulrich Ruffiner
robot vaiable list Marco Genua
❍
Re: robot vaiable list Martijn Koster
❍
The Web Robots Database (was Re: robot vaiable list Martijn Koster
WebAnalyzer - introduction Craig McQueen
❍
Re: WebAnalyzer - introduction Martijn Koster
❍
Re: WebAnalyzer - introduction Paul De Bra
●
(no subject) Martijn Koster
●
RE: Introducing myself Anthony D. Thomas
●
RE: WebAnalyzer - introduction Gregg Steinhilpert
●
●
PERL Compilers & Interpretive Tools MPMC-Manhattan Premed Council & Hunter
PBPMA-Post Bac PreMed Assoc 212-843-3701 Ext 2800
RE: robots.txt usage David Levine
●
RE: WebAnalyzer - introduction Craig McQueen
●
RE: Tagging a document with language Robert Raisch, The Internet Company
●
Microsoft Tripoli Web Search Beta now available Lee Fisher
●
RE: Tagging a document with language Henk Alles
●
(no subject) Kevin Lew
●
in-document directive to discourage indexing ? Denis McKeon
http://info.webcrawler.com/mailing-lists/robots/index.html (24 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
❍
makerobots.perl (Re: in-document directive..) Jaakko Hyvatti
❍
Re: in-document directive to discourage indexing ? Martijn Koster
❍
Re: in-document directive to discourage indexing ? Nick Arnett
❍
Re: in-document directive to discourage indexing ? arutgers
❍
Re: in-document directive to discourage indexing ? Denis McKeon
❍
Re: in-document directive to discourage indexing ? Kevin Hoogheem
❍
Re: in-document directive to discourage indexing ? Benjamin Franz
❍
Re: in-document directive to discourage indexing ? Jaakko Hyvatti
❍
Re: in-document directive to discourage indexing ? Drew Hamilton
❍
Re: in-document directive to discourage indexing ? Rob Turk
❍
Re: in-document directive to discourage indexing ? Kevin Hoogheem
❍
Re: in-document directive to discourage indexing ? Kevin Hoogheem
❍
Re: in-document directive to discourage indexing ? Rob Turk
❍
Re: in-document directive to discourage indexing ? Convegno NATO HPC 1996
❍
Re: your mail Ed Carp
❍
Req for ADMIM: How to sunsubscribe? perez@fusoes.com.br
❍
Re: in-document directive to discourage indexing ? Terry O'Neill
❍
Re: in-document directive to discourage indexing ? Kevin Hoogheem
❍
Re: in-document directive to discourage indexing ? Nick Arnett
❍
Re: in-document directive to discourage indexing ? Terry O'Neill
❍
Re: in-document directive to discourage indexing ? Rob Turk
●
(no subject) edas@micros.et.byu.edu
●
Client Robot 'Ranjan' Brenden Portolese
❍
Re: Client Robot 'Ranjan' Robert Raisch, The Internet Company
❍
Re: Client Robot 'Ranjan' John D. Pritchard
❍
Re: Client Robot 'Ranjan' Kevin Hoogheem
❍
Re: Client Robot 'Ranjan' Benjamin Franz
❍
Re: Client Robot 'Ranjan' Brenden Portolese
❍
Re: Client Robot 'Ranjan' Reinier Post
❍
Re: Client Robot 'Ranjan' Kevin Hoogheem
●
multiple copies Steve Leibman
●
RE: Client Robot 'Ranjan' chris cobb
●
Inter-robot communication David Reilly
❍
Re: Inter-robot communication Darren R. Hardy
❍
Re: Inter-robot communication John D. Pritchard
http://info.webcrawler.com/mailing-lists/robots/index.html (25 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
❍
Re: Inter-robot communication Harry Munir Behrens
❍
Re: Inter-robot communication John D. Pritchard
❍
Re: Inter-robot communication David Reilly
❍
Re: Inter-robot communication David Reilly
❍
Re: Inter-robot communication John D. Pritchard
❍
Re: Inter-robot communication Ross Finlayson
Re: RCPT: Re: Introducing myself Steve Jones
❍
Re: RCPT: Re: Introducing myself Bj\xrn-Olav Strand
●
Re: RCPT: Re: Introducing myself Fred K. Lenherr
●
Looking for... Steven Frank
●
Re: RCPT: Re: Introducing myself Rob Turk
❍
●
Re: RCPT: Re: Introducing myself Ann Cantelow
Re: RCPT: Re: Introducing myself Daniel Williams
❍
People who live in SPAM houses... (Was re: RCPT Scott 'Webster' Wood
●
RE: Looking for... Craig McQueen
●
Collected information standards Scott 'Webster' Wood
●
Re: RCPT: Re: Introducing myself Ross Finlayson
●
Re: RCPT: Re: Introducing myself Chris Crowther
●
Re: RCPT: Re: Introducing myself Chris Crowther
●
Java Robot Fabio Arciniegas A.
❍
Re: Java Robot Joe Nieten
❍
Re: Java Robot John D. Pritchard
❍
Re: Java Robot Joe Nieten
❍
Re: Java Robot L a r r y P a g e
❍
Re: Java Robot John D. Pritchard
●
Re: RCPT: Re: Introducing myself Hauke Loens
●
Report of the Distributed Indexing/Searching Workshop Martijn Koster
●
RE: Java Robot David Levine
●
RE: Java Robot David Levine
●
Re: Java Robot John D. Pritchard
●
unscribe Jeffrey Kerns
●
Dead account Anthony John Carmody
●
web topology Fred K. Lenherr
❍
Re: web topology Nick Arnett
❍
Re: web topology L a r r y P a g e
http://info.webcrawler.com/mailing-lists/robots/index.html (26 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
❍
Re: web topology Ed Carp
●
A modest proposal...<snip> to discourage indexing ? Rob Turk
●
Re: Unsubscribing from Robots (was "your mail") Ian Samson
●
Robot's Book. Mannina Bruno
●
❍
Re: Robot's Book. Rob Turk
❍
Re: Robot's Book. joseph williams
❍
Re: Robot's Book. Victor Ribeiro
ADMIN: unsubscribing (was: Re: Martijn Koster
❍
●
●
●
Search Engine end-users Stephen Kahn
❍
Re: Search Engine end-users Ted Resnick
❍
Re: Search Engine end-users Nick Arnett
loc(SOIF) John D. Pritchard
❍
Re: loc(SOIF) Paul Francis
❍
Re: loc(SOIF) David Reilly
ADMIN: Archive Martijn Koster
❍
●
Re: ADMIN: Archive Nick Arnett
roverbot - perhaps the worst robot yet dws
❍
●
New engine on the loose? Scott 'Webster' Wood
Re: roverbot - perhaps the worst robot yet Brian Clark
we should help spiders and not say NO! Daniel Lo
❍
Re: we should help spiders and not say NO! Daniel Lo
❍
robot clusion; was Re: we should help spiders and not say NO! John D. Pritchard
❍
Re: robot clusion; was Re: we should help spiders and not say NO! Martijn Koster
Re: robot clusion; was Re: we should help spiders and not say NO! John D.
Pritchard
Advice Alyne Mochan & Warren Baker
❍
●
❍
Re: Advice John D. Pritchard
❍
Re: Advice Martijn Koster
❍
Re: Advice Alyne Mochan & Warren Baker
❍
Re: Advice Alyne Mochan & Warren Baker
●
Robot Mirror with Username/Password feature (no name)
●
(no subject) K.E. HERING
●
Re: Robot Mirror with Username/Password feature Mannina Bruno
●
❍
Re: Announcement Mannina Bruno
❍
Re: Announcement Michael. Gunn
(no subject) K.E. HERING
http://info.webcrawler.com/mailing-lists/robots/index.html (27 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
Announcement Michael G=?iso-8859-1?Q?=F6ckel
●
Harvest-like use of spiders Fred Melssen
❍
Re: Harvest-like use of spiders Nick Arnett
●
(no subject) Ronald Kanagy
●
Re: hey man gimme a break Martijn Koster
●
Identifying identical documents Daniel T. Martin
❍
●
●
●
Description or Abstract? G. Edward Johnson
❍
Re: Description or Abstract? Reinier Post
❍
Re: Description or Abstract? Davide Musella
❍
Re: Description or Abstract? Reinier Post
❍
Re: Description or Abstract? Davide Musella
❍
Re: Description or Abstract? Nick Arnett
❍
Re: Description or Abstract? Mike Agostino
❍
Re: Description or Abstract? Martijn Koster
❍
Re: Description or Abstract? Martijn Koster
❍
Re: Description or Abstract? Nick Arnett
❍
On the subject of abuse/pro-activeness Scott 'Webster' Wood
❍
Re: Description or Abstract? Davide Musella
Should I index all ... CLEDER Catherine
❍
Re: Should I index all ... Michael G=?iso-8859-1?Q?=F6ckel
❍
Re: Should I index all ... Jaakko Hyvatti
❍
Re: Should I index all ... Terry O'Neill
❍
Re: Should I index all ... Cleder Catherine
❍
Re: Should I index all ... Michael G=?iso-8859-1?Q?=F6ckel
❍
Re: Should I index all ... Martijn Koster
❍
Re: Should I index all ... Nick Arnett
❍
Re: Should I index all ... Chris Crowther
❍
Re: Should I index all ... Terry O'Neill
❍
Re: Should I index all ... Terry O'Neill
❍
Re: Should I index all ... Chris Crowther
❍
Re: Should I index all ... Trevor Jenkins
HTML Parser Ronald Kanagy
❍
●
Re: Identifying identical documents Jaakko Hyvatti
HTML Parser Skip Montanaro
desperately looking for a news searcher Saloum Fall
http://info.webcrawler.com/mailing-lists/robots/index.html (28 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
❍
Re: desperately looking for a news searcher Reinier Post
●
Re: Unsubscribing from Robots (was "your mail") Ian Samson
●
A modest proposal...<snip> to discourage indexing ? Rob Turk
●
Re: your mail Ed Carp
●
Blackboard for Discussing Domain-specific Robots Fred K. Lenherr
●
robots.txt unavailability Daniel T. Martin
●
●
❍
Re: robots.txt unavailability Jaakko Hyvatti
❍
Re: robots.txt unavailability Fred K. Lenherr
❍
Re: robots.txt unavailability Michael G=?iso-8859-1?Q?=F6ckel
❍
Re: robots.txt unavailability Daniel T. Martin
❍
Re: robots.txt unavailability levitte@lp.se
nastygram from xxx.lanl.gov Aaron Nabil
❍
Re: nastygram from xxx.lanl.gov Wiebe Weikamp
❍
Re: nastygram from xxx.lanl.gov Aaron Nabil
Re: nastygram from xxx.lanl.gov Aaron Nabil
❍
●
Re: nastygram from xxx.lanl.gov Michael Schlindwein
nastygram from xxx.lanl.gov Peter.Vogt@tokai.sprint.com
❍
Re: nastygram from xxx.lanl.gov Paul Francis
❍
Re: nastygram from xxx.lanl.gov Paul Francis
❍
Re: nastygram from xxx.lanl.gov Aaron Nabil
❍
Re: nastygram from xxx.lanl.gov Paul Francis
❍
Re: nastygram from xxx.lanl.gov Steve Nisbet
❍
Re: nastygram from xxx.lanl.gov Daniel T. Martin
❍
Re: nastygram from xxx.lanl.gov Chris Crowther
❍
Re: nastygram from xxx.lanl.gov Roy T. Fielding
❍
Re: nastygram from xxx.lanl.gov Istvan
❍
Re: nastygram from xxx.lanl.gov Istvan
❍
Re: nastygram from xxx.lanl.gov Larry Gilbert
❍
Re: nastygram from xxx.lanl.gov Tim Bray
❍
Re: nastygram from xxx.lanl.gov Chris Crowther
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
❍
Re: nastygram from xxx.lanl.gov Benjamin Franz
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
❍
Re: nastygram from xxx.lanl.gov Aaron Nabil
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
http://info.webcrawler.com/mailing-lists/robots/index.html (29 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
❍
Re: nastygram from xxx.lanl.gov Gordon Bainbridge
❍
Re: nastygram from xxx.lanl.gov Benjamin Franz
❍
Re: nastygram from xxx.lanl.gov Denis McKeon
❍
Re: nastygram from xxx.lanl.gov Drew Hamilton
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
❍
Re: nastygram from xxx.lanl.gov Istvan
❍
Re: nastygram from xxx.lanl.gov Fred K. Lenherr
❍
Re: nastygram from xxx.lanl.gov Steve Jones
❍
You found it... (was Re: nastygram from xxx.lanl.gov) Michael Schlindwein
❍
Re: nastygram from xxx.lanl.gov Garth T Kidd
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
❍
Re: nastygram from xxx.lanl.gov Garth T Kidd
❍
Re: nastygram from xxx.lanl.gov nobody@physics.yale.edu
❍
Re: nastygram from xxx.lanl.gov Istvan
❍
Re: nastygram from xxx.lanl.gov Garth T Kidd
❍
Re: nastygram from xxx.lanl.gov Jaakko Hyvatti
❍
Re: nastygram from xxx.lanl.gov Chris Crowther
❍
Re: nastygram from xxx.lanl.gov Chris Crowther
❍
Re: nastygram from xxx.lanl.gov Gordon Bainbridge
❍
Re: nastygram from xxx.lanl.gov
❍
Re: nastygram from xxx.lanl.gov Kevin Hoogheem
❍
Re: nastygram from xxx.lanl.gov Chris Crowther
RE: nastygram from xxx.lanl.gov Elias Sideris
❍
●
RE: nastygram from xxx.lanl.gov Bj\xrn-Olav Strand
RE: nastygram from xxx.lanl.gov Frank Wales
❍
Re: nastygram from xxx.lanl.gov Michael Schlindwein
●
htaccess Steve Leibman
●
RE: robots.txt unavailability Louis Monier
●
RE: On the subject of abuse/pro-activeness Louis Monier
❍
●
●
●
●
Re: On the subject of abuse/pro-activeness Scott 'Webster' Wood
NCSA Net Access_log Analysis Tool for Win95 MPMC-Manhattan Premed Council &
Hunter PBPMA-Post Bac PreMed Assoc 212-843-3701 Ext 2800
Proxies Larry Stephen Burke
NCSA Net Access_log Analysis Tool for Win95 MPMC-Manhattan Premed Council &
Hunter PBPMA-Post Bac PreMed Assoc 212-843-3701 Ext 2800
RE: Should I index all ... David Levine
http://info.webcrawler.com/mailing-lists/robots/index.html (30 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
●
●
How long to cache robots.txt for? Aaron Nabil
❍
Re: How long to cache robots.txt for? Micah A. Williams
❍
Re: How long to cache robots.txt for? Jaakko Hyvatti
❍
Re: How long to cache robots.txt for? Martin Kiff
❍
Re: How long to cache robots.txt for? Greg Fenton
Re: psycho at xxx.lanl.gov Rob Turk
❍
Re: psycho at xxx.lanl.gov Bonnie Scott
❍
Re: psycho at xxx.lanl.gov Scott 'Webster' Wood
Re: Alta Vista getting stale? Denis McKeon
❍
Re: Alta Vista getting stale? Martin Kiff
❍
Re: Alta Vista getting stale? Denis McKeon
●
RE: nastygram from xxx.lanl.gov Frank Wales
●
RE: nastygram from xxx.lanl.gov Paul Francis
●
❍
Re: nastygram from xxx.lanl.gov Michael De La Rue
❍
Re: nastygram from xxx.lanl.gov Michael Schlindwein
RE: nastygram from xxx.lanl.gov Rob Hartill
❍
Re: nastygram from xxx.lanl.gov Aaron Nabil
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
❍
RE: nastygram from xxx.lanl.gov Bj\xrn-Olav Strand
●
Updating Robots Ian Samson
●
Re: nastygram from xxx.lanl.gov Aaron Nabil
❍
Re: nastygram from xxx.lanl.gov Rob Hartill
●
Prasad Wagle: Webhackers: Java servlets and agents John D. Pritchard
●
RE: nastygram from xxx.lanl.gov Istvan
●
RE: nastygram from xxx.lanl.gov Tim Bray
●
Re: nastygram for xxx.lanl.gov David Levine
●
RE: nastygram from xxx.lanl.gov Istvan
●
Re: nastygram from xxx.lanl.gov Aaron Nabil
❍
●
Re: nastygram from xxx.lanl.gov Rob Hartill
dumb robots and xxx Rob Hartill
❍
Re: dumb robots and xxx Ron Wolf
❍
Re: dumb robots and xxx Randy Terbush
❍
Re: dumb robots and xxx Jim Ausman
❍
Re: dumb robots and xxx Michael Schlindwein
❍
Re: dumb robots and xxx Randy Terbush
http://info.webcrawler.com/mailing-lists/robots/index.html (31 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
ADMIN: Spoofing vs xxx.lanl.gov Martijn Koster
●
RE: nastygram from xxx.lanl.gov Frank Wales
●
RE: nastygram from xxx.lanl.gov Istvan
●
forwarded e-mail Paul Ginsparg 505-667-7353
●
the POST myth... a web admin's opinions.. Rob Hartill
❍
Re: the POST myth... a web admin's opinions.. Benjamin Franz
❍
Re: the POST myth... a web admin's opinions.. Rob Hartill
❍
Re: the POST myth... a web admin's opinions.. Benjamin Franz
❍
Re: the POST myth... a web admin's opinions.. Bonnie Scott
❍
Re: the POST myth... a web admin's opinions.. Benjamin Franz
❍
Re: the POST myth... a web admin's opinions.. Rob Hartill
❍
Re: the POST myth... a web admin's opinions.. Istvan
●
xxx.lanl.gov - The thread continues.... Chris Crowther
●
xxx.lanl.gov a real threat? John Lammers
●
Re: Alta Vista getting stale? Nick Arnett
●
crawling accident Rafhael Cedeno
●
xxx.lanl.gov/robots.txt Chris Crowther
●
Stupid robots cache DNS and not IMS Roy T. Fielding
●
Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Java intelligent agents and compliance? Scott 'Webster' Wood
❍
Re: Java intelligent agents and compliance? John D. Pritchard
❍
Re: Java intelligent agents and compliance? Shiraz Siddiqui
❍
Re: Suggestion to help robots and sites coexist a little better Benjamin Franz
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Suggestion to help robots and sites coexist a little better Skip Montanaro
❍
Re: Suggestion to help robots and sites coexist a little better Dirk.vanGulik
❍
Re: Suggestion to help robots and sites coexist a little better Martijn Koster
❍
Re: Suggestion to help robots and sites coexist a little better Randy Terbush
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Re: Suggestion to help robots and sites coexist a little better Nick Arnett
❍
Re: Suggestion to help robots and sites coexist a little better Scott 'Webster' Wood
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Re: Suggestion to help robots and sites coexist a little better Nick Arnett
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Re: Suggestion to help robots and sites coexist a little better Nick Arnett
http://info.webcrawler.com/mailing-lists/robots/index.html (32 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
❍
Re: Suggestion to help robots and sites coexist a little better Scott 'Webster' Wood
❍
Re: Suggestion to help robots and sites coexist a little better Jaakko Hyvatti
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Re: Suggestion to help robots and sites coexist a little better Brian Clark
❍
Re: Suggestion to help robots and sites coexist a little better Nick Arnett
❍
Re: Suggestion to help robots and sites coexist a little better Martijn Koster
❍
Apology -- I didn't mean to send that last message to the list Bonnie Scott
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
❍
Re: Suggestion to help robots and sites coexist a little better Nick Arnett
❍
Re: Suggestion to help robots and sites coexist a little better Rob Hartill
PS Rob Hartill
❍
Re: PS Benjamin Franz
❍
Re: PS Brian Clark
❍
Re: PS Rob Hartill
❍
About the question what a robots is (was Re: PS) Michael Schlindwein
❍
Re: About the question what a robots is (was Re: PS) Rob Turk
❍
Re: About the question what a robots is (was Re: PS) Michael Schlindwein
●
Linux and Robot development... root
●
Re: Suggestion to help robots and sites coexist a little better Mark J Cox
●
interactive generation of URL's Fred Melssen
●
❍
Re: interactive generation of URL's Benjamin Franz
❍
Re: interactive generation of URL's Chris Crowther
Newbie question Dan Hurwitz
❍
Re: Newbie question David Eichmann
❍
Re: Newbie question Danny Sullivan
●
Q: meta name="robots" content="noindex" ? John Bro, InterSoft Solutions, Inc
●
Possible MSIIS bug? Jakob Faarvang
●
HOST: header Rob Hartill
●
●
❍
Re: HOST: header Scott 'Webster' Wood
❍
Re: HOST: header Rob Hartill
Robot Research Kosta Tombras
❍
Re: Robot Research David Eichmann
❍
Re: Robot Research Steve Nisbet
❍
Re: Robot Research Paul Francis
Safe Methods Istvan
http://info.webcrawler.com/mailing-lists/robots/index.html (33 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
❍
Re: Safe Methods Garth T Kidd
❍
Re: Safe Methods Rob Hartill
❍
Re: Safe Methods Benjamin Franz
❍
Re: Safe Methods Rob Hartill
❍
Re: Safe Methods Benjamin Franz
❍
Re: Safe Methods Randy Terbush
❍
Re: Safe Methods Rob Turk
❍
Re: Safe Methods Rob Hartill
●
robots source code in C Ethan Lee
●
Re: How long to cache robots.txt Daniel T. Martin
●
*Help: Writing BOTS* Shiraz Siddiqui
●
Re: How long to cache robots.txt Mike Agostino
●
Search Engine article Danny Sullivan
❍
Re: Search Engine article James
●
Unusual request - sorry! David Eagles
●
Re: How long to cache robots.txt Daniel T. Martin
●
Last message David Eagles
●
Re: Social Responsibilities (was Safe Methods) David Eichmann
●
AltaVista's Index is obsolete Ian Samson
❍
Re: AltaVista's Index is obsolete Patrick Lee
❍
Re: AltaVista's Index is obsolete Wiebe Weikamp
●
RE: AltaVista's Index is obsolete; but what about the others Ted Sullivan
●
RE: AltaVista's Index is obsolete; but what about the other Ian Samson
●
Re: AltaVista's Index is obsolete; but what about the other Daniel T. Martin
●
hoohoo.cac.washington = bad Andy Warner
❍
Re: hoohoo.cac.washington = bad David Reilly
❍
Re: hoohoo.cac.washington = bad Betsy Dunphy
❍
Re: hoohoo.cac.washington = bad Erik Selberg
❍
Re: hoohoo.cac.washington = bad Patrick Lee
❍
Re: hoohoo.cac.washington = bad Jeremy.Ellman
●
PHP stops robots kathy@accessone.com
●
netscape spec for RDM Jim Ausman
●
❍
Re: netscape spec for RDM Nick Arnett
❍
Re: netscape spec for RDM David Reilly
Stop 'bots using apache, etc. or php? kathy@accessone.com
http://info.webcrawler.com/mailing-lists/robots/index.html (34 of 61) [18.02.2001 13:19:26]
Robots Mailing List Archive by thread
●
●
●
Anyone know who owns this one? Betsy Dunphy
❍
Re: Anyone know who owns this one? Benjamin Franz
❍
Re: Anyone know who owns this one? Rob Turk
❍
Re: Anyone know who owns this one? Rob Hartill
❍
Re: Anyone know who owns this one? David Reilly
❍
Robo-phopbic Mailing list Shiraz Siddiqui
❍
Re: Anyone know who owns this one? Captain Napalm
❍
Re: Anyone know who owns this one? pinson
Robot Gripes forum? (Was: Anyone know who owns this one?) Betsy Dunphy
❍
Re: Robot Gripes forum? (Was: Anyone know who owns this one?) Rob Turk
❍
Re: Robot Gripes forum? (Was: Anyone know who owns this one?) Bruce Rhodewalt
❍
Re: Robot Gripes forum? (Was: Anyone know who owns this one?) Issac Roth
RE: Anyone know who owns this one? Martin.Soukup
❍
●
●
I vote NO (Was: Robot Gripes forum?) Nick Arnett
❍
Re: I vote NO (Was: Robot Gripes forum?) Rob Hartill
❍
Re: I vote NO (Was: Robot Gripes forum?) Nick Arnett
❍
Re: I vote NO (Was: Robot Gripes forum?) Rob Hartill
Re: I vote NO (Was: Robot Gripes forum?) - I vote YES Betsy Dunphy
❍
●
●
Re: Anyone know who owns this one? Captain Napalm
Re: I vote NO (Was: Robot Gripes forum?) - I vote YES Erik Selberg
What is your favorite search engine - a survey Bhupinder S. Sran
❍
Re: What is your favorite search engine - a survey Sanna
❍
Re: What is your favorite search engine - a survey jmills@caribsurf.com
robots on an intranet Adam Gaffin
❍
Re: robots on an intranet Tim Bray
❍
Re: robots on an intranet Tim Bray
❍
Re: robots on an intranet (replies to list...) Michael De La Rue
❍
Re: robots on an intranet Peter Small
❍
Re: robots on an intranet jmills@caribsurf.com
❍
Re: robots on an intranet Jane Doyle
❍
(no subject) blea@hic.net
❍
Re: robots on an intranet Ulla Sandberg
❍
Re: robots on an intranet Abderrezak Kamel
●
grammar engines Ross A. Finlayson
●
Re[2]: robots on an intranet Stacy Cannady
http://info.webcrawler.com/mailing-lists/robots/index.html (35 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
Add This Search Engine to your Results. Thomas Bedell
❍
●
Re: Add This Search Engine to your Results. James
Search Engine Tutorial for Web Developers Edward Stangler
❍
How can IR Agents be evaluate ? Giacomo Fiorentini
●
RE: How can IR Agents be evaluate ? Jim Harris
●
RE: How can IR Agents be evaluate ? Dan Quigley
●
Project Aristotle(sm) Gerry McKiernan
●
Re: robots on an intranet (replies to list...) Martijn Koster
❍
●
Re: robots on an intranet (replies to list...) Reinier Post
RE: Search Engine System Admin
❍
RE: Search Engine Marc Langheinrich
❍
RE: Search Engine WEBsmith Editor
❍
Re: Search Engine Harry Munir Behrens
●
Re: robots on an intranet (replies to list...) Operator
●
Re: Search Engine Operator
❍
Re: Search Engine siddiqui athar shiraz
●
Re: Search Engine Stimpy
●
Lynx. The one true browser. Ian McKellar
●
email spider Richard A. Paris
❍
Re: email spider Scott 'Webster' Wood
❍
Re: email spider Jerry Walsh
●
RE: email spider Jim Harris
●
RE: How can IR Agents be evaluate ? Nick Arnett
●
Re: Search Engine Nick Arnett
●
Offline Agents for UNIX A. Shiraz Siddiqui
❍
●
Re: Offline Agents for UNIX Ian McKellar
mini-robot Chiaki Ohta
❍
Re: mini-robot Joao Moreira
●
Webfetch Edward Ooi
●
Library Agents(sm): Library Applications of Intelligent Software Agents Gerry
McKiernan
Mail robot? Christopher J. Farrell IV
●
❍
●
Re: Mail robot? blea@hic.net
depth first vs breadth first Robert Nicholson
❍
Re: depth first vs breadth first steved@pmc02.pmc.philips.com
❍
Re: depth first vs breadth first David Eichmann
http://info.webcrawler.com/mailing-lists/robots/index.html (36 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
❍
●
●
Quakebots Ross A. Finlayson
❍
Re: Quakebots Rob Torchon
❍
Re: Quakebots Ross A. Finlayson
articles or URL's on search engines Werner Schweibenz, Mizzou
❍
●
●
●
Re: depth first vs breadth first David Eichmann
Re: articles or URL's on search engines Peter Small
HEAD Anna Torti
❍
Re: HEAD Daniel Lo
❍
Re: HEAD Captain Napalm
❍
Re: HEAD Captain Napalm
The Internet Archive robot Mike Burner
❍
Re: The Internet Archive robot Tronche Ch. le pitre
❍
Re: The Internet Archive robot Fred Douglis
The Internet Archive robot ACHAKS@inf.com
❍
Re: The Internet Archive robot Brian Clark
❍
Re: The Internet Archive robot Robert B. Turk
❍
Re: The Internet Archive robot Alex Strasheim
❍
Re: The Internet Archive robot Marilyn R Wulfekuhler
❍
Re: The Internet Archive robot Richard Gaskin - Fourth World
❍
Re: The Internet Archive robot Richard Gaskin - Fourth World
❍
Re: The Internet Archive robot Eric Kristoff
❍
Re: The Internet Archive robot Jeremy Sigmon
❍
Re: The Internet Archive robot Michael G=?iso-8859-1?Q?=F6ckel
❍
Re: The Internet Archive robot Eric Kristoff
❍
Re: The Internet Archive robot Gareth R White
❍
Re: The Internet Archive robot Jeremy Sigmon
❍
Re: The Internet Archive robot Todd Markle
❍
Re: The Internet Archive robot Jeremy Sigmon
❍
robots.txt buffer question. Jeremy Sigmon
❍
Re: The Internet Archive robot Martijn Koster
❍
Re: The Internet Archive robot Z Smith
❍
Re: The Internet Archive robot Rob Hartill
❍
Re: The Internet Archive robot Z Smith
●
(no subject) Stacy Cannady
●
"hidden text" vs. META tags for robots/search engines Todd Sellers
http://info.webcrawler.com/mailing-lists/robots/index.html (37 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
❍
Re: "hidden text" vs. META tags for robots/search engines Can Ozturan
❍
Re: "hidden text" vs. META tags for robots/search engines Martijn Koster
❍
Re: "hidden text" vs. META tags for robots/search engines Martin Kiff
❍
Re: "hidden text" vs. META tags for robots/search engines Chad Zimmerman
❍
Re: "hidden text" vs. META tags for robots/search engines Davide Musella
❍
Re: "hidden text" vs. META tags for robots/search engines Martin Kiff
Looking for subcontracting spider-programmers Richard Rossi
❍
Re: Looking for subcontracting spider-programmers Bob Worthy
❍
Re: Looking for subcontracting spider-programmers Hani Yakan
❍
tryme Marty Landman
❍
Re: Looking for subcontracting spider-programmers Kevin Hoogheem
●
RE: The Internet Archive robot Ted Sullivan
●
(no subject) Robert Stober
●
RE: The Internet Archive robot (fwd) Brewster Kahle
❍
●
●
Re: The Internet Archive robot (fwd) Fred Douglis
pointers for a novice? Alex Strasheim
❍
Re: pointers for a novice? Kevin Hoogheem
❍
Re: pointers for a novice? Elias Hatzis
Extracting info from SIG forum archives Peter Small
❍
Re: Extracting info from SIG forum archives Denis McKeon
●
Conceptbot spider David L. Sifry
●
Re: The Internet Archive robot (fwd) Robert B. Turk
●
Re: The Internet Archive robot Phil Hochstetler
●
Copyrights (was Re: The Internet Archive robot) Brian Clark
●
❍
Re: Copyrights (was Re: The Internet Archive robot) Tim Bray
❍
Re: Copyrights (was Re: The Internet Archive robot) Robert B. Turk
❍
Re: Copyrights (was Re: The Internet Archive robot) Brian Clark
❍
Re: Copyrights (was Re: The Internet Archive robot) Robert B. Turk
❍
Re: Copyrights (was Re: The Internet Archive robot) Brian Clark
Copyrights on the web Charlie Brown
❍
Re: Copyrights on the web Benjamin Franz
❍
Re: Copyrights on the web Glburt@aol.com
❍
Re: Copyrights on the web Chad Zimmerman
❍
Re: Copyrights on the web Denis McKeon
❍
Re: Copyrights on the web Richard Gaskin - Fourth World
http://info.webcrawler.com/mailing-lists/robots/index.html (38 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
❍
Re: Copyrights on the web Eric Kristoff
●
RE: Copyrights on the web Ted Sullivan
●
More ways to spam search engines? G. Edward Johnson
❍
Re: More ways to spam search engines? Andrew Leonard
●
RE: copyright, etc. Richard Gaskin - Fourth World
●
Copyrights, let them be ! Joao Moreira
●
❍
Re: Copyrights, let them be ! Denis McKeon
❍
Re: Copyrights, let them be ! Brian Clark
❍
Re: Copyrights, let them be ! Richard Gaskin - Fourth World
❍
Re: Copyrights, let them be ! John D. Pritchard
❍
(Fwd) Re: Copyrights, let them be ! Robert Raisch, The Internet Company
❍
Re: (Fwd) Re: Copyrights, let them be ! Richard Gaskin - Fourth World
crawling FTP sites Greg Fenton
❍
Re: crawling FTP sites Jaakko Hyvatti
❍
Re: crawling FTP sites James Black
●
RE: Copyrights, let them be ! William Dan Terry
●
RE: crawling FTP sites William Dan Terry
●
Re: The Internet Archive robot (fwd) Brewster Kahle
●
RE: crawling FTP sites William Dan Terry
●
Re: Public Access Nodes / Copywrited Nodes Ross A. Finlayson
●
FAQ? Frank Smadja
●
RE: The Internet Archive robot David Levine
❍
RE: The Internet Archive robot Denis McKeon
❍
RE: The Internet Archive robot Sigfrid Lundberg
❍
Re: The Internet Archive robot Fred Douglis
●
RE: Copyrights on the web Bryan Cort
●
RE: The Internet Archive robot William Dan Terry
●
RE: The Internet Archive robot David Levine
❍
RE: The Internet Archive robot Denis McKeon
●
InfoSpiders/0.1 Filippo Menczer
●
info for newbie Filippo Menczer
❍
MOMSpider problem. Broken Pipe Jeremy Sigmon
●
Bad agent...A *very* bad agent. Benjamin Franz
●
RE: The Internet Archive robot Nick Arnett
●
A few copyright notes Nick Arnett
http://info.webcrawler.com/mailing-lists/robots/index.html (39 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
Topic drift (archive robot, copyright...) Nick Arnett
❍
Re: Topic drift (archive robot, copyright...) Ross A. Finlayson
❍
Re: Topic drift (archive robot, copyright...) Jeremy Sigmon
●
Netscape Catalog Server: An Eval Eric Kristoff
●
file retrieval Eddie Rojas
❍
●
Re: file retrieval Martijn Koster
Preferred access time Martin.Soukup
❍
Re: Preferred access time John D. Pritchard
●
White House/PARC "Leveraging Cyberspace" Nick Arnett
●
RE: Preferred access time Martin.Soukup
●
robot to get specific info only? Martin.Soukup
❍
re: robot to get specific info only? Steve Leibman
●
Unregistered MIME types? Nick Arnett
●
A bad agent? Nick Dearnaley
❍
Re: A bad agent? Rob Hartill
❍
Re: A bad agent? Nick Dearnaley
❍
Re: A bad agent? Rob Hartill
●
Use of robots.txt to "check status"? Ed Costello
●
Robot exclustion for for non-'unix file' hierarchy Hallvard B Furuseth
❍
Re: Robot exclustion for for non-'unix file' hierarchy Martijn Koster
●
How to get listed #1 on all search engines (fwd) Bonnie
●
RE: How to get listed #1 on all search engines (fwd) Ted Sullivan
●
Cannot believe it "Morons" Cafe
●
Possible robot? Chad Zimmerman
●
Bye Bye HyperText: The End of the World (Wide Web) As We Know It! Gerry McKiernan
●
Re: Bye Bye HyperText: The End of the World (Wide Web) As We Know It!
Michael De La Rue
Image Maps Harold Gibbs
❍
❍
Re: Image Maps Martin Kiff
●
Bug in LibWWW perl + Data::Dumper (libwwwperl refs are strange) Michael De La Rue
●
do robots send HTTP_HOST? Joe Pruett
❍
●
●
Re: do robots send HTTP_HOST? Aaron Nabil
The End of The World (Wide Web) / Part II Gerry McKiernan
❍
Re: The End of The World (Wide Web) / Part II Brian Clark
❍
Re: The End of The World (Wide Web) / Part II Nick Dearnaley
Seeing is Believing: Candidate Web Resources for Information Visualization Gerry
http://info.webcrawler.com/mailing-lists/robots/index.html (40 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
McKiernan
CitedSites(sm): Citation Indexing of Web Resources Gerry McKiernan
●
Netscape Catalog Server Eric Kristoff
●
Netscape Catalog Server Eric Kristoff
●
Topic-specific robots Fred K. Lenherr
❍
●
●
●
Re: Topic-specific robots Nick Dearnaley
robots.txt syntax Fred K. Lenherr
❍
Re: robots.txt syntax John D. Pritchard
❍
Re: robots.txt syntax Martijn Koster
❍
Re: robots.txt syntax Martijn Koster
❍
Re: robots.txt syntax Captain Napalm
❍
Re: robots.txt syntax John D. Pritchard
❍
Re: robots.txt syntax Captain Napalm
❍
Re: robots.txt syntax John D. Pritchard
❍
Re: robots.txt syntax Captain Napalm
❍
Re: robots.txt syntax John D. Pritchard
Another rating scam! (And a proposal on how to fix it) Aaron Nabil
❍
Re: Another rating scam! (And a proposal on how to fix it) Martijn Koster
❍
Re: Another rating scam! (And a proposal on how to fix it) Paul Francis
❍
Re: Another rating scam! (And a proposal on how to fix it) Paul Francis
❍
Re: Another rating scam! (And a proposal on how to fix it) Aaron Nabil
❍
Re: Another rating scam! (And a proposal on how to fix it) Benjamin Franz
META tag standards, search accuracy Nick Arnett
❍
Re: META tag standards, search accuracy Robert B. Turk
❍
Re: META tag standards, search accuracy Nick Arnett
❍
Re: META tag standards, search accuracy Nick Arnett
❍
Re: META tag standards, search accuracy Nick Arnett
❍
Re: META tag standards, search accuracy Nick Arnett
❍
Re: META tag standards, search accuracy Eric Miller
●
ADMIN (was Re: hypermail archive not operational Martijn Koster
●
ActiveAgent HipCrime
❍
Re: ActiveAgent Benjamin Franz
●
non 2nn repsonses on robots.txt Aaron Nabil
●
servers that don't return a 404 for "not found" Aaron Nabil
●
Re: servers that don't return a 404 for "not found" Aaron Nabil
●
Re: servers that don't return a 404 for "not found" Aaron Nabil
http://info.webcrawler.com/mailing-lists/robots/index.html (41 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
Re: Another rating scam! (And a proposal on how to fix it Nick Dearnaley
●
Returned mail: Host unknown (Name server: webcrawler: host not found) Mail Delivery
Subsystem
Re: META tag standards, search accuracy Eric Miller
●
❍
Re: META tag standards, search accuracy Benjamin Franz
❍
Re: META tag standards, search accuracy Benjamin Franz
●
ActiveAgent Larry Steinberg
●
Re: ActiveAgent Benjamin Franz
●
Re: META tag standards, search accuracy Eric Miller
●
●
Returned mail: Host unknown (Name server: webcrawler: host not found) Mail Delivery
Subsystem
RE: ActiveAgent and E-Mail Spam Bryan Cromartie
●
Re: META tag standards, search accuracy Eric Miller
●
Re: ActiveAgent William Neuhauser
●
agents ignoring robots.txt Rob Hartill
❍
Re: agents ignoring robots.txt Erik Selberg
❍
Re: agents ignoring robots.txt Captain Napalm
❍
Re: agents ignoring robots.txt Erik Selberg
❍
Re: agents ignoring robots.txt John D. Pritchard
●
Re: agents ignoring robots.txt Rob Hartill
●
Disallow/Allow by Action (Re: robots.txt syntax) Brian Clark
❍
Re: Disallow/Allow by Action (Re: robots.txt syntax) Nick Dearnaley
●
CyberPromo shut down at last!!! Richard Gaskin - Fourth World
●
infoseek Fred K. Lenherr
●
Comparing robots/search sites Fred K. Lenherr
●
Re: infoseek Rob Hartill
●
McKinley -- 100% error rate Jennifer C. O'Brien
●
Re: robots.txt buffer question. Brent Boghosian
●
What to rate limit/lock on, name or IP address? Aaron Nabil
❍
Re: What to rate limit/lock on, name or IP address? bbh@xenodata.com
●
RE: What to rate limit/lock on, name or IP address? Greg Fenton
●
RE: What to rate limit/lock on, name or IP address? Brent Boghosian
●
Filtering queries on a robot-built database Fred K. Lenherr
●
sockets in PERL HipCrime
●
Re: sockets in PERL Otis Gospodnetic
●
Re: Showbiz search engine Showbiz Information
http://info.webcrawler.com/mailing-lists/robots/index.html (42 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
Is a robot visiting? Tim Freeman
❍
Re: Is a robot visiting? Klaus Johannes Rusch
❍
Re: Is a robot visiting? Daniel T. Martin
❍
Re: Is a robot visiting? Aaron Nabil
❍
Re: Is a robot visiting? Tim Freeman
❍
Re: Is a robot visiting? Tim Freeman
❍
Re: Is a robot visiting? David Steele
❍
Re: Is a robot visiting? Tim Freeman
❍
Re: Is a robot visiting? Aaron Nabil
❍
Re: Is a robot visiting? Klaus Johannes Rusch
❍
Re: Is a robot visiting? Hallvard B Furuseth
❍
Re: Is a robot visiting? Hallvard B Furuseth
●
RE: Is a robot visiting? Greg Fenton
●
Tim Freeman Aaron Nabil
●
Thanks! Tim Freeman
●
Possible robots.txt addition Ian Graham
❍
Re: Possible robots.txt addition Issac Roth
❍
Re: Possible robots.txt addition Francois Rouaix
❍
Re: Possible robots.txt addition John D. Pritchard
❍
Re: Possible robots.txt addition John D. Pritchard
❍
Re: Possible robots.txt addition Martijn Koster
●
RE: Is a robot visiting? Hallvard B Furuseth
●
download robot LIAM GUINANE
●
A new robot -- ask for advice Hrvoje Niksic
●
Re: Possible robots.txt addition Martin Kiff
●
technical descripton [D [D [D LIAM GUINANE
❍
●
●
Re: technical descripton [D [D [D P. Senthil
Domains and HTTP_HOST Brian Clark
❍
Re: Domains and HTTP_HOST Ian Graham
❍
Re: Domains and HTTP_HOST DECLAN FITZPATRICK
❍
Re: Domains and HTTP_HOST Klaus Johannes Rusch
❍
Re: Domains and HTTP_HOST Brian Clark
❍
Re: Domains and HTTP_HOST Klaus Johannes Rusch
❍
Re: Domains and HTTP_HOST John D. Pritchard
Source code Stephane Vaillancourt
http://info.webcrawler.com/mailing-lists/robots/index.html (43 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
Back of the envelope computations Francois Rouaix
❍
Re: Back of the envelope computations John D. Pritchard
❍
Re: Back of the envelope computations Sigfrid Lundberg
❍
Re: Possible robots.txt addition Ian Graham
●
Re: Possible robots.txt addition Ian Graham
●
Re: Possible robots.txt addition Ian Graham
●
Possible robots.txt addition - did I say that? Martin Kiff
●
Re: Possible robots.txt addition Klaus Johannes Rusch
●
Re: Possible robots.txt addition (fwd) Ian Graham
●
Belated notice of spider article Adam Gaffin
●
Re: Possible robots.txt addition Klaus Johannes Rusch
●
ActiveAgent HipCrime
●
Matching the user-agent in /robots.txt Hrvoje Niksic
❍
Re: Domains and HTTP_HOST Benjamin Franz
●
Re: ActiveAgent Aaron Nabil
●
Re: ActiveAgent ROBERT-LF_HUANG@HP-China-om1.om.hp.com
●
Re: ActiveAgent Rob Hartill
●
We need robot information Juan
●
anti-robot regexps Hallvard B Furuseth
●
anti-robot regexps Hallvard B Furuseth
●
An extended verion of the robot exclusion standard Captain Napalm
❍
●
Re: An extended verion of the robot exclusion standard Hrvoje Niksic
Re: An extended version of the Robots... Martijn Koster
❍
Re: An extended version of the Robots... Captain Napalm
❍
Re: An extended version of the Robots... Hrvoje Niksic
●
Re: Domains and HTTP_HOST Klaus Johannes Rusch
●
robot algorithm ? Otis Gospodnetic
●
itelligent agents Wanca, Vincent
❍
●
●
Re: itelligent agents Hrvoje Niksic
Get official! Hallvard B Furuseth
❍
Re: Get official! DA
❍
Re: Get official! Klaus Johannes Rusch
❍
Re: Get official! Hallvard B Furuseth
Regexps (Was: Re: An extended version of the Robots...) Hallvard B Furuseth
❍
Regexps (Was: Re: An extended version of the Robots...) Skip Montanaro
http://info.webcrawler.com/mailing-lists/robots/index.html (44 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
●
Regexps (Was: Re: An extended version of the Robots...) Hallvard B Furuseth
●
Is their a web site... z960849@rice.farm.niu.edu
❍
Re: Is their a web site... DA
●
Re: Regexps (Was: Re: An extended version of the Robots...) Martijn Koster
●
Re: An extended verion of the robot exclusion standard Captain Napalm
●
Re: An extended version of the Robots... Captain Napalm
❍
●
An updated extended standard for robots.txt Captain Napalm
❍
●
●
Re: An extended version of the Robots... Hrvoje Niksic
Re: An updated extended standard for robots.txt Art Matheny
Notification protocol? Nick Arnett
❍
Re: Notification protocol? Fred K. Lenherr
❍
Re: Notification protocol? Ted Hardie
❍
Re: Notification protocol? John D. Pritchard
❍
Re: Notification protocol? Tony Barry
❍
Re: Notification protocol? Mike Schwartz
❍
Re: Notification protocol? Sankar Virdhagriswaran
❍
Re: Notification protocol? John D. Pritchard
❍
Re: Notification protocol? Peter Jurg
Re: An extended version of the Robots... Hallvard B Furuseth
❍
Re: An extended version of the Robots... Art Matheny
●
Re: An extended version of the Robots... (fwd) Vu Quoc HUNG
●
changes to robots.txt Rob Hartill
●
❍
Re: changes to robots.txt DA
❍
Re: changes to robots.txt Rob Hartill
❍
Re: changes to robots.txt DA
❍
Re: changes to robots.txt Klaus Johannes Rusch
❍
Re: changes to robots.txt Steve DeJarnett
❍
Re: changes to robots.txt Rob Hartill
Re: An updated extended standard for robots.txt Captain Napalm
❍
Re: An updated extended standard for robots.txt Art Matheny
●
Re: An extended version of the Robots... Martijn Koster
●
UN/LINK protocol is standardized! wasn't that quick! John D. Pritchard
●
Re: UN/LINK protocol is standardized! wasn't that quick! John D. Pritchard
●
[ROBOTS JJA] Lib-WWW-perl5 Juan Jose Amor
●
Admitting the obvious I think therefore I spam
http://info.webcrawler.com/mailing-lists/robots/index.html (45 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
❍
●
Re: Admitting the obvious Richard Gaskin
how do they do that? Otis Gospodnetic
❍
Re: how do they do that? Martijn Koster
●
databases for spiders Elias Hatzis
●
databases for spiders Elias Hatzis
❍
Re: databases for spiders John D. Pritchard
❍
Re: databases for spiders Nick Arnett
❍
Re: databases for spiders DA
❍
Re: databases for spiders Nick Arnett
●
Re: databases for spiders ROBERT-LF_HUANG@HP-China-om1.om.hp.com
●
indexing intranet-site Martin Paff
❍
Re: indexing intranet-site Nick Arnett
●
Re: An extended version of the Robots... Hallvard B Furuseth
●
Re: An extended version of the Robots... Hallvard B Furuseth
●
RE: databases for spiders Larry Fitzpatrick
●
RE: databases for spiders Nick Arnett
●
infoseeks robot is dumb Otis Gospodnetic
❍
Re: infoseeks robot is dumb Matthew K Gray
❍
Re: infoseeks robot is dumb Otis Gospodnetic
●
RE: changes to robots.txt Scott Johnson
●
Not so Friendly Robot - Teleport David McGrath
●
❍
Re: infoseeks robot is dumb DA
❍
Re: infoseeks robot is dumb Hrvoje Niksic
❍
Re: infoseeks robot is dumb Otis Gospodnetic
RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Captain Napalm
❍
Re: RFC, draft 1 Denis McKeon
❍
Re: RFC, draft 1 Hrvoje Niksic
❍
Re: RFC, draft 1 Darren Hardy
❍
Re: RFC, draft 1 Darren Hardy
❍
Re: RFC, draft 1 Hallvard B Furuseth
❍
Re: RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Hrvoje Niksic
❍
Re: RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Martijn Koster
http://info.webcrawler.com/mailing-lists/robots/index.html (46 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
❍
Re: RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Captain Napalm
❍
Re: RFC, draft 1 Klaus Johannes Rusch
❍
Re: RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Klaus Johannes Rusch
❍
RFC, draft 2 (was Re: RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Martijn Koster
❍
Re: RFC, draft 1 Hallvard B Furuseth
❍
Re: RFC, draft 1 Martijn Koster
●
Re: An extended version of the Robots... Martijn Koster
●
Re: infoseeks robot is dumb DA
❍
Re: infoseeks robot is dumb Hrvoje Niksic
❍
Re: infoseeks robot is dumb Otis Gospodnetic
●
Getting a Reply-to: field ... Captain Napalm
●
Re: Getting a Reply-to: field ... Captain Napalm
●
RE: RFC, draft 1 Keiji Kanazawa
●
RE: RFC, draft 1 Keiji Kanazawa
❍
Re: Notification protocol? Peter Jurg
❍
Re: Notification protocol? Erik Selberg
❍
Re: Notification protocol? Issac Roth
●
Re: An extended version of the Robots... Darren Hardy
●
Re: indexing intranet-site Nick Arnett
●
Re: RFC, draft 1 Klaus Johannes Rusch
●
Hipcrime no more Martijn Koster
●
RE: Notification protocol? Larry Fitzpatrick
●
http://HipCrime.com HipCrime
●
Re: http://HipCrime.com Nick Arnett
●
RE: RFC, draft 1 Martijn Koster
●
Re: Virtual (was: RFC, draft 1) Klaus Johannes Rusch
●
Re: Washington again !!! Erik Selberg
●
Re: Washington again !!! Rob Hartill
❍
Re: Washington again !!! Erik Selberg
●
SetEnv a problem Anna Torti
●
User-Agent David Banes
●
Broadness of Robots.txt (Re: Washington again !!!) Brian Clark
http://info.webcrawler.com/mailing-lists/robots/index.html (47 of 61) [18.02.2001 13:19:27]
Robots Mailing List Archive by thread
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Thaddeus O. Cooper
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Brian Clark
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Brian Clark
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Hrvoje Niksic
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Martijn Koster
❍
❍
Agent Categories (was Re: Broadness of Robots.txt (Re: Washington again !!!)
Martijn Koster
Re: Agent Categories (was Re: Broadness of Robots.txt (Re: Washington again !!!)
Erik Selberg
Re: Broadness of Robots.txt (Re: Washington again !!!) Martijn Koster
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) John D. Pritchard
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) John D. Pritchard
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) John D. Pritchard
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Hallvard B Furuseth
❍
RE: Broadness of Robots.txt (Re: Washington again !!!) Martin.Soukup
❍
●
Re: Washington again !!! Martijn Koster
●
Re: Washington again !!! Gregory Lauckhart
●
robots.txt HipCrime
❍
Re: robots.txt David M Banes
❍
Re: robots.txt Martijn Koster
❍
Re: robots.txt David Banes
●
Re: User-Agent Klaus Johannes Rusch
●
Re: Broadness of Robots.txt (Re: Washington again !!!) Captain Napalm
●
Re: Broadness of Robots.txt (Re: Washington again !!!) Captain Napalm
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg
●
must be something in the water Rob Hartill
●
who/what uses robots.txt HipCrime
❍
Re: who/what uses robots.txt Martijn Koster
❍
Re: who/what uses robots.txt Erik Selberg
❍
Re: who/what uses robots.txt HipCrime
❍
Re: who/what uses robots.txt Erik Selberg
❍
Re: who/what uses robots.txt HipCrime
❍
Re: who/what uses robots.txt Erik Selberg
http://info.webcrawler.com/mailing-lists/robots/index.html (48 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
❍
Re: who/what uses robots.txt Hrvoje Niksic
❍
Re: who/what uses robots.txt Martijn Koster
❍
Re: who/what uses robots.txt Erik Selberg
❍
Re: who/what uses robots.txt admin@superhot.com
❍
Re: who/what uses robots.txt Terry O'Neill
Re: Broadness of Robots.txt (Re: Washington again !!!) Art Matheny
❍
●
Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg
Re: Broadness of Robots.txt (Re: Washington again !!!) Rob Hartill
❍
Re: Broadness of Robots.txt (Re: Washington again !!!) Erik Selberg
●
Koen Holtman: Content negotiation draft 04 submitted John D. Pritchard
●
Re: who/what uses robots.txt Rob Hartill
●
defining "robot" HipCrime
❍
Re: defining "robot" Matthew K Gray
❍
Re: defining "robot" Hrvoje Niksic
❍
Re: defining "robot" David M Banes
❍
Re: defining "robot" Martin Kiff
❍
Re: defining "robot" Martijn Koster
❍
Re: defining "robot" Hrvoje Niksic
❍
Re: defining "robot" Martin Kiff
❍
Re: defining "robot" Erik Selberg
❍
Re: defining "robot" Martijn Koster
❍
Re: defining "robot" Erik Selberg
❍
Re: defining "robot" Rob Hartill
❍
Re: defining "robot" Erik Selberg
❍
Re: defining "robot" Brian Clark
❍
Re: defining "robot" David M Banes
❍
Re: defining "robot" David M Banes
❍
Re: USER_AGENT and Apache 1.2 Klaus Johannes Rusch
❍
Re: defining "robot" Martijn Koster
❍
Re: defining "robot" Brian Clark
●
Re: defining "robot" Art Matheny
●
define a page? HipCrime
❍
●
Re: define a page? Ross A. Finlayson
robots.txt (A *little* off the subject) Thaddeus O. Cooper
❍
Re: robots.txt (A *little* off the subject) Erik Selberg
http://info.webcrawler.com/mailing-lists/robots/index.html (49 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
❍
Re: robots.txt (A *little* off the subject) Thaddeus O. Cooper
❍
Re: NetJet Rob Hartill
●
Re: define a page? Rob Hartill
●
robot defined HipCrime
❍
●
Re: robot defined Kim Davies
not a robot HipCrime
❍
Re: not a robot Matthew K Gray
❍
Re: not a robot Hrvoje Niksic
●
robot? HipCrime
●
another suggestion Rob Hartill
●
Re: not a robot Rob Hartill
●
ActiveAgent Rob Hartill
●
robot definition Ross Finlayson
●
make people use ROBOTS.txt? HipCrime
❍
Re: make people use ROBOTS.txt? Richard Levitte - VMS Whacker
❍
Re: make people use ROBOTS.txt? Kim Davies
❍
Re: make people use ROBOTS.txt? Kim Davies
❍
Re: make people use ROBOTS.txt?
❍
Re: make people use ROBOTS.txt? Erik Selberg
❍
Re: make people use ROBOTS.txt? John D. Pritchard
❍
Re: make people use ROBOTS.txt? John D. Pritchard
❍
Re: make people use ROBOTS.txt? Nick Arnett
❍
Re: make people use ROBOTS.txt? Erik Selberg
❍
Re: make people use ROBOTS.txt? John D. Pritchard
●
Re: make people use ROBOTS.txt? Benjamin Franz
●
Hip Crime Thomas Bedell
❍
Re: Hip Crime Otis Gospodnetic
❍
Re: Hip Crime John D. Pritchard
●
another rare attack Rob Hartill
●
another dumb robot (possibly) Rob Hartill
●
ActiveAgent Rob Hartill
❍
Re: ActiveAgent HipCrime
❍
Re: ActiveAgent Benjamin Franz
❍
Re: ActiveAgent Issac Roth
❍
Re: robots.txt syntax Captain Napalm
http://info.webcrawler.com/mailing-lists/robots/index.html (50 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
●
❍
Re: robots.txt syntax Brent Boghosian
❍
Re: ActiveAgent Kim Davies
❍
Re: ActiveAgent John D. Pritchard
❍
Re: ActiveAgent Brian Clark
❍
Re: ActiveAgent Betsy Dunphy
❍
Re: ActiveAgent John D. Pritchard
❍
Re: ActiveAgent HipCrime
❍
Re: ActiveAgent Nick Dearnaley
❍
Re: ActiveAgent Betsy Dunphy
❍
Re: ActiveAgent Richard Levitte - VMS Whacker
❍
Re: ActiveAgent Ross A. Finlayson
❍
Re: ActiveAgent Betsy Dunphy
❍
Re: ActiveAgent Richard Gaskin
❍
Re: ActiveAgent Richard Levitte - VMS Whacker
❍
Re: ActiveAgent Fred K. Lenherr
❍
Re: ActiveAgent Randy Terbush
❍
Re: ActiveAgent Richard Gaskin - Fourth World
❍
Re: ActiveAgent John Lindroth
❍
Re: ActiveAgent Randy Terbush
❍
Re: ActiveAgent andy@andy.net
❍
Re: ActiveAgent andy@andy.net
❍
Re: ActiveAgent Richard Levitte - VMS Whacker
❍
Re: ActiveAgent Richard Gaskin - Fourth World
❍
Re: ActiveAgent Richard Levitte - VMS Whacker
❍
Re: ActiveAgent Fred K. Lenherr
❍
Re: An extended version of the Robots... Hallvard B Furuseth
❍
Re: An extended version of the Robots... Captain Napalm
❍
Re: ActiveAgent John D. Pritchard
❍
Re: ActiveAgent Captain Napalm
Lycos' HEAD vs. GET Otis Gospodnetic
❍
Re: Lycos' HEAD vs. GET Klaus Johannes Rusch
❍
Re: Lycos' HEAD vs. GET David Banes
java applet sockets John D. Pritchard
❍
●
Re: java applet sockets Art Matheny
Re: spam? (fwd) Otis Gospodnetic
http://info.webcrawler.com/mailing-lists/robots/index.html (51 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
Re: spam? (fwd) Otis Gospodnetic
●
hipcrime Rob Hartill
❍
Re: hipcrime bbh@xenodata.com
●
Re: robots.txt (A *little* off the subject) Erik Selberg
●
RE: make people use ROBOTS.txt? Terry Coatta
❍
Re: make people use ROBOTS.txt? Hrvoje Niksic
❍
Re: make people use ROBOTS.txt? Erik Selberg
●
IROS 97 Call for Papers John D. Pritchard
●
Re[2]: Lycos' HEAD vs. GET Shadrach Todd
●
Re: user-agent in Java Captain Napalm
●
USER_AGENT and Apache 1.2 Rob Hartill
❍
●
NetJet dpp@peg.apc.org
❍
●
●
Re: USER_AGENT and Apache 1.2 Martijn Koster
Re: NetJet Rob Hartill
Standard Joseph Whitmore
❍
Re: Standard Michael Göckel
❍
Re: Standard John D. Pritchard
❍
Re: Standard Hrvoje Niksic
❍
Re: Standard Greg Fenton
❍
Re: Standard Joseph Whitmore
Servers vs Agents Davis, Ian
❍
Re: Servers vs Agents David M Banes
❍
Re: Standard? Captain Napalm
❍
Who are you robots.txt? was Re: Servers vs Agents John D. Pritchard
❍
Re: Who are you robots.txt? was Re: Servers vs Agents david jost
●
unix robot LIAM GUINANE
●
Re: Servers vs Agents Rob Hartill
●
Re: Servers vs Agents Martin Kiff
❍
●
Re: Servers vs Agents Erik Selberg
Standard? admin@superhot.com
❍
Re: Standard? Kim Davies
❍
Re: Standard? Martin Kiff
❍
Re: Standard? Brian Clark
❍
Re: Standard? Richard Levitte - VMS Whacker
❍
Re: Standard?
http://info.webcrawler.com/mailing-lists/robots/index.html (52 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
❍
Re: Standard?
❍
Re: Standard? Nigel Rantor
❍
legal equivalence of fax and email was Re: Standard? John D. Pritchard
❍
Re: legal equivalence of fax and email was Re: Standard? Rob Hartill
❍
Junkie-Mail was Re: Standard? John D. Pritchard
❍
Re: Junkie-Mail was Re: Standard? Brian Clark
❍
Re: Junkie-Mail was Re: Standard? John D. Pritchard
❍
Re: Junkie-Mail was Re: Standard? DA
❍
Re: Junkie-Mail was Re: Standard? Gary L. Burt
❍
Re: Junkie-Mail was Re: Standard? Brian Clark
❍
Re: Standard? Nick Arnett
❍
Re: Standard? John D. Pritchard
❍
Re: Standard? Davis, Ian
❍
Re: Standard? Randy Fischer
❍
Re: Standard? Richard Gaskin
❍
Re: Standard? Joseph Whitmore
❍
Re: Standard? Gary L. Burt
●
Cache Filler Nigel Rantor
●
an article... (was: Re: Standard?) Greg Fenton
●
[...]Re: Cache Filler Benjamin Franz
●
Re: Cache Filler Ian Graham
●
Re: an article... (was: Re: Standard?) Ian Graham
●
Re: an article... (was: Re: Standard?) Rob Hartill
●
Re: Servers vs Agents Art Matheny
●
Re: an article... (was: Re: Standard?) Nigel Rantor
●
Re: [...]Re: Cache Filler Carlos Horowicz
●
Regexp Library Cook-off Tim Bunce
●
Re: Standard? Nigel Rantor
❍
Re: Standard? Erik Selberg
❍
Re: Standard? Denis McKeon
●
Re: robot ? Baron Timothy de Vallee
●
Lets get on task! {was Re: Standard} Wes Miller
❍
Re: Lets get on task! {was Re: Standard} Ross A. Finlayson
●
What Is wwweb =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?=
●
Re: robot ? David Banes
http://info.webcrawler.com/mailing-lists/robots/index.html (53 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
●
WebCrawler & Excite Otis Gospodnetic
❍
Re: WebCrawler & Excite Nick Arnett
❍
Re: WebCrawler & Excite Brian Pinkerton
❍
Re: WebCrawler & Excite Nick Arnett
[Fwd: WebCrawler & Excite] Wes Miller
❍
Re: WebCrawler & Excite Otis Gospodnetic
●
Re: Standard? Nigel Rantor
●
Re: WebCrawler & Excite Otis Gospodnetic
●
Please take Uninvited Email discussion elsewhere Martijn Koster
●
Just when you thought it might be interesting to standardize Dave Bakin
❍
❍
●
●
Re: Just when you thought it might be interesting to standardize robots.txt... Klaus
Johannes Rusch
Re: Just when you thought it might be interesting to standardize Otis Gospodnetic
Server Indexing -- Helping a Robot Out Ian Graham
❍
Re: Server Indexing -- Helping a Robot Out Martijn Koster
❍
Re: Server Indexing -- Helping a Robot Out Ian Graham
stingy yahoo server? Mark Norman
❍
Re: stingy yahoo server? Klaus Johannes Rusch
❍
Re: stingy yahoo server? Klaus Johannes Rusch
❍
Re: stingy yahoo server? Dan Gildor
❍
Re: stingy yahoo server? Klaus Johannes Rusch
●
IIS and If-modified-since [was Re: stingy yahoo server?] Dan Gildor
●
Crawlers and "dynamic" urls David Koblas
❍
Re: Crawlers and "dynamic" urls Martijn Koster
❍
Re: Crawlers and "dynamic" urls olly@muscat.co.uk
❍
Re: Crawlers and "dynamic" urls olly@muscat.co.uk
●
The Big Picture(sm): Visual Browsing in Web and non-Web Databases Gerry McKiernan
●
Re: Crawlers and "dynamic" urls Klaus Johannes Rusch
❍
Re: Crawlers and "dynamic" urls Ian Graham
●
Re: IIS and If-modified-since Lee Fisher
●
Merry Christmas, spidie-boyz&bottie-girlz! Santa Claus
●
Merry Christmas, HipXmas-SantaSpam! Santa Claus
●
USER_AGENT spoofing Rob Hartill
●
Web pages being served from an SQL database Eric Mackie
❍
Re: Web pages being served from an SQL database Randy Fischer
❍
Re: Web pages being served from an SQL database Brian Clark
http://info.webcrawler.com/mailing-lists/robots/index.html (54 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
Re: Web pages being served from an SQL database Sigfrid Lundberg
●
RE: Web pages being served from an SQL database Martin.Soukup
●
Netscape-Catalog-Robot Rob Hartill
❍
●
Re: Netscape-Catalog-Robot Nick Arnett
It's not only robots we have to worry about ... Captain Napalm
❍
Re: It's not only robots we have to worry about ... Rob Hartill
❍
Re: It's not only robots we have to worry about ... Simon Powell
●
Re: It's not only robots we have to worry about ... Rob Hartill
●
Re: Remember Canseco..... YukYukYo69@aol.com
●
Re: Remember Canseco..... Lilian Bartholo
●
Re: RE: Scalpers (SJPD does crack down) Tim Simmons
●
FS- Sharks Tkts- Front Row (2nd Deck) Jan 13 Howard Strachman
●
FS-Jan 7 Row 1 Sec 211 Anne Greene
●
For Sale for 12/26! Reynolds, Cathy A
●
Game tonight Karmy T. Kays
❍
Re: Game tonight
●
Good HREFs vs Bogus HREFs: 80/20 mike mulligan
●
Returned mail: User unknown Mail Delivery Subsystem
●
We need to Shut down Roenick RaffiK98@aol.com
●
RE: shot clock?!.... Mark Fullerton
●
Forsberg Laura M. Sebastian
●
Quick--who knows listproc? Bonnie Scott
●
Error Condition Re: Invalid request listproc@plaidworks.com
●
Cyclones sign MacLeod clonez1@one.net
●
Error Condition Re: Invalid request listproc@plaidworks.com
●
Re: WRITERS WANTED (re-post) Greg Tanzola
●
ADMIN: mailing list attack :-( Martijn Koster
❍
Re: ADMIN: mailing list attack :-( Martijn Koster
●
RE: Netscape-Catalog-Robot Ian King
●
Referencing dynamic pages Thomas Merlin
●
Re: help Martijn Koster
●
Excite Authors? Randy Terbush
●
Do robots have to follow links ? Thomas Merlin
❍
●
Re: Do robots have to follow links ? Theo Van Dinter
Frames ? Lycos ? Thomas Merlin
http://info.webcrawler.com/mailing-lists/robots/index.html (55 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
❍
Re: Frames ? Lycos ? Mitch Allen
❍
Re: Frames ? Lycos ? Mitch Allen
Do robots have to follow links ? Ross A. Finlayson
❍
Re: Do robots have to follow links ? Nick Arnett
●
Re: Frames ? Lycos ? Theo Van Dinter
●
Lycos Thomas Merlin
❍
●
●
Re: Lycos Klaus Johannes Rusch
Meta refresh tags John Heard
❍
Re: Meta refresh tags Martijn Koster
❍
Re: Meta refresh tags Julian Smith
Cron <robh@us2> /usr/home/robh/show_robots (fwd) Rob Hartill
❍
Re: Cron <robh@us2> /usr/home/robh/show_robots (fwd) olly@muscat.co.uk
●
Re: Cron <robh@us2> /usr/home/robh/show_robots (fwd) Sigfrid Lundberg
●
email address grabber Dorian Ellis
❍
Re: email address grabber Robert Raisch, The Internet Company
❍
Re: email address grabber Ken nakagama
❍
Re: email address grabber Wes Miller
●
Re: email address grabber Art Matheny
●
Re: email address grabber Klaus Johannes Rusch
●
Re: email address grabber Jeff Drost
●
Re: email address grabber Issac Roth
●
robot source code v.sreekanth
❍
Re: robot source code =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?=
❍
Re: robot source code Dan Howard
●
Meta Tag Article Mitch Allen
●
Re: email address grabber Jeff Drost
●
re Email Grabber Steve Nisbet
●
Re: email address grabber Captain Napalm
●
Re: email address grabber Captain Napalm
●
Re: email address grabber Art Matheny
●
re: email grabber Rich Dorfman
●
re: email grabber Richard Gaskin
●
Re: email grabber Chris Brown
●
SpamBots Mitch Allen
❍
Re: SpamBots Richard Gaskin
http://info.webcrawler.com/mailing-lists/robots/index.html (56 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
❍
Re: SpamBots Mitch Allen
●
an image observer Martin Paff
●
Re: email grabber Andy Rollins
❍
More Robot Talk (was Re: email grabber) Captain Napalm
❍
Re: More Robot Talk (was Re: email grabber) bbh@xenodata.com
●
Re: an image observer David Steele
●
Re: email grabber Wendell B. Kozak
●
Re[2]: SpamBots Brad Fox
●
Re: email grabber Robert Raisch, The Internet Company
●
Re[3]: SpamBots
●
Re: More Robot Talk (was Re: email grabber) Captain Napalm
●
RE: email grabber Ian King
●
Re: More Robot Talk Nick Arnett
●
Too Many Admins (TMA) !!! HipCrime
●
Re: More Robot Talk Theo Van Dinter
●
Re: More Robot Talk bbh@xenodata.com
●
escaped vs unescaped urls Dan Gildor
●
●
●
❍
Re: escaped vs unescaped urls =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?=
❍
Re: escaped vs unescaped urls =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?=
❍
Re: Q: size of the web in bytes, comprehensive list DA
Q: size of the web in bytes, comprehensive list Yossi Cohen
❍
Re: Q: size of the web in bytes, comprehensive list Thomas R. Bedell
❍
Re: Q: size of the web in bytes, comprehensive list Otis Gospodnetic
❍
Re: Q: size of the web in bytes, comprehensive list Noah Parker
"real-time" spidering by Lycos Otis Gospodnetic
❍
Re: "real-time" spidering by Lycos Danny Sullivan
❍
Re: "real-time" spidering by Lycos Otis Gospodnetic
Info on large scale spidering? Nick Craswell
❍
Re: Info on large scale spidering? Greg Fenton
❍
Re: Info on large scale spidering? Nick Arnett
❍
Re: Info on large scale spidering? Nick Craswell
●
Re: Info on large scale spidering? Otis Gospodnetic
●
AltaVista Meta Tag Rumour Mitch Allen
❍
Re: AltaVista Meta Tag Rumour Danny Sullivan
❍
Re: AltaVista Meta Tag Rumour Erik Selberg
http://info.webcrawler.com/mailing-lists/robots/index.html (57 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
indexing via redirectors Patrick Berchtold
❍
Re: indexing via redirectors Sigfrid Lundberg
❍
Re: indexing via redirectors Hrvoje Niksic
❍
Re: indexing via redirectors olly@muscat.co.uk
❍
Re: indexing via redirectors Mike Burner
●
fetching .map files Patrick Berchtold
●
Re: indexing via redirectors Eric Miller
●
Re: indexing via redirectors Theo Van Dinter
●
Re: indexing via redirectors Sigfrid Lundberg
❍
Re: indexing via redirectors Sigfrid Lundberg
❍
Re: indexing via redirectors Hrvoje Niksic
●
Re: indexing via redirectors Benjamin Franz
●
Re: indexing via redirectors Eric Miller
●
RE: indexing via redirectors Martin.Soukup
●
Meta Tags =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?=
❍
Re: Meta Tags Jeff Drost
❍
Re: Meta Tags Martijn Koster
❍
Re: Meta Tags Danny Sullivan
❍
Re: Meta Tags Eric Miller
●
Re: indexing via redirectors Jeff Drost
●
Re: indexing via redirectors Captain Napalm
❍
Re: indexing via redirectors Hrvoje Niksic
●
Have you used the Microsoft Active-X Internet controls for Visual Basic? (Or know
someone who does?) Richard Edwards
●
Crawling & DNS issues Neil Cotty
●
❍
Re: Meta Tags Jeff Drost
❍
Re: Crawling & DNS issues Neil Cotty
❍
Re: Crawling & DNS issues David L. Sifry
❍
Re: Crawling & DNS issues Martin Beet
❍
Re: Crawling & DNS issues bbh@xenodata.com
robot meta tags Dan Gildor
❍
Re: robot meta tags Theo Van Dinter
●
Re: Crawling & DNS issues Otis Gospodnetic
●
Re: Crawling & DNS issues Srinivas Padmanabhuni
●
Re: Info on large scale spidering? Otis Gospodnetic
❍
Re: Info on large scale spidering? Martin Hamilton
http://info.webcrawler.com/mailing-lists/robots/index.html (58 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
Inktomi & large scale spidering Otis Gospodnetic
❍
Re: Inktomi & large scale spidering andy@andy.net
❍
Re: Inktomi & large scale spidering Martin Hamilton
❍
Re: Inktomi & large scale spidering Otis Gospodnetic
❍
Re: Inktomi & large scale spidering Nick Arnett
❍
Re: Inktomi & large scale spidering Erik Selberg
●
Single tar (Re: Inktomi & large scale spidering) =?ISO-8859-1?Q?Jaakko_Hyv=E4tti?=
●
Meta Tags only on home page ? Thomas Merlin
●
Re: Single tar (Re: Inktomi & large scale spidering) Sigfrid Lundberg
❍
Re: Meta Tags only on home page ? Jon Knight
❍
Re: Meta Tags only on home page ? Klaus Johannes Rusch
❍
Re: Single tar (Re: Inktomi & large scale spidering) Erik Selberg
●
Re: Single tar (Re: Inktomi & large scale spidering) Martin Hamilton
●
robots & copyright law Tony Rose
❍
Re: robots & copyright law Wayne Rust
❍
Re: robots & copyright law Gary L. Burt
❍
Re: robots & copyright law Danny Sullivan
❍
Re: robots & copyright law Nick Arnett
❍
Re: robots & copyright law Mitch Allen
●
Re: Single tar (Re: Inktomi & large sca Howard, Dan: CIO
●
Re: Single tar (Re: Inktomi & large scale spidering) Rob Hartill
●
Referencing dynamic pages Thomas Merlin
●
Need help on Search Engine accuracy test. HiPromote@aol.com
●
❍
Re: Referencing dynamic pages Klaus Johannes Rusch
❍
Re: Need help on Search Engine accuracy test. Nick Arnett
❍
Re: Need help on Search Engine accuracy test. Nick Craswell
Need help again. HiPromote@aol.com
❍
●
Re: Need help again. Nick Arnett
Question about Robot.txt =?ISO-8859-1?Q?Alvaro_Mu=F1oz-Aycuens_Martinez?=
❍
Re: Question about Robot.txt Klaus Johannes Rusch
●
Re: Analysing the Web (was Re: Info on large scale spidering?) Patrick Berchtold
●
Robot Specifications. Paul Bingham
●
Agent Specification Paul Bingham
●
specialized searches Mike Fresener
●
NaughtyRobot Martijn Koster
http://info.webcrawler.com/mailing-lists/robots/index.html (59 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
Unfriendly robot at 192.115.187.2 Tim Holt
●
Information about AltaVista and Excite Huaiyu Liu
❍
●
FILEZ bbh@xenodata.com
❍
●
Re: Information about AltaVista and Excite Klaus Johannes Rusch
Re: FILEZ Barry A. Dobyns
Re: Single tar (Re: Inktomi & large scale spidering) Martin Hamilton
❍
Re: Single tar (Re: Inktomi & large scale spidering) Simon Wilkinson
❍
Re: Single tar (Re: Inktomi & large scale spidering) Sigfrid Lundberg
●
i need a bot! Myles Weissleder
●
robots: lycos's t-rex: strange behaviour Dinesh
●
WININET caching Martin.Soukup
●
Welcome to cypherpunks Majordomo@toad.com
●
Re: Welcome to cypherpunks Skip Montanaro
●
More with the Cypherpunk antics Captain Napalm
❍
●
●
●
Re: More with the Cypherpunk antics Hrvoje Niksic
The Metacrawler, Reborn Paul Phillips
❍
Re: More with the Cypherpunk antics Martijn Koster
❍
Re: More with the Cypherpunk antics Chad Zimmerman
Java and robots... Manuel J. Kwak
❍
Re: Java and robots... John_R_R_Leavitt@NL.CS.CMU.EDU
❍
Re: Java and robots... Art Matheny
Lack of support for "If-Modified-Since" Howard, Dan: CIO
❍
Re: Lack of support for "If-Modified-Since" John W. James
❍
Re: Lack of support for "If-Modified-Since" mike mulligan
●
Re: Lack of support for "If-Modified-Since" Rob Hartill
●
How to get the document info ? mannina bruno
●
Thanks! Manuel Jesus Fernandez Blanco
●
RE: How to get the document info ? Howard, Dan: CIO
●
Re: message to USSA House of Representatives autoresponder@WhiteHouse.gov
●
New Site Aaron Stayton
●
New Site Aaron Stayton
●
Re: message to USSA Senate autoresponder@WhiteHouse.gov
●
Re: More with the Cypherpunk antics Benjamin Franz
●
Re: More with the Cypherpunk antics Sigfrid Lundberg
●
Re: More with the Cypherpunk antics Rob Hartill
http://info.webcrawler.com/mailing-lists/robots/index.html (60 of 61) [18.02.2001 13:19:28]
Robots Mailing List Archive by thread
●
Re: More with the Cypherpunk antics Jeff Drost
●
AMDIN: The list is dead Martijn Koster
❍
●
Re: AMDIN: The list is dead Martijn Koster
[3]RE>[5]RE>Checking Log fi Roger Dearnaley
❍
Re: [3]RE>[5]RE>Checking Log fi Gordon Bainbridge
Last message date: Thu 18 Dec 1997 - 14:33:60 PDT
Archived on: Sun Aug 17 1997 - 19:13:25 PDT
● Messages sorted by: [ date ][ subject ][ author ]
●
Other mail archives
This archive was generated by hypermail 1.02.
http://info.webcrawler.com/mailing-lists/robots/index.html (61 of 61) [18.02.2001 13:19:28]
Robots in the Web: threat or treat?
The Web Robots Pages
Robots in the Web: threat or treat?
Martijn Koster, NEXOR April 1995
[1997: Updated links and addresses]
ABSTRACT
Robots have been operating in the World-Wide Web for over a year. In that time they have performed
useful tasks, but also on occasion wreaked havoc on the networks. This paper investigates the
advantages and disadvantages of robots, with an emphasis on robots used for resource discovery. New
alternative resource discovery strategies are discussed and compared. It concludes that while current
robots will be useful in the immediate future, they will become less effective and more problematic as
the Web grows.
INTRODUCTION
The World Wide Web [1] has become highly popular in the last few years, and is now one of the
primary means of information publishing on the Internet. When the size of the Web increased beyond
a few sites and a small number of documents, it became clear that manual browsing through a
significant portion of the hypertext structure is no longer possible, let alone an effective method for
resource discovery.
This problem has prompted experiments with automated browsing by "robots". A Web robot is a
program that traverses the Web's hypertext structure by retrieving a document, and recursively
retrieving all documents that are referenced. These programs are sometimes called "spiders", "web
wanderers", or "web worms". These names, while perhaps more appealing, may be misleading, as the
term "spider" and "wanderer" give the false impression that the robot itself moves, and the term
"worm" might imply that the robot multiplies itself, like the infamous Internet worm [2]. In reality
robots are implemented as a single software system that retrieves information from remote sites using
standard Web protocols.
ROBOT USES
Robots can be used to perform a number of useful tasks:
Statistical Analysis
The first robot [3] was deployed to discover and count the number of Web servers. Other statistics
could include the average number of documents per server, the proportion of certain file types, the
average size of a Web page, the degree of interconnectedness, etc.
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (1 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
Maintenance
One of the main difficulties in maintaining a hypertext structure is that references to other pages may
become "dead links", when the page referred to is moved or even removed. There is currently no
general mechanism to proactively notify the maintainers of the referring pages of this change. Some
servers, for example the CERN HTTPD, will log failed requests caused by dead links, along with the
reference of the page where the dead link occurred, allowing for post-hoc manual resolution. This is
not very practical, and in reality authors only find that their documents contain bad links when they
notice themselves, or in the rare case that a user notifies them by e-mail.
A robot that verifies references, such as MOMspider [4], can assist an author in locating these dead
links, and as such can assist in the maintenance of the hypertext structure. Robots can help maintain
the content as well as the structure, by checking for HTML [5] compliance, conformance to style
guidelines, regular updates, etc., but this is not common practice. Arguably this kind of functionality
should be an integrated part of HTML authoring environments, as these checks can then be repeated
when the document is modified, and any problems can be resolved immediately.
Mirroring
Mirroring is a popular technique for maintaining FTP archives. A mirror copies an entire directory
tree recursively by FTP, and then regularly retrieves those documents that have changed. This allows
load sharing, redundancy to cope with host failures, and faster and cheaper local access, and off-line
access.
In the Web mirroring can be implemented with a robot, but at the time of writing no sophisticated
mirroring tools exist. There are some robots that will retrieve a subtree of Web pages and store it
locally, but they don't have facilities for updating only those pages that have changed. A second
problem unique to the Web is that the references in the copied pages need to be rewritten: where they
reference pages that have also been mirrored they may need to changed to point to the copies, and
where relative links point to pages that haven't been mirrored they need to be expanded into absolute
links. The need for mirroring tools for performance reasons is much reduced by the arrival of
sophisticated caching servers [6], which do offer selective updates, can guarantee that a cached
document is up-to-date, and are largely self maintaining. However, it is expected that mirroring tools
will be developed in due course.
Resource discovery
Perhaps the most exciting application of robots is their use in resource discovery. Where humans
cannot cope with the amount of information it is attractive to let the computer do the work. There are
several robots that summarise large parts of the Web, and provide access to a database with these
results through a search engine.
This means that rather than relying solely on browsing, a Web user can combine browsing and
searching to locate information; even if the database doesn't contain the exact item you want to
retrieve, it is likely to contain references to related pages, which in turn may reference the target item.
The second advantage is that these databases can be updated automatically at regular intervals, so that
dead links in the database will be detected and removed. This in contrast to manual document
maintenance, where verification is often sporadic and not comprehensive. The use of robots for
resource discovery will be further discussed below.
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (2 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
Combined Uses
A single robot can perform more than one of the above tasks. For example the RBSE Spider [7] does
statistical analysis of the retrieved documents as well providing a resource discovery database. Such
combined uses are unfortunately quite rare.
OPERATIONAL COSTS AND DANGERS
The use of robots comes at a price, especially when they are operated remotely on the Internet. In this
section we will see that robots can be dangerous in that they place high demands on the Web.
Network resource and server load
Robots require considerable bandwidth. Firstly robots operate continually over prolonged periods of
time, often months. To speed up operations many robots feature parallel retrieval, resulting in a
consistently high use of bandwidth in the immediate proximity. Even remote parts of the network can
feel the network resource strain if the robot makes a large number of retrievals in a short time ("rapid
fire"). This can result in a temporary shortage of bandwidth for other uses, especially on
low-bandwidth links, as the Internet has no facility for protocol-dependent load balancing.
Traditionally the Internet has been perceived to be "free", as the individual users did not have to pay
for its operation. This perception is coming under scrutiny, as especially corporate users do feel a
direct cost associated with network usage. A company may feel that the service to its (potential)
customers is worth this cost, but that automated transfers by robots are not.
Besides placing demands on network, a robot also places extra demand on servers. Depending on the
frequency with which it requests documents from the server this can result in a considerable load,
which results in a lower level of service for other Web users accessing the server. Especially when the
host is also used for other purposes this may not be acceptable. As an experiment the author ran a
simulation of 20 concurrent retrievals from his server running the Plexus server on a Sun 4/330.
Within minutes the machine slowed down to a crawl and was usable for anything. Even with only
consecutive retrievals the effect can be felt. Only the week that this paper was written a robot visited
the author's site with rapid fire requests. After 170 consecutive retrievals the server, which had been
operating fine for weeks, crashed under the extra load.
This shows that rapid fire needs to be avoided. Unfortunately even modern manual browsers (e.g.
Netscape) contribute to this problem by retrieving in-line images concurrently. The Web's protocol,
HTTP [8], has been shown to be inefficient for this kind of transfer [9], and new protocols are being
designed to remedy this [10].
Updating overhead
It has been mentioned that databases generated by robots can be automatically updated. Unfortunately
there is no efficient change control mechanism in the Web; There is no single request that can
determine which of a set of URL's has been removed, moved, or modified.
The HTTP does provide the "If-Modified-Since" mechanism, whereby the user-agent can specify the
modification time-stamp of a cached document along with a request for the document. The server will
then only transfer the contents if the document has been modified since it was cached.
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (3 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
This facility can only be used by a robot if it retains the relationship between the summary data it
extracts from a document, it's URL, and the timestamp of the retrieval. This places extra requirements
on the size and complexity on the database, and is not widely implemented. Client-side robots/agents
The load on the network is especially an issue with the category of robots that are used by end-users,
and implemented as part of a general purpose Web client (e.g. the Fish Search [11] and the tkWWW
robot [12]). One feature that is common in these end-user robots is the ability to pass on search-terms
to search engines found while traversing the Web. This is touted as improving resource discovery by
querying several remote resource discovery databases automatically. However it is the author's
opinion that this feature is unacceptable for two reasons. Firstly a search operation places a far higher
load on a server than a simple document retrieval, so a single user can cause a considerable overhead
on several servers in a far shorter period than normal. Secondly, it is a fallacy to assume that the same
search-terms are relevant, syntactically correct, let alone optimal for a broad range of databases, and
the range of databases is totally hidden from the user. For example, the query "Ford and garage" could
be sent to a database on 17th century literature, a database that doesn't support Boolean operators, or a
database that specifies that queries specific to automobiles should start with the word "car:". And the
user isn't even aware of this.
Another dangerous aspect of a client-side robot is that once it is distributed no bugs can be fixed, no
knowledge of problem areas can be added and no new efficient facilities can be taken advantage of, as
not everyone will upgrade to the latest version.
The most dangerous aspect however is the sheer number of possible users. While some people are
likely to use such a facility sensibly, i.e. bounded by some maximum, on a known local area of the
web, and for a short period of time, there will be people who will abuse this power, through ignorance
or arrogance. It is the author's opinion that remote robots should not be distributed to end-users, and
fortunately it has so far been possible to convince at least some robot authors to cancel releases [13].
Even without the dangers client-side robots pose an ethical question: where the use of a robot may be
acceptable to the community if its data is then made available to the community, client-side robots
may not be acceptable as they operate only for the benefit a single user. The ethical issues will be
discussed further below.
End-user "intelligent agents" [14] and "digital assistants" are currently a popular research topic in
computing, and often viewed as the future of networking. While this may indeed be the case, and it is
already apparent that automation is invaluable for resource discovery, a lot more research is required
for them to be effective. Simplistic user-driven Web robots are far removed from intelligent network
agents: an agent needs to have some knowledge of where to find specific kinds of information (i.e.
which services to use) rather than blindly traversing all information. Compare the situation where a
person is searching for a book shop; they use the Yellow Pages for a local area, find the list of shops,
select one or a few, and visit those. A client-side robot would walk into all shops in the area asking for
books. On a network, as in real life, this is inefficient on a small scale, and prohibitve on a larger
scale.
Bad Implementations
The strain placed on the network and hosts is sometimes increased by bad implementations of
especially newly written robots. Even if the protocol and URL's sent by the robot is correct, and the
robot correctly deals with returned protocol (including more advanced features such as redirection),
there are some less-obvious problems.
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (4 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
The author has observed several identical robot runs accessing his server. While in some cases this
was caused by people using the site for testing (instead of a local server), in some cases it became
apparent that this was caused by lax implementation. Repeated retrievals can occur when either no
history of accessed locations is stored (which is unforgivable), or when a robot does not recognise
cases where several URL are syntactically equivalent, e.g. where different DNS aliases for the same
IP address are used, or where URL's aren't canonicalised by the robot, e.g. "foo/bar/../baz.html" is
equivalent to "foo/baz.html".
Some robots sometimes retrieve document types, such as GIF's and Postscript, which they cannot
handle and thus ignore.
Another danger is that some areas of the web are near-infinite. For example, consider a script that
returns a page with a link to one level further down. This will start with for example "/cgi-bin/pit/",
and continue with "/cgi-bin/pit/a/", "/cgi-bin/pit/a/a/", etc. Because such URL spaces can trap robots
that fall into them, they are often called "black holes". See also the discussion of the Proposed
Standard for Robot Exclusion below.
CATALOGUING ISSUES
That resource discovery databases generated by robots are popular is undisputed. The author himself
regularly uses such databases when locating resources. However, there are some issues that limit the
applicability of robots to Web-wide resource discovery.
There is too much material, and it's too dynamic
One measure of effectiveness of an information retrieval approach is "recall", the fraction of all
relevant documents that were actually found. Brian Pinkerton [15] states that recall in Internet
indexing systems is adequate, as finding enough relevant documents is not the problem. However, if
one considers the complete set of information available on the Internet as a basis, rather than the
database created by the robot, recall cannot be high, as the amount of information is enormous, and
changes are very frequent. So in practice a robot database may not contain a particular resource that is
available, and this will get worse as the Web grows.
Determining what to include/exclude
A robot cannot automatically determine if a given Web page should be included in its index. Web
servers may serve documents that are only relevant to a local context (for example an index of an
internal library), that exists only temporarily, etc. To a certain extent the decision of what is relevant
also depends on the audience, which may not have been identified at the time the robot operates. In
practice robots end up storing almost everything they come come accross. Note that even if a robot
could decide if a particular page is to be exclude form its database they have already incurred the cost
of retrieving the file; a robot that decides to ignore a high percentage of documents is very wasteful.
In an attempt to alleviate this situation somewhat the robot community has adopted "A Standard for
Robot exclusion" [16]. This standard describes the use of a simple structured text file available at
well-known place on a server ("/robots.txt") to specify which parts of their URL space should be
avoided by robots (see Figure 1). This facility can also be used to warn Robots for black holes.
Individual robots can be given specific instructions, as some may behave more sensibly than others, or
are known to specialise in a particular area. This standard is voluntary, but is very simple to
implement, and there is considerable public pressure for robots to comply.
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (5 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
Determining how to traverse the Web is a related problem. Given that most Web servers are organised
hierarchically, a breadth-first traversal from the top to a limited depth is likely to more quickly find a
broader and higher-level set of document and services than a depth-first traversal, and is therefore
much preferable for resource discovery. However, a depth-first traversal is more likely to find
individual users' home pages with links to other, potentially new, servers, and is therefore more likely
to find new sites to traverse.
# /robots.txt for http://www.site.com/
User-agent: *
# attention all robots:
Disallow:
/cyberworld/map # infinite URL space
Disallow:
/tmp/
# temporary files
Figure 1: An example robots.txt file
Summarising documents
It is very difficult to index an arbitrary Web document. Early robots simply stored document titles and
anchor texts, but newer robots use more advanced mechanisms and generally consider the entire
content.
These methods are good general measures, and can be automatically applied to all Web pages, but
cannot be as effective as manual indexing by the author. HTML provides a facility to attach general
meta information to documents, by specifying a <META> element, e.g. "<meta name= "Keywords"
value= "Ford Car Maintenance">. However, no semantics have (yet) been defined for specific values
of the attributes of this tag, and this severely limits its acceptance, and therfore its usefulness.
This results in a low "precision", the proportion of the total number of documents retrieved that is
relevant to the query. Advanced features such as Boolean operators, weighted matches like WAIS, or
relevance feedback can improve this, but given that the information on the Internet is enormously
diverse, this will continue to be a problem.
Classifying documents
Web users often ask for a "subject hierarchy" of documents in the Web. Projects such as GENVL [17]
allow these subject hierarchies to be manually maintained, which presents a number of problems that
fall outside the scope of this paper. It would be useful if a robot could present a subject hierarchy view
of its data, but this requires some automated classification of documents [18].
The META tag discussed above could provide a mechanism for authors to classify their own
documents. The question then arises which classification system to use, and how to apply it. Even
traditional libraries don't use a single universal system, but adopt one of a few, and adopt their own
conventions for applying them. This gives little hope for an immediate universal solution for the Web.
Determining document structures
Perhaps the most difficult issue is that the Web doesn't consist of a flat set of files of equal
importance. Often services on the Web consist of a collection of Web pages: there is a welcome page,
maybe some pages with forms, maybe some pages with background information, and some pages with
individual data points. The service provider announces the service by referring to the welcome page,
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (6 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
which is designed to give structured access to the rest of the information. A robot however has no way
of distinguishing these pages, and may well find a link into for example one of the data points or
background files, and index those rather than the main page. So it can happen that rather than storing a
reference to "The Perl FAQ", it stores some random subset of the questions addressed in the FAQ. If
there was a facility in the web for specifying per document that someone shouldn't link to the page,
but to another one specified, this problem could be avoided.
Related to the above problem is that the content of web pages are often written for a specific context,
provided by the access structure, and may not make sense outside that context. For example, a page
describing the goals of a project may refer to "The project", without fully specifying the name, or
giving a link to the welcome page. Another problem is that of moved URL's. Often when service
administrators reorganise their URL structure they will provide mechanisms for backward
compatibility with the previous URL structure, to prevent broken links. In some servers this can be
achieved by specifying redirection configuration, which results in the HTTP negotiating a new URL
when users try to access the old URL. However, when symbolic links are used it is not possible to tell
the difference between the two. An indexing robot can in these cases store the deprecated URL,
prolonging the requirement for a web administrator to provide backward compatibility.
A related problem is that a robot might index a mirror of a particular service, rather than the original
site. If both source and mirror are visited there will be duplicate entries in the database, and bandwidth
is being wasted repeating identical retrievals to different hosts. If only the mirror is visited users may
be referred to out-of-date information even when up-to-date information is available elsewhere.
ETHICS
We have seen that robots are useful, but that they can place high demands on bandwidth, and that they
have some fundamental problems when indexing the Web. Therefore a robot author needs to balance
these issues when designing and deploying a robot. This becomes an ethical question "Is the cost to
others of the operation of a robot justified". This is a grey area, and people have very different
opinions on what is acceptable.
When some of the acceptability issues first became apparent (after a few incidents with robots
doubling the load on servers) the author developed a set of Guidelines for Robot Writers [19], as a
first step to identify problem areas and promote awareness. These guidelines can be summarised as
follows:
● Reconsider: Do you really need a new robot?
● Be accountable: Ensure the robot can be identified by server maintainers, and the author can be
easily contacted.
● Test extensively on local data
● Moderate resource consumption: Prevent rapid fire and eliminate redundant and pointless
retrievals.
● Follow the Robot Exclusion Standard.
● Monitor operation: Continuously analyse the robot logs.
● Share results: Make the robot's results available to others, the raw results as well as any
intended high-level results.
David Eichman [20] makes a further distinction between Service Agents, robots that build information
bases that will be publicly available, and User Agents, robots that benefit only a single user such as
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (7 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
client-side robots, and has identified separate high-level ethics for each.
The fact that most Robot writers have already implemented these guidelines indicates that they are
conscious of the issues, and eager to minimise any negative impact. The public discussion forum
provided by the robots mailing list speeds up the discussion of new problem areas, and the public
overview of the robots on the Active list provides a certain community pressure on robot behaviour
[21].
This maturation of the Robot field means there have recently been fewer incidents where robots have
upset information providers. Especially the standard for robot exclusion means that people who don't
approve of robots can prevent being visited. Experiences from several projects that have deployed
robots have been published, especially at the World-Wide Web conferences at CERN in July 1994
and Chicago in October 1994, and these help to educate, and discourage, would-be Robot writers.
However, with the increasing popularity of the Internet in general, and the Web in particular it is
inevitable that more Robots will appear, and it is likely that some will not behave appropriately.
ALTERNATIVES FOR RESOURCE DISCOVERY
Robots can be expected to continue to be used for network information retrieval on the Internet.
However, we have seen that there are practical, fundamental and ethical problems with deploying
robots, and it is worth considering research into alternatives, such as ALIWEB [22] and Harvest [23].
ALIWEB has a simple model for human distributed indexing of services in the Web, loosely based on
Archie [24]. In this model aggregate indexing information is available from hosts on the Web. This
information indexes only local resources, not resources available from third parties. In ALIWEB this
is implemented with IAFA templates [25], which give typed resource information is a simple
text-based format (See Figure 2). These templates can be produced manually, or can be constructed by
automated means, for example from titles and META elements in a document tree. The ALIWEB
gathering engine retrieves these index files through normal Web access protocols, and combines them
into a searchable database. Note that it is not a robot, as it doesn't recursively retrieve documents
found in the index.
Template-Type:
Title:
URL:
Description:
Keywords:
SERVICE
The ArchiePlex Archie Gateway
/public/archie/archieplex/archieplex.html
A Full Hypertext interface to Archie.
Archie, Anonymous FTP.
Template-Type:
Title:
URL:
Description:
DOCUMENT
The Perl Page
/public/perl/perl.html
Information on the Perl Programming Language.
Includes hypertext versions of the Perl 5 Manual
and the latest FAQ.
Keywords:
perl, programming language, perl-faq
Figure 2: An IAFA index file
There are several advantages to this approach. The quality of human-generated index information is
combined with the efficiency of automated update mechanisms. The integrity of the information is
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (8 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
higher than with traditional "hotlists", as only local index information is maintained. Because the
information is typed in a computer-readable format, search interfaces can offer extra facilities to
constrain queries. There is very little network overhead, as the index information is retrieved in a
single request. The simplicity of the model and the index file means any information provider can
immediately participate.
There are some disadvantages. The manual maintenance of indexing information can appear to give a
large burden on the information provider, but in practice indexing information for major services don't
change often. There have been experiments with index generation from TITLE and META tags in the
HTML, but this requires the local use of a robot, and has the danger that the quality of the index
information suffers. A second limitiation is that in the current implementation information providers
have to register their index files at a central registry, which limits scalability. Finally updates are not
optimally efficient, as an entire index files needs to retrieved even if only one of its records was
modified.
ALIWEB has been in operation since October 1993, and the results have been encouraging. The main
operational difficulties appeared to be lack of understanding; initially people often attempted to
register their own HTML files instead of IAFA index files. The other problem is that as a personal
project ALIWEB is run on a spare-time basis and receives no funding, so further development is slow.
Harvest is a distributed resource discovery system recently released by Internet Research Task Force
Research Group on Resource Discovery (IRTF-RD), and offers software systems for automated
indexing contents of documents, efficient replication and caching of such index information on remote
hosts, and finally searching of this data through an interface in the web. Initial reactions to this system
have been very positive.
One disadvantage of Harvest is that it is a large and complex system which requires considerable
human and computing resource, making it less accessible to information providers.
The use of Harvest to form a common platform for the interworking of existing databases is perhaps
its most exciting aspect. It is reasonably straightforward for other systems to interwork with Harvest;
experiments have shown that ALIWEB for example can operate as a Harvest broker. This gives
ALIWEB the caching and searching facilities Harvest offers, and offers Harvest a low-cost entry
mechanism.
These two systems show attractive alternatives to the use of robots for resource discovery: ALIWEB
provides a simple and high-level index, Harvest provides comprehensive indexing system that uses
low-level information. However, neither system is targeted at indexing of third-parties that don't
actively participate, and it is therefore expected that robots will continue to be used for that purpose,
but in co-operation with other systems such as ALIWEB and Harvest.
CONCLUSIONS
In today's World-Wide Web, robots are used for a number of different purposes, including global
resource discovery. There are several practical, fundamental, and ethical problems involved in the use
of robots for this task. The practical and ethical problems are being addressed as experience with
robots increases, but are likely to continue to cause occasional problems. The fundamental problems
limit the amount of growth there is for robots. Alternatives strategies such as ALIWEB and Harvest
are more efficient, and give authors and sites control of the indexing of their own information. It is
expected that this type of system will increase in popularity, and will operate alongside robots and
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (9 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
interwork with them. In the longer term complete Web-wide traversal by robots will become
prohibitvely slow, expensive, and ineffective for resource discovery.
REFERENCES
1
Berners-Lee, T., R. Cailliau, A. Loutonen, H.F.Nielsen and A. Secret. "The World-Wide Web".
Communications of the ACM, v. 37, n. 8, August 1994, pp. 76-82.
2
Seeley, Donn. "A tour of the worm". USENINX Association Winter Conference 1989
Proceedings, January 1989, pp. 287-304.
3
Gray, M. "Growth of the World-Wide Web," Dec. 1993. <URL:
http://www.mit.edu:8001/aft/sipb/user/mkgray/ht/web-growth.html >
4
Fielding, R. "Maintaining Distributed Hypertext Infostructures: Welcome to MOMspider's
Web". Proceedings of the First International World-Wide Web Conference, Geneva
Switzerland, May 1994.
5
Berners-Lee, T., D. Conolly at al., "HyperText Markup Language Spacification 2.0". Work in
progress of the HTML working group of the IETF. <URL:
ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-html-spec-00.txt >
6
Luotonen, A., K. Altis. "World-Wide Web Proxies". Proceedings of the First International
World-Wide Web Conference, Geneva Switzerland, May 1994.
7
Eichmann, D. "The RBSE Spider - Balancing Effective Search against Web Load". Proceedings
of the First International World-Wide Web Conference, Geneva Switzerland, May 1994.
8
Berners-Lee, T., R. Fielding, F. Nielsen. "HyperText Transfer Protocol". Work in progress of
the HTTP working group of the IETF. <URL:
ftp://nic.merit.edu/documents/internet-drafts/draft-fielding-http-spec-00.txt >
9
Spero, S. "Analysis of HTTP Performance problems" July 1994 <URL:
http://sunsite.unc.edu/mdma-release/http-prob.html >
10
Spero, S. "Progress on HTTP-NG". <URL:
http://info.cern.ch/hypertext/www/Protocols/HTTP-NG/http-ng-status.html >
11
De Bra, P.M.E and R.D.J. Post. "Information Retrieval in the World-Wide Web: Making
Client-based searching feasable". Proceedings of the First International World-Wide Web
Conference, Geneva Switzerland, May 1994.
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (10 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
12
Spetka, Scott. "The TkWWW Robot: Beyond Browsing". Proceedings of the Second
International World-Wide Web Conference, Chicago United States, October 1994.
13
Slade, R., "Risks of client search tools," RISKS-FORUM Digest, v. 16, n. 37, Weds 31 August
1994.
14
Riechen, Doug. "Intelligent Agents". Communications of the ACM Vol. 37 No. 7, July 1994.
15
Pinkerton, B., "Finding What PEople Want: Experiences with the WebCrawler," Proceedings of
the Second International World-Wide Web Conference, Chicago United States, October 1994.
16
Koster, M., "A Standard for Robot Exclusion," < URL:
http://info.webcrawler.com/mak/projects/robots/exclusion.html >
17
McBryan, A., "GENVL and WWWW: Tools for Taming the Web," Proceedings of the First
International World-Wide Web Conference, Geneva Switzerland, May 1994.
18
Kent, R.E., Neus, C., "Creating a Web Analysis and Visualization Environment," Proceedings
of the Second International World-Wide Web Conference, Chicago United States, October
1994.
19
Koster, Martijn. "Guidelines for Robot Writers". 1993. <URL:
http://info.webcrawler.com/mak/projects/robots/guidelines.html >
20
Eichmann, D., "Ethical Web Agents," "Proceedings of the Second International World-Wide
Web Conference, Chicago United States, October 1994.
21
Koster, Martijn. "WWW Robots, Wanderers and Spiders". <URL:
http://info.webcrawler.com/mak/projects/robots/robots.html >
22
Koster, Martijn, "ALIWEB - Archie-Like Indexing in the Web," Proceedings of the First
International World-Wide Web Conference, Geneva Switzerland, May 1994.
23
Bowman, Mic, Peter B. Danzig, Darren R. Hardy, Udi Manber and Michael F. Schwartz.
"Harvest: Scalable, Customizable Discovery and Access System". Technical Report
CU-CS-732-94, Department of Computer Science, University of Colorado, Boulder, July 1994.
<URL: http://harvest.cs.colorado.edu/>
24
Deutsch, P., A. Emtage, "Archie - An Electronic Directory Service for the Internet", Proc.
Usenix Winter Conf., pp. 93-110, Jan 92.
25
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (11 of 12) [18.02.2001 13:19:53]
Robots in the Web: threat or treat?
Deutsch, P., A. Emtage, M. Koster, and M. Stumpf. "Publishing Information on the Internet
with Anonymous FTP". Work in progress of the Integrated Internet Information Retrieval
working group. <URL:
ftp://nic.merit.edu/documents/internet-drafts/draft-ietf-iiir-publishing-02.txt >
MARTIJN KOSTER holds a B.Sc. in Computer Science from Nottingham University (UK). During
his national service he worked on as 2nd lieutenant of the Dutch Army at the Operations Research
group of STC, NATO's research lab in the Netherlands. Since 1992 he has worked for NEXOR as
software engineer on X.500 Directory User Agents, and maintains NEXOR's World-Wide Web
service. He is also author of the ALIWEB and CUSI search tools, and maintains a mailing-list
dedicated to World-Wide Web robots.
Reprinted with permission from ConneXions, Volume 9, No. 4, April 1995.
ConneXions--The Interoperability Report is published monthly by:
Interop Company, a division of SOFTBANK Expos
303 Vintage Park Drive, Suite 201
Foster City, CA 94404-1138
USA
Phone: +1 415 578-6900 FAX: +1 415 525-0194
Toll-free (in USA): 1-800-INTEROP
E-mail: connexions@interop.com
Free sample issue and list of back issues available upon request.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/threat-or-treat.html (12 of 12) [18.02.2001 13:19:53]
Guidelines for Robot Writers
The Web Robots Pages
Guidelines for Robot Writers
Martijn Koster, 1993
This document contains some suggestions for people who are thinking about developing Web
Wanderers (Robots), programs that traverse the Web.
Reconsider
Are you sure you really need a robot? They put a strain on network and processing resources all over
the world, so consider if your purpose is really worth it. Also, the purpose for which you want to run
your robot are probably not as novel as you think; there are already many other spiders out there.
Perhaps you can make use of the data collected by one of the other spiders (check the list of robots
and the mailing list). Finally, are you sure you can cope with the results? Retrieving the entire Web is
not a scalable solution, it is just too big. If you do decide to do it, don't aim to traverse then entire web,
only go a few levels deep.
Be Accountable
If you do decide you want to write and/or run one, make sure that if your actions do cause problems,
people can easily contact you and start a dialog. Specifically:
Identify your Web Wanderer
HTTP supports a User-agent field to identify a WWW browser. As your robot is a kind of
WWW browser, use this field to name your robot e.g. "NottinghamRobot/1.0". This will allow
server maintainers to set your robot apart from human users using interactive browsers. It is
also recommended to run it from a machine registered in the DNS, which will make it easier to
recognise, and will indicate to people where you are.
Identify yourself
HTTP supports a From field to identify the user who runs the WWW browser. Use this to
advertise your email address e.g. "j.smith@somehwere.edu". This will allow server maintainers
to contact you in case of problems, so that you can start a dialogue on better terms than if you
were hard to track down.
Announce It
Post a message to comp.infosystems.www.providers before running your robots. If
people know in advance they can keep an eye out. I maintain a list of active Web Wanderers, so
that people who wonder about access from a certain site can quickly check if it is a known robot
-- please help me keep it up-to-date by informing me of any missing ones.
Announce it to the target
If you are only targetting a single site, or a few, contact its administrator and inform him/her.
Be informative
http://info.webcrawler.com/mak/projects/robots/guidelines.html (1 of 5) [18.02.2001 13:20:03]
Guidelines for Robot Writers
Server maintainers often wonder why their server is hit. If you use the HTTP Referer field
you can tell them. This costs no effort on your part, and may be informative.
Be there
Don't set your Web Wanderer going and then go on holiday for a couple of days. If in your
absence it does things that upset people you are the only one who can fix it. It is best to remain
logged in to the machine that is running your robot, so people can use "finger" and "talk" to
contact you
Suspend the robot when you're not there for a number of days (in the weekend), only run it in
your presence. Yes, it may be better for the performance of the machine if you run it over night,
but that implies you don't think about the performance overhead of other machines. Yes, it will
take longer for the robot to run, but this is more an indication that robots are not they way to do
things anyway, then an argument for running it continually; after all, what's the rush?
Notify your authorities
It is advisable to tell your system administrator / network provider what you are planning to do.
You will be asking a lot of the services they offer, and if something goes wrong they like to
hear it from you first, not from external people.
Test Locally
Don't run repeated test on remote servers, instead run a number of servers locally and use them to test
your robot first. When going off-site for the first time, stay close to home first (e.g. start from a page
with local servers). After doing a small run, analyse your performance, your results, and estimate how
they scale up to thousands of documents. It may soon become obvious you can't cope.
Don't hog resources
Robots consume a lot of resources. To minimise the impact, keep the following in mind:
Walk, don't run
Make sure your robot runs slowly: although robots can handle hundreds of documents per
minute, this puts a large strain on a server, and is guaranteed to infuriate the server maintainer.
Instead, put a sleep in, or if you're clever rotate queries between different servers in a
round-robin fashion. Retrieving 1 document per minute is a lot better than one per second. One
per 5 minutes is better still. Yes, your robot will take longer, but what's the rush, it's only a
program.
Use If-modified-since or HEAD where possible
If your application can use the HTTP If-modified-since header, or the HEAD method for its
purposes, that gives less overhead than full GETs.
Ask for what you want
HTTP has a Accept field in which a browser (or your robot) can specify the kinds of data it
can handle. Use it: if you only analyse text, specify so. This will allow clever servers to not
bother sending you data you can't handle and have to throw away anyway. Also, make use of
url suffices if they're there.
Ask only for what you want
You can build in some logic yourself: if a link refers to a ".ps", ".zip", ".Z", ".gif" etc, and you
http://info.webcrawler.com/mak/projects/robots/guidelines.html (2 of 5) [18.02.2001 13:20:03]
Guidelines for Robot Writers
only handle text, then don't ask for it. Although they are not the modern way to do things
(Accept is), there is an enourmeous installed base out there that uses it (especially FTP sites).
Also look out for gateways (e.g. url's starting with finger), News gateways, WAIS gateways etc.
And think about other protocols ("news:", "wais:") etc. Don't forget the sub-page references
(<A HREF="#abstract">) -- don't retrieve the same page more then once. It's imperative to
make a list of places not to visit before you start...
Check URL's
Don't assume the HTML documents you are going to get back are sensible. When scanning for
URL be wary of things like <A HREF=" http://somehost.somedom/doc>. A lot of sites don't
put the trailing / on urls for directories, a naieve strategy of concatenating the names of sub urls
can result in bad names.
Check the results
Check what comes back. If a server refuses a number of documents in a row, check what it is
saying. It may be that the server refuses to let you retrieve these things because you're a robot.
Don't Loop or Repeat
Remember all the places you have visited, so you can check that you're not looping. Check to
see if the different machine addresses you have are not in fact the same box (e.g.
web.nexor.co.uk is the same machine as "hercules.nexor.co.uk" and 128.243.219.1) so you
don't have to go through it again. This is imperative.
Run at opportune times
On some systems there are preferred times of access, when the machine is only lightly loaded.
If you plan to do many automatic requests from one particular site, check with its
administrator(s) when the preferred time of access is.
Don't run it often
How often people find acceptable differs, but I'd say once every two months is probably too
often. Also, when you re-run it, make use of your previous data: you know which url's to avoid.
Make a list of volatile links (like the what's new page, and the meta-index). Use this to get
pointers to other documents, and concentrate on new links -- this way you will get a high initial
yield, and if you stop your robot for some reason at least it has spent it's time well.
Don't try queries
Some WWW documents are searcheable (ISINDEX) or contain forms. Don't follow these. The
Fish Search does this for example, which may result in a search for "cars" being sent to
databases with computer science PhD's, people in the X.500 directory, or botanical data. Not
sensible.
Stay with it
It is vital you know what your robot is doing, and that it remains under control
Log
Make sure it provides ample logging, and it wouldn't hurt to keep certain statistics, such as the
number of successes/failures, the hosts accessed recently, the average size of recent files, and
keep an eye on it. This ties in with the "Don't Loop" section -- you need to log where you have
been to prevent looping. Again, estimate the required disk-space, you may find you can't cope.
Be interactive
http://info.webcrawler.com/mak/projects/robots/guidelines.html (3 of 5) [18.02.2001 13:20:03]
Guidelines for Robot Writers
Arrange for you to be able to guide your robot. Commands that suspend or cancel the robot, or
make it skip the current host can be very useful. Checkpoint your robot frequently. This way
you don't lose everything if it falls over.
Be prepared
Your robot will visit hundreds of sites. It will probably upset a number of people. Be prepared
to respond quickly to their enquiries, and tell them what you're doing.
Be understanding
If your robot upsets someone, instruct it not to visit his/her site, or only the home page. Don't
lecture him/her about why your cause is worth it, because they probably aren't in the least
interested. If you encounter barriers that people put up to stop your access, don't try to go
around them to show that in the Web it is difficult to limit access. I have actually had this
happen to me; and although I'm not normally violent, I was ready to strangle this person as he
was deliberatly wasting my time. I have written a standard practice proposal for a simple
method of excluding servers. Please implement this practice, and respect the wishes of the
server maintainers.
Share results
OK, so you are using the resources of a lot of people to do this. Do something back:
Keep results
This may sound obvious, but think about what you are going to do with the retrieved
documents. Try and keep as much info as you can possibly store. This will the results optimally
useful.
Raw Result
Make your raw results available, from FTP, or the Web or whatever. This means other people
can use it, and don't need to run their own servers.
Polished Result
You are running a robot for a reason; probably to create a database, or gather statistics. If you
make these results available on the Web people are more likely to think it worth it. And you
might get in touch with people with similar interests.
Report Errors
Your robot might come accross dangling links. You might as well publish them on the Web
somewhere (after checking they really are. If you are convinced they are in error (as opposed to
restricted), notify the administrator of the server.
Examples
This is not intended to be a public flaming forum or a "Best/Worst Robot" league-table. But it shows
the problems are real, and the guidelines help aleviate them. He, maybe a league table isn't too bad an
idea anyway.
http://info.webcrawler.com/mak/projects/robots/guidelines.html (4 of 5) [18.02.2001 13:20:03]
Guidelines for Robot Writers
Examples of how not to do it
The robot which retrieved the same sequence of about 100 documents on three occasions in four days.
And the machine couldn't be fingered. The results were never published. Sigh.
The robot run from phoenix.doc.ic.ac.uk in Jan 94. It provides no User-agent or From fields, one
can't finger the host, and it is not part of a publicly known project. In addition it has been reported to
retrieve documents it can't handle. Has since improved.
The Fish search capability added to Mosaic. One instance managed to retrieve 25 documents in under
one minute.
Better examples
The RBSE-Spider, run in December 93. It had a User-agent field, and after a finger to the host it
was possible to open a dialogue with the robot writers. Their web server explained the purpose of it.
Jumpstation: the results are presented in a searchable database, the author announced it, and is
considering making the raw results available. Unfortunately some people complained about the high
rate with which documents were retrieved.
Why?
Why am I rambling on about this? Because it annoys me to see that people cause other people
unnecessary hassle, and the whole discussion can be so much gentler. And because I run a server that
is regularly visited by robots, and I am worried they could make the Web look bad.
This page has been contributed to by Jonathon Fletcher JumpStation Robot author, Lee McLoughlin
(L.McLoughlin@doc.ic.ac.uk), and others.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/guidelines.html (5 of 5) [18.02.2001 13:20:03]
Evaluation of the Standard for Robots Exclusion
The Web Robots Pages
Evaluation of the Standard for Robots
Exclusion
Martijn Koster, 1996
Abstract
This paper contains an evaluation of the Standard for Robots Exclusion, identifies some of its
problems and feature requsts, and recomends future work.
●
Introduction
●
Architecture
●
Problems and Feature Requests
●
Recommendations
Introduction
The Standard for Robots Exclusion (SRE) was first proposed in 1994, as a mechanism for keeping
robots out of unwanted areas of the Web. Such unwanted areas included:
● infinite URL spaces in which robots could get trapped ("black holes").
● resource intensive URL spaces, e.g. dynamically generated pages.
● documents which would attract unmanageable traffic, e.g. erotic material.
● documents which could represent a site unfavourably, e.g. bug archives.
● documents which aren't useful for world-wide indexing, e.g. local information.
The Architecture
The main design consideration to achieve this goal were:
● simple to administer,
● simple to implement, and
● simple to deploy.
This specifically ruled out special network-level protocols, platform-specific solutions, or changes to
clients or servers.
Instead, the mechanism uses a specially formatted resource, at a know location in the server's URL
space. In its simplest form the resource could be a text file produced with a text edittor, placed in the
root-level server directory.
http://info.webcrawler.com/mak/projects/robots/eval.html (1 of 7) [18.02.2001 13:20:16]
Evaluation of the Standard for Robots Exclusion
This formatted-file approach satisfied the design considerations: The administration was simple,
because the format of the file was easy to understand, and required no special software to produce.
The implementation was simple, because the format was simple to parse and apply. The deployment
was simple, because no client or server changes were required.
Indeed the majority of robot authors rapidly embraced this proposal, and it has received a great deal of
attention in both Web-based documentation and the printed press. This in turn has promoted
awareness and acceptance amongst users.
Problems and Feature Requests
In the years since the inital proposal, a lot of practical experience with the SRE has been gained, and a
considerable number of suggestions for improvement or extensions have been made. They broadly fall
into the following categories:
1. operational problems
2. general Web problems
3. further directives for exclusion
4. extensions beyond exclusion
I will discuss some of the most frequent suggestions in that order, and give some arguments in favour
or against them.
One main point to keep in mind is that it is difficult to gauge how much of an issue these problems are
in practice, and how wide-spread support for extensions would be. When considering further
development of the SRE it is important to prevent second-system syndrome.
Operational problems
These relate to the administration of the SRE, and as such the effectiveness of the approach for the
purpose.
Administrative access to the /robots.txt resource
The SRE specifies a location for the resource, in the root level of a server's URL space. Modifying
this file generally requires administrative access to the server, which may not be granted to a user who
would like to add exclusion directives to the file. This is especially common in large multi-user
systems.
It can be argued this is not a problem with the SRE, which after all does not specify how the resource
is administered. It is for example possible to programatically collect individual's '~/robots.txt' files,
combining them into a single '/robots.txt' file on a regular basis. How this could be implemented
depends on the operating system, server software, and publishing process. In practice users find their
adminstrators unwilling or incapable of providing such a solution. This indicates again how important
it is to stress simplicity; even if the extra effort required is miniscule, requiring changes in practices,
procedures, or software is a major barrier for deployment.
Suggestions to alleviate the problem have been producing a CGI script which combines multiple
individual files on the fly, or listing multiple referral files in the '/robots.txt' which the robot can
retrieve and combine. Both these options suffer from the same problem; some administrative access is
http://info.webcrawler.com/mak/projects/robots/eval.html (2 of 7) [18.02.2001 13:20:16]
Evaluation of the Standard for Robots Exclusion
still required.
This is the most painful operational problem, and cannot be sufficiently addressed in the current
design. It seems that the only solution is to move the robot policy closer to the user, in the URL space
they do control.
File specification
The SRE allows only a single method for specifying parts of the URL space: by substring anchored at
the front. People have asked for substrings achored at the end, as in "Disallow: *.shtml", as well as
generlised regular expression parsing, as in 'Disallow: *sex*'. XXX
The issue with this extension is that it increases complexity of both administration and
implementation. In this case I feel this may be justified.
Redundancy for specific robots
The SRE allows for specific directives for individual robots. This may result in considerable repetiton
of rules common to all robots. It has been suggested that an OO inheritance scheme could address
this.
In practice the per-robot distinction is not that widely used, and the need seems to be sporadic. The
increased complexity of both adminstration and implementation seems prohibitive in this case.
Scaleability
The SRE groups all rules for the server into a single file. This doesn't scale well to thousands or
millions of individually specified URL's.
This is a fundamental problem, and one that can only be solved by moving beyond a single file, and
bringing the policy closer to the individual resources.
Web problems
These are problems faced by the Web at large, which could be addressed (at leats for robots)
separately using extensions to the SRE. I am against following that route, as it is fixing the problem in
the wrong place. These issues should be addressed by proper general solution separate from the SRE.
"Wrong" domain names
The use of multiple domain names sharing a logical network interface is a common practice (even
without vanity domains), which often leads to problems with indexing robots, who may end up using
an undesired domain name for a given URL.
This could be adressed by adding a "preferred" address, or even encoding "preferred" domain names
for certain parts of a URL space. This again increases complexity, and doesn't solve the problem for
non-robots which can suffer the same fate.
The issue here is that deployed HTTP software doesn't have a facility to indicate the host part of the
HTTP URL, and a server therefore cannot use that to decide the availability of a URL. HTTP 1.1 and
later address this using a Host header and full URI's in the request line. This will address this problem
accross the board, but will take time to be deployed and used.
http://info.webcrawler.com/mak/projects/robots/eval.html (3 of 7) [18.02.2001 13:20:16]
Evaluation of the Standard for Robots Exclusion
Mirrors
Some servers, such as "webcrawler.com", run identical URL spaces on several different machines, for
load balancing or redundancy purposes. This can lead to problems when a robot uses only the IP
address to uniquely identify a server; the robot would traverse and list each instance of the server
separately.
It is possible to list alternative IP addresses in the /robots.txt file, indicating equivalency. However, in
the common case where a single domain name is used for these separate IP addresses this information
is already obtainable from the DNS.
Updates
Currently robots can only track updates by frequent revisits. There seem to be a few: the robot could
request a notification when a page changes, the robot could ask for modification information in bulk,
or the SRE could be extended to suggest expirations on URL's.
This is a more general problem, and ties in to caching issues and the link consistency. I will not go
into the first two options as they donot concern the SRE. The last option would duplicate existing
HTTP-level mechanisms such as Expires, only because they are currently difficult to configure in
servers. It seems to me this is the wrong place to solve that problem.
Further directives for exclusion
These concern further suggestions to reduce robot-generated problems for a server. All of these are
easy to add, at the cost of more complex administration and implementation. It also brings up the
issue of partial compliance; not all robot may be willing or able to support all of these. Given that the
importance of these extensions is secondary to the SRE's purpose, I suggest they are to be listed as
MAY or SHOULD, not MUST options.
Multiple prefixes per line
The SRE doesn't allow multiple URL prefixes on a single line, as in "Disallow: /users /tmp". In
practice people do this, so the implementation (if not the SRE) could be changed to condone this
practice.
Hit rate
This directive could indicate to a robot how long to wait between requests to the server. Currently it is
accepted practice to wait at least 30 seconds between requests, but this is too fast for some sites, too
slow for others.
A limitation is that this would specify a value for the entire site, whereas the value may depend on
specific parts of the URL space.
ReVisit frequency
This directive could indicate how long a robot should wait before revisiting pages on the server.
A limitation is that this would specify a value for the entire site, whereas the value may depend on
specific parts of the URL space.
http://info.webcrawler.com/mak/projects/robots/eval.html (4 of 7) [18.02.2001 13:20:16]
Evaluation of the Standard for Robots Exclusion
This appears to duplicate some of the existing (and future) cache-consistency measures such as
Expires.
Visit frequency for '/robots.txt'
This is a special version of the directive above; specifying how often the '/robots.txt' file should be
refreshed.
Again Expires could be used to do this.
Visiting hours
It has often been suggested to list certain hours as "preferred hours" for robot accesses. These would
be given in GMT, and would probably list local low-usage time.
A limitation is that this would specify a value for the entire site, whereas the value may depend on
specific parts of the URL space.
Visiting vs indexing
The SRE specifies URL prefixes that are not to be retrieved. In practice we find it is used both for
URL's that are not to be retrieved, as ones that are not to be indexed, and that the distinction is not
explicit.
For example, a page with links to a company's employees pages may not be all that desirable to
appear in an index, whereas the employees pages themselves are desirable; The robot should be
allowed to recurse on the parent page to get to the child pages and index them, without indexing the
parent.
This could be addressed by adding a "DontIndex" directive.
Extensions beyond exclusion
The SRE's aim was to reduce abuses by robots, by specifying what is off-limits. It has often been
suggested to add more constructive information. I strongly believe such constructive information
would be of immense value, but I contest that the '/robots.txt' file is the best place for this. In the first
place, there may be a number of different schemes for providing such information; keeping exclusion
and "inclusion" separate allows multiple inclusions schemes to be used, or the inclusion scheme to be
changed without affecting the exclusion parts. Given the broad debates on meta information this
seems prudent.
Some of you may actually not be aware of ALIWEB, a separate pilot project I set up in 1994 which
used a '/site.idx' file in IAFA format, as one way of making such inclusive information available. A
full analysis of ALIWEB is beyond the scope of this document, but as it used the same concept as the
'/robots.txt' (single resource on a known URL), it shares many of the problems outlined in this
document. In addition there were issues with the exact nature of the meta data, the complexity of
administration, the restrictiveness of the RFC822-like format, and internationalisation issues. That
experience suggests to me that this does not belong in the '/robots.txt' file, except possibly in its most
basic form: a list of URL's to visit.
For the record, people's suggestions for inclusive information included:
http://info.webcrawler.com/mak/projects/robots/eval.html (5 of 7) [18.02.2001 13:20:16]
Evaluation of the Standard for Robots Exclusion
●
●
●
●
●
list of URI's to visit
perl-URL meta information
site administrator contact information
description of the site
geographic information
Recommendations
I have outlined most of the problems and missed features of the SRE. I also have indicated that I am
against most of the extensions to the current scheme, because of increased complexity, or because the
'/robots.txt' is the wrong place to solve the problem. Here is what I believe we can do to address these
issues.
Moving policy closer to the resources
To address the issues of scaling and administrative access, it is clear we must move beyond the single
resource per server. There is currently no effective way in the Web for clients to consider collections
(subtrees) of documents together. Therefore the only option is to associate policy with the resources
themselves, ie the pages identified with a URL.
This association can be done in a few ways:
Embedding the policy in the resource itself
This could done using the META tag, e.g. <META NAME="robotpolicy"
CONTENT="dontindex">. While this would only work for HTML, it would be extremely easy
for a user to add this information to their documents. No software or administrative access is
required for the user, and it is really easy to support in the robot.
Embedding a reference to the policy in the resource
This could be done using the LINK tag, e.g. <LINK REL="robotpolicy" HREF="public.pol">
This would give the extra flexibility of sharing a policy among documents, and supporting
different policy encodings which could move beyond RFC822-like syntax. The drawback is
increased traffic (using regular caching) and complexity.
Using an explicit protocol for the association
This could be done using PEP, in a similar fashion to PICS. It may even be possible or
beneficial to use the PICS framework as the infrastructure, and express the policy as a rating.
Note that this can be deployed independently of, and can be used together with a site '/robots.txt'.
I suggest the first option should be an immediate first step, with the other options possibly following
later.
Meta information
The same three steps can be used for descriptive META information:
Embedding the meta information in the resource itself
This could done using the META tag, e.g. <META NAME="description" CONTENT="...">.
The nature of the META information could be the Dublin core set, or even just "description"
and "keywords". While this would only work for HTML, it would be extremely easy for a user
http://info.webcrawler.com/mak/projects/robots/eval.html (6 of 7) [18.02.2001 13:20:16]
Evaluation of the Standard for Robots Exclusion
to add this information to their documents. No software or administrative access is required for
the user, and it is really easy to support in the robot.
Embedding a reference to the policy in the resource
This could be done using the LINK tag, e.g. <LINK REL="meta" HREF="doc.meta"> This
would give the extra flexibility of sharing meta information among documents, and supporting
different meta encodings which could move beyond RFC822-like syntax (which can even be
negotiated using HTTP content type negotiation!) The drawback is increased traffic (using
regular caching) and complexity.
Using an explicit protocol for the association
This could be done using PEP, in a similar fashion to PICS. It may even be possible or
beneficial to use the PICS framework as the infrastructure, and express the meta information as
a rating.
I suggest the first option should be an immediate first step, with the other options possibly following
later.
Extending the SRE
The meaures above address some of the problems in the SRE in a more scaleable and flexible way
than by adding a multitude of directives to the '/robots.txt' file.
I believe that of the suggested additions, this one will have the most benefit, without adding
complexity:
PleaseVisit
To suggest relative URL's to visit on the site
Standards...
I believe any future version of the SRE should be documented either as an RFC or a W3C-backed
standard.
The Web Robots Pages
http://info.webcrawler.com/mak/projects/robots/eval.html (7 of 7) [18.02.2001 13:20:16]