Thanks go out to Sol over at FederatedSearchBlog.com for giving me some suggestions on things to watch out for. If you want more background information on federated search or information retrieval, go check it out that site.
In the last post, I introduced the idea of creating a deep-web crawler. I laid out the basic requirements that I’ve given myself, and I touched on some of the barriers to meeting those requirements. In this post, I’m going to introduce DeepCrawler.NET, my .NET-based (prototype-stage) crawler.
DeepCrawler.NET is written in C# for Microsoft .NET 3.5. While there is intelligence behind it, at it’s core it is doing nothing more than by automating Internet Explorer. The crawler’s "brain" examines a page in IE, then tells IE what to do, such as populate a form field with a value, click a link or button, or navigate to a new URL. To facilitate this automation, I’m currently using the open-source WatiN API. WatiN is actually designed for creating unit tests for web interfaces, but it’s proving to be a fairly nice abstraction over the alternative method of automating IE from C# (that is using the raw COM APIs).
The main class in WatiN is "IE", which represents an instance of the Internet Explorer browser. There’s all sorts of options you can adjust to control how WatiN "wraps" the browser, but for the most part, the defaults appear to be fine. Now, though WatiN is designed to facilitate testing of a web form, its API is flexible enough to enable exploratory analysis of a web page. You can easily enumerate forms, links, buttons, or anything else in the DOM tree. Since the first task of a deep-web crawler is just to submit a query through a search form, our task straightforward assuming access to a magic black box that can help you make certain decision. First, enumerate the forms (some pages may contain multiple forms), and use the black box to select the form that most-likely contains the search form. Next, enumerate the fields in the form, and use the black box to determine which fields correspond to which available query criteria (the crawler’s pool of queries may be simple keywords, or they could be keywords augmented with date ranges or other values). Finally, enumerate the buttons and links, and use the black box to determine which one to use to submit the form and begin the search. From there, it’s a simple matter of paging through the results and grabbing all the links.
By simple, I mean very NOT simple. First, not all the links on the page are going to be for results, some may link back to the search form, some may go elsewhere on the site, some may be ads, and some are (hopefully) the links to page through the results. Which brings up another issue: how do you determine how to page through the results? These are open questions that I’m currently working to address and will hopefully discuss in a future post.
In the next post, we’ll start diving in to some of these issues. And remember, these are just the issues with meeting the *first* requirement. After that, we still have to figure out how to do these things efficiently and intelligently. It’s a long road ahead, I wish I had more than four weeks left in the semester. 😀