Thanks go out to Sol over at FederatedSearchBlog.com for giving me some suggestions on things to watch out for. If you want more background information on federated search or information retrieval, go check it out that site.
In the last post, I introduced the idea of creating a deep-web crawler. I laid out the basic requirements that I’ve given myself, and I touched on some of the barriers to meeting those requirements. In this post, I’m going to introduce DeepCrawler.NET, my .NET-based (prototype-stage) crawler.
DeepCrawler.NET is written in C# for Microsoft .NET 3.5. While there is intelligence behind it, at it’s core it is doing nothing more than by automating Internet Explorer. The crawler’s "brain" examines a page in IE, then tells IE what to do, such as populate a form field with a value, click a link or button, or navigate to a new URL. To facilitate this automation, I’m currently using the open-source WatiN API. WatiN is actually designed for creating unit tests for web interfaces, but it’s proving to be a fairly nice abstraction over the alternative method of automating IE from C# (that is using the raw COM APIs).
The main class in WatiN is "IE", which represents an instance of the Internet Explorer browser. There’s all sorts of options you can adjust to control how WatiN "wraps" the browser, but for the most part, the defaults appear to be fine. Now, though WatiN is designed to facilitate testing of a web form, its API is flexible enough to enable exploratory analysis of a web page. You can easily enumerate forms, links, buttons, or anything else in the DOM tree. Since the first task of a deep-web crawler is just to submit a query through a search form, our task straightforward assuming access to a magic black box that can help you make certain decision. First, enumerate the forms (some pages may contain multiple forms), and use the black box to select the form that most-likely contains the search form. Next, enumerate the fields in the form, and use the black box to determine which fields correspond to which available query criteria (the crawler’s pool of queries may be simple keywords, or they could be keywords augmented with date ranges or other values). Finally, enumerate the buttons and links, and use the black box to determine which one to use to submit the form and begin the search. From there, it’s a simple matter of paging through the results and grabbing all the links.
By simple, I mean very NOT simple. First, not all the links on the page are going to be for results, some may link back to the search form, some may go elsewhere on the site, some may be ads, and some are (hopefully) the links to page through the results. Which brings up another issue: how do you determine how to page through the results? These are open questions that I’m currently working to address and will hopefully discuss in a future post.
Ignoring those issues for now and focusing just on how you submit a form, you can see that I’ve skipped all the hard parts by using this magical black box. Such a box doesn’t exist, so we have to implement one. What issues do we have to deal with? How do we decide which form contains a search form? Once we’ve done that, how do we determine where to place our query criteria? There’s *nothing* that says people have to give their form fields meaningful names or IDs, so "q" could just as easily be a box for a "query" as it could be a box for entering your username. Finally, even if we find the form and figure out how to populate it with the query we want to execute, how do we submit the form? Some forms may use JavaScript, some may use buttons, some may use images… what can we do? How do we determine which one to use?
In the next post, we’ll start diving in to some of these issues. And remember, these are just the issues with meeting the *first* requirement. After that, we still have to figure out how to do these things efficiently and intelligently. It’s a long road ahead, I wish I had more than four weeks left in the semester. 😀
If it would help, check out my project, http://arachnode.net.
arachnode.net is a full-featured/complete C# site crawler that uses SQL2005 and lucene.net.
I’m with you on the SIMPLE reference. Hopefully my work can answer some of the questions and problems you’ll face.
Thanks!
-an
Very impressive! I had looked for related work while working on DeepCrawler.NET, but I didn’t find your project. I’m going to check it out in-depth. Most of my research has been focused on the content-extraction "problem" as opposed to the things you’ve addressed, so there could be an opportunity to collaborate here that would be fun and interesting… I’ll be in touch. 🙂
Thanks for the compliment! I would really like to find people to collaborate with.
One of the next items of business for arachnode.net is to finish the automatic content extraction plugin.
Given a webpage, blog or forum; how can a computer program know what the meat/content of the page is, and what are the comments?
I’m about 50% done with this functionality. I wonder if this would be of help?
Feel free to come over to the site and register, unless you already have! 😀
Yes – please stay in touch.
@Arachnode,
I have briefly checked out your site, and I’m very intrigued by what you’ve built. I haven’t found time to really dive in to it yet, but I have a reminder on my todo list (that keeps getting deferred to tomorrow). I *will* get over there and check it out fully.
Automated metadata extraction is actually something I’ve been reading up on. I found a neat idea in a research paper on using partial tree alignment to find what to extract, but I haven’t had time to really digest it yet. Fun stuff, though!