In the last post, I laid out DeepCrawler.NET’s (primitive) strategy for finding search forms, populating them, and submitting their contents using WatiN and a heuristic search mechanism. As I mentioned at the end of the previous post though, submitting a query is only the first step in a complicated process. Assuming nothing goes wrong, submitting a query will get us back one or more pages of results. The problem now becomes parsing and crawling the results.
The Current State
First, let’s look at what a typical result page consists of, using Google as an example. There are some useless links that don’t really help us: links to other pages on Google (we’ll call these navigational links), sponsored links, etc. Then there are links to individual results along with "supporting" links for each result (such as links to the cached copy, links to similar pages, etc). Finally, there are links to subsequent pages in the result set. Ideally, we’d like to skip the first set of links and focus the crawl just on the second two sets of links (the actual results and the subsequent result pages). Unfortunately, this is very difficult to do. Without encoding some site-specific knowledge into the process, how do we determine the difference between a result link and a link that we don’t care about?
For now, I’ve decided to punt on the issue: DeepCrawler.NET just crawls all the available links. Fortunately, with the right crawl depth, this actually has fairly good recall (meaning we grab a large number of the actual results) at the cost of low precision (we also grab a lot of pages that we don’t care about). The high recall is achieved because the initial result page contains links to the subsequent result pages, so the crawler recursively navigates to the result pages, pulling in their results, and so on. The low precision is due again to this recursive property: DeepCrawler.NET pulls in the junk pages, follows their links, and so-on. It does work, but it’s not perfect.
A Better Approach?
The approach I’m currently investigating is to "learn" where the good results are by executing probe queries against a source. Assuming that the source is semi-consistent in how it presents results (which is a safe assumption, I think), the result pages for different queries should be largely the same, varying only in the actual result links, descriptions, etc. Using this assumption, hopefully we can quickly identify navigational links and exclude them from the crawl. Next, we can limit the crawler so that it only travels one link away from the original source’s domain. Doing this will enable us to crawl subsequent result pages by increasing the crawl depth without diminishing precision due to the crawler jumping further and further away from the results.
The above approach isn’t perfect. First, it still doesn’t filter ads, but many of the sources that I envision DeepCrawler.NET being used for won’t have ads, and there’s always a chance that an ad is actually relevant to what you’re looking for, so maybe that isn’t a terrible thing. Second, the approach for limiting crawl depth may not work well for search sources that only list links from their own sites. For this type of source, the crawler could still navigate too broadly away from the results. A better solution is needed, one that can precisely identify the result links and the controls for paging through a set of results.
Well, I obviously still have some work to do on crawling the results. The current approach works, but it’s flawed. I plan to review recent research literature to see if there’s an existing method I can build on top of, but I suspect I’m going into muddy waters here where there’s not an easy solution. Another thing that I need to do is add richer support for handling different file types. Right now, DeepCrawler.NET is only capable of saving HTML documents (it can "visit" any file type, though), but this can easily be solved by leveraging available functionality in WatiN.
One thing that’s completely missing from DeepCrawler.NET is solid support for sources that utilize sessions. For such sources, the crawl must be performed immediately after executing a query. Ideally, the crawl should be interruptible and distributable, enabling a much more flexible crawl process. My current plan for facilitating this is to have the crawler record the "path" of actions it took to arrive at any given page in the crawl. With that path information, the crawler should (in theory) be able to replay the actions to return to any point in the crawl at any point in time. This would allow a "path" to be discovered by one instance of the crawler, then followed further by a completely separate instance at a different point in time.
Questions or comments? Suggestions or ideas? Please leave me a comment.