Much to my surprise, getting DeepCrawler.NET up and working with basic functionality was surprisingly easy. It’s far from finished, and I haven’t exhaustively tested it, but it does work. In this post, I’ll describe the current implementation with respect to how I’ve addressed some of the barriers raised in my last post.
How do we decide which form contains a search form?
Some sites (in particular, FedBizOpps) contain multiple forms. As in this case, sites can contain login forms along side query forms, and we definitely don’t want our crawler to try to log in with our various keywords.
To address this barrier, DeepCrawler.NET employs a heuristic search of the forms on the page. It calculates a probability for each form on the page by examining the form’s contents (text fields, labels, buttons, etc) as well as attributes on the form, such as the form’s name and ID. The properties of each of these controls are compared to a very short list of words (query, search, keywords) that correlate well with search forms. Right now, scores are basically binary: the heuristic either considers something a potential match or not a potential match. Future work will add some more intelligent scoring to the process.
Another issue not fully addressed yet are forms with useless ID, name, and value descriptors for fields within the form. Right now, DeepCrawler.NET can’t do anything, but that’s going to change (hopefully this weekend). I’ve prototyped a search mechanism that looks for text lables (not label elements, which are much more useful and already handled by DeepCrawler.NET) by "visually" searching a grid around a form element of interest. The search is primitive now, but early tests indicate that it will successfully locate text labels corresponding to form fields in some cases.
How do we determine where to place our query criteria?
This is somewhat addressed by the solution to the previous issue. Within each candidate form, DeepCrawler.NET applies a heuristic test to each text box (input elements of type text) to determine if the text box is where the query should go. The "best" text box is retained and used as our query box. Again, this is a somewhat naive approach, but it works well enough for the sites I’ve evaluated DeepCrawler.NET against. Future work will add more intelligence to the heuristics.
In its current state DeepCrawler.NET doesn’t handle anything but text boxes. That still leaves drop-down lists, multi-selects, text areas, radio buttons, and checkboxes. Technically, there could also be hidden fields, but since Internet Explorer is serving as the crawler’s "window", I’m assuming that any hidden fields will be correctly populated by the page itself. I plan to address the remaining field types in the near future, but probably not for the first "finished" version of DeepCrawler.NET.
How do we submit the form?
I’ve described how DeepCrawler.NET finds a form, populates it, and submits it, but that’s only part of the battle. Next up is crawling the search results. My approach is somewhat primitive, but it actually works quite well in the limited testing I’ve done so far. I’ll do a write-up on that at some point next week. I also plan to release the full source code to DeepCrawler.NET after the semester ends, but if you have any questions on how I’m accomplishing anything specific (remember that I’m using WatiN right now), feel free to ask in the comments.