For one of my graduate courses, I’ve decided to tackle the task of creating an intelligent agent for deep-web (AKA hidden-web) crawling.  Unlike traditional crawlers, which work by following hyperlinks from a set of seed pages, a deep-web crawler digs through search interfaces and forms on the Internet to access the data underneath, data that is typically not directly accessible via a hyperlink (and would therefore be missed by a traditional crawler).  It’s a tough problem to crack.  No two search forms on the web are the same (example 1 and example 2), and there’s no standard that’s useful here.  Worse, many now employ client-side scripting, SSL, and other technologies that make interaction very difficult even when you focus on a single site.  When you try to scale to the myriad of possibilities that exist on the web, the problem becomes (seemingly) intractable. 

There has been quite a bit of research in this area over the last few years.  Even Google has gotten in to the game with their own deep-web crawler that is now tearing up the (hidden) web.  Still, most of the documented approaches have limitations that I don’t like: some only support GET-based requests (meaning that any form that uses POST is unsupported); many don’t work with client-side scripting; some try to spider an entire source (digging too deeply for a focused task), while others barely scratch the surface, missing a lot of valuable information.  So, my goal is to create a flexible deep-web crawler that meets these basic requirements:

  1. If you can browse it, you can crawl it.  The crawler should work with client-side scripting, work over SSL, and support POST requests.
  2. You give it "topics", and it does the rest.  You tell it keywords (or phrases) that you’re interested in, and it will crawl a source to harvest as much information about those keywords as it can.
  3. It should be efficient. The crawler should intelligently determine which queries to submit to achieve high recall and high precision. 

For now, let’s ignore requirements 2 and 3, and focus just on requirement 1.  There’s a few ways we could go about this.  First, we could use low-level API classes to simulate GET, POST, etc. requests, basically  implementing an HTTP-compliant crawler.  We could parse the HTML received in response to a request, then ‘figure out’ what to do next (more on that in a future post).  But what happens if the HTML contains JavaScript?  What if the form requires cookies?  What if the traffic is encrypted over SSL?  Well, we could implement support for all of that eventually, but it would be very, very painful.  We would basically be implementing an automated browser.  Instead of going that route, why not just automate an existing browser?  It turns out that this option is actually fairly easy to do.  Microsoft has a COM API that can be used to manipulate Internet Explorer programmatically.  This API, though a bit rough around the edges, can be leveraged to create a crawler, which is the route I’ve chosen to go.  Doing so will allow me to very quickly bypass painful things like cookies, JavaScript, etc, and focus on implementing the logic to find and interact with web forms.

Time’s up for today.  In the next post, I’ll introduce the API I’m using for my crawler, and I’ll discuss some of the barriers to meeting the first requirement.