For one of my graduate courses, I’ve decided to tackle the task of creating an intelligent agent for deep-web (AKA hidden-web) crawling. Unlike traditional crawlers, which work by following hyperlinks from a set of seed pages, a deep-web crawler digs through search interfaces and forms on the Internet to access the data underneath, data that is typically not directly accessible via a hyperlink (and would therefore be missed by a traditional crawler). It’s a tough problem to crack. No two search forms on the web are the same (example 1 and example 2), and there’s no standard that’s useful here. Worse, many now employ client-side scripting, SSL, and other technologies that make interaction very difficult even when you focus on a single site. When you try to scale to the myriad of possibilities that exist on the web, the problem becomes (seemingly) intractable.
There has been quite a bit of research in this area over the last few years. Even Google has gotten in to the game with their own deep-web crawler that is now tearing up the (hidden) web. Still, most of the documented approaches have limitations that I don’t like: some only support GET-based requests (meaning that any form that uses POST is unsupported); many don’t work with client-side scripting; some try to spider an entire source (digging too deeply for a focused task), while others barely scratch the surface, missing a lot of valuable information. So, my goal is to create a flexible deep-web crawler that meets these basic requirements:
- If you can browse it, you can crawl it. The crawler should work with client-side scripting, work over SSL, and support POST requests.
- You give it "topics", and it does the rest. You tell it keywords (or phrases) that you’re interested in, and it will crawl a source to harvest as much information about those keywords as it can.
- It should be efficient. The crawler should intelligently determine which queries to submit to achieve high recall and high precision.
Time’s up for today. In the next post, I’ll introduce the API I’m using for my crawler, and I’ll discuss some of the barriers to meeting the first requirement.