Try-Catch-FAIL

Failure is inevitable.

Crawling results in DeepCrawler.NET

clock November 17, 2008 02:45 by author Matt

In the last post, I laid out DeepCrawler.NET's (primitive) strategy for finding search forms, populating them, and submitting their contents using WatiN and a heuristic search mechanism.  As I mentioned at the end of the previous post though, submitting a query is only the first step in a complicated process.  Assuming nothing goes wrong, submitting a query will get us back one or more pages of results.  The problem now becomes parsing and crawling the results. 

The Current State

First, let's look at what a typical result page consists of, using Google as an example.  There are some useless links that don't really help us: links to other pages on Google (we'll call these navigational links), sponsored links, etc.  Then there are links to individual results along with "supporting" links for each result (such as links to the cached copy, links to similar pages, etc).  Finally, there are links to subsequent pages in the result set.  Ideally, we'd like to skip the first set of links and focus the crawl just on the second two sets of links (the actual results and the subsequent result pages).  Unfortunately, this is very difficult to do.  Without encoding some site-specific knowledge into the process, how do we determine the difference between a result link and a link that we don't care about? 

For now, I've decided to punt on the issue: DeepCrawler.NET just crawls all the available links.  Fortunately, with the right crawl depth, this actually has fairly good recall (meaning we grab a large number of the actual results) at the cost of low precision (we also grab a lot of pages that we don't care about).  The high recall is achieved because the initial result page contains links to the subsequent result pages, so the crawler recursively navigates to the result pages, pulling in their results, and so on.  The low precision is due again to this recursive property: DeepCrawler.NET pulls in the junk pages, follows their links, and so-on.  It does work, but it's not perfect.

A Better Approach?

The approach I'm currently investigating is to "learn" where the good results are by executing probe queries against a source.  Assuming that the source is semi-consistent in how it presents results (which is a safe assumption, I think), the result pages for different queries should be largely the same, varying only in the actual result links, descriptions, etc.  Using this assumption, hopefully we can quickly identify navigational links and exclude them from the crawl.  Next, we can limit the crawler so that it only travels one link away from the original source's domain.  Doing this will enable us to crawl subsequent result pages by increasing the crawl depth without diminishing precision due to the crawler jumping further and further away from the results. 

The above approach isn't perfect.  First, it still doesn't filter ads, but many of the sources that I envision DeepCrawler.NET being used for won't have ads, and there's always a chance that an ad is actually relevant to what you're looking for, so maybe that isn't a terrible thing.  Second, the approach for limiting crawl depth may not work well for search sources that only list links from their own sites.  For this type of source, the crawler could still navigate too broadly away from the results.  A better solution is needed, one that can precisely identify the result links and the controls for paging through a set of results.

What's Next?

Well, I obviously still have some work to do on crawling the results.  The current approach works, but it's flawed.  I plan to review recent research literature to see if there's an existing method I can build on top of, but I suspect I'm going into muddy waters here where there's not an easy solution.  Another thing that I need to do is add richer support for handling different file types.  Right now, DeepCrawler.NET is only capable of saving HTML documents (it can "visit" any file type, though), but this can easily be solved by leveraging available functionality in WatiN. 

One thing that's completely missing from DeepCrawler.NET is solid support for sources that utilize sessions.  For such sources, the crawl must be performed immediately after executing a query.  Ideally, the crawl should be interruptible and distributable, enabling a much more flexible crawl process.  My current plan for facilitating this is to have the crawler record the "path" of actions it took to arrive at any given page in the crawl.  With that path information, the crawler should (in theory) be able to replay the actions to return to any point in the crawl at any point in time.  This would allow a "path" to be discovered by one instance of the crawler, then followed further by a completely separate instance at a different point in time. 

Questions or comments?  Suggestions or ideas?  Please leave me a comment.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


DeepCrawler.NET: Alive and Kicking

clock November 14, 2008 02:10 by author Matt

Much to my surprise, getting DeepCrawler.NET up and working with basic functionality was surprisingly easy.  It's far from finished, and I haven't exhaustively tested it, but it does work.  In this post, I'll describe the current implementation with respect to how I've addressed some of the barriers raised in my last post.

How do we decide which form contains a search form?

Some sites (in particular, FedBizOpps) contain multiple forms.  As in this case, sites can contain login forms along side query forms, and we definitely don't want our crawler to try to log in with our various keywords. 

To address this barrier, DeepCrawler.NET employs a heuristic search of the forms on the page.  It calculates a probability for each form on the page by examining the form's contents (text fields, labels, buttons, etc) as well as attributes on the form, such as the form's name and ID.  The properties of each of these controls are compared to a very short list of words (query, search, keywords) that correlate well with search forms.  Right now, scores are basically binary: the heuristic either considers something a potential match or not a potential match.  Future work will add some more intelligent scoring to the process.

Another issue not fully addressed yet are forms with useless ID, name, and value descriptors for fields within the form.  Right now, DeepCrawler.NET can't do anything, but that's going to change (hopefully this weekend).  I've prototyped a search mechanism that looks for text lables (not label elements, which are much more useful and already handled by DeepCrawler.NET) by "visually" searching a grid around a form element of interest.  The search is primitive now, but early tests indicate that it will successfully locate text labels corresponding to form fields in some cases.

How do we determine where to place our query criteria?

This is somewhat addressed by the solution to the previous issue.  Within each candidate form, DeepCrawler.NET applies a heuristic test to each text box (input elements of type text) to determine if the text box is where the query should go.  The "best" text box is retained and used as our query box.  Again, this is a somewhat naive approach, but it works well enough for the sites I've evaluated DeepCrawler.NET against.  Future work will add more intelligence to the heuristics.

In its current state DeepCrawler.NET doesn't handle anything but text boxes.  That still leaves drop-down lists, multi-selects, text areas, radio buttons, and checkboxes.  Technically, there could also be hidden fields, but since Internet Explorer is serving as the crawler's "window", I'm assuming that any hidden fields will be correctly populated by the page itself.  I plan to address the remaining field types in the near future, but probably not for the first "finished" version of DeepCrawler.NET.

How do we submit the form?

So, at this point, we have a form, and we know where the query should go, but how do we submit it?  Again, this problem is addressed by the heuristic search that finds the search form.  While calculating the probability that each form is a search form, it examines buttons and images, and the best one is retained to submit the form.  Unfortunately, this does not work for forms that use JavaScript with hyperlinks to submit the form, but I'm already working on code to address that limitation.  There are other issues, too, like working with an image button that lacks useful attributes, but I probably won't address those issues in this first version of the crawler.

Moving Forward

I've described how DeepCrawler.NET finds a form, populates it, and submits it, but that's only part of the battle.  Next up is crawling the search results.  My approach is somewhat primitive, but it actually works quite well in the limited testing I've done so far.  I'll do a write-up on that at some point next week.  I also plan to release the full source code to DeepCrawler.NET after the semester ends, but if you have any questions on how I'm accomplishing anything specific (remember that I'm using WatiN right now), feel free to ask in the comments.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Rant: Automatic Updates

clock November 14, 2008 01:06 by author Matt

You know what's awesome?  When you leave a long-running process going overnight on your computer, and return the next day expecting to see the results, but instead realize that Automatic Updates rebooted your computer for you, killing the process and losing all of your process's progress!  Yeah, that's my bad for not remembering to disable Automatic Updates on this new computer, but this is terrible usability.  How hard would it have been to make Automatic Updates check my running processes to see if anything looked really, really busy before forcing a reboot?!? 

Also, if you use OneCare Live, you get the nice circle-of-redness if you disable the "force a reboot on my computer any time you want" feature.  This. Sucks. 

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Deep-web crawling with .NET: Getting Started

clock November 12, 2008 10:40 by author Matt

Thanks go out to Sol over at FederatedSearchBlog.com for giving me some suggestions on things to watch out for.  If you want more background information on federated search or information retrieval, go check it out that site.

In the last post, I introduced the idea of creating a deep-web crawler.  I laid out the basic requirements that I've given myself, and I touched on some of the barriers to meeting those requirements.  In this post, I'm going to introduce DeepCrawler.NET, my .NET-based (prototype-stage) crawler.

DeepCrawler.NET is written in C# for Microsoft .NET 3.5.  While there is intelligence behind it, at it's core it is doing nothing more than by automating Internet Explorer.  The crawler's "brain" examines a page in IE, then tells IE what to do, such as populate a form field with a value, click a link or button, or navigate to a new URL.  To facilitate this automation, I'm currently using the open-source WatiN API. WatiN is actually designed for creating unit tests for web interfaces, but it's proving to be a fairly nice abstraction over the alternative method of automating IE from C# (that is using the raw COM APIs). 

The main class in WatiN is "IE", which represents an instance of the Internet Explorer browser.  There's all sorts of options you can adjust to control how WatiN "wraps" the browser, but for the most part, the defaults appear to be fine.  Now, though WatiN is designed to facilitate testing of a web form, its API is flexible enough to enable exploratory analysis of a web page.  You can easily enumerate forms, links, buttons, or anything else in the DOM tree.  Since the first task of a deep-web crawler is just to submit a query through a search form, our task straightforward assuming access to a magic black box that can help you make certain decision.  First, enumerate the forms (some pages may contain multiple forms), and use the black box to select the form that most-likely contains the search form.  Next, enumerate the fields in the form, and use the black box to determine which fields correspond to which available query criteria (the crawler's pool of queries may be simple keywords, or they could be keywords augmented with date ranges or other values).  Finally, enumerate the buttons and links, and use the black box to determine which one to use to submit the form and begin the search.  From there, it's a simple matter of paging through the results and grabbing all the links.

By simple, I mean very NOT simple.  First, not all the links on the page are going to be for results, some may link back to the search form, some may go elsewhere on the site, some may be ads, and some are (hopefully) the links to page through the results.  Which brings up another issue: how do you determine how to page through the results?  These are open questions that I'm currently working to address and will hopefully discuss in a future post.

Ignoring those issues for now and focusing just on how you submit a form, you can see that I've skipped all the hard parts by using this magical black box.  Such a box doesn't exist, so we have to implement one.  What issues do we have to deal with?  How do we decide which form contains a search form?  Once we've done that, how do we determine where to place our query criteria?  There's *nothing* that says people have to give their form fields meaningful names or IDs, so "q" could just as easily be a box for a "query" as it could be a box for entering your username.  Finally, even if we find the form and figure out how to populate it with the query we want to execute, how do we submit the form?  Some forms may use JavaScript, some may use buttons, some may use images... what can we do?  How do we determine which one to use?

In the next post, we'll start diving in to some of these issues.  And remember, these are just the issues with meeting the *first* requirement.  After that, we still have to figure out how to do these things efficiently and intelligently.  It's a long road ahead, I wish I had more than four weeks left in the semester. :D

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Creating a deep-web crawler with .NET: Background

clock November 10, 2008 10:15 by author Matt

For one of my graduate courses, I've decided to tackle the task of creating an intelligent agent for deep-web (AKA hidden-web) crawling.  Unlike traditional crawlers, which work by following hyperlinks from a set of seed pages, a deep-web crawler digs through search interfaces and forms on the Internet to access the data underneath, data that is typically not directly accessible via a hyperlink (and would therefore be missed by a traditional crawler).  It's a tough problem to crack.  No two search forms on the web are the same (example 1 and example 2), and there's no standard that's useful here.  Worse, many now employ client-side scripting, SSL, and other technologies that make interaction very difficult even when you focus on a single site.  When you try to scale to the myriad of possibilities that exist on the web, the problem becomes (seemingly) intractable. 

There has been quite a bit of research in this area over the last few years.  Even Google has gotten in to the game with their own deep-web crawler that is now tearing up the (hidden) web.  Still, most of the documented approaches have limitations that I don't like: some only support GET-based requests (meaning that any form that uses POST is unsupported); many don't work with client-side scripting; some try to spider an entire source (digging too deeply for a focused task), while others barely scratch the surface, missing a lot of valuable information.  So, my goal is to create a flexible deep-web crawler that meets these basic requirements:

  1. If you can browse it, you can crawl it.  The crawler should work with client-side scripting, work over SSL, and support POST requests.
  2. You give it "topics", and it does the rest.  You tell it keywords (or phrases) that you're interested in, and it will crawl a source to harvest as much information about those keywords as it can.
  3. It should be efficient. The crawler should intelligently determine which queries to submit to achieve high recall and high precision. 

For now, let's ignore requirements 2 and 3, and focus just on requirement 1.  There's a few ways we could go about this.  First, we could use low-level API classes to simulate GET, POST, etc. requests, basically  implementing an HTTP-compliant crawler.  We could parse the HTML received in response to a request, then 'figure out' what to do next (more on that in a future post).  But what happens if the HTML contains JavaScript?  What if the form requires cookies?  What if the traffic is encrypted over SSL?  Well, we could implement support for all of that eventually, but it would be very, very painful.  We would basically be implementing an automated browser.  Instead of going that route, why not just automate an existing browser?  It turns out that this option is actually fairly easy to do.  Microsoft has a COM API that can be used to manipulate Internet Explorer programmatically.  This API, though a bit rough around the edges, can be leveraged to create a crawler, which is the route I've chosen to go.  Doing so will allow me to very quickly bypass painful things like cookies, JavaScript, etc, and focus on implementing the logic to find and interact with web forms.

Time's up for today.  In the next post, I'll introduce the API I'm using for my crawler, and I'll discuss some of the barriers to meeting the first requirement.

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


How to run a software development company (INTO THE GROUND) - Part 7

clock November 7, 2008 09:57 by author Matt

It's been a while since I've done one of these.  I wish I could say that it was because the bucket was running low, but it's not.  Unfortunately, most of the examples that spring to mind lately are a tad too personal to vent them in public (right now), but here's a solid, non-offensive practice you can implement to insure the complete and utter failure of your software development company.

"I need subcontractors.  Lots of subcontractors."

neo

So you have some money in your pocket, and you want to spend it on software development.  You have a few options:

  1. You could buy the things your in-house developers have been pestering you about, like chairs that aren't broken, monitors that don't flicker so badly that they cause seizures, and "workstations" that you didn't buy surplus from the government five years ago. 
  2. You could hire additional developers.
  3. You could actually pay overtime for those "rare" weeks where there seems to be more to do than the team can accomplish during a 40 hour week.
  4. You could maintain the status quo a little longer.  If things are tight, the extra money can be used to maintain existing staff and resources for longer.
  5. You can bring in consultants and subcontractors.

Let's weigh the pros and cons of each.  With option 1, the developers get better equipment, but how's that going to make you money?  It's not.  The quality of the equipment has absolutely nothing AT ALL to do with developer productivity.

With option 2, you would be bringing on more people, which means you would be spending more money (and that's good, because you have to spend money to make money). 

Concerning option 3: HAHAHAHAHAHAHAH, yeah, overime, hahahaha, right.

Option 4 is just dumb.  Money is like a hot potato, you want to get rid of it as quickly as possible.

Option 5 sounds interesting.  Again, it would involve spending lots of money, which automatically guarantees making lots of money.  This is actually a better choice than option 2 for two reasons: consultants and subcontractors are (usually) more expensive than direct hires, therefore the quality of their work will be superior.  Second, because they are more expensive, you will spend more money faster, which means you will make more money and get rid of the hot potato sooner. 

So, let's go with option 5.  You want to find the most expensive subcontractors or consultants that you can.  It doesn't matter if they are not experienced with the type of software you are building, because consultants are expensive and therefore always contribute a lot of value.  Be sure not to pay any attention to things like previous projects, or even whether or not they are familiar with the tools and languages that your company uses.  Code is code, after all, everything just works when you pile it all together.

Once you have brought consultants on board, it is very important to let them drive the process.  It doesn't matter if your developers have an existing "methodology" (whatever that is); just do whatever the consultants say!  Remember, you are paying them more than your developers (hopefully more than all your developers combined), so they are smarter than your developers. Be sure not to try to give the consultants any firm deliverables or requirements.  You don't want to chain these people down, you want to hitch your company to them and let them pull your software development up to the stratosphere of success!

Concluding Remarks

There are situations where consultants and/or subcontractors are very, very useful.  For example, maybe your team needs an installer for a project, but no one on the team knows anything about installers.  Instead of investing effort in mastering a technology that you may not have much use for, it may be better just to pay someone else to build the installer for you. 

However, there are many times when bringing in consultants is the completely wrong choice.  For example, you don't have much need for them early in a project.  The early stages are to figure out roughly what you're actually going to build, and consultants (unless they are experts in whatever market the project is targeting) aren't going to contribute much.  You also don't want to bring consultants in unless you have a very clear, specific need for them.  Don't bring them in to do general work that your own developers can do.  If your developers are overworked, bring on additional full or part-time staff, not a long-term consultant or subcontractor.

If you do decide to bring a consultant in, agree on specific deliverables and hold the consultants to them.  Charging a lot of money is not an excuse to under-deliver.  If things aren't working on, cut ties as soon as you can.  Be careful with how any contracts are worded, too.  You don't want to get tied in to a relationship that isn't beneficial to your company.  Keep the scope small, focused, and specific.  Do that, and you might be ok.  Do it not, and you are probably going to have problems. 

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Dynamic objects in C# 4.0

clock November 5, 2008 00:43 by author Matt

I was going to write up my feelings on the new 'dynamic' features coming with C# 4.0, but this basically sums up my feelings exactly.  Are dynamic references a useful feature?  Sure, there are going to be times when this is really handy.  Can it be misused?  Most definitely.  Will it be misused?  Almost certainly. Sometimes, having to jump through the hoops of strong-typing is a really, really good thing.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Bridging the Java-.NET Gap: foreach-ing an Enumeration

clock November 3, 2008 10:08 by author Matt

My fun with IKVM.NET continues this week as I utilize Weka from .NET (just a fun note, but the .NET-compiled version is insanely faster than the Java version doing the exact same thing; suck on that, Java fans!).  For the most part, everything has been swell.  The hardest part is trying to decipher research methodologies to replicate the systems they describe.  Still, I've run into a few snags when working with the .NET versions of Weka, the first of which is that you can't foreach over a Java Enumeration.  For example, consider the following:

   1: //THE FOLLOWING DOESN'T COMPILE!
   2: //foreach (weka.core.Instance instance in instances.enumerateInstances())
   3: //{
   4: //    //TODO: Operate on instance here.
   5: //}
   6:  
   7: Enumeration enumerator = instances.enumerateInstances();
   8: while (enumerator.hasMoreElements())
   9: {
  10:     weka.core.Instance instance = (weka.core.Instance)enumerator.nextElement();
  11:  
  12:     //TODO: Operate on instance here
  13: }

While it isn't a huge difference, being able to use foreach would obviously be simpler than having to create an instance of Enumeration, then use it to step through the items.  Fortunately, it's quite easy to generically bridge the gap so that any Enumeration can be enumerated by simply calling an extension method, like so:

   1: foreach (weka.core.Instance instance in instances.enumerateInstances().ToEnumerable())
   2: {
   3:     //TODO: Operate on instance
   4: }

How does this work?  First, we need to apply the adapter pattern to convert a Java Enumeration into a .NET IEnumerator.  Here's our adapter:

   1: /// <summary>
   2: /// Provides an adapter that can convert a Java
   3: /// Enumeration class into something that implements
   4: /// IEnumerator.  
   5: /// </summary>
   6: public class EnumeratorEnumerationAdapter : IEnumerator
   7: {
   8:     #region Private Fields
   9:  
  10:     /// <summary>
  11:     /// The class being adapted.
  12:     /// </summary>
  13:     private Enumeration mEnumeration;
  14:  
  15:     /// <summary>
  16:     /// The current object.
  17:     /// </summary>
  18:     private object mCurrent;
  19:  
  20:     #endregion
  21:  
  22:     #region Implementation of IEnumerator
  23:  
  24:     /// <summary>
  25:     /// Advances the enumerator to the next element of the collection.
  26:     /// </summary>
  27:     /// <returns>
  28:     /// true if the enumerator was successfully advanced to the next element; false if the enumerator has passed the end of the collection.
  29:     /// </returns>
  30:     /// <exception cref="T:System.InvalidOperationException">The collection was modified after the enumerator was created. </exception><filterpriority>2</filterpriority>
  31:     public bool MoveNext()
  32:     {
  33:         if (!mEnumeration.hasMoreElements())
  34:         {
  35:             return false;
  36:         }
  37:  
  38:         mCurrent = mEnumeration.nextElement();
  39:         return true;
  40:     }
  41:  
  42:     /// <summary>
  43:     /// Sets the enumerator to its initial position, which is before the first element in the collection.
  44:     /// </summary>
  45:     /// <exception cref="T:System.InvalidOperationException">The collection was modified after the enumerator was created. </exception><filterpriority>2</filterpriority>
  46:     public void Reset()
  47:     {
  48:         throw new NotSupportedException();
  49:     }
  50:  
  51:     /// <summary>
  52:     /// Gets the current element in the collection.
  53:     /// </summary>
  54:     /// <returns>
  55:     /// The current element in the collection.
  56:     /// </returns>
  57:     /// <exception cref="T:System.InvalidOperationException">The enumerator is positioned before the first element of the collection or after the last element.-or- The collection was modified after the enumerator was created.</exception><filterpriority>2</filterpriority>
  58:     public object Current
  59:     {
  60:         get
  61:         {
  62:             return mCurrent;
  63:         }
  64:     }
  65:  
  66:     #endregion
  67:  
  68:     #region Public Constructors
  69:  
  70:     /// <summary>
  71:     /// Creates an adapter for the specified enumeration.
  72:     /// </summary>
  73:     /// <param name="enumeration"></param>
  74:     public EnumeratorEnumerationAdapter(Enumeration enumeration)
  75:     {
  76:         mEnumeration = enumeration;
  77:         mCurrent = null;
  78:     }
  79:  
  80:     #endregion
  81: }

Next, we just need to write the extension method that utilizes our adapter to create an IEnumerable:

   1: /// <summary>
   2: /// Contains extension methods to simplify working with 
   3: /// <see cref="Enumeration"/> objects.
   4: /// </summary>
   5: public static class EnumerationExtensions
   6: {
   7:     /// <summary>
   8:     /// Creates a <see cref="IEnumerable"/> wrapper
   9:     /// around a <see cref="Enumeration"/>.
  10:     /// </summary>
  11:     /// <param name="enumeration"></param>
  12:     /// <returns></returns>
  13:     public static IEnumerable ToEnumerable(this Enumeration enumeration)
  14:     {
  15:         EnumeratorEnumerationAdapter adapter = new EnumeratorEnumerationAdapter(enumeration);
  16:  
  17:         while (adapter.MoveNext())
  18:         {
  19:             yield return adapter.Current;
  20:         }
  21:     }
  22: }

And like magic, you can now foreach over any Java Enumeration just like it was a .NET IEnumerable implementor.  It'd be nice if this were baked in to IKVM.NET, but for now, this simple "hack" will do the trick.

Be the first to rate this post

  • Currently 0/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Why Microsoft instead of "Open Source"?

clock October 31, 2008 08:44 by author Matt

Recently, one of my friends asked me why I was a "Microsoft" coder instead of joining the open-source camps.  I think that's a great question, but it's really two questions: how did I end up as a "Microsoft" coder, and (the more interesting question) why have I chosen to remain a "Microsoft" coder when there are so many cool buzzwords popping up?  I may get in to the former question some day, but for today, I thought I'd run through the top 5 reasons why I am, and will remain for the foreseeable future, 100% entrenched in the Microsoft .NET world of development.

1. I am very, very productive with .NET.

I love writing code, but I don't enjoy writing code for the sake of writing code.  I love writing code that does things.  I like solving problems.  And I like to do it as elegantly (and as quickly) as possible.  That's probably the main reason that I have stayed firmly rooted in C#: I can solve most any problem quickly without changing languages or platforms.  I know the language very well, and I've run into very few cases where I felt like it was impeding me from solving a problem.  I'm not going to switch just for the sake of switching.  Yes, I'll continue to test out the waters of new platforms, languages, tools, etc, but I'm only going to switch to something that makes me more productive.  So far, I haven't found anything that meets that requirement.

2. What other platform is this diverse?

Let's look at what I can do with .NET (starting from the bottom up): I can write scripts.  I can create console applications.  I can create rich client applications. I can create thin web applications.  I can create rich interactive web applications.  I can create games.  I can create web services.  I can create enterprise services.  I can create embedded database modules.  I can write add-ins for many popular products.  While it is true that you can do some of these things with other platforms, nothing that I've seen is anywhere near as well-supported across such a wide range of areas as is .NET. 

3. We have the best tools, period.

We have some amazing tools available to us in the .NET world, both free and fee.  On the free side, we have things like the Express Editions, SharpDevelop, NUnit, and TestDriven.NET.  On the pay side, we have the Pro+ versions of Visual Studio, Resharper, dotTrace, and oh so much more.  It's easy (and fun!) to write code when you have the best tools at your disposal. 

4. We have great APIs.

Out of the box, .NET has a wonderfully clean API.  Sure, there are some things that aren't perfect, but it's head-and-shoulders over the mess that is PHP or the muck that is Java.  On top of that, we have some great add-on APIs that make life even easier: Castle Project, NHibernate (to be fair, we stole this from Java), log4net (ditto), Moq, ASP.NET MVC... I could name things off all day long, and I still couldn't cover them all.  There is a very active .NET developer community with a lot of very smart, talented people contributing both code and wisdom to our collective knowledge base.

5. .NET continues to improve and evolve.

Microsoft keeps pushing forward to make the .NET platform even better.  We're getting major revisions to the framework about every two years, and we're getting interim releases with new functionality constantly.  They're paying attention to what others are doing that works (functional programming, for example), and they're bringing it into the framework.  I don't need to go somewhere else for the Next Big Thing, because Microsoft is going to bring it to me.


So, that's my top 5 list.  Take it for what it is (a .NET developer's reasoning for staying in the .NET world), but I don't consider myself to be one of those people that gets entrenched in one language/platform and can never move on.  I actually enjoy learning new things, so I'm probably slightly biased towards moving on to the Next Big Thing.  The fact that I've been a .NET developer for six years now is a testament to just how compelling a platform Microsoft has made. 

Currently rated 5.0 by 1 people

  • Currently 5/5 Stars.
  • 1
  • 2
  • 3
  • 4
  • 5


Using Java APIs in .NET

clock October 28, 2008 15:40 by author Matt

Have you ever found a neat-looking API that would save you tons of time and pain, only to have your hopes crushed when you discover that the API is written in Java?  Well, fret no more, because there’s a nice, easy way to leverage tasty Java APIs from within .NET: just compile them to IL using IKVM.NET!

I’m currently using the excellent Weka machine learning library from .NET.  Here’s the code:

   1: /// <summary>
   2: /// Simple ad-hoc class for testing out the Weka API from .NET.
   3: /// </summary>
   4: public class AdHocTests
   5: {
   6:     /// <summary>
   7:     /// Tests COBWEB.
   8:     /// </summary>
   9:     public void CobwebTest()
  10:     {
  11:         string input = @"D:\Program Files (x86)\Weka-3-5\data\soybean.arff";