This is the obligatory “introduction to something that I’m going to be talking about periodically” post. I think a lot of developers (myself included until recently) are not familiar with Lucene/Lucene.NET and where it fits in their development toolbox. By the end of this post, you should understand the basics of Lucene, and you will (hopefully) want to come back for future posts that will show you how to use Lucene in your applications.
What is Lucene?
Good question. For that, I defer to the documentation: Apache Lucene is a high-performance, full-featured text search engine library written entirely in Java. It is a technology suitable for nearly any application that requires full-text search, especially cross-platform. Pretty self-explanatory, right? Ok, maybe not. Basically, it’s an open-source library that you can use if your application needs to provide search capabilities that go beyond simple relational queries.
“But Matt, this is a .NET blog, why are you talking about a Java library? I HATE JAVA!” Yeah, we all hate Java, but they have some cool projects (that run like crap because the Sun JavaVM is horrible, horrible abomination that should be purged by fire*). As is the case for most good Java projects, we (the .NET community) have “borrowed” Lucene and created Lucene.NET: Lucene.Net is a source code, class-per-class, API-per-API and algorithmatic port of the Java Lucene search engine to the C# and .NET platform utilizing Microsoft .NET Framework. Basically, Lucene.NET brings the Lucene API into .NET. Be warned though: it is a very direct port. You will see lots of GetSuchAndSuch and SetSuchAndSuch methods instead of properties; you will see things that expose iterators but are not IEnumerable, meaning you can’t just drop them into a foreach loop. You will see things that *should* implement IDisposable, but don’t.
Those issues aside, the fact that we can use Lucene in .NET is still awesome, awesome enough to outweigh the Javainess of the API.
What’s it used for?
The most common use for Lucene is full-text indexing and search. “But I’m not building a search engine, I’m building a bug tracking/record management/whatever!” Is your system storing text? Think users might ever want to search the text? Congratulations, you have a need for full-text search. Lucene can allow you to quickly and easily index text fields for fast, efficient searching later. And when I say fast, I mean fast. Way, way faster than doing something stupid like string.Contains across all your records or doing WHERE Text LIKE ‘%whatever%’ in your database.
Lucene can index and search more than just text though. You can index dates, numbers… virtually anything that your users might want to build queries on. Sure, you could do something similar using SQL Server, but…
Why not use SQL Server?
Because I am telling you: do not use SQL Server to store and search large amounts of text. I have direct experience going down this road, and it leads only to pain and suffering. Yes, all versions of SQL Server support full-text indexing, but words cannot describe how terrible the performance is for large databases (and how bad it is at actually ranking results). It’s also very not-flexible, as opposed to Lucene, which is quite flexible, supports a wide range of queries, and does a pretty good job of ranking things. Lucene is also not the overweight elephant that SQL Server is: it uses very little RAM and runs fine even on old or scaled-down machines.
Here’s my advice: use SQL Server for things that map well to set-based operations. Use it for creating reports that involve aggregation. Use it for doing primary key look-ups. But please, PLEASE, don’t use it as a search engine, especially not for text. I have been down this road (in fact, I’m *still* trying to get off this road), and it is not anywhere you want to be.
Whoa, this post is already finished? I thought it would take longer to make the case for why Lucene is great, but obviously not since this is the last section. In future posts, we’ll start looking at how to get up and running with Lucene.NET. We’ll move on to advanced topics, such as distributed indexing and search, faceting, and (hopefully) even look at how to correctly use Lucene and SQL Server together. We’ll also take detours along the way to see how you can integrate Lucene with your ORM solution to get full-text indexing for free. If anyone has suggestions or ideas for things they want to see, please let me know!
*That’s not a joke. I have no problems with the Java language, other than it being inferior and all, but I hate the Sun Java VM with the fiery passion of a thousand suns.