I have mentioned my thesis work only vaguely on this blog up until now (mostly out of fear of some shady PhD-by-mail student trying to steal my ideas), but since things are winding down, I think it’s safe to start talking a bit more about what I’m doing. So, my thesis is in the area of data mining and machine learning. It was motivated by a fairly common problem: we have a data set that existing machine learning classification techniques are simply unable to model well. Our hypothesis is that this isn’t due to an inherent limitation of the algorithms; we believe the problem is the data itself. In our case, the training data provided, even though there is *tons* of it, is not useful for learning a classification function. The question we wanted to answer was why. We wanted to see what subsets of the data the classifier was able to learn from, which it wasn’t, and what those subsets represented. With that information, maybe we could provide feedback to domain experts about where the data insufficient.
To that end, we have designed a new process that leverages both supervised and unsupervised machine learning techniques to discover a knowledge frontier within data. A knowledge frontier represents a conceptual boundary where a classifier’s performance becomes very stable. The performance may be terrible, but it’s consistent across that subset of the data. Knowledge frontier nodes are subgroups/subpopulations of a data set where no meaningful sub-partition exists that has significantly different performance with respect to some classification technique.
*reads what he wrote*
Huh, that’s still kinda vague. It’s tough to condense 20+ pages of explanation and justification down to a couple of paragraphs. That, and I don’t want you stealing my ideas, PhD-by-mail guy!
Anyway, how is this interesting to .NET developers? Well, for a couple of reasons. First, I have created a toolkit for machine learning in C#. It’s modeled loosely after Weka, but I think the object model is actually significantly better than Weka. And to add insult to injury, it actually interops with Weka via IKVM.NET. The interop isn’t as polished as I’d like yet, but it’s getting there. Oh, and did I mention that it performs about 1000% better than Weka? Well, it does. I’ll be posting more about this toolkit in the coming weeks/months/whenever.
The other reason this is interesting is because I have taken it as an opportunity to learn WPF, which sounded a lot cooler when they called it Avalon. Stupid Microsoft marketing… Anyway, that brings us to Knowledge Frontier Miner. This is my tool for actually discovering knowledge frontier nodes within a data set. It does this by using the conceptual clustering algorithm COBWEB to first create a hierarchical cluster tree (basically a tree that recursively partitions the data set into smaller pieces, and each partition has some sort of “meaning” to it, like decomposing animals into mammals, reptiles, birds, etc). Next, it applies the actual knowledge frontier discovery process to find frontier nodes. The result is presented in a WPF TreeView that looks absolutely nothing like what you would expect out of a TreeView:
Yeah, that’s a TreeView. That screenshot (which wouldn’t have been possible without the excellent articles by Josh Smith on CodeProject) really shows off the power of WPF. Virtually everything can be templated and customized, then bound to whatever you want. That beautiful TreeView, which by the way supports zooming, took far less total code than just the data binding would have taken had I gone the WinForms route.
More screenshots to prove how awesome this is:
The last one is a dataset that has a very, very broad cluster tree with very inconsistent accuracy in See5; that’s what’ the frontier is so massive. 🙂
If I can get time, I’ll post some code and discussions about how the app actually works. If anyone is even remotely interested about knowledge frontier discovery, I might make another, more detailed, post about that. Let me know what you want in the comments!
i would be really interested in what this tool can accomplish. it is 10x better that weka? how did you measure this? how much better is it than infer.net (http://research.microsoft.com/en-us/um/cambridge/projects/infernet/)?
@bnm, I should have worded what I said more careful, because "performance" can mean a lot of things. In terms of accuracy or "correctness" results, my toolkit is identical to Weka; it re-uses the same classes, the weka JAR file has just been converted to a .NET DLL. The performance improvement I was referring to was in speed. I have not (yet) done a formal evaluation, but unofficially, the difference is very obvious. It’s not that there’s anything wrong with Weka that causes it to run slowly, it’s just that Java is an inferior platform (my two cents).
As for Infer.net, I just saw that yesterday. It’s on my list of things to investigate further.
Anyway, I plan to make my toolkit open source at some point, and I will quantify the speed difference between the original Weka and the Weka API converted to .NET in a future post. Thanks for the feedback!