In the last post, I introduced the topic of machine learning. In this post, I’ll describe an example problem, discuss how you might go about writing code to address the problem, then discuss how you can apply machine learning to the same problem and get the computer to do the heavy lifting for you. In the end, hopefully you’ll at least have a vague idea of how machine learning might be useful.
You work for a company that makes data entry software for hospital emergency rooms. One day, your boss comes in and says that you’re going to add a new feature to the software: based on patient vitals and stats, it will prioritize patients automatically into one of two categories: critical or not critical. What do you do? On his way out the door, your boss leaves you with a database containing about 100,000 records from the local ER. Each record contains stats on some patient and whether or not the supervising doctor decided that the patient was critically injured. That’s all you have to work with. Stop and think about how you would try to solve this problem.
Approach 1: Write code
You decide that you’ll write a series of if/else-if statements to solve this problem. Surely that will work, right? Well, it turns out that there are about 40 stats (we’ll call these attributes) for each patient (and we’ll call patients instances), so which attributes do you use in your if/else-if statements? This is going to get pretty hairy in a hurry. After checking a few instances, you end up with something that looks like:
1: if (patient.BP < minBP && patient.Pulse < minPulse)
3: return true;
5: if (patient.O2 < minO2 || patient.Pulse > maxPulse)
7: return true;
9: if (patient.Pulse == 0)
11: return true;
So far so good! Except now you look at the next instances, and suddenly you have a patient that meets the requirements for the first if statement, but they’re labeled as not critically injured. So, you carefully examine the instance, and try to tack another check on an attribute into the if statement, but that causes the statement not to match an instance that it should be matching. So, now you have to add yet another if statement, and it has to go before the original if statement.
Now, multiple that scenario by about 100,000, and you’re finished. Wasn’t that fun and easy?
Approach 2: Encode human knowledge
So approach 1 didn’t work out so well. Instead, you try asking the doctors what criteria they use to decide whether or not someone is critically injured. You get a short explanation and encode it as a couple of rules, easy enough. Except that when you test it on your 100,000 records, it gets the vast majority of them wrong. When you bring a few examples to the doctor, he says "Oh, yeah, the patient isn’t critically injured *UNLESS* this attribute has this value, too, then they’re critically injured." You take the knowledge back, rework your rules to encode this new knowledge, and find that you’re still not doing a very good job at re-classifying your 100,000 instances. After multiple iterations with the doctor, you have a set of rules that seem to work. You roll it out into production, and immediately the phone starts ringing off the hook. "Your software says this guy is critically injured, but he’s fine!" Even though your rules work very, very well on your 100,000 records, they seem to do very poorly in the real world.
Approach 3: The Machine Learning way
You throw away all your rules, and instead decide to try out these fancy machine learning tools. You don’t really know much about the tool, other than you give it data, tell it what attribute to predict, and it builds a model that you can apply to new instances. For this problem, you feed it in your 100,000 records and tell it to build a model that can predict the critical/not critical status. After a few short seconds, it spits out a model that works quite well on the 100,000 instances you trained it with (note that you actually don’t want it to be 100% accurate most of the time, more on that in a future post). You hook the model in to your code; all you have to do is pass it an instance, and it passes back a true/false, much cleaner than 100 if/else-if statements strewn about. You then roll the code out to the world and wait… the phone rings, and people do complain that it isn’t 100% accurate, but the calls are infrequent. Most of the time when it is making mistakes, it is doing so on patients that are borderline anyway. All in all, not bad considering you had to write almost no code.
So what happened?
The machine learning approach worked because there were patterns in the data that the computer was able to learn to recognize. The patterns were too subtle for you to pick up on given the sheer size of the data set and the number of attributes on each case. Our brains have a hard time working across many dimensions at once. The machine learner is really good at that sort of thing though. It’s able to consider all 100,000 records and all 40 attributes quite easily. It was able to identify patterns in the training data that you fed it, and it generalized those patterns so that they would be useful in classifying new instances. That’s the magic of machine learning: being able to generalize from observed instances to things that haven’t been seen before.
Next time, I’ll give some examples of how machine learning is used by tools that you’re probably already using. I may even get in to specific types of machine learning techniques and what they can be used for.
Sorry, but approach 3 doesn’t work that way. I think Duda and Hart called this the "no free lunch theorem": There is no machine learning algorithm that will always generalize better than any other. Specifically, for any machine learning algorithm, there exists a machine learning problem where random guessing will produce better results than your algorithm. What does that mean? You can’t just give the data to some ML code and forget about it. You have to understand your data, and you have to understand the algorithm in question. Only then can you choose the right input feature set, the optimal pre-processing for these input features, select an ML algorithm, pick the right parameters for it and fine-tune it, so it will deliver good results.
You are correct that the choice of induction algorithm is not trivial. This post wasn’t meant to exhaustively cover all the tradeoffs involved, it was meant merely to introduce the potential benefit that machine learning techniques can provide. Future posts will continue to get deeper and deeper into the nitty-gritty of how and why things do and don’t work.
Then I probably misunderstood the sentence "You don’t really know much about the tool, other than you give it data, tell it what attribute to predict, and it builds a model that you can apply to new instances". This sentence kind of made my toenails curl, because I’ve seen machine learning projects that were done like that.
My point is, you really have to understand your data. Otherwise, your machine learning algorithm will fail as miserably as the other two approaches. In fact, more miserable, because you’ll at least learn something form approaches 1 and 2.