Discover more from Mostly Harmless Ideas
Foundations of ML #1 - What is Machine Learning?
The first issue on the Foundations of Machine Learning series, exploring the most basic concepts in the field.
Machine Learning is a fundamental paradigm shift in how we build computer programs. It opens the door to solving problems that are unsolvable by traditional means. In short, Machine Learning answers the question, “What if we don't know how to write a program to solve this problem?” The solution is simple and elegant: we write another program to find the best program to solve this problem.
This is the first article in a series about the Foundations of Machine Learning. This series will cover basic concepts, the main learning and modeling paradigms, and evaluating and comparing models. From there, future series will explore specific models and algorithms in greater detail.
This series is written in collaboration with
This first series is aimed at general readers interested in knowing more about this field so they can understand the conversation around Machine Learning going on in the media. For this reason, this series has no code or math whatsoever, just intuitions —ok, maybe I’ll throw in a few equations for performative reasons, but nothing strictly necessary.
However, if you are interested in Machine Learning from a software development or a scientific perspective, this series is still for you. In that case, I believe this series will be helpful as an intuitive foundation from which you can move on to more technical courses. I’ll give you some pointers if you want to dig deeper.
In this first article, I want to answer the broadest questions: What is Machine Learning, and when is it useful? To get there, first, we’ll look at why we want Machine Learning in the first place by comparing it with the traditional approach in computer science on a motivating example: making a world-class chess player. Then we’ll define exactly what “the Machine Learning approach” means. And finally, we’ll briefly discuss how we can actually do Machine Learning.
Building a chess player the traditional way
To understand why machine learning is important, we must first discuss when the traditional approach to solving computer problems fails miserably. For that, let’s briefly talk about what it would take to solve what was for a long time considered the quintessential problem in artificial intelligence: making a chess player that can beat the world champion.
If you know how to code, you can already begin to imagine how hard it must be to write a program that plays chess at the highest level. Let me illustrate the type of approach that a typical software developer would follow.
First, you must understand the rules of chess: how pieces move and what it means to win and lose a game. This only guarantees that you can write a program that never makes invalid moves. But that’s far from good enough. To beat even the weakest opponents, you’ll need to code some strategies, i.e., rules for the program to not simply pick one random valid move but a move with a reasonable chance of leading to victory.
To find strategies, you can either read strategy books, consult an expert, or maybe think really hard and develop good ideas yourself. Either way, you will end up coding a program that does a combination of the following:
a bunch of hard-coded rules for well-known openings and endings; and
some heuristic to play the mid-game, most likely based on weighting different pieces according to their position on the board.
And this approach works, but it has many disadvantages, some obvious and some less. The first obvious limitation is that building and maintaining such a complex program is increasingly difficult, especially as new strategies are discovered. That kind of long collection of rules is a headache for most software engineers. Second, slightly less obvious, we never know if we're playing the optimal strategy. Chess is an unsolved game, meaning no definitive strategy is known to work against all opponents.
But the most important limitation of the traditional programming approach is that the program can only be as good as the programmer. There is nuance here, of course. Maybe the knowledge comes from domain experts and not precisely the programmers. But in any case, one or more humans’ knowledge about the problem must be explicitly coded in software.
This means that while the computer is definitely an immense boost in scale and speed —we can solve bigger problems much faster—, fundamentally, traditional software cannot take us beyond human capabilities. The best chess-playing program using this approach might exceed the best human by sheer brute force, but it cannot fundamentally play better chess.
To see why this is a huge issue, consider what happens when we face a problem no one knows how to solve explicitly. At least two kinds of problems in this category are amenable to machine learning.
First, we have unstructured problems that we can solve, but we can’t explain how. These are problems where we hint there must be a set of rules that work —because we humans solve them intuitively— but we cannot explicitly list those rules in a way that a computer can understand —i.e., in a computer program. Examples include perception (e.g., recognizing objects in images) and language.
And second, we have inverse problems that we can solve easily in one direction but not in reverse. These are problems for which we know how to go from cause to effect, but it is not easy to determine, given an effect, what was the cause. Most physics falls in this category, e.g., given a specific design for an engine, we can simulate in a computer—and thus calculate— its performance under certain conditions. The inverse problem would be, given some constraints of design and a desired performance, to find the optimal engine design.
Our motivating example, playing chess, falls somewhere in between. On the one hand, some humans are extremely good at it. Still, while there are books and books of strategies, no one can explain exactly what the world champions do, not even themselves —otherwise, everyone could be a world champion. On the other hand, given a set of moves in chess, we can simulate it and compute who is the winner, but it is not easy to determine the right set of moves that, from a given board, leads to someone winning.
The bottom line is: we know there must be an optimal strategy (or set of strategies) for playing chess at the world-champion level —because there are people who do it—, but neither we nor anyone else can think through and come up with a computationally convenient description of that strategy.
The way of Machine Learning
Now here is the Machine Learning approach to this problem.
Instead of attempting to code the optimal strategy directly, let’s turn the problem on its side and ask, “If I had a huge number of possible strategies, could I identity the best one?”
This change in framing is a big leap because it makes the problem at least tractable. We are no longer trying to invent or discover something; we are “merely” searching for something inside a well-defined set of things. The problem shifted from being open-ended (come up with a winning strategy) to closed-ended (select the best strategy among all of these)!
However, this leap hinges on two crucial assumptions.
First, we must be able to enumerate all possible chess-playing strategies or, at least, to define a sufficiently large set of strategies such that the best one among them has a fair chance of being the one we’re looking for. We call this set the hypothesis space, and defining it will prove non-trivial. In fact, having a good hypothesis space will be the fundamental obstacle in getting machine learning to work.
For this motivating example, we can easily define a sensible hypothesis space that will illustrate the thinking one needs to develop in machine learning. Let’s say we assign a score to each pair of piece-position, for example, white-pawn-at-B7, black-rook-at-D3, etc. We compute the “value” of any concrete board configuration by adding all scores associated with the pieces in their corresponding locations.
Assuming we have the right scores, our chess-playing program is pretty simple:
Generate every possible board from the current one, applying all valid moves.
For each board, compute its value using those —still unknown— scores.
Return the move that leads to the highest valued board.
We assume that those scores somehow capture the notion of “this is a good board to be in.” A board with a higher value is better because I’m more likely to win by playing a move that takes me to that configuration. This is, of course, a huge oversimplification because we are ignoring all interactions between pieces. A rook may be more or less valuable in a given position because it challenges an enemy piece or protects one of your own.
However, even with this simplification, we already have six different types of pieces (rooks, knights, bishops, pawns, queens, and kings) times 64 different positions times 2, equal to 768 different scores. Every possible assignment to those 768 scores represents a different strategy, a hypothesis —also called a model. Note that even if we restrict the scores to whole numbers between 1 and 100, we still talk about 100 to the power of 786 different strategies —a 1 followed by 1536 zeros! Trying out all those hypotheses before the universe reaches heat death is impossible.
So far, we have a simplistic but still huge hypothesis space. Now we need a way to search for an optimal —or as good as possible— strategy in it, which in practice means finding the right way to tune those scores so that moving to boards with higher scores actually tends to increase your chance of winning. But how do we know which boards actually lead to winning?
Here comes another key piece in machine learning solutions: leveraging experience. Obviously, we don’t know an a priori way to determine if a board is better than another. If we had, that would be our strategy! But we can access millions, potentially even an infinite number of chess matches. How can we learn the right scores from that source of experience?
There are two fundamentally different paradigms in machine learning which fit chess well. We will dive deeper into them next time, but for now, let’s illustrate how they work in the case of chess.
The first paradigm is learning by imitation —also called supervised learning. In our example, it would work as follows. We first collect all the chess matches ever recorded that we can find. We note whether whites win or lose for each match and tag each board in that match accordingly.
The second paradigm is learning by trial and error —also called reinforcement learning. Instead of using pre-recorded matches, we let two computer programs play at will —e.g., choosing each move randomly. And just like before, we tag each board as a winning or losing board conditioned on the results of each simulated match.
We will have plenty to talk about the advantages and limitations of each paradigm in future entries in this series. For now, all that matters is that we can access a huge amount of experience that tells us how good different boards are in practice.
With all that experience, how do we actually write a Machine Learning program? Or, specifically in this case, how do we adjust the scores so that the resulting strategy is the best possible? Details vary, but the bottom line is always the same.
Instead of directly coding the program that plays chess, we write a meta-program —also called a training algorithm— that will search for the best chess program, which means finding the best set of scores in our example. And here comes the final ingredient: best with respect to what?
In any machine learning setting, we will need a way to compare different hypotheses —models, or strategies in the case of chess. Thus, finding the best model in our hypothesis space is equivalent to solving an optimization problem in which we seek the hypothesis or model that minimizes some error metric. This metric tells us how well any given hypothesis solves the task we wanted to learn, e.g., playing chess in this article.
We will discuss evaluation metrics in detail in a future article, but for now, let’s consider the most straightforward metric solution in our example. The best strategy in chess is the one that wins the most number of games. Thus, the direct solution would be optimized to maximize the probability of winning. However, for reasons that will become evident in future articles, we can rarely optimize for a direct success metric. That is usually mathematically intractable or at least quite involved.
Instead, we often find a proxy error metric easier to formulate as an optimization problem and close enough to our objective. In this example, a sensible approximation would be attempting to guess whether the current board is winning. Thus, we will compare two strategies —hypotheses or models— not precisely by which is more likely to win but by which is more accurate guessing if a given board appears on a winning game.
Now comes the technical part. Instead of simply enumerating all hypotheses —which we know is impossible, we will pull a rabbit out of the hat and turn the problem of finding the best strategy into solving a system of mathematical equations! Taking the scores as variables, each winning board yields an equation like the following:
While each losing board yields a similar equation:
Then, we will use a bit of mathematical magicto determine how to assign the scores w such that most winning boards sum close to 1 and losing boards sum close to 0.
This magic trick of turning an impossibly large search problem into a clever mathematical problem is pivotal to the success of machine learning. It is what allows our trainer algorithm to actually work at all. Otherwise, we would be stuck with trying randomly strategy after strategy to see which is best —something that we will actually do at some point, mind you.
And that’s it! We just tricked a computer into learning to play chess!
Machine Learning formalized
So far, we’ve built an intuitive formulation for a Machine Learning approach to solving chess. To summarize, we first simplified playing good chess to merely deciding which is a good next board by assigning some scores to all possible pieces and positions. Then we used loads of previous experience (recorded or simulated) to gather evidence for good and bad boards. And then, we turned the problem of finding the best scores into solving a mathematical problem.
Now let’s zoom out and finally define machine learning. This one is my favorite definition, due to MIT professor and author of one of the most influential books on ML, Tom Mitchell.
Machine Learning is a computational approach to problem-solving with four key ingredients:
A task to solve (T)
A performance metric (M)
A computer program (P)
A source of experience (E)
You have a Machine Learning solution when:
The performance of program P at task T, as measured by M, improves with access to the experience E.
That's it. Notice that the gist of the definition is that experience improves performance. This starkly contrasts traditional software, as we’ve seen in this post.
In a typical software application, how well our program solves a problem depends only on its initial programming. The application doesn’t improve the more you use it. Summarizing, in the traditional approach we:
Define a desired output, e.g., the best move in a given board.
Think very hard about the process to compute that output based on a deep understanding of the problem.
Write the program P that produces the output.
Machine learning aims to build a program that improves automatically with experience, even if we don’t fully understand how to solve the problem. To achieve that, we:
Assume there is a template that any sensible program P follows, e.g., a formula parameterized with some unknown values —a hypothesis space.
Write a program Q — a training algorithm— that finds the best values based on some experience E.
Run Q on E to find the best program P —the model or hypothesis.
The training algorithm Q is any of a vast array of machine learning techniques: decision trees, neural networks, naive Bayes, logistic regression, etc. Don’t worry if these names sound alien; we’ll meet them all in due time.
In conclusion, the Machine Learning approach represents a major paradigm shift to computational problem-solving. Instead of directly writing a program P to solve task T, you actually code a trainer program Q that, when run on suitable experience E, finds the best program P, according to some metric M.
This paradigm is so useful because there is an incredible amount of tasks for which we don't know how to write a solution directly. However, in many cases, writing a trainer algorithm is fairly straightforward, provided we have enough experience (e.g., data or simulations) to train on. Machine Learning allows us to tackle problems we can’t solve directly with a performance that often surpasses the best humans in that task.
And that's it for today. In the next entry in the series, we'll talk about the different flavors of experience we can have.
This simplification of ignoring interactions between elements is extremely common in machine learning. Viewing this from a probabilistic perspective, we’re assuming independence among the influence of each variable —each piece-position pair— concerning the probability of having a winning board. Assuming independence is often dubbed the naive approach, but it is actually a pretty smart thing to do in many cases.
A careful examination will show you that there are symmetrical positions for which having different scores makes no sense. However, for simplicity, let’s happily ignore that fact. Even accounting for all symmetries, the number of strategies will be so large it is unfathomable. And yes, that’s a word I just learned.
This means we’re computing scores for the white player. Scores for the black player will be symmetrical but with opposite signs.
In practice, you won’t directly optimize this probability, but rather a continuous approximation called the logistic function.
Mathematical magic is also known as algebra, optimization, probabilities, etc.
I told you we would throw a few math formulas for performative reasons, but don’t worry; the exact equations don’t matter. All that matters, for now, is that there is a way to convert a search problem into a mathematical problem. These equations aren’t even the ones we would actually use (see footnote 4).