Discover more from Mostly Harmless Ideas
Foundations of ML #2 - Learning Paradigms
A framework to characterize and understand the different flavors of machine learning, from supervised, unsupervised and reinforcement, to the many hybrid variants.
This is the second entry in a series on the Foundations of Machine Learning. In the previous post, we talked about how Machine Learning works. It’s about having some source of experience E for solving a given task T, which allows us to find a program P that is (hopefully) optimal for some metric M. We argued that this paradigm represents a fundamental shift in how we build software because we turn the problem from coming up with a solution ourselves to searching for a suitable solution among sensible strategies —from open-ended to closed-ended.
If you haven’t read the previous post or need a recap, check it out before moving on.
In this post, we will start discussing the different subtypes of machine learning paradigms we can find. If you are familiar with ML, you’ve probably heard terms like supervised, unsupervised, or reinforcement learning. This post will explain exactly what these terms mean. But before, we will develop a unified framework to think about machine learning paradigms encompassing these three and many other ML flavors.
All machine learning is learning from experience, as we’ve seen. To learn, you need to be able to measure how well you’re doing; that is, you need feedback. Thus, to understand where learning paradigms differ, we will look at the different types of experiences that we can access and the types of feedback we can rely on. Once those intuitions are in place, we will meet the three fundamental learning strategies: learning by imitation, pattern recognition, and trial and error.
Combining different flavors of experiences, feedback, and learning strategies, we can define all machine learning paradigms, from supervised, unsupervised, and reinforcement learning, to self-supervised learning, active learning, and other hybrid modes. We will finish this post by describing the most popular paradigms.
Deconstructing learning paradigms
Machine learning paradigms vary in at least three essential factors: the nature of the experience available, the quality of the feedback, and the strategy used to learn. The fundamental dimension to characterize a source of experience is whether it is static or dynamic. Regarding feedback, the fundamental dimension to characterize it is implicit vs. explicit.
Nature of experience
Static experience is, for example, a collection of books. You can learn from books by reading other people’s experiences. But you are limited beforehand to the amount of quality of experience that was put in those books. If you read from a very good author —or teacher— they will have chosen the precise experience you need to learn effectively. But you cannot control what they teach you —you cannot ask questions to a book—you can only choose beforehand what source to consult.
In contrast, dynamic experience is what you get by practicing, e.g., playing a sport or a musical instrument. Dynamic here means that the amount and quality of the experience you get depends at least partially on how you act. You try to kick the ball a certain way or put your fingers on a particular key on the piano, and you’ll get different feedback. Thus you can, to some extent, control the experience you get by asking different questions.
In short, static experience is predefined before we access it. We can perhaps decide in which order to consume it, but we cannot fundamentally change the experience we get. Dynamic experience depends upon our behavior —by acting in specific ways, we can shape the experience we get. We must understand, though, that this isn’t a binary distinction but a spectrum going from pure static to more dynamic.
In computational terms, the most common type of static experience is a dataset, i.e., a collection of data —e.g., images, text, audio, video, etc.— that someone put together. The most common type of dynamic experience is a simulation, i.e., a computer program with specific rules where one or more virtual agents (also computer programs) can interact.
Quality of feedback
Explicit feedback is directly related to your learning target. For example, if you’re learning to play an instrument, the most explicit feedback would be, “This is how you must move your fingers.” In the chess example, very explicit feedback would be knowing the optimal move on a given board.
However, you often can only get implicit feedback., in various degrees. For example, take learning to play soccer. You kick the ball and observe its trajectory. If it doesn’t land where it should, you know you did something wrong but not exactly what. This feedback is only indirectly related to your learning target —e.g., the exact force and torque you need to hit the ball. This is also a spectrum rather than a binary distinction.
The more explicit the feedback, the easier it is to learn. However, providing explicit feedback often entails knowing how to solve the problem and effectively communicating that knowledge. Your music teacher not only knows how to play the guitar; they can also tell you precisely what you’re doing right or wrong. But if you’re learning to ride a bicycle, no amount of explanation will do. We don’t even know exactly how we do it, let alone how to explain it to someone else.
Lower-quality feedback also can occur in at least two critical ways: it becomes delayed and sparse. Delayed feedback is, for example, when you lose a game of chess. You only know that your strategy was flawed at the end, but you can’t tell exactly which moves were good or bad. Sparse feedback is the more general issue of getting a feedback signal only for a subset of your decisions. Both these phenomena make learning increasingly harder.
Three main learning strategies underpin all ML algorithms so far in use. These are the actual mechanisms by which a model is trained based on experience. While the lines between ML paradigms are somewhat blurry, they all use one or a combination of learning by imitation, trial-and-error, and pattern recognition.
Learning by imitation is the most common and intuitive technique. It’s one of the main ways children learn in humans and most mammal species. You observe some behavior and try to reproduce it. Learning by imitation is the most efficient type but requires clear and direct feedback.
Learning by trial and error is what you do when feedback is indirect and sparse. If you cannot know precisely what you should do, but at least you can know if your actions are better or worse, you can try different things and see which is better. Learning by trial and error can take much longer than imitation, especially if the space of possible decisions is huge. It requires dynamic feedback since you need to be able to decide which actions to try out.
Learning by pattern recognition is the most indirect type of learning, useful when the feedback is the worst quality: implicit and static. It’s the learning that happens when, for example, you read a non-didactic book or watch a movie. There is no explicit learning objective, but you can still find some interesting lessons to keep. It involves finding patterns in the available experience that explain or summarize at least part of it.
These three learning strategies are not disjoint. They can be used in different combinations depending on the available experience.
Machine learning paradigms
Now that all intuitions are in place, we can briefly enumerate the most important machine learning paradigms. We will start with the three basics: supervised, unsupervised, and reinforcement learning. Each occurs when one specific combination of experience, feedback, and learning strategies intersect. Then we will see three prominent hybrid paradigms that blur the frontiers between these three.
The most common, simple, and helpful learning paradigm is Supervised Learning. It involves static experience with the most explicit kind of feedback, and it’s based on imitation learning.
In this paradigm, the experience is a static collection of input/output pairs, and the task is defined as finding a function that produces the correct output for any given input. Supervised learning is thus a problem if prediction: given an input, predict the corresponding output. The underlying assumption is that there is some correlation (or, in general, a computable relation) between the structure of an input and its corresponding output and that it is possible to infer that function or mapping from a sufficiently large number of examples.
The output can have any structure, from highly complex to simple atomic values. When the output has a complex structure (e.g., a sequence or an arbitrary object with properties), we call it a structured prediction problem. When the output is a simple atomic value, we have two special sub-problems: classification and regression.
Classification is when the output is a category out of a finite set. Examples are detecting the object(s) that appear in an image, assigning a sentiment to a text, or predicting if a person has a disease —like cancer— given the results of a set of medical exams.
Classification problems can be binary —e.g., predict cancer or no cancer— or multi-class —e.g., predict the sentiment of a text. Multi-class problems can also be multi-label when the output can be zero, one, or more than one category for a single input—for example, predicting the topics of a piece of text.
Regression is when the output is a continuous value, bounded or not. A typical example is property valuation, e..g, predicting the price of a house or car, or any other object given its characteristics.
The main alternative to supervised learning is, you guessed it, Unsupervised Learning. The main difference is that we don’t have access to the “right outputs.” Instead, we have a large static dataset and want to find hidden patterns. Thus, feedback is highly indirect and often implicitly defined, and we resort to pattern recognition as the primary learning strategy.
The underlying assumption is that there is some regularity in the structure of those elements that help explain their characteristics with a restricted amount of information, hopefully significantly less than just enumerating all elements. In this approach, the programmer leaves a deeper footprint since the kind of regularities that can be found are a bias that needs to be provided.
Two common sub-problems in unsupervised learning are associated with where we want to find that structure: clustering and dimensionality reduction.
Clustering is when we care about the relationship between different elements. This task is often framed as finding the (generally predefined number of) groups to which each element belongs. Examples include grouping users given their behavior on social media or grouping products given their characteristics in an online store.
Clustering requires defining an implicit similarity relation between elements, such that elements in the same group are, on average, more similar to each other than to the elements in other groups. Thus, the performance of clustering is often highly subjective, depending on how we define that similarity. Also, other assumptions and biases determine the “geometry “of the clusters. Long story short, every clustering algorithm is the best one under its own assumptions.
Dimensionality Reduction is when we care about the internal structure internal of each element. The task is often framed as finding a compact representation of the elements capturing their most salient characteristics. In other words, they keep the most information possible, where each model defines what information means in its context.
This task is often used as a preprocessing step before supervised learning because it helps reduce each element’s detail to its most essential traits. For this reason, there aren’t easily recognizable examples of dimensionality reduction in user-facing applications.
The third fundamental ML paradigm is Reinforcement Learning. The main difference with the previous two is that instead of a static collection of data, here we have a dynamic simulation where an agent can act, and the task is formulation as learning to make decisions that lead to desirable outcomes. It is based on learning by trial and error.
This paradigm is useful when we have to learn to perform a sequence of actions, and there is no obvious way to define the “correct” sequence beforehand other than trial and error, such as training artificial players for video games, robots in real life, or self-driven cars. The feedback is formulated as a payoff obtained after the agent performs one or a sequence of actions, and the objective is often to maximize the lifetime expected accumulated payoff.
Feedback in reinforcement earning can be delayed and sparse, meaning our agents must learn to perform long sequences of actions that lead to a single outcome. Thus, a major problem is credit assignment, i.e., deciding which actions in that sequence should account for what part of the ultimate payoff. Also, the same action can lead to different payoffs at different moments.
For example, in self-driving cars, you get an immediate positive payoff every second the car hasn’t crashed, a mid-term negative payoff for every driving law broken, a longer-term payoff if you reach the desired location, etc. The same action —e.g., moving the driving wheel a degree to the left— can be involved in all those payoffs. This level of indirection between action and feedback makes RL far more complex than supervised and unsupervised in general.
Reinforcement learning approaches can be divided into model-based and model-free. In the former, the agent learns a model of the environment. Thus it can predict the payoff that specific sequences of actions will provide and optimize accordingly. In the latter, the agent learns to predict the “correct” action without explicitly computing its expected payoff.
This is a straightforward mixture of supervised and unsupervised learning, in which we have explicit output samples for just a few of the inputs plus many additional inputs where we can try, at least, to learn some structure.
Semi-supervised Learning is useful in almost any supervised learning problem when we hit the point where getting additional labeled data —with both inputs and outputs— is too expensive, but it is easy to get lots of unlabeled data —just with inputs.
Its advantage is that if the input data is relatively uniform, the underlying structure useful for predicting outputs —the one you would need to learn in supervised mode— will closely match the structure you can discover unsupervised. Thus, you can learn a lot from the data alone, in an unsupervised way, and then add a small supervised tweak to the mix.
One form of this paradigm, unsupervised pertaining, was very popular in the early days of deep learning. Training an entire neural network end-to-end on lots of data was computationally unfeasible back then. The alternative was to pre-train intermediate layers of the network in unsupervised mode before training the final layer(s) in supervised mode.As newer, simpler architectures have been invented and computational power has increased, this approach has fallen out of favor, but it is still helpful in low-resource scenarios.
This paradigm is also in-between supervised and unsupervised learning but has a different setup. In Self-supervised Learning, we want to predict an explicit output, but that output is simultaneously part of the input. So in a sense, the output is also defined implicitly. Thus, this paradigm is simultaneously supervised and unsupervised instead of simply concatenating supervised and unsupervised problems.
A straightforward example is language models, like BERT and GPT, where the objective is (hugely oversimplifying) to predict the n-th word in a sentence from the surrounding words, a problem for which we have lots of data —e.g., all the text on the Internet.
Self-supervised learning is a highly effective approach in machine learning as it blurs the boundaries between supervised and unsupervised learning. Its reliance on explicit feedback makes it as efficient in learning as the supervised paradigm. However, what sets it apart is that it doesn’t require human labeling, making it a scalable and much more efficient option, just like in unsupervised learning.
The final paradigm we’ll analyze in this post is Active Learning, which is also a hybrid of sorts, but this time mixing traits of supervised and reinforcement learning. In this setup, we have a vast, potentially infinite amount of unlabelled data as experience and access to an “oracle” —e.g., a human expert— to whom we can ask for the correct output for any specific input.
The key to active learning lies in pinpointing the most effective examples to request from humans. This approach can optimize our knowledge acquisition while minimizing human effort.
Active learning can reduce the cost of data collection enormously compared to traditional supervised learning while keeping the same model performance. As in supervised learning, we have explicit, high-quality feedback. But unlike supervised learning, we only pay the cost of getting that feedback —which often entails having a human annotator— for the set of learning most informative examples.
In active learning, we train the model and annotate the data simultaneously in a tight loop. As the model improves, it will tell us exactly which data point is most cost-effective to annotate next. Once finished, we obtain the trained model and the final, optimally annotated dataset.
Not every task comes with a paradigm label, though. In the realm of machine learning, these paradigms not only operate independently. Often the more powerful methods live on the frontiers.
In this final section, we aim to give you a flavor of how these interactions look. We will present the ideas behind some illustrative examples without technical details.
Autoencoders are an excellent example of how supervised and unsupervised learning encapsulate a dual role. At their core, they are unsupervised models designed to uncover hidden patterns in unlabeled data. Yet, this process can be enhanced using supervised learning techniques. We feed an encoder with the input data, and it returns a more compact representation as output. Then a decoder receives this output as an input, producing something in the same format as the original output. The composition of both the encoder and the decoder is what we call the autoencoder. Given a similarity metric, the task is to learn to reconstruct the input.
This architecture resembles self-supervised learning —in the sense that the output is part of the input, i.e., it’s the input itself. We are not quite interested in the output itself, though. We are instead interested in the intermediate, more compact representation.
Embeddings are another interesting example, a general method for representing discrete variables, like words, items, or users, by transforming them into continuous vectors within a lower-dimensional space. These embeddings encapsulate some semantic or contextual attributes of the variables. Their application extends to natural language processing, computer vision, recommender systems, and beyond.
In the unsupervised scenario, they can detect similarity, association, or hierarchy. Facing supervised problems, they offer substantial utility due to their capacity to handle nominal data that do not have explicit characteristics. For instance, when dealing with word embeddings, words that share analogous meanings or fall within the same category can be positioned closer together within the vector space.
In conclusion, the various subtypes of Machine Learning paradigms are shaped by the interplay between different types of experiences and feedback and the learning strategies employed.
Static and dynamic experiences refer to the nature of the data used for training, with static experiences involving fixed datasets and dynamic experiences involving continuous data streams or simulations. Explicit and implicit feedback refers to the type of guidance provided to the machine learning algorithm, with explicit feedback being more direct while implicit feedback is more subtle and indirect.
Considering these three key concepts, we can better understand the differences and similarities between machine learning paradigms. For example, supervised learning involves explicit feedback and static experiences, while reinforcement learning involves implicit feedback and dynamic experiences. On the other hand, unsupervised learning relies on implicit feedback and can be either static or more dynamic, depending on the nature of the data.
Ultimately, by understanding these key concepts, we can gain a deeper understanding of the underlying principles that govern Machine Learning and make informed decisions about which approach to use for a given problem. In the next entry in this series, we will discuss another fundamental distinction between Machine Learning approaches: generative vs discriminative modeling.
Don’t worry about the details of this process. We will talk about neural networks and their training paradigms in future posts.
This concept is analogous to a musician trying to replicate several pieces by ear. The musician’s brain builds a representation of the music that’s simpler than the music itself. Eventually, most musicians agreed on a unified language for this representation —but that’s totally out of scope for this article ;)