Completing the Reinforcement Learning course at Georgia Tech (OMSCS) was a significant personal challenge for me. It stands out as one of the most demanding courses I've ever undertaken, largely due to the inherent complexity of reinforcement learning.
Basics of RL
Like all forms of machine learning, we start off with an equation. An equation that represents the fundamental concepts that we're going to use to encode some pattern in our training data, and hopefully attain generalization. In RL, the fundamental equation that we're starting with is the Bellman Equation. The Bellman Equation is actually very simple in its base form, but for various reasons, computer scientists have pulled it apart, put it back together, and added to this equation in so many convoluted ways. Understanding each of these (e.g. TD-Lambda, Monte Carlo, SARSA, etc.) is ultimately about understanding the preference bias needed for different applications of RL. This is fine and interesting by itself, but there's an added practical complexity with RL.
Difficulty of RL
Why is reinforcement learning complex in practice? The biggest problem with coding RL algorithms from scratch is that they are very difficult to debug. One small mistake can render an implementation of say, TD-Lambda, completely incapable of convergence. These algorithms can be so parameter sensitive that while trying to understand why a model isn't converging, you're not completely sure if it's because you chose the wrong discount factor or you made a mistake in computing your policy update.
The Cutting Edge
Towards the other end of sophistication for RL algorithms, you have Deep Reinforcement Learning. This is a very interesting field that will clearly become even more valuable going forward. Just the other day, NVIDIA released an intriguing research work called DrEureka. The highlight of this project is that they managed to train a robot dog in simulation to balance on a yoga ball and then transfer to the real-world zero-shot. This is impressive in its own right, but it's even more exciting when considering that they did this with the help of an LLM (GPT4) that used various meta-learning techniques to construct a really good reward function that included a kind of environment randomization to help the model generalize.
We can imagine how, as LLMs continue to improve and as we make domain-specific LLMs for RL reward modeling, we'll see further increases in the capabilities of such systems. Imagine instructing an LLM to create an RL model capable of piloting a general-purpose humanoid robot for construction or cooking. Truly mind-blowing stuff.
Let's consider, however, why the LLM is so effective in this use case. Reward shaping in reinforcement learning is extremely boring and error prone. Reward shaping is one way that we can add bias to the RL model that helps it to converge. By rewarding the agent for getting somewhat closer to its ultimate goal, we can reduce the issue of reward sparsity. From firsthand experience in doing this, I can tell you that it takes a long time to do it well and even then, you have no guarantees of convergence. This is the other reason why reinforcement learning is so difficult.
Theoretical Perspective on RL Difficulty
To better communicate this concept, I'd like to evoke the concept of the Vapnik-Chervonenkis (VC) dimension and the Markov Property in RL. Let's compare supervised learning with reinforcement learning to gain an intuition on why RL can be so difficult. In RL, the Markov property asserts that the future state and reward depend only on the current state and action. Therefore, we can think of an RL environment as a mapping to a theoretical but extremely large set of all possible state-action-reward-next state tuples (S, A, R, S'). If we think of this dataset as being equivalent to the kind of training data we use in Supervised Learning, we can more easily compare these domains.
VC dimension measures the complexity of a model in terms of its capacity to fit training data. High VC dimension mean complex models and in RL we can generally expect to be dealing with a higher VC dimension problem class because of the inherently dynamic nature of RL. Due to the dynamic nature of these kinds of problems, each inference in a series of inferences (up to a certain time-horizon or discount) must be correct in order to maximize reward. So, in a sense, we're dealing with a series of inferences that must be correct rather than each individually. This increase in problem complexity brings with it a necessary increase in VC dimension.
Despite having a dataset, RL faces the basic challenge of information sparsity. We have sparse rewards (which reward shaping helps to address). We might have partial observability of states. The hypothesis space in RL, defined by the policy or value function, will be very complex. This might not be a problem if you always know the reward, but again, because of reward sparsity, it's like having a training dataset where part of the goal is to fill in reward values for previously visited data points. This sparsity impacts the effective sample size, which is the amount of useful information available for learning. So, comparably complex RL and SL situations will require significantly more compute in the RL case.
I think we can appreciate the theoretical difficulties of reinforcement learning from another perspective - causal modeling. If we consider a causal model which is ultimately a model of the environment which generates the RL situation, we can make some insightful generalizations. In RL, beliefs on reward values must be updated whenever we discover new information about our environment. This dynamic relates to the evocation of the VC dimension I mentioned earlier. Complex causal graphs should relate to higher VC dimensions. Causal dynamics that are inherently more cyclical will require another increase in VC dimension. We'll see a greater investigation into this topic of Causal-RL because it unifies many emerging topics in machine learning today - RL, causal modeling, and implications for meta-learning in LLMs. These will likely be relevant as we seek greater generalization and value from our machine learning systems.
Future of RL
On top of all of this, the economics of scaling RL systems is just different than, say, training a classifier or an LLM. Over the last decade, we've seen so much excitement over AI because deep-learning-based classifiers and language models have finally become capable enough to be useful in many general cases. This was made possible due to the parallelizability of deep learning models in general and transformer architectures in particular. Hence, why NVIDIA became one of the world's leading companies based on stock market valuation. The problem with RL, however, is that because our training dataset is inherently dynamic and gathered through simulation, GPUs don't help us too much. In fact, CPUs-based training was recommended by my instructors. So, ultimately, we should expect RL applications to lag behind other forms of generalizable ML for some time.
So, what are some really good use cases for RL? We've already talked about robot dogs balancing on exercise equipment, but I'd also like to mention something that I've used RL for professionally - Recommender Systems.
No comments:
Post a Comment