when the map is not the territory

Rethinking fair supervised learning by treating it as reinforcement learning with a corrupted reward channel, and introducing a new criterion for corrupt reward reinforcement learning as a result.



  1. Brief notice
  2. Motivation
  3. Summary
  4. Supervised learning
    1. Practical problems in supervised learning
    2. Generalization is everything
    3. Not all noise is equal
    4. The notion of bias in “Fairness through Awareness”
  5. Biased supervised learners are naïve agents in CRMDPs
    1. Intro to CRMDPs
    2. Supervised learning as reinforcement learning in CRMDPs
  6. Implications and conclusions
    1. Connections
    2. What does this mean for CRMDPs?
    3. What does this mean for supervised learning?
    4. Prioritizing fairness, short-term safety, and long-term safety

Brief notice

This is a very rough WIP, and the details are likely illegible to those who aren’t familiar with formalizations of reinforcement learning and supervised learning. This includes everything after the “Summary” section.

Helpful pre-reading:



This work shows that an agent engaging in unfair supervised learning, weak generalization in the face of noise, or learning under a corrupt reward channel are all dealing with the same underlying phenomenon. In particular,

  1. Supervised learning is a special case of reinforcement learning.
  2. Assuming a fallible oracle in the supervised learning formalism allows noise to creep into the learning task.
  3. The role of the oracle in supervised learning is identical to the role of the corruption map in a corrupt reward Markov decision process.
  4. When the oracle introduces locally concentrated noise, the classifier is particularly susceptible to the oracle’s bias.
  5. A classifier learned from such an oracle fails a specific Lipschitz condition that leads to worse classification performance. This failure is induced by the oracle.
  6. The corruption map in a CRMDP induces the same failure for naïve RL agents.
  7. Since failing the Lipschitz condition is induced by the oracle/corruption map, it can provide a novel condition for learnability in CRMDPs.
  8. Using the Lipschitz condition as a learnability condition for CRMDPs would resolve open questions described by the original authors. In particular, it would show agents can learn in a larger class of finite-state CRMDPs than had been originally proven. These results could also extend to infinite state space CRMDPs, which is an open problem. Further implications, extensions, and speculations are considered near the end of this post.

Supervised learning

As setup, we formalize supervised learning. If you’re already familiar with supervised learning formalities, you can skip to the next section after noting the notation we use.

First, there is a task T\mathcal{T} to be learned. This could be a classification task, e.g. classifying defaults on loans, classifying likelihood of recidivism, recognizing objects in an image, or it could be a regression task, e.g. predicting amount of interest gained on a private student loan, pricing a salary package for a new employee based on past employee profiles. We assume that our task is well-defined and solvable, i.e. there exists some solution set T=XT×YTT = X_\mathcal{T} \times Y_\mathcal{T}.1 We also assume that TT is not able to be determined by brute force.

In order to solve the task under this constraint, we want to determine a computable function f:XYf^*:X \mapsto Y such that y=f(x)y=f^*(x) for every (x,y)T(x,y) \in T. In practice, we pursue an approximation f:XYf:X \mapsto Y such that fff \approx f^*.

We require access to a dataset D=XT×YT={(xi,yi)nP(T)}D = X_T \times Y_T = \{(x_i, y_i) \stackrel{\scriptscriptstyle n}{\sim} \mathcal{P}(T)\} where P(T)\mathcal{P}(T) is a distribution over the solution set TT. For simplicity, we will often refer to XTX_T and XTX_\mathcal{T} as DD and TT in a slight abuse of notation, although the difference should be clear from the context.

Since we don’t have access to ff^*, we assume the existence of some oracle Ω\Omega that’s able to approximate it. The dataset DD we use is actually D={(xi,Ω(xi))nP(T)}D = \{(x_i, \Omega(x_i)) \stackrel{\scriptscriptstyle n}{\sim} \mathcal{P}(T)\}.

Practical problems in supervised learning

There are several common practical constraints here. Notably:

  1. We are often stuck with how these samples are gathered from XTX_\mathcal{T}. For example, we may wish to gather a representative sample of animal species for image recognition, but we may have regulatory guidelines on which animals can be photographed based on their scarcity.
  2. DD is small, i.e. the nn above is small relative to the cardinality of XTX_\mathcal{T}. Put another way, we have a finite capacity for queries of Ω\Omega, whether due to difficulty of querying Ω\Omega or difficulty of sampling from P(T)\mathcal{P}(T). 2
  3. We create DD through Ω\Omega, which is itself generally fallible with respect to ff^*. More on this later.

These constraints can lead to dangerous or biased behavior. Sometimes they’re so strong and invasive that they cause a non-starter for the learning task, possibly forcing us to find a better oracle, redefine the task, or to abandon it altogether. Dangerous supervised learning occurs when we, the operators, fail to recognize these constraints as being as serious as they are. Failure of recognition can occur throughout the entire machine learning development/deployment process.

Generalization is everything

People often confuse the training objective with the task objective. In truth, we optimize over the training objective (i.e. loss function) only so that we can minimize out-of-sample error. If such error were computable, we’d prefer to optimize that directly. This means that the training objective is a surrogate for our actual optimization objective. We can only ever approximate our true performance on out-of-sample error in expectation, using methods like cross-validation.

But even then, any observed out-of-sample error is not the true, idealized objective of the task, as the true out-of-sample error can only ever be approximated through the use of the oracle Ω\Omega. Our hope in SL is therefore to use DD to match the oracle’s performance on data in TT, when it should really be to achieve perfect performance on TT. That is, we can only in good faith attempt to learn Ω(T)\Omega (T) but not f(T)f^{*}(T). Given a perfect oracle, these are identical optimization problems, but given a fallible oracle, they are distinct. Fallible oracles introduce noise.

Not all noise is equal

The notion of bias in “Fairness through Awareness”

Biased supervised learners are naïve agents in CRMDPs

Short intro to this section, explaining the overall assertion. Summarize related work (basically just say Paul Christiano’s theories are general enough to phrase these problems the same way, but their similarity has not been investigated explicilty or in detail).

Intro to CRMDPs

Boating with a corrupt reward, from OpenAI

Boating with a corrupt reward, from OpenAI

The main difference between an MDP and a CRMDP is the inclusion of a corruption function C:R˙R^C:\dot{R} \mapsto \hat{R} mapping true rewards to observed (potentially corrupt) rewards. The agent is only able to observe the environment’s true reward after it’s passed through the corruption map, so the corruption map can introduce unwanted behaviors, as seen above in the boat race example. In the boat race example, the corruption map is particularly extreme, although unwanted behaviors can be more subtle depending on the map’s character. This corruption map plays an important role in supervised learning, too.

TODO No Free Lunch Theorem for CRMDPs, decoupled RL, limitations of quantilising agent

Supervised learning as reinforcement learning in CRMDPs

Intuitively, supervised learning can be considered a special case of an MDP. For example, in classification, we often want to maximize the probability of the classifier yielding the correct class. In practice, we equivalently maximize the log-probability for numerical stability reasons.3

This (log-)probability can be considered the true reward of an MDP, where the state space contains all past, present, and future instances for the learning task. In particular, the training instances form a partition of the state space that is non-communicating with the rest of the state space, i.e. non-training instance states are impossible for the agent to reach in finite time on its own (see CRMDP paper for a precise definition). The actions are the labels that the policy is trained to assign; classification has a discrete action space, while regression has a continuous one.

When we throw the oracle into this setting, things get a bit more complicated, but there’s a simple explanation for how the MDP formalism is affected. The oracle in a supervised learning task is equivalent to the corruption map in its corresponding CRMDP. The extent of an oracle’s bias dictates just how extreme the corruption map affects the true reward of the underlying MDP. An idealised, unbiased oracle will lead to a pure MDP; a strongly biased oracle will lead to an intractable CRMDP.

Implications and conclusions


TODO actually porivde support for these conclusions, or some modified variation

  1. In special case of a deep policy, naïve agents in special non-communicating MDPs can apparently still defeat reward corruption (i.e. label noise) in special cases, but only if label noise is not locally concentrated in the feature space (see Drory et al. 2018). If locally concentrated, the classifier’s error drops perfectly linearly with amount of noise. Classification error in the CRMDP case is just regret, so in this case the agent experiences linear regret. But if the noise is random or “flipped”, regret is sublinear! The intractability of locally conentrated noise by standard methods mirrors the no free lunch theorem for CRMDPs.
  2. To understand this, we switched the requirement for a communicating CRMDP with a condition on the kernel between the action distribution produced by our agent for different states. “The No Free Lunch theorem suggests that without some assumption of what the reward corruption looks like, any RL agent is essentially lost.” The authors of Everett et al. assume the reward corruption is limited to a subset of states in a finite-state MDP, but assume nothing about the magnitude of corruption, or how these states may interrelate in the geometry/topology of the state space. In contrast, we’ve constrained the magnitude of corruption for states within local regions of spaces homeomorphic to the state space, but not limiting the amount of corrupt states or forcing to be able to get to every other state in finite time. I’m calling this approach implicitly decoupled RL, in the hope that these results will still apply in the general CRMDP as well. I’ll leave this for a follow up work?
  3. The kernel in 2 can be defined by the fairness metrics from “Fairness through Awareness”.
  4. To make the comparison to safety explicit, in language of “Fairness through Awareness” – if similar inputs don’t map to similar distributions, this counts as locally concentrated noise. So an unfair classifier is a classifier trained on a dataset with locally concentrated noise, which in turn is an agent trained naïvely in a CRMDP.

What does this mean for CRMDPs?

What does this mean for supervised learning?

Prioritizing fairness, short-term safety, and long-term safety

The question of the prioritization of research into short-term and long-term problems of AI safety has been a controversial one. Long-term safety researchers argue that we should prioritize safe AGI as it has a clear path to a significant existential risk for all of humanity, whereas short-term safety issues do not seem to have such sweeping existential risks. Conversely, AGI may not be possible, or may be so far off that we’re wasting our time trying to figure it out now, especially when there are people living now whose lives are being adversely affected by biased machine learning algorithms in production. Victoria Krakovna has argued that the two fields do not need to be mutually prioritized, since their research budgets are largely drawn from mutually exclusive sources. In some sense, arguments for prioritization rely on a false unification of budgets. This work suggests that, in this case, the dichotomy between safe RL and safe SL is a false one – learning in the face of reward corruption or biased datasets are the same problem. Methods for dealing with reward corruption can be recast as possible solutions to learning from biased datasets, and research into fair supervised learning can suggest new results for CRMDPs.

  1. Formally, the solution set is a relation between the feature space XX and the target space YY that qualifies as a well-defined function. In cases where the relation is not a well-defined function, the task can often be extended or redefined so that this is the case. 

  2. Note that (2), other than constraining the amount of signal we have for pattern recognition overall, can cause problems for (1) as well. For example, perhaps there are no regulatory guidelines about photographing endangered species, but the sheer rarity of these species keeps them out of the sample with realistically sized nn

  3. The case for regression is similar; in that case, we want to maximize the negative of the mean error.