when the map is not the territory
Rethinking fair supervised learning by treating it as reinforcement learning with a corrupted reward channel, and introducing a new criterion for corrupt reward reinforcement learning as a result.
tl;dr
 We can recast supervised learning tasks as reinforcement learning tasks with a corrupted reward signal
 Doing so allows us to learn about each from the techniques and results of the other
 In particular, we can reexamine the theory of corrupt rewards in the context of results on generalization and fairness in supervised learning
 Novel sufficient conditions for learning with corrupt rewards are suggested that would resolve open problems in the literature
Contents
 Brief notice
 Motivation
 Summary
 Supervised learning
 Biased supervised learners are naïve agents in CRMDPs
 Implications and conclusions
Brief notice
This is a very rough WIP, and the details are likely illegible to those who aren’t familiar with formalizations of reinforcement learning and supervised learning. This includes everything after the “Summary” section.
Helpful prereading:
 Reinforcement Learning with a Corrupted Reward Channel (required: sections 14 and 7)
 Fairness through Awareness (required: Lipschitz definition and using it to achieve fairness, mainly section 2 and optionally some of section 3)
 On the Resistance of Neural Nets to Label Noise (mainly the definition of locally concentrated noise and experimental results comparing susceptibility to noise)
Motivation
 Safe supervised learning has always been a matter of generalizing properly given realworld dataset constraints.
 The most fundamental safety constraint in supervised learning is the bias/variance tradeoff, and how to handle it with various models/techniques. It is a necessary, but far from sufficient, ingredient for safe supervised learning.
 Meanwhile, neural networks have been shown to be able to both memorize completely random noise and learn tasks robustly in the presence of noise.
 Despite this observed robustness, neural networks have been shown to be extremely sensitive to biases in training data. Google Photos incident, Tay, plenty more examples to choose from. Similarly, neural network policies in RL have been shown to be susceptible to wireheading.
 Finally, safe RL and fair SL have always been thought of as separate problems, but we show that they’re formally related.
Summary
This work shows that an agent engaging in unfair supervised learning, weak generalization in the face of noise, or learning under a corrupt reward channel are all dealing with the same underlying phenomenon. In particular,
 Supervised learning is a special case of reinforcement learning.
 Assuming a fallible oracle in the supervised learning formalism allows noise to creep into the learning task.
 The role of the oracle in supervised learning is identical to the role of the corruption map in a corrupt reward Markov decision process.
 When the oracle introduces locally concentrated noise, the classifier is particularly susceptible to the oracle’s bias.
 A classifier learned from such an oracle fails a specific Lipschitz condition that leads to worse classification performance. This failure is induced by the oracle.
 The corruption map in a CRMDP induces the same failure for naïve RL agents.
 Since failing the Lipschitz condition is induced by the oracle/corruption map, it can provide a novel condition for learnability in CRMDPs.
 Using the Lipschitz condition as a learnability condition for CRMDPs would resolve open questions described by the original authors. In particular, it would show agents can learn in a larger class of finitestate CRMDPs than had been originally proven. These results could also extend to infinite state space CRMDPs, which is an open problem. Further implications, extensions, and speculations are considered near the end of this post.
Supervised learning
As setup, we formalize supervised learning. If you’re already familiar with supervised learning formalities, you can skip to the next section after noting the notation we use.
First, there is a task $\mathcal{T}$ to be learned. This could be a classification task, e.g. classifying defaults on loans, classifying likelihood of recidivism, recognizing objects in an image, or it could be a regression task, e.g. predicting amount of interest gained on a private student loan, pricing a salary package for a new employee based on past employee profiles. We assume that our task is welldefined and solvable, i.e. there exists some solution set $T = X_\mathcal{T} \times Y_\mathcal{T}$.^{1} We also assume that $T$ is not able to be determined by brute force.
In order to solve the task under this constraint, we want to determine a computable function $f^*:X \mapsto Y$ such that $y=f^*(x)$ for every $(x,y) \in T$. In practice, we pursue an approximation $f:X \mapsto Y$ such that $f \approx f^*$.
We require access to a dataset $D = X_T \times Y_T = \{(x_i, y_i) \stackrel{\scriptscriptstyle n}{\sim} \mathcal{P}(T)\}$ where $\mathcal{P}(T)$ is a distribution over the solution set $T$. For simplicity, we will often refer to $X_T$ and $X_\mathcal{T}$ as $D$ and $T$ in a slight abuse of notation, although the difference should be clear from the context.
Since we don’t have access to $f^*$, we assume the existence of some oracle $\Omega$ that’s able to approximate it. The dataset $D$ we use is actually $D = \{(x_i, \Omega(x_i)) \stackrel{\scriptscriptstyle n}{\sim} \mathcal{P}(T)\}$.
Practical problems in supervised learning
There are several common practical constraints here. Notably:
 We are often stuck with how these samples are gathered from $X_\mathcal{T}$. For example, we may wish to gather a representative sample of animal species for image recognition, but we may have regulatory guidelines on which animals can be photographed based on their scarcity.
 $D$ is small, i.e. the $n$ above is small relative to the cardinality of $X_\mathcal{T}$. Put another way, we have a finite capacity for queries of $\Omega$, whether due to difficulty of querying $\Omega$ or difficulty of sampling from $\mathcal{P}(T)$. ^{2}
 We create $D$ through $\Omega$, which is itself generally fallible with respect to $f^*$. More on this later.
These constraints can lead to dangerous or biased behavior. Sometimes they’re so strong and invasive that they cause a nonstarter for the learning task, possibly forcing us to find a better oracle, redefine the task, or to abandon it altogether. Dangerous supervised learning occurs when we, the operators, fail to recognize these constraints as being as serious as they are. Failure of recognition can occur throughout the entire machine learning development/deployment process.
Generalization is everything
People often confuse the training objective with the task objective. In truth, we optimize over the training objective (i.e. loss function) only so that we can minimize outofsample error. If such error were computable, we’d prefer to optimize that directly. This means that the training objective is a surrogate for our actual optimization objective. We can only ever approximate our true performance on outofsample error in expectation, using methods like crossvalidation.
But even then, any observed outofsample error is not the true, idealized objective of the task, as the true outofsample error can only ever be approximated through the use of the oracle $\Omega$. Our hope in SL is therefore to use $D$ to match the oracle’s performance on data in $T$, when it should really be to achieve perfect performance on $T$. That is, we can only in good faith attempt to learn $\Omega (T)$ but not $f^{*}(T)$. Given a perfect oracle, these are identical optimization problems, but given a fallible oracle, they are distinct. Fallible oracles introduce noise.
Not all noise is equal
 “On the Resistance of Neural Nets to Label Noise”
 random noise: for random $x_i, x_j \in D$ such that $f^*(x_i) = f^*(x_j)$ and $\Omega(x_i) \neq f^*(x_i)$ or $\Omega(x_j) \neq f^*(x_j)$.
 the authors also define fliplabel and confusionmatrix noise, which are straightforward generalizations of random noise that condition the noise on the ground truth label.
 locally concentrated noise: when $x_i, x_j$ are similar, i.e. exist in the same local region of feature space, but are noisily labelled, i.e. are consistently misclassified by the oracle. This is quite different from the other kinds of noise, and it’s the one that’s most important to us.
 The authors give theoretical and empirical evidence that neural networks are robust to random, fliplabel, and confusionmatrix noise, but fail on locally concentrated noise.
 How then do we end up with biased neural networks?
The notion of bias in “Fairness through Awareness”
 Fairness problems arise in the space between $f^{*}$ and $\Omega$.
 Biased data can be considered noise, in the sense that a biased data point is an “incorrect” representation of the idealized task but a “correct” representation of the task as determined by the oracle. (This should be considered an oversimplification for anaytic purposes. The map is not the territory. See Kate Crawford’s “The Trouble with Bias”, and also decades of work on relativism in philosophy/social sciences).
 Assuming perfect learning, the prejudice of our classifier w.r.t. the idealized task can be attributed to noise/mistakes in the oracle’s estimation of $f^*$.
 Naïvely, we’d suspect that bias takes the form of random, fliplabel, or confusionmatrix noise – but then neural networks would be robust, and they’re often not!
 The problem is that the noise is locally concentrated. Recall: locally concentrated noise occurs when $x_i, x_j$ are similar, i.e. exist in a neighborhood of some metric space, but are noisily labelled, i.e. one of them is misclassified by the oracle.
 Compare this to the Lipschitz condition in “Fairness through Awareness”. A classifier is said to be fair if for similar $x_i, x_j$, $p(\hat{y}_i = f(x_i))$ and $p(\hat{y}_j = f(x_j))$ are similar (replacing the $p$ with $\log p$ can be cast equivalently by changing the similarity metric). Similarity can be defined by various kernels over distributions, and the authors explore metrics like total variation, the relative Linfinity measure, and the Earthmover distance.
 To make this explicit, say we have two similar points $x, x_b$ that are mislabelled by the oracle according to some hidden feature that the oracle is implicitly aware of (e.g. race, gender). This means that $k(x, x_b) < \epsilon$ for some small epsilon and $f^{*}(x) = f^{*}(x_b)$, but $\Omega(x) \neq \Omega(x_b)$. This constitutes locallyconcentrated noise, and also implies that the probabilities in the Lipschitz condition above will be significantly different.
 Locally concentrated noise was previously shown to be intractable for neural network classifiers; similarly, fairness is a general, unsolved problem for standard neural networks.
Biased supervised learners are naïve agents in CRMDPs
Short intro to this section, explaining the overall assertion. Summarize related work (basically just say Paul Christiano’s theories are general enough to phrase these problems the same way, but their similarity has not been investigated explicilty or in detail).
Intro to CRMDPs
Boating with a corrupt reward, from OpenAI
The main difference between an MDP and a CRMDP is the inclusion of a corruption function $C:\dot{R} \mapsto \hat{R}$ mapping true rewards to observed (potentially corrupt) rewards. The agent is only able to observe the environment’s true reward after it’s passed through the corruption map, so the corruption map can introduce unwanted behaviors, as seen above in the boat race example. In the boat race example, the corruption map is particularly extreme, although unwanted behaviors can be more subtle depending on the map’s character. This corruption map plays an important role in supervised learning, too.
TODO No Free Lunch Theorem for CRMDPs, decoupled RL, limitations of quantilising agent
Supervised learning as reinforcement learning in CRMDPs
Intuitively, supervised learning can be considered a special case of an MDP. For example, in classification, we often want to maximize the probability of the classifier yielding the correct class. In practice, we equivalently maximize the logprobability for numerical stability reasons.^{3}
This (log)probability can be considered the true reward of an MDP, where the state space contains all past, present, and future instances for the learning task. In particular, the training instances form a partition of the state space that is noncommunicating with the rest of the state space, i.e. nontraining instance states are impossible for the agent to reach in finite time on its own (see CRMDP paper for a precise definition). The actions are the labels that the policy is trained to assign; classification has a discrete action space, while regression has a continuous one.
When we throw the oracle into this setting, things get a bit more complicated, but there’s a simple explanation for how the MDP formalism is affected. The oracle in a supervised learning task is equivalent to the corruption map in its corresponding CRMDP. The extent of an oracle’s bias dictates just how extreme the corruption map affects the true reward of the underlying MDP. An idealised, unbiased oracle will lead to a pure MDP; a strongly biased oracle will lead to an intractable CRMDP.
 TODO Using the Lipschitz condition from “Fairness through Awareness,” formalize a novel condition for learnability in CRMDPs
Implications and conclusions
Connections
TODO actually porivde support for these conclusions, or some modified variation
 In special case of a deep policy, naïve agents in special noncommunicating MDPs can apparently still defeat reward corruption (i.e. label noise) in special cases, but only if label noise is not locally concentrated in the feature space (see Drory et al. 2018). If locally concentrated, the classifier’s error drops perfectly linearly with amount of noise. Classification error in the CRMDP case is just regret, so in this case the agent experiences linear regret. But if the noise is random or “flipped”, regret is sublinear! The intractability of locally conentrated noise by standard methods mirrors the no free lunch theorem for CRMDPs.
 To understand this, we switched the requirement for a communicating CRMDP with a condition on the kernel between the action distribution produced by our agent for different states. “The No Free Lunch theorem suggests that without some assumption of what the reward corruption looks like, any RL agent is essentially lost.” The authors of Everett et al. assume the reward corruption is limited to a subset of states in a finitestate MDP, but assume nothing about the magnitude of corruption, or how these states may interrelate in the geometry/topology of the state space. In contrast, we’ve constrained the magnitude of corruption for states within local regions of spaces homeomorphic to the state space, but not limiting the amount of corrupt states or forcing to be able to get to every other state in finite time. I’m calling this approach implicitly decoupled RL, in the hope that these results will still apply in the general CRMDP as well. I’ll leave this for a follow up work?
 The kernel in 2 can be defined by the fairness metrics from “Fairness through Awareness”.
 To make the comparison to safety explicit, in language of “Fairness through Awareness” – if similar inputs don’t map to similar distributions, this counts as locally concentrated noise. So an unfair classifier is a classifier trained on a dataset with locally concentrated noise, which in turn is an agent trained naïvely in a CRMDP.
What does this mean for CRMDPs?
 We can ostensibly get around reward corruption by optimizing the policy subject to the Lipschitz constraint as suggested in “Fairness through Awareness”, either through regularization or through something like iterated PathCG optimization.
 This requires a taskspecific distance metric for the MDP. For example, in the TomatoWateringEnv from AI safety gridworlds, an appropriate distance metric would be to swap each tomato’s watered status in $x$ to get $x_s$, then measure the L1 or L2 distance between $r(x_s)$ and $r(x)$.
 Here, I’m using knowledge of the MDP as well as knowledge of the corruption to determine a good metric for the task. This is a reasonable prerequisite for many classification tasks, simple MDPs, and intuitively complex MDPs (e.g. when individuals/groups of individuals are the states), but not so reasonable for unintuitively complex MDPs.
 Authors from “Fairness through Awareness” suggest their method is agnostic to the correctness of this metric. Is this really the case? Can we use other techniques suggested in the paper to determine the right metric for our task? Are they still feasible for MDPs?
 Can we learn a metric that makes the Lipschitz condition useful as a regularizer/hard constraint when optimizing a policy with the observed reward of a CRMDP? Could learning an embedding allow us to be fast and loose with the actualy metric we use? (cc: Deep RL from Human Preferences) If so, we could scale this approach with little to no human feedback.
 Claim: The No Free Lunch theorem does not apply to CRMDPs with some kinds of noisy reward corruption. In particular, confusionmatrix noise is surrmountable.
 Claim: CIRL is able to avoid reward corruption under these conditions. (Decoupled RL provides sufficient, but not necessary conditions, so failing to comply with decoupled RL does not strictly imply failure to learn in general CRMDPs).
 Justification: the supervisor’s action can be used as feedback for metric labelling. We can use this to learn a kernel over the feature space to define the Lipschitz condition, then optimize subject to this Lipschitz condition.
 Claim: This formulation works for the infinite state space case.
 Justification: There’s no need for a finite state space, so long as the feature space kernel is general enough to account for all states in the CRMDP.
 Hypothesis: This could also work for nonstationary corruption maps, in cases where the nonstationarity is slow enough and we can adapt techniques from online learning to fit CRMDPs.
What does this mean for supervised learning?
 Can we reinterpret methods for solving CRMDPs (SSRL, LVFS, LfHP) as methods for learning classifiers and regressors on biased datasets? Clearly, decoupled RL is infeasible here, but methods for solving CRMDPs needn’t be constrained by the decoupled RL formalism.
 Gives a new context for the original No Free Lunch theorem.
 Phrasing bias as a corruption between an optimal policy and the oracle highlights Kate Crawford’s point in “The Trouble with Bias”. Our oracles are social structures and processes, and these oracles are often corrupt. Social bias is first and foremost a social problem.
Prioritizing fairness, shortterm safety, and longterm safety
The question of the prioritization of research into shortterm and longterm problems of AI safety has been a controversial one. Longterm safety researchers argue that we should prioritize safe AGI as it has a clear path to a significant existential risk for all of humanity, whereas shortterm safety issues do not seem to have such sweeping existential risks. Conversely, AGI may not be possible, or may be so far off that we’re wasting our time trying to figure it out now, especially when there are people living now whose lives are being adversely affected by biased machine learning algorithms in production. Victoria Krakovna has argued that the two fields do not need to be mutually prioritized, since their research budgets are largely drawn from mutually exclusive sources. In some sense, arguments for prioritization rely on a false unification of budgets. This work suggests that, in this case, the dichotomy between safe RL and safe SL is a false one – learning in the face of reward corruption or biased datasets are the same problem. Methods for dealing with reward corruption can be recast as possible solutions to learning from biased datasets, and research into fair supervised learning can suggest new results for CRMDPs.

Formally, the solution set is a relation between the feature space $X$ and the target space $Y$ that qualifies as a welldefined function. In cases where the relation is not a welldefined function, the task can often be extended or redefined so that this is the case. ↩

Note that (2), other than constraining the amount of signal we have for pattern recognition overall, can cause problems for (1) as well. For example, perhaps there are no regulatory guidelines about photographing endangered species, but the sheer rarity of these species keeps them out of the sample with realistically sized $n$. ↩

The case for regression is similar; in that case, we want to maximize the negative of the mean error. ↩