Reinforcement learning, like many topics whose names end with “ing” such as machine learning and mountaineering, is simultaneously a problem, a class of solution methods that work well on the problem, and the field that studies this problem and its solution methods. It is convenient to use a single name for all three things, but at the same time essential to keep the three conceptually separate. In particular, the distinction between problems and solution methods is very important in reinforcement learning; failing to make this distinction is the source of many confusions.
带有“ing”后缀的词都包含以下三个部分
一个问题一类解决问题的方法研究问题和方法的领域 ?把他们区分开是非常有必要的,在之后的理解过程中,很多迷惑都是因为没有把这些区分清楚。
In interactive problems it is often impractical to obtain examples of desired behavior that are both correct and representative of all the situations in which the agent has to act.
在互动性问题中,找到有以下两个特点的例子通常是不实际的:
正确的可以反映所有agent必须采取行动的情况Reinforcement learning is also different from what machine learning researchers call unsupervised learning, which is typically about finding structure hidden in collections of unlabeled data. The terms supervised learning and unsupervised learning would seem to exhaustively classify machine learning paradigms, but they do not. Although one might be tempted to think of reinforcement learning as a kind of unsupervised learning because it does not rely on examples of correct behavior, reinforcement learning is trying to maximize a reward signal instead of trying to find hidden structure.
RL的目标是最大化reward函数而不是寻找隐藏的结构关系,是除了监督学习,非监督学习之外另外的一种机器学习。
Exploitation: the action of making use of and benefiting from resources.
To obtain a lot of reward, a reinforcement learning agent must prefer actions that it has tried in the past and found to be effective in producing reward.
agent更加喜欢之前发现过的而且被发现有用于产生reward的action。但是为了发现他们,需要去探索之前没有发现过的action,就造成了矛盾。
The agent has to exploit what it has already experienced in order to obtain reward, but it also has to explore in order to make better action selections in the future.
exploit的目的:获得rewardexplore的目的:获得更好的action selection单独追求任何一个方面都可能导致任务失败。
For now, we simply note that the entire issue of balancing exploration and exploitation does not even arise in supervised and unsupervised learning, at least in the purest forms of these paradigms.
平衡exploration和exploition的问题仍然没有被解决,暂时看作在监督学习和非监督学习中还不会出现。
This is in contrast to many approaches that consider subproblems without addressing how they might fit into a larger picture. … Although these approaches (some tricks in machine learning) have yielded many useful results, their focus on isolated subproblems is a significant limitation.
直接表明了问题的目标而不是通过定义一些下一级别目标转换。
Methods for solving reinforcement learning problems that use models and planning are called model-based methods, as opposed to simpler model-free methods that are explicitly trial-and-error learners—viewed as almost the opposite of planning.
The formal definition of state as we use it here is given by the framework of Markov decision processes presented in Chapter 3. More generally, however, we encourage the reader to follow the informal meaning and think of the state as whatever information is available to the agent about its environment.
不同于传统MDP定义中的state,本文作者给出了对于state的新的理解:agent可以读取环境相关信息的任何状态。