Reinforcement Learning

Book description

Reinforcement learning (RL) will deliver one of the biggest breakthroughs in AI over the next decade, enabling algorithms to learn from their environment to achieve arbitrary goals. This exciting development avoids constraints found in traditional machine learning (ML) algorithms. This practical book shows data science and AI professionals how to learn by reinforcement and enable a machine to learn by itself.

Author Phil Winder of Winder Research covers everything from basic building blocks to state-of-the-art practices. You'll explore the current state of RL, focus on industrial applications, learn numerous algorithms, and benefit from dedicated chapters on deploying RL solutions to production. This is no cookbook; doesn't shy away from math and expects familiarity with ML.

  • Learn what RL is and how the algorithms help solve problems
  • Become grounded in RL fundamentals including Markov decision processes, dynamic programming, and temporal difference learning
  • Dive deep into a range of value and policy gradient methods
  • Apply advanced RL solutions such as meta learning, hierarchical learning, multi-agent, and imitation learning
  • Understand cutting-edge deep RL algorithms including Rainbow, PPO, TD3, SAC, and more
  • Get practical examples through the accompanying website

Publisher resources

View/Submit Errata

Table of contents

  1. Preface
    1. Objective
    2. Who Should Read This Book?
    3. Guiding Principles and Style
    4. Prerequisites
    5. Scope and Outline
    6. Supplementary Materials
    7. Conventions Used in This Book
      1. Acronyms
      2. Mathematical Notation
      3. Fair Use Policy
    8. O’Reilly Online Learning
    9. How to Contact Us
    10. Acknowledgments
  2. 1. Why Reinforcement Learning?
    1. Why Now?
    2. Machine Learning
    3. Reinforcement Learning
      1. When Should You Use RL?
      2. RL Applications
    4. Taxonomy of RL Approaches
      1. Model-Free or Model-Based
      2. How Agents Use and Update Their Strategy
      3. Discrete or Continuous Actions
      4. Optimization Methods
      5. Policy Evaluation and Improvement
    5. Fundamental Concepts in Reinforcement Learning
      1. The First RL Algorithm
      2. Is RL the Same as ML?
      3. Reward and Feedback
    6. Reinforcement Learning as a Discipline
    7. Summary
    8. Further Reading
  3. 2. Markov Decision Processes, Dynamic Programming, and Monte Carlo Methods
    1. Multi-Arm Bandit Testing
      1. Reward Engineering
      2. Policy Evaluation: The Value Function
      3. Policy Improvement: Choosing the Best Action
      4. Simulating the Environment
      5. Running the Experiment
      6. Improving the ϵ -greedy Algorithm
    2. Markov Decision Processes
      1. Inventory Control
      2. Inventory Control Simulation
    3. Policies and Value Functions
      1. Discounted Rewards
      2. Predicting Rewards with the State-Value Function
      3. Predicting Rewards with the Action-Value Function
      4. Optimal Policies
    4. Monte Carlo Policy Generation
    5. Value Iteration with Dynamic Programming
      1. Implementing Value Iteration
      2. Results of Value Iteration
    6. Summary
    7. Further Reading
  4. 3. Temporal-Difference Learning, Q-Learning, and n-Step Algorithms
    1. Formulation of Temporal-Difference Learning
      1. Q-Learning
      2. SARSA
      3. Q-Learning Versus SARSA
      4. Case Study: Automatically Scaling Application Containers to Reduce Cost
    2. Industrial Example: Real-Time Bidding in Advertising
      1. Defining the MDP
      2. Results of the Real-Time Bidding Environments
      3. Further Improvements
    3. Extensions to Q-Learning
      1. Double Q-Learning
      2. Delayed Q-Learning
      3. Comparing Standard, Double, and Delayed Q-learning
      4. Opposition Learning
    4. n-Step Algorithms
      1. n-Step Algorithms on Grid Environments
    5. Eligibility Traces
    6. Extensions to Eligibility Traces
      1. Watkins’s Q( λ )
      2. Fuzzy Wipes in Watkins’s Q( λ )
      3. Speedy Q-Learning
      4. Accumulating Versus Replacing Eligibility Traces
    7. Summary
    8. Further Reading
  5. 4. Deep Q-Networks
    1. Deep Learning Architectures
      1. Fundamentals
      2. Common Neural Network Architectures
      3. Deep Learning Frameworks
      4. Deep Reinforcement Learning
    2. Deep Q-Learning
      1. Experience Replay
      2. Q-Network Clones
      3. Neural Network Architecture
      4. Implementing DQN
      5. Example: DQN on the CartPole Environment
      6. Case Study: Reducing Energy Usage in Buildings
    3. Rainbow DQN
      1. Distributional RL
      2. Prioritized Experience Replay
      3. Noisy Nets
      4. Dueling Networks
    4. Example: Rainbow DQN on Atari Games
      1. Results
      2. Discussion
    5. Other DQN Improvements
      1. Improving Exploration
      2. Improving Rewards
      3. Learning from Offline Data
    6. Summary
    7. Further Reading
  6. 5. Policy Gradient Methods
    1. Benefits of Learning a Policy Directly
    2. How to Calculate the Gradient of a Policy
    3. Policy Gradient Theorem
    4. Policy Functions
      1. Linear Policies
      2. Arbitrary Policies
    5. Basic Implementations
      1. Monte Carlo (REINFORCE)
      2. REINFORCE with Baseline
      3. Gradient Variance Reduction
      4. n-Step Actor-Critic and Advantage Actor-Critic (A2C)
      5. Eligibility Traces Actor-Critic
      6. A Comparison of Basic Policy Gradient Algorithms
    6. Industrial Example: Automatically Purchasing Products for Customers
      1. The Environment: Gym-Shopping-Cart
      2. Expectations
      3. Results from the Shopping Cart Environment
    7. Summary
    8. Further Reading
  7. 6. Beyond Policy Gradients
    1. Off-Policy Algorithms
      1. Importance Sampling
      2. Behavior and Target Policies
      3. Off-Policy Q-Learning
      4. Gradient Temporal-Difference Learning
      5. Greedy-GQ
      6. Off-Policy Actor-Critics
    2. Deterministic Policy Gradients
      1. Deterministic Policy Gradients
      2. Deep Deterministic Policy Gradients
      3. Twin Delayed DDPG
      4. Case Study: Recommendations Using Reviews
      5. Improvements to DPG
    3. Trust Region Methods
      1. Kullback–Leibler Divergence
      2. Natural Policy Gradients and Trust Region Policy Optimization
      3. Proximal Policy Optimization
    4. Example: Using Servos for a Real-Life Reacher
      1. Experiment Setup
      2. RL Algorithm Implementation
      3. Increasing the Complexity of the Algorithm
      4. Hyperparameter Tuning in a Simulation
      5. Resulting Policies
    5. Other Policy Gradient Algorithms
      1. Retrace( λ )
      2. Actor-Critic with Experience Replay (ACER)
      3. Actor-Critic Using Kronecker-Factored Trust Regions (ACKTR)
      4. Emphatic Methods
    6. Extensions to Policy Gradient Algorithms
      1. Quantile Regression in Policy Gradient Algorithms
    7. Summary
      1. Which Algorithm Should I Use?
      2. A Note on Asynchronous Methods
    8. Further Reading
  8. 7. Learning All Possible Policies with Entropy Methods
    1. What Is Entropy?
    2. Maximum Entropy Reinforcement Learning
    3. Soft Actor-Critic
      1. SAC Implementation Details and Discrete Action Spaces
      2. Automatically Adjusting Temperature
      3. Case Study: Automated Traffic Management to Reduce Queuing
    4. Extensions to Maximum Entropy Methods
      1. Other Measures of Entropy (and Ensembles)
      2. Optimistic Exploration Using the Upper Bound of Double Q-Learning
      3. Tinkering with Experience Replay
      4. Soft Policy Gradient
      5. Soft Q-Learning (and Derivatives)
      6. Path Consistency Learning
    5. Performance Comparison: SAC Versus PPO
    6. How Does Entropy Encourage Exploration?
      1. How Does the Temperature Parameter Alter Exploration?
    7. Industrial Example: Learning to Drive with a Remote Control Car
      1. Description of the Problem
      2. Minimizing Training Time
      3. Dramatic Actions
      4. Hyperparameter Search
      5. Final Policy
      6. Further Improvements
    8. Summary
      1. Equivalence Between Policy Gradients and Soft Q-Learning
      2. What Does This Mean For the Future?
      3. What Does This Mean Now?
  9. 8. Improving How an Agent Learns
    1. Rethinking the MDP
      1. Partially Observable Markov Decision Process
      2. Case Study: Using POMDPs in Autonomous Vehicles
      3. Contextual Markov Decision Processes
      4. MDPs with Changing Actions
      5. Regularized MDPs
    2. Hierarchical Reinforcement Learning
      1. Naive HRL
      2. High-Low Hierarchies with Intrinsic Rewards (HIRO)
      3. Learning Skills and Unsupervised RL
      4. Using Skills in HRL
      5. HRL Conclusions
    3. Multi-Agent Reinforcement Learning
      1. MARL Frameworks
      2. Centralized or Decentralized
      3. Single-Agent Algorithms
      4. Case Study: Using Single-Agent Decentralized Learning in UAVs
      5. Centralized Learning, Decentralized Execution
      6. Decentralized Learning
      7. Other Combinations
      8. Challenges of MARL
      9. MARL Conclusions
    4. Expert Guidance
      1. Behavior Cloning
      2. Imitation RL
      3. Inverse RL
      4. Curriculum Learning
    5. Other Paradigms
      1. Meta-Learning
      2. Transfer Learning
    6. Summary
    7. Further Reading
  10. 9. Practical Reinforcement Learning
    1. The RL Project Life Cycle
      1. Life Cycle Definition
    2. Problem Definition: What Is an RL Project?
      1. RL Problems Are Sequential
      2. RL Problems Are Strategic
      3. Low-Level RL Indicators
      4. Types of Learning
    3. RL Engineering and Refinement
      1. Process
      2. Environment Engineering
      3. State Engineering or State Representation Learning
      4. Policy Engineering
      5. Mapping Policies to Action Spaces
      6. Exploration
      7. Reward Engineering
    4. Summary
    5. Further Reading
  11. 10. Operational Reinforcement Learning
    1. Implementation
      1. Frameworks
      2. Scaling RL
      3. Evaluation
    2. Deployment
      1. Goals
      2. Architecture
      3. Ancillary Tooling
      4. Safety, Security, and Ethics
    3. Summary
    4. Further Reading
  12. 11. Conclusions and the Future
    1. Tips and Tricks
      1. Framing the Problem
      2. Your Data
      3. Training
      4. Evaluation
      5. Deployment
    2. Debugging
      1. ${ALGORITHM_NAME} Can’t Solve ${ENVIRONMENT}!
      2. Monitoring for Debugging
    3. The Future of Reinforcement Learning
      1. RL Market Opportunities
      2. Future RL and Research Directions
    4. Concluding Remarks
      1. Next Steps
      2. Now It’s Your Turn
    5. Further Reading
  13. A. The Gradient of a Logistic Policy for Two Actions
  14. B. The Gradient of a Softmax Policy
  15. Glossary
    1. Acronyms and Common Terms
    2. Symbols and Notation
  16. Index

Product information

  • Title: Reinforcement Learning
  • Author(s): Phil Winder
  • Release date: November 2020
  • Publisher(s): O'Reilly Media, Inc.
  • ISBN: 9781098114831