The Landscape of Deep Reinforcement Learning

Deep reinforcement learning (Deep RL) can be said to be one of the hottest topics in artificial intelligence (AI), attracting many outstanding scientists in this field to explore its ability to solve tough real-world problems. Deep RL itself is highly respected by various application fields because of its versatility, from end to end game control, robotic arm control, recommendation systems, and even natural language dialogue systems. However, it is very hard for a machine learning engineer or researcher to keep a good rhythm aligned with the fast and iterative development of deep reinforcement learning. We hope this book can be a guide to help our readers who not only want to know the connections between various deep reinforcement learning algorithms but to get familiar with the practice side of these algorithms.

We will first briefly introduce deep learning techniques and applications. Then we review some core concepts of reinforcement learning, explore the combination of deep learning and reinforcement learning, introduce several paradigms of deep reinforcement learning and give some interesting work and applications in the near future. Finally, we outline a quick learning guide of this project-based book for readers.

We'll cover the following main topics:

  • Deep Learning

  • Deep Reinforcement Learning

  • The general guide for deep RL projects

Deep learning

Deep learning played a key role in building powerful modern AI systems for many applications and conducting research in various scientific areas. Here we quickly explain basic knowledge about deep learning and some typical applications.

Elements of deep learning

In March 2019, three influential AI scientists: Geoffrey E. Hinton, Yoshua Bengio, and Yann LeCun received the 2018 ACM Turing Award, which is often called The Nobel Prize for Computer Science, due to their contribution to deep learning. Deep learning is a modern version of artificial neural networks, which is a framework for building an intelligent system with the inspirations from psychology and brain science invented since the early age of artificial intelligence. The basic way it works is to use the layer-wised computational models to learn representations of data with multiple levels of abstraction.

Artificial neural networks have been able to achieve approximations of arbitrarily complex continuous functions. This can be seen in Michael Nielsen's book Neural Networks and Deep Learning, chapter 4. Deep learning can take advantage of more hidden layers to enhance the ability to represent the data. From a mathematical view, deep learning is actually a combination of a large number of functions and can be trained by back-propagation algorithm.

It has become popular around the world with its transcendental effects in practical applications. The computing device GPUs, produced by Nvidia, which is very crucial computing that made deep learning an efficient way to deal with ImageNet Competition in 2012, now is dominating the training market and becoming a must for deep learning. This is also one of the most important driving forces for building capable reinforcement learning agents.

Nowadays, deep learning has swept the fields of speech recognition, image recognition, computer vision, natural language processing, and even video prediction. There are two main network architectures- convolutional neural networks and recurrent neural networks - have completed the space and time perspectives to model problems.

Although deep learning may still have some disadvantages for the interpretability of the models, this technology has already been the default choice for many areas today, like computer vision, natural language processing, social network analysis, biology, quantum mechanics and astronomy, etc.

Here is a list of building blocks for constructing various recent deep learning programs:

  1. Transformers

  2. Residual connections

  3. Attention mechanisms

  4. Generative adversarial networks, GANs for short

  5. Variational auto-encoders

  6. Graph convolutional networks

Applications of Deep Learning


Since the very beginning of deep learning, researchers had already designed several interesting algorithms for generating arts like paintings with styles from some famous painters like Claude Monet's impressionism, Van Gogh's post-impressionism, and other styles. This kind of work has been done by a deep learning algorithm named Neural Style Transfer. There is a famous mobile application named Prisma dedicated to producing high-resolution images of different styles for customers.

In October 2018, 'Edmond de Belamy, from La Famille de Belamy', a painting generated by a deep learning algorithm, sold for $432,500 at auction.

Nvidia researchers recently developed a GAN named GauGAN, which is capable of turning doodles into photorealistic Landscapes. You can draw doodles first, and get the landscape drawings after the processing of the GAN. The example below shows an extraordinary waterfall generated by the left side drawing.


DeepMind published their work on predicting the 3D structure of a protein based solely on its genetic sequence by cooperating with experts from various fields, including structural biology and physics. Their cutting-edge system is named AlphaFold, and is making significant progress on one of the core challenges in biology. The left figure below shows the dynamics of the transforming process of a structure, just using the gradient descent method in a normal deep learning training.

The basic procedure is described in the right figure, we input the protein sequence into a neural network, using distance prediction and angle predictions to provide enough information to construct a score of measurement so that we can optimize to get an optimal structure of the protein. We could consider this as a normal deep learning application for vision.

Natural language processing

OpenAI released their NLP model in March 2019. The model named GPT-2 is trained with an unsupervised style. The following example shows the text generated by GPT-2 based on the human written text. Given the beginning of a sentence or two, this model can generate a variety of different styles of news, novels, and other similar styles, the content looks very realistic. You can see one example in the table as follows.


In a shocking finding, scientist discovered a herd of unicorns living in a remote, previously unexplored valley, in the Andes Mountains. Even more surprising to the researchers was the fact that the unicorns spoke perfect English.


The scientist named the population, after their distinctive horn, Ovid’s Unicorn. These four-horned, silver-white unicorns were previously unknown to science.

Now, after almost two centuries, the mystery of what sparked this odd phenomenon is finally solved.

Dr. Jorge Pérez, an evolutionary biologist from the University of La Paz, and several companions, were exploring the Andes Mountains when they found a small valley, with no other animals or humans. Pérez noticed that the valley had what appeared to be a natural fountain, surrounded by two peaks of rock and silver snow.


However, Pérez also pointed out that it is likely that the only way of knowing for sure if unicorns are indeed the descendants of a lost alien race is through DNA. “But they seem to be able to communicate in English quite well, which I believe is a sign of evolution, or at least a change in social organization,” said the scientist.

However, unusually, OpenAI researchers decided not to release the data for the training model, nor for the pre-trained parameters of the largest model, because they believe that such a powerful model is at risk of malicious abuse. Their arguments that there may be risks and the model is better not to be released have caused a big wave of rendering, and the researchers in the machine learning and natural language processing community have had intense discussions. To our understanding, there are many aspects needing to be enhanced or reconstructed so that the text generated could be more consistent according to some core meaning as a real writer writes.


Google Brain made some progress in medical diagnosis. The machine learning system used for image diagnosis of diabetic retinopathy can be equivalent to a professionally certified ophthalmologist. If early diabetic retinopathy is not detected, there are now 400 million people at risk of blindness. But in many countries, the number of professional ophthalmologists is too small to perform the necessary checks, and this technology will help ensure that more people receive appropriate checks. Research in other areas of medical imaging is investigating the potential of using machine learning to predict other medical tasks. We believe that machine learning can improve the treatment experience for both physicians and patients, both in terms of quality and efficiency.

In 2018, Stanford AI lab cooperated with a medical team; they made a deep learning system to help with the monitoring of patients. Postures and actions can be detected so that the system can prevent dangerous situations happening.


To summarize, deep learning is a force that pushes the understanding and utilization of artificial intelligence across sectors of human life and society. One thing to remember is that for a deep learning program, we, the human, give the program the instructions to start, to transfer, and to stop. But we believe an intrinsic autonomous system could be a more natural way to build real artificial intelligence, that is to say, reinforcement learning could be a crucial framework for building artificial general intelligence.

Deep reinforcement learning

The deep learning model is simple. In just a few dozen lines of code, you can solve the system that took a lot of effort before you can design it. Therefore, various application areas (speech, image, visual, natural language understanding, etc.) now tilt resources to deep learning, where we do not judge the unintended adverse consequences of this, from an optimistic point of view Deep learning really revitalizes the field of artificial intelligence. Of course, how to divert people's passion is very important. I believe that after a while, everyone will find a suitable path to develop.

Success in the field is often very difficult to achieve in the field of science. There are several important number theory and graph theory problems that have been carried forward through generations of scientists and continue to advance in the work of predecessors. After finishing the history, let’s take a look now. The most exciting progress. We introduce the paradigm of deep reinforcement learning and related algorithms. See what exactly is the most critical factor. The key is actually how we apply these techniques to solve problems - suitable problem modeling, solutions Improvement.

The reason why it is not practical before intensive learning is that it is difficult to deal with these situations effectively in the face of excessive state or action space. The examples often seen are relatively simplified scenarios. The emergence of deep learning allows people to deal with real Problems such as the dramatic increase in visual recognition accuracy to ImageNet top-5 error rate dropped to less than 4%, now speech recognition has really become more mature, and is widely used, and all current commercial speech recognition algorithms No one is not based on deep learning. These are all indicating that deep learning can be the basis of some practical applications. Now the research and application of deep reinforcement learning are basically aimed at the above problems.ImageNet: ImageNet is an important dataset for computer vision research. This dataset is designed based on WordNet hierarchy. Each node of the hierarchy is related to hundreds and thousands of images. Researchers test their deep learning algorithms on different tasks of ImageNet. To read more about it, please visit

Thanks to the work on building reinforcement learning environments, comparison of research work turns out to be easier than ten years ago. OpenAI gym has a much impressive effect on developing more and more algorithms and finding applications of reinforcement learning.

Reinforcement learning

Reinforcement learning is a way to simulate human learning process by simplifying the situation in which humans make decisions. There are two independent strings of development, one is animal behavior research, the other is optimization control. Finally, this process was formalized into a Markov Decision Process (MDP) by Richard Bellman. Since then, many scientists have expanded this model to form a relatively complete system—often called approximate dynamic programming. See the dynamic programming of MIT professor Dimitri P. Bertsekas - Dimitri P. Bertsekas, Dynamic Programming and Optimal Control, Vol. II, 4th Edition: Approximate Dynamic Programming.

Although we have lots of reinforcement learning methods, it is very difficult to apply them to large scale real-world problems, first of all, the existence of the curse of dimensionality makes it difficult to efficiently find the optimal policy or compute the optimal action value. In addition, the ideas contained in deep learning—greedy algorithms, dynamic programming, approximation, etc.—are the most critical parts of the algorithm, and they are the most used of these methods. Here is an overview diagram of reinforcement learning introduced by David Silver in his Reinforcement Learning course:

Core concepts of a reinforcement learning system are:

  1. Environment. The environment can be fully observable or partially observable based on the complexity of the problem and related factors.

  2. Reward signal. Reinforcement learning is based on the reward hypothesis: any goal can be formalized as the outcome of maximizing a cumulative reward. That is the basic foundation for applying reinforcement learning techniques to your problems.

  3. Agent. The agent is the key element for reinforcement learning, containing state, policy, value function (probably) and Model (optionally).

Classical methods of deep reinforcement learning

Now, let's consider a bunch of deep reinforcement learning algorithms and divide them into several clusters based on the ways or perspectives to tackle reinforcement learning problems. Due to the rapid pace of researching and practicing by researchers and engineers, it is hard to get a complete view of this area. So, we'll make a visualization of these interesting methods. You can find the classes, names, and publishing time in the figure from the awesome-deep-rl project .

As we know, deep reinforcement learning methods are a combination of deep learning and reinforcement learning. So typically we can divide the following types for deep reinforcement learning:

  • Value-based methods

  • Policy gradient methods

  • Explorations in deep reinforcement learning

  • Actor-critic methods

  • Model-based methods

  • Multi-agent reinforcement learning

  • Meta-reinforcement learning

  • Hierarchical reinforcement learning

  • Inverse reinforcement learning

Value-based methods

The symbol of the rise of deep reinforcement learning, DQN, was first proposed by V. Mnih in 2013. After he jointed DeepMind, their team gave a better model by getting rid of some issues in original DQN. We often called it Nature version DQN.

Many researchers followed the work of Nature DQN, especially from DeepMind. During the last several years, the performance of value-based methods has been improved a lot on a large portion of tasks in Atari Game environment.

In some random environments, Q-learning is very poor. The culprit is overestimations of large action values. These overestimates are due to the fact that Q learning uses the largest action value as the estimate of the maximum expected action value with positive bias. There is another way to approximate the maximum expected action value for any random variable set. The so-called double estimation method will be underestimated rather than overestimated. Apply this idea to Q-learning. A double Q-learning method, a policy-free reinforcement learning method is obtained. This algorithm can converge to the optimal strategy and perform better than the Q-learning algorithm under certain settings.

Double Q-Network is the result of merging Q-learning and deep learning. In some Atari games, DQN itself is also subject to estimation. Through the introduction of double Q-learning, it can handle large-scale function approximation problems. The final algorithm not only reduces the overestimation of the observations but also has a fairly good performance in some games.

Other algorithms based on DQN make efficient use of samples in experience replay buffer or change network architectures like Dueling networks. Rainbow proposed in Mar 2018 seems like a final version for DQN, it also contains other great ideas.

Policy gradient methods

Policy gradient methods belong to another typical way to solve reinforcement learning problems. Strictly speaking, this class of methods approaches getting the optimal policy in a direct manner.

Although there are several books related to reinforcement learning, the introduction of the strategy gradient part is not enough. The existing reinforcement learning textbook does not give enough guidance on how to use the function approximation; basically, it is focused on discrete The realm of state space. Moreover, existing RL textbooks do not adequately describe non-derivative optimization and strategic gradient methods, and these techniques are quite important in many tasks.

The strategy gradient algorithm is optimized by gradient descent. That is, by repeatedly calculating the noise estimate of the desired return gradient of the strategy, and then updating the strategy according to the gradient direction. This method is more advantageous than other RL methods (such as Q-learning) because it is possible to directly optimize the quantity of interest - the expected total return of the strategy. This type of method has long been considered to be less practical due to the high variance of the gradient estimate. Until recently, the work of Schulman et al. and Mnih et al. demonstrated the successful application of the policy gradient method on the difficult control problem.

Explorations in deep reinforcement learning

Exploration is an important part of reinforcement learning agents to get rich experience and learn about the environments more broadly. As we know, there is another part named exploitation and the two are some bit of competitors. We call this competitive situation "Exploration-Exploitation Dilemma".

Now we give more intuition for exploration. Generally speaking, we could have three types of exploration strategy: optimistic, posterior sampling, and information gain.

  1. Optimistic exploration always tries to choose highly uncertain actions, just as a famous Chinese saying "rare things are precious" says. So when we calculate the frequency of some action or an action-state pair, this frequency can be used as a bonus to distinguish the importance of the rare states. In deep reinforcement learning, researchers have already considered several paths to make this possible: CTS-based pseudocounts, hash-based pseudocount, and exemplar models exploration are typical pseudo-counts exploration.

  2. Posterior sampling is to make more accurate exploration by utilizing the idea from Bayesian learning. Since posterior sampling algorithms always remedy the probability distribution after each sampling, after many steps learning, we can have a stronger belief for each action, with a smaller variance as an indicator. Techniques like Thompson sampling or Bootstrap DQN can be good for posterior sampling.

  3. Information Gain Exploration considers information as a part of the states. Through traversing new states to acquire new information, then make those states that can get more information gain the ideal states. Information theory can help here, and we need better ways to approximate analytic solutions. We can use VIME: variational information maximizing exploration to get as much information about the environment from each interaction as possible.

Actor-critic methods

The goal as a guide for better methods is to make stable and data efficient algorithms. The most famous deep actor-critic algorithm is DDPG. DDPG is the deep learning version of the deterministic policy gradient method, which uses the idea of DQN to transform the DPG. DDPG can solve the reinforcement learning problem in continuous action space. In those continuous action tasks, DDPG gives a stable performance and in different environments. No changes are required on the top. In addition, DDPG found the solution of the Atari game in less time than DQN learning in all experiments, which is about 20 times the performance. Given more simulation time, DDPG may solve more difficult problems than the current Atari game. The future direction of DDPG should be to use a model-based approach to reduce the number of rounds of training because model-independent reinforcement learning methods usually require a lot of training to find a reasonable solution.

DDPG is actually an Actor-Critic structure that combines information from both strategy and value functions. Both Actor and Critic use deep neural networks for approximation.

After DDPG, we have ACER and Reactor algorithms. Researchers hope to eliminate these flaws, such as TRPO and ACER, through constraints or other optimization strategy size methods. These methods all have their own trade-offs. The ACER method is much more complicated than the PPO method. It requires additional code to modify the off-policy and refactor buffers, but it is only one better than the PPO on the Atari benchmark. Although useful for continuous control tasks, TRPO is not easily compatible with algorithms that share parameters between strategy and value functions or auxiliary losses, that is, those that are important for solving Atari and other visual inputs algorithm.

Model-based methods

The goal of model-based reinforcement learning methods, mentioned above, is to improve the stability and data efficiency without explicitly modeling for the environment. Here the model-based methods mainly utilize the information about the environments. If we get the learned model, we can use it to plan optimal actions.

In contrast, model-based reinforcement learning methods can be learned with significantly fewer samples. This type of learning method uses a learned environment dynamic model that can perform policy optimization. Learning dynamic models can be done in a sample-efficient manner because they are trained using standard supervised learning techniques, allowing the use of non-strategic data.

The low sample complexity of the method in Model-Based Reinforcement Learning via Meta-Policy Optimization makes it suitable for real-world robots. For example, it can find the optimal strategy for a high-dimensional and complex four-dimensional motion world based on real data within two hours. Note that the amount of data required to learn such a strategy using a model-free approach is 10 to 100 times higher, and the researchers know that previous model-based methods did not achieve similar performance in such tasks.

In March 2019 a complete model-based deep RL algorithm proposed by Google Brain and UIUC researchers based on video prediction models with a novel architecture yielded the best results in standard benchmark tasks. It may tell us model-based methods close to one of the core mechanisms of human's fast learning.

Multi-agent reinforcement learning

Multi-agent reinforcement learning is a natural extension of (deep) reinforcement learning for many reasons. First, when a single agent interacting with the environment, the information exchange just between two clear separated counterparts. They have a totally different role. That is so restricted for modeling a real-world problem. Second, after we solve really challenging problems in only one agent setting, we want to find more difficult problems, obviously, the prediction for the development of a complex system with more agents turns out more difficult. Third, if we want to devise general intelligent agents, the final question is to make them act in a real society like us.

As we know, much great work had been done since the early age of human history on researching the dynamics of a society and predictions of future actions of the individuals. Complex networks and game theory are two main areas contributing many ideas for interactions within a giant system with many individuals competing and cooperating with each other. Clustering of the nodes is one of the problems in complex network analysis while Nash equilibrium is a great milestone in game theory. It is actually the premise of a multi-agent system.

Normally, the methods for a single agent don't work well in multi-agent setting, therefore we need to find more proper methodology and framework to fit the problems in multi-agent systems. Multi-agent deep reinforcement learning is a vital area for building efficient and effective algorithms to help us understand the dynamics and properties of a networked agent sets.

Compared to training a strategy to solve all actions in an environment, a multi-agent perspective can be helpful to decompose the problem more naturally: F1-racing, antennas placing and traffic control. Multi-agent reinforcement learning can be seen as a more scalable way of learning: First, decomposing the actions of a single agent and observing into multiple simpler agents not only reduces the dimensions of the input and output of the agent but also effectively increases the amount of training data for each time step. Then, dividing the action and observation space of each agent can produce effects similar to the introduction of time abstraction methods. Time abstraction has been used to improve learning efficiency under a single agent setting. Finally, good decomposition can also lead to the learning of strategies that are more likely to migrate in multiple variant environments, such as a single super-agent may match a particular environment.

Meta-reinforcement learning

Thanks to OpenAI gym, robotic arms and other environments, we can train our agents for solving much more tasks than before. Having those large number of tasks actually could give us a huge advantage. Meta-learning just is a way to utilize the tasks to figure out a general way to learn better.

Meta-learning could reduce the number of samples needed to train deep reinforcement learning algorithms since meta-learning can meta-learn a faster reinforcement learner when dealing with new tasks. Actually, meta-learning can have various types like learning RNNs with experience or learning representations, even learning optimizers.

Researchers found that meta-learning could help agents explore more intelligently, avoid useless actions or find the right features faster. Meta-learning algorithms can be automated by automating the process of task design. For example, unsupervised meta reinforcement learning can effectively accelerate reinforcement learning procedures while no need for manual task design, exceeds the performance of learning from scratch and showed competitive performance to that use hand-specified task distributions.

Hierarchical reinforcement learning

If the reward is delayed and sparse, the reinforcement learning algorithm may suffer from poor sample efficiency. Hierarchical reinforcement learning (HRL) enables agents to learn time-extended actions at multiple levels of abstraction in an efficient and automated manner. HRL allows agents to learn strategies that belong to different time scales in parallel.

Multi-level hierarchies have the advantages to accelerate learning in sparse reward tasks because they can classify problems into a set of short-term sub-problems.

Inverse reinforcement learning

So far we all deal with rewards, however, most problems we face in the real world don't have a proper reward function. So we need a method to learn reward function from some expert's behavior. For inverse RL, we should try to find a reward function that matches some history of an agent's behavior or policy.

Ng and Russell first proposed algorithms for inverse RL. Based on the assumption that actions always chooses the best possible action for its reward function, we try to estimate a reward function that could have generated behavior like this.

When we build such an algorithm, the input is environment dynamics e.g., an MDP without a reward function and optimal behavior e.g., the full policy or trajectories and the output is the inferred reward function.

There are mainly two problems of inverse RL:

  1. Many reward functions can be useful under most observations of behavior

  2. Sometimes the observed behavior is not optimal. Our optimal policy assumption is too strong.

More recently discussions in Value misalignment in AI safety show the importance of utilizing inverse RL to find a way to keep AI safe.

There are many other branches in deep reinforcement learning, like option based reinforcement learning, multi-task reinforcement learning, or distributional reinforcement learning, etc. We just stop here to focus on methods relating to our projects in this book.

Applications of deep reinforcement learning

Since reinforcement learning is a powerful and general enough framework to model various situations, we can see lots of applications in many fields. And because of the power of deep learning, the deep reinforcement learning can be designed to match the real world needs of various domains.

As we know, the first success in this area is the DQN agent for playing Atari games. People saw the potential of deep reinforcement learning, therefore big companies and research institutes like DeepMind, OpenAI, Google Brain, UC Berkeley, CMU as well as many others all put their resources into this area to try to achieve better AI techniques.

Besides games, we can also find applications in autopiloting drones, traffic control, self-driving cars, robotics control, electric business, and even computer security.


Humans like playing games all the time, for example, Go is an ancient game needing the players to do calculation and reasoning tasks and electric games have different weights for the usage of various aspects of human intelligence. Now, games are used by researchers to show the effectiveness and performance of our algorithms.


AlphaGo is the first artificial intelligence robot to defeat human professional Go players to win the world championship in Go. It was developed by a team led by Google's DeepMind company.

In March 2016, AlphaGo and Go World Champion and professional Go player of 9 ran rank, Lee Sedol, carried out the Go-Man Wars, AlphaGo winning with a total score of 4-1. At the end of 2016 and the beginning of 2017, the program was a "master" on the Chinese chess website ( Master). For the registered account, there are dozens of Go players in China, Japan, and Korea, and there is no one defeat in 60 consecutive games. In May 2017, at the Wuzhen Go Summit in China, it competed with World Go Champion Ke Jie. It won the match with a total score of 3 to 0. In the world of Go, AlphaGo is recognized as better than top human Go professionals. In the World Professional Go ranking published on the GoRatings website, its score has surpassed Keji, the number one player in the ranking.

On May 27, 2017, after the human-machine battle between Ke Jie and Alpha Go, the Alpha Go team announced that Alpha Go would no longer participate in the Go game. On October 18, 2017, the DeepMind team announced the strongest version of Alpha Go, named AlphaGo Zero.

They used a new way to do reinforcement learning in which AlphaGo Zero starts to teach itself. At first, the system just has a neural network of zero knowledge about the game of Go, then it plays games against itself, by combining this neural network with a search algorithm. AlphaGo Zero can be used in other areas to learn to find new knowledge about a complex system. Their work published in Nature explained that deep learning related techniques can help with scientific research in other domains.

OpenAI Five

In 2017, OpenAI beat the "Dota 2" world's top players in a 1 to 1 solo at the Dota2 TI finals. In June 2018, OpenAI announced that their AI bot beat amateur human players in the 5 v 5 team competition and was able to beat the top professional team after the plan. The heart of the machine compiles the contents of OpenAI's blog.

Through self-confrontation learning, OpenAI Five is equivalent to playing 180 years of games every day. In training, it uses 256 GPUs and 128,000 CPU cores to train using the Proximal Policy Optimization method, which was augmented on the solo Dota2 system we built last year. When we use a separate LSTM for each hero, the model learns identifiable strategies without human data. This suggests that intensive learning can produce large-scale but acceptable long-term planning, even without fundamental advances.

Self-driving cars


The UK company Wayve designed the first-ever autonomous car that works with the help of reinforcement learning

A deep reinforcement learning based approach helped them to teach the car how to drive in just 15-20 minutes. The system is supported by a deep neural network that has 4 convolutional layers and 3 fully connected layers.

Electronic Commerce

Here we present some use case in e-commerce companies like Alibaba. The following figure is from paper

It shows the typical search session in TaoBao. A user starts a session from a query and has multiple actions to choose, including clicking into an item description, buying an item, turning to the next page, and leaving the session.

Product recommendation

In the recommended scenario, Alibaba uses deep reinforcement learning and adaptive online learning to build a decision engine through continuous machine learning and model optimization and analyzes massive user behavior and tens of billions of commodity features in real time. Every user quickly finds the product and improves the matching efficiency between the person and the product. The algorithm performance index is increased by 10% and 20%.

Customer service

In intelligent customer service, customer service robots such as Ali Xiaomi, as agents of the delivery engine, need to have decision-making capabilities. This decision is not based on the direct benefit of a single node, but a relatively long-term process of human-computer interaction. The interaction between consumers and platforms is regarded as a Markov decision process, using an intensive learning framework. Establish a loop system in which consumers interact with the system, and the system's decision is based on maximizing the process benefits to achieve a dynamic balance between the system and the user.


Robotics is another area where deep reinforcement learning can be applied.

Robotic arms

The following figure shows the scene of collective training of robotic arms using deep reinforcement learning algorithms.

In this experiment, each robotic arm was started by practicing the opening skill in the specific position and direction of the door displayed before the instructor. As it gets better and better at performing tasks, the instructor begins to change the position and orientation of the door to slightly exceed the current function of the policy, but it is not difficult to completely fail. This allows the robot to gradually improve their skill level over time and expand the range of situations they can handle. The combination of manual guidance and trial and error allows the robot to collectively learn how to open the door in just a few hours. Since the robotic arms were trained on doors that looked different from each other, the final policy was successful on a door with no robotic arms on the handle that it had seen before.

The general guide for deep RL projects

In this section, we present the general guide for our book through which you can get the experience of applying deep reinforcement learning as much as possible. Typically, we design with a step-by-step explanation of how to solve a problem. However, it can also be seen as a well-generalized guide for most of the problems you have already encountered or will face in the future.

General steps

Step 1: Make your understanding of the problem clear enough

Step 2: Design a (PO)MDP for this problem

Step 3: Try methods you have already learned to see the results

Step 4: Find tricks and theoretical proof to deal with sub-problems based on the results

Step 5: Combine tricks and rigorous techniques to solve the problem

General tools

Here we introduce important tools for building deep reinforcement learning agents. For a detailed introduction, you can read the corresponding part in the appendix.

TensorFlow 2.0

TensorFlow is an open-source machine learning library for research and production. TensorFlow now is the most popular deep learning framework with 123,589 stars and 73,108 forks on its Github project in early 2019.

At the end of 2018, Google TensorFlow team announced the 2.0 agenda for TensorFlow. We assume you have some experience in using TensorFlow to implement deep learning models like CNN, ResNet, LSTM, GRU even Attention mechanisms. Compared with TF1.0, TF 2.0 changed a bit on different aspects.

We want to introduce Google Colab, an easy way to learn and use TensorFlow. We can use colab to practice deep reinforcement learning algorithms using free computing resources. In the following chapters, we'll go through the usage of colab.

You can go to to check more about TensorFlow, and to know how to use colab.


Ray is a high-performance distributed execution framework providing at large-scale machine learning and reinforcement learning applications. It came from UC Berkeley RISE lab, which leads teams building many successful frameworks for big data science and machine learning.

RISE is Real-time Intelligence with Secure Explainable decisions, which is important for building systems in AI era, "Sensors are everywhere. ... AI is for real. ... The world is programmable." is their basis for this lab. Ray is the most important framework for building distributed AI systems in the future. Therefore we will introduce Ray to our readers to gain a basic understanding of its usage and try to make a multiagent reinforcement learning agents using it.

Go to to know more about Ray.

OpenAI gym

As we mentioned above, OpenAI gym has been pushing the research of deep reinforcement learning since its very beginning. That is the great contribution of OpenAI actually. "Gym is a toolkit for developing and comparing reinforcement learning algorithms. " We can use the OpenAI gym to design new environments for different problems and integrate with any numerical computation library, such as TensorFlow or PyTorch as well as distributed computing framework, Ray.

You can check gym on


The projects we will build with the tools above is as follows:

  • Developing Grid Environment: In this project, we will build up one of the most classical reinforcement learning tasks, grid world. And it is also the benchmark for deep reinforcement learning. We will solve this task by different reinforcement learning algorithms - dynamic programming, Monte Carlo, temporal difference and policy/value iteration. Through this single tasks, readers will learn all the useful algorithms including tabular and deep reinforcement learning.

  • Playing Atari Games using Improvised deep Q-learning methods: In this project, we will start to use the Gym environment and train an agent to play Atari games. Atari games are a collection of interesting video games and its video image input requires us to combine CNN with our RL algorithms. DQN and other techniques will be introduced to build up this agent.

  • Building continuous deep RL agent to control Mujoco robots: In this project, we will learn how to train an agent to control robots in a simulated environment. Mujoco is a robot simulator with Gym wrapper. Mujoco task is one of the typical continuous control reinforcement learning problems. We will solve this task by several policy gradient RL algorithms such as TRPO and PPO. We will also solve this task by evolutionary strategy RL algorithms.

  • Building powerful RL agents to play Montezuma’s Revenge: In this project, we will learn how to train an agent to solve the Montezuma’s Revenge Project. To solve this, we will introduce effective methods like Go-Explore, hierarchical RL and imitation learning. This hard problem can be considered as the sign of the exploration power of methods becoming competitive for solving real-world problems.

  • Solving the general-soldier control problem with Multi-Agent RL: This is a project for controlling multiple agents in a competitive and cooperative environment. We will introduce multi-agent system concepts and basic methods. And we introduce useful designs for constructing a framework to solve the general-soldier control problem.

  • Building a Deep RL dialogue model with ACER: In this project, we will build a Deep RL dialogue model. We will introduce the development of dialogue generation, especially using RL techniques and give an implementation of the agent that can learn from interactions and generate reasonable results. Finally, we discuss in-depth algorithms like ACER that can generate more interactive responses and generate more natural conversations in dialogue simulation.

  • Building an agent for playing RTS game Starcraft II: Starcraft II is a classic RTS game. Players need to have better control both in the strategic layer and tactic layers. Recently there is much progress in RTS game AI. We use this project to test the RL control algorithms in our book. So that readers can practice with the methods to see the potential of RL in game playing.

  • Building a self-driving agent with deep RL: Self-driving problems can be solved by Deep RL methods. DeepDrive is an interesting self-drive environment. We use this project to test the RL control algorithms in our book. So that readers can practice with the methods to see the potential of RL in the self-driving cars.

Through each project, you will get the corresponding deep reinforcement learning methods used for tackling the problems related to that project. The difficulty of each project grows step by step, including more and more advanced and complicated algorithms. The techniques you learned can be utilized to solve real-world problems.


So far, we have overviewed these fascinating topics on deep reinforcement learning and the arrangement of this book. Now let us jump into the first project to get familiar with how to make your first workable deep reinforcement learning agent with Python and TensorFlow 2.0.

Last updated