Gorila: Google Reinforcement Learning Architecture
Today (Sat 11th July) is the second day of the Deep Learning workshop here at ICML 2015 in Lille. I liked one session in particular as I think it offers a good glimpse of how the industrialisation of Machine Learning is shaping up. Or put another way, the white-hot nexus where experimental ML meets practical software engineering..
Gorila (General Reinforcement Learning Architecture)
The talk today was by David Silver - Head of the Reinforcement Learning team at Google DeepMind and Arun Nair from the Applied team at Deepmind on Gorila (General Reinforcement Learning Architecture). This separation of concerns (research vs application) makes a lot of sense when productionising Machine Learning - you wouldn't ask a pure research scientist to make their code ready for production - so the teams are stratified from research through to production.
Silver gave a keynote on the same topic at ICLR in San Diego in May, but as you'll see from those slides that talk focused more on the general benefits of Reinforcement Learning (RL) as opposed to Gorila itself.
In summary, Gorila looks to be a generalisation of the well-known DistBelief from Jeff Dean et al. from feed-forward supervised learning to reinforcement learning.
What is Reinforcement Learning?
From an ML perspective, Reinforcement Learning has some nice properties over supervised learning, but it is also harder to implement successfully. Supervised or semi-supervised learning hinges on two key properties in particular:
- Having well-formed labelled data to tell your network in the training phase whether it got the task right or wrong. Labelled data is often hard to get and expensive to scale as we need humans to do the labelling.
- Having a well-defined Teacher module and being able to construct your objective function in such a way that you can even have a Teacher module.
In unsupervised learning, there are no labels and no teacher. The most common example of unsupervised learning would be clustering a data set based on properties intrinsic to the data that the algorithm can infer without outside help.
Reinforcement Learning (RL) is different again to both supervised and unsupervised learning. RL asks a different question - can the network figure out how to take one or more actions now to achieve a reward or payout (potentially far-off, i.e. t steps in the future) in the future. This delayed reward scenario is much harder to train for as we may have a large number of t steps to count back from and also we need to solve the credit assignment problem whereby multiple actions chosen by the network combine to realise the goal. There is no teacher module and very little labelled data - we only need to be able to measure the outcome of actions on the environment.
Mathematically, the network is asked to learn the best policy that will achieve the best outcome by picking the best action for a given state of the environment that the actor finds itself in, i.e. to solve the Q learning / Bellman optimality equation (derived from dynamic programming).
The easiest way to visualise this is playing video games - and is also a great way to ensure copious coverage in the mainstream media :) The DeepMind team train their networks on 49 games from the Atari 2600 - Seaquest, Tennis, Boxing etc.
Figure 1 A schematic from David Silver's ICML 2015 presentation on Reinforcement Learning. Used with permission.
What's interesting for me about Gorila is how much it feels like MapReduce from Dean et al or BigTable from Chang et al. In both of those cases, a hard problem (using a heterogenous compute cluster efficiently, storing and querying very large data sets) was solved by a new framework designed right from inception to scale to levels not previously encountered.
The four key components for Gorilla are:
- Actor (there are many of these and they correspond to video game players, users of a service etc.)
- Replay memory (this was a key insight to improve performance of the RL system and enable the Q learning task to be learned)
- Learner (parallelised - so can generate many more gradients than the previous iteration)
- The Q network or model itself (distributed using DistBelief - capacity to process many more networks in parallel)
The implications of all of this are fairly obvious but worth noting nonetheless for their importance:
The team reported very significant speedups in performance and wall clock training time (v2 beat v1 (the Nature DQN) on 41 out of 49 Atari games. 22 x 2x, 11 x 5x, on 25 it is better than a human player). Training time is reduced from ~2 weeks to ~1 day. So the speed of iteration going from pretty basic research to iteration two is fast (less than one year by my reckoning).
Google are building the same infrastructure around ML as they have around other problems (Gorila is to Reinforcement Learning as MapReduce is to task parallelisation as BigTable is to data storage). History then tells us that sooner or later there will be an open-source equivalent and eventually using reinforcement learning will be common place in software (whereas today it is esoteric, even within the deep learning community itself).
Reinforcement Learning has the potential to have far wider practical usage than just video games and systems like Gorila pull it towards this future. If you imagine that we are all actors or protagonists inside Google services like youtube, AdWords etc., then it becomes quite realistic to use RL to pick the best ads to show me, recommend new content for me to watch etc.
Finally, the paper that Silver and Nair referenced during the talk is now available on Arxiv!
Thanks to David Silver for constructive feedback on this blog post