Department of Computer Science

Graduate School of Information Science and Technology, University of Tokyo


Kamil Rocki, Ph.D.

I graduated from University of Tokyo in 2011 and moved to Silicon Valley. At the moment, I am a research scientist at IBM Research in Machine Intelligence group. I am often asked to classify what I do as software or hardware research, but to me computer science means both. Therefore, I work on ideas which bring artificial intelligence, hardware and software together.

twitter: @rockikamil

github: krocki

Nintendo Learning Environment

I wrote a high-performance oriented, minimal Nintendo Gameboy emulator in C (~500 LOC) with 0 dependencies. The goals of his project are:

1. provide an alternative to existing research datasets in Reinforcement Learning

2. provide much better performance (frames/second) than other emulators

3. give access to a platform with a variety of games (target meta- or transfer learning research)

4. enable modifications due to its simplicity

More details on NLE

My experiments with RL algorithms


Hardware-accelerated reinforcement learning

My most recent work has focused on accelerating reinforcement learning using reconfigurable hardware. RL research suffers from the fact that data is not as readily available as in other domains. Computer games are widely used as a platform to test new ideas. However, simulating more complex games such as Mario or Prince of Persia take considerable amount of time (number of frames collected to finish).

I implemented an entire Gameboy-compatible processing system inside an FPGA which allowed much higher rate of observations per second (~200x speedup compared to multi-core Xeon). The recording below shows Tetris running at about 60x the original speed, which is approximately 1/5th of the maximum framerate. In addition to that, one FPGA node can accomodate 50 Gameboy instances.

As a part of a larger system: IBM Neural Computer houses over 1000 FPGAs and provides orders of magnitude more power for data generation

An example to trained network using distributed Asynchronous Advantage Actor Critic (A3C) algorithm. Pacman is a relatively simple problem as rewards come often and are not delayed significantly.

Reconfigurable Supercomputer

RISC-V is an open source architecture from UC Berkeley. I implemented a mini-supercomputer inside FPGA which depending on the instruction set and resources houses from 10 to nearly 500 general purpose 32-bit RISC-V cores. Each core is capable of executing its own program in a MIMD fashion. One can program RISC-V using C/C++ or RV assembly

Learning algorithms in hardware

Recent work on algorithms learning algorithm (Neural Turing Machines, Differentiable Neural Computer, Neural Programmer Interpreter, ...) has shown that it is possible to learn simple algorithmic tasks such as copy or sort directly from data. I used this principle to try out if it is possible to implement a simple CPU-like learning system with external memory and registers directly in hardware. I used custom mini instruction set (set, load, store, halt, add, compare) and learned algorithms such as copy, reverse copy and bubble sort directly in FPGA logic using Genetic Algorithms. So the answer is: yes! Moreover, the hardware implementation is up to 1000x faster than the Python code

Applying Reinforcement Learning to discovering hardware architectures

I defined a problem of copying data as follows: given memory A and memory B, maximize the performance (minimize the time) of moving the data from A to B. I wrote a simple program which moves data and defines the reward associated with that. The trick is that the underlaying hardware needs to learn to copy by receiving feedback from the program. Initially it is just a collection of randomly connected gates and evolves into a working circuit. By following this approach, I was able to train hardware and perform at much higher efficiency than using traditional methods.

Problem is defined like this, Reward = latency * correctness

Architecture evolves from thousands of runs

Maximizing the frequency through RL

PYNQ-Z1 used for testing

Content Addressable Memory

I used Autoencoders, VAEs and GANs to learn low-dimensional manifold of perceived data in order to perform inverse lookups in large memory spaces. This can be thought of as neural hash-tables allowing rapid localization of content based on itself.

Surprisal-Driven LSTM

Improved state-of-the art on character level text prediction by adding suprisal information as input to LSTM network.

K. Rocki, T. Kornuta, T. Maharaj, Surpisal-Driven Zoneout, NIPS 2016, Continual Learning and Deep Networks Workshop

K. Rocki, Surprisal-Driven Feedback in Recurrent Networks, arXiv:1608.06027

K. Rocki, Recurrent Memory Array Structures, arXiv 1607.03085

Low-level parallel primitives

Autotuning of various operations performed on parallel hardware using LLVM: GITHUB

DARPA funded 'Saccadic Vision Project'

Our eyes rapidly move in order to switch narrow field of view called fovea to a different object. These movement are called saccades and are controlled based on a given task and saliency (information content) of an object. In order to classify and track objects more efficiently, one could envision learning sequences of tiny patches which represent fovea. This work coupled Restricted Boltzmann Machine with LSTM to do just this and added Reinforcement Learning on top to further improve efficiency.

Parallel Monte-Carlo Tree Search

I used Tokyo-based TSUBAME 2.0 supercomputer to run massively parallel Monte Carlo Tree Search utilizing 2048 CPUs and 256 GPUs. MCTS is not well suited for SIMD hardware, so efficient implementation is far from trivial. For details see my PhD Thesis