2018 -                | Nintendo Learning Environment

2018 -                | GAMEBOY Supercomputer

2017 -                | AISC - Adaptive Instruction Set Computer

2016 - 2018      | Large Scale Content Addressable Memory

2016                  | L2P2 - Low Level Parallel Primitives

2015 - 2016      | Recurrent Memory Array Structures

2015 - 2016      | DARPA Saccadic Vision Project

2013 - 2014      | Hierarchical Temporal Memory

2012 - 2013      | Parallel Stochastic Local Optimization

2008 - 2011      | Large Scale Parallel Monte Carlo Tree Search

2006 - 2008      | Real-time data processing on GPU

2005 - 2007      | Robotic Arm Manipulation using ARTMAP neural network

2018 - | Nintendo Learning Environment


New environment for accelerating research in AI, targeting transfer and meta learning

Fastest game emulator in the world (CPU, GPU and FPGA versions), Support for over 1000 games in total

Visualization of 3D manifold created by running Mario and Tetris through a Variational Autoencoder (a fully unsupervised approach), the Nintendo environment is designed to run experiments on multiple games at once

More details on manifolds

Tetris game reconstructed from the internal world representation (as in Ha et al. 2018), this enables 'artificial imagination'

2018 - | GAMEBOY Supercomputer


On Dec 19th 2018, the main page of VICE motherboard

A $2M FPGA-based supercomputer, 1296 nodes connected in a 3D mesh, co-designed HW + SW

1B+ frames per second aggregate, over 200x speedup vs Xeon CPU

Demonstrated practical usage using existing RL algorithm using distributed asynchronous advantage actor critic algorithm (A3C)

Trained using the system, Pacman (LEFT), Mario(RIGHT)

2017 - | AISC - Adaptive Instruction Set Computer

A RISC-V CPU architecture has been modified to give the CPU an ability to shape its own architecture, based on the reward collected from the environment (correct and fast execution)

10x faster than Xeon at 1/100th the power and footprint

Architecture evolves from thousands of runs

Self-evolved architecture is capable of learning simple algorithmic tasks (copy, sort)

2016 - 2018 | Large Scale Content Addressable Memory

High-dimensional data is mapped into physical 3D memory space (GAN-based) and used with a memory-augmented neural network in order to provide very fast key-value lookups

An illustration of real-time encoding and decoding data location in latent 3D space

2016 | L2P2 - Low Level Parallel Primitives

Auto generation of machine code based on high level structures and LLVM backends

A programmer specifies general description (LEFT) and the program is executed millions of times resulting in the best version (RIGHT: Auto-tuning)

2015-2016 | Recurrent Memory Array

Improvements to existing models: stochastic memory array (LEFT), surprisal-driven zoneout (RIGHT)

Link to the paper shown at NIPS 2016

Illustration of the zoneout driven by misprediction, LEFT: Information content (surprisal) during predicting wikipedia text, RIGHT: use high surprisal to gate the memory cell, possibly sparsify usage during predictable passages

Visualization of activity in the LSTM network, induced sparsity in activations, No zoneout (LEFT), With zoneout (RIGHT)

Hidden state

Memory cell

Objectively works too, improving SOTA

2015-2016 | DARPA Saccadic Vision Project

Image recognition using sequences of 2D patches as inputs

Observation trajectory depends on he task (Yarbus et al. 1967)

Unsupervised learning of spatial pattern and image saliency map

Unsupervised training using RBM and LSTM builds predictive model (marked as SDR/memory sequence in the animation), RL-based fine-tuning for classification (probs)

2013-2014 | Hierarchical Temporal Memory

This is a demo of my implementation of HTM running over a million neurons and tens of millions of synapses. Left: predicion task, Right: visualizaion of activations in real-time

My work has been used as a flagship project of the Machine Intelligence Group at IBM Research

2012-2013 | Parallel Stochastic Local Optimization

Traveling Salesman Problem (TSP) is a known NP-hard problem in combinatorial optimization

I built the fastest GPU-based TSP Solver

Presented my work at multiple venues, including Supercomputing conference, IPDPS, GECCO, GTC

I published an open source version of the solver, which runs on CPU and GPU, github link

Proposed many algorithmic innovations applicable to wide range of irregular problems, a poster from Supercomputing 2012:

An issue of TSUBAME journal with my work

HPC Wire article on my research

2008-2011 | Large Scale Parallel Monte Carlo Tree Search

PhD Thesis

1. A multi-GPU enchanced version of highly irregular MCTS

2. Used TSUBAME2 supercomputer (2048 CPUs + 256 GPUs, 3M threads)

3. Self-play based learning

2006-2008 | Real-time data processing on GPU

Repurposed OpenGL graphics pipeline for general-purpose computation (2 years before NVIDIA CUDA)

MSc Thesis

1.Edge Detection



2005-2007 | Robotic Arm Manipulation using ARTMAP neural network

Used an ARTMAP fuzzy neural net to guide a robotic arm, mapping pixels to torque

BSc Thesis

1. Tracking

2. Robustness

3. Recognition Heatmaps