Department of Computer Science

Graduate School of Information Science and Technology, University of Tokyo


Kamil Rocki, Ph.D.


Post-petascale programming tools and applications, CREST, JST

The recent trend in hardware towards multi-core and GPU-accelerated heterogeneous environments creates a big challenge for algorithm designers. In particular, the graph algorithms are not well suited for SIMD architectures due to their complexity and branching. Moreover, parallelization of such algorithms in general always poses a challenge. My research explores new ways of speeding up these algorithms by utilizing modern architectures and novel parallel algorithms. I am interested in all types on new parallel, low-power architectures and the methods of programming them.

In my opinion, this research can have a great impact now and in the near future because of the rapidly increasing parallelism of computer hardware. It combines my parallel computing, GPU programming and supercomputing with very important, yet hard fundamental problems in computer science such as combinatorial optimization and AI. The main goals of my work were to investigate the impact of the new parallel architectures such as Xeon Phi, GPUs and platforms like OpenCL and CUDA on problems which is typically very hard to parallelize. It should be expected that computer hardware will become even more complex, surely not simpler. Therefore novel parallel algorithms are required. In the modern world, being limited by power and space, parallelization is the only way to run algorithms quicker and graph algorithms are no exception. It is expected that by 2020 the fastest supercomputer in the world will have billions of cores and personal computers will be as powerful as currently fastest machines which thousands of processors.

Parallel GPU accelerated discrete optimization (Traveling Salesman Problem)

Research was focused on the classic TSP problem. GPU usage greatly decreases the time needed to optimize a route, however requires a complicated and well tuned implementation. Using even a simple 2-opt optimization technique, the time needed to perform a single local search operation can be decreased approximately 5 to 45 times compared to a similar parallel CPU version. My parallel TSP solver is available here.

Logo - High Performance Parallel GPU-based TSP Solver (Screenshot) tsp

GTC 2013 Slides Previous Slides


Download latest source code (OSX, Linux - CUDA/OpenCL/SSE/AVX/Xeon Phi support v. 0.62 (03/04/2013)

Download Windows binary (CUDA/OpenCL, Multi-GPU support) (Video), based on the Linux code - (17/03/2013)

Source code is also available here:


Previous versions

CUDA/OpenCL/SSE/AVX/Xeon Phi - v. 0.61, v. 0.6

CUDA/OpenCL - v. 0.52, v. 0.51, v. 0.5, v. 0.4, v. 0.31, v. 0.3

CUDA - v. 0.2, v. 0.1

Download TSPLIB instances

CUDA + OpenGL Windows demo application (as presented at SC12)


FlopsCUDA - CUDA Benchmarking Tool (with Kepler GPU support)

Download : Linux source (Screenshot)

FlopsCL - OpenCL Benchmarking Tool

Download : Windows executable (Screenshot), Linux source (Screenshot)

OpenCL Platforms - NVIDIA, AMD, Intel

ULP-HPC: Ultra Low-Power, High-Performance Computing via Modeling and Optimization of Next Generation HPC Technologies, CREST, JST


As a PhD student I was a member of ULP-HPC (Ultra Low-Power, High-Performance Computing via Modeling and Optimization of Next Generation HPC Technologies) project. My PhD dissertation described a very effective parallelization scheme a novel, approximate Monte Carlo Tree Search (MCTS) algorithm. It is a method for making optimal decisions in artificial intelligence (AI) problems, typically move planning in combinatorial games. It combines the generality of random simulation with the precision of tree search. I have thoroughly investigated the problems related to parallelizing the algorithm and successively implemented a parallel multi-GPU version of the MCTS algorithm using CUDA on the TSUBAME 2.0 supercomputer. I was able to scale it up to 256 GPUs and 2048 CPUs showing tremendous improvement over the sequential program or any other existing work at that time. In addition to that, I analyzed the factors affecting the overall performance of the parallel version and the limitations of its scalability.Implementation and analysis of large scale parallel tree searching algorithms on GPU - Minimax, MCTS (Monte Carlo Tree Search) using CUDA and MPI. Analysis of power-usage related issues.


The motivation behind this work was caused by the emerging GPU-based supercomputer systems and their high computational potential combined with relatively low power usage compared to CPUs. As a problem to be solved I chose an AI GPU-based agent in the game of Reversi (Othello) which provides a sufficiently complex problem for tree searching with non-uniform structure.



The research covered areas such as: Artificial Intelligence, Tree Search, Monte Carlo/Random methods, Parallel processing, General Purpose GPU Programming (GPGPU)





Study on real-time image processing. Implementation of several algorithms within CUDA platform (Edge detection, Histogram calculation, Segmentation, Noise reduction) and integration with OpenCV. The main goal of the work was to improve the performance of feature extraction process from video signal while keeping the existing interface untouched.



Implementation and integration of an ART neural network class in C++ for use with the OpenCV library. The purpose of the work was an application able to recognize and locate simple objects based on their color and shape. The features obtained by OpenCV’s built-in routines were processed by the neural network. The application was part of a larger servomechanism system controlling robot's arm movement based on the object's location.