## ChainerRL Visualizer: Deep RL Agent Visualization Library

ofk

2019-03-19 12:40:46

This post is contributed by Mr. Takahiro Ishikawa, who was an intern last year and now works as a part-time engineer at PFN.

We have released ChainerRL Visualizer, which visualizes the behaviors of agents trained by ChainerRL on the Web browser.

My name is Takahiro Ishikawa, who participated in PFN 2018 internship and currently work as a part-time engineer.

This library is developed in the aim of “making debugging of deep RL implementations easier” and “contributing to understanding of how deep RL agents work”.
It enables to interactively observe the behaviors of trained deep RL agents on the Web browser.
This library is easy to use. All you have to do is to pass the agent object implemented in ChainerRL and the env object that satisfies a specific interface to the launch visualizer function provided by this library, along with a few options.

from chainerrl_visualizer import launch_visualizer

# Prepare agent and env object here
#

# Prepare dictionary which explains meanings of each action
ACTION_MEANINGS = {
0: 'hoge',
1: 'fuga',
...
}

launch_visualizer(
agent,                           # required
env,                             # required
ACTION_MEANINGS,                 # required
port=5002,                       # optional (default: 5002)
log_dir='log_space',             # optional (default: 'log_space')
raw_image_input=False,           # optional (default: False)
contains_rnn=False,              # optional (default: False)
)

After executing this script, a local Web server will be launched and the following features will be provided on the Web browser.

## 1. Roll-out one episode (or specified steps)

You can tell the agent to run one episode (or specified steps) from the UI, then the outputs of the agent model will be visualized in chronological order.
In the following video, the probabilities of the next action and the state value of the agent trained with A3C are visualized.

## 2. Tick timestep and visualize the behaviors of environment and agent

In the following video, the agent can be moved back and forth and the outputs in each step are visualized along with the behavior of the environment. The pie chart on bottom-right shows the probabilities of the next action of each step.

## 3. Saliency map

If the input of the model is raw pixels, the UI can visualize saliency map, which shows the specific sub-area to which the agent pays attention. This feature is implemented based on the paper Visualizing and Understanding Atari Agents.
In the following video, saliency maps of the agent trained with CategoricalDQN are visualized over the image of the environment.
For now, this feature allows us to specify the number of steps for which saliency maps are created because the computational cost of creating saliency maps is very expensive.

## 4. Miscellaneous visualizations

Various ways of visualization for each type of agent are supported.
For example, the value distributions of the agent trained with CategoricalDQN are visualized in the following video.

Quickstart guides are provided.

For now, almost all of the visualization tools in deep learning have focused on visualizing scores and other metrics along with the progress of the model training. ChainerRL Visualizer is the original visualizer among them as it can interactively and dynamically visualize the behaviors of the deep RL agent and environment themselves.

Atari Zoo, which was released by Uber Research, is developed for a similar purpose. Atari Zoo aims at accelerating research for understanding deep RL agent, providing trained models, and analyzing tools for the frozen models. It enables researchers to participate in the research for understanding deep RL agent even if they don’t have enough computing resources.

ChainerRL Visualizer is different from Atari Zoo in the sense that “all” kinds of agents in ChainerRL can be dynamically analyzed during training by the visualizer while Atari Zoo is only for visualizing already-trained models in the repository and those models are limited to the ALE environments and specific algorithms where the architecture looks like ( raw image => Conv => Conv .. => FC .. ).

There are also other visualization tools with similar motivations, such as DQNViz that visualizes the behaviors of DQN agents and its various metrics during the training.

Though much effort has been dedicated to improve the performance of deep RL algorithms on benchmark tasks, less effort has been paid for understanding what deep RL agents learn by deep RL algorithm, and analyzing how the trained agents behave. However, the research for understanding deep RL agent as seen in the visualizations above will expand in the future.

ChainerRL Visualizer is now in beta version and still does not have sufficient features for deeply analyzing deep RL agents. So continuous development is needed in order to contribute to the emerging research area of understanding deep RL agent. We welcome you to participate in the development of ChainerRL Visualizer to add new features or to improve existing features through OSS collaboration.

## Experimental toolchain to compile and run Chainer models

Shinichiro Hamaji

2019-01-25 16:19:34

Hello, my name is Shinichiro Hamaji, an engineer at Preferred Networks. I would like to introduce an experimental project named Chainer-compiler today. Although not ready for end users, we are making it publicly available with the hope that others may find it interesting or useful for research purposes in its current state.

https://github.com/pfnet-research/chainer-compiler

Late last year, Preferred Networks release a beta version of ChainerX. The three goals of ChainerX were

• Optimize the speed of running models
• Make models deployable to environment without Python
• Make it easier to port models to non-CPU/GPU environments

while keeping the flexibility of Chainer. The goal of chainer-compiler is to go further with ChainerX. Currently, it has the following three main components:

1. Translate Python AST to extract computation graphs as extended ONNX format.
2. Modify the extended ONNX graph for optimization, auto-differentiation, etc. It then generates deployable code.
3. Run the exported code with ChainerX’s C++ API.

Here are expected use-cases of this project:

• Unlike the imperative model of Chainer/CuPy/ChainerX (a.k.a. define-by-run), the first step extracts a computation graph with multiple operations so that it gives a chance to apply inter-operation optimization techniques such as operation fusion.
• By running the step 1 and 2 on the host machine and deploying only 3, you can easily deploy your model to Python-free environments.
• If you add targets of the code generator, your model can run with/on optimized model executor or domain specific chips like MN-Core.
• By using the step 2 and 3, you can run ONNX models generated by other tools such as ONNX-chainer.

Other than the above, we would like to continue conducting experimental research.

Like other areas around deep learning, many people are competing for deep learning compilers. They have different strengths and focuses, which makes research on a deep learning compiler very interesting, in my opinion. In this project, we are trying not to hurt the flexibility of Chainer. For that reason, the toolchain does not assume that the model is static and can handle tensors without static dimensions, control-flow primitives of Python, and Python lists. This could be one of the unique strengths of this toolchain.

In this article, we have introduced chainer-compiler, an experimental project which compiles and runs Chainer models. We still have a huge number of TODOs but they are challenging and fun to work on. If you are interested in working with us, please consider applying to Preferred Networks. Any questions or feedbacks are really appreciated.

Lastly, I would like to thank everyone who helped us. I especially would like to thank Sato-san, an intern who realized the Python code to ONNX compiler.

## Optuna: An Automatic Hyperparameter Optimization Framework

Takuya Akiba

2018-12-03 14:00:28

Preferred Networks has released a beta version of an open-source, automatic hyperparameter optimization framework called Optuna. In this blog, we will introduce the motivation behind the development of Optuna as well as its features.

# What is a hyperparameter?

A hyperparameter is a parameter to control how a machine learning algorithm behaves. In deep learning, the learning rate, batch size, and number of training iterations are hyperparameters. Hyperparameters also include the numbers of neural network layers and channels. They are not, however, just numerical values. Things like whether to use Momentum SGD or Adam in training are also regarded as hyperparameters.

It is almost impossible to make a machine learning algorithm do the job without tuning hyperparameters. The number of hyperparameters tends to be high, especially in deep learning, and it is believed that performance largely depends on how we tune them. Most researchers and engineers that use deep learning technology manually tune these hyperparameters and spend a significant amount of their time doing so.

# What is Optuna?

Optuna is a software framework for automating the optimization process of these hyperparameters. It automatically searches for and finds optimal hyperparameter values by trial and error for excellent performance. Currently, the software can be used in Python.

Optuna uses a history record of trials to determine which hyperparameter values to try next. Using this data, it estimates a promising area and tries values in that area. Optuna then estimates an even more promising region based on the new result. It repeats this process using the history data of trials completed thus far. Specifically, it employs a Bayesian optimization algorithm called Tree-structured Parzen Estimator.

# What is its relationship with Chainer?

Chainer is a deep learning framework and Optuna is an automatic hyperparameter optimization framework. To optimize hyperparameters for training a neural network using Chainer, the user needs to write a code for receiving hyperparameters from Optuna within the Chainer code. Given this code, Optuna repeatedly calls the user code, and the neural network is trained with different hyperparameter values until it finds good values.

Optuna is being used with Chainer in most of the use cases at PFN, but this does not mean Optuna and Chainer are closely connected with each other.  Users can use Optuna with other machine learning software as well. We have prepared some sample codes that use scikit-learn, XGBoost, and LightGBM as well as Chainer. In fact, Optuna can cover a broad range of use cases beyond machine learning, like acceleration, providing an interface that receives hyperparameters and returns evaluation values, for instance.

# Why did PFN develop Optuna?

Why did we develop Optuna even though there were already established automatic hyperparameter optimization frameworks like Hyperopt, Spearmint, and SMAC?

When we tried the existing alternatives, we found that they did not work or were unstable in some of our environments, and that the algorithms had lagged behind recent advances in hyperparameter optimization. We wanted a way to specify which hyperparameters should be tuned within the python code, instead of having to write separate code for the optimizer.

# Key Features

### Define-by-Run style API

Optuna provides a novel Define-by-Run style API that enables the user to optimize hyperparameters, even if the user code is complex, while maintaining higher modularity than other frameworks. It can also optimize hyperparameters in a complex space like no other framework could express before.

There are two paradigms in deep learning frameworks: Define-and-Run and Define-by-Run. In the early days, Caffe and other Define-and-Run frameworks were dominant players. Then, PFN-developed Chainer appeared as the first advocate of the Define-by-Run paradigm, followed by the release of PyTorch, and later, eager mode becoming the default in TensorFlow 2.0. Now the Define-by-Run paradigm is well recognized and appears to be gaining momentum to become the standard.

Is the Define-by-Run paradigm useful only in the domain of deep learning frameworks? We came to understand that we could apply a similar approach to automatic hyperparameter optimization frameworks as well. Under this approach, all existing automatic hyperparameter optimization frameworks are classified as Define-and-Run. Optuna, on the other hand, is based on the Define-by-Run concept and provides users with a new style of API that is very different from other frameworks. This has made it possible to give high modularity to a user program and access to complex hyperparameter spaces, among other things.

### Pruning of trials using learning curves

When iterative algorithms like deep learning and gradient boosting are used, rough prediction on end results of training can be made from the learning curve. Using these predictions, Optuna can halt unpromising trials before the training is over. This is the pruning feature of Optuna.

Existing frameworks such as Hyperopt, Spearmint, and SMAC do not have this functionality. Recent studies show that the pruning technique using learning curves is highly effective.  The following graph indicates its effectiveness in performing a sample deep learning task. While the optimization engines of both Optuna and Hyperopt utilize the same TPE, thanks to pruning, the optimization performed by Optuna is more efficient.

### Parallel distributed optimization

Deep learning is computationally intensive, and each training process is very time-consuming. Therefore, for automatic hyperparameter optimization in practical use cases, it is essential that the user can easily use parallel distributed optimization that is efficient and stable. Optuna supports asynchronous distributed optimization which simultaneously performs multiple trials using multiple nodes. Parallelization can make the optimization process even faster as shown in the following figure. In the below example, we changed the number of workers from 1, 2, 4, to 8, confirming that the parallelization has accelerated the optimization.

Optuna also has a functionality to work with ChainerMN, allowing the user to optimize training that requires distributed processing without difficulty. By making use of a combination of these functionalities, the user can execute objective functions that include distributed processing in a parallel, distributed manner.

### Visualized optimization on dashboard (under development)

Optuna has a dashboard that provides a visualized display of the optimization process. With this, the user can obtain useful information from experimental results. The dashboard can be accessed by connecting via a Web browser to an HTTP server which can be started by one command. Optuna also has a functionality to export optimization processes in a pandas dataframe, for systematic analysis.

# Conclusions

Optuna is already in use by several projects at PFN. Among them is the project to compete in the Open Images Challenge 2018, in which we finished in second place. We will continue to aggressively develop Optuna to improve its integrity as well as prototyping and implementing advanced functionalities. We believe Optuna is ready for use, so we would love to receive your candid feedback.

Our objective is to speed up deep learning related R&D activities as much as possible. Our effort into automatic hyperparameter optimization is an important step toward this end. Additionally, we have begun working on other important technologies such as neural architecture search and automatic feature extraction. PFN is looking for potential full-time members and interns who are enthusiastic about working with us in these fields and activities.

## New HCI group + upcoming papers and demos at UIST and ISS 2018

Fabrice Matulic

2018-10-15 09:11:17

## Creation of HCI group

At PFN, we aspire to create next-generation “intelligent” systems and services, powered by cutting-edge AI technology, but we also recognise that humans will remain essential actors in the design and usage of such systems and therefore it is paramount to think about how the dialogue occurs. Human-Computer Interaction (HCI) approaches, which focus on bridging the gap between people and machines, can considerably contribute to improving intricate machine-learning processes requiring human intervention. With the creation of a dedicated HCI group at PFN, we aim to advance user-centred design for AI and machines and make sure the “humans in the loop” are supported with powerful tools when working with such systems.

Broadly, there are three main lines of research that the team would like to pursue:

• HCI for machine learning: Utilise HCI methods to facilitate complex or tedious machine-learning processes in which people are involved (such as data gathering, labelling, pre-processing, augmentation; neural network engineering, deployment, and management)
• Machine-learning for HCI: Use deep learning to enhance existing or enable new interaction techniques (e.g. advanced gesture recognition, activity recognition, multimodal input, sensor fusion, embodied interaction, collaboration between AI, robots and humans, generative model to create interactive content etc.)
• Human-Robot Interaction (HRI): Make communication and interaction between smart robots and their users natural, intuitive and hopefully even fun!

Of course, HCI does not necessarily involve machine learning or robots and we are also generally interested in creating novel and exciting interactive experiences.

The HCI group will benefit from the expertise of Prof. Takeo Igarashi, of The University of Tokyo, who has been hired as an external consultant. In addition to his wide experience in HCI and HRI, Prof. Igarashi has recently started a JST CREST project on “HCI for machine learning” at his lab, which very much aligns with our research interests. We look forward to a long and fruitful collaboration.

## Papers and demos at UIST and ISS 2018

Although the group was just officially created, we have been active in HCI research for the past months already and we will present two papers on recent work, respectively at UIST this week and ISS next month.

The first project, which was started at the University of Waterloo with Drini Cami and Prof. Dan Vogel, proposes to use different ways of holding a stylus pen while writing on a tablet to trigger different UI actions. The technique uses machine learning on the raw touch input data to detect these different pen grips when the user contacts the surface with the hand. The advantage of our technique is that it allows to rapidly switch between various pen modes using the same hand that writes and without resorting to cumbersome UI widgets.

In addition to the paper presentation, Drini will also be showing the technique at UIST’s popular demo session.

Our second contribution is the interactive projection mapping system for PaintsChainer that we showed at the Winter Comiket last year. For those of you who missed it, ColourAIze (which is how we call it in the paper) works directly with drawings and art on paper. Specifically, it projects colour fills determined by PaintsChainer directly onto the paper drawing with the colouring superimposed on the line art. Like with the web version of PaintsChainer, the ability to specify local colour hints to influence the colourisation is supported through simple (digital) pen strokes.

As with the pen-posture project above, we will both present our paper and do a demo of the system at the conference. If you’d like to try the fun experience of having your paper sketches, drawings and mangas coloured by AI, come and see us at ISS in Tokyo in November!

Last but not least, we are looking for talented HCI researchers to join our team, so if you think you can contribute in the areas mentioned above, please check the details of the position on our jobs page and apply!

## [PFN Day] BoF session: How to Improve Sharing of Software Components and Best Practices

tgpfeiffer

2018-08-24 12:54:26

Hello, my name is Tobias Pfeiffer, I am a Lead Engineer at Preferred Networks. On July 24, 2018, PFN held an internal tech conference called “PFN Day”, in which I hosted a Birds-of-a-Feather session named “How to Improve Sharing of Software Components and Best Practices”. I want to give a short report on what we did in that session and what the outcomes were.

Reusability of both code and best practices is one of the core goals of software engineering – some people would go as far as saying that achieving a high level of reusability is the sole reason why software engineering exists. Avoiding to reinvent the wheel, fighting the “not-invented-here syndrome” and understanding the real cost of rewriting something from scratch as opposed to learning how to use an existing code base is essential if time is to be spent on developing truly new functionality.

In a company such as PFN that develops highly specialized code based on latest research results, it is all the more important that work that can not be regarded as something that “only PFN can do” is minimized and shared as much as possible. However, in a rapidly growing organization it is often difficult to keep an overview of which member of which group may have already implemented the same functionality before or who to ask with help on a certain problem. Under the impression that PFN has still room to improve when it comes to code reuse I decided to hold a session about that topic and asked people from various projects to join.

First everyone contributed with a number of examples of efforts from their respective teams that seem like duplicate work. One person reported that wrappers for a certain library to work with a certain data format are written again and again by different members because they don’t know this work has been done already. Someone else reported that some algorithms to solve a certain task in computer vision are implemented in many teams separately, simply because nobody who did so made the step to package that code into a reusable unit. Also, maybe more an example of sharing best practices rather than program code, it was said that getting started with Continuous Integration (CI) systems is difficult because it takes a lot of time initially, and special requirements like hardware-specific code (GPU, Infiniband, …) or nested Docker containers are complex to set up.

In a next step we thought about reasons for a low level of reuse and possible impediments. We found that on the producer side, when someone has implemented a new functionality that could be useful for other members, hurdles might be that that person may not have the knowledge of how to package code, that it is not trivial to design a good interface for a software library, and that it takes time to extract the core functionality, package the code, and write usable documentation; this time can then not be used for further development and also the effort does not come with an immediate benefit.

On the consumer side, problems might be that it is not easy to find code that may implement a required functionality in the internal repositories, that copy/paste combined with editing individual lines of code often provides a faster solution than using code as a library and contribute missing functionality back upstream, and that it is hard to judge on the quality of code found “somewhere” in an internal repository.

So we concluded that in order to increase code reuse we should try to (1) increase the motivation to share code and (2) lower the technical barriers to both creating and using shared code. In concrete terms, we thought that the following steps might be a good approach, starting with Python developers in mind:

• Create an internal PyPI server for code that cannot be released as open source code.
• Create a template repository with skeleton files that encode the best practices for sharing Python code as a library. For example, there could be a setup.py template, a basic documentation template, and CI settings that run all the tests, build the documentation, and publish to the internal package server.
• Form an expert group that can assist with any questions related to packaging, teach best practices, and that can perform some QA tasks.
• Create an overview of which internal libraries exist for which purpose, where people can add their libraries. This listing could also show some metric of how many people are using a particular library, serving as a motivation for the author and a quality indicator for potential users.

As the host of that 80-minute long session I was happy about the deep and focused discussion that we had. With members from different teams it was possible to get input from various viewpoints and also learn about issues that are specific to certain work styles. I think we developed some concrete and actionable ideas and we will also try to follow-up and actually realize these ideas after PFN Day to increase code reuse and improve across all of our teams.

## 2018 PFN Internship Coding Tasks

Mitsuru Kusumoto

2018-07-25 18:01:08

We have published the coding task used in the screening process of PFN internship 2018. It is available on GitHub.

Hello, I’m Kusumoto, an engineer in PFN. In PFN, we organize a summer internship from August to September. Coding task is what we asked applicants to solve during the screening process to check the applicants skill level at programming, problem-solving, etc. Because we are hiring in a wide range of fields, including machine learning, this year we prepared five kinds of problems — “Machine learning/Mathematics,” “Back-end,” “Front-end,” “Processor/Compiler,” and “Chainer.” Applicants would choose one of these tasks according to the theme they have chosen.

This year, we received many more applications than previous years.  With this increasing applications, we increased the number of acceptances we offer.

The detail of the coding task is as follows.

• Machine learning/Mathematics: You are asked to implement an algorithm of adversarial examples for some neural network model. You need to write a simple report on the performance of the algorithm as well.
• Back-end: You are asked to create a tool that analyzes some log file.
• Front-end: You are asked to develop a prototype of an annotation tool for speech videos.
• Processor/Compiler: You are asked to optimize the code of matrix multiplication. Further, you need to design a hardware circuit of matrix multiplication.
• Chainer: You are asked to implement a training code for some model, using Chainer.

Every year, we carefully create the coding task with creative sense. I hope these tasks become a good practice problem to learn what you want to study.

I created Machine learning/Mathematics task this year. Let me briefly write what I usually consider when creating problems.

• Make the problem not require specific knowledge: In PFN, we hire people from a wide range of fields. We make problems solvable without any particular experience or knowledge of machine learning itself as possible so that various people can tackle the problems.
• Make the problem setting close to actual research: In the field of machine learning or deep learning, we often repeat the process like “find a good theme -> consider a novel method -> implement it -> summarize and evaluate the result.” Our problem setting imitates the latter part of this process. It may be similar to an assignment in a university class.
• Ask interesting theme: Lots of interesting research results appear every day in the area of machine learning/deep learning. The coding task should also be interesting. This year, the task was on the method called Fast Gradient Signed Method, which shows far better performance than random noise baseline method. I believe that this was a fun experiment in and of itself.
• Do not make the problem too difficult: It is not good if the problem is too time-consuming. Our objective is that a student with enough skills can solve the problem within one or two days.

We evaluate the submitted code and report from various perspective. Not only correct implementation is important. That code is readable for other engineers, that there is an appropriate amount of unit-tests, and that other engineers can easily replicate the result are also evaluated.

In addition to the code, summarization of the result and evaluation of the proposed method are also important factors in experiments. Reporting the result to other people is also important especially when you work in a team. We will check the submitted report to see how good these factors are.

If you are interested in PFN, we look forward to receiving your application in the next internship program.

We are also hiring full-time employees in Tokyo, Japan and San Mateo, California. Please refer to the job page below.

https://www.preferred-networks.jp/en/job

## Technologies behind Distributed Deep Learning: AllReduce

kfukuda

2018-07-10 15:11:24

This post is contributed by Mr. Yuichiro Ueno, who were a Summer intern in 2017 and a part time engineer at PFN.

Hello, I am Yuichiro Ueno. I participated in a summer internship program at PFN in 2017, and I currently work as a part-time engineer. I am an undergraduate student at Tokyo Institute of Technology, and my research topic is High-Performance, Parallel and Distributed Computing.

In this blog post, I will describe our recent study on algorithms for AllReduce, a communication operation used for distributed deep learning.

## What is Distributed Deep Learning?

Currently, one of the significant challenges of deep learning is it is a very time-consuming process. Designing a deep learning model requires design space exploration of a large number of hyper-parameters and processing big data. Thus, accelerating the training process is critical for our research and development. Distributed deep learning is one of the essential technologies in reducing training time.

We have deployed a private supercomputer “MN-1” to accelerate our research and development process. It is equipped with 1024 NVIDIA(R) Tesla(R) P100 GPUs and Mellanox(R) InfiniBand FDR interconnect and is the most powerful supercomputer in the industry segment in Japan. By leveraging MN-1, we completed training a ResNet-50 model on the ImageNet dataset in 15 minutes.

Communication among GPUs is one of the many challenges when training distributed deep learning models in a large-scale environment. The latency of exchanging gradients over all GPUs is a severe bottleneck in data-parallel synchronized distributed deep learning.

How is the communication performed in distributed deep learning? Also, why is the communication so time-consuming?

## The Importance of AllReduce in Distributed Deep Learning

In synchronized data-parallel distributed deep learning, the major computation steps are:

1. Compute the gradient of the loss function using a minibatch on each GPU.
2. Compute the mean of the gradients by inter-GPU communication.
3. Update the model.

To compute the mean, we use a collective communication operation called “AllReduce.”

As of now, one of the fastest collective communication libraries for GPU clusters is NVIDIA Collective Communication Library: NCCL[3]. It achieves far better communication performance than MPI, which is the de-facto standard communication library in the HPC community. NCCL is indispensable for achieving high performance in distributed deep learning using ChainerMN. Without it, the ImageNet 15-min feat could not have been achieved[2].

Our researchers and engineers were curious about NCCL’s excellent performance. Since NCCL is not an open source library, we tried to understand the high performance of the library by developing and optimizing an experimental AllReduce library.

## Algorithms of AllReduce

First, let’s take a look at the AllReduce algorithms. AllReduce is an operation that reduces the target arrays in all processes to a single array and returns the resultant array to all processes. Now, let P the total number of processes. Each process has an array of length N called $$A_p$$. $$i$$-th element of the array of process $$p ~(1 \leq p \leq P)$$ is $$A_{p,i}$$.

The resulting array B is to be:
$$B_{i}~~=~~A_{1,i}~~Op~~A_{2,i}~~Op~~…~~Op~~A_{P,i}$$

Here, Op is a binary operator. SUM, MAX, and MIN are frequently used. In distributed deep learning, the SUM operation is used to compute the mean of gradients. In the rest of this blog post, we assume that the reduction operation is SUM. Figure 1 illustrates how the AllReduce operation works by using an example of P=4 and N=4.

Fig.1 AllReduce Operation

There are several algorithms to implement the operation. For example, a straightforward one is to select one process as a master, gather all arrays into the master, perform reduction operations locally in the master, and then distribute the resulting array to the rest of the processes. Although this algorithm is simple and easy to implement, it is not scalable. The master process is a performance bottleneck because its communication and reduction costs increase in proportion to the number of total processes.

Faster and more scalable algorithms have been proposed. They eliminate the bottleneck by carefully distributing the computation and communication over the participant processes.
Such algorithms include Ring-AllReduce and Rabenseifner’s algorithm[4].

We will focus on the Ring-AllReduce algorithms in this blog post. This algorithm is also employed by NCCL [5] and baidu-allreduce[6].

## Ring-AllReduce

Let us assume that P is the total number of the processes, and each process is uniquely identified a number between 1 and P. As shown in the Fig.2, the processes constitute a single ring.

Fig.2 Example of a process ring

First, each process divides its own array into P subarrays, which we refer to as “chunks”. Let chunk[p] be the p-th chunk.

Next, let us focus on the process [p]. The process sends chunk[p] to the next process, while it receives chunk[p-1] from the previous process simultaneously (Fig.3).

Fig.3 Each process sends its chunk[p] to the next process [p+1]

Then, process p performs the reduction operation to the received chunk[p-1] and its own chunk[p-1], and sends the reduced chunk to the next process p+1 (Fig.4).

Fig.4 Each process sends a reduced chunk to the next process

By repeating the receive-reduce-send steps P-1 times, each process obtains a different portion of the resulting array (Fig.5).

Fig.5 After P-1 steps, each process has a reduced subarray.

In other words, each process adds its local chunk to a received chunk and send it to the next process. In other words, every chunk travels all around the ring and accumulates a chunk in each process. After visiting all processes once, it becomes a portion of the final result array, and the last-visited process holds the chunk.

Finally, all processes can obtain the complete array by sharing the distributed partial results among them. This is achieved by doing the circulating step again without reduction operations, i.e., merely overwriting the received chunk to the corresponding local chunk in each process. The AllReduce operation completes when all processes obtain all portions of the final array.

Let’s compare the amount of communication of Ring-AllReduce to that of the simple algorithm we mentioned above.

In the simple algorithm, the master process receives all the arrays from all other processes, which means the total amount of received data is $$(P – 1) \times N$$. After the reduction operation, it sends the arrays back to all the processes, which is again $$(P – 1) \times N$$ data. Thus, the amount of communication of the master process is proportional to P.

In the Ring-AllReduce algorithm, we can calculate the amount of communication in each process in the following way. In the earlier half of the algorithm, each process sends an array, the size of which is $$N/P$$, $$P-1$$ times. Next, each process again sends an array of the same size P-1 times. The total amount of data each process sends throughout the algorithm is $$2N(P-1) / P$$, which is practically independent of P.

Thus, the Ring-Allreduce algorithm is more efficient than the simple algorithm because it eliminates the bottleneck process by distributing computation and communication evenly over all participant processes. Many AllReduce implementations adopt Ring-AllReduce, and it is suitable for distributed deep learning workloads as well.

## Implementation and Optimization

The Ring-AllReduce algorithm is simple to implement if basic send and receive routines are given. baidu-allreduce[6] is built on top of MPI using MPI_Send and MPI_Recv.

However, we tried to do further optimizations by using InfiniBand Verbs API instead of MPI. To fully utilize hardware resources, the algorithm has multiple stages such as memory registration (pinning), cuda-memcpy, send, reduction, receive, and memory deregistration, and they are processed in a software pipeline. Here, “registration” and “deregistration” are pre- and post-processing stages for DMA data transfer. Such low-level operations are abstracted out in MPI send/receive routines, and we are not able to split them into pipeline stages. To increase the granularity of the communication and computation, we further divide chunks into smaller sub-chunks. Also, we introduce a memory pool to hide memory allocation overhead.

## Performance Evaluation

For performance evaluation, we compared our prototype (called PFN-Proto) to several AllReduce implementations shown in the Appendix.

Our prototype implementation currently focuses on inter-node communication; it is not optimized for intra-node communication using shared memory or GPU-to-GPU DMA data transfer. We evaluated the implementations in one process per node configuration. For Open MPI [7], our company is yet to introduce the latest version 3.x series because the most recent series has a minor issue related to GPUDirect. So, we used version 2.1.3 instead.

We used our private supercomputer MN-1 for this experiment, as shown in the “Experimental environment” below. Eight processes were run, where one process ran on one computing node. The target data size is 256MB.

Fig.6 AllReduce Execution Time

Figure 6 shows the result of the evaluation. Each bar indicates the median of 10 runs. The error bar indicates confidence intervals. The details of each library are shown in the “software versions” below.

First, let’s look at the median values. Our experimental implementation, PFN-Proto, showed the fastest time, which is approximately 82%, 286%, 28%, 1.6% better than ompi, ompi-cuda, Baidu, NCCL, respectively. One thing worth mentioning, which is not in the graph, is that Baidu achieved the fastest single-run time 0.097 [s] among all the five libraries.

Next, we focus on the variance of the performance. Maximum and minimum runtimes of PFN-Proto and NCCL are within +/- 3% and +/- 6%, respectively. In contrast, Baidu’s maximum value is 7.5x its median, because its first run takes a very long time. Its maximum runtime excluding the first run is +9.6% over the median, which is still more significant than those of NCCL and PFN-Proto.

Our hypothesis is that the performance variances of MPI and MPI-based routines are attributed to MPI’s internal behavior related to memory operations. MPI’s programming interface hides memory allocation and registration operations for InfiniBand communication. Timings of such operations are not controllable from those AllReduce implementations.

## Summary

We described the AllReduce communication pattern, which is very important for distributed deep learning. In particular, we implemented the Ring-AllReduce algorithm in our experimental communication library, and it achieved comparable performance to NCCL library released by NVIDIA. The implementation efficiently utilizes available hardware resources through advanced optimization such as using InfiniBand Verbs API and software pipelining. We continue our research and development on accelerating distributed deep learning.

Caveats: our implementation is experimental, and we only demonstrated the performance on our in-house cluster. NCCL is a highly practical and usable library thanks to its performance suitability and availability on a wide range of IB-connected NVIDIA GPU clusters.

## Acknowledgement

I would like to thank my mentors and the team for the kind support and feedbacks. Since my internship period last year, I have been give access to rich computation resources, and it has been a fantastic experience.

## From Mentors:

This project started with a question: “how does NCCL achieve such high and stable performance?” It is an advanced and experimental topic, but Mr. Ueno achieved a remarkable result with his high motivation and technical skills.

PFN is looking for talents, not only in the deep learning/machine learning field but a full range of technical areas from hardware to software. Please visit https://www.preferred-networks.jp/en/jobs for more information.

For students who are interested in high-performance computing and other technologies, PFN offers international internship opportunities, as well as domestic programs for Japanese students. The application period has finished this year, but be ready for the next opportunity!

## References

[1] Preferred Networks officially released ChainerMN version 1.0.0
[2] Akiba, et al., “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”
[3] NVIDIA Collective Communications Library
[4] Rabenseifner, “Optimization of Collective Reduction Operations”, ICCS 2004
[5] Jeaugey, “Optimized Inter-GPU Collective Operations with NCCL”, GTC 2017
[6] baidu-allreduce
[7] Open MPI
[8] New ChainerMN functions for improved performance in cloud environments and performance testing results on AWS
[9] Tsuzuku, et al., “Variance-based Gradient Compression for Efficient Distributed Deep Learning”, In Proceedings of ICLR 2018 (Workshop Track)

## Appendix

### Software versions

Implementation Version Note
MPI (ompi) Open MPI 2.1.3 Trasnfer from CPU memory to CPU memory (No GPU involved)
CUDA-aware MPI Open MPI 2.1.3 From GPU memory to GPU memory
baidu-allreduce (baidu) A customized version of baidu-allreduce, based on commit ID 73c7b7f https://github.com/keisukefukuda/baidu-allreduce
NCCL 2.2.13

### Experimental environment

• Intel(R) Xeon(R) CPU E5-2667 * 2
• Mellanox ConnectX-3 InfiniBand FDR (56Gbps) x2
• NVIDIA Pascal P100 GPU (with NVIDIA Driver Version 375.20)

## About the Release of the DNN Inference Library Menoh

2018-06-26 14:36:43

Don’t you want to use languages other than Python, especially in the deep learning community?

Menoh repository : https://github.com/pfnet-research/menoh

I am Shintaro Okada, developer of Menoh. This article will give you an introduction to Menoh and describe my motivation for the development.

Menoh is a library that can read trained DNN models in the ONNX format for inference. I wrote it in C++, but it has a C language interface. So, its functions can easily be called from other languages as well.  At release, C++, C#, and Haskell wrappers are available, and Ruby, NodeJS, and Java (JVM) wrappers are in the pipeline. I leveraged Intel’s MKL-DNN backend, so that even without using a GPU, it does fast inference on Intel CPUs. Menoh makes it possible to deploy your trained Chainer model to an application programmed in languages other than Python in no time.

In the meantime, why is it Python, rather than Ruby, that has dominated the deep learning community? Why not R, Perl, or C++? There are many programming languages out there, which could have been widely used to write deep learning training frameworks instead of Python (Of course, each language is useful in its own way and whether the level of such probability is high or low would depend on each language.)  Python has the hegemony of our universe, but in another universe, Lisp may hold supremacy. That said, we have no choice but to live in this universe where we need to part with sweet (), {}, or begin/end and write blocks with pointless indentation in order to implement the deep-something under today’s Python rule. What a tragedy. I wish I could say so without any reservations, but Python is a good programming language.

Yes, Python is a good language. It comes with the myriad of useful libraries, Numpy in particular, is dynamic-typed, and handles the Garbage Collection function. All of these make the trial-and-error process of writing code to train and implement DNNs easier. Chainer is a flexible and easily extensible DNN framework, and it is, of course, written in Python. Chainer is amazingly easy to use thanks to its magic called Define-by-Run. Sure, another language could’ve been used to implement the Define-by-Run feature. But, if it had not been for Python, the code would’ve been more complicated and its implementation more painful. It is obviously the Python language itself that plays a part of Chainer’s user-friendliness.

For us, to study DNN is not difficult, since we have Chainer backed by easy-to-use Python. We can write and train DNN models without a hitch.  It’s heavenly. On the flip side, to deploy trained DNN models is where the pain starts.

It may be an exaggeration to use the word pain. I should just use Chainer as is when deploying to a Python-friendly environment and there is no pain from the beginning to end (at least in the deployment work).  But, what if one’s environment doesn’t allow Python? Outside the lab, one may not use Python due to security or computing resource-related issues, and Python may be useless in areas dominated by other languages. There are a variety of situations like this (For example, Ruby enjoys enduring popularity in the Web community even today). Some DL frameworks have been designed with deployment taken into consideration and allow users to write DNNs in C or C++ without using Python. But, they often require a lot of effort to implement and have too little wrappers to make it easy to use. While the knowledge of training DNNs has been widespread, the deployment of DNNs has been far from developed.

I just wanted to build trained models into my applications, but it’s been a hassle.

This is why I decided to develop Menoh.

Menoh is a result of my project under PFN’s 20% rule. It’s our company policy that allows PFN members to spend 20% of their time at work on their favorite tasks or projects, aside from formally assigned tasks. At PFN, we have various other 20% projects and study sessions both by individuals and groups progressing at the moment.

As a matter of fact, Menoh is based on a library called Instant which I developed in my personal project in December 2017. Since then, I have taken advantage of the 20% time to enhance its functionality. Along the way, some of my colleagues gave me valuable advice on how to better design it, and others volunteered to write other language wrappers. Thanks to the support of all these members, Instant has finally been released as an experimental product in pfn-research under the new name Menoh. I plan to continue to spend 20% of my time improving it. I hope you will use Menoh and I would appreciate it if you would open new issues for suggestions or any bug you may find.

## Research Activities at Preferred Networks

Takuya Akiba

2018-06-18 15:03:33

Hello, I am Takuya Akiba, a newly appointed corporate officer doubling as chief research strategist. I would like to make an inaugural address as well as sharing my view on research activities at PFN.

# What does research mean at PFN?

It is very difficult to draw a line between what is research and what is not, and it is not worthwhile to go out of your way to define it. Research means to master something by sharpening one’s thinking. It is usually understood that research is to deeply investigate into and study a subject in order to establish facts and reach new conclusions about it.

Almost all projects at PFN are challenging, entail great uncertainty, and require no small amount of research. In most cases, research and development of core deep learning technologies, not to mention their applications, does not go well without selecting an appropriate method or devising a nontrivial technique according to a task or data. We are also dealing with unknown problems that arise when trying to combine technologies in multiple fields such as robotics, computer vision, and natural language processing. In addition to that, when we design a cluster, manage its resources, and work on a deep learning framework, there are many things to consider and solve by trial and error in order to make them useful and highly efficient while satisfying requirements that are specific to deep learning at the same time.

Among them, especially the following projects involve a great deal of research:

• Academic research whose findings are worthy to be published in a paper
• Prepare and perform a demonstration at an exhibition
• Participation in competitions
• Solve open social problems that have been left unsolved

We have already started producing excellent results in these activities, with our papers continuously being accepted by a wide range of top conferences, including ICML, CVPR, ACL, and CHI.  We are not only publishing more papers than before, but our papers are receiving global attention. One of our researchers won the Best Paper Award on Human-Robot Interaction at ICRA’18 while another researcher was chosen as Oral at ICLR’18 recently. With regards to demonstrations, we displayed our work at several exhibitions including CES 2016 and ICRA 2017. We also took part in many competitions and achieved great results at Amazon Picking Challenge 2016, IPAB drug discovery contest, and the like.

# Why does PFN do research？

What is the point of researching what doesn’t seem to bring immediate profits to a business like PFN? For example, writing a research paper means that the researcher will need to spend a good amount of his/her precious time in the office, and publishing it would be tantamount to revealing technology to people outside the company. You may be wondering whether activities like academic research and paper writing have a negative impact on the company.

At PFN, however, we highly value such activities and will even continue to increase our focus on them. It is often said that the “winner takes all” in the competitive and borderless world of computer and AI businesses. In order to survive in this harsh business environment, we need to obtain a world-class technological strength through these activities and retain a competitive edge to stay ahead of rivals. Building a good patent portfolio is practically important as well.

Also, I often hear some say, “Isn’t it more efficient to focus on practical applications of technologies in papers published by others?” It is certain, however, that leading organizations in the world will be far ahead by the time those papers come out and catch our eyes. Besides, the information we can get from reading papers is very limited. Often times, we need to go through a process of trial and error or ask authors before successfully reproducing the published result or need to apply it to other datasets to learn its negative aspect that is not written in the paper. These would take an incredible amount of time.  Alan Kay, who is known as the father of personal computers, once said: “The best way to predict the future is to invent it.” Now that we have made one great achievement after another in multiple research fields, his words are beginning to hit home. They carry a great sense of reality.

Furthermore, we not only research within the company but also place great importance on presenting our study results to contribute to the community. This not only helps make our presence felt both in and out of Japan but will eventually accelerate the advances of the technology necessary to realize our objectives if we can inspire other professionals in the world to undertake follow-on research based on the techniques we publish.  This is why we are very active in making the codes and data used in our research open to the public as well as releasing software as an OSS.  Our researchers also peer-review papers in academic journals in work hours as part of our contributions to the academic community.

# What kind of research are we promoting?

We are working on an extensive range of research fields, centering around deep learning. They include computer vision, natural language processing, speech recognition, robotics, compiler, distributed processing, dedicated hardware, bioinformatics, and cheminformatics. We will step up efforts to further promote these research activities based on the following philosophy.

### Legitimately crazy

Any research should be conducted not only by looking at the world today but also with an eye for the future. The value of research shouldn’t be judged using only the common knowledge now, either. An unpractical method that requires heavy computation or a massive process that no one dares to do in today’s computing environment is not necessarily negative. For example, we succeeded in a high-profile project where we completed training an image recognition model within minutes through distributed processing on 1,024 GPUs last year. Not only the unprecedentedly high speed that we achieved was extraordinary but the scale of the experiment itself – we used 1,024 GPUs all at once – was out of the ordinary.  It may not be realistic to use 1,024 GPUs for ordinary training. Then, is research like this not worth conducting?

Computational speed is yet continuing to improve. Especially for deep learning, people are keen to develop a chip dedicated to it. According to an analysis released by OpenAI, the computational power used in large-scale deep learning training has been doubling every 3.5 months. Such settings seem incredible now but may become commonplace and widely available to use in several years. Knowing what will happen and what will be a problem at that time and thinking how to solve them and what we will be able to do – to quickly embark on this kind of far-sighted action is extremely important. The experiment using 1,024 GPUs mentioned above was the first step in our endeavor to create an environment that would make such large-scale experiments nothing out of the ordinary. We are taking advantage of having a private supercomputer and a team specializing in parallel, distributed computing to realize this.

### Out into the world

You should aspire to lead the world in your research regardless of the research field. Having a technological strength that is cut above the rest of the world can bring great value. Not act too inwardly, but you should look outside the company and take the lead. Publishing a paper that will be highly recognized by global researchers, becoming among the top in a competition, or getting invited to give a lecture on a spotlighted subject – these are the kind of activities you should aim for. In reality, it may be difficult to outdistance the world in every research area. But, when you are conscious of and aiming to reach the top spot, you will know where you stand relative to the most advanced research in the world.

It is also very important to work your way into an international community. If you become acquainted with leading researchers and they recognize you are to be reckoned with, you will be able to exchange valuable information with them. Therefore, PFN is encouraging its members to make a speech outside the company and making sure to publicize those who have made such contributions.

### Go all-out to expand

Any research should not be kept behind closed doors but expanded further. For example, compiling a paper on your research is an important milestone, but it’s not the end of your research project. You shouldn’t undertake research just for the sake of writing a paper. In deep learning, a common technique can sometimes work effectively in different application fields. I have high hopes that PFN members will widen the scope of their research for extensive applications by working with other members from different study areas. Having people with a variety of expertise is one of our company’s forte. If possible, you should also consider developing new software or giving feedback to make an in-house software serviceable. It would also be great if your research would result in improving day-to-day business activities. Although I emphasized the importance of the number of research papers accepted by top conferences, I have no intention to evaluate R&D activities solely based on the number of papers or the ranking of a conference by which the paper was accepted.

To break into one of the top places, you need to utilize your skills fully while being highly motivated. Having said that, you don’t need to do everything by yourself. You should positively consider relying on someone who has an ability that you don’t have. This is not only about technical skills but also paper writing. Even if you put a lot of efforts into your research and made interesting findings, your paper could be underestimated, thus not accepted by an academic conference due to misleading wording or other reasons caused by your lack of experience or knowledge of writing a good paper. PFN has many senior researchers with years of experience in basic research who can teach young members not only about paper writing but also how to conduct a thorough investigation as well as the correct way to compare experiments. I will ensure that our junior members can receive the support of these experienced researchers.

# The appeal of working on R&D at PFN

What are the benefits of engaging in research and development at PFN for researchers and engineers?

One of the most attractive points is that your superb individual skills as well as organizational technical competence are truly being sought after and can make a big difference in PFN’s technical domains, mainly deep learning. This means that the difference of technical skills, whether they are individual or team, will be hugely reflected on the outcome of research. So, having high technological skills will lead directly to a high value. Your individual skills and the ability to put them to good use in a team are highly regarded.  This is particularly a good thing if you are confident about or motivated to improve your technical capability.

It is also worth mentioning that we have flexibility in the way we do research. Some researchers devote 100% of their time to pure basic research, and they have formed a team entirely dedicated to it, which we even plan to expand. Some are handling business-like problems while progressing their main research activities.  Joint research with the academia is also actively being carried out. Some members are working part-time to take a doctor’s course in graduate school to polish their expertise.

We are also putting extra effort into enhancing our in-house systems to promote R&D activities. PFN provides full support to its members taking up on new challenges by trusting and giving considerable discretion to them and flexibly dealing with needs to improve such in-house systems or requests for assets that are not available in the company. For example, all PFN members are eligible to spend up to 20% of their work hours at their own discretion. This 20% rule enables us to test our ideas right away. So, I am expecting our motivated members to produce unique ideas and launch new initiatives one after another.

Everything from the algorithm, to software framework, to research supporting middleware, and to hardware is important in deep learning and other technical domains that PFN engages in.  It is also one of the appealing points that at PFN you get to chat with experts in a wide range of research fields such as deep learning, reinforcement learning, computer vision, natural language processing, bioinformatics, high-performance computation, distributed system, network, robotics, simulation, data analysis, optimization, and anomaly detection. You can ask them about subjects you’re not familiar with, exchange practical problems, work together on a research subject, and so on.

# In conclusion

Finally, let me write a little bit about my personal aspirations. I have been given the honor that is more than I deserve of serving as the corporate officer and chief research strategist at a company where many esteemed professionals are doing splendid work in a wonderful team whose great abilities keep inspiring me everyday. At first, I hesitated whether I should accept this important role that seemed too big for someone like me and I was afraid that I might not be able to live up to their expectations.

I was a researcher in academia before joining PFN and worked as an intern for several corporate labs outside Japan in my university days because I was interested in becoming a researcher in a corporate environment. During one of the internships, they carried out layoffs, and I saw right before my eyes all researchers in the lab, including my mentor, being dismissed.  I experienced firsthand the toughness of continuing to make research activities meaningful enough for a company.

Despite the bitter experience, I believe PFN should promote research as a corporate activity and generate value from maintaining it in a healthy state. This is not an easy but very exciting and meaningful task, and this is exactly the area where my experiences and knowledge obtained in various places could be useful. So, I decided to do my best to make contributions in this new role.

I excel at combining several areas of my expertise such as researching, engineering, deep learning and distributed computation into creating a new value as well as elaborating and executing a competitive strategy. I will try to exploit these strong points of mine to the fullest in broader areas.

PFN is looking for researchers and engineers who are enthusiastic about working with us on these research activities.

## CHI 2018 and PacificVis 2018

Fabrice Matulic

2018-05-08 12:02:31

This is Fabrice, Human Computer Interaction (HCI) researcher at PFN.

While automated systems based on deep neural networks are making rapid progress, it is important not to neglect the human factors involved in those processes, an aspect that is frequently referred to as “human in the loop”. In this respect, the HCI research community is well positioned to not only utilise advanced machine learning techniques as tools to create novel user-centred applications, but also to contribute approaches to facilitate the introduction, use and management of those complex tools. The information visualisation (InfoVis) community has started to shed some light into the “black box” of deep neural networks by proposing visualisations and user interfaces that help practitioners better understand what is happening inside it. PFN is closely following what is going on in HCI and InfoVis/Visual Analytics research and also aims to contribute in those areas.

## PacificVis

The 11th IEEE Pacific Visualization Symposium (PacificVis 2018), which PFN sponsored and attended, was held in Kobe in April. Machine learning was well covered with several contributions in that area, including the first keynote by Prof. Shixia Liu of Tsinghua University on “Explainable Machine Learning” and the best paper “GANViz: A Visual Analytics Approach to Understand the Adversarial Game“, which followed in the footsteps of the best paper of IEEE VIS’17 about a visual analytics system for TensorFlow. Those contributions are closely related to Explainable Artificial Intelligence (XAI), an effort to produce machine learning techniques based on explainable models and interfaces that can help users understand and interpret why and how automated systems come to particular decisions or results. Whether those algorithms and tools will be sufficient to fulfil the right to explanation of the EU’s new General Data Protection Regulation (GDPR) remains to be seen.

## CHI

The ACM Conference on Human Factors in Computing Systems (CHI) is the premier international conference of Human-Computer Interaction. This year it took place in Montreal, Canada, with attendance exceeding 3300 participants and an official welcome letter by Prime Minister Justin Trudeau.

A common use of machine learning in HCI is to detect or recognise patterns from complex sensor data in order to realise novel interaction techniques, e.g. palm contact detection from raw touch data, handwriting recognition using pen tip motion and writing sound. With the wide availability of deep learning frameworks, HCI researchers have integrated those new tools in their arsenal to increase the recognition performance for previous techniques or to create entirely new ones, which would have been ineffective or difficult to realise using old methods. Good examples of the latter are systems enabled by generative nets. For instance, DeepWriting is a deep generative model that can generate handwriting from typeset text and even beautify or mimic handwriting styles. ExtVision, which is inspired by IllumiRoom, automatically generates peripheral images using conditional adversarial nets instead of using actual content.

Aksan, E., Pece, F. and Hilliges, O. DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. Code made available on Github.

Two other categories of applications of machine learning that we increasingly see in HCI are for interaction prediction and emotional state estimation. In the former category, Li, Bengio (Samy) and Bailly investigated how DNNs can predict human performance in interaction tasks using the example of vertical menu selection. For emotion and state recognition, in addition to an introductory course by Lex Fridman from MIT on “deep learning for understanding the human”, two papers about estimating cognitive load from eye pupil movements in videos and EEG signals were presented. With the non-stopping proliferation of sensors in mobile and wearable devices, we are bound to see more and more “smart” systems that seek to better understand people and anticipate their moves, for good or bad.

CHI also includes many vis contributions and this year was no exception. Of particular relevance for visual exploration of big data and DNN understanding was the work by Cavallo and Demiralp, who created a visual interaction framework to improve exploratory analysis of high-dimensional data using tools to navigate in a reduced dimension graph and observe how modifying the reduced data affects the initial dataset. The examples using autoencoders on MNIST and QuickDraw, where the user draws on input samples to see how results change, are particularly interesting.

Cavallo M, Demiralp Ç. A Visual Interaction Framework for Dimensionality Reduction Based Data Exploration.

I should also mention DuetDraw, a prototype that allows users and AI to sketch collaboratively and which uses PaintsChainer!

### Multiray: Multi-Finger Raycasting for Large Displays

My contribution to CHI this year was not related to machine learning. It involved interacting with remote displays using multiple rays emanating from the fingers. This work with Dan Vogel, which received an honourable mention, was done while I was at the University of Waterloo. The idea is to extend single-finger raycasting to multiple rays using two or more fingers in order to increase the interaction vocabulary, in particular through a number geometric shapes that users form with the projected points on the screen.

Matulic F, Vogel D. Multiray: Multi-Finger Raycasting for Large Displays

## Final thoughts

So far, it is mostly the vis community that has tackled the challenge of opening up the black box of DNNs, but being focused on visualisation, many of the proposed tools have only limited interactive capabilities, especially when it comes to tweaking input and output data to understand how it affects the neurons of the inner layers. This is where HCI researchers need to step up and create the tools to support dynamic analysis of DNNs with possibilities to interactively make adjustments to the models. HCI approaches are also needed to improve the other processes of machine-learning pipelines in which humans are involved, such as data labelling, model selection and integration, data augmentation and generation etc. I think we can expect to see an increasing amount of work addressing those aspects at future CHIs and other HCI venues.