## Open-set Few-shot Speaker Identification

tsuboiyuta

2019-10-21 15:56:12

This is a guest blog from an ex-intern, Nontawat Charoenphakdee.

I am Nontawat Charoenphakdee from Bangkok, Thailand. I am currently a second-year PhD student (starting from Sep 2018) working on machine learning at the Sugiyama-Sato-Honda lab in the University of Tokyo. I graduated with a master’s degree from my current lab. My hobbies are listening to music, karaoke, and playing games. More information about me can be found here: https://nolfwin.github.io/.This blog entry introduces my work during the summer internship (Aug-Sep 2019) at PFN.

# Introduction: a system that can recognize you by your voice

Speaking is a natural way for humans to communicate. As we can see from recent developments in speech technology, the way we communicate with robots is getting closer and closer to the way we talk to people [1, 2, 3, 4]. For example, PFN’s interactive robot can receive voice commands from humans and follow their orders (see PFN’s ICRA-2018 paper and Autonomous Tidying-up Robot System at CEATEC2018 for more information).

Currently, several voice assistant applications are focusing on understanding the voice command without verifying the identity of speakers [2]. It is known that being able to know a speaker’s identity can enhance the security of an application. Intuitively, we definitely do not want anyone to be able to give an order to any robots, especially in a critical application (for example, see “Amazon’s Alexa started ordering people dollhouses after hearing its name on TV” and “How A Few Words To Apple’s Siri Unlocked A Man’s Front Door“).

Not only security issues, we believe that being able to recognize a speaker’s identity will lead to more exciting and useful applications of the current technology we have. Two examples are given as follows.

First, a robot can provide an appropriate response for each speaker. For example, a robot teacher may adjust an explanation according to the student or a personal robot may interact with its owner and their friends differently according to the robot’s knowledge about each user. Another example is we can design a permission level for each user to use a robot. This can also prevent a robot to receive a command from an unknown speaker who wants to use a robot in an inappropriate way.

Second, we can communicate with a robot more naturally. Consider a scenario such as a person says “Take my cup” behind the robot. Although a robot cannot see that person, a robot can associate the word “my” with the speech identity of the speaker and perform a command accordingly. It would be less natural to say “Take [person_name]’s cup” when the context is clear. Intuitively, knowing who you are talking to will give you a better understanding on the current context of the conversation.

For these reasons, being able to recognize a speaker allows a personal robot to support a wider range of applications. Thus, this study aims to explore the possibility of using speaker identification under the situation where we only have a few training data for our target speakers (few-shot). The main motivation is that we do not want our customers/users to spend too much time teaching our robots. Moreover, for safety reasons, the system should be able to detect unknown speakers that are not in the training data in the test phase (open-set). So that we can avoid any potential damages that may cause by them. As a result, we created an application and tested it using real-world few-shot speech data (which are collected from PFN members and interns, thank you for your cooperation!).

# Problem setting: Open-set Few-shot Speaker Identification

Without any prior knowledge of the task, it would be difficult to use machine learning when the number of data is very few (e.g., two data points per class) because it is prone to overfitting. In our speaker identification task, we are working on human speech information. Therefore, in addition to a small dataset of our target speakers (target data), we may consider incorporating a large labeled speech dataset even though such data are not from our target speakers (source data). Our problem setting can be informally explained as follows:

Given:

1. A large labeled data (speech-speaker pairs) from benchmark datasets.
2. A small labeled data (speech-speaker pairs) from target speakers

Goal:

Learn a classifier that can classify well from given new speech input, whether it comes from which target speaker or it is not from any target speakers at all. To evaluate the performance of a classifier, we use the following three evaluation metrics: accuracy (ACC), balanced accuracy (BAC, i.e. averaged accuracy), and F1-measure (F1) (see [5] for more information on each metric).

## Source data

We used the LibriSpeech dataset [6] as the source data. It is a famous freely available speech data consists of more than 1000 hours of English speech. There are almost 2500 speakers in this dataset.

Figure 1: Statistics of LibriSpeech dataset

## Target data

Figure 2:  An invitation to join this project used in the interim presentation

Figure 3: An instruction for collecting target speaker data

In this internship program, we collected two datasets from PFN members. The first dataset is a 4-speaker dataset recorded in “Banana” room in PFN (we call this dataset PFN-banana). Banana is a meeting room for up to 10 people. The recording environment was clean, i.e., without noise. The other dataset is a 14-speaker dataset recorded in PFN’s cafe (we call this dataset PFN-Cafe), which is a large room that can be used to hold a party for more than 100 people. Since we recorded the data during the interim poster presentation for PFN-Cafe, there were many people speaking (e.g., other presenters were presenting their work next to my poster). As a result, the collected data were quite noisy. Furthermore, because I was worried that data could be too noisy, I asked the speakers to speak loudly and found that it was too loud and the record was clipped (audio clipping is a type of waveform distortion).

Figure 4:  Audio clipping issue in PFN-Cafe dataset

We also exclude one folder of LibriSpeech (test-other) as target data with 33 speakers.

# Method

## Data preprocessing

We used 16,000 Hz as a sampling rate. An audio file is trimmed to start from speech information. Then we extracted a log filterbank feature (n_filter = 24) from given speech. Next, we stacked 5 adjacent filterbank features together (n_dim=n_filter x 5 = 120). Note that we are using variable-length data (each speech input data do not need to have the same dimension). For one speech input, we order these stacked filterbank features up to 80 stacked filterbank features for each input. As a result, the possible dimensions of inputs are from (1,120) to (80, 120). This preprocessing idea is quite similar to the preprocessing of X-vector but not exactly the same (see [7] for more information), because we found our preprocessing scheme empirically works better in our experiment.

## The model choice for training a feature extractor

We used a very simple LSTM for our experiment. We used Chainer [8] to implement all methods. We used LSTM (chainer.links.NStepLSTM) as our model. We used 10% dropout [13] for regularization. For the optimization algorithm, we used AdamW [14] with 1e-4 as a weight decay rate. One may try a more complicated model to get better performance. We also implemented X-vector [7],  but found that it takes a long time to run and we couldn’t do many trial-and-error. d-vector [9] is another choice but it cannot support the variable-length data. i-vector [10] is also another alternative to deal with speaker identification, which is well-known and highly-used before the popularity of deep neural networks. Although we used a relatively small network (one-layer LSTM) that can be trained in such a short time (within a day using 1 GPU), we still obtained a good performance on the source data. Table 5 shows the performance on source unseen data (but seen 2451 speakers).

## Simple yet the best method in my experiment

We have attempted several methods but we found that this highly simple approach achieves the best performance and we used it in our demo for the final poster presentation.

1. Using source data: learn a neural network to classify source data effectively using cross-entropy loss
2. Remove the final linear layer and the last softmax layer and use the remaining network as a feature extractor
3. For dealing with the open-set scenario, we simply used unseen LibriSpeech data (test-other folder) as a background class.
4. Learn a new linear layer and a softmax layer on target data and a background class data

(Optional) for 4, we may also fine-tune the feature extractor for our target speakers, however, we have to be careful to avoid overfitting since we have very few target data. We also did a fine-tune with a few epochs and we observed a small improvement.

One interesting but still explainable thing is we found that using one-layer LSTM and 200 number of units, the performance is the best in our experiments. Although the performance on the source is not as good as when using two-layer LSTM with the same number of units, In a preliminary experiment, we found the best model in the source suffers from overfitting in the few-shot setting. Figure 6 shows the difference in the performance of one-layer LSTM and two-layer LSTM with the same number of units (200). One-layer LSTM outperformed two-layer LSTM in the few-shot learning scenario although two-layer LSTM is better when evaluating on the source (see Table 5 for the performance on the source data). Note that this method discards the final linear layer and the last softmax layer after finishing training the source domain. In our opinion, it is interesting to explore the possibility to incorporate this information to improve the performance of few-shot learning for future work.

Table 5: We can achieve 99 percent test accuracy on 2451-speaker classification (unseen test points, but seen 2451-speaker)  for LibriSpeech dataset using our preprocessing method and simply ran it with cross-entropy loss (200 epochs).

 LSTM layers Number of units Test accuracy on source data with 2451 classes (%) 1 50 84.60 1 100 96.23 1 200 97.75 2 100 97.1 2 200 99.04

Figure 6: Performance of 2-shot learning on PFN-Cafe without open-set scenario as the number of target speakers increases. Left: one-layer, 200 number of units LSTM. Right: two-layer, 200 number of units LSTM.

## Baseline++

In the paper: A closer look at Few-shot learning [11] proposed a simple method that can perform well in their experiment (Baseline++), which is based on cosine-similarity. However, we found that this method did not work well when the number of pre-trained classes is large (2451 in our case). We found that the implementation of this method is not that straightforward and the author introduced the scale factor, which needs to be adjusted appropriately depending on the task (see the code from the original paper). We also tried to adjust this scale factor and improve the performance but it still did not work well when we have 1000+ of classes (2451 in our case).

## Prototypical network

We also tried a famous prototypical network [12] for our problem. However, it did not work well in our preliminary experiments and it is not straightforward to extend a prototypical network to support the open-set classification. It is one interesting research direction to make this happens.

# Results

We presented our results on three datasets. First, the benchmark dataset (LibriSpeech). Second, PFN-Banana datasets, which are speech recorded without noise from four PFN members. Third, PFN-Cafe, which are speech recorded in the café during the interim presentation. Figure 7 shows an overview of how we evaluate the result.

Figure 7: An overview of the evaluation procedure.

## Result on few-shot learning in LibriSpeech

We exclude the test-other folder from the pre-training set (source) and because we will use it for evaluating the performance of the few-shot learning. Note that target speakers are not given in the source data. Experiments show that we can achieve over 99% ACC/F1/BAC for 10-shot learning with 33 target speakers.

Figure 8: Performance of 10-shot learning on LibriSpeech without open-set scenario as the number of target speakers increases.

Figure 9:  Performance of 33-speaker learning on LibriSpeech without open-set scenario as the number of shots increases.

Although we have never observed 33 speakers before in the pre-training phase. We can obtain highly accurate predictions in the few-shot learning scenario. This suggests that our simple pre-training method can extract useful information to identify a speaker identity to some extent. However, one may argue that it is still the same dataset collected under similar environments, and it may not work when we use a different dataset. Motivated by this argument, we collect real-world data and test it on completely different environments (PFN-Banana, PFN-Cafe).

## PFN-Banana

Figure 10 shows the performance of our method for PFN-Banana dataset. For the closed-set scenario (the scenario without open-set scenario), we can achieve over 90 percent for this dataset. For the open-set scenario, the performance dropped (around 6-7%). It is not surprising that accuracy is very high in the open-set scenario because we add a lot of open-set data in the test phase, which caused the data to be highly imbalanced between in-distribution and out-distribution data.

Figure 10:  Performance of 2-shot learning on PFN-Banana as the number of target speakers increases. Left: Closed-set scenario. Right: Open-set scenario.

## PFN-Cafe

We report the performance on PFN-Cafe dataset, which we conducted similarly to our experiment on PFN-Banana dataset, in Figure 10. Although the data is quite noisy and the audio amplitude is clipped, our method still performed reasonably well on PFN-Cafe dataset.

Figure 11:  Performance of 2-shot learning on PFN-Cafe as the number of target speakers increases. Left: Closed-set scenario. Right: Open-set scenario.

## Demo (final presentation):

The final presentation was done in the room “Forest”, which we have never collected the data in this room. Nevertheless, our simple method can classify reasonably well. Unfortunately, we did not record the exact performance of our method on the day. We found that it can classify many target speakers very accurately. But at the same time, there was one target speaker that our classifier almost always failed to recognize. Our method could detect unknown speakers pretty well although there were also a few misclassifications to target speakers. We also found that the first prediction result users see really affects the first impression towards our application, which is reasonable and developers should keep this in mind.

Figure 12:  An invitation to test our system in the final lightning talk

Figure 13:  Testing a demo (Yuya Unno (left), Nontawat Charoenphakdee (right))

## Discussion

It is important to know the limitation of this technology. For example, if a target speaker is sick and his/her voice sounds different from usual, can the system still detect that person accurately? Is there a good and cheap data augmentation method to alleviate this problem because it is impractical to record one’s voice in every condition? Moreover, in practice, we may incorporate visual information to handle this problem. But sometimes only visual information is insufficient since we may not see everything in the range. For example, we may not be able to see something behind us or there might be something that blocks our vision. In such a case, the hearing will be very helpful.

## Acknowledgment

My mentors are Yuta Tsuboi (main) and Katsuhiko Ishiguro (sub). I received tremendous support from them. Also, I would like to thank Toru Taniguchi for teaching me a lot especially during the first week: from preprocessing the speech data to introducing several interesting state-of-the-art papers in the field of speech processing. Moreover, I would like to thank Takashi Masuko, who actively attended my weekly meeting and gave me useful comments. Finally, I would like to thank people from Human-robot interface team, Intelligent information processing team, and everyone who provided data for this project.

## References

[1] Fong, T., Nourbakhsh, I., & Dautenhahn, K. (2003). A survey of socially interactive robots. Robotics and autonomous systems, 42(3-4), 143-166.
[2] Hoy, M. B. (2018). Alexa, Siri, Cortana, and more: an introduction to voice assistants. Medical reference services quarterly, 37(1), 81-88.
[3] Kepuska, V., & Bohouta, G. (2018). Next-generation of virtual personal assistants (microsoft cortana, apple siri, amazon alexa and google home). In 2018 IEEE 8th Annual Computing and Communication Workshop and Conference (CCWC) (pp. 99-103). IEEE.
[4] Hatori, J., Kikuchi, Y., Kobayashi, S., Takahashi, K., Tsuboi, Y., Unno, Y., Ko, W. & Tan, J. (2018). Interactively picking real-world objects with unconstrained spoken language instructions. In 2018 IEEE International Conference on Robotics and Automation (ICRA) (pp. 3774-3781). IEEE.
[5] Hossin, M., & Sulaiman, M. N. (2015). A review on evaluation metrics for data classification evaluations. International Journal of Data Mining & Knowledge Management Process, 5(2), 1.
[6] Panayotov, V., Chen, G., Povey, D., & Khudanpur, S. (2015). Librispeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5206-5210). IEEE.
[7] Snyder, D., Garcia-Romero, D., Sell, G., Povey, D., & Khudanpur, S. (2018). X-vectors: Robust dnn embeddings for speaker recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 5329-5333). IEEE.
[8] Tokui, S., Okuta, R., Akiba, T., Niitani, Y., Ogawa, T., Saito, S., Suzuki, S., Uenishi, K., Vogel, B. & Yamazaki Vincent, H. (2019). Chainer: A deep learning framework for accelerating the research cycle. In Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining (SIGKDD) (pp. 2002-2011). ACM.
[9] Variani, E., Lei, X., McDermott, E., Moreno, I. L., & Gonzalez-Dominguez, J. (2014). Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 4052-4056). IEEE.
[10] Dehak, N., Kenny, P. J., Dehak, R., Dumouchel, P., & Ouellet, P. (2010). Front-end factor analysis for speaker verification. IEEE Transactions on Audio, Speech, and Language Processing, 19(4), 788-798.
[11] Chen, W. Y., Liu , Y. C., Kira, Z., Wang, Y. C. F. & Huang, J. B. (2019). A Closer Look at Few-shot Classification. In Proceedings of International Conference on Learning Representations (ICLR).
[12] Snell, J., Swersky, K., & Zemel, R. (2017). Prototypical networks for few-shot learning. In Advances in Neural Information Processing Systems (NeurIPS) (pp. 4077-4087).
[13] Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: a simple way to prevent neural networks from overfitting. The journal of machine learning research, 15(1), 1929-1958.
[14] Loshchilov, I., & Hutter, F. (2019). Decoupled weight decay regularization. In Proceedings of International Conference on Learning Representations (ICLR).

## k8s-cluster-simulator: A simulator for evaluating Kubernetes schedulers

Daisuke Taniwaki

2019-04-11 08:00:53

## Overview

We’re happy to release an open source, Kubernetes cluster simulator, called k8s-cluster-simulator.  The simulator is in the alpha release, and was created by Hidehito Yabuuchi, a PFN internship student in 2018 and part-time employee, along with his mentors, Daisuke Taniwaki and Shingo Omura. This simulator simulates workloads of a Kubernetes cluster and time clock so you can evaluate your Kubernetes scheduler without actually deploying it in the production site.

## Motivation

We have large on-premise GPU clusters, in which researchers run ML jobs of various running duration via Kubernetes. One of our goals is to maximize the utilization of the GPUs for cost-effectiveness while enabling all researchers to have reasonable access. To do this, we developed our own private Kubernetes scheduler and extender (e.g. kube-throttler). However, it’s hard to evaluate new logic in production, because researchers are running jobs, and we should not change the scheduling logic and fairness so often. Of course, we cannot deploy a buggy scheduler that stops the researchers work. Moreover, it is not desirable to stop research to test new scheduling logic in large clusters. Therefore, we started to develop a scheduler simulator for Kubernetes.

## Design

We believe the simulator should have the following properties.

• Require as few changes on scheduler’s implementation and interface as possible.
• Simulate clock time to accelerate evaluations and also evaluate scheduling logics without being affected by system latencies such as network and internal processes.
• Simulate workloads as flexibly as possible.
• Support various output formats for further analysis.

## Architecture

Here’s the simple flow diagram.

The idea is simple. The simulator simulates clocks and ticks the simulated clock at each step of the loop. At each step, the simulator asks submitters if they have pods which should be submitted or deleted in this clock, and schedule the submitted pods to scheduler. Scheduler returns bind and delete events so the simulator can simulate the resource management. Finally, the simulator writes metrics of simulation by metrics loggers.

And here’s the high-level class diagram.

We provide the following two points of customizations for scheduling simulations.

Submitters

Multiple users can be simulated by adding any number and combination of submitters, with time and number of pods submitted fully customizable through the simulator interface. For example, assume user A tends to submit more pods in the morning and user B tends to submit more pods in the evening. A submitter can be created for each user and plugged into the simulator.
Moreover, as submitters receive metrics from the simulator, they can change behaviors based on the state of a cluster, such as crowded or not.

Scheduler

You have two options for scheduler extensions, depending on the style of Kubernetes scheduler customization. The first scheduler extension mimics the normal Kubernetes scheduler (kube-scheduler), and can be extended with Prioritizer, Extender and Predicate. If you customize your scheduling logic by these kube-scheduler extension points, this is the best approach. As Kubernetes scheduler is a queue-based scheduler, you may want to implement more complicated scheduling logic that doesn’t fit a queue based scheduler, for example, scheduling a new set of pods immediately after receiving multiple pod submissions. For this case, we provide an option to evaluate a scheduler with the interface defined in Kubernetes with a thin wrapper function.

We’re implementing the following features before the beta phase to support more realistic cluster environments simulations.

• More isolation between components (e.g. supporting RPC interface for a scheduler and submitter)
• Provide common submitter implementations (e.g. typical probabilistic distributions(Uniform, Binomial, Poisson, etc.))
• Support various cluster events (node failures, accidental pods failures, node addition/removal, etc.)
• Support plottable output formats in popular plotter tools (matplotlib, gnuplot etc.)

## ChainerRL Visualizer: Deep RL Agent Visualization Library

ofk

2019-03-19 12:40:46

This post is contributed by Mr. Takahiro Ishikawa, who was an intern last year and now works as a part-time engineer at PFN.

We have released ChainerRL Visualizer, which visualizes the behaviors of agents trained by ChainerRL on the Web browser.

My name is Takahiro Ishikawa, who participated in PFN 2018 internship and currently work as a part-time engineer.

This library is developed in the aim of “making debugging of deep RL implementations easier” and “contributing to understanding of how deep RL agents work”.
It enables to interactively observe the behaviors of trained deep RL agents on the Web browser.
This library is easy to use. All you have to do is to pass the agent object implemented in ChainerRL and the env object that satisfies a specific interface to the launch visualizer function provided by this library, along with a few options.

from chainerrl_visualizer import launch_visualizer

# Prepare agent and env object here
#

# Prepare dictionary which explains meanings of each action
ACTION_MEANINGS = {
0: 'hoge',
1: 'fuga',
...
}

launch_visualizer(
agent,                           # required
env,                             # required
ACTION_MEANINGS,                 # required
port=5002,                       # optional (default: 5002)
log_dir='log_space',             # optional (default: 'log_space')
raw_image_input=False,           # optional (default: False)
contains_rnn=False,              # optional (default: False)
)

After executing this script, a local Web server will be launched and the following features will be provided on the Web browser.

## 1. Roll-out one episode (or specified steps)

You can tell the agent to run one episode (or specified steps) from the UI, then the outputs of the agent model will be visualized in chronological order.
In the following video, the probabilities of the next action and the state value of the agent trained with A3C are visualized.

## 2. Tick timestep and visualize the behaviors of environment and agent

In the following video, the agent can be moved back and forth and the outputs in each step are visualized along with the behavior of the environment. The pie chart on bottom-right shows the probabilities of the next action of each step.

## 3. Saliency map

If the input of the model is raw pixels, the UI can visualize saliency map, which shows the specific sub-area to which the agent pays attention. This feature is implemented based on the paper Visualizing and Understanding Atari Agents.
In the following video, saliency maps of the agent trained with CategoricalDQN are visualized over the image of the environment.
For now, this feature allows us to specify the number of steps for which saliency maps are created because the computational cost of creating saliency maps is very expensive.

## 4. Miscellaneous visualizations

Various ways of visualization for each type of agent are supported.
For example, the value distributions of the agent trained with CategoricalDQN are visualized in the following video.

Quickstart guides are provided.

For now, almost all of the visualization tools in deep learning have focused on visualizing scores and other metrics along with the progress of the model training. ChainerRL Visualizer is the original visualizer among them as it can interactively and dynamically visualize the behaviors of the deep RL agent and environment themselves.

Atari Zoo, which was released by Uber Research, is developed for a similar purpose. Atari Zoo aims at accelerating research for understanding deep RL agent, providing trained models, and analyzing tools for the frozen models. It enables researchers to participate in the research for understanding deep RL agent even if they don’t have enough computing resources.

ChainerRL Visualizer is different from Atari Zoo in the sense that “all” kinds of agents in ChainerRL can be dynamically analyzed during training by the visualizer while Atari Zoo is only for visualizing already-trained models in the repository and those models are limited to the ALE environments and specific algorithms where the architecture looks like ( raw image => Conv => Conv .. => FC .. ).

There are also other visualization tools with similar motivations, such as DQNViz that visualizes the behaviors of DQN agents and its various metrics during the training.

Though much effort has been dedicated to improve the performance of deep RL algorithms on benchmark tasks, less effort has been paid for understanding what deep RL agents learn by deep RL algorithm, and analyzing how the trained agents behave. However, the research for understanding deep RL agent as seen in the visualizations above will expand in the future.

ChainerRL Visualizer is now in beta version and still does not have sufficient features for deeply analyzing deep RL agents. So continuous development is needed in order to contribute to the emerging research area of understanding deep RL agent. We welcome you to participate in the development of ChainerRL Visualizer to add new features or to improve existing features through OSS collaboration.

## Technologies behind Distributed Deep Learning: AllReduce

kfukuda

2018-07-10 15:11:24

This post is contributed by Mr. Yuichiro Ueno, who were a Summer intern in 2017 and a part time engineer at PFN.

Hello, I am Yuichiro Ueno. I participated in a summer internship program at PFN in 2017, and I currently work as a part-time engineer. I am an undergraduate student at Tokyo Institute of Technology, and my research topic is High-Performance, Parallel and Distributed Computing.

In this blog post, I will describe our recent study on algorithms for AllReduce, a communication operation used for distributed deep learning.

## What is Distributed Deep Learning?

Currently, one of the significant challenges of deep learning is it is a very time-consuming process. Designing a deep learning model requires design space exploration of a large number of hyper-parameters and processing big data. Thus, accelerating the training process is critical for our research and development. Distributed deep learning is one of the essential technologies in reducing training time.

We have deployed a private supercomputer “MN-1” to accelerate our research and development process. It is equipped with 1024 NVIDIA(R) Tesla(R) P100 GPUs and Mellanox(R) InfiniBand FDR interconnect and is the most powerful supercomputer in the industry segment in Japan. By leveraging MN-1, we completed training a ResNet-50 model on the ImageNet dataset in 15 minutes.

Communication among GPUs is one of the many challenges when training distributed deep learning models in a large-scale environment. The latency of exchanging gradients over all GPUs is a severe bottleneck in data-parallel synchronized distributed deep learning.

How is the communication performed in distributed deep learning? Also, why is the communication so time-consuming?

## The Importance of AllReduce in Distributed Deep Learning

In synchronized data-parallel distributed deep learning, the major computation steps are:

1. Compute the gradient of the loss function using a minibatch on each GPU.
2. Compute the mean of the gradients by inter-GPU communication.
3. Update the model.

To compute the mean, we use a collective communication operation called “AllReduce.”

As of now, one of the fastest collective communication libraries for GPU clusters is NVIDIA Collective Communication Library: NCCL[3]. It achieves far better communication performance than MPI, which is the de-facto standard communication library in the HPC community. NCCL is indispensable for achieving high performance in distributed deep learning using ChainerMN. Without it, the ImageNet 15-min feat could not have been achieved[2].

Our researchers and engineers were curious about NCCL’s excellent performance. Since NCCL is not an open source library, we tried to understand the high performance of the library by developing and optimizing an experimental AllReduce library.

## Algorithms of AllReduce

First, let’s take a look at the AllReduce algorithms. AllReduce is an operation that reduces the target arrays in all processes to a single array and returns the resultant array to all processes. Now, let P the total number of processes. Each process has an array of length N called $$A_p$$. $$i$$-th element of the array of process $$p ~(1 \leq p \leq P)$$ is $$A_{p,i}$$.

The resulting array B is to be:
$$B_{i}~~=~~A_{1,i}~~Op~~A_{2,i}~~Op~~…~~Op~~A_{P,i}$$

Here, Op is a binary operator. SUM, MAX, and MIN are frequently used. In distributed deep learning, the SUM operation is used to compute the mean of gradients. In the rest of this blog post, we assume that the reduction operation is SUM. Figure 1 illustrates how the AllReduce operation works by using an example of P=4 and N=4.

Fig.1 AllReduce Operation

There are several algorithms to implement the operation. For example, a straightforward one is to select one process as a master, gather all arrays into the master, perform reduction operations locally in the master, and then distribute the resulting array to the rest of the processes. Although this algorithm is simple and easy to implement, it is not scalable. The master process is a performance bottleneck because its communication and reduction costs increase in proportion to the number of total processes.

Faster and more scalable algorithms have been proposed. They eliminate the bottleneck by carefully distributing the computation and communication over the participant processes.
Such algorithms include Ring-AllReduce and Rabenseifner’s algorithm[4].

We will focus on the Ring-AllReduce algorithms in this blog post. This algorithm is also employed by NCCL [5] and baidu-allreduce[6].

## Ring-AllReduce

Let us assume that P is the total number of the processes, and each process is uniquely identified a number between 1 and P. As shown in the Fig.2, the processes constitute a single ring.

Fig.2 Example of a process ring

First, each process divides its own array into P subarrays, which we refer to as “chunks”. Let chunk[p] be the p-th chunk.

Next, let us focus on the process [p]. The process sends chunk[p] to the next process, while it receives chunk[p-1] from the previous process simultaneously (Fig.3).

Fig.3 Each process sends its chunk[p] to the next process [p+1]

Then, process p performs the reduction operation to the received chunk[p-1] and its own chunk[p-1], and sends the reduced chunk to the next process p+1 (Fig.4).

Fig.4 Each process sends a reduced chunk to the next process

By repeating the receive-reduce-send steps P-1 times, each process obtains a different portion of the resulting array (Fig.5).

Fig.5 After P-1 steps, each process has a reduced subarray.

In other words, each process adds its local chunk to a received chunk and send it to the next process. In other words, every chunk travels all around the ring and accumulates a chunk in each process. After visiting all processes once, it becomes a portion of the final result array, and the last-visited process holds the chunk.

Finally, all processes can obtain the complete array by sharing the distributed partial results among them. This is achieved by doing the circulating step again without reduction operations, i.e., merely overwriting the received chunk to the corresponding local chunk in each process. The AllReduce operation completes when all processes obtain all portions of the final array.

Let’s compare the amount of communication of Ring-AllReduce to that of the simple algorithm we mentioned above.

In the simple algorithm, the master process receives all the arrays from all other processes, which means the total amount of received data is $$(P – 1) \times N$$. After the reduction operation, it sends the arrays back to all the processes, which is again $$(P – 1) \times N$$ data. Thus, the amount of communication of the master process is proportional to P.

In the Ring-AllReduce algorithm, we can calculate the amount of communication in each process in the following way. In the earlier half of the algorithm, each process sends an array, the size of which is $$N/P$$, $$P-1$$ times. Next, each process again sends an array of the same size P-1 times. The total amount of data each process sends throughout the algorithm is $$2N(P-1) / P$$, which is practically independent of P.

Thus, the Ring-Allreduce algorithm is more efficient than the simple algorithm because it eliminates the bottleneck process by distributing computation and communication evenly over all participant processes. Many AllReduce implementations adopt Ring-AllReduce, and it is suitable for distributed deep learning workloads as well.

## Implementation and Optimization

The Ring-AllReduce algorithm is simple to implement if basic send and receive routines are given. baidu-allreduce[6] is built on top of MPI using MPI_Send and MPI_Recv.

However, we tried to do further optimizations by using InfiniBand Verbs API instead of MPI. To fully utilize hardware resources, the algorithm has multiple stages such as memory registration (pinning), cuda-memcpy, send, reduction, receive, and memory deregistration, and they are processed in a software pipeline. Here, “registration” and “deregistration” are pre- and post-processing stages for DMA data transfer. Such low-level operations are abstracted out in MPI send/receive routines, and we are not able to split them into pipeline stages. To increase the granularity of the communication and computation, we further divide chunks into smaller sub-chunks. Also, we introduce a memory pool to hide memory allocation overhead.

## Performance Evaluation

For performance evaluation, we compared our prototype (called PFN-Proto) to several AllReduce implementations shown in the Appendix.

Our prototype implementation currently focuses on inter-node communication; it is not optimized for intra-node communication using shared memory or GPU-to-GPU DMA data transfer. We evaluated the implementations in one process per node configuration. For Open MPI [7], our company is yet to introduce the latest version 3.x series because the most recent series has a minor issue related to GPUDirect. So, we used version 2.1.3 instead.

We used our private supercomputer MN-1 for this experiment, as shown in the “Experimental environment” below. Eight processes were run, where one process ran on one computing node. The target data size is 256MB.

Fig.6 AllReduce Execution Time

Figure 6 shows the result of the evaluation. Each bar indicates the median of 10 runs. The error bar indicates confidence intervals. The details of each library are shown in the “software versions” below.

First, let’s look at the median values. Our experimental implementation, PFN-Proto, showed the fastest time, which is approximately 82%, 286%, 28%, 1.6% better than ompi, ompi-cuda, Baidu, NCCL, respectively. One thing worth mentioning, which is not in the graph, is that Baidu achieved the fastest single-run time 0.097 [s] among all the five libraries.

Next, we focus on the variance of the performance. Maximum and minimum runtimes of PFN-Proto and NCCL are within +/- 3% and +/- 6%, respectively. In contrast, Baidu’s maximum value is 7.5x its median, because its first run takes a very long time. Its maximum runtime excluding the first run is +9.6% over the median, which is still more significant than those of NCCL and PFN-Proto.

Our hypothesis is that the performance variances of MPI and MPI-based routines are attributed to MPI’s internal behavior related to memory operations. MPI’s programming interface hides memory allocation and registration operations for InfiniBand communication. Timings of such operations are not controllable from those AllReduce implementations.

## Summary

We described the AllReduce communication pattern, which is very important for distributed deep learning. In particular, we implemented the Ring-AllReduce algorithm in our experimental communication library, and it achieved comparable performance to NCCL library released by NVIDIA. The implementation efficiently utilizes available hardware resources through advanced optimization such as using InfiniBand Verbs API and software pipelining. We continue our research and development on accelerating distributed deep learning.

Caveats: our implementation is experimental, and we only demonstrated the performance on our in-house cluster. NCCL is a highly practical and usable library thanks to its performance suitability and availability on a wide range of IB-connected NVIDIA GPU clusters.

## Acknowledgement

I would like to thank my mentors and the team for the kind support and feedbacks. Since my internship period last year, I have been give access to rich computation resources, and it has been a fantastic experience.

## From Mentors:

This project started with a question: “how does NCCL achieve such high and stable performance?” It is an advanced and experimental topic, but Mr. Ueno achieved a remarkable result with his high motivation and technical skills.

PFN is looking for talents, not only in the deep learning/machine learning field but a full range of technical areas from hardware to software. Please visit https://www.preferred-networks.jp/en/jobs for more information.

For students who are interested in high-performance computing and other technologies, PFN offers international internship opportunities, as well as domestic programs for Japanese students. The application period has finished this year, but be ready for the next opportunity!

## References

[1] Preferred Networks officially released ChainerMN version 1.0.0
[2] Akiba, et al., “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”
[3] NVIDIA Collective Communications Library
[4] Rabenseifner, “Optimization of Collective Reduction Operations”, ICCS 2004
[5] Jeaugey, “Optimized Inter-GPU Collective Operations with NCCL”, GTC 2017
[6] baidu-allreduce
[7] Open MPI
[8] New ChainerMN functions for improved performance in cloud environments and performance testing results on AWS
[9] Tsuzuku, et al., “Variance-based Gradient Compression for Efficient Distributed Deep Learning”, In Proceedings of ICLR 2018 (Workshop Track)

## Appendix

### Software versions

Implementation Version Note
MPI (ompi) Open MPI 2.1.3 Trasnfer from CPU memory to CPU memory (No GPU involved)
CUDA-aware MPI Open MPI 2.1.3 From GPU memory to GPU memory
baidu-allreduce (baidu) A customized version of baidu-allreduce, based on commit ID 73c7b7f https://github.com/keisukefukuda/baidu-allreduce
NCCL 2.2.13

### Experimental environment

• Intel(R) Xeon(R) CPU E5-2667 * 2
• Mellanox ConnectX-3 InfiniBand FDR (56Gbps) x2
• NVIDIA Pascal P100 GPU (with NVIDIA Driver Version 375.20)

## Guest blog with Hai, a former intern at PFN

hido

2018-04-09 17:34:34

This is a guest post in an interview style with Hai Nguyen, a former intern 2017 summer at Preferred Networks, whose research has been accepted at one of the NIPS 2017 workshops. After finishing PFN internship, he joined Kyoto University as a Ph.d student.

“Semi-supervised Learning of Hierarchical Representations of Molecules Using Neural Message Passing,” Hai Nguyen, Shin-ichi Maeda, and Kenta Oono; NIPS Workshop on Machine Learning for Molecules and Materials, 2017. (Link, arXiv)

## 2018 Intern Results at Preferred Networks (Part 1)

hido

2017-10-18 07:44:24

This summer, Preferred Networks accepted a record number of interns in Tokyo from all over the world. They tackled challenging tasks around artificial intelligence together with PFN mentors. We appreciate their passion, focus, and designation to the internship.

In this post, we would like to share some of their great jobs (more to come).

## Guest blog with Weihua, a former intern at PFN

hido

2017-09-11 16:29:13

This is a guest post in an interview style with Weihua Hu, a former intern at Preferred Networks last year from University of Tokyo, whose research has been extended after the internship and accepted at ICML 2017.

“Learning Discrete Representations via Information Maximizing Self-Augmented Training,” Weihua Hu, Takeru Miyato, Seiya Tokui, Eiichi Matsumoto, and Masashi Sugiyama; Proceedings of the 34th International Conference on Machine Learning, PMLR 70:1558-1567, 2017. (Link)

## Publish 2017 PFN Internship Coding Tasks

Kosuke Nakago

2017-08-01 11:23:51

* Japanese blog is also written here.

Preferred Networks (PFN) organizes two-month-long summer internship program for students in August and September every year.

The number of applications is increasing year by year. This year we have received the highest number of applications ever and the interview and selection process has been finished.

PFN 2017 Summer Internship Program

Then we have published our intern coding tasks on github.