Technologies behind Distributed Deep Learning: AllReduce

kfukuda

2018-07-10 15:11:24

This pots is contributed by Mr. Yuichiro Ueno, who were a Summer intern in 2017 and a part time engineer at PFN.


Hello, I am Yuichiro Ueno. I participated in a summer internship program at PFN in 2017, and I currently work as a part-time engineer. I am an undergraduate student at Tokyo Institute of Technology, and my research topic is High-Performance, Parallel and Distributed Computing.

In this blog post, I will describe our recent study on algorithms for AllReduce, a communication operation used for distributed deep learning.

What is Distributed Deep Learning?

Currently, one of the significant challenges of deep learning is it is a very time-consuming process. Designing a deep learning model requires design space exploration of a large number of hyper-parameters and processing big data. Thus, accelerating the training process is critical for our research and development. Distributed deep learning is one of the essential technologies in reducing training time.

We have deployed a private supercomputer “MN-1” to accelerate our research and development process. It is equipped with 1024 NVIDIA(R) Tesla(R) P100 GPUs and Mellanox(R) InfiniBand FDR interconnect and is the most powerful supercomputer in the industry segment in Japan. By leveraging MN-1, we completed training a ResNet-50 model on the ImageNet dataset in 15 minutes.

Communication among GPUs is one of the many challenges when training distributed deep learning models in a large-scale environment. The latency of exchanging gradients over all GPUs is a severe bottleneck in data-parallel synchronized distributed deep learning.

How is the communication performed in distributed deep learning? Also, why is the communication so time-consuming?

The Importance of AllReduce in Distributed Deep Learning

In synchronized data-parallel distributed deep learning, the major computation steps are:

  1. Compute the gradient of the loss function using a minibatch on each GPU.
  2. Compute the mean of the gradients by inter-GPU communication.
  3. Update the model.

To compute the mean, we use a collective communication operation called “AllReduce.”

As of now, one of the fastest collective communication libraries for GPU clusters is NVIDIA Collective Communication Library: NCCL[3]. It achieves far better communication performance than MPI, which is the de-facto standard communication library in the HPC community. NCCL is indispensable for achieving high performance in distributed deep learning using ChainerMN. Without it, the ImageNet 15-min feat could not have been achieved[2].

Our researchers and engineers were curious about NCCL’s excellent performance. Since NCCL is not an open source library, we tried to understand the high performance of the library by developing and optimizing an experimental AllReduce library.

Algorithms of AllReduce

First, let’s take a look at the AllReduce algorithms. AllReduce is an operation that reduces the target arrays in all processes to a single array and returns the resultant array to all processes. Now, let P the total number of processes. Each process has an array of length N called \(A_p\). \(i\)-th element of the array of process \(p ~(1 \leq p \leq P)\) is \(A_{p,i}\).

The resulting array B is to be:
$$ B_{i}~~=~~A_{1,i}~~Op~~A_{2,i}~~Op~~…~~Op~~A_{P,i} $$

Here, Op is a binary operator. SUM, MAX, and MIN are frequently used. In distributed deep learning, the SUM operation is used to compute the mean of gradients. In the rest of this blog post, we assume that the reduction operation is SUM. Figure 1 illustrates how the AllReduce operation works by using an example of P=4 and N=4.

pscl_fig_1

Fig.1 AllReduce Operation

 

There are several algorithms to implement the operation. For example, a straightforward one is to select one process as a master, gather all arrays into the master, perform reduction operations locally in the master, and then distribute the resulting array to the rest of the processes. Although this algorithm is simple and easy to implement, it is not scalable. The master process is a performance bottleneck because its communication and reduction costs increase in proportion to the number of total processes.

Faster and more scalable algorithms have been proposed. They eliminate the bottleneck by carefully distributing the computation and communication over the participant processes.
Such algorithms include Ring-AllReduce and Rabenseifner’s algorithm[4].

We will focus on the Ring-AllReduce algorithms in this blog post. This algorithm is also employed by NCCL [5] and baidu-allreduce[6].

Ring-AllReduce

Let us assume that P is the total number of the processes, and each process is uniquely identified a number between 1 and P. As shown in the Fig.2, the processes constitute a single ring.

pscl_fig_2

Fig.2 Example of a process ring

 

First, each process divides its own array into P subarrays, which we refer to as “chunks”. Let chunk[p] be the p-th chunk.

Next, let us focus on the process [p]. The process sends chunk[p] to the next process, while it receives chunk[p-1] from the previous process simultaneously (Fig.3).

pscl_fig_3

Fig.3 Each process sends its chunk[p] to the next process [p+1]

 

Then, process p performs the reduction operation to the received chunk[p-1] and its own chunk[p-1], and sends the reduced chunk to the next process p+1 (Fig.4).

pscl_fig_4

Fig.4 Each process sends a reduced chunk to the next process

 

By repeating the receive-reduce-send steps P-1 times, each process obtains a different portion of the resulting array (Fig.5).

pscl_fig_5

Fig.5 After P-1 steps, each process has a reduced subarray.

 

In other words, each process adds its local chunk to a received chunk and send it to the next process. In other words, every chunk travels all around the ring and accumulates a chunk in each process. After visiting all processes once, it becomes a portion of the final result array, and the last-visited process holds the chunk.

Finally, all processes can obtain the complete array by sharing the distributed partial results among them. This is achieved by doing the circulating step again without reduction operations, i.e., merely overwriting the received chunk to the corresponding local chunk in each process. The AllReduce operation completes when all processes obtain all portions of the final array.

Let’s compare the amount of communication of Ring-AllReduce to that of the simple algorithm we mentioned above.

In the simple algorithm, the master process receives all the arrays from all other processes, which means the total amount of received data is \((P – 1) \times N\). After the reduction operation, it sends the arrays back to all the processes, which is again \((P – 1) \times N\) data. Thus, the amount of communication of the master process is proportional to P.

In the Ring-AllReduce algorithm, we can calculate the amount of communication in each process in the following way. In the earlier half of the algorithm, each process sends an array, the size of which is \(N/P\), \(P-1\) times. Next, each process again sends an array of the same size P-1 times. The total amount of data each process sends throughout the algorithm is \(2N(P-1) / P\), which is practically independent of P.

Thus, the Ring-Allreduce algorithm is more efficient than the simple algorithm because it eliminates the bottleneck process by distributing computation and communication evenly over all participant processes. Many AllReduce implementations adopt Ring-AllReduce, and it is suitable for distributed deep learning workloads as well.

Implementation and Optimization

The Ring-AllReduce algorithm is simple to implement if basic send and receive routines are given. baidu-allreduce[6] is built on top of MPI using MPI_Send and MPI_Recv.

However, we tried to do further optimizations by using InfiniBand Verbs API instead of MPI. To fully utilize hardware resources, the algorithm has multiple stages such as memory registration (pinning), cuda-memcpy, send, reduction, receive, and memory deregistration, and they are processed in a software pipeline. Here, “registration” and “deregistration” are pre- and post-processing stages for DMA data transfer. Such low-level operations are abstracted out in MPI send/receive routines, and we are not able to split them into pipeline stages. To increase the granularity of the communication and computation, we further divide chunks into smaller sub-chunks. Also, we introduce a memory pool to hide memory allocation overhead.

Performance Evaluation

For performance evaluation, we compared our prototype (called PFN-Proto) to several AllReduce implementations shown in the Appendix.

Our prototype implementation currently focuses on inter-node communication; it is not optimized for intra-node communication using shared memory or GPU-to-GPU DMA data transfer. We evaluated the implementations in one process per node configuration. For Open MPI [7], our company is yet to introduce the latest version 3.x series because the most recent series has a minor issue related to GPUDirect. So, we used version 2.1.3 instead.

We used our private supercomputer MN-1 for this experiment, as shown in the “Experimental environment” below. Eight processes were run, where one process ran on one computing node. The target data size is 256MB.

pscl_eval

Fig.6 AllReduce Execution Time

 

Figure 6 shows the result of the evaluation. Each bar indicates the median of 10 runs. The error bar indicates confidence intervals. The details of each library are shown in the “software versions” below.

First, let’s look at the median values. Our experimental implementation, PFN-Proto, showed the fastest time, which is approximately 82%, 286%, 28%, 1.6% better than ompi, ompi-cuda, Baidu, NCCL, respectively. One thing worth mentioning, which is not in the graph, is that Baidu achieved the fastest single-run time 0.097 [s] among all the five libraries.

Next, we focus on the variance of the performance. Maximum and minimum runtimes of PFN-Proto and NCCL are within +/- 3% and +/- 6%, respectively. In contrast, Baidu’s maximum value is 7.5x its median, because its first run takes a very long time. Its maximum runtime excluding the first run is +9.6% over the median, which is still more significant than those of NCCL and PFN-Proto.

Our hypothesis is that the performance variances of MPI and MPI-based routines are attributed to MPI’s internal behavior related to memory operations. MPI’s programming interface hides memory allocation and registration operations for InfiniBand communication. Timings of such operations are not controllable from those AllReduce implementations.

Summary

We described the AllReduce communication pattern, which is very important for distributed deep learning. In particular, we implemented the Ring-AllReduce algorithm in our experimental communication library, and it achieved comparable performance to NCCL library released by NVIDIA. The implementation efficiently utilizes available hardware resources through advanced optimization such as using InfiniBand Verbs API and software pipelining. We continue our research and development on accelerating distributed deep learning.

Caveats: our implementation is experimental, and we only demonstrated the performance on our in-house cluster. NCCL is a highly practical and usable library thanks to its performance suitability and availability on a wide range of IB-connected NVIDIA GPU clusters.

Acknowledgement

I would like to thank my mentors and the team for the kind support and feedbacks. Since my internship period last year, I have been give access to rich computation resources, and it has been a fantastic experience.

From Mentors:

This project started with a question: “how does NCCL achieve such high and stable performance?” It is an advanced and experimental topic, but Mr. Ueno achieved a remarkable result with his high motivation and technical skills.

PFN is looking for talents, not only in the deep learning/machine learning field but a full range of technical areas from hardware to software. Please visit https://www.preferred-networks.jp/en/jobs for more information.

For students who are interested in high-performance computing and other technologies, PFN offers international internship opportunities, as well as domestic programs for Japanese students. The application period has finished this year, but be ready for the next opportunity!

References

[1] Preferred Networks officially released ChainerMN version 1.0.0
[2] Akiba, et al., “Extremely Large Minibatch SGD: Training ResNet-50 on ImageNet in 15 Minutes”
[3] NVIDIA Collective Communications Library
[4] Rabenseifner, “Optimization of Collective Reduction Operations”, ICCS 2004
[5] Jeaugey, “Optimized Inter-GPU Collective Operations with NCCL”, GTC 2017
[6] baidu-allreduce
[7] Open MPI
[8] New ChainerMN functions for improved performance in cloud environments and performance testing results on AWS
[9] Tsuzuku, et al., “Variance-based Gradient Compression for Efficient Distributed Deep Learning”, In Proceedings of ICLR 2018 (Workshop Track)

Appendix

Software versions

Implementation Version Note
MPI (ompi) Open MPI 2.1.3 Trasnfer from CPU memory to CPU memory (No GPU involved)
CUDA-aware MPI Open MPI 2.1.3 From GPU memory to GPU memory
baidu-allreduce (baidu) A customized version of baidu-allreduce, based on commit ID 73c7b7f https://github.com/keisukefukuda/baidu-allreduce
NCCL 2.2.13

Experimental environment

  • Intel(R) Xeon(R) CPU E5-2667 * 2
  • Mellanox ConnectX-3 InfiniBand FDR (56Gbps) x2
  • NVIDIA Pascal P100 GPU (with NVIDIA Driver Version 375.20)

About the Release of the DNN Inference Library Menoh

Shintarou Okada

2018-06-26 14:36:43

Don’t you want to use languages other than Python, especially in the deep learning community?

Menoh repository : https://github.com/pfnet-research/menoh

I am Shintaro Okada, developer of Menoh. This article will give you an introduction to Menoh and describe my motivation for the development.

Menoh is a library that can read trained DNN models in the ONNX format for inference. I wrote it in C++, but it has a C language interface. So, its functions can easily be called from other languages as well.  At release, C++, C#, and Haskell wrappers are available, and Ruby, NodeJS, and Java (JVM) wrappers are in the pipeline. I leveraged Intel’s MKL-DNN backend, so that even without using a GPU, it does fast inference on Intel CPUs. Menoh makes it possible to deploy your trained Chainer model to an application programmed in languages other than Python in no time.

In the meantime, why is it Python, rather than Ruby, that has dominated the deep learning community? Why not R, Perl, or C++? There are many programming languages out there, which could have been widely used to write deep learning training frameworks instead of Python (Of course, each language is useful in its own way and whether the level of such probability is high or low would depend on each language.)  Python has the hegemony of our universe, but in another universe, Lisp may hold supremacy. That said, we have no choice but to live in this universe where we need to part with sweet (), {}, or begin/end and write blocks with pointless indentation in order to implement the deep-something under today’s Python rule. What a tragedy. I wish I could say so without any reservations, but Python is a good programming language.

Yes, Python is a good language. It comes with the myriad of useful libraries, Numpy in particular, is dynamic-typed, and handles the Garbage Collection function. All of these make the trial-and-error process of writing code to train and implement DNNs easier. Chainer is a flexible and easily extensible DNN framework, and it is, of course, written in Python. Chainer is amazingly easy to use thanks to its magic called Define-by-Run. Sure, another language could’ve been used to implement the Define-by-Run feature. But, if it had not been for Python, the code would’ve been more complicated and its implementation more painful. It is obviously the Python language itself that plays a part of Chainer’s user-friendliness.

For us, to study DNN is not difficult, since we have Chainer backed by easy-to-use Python. We can write and train DNN models without a hitch.  It’s heavenly. On the flip side, to deploy trained DNN models is where the pain starts.

It may be an exaggeration to use the word pain. I should just use Chainer as is when deploying to a Python-friendly environment and there is no pain from the beginning to end (at least in the deployment work).  But, what if one’s environment doesn’t allow Python? Outside the lab, one may not use Python due to security or computing resource-related issues, and Python may be useless in areas dominated by other languages. There are a variety of situations like this (For example, Ruby enjoys enduring popularity in the Web community even today). Some DL frameworks have been designed with deployment taken into consideration and allow users to write DNNs in C or C++ without using Python. But, they often require a lot of effort to implement and have too little wrappers to make it easy to use. While the knowledge of training DNNs has been widespread, the deployment of DNNs has been far from developed.

I just wanted to build trained models into my applications, but it’s been a hassle.

This is why I decided to develop Menoh.

Menoh is a result of my project under PFN’s 20% rule. It’s our company policy that allows PFN members to spend 20% of their time at work on their favorite tasks or projects, aside from formally assigned tasks. At PFN, we have various other 20% projects and study sessions both by individuals and groups progressing at the moment.  

As a matter of fact, Menoh is based on a library called Instant which I developed in my personal project in December 2017. Since then, I have taken advantage of the 20% time to enhance its functionality. Along the way, some of my colleagues gave me valuable advice on how to better design it, and others volunteered to write other language wrappers. Thanks to the support of all these members, Instant has finally been released as an experimental product in pfn-research under the new name Menoh. I plan to continue to spend 20% of my time improving it. I hope you will use Menoh and I would appreciate it if you would open new issues for suggestions or any bug you may find.   

Research Activities at Preferred Networks

Takuya Akiba

2018-06-18 15:03:33

Hello, I am Takuya Akiba, a newly appointed corporate officer doubling as chief research strategist. I would like to make an inaugural address as well as sharing my view on research activities at PFN.

What does research mean at PFN?

It is very difficult to draw a line between what is research and what is not, and it is not worthwhile to go out of your way to define it. Research means to master something by sharpening one’s thinking. It is usually understood that research is to deeply investigate into and study a subject in order to establish facts and reach new conclusions about it.

Almost all projects at PFN are challenging, entail great uncertainty, and require no small amount of research. In most cases, research and development of core deep learning technologies, not to mention their applications, does not go well without selecting an appropriate method or devising a nontrivial technique according to a task or data. We are also dealing with unknown problems that arise when trying to combine technologies in multiple fields such as robotics, computer vision, and natural language processing. In addition to that, when we design a cluster, manage its resources, and work on a deep learning framework, there are many things to consider and solve by trial and error in order to make them useful and highly efficient while satisfying requirements that are specific to deep learning at the same time.

Among them, especially the following projects involve a great deal of research:

  • Academic research whose findings are worthy to be published in a paper
  • Prepare and perform a demonstration at an exhibition
  • Participation in competitions
  • Solve open social problems that have been left unsolved

We have already started producing excellent results in these activities, with our papers continuously being accepted by a wide range of top conferences, including ICML, CVPR, ACL, and CHI.  We are not only publishing more papers than before, but our papers are receiving global attention. One of our researchers won the Best Paper Award on Human-Robot Interaction at ICRA’18 while another researcher was chosen as Oral at ICLR’18 recently. With regards to demonstrations, we displayed our work at several exhibitions including CES 2016 and ICRA 2017. We also took part in many competitions and achieved great results at Amazon Picking Challenge 2016, IPAB drug discovery contest, and the like.

Why does PFN do research?

What is the point of researching what doesn’t seem to bring immediate profits to a business like PFN? For example, writing a research paper means that the researcher will need to spend a good amount of his/her precious time in the office, and publishing it would be tantamount to revealing technology to people outside the company. You may be wondering whether activities like academic research and paper writing have a negative impact on the company.

At PFN, however, we highly value such activities and will even continue to increase our focus on them. It is often said that the “winner takes all” in the competitive and borderless world of computer and AI businesses. In order to survive in this harsh business environment, we need to obtain a world-class technological strength through these activities and retain a competitive edge to stay ahead of rivals. Building a good patent portfolio is practically important as well.

Also, I often hear some say, “Isn’t it more efficient to focus on practical applications of technologies in papers published by others?” It is certain, however, that leading organizations in the world will be far ahead by the time those papers come out and catch our eyes. Besides, the information we can get from reading papers is very limited. Often times, we need to go through a process of trial and error or ask authors before successfully reproducing the published result or need to apply it to other datasets to learn its negative aspect that is not written in the paper. These would take an incredible amount of time.  Alan Kay, who is known as the father of personal computers, once said: “The best way to predict the future is to invent it.” Now that we have made one great achievement after another in multiple research fields, his words are beginning to hit home. They carry a great sense of reality.

Furthermore, we not only research within the company but also place great importance on presenting our study results to contribute to the community. This not only helps make our presence felt both in and out of Japan but will eventually accelerate the advances of the technology necessary to realize our objectives if we can inspire other professionals in the world to undertake follow-on research based on the techniques we publish.  This is why we are very active in making the codes and data used in our research open to the public as well as releasing software as an OSS.  Our researchers also peer-review papers in academic journals in work hours as part of our contributions to the academic community.

What kind of research are we promoting?

We are working on an extensive range of research fields, centering around deep learning. They include computer vision, natural language processing, speech recognition, robotics, compiler, distributed processing, dedicated hardware, bioinformatics, and cheminformatics. We will step up efforts to further promote these research activities based on the following philosophy.

Legitimately crazy

Any research should be conducted not only by looking at the world today but also with an eye for the future. The value of research shouldn’t be judged using only the common knowledge now, either. An unpractical method that requires heavy computation or a massive process that no one dares to do in today’s computing environment is not necessarily negative. For example, we succeeded in a high-profile project where we completed training an image recognition model within minutes through distributed processing on 1,024 GPUs last year. Not only the unprecedentedly high speed that we achieved was extraordinary but the scale of the experiment itself – we used 1,024 GPUs all at once – was out of the ordinary.  It may not be realistic to use 1,024 GPUs for ordinary training. Then, is research like this not worth conducting?

Computational speed is yet continuing to improve. Especially for deep learning, people are keen to develop a chip dedicated to it. According to an analysis released by OpenAI, the computational power used in large-scale deep learning training has been doubling every 3.5 months. Such settings seem incredible now but may become commonplace and widely available to use in several years. Knowing what will happen and what will be a problem at that time and thinking how to solve them and what we will be able to do – to quickly embark on this kind of far-sighted action is extremely important. The experiment using 1,024 GPUs mentioned above was the first step in our endeavor to create an environment that would make such large-scale experiments nothing out of the ordinary. We are taking advantage of having a private supercomputer and a team specializing in parallel, distributed computing to realize this.

Out into the world

You should aspire to lead the world in your research regardless of the research field. Having a technological strength that is cut above the rest of the world can bring great value. Not act too inwardly, but you should look outside the company and take the lead. Publishing a paper that will be highly recognized by global researchers, becoming among the top in a competition, or getting invited to give a lecture on a spotlighted subject – these are the kind of activities you should aim for. In reality, it may be difficult to outdistance the world in every research area. But, when you are conscious of and aiming to reach the top spot, you will know where you stand relative to the most advanced research in the world.

It is also very important to work your way into an international community. If you become acquainted with leading researchers and they recognize you are to be reckoned with, you will be able to exchange valuable information with them. Therefore, PFN is encouraging its members to make a speech outside the company and making sure to publicize those who have made such contributions.

Go all-out to expand

Any research should not be kept behind closed doors but expanded further. For example, compiling a paper on your research is an important milestone, but it’s not the end of your research project. You shouldn’t undertake research just for the sake of writing a paper. In deep learning, a common technique can sometimes work effectively in different application fields. I have high hopes that PFN members will widen the scope of their research for extensive applications by working with other members from different study areas. Having people with a variety of expertise is one of our company’s forte. If possible, you should also consider developing new software or giving feedback to make an in-house software serviceable. It would also be great if your research would result in improving day-to-day business activities. Although I emphasized the importance of the number of research papers accepted by top conferences, I have no intention to evaluate R&D activities solely based on the number of papers or the ranking of a conference by which the paper was accepted.

To break into one of the top places, you need to utilize your skills fully while being highly motivated. Having said that, you don’t need to do everything by yourself. You should positively consider relying on someone who has an ability that you don’t have. This is not only about technical skills but also paper writing. Even if you put a lot of efforts into your research and made interesting findings, your paper could be underestimated, thus not accepted by an academic conference due to misleading wording or other reasons caused by your lack of experience or knowledge of writing a good paper. PFN has many senior researchers with years of experience in basic research who can teach young members not only about paper writing but also how to conduct a thorough investigation as well as the correct way to compare experiments. I will ensure that our junior members can receive the support of these experienced researchers.

The appeal of working on R&D at PFN

What are the benefits of engaging in research and development at PFN for researchers and engineers?

One of the most attractive points is that your superb individual skills as well as organizational technical competence are truly being sought after and can make a big difference in PFN’s technical domains, mainly deep learning. This means that the difference of technical skills, whether they are individual or team, will be hugely reflected on the outcome of research. So, having high technological skills will lead directly to a high value. Your individual skills and the ability to put them to good use in a team are highly regarded.  This is particularly a good thing if you are confident about or motivated to improve your technical capability.

It is also worth mentioning that we have flexibility in the way we do research. Some researchers devote 100% of their time to pure basic research, and they have formed a team entirely dedicated to it, which we even plan to expand. Some are handling business-like problems while progressing their main research activities.  Joint research with the academia is also actively being carried out. Some members are working part-time to take a doctor’s course in graduate school to polish their expertise.

We are also putting extra effort into enhancing our in-house systems to promote R&D activities. PFN provides full support to its members taking up on new challenges by trusting and giving considerable discretion to them and flexibly dealing with needs to improve such in-house systems or requests for assets that are not available in the company. For example, all PFN members are eligible to spend up to 20% of their work hours at their own discretion. This 20% rule enables us to test our ideas right away. So, I am expecting our motivated members to produce unique ideas and launch new initiatives one after another.

Everything from the algorithm, to software framework, to research supporting middleware, and to hardware is important in deep learning and other technical domains that PFN engages in.  It is also one of the appealing points that at PFN you get to chat with experts in a wide range of research fields such as deep learning, reinforcement learning, computer vision, natural language processing, bioinformatics, high-performance computation, distributed system, network, robotics, simulation, data analysis, optimization, and anomaly detection. You can ask them about subjects you’re not familiar with, exchange practical problems, work together on a research subject, and so on.

In conclusion

Finally, let me write a little bit about my personal aspirations. I have been given the honor that is more than I deserve of serving as the corporate officer and chief research strategist at a company where many esteemed professionals are doing splendid work in a wonderful team whose great abilities keep inspiring me everyday. At first, I hesitated whether I should accept this important role that seemed too big for someone like me and I was afraid that I might not be able to live up to their expectations.

I was a researcher in academia before joining PFN and worked as an intern for several corporate labs outside Japan in my university days because I was interested in becoming a researcher in a corporate environment. During one of the internships, they carried out layoffs, and I saw right before my eyes all researchers in the lab, including my mentor, being dismissed.  I experienced firsthand the toughness of continuing to make research activities meaningful enough for a company.

Despite the bitter experience, I believe PFN should promote research as a corporate activity and generate value from maintaining it in a healthy state. This is not an easy but very exciting and meaningful task, and this is exactly the area where my experiences and knowledge obtained in various places could be useful. So, I decided to do my best to make contributions in this new role.

I excel at combining several areas of my expertise such as researching, engineering, deep learning and distributed computation into creating a new value as well as elaborating and executing a competitive strategy. I will try to exploit these strong points of mine to the fullest in broader areas.

PFN is looking for researchers and engineers who are enthusiastic about working with us on these research activities.

CHI 2018 and PacificVis 2018

Fabrice Matulic

2018-05-08 12:02:31

This is Fabrice, Human Computer Interaction (HCI) researcher at PFN.

While automated systems based on deep neural networks are making rapid progress, it is important not to neglect the human factors involved in those processes, an aspect that is frequently referred to as “human in the loop”. In this respect, the HCI research community is well positioned to not only utilise advanced machine learning techniques as tools to create novel user-centred applications, but also to contribute approaches to facilitate the introduction, use and management of those complex tools. The information visualisation (InfoVis) community has started to shed some light into the “black box” of deep neural networks by proposing visualisations and user interfaces that help practitioners better understand what is happening inside it. PFN is closely following what is going on in HCI and InfoVis/Visual Analytics research and also aims to contribute in those areas.

PacificVis

The 11th IEEE Pacific Visualization Symposium (PacificVis 2018), which PFN sponsored and attended, was held in Kobe in April. Machine learning was well covered with several contributions in that area, including the first keynote by Prof. Shixia Liu of Tsinghua University on “Explainable Machine Learning” and the best paper “GANViz: A Visual Analytics Approach to Understand the Adversarial Game“, which followed in the footsteps of the best paper of IEEE VIS’17 about a visual analytics system for TensorFlow. Those contributions are closely related to Explainable Artificial Intelligence (XAI), an effort to produce machine learning techniques based on explainable models and interfaces that can help users understand and interpret why and how automated systems come to particular decisions or results. Whether those algorithms and tools will be sufficient to fulfil the right to explanation of the EU’s new General Data Protection Regulation (GDPR) remains to be seen.

CHI

CHI2018

The ACM Conference on Human Factors in Computing Systems (CHI) is the premier international conference of Human-Computer Interaction. This year it took place in Montreal, Canada, with attendance exceeding 3300 participants and an official welcome letter by Prime Minister Justin Trudeau.

A common use of machine learning in HCI is to detect or recognise patterns from complex sensor data in order to realise novel interaction techniques, e.g. palm contact detection from raw touch data, handwriting recognition using pen tip motion and writing sound. With the wide availability of deep learning frameworks, HCI researchers have integrated those new tools in their arsenal to increase the recognition performance for previous techniques or to create entirely new ones, which would have been ineffective or difficult to realise using old methods. Good examples of the latter are systems enabled by generative nets. For instance, DeepWriting is a deep generative model that can generate handwriting from typeset text and even beautify or mimic handwriting styles. ExtVision, which is inspired by IllumiRoom, automatically generates peripheral images using conditional adversarial nets instead of using actual content.

Aksan, E., Pece, F. and Hilliges, O. DeepWriting: Making Digital Ink Editable via Deep Generative Modeling. Code made available on Github.

Two other categories of applications of machine learning that we increasingly see in HCI are for interaction prediction and emotional state estimation. In the former category, Li, Bengio (Samy) and Bailly investigated how DNNs can predict human performance in interaction tasks using the example of vertical menu selection. For emotion and state recognition, in addition to an introductory course by Lex Fridman from MIT on “deep learning for understanding the human”, two papers about estimating cognitive load from eye pupil movements in videos and EEG signals were presented. With the non-stopping proliferation of sensors in mobile and wearable devices, we are bound to see more and more “smart” systems that seek to better understand people and anticipate their moves, for good or bad.

CHI also includes many vis contributions and this year was no exception. Of particular relevance for visual exploration of big data and DNN understanding was the work by Cavallo and Demiralp, who created a visual interaction framework to improve exploratory analysis of high-dimensional data using tools to navigate in a reduced dimension graph and observe how modifying the reduced data affects the initial dataset. The examples using autoencoders on MNIST and QuickDraw, where the user draws on input samples to see how results change, are particularly interesting.

Cavallo M, Demiralp Ç. A Visual Interaction Framework for Dimensionality Reduction Based Data Exploration.

I should also mention DuetDraw, a prototype that allows users and AI to sketch collaboratively and which uses PaintsChainer!

Multiray: Multi-Finger Raycasting for Large Displays

My contribution to CHI this year was not related to machine learning. It involved interacting with remote displays using multiple rays emanating from the fingers. This work with Dan Vogel, which received an honourable mention, was done while I was at the University of Waterloo. The idea is to extend single-finger raycasting to multiple rays using two or more fingers in order to increase the interaction vocabulary, in particular through a number geometric shapes that users form with the projected points on the screen.

Matulic F, Vogel D. Multiray: Multi-Finger Raycasting for Large Displays

Final thoughts

So far, it is mostly the vis community that has tackled the challenge of opening up the black box of DNNs, but being focused on visualisation, many of the proposed tools have only limited interactive capabilities, especially when it comes to tweaking input and output data to understand how it affects the neurons of the inner layers. This is where HCI researchers need to step up and create the tools to support dynamic analysis of DNNs with possibilities to interactively make adjustments to the models. HCI approaches are also needed to improve the other processes of machine-learning pipelines in which humans are involved, such as data labelling, model selection and integration, data augmentation and generation etc. I think we can expect to see an increasing amount of work addressing those aspects at future CHIs and other HCI venues.

NIPS’17 Adversarial Learning Competition

Takuya Akiba

2018-04-20 18:03:49

PFN members participated in NIPS’17 Adversarial Learning Competition, one of the additional events to the international conference on machine learning NIPS’17 held on Kaggle, and we came in fourth place in the competition. As a result, we were invited to make a presentation at NIPS’17 and also have written and published a paper explaining our method. In this article, I will describe the details of the competition as well as the approach we took to achieve the forth place.

What is Adversarial Example?

Adversarial example [1, 2, 3] is a very hot research topic and is said to be one of the most biggest challenges facing the practical applications of deep learning. Take image recognition for example. It is known that adversarial examples can cause CNN to recognize images incorrectly just by making small modifications to the original images that are too subtle for humans to notice.

66394cbbab54c1bfe802897b4ee4b6f4

The above are sample images of adversarial examples (ref. [2]). The left image is a picture of a panda that has been classified correctly as a panda by CNN. What we have in the middle is maliciously created noise. The right image looks the same as the left panda, but it contains the slight noise superimposed on it, causing CNN to classify it not as a panda but gibbon with a very high confidence level.

  • [1] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian J. Goodfellow, Rob Fergus: Intriguing properties of neural networks. CoRR abs/1312.6199 (2013)
  • [2] Ian J. Goodfellow, Jonathon Shlens, Christian Szegedy:Explaining and Harnessing Adversarial Examples. CoRR abs/1412.6572 (2014).

NIPS’17 Adversarial Learning Competition

NIPS’17 Adversarial Learning Competition we took part in was a competition related to adversarial examples as the name suggests. I will explain two types of competition events: Attack and Defense tracks.

Attack Track

You must submit a program that adds noise to input images with malicious intent to convert them to adversarial examples. You will earn points depending on how well the adversarial images generated by your algorithm can fool image classifiers submitted in the defense track by other competitors. To be specific, your score will be the average rate of misclassifications made by each submitted defense classifier. The goal of the attack track is to develop a method for crafting formidable adversarial examples.

Defense Track

You must submit a program that returns a classification result for each input image. Your score will be the average of accuracy in classifying all adversarial images generated by each adversarial example generator submitted in the attack track by other teams. The goal of defense track is to build a robust image classifier that is hard to fool.

Rules in Detail

Your programs will have to process multiple images. Adversarial programs in the attack track will only be allowed to generate noise up to the parameter ε, which is given when they are run. Specifically, attacks can change R, G, and B values of each pixel on each image only up to ε.  In other words, L∞ norm of the noise needs to be equal to or less than ε. The attack track is divided into non-targeted and targeted subsections, and we participated in the non-targeted competition, which is the focus of this article.  For more details, please refer to the following official competition pages [4, 5, 6].

Standard Approach for Creating Adversarial Examples

We competed in the attack track. First, I will describe standard methods for creating adversarial examples. Roughly speaking, the most popular FGSM (fast gradient sign method) [2] and almost all the other existing methods take the following three steps:

  1. Classify the subject image by an image classifier
  2. Use backpropagation though to the image to calculate a gradient
  3. Add noise to the image using the calculated gradient

0e8c43732346707eeb37a7676f0b2371

Methods for crafting strong adversarial examples have been developed by exercising ingenuity in deciding whether these steps should be carried out only once or repeated, how the loss function used in backpropagation should be defined, how the gradient should be used to update the image, among other factors. Similarly, most of the teams seemed to have used this kind of approach to build their attacks in the competition.

Our Method

Our approach was to create a neural network that produces adversarial examples directly, which differs greatly from the current major approach described above.

6363912bf2a41f1a58846d74ad9d0539

The process to craft an attack image is simple and all you need to do is just give an image to the neural network. It will then generate an output image, which is adversarial example in itself.

How We Trained the Attack Network

How we trained the attack networkThe essence of this approach was, of course, how we created the neural network. We henceforth call our neural network that generates adversarial examples “attack network.” We trained the attack network by repeating the iteration of the following steps:

  1. The attack network to generate an adversarial example
  2. Existing trained CNN to classify the generated adversarial example
  3. Use backpropagation on the CNN to calculate a gradient of the adversarial example
  4. Further backpropagation on the attack network to update the network with the gradient

781bb890fe9092f1c4cddf512b8e234a

We designed the architecture of the attack network to be fully convolutional. A similar approach has been proposed in the following paper [7] for your reference.

  • [7] Shumeet Baluja, Ian Fischer. Adversarial Transformation Networks: Learning to Generate Adversarial Examples. CoRR, abs/1703.09387, 2017.

 

Techniques to Boost Attacks

We have developed such techniques as multi‒target training, multi‒task training, and gradient hint in order to generate more powerful adversarial examples by trying one way after another to devise the architecture of the attack network and the training method. Please refer to our paper for details.

Distributed Training on 128 GPUs Combining Data and Model Parallelism

In order to solve the issue that training takes a significant amount of time as well as to design the large-scale architecture of the attack network, we used ChainerMN [8]  to train it in a distributed manner on 128 GPUs. After considering two factors in particular, which are the need to reduce the batch size due to GPU memory since the attack network is larger than the classifier CNN, and the fact that each worker uses a different classifier network in the aforementioned multi-target training, we have decided to use a combination of standard data parallel and the latest model parallel function of ChainerMN to achieve effective data parallelism.

1d0c6cb7bede395acc21468888f70970

  • [8] Takuya Akiba, Keisuke Fukuda, Shuji Suzuki: ChainerMN: Scalable Distributed Deep Learning Framework. CoRR abs/1710.11351 (2017)

Generated Images

In our approach, not only the method we used but also generated adversarial examples are very unique.

75e7a2f84b42cf3722024baddfe65251

Original images are in the left column, generated adversarial examples are in the middle, and generated noises are in the right column (i.e. the differences between the original images and adversarial examples). We can observe two distinguishing features from the above.

  • Noise was generated to cancel the fine patterns such as the texture of the panda’s fur, making the image flat and featureless.
  • Jigsaw puzzle-like patterns were added unevenly but effectively by using the original images wisely.

Because of these two features, many image classifiers seemed to classify these adversarial examples as jigsaw puzzles. It is interesting to note that we did not specifically train the attach network to generate these puzzle-like images. We trained it based on objective functions to craft images that could mislead image classifiers. Obviously, the attack network automatically learned that it was effective to generate such jigsaw puzzle-like images.

Results

Finally, we were in the fourth place among about 100 teams. I was personally disappointed by this result as we were aiming for the top place, we had the honor to give a talk at the NIPS’17 workshop since only the top four winners were invited to do so.

DSC00658

Offered by the organizers of the event, we have also co-authored a paper related to the competition with big names in machine learning such as Ian Goodfellow and Samy Bengio. It was a good experience to publish the paper with such great researchers [9]. We have also made the source code available on GitHub [10].

  • [9] Alexey Kurakin, Ian Goodfellow, Samy Bengio, Yinpeng Dong, Fangzhou Liao, Ming Liang, Tianyu Pang, Jun Zhu, Xiaolin Hu, Cihang Xie, Jianyu Wang, Zhishuai Zhang, Zhou Ren, Alan Yuille, Sangxia Huang, Yao Zhao, Yuzhe Zhao, Zhonglin Han, Junjiajia Long, Yerkebulan Berdibekov, Takuya Akiba, Seiya  Tokui,  Motoki  Abe.  Adversarial  Attacks and Defences Competition. CoRR, abs/1804.00097, 2018.
  • [10] pfnet‒research/nips17‒adversarial‒attack: Submission to Kaggle NIPS’17 competition on adversarial examples (non‒targeted adversarial attack track) : https://github.com/pfnet‒research/nips17‒adversarial‒attack

While our team was ranked fourth, we had been attracting attention from other participants even before the competition ended due to the run time that was very different in nature from that of other teams. This is attributed to the completely different approach we took. The below table shows a list of top 15 teams with their scores and run time. As you can see, our team’s run time was an order of magnitude faster. This was because in our approach our attack only calculated forward, thus short calculation time, as opposed to repeating forward and backward calculations using gradients of images in almost all approaches taken by other teams.

c4a1f4cbaa153ed9fa9c234fca80d21c

In fact, according to a PageRank-style analysis conducted by one of the participants, our team got the highest score. This indicates our attack was especially effective against top defense teams. It must have been difficult to defend against our attack which was different in nature from other attacks. For your information, a paper describing the method used by the top team [11] has been accepted by the computer vision international conference CVPR’18 and is scheduled to be presented in the spotlight session.

  • [11] Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Xiaolin Hu, Jun Zhu:Yinpeng Dong, Fangzhou Liao, Tianyu Pang, Xiaolin Hu, Jun Zhu: Discovering Adversarial Examples with Momentum. CoRR abs/1710.06081 (2017)

Conclusions

Our efforts to participate in the competition started as part of our company’s 20% projects. Once things got going, we began to think we should concentrate our efforts and aim for the top place. After some coordination, our team got into full gear and began to spend almost all our work hours in this project toward the end. PFN has an atmosphere that encourages its members to participate in competitions such as this as other PFN teams have competed in Amazon Picking Challenges and IT Drug Discovery Contest, for example. I like taking part in this kind of competitions very much and will continue to be engaged in activities like this on a regular basis while wisely choosing competitions that have to do with challenges our company wants to tackle. Quite often, I find the skills honed through these competitions to be useful in handling tasks at critical moments of our company projects such as tuning accuracy or speed.

PFN is looking for engineers and researchers who are enthusiastic about working with us on this kind of activities.

Guest blog with Hai, a former intern at PFN

hido

2018-04-09 17:34:34

This is a guest post in an interview style with Hai Nguyen, a former intern 2017 summer at Preferred Networks, whose research has been accepted at one of the NIPS 2017 workshops. After finishing PFN internship, he joined Kyoto University as a Ph.d student.

“Semi-supervised Learning of Hierarchical Representations of Molecules Using Neural Message Passing,” Hai Nguyen, Shin-ichi Maeda, and Kenta Oono; NIPS Workshop on Machine Learning for Molecules and Materials, 2017. (Link, arXiv)

more »

The PFN spirit that we put in the required qualifications – “Qualified applicants must be familiar with all aspects of computer science”

Toru Nishikawa

2018-03-06 12:29:19

*Some of our guidelines for applicants have already been updated to make our true intent conveyed properly, based on the content of this blog.

 

Hello, this is Nishikawa, CEO of PFN.

I am going to write about one of our hiring requirements.

It is about the wording in the job section of our website used to describe one of the qualifications/requirements for researchers – “Researchers must be seitsu (精通, “familiar with” in Japanese) all aspects of computer science.” We have always had this requirement since the times of PFI because we truly believe in the importance of having a deep knowledge of not just one specific branch but various areas when doing research on computer science.

Take database research for example. It is essential to have thorough knowledge of not only the theory of transaction processing and relational algebra, but also storage and the computer architecture to run a database. Researchers are also required to know about computer networks, now that distributed databases have become common. In today’s deep-learning research, using just one computer could not produce competitive results, thus high-efficient parallel processing is a must. For creating a framework, to understand the structure of computer architecture and language processors is vital. When creating a domain-specific language, without understanding the programming language theory, you will easily end up making a language that looks like an annex added to a building as an afterthought. In reinforcement learning, it is important to refine simulation and rendering technologies.

In short, we live in an age when one who only knows about only one particular area can no longer have an advantage. Furthermore, it is difficult to know in advance which areas of computer science should be fused into generating new technology. In order to realize our mission, that is, to make a breakthrough with cutting-edge technologies, it is extremely important to strive to familiarize oneself with each and every branch of computer science.

This familiarity, a comprehensive knowledge and deep understanding in every field of computer science, is expressed by the Japanese word seitsu mentioned in the first paragraph. The word does not mean you can publish papers in top conferences – that would require not only seitsu but also the ability to conduct new groundbreaking research. (Being able to perform such research is a very important skill that we also need to acquire.) It also does not mean to “know everything” about each field. Someone who declares he knows everything is, rather, not a scientist.

The field of computer science is making rapid progress and we must always pursue its advancement. Sometimes I come across comments making fun of the passage “with all aspects of computer science” on social media, but the message we put into the job requirement has played an important role in shaping PFN culture and so it has remained to date. We will continue to stick to this principle. That said, we also understand the need to come up with an expression that is not misleading. The domain of computer science has been expanding rapidly over the past decade. This trend will no doubt continue to accelerate. New fields of study will emerge after combining many different fields within and outside of the computer science domain. Considering this, we should revise the employment condition in light of the following factors:

 

・It will become more important to absorb changes and progress that will be made in computer science and become acquainted with new fields that will come out in the future rather than being well-versed in all aspects of computer science at this point. (It is, of course, still necessary to have an extensive knowledge.) We will treat an applicant’s eagerness and passion for learning as more important than his current knowledge.

・We value an applicant’s forward-looking attitude toward deepening an understanding of not only computer science but also other fields such as software engineering, life science and mechanical engineering.

・We welcome not only experts in the artificial intelligence field but also specialists in various areas of expertise to make innovation by combining new technologies.

 

The criterion has been applied only to researchers, but I believe it is crucial for everyone to be united to open up a path to new technology with no distinction between researchers and engineers because researchers need to have some engineering knowledge while engineers need to make efforts to understand research as well. Therefore, we will make this a requirement for both researchers and engineers.

It is also an important duty for me to create a workplace in which all valuable PFN members can do their best to innovate and create new technology, which I will continue to actively work on.

 

PFN is looking for talented people with diverse expertise in various fields. If you are interested in working with us, please apply at the following link

https://www.preferred-networks.jp/en/job

We have released ChainerUI, a training visualizer and manager for Chainer

ofk

2017-12-20 10:58:31

We have released ChainerUI, to help visualize training results and manage training jobs.

Among Chainer users, there are demands to watch the progress of training jobs or to compare jobs by plotting the training loss, accuracy, and other logs of multiple runs. These tasks tend to be intricate because there were no suitable applications available. ChainerUI offers functions listed below in order to support your DNN training routine.

  • Visualizing training logs: plot values like loss and accuracy
  • Managing histories of training jobs with experimental conditions
  • Operating the training jobs: take a snapshot, modify hyper parameter like within training

ChainerUI consists of a web application and an extension module of Trainer in Chainer, which enables easy training. If you have already used the LogReport extension, you can watch training logs on web browser without any change. If you add other extensions of ChainerUI, more experimental conditions will be displayed on the table and the training job can be managed from ChainerUI.

Visualizing training logs


ChainerUI monitors training log file and plots values such as loss, accuracy, and so on. Users can choose which variables to plot on the chart.

Managing training jobs


ChainerUI’s web application shows the list of multiple training jobs on the result table with experimental conditions. In addition, you can take actions such as taking snapshot or modifying hyperparameters from the job control panel.

How to use

To install, please use the pip module, and then setup the ChainerUI database.

pip install chainerui
chainerui db create
chainerui db upgrade

Secondly, register a “project” and run server. The “project” is the repository or the directory that contains chainer based scripts.

chainerui project create -d PROJECT_DIR [-n PROJECT_NAME]
chainerui server

You can also call chainerui project create while the server is running.

Finally, open http://localhost:5000/ on web browser, and you are ready!!

Visualizing training logs

The standard LogReport extension included in Chainer exports “log” file. ChainerUI watches that “log” file and plots a chart automatically. The following code is an example to run MNIST example to show plotting with ChainerUI.

chainerui project create -d path/to/result -n mnist
python train_mnist.py -o path/to/result/1

The “…result/1” result is added in “mnist” project. ChainerUI watches the “log” file updated in “path/to/result/1” continuously and plots values written in the file.

Managing training jobs

ChainerUI monitors the “args” file that is located in the same directory with the  “log” file, and shows its information on the results table as experimental conditions. The “args” file has key-value pairs in JSON style.
The below sample code shows how to save “args” file using ChainerUI’s utility function.

# [ChainerUI] import chainerui util function
from chainerui.utils import save_args

def main():
parser.add_argument('--out', '-o', default='result',
help='Directory to output the result')
args = parser.parse_args()

# [ChainerUI] save 'args' to show experimental conditions
save_args(args, args.out)

To operate training jobs, set CommandsExtension in the training script. This extension supports taking a snapshot and changing hyperparameters such as learning rate while running the training job.

# [ChainerUI] import CommandsExtension
from chainerui.extensions import CommandsExtension

def main():
trainer = training.Trainer(updater, (args.epoch, 'epoch'), out=args.out)

# [ChainerUI] enable to send commands from ChainerUI
trainer.extend(CommandsExtension())

To see whole code, examples/train_mnist.py.

Background

ChainerUI was mainly developed by Inagaki-san and Kobayashi-san who participated in summer internship at Preferred Networks this year.

During the two months of their internship program, they specified user requirements and implemented a prototype. They have continued to contribute after the internship as part-time workers. They are proud to release their work as “ChainerUI.”

Future plan

ChainerUI is being developed under the Chainer organization. The future plan includes the following functions.

  • Output chart as image file
  • Add other extensions to operate training script, etc.

We are also hiring front-end engineers to work on such! We are looking forward to receiving your applications.

Release Chainer Chemistry: A library for Deep Learning in Biology and Chemistry

Kosuke Nakago

2017-12-18 11:40:20

 

* Japanese blog is also written here.

We released Chainer Chemistry, a Chainer [1] extension to train and run neural networks for tasks in biology and chemistry.

The library helps you to easily apply deep learning on molecular structures.

For example, you can apply machine learning on toxicity classification tasks or HOMO (highest occupied molecular orbital) level regression task with compound input.

The library was developed during the PFN 2017 summer internship, and part of the library has been implemented by an internship student, Hirotaka Akita at Kyoto University.

 

Supported features

Graph Convolutional Neural Network implementation

The recently proposed Graph Convolutional Network (Refer below for detail) opened the door to apply deep learning on “graph structure” input, and the Graph Convolution Networks are currently an active area of research. We implemented several Graph Convolution Network architectures, including the network introduced in this year’s paper.

The following models are implemented:

  • NFP: Neural Fingerprint [2, 3]
  • GGNN: Gated-Graph Neural Network [4, 3]
  • WeaveNet: Molecular Graph Convolutions [5, 3]
  • SchNet: A continuous-filter convolutional Neural Network [6]

 

Common data preprocessing/research dataset support

Various datasets can be used with a common interface with this library. Also, some research datasets can be downloaded automatically and preprocessed.

The following datasets are supported:

  • QM9 [7, 8]: dataset of organic molecular structures with up to nine C/O/N/F atoms and their computed physical property values. The values include HOMO/LUMO level and internal energy. The computation is B3LYP/6-31G level of quantum chemistry.
  • Tox21 [9]: dataset of toxicity measurements on 12 biological targets

 

Train/inference example code is available

We provide example code for training models and inference. You can easily try training/inference of the models implemented in this library for quick start.

 

Background

In the new material discovery/drug discovery field, simulation of molecule behavior is important. When we need to take quantum effects into account with high precision, DFT (density functional theory) is widely used. However it requires a lot of computational resources especially for big molecules. It is difficult to apply simulation on many molecule structures.

There is a different approach from the machine learning field: learn the data measured/calculated in previous experiments, and predict the unexperimented molecule’s chemical property. The neural network may calculate the prediction faster than the quantum simulation.

 

Cited from “Neural Message Passing for Quantum Chemistry”, Justin et al. https://arxiv.org/pdf/1704.01212.pdf

 

An important question is how to deal with the input/output of compounds in order to apply deep learning. The main problem is that all molecular structures have variable numbers of atoms, represented as different graph structures, while conventional deep learning methods deal with a fixed size/structured input.

However “Graph Convolutional Neural Network” is proposed to deal with graph structure for input.

 

What is a Graph Convolutional Neural Network

Convolutional Neural Networks introduce “convolutional” layers which applies a kernel on local information in an image. It shows promising results on many image tasks, including classification, detection, segmentation, and even image generation tasks.

Graph Convolutional Neural Networks introduce a “graph convolution” operation which applies a kernel among the neighboring nodes on the graph, to deal with graph structure.

 

How graph convolutions work

CNN deals with an image as input, whereas Graph CNN can deal with a graph structure (molecule structure etc) as input.

Its application is not limited to molecule structure. “Graph structures” can appear in many other fields, including social networks, transportation etc, and the research of graph convolutional neural network applications is an interesting topic. For example, [10] applied graph convolution on image, [11] applied it on knowledge base, [12] applied it on traffic forecasting.

 

Target users

  1. Deep learning researchers
    This library provides latest Graph Convolutional Neural Network implementation
    Graph Convolution application is not limited to Biology & Chemistry, but various kinds of fields. We would like many people to use this library.
  2. Material/drug discovery researchers
    The library enables the user to build their own model to predict various kinds of chemical properties of a molecule.

 

Future plan

This library is still a beta version, and in active development. We would like to support the following features:

  • Provide pre-trained models for inference
  • Add more datasets
  • Implement more networks

We prepared a Tutorial to get started with this library, please try and let us know if you have any feedback.

 

Reference

[1] Tokui, S., Oono, K., Hido, S., & Clayton, J. (2015). Chainer: a next-generation open source framework for deep learning. In Proceedings of workshop on machine learning systems (LearningSys) in the twenty-ninth annual conference on neural information processing systems (NIPS) (Vol. 5).

[2] Duvenaud, D. K., Maclaurin, D., Iparraguirre, J., Bombarell, R., Hirzel, T., Aspuru-Guzik, A., & Adams, R. P. (2015). Convolutional networks on graphs for learning molecular fingerprints. In Advances in neural information processing systems (pp. 2224-2232).

[3] Gilmer, J., Schoenholz, S. S., Riley, P. F., Vinyals, O., & Dahl, G. E. (2017). Neural message passing for quantum chemistry. arXiv preprint arXiv:1704.01212.

[4] Li, Y., Tarlow, D., Brockschmidt, M., & Zemel, R. (2015). Gated graph sequence neural networks. arXiv preprint arXiv:1511.05493.

[5] Kearnes, S., McCloskey, K., Berndl, M., Pande, V., & Riley, P. (2016). Molecular graph convolutions: moving beyond fingerprints. Journal of computer-aided molecular design, 30(8), 595-608.

[6] Kristof T. Schütt, Pieter-Jan Kindermans, Huziel E. Sauceda, Stefan Chmiela, Alexandre Tkatchenko, Klaus-Robert Müller (2017). SchNet: A continuous-filter convolutional neural network for modeling quantum interactions. arXiv preprint arXiv:1706.08566

[7] L. Ruddigkeit, R. van Deursen, L. C. Blum, J.-L. Reymond, Enumeration of 166 billion organic small molecules in the chemical universe database GDB-17, J. Chem. Inf. Model. 52, 2864–2875, 2012.

[8] R. Ramakrishnan, P. O. Dral, M. Rupp, O. A. von Lilienfeld, Quantum chemistry structures and properties of 134 kilo molecules, Scientific Data 1, 140022, 2014.

[9] Huang R, Xia M, Nguyen D-T, Zhao T, Sakamuru S, Zhao J, Shahane SA, Rossoshek A and Simeonov A (2016) Tox21 Challenge to Build Predictive Models of Nuclear Receptor and Stress Response Pathways as Mediated by Exposure to Environmental Chemicals and Drugs. Front. Environ. Sci. 3:85. doi: 10.3389/fenvs.2015.00085

[10] Michaël Defferrard, Xavier Bresson, Pierre Vandergheynst (2016), Convolutional Neural Networks on Graphs with Fast Localized Spectral Filtering, NIPS 2016.

[11] Michael Schlichtkrull, Thomas N. Kipf, Peter Bloem, Rianne van den Berg, Ivan Titov, Max Welling (2017) Modeling Relational Data with Graph Convolutional Networks. arXiv preprint arXiv: 1703.06103

[12] Yaguang Li, Rose Yu, Cyrus Shahabi, Yan Liu (2017) Diffusion Convolutional Recurrent Neural Network: Data-Driven Traffic Forecasting. arXiv preprint arXiv: 1707.01926

 

MN-1: The GPU cluster behind 15-min ImageNet

doipfn

2017-11-30 11:00:05

Preferred Networks, Inc. has completed ImageNet training in 15 minutes [1,2]. This is the fastest time to perform a 90-epoch ImageNet training ever achieved. Let me describe the MN-1 cluster used for this accomplishment.

Preferred Networks’ MN-1 cluster started operation this September [3]. It consists of 128 nodes with 8 NVIDIA P100 GPUs each, for 1024 GPUs in total. As each GPU unit has 4.7 TFLOPS in double precision floating point as its theoretical peak, the total theoretical peak capacity is more than 4.7 PFLOPS (including CPUs as well). The nodes are connected with two FDR Infiniband links (56Gbps x 2). PFN has exclusive use of the cluster, which is located in an NTT datacenter.

MN-1 in a Data center

MN-1 Cluster in an NTT Datacenter

On the TOP500 list published in this November, the MN-1 cluster is listed as the 91st most powerful supercomputer, with approx. 1.39PFLOPS maximum performance on the LINPACK benchmark[4]. Compared to traditional supercomputers, MN-1’s computation efficiency (28%) is not high. One of the performance bottlenecks is the interconnect. Unlike typical supercomputers, MN-1 is connected as a thin tree (compared to a fat tree). A group of sixteen nodes is connected to a pair of redundant infiniband switches. In the cluster, we have eight groups, and links between groups are aggregated in a redundant pair of infiniband switches. Thus, if a process needs to communicate with different group, the link between groups becomes a bottleneck, which lowers the LINPACK benchmark score.

Distributed Learning in ChainerMN

However, as stated at the beginning of this article, MN-1 can perform ultra-fast Deep Learning (DL). This is because ChainerMN does not require bottleneck-free communication for DL training. While training, ChainerMN collects and re-distributes parameter updates between all nodes. In the 15-minute trial, we used the ring allreduce algorithm. With the ring allreduce algorithm, nodes communicate with their adjacent node in the ring topology. The accumulation is performed on the first round, and the accumulated parameter update is distributed on the second round. Since we can make a ring without hitting the bottleneck on full duplex network, MN-1 cluster can efficiently finish the ImageNet training in 15 minutes with 1024 GPUs.

Scalability of ChainerMN up to 1024 GPUs

[1] https://arxiv.org/abs/1711.04325

[2] https://www.preferred-networks.jp/en/news/pr20171110

[3] https://www.preferred-networks.jp/en/news/pr20170920

[4] https://www.preferred-networks.jp/en/news/pr20171114

Page 1 of 3123