dc.contributor
Universitat Politècnica de Catalunya. Doctorat en Arquitectura de Computadors
dc.contributor
Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors
dc.contributor
Barcelona Supercomputing Center
dc.contributor
Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
dc.contributor.author
Njoroge Kahira, Albert
dc.contributor.author
Nguyen, Truong Thao
dc.contributor.author
Bautista Gomez, Leonardo
dc.contributor.author
Takano, Ryousei
dc.contributor.author
Badia Sala, Rosa Maria
dc.contributor.author
Wahib, Mohamed
dc.identifier
Kahira, A. [et al.]. An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks. A: International Summer School on Advanced Computer Architecture and Compilation for High-Performance and Embedded Systems. "ACACES 2021 poster abstracts: September 15, 2021, Fiuggi, Italy". European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC), 2021, p. 37-40. ISBN 978-88-905806-8-0.
dc.identifier
978-88-905806-8-0
dc.identifier
https://hdl.handle.net/2117/356635
dc.description.abstract
Deep Neural Network (DNN) frameworks use distributed training to enable faster time to convergence and alleviate memory capacity limitations when training large models and/or using high dimension inputs. With the steady increase in datasets and model sizes, model/hybrid parallelism is deemed to have an important role in the future of distributed training of DNNs. We analyze the compute, communication, and memory requirements of Convolutional Neural Networks (CNNs) to understand the trade-offs between different parallelism approaches on performance and scalability. We leverage our model-driven analysis to be the basis for an oracle utility which can help in detecting the limitations and bottlenecks of different parallelism approaches at scale. We evaluate the oracle on six parallelization strategies, with four CNN models and multiple datasets (2D and 3D), on up to 1024 GPUs. The results demonstrate that the oracle has an average accuracy of about 86.74% when compared to empirical results, and as high as 97.57% for.
dc.description.abstract
Peer Reviewed
dc.description.abstract
Postprint (published version)
dc.format
application/pdf
dc.publisher
European Network of Excellence on High Performance and Embedded Architecture and Compilation (HiPEAC)
dc.subject
Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles
dc.subject
Neural networks (Computer science)
dc.subject
Parallel processing (Electronic computers)
dc.subject
Model parallelism
dc.subject
Performance modeling
dc.subject
Aprenentatge profund
dc.subject
Xarxes neuronals (Informàtica)
dc.subject
Processament en paral·lel (Ordinadors)
dc.title
An oracle for guiding large-scale model/hybrid parallel training of convolutional neural networks
dc.type
Conference lecture