Stencil codes on a vector length agnostic architecture

Inicio | ¿Qué es? | Contacto

English | Català

Consultar RECERCAT

Por comunidades y
colecciones Por fecha Por autores Por títulos Por temas (CDU)

Consultar departamento

Por fecha Por autores Por títulos Por temas (CDU)

Estadisticas

Del documento Todo RECERCAT

Mi RECERCAT

Entrar Alertas por correo-e

Directorio de otros repositorios

RECERCAT Principal > Universitat Politècnica de Catalunya > Documents de recerca > Visualizar documento

Para acceder a los documentos con el texto completo, por favor, siga el siguiente enlace: http://hdl.handle.net/2117/125368

Título:	Stencil codes on a vector length agnostic architecture
Autor/a:	Armejach Sanosa, Adrià; Caminal Pallarés, Helena; Cebrián González, Juan Manuel; González-Alberquilla, Rekai; Adeniyi-Jones, Chris; Valero Cortés, Mateo; Casas, Marc; Moreto Planas, Miquel
Otros autores:	Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors; Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions
Abstract:	Data-level parallelism is frequently ignored or underutilized. Achieved through vector/SIMD capabilities, it can provide substantial performance improvements on top of widely used techniques such as thread-level parallelism. However, manual vectorization is a tedious and costly process that needs to be repeated for each specific instruction set or register size. In addition, automatic compiler vectorization is susceptible to code complexity, and usually limited due to data and control dependencies. To address some these issues, Arm recently released a new vector ISA, the Scalable Vector Extension (SVE), which is Vector-Length Agnostic (VLA). VLA enables the generation of binary files that run regardless of the physical vector register length. In this paper we leverage the main characteristics of SVE to implement and optimize stencil computations, ubiquitous in scientific computing. We show that SVE enables easy deployment of textbook optimizations like loop unrolling, loop fusion, load trading or data reuse. Our detailed simulations using vector lengths ranging from 128 to 2,048 bits show that these optimizations can lead to performance improvements over straight-forward vectorized code of up to 56.6% for 2,048 bit vectors. In addition, we show that certain optimizations can hurt performance due to a reduction in arithmetic intensity, and provide insight useful for compiler optimizers.
Abstract:	This work has been partially supported by the European HiPEAC Network of Excellence, by the Spanish Ministry of Economy and Competitiveness (contract TIN2015-65316-P), and by the Generalitat de Catalunya (contracts 2017-SGR-1328 and 2017-SGR-1414). The Mont-Blanc project receives funding from the EUs H2020 Framework Programme (H2020/2014-2020) under grant agreements no. 671697 and no. 779877. M. Moreto has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Ramon y Cajal fellowship number RYC-2016-21104. Finally, A. Armejach has been partially supported by the Spanish Ministry of Economy, Industry and Competitiveness under Juan de la Cierva postdoctoral fellowship number FJCI-2015-24753.
Abstract:	Peer Reviewed
Materia(s):	-Àrees temàtiques de la UPC::Informàtica::Arquitectura de computadors::Arquitectures paral·leles -Parallel processing (Electronic computers) -Single instruction -Multiple data -Parallel computing models -Data-level parallelism -Scalable vector extension -Vector length agnostic -Stencil computations -Processament en paral·lel (Ordinadors)
Derechos:
Tipo de documento:	Artículo - Versión presentada Objeto de conferencia
Editor:	Association for Computing Machinery (ACM)
Compartir:

Mostrar el registro completo del ítem

Documentos relacionados

Otros documentos del mismo autor/a

Using Arm’s scalable vector extension on stencil codes

Armejach Sanosa, Adrià; Caminal Pallarés, Helena; Cebrián González, Juan Manuel; Langarita, Rubén; González-Alberquilla, Rekai; Adeniyi-Jones, Chris; Valero Cortés, Mateo; Casas Guix, Marc; Moreto Planas, Miquel

Performance and energy effects on task-based parallelized applications: User-directed versus manual vectorization

Caminal Pallarés, Helena; Caballero de Gea, Diego; Cebrián González, Juan Manuel; Ferrer, Roger; Casas, Marc; Moreto Planas, Miquel; Martorell Bofill, Xavier; Valero Cortés, Mateo

Design trade-offs for emerging HPC processors based on mobile market technology

Armejach Sanosa, Adrià; Casas, Marc; Moreto Planas, Miquel

Graph partitioning applied to DAG scheduling to reduce NUMA effects

Sánchez Barrera, Isaac; Casas, Marc; Moreto Planas, Miquel; Ayguadé Parra, Eduard; Labarta Mancho, Jesús José; Valero Cortés, Mateo

Architectural support for task dependence management with flexible software scheduling

Castillo, Emilio; Álvarez Martí, Lluc; Moreto Planas, Miquel; Casas, Marc; Vallejo, Enrique; Bosque, Jose L.; Beivide Palacio, Ramon; Valero Cortés, Mateo

Accesibilidad | Aviso legal | Política de Cookies | Documentos de uso interno

Coordinación

Patrocinio