Work-efficient parallel non-maximum suppression for embedded GPU architectures

Oro Garcia, David; Fernandez Tena, Carles; Martorell Bofill, Xavier; Hernando Pericás, Francisco Javier

Work-efficient parallel non-maximum suppression for embedded GPU architectures

dc.contributor

Universitat Politècnica de Catalunya. Departament d'Arquitectura de Computadors

dc.contributor

Universitat Politècnica de Catalunya. Departament de Teoria del Senyal i Comunicacions

dc.contributor

Universitat Politècnica de Catalunya. CAP - Grup de Computació d'Altes Prestacions

dc.contributor

Universitat Politècnica de Catalunya. VEU - Grup de Tractament de la Parla

dc.contributor.author

Oro Garcia, David

dc.contributor.author

Fernandez Tena, Carles

dc.contributor.author

Martorell Bofill, Xavier

dc.contributor.author

Hernando Pericás, Francisco Javier

dc.date.issued

2016

dc.identifier

Oro, D., Fernandez, C., Martorell, X., Hernando, J. Work-efficient parallel non-maximum suppression for embedded GPU architectures. A: IEEE International Conference on Acoustics, Speech, and Signal Processing. "2016 IEEE International Conference on Acoustics, Speech, and Signal Processing: proceedings: March 20-25, 2016: Shanghai International Convention Center: Shanghai, China". Shanghai: Institute of Electrical and Electronics Engineers (IEEE), 2016, p. 1026-1030.

dc.identifier

978-1-4799-9988-0

dc.identifier

https://hdl.handle.net/2117/91351

dc.identifier

10.1109/ICASSP.2016.7471831

dc.description.abstract

With the emergence of GPU computing, deep neural networks have become a widely used technique for advancing research in the field of image and speech processing. In the context of object and event detection, slidingwindow classifiers require to choose the best among all positively discriminated candidate windows. In this paper, we introduce the first GPU-based non-maximum suppression (NMS) algorithm for embedded GPU architectures. The obtained results show that the proposed parallel algorithm reduces the NMS latency by a wide margin when compared to CPUs, even clocking the GPU at 50% of its maximum frequency on an NVIDIA Tegra K1. In this paper, we show results for object detection in images. The proposed technique is directly applicable to speech segmentation tasks such as speaker diarization.

dc.description.abstract

Peer Reviewed

dc.description.abstract

Postprint (published version)

dc.format

5 p.

dc.format

application/pdf

dc.language

eng

dc.publisher

Institute of Electrical and Electronics Engineers (IEEE)

dc.relation

http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7471831

dc.relation

info:eu-repo/grantAgreement/EC/H2020/644312/EU/Heterogeneous Secure Multi-level Remote Acceleration Service for Low-Power Integrated Systems and Devices/RAPID

dc.rights

Restricted access - publisher's policy

dc.subject

Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació

dc.subject

Àrees temàtiques de la UPC::Informàtica

dc.subject

Embedded computer systems

dc.subject

Information display systems

dc.subject

Embedded systems

dc.subject

Graphics processing units

dc.subject

Parallel algorithms

dc.subject

Work-efficient parallel nonmaximum suppression

dc.subject

Embedded GPU architectures

dc.subject

Image processing

dc.subject

Speech processing

dc.subject

Deep neural networks

dc.subject

NMS latency

dc.subject

Positively discriminated candidate windows

dc.subject

Parallel algorithm

dc.subject

NVIDIA Tegra JC1

dc.subject

Speech segmentation tasks

dc.subject

Speaker diarization

dc.subject

Sistemes incrustats (Informàtica)

dc.subject

Visualització (Informàtica)

dc.title

Work-efficient parallel non-maximum suppression for embedded GPU architectures

dc.type

Conference report

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-prints [73034]