MapReduce performance models for Hadoop 2.x

Glushkova, Daria; Jovanovic, Petar; Abelló Gamazo, Alberto

MapReduce performance models for Hadoop 2.x

dc.contributor

Universitat Politècnica de Catalunya. Departament d'Enginyeria de Serveis i Sistemes d'Informació

dc.contributor

Universitat Politècnica de Catalunya. inSSIDE - integrated Software, Service, Information and Data Engineering

dc.contributor.author

Glushkova, Daria

dc.contributor.author

Jovanovic, Petar

dc.contributor.author

Abelló Gamazo, Alberto

dc.date.issued

2017

dc.identifier

Glushkova, D., Jovanovic, P., Abelló, A. MapReduce performance models for Hadoop 2.x. A: International Workshop On Design, Optimization, Languages and Analytical Processing of Big Data. "Proceedings of the Workshops of the EDBT/ICDT 2017 Joint Conference (EDBT/ICDT 2017): Venice, Italy, March 21-24, 2017". Venice: CEUR-WS.org, 2017, p. 1-10.

dc.identifier

1613-0073

dc.identifier

https://hdl.handle.net/2117/113535

dc.description.abstract

MapReduce is a popular programming model for distributed processing of large data sets. Apache Hadoop is one of the most common open-source implementations of such paradigm. Performance analysis of concurrent job executions has been recognized as a challenging problem, at the same time, that it may provide reasonably accurate job response time at significantly lower cost than experimental evaluation of real setups. In this paper, we tackle the challenge of defining MapReduce performance models for Hadoop 2.x. While there are several efficient approaches for modeling the performance of MapReduce workloads in Hadoop 1.x, the fundamental architectural changes of Hadoop 2.x require that the cost models are also reconsidered. The proposed solution is based on an existing performance model for Hadoop 1.x, but it takes into consideration the architectural changes of Hadoop 2.x and captures the execution flow of a MapReduce job by using queuing network model. This way the cost model adheres to the intra-job synchronization constraints that occur due the contention at shared resources. The accuracy of our solution is validated via comparison of our model estimates against measurements in a real Hadoop 2.x setup. According to our evaluation results, the proposed model produces estimates of average job response time with error within the range of 11% - 13.5%.

dc.description.abstract

Peer Reviewed

dc.description.abstract

Postprint (published version)

dc.format

10 p.

dc.format

application/pdf

dc.language

eng

dc.publisher

CEUR-WS.org

dc.relation

http://ceur-ws.org/Vol-1810/DOLAP_paper_28.pdf

dc.rights

http://creativecommons.org/licenses/by-nc-nd/3.0/es/

dc.rights

Open Access

dc.rights

Attribution-NonCommercial-NoDerivs 3.0 Spain

dc.subject

Àrees temàtiques de la UPC::Informàtica::Sistemes d'informació

dc.subject

Electronic data processing -- Distributed processing

dc.subject

Cost effectiveness

dc.subject

Open source software

dc.subject

MapReduce performance models

dc.subject

Hadoop 2.x

dc.subject

Queuing theory

dc.subject

Mean value analysis

dc.subject

Processament distribuït de dades

dc.subject

Cost-eficàcia

dc.subject

Programari lliure

dc.title

MapReduce performance models for Hadoop 2.x

dc.type

Conference report

Files in this item

Files	Size	Format	View
There are no files associated with this item.

This item appears in the following Collection(s)

E-prints [72987]