Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Eduardo P. S. Castro; Thiago D. Maia; Marluce R. Pereira; Ahmed A. A. Esmin; Denilson A. Pereira

doi:10.1017/S0269888918000127

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Published online by Cambridge University Press: 11 July 2018

Eduardo P. S. Castro ,

Thiago D. Maia ,

Marluce R. Pereira ,

Ahmed A. A. Esmin and

Denilson A. Pereira

Show author details

Eduardo P. S. Castro: Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
Thiago D. Maia: Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
Marluce R. Pereira: Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
Ahmed A. A. Esmin: Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]
Denilson A. Pereira: Affiliation:
Department of Computer Science, Universidade Federal de Lavras, PO Box 3037, Lavras 37200-000, Brazil e-mail: [email protected], [email protected], [email protected], [email protected], [email protected]

Article contents

Abstract
References

Get access

Rights & Permissions

Abstract

Several Apriori algorithm implementations for mining association rules have been proposed in the literature using the Hadoop-MapReduce framework and, more recently, Spark. However, none of the works have made a detailed assessment of its performance, for example, comparing it with other implementations in various characteristics of data sets. In this work, we present a review of the main algorithms proposed for Hadoop-MapReduce and compared their implementations in a single environment under several different situations. Moreover, these algorithms had their implementations adapted to Spark, and also compared under the same circumstances. Based on the results of the experiments, we present a framework for recommending the Apriori implementation most appropriate for solving a given problem, according to the data set characteristics and minimum required support. The results show that Spark implementations overcome Hadoop-MapReduce implementations at runtime in most experiments. However, there is no single implementation that is the best in all the evaluated situations.

Type: Review Article
Information: The Knowledge Engineering Review , Volume 33 , 2018 , e9

DOI: https://doi.org/10.1017/S0269888918000127 [Opens in a new window]
Copyright: © Cambridge University Press, 2018

Access options

Get access to the full version of this content by using one of the access options below. (Log in options will check for institutional or personal access. Content may require purchase if you do not have access.)

Article purchase

Temporarily unavailable

References

Agrawal, R., Imielinski, T. & Swami, A. 1993. Mining association rules between sets of items in large databases. In Proceedings of the 1993 ACM SIGMOD International Conference on Management of Data 22, 207–216. ACM.Google Scholar

Agrawal, R. & Srikant, R. 1994. Fast algorithms for mining association rules. In Proceedings of the 20th International Conference on Very Large Data Bases, VLDB’94, 487–499. Morgan Kaufmann Publishers Inc.Google Scholar

Apache. 2016. What is Apache Hadoop. http://hadoop.apache.org/#What+Is+Apache+Hadoop, accessed January, 2016.Google Scholar

Apache Spark. 2016. Apache Spark lightning-fast cluster computing. http://spark.apache.org/, accessed January, 2016.Google Scholar

Apache Yarn. 2016. Apache Hadoop YARN. http://hadoop.apache.org/docs/current/hadoop-yarn/hadoop-yarn-site/YARN.html, accessed January, 2016.Google Scholar

Dean, J. & Ghemawat, S. 2004. MapReduce: simplified data processing on large clusters. In Proceedings of the 6th Symposium on Operating Systems Design and Implementation, 395–408. USENIX Association.Google Scholar

Farzanyar, Z. & Cercone, N. 2013a. Accelerating frequent itemsets mining on the cloud: a mapreduce-based approach. In Proceedings of the 14th IEEE Conference on Data Mining Workshops, ICDMW’13, 592–598. IEEE Computer Society.Google Scholar

Farzanyar, Z. & Cercone, N. 2013b. Efficient mining of frequent itemsets in social network data based on MapReduce framework. In Proceedings of the 2013 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining in ASONAM’13, 1183–1188. ACM.Google Scholar

Ghemawat, S., Gobioff, H. & Leung, S.-T. 2003. The Google File System. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, SOSP’03, 37, 29–43. ACM.Google Scholar

Hahsler, M., Grun, B., Hornik, K. & Buchta, C. 2016. Introduction to arules: a computational environment for mining association rules and frequent itemsets. https://cran.r-project.org/web/packages/arules/vignettes/arules.pdf, accessed January, 2016.Google Scholar

Li, L. & Zhang, M. 2011. The strategy of mining association rule based on cloud computing. In Proceedings of the 2011 International Conference on Business Computing and Global Informatization in BCGIN’11, 475–478. IEEE Computer Society.Google Scholar

Li, N., Zeng, L., He, Q. & Shi, Z. 2012. Parallel implementation of apriori algorithm based on mapreduce. In Proceedings of the 13th Conference on Software Engineering, Artificial Intelligence, Networking and Parallel Distributed Computing, SNPD’12, 236–241. IEEE Computer Society.Google Scholar

Lin, M.-Y., Lee, P.-Y. & Hsueh, S.-C. 2012. Apriori-based frequent itemset mining algorithms on MapReduce. In Proceedings of the 6th International Conference on Ubiquitous Information Management and Communication in ICUIMC’12, 1–8. ACM.Google Scholar

Oliveira, C. M. & Pereira, D. A. 2017. An association rules based method for classifying product offers from e-shopping. Intelligent Data Analysis 21(3), 637–660.Google Scholar

Mazur, E., Li, B., Diao, Y., McGregor, A. & Shenoy, P. 2012. SCALLA: a platform for scalable one-pass analytics using MapReduce. ACM Transactions on Database Systems 37(4), 27.Google Scholar

Qiu, H., Gu, R., Yuan, C. & Huang, Y. 2014. Yafim: a parallel frequent itemset mining algorithm with Spark. In Proceedings of the 28th IEEE International Distributed Processing Symposium Workshops, IPDPSW’14, 1664–1671.Google Scholar

Rathee, S., Kaul, M. & Kashyap, A. 2015. R-Apriori: an efficient apriori based algorithm on Spark. In Proceedings of the 8th Workshop in Information and Knowledge Management, CIKM’15, 27–34. ACM.Google Scholar

SINTEF 2013. Big Data, for better or worse: 90% of world’s data generated over last two years. www.sciencedaily.com/releases/2013/05/130522085217.htm, accessed January 22, 2016.Google Scholar

Wedyan, S. 2014. Review and comparison of associative classification data mining approaches. International Journal of Computer, Electrical, Automation, Control and Information Engineering 8(1), 34–45.Google Scholar

White, T. 2015. Hadoop: The Definitive Guide, 4th edition. O’Reilly Media.Google Scholar

Witten, I. H., Frank, E. & Hall, M. A. 2011. Data Mining – Practical Machine Learning Tools and Techniques. Morgan Kaufmann.Google Scholar

Yahya, O., Hegazy, O. & Ezat, E. 2012. An efficient implementation of Apriori algorithm based on Hadoop-MapReduce model. International Journal of Reviews in Computing 12, 59–67.Google Scholar

Yang, X. Y., Liu, Z. & Fu, Y. 2010. MapReduce as a programming model for association rules algorithm on Hadoop. In Proceedings of the 3rd Conference on Information Sciences and Interaction Sciences, ICIS’10, 99–102. IEEE.Google Scholar

Zaharia, M., Chowdhury, M., Franklin, M. J., Shenker, S. & Stoica, I. 2010. Spark: cluster computing with working sets. In Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing in HotCloud’10, 1–7. USENIX Association.Google Scholar

Zhou, X. & Huang, Y. 2014. An improved parallel association rules algorithm based on MapReduce framework for big data. In Proceedings of the 11th Conference on Fuzzy Systems and Knowledge Discovery, FSKD’14, 284–288. IEEE.Google Scholar

Article contents

Review and comparison of Apriori algorithm implementations on Hadoop-MapReduce and Spark

Abstract

Access options

Article purchase

Temporarily unavailable

References

Save article to Kindle

Save article to Dropbox

Save article to Google Drive

Reply to: Submit a response

Your details

You have entered the maximum number of contributors

Conflicting interests