1. bookVolume 9 (2019): Issue 4 (October 2019)
Journal Details
License
Format
Journal
eISSN
2449-6499
First Published
30 Dec 2014
Publication timeframe
4 times per year
Languages
English
Open Access

Decision-Making Enhancement in a Big Data Environment: Application of the K-Means Algorithm to Mixed Data

Published Online: 30 Aug 2019
Volume & Issue: Volume 9 (2019) - Issue 4 (October 2019)
Page range: 293 - 302
Received: 08 May 2019
Accepted: 25 Jul 2019
Journal Details
License
Format
Journal
eISSN
2449-6499
First Published
30 Dec 2014
Publication timeframe
4 times per year
Languages
English
Abstract

Big data research has become an important discipline in information systems research. However, the flood of data being generated on the Internet is increasingly unstructured and non-numeric in the form of images and texts. Thus, research indicates that there is an increasing need to develop more efficient algorithms for treating mixed data in big data for effective decision making. In this paper, we apply the classical K-means algorithm to both numeric and categorical attributes in big data platforms. We first present an algorithm that handles the problem of mixed data. We then use big data platforms to implement the algorithm, demonstrating its functionalities by applying the algorithm in a detailed case study. This provides us with a solid basis for performing more targeted profiling for decision making and research using big data. Consequently, the decision makers will be able to treat mixed data, numerical and categorical data, to explain and predict phenomena in the big data ecosystem. Our research includes a detailed end-to-end case study that presents an implementation of the suggested procedure. This demonstrates its capabilities and the advantages that allow it to improve the decision-making process by targeting organizations’ business requirements to a specific cluster[s]/profiles[s] based on the enhancement outcomes.

Keywords

[1] Ahmed Abbasi, Suprateek Sarker, and Roger HL Chiang. Big data research in information systems: Toward an inclusive research agenda. Journal of the Association for Information Systems, 17(2):I, 2016.Search in Google Scholar

[2] Ritu Agarwal and Vasant Dhar. Big data, data science, and analytics: The opportunity and challenge for is research. Information Systems Research, 25(3):443–448, 2014.10.1287/isre.2014.0546Open DOISearch in Google Scholar

[3] Amir Ahmad and Lipika Dey. A k-mean clustering algorithm for mixed numeric and categorical data. Data & Knowledge Engineering, 63(2):503–527, 2007.10.1016/j.datak.2007.03.016Open DOISearch in Google Scholar

[4] Pavel Berkhin. A survey of clustering data mining techniques. In Grouping multidimensional data, pages 25–71. Springer, 2006.10.1007/3-540-28349-8_2Search in Google Scholar

[5] Xiao Cai, Feiping Nie, and Heng Huang. Multi-view k-means clustering on big data. In Twenty-Third International Joint conference on artificial intelligence, 2013.Search in Google Scholar

[6] Xiaoli Cui, Pingfei Zhu, Xin Yang, Keqiu Li, and Changqing Ji. Optimized big data k-means clustering using mapreduce. The Journal of Supercomputing, 70(3):1249–1259, 2014.10.1007/s11227-014-1225-7Search in Google Scholar

[7] Kenneth Cukier and Viktor Mayer-Schoenberger. The rise of big data: How it’s changing the way we think about the world. Foreign Aff., 92:28, 2013.Search in Google Scholar

[8] Jeffrey Dean and Sanjay Ghemawat. Mapreduce: simplified data processing on large clusters. Communications of the ACM, 51(1):107–113, 2008.Search in Google Scholar

[9] Yuri Demchenko, Canh Ngo, and Peter Membrey. Architecture framework and components for the big data ecosystem. Journal of System and Network Engineering, pages 1–31, 2013.10.1109/CTS.2014.6867550Search in Google Scholar

[10] Dany Di Tullio and D Sandy Staples. The governance and control of open source software projects. Journal of Management Information Systems, 30(3):49–80, 2013.10.2753/MIS0742-1222300303Search in Google Scholar

[11] Gal Engelberg, Oded Koren, and Nir Perel. Big data performance evaluation analysis using apache pig. International Journal of Software Engineering and Its Applications, 10(11):429–440, 2016.10.14257/ijseia.2016.10.11.34Search in Google Scholar

[12] Johann Füller, Katja Hutter, Julia Hautz, and Kurt Matzler. User roles and contributions in innovation-contest communities. Journal of Management Information Systems, 31(1):273–308, 2014.10.2753/MIS0742-1222310111Open DOISearch in Google Scholar

[13] Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The google file system. In Proceedings of the 19th ACM Symposium on Operating Systems Principles, pages 20–43, Bolton Landing, NY, 2003.10.1145/945445.945450Search in Google Scholar

[14] Shanshan Guo, Xitong Guo, Yulin Fang, and Doug Vogel. How doctors gain social and economic returns in online health-care communities: a professional capital perspective. Journal of Management Information Systems, 34(2):487–519, 2017.Search in Google Scholar

[15] Bock Hans-Hermann. Origins and extensions of the k-means algorithm in cluster analysis. Journal Electronique dHistoire des Probabilités et de la Statistique Electronic Journal for History of Probability and Statistics, 4:48–49, 2008.Search in Google Scholar

[16] Doug Henschen. Why sears is going all-in on hadoop. Information week. Retrieved July, 1:2014, 2012.Search in Google Scholar

[17] Joshua Zhexue Huang, Michael K Ng, Hongqiang Rong, and Zichen Li. Automated variable weighting in k-means type clustering. IEEE Transactions on Pattern Analysis & Machine Intelligence, (5):657–668, 2005.10.1109/TPAMI.2005.9515875789Open DOISearch in Google Scholar

[18] Zhexue Huang. Extensions to the k-means algorithm for clustering large data sets with categorical values. Data mining and knowledge discovery, 2(3):283–304, 1998.10.1023/A:1009769707641Search in Google Scholar

[19] Cisco Visual Networking Index. The zettabyte era–trends and analysis. Cisco white paper, 2013.Search in Google Scholar

[20] Anil K Jain. Data clustering: 50 years beyond k-means. Pattern recognition letters, 31(8):651–666, 2010.Search in Google Scholar

[21] Tapas Kanungo, David M Mount, Nathan S Netanyahu, Christine D Piatko, Ruth Silverman, and Angela Y Wu. An efficient k-means clustering algorithm: Analysis and implementation. IEEE Transactions on Pattern Analysis & Machine Intelligence, (7):881–892, 2002.10.1109/TPAMI.2002.1017616Open DOISearch in Google Scholar

[22] Daniel Kendal, Oded Koren, and Nir Perel. Pig vs. hive use case analysis. International Journal of Database Theory and Application, 9(12):267–276, 2016.10.14257/ijdta.2016.9.12.24Search in Google Scholar

[23] Oded Koren, Carina Antonia Hallin, Nir Perel, and Dror Bendet. Enhancement of the k-means algorithm for mixed data in big data platforms. In Proceedings of SAI Intelligent Systems Conference, pages 1025–1040. Springer, 2018.10.1007/978-3-030-01054-6_71Search in Google Scholar

[24] Sara Landset, Taghi M Khoshgoftaar, Aaron N Richter, and Tawfiq Hasanin. A survey of open source tools for machine learning with big data in the hadoop ecosystem. Journal of Big Data, 2(1):24, 2015.10.1186/s40537-015-0032-1Search in Google Scholar

[25] James Manyika, Michael Chui, Brad Brown, Jacques Bughin, Richard Dobbs, Charles Roxburgh, and Angela H Byers. Big data: The next frontier for innovation, competition, and productivity. McKinsey Global Institute, 2011.Search in Google Scholar

[26] R Angelin Preethi and J Elavarasi. Big data analytics using hadoop tools, pache hive vs apache pig. International Journal of Emerging Technology in Computer Science & Electronics, 24(3), 2017.Search in Google Scholar

[27] Arun Rai. Editor’s comments: Synergies between big data and theory. MIS quarterly, 40(2):iii–ix, 2016.Search in Google Scholar

[28] Henri Ralambondrainy. A conceptual version of the k-means algorithm. Pattern Recognition Letters, 16(11):1147–1157, 1995.10.1016/0167-8655(95)00075-RSearch in Google Scholar

[29] Alok R Saboo, V Kumar, and Insu Park. Using big data to model time-varying effects for marketing resource (re) allocation. MIS Quarterly, 40(4), 2016.10.25300/MISQ/2016/40.4.06Search in Google Scholar

[30] Ohn Mar San, Van-Nam Huynh, and Yoshiteru Nakamori. An alternative extension of the k-means algorithm for clustering categorical data. International Journal of Applied Mathematics and Computer Science, 14:241–247, 2004.Search in Google Scholar

[31] Prasanna Tambe. Big data investment, skills, and firm value. Management Science, 60(6):1452–1469, 2014.10.1287/mnsc.2014.1899Open DOISearch in Google Scholar

[32] Tom White. Hadoop: The definitive guide. O’Reilly Media, Inc., 2012.Search in Google Scholar

[33] Rui Xu and Donald C Wunsch. Survey of clustering algorithms. IEEE Transactions on Neural Networks, 16(3):645–678, 2005.10.1109/TNN.2005.84514115940994Open DOISearch in Google Scholar

Recommended articles from Trend MD

Plan your remote conference with Sciendo