Dr. Aly, O.
Computer Science
The purpose of this discussion is to identify and describe a tool in the market for data analytics, how the tool is used and where it can be used. The discussion begins with an overview of the Big Data Analytics tools, followed by the top five tools for 2018, among which RapidMiner is selected as the BDA tool for this discussion. The discussion of the RapidMiner as one of the top five BDA tools include the features, technical specification, use, advantages, and limitation. The application of RapidMiner in various industries such as medical and education is also addressed in this discussion.
Overview of Big Data Analytics Tools
Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions. The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market. A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015). The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.
Regarding the BDA tools, various BDA tools exist in the market for different business purposes based on the business model of the organization. Organizations must select the right tool that will serve their business model. Various studies have discussed various tools for BDA implementation. (Chen & Zhang, 2014) have examined various types of BD tools. Some tools are based on batch processing such as Apache Hadoop, Dryad, Apache Mahout, and Tableau, while other tools are based on stream processing such as Storm, S4, Splunk, Apache Kafka, and SAP Hana as summarized in Table 1 and Table 2. Each tool provides certain features for BDA implementation and offers various advantages to those BDA-adapted organizations.

Table 1. Big Data Tools Based on Batch Processing (Chen & Zhang, 2014).

Table 2. Big Data Tools Based on Stream
Processing (Chen & Zhang, 2014).
Other studies such as (Rangra & Bansal, 2014) have provided a comparative study of data mining tools such as Weka, Keel, R-Programming, Knime, RapidMiner, and Orange, their technical specification, general features, specialization, advantages, and limitations. (Choi, 2017) have discussed the BDA tools by categories. These BDA tools are categorized by open source data tools, data visualization tools, sentiment tools, and data extraction tools. Figure 1 provides a summary of some of the examples of BDA tools including the databases sources to download big datasets for analysis.

Figure 1. A Summary of Big Data Analytics Tools.
(Al-Khoder & Harmouch, 2014) have evaluated four of the most popular open source and free data mining tools including R, RapidMiner, Weka, and Knime. R foundation has developed R-Programming, while Rapid-I company have developed RapidMinder. Weka is developed by University of Waikato, and Knime is developed by Knime.com AG. Figure 2 provides a summary of these four BDA most popular open source and free data mining tools, with the logo, description, launch date, current version at the time of writing the study, and development team.

Figure 2. Open Source and Free Data
Mining Tools Analyzed by (Al-Khoder & Harmouch, 2014).
The top five of BDA tools for 2018 include Tableau Public, Rapid Miner, Hadoop, R-Programming, IBM Big Data (Seli, 2017). The present discussion focuses on one of these two five BDA tools for 2018. Figure 3 summarizes these top five BDA tools for 2018.

Figure 3. Top Five BDA Tools for 2018.
RapidMiner Big Data Analytic Tool
RapidMiner Big Data Analytic tool is selected for the present discussion since it was among the top five BDA tools for 2018. RapidMiner is an open source platform for BDA, based on Java programming language. RapidMiner provides machine learning procedures and data mining. It also provides data visualization, processing, statistical modeling, deployment, evaluation and predictive analytics (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014; Seli, 2017). RapidMiner is known for its commercial and business applications, as it provides an integrated environment and platform for machine learning, data mining, predictive analysis, and business analytics (Hofmann & Klinkenberg, 2013; Seli, 2017). It is also used for research, education, training, rapid prototyping, and application development (Rangra & Bansal, 2014). It is specialized in predictive analysis and statistical computing. It supports all steps of the data mining process (Hofmann & Klinkenberg, 2013; Rangra & Bansal, 2014). RapidMiner uses the client/server model, where the server can be software, or a service or on cloud infrastructures (Rangra & Bansal, 2014).
RapidMiner was released on 2006. The latest version of RapidMiner server is 7.2 with a free version of server and Radoop and can be downloaded from RapidMiner site (rapidminer, 2018). It can be installed on any operating system (Rangra & Bansal, 2014). The advantages of the RapidMiner include an integrated environment for all steps that are required for data mining process, easy to use graphical user interface (GUI) for the design of data mining process, the visualization of the result and data, the validation and optimization of these processes. RapidMiner can be integrated into more complex systems (Hofmann & Klinkenberg, 2013). RapidMiner also stores the data mining processes in a machine-readable XML format, which can be executed with a click of a button, providing a visualized graphics of the data mining processes (Hofmann & Klinkenberg, 2013). It contains over a hundred learning schemes for regression classification and clustering analysis (Rangra & Bansal, 2014). RapidMiner has a few limitations including the size constraints of the number of rows and more hardware resources than other tools such as SAS for the same task and data (Seli, 2017). RapidMiner also requires prominent knowledge of the database handling (Rangra & Bansal, 2014).
RapidMiner Use and Application
Data Mining requires six essential steps to extract value from a large dataset (Chisholm, 2013). The process of Data mining framework begins with business understanding, followed by the data understanding and data preparation. The modeling, evaluation and deployment phases develop the models for predictions, testing, and deploying them in real-time. Figure 4 illustrates these six steps of the data mining.

Figure 4. Data Mining Six Phases Process
Framework (Chisholm, 2013).
Before working with RapidMiner, the user must know the common terms used by RapidMiner. Some of these standard terms are a process, operator, macro, repository, attribute, role, label, and ID (Chisholm, 2013). The data mining process in RapidMiner begins with loading the data into RapidMiner. Loading the data into RapidMiner using import technique for either data in files, or databases. The process of splitting the large file into pieces can be implemented in RapidMiner. In some cases, the dataset can be split into chunks using RapidMiner process which reads each line in the file such as CSV file to be split into chunks. If the dataset is based on a database, a Java Data Connectivity (JDBC) driver must be used. RapidMiner support MySQL, PostgreSQL, SQL Server, Oracle and Access (Chisholm, 2013). After loading the data into RapidMinder and generating data for testing, a predictive model can be created based on the loaded dataset, followed by the process execution and reviewing the result visually. RapidMiner provides various techniques to visualize the data. It uses scatter plots, scatter 3D color, parallel and deviations, quartile color, plotting series, and survey plotter. Figure 5 illustrates scatter 3D color visualization of the data in RapidMiner (Chisholm, 2013).

Figure 5. Scatter 3D Color Visualization
of the Data in RapidMiner (Chisholm, 2013).
RapidMiner supports statistical analysis such as K-Nearest Neighbor Classifications, Naïve Bayes Classification, which can be used for credit approval and in education (Hofmann & Klinkenberg, 2013). RapidMiner application is also witnessed in other industries such as marketing, cross-selling and recommender system (Hofmann & Klinkenberg, 2013). Other useful use cases of the RapidMiner application include the clustering in medical and education domains (Hofmann & Klinkenberg, 2013). RapidMinder can also be used for text mining scenarios such as spam detection, language detection, and customer feedback analysis. Other applications of RapidMiner include anomaly detection and instance selection.
Conclusion
This discussion has identified the different tools for Big Data Analytics (BDA). Over thirty analytic tools which can be used to overcome some of the BDA. Some are open source tools such as Knime, R-Programming, RapidMiner which can be downloaded for free, while others are described as visualization tools such as Tableau Public, Google Fusion to provide compelling visual images of the data in various scenarios. Other tools are more semantic such as OpenText and Opinion Crawl. Data extraction tools for BDA include Octoparse and Content Grabber. The users can download large datasets for BDA from various databases such as data.gov.
The discussion has also addressed the top five BDA tools for 2018, such as Tableau Public, RapidMiner, Hadoop, R-Programming and IBM Big Data. RapidMiner was selected as BDA tools for this discussion. The focus of the discussion on RapidMiner included the technical specification, use, advantages, and limitation. The data mining process and steps when using RapidMiner have also been discussed. The analytic process begins with the data upload to RapidMiner, during which the data can be split using the RapidMiner capabilities. After the load and the cleaning of the data, the data model is developed and tested, followed by the visualization. The visualization capabilities of RapidMiner include statistical analysis such as K-Nearest Neighbor and Naïve Bay Classification. RapidMiner use cases have been addressed as well to include the medical and education domains, text mining scenarios such as spam detection. Organizations must select the appropriate BDA tools based on the business model.
References
Al-Khoder, A., & Harmouch, H. (2014). Evaluating four of the most popular open source and free data mining tools.
Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.
Chisholm, A. (2013). Exploring data with RapidMiner: Packt Publishing Ltd.
Choi, N. (2017). Top 30 Big Data Tools for Data Analysis. Retrieved from https://bigdata-madesimple.com/top-30-big-data-tools-data-analysis/.
Hofmann, M., & Klinkenberg, R. (2013). RapidMiner: Data mining use cases and business analytics applications: CRC Press.
McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.
Rangra, K., & Bansal, K. (2014). Comparative study of data mining tools. International journal of advanced research in computer science and software engineering, 4(6).
rapidminer. (2018). Introducing RapidMiner 7.2, Free Versions of Server & Radoop, and New Pricing. Retrieved from https://rapidminer.com/blog/introducing-new-rapidminer-pricing-free-versions-server-radoop/.
Seli, A. (2017). Top 5 Big Data Analytics Tools for 2018. Retrieved from http://heartofcodes.com/big-data-analytics-tools-for-2018/.