R-Programming Language

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to compare the statistical features of R to its programming features. The discussion also outlines the programming features available in R in a table format. Furthermore, the discussion describes how the analytics of R are suited for Big Data. We will begin by defining R followed by the comparison.

What is R?

R is defined in (r-project.org, n.d.) as “language and environment for statistical computing and graphics.” The R system for statistical computing is used for data analysis and graphics (Hothorn & Everitt, 2009; Venables, Smith, & Team, 2017). It is also described as an integrated suite of software facilities for data manipulation, calculation and graphical display (Venables et al., 2017). The root of R is the S language, developed by John Chambers and colleagues at Bell Laboratories (formerly AT&T, now owned by Lucent Technologies) starting in the 1960s (Hothorn & Everitt, 2009; r-project.org, n.d.; Venables et al., 2017). The S language was designed and developed as a programming language for data analysis. While S language is a full-features of programming language (Hothorn & Everitt, 2009; r-project.org, n.d.), R provides a wide range of statistical techniques such as linear and non-linear modeling, classical statistical tests, time-series analysis, classification, clustering and so forth (Venables et al., 2017; Verzani, 2014). It also provides graphical techniques and is highly extensible (Hothorn & Everitt, 2009; r-project.org, n.d.). It is available as Free Software under the terms of the Free Software Foundation’s GNU General Public License (r-project.org, n.d.). R has become the “lingua franca” or common language of statistical computing (Hothorn & Everitt, 2009). It is becoming the primary computing engine for reproducible statistical research because of its open source availability and its dominant language and graphical capabilities (Hothorn & Everitt, 2009). It is developed for Unix-like, Windows and Mac families of the operating system (Hornik, 2016; Hothorn & Everitt, 2009; r-project.org, n.d.; Venables et al., 2017).

The R system provides an extensive, coherent, integrated collection of intermediate tools for data analysis. It also provides graphical facilities for data analysis and displays either directly on the computer or on hard-copy. The term “environment” in R is to characterize R as a fully planned and coherent system, rather than an incremental accretion of specific and inflexible tools as the case with other data analysis software (Venables et al., 2017). However, most programs written in R are written for a single piece of data analysis and inherently ephemeral (Venables et al., 2017). The R system provides the most classical statistics and much of the latest methodology (Hothorn & Everitt, 2009; Venables et al., 2017). Furthermore, the R system has a well-developed, simple and effective programming language which includes conditionals, loops, user defined recursive functions and input and output facilities (Venables et al., 2017). As observed, R has various advantages which makes it a powerful tool to use for data analysis.

Statistical Features vs Programming Features

With R, several statistical tests and methods can be performed such as two-sample tests, hypothesis testing, z-test, t-test, chi-square tests, regression analysis, multiple linear regression, analysis of variance, and so forth (Hothorn & Everitt, 2009; r-project.org, n.d.; Schumacker, 2014; Venables et al., 2017; Verzani, 2014). With respect to the programming features, R is an interpreted language, and it can be accessed through a command line interpreter. The R supports matrix arithmetic. It supports procedural programming with functions and object-oriented programming with generic functions. Procedural programming includes procedure, records, modules and procedure calls. It has useful data handling and storage facilities. Packages are part of R programming and are useful in collecting sets of R functions into a single unit. The programming features of R include database input, exporting data, viewing data, variable labels, missing data and so forth. R also supports a large pool of operators for performing operations on arrays and metrics. It has facilities to print the reports for the analysis performed in the form of graphs either on-screen or on hardcopy (Hothorn & Everitt, 2009; r-project.org, n.d.; Schumacker, 2014; Venables et al., 2017; Verzani, 2014). Table 1 summarizes these features.

Table 1. Summary of the Programming Features and Statistical Features in R.

Big Data Analytics Using R: Big Data has attracted the attention of various sectors, researchers, academia, government and even the media (Chen, Mao, & Liu, 2014; Géczy, 2014; Kaisler, Armour, Espinosa, & Money, 2013). Such attention is driven by the value and the opportunities that can be derived from Big Data. The importance of Big Data has been evident in almost every sector. There are various advanced analytical theories and methods which can be utilized in Big Data in different fields such as Medical, Finance, Manufacturing, Marketing, and more. These six analytical models are Clustering, Association Rules, Regression, Classification, Time Series Analysis, and Text Analysis (EMC, 2015). The Cluster, Regression and Classification models can be used in the Medical field. The Classification model with the Decision Tree and Naïve Bayes method has been used to diagnose patients with specific diseases such as heart disease, and the probability of a patient having a specific disease. As an example, in (Shouman, Turner, & Stocker, 2011), the researchers performed various experimentations to evaluate the Decision Tree in the diagnosis of the heart disease. The key benefit of the study was the implementation of multi-variants while using various types of Decision Tree types such as Information Gain, Gini Index, and Gain Ratio. The study also performed the experimentation with and without the voting technique.

Furthermore, there are four major analytics types: Descriptive Analytics, Predictive Analytics, Prescriptive Analytics (Apurva, Ranakoti, Yadav, Tomer, & Roy, 2017; Davenport & Dyché, 2013; Mohammed, Far, & Naugler, 2014), and Diagnostic Analysis (Apurva et al., 2017). The Descriptive Analytics are used to summarize historical data to provide useful information. The Predictive Analytics is used to predict future events based on the previous behavior using the data mining techniques and modeling. The Prescriptive Analytics provides support to use various scenarios of data models such as multi-variables simulation, detecting a hidden relationship between different variables. It is useful to find an optimum solution and the best course of action using the algorithm.

Moreover, many organizations have employed Big Data and Data Mining in some areas including fraud detection. Big Data Analytics can empower healthcare industry in fraud detection to mitigate the impact of the fraudulent activities in the industry. Several use cases such as (Halyna, 2017; Nelson, 2017) have demonstrated the positive impact of integrating Big Data Analytics into the fraud detection system. Big Data Analytics and Data Mining have various techniques such as classification model, regression model, and clustering model. The classification model employs logistic, tree, naïve Bayesian, and neural network algorithms. It can be used for fraud detection. The regression model employs linear ad k-nearest-neighbor. The clustering model employs k-means, hierarchical and principal component algorithms. For instance, in (Liu & Vasarhelyi, 2013), the researchers applied the clustering technique using an unsupervised data mining approach to detect the fraud of insurance subscribers. In (Ekina, Leva, Ruggeri, & Soyer, 2013), the researchers applied the Bayesian co-clustering with unsupervised data mining method to detect conspiracy fraud which involved more than one party. In (Capelleveen, 2013), the researchers employed the outlier detection technique using an unsupervised data mining method to detect dental claim data within Medicaid. In (Aral, Güvenir, Sabuncuoğlu, & Akar, 2012), the researchers used distance-based correlation using hybrid supervised and unsupervised data mining methods for prescription fraud detection. These research studies and use cases are examples of taking advantages of Big Data Analytics in healthcare fraud detection. Thus, it is proven that Big Data Analytics can play a significant role in various sectors such as healthcare fraud detection.

Therefore, giving the nature of BD and BDA, and the nature of R language, which can be integrated with other languages such as SQL, Hadoop (Prajapati, 2013), Spark (spark.rstudio.com, 2018), R is becoming the primary workhorse for statistical analyses (Hothorn & Everitt, 2009), which can be used for BDA as discussed above. Statistical methods not only help make scientific discoveries, but also quantifies the reliability, reproducibility, and general uncertainty associated with these discoveries (Ramasubramanian & Singh, 2017). Examples of using R with BDA include (Matrix, 2006), which analyzed customer behavioral data to identify unique and actionable segments of the customer base. Another example includes (Gentleman, 2005) using R in genetics and molecular biology use case.

In summary, the R system offers various features such as programming and statistical features which help in data analysis. Big Data has various types of analytics such as clustering, association rules, regression, classification, time series analysis and text analysis. Most of these analyses are statistical based which can be leveraged by using the R language. R has been used in various BDA sectors such as healthcare and fraud detection.

References

Apurva, A., Ranakoti, P., Yadav, S., Tomer, S., & Roy, N. R. (2017, 12-14 Oct. 2017). Redefining cyber security with big data analytics. Paper presented at the 2017 International Conference on Computing and Communication Technologies for Smart Nation (IC3TSN).

Aral, K. D., Güvenir, H. A., Sabuncuoğlu, İ., & Akar, A. R. (2012). A prescription fraud detection model. Computer methods and programs in biomedicine, 106(1), 37-46.

Capelleveen, G. C. (2013). Outlier based predictors for health insurance fraud detection within US Medicaid. The University of Twente.

Chen, M., Mao, S., & Liu, Y. (2014). Big data: a survey. Mobile Networks and Applications, 19(2), 171-209.

Davenport, T. H., & Dyché, J. (2013). Big data in big companies. International Institute for Analytics.

Ekina, T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of Bayesian methods in the detection of healthcare fraud.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Géczy, P. (2014). Big data characteristics. The Macrotheme Review, 3(6), 94-104.

Gentleman, R. (2005). Reproducible research: A bioinformatics case study.

Halyna. (2017). Challenge Accomplished: Healthcare Fraud Detection Using Predictive Analytics. Retrieved from https://www.romexsoft.com/blog/healthcare-fraud-detection/.

Hornik, K. (2016). R FAQ. Retrieved from: https://CRAN.R-project.org/doc/FAQ/R-FAQ.html.

Hothorn, T., & Everitt, B. S. (2009). A handbook of statistical analyses using R: Chapman and Hall/CRC.

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big data: issues and challenges moving forward. Paper presented at the System Sciences (HICSS), 2013 46th Hawaii International Conference on System Sciences.

Liu, Q., & Vasarhelyi, M. (2013). Healthcare fraud detection: A survey and a clustering model incorporating Geo-location information.

Matrix, L. (2006). Using R for Customer Analytics: A Practical Introduction to R for Business Analysts. (2006).

Mohammed, E. A., Far, B. H., & Naugler, C. (2014). Applications of the MapReduce Programming Framework to Clinical Big Data Analysis: Current Landscape and Future Trends. BioData mining, 7(1), 1.

Nelson, P. (2017). Fraud Detection Powered by Big Data – An Insurance Agency’s Case Story. Retrieved from https://www.searchtechnologies.com/blog/fraud-detection-big-data.

Prajapati, V.-i. (2013). Big Data Analytics with R and Hadoop: Packt Publishing Ltd.

r-project.org. (n.d.). What is R? . Retrieved from https://www.r-project.org/about.html.

Ramasubramanian, K., & Singh, A. (2017). Machine Learning Using R: Springer.

Schumacker, R. E. (2014). Learning statistics using R: Sage Publications.

Shouman, M., Turner, T., & Stocker, R. (2011). Using decision tree for diagnosing heart disease patients. Paper presented at the Proceedings of the Ninth Australasian Data Mining Conference-Volume 121.

spark.rstudio.com. (2018). R Interface For Apache Spark Retrieved from http://spark.rstudio.com/.

Venables, W. N., Smith, D. M., & Team, R. C. (2017). Introduction To R. Retrieved from: https://cran.r-project.org/doc/manuals/R-intro.pdf, Version 3.4.2(2017-09-28).

Verzani, J. (2014). Using R for introductory statistics: CRC Press.

Share this:

Related

Published by Think and Knowledge Tank