Proposal: State-of-the-Art Healthcare System in Four States.

Dr. O. Aly
Computer Science

Abstract

The purpose of this proposal is to design a state-of-the-art healthcare system in four States of Colorado, Utah, Arizona, and New Mexico.   Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry.  The value that is driven by BDA can save lives and minimize costs for patients.  The project proposes a design to apply BD and BDA in the healthcare system across these identified four States.  Cloud computing is the most appropriate technology to deal with the large volume of healthcare data at the storage level as well as at the data processing level.  Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used.  VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists.   The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT).  The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images.  Spark will be used for real-time data processing which can be vital for urgent care and emergency services.  This project addresses the assumptions and limitations plus the justification for selecting these specific components.  All stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process.  The rigid culture and silo pattern need to change for better healthcare which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.

Keywords: Big Data Analytics; Hadoop; Healthcare Big Data System; Spark.

Introduction

            In the age of Big Data (BD), information technology plays a significant role in the healthcare industry (HIMSS, 2018).  The role of information technology in healthcare The healthcare sector generates a massive amount of data every day to conform to standards and regulations (Alexandru, Alexandru, Coardos, & Tudora, 2016).  The generated Big Data has the potential to support many medical and healthcare operations including clinical decision support, disease surveillance and population health management (Alexandru et al., 2016). This project proposes a state-of-the-art integrated system for hospitals located in Arizona, Colorado, New Mexico, and Utah.  The system is based on the Hadoop ecosystem to help the hospitals maintain and improve human health via diagnosis, treatment and disease prevention. 

It begins with Big Data Analytics in Healthcare Overview, which covers the benefits and challenges of BD and BDA in the healthcare industry.  The overview also covers the various healthcare data sources for data analytics, in different formats such as semi-structured, e.g., XML and JSON, and unstructured, e.g., images and XRays.  The second section addresses the healthcare BDA Design Proposal Using Hadoop. This section covers various components.  The first component discusses the requirements for this design.  These requirements include state-of-the-art technology such as Hadoop/MapReduce, Spark, NoSQL database, Artificial Intelligence (AI), Internet of Things (IoT).  The project also covers various diagrams including the data flow diagram, a communication flow chart, and the overall system diagram.  The healthcare design system is bounded by regulation, policies, and governance such as HIPAA, that is also covered in this project.  The justification, limitation, and assumptions are also discussed.

Big Data Analytics in Healthcare Overview

BD and BDA are terms that have been used interchangeably and described as the next frontier for innovation, competitions, and productivity (Maltby, 2011; Manyika et al., 2011).  BD has a multi-V model with unique characteristics, such as volume referring to the large dataset, velocity refers to the speed of the computation as well as data generation, and variety referring to the various data types such as semi-structured and unstructured (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015; Hu, Wen, Chua, & Li, 2014).  BD is described as the next frontier for competition, innovation, and productivity.  Various industries including healthcare have taken this opportunity and applied BD and BDA in their business models (Manyika et al., 2011).  McKinsey Institute predicted $300 billion as a potential annual value to US healthcare (Manyika et al., 2011).  

The healthcare industry generated extensive data driven by keeping patients’ records, complying with regulations and policies, and patients care (Raghupathi & Raghupathi, 2014).  The current trend is digitalizing this explosive growth of the data in the age of Big Data (BD) and Big Data Analytics (BDA) (Raghupathi & Raghupathi, 2014).  BDA has made a revolution in healthcare by transforming the valuable information, knowledge to predict epidemics, cure diseases, improve quality of life, and avoid preventable deaths (Van-Dai, Chuan-Ming, & Nkabinde, 2016).  Various applications of BDA in healthcare include pervasive health, fraud detection, pharmaceutical discoveries, clinical decision support system, computer-aided diagnosis, and biomedical applications. 

Healthcare Big Data Benefits and Challenges

            Healthcare sector employs BDA in various aspect of healthcare such as detecting diseases at early stages, providing evidence-based medicine, minimizing doses of medication to avoid any side effects, and delivering useful medicine base on genetic analysis.  The use of BD and BDA can reduce the re-admission rate, and thereby the healthcare related costs for patients are reduced.  Healthcare BDA can be used to detect spreading diseases earlier before the disease gets spread using real-time analytics (Archenaa & Anita, 2015; Raghupathi & Raghupathi, 2014; Wang, Kung, & Byrd, 2018).   Example of the application of BDA in the healthcare system is Kaiser Permanente implementing a HealthConnect technique to ensure data exchange across all medical facilities and promote the use of electronic health records (Fox & Vaidyanathan, 2016).

            Despite the various benefits of BD and BDA in the healthcare sector, various challenges and issues are emerging from the application of BDA in healthcare.  The nature of the healthcare industry poses challenging to BDA (Groves, Kayyali, Knott, & Kuiken, 2016).  The episodic culture, the data puddles, and the IT leadership are the three significant challenges of the healthcare industry to apply BDA.  The episodic culture addresses the conservative culture of the healthcare and the lack of IT technologies mindset creating rigid culture.  Few providers have overcome this rigid culture and started to use the BDA technology. The data puddles reflect the silo nature of healthcare.  Silo is described as one of the most significant flaws in the healthcare sector (Wicklund, 2014).  The use of the technology properly is lacking in healthcare sector resulting in making the industry fall behind other industries. All silos use their methods to collect data from labs, diagnosis, radiology, emergency, case management and so forth.  The IT leadership is another challenge is caused by the rigid culture of the healthcare industry.  The lack of the latest technologies among the IT leadership in the healthcare industry is a severe problem. 

Healthcare Data Sources for Data Analytics

            The current healthcare data is collected from clinical and non-clinical sources (InformationBuilders, 2018; Van-Dai et al., 2016; Zia & Khan, 2017).  The electronic healthcare records are digital copies of the medical history of the patients.  It contains a variety of data relevant to the care of the patients such as demographics, medical problems, medications, body mass index, medical history, laboratory test data, radiology reports, clinical notes, and payment information. These electronic healthcare records are the most important data in healthcare data analytics, because it provides effective and efficient methods for the providers and organizations to share data (Botta, de Donato, Persico, & Pescapé, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016; Wang et al., 2018).  

The biomedical imaging data plays a crucial role in healthcare data to aid disease monitoring, treatment planning and prognosis.  This data can be used to generate quantitative information and make inferences from the images that can provide insights into a medical condition.  The images analytics is more complicated due to the noises of the data associated with the images and is one of the significant limitations with biomedical analysis (Ji, Ganchev, O’Droma, Zhang, & Zhang, 2014; Malik & Sangwan, 2015; Van-Dai et al., 2016). 

The sensing data is ubiquitous in the medical domain both for real-time and for historical data analysis.  The sensing data involve several forms of medical data collection instruments such as the electrocardiogram (ECG) and electroencephalogram (EEG) which are vital sensors to collect signals from various parts of the human body.  The sensing data plays a significant role for intensive care units (ICU) and real-time remote monitoring of patients with specific conditions such as diabetes or high blood pressure.  The real-time and long-term analysis of various trends and treatment in remote monitoring programs can help providers monitor the state of those patients with certain conditions(Van-Dai et al., 2016). 

The biomedical signals are collected from many sources such as hearts, blood pressure, oxygen saturation levels, blood glucose, nerve conduction, and brain activity.  Examples of biomedical signals include electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG), electroencephalogram (EEG), electrogastrogram (EGG), and phonocardiogram (PCG).  The biomedical signals real-time analytics will provide better management of chronic diseases, earlier detection of adverse events such as heart attacks, and strokes and earlier diagnosis of disease.   These biomedical signals can be discrete or continuous based on the kind of care or severity of a particular pathological condition (Malik & Sangwan, 2015; Van-Dai et al., 2016).

The genomic data analysis helps better understand the relationship between various genetic, mutations, and disease conditions. It has great potentials in the development of various gene therapies to cure certain conditions.  Furthermore, the genomic data analytics can assist in translating genetic discoveries into personalized medicine practice (Liang & Kelemen, 2016; Luo, Wu, Gopukumar, & Zhao, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016).

The clinical text data analytics using the data mining are the transformation process of the information from clinical notes stored in unstructured data format to useful patterns.  The manual coding of clinical notes is costly and time-consuming, because of their unstructured nature, heterogeneity, different format, and context across different patients and practitioners.  Various methods such as natural language processing (NLP) and information retrieval can be used to extract useful knowledge from large volume of clinical text and automatically encoding clinical information in a timely manner (Ghani, Zheng, Wei, & Friedman, 2014; Sun & Reddy, 2013; Van-Dai et al., 2016).

The social network healthcare data analytics is based on various kinds of collected social media sources such as social networking sites, e.g., Facebook, Twitter, Web Logs, to discover new patterns and knowledge that can be leveraged to model and predict global health trends such as outbreaks of infections epidemics (InformationBuilders, 2018; Luo et al., 2016; Van-Dai et al., 2016; Zia & Khan, 2017). Figure 1 shows a summary of these healthcare data sources.


Figure 1.  Healthcare Data Sources.

Healthcare Big Data Analytics Design Proposal Using Hadoop

            The implementation of BDA in the hospitals within the four States aims to improve the safety of the patient, the clinical outcomes, promoting wellness and disease management (Alexandru et al., 2016; HIMSS, 2018).  The BDA system will take advantages of the large healthcare-generated data to provide various applied analytical disciplines such as statistical, contextual, quantitative, predictive and cognitive spectrums (Alexandru et al., 2016; HIMSS, 2018).  These applied analytical disciplines will drive the fact-based decision making for planning management and learning in hospitals (Alexandru et al., 2016; HIMSS, 2018). 

            The proposal begins with the requirements, followed by the data flow diagram, the communication flowcharts, and the overall system diagram.  The proposal addresses the regulations, policies, and governance for the medical system.  The limitation and assumptions are also addressed in this proposal, followed by the justification for the overall design.

1.      Basic Design Requirements

The basic requirement for the implementation of this proposal included not only the tools and required software, but also the training at all levels from staff, to nurses, to clinicians, to patients.  The list of the requirements is divided into system requirement, implementation requirement, and training requirements. 

1.1 Cloud Computing Technology Adoption Requirement

The volume is one of the significant characteristics of BD, especially in the healthcare industry (Manyika et al., 2011).  Based on the challenges addressed earlier when dealing with BD and BDA in healthcare, the system requirements cannot be met using the traditional on-premise technology center, as it cannot handle the intensive computation requirements of BD, and the storage requirement for all the medical information from various hospitals from the four States (Hu et al., 2014). Thus, the cloud computing environment is found to be more appropriate and a solution for the implantation of this proposal.  Cloud computing plays a significant role in BDA (Assunção et al., 2015).  The massive computation and storage requirement of BDA brings the critical need for cloud computing emerging technology (Mehmood, Natgunanathan, Xiang, Hua, & Guo, 2016).  Cloud computing offers various benefits such as cost reduction, elasticity, pay per use, availability, reliability, and maintainability (Gupta, Gupta, & Mohania, 2012; Kritikos, Kirkham, Kryza, & Massonet, 2017).  However, although cloud computing offers various benefits, it has security and privacy issues using the standard deployment models of public cloud, private cloud, hybrid cloud, and community cloud.  Thus, one of the major requirements is to adopt the Virtual Private Cloud as it has been regarded as the most prominent approach to trusted computing technology (Abdul, Jena, Prasad, & Balraju, 2014).

 1.2 Security Requirement

Cloud computing has been facing various threats (Cloud Security Alliance, 2013, 2016, 2017).   Records showed that over the last three years from 2015 until 2017, the number of breaches, lost medical records, and settlements of fines are staggering (Thompson, 2017).  The Office of Civil Rights (OCR) issued 22 resolution agreements, requiring monetary settlements approaching $36 million (Thompson, 2017).  Table 1 shows the data categories and the total for each year. 

Table 1.  Approximation of Records Lost by Category Disclosed on HHS.gov (Thompson, 2017)

Furthermore, a recent report published by HIPAA showed the first three months of 2018 experienced 77 healthcare data breaches reported to the OCR (HIPAA, 2018d).  In the second quarter of 2018, at least 3.14 million healthcare records were exposed (HIPAA, 2018a).  In the third quarter of 2018, 4.39 million records exposed in 117 breaches (HIPAA, 2018c).

Thus, the protection of the patients’ private information requires the technology to extract, analyze, and correlated potentially sensitive dataset (HIPAA, 2018b).  The implementation of BDA requires security measures and safeguards to protect the privacy of the patients in the healthcare industry (HIPAA, 2018b).  Sensitive data should be encrypted to prevent the exposure of data in the event of theft (Abernathy & McMillan, 2016).  The security requirements involve security at the VPC cloud deployment model as well as at the local hospitals in each State (Regola & Chawla, 2013).  The security at the VPC cloud deployment model should involve the implementation of security groups and network access control lists to allow access to the right individuals to the right applications and patients’ records.  Security group in VPC acts as the first line of defense firewall for the associated instances of the VPC (McKelvey, Curran, Gordon, Devlin, & Johnston, 2015).  The network access control lists act as the second layer of defense firewall for the associated subnets, controlling the inbound and the outbound traffic at the subnet level (McKelvey et al., 2015). 

The security at the local hospitals level in each State is mandatory to protect patients’ records and comply with HIPAA regulations (Regola & Chawla, 2013).  The medical equipment must be secured with authentication and authorization techniques so that only the medical staff, nurses and clinicians have access to the medical devices based on their role.  The general access should be prohibited as every member of the hospital has a different role with different responses.  The encryption should be used to hide the meaning or intent of communication from unintended users (Stewart, Chapple, & Gibson, 2015).   The encryption is an essential element in security control especially for the data in transit (Stewart et al., 2015).  The hospital in all four State should implement the encryption security control using the same type of the encryption across the hospitals such as PKI, cryptographic application, and cryptography and symmetric key algorithm (Stewart et al., 2015).

The system requirements should also include the identity management systems that can correspond with the hospitals in each state. The identity management system provides authentication and authorization techniques allowing only those who should have access to the patients’ medical records.  The proposal requires the implementation of various encryption techniques such as secure socket layer (SSL), Transport Layer Security (TLS), and Internet Protocol Security (IPSec) to protect information transferred in public network (Zhang, R. & Liu, 2010).  

 1.3 Hadoop Implementation for Data Stream Processing Requirement

While the velocity of BD leads to the speed of generating large volume of data and requires speed in data processing (Hu et al., 2014), the variety of the data requires specific technology capabilities to handle various types of dataset such as structured, semi-structured, and unstructured data (Bansal, Deshpande, Ghare, Dhikale, & Bodkhe, 2014; Hu et al., 2014).  Hadoop ecosystem is found to be the most appropriate system that is required to implement BDA (Bansal et al., 2014; Dhotre, Shimpi, Suryawanshi, & Sanghati, 2015).  The implementation requirements include various technologies and various tools.  This section covers various components that are required when implementing Hadoop technology in the four States for healthcare BDA system.

Hadoop has three significant limitations, which must be addressed in this design.  The first limitation is the lack of technical support and document for open source Hadoop (Guo, 2013).   Thus, this design requires the Enterprise Edition of Hadoop to get around this limitation using Cloudera, Hortonworks, and MapR (Guo, 2013). The final decision for which product will be determined by the cost analysis team.  The second limitation is that Hadoop is not optimal for real-time data processing (Guo, 2013). The solution for this limitation will require the integration of real-time streaming program as Spark or Storm or Kafka (Guo, 2013; Palanisamy & Thirunavukarasu, 2017). This requirement of integrating Spark is discussed below in a separate requirement for this design (Guo, 2013). The third limitation is that Hadoop is not a good fit for large graph dataset (Guo, 2013). The solution for this limitation requires the integration of GraphLab which is also discussed below in a separate requirement for this design.

1.3.1 Hadoop Ecosystem for Data Processing

Hadoop technologies have been in the front-runner for Big Data application (Bansal et al., 2014; Chrimes, Zamani, Moa, & Kuo, 2018).  Hadoop ecosystem will be part of the implementation requirement as it is proven to serve well with intensive computation using large datasets (Raghupathi & Raghupathi, 2014; Wang et al., 2018).   The implementation of Hadoop technology will be performed in the VPC deployment model.  The Hadoop version that is required is version 2.x to include YARN for resource management  (Karanth, 2014).  Hadoop 2.x also include HDFS snapshots to provide a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery (Karanth, 2014). The Hadoop platform can be implemented to gain more insight into various areas (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Hadoop ecosystem involves Hadoop Distributed File System, MapReduce, and NoSQL database such as HBase, and Hive to handle a large volume of dataset using various algorithms and machine learning to extract values from the medical records that are structured, semi-structured, and unstructured (Raghupathi & Raghupathi, 2014; Wang et al., 2018).  Other components to support Hadoop ecosystem include Oozie for workflow, Pig for scripting, and Mahout for machine learning which is part of the artificial intelligence (AI) (Ankam, 2016; Karanth, 2014).  Hadoop ecosystem will also include Flume for log collector, Sqoop for data exchange, and Zookeeper for coordination (Ankam, 2016; Karanth, 2014).  HCatalog is a required component to manage the metadata in Hadoop (Ankam, 2016; Karanth, 2014).   Figure 2 shows the Hadoop ecosystem before integrating Spark for real-time analytics.


Figure 2.  Hadoop Architecture Overview (Alguliyev & Imamverdiyev, 2014).

1.3.2 Hadoop-specific File Format for Splittable and Agnostic Compression

The ability of splittable files plays a significant role during the data processing (Grover, Malaska, Seidman, & Shapira, 2015).  Therefore, Hadoop-specific file formats of SequenceFile, and Serialization formats like Avro, and columnar formats such as RCFile and Parquet should be used because these files share two essential characteristics that are essential for Hadoop applications: splittable compression and agnostic compression (Grover et al., 2015).  Hadoop allows large files to be split for input to MapReduce and other types of jobs, which is required for parallel processing and an essential key to leveraging data locality feature of Hadoop (Grover et al., 2015). The agnostic compression is required to compress data using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015).  Figure 3 summarizes the three Hadoop file types with the two common characteristics.  


Figure 3. Three Hadoop File Types with the Two Common Characteristics.  

1.3.3 XML and JSON Use in Hadoop

The clinical data include semi-structured formats such as XML and JSON.  The split process of XML and JSON is not straightforward and can present unique challenges using Hadoop (Grover et al., 2015).  Since and Hadoop does not provide a built-in InputFormat for either format of XML and JSON (Grover et al., 2015).  Furthermore, JSON presents more challenges to Hadoop than XML because no token is available to mark the beginning or end of a record (Grover et al., 2015). When using these file format, two primary considerations must be taken.  The container format such as Avro should be used because Avro provides a compact and efficient method to store and process the data when transforming the data into Avro (Grover et al., 2015).  A library for processing XML or JSON should be designed (Grover et al., 2015).  XMLLoader in PiggyBank library for Pig is an example when using XML data type.  The Elephant Bird project is an example of a JSON data type file (Grover et al., 2015). 

1.4 HBase and MongoDB NoSQL Database Integration Requirement

In the age of BD and BDA, the traditional data store is found inadequate to handle not only the large volume of the dataset but also the various types of the data format such as unstructured and semi-structured (Hu et al., 2014).   Thus, Not Only SQL (NoSQL) database is emerged to meet the requirement of the BDA.  These NoSQL data stores are used for modern, and scalable databases (Sahafizadeh & Nematbakhsh, 2015).  The scalability feature of the NoSQL data stores enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015).  The platform can incorporate two scalability types to support the large volume of the datasets; the horizontal and vertical scalability.  The horizontal scaling allows the distribution of the workload across many servers and nodes to increase the throughput, while the vertical scaling requires more processors, more memories and faster hardware to be installed on a single server (Sahafizadeh & Nematbakhsh, 2015). 

NoSQL data stores have various types such as MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.  These data stores are categorized into four types: document-oriented, column-oriented or column-family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The document-oriented data store can store and retrieve collections of data and documents using complex data forms in various formats such as XML and JSON as well as PDF and MS word (EMC, 2015; Hashem et al., 2015).  MongoDB and CouchDB are examples of document-oriented data stores (EMC, 2015; Hashem et al., 2015).  The column-oriented data store can store the content in columns aside from rows with the attributes of the columns stored contiguously (Hashem et al., 2015).  This type of datastore can store and render blog entries, tags, and feedback (Hashem et al., 2015).  Cassandra, DynamoDB, and HBase are examples of column-oriented data stores (EMC, 2015; Hashem et al., 2015).  The key-value can store and scale large volumes of data and contains value and a key to access the value (EMC, 2015; Hashem et al., 2015).  The value can be complicated, but this type of data stores can be useful in storing the user’s login ID as the key referencing the value of patients.  Redis and Riak are examples of the key-value NoSQL data store (Alexandru et al., 2016).  Each of these NoSQL data stores has its limitations and advantages.  The graph NoSQL database can store and represent data using graph models with nodes, edges, and properties related to one another through relations which will be useful for unstructured medical data such as images, and lab results. Neo4j is an example of this type of graph NoSQL database (Hashem et al., 2015).  Figure 4 summarizes these NoSQL data stores, data types for storage, and examples.

Figure 4.  Big Data Analytics NoSQL Data Store Types.

The proposed design requires one or more NoSQL data stores to meet the requirement of BDA using Hadoop environment for this healthcare BDA system.  Healthcare big data has unique characteristics which must be addressed when selecting the data store and consideration must be taken for the various types of data.   HBase and HDFS are the commonly used storage manager in the Hadoop environment (Grover et al., 2015).  HBase is a column-oriented data store which will be used to store multi-structured data (Archenaa & Anita, 2015).  HBase sets on top of HDFS in the Hadoop ecosystem framework (Raghupathi & Raghupathi, 2014).   

MongoDB will also be used to store the semi-structured data set such as XML and JSON. Metadata for HBase data schema, to improve the accessibility and readability of HBase data schema (Luo et al., 2016).  Riak will be used for a key-value dataset which can be used for the dictionary, hash tables and associative arrays that can be used for login and user ID information for patients as well as for providers and clinicians (Klein et al., 2015).  Neo4j NoSQL will be used to store the images with nodes and edges such as Lab images, XRays (Alexandru et al., 2016).

The proposed healthcare system has a logical data model and query patterns that need to be supported by NoSQL databases (Klein et al., 2015). The data model will include reading the medical test results for patients is a core function used to populate the user interface. It will also include a strong replica consistency when a new medical result is written for a patient.  Providers can make patient care decisions using these records.  All providers will be able to see the same information within the hospital systems in the four States, whether they are at the same site as the patients, or providing telemedicine support from another location. 

The logical data model includes mapping the application-specific model into the particular data model, indexing, and query language capabilities of each database.  The HL7 Fast Healthcare Interoperability Resources (FHIR) is used as the logical data model for records analysis.  The patient’s data such as demographic information such as names, addresses, and telephone will be modeled using the FHIR Patient Resources such as result quantity, and result units (Klein et al., 2015). 

1.5 Spark Integration for Real-Time Data Processing Requirement

While the architecture of Hadoop ecosystem has been designed in various scenarios for data storage, data management statistical analysis, and statistical association between various data sources distributed computing and batch processing, this proposal requires real-time data processing which cannot be met by Hadoop alone (Basu, 2014).  Real-time analytics will tremendous value to the healthcare proposed system.  Thus, Apache Spark is another component which is required to implement this proposal (Basu, 2014).  Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Basu, 2014).  With Spark integration with Hadoop, stream processing, machine learning, interactive analytics, and data integration will be possible (Scott, 2015).  Spark will run on top of Hadoop to benefit from YARN and the underlying storage of HDFS, HBase and other Hadoop ecosystem building blocks (Scott, 2015).  Figure 5 shows the core engines of the Spark.


Figure 5. Spark Core Engines (Scott, 2015).

 1.6 Big Healthcare Data Visualization Requirement

Visualization is one of the most powerful presentations of the data (Jayasingh, Patra, & Mahesh, 2016).  It helps in viewing the data in a more meaningful way in the form of graphs, images, pie charts that can be understood easily.  It helps in synthesizing a large volume of data set such as healthcare data to get at the core of such raw big data and convey the key points from the data for insight (Meyer, 2018).  Some of the commercial visualization tools include Tableau, Spotfire, QlikView, and Adobe Illustrator.  However, the most commonly used visualization tools in healthcare include Tableau, PowerBI, and QlikView. This healthcare design proposal will utilize Tableau. 

Healthcare providers are successfully transforming data from information to insight using Tableau software.  Healthcare organizations can utilize three approaches to get more from the healthcare datasets.  The first approach is to break the data access by empowering the departments in healthcare to explore their data.  The second approach is to uncover answers with data from multiple systems to reveal trends and outliers.  The third approach is to share insights with executives, providers, and others to drive collaboration (Tableau, 2011).  It has several advantages including the interactive visualization using drag-n-drop techniques, handling large amounts of data and millions of rows of data with ease, and other scripts such as Python can be integrated with Tableau (absentdata.com, 2018).  It also provides mobile support and responsive dashboard.  The limitation of Tableau is that it requires substantial training to fully master the platform, among other limitations including lack of automatic refreshing,  conditional formatting and 16-column table limit (absentdata.com, 2018).   Figure 6 shows the Patient Cycle Time data visualization using Tableau software.


Figure 6. Patient Cycle Time Data Visualization Example (Tableau, 2011).

1.7 Artificial Intelligence Integration Requirement

Artificial Intelligence is a computational technique allowing machines to perform cognitive functions such as acting or reacting to input, similar to the way humans do (Patrizio, 2018).  The traditional computing applications react to data, and the reactions and responses must be hand-coded with human intervention (Patrizio, 2018).  The AI systems are continuously in a flux mode changing their behavior to accommodate any changes in the results and modifying their reactions accordingly (Patrizio, 2018). The AI techniques can include video recognition, natural language processing, speech recognition, machine learning engines, and automation (Mills, 2018)

Healthcare system can benefit from BDA integration with Artificial Intelligence (AI) (Bresnick, 2018).  Since AI can play a significant role in BDA in the healthcare system, this proposal suggests the implementation of machine learning which is part of the AI to deploy more precise and impactful interventions at the right time in the care of patients (Bresnick, 2018).  The application of AI in the proposed design requires machine learning (Patrizio, 2018).  Since the data used in the AI and machine learning is already cleaned after removing the duplicates and unnecessary data, AI can take advantages of these filtered data leading to many healthcare breakthroughs such as genomic and proteomic experiments to enable personalized medicine (Kersting & Meyer, 2018).

The healthcare industry has been utilizing AI, machine learning (ML) and data mining (DM) to extract value from BD by transforming the large medical datasets into actionable knowledge performing predictive and prescriptive analytics (Palanisamy & Thirunavukarasu, 2017).   The ML will be used to utilize the AI to develop sophisticated algorithm processing massive medical datasets including the structured, unstructured, and semi-structured data performing advanced analytics (Palanisamy & Thirunavukarasu, 2017).  Apache Mahout, which is an open source for ML, will be integrated with Hadoop to facilitate the execution of scalable machine learning algorithms, offering various techniques such as recommendation, classification, and clustering (Palanisamy & Thirunavukarasu, 2017).

1.8 Internet of Things (IoT) Integration Requirement

Internet of Things (IoT) refers to the increased connected devices with IP addresses which were not common years ago  (Anand & Clarice, 2015; Thompson, 2017).  These connected devices collect and use the IP addresses to transmit information (Thompson, 2017).    Providers in healthcare take advantages of the collected information to find new treatment methods and increase efficiency (Thompson, 2017).

The implementation of IoT will involve various technologies including frequency identification (RFID), near field communication (NFC), machine to machine (M2M), wireless sensor network (WSM), and addressing schemes (AS) (IPv6 addresses) (Anand & Clarice, 2015; Kumari, 2017).  The implementation of IoT requires machine learning and algorithm to find patterns, correlations, and anomalies that have the potential of enabling healthcare improvements (O’Brien, 2016).  Machine learning is a critical component of artificial intelligence. Thus, the success of IoT depends on AI implementation. 

1.9 Training Requirement

This design proposal requires various training to IT professionals, providers and clinician and those who will be using this healthcare ecosystem depending on their role (Alexandru et al., 2016; Archenaa & Anita, 2015). Each component of this ecosystem should have training such as training for Hadoop/MapReduce, Spark, Security, and so forth.  The training will play a significant role in the success of this design implementation to apply BD and BDA in the healthcare system in the four States of Colorado, Utah, Arizona, and New Mexico.   Patients should be considered in training for remote monitoring programs such as blood sugar monitoring, and blood pressure monitoring applications.  The senior generation might face some challenges.  However, with the technical support, this challenge can be alleviated.

2.      Data Flow Diagram

            This section discusses the data flow for the proposed design for the healthcare ecosystem for the application of BDA. 

2.1 HBase Cluster and HDFS Data Flow

HBase stores data into table schema and specify the column family (Yang, Liu, Hsu, Lu, & Chu, 2013).  The table schema must be predefined, and the column families must be specified.  New columns can be added to families as required making the schema-flexible and can adapt to changing application requirements (Yang et al., 2013).   HBase is developed in a similar way like HDFS with a NameNode and slave nodes, and MapReduce with JobTracker and TaskTracker slaves (Yang et al., 2013).  HBase will play a vital role in the cluster environment of Hadoop system.  In HBase master node called HMaster will manage the cluster, and region servers store portions of the tables and perform the work on the data. The HMaster reflects the Master Server and is responsible for monitoring all RegionServer instances in the cluster and is the interface for all metadata changes.  This Master executes on the NameNode in the distributed cluster Hadoop environment.  The HRegionServer represents the RegionServer and is responsible for serving and managing regions.  The RegionServer runs on a DataNode in the distributed cluster Hadoop environment.   The ZooKeeper will assist other machines are selected within the cluster as HMaster in case of a failure, unlike HDFS framework where NameNode has a single point of availability issue.  Thus, the data flow between the DataNodes and the NameNodes when integrating HBase on top of HDFS is shown in Figure 7.  


Figure 7.  HBase Cluster Data Flow (Yang et al., 2013).

2.2 HBase and MongoDB with Hadoop/MapReduce and HDFS Data Flow

The healthcare system integrates four significant components such as HBase, MongoDB, MapReduce, and Visualization.  HBase is used for data storage, MongoDB is used for metadata, MapReduce using Hadoop for computation, and data visualization tool.  The signal data will be stored in HBase while the metadata and other clinical data will be stored in MongoDB.  The data stored in both HBase and MongoDB will be accessible from the Hadoop/MapReduce environment for processing and the data visualization layer as well.   One master node and eight slave nodes, and several supporting servers.   The data will be imported to Hadoop and processed via MapReduce.  The result of the computational process will be viewed through a data visualization tool such as Tableau.  Figure 8 shows the data flow between these four components of the proposed healthcare ecosystem.


Figure 8.  The Proposed Data Flow Between Hadoop/MapReduce and Other Databases.

2.3 XML Design Flow Using ETL Process with MongoDB 

Healthcare records have various types of data from structured, semi-structured to unstructured (Luo et al., 2016).   Some of these healthcare records are XML-based records in the semi-structured format using tags.  XML stands for eXtensible Markup Language (Fawcett, Ayers, & Quin, 2012).  Healthcare sector can drive value from these XML documents which reflect semi-structured data (Aravind & Agrawal, 2014).  Example of this XML-based patients records shows in Figure 9.


Figure 9.  Example of the Patient’s Electronic Health Record (HL7, 2011)

XML-based records need to get ingested into Hadoop system for the analytical purpose to derive value from this semi-structured XML-based data.   However, Hadoop does not offer a standard XML “RecordReader” (Lublinsky, Smith, & Yakubovich, 2013).  XML is one of the standard file formats for MapReduce.  Various approaches can be used to process XML semi-structured data.  The process of ETL (Extract, Transform and Load) can be used to process XML data in Hadoop.  MongoDB is a NoSQL database which is required in this design proposal.  It handles XML document-oriented type. 

The ETL process in MongoDB starts with the extract and transform.  The MongoDB application provides the ability to map the XML elements within the document to the downstream data structure.  The application supports the ability to unwind simple arrays or present embedded documents using appropriate data relationships such as one-to-one (1:1), one-to-many (1: M), or many-to-many (M: M) (MongoDB, 2018).  The application infers the schema information by examining a subset of documents within target collections.  Organizations can add fields to the discovered data model that may not have been present within the subset of documents used for schema inference.  The application infers information about the existing indexes for collections to be queried.  It prompts or warns of queries that do not contain any indexes fields.  The application can return a subset of fields from documents using query projections.  For queries against MongoDB Replica Sets, the application supports the ability to specify custom MongoDB Read Preferences for individual query operations.  The application then infers information about sharded cluster deployment and note the shard key fields for each sharded collection.  For queries against MongoDB Sharded Clusters, the application warns against queries that do not use proper query isolation.  Broadcast queries in a sharded cluster can have a negative impact on database performance (MongoDB, 2018). 

The load process in MongoDB is performed after the extract and transform process.  The application supports the ability to write data to any MongoDB deployment whether a single node, replica set or sharded cluster.  For writes to a MongoDB Sharded Cluster, the application informs or display an error message to the user if XML documents do not contain a shard key.  A custom WriteConcern can be used for any write operations to a running MongoDB deployment.  For the bulk loading operations, writing documents in batches using the insert() method can be used using the MongoDB 2.6 version or above, which supports the bulk update database command. For the bulk loading into a MongoDB sharded deployment, the bulk insert into a sharded collection is supported, including the pre-splitting of the collections’ shard key and inserting via multiple mongos processes.   Figure 10 shows this ETL process for XML-based patients records using MongoDB.


Figure 10.  The Proposed XML ETL Process in MongoDB.

2.4 Real-Time Streaming Spark Data Flow

Real-Time streaming can be implemented using any real-time streaming program such as Spark, Kafka, or Storm.  This healthcare design proposal will integrate Spark open-source program for the real-time streaming data such as sensing data, from various sources such as intensive care units, remote monitoring programs, biomedical signals. The data from various sources will be flow into Spark for analytics and then imported to the data storage systems.  Figure 11 illustrates the data flow for real-time streaming analytics.

Figure 11.  The Proposed Spark Data Flow.

3.      Communication Workflow

The communication flow involves the stakeholders involves in the healthcare system. These stakeholders include providers, insurer, pharmaceutical, and IT professionals and practitioners.  The communication flow is centered with the patient-centric healthcare system using the cloud computing technology for the four States of Colorado, Utah, Arizona, and New Mexico.  These stakeholders are from these states.  The patient-centric healthcare system is the central point for communication.  The patients communicate with the central system using the web-based platform, and clinical forums as needed.  The providers communicate with the patient-centric healthcare system using resource usages, patient feedback, and hospital visits, and services details.  The insurers communicate with the central system using claims database, and census and societal data. The pharmaceutical vendors will communicate with the central system using prescription and drug reports which can be retrieved by the providers from anywhere in these four states. The IT professionals and practitioners will communicate with the central system for data streaming, medical records, genomics, and all omics data analysis and reporting.  Figure 12 shows the communication flow between these stakeholders and the central system in the cloud that can be accessed from any of these identified four States.

Figure 12.  The Proposed Patient-Centric Healthcare System Communication Flow.

4.      Overall System Diagram

The overall system represents the state-of-the-art healthcare ecosystem system that utilizes the latest technology for healthcare Big Data Analytics. The system is bounded by the regulations and policy such as HIPAA to ensure the protection of the patients’ privacy across the various layers of the overall system.  The system integrated components include the Hadoop latest technology with MapReduce and HDFS.  The data government layer is the bottom layer which contains three major building blocks:  master data management (MDM), data life-cycle management (DLM) components, and data security and privacy management.  The MDM component is responsible for data completeness, accuracy, and availability, while the DLM is responsible for archiving the data, maintaining the data warehousing, data deletion, and disposal.   The data security and privacy management building block is responsible for sensitive data discovery, vulnerability and configuration assessment, security policies application, auditing and compliance reporting, activity monitoring, identify and access management, and protecting data.  The top layers include data layer, data aggregation layer, data analytics layer, and information exploration layer.  The data layer is responsible for data sources and content format, while the data aggregation layer involves various components from data acquisition process, transformation engines, and data storage area using Hadoop, HDFS, NoSQL databases such as MongoDB and HBase.  The data analytics layer involves the Hadoop/MapReduce mapping process, stream computing, real-time streaming, and database analytics.  AI and IoT are part of the data analytics layer.  The information exploration layer involves the data visualization layer, visualization reporting, real-time monitoring using healthcare dashboard, and clinical decision support. Figure 13 illustrates the overall system diagram with these layers.


Figure 13.  The Proposed Healthcare Overall System Diagram.

5.      Regulations, Policies, and Governance for the Medical Industry

Healthcare data must be stored in a secure storage area to protect the information and the privacy of patients (Liveri, Sarri, & Skouloudi, 2015).  When the healthcare industry fails to comply with the regulation and policies, the fines and the cost can cause financial stress on the industry (Thompson, 2017).  Records showed that the healthcare industry paid millions of dollars in fines.  The Advocate Health Care in suburban Chicago agreed to the most significant figure as of August 2016 with a total amount of $5.55 million (Thompson, 2017).  Memorial Health System in southern Florida became the second entity to top of paying $5 million (Thompson, 2017). Table 2 shows the five most substantial fines posted to the Office of Civil Rights (OCR) site. 

Table 2.  Five Largest Fines Posted to OCR Web Site (Thompson, 2017)

The hospitals must adhere to the data privacy regulations and legislative rules carefully to protect the patients’ medical records from data breaches (HIPAA).  The proper security policy and risk management must be implemented to ensure the protection of private information as well to minimize the impact of confidential data in case of loss or theft (HIPAA, 2018a, 2018c; Salido, 2010).  The healthcare system design proposal requires the implementation of a system for those hospitals or providers who are not compliant with the regulation and policies and the escalation path (Salido, 2010).  This design proposal implements four major principles as the best practice to comply with required policies and regulation and protect the confidential data assets of the patients and users (Salido, 2010).  The first principle is to honor policies throughout private data life (Salido, 2010).  The second principle for best practice in healthcare design system is to minimize the risk of unauthorized access or misuse of confidential data (Salido, 2010).  The third principle is to minimize the impact of confidential data loss, while the fourth principle is to document appropriate controls and demonstrate their effectiveness (Salido, 2010).  Figure 14 shows these four principles which this healthcare design proposal adheres to ensure protection healthcare data from unauthorized users and comply with the required regulation and policies. 


Figure 14.  Healthcare Design Proposal Four Principles.

6.      Assumptions and Limitations

This design proposal assumes that the healthcare sector in the four States will support the application of BD and BDA across these fours States.  The support includes investment in the proper technology, proper tools and proper training based on the requirements of this design proposal.  The proposal also assumes that the stakeholders including the providers, patients, insurer, pharmaceutical vendors, and practitioners will welcome the application of BDA to take advantages of it to provide efficient healthcare services, increase productivity, decrease costs for healthcare sector as well as for patients, and provide better care to patients.

            The limitation of this proposal is the timeframe that is required to implement it.  With the support of the healthcare sector from these four States, the implementation can be expedited.  However, the silo and the rigid culture of the healthcare may interfere with the implementation which can take longer than expected.   The initial implementation might face unexpected challenges. However, these unexpected challenges will come from the lack of experienced IT professionals and managers in the field of BD and BDA domain.  This design proposal will be enhanced based on the observations from the first few months of the implementation. 

7.      The justification for Overall Design

            The traditional database and analytical systems are found inadequate when dealing with healthcare data in the age of BDA.  The characteristics of the healthcare datasets including the large volume medical records, the variety of the dataset from structured, to semi-structured, to the unstructured dataset, and the velocity of the dataset generation and the data processing requires technology such as cloud computing (Fernández et al., 2014). Cloud computing is found the best solution when dealing with BD and BDA to address the challenges of BD storage, and the intensive-computing processing demands (Alexandru et al., 2016; Hashem et al., 2015).  The healthcare system in the four States will shift the communication technology and services for applications across the hospitals and providers (Hashem et al., 2015).  Some of the advantages of cloud computing adoption include virtualized resources, parallel processing, security and data service integration with scalable data storage (Hashem et al., 2015).  With the cloud computing technology, the healthcare sector in the four States will reduce the cost, and increase the efficiency (Hashem et al., 2015).  When quick access to critical data for patients care is required quickly, the mobility of accessing the data from anywhere is one of the most significant advantages of the cloud computing adoption as recommended by this proposed design  (Carutasu, Botezatu, Botezatu, & Pirnau, 2016). The benefits of cloud computing include technological benefits such as visualization, multi-tenancy, data and storage, security and privacy compliance (Chang, 2015).  The cloud computing also offers economic benefits such as pay per use, cost reduction, return on investment (Chang, 2015).  The non-functional benefits of the cloud computing cover the elasticity, quality of service, reliability, and availability (Chang, 2015).  Thus, the proposed design justifies the use of cloud computing for several benefits as cloud computing is proven the best technology for BDA especially for healthcare data analytics.

            Although cloud computing offers several benefits to the proposed healthcare system, cloud computing has been suffering from security and privacy concerns (Balasubramanian & Mala, 2015; Kazim & Zhu, 2015).  The security concerns involve risk areas such as external data storage, dependency on the public internet, lack of control, multi-tenancy and integration with internal security (Hashizume, Rosado, Fernández-medina, & Fernandez, 2013). The traditional security techniques such as identity, authentication, and authorization are not sufficient for cloud computing environments in their current forms using the standard deployment models of the public cloud, and private cloud  (Hashizume et al., 2013).  The increasing trend in the security threats data breaches, and the current deployment models of private and public clouds, which are not meeting the security challenges, have triggered the need for another deployment to ensure security and privacy protection.  Thus, the VPC deployment model which is a new deployment model of cloud computing technology (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q., Cheng, & Boutaba, 2010).  The VPC is taking advantages of technologies such as a virtual private network (VPN) which will allow hospitals and providers to set up their required network settings such as security (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010).  The VPC deployment model will have dedicated resources with the VPN to provide the required isolation for security to protect the patients’ information (Botta et al., 2016; Sultan, 2010; Venkatesan, 2012; Zhang, Q. et al., 2010). Thus, this proposed design will be using VPC cloud computing deployment mode to store and use healthcare data in a secure and isolated environment to protect the patients’ medical records (Regola & Chawla, 2013).

Hadoop ecosystem is a required component in this proposed design for several reasons.  Hadoop technology is a commonly used computing paradigm for massive volume data processing in the cloud computing (Bansal et al., 2014; Chrimes et al., 2018; Dhotre et al., 2015).  Hadoop is the only technology that enables large healthcare volumes of data to be stored in its native forms (Dezyre, 2016).  Hadoop is proven to develop better treatments for diseases such as cancer by accelerating the design and testing of effective treatments tailored to patients, expanding genetically based clinical cancer trials, and establishing a national cancer knowledge network to guide treatment decision (Dezyre, 2016).  With Hadoop system, hospitals in the four States will be able to monitor the patient vitals (Dezyre, 2016).  The Children’s Healthcare of Atlanta is an example of using the Hadoop ecosystem to treat over six thousand children in their ICU units (Dezyre, 2016).

The proposed design requires the integration of NoSQL database because it offers benefits such as mass storage support, reading and writing operations which are fast, and the expansion is easy with a low cost (Sahafizadeh & Nematbakhsh, 2015). HBase is proposed as a required NoSQL database as it is faster when reading more than six million variants which are required when analyzing large healthcare datasets (Luo et al., 2016).  Besides, query engine such as SeqWare can be integrated with HBase as needed to help bioinformatics researchers access large-scale whole-genome datasets (Luo et al., 2016).  HBase can store clinical sensors where the row key serves as the time stamp of a single value, and the column stores patients’ physiological values that correspond with the row key time stamp (Luo et al., 2016). HBase is scalable, high-performance and low-cost NoSQL data store that can be integrated with Hadoop sitting on top of HDFS (Yang et al., 2013). As a column-oriented NoSQL data store that runs on top of HDFS of Hadoop ecosystem, HBase is well suited to parse the healthcare large data sets (Yang et al., 2013). HBase supports applications written in Avro, REST and Thrift (Yang et al., 2013).  MongoDB is another NoSQL data store, which will be used to store metadata to improve the accessibility and readability of the HBase data schema (Luo et al., 2016).

The integration of Spark is required in order to overcome the Hadoop limitation of real-time data processing because Hadoop is not optimal for real-time data processing (Guo, 2013).  Thus, Apache Spark is a required component to implement this proposal so that the healthcare BDA system can take advantages of data processing at rest using the batching technique as well as a motion using the real-time processing technique (Liang & Kelemen, 2016).  Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Liang & Kelemen, 2016).   Spark is a high integration to the recent Hadoop cluster deployment (Scott, 2015).  While Spark is a powerful tool on its own for processing a large volume of medical and healthcare datasets, Spark is not well-suited for production workload.  Thus, the integration of Spark with Hadoop ecosystem provides many capabilities which Spark cannot offer on its own, and Hadoop cannot offer on its own.

The integration of AI as part of this proposal is justified by the examination of Harvard Business Review (HBR) that shows ten promising AI application in healthcare (Kalis, Collier, & Fu, 2018). The findings of HBR’s examination showed that the application of AI could create up to $150 billion in annual savings for U.S. healthcare by 2026 (Kalis et al., 2018).  The result also showed that AI currently creates the most value in assisting the frontline clinicians to be more productive and in making back-end processes more efficient (Kalis et al., 2018).   Furthermore, IBM invested $1 billion in AI through the IBM Watson Group, and healthcare industry is the most significant application of Watson (Power, 2015).

Conclusion

Big Data and Big Data Analytics have played significant roles in various industries including the healthcare industry.  The value that is driven by BDA can save lives and minimize costs for patients.  This project proposes a design to apply BDA in the healthcare system across four States of Colorado, Utah, Arizona, and New Mexico.  Cloud computing is the most appropriate technology to deal with the large volume of healthcare data.  Due to the security issue of the cloud computing, the Virtual Private Cloud (VPC) will be used.  VPC provides a secure cloud environment using network traffic security setup using security groups and network access control lists. 

The project requires other components to be fully implemented using the latest technology such as Hadoop and MapReduce for data streaming processing, machine learning for artificial intelligence, which will be used for Internet of Things (IoT).  The NoSQL database HBase and MongoDB will be used to handle the semi-structured data such as XML and unstructured data such as logs and images.  Spark will be used for real-time data processing which can be vital for urgent care and emergency services.  This project addressed the assumptions and limitations plus the justification for selecting these specific components. 

In summary, all stakeholders in the healthcare sector including providers, insurers, pharmaceuticals, practitioners should cooperate and coordinate to facilitate the implementation process.  All stakeholders are responsible to facilitate the integration of BD and BDA into the healthcare system.  The rigid culture and silo pattern need to change for better healthcare system which can save millions of dollars to the healthcare industry and provide excellent care to the patients at the same time.

References

Abdul, A. M., Jena, S., Prasad, S. D., & Balraju, M. (2014). Trusted Environment In Virtual Cloud. International Journal of Advanced Research in Computer Science, 5(4).

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

absentdata.com. (2018). Tableau Advantages and Disadvantages. Retrieved from https://www.absentdata.com/advantages-and-disadvantages-of-tableau/.

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data and Cloud Computing. management, 1, 2.

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Anand, M., & Clarice, S. (2015). Artificial Intelligence Meets Internet of Things. Retrieved from http://www.ijcset.net/docs/Volumes/volume5issue6/ijcset2015050604.pdf.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Aravind, P. S., & Agrawal, V. (2014). Processing XML data in BigInsights 3.0. Retrieved from https://developer.ibm.com/hadoop/2014/10/31/processing-xml-data-biginsights-3-0/.

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003

Balasubramanian, V., & Mala, T. (2015). A Review On Various Data Security Issues In Cloud Computing Environment And Its Solutions. Journal of Engineering and Applied Sciences, 10(2).

Bansal, A., Deshpande, A., Ghare, P., Dhikale, S., & Bodkhe, B. (2014). Healthcare data analysis using dynamic slot allocation in Hadoop. International Journal of Recent Technology and Engineering, 3(5), 15-18.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.

Bresnick, J. (2018). Top 12 Ways Artificial Intelligence Will Impact Healthcare. Retrieved from https://healthitanalytics.com/news/top-12-ways-artificial-intelligence-will-impact-healthcare.

Carutasu, G., Botezatu, M., Botezatu, C., & Pirnau, M. (2016). Cloud Computing and Windows Azure. Electronics, Computers and Artificial Intelligence.

Chang, V. (2015). A Proposed Framework for Cloud Computing Adoption. International Journal of Organizational and Collective Intelligence, 6(3).

Chrimes, D., Zamani, H., Moa, B., & Kuo, A. (2018). Simulations of Hadoop/MapReduce-Based Platform to Support its Usability of Big Data Analytics in Healthcare.

Cloud Security Alliance. (2013). The Notorious Nine: Cloud Computing Top Threats in 2013. Cloud Security Alliance: Top Threats Working Group. 

Cloud Security Alliance. (2016). The Treacherous 12: Cloud Computing Top Threats in 2016. Cloud Security Alliance: Top Threats Working Group. 

Cloud Security Alliance. (2017). The Treacherous 12 Top Threats to Cloud Computing. Cloud Security Alliance: Top Threats Working Group. 

Dezyre. (2016). 5 Healthcare Applications of Hadoop and Big Data Retrieved from https://www.dezyre.com/article/5-healthcare-applications-of-hadoop-and-big-data/85.

Dhotre, P., Shimpi, S., Suryawanshi, P., & Sanghati, M. (2015). Health Care Analysis Using Hadoop. Internationaljournalofscientific&tech nologyresearch, 4(12), 279r281.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Fawcett, J., Ayers, D., & Quin, L. R. (2012). Beginning XML: John Wiley & Sons.

Fernández, A., del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409. doi:10.1002/widm.1134

Fox, M., & Vaidyanathan, G. (2016). Impacts of Healthcare Big Data:  A Framwork With Legal and Ethical Insights. Issues in Information Systems, 17(3).

Ghani, K. R., Zheng, K., Wei, J. T., & Friedman, C. P. (2014). Harnessing big data for health care and research: are urologists ready? European urology, 66(6), 975-977.

Grover, M., Malaska, T., Seidman, J., & Shapira, G. (2015). Hadoop Application Architectures: Designing Real-World Big Data Applications: ” O’Reilly Media, Inc.”.

Groves, P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation.

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud Computing and Big Data Analytics: What is New From Databases Perspective? Paper presented at the International Conference on Big Data Analytics, Springer-Verlag Berlin Heidelberg.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The Rise of “Big Data” on Cloud Computing: Review and Open Research Issues. Information Systems, 47, 98-115. doi:10.1016/j.is.2014.07.006

Hashizume, K., Rosado, D. G., Fernández-medina, E., & Fernandez, E. B. (2013). An analysis of security issues for cloud computing. Journal of internet services and applications, 4(1), 1-13. doi:10.1186/1869-0238-4-5

HIMSS. (2018). 2017 Security Metrics:  Guide to HIPAA Compliance: What Healthcare Entities and Business Associates Need to Know. . Retrieved on 12/1/2018 from  http://www.himss.org/file/1318331/download?token=h9cBvnl2. 

HIPAA. (2018a). At Least 3.14 Million Healthcare Records Were Exposed in Q2, 2018. Retrieved 11/22/2018 from https://www.hipaajournal.com/q2-2018-healthcare-data-breach-report/. 

HIPAA. (2018b). How to Defend Against Insider Threats in Healthcare. Retrieved 8/22/2018 from https://www.hipaajournal.com/category/healthcare-cybersecurity/. 

HIPAA. (2018c). Q3 Healthcare Data Breach Report: 4.39 Million Records Exposed in 117 Breaches. Retrieved 11/22/2018 from https://www.hipaajournal.com/q3-healthcare-data-breach-report-4-39-million-records-exposed-in-117-breaches/. 

HIPAA. (2018d). Report: Healthcare Data Breaches in Q1, 2018. Retrieved 5/15/2018 from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/. 

HL7. (2011). Patient Example Instance in XML.  

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

InformationBuilders. (2018). Data In Motion – Big Data Analytics in Healthcare. Retrieved from http://docs.media.bitpipe.com/io_10x/io_109369/item_674791/datainmotionbigdataanalytics.pdf, White Paper.

Jayasingh, B. B., Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data analytics and visualization. Paper presented at the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I).

Ji, Z., Ganchev, I., O’Droma, M., Zhang, X., & Zhang, X. (2014). A cloud-based X73 ubiquitous mobile healthcare system: design and implementation. The Scientific World Journal, 2014.

Kalis, B., Collier, M., & Fu, R. (2018). 10 Promising AI Applications in Health Care. Retrieved from https://hbr.org/2018/05/10-promising-ai-applications-in-health-care, Harvard Business Review.

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Kazim, M., & Zhu, S. Y. (2015). A Survey on Top Security Threats in Cloud Computing. International Journal Advanced Computer Science and Application, 6(3), 109-113.

Kersting, K., & Meyer, U. (2018). From Big Data to Big Artificial Intelligence? : Springer.

Klein, J., Gorton, I., Ernst, N., Donohoe, P., Pham, K., & Matser, C. (2015, June 27 2015-July 2 2015). Application-Specific Evaluation of No SQL Databases. Paper presented at the 2015 IEEE International Congress on Big Data.

Kritikos, K., Kirkham, T., Kryza, B., & Massonet, P. (2017). Towards a Security-Enhanced PaaS Platform for Multi-Cloud Applications. Future Generation computer systems, 67, 206-226. doi:10.1016/j.future.2016.10.008

Kumari, W. M. P. (2017). Artificial INtelligence Meets Internet of Things.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Liveri, D., Sarri, A., & Skouloudi, C. (2015). Security and Resilience in eHealth: Security Challenges and Risks. European Union Agency For Network And Information Security.

Lublinsky, B., Smith, K. T., & Yakubovich, A. (2013). Professional hadoop solutions: John Wiley & Sons.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Malik, L., & Sangwan, S. (2015). MapReduce Framework Implementation on the Prescriptive Analytics of Health Industry. International Journal of Computer Science and Mobile Computing, ISSN, 675-688.

Maltby, D. (2011). Big Data Analytics. Paper presented at the Annual Meeting of the Association for Information Science and Technology.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute.

McKelvey, N., Curran, K., Gordon, B., Devlin, E., & Johnston, K. (2015). Cloud Computing and Security in the Future Guide to Security Assurance for Cloud Computing (pp. 95-108): Springer.

Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., & Guo, S. (2016). Protection of Big Data Privacy. Institute of Electrical and Electronic Engineers, 4, 1821-1834. doi:10.1109/ACCESS.2016.2558446

Meyer, M. (2018). The Rise of Healthcare Data Visualization.

Mills, T. (2018). Eight Ways Big Data And AI Are Changing The Business World.

MongoDB. (2018). ETL Best Practice.  

O’Brien, B. (2016). Why The IoT Needs ARtificial Intelligence to Succeed.

Palanisamy, V., & Thirunavukarasu, R. (2017). Implications of Big Data Analytics in developing Healthcare Frameworks–A review. Journal of King Saud University-Computer and Information Sciences.

Patrizio, A. (2018). Big Data vs. Artificial Intelligence.

Power, B. (2015). Artificial Intelligence Is Almost Ready for Business.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.

Regola, N., & Chawla, N. (2013). Storing and Using Health Data in a Virtual Private Cloud. Journal of medical Internet research, 15(3), 1-12. doi:10.2196/jmir.2076

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Salido, J. (2010). Data Governance for Privacy, Confidentiality and Compliance: A Holistic Approach. ISACA Journal, 6, 17.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Sultan, N. (2010). Cloud Computing for Education: A New Dawn? International Journal of Information Management, 30(2), 109-116. doi:10.1016/j.ijinfomgt.2009.09.004

Sun, J., & Reddy, C. (2013). Big Data Analytics for Healthcare. Retrieved from https://www.siam.org/meetings/sdm13/sun.pdf.

Tableau. (2011). Three Ways Healthcare Probiders are transforming data from information to insight. White Paper.

Thompson, E. C. (2017). Building a HIPAA-Compliant Cybersecurity Program, Using NIST 800-30 and CSF to Secure Protected Health Information.

Van-Dai, T., Chuan-Ming, L., & Nkabinde, G. W. (2016, 5-7 July 2016). Big data stream computing in healthcare real-time analytics. Paper presented at the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

Venkatesan, T. (2012). A Literature Survey on Cloud Computing. i-Manager’s Journal on Information Technology, 1(1), 44-49.

Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big Data Analytics: Understanding its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. doi:10.1016/j.techfore.2015.12.019

Wicklund, E. (2014). ‘Silo’ one of healthcare’s biggest flaws. Retrieved from http://www.healthcareitnews.com/news/silo-one-healthcares-biggest-flaws.

Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., & Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation of Data Transform Method into NoSQL Database for Healthcare Data. Paper presented at the 2013 International Conference on Parallel and Distributed Computing, Applications and Technologies.

Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud Computing: State-of-the-Art and Research Challenges. Journal of internet services and applications, 1(1), 7-18. doi:10.1007/s13174-010-0007-6

Zhang, R., & Liu, L. (2010). Security models and requirements for healthcare application clouds. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.

Zia, U. A., & Khan, N. (2017). An Analysis of Big Data Approaches in Healthcare Sector. International Journal of Technical Research & Science, 2(4), 254-264.

 

Case Study: Big Data Analytics in Healthcare Using Outlier Detection.

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss and examine Big Data Analytics (BDA) technique and a case study.  The discussion begins with an overview of BDA application in various sectors, followed by the implementation of BDA in the healthcare industry.  The records show the healthcare industry suffers from fraud, waste, and abuse (FWA).  The emphasis of this discussion is on FWA in the healthcare industry.  The project provides a case study of BDA in healthcare using outlier detection data mining tool.  The data mining phases of the use case are discussed and analyzed.  An improvement for the selected BDA technique of the outlier detection is proposed in this project.  The analysis shows that the outlier detection data mining technique for fraud detection is under experimentation and is not proven reliable yet. The recommendation is to use the clustering data mining technique as a more heuristic technique for fraud detection. Organizations should evaluate the BDA tools and select the most appropriate and fit tool to meet the requirements of the business model successfully.

Keywords: Big Data Analytics; Healthcare; Outlier Detection; Fraud Detection.

Introduction

Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions.  The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market.  A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015).  The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.  This project discusses and analyzes the application of Big Data Analytics. It begins with an overview of such broad applications, with more emphasis on a single application for further investigation.  Healthcare sector is selected for further discussion and with a closer lens to investigate the implementation of BDA, and methods to improve such implementation.

Overview of Big Data Analytics Applications

            Numerous research studies have discussed and analyzed the application of Big Data in different domains. (Chen & Zhang, 2014) have discussed BDA in the scientific research domains such as astronomy, meteorology, social computing, bioinformatics, and computational biology, which are based on data-intensive scientific discovery.  Other studies such as (Rabl et al., 2012) have investigated the performance of six modern open-source data stores in the context of the monitor of application performance as part of the initiative of (CA-Technologies, 2018). (Bi & Cochran, 2014) have discussed BDA in cloud manufacturing, indicating that the success of a manufacturing enterprise depends on the advancement of IT to support and enhance the value stream.  The manufacturing technologies have evolved throughout the years.  The measures of such advancement of a manufacturing system can be implemented by scale, complexity and automation responsiveness (Bi & Cochran, 2014).  Figure 1 illustrates such evolution of the manufacturing technologies before the 1950s until the Big Data age. 


Figure 1.  Manufacturing Technologies, Information System, ITs, and Their Evolutions

McKinsey Institute has first reported four essential sectors that can benefit from BDA: healthcare industry, government services, retailing, and manufacturing (Brown, Chui, & Manyika, 2011).  The report has also reported a prediction for BDA implementation to improve the productivity by .5 to 1 percent annually and produce hundreds of billions of dollars in new value (Brown et al., 2011).  McKinsey Institute has indicated that not all industries are created equal in the context of parsing the benefits from BDA (Brown et al., 2011).   

Another report by McKinsey Institute have reported the transformative potential of BD in  five domains:  health care (U.S.), public sector administration (European Union), Retail (U.S.) Manufacturing (global), and Personal Location Data (global) (Manyika et al., 2011).  The same report has predicted $300 billion as a potential annual value to US healthcare, and 60% potential increase in retailers’ operating margins possible with BDA (Manyika et al., 2011). Some sectors are poised for more significant gains and benefits from BD than others, although the implementation of BD will matter across all sectors (Manyika et al., 2011).  It is divided by cluster A, B, C, D and E.  The cluster A reflects information and computer and electronic products, while finance & insurance and government are categorized as class B.  Cluster C include several sectors such as construction, educational services, and arts and entertainments.  Cluster D has manufacturing, wholesale trade, while cluster E covers retail, healthcare providers, accommodation and food. Figure 2 shows some sectors are positioned for more significant gains from the use of BD. 


Figure 2.  Capturing Value from Big Data by Sector (Manyika et al., 2011).

The application of BDA in specific sectors have been discussed in various research studies, such as health and medical research (Liang & Kelemen, 2016), biomedical research (Luo, Wu, Gopukumar, & Zhao, 2016), machine learning techniques in healthcare sectors (MCA, 2017).  The next section discusses the implementation of BDA in the healthcare sector.

Big Data Analytics Implementation in Healthcare

            Numerous research studies have discussed Big Data Analytics (BDA) in healthcare industries from a different perspective.  Healthcare industries have taken advantages of BDA in fraud and abuse prevention, detection and reporting (cms.gov, 2017).  The fraud and abuse of Medicare are regarded to be a severe problem which needs attention (cms.gov, 2017).  Various examples of Medicare fraud scenarios are reported (cms.gov, 2017).  Submitting, or causing to be submitted, false claims or making misrepresentations of fact to obtain a federal healthcare payment is the first Medicare fraud case.  Soliciting, receiving, offering and paying remuneration to induce or reward referrals for items or services reimbursed by federal health care programs is another Medicare fraud scenario.  The last fraud case in Medicare is making prohibited referrals for certain designated health services (cms.gov, 2017). The abuse of Medicare includes billing for unnecessary medical services, charging excessively for services or supplies, and misusing codes on a claim such as upcoding or unbundling codes (cms.gov, 2017; J. Liu et al., 2016).  In 2012, the payments of $120 billion were improperly for healthcare (J. Liu et al., 2016).  Medicare and Medicaid contributed to more than half of this improper payment total (J. Liu et al., 2016).  The annual loss to fraud, waste, and abuse in healthcare domain is estimated to be $750 billion (J. Liu et al., 2016).  In 2013, over 60% of the improper payments were for healthcare related. Figure 3 illustrates the improper payments in government expenditure.


Figure 3. Improper Payments Resulted from Fraud and Abuse (J. Liu et al., 2016).

Medicare fraud and abuse are governed by federal laws (cms.gov, 2017).  These federal laws include False Claim Act (FCA), Anti-Kickback Statute (AKS), Physician Self-Referral Law (Stark Law), Criminal Health Care Fraud Statute, Social Security Act, and the United States Criminal Code.  Medicare anti-fraud and abuse partnerships of various government agencies such as Health Care Fraud Prevention Partnership (HFPP) and Centers for Medicare and Medicaid Services (CMS) have been established to combat fraud and abuse. The main aim of this partnership is to uphold the integrity of the Medicare program, save and recoup taxpayer funds, reduce the costs of health care to patients, and improve the quality of healthcare (cms.gov, 2017).  

In 2010, Health and Human Services (HHS) and CMS initiated a national effort known as Fraud Prevention System (FPS), a predictive analytics technology which runs predictive algorithms and other analytics nationwide on all Medicare FFS claims prior to any payment in an effort to detect any potential suspicious claims and patterns that may constitute fraud and abuse (cms.gov, 2017).  In 2012, CMS developed the Program Integrity Command Center to combine Medicare and Medicaid experts such as clinicians, policy experts, officials, fraud investigators, and law enforcement community including FBI to develop and improve predictive analytics that identifies fraud and mobilize a rapid response (cms.gov, 2017). Such effort aims to connect with the field offices to examine the fraud allegations within few hours through a real-time investigation.  Before the application of BDA, the process to find substantiating evidence of a fraud allegation took days or weeks.

Research communities and data analytics industry have exerted various efforts to develop fraud-detection systems (J. Liu et al., 2016).  Various research studies have used different data mining for healthcare fraud and abuse detection.  (J. Liu et al., 2016) have used unsupervised data mining approach and applied the clustering data mining technique for healthcare fraud detection.  (Ekina, Leva, Ruggeri, & Soyer, 2013) have used the unsupervised data mining approach and applied the Bayesian co-clustering data mining technique for healthcare fraud detection.  (Ngufor & Wojtusiak, 2013) have used the hybrid supervised and unsupervised data mining approach, and applied the unsupervised data labeling and outlier detection, classification and regression data mining technique for medical claims prediction.  (Capelleveen, 2013; van Capelleveen, Poel, Mueller, Thornton, & van Hillegersberg, 2016) have used unsupervised data mining approach, and applied outlier detection data mining technique for health insurance fraud detection with the Medicaid domain. 

Case Study of BDA in Healthcare

The case study presented by (Capelleveen, 2013; van Capelleveen et al., 2016) has been selected for further investigation on the application of BDA in healthcare.  The outlier detection, which is one of the unsupervised data mining techniques, is regarded as an effective predictor for fraud detection and is recommended for use to support the audits initiations (Capelleveen, 2013; van Capelleveen et al., 2016).  The outlier detection is the primary analytic tool which was used in this case study.   The outlier detection tool can be based on linear model analysis, multivariate clustering analysis, peak analysis, and boxplot analysis (Capelleveen, 2013; van Capelleveen et al., 2016).  The algorithm of data mining outlier detection approach of this case study has been used on Medicaid dataset of 650,000 healthcare claims and 369 dentists of one state. RapidMiner can be used for outlier detection data mining techniques.  The study of (Capelleveen, 2013; van Capelleveen et al., 2016) did not specify the name of the tool which was used in the outlier detection of the fraud and abuse in Medicare with emphasis on dental practice.

The process for such outlier detection unsupervised data mining technique involves seven iterative phases.  The first step involves the composition of metrics composition for domains. These metrics are derived or calculated data such as feature, attribute or measurement which characterizes the behavior of an entity for a certain period.  The purpose of this metrics is to develop a comparative behavioral analysis using data mining algorithms.  These metrics are expected during the first iteration to be inferred from provider behavior supported by fraud causes and developed in cooperation with fraud experts.  In the subsequent iterations, the metrics composition consists of the latest metrics which updates the existing metrics that modify the configuration and make adjustments on the confidence level to optimize the hit rates.  The composition of metrics phase is followed by the cleaning and filtering the data.  The selection of provider groups, and computing the metrics is the third phase in this outlier detection process.  The fourth phase involves the comparison of providers by metric and flagging outliers.  The predictors form suspicion for provider fraud detection is the fifth phase, followed by the report and presentation to fraud investigators phase.  The last phase of the use of the outlier protection analytic tool involves the metric evaluation.  The result of the outlier detection analysis has shown that 12 of the top 17 providers (71%) submitted suspicious claim patterns and should be referred to officials for further investigation.  The study concluded that the outlier detection tool could be used to provide new patterns of potential fraud that can be identified and possibly used for future automated detection technique.

Proposed Improvements for Outlier Detection Tool Use Case

            (Lazarevic & Kumar, 2005) have indicated that most of the outlier detection techniques are categorized into four categories.  The statistical approach, the distance-based approach, the profiling method, and the model-based approach.  The data points are modeled in the statistical approach using a stochastic distribution and are determined to be outliers based on their relationship with the model.  Most statistical approaches have the limitation with higher dimensionality distribution of the data points due to the complexity of such a distribution which results in inaccurate estimations.  The distance-based approach can detect the outliers using the computation of the distances among points to overcome the limitation of the statistical approach.  Various distance-based outlier detection algorithms have been proposed, and they are based on different approaches.  The first approach is based on computing the full dimensional distances of points from one another using all the available features.  The second approach is based on computing the densities of local neighborhoods.   The profiling method develops profiles of normal behavior using different data mining techniques or heuristic-based approaches, and deviations from them are considered as intrusions.  The model-based approach begins with the categorization of normal behavior using some predictive models. Such as neural replicator networks or unsupervised support vector machines, and detect outliers as the deviations from the learned model (Lazarevic & Kumar, 2005).      (Capelleveen, 2013; van Capelleveen et al., 2016) have indicated that the outlier detection tool as a data mining technique has not proven itself in the long run and is still under experimentation.  It is also considered a sophisticated data mining technique (Capelleveen, 2013; van Capelleveen et al., 2016). The validation of effectiveness remains difficult (Capelleveen, 2013; van Capelleveen et al., 2016). 

Based on this analysis of the outlier detection tool, more heuristic and novel approach should be used.  (Viattchenin, 2016) have proposed a novel technique for outlier detection.  The proposed technique for outlier detection is based on a heuristic algorithm of clustering, which is a function-based method.  (Q. Liu & Vasarhelyi, 2013) have proposed a healthcare fraud detection using a clustering model incorporating geolocation information.  The results of the clustering model using have detected claims with the extreme payment amount and identified some suspicious claims.  In summary, integrating the clustering technique can play a role in enhancing the reliability and validity of the outlier detection data mining technique.

Conclusion

This project has discussed and examined Big Dat Analytics (BDA) methods. An overview of BDA application in various sectors is discussed, followed by the implementation of BDA in the healthcare industry.  The records showed that the healthcare industry is suffering from fraud, waste, and abuse.  The discussion has provided a case study of BDA in healthcare using outlier detection tool.  The data mining phases have been discussed and analyzed.  A proposed improvement for the selected BDA technique of outlier detection has also been addressed.  The analysis has indicated that the outlier detection technique is under experimentation, and more heuristic data mining fraud detection technique should be used such as the clustering data mining technique.  In summary, various BDA techniques are available for different industries.  Organizations must select the appropriate BDA tool to meet the requirements of the business model. 

References

Bi, Z., & Cochran, D. (2014). Big data analytics with applications. Journal of Management Analytics, 1(4), 249-265.

Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. McKinsey Quarterly, 4(1), 24-35.

CA-Technologies. (2018). CA Technoligies. Retrieved from https://www.ca.com/us/company/about-us.html.

Capelleveen, G. C. (2013). Outlier based predictors for health insurance fraud detection within US Medicaid. University of Twente.  

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

cms.gov. (2017). Medicare Fraud & Abuse: Prevention, Detection, and Reporting. Retrieved from https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/downloads/fraud_and_abuse.pdf.

Ekina, T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of bayesian methods in detection of healthcare fraud.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. Paper presented at the Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Liu, J., Bier, E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sricharan, K., . . . Davies, D. (2016). Graph analysis for detecting fraud, waste, and abuse in healthcare data. AI Magazine, 37(2), 33-46.

Liu, Q., & Vasarhelyi, M. (2013). Healthcare fraud detection: A survey and a clustering model incorporating Geo-location information.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

MCA, M. J. S. (2017). Applications of Big Data Analytics and Machine Learning Techniques in Health Care Sectors. International Journal Of Engineering And Computer Science, 6(7).

McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.

Ngufor, C., & Wojtusiak, J. (2013). Unsupervised labeling of data for supervised learning and its application to medical claims prediction. Computer Science, 14(2), 191.

Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.

van Capelleveen, G., Poel, M., Mueller, R. M., Thornton, D., & van Hillegersberg, J. (2016). Outlier detection in healthcare fraud: A case study in the Medicaid dental domain. International Journal of Accounting Information Systems, 21, 18-31.

Viattchenin, D. A. (2016). A Technique for Outlier Detection Based on Heuristic Possibilistic Clustering. CERES, 17.

 

Quantitative Analysis of Online Radio “LastFM” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the online radio dataset called (lastfm.csv). The project is divided into two main Parts.  Part-I evaluates and examines the dataset for understanding the Dataset using the RStudio.  Part-I involves three major tasks to review and understand the Dataset variables.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame. The Association Rule data mining technique is used in this project.  The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users.  The construction of the association rules is also implemented using the function of “apriori” in R package arules.  The search was implemented for artists or groups of artists who have support larger than 1% and who give confidence to another artist that is larger than 50%.  These requirements rule out rare artists.  The calculation and the list of antecedents (LHS) are also implemented which involve more than one artist.  The list is further narrowed down by requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing confidence as illustrated in Figure 6. 

Keywords: Online Radio, Association Rule Data Mining Analysis

Introduction

This project examines and analyzes the Dataset of (lastfm.csv).  The dataset is downloaded from CTU course materials.  The lastfm.csv dataset reflect online radio which keeps track of every thing the user plays.  It has 289,955 observations with four variables.  The focus of this analysis is Association Rule.  The information in the dataset is used for recommending music the user is likely to enjoy and supports focused on marketing which sends the user advertisements for music the user is likely to buy.  From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists.  Thus, to calculate such support, the count of the incidences (0/1) (frequency) is implemented across all memebers of the network and divide those frequencies by the number of the members.  From the support, the confidence and the lift is calculated.

This project addresses two major Parts.  Part-I covers the following key Tasks to understand and examine the Dataset of “lastfm.csv.” 

  • Task-1:  Review the Variables of the Dataset.
  • Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.
  • Task-3:  Examine the Dataset, Summary of the Descriptive Statistics, and Visualization of the Variables.

Part-II covers the following three primary key Tasks to the plot, discuss and analyze the result.

  • Task-1: Required Computations for Association Rules and Frequent Items.
  • Task-2: Association Rules.
  • Task-3: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “lastfm.csv”

Task-1:  Review the Variables of the Dataset

The purpose of this task is to understand the variables of the dataset.  The Dataset is “lastfm.csv” dataset.  The Dataset describes the artists and the users who listens to the music. From the available information such as demographic information (such as age, sex and location) the support for the frequencies of listeninig to various individual artists can be determined as well as the joint support for pairs or larger groupings of artists.  There are 4 variables.  Table 1 summarizes the selected variables for this project.  

Table 1:  LastFm Dataset Variables

Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.

            The purpose of this task is to load and understand the Dataset using names(), head(), dim() function.  The task also displays the first three observations.

  • ## reading the data
  • lf <-read.csv(“C:/CS871/Data/lastfm.csv”)
  • lf
  • dim(lf)
  • length(lf$user)
  • names(lf)
  • head(lf)
  • lf <- data.frame(lf)
  • head(lf)
  • str(lf)
  • lf[1:20,]
  • lfsmallset <- lf[1:1000,]
  • lfsmallset
  • plot(lfsmallset, col=”blue”, main=”Small Set of Online Radio”)

Figure 1.  First Sixteen Observations for User (1) – Woman from Germany.

Figure 2. The plot of Small Set of Last FM Variables.

 Task-3:  Examine the Dataset, Summary of the Descriptive Statistics and Visualization of the Variables.

            The purpose of this task is to examine the dataset.  This task also factor the user and levels users and artist variables.  It also displays the summary of the variables and the visualization of each variable.

  • ### Factor user and levels user and artist
  • lf$user <- factor(lf$user)
  • levels(lf$user)    ## 15,000 users
  • levels(lf$artist)  ## 1,004 artists        
  • ## Summary of the Variables
  • summary(lf)
  • summary(lf$user)
  • summary(lf$artist)
  • ## Plot for Visualization of the variables.
  • plot(lf$user, col=”blue”)
  • plot(lf$artist, col=”blue”)
  • plot(lf$sex, col=”orange”)
  • plot(lf$country, col=”orange”)

Figure 3.  Plots of LastFM Variables.

Part-II:  Association Rules Data Mining, Discussion and Analysis

 Task-1:  Required Computations for Association Rules and Frequent Items

The purpose of this task is to first implement computations which are required for the association rules.  The required package arules is first installed.  This task visualizes the frequency of items in Figure 4.

  • ## Install arules library for association rules
  • install.packages(“arules”)
  • library(arules)
  • ### computational environment for mining association rules and frequent item sets
  • playlist <- split(x=lf[,”artist”], f=lf$user)
  • playlist[1:2]
  • ## Remove Artist Duplicates.
  • playlist <- lapply(playlist,unique)
  • playlist <- as(playlist,”transactions”)
  • ## view this as a list of “transaction”
  • ## transactions is a data class defined in arules
  • itemFrequency(playlist)
  • ## lists the support of the 1,004 bands
  • ## number of times band is listed to on the playlist of 15,000 users
  • ## computes relative frequency of artist mentioned by the 15,000 users
  • ## plots the item frequencies.
  • itemFrequencyPlot(playlist,support=0.08, cex.names=1.5, col=”blue”, main=”Item Frequency”)

Figure 4.  Plot of Item Frequency.

Task-2:  Association Rules Data Mining

The purpose of this task is to implement the data mining for the music list (lastfm.csv) using Association Rules technique.  First, the code builds the Association Rules, followed by the implementation of the associations with support > 0.01 and confidence > 0.50. Rule out rare bands and ordering the result by confidence for better understanding of the association rules result.

  • ## Build the Association Rules
  • ## Only associations with support > 0.01 and confidence > 0.50
  • ## Rule out rare bands
  • music.association.rules <- apriori(playlist, parameter=list(support=0.01, confidence=0.50))
  • inspect(music.association.rules)
  • ## Filter by lift > 5
  • ## Show only those with lift > 5, among those association with support > 0.01 and confidence > 0.50.
  • inspect(subset(music.association.rules, subset=lift > 5))
  • ## Order by confidence for better understanding of the association rules result.
  • inspect(sort(subset(music.association.rules, subset=lift>5), by=”confidence”))

Figure 5.  Example of Listening to both “Muse” and “Beatles” with a Confidence of 0.507 for Radiohead.

Figure 6.  Narrow the List by increasing the Lift to > 5 and Decreasing Confidence.

 Task-3: Discussion and Analysis

The association rules are used to explore the relationship between items and sets of items (Fischetti et al., 2017; Giudici, 2005).  Each transaction is composed of one or more items.  The interest is in transactions of at least two items because there cannot be relationships between several items in the purchase of a single item (Fischetti et al., 2017). The association rule is the explicit mention in a relationship in the data, in the form of X >= Y, where X (the antecedent) can be composed of one or several items and is called itemset, and Y (the consequent) is always one single item.  In this project, the interest is in the antecedents of music since the interest is in promoting the purchase of music.  The frequent “itemsets” are the items or collections of items which frequently occur in transactions.  The “itemsets” are considered frequent if they occur more frequently than a specified threshold (Fischetti et al., 2017).  The threshold is called minimal support (Fischetti et al., 2017).  The omission of “itemsets” with support less than the minimum support is called support pruning (Fischetti et al., 2017). The support for an itemset is the proportion among all cases where the itemset of interest is present, which allows estimation of how interesting an itemset or a rule is when support is low, the interest is limited (Fischetti et al., 2017). The confidence is the proportion of cases of X where X >= Y, which can be computed as the number of cases featuring X and Y divided by the number of cases featuring X (Fischetti et al., 2017).  Lift is a measure of the improvement of the rule support over what can be expected by chance, which is computed as support(X>=Y)/support(X)*support(Y) (Fischetti et al., 2017).  If the lift value is not higher than 1, the rule does not explain the relationship between the items better than could be expected by chance.  The goal of “apriori” is to compute the frequent “itemsets” and the association rules efficiently and to compute support and confidence.  

In this project, the large dataset of lastfm (289,955 observations and four variables) is used.  The descriptive analysis shows that the number of males (N=211823) exceeds the number of female users (N=78132) as illustrated in Figure 3.  The top artist has a value of 2704, followed by “Beatles” of 2668 and “Coldplay” of 2378.  The top country has the value of 59558 followed by the United Kingdom of 27638 and German of 24251 as illustrated in Task-3 of Part-I.

As illustrated in Figure 1, the first sixteen observations are for the user (1) for a woman from Germany, resulting in the first sixteen rows of the data matrix.  The R package arules was used for mining the association rules and for identifying frequent “itemsets.”  The data is transformed into an incidence matrix where each listener represents a row, with 0 and 1s across the columns indicating whether or not the user has played a particular artist.  The incidence matrix is stored in the R object “playlist.”  The support for each of the 1004 artists is calculated, and the support is displayed for all artists with support larger than 8% indicating that artists shown on the graph (Figure 4) are played by more than 8% of the users. 

The construction of the association rules is also implemented using the function of “apriori” in R package arules.  The search was implemented for artists or groups of artists who have support larger than 1% and who give confidence to another artist that is larger than 50%.  These requirements rule out rare artists.  The calculation and the list of antecedents (LHS) are also implemented which involve more than one artist.  For instance, listening both to “Muse” and “Beatles” has support larger than 1%, and the confidence for “Radiohead,” given that someone listens to both “Muse” and “Beatles” is 0.507 with a lift of 2.82 as illustrated in Figure 5.  This result exceeded the two requirements as antecedents involving three artists do not come up in the list because they do not meet both requirements.  The list is further narrowed down by requiring that the lift is larger than 5 and the resulting list is ordered according to the decreasing confidence as illustrated in Figure 6.  The result shows that listening to both “Led Zeppelin” and “the Doors” has a support of 1%, the confidence of 0.597 (60%) and lift of 5.69 and is quite predictive of listening to “Pink Floyd” as shown in Figure 6. Another example of the association rule result is listening to “Judas Priest” lifts the chance of listening to the “Iron Maiden” by a factor of 8.56 as illustrated in Figure 6.  Thus, if the user listens to “Judas Priest,” the recommendation for that user to also to listen to “Iron Maiden.”  The same association rules results apply to all of the six items listed in Figure 6.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Quantitative Analysis of “Prostate Cancer” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to use the prostate cancer dataset available in R, in which biopsy results are given for 97 men.  This goal is to predict tumor spread, which is the log volume in this dataset of 97 men who had undergone a biopsy. The measures which are used for prediction are BPH, PSA, Gleason Score, CP, and size of the prostate.  The predicted tumor size affects the treatment options for the patients, which can include chemotherapy, radiation treatment, and surgical removal of the prostate.  

The dataset “prostate.cancer.csv” is downloaded from the CTU course learning materials.  The dataset has 97 observations or patients on six variables. The response variable is the log volume (lcavol).  This assignment is to predict this variable (lcavol) from five covariates (age, logarithms of bph, cp, and PSA, and Gleason score) using the decision tree.  The response variable is a continuous measurement variable. The sum of squared residuals as the impurity (fitting) criterion is used in this analysis.

This assignment discusses and addresses fourteen Tasks as shown below:

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018)

Task-1:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset. The dataset has 97 observations or patients with six variables. The response variable for prediction is (lcavol), and the five covariates (age, logarithms of bph, cp, and PSA, and Gleason score) will be used for this prediction using the decision tree.  The response variable is a continuous measurement variable.  Table 1 summarizes these variables including the response variable of (lcavol).

Table 1:  Prostate Cancer Variables.

Task-2:  Load and Review the Dataset using names(), heads(), dim() functions

  • pc <- read.csv(“C:/CS871/prostate.cancer.csv”)
  • pc
  • dim(pc)
  • names(pc)
  • head(pc)
  • pc <- data.frame(pc)
  • head(pc)
  • str(pc)
  • pc <-data.frame(pc)
  • summary(pc)
  • plot(pc, col=”blue”, main=”Plot of Prostate Cancer”)

Figure 1. Plot of Prostate Variables.

Task-3:  Distribution of Prostate Cancer Variables.

  • #### Distribution of Prostate Cancel Variables
  • ### These are the variables names
  • colnames(pc)
  • ##Setup grid, margins.
  • par(mfrow=c(3,3), mar=c(4,4,2,0.5))
  • for (j in 1:ncol(pc))
  • {
  • hist(pc[,j], xlab=colnames(pc)[j],
  • main=paste(“Histogram of”, colnames(pc)[j]),
  • col=”blue”, breaks=20)
  • }
  • hist(pc$lcavol,col=”orange”)
  • hist(pc$age,col=”orange”)
  • hist(pc$lbph,col=”orange”)
  • hist(pc$lcp,col=”orange”)
  • hist(pc$gleason,col=”orange”)
  • hist(pc$lpsa,col=”orange”)

Figure 2.  Distribution of Prostate Cancer Variables.

Task-4:  Correlation Among Prostate Variables

  • ##Correlations between prostate cancer variables
  • pc.cor = cor(pc)
  • pc.cor <- round(pc.cor, 3)
  • summary(pc)
  • pc.cor[lower.tri(pc.cor,diag=TRUE)] = 0
  • pc.cor.sorted = sort(abs(pc.cor),decreasing=T)
  • pc.cor.sorted[1]
  • ##[1] 0.871
  • # Use arrayInd()
  • vars.big.cor = arrayInd(which(abs(pc.cor)==pc.cor.sorted[1]), dim(pc.cor))
  • colnames(pc)[vars.big.cor]
  • ##[1] “lcp”     “gleason”
  • pc.cor.sorted[2]
  • ## [[1] 0.812
  • vars.big.cor = arrayInd(which(abs(pc.cor)==pc.cor.sorted[2]), dim(pc.cor))
  • colnames(pc)[vars.big.cor]
  • ## [1] “lbph” “lcp”

Task-5:  Visualization of the Relationship and Correlation between the Cancer Spread (lcavol) and other Variables.

  • ## Visualizing relationships among varibles
  • plot(lcavol~age, data=pc, col=”red”, main=”Relationship of Age on the Cancer Volume (lcvol)”)
  • plot(lcavol~lbph, data=pc, col=”red”, main=”Relationship of Amount of Benign Prostatic Hyperplasia (lbph) on the Cancer Volume (lcavol)”)
  • plot(lcavol~lcp, data=pc, col=”red”, main=”Relationship of Capsular Penetration (lcp) on the Cancer Volume (lcvol)”)
  • plot(lcavol~gleason, data=pc, col=”red”, main=”Relationship of Gleason System (gleason) on the Cancer Volume (lcvol)”)
  • plot(lcavol~lpsa, data=pc, col=”red”,main=”Relationship of Prostate Specific Anitgen (lpsa) on the Cancer Volume (lcvol)”)

Figure 3.  Plot of Correlation Among Prostate Variables Using plot() Function.

  • ## Correlation among the variables using corrplot() function
  • install.packages(“ElemStatLearn”)
  • library(ElemStatLearn)           ## it contains the data
  • install.packages(“car”)
  • library(car)                   ## package to calculate variance inflation factor
  • install.packages(“corrplot”)
  • library(corrplot)                       ## correlation plots
  • install.packages(“leaps”)
  • library(leaps)               ## best subsets regression
  • install.packages(“glmnet”)
  • library(glmnet)                        ##allows ridge regression, LASSO and elastic net.
  • install.packages(“caret”)
  • library(caret)                ##parameter tuning
  • pc.cor=cor(pc)
  • corrplot.mixed(pc.cor)

Figure 4.  Plot of Correlation Among Prostate Variables Using corplot() Function.

Task-6:  Build a Decision Tree for Prediction

  • ##Building a Decision Tree for Prediction
  • install.packages(“tree”)
  • library(tree)
  • ##Construct the tree
  • pctree <- tree(lcavol ~., data=pc, mindev=0.1, mincut=1)
  • pctree <- tree(lcavol ~., data=pc,mincut=1)
  • pctree
  • plot(pctree, col=4)
  • text(pctree, digits=2)
  • pccut <- prune.tree(pctree, k=1.7)
  • plot(pccut, col=”red”, main=”Pruning Using k=1.7″)
  • pccut
  • text(pccut, digits=2)
  • pccut <- prune.tree(pccut, k=2.05)
  • plot(pccut, col=”darkgreen”, main=”Pruning Using k=2.05″)
  • pccut
  • text(pccut,digits=2)
  • pccut <- prune.tree(pctree,k=3)
  • plot(pccut, col=”orange”, main=”Pruning Using k=3″)
  • pccut
  • text(pccut,digits=2)
  • pccut <- prune.tree(pctree)
  • pccut
  • plot(pccut, col=”blue”, main=”Decision Tree Pruning”)
  • pccut <- prune.tree(pctree,best=3)
  • pccut
  • plot(pccut, col=”orange”, main=”Pruning Using best=3″)
  • text(pccut, digits=2)

Figure 5:  Initial Tree Development.

Figure 6:  First Pruning with α=1.7.

Figure 7:  Second Pruning with α=2.05.

Figure 8:  Third Pruning with α=3.

Figure 9:  Plot of the Decision Tree Pruning.

Figure 10.  Plot of the Final Tree.

Task-7:  Cross-Validation

  • ## Use cross-validation to prune the tree
  • set.seed(2)
  • cvpc <- cv.tree(pctree, k=10)
  • cvpc$size
  • cvpc$dev
  • plot(cvpc, pch=21, bg=8, type=”p”, cex=1.5, ylim=c(65,100), col=”blue”)
  • pccut <- prune.tree(pctree, best=3)
  • pccut
  • plot(pccut, col=”red”)
  • text(pccut)
  • ## Final Plot
  • plot(pc[,c(“lcp”,”lpsa”)],col=”red”,cex=0.2*exp(pc$lcavol))
  • abline(v=.261624, col=4, lwd=2)
  • lines(x=c(-2,.261624), y=c(2.30257,2.30257), col=4, lwd=2)

Figure 13.  Plot of the Cross-Validation Deviance.

Figure 14.  Plot of the Final Classification Tree.

Figure 15.  Plots of Cross-Validation.

Task-8:  Discussion and Analysis

The classification and regression tree (CART) represents a nonparametric technique which generalizes parametric regression models (Ledolter, 2013).  It allows for non-linearity and variables interactions with no need to specify the structure in advance. Furthermore, the violation of constant variance which represents a critical assumption in the regression model is not critical in this technique (Ledolter, 2013).

The descriptive statistics result shows that lcavol has a mean of 1.35 which is less than the median of 1.45 indicating a negatively skewed distribution, with a minimum of -1.35 and a maximum of 2.8. The age of the prostate cancer patients has an average of 64 years, with a minimum of 41 and a maximum of 79 years old.  The lbph has an average of 0.1004 which is less than the median of 0.300 indicating the same negatively skewed distribution with a minimum of -1.39 and maximum of 2.33.  The lcp has an average of -0.18 which is higher than the median of -0.79 indicating a positive skewed distribution with a minimum of -1.39 and a maximum of 2.9.  The Gleason measure has a mean of 6.8 which is a little less than the median of 7 indicating a little negative skewed distribution with a minimum of 6 and maximum of 9.  The last variable of lpsa has an average of 2.48 which is a little less than the median of 2.59 indicating a little negatively skewed distribution with a minimum of -0.43 and maximum of 5.58. The result shows that there is a positive correlation between lpsa and lcavol, and between lcp and lcavol as well.  The result also shows that the age between 60 and 70 the lcavol gets increased.

Furthermore, the result also shows that the Gleason result takes integer values of 6 and larger. The result of the lspa shows that the log PSA score, is close to the normally distributed dataset.  The result in Task-4 of the correlation among prostate variables is not surprising as it shows that if their Gleason score is high now, then they likely had a bad history of Gleason scores, which is known for such high Gleason.  The result also shows that lcavol as a predictor should be included for any prediction of the lpsa.

As illustrated in Figure 4, the result shows that PSA is highly correlated with the log of cancer volume (lcavol); it appeared to have a highly linear relationship.  The result also shows that multicollinearity may become an issue; for example, cancer volume is also correlated with capsular penetration, and this is correlated with the seminal vesicle invasion.

For the implementation of the Tree, the initial tree has 12 leave nodes, and the size of the tree is thus 12 as illustrated in Figure 5.  The root shows the 97 cases with deviance of 133.4. Node 1 is the root; Node 2 has a value of lcp < 0.26 with 63 patients and deviance of 64.11.  Node 3 has the value of lcp > 0.26 with 34 cases and deviance of 13.39.  Node 4 has the lpsa < 2.30 with 35 cases and deviance of 24.72. Node 5 has lpsa > 2.30 with 28 cases and 18.6 deviance. Node 6 has lcp < 2.14 with 25 cases and deviance of 6.662.  Node 7 has lcp > 2.139 with 9 cases and deviance of 1.48. Node 8 has lpsa < 0.11 with 4 cases and deviance of 0.3311, while Node 9 has lpsa > 0.11 with 31 cases and deviance of 18.92, and age of < 52 with deviance of 0.12 and age o > 52 with deviance of 13.88. Node 10 has lpsa < 3.25 with 23 cases and deviance of 11.61. while Node 11 has lcp > 3.25 with 5 cases and deviance of 1.76.  Node 12 is for age < 62 with 7 cases and deviance of 0.73.

The first pruning process using α=1.7 did not result in any different from the initial tree.  It resulted in the 12 nodes.  The second pruning with α=2.05 improved the tree with eight nodes.  The root shows the same result of 97 cases with deviance of 133.4.  Node 1 has lcp < 0.26 with deviance of 64.11 while Node 2 has lcp > 0.26 with deviance of 13.39.  The third pruning using α=3 has further improved the tree as shown in Figure 8.  The final Tree has the root with four nodes: Node 1 for lcp < 0.26 and Node 2 for lcp > 0.26.  Node 3 has lpsa < 2.30, while Node 4 reflects lpsa > 2.30.  With regard to the prediction, the patient with lcp=0.20, which is categorized in Node 2, and lpsa of 2.40 which is categorized in Node 4, can be predicted to have a log volume of (lcavol) of 1.20.

The biggest challenge for the CART model which is described as flexible, in comparison to the regression models, is the overfitting (Giudici, 2005; Ledolter, 2013).  If the splitting algorithm is not stopped, the tree algorithm can ultimately extract all information from the data, including information which is not and cannot be predicted in the population with the current set of prediction causing random or noise variation (Ledolter, 2013).   However, when the subsequent splits add minimal improvement of the prediction, the stop of generating new split nodes, in this case, can be used as a defense against the overfitting issue.  Thus, if 90% of all cases can be predicted correctly from 10 splits, and 90.1% of all cases from 11 splits, then, there is no need to add the 11th split to the tree, as it does not add much value only .1%. There are various techniques to stop the split process.  The basic constraints (mincut, mindev) lead to a full tree fit with a certain number of terminal nodes.  In this case of the prostate analysis, the mincut=1 is used which is a minimum number of observations to include in a child node and obtained a tree of size 12.

Since the three-building is stopped as illustrated in Figure 10, the cross-validation is used to evaluate the quality of the prediction of the current tree.  The cross-validation subjects the tree computed from one set of observation (the training sample) to another independent set of observation (the test sample). If most or all of the splits determined by the analysis of the training sample are based on random noise, then the prediction for the test sample is described to be poor.  The cross-validation cost or CV cost is the averaged error rate for particular tree size.  The tree size which produces the minimum CV cost is found.  The reference tree is then pruned back to the number of nodes matching the size which produces the minimum CV cost.  Pruning was implemented in a stepwise bottom-up manner, by removing the least important nodes during each pruning cycle. The v-fold CV is implemented with the R command (cv.tree). The graph in Figure 13 of the CV Deviance indicates that, for the prostate example, a tree of size 3 is appropriate.  Thus, the reference tree which was obtained from all the data is being pruned back to size 3. CV chooses the capsular penetration and PSA as the decision variable.  The effect of capsular penetration on the response of log volume (lcavol) depends on PSA. The final graph of Figure 15 shows that the CAR divides up the space of the explanatory variables into rectangles, with each rectangle leading to a different prediction. The size of the circles of the data points in the respective rectangles reflects the magnitude of the response.  Figure 15 confirms that the tree splits are quite reasonable.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Analysis of Ensembles

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze creating ensembles from different methods such as logistic regression, nearest neighbor methods, classification trees, Bayesian, or discriminant analysis. This discussion also addresses the use of the Random Forest to do the analysis.

Ensembles

There are two useful techniques which combine methods for improving predictive power: ensembles and uplift modeling.  Ensembles are the focus of this discussion. Thus, uplift modeling is not discussed in this discussion.  An ensemble combines multiple “supervised” models into a “super-model” (Shmueli, Bruce, Patel, Yahav, & Lichtendahl Jr, 2017)).  An ensemble is based on the dominant notion of combining models (EMC, 2015; Shmueli et al., 2017). Thus, several models can be combined to achieve improved predictive accuracy (Shmueli et al., 2017). 

Ensembles played a significant role in the million-dollar Netflix Prize contest which started in 2006 to improve their movie recommendation system (Shmueli et al., 2017).  The principle of combining methods is known for reducing risk because the variation is smaller than each of the individual components (Shmueli et al., 2017).  The risk is equivalent to a variation in prediction error in predictive modeling.  The more the prediction errors vary, the more volatile the predictive model (Shmueli et al., 2017).  Using an average of two predictions can potentially result in smaller error variance, and therefore, better predictive power (Shmueli et al., 2017).  Thus, results can be combined from multiple prediction methods or classifiers (Shmueli et al., 2017).  The combination can be implemented for predictions, classifications, and propensities as discussed below. 

Ensembles Combining Prediction Using Average Method

When combining prediction, the predictions can be combined with different methods by taking an average.  One alternative to a simple average is taking the median prediction, which would be less affected by extreme predictions (Shmueli et al., 2017). Computing a weighted average is another possibility where the weights are proportional to a quantity of interest such as quality or accuracy (Shmueli et al., 2017).  Ensembles for prediction are useful not only in cross-sectional prediction but also in time series forecasting (Shmueli et al., 2017).

Ensembles Combining Classification Using Voting Method

When combining classification, combining the results from multiple classifiers can be implemented using “voting,” for each record, multiple classifications are available.  A simple rule would be to choose the most popular class among these classifications (Shmueli et al., 2017). For instance, Classification Tree, a Naïve Bayes classifier, and discriminant analysis can be used for classifying a binary outcome (Shmueli et al., 2017).  For each record, three predicted classes are generated (Shmueli et al., 2017). Simple voting would choose the most common class of the three (Shmueli et al., 2017). Similar to the prediction, heavier weights can be assigned to scores from some models, based on considerations such as model accuracy or data quality, which can be implemented by setting a “majority rule” which is different from 50% (Shmueli et al., 2017).   Concerning the nearest neighbor (K-NN), an ensemble learning such as bagging can be performed with K-NN (Dubitzky, 2008). The individual decisions are combined to classify new examples.  Combining of individual results is performed by weighted or unweighted voting (Dubitzky, 2008).

Ensembles Combining Propensities Using Average Method

Similar to prediction, propensities can be combined by taking a simple or weighted average.  Some algorithms such as Naïve Bayes produce biased propensities and should not, therefore, be averaged with propensities from other methods (Shmueli et al., 2017).

Other Forms of Ensembles

Various methods are commonly used for classification, including bagging, boosting, random forest, and support vector machines (SVM).  The bagging, boosting, and random forest is all examples of ensemble methods which use multiple models to obtain better predictive performance than can be obtained from any of the constituent models (EMC, 2015; Ledolter, 2013; Shmueli et al., 2017).

  • Bagging: It is short for “bootstrap aggregating” (Ledolter, 2013; Shmueli et al., 2017). It was proposed by Leo Breiman in 1994, which is a model aggregation technique to reduce model variance (Swamynathan, 2017).  It is another form of Ensembles which is based on averaging across multiple random data samples (Shmueli et al., 2017).  There are two steps to implement bagging.  Figure 1illustrates the bagging process flow.
    • Generate multiple random samples by sampling “with replacement from the original data.”  This method is called “bootstrap sampling.”
    • Running an algorithm on each sample and producing scores (Shmueli et al., 2017).

Figure 1.  Bagging Process Flow (Swamynathan, 2017).

Bagging improves the performance stability of a model and helps avoid overfitting by separately modeling different data samples and then combining the result.  Thus, it is especially useful for algorithms such as Trees and Neural Networks.  Figure 2 illustrates an example of the bootstrap sample that has the same size as the original sample size, with ¾ of the original values plus replacement result in repetition of values.

Figure 2:  Bagging Example (Swamynathan, 2017).

Boosting: It is a slightly different method of creating ensembles.  It was introduced by Freud and Schapire in 1995 using the well-known AdaBoost algorithm (adaptive boosting) (Swamynathan, 2017).  The underlying concept of boosting is that rather than an independent individual hypothesis, combining hypotheses in a sequential order increases the accuracy (Swamynathan, 2017).  The boosting algorithms convert the “weak learners” into “strong learners” (Swamynathan, 2017).  Boosting algorithms are well designed to address the bias problems (Swamynathan, 2017). Boosting tends to increase the accuracy (Ledolter, 2013). The “AdaBoosting” process involves three steps. Figure 3 illustrates the “AdaBoosting” process:

  1. Assign uniform weight for all data points W0(x)=1/N, where N is the total number of training data points.
    1. At each iteration fit a classifier ym(xn) to the training data and update weights to minimize the weighted error function.
  • The final model is given by the following equation:

Figure 3.  “AdaBoosting” Process (Swamynathan, 2017).

            As an example illustration of AdaBoost, there is a sample dataset with 10 data points, with an assumption that all data points will have equal weights giving by, 1/10 as illustrated in Figure 4. 

Figure 4.  An Example Illustration of AdaBoost. Final Model After Three Iteration (Swamynathan, 2017).

  • Random Forest: It is another class of ensemble method using decision tree classifiers.  It is a combination of tree predictors such that each tree depends on the values of a random vector sampled independently and with the same distribution for all trees in the forest. A particular case of random forest uses bagging on decision trees, where samples are randomly chosen with replacement from the original training set (EMC, 2015).
  • SVM: Itis another common classification method which combines linear models with instance-based learning techniques. The SVM select a small number of critical boundary instances called support vectors from each class and build a linear decision function which separates them as widely as possible.  SVM can efficiently perform, by default linear classifications and can also be configured to perform non-linear classifications (EMC, 2015).

Advantages and Limitations of Ensembles

Combining scores from multiple models is aimed at generating more precise predictions by lowering the prediction error variance (Shmueli et al., 2017).  The ensemble method is most useful when the combined models generate prediction error which is negatively associated or correlated, but it can also be useful when the correlation is low (Ledolter, 2013; Shmueli et al., 2017).  Ensembles can use simple averaging, weighted averaging, voting, and median (Ledolter, 2013; Shmueli et al., 2017).  Models can be based on the same algorithm or different algorithms, using the same sample or different sample (Ledolter, 2013; Shmueli et al., 2017).  Ensembles have become an important strategy for participants in data mining contests, where the goal is to optimize some predictive measure (Ledolter, 2013; Shmueli et al., 2017).  Ensembles which are based on different data samples help avoid overfitting. However, overfit can also happen with an ensemble in instances such as the choice of best weights when using a weighted average (Shmueli et al., 2017).   

The primary limitation of the ensemble is the resources which it requires such as computationally, and the skills and time investments (Shmueli et al., 2017).  Ensembles which combine results from different algorithms require the development of each model and their evaluation.  The boosting-type ensembles and bagging-type ensembles do not require much effort. However, they do have a computational cost.  Furthermore, ensembles which rely on multiple data sources require the collection and the maintenance of the multiple data sources (Shmueli et al., 2017).  Ensembles are regarded to be “black box” methods, where the relationship between the predictors and the outcome variable usually becomes non-transparent (Shmueli et al., 2017). 

The Use of Random Forests for Analysis

The decision tree is based on a set of True/False decision rules. The prediction is based on the tree rules for each terminal node.  A decision tree for a small set of sample training data encounters the overfitting problem. Random forest model, in contrast, is well suited to handle small sample size problems.  The random forest contains multiple decision trees as the more trees, the better.  Randomness is in selecting the random training subset from the training dataset, using bootstrap aggregating or bagging method to reduce the overfitting by stabilizing the predictions. This method is utilized in many other machine-learning algorithms, not only in the Random Forests (Hodeghatta & Nayak, 2016). There is another type of randomness which occurs when selecting variables randomly from the set of variables, resulting in different trees which are based on different sets of variables.  In a forest, all the trees would still influence the overall prediction by the random forest (Hodeghatta & Nayak, 2016).

The programming logic for Random Forest includes seven steps as follows (Azhad & Rao, 2011).

  1. Input the number of training set N.
  2. Compute the number of attributes M.
  3. For (m) input attributes used to form the decision at a node m<M.
  4. Choose training set by sampling with replacement.
  5. For each node of the tree, use one of the (m) variables as the decision node.
  6. Grow each tree without pruning.
  7. Select the classification with maximum votes.

Random Forests have a low bias (Hodeghatta & Nayak, 2016).  The variance is reduced, and thus, overfitting, by adding more trees, which is one of the advantages of the Random Forests, and hence gaining popularity.  The models of Random Forests are relatively robust to the set of input variables and often do not care about pre-processing of data.  Random Forests are described to be more efficient to build than other models such as SVM (Hodeghatta & Nayak, 2016).  Table 1 summarizes the Advantages and Disadvantages of Random Forests in a comparison with other Classification Algorithms such as Naïve Bayes, Decision Tree, Nearest Neighbor. 

Table 1.  Advantages and Disadvantages of Random Forest in comparison with other Classification Algorithms. Adapted from (Hodeghatta & Nayak, 2016).

References

Azhad, S., & Rao, M. S. (2011). Ensuring data storage security in cloud computing.

Dubitzky, W. (2008). Data Mining in Grid Computing Environments: John Wiley & Sons.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Shmueli, G., Bruce, P. C., Patel, N. R., Yahav, I., & Lichtendahl Jr, K. C. (2017). Data mining for business analytics: concepts, techniques, and applications in R: John Wiley & Sons.

Swamynathan, M. (2017). Mastering Machine Learning with Python in Six Steps: A Practical Implementation Guide to Predictive Data Analytics Using Python: Apress.

Decision Trees with a Comparison of Classification and Regression Decision Tree (CART)

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Decision Trees, with a comparison of Classification and Regression Decision Trees.  The discussion also addresses the advantages and disadvantages of the Decision Trees.  The focus of this discussion is on the Classification and Regression Tree (CART) algorithm as one of the statistical criteria. The discussion begins with a brief overview of the Classification, followed by additional related topics.  It will end with a sample Decision Tree for a decision whether or not to take an umbrella.

Classification

Classification is a fundamental data mining technique (EMC, 2015).  Most classification methods are supervised, in which they start with a training set of pre-labeled observations to learn how likely the attributes of these observations may contribute to the classification of future unlabeled observations (EMC, 2015).  For instance, marketing, sales, and customer demographic data can be used to develop a classifier to assign a “purchase” or “no purchase” label to potential future customers (EMC, 2015).  Classification is widely used for prediction purposes (EMC, 2015).  Logistic Regression is one of the popular classification methods (EMC, 2015).  Classification can be used for health care professionals to diagnose diseases such as heart disease (EMC, 2015).   There are two fundamental classification methods:  Decision Trees and Naïve Bayes.  In this discussion, the focus is on the Decision Trees. 

The Tree Models vs. Linear & Logistic Regression Models

The tree models are distinguished from the Linear and Logistic Regression models.  The tree models produce a classification of observations into groups first and then obtain a score for each group, while the Linear and Logistic Regression methods produce a score and then possibly a classification based on a discriminant rule (Giudici, 2005). 

Regression Trees vs. Classification Trees

The tree models are divided into regression trees and classification trees (Giudici, 2005).  The regression trees are used when the response variable is continuous, while the classification trees are used when the response variable is quantitative discrete or qualitative (categorical) (Giudici, 2005).  The tree models can be defined as a recursive process, through which a set of (n) statistical units are divided into groups progressively, based on a division rule aiming to increase a homogeneity or purity measure of the response variable in each of the obtained group (Giudici, 2005). An explanatory variable specifies a division rule at each step of the procedure, to split and establish splitting rules to partition the observations (Giudici, 2005). The final partition of the observation is the main result of a tree model (Giudici, 2005).  It is critical to specify a “stopping criteria” for the division process to achieve such a result (Giudici, 2005). 

Concerning the classification tree, fitted values are given regarding the fitted probabilities of affiliation to a single group (Giudici, 2005). A discriminant rule for the classification trees can be derived at each leaf of the tree (Giudici, 2005).  The classification of all observations belonging to a terminal node in the class corresponding to the most frequent level is a commonly used rule, called “majority rule” (Giudici, 2005).  While other “voting” schemes can also be implemented, in the absence of other consideration, this rule is the most reasonable (Giudici, 2005).  Thus, each of the leaves points out a clear allocation rule of the observation, which is read using the path that connects the initial node to each of them.  Therefore, every path in the tree model represents a classification rule (Giudici, 2005).  

With comparison to other discriminant models, the tree models produce rules which are less explicit analytically, and easier to understand graphically (Giudici, 2005). The tree models can be regarded as nonparametric predictive models as they do not require assumptions about the probability distribution of the response variable (Giudici, 2005).  This flexibility indicates that the tree models are generally applicable, whatever the nature of the dependent variable and the explanatory variables  (Giudici, 2005).  However, the disadvantages of this flexibility of a higher demand of computational resources, and their sequential nature and the complexity of their algorithm can make them dependent on the observed data, and even a small change might alter the structure of the tree  (Giudici, 2005). Thus, it is difficult to take a tree structure designed for one context and generalize it to other contexts  (Giudici, 2005).

The Classification Tree Analysis vs. The Hierarchical Cluster Analysis

The classification tree analysis is distinguished from the hierarchical cluster analysis despite their graphical similarities  (Giudici, 2005).  The classification trees are predictive rather than descriptive. While the hierarchical cluster analysis performs an unsupervised classification of the observations based on all available variables, the classification trees perform a classification of the observations based on all explanatory variables and supervised by the presence of the response variable (target variable) (Giudici, 2005).  The second critical difference between the hierarchical cluster analysis and the classification tree analysis is related to the partition rule.  While in the classification trees the segmentation is typically carried out using only one explanatory variable at a time, in the hierarchical clustering the divisive or agglomerative rule between groups is established based on the considerations on the distance between them, calculated using all the available variables  (Giudici, 2005).

Decision Trees Algorithms

The goal of Decision Trees is to extract from the training data the succession of decisions about the attributes that explain the best class, that is, group membership (Fischetti, Mayor, & Forte, 2017).  Decision Trees have a root, which is the best attribute to split the data upon, about the outcome (Fischetti et al., 2017).  The dataset is partitioned into branches by this attribute (Fischetti et al., 2017).  The branches lead to other nodes which correspond to the next best partition for the considered branch (Fischetti et al., 2017).  The process continues until the terminal nodes are reached, where no more partitioning is required (Fischetti et al., 2017).   Decision Trees allow class predictions (group membership) of previously unseen observations (testing datasets or prediction datasets) using statistical criteria applied on the seen data (training dataset) (Fischetti et al., 2017).  There are six statistical criteria of six algorithms:

  • ID3
  • C4.5
  • Random Forest.
  • Conditional Inference Trees.
  • Classification and Regress Trees (CART)

The most used algorithm in the statistical community is the CART algorithm, while C4.5 and its latest version C5.0 are widely used by computer scientists (Giudici, 2005).  The first versions of C4.5 and 5.0 were limited to categorical predictors, but the most recent versions are similar to CART (Giudici, 2005).  

Classification and Regression Trees (CART)

CART is often used as a generic acronym for the decision tree, although it is a specific implementation of tree models (EMC, 2015).  CART, similar to C4.5, can handle continuous attributes (EMC, 2015).  While C4.5 uses entropy-based criteria to rank tests, CART uses the Gini diversity index defined in equation (1) (EMC, 2015; Fischetti et al., 2017).

Moreover, while C4.5 uses stopping rules, CART construct a sequence of subtrees, uses cross-validation to estimate the misclassification cost of each subtree, and chooses the one with the lowest cost, (EMC, 2015; Hand, Mannila, & Smyth, 2001). CART represents a powerful nonparametric technique which generalizes parametric regression models (Ledolter, 2013).  It allows nonlinearity and variable interactions without having to specify the structure in advance (Ledolter, 2013).  It operates by choosing the best variable for splitting the data into two groups at the root node (Hand et al., 2001).  It builds the tree using a single variable at a time, and can readily deal with large numbers of variables (Hand et al., 2001).  It uses different statistical criteria to decide on tree splits (Fischetti et al., 2017).  There are some differences between CART used for classification and the family of algorithms.  In CART, the attribute to be partition is selected with the Gini index as a decision criterion (Fischetti et al., 2017).  This method is described as more efficient compared to the information gain and information ratio (Fischetti et al., 2017).  CART implements the necessary partitioning on the modalities of the attribute and merges modalities for the partition, such as modality A versus modalities B and C (Fischetti et al., 2017).  The CART can predict a numeric outcome (Fischetti et al., 2017).  In the case of regression trees, CART performs regression and builds the tree in a way which minimizes the squared residuals (Fischetti et al., 2017). 

CART Algorithms of Division Criteria and Pruning

There are two critical aspects of the CART algorithm:  Division Criteria, and Pruning, which can be employed to reduce the complexity of a tree (Giudici, 2005).  Concerning the division criteria algorithm, the primary essential element of a tree model is to choose the division rule for the units belonging to a group, corresponding to a node of the tree (Giudici, 2005).  The decision rule selection means a predictor selection from those available, and the selection of the best partition of its levels (Giudici, 2005).  The selection is generally made using a goodness measure of the corresponding division rule, which allows the determination of the rule to maximize the goodness measure at each stage of the procedure (Giudici, 2005). 

The impurity concept refers to a measure of variability of the response values of the observations (Giudici, 2005).  In a regression tree, a node will be pure if it has null variance as all observations are equal, and it will be impure if the variance of the observation is high (Giudici, 2005).  For the regression trees, the impurity corresponds to the variance, while for the classification trees alternative measures for the impurity are considered such as Misclassification impurity, Gini impurity, Entropy impurity, and Tree assessments (Giudici, 2005).  

When there is no “stopping criterion,” a tree model can grow until each node contains identical observation regarding the values or levels of the dependent variable  (Giudici, 2005). This approach does not contain a parsimonious segmentation  (Giudici, 2005). Thus, it is critical to stop the growth of the tree at a reasonable dimension  (Giudici, 2005). The tree configuration becomes ideal when it is parsimonious and accurate  (Giudici, 2005). The parsimonious attribute indicates that the tree has a small number of leaves, and therefore, the predictive rule can be easily interpreted  (Giudici, 2005). The accurate attribute indicates a large number of leaves which are pure to a maximum extent  (Giudici, 2005). There are two opposing techniques for the final choice which tree algorithms can employ. The first technique uses stopping rules based on the thresholds on the number of the leaves, or on the maximum number of steps in the process, whereas the other algorithm technique introduces probabilistic assumptions on the variables, allowing the use of suitable statistical tests  (Giudici, 2005). The growth is stopped when the decrease in impurity is too small, in the absence of the probabilistic assumptions  (Giudici, 2005). The result of a tree model can be influenced by the choice of the stopping rule (Giudici, 2005). 

The CART method utilizes a strategy different from the stepwise stopping criteria. The method is based on the pruning concept.  The tree, first, is built to its greatest size, and it then gets “trimmed” or “pruned” according to a cost-complexity criterion  (Giudici, 2005). The concept of pruning is to find a subtree optimally, to minimize a loss function, which is used by CART algorithm and depends on the total impurity of the tree and the tree complexity  (Giudici, 2005). The misclassification impurity is usually chosen to be used for the pruning, although the other impurity methods can also be used.  The minimization of the loss function results in a compromise between choosing a complex model with low impurity but high complexity cost and choosing a simple model with a high impurity with low complexity cost  (Giudici, 2005).   The loss function is assessed by measuring the complexity of the model fitted on the training dataset, whose misclassification errors are measured in the validation data set  (Giudici, 2005).  This method partitions the training data into a subset for building the tree and then estimates the misclassification rate on the remaining validation subset (Hand et al., 2001).

The CART has been widely used for several years by marketing applications and others (Hodeghatta & Nayak, 2016).   The CART is described as a flexible model as the violations of constant variance which is very critical in regression, is permissible in the CART (Ledolter, 2013).  However, the biggest challenge in the CART is the avoidance of the “overfitting” (Ledolter, 2013).

Advantages and Disadvantages of the Trees

Decision trees for regression and classification have advantages and disadvantages.  Trees are regarded to be easier than linear regression and can be displayed graphically and interpreted easily (Cristina, 2010; Tibshirani, James, Witten, & Hastie, 2013).  Decision trees are self-explanatory and easy to understand even for non-technical users (Cristina, 2010; Tibshirani et al., 2013). They can handle qualitative predictors without the need to create dummy variables (Tibshirani et al., 2013).  Decision trees are efficient. Complex alternatives can be expressed quickly and precisely. A decision tree can easily be modified as new information becomes available.  Standard decision tree notation is easy to adopt (Cristina, 2010).  They can be used in conjunction with other management tools.  Decision trees can handle both nominal and numerical attributes (Cristina, 2010).  They are capable of handling datasets which may have errors or missing values.  Decision trees are considered to be a non-parametric method, which means that they have no assumption about the spatial distribution and the classifier structure.  Their representations are rich enough to represent any discrete-value classifier.

However, trees have limitations as well.  They do not have the same level of predictive accuracy as some of the other regression and classification models (Tibshirani et al., 2013). Most of the algorithms, like ID3 and C4.5, require that the target attribute will have only discrete values. Decision trees are over-sensitive to the training set, to irrelevant attributes and noise. Decision trees tend to perform less if many complex interactions are present, and well if a few highly relevant attributes exist as they use the “divide and conquer” method (Cristina, 2010).  Table 1 summarizes the advantages and disadvantages of the trees.

Table 1.  Summary of the Advantages and Disadvantages of Trees.
Note:  Constructed by the researcher based on the literature.

Take an Umbrella Decision Tree Example:

  • If input field value < n
    • Then target = Y%
  • If input field value > n
    • Then target = X%

Figure 1.  Decision Tree for Taking an Umbrella

  • The decision depends on the weather, on the predicted rain probability, and whether it is sunny or cloudy.
  • The forecast predicts rain with a probability between 70% and 30%.
    • If it is >70% rain probability, take an umbrella, else use >30% and <30% probability for further predictions.
    • If it is >30% rain probability and cloudy, take an umbrella, else no umbrella.
    • If it is <30% rain probability, no umbrella.

References

Cristina, P. (2010). Decision Trees. Retrieved from http://www.cs.ubbcluj.ro/~gabis/DocDiplome/DT/DecisionTrees.pdf.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

Tibshirani, R., James, G., Witten, D., & Hastie, T. (2013). An introduction to statistical learning-with applications in R: New York, NY: Springer.

Quantitative Analysis of the “Flight-Delays” Dataset Using R-Programming

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to analyze the flight delays Dataset. The project is divided into two main Parts.  Part-I evaluates and examines the Dataset for understanding the Dataset using the RStudio.  Part-I involves five major tasks to review and understand the Dataset variables.  Part-II discusses the Pre-Data Analysis, by converting the Dataset to Data Frame, involving three major tasks to analyze the Data Frame using logistic regression first, followed by the naïve Bayesian method.   The naïve Bayesian method used probabilities from the training set consisting of 60% randomly selected flights, and the remaining 40% of the 2201 flights serve as the holdout period.  The misclassification proportion of the naïve Bayesian method shows 19.52%, which is a little higher than the logistic regression. The prediction has 30 delayed flight out of the 167 correctly but fails to identify 137/(137+30), or 73% of the delayed flights.  Moreover, the 35/(35+679), or 4.9% of on-time flights are predicted as delayed as illustrated in Task-2 of Part-II and Figure-19.

Keywords: Flight-Delays Dataset; Naïve Bays Prediction Analysis Using R.

Introduction

This project examines and analyzes the Dataset of (flight.delays.csv).  The Dataset is downloaded from CTU course materials.  There have been a couple of attempts to download the Dataset from the following link https://www.transtats.bts.gov/.  However, the attempts failed to continue with the Dataset analysis due to the size of the downloaded Datasets from that link and the limited resources of the student’s machine.  Thus, this project utilized the version of flight.delays.csv which is provided by the course in the course material.  The Dataset of (flight.delays.cvs) has 2201 observations on 14 variables.  The focus of this analysis is Naïve Bayes.  However, for a better understanding of the prediction and a comparison using two different models, the researcher has also implemented the Logistic Regression first, followed by the Naïve Bayesian Approach on the same Dataset of flight.delays.csv.  This project addresses two major Parts.  Part-I covers the following key Tasks to understand and examine the Dataset of “flight.delays.csv.” 

  • Task-1:  Review the Variables of the Dataset.
  • Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.
  • Task-3:  Examine the Dataset, Install the Required Packages, and Summary of the Descriptive Statistics.
  • Task-4:  Create Data Frame and Histogram of the Delay (Response)
  • Task-5:  Visualization of the Desired Variables Using Plot() Function.

Part-II covers the following three primary key Tasks to the plot, discuss and analyze the result.

  • Task-1:  Logistic Regression Model for Flight Delays Prediction
  • Task-2:  Naïve Bayesian Model for Flight Delays Prediction.
  • Task-3:  Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018).

Part-I:  Understand and Examine the Dataset “flight.delays.csv”

Task-1:  Review the Variables of the Dataset

The purpose of this task is to understand the variables of the Dataset.  The Dataset is “flight. Delays” Dataset.  The Dataset describes the clients who can default on a loan.  There are 14 variables.  Table 1 summarizes the selected variables for this project.  

Table 1:  Flight Delays Variables

Task-2:  Load and Understand the Dataset Using names(), head(), dim() Functions.

            The purpose of this task is to load and understand the Dataset using names(), head(), dim() function.  The task also displays the first three observations.

  • ## reading the data
  • fd <-read.csv(“C:/CS871/Data/flight.delays.csv”)
  • names(fd[1:5,])
  • head(fd)
  • dim(fd)
  • fd[1:3,]

Task-3:  Examine the Dataset, Install the Required Packages, and Summary of the Descriptive Statistics.

            The purpose of this task is to examine the dawta set, install the requried package (car).  This task also displays the descriptive statistics for analysis.

  • ### set seed
  • set.seed(1)
  • ##Required Library(car) to recode a variable
  • install.packages(“car”)
  • library(car)
  • summary(fd)
  • plot(fd, col=”blue”)

Figure 1. The plot of the Identified Variables for the Flight Delays Dataset.

Task-5:  Visualization of the Desired Variables Using Plot() Function.

            The purpose of this task is to visualize the selected variables using the Plot() Function for a good understanding of these variables and the current trend for each variable.

  • plot(fd$schedf, col=”blue”, main=”Histogram of the Scheduled Time”)
  • plot(fd$carrier, col=”blue”, main=”Histogram of the Carrier”)
  • plot(fd$dest, col=”blue”, main=”Histogram of the Destination”)
  • plot(fd$origin, col=”blue”, main=”Histogram of the  Origin”)
  • plot(fd$weather, col=”blue”, main=”Histogram of the Weather”)
  • plot(fd$dayweek, col=”blue”, main=”Histogram of the Day of Week”)

Figure 2.  Histogram of the Schedule Time and Carrier.

Figure 3.  Histogram of the Destination and Origin.

Figure 4.  Histogram of the Weather and Day of Week.

Part-II:  Plot, Discuss and Analyze

 Task-1: Logistic Regression Model for Flight Delays

The purpose of this task is to first use the logistic regression model for predicting the on-time and delayed flights more than 15 minutes. The Dataset consists of 2201 flights for the year of 2004 from Washington DC into the NYC.  The characteristic of the response is whether or not a flight has been delayed by more than 15 minutes and coded as 0=no delay, and 1=delay by more than 15 minutes.  The explanatory variables include:

  • Three arrival airports (Kennedy, Newark, and LaGuardia).
  • Three different departure airports (Reagan, Dulles, and Baltimore.
  • Eight carriers a categorical variable for 16 different hours of departure (6:00 AM to 10:00 PM).
  • Weather conditions (0=good, 1=bad).
  • Day of week (1 for Sunday and Monday; and 0 for all other days).

The code of R is shown below for the logistic regression model.

  • ## Create a Data Frame and Understand the Dataset.
  • fd <-data.frame(fd)
  • names(fd)
  • head(fd)
  • fd[1:5,]
  • dim(fd)
  • summary(fd)
  • plot(fd, co=”blue”)
  • ## library car is needed to recode variables
  • library(car)
  • ##Define hours of Departure
  • fd$sched=factor(floor(fd$schedtime/100))
  • table(fd$sched)
  • table(fd$carrier)
  • table(fd$dest)
  • table(fd$origin)
  • table(fd$weather)
  • table(fd$dayweek)
  • table(fd$daymonth)
  • table(fd$delay)
  • fd$delay=recode(fd$delay,”‘delayed’=1;else=0″)
  • fd$delay=as.numeric(levels(fd$delay)[fd$delay])
  • table(fd$delay)
  • ## Summary of the Major Variables
  • summary(fd$sched)
  • summary(fd$carrier)
  • summary(fd$dest)
  • summary(fd$origin)
  • summary(fd$weather)
  • summary(fd$dayweek)
  • summary(fd$daymonth)
  • summary(fd$delay)
  • ## Plots and Histograms of the Major Variables
  • plot(fd$sched, col=”blue”, main=”Schedule Departure Time”)
  • plot(fd$carrier, col=”darkblue”, main=”Flight Carriers”)
  • plot(fd$dest, col=”darkred”, main=”Destination of Flights”)
  • plot(fd$origin, col=”green”, main=”Origin of Flights”)
  • plot(fd$weather, col=”darkgreen”, main=”Weather During Flight Days”)
  • hist(fd$dayweek, col=”darkblue”, main=”Flights Day of the Week”, xlab=”Day of Week”)
  • hist(fd$daymonth, col=”yellow”, main=”Flights Day of the Month”)
  • plot(fd$delay, col=”red”, main=”Plot of the Delay”)
  • hist(fd$delay, col=”red”, main=”Histogram of the Delay”)
  • ## Delay: 1=Monday and 7=Sunday coded as 1, else 0.
  • fd$dayweek=recode(fd$dayweek,”c(1,7)=1;else=0″)
  • table(fd$dayweek)
  • summary(fd$dayweek)
  • hist(fd$dayweek, col=”darkblue”, main=”Flights Day of the Week”, xlab=”Day of Week”)
  • ## Omit unused variables
  • fd=fd[,c(-1,-3,-5,-6,-7,-11,-12)]
  • fd[1:5,]
  • ## Create Sample Dataset
  • delay.length=length(fd$delay)
  • delay.length
  • delay.length1=floor(delay.length*(0.6))
  • delay.length1
  • delay.length2=delay.length-delay.length1
  • delay.length2
  • train=sample(1:delay.length, delay.length1)
  • train
  • plot(train, col=”red”)
  • ## Estimation of Logistic Regression Model
  • ##Explanatory Variables: carrier, destination, origin, weather, day of week,
  • ##(weekday/weekend), scheduled hour of departure.
  • ## Create design matrix; indicators for categorical variables (factors)
  • Xfd <- model.matrix(delay~., data=fd) [,-1]
  • Xfd[1:5,]
  • xtrain <- Xfd[train,]
  • xtrain[1:2,]
  • xtest <- Xfd[-train,]
  • xtest[1:2,]
  • ytrain <- fd$delay[train]
  • ytrain[1:5]
  • ytest <- fd$delay[-train]
  • ytest[1:5]
  • model1 = glm(delay~., family=binomial, data=data.frame(delay=ytrain,xtrain))
  • summary(model1)
  • ## Prediction: predicted default probabilities for cases in test set
  • probability.test <- predict(model1, newdata=data.frame(xtest), type=”response”)
  • data.frame(ytest,probability.test)[1:10,]
  • ## The first column in list represents the case number of the test element
  • plot(ytest~probability.test, col=”blue”)
  • ## Coding as 1 if probability 0.5 larger
  • ### using floor function
  • probability.fifty = floor(probability.test+0.5)
  • table.ytest = table(ytest,probability.fifty)
  • table.ytest
  • error = ((table.ytest[1,2]+table.ytest[2,1])/delay.length2)
  • error

Figure 5.  The probability of the Delay Using Logistic Regression.

Task-2:  Naïve Bayesian Model for Predicting Delays and Ontime Flights

The purpose of this task is to use the Naïve Bayesian model for predicting a categorical response from most categorical predictor variables.  The Dataset consists of 2201 flights in 2004 from Washington, DC into NYC.  The characteristic of the response is whether or not a flight has been delayed by more than 15 minutes (0=no delay, 1=delay).  The explanatory variables include the following:

  • Three arrival airports (Kennedy, Newark, and LaGuardia).
  • Three different departure airports (Reagan, Dulles, and Baltimore.
  • Eight carriers.
  • A categorical variable for 16 different hours of departure (6:00 AM to 10:00 PM).
  • Weather conditions (0=good, 1=bad).
  • Day of week (7 days with Monday=1, …, Sunday=7).

The code of R is shown below for the logistic regression model, followed by the result of each code.

  • fd=data.frame(fd)
  • fd$schedf=factor(floor(fd$schedtime/100))
  • fd$delay=recode(fd$delay,”‘delayed’=1;else=0″)
  • response=as.numeric(levels(fd$delay)[fd$delay])
  • hist(response, col=”orange”)
  • fd.mean.response=mean(response)
  • fd.mean.response
  • ## Create Train Dataset 60/40
  • n=length(fd$dayweek)
  • n
  • n1=floor(n*(0.6))
  • n1
  • n2=n-n1
  • n2
  • train=sample(1:n,n1)
  • train
  • plot(train, col=”blue”, main=”Train Data Plot”)
  • ## Determine Marginal Probabilities
  • td=cbind(fd$schedf[train],fd$carrier[train],fd$dest[train],fd$origin[train],fd$weather[train],fd$dayweek[train],response[train])
  • td
  • tdtrain0=td[td[,7]<0.5,]
  • tdtrain1=td[td[,7]>0.5,]
  • tdtrain0[1:3]
  • tdtrain1[1:3]
  • plot(td, col=”blue”, main=”Train Data”)
  • plot(tdtrain0, col=”blue”, main=”Marginal Probability <0.5″ )
  • plot(tdtrain1, col=”blue”, main=”Marginal Probability > 0.5″)
  • ## Prior Probabilities for Delay for P( y = 0 ) and P(y = 1)
  • tdel=table(response[train])
  • tdel=tdel/sum(tdel)
  • tdel
  • ## P( y = 0)       P( y = 1)         1
  • ##0.8022727    0.1977273
  • ## Probabilities for Scheduled Time
  • ### Probabilities for (y=0) for Scheduled Time
  • ts0=table(tttrain0[,1])
  • ts0=ts0/sum(ts0)
  • ts0
  • ### Probabilities for (y = 1) for Scheduled Time
  • ts1=table(tttrain1[,1])
  • ts1=ts1/sum(ts1)
  • ts1
  • ## Probabilities for Carrier
  • ## Probabilities for (y = 0) for Carrier
  • tc0=table(tttrain0[,2])
  • tc0=tc0/sum(tc0)
  • tc0
  • tc1=table(tttrain1[,2])
  • tc1=tc1/sum(tc1)
  • tc1
  • ## Probabilities for Destination
  • ##Probabilities for (y=0) for Destination
  • td0=table(tttrain0[,3])
  • td0=td0/sum(td0)
  • td0
  • ##Probabilities for (y=1) for Destination
  • td1=table(tttrain1[,3])
  • td1=td1/sum(td1)
  • td1
  • ## Probabilities for Origin
  • ##Probabilities for (y=0) for Origin
  • to0=table(tttrain0[,4])
  • to0=to0/sum(to0)
  • to0
  • ##Probabilities for (y=1) for Origin
  • to1=table(tttrain1[,4])
  • to1=to1/sum(to1)
  • to1
  • ## Probabilities for Weather
  • ##Probabilities for (y=0) for Origin
  • tw0=table(tttrain0[,5])
  • tw0=tw0/sum(tw0)
  • tw0
  • ## bandaid as no observation in a cell
  • tw0=tw1
  • tw0[1]=1
  • tw0[2]=0
  • ##Probabilities for (y=1) for Weather
  • tw1=table(tttrain1[,5])
  • tw1=tw1/sum(tw1)
  • tw1
  • ## Probabilities for Day of Week
  • #### Probabilities for (y-0) for Day of Week
  • tdw0=table(tttrain0[,6])
  • tdw0=tdw0/sum(tdw0)
  • tdw0
  • #### Probabilities for (y-1) for Day of Week
  • tdw1=table(tttrain1[,6])
  • tdw1=tdw1/sum(tdw1)
  • tdw1
  • ### Create Test Data
  • testdata=cbind(fd$schedf[-train],fd$carrier[-train],fd$dest[-train],fd$origin[-train],fd$weather[-train],fd$dayweek[-train],response[-train])
  • testdata[1:3]
  • ## With these estimates, the following probabilities can be determined.
  • ##P(y = 1|Carrier = 7,DOW = 7,DepTime = 9 AM−10 AM,Dest = LGA,Origin = DCA,Weather = 0)
  • =[(0.015)(0.172)(0.027)(0.402)(0.490)(0.920)](0.198)
  • [(0.015)(0.172)(0.027)(0.402)(0.490)(0.920)](0.198)
  • +[(0.015)(0.099)(0.059)(0.545)(0.653)(1)](0.802)
  • = 0.09.
  • ## Creating Predictions, stored in prediction
  • p0=ts0[tt[,1]]*tc0[tt[,2]]*td0[tt[,3]]*to0[tt[,4]]*tw0[tt[,5]+1]*tdw0[tt[,6]]
  • p1=ts1[tt[,1]]*tc1[tt[,2]]*td1[tt[,3]]*to1[tt[,4]]*tw1[tt[,5]+1]*tdw1[tt[,6]]
  • prediction=(p1*tdel[2])/(p1*tdel[2]+p0*tdel[1])
  • hist(prediction, col=”blue”, main=”Histogram of Predictions”)
  • plot(response[-train], prediction,col=”blue”)
  • ###Coding as 1 if probability >=0.5 
  • ## Calculate the Probability for at least 0.5 or more
  • prob1=floor(prediction+0.5)
  • tr=table(response[-train],prob1)
  • tr
  • error=(tr[1,2]+tr[2,1])/n2
  • error
  • ## Calculate the Probability for at least 0.3 or more
  • prob2=floor(prediction+0.3)
  • tr2=table(response[-train],prob2)
  • tr2
  • error=(tr[1,2]+tr[2,1])/n2
  • error
  • ## calculating the lift, cumulative, sorted by predicted values and average success.
  • ## cumulative 1’s sorted by predicted values
  • ## cumulative 1’s using the average success prob from training set
  • axis=dim(n2)
  • ax=dim(n2)
  • ay=dim(n2)
  • axis[1]=1
  • ax[1]=xbar
  • ay[1]=bb1[1,2]
  • for (i in 2:n2) {
  • axis[i]=i
  • ax[i]=xbar*i
  • ay[i]=ay[i-1]+bb1[i,2]
  • }
  • aaa=cbind(bb1[,1],bb1[,2],ay,ax)
  • aaa[1:100,]
  • plot(axis,ay,xlab=”Number of Cases”,ylab=”Number of Successes”,main=”Lift: Cum Successes Sorted Predicted Values Using Average Success Probabilitis”, col=”red”)
  • points(axis,ax,type=”l”)

Figure 6. Pre and Post Factor and Level of History Categorical Variable.

Figure 7: Train Dataset Plot.

Figure 8.  Train Data, Marginal Probability of <0.5 and >0.5.

Figure 9.  Prior Probability for Delay (y=0) and (y-1).

Figure 10.  Prior Probability for Scheduled Time: Left (y=0) and Right (y-1).

Figure 11.  Prior Probability for Carrier: Left (y=0) and Right (y-1).

Figure 12.  Prior Probability for Destination: Left (y=0) and Right (y-1).

Figure 13.  Prior Probability for Origin: Left (y=0) and Right (y-1).

Figure 14.  Prior Probability for Weather: Left (y=0) and Right (y-1).

Figure 15.  Prior Probability for Day of Week: Left (y=0) and Right (y-1).

Figure 16.  Test Data Plot.

Figure 17.  Histogram of the Prediction Using Bayesian Method.

Figure 18.  Plot of Prediction to the Response Using the Test Data.

Figure 19.  Probability Calculation for at least 0.5 or larger (left), and at least 0.3 or larger (right).

Figure 20.  Lift: Cum Success Sorted by Predicted Values Using Average Success Probabilities.

Task-3: Discussion and Analysis

            The descriptive analysis shows that the average schedule time is 13:72 which is less than the median of 14:55 indicating a negatively skewed distribution, while the average for the departure time is 13:69 which is less than the median of 14:50 confirming the negatively skewed distribution.  The result of the carrier shows that the DH has the highest rank of 551, followed by RU of 408. The result of the destination shows that the LGA has the highest rank of 1150, followed by EWR of 665 and JFK of 386.   The result of the origin shows that DCA has the highest rank of 1370, followed by IAD of 686 and 145 for BWI.   The result shows that the weather is not the primary reason for the delays.  Few instances of weather instances are related to the delays.  The descriptive analysis shows the ontime has the highest frequency of 1773, followed by the delays of 428 frequency.  The average delay or response is 0.195.

The result shows the success probability which is the proportion of delayed planes in the training set if 0.198 as analyzed in Task 2 of Part-II; the failure probability which is the proportion of on-time flights is 0.802 as discussed and analyzed in Task-2 of Part-II.   The naïve rule which does not incorporate any covariate information classified every flight as being on-time as the estimated unconditional probability of a flight being on-time, 0.802, is larger than the cutoff of 0.5. Thus, this rule does not make an error predicting a flight which is on-time, but it makes a 100% error when the flight is delayed.   The naïve rule fails to identify the 167 delayed flights among the 881 flights of the evaluation Dataset as shown in Task-1 of Part-II; its misclassification error rate in the holdout sample is 167/881=0.189.  The logistic regression reduces the overall misclassification error in the holdout (evaluation/test) Dataset to 0.176, which is a modest improvement over the naïve rule of (0.189) as illustrated in Task-1 of Part-II.   The logistic regression identifies, among 167 delayed flights, correctly 14 delayed flights 8.4%, but it misses 153/167 delayed flights (92.6%).  Moreover, the logistic regression model predicts 2 of the 714 on-time flights as being delayed as illustrated in Task-1 of Part-II.

The naïve Bayesian method used probabilities from the training set consisting of 60% randomly selected flights, and the remaining 40% of the 2201 flights serve as the holdout period.  The misclassification proportion of the naïve Bayesian method shows 19.52%, which is a little higher than the logistic regression. The prediction has 30 delayed flight out of the 167 correctly but fails to identify 137/(137+30), or 73% of the delayed flights.  Moreover, the 35/(35+679), or 4.9% of on-time flights are predicted as delayed as illustrated in Task-2 of Part-II and Figure-19.

The lift charts (Figure 20) is constructed with the number of cases on the x-axis and the cumulative true-positive cases on the y-axis.  True positives are those observations which are classified correctly.  It measures the effectiveness of a classification model by comparing the true positives without a model (Hodeghatta & Nayak, 2016).  It also provides an indication of how well the model performs if the samples are selected randomly from a population (Hodeghatta & Nayak, 2016).  With the lift chart, a comparison of different models’ performance for a set of random cases (Hodeghatta & Nayak, 2016).  In Figure 20, the lift varies with the number of cases, and the black line is a reference line, meaning if a prediction of a positive case is made in case there was no model, then, this line provides a benchmark.  The lift curve graph in Figure 20, graphs the expected number of delayed flights, assuming that the probability of delay is estimated by the proportion of delayed flights in the evaluation sample, against the number of cases.  The reference line expresses the performance of the naïve model. With ten flights, for instance, the expected number of delayed flights is 10 p, where p is the proportion of delayed flights in the evaluation sample which is 0.189 in this case.  At the very end, the lift curve and the reference line meet. However, in the beginning, the logistic regression leads to a “lift.” For instance, when picking 10 cases with the largest estimated success probabilities, all the 10 case turn out to be delayed.  If the lift is close to the reference line, then there is not much point in using the estimated model for classification. The overall misclassification rate of the logistic regression is not that different from that of the naïve strategy which considers all flights as being on-time. However, as the lift curve shows, flights with the largest probabilities of being delayed are classified correctly. The logistic regression is quite successful in identifying those flight as being delayed.  The lift curve in Figure 10 shows that the model gives an advantage in detecting the most apparent flights which are going to be delayed or on-time.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Hodeghatta, U. R., & Nayak, U. (2016). Business Analytics Using R-A Practical Approach: Springer.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Quantitative Analysis of “German.Credit” Dataset Using K-NN Classification and Cross-Validation

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to use the german.credit.csv dataset to address the issues of lending which result in default.  Two outcomes are success (defaulting on the loan), and failure (not defaulting on the loan).  The explanatory variables in the Logistic Regression are both the type of loan and borrowing amount.  For the K-NN Classification, three continuous variables are used:  duration, amount, and installment.  The cross-validation with k=5 for the nearest neighbor will be used as well in this analysis.

The dataset is downloaded from the following archive site for machine learning repository : https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data).  The dataset has 1000 observation on 222 variables. There are two datasets for german.credit.csv.  The original dataset, in the form provided by Professor Hofmann, contains categorical/symbolic attributes and is in the current german.credit.csv file which is used in this disussion.  The other dataset “german.data-numeric” is not used in this discussion which was developed by Strathclyde University for algorithms that need numerical attributes.  This discussion utilized the original version of german.credit.csv which has categorical variables and the continuous variables. 

This analysis discusses and addresses fourteen Tasks as shown below:

  • Task-1: Understand the Variables of the Dataset
  • Task-2: Load and Review the Dataset using names(), head(), dim() functions.
  • Task-3: Pre and Post Factor and Level of Categorical Variables of the Dataset.
  • Task-4: Summary and Plot the Continuous Variables: Duration, Amount, and Installment
  • Task-5: Classify Amount into Groups.
  • Task-6: Summary of all selected variables.
  • Task-7: Select and Plot Specific Variables for this analysis.
  • Task-8:  Create Design Matrix
  • Task-9:   Create Training and Prediction Dataset.
  • Task-10:  Implement K-Nearest Neighbor Method.
  • Task-11: Calculate the Proportion of Correct Classification.
  • Task-12: Plot for 3 Nearest Neighbor.
  • Task-13: Cross-Validation with k=5 for the Nearest Neighbor.
  • Task-14: Discussion and Analysis.

Various resources were utilized to develop the required code using R. These resources include (Ahlemeyer-Stubbe & Coleman, 2014; Fischetti, Mayor, & Forte, 2017; Ledolter, 2013; r-project.org, 2018)

Task-1:  Understand the Variables of the Data Sets

The purpose of this task is to understand the variables of the dataset.  The dataset is “german.credit” dataset.  The dataset describes the clients who can default on a loan.  There are selected variables out of the 22 variables which are target for this analysis.  Table 1 and Table 2 summarize these selected variables for this discussion.  Table 1 focuses on the variables with binary and numerical values, while Table 2 focuses on the variables with categorical values.

Table 1:  Binary and Numerical Variables

Table 2: Categorical Variables.

Task-2:  Load and Review the Dataset using names(), heads(), dim() functions

  • gc <- read.csv(“C:/CS871/german.credit.csv”)
  • names(gc)
  • head(gc)
  • dim(gc)
  • gc[1:3,]

Task-3:  Pre and Post Factor and Level of Categorical Variables of the Data Sets

  • ## history categorical variable pre and post factor and level.
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Pre Factor and Level”)
  • gc$history = factor(gc$history, levels=c(“A30”, “A31”, “A32”, “A33”, “A34”))
  • levels(gc$history)=c(“good-others”, “good-thisBank”, “current-paid-duly”, “bad-delayed”, “critical”)
  • summary(gc$history)
  • plot(gc$history, col=”green”, xlab=”History Categorical Variable Post Factor and Level”)
  • ##### purpose pre and post factor and level
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)
  • ###tranform purpose
  • gc$purpose <- factor(gc$purpose, levels=c(“A40″,”A41″,”A42″,”A43″,”A44″,”A45″,”A46″,”A48″,”A49″,”A410”))
  • levels(gc$purpose) <- c(“newcar”,”usedcar”,”furniture/equipment”,”radio/television”,”domestic appliances”,”repairs”, “edu”,”vacation-doesNotExist”, “retraining”, “business”, “others”)
  • summary(gc$purpose)
  • plot(gc$purpose, col=”darkgreen”)

Figure 1.  Example of Pre and Post Factor of Purpose as Cateogircal Variable Illustration.

Task-4:  Summary & Plot the Numerical Variables: Duration, Amount, Installment

  • ##summary and plot of those numerical variables
  • summary(gc$duration)
  • summary(gc$amount)
  • plot(gc$amount, col=”blue”, main=”Amount Numerical Variable”)
  • summary(gc$installment)
  • plot(gc$installment, col=”blue”, main=”Installment Numerical Variable”)

Figure 2:  Duration, Amount, Installment Continuous IV.

Task-5:  Classify the Amount into Groups

  • #### To classify the amount into groups
  • gc$amount <-as.factor(ifelse(gc$amount <=2500, ‘0-2500′, ifelse(gc.df$amount<=5000,’2600-5000’, ‘5000+’)))
  • summary(gc$amount)

Task-6:  Summary of all variables

  • summary(gc$duration)
  • summary(gc$amount)
  • summary(gc$installment)
  • summary(gc$age)
  • summary(gc$history)
  • summary(gc$purpose)
  • summary(gc$housing)
  • summary(gc$rent)

Task-7:  Select and Plot specific variables for this discussion

  • ##cut the dataset to the selected variables
  • ##(duration, amount, installment, and age) which are numeric and
  • ##(history, purpose and housing) which are categorical and
  • ## Default (representing the risk) which is binary.
  • gc.sv <- gc[,c(“Default”, “duration”, “amount”, “installment”, “age”, “history”, “purpose”, “foreign”, “housing”)]
  • gc.sv[1:3,]
  • summary(gc.sv)
  • ### Setting the Rent
  • gc$rent <- factor(gc$housing==”A151″)
  • summary(gc$rent)
  • plot(gc, col=”blue”)

Figure 3:  Plot of The Selected Variables.

Task-8:  Create Design Matrix

  • ###Create a Design Matrix
  • ##Factor variables are turned into indicator variables
  • ##The first column of ones is ommitted
  • Xgc <- model.matrix(Default~.,data=gc)[,-1]
  • Xgc[1:3,]

Task-9: Create Training and Prediction Datasets

  • ## creating training and prediction datasets
  • ## select 900 rows for estimation and 100 for testing
  • set.seed(1)
  • train <- sample(1:1000,900)
  • xtrain <- Xgc[train,]
  • xnew <- Xgc[-train,]
  • ytrain <- gc$Default[train]
  • ynew <- gc$Default[-train]

Task-10:  K-Nearest Neighbor Method

  • ## k-nearest neighbor method
  • library(class)
  • nearest1 <- knn(train=xtrain, test=xnew, cl=ytrain, k=1)
  • nearest3 <- knn(train=xtrain, test=xnew, cl=ytrain, k=3)
  • data.frame(ynew,nearest1,nearest3)[1:10,]

Task-11: Calculate the Proportion of Correct Classification

  • ## calculate the proportion of correct classifications
  • proportion.correct.class1=100*sum(ynew==nearest1)/100
  • proportion.correct.class3=100*sum(ynew==nearest3)/100
  • proportion.correct.class1
  • proportion.correct.class3

Task-12: Plot for 3 Nearest Neighbors

  • ## plot for 3NN
  • plot(xtrain[,c(“amount”,”duration”)],
  • col=c(4,3,6,2)[gc[train,”installment”]],
  • pch=c(1,2)[as.numeric(ytrain)],
  • main=”Predicted Default, by 3 Nearest Neighbors”, xlab=”Amount”, ylab=”Duration”,cex.main=.95)
  • points(xnew[,c(“amount”,”duration”)],
  • bg=c(4,3,6,2)[gc[train,”installment”]],
  • pch=c(21,24)[as.numeric(nearest3)],cex=1.2,col=grey(.7))
  • legend(“bottomright”,pch=c(1,16,2,17),bg=c(1,1,1,1),
  • legend=c(“data 0″,”pred 0″,”data 1″,”pred 1”),
  • title=”default”,bty=”n”,cex=.8)
  • legend(“topleft”,fill=c(4,3,6,2),legend=c(1,2,3,4),
  • title=”installment %”, horiz=TRUE,bty=”n”,col=grey(.7),cex=.8)

Figure 4:  Predicted Default by 5 Nearest Neighbors.

Task-13: Cross-Validation with k=5 for the nearest neighbor

  • ## The above was for just one training set
  • ## The cross-validation (leave one out)
  • proportion.corr=dim(10)
  • for (k in 1:10) {
  • prediction=knn.cv(x,cl=gc$Default,k)
  • proportion.corr[k]=100*sum(gc$Default==prediction)/1000
  • }
  • proportion.corr

Task-14: Discussion and Analysis

The descriptive analysis shows that the average duration (Mean=20.9) is higher than the Median (Median=18) indicating a positively skewed distribution.  The average amount (Mean=3271) is higher than the Median (Median=2320) indicating a positively skewed distribution.  The average installment rate is (Mean=2.97) is a little less than the Median (Median=3.00) indicating a small negative skewed distribution.  The result also shows that the radio/TV ranks number one for the loan, followed by a new car.  A training dataset is created to select 900 rows for estimation and 100 for testing.  K-NN method is used to estimate the nearest using k=1 and k=3, and the proportion of the correct classification is calculated to result in 60% and 61% for k=1 and k=3 respectively (Figure 4).  The result of the cross-validation with k=5 for the nearest neighbor is about 65% of the outcome.

References

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Ahlemeyer-Stubbe, A., & Coleman, S. (2014). A practical guide to data mining for business and industry: John Wiley & Sons.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Ledolter, J. (2013). Data mining and business analytics with R: John Wiley & Sons.

r-project.org. (2018). R: A Language and Environment for Statistical Computing. Retrieved from https://cran.r-project.org/doc/manuals/r-release/fullrefman.pdf.

Examples of Bayesian Analysis in the Context of Social Media

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to provide examples of how Bayesian analysis can be used in the context of social media.  The discussion also summarizes the study which used Bayesian analysis in the context of social media. The discussion begins with Naïve Bayes Classifiers for Mining, followed by the Importance of Social Media, the Social Media Mining, Social Media Mining Techniques, Social Media Mining Process, Data Modelling Step, Twitter Mining Using Naïve Bayes with R.  The discussion ends with additional examples of the Bayesian Analysis methods in Social Media.

Naïve Bayes Classifiers for Mining

Naïve Bayes classifiers are probabilistic classifiers, built using the Bayes Theorem (Kumar & Paul, 2016). Naïve Bayes is also known as prior probability and a class conditional probability classifier, since it uses the prior probability of each feature and generates a posterior probability distribution over all the features (Kumar & Paul, 2016). 

The Naïve Bayes classifier makes the following assumptions about the data (Kumar & Paul, 2016):

  • All the features in the dataset are independent of each other.
  • All the features are important.

Though these assumptions may not be accurate in a real-world scenario, Naïve Bayes is still in many applications for text classification such as (Kumar & Paul, 2016):

  • Spam filtering for email applications
  • Social media mining, such as finding the sentiments in a given text.
  • Computer network security applications.

As indicated by (Kumar & Paul, 2016), the classifier has various strength such as:

  • Naïve Bayes classifiers are highly scalable and need less computational cycles when compared with other advanced and sophisticated classifiers.
  • A vast number of features can be taken into consideration.
  • The Naïve Bayes classifiers work well when there is missing data, and the dimensionality of the inputs is high.
  • The Naïve Bayes need only small amounts of training sets.

The Importance of Social Media

The traditional media such as radio, newspaper, or television facilitates one-way communication with a limited scope of reach and usability.  Although the audience can interact with channels such as radio, the quality, and frequency of such communications are limited (Ravindran & Garg, 2015).  On the other hand, the Internet-based social media offers multi-way communication with features such as immediacy and permanence (Ravindran & Garg, 2015).

Social media is an approach to communication using online tools such as Twitter, Facebook, LinkedIn, and so forth (Ravindran & Garg, 2015).  Social Media is defined by Andreas Kaplan and Michael Haenlein as cited in (Ravindran & Garg, 2015) as follows:  “A group of Internet-based applications that build on the ideological and technological foundations of Web 2.0 and that allow the creation and exchange of user-generated content.”

Social media spans various Internet-based platforms which facilitate human emotions such as:

  • Networking, such as Facebook, LinkedIn.
  • Microblogging, such as Twitter, Tumblr.  
  • Photo sharing, such as Instagram, Flickr.
  • Video sharing, such as YouTube, Vimeo.
  • Stack exchanging, such as Stack Overflow, Github.
  • Instant messaging, such as Whatsapp, Hike.

The marketing industry is maturing in understanding the promise or the impact of social media (Ravindran & Garg, 2015).  While social media is regarded to be a great tool for banner advertisement regarding cost and reach, it can turn out to be more influential in the long term (Ravindran & Garg, 2015).  Organizations need to find out about the opinions of consumers by mining social networks (Ravindran & Garg, 2015).  They can understand the current and potential outlook of consumers by collecting information on their opinions, and such informative information can guide a business decision, in the long run, influencing the fate of any business (Ravindran & Garg, 2015).

Social Media Mining

The social media mining is a systematic analysis of information generated from social media.  The set of tools and techniques which are used to mine such information are collectively called Data Mining techniques and in the context of social media; Social Media Mining (SMM) (Ravindran & Garg, 2015).  There has been much research in multiple disciplines of social media such as modeling behavior, predictive analysis, and recommending content (Ravindran & Garg, 2015).

Social Media Mining (SMM)Techniques

Graph mining is one technique of the SMM.  Graph mining is described as “the process of extracting useful knowledge (patterns, outliers and so on), from a social relationship between the community members can be represented as a graph” (Ravindran & Garg, 2015).  The most influential example of Graph Mining is Facebook Graph Search (Ravindran & Garg, 2015). The Text Mining is another SMM technique, which includes extraction of meaning from unstructured text data presented in social media.  The primary targets of this type of mining are blogs and microblogs such as Twitter (Ravindran & Garg, 2015). 

Social Media Mining Process

The process of the social media mining include the following five steps (Ravindran & Garg, 2015):

  1. Getting authentication from the social website.
  2. Data Visualization.
  3. Cleaning and pre-processing.
  4. Data modeling using standard algorithms such as opinion mining, clustering, anomaly/spam detection, correlations and segmentation, recommendations.
  5. Result Visualization.

Data Modelling Step

This step is number four in the process of social media mining which includes the application of mining algorithms (Ravindran & Garg, 2015).  Standard mining algorithms include the opinion mining or sentiment mining where the opinion/sentiment present in the given phrase is assessed (Ravindran & Garg, 2015).  Although the classification of sentiments is not a simple technique, various classification algorithms have been employed to aid opinion mining. This algorithm varies from simple probabilistic classifiers such as Naïve Bayes which assumes that all features are independent and does not use any prior information, to the more advanced classifiers such as Maximum Entropy, which uses the prior information to a certain extent (Ravindran & Garg, 2015).  Other classifiers include Support Vector Machine (SVM), and Neural Networks NN) which have been used to correctly classify the sentiments (Ravindran & Garg, 2015).  Additional methods include Anomaly/spam detection or social spammer detection (Ravindran & Garg, 2015).  Fake profiles created with malicious intentions are known as spam or anomalies profiles (Ravindran & Garg, 2015).

Twitter Mining Using Naïve Bayes with R

In this application, after getting the cleaned Twitter data, R packages are used to assess the sentiments in the tweets.  The first step is to obtain the authorization using the following two packages (r-project.org, 2018; Ravindran & Garg, 2015):

  • getTwitterOAuth(consumer_key, consumer_secret)
  • registerTwitterOAuth(OAuth)
  • source(“authenticate.R”)

Collect tweets as a corpus, using searchTwitter() function in R.  After obtaining the cleaned Twitter data; few R packages are used to assess the sentiments in the tweets.  The first step is to use a Naïve algorithm which gives a score based on the number of times a positive or negative word occurred in the given sentence in this example (Ravindran & Garg, 2015).  To estimate the sentiment further, Naïve Bayes is used in deciding on the emotion present in any tweet.  The Naïve Bayes method requires R packages called Rstem and sentiment to assist with this assessment.  These packages are now removed from R repository. However, they can still be downloaded from the archive at https://cran.r-project.org/src/contrib/Archive/.  Additional R packages include classify_emotion().  Example of the result is illustrated in Figure 1, when applying the Naïve Bayes method using R (Ravindran & Garg, 2015).

Figure 1. Example of the Result When Applying Naïve Bayes Method in R (Ravindran & Garg, 2015).

In summary, the Naïve Bayes method can be used in social media such as Twitter example.  Certain R packages must be installed to be able to complete the Naïve Bayes analysis on twitter dataset.

Additional Examples of Bayesian Analysis methods in Social Media

In another application of Naïve Bayes method on social media reported by (Singh, Singh, & Singh, 2017).  Naïve Bayes classifier is a popular supervised classifier, provides a way to express positive, negative and neutral feelings in the web text.  It utilizes conditional probability to classify words into their respective categories (Singh et al., 2017).  The benefit of using Naïve Bayes on text classification is that it needs a small dataset for training (Singh et al., 2017).  The raw data from web undergoes pre-processing, removal of numeric, foreign words, HTML tags, and special symbols yielding the set of words (Singh et al., 2017).  The tagging of words with labels of positive, negative and neutral tags is manually performed by human experts (Singh et al., 2017).  This pre-processing produces word-category pairs for training set (Singh et al., 2017). 

The work of (Singh et al., 2017) focused on four Text Classifiers utilized for sentiment analysis: Naïve Bayes, J48, BFTree, and OneR algorithms.  Naïve Bayes was found to be quite fast in learning whereas OneR method was found more promising in generating the accuracy of 91.3% in precision, 97% in F-measure and 92.34% in correctly classified instances (Singh et al., 2017). 

Another example of the application of Bayesian Analysis is also reported in the research paper of (Volkova & Van Durme, 2015).  The work of (Volkova & Van Durme, 2015) proposed two approaches in mining streaming social media.  They studied iterative, incremental retraining in batch and current settings with and without iterative annotations. They treated each new message as independent evidence which is combined into an incremental user-prediction model applying Bayes Rule and explored model training in parallel with its application, rather than assuming a previously existing labeled dataset.  The applied Bayesian rule updates to dynamically revise posterior probability estimates of the attribute value in question (Volkova & Van Durme, 2015). 

References

Kumar, A., & Paul, A. (2016). Mastering Text Mining with R: Packt Publishing Ltd.

r-project.org. (2018). Pakcage ‘twitterR’. Retrieved from https://cran.r-project.org/web/packages/twitteR/twitteR.pdf.

Ravindran, S. K., & Garg, V. (2015). Mastering social media mining with R: Packt Publishing Ltd.

Singh, J., Singh, G., & Singh, R. (2017). Optimization of sentiment analysis using machine learning classifiers. Human-centric Computing and Information Sciences, 7(1), 32.

Volkova, S., & Van Durme, B. (2015). Online Bayesian Models for Personal Analytics in Social Media.

Bayesian Analysis

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze Bayesian analysis and the reasons for using it when faced with uncertainty in making decisions.  The discussion also addresses any assumptions of Bayesian analysis, cases which would call for Bayesian analysis, and any problems with the Bayesian analysis.

Probability Theory and Probability Calculus

Various words and terms are used to describe uncertainty and related concepts such as probability, chance, randomness, luck, hazard, and fate (Hand, Mannila, & Smyth, 2001).  Modeling “uncertainty” is a required component of almost all data analysis (Hand et al., 2001).   There are various reasons for such uncertainty.  One reason is that the data may be only a sample from the population which is used for the research study so that the uncertainty is about the extent to which different samples differ from each other and the overall population (Hand et al., 2001).  Another reason for such uncertainty is a prediction about tomorrow based on today’s data so that the conclusions are subject to uncertainty about what the future will bring (Hand et al., 2001). Another reason includes the ignorance, and some value cannot be observed, and the ideas must be based on the “best guess” about it (Hand et al., 2001).

There are two different probabilities: probability theory and probability calculus (Hand et al., 2001). The probability theory is concerned with the interpretation of probability while the probability calculus is concerned with the manipulation of the mathematical representation of probability (Hand et al., 2001).  

The probability measures the likeliness that a particular event will occur.  Mathematicians refer to a set of potential outcomes of an experiment or trial to which a probability of occurrences can be assigned (Fischetti, Mayor, & Forte, 2017).  Probabilities are expressed as a number between 0 and 1 or as a percentage out of 100 (Fischetti et al., 2017).  An event with a probability of 0 denotes an impossible outcome, and a probability of 1 describes an event that is certain to occur (Fischetti et al., 2017).  An example of probability is a coin flip, where there are two outcomes: heads or tails.  Since the entire sample space is covered by these two outcomes, they are said to be collectively exhaustive, and mutually exclusive, meaning that they can never co-occur together (Fischetti et al., 2017). Thus, the probability of obtaining either heads or tails is P(heads)=0.50; P(tails)=0.50 respectively.  Moreover, when the probability of either outcome does not affect the probability of the other, these events are described as “conditionally independent” (Fischetti et al., 2017).   For instance, the probability of event A and event B is the product of the probability of A and the probability of B (Fischetti et al., 2017). 

Subjective vs. Objective Probability Views

In the first half of the 20th century the dominant statistics was the frequentist approach (Hand et al., 2001; O’Hagan, 2004).  The frequentist view of probability takes the perspective that probability is an “objective” concept, where the probability of an event is defined as the limiting proportion of times that the event would occur in repetitions of the substantially identical situation (Hand et al., 2001). Example of this frequentist probability is the coin example mentioned above. 

In the second half of the 20th century, the dominance of the frequentist view started to fade out (Hand et al., 2001; O’Hagan, 2004).  Although the vast majority of statistical analysis in practice is still frequentist, a competing view of “subjective probability” has acquired increasing importance (Hand et al., 2001; O’Hagan, 2004).  The principles and methodologies for data analysis driven from the “subjective” view are often referred to as “Bayesian” statistics (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).   The calculus is the same for the two viewpoints, even though the underlying interpretation is entirely different (Hand et al., 2001).

Bayesian Theorem

It is named after the 18th-century minister Thomas Bayes, whose paper presented to the Royal Society in 1763 first used the Bayesian argument (O’Hagan, 2004).  Although the Bayesian approach can be traced back to Thomas Bayes, its modern incarnation began only in 1950 and 1960 (O’Hagan, 2004). 

The central tent of Bayesian statistics is the explicit characterization of all forms of uncertainty in a data analysis problem including uncertainty about any parameters which are estimated from the data, uncertainty as to which among a set of model structures are best or closest to “truth,” uncertainty in any forecast that can be made, and so forth (Hand et al., 2001).  Because Bayesian interpretation is subjective, when evidence is scarce, there are sometimes wildly different degrees of belief among different people (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).

From the perspective of the “subjective” probability, the Bayesian interpretation of probability views probability as the degree of belief in a claim or hypothesis (Fischetti et al., 2017; Hand et al., 2001).  The Bayesian inference provides a method to update that belief or hypotheses in the light of new evidence (Fischetti et al., 2017).  The equation of the Bayes’ Theorem is defined in equation (1) (Fischetti et al., 2017; Hand et al., 2001; O’Hagan, 2004).

Where:

  • H is the hypothesis.
  • E is the evidence.

Bayesian Continuous Posterior Distribution

When working with Bayesian analysis, the hypothesis concerns a continuous parameter or many parameters (Fischetti et al., 2017).  Bayesian analysis usually yields a continuous posterior called a “posterior distribution” (Fischetti et al., 2017).  The Bayes methods utilize the Bayes’ rule which expresses a powerful framework for combining sample information with a prior expert opinion to produce an updated or posterior expert opinion (Giudici, 2005).  In the Bayesian analysis, a parameter is treated as a random variable whose uncertainty is modeled by a probability distribution (Giudici, 2005). This distribution is the expert’s prior distribution p(q), stated in the absence of the sampled data (Giudici, 2005).  The likelihood is the distribution of the sample, conditional on the values of the random variable q: p(x|q) (Giudici, 2005).  The Bayes’ rule provides an algorithm to update the expert’s opinion in the light of the data, producing the so-called posterior distribution p(x|q), as shown in equation (2) below (Giudici, 2005).  With c = p(x), a constant that does not depend on the unknown parameter (q).

The posterior distribution represents the main Bayesian inferential tool (Giudici, 2005).  Once it is obtained, it is easy to obtain any inference of interest (Giudici, 2005).  For instance, to obtain a point estimate, a summary of the posterior distribution is taken, such as the Mean or the Mode (Giudici, 2005).  Similarly, confidence intervals can be easily derived by taking any two values of q such the probability of q belonging to the interval described by those two values corresponds to the given confidence level (Giudici, 2005).  As q is a random variable, it is now correct to interpret the confidence level as a probabilistic statement: (1 – α ) is the coverage probability of the interval, namely, the probability that q assumes values in the interval (Giudici, 2005). Thus, the Bayesian approach is thus described to be a coherent and flexible procedure (Giudici, 2005).

Special Case of Bayesian Estimates; Maximum Likelihood Estimator (MLE)

The MLE is a special case of Bayesian estimates, when the assumption as a prior distribution for q a constant distribution expressing a vague state of prior knowledge, the posterior mode is equal to the MLE (Giudici, 2005).  More generally, when a large sample is considered, the Bayesian posterior distribution approaches an asymptotic normal distribution, with the MLE as expected value (Giudici, 2005).

Bayesian Statistics

In the Bayesian Statistics, there is a four-step process (O’Hagan, 2004).  The first step is to create a statistical model to link data to parameters.  The second step is to formulate prior information about parameters.  The third step is to combine the two sources of information using Bayes’ theorem. The last step is to use the result posterior distribution to derive inferences about parameters (O’Hagan, 2004). Figure 1 illustrates the synthesis of the information by Bayes’ theorem. 

Figure 1.  Synthesis of the Information by Bayes’ Theorem (O’Hagan, 2004).

Figure 2 illustrates Bayes’ Theorem using a “triplot,” in which the prior distribution, likelihood and posterior distribution are all plotted on the same graph.  The prior information is represented by the dashed line lying, in this example, lying between -4 and +4.  The data with the dotted line represents the likelihood which favors the values of the parameters between 0 and 3, and strongly argue against any value below -2 or above +4 (O’Hagan, 2004).  The posterior is represented as solid line putting these two sources of information together (O’Hagan, 2004).  Thus, for values below -2, the posterior density is minimal because the data are saying that these values are highly implausible, while values above +4 are ruled out by the prior (O’Hagan, 2004).  While the data favor values around 1.5, the prior prefers values around 0, the posterior listens to both and the synthesis is a compromise, and the parameter is most likely to be around 1 (O’Hagan, 2004).

Figure 2.  Triplot. Prior Density (dashed), Likelihood (dotted), and Posterior Density (solid) (O’Hagan, 2004).

Bayes Assumption, Naïve Bayes, and Bayes Classifier

Any joint distribution can be simplified by making appropriate independence assumptions, essentially approximating a full table of probabilities by-products of much smaller tables (Hand et al., 2001).  At an extreme, an assumption can be made that all the variables are conditionally independent.  Such assumption is sometimes referred to as the Naïve Bayes or first-order Bayes assumption (Alexander & Wang, 2017; Baştanlar & Özuysal, 2014; Hand et al., 2001; Suguna, Sakthi Sakunthala, Sanjana, & Sanjhana, 2017).  The conditional independence model is linear in the number of variables (p) rather than being exponential (Hand et al., 2001).  To use the model for classification, the product form is simply used for the class-conditional distributions, yielding the Naïve Bayes classifier (Hand et al., 2001). The reduction in the number of parameters by using the Naïve Bayes model comes at a cost as an extreme independence assumption is made (Hand et al., 2001).  In some cases, the conditional independence assumption can be made quite reasonable (Hand et al., 2001).  In many practical cases, the conditional independence assumption may not be realistic (Hand et al., 2001).  Although the independence assumption may not be a realistic model of the probabilities involved, it may still permit relatively accurate classification performance (Hand et al., 2001). 

The Naïve Bayes model can easily be generalized in many different directions (Hand et al., 2001).  The simplicity, parsimony, and interpretability of the Naïve Bayes model have led to its widespread popularity, particularly in the machine learning literature (Hand et al., 2001).  The model can be generalized equally well by including some but not all dependencies beyond first-order (Hand et al., 2001).  However, the conventional wisdom in practice is that such additions to the model often provide only limited improvements in classification performance on many data sets, underscoring the difference between building accurate density estimators and building good classifiers (Hand et al., 2001).

Markov Chain Monte Carlo (MCMC) vs. Bayesian Analysis

Computing tools were explicitly developed for Bayesian analysis which is more powerful than anything available for frequentist methods, in the sense that Bayesians can now tackle enormously intricate problems that frequentists methods cannot begin to address (O’Hagan, 2004).  The advent of MCMC methods in the early 1990s served to emancipate the implementation of the Bayesian analysis (Allenby, Bradlow, George, Liechty, & McCulloch, 2014). 

The transformation is continuing, and computational developments are shifting the balance consistently in favor of Bayesian methods (O’Hagan, 2004).  MCMC is a simulation technique, whose concept is to by-pass the mathematical operations rather than to implement them (O’Hagan, 2004).  The Bayesian inference is solved by randomly drawing a sizeable simulated sample from the posterior distribution (O’Hagan, 2004).  The underlying concept of the Bayesian inference is that a sufficiently large sample from any distribution can represent the whole distribution effectively (O’Hagan, 2004).

Bayesian Analysis Application

The Bayesian principle methods are not the tools of choice in many application areas (O’Hagan, 2004).  Bayes’ Theorem has been applied to and proven useful in various disciplines and contexts such as German Enigma code during World War II, saving millions of lives (Fischetti et al., 2017).  Furthermore, An essential application of Bayes’ rule arises in the predictive classification problems (Giudici, 2005).  Bayesian analyses applied to market science problems have become increasingly popular due to their ability to capture individual-level customer heterogeneity (Allenby et al., 2014).  The Big Data is promoting the collection and archiving of an unprecedented amount of data (Allenby et al., 2014).  The marketing industry is convinced that there is gold in such Big Data (Allenby et al., 2014).  Big Data is described as discrete, and as huge because of its breadth and not because of its depth (Allenby et al., 2014).  It provides significant amounts of shallow data which does not reveal the state of the respondent, and the state of the market (Allenby et al., 2014).  The Bayesian methods are found to be useful in marketing analysis because of their ability to deal with large, shallow datasets and their ability to produce exact, finite sample inference (Allenby et al., 2014).

The pharmaceutical industry is another example of the use of Bayesian methods.  The pharmaceutical companies are regularly forced to abandon drugs that have just failed to demonstrate beneficial effects in frequentist terms, while the Bayesian analysis suggests that it would be worth persevering (O’Hagan, 2004).  The DNA is another example of the use of Bayesian analysis (O’Hagan, 2004).  Other applications of Bayesian methods include fraud detection (Bolton & Hand, 2002), social networks (Rodriguez, 2012).

Bayesian Analysis Software

There is a growing range of software available to assist with Bayesian analysis (O’Hagan, 2004).  Two particular software packages which are in general use, freely available: First Bayes and WinBUGS (O’Hagan, 2004).

The First Bayes is an elementary program which is aimed at helping the beginner to learn and understand how Bayesian methods work (O’Hagan, 2004).  The WinBUGS is a robust program for carrying out MCMC computations and is in widespread use for serious Bayesian analysis (O’Hagan, 2004).

Advantages and Disadvantages of Bayesian Analysis

Bayesian methods and classical methods both have advantages and disadvantages, and there are some similarities (sas.com, n.d.)  When the sample size is large, Bayesian inference often provides results for parametric models which are very similar to the results produced by frequentist methods (sas.com, n.d.).  Bayesian Analysis has the following five advantages:

  • It provides a natural and principled technique of combining prior information with data, within a solid decision theoretical framework.  Thus, past information about a parameter can be incorporated to form a prior distribution for future analysis.  When new observations are available, the previous posterior distribution can be used as a prior.  All inferences logically follow from the Bayes’ Theorem (sas.com, n.d.).
  • It provides inferences which are conditional on the data and are precise, without reliance on the asymptotic approximation.  Small sample inference proceeds in the same manner as if one had a large sample.  The Bayesian analysis can also estimate any functions of parameters directly, without using the “plug-in” method, which is a way to estimate functionals by plugging the estimated parameters in the functionals (sas.com, n.d.).
  • It adheres to the likelihood principle.  If two distinct sampling designs yield proportional likelihood functions for q, then all inferences about q should be identical from these two designs.  The classical inference does not adhere to the likelihood principle in general (sas.com, n.d.).
  • It provides interpretable answers, such as “the true parameter q  has a probability of 0.95 of falling in a 95% credible interval” (sas.com, n.d.).
  • It provides a convenient setting for a wide range of models, such as hierarchical models and missing data problems (sas.com, n.d.).    

Bayesian Analysis also has the following disadvantages:

  • It does not tell how to select a prior.  There is no correct method to choose a prior. Bayesian inferences require skills to translate subjective prior beliefs into a mathematically formulated prior, which can generate misleading results if done with no caution (sas.com, n.d.).
  • It can produce posterior distributions which are heavily influenced by the priors.  From a practical point of view, subject matter experts might disagree with the validity of the chosen prior (sas.com, n.d.).
  • If often comes with a high computational cost, especially in models with a large number of parameters.  Also, simulations provide slightly different answers unless the same random seed is used.   However, the slight variations in simulation results do not contradict the claim that the Bayesian inferences are precise or exact, because the posterior distribution of a parameter is exact, given the likelihood function and the priors, while simulation-based estimates of posterior quantities can vary due to the random number generator used in the procedures (sas.com, n.d.).

References

Alexander, C., & Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).

Allenby, G. M., Bradlow, E. T., George, E. I., Liechty, J., & McCulloch, R. E. (2014). Perspectives on Bayesian Methods and Big Data. Customer Needs and Solutions, 1(3), 169-175.

Baştanlar, Y., & Özuysal, M. (2014). Introduction to machine learning miRNomics: MicroRNA Biology and Computational Analysis (pp. 105-128): Springer.

Bolton, R. J., & Hand, D. J. (2002). Statistical fraud detection: A review. Statistical Science, 235-249.

Fischetti, T., Mayor, E., & Forte, R. M. (2017). R: Predictive Analysis: Packt Publishing.

Giudici, P. (2005). Applied data mining: statistical methods for business and industry: John Wiley & Sons.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

O’Hagan, A. (2004). Bayesian statistics: principles and benefits. Frontis, 31-45.

Rodriguez, A. (2012). Modeling the dynamics of social networks using Bayesian hierarchical blockmodels. Statistical Analysis and Data Mining: The ASA Data Science Journal, 5(3), 218-234.

sas.com. (n.d.). Bayesian Analysis: Advantages and Disadvantages. Retrieved from https://support.sas.com/documentation/cdl/en/statug/63033/HTML/default/viewer.htm#statug_introbayes_sect006.htm.

Suguna, M., Sakthi Sakunthala, N., Sanjana, S., & Sanjhana, S. (2017). A Survey on Prediction of Heart Diseases Using Big Data Algorithms.