"Artificial Intelligence without Big Data Analytics is lame, and Big Data Analytics without Artificial Intelligence is blind." Dr. O. Aly, Computer Science.
The purpose of this discussion is to discuss and analyze the design of a data audit system for health informatics, and the elements of the system design. The audit is part of the security measures, which must be implemented by organizations for privacy protection. Organizations must comply with regulations and rules such as HIPAA to protect the private information of the patients.
Data Audit System for Healthcare Informatics
Although significant progress in technological security solution such as information access control, the operational process is still confronted with a significant challenge (Appari & Johnson, 2010). In healthcare, the data access is provided broadly, and the “Break the Glass” (BTG) policy is adapted to facilitate the care effectively and promptly due to the nature of healthcare and the various purpose (Appari & Johnson, 2010). The BTG policy allows the granting emergency access to the critical electronic protected health information (ePHI) system by providing a quick approach for a person who does not have access privileges to specific information to gain access when required. As indicated in (Appari & Johnson, 2010), 99% of doctors were granting overriding privileges while only 52% required overriding rights on a regular basis, and the security techniques of health information systems were overridden to access 54% of the patients’ information. Moreover, the BTG policy can be misused by the employees (Appari & Johnson, 2010). As indicated in (Malin & Airoldi, 2007), a study found that 28 of 28 Electronic Medical Record (EMR) system incorporate audit capability, yet only 10 of the systems alert healthcare admin of potential violation. Thus, there is a serious need for healthcare organizations to design and implement a robust audit system to avoid such pitfalls that can lead to serious malicious attacks and data breaches.
Various research studies have exerted efforts and
proposed audit systems to address these pitfalls and ensure the proper Security
Policy with an appropriate Audit system and the implementation of such a policy
correctly. In (Malin & Airoldi, 2007), the researchers proposed
a novel protocol called CAMRA (Confidential Audits of Medical Record Access)
which allows an auditor to access information from non-EMR systems without revealing the identity of those being investigated. In (Rostad & Edsberg, 2006), the researchers
discussed and analyzed the role-based access control systems in healthcare, which
are often extended with exception mechanisms to ensure access to needed
information even when the needs do not follow the proper methods. The researchers recommend the limited use of
the exceptions mechanisms because they increase the threats to patients’
privacy and subject to auditing. In (Bhatti & Grandison, 2007), the researchers
proposed a model called PRIMA (PRIvacy Management Architecture) to exploit
policy refinement techniques to gradually and seamlessly embed the privacy
controls into the clinical workflow. The
underlying concept of the PRIMA is based on the Active Enforcement and
Compliance Auditing component of the Hippocratic Database technology and
leverages standard data analysis technique. In (Ferreira et al.), the researchers
discussed and analyzed the BTG policy in a Virtual EMR (VEMR) system integrated
with the access control model already in use.
One of the requirements of the
Access Control model involves auditing and monitoring mechanisms which must be
in place at all times for all users. In (Zhao & Johnson, 2008), the researchers proposed
a governance structure based on controls and incentives where employees’
self-interested behavior can result in the firm-optimal
use of information. The result of their
analysis indicated that the Audit quality is a critical
element of the proposed governance scheme.
The Role of Audit as a Security Measure
The security has three main principles known as CIA Triad: Confidentiality, Integrity, and Availability. There are additional security concepts known as the five elements of the AAA services; Identification, Authentication, Authorization, Auditing, and Accounting. The AAA services include the Authentication, Authorization, and Accounting or sometimes Auditing. The Identification reflects the identity when attempting to access a secured area or system. The Authentication is to prove the identity of the user. The Authorization defines the “allow” and “deny” of resources and object access for a specific identity. The Auditing is used to record a log of the events and activities related to the system and subjects. The Accounting (a.k.a Accountability) is used to review the log files to check for compliance and violations to hold users accountable for their actions (Stewart, Chapple, & Gibson, 2015).
The Auditingis
the programmatic techniques to track and record actions to hold the users accountable for their actions while authenticated
on the system. The Auditing is also used
to detect unauthorized users and abnormal
activities on the system. The Auditing
is also to record activities of the users and the activities of the core system
functions which maintain the operating environment and the security
techniques. The Audit Trails which get
created by the recording system events to logs can be utilized to evaluate the
health and performance of the system.
The crashes of the system indicate faulty programs, corrupt drivers, or
intrusion attempts. The event logs can
be used to discover the reason a system failed.
Auditing is required to detect malicious actions by users, attempted
intrusions, and system failures and to reconstruct events, provide evidence for
the prosecution, and produce problem
analysis and reports. The Auditing
provides Accountability. It tracks users
and records the time they access objects and files, creates an Audit Trail in
the audit logs. For instance, the
Auditing can record the time of a user reads, modifies, or delete a file
(Stewart et al., 2015).
Auditing is a native feature of the operating systems
and most applications and services.
Thus, a security professional must
ensure the auditing feature is enabled and
configured on the systems, applications or services to monitor and records all
activities including the malicious events. Moreover, most firewalls offer extensive
auditing, logging, monitoring capabilities, alarms and primary intrusion detection system (IDS) functions. Every system must have the appropriate
combination of a local host firewall, anti-malware scanners, authentication,
authorization, auditing, spam filters and IDS/IPS services. Many organizations require the retention of
all audit logs for three years or longer to enable organizations to reconstruct
the details of past security incidents (Stewart et al., 2015). Every organization must have a Retention
Policy that provides the rules for retaining such audit logs to comply with
HIPAA investigations.
The Role of the Auditor
The auditor must review and verify that the Security Policy is implemented properly and the security solutions are adequate. The Auditor produces compliance and effectiveness reports to be reviewed by the senior management. The senior management transforms the issues discovered in these reports are transformed into new directives. Moreover, the role of the Auditor is to ensure a secure environment is properly protecting assets of the organizations (Stewart et al., 2015).
HIPAA Audit Requirement Compliance
The Audit Trails are records with retention requirements. Healthcare Information Management should include them in the management of the electronic health records. The legal requirements and compliance drive the Audit Trail management. HIPAA Audits have been occurring around the country resulting in judgments of substantial fines; organizations must sustain less risk and robust Audit Trails for their clinical applications (Nunn, 2009).
Audit Trail is
distinguished from Audit Control.
As cited in (Nunn, 2009), the Audit Trail
is defined by the Fundamentals of Law for Health Informatics and Information
Management as a “record that shows who had
accessed a computer system, when it was accessed, and what operation was performed.”
The Audit Control is a term used by the IT professional which is defined as “the mechanisms employed to
record and examine system activity. The data collected and potentially used to
facilitate a security audit is called the audit trail that in turn may consist
of several audit files located on different entities of a network” (Nunn, 2009). This distinction
indicates that it may take several different audit trails of systems to detect
inappropriate access or malicious intrusions into the clinical databases.
Organizations must conduct routine random audits on a regular base to ensure the compliance with
HIPAA and other regulations to protect the privacy of the patients. Audit Trails can track all system activities
including a detailed listing of content,
duration, and the users, generating date
and time for entries and logs of all modifications to EHRs. The routine audit can assist in capturing the
inappropriate use and access by unauthorized users. When there is inappropriate access to a
medical record, the system can generate information about the name of the
individual gaining access; the time, data, screens accessed and the duration of
the review. This information can assist
in providing evidence for prosecution if the access was not authorized or there
is a malicious attack or data breach.
HIPAA Security Rule requires organizations to conduct Audit Trails and
document information system activities, and have the hardware, software, and
procedures to record and examine these activities that contain health
information (Ozair, Jamshed, Sharma, & Aggarwal, 2015; Walsh
& Miaoulis, 2014).
In summary, healthcare organization must ensure to
implement Audit system to comply with regulations such as HIPAA to ensure the
protection of the patients’ private information. The Audit system should consider the Audit
Trail techniques to track the system activities and the access by users to the
health information. The limited access
to authorized users is recommended. The BTG policy is misused in healthcare. It
should be applied for exceptions
only. However, it has been applied to users who do not necessarily
have any exception to access the health records. The audit
is also used to detect fraud activities.
Thus, the organization must take
advantages of various hardware and software to implement the Audit system not
only to protect the privacy of the patients
but also to detect fraud.
References
Appari, A., & Johnson, M. E.
(2010). Information security and privacy in healthcare: current state of
research. International Journal of
Internet and enterprise management, 6(4), 279-314.
Bhatti, R., &
Grandison, T. (2007). Towards improved
privacy policy coverage in healthcare using policy refinement. Paper
presented at the Workshop on Secure Data Management.
Ferreira, A.,
Cruz-Correia, R., Antunes, L., Farinha, P., Oliveira-Palhares, E., Chadwick, D.
W., & Costa-Pereira, A. How to break
access control in a controlled manner. Paper presented at the
Computer-Based Medical Systems, 2006. CBMS 2006. 19th IEEE International
Symposium on.
Malin, B., &
Airoldi, E. (2007). Confidentiality preserving audits of electronic medical
record access.
Ozair, F. F.,
Jamshed, N., Sharma, A., & Aggarwal, P. (2015). Ethical issues in
electronic health records: a general overview. Perspectives in clinical research, 6(2), 73.
Rostad, L., &
Edsberg, O. (2006). A study of access
control requirements for healthcare systems based on audit trails from access
logs. Paper presented at the Computer Security Applications Conference,
2006. ACSAC’06. 22nd Annual.
Stewart, J.,
Chapple, M., & Gibson, D. (2015). ISC
Official Study Guide. CISSP Security
Professional Official Study Guide (7th ed.): Wiley.
The purpose of
this project is to discuss and analyze Big Data and Big Data Analytics
in the healthcare industry. The discussion and the analysis cover the
benefits of BD and BDA in healthcare, and how the healthcare industry is not taking full
advantages of such great benefits.
The analyses also cover various
proposed BDA frameworks for healthcare with the advanced technology such as
Hadoop, MapReduce. The proposed BDA
ecosystem for healthcare includes various layers from Data Layers, Data
Aggregation Layer, Analytical Layer and Information Exploration Layer. These layers are controlled by the Data Governance
Layer to protect health information and ensure the compliance of HIPAA
regulations. The discussion and analysis
detail the role of BDA in healthcare for Data Privacy Protection. The analyses
also cover the HIPAA and Data Privacy Requirements, and the Increasing Trend of
Data Breaches and Privacy Violation in healthcare. The project proposes a policy for healthcare
Data Security and Privacy Protection. The proposed policy covers the general principles of HIPAA, the implications for
violating HIPAA regulations. The project detailed the elements of the
proposed policy and the Risk Analysis
that is required as the first element in
the proposed policy.
Keywords:
Big Data Analytics, Healthcare, Data Privacy, Data
Protection.
The healthcare
industry is continuously generating a large
volume of data resulting from record keeping, patients related data, and
compliance. As indicated in (Dezyre, 2016), the US
healthcare industry generated 150 billion gigabytes, which is 150 Exabytes of data in 2011. In the era of information technology and
digital world, the digitization of the data is becoming mandatory. The analysis of such large volume of the data is critically required to improve the
quality of healthcare, minimize the healthcare related costs, and respond to
any challenges effectively and promptly. Big Data Analytics (BDA) offers great opportunities in the healthcare industry
to discover patterns and relationships using the machine learning algorithms to gain meaningful insights
for sound decision making.
Although there are various benefits
of the implementation of Big Data Analytics, healthcare industry continues to
struggle to gain full benefits from the investments on BDA, and some healthcare
organizations doubt the benefits and the
advantages of the application of BDA technologies. As indicated in (Wang, Kung, & Byrd, 2018) only 42% of
healthcare organizations are taking advantages of BDA to support the
decision-making process, 16% of them have substantial experience using
analytics across a broad range of functions.
The result of such a survey shows that there is a lack of understanding
about the key role of BDA in the healthcare industry (Bresnick, 2017; Wang et al., 2018). Thus,
there is an urgent need for the healthcare
industry to comprehend the impact of BDA and explore the potential benefits of
BDA at the managerial level, economic level, and strategic level (Wang et al., 2018). Healthcare organizations must also comprehend
the implementation of data governance and regulations such as HIPAA to protect
healthcare data from threats, risks, and
any data breach. All organizations including healthcare organizations must fully understand
the impact of breaking or violating these regulations and rules.
This project addresses the Big Data
Analytics in Healthcare industry, techniques and methods to meet the data
privacy required by HIPAA. The project also proposes a policy for healthcare
organization which explains the consequences of a data breach at the personal, patients and organizations level. The importance of using the BDA with security
measures is also discussed in this
project. Moreover, the project discusses
and analyzes the steps required to comply with data privacy rules and HIPAA
regulations.
Healthcare
generates various types of data from various sources such as physician notes,
X-Rays reports, Lab reports, case history, diet regime, list of doctors and
nurses, national health register data, medicine and pharmacies, medical tools,
materials and instruments expiration data identification based on RFID data (Archenaa & Anita, 2015; Dezyre, 2016; Wang et al., 2018).
Thus, there has been an exponentially
increasing trend in generating healthcare data, which resulted in an
expenditure of 1.2 trillion towards healthcare data solutions in the healthcare
industry (Dezyre, 2016). The healthcare organizations rely on Big Data
technology to capture this healthcare information about the patients to gain
more insight on the care coordination, health management, and patient
engagement. As cited in (Dezyre, 2016), McKinsey
projects the use of Big Data in the healthcare
industry can minimize the expenses associated with healthcare data management
by $300-$500 billion, as an example of the benefits from using BD in
healthcare.
Big
Data Analytics offers various benefits to healthcare organizations (Jee & Kim, 2013). These benefits include providing patient-centric services. Healthcare organizations can employ Big Data
Analytics in various areas such as detecting diseases at an early stage, providing evidence-based medicine,
minimizing the doses of the drugs to avoid side effects, and delivering
efficient medicine based on genetic makeups.
The use of BDA can reduce the re-admission rates and thereby the
healthcare related costs for the patients are also reduced. BDA
can also be used in the healthcare
industry to detect spreading diseases earlier before the disease gets spread
using real-time analysis. The analysis
includes social logs of the patients who suffer from a disease in a particular
geographical location. This analytical process can assist healthcare
professionals to provide to the community take the preventive measures. Moreover, BDA is also used in the healthcare industry to monitor the quality of
healthcare organizations and entities such as hospitals. The treatment methods can be improved using
BDA by monitoring the effectiveness of medications (Archenaa & Anita, 2015; Raghupathi &
Raghupathi, 2014; Wang et al., 2018). Examples of the Big Data Analytics in
Healthcare industry include Kaiser Permanent implementing a HealthConnect technique to ensure data exchange
across all medical facilities and promote the use of electronic health records.
AstraZeneca and HealthCore have joined n alliance to determine the most
effective and economical treatments for some chronic illness and common
diseases based on their combined data (Fox & Vaidyanathan, 2016).
It
is very important for healthcare organization IT professionals to understand
the framework and the topology of the BDA for the healthcare organization to
apply the security measures to protect patients’ information. The new framework for healthcare industry include the
emerging new technologies such as Hadoop, MapReduce, and others which can be utilized to gain more insight in various
areas. The traditional analytic system
was not found adequate to deal with a large volume of data such as the
healthcare generated data (Wang et al., 2018). Thus, new technologies such as Hadoop and its
major components of the Hadoop Distributed File System (HDFS), and MapReduce
functions with NoSQL databases such as HBase,
and Hive were emerged to handle a large
volume of data using various algorithms and machine learnings to extract value
from such data. Data without analytics has no value. The analytical process turns the raw data
into valuable information which can be used to save lives, predict diseases,
decrease costs, and improve the quality of the healthcare services.
Various
research studies addressed various BDA
frameworks for healthcare in an attempt to shed light on integrating the new
technologies to generate value for the healthcare. These
proposed frameworks vary. For instance,
in (Raghupathi & Raghupathi, 2014), the framework involved various layers. The layers included Data Source Layer,
Transformation Layer, Big Data Platform Layer, and Big Data Analytical
Application Layer. In (Chawla & Davis, 2013), the
researchers proposed personalized
healthcare, patient-centric framework, empowering patients to take a more
active role in their health and the health
of their families. In (Youssef, 2014), the
researcher proposed a framework for secure healthcare systems based on BDA in
Mobile Cloud Computing environment. The
framework involved the Cloud Computing as
the technology to be used for handling big
healthcare data, the electronic health
records, and the security model.
Thus,
this project introduces the framework and the ecosystems for BDA in healthcare
organizations which integrate the data governance to protect the patients’
information at the various level of data
such as data in transit and storage. The researcher of this project is in agreement with the framework proposed by (Wang et al., 2018), as it is a
comprehensive framework addressing various data privacy protection techniques
during the analytical processing. Thus,
the selected framework for this project
is based on the ecosystems and topology of (Wang et al., 2018).
The framework consists of major layers of the Data Layer, Data Aggregation Layer, Analytics Layer, Information Exploration Layer, and Data Governance Layer. Each layer has its purpose and its role in the implementation of BDA in the healthcare domain. Figure 1 illustrates the BDA framework for healthcare organizations (Wang et al., 2018).
Figure 1. Big Data Analytics Framework in Healthcare (Wang et al., 2018).
The framework consists of the Data
Governance Layer that is controlling the data processing starting from
capturing the data, transforming the data, and the consumption of the data. The
Data Governance Layer consists of three key
elements; the Master Data Management element, the Data Life-Cycle Management
element, and the Data Security and Private Management element. These three major elements of
the Data Governance Layer to ensure the proper use of the data, the data
protection from any breach and unauthorized access.
The Data Layer represents the
capture of the data from various sources such as patients’ records, mobile
data, social media, clinical and lab results, X-Rays, R&D lab, home care
sensors and so forth, as illustrated in Figure 2. This data is
captured in various types such as structured, semi-structured and
unstructured formats. The structured
data represent the traditional electronic healthcare records (EHRs). The video, voice, and images represent the unstructured data type. The machine-generated data forms semi-structured
data, while transactions data including
patients’ information forms structure data.
These various types of data represent the variety feature which is one
of the three major characteristics of the
Big Data (volume, velocity, and variety).
The integration of this data
pools is required for the healthcare industry
to gain major opportunities from BDA.
The
Data Aggregation Layer consists of three major
steps to digest and handle the data; the acquisition of the data, the
transformation of the data, and data storage.
The acquisition step is challenging because it involves reading the data
from various communication channels including frequencies, sizes, and formats.
As indicated in (Wang et al., 2018), the
acquisition of the data is a major
obstacle in the early stage of BDA implementation as the captured data has
various characteristics, and budget may get exceeded to expand the data
warehouse to avoid bottlenecks during the workload. The transformation step involves various process steps such as the data moving step, the cleaning step, splitting step,
translating step, merging step, sorting step, and validating data step. After the data gets transformed using various
transformation engines, the data are loaded into storage such as HDFS or in Hadoop Cloud for further processing and
analysis. The principles of the data
storage are based on compliance
regulations, data governance policies and access controls. The data storage techniques can be
implemented and completed using batch process or during the real-time.
The Analytics Layer involves three main operations; the Hadoop MapReduce, Stream
Computing, and in-database analytics based on the type of the data. The MapReduce operation is the most popular
BDA technique as it provides the capability to process a large volume of data in the batch form in a cost-effective fashion and to
analyze various types of data such as structured and unstructured data using massively parallel processing (MPP). Moreover, the analytical process can be at
the real-time or near real time. With respect to the real-time data analytic
process, the data in motion is tracked,
and responses to unexpected events as they occur and determine the next-best actions quickly. Example include the healthcare fraud
detection, where stream computing is a key
analytical tool in predicting the likelihood of illegal transactions or
deliberate misuse of the patients’ information.
With respect to the in-database
analytic, the analysis is implemented through the Data Mining technique using
various approaches such as Clustering, Classification, Decision Trees, and so
forth. The Data Mining technique allows data to be
processed within the Data Warehouse providing high-speed
parallel processing, scalability, and optimization features with the aim to
analyze big data. The results of the
in-database analytics process are not current or real-time. However, it
generates reports with a static prediction, which can be used in healthcare to
support preventive healthcare practices
and improving pharmaceutical management.
This Analytic Layer also provides significant support for evidence-based medical practices by analyzing
electronic healthcare records (EHRs), care experience, patterns of care,
patients’ habits, and medical histories (Wang et al., 2018).
In the Information Exploration
Layer, various visualization reports, real-time information monitoring, and
meaningful business insights which are derived
from the Analytics Layer are generated to assist organizations in making better
decisions in a timely fashion. With respect to healthcare organizations, the
most important reporting involves
real-time information monitoring such as alerts and proactive notification, real-time data navigation, and operational
key performance indicators (KPIs). The
analysis of this information is implemented
from devices such as smartphones, and
personal medical devices which can be sent to interested users or made
available in the form of dashboards in real-time for monitoring patients’
health and preventing accidental medical events. The value of remote monitoring is proven for
diabetes as indicated in (Sidhtara, 2015), and for heart
diseases, as indicated in (Landolina et al., 2012).
The benefits of Big Data and Big Data Analytics are not questionable in Healthcare industry. Several research studies and practitioners are in agreement that healthcare industry can benefit greatly from BD and BDA technologies in various areas such as reducing the healthcare costs for patients as well as for healthcare organizations, providing quality services, predicting diseases among potential patients, and more. Moreover, there are also agreement and consensus among the researchers and practitioners about the security requirements for the patients’ information and the protection requirements for the data privacy from unauthorized access or data breaches.
As indicated in (himss.org, 2018), medical and healthcare entities accounted for 36.5% of the
reported data breaches in 2017. In accordance with a recent report published by
HIPAA, the first three months of 2018
experienced 77 healthcare data breaches reported to the Department of Health
and Human Services’ Office for Civil Rights (OCR). The report added that the impact of these
breaches was significant as more than one million patients and health plan
members were affected. These breaches are estimated to be almost
twice the number of individuals who were impacted by healthcare data breaches
in Q4 of 2017. Figure 3 illustrates
such increasing trend in the Healthcare Data Breaches (HIPAA, 2018).
Figure 3: Q1, 2018 Healthcare Data Breaches (HIPAA, 2018).
As reported in the same report, the
healthcare industry is unique with respect to
the data breaches because they are caused mostly by the insiders; “insiders
were behind the majority of breaches” (HIPAA, 2018). Other reasons
involve improper disposal, loss/theft, unauthorized access/disclosure
incidents, and hacking incidents. The largest
healthcare data breaches of Q1 of 2018 involved
18 healthcare security breaches which impacted more than 10,000
individuals. The hacking/IT incidents
involved more records than any other breach cause as illustrated in Figure
4 (HIPAA, 2018).
Figure 4. Healthcare Records Exposed by Breach Cause (HIPAA, 2018).
The worst affected by the healthcare
data breaches in Q1 of 2018 involved the healthcare providers. With respect
to the states, California was the worst affected state with 11 reported
breaches and Massachusetts with eight
security incidents.
Health Insurance Portability and
Accountability Act (HIPAA) of 1996 is U.S. legislation which provides data
privacy and security provisions for safeguarding medical information. Every organization
including healthcare organizations must comply and meet the requirements of
HIPAA. The compliance of HIPAA is
critical because the privacy and security of the patients’ information are of the most important aspect in healthcare
domain. The goal of the security is to
meet the CIA Triad of Confidentiality, Integrity, and Availability. In healthcare domain, organizations must
apply security measures by utilizing commercial software such as Cloudera
instead of using open source software which may be
exposed to security holes (Fox &
Vaidyanathan, 2016).
The Data Privacy concern is caused by the potential data breaches and data leak of the patients’
information. As indicated in (Fox &
Vaidyanathan, 2016), the cyber thieves routinely target the
medical records. The Federal Bureau of
Investigation (FBI) issued a warning to healthcare providers to guard their
data against cyber attacks, after the incident of the Community Health Systems
Inc., which is regarded to be one of the largest U.S. hospital operators. In this particular incident, 4.5 million
patients’ personal information were stolen
by Chinese hackers. Moreover, the names and addresses of 80 million patients
were stolen by hackers from Anthem, which is regarded to be one of the largest
U.S. health insurance companies.
Although the details of the illnesses of these patients and treatment
were not exposed, however, this incident shows how the healthcare industry is exposed
to cyber attacks. There is an increasing
trend in such privacy data breach and data loss through cyber attacks
incidents.
The number of threats and data breach
incidents exposing patients’ information to unauthorized users and cyber
attacks has expanded over the years. As indicated in (Fox &
Vaidyanathan, 2016), data breaches cost healthcare industry about $6 billion. A study conducted by Ponemon, as cited by (Fox &
Vaidyanathan, 2016) showed that healthcare organizations failed
to perform a risk assessment for security
incidents. Moreover, the study also
showed that healthcare organizations are struggling to comply with federal and
state privacy and security regulations such as HIPAA.
HIPAA added a new part of Title 45 of the
Code of Federal Regulation (CFR) for health plans, healthcare providers, and
healthcare clearinghouses in general. As
indicated in (Fischer,
2003), the new regulations include five basic principles which must
be adhered to by healthcare entities.
The first principle includes the consumer control where the regulation
provides patients with critical new rights to control the release of their
medical information. The second
principle includes the boundaries, which
allows the use of the healthcare information of patients for health purpose
only including treatment and payment, with few exceptions. The third principle includes the accountability, which enforces specific federal
penalties if a patient’s right to privacy is
violated. The fourth principle includes the public responsibility which
reflects the need to balance privacy protections with the public responsibility
to support such national priorities as protecting public health, conducting
medical research, improving the quality of care, and fighting healthcare fraud
and abuse. The last principle is the
security which organizations must implement to protect against deliberate or
inadvertent misuse or disclosure.
Thus,
HIPAA requires three main implementations for these principles. The first requirement involves the
standardization of EPHs, administrative and financial data. The second
requirement involves unique health identifiers for individuals, employers,
health plans and healthcare
providers. The third implementation
requirement involves security standards
to protect the confidentiality and integrity of “individually identifiable
health information” present, past and future (Fischer,
2003). The implementation of
these three requirements is critical to
comply with these five principles.
With respect
to privacy and security which are very related,
HIPAA regulations cover both where confidentiality is part of the privacy
rule. Privacy and security are
distinguished, yet they are related. As
defined in (Fischer,
2003), “Privacy is the right of the individual
to control the use of his or her personal information.” This personal
information should not be used against
the well of the person. Confidentiality becomes an issue when the personal information is received by another entity. Confidentiality means protecting the
information by safeguarding it from unauthorized disclosure. Security, on the other hand, refers to all of
the physical, technical, and administrative safeguards which are placed and
implemented to protect the information. Security
involves the protection of systems from unauthorized internal or external
access, data in storage or data in transit.
Thus, security and privacy are distinguished but related; there is no
one without the other (Fischer,
2003).
The Privacy Rule of HIPAA focuses on the way
PHI is handled by an entity and between
healthcare organizations and other covered entities
and includes both papers as well as
electronic records. The Five HIPAA
Privacy Rules are summarized in Figure 5.
Figure 5. HIPAA Privacy Rules. Adapted from (Fischer, 2003).
Organizations must comply with HIPAA Privacy
Rules. Thus, organizations must
implement reasonable and appropriate security measures to comply with HIPAA
Privacy Rules. These security measures
and rules involve internal and external threats, security threats and vulnerability,
and include the protection of computers (servers and clients), transit and
store the medical data and related information of the patients. Moreover, additional measures such as
policies formalization and documentation, logging to create audit trails to
monitor the access of users or the modification of PHI (Fischer,
2003).
HIPAA required healthcare organizations to
assess the potential risks and vulnerabilities by performing gap and risk
analysis. It also requires organizations
to protect against threats to information security or integrity, and against
unauthorized use or disclosure.
Organizations are also required to implement and maintain security measures which are appropriate to needs,
capabilities, and circumstances. Moreover, the organization
is required by HIPAA rules to ensure
compliance with these safeguards by all staff (Fischer,
2003).
The implications of data breaches and the
non-compliance with the privacy regulations such as HIPAA are serious. Healthcare organizations are directly liable
for violating HIPAA rules. Such violation includes
the failure to comply with Security Rule, impermissible use and disclosure, and
failure to provide breach notification (Fox &
Vaidyanathan, 2016).
Thus, it is very critical for a healthcare organization to comply with HIPAA
regulations. Thus, healthcare organization
must consider security measures during the application of BDA, and apply the
privacy protection measures at every level of BDA framework. Healthcare organizations are required to
implement security measures to protect the ePHI through the appropriate safeguards
such as administrative, physical and technical (Fox &
Vaidyanathan, 2016). The security measures to protect
healthcare information can be implemented
by using security constraints such as access control, an encryption technique, multi-factor
authentication technique, secure socket layer (SSL) technique (Fox &
Vaidyanathan, 2016; Gardazi & Shahid, 2017; Youssef, 2014).
Various
governments documentations details the financial implications for the
organizations which do not comply with the HIPAA regulations. In accordance with (hhs.gov,
2003), there two types of enforcement and penalties for
non-compliance with HIPAA regulations.
The civil money penalties may be imposed on a covered entity of $100 per
failure to comply with the Privacy Rule requirement. This penalty may not
exceed $25,000 per year for multiple violations of the identical Privacy Rule
requirement in a calendar year. With respect to the criminal penalties, there
is a fine of $50000 and up to one-year imprisonment for the individual who
obtains or disclose individually identifiable
health information in violation of HIPAA.
The criminal penalties increase to $100,000 and up to five years
imprisonment if the wrongful conduct involves pretenses,
and to $250000 and up to ten years imprisonment if the wrongful conduct
involves the intent to sell, transfer, or use individually
identifiable health information for commercial advantage, personal gain, or
malicious harm. These criminal sanctions
are enforced by the Department of Justice (hhs.gov,
2003).
Moreover, in 2013, Health and Human
Services’ Office for Civil Rights (HHS) released its final regulations
expanding privacy rights for patients and others. The new rules include major changes in medical record privacy measures required of health
providers by HIPAA and Health Information Technology for Economic and Clinical
Health Act (HITECH). The rules expand
the privacy measures to apply additional groups which have access to the
information of the patients “regardless of whether the information is being held by a health plan, a health care
provider, or one of their business associates” (ncsl.org,
2016). This change expands
many of the requirements to business associates of these healthcare entities
which receive protected health information such as contractors and
sub-contractors, as some of the largest
breaches reported to HHS involved business associated. The penalties are
increased for noncompliance based
on the level of negligence with a maximum penalty of $1.5 million per violation
(ncsl.org,
2016).
It is very
essential for a healthcare
organization to comply with HIPAA regulations.
Organizations which are breached and not compliant with HIPAA
regulations can face financial consequences as summarized in Table 1, adapted
from (himss.org, 2018).
Table 1. Financial Consequences for Violating HIPAA Regulations. Adapted from (himss.org, 2018).
The statistical analysis and the
increasing trend in the data breach reporting incidents indicate that there is
a serious need for healthcare entities to
comply with the HIPAA regulations to protect the privacy of the patients and
individuals. There are three major HIPAA rules which healthcare organizations must comply with:
Security Rule, Privacy Rule, and Breach Rule.
The main
goal to comply with HIPAA regulations is to protect the privacy of health
information of patients and individuals
while allowing the organization to adopt
new technologies such as BDA emerging technologies to improve the quality and
efficiency of patient care. Some of the
requirements for the Security Rule of HIPPA attempt to make it more difficult
and harder for attackers to install malware and other harmful viruses onto the
systems. Examples of these requirements
to enforce security measures Rule include:
of security awareness
and training program for all workforce members including management (himss.org, 2018).
The
BDA healthcare framework should be developed to comply with HIPAA regulations.
The first step for compliance involves the Risk assessment and Analysis and Risk
Management Plan. The purpose of this
step is to identify and implement safeguards which comply with and execute the
standards and specifications in the Security Rule. The Risk Analysis is the foundation to assess
the potential vulnerabilities, threats, and risks to the PHI. Every healthcare organization must begin with
the Risk Analysis is to assist in documenting the potential security vulnerabilities, threats, and risks to the patient’s information. Healthcare organizations must maintain identity everywhere PHI is
created and enters the healthcare entity.
The entry points of PHI include emails, texts, people involved and
locations used in entering electronic health records,
new patients information, business associate communications, and the databases
used and their locations. Moreover,
healthcare organizations must know the
storage of the patient health information such as EMR/EHR systems, mobile
devices, servers, workstations, wireless medical devices, laptops, computers, and
so forth. The Risk Analysis should
detail the scope of the analysis, the collection of the data procedures, the
vulnerabilities, threats and risks, the assessment of current security
measures, the likelihood of threat occurrence, the potential impact of the
threat, the risk level and the periodic
review and update as required. Figure 6
summarizes the elements of the Risk Analysis (himss.org,
2018).
Figure 6. Risk Analysis Elements. Adapted from (himss.org, 2018)
Organizations
must implement the Encryption techniques for PHI to protecting the patient’s information in case of data breaches
occurs. HIPAA does not specify the type
of the encryption. However, the industry best practice is to use AES-128,
Triple DES, AES-256, or better (himss.org,
2018). Moreover, organizations
must comply with HIPPA requirements for dealing with emails. In addition to the encryption requirement for
the emails, the emails should follow additional security measures such as
password procedures. The mobile devices should follow the Security
Rule of HIPAA as well. The best mobile
security practice is not to implement the “bring your own device” (BYOD)
strategy. However, since this
requirement is not practical, the organization
must consider using the guidelines of National Institute of Standards and
Technology (NIST) for healthcare providers and staff. These guidelines include the basic mobile
security practices such as secure wireless.
The organization should limit the
use of mobile devices to limit the risk associated with the mobile devices (himss.org,
2018).
As
indicated in (himss.org,
2018), the majority of physical data thefts take less than only minutes in planning and execution. Thus, the physical access to healthcare
organization must be protected. Examples of the physical access include the
security of offices, storage doors, windows, reception desks, access limit to
PHI through role-based access, and so forth. Healthcare organization must follow the best
practice for physical security and train all staff and physicians to ensure the
security requirements and safeguards are implemented (himss.org,
2018).
Healthcare
organization must implement the firewall technique to filter potentially
harmful internet traffic to protect valuable and sensitive PHI. The firewall can be hardware firewall,
software firewall, and web application firewall. The firewall must be configured and maintained
properly, otherwise the organization can
be put at risk. The firewall best practice includes the implementation of security
settings for each switch port, particularly if the network is segmented, the
continuous update of the firewall settings, the use of virtual private
networks, the establishment for inbound and outbound rules, and the segmentation
of the network using switch ports (himss.org,
2018).
Healthcare organizations should address all
known security vulnerabilities and be consistent with the industry-accepted
system hardening standards. Examples of
the best practice to comply with HIPAA regulation include disabling services
and features not in use, uninstalling applications not in need, limiting systems
to perform a single role, removing or disabling default accounts, and changing
the default passwords and other settings. System updates and patches must be implemented on a regular basis at every
level of the healthcare framework.
Use Access should be secured by changing the
default password weaknesses and enforcing
password procedures. Healthcare
organizations must implement the role-based access control to limit the access
to PHI to only the authorized users based on their roles. Access control is not limited to software and
hardware but also to physical locations.
Healthcare organization must follow HIPAA
regulations for logging and log management.
This requirement includes the
event, audit, and access logging to logs of each of the systems for a total of
six years (himss.org,
2018). Procedures for
monitoring log-in must be implemented to monitor all attempts including the
unauthorized user’s attempts. The audit controls include the implementation
of hardware and software techniques to record and examine activity in an information system which utilizes the protected health information. Moreover, the organization
must implement vulnerability scanning and penetration testing on a regular
basis.
The project
discussed and analyzed Big Data and Big
Data Analytics in the healthcare
industry. The discussion and the
analysis covered the benefits of BD and BDA in healthcare, and how the healthcare industry is not taking full advantages of such great benefits. The analyzes also covered various proposed
BDA frameworks for healthcare with the advanced technology such as Hadoop, MapReduce. The proposed BDA ecosystem for
healthcare includes various layers from
Data Layers, Data Aggregation Layer, Analytical Layer and Information
Exploration Layer. These layers are
controlled by the Data Governance Layer to protect health information and
ensure the compliance of HIPAA regulations.
The discussion and analysis detailed the role of BDA in healthcare for
Data Privacy Protection. This analysis also
covered the HIPAA and Data Privacy Requirements, the Increasing Trend of Data
Breaches and Privacy Violation in healthcare.
The project proposed a policy for healthcare Data Security and Privacy
Protection. The proposed policy covered the general principles of HIPAA, the
implications for violating HIPAA
regulations. The project detailed the
elements of the proposed policy and the Risk Analysis that is required as the first element in the
proposed policy.
In conclusion, the healthcare industry must pay more attention to
HIPAA regulations to protect health information from intruders either insiders
or outsiders, and take the proper security measures at every level from
physical access to system access. Security measures should include encryption
in case of attack occurs. With the
encryption, the data become unreadable.
Chawla,
N. V., & Davis, D. A. (2013). Bringing big data to personalized healthcare:
a patient-centered framework. Journal of
general internal medicine, 28(3), 660-665.
Fox,
M., & Vaidyanathan, G. (2016). IMPACTS OF HEALTHCARE BIG DATA: A FRAMEWORK
WITH LEGAL AND ETHICAL INSIGHTS. Issues
in Information Systems, 17(3).
Gardazi,
S. U., & Shahid, A. A. (2017). Compliance-Driven Architecture for
Healthcare Industry. INTERNATIONAL
JOURNAL OF ADVANCED COMPUTER SCIENCE AND APPLICATIONS, 8(5), 568-577.
Groves,
P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’
Revolution in Healthcare: Accelerating Value and Innovation.
Jee,
K., & Kim, G.-H. (2013). Potentiality of big data in the medical sector:
focus on how to reshape the healthcare system. Healthcare informatics research, 19(2), 79-85.
Landolina,
M., Perego, G. B., Lunati, M., Curnis, A., Guenzati, G., Vicentini, A., . . .
Valsecchi, S. (2012). Remote Monitoring Reduces Healthcare Use and Improves
Quality of Care in Heart Failure Patients With Implantable
DefibrillatorsClinical Perspective: The Evolution of Management Strategies of
Heart Failure Patients With Implantable Defibrillators (EVOLVO) Study. Circulation, 125(24), 2985-2992.
Wang,
Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its
capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change,
126, 3-13.
Youssef, A. E. (2014). A framework for secure
healthcare systems based on big data analytics in mobile cloud computing
environments.
The purpose of this discussion is to discuss and analyze the data privacy and governance as they relate to the healthcare industry. The discussion also addresses the privacy violations, approaches to avoid them, and the best practices which healthcare industry could adapt to avoid instances of data breach and privacy violation.
Data Privacy and Governance
The privacy protections of users’ information depend on the capabilities of the technology adopted to extract, analyze, and correlate potentially sensitive dataset. In the age of Big Data and the emerging technologies to deal with Big, protecting the privacy is becoming more challenging than before. The development of Big Data tools requires security measures and safeguards to protect the privacy of the users and the patients in the healthcare industry in particular. Moreover, the data which gets used for the analytical purpose may include regulated information or intellectual property. Thus, the Big Data professionals must comply with the regulations to ensure the appropriate data use and protection (CSA, 2013).
With respect to privacy, various legislative and regulatory compliance
issues. Many US regulations include
privacy requirements such as Health Insurance Portability and Accountability
Act (HIPAA), the Sarbanes-Oxley Act of
2002 (SOX), the Gramm-Leach-Bliley Act.
When privacy is violated, the
individuals and organizations must be informed.
Otherwise, legal ramifications will be
enforced. The issues of privacy must be
addressed when allowing or restricting personal use of email, retaining
email, recording phone conversation, gathering information about surfing or
spending habits, and so forth (Stewart, Chapple, &
Gibson, 2015).
Data Breach and Privacy Violation in Healthcare
In accordance to a recent report published by HIPAA, the first three months of 2018 experienced 77 healthcare data breaches reported to the Department of Health and Human Services’ Office for Civil Rights (OCR). The report added that the impact of these breaches was significant as more than one million patients and health plan members were affected. These breaches are estimated to be almost twice the number of individuals who were impacted by healthcare data breaches in Q4 of 2017. Figure 1 illustrates such increasing trend in the Healthcare Data Breaches (HIPAA, 2018).
Figure 1: Q1, 2018 Healthcare Data Breaches (HIPAA, 2018).
As reported in the same report, the healthcare industry is unique with respect to the data breaches because they are caused mostly by the insiders; “insiders were behind the majority of breaches” (HIPAA, 2018). Other reasons involve improper disposal, loss/theft, unauthorized access/disclosure incidents, and hacking incidents. The largest healthcare data breaches of Q1 of 2018 involved 18 healthcare security breaches which impacted more than 10,000 individuals. The hacking/IT incidents involved more records than any other breach cause as illustrated in Figure 2 (HIPAA, 2018).
Figure 2. Healthcare Records Exposed by Breach Cause (HIPAA, 2018).
The worst affected by the healthcare
data breaches in Q1 of 2018 involved the healthcare providers. With respect
to the states, California was the worst affected state with 11 reported
breaches and Massachusetts with eight
security incidents.
Best Practice
Organizations must follow four main principles to select technologies and activities which will protect their confidential data assets. The first principle includes the policies which must be honored by the organization throughout the confidential data lifespan. Organizations should commit to process all data by the regulations and laws, protecting users and patients private information and employ transparency to allow them to correct their information. The second principle involves the risk of unauthorized access or misuse of confidential information, which should be minimized by organizations. Organizations should establish a security system to provide administrative, technical and physical security measures to ensure the CIA Triad elements of Confidentiality, Integrity, and Availability of the data. The third principle involves the impact of the confidential data loss. Organizations must establish a protection system to minimize the impact of stolen or lost data by providing reasonable safeguards for the data in storage and transit, such as encryptions to ensure the confidentiality of the data in case of loss or stolen. Organizations must develop an appropriate plan to respond to data breach effectively and provide training to all involved employees. The last principle involves the documentation of the applicable controls and the demonstration of their effectiveness. Organizations must verify the effectiveness of these principles implementations using the appropriate monitoring, auditing, and control systems. Moreover, organizations must establish an appropriate system for the non-compliance and the escalation path (Salido, 2010). Figure 3 summarizes these four principles which organizations must adhere to ensure the compliance with the CIA Triad.
Figure 3: Four Principles for Data Privacy and Confidentiality. Adapted from (Salido, 2010).
In summary, organizations including
healthcare must adhere to the data privacy regulations and legislative rules
carefully to protect the users and patients from data breaches. Organizations must commit to this four principles for data privacy and
confidentiality. Organizations including
healthcare must implement the proper security policy and risk management to
ensure the protection of private information as well as to minimize the impact
of confidential data in case of loss or theft.
References
CSA, C. S. A. (2013). Big Data Analytics
for Security Intelligence. Big Data
Working Group.
The purpose of this discussion is to discuss and analyze the role of the Data Brokers in the healthcare industry. This discussion begins with the dark side of the Big Data industry, followed by the Data Broker business, and a discussion of some use cases in Data Broker in healthcare.
The Dark Side of Big Data Industry
Big Data Analytics provide various benefits to organizations and businesses in various areas. With respect to healthcare, as indicated in (Wang, Kung, & Byrd, 2018), the benefits of Big Data covers various areas such IT infrastructure benefits, organizational benefit, managerial benefits and strategic benefits. In IT infrastructure, the healthcare organization can benefit from reducing system redundancy, avoiding unnecessary IT costs, transferring data quickly among healthcare IT systems, and better use of healthcare system. The organizational benefits of Big Data analytics in healthcare organization include a reduction in surgery-related hospitalization, improving the quality and accuracy of a clinical decision, and reducing the diagnosis time. The managerial benefits of Big Data analytics include gaining insights quickly about changing healthcare trends in the market, optimizing business-related decision and fraud detection. Strategic benefits of Big Data in healthcare organizations include the development of highly competitive healthcare services. Thus, the utilization of Big Data Analytics makes a difference in healthcare organizations in particular, and in all businesses in general.
However, there is a dark side to Big Data. As indicated in (Martin, 2015), there are
ethical issues in the Big Data industry.
Big Data is about information from various sources to create knowledge,
make better predictions and tailor services.
The ethical issues are coming from reselling the data of the consumers
to the secondary market for Big Data. Examples
of these ethical issues include the Data Broker.
Data
Broker Business: Data
Broker is defined in (Gartner, n.d.) as follows:
“A Data Broker
is a business that aggregates information from a variety of sources; processes
it to enrich, cleanse or analyze it; and licenses it to other organizations.
Data Brokers can also license another company’s data directly, or process
another organization’s data to provide them with enhanced results. Data is typically accessed via an application
programming interface (API) and
frequently involves subscription type contract. Data typically is not ‘sold’
(i.e., its ownership transferred), but rather it is licensed for particular or
limited uses. (A data broker is also sometimes known as information broker,
syndicated data broker, or information product company.)”
Big Data Means Big Business to Data Brokers: Use Cases in Healthcare: As discussed in (Tanner, 2016), Data Brokers make money off the patients’ medical records. The patients’ identity by law should be kept in a secure place for privacy protection. Organizations which sell medical information to data mining companies strip the private information of the patients such as social security, names and detailed addresses to protect the identity of the patients (Tanner, 2016). However, the Data Brokers add unique numbers to each record they collect which allow them to match disparate pieces of information to the same individual even if they do not know the patient’s name. Such matching information makes the information more valuable to the Data Brokers (Tanner, 2016). As indicated in (Hill, 2013), Data Brokers are making money by selling lists of rape victims, alcoholics, and patients with AIDS/HIV, and hundreds of other illnesses. There are about four thousand Data Broker. There are 320 Million people in the U.S who cannot escape from Data Brokers business. Moreover, Data Broker and Pharmacies such as Walgreen commercialize the medical data (Leetaru, 2018).
In summary, although Big Data has tremendous benefits to
organizations in many industries including healthcare industry, across the
board, there is a major concern about the privacy of people including the
patients in the healthcare industry. More regulations must be enforced to protect
the privacy of people from the hands of businesses such as Data or Information
Broker. As indicated in (Slater, 2014), “The debate
surrounding how to regulate big data is set to continue, both within the data
broker context and beyond.”
Wang,
Y., Kung, L., & Byrd, T. A. (2018). Big data analytics: Understanding its
capabilities and potential benefits for healthcare organizations. Technological Forecasting and Social Change,
126, 3-13.
The purpose of this discussion is to present a case study where Big Data Analytics was used to detect fraudulent data in the healthcare industry. The discussion also discusses and analyze the case study and how the Big Data Analytics can be used to eliminate or mitigate fraud in the healthcare industry.
Healthcare Fraud is Reality
Healthcare industry is valued at $2.7 trillion industry in the United States alone. One-third of this value is lost due to various waste, fraud, and abuse. The common examples for the fraud and abuse in healthcare include illegal medical billing for falsified claims, multiple claims filed by multiple providers, and stolen identity of the patients to gain reimbursement for medical service never provided. There is an estimation of fraudulent billing of 3-10% of the annual healthcare costs in the US (Halyna, 2017). As indicated in (Travaille, Mueller, Thornton, & J., 2011), there is an estimation between $600 and $850 billion annually is lost in fraud, waste, and abuse in the healthcare system in the U.S., out of which $125 to $175 billion is due to fraudulent activities. Example of the magnitude of the Healthcare Fraud is the Medicare fraud scheme for a Texas doctor who netted nearly $375 million. The fraud included $350 million for Medicare, and more than $24 million for Medicaid (medicarefraudcenter.org, 2012). Thus, the healthcare fraud is an intentional deception which is used by the medical practitioners to obtain unauthorized benefits (Joudaki et al., 2015). Frauds are categorized as criminal activities and must be detected and stopped. As indicated in (Rashidian, Joudaki, & Vian, 2012), combatting the fraud in healthcare industry remains a challenge.
Thus, there is a serious need for effective fraud
detection and prevention in the healthcare
industry. Big Data Analytics plays a
significant role in the fraud detection in the healthcare
industry. As indicated in (Hitchcock, 2018) “Big data has
massively transformed the healthcare industry in so many ways, contributing
largely to today’s more efficient value-based healthcare system.” Various
research studies showed that Big Data Analytics and Data Mining could be used effectively in fraud
detection. For example, in the research
study of (J. Liu et al., 2016), the researchers
used the graph analysis for detecting
fraud, waste, and abuse in healthcare
data. In (Bresnick, 2017), machine learning
and Big Data Analytics are helping Anthem improve member engagement and detect
fraud, waste, and abuse. In (Suleiman, Agrawal, Seay, & Grosky, 2014), Big Data is used
to filter fraudulent Medicaid applications.
Case Study In Health Insurance
This case study is based on (Nelson, 2017). The health insurance company was using the traditional technique in fraud detection. The traditional technique in fraud investigation relied on executing SQL queries using data warehouse which stores massive amounts of claims, billings, and other information. This process was taking weeks or months before enough evidence for a legal case was developed. The time for fraud increases the losses of the organization.
The health insurance utilized Big Data techniques such
as Predictive Analytics and Machine Learning for fraud detection. The integration of Big Data and search
architecture was proven to be the most feasible approach for fraud
detection. Figure 1 illustrates the Big
Data architecture for fraud detection utilized by the healthcare insurance
company. This Big Data framework enabled
the fraud detection effort to be more scalable, faster, and more accurate.
Figure 1. Big Data Analytics Architecture for Fraud Detection (Nelson, 2017).
As a result of the integration of Big Data into the
fraud detection system of the health insurance, the company experienced an immediate return on investment by saving $150
million used to prosecute large fraud cases.
The advantages of the integration of Big Data into the fraud detection
also include the analysis of 10+ million claims, 100+ million of bill line
details, and related record. The
framework also provided more accurate information by computing key fraud
indicators and providing automatic red flag datasets
for suspicious activities and leveraging
all records and not just a statistical sample.
Big Data Analytics Integration and Impact in Fraud Detection in Healthcare
Many organizations have employed Big Data and Data Mining in some areas including fraud detection. Big Data Analytics can empower healthcare industry in fraud detection to mitigate the impact of the fraudulent activities in the industry. Several use cases such as (Halyna, 2017; Nelson, 2017) have demonstrated the positive impact of integrating Big Data Analytics into the fraud detection system.
Big Data Analytics and Data Mining have various
techniques such as classification model, regression model, and clustering
model. The classification model employs
logistic, tree, naïve Bayesian, and neural network algorithms. It can be used
for fraud detection. The regression model employs linear ad
k-nearest-neighbor. The clustering model
employs k-means, hierarchical and principal component algorithms.
For instance, in (Q. Liu & Vasarhelyi, 2013), the researchers
applied the clustering technique using unsupervised data mining approach to
detect the fraud of insurance subscribers.
In (Ekina, Leva, Ruggeri, & Soyer, 2013), the researchers applied the Bayesian co-clustering
with unsupervised data mining method to detect conspiracy fraud which involved
more than one party. In (Capelleveen, 2013), the researchers
employed the outlier detection technique using unsupervised data mining method
to detect dental claim data within Medicaid.
In (Aral, Güvenir, Sabuncuoğlu, & Akar, 2012), the researchers
used distance-based correlation using
hybrid supervised and unsupervised data mining methods for prescription fraud
detection. These research studies and use cases are
examples of taking advantages of Big Data
Analytics in healthcare fraud detection.
Thus, it is proven that Big Data
Analytics can play a significant role in healthcare fraud detection.
References
Aral, K. D., Güvenir, H. A.,
Sabuncuoğlu, İ., & Akar, A. R. (2012). A prescription fraud detection
model. Computer methods and programs in
biomedicine, 106(1), 37-46.
Joudaki, H.,
Rashidian, A., Minaei-Bidgoli, B., Mahmoodi, M., Geraili, B., Nasiri, M., &
Arab, M. (2015). Using data mining to detect health care fraud and abuse: a
review of the literature. Global journal
of health science, 7(1), 194.
Liu, J., Bier,
E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sridharan, K., . . . Davies, D.
(2016). Graph analysis for detecting fraud, waste, and abuse in healthcare
data. AI Magazine, 37(2), 33-46.
Liu, Q., &
Vasarhelyi, M. (2013). Healthcare fraud
detection: A survey and a clustering model incorporating Geo-location
information.
Suleiman, M.,
Agrawal, R., Seay, C., & Grosky, W. (2014). Data-driven implementation to filter fraudulent Medicaid applications.
Paper presented at the SOUTHEASTCON 2014, IEEE.
The
purpose of this project was to articulate all the steps for the installation
and configuration of OpenStack and Amazon Web Services. The project begins with an overview of
OpenStack. It is divided into three main phases.
The first Phase discusses and analyzes the differences between the
Networking techniques in AWS and OpenStack.
Phase 2 discusses the required configurations to deploy the OpenStack Controller. Phase 2 also discusses and analyzes the
expansion of OpenStack to include additional node as the Compute node. Phase 3 discusses the issues encountered
during the installation and configuration of OpenStack and AWS services. A
virtual bridge for the provider network was
configured where all VMs traffic reaches the Internet through the
external bridge. The floating IP also
must be disallowed to avoid dropping the packet when they reach AWS. In this project, OpenStack using the
Controller Node and an additional Compute Node is deployed and accessed
successfully using Horizon dashboard.
Elastic Cloud Compute (EC2) is also installed and configured
successfully using the default VPC, the default Security Group and Access
Control List.
OpenStack is a result of initiatives from Rackspace and NASA in 2010 because NASA could not store its data in the Public Cloud for security reasons. OpenStack is an open source project which can be utilized by leading vendors to bring AWS-like ability and agility to the private cloud. OpenStack has been growing since its inception in 2010 to include 500 member companies as part of the OpenStack Foundation with platinum and gold members from the largest IT vendors globally. Examples of these platinum members include RedHat, Suse, IBM, Hewlett Packard Enterprise, Ubuntu, AT&T and Rackspace (Armstrong, 2016).
OpenStack provides primarily an
Infrastructure-as-s-Service (IaaS) function within the Private cloud, where it
makes centralized storage, commodity computes,
and networking features available to end users to self-service their needs,
through the Horizon dashboard or a set of common APIs. Many organizations are deploying OpenStack
in-house to develop their own data centers.
The implementation of the OpenStack is less likely to fail when
utilizing professional service support from known vendors and can create
alternative solutions to Microsoft Azure and AWS. Examples of these professional service
vendors include Red Hat, Suse, HP, Canonical, Mirantis, and so forth. They provide different methods of installing
the platform (Armstrong,
2016).
The release cycle of the OpenStack
is six months during which an upstream release is
created. OpenStack Foundation
creates the upstream release and governs it.
Example of the public cloud deployment of OpenStack includes AT&T, RackSpace, and GoDaddy.
Thus, OpenStack is not exclusively used
for private cloud. However, OpenStack
has been increasingly popular as a Private Cloud alternative to AWS Public
Cloud. OpenStack is now widely used for Network Function Virtualization (NFV) (Armstrong,
2016).
OpenStack
and AWS utilize different approaches to Networking. This section begins with AWS Networking,
followed by OpenStack Networking.
Virtual Private Cloud (VPC) is a hybrid cloud comprising of public and private clouds. The VPC is the default setting for new AWS users. The VPC can also be connected to a network of users or the private data center of the organization. The underlying concept of connecting the VPC to the private data center of the organization is the use of the gateway and virtual private cloud gateway (VPG). The VPG is two redundant VPN tunnels, which gets instantiated from the private network of the user or the organization. The gateway of the organization exposes a set of external static addresses from the site of the organization, which are using Network Address Translation-Traversal (NAT-T) to hide the address. The organization can use one gateway device to access multiple VPCs. The VPC provides an isolated view of all provisioned instances. Identity and Access Management (IAM) of AWS is used to set up user account to access the VPC. Figure 1 illustrates an example of the AWS VPC with virtual machines or instances mapped with one or more security groups and connected to different subnets connected to the VPC router (Armstrong, 2016; AWS, 2017).
Figure 1. VPC of AWS showing multiple instances using Security Group.
The networking is simplified by VPC using software and allowing users and organizations to perform a certain set of networking operations such as mapping the subnet, using Domain Name System (DNS), Public and Private IP addresses assignments, security group and access control list application. When organizations create a virtual machine or instance, a default VPC is assigned to it automatically. All VPC comes with a default router which can have additional custom routes and the routing priority to forwarding traffic to specific subnets based on the requirements of the organizations and users. Figure 2 illustrates VPC using Private IP, Public IPs, and the Main Route Table, adapted from (Armstrong, 2016; AWS, 2017).
Figure 2. AWS VPC Configuration Example (AWS, 2017).
With respect to the IP Addressing of
AWS, a mandatory private IP is assigned automatically to every virtual machine
or instance, also a public IP and DNS entry unless the instance is a dedicated
instance. The Private IP is used to
route traffic among instances when there is a need for a virtual machine to
communicate with another virtual machine that is close to it on the same
subnet. The Public IP, on the other
hand, are accessible through the Internet.
If there is a need for a persistent Public IP address for a virtual
machine, the Elastic IP addressed feature is provided by AWS which is limited
to five per VPC account only. When using
Elastic IP addresses, the IP address can be mapped quickly to another instance
in case of a failure of the instance.
When using AWS, it can take up to 24 hours for the DNS Time to live
(TTL) of a Public IP address to propagate.
Moreover, AWS supports a Maximum Transmission Unit (MTU) of 1,500 regarding
throughput which can be passed to an instance in AWS. The organization must consider this feature
for application performance consideration (Armstrong,
2016; AWS, 2017).
AWS uses Security Groups and Access
Control Lists. The SG in AWS is used to
group a collection of access control rules with implicit denies. The SG in AWS can be associated with one or
more network interfaces of instances.
The SG acts as the firewall for the instances. There is a default SG which gets applied
automatically if no other security group is specified with the instantiated
instance. The default SG allow all
outbound traffic and all inbound traffic only from other instances within the
same VPC. The default SG group cannot be
deleted. With the custom SG, no inbound
traffic, but all outbound traffic allowed. The user can add Access Control List (ACL)
rules are associated with the SG governing the inbound traffic using AWS
console (Armstrong,
2016; AWS, 2017).
The VPC of AWS ha access to
different regions and availability zone of shared computer dictating the data
center which the instance and virtual machine will be deployed in. The availability zone AZ is an isolated
location residing in a region which is a geographic area isolated by
design. Thus, AZ can be a subset of a region. Organizations and users can place resources
in different locations for redundancy for recovery consideration. AWS supports the use of more than one AV when
deploying production workloads on AWS. Moreover,
organizations and users can replicate the instances and data across regions (Armstrong,
2016; AWS, 2017).
Elastic Load Balancing (ELB) feature
is also offered by AWS, which can be configured within a VPC. The
ELB can be external or internal. When
the ELB is external, it allows the creation of the internet-facing entry point
into the VP using an associated DNS entry and balances load among the instances
in the VPC. The SG is assigned to the
ELB to control the access to ports which need to be used (Armstrong,
2016; AWS, 2017).
OpenStack is deployed in a data
center on multiple controllers. These
controllers contain all services of the OpenStack. These controllers can be installed
on virtual machine, bare metal physical servers, or containers. When these controllers get deployed in a production
environment, they host all OpenStack services in a high availability and
redundancy platform. Different
installers to install OpenStack are offered by different OpenStack vendors. Examples of these installers include RedHat
Director, Mirantis Fuel, HPs HPE installed, and Juju for Canonical. All these
installers install controllers. They are
also used to scale out compute nodes on the OpenStack cloud (Armstrong,
2016; OpenStack, 2018b).
With respect to the services of the OpenStack, there are eleven core services which are installed on the OpenStack controlled. These core services include Keystone, Heat, Glance, Cinder, Nova, Horizon, Rabbitmq, Galera, Swift, Ironic and Neutron. Figure 3 summarizes each core service of the OpenStack (OpenStack, 2018a). The Neutron architecture is similar in constructs to AWS regarding Neutron Networking services (Armstrong, 2016; OpenStack, 2018b).
Figure 3. Summary of OpenStack Core Services (OpenStack, 2018a)
In
OpenStack, a Project is referred to as a Tenant providing an isolated view of everything
which a team has provisioned in the OpenStack cloud. Using the Keystone Identity service,
different users can be set up for a Project (Tenant). These accounts can be integrated with LDAP
such as Active Directory to support customizable permission model (Armstrong,
2016; OpenStack, 2018b).
The
Neutron Service of OpenStack performs all networking related tasks and
functions. These functions and tasks
include seven major steps. The first step includes the creation of instances or
virtual machine mapped to networks. The
second step includes the assignment of IP addresses using the built-in DHCP
service. The third step includes the
application of DNS entries to instances from named servers. The fourth step includes the assignment of
Private and Floating IP addressed. The
fifth step incluse the creation or the associatoiin of the network subnet,
followed by creating the routers. The
last step is the application of the Security Groups (Armstrong,
2016; OpenStack, 2018b).
The
compute nodes of the OpenStack are deployed using a Hypervisor which uses Open
vSwitch. Most vendor distributions of
OpenStack provide KVM Hypervisor by default, which gets deployed and configured
on each computes node by the OpenStack Installer. The compute nodes in OpenStack are connected
to the access layer of the STP 3-tier model. In modern networks, they are
connected to the Leave switches, with VLANs connected to each computes node in
the OpenStack cloud. The networks of the
Tenant are used to provide isolation among tenants and use VXLAN and GRE
tunneling to connect the layer two network (Armstrong,
2016; OpenStack, 2018b).
The
configuration and setup of simple networking using Neutron in a Project
(Tenant) network requires two different networks; an internal network and an
external network. The internal network is
used for traffic among instances in the Project, where the subnet name and
range are specified in the Subnet. The
external network is used to make the internal network accessible from outside
of the OpenStack. A router is also used
in OpenStack to route packets to the network, which will be associated with the
networks. The external network needs to be
set as the router’s gateway. The last
step in the network configuration connects one router to an internal and
external network. Instances are
provisioned in OpenStack onto the internal Private Network by selecting the
Private Network NIC during the deployment of the instance. OpenStack assigns pools of Public IPs known
as Floating IP addresses from an external network for instances which need to
be externally routable outside of the OpenStack (Armstrong,
2016; OpenStack, 2018b).
OpenStack
uses SG like AWS to set up firewall rules between instances. However, OpenStack, unlike AWS, supports both
ingress and egress ACL rules, whereas AWS allows all outbound communications. OpenStack can work with both ingress and
egress rules. SSH access must be
configured as an ACL rule against the parent SG in OpenStack which is pushed
down to Open vSwitch into kernel space on each Hypervisor. When the internal and external networks are set
up and configured for the Project (Tenant), instances are ready to be launched
on the Private network. Users can access
the instance from Horizon dashboard (Armstrong,
2016; OpenStack, 2018b).
With
respect to regions and availability zones in OpenStack, like AWS, OpenStack
uses regions and AZ. The compute nodes
in OpenStack (Hypervisors) can be assigned to different AZ, which is a virtual
separation of computing resources. The
AZ in OpenStack can be segmented into host aggregated. However, a compute node
can be assigned to only one AZ in OpenStack, while it can be a part of multiple
host aggregates in the same AZ (Armstrong,
2016; OpenStack, 2018b).
OpenStack
offers Load-Balancer-as-a-Service (LBaaS) which allows incoming requests to be
distributed evenly among the designated instances using a Virtual IP (VIP). Examples of the popular LBaaS plugins in OpenStack
include Citrix NetScaler, F5, HaProxy, and Avi networks. The underlying concept of LBaaS on OpenStack
is to allow organizations and users to use LBaaS as a broker to the load
balancing solutions, using APIs of the OpenStack or using the Horizon dashboard
to configure the Load Balancer (Armstrong,
2016; OpenStack, 2018b).
This project deployed OpenStack on AWS and limited to the configuration of the controller node. In the same project, the OpenStack cloud is expanded to add a compute node. The topology for this project is illustrated in Figure 4. Port 9000 will be configured to be accessed from the browser on the client. The Compute Node VM will be using a different IP address than that IP address for the OpenStack Node. A Private Network will be configured using the Vagrant software. NAT interface will be configured and mapped to the Compute Node and the OpenStack Controller Node as illustrated in Figure 4.
Figure 4. This Project’s Topology.
The
Controller Node is configured to have one processor, 4 GB memory, and 5 G storage. The Compute Node is configured to have one
processor, 2 GB memory, and 10 GB storage.
The installation must be performed on a 64bit version of distribution on
each node. VirtualBox is used in this
project. The Vagrant software is also
used in this project. Another software
called Sublime Text is installed to configure the Vagrant file and avoid any
control characters at the end of each line which can cause problems. The project is using the Pike release.
The
project requires AWS account, to select
the image which can be used for OpenStack
Deployment. Multi-Factor Authentication
is implemented to access the account. Amazon
Machine Image (AMI) Elastic Compute Cloud (EC2) is
selected from the pool of the AMIs for this project. The Free Tier EC2 instance is configured with the default Security Group (SG)
and Access Control List (ACL) rules as discussed earlier. EC2 AMI is a template which contains the
software configuration such as operating system, application server, and applications required to launch and
instantiate the instance. The EC2 AMI is
configured to use the default VPC.
The Controller Node is configured first to use the IP Address identified in the topology. This configuration is implemented using Vagrant software and Vagrant file.
Connect to the controller using the Vagrant software. To start the Controller from Vagrant, execute:
$vagrant up the controller.
Verify the Controller is running successfully.
$vagrant status
Verify the NAT address using eth0.
$ifconfig -a
Verify the Private IP Address using eth1. The IP address shows the same IP address configured in the configuration file.
Access the Controller Node of the OpenStack from the Browser using the Port 9000.
The
OpenStack Cloud is expanded by adding a Compute Node. The configuration of the compute node is
performed using the Vagrant file.
Connect to the computer using Vagrant command. The Compute Node is using node1 as the
hostname. To start the Compute Node from
Vagrant, execute the following command:
$vagrant up node1.
Verify the Compute Node is running
successfully.
$vagrant status
Access
node1 using SSH.
Check OpenStack Services:
$sudo systemctl
list-units devstack@*
Verify the NAT address using eth0.
$ifconfig -a
Verify the Private IP Address using
eth1. The IP address shows the same IP
address configured in the configuration file.
Access
the Controller Node of the OpenStack from the Browser using the Port 9000. Verify the Hypervisors from Horizon
interface.
There
are some issues encountered during the deployment of OpenStack on AWS. The issue which impacted EC2 AMI involved the
MAC address which must be registered in
the AWS network environment. Moreover,
the MAC address and the IP address must be mapped together because the packets
will not be allowed to flow if the MAC address and the IP address are
different.
During
the configuration of the OpenStack Neutron Networking, a virtual bridge for the
Provider Network is configured where all
VMs traffic will reach the Internet through the external bridge which is
followed by the actual physical NIC of eth1.
Thus, NIC with a special type of configuration
will be configured as the external interface as shown in the topology for this
project (Figure 4).
The floating IP must be disabled because it will send the
packet through the router’s gateway with the IP address as a floating IP
address, which will result in dropping the packets once they reach AWS because
they will reach the switch with no registered IP and MAC address. In this project, the NAT is configured to access
the public address externally as shown in the topology in Figure 4.
The
purpose of this project was to articulate all the steps for the installation
and configuration of OpenStack and Amazon Web Services. The project began with an overview of
OpenStack. It is divided into three main phases.
The first Phase discussed and analyzed the differences between the
Networking techniques in AWS and OpenStack.
Phase 2 discussed the required configurations to deploy the OpenStack Controller. Phase 2 also discussed and analyzed the
expansion of OpenStack to include additional node as the Compute node. Phase 3 discussed the issues encountered
during the installation and configuration of OpenStack and AWS services. A
virtual bridge for the provider network was
configured where all VMs traffic reaches the Internet through the
external bridge. The floating IP also
must be disallowed to avoid dropping the packet when they reach AWS. In this project, OpenStack using the
Controller Node and an additional Compute Node was deployed and accessed
successfully using Horizon dashboard.
Elastic Cloud Compute (EC2) was also installed and configured
successfully using the default VPC, the default Security Group, and Access Control List.
The purpose of this discussion is to discuss and analyze the use of the Cloud Computing technology in the healthcare industry. It also discusses and analyzes the present issue related to healthcare data in the Cloud, advantages, and disadvantages of having the data into the Public and Private Cloud. The discussion also provides a use case scenario.
Healthcare in Cloud Computing
As indicated in (Chen & Hoang, 2011), the healthcare industry is moving slowly toward the Cloud Computing technology due to the sensitive nature of the healthcare data. There is a fear among healthcare organizations to employ Cloud Computing because of the privacy and security issues which can cause data leak from the Cloud to unauthorized users. Various researchers exerted tremendous effort to propose cloud framework to ensure data protection framework for the healthcare industry.
In (Chen & Hoang, 2011),
the researchers proposed a robust data protection framework that is surrounded
by a chain of protection schemes from Access Control, Monitoring, to Active
Auditing. The proposed framework includes
three major models for this chain of protection schemes. The first component of
this proposed robust framework includes Cloud-based, Privacy-aware, and
Role-based Access Control (CPRBAC) model.
The second model includes the Triggerable
Data File Structure (TDFS) model. The
third component includes the Active
Auditing Scheme (AAS). In (Regola & Chawla, 2013), the researchers presented a
prototype infrastructure in Amazon’s Virtual Private Cloud to allow researchers
and practitioners to utilize the data in a HIPAA-compliant environment. In (Yu, Kollipara, Penmetsa, & Elliadka, 2013),
the researchers provided an approach for a distributed storage system using a
combination of RDBMS and NoSQL databases to ensure optimal system performance
and scalability. These three research
studies are examples of the tremendous effort exerted by researchers in the
domain of healthcare to ensure security.
Healthcare Use Case
The Healthcare Information System supports clinical and medical activities related to patient care. The system is an integration of several components where each component serves a specific need of a medical system. These components include Radiology Information System (RIS), Picture Archiving and Communication System (PACS), Laboratory Information System (LIS), and Policy and Procedure Management System (PPMS) (Yu et al., 2013).
In (Yu et al., 2013),
the researchers focused on the RIS which is a software used to manage the
patients and their radiology data such as ultrasound scans, X-rays, CT-scans, audio, and video. The patient activities management include
examination scheduling, patient data processing and monitoring, and analysis of
the patient records statistics. The
radiology data management include the processing of file records, formatting
and storing radiology data with a digital
signature, and tracking the film records.
The RIS deals with very large of unstructured and structured data. The RIS is often used with the PACS and
requires very large storage space.
The researchers examined two NoSQL databases
for this project: MongoDB and Cassandra. They found that MongoDB is more apt for
Healthcare Information Systems. Table 1 summarizes the comparison between
MongoDB and Cassandra, adapted from (Yu et al., 2013).
Table 1. Comparison between MongoDB and Cassandra for Healthcare data (Yu et al., 2013).
The RIS Framework in this project included the System Architecture, Cloud Architecture. The System
Architecture was deployed in AWS using EC2 (Elastic Compute Cloud), which can
be accessed by request from a browser
using HTML or a mobile client
application. The application server was placed in the Public Cloud. The database was
placed in the Private Cloud. When
the system requires communication with the Database in the Private, the request
must go through various security measures and pass through the security of the
Private Cloud and the firewall to connect to the storage server. The request
talks to either SQL or NoSQL database based on the data management logic model.
The System Architecture is deployed in
the Cloud. The Cloud Architecture involved Public Cloud and Private
Cloud. The Private Cloud was used to
store all sensitive data. The storage server controls the SQL and NoSQL
databases along with the security and backup capabilities and functionalities. The NAS server was used as the storage
solution to deal with the large volume of the healthcare data (Yu et al., 2013).
Advantages and Disadvantages of healthcare data in the Cloud
The Cloud Computing offer various advantages to several industries, including healthcare industry. The major benefits of using the Cloud Computing technology for healthcare include Scalability, Data Storage, Data Sharing and Data Availability, Reliability and Efficiency, and Cost Reduction (Pullarao & Thirupathi Rao, 2013). The major challenge when using Cloud Computing in the healthcare industry is the security. However, as demonstrated in the above use case, the risks of leaking data from the Cloud to unauthorized users can be mitigated and eliminated by using the Private Cloud which has additional security measures. Public Cloud should never be used for storing sensitive data. In the above use case, the Public Cloud was used only for the application layer, with security measures such as access control and SSL to access the data from the browser.
References
Chen, L., & Hoang, D. B. (2011, 16-18
Nov. 2011). Towards Scalable,
Fine-Grained, Intrusion-Tolerant Data Protection Models for Healthcare Cloud.
Paper presented at the 2011IEEE 10th International Conference on Trust,
Security, and Privacy in Computing and Communications.
Pullarao, K.,
& Thirupathi Rao, K. (2013). A secure approach for storing and using health
data in a private cloud computing environment. International Journal of Advanced Research in Computer Science, 4(9).
Regola, N., &
Chawla, N. (2013). Storing and using health data in a virtual private cloud. Journal of medical Internet research, 15(3),
e63.
Yu, W.
D., Kollipara, M., Penmetsa, R., & Elliadka, S. (2013, 9-12 Oct. 2013). A distributed storage solution for cloud-based
e-Healthcare Information System. Paper presented at the 2013 IEEE 15th
International Conference on e-Health Networking, Applications, and Services
(Healthcom 2013).
The purpose of this discussion is to discuss and analyze the current state of data storage for big data. The discussion also discusses and analyzes the impact of the Big Data storage on the organizational process.
Big Data and Big Data Analytics Brief Overview
The term Big Data refers to the explosive growth in the volume of the data which are difficult to store, process and analyze. The volume of the Big Data is only one feature. However, the major 3Vs characterized Big Data with volume, variety, and velocity. The variety of the data is reflected by the different types of data collected from sensors, smartphones, or social networks. Thus, the data collected forms additional types of data such as unstructured, and semi-structured besides the structured type. The velocity characteristic of the Big Data reflects the speed of the data transfer, where the content of the data is continuously changing. These three major features characterize the Big Data nature. Big Data is classified even further to include Data Sources, Content Format, Data Stores, Data Staging, and Data Processing. Figure 1 summarizes the Big Data Classifications, adapted from (Hashem et al., 2015).
Figure 1. Big Data Classification. Adapted from (Hashem et al., 2015).
Big Data without Analytics has no value.
Big Data Analytics (BDA) is the process of examining large datasets
containing a variety of data types such as unstructured, semi-structured and
structured. The purpose of the BDA is to uncover hidden patterns, market
trends, unknown correlations, customer preferences and other useful business
information that can help the organization (Arora & Bahuguna, 2016). BDA has been
used in various industries such as healthcare.
Big Data Storage
The explosive growth of the data has challenged the capabilities of the existing storage technologies to store and manage data. Organizations have been utilizing the traditional storage techniques to store data through the structured relational database. However, the Big Data and BDA require distributed storage technology based on the Cloud Computing instead of the local storage attached to a computer or electronic device. Cloud Computing technologies provide a powerful framework which performs complex large-scale computing tasks and span a range of IT functions from storage and computation to database and application services. Organizations and users adopt the Cloud Computing technologies because of the need and requirements to store, process and analyze a large amount of data (Hashem et al., 2015).
Various storage technologies have been emerged to meet the requirements when
dealing with large volume of data. These storage technologies include Direct
Attached Storage (DAS), Network Attached
Storage (NAS), and Storage Area Network (SAN).
When using DAS, various hard disk drives are
directly connected to the servers. Each hard disk drive receives a
certain amount of I/O resource managed by the application. The DAS technology
is a good fit for servers that are
interconnected on a small scale. The
NAS technology provides a storage device which supports a network through a
switch or hub via TCP/IP protocols. When
using NAS, data is transferred as
files. The I/O in the NAS technology is
less burden than in DAS because the NAS server can indirectly access a storage
device through the networks. NAS
technology can orient the networks such as scalable and bandwidth-intensive
networks including the high-speed networks of optical-fiber connections. The SAN system of data storage is independent
with respect to storage on the local area
network. Data management and sharing are maximized by using the multipath data
switching which is conducted among
internal nodes. The organization data
storage system of DAS, NAS, SAN can be divided into three categories: disc array, connection and network
sub-systems, and storage management software. The disk array provides the
storage system. The connect and network
sub-systems provides connection to one or more disc arrays and servers. The storage management software monitors the data sharing, storage
management and disaster recovery tasks for multiple servers (Hashem et al., 2015).
When dealing with Big Data and BDA, the storage system is not physically separated from the processing
system. There are various storage types
such as hard drives, solid-state memory, object storage, optical storage and
cloud storage. Each type has advantages as well as limitations. Thus,
organizations must examine the goal and the objectives of the data storage first
prior selecting any of these storage media.
Table 1 shows a comparison of storage media, adapted from (Hashem et al., 2015).
Table 1. Comparison of Storage Media. Adapted from (Hashem et al., 2015).
The Hadoop Distributed File System (HDFS) is a primary component in Hadoop
technology, which is emerged to deal with Big Data and BDA. The other major
component of Hadoop technology is MapReduce.
The Hadoop framework is described to be the de facto standard for Big
Data storage and processing (Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012). The HDFS is a distributed file system which
is designed to run on top of the local file systems of the cluster nodes. It
stores extremely large files for streaming purpose. HDFS is highly
faulted tolerant and can scale up from a single server to thousands of
nodes, where each offers local computation and storage.
The Cloud Computing technology can meet the requirement of the Big Data and
BDA offering effective framework and platform for computational purpose as well
as for storage purpose. Thus,
organizations which tend to take advantage of Big Data and BDA utilize the
Cloud Computing technology. However, the
use of the Cloud Computing does not come without a price. Security and privacy
have been major concerns to Cloud Computing users and organizations. Although Cloud Computing offers several
benefits to organizations from scalability, fault tolerance, to data storage,
yet, it is curbed by the security and privacy.
Organizations must take the appropriate security measures for data in storage,
transit, and processing, such as SSL,
Encryption, Access Control, Multi-Factor Authentication and so forth.
In summary, Big Data comes with Big Storage requirement. Organizations have been facing various
challenges when dealing with Big Data, such as data storage and data
processing. Data storage issue is partially solved by Cloud Computing technology. However, until the security and privacy
issues are resolved in the Cloud
Computing platform, organizations must apply robust security measures to
mitigate and alleviate the security risks.
References
Arora, M., &
Bahuguna, H. (2016). Big Data Security–The Big Challenge.
Hashem, I. A. T.,
Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The
rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.
Jinquan,
D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop
Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.
The purpose of this discussion is to create the building blocks of a system for healthcare Big Data Analytics and compare the building block design to a DNA networked cluster currently used by an organization in the current market.
The discussion begins with the Cloud Computing
Building Blocks, followed by Big Data Analytics Building Blocks, and DNA
Sequencing. The discussion also addresses the building blocks for the health
analytics and the building blocks for DNA Sequencing System, and the comparison
between both systems.
Cloud Computing Building Blocks
The Cloud Computing model contains two elements: the front end and the back end. Both elements are connected to the network. The user interacts with the system using the front end, while the cloud itself is the back end. The front end is the client which the user uses to access the cloud through a device such as a smartphone, tablet, and laptops. The backend represented by the Cloud provides applications, computers, servers and data storage which creates the services (IBM, 2012).
As indicated in (Macias & Thomas, 2011),
three building blocks are required to enable Cloud Computing. The first block
is the “Infrastructure,” where the organization
can optimize data center consolidation, enhance network performance, connect
anyone, anywhere seamlessly, and implement pre-configured solutions. The second block is the “Applications,” where
the organization can identify
applications for rapid deployment, and utilize automation and orchestration
features. The third block is the
“Services,” where the organization can
determine the right implementation model, and create a phased cloud migration
plan.
In (Mousannif, Khalil, & Kotsis, 2013-14), the building
blocks for the Cloud Computing involve the physical layer, the virtualization
layer, and the service layer. Virtualization
is a basic building block in Cloud Computing.
Virtualization is the technology which hides the physical
characteristics of the computing platform from the front end users. Virtualization provides an abstract and
emulated computing platform. The
clusters and grids are features and characteristics in Cloud Computing for high-performance computing applications such as
simulations. Other building blocks of the Cloud Computing include
Service-Oriented Architectures (SOA) and Web Services (Mousannif et al., 2013-14).
Big Data Building Block
As indicated in (Verhaeghe, n.d.), there are four major building blocks for Big Data Analytics. The first building block is Big Data Management to enable organization capture, store and protect the data. The second building block for the Big Data is the Big Data Analytics to extract value from the data. Big Data Integration is the third building block to ensure the application of governance over the data. The last building block in Big Data is the Big Data Applications for the organization to apply the first three building blocks using the Big Data technologies.
DNA Sequencing
DNA stands for Deoxyribonucleic Acid which represents the smallest building block of life (Matthews, 2016). As indicated in (Salzberg, 1999), advances in biotechnology have produced enormous volumes of DNA-related information. However, the rate of data generation is outpacing the ability of the scientists to analyze the data. DNA Sequencing is a technique used to determine the order of the four chemical building blocks, called “bases,” which make up the DNA molecule (genome.gov, 2015). The sequence provides the kind of genetic information which is carried in a particular DNA segment. DNA sequencing can provide valuable information about the role of inheritance in susceptibility to disease and response to the influence of environment. Moreover, DNA sequencing provides rapid and cost-effective diagnosis and treatments. Markov chains and hidden Markov models are probabilistic techniques which can be used to analyze the result of the DNA sequencing (Han, Pei, & Kamber, 2011). Example of the DNA Sequencing application is discussed and analyzed in (Leung et al., 2011), where the researchers employed Data Mining on DNA Sequences biological data sets for Hepatitis B Virus.
DNA Sequencing was
performed on non-networked computers, using a limited subset of data due to the limited computer processing speed
(Matthews, 2016). However, DNA Sequencing has been
experiencing various advanced technologies and techniques. Predictive Analytic is an example of these
techniques which are applied to DNA
Sequencing resulting Predictive Genomics.
Cloud Computing plays a significant role in the success of the
Predictive Genomics for two major reasons.
The first reason is the volume of the genomic data, while the second
reason is the low cost (Matthews, 2016). Cloud Computing is becoming a valuable tool for various domains including the
DNA Sequencing. As cited in (Blaisdell, 2017), the study of the
Transparency Market Research showed that the healthcare Cloud Computing market
is going to evolve further, reaching up to $6.8 Billion by 2018.
Building Block for Healthcare System
Healthcare data requires protection due to the security and privacy concerns. Thus, Private Cloud will be used in this use case. To build a Private Cloud, the virtualization layer, the physical layer, and the service layer are required. The virtualization layer consists a hypervisor to allow multiple operating systems to share a single hardware system. The hypervisor is a program which controls the host processors and resources by allocating the resources to each operating system. Two types of hypervisors: native and also called bare-metal or type 1 and hosted also called type 2. Type 1 runs directly on the physical hardware while Type 2 runs on a host operating system which runs on the physical hardware. Examples of the native hypervisor include VMware’s ESXi, Microsoft’s Hyper-V. Example of the hosted hypervisor includes Oracle VirtualBox and VMware’s Workstation. The physical layer can consist of two computer pools one for PC and the other for the server (Mousannif et al., 2013-14).
In (Archenaa & Anita, 2015), the researchers
illustrated the secure Healthcare Analytic System. The Electronic health record is a
heterogeneous dataset which is given as input
to HDFS through Flume and Sqoop. The analysis of
the data is performed using MapReduce and Hive by implementing Machine Learning
algorithm to analyze the similar pattern
of data, and to predict the risk for patient health condition at an early stage. HBase database is
used for storing the multi-structured data. STORM is used to perform
live streaming and any emergency conditions such as patient temperature rate
falling beyond the expected level. Lambda function is also used in this healthcare system. The final component of a building block in Healthcare system involves
the reports generated by the top layer tools such as “Hunk.” Figure 1 illustrates the Healthcare System,
adapted from
Building Block for DNA and Next Generation Sequencing System
Besides the DNA Sequencing, there is a next-generation sequencing (NGS) which is increasing exponentially since 2007 (Bhuvaneshwar et al., 2015). In (Bhuvaneshwar et al., 2015), the Globus Genomic System is proposed as an enhanced Galaxy workflow system made available as a service offering users the capability to process and transfer data easily, reliably and quickly. This system addresses the end-to-end NGS analysis requirements and is implemented using Amazon Cloud Computing Infrastructure. Figure 2 illustrates the framework for the Globus Genomic System taking into account the security measures for protecting the data. Examples of healthcare organizations which are using Genomic Sequencing include Kaiser Permanente in Northern California, and Geisinger Health System in Pennsylvania (Khoury & Feero, 2017).
Figure 2. Globus Genomics System for Next Generation Sequencing (NGS). Adapted from (Bhuvaneshwar et al., 2015).
In summary, Cloud Computing has reshaped the
healthcare industry in many aspects. Healthcare
Cloud Computing and Analytics provide many benefits from the easy access to the
electronic patient records to DNA Sequencing and NGS. The building blocks of the Cloud Computing
must be implemented with care for security and privacy consideration to protect
the patients’ data from unauthorized users. The building blocks for Healthcare Analytics
system involves advanced technologies such as Hadoop, MapReduce, STORM, Flume
as illustrated in Figure 1. The building
blocks for DNA Sequencing and NGS System involves Dynamic Worker Pool,
HTCondor, Shared File System, Elastic Provisioner, Globus Transfer and Nexus,
and Galaxy as illustrated in Figure 2.
Each system has the required building blocks to perform the analytics
tasks.
References
Archenaa, J., & Anita, E. M. (2015). A
survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.
Bhuvaneshwar, K.,
Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., . . . Madhavan,
S. (2015). A case study for cloud-based high throughput analysis of NGS data
using the globus genomics system. Computational
and structural biotechnology journal, 13, 64-74.
Leung, K., Lee,
K., Wang, J., Ng, E. Y., Chan, H. L., Tsui, S. K., . . . Sung, J. J. (2011).
Data mining on dna sequences of hepatitis b virus. IEEE/ACM Transactions on Computational Biology and Bioinformatics
(TCBB), 8(2), 428-440.
Mousannif, H.,
Khalil, I., & Kotsis, G. (2013-14). Collaborative learning in the clouds. Information Systems Frontiers, 15(2),
159-165. doi:10.1007/s10796-012-9364-y
Salzberg, S. L.
(1999). Gene discovery in DNA sequences. IEEE
Intelligent Systems and their Applications, 14(6), 44-48.
The purpose of this
project is to articulate all the steps conducted to perform analysis of heart disease use case. The project contained two main phases: Phase
1: Sandbox Configuration, and Phase 2: Heart Disease
Use Case. The setup and the configurations
are not trivial and did require the integration of Hive with MapReduce and
Tez. It also required the integration of
R and RStudio with Hive to perform transactions to retrieve and aggregate data.
The analysis included Descriptive
Analysis for all patients and then
drilled down to focus on the gender: female and males. Moreover,
the analysis included the Decision Tree and the Fast-and-Frugal Trees (FFTrees).
The researcher of this paper in agreement with other researchers that
Big Data Analytics and Data Mining can play a significant role in healthcare in
various areas such as patient care, healthcare
records, fraud detection, and
prevention.
Keywords: Decision Tree, Diagnosis of Heart Disease.
The medical
records and the databases to store these records are increasing rapidly. This rapid increase is leading the
researchers and practitioners to employ Big Data technologies. The Data Mining technique plays a significant
role in finding patterns and in extracting knowledge to provide better patient care
and effective diagnostic capabilities. As
indicated in (Koh
& Tan, 2011), “In healthcare, data mining is
becoming increasingly popular, if not increasingly essential.” Healthcare can benefit from Data Mining
application in various areas such as the evaluation of treatment effectiveness,
customer and patient relationship management, healthcare management, fraud
detection, and prevention. Moreover,
other benefits include predictive medicine and analysis of DNC
micro-arrays.
Various
research studies employed various Data Mining techniques in the
healthcare. In (Alexander
& Wang, 2017), the main objective of the study was
to identify the usage of Big Data Analytics to predict and prevent heart
attacks. The results showed that Big Data
Analytics is useful in predicting and pr3eventing attacks. In (Dineshgar
& Singh, 2016), the purpose of the study was to
develop a prototype Intelligent Heart Disease Prediction System (IHDPC) using
Data Mining technique. In (Karthiga,
Mary, & Yogasini, 2017), the researchers
utilized the Data Mining techniques to predict heart disease using the Decision
Tree algorithm and Naïve Bayes. The
result showed that the prediction accuracy of 99%. Thus, Data Mining techniques
enable the healthcare industry to predict patterns. In (Kirmani
& Ansarullah, 2016), the researchers
also applied the Data Mining techniques with the aim to investigate the result
after applying different types of Decision Tree methods to obtain the better
performance in the heart disease. These
research studies are examples of the vast literature on the use of Big Data
Analytics and Data Mining in the healthcare industry.
In this project, the heart disease
dataset is utilized as the Use Case for
Data Mining application. The project used Hortonworks sandbox, with
Hive, MapReduce, and Tez. The project
also integrated R with Hive to perform statistical
analysis including Decision Tree method.
The project utilized techniques from
various research studies such as (Karthiga et al.,
2017; Kirmani & Ansarullah, 2016; Martignon, Katsikopoulos, & Woike,
2008; Pandey, Pandey, & Jaiswal, 2013; Phillips, Neth, Woike, &
Gaissmaier, 2017; Reddy, Raju, Kumar, Sujatha, & Prakash, 2016).
The project begins with Phase 1 of Sandbox
Configuration, followed by Phase 2 of the
Heart Disease Use Case. The Sandbox
configuration included the environment set up
from mapping the sandbox IP to the Ambari Console management and the Integration of R and RStudio with Hive. The Heart Disease Use Case involved fourteen
steps starting from understanding the dataset to the analysis of the
result. The project articulates the steps and the commands as
required for this project.
After
the installation, the environment must be configured to function using the following
steps fully.
1.1 IP Address
and HTTP Web Port
After the Sandbox is installed, the host must use an IP address depending on the Virtual Machine (VMware, VirtualBox), or container (Docker). After the installation is finished, the local IP address is assigned with the HTTP web access port of 8888 as shown below. Thus, the local access is using http://127.0.0.1:8888/
The IP address can be mapped to a hostname using the hosts file a shown below. After setting the hostname to replace the IP address, the sandbox can be accessed from the browser using http://hortonworks-sandbox.com:8888.
There are five major users with different roles in Hortonworks Sandbox. These roles with their passwords and roles are summarized in Table 1.
Table 1. Users, Roles, Service, and Passwords.
With respect to the Access, putty can be used to access the sandbox using SSH with Port 2222. The root will get the prompt to specify the new password.
The shell web client is also known as Shell-in-a-box to issue shell commands without installing additional software. It uses port 4200. The admin password can be reset using the Shell Web Client.
The Admin can manage Ambari using the web address with port 8080 using the address of http://hortonworks-sandbox.com:8080 with the admin user and password. The admin can operate the cluster, manage users and group, and deploy views. The cluster section is the primary UI for Hadoop Operators. The clusters allow admin to grant permission to Ambari users and groups.
Test the query using MapReduce as the execution engine. The following query ran using MapReduce.
Configure Tez As the Execution Engine
in Hive
The
user can also modify the value for hive.execution.engine from mr to tez as Hive
is enabled on Tez execution and take advantage of the DAG execution
representing the query instead of multiple stages of MapReduce program which
involved a log of synchronization, barriers, and I/O overheads.
Click on Hive Settings tab.
Click Add New and add the following Key: Value pairs.
This integration is implemented on Tez to also take
advantage of the Directed Acyclic Graph (DAG) execution. This technique is
improved in Tez, by writing intermediate dataset into memory instead of hard
disk.
The
analysis of the heart disease dataset
included descriptive analysis and
decision tree. The result of the descriptive analysis showed that the minimum
age among the patients who are exposed to
the heart disease is 29 years old, while the maximum age is 79, with median and
mean of 52 years old. The result also
showed that the minimum cholesterol level for these patients is 126, while the
maximum is 564, with a median of 236 and
mean of 244 indicating that the
cholesterol level also gets increased with the increased age.
The
descriptive analysis drilled down and focused on gender (female vs. male) to identify the impact of the age on
the cholesterol for the heart disease patients.
The result showed the same trend among the female heart disease
patients, with a minimum age of 34, and
maximum age of 76, with median and mean
of 54. The cholesterol level among
female heart disease patients begins with the minimum of 141 and maximum of
564, and median of 250 and mean of 257.
The maximum cholesterol level of
564 is an outlier with another outlier in the age of 76. With respect
to the heart of the male disease
patients, the result showed the same trend.
Among the male heart disease patients, the results showed that the minimum age is 29 years old, and maximum age of
70 years old, with a median of 52 and
mean of 51. The cholesterol level among these
male heart disease patients showed 126 minimum and 325 maximum level, median
and mean of 233. There is another outlier
among the male heart disease patients at
the age of 29. Due to these outliers in
the dataset among the female and heart disease patients, the comparison between
male and female patients will not be accurate.
However, Figure 5 and Figure 6 show similarities in the impact of the
age on the cholesterol level between both genders.
With regard to the decision tree, the
first decision tree shows the data is
partitioned among six nodes. The
first two nodes are for the Resting Blood Pressure (RBP) attribute (RestingBP).
The first node of this cluster shows 65 heart disease patients have RBP of 138 or less, while the second node
of this cluster shows 41 heart disease patients have
RBP of greater than 138. These two nodes
show the vessels is zero or less with a heart rate
of 165 or less. For the vessel that
exceeds the zero level, there is another node with 95 patients. The second set of the nodes are for the heart
rate of greater than 165. The three
nodes are under the vessels; less than or equal zero vessels, and greater than
zero vessels. Two nodes are under the first categories of
zero or less, with 22 heart disease patients with a heart rate of 172 or less. The last node shows the vessels is
greater than zero with 15 heart disease patients. The FFTree
results show that the high-risk heart disease patients with vessel
greater than zero, while the low-risk patients are of zero or less of
vessels.
The
purpose of this project was to articulate all the steps conducted to perform
analysis of heart disease use case. The project contained two main phases: Phase
1: Sandbox Configuration, and Phase 2: Heart Disease
Use Case. The setup and the configurations
are not trivial and did require the integration of Hive with MapReduce and
Tez. It also required the integration of
R and RStudio with Hive to perform transactions to retrieve and aggregate data.
The analysis included Descriptive
Analysis for all patients and then
drilled down to focus on the gender: female and males. Moreover,
the analysis included the Decision Tree and the FFTrees. The researcher of this paper in agreement
with other researchers that Big Data Analytics and Data Mining can play a significant
role in healthcare in various areas such as patient care, healthcare records, fraud detection, and prevention.
Alexander, C., &
Wang, L. (2017). Big data analytics in heart attack prediction. The Journal of Nursing Care, 6(393).
Dineshgar,
G. P., & Singh, L. (2016). A Review of Data Mining For Heart Disease
Prediction. International Journal of
Advanced Research in Electronics and Communication Engineering (IJARECE), 5(2).
Karthiga,
A. S., Mary, M. S., & Yogasini, M. (2017). Early Prediction of Heart
Disease Using Decision Tree Algorithm. International
Journal of Advanced Research in Basic Engineering Sciences and Technology
(IJARBEST), 3(3).
Kirmani,
M. M., & Ansarullah, S. I. (2016). Prediction of Heart Disease using
Decision Tree a Data Mining Technique. IJCSN
International Journal of Computer Science and Network, 5(6), 885-892.
Koh,
H. C., & Tan, G. (2011). Data mining applications in healthcare. Journal of healthcare information
management, 19(2), 65.
Martignon,
L., Katsikopoulos, K. V., & Woike, J. K. (2008). Categorization with
limited resources: A family of simple heuristics. Journal of Mathematical Psychology, 52(6), 352-361.
Pandey,
A. K., Pandey, P., & Jaiswal, K. (2013). A heart disease prediction model
using decision tree. IUP Journal of
Computer Sciences, 7(3), 43.
Phillips,
N. D., Neth, H., Woike, J. K., & Gaissmaier, W. (2017). FFTrees: A toolbox
to create, visualize, and evaluate fast-and-frugal decision trees. Judgment and Decision Making, 12(4),
344.
Reddy, R. V. K., Raju, K. P., Kumar, M. J., Sujatha,
C., & Prakash, P. R. (2016). Prediction of heart disease using decision
tree approach. International Journal of
Advanced Research in Computer Science and Engineering, 6(3).