"Artificial Intelligence without Big Data Analytics is lame, and Big Data Analytics without Artificial Intelligence is blind." Dr. O. Aly, Computer Science.
The purpose of this discussion is to
discuss artificial intelligence and
whether it should be used as a tool to
support or replace decision makers. The discussion begins with a brief history of artificial intelligence (AI), followed by
the foundation of the AI, and the question about AI whether it should be used
to support or replace decision makers.
The History of Artificial Intelligence
Artificial
intelligence is defined as a computational technique allowing machines to
perform cognitive functions such as acting or reacting to input, similar to the
way humans do (Patrizio, 2018). The gestation
of AI was between the year of 1943 and 1955. The work of Warren McCulloch and
Walter Pins (1943) is regarded to be the first work of Artificial Intelligence
(AI) (Russell & Norvig, 2016). Their work
drew on three sources: knowledge of the underlying
physiology and function of neurons in the brain, a formal analysis of
propositional logic due, and Turing’s theory of computation (Russell & Norvig, 2016). Hebbian
learning is the result of the work from Donald Hebb (1949) who demonstrated a
simple updating a rule for modifying the connection strengths between neurons (Russell & Norvig, 2016). The Hebbian theory is still an influential model to this day (Russell & Norvig, 2016).
The birth of AI was in 1956, when John McCarthy, who
was another influential figure in AI in Princeton, initiated a project for AI.
AI witnessed early enthusiasm, and high
expectation from 1952 until 1969 (Russell & Norvig, 2016). AI witnessed a dose of reality between 1966 and 1973. The knowledge-based systems as the key to
power began in 1969 until 1979. In 1980
until the present time, the AI became an
industry. From 1986 until today, the
neural networks are returned. From 1987 until the present, AI adopts the scientific method. The emergence
of intelligent agents is developed
from 1995 until the present time. The large
dataset became available from 2001 until the
present.
Recent works of AI suggest that the emphasis should be on data and not an
algorithm to solve many problems (Russell & Norvig, 2016).
The Foundation of Artificial Intelligence
AI, ideally, takes the best possible action in a
situation (Russell & Norvig, 2016). Building an
agent that is intelligent is not an easy
task and is described as problematic. There are eight foundations for building an
intelligent agent. The early
philosophers such as Aristotle (400 B.C.) made the AI conceivable by
considering the ideas that the mind is in some ways like a machine, that it
operates on knowledge encoded in some internal language, and that thought can
be used to choose what actions to take (Russell & Norvig, 2016). The
mathematics is another block for building an intelligent agent, where
mathematician provides the tools to
manipulate certain and uncertain statement, as well as probabilistic
statements. Mathematics also set the groundwork for understanding computation
and reasoning about algorithms (Russell & Norvig, 2016). The
economics formalize the problem of making decisions that maximize the expected
outcome of the decision makers (Russell & Norvig, 2016). The
neuroscience discovered some facts about how the brain works and how it is similar to and different from
computers. The computer engineering
provided the ever-more-powerful machines that make AI applications
possible. The control theory deals with
designing devices that act optimally by
feedback from the environment (Russell & Norvig, 2016). Understanding
language requires an understanding of the subject matter and context, not just
an understanding of the structure of sentences, which can cause a problem in AI
(Russell & Norvig, 2016).
Can AI Support or Replace Decision-Maker?
AI has already entered various industries such as
healthcare (navatiosolutions.com, 2018; UNAB, 2018). It has been used in managing medical records and other
data. It has also been used for doing repetitive jobs such as
analyzing tests, X-Rays, CT scans, and data entry (navatiosolutions.com, 2018). AI has been used to
analyzing data, and reports to help
select the correct individually customized treatment path (navatiosolutions.com, 2018). Patients can
report their symptoms into an AI app which uses speech recognition to compare
against a database of illness. AI acts
as virtual nurses to help monitor the conditions of patients and follow up with
treatments between doctor visits (navatiosolutions.com, 2018). AI has also
been used to monitor the use of medication by a patient. Pharmaceutical
has taken advantage of AI in creating
drugs faster and cheaper. AI has been used for genetics and genomics for
mutations and links to disease from information in DNA (navatiosolutions.com, 2018). AI has been
used to sift through the data to highlight mistakes in treatments, workflow
inefficiencies, and helps area healthcare systems avoid unnecessary patient
hospitalization (navatiosolutions.com, 2018). Other
examples of AI’s benefits include the autonomous
transport system decreasing the number of accidents, the medical systems making
quantum advances possible in health monitoring (UNAB, 2018).
The UNAB think tank (UNAB, 2018) has raised valid questions
among which the singularity of human and AI and whether the human and AI can become
integrated. AI control of the human with
no regard to the human value is causing fears towards AI technology (UNAB, 2018). The other questions include the following (UNAB, 2018):
“What
if AI was wholly monitoring human
behavior, without human participation?
Who
or what will be engaged in the related decision-making process?
To
what extent would individuals accept AI despite the consequences?
Will
the human factor as we know it disappears
completely?”
These questions are valid questions to
fully adopt AI technology and integrate it fully into the human lives. (James, 2018) has raised another
valid question “Can We Trust AI?”
Despite the benefits of AI especially in the healthcare industry, these
systems can still make mistakes, caused by limited training, or unknown bias in
the algorithm due to lack of understanding of the neural network models
operation (James, 2018). Several high profile instances of machines
have demonstrated bias, which caused by wrong training dataset, and a malicious attacker who hacks into the training
dataset to make it bias (IBM, n.d.).
Ethics issues come along with AI technology
adoption (James, 2018; UNAB, 2018). IBM has suggested instilling
human values and morality into AI systems (IBM, n.d.). However, there is no single ethical system
for AI (IBM, n.d.). Transparency seems to be a key in trusting AI
(IBM, n.d.; James, 2018). People need
to know how the AI system arrives at a particular conclusion and make a decision or
a recommendation (IBM, n.d.; James, 2018).
Conclusion
This discussion has addressed the artificial intelligence and its key dimension in human life. It has contributed to various industries including healthcare and pharmaceutical and proven to provide value in certain areas. However, it is also proven to make mistakes and demonstrated bias due to wrong training data set or malicious attacks. There is a fear about integrating AI technology fully into human lives with no regard to human’s participation and human’s values. Integrating values and ethics is not an easy task.
From the researcher point of view, AI should not be used for making decisions that are related to human values and ethics. Human lives have many dimensions that are not always black and white. There are some areas where human integrity, principles, values, and ethics play a role. In the court, there is always a statement of “benefit of the doubt.” Can AI decision be based on the “benefit of the doubt” rule in the court? Another aspect of AI, from the researcher’s point of view, is: who develops AI? The AI technology is developed by humans. Are humans trying to get rid of humans and put AI in a superior role? AI technology has its role and its dimension in certain fields but not in all fields and domains where a human can move and interact with other humans with integrity and values. Let AI technology take place and make decisions in areas where it is proven to be most useful to human such as promoting sales and marketing, automating certain processes to increase efficiency, and productivity, etc. Let the human takes place and makes decisions in areas where it is proven to be most useful to human lives promoting ethics, values, integrity, and principles. “Computers are becoming great assistants to us however they still need our thought to make good decisions” (Chibuk, 2018).
References
Chibuk, J. D. (2018). Four
Building Blocks for a General AI.
This project discusses and analyzes ten big data visualization tools. These ten tools include Sisense, Microsoft Power BI, Tableau, Domo, Looker, InsightSquared, QlikView, WebFOCUS, Phocas, and Easy Insight. Each tool has advantages and disadvantages that are discussed in this project as well. The analysis shows that the Sisense, Looker, InsightSquared, and EasyInsight have the highest rate in Quality of Support. Phocas and InsightSquared have the highest rate of ease of use. InsightSquared has the highest rates of meets requirements. Easy Insight has the highest rate for ease of admin. Easy Insight, Phocas, WebFOCUS, InsightSquared, Looker, and Sisense have the highest ease of doing business with. EasyInsight and Sisense have the highest rate for ease of setup. Organizations must analyze each tool that can best meet their business requirements.
Keywords:
Big Data Analytics Data Visualization Tools.
Data visualization
is a visual depiction or context to help better understand the significance of
the data (P. Baker, 2018; Meyer, 2018). Visualization is one of the most powerful representations and
presentation of the data (Jayasingh,
Patra, & Mahesh, 2016).
It does help in viewing the data in the form
of graphs, images, pie charts (EMC,
2015; Jayasingh et al., 2016).
Data visualization is not a new concept, as it has been utilized in business intelligence (P. Baker, 2018).
The kind and size of the data which data
visualization nowadays represents are sophisticated (Meyer, 2018). The visualization
process includes the synthesis of large volumes of data sets to obtain the
essence of the data and convey the critical
insights for decision making (Meyer, 2018). Data visualization play a significant role in
depicting patterns and relationships in large volumes of data that may not be easily seen in raw data reports (Meyer, 2018). It helps to identify
emerging trends to provide actionable insights that can drive change (Meyer, 2018).
Ten tools are recognized as the best data visualization
tools of 2018 (P. Baker, 2018). These ten data visualization tools include
Zoho Reports, Sisense, Domo, Microsoft Power BI, Tableau Desktop, Google
Analytics, Chartio, SAP Analytics Cloud, IB Watson Analytics, and Salesforce
Einstein Analytics Platform. The same
report has rated these tools from fair, good to excellent. The top excellent tools include Sisense, Power BI, Tableau, and IBM
Watson. The top good tools include Zoho,
Domo, Google Analytics, SAP Analytics Cloud, and Salesforce Einstein Analytics
Platform. The only fair rated data visualization tool is
Chartio. IBM Watson is not available in
the market since July 2018 (P. Baker, 2018).
Other reports have
included additional data visualization tools such as QlikView, Matlab, Kibana, Plotly and other (financeonline.com, 2018a). The seven best data
visualization tools include Tableau, QlikView, FusionCharts, Highcharts,
Datawrapper, Plotly, and Sisense (Marr, 2017). There are various data visualization tools in
the market. Organizations must analyze
each tool to ensure that the tools will serve the business needs.
This project has selected top ten visualization tools
in the market. These ten
visualization tools include Sisense Data, Microsoft Power BI, Tableau, Domo,
Looker, InsightSquared, QlikView, WebFOCUS, Phocas, and EasyInsight. The project discusses and analyzes these top
ten visualization tools. The discussion
describes each tool, its benefits, and limitation. The discussion is followed by the analysis of these selected
top ten data visualization tools, and a conclusion.
Sisense won the best business intelligence software aware for 2017 with 99% user satisfaction (financesonline.com, 2018b). It was rated with among the excellent data visualization tools (P. Baker, 2018c). It can be used to collect information from all sources and unify them into a single repository (P. Baker, 2018c; financesonline.com, 2018b). It generates intelligent analysis and enables sharing insights across the organization (P. Baker, 2018c; financesonline.com, 2018b). It is a fast system as it uses 64-bit computer, parallelization capabilities, and multi-core CPUs (financesonline.com, 2018b). It has been named a Leader based on receiving a high customer satisfaction score and having a large Market Presence (looker.com, 2018). It has 91% of the users rating it 4-5 stars (looker.com, 2018). Figure 1 illustrates the satisfaction rating of Sisense.
Sisense has
various advantages. It has an intuitive user interface (P. Baker, 2018c). It is described to be all-in-one BI solution
enabling multiple tasks including data modeling
and complex calculations (P. Baker, 2018c; financesonline.com, 2018b). Sisense provides accurate data
analysis in real-time (financesonline.com, 2018b). It uses crowd accelerated BI technology which
can simultaneously handle hundreds of queries (financesonline.com, 2018b). Other advantages
include excellent technical support. It
connects to all data storage platform allowing users to quickly view the data assets in different systems (financesonline.com, 2018b). It creates robust
dashboards accessible from any device (financesonline.com, 2018b). The system can be easily customized for multiple user levels (financesonline.com, 2018b). The highest-rated features include dashboards with 90%, followed by data
visualization of 90%, and scorecards of
90%.
The drawbacks of Sisense include sharing the dashboards created by multiple users not as easy as it should be. The navigation and filtering on the mobile platform could be improved (financesonline.com, 2018b). It is a bit complex for a self-service business intelligence tool (BI) (P. Baker, 2018c). The analytics process needs work (P. Baker, 2018c). The natural language features have limitations (P. Baker, 2018c). The lowest-rated features include the mobile user support with 77%, followed by auto-modeling of 78% and breadth of partner applications of 79%. Figure 2 shows the highest-rated features and lowest-rated features of Sisense.
Figure 2. Sisense Highest-Rated and
Lowest-Rated Features (looker.com, 2018).
Power BI does a very good job in combining power analytics with a user-friendly user interface (UI) and remarkable data visualization capabilities (P. Baker, 2018b). There is a limited free version as well as the Professional version which begins at $9.99 per user per month (P. Baker, 2018b). It provides a single view for the dashboard to view the critical business data (P. Baker, 2018b). Power BI has been named a leader based on receiving a high customer satisfaction score and having a significant market presence (looker.com, 2018). It has 89% of users rating it 4-5 stars. Figure 3 illustrates the Power BI satisfaction ratings.
Figure 3. Power BI Satisfaction Ratings (looker.com, 2018).
Power BI has several advantages.
It is affordable because it has a free version as well as the
Professional version for $9.99 per user per month (P. Baker, 2018b; techaffinity.com, 2017). It is tightly
coupled with Microsoft product suite and integration with Excel, Azure
and SQL server is simple and straightforward
(techaffinity.com, 2017). It consistently gets
upgraded to improve its features (techaffinity.com, 2017). It provides good
visualization reports offering detailed reporting (techaffinity.com, 2017). It can connect and extract data from a
variety of data sources like Excel, Access, Github, Google Analytics,
Salesforce, etc. It works on all platforms like Windows,
Android, and iOS (techaffinity.com, 2017). The highest-rated features of Power BI include
the data visualization with 90%, followed by graphs and charts of 89%, and
breadth of partner applications of 86% (looker.com, 2018).
The drawbacks of Power BI include the limitation for handling a large volume of data (techaffinity.com, 2017). It will hand many times while handling vast sets of data and the best solution is to use a live connection which will make it much faster (techaffinity.com, 2017). It is complicated to master as there are many complex components like Power BI desktop, Gateway, and Power BI services and it is difficult to understand which option is best suited for the business (techaffinity.com, 2017). Power BI has 90% highest rated features for data visualization, followed by graphs and charts, and breadth of partner applications. The lowest-rated features have 73% for predictive analytics, followed by big data services of 77%, and auto-modeling of 80% (looker.com, 2018). Figure 4 shows the highest-rated features and the lowest rated features of Power BI.
Figure 4. Power BI Highest-Rated and
Lowest-Rated Features (looker.com, 2018).
Tableau Desktop was one of the early players in the business intelligence domain (P. Baker, 2018d). It is one of the most mature offerings on the market (P. Baker, 2018d). It has a steep learning curve than other platforms, and it is easily one of the best tools in the data visualization tools (P. Baker, 2018d). It has been named as a leader based on receiving a high customer satisfaction score and having a significant market presence (looker.com, 2018). The number of the user is 85% who rated Tableau 4-5 stars (looker.com, 2018). Figure 5 shows the satisfaction ratings of Tableau.
Tableau has several
advantages. Some of these advantages
include the data visualization tool supporting complex computations (absentdata.com, 2018). It quickly creates interactive
visualizations using drag-n-drop functionalities (absentdata.com, 2018). It offers many types of
visualization options enhancing user experience (absentdata.com, 2018). Tableau can handle large
amounts of data and millions of rows of data with ease (absentdata.com, 2018). Users can integrate other scripting languages
such as Python or R-programming in Tableau (absentdata.com, 2018). It provides mobile support and responsive dashboard (absentdata.com, 2018). Data visualization is the highest-rated
features with 90% rating, followed by graphs and charts of 90%, and dashboard
of 89% ratings (looker.com, 2018).
The drawbacks of Tableau include the requirement for substantial training to fully master the Tableau platform (P. Baker, 2018d). It does not provide the feature of automatic refreshing of the reports with the help of scheduling (absentdata.com, 2018). It is not a complete open tool; unlike other tools like Power BI, developers can create custom visuals that can be imported into Tableau. Hence, any new visuals need to be recreated instead of imported to Tableau (absentdata.com, 2018). It has a limitation of conditional formatting and limited 16 column table, and a manual update to the parameters (absentdata.com, 2018). The screen resolution can disturb a Tableau dashboard (absentdata.com, 2018). Tableau Desktop provides necessary pre-processing including joining and blending data, and data cleansing is a required step, which required another tool such as Power BI to clean the data. Tableau introduced Tableau Prep tool in 2018 to prepare the data, which has its advantages and disadvantages as well (absentdata.com, 2018). The biggest issue in Tableau is scaling and pricing for enterprise (absentdata.com, 2018). The lowest-rated features in Tableau include the sandbox and test environment with 71% ratings, followed by the predictive analytics of 73% and auto-modeling of 78% (looker.com, 2018). Figure 6 shows the highest-rated features and lowest-rated features of Tableau.
Figure 6. Tableau Highest-Rated and
Lowest-Rated Features (looker.com, 2018).
Domo was founded in 2010 and has received $689 million in funding as of April 2017 (yurbi.com, 2018). This funding includes investments from Fidelity Investments and Salesforce. Domo has big customers like eBay and National Geographic (yurbi.com, 2018). The initial public offering (IPO) was on June 29, 2018, and is now publicly traded on NASDAQ (yurbi.com, 2018). It is for companies that already have business intelligence experience in their organizations (P. Baker, 2018a). It is a powerful BI tool with many data connectors and robust data visualization capabilities (P. Baker, 2018a). It has been named a leader based on receiving a high customer satisfaction score and having a significant market presence. The number of users is 92% rating it 4-5 stars and 87% of users believe it is headed into the right direction (looker.com, 2018). Figure 7 shows the satisfaction ratings.
Domo
has advantages and disadvantages. The advantages include the ability to view
real-time data in a single dashboard (yurbi.com, 2018). Domo can
integrate on-premise data and external data sources in the cloud (yurbi.com, 2018). It is built on reliable
technology, its network is dependable, and
it has both the leadership and financial resources to continue to evolve. The highest-rated features include graphs and
charts with 91%, followed by the dashboards of 90% and data visualization of
90% (looker.com, 2018).
The drawbacks of Domo include the cloud-based nature of Domo for those organizations whose most data is on-premise (yurbi.com, 2018). The cost of Domo is a prohibitive for most business as it is not interested in deals that are less than $50,000 (yurbi.com, 2018). The lack of improvement is another drawback of Domo (yurbi.com, 2018). It is difficult to extract data, and high-pressure sales (yurbi.com, 2018). The lowest-rated features of Domo include predictive analytics with 73%, followed by the auto-modeling of 77% and search of 80% (looker.com, 2018). Figure 8 shows the highest-rated features and the lowest-rated features.
Figure 8. Domo Highest-Rated and
Lowest-Rated Features (looker.com, 2018).
Looker is self-service data exploration solution hosted in the cloud (comparecamp.com, 2018). It can be used to discover critical data and generate reports in seconds (comparecamp.com, 2018). All collaboration and exploration models can be created with a little bit of SQL coding knowledge (comparecamp.com, 2018). It supports more than 20 variations including BigQuery, Vertica, Hive, and Spark (comparecamp.com, 2018). It can be used in complex installations and process the required terabytes of data (comparecamp.com, 2018). It has been named a leader based on receiving a high customer satisfaction score and having a significant market presence (comparecamp.com, 2018). The number of the user of 97% rated it 4-5 stars, and 90% of them believe it is headed in the right direction (looker.com, 2018). The satisfaction ratings show 94% for quality of support, ease of use of 82%, and meets requirements of 86%. Ease of admin has 88%, and ease of doing business with has 94%, and ease of setup has 84%. Figure 9 shows the satisfaction ratings for Looker as a data visualization tool.
Looker has
advantages and disadvantages. The main advantage of Looker is that the user
does not have to be SQL expert or professional data analyst to use Looker (looker.com, 2018). It has
its understandable and flexible language known as LookML with a specific syntax, which can be used to explore
data, extend the platform’s SQL efficiency (looker.com, 2018). Packages
can be tailored within the budget of the
organization (looker.com, 2018). The
highest-rated features include the data column filtering of 90%, followed by
data modeling of 89% and reports interface of 88% (looker.com, 2018).
The drawbacks of the Looker include the mobile user support limitation which has 66% of lowest-rated features (looker.com, 2018). The predictive analytics is another limitation in Looker with 70% lowest-rated features (looker.com, 2018). The auto-modeling has 74% lowest-rated feature (looker.com, 2018). Figure 10 shows the highest-rated features and lowest-rated features of Looker.
Figure 10. Looker Highest-Rated and Lowest-Rated Features (looker.com, 2018).
InsightSquared is a suite offering an array of business reporting and analytics features to measure all aspects of businesses including marketing, sales, customer service, financials and staffing (financesonline.com, 2018a). It provides an in-depth analysis of the sales pipeline, covering sales forecasting and retrospective trend identification (financesonline.com, 2018a). The sales analytics reports can be utilized to measure employee achievements and successes (financesonline.com, 2018a). These sales analytics reports can be combined with marketing analytics information to identify lead sources, track lead generation, and measure campaign results (financesonline.com, 2018a). It has been named a leader based on receiving a high customer satisfaction score and having a significant market presence (looker.com, 2018). It received the highest satisfaction score among products in the BI platform (looker.com, 2018). Figure 11 shows the satisfaction rating for the InsightSquared tool.
InsightSquared has
several advantages. It offers six
customizable dashboards that give a near real-time view of essential metrics
and latest trends, such as new pipeline opportunities, sales cycle warnings,
employee activities, and flagged data errors (financesonline.com, 2018a). It offers sales analytics and reports as well
as marketing analytics. It also offers
other analytic reports such as financial reporting, staffing analytics and
support team analytics (financesonline.com, 2018a). (financesonline.com, 2018a) reported 100% user
satisfaction. (looker.com, 2018) has reported 98% of the users rated it 4-5 stars
with 97% of users believe it is headed in
the right direction.
The disadvantages of InsightSquared include the limited customization, the loading time which needs improvement, and tricky configuration to ensure data is perfect (getapp.com, 2018b). The lowest-rated features of the InsightSquared tool include mobile user support with 83%, customization of 83% and Big Data services of 85% (looker.com, 2018). Figure 12 shows the highest-rated features and lowest-rated features of the InsightSquared tool.
Figure 12. InsightSquared Highest-Rated and Lowest-Rated Features (looker.com, 2018).
QlikView is a data discovery tool allowing users to simulate the application of analytical data (financeonline.com, 2018a). It can be used to create and utilize the default and custom data connectors and templates based on the requirement of the business (financeonline.com, 2018a). It has been named a leader based on receiving a high customer satisfaction score and having a significant market presence (looker.com, 2018). The number of users of 90% rated it with 4-5 stars, and 80% of the users believe it is headed in the right direction (looker.com, 2018). Figure 13 shows the satisfaction ratings.
The advantages of
QlikView include the personalized data search (financeonline.com, 2018a). Users can build applications from the
software script to fit the business needs.
It offers role-based access to
specific security and data access (financeonline.com, 2018a). The highest-rated features
include the dashboards with 89%, followed by performance and reliability and
data transformation of 87% each (looker.com, 2018).
The disadvantages of QlikView include the predictive analytics which received the lowest-rated features of 67%, followed by the auto-modeling of 69%, and integration APIs of 75% (looker.com, 2018). Figure 14 shows the highest-rated features and the lowest-rated features of QlikView tool.
Figure 14. QlikView Highest-Rated
and Lowest-Rated Features (looker.com, 2018).
WebFOCUS is a cloud-based business intelligence and analytics platform offering analytical tools, applications, reports and documents for business stakeholders such as management team, analysts, line-of-business workers, partners and customers (softwareadvice.com, 2018). It provides data discovery, location intelligence, predictive and prescriptive analytics, BI smart search, and natural language search (looker.com, 2018; softwareadvice.com, 2018). It has been named a leader on receiving a high customer satisfaction score and having a significant market presence. The number of the user of 93% rated it with 4-5 stars, and 95% of the users believe it is headed in the right direction (looker.com, 2018). Figure 15 shows the satisfaction ratings for WebFOCUS.
The advantages of
WebFOCUS include the ability to create anything the customer’s request; the
platform is very flexible (looker.com, 2018). Server
time is minimal and easy to host
multiple, separate applications with
separate user groups and roles (looker.com, 2018). The
product is user-friendly and can be used by multiple levels of knowledge (looker.com, 2018). The
highest-rated features include data column filtering with 90% ratings, followed
by reports interface with 90% ratings, and user, role and access management
with 89% ratings (looker.com, 2018).
The disadvantages of WebFOCUS include the learning curve for developers to create content, and the reporting server has a static font which is proportional (looker.com, 2018). Some things that seem to be easy becomes difficult, and not all features work with all data sources (softwareadvice.com, 2018). The lowest-rated features of WebFOCUS include the auto-modeling with 80% rating, mobile user support with 80% rating, and predictive analytics with 81% ratings (looker.com, 2018). Figure 16 shows the highest-rated features and lowest-rated features of WebFOCUS tool.
Figure 16. WebFOCUS Highest-Rated
and Lowest-Rated Features (looker.com, 2018).
Phocas is a leading business intelligence solution built on exceeding customer expectations (financeonline.com, 2018c; looker.com, 2018). It helps make data-driven business decisions, see new sales opportunities, and enhance the efficiency of the business (financeonline.com, 2018c). It is an integrated data solution, providing an innovative data discovery platform (financeonline.com, 2018c). It is designed for non-technical users and delivers a simple and powerful analytical capability that easily turns data into a graph, chart, or map at a few clicks or touches of a screen (financeonline.com, 2018c). It has been named a leader based on receiving a high customer satisfaction score and having a significant market presence (looker.com, 2018). The number of users of 100% rated it with 4-5 stars, and 95% of the users believe it is headed in the right direction (looker.com, 2018). Figure 17 shows the satisfaction rating for Phocas.
The advantages of
Phocas include gaining a quick insight
into sales data, the ease of use of the product and responsiveness of the
vendor (getapp.com, 2018c). It is
described to be quick and easy to get the reports the business needs, with dashboards providing summaries that can
drill down into (getapp.com, 2018c). It is
accessible from anywhere and can design to suit the business needs (getapp.com, 2018c). The
highest-rated features of Phocas include the performance and reliability of 96%, followed by the data column
filtering of 95% and Big Data services of
94% (looker.com, 2018).
The disadvantages of Phocas include the cost to make it available to more employees (getapp.com, 2018c). The lowest-rated features include the predictive analytics of 81%, followed by the breadth of partner applications of 81%, and auto-modeling of 82% (looker.com, 2018). Figure 18 shows the highest-rated features and lowest-rated features of Phocas.
Figure 18. Phocas Highest-Rated and Lowes-Rated Features (looker.com, 2018).
Easy Insight is a feature-packed business intelligence platform enabling users to get an insightful overview of business operations. The overview helps business in defining and analyzing the business data to help find better ways to improve the existing business processes or devise and deploy new strategies (financeonline.com, 2018b). Business gets access to vital information that will help them come up with better decisions and initiate sound business actions (financeonline.com, 2018b). It has been named a niche vendor based on receiving a relatively low customer satisfaction score and having a small market presence (financeonline.com, 2018b). The number of the user of 100% rated it with 4-5 stars, and 100% of the users believe it is headed in the right direction (looker.com, 2018). Figure 19 shows the satisfaction rating for Easy Insight tool.
The advantages of
Easy Insight include data import and export, data visualization, drag and drop
interface, and graphical data presentation (getapp.com, 2018a). It also
offers real-time analytics, real-time reporting, reporting and statistics, and
visual analytics (getapp.com, 2018a). It has
data filtering, and customizable reporting (getapp.com, 2018a). The
highest-rated features include the performance and reliability of 96%, followed
by data column filtering of 95%, and Big Data
services of 94% (looker.com, 2018).
The limitations of Easy Insight include the activity dashboard, ad-hoc reporting, automatic notification, monitoring, and dashboard creation (getapp.com, 2018a). The limitations also include the third-party integration. The lowest-rated features of Easy Insight include collaboration and workflow of 71%, followed by auto-modeling of 71% and mobile user support of 76% (looker.com, 2018). Figure 20 shows the highest-rated features and lowest-rated features of Easy Insight tool.
Figure 20. Easy Insight Highest-Rated
and Lowest-Rated Features (looker.com, 2018).
Table 1 summarizes the satisfaction ratings for these ten big data visualization tools. The comparative analysis in Figure 21 illustrates visually the Sisense, Looker, InsightSquared, and EasyInsight have the highest rate in Quality of Support. Phocas and InsightSquared have the highest rate of ease of use. InsightSquared has the highest rates of meets requirements. Easy Insight has the highest rate for ease of admin. Easy Insight, Phocas, WebFOCUS, InsightSquared, Looker, and Sisense have the highest ease of doing business with. EasyInsight and Sisense have the highest rate for ease of setup.
Table 1. The Ten Tools Satisfaction Rating Summary.
Figure 21. Illustration of the Satisfaction
Ratings of the Ten Data Visualization Tools.
This project has
discussed and analyzed ten big data visualization tools. These ten include Sisense, Microsoft Power
BI, Tableau, Domo, Looker, InsightSquared, QlikView, WebFOCUS, Phocas, and Easy
Insight. Each tool has advantages and disadvantages
that are discussed in this project as well.
The analysis showed that the Sisense, Looker, InsightSquared, and
EasyInsight have the highest rate in Quality of Support. Phocas and InsightSquared have the highest
rate of ease of use. InsightSquared has the highest rates of meets
requirements. Easy Insight has the highest rate for ease of admin. Easy
Insight, Phocas, WebFOCUS, InsightSquared, Looker, and Sisense have the highest
ease of doing business with. EasyInsight and Sisense have the highest rate
for ease of setup. Organizations must analyze each tool that can best meet
their business requirements.
Jayasingh,
B. B., Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data
analytics and visualization. Paper presented at the 2016 2nd International
Conference on Contemporary Computing and Informatics (IC3I).
looker.com.
(2018). BI Tools Comparison, Ratings of Top 20 Vendors. Retrieved from https://www.looker.com.
The purpose of this discussion is to
select an industry and discuss three Big Data visualization tools often used in
that particular industry. The selected
tools for this discussion involve Tableau, QlikView, and Power BI in the healthcare sector. The discussion also provides the viewpoint of the researcher on their impact or
purpose for an industry.
Data Visualization
Visualization is one of the most powerful representations and presentation of the data (Jayasingh, Patra, & Mahesh, 2016). It does help in viewing the data in the form of graphs, images, pie charts (EMC, 2015; Jayasingh et al., 2016). Table 1 shows common representation methods for data and charts. It helps in synthesizing a large volume of data to get at the essence of such big data and convey the key insights from this data (Meyer, 2018).
Table 1. Common Representation Methods
for Data and Charts (EMC, 2015).
EMC has provided a good example which can demonstrate the power of data visualization. The forty-five years of store opening data in a table which is very hard to understand versus data presentation visualized in a map in Figure 1, which can be easily understood.
Figure 1. Demonstration of Data
Visualization Role in Presenting Big Data (EMC, 2015).
Data Visualization and Visual Analytics
Data visualization is increasingly becoming the critical building block of the analytics in the
era of Big Data Analytics (EMC, 2015; Fiaz, Asha, Sumathi, & Navaz, 2016). The volume
and the variety of the data keep growing, and the data visualization plays a
significant role in presenting the analytical data to the various audience with various backgrounds (EMC, 2015; Fiaz et al., 2016). The Big Data
Analytical projects are sophisticated, and
the presentation of their values is critical to sustaining
their momentum (EMC, 2015). The presentation of such analytical projects
is challenging due to the mixed backgrounds of the audience (EMC, 2015). The interpretation of
the data visually assists in understanding the data and to quickly make business decisions (Fiaz et al., 2016). Besides, the
dynamic data visualization is another challenge.
(EMC, 2015) have recommended four deliverables for
communicating analytical projects to satisfy most of the needs of various
stakeholders. The first presentation is
for a project sponsor. The second presentation is for an analytical
audience. The third presentation is for
technical specification documents. The fourth presentation is for
well-annotated production code.
Various data visualization such as Tableau, D3.js, timeline provide visualization for processing Big Data to provide overviews, summaries and drill down to a level where patterns can be extracted and correlation can be developed from the datasets (EMC, 2015; Jayasingh et al., 2016). Table 2 shows the standard tools for data visualization.
Table 2. Common Tools for Data
Visualization (EMC, 2015).
The Visual Analytics
is the combination of Big Data Analytics and the interactive visualization
techniques (Jayasingh et al., 2016). Visual analytics is faced with the challenge
of embedding or supporting Big Data to represent the data (Jayasingh et al., 2016). The application of visual analytics includes early fraud detection in the credit card sector, weather monitoring, network
analysis and forensic analysis (Jayasingh et al., 2016).
Data Visualization in Healthcare
As healthcare industry is taking advantage of Big
Data Analytics, it is also utilizing the powerful
presentation of the visual analytics and data visualization to make sense out
of the large volume of the data (Bresnick, 2018; Patel-Misra, 2018). However, as (Patel-Misra, 2018) have indicated the data visualization alone does not
drive value, but rather the value is realized
when the data visualization drives a process, a change in process, or a new
action. Thus, the data visualization
become best when it is integrated seamlessly into a process (Patel-Misra, 2018).
Various user-friendly data visualization tools have been used in various sectors including healthcare
such as Splunk, Datameer, Jaspersoft, Karmasphere, Pentaho, Hadapt, HP Vertica,
Teradata Aster Solutions Cognos, Crystal Reports, Tableau, QlikView, Spotfire,
and Power BI (Haughom, Horstmeier, Wadsworth, Staheli, & Falk, 2017; Mathew &
Pillai, 2015).
Tableau, QlikView,
and Power BI have been used often in healthcare and transformed the analytical
reporting (Thompson, Gresse, & Lendway, 2018). This
discussion addresses only Tableau, QlikView
and Power BI as the three data visualization tools often used in the healthcare industry.
Tableau in Healthcare
Tableau has reported in a white paper (Tableau, 2011)) that healthcare providers are successfully
transforming data from information to insight using Tableau software. Healthcare organizations utilize three
approaches to get more from their information assets (Tableau, 2011). The first
approach is to break the data access logjam by empowering the departments in
healthcare organization to explore their data.
The second approach is to uncover answers with data from multiple
systems to reveal trends and outliers. The third approach is to share insights
with executives, doctors, and others to drive collaboration.
The application of the first approach has resulted in reducing the time patients have to wait by nearly ten minutes. Figure 2 shows the data visualization for such a result.
Figure 2. Patient Cycle Time Data
Visualization (Tableau, 2011).
The second approach has been applied in a syringe label audit resulting in better labeling with provider initials, syringe date and time of syringe preparation, increasing the safety of the patient. Figure 3 shows the data visualization for the anesthesiology syringe audit.
Figure 3. Anesthesiology Syringe Audit
Data Visualization (Tableau, 2011).
The application of the third approach to sharing insights with executives, doctors, and others to drive collaboration has increased the hospital profitability by the market. Figure 4 shows the data visualization for such application.
Figure 4. Analyzing Hospital Profitability by Market (Tableau, 2011)
QlikView in Healthcare
QlikView has reported that healthcare providers can
integrate data from across systems and remove the inhibitors to improve
quality, safety, and cost of healthcare delivery using QlikView (qlikview.com, 2007). It uses
in-memory analysis. Data can be analyzed
across an unlimited number of dimensions and explored in any directions against
the entire data volume down to the transaction level. Qlik provides quick, and robust business analysis using data
visualization all enabled through its in-memory associative technology. It provides various types of analysis such as
clinical operations analysis, care delivery analysis, resource planning
analysis, supply chain analysis, financial analysis, and improving the quality,
safety and cost of healthcare delivery (qlikview.com, 2007).
Power BI in Healthcare
(Schott, 2017) has reported three ways real-time data visualization will transform the healthcare
industry. The first approach is to share
data across healthcare
organizations. The second approach is to
provide real-time visualization. The third approach is to improve the response
time. Power BI was found useful in all
three approaches (Schott, 2017). Power BI collects and
analyzes electronic health records. It
then connects that information with open data sources to enable users to
visualize data and explore service are patterns (Schott, 2017).
The Foundation Trust is an example for utilizing
Power BI to evaluate the cost and efficacy of drugs during treatment processes (Schott, 2017). The organization integrated regional weather
information with its data to find out how inclement weather can impact the
frequency of respiratory ailment (Schott, 2017). The group has worked jointly with a local hospital to compare data through Power BI
to identify the best practices in prescribing medications (Schott, 2017)
Conclusion
In summary, data visualization plays a significant
role not only in interpreting the Big Data Analytics but also in representing
and presenting the Big Data Analytics.
Data visualization has challenges of the various
audience and dynamic visual analytics. Various data visualizations software and
programs are available for Big Data Analytics. The three selected programs for
this discussion involved Tableau, QlikView and Power BI for the healthcare sector. Each program offers unique services with data
visualization capabilities. QlikView
provides the analysis using the in-memory
technique for better performance.
Organizations should utilize the best data visualization tool based on the
business model. Organizations might need
to implement more than one data visualization tool, should the business model
require.
References
Bresnick, J. (2018). Using Visual
Analytics, Big Data Dashboards for Healthcare Insights.
EMC. (2015). Data Science and Big Data Analytics:
Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.
Fiaz, A. S.,
Asha, N., Sumathi, D., & Navaz, A. S. (2016). Data Visualization: Enhancing
Big Data More Adaptable and Valuable. International
Journal of Applied Engineering Research, 11(4), 2801-2804.
Jayasingh, B. B.,
Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data analytics and visualization.
Paper presented at the 2016 2nd International Conference on Contemporary
Computing and Informatics (IC3I).
Mathew, P. S.,
& Pillai, A. S. (2015). Big Data
solutions in Healthcare: Problems and perspectives. Paper presented at the
Innovations in Information, Embedded and Communication Systems (ICIIECS), 2015
International Conference on.
Meyer, M. (2018).
The Rise of Healthcare Data Visualization.
Patel-Misra, D.
(2018). Data Visualization in Healthcare: Driving Real-Time Actionable
Insights.
The purpose of this discussion is to address whether the
combination of Big Data and Artificial
Intelligence is significant to any industry.
The discussion also provides an example where Artificial Intelligence has been used and applied
successfully. The chosen sector for such
the use of AI is the health care.
The
Significance of Big Data and Artificial Intelligence Integration
As discussed in U4-DB2, Big Data
empowers artificial intelligence. Thus, there is no doubt about the benefits and
advantages of utilizing Big Data in artificial intelligence for
businesses. However, in this discussion,
the question is whether the significance of their combination in any industry or specific industries only.
McKinsey Global Institute reported in 2011 that not all industries are created equal when parsing the benefits of Big Data (Brown, Chui, & Manyika, 2011). The report has indicated that although Big Data is changing the game for virtually every sector, it favors some companies and industries over the others, especially in the early stages of the adoption. McKinsey has also reported in (Manyika et al., 2011) five domains that could take advantages of the transformative potential of Big Data. These domains include for U.S. healthcare, retail, and public sector administration, retail for European Union, and personal location data globally. Figure 1 illustrates the value of Big Data significant financial value across sectors.
Figure 1. Big Data Financial Value
Across Sectors (Manyika et al., 2011).
Thus, the value of Big Data Analytics is tremendous already for almost every business, and the value varies from one sector to another. The combination of Big Data and artificial intelligence is good for innovation (Bean, 2018; Seamans, 2017). There is no limit for innovation for any business. Figure 2 shows the 19-year Go player Ke Jie reacts during the second match against Google’s artificial intelligence program AlphaGo in Wuzhen.
Figure 2. 19-year old Ke Jie Reacts
During the Second Match Against Google’s Artificial Intelligence Program
AlphaGo (Seamans, 2017).
If the combination of Big Data and
artificial intelligence is good for
innovation, then, logically every organization and every sector need
innovations to survive the competition.
In the survey conducted by NewVantage Partner, 97.2% of the executive
decision-makers have reported that their companies
are investing in building or launching Big Data and Artificial Intelligence
initiatives (Bean, 2018; Patrizio, 2018). It is also worth noting that 76.5% of the
executives have indicated that the availability
of Big Data is empowering AI and cognitive initiatives
within their organizations (Bean, 2018). The same survey has also shown 93% of the
executives have identified artificial intelligence as the disruptive technology
and their organizations are investing in for the future. This result shows that a common consensus among the executives that
organizations must leverage cognitive technologies to compete in an
increasingly disruptive period (Bean, 2018).
AI
Application Example in the Health Care
Industry
Since healthcare
industry has been identified in various
research studies about its great benefits from Big Data and artificial
intelligence, this sector is chosen as an
example of the application of both BD and AI for this discussion. AI is
becoming a transformational force in healthcare (Bresnick, 2018). The healthcare industry has almost endless
opportunities to apply technologies such as Big Data and AI to deploy more
precise and impactful interventions at the right time in the care of patients (Bresnick, 2018).
Harvard Business Review (HBR) has indicated that 121 health AI and machine learning companies raised $2.7 billion in 206 deals between 2011 and 2017 (Kalis, Collier, & Fu, 2018). HBR has examined ten promising artificial intelligence applications in healthcare (Kalis et al., 2018). The findings have shown that the application of AI could create up to $150 billion in annual savings for U.S. health care by 2026 (Kalis et al., 2018). The investigation has also shown that AI currently creates the most value in assisting the frontline clinicians to be more productive and in making back-end processes more efficient, but not yet in making clinical decisions or improving clinical outcomes (Kalis et al., 2018). Figure 3 shows the ten AI applications that could change health care.
Figure 3. Ten Application of AI That
Could Change Health Care (Kalis et al., 2018).
Conclusion
In
conclusion, the combination of Big Data and Artificial
Intelligence drives innovations for all sectors. Every sector and every
business need to innovate to maintain a competitive edge. Some sectors are leading in taking the advantages of this combination of BD and AI more
than others. Health care is an excellent example of employing artificial
intelligence. However, the application
of the AI has its most value on three main areas only of AI-assisted surgery,
virtual nurse, administrative workflow. The use of AI in other areas in healthcare is
still in infant stages and will take time until it establishes its root and
witness the great benefits of AI application (Kalis et al., 2018).
Manyika, J.,
Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H.
(2011). Big data: The next frontier for innovation, competition, and
productivity.
Patrizio, A.
(2018). Big Data vs. Artificial Intelligence.
Seamans,
R. (2017). Artificial Intelligence And Big Data: Good For Innovation?
The purpose of this
discussion is to discuss the future impact of Big Data Analytics for Artificial Intelligence. The discussion will also provide an example
of the AI use in Big Data generation and analysis. The discussion begins with artificial intelligence, followed by an advanced level of big data analysis. The impact of the Big Data (BD) on the
artificial intelligence is also discussed addressing various examples showing
how artificial intelligence is empowered by BD.
Artificial Intelligence
Artificial Intelligence (AI) has eight definitions laid out across two dimensions of thinking and acting (Table 1) (Russell & Norvig, 2016). The top definitions are concerned with thought processes and reasoning, while the bottom definitions address the behavior. The definitions on the left measure success regarding fidelity to human performance, while the definitions on the rights measure against an ideal performance measure called “rationality” (Russell & Norvig, 2016). The system is “rational” if it does the “right thing” given what it knows.
Table 1: Some Definitions of Artificial Intelligence, Organized Into Four Categories (Russell & Norvig, 2016).
The study (Patrizio, 2018) defined artificial
intelligence as a computational technique
allowing machines to perform cognitive functions such as acting or reacting to
input, similar to the way humans do. The
traditional computing applications react
to data, but the reactions and responses
have to be hand-coded. However, the app cannot react to unexpected
results (Patrizio, 2018).
The artificial intelligence systems are continuously in a flux mode
changing their behavior to accommodate any changes in the results and modifying
their reactions (Patrizio, 2018). The
artificial intelligence-enabled system is designed to analyze and interpret
data and address the issues based on those interpretations (Patrizio, 2018).
The computer learns once how to act or react to a particular result and knows in the future to act in the same way
using the machine learning algorithms (Patrizio, 2018). IBM has invested $1 billion in
artificial intelligence through the launch of its IBM Watson Group (Power,
2015).
The health care industry is the most
significant application of Watson (Power,
2015).
Advanced Level of Big Data Analysis
The fundamental analytics techniques
include descriptive analytics allowing breaking down big data into smaller,
more useful pieces of information about what has happened, and focusing on the
insight gained from the historical data to provide trending information on past
or current events (Liang & Kelemen, 2016). However,
the advanced level computational tools focus on predictive analytics, to determine
patterns and predict future outcomes and trends through quantifying effects of
future decision to advise on possible outcomes (Liang & Kelemen, 2016).
The prescriptive analytic includes functions as a decision support tool
exploring a set of possible actions and proposing actions based on descriptive
and predictive analysis of complex data.
The advanced level computational techniques include real-time analytics.
Advanced level of data analysis includes
various techniques. The real-time
analytics and meta-analysis can be used to integrate multiple data sources (Liang & Kelemen, 2016).
The hierarchical or multi-level model can be used for spatial data, a longitudinal
and mixed model for real-time or dynamic temporal data rather than static data (Liang & Kelemen, 2016).
The data mining, pattern recognition can be used for trends, and pattern
detection (Liang & Kelemen, 2016).
The natural language processing (NLP) can be used for text mining, machine learning, statistical learning
Bayesian learning with auto-extraction of data and variables (Liang & Kelemen, 2016).
The artificial intelligence with automatic ensemble techniques and
intelligent agent, and deep learning such as neural network, support vector
machine, dynamic state-space model,
automatic can be used for automated
analysis and information retrieval (Liang & Kelemen, 2016).
The causal inferences and Bayesian approach can be used for probabilistic interpretations (Liang & Kelemen, 2016).
Big Data Empowers Artificial Intelligence
The trend of artificial intelligence
implementation is increasing. It is anticipated that 70% of enterprises will
implement artificial intelligence (AI) by the
end of 2018, which is up from 40% in 2016 and 51% in 2017 (Mills, 2018). A survey conducted by NewVantage Partners of
c-level executive decision-makers found that 97.2% of executives stated that
their companies are investing in, building, or launching Big Data and
artificial intelligence initiatives (Bean, 2018; Patrizio, 2018). The
same survey has found that 76.5% of the executives feel that the artificial
intelligence and Big Data are becoming interconnected closely and the
availability of the data is empowering the artificial intelligence and
cognitive initiatives within their organizations (Patrizio, 2018).
Artificial
intelligence requires data to develop its intelligence, particularly machine
learning (Patrizio, 2018).
The data used in artificial intelligence and machine learning is already
cleaned, with extraneous, duplicate and unnecessary data already removed, which
is regarded to be the first big step when using Big Data and artificial
intelligence (Patrizio, 2018). CERN
data center has accumulated over 200 petabytes of filtered data (Kersting & Meyer, 2018). Machine learning and artificial
intelligence can take advantages of this filtered data leading to many breakthroughs (Kersting & Meyer, 2018).
An example of these breakthroughs includes genomic and proteomic
experiments to enable personalized medicine (Kersting & Meyer, 2018). Another
example includes the historical climate data which can be used to understand global warming and to predict weather better (Kersting & Meyer, 2018).
The massive amounts of sensor
network readings and hyperspectral images of plants is another example to
identify drought conditions and gain insights into
plant growth and development (Kersting & Meyer, 2018).
Multiple technologies such as artificial
intelligence, machine learning, and data
mining techniques have been used together to extract the maximum value from Big
Data (Luo, Wu, Gopukumar, & Zhao, 2016). Artificial intelligence, machine
learning, and data mining have been used
in healthcare (Luo et al., 2016). Computational
tools such as neural networks, genetic algorithms, support vector machines,
case-based reasoning have been used in
prediction (Mishra, Dehuri, & Kim, 2016; Qin, 2012) of stock markets and other financial
markets (Qin, 2012).
AI has
impacted the business world through social
media and the large volume of the collected data from social media (Mills, 2018). For instance, the personalized content in real
time is increasing to enhance the sales opportunities (Mills, 2018). The artificial intelligence makes use of effective behavioral targeting methodologies (Mills, 2018). Big Data
improves customer services by making it proactive and allows companies to make
customer responsive products (Mills,
2018).
The Big Data Analytics (BDA) assist in predicting what is wanted out of a product (Mills,
2018).
BDA has been playing a significant role in fraud preventions using
artificial intelligence (Mills,
2018).
Artificial intelligence techniques such as video recognition, natural
language processing, speech recognition, machine learning engines, and automation have been used to help
businesses protect against these sophisticated fraud schemes (Mills,
2018).
The healthcare industry has utilized the machine
learning to transform the large volume of the medical data into actionable
knowledge performing predictive and prescriptive analytics (Palanisamy
& Thirunavukarasu, 2017).
The machine learning platform utilizes artificial
intelligence to develop sophisticated algorithm
processing massive datasets (structured and unstructured) performing advanced
analytics (Palanisamy
& Thirunavukarasu, 2017). For a distributed environment, Apache Mahout
(2017), which is an open source machine learning library, integrates with
Hadoop to facilitate the execution of scalable machine learning algorithms,
offering various techniques such as recommendation, classification, and
clustering (Palanisamy
& Thirunavukarasu, 2017).
Conclusion
Big Data has attracted the attention of
various industries including academia, healthcare and even the government.
Artificial intelligence has been around for some time. Big Data offers various advantages to
organizations from increasing sales, to reduce
costs to health care. Artificial
intelligence also has its advantages, providing real-time analysis reacting to
changes continuously. The use of Big
Data has empowered the artificial intelligence.
Various industries such as the healthcare industry are taking advantages
of Big Data and artificial intelligence.
Their growing trend is increasingly
demonstrating the realization of businesses
to the importance of artificial intelligence in the age of Big Data, and the
importance of Big Data role in the artificial intelligence domain.
Kersting, K.,
& Meyer, U. (2018). From Big Data to Big Artificial Intelligence? :
Springer.
Liang, Y., &
Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical
Research: Challenges and Opportunities. Austin
Journal of Biometrics & Biostatistics, 7(3).
Luo, J., Wu, M.,
Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical
research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.
Mills, T. (2018).
Eight Ways Big Data And AI Are Changing The Business World.
Mishra, B. S. P.,
Dehuri, S., & Kim, E. (2016). Techniques
and Environments for Big Data Analysis: Parallel, Cloud, and Grid Computing
(Vol. 17): Springer.
Palanisamy, V.,
& Thirunavukarasu, R. (2017). Implications of Big Data Analytics in
developing Healthcare Frameworks–A review. Journal
of King Saud University-Computer and Information Sciences.
Patrizio, A.
(2018). Big Data vs. Artificial Intelligence.
Power, B. (2015).
Artificial Intelligence Is Almost Ready for Business.
Qin, X. (2012). Making use of the big data: next generation
of algorithm trading. Paper presented at the International Conference on
Artificial Intelligence and Computational Intelligence.
Russell,
S. J., & Norvig, P. (2016). Artificial
intelligence: a modern approach: Malaysia; Pearson Education Limited.
The purpose of this project is to discuss how data can be handled before Hadoop can take action on breaking data into manageable sizes. The discussion begins with an overview of Hadoop providing a brief history of Hadoop and the difference between Hadoop 1.x and Hadoop 2.x. The discussion involves the Big Data Analytics process using Hadoop which involves six significant steps including the pre-processing data and ETL process where the data must be converted and cleaned before processing it. Before data processing, some consideration must be taken for data preprocessing, modeling and schema design in Hadoop for better processing and data retrieval as it will affect how data can be split among various nodes in the distributed environment because not all tools can split the data. This consideration begins with the data storage format, followed by Hadoop file types consideration and XML and JSON format challenges in Hadoop. The compression of the data must be considered carefully because not all compression types are “splittable.” The discussion also involves the schema design consideration for HDFS and HBase since they are used often in the Hadoop ecosystem.
Keywords:
Big Data Analytics; Hadoop; Data
Modelling in Hadoop; Schema Design in Hadoop.
In
the age of Big Data, dealing with large datasets in terabytes and petabytes is
a reality and requires specific technology as the traditional technology was
found inappropriate for it (Dittrich
& Quiané-Ruiz, 2012). Hadoop is developed to store, and process
such large datasets efficiently. Hadoop
is becoming a data processing engine for Big Data (Dittrich
& Quiané-Ruiz, 2012). One of the significant advantages of Hadoop
MapReduce is allowing non-expert users to run easily analytical tasks over Big
Data (Dittrich
& Quiané-Ruiz, 2012). However, before
the analytical process takes place, some schema design and data modeling
consideration must be taken for Hadoop so that the data process can be
efficient (Grover,
Malaska, Seidman, & Shapira, 2015). Hadoop requires splitting the data. Some
tools can split the data while others cannot split the data natively and
requires integration (Grover
et al., 2015).
This
project discusses these considerations to ensure the appropriate schema design
for Hadoop and its components of HDFS, HBase where the data gets stored in a
distributed environment. The discussion
begins with an overview of Hadoop first, followed by the data analytics process
and ends with the data modeling techniques and consideration for Hadoop which
can assist in splitting the data appropriately for better data processing
performance and better data retrieval.
Google
published and disclosed its MapReduce technique and implementation early around
2004 (Karanth, 2014). It also
introduced the Google File System (GFS) which is
associated with MapReduce implementation. The MapReduce, since then, has become the
most common technique to process massive data sets
in parallel and distributed settings across many companies (Karanth, 2014). In 2008,
Yahoo released Hadoop as an open-source implementation of the MapReduce
framework (Karanth, 2014; sas.com, 2018). Hadoop and its file system
HDFS are inspired by Google’s MapReduce and GFS (Ankam, 2016; Karanth, 2014).
The Apache Hadoop is the parent project for all subsequence projects of Hadoop (Karanth, 2014). It contains three essential branches 0.20.1 branch, 0.20.2 branch, and 0.21 branch. The 0.20.2 branch is often termed MapReduce v2.0, MRv2, or Hadoop 2.0. Two additional releases for Hadoop involves the Hadoop-0.20-append and Hadoop-0.20-Security, introducing HDFS append and security-related features into Hadoop respectively. The timeline for Hadoop technology is outlined in Figure 1.
Figure 1. Hadoop Timeline from 2003 until 2013 (Karanth, 2014).
Hadoop version 1.0
was the inception and evolution of Hadoop as a simple
MapReduce job-processing framework (Karanth, 2014). It
exceeded its expectations with wide
adoption of massive data processing. The
stable version of the 1.x release
includes features such as append and security.
Hadoop version 2.0 release came out in 2013 to increase efficiency and mileage from existing Hadoop
clusters in enterprises. Hadoop is
becoming a common cluster-computing and storage platform from being limited to
MapReduce only, because it has been moving faster
than MapReduce to stay leading in massive scale data processing with the
challenge of being backward compatible (Karanth, 2014).
In
Hadoop 1.x, the JobTracker was responsible for the resource allocation and job execution (Karanth, 2014).
MapReduce was the only supported model since the computing model was tied to the resources in the cluster. The
yet another resource negotiator (YARN) was developed to separate concerns
relating to resource management and application execution, which enables other
application paradigms to be added into Hadoop computing cluster. The
support for diverse applications result in the efficient and effective
utilization of the resources and integrates well with the infrastructure of the
business (Karanth, 2014). YARN
maintains backward compatibility with Hadoop version 1.x APIs (Karanth, 2014). Thus, the
old MapReduce program can still execute
in YARN with no code changes, but it has to be
recompiled (Karanth, 2014).
YARN abstracts out the resource management functions to form a platform layer called ResourceManager (RM) (Karanth, 2014). Every cluster must have RM to keep track of cluster resource usage and activity. RM is also responsible for allocation of the resources and resolving contentions among resource seekers in the cluster. RM utilizes a generalized resource model and is agnostic to application-specific resource needs. RM does not need to know the resources corresponding to a single Map or Reduce slot (Karanth, 2014). Figure 2 shows Hadoop 1.x and Hadoop 2.x with YARN layer.
Figure 2. Hadoop 1.x vs. Hadoop 2.x (Karanth, 2014).
Hadoop 2.x
involves various enhancement at the storage layer as well. These enhancements include the high
availability feature to have a hot
standby of NameNode (Karanth, 2014), when the active NameNode fails, the standby can
become active NameNode in a matter of minutes. The Zookeeper or any other HA monitoring
service can be utilized to track NameNode failure (Karanth, 2014). The
failover process to promote the hot standby as the active NameNode is triggered with the assistance of the
Zookeeper. The HDFS federation is
another enhancement in Hadoop 2.x, which is a more
generalized storage model, where the block storage has been generalized and
separated from the filesystem layer (Karanth, 2014). The HDFS
snapshots is another enhancement to the Hadoop 2.x which provides a read-only image of the entire or a particular subset of a filesystem to protect
against user errors, backup, and disaster recovery. Other enhancements added in Hadoop 2.x
include the Protocol Buffers (Karanth, 2014). The wire
protocol for RPCs within Hadoop is based on Protocol Buffers. Hadoop 2.x is aware of the type of storage
and expose this information to the application, to optimize data fetch and
placement strategies (Karanth, 2014). HDFS
append support has been another enhancement
in Hadoop 2.x.
Hadoop is regarded
to be the de facto open-source framework
for dealing with large-scale, massively
parallel, and distributed data processing (Karanth, 2014). The
framework of Hadoop includes two layers for computation and data layer (Karanth, 2014). The
computation layer is used for parallel and distributed computation processing,
while the data layer is used for a highly
fault-tolerant data storage layer which is
associated with the computation layer.
These two layers run on commodity hardware, which is not expensive, readily available, and compatible with other
similar hardware (Karanth, 2014).
Hadoop Architecture
Apache Hadoop has
four projects: Hadoop Common, Hadoop Distributed
File System, Yet Another Resource Negotiator (YARN), and MapReduce (Ankam, 2016). The HDFS is used to store data, MapReduce is
used to process data, and YARN is used to manage the resources such as CPU and
memory of the cluster and common utilities that support Hadoop framework (Ankam, 2016; Karanth, 2014). Apache Hadoop integrates with other tools
such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark (Ankam, 2016; Karanth, 2014).
Hadoop three significant components for Big Data Analytics. The HDFS is a framework for reliable distributed data storage (Ankam, 2016; Karanth, 2014). Some considerations must be taken when storing data into HDFS (Grover et al., 2015). The multiple frameworks for parallel processing of data include MapReduce, Crunch, Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph (Ankam, 2016; Karanth, 2014). The Hadoop architecture includes NameNodes and DataNodes. It also includes Oozie for workflow, Pig for scripting, Mahout for machine learning, Hive for the data warehouse. Sqoop for data exchange, and Flume for log collection. YARN is in Hadoop 2.0 as discussed earlier for distributed computing, while HCatalog for Hadoop metadata management. HBase is for columnar database and Zookeeper for coordination (Alguliyev & Imamverdiyev, 2014). Figure 3 shows the Hadoop ecosystem components.
The process of Big
Data Analytics involves six essential steps (Ankam, 2016).
The identification of the business problem and outcomes is the first step. Examples of business problems include sales are going down, or shopping carts are abandoned by customers, a sudden rise in
the call volumes, and so forth. Examples
of the outcome include improving the buying rate by 10%, decreasing shopping
cart abandonment by 50%, and reducing
call volume by 50% by next quarter while keeping customers happy. The required data must be identified where data sources can be data
warehouse using online analytical processing, application database using online
transactional processing, log files from servers, documents from the internet,
sensor-generated data, and so forth, based on the case and the problem. Data collection is the third step in
analyzing the Big Data (Ankam, 2016). Sqoop tool can be used to collect data from the
relational database, and Flume can be used for stream data. Apache Kafka can be used for reliable intermediate storage. The data collection and design should be
implemented using the fault tolerance strategy (Ankam, 2016). The preprocessing data and ETL process is the
fourth step in the analytical process.
The collected data comes in various formats, and the data quality can be an issue. Thus, before processing it,
it needs to be converted to the required format and cleaned from inconsistent, invalid
or corrupted data. Apache Hive, Apache
Pig, and Spark SQL can be used for
preprocessing massive amounts of data.
The analytics implementation is the fifth steps which should be in order
to answer the business questions and problems. The analytical process requires
understanding the data and relationships between data points. The types of data
analytics include descriptive and diagnostic analytics to present the past and
current views of the data, to answer questions such as what and why
happened. The predictive analytics is
performed to answer questions such as what would happen based on a hypothesis.
Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase can be used for data analytics in batch processing
mode. Real-time analytics tools
including Impala, Tez, Drill, and Spark SQL can be
integrated into the traditional business
intelligence (BI) using any of BI tools such as Tableau, QlikView, and
others for interactive analytics. The last step in this process involves the
visualization of the data to present the analytics output in a graphical or
pictorial format to understand the analysis better for decision making. The finished data is exported from Hadoop to a
relational database using Sqoop, for
integration into visualization systems or visualizing systems are directly
integrated into tools such as Tableau, QlikView,
Excel, and so forth. Web-based notebooks
such as Jupyter, Zeppelin, and Data bricks cloud are also used to visualize
data by integrating Hadoop and Spark components (Ankam, 2016).
Before processing any data, and
before collecting any data for storage, some considerations must be taken for data modeling and design in Hadoop for better processing and better
retrieval (Grover
et al., 2015).
The traditional data management system is referred to as Schema-on-Write
system which requires the definition of the schema
of the data store before the data is loaded (Grover
et al., 2015).
This traditional data management system results in long analysis cycles, data modeling, data
transformation loading, testing, and so forth before the data can be accessed (Grover
et al., 2015).
In addition to this long analysis
cycle, if anything changes or wrong
decision was made, the cycle must start
from the beginning which will take longer time for processing (Grover
et al., 2015).
This section addresses various types of consideration before processing
the data from Hadoop for analytical purpose.
The dataset may have various levels of quality regarding noise, redundancy, and consistency (Hu, Wen, Chua, & Li, 2014). Preprocessing techniques must be used to
improve data quality should be in place in Big Data systems (Hu et al., 2014; Lublinsky, Smith, & Yakubovich, 2013). The data pre-processing involves three
techniques: data integration, data cleansing, and redundancy elimination.
The data
integration techniques are used to combine data residing in different sources
and provide users with a unified view of the data (Hu et al., 2014). The
traditional database approach has well-established data integration system
including the data warehouse method, and the data federation method (Hu et al., 2014). The data
warehouse approach is also known as ETL consisting of extraction,
transformation, and loading (Hu et al., 2014). The
extraction step involves the connection to the source systems and selecting and
collecting the required data to be processed for
analytical purposes. The transformation
step involves the application of a series of rules to the extracted data to
convert it into a standard format. The
load step involves importing extracted and transformed data into a target storage infrastructure (Hu et al., 2014). The
federation approach creates a virtual database to query and aggregate data from
various sources (Hu et al., 2014). The virtual database contains information or
metadata about the actual data, and its location and does not contain data itself (Hu et al., 2014). These
two data pre-processing are called store-and-pull techniques which is not
appropriate for Big Data processing, with high computation and high streaming,
and dynamic nature (Hu et al., 2014).
The data cleansing
process is a vital process to keep the data consistent and updated to get
widely used in many fields such as banking, insurance, and retailing (Hu et al., 2014). The
cleansing process is required to determine the incomplete, inaccurate, or
unreasonable data and then remove these data to
improve the quality of the data (Hu et al., 2014). The data cleansing process includes five steps (Hu et al., 2014). The first step is to define and determine the
error types. The second step is to
search and identify error instances. The
third step is to correct the errors, and then document error instances and
error types. The last step is to modify data entry procedures to reduce future errors.
Various types of checks must be done at the cleansing process, including
the format checks, completeness checks, reasonableness checks, and limit checks
(Hu et al., 2014). The
process of data cleansing is required to improve the accuracy of the analysis (Hu et al., 2014). The data
cleansing process depends on the complex relationship model, and it has extra computation and delay overhead
(Hu et al., 2014).
Organizations must seek a balance between the complexity of the
data-cleansing model and the resulting improvement in the accuracy analysis (Hu et al., 2014).
The data
redundancy is the third data pre-processing step where data is repeated
increasing the overhead of the data transmission
and causes limitawtions for storage systems, including wasted space,
inconsistency of the data, corruption of the dta, and reduced
reliability (Hu et al., 2014). Various
redundancy reduction methods include redundancy
detection and data compression (Hu et al., 2014). The data
compression method poses an extra
computation burden in the data compression and decompression processes (Hu et al., 2014).
Data Modeling and Design
Consideration
Schema-on-Write system is used when the application
or structure is well understood and frequently accessed through queries and
reports on high-value data (Grover
et al., 2015).
The term Schema-on-Read is
used in the context of Hadoop data management system (Ankam,
2016; Grover et al., 2015). This term refers to the raw
data, that is not processed and can be loaded to Hadoop using the required structure at processing time based on the
requirement of the processing application (Ankam,
2016; Grover et al., 2015). The Schema-on-Read is used when the
application or structure of data is not well understood (Ankam,
2016; Grover et al., 2015).
The agility of the process is implemented through the schema-on-read
providing valuable insights on data not previously accessible (Grover
et al., 2015).
Five factors must be considered before storing data into Hadoop for processing (Grover et al., 2015). The data storage format must be considered as there are some file formats and compression formats supported on Hadoop. Each type of format has strengths that make it better suited to specific applications. Although Hadoop Distributed File System (HDFS) is a building block of Hadoop ecosystem, which is used for storing data, several commonly used systems implemented on top of HDFS such as HBase for traditional data access functionality, and Hive for additional data management functionality (Grover et al., 2015). These systems of HBase for data access functionality and Hive for data management functionality must be taken into consideration before storing data into Hadoop (Grover et al., 2015). The second factor involves the multitenancy which is a common approach for clusters to host multiple users, groups and application types. The multi-tenant clusters involve essential considerations for data storage. The schema design factor should also be considered before storing data into Hadoop even if Hadoop is a schema-less (Grover et al., 2015). The schema design consideration involves directory structures for data loaded into HDFS and the output of the data processing and analysis, including the schema of objects stored in systems such as HBase and Hive. The last factor for consideration before storing data into Hadoop is represented in the metadata management. Metadata is related to the stored data and is often regarded as necessary as the data. The understanding of the metadata management plays a significant role as it can affect the accessibility of the data. The security is another factor which should be considered before storing data into Hadoop system. The security of the data decision involves authentication, fine-grained access control, and encryption. These security measures should be considered for data at rest when it gets stored as well as in motion during the processing (Grover et al., 2015). Figure 4 summarizes these considerations before storing data into the Hadoop system.
Figure 4. Considerations Before Storing
Data into Hadoop.
When architecting a solution on Hadoop, the method of storing the data into Hadoop is one of the essential decisions. Primary considerations for data storage in Hadoop involve file format, compression, data storage system (Grover et al., 2015). The standard file formats involve three types: text data, structured text data, and binary data. Figure 5 summarizes these three standard file formats.
Figure 5. Standard File Formats.
The text data is widespread use of Hadoop including log file such as weblogs, and server logs (Grover et al., 2015).
These text data format can come in many forms such as CSV files, or
unstructured data such as emails.
Compression of the file is recommended,
and the selection of the compression is
influenced by how the data will be used (Grover et al., 2015).
For instance, if the data is for archival, the most compact compression
method can be used, while if the data are used
in processing jobs such as MapReduce, the splittable format should be used (Grover et al., 2015).
The splittable format enables Hadoop to split files into chunks for
processing, which is essential to efficient parallel processing (Grover et al., 2015).
In most cases, the use
of container formats such as SequenceFiles
or Avro provides benefits making it the preferred format for most file system
including text (Grover et al., 2015).
It is worth noting that these container formats provide functionality to
support splittable compression among other benefits (Grover et al., 2015). The binary data involves images which can be
stored in Hadoop as well. The container
format such as SequenceFile is preferred when storing binary data in
Hadoop. If
the binary data splittable unit is more than 64MB, the data should be
put into its file, without using the container format (Grover et al., 2015).
The structured text data include formats
such as XML and JSON, which can present unique
challenges using Hadoop because splitting XML
and JSON files for processing is not straightforward, and Hadoop does
not provide a built-in InputFormat for either (Grover et
al., 2015).
JSON presents more challenges to Hadoop than XML because no token is
available to mark the beginning or end of a record. When using these file format, two primary consideration must be taken.
The container format such as Avro should be used because Avro provides a compact and efficient method to store
and process the data when transforming the data into Avro (Grover et
al., 2015). A library for processing XML or JSON should be designed.
XMLLoader in PiggyBank library for Pig is an example when using XML data
type. The Elephant Bird project is an
example of a JSON data type file (Grover et
al., 2015).
Several Hadoop-based file formats created to work well with MapReduce (Grover et al., 2015). The Hadoop-specific file formats include file-based data structures such as sequence files, serialization formats like Avro, and columnar formats such as RCFile and Parquet (Grover et al., 2015). These files types share two essential characteristics that are important for Hadoop application: splittable compression and agnostic compression. The ability of splittable files play a significant role during the data processing, and should not be underestimated when storing data in Hadoop because it allows large files to be split for input to MapReduce and other types of jobs, which is a fundamental part of parallel processing and a key to leveraging data locality feature of Hadoop (Grover et al., 2015). The agnostic compression is the ability to compress using any compression codec without readers having to know the codec because the codec is stored in the header metadata of the file format (Grover et al., 2015). Figure 6 summarizes these Hadoop-specific file formats with the typical characteristics of splittable compression and agnostic compression.
Figure 6. Three Hadoop File Types with the Two Common Characteristics.
SequenceFiles format is the most widely used Hadoop file-based formats. SequenceFile format store data as binary key-value pairs (Grover et al., 2015). It involves three formats for records stored within SequenceFiles: uncompressed, record-compressed, and block-compressed. Every SequenceFile uses a standard header format containing necessary metadata about the file such as the compression codec used, key and value class names, user-defined metadata, and a randomly generated syn marker. The SequenceFiles arewell supported in Hadoop. However, it has limited support outside the Hadoop ecosystem as it is only supported in Java language. The frequent use case for SequenceFiles is a container for smaller files. However, storing a large number of small files in Hadoop can cause memory issue and excessive overhead in processing. Packing smaller files into a SequenceFile can make the storage and processing of these files more efficient because Hadoop is optimized for large files (Grover et al., 2015). Other file-based formats include the MapFiles, SetFiles, Array-Files, and BloomMapFiles. These formats offer a high level of integration for all forms of MapReduce jobs, including those run via Pig and Hive because they were designed to work with MapReduce (Grover et al., 2015). Figure 7 summarizes the three formats for records stored within SequenceFiles.
Figure 7. Three Formats for Records
Stored within SequenceFile.
Serialization is the process of moving data structures into bytes for storage or for transferring data over the network (Grover et al., 2015). The de-serialization is the opposite process of converting a byte stream back into a data structure (Grover et al., 2015). The serialization process is the fundamental building block for distributed processing systems such as Hadoop because it allows data to be converted into a format that can be efficiently stored and transferred across a network connection (Grover et al., 2015). Figure 8 summarizes the serialization formats when architecting for Hadoop.
Figure 8. Serialization Process vs.
Deserialization Process.
The serialization involves two aspects of data processing in a distributed system of interprocess communication using data storage, and remote procedure calls or RPC (Grover et al., 2015). Hadoop utilizes Writables as the main serialization format, which is compact and fast but uses Java only. Other serialization frameworks have been increasingly used within Hadoop ecosystems, including Thrift, Protocol Buffers and Avro (Grover et al., 2015). Avro is a language-neutral data serialization system (Grover et al., 2015). It was designed to address the limitation of the Writables of Hadoop which is lack of language portability. Similar to Thrift and Protocol Buffers, Avro is described through a language-independent schema (Grover et al., 2015). Avro, unlike Thrift and Protocol Buffers, the code generation is optional. Table 1 provides a comparison between these serialization formats.
Table 1:
Comparison between Serialization Formats.
Row-oriented systems have been used to
fetch data stored in the database (Grover et al., 2015).
This type of data retrieval has been used as the analysis heavily relied
on fetching all fields for records that belonged to a specific time range. This process is efficient if all columns of the record
are available at the time or writing because the record can be written with a
single disk seek. The column
storage has recently been used to fetch data.
The use of columnar storage has four main benefits over the row-oriented
system (Grover et al., 2015). The
skips I/O and decompression on columns that are not part of the query is one of
the benefits of the columnar storage.
Columnar data storage works better for queries that access a small
subset of columns than the row-oriented data storage, which can be used when many columns are retrieved. The compression on columns provides
efficiency because data is more similar within the same column than it is in a
block of rows. The columnar data storage
is more appropriate for data warehousing-based applications where aggregations
are implemented using specific columns
than an extensive collection of records (Grover et al., 2015).
Hadoop applications have been using the columnar file formats including
the RCFile format, Optimized Row Columnar (ORC), and Parquet. The RCFile format has been used as a Hive Format.
It was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization. It breaks files into row splits, and within
each split uses columnar-oriented storage.
Despite its advantages of the query
and compression performance compared to SequenceFiles, it has limitations, that prevent the optimal performance for query times and
compression. The newer version of
the columnar formats ORC and Parquet are designed to address many of the
limitations of the RCFile (Grover et al., 2015).
Compression is
another data storage consideration because it plays a crucial role in reducing the storage requirements, and in improving
the data processing performance (Grover et al., 2015). Some compression formats supported on Hadoop are not splittable
(Grover et al., 2015). MapReduce framework splits data for input to multiple
tasks; the nonsplittable
compression format is an obstacle to efficient processing. Thus, the splittability
is a critical consideration in selecting
the compression format and file format for Hadoop. Various compression types for Hadoop include
Snappy, LZO, Gzip, bzip2. Google
developed Snappy for speed. However, it does not offer the best compression
size. It is designed to be used with a container format like SequenceFile or Avro
because it is not inherently splittable.
It is being distributed with
Hadoop. Similar to Snappy, LZO is optimized
for speed as opposed to size. However,
LZO, unlike Snappy support splittability of the compressed files, but it
requires indexing. LZO, unlike Snappy, is not distributed with Hadoop and
requires a license and separate
installation. Gzip, like Snappy, provides good compression performance,
but is not splittable, and it should be used with a container format. The speed read performance of the Gzip is like
the Snappy. Gzip is slower than Snappy
for write processing. Gzip is not
splittable and should be used with a container format. The use of smaller blocks with Gzip can
result in better performance. The bzip2
is another compression type for Hadoop.
It provides good compression performance, but it can be slower than another compression codec such as Snappy. It is not an ideal codec for Hadoop storage. Bzip2,
unlike Snappy and Gzip, is inherently splittable. It inserts synchronization markers between
blocks. It can be used for active archival
purposes (Grover et al., 2015).
The compression format can become splittable when used with container file formats such as Avro, SequenceFile which compress blocks of records or each record individually (Grover et al., 2015). If the compression is implemented on the entire file without using the container file format, the compression format that inherently supports splittable must be used such as bzip2. The compression use with Hadoop has three recommendation (Grover et al., 2015). The first recommendation is to enable compression of MapReduce intermediate output, which improves performance by decreasing the among of intermediate data that needs to be read and written from and to disk. The second recommendation s to pay attention to the order of the data. When the data is close together, it provides better compression levels. The data in Hadoop file format is compressed in chunks, and the organization of those chunks determines the final compression. The last recommendation is to consider the use of a compact file format with support for splittable compression such as Avro. Avro and SequenceFiles support splittability with non-splittable compression formats. A single HDFS block can contain multiple Avro or SequenceFile blocks. Each block of the Avro or SequenceFile can be compressed and decompressed individually and independently of any other blocks of Avro or SequenceFile. This technique makes the data splittable because each block can be compressed and decompressed individually. Figure 9 shows the Avro and SequenceFile splittability support (Grover et al., 2015).
Figure 9. Compression Example Using Avro
(Grover et al., 2015).
HDFS and HBase are
the commonly used storage managers in the Hadoop
ecosystem. Organizations can store the
data in HDFS or HBase which internally store it on HDFS (Grover et al., 2015). When
storing data in HDFS, some design techniques must be taken into consideration.
The schema-on-read model of Hadoop does not impose any requirement when
loading data into Hadoop, as data can be ingested into HDFS by one of many
methods without the requirements to associate a schema or preprocess the
data. Although
Hadoop has been used to load many types of data such as the unstructured data, semi-structured
data, some order is still required, because Hadoop serves as a central location
for the entire organization and the data stored in HDFS is intended to be
shared across various departments and teams in the organization (Grover et al., 2015). The
data repository should be carefully structured and organized to provide various
benefits to the organization (Grover et al., 2015). When there
is a standard directory structure, it becomes easier to share data among teams working
with the same data set. The data gets
staged in a separate location before processing it. The standard stage technique can help not
processing data that has not been appropriately
staged or entirely yet. The standard organization of data allows for some
code reuse that may process the data (Grover et al., 2015). The
placement of data assumptions can help simplify the loading process of the data
into Hadoop. The HDFS data model design
for projects such as data warehouse implementation is likely to use structure facts and dimension tables similar to
the traditional schema (Grover et al., 2015). The HDFS data model design for projects of unstructured
and semi-structured data is likely to
focus on directory placement and metadata management (Grover et al., 2015).
Grover et al.
(2015) suggested three key considerations when designing the schema, regardless
of the data model design project. The
first consideration is to develop standard practices that can be followed by
all teams. The second point is to ensure
the design works well with the chosen tools.
For instance, if the version of Hive can support only table partitions
on directories that are named a certain way, it will affect the schema design
and the names of the table subdirectories.
The last consideration when designing a schema is to keep usage patterns
in mind, because different data processing and
querying patterns work better with different schema designs (Grover et al., 2015).
The first step when designing an HDFS schema involves the determination of the location of the file. Standard file location plays a significant role in finding and sharing data among various departments and teams. It also helps in the assignment of permission to access files to various groups and users. The recommended file locations are summarized in Table 2.
The HDFS schema design
involves advanced techniques to organize data into files (Grover et al., 2015). A few
strategies are recommended to organize the data
set. These strategies for data organization involve partitioning,
bucketing, and denormalizing process. The partitioning process of the data set is a common technique used to reduce the amount of I/O
required to process the data set.
HDFS does not store indexes on the data unlike
the traditional data warehouse. Such a lack of indexes in HDFS plays a key role
in speeding up data ingest, with a full table scan cost where every query will
have to read the entire dataset even when processing a small subset of data. Breaking up the data set into
smaller subsets, or partitions can help with the full table scan, allowing queries
to read only the specific partitions reducing the amount of I/O and improving
the query time processing significantly (Grover et al., 2015). When data is
placed in the filesystem, the directory format for partition should be
as shown below. The order data sets are
partitioned by date because there are a large
number of orders done daily and the partitions will contain large enough files
which are optimized by HDFS.
Various tools such as HCatalog, Hive, Impala, and Pig understand this
directory structure leveraging the partitioning to reduce the amount of I/O
requiring during the data processing (Grover et al., 2015).
<data set
name>/<partition_column_name=partition_column_value>/(Armstrong)
e.g. medication_orders/date=20181107/[order1.csv,
order2.csv]
Bucketing is
another technique for breaking a large data set into manageable sub-sets (Grover et al., 2015). The
bucketing technique is similar to the hash partitions which is used in the relational
database. Various tools such as
HCatalog, Hive, Impala, and Pig understand this directory structure leveraging
the partitioning to reduce the amount of I/O requiring during the data
processing. The partition example above was implemented using the date which
resulted in large data files which can be
optimized by HDFS (Grover et al., 2015). However,
if the data sets are partitioned by a the
category of the physician, the result will be too many small files,
which leads to small file problems, which can
lead to excessive memory use for the NameNode, since metadata for each file
stored in HDFS is stored in memory (Grover et al., 2015). Many
small files can also lead to many processing tasks, causing excessive overhead
in processing. The solution for too many
small files is to use the bucketing process for the physician in this example,
which uses the hashing function to map physician into a specified number of
buckets (Grover et al., 2015).
The bucketing
technique controls the size of the data
subsets and optimizes the query speed (Grover et al., 2015). The
recommended average bucket size is a few multiples
of the HDFS block size. The distribution of data
when hashed on the bucketing column is essential because it results in
consistent bucketing (Grover et al., 2015). The use
of the number of buckets as a power of two is every
day. Bucketing allows joining
two data sets. The join, in this case,
is used to represent the general idea of combining two data sets to retrieve a
result. The joins can be implemented through the SQL-on-Hadoop systems and also
in MapReduce, or Spark, or other programming interfaces to Hadoop. When using join in the bucketing technique,
it joins corresponding buckets individually without having to join the entire
datasets, which help in minimizing the time complexity for the reduce-side
join of the two datasets process,
which is computationally expensive (Grover et al., 2015). The
join is implemented in the map stage of a
MapReduce job by loading the smaller of the buckets in memory because the buckets are small enough to easily fit into
memory, which is called map-side join process. The map-side join process improves the join
performance as compared to a reduce-side
join process. A hive for data analysis recognizes the tables
are bucketed and optimize the process accordingly.
Further optimization can be implemented if the data in the bucket is sorted, the merge join can be used, and the entire bucket does not get
stored in memory when joining, resulting in the faster
process and much less memory than a
simple bucket join. Hive supports this
optimization as well. The use of both
sorting and bucketing on large tables that are frequently joined together using
the join key for bucketing is recommended
(Grover et al., 2015).
The schema design
depends on how the data will be queried (Grover et al., 2015). Thus,
the columns to be used for joining and filtering must be identified before the portioning and bucketing of the data is
implemented. In some cases, when the
identification of one partitioning key is challenging, storing the same data
set multiple times can be implemented,
each with the different physical
organization, which is regarded to be an anti-pattern
in a relational database. However, this solution can be implemented with Hadoop, because with Hadoop
is write-once, and few updates are expected. Thus, the overhead of keeping duplicated data
set in sync is reduced. The cost of
storage in Hadoop clusters is reduced as well (Grover et al., 2015).
The duplicated data set in sync provides better query speed processing in such
cases (Grover et al., 2015).
Regarding the
denormalizing process, it is another technique of trading the disk space for
query performance, where joining the entire data set need is minimized (Grover et al., 2015). In the
relational database model, the data is stored
in the third standard form (NF3), where
redundancy is minimized, and data integrity is enforced by splitting data into smaller tables, each holding a particular entity. In this relational model, most queries
require joining a large number of tables together to produce a final result as desired (Grover et al., 2015). However,
in Hadoop, joins are often the slowest operations and consume the most
resources from the cluster.
Specifically, the reduce-side join
requires sending the entire table over the network, which is computationally
costly. While sorting and bucketing help
minimizing this computational cost, another solution is to create data sets
that are pre-joined or pre-aggregated (Grover et al., 2015). Thus,
the data can be joined once and store it in this form instead of running the
join operations every time there is a query for that data. Hadoop schema
consolidates many of the small dimension tables into a few larger dimensions by
joining them during the ETL process (Grover et al., 2015). Other
techniques to speed up the process include the aggregation or data type
conversion. The duplication of the data
is of less concern; thus, when the processing is frequent for a large number of
queries, it is recommended to doing it one and reuse as the case with a materialized view in the relational
database. In Hadoop, the new dataset is
created that contains the same data in its aggregated form (Grover et al., 2015).
To summarize, the
partitioning process is used to reduce the I/O
overhead of processing by selectively reading and writing data in particular
partitions. The bucketing can be
used to speed up queries that involve joins or sampling, by reducing the I/O as
well. The denormalization can be implemented to speed up Hadoop jobs. In
this section, a review of advanced techniques to organize data into files is discussed.
The discussion includes the use of a small
number of large files versus a large
number of small files. Hadoop prefers
working with a small number of large
files than a large number of small
files. The discussion also addresses the
reduce-side join versus map-side join techniques. The reduce-side join is computationally
costly. Hence, the map-side join
technique is preferred and recommend.
HBase is not a
relational database (Grover et al., 2015; Yang, Liu, Hsu, Lu, & Chu, 2013). HBase is similar to a large hash table, which allows the association of values with
keys and performs a fast lookup of the value based on a given key (Grover et al., 2015).
The operations of hash tables involve put, get, scan, increment and delete. HBase provides scalability and flexibility and is useful in many applications,
including fraud detection, which is a widespread application for HBase (Grover et al., 2015).
The framework of HBase involves Master Server, Region Servers, Write-Ahead Log (WAL), Memstore, HFile, API and Hadoop HDFS (Bhojwani & Shah, 2016). Each component of the HBase framework plays a significant role in data storage and processing. Figure 10 illustrated the HBase framework.
The
following consideration must be taken when designing the schema for HBase (Grover et al., 2015).
Row Key Consideration.
Timestamp Consideration.
Hops Consideration.
Tables and Regions Consideration.
Columns Use Consideration.
Column Families Use
Consideration.
Time-To-Live Consideration.
The row key is
one of the most critical factors for
well-architected HBase schema design (Grover et al., 2015). The row
key consideration involves record retrieval, distribution, block cache, the ability to scan, size, readability, and
uniqueness. The row key is critical for
retrieving records from HBase. In the relational database, the composite key
can be used to combine multiple primary keys.
In HBase, multiple pieces of information can be combined in a single key.
For instance, a key of customer_id, order_id, and timestamp will be a
row key for a row describing an order. In
a relational database, they are three
different columns in the relational database, but in HBase, they will be combined into a single unique
identifier. Another consideration for selecting
the row key is the get operation because a get operation of a single record is
the fasted operation in HBase. A single
get operation can retrieve the most common uses of the data improves the
performance, which requires to put much
information in a single record which is called denormalized design. For
instance, while in the relational database, customer information will be placed in various tables, in HBase all
customer information will be stored in a single record where get operation will
be used. The distribution is another
consideration for HBase schema design.
The row key determines the regions of HBase cluster for a given table,
which will be scattered throughout various regions (Grover et al., 2015; Yang et al., 2013). The row keys are
sorted, and each region stores a range of these sorted row keys (Grover et al., 2015). Each region is
pinned to a region server namely a node in the cluster (Grover et al., 2015). The
combination of device ID and timestamp or reverse timestamp is commonly used to
“salt” the key in machine data (Grover et al., 2015). The block cache is a least recently used (LRU) cache
which caches data blocks in memory (Grover et al., 2015). HBase reads records
in chunks of 64KB from the disk by default. Each of these chunks is called HBase block (Grover et al., 2015). When the HBase block is read from disk, it will be put
into the block cache (Grover et al., 2015). The
choice of the row key can affect the scan operation as well. HBase scan
rates are about eight times slower than HSFS scan rates. Thus, reducing I/O requirements has a significant performance advantage. The size of the row key determines the performance of the workload. The short row key is better than, the long
row key because it has lower storage overhead and faster read/ writes performance. The readability of the row key is critical.
Thus, it is essential to start with
human-readable row key. The uniqueness
of the row key is also critical since a row key is equivalent to a key in hash
table analogy. If the row key is based on the non-unique
attribute, the application should handle such cases and only put data in HBase
with a unique row key (Grover et al., 2015).
The timestamp is
the second essential consideration for good HBase schema design (Grover et al., 2015). The
timestamp provides advantages of determining which records are newer in case of
put operation to modify the record. It
also determines the order where records are
returned when multiple versions of a single record are requested. The timestamp is also utilized to remove out-of-date records
because time-to-live (TTL) operation compared
with the timestamp shows the record value has either been overwritten by
another put or deleted (Grover et al., 2015).
The hop term
refers to the number of synchronized “get” requests to retrieve specific data from HBase (Grover et al., 2015). The less hop, the
better because of the overhead. Although
multi-hop requests with HBase can be made,
it is best to avoid them through better schema design, for example by leveraging
de-normalization, because every hop is a round-trip to HBase which has a
significant performance overhead (Grover et al., 2015).
The number of tables and regions per table in HBase can have a negative impact on the performance and distribution of the data (Grover et al., 2015). If the number of tables and regions are not implemented correctly, it can result in an imbalance in the distribution of the load. Important considerations include one region server per node, many regions in a region server, a give region is pinned to a particular region server, and tables are split into regions and scattered across region servers. A table must have at least one region. All regions in a region server receive “put” requests and share the region server’s “memstore,” which is a cache structure present on every HBase region server. The “memstore” caches the write is sent to that region server and sorts them in before it flushes them when certain memory thresholds are reached. Thus, the more regions exist in a region server; the less memstore space is available per region. The default configuration sets the ideal flush size to 100MB. Thus, the “memstore” size can be divided by 100MB and result should be the maximum number of regions which can be put on that region server. The vast region takes a long time to compact. The upper limit on the size of a region is around 20GB. However, there are successful HBase clusters with upward of 120GB regions. The regions can be assigned to HBase table using one of two techniques. The first technique is to create the table with a single default region, which auto splits as data increases. The second technique is to create the table with a given number of regions and set the region size to a high enough value, e.g., 100GB per region to avoid auto splitting (Grover et al., 2015). Figure 11 shows a topology of region servers, regions and tables.
Figure 11. The Topology of Region Servers, Regions, and Tables (Grover et al., 2015).
The columns used in HBase is different from the traditional
relational database (Grover et al., 2015; Yang et al., 2013). In HBase, unlike the traditional database, a
record can have a million columns, and the next record can have a million
completely different columns, which is not
recommended but possible (Grover et al., 2015). HBase
stores data in a format called HFile, where each column value gets its row in
HFile (Grover et al., 2015; Yang et al., 2013).
The row has files like row key, timestamp, column names, and values. The file format provides various functionality,
like versioning and sparse column storage (Grover et al., 2015).
HBase, include the concept of column families (Grover et al., 2015; Yang et al., 2013). A column family is a container for columns. In HBase, a table can have one or more column families. Each column family has its set of HFiles and gets compacted independently of other column families in the same table. In many cases, no more than one column family is needed per table. The use of more than one column family per table can be done when the operation is done, or the rate of change on a subset of the columns of a table is different from the other columns (Grover et al., 2015; Yang et al., 2013). The last consideration for HBase schema design is the use of TTL, which is a built-in feature of HBase which ages out data based on its timestamp (Grover et al., 2015). If TTL is not used and an aging requirement is needed, then a much more I/O intensive operation would need to be done. The objects in HBase begin with table object, followed by regions for the table, store per column family for each region for the table, memstore, store files, and block (Yang et al., 2013). Figure 12 shows the hierarchy of objects in HBase.
Figure 12. The Hierarchy of Objects
in HBase (Yang et al., 2013).
To summarize this
section, HBase schema design requires seven key consideration starting with the
row key, which should be selected carefully for record retrieval, distribution,
block cache, ability to scan, size, readability, and uniqueness. The timestamp and hops are other schema
design consideration for HBase. Tables
and regions must be considered for put performance, and compacting time. The use of columns and column families should
also be considered when designing the
schema for HBase. The TTL to remove data that aged is another consideration for
HBase schema design.
The above discussion has been about the data and the techniques to store it in Hadoop. Metadata is as essential as the data itself. Metadata is data about the data (Grover et al., 2015)). Hadoop ecosystem has various forms of metadata. Metadata about logical dataset usually stored in a separate metadata repository include the information like the location of a data set such as directory in HDFS or HBase table name, the schema associated with the dataset, the partitioning and sorting properties of the data set, the format of the data set e.g. CSV, SequenceFile, etc. (Grover et al., 2015). The metadata about files on HDFS includes the permission and ownership of such files and the location of various blocks on data nodes, usually stored and managed by Hadoop NameNode (Grover et al., 2015). Metadata about tables in HBase include information like table names, associated namespace, associated attributes, e.g. MAX_FILESIZE, READONLY, etc., and the names of column families, usually stored and managed by HBase (Grover et al., 2015). Metadata about data ingest and transformation include information like which user-generated a given dataset, where the dataset came from, how long it took to generate it, and how many records there are, or the size of the data load (Grover et al., 2015). Metadata about dataset statistics include information like the number of rows in a dataset, number of unique values in each column, a histogram of the distribution of the data, and maximum and minimum values (Grover et al., 2015). Figure 13 summarizes this various metadata.
Figure 13. Various Metadata in Hadoop.
Apache Hive was the first project in the Hadoop ecosystem to store, manage and leverage metadata (Antony et al., 2016; Grover et al., 2015). Hives stores this metadata in a relational database called the Hive “metastore” (Antony et al., 2016; Grover et al., 2015). Hive also provides a “metastore” service which interfaces with the Hive metastore database (Antony et al., 2016; Grover et al., 2015). The query process in Hive goes to the metastore to get the metadata for the desired query, and metastore sends the metadata to Hive generating execution plan, followed by executing the job using the Hadoop cluster, which implements the job and Hive send the fetched result to the user (Antony et al., 2016; Grover et al., 2015). Figure 14 shows the query process and the role of the metastore in Hive framework.
Figure 14. Query Process and the Role of
Metastore in Hive (Antony et al., 2016).
More projects have utilized the concept of metadata that was introduced by Hive and created a separate project called HCatalog to enable the usage of Hive metastore outside of Hive (Grover et al., 2015). HCatalog is a part of Hive and allows other tools like Pig and MapReduce to integrate with Hive metastore. It also opens the access to Hive metastore to other tools such as REST API via WebHCat server. MapReduce, Pig, and standalone applications can talk directly to the metastore of Hive through its APIs, but HCatalog allows easy access through its WebHCat REST APIs, and it allows the cluster administrators to lock down access to the Hive metastore to address security concerns. Other ways to store metadata include the embedding of metadata in file paths and names. Another technique to store metadata involves storing it in HDFS in a hidden file, e.g., .metadata. Figure 15 shows the HCatalog as an accessibility veneer around the Hive metastore (Grover et al., 2015).
Figure 15. HCatalog acts an
accessibility veneer around the Hive metastore (Grover et al., 2015).
There are some limitations for Hive
metastore and HCatalog, including the problem with high availability (Grover et al., 2015). The HA
database cluster solutions to bring HA to
the Hive metastore database. For the
metastore service of Hive, there is support concurrently to run multiple metastores on more than one node in the
cluster. However, concurrency issues
related to data definition language operations (DDL) can occur, and Hive community is working on fixing these issues.
The fixed schema for metadata is
another limitation. Hadoop provides much flexibility on the type of data that can be
stored, mainly because of the
Schema-on-Read concept. Hive metastore
provides a fixed schema for the metadata itself. It provides a tabular
abstraction for the data sets. The data
in metastore is moving the part in the
infrastructure which requires to be running and secured as part of Hadoop infrastructure (Grover et al., 2015).
Alguliyev, R., &
Imamverdiyev, Y. (2014). Big data: big
promises for information security. Paper presented at the Application of
Information and Communication Technologies (AICT), 2014 IEEE 8th International
Conference on.
Ankam,
V. (2016). Big Data Analytics: Packt
Publishing Ltd.
Antony,
B., Boudnik, K., Adams, C., Lee, C., Shao, B., & Sasaki, K. (2016). Professional Hadoop: John Wiley &
Sons.
Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., &
Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation
of Data Transform Method into NoSQL Database for Healthcare Data. Paper
presented at the 2013 International Conference on Parallel and Distributed
Computing, Applications and Technologies.
The purpose of this project is to discuss how data can be handled before Hadoop can take action on
breaking data into manageable sizes. The
discussion begins with an overview of Hadoop providing a brief history of
Hadoop and the difference between Hadoop 1.x and Hadoop 2.x. The discussion
involves the Big Data Analytics process using Hadoop which involves six
significant steps including the pre-processing data and ETL process where the
data must be converted and cleaned before processing it. Before data processing, some consideration
must be taken for data preprocessing, modeling
and schema design in Hadoop for better processing and data retrieval as it will
affect how data can be split among various nodes in the distributed environment because not all tools can split the
data. This
consideration begins with the data storage format, followed by Hadoop
file types consideration and XML and JSON format challenges in Hadoop. The compression of the data must be
considered carefully because not all compression types are “splittable.” The discussion
also involves the schema design consideration for HDFS and HBase since they are
used often in the Hadoop ecosystem.
Keywords:
Big Data Analytics; Hadoop; Data
Modelling in Hadoop; Schema Design in Hadoop.
In
the age of Big Data, dealing with large datasets in terabytes and petabytes is
a reality and requires specific technology as the traditional technology was
found inappropriate for it (Dittrich
& Quiané-Ruiz, 2012). Hadoop is developed to store, and process
such large datasets efficiently. Hadoop
is becoming a data processing engine for Big Data (Dittrich
& Quiané-Ruiz, 2012). One of the significant advantages of Hadoop
MapReduce is allowing non-expert users to run easily analytical tasks over Big
Data (Dittrich
& Quiané-Ruiz, 2012). However, before
the analytical process takes place, some schema design and data modeling
consideration must be taken for Hadoop so that the data process can be
efficient (Grover,
Malaska, Seidman, & Shapira, 2015). Hadoop requires splitting the data. Some
tools can split the data while others cannot split the data natively and
requires integration (Grover
et al., 2015).
This
project discusses these considerations to ensure the appropriate schema design
for Hadoop and its components of HDFS, HBase where the data gets stored in a
distributed environment. The discussion
begins with an overview of Hadoop first, followed by the data analytics process
and ends with the data modeling techniques and consideration for Hadoop which
can assist in splitting the data appropriately for better data processing
performance and better data retrieval.
Google
published and disclosed its MapReduce technique and implementation early around
2004 (Karanth, 2014). It also
introduced the Google File System (GFS) which is
associated with MapReduce implementation. The MapReduce, since then, has become the
most common technique to process massive data sets
in parallel and distributed settings across many companies (Karanth, 2014). In 2008,
Yahoo released Hadoop as an open-source implementation of the MapReduce
framework (Karanth, 2014; sas.com, 2018). Hadoop and its file system
HDFS are inspired by Google’s MapReduce and GFS (Ankam, 2016; Karanth, 2014).
The Apache Hadoop
is the parent project for all subsequence projects of Hadoop (Karanth, 2014). It
contains three essential branches 0.20.1 branch, 0.20.2 branch, and 0.21
branch. The 0.20.2 branch is often termed MapReduce v2.0, MRv2, or Hadoop
2.0. Two additional releases for Hadoop
involves the Hadoop-0.20-append and Hadoop-0.20-Security, introducing HDFS
append and security-related features into Hadoop respectively. The timeline for Hadoop technology is outlined in Figure 1.
Figure 1. Hadoop Timeline from 2003 until 2013 (Karanth, 2014).
Hadoop version 1.0
was the inception and evolution of Hadoop as a simple
MapReduce job-processing framework (Karanth, 2014). It
exceeded its expectations with wide
adoption of massive data processing. The
stable version of the 1.x release
includes features such as append and security.
Hadoop version 2.0 release came out in 2013 to increase efficiency and mileage from existing Hadoop
clusters in enterprises. Hadoop is
becoming a common cluster-computing and storage platform from being limited to
MapReduce only, because it has been moving faster
than MapReduce to stay leading in massive scale data processing with the
challenge of being backward compatible (Karanth, 2014).
In
Hadoop 1.x, the JobTracker was responsible for the resource allocation and job execution (Karanth, 2014).
MapReduce was the only supported model since the computing model was tied to the resources in the cluster. The
yet another resource negotiator (YARN) was developed to separate concerns
relating to resource management and application execution, which enables other
application paradigms to be added into Hadoop computing cluster. The
support for diverse applications result in the efficient and effective
utilization of the resources and integrates well with the infrastructure of the
business (Karanth, 2014). YARN
maintains backward compatibility with Hadoop version 1.x APIs (Karanth, 2014). Thus, the
old MapReduce program can still execute
in YARN with no code changes, but it has to be
recompiled (Karanth, 2014).
YARN abstracts out the resource management functions
to form a platform layer called ResourceManager (RM) (Karanth, 2014). Every
cluster must have RM to keep track of cluster resource usage and activity. RM is also responsible for allocation of the
resources and resolving contentions among resource seekers in the cluster. RM utilizes a generalized resource model and
is agnostic to application-specific resource needs. RM does not need to know the resources
corresponding to a single Map or Reduce slot (Karanth, 2014). Figure 2 shows
Hadoop 1.x and Hadoop 2.x with YARN layer.
Figure 2. Hadoop 1.x vs. Hadoop 2.x (Karanth, 2014).
Hadoop 2.x
involves various enhancement at the storage layer as well. These enhancements include the high
availability feature to have a hot
standby of NameNode (Karanth, 2014), when the active NameNode fails, the standby can
become active NameNode in a matter of minutes. The Zookeeper or any other HA monitoring
service can be utilized to track NameNode failure (Karanth, 2014). The
failover process to promote the hot standby as the active NameNode is triggered with the assistance of the
Zookeeper. The HDFS federation is
another enhancement in Hadoop 2.x, which is a more
generalized storage model, where the block storage has been generalized and
separated from the filesystem layer (Karanth, 2014). The HDFS
snapshots is another enhancement to the Hadoop 2.x which provides a read-only image of the entire or a particular subset of a filesystem to protect
against user errors, backup, and disaster recovery. Other enhancements added in Hadoop 2.x
include the Protocol Buffers (Karanth, 2014). The wire
protocol for RPCs within Hadoop is based on Protocol Buffers. Hadoop 2.x is aware of the type of storage
and expose this information to the application, to optimize data fetch and
placement strategies (Karanth, 2014). HDFS
append support has been another enhancement
in Hadoop 2.x.
Hadoop is regarded
to be the de facto open-source framework
for dealing with large-scale, massively
parallel, and distributed data processing (Karanth, 2014). The
framework of Hadoop includes two layers for computation and data layer (Karanth, 2014). The
computation layer is used for parallel and distributed computation processing,
while the data layer is used for a highly
fault-tolerant data storage layer which is
associated with the computation layer.
These two layers run on commodity hardware, which is not expensive, readily available, and compatible with other
similar hardware (Karanth, 2014).
Hadoop Architecture
Apache Hadoop has
four projects: Hadoop Common, Hadoop Distributed
File System, Yet Another Resource Negotiator (YARN), and MapReduce (Ankam, 2016). The HDFS is used to store data, MapReduce is
used to process data, and YARN is used to manage the resources such as CPU and
memory of the cluster and common utilities that support Hadoop framework (Ankam, 2016; Karanth, 2014). Apache Hadoop integrates with other tools
such as Avro, Hive, Pig, HBase, Zookeeper, and Apache Spark (Ankam, 2016; Karanth, 2014).
Hadoop
three significant components for Big Data
Analytics. The HDFS is a framework for
reliable distributed data storage (Ankam, 2016; Karanth, 2014). Some considerations must be taken when storing data into HDFS (Grover et al., 2015). The
multiple frameworks for parallel processing of data include MapReduce, Crunch,
Cascading, Hive, Tez, Impala, Pig, Mahout, Spark, and Giraph (Ankam, 2016; Karanth, 2014). The Hadoop architecture includes NameNodes
and DataNodes. It also includes Oozie
for workflow, Pig for scripting, Mahout for machine learning, Hive for the data warehouse.
Sqoop for data exchange, and Flume for log collection. YARN is in Hadoop 2.0 as discussed earlier
for distributed computing, while HCatalog for Hadoop metadata management. HBase is for columnar database and Zookeeper
for coordination (Alguliyev & Imamverdiyev, 2014). Figure 3 shows the Hadoop ecosystem
components.
The process of Big
Data Analytics involves six essential steps (Ankam, 2016).
The identification of the business problem and outcomes is the first step. Examples of business problems include sales are going down, or shopping carts are abandoned by customers, a sudden rise in
the call volumes, and so forth. Examples
of the outcome include improving the buying rate by 10%, decreasing shopping
cart abandonment by 50%, and reducing
call volume by 50% by next quarter while keeping customers happy. The required data must be identified where data sources can be data
warehouse using online analytical processing, application database using online
transactional processing, log files from servers, documents from the internet,
sensor-generated data, and so forth, based on the case and the problem. Data collection is the third step in
analyzing the Big Data (Ankam, 2016). Sqoop tool can be used to collect data from the
relational database, and Flume can be used for stream data. Apache Kafka can be used for reliable intermediate storage. The data collection and design should be
implemented using the fault tolerance strategy (Ankam, 2016). The preprocessing data and ETL process is the
fourth step in the analytical process.
The collected data comes in various formats, and the data quality can be an issue. Thus, before processing it,
it needs to be converted to the required format and cleaned from inconsistent, invalid
or corrupted data. Apache Hive, Apache
Pig, and Spark SQL can be used for
preprocessing massive amounts of data.
The analytics implementation is the fifth steps which should be in order
to answer the business questions and problems. The analytical process requires
understanding the data and relationships between data points. The types of data
analytics include descriptive and diagnostic analytics to present the past and
current views of the data, to answer questions such as what and why
happened. The predictive analytics is
performed to answer questions such as what would happen based on a hypothesis.
Apache Hive, Pig, Impala, Drill, Tez, Apache Spark, and HBase can be used for data analytics in batch processing
mode. Real-time analytics tools
including Impala, Tez, Drill, and Spark SQL can be
integrated into the traditional business
intelligence (BI) using any of BI tools such as Tableau, QlikView, and
others for interactive analytics. The last step in this process involves the
visualization of the data to present the analytics output in a graphical or
pictorial format to understand the analysis better for decision making. The finished data is exported from Hadoop to a
relational database using Sqoop, for
integration into visualization systems or visualizing systems are directly
integrated into tools such as Tableau, QlikView,
Excel, and so forth. Web-based notebooks
such as Jupyter, Zeppelin, and Data bricks cloud are also used to visualize
data by integrating Hadoop and Spark components (Ankam, 2016).
Before processing any data, and
before collecting any data for storage, some considerations must be taken for data modeling and design in Hadoop for better processing and better
retrieval (Grover
et al., 2015).
The traditional data management system is referred to as Schema-on-Write
system which requires the definition of the schema
of the data store before the data is loaded (Grover
et al., 2015).
This traditional data management system results in long analysis cycles, data modeling, data
transformation loading, testing, and so forth before the data can be accessed (Grover
et al., 2015).
In addition to this long analysis
cycle, if anything changes or wrong
decision was made, the cycle must start
from the beginning which will take longer time for processing (Grover
et al., 2015).
This section addresses various types of consideration before processing
the data from Hadoop for analytical purpose.
The dataset may have various levels of quality regarding noise, redundancy, and consistency (Hu, Wen, Chua, & Li, 2014). Preprocessing techniques must be used to
improve data quality should be in place in Big Data systems (Hu et al., 2014; Lublinsky, Smith, & Yakubovich, 2013). The data pre-processing involves three
techniques: data integration, data cleansing, and redundancy elimination.
The data
integration techniques are used to combine data residing in different sources
and provide users with a unified view of the data (Hu et al., 2014). The
traditional database approach has well-established data integration system
including the data warehouse method, and the data federation method (Hu et al., 2014). The data
warehouse approach is also known as ETL consisting of extraction,
transformation, and loading (Hu et al., 2014). The
extraction step involves the connection to the source systems and selecting and
collecting the required data to be processed for
analytical purposes. The transformation
step involves the application of a series of rules to the extracted data to
convert it into a standard format. The
load step involves importing extracted and transformed data into a target storage infrastructure (Hu et al., 2014). The
federation approach creates a virtual database to query and aggregate data from
various sources (Hu et al., 2014). The virtual database contains information or
metadata about the actual data, and its location and does not contain data itself (Hu et al., 2014). These
two data pre-processing are called store-and-pull techniques which is not
appropriate for Big Data processing, with high computation and high streaming,
and dynamic nature (Hu et al., 2014).
The data cleansing
process is a vital process to keep the data consistent and updated to get
widely used in many fields such as banking, insurance, and retailing (Hu et al., 2014). The
cleansing process is required to determine the incomplete, inaccurate, or
unreasonable data and then remove these data to
improve the quality of the data (Hu et al., 2014). The data cleansing process includes five steps (Hu et al., 2014). The first step is to define and determine the
error types. The second step is to
search and identify error instances. The
third step is to correct the errors, and then document error instances and
error types. The last step is to modify data entry procedures to reduce future errors.
Various types of checks must be done at the cleansing process, including
the format checks, completeness checks, reasonableness checks, and limit checks
(Hu et al., 2014). The
process of data cleansing is required to improve the accuracy of the analysis (Hu et al., 2014). The data
cleansing process depends on the complex relationship model, and it has extra computation and delay overhead
(Hu et al., 2014).
Organizations must seek a balance between the complexity of the
data-cleansing model and the resulting improvement in the accuracy analysis (Hu et al., 2014).
The data
redundancy is the third data pre-processing step where data is repeated
increasing the overhead of the data transmission
and causes limitawtions for storage systems, including wasted space,
inconsistency of the data, corruption of the dta, and reduced
reliability (Hu et al., 2014). Various
redundancy reduction methods include redundancy
detection and data compression (Hu et al., 2014). The data
compression method poses an extra
computation burden in the data compression and decompression processes (Hu et al., 2014).
Data Modeling and Design
Consideration
Schema-on-Write system is used when the application
or structure is well understood and frequently accessed through queries and
reports on high-value data (Grover
et al., 2015).
The term Schema-on-Read is
used in the context of Hadoop data management system (Ankam,
2016; Grover et al., 2015). This term refers to the raw
data, that is not processed and can be loaded to Hadoop using the required structure at processing time based on the
requirement of the processing application (Ankam,
2016; Grover et al., 2015). The Schema-on-Read is used when the
application or structure of data is not well understood (Ankam,
2016; Grover et al., 2015).
The agility of the process is implemented through the schema-on-read
providing valuable insights on data not previously accessible (Grover
et al., 2015).
Five factors must be considered
before storing data into Hadoop for processing (Grover
et al., 2015). The data storage format must be considered as there are some file formats and compression formats
supported on Hadoop. Each type of format
has strengths that make it better suited to specific applications. Although Hadoop Distributed File System
(HDFS) is a building block of Hadoop ecosystem, which is used for storing data, several commonly used systems implemented
on top of HDFS such as HBase for traditional data access functionality, and
Hive for additional data management functionality (Grover
et al., 2015).
These systems of HBase for data access functionality and Hive for data
management functionality must be taken
into consideration before storing data into Hadoop (Grover
et al., 2015). The second factor involves the
multitenancy which is a common approach for clusters to host multiple users,
groups and application types. The multi-tenant clusters involve essential considerations for data storage. The schema design factor should also be
considered before storing data into Hadoop even if Hadoop is a schema-less (Grover
et al., 2015).
The schema design consideration involves directory structures for data
loaded into HDFS and the output of the data processing and analysis, including
the schema of objects stored in systems such as HBase and Hive. The last factor for consideration before storing
data into Hadoop is represented in the
metadata management. Metadata is related
to the stored data and is often regarded
as necessary as the data. The understanding of the metadata management
plays a significant role as it can affect the accessibility of the data. The security is another factor which should be considered before storing data into Hadoop system.
The security of the data decision involves authentication, fine-grained
access control, and encryption. These security measures should be considered
for data at rest when it gets stored as well as in motion during the processing
(Grover
et al., 2015).
Figure 4 summarizes these considerations before storing data into the Hadoop system.
Figure 4. Considerations Before Storing
Data into Hadoop.
When architecting a solution on Hadoop,
the method of storing the data into Hadoop is one of the essential decisions. Primary considerations for data storage in
Hadoop involve file format, compression, data storage system (Grover et al., 2015). The
standard file formats involve three types:
text data, structured text data, and binary data. Figure 5 summarizes these three standard file
formats.
Figure 5. Standard File Formats.
The text data is widespread use of Hadoop including log file such as weblogs, and server logs (Grover et al., 2015).
These text data format can come in many forms such as CSV files, or
unstructured data such as emails.
Compression of the file is recommended,
and the selection of the compression is
influenced by how the data will be used (Grover et al., 2015).
For instance, if the data is for archival, the most compact compression
method can be used, while if the data are used
in processing jobs such as MapReduce, the splittable format should be used (Grover et al., 2015).
The splittable format enables Hadoop to split files into chunks for
processing, which is essential to efficient parallel processing (Grover et al., 2015).
In most cases, the use
of container formats such as SequenceFiles
or Avro provides benefits making it the preferred format for most file system
including text (Grover et al., 2015).
It is worth noting that these container formats provide functionality to
support splittable compression among other benefits (Grover et al., 2015). The binary data involves images which can be
stored in Hadoop as well. The container
format such as SequenceFile is preferred when storing binary data in
Hadoop. If
the binary data splittable unit is more than 64MB, the data should be
put into its file, without using the container format (Grover et al., 2015).
The structured text data include formats
such as XML and JSON, which can present unique
challenges using Hadoop because splitting XML
and JSON files for processing is not straightforward, and Hadoop does
not provide a built-in InputFormat for either (Grover et
al., 2015).
JSON presents more challenges to Hadoop than XML because no token is
available to mark the beginning or end of a record. When using these file format, two primary consideration must be taken.
The container format such as Avro should be used because Avro provides a compact and efficient method to store
and process the data when transforming the data into Avro (Grover et
al., 2015). A library for processing XML or JSON should be designed.
XMLLoader in PiggyBank library for Pig is an example when using XML data
type. The Elephant Bird project is an
example of a JSON data type file (Grover et
al., 2015).
Several
Hadoop-based file formats created to work well with MapReduce (Grover et al., 2015). The Hadoop-specific file formats include file-based
data structures such as sequence files, serialization formats like Avro, and
columnar formats such as RCFile and Parquet (Grover et al., 2015). These files types share two essential
characteristics that are important for Hadoop application: splittable
compression and agnostic compression. The ability of splittable files play a
significant role during the data processing, and should not be underestimated when storing data in Hadoop because it allows large files to be split for input
to MapReduce and other types of jobs, which is a fundamental part of parallel processing and a key to leveraging
data locality feature of Hadoop (Grover et
al., 2015).
The agnostic compression is the ability to compress using any
compression codec without readers having to know the codec because the codec is stored in the header metadata of the
file format (Grover et al., 2015). Figure
6 summarizes these Hadoop-specific file formats with the typical characteristics of splittable
compression and agnostic compression.
Figure 6. Three Hadoop File Types with the Two Common Characteristics.
SequenceFiles
format is the most widely used Hadoop
file-based formats. SequenceFile format store data as binary
key-value pairs (Grover et al., 2015). It
involves three formats for records stored within SequenceFiles: uncompressed,
record-compressed, and block-compressed.
Every SequenceFile uses a standard header format containing necessary metadata
about the file such as the compression codec used, key and value class names,
user-defined metadata, and a randomly generated syn marker. The SequenceFiles
arewell
supported in Hadoop. However, it has limited support outside the Hadoop ecosystem as it is only supported in Java language.
The frequent use case for SequenceFiles is a container for smaller
files. However,
storing a large number of small files in Hadoop can cause memory issue and excessive overhead in
processing. Packing smaller files into a
SequenceFile can make the storage and
processing of these files more efficient because Hadoop is optimized for large files (Grover et al., 2015). Other
file-based formats include the MapFiles, SetFiles, Array-Files, and
BloomMapFiles. These formats offer a
high level of integration for all forms of MapReduce jobs, including those run
via Pig and Hive because they were designed to work with MapReduce (Grover et al., 2015). Figure 7
summarizes the three formats for records stored within SequenceFiles.
Figure 7. Three Formats for Records
Stored within SequenceFile.
Serialization is the process of moving data structures into bytes for storage or for
transferring data over the network (Grover et al., 2015).
The de-serialization is the opposite process of converting a byte stream
back into a data structure (Grover et al., 2015). The serialization process is the fundamental
building block for distributed processing systems such as Hadoop because it allows data to be converted into a format that can
be efficiently stored and transferred across a network connection (Grover et al., 2015). Figure
8 summarizes the serialization formats when architecting for Hadoop.
Figure 8. Serialization Process vs.
Deserialization Process.
The serialization involves two aspects of
data processing in a distributed system
of interprocess communication using data
storage, and remote procedure calls or RPC (Grover et al., 2015).
Hadoop utilizes Writables as the main serialization format, which is compact and fast but uses Java only. Other serialization frameworks have been
increasingly used within Hadoop ecosystems, including Thrift, Protocol Buffers
and Avro (Grover et al., 2015). Avro
is a language-neutral data serialization
system (Grover et al., 2015). It was designed to address the limitation of
the Writables of Hadoop which is lack of language portability. Similar to Thrift and Protocol Buffers, Avro
is described through a language-independent schema (Grover et al., 2015).
Avro, unlike Thrift and Protocol Buffers, the code generation is
optional. Table 1 provides a comparison
between these serialization formats.
Table 1:
Comparison between Serialization Formats.
Row-oriented systems have been used to
fetch data stored in the database (Grover et al., 2015).
This type of data retrieval has been used as the analysis heavily relied
on fetching all fields for records that belonged to a specific time range. This process is efficient if all columns of the record
are available at the time or writing because the record can be written with a
single disk seek. The column
storage has recently been used to fetch data.
The use of columnar storage has four main benefits over the row-oriented
system (Grover et al., 2015). The
skips I/O and decompression on columns that are not part of the query is one of
the benefits of the columnar storage.
Columnar data storage works better for queries that access a small
subset of columns than the row-oriented data storage, which can be used when many columns are retrieved. The compression on columns provides
efficiency because data is more similar within the same column than it is in a
block of rows. The columnar data storage
is more appropriate for data warehousing-based applications where aggregations
are implemented using specific columns
than an extensive collection of records (Grover et al., 2015).
Hadoop applications have been using the columnar file formats including
the RCFile format, Optimized Row Columnar (ORC), and Parquet. The RCFile format has been used as a Hive Format.
It was developed to provide fast data loading, fast query processing, and highly efficient storage space utilization. It breaks files into row splits, and within
each split uses columnar-oriented storage.
Despite its advantages of the query
and compression performance compared to SequenceFiles, it has limitations, that prevent the optimal performance for query times and
compression. The newer version of
the columnar formats ORC and Parquet are designed to address many of the
limitations of the RCFile (Grover et al., 2015).
Compression is
another data storage consideration because it plays a crucial role in reducing the storage requirements, and in improving
the data processing performance (Grover et al., 2015). Some compression formats supported on Hadoop are not splittable
(Grover et al., 2015). MapReduce framework splits data for input to multiple
tasks; the nonsplittable
compression format is an obstacle to efficient processing. Thus, the splittability
is a critical consideration in selecting
the compression format and file format for Hadoop. Various compression types for Hadoop include
Snappy, LZO, Gzip, bzip2. Google
developed Snappy for speed. However, it does not offer the best compression
size. It is designed to be used with a container format like SequenceFile or Avro
because it is not inherently splittable.
It is being distributed with
Hadoop. Similar to Snappy, LZO is optimized
for speed as opposed to size. However,
LZO, unlike Snappy support splittability of the compressed files, but it
requires indexing. LZO, unlike Snappy, is not distributed with Hadoop and
requires a license and separate
installation. Gzip, like Snappy, provides good compression performance,
but is not splittable, and it should be used with a container format. The speed read performance of the Gzip is like
the Snappy. Gzip is slower than Snappy
for write processing. Gzip is not
splittable and should be used with a container format. The use of smaller blocks with Gzip can
result in better performance. The bzip2
is another compression type for Hadoop.
It provides good compression performance, but it can be slower than another compression codec such as Snappy. It is not an ideal codec for Hadoop storage. Bzip2,
unlike Snappy and Gzip, is inherently splittable. It inserts synchronization markers between
blocks. It can be used for active archival
purposes (Grover et al., 2015).
The compression format can become splittable when used
with container file formats such as Avro, SequenceFile which compress
blocks of records or each record individually (Grover et al., 2015). If the
compression is implemented on the entire
file without using the container file format, the compression format that
inherently supports splittable must be used such as bzip2. The compression use with Hadoop has three recommendation
(Grover et al., 2015). The
first recommendation is to enable compression of MapReduce intermediate output,
which improves performance by decreasing the among of intermediate data that
needs to be read and written from and to disk.
The second recommendation s to pay attention to the order of the
data. When the data is close together,
it provides better compression levels. The data in Hadoop file format is compressed in chunks, and the organization
of those chunks determines the final
compression. The last recommendation is to consider the use
of a compact file format with support for splittable compression such as
Avro. Avro and SequenceFiles support
splittability with non-splittable compression formats. A single HDFS block can contain multiple Avro
or SequenceFile blocks. Each block of the Avro or SequenceFile can be
compressed and decompressed individually and independently of any other blocks
of Avro or SequenceFile. This technique makes the data splittable because each
block can be compressed and decompressed individually. Figure 9 shows the Avro and SequenceFile
splittability support (Grover et al., 2015).
Figure 9. Compression Example Using Avro
(Grover et al., 2015).
HDFS and HBase are
the commonly used storage managers in the Hadoop
ecosystem. Organizations can store the
data in HDFS or HBase which internally store it on HDFS (Grover et al., 2015). When
storing data in HDFS, some design techniques must be taken into consideration.
The schema-on-read model of Hadoop does not impose any requirement when
loading data into Hadoop, as data can be ingested into HDFS by one of many
methods without the requirements to associate a schema or preprocess the
data. Although
Hadoop has been used to load many types of data such as the unstructured data, semi-structured
data, some order is still required, because Hadoop serves as a central location
for the entire organization and the data stored in HDFS is intended to be
shared across various departments and teams in the organization (Grover et al., 2015). The
data repository should be carefully structured and organized to provide various
benefits to the organization (Grover et al., 2015). When there
is a standard directory structure, it becomes easier to share data among teams working
with the same data set. The data gets
staged in a separate location before processing it. The standard stage technique can help not
processing data that has not been appropriately
staged or entirely yet. The standard organization of data allows for some
code reuse that may process the data (Grover et al., 2015). The
placement of data assumptions can help simplify the loading process of the data
into Hadoop. The HDFS data model design
for projects such as data warehouse implementation is likely to use structure facts and dimension tables similar to
the traditional schema (Grover et al., 2015). The HDFS data model design for projects of unstructured
and semi-structured data is likely to
focus on directory placement and metadata management (Grover et al., 2015).
Grover et al.
(2015) suggested three key considerations when designing the schema, regardless
of the data model design project. The
first consideration is to develop standard practices that can be followed by
all teams. The second point is to ensure
the design works well with the chosen tools.
For instance, if the version of Hive can support only table partitions
on directories that are named a certain way, it will affect the schema design
and the names of the table subdirectories.
The last consideration when designing a schema is to keep usage patterns
in mind, because different data processing and
querying patterns work better with different schema designs (Grover et al., 2015).
The
first step when designing an HDFS schema involves the determination of the location of the file. Standard file location plays a significant
role in finding and sharing data among various departments and teams. It also
helps in the assignment of permission to access files to various groups and
users. The recommended file locations are summarized in Table 2.
The HDFS schema design
involves advanced techniques to organize data into files (Grover et al., 2015). A few
strategies are recommended to organize the data
set. These strategies for data organization involve partitioning,
bucketing, and denormalizing process. The partitioning process of the data set is a common technique used to reduce the amount of I/O
required to process the data set.
HDFS does not store indexes on the data unlike
the traditional data warehouse. Such a lack of indexes in HDFS plays a key role
in speeding up data ingest, with a full table scan cost where every query will
have to read the entire dataset even when processing a small subset of data. Breaking up the data set into
smaller subsets, or partitions can help with the full table scan, allowing queries
to read only the specific partitions reducing the amount of I/O and improving
the query time processing significantly (Grover et al., 2015). When data is
placed in the filesystem, the directory format for partition should be
as shown below. The order data sets are
partitioned by date because there are a large
number of orders done daily and the partitions will contain large enough files
which are optimized by HDFS.
Various tools such as HCatalog, Hive, Impala, and Pig understand this
directory structure leveraging the partitioning to reduce the amount of I/O
requiring during the data processing (Grover et al., 2015).
<data set
name>/<partition_column_name=partition_column_value>/(Armstrong)
e.g. medication_orders/date=20181107/[order1.csv,
order2.csv]
Bucketing is
another technique for breaking a large data set into manageable sub-sets (Grover et al., 2015). The
bucketing technique is similar to the hash partitions which is used in the relational
database. Various tools such as
HCatalog, Hive, Impala, and Pig understand this directory structure leveraging
the partitioning to reduce the amount of I/O requiring during the data
processing. The partition example above was implemented using the date which
resulted in large data files which can be
optimized by HDFS (Grover et al., 2015). However,
if the data sets are partitioned by a the
category of the physician, the result will be too many small files,
which leads to small file problems, which can
lead to excessive memory use for the NameNode, since metadata for each file
stored in HDFS is stored in memory (Grover et al., 2015). Many
small files can also lead to many processing tasks, causing excessive overhead
in processing. The solution for too many
small files is to use the bucketing process for the physician in this example,
which uses the hashing function to map physician into a specified number of
buckets (Grover et al., 2015).
The bucketing
technique controls the size of the data
subsets and optimizes the query speed (Grover et al., 2015). The
recommended average bucket size is a few multiples
of the HDFS block size. The distribution of data
when hashed on the bucketing column is essential because it results in
consistent bucketing (Grover et al., 2015). The use
of the number of buckets as a power of two is every
day. Bucketing allows joining
two data sets. The join, in this case,
is used to represent the general idea of combining two data sets to retrieve a
result. The joins can be implemented through the SQL-on-Hadoop systems and also
in MapReduce, or Spark, or other programming interfaces to Hadoop. When using join in the bucketing technique,
it joins corresponding buckets individually without having to join the entire
datasets, which help in minimizing the time complexity for the reduce-side
join of the two datasets process,
which is computationally expensive (Grover et al., 2015). The
join is implemented in the map stage of a
MapReduce job by loading the smaller of the buckets in memory because the buckets are small enough to easily fit into
memory, which is called map-side join process. The map-side join process improves the join
performance as compared to a reduce-side
join process. A hive for data analysis recognizes the tables
are bucketed and optimize the process accordingly.
Further optimization can be implemented if the data in the bucket is sorted, the merge join can be used, and the entire bucket does not get
stored in memory when joining, resulting in the faster
process and much less memory than a
simple bucket join. Hive supports this
optimization as well. The use of both
sorting and bucketing on large tables that are frequently joined together using
the join key for bucketing is recommended
(Grover et al., 2015).
The schema design
depends on how the data will be queried (Grover et al., 2015). Thus,
the columns to be used for joining and filtering must be identified before the portioning and bucketing of the data is
implemented. In some cases, when the
identification of one partitioning key is challenging, storing the same data
set multiple times can be implemented,
each with the different physical
organization, which is regarded to be an anti-pattern
in a relational database. However, this solution can be implemented with Hadoop, because with Hadoop
is write-once, and few updates are expected. Thus, the overhead of keeping duplicated data
set in sync is reduced. The cost of
storage in Hadoop clusters is reduced as well (Grover et al., 2015).
The duplicated data set in sync provides better query speed processing in such
cases (Grover et al., 2015).
Regarding the
denormalizing process, it is another technique of trading the disk space for
query performance, where joining the entire data set need is minimized (Grover et al., 2015). In the
relational database model, the data is stored
in the third standard form (NF3), where
redundancy is minimized, and data integrity is enforced by splitting data into smaller tables, each holding a particular entity. In this relational model, most queries
require joining a large number of tables together to produce a final result as desired (Grover et al., 2015). However,
in Hadoop, joins are often the slowest operations and consume the most
resources from the cluster.
Specifically, the reduce-side join
requires sending the entire table over the network, which is computationally
costly. While sorting and bucketing help
minimizing this computational cost, another solution is to create data sets
that are pre-joined or pre-aggregated (Grover et al., 2015). Thus,
the data can be joined once and store it in this form instead of running the
join operations every time there is a query for that data. Hadoop schema
consolidates many of the small dimension tables into a few larger dimensions by
joining them during the ETL process (Grover et al., 2015). Other
techniques to speed up the process include the aggregation or data type
conversion. The duplication of the data
is of less concern; thus, when the processing is frequent for a large number of
queries, it is recommended to doing it one and reuse as the case with a materialized view in the relational
database. In Hadoop, the new dataset is
created that contains the same data in its aggregated form (Grover et al., 2015).
To summarize, the
partitioning process is used to reduce the I/O
overhead of processing by selectively reading and writing data in particular
partitions. The bucketing can be
used to speed up queries that involve joins or sampling, by reducing the I/O as
well. The denormalization can be implemented to speed up Hadoop jobs. In
this section, a review of advanced techniques to organize data into files is discussed.
The discussion includes the use of a small
number of large files versus a large
number of small files. Hadoop prefers
working with a small number of large
files than a large number of small
files. The discussion also addresses the
reduce-side join versus map-side join techniques. The reduce-side join is computationally
costly. Hence, the map-side join
technique is preferred and recommend.
HBase is not a
relational database (Grover et al., 2015; Yang, Liu, Hsu, Lu, & Chu, 2013). HBase is similar to a large hash table, which allows the association of values with
keys and performs a fast lookup of the value based on a given key (Grover et al., 2015).
The operations of hash tables involve put, get, scan, increment and delete. HBase provides scalability and flexibility and is useful in many applications,
including fraud detection, which is a widespread application for HBase (Grover et al., 2015).
The framework of
HBase involves Master Server, Region Servers, Write-Ahead Log (WAL), Memstore,
HFile, API and Hadoop HDFS (Bhojwani & Shah, 2016). Each component of the HBase framework plays a
significant role in data storage and processing. Figure 10 illustrated the HBase framework.
The
following consideration must be taken when designing the schema for HBase (Grover et al., 2015).
Row Key Consideration.
Timestamp Consideration.
Hops Consideration.
Tables and Regions Consideration.
Columns Use Consideration.
Column Families Use
Consideration.
Time-To-Live Consideration.
The row key is
one of the most critical factors for
well-architected HBase schema design (Grover et al., 2015). The row
key consideration involves record retrieval, distribution, block cache, the ability to scan, size, readability, and
uniqueness. The row key is critical for
retrieving records from HBase. In the relational database, the composite key
can be used to combine multiple primary keys.
In HBase, multiple pieces of information can be combined in a single key.
For instance, a key of customer_id, order_id, and timestamp will be a
row key for a row describing an order. In
a relational database, they are three
different columns in the relational database, but in HBase, they will be combined into a single unique
identifier. Another consideration for selecting
the row key is the get operation because a get operation of a single record is
the fasted operation in HBase. A single
get operation can retrieve the most common uses of the data improves the
performance, which requires to put much
information in a single record which is called denormalized design. For
instance, while in the relational database, customer information will be placed in various tables, in HBase all
customer information will be stored in a single record where get operation will
be used. The distribution is another
consideration for HBase schema design.
The row key determines the regions of HBase cluster for a given table,
which will be scattered throughout various regions (Grover et al., 2015; Yang et al., 2013). The row keys are
sorted, and each region stores a range of these sorted row keys (Grover et al., 2015). Each region is
pinned to a region server namely a node in the cluster (Grover et al., 2015). The
combination of device ID and timestamp or reverse timestamp is commonly used to
“salt” the key in machine data (Grover et al., 2015). The block cache is a least recently used (LRU) cache
which caches data blocks in memory (Grover et al., 2015). HBase reads records
in chunks of 64KB from the disk by default. Each of these chunks is called HBase block (Grover et al., 2015). When the HBase block is read from disk, it will be put
into the block cache (Grover et al., 2015). The
choice of the row key can affect the scan operation as well. HBase scan
rates are about eight times slower than HSFS scan rates. Thus, reducing I/O requirements has a significant performance advantage. The size of the row key determines the performance of the workload. The short row key is better than, the long
row key because it has lower storage overhead and faster read/ writes performance. The readability of the row key is critical.
Thus, it is essential to start with
human-readable row key. The uniqueness
of the row key is also critical since a row key is equivalent to a key in hash
table analogy. If the row key is based on the non-unique
attribute, the application should handle such cases and only put data in HBase
with a unique row key (Grover et al., 2015).
The timestamp is
the second essential consideration for good HBase schema design (Grover et al., 2015). The
timestamp provides advantages of determining which records are newer in case of
put operation to modify the record. It
also determines the order where records are
returned when multiple versions of a single record are requested. The timestamp is also utilized to remove out-of-date records
because time-to-live (TTL) operation compared
with the timestamp shows the record value has either been overwritten by
another put or deleted (Grover et al., 2015).
The hop term
refers to the number of synchronized “get” requests to retrieve specific data from HBase (Grover et al., 2015). The less hop, the
better because of the overhead. Although
multi-hop requests with HBase can be made,
it is best to avoid them through better schema design, for example by leveraging
de-normalization, because every hop is a round-trip to HBase which has a
significant performance overhead (Grover et al., 2015).
The number of tables and regions per table in HBase can have
a negative impact on the performance and distribution of the data (Grover et al., 2015). If the
number of tables and regions are not implemented
correctly, it can result in an imbalance
in the distribution of the load.
Important considerations include one region server per node, many
regions in a region server, a give region is
pinned to a particular region server, and tables are split into regions
and scattered across region servers. A
table must have at least one region. All
regions in a region server receive “put” requests and share the region server’s
“memstore,” which is a cache structure
present on every HBase region server. The “memstore”
caches the write is sent to that region
server and sorts them in before it flushes them when certain memory thresholds are reached. Thus, the more regions exist in a region
server; the less memstore space is
available per region. The default
configuration sets the ideal flush size to 100MB. Thus, the “memstore” size can
be divided by 100MB and result should be
the maximum number of regions which can be put
on that region server. The vast
region takes a long time to compact. The upper limit on the size of a region is
around 20GB. However, there are successful HBase clusters with upward of 120GB
regions. The regions can be assigned to
HBase table using one of two techniques. The first technique is to create the
table with a single default region, which auto splits
as data increases. The second technique
is to create the table with a given number of regions and set the region size
to a high enough value, e.g., 100GB per
region to avoid auto splitting (Grover et al., 2015). Figure
11 shows a topology of region servers, regions and tables.
Figure 11. The Topology of Region Servers, Regions, and Tables (Grover et al., 2015).
The columns used in HBase is different from the traditional
relational database (Grover et al., 2015; Yang et al., 2013). In HBase, unlike the traditional database, a
record can have a million columns, and the next record can have a million
completely different columns, which is not
recommended but possible (Grover et al., 2015). HBase
stores data in a format called HFile, where each column value gets its row in
HFile (Grover et al., 2015; Yang et al., 2013).
The row has files like row key, timestamp, column names, and values. The file format provides various functionality,
like versioning and sparse column storage (Grover et al., 2015).
HBase, include the concept of column families (Grover et al., 2015; Yang et al., 2013). A column family is a container for
columns. In HBase, a table can have one
or more column families. Each column
family has its set of HFiles and gets compacted independently of other column
families in the same table. In many
cases, no more than one column family is needed
per table. The use of more than one
column family per table can be done when
the operation is done, or the rate of change on a subset of the
columns of a table is different from the other columns (Grover et al., 2015; Yang et al., 2013). The last consideration for HBase schema
design is the use of TTL, which is a built-in feature of HBase which ages out
data based on its timestamp (Grover et al., 2015). If TTL is not used and an aging requirement is needed,
then a much more I/O intensive operation would need to be done. The
objects in HBase begin with table object, followed by regions for the table,
store per column family for each region for the table, memstore, store files, and block (Yang et al., 2013). Figure 12 shows the hierarchy of objects in
HBase.
Figure 12. The Hierarchy of Objects
in HBase (Yang et al., 2013).
To summarize this
section, HBase schema design requires seven key consideration starting with the
row key, which should be selected carefully for record retrieval, distribution,
block cache, ability to scan, size, readability, and uniqueness. The timestamp and hops are other schema
design consideration for HBase. Tables
and regions must be considered for put performance, and compacting time. The use of columns and column families should
also be considered when designing the
schema for HBase. The TTL to remove data that aged is another consideration for
HBase schema design.
The above
discussion has been about the data and the techniques to store it in
Hadoop. Metadata is as essential as the data itself. Metadata is data about the data (Grover et al., 2015)). Hadoop
ecosystem has various forms of metadata.
Metadata about logical dataset usually stored in a separate metadata
repository include the information like the location of a data set such as
directory in HDFS or HBase table name, the schema associated with the dataset, the
partitioning and sorting properties of the data set, the format of the data set
e.g. CSV, SequenceFile, etc. (Grover et al., 2015). The metadata about files on HDFS includes the permission and ownership of such
files and the location of various blocks on data nodes, usually stored and
managed by Hadoop NameNode (Grover et al., 2015). Metadata
about tables in HBase include information
like table names, associated namespace, associated attributes, e.g. MAX_FILESIZE, READONLY, etc., and the names of column families, usually
stored and managed by HBase (Grover et al., 2015). Metadata
about data ingest and transformation
include information like which user-generated
a given dataset, where the dataset came from, how long it took to generate it,
and how many records there are, or the size of the data load (Grover et al., 2015). Metadata
about dataset statistics include information like the number of rows in a
dataset, number of unique values in each column,
a histogram of the distribution of the data, and maximum and minimum values (Grover et al., 2015). Figure 13
summarizes this various metadata.
Figure 13. Various Metadata in Hadoop.
Apache Hive was
the first project in the Hadoop ecosystem
to store, manage and leverage metadata (Antony et al., 2016; Grover et al., 2015). Hives stores this metadata in a relational
database called the Hive “metastore” (Antony et al., 2016; Grover et al., 2015). Hive also provides a “metastore” service which interfaces with the Hive metastore database (Antony et al., 2016; Grover et al., 2015). The query process in Hive goes to the metastore to get the metadata for the desired
query, and metastore sends the metadata
to Hive generating execution plan, followed by executing the job using the
Hadoop cluster, which implements the job and Hive send the fetched result to
the user (Antony et al., 2016; Grover et al., 2015). Figure 14 shows the query process and the
role of the metastore in Hive framework.
Figure 14. Query Process and the Role of
Metastore in Hive (Antony et al., 2016).
More projects have
utilized the concept of metadata that was introduced by Hive and created a separate project called HCatalog to enable the
usage of Hive metastore outside of Hive (Grover et al., 2015). HCatalog
is a part of Hive and allows other tools like Pig and MapReduce to integrate
with Hive metastore. It also opens the access to Hive metastore to other tools such as REST API via WebHCat
server. MapReduce, Pig, and standalone applications can talk directly
to the metastore of Hive through its APIs, but HCatalog allows easy access
through its WebHCat REST APIs, and it allows the cluster administrators to
lock down access to the Hive metastore to address security concerns. Other ways
to store metadata include the embedding of metadata in file paths and
names. Another technique to store metadata involves storing it in HDFS in a hidden
file, e.g., .metadata. Figure 15 shows
the HCatalog as an accessibility veneer
around the Hive metastore (Grover et al., 2015).
Figure 15. HCatalog acts an
accessibility veneer around the Hive metastore (Grover et al., 2015).
There are some limitations for Hive
metastore and HCatalog, including the problem with high availability (Grover et al., 2015). The HA
database cluster solutions to bring HA to
the Hive metastore database. For the
metastore service of Hive, there is support concurrently to run multiple metastores on more than one node in the
cluster. However, concurrency issues
related to data definition language operations (DDL) can occur, and Hive community is working on fixing these issues.
The fixed schema for metadata is
another limitation. Hadoop provides much flexibility on the type of data that can be
stored, mainly because of the
Schema-on-Read concept. Hive metastore
provides a fixed schema for the metadata itself. It provides a tabular
abstraction for the data sets. The data
in metastore is moving the part in the
infrastructure which requires to be running and secured as part of Hadoop infrastructure (Grover et al., 2015).
Alguliyev, R., &
Imamverdiyev, Y. (2014). Big data: big
promises for information security. Paper presented at the Application of
Information and Communication Technologies (AICT), 2014 IEEE 8th International
Conference on.
Ankam,
V. (2016). Big Data Analytics: Packt
Publishing Ltd.
Antony,
B., Boudnik, K., Adams, C., Lee, C., Shao, B., & Sasaki, K. (2016). Professional Hadoop: John Wiley &
Sons.
Yang, C. T., Liu, J. C., Hsu, W. H., Lu, H. W., &
Chu, W. C. C. (2013, 16-18 Dec. 2013). Implementation
of Data Transform Method into NoSQL Database for Healthcare Data. Paper
presented at the 2013 International Conference on Parallel and Distributed
Computing, Applications and Technologies.
The purpose of this discussion is to identify and research an analytics case that involves data-in-motion. The discussion addresses how the data-in-motion analytics performed in the selected case study, and why it is essential to apply data analytics to data in motion. The discussion begins with data-in-motion vs. data-at-rest, followed by data in-motion analytics in healthcare. The case study selected for this discussion is regarding fraud detection in healthcare insurance.
Data In-Motion vs. Data at-Rest
(CSA, 2013) have categorized Big Data technologies into two groups: the batch processing, and stream processing. The batch processing involves analytics on data at rest, while the stream processing involves analytics on data in motion. The real-time data (a.k.a data in motion) is the streaming data which needs to be analyzed as it comes in (Jain, 2013). The processing of real-time data does not always require to reside in memory (CSA, 2013), as technologies such as Drill (Hausenblas & Nadeau, 2013) and Dremel (Melnik et al., 2010) provide new techniques for new interactive analysis of large-scale data sets (CSA, 2013). Hadoop is dominating the batch processing using MapReduce. However, the stream processing does not have a single dominant technology like Hadoop, but rather, involves stream technologies such as InfoSphere Streams, and Storm (CSA, 2013). Figure 1 shows the data analytics using the batch processing (data at rest) and using streaming processing (data in motion).
Figure 1. Data In Motion (Stream) vs.
Data At Rest (Batch) (CSA, 2013)
(Tibco, 2013) have described the data at rest vs. in motion from the Business Intelligence (BI) perspective. The traditional BI employs data at rest such as customer information, purchasing history, inventory and so forth, which does not change continuously. The data at rest analyses can be used for decisions for next week, next month or next quarter (Tibco, 2013). The data in motion analytics can be employed to make immediate decisions for the next few minutes (Tibco, 2013). The value of data in motion or real-time analytics is significant, as it implicitly allows for higher quality data for decision making (evariant.com, 2015). Figure 2 illustrates the value of data in motion analytics for businesses.
Figure 2. Business Values Through an
Accelerated “Insight to Action” Process (Tibco, 2013).
Data in Motion Analytics Value in Healthcare
Various studies and reports have
discussed the value of real-time or data-in-motion analytics to healthcare. Real-time or streaming data analytics will be
a critical asset providing a more profound
understanding of patient situations at the point of care (Maike, 2018). Such understanding that is based on real-time data analytics can assist
in decreasing costs and improving outcomes. (Maike, 2018) has referenced BIS Research prediction of big data analytics market growth to over $68.75
billion by 2025. The real-time data
analytics for patients and the real-time monitoring of patients care plans
provide the opportunity for providers to care
for patients proactively (Maike, 2018). (redhat.com, 2016) has reported a
use case for using real-time data
analytics. TMG health was using
batch-oriented application which prevented it from providing continuous data
visibility and access to its Medicare and Medicaid clients. TMG handles various tasks such as billing,
health insurance, claim processing delivering to more than 3 million file feeds
daily. TMG was seeking a solution to
process data quickly and improve visibility and efficiency. TMG has employed a new application to
accelerate real-time data access and visibility for its clients which reduced development
time and costs. (Perna, 2015) has also addressed another use case for the
value of real-time in healthcare, as Maine health information exchange rolls
out real-time predictive analytics for members. The analytics service employs
real-time clinical data to determine the potential costly patients who will
have a stroke, heart attack or Type 2
diabetes. The real-time predictive analytics add value to the patient’s care and has value-based
reimbursement and risk-based models, which can filter population by provider
and contractor. (Maike, 2018) has discussed the real-time data using the
medical devices which are increasingly connected and able to relay data to
centralized patient management systems. When
using the real-time analytics, the health care providers can monitor patients
both during their hospital stays as well as after they return home.
Fraud Detection Use Case in Insurance Agency
(Nelson, 2017) has reported an insurance agency’s use case using real-time analytics for fraud detection. The case study shows that the insurance agency was able to detect $100+ million in fraud in a fraction of the time the legacy process used to take. (Nelson, 2017) has provided a good example of the powerful user interface to detect new fraud schemes shown in Figure 3.
Figure 3: Powerful User Interface to Detect New Fraud Scheme At Real-Time Analytics (Nelson, 2017).
Conclusion
Big Data Analytics is distinguished from Business Intelligence (BI)
in data processing. The traditional
analytics techniques in BI use the batching processing which is based on historical data, when the decision is made for next week, next month or even next
quarter. However, the unique
characteristics of Big Data such as volume, velocity, and variety provide unique opportunity to process data at the real-time or in motion to make a more effective decision at the real-time or near real time. CSA categorized two types of processing:
batching and streaming. The batch processing is similar to the BI data
processing and can implement using the
Hadoop technology. The processing of
data in motion or real time is implemented
through interactive technology such as Drill and Dremel technologies. The technique of real-time data analytics is proven more effective and efficient
reducing costs and increasing values in various domains.
References
CSA. (2013). Big Data Analytics for
Security Intelligence. Big Data Working
Group, Cloud Security Alliance.
Melnik, S.,
Gubarev, A., Long, J. J., Romer, G., Shivakumar, S., Tolton, M., &
Vassilakis, T. (2010). Dremel: interactive analysis of web-scale datasets. Proceedings of the VLDB Endowment, 3(1-2),
330-339.
Nelson, P.
(2017). Fraud Detection Powered by Big Data – An Insurance Agency’s Case Story.
Perna, G. (2015).
Patients in Motion: Maine HIE Rolls Out Real-Time Predictive Analytics.
The purpose of this discussion is to
discuss and analyze the impact of XML on MapReduce. The discussion addresses
the various techniques and approaches proposed by various research studies for
processing large XML document using MapReduce.
The XML fragmentation process in the absence and presence of MapReduce
is also discussed to provide a better understanding of the complex process of XML large documents using a distributed scalable MapReduce environment.
XML Query Processing Using MapReduce
XML format has been used to store data
for multiple applications (Aravind & Agrawal, 2014). Data needs
to be ingested into Hadoop and get analyzed to obtain value from the XML data (Aravind & Agrawal, 2014). Hadoop
ecosystem needs to understand XML when it gets ingested into it and be able to interpret it (Aravind & Agrawal, 2014). MapReduce is
a building block of Hadoop ecosystem. In the age of Big Data, XML documents are
expected to be very large and to be
scalable and distributed. The process of
XML queries using MapReduce requires the decomposition of a big XML document and distribute portions to
different nodes. The relational approach
is not appropriate as it is expensive because transforming a big XML document into relational database
tables can be extremely time consuming and θ-joins
among relational table (Wu, 2014). Various research studies have proposed
various approaches to implement native XML query processing algorithms using
MapReduce.
(Dede, Fadika, Gupta, & Govindaraju, 2011) have discussed and analyzed the scalable and distributed processing of scientific XML data, and how the MapReduce model should be used in XML metadata indexing.
The study has presented performance results using two MapReduce
implementations of Apache Hadoop framework and proposed framework of LEMO-MR. The study has provided an indexing framework
that is capable of indexing and efficiently searching large-scale scientific XML datasets. The framework has been tailed for integration with any framework that uses the
MapReduce model to meet the scalability and variety requirements.
(Fegaras, Li, Gupta, & Philip, 2011) have also discussed and analyzed query optimization
in a MapReduce environment. The study has
presented a novel query language for large-scale analysis of XML data on a
MapReduce environment, called MRQL for MapReduce Query Language, that is designed to capture most common data analysis
tasks which can be optimized. XML data
fragmentation is also discussed in this
study. When using a parallel data computation, it expects the
input data to be fragmented into small
manageable pieces, that determine the granularity of the computation. In a MapReduce environment, each map worker is assigned a data split that consists of data
fragments. A map worker processes these
data one fragment at a time. The fragment is a relational tuple for
relational data that is structured, while for a text file, a fragment can be a single line in the file. However, for hierarchical data and nested
collections data such as XML data, the fragment
size and structure depend on the actual application that processes these
data. For instance, XML data may consist
of some XML documents, each one containing
a single XML element, whose size may exceed the memory capability of a map worker.
Thus, when processing XML data, it is recommended to allow custom fragmentation to meet a wide range of
applications requirements. (Fegaras et al., 2011) have argued that Hadoop provides a simple input
format for XML fragmentation based on a single tag name. XML document data can be split, which may
start and end at arbitrary points in the document, even in the middle of tag
names. (Fegaras et al., 2011; Sakr & Gaber, 2014) have indicated that this input format allows reading the document as a stream of string fragments, so that each string will contain a
single complete element that has the requested tag name. XML parser can then be used to parse these
strings and convert them to objects. The
fragmentation process is complex because the requested elements may cross data split boundaries and these data splits may
reside in different data nodes in the data file system (DFS). Hadoop DFS is the implicit solution for this problem allowing to scan beyond a data
split to the next, subject to some overhead for transferring data between
nodes. (Fegaras et al., 2011) have proposed XML fragmentation technique that was
built on top of the existing Hadoop XML input format, providing a higher level of abstraction and better
customization. It is a higher level of abstraction
because it constructs XML data in the MRQL
data model, ready to be processed by MRQL queries instead of deriving a string
for each XML element (Fegaras et al., 2011).
(Sakr & Gaber, 2014) have also discussed briefly another language that has been proposed to support distributed XML processing using the MapReduce framework, called ChuQL. It presents a MapReduce-based extension for the syntax, grammar, and semantics of XQuery, the standard W3C language for querying the XML documents. The implementation of ChuQL takes care of distributing the computation to multiple XQuery engines running in Hadoop nodes, as described by one or more ChuQL MapReduce expressions. The representation of the “word count” example program in the ChuQL language using its extended expressions where the MapReduce expression is used to describe a MapReduce job. The clauses of input and output are used to read and write onto HDFS respectively. The clauses of rr and rw are used for describing the record reader and writer respectively. The clauses of the map and reduce represent the standard map and reduce phases of the framework where they process XML values or key/value pairs of XML values to match the MapReduce model which are specified using XQuery expressions. Figure 1 show the word count example in ChQL using XML in distributed environment.
Figure 1. The Word Count Example Program in ChQL Using XML in Distributed Env. (Sakr & Gaber, 2014).
(Vasilenko & Kurapati, 2014) have discussed and analyzed the efficient processing
of XML documents in Hadoop MapReduce environment. They argued that the most common approach to
process XML data is to introduce a custom solution based on the user-defined functions or scripts. The common
choices vary from introducing an ETL process for extracting the data of
interest to the transformation of XML
into other formats that are natively supported by Hive. They have addressed a generic approach to
handling XML based on Apache Hive architecture.
The researchers have described an approach that complements the existing
family of Hive serializers and de-serializers for other popular data formats,
such as JSON, and makes it much easier
for users to deal with the large XML dataset format. The implementation included logical splits
for the input files each of which is assigned
to an individual Mapper. The mapper
relies on the implemented Apache Hive XML
SerDe to break the split into XML fragments using a specified start/end byte sequences. Each fragment corresponds to a
single Hive record. The fragments are
handled by the XML processor to extract value for the record column utilizing specified XPath queries. The reduce phase was not required in this implementation (Vasilenko & Kurapati, 2014).
(Wu, 2014) have discussed and analyzed the partitioning
XML documents and distributing XML fragments into different compute nodes,
which can introduce high overhead in XML fragment transferring from one node to
another during the MapReduce process execution.
The researchers have proposed a technique to use MapReduce to distribute
labels in inverted lists in a computing cluster
so that structural joins can be parallelly performed to process queries. They have also proposed an optimization
technique to reduce the computing space in the proposed framework to improve
the performance of query processing. They
have argued that their approaches are different from the current shred and
distributed XML document into different nodes in a cluster approach. The process includes
reading and distributing the inverted lists that are required for input queries during the query processing, and
their size is much smaller than the size of the whole document. The process also includes the partition of
the total computing space for structural joins so that each sub-space can be
handled by one reducer to perform structural joins. The researchers have also proposed a pruning-based
optimization algorithm to improve the performance of their approach.
Conclusion
This discussion has addressed the XML query processing using MapReduce
environment. The discussion has addressed the various techniques and
approaches proposed by various research studies for processing large XML
document using MapReduce. The XML
fragmentation process in the absence and presence of MapReduce has also been
discussed to provide a better understanding of
the complex process of XML large documents using a distributed scalable MapReduce environment.
Dede, E., Fadika,
Z., Gupta, C., & Govindaraju, M. (2011). Scalable and distributed processing of scientific XML data. Paper
presented at the Grid Computing (GRID), 2011 12th IEEE/ACM International
Conference on.
Fegaras, L., Li,
C., Gupta, U., & Philip, J. (2011). XML
Query Optimization in Map-Reduce.
Sakr, S., &
Gaber, M. (2014). Large Scale and big data:
Processing and Management: CRC Press.
Vasilenko, D.,
& Kurapati, M. (2014). Efficient processing of xml documents in hadoop map
reduce.
Wu, H.
(2014). Parallelizing structural joins to
process queries over big XML data using MapReduce. Paper presented at the
International Conference on Database and Expert Systems Applications.
The purpose of this discussion is to discuss and analyze the design of the XML document. The discussion also examines the XML design document from the perspective of the users for improved performance. The discussion begins with XML Design Principles and detailed analysis of each principle. XML design document is also examined from the performance perspective focusing on the appropriate use of elements and attributes when designing XML document.
XML Design Principles
The XML design document has guidelines and principles that developers should follow. These guidelines are divided into four major principles for the use of elements and attributes: core content principles, structured information principle, readability principles, element and attribute binding principles. Figure 1 summarizes these principles of XML design document.
Figure 1. XML Design Document Four Principles for Elements and Attributes Use.
Core Content Principle
The core content
principle involves the use of element versus the use of the attribute.
If the information is part of the essential
material for human-readable documents, the use of elements is recommended.
If the information is for machine-oriented records formats, and to help
applications process the primary
communication, the use of attributes is
recommended. Example of this principle
includes the title which is replaced in
an attribute while it should be placed in
element content. Another example of this
principle is the internal product
identifies thrown as elements into detailed
records of the products, while some cases attributes are more appropriate than
elements because the internal product code would not be of primary interest to
most readers or processors of the document
when the ID has an extended format. Similar to data and metadata, the data should
be placed in elements, and metadata
should be in attributes (Ogbuji, 2004).
Since elements and attributes are the two main building blocks of XML design document, developers should be aware of the legal and illegal elements and attributes. (Fawcett, Ayers, & Quin, 2012) have identified legal and illegal elements. For instance, the spaces are allowed after a name, but names cannot contain spaces. Digits can appear within a name, while names cannot begin with a digit. The spaces can appear between the name and the forward slash in a self-closing element, while the initial spaces are not allowed. A hyphen is allowed within a name, but a hyphen is not allowed as the first character. The non-roman characters are allowed if they are classified as letters by the Unicode specifications, where the element name is forename in Greek, while the start and end tags must match case-sensitively (Fawcett et al., 2012). Table 1 shows the legal and illegal elements when designing XML document.
Table 1. Legal vs. Illegal Elements
Consideration for XML Design Document (Fawcett et al., 2012).
For the attributes, (Fawcett et al., 2012) have identified legal and illegal attributes. The single quote inside double quote delimiters is allowed. The double quotes inside a single quote delimiter are also allowed, while a single quote inside single quote delimiters is not allowed. The attribute names cannot begin with a digit. Two attributes with the same name are not allowed. The mismatching delimiters are not allowed. Table 2 shows the legal and illegal attributes to be considered when designing XML document.
Table 2. Legal vs. Illegal Attributes
Consideration for XML Design Document (Fawcett et al., 2012).
Structured Information Principle
Since the element is an extensible engine for expressing structure in XML, the use of the element is recommended if the information is expressed in a structured form, especially if the structure is extensible. The use of the attribute is recommended if the information is expressed as an atomic token since attributes are designed to express simple properties of the information represented in an element (Ogbuji, 2004). The date is an excellent example as it has a fixed structure and acts as a single token, hence can be used as an attribute. Personal names are recommended to be in the element content, instead of having the names in attributes, since personal names have variable structure, and are rarely an atomic token. The following code example is making the name as an element. Figure 2 shows the name is an element, while Figure 3 shows the name is an attribute.
Readability Principle
If
the information is intended to be for human readability, the use of the element
is recommended. If the information is for
machine readability, the use of the attribute is recommended (Ogbuji, 2004). The URL is an example as
it cannot be used without the computer to
retrieve the referenced resource (Ogbuji, 2004).
Element/Attribute Binding
The use of element is
recommended if its value is required to be modified by another attribute
(Ogbuji, 2004). The attribute should provide
some properties or modifications of the element (Ogbuji, 2004).
XML Design Document Examination
One of the best practices identified by IBM for DB2 is to use
attributes and elements appropriately in XML (IBM, 2018). Although it is identified for DB2, it can be applied to the design of an application
using XML document because the elements
and attributes are the building blocks of XML as discussed above. The example for the examination involves a
menu and the use of elements and attributes.
If a menu for a restaurant is developed using XML design document technique, and the portion sizes of items are placed in the menu, the code content principle is applied with the assumption that it is not important to the reader of the menu format. Following the structured information principle, the code will be as follows by not placing the portion measurement and units into a single attribute. Figure 4 shows the code using the core content principles, while Figure 5 shows the code using the structured information principle.
However, following the structured information principle in Figure xx allows portion-unit to modify the portion-size which is not recommended. The use of the attribute is recommended to modify the element which is the menu-item element in this example. Thus, the solution is to modify the code and make the element to be modified by the attribute portion-unit. The result of this code will show the portion size to the reader as shown in Figure 6.
After the modification
of the code to make the element modifiable by the portion-unit, the principles
of the core content and readability are applied. This modification contradicts the original
decision that it is not essential to the
reader to know about the size which is based
on the core content principle.
Therefore, XML developers should judge the appropriate principle to be
applied based on the requirements.
The following link is available, to see another XML design document as a menu example: https://www.w3schools.com/xml/default.asp. The code in the provided link shown in Figure 7 shows that the attributes are modifying the elements which are recommended.
Conclusion
This assignment has focused on the
XML design document. The discussion has
covered the four major principles that should be
considered
when designing XML document. The four principles go around the two
building blocks of the attribute and element.
The use of the element is recommended for human-readable documents,
while the use of the attributes is recommended
for machine-oriented records. The use of
the element is also recommended for
information that is expressed in a structured form, especially if the structure
is extensible, while the use of the attributes is
recommended for information is expressed as an atomic token. If the attribute modifies another attribute,
the use of element is recommended. XML document design should also consider the
legal and illegal elements and attributes.
A few examples have been provided to demonstrate the use of element
versus attributes, and the method to improve the code for good performance as
well as for good practice. The
discussion was limited to the use of element and attributes and
performance consideration from that perspective. However, XML design document involves other performance considerations for
XML for the database, for parsing, and
for data warehouse as discussed in (IBM, 2018; Mahboubi & Darmont, 2009; Nicola & John, 2003;
Su-Cheng, Chien-Sing, & Mustapha, 2010).
References
Fawcett, J.,
Ayers, D., & Quin, L. R. (2012). Beginning
XML: John Wiley & Sons.
Mahboubi, H.,
& Darmont, J. (2009). Enhancing XML
data warehouse query performance by fragmentation. Paper presented at the
Proceedings of the 2009 ACM symposium on Applied Computing.
Nicola, M., &
John, J. (2003). XML parsing: a threat to
database performance. Paper presented at the Proceedings of the twelfth
international conference on Information and knowledge management.
The purpose of this project is to discuss and examine Big Data Analytics (BDA) technique and a case study. The discussion begins with anoverview of BDA application in various sectors, followed by the implementation of BDA in the healthcare industry. The records show the healthcare industry suffers from fraud, waste, and abuse (FWA). The emphasis of this discussion is on FWA in the healthcare industry. The project provides a case study of BDA in healthcare using outlier detection data mining tool. The data mining phases of the use case are discussed and analyzed. An improvement for the selected BDA technique of the outlier detection is proposed in this project. The analysis shows that the outlier detection data mining technique for fraud detection is under experimentation and is not proven reliable yet. The recommendation is to use the clustering data mining technique as a more heuristic technique for fraud detection. Organizations should evaluate the BDA tools and select the most appropriate and fit tool to meet the requirements of the business model successfully.
Keywords: Big Data Analytics; Healthcare; Outlier Detection; Fraud Detection.
Organizations must be
able to quickly and effectively analyze a large
amount of data and extract value from such data for sound business
decisions. The benefits of Big Data
Analytics are driving organizations and businesses to implement the Big Data
Analytics techniques to be able to compete in the market. A survey conducted by CIO Insight has shown
that 65% of the executives and senior decisions makers have indicated that
organizations will risk becoming uncompetitive or irrelevant if Big Data is not
embraced (McCafferly, 2015). The same survey also has shown that 56% have
anticipated a higher investment for big data, and 15% have indicated that such
increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget
allocation can be used for skilled professionals, BD data storage, BDA tools, and so
forth. This project discusses and
analyzes the application of Big Data Analytics. It begins with an overview of
such broad applications, with more emphasis on a single application for
further investigation. Healthcare sector is selected for further
discussion and with a closer lens to investigate the implementation of BDA, and
methods to improve such implementation.
Numerous research studies have discussed and analyzed the application of Big Data in different domains. (Chen & Zhang, 2014) have discussed BDA in the scientific research domains such as astronomy, meteorology, social computing, bioinformatics, and computational biology, which are based on data-intensive scientific discovery. Other studies such as (Rabl et al., 2012) have investigated the performance of six modern open-source data stores in the context of the monitor of application performance as part of the initiative of (CA-Technologies, 2018). (Bi & Cochran, 2014) have discussed BDA in cloud manufacturing, indicating that the success of a manufacturing enterprise depends on the advancement of IT to support and enhance the value stream. The manufacturing technologies have evolved throughout the years. The measures of such advancement of a manufacturing system can be implemented by scale, complexity and automation responsiveness (Bi & Cochran, 2014). Figure 1 illustrates such evolution of the manufacturing technologies before the 1950s until the Big Data age.
Figure 1. Manufacturing Technologies,
Information System, ITs, and Their Evolutions
McKinsey Institute
has first reported four essential sectors
that can benefit from BDA: healthcare industry, government services, retailing,
and manufacturing (Brown,
Chui, & Manyika, 2011). The report has also reported a prediction for BDA
implementation to improve the productivity by .5 to 1 percent annually and
produce hundreds of billions of dollars in new value (Brown
et al., 2011).
McKinsey Institute has indicated that not all industries are created
equal in the context of parsing the benefits from BDA (Brown
et al., 2011).
Another report by McKinsey Institute have reported the transformative potential of BD in five domains: health care (U.S.), public sector administration (European Union), Retail (U.S.) Manufacturing (global), and Personal Location Data (global) (Manyika et al., 2011). The same report has predicted $300 billion as a potential annual value to US healthcare, and 60% potential increase in retailers’ operating margins possible with BDA (Manyika et al., 2011). Some sectors are poised for more significant gains and benefits from BD than others, although the implementation of BD will matter across all sectors (Manyika et al., 2011). It is divided by cluster A, B, C, D and E. The cluster A reflects information and computer and electronic products, while finance & insurance and government are categorized as class B. Cluster C include several sectors such as construction, educational services, and arts and entertainments. Cluster D has manufacturing, wholesale trade, while cluster E covers retail, healthcare providers, accommodation and food. Figure 2 shows some sectors are positioned for more significant gains from the use of BD.
Figure 2. Capturing Value from Big Data by
Sector (Manyika
et al., 2011).
The application of BDA in specific sectors have been
discussed in various research studies, such as health and medical research
(Liang
& Kelemen, 2016), biomedical research (Luo, Wu,
Gopukumar, & Zhao, 2016), machine learning techniques in
healthcare sectors (MCA,
2017). The
next section discusses the implementation of BDA in the healthcare sector.
Numerous research studies have discussed Big Data Analytics (BDA) in healthcare industries from a different perspective. Healthcare industries have taken advantages of BDA in fraud and abuse prevention, detection and reporting (cms.gov, 2017). The fraud and abuse of Medicare are regarded to be a severe problem which needs attention (cms.gov, 2017). Various examples of Medicare fraud scenarios are reported (cms.gov, 2017). Submitting, or causing to be submitted, false claims or making misrepresentations of fact to obtain a federal healthcare payment is the first Medicare fraud case. Soliciting, receiving, offering and paying remuneration to induce or reward referrals for items or services reimbursed by federal health care programs is another Medicare fraud scenario. The last fraud case in Medicare is making prohibited referrals for certain designated health services (cms.gov, 2017). The abuse of Medicare includes billing for unnecessary medical services, charging excessively for services or supplies, and misusing codes on a claim such as upcoding or unbundling codes (cms.gov, 2017; J. Liu et al., 2016). In 2012, the payments of $120 billion were improperly for healthcare (J. Liu et al., 2016). Medicare and Medicaid contributed to more than half of this improper payment total (J. Liu et al., 2016). The annual loss to fraud, waste, and abuse in healthcare domain is estimated to be $750 billion (J. Liu et al., 2016). In 2013, over 60% of the improper payments were for healthcare related. Figure 3 illustrates the improper payments in government expenditure.
Figure 3. Improper Payments Resulted from Fraud and Abuse (J. Liu et al., 2016).
Medicare
fraud and abuse are governed by federal
laws (cms.gov, 2017). These federal laws include False Claim Act (FCA), Anti-Kickback Statute (AKS),
Physician Self-Referral Law (Stark Law), Criminal Health Care Fraud Statute, Social
Security Act, and the United States
Criminal Code. Medicare anti-fraud and
abuse partnerships of various government agencies such as Health Care Fraud
Prevention Partnership (HFPP) and Centers for Medicare
and Medicaid Services (CMS) have been established to combat fraud and abuse. The main aim of this
partnership is to uphold the integrity of the Medicare program, save and recoup
taxpayer funds, reduce the costs of health care
to patients, and improve the quality of healthcare (cms.gov, 2017).
In
2010, Health and Human Services (HHS) and CMS initiated a national effort known
as Fraud Prevention System (FPS), a predictive analytics technology which runs
predictive algorithms and other analytics nationwide on all Medicare FFS claims
prior to any payment in an effort to detect any potential suspicious claims and
patterns that may constitute fraud and abuse (cms.gov, 2017). In 2012, CMS developed the Program Integrity
Command Center to combine Medicare and Medicaid experts such as clinicians,
policy experts, officials, fraud investigators, and law enforcement community
including FBI to develop and improve predictive analytics that identifies fraud and mobilize a rapid response (cms.gov, 2017). Such effort aims to connect with the field
offices to examine the fraud allegations within few hours through a real-time
investigation. Before the application of
BDA, the process to find substantiating evidence of a fraud allegation took
days or weeks.
Research
communities and data analytics industry have exerted various efforts to develop
fraud-detection systems (J. Liu et al., 2016). Various research studies have used different
data mining for healthcare fraud and abuse detection. (J. Liu et al., 2016) have used
unsupervised data mining approach and applied the clustering data mining
technique for healthcare fraud detection.
(Ekina, Leva, Ruggeri, & Soyer, 2013) have used the
unsupervised data mining approach and applied the Bayesian co-clustering data
mining technique for healthcare fraud
detection. (Ngufor & Wojtusiak, 2013) have used the
hybrid supervised and unsupervised data mining approach, and applied the
unsupervised data labeling and outlier detection, classification and regression
data mining technique for medical claims prediction. (Capelleveen, 2013; van Capelleveen, Poel, Mueller,
Thornton, & van Hillegersberg, 2016) have used unsupervised
data mining approach, and applied outlier
detection data mining technique for health insurance fraud detection
with the Medicaid domain.
The
case study presented by (Capelleveen, 2013; van Capelleveen et al., 2016) has been selected for further investigation on the
application of BDA in healthcare. The
outlier detection, which is one of the unsupervised data mining techniques, is
regarded as an effective predictor for
fraud detection and is recommended for use to support the audits initiations (Capelleveen, 2013; van Capelleveen et al., 2016). The outlier detection is the primary analytic
tool which was used in this case
study. The outlier detection tool can be based on linear model analysis, multivariate
clustering analysis, peak analysis, and boxplot analysis (Capelleveen, 2013; van Capelleveen et al., 2016). The algorithm of data mining outlier detection approach of this case study has been used on Medicaid dataset of 650,000
healthcare claims and 369 dentists of one state. RapidMiner can be used for outlier
detection data mining techniques.
The study of (Capelleveen, 2013; van Capelleveen et al., 2016) did not
specify the name of the tool which was used
in the outlier detection of the fraud and abuse in Medicare with emphasis on
dental practice.
The
process for such outlier detection unsupervised data mining technique involves seven iterative phases. The first step involves the composition of metrics
composition for domains. These metrics are derived or calculated data such as
feature, attribute or measurement which characterizes the behavior of an entity
for a certain period. The purpose of this metrics is to develop a
comparative behavioral analysis using data mining algorithms. These metrics are expected during the first
iteration to be inferred from provider
behavior supported by fraud causes and developed in cooperation with fraud
experts. In the subsequent iterations,
the metrics composition consists of the latest metrics which updates the
existing metrics that modify the configuration and make adjustments on the
confidence level to optimize the hit rates.
The composition of metrics phase is
followed by the cleaning and filtering the data. The selection of provider groups, and
computing the metrics is the third phase in this outlier detection
process. The fourth phase involves the
comparison of providers by metric and
flagging outliers. The predictors form
suspicion for provider fraud detection is the fifth phase, followed by the
report and presentation to fraud investigators phase. The last phase of the use of the outlier
protection analytic tool involves the metric evaluation. The result of the outlier detection analysis
has shown that 12 of the top 17 providers (71%) submitted suspicious claim patterns and should be referred to officials for further
investigation. The study concluded that
the outlier detection tool could be used
to provide new patterns of potential fraud that can be identified and possibly
used for future automated detection technique.
(Lazarevic & Kumar, 2005) have indicated that most of the outlier detection techniques are categorized into four categories. The statistical approach, the distance-based
approach, the profiling method, and the model-based approach. The data points are modeled in the
statistical approach using a stochastic distribution and are determined to be
outliers based on their relationship with the model. Most statistical approaches have the
limitation with higher dimensionality distribution of the data points due to
the complexity of such a distribution which results in inaccurate estimations. The distance-based approach can detect the
outliers using the computation of the distances among points to overcome the
limitation of the statistical approach. Various
distance-based outlier detection algorithms have been proposed, and they are based on different approaches. The first approach is based on computing the full dimensional
distances of points from one another using all the available features. The second approach is based on computing the densities of local neighborhoods. The
profiling method develops profiles of normal behavior using different data
mining techniques or heuristic-based approaches, and deviations from them are considered as intrusions. The model-based approach begins with the
categorization of normal behavior using
some predictive models. Such as neural
replicator networks or unsupervised support vector machines, and detect
outliers as the deviations from the learned model (Lazarevic & Kumar, 2005). (Capelleveen, 2013; van Capelleveen et al., 2016) have indicated that the outlier detection tool
as a data mining technique has not proven itself in the long run and is still
under experimentation. It is also
considered a sophisticated data mining
technique (Capelleveen, 2013; van Capelleveen et al., 2016). The
validation of effectiveness remains difficult (Capelleveen, 2013; van Capelleveen et al., 2016).
Based on this
analysis of the outlier detection tool, more heuristic and novel approach
should be used. (Viattchenin, 2016) have
proposed a novel technique for outlier detection. The proposed technique for outlier detection is based on a heuristic algorithm of
clustering, which is a function-based method. (Q. Liu & Vasarhelyi, 2013) have proposed a healthcare fraud detection using a clustering model
incorporating geolocation information.
The results of the clustering model using
have detected claims with the extreme payment amount and identified some suspicious
claims. In summary, integrating the
clustering technique can play a role in enhancing the reliability and validity
of the outlier detection data mining technique.
This project has discussed and examined Big Dat Analytics
(BDA) methods. An overview of BDA application in various sectors is discussed,
followed by the implementation of BDA in the healthcare industry. The records showed that the healthcare
industry is suffering from fraud, waste, and abuse. The discussion has provided a case study of
BDA in healthcare using outlier detection tool.
The data mining phases have been discussed and analyzed. A proposed improvement for the selected BDA technique
of outlier detection has also been addressed.
The analysis has indicated that the outlier detection technique is under
experimentation, and more heuristic data mining fraud detection technique
should be used such as the clustering data mining technique. In summary, various BDA techniques are available
for different industries. Organizations
must select the appropriate BDA tool to meet the requirements of the business
model.
Capelleveen,
G. C. (2013). Outlier based predictors
for health insurance fraud detection within US Medicaid. University of
Twente.
Chen,
C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges,
techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.
Ekina,
T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of bayesian
methods in detection of healthcare fraud.
Lazarevic,
A., & Kumar, V. (2005). Feature
bagging for outlier detection. Paper presented at the Proceedings of the
eleventh ACM SIGKDD international conference on Knowledge discovery in data
mining.
Liang,
Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health
and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).
Liu,
J., Bier, E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sricharan, K., . . .
Davies, D. (2016). Graph analysis for detecting fraud, waste, and abuse in
healthcare data. AI Magazine, 37(2),
33-46.
Liu,
Q., & Vasarhelyi, M. (2013). Healthcare
fraud detection: A survey and a clustering model incorporating Geo-location
information.
Luo,
J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in
biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.
Manyika,
J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A.
H. (2011). Big data: The next frontier for innovation, competition, and
productivity.
MCA,
M. J. S. (2017). Applications of Big Data Analytics and Machine Learning
Techniques in Health Care Sectors. International
Journal Of Engineering And Computer Science, 6(7).
Ngufor,
C., & Wojtusiak, J. (2013). Unsupervised labeling of data for supervised
learning and its application to medical claims prediction. Computer Science, 14(2), 191.
Rabl,
T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., &
Mankovskii, S. (2012). Solving big data challenges for enterprise application
performance management. Proceedings of
the VLDB Endowment, 5(12), 1724-1735.
van
Capelleveen, G., Poel, M., Mueller, R. M., Thornton, D., & van
Hillegersberg, J. (2016). Outlier detection in healthcare fraud: A case study in
the Medicaid dental domain. International
Journal of Accounting Information Systems, 21, 18-31.
Viattchenin, D. A. (2016). A Technique for Outlier
Detection Based on Heuristic Possibilistic Clustering. CERES, 17.