Case Study: Big Data Analytics in Healthcare Using Outlier Detection.

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss and examine Big Data Analytics (BDA) technique and a case study.  The discussion begins with an overview of BDA application in various sectors, followed by the implementation of BDA in the healthcare industry.  The records show the healthcare industry suffers from fraud, waste, and abuse (FWA).  The emphasis of this discussion is on FWA in the healthcare industry.  The project provides a case study of BDA in healthcare using outlier detection data mining tool.  The data mining phases of the use case are discussed and analyzed.  An improvement for the selected BDA technique of the outlier detection is proposed in this project.  The analysis shows that the outlier detection data mining technique for fraud detection is under experimentation and is not proven reliable yet. The recommendation is to use the clustering data mining technique as a more heuristic technique for fraud detection. Organizations should evaluate the BDA tools and select the most appropriate and fit tool to meet the requirements of the business model successfully.

Keywords: Big Data Analytics; Healthcare; Outlier Detection; Fraud Detection.

Introduction

Organizations must be able to quickly and effectively analyze a large amount of data and extract value from such data for sound business decisions.  The benefits of Big Data Analytics are driving organizations and businesses to implement the Big Data Analytics techniques to be able to compete in the market.  A survey conducted by CIO Insight has shown that 65% of the executives and senior decisions makers have indicated that organizations will risk becoming uncompetitive or irrelevant if Big Data is not embraced (McCafferly, 2015).  The same survey also has shown that 56% have anticipated a higher investment for big data, and 15% have indicated that such increasing trend in the budget allocation will be significant (McCafferly, 2015). Such budget allocation can be used for skilled professionals, BD data storage, BDA tools, and so forth.  This project discusses and analyzes the application of Big Data Analytics. It begins with an overview of such broad applications, with more emphasis on a single application for further investigation.  Healthcare sector is selected for further discussion and with a closer lens to investigate the implementation of BDA, and methods to improve such implementation.

Overview of Big Data Analytics Applications

            Numerous research studies have discussed and analyzed the application of Big Data in different domains. (Chen & Zhang, 2014) have discussed BDA in the scientific research domains such as astronomy, meteorology, social computing, bioinformatics, and computational biology, which are based on data-intensive scientific discovery.  Other studies such as (Rabl et al., 2012) have investigated the performance of six modern open-source data stores in the context of the monitor of application performance as part of the initiative of (CA-Technologies, 2018). (Bi & Cochran, 2014) have discussed BDA in cloud manufacturing, indicating that the success of a manufacturing enterprise depends on the advancement of IT to support and enhance the value stream.  The manufacturing technologies have evolved throughout the years.  The measures of such advancement of a manufacturing system can be implemented by scale, complexity and automation responsiveness (Bi & Cochran, 2014).  Figure 1 illustrates such evolution of the manufacturing technologies before the 1950s until the Big Data age. 


Figure 1.  Manufacturing Technologies, Information System, ITs, and Their Evolutions

McKinsey Institute has first reported four essential sectors that can benefit from BDA: healthcare industry, government services, retailing, and manufacturing (Brown, Chui, & Manyika, 2011).  The report has also reported a prediction for BDA implementation to improve the productivity by .5 to 1 percent annually and produce hundreds of billions of dollars in new value (Brown et al., 2011).  McKinsey Institute has indicated that not all industries are created equal in the context of parsing the benefits from BDA (Brown et al., 2011).   

Another report by McKinsey Institute have reported the transformative potential of BD in  five domains:  health care (U.S.), public sector administration (European Union), Retail (U.S.) Manufacturing (global), and Personal Location Data (global) (Manyika et al., 2011).  The same report has predicted $300 billion as a potential annual value to US healthcare, and 60% potential increase in retailers’ operating margins possible with BDA (Manyika et al., 2011). Some sectors are poised for more significant gains and benefits from BD than others, although the implementation of BD will matter across all sectors (Manyika et al., 2011).  It is divided by cluster A, B, C, D and E.  The cluster A reflects information and computer and electronic products, while finance & insurance and government are categorized as class B.  Cluster C include several sectors such as construction, educational services, and arts and entertainments.  Cluster D has manufacturing, wholesale trade, while cluster E covers retail, healthcare providers, accommodation and food. Figure 2 shows some sectors are positioned for more significant gains from the use of BD. 


Figure 2.  Capturing Value from Big Data by Sector (Manyika et al., 2011).

The application of BDA in specific sectors have been discussed in various research studies, such as health and medical research (Liang & Kelemen, 2016), biomedical research (Luo, Wu, Gopukumar, & Zhao, 2016), machine learning techniques in healthcare sectors (MCA, 2017).  The next section discusses the implementation of BDA in the healthcare sector.

Big Data Analytics Implementation in Healthcare

            Numerous research studies have discussed Big Data Analytics (BDA) in healthcare industries from a different perspective.  Healthcare industries have taken advantages of BDA in fraud and abuse prevention, detection and reporting (cms.gov, 2017).  The fraud and abuse of Medicare are regarded to be a severe problem which needs attention (cms.gov, 2017).  Various examples of Medicare fraud scenarios are reported (cms.gov, 2017).  Submitting, or causing to be submitted, false claims or making misrepresentations of fact to obtain a federal healthcare payment is the first Medicare fraud case.  Soliciting, receiving, offering and paying remuneration to induce or reward referrals for items or services reimbursed by federal health care programs is another Medicare fraud scenario.  The last fraud case in Medicare is making prohibited referrals for certain designated health services (cms.gov, 2017). The abuse of Medicare includes billing for unnecessary medical services, charging excessively for services or supplies, and misusing codes on a claim such as upcoding or unbundling codes (cms.gov, 2017; J. Liu et al., 2016).  In 2012, the payments of $120 billion were improperly for healthcare (J. Liu et al., 2016).  Medicare and Medicaid contributed to more than half of this improper payment total (J. Liu et al., 2016).  The annual loss to fraud, waste, and abuse in healthcare domain is estimated to be $750 billion (J. Liu et al., 2016).  In 2013, over 60% of the improper payments were for healthcare related. Figure 3 illustrates the improper payments in government expenditure.


Figure 3. Improper Payments Resulted from Fraud and Abuse (J. Liu et al., 2016).

Medicare fraud and abuse are governed by federal laws (cms.gov, 2017).  These federal laws include False Claim Act (FCA), Anti-Kickback Statute (AKS), Physician Self-Referral Law (Stark Law), Criminal Health Care Fraud Statute, Social Security Act, and the United States Criminal Code.  Medicare anti-fraud and abuse partnerships of various government agencies such as Health Care Fraud Prevention Partnership (HFPP) and Centers for Medicare and Medicaid Services (CMS) have been established to combat fraud and abuse. The main aim of this partnership is to uphold the integrity of the Medicare program, save and recoup taxpayer funds, reduce the costs of health care to patients, and improve the quality of healthcare (cms.gov, 2017).  

In 2010, Health and Human Services (HHS) and CMS initiated a national effort known as Fraud Prevention System (FPS), a predictive analytics technology which runs predictive algorithms and other analytics nationwide on all Medicare FFS claims prior to any payment in an effort to detect any potential suspicious claims and patterns that may constitute fraud and abuse (cms.gov, 2017).  In 2012, CMS developed the Program Integrity Command Center to combine Medicare and Medicaid experts such as clinicians, policy experts, officials, fraud investigators, and law enforcement community including FBI to develop and improve predictive analytics that identifies fraud and mobilize a rapid response (cms.gov, 2017). Such effort aims to connect with the field offices to examine the fraud allegations within few hours through a real-time investigation.  Before the application of BDA, the process to find substantiating evidence of a fraud allegation took days or weeks.

Research communities and data analytics industry have exerted various efforts to develop fraud-detection systems (J. Liu et al., 2016).  Various research studies have used different data mining for healthcare fraud and abuse detection.  (J. Liu et al., 2016) have used unsupervised data mining approach and applied the clustering data mining technique for healthcare fraud detection.  (Ekina, Leva, Ruggeri, & Soyer, 2013) have used the unsupervised data mining approach and applied the Bayesian co-clustering data mining technique for healthcare fraud detection.  (Ngufor & Wojtusiak, 2013) have used the hybrid supervised and unsupervised data mining approach, and applied the unsupervised data labeling and outlier detection, classification and regression data mining technique for medical claims prediction.  (Capelleveen, 2013; van Capelleveen, Poel, Mueller, Thornton, & van Hillegersberg, 2016) have used unsupervised data mining approach, and applied outlier detection data mining technique for health insurance fraud detection with the Medicaid domain. 

Case Study of BDA in Healthcare

The case study presented by (Capelleveen, 2013; van Capelleveen et al., 2016) has been selected for further investigation on the application of BDA in healthcare.  The outlier detection, which is one of the unsupervised data mining techniques, is regarded as an effective predictor for fraud detection and is recommended for use to support the audits initiations (Capelleveen, 2013; van Capelleveen et al., 2016).  The outlier detection is the primary analytic tool which was used in this case study.   The outlier detection tool can be based on linear model analysis, multivariate clustering analysis, peak analysis, and boxplot analysis (Capelleveen, 2013; van Capelleveen et al., 2016).  The algorithm of data mining outlier detection approach of this case study has been used on Medicaid dataset of 650,000 healthcare claims and 369 dentists of one state. RapidMiner can be used for outlier detection data mining techniques.  The study of (Capelleveen, 2013; van Capelleveen et al., 2016) did not specify the name of the tool which was used in the outlier detection of the fraud and abuse in Medicare with emphasis on dental practice.

The process for such outlier detection unsupervised data mining technique involves seven iterative phases.  The first step involves the composition of metrics composition for domains. These metrics are derived or calculated data such as feature, attribute or measurement which characterizes the behavior of an entity for a certain period.  The purpose of this metrics is to develop a comparative behavioral analysis using data mining algorithms.  These metrics are expected during the first iteration to be inferred from provider behavior supported by fraud causes and developed in cooperation with fraud experts.  In the subsequent iterations, the metrics composition consists of the latest metrics which updates the existing metrics that modify the configuration and make adjustments on the confidence level to optimize the hit rates.  The composition of metrics phase is followed by the cleaning and filtering the data.  The selection of provider groups, and computing the metrics is the third phase in this outlier detection process.  The fourth phase involves the comparison of providers by metric and flagging outliers.  The predictors form suspicion for provider fraud detection is the fifth phase, followed by the report and presentation to fraud investigators phase.  The last phase of the use of the outlier protection analytic tool involves the metric evaluation.  The result of the outlier detection analysis has shown that 12 of the top 17 providers (71%) submitted suspicious claim patterns and should be referred to officials for further investigation.  The study concluded that the outlier detection tool could be used to provide new patterns of potential fraud that can be identified and possibly used for future automated detection technique.

Proposed Improvements for Outlier Detection Tool Use Case

            (Lazarevic & Kumar, 2005) have indicated that most of the outlier detection techniques are categorized into four categories.  The statistical approach, the distance-based approach, the profiling method, and the model-based approach.  The data points are modeled in the statistical approach using a stochastic distribution and are determined to be outliers based on their relationship with the model.  Most statistical approaches have the limitation with higher dimensionality distribution of the data points due to the complexity of such a distribution which results in inaccurate estimations.  The distance-based approach can detect the outliers using the computation of the distances among points to overcome the limitation of the statistical approach.  Various distance-based outlier detection algorithms have been proposed, and they are based on different approaches.  The first approach is based on computing the full dimensional distances of points from one another using all the available features.  The second approach is based on computing the densities of local neighborhoods.   The profiling method develops profiles of normal behavior using different data mining techniques or heuristic-based approaches, and deviations from them are considered as intrusions.  The model-based approach begins with the categorization of normal behavior using some predictive models. Such as neural replicator networks or unsupervised support vector machines, and detect outliers as the deviations from the learned model (Lazarevic & Kumar, 2005).      (Capelleveen, 2013; van Capelleveen et al., 2016) have indicated that the outlier detection tool as a data mining technique has not proven itself in the long run and is still under experimentation.  It is also considered a sophisticated data mining technique (Capelleveen, 2013; van Capelleveen et al., 2016). The validation of effectiveness remains difficult (Capelleveen, 2013; van Capelleveen et al., 2016). 

Based on this analysis of the outlier detection tool, more heuristic and novel approach should be used.  (Viattchenin, 2016) have proposed a novel technique for outlier detection.  The proposed technique for outlier detection is based on a heuristic algorithm of clustering, which is a function-based method.  (Q. Liu & Vasarhelyi, 2013) have proposed a healthcare fraud detection using a clustering model incorporating geolocation information.  The results of the clustering model using have detected claims with the extreme payment amount and identified some suspicious claims.  In summary, integrating the clustering technique can play a role in enhancing the reliability and validity of the outlier detection data mining technique.

Conclusion

This project has discussed and examined Big Dat Analytics (BDA) methods. An overview of BDA application in various sectors is discussed, followed by the implementation of BDA in the healthcare industry.  The records showed that the healthcare industry is suffering from fraud, waste, and abuse.  The discussion has provided a case study of BDA in healthcare using outlier detection tool.  The data mining phases have been discussed and analyzed.  A proposed improvement for the selected BDA technique of outlier detection has also been addressed.  The analysis has indicated that the outlier detection technique is under experimentation, and more heuristic data mining fraud detection technique should be used such as the clustering data mining technique.  In summary, various BDA techniques are available for different industries.  Organizations must select the appropriate BDA tool to meet the requirements of the business model. 

References

Bi, Z., & Cochran, D. (2014). Big data analytics with applications. Journal of Management Analytics, 1(4), 249-265.

Brown, B., Chui, M., & Manyika, J. (2011). Are you ready for the era of ‘big data’. McKinsey Quarterly, 4(1), 24-35.

CA-Technologies. (2018). CA Technoligies. Retrieved from https://www.ca.com/us/company/about-us.html.

Capelleveen, G. C. (2013). Outlier based predictors for health insurance fraud detection within US Medicaid. University of Twente.  

Chen, C. P., & Zhang, C.-Y. (2014). Data-intensive applications, challenges, techniques and technologies: A survey on Big Data. Information Sciences, 275, 314-347.

cms.gov. (2017). Medicare Fraud & Abuse: Prevention, Detection, and Reporting. Retrieved from https://www.cms.gov/Outreach-and-Education/Medicare-Learning-Network-MLN/MLNProducts/downloads/fraud_and_abuse.pdf.

Ekina, T., Leva, F., Ruggeri, F., & Soyer, R. (2013). Application of bayesian methods in detection of healthcare fraud.

Lazarevic, A., & Kumar, V. (2005). Feature bagging for outlier detection. Paper presented at the Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining.

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Liu, J., Bier, E., Wilson, A., Guerra-Gomez, J. A., Honda, T., Sricharan, K., . . . Davies, D. (2016). Graph analysis for detecting fraud, waste, and abuse in healthcare data. AI Magazine, 37(2), 33-46.

Liu, Q., & Vasarhelyi, M. (2013). Healthcare fraud detection: A survey and a clustering model incorporating Geo-location information.

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big data: The next frontier for innovation, competition, and productivity.

MCA, M. J. S. (2017). Applications of Big Data Analytics and Machine Learning Techniques in Health Care Sectors. International Journal Of Engineering And Computer Science, 6(7).

McCafferly, D. (2015). How To Overcome Big Data Barriers. Retrieved from https://www.cioinsight.com/it-strategy/big-data/slideshows/how-to-overcome-big-data-barriers.html.

Ngufor, C., & Wojtusiak, J. (2013). Unsupervised labeling of data for supervised learning and its application to medical claims prediction. Computer Science, 14(2), 191.

Rabl, T., Gómez-Villamor, S., Sadoghi, M., Muntés-Mulero, V., Jacobsen, H.-A., & Mankovskii, S. (2012). Solving big data challenges for enterprise application performance management. Proceedings of the VLDB Endowment, 5(12), 1724-1735.

van Capelleveen, G., Poel, M., Mueller, R. M., Thornton, D., & van Hillegersberg, J. (2016). Outlier detection in healthcare fraud: A case study in the Medicaid dental domain. International Journal of Accounting Information Systems, 21, 18-31.

Viattchenin, D. A. (2016). A Technique for Outlier Detection Based on Heuristic Possibilistic Clustering. CERES, 17.