Information Technology Requirements in Healthcare

Dr. O. Aly
Computer Science

The purpose of this discussion is to address one of the sectors that utilizes a few unique information technology (IT) requirements.  The selected sector for this discussion is health care. The discussion addresses the IT needs based on a case study.   The discussion begins with Information Technology Key Role in Business, followed by the Healthcare Industry Case Study.

Information Technology Key Role in Business

Information technology (IT) is a critical resource for businesses in the age of Big Data and Big Data Analytics (Dewett & Jones, 2001; Pearlson & Saunders, 2001).  IT supports and consumes a significant amount of the resources of enterprises.  IT needs to be managed wisely like other significant types of business resources such as people, money, and machines.  These resources must return a value to the business. Thus, enterprises must carefully evaluate its resources including the IT resources that can be efficiently and effectively used. 

Information system and technology are now integrated with almost every aspect of every business.  IT and IS play significant roles in business, as it simplifies the organizational activities and processes.  Enterprises can gain competitive advantages when utilizing appropriate information technology.  The inadequate information system can cause a breakdown in providing services to customers or developing products which can harm sales and eventually the businesses (Bhatt & Grover, 2005; Brynjolfsson & Hitt, 2000; Pearlson & Saunders, 2001).  The same thing applies when inefficient business processes sustained by ill-fitting information system and technology as they increase the cost on the business without any return on investment or value.  The lag in the implementation or poor process adaptation reduce the profits and the growth and can place the business behind other competitors. The failure of the information system and technology in business is caused primarily by ignoring them during the planning of the business strategy and organizational strategy.  IT will fail to support business goals and organizational systems because it was not considered in the business and organizational strategy. When the business strategy is misaligned with the organizational strategy, IT is subject to failure (Pearlson & Saunders, 2001).

IT Support to Business Goals

Enterprises should invest in IT resources that will benefit them.  They should make investment in systems that supports their business goals including gaining competitive advantages (Bhatt & Grover, 2005).  Although IT represents a significant investment in businesses, yet, the poorly chosen information system can become an obstacle to achieving the business goals (Dewett & Jones, 2001; Henderson & Venkatraman, 1999; Pearlson & Saunders, 2001).  When the IT does not allow the business to achieve its goals, or lack the capacity required to collect, store, and transfer critical information for the business, the results can be disastrous, leading to dissatisfied customers, or excessive costs for production.  The Toys R US store is an excellent example of such an issue (Pearlson & Saunders, 2001).  The well-publicized website was not designed to process and fulfill orders fast enough.  The site could be redesigned with an additional cost which could have been saved if the IT strategy and business goals were discussed together to be aligned together.

IT Support to Organizational Systems

Organizations systems including people, work processes, and structure represent the core elements of the business.  Enterprises should plan to enable these systems to work together efficiently to achieve the business goals (Henderson & Venkatraman, 1999; Pearlson & Saunders, 2001; Ryssel, Ritter, & Georg Gemünden, 2004). When the IT of the business fails to support the business’ organization systems, the result is a misalignment of the resources needed to achieve the business goals.  For instance, when organizations decide to use Enterprise Resource Planning (ERP) system, the system often dictates how many business processes are executed.  When enterprises deploy a technology, they should think through various aspects such as how the technology will be used in the organization, who will use it, how they will use it, how to make sure the application chosen accomplishes what is intended.  For instance, an organization which plans to institute a wide-scale telecommuting program would need an information system strategy that is compatible with its organization strategy (Pearlson & Saunders, 2001).  The desktop PCs located within the corporate office are not the right solution for a telecommuting organization.  Laptop computers application that are accessible online anywhere and anytime are a most appropriate solution.  If a business only allows the purchase of desktop PCs and only builds systems accessible from desks within the office, the telecommuting program is subject to failure.  Thus, information systems implementation should support the organizational systems and should be aligned with the business goals.

Advantages of IT in Business

Business is able to transform local business to international business with the advent of information system and internet (Bhatt & Grover, 2005; Zimmer, 2018).  Organizations are under pressures to take advantages of information technology to gain competitive advantages.  They are turning to information technology to streamline services and enhance the performance.  IT has become an essential feature in the landscape of the business that aid business to decrease the costs, improve communication, develop recognition, and release more innovative and attractive products.

IT streamlines communication as effective communication is critical to an organization’s success (Bhatt & Grover, 2005; Zimmer, 2018). A key advantage of information system lies in its ability to streamline communication both internally and externally.  For instance, online meeting and video conferencing platform such as Skype, WebEx provide business the opportunity to collaborate virtually in real-time, reducing costs associated with bringing clients on-site or communicating with staff who work remotely.  IT enables Enterprises to connect almost effortlessly with international suppliers and consumers. 

IT can enhance the competitive advantages in the marketplace of the business by facilitating strategic thinking and knowledge transfer (Bhatt & Grover, 2005; Zimmer, 2018).  When using IT as a strategic investment and not as a means to an end, IT provides business with the tools they need to properly evaluate the market and implement strategies needed for a competitive edge.

IT stores and safeguards information, as information management is another domain of IT (Bhatt & Grover, 2005; Zimmer, 2018).  IT is essential to any business that must store and safeguard sensitive information such as financial data for long periods.  Various security techniques can be applied to ensure the data is stored in a secure place.  Organizations should evaluate the options available to store their data such as locally using local data center or cloud-based storage methods. 

IT cuts costs and eliminate waste  (Bhatt & Grover, 2005; Zimmer, 2018).  Although IT implementation at the beginning will be expensive, in the long run, it becomes incredibly cost-effective by streamlining the operational and managerial processes of the business.  Thus, investing in the appropriate IT is key for a business to gain a return on investment.  For instance, the implementation of online training programs is a classic example of IT improving the internal processes of the business by reducing the costs and employees’ time spent outside of work, and travel costs. Information technology enables organizations to implement more with less investment without sacrificing quality or value.

Healthcare Industry Case Study

The healthcare industry generated extensive data driven by keeping patients’ records, complying with regulations and policies, and patients care (Raghupathi & Raghupathi, 2014).  The current trend is digitalizing this explosive growth of the data in the age of Big Data (BD) and Big Data Analytics (BDA) (Raghupathi & Raghupathi, 2014).  BDA has made a revolution in healthcare by transforming the valuable information, knowledge to predict epidemics, cure diseases, improve quality of life, and avoid preventable deaths (Van-Dai, Chuan-Ming, & Nkabinde, 2016).  Various applications of BDA in healthcare include pervasive health, fraud detection, pharmaceutical discoveries, clinical decision support system, computer-aided diagnosis, and biomedical applications. 

Healthcare Big Data Benefits and Challenges

            Healthcare sector employs BDA in various aspect of healthcare such as detecting diseases at early stages, providing evidence-based medicine, minimizing doses of medication to avoid any side effects, and delivering useful medicine base on genetic analysis.  The use of BD and BDA can reduce the re-admission rate, and thereby the healthcare related costs for patients are reduced.  Healthcare BDA can be used to detect spreading diseases earlier before the disease gets spread using real-time analytics (Archenaa & Anita, 2015; Raghupathi & Raghupathi, 2014; Wang, Kung, & Byrd, 2018).   Example of the application of BDA in the healthcare system is Kaiser Permanente implementing a HealthConnect technique to ensure data exchange across all medical facilities and promote the use of electronic health records (Fox & Vaidyanathan, 2016).

            Despite the various benefits of BD and BDA in the healthcare sector, various challenges and issues are emerging from the application of BDA in healthcare.  The nature of the healthcare industry poses challenging to BDA (Groves, Kayyali, Knott, & Kuiken, 2016).  The episodic culture, the data puddles, and the IT leadership are the three significant challenges of the healthcare industry to apply BDA.  The episodic culture addresses the conservative culture of the healthcare and the lack of IT technologies mindset creating rigid culture.  Few providers have overcome this rigid culture and started to use the BDA technology. The data puddles reflect the silo nature of healthcare.  Silo is described as one of the most significant flaws in the healthcare sector (Wicklund, 2014).  The use of the technology properly is lacking in healthcare sector resulting in making the industry fall behind other industries. All silos use their methods to collect data from labs, diagnosis, radiology, emergency, case management and so forth.  The IT leadership is another challenge is caused by the rigid culture of the healthcare industry.  The lack of the latest technologies among the IT leadership in the healthcare industry is a severe problem. 

Healthcare Data Sources for Data Analytics

            The current healthcare data is collected from clinical and non-clinical sources (InformationBuilders, 2018; Van-Dai et al., 2016; Zia & Khan, 2017).  The electronic healthcare records are digital copies of the medical history of the patients.  It contains a variety of data relevant to the care of the patients such as demographics, medical problems, medications, body mass index, medical history, laboratory test data, radiology reports, clinical notes, and payment information. These electronic healthcare records are the most critical data in healthcare data analytics, because it provides effective and efficient methods for the providers and organizations to share data (Botta, de Donato, Persico, & Pescapé, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016; Wang et al., 2018).  

The biomedical imaging data plays a crucial role in healthcare data to aid disease monitoring, treatment planning and prognosis.  This data can be used to generate quantitative information and make inferences from the images that can provide insights into a medical condition.  The images analytics is more complicated due to the noises of the data associated with the images and is one of the significant limitations with biomedical analysis (Ji, Ganchev, O’Droma, Zhang, & Zhang, 2014; Malik & Sangwan, 2015; Van-Dai et al., 2016). 

The sensing data is ubiquitous in the medical domain both for real-time and for historical data analysis.  The sensing data involve several forms of medical data collection instruments such as the electrocardiogram (ECG) and electroencephalogram (EEG) which are vital sensors to collect signals from various parts of the human body.  The sensing data plays a significant role for intensive care units (ICU) and real-time remote monitoring of patients with specific conditions such as diabetes or high blood pressure.  The real-time and long-term analysis of various trends and treatment in remote monitoring programs can help providers monitor the state of those patients with certain conditions(Van-Dai et al., 2016). 

The biomedical signals are collected from many sources such as hearts, blood pressure, oxygen saturation levels, blood glucose, nerve conduction, and brain activity.  Examples of biomedical signals include electroneurogram (ENG), electromyogram (EMG), electrocardiogram (ECG), electroencephalogram (EEG), electrogastrogram (EGG), and phonocardiogram (PCG).  The biomedical signals real-time analytics will provide better management of chronic diseases, earlier detection of adverse events such as heart attacks, and strokes and earlier diagnosis of disease.   These biomedical signals can be discrete or continuous based on the kind of care or severity of a particular pathological condition (Malik & Sangwan, 2015; Van-Dai et al., 2016).

The genomic data analysis helps better understand the relationship between various genetic, mutations, and disease conditions. It has great potentials in the development of various gene therapies to cure certain conditions.  Furthermore, the genomic data analytics can assist in translating genetic discoveries into personalized medicine practice (Liang & Kelemen, 2016; Luo, Wu, Gopukumar, & Zhao, 2016; Palanisamy & Thirunavukarasu, 2017; Van-Dai et al., 2016).

The clinical text data analytics using the data mining are the transformation process of the information from clinical notes stored in unstructured data format to useful patterns.  The manual coding of clinical notes is costly and time-consuming, because of their unstructured nature, heterogeneity, different format, and context across different patients and practitioners.  Various methods such as natural language processing (NLP) and information retrieval can be used to extract useful knowledge from large volume of clinical text and automatically encoding clinical information in a timely manner (Ghani, Zheng, Wei, & Friedman, 2014; Sun & Reddy, 2013; Van-Dai et al., 2016).

The social network healthcare data analytics is based on various kinds of collected social media sources such as social networking sites, e.g., Facebook, Twitter, Web Logs, to discover new patterns and knowledge that can be leveraged to model and predict global health trends such as outbreaks of infections epidemics (InformationBuilders, 2018; Luo et al., 2016; Van-Dai et al., 2016; Zia & Khan, 2017).

IT Requirements for Healthcare Sector

The basic requirement for the implementation of this proposal included not only the tools and required software, but also the training at all levels from staff, to nurses, to clinicians, to patients.  The list of the requirements is divided into system requirement, implementation requirement, and training requirements. 

Cloud Computing Technology Adoption Requirement

The volume is one of the significant characteristics of BD, especially in the healthcare industry (Manyika et al., 2011).  Based on the challenges addressed earlier when dealing with BD and BDA in healthcare, the system requirements cannot be met using the traditional on-premise technology center, as it cannot handle the intensive computation requirements of BD, and the storage requirement for all the medical information from various hospitals from the four States (Hu, Wen, Chua, & Li, 2014). Thus, the cloud computing environment is found to be more appropriate and a solution for the implantation of this proposal.  Cloud computing plays a significant role in BDA (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015).  The massive computation and storage requirement of BDA brings the critical need for cloud computing emerging technology (Mehmood, Natgunanathan, Xiang, Hua, & Guo, 2016).  Cloud computing offers various benefits such as cost reduction, elasticity, pay per use, availability, reliability, and maintainability (Gupta, Gupta, & Mohania, 2012; Kritikos, Kirkham, Kryza, & Massonet, 2017).  However, although cloud computing offers various benefits, it has security and privacy issues using the standard deployment models of public cloud, private cloud, hybrid cloud, and community cloud.  Thus, one of the major requirements is to adopt the Virtual Private Cloud as it has been regarded as the most prominent approach to trusted computing technology (Abdul, Jena, Prasad, & Balraju, 2014).

 Security Requirement

Cloud computing has been facing various threats (Cloud Security Alliance, 2013, 2016, 2017).   Records showed that over the last three years from 2015 until 2017, the number of breaches, lost medical records, and settlements of fines are staggering (Thompson, 2017).  The Office of Civil Rights (OCR) issued 22 resolution agreements, requiring monetary settlements approaching $36 million (Thompson, 2017).  Table 1 shows the data categories and the total for each year. 

Table 1.  Approximation of Records Lost by Category Disclosed on HHS.gov (Thompson, 2017)

Furthermore, a recent report published by HIPAA showed the first three months of 2018 experienced 77 healthcare data breaches reported to the OCR (HIPAA, 2018d).  In the second quarter of 2018, at least 3.14 million healthcare records were exposed (HIPAA, 2018a).  In the third quarter of 2018, 4.39 million records exposed in 117 breaches (HIPAA, 2018c).

Thus, the protection of the patients’ private information requires the technology to extract, analyze, and correlated potentially sensitive dataset (HIPAA, 2018b).  The implementation of BDA requires security measures and safeguards to protect the privacy of the patients in the healthcare industry (HIPAA, 2018b).  Sensitive data should be encrypted to prevent the exposure of data in the event of theft (Abernathy & McMillan, 2016).  The security requirements involve security at the VPC cloud deployment model as well as at the local hospitals in each State (Regola & Chawla, 2013).  The security at the VPC cloud deployment model should involve the implementation of security groups and network access control lists to allow access to the right individuals to the right applications and patients’ records.  Security group in VPC acts as the first line of defense firewall for the associated instances of the VPC (McKelvey, Curran, Gordon, Devlin, & Johnston, 2015).  The network access control lists act as the second layer of defense firewall for the associated subnets, controlling the inbound and the outbound traffic at the subnet level (McKelvey et al., 2015). 

The security at the local hospitals level in each State is mandatory to protect patients’ records and comply with HIPAA regulations (Regola & Chawla, 2013).  The medical equipment must be secured with authentication and authorization techniques so that only the medical staff, nurses and clinicians have access to the medical devices based on their role.  The general access should be prohibited as every member of the hospital has a different role with different responses.  The encryption should be used to hide the meaning or intent of communication from unintended users (Stewart, Chapple, & Gibson, 2015).   The encryption is an essential element in security control especially for the data in transit (Stewart et al., 2015).  The hospital in all four State should implement the encryption security control using the same type of the encryption across the hospitals such as PKI, cryptographic application, and cryptography and symmetric key algorithm (Stewart et al., 2015).

The system requirements should also include the identity management systems that can correspond with the hospitals in each state. The identity management system provides authentication and authorization techniques allowing only those who should have access to the patients’ medical records.  The proposal requires the implementation of various encryption techniques such as secure socket layer (SSL), Transport Layer Security (TLS), and Internet Protocol Security (IPSec) to protect information transferred in public network (Zhang & Liu, 2010). 

Hadoop Implementation for Data Stream Processing Requirement

While the velocity of BD leads to the speed of generating large volume of data and requires speed in data processing (Hu et al., 2014), the variety of the data requires specific technology capabilities to handle various types of dataset such as structured, semi-structured, and unstructured data (Bansal, Deshpande, Ghare, Dhikale, & Bodkhe, 2014; Hu et al., 2014).  Hadoop ecosystem is found to be the most appropriate system that is required to implement BDA (Bansal et al., 2014; Dhotre, Shimpi, Suryawanshi, & Sanghati, 2015).  The implementation requirements include various technologies and various tools.  This section covers various components that are required when implementing Hadoop technology in the four States for healthcare BDA system.

Hadoop has three significant limitations, which must be addressed in this design.  The first limitation is the lack of technical support and document for open source Hadoop (Guo, 2013).   Thus, this design requires the Enterprise Edition of Hadoop to get around this limitation using Cloudera, Hortonworks, and MapR (Guo, 2013). The final decision for which product will be determined by the cost analysis team.  The second limitation is that Hadoop is not optimal for real-time data processing (Guo, 2013). The solution for this limitation will require the integration of real-time streaming program as Spark or Storm or Kafka (Guo, 2013; Palanisamy & Thirunavukarasu, 2017). This requirement of integrating Spark is discussed below in a separate requirement for this design (Guo, 2013). The third limitation is that Hadoop is not a good fit for large graph dataset (Guo, 2013). The solution for this limitation requires the integration of GraphLab which is also discussed below in a separate requirement for this design.

Conclusion

Information technology (IT) play a significant role in various industries including the healthcare sector.  This project discussed the IT role in businesses, the requirement to be aligned with the strategic goal and organizational system of the business.  If IT systems are not included during the planning of the business strategy and organizational strategy, the IT integration into the business at a later stage is very likely to set for failure.  IT offers various advantages to business including the competitive advantages in the marketplace.  Healthcare industry is no exception to integrate IT systems.  Healthcare sector has been suffering from various challenges including the high cost of services and inefficient service to patients.  The case study showed the need for IT systems requirements that can place the industry into competitive advantages offering better care to patients with low cost.  Various IT integrations have been used lately in the healthcare industry including Big Data Analytics, Hadoop technology, security systems, and cloud computing. Kaiser Permanente, for instance, applied Big Data Analytics using HealthConnet to provide care to patients with lower cost and better care, which are aligned with the strategic goal of its business.

References

Abdul, A. M., Jena, S., Prasad, S. D., & Balraju, M. (2014). Trusted Environment In Virtual Cloud. International Journal of Advanced Research in Computer Science, 5(4).

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003

Bansal, A., Deshpande, A., Ghare, P., Dhikale, S., & Bodkhe, B. (2014). Healthcare data analysis using dynamic slot allocation in Hadoop. International Journal of Recent Technology and Engineering, 3(5), 15-18.

Bhatt, G. D., & Grover, V. (2005). Types of information technology capabilities and their role in competitive advantage: An empirical study. Journal of management information systems, 22(2), 253-277.

Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.

Brynjolfsson, E., & Hitt, L. M. (2000). Beyond computation: Information technology, organizational transformation and business performance. Journal of Economic perspectives, 14(4), 23-48.

Cloud Security Alliance. (2013). The Notorious Nine: Cloud Computing Top Threats in 2013. Cloud Security Alliance: Top Threats Working Group. 

Cloud Security Alliance. (2016). The Treacherous 12: Cloud Computing Top Threats in 2016. Cloud Security Alliance: Top Threats Working Group. 

Cloud Security Alliance. (2017). The Treacherous 12 Top Threats to Cloud Computing. Cloud Security Alliance: Top Threats Working Group. 

Dewett, T., & Jones, G. R. (2001). The role of information technology in the organization: a review, model, and assessment. Journal of Management, 27(3), 313-346.

Dhotre, P., Shimpi, S., Suryawanshi, P., & Sanghati, M. (2015). Health Care Analysis Using Hadoop. Internationaljournalofscientific&tech nologyresearch, 4(12), 279r281.

Fox, M., & Vaidyanathan, G. (2016). Impacts of Healthcare Big Data:  A Framwork With Legal and Ethical Insights. Issues in Information Systems, 17(3).

Ghani, K. R., Zheng, K., Wei, J. T., & Friedman, C. P. (2014). Harnessing big data for health care and research: are urologists ready? European urology, 66(6), 975-977.

Groves, P., Kayyali, B., Knott, D., & Kuiken, S. V. (2016). The ‘Big Data’ Revolution in Healthcare: Accelerating Value and Innovation.

Guo, S. (2013). Hadoop operations and cluster management cookbook: Packt Publishing Ltd.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud Computing and Big Data Analytics: What is New From Databases Perspective? Paper presented at the International Conference on Big Data Analytics, Springer-Verlag Berlin Heidelberg.

Henderson, J. C., & Venkatraman, H. (1999). Strategic alignment: Leveraging information technology for transforming organizations. IBM systems journal, 38(2.3), 472-484.

HIPAA. (2018a). At Least 3.14 Million Healthcare Records Were Exposed in Q2, 2018. Retrieved 11/22/2018 from https://www.hipaajournal.com/q2-2018-healthcare-data-breach-report/. 

HIPAA. (2018b). How to Defend Against Insider Threats in Healthcare. Retrieved 8/22/2018 from https://www.hipaajournal.com/category/healthcare-cybersecurity/. 

HIPAA. (2018c). Q3 Healthcare Data Breach Report: 4.39 Million Records Exposed in 117 Breaches. Retrieved 11/22/2018 from https://www.hipaajournal.com/q3-healthcare-data-breach-report-4-39-million-records-exposed-in-117-breaches/. 

HIPAA. (2018d). Report: Healthcare Data Breaches in Q1, 2018. Retrieved 5/15/2018 from https://www.hipaajournal.com/report-healthcare-data-breaches-in-q1-2018/. 

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

InformationBuilders. (2018). Data In Motion – Big Data Analytics in Healthcare. Retrieved from http://docs.media.bitpipe.com/io_10x/io_109369/item_674791/datainmotionbigdataanalytics.pdf, White Paper.

Ji, Z., Ganchev, I., O’Droma, M., Zhang, X., & Zhang, X. (2014). A cloud-based X73 ubiquitous mobile healthcare system: design and implementation. The Scientific World Journal, 2014.

Kritikos, K., Kirkham, T., Kryza, B., & Massonet, P. (2017). Towards a Security-Enhanced PaaS Platform for Multi-Cloud Applications. Future Generation computer systems, 67, 206-226. doi:10.1016/j.future.2016.10.008

Liang, Y., & Kelemen, A. (2016). Big Data Science and its Applications in Health and Medical Research: Challenges and Opportunities. Austin Journal of Biometrics & Biostatistics, 7(3).

Luo, J., Wu, M., Gopukumar, D., & Zhao, Y. (2016). Big data application in biomedical research and health care: a literature review. Biomedical informatics insights, 8, BII. S31559.

Malik, L., & Sangwan, S. (2015). MapReduce Framework Implementation on the Prescriptive Analytics of Health Industry. International Journal of Computer Science and Mobile Computing, ISSN, 675-688.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute.

McKelvey, N., Curran, K., Gordon, B., Devlin, E., & Johnston, K. (2015). Cloud Computing and Security in the Future Guide to Security Assurance for Cloud Computing (pp. 95-108): Springer.

Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., & Guo, S. (2016). Protection of Big Data Privacy. Institute of Electrical and Electronic Engineers, 4, 1821-1834. doi:10.1109/ACCESS.2016.2558446

Palanisamy, V., & Thirunavukarasu, R. (2017). Implications of Big Data Analytics in developing Healthcare Frameworks–A review. Journal of King Saud University-Computer and Information Sciences.

Pearlson, K., & Saunders, C. (2001). Managing and Using Information Systems: A Strategic Approach. 2001: USA: John Wiley & Sons.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.

Regola, N., & Chawla, N. (2013). Storing and Using Health Data in a Virtual Private Cloud. Journal of medical Internet research, 15(3), 1-12. doi:10.2196/jmir.2076

Ryssel, R., Ritter, T., & Georg Gemünden, H. (2004). The impact of information technology deployment on trust, commitment and value creation in business relationships. Journal of business & industrial marketing, 19(3), 197-207.

Stewart, J., Chapple, M., & Gibson, D. (2015). ISC Official Study Guide.  CISSP Security Professional Official Study Guide (7th ed.): Wiley.

Sun, J., & Reddy, C. (2013). Big Data Analytics for Healthcare. Retrieved from https://www.siam.org/meetings/sdm13/sun.pdf.

Thompson, E. C. (2017). Building a HIPAA-Compliant Cybersecurity Program, Using NIST 800-30 and CSF to Secure Protected Health Information.

Van-Dai, T., Chuan-Ming, L., & Nkabinde, G. W. (2016, 5-7 July 2016). Big data stream computing in healthcare real-time analytics. Paper presented at the 2016 IEEE International Conference on Cloud Computing and Big Data Analysis (ICCCBDA).

Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big Data Analytics: Understanding its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. doi:10.1016/j.techfore.2015.12.019

Wicklund, E. (2014). ‘Silo’ one of healthcare’s biggest flaws. Retrieved from http://www.healthcareitnews.com/news/silo-one-healthcares-biggest-flaws.

Zhang, R., & Liu, L. (2010). Security models and requirements for healthcare application clouds. Paper presented at the Cloud Computing (CLOUD), 2010 IEEE 3rd International Conference on.

Zia, U. A., & Khan, N. (2017). An Analysis of Big Data Approaches in Healthcare Sector. International Journal of Technical Research & Science, 2(4), 254-264.

Zimmer, T. (2018). What Are the Advantages of Information Technology in Business?

Critical Information Technology Solutions Used to Gain Competitive Advantages

Dr. O. Aly
Computer Science

Abstract

The purpose of this project is to discuss critical information technology solutions used to gain competitive advantages.  The discussion begins with Big Data and Big Data Analytics addressing essential topics such as the Hadoop ecosystem, NoSQL databases, Spark integration for real-time data processing, and Big Data Visualization. Cloud computing is an emerging technology to solve Big Data challenges such as storage for the large volume of the data, and the high-speed data processing to extract value from data.  Enterprise Resource Planning (ERP) is a system that can aid organizations to gain competitive advantages if implemented right.  The project discusses various success factor for the ERP system.  Big Data plays a significant role in ERP, which is also discussed in this project.  The last technology addressed in this project is the Customer Relationship Management (CRM), its building blocks and integration.  The project addresses the challenges and costs associated with CRM.  The best practice of CRM is addressed which can assist in the successful implementation of CRM.  In summary, enterprises should evaluate various information technology systems that are developed to aid them to gain competitive advantages. 

Keywords: Big Data Analytics; Cloud Computing; ERP; CRM.

Introduction

            Enterprises should evaluate various information technologies to gain competitive advantages in the market.  Big Data and Big Data Analytics are one of the significant topics in information technology and computer science.  Cloud computing is another critical topic in the same domains, as cloud computing emerged to solve the challenge of Big Data.  Thus, this project begins with these top information technologies.  The discussion covers various major topics in Big Data such as the Hadoop ecosystem, Spark for real-time processing.  The discussion of the cloud computing covers the various service models and deployment models which cloud computing offers.             

The most common business areas that require information technology support include Enterprise Resource Planning (ERP), Customer Relationship Management (CRM), Product Life Cycle Management (PLM), Supply Chain Management (SCM), and Supplier Relationship Management (SRM) (DuttaRoy, 2016). Thus, this project discusses ERP and CRM as additional critical information technology systems that aid Enterprises gain competitive advantages. 

Big Data and Big Data Analytics

Big Data is now the buzzword in the field of computer science and information technology.  Big Data attracted the attention of various sectors, researchers, academia, government and even the media (Géczy, 2014; Kaisler, Armour, Espinosa, & Money, 2013).   In the 2011 report of the International Data Corporation (IDC), it is reporting that the amount of the information which will be created and replicated will exceed 1.8 zettabytes which are 1.8 trillion gigabytes in 2011. This amount of information is growing by a factor of 9 in just five years (Gantz & Reinsel, 2011). 

BD and BDA are terms that have been used interchangeably and described as the next frontier for innovation, competitions, and productivity (Maltby, 2011; Manyika et al., 2011).  BD has a multi-V model with unique characteristics, such as volume referring to the large dataset, velocity refers to the speed of the computation as well as data generation, and variety referring to the various data types such as semi-structured and unstructured (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015; Hu, Wen, Chua, & Li, 2014).  BD is described as the next frontier for competition, innovation, and productivity.  Various industries have taken this opportunity and applied BD and BDA in their business models (Manyika et al., 2011).  There are many technologies such as Cloud Computing, Hadoop Map/Reduce Hive, and others have emerged to deal with the phenomena of the Big Data.  Data without analysis has no value to organizations. 

Hadoop Ecosystem

While the velocity of BD leads to the speed of generating large volume of data and requires speed in data processing (Hu et al., 2014), the variety of the data requires specific technology capabilities to handle various types of dataset such as structured, semi-structured, and unstructured data (Bansal, Deshpande, Ghare, Dhikale, & Bodkhe, 2014; Hu et al., 2014).  Hadoop ecosystem is found to be the most appropriate system that is required to implement BDA (Bansal et al., 2014; Dhotre, Shimpi, Suryawanshi, & Sanghati, 2015).  Hadoop technologies have been in the front-runner for Big Data application (Bansal et al., 2014; Chrimes, Zamani, Moa, & Kuo, 2018).  Hadoop ecosystem will be part of the implementation requirement as it is proven to serve well with intensive computation using large datasets (Raghupathi & Raghupathi, 2014; Wang, Kung, & Byrd, 2018).   The Hadoop version that is required is version 2.x to include YARN for resource management  (Karanth, 2014).  Hadoop 2.x also include HDFS snapshots to provide a read-only image of the entire or a particular subset of a filesystem to protect against user errors, backup, and disaster recovery (Karanth, 2014). The Hadoop platform can be implemented to gain more insight into various areas (Raghupathi & Raghupathi, 2014; Wang et al., 2018). Hadoop ecosystem involves Hadoop Distributed File System, MapReduce, and NoSQL database such as HBase, and Hive to handle a large volume of dataset using various algorithms and machine learning to extract values from the medical records that are structured, semi-structured, and unstructured (Raghupathi & Raghupathi, 2014; Wang et al., 2018).  Other components to support Hadoop ecosystem include Oozie for workflow, Pig for scripting, and Mahout for machine learning which is part of the artificial intelligence (AI) (Ankam, 2016; Karanth, 2014).  Hadoop ecosystem includes other tools such as Flume for log collector, Sqoop for data exchange, and Zookeeper for coordination (Ankam, 2016; Karanth, 2014).  HCatalog is a required component to manage the metadata in Hadoop (Ankam, 2016; Karanth, 2014).   Figure 1 shows the Hadoop ecosystem before integrating Spark for real-time analytics.


Figure 1.  Hadoop Architecture Overview (Alguliyev & Imamverdiyev, 2014).

NoSQL Databases

In the age of BD and BDA, the traditional data store is found inadequate to handle not only the large volume of the dataset but also the various types of the data format such as unstructured and semi-structured (Hu et al., 2014).   Thus, Not Only SQL (NoSQL) database is emerged to meet the requirement of the BDA.  These NoSQL data stores are used for modern, and scalable databases (Sahafizadeh & Nematbakhsh, 2015).  The scalability feature of the NoSQL data stores enables the systems to increase the throughput when the demand increases during the processing of the data (Sahafizadeh & Nematbakhsh, 2015).  The platform can incorporate two scalability types to support the large volume of the datasets; the horizontal and vertical scalability.  The horizontal scaling allows the distribution of the workload across many servers and nodes to increase the throughput, while the vertical scaling requires more processors, more memories and faster hardware to be installed on a single server (Sahafizadeh & Nematbakhsh, 2015). 

NoSQL data stores have various types such as MongoDB, CouchDB, Redis, Voldemort, Cassandra, Big Table, Riak, HBase, Hypertable, ZooKeeper, Vertica, Neo4j, db4o, and DynamoDB.  These data stores are categorized into four types: document-oriented, column-oriented or column-family stores, graph database, and key-value (EMC, 2015; Hashem et al., 2015). The document-oriented data store can store and retrieve collections of data and documents using complex data forms in various formats such as XML and JSON as well as PDF and MS word (EMC, 2015; Hashem et al., 2015).  MongoDB and CouchDB are examples of document-oriented data stores (EMC, 2015; Hashem et al., 2015).  The column-oriented data store can store the content in columns aside from rows with the attributes of the columns stored contiguously (Hashem et al., 2015).  This type of datastore can store and render blog entries, tags, and feedback (Hashem et al., 2015).  Cassandra, DynamoDB, and HBase are examples of column-oriented data stores (EMC, 2015; Hashem et al., 2015).  The key-value can store and scale large volumes of data and contains value and a key to access the value (EMC, 2015; Hashem et al., 2015).  The value can be complicated, but this type of data stores can be useful in storing the user’s login ID as the key referencing the value of patients.  Redis and Riak are examples of the key-value NoSQL data store (Alexandru, Alexandru, Coardos, & Tudora, 2016).  Each of these NoSQL data stores has its limitations and advantages.  The graph NoSQL database can store and represent data using graph models with nodes, edges, and properties related to one another through relations which will be useful for unstructured medical data such as images, and lab results. Neo4j is an example of this type of graph NoSQL database (Hashem et al., 2015).  Figure 2 summarizes these NoSQL data stores, data types for storage, and examples.

Figure 2.  Big Data Analytics NoSQL Data Store Types.

Spark Integration for Real-Time Data Processing

While the architecture of Hadoop ecosystem has been designed in various scenarios for data storage, data management statistical analysis, and statistical association between various data sources distributed computing and batch processing, businesses requires real-time data processing to gain competitive advantages.  However, the real-time data processes cannot be met by Hadoop alone (Basu, 2014).  Real-time analytics will tremendous value to the healthcare proposed system.  Thus, Apache Spark is another component which is required for real-time data processing.  Spark allows in-memory processing for fast response time, bypassing MapReduce operations (Basu, 2014).  With Spark integration with Hadoop, stream processing, machine learning, interactive analytics, and data integration will be possible (Scott, 2015).  Spark will run on top of Hadoop to benefit from YARN and the underlying storage of HDFS, HBase and other Hadoop ecosystem building blocks (Scott, 2015).  Figure 3 shows the core engines of the Spark.


Figure 3. Spark Core Engines (Scott, 2015).

Big Data Visualization

Visualization is one of the most powerful presentations of the data (Jayasingh, Patra, & Mahesh, 2016).  It helps in viewing the data in a more meaningful way in the form of graphs, images, pie charts that can be understood easily.  It helps in synthesizing a large volume of data set such as healthcare data to get at the core of such raw big data and convey the key points from the data for insight (Meyer, M., 2018).  Some of the commercial visualization tools include Tableau, Spotfire, QlikView, and Adobe Illustrator.  However, the most commonly used visualization tools in healthcare include Tableau, PowerBI, and QlikView.

Cloud Computing Technology

Numerous studies discussed and addressed the definition of cloud computing, as it was not well defined (Foster, Zhao, Raicu, & Lu, 2008).  As an effort to identify precisely the term cloud computing IT practitioners, the academics and research community came up with various definitions.  (Vaquero, Rodero-Merino, Caceres, & Lindner, 2008) suggested twenty-two definitions to cloud computing from different research studies.  The underlying concepts of cloud computing rely heavily on providing computing power, storage services, software services, and platform services on demand to customers over the internet (Lewis, 2010).  The access to cloud computing services can scale up or down as needed, and the consumers use the pay-per-use or pay-as-you-go model (Armbrust et al., 2009; Lewis, 2010).

The National Institute of Standards and Technology (NIST) proposed an official definition of cloud computing.  Cloud computing enables ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources such as network, servers, storage, applications, and services. Organizations can quickly provision and release these resources with minimal effort of management or interaction from a service provider (Mell & Grance, 2011).

Cloud Computing Essential Characteristics

The essential characteristics of cloud computing technology identified by NIST include on-demand self-service, broad network access, resource pooling, rapid elasticity, and measured service (Mell & Grance, 2011).  The on-demand self-service feature provides cloud consumers the computing capabilities such as server time and network storage as needed automatically eliminating the need for any human interaction with a service provider.  The broad network access feature provides capabilities to cloud consumers over the network and the use of various devices such as mobile phones, and tablets from anywhere enabling the heterogeneous client platforms.  The resource pooling feature provides a multi-tenant model that serve multiple consumers sharing the pool of resources.  This feature provides location independence, where the consumers do not know the exact location of the provided resources.  The consumer may be able to specify the location at a higher level of abstraction such as country, state, or datacenter (Mell & Grance, 2011).  The rapid elasticity feature provides capabilities to scale horizontally and vertically to meet the demand.  The measured services feature enables the measurement of the consumption of resources such as processing, storage, and bandwidth. The resource utilization can be monitored, controlled, and reported, providing transparency for both the provider and consumer of the utilized services (Mell & Grance, 2011).

Cloud Computing Three Essential Service Models

Cloud computing offers three essential service models as Infrastructure-as-a-Service (IaaS), Platform-as-a-Service (PaaS), and Software-as-a-Service (SaaS) (Mell & Grance, 2011).  The IaaS layer provides the capability to the consumers to provision storage, processing, networks, and other fundamental computing resources.  Using IaaS, the consumer can deploy and run arbitrary software, which can include operating systems and application.  When using IaaS, the users do not manage or control the underlying infrastructure of the cloud.  The consumers have control over the storage, the operating systems, and the deployed application and limited control of some networking components such as host firewall.  The PaaS allows the cloud computing consumers to deploy applications that are created using programming languages, libraries, services, and tools supported by the providers.  Using PaaS, the cloud computing consumers do not manage or control the underlying infrastructure of the cloud including network, servers, operating systems, or storage.  The consumers have control over the deployed applications and possibly configuration settings for the application-hosting environment.  The SaaS allows cloud computing consumers to use the provider’s applications running on the infrastructure of the cloud.  The SaaS service model consumers can access the applications from various client devices through either a thin client interface, such as a web-based email from a web browser, or a program interface.  The SaaS consumers do not control or manage the underlying infrastructure of the cloud such as network, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings  (Mell & Grance, 2011).

Cloud Computing Four Essential Deployment Models

Cloud computing offers four essential deployment models known as public cloud, private cloud, community cloud, and hybrid cloud  (Mell & Grance, 2011).  The public cloud reflects the infrastructure of the cloud available to the general public.  It can be managed, owned and operated by organizations, academic entities, government entities, or a combination of them.  This deployment model resides on the premises of the cloud provider.  The private cloud is the cloud infrastructure designed exclusively for a single organization.  This deployment model can be managed, owned and operated by the organization, or a third party or a combination of both.  This model may reside either on-premises or off-premises.  The community cloud is the cloud infrastructure designed exclusively for a specific community of consumers from organizations that have such as security requirement, compliance consideration, and policy. One or more of organizations in the community, a third party or some combination of them can manage, own, operate the community cloud.  The community cloud can reside on-premises or off-premises.  The hybrid cloud is the cloud infrastructure combining two or more cloud infrastructures such as private, public, or community (Mell & Grance, 2011).  Figure 4 presents the full representation of cloud computing technology per NIST including the standard service models, deployment models, and essential characteristics.

Figure 4.  Overview of Cloud Computing based on NIST’s Definitions.

Cloud Computing Role in Big Data and Big Data Analytics

Cloud computing plays a significant role in BDA (Assunção et al., 2015).  The massive computation and storage requirement of BDA brings the critical need for cloud computing emerging technology (Mehmood, Natgunanathan, Xiang, Hua, & Guo, 2016).  Cloud computing offers various benefits such as cost reduction, elasticity, pay per use, availability, reliability, and maintainability (Gupta, Gupta, & Mohania, 2012; Kritikos, Kirkham, Kryza, & Massonet, 2017).  However, although cloud computing offers various benefits, it has security and privacy issues using the standard deployment models of public cloud, private cloud, hybrid cloud, and community cloud.

Enterprise Resource Planning (ERP)

            American Production and Inventory Control Society (2001), as cited in (Madanhire & Mbohwa, 2016) defined ERP as a method for the effective planning and controlling of all resources needed to take, make, ship and account for customer orders in a manufacturing, distribution or service organization.  This functions integration can be achieved through a software package solution offered by vendors to support the seamless integration of all information flowing through the enterprise, such as financial, accounting and human resources.   ERP is a business management software that is designed to integrate data sources and processes of the entire organization into a combined system (Bahssas, AlBar, & Hoque, 2015).

ERP system is a popular solution which is used by the organization to integrate and automate various processes, performance improvements, and cost reduction.  ERP provides business with a real-time view of its core business processes such as production, planning, manufacturing, inventory management and development (Bahssas et al., 2015). The ERP software is a multi-module application that integrates activities across functional departments such as production, planning, purchasing, inventory control, product distribution, and order tracking.  It allows the automation and integration of business process by enabling data and information sharing to reach best practices in managing the process of the business. 

ERP involves various modules such as accounting, finance, supply chain, human resources, customer information and others (Bahssas et al., 2015; Madanhire & Mbohwa, 2016).  ERP production planning module is used to optimize the utilization of manufacturing capacity, parts, components, and material resources.  ERP purchases module is used to streamline procurement of required raw materials, as it automates the process of identifying potential suppliers, negotiating prices, placing orders to suppliers and related billing processes.  ERP inventory control module is used to facilitate the process of maintaining an appropriate level of stocks in the warehouse through identifying inventory requirements, setting targets, providing replenishment techniques and options, monitoring item usage, reconciling inventory balances and reporting inventory status.  ERP sales module is used for order placement, order scheduling, shipping and invoicing. ERP marketing module is used to support lead generation, direct mailing campaign.  ERP financial module is used to gather financial data from various departments and generate reports such as balance sheet, general ledger, trial balance.  ERP human resources (HR) module is used to maintain a complete employee database to include contact information, salary details, attendance and so forth (Madanhire & Mbohwa, 2016).

Innovations in technology trends have forced ERP designers to establish new development.  Thus, new ERP system designs are implemented to satisfy organizations and customers by evolving new ERP business models.  Furthermore, one of the biggest challenges for ERP is to keep speed with the manufacturing sector which has been moving rapidly from product-centric to customer-centric focus (Bahssas et al., 2015).  Most ERP vendors are required to add a variety of functions and modules to their core systems.

Critical Factors for Successful ERP Implementation

            The implementation of ERP systems is costly, and organizations should be careful when implementing it to ensure its success.  Some believe that ERP systems could hurt their business because of the potential problems of ERP (Umble, Haft, & Umble, 2003). Various studies identified success factors for ERP.  (Umble et al., 2003) addressed the most prominent factors for successful implementation of ERP. The first critical success factor is that organizations should have a clear understanding of the strategic goals.  The commitment by top management is another success factor.  Successful ERP implementation requires excellent project management. The existing organizational structure and processes found in most enterprises are not compatible with the structure, tools, and types of information provided by ERP systems.  Thus, organizational change management is required to ensure the successful implementation of ERP.  ERP implementation teams should be composed of highly skilled professionals that are chosen for their skills, past accomplishments, reputation, and flexibility.  Data accuracy is another success factor for ERP implementation.  The education and training are another success factor for the implementation of the ERP system.   (Bahssas et al., 2015) Indicated that reserving 10-15% of the total ERP implementation budget for training will give an organization an 80% chance of successful implementation.  Focused performance measures must be included from the beginning of the implementation because if the system is not associated with compensation, it will not be successful. 

Big Data and Big Data Analytics Role in ERP

Big Data Analytics plays a significant role in ERP applications (Carlton, 2014; ERP Solutions, 2018; Woodie, 2016).  Enterprise data comprises various departments such as HR, finance, CRM and other essential business functions of a business.  This data can be leveraged to make ERP functionality better.  When Big Data tools are brought together with the ERP system, can unfold valuable insights that can businesses make smarter decisions (Carlton, 2014; Cornell University, 2017; Wailgum, 2018). Many ERP systems fail to make use of real-time inventory and supply chains data because these systems lack the intelligence to make predictions about products demands (Carlton, 2014; ERP Solutions, 2018). Big Data tools can predict demand and help determine what company needs to go forward (ERP Solutions, 2018).  Infor co-president Duncan Angove established Dynamic Science Labs (DSL) aiming to use data science techniques to solve a particular class of business problems for its customers. Employees with big data, math, and coding skills were hired in Cambridge, Massachusetts-based organization to develop proof of concept (POC) (Woodie, 2016).  Big Data systems such as Apache’s Hadoop are creating node-level operating transparencies which affect nearly every current ERP module in real-time (Carlton, 2014).  Managers will be able to quickly leverage ERP Big Data capabilities, thereby enhancing information density and speeding up overall decision-making. In brief, Big Data and Big Data Analytics impact business at all levels, and ERP is no exception.

Customer Relationship Management (CRM)

Customer Relationship Management (CRM) systems assist organizations to manage customer interaction and customer data, automate marketing, sales, and customer support, assess business information and managing partner, vendor, and employee relationships.  A quality CRM system can be scalable to serve the needs of small, medium or large business (Financesonline, 2018).  CRM systems can be customized to allow business is taking actionable customer insights using back-end analytics, identify opportunities with predictive analytics, personalize customer support, and streamline operations based on the history of the customers’ interaction with the business.  Organizations must be aware of the CRM system software available to select the most appropriate CRM system that can better serve their needs. 

Various reports identified various CRM systems.  The best CRM systems include Salesforce CRM, Hubspot CRM, Fresh sales, Pipedrive, Insightly, Zoho CRM, Nimble, PipelineDeals, Nutshell CRM, Microsoft Dynamics CRM, SalesforceIQ, Spiro, and ExxpertApps.  Table 1 shows the best CRM systems available in the market.


Table 1.  CRM Systems  (Financesonline, 2018).

Customer satisfaction is the critical element to the success of the business (Bygstad, 2003; Pearlson & Saunders, 2001).  Businesses need to continuously satisfy customers, understand their needs and expectations, provide high-quality products or service at a competitive price to maintain success.  These interactions needed to be tracked by the business and analyzed in an organized way to foster long-lasting customer relationships which get transformed into long-term success.  

CRM can aid business increase sales efficiency, drive the satisfaction of customers, streamline the process of the business and make it more efficient, and identify and resolve bottlenecks at any of the operational processes from marketing, sales to the product development (Ahearne, Rapp, Mariadoss, & Ganesan, 2012; Bygstad, 2003).  The development of customer relationship is not a trivial or straightforward task. When it is done right, it places the business in a competitive edge. However, the implementation of CRM is challenging. 

CRM Challenges and Costs

The implementation of CRM demonstrates the value of customers to the business and placing customer service on top priority (Pearlson & Saunders, 2001).  CRM plays a significant role in collaborating the effort between customer service, marketing, and sales in an organization.  However, the implementation of CRM is challenging especially for small business and startups.  Various reports addressed various challenges when implementing CRM.  The cost is the most significant challenges organizations are confronted with when implementing the CRM solution (Sage Software, 2015).  The development of a clear objective to achieve with the CRM system is another challenge when implementing CRM.  Organizations are confronted with the type of deployment whether it should be on-premise or cloud-based CRM.  Other challenges involve the employees’ training, the right CRM solution provider and the integration plan in advance (Sage Software, 2015). 

The cost of CRM systems varies from one vendor to another based on the features and deployment key such as data importing, analytics, email integrations, mobile accessibility, email marketing, multi-channel support, SaaS platform, on-premise platform, and SaaS and on-premise.  Some vendors offer CRM for small and medium, or small only, while others offer CRM systems for small, medium and large businesses.  In a report by (Business-Software, 2019), the cost is categorized for more expensive to least expensive using the dollar sign as $$$$ for most expensive, $$$ for expensive, $$ for less expensive and $ for least expensive.  Each vendor CRM system has certain features which must be examined by organizations before making the decision to adopt such a system.  Table 2 provides an idea about the cost from the most expensive, expensive, less expensive, to least expensive.


Table 2.  CRM System Costs based on the Report by (Business-Software, 2019).

 

The Building Blocks of CRM Systems and Their Integration

Understanding the buildings blocks of the CRM system can assist in the implementation and integration of CRM systems.  CRM involves four core building blocks (Meyer, Matthias & Kolbe, 2005). The acquirement and continuous update of the knowledge base on the needs of customers, motivations, and behavior over the lifetime of the relationship with customers.  The application of the customers’ knowledge to continuously improve performance through a process of learning from success and failures is the second building block of CRM system.  The integration of marketing, sales, and service activities to achieve a common goal is another building block of the CRM system.  The last building block of the CRM system involves the implementation of appropriate systems to support customer knowledge acquisition, sharing, and the measurement of CRM effectiveness. 

CRM integration is a critical building block for CRM success (Meyer, Matthias, 2005).  The process of integrating CRM involves various organizational and operational functions of the business such as marketing, sales and service activities.  CRM requires detailed business processes which can be categorized into three core elements; CRM delivery process, CRM support process, and CRM analysis process.  The delivery process involves direct contact with customers to cover part of the customer process such as campaign management, sales management, service management, and complaint management. The support process involves direct contact with the customer that are not designed to fulfill supporting functions within the CRM context such as market research and loyalty management.  The analysis process consolidates and analyzes the knowledge of customers collected in other CRM processes.  The result of this analysis process is passed to the delivery process, support process and to the service innovation and service production processes to enhance their effectiveness such as customer scoring and lead management, customer profiling and segmentation, feedback and knowledge management. 

Best Practices in Implementing These CRM Systems

Various studies and reports addressed best practices in the implementation and integration of CRM systems into the business (Salesforce, 2018; Schiff, 2018).  Organizations must choose a CRM that fits their needs.  Not every CRM is created equally, and if organizations choose a CRM system without properly researching its features, capabilities, and weaknesses, organizations could end up committed to a system that is not appropriate for the business, and as a result, could lose money.  Organizations should decide whether CRM should be cloud-based or on-premise base CRM (Salesforce, 2018; Schiff, 2018; Wailgum, 2008).  Organizations should decide whether CRM should be a service contract or one that costs more upfront to install.  Business should also decide whether it needs in-depth, highly customizable features, or basic functionality will be sufficient to serve the needs of the business.  Organizations should analyze the options and decide on the CRM system that is most appropriate for the business which can serve the needs to build strong customer relationship and gain a competitive edge in the market.

Well-trained personnel and workforce will help organizations achieve its strategic CRM goal. If organizations do not invest in the training of the workforce on how to utilize the CRM system, CRM tools will become useless.  The CRM systems become effective as organizations allow them to be. When the workforce is not using the CRM system to its full potentials, or if the workforce is misusing the CRM systems, CRM will not perform its functions properly and will not serve the needs of the business as expected (Salesforce, 2018; Schiff, 2018). 

Automation is another critical factor for best practice when implementing CRM systems.  Tasks that are associated with data entry can be automated so that CRM systems will be up to date.  The automation will increase the efficiency of the CRM systems as well as the business overall (Salesforce, 2018; Schiff, 2018).  One of the significant benefits of CRM is its potential in improving and enhancing the cooperative efforts across departments of the business.  When the same information is accessible across various departments, CRM systems eliminate confusions that can be caused by using different terms and different information.  Data without analysis is not meaningless.  Organizations should consider mining the data to get the value that can aid in making sound business decisions.  CRM systems are designed to capture and organize massive amounts of data. If organizations do not take advantages of this massive amount of data to turn it into actionable data, the implementation of CRM will be so limited. The best CRM systems are those that come with built-in analytics features which use advanced programming to mine all captured data and use that information to produce valuable conclusions which can be used for future business decisions.  When organizations take advantages of the CRM built-in analytical feature and analyze the data that CRM system procures, the valuable information can provide insight for business decisions (Salesforce, 2018).  The last element for best practice of the implementation of CRM is for organizations to keep it simple. The best CRM system is the one that will best fit the needs and requirements of the business. The simplicity is a crucial element when implementing CRM.  Organizations should implement CRM that is not complex while it is useful and provides everything the business needs.  Organizations should also consider making changes to the CRM policies where necessary.  The effectiveness of day-to-day operations will be the best indicator of whether the CRM performs as expected, and if it is not, some changes must be made until it performs as expected (Salesforce, 2018; Wailgum, 2008).

Conclusion

This project discussed critical information technology solutions used to gain competitive advantages.  The discussion began with Big Data and Big Data Analytics addressing essential topics such as the Hadoop ecosystem, NoSQL databases, Spark integration for real-time data processing, and Big Data Visualization. Cloud computing is an emerging technology to solve Big Data challenges such as storage for the large volume of the data, and the high-speed data processing to extract value from data.  Enterprise Resource Planning (ERP) is a system that can aid organizations to gain competitive advantages if implemented right.  The project discussed various success factor for the ERP system.  Big Data plays a significant role in ERP, which is also discussed in this project.  The last technology addressed in this project is the Customer Relationship Management (CRM), its building blocks and integration.  The project addressed the challenges and costs associated with CRM.  The best practice of CRM is addressed which can assist in the successful implementation of CRM.  In summary, enterprises should evaluate various information technology systems that are developed to aid them to gain competitive advantages. 

References

Ahearne, M., Rapp, A., Mariadoss, B. J., & Ganesan, S. (2012). Challenges of CRM implementation in business-to-business markets: A contingency perspective. Journal of Personal Selling & Sales Management, 32(1), 117-129.

Alexandru, A., Alexandru, C., Coardos, D., & Tudora, E. (2016). Healthcare, Big Data and Cloud Computing. management, 1, 2.

Alguliyev, R., & Imamverdiyev, Y. (2014). Big data: big promises for information security. Paper presented at the Application of Information and Communication Technologies (AICT), 2014 IEEE 8th International Conference on.

Ankam, V. (2016). Big Data Analytics: Packt Publishing Ltd.

Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., . . . Stoica, I. (2009). Above The Clouds: A Berkeley View of Cloud Computing. Electrical Engineering and Computer Sciences University of California at Berkeley.

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003

Bahssas, D. M., AlBar, A. M., & Hoque, M. R. (2015). Enterprise resource planning (ERP) systems: design, trends and deployment. The International Technology Management Review, 5(2), 72-81.

Bansal, A., Deshpande, A., Ghare, P., Dhikale, S., & Bodkhe, B. (2014). Healthcare data analysis using dynamic slot allocation in Hadoop. International Journal of Recent Technology and Engineering, 3(5), 15-18.

Basu, A. (2014). Real-Time Healthcare Analytics on Apache Hadoop* using Spark* and Shark. Retrieved from https://www.intel.com/content/dam/www/public/us/en/documents/white-papers/big-data-real-time-healthcare-analytics-whitepaper.pdf.

Business-Software. (2019). Top 40 CRM Software Report.  

Bygstad, B. (2003). The implementation puzzle of CRM systems in knowledge based organizations. Information Resources Management Journal (IRMJ), 16(4), 33-45.

Carlton, R. (2014). 5 Ways Big Data is Changing ERP Software. Retrieved from https://www.erpfocus.com/five-ways-big-data-is-changing-erp-software-2733.html.

Chrimes, D., Zamani, H., Moa, B., & Kuo, A. (2018). Simulations of Hadoop/MapReduce-Based Platform to Support its Usability of Big Data Analytics in Healthcare.

Cornell University. (2017). Enterprise Information Systems. Retrieved from https://it.cornell.edu/strategic-plan/enterprise-information-systems. 

Dhotre, P., Shimpi, S., Suryawanshi, P., & Sanghati, M. (2015). Health Care Analysis Using Hadoop. Internationaljournalofscientific&tech nologyresearch, 4(12), 279r281.

DuttaRoy, S. (2016). SAP Business Analytics: A Best Practices Guide for Implementing Business Analytics Using SAP: Springer.

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.

ERP Solutions. (2018). The Role of Big Data Analytics in ERP Applications. Retrieved from https://erpsolutions.oodles.io/big-data-analytics-in-erp/. 

Financesonline. (2018). 15 Best CRM Systems for Your Business. Retrieved from https://financesonline.com/15-best-crm-software-systems-business/. 

Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud Computing and Grid Computing 360-Degree Compared. Paper presented at the 2008 Grid Computing Environments Workshop.

Gantz, J., & Reinsel, D. (2011). Extracting Value From Chaos. International Data Corporation, 1142, 1-12.

Géczy, P. (2014). Big data characteristics. The Macrotheme Review, 3(6), 94-104.

Gupta, R., Gupta, H., & Mohania, M. (2012). Cloud Computing and Big Data Analytics: What is New From Databases Perspective? Paper presented at the International Conference on Big Data Analytics, Springer-Verlag Berlin Heidelberg.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The Rise of “Big Data” on Cloud Computing: Review and Open Research Issues. Information Systems, 47, 98-115. doi:10.1016/j.is.2014.07.006

Hu, H., Wen, Y., Chua, T., & Li, X. (2014). Toward Scalable Systems for Big Data Analytics: A Technology Tutorial. Practical Innovation, Open Solution, 2, 652-687. doi:10.1109/ACCESS.2014.2332453

Jayasingh, B. B., Patra, M. R., & Mahesh, D. B. (2016, 14-17 Dec. 2016). Security issues and challenges of big data analytics and visualization. Paper presented at the 2016 2nd International Conference on Contemporary Computing and Informatics (IC3I).

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big Data: Issues and Challenges Moving Forward. Paper presented at the Hawaii International Conference on System Sciences

Karanth, S. (2014). Mastering Hadoop: Packt Publishing Ltd.

Kritikos, K., Kirkham, T., Kryza, B., & Massonet, P. (2017). Towards a Security-Enhanced PaaS Platform for Multi-Cloud Applications. Future Generation computer systems, 67, 206-226. doi:10.1016/j.future.2016.10.008

Lewis, G. (2010). Basics About Cloud Computing. Software Engineering Institute Carnegie Mellon University, Pittsburgh.

Madanhire, I., & Mbohwa, C. (2016). Enterprise resource planning (ERP) in improving operational efficiency: Case study. Procedia Cirp, 40, 225-229.

Maltby, D. (2011). Big Data Analytics. Paper presented at the Annual Meeting of the Association for Information Science and Technology.

Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C., & Byers, A. H. (2011). Big Data: The Next Frontier for Innovation, Competition, and Productivity. McKinsey Global Institute.

Mehmood, A., Natgunanathan, I., Xiang, Y., Hua, G., & Guo, S. (2016). Protection of Big Data Privacy. Institute of Electrical and Electronic Engineers, 4, 1821-1834. doi:10.1109/ACCESS.2016.2558446

Mell, P., & Grance, T. (2011). The NIST Definition of Cloud Computing. National Institute of Standards and Technology (NIST), 800-145, 1-7.

Meyer, M. (2005). Multidisciplinarity of CRM Integration and its Implications. Paper presented at the System Sciences, 2005. HICSS’05. Proceedings of the 38th Annual Hawaii International Conference on.

Meyer, M. (2018). The Rise of Healthcare Data Visualization.

Meyer, M., & Kolbe, L. M. (2005). Integration of customer relationship management: status quo and implications for research and practice. Journal of strategic marketing, 13(3), 175-198.

Pearlson, K., & Saunders, C. (2001). Managing and Using Information Systems: A Strategic Approach. 2001: USA: John Wiley & Sons.

Raghupathi, W., & Raghupathi, V. (2014). Big data analytics in healthcare: promise and potential. Health Information Science and Systems, 2(1), 1.

Sage Software. (2015). Top Challenges in CRM Implementation.  

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Salesforce. (2018). 7 CRM Best Practices to Get the Most out of your CRM. Retrieved from https://www.salesforce.com/crm/best-practices/. 

Schiff, J. L. (2018). 8 CRM implementation best practices.

Scott, J. A. (2015). Getting Started with Spark: MapR Technologies, Inc.

Umble, E. J., Haft, R. R., & Umble, M. M. (2003). Enterprise resource planning: Implementation procedures and critical success factors. European Journal of Operational Research, 146(2), 241-257.

Vaquero, L. M., Rodero-Merino, L., Caceres, J., & Lindner, M. (2008). A Break in the Clouds: Towards a Cloud Definition. Association for Computing Machinery: Computer Communication Review, 39(1), 50-55.

Wailgum, T. (2008). Five Best Practices for Implementing SaaS CRM. Retrieved from https://www.cio.com/article/2435928/customer-relationship-management/five-best-practices-for-implementing-saas-crm.html.

Wailgum, T. (2018). What is CRM? Software for Managing Customer Data. Retrieved from https://www.cio.com/article/2439505/customer-relationship-management/customer-relationship-management-crm-definition-and-solutions.html.

Wang, Y., Kung, L. A., & Byrd, T. A. (2018). Big Data Analytics: Understanding its Capabilities and Potential Benefits for Healthcare Organizations. Technological Forecasting and Social Change, 126, 3-13. doi:10.1016/j.techfore.2015.12.019

Woodie, A. (2016). Making ERP Better with Big Data. Retrieved from https://www.datanami.com/2016/07/08/making-erp-better-big-data/.

The Impact of Cloud Computing Technology on Information Security Governance Decisions

Dr. O. Aly
Computer Science

Information security plays a significant role in the context of information technology (IT) governance. The critical decisions as part of governance for the information security needs are in the areas of information security strategy, policies, infrastructure, training, and investments for tools. Cloud computing emerging technology provides a new business model for accessing computing infrastructure on a virtualized, scalable, and lower-cost basis.  The purpose of this discussion is to address the impact of cloud computing on changing decisions related to information security governance.

Cloud Computing Technology

“Cloud computing and big data are conjoined” (Hashem et al., 2015).  This statement can raise the question about the reason for such a relationship.  Big Data has been characterized by what is often referred to as a multi-V model such as variety, velocity, volume, veracity, and value (Assunção, Calheiros, Bianchi, Netto, & Buyya, 2015). While variety represents the data types, the velocity reflects the rate at which the data is produced and processed (Assunção et al., 2015).  The volume defines the amount of data, and the veracity reflects how much the data can be trusted given the reliability of its source. The value, on the other hand, represents the monetary worth which organizations can derive from adopting Big Data computing.  The characteristics of Big Data including the explosive growth rate, challenges and issues came along (Jagadish et al., 2014; Meeker & Hong, 2014; Misra, Sharma, Gulia, & Bana, 2014; Nasser & Tariq, 2015; Zhou, Chawla, Jin, & Williams, 2014).  The growth rate is regarded to be a significant challenge for IT researchers and practitioners to design appropriate systems that handle the data effectively, and analyze it to extract relevant meaning for decision-making (Kaisler, Armour, Espinosa, & Money, 2013). Other challenges include data storage, data management and data processing (Fernández et al., 2014; Kaisler et al., 2013); Big Data variety, Big Data integration and cleaning, Big Data reduction, Big Data query and indexing, and Bid Data analysis and mining (Chen et al., 2013). 

Traditional systems could not face all these challenges of BD. Cloud computing technology emerged to address these challenges of BD. Cloud computing is regarded as the solution and the answer to BD challenges and issues (Fernández et al., 2014).  Organizations and businesses are under pressure to quickly adopt and implement technologies such as cloud computing to address the challenges of the Big Data storage, and processing demands (Hashem et al., 2015).  Besides, the increasing demand of the Big Data on networks, storage, and servers outsourcing the data to the cloud may seem to be a practical and useful option and approach when dealing with Big Data (Katal, Wazid, & Goudar, 2013).  During the last two decades, this increasing demand for data storage and data security has been growing at a fast pace (Gupta, 2015).  Such a demand lead to the emerging cloud computing technology (Gupta, 2015).  Issues such as scalability of the Big Data has also pointed towards the cloud computing technology, which can aggregate multiple disparate workloads with varying performance goals into significant clusters in the cloud (Katal et al., 2013). 

Various studies provided a different definition to cloud computing.  However, the National Institute of Standards and Technology (NIST) proposed an official definition of cloud computing.  NIST defined cloud computing as “a model for enabling ubiquitous, convenient, on-demand network access to a shared pool of configurable computing resources (e.g., network, servers, storage, applications, and services) that can be rapidly provisioned and released with minimal management effort or service provider interaction” (page 2) (Mell & Grance, 2011).

Cloud computing technology offers various deployment models of public cloud, private cloud, hybrid cloud, and community cloud. The public cloud is the least secure cloud model (Puthal, Sahoo, Mishra, & Swain, 2015).  The private cloud has also been referred by (Armbrust et al., 2009) as internal datacenters, which are not available to the general public. Community cloud supports the specific community with particular concerns such as security requirements, policy and compliance consideration, and mission (Yang & Tate, 2012; Zissis & Lekkas, 2012). It also offers three major service models such as Infrastructure-as-a-Service (IaaS), Software-as-a-Service (SaaS), Platform-as-a-Service (PaaS) (Mell & Grance, 2011).   

Cloud computing offers various benefits from technological benefits such as data and storage, APIs, metering and tools, to economic benefits such as pay per use, cost reduction and return on investment, to non-functional benefits such as elasticity, reliability, and availability (Chang, 2015).  Despite these benefits, and the increasing trend in the adoption of cloud computing is still not widely used. Security concerns related to virtualization, hardware, network, data, and service providers act as significant obstacles in adopting cloud computing in IT industry (Balasubramanian & Mala, 2015; Kazim & Zhu, 2015).  The security and privacy concern has been one of the major obstacle preventing the full adoption of the technology (Shahzad, 2014).   (Purcell, 2014) have stated that “The advantages of cloud computing are tempered by two major concerns – security and loss of control.” The uncertainty about security has lead executives to state that security is their number one concern for deploying cloud computing (Hashizume, Rosado, Fernández-medina, & Fernandez, 2013).

Cloud Computing Governance and Data Governance

The enforcement of regulatory laws such as Health and Human Services Health Insurance Portability and Accountability Act (HIPAA) and Sarbanes-Oxley becomes an issue especially when adopting cloud computing (Ali, Khan, & Vasilakos, 2015).  Cloud computing fosters security concerns that hamper the fast rate adoption of the cloud computing. Thus, cloud governance and data governance are highly recommended when adopting cloud computing.

Cloud governance is defined as the control and processes that make sure policies are enforced (Saidah & Abdelbaki, 2014).  It is a framework applied to all related parties and the business process securely to ensure that the cloud supports the goal of the organization and comply with all required regulations and rules. Cloud governance model should be aligned with the corporate governance and IT governance.  It has to comply with the strategy of the organization to accomplish the business goals.  Various studies proposed various cloud governance models.

(Saidah & Abdelbaki, 2014) proposed a cloud governance model that provides three models; policy model, operational model, and management model. The policy model invovle data policy, service policy, business process management policy and exit policy. The operational model include authentication, authorization, audit, monitoring, adaptations, medata repository, and asset management. The management model includes policy management, security management, and service management. Figure 1 illustrates the proposed cloud governance model.


Figure 1.  The Proposed Cloud Governance Model (Saidah & Abdelbaki, 2014).

(Rebollo, Mellado, & Fernández-Medina, 2013) proposed a security governance framework for cloud computing environment (ISGcloud). The proposed governance framework is founded upon two main standards. It implements the core governance principles of the ISO/IEC 38500 governance standard. The framework proposed a cloud service lifecycle based on the ISO/IEC 27036 outsourcing security draft.

When organizations decide to adopt the cloud computing technology, careful considerations must be made toward the deployment model as well as to the service model to understand the security requirements and the governance strategies (Al-Ruithe, Benkhelifa, & Hameed, 2016).  Data governance for cloud computing is not nice to have but is required by rules and regulations to protect the privacy of the users and employees. 

The loss of control on the data is the most significant issue when adopting cloud computing because the data is stored on a computer belonging to the cloud provider. This loss of governance and control could have a potentially severe impact on the strategy of the organization, and the capacity to meet its mission and goals (Al-Ruithe et al., 2016).  The loss of control and governance of the data can lead to the impossibility of complying with security requirements, a lack of confidentiality, integrity, and availability of data, and a deterioration of performance and quality of services, not to mention the introduction of compliance challenges. Thus, organizations must be aware of the best practice for safeguarding, governing and operating data when adopting cloud computing technology.  NIST offers many recommendations when adopting cloud computing technology (Al-Ruithe et al., 2016). The organization should consider data governance strategy before adopting cloud computing. This recommendation demonstrates the importance of data governance for organizations which intend to move their data and services to cloud computing environment as policies, rules, and distribution of responsibilities between cloud actors will have to be set.  The development of policies and data governance will assist organizations in monitoring compliance with the current regulations and rules.  The primary benefit of data governance when using cloud environment is to ensure security measures, privacy protection and quality of data. 

The implementation of data governance for cloud computing changes based on the roles and responsibilities in the internal process of the organization (Al-Ruithe et al., 2016).  Thus, organizations are expected to face many issues.  The lack of understanding of data governance is one of the major issues.  The lack of training and lack of communication plan are additional issues which organizations will face. The lack of support is another obstacle which includes lack of top management support, lack of compliance enforcement and lack of cloud regulation. Lack of policies, process and defined roles in the organization are one of the main obstacles to implement data governance in the cloud.  The lack of resources including lack of funding, technology, people, and skills is considered another data governance obstacle.

Conclusion

This discussion addressed cloud computing technology and its relationship with BD and BDA. Cloud computing technology emerged as a solution to the challenges that BD and BDA faced. However, cloud computing is confronted with security and privacy challenges.  Executives expressed security as the number one concern for cloud computing adoption.  The governance of cloud computing will provide a secure environment to protect data from loss or malicious attacks. Organizations are required to comply with the various security and privacy regulations and rules.  Organizations under pressure for data protection especially when using cloud computing technology.  Thus, they are required to implement the data governance and cloud computing governance framework to ensure such compliance.

References

Al-Ruithe, M., Benkhelifa, E., & Hameed, K. (2016). A Conceptual Framework for Designing Data Governance for Cloud Computing. Procedia Computer Science, 94, 160-167. doi:10.1016/j.procs.2016.08.025

Ali, M., Khan, S. U., & Vasilakos, A. V. (2015). Security in cloud computing: Opportunities and challenges. Information Sciences, 305, 357-383. doi:10.1016/j.ins.2015.01.025

Armbrust, M., Fox, A., Griffith, R., Joseph, A. D., Katz, R. H., Konwinski, A., . . . Stoica, I. (2009). Above The Clouds: A Berkeley View of Cloud Computing. Electrical Engineering and Computer Sciences University of California at Berkeley.

Assunção, M. D., Calheiros, R. N., Bianchi, S., Netto, M. A. S., & Buyya, R. (2015). Big Data Computing and Clouds: Trends and Future Directions. Journal of Parallel and Distributed Computing, 79, 3-15. doi:10.1016/j.jpdc.2014.08.003

Balasubramanian, V., & Mala, T. (2015). A Review On Various Data Security Issues In Cloud Computing Environment And Its Solutions. Journal of Engineering and Applied Sciences, 10(2).

Chang, V. (2015). A Proposed Framework for Cloud Computing Adoption. International Journal of Organizational and Collective Intelligence, 6(3).

Chen, J., Chen, Y., Du, X., Li, C., Lu, J., Zhao, S., & Zhou, X. (2013). Big Data Challenge: a Data Management Perspective. Frontiers of Computer Science, 7(2), 157-164. doi:10.1007/s11704-013-3903-7

Fernández, A., Del Río, S., López, V., Bawakid, A., del Jesus, M. J., Benítez, J. M., & Herrera, F. (2014). Big Data with Cloud Computing: An Insight on the Computing Environment, MapReduce, and Programming Frameworks. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 4(5), 380-409. doi:10.1002/widm.1134

Gupta, U. (2015). Survey on Security Issues in File Management in Cloud Computing Environment. Department of Computer Science and Information Systems, Birla Institute of Technology and Science, Pilani.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The Rise of “Big Data” on Cloud Computing: Review and Open Research Issues. Information Systems, 47, 98-115. doi:10.1016/j.is.2014.07.006

Hashizume, K., Rosado, D. G., Fernández-medina, E., & Fernandez, E. B. (2013). An analysis of security issues for cloud computing. Journal of internet services and applications, 4(1), 1-13. doi:10.1186/1869-0238-4-5

Jagadish, H. V., Gehrke, J., Labrinidis, A., Papakonstantinou, Y., Patel, J. M., Ramakrishnan, R., & Shahabi, C. (2014). Big Data and Its Technical Challenges. Communications of the Association for Computing Machinery, 57(7), 86-94. doi:10.1145/2611567

Kaisler, S., Armour, F., Espinosa, J. A., & Money, W. (2013). Big Data: Issues and Challenges Moving Forward. Paper presented at the Hawaii International Conference on System Sciences

Katal, A., Wazid, M., & Goudar, R. H. (2013). Big Data: Issues, Challenges, Tools and Good Practices. Paper presented at the International Conference on Contemporary Computing.

Kazim, M., & Zhu, S. Y. (2015). A Survey on Top Security Threats in Cloud Computing. International Journal Advanced Computer Science and Application, 6(3), 109-113.

Meeker, W., & Hong, Y. (2014). Reliability Meets Big Data: Opportunities and Challenges. Quality Engineering, 26(1), 102-116. doi:10.1080/08982112.2014.846119

Mell, P., & Grance, T. (2011). The NIST Definition of Cloud Computing. National Institute of Standards and Technology (NIST), 800-145, 1-7.

Misra, A., Sharma, A., Gulia, P., & Bana, A. (2014). Big Data: Challenges and Opportunities. International Journal of Innovative Technology and Exploring Engineering, 4(2).

Nasser, T., & Tariq, R. S. (2015). Big Data Challenges. Journal of Computer Engineering & Information Technology, 9307, 1-10. doi:10.4172/2324

Purcell, B. M. (2014). Big Data Using Cloud Computing. Journal of Technology Research, 5, 1-9.

Puthal, D., Sahoo, B., Mishra, S., & Swain, S. (2015). Cloud Computing Features, Issues, and Challenges: a Big Picture. Paper presented at the Computational Intelligence and Networks (CINE), 2015 International Conference on Computational Intelligence & Networks.

Rebollo, O., Mellado, D., & Fernández-Medina, E. (2013). Introducing a security governance framework for cloud computing. Paper presented at the Proceedings of the 10th International Workshop on Security in Information Systems (WOSIS), Angers, France.

Saidah, A. S., & Abdelbaki, N. (2014). A New Cloud Computing Governance Framework.

Shahzad, F. (2014). State-of-the-art Survey on Cloud Computing Security Challenges, Approaches and Solutions. Procedia Computer Science, 37, 357-362. doi:10.1016/j.procs.2014.08.053

Yang, H., & Tate, M. (2012). A Descriptive Literature Review and Classification of Cloud Computing Research. Communications of the Association for Information Systems, 31(2), 35-60.

Zhou, Z., Chawla, N., Jin, Y., & Williams, G. (2014). Big Data Opportunities and Challenges: Discussions from Data Analytics Perspectives. Institute of Electrical and Electronic Engineers: Computational Intelligence Magazine, 9(4), 62-74.

Zissis, D., & Lekkas, D. (2012). Is Cloud Computing Finally Beginning to Mature? International Journal of Cloud Computing and Services Science, 1(4), 172. doi:10.11591/closer.v1i4.1248

Installation and Configuration of Openstack and AWS

Dr. Aly, O.
Computer Science

Abstract

The purpose of this project was to articulate all the steps for the installation and configuration of OpenStack and Amazon Web Services.  The project begins with an overview of OpenStack.  It is divided into three main phases.  The first Phase discusses and analyzes the differences between the Networking techniques in AWS and OpenStack.  Phase 2 discusses the required configurations to deploy the OpenStack Controller.  Phase 2 also discusses and analyzes the expansion of OpenStack to include additional node as the Compute node.   Phase 3 discusses the issues encountered during the installation and configuration of OpenStack and AWS services. A virtual bridge for the provider network was configured where all VMs traffic reaches the Internet through the external bridge.   The floating IP also must be disallowed to avoid dropping the packet when they reach AWS.  In this project, OpenStack using the Controller Node and an additional Compute Node is deployed and accessed successfully using Horizon dashboard.  Elastic Cloud Compute (EC2) is also installed and configured successfully using the default VPC, the default Security Group and Access Control List.  

Keywords: OpenStack, Amazon Web Services (AWS).

Introduction

            OpenStack is a result of initiatives from Rackspace and NASA in 2010 because NASA could not store its data in the Public Cloud for security reasons.  OpenStack is an open source project which can be utilized by leading vendors to bring AWS-like ability and agility to the private cloud.  OpenStack has been growing since its inception in 2010 to include 500 member companies as part of the OpenStack Foundation with platinum and gold members from the largest IT vendors globally.  Examples of these platinum members include RedHat, Suse, IBM, Hewlett Packard Enterprise, Ubuntu, AT&T and Rackspace (Armstrong, 2016).

            OpenStack provides primarily an Infrastructure-as-s-Service (IaaS) function within the Private cloud, where it makes centralized storage, commodity computes, and networking features available to end users to self-service their needs, through the Horizon dashboard or a set of common APIs.  Many organizations are deploying OpenStack in-house to develop their own data centers.  The implementation of the OpenStack is less likely to fail when utilizing professional service support from known vendors and can create alternative solutions to Microsoft Azure and AWS.  Examples of these professional service vendors include Red Hat, Suse, HP, Canonical, Mirantis, and so forth.  They provide different methods of installing the platform  (Armstrong, 2016). 

            The release cycle of the OpenStack is six months during which an upstream release is created.  OpenStack Foundation creates the upstream release and governs it.  Example of the public cloud deployment of OpenStack includes AT&T, RackSpace, and GoDaddy. Thus, OpenStack is not exclusively used for private cloud.  However, OpenStack has been increasingly popular as a Private Cloud alternative to AWS Public Cloud. OpenStack is now widely used for Network Function Virtualization (NFV) (Armstrong, 2016).  

OpenStack and AWS utilize different approaches to Networking.  This section begins with AWS Networking, followed by OpenStack Networking.

Phase 1:  OpenStack Networking vs. AWS Networking

1.1       AWS Networking

Virtual Private Cloud (VPC) is a hybrid cloud comprising of public and private clouds.  The VPC is the default setting for new AWS users.  The VPC can also be connected to a network of users or the private data center of the organization.  The underlying concept of connecting the VPC to the private data center of the organization is the use of the gateway and virtual private cloud gateway (VPG).  The VPG is two redundant VPN tunnels, which gets instantiated from the private network of the user or the organization.  The gateway of the organization exposes a set of external static addresses from the site of the organization, which are using Network Address Translation-Traversal (NAT-T) to hide the address.  The organization can use one gateway device to access multiple VPCs.   The VPC provides an isolated view of all provisioned instances.  Identity and Access Management (IAM) of AWS is used to set up user account to access the VPC.   Figure 1 illustrates an example of the AWS VPC with virtual machines or instances mapped with one or more security groups and connected to different subnets connected to the VPC router (Armstrong, 2016; AWS, 2017). 

Figure 1.  VPC of AWS showing multiple instances using Security Group.

            The networking is simplified by VPC using software and allowing users and organizations to perform a certain set of networking operations such as mapping the subnet, using Domain Name System (DNS), Public and Private IP addresses assignments, security group and access control list application.   When organizations create a virtual machine or instance, a default VPC is assigned to it automatically.  All VPC comes with a default router which can have additional custom routes and the routing priority to forwarding traffic to specific subnets based on the requirements of the organizations and users.   Figure 2 illustrates VPC using Private IP, Public IPs, and the Main Route Table, adapted from (Armstrong, 2016; AWS, 2017).

Figure 2.  AWS VPC Configuration Example (AWS, 2017).

            With respect to the IP Addressing of AWS, a mandatory private IP is assigned automatically to every virtual machine or instance, also a public IP and DNS entry unless the instance is a dedicated instance.  The Private IP is used to route traffic among instances when there is a need for a virtual machine to communicate with another virtual machine that is close to it on the same subnet.  The Public IP, on the other hand, are accessible through the Internet.   If there is a need for a persistent Public IP address for a virtual machine, the Elastic IP addressed feature is provided by AWS which is limited to five per VPC account only.  When using Elastic IP addresses, the IP address can be mapped quickly to another instance in case of a failure of the instance.  When using AWS, it can take up to 24 hours for the DNS Time to live (TTL) of a Public IP address to propagate.  Moreover, AWS supports a Maximum Transmission Unit (MTU) of 1,500 regarding throughput which can be passed to an instance in AWS.  The organization must consider this feature for application performance consideration (Armstrong, 2016; AWS, 2017).

            AWS uses Security Groups and Access Control Lists.  The SG in AWS is used to group a collection of access control rules with implicit denies.  The SG in AWS can be associated with one or more network interfaces of instances.  The SG acts as the firewall for the instances.  There is a default SG which gets applied automatically if no other security group is specified with the instantiated instance.  The default SG allow all outbound traffic and all inbound traffic only from other instances within the same VPC.  The default SG group cannot be deleted.  With the custom SG, no inbound traffic, but all outbound traffic allowed.  The user can add Access Control List (ACL) rules are associated with the SG governing the inbound traffic using AWS console (Armstrong, 2016; AWS, 2017).

            The VPC of AWS ha access to different regions and availability zone of shared computer dictating the data center which the instance and virtual machine will be deployed in.  The availability zone AZ is an isolated location residing in a region which is a geographic area isolated by design.  Thus,  AZ can be a subset of a region.  Organizations and users can place resources in different locations for redundancy for recovery consideration.  AWS supports the use of more than one AV when deploying production workloads on AWS.   Moreover, organizations and users can replicate the instances and data across regions (Armstrong, 2016; AWS, 2017).

            Elastic Load Balancing (ELB) feature is also offered by AWS, which can be configured within a VPC.   The ELB can be external or internal.  When the ELB is external, it allows the creation of the internet-facing entry point into the VP using an associated DNS entry and balances load among the instances in the VPC.  The SG is assigned to the ELB to control the access to ports which need to be used (Armstrong, 2016; AWS, 2017).   

1.2       OpenStack Networking

            OpenStack is deployed in a data center on multiple controllers.  These controllers contain all services of the OpenStack. These controllers can be installed on virtual machine, bare metal physical servers, or containers.  When these controllers get deployed in a production environment, they host all OpenStack services in a high availability and redundancy platform.   Different installers to install OpenStack are offered by different OpenStack vendors.  Examples of these installers include RedHat Director, Mirantis Fuel, HPs HPE installed, and Juju for Canonical. All these installers install controllers.  They are also used to scale out compute nodes on the OpenStack cloud (Armstrong, 2016; OpenStack, 2018b).

            With respect to the services of the OpenStack, there are eleven core services which are installed on the OpenStack controlled.  These core services include Keystone, Heat, Glance, Cinder, Nova, Horizon, Rabbitmq, Galera, Swift, Ironic and Neutron.  Figure 3 summarizes each core service of the OpenStack (OpenStack, 2018a).  The Neutron architecture is similar in constructs to AWS regarding Neutron Networking services (Armstrong, 2016; OpenStack, 2018b). 

Figure 3.  Summary of OpenStack Core Services (OpenStack, 2018a)

In OpenStack, a Project is referred to as a Tenant providing an isolated view of everything which a team has provisioned in the OpenStack cloud.  Using the Keystone Identity service, different users can be set up for a Project (Tenant).  These accounts can be integrated with LDAP such as Active Directory to support customizable permission model (Armstrong, 2016; OpenStack, 2018b).

The Neutron Service of OpenStack performs all networking related tasks and functions.  These functions and tasks include seven major steps. The first step includes the creation of instances or virtual machine mapped to networks.  The second step includes the assignment of IP addresses using the built-in DHCP service.  The third step includes the application of DNS entries to instances from named servers.  The fourth step includes the assignment of Private and Floating IP addressed.  The fifth step incluse the creation or the associatoiin of the network subnet, followed by creating the routers.  The last step is the application of the Security Groups (Armstrong, 2016; OpenStack, 2018b).

The compute nodes of the OpenStack are deployed using a Hypervisor which uses Open vSwitch.  Most vendor distributions of OpenStack provide KVM Hypervisor by default, which gets deployed and configured on each computes node by the OpenStack Installer.  The compute nodes in OpenStack are connected to the access layer of the STP 3-tier model. In modern networks, they are connected to the Leave switches, with VLANs connected to each computes node in the OpenStack cloud.  The networks of the Tenant are used to provide isolation among tenants and use VXLAN and GRE tunneling to connect the layer two network (Armstrong, 2016; OpenStack, 2018b).

The configuration and setup of simple networking using Neutron in a Project (Tenant) network requires two different networks; an internal network and an external network.  The internal network is used for traffic among instances in the Project, where the subnet name and range are specified in the Subnet.  The external network is used to make the internal network accessible from outside of the OpenStack.   A router is also used in OpenStack to route packets to the network, which will be associated with the networks.  The external network needs to be set as the router’s gateway.  The last step in the network configuration connects one router to an internal and external network.   Instances are provisioned in OpenStack onto the internal Private Network by selecting the Private Network NIC during the deployment of the instance.  OpenStack assigns pools of Public IPs known as Floating IP addresses from an external network for instances which need to be externally routable outside of the OpenStack (Armstrong, 2016; OpenStack, 2018b).

OpenStack uses SG like AWS to set up firewall rules between instances.  However, OpenStack, unlike AWS, supports both ingress and egress ACL rules, whereas AWS allows all outbound communications.  OpenStack can work with both ingress and egress rules.   SSH access must be configured as an ACL rule against the parent SG in OpenStack which is pushed down to Open vSwitch into kernel space on each Hypervisor.  When the internal and external networks are set up and configured for the Project (Tenant), instances are ready to be launched on the Private network.  Users can access the instance from Horizon dashboard (Armstrong, 2016; OpenStack, 2018b).

With respect to regions and availability zones in OpenStack, like AWS, OpenStack uses regions and AZ.  The compute nodes in OpenStack (Hypervisors) can be assigned to different AZ, which is a virtual separation of computing resources.  The AZ in OpenStack can be segmented into host aggregated. However, a compute node can be assigned to only one AZ in OpenStack, while it can be a part of multiple host aggregates in the same AZ (Armstrong, 2016; OpenStack, 2018b). 

OpenStack offers Load-Balancer-as-a-Service (LBaaS) which allows incoming requests to be distributed evenly among the designated instances using a Virtual IP (VIP).  Examples of the popular LBaaS plugins in OpenStack include Citrix NetScaler, F5, HaProxy, and Avi networks.  The underlying concept of LBaaS on OpenStack is to allow organizations and users to use LBaaS as a broker to the load balancing solutions, using APIs of the OpenStack or using the Horizon dashboard to configure the Load Balancer (Armstrong, 2016; OpenStack, 2018b). 

Phase 2:  AWS and OpenStack Setup and Configuration

            This project deployed OpenStack on AWS and limited to the configuration of the controller node.  In the same project, the OpenStack cloud is expanded to add a compute node.   The topology for this project is illustrated in Figure 4.   Port 9000 will be configured to be accessed from the browser on the client.  The Compute Node VM will be using a different IP address than that IP address for the OpenStack Node. A Private Network will be configured using the Vagrant software.   NAT interface will be configured and mapped to the Compute Node and the OpenStack Controller Node as illustrated in Figure 4.

Figure 4.  This Project’s Topology.

The Controller Node is configured to have one processor, 4 GB memory, and 5 G storage.  The Compute Node is configured to have one processor, 2 GB memory, and 10 GB storage.  The installation must be performed on a 64bit version of distribution on each node.  VirtualBox is used in this project.  The Vagrant software is also used in this project.  Another software called Sublime Text is installed to configure the Vagrant file and avoid any control characters at the end of each line which can cause problems.  The project is using the Pike release.

2.1 Amazon Machine Images (AMI) Elastic Cloud Compute (EC2) AMI Configuration

The project requires AWS account, to select the image which can be used for OpenStack Deployment.  Multi-Factor Authentication is implemented to access the account.  Amazon Machine Image (AMI) Elastic Compute Cloud (EC2) is selected from the pool of the AMIs for this project.  The Free Tier EC2 instance is configured with the default Security Group (SG) and Access Control List (ACL) rules as discussed earlier.   EC2 AMI is a template which contains the software configuration such as operating system, application server, and applications required to launch and instantiate the instance.  The EC2 AMI is configured to use the default VPC. 

2.2 OpenStack Controller Node Configuration

The Controller Node is configured first to use the IP Address identified in the topology.  This configuration is implemented using Vagrant software and Vagrant file. 

  • Connect to the controller using the Vagrant software.  To start the Controller from Vagrant, execute:
    • $vagrant up the controller. 
  • Verify the Controller is running successfully.
    • $vagrant status
  • Verify the NAT address using eth0. 
    • $ifconfig -a
  • Verify the Private IP Address using eth1.  The IP address shows the same IP address configured in the configuration file.

Access the Controller Node of the OpenStack from the Browser using the Port 9000. 

  • Verify the Hypervisors from Horizon interface. 

2.3 OpenStack Compute Node Configuration

The OpenStack Cloud is expanded by adding a Compute Node.  The configuration of the compute node is performed using the Vagrant file.

  • Connect to the computer using Vagrant command.  The Compute Node is using node1 as the hostname.  To start the Compute Node from Vagrant, execute the following command:
    • $vagrant up node1. 
  • Verify the Compute Node is running successfully.
    • $vagrant status
  • Access node1 using SSH.
  • Check OpenStack Services:
    • $sudo systemctl list-units devstack@*
  • Verify the NAT address using eth0. 
    • $ifconfig -a
  • Verify the Private IP Address using eth1.  The IP address shows the same IP address configured in the configuration file.

Access the Controller Node of the OpenStack from the Browser using the Port 9000.  Verify the Hypervisors from Horizon interface. 

Phase 3:  Issues Deploying OpenStack on AWS

There are some issues encountered during the deployment of OpenStack on AWS.   The issue which impacted EC2 AMI involved the MAC address which must be registered in the AWS network environment.  Moreover, the MAC address and the IP address must be mapped together because the packets will not be allowed to flow if the MAC address and the IP address are different.  

3.1 Neutron Networking

                During the configuration of the OpenStack Neutron Networking, a virtual bridge for the Provider Network is configured where all VMs traffic will reach the Internet through the external bridge which is followed by the actual physical NIC of eth1.  Thus, NIC with a special type of configuration will be configured as the external interface as shown in the topology for this project (Figure 4).   

 3.2 Disable Floating IP

            The floating IP must be disabled because it will send the packet through the router’s gateway with the IP address as a floating IP address, which will result in dropping the packets once they reach AWS because they will reach the switch with no registered IP and MAC address.  In this project, the NAT is configured to access the public address externally as shown in the topology in Figure 4.

Conclusion

The purpose of this project was to articulate all the steps for the installation and configuration of OpenStack and Amazon Web Services.  The project began with an overview of OpenStack.  It is divided into three main phases.  The first Phase discussed and analyzed the differences between the Networking techniques in AWS and OpenStack.  Phase 2 discussed the required configurations to deploy the OpenStack Controller.  Phase 2 also discussed and analyzed the expansion of OpenStack to include additional node as the Compute node.   Phase 3 discussed the issues encountered during the installation and configuration of OpenStack and AWS services. A virtual bridge for the provider network was configured where all VMs traffic reaches the Internet through the external bridge.   The floating IP also must be disallowed to avoid dropping the packet when they reach AWS.  In this project, OpenStack using the Controller Node and an additional Compute Node was deployed and accessed successfully using Horizon dashboard.  Elastic Cloud Compute (EC2) was also installed and configured successfully using the default VPC, the default Security Group, and Access Control List.  

References

Armstrong, S. (2016). DevOps for Networking: Packt Publishing Ltd.

AWS. (2017). Virtual Private Cloud:  User Guide. Retrieved from: http://docs.aws.amazon.com/AmazonVPC/latest/UserGuide/vpc-ug.pdf.

OpenStack. (2018a). Introduction to OpenStack. Retrieved from https://docs.openstack.org/security-guide/introduction/introduction-to-openstack.html.

OpenStack. (2018b). OpenStack Overview. Retrieved from https://docs.openstack.org/install-guide/overview.html.

Current State of Data Storage for Big Data

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze the current state of data storage for big data. The discussion also discusses and analyzes the impact of the Big Data storage on the organizational process.

Big Data and Big Data Analytics Brief Overview

The term Big Data refers to the explosive growth in the volume of the data which are difficult to store, process and analyze.  The volume of the Big Data is only one feature.  However, the major 3Vs characterized Big Data with volume, variety, and velocity.  The variety of the data is reflected by the different types of data collected from sensors, smartphones, or social networks.  Thus, the data collected forms additional types of data such as unstructured, and semi-structured besides the structured type.  The velocity characteristic of the Big Data reflects the speed of the data transfer, where the content of the data is continuously changing.  These three major features characterize the Big Data nature. Big Data is classified even further to include Data Sources, Content Format, Data Stores, Data Staging, and Data Processing.  Figure 1 summarizes the Big Data Classifications, adapted from (Hashem et al., 2015).

Figure 1.  Big Data Classification.  Adapted from (Hashem et al., 2015).

Big Data without Analytics has no value.  Big Data Analytics (BDA) is the process of examining large datasets containing a variety of data types such as unstructured, semi-structured and structured. The purpose of the BDA is to uncover hidden patterns, market trends, unknown correlations, customer preferences and other useful business information that can help the organization (Arora & Bahuguna, 2016).  BDA has been used in various industries such as healthcare.   

Big Data Storage

The explosive growth of the data has challenged the capabilities of the existing storage technologies to store and manage data.  Organizations have been utilizing the traditional storage techniques to store data through the structured relational database.  However, the Big Data and BDA require distributed storage technology based on the Cloud Computing instead of the local storage attached to a computer or electronic device. Cloud Computing technologies provide a powerful framework which performs complex large-scale computing tasks and span a range of IT functions from storage and computation to database and application services. Organizations and users adopt the Cloud Computing technologies because of the need and requirements to store, process and analyze a large amount of data (Hashem et al., 2015).

Various storage technologies have been emerged to meet the requirements when dealing with large volume of data. These storage technologies include Direct Attached Storage (DAS), Network Attached Storage (NAS), and Storage Area Network (SAN).  When using DAS, various hard disk drives are directly connected to the servers. Each hard disk drive receives a certain amount of I/O resource managed by the application. The DAS technology is a good fit for servers that are interconnected on a small scale.  The NAS technology provides a storage device which supports a network through a switch or hub via TCP/IP protocols.  When using NAS, data is transferred as files.  The I/O in the NAS technology is less burden than in DAS because the NAS server can indirectly access a storage device through the networks.  NAS technology can orient the networks such as scalable and bandwidth-intensive networks including the high-speed networks of optical-fiber connections.   The SAN system of data storage is independent with respect to storage on the local area network.  Data management and sharing are maximized by using the multipath data switching which is conducted among internal nodes.  The organization data storage system of DAS, NAS, SAN can be divided into three categories:  disc array, connection and network sub-systems, and storage management software. The disk array provides the storage system.  The connect and network sub-systems provides connection to one or more disc arrays and servers.  The storage management software monitors the data sharing, storage management and disaster recovery tasks for multiple servers (Hashem et al., 2015).

When dealing with Big Data and BDA, the storage system is not physically separated from the processing system.  There are various storage types such as hard drives, solid-state memory, object storage, optical storage and cloud storage. Each type has advantages as well as limitations.  Thus, organizations must examine the goal and the objectives of the data storage first prior selecting any of these storage media.  Table 1 shows a comparison of storage media, adapted from (Hashem et al., 2015).

Table 1.  Comparison of Storage Media.  Adapted from (Hashem et al., 2015).

The Hadoop Distributed File System (HDFS) is a primary component in Hadoop technology, which is emerged to deal with Big Data and BDA. The other major component of Hadoop technology is MapReduce.  The Hadoop framework is described to be the de facto standard for Big Data storage and processing (Jinquan, Jie, Shengsheng, Yan, & Yuanhao, 2012).  The HDFS is a distributed file system which is designed to run on top of the local file systems of the cluster nodes. It stores extremely large files for streaming purpose.  HDFS is highly faulted tolerant and can scale up from a single server to thousands of nodes, where each offers local computation and storage. 

The Cloud Computing technology can meet the requirement of the Big Data and BDA offering effective framework and platform for computational purpose as well as for storage purpose.  Thus, organizations which tend to take advantage of Big Data and BDA utilize the Cloud Computing technology.  However, the use of the Cloud Computing does not come without a price.  Security and privacy have been major concerns to Cloud Computing users and organizations.  Although Cloud Computing offers several benefits to organizations from scalability, fault tolerance, to data storage, yet, it is curbed by the security and privacy.  Organizations must take the appropriate security measures for data in storage, transit, and processing, such as SSL, Encryption, Access Control, Multi-Factor Authentication and so forth.

In summary, Big Data comes with Big Storage requirement.  Organizations have been facing various challenges when dealing with Big Data, such as data storage and data processing.  Data storage issue is partially solved by Cloud Computing technology.  However, until the security and privacy issues are resolved in the Cloud Computing platform, organizations must apply robust security measures to mitigate and alleviate the security risks.

References

Arora, M., & Bahuguna, H. (2016). Big Data Security–The Big Challenge.

Hashem, I. A. T., Yaqoob, I., Anuar, N. B., Mokhtar, S., Gani, A., & Khan, S. U. (2015). The rise of “big data” on cloud computing: Review and open research issues. Information Systems, 47, 98-115.

Jinquan, D., Jie, H., Shengsheng, H., Yan, L., & Yuanhao, S. (2012). The Hadoop Stack: New Paradigm for Big Data Storage and Processing. Intel Technology Journal, 16(4), 92-110.

Building Blocks of a System for Healthcare Big Data Analytics

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to create the building blocks of a system for healthcare Big Data Analytics and compare the building block design to a DNA networked cluster currently used by an organization in the current market.

The discussion begins with the Cloud Computing Building Blocks, followed by Big Data Analytics Building Blocks, and DNA Sequencing. The discussion also addresses the building blocks for the health analytics and the building blocks for DNA Sequencing System, and the comparison between both systems.

Cloud Computing Building Blocks

The Cloud Computing model contains two elements: the front end and the back end.  Both elements are connected to the network. The user interacts with the system using the front end, while the cloud itself is the back end. The front end is the client which the user uses to access the cloud through a device such as a smartphone, tablet, and laptops.  The backend represented by the Cloud provides applications, computers, servers and data storage which creates the services (IBM, 2012).   

As indicated in (Macias & Thomas, 2011), three building blocks are required to enable Cloud Computing. The first block is the “Infrastructure,” where the organization can optimize data center consolidation, enhance network performance, connect anyone, anywhere seamlessly, and implement pre-configured solutions.  The second block is the “Applications,” where the organization can identify applications for rapid deployment, and utilize automation and orchestration features.  The third block is the “Services,” where the organization can determine the right implementation model, and create a phased cloud migration plan.

In (Mousannif, Khalil, & Kotsis, 2013-14), the building blocks for the Cloud Computing involve the physical layer, the virtualization layer, and the service layer.  Virtualization is a basic building block in Cloud Computing.  Virtualization is the technology which hides the physical characteristics of the computing platform from the front end users.  Virtualization provides an abstract and emulated computing platform.  The clusters and grids are features and characteristics in Cloud Computing for high-performance computing applications such as simulations. Other building blocks of the Cloud Computing include Service-Oriented Architectures (SOA) and Web Services (Mousannif et al., 2013-14). 

Big Data Building Block

As indicated in (Verhaeghe, n.d.), there are four major building blocks for Big Data Analytics.  The first building block is Big Data Management to enable organization capture, store and protect the data. The second building block for the Big Data is the Big Data Analytics to extract value from the data.  Big Data Integration is the third building block to ensure the application of governance over the data.  The last building block in Big Data is the Big Data Applications for the organization to apply the first three building blocks using the Big Data technologies.

DNA Sequencing

DNA stands for Deoxyribonucleic Acid which represents the smallest building block of life (Matthews, 2016).  As indicated in (Salzberg, 1999), advances in biotechnology have produced enormous volumes of DNA-related information.  However, the rate of data generation is outpacing the ability of the scientists to analyze the data.  DNA Sequencing is a technique used to determine the order of the four chemical building blocks, called “bases,” which make up the DNA molecule (genome.gov, 2015).  The sequence provides the kind of genetic information which is carried in a particular DNA segment.  DNA sequencing can provide valuable information about the role of inheritance in susceptibility to disease and response to the influence of environment.  Moreover, DNA sequencing provides rapid and cost-effective diagnosis and treatments.  Markov chains and hidden Markov models are probabilistic techniques which can be used to analyze the result of the DNA sequencing (Han, Pei, & Kamber, 2011).  Example of the DNA Sequencing application is discussed and analyzed in (Leung et al., 2011), where the researchers employed Data Mining on DNA Sequences biological data sets for Hepatitis B Virus. 

DNA Sequencing was performed on non-networked computers, using a limited subset of data due to the limited computer processing speed (Matthews, 2016).  However, DNA Sequencing has been experiencing various advanced technologies and techniques.  Predictive Analytic is an example of these techniques which are applied to DNA Sequencing resulting Predictive Genomics.  Cloud Computing plays a significant role in the success of the Predictive Genomics for two major reasons.  The first reason is the volume of the genomic data, while the second reason is the low cost (Matthews, 2016).  Cloud Computing is becoming a valuable tool for various domains including the DNA Sequencing.   As cited in (Blaisdell, 2017), the study of the Transparency Market Research showed that the healthcare Cloud Computing market is going to evolve further, reaching up to $6.8 Billion by 2018. 

Building Block for Healthcare System

Healthcare data requires protection due to the security and privacy concerns.  Thus, Private Cloud will be used in this use case.  To build a Private Cloud, the virtualization layer, the physical layer, and the service layer are required.  The virtualization layer consists a hypervisor to allow multiple operating systems to share a single hardware system.  The hypervisor is a program which controls the host processors and resources by allocating the resources to each operating system.  Two types of hypervisors: native and also called bare-metal or type 1 and hosted also called type 2.  Type 1 runs directly on the physical hardware while Type 2 runs on a host operating system which runs on the physical hardware.  Examples of the native hypervisor include VMware’s ESXi, Microsoft’s Hyper-V. Example of the hosted hypervisor includes Oracle VirtualBox and VMware’s Workstation.  The physical layer can consist of two computer pools one for PC and the other for the server (Mousannif et al., 2013-14).   

In (Archenaa & Anita, 2015), the researchers illustrated the secure Healthcare Analytic System.  The Electronic health record is a heterogeneous dataset which is given as input to HDFS through Flume and Sqoop. The analysis of the data is performed using MapReduce and Hive by implementing Machine Learning algorithm to analyze the similar pattern of data, and to predict the risk for patient health condition at an early stage.  HBase database is used for storing the multi-structured data. STORM is used to perform live streaming and any emergency conditions such as patient temperature rate falling beyond the expected level. Lambda function is also used in this healthcare system.  The final component of a building block in Healthcare system involves the reports generated by the top layer tools such as “Hunk.”  Figure 1 illustrates the Healthcare System, adapted from

Figure 1.  Healthcare Analytics System. Adapted from (Archenaa & Anita, 2015)

Building Block for DNA and Next Generation Sequencing System

Besides the DNA Sequencing, there is a next-generation sequencing (NGS) which is increasing exponentially since 2007 (Bhuvaneshwar et al., 2015).  In (Bhuvaneshwar et al., 2015), the Globus Genomic System is proposed as an enhanced Galaxy workflow system made available as a service offering users the capability to process and transfer data easily, reliably and quickly.  This system addresses the end-to-end NGS analysis requirements and is implemented using Amazon Cloud Computing Infrastructure.  Figure 2 illustrates the framework for the Globus Genomic System taking into account the security measures for protecting the data.  Examples of healthcare organizations which are using Genomic Sequencing include Kaiser Permanente in Northern California, and Geisinger Health System in Pennsylvania (Khoury & Feero, 2017).  

Figure 2. Globus Genomics System for Next Generation Sequencing (NGS). Adapted from (Bhuvaneshwar et al., 2015).

In summary, Cloud Computing has reshaped the healthcare industry in many aspects.  Healthcare Cloud Computing and Analytics provide many benefits from the easy access to the electronic patient records to DNA Sequencing and NGS.  The building blocks of the Cloud Computing must be implemented with care for security and privacy consideration to protect the patients’ data from unauthorized users.  The building blocks for Healthcare Analytics system involves advanced technologies such as Hadoop, MapReduce, STORM, Flume as illustrated in Figure 1.  The building blocks for DNA Sequencing and NGS System involves Dynamic Worker Pool, HTCondor, Shared File System, Elastic Provisioner, Globus Transfer and Nexus, and Galaxy as illustrated in Figure 2.  Each system has the required building blocks to perform the analytics tasks.  

References

Archenaa, J., & Anita, E. M. (2015). A survey of big data analytics in healthcare and government. Procedia Computer Science, 50, 408-413.

Bhuvaneshwar, K., Sulakhe, D., Gauba, R., Rodriguez, A., Madduri, R., Dave, U., . . . Madhavan, S. (2015). A case study for cloud-based high throughput analysis of NGS data using the globus genomics system. Computational and structural biotechnology journal, 13, 64-74.

Blaisdell, R. (2017). DNA Sequencing in the Cloud. Retrieved from https://rickscloud.com/dna-sequencing-in-the-cloud/.

genome.gov. (2015). DNA Sequencing. Retrieved from https://www.genome.gov/10001177/dna-sequencing-fact-sheet/.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques: Elsevier.

IBM. (2012). Cloud computing fundamentals: A different way to deliver computer resources. Retrieved from https://www.ibm.com/developerworks/cloud/library/cl-cloudintro/cl-cloudintro-pdf.pdf.

Khoury, M. J., & Feero, G. (2017). Genome Sequencing for Healthy Individuals? Think Big and Act Small! Retrieved from https://blogs.cdc.gov/genomics/2017/05/17/genome-sequencing-2/.

Leung, K., Lee, K., Wang, J., Ng, E. Y., Chan, H. L., Tsui, S. K., . . . Sung, J. J. (2011). Data mining on dna sequences of hepatitis b virus. IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB), 8(2), 428-440.

Macias, F., & Thomas, G. (2011). Three Building Blocks to Enable the Cloud. Retrieved from https://www.cisco.com/c/dam/en_us/solutions/industries/docs/gov/white_paper_c11-675835.pdf.

Matthews, K. (2016). DNA Sequencing. Retrieved from https://cloudtweaks.com/2016/11/cloud-dna-sequencing/.

Mousannif, H., Khalil, I., & Kotsis, G. (2013-14). Collaborative learning in the clouds. Information Systems Frontiers, 15(2), 159-165. doi:10.1007/s10796-012-9364-y

Salzberg, S. L. (1999). Gene discovery in DNA sequences. IEEE Intelligent Systems and their Applications, 14(6), 44-48.

Verhaeghe, X. (n.d.). The Building Blocks of a Big Data Strategy. Retrieved from https://www.oracle.com/uk/big-data/features/bigdata-strategy/index.html.

Security Measures for Virtual and Cloud Environment

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze security measures for virtual and cloud environments. It also discusses and analyzes the current security models and the possibility for additional enhancements to increase the protection for these virtual and cloud environments. 

Virtualization

Virtualization is a core technology in Cloud Computing technology.  The purpose of Virtualization in Cloud Computing is to virtualize the resources to Cloud Computing Service Models such as Software-as-a-Service (SaaS), Infrastructure-as-a-Service (IaaS), and Platform-as-a-Service (PaaS) (Gupta, Srivastava, & Chauhan, 2016).   Virtualization allows creating many instances of Virtual Machines (VMs) in a single physical operating system.  The utilization of these VMS provides flexibility, agility, and scalability to the Cloud Computing resources.  The VM is provided to the client to access resources at a remote location using the virtualization computing technique.  Key features of Virtualization include the resource utilization using isolation among hardware, operating systems, and software.  Another key feature of Virtualization is the multi-tenancy for simultaneous access of the VMs residing in a single physical machine. After the VM is created, it can be copied and migrated.  These features of the Virtualization are double-edged as they provide flexibility, scalability, and agility, while they cause security challenges and concerns.  The security concerns are one of the biggest obstacles to the widespread adoption of the Cloud Computing (Ali, Khan, & Vasilakos, 2015). 

The hardware Virtualization using the physical machine is implemented using hypervisor.  The hypervisor has two types:  Type 1 and Type 2. Type 1 of the hypervisor is called “Bare Metal Hypervisor” as illustrated in Figure 1.  Type 2 of the hypervisor is called “Hosted Hypervisor” as illustrated in Figure 2.   The “Bare Metal Hypervisor” provides a layer between the physical system and the VMs, while the “Hosted Hypervisor” is deployed on the Operating System.

Figure 1.  Hypervisor Type 1: Bare Metal Hypervisor. Adapted from (Gupta et al., 2016).

Figure 2: Hypervisor Type 2: Hosted Hypervisor. Adapted from (Gupta et al., 2016).

Virtualization has many security flaws to intruders.  The traditional security measures that control physical systems are found inadequate or ineffective when dealing with the virtualized data center, hybrid and private Cloud environment (Gupta et al., 2016).  Moreover, the default configuration of the hypervisor does not always include security measures that can protect the virtual and cloud environment.

One of the roles of the hypervisor is to control the management between the VMs and the physical resources.  In Type 1 Hypervisor “Bare Metal Hypervisor,” the single point of failure increases the security breaches for the whole virtualized physical environment on the physical system.  In Type 2 Hypervisor “Hosted Hypervisor,” the configuration exposes more threats than the “Bare Metal Hypervisor.”  The VMs, which are hosted in the physical system, communicate with each other which can cause the loopholes to the intruders. 

Virtualization is exposed to various types of threats and vulnerabilities.  These vulnerabilities in Virtualization Security include VM Escape, VM Hoping, VM Theft, VM Sprawl, Insecure VM Migration, Sniffing and Spoofing.  Figure 3 illustrates the vulnerabilities of the Virtualization. 

Figure 3.  Vulnerabilities of Virtualization. Adapted from (Gupta et al., 2016).

As indicated in (Gupta et al., 2016), Hypervisor should be inbuilt with the firewall security and disable access console (USB, NIC) to prevent unauthorized access.   The access to the Role Based Access Control (RBAC) is effective to control Hyper jacking of VMs.  The role and responsibilities should be defined to the users of the VMs to check the access authorization. 

Security Principles, Security Mode. Security Models and Security Implementation

As indicated in (Abernathy & McMillan, 2016), the primary goal of all security measures is to provide protection and ensure that the measure is successful.  Three major principles of security include confidentiality, integrity, and availability (CIA).  These Security Principles are known as CIA triad.  The confidentiality is provided if the data cannot be read either through access control and encryption for data as it exists on the hard drive or through encryption as the data is in transit.   Confidentiality is the opposite of “disclosure” (Abernathy & McMillan, 2016).  The Integrity is provided if the data is not changed in any way by unauthorized users.  The integrity principle is provided through the hashing algorithm or a checksum.  The availability principles provide the time the resources or data is available. The availability is measured as a percentage of “up” time with 99.9% of uptime representing more availability than 99% uptime.   The availability principle ensures the availability and access of the data whenever it is needed.  The availability principle is described as a prime goal of security.  Most of the attacks result in a violation of one of these security principles of confidentiality, integrity, or availability.  Thus, the defense-in-depth technique is recommended as an additional layer of security.  For instance, even if the firewall is configured for protection, access control list should still be applied to resources to help prevent access to sensitive data in case the firewall gets breached.  Thus, the defense-in-depth technique is highly recommended.

Security has four major Security Modes which are typically used by the Mandatory Access Control (MAC).  These four security modes include Dedicated Security Mode, System High-Security Mode, Compartmented Security Mode, and Multi-Level Security Mode.  The MAC operates in different security modes at different times based on variables such as sensitivity of data, the clearance level of the user, and the actions users are authorized to take.  In all the four security modes, a non-disclosure agreement (NDA) must be signed, and the access to certain information is based on each mode.

Security Models provide a mapping technique for the security policymakers to the rules which a computer system must follow.  Various types of the Security Models provide various approaches to implement such a mapping technique (Abernathy & McMillan, 2016). 

  • State Machine Model,
  • Multi-Level Lattice Models, 
  • Matrix-Based Models,
  • Non-Interface Models, and
  • Information Flow Models.

Moreover, there are formal Security Models which are incorporating security concepts and principles to guide the security design of systems. These formal Security Models include the following seven Models (Abernathy & McMillan, 2016).  The detail for each model is beyond the scope of this discussion.

  • Bell-LaPadula Model.
  • Biba Model.
  • Clark-Wilson Integrity Model.
  • Lipner Model.
  • Brewer-Nash Model.
  • Graham-Denning Model.
  • Harrison-Ruzzo-Ullman Model.

With respect to the Security Implementation, there are standards which must be followed when implementing security measures for protection.  These standards include ISO/IEC27001 and 27002 and PCI-DSS.   The ISO/IEC27001 is the most popular standards, which is used by the organization to obtain certification for information security.  These standard guides ensure that the information security management system (ISMS) of the organization is properly built, administered, maintained and progressed.  The ISO/IEC 27002 standard provides a code of practice for information security management. This standard includes security measures such as access control, cryptography, compliance.  The PCI-DSS v3.1 is specific for payment card industry. 

Security Models in Cloud Computing

As Service Model is one of the main models in Cloud Computing.  These services are offered through a Service Provider known as a Cloud Service Provider to the cloud users.  Security and privacy are the main challenges and concern when using Cloud Computing environment.  Although there is a demand to leverage the resources of the Cloud Computing to provide services to clients, there is also need and the requirement for the Cloud servers and resources not to learn any sensitive information about the data being managed, stored, or queried (Chaturvedi & Zarger, 2015).   Effort should be exerted to improve the control of users to their data in the public environment.  Cloud Computing Security Models include Multi-Tenancy Model, Cloud Cube Security Model, the Mapping Model of Cloud, Security and Compliance, and the Cloud Risk Accumulation Model of CSA (Chaturvedi & Zarger, 2015).

The Multi-Tenancy Model is described to be the major functional characteristic of Cloud Computing allowing multiple applications to provide cloud services to the clients.  The user’s tenants are separated by virtual partitions, and each partition holds clients tenant’s data, customized settings and configuration settings.  Virtualization in a physical machine allows users to share computing resources such as memory, processor I/O and storage to different users’ applications and amends the utilization of Cloud resources.  SaaS is a good example of Multi-Tenant Model which provides scalability to serve a large number of clients based on Web service.  This model of Multi-Tenancy is described by the security experts to be vulnerable and expose confidentiality which is regarded to be one of the Security Principles to risk between the tenants.  Side channel attack is a significant risk in the Multi-Tenancy Model.  This kind of attack is based on information obtained from bandwidth monitoring.   Another risk of the Multi-Tenancy Model is the assignment of resources to the clients with unknown identity and intentions.  Another security risk associated with Multi-Tenancy involves data storage of multiple tenants in the same database tablespaces or backup tapes. 

The Cloud Cube Security Model is characterized by four main elements; Internal/External, Proprietary/Open, Parameterized/De-parameterized, and Insourced/Outsourced.  The Mapping Model of Cloud, Security, and Compliance Model is another Model to provide a better method to analyze the gaps between cloud architecture and compliance framework and the corresponding security control strategies provided by the Cloud Service Provider, or third parties.  The Cloud Risk Accumulation Model of CSA is the last Security Models of Cloud Computing.  The three Cloud Models of IaaS, PaaS, and SaaS have various security requirements due to the layer dependencies.

Security Implementation: Virtual Private Cloud (VPC)

The VPC Deployment Model is a model that provides more security than the Public Deployment Model.  In this Model, the user can apply Access Control at the instance level as well as at the network level.  Policies are configured and assigned to groups based on the access role.   The VPC as a Deployment Model of the Cloud Computing did solve problems such as the loss of authentication, loss of confidentiality, loss of availability, loss, and corruption of data (Abdul, Jena, Prasad, & Balraju, 2014).  The VPC is logically isolated from other virtual networks in the cloud.  As indicated in (Abdul et al., 2014), VPC is regarded as the most prominent approach to Trusted Computing technology.  However, organizations must implement the security measures based on the requirements of the business.  For instance, organizations and users have control to select the IP address range, create a subnet, route tables, network gateway and security as illustrated in Figure 4.

Figure 4.  Virtual Private Cloud Security Implementation.

In summary, security measures must be implemented to protect the cloud environment.  Virtualization imposes threats to the Cloud environment.  The hypervisor is a major component of Virtualization.  It is recommended that the Hypervisor should be inbuilt with the firewall security and disable access console (USB, NIC) to prevent unauthorized access.   The access to the Role Based Access Control (RBAC) should be effective to control Hyper jacking of VMs.  The role and responsibilities should be defined to the users of the VMs to check the access authorization.  Virtual Private Cloud as a trusted deployment model of the Cloud Computing provides a more secure cloud environment than the Public Cloud. The Security Implementation must follow certain standards.  The organization must comply with these standards to protect organizations and users.

References

Abdul, A. M., Jena, S., Prasad, S. D., & Balraju, M. (2014). Trusted Environment In Virtual Cloud. International Journal of Advanced Research in Computer Science, 5(4).

Abernathy, R., & McMillan, T. (2016). CISSP Cert Guide: Pearson IT Certification.

Ali, M., Khan, S. U., & Vasilakos, A. V. (2015). Security in cloud computing: Opportunities and challenges. Information Sciences, 305, 357-383. doi:10.1016/j.ins.2015.01.025

Chaturvedi, D. A., & Zarger, S. A. (2015). A review of security models in cloud computing and an Innovative approach. International Journal of Computer Trends and Technology (IJCTT), 30(2), 87-92.

Gupta, M., Srivastava, D. K., & Chauhan, D. S. (2016). Security Challenges of Virtualization in Cloud Computing. Paper presented at the Proceedings of the Second International Conference on Information and Communication Technology for Competitive Strategies, Udaipur, India.

Cloud Computing Security Issues

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze two security issues associated with the Cloud Computing system.  The analysis includes the causes for these two security issues and the solutions.  The discussion begins with an overview of the Security Issues when dealing with Cloud Computing.

Security Issues Associated with Cloud Computing

Cloud Computing and Big Data are the current buzz words in IT industry. Cloud Computing does not only solve the challenges of Big Data, but also offers benefits for businesses, organizations, and individuals such as:

  • Cost saving,
  • Access data from anywhere anytime,
  • Pay per use like any utility,
  • Data Storage,
  • Data Processing,
  • Elasticity,
  • Energy Efficiency,
  • Enhanced Productivity, and more (Botta, de Donato, Persico, & Pescapé, 2016; Carutasu, Botezatu, Botezatu, & Pirnau, 2016; El-Gazzar, 2014).

Despite the tremendous benefits of the Cloud Computing, the emerging technology of the Cloud Computing is confronted with many challenges. The top challenge is the Security, which is expressed by executives as number one concern for adopting Cloud Computing  (Avram, 2014; Awadhi, Salah, & Martin, 2013; Chaturvedi & Zarger, 2015; Hashizume, Rosado, Fernández-medina, & Fernandez, 2013; Pearson, 2013).

The security issues in Cloud Computing environment are distinguished from the security issues of the traditional distributed systems (Sakr & Gaber, 2014).   Various research studies, in an attempt, to justify this security challenge in the Cloud Computing environment, provide various reasons such as the underlying technologies of Cloud Computing have security issues, such as virtualization, and SOA (Service Oriented Architecture) (Inukollu, Arsi, & Ravuri, 2014).  Thus, the security issues that are associated with these technologies come along with the Cloud Computing (Inukollu et al., 2014).  The Cloud Computing Service Model of PaaS (Platform as a Service) is a good example because it is based on SOA (Service-Oriented Architecture) Model.  Thus, the Cloud Computing Service Model PaaS inherits all of the security issues that are associated with SOA technology (Almorsy, Grundy, & Müller, 2016).   In (Sakr & Gaber, 2014), factors such as multi-tenancy, trust asymmetry, global reach and insider threats contribute to the security issues associated with the Cloud Computing environment.  In (Tripathi & Mishra, 2011), eleven security issues and threats associated with the Cloud environment are identified (1) VM-Level attacks, (2) Abuse and Nefarious Use of Cloud Computing,  (3) Loss of Governances, (4) Lock-IN, (5) Insecure Interfaces and APIs, (6) Isolation Failure, (7) Data Loss or Leakage, (8) Account or Service Hijacking, (9) Management Interface Compromise, (10) Compliance Risks, and (11) Malicious Insiders. In the more recent report of (CSA, 2016), twelve critical issues to the Cloud security are identified and ranked in the order of severity.  Data Breaches is ranked at the top and regarded as the most severe security issue of the Cloud Computing environment.  The Weak Identity, Credential, and Access Management is the second severe security issue.  The Insecure APIs, System and Application Vulnerabilities, and Account Hijacking are the next ranked security issues.  Table 1 lists the twelve security issues associated with the Cloud Computing as reported by (CSA, 2016).

Table 1.  Top Twelve Security Issues of Cloud Computing in Order of Severity. Adapted from (CSA, 2016).

The discussion and analysis are limited to the top two security issues, which are the Data Breaches, and the Weak Identity, Credential and Access Management. The discussion and analysis cover the causes and the proposed solutions.

Data Breaches

The data breach occurs when the sensitive and confidential information or any private data not intended for the public is released, viewed, stolen or used by unauthorized users (CSA, 2016).   The data breach issue is not unique to the Cloud Computing environment (CSA, 2016).  However, it is consistently ranking as the top issue and concern for the Cloud users.  The Cloud environment is subject to the same threats as the traditional corporate network and new attack techniques due to the shared resources.  The sensitivity degree of the data determines the extent of the damage.  The impact of the Data Breach on users and organization is devastated.  For instance, in a single incident of a data breach in the USA, 40 million credit card numbers and about 70 million addresses, phone numbers and other private and personal information details were compromised (Soomro, Shah, & Ahmed, 2016).  The firm spent $61 million in less than one year of the breach for the damages and the recovery, besides the cash loss, the profit which dropped by 46% in one quarter of the year (Soomro et al., 2016).  The “BitDefender,” the anti-virus firm, and the British telecom provider “TalkTalk” are other good examples of the Data Breaches.  The private information such as username and passwords of the customers of “BitDefender” was stolen in mid-2015, and the hacker demanded a ransom of $15,000 (CSA, 2016; Fox-Brewster, 2015).  Multiple security incidents in 2014 and 2015 were reported by “TalkTalk” resulting in the theft of four million users’ private information (CSA, 2016; Gibbs, 2015). 

The organization is obliged to exercise certain security standards of care to ensure that sensitive information is not released to unauthorized users.  The Cloud providers have certain responsibilities in certain aspects of the Cloud Computing, and they usually provide the security measures for these aspects. However, the Cloud users also have certain aspects when using the Cloud Computing, and they are responsible for these aspects to protect their data in the Cloud.   The multi-factor authentication and encryptions are the two techniques that are proposed to secure the Cloud environment.  

Insufficient Identity, Credential, and Access Management

Data Breaches and the malicious attacks happen due to various reasons.  The lack of scalable Identity Access Management Systems can cause Data Breach (CSA, 2016).  The failure to use Multi-Factor Authentication, weak password use and a lack of ongoing automated rotation of cryptographic keys, passwords, and certificates can cause Data Breach (CSA, 2016).  Malicious attackers, who can masquerade as legitimate users or developers, can modify and delete data, issue control, and management functions, and snoop on data in transit or release malicious software which appears to originate from a legitimate source.   The insufficient identity, credential or key management can allow these malicious attackers or non-authorized users to access private and sensitive data and cause catastrophic damage to the users and the organizations as well.   The GitHub attack and the Dell root certificate are good examples of this security issues.  The GitHub is a good example of this security issue as the attackers scrape GitHub for Cloud service credentials, hijacked account to mine virtual currency (Sandvik 2014). Dell is another example which releases a fix for root certificate failure because all dell systems used the same secret key and the certificate which enables creating a certificate for any domain, which is trusted by Dell (Schwartz, 2015). 

The security issues require Cloud Computing systems to be protected so that unauthorized users should not have access to the private and sensitive information. Various solutions are proposed to solve this security issue of insufficient identity and access management.  A security framework in a distributed system to consider public key cryptography, software agents and XML binding technologies was proposed as indicated in (Prakash & Darbari).  The credential and cryptographic keys should not be embedded in source code or distributed in public repositories such as GitHub.  The keys should be properly secured using well-secured public key infrastructure (PKI) to ensure key-management (CSA, 2016).  The Identity Management Systems (IMS) should scale to handle the lifecycle management for millions of users and cloud service providers (CSP).  The IMS should support immediate de-provisioning of access to resources when events such as job termination or role change.   The Multi-Factor Authentication System (MAS) such as a smart card, phone authentication, should be required for user and operator of the Cloud service (CSA, 2016).  

References

Almorsy, M., Grundy, J., & Müller, I. (2016). An analysis of the cloud computing security problem. arXiv preprint arXiv:1609.01107.

Avram, M. G. (2014). Advantages and Challenges of Adopting Cloud Computing from an Enterprise Perspective. Procedia Technology, 12, 529-534. doi:10.1016/j.protcy.2013.12.525

Awadhi, E. A., Salah, K., & Martin, T. (2013, 17-20 Nov. 2013). Assessing the Security of the Cloud Environment. Paper presented at the GCC Conference and Exhibition (GCC), 2013 7th IEEE.

Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.

Carutasu, G., Botezatu, M., Botezatu, C., & Pirnau, M. (2016). Cloud Computing and Windows Azure.

Chaturvedi, D. A., & Zarger, S. A. (2015). A review of security models in cloud computing and an Innovative approach. International Journal of Computer Trends and Technology (IJCTT), 30(2), 87-92.

CSA. (2016). The Treacherous 12: Cloud Computing Top Threats in 2016. Cloud Security Alliance

Top Threats Working Group.

El-Gazzar, R. F. (2014). A literature review on cloud computing adoption issues in enterprises. Paper presented at the International Working Conference on Transfer and Diffusion of IT.

Fox-Brewster, T. (2015). Anti-Virus Firm BitDefender Admits Breach, Hacker Claims Stolen Passwords Are Unencrypted. Retrieved from https://www.forbes.com/sites/thomasbrewster/2015/07/31/bitdefender-hacked/#5a5f5b125ab2.

Gibbs, S. (2015). TalkTalk criticised for poor security and handling of hack attack. Retrieved from http://www.theguardian.com/technology/2015/oct/23/talktalk-criticised-for-poor-security-and-handling-of-hack-attack.

Hashizume, K., Rosado, D. G., Fernández-medina, E., & Fernandez, E. B. (2013). An analysis of security issues for cloud computing. Journal of internet services and applications, 4(1), 1-13. doi:10.1186/1869-0238-4-5

Inukollu, V. N., Arsi, S., & Ravuri, S. R. (2014). Security issues associated with big data in cloud computing. International Journal of Network Security & Its Applications, 6(3), 45.

Pearson, S. (2013). Privacy, security and trust in cloud computing Privacy and Security for Cloud Computing (pp. 3-42): Springer.

Prakash, V., & Darbari, M. A Review on Security Issues in Distributed Systems.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Sandvik , R. A. (2014). Attackers Scrape GitHub for Cloud Service Credentials, Hijack Account to Mine Virtual Currency. Retrieved from https://www.forbes.com/sites/runasandvik/2014/01/14/attackers-scrape-github-for-cloud-service-credentials-hijack-account-to-mine-virtual-currency/#71fe913c3196.

Schwartz, M. (2015). Dell Releases Fix for Root Certificate Fail. Retreived from http://www.bankinfosecurity.com/dell-releases-fix-for-root-certificate-fail-a-8701/op-1.

Soomro, Z. A., Shah, M. H., & Ahmed, J. (2016). Information security management needs more holistic approach: A literature review. International Journal of Information Management, 36(2), 215-225.

Tripathi, A., & Mishra, A. (2011, 14-16 Sept. 2011). Cloud computing security considerations. Paper presented at the 2011 IEEE International Conference on Signal Processing, Communications and Computing (ICSPCC).

Big Data Security Issues

Dr. Aly, O.
Computer Science

Introduction

The purpose of this discussion is to discuss and analyze two security issues associated with Big Data.  The analysis includes the causes for these two security issues and the solutions.  The discussion begins with an overview of the Security Issues when dealing with Big Data.

Security Issues Associated with Big Data

As indicated in (CSA & Big-Data-Working-Group, 2013), the velocity, volume and variety characteristics of Big Data magnify the security and privacy issues.  These security and privacy issues include issues such as the large-scale infrastructures in the Cloud, various data sources and formats, the acquisition of the data using streaming techniques, and the high-volume migration inside the Cloud (CSA & Big-Data-Working-Group, 2013).  Thus, the traditional security techniques which tended to be for small-scale, static data are found inadequate when dealing with Big Data (CSA & Big-Data-Working-Group, 2013).  Storing the organizations’ information, customers and patients in a secure manner is not a trivial process, and it gets more complicated in a Big Data environment (Al-Kahtani, 2017).   CSA identified the top ten Big Data Security and Privacy challenges illustrated in the Big Data Ecosystem in Figure 1, adapted from CSA.  These ten security challenges are categorized in four main categories in Big Data Ecosystems: (1) The infrastructure Security, (2) Data Privacy, (3) Data Management, and (4) Integrity and Reactive Security.

Figure 1.  Top Ten Security Challenges in Big Data Ecosystem, adapted from (Al-Kahtani, 2017).

Tremendous efforts from the researchers, practitioners and the industry are exerted to address the security issues associated with Big Data.  As indicated in (Arora & Bahuguna), Security of the Big Data is challenging due to two main vulnerabilities.  The first vulnerability includes the information leakage which gets increased by Big Data because of its characteristics of high volume and velocity (Arora & Bahuguna).  The second vulnerability reflects the privacy and prediction of people’s behavior risk get increased by the development of intelligent terminals (Arora & Bahuguna).  In (Al-Kahtani, 2017), the general security risks associated with Big Data environments are identified to include six security risk elements. The first security risk is associated with the implementation of a new technology, which can lead to new vulnerability discovery. The second security risk can be associated with the open source tools which can contain undocumented vulnerabilities and lack of update options such as backdoors.  The third security risk reflects the large cluster node attack surfaces which organizations are not prepared to monitor them.  The fourth security risk reflects the poor authentication of users and the weak remote access policies. The fifth security risk is associated with the organizations is unable to handle large processing of audit and access logs.  The sixth element includes the lack of data validation looking for malicious data input which can become lost in the large volume of the Big Data (Al-Kahtani, 2017).  With regard to the infrastructure, the common attacks can include false data injections, Denial of Service (DoS), worm and malware propagation, and botnet attacks (Al-Kahtani, 2017).   In (Khan et al., 2014), the security issues associated with the Big Data are categorized into privacy, integrity, availability, confidentiality, and governance.  Data leakage is a major privacy concern in Big Data.  The data integrity is a particular challenge for large-scale collaborative analysis, where data frequently changes (Khan et al., 2014).  The availability is critical when dealing with Big Data in the cloud.  It involves threats to data availability such as Denial of Service (DoS) and Mitigation of DoS attacks.  The confidentiality security issue refers to the distorted data from theft (Khan et al., 2014).  In (Mehta, 2017), the security issues associated with Big Date involves granular access, monitoring in real-time, granular audits, preserve privacy in data mining and analytics, encrypted data-centric security, data provenance and verification, and integrity and reactive security.  These security issues are similar to the ones discussed in (CSA & Big-Data-Working-Group, 2013; Sahafizadeh & Nematbakhsh, 2015; Yadav, 2016).

For this discussion, only two security issues associated with Big Data are discussed and analyzed with the proposed solutions to overcome them.  These two security issues are categorized under the Integrity and Reactive Security category of (CSA & Big-Data-Working-Group, 2013), which involves (1) End-point validation and filtering, and (2) Real-time Security Monitoring.  The End-point validation and filtering are categorized in (Demchenko, Ngo, de Laat, Membrey, & Gordijenko, 2013-15) under the Infrastructure Security category, while the Real-time Security Monitoring is categorized under the Data Management. 

End-Point Validation and Filtering Security Issue and Proposed Solutions

The end-points are the main components for Big Data collection (Yadav, 2016).  They provide input data for storage and processing.  Security is very important to ensure the use of the only authentic end-points, where the network is free from other end-points including the malicious ones (Yadav, 2016).   The data collected from various sources including end-point devices is required when dealing with Bid Data (CSA & Big-Data-Working-Group, 2013). The security information and event management system (SIEM) is an example of collecting logs from millions of software applications and hardware devices in an enterprise or organization network.  The input validation and filtering process during this data collection are very challenging and critical to the integrity and the trust of the data due to threats of the untrusted sources especially with the “bring-your-own-device” (BYOD) model which allows employees to bring their own devices to the workplace (CSA & Big-Data-Working-Group, 2013).  There are four threat models when dealing with validation and filtering security issue.  The malicious attacker may tamper with any of these devices such as the smartphone from where data is collected and retrieved with the aim of providing malicious input to a central data collection system is the first threat model.  The malicious attacker may perform ID cloning attacks such as Sybil attacks on the collected data with the aim of providing malicious input to a central data collection using the faked identities.  The malicious attacker can manipulate the input sources of sensed data is the third threat model when dealing with the validation and filtering security issue.  The last threat model for this security issue involves the malicious attacker compromising data in transmission from a benign source to the central collection system such as by performing a man-in-the-middle attack or a replay attack (CSA & Big-Data-Working-Group, 2013).

A use case scenario for this issue is the data which gets retrieved from weather sensors and feedback votes and are sent by a smartphone such as iPhone or Android applications have the similar validation and filtering problem (CSA & Big-Data-Working-Group, 2013).  The security issue of the validation and filtering of this example gets further complicated when the volume of the data collected gets increased (CSA & Big-Data-Working-Group, 2013).  The algorithm is required to validate the input for large data sets to validate and filter the data from any malicious and untrusting data (CSA & Big-Data-Working-Group, 2013). 

The solutions to the validation security issue are categorized into two categories.  The first category is to prevent the malicious attacker from generating and sending malicious input to the central collection system (CSA & Big-Data-Working-Group, 2013).  The second category is to detect and filter malicious input at the central system in case the malicious attacker was successful sending the malicious data to the central collection system (CSA & Big-Data-Working-Group, 2013).  

The first solution to prevent malicious attacks requires tamper-proof software and defense against the “Sybil” attacks.   The researchers and industry have exerted tremendous efforts to design and implement tamper-proof secure software and tools.  The security for PC-based platforms and applications have been widely studied.  However, the mobile devices and the application security still an active area for research (CSA & Big-Data-Working-Group, 2013).  Thus, the determined malicious attacker may succeed in tamping the mobile devices.  Trusted Platform Module (TPM) was proposed to ensure the integrity of raw sensor data, and data derived from raw data (CSA & Big-Data-Working-Group, 2013).  However, the TPM solution is not found in mobile devices universally.  Thus, the malicious attacker can manipulate the sensor input such as GPS signals (CSA & Big-Data-Working-Group, 2013).  Various defense techniques against the fake ID using the ID cloning attacks and Sybil attacks have been proposed such as P2P (Peer-To-Peer) systems, Recommender Systems (RS), Vehicular Networks, and Wireless Sensor Network (CSA & Big-Data-Working-Group, 2013).  Many of these defense techniques propose the Trusted Certificates and Trusted Devices to prevent Sybil attacks.  However, in large enterprise settings and organizations with millions of entities, the management of certificates become an additional challenge.  Thus, additional solutions for resource testing are proposed to provide minimal defense against the Sybil attacks by discouraging Sybil attacks instead of preventing it (CSA & Big-Data-Working-Group, 2013).   The Big Data analytical techniques can be used to detect and filter malicious input at the central collection system.  Malicious input from the malicious attacker may appear as outliers.  Thus, statistical analysis and outlier detection techniques can be used to detect and filter out the malicious output (CSA & Big-Data-Working-Group, 2013).

Real-time Security Monitoring

The Real-Time Security Monitoring is described as one of the most challenging Big Data Analytics issues (CSA & Big-Data-Working-Group, 2013; Sakr & Gaber, 2014).  This challenging issue is a two-dimensional issue including the monitoring of the Big Data infrastructure itself, and the use of the same infrastructure for Big Data Analytics.  The performance monitoring and the health of the nodes in the Big Data infrastructure is an example of the first side of this issue. A good example of the other side of this issue is the health care provider using monitoring tools to look for fraudulent claims to get a better real-time alert and compliance monitoring (CSA & Big-Data-Working-Group, 2013).   The Real-Time Security Monitoring is challenging because the security devices send some alerts which can lead to a massive number of false positives, which are often ignored due to the limited human capacity for analysis.  This problem becomes further challenging with Big Data due to the characteristics of the Big Data volume, the velocity of the data streams.   However, the technologies of the Big Data can provide an opportunity to process and analyze different types of data rapidly, and real-time anomaly detection based on scalable security analytics (CSA & Big-Data-Working-Group, 2013). 

A use case scenario for this issue is the health industry by reducing the fraud related to claims (CSA & Big-Data-Working-Group, 2013).  Moreover, the stored data are extremely sensitive and must comply with the patient privacy regulations such as HIPAA, and it must be carefully protected.   The Real-Time detection of the anomalous retrieval of private information of the patients enables the healthcare provider to rapidly repair damage and prevent further misuse (CSA & Big-Data-Working-Group, 2013).

The security of the Big Data infrastructure and platform must be secured which is a requirement for the Real-Time Security Monitoring.  Big Data infrastructure threats include (1) rogue admin access to applications or nodes, (2) web application threats, and (3) eavesdropping on the line.  Thus, the security of Big Data ecosystem and infrastructure must include each component and the integration of these components.  For instance, when using Hadoop cluster in a Public Cloud, the security for the Big Data should include the Security of the Public Cloud, the Security of Hadoop clusters and all nodes in the cluster, the Security of the Monitoring Applications, and the Security of the Input Sources such as devices and sensors (CSA & Big-Data-Working-Group, 2013).  The threats also include the attack on the Big Data Analytics tools which are used to identify the malicious attacks.  For instance, evasion attacks can be used to prevent from being detected; the data poisoning attacks can be used to minimize the trustworthiness and integrity of the datasets which are used to train Big Data analytics algorithms (CSA & Big-Data-Working-Group, 2013).  Moreover, barriers such as legal regulations become important when dealing with the Real-Time Security Monitoring challenges of the Big Data.  Big Data Analytics (BDA) can be employed to monitor anomalous connection to the cluster environment and to mine the logging events to identify any suspicious activities (CSA & Big-Data-Working-Group, 2013).  When dealing with the Real-Time Security Monitoring, various factors such as technical, legal and ethical must be taken into consideration (CSA & Big-Data-Working-Group, 2013).

Conclusion

This discussion focused on two security issues associated with Big Data.  The discussion and the analysis included the cause of these two security issues and the solutions.  The discussion began with an overview of the Security Issues when dealing with Big Data.  The categories of the threats are described not only by (CSA & Big-Data-Working-Group, 2013) but also by researchers.  CSA identified the top ten challenges when dealing with Big Data.  Various researchers also identified the threats and security challenges associated with Big Data.  Some of these security challenges and threats include the secure computations in distributed programming framework. The security of data storage and transactions logs, the End-Point input validation and filtering, and Real-Time Security Monitoring.  The two security issues chosen for this discussion are the End-Point input validation and filtering, and the Real-Time Security Monitoring.   Various solutions are proposed to reduce and prevent the attacks and threats when dealing with Big Data. However, there is no perfect solution yet for the threats and security issues associated with Big Data due to the nature of the Big Data and the fact that the mobile devices are still active research areas for security. 

References

Al-Kahtani, M. S. (2017). Security and Privacy in Big Data. International Journal of Computer Engineering and Information Technology, 9(2).

Arora, M., & Bahuguna, H. Big Data Security–The Big Challenge.

CSA, & Big-Data-Working-Group. (2013). Expanded Top Ten Big Data Security and Privacy Challenges. Cloud Security Alliance.

Demchenko, Y., Ngo, C., de Laat, C., Membrey, P., & Gordijenko, D. (2013-15). Big security for big data: addressing security challenges for the big data infrastructure. Paper presented at the Workshop on Secure Data Management.

Khan, N., Yaqoob, I., Hashem, I. A. T., Inayat, Z., Mahmoud Ali, W. K., Alam, M., . . . Gani, A. (2014). Big Data: Survey, Technologies, Opportunities, and Challenges. The Scientific World Journal, 2014.

Mehta, T. M. P. M. P. (2017). Security and Privacy–A Big Concern in Big Data A Case Study on Tracking and Monitoring System.

Sahafizadeh, E., & Nematbakhsh, M. A. (2015). A Survey on Security Issues in Big Data and NoSQL. Int’l J. Advances in Computer Science, 4(4), 2322-5157.

Sakr, S., & Gaber, M. (2014). Large Scale and big data: Processing and Management: CRC Press.

Yadav, N. (2016). Top Ten Big Data Security and Privacy Challenge. Retrieved from https://www.infosecurity-magazine.com/opinions/big-data-security-privacy/.

Performance of IaaS Cloud and Stochastic Model

Dr. Aly, O.
Computer Science

Abstract

The purpose of this paper is to provide analysis of the Performance of IaaS Cloud with an emphasis on Stochastic Model.  The project begins with a brief discussion on the Cloud Computing and its Deployment Models of IaaS, PaaS, and SaaS. It also discusses the available three options for the performance analysis of the IaaS Cloud of the experiment-based, discrete event-simulation-based, and the stochastic model-based. The discussion focuses on the most feasible approach which is the stochastic model.  The discussion of the performance analysis also includes the proposed sub-models of the Resource Provisioning Decision Engine (RPDE) of CTMC, the Hot Physical Machine (PM) Sub-Model, the Closed-Form Solution for Hot PM Sub-Model, the Warm PM Sub-Model, and the Cold PM Sub-Model.  The discussion and the analysis also include the Interactions among these sub-models and the impact on the performance.   The Monolithic Model is also discussed and analyzed.  The findings of this discussion and analysis are addressed comparing the scalability and accuracy with the one-level monothetic model, and the accuracy of the interacting sub-models with the monolithic model.  The result also showed that when the number of PMs in each pool increased beyond three and the number of the VMs per PM increases beyond 38, the monolithic model runs into a memory overflow problem.  The result also indicated that the state space size of the monolithic model increases quickly and becomes too large to construct the reachability graph even for a small number of PMs and VMs.   When using the interacting sub-models, a reduced number of states and nonzero entries leads to a concomitant reduction in solution time needed. The findings of indicated that the values of the probabilities models (Ph, Pw, Pc) that at least one PM can accept a job in a pool are different in monolithic (“exact”) model and interacting (“approximate”) sub-models.

Keywords: IaaS Performance Analysis, Stochastic Model, Monolithic Model, CTMC.

Cloud Computing

Cloud Computing has attracted the attention of both the IT industry and the academia as it represents a new paradigm in computing and as a business model (Xiao & Xiao, 2013).  The key concept of the Cloud Computing is not new (Botta, de Donato, Persico, & Pescapé, 2016; Kaufman, 2009; Kim, Kim, Lee, & Lee, 2009; Zhang, Cheng, & Boutaba, 2010).  In accordance to (Kaufman, 2009) the technology of the Cloud Computing has been evolving for decades, “more than 40 years.”  Licklider introduced the term of “intergalactic computer network” back in the 1960s at the Advanced Research Projects Agency (Kaufman, 2009; Timmermans, Stahl, Ikonen, & Bozdag, 2010).   The term “cloud” goes back 1990s when the telecommunication world was emerging (Kaufman, 2009).  The virtual private network (VPN) services also got introduced with the telecommunication (Kaufman, 2009).  Although the VPN maintained the same bandwidth as “fixed networks,” the bandwidth efficiency got increased and the utilization of the network was balanced because these “fixed networks” supported “dynamic routing (Kaufman, 2009).  The telecommunication with the VPN and the bandwidth efficiency using dynamic routing resulted in technology that was coined the term “telecom cloud” (Kaufman, 2009).  The term of Cloud Computing is similar to the term “telecom cloud” as Cloud Computing also provides computing services using virtual environments that are dynamically allocated as required by consumers (Kaufman, 2009). 

Also, the underlying concept of the Cloud Computing was introduced by John McCarthy, the “MIT computer scientist and Turning aware winner,” in 1961  (Jadeja & Modi, 2012; Kaufman, 2009).  McCarthy predicted that “computation may someday be organized as a public utility” (Foster, Zhao, Raicu, & Lu, 2008; Jadeja & Modi, 2012; Joshua & Ogwueleka, 2013; Khan, Khan, & Galibeen, 2011; Mokhtar, Ali, Al-Sharafi, & Aborujilah, 2013; Qian, Luo, Du, & Guo, 2009; Timmermans et al., 2010).   Besides, Douglas F. Parkhill as cited in (Adebisi, Adekanmi, & Oluwatobi, 2014), in his book called “The Challenge of the Computer Utility” also predicted that the computer industry will provide similar services like the public utility “in which many remotely located users are connected via communication links to a central computing facility” (Adebisi et al., 2014).

NIST (Mell & Grance, 2011) identifies three essential Cloud Computing Service Models as follows:

  • layer provides the capability to the consumers to provision storage, processing, networks, and other fundamental computing resources.  Using IaaS, the consumer can deploy and run “arbitrary” software, which can include operating systems and application.  When using IaaS, the users do not manage or control the “underlying cloud infrastructure.”  However, the consumers have control over the storage, the operating systems, and the deployed application; and “possibly limited control of selected networking components such as host firewall” (Mell & Grance, 2011).
  • allows the Cloud Computing consumers to deploy applications that are created using programming languages, libraries, services, and tools supported by the providers.  Using PaaS, the Cloud Computing users do not manage or control the underlying cloud infrastructure including network, servers, operating systems, or storage. However, the consumers have control over the deployed applications and possibly configuration settings for the application-hosting environment (Mell & Grance, 2011).
  • allows Cloud Computing consumers to use the provider’s applications running on the cloud infrastructure.  Users can access the applications from various client devices through either a thin client interface, such as a web-based email from a web browser, or a program interface.  Using SaaS, the consumers do not manage or control the underlying cloud infrastructure including network, servers, operating systems, storage, or even individual application capabilities, with the possible exception of limited user-specific application configuration settings  (Mell & Grance, 2011).

Performance of IaaS Cloud

            The management of Big Data does require computing capacity.  This computing capacity requirement is met by the IaaS clouds which are regarded to be the major enabler of data-intensive cloud application (Ghosh, Longo, Naik, & Trivedi, 2013; Sakr & Gaber, 2014).  When using the IaaS Service Cloud Model, instances of Virtual Machines (VMs) which are deployed on physical machines (PMs) are provided to users for computing needs (Ghosh et al., 2013; Sakr & Gaber, 2014).  Providing the basic functionalities for processing Big Data is important.  However, the performance of the Cloud is regarded to be another important factor (Ghosh et al., 2013; Sakr & Gaber, 2014).  IaaS cloud providers offer Service Level Agreement (SLA) to guarantee availability (Ghosh et al., 2013; Sakr & Gaber, 2014). However, performance SLA is as important as the availability SLA (Ghosh et al., 2013; Sakr & Gaber, 2014).  Performance analysis of the Cloud is complex process because performance is impacted by various things such as the hardware components of CPU speed, disk properties, or software such as the nature of hypervisor, or the workload such the arrival rate, or the placement policies (Ghosh et al., 2013; Sakr & Gaber, 2014). 

            There are three major techniques which can be used to evaluate the performance of the Cloud (Ghosh et al., 2013; Sakr & Gaber, 2014).  The first technique involves the experimentation for measurement-based performance quantification (Ghosh et al., 2013; Sakr & Gaber, 2014). However, this approach is not a practical approach due to the scale of the cloud which becomes prohibitive in term of time and cost when using this measurement-based analysis (Ghosh et al., 2013; Sakr & Gaber, 2014).  The second approach involves discrete event simulation (Ghosh et al., 2013; Sakr & Gaber, 2014).  However, this approach is not practical approach either because the simulation can take a long time to provide statistically significant results (Ghosh et al., 2013; Sakr & Gaber, 2014).  The third approach is the stochastic technique which can be used as a low-cost option where the model solution time is much less the experimental approach and the simulation approach (Ghosh et al., 2013; Sakr & Gaber, 2014).  However, with the stochastic approach, the cloud may not scale giving the complexity and the size of the Cloud (Ghosh et al., 2013; Sakr & Gaber, 2014).  Scalable stochastic modeling approach which can preserve accuracy is important (Ghosh et al., 2013; Sakr & Gaber, 2014).  

As indicated in (Ghosh et al., 2013; Sakr & Gaber, 2014), three pools identified for the Cloud architecture; hot, warm and cold. The hot pool is the busy and running pool (running status), while the warm pool is turned on but not ready (turned on but not ready status) and it is saving power, and the cold pool is turned off (turned off status) (Ghosh et al., 2013; Sakr & Gaber, 2014). There is no delay with the hot pool, while there is a little delay in the warm pool and the delay gets increased with the cold pool (Ghosh et al., 2013; Sakr & Gaber, 2014).  When a request arrived, the Resource Provisioning Decision Engine (RPDE) tried to find a physical machine from the hot pool, which can accept the request (Ghosh et al., 2013; Sakr & Gaber, 2014).  However, if all machines are all busy in the hot pool, the RPDE tries to find a physical machine from the warm pool (Ghosh et al., 2013; Sakr & Gaber, 2014).  If the warm pool can not meet the request, the RPDE will go to the cold pool to meet that request (Ghosh et al., 2013; Sakr & Gaber, 2014). 

There are interacting sub-models for performance analysis.  A scalable approach using interacting stochastic sub-models are proposed where iteration composes an overall solution over individual sub-model solutions (Ghosh et al., 2013; Sakr & Gaber, 2014).  

  1. RPDE Sub-Model of the Continuous-Time MarkovChains (CTMC)

The first model is the RPDE sub-model of the Continuous-Time Markov Chains (CTMC) which is designed to capture the resource providing decision process (Ghosh et al., 2013; Sakr & Gaber, 2014).   In this submodel, a finite length decision queue is considered where decisions are made on a first-come, first-serve (FMFS) basis (Ghosh et al., 2013; Sakr & Gaber, 2014).  Under this sub-model, there is a closed form solution for RPDE sub-model and VM provisioning sub-model solutions.

1.1 The Closed Form Solution for RPDE Sub-Model

Using this closed form sub-model, a numerical solution can be obtained in two steps which start with some value of π(0,0), and compute all the state probabilities as a function of π(0,0) (Ghosh et al., 2013; Sakr & Gaber, 2014). The second step includes the actual steady state probability which gets calculated and normalized (Ghosh et al., 2013; Sakr & Gaber, 2014).  The calculation of the steady state is found in (Ghosh et al., 2013; Sakr & Gaber, 2014).   Using the Markov reward approach, the outputs from the RPDE sub-model are obtained by appropriate reward rate assigned to each state of the CTMC and then computing the expected reward rate in the steady state (Ghosh et al., 2013; Sakr & Gaber, 2014).   There are three scenarios for the outputs from this RPDE sub-model:  The job rejection probability, the mean queuing delay, and the mean decision delay.  Each of these outputs has its calculations which can be found in details in (Ghosh et al., 2013; Sakr & Gaber, 2014).  

1.2 The VM Provisioning Sub-Models

The VM provisioning sub-models capture the instantiation, configuration, and provision of a VM on a physical machine (PM) (Ghosh et al., 2013; Sakr & Gaber, 2014).  For each hot, warm, and cold physical machine pool, there is CTMC which keeps track of the number of assigned and running VMs (Ghosh et al., 2013; Sakr & Gaber, 2014).  The VM provisioning sub-models include various sub-models: (1) the hot PM sub-model, (2) the closed form solution for hot PM sub-model, (3) the warm PM sub-model, and (4) the cold PM sub-model (Ghosh et al., 2013; Sakr & Gaber, 2014). 

1.2.1 The Hot PM Sub-Model

The hot PM sub-model include the overall hot pool which is modeled by a set of independent hot PM sub-models (Ghosh et al., 2013; Sakr & Gaber, 2014). The VMs are assumed to be provisioned serially (Ghosh et al., 2013; Sakr & Gaber, 2014).

1.2.2 The Closed-Form Solution for Hot PM Sub-Model

The closed form solution for hot PM sub-model is derived for the steady state probabilities of the hot PM CTMC where the hot PM is modeled as a two-stage tandem network of queues (Ghosh et al., 2013; Sakr & Gaber, 2014).  In the closed form solution for hot PM sub-model, the queuing system consists of two nodes (1) node A , and node B.  Node A has only one server with service rate of βh , while Node B has infinite servers with service rate of each server of µ (Ghosh et al., 2013; Sakr & Gaber, 2014).  The server in node A denotes the provisioning engine within the PM while the servers in node B denote the running VMs.  The service time distribution at both nodes A and B is exponential (Ghosh et al., 2013; Sakr & Gaber, 2014).  The calculation for the external arrival process is demonstrated in (Ghosh et al., 2013; Sakr & Gaber, 2014).  If the buffer of PM is full, it cannot accept a job for provisioning (Ghosh et al., 2013; Sakr & Gaber, 2014).  The steady state probability can be computed after solving the hot PM sub-model (Ghosh et al., 2013; Sakr & Gaber, 2014). 

1.2.3 The Warm PM Sub-Model

In the warm PM sub-model, there are three main differences from the hot PM sub-model.  The effective arrival rate is the first difference.  In the warm PM sub-mode, the effective arrival rate is different because jobs arrive at the warm PM pool only if they are not provisioned on any of the hot PMs (Ghosh et al., 2013; Sakr & Gaber, 2014). The second difference is the time required to provision VM.   When there is no VM running or being provisioning, warm PM is turned on but not ready for use yet. Upon a job arrival in this state, the warm PM requires some additional time for startup which causes delay to make it ready to use (Ghosh et al., 2013; Sakr & Gaber, 2014).  The time to make a warm PM ready for use is assumed to be exponential.  The third difference is the mean time to provision a VM on a warm PM is 1/βw for the first VM to be deployed on this PM.  The mean time to provision VMs for subsequent jobs is the same as that for a hot PM (Ghosh et al., 2013; Sakr & Gaber, 2014).  After solving the warm PM sub-model, the steady-state probability is computed detailed in (Ghosh et al., 2013; Sakr & Gaber, 2014).

1.2.4 The Cold PM Sub-Model

The cold PM sub-model had differences from the hot and cold PM sub-models discussed above (Ghosh et al., 2013; Sakr & Gaber, 2014). The effective arrival rate, the rate at which startup is executed, and the initial VM provisioning rates and the buffer sizes (Ghosh et al., 2013; Sakr & Gaber, 2014).  In (Ghosh et al., 2013; Sakr & Gaber, 2014), the computations for these factors are provided in details.   

Once the job is successfully provisioned using either hot, cold or warm, the job utilizes the resources until the execution of the job is completed (Ghosh et al., 2013; Sakr & Gaber, 2014).  The run-time sub-model is utilized to determine the mean time for job completion.  The Discrete Time Markov Chain (DTMC) is used to capture the details of a job execution (Ghosh et al., 2013; Sakr & Gaber, 2014).  The job can complete its execution with a probability of P0 or go for some local I/O operations with a probability of ( 1 – P0 ) (Ghosh et al., 2013; Sakr & Gaber, 2014).  The full calculation is detailed in (Ghosh et al., 2013; Sakr & Gaber, 2014).

2. The Interactions Among Sub-Models

The sub-models discussed above can interact.  The interactions among these sub-models are illustrated in Figure 1, adapted from (Ghosh et al., 2013).

Figure 1:  Interactions among the sub-models, adapted from (Ghosh et al., 2013).

In (Ghosh et al., 2013), this interaction is discussed briefly.  The run-time sub-model yields the mean service time (1/µ) which is needed as an input parameter to each type; hot, warm, or cold of the VM provisioning sub-model (Ghosh et al., 2013).  The VM provisioning sub-models compute the steady state probabilities as (Ph, Pw, and Pc) which at least on PM in a hot pool, warm pool, or cold pool respectively can accept a job for provisioning (Ghosh et al., 2013).  These probabilities are used as input parameters to the RPDE sub-model (Ghosh et al., 2013).  From the RPDE sub-model, rejection probability due to buffer full as (Pblock), or insufficient capacity (Pdrop), and their sum (Preject) are obtained (Ghosh et al., 2013).  This (Pblock) is an input parameter to the three VM provisioning sub-models discussed above.  Moreover, the Mean Response Delay (MRD) is computed from the overall performance model (Ghosh et al., 2013).  There are two components of the MRD; Mean Queuing Delay (MQD) in RPDE, and Mean Decision Delay (MDD) which are obtained from the RPDE sub-model (Ghosh et al., 2013).  Two more components are calculated; MQD in a PM and Mean Provisioning Delay (MPD) are obtained from VM provisioning sub-models (Ghosh et al., 2013). There are dependencies among the sub-models.  The (Pblock) which is computed in the RPDE sub-model is utilized as an input parameter in VM provisioning sub-models (Ghosh et al., 2013).  However, to solve the RPDE sub-model, outputs from VM provisioning sub-models (Ph, Pw, Pc) are required as input parameters (Ghosh et al., 2013).  This cyclic dependency issue is resolved by using fixed-point iteration using a variant of the successive substitution method (Ghosh et al., 2013).  

3. The Monolithic Model

            In (Ghosh et al., 2013), a monolithic model for IaaS cloud is constructed using the variant of stochastic Petric Net (SPN) called stochastic reward net (SRN) (Ghosh et al., 2013).  In this model, the SRN is used to construct a monolithic model for IaaS Cloud (Ghosh et al., 2013).  The SRNs are extensions of GSPNs (Generalized Stochastic Petri Nets) (Ajmone Marsan, Conte, & Balbo, 1984) and the key features of SRNs are:

  • (Ghosh et al., 2013). 

Using this monolithic model, the findings of (Ghosh et al., 2013) showed that the outputs were obtained by assigning an appropriate reward rate to each marking of the SRN and then computing the expected reward rate in the steady state.  The measures that were used by (Ghosh et al., 2013) are the Job Rejection Probability (Preject), the Mean Number of Jobs in the RDPE (E(NRPDE)).  The (Preject) has two components as discussed earlier (Pblock) which is rejection probability due to buffer full, and (Pdrop), which insufficient capacity (Ghosh et al., 2013), when the RPDE buffer is full and when all (hot, warm, cold) PMs are busy respectively (Ghosh et al., 2013).  The (E(NRPDE)), which is a measure of mean response delay, is given by the sum of the number of jobs that are waiting in the RPDE queue and the job that is currently undergoing provisioning decision (Ghosh et al., 2013). 

4. The Findings

In (Ghosh et al., 2013; Sakr & Gaber, 2014), the SHARP software package is used to solve the interacting sub-models to compute two main calculations: (1) the Job Rejection Probability, and (2) the Mean Response Delay (MRD) (Ghosh et al., 2013; Sakr & Gaber, 2014).  The result of (Ghosh et al., 2013; Sakr & Gaber, 2014) showed that the job rejection probability gets increased with longer Mean Service Time (MST).  Moreover, if the PM capacity in each pool is increased, the job rejection probability gets reduced at a given value of mean service time (Ghosh et al., 2013; Sakr & Gaber, 2014).   The result also showed that with the increasing MST, the MRD increased for a fixed number of PMs in each pool (Ghosh et al., 2013; Sakr & Gaber, 2014).   

In comparison with one-level monolithic model, the scalability and accuracy of the proposed approach by (Ghosh et al., 2013; Sakr & Gaber, 2014), the result also showed that when the number of PMs in each pool increased beyond three and the number of the VMs per PM increases beyond 38, the monolithic model runs into a memory overflow problem (Ghosh et al., 2013; Sakr & Gaber, 2014).  The result also indicated that the state space size of the monolithic model increases quickly and becomes too large to construct the reachability graph even for a small number of PMs and VMs (Ghosh et al., 2013; Sakr & Gaber, 2014).   The findings of (Ghosh et al., 2013; Sakr & Gaber, 2014) also showed that the non-zero elements in the infinitesimal generator matrix of the underlying CTMC of the monolithic model are hundreds to thousands in orders of magnitude larger compared with the interacting sub-models for a given number of PMs and VMs.  When using the interacting sub-models, a reduced number of states and nonzero entries leads to a concomitant reduction in solution time needed (Ghosh et al., 2013; Sakr & Gaber, 2014). As demonstrated by (Ghosh et al., 2013; Sakr & Gaber, 2014), the solution time for monolithic model increased almost exponentially with the increase in model size, while the solution time for interacting sub-models remains almost constant with the increase in model size.  Thus, the findings of (Ghosh et al., 2013; Sakr & Gaber, 2014) indicated that the proposed approach is scalable and tractable compared with the one-level monolithic model. 

In comparison with the monolithic modeling approach, the accuracy of interacting sub-models showed when the arrival rate and maximum number of VMs per PM is changed, the outputs obtained from both the modeling approaches are near similar using the two performance measures of the Job Rejection and Mean Number of Jobs in RPDE (Ghosh et al., 2013; Sakr & Gaber, 2014).   Thus, the errors introduced by the decomposition of the monolithic model are negligible, and interacting sub-models approach preserves accuracy while being scalable (Ghosh et al., 2013; Sakr & Gaber, 2014).   These errors are the result of solving only one model for all the PMs in each pool, and the aggregation of the obtained results to approximate the behavior of the pool as a whole (Ghosh et al., 2013; Sakr & Gaber, 2014).  The findings of (Ghosh et al., 2013; Sakr & Gaber, 2014) indicated that the values of the probabilities models (Ph, Pw, Pc) that at least one PM can accept a job in a pool are different in monolithic (“exact”) model and interacting (“approximate”) sub-models (Ghosh et al., 2013; Sakr & Gaber, 2014).  

Conclusion

The purpose of this project was to provide analysis of the Performance of IaaS Cloud with an emphasis on Stochastic Model.  The project began with a brief discussion on the Cloud Computing and its Deployment Models of IaaS, PaaS, and SaaS.  It also discussed the available three options for the performance analysis of the IaaS Cloud of the experiment-based, discrete event-simulation-based, and the stochastic model-based. The discussion focused on the most feasible approach which is the stochastic model.  The discussion of the performance analysis also included the proposed sub-models of RPDE of CTMC, which the Hot PM Sub-Model, the Closed-Form Solution for Hot PM Sub-Model, the Warm PM Sub-Model, and the Cold PM Sub-Model.  The discussion and the analysis also included the Interactions among these sub-models and the impact on the performance.   The Monolithic Model was also discussed and analyzed.  The findings of this analysis are addressed comparing the scalability and accuracy of them with the one-level monothetic model, and the accuracy of the interacting sub-models with the monolithic model.  The result also showed that when the number of PMs in each pool increased beyond three and the number of the VMs per PM increases beyond 38, the monolithic model runs into a memory overflow problem.  The result also indicated that the state space size of the monolithic model increases quickly and becomes too large to construct the reachability graph even for a small number of PMs and VMs.   The findings of also showed that the non-zero elements in the infinitesimal generator matrix of the underlying CTMC of the monolithic model are hundreds to thousands in orders of magnitude larger compared with the interacting sub-models for a given number of PMs and VMs.  When using the interacting sub-models, a reduced number of states and nonzero entries leads to a concomitant reduction in solution time needed(Ghosh et al., 2013; Sakr & Gaber, 2014). As demonstrated by, the solution time for monolithic model increased almost exponentially with the increase in model size, while the solution time for interacting sub-models remains almost constant with the increase in model size.  Thus, the findings indicated that the proposed approach is scalable and tractable compared with the one-level monolithic model.  The findings of (Ghosh et al., 2013; Sakr & Gaber, 2014)indicated that the values of the probabilities models (Ph, Pw, Pc) that at least one PM can accept a job in a pool are different in monolithic (“exact”) model and interacting (“approximate”) sub-models.

References

Adebisi, A. A., Adekanmi, A. A., & Oluwatobi, A. E. (2014). A Study of Cloud Computing in the University Enterprise. International Journal of Advanced Computer Research, 4(2), 450-458.

Ajmone Marsan, M., Conte, G., & Balbo, G. (1984). A class of generalized stochastic Petri nets for the performance evaluation of multiprocessor systems. ACM Transactions on Computer Systems (TOCS), 2(2), 93-122.

Botta, A., de Donato, W., Persico, V., & Pescapé, A. (2016). Integration of Cloud Computing and Internet Of Things: a Survey. Future Generation computer systems, 56, 684-700.

Foster, I., Zhao, Y., Raicu, I., & Lu, S. (2008). Cloud Computing and Grid Computing 360-Degree Compared. Paper presented at the 2008 Grid Computing Environments Workshop.

Ghosh, R., Longo, F., Naik, V. K., & Trivedi, K. S. (2013). Modeling and performance analysis of large-scale IaaS clouds. Future Generation computer systems, 29(5), 1216-1234.

Jadeja, Y., & Modi, K. (2012). Cloud Computing-Concepts, Architecture and Challenges. Paper presented at the Computing, Electronics and Electrical Technologies (ICCEET), 2012 International Conference on.

Joshua, A., & Ogwueleka, F. (2013). Cloud Computing with Related Enabling Technologies. International Journal of Cloud Computing and Services Science, 2(1), 40. doi:10.11591/closer.v2i1.1720

Kaufman, L. M. (2009). Data Security in the World of Cloud Computing. IEEE Security & Privacy, 7(4), 61-64.

Khan, S., Khan, S., & Galibeen, S. (2011). Cloud Computing an Emerging Technology: Changing Ways of Libraries Collaboration. International Research: Journal of Library and Information science, 1(2).

Kim, W., Kim, S. D., Lee, E., & Lee, S. (2009). Adoption Issues for Cloud Computing. Paper presented at the Proceedings of the 7th International Conference on Advances in Mobile Computing and Multimedia.

Mell, P., & Grance, T. (2011). The NIST Definition of Cloud Computing.

Mokhtar, S. A., Ali, S. H. S., Al-Sharafi, A., & Aborujilah, A. (2013). Cloud Computing in Academic Institutions. Paper presented at the Proceedings of the 7th International Conference on Ubiquitous Information Management and Communication.

Qian, L., Luo, Z., Du, Y., & Guo, L. (2009). Cloud Computing: an Overview. Paper presented at the IEEE International Conference on Cloud Computing.

Sakr, S., & Gaber, M. (2014). Large Scale and Big Data: Processing and Management: CRC Press.

Timmermans, J., Stahl, B. C., Ikonen, V., & Bozdag, E. (2010). The Ethics of Cloud Computing: A Conceptual Review.

Xiao, Z., & Xiao, Y. (2013). Security and Privacy in Cloud Computing. IEEE Communications Surveys & Tutorials, 15(2), 843-859.

Zhang, Q., Cheng, L., & Boutaba, R. (2010). Cloud Computing: State-of-the-Art and Research Challenges. Journal of internet services and applications, 1(1), 7-18.