The Challenges and Benefits of Data Warehousing and Data Mining Techniques.

Dr. O. Aly
Computer Science

In the age of big data, a considerable variety, volume, and velocity of data are being generated. The data are being generated by people, machines, the Web, and information systems. Harnessing these data and making sense of them in real time or near real time to develop actionable intelligence is one of the big challenges facing organizations. Data are stored in warehouses, and they are then mined to generate insights. Analytical techniques that are used include statistical techniques, machine learning, and others. The purpose of this discussion is to address the challenges and benefits of data warehousing and data mining techniques.

Data Warehousing

Data warehousing is defined as a subject-oriented, integrated, time-variant, and non-volatile collection of data in support of the decision-making process (Connolly & Begg, 2015). Since the 1970s, enterprises have mostly focused their investment in a new information system that automates a business process. Businesses gained competitive advantages through these systems that provided more efficient and cost-effective services to customers. Organizations have been stored the data in the operational databases — however, the operational database designed for daily operations and not to be part of the decision-making process. Enterprises faced the challenge to turn the archived data into a source of knowledge. The concept to data warehouse was emerged as the solution to meet this requirement of a capability system supporting decision making and receiving data from various operational sources (Connolly & Begg, 2015; Coronel & Morris, 2016).

The concept of data warehouse (DW) was devised by IBM as “information warehouse” as a solution for accessing data held in non-relational systems (Connolly & Begg, 2015). It was proposed to allow businesses to use the archived data to aid them to gain a business advantage. However, due to the complexity of the implementation, the early attempts at creating an information warehouse were mostly rejected. The concept of data warehousing has been raised several times since then. However, in recent years, the potential of data warehousing has been viewed as a valuable and viable solution to businesses. Bill Inmon is regarded to be the father of DW as he was one of the earliest promoters of data warehousing (Connolly & Begg, 2015; Guohong, Lijun, Junhui, & Peixin, 2010).

Data Warehouse Characteristics
The database for data warehouse (DW) is another type of database in management information system acting as ‘one-stop shopping” and focusing on supporting informed and actionable decision making (Ally & Khan, 2016; Coronel & Morris, 2016). It is a central location for knowledge creation to mitigate the challenge of various independent sources of data. This type of database is distinguished from other databases such as a transactional or operational database (Ally & Khan, 2016; Coronel & Morris, 2016). DW unlike the operational database collects consolidated and summarized data used in the decision-making process. DW has four significant characteristics proposed by two DW icon known as Kimble and Inmon. The integrated, subject-oriented, time-variant, and non-volatile are the primary four characteristics of DW (Ally & Khan, 2016; Connolly & Begg, 2015; Coronel & Morris, 2016).

Data Warehouse Architecture
Various studies proposed a various architecture for the data warehouse. The selected architecture for this discussion includes CRM, and ERP (Guohong et al., 2010). CRM integrates the scattered, isolated data in the enterprise for a comprehensive and complete understanding of customers. Online analytical processing technology (OLAP) is a software technology allowing analysis and managers to access the data fast, consistently, and interactively. Figure 1 shows the holistic view of the data warehouse framework.

Figure 1. A Holistic View of DW Framework (Guohong et al., 2010)

Benefits and Challenges of Data Warehousing
The successful implementation of the data warehouse can bring significant advantages to business. Enterprises can gain potential high returns on investment, competitive advantage, and increased the productivity of corporate decision makers. As cited in (Connolly & Begg, 2015), data warehouse projects delivered an average three-year return on investment of 401%. This high ROI posits these enterprises which successfully implemented the data warehousing projects into a competitive advantage. Businesses gain competitive advantages when allowing decision makers to access the data that can reveal previously unavailable, unknown and untapped information on customers, products, trends, and demands. The successful implementation of the data warehousing improves the productivity of enterprise decision makers by creating an integrated database of consistent, subject-oriented, and historical data. The data warehouse can integrate data from various independent data sources and transform this data to meaningful information providing decision makers with substantive, accurate and consistent analysis (Connolly & Begg, 2015; Coronel & Morris, 2016).

Data warehousing is confronted with various challenges. Underestimation of resource for data ETL (extract, transform and load) process is one of the significant challenges (Connolly & Begg, 2015; Coronel & Morris, 2016). Hidden problems with source systems and required data are not captured are other challenges that data warehouse faces. Other challenges include increased end-user demands, data homogenization, high demand for resources, data ownership, high maintenance, long-duration projects, and complexity of integration. In the era of Big Data and Big Data Analytics, a data warehouse is confronted with additional challenges of new technologies such as Hadoop, MapReduce, Cloud Computing and so forth. The data warehouse was initially designed for historical data. However, with BDA, real-time (RT) and near-real-time (NRT), a data warehouse is required. Thus, the demand is increased to design DW to enable RT/NRT extraction, modeling RT fact table, and scalability and query contention (Connolly & Begg, 2015; Coronel & Morris, 2016)

Data Mining

Data warehouse, OLAP and data mining are essential technologies forming critical components of the Business Intelligence implementation (Connolly & Begg, 2015). The value of the data warehouse is determined by providing the data to end users using the appropriate analytical tools such as data mining and OLAP (Connolly & Begg, 2015). Because OLAP and data mining analytical tools are distinguished in what they offer to the end users, they are regarded as complementary technologies (Connolly & Begg, 2015). While OLAP employs advanced data analysis and presentation tools including the multi-dimensional data analysis, data mining provides advanced statistical tools not only to provide analysis of the large data available through the data warehouses and other sources but also to identify the possible relationships and anomalies (Connolly & Begg, 2015).

Data mining is “the process of discovering meaningful new correlations, patterns, and trends by mining large amounts of data using statistical, mathematical, and AI techniques. Data mining has the potential to supersede the capabilities of OLAP tools, as the major attraction of data mining is its ability to build predictive rather than retrospective models” (Connolly & Begg, 2015). While the traditional BI tools are “reactive,” data mining is regarded to be “proactive” as the end users do not have to identify the problem, and select the data to be analyzed by the traditional BI tools, but rather data mining tools identify the problem by automatically searching the data for anomalies and possible relationship (Coronel & Morris, 2016). Thus, data mining involves four tasks: (1) analyzing the data, (2) discovering the problems or opportunities that might be hidden in the relationship of the data, (3) formulating a model that is based on the findings, (4) utilizing the model to predict behavior of the business, which requires minimal intervention from the end users (Coronel & Morris, 2016). As a result of these activities, the business can use the findings to obtain knowledge that can lead to competitive advantages (Coronel & Morris, 2016). In summary, data mining is described as the analytical tool that “initiate analyses to create knowledge” (Coronel & Morris, 2016). This knowledge represents very specialized information (Coronel & Morris, 2016).

Data Mining Techniques

Data mining techniques involve four essential operations: (1) “Predictive Modeling,” (2) “Database Segmentation,” (3) “Link Analysis,” and (4) “Deviation Detection.” (Connolly & Begg, 2015). The “Predictive Modeling” operation implements the classification and prediction technique. The “Database Segmentation” operation implements demographic clustering and neural clustering techniques (Connolly & Begg, 2015). The “Link Analysis” operation implements association discovery, sequential pattern discovery, and similar time sequence discovery techniques (Connolly & Begg, 2015). The “Deviation Detection” operation implements the statistics and visualization techniques (Connolly & Begg, 2015). Although business can implement any of these four operations, the certain association between the business applications and the data mining techniques (Connolly & Begg, 2015). For instance, the “Retail/Marketing” applies “database segmentation operation,” while the “Fraud Detection” applies any of the four operations (Connolly & Begg, 2015).

The Machine Learning Algorithm “Supervised” and “Non-supervised” learning techniques are the most common machine learning algorithm that is implemented in various domains, particularly the “Data Mining” domain (Hall, Dean, Kabul, & Silva, 2014). Supervised learning algorithm (SLA) is a technique that is used to label data to train a model (Hall et al., 2014). It is comprised of “Prediction” (“Regression”) algorithm, and “Classification” algorithm. The “Regression” or “Prediction” algorithm is used for “interval labels,” while the “Classification” algorithm is used for “class labels” (Hall et al., 2014). In the SL algorithm, the training data represented in observations, measurements, and so forth are associated by labels reflecting the class of the observations (Han, Pei, & Kamber, 2011). The new data is classified based on the “training set” (Han et al., 2011). The unsupervised learning algorithm (ULA) occurs when a model is trained on unlabeled data (Hall et al., 2014). UL algorithm typically segments data into “groups of examples” called “Clusters” or “groups of features” called “Feature Extraction” (Hall et al., 2014). The UL technique can be either the “end goal of a machine learning task,” as the case with “Market Segmentation,” or a “preliminary or pre-processing step in a supervised learning task” (Hall et al., 2014). When using the UL algorithm, the class labels of training data is “unknown” (Han et al., 2011). UL algorithm is used to establish the existence of class or clusters in the data, given a set of measurements, and observations (Han et al., 2011).

Benefits and Challenges

The goal of data mining is to extract value from data. Enterprises can utilize this information to make sound decisions to gain competitive advantages (Che, Safran, & Peng, 2013). Organizations can benefit from data mining in discovering concept/class descriptions, associations and correlations, classification, prediction, clustering, trend analysis outlier, and deviation analysis in making strategic and tactic decisions (Hand, Mannila, & Smyth, 2001; Linoff & Berry, 2011; Rygielski, Wang, & Yen, 2002). However, data mining is confronted with various challenges include the development of parallel or high-performance algorithms, theoretical models, and data mining techniques (Dubitzky, 2008). Distributed data mining algorithms should support the complete data mining process from pre-processing, to data mining, to post-processing. The design of new data mining systems and architectures to deal with efficient use of computing resource is another challenging area for data mining. More development challenges in several areas such as the high complexity of many data mining applications, the various data sources with various data models, the volume of the data (Dubitzky, 2008).

Conclusion

This discussion addressed two significant topics of data warehouse and data mining. It began with the discussion about data warehouse, its evolution using information warehouse by IBM. Due to the complexity, the concept disappeared for a while but surfaced again. Bill Inmon is the father of the data warehouse. The benefits of the data warehouse are tremendous to businesses. However, data warehouse project implementation is confronted with various challenges especially in the age of Big Data Analytics and emerging technologies such as Hadoop. Data mining is another technique that organization embraces to extract value from the data. Data mining has various mining techniques including supervised and non-supervised algorithms. Like data warehouse, data mining makes organization gain a competitive edge. However, same as the data warehouse, data mining is also confronted with various challenges. Organizations should analyze each technique before embracing the technology to understand the benefits as well as the challenges.

References

Ally, S. S., & Khan, N. (2016, 15-17 Dec. 2016). Data Warehouse and BI to Catalize Information Use in Health Sector for Decision Making: A Case Study. Paper presented at the 2016 International Conference on Computational Science and Computational Intelligence (CSCI).

Che, D., Safran, M., & Peng, Z. (2013). From Big Data to Big Data Mining: Challenges, Issues, and Opportunities. Paper presented at the International Conference on Database Systems for Advanced Applications.

Connolly, T., & Begg, C. (2015). Database Systems: A Practical Approach to Design, Implementation, and Management (6th Edition ed.): Pearson.

Coronel, C., & Morris, S. (2016). Database systems: design, implementation, & management: Cengage Learning.

Dubitzky, W. (2008). Data Mining in Grid Computing Environments: John Wiley & Sons.

Guohong, G., Lijun, X., Junhui, F., & Peixin, Q. (2010). The building of Customer Relationship Management system based on OLAP. Paper presented at the Industrial Mechatronics and Automation (ICIMA), 2010 2nd International Conference on.

Hall, P., Dean, J., Kabul, I. K., & Silva, J. (2014). An Overview of Machine Learning with SAS® Enterprise Miner™. SAS Institute Inc.

Han, J., Pei, J., & Kamber, M. (2011). Data mining: concepts and techniques: Elsevier.

Hand, D. J., Mannila, H., & Smyth, P. (2001). Principles of data mining.

Linoff, G. S., & Berry, M. J. (2011). Data mining techniques: for marketing, sales, and customer relationship management: John Wiley & Sons.

Rygielski, C., Wang, J.-C., & Yen, D. C. (2002). Data mining techniques for customer relationship management. Technology in society, 24(4), 483-502.

Share this:

Related

Published by Think and Knowledge Tank