Stakeholders’ Involvement in Big Data Analytics Lifecycle

Dr. O. Aly
Computer Science

Introduction

The purpose of this discussion is to determine when and how it is beneficial for stakeholders to be involved in the data analytics lifecycle.  The discussion begins with a summary of the Big Data Analytics Lifecycle, followed by the stakeholder’s involvement.

Big Data Analytics Life Cycle

The life cycle of the data analytics defines the analytics process for the organization’s data science project.  This analytics process involves six phases of the data analytics lifecycle identified by (EMC, 2015).  These six phases involve “Discovery,” “Data Preparation,” “Model Planning,” “Model Building,”  “Communicate Results,” and “Operationalize” (EMC, 2015).

The “Discovery” is the first phase of the data analytics lifecycle which determines whether there is enough information to draft an analytic plan and share for peer review.  In this first phase, the business domain including the relevant history, the resources assessment including technology, time, data, and people are identified.  During this first phase of the “Discovery,” the problem of the business and the initial hypotheses are identified.  Moreover, the key stakeholders are also identified and interviewed to understand their perspectives toward the identified problem.  The potential data sources are identified, the aggregate data sources are captured, the raw data is reviewed, the data structures and tools needed for the project are evaluated, and the data infrastructure is identified and scoped such as disk storage and network capacity during this first phase.   

The “Data Preparation” is the second phase of the data analytics lifecycle.  During this second phase, the analytics sandbox and workspace are prepared, and the process of Extract, Transform and Load (ETL), or Extract, Load and Transform (ELT), known as ETLT, is performed.   Moreover, during this second phase, learning about the data is very important.  Thus, the data access to the project data must be clarified, gaps must be identified, and datasets outside the organization must be identified.  The “data conditioning” must be implemented which involves the process of cleaning the data, normalizing the datasets, and performing a transformation on the data.  During the “Data Preparation,” the visualization and statistics are implemented.  The common tools for the “Data Preparation” phase involve Hadoop, Alpine Miner, OpenRefine, and Data Wrangler. 

The “Model Planning” is the third phase of the data analytics lifecycle.  The purpose of this step is to capture the key predictors and variables instead of considering every possible variable which might impact the outcome.   In this phase, the data is explored, the variables are selected, the relationships between the variables are determined. The model is identified with the aim to select the analytical techniques to implement the goal of the project.  The common tools for the “Model Planning” phase include R, SQL Analysis Services, and SAS/ACCESS.  

The “Model Building” is the fourth phase of the data analytics lifecycle, the datasets are developed for testing, training and production purpose.  The models which are identified in phase three are implemented and executed.  The tools to run the identified models must be identified and examined.  The common tools for this phase of “Model Building” include commercial tools such as SAS Enterprise Miner, SPSS Modeler, Matlab, Alpine Miner, STATISTICA, and open source tools such as R and PL/R, Octave, WEKA. Python, and SQL.

“Communicate Result” is the fifth phase which involves the communication of the result with the stakeholders.  The results of the project must be determined whether they are success or failure based on the criteria developed in the first phase of “Discovery.”  The key findings must be identified in this phase, the business value must be quantified, and a narrative summary of the findings must be communicated to the stakeholders.

The “Operationalize” is the last phase of the data analytics lifecycle.  This phase involves the final report delivery, briefing, code and technical documentation.  A pilot project may be implemented in a production environment. 

Stakeholder and their involvement

Stakeholders are those who will benefit from the project.  They include project sponsor, project manager, business intelligence analyst, data engineer, data scientist, database administrator, and business user.  The first phase of the “Discovery” is regarded to be the good phase for the project managers and the key stakeholders to negotiate the right resources in this early stage instead of placing it at a later stage.  In this stage, it is critical to document and share the problem statement, the goal statement, the objectives of the project, the requirements to achieve these objectives, the success criteria, and the minimum acceptable outcome for the project with the key stakeholder.  The analytics problem should be clarified and framed in collaboration with the stakeholders.  However, in some situations, the project sponsors may have a pre-defined solution which can be biased.  Thus, the implementation of a more objective approach is better than the pre-defined solution that can be biased by the project sponsors.  In the same phase of the “Discovery,” the hypotheses should be developed and evaluated in collaboration with the stakeholders.  The stakeholders, as experts in the domain area, can offer suggestions and concepts to be tested while hypotheses are formulated.   These concepts can expand the scope of the project.   The last phase is also an important phase to get the stakeholders involved in the results and findings of the project should be presented and communicated to the stakeholders. 

In summary, the analytic team should collaborate their work at the beginning of the project to understand the requirements, the objectives, the hypotheses, and at the end project to share the result and the findings.  The analytic team are more objectives than the stakeholders. Thus, the more objective driven project, the more probability it will succeed.

References

EMC. (2015). Data Science and Big Data Analytics: Discovering, Analyzing, Visualizing and Presenting Data. (1st ed.): Wiley.