The Big Data Lifecycle explained
Big Data analysis greatly differs from traditional data analysis methods. The 5V Big Data characteristics define Big Data analysis and require a structural approach to data analysis. In order to keep a uniform approach the analysis of Big Data, organizations can use the Big Data Lifecycle. The Big Data Lifecycle consists of 9 simple steps that address different phases in the analysis of data. It is a step-by-step process that can help organizations systematically analyse data:
- Business Case Evaluation
- Data Identification
- Data Acquisition and Filtering
- Data Extraction
- Data Validation and Cleansing
- Data Aggregation and Representation
- Data Analysis
- Data Visualization
- Utilisation of Analysis Results
A brief overview of each of the Big Data Lifecycle phases is provided below. They are discussed in further detail in the Big Data Science Professional course.
1. Business Case Evaluation
The beginning of the Big Data Lifecycle starts with a sound evaluation of the business case. Before any Big Data project can be started, it needs to be clear what the business objectives and results of the data analysis should be. Begin with the end in mind and clearly define the objectives and desired results of the project. Many different forms of data analysis could be conducted, but what exactly is the reason for investing time and effort in data analysis? As with any good business case, the proposal should be backed up by financial data.
2. Data Identification
The Data Identification stage determines the origin of data. Before data can be analysed, it is important to know what the sources of the data will be. Especially if data is procured from external suppliers, it is necessary to clearly identify what the original source of the data is and how reliable (frequently referred to as the veracity of the data) the dataset is. The second stage of the Big Data Lifecycle is very important, because if the input data is unreliable, the output data will also definitely be unreliable.
3. Data Acquisition and Filtering
The Data Acquisition and Filtering Phase builds upon the previous stage op the Big Data Lifecycle. In this stage, the data is gathered from different sources, both from within the company and outside of the company. After the acquisition, a first step of filtering is conducted to filter out corrupt data. Additionally, data that is not necessary for the analysis will be filtered out as well. The filtering step will be applied on each data source individually, so before the data is aggregated into the data warehouse.
4. Data Extraction
Some of the data identified in the two previous stages may be incompatible with the Big Data tool that will perform the actual analysis. In order to deal with this problem, the Data Extraction stage is dedicated to extracting different data formats from data sets (e.g. the data source) and transforming these into a format the Big Data tool is able to process and analyse. The complexity of the transformation and the extent in which is necessary to transform data is greatly dependent on the Big Data tool that has been selected. Most ‘modern’ Big Data tools can read industry standard data data formats of relational and non-relational data.
5. Data Validation and Cleansing
Data that is invalid leads to invalid results. In order to ensure only the appropriate data is analysed, the Data Validation and Cleansing stage of the Big Data Lifecycle is required. During this stage, data is validated against a set of predetermined conditions and rules in order to ensure the data is not corrupt. An example of a validation rule would be to exclude all persons that are older than 100 years old, since it is very unlikely that data about these persons would be correct due to physical constraints.
6. Data Aggregation and Representation
Data may be spread across multiple datasets, requiring that dataset be joined together to conduct the actual analysis. In order to ensure only the correct data will be analysed in the next stage, it might be necessary to integrate multiple datasets. The Data Aggregation and Representation stage is dedicated to integrate multiple datasets to arrive at a unified view. Additionally, data aggregation will greatly speed up the analysis process of the Big Data tool, because the tool will not be required to join different tables from different datasets, greatly speeding up the process.
7. Data Analysis
The Data Analysis stage of the Big Data Lifecycle stage is dedicated to carrying out the actual analysis task. It runs the code or algorithm that makes the calculations that will lead to the actual result. Data Analysis can be simple or really complex, depending on the required analysis type. In this stage the ‘actual value’ of the Big Data project will be generated. If all previous stages have been executed carefully, the results will be factual and correct.
8. Data Visualization
The ability to analyse massive amounts of data and find useful insight is one thing, communicating the results in a way that everybody can understand is something completely different. The Data visualization stage is dedicated to to using data visualization techniques and tools to graphically communicate the analysis results for effective interpretation by business users. Frequently this requires plotting data points in charts, graphs or heat maps.
9. Utilisation of Analysis Results
After the data analysis has been performed an the result have been presented, the final step of the Big Data Lifecycle is to use the results in practice. The Utilisation of Analysis results is dedicated to determining how and where the processed data can be further utilised to leverage the result of the Big Data Project.