Data Science Process
Data science is the application of statistical , computational and mathematical techniques to offer the greatest opportunities to the organizations. It is a long technical process and may take several days/months to complete the projects. Every step in the Data Science project (Data Science Process) depends on several data scientist skills and data science tools. The Data Science Process starts with business understanding and data understanding of particular domain, followed by Exploratory data analysis (EDA) , modeling , visualization, evaluation and deployment.
Although several specific Data Science Process models have been recently emerged for particular domains, one can argue that there is no universal standards and comprehensive Data Process framework to be extensively used in different domain. Normally, Data Science Process begins with an interesting business question that guides the overall workflow of the data science project. Basically Data Science Process can be divided in the following:
- Business Understanding
- Data Acquisition
- Data Preparation
- Exploratory Data Analysis (EDA)
- Data Modeling
- Evaluation and Deployment
A Data Scientist should understand the strategic objectives of the organization. To convert a business question into a data science solution, he need to recognize the business problem, the data analysis objectives and metrics , and the mapping to one or more business patterns. He must know all the specifics on how an organization runs its business, how the organization is established , its competitors, how many divisions and sub-divisions exist, the different objectives and targets they have, and how they evaluate success or failure. Most importantly, he should understand what the organization expects to gain from the Data Science and how the results of the data science will be used. Having a strong business understanding about a Data analysis project will prove to be beneficial for both the Data Scientist and the organization.
The second step in Data Science is to retrieve raw data in the control-center without which no further activities are possible. Integration from various data-sources is a critical time-consuming step. Usually, data-sources are rarely summarized, centrally-available , and/or ready to be used in data science. Data acquisition typically involves data selection from different data sources. Since each data is valuable, a data sceintist should traces the origin of the data source and check whether that information is up to date or not, as it is very much important to match the real-time output . The received raw process data are transform into input for the next level.
Data preparation is the process of cleaning and mapping raw-data prior to data-processing or data-analysis. Preparation of Data helps to catch errors before analysis. After data has been extracted from its original data source , these errors become more difficult to understand and correct. It is very important for a Data Scientist to redefine and transform these raw-data into usable datasets, which can be leveraged for analysis. Data Wrangling, sometimes referred to as Data mugging as a procedure typically go along with a set of general steps which convert the raw data using algorithms or parsing the information into determined data structures , and finally store the output for future use.
Exploratory Data Analysis (EDA)
After the essential phase of exploring and cleaning the data, comes the stage of Data Analysis. This is the stage that helps Data Scientists to solve the problems on as to what do they really want to do with this information. The goal of Data Analysis is to acquire useful information from data they collected and taking the findings based upon the data analysis. EDA ( Exploratory Data Analysis ) plays a key role in this phase as summarization of clean data helps in recognize the structure , outliers, anomalies and patterns in the processed data. These information could help in the next stage of building the data model.
A model accept the prepared data (from Data Analysis) as input and produce the desired output. Data modeling is a set of techniques and tools used to understand and analyse whether it will be sufficient to suffice the requirement of building a data model . This step includes selecting the suitable type of model, whether the problem is a classification problem, or a regression problem or a clustering problem. Based on the organisational problem models could be selected. It is carried out either Statistical Analytics or using Machine Learning (ML) Techniques. Often these model buldings are implemented in languages like Python, R, MATLAB or Perl.
Evaluation and Deployment
This is the last phase in the Data Science Process . Evaluation is the validation of the Data model. You must deeply examine the models and recognize the model that meets the organisational business requirements . The Data Model, after a careful evaluation, is finally deployed in the desired format. The aim of this phase is to deploy the models into a live or production like environment for end user acceptance . If any of the above steps fail, the data model will fail in the real world.