Data Science Deconstructed and Defined

Data Science Deconstructed (1)

The foundation of Data Science can be deconstructed into 6 main processes to be defined as a great Data Scientist.  Let’s note that since there are several roles under the discipline of Data Science, these are the most common crossover processes to define it.

After understanding the overall data science process, there are basic skills required for each of the 6 areas. Again, these are the basic skills that are more common in the data science field.

Data scientist AJ Goldstein breaks the processes down in a meaningful way.

data science process

1. Frame the problem – 

Find out what exactly it is that the organization is seeking to find out.  Determine what the overall strategy or goal is.

SKILLS: Know the needs, metrics, priorities/goals, resources, and have a team and proper resources.

frame the data problem

2. Identify All Available Datasets/Extract Data Into a Usable Format – 

Gather all of the raw data from external and internal sources and put them into a usable format to work with.

SKILLS: Strong database skills l(MongoDB, Oracle, mySQL etc.), SQL for querying, ability to retrieve unstructured data, ability to work with distributed storage (Hadoop HDFS, Spark, Flink, Apache Storm).

collect raw data

3.  Examine the Data at a High Level/Clean the Data – 

Understanding what data is available, what is missing, and what is junk. Identify what needs to be fixed, replaced, or just bad data.

SKILLS: Knowing a scripting language (Python, R, Javascript, Scala), ability to do data wrangling and cleansing (pandas, NumPy, PySpark), and distributed processing (MapReduce, Spark).

prcoess the data

4. Play Around with the Data/Identify Patterns and Extract the Data –

Segment the data and format it to see if there is anything that stands out. Use statistical models to see if there are any significant patterns or important points to address.

SKILLS: Ability to use do some sort of scientific computing (via Python is the most popular), statistics to do hypothesis testing and confirm causes, and ability to do A/B testing.

explore the data

5. Create Predictive Model/Evaluate and Refine the Model – 

AJ suggests using feature vectors from #4 and going back over steps 2-4 which would provide more data analysis.

SKILLS: Need machine learning skills like supervised and unsupervised learning together with the machine learning tools library and advanced algebra and calculus skills.

perform in depth analysis

6. Identify Business Insights/Visualize Your Findings/Tell a Clear and Actionable Story – 

At this point, it’s time to look back at the original business question or problem and show a simple visualization of the findings. This would also be understood by non-techies to clearly communicate results.

SKILLS: Ability to explain the data in non-technical terms and use visualization tools to show the findings. The findings will be in the form of data storytelling, reports, presentations etc.

commuicat data results

Helpful Notice

Missing any of the skill sets or need assistance?  Please check out the TrainingCheat Sheets, and Practice/Participate options.

AJ also broke the process down into technical and non-technical functions.

The Core 20