The subtleties of Data Science: What do most beginners overlook?


Who can use this post?

Before we even begin with this post for the Data Science aspirants out there, let’s first be clear on who can use this guide. I would limit this article to the beginners who are confused with what a Data Science project looks like, to them. This article is going to clear your thought process on what steps and measures are taken in a typical project related to Data Science.


What is and what is not?

You came across these terms; ‘Hypotheses matrix’, ‘Statistical tests’, ‘Modelling’ etc. I would agree that they cater to Data Science as a whole. In this post, I’ll convey my thoughts on how a real-world project looks like contrary to the perception I had on Data Science projects, during my undergraduate years. Working as a Decision Scientist for one of the top telecommunication client of The US, I have come to the realization that there’s much more to a real-world project than just looking at data structures and coming up with some ‘fancy’ model to deploy. A common mistake most beginners make is rushing into data collection and analysis, which precludes spending sufficient time to plan and scope the amount of work involved, understanding requirements, or even framing the business problem properly.


This is how a project life cycle would look like through the team’s lens.

Source: EMC ‘Data Science and Big Data Analytics’

Note: The term ‘model’ on the above illustration refers to the methods, techniques, framework, and workflow as a whole and not some Machine Learning models. The scope of this post is to be as simple as possible.


What are the basic elements of progressing with the project?

For any project, acquiring business context and requirement should be the goal for any lead or individual. This phase alone took almost a month for the project I had been working on. Some of the steps involved were:

  • Learning the business domain

  • Framing the problem and breaking down the same

  • Identifying Key Stakeholders

  • Assess the resources at disposal

  • Interviewing and being inquisitive with the Analytics sponsor

For a successful take-off with a project, getting the correct context and preparing the proper material is very crucial. For instance, before we even got accesses to the data sources we moved ahead about creating multiple hypotheses to test, sprint plans to be on ‘track’, analysis inventory, business process flow, data source/accesses list.


How can you apply this to yourself, if you are just a beginner?

I have been expounding about creating plans, gathering context for the project and what not. But hey! You’re in college or maybe a beginner who has just delved into this sphere of work. I’ll make this more relate-able for you, so that you can apply it to your projects or interest.


Let’s say you take up a project or a competition from Kaggle.

Source: Kaggle.com
Source: Kaggle.com

What do you think your first job as a participant should be? Download the ‘kick-ass’ data-sets provided or fire them up on the kernels? Open up RStudio or Jupyter and import these data and start loading libraries? Simple answer — NO!

Let me walk you through the simple steps that you could adopt, which would always help you in the long term.


Choose your project wisely

You have to be able to determine on what specific problem space you want to work on. Ask yourself what you’re trying to achieve working on that particular project or competition. Problem spaces can be anywhere starting from Demand/Sales Forecasting, Customer Churn Prediction, Click Through Frameworks etc. Your aim should not be to take up a project so that you can fill it up on your resume. Always try to go for the learning, the former adds up automatically.


Have the correct business context

Congratulations! You have chosen the project that you want to work on. This should mean that you know what problem space you would want to deal with. After having chosen the project, work on developing your background on that project. Know the ‘in and out’ of the project. That would mean eating the project description for breakfast. A sample overview about a project can look like as below,

Source: Kaggle competition on ‘Google Analytics Customer revenue Prediction’

Read each and every detail about the project. Gather the proper material that would help you proceed with the project. Let’s say for the above project, googling about what a GStore is and what they do would help you in the long run. Other intricacies would include reading up about the 80/20 rule, revenue, revenue per customer and sales forecasting would be a good start. Do not, I repeat, do not just switch over to the Data tab yet and also, ignore the prize money for now. We’re not here for the wins, right? We’re here to learn for now.


Create your plan

This is by far, one of the most important phases of your project life-cycle. By ‘create your plan’, you could start-off with the deadline. Set a deadline for yourself, otherwise you’re not going to work on it. Human nature. I know. You know that too. 😏(Thank god for the emojis. I wanted to express that smirk).

Create a flow of what steps you’re going to conduct for your project. These steps can include Setting deadline, creating dates for deliverable, determining your requirements and resources.


A typical planner would contain list of processes you might want to use, like —

  • Determining the use of technologies

  • Importing and Pre-processing the data (Kaggle forums and kernels can help you fast track on that)

  • Exploratory Data Analysis

  • Processing your data according to the project requirement

  • Performing statistical tests on your hypotheses

  • Modelling

https://www.kaggle.com/learn/overview This place would help you ramp up pretty awesome for the Kaggle competitions. Always try to choose the correct study material for reference, to avoid excessive and undesirable time consumption.



Execution and iterations

Your planner might be subject to iterations and changes. Do not let that alarm you. In the due course of your project, you might figure out hiccups. One of the hiccups I found out while I was working on the Google Analytics Customer Revenue Prediction was that the target variable could be mined from an external source and could be used on your models to get awesome accuracy. Well, that was a bummer. My point being, there could be other 100 problems that you could face. Do not be demoralized and just try to work your way around it.


Deliver

This step is pretty much self explanatory from the heading. Prepare all your outcomes and results in a proper presentable manner. Don’t just stack up files and folders into a zip file and send those across. Even if you have arrived at the correct answer with the correct approach, no one likes to look at hap-hazard files and unstructured codes.

This is an ideal example of a structured kernel. https://www.kaggle.com/sudalairajkumar/simple-exploration-baseline-ga-customer-revenue (Credits: Sudalai Rajkumar, Kaggle Grandmaster)


You can see fellow kagglers being happy with it on the comments section. That is how your output should look like. Presentable.


To summarize

Data Science is not just about looking at data and creating models to be able to predict values. There’s a lot involved. This post should help you get a correct direction about starting your work on this field. Keeping it on the overview level, I want to be able to nudge you in the proper direction if and when you decide to work on a Data Science project.


That is pretty much for this post. Hope this helps. Don’t forget to leave a like/clap on the post, if you liked it. ❤

60 views0 comments

Recent Posts

See All