What is Data Science?
Using scientific method to process data and make usefull insights (results/decisions)
Big data and data science have been so intertwined that many organizations see them as one thing. Remember, data science uses the scientific method with your data. That doesn’t mean that you have to have a lot of data to ask these questions. Big data provides a robust new source of data. This new source allows you to ask questions that couldn’t be answered with a smaller data set. Often, more data points provide more power during statistical analysis.
The data science pathway
The insights you get from data science can feel like a gift to your business, but you don’t get to just open your hands and get it delivered to you with a bow on it. Really, there are a lot of moving parts and things that have to be planned and coordinated for all of this to work properly. I like to think of data science projects like walking down a pathway, where each step gets you closer to the goal that you have in mind. And with that I want to introduce you to a way of thinking about the data science pathway.
1. define goals
What is it that you’re actually trying to find out or accomplish? That way you can know when you’re on target or when you need to redirect a little bit
2. organize resources
That can include things as simple as getting the right computers and the software, accessing the data, getting people and their time available
3. coordinate people
Data science is a team effort. Not everybody’s going to be doing the same thing and some things have to happen first and some happen later
4. schedule the project
it doesn’t expand to fill up an enormous amount of time. Time boxing, or saying we will accomplish this task in this amount of time, can be especially useful in working on a tight timeframe or you have a budget and you’re working with a client.
Wrangling or preparing the data
Data wrangling, sometimes referred to as data munging, is the process of transforming and mapping data from one “raw” data form into another format with the intent of making it more appropriate and valuable for a variety of downstream purposes such as analytics.
5. get the data
You may be gathering new data, you may be using open data sources, you may be using public APIs, but you have to actually get the raw materials together.
6. cleaning the data
Cleaning the data actually is an enormous task within data science. It’s about getting the data ready so it fits into the paradigm, for instance, the program and the applications that you’re using, that you can process it to get the insight that you need.
7. explore the data
Once the data’s prepared and it’s in your computer, you need to explore the data, maybe making visualizations, maybe doing some numerical summaries, a way of getting a feel of what’s going on in there.
8. refine the data
Based on your exploration, you may need to refine the data. You may need to re-categorize cases. You may need to combine variables into new scores. Any of the things that can help you get it prepared for the insight.
9. create model
This is where you actually create the statistical model and you do the linear regression. You do the decision tree. You do the deep learning neural network. But then,
10. Validate model
you need to validate the model. How well do you know this is going to generalize from the current data set to other data sets. In a lot of research that step is left out and you often end up with conclusions that fall apart when you go to new places. So, validation’s a very important part of this.
11. evaluate model
The next step is evaluating the model. How well does it fit the data? What’s the return on investment for it? How usable is it going to be?
12. refine model
And then, based on “evaluating the model”, you may need to refine the model. You may need to try processing a different way, adjust the parameters in your neural network, get additional variables to include in your linear regression. Any one of those can help you build a better model to achieve the goals that you had in mind in the first place.
13. present model
The last part of the data pathway is applying the model and that includes presenting the model, showing what you learned to other people, to the decision makers, to the invested parties, to your client, so they know what it is that you’ve found.
14. deploy model
Then you deploy the model. Say for instance, you created a recommendation engine. You actually need to put it online so that it can start providing these recommendations to clients or you put it into a dashboard so it can start providing recommendations to your decision makers.
15. revisit model
You will eventually need to revisit the model, see how well it’s performing, especially when you have new data and maybe a new context in which it’s operating. And then, you may need to revise it and try the process over again.
16. archive assets
And then finally, once you’ve done all of this there’s the matter of archiving the assets, really cleaning up after yourself is very important in data science. It includes documenting where the data came from and how you process it. It includes commenting the code that you used to analyze it. It includes making things future proof.
All of these together can make the project more successful, easier to manage, easier to get the return on investment calculations for it, and those together will make the project more successful by following each of these steps