There are 8 core data science competencies which are required for becoming a data scientist:
1. Software Tools: Every professional requires a tool set. Data Scientist tool sets build around a few basic principles. It is all about cleaning, mining and understanding data. No matter what type of roles or organizations you work for, you’re likely going to be expected to know how to use the tools of the trade. A statistical programming language, like R or Python, and a database querying language like SQL, and Hadoop concepts. We highly recommend the Python programming tool set. It is not only helpful in mining data, visualizing data, and machine learning. Python is mobile, web programming and a system programming language too. A data science job mainly consists to be 40-50% technical.
2. The Statistics: A data scientists prime goal is to look at data, and understand the behavior of data. Knowledge of statistics is one of most required skill as a data scientist. You will be required to least have more than a basic understanding of statistics. As a data scientist in an interviewing process, you are required to know the basics. A few of my fellow data scientists once told me, that many of the people they've interviewed couldn’t even provide the correct definition of a p-value or Kmean. Be familiar with statistical tests, distributions, maximum likelihood estimators, etc. Statistics also helps with cases for machine learning, but one of the most important aspects of your statistic knowledge is to help understand which stat techniques help to understand data, and its behavior in a better way. Statistics is very important for all company types, especially data-driven companies. In the current age of search engines, social media, and wearable’s. Any and every product that is not data-focused, stakeholders will depend on your help to make decisions based on data driven approach.
3. Machine Learning: Machine learning is not something new to the industry. Many of us have learned some part of artificial intelligence in our college studies. Machine learning is the algorithm which adopts to input data, and help to build data intelligence on top of data. It is important for a data scientist to know a programming language like python to use machine learning. Machine learning is playing a fear factor into a lot of new professionals who want to learn data science. You are required to understand the statistics and understand what kind of machine learning algorithms to use. If you look at this, deploying a machine learning algorithm is not more than few lines of code. The key in machine learning is understanding what an algorithm does and how to use it. Having a basic understanding of k-nearest neighbors, random forests, ensemble methods – all of the machine learning buzzwords is very important. All this technique can be implemented using few lines of Python code. The understanding requirement is knowing basic machine learning algorithm, and when it is appropriate to use different techniques.
4. Linear Algebra and Multi-Variable Calculus: This skill is like revisiting math when you were in high school class, but you need this foundation before you start getting into the heavy load of machine learning and statistics model. Why does a data scientist need to understand this stuff if there are a bunch of out of the box implementations in python scikit-learn? The answer to your question is that at a certain point, it can become worth it for a data science team to build out their own machine learning implementations. The machine learning algorithm will adopt to its data behavior. For any data scientist, understanding these concepts are important where the product is defined by the data-driven approach, and in predictive models, this can lead to huge wins for you and your company.
5. Data Mugging (Janitorial work): In the past, a few data scientists have described their job as janitorial work. This statement is quite true and very important to understand as a data scientist. The most important thing for a data scientist is to have an accurate result, which consists of having most the accurate data first. We have data coming to our system from every direction. Organizing data, pulling data, and storage of data is a Data Engineering job. It is equally important for a data scientist to have data cleaning and mugging. Python has packages like Pandas and Matplotlib. With these packages, you are able to write a few lines of code to replace a broken row or column. This also helps to replace blank values or drop missing values. The Non data-driven companies where the product is not data-related won't understand how data can help them to grow their business. This is a must tool to know for your growth and company growth.
6. Data Visualization tools & Communication: Data Visualization is a career itself. If you have been a front end developer in your previous career. There are a few tools like D3.js, and ggplot, which are very effective tools in reporting data plotting and prediction. In order to make you the most effective communicator, visualizing and communicating data is incredibly important, especially for young companies who are making data-driven decisions. Data Visualization is replacing traditional old school dashboard software and reporting. You can have real-time data presentation on the web, and on your mobile device. It is very important to not just be familiar with the tools necessary to visualize data, but also the principles behind visualizing data driven decision-making process, and communicating information.
7. Software Engineering: There is race for software engineers, and data analytics to call themselves as data scientists. It is true you need to know good software engineering to be an expert data scientist. If you’re approaching an interview at a small company and are one of the first data science hires, it can be important to have a strong software engineering background. You’ll be responsible for handling a lot of data logging, and potentially the development of data-driven products.
8. Thinking Like A Data Scientist (Muti-Dimension Problem Solving): A data scientists thinking philosophy is very different than an engineer or other similar professions. The new companies want to see that data scientist are (data-driven) problem solvers. This is at some point, during your project imitation process, you’ll probably be asked about some high-level problem – for example, a test the company may want to run or a data-driven product it may want to develop. It’s important to think about what things are important, and what things aren’t. How should you, as the data scientist, interact with the engineers and product managers? What methods should you use? When do approximations make sense? As a data scientist make a habit of looking at any problem from multiple dimensions, not in a black and white format. Every problem has multiple factors which contribute to results. For example, a determination of housing price has multiple factors affecting the price, like zip code, square footage, a number of bedrooms and bathrooms. All these factors contribute to pricing the house. As a data scientist, you should be able to explain how you can narrow down your search to the right factors affecting the housing price in a given zip code