Data Science Interview Questions (Part 2)

Data science popularity has grown over the years, and companies have started implementing Data Science techniques to grow their business and increase customer satisfaction. Also, it continues to evolve as one of the most promising and in-demand career paths for skilled professionals. If you are expertise at the Data Science then there are various companies that offer job in various position like Data Scientists , Data Analysts, Data Engineers, Data Architects and many other roles too. There opportunities for both Fresher and Experienced candidates as well.

Differentiate between Inductive and Deductive Learning

  1. Inductive Learning - the model makes use of observations to draw conclusions.
  2. Deductive Learning - the model makes use of conclusions to form observations.

What is a confusion matrix?

Confusion Matrix is a performance measurement for machine learning classification algorithm. It is in the form of a square matrix that represents the Actual Vs. Predicted values. When considering the structure of the matrix, the size of the matrix is directly proportional to the number of output classes.

It is a table with 4 different combinations of predicted and actual values.

Top 100 Data Science Interview Questions and Answers
  1. TP (True Positive) : The values were positive and predicted positive.
  2. FP (False Positive) : The values were negative but falsely predicted as positive.
  3. FN (False Negative) : The values were positive but falsely predicted as negative.
  4. TN (True Negative) : The values were negative and were predicted negative.

The accuracy of a model (through a confusion matrix) is calculated using the given formula below.

100 Must-Know Data Science Interview Questions and Answers

What is Bagging and Boosting?

Bagging and boosting are two methods of implementing ensemble models.


Bagging stands for bootstrap aggregating , a parallel ensemble: each model is built independently. First, create random samples of the training data se t (sub sets of training data set). Then train a learner on each subset of the data. Finally, combine every learning by simply taking the average of all the individual learner's outputs. By increasing the size of your training set you can't improve the model predictive force , but just decrease the variance, narrowly tuning the prediction to expected outcome.


Boosting starts out similar to bagging by sampling subsets with replacement and training a learner on each subset of the data. The first predictor is learned on the whole data set, while the following are learnt on the training set based on the performance of the previous one. It starts by classifying original data set and giving equal weights to each observation. If classes are predicted incorrectly using the first learner, then it gives higher weight to the missed classified observation . Being an iterative process, it continues to add classifier learner until a limit is reached in the number of models or accuracy . Boosting has shown better predictive accuracy than bagging, but it also tends to over-fit the training data as well.

What are categorical variables?

A categorical variable takes only a limited number of values. Any data attribute which is categorical in nature represents discrete values which belong to a specific finite number of categories or distinct groups. If people responded to a survey about which brand of motorcycle they owned, the responses would fall into categories like "Yamaha", "Suzuki", and "Harley-Davidson". In this case, the data is also categorical .

When to use ensemble learning?

When using decision trees you face a dilemma: in order to model a complex set of data, you often need many levels, but as the number of levels in your tree increases, it becomes more prone to overfitting . Ensemble learning is used to fight overfitting or to fight specific weaknesses/strength of different classifiers to improvise on the stability and predictive power of the model. Ensembles are predictive models that is usually used to average the predictions of different models to get a better prediction. The idea in ensembles learning is that a group of weak predictors outperform a strong predictor. So, if you train different models with different predictive results and use the majority rule as the final result of your ensemble, this result is better than just trying to train one single model.

What is the trade-off between accuracy and interpretability?

When building a predictive model, there are two important criteria: predictive accuracy and model interpretability . Predictive accuracy concerns the ability of a model to make correct predictions. Model performance is estimated in terms of its accuracy to predict the occurrence of an event on unseen data. Model interpretability concerns to what degree the model allows for human understanding . Interpretability provides insight into the relationship between in the inputs and the output. So, an interpreted model can answer your questions as to why the independent features predict the dependent attribute.

What is a ROC Curve and How to Interpret It?

The ROC curve is used in binary classification problems to help visualize model performance. Binary classification means that your model is predicting a data point as belonging to one of a potential two classes, for example, pedestrian versus non-pedestrian input samples. Receiver Operating Characteristic curve, or ROC curve often used as a proxy for the trade-off between true positives rates Vs. false positives rates at various thresholds. The true positive rate , defined as is the fraction of true positives out of the positives, is also called the sensitivity or recall. The false positive rate , defined as the fraction of false positives out of the negatives, is equivalent to 1 - sensitivity. It represents the True Positive Rate against the False Positive Rate on a graph.

Explain PCA?

PCA stands for Principal Component Analysis . It is a feature transformation technique that rotates your original data dimensions and converts to a new orthonormal feature space. In the new feature space, the principal components form the dimensions of the space. These components are the linear combinations of your original feature dimensions. It's most often used for reducing the curse of dimensionality of a large data set so that it becomes more practical to apply machine learning where the original data are inherently high dimensional.

Differences between supervised and unsupervised learning?

A supervised learning algorithm takes a known set of input data and known responses to the data (output), and trains a model to generate reasonable predictions for the response to new data. This means that the input x is provided with the expected outcome y, which is often called the "class" (or "label") of the corresponding input x. Supervised learning can be used for two types of problems: Classification and Regression . In supervised learning the "categories", "tags" or "labels" are known while in unsupervised learning , they are not, and the learning process attempts to find appropriate "categories" or "labels". This means that Unsupervised learning are types of algorithms that try to find correlations without any external inputs other than the raw data. Unsupervised learning can be used for two types of problems: Clustering and Association .

Explain univariate, bivariate, and multivariate analyses.

  1. Univariate analyses only one variable at a time.
  2. Bivariate analyses compare two variables.
  3. Multivariate analyses compare more than two variables.