Statistics Interview Questions

Statistics is a way to get information from quantitative data . It is a collection of methods and theory for gaining insights in order to make decisions when faced with uncertainty. When you are getting started with your study in Data Science, having statistical knowledge will help you to better leverage the insights of data. Statistics provides tools and methods to find structure and to give deeper data insights . Here are the most commonly asked interview questions of Statistics for data scientists , broken into basics and advanced.

Ready to dive in? Then let's get started!

What is Statistical Learning?

Statistical Learning is a fairly a new area in statistics which blends with parallel development in computer science, especially Machine Learning . It relies on rule-based programming, this means that it is formalized in the way variables relate to one another. Statistical Learning is mostly about inferences, population, and hypothesis . This hypothesis may involve making specific assumptions which you then validate after creating the models. In Data Science, Statistical Learning is understanding from training data and predicting on unseen data. It refers to a series of tools for modelling and understanding complex datasets. The formulation of Statistical Learning problem is quite general. However, two main types of problems are that of:
  1. Regression Estimation
  2. Classification
Examples of Statistical Learning problems include:
  1. Predict the price of a stock in one year from now, on the basis of company performance measures and financial data.
  2. Estimate the amount of glucose in the blood of a diabetic patient, from the infrared absorption spectrum of that patient's blood.
  3. Identify the risk factors for prostate cancer, based on clinical and demographic variables.

What is the importance of statistics in Data Science?

Statistics serve as a foundation while dealing with data and its analysis in Data Science . The concepts involved in statistics help provide insights into the data to perform quantitative analysis on it. Some of the more important Statistical concepts used in Data Science include probability distributions, statistical significance, hypothesis testing , and regression. In data science, statistics is used to process complex problems in the real world so that data scientists can look for meaningful trends and changes in Datasets. It also helps to build robust data models to validate certain inferences and predictions. In Data Science, from the beginning to the end of the complete cycle there is a requirement of statistical techniques at every single step. This means that a good statistician can be a good Data Scientist as well.

What is Statistical Modelling?

A Statistical Modelling is the use of statistics to build a representation of the data and then conduct analysis to infer any relationships between variables or discover the insights of data . It could be in the form of a mathematical equation or a visual representation of the information to make predictions about the real world. When data scientists apply various Statistical Modelling to the dataset they are investigating, they are able to understand and interpret the information more strategically. There are three main types of statistical models:
  1. Parametric
  2. Nonparametric
  3. Semiparametric
The most common statistical modelling methods for analyzing data are categorized:
  1. Supervised Learning
  2. Unsupervised Learning
Supervised Learning techniques include Regression Models and Classification Models, while Unsupervised Learning techniques include Clustering Algorithms and Association Rules.

Different statistical techniques used in Data Science?

The main purpose of statistics is to explain and anticipate information. A multitude of different statistical techniques is available, some of them simple, some complicated, and often very specific for certain purposes. Following are the most popular statistical techniques used in Data Science .
  1. Linear Regression
  2. Classification
  3. Resampling Methods
  4. Nonlinear Models
  5. Tree-Based Methods
  6. Unsupervised Learning

What is measure of central tendency?

A measure of central tendency is the descriptive summary that represents the center point or typical value of a dataset. It is a single value that attempts to describe a set of data by identifying the central position within that set of data. Although it does not provide information regarding the individual values from the dataset, where it gives a comprehensive summary of the whole dataset. Generally, the measure of central tendency of a dataset can be described using the following measures:
  1. Mean: the sum of all values divided by the total number of values.
  2. Median: the middle number in an ordered data set.
  3. Mode: the most frequent value.
Each of the above measures calculates the location of the central point using a different method. Choosing the best measure of central tendency depends on the type of data you have.

What do you understand by the term Normal Distribution?

The Normal Distribution , referred to as Gaussian or bell curve, is a probability function that describes how the values of a variable are distributed. It is a continuous probability distribution that is symmetrical on both sides of the mean, so the right side of the center is a mirror image of the left side. Extreme values in both tails of the distribution are similarly unlikely .
Data science statistics interview questions Normal Distribution
The Normal Distribution has:
  1. mean = median = mode
  2. Bell Shaped
  3. Symmetrical
  4. Tail extend indefinitely

What is Linear Regression?

Linear regression analysis is used to predict the value of one variable (y) based on the value of one or more variables (x). The variable you want to predict (y) is called the response, outcome or dependent variable . The variable you are using to predict (x) the other variable's value is called the predictor, explanatory or independent variable. For example, Linear regression analysis can be used to quantify the relative impacts of age, gender, and diet (independent variables (x)) on height (the dependent variable (y)). A linear regression line has an equation of the form:
Y = a + bX
Here X is the predictor variable (independent variable) and Y is the response variable (dependent variable). The slope of the line is b, and a is the intercept (the value of y when x = 0).

There are two types of linear regression:

  1. Simple Linear Regression
  2. Multiple Linear Regression
A Simple Linear Regression model has only one predictor or independent variable, while a Multiple Linear Regression model has two or more predictor or independent variables.

What is Bias in a Model?

Bias is the difference between the average Predicted Value of your model and the Expected Value . It is the measure of how "inflexible" the model is. A model with High-Bias won't match the data set closely, while a model with Low Bias will match the data set very closely.

What is variance in a model?

Variance describes how much the prediction would vary if the model was trained on a different dataset , drawn from the same population. It simply means that if a Machine Learning model is predicting with an accuracy of "x" on training data and its prediction accuracy on test data is "y" then Variance = x - y

Bias Vs. Variance

Bias and Variance can be decomposed from the expected error of the trained model, given different samples drawn from a training distribution. Bias and variance are inversely connected.
  1. Models with high bias will have low variance.
  2. Models with high variance will have a low bias.

What is the difference between covariance and correlation?

These are two statistical concepts that are used to determine the relationship between two random variables.
  1. Covariance shows us how the two variables vary from each other. This means that it indicates the direction of the linear relationship between variables.
  2. Correlation shows us the relationship between the two variables and how are they related. It not only shows the direction of the relationship, but also shows how strong the relationship is.