Data Science Interview Questions

Data science is an evolutionary extension of statistical capability of dealing with the huge amounts of data produced today. It continues to evolve as one of the most promising and in-demand career paths for skilled professionals. If you are preparing for a Data science job , it's a good idea to go through Data science interview questions. These Data science interview questions are categorized in Methods and Algorithms, Data Analysis , Machine Learning, Statistics, Artificial Intelligence etc. This detailed guide of Data science interview questions will help you to crack your Job interview easily.

Ready to dive in? Then let's get started!

In one sentence, What is Data science?

Data science is a multi-disciplinary field that uses scientific methods, processes, algorithms and techniques to extract patterns and meaningful information from the raw data.

Does very less data lead to best model?

No, it leads to underfitting . Underfitting occurs when a data model is unable to capture the relationship between the input and output variables accurately. It usually happens when you have less data to build an accurate model. For example, if you have 1 trillion data points, outliers are easier to classify and the underlying distribution of that data is clearer. If you have very less data points (ex: 10 data points) , this is probably not the case.

What is Pattern Recognition?

Pattern recognition is the scientific discipline that allows you to classify objects into several categories or classes that can be further used to perform analysis and improve certain things. It is a generic term for the ability to recognize regularities or patterns in data. This data can be anything from text and images to sounds or other definable qualities. This methodology involves the use of Machine Learning algorithms that identifying the regularities in the given data. For ex. k-means Machine Learning algorithm which is a clustering algorithm. When the k-means algorithm runs it finds patterns in your data and try to splits into distinct clusters. Pattern recognition has a variety of applications, including image processing , speech recognition, aerial photo interpretation, optical character recognition, and even medical imaging and diagnosis.

What are the major steps in exploratory data analysis?

Exploratory data analysis can help identify common errors , as well as better understand patterns within the dataset, detect outliers or anomalous events, find notable relations among the variables. The major steps to be covered are below:
  1. Handle Missing value
  2. Removing duplicates
  3. Outlier Treatment
  4. Normalizing and Scaling Numerical Variables
  5. Encoding Categorical variables( Dummy Variables)
  6. Bivariate Analysis

What is Genetic Programming?

Genetic Programming is an application of Genetic Algorithm , a subset of machine learning. It is a way of solving problems by mimicking the same processes mother nature uses. Genetic Programming used to evolve the answer to a problem, by comparing the fitness of each candidate in a population of potential candidates over many generations . Each generation, new candidates are found by randomly changing (mutation) or swapping parts (crossover) of other candidates. The least 'fit' candidates are removed from the population.

Difference Between Classification and Regression?

The main difference between classification and regression is that classification predicts or classify a discrete label such as True or False, Spam or Not Spam, etc. while regression predicts a continuous quantity or value such as price, salary, age, etc. A regression algorithm is commonly evaluated by calculating the Root Mean Squared Error? of its output while a classification algorithm is evaluated by computing the ?accuracy with which it correctly classified its input.

How to use labelled and unlabelled?

labelled data add some sort of meaningful labels or tags or class to the observations . These labels can come from observations or asking people or specialists about the data. unlabelled data do not have any meaningful labels or tags associated with it. There is no "explanation" for each piece of unlabelled data; it just contains the data, and nothing else. The set of algorithms in which you use a labelled dataset is called Supervised Learning . Classification and Regression could be applied to labelled datasets for Supervised learning. The set of algorithms in which you use an unlabelled dataset, is called Unsupervised Learning .

How to deal with unbalanced data?

Imbalanced data typically refers to the issue of target class distribution . If the target classes are not equally distributed or not in an equal ratio, you call the data having an imbalance data issue . There are several techniques to handle the imbalance in a dataset:
  1. Change the performance metric
  2. Resampling your dataset
  3. Oversampling minority class
  4. Undersampling majority class
  5. Clustering the abundant class
  6. Ensembling different resampled datasets
  7. Implement different algorithms
  8. Generate synthetic samples

If you have a smaller dataset, how would you handle?

There are several ways to handle this problem. Following are a few techniques.

  1. Choose simple models
  2. Remove outliers from data
  3. Combine several models
  4. Rely on confidence intervals
  5. Apply transfer learning when possible
  6. Find the ways to extend the dataset

What is Cluster Sampling?

Cluster sampling is a sampling design in which the data is divided into multiple groups for logistical reasons that are considered as clusters. Researchers then select random groups with a simple random or systematic random sampling technique for data collection and data analysis . For ex. households clustered in a geographic area. Typically, cluster samples are multistage samples, so geographic areas are selected in the first stage and households in the subsequent stage.

Define False Positive and False Negative

  1. False Positive - A case which is False, but gets classified as True.
  2. False Negative - A case which is True, but gets classified as False.

For ex. In a classification / screening test, you can have four different situations:

How to Answer Common Data Science Interview Questions
Continue.....Data Science Interview Questions (Part 2)