Scikit-learn Interview Questions

Scikit-learn is probably the most useful library for Machine Learning in Python and focused only on modelling data. It provides a wide range of supervised and unsupervised learning algorithms via a consistent interface in Python and is built upon NumPy, SciPy and Matplotlib. This package focuses on bringing machine learning to non-specialists using a general-purpose high-level language. Being prepared with Scikit-learn library will help in Data Science job interviews.

What does the "fit()" method in scikit-learn do?

Fitting your model (using fit() method) to the training data is essentially the training part of the modelling process . The fit() method finds the coefficients for the equation specified via the algorithm being used. During the process, this method modifies the object and it returns a reference to the object. After it is trained, the model can be used to make predictions , usually with a .predict() method call.

How to Eliminating warnings from scikit-learn?

You can use the "warnings-module" to temporarily suppress warnings.
import warnings warnings.filterwarnings('ignore')
The filterwarnings call should be in the file that calls the function that gives the warning. Actually the warning tells you exactly what is the problem, so instead of suppressing a warning it is better to get rid of it.

What does calling fit() multiple times on the same model do?

If you will execute for a second time, it will start training again using passed data and will remove the existing results. It will reset the following inside model:
  1. Fitted Coefficients
  2. Weights
  3. Intercept (bias)
  4. And other training related stuff.
To avoid overwriting, you can use "warm_start" parameter, where it will initialise model parameters with the previous solution from fit(). Also, you can use partial_fit() method as well if you want your previous calculated stuff to stay and additionally train using next data.

How to predict time series in scikit-learn?

Time Series is a collection of data points collected at constant time intervals . Time-series prediction is on the base of theory that current value more or less depend on the past ones. A time series has two basic components: Mean and Variance. Ideally, you would like to control this components, for the variability, you can simply apply a logarithm transformation on the data, and for the trend you can differentiate it. In the case of prediction of time series data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but scikit does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the RNN or LSTM model.
Scikit-learn Interview Questions and answers

Is it possible in scikit-learn to split into three sets directly?

No, it's not possible in scikit-learn to split into three sets directly. However, one approach to dividing the dataset into train, test, validation with 0.6, 0.2, 0.2 would be to use the train_test_split method twice.
from sklearn.model_selection import train_test_split
x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8) x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

How to split data into 3 sets (train, validation and test)?

It can be achieved using numpy and pandas. With np.split() , first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y.
train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test.

Difference between scikit-learn and sklearn?

sklearn is how you type the scikit-learn name in python because scikit-learn isn't a valid identifier in python, so it can't be that.
import sklearn as scikit_learn

Is any custom distance function using scikit-learn K-Means Clustering?

Unfortunately no: by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster. It has no metric parameter and it is not trivial to extend k-means to other distances. You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.

How do you solve overfitting in random forest of Python sklearn?

To avoid over-fitting in RF models, the main thing you need to do is optimize a tuning parameter that governs the number of features that are randomly chosen to grow each tree from the bootstrapped data. If possible, the best thing you can do is get more data, the more data the less likely it is to overfit , as random patterns that appear predictive start to get drowned out as the dataset size increases. Growing a larger forest will improve predictive accuracy, although there are usually diminishing returns once you get up to several hundreds of trees.

Look at the following params:

  1. n_estimators: In general the more trees the less likely the algorithm is to overfit.
  2. max_features: Try reducing this number. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
  3. max_depth: Reduction of the maximum depth helps fighting with overfitting.
  4. min_samples_leaf: This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

Is there a library function for Root mean square error (RMSE) in python?

sklearn's mean_squared_error itself contains a parameter squared with default value as True . If you set it to False, the same function will return RMSE instead of MSE.
from sklearn.metrics import mean_squared_error
rms = mean_squared_error(y_actual, y_predicted, squared=False)

How to extract the decision rules from scikit-learn decision-tree?

You can use Scikit learn export_text to extract the rules from a tree. Once you've fit your model, you just need two lines of code.
from sklearn.tree import export_text
rules = export_text(loan_tree, feature_names=(list(X_train.columns))) print(rules)

What is the difference between 'transform' and 'fit_transform' in sklearn

  1. fit() method is used for generating learning model parameters from training data. This is where the model "learns" from the data.
  2. transform() method is to transform the data (produce model outputs) according to the fitted model.
  3. fit_transform() method to do both; Fit the model to the data, then transform the data according to the fitted model.