# Scikit-learn Interview Questions

Scikit-learn is probably the most useful library for**Machine Learning**in Python and focused only on modelling data. It provides a wide range of

**supervised**and

**unsupervised**learning algorithms via a consistent interface in Python and is built upon NumPy, SciPy and Matplotlib. This package focuses on bringing machine learning to non-specialists using a

**general-purpose**high-level language. Being prepared with Scikit-learn library will help in

**Data Science**job interviews.

## What does the "fit()" method in scikit-learn do?

Fitting your model (using**fit()**method) to the training data is essentially the training part of the

**modelling process**. The fit() method finds the coefficients for the equation specified via the algorithm being used. During the process, this method modifies the object and it returns a reference to the object. After it is trained, the model can be used to

**make predictions**, usually with a .predict() method call.

## How to Eliminating warnings from scikit-learn?

You can use the**"warnings-module"**to temporarily suppress warnings.

import warnings
warnings.filterwarnings('ignore')

The **filterwarnings**call should be in the file that calls the function that gives the warning. Actually the warning tells you exactly what is the problem, so instead of

**suppressing a warning**it is better to get rid of it.

## What does calling fit() multiple times on the same model do?

If you will execute**model.fit()**for a second time, it will start training again using passed data and will remove the existing results. It will reset the following inside model:

- Fitted Coefficients
- Weights
- Intercept (bias)
- And other training related stuff.

**"warm_start"**parameter, where it will initialise model parameters with the previous solution from fit(). Also, you can use

**partial_fit()**method as well if you want your previous calculated stuff to stay and additionally train using next data.

## How to predict time series in scikit-learn?

Time Series is a collection of data points collected at**constant time intervals**. Time-series prediction is on the base of theory that current value more or less depend on the past ones. A

**time series**has two basic components: Mean and Variance. Ideally, you would like to control this components, for the variability, you can simply apply a

**logarithm transformation**on the data, and for the trend you can differentiate it. In the case of prediction of

**time series**data, RNN or LSTM algorithm (Deep Learning) has been widely utilized, but

**scikit**does not provide the build-in algorithm of it. So, you might be better off studying Tensorflow or Pytorch framework which are common tools to be enable you to build the

**RNN**or

**LSTM**model.

## Is it possible in scikit-learn to split into three sets directly?

No, it's not possible in**scikit-learn**to split into three sets directly. However, one approach to dividing the dataset into train, test, validation with 0.6, 0.2, 0.2 would be to use the

**train_test_split**method twice.

from sklearn.model_selection import train_test_split

x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8)
x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

## How to split data into 3 sets (train, validation and test)?

It can be achieved using numpy and pandas. With**np.split()**, first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y.

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test.

## Difference between scikit-learn and sklearn?

sklearn is how you type the**scikit-learn**name in python because scikit-learn isn't a valid identifier in python, so it can't be that.

import sklearn as scikit_learn

## Is any custom distance function using scikit-learn K-Means Clustering?

Unfortunately no: by definition, the**k-means clustering**algorithm relies on the eucldiean distance from the mean of each cluster. It has no metric parameter and it is not trivial to extend

**k-means**to other distances. You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.

## How do you solve overfitting in random forest of Python sklearn?

To avoid**over-fitting**in RF models, the main thing you need to do is optimize a tuning parameter that governs the number of features that are randomly chosen to grow each tree from the

**bootstrapped**data. If possible, the best thing you can do is get more data, the more data the less likely it is to

**overfit**, as random patterns that appear predictive start to get drowned out as the dataset size increases. Growing a larger forest will improve predictive accuracy, although there are usually

**diminishing returns**once you get up to several hundreds of trees.

Look at the following params:

**n_estimators:**In general the more trees the less likely the algorithm is to overfit.**max_features:**Try reducing this number. The smaller, the less likely to overfit, but too small will start to introduce under fitting.**max_depth:**Reduction of the maximum depth helps fighting with overfitting.**min_samples_leaf:**This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

## Is there a library function for Root mean square error (RMSE) in python?

sklearn's**mean_squared_error**itself contains a parameter squared with default value as True . If you set it to False, the same function will return RMSE instead of MSE.

from sklearn.metrics import mean_squared_error

rms = mean_squared_error(y_actual, y_predicted, squared=False)

## How to extract the decision rules from scikit-learn decision-tree?

You can use Scikit learn**export_text**to extract the rules from a tree. Once you've fit your model, you just need two lines of code.

from sklearn.tree import export_text

rules = export_text(loan_tree, feature_names=(list(X_train.columns)))
print(rules)

## What is the difference between 'transform' and 'fit_transform' in sklearn

**fit()**method is used for generating learning model parameters from training data. This is where the model "learns" from the data.**transform()**method is to transform the data (produce model outputs) according to the fitted model.**fit_transform()**method to do both; Fit the model to the data, then transform the data according to the fitted model.

**Related Topics**