Scikit-learn Interview Questions

Scikit-learn stands out as one of the most valuable and versatile libraries for Machine Learning in Python, specifically dedicated to data modeling. It offers an extensive collection of supervised and unsupervised learning algorithms, all accessible through a uniform interface in Python. Using the power of NumPy, SciPy, and Matplotlib, Scikit-learn provides an efficient and robust framework for machine learning tasks. Its emphasis on making machine learning accessible to non-specialists, while utilizing a high-level language, sets it apart as a widely adopted tool in the field. Familiarity with the Scikit-learn library is highly beneficial for individuals preparing for Data Science job interviews, as it equips them with the essential tools to tackle diverse data analysis challenges and make informed decisions.

What does the "fit()" method in scikit-learn do?

The process of fitting a model to the training data using the fit() method is a crucial step in the modeling process. This step involves finding the coefficients for the equation specified by the chosen algorithm. As the fit() method runs, it adjusts the model's internal parameters to best represent the underlying patterns in the training data. The method modifies the model object and returns a reference to the updated object.

Once the model is trained, it becomes capable of making predictions on new, unseen data. This is typically achieved using the .predict() method, which takes the input data and returns the predicted outcomes based on the learned patterns from the training phase. The predict() method allows the model to generalize its knowledge to previously unseen instances and make informed predictions.

How to Eliminating warnings from scikit-learn?

You can use the "warnings-module" to temporarily suppress warnings.

import warnings warnings.filterwarnings('ignore')

The filterwarnings call should be in the file that calls the function that gives the warning.

Actually the warning tells you exactly what is the problem, so instead of suppressing a warning it is better to get rid of it.

What does calling fit() multiple times on the same model do?

If you will execute model.fit() for a second time, it will start training again using passed data and will remove the existing results. It will reset the following inside model:

  1. Fitted Coefficients
  2. Weights
  3. Intercept (bias)
  4. And other training related stuff.

To avoid overwriting, you can use "warm_start" parameter, where it will initialise model parameters with the previous solution from fit(). Also, you can use partial_fit() method as well if you want your previous calculated stuff to stay and additionally train using next data.

How to predict time series in scikit-learn?

Time series data consists of data points collected at regular time intervals, forming a sequence of observations over time. Time-series prediction is based on the principle that the current value of a time series is influenced to some extent by its past values. A time series typically comprises two fundamental components: the Mean and the Variance. Ideally, you would like to control these components to better understand the underlying patterns and trends in the data. To manage variability, a logarithm transformation can be applied, while differentiation can be used to address trends.

When it comes to predicting time series data, Recurrent Neural Networks (RNN) or Long Short-Term Memory (LSTM) algorithms, which fall under the umbrella of Deep Learning, have gained significant popularity. However, it's important to note that the scikit-learn library does not include built-in support for RNN or LSTM models. For those specific algorithms, you might find it more beneficial to explore other frameworks like TensorFlow or PyTorch, as these are widely used tools that allow you to effectively build RNN or LSTM models for time series prediction tasks.


Scikit-learn Interview Questions and answers

Is it possible in scikit-learn to split into three sets directly?

No, it's not possible in scikit-learn to split into three sets directly. However, one approach to dividing the dataset into train, test, validation with 0.6, 0.2, 0.2 would be to use the train_test_split method twice.

from sklearn.model_selection import train_test_split
x, x_test, y, y_test = train_test_split(xtrain,labels,test_size=0.2,train_size=0.8) x_train, x_cv, y_train, y_cv = train_test_split(x,y,test_size = 0.25,train_size =0.75)

How to split data into 3 sets (train, validation and test)?

It can be achieved using numpy and pandas. With np.split() , first splitting your dataset into 3 objects, train, validate and test which will later need to be split individually into independent variables commonly known as X and target variable known as Y.

train, validate, test = np.split(df.sample(frac=1), [int(.6*len(df)), int(.8*len(df))])

You are getting 3 different objects, which consist of the first 60% of data from df for train, the data corresponding to the interval between 60% and 80% for validate and the last 20% corresponding to 80%-100% in test.

Difference between scikit-learn and sklearn?

sklearn is how you type the scikit-learn name in python because scikit-learn isn't a valid identifier in python, so it can't be that.

import sklearn as scikit_learn

Is any custom distance function using scikit-learn K-Means Clustering?

Unfortunately no: by definition, the k-means clustering algorithm relies on the eucldiean distance from the mean of each cluster. It has no metric parameter and it is not trivial to extend k-means to other distances. You could use a different metric, so even though you are still calculating the mean you could use something like the mahalnobis distance.

How do you solve overfitting in random forest of Python sklearn?

To mitigate overfitting in Random Forest (RF) models, a key approach is to optimize a tuning parameter that controls the number of features randomly chosen to grow each tree from the bootstrapped data. This parameter, often referred to as "max_features," determines the subset of features considered for each split, reducing the risk of individual trees being too specialized to the training data.

Additionally, obtaining more data can be highly beneficial in reducing overfitting tendencies. With an increased dataset size, random patterns that might seem predictive in smaller datasets are likely to be overshadowed, leading to a more generalized and accurate model. However, it is essential to strike a balance as growing an excessively large forest might introduce computational costs without significant improvements in predictive accuracy.

Moreover, feature engineering and selection can play a crucial role in preventing overfitting. By carefully selecting relevant features and eliminating irrelevant or noisy ones, the model can focus on the most important patterns in the data, leading to better generalization.

Look at the following params:

  1. n_estimators: In general the more trees the less likely the algorithm is to overfit.
  2. max_features: Try reducing this number. The smaller, the less likely to overfit, but too small will start to introduce under fitting.
  3. max_depth: Reduction of the maximum depth helps fighting with overfitting.
  4. min_samples_leaf: This has a similar effect to the max_depth parameter, it means the branch will stop splitting once the leaves have that number of samples each.

Is there a library function for Root mean square error (RMSE) in python?

sklearn's mean_squared_error itself contains a parameter squared with default value as True . If you set it to False, the same function will return RMSE instead of MSE.

from sklearn.metrics import mean_squared_error
rms = mean_squared_error(y_actual, y_predicted, squared=False)

How to extract the decision rules from scikit-learn decision-tree?

You can use Scikit learn export_text to extract the rules from a tree. Once you've fit your model, you just need two lines of code.

from sklearn.tree import export_text
rules = export_text(loan_tree, feature_names=(list(X_train.columns))) print(rules)

What is the difference between 'transform' and 'fit_transform' in sklearn

  1. fit() method is used for generating learning model parameters from training data. This is where the model "learns" from the data.
  2. transform() method is to transform the data (produce model outputs) according to the fitted model.
  3. fit_transform() method to do both; Fit the model to the data, then transform the data according to the fitted model.