Classification Algorithms

Regression algorithms can predicted the output for continuous values , but to predict the categorical values , you need to use Classification Algorithms . Classification is one of the most fundamental concepts in Data Science . Classification algorithm is a two-step process, learning step and prediction step, in Machine Learning . In the learning step, the model is developed based on given training data. In the prediction step, the model is used to predict the response for given data. A classification problem has a discrete value as its output. One of the most common uses of classification is whether an email is spam (1) or non-spam(0). Popular algorithms that can be used for binary classification include:
  1. Logistic Regression
  2. K-Nearest Neighbor
  3. Decision Tree
  4. Random Forest
  5. Support Vector Machine

Classification Algorithms

Logistic Regression

Logistic Regression (LR) is a classification algorithm , which comes under the Supervised Learning technique. It is used for predicting the categorical dependent variable using a given set of independent variables. Therefore the outcome must be a categorical or discrete value . It can be either the event happens (1, true etc.) or it does not happen (0, false etc.), but instead of giving the exact value as 0 and 1, it gives the probabilistic values which lie between 0 and 1. It is a simple and more efficient method for binary and linear classification problems.

K-Nearest Neighbors

The KNN ( K-nearest neighbors) is a supervised machine learning algorithm that can be used to solve both classification and regression problems, but mostly it is used for the Classification problems . This algorithm assumes that similar things exist in close proximity. In other words, similar things are near to each other. It stores all available cases and classifies new cases based on a similarity measure. KNN uses test data to make an "educated guess" on what an unclassified point should be classified as. This means that when new data appears then it can be easily classified into a well suite category by using this algorithm. This algorithm is considered both non-parametric and an example of lazy learning . Non-parametric means that it makes no assumptions and Lazy learning means that the algorithm makes no generalizations.

Decision Tree

Decision Trees are a type of Supervised Machine Learning algorithm that can be used to solve both classification and regression problems. Decision tree uses the tree representation to solve the problem in which data is continuously split according to a certain parameter. The tree can be explained by two entities, namely decision nodes and leaves . Each branch of the tree represents a possible decision, occurrence, or reaction. The Decision nodes are used to make any decision and have multiple branches, whereas Leaf nodes are the output of those decisions and do not contain any further branches. The deeper the tree, the more complex the rules and fitter the model. The decision rules are generally in form of if-then-else statements. Important Terminology related to Decision Trees:
  1. Root Node: It represents the entire population or sample and this further gets divided into two or more homogeneous sets.
  2. Splitting: It is a process of dividing a node into two or more sub-nodes.
  3. Decision Node: When a sub-node splits into further sub-nodes, then it is called the decision node.
  4. Leaf/Terminal Node: Nodes do not split is called Leaf or Terminal node.

Random Forest

A Random Forest (RF) algorithm is a type of Supervised Machine Learning algorithm that's used to solve regression and classification problems. This algorithm consists of many decision trees. A decision tree algorithm is a tree-shaped diagram used to determine a course of action. The "Forest", a number of decision trees on various subsets of the given dataset, generated by this algorithm is trained through bagging or bootstrap aggregating . Bagging is an ensemble meta-algorithm that improve the predictive accuracy of the dataset. An ensemble method or ensemble learning algorithm consists of aggregating multiple outputs made by a diverse set of predictors to obtain better results.

Support Vector Machine

Support Vector Machines (SVMs) is a type of Supervised Machine Learning algorithm that's powerful for solving regression and classification problems . The main aim of this algorithm is to find a hyperplane in an N-dimensional space that distinctly classifies the data points. To separate the two classes of data points, there are many possible hyperplanes that could be chosen. Our objective is to find a plane that has the maximum margin, i.e the maximum distance between data points of both classes. Maximizing the margin distance provides some reinforcement so that future data points can be classified with more confidence.