Linear Classifiers & Logistic Regression
Linear classifiers are amongst the most practical classification methods. For example, in our sentiment analysis case-study, a linear classifier associates a coefficient with the counts of each word in the sentence. In this module, you will become proficient in this type of representation. You will focus on a particularly useful type of linear classifier called logistic regression, which, in addition to allowing you to predict a class, provides a probability associated with the prediction. These probabilities are extremely useful, since they provide a degree of confidence in the predictions. In this module, you will also be able to construct features from categorical inputs, and to tackle classification problems with more than two class (multiclass problems). You will examine the results of these techniques on a real-world product sentiment analysis task.
Learning Linear Classifiers
Once familiar with linear classifiers and logistic regression, you can now dive in and write your first learning algorithm for classification. In particular, you will use gradient ascent to learn the coefficients of your classifier from data. You first will need to define the quality metric for these tasks using an approach called maximum likelihood estimation (MLE). You will also become familiar with a simple technique for selecting the step size for gradient ascent. An optional, advanced part of this module will cover the derivation of the gradient for logistic regression. You will implement your own learning algorithm for logistic regression from scratch, and use it to learn a sentiment analysis classifier.
Overfitting & Regularization in Logistic Regression
As we saw in the regression course, overfitting is perhaps the most significant challenge you will face as you apply machine learning approaches in practice. This challenge can be particularly significant for logistic regression, as you will discover in this module, since we not only risk getting an overly complex decision boundary, but your classifier can also become overly confident about the probabilities it predicts. In this module, you will investigate overfitting in classification in significant detail, and obtain broad practical insights from some interesting visualizations of the classifiers' outputs. You will then add a regularization term to your optimization to mitigate overfitting. You will investigate both L2 regularization to penalize large coefficient values, and L1 regularization to obtain additional sparsity in the coefficients. Finally, you will modify your gradient ascent algorithm to learn regularized logistic regression classifiers. You will implement your own regularized logistic regression classifier from scratch, and investigate the impact of the L2 penalty on real-world sentiment analysis data.
Decision Trees
Along with linear classifiers, decision trees are amongst the most widely used classification techniques in the real world. This method is extremely intuitive, simple to implement and provides interpretable predictions. In this module, you will become familiar with the core decision trees representation. You will then design a simple, recursive greedy algorithm to learn decision trees from data. Finally, you will extend this approach to deal with continuous inputs, a fundamental requirement for practical problems. In this module, you will investigate a brand new case-study in the financial sector: predicting the risk associated with a bank loan. You will implement your own decision tree learning algorithm on real loan data.
Preventing Overfitting in Decision Trees
Out of all machine learning techniques, decision trees are amongst the most prone to overfitting. No practical implementation is possible without including approaches that mitigate this challenge. In this module, through various visualizations and investigations, you will investigate why decision trees suffer from significant overfitting problems. Using the principle of Occam's razor, you will mitigate overfitting by learning simpler trees. At first, you will design algorithms that stop the learning process before the decision trees become overly complex. In an optional segment, you will design a very practical approach that learns an overly-complex tree, and then simplifies it with pruning. Your implementation will investigate the effect of these techniques on mitigating overfitting on our real-world loan data set.
Handling Missing Data
Real-world machine learning problems are fraught with missing data. That is, very often, some of the inputs are not observed for all data points. This challenge is very significant, happens in most cases, and needs to be addressed carefully to obtain great performance. And, this issue is rarely discussed in machine learning courses. In this module, you will tackle the missing data challenge head on. You will start with the two most basic techniques to convert a dataset with missing data into a clean dataset, namely skipping missing values and inputing missing values. In an advanced section, you will also design a modification of the decision tree learning algorithm that builds decisions about missing data right into the model. You will also explore these techniques in your real-data implementation.
Boosting
One of the most exciting theoretical questions that have been asked about machine learning is whether simple classifiers can be combined into a highly accurate ensemble. This question lead to the developing of boosting, one of the most important and practical techniques in machine learning today. This simple approach can boost the accuracy of any classifier, and is widely used in practice, e.g., it's used by more than half of the teams who win the Kaggle machine learning competitions. In this module, you will first define the ensemble classifier, where multiple models vote on the best prediction. You will then explore a boosting algorithm called AdaBoost, which provides a great approach for boosting classifiers. Through visualizations, you will become familiar with many of the practical aspects of this techniques. You will create your very own implementation of AdaBoost, from scratch, and use it to boost the performance of your loan risk predictor on real data.
Precision-Recall
In many real-world settings, accuracy or error are not the best quality metrics for classification. You will explore a case-study that significantly highlights this issue: using sentiment analysis to display positive reviews on a restaurant website. Instead of accuracy, you will define two metrics: precision and recall, which are widely used in real-world applications to measure the quality of classifiers. You will explore how the probabilities output by your classifier can be used to trade-off precision with recall, and dive into this spectrum, using precision-recall curves. In your hands-on implementation, you will compute these metrics with your learned classifier on real-world sentiment analysis data.
Scaling to Huge Datasets & Online Learning
With the advent of the internet, the growth of social media, and the embedding of sensors in the world, the magnitudes of data that our machine learning algorithms must handle have grown tremendously over the last decade. This effect is sometimes called "Big Data". Thus, our learning algorithms must scale to bigger and bigger datasets. In this module, you will develop a small modification of gradient ascent called stochastic gradient, which provides significant speedups in the running time of our algorithms. This simple change can drastically improve scaling, but makes the algorithm less stable and harder to use in practice. In this module, you will investigate the practical techniques needed to make stochastic gradient viable, and to thus to obtain learning algorithms that scale to huge datasets. You will also address a new kind of machine learning problem, online learning, where the data streams in over time, and we must learn the coefficients as the data arrives. This task can also be solved with stochastic gradient. You will implement your very own stochastic gradient ascent algorithm for logistic regression from scratch, and evaluate it on sentiment analysis data.