## 计算机代写|机器学习代写machine learning代考|COMP4702

2022年12月23日

• Statistical Inference 统计推断
• Statistical Computing 统计计算
• (Generalized) Linear Models 广义线性模型
• Statistical Machine Learning 统计机器学习
• Longitudinal Data Analysis 纵向数据分析
• Foundations of Data Science 数据科学基础
## 计算机代写|机器学习代写machine learning代考|Logistic Regression

When developing regular linear regressors in Chapter 2, we wanted a model $f_\theta$ whose estimates $f_\theta\left(x_i\right)$ were as close as possible to the (real-valued) labels $y_i$. When adapting a linear regression algorithm to classification, we might instead seek models that associate positive values of $x_i \cdot \theta$ with positive labels $\left(y_i=1\right)$, and negative values of $x_i \cdot \theta$ with negative labels $\left(y_i=0\right)$.

If we could do so, we could write down the accuracy associated with a particular model:
$$\frac{1}{|y|} \sum_{i=1}^{|y|} \underbrace{\delta\left(y_i=0\right) \delta\left(x_i \cdot \theta \leq 0\right)}_{\text {label is negative and prediction is negative }}+\overbrace{\delta\left(y_i=1\right) \delta\left(x_i \cdot \theta>0\right)}^{\text {label is positive and prediction is positive }}$$
(here $\delta$ is an indicator function that returns 1 if the argument is true, 0 otherwise). The equation here, in spite of slightly confusing notation, is merely counting the number of times we correctly predict a positive score for a positively labeled instance, and a negative (or zero) score for a negatively labeled instance.

We now simply desire from our classifier $\theta$ that it maximizes the accuracy measured by Equation (3.1). Unfortunately, directly optimizing Equation (3.1) for $\theta$ is NP-hard (see, e.g., Nguyen and Sanner (2013)). To get a sense for why it is difficult, consider that the function in Equation (3.1) is essentially a step function (fig. 3.1, left), that is, it is flat (derivative zero) almost everywhere; it is therefore not amenable to techniques like gradient ascent as we saw in Section $2.5$.

So, to optimize the accuracy approximately, we would like a function that is similar to Equation (3.1), but is more straightforward to optimize.

Logistic Regression achieves this goal by converting the outputs of a linear function $x_i \cdot \theta$ to probabilities via a smooth function. Our intuition is that large values of $x_i \cdot \theta$ should correspond to high probabilities, and small (i.e., large negative) values of $x_i \cdot \theta$ should correspond to low probabilities.

## 计算机代写|机器学习代写machine learning代考|Other Classification Techniques

In our introduction to classification, we have only discussed a single classification technique: Logistic Regression. Our choice to explore this particular technique was largely a practical one: the idea of associating a probability with a particular outcome (as in eq. (3.5)) and estimating that probability via a differentiable function (to facilitate gradient ascent) will appear repeatedly as we develop more and more complex models.

However, the technique we have explored is only one class of approach to build classifiers. The specific choice to map binary labels to continuous probabilities via a smooth function has hidden assumptions and limitations, meaning that logistic regression is not the ideal classifier for every situation. Below we present a few alternatives, largely as further reading and to highlight specific situations where logistic regression may not be the preferable choice.

Support Vector Machines: While logistic regressors optimize a probability associated with a set of ohserved lahels, they do not explicitly minimize the number of mistakes made by the classifier. Support Vector Machines (SVMs) (Cortes and Vapnik, 1995) replace the sigmoid function in Figure $3.1$ with an expression that assigns zero cost to correctly classified examples, ${ }^1$ and a positive $\operatorname{cost}^2$ to incorrectly classified examples (in proportion to the confidence of the prediction $x \cdot \theta$ ). This distinction is fairly subtle: while every sample will influence the optimal value of $\theta$ for a logistic regressor, the solution found by an SVM is entirely determined by a few samples closest to the classification boundary, or those that are mislabeled. Conceptually it is appealing for a classifier to focus on the most ‘difficult’ samples in this way, though note that in many cases (and notably when building recommender systems) our goal is to optimize ranking performance rather than classification accuracy (as we will discuss in sec. 3.3.3), such that giving special attention to the most ambiguous examples is not necessarily desirable.

