机器学习代写|机器学习代写machine learning代考|General Suggestions for Removing or Adding Inputs
The following is a general guide to removing inputs:
(a) Remove an independent variable (input) if it has zero variance, which implies that the input has a single unique value (Kuhn and Johnson 2013).
(b) Remove an independent variable (input) if it has near-zero variance, which implies that the input has very few values.
(c) Remove an independent variable (input) if it is highly correlated with another input variable (nearly perfect correlation), since they are measuring the same underlying information (Kuhn and Johnson 2013). Known as collinearity in statistical machine learning science, this phenomenon is important because in its presence the parameter estimates of some machine learning algorithms (for example, those based on gradient descent) are inflated (not accurately estimated).
These three issues are very common in genomic prediction, since part of the independent variables is marker information and many of them have zero or nearzero variance and other pairs have very high correlations. One of the advantages of removing input information prior to the modeling process is that this reduces the computational resources needed to implement the statistical machine learning algorithm. Also, it is possible to end up with a more parsimonious and interpretable model. Another advantage is that models with less correlated inputs are less prone to unstable parameter estimates, numerical errors, and degraded prediction performance (Kuhn and Johnson 2013).
The following are general rules for the addition of input variables:
(a) Create dummy variables from nominal or categorical inputs.
(b) Manually create a categorical variable from a continuous variable.
(c) Transform the original input variable using a specific transformation.
First, we describe the process of creating dummy variables from categorical (nominal or ordinal) inputs. Transforming categorical inputs into dummy variables is required in most supervised statistical machine learning methods, since providing the original independent variable (not transformed into dummy variables) is incorrect and should be avoided by practitioners of statistical machine learning methods. However, it is important to point out that when the dependent variable is categorical, most statistical machine learning methods do not require it to be transformed into dummy variables. For example, assume that we are studying three genotypes (G1, G2, and G3) in two environments (E1 and E2) and we collected the following grain yield data.
机器学习代写|机器学习代写machine learning代考|Principal Component Analysis as a Compression Method
Principal component analysis (PCA) is a method often used to compress the input data without losing as much information. The PCA works on a rectangular matrix in which the rows represent the observations $(n)$ and the columns, the independent variables $(p)$. The PCA creates linear combinations of the columns of matrix information, $\boldsymbol{X}$, and generates, at most, $p$ linear combinations, called principal components. These linear combinations, or principal components, can be obtained as follows:
\mathrm{PC}1=w_1 \boldsymbol{X}=w{11} X_1+w_{12} X_2+\cdots+w_{1 p} X_p \
\cdots \
\mathrm{PC}p=w_p \boldsymbol{X}=w{p 1} X_1+w_{p 2} X_2+\cdots+w_{p p} X_p
These linear combinations are constructed in such a way that the first principal component, $\mathrm{PC}_1$, captures the largest variance, the second principal component, $\mathrm{PC}_2$, captures the second largest variance, and so on. For this reason, it is expected that few principal components $(k<p)$ can explain the largest variability contained in the original rectangular matrix $(\boldsymbol{X})$, which means that with a compressed matrix, $\boldsymbol{X}^*$, we contain most of the variability of the original matrix, but with a significant reduction in the number of columns. In matrix notation, the full principal components are obtained with the following expression:
\mathrm{PC}=X W,
$$ where $\boldsymbol{W}$ is a $p$-by-p matrix of weights whose columns are the eigenvectors of $\boldsymbol{Q}=\boldsymbol{X}^{\mathrm{T}} \boldsymbol{X}$, that is, we first need to calculate the eigenvalue decomposition of $\boldsymbol{Q}$, which is equal to $\boldsymbol{Q}=\boldsymbol{W} \boldsymbol{\Lambda} \boldsymbol{W}^{\mathrm{T}}$, where $\boldsymbol{W}$ represents the matrix of eigenvectors and $\boldsymbol{\Lambda}$ is a diagonal matrix of order $p$-by- $p$ containing the eigenvalues. For this reason, if we use $k<p$ principal components, the reduced (compressed) matrix is of order $n \times k$ and is calculated as
X^=X W^,
where $\boldsymbol{W}^*$ contains the same rows of $\boldsymbol{W}$, but only the first $k$ columns instead of the original $p$ columns.

机器学习代写|机器学习代写machine learning代考|删除或添加输入的一般建议
(a)如果一个自变量(输入)方差为零,则删除该变量(输入),这意味着输入只有一个唯一值(Kuhn and Johnson 2013)
(c)如果一个自变量(输入)与另一个输入变量高度相关(几乎完全相关),则删除该自变量(输入),因为它们测量的是相同的底层信息(Kuhn and Johnson 2013)。这种现象在统计机器学习科学中被称为共线性,因为在它的存在下,一些机器学习算法(例如,基于梯度下降的算法)的参数估计是夸大的(不是准确估计的)。这三个问题在基因组预测中非常常见,因为自变量的一部分是标记信息,其中许多具有零或接近零的方差,而其他对具有非常高的相关性。在建模过程之前删除输入信息的优点之一是,这减少了实现统计机器学习算法所需的计算资源。另外,最终可能会得到一个更简洁、更可解释的模型。另一个优点是输入相关性较低的模型不容易出现不稳定的参数估计、数值误差和预测性能下降(Kuhn and Johnson 2013)。
\mathrm{PC}=X W,
