【MAFA23】Computing and technology midterm复习

Lecture1 intro

1.1 ML定义

在这里插入图片描述

application of AI

provides systems to learn & improve

1.2 分类辨析

特性/类型	supervised	unsupervised	reinforcement
training dataset	labeled （desired output） correct answer	not labeled （unkown output） no correct answer	No labelled input/output. Not needing sub-optimal actions to be corrected Machine receives feedback on each learning step
目标	学习输入与输出之间的映射关系	发现数据中的模式或结构	interacts with its environment indiscrete time steps.
algorithm	- Regression（continuous real value output）、 - Classification（discrete value output）	- Clustering（K-means、层次聚类） - non-clustering	Q-learning、Sarsa、深度Q网络（DQN）、策略梯度方法等

1.2.1 非监督学习分类

非监督学习分类和辨析

1.2.2 加强学习流程

请添加图片描述

1.3 机器学习应用流程图

在这里插入图片描述

L2 Linear Regression

2.1 Linear Regression Models

2.1.1. Univariate Linear Regression Model单变量

Predict continuous real value output

在这里插入图片描述

2.1.1.1 Lost Function v.s. Cost Function

cost function and lost

2.1.1.2 OLS REegreesion

普通最小二乘回归（Ordinary Least Squares Regression）

有一个因变量（响应变量）

一个或多个自变量（解释变量）

在这里插入图片描述

2.1.1.3 Understanding the Cost Function

在这里插入图片描述

2.1.1.4 计算Cost Function的两种方法

在这里插入图片描述

扫描二维码关注公众号，回复： 17471401 查看本文章

2.1.2. Multivariate Linear Regression Model多变量

2.1.3. Polynomial Regression Model多项式

y is modelled as an Kth degree polynomial in x polynomial
Is still considered linear regression.

在这里插入图片描述

2.2 Linear Regression Models

2.2.1. Feature Scaling特征缩放

目的：保证特征（feature）在一个相似的尺度上。

*两种feature scaling的处理方法：

在这里插入图片描述

2.2.2. Model Regularization正则化

*2.2.2.1Over-fitting定义：

model matches the training data “too closely”.
do well on training data
catch noise\error\distrubance ;rather than just truevalues/signal.
produce a non-generalized model

*2.2.2.2两种方法 to address the problem of overfitting：（减少变量or极值降权）

在这里插入图片描述

L3 Implementing ML Algos

3.1 Implementation Procedures

Goal of linear regression：build a model to accurately predict unkown cases.

3.1.1. Regression Evaluation Metrics回归评价指标

• Train and Test ——on the Same Data Set
• Train/Test Split

3.1.2. Training Accuracy v.s. Out-of-Sample Accuracy

在这里插入图片描述

3.1.3.problem of misrepresentation

3.1.3.1 misrepresentation含义

depend on how the data is cut
– Testing and training sets are mutually exclusive——如果数据不互斥，有交叉，就可能misrepresentation
– split不当，test或train set中有显著不同的极值
overfitting

3.1.3.2 解决办法：K−Fold Cross Validation （K折交叉验证）

A resampling procedure on a limited data sample

split data：data set randomly split to K个大小相等的子集。
训练和验证：对于K次的迭代：
- 选择1个子集作为测试集（test Set）
- 使用剩余的K-1个子集作为训练集（Training Set）来训练模型
- 在test set上评估模型的性能
take the average of the performance measures：在所有K次迭代后，计算模型性能指标（如准确率、召回率、F1分数等）的平均值，以得到模型的整体性能估计

在这里插入图片描述

3.2 Vectorization

3.3 Model

3.3.1 libraries

NumPy——processing matrices and arrays

SciPy——high level math calculation

Matplotlib——2- or even-higher-dimensional plots

Pandas——data importing, manipulation and analysis

Scikit-Learn——machine learning tools

L4 Interpreting Regression Results

4.1 Linear Regression Models

术语	解释	作用	重点
参数估计（Parameter Estimates）	回归模型中未知参数（如截距和斜率）的估计值。	描述每个独立变量与因变量之间的数学关系。	- 系数的正负号表示变量间的正负相关性。 - 系数的大小表示变量变化对因变量均值的影响。
T统计量（T Statistics）	用于检验单个回归系数是否显著不同于0的统计量。	指示回归系数的统计显著性。	- T值越大，表示组间差异越大，统计显著性越高。 - T值与组间和组内差异的比率有关。
P值（P-value）	与T值相对应的概率值，表示结果偶然发生的概率。	判断统计测试的结果是否具有统计学意义。	- P值越低，表示结果不太可能是偶然发生的，通常P值小于0.05被认为是统计显著的。

请添加图片描述

4.2 Scikit-Learn v.s. Statsmodels

库	scikit-learn	statsmodels
设计目的	预测（Prediction）	解释性分析（Explanatory Analysis）
速度	significantly faster( when datasets with more than 1,000 observations )	速度相对较慢
结果呈现	不提供详细的统计报告	A nice summary table from OLS class
适用人群	适合需要快速预测的机器学习应用	econometrician\其他需要详细统计报告的用户

L5 Classification

Q：what is a Classification problem?

在这里插入图片描述

特性	二元Binary Classification	Multi-class Classification(KNN)
目标	将实例分类为两个类别之一（通常为正类和负类）	将实例分类为多个类别中的一个
类别数量	2（例如：是/否，0/1）	大于2（例如：药物A、B、C等）
model	逻辑回归 (Logistic Regression)	多项逻辑回归 (Multinomial Logistic Regression), KNN等
output
Logistic Function/ KNN
decision boundary		多个决策边界，每个类别一个
note
Cost function/ choose the best K	$J(\beta_0, \beta_1) = \frac{1}{N} \sum_{i=1}^{N} (h_{\beta}(x_i) - y_i)^2$
Evaluation Metrics: Jaccard index + log loss