Python机器学习零基础入门 -- 骨科患者的生物力学特征分析处理下

it2022-05-05 482

# 监督式学习 # 探索性数据分析 ''' 监督式学习我们将学习线性回归和逻辑回归这个骨科患者的数据不适合回归，所以我只使用了sacral_slope和pelvic_incidence of abnormal这两个特征我认为特征是pelvic_incidence，目标是sacral_slope 让我们看一下散点图，以便更好地理解它的形状(-1,1):如果您不使用它形状的x或y becaomes(210，)，我们不能在sklearn中使用它，所以我们使用shape(-1,1)和shape of x或y be(210, 1) ''' data1 = data[data['class'] == 'Abnormal'] x = np.array(data1.loc[:, 'pelvic_incidence']).reshape(-1, 1) y = np.array(data1.loc[:, 'sacral_slope']).reshape(-1, 1) plt.figure(figsize=[10, 10]) plt.scatter(x=x, y=y) plt.xlabel('pelvic_incidence') plt.ylabel('sacral_slope') plt.show() # 线性回归 ''' Y=AX+b，其中y=目标，x=特征，a=模型参数我们根据线性回归中丢失函数的最小误差函数选择模型(A)的参数，用普通最小二乘(OLS)作为丢失函数。 OLS：所有残差之和，但一些正残差和负残差可以互相抵消，所以我们用残差平方和。它被称为OLS评分：分数使用R^2方法，即(y_pred-y_mean)^2)/(y_real-y_mean)^2。 ''' from sklearn.linear_model import LinearRegression reg = LinearRegression() # 预测空间 predict_space = np.linspace(min(x), max(x)).reshape(-1, 1) reg.fit(x, y) # 预测 predicted = reg.predict(predict_space) # R ^ 2 print('R^2 得分:', reg.score(x, y)) # 绘制回归线和散点 plt.plot(predict_space, predicted, color='black', linewidth=3) plt.scatter(x=x,y=y) plt.xlabel('pelvic_incidence') plt.ylabel('sacral_slope') plt.show() from sklearn.model_selection import cross_val_score reg = LinearRegression() k = 5 # 使用上面定义的x和y的reg(线性回归)，K等于5。它的意思是5次(分裂，训练，预测) cv_result = cross_val_score(reg,x,y,cv=k) # uses R^2 as score print('CV 分数: ',cv_result) print('CV 分数平均: ',np.sum(cv_result)/k) ''' 当我们学习线性回归时，选择参数(系数)，同时最小化损失函数。如果线性回归认为某一特征是重要的，则该特征的系数较高。然而，这可能会导致过度拟合，就像KNN中的记忆一样。为了避免过度拟合，我们使用正则化来惩罚大系数。岭回归:第一个正则化技术。它也被称为L2正则化。岭回归损失函数= OLS +和(参数^2)是我们需要选择拟合和预测的参数。选择和p相似 ''' from sklearn.linear_model import Ridge x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 2, test_size = 0.3) ridge = Ridge(alpha = 0.1, normalize = True) ridge.fit(x_train,y_train) ridge_predict = ridge.predict(x_test) print('Ridge score: ',ridge.score(x_test,y_test)) from sklearn.linear_model import Lasso x = np.array(data1.loc[:,['pelvic_incidence','pelvic_tilt numeric','lumbar_lordosis_angle','pelvic_radius']]) x_train,x_test,y_train,y_test = train_test_split(x,y,random_state = 3, test_size = 0.3) lasso = Lasso(alpha = 0.1, normalize = True) lasso.fit(x_train,y_train) ridge_predict = lasso.predict(x_test) print('Lasso score: ',lasso.score(x_test,y_test)) print('Lasso coefficients: ',lasso.coef_) ''' pelvic_incidence和pelvic_tilt数值是重要的特性，但其他特性并不重要现在我们来讨论一下准确性。对于模型选择的度量是否足够。例如，有一个数据包含95%的正常样本和5%的异常样本，我们的模型使用精度作为测量指标。该模型对所有样本的正确率为100%，正确率为95%，但对所有异常样本的分类是错误的。因此，我们需要使用混淆矩阵作为不平衡数据的模型度量矩阵。在使用混淆矩阵的同时，利用随机森林分类器进行分类 ''' # 随机森林的混淆矩阵 from sklearn.metrics import classification_report, confusion_matrix from sklearn.ensemble import RandomForestClassifier x,y = data.loc[:,data.columns != 'class'], data.loc[:,'class'] x_train,x_test,y_train,y_test = train_test_split(x,y,test_size = 0.3,random_state = 1) rf = RandomForestClassifier(random_state = 4) rf.fit(x_train,y_train) y_pred = rf.predict(x_test) cm = confusion_matrix(y_test,y_pred) print('Confusion matrix: \n',cm) print('Classification report: \n',classification_report(y_test,y_pred)) sns.heatmap(cm,annot=True,fmt="d") plt.show() from sklearn.metrics import roc_curve from sklearn.linear_model import LogisticRegression from sklearn.metrics import confusion_matrix, classification_report # abnormal = 1 and normal = 0 data['class_binary'] = [1 if i == 'Abnormal' else 0 for i in data.loc[:,'class']] x,y = data.loc[:,(data.columns != 'class') & (data.columns != 'class_binary')], data.loc[:,'class_binary'] x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state=42) logreg = LogisticRegression() logreg.fit(x_train,y_train) y_pred_prob = logreg.predict_proba(x_test)[:,1] fpr, tpr, thresholds = roc_curve(y_test, y_pred_prob) # Plot ROC curve plt.plot([0, 1], [0, 1], 'k--') plt.plot(fpr, tpr) plt.xlabel('False Positive Rate') plt.ylabel('True Positive Rate') plt.title('ROC') plt.show()

想获取python学习资料的小伙伴可以加QQ：728711576

专利

最新回复(0)