Matplotlib 与机器学习

English version: Matplotlib and Machine Learning

Matplotlib 数据可视化与 scikit-learn 机器学习基础概念入门。

概述

本讲涵盖数据分析中两个核心 Python 库：

Matplotlib：用于可视化数据与结果的 Python 库。
Scikit-learn (sklearn)：用于处理数据并提取底层模式的机器学习库。

Matplotlib

Matplotlib 是一个用于创建静态、动画和交互式可视化的 Python 库。

常见图表类型

图表类型	用途
折线图 (Line Plot)	展示连续变量的趋势
直方图 (Histogram)	展示单个变量的分布
散点图 (Scatter Plot)	展示两个变量之间的关系
子图 (Subplots)	在单个图形中组合多个图表
柱状图 (Bar Chart)	比较类别数据
饼图 (Pie Chart)	展示类别数据的比例
箱线图 (Box Plot)	展示数据的分布与统计摘要
3D 图 (3D Plots)	创建三维可视化

折线图

折线图通常用于时间序列数据，如股票价格、温度趋势或网站流量随时间的变化。

import matplotlib.pyplot as plt
import numpy as np
 
# 生成数据
x = np.linspace(0, 10, 100)  # 在 0 到 10 之间创建 100 个均匀分布的点
y = np.sin(x)                # 计算每个 x 值的正弦
 
# 创建图表
plt.plot(x, y)
plt.xlabel("X_axis")
plt.ylabel("Y_axis")
plt.title("Sine Wave Plot")
plt.grid(True)
plt.savefig('sin_plot.png', dpi=400)
plt.savefig('sin_plot.pdf', dpi=400)
plt.show()
 
# 读取保存的图片
import matplotlib.image as mpimg
image = mpimg.imread('sin_plot.png')
plt.imshow(image)
plt.axis('off')
plt.show()

自定义选项

属性	选项
线型	实线 `-`、虚线 `--`、点线 `:`、点划线 `-.`
颜色	颜色名称 (`red`、`blue`)、十六进制代码 (`#FF0000`)、灰度 (`0.7`)
标记	圆圈 `o`、加号 `+`、星号 `*` 等
线宽	控制线条粗细
标签与标题	添加描述性标签和标题
图例	标识不同的图表元素
网格	添加网格线以提高可读性
坐标轴范围	控制 x 轴和 y 轴的显示范围

plt.plot(x, y, color='green', linestyle='--', marker='o', linewidth=2, label='Sine Wave')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Customized Sine Wave Plot")
plt.legend()
plt.grid(True)
plt.xlim(0, 10)
plt.ylim(-1.2, 1.2)
plt.show()

面向对象方法

对于更复杂的图表，推荐使用面向对象方法：

import matplotlib.pyplot as plt
import numpy as np
 
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
 
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x, y1, label='Sine', linestyle='--')
ax.plot(x, y2, label='Cosine', linestyle='-')
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_title("Sine and Cosine Waves")
ax.legend()
ax.grid(True)
plt.savefig('plot_objected.png', dpi=400)
plt.show()

直方图

直方图展示单个变量的分布。

import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import pandas as pd
import numpy as np
 
plt.rc('font', family='Times New Roman', size=12)
data = pd.read_csv('./data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
 
plt.figure(num=1)
plt.hist(heights, bins=8)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number')
plt.savefig('BarChart.png', dpi=400)
plt.show()

直方图可以与概率密度函数（PDF）进行对比，以理解理论分布。

比较两个分布

data_marks1 = pd.read_excel('./data/CourseMarks1.xlsx')
Marks = np.array(data_marks1['总分'], dtype=int)
plt.figure(num=2)
plt.hist(Marks, bins=20)
plt.title('Scores of students in Class 1')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores1.png', dpi=400)
plt.show()
 
data_marks2 = pd.read_excel('./data/CourseMarks2.xlsx')
Marks = np.array(data_marks2['总分'], dtype=int)
plt.figure(num=3)
plt.hist(Marks, bins=20, color='green')
plt.title('Scores of students in Class 2')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores2.png', dpi=400)
plt.show()

散点图

散点图展示两个变量之间的关系。

import matplotlib.pyplot as plt
import numpy as np
 
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 3, 5])
 
plt.scatter(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot")
plt.savefig('Simple_scatter.png')
plt.show()

机器学习

什么是机器学习？

机器学习是人工智能的一个分支，它使算法能够从数据集中提取隐藏的模式。它能够在不针对每项任务进行显式编程的情况下，对新的相似数据进行预测。机器学习在多个领域都有应用。

监督学习与无监督学习

监督学习 (Supervised Learning)：模型从带标签的数据（输入-输出对）中学习。
无监督学习 (Unsupervised Learning)：模型在未标记的数据中寻找模式。

回归与分类

方面	回归 (Regression)	分类 (Classification)
输出	连续数值	离散类别 / 类别标签
目标	预测数量	预测类别
示例	房价预测	邮件垃圾检测

监督学习的结构

典型的监督学习工作流程包括：

输入 / 特征 / 属性：用于进行预测的数据。
输出 / 目标 / 标签：训练数据的已知结果。
模型选择：选择合适的算法。
训练：将模型拟合到训练数据以获取参数。
评估：在测试数据上评估模型性能。
推理 / 预测：将训练好的模型应用于新的未知数据。

线性回归

线性回归是一种统计方法，用于建模因变量与一个或多个自变量之间的关系。

简单线性回归

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
 
# 生成样本数据
area = np.array([100, 150, 200, 250, 300]) + np.random.randint(-30, 20, 5)
area = area[:, np.newaxis]
price = np.array([200000, 250000, 300000, 350000, 400000])
 
# 创建并训练模型
model = LinearRegression()
model.fit(area, price)
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
 
# 对新数据进行预测
new_area = np.array([350])
new_area = new_area[:, np.newaxis]
predicted_price = model.predict(new_area)
print(f"Predicted price for an area of {new_area[0][0]} square feet: {predicted_price[0]}")

多元线性回归

多元线性回归将简单回归扩展到多个输入特征。以下是使用 California 房价数据集的示例：

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import matplotlib.pyplot as plt
 
data_file_path = './data/cal_housing.data'
data = pd.read_csv(data_file_path, header=None)
print(data.head())
 
column_names = [
    'longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms',
    'population', 'households', 'medianIncome', 'medianHouseValue'
]
data.columns = column_names
print(data.head())
 
X = data.drop('medianHouseValue', axis=1)
y = data['medianHouseValue']

# 训练模型
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
 
# 测试模型
y_pred = model.predict(X_test)
 
# 评估模型
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")

可视化预测结果：

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel('Actual House Prices', fontsize=12)
plt.ylabel('Predicted House Prices', fontsize=12)
plt.title('Actual vs Predicted House Prices', fontsize=14)
plt.grid(True)
plt.xlim(min(y_test), max(y_test))
plt.ylim(min(y_test), max(y_test))
plt.show()

多项式线性回归

当变量之间的关系是非线性时，可以使用多项式函数：

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
 
poly_model = make_pipeline(PolynomialFeatures(7), LinearRegression())
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
poly_model.fit(x[:, np.newaxis], y)

xfit = np.linspace(0, 10, 1000)
yfit = poly_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('polynomial_sin.png', dpi=400)
plt.show()

欠拟合与过拟合

欠拟合 (Underfitting)（如 degree = 1）：模型过于简单，无法捕捉底层模式。
过拟合 (Overfitting)（如 degree = 6 或更高）：模型捕捉到训练数据中的噪声，泛化能力差。

选择合适的模型复杂度对于获得良好的预测性能至关重要。

按年份预测人口

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
 
years = np.array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])
population = np.array([100, 103, 108, 115, 124, 135, 148, 163, 180, 199, 220]) + np.random.randint(0, 50, 11)
 
X = years[:, np.newaxis]
degree = 3
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model.fit(X, population)

future_years = np.arange(2010, 2031)[:, np.newaxis]
predicted_population = poly_model.predict(future_years)
 
pred_2025 = poly_model.predict([[2025]])
print(f"The predicted population in 2025 is {pred_2025[0]:.2f} thousand.")
 
plt.scatter(years, population, color='blue', label='Actual Data')
plt.plot(future_years, predicted_population)
plt.show()

贝叶斯定理用于分类

贝叶斯定理为分类提供了一个概率框架。核心思想是比较后验概率以确定最可能的类别：

$P (c_{j} ∣ A) = \frac{P ( A ∣ c _{j} ) \cdot P ( c _{j} )}{P ( A )}$

其中：

$P (c_{j} ∣ A)$ 是给定证据 $A$ 时类别 $c_{j}$ 的后验概率
$P (A ∣ c_{j})$ 是给定类别 $c_{j}$ 时证据 $A$ 的似然
$P (c_{j})$ 是类别 $c_{j}$ 的先验概率
$P (A)$ 是证据 $A$ 的边缘概率

多离散特征分类

PlayTennis 数据集演示了使用朴素贝叶斯分类器对类别特征进行分类：

import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
 
data = {
    'Rec': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)

编码类别变量：

df = df.drop('Rec', axis=1)
le = LabelEncoder()
encoded_df = df.copy()
for col in df.columns:
    encoded_df[col] = le.fit_transform(df[col])
 
X = encoded_df.drop('PlayTennis', axis=1)
y = encoded_df['PlayTennis']
 
bnb = BernoulliNB()
bnb.fit(X, y)

对新数据进行预测：

data_new = {
    'Outlook': ['Sunny'], 'Temperature': ['Cool'], 'Humidity': ['High'], 'Wind': ['Strong'], 'PlayTennis': ['Yes']
}
data_new = pd.DataFrame(data_new)
 
encoded_new_data = data_new.copy()
for col in data_new.columns:
    if col in df.columns:
        le = LabelEncoder()
        le.fit(df[col])
        encoded_new_data[col] = le.transform(data_new[col])
 
X_new = encoded_new_data.drop('PlayTennis', axis=1)
 
new_pred = bnb.predict(X_new)
play_tennis_le = LabelEncoder()
play_tennis_le.fit(df['PlayTennis'])
decoded_new_pred = play_tennis_le.inverse_transform(new_pred)
print("Prediction for new data:", decoded_new_pred)

连续特征分类

当特征是连续值时，通常假设其服从正态（高斯）分布。其概率密度函数取决于均值和方差，产生特征性的钟形曲线。

Iris 数据集：多连续特征

Iris 数据集包含四个连续特征（萼片长度、萼片宽度、花瓣长度、花瓣宽度）和一个类别标签（山鸢尾、变色鸢尾、维吉尼亚鸢尾）。每个特征被假设服从高斯分布，且特征之间被视为相互独立：

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
 
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
gnb = GaussianNB()
gnb.fit(X_train, y_train)
 
y_pred = gnb.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

要点总结

主题	关键内容
Matplotlib	折线图、直方图、散点图；通过颜色、线型、标记自定义；面向对象 API
ML 工作流程	加载数据 → 划分训练/测试集 → 选择模型 → 训练 → 评估 → 预测
线性回归	简单（单特征）、多元（多特征）、多项式（非线性）；注意欠拟合与过拟合
朴素贝叶斯	概率分类器；BernoulliNB 用于离散特征，GaussianNB 用于连续特征

来源资料

MatplotlibAndMachineLearning

探索

AITC Wiki

Matplotlib 与机器学习

Matplotlib and Machine Learning

Matplotlib 与机器学习

概述

Matplotlib

常见图表类型

折线图

自定义选项

面向对象方法

直方图

比较两个分布

散点图

机器学习

什么是机器学习？

监督学习与无监督学习

回归与分类

监督学习的结构

线性回归

简单线性回归

多元线性回归

多项式线性回归

欠拟合与过拟合

按年份预测人口

贝叶斯定理用于分类

多离散特征分类

连续特征分类

Iris 数据集：多连续特征

要点总结

相关概念

来源资料

关系图谱

目录

反向链接