Matplotlib and Machine Learning

中文版：Matplotlib 与机器学习

An introduction to data visualization with Matplotlib and foundational machine learning concepts using scikit-learn.

Overview

This lecture covers two core Python libraries for data analysis:

Matplotlib: A Python library for visualizing data and results.
Scikit-learn (sklearn): A machine learning library for processing data to extract underlying patterns.

Matplotlib

Matplotlib is a Python library for creating static, animated, and interactive visualizations.

Common Chart Types

Chart Type	Purpose
Line Plot	Shows trends over continuous variables
Histogram	Shows the distribution of a single variable
Scatter Plot	Shows the relationship between two variables
Subplots	Combines multiple plots in a single figure
Bar Chart	Compares categorical data
Pie Chart	Shows proportions of categorical data
Box Plot	Shows the distribution and statistical summary of data
3D Plots	Creates three-dimensional visualizations

Line Plot

Line plots are typically used for time-series data, such as stock prices, temperature trends, or website traffic over time.

import matplotlib.pyplot as plt
import numpy as np
 
# Generate some data
x = np.linspace(0, 10, 100)  # Create 100 evenly spaced points between 0 and 10
y = np.sin(x)                # Calculate the sine of each x value
 
# Create the plot
plt.plot(x, y)
plt.xlabel("X_axis")
plt.ylabel("Y_axis")
plt.title("Sine Wave Plot")
plt.grid(True)
plt.savefig('sin_plot.png', dpi=400)
plt.savefig('sin_plot.pdf', dpi=400)
plt.show()
 
# Read the saved image
import matplotlib.image as mpimg
image = mpimg.imread('sin_plot.png')
plt.imshow(image)
plt.axis('off')
plt.show()

Customization Options

Property	Options
Line Styles	Solid `-`, Dashed `--`, Dotted `:`, Dash-dot `-.`
Colors	Color names (`red`, `blue`), hex codes (`#FF0000`), grayscale (`0.7`)
Markers	Circle `o`, Plus `+`, Star `*`, etc.
Linewidth	Controls the thickness of the line
Labels and Titles	Add descriptive labels and titles
Legends	Identify different plot elements
Grid	Add gridlines to improve readability
Axis Limits	Control the range of the x and y axes

plt.plot(x, y, color='green', linestyle='--', marker='o', linewidth=2, label='Sine Wave')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Customized Sine Wave Plot")
plt.legend()
plt.grid(True)
plt.xlim(0, 10)
plt.ylim(-1.2, 1.2)
plt.show()

Object-Oriented Approach

For more complex plots, the object-oriented approach is recommended:

import matplotlib.pyplot as plt
import numpy as np
 
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
 
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x, y1, label='Sine', linestyle='--')
ax.plot(x, y2, label='Cosine', linestyle='-')
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_title("Sine and Cosine Waves")
ax.legend()
ax.grid(True)
plt.savefig('plot_objected.png', dpi=400)
plt.show()

Histogram

A histogram shows the distribution of a single variable.

import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import pandas as pd
import numpy as np
 
plt.rc('font', family='Times New Roman', size=12)
data = pd.read_csv('./data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
 
plt.figure(num=1)
plt.hist(heights, bins=8)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number')
plt.savefig('BarChart.png', dpi=400)
plt.show()

Histograms can be compared with the probability density function (PDF) to understand the theoretical distribution.

Comparing Two Distributions

data_marks1 = pd.read_excel('./data/CourseMarks1.xlsx')
Marks = np.array(data_marks1['总分'], dtype=int)
plt.figure(num=2)
plt.hist(Marks, bins=20)
plt.title('Scores of students in Class 1')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores1.png', dpi=400)
plt.show()
 
data_marks2 = pd.read_excel('./data/CourseMarks2.xlsx')
Marks = np.array(data_marks2['总分'], dtype=int)
plt.figure(num=3)
plt.hist(Marks, bins=20, color='green')
plt.title('Scores of students in Class 2')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores2.png', dpi=400)
plt.show()

Scatter Plot

A scatter plot shows the relationship between two variables.

import matplotlib.pyplot as plt
import numpy as np
 
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 3, 5])
 
plt.scatter(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot")
plt.savefig('Simple_scatter.png')
plt.show()

Machine Learning

What is Machine Learning?

Machine learning is a branch of artificial intelligence that enables algorithms to extract hidden patterns within datasets. It allows them to predict new, similar data without explicit programming for each task. Machine learning finds applications in diverse domains.

Supervised vs Unsupervised Learning

Supervised Learning: The model learns from labeled data (input-output pairs).
Unsupervised Learning: The model finds patterns in unlabeled data.

Regression vs Classification

Aspect	Regression	Classification
Output	Continuous numerical value	Discrete category / class label
Goal	Predict a quantity	Predict a category
Example	House price prediction	Email spam detection

Structure of Supervised Learning

A typical supervised learning workflow consists of:

Input / Features / Attributes: The data used to make predictions.
Output / Targets / Labels: The known results for training data.
Model Selection: Choose an appropriate algorithm.
Training: Fit the model to training data to obtain parameters.
Evaluation: Assess model performance on test data.
Inference / Prediction: Apply the trained model to new, unseen data.

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.

Simple Linear Regression

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
 
# Generate sample data
area = np.array([100, 150, 200, 250, 300]) + np.random.randint(-30, 20, 5)
area = area[:, np.newaxis]
price = np.array([200000, 250000, 300000, 350000, 400000])
 
# Create and train model
model = LinearRegression()
model.fit(area, price)
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
 
# Predict for new data
new_area = np.array([350])
new_area = new_area[:, np.newaxis]
predicted_price = model.predict(new_area)
print(f"Predicted price for an area of {new_area[0][0]} square feet: {predicted_price[0]}")

Multiple Linear Regression

Multiple linear regression extends simple regression to multiple input features. Below is an example using the California housing dataset:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import matplotlib.pyplot as plt
 
data_file_path = './data/cal_housing.data'
data = pd.read_csv(data_file_path, header=None)
print(data.head())
 
column_names = [
    'longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms',
    'population', 'households', 'medianIncome', 'medianHouseValue'
]
data.columns = column_names
print(data.head())
 
X = data.drop('medianHouseValue', axis=1)
y = data['medianHouseValue']

# Train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
 
# Test the model
y_pred = model.predict(X_test)
 
# Evaluate the model
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")

Visualizing the prediction results:

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel('Actual House Prices', fontsize=12)
plt.ylabel('Predicted House Prices', fontsize=12)
plt.title('Actual vs Predicted House Prices', fontsize=14)
plt.grid(True)
plt.xlim(min(y_test), max(y_test))
plt.ylim(min(y_test), max(y_test))
plt.show()

Polynomial Linear Regression

When the relationship between variables is non-linear, a polynomial function can be used:

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
 
poly_model = make_pipeline(PolynomialFeatures(7), LinearRegression())
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
poly_model.fit(x[:, np.newaxis], y)

xfit = np.linspace(0, 10, 1000)
yfit = poly_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('polynomial_sin.png', dpi=400)
plt.show()

Underfitting vs Overfitting

Underfitting (e.g., degree = 1): The model is too simple to capture the underlying pattern.
Overfitting (e.g., degree = 6 or higher): The model captures noise in the training data and generalizes poorly.

Choosing an appropriate model complexity is crucial for good predictive performance.

Predicting Population Over Years

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
 
years = np.array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])
population = np.array([100, 103, 108, 115, 124, 135, 148, 163, 180, 199, 220]) + np.random.randint(0, 50, 11)
 
X = years[:, np.newaxis]
degree = 3
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model.fit(X, population)

future_years = np.arange(2010, 2031)[:, np.newaxis]
predicted_population = poly_model.predict(future_years)
 
pred_2025 = poly_model.predict([[2025]])
print(f"The predicted population in 2025 is {pred_2025[0]:.2f} thousand.")
 
plt.scatter(years, population, color='blue', label='Actual Data')
plt.plot(future_years, predicted_population)
plt.show()

Bayes’ Theorem for Classification

Bayes’ theorem provides a probabilistic framework for classification. The core idea is to compare posterior probabilities to determine the most likely class:

$P (c_{j} ∣ A) = \frac{P ( A ∣ c _{j} ) \cdot P ( c _{j} )}{P ( A )}$

Where:

$P (c_{j} ∣ A)$ is the posterior probability of class $c_{j}$ given evidence $A$
$P (A ∣ c_{j})$ is the likelihood of evidence $A$ given class $c_{j}$
$P (c_{j})$ is the prior probability of class $c_{j}$
$P (A)$ is the marginal probability of evidence $A$

Classification with Multiple Discrete Features

The PlayTennis dataset demonstrates classification with categorical features using a Naive Bayes classifier:

import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
 
data = {
    'Rec': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)

Encoding categorical variables:

df = df.drop('Rec', axis=1)
le = LabelEncoder()
encoded_df = df.copy()
for col in df.columns:
    encoded_df[col] = le.fit_transform(df[col])
 
X = encoded_df.drop('PlayTennis', axis=1)
y = encoded_df['PlayTennis']
 
bnb = BernoulliNB()
bnb.fit(X, y)

Making predictions on new data:

data_new = {
    'Outlook': ['Sunny'], 'Temperature': ['Cool'], 'Humidity': ['High'], 'Wind': ['Strong'], 'PlayTennis': ['Yes']
}
data_new = pd.DataFrame(data_new)
 
encoded_new_data = data_new.copy()
for col in data_new.columns:
    if col in df.columns:
        le = LabelEncoder()
        le.fit(df[col])
        encoded_new_data[col] = le.transform(data_new[col])
 
X_new = encoded_new_data.drop('PlayTennis', axis=1)
 
new_pred = bnb.predict(X_new)
play_tennis_le = LabelEncoder()
play_tennis_le.fit(df['PlayTennis'])
decoded_new_pred = play_tennis_le.inverse_transform(new_pred)
print("Prediction for new data:", decoded_new_pred)

Classification with Continuous Features

When features are continuous, the Normal (Gaussian) distribution is typically assumed. Its probability density function depends on the mean and variance, producing the characteristic bell curve.

Iris Dataset: Multiple Continuous Features

The Iris dataset contains four continuous features (sepal length, sepal width, petal length, petal width) and a categorical label (Iris Setosa, Versicolor, or Virginica). Each feature is assumed to follow a Gaussian distribution and features are treated as independent:

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
 
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
gnb = GaussianNB()
gnb.fit(X_train, y_train)
 
y_pred = gnb.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

Summary

Topic	Key Points
Matplotlib	Line plots, histograms, scatter plots; customization via color, linestyle, marker; object-oriented API
ML Workflow	Load data → Split train/test → Choose model → Train → Evaluate → Predict
Linear Regression	Simple (one feature), multiple (many features), polynomial (non-linear); watch for underfitting/overfitting
Naive Bayes	Probabilistic classifier; BernoulliNB for discrete features, GaussianNB for continuous features

Regression Analysis — Linear and polynomial forecasting methods
Classification — Supervised learning for category prediction
Introduction to Python — Python basics required for this lecture

Sources

MatplotlibAndMachineLearning

Explorer

AITC Wiki

Matplotlib and Machine Learning

Matplotlib 与机器学习

Matplotlib and Machine Learning

Overview

Matplotlib

Common Chart Types

Line Plot

Customization Options

Object-Oriented Approach

Histogram

Comparing Two Distributions

Scatter Plot

Machine Learning

What is Machine Learning?

Supervised vs Unsupervised Learning

Regression vs Classification

Structure of Supervised Learning

Linear Regression

Simple Linear Regression

Multiple Linear Regression

Polynomial Linear Regression

Underfitting vs Overfitting

Predicting Population Over Years

Bayes’ Theorem for Classification

Classification with Multiple Discrete Features

Classification with Continuous Features

Iris Dataset: Multiple Continuous Features

Summary

Sources

Graph View

Table of Contents

Backlinks

Explorer

Matplotlib and Machine Learning

Matplotlib 与机器学习

Matplotlib and Machine Learning

Overview

Matplotlib

Common Chart Types

Line Plot

Customization Options

Object-Oriented Approach

Histogram

Comparing Two Distributions

Scatter Plot

Machine Learning

What is Machine Learning?

Supervised vs Unsupervised Learning

Regression vs Classification

Structure of Supervised Learning

Linear Regression

Simple Linear Regression

Multiple Linear Regression

Polynomial Linear Regression

Underfitting vs Overfitting

Predicting Population Over Years

Bayes’ Theorem for Classification

Classification with Multiple Discrete Features

Classification with Continuous Features

Iris Dataset: Multiple Continuous Features

Summary

Related Concepts

Sources

Graph View

Table of Contents

Backlinks