AITC Wiki

Matplotlib and Machine Learning

Matplotlib 与机器学习

Matplotlib and Machine Learning

中文版:Matplotlib 与机器学习

An introduction to data visualization with Matplotlib and foundational machine learning concepts using scikit-learn.

Overview

This lecture covers two core Python libraries for data analysis:

  • Matplotlib: A Python library for visualizing data and results.
  • Scikit-learn (sklearn): A machine learning library for processing data to extract underlying patterns.

Matplotlib

Matplotlib is a Python library for creating static, animated, and interactive visualizations.

Common Chart Types

Chart TypePurpose
Line PlotShows trends over continuous variables
HistogramShows the distribution of a single variable
Scatter PlotShows the relationship between two variables
SubplotsCombines multiple plots in a single figure
Bar ChartCompares categorical data
Pie ChartShows proportions of categorical data
Box PlotShows the distribution and statistical summary of data
3D PlotsCreates three-dimensional visualizations

Line Plot

Line plots are typically used for time-series data, such as stock prices, temperature trends, or website traffic over time.

import matplotlib.pyplot as plt
import numpy as np
 
# Generate some data
x = np.linspace(0, 10, 100)  # Create 100 evenly spaced points between 0 and 10
y = np.sin(x)                # Calculate the sine of each x value
 
# Create the plot
plt.plot(x, y)
plt.xlabel("X_axis")
plt.ylabel("Y_axis")
plt.title("Sine Wave Plot")
plt.grid(True)
plt.savefig('sin_plot.png', dpi=400)
plt.savefig('sin_plot.pdf', dpi=400)
plt.show()
 
# Read the saved image
import matplotlib.image as mpimg
image = mpimg.imread('sin_plot.png')
plt.imshow(image)
plt.axis('off')
plt.show()

Customization Options

PropertyOptions
Line StylesSolid -, Dashed --, Dotted :, Dash-dot -.
ColorsColor names (red, blue), hex codes (#FF0000), grayscale (0.7)
MarkersCircle o, Plus +, Star *, etc.
LinewidthControls the thickness of the line
Labels and TitlesAdd descriptive labels and titles
LegendsIdentify different plot elements
GridAdd gridlines to improve readability
Axis LimitsControl the range of the x and y axes
plt.plot(x, y, color='green', linestyle='--', marker='o', linewidth=2, label='Sine Wave')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Customized Sine Wave Plot")
plt.legend()
plt.grid(True)
plt.xlim(0, 10)
plt.ylim(-1.2, 1.2)
plt.show()

Object-Oriented Approach

For more complex plots, the object-oriented approach is recommended:

import matplotlib.pyplot as plt
import numpy as np
 
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
 
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x, y1, label='Sine', linestyle='--')
ax.plot(x, y2, label='Cosine', linestyle='-')
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_title("Sine and Cosine Waves")
ax.legend()
ax.grid(True)
plt.savefig('plot_objected.png', dpi=400)
plt.show()

Histogram

A histogram shows the distribution of a single variable.

import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import pandas as pd
import numpy as np
 
plt.rc('font', family='Times New Roman', size=12)
data = pd.read_csv('./data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
 
plt.figure(num=1)
plt.hist(heights, bins=8)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number')
plt.savefig('BarChart.png', dpi=400)
plt.show()

Histograms can be compared with the probability density function (PDF) to understand the theoretical distribution.

Comparing Two Distributions

data_marks1 = pd.read_excel('./data/CourseMarks1.xlsx')
Marks = np.array(data_marks1['总分'], dtype=int)
plt.figure(num=2)
plt.hist(Marks, bins=20)
plt.title('Scores of students in Class 1')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores1.png', dpi=400)
plt.show()
 
data_marks2 = pd.read_excel('./data/CourseMarks2.xlsx')
Marks = np.array(data_marks2['总分'], dtype=int)
plt.figure(num=3)
plt.hist(Marks, bins=20, color='green')
plt.title('Scores of students in Class 2')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores2.png', dpi=400)
plt.show()

Scatter Plot

A scatter plot shows the relationship between two variables.

import matplotlib.pyplot as plt
import numpy as np
 
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 3, 5])
 
plt.scatter(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot")
plt.savefig('Simple_scatter.png')
plt.show()

Machine Learning

What is Machine Learning?

Machine learning is a branch of artificial intelligence that enables algorithms to extract hidden patterns within datasets. It allows them to predict new, similar data without explicit programming for each task. Machine learning finds applications in diverse domains.

Supervised vs Unsupervised Learning

  • Supervised Learning: The model learns from labeled data (input-output pairs).
  • Unsupervised Learning: The model finds patterns in unlabeled data.

Regression vs Classification

AspectRegressionClassification
OutputContinuous numerical valueDiscrete category / class label
GoalPredict a quantityPredict a category
ExampleHouse price predictionEmail spam detection

Structure of Supervised Learning

A typical supervised learning workflow consists of:

  1. Input / Features / Attributes: The data used to make predictions.
  2. Output / Targets / Labels: The known results for training data.
  3. Model Selection: Choose an appropriate algorithm.
  4. Training: Fit the model to training data to obtain parameters.
  5. Evaluation: Assess model performance on test data.
  6. Inference / Prediction: Apply the trained model to new, unseen data.

Linear Regression

Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.

Simple Linear Regression

import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
 
# Generate sample data
area = np.array([100, 150, 200, 250, 300]) + np.random.randint(-30, 20, 5)
area = area[:, np.newaxis]
price = np.array([200000, 250000, 300000, 350000, 400000])
 
# Create and train model
model = LinearRegression()
model.fit(area, price)
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
 
# Predict for new data
new_area = np.array([350])
new_area = new_area[:, np.newaxis]
predicted_price = model.predict(new_area)
print(f"Predicted price for an area of {new_area[0][0]} square feet: {predicted_price[0]}")

Multiple Linear Regression

Multiple linear regression extends simple regression to multiple input features. Below is an example using the California housing dataset:

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import matplotlib.pyplot as plt
 
data_file_path = './data/cal_housing.data'
data = pd.read_csv(data_file_path, header=None)
print(data.head())
 
column_names = [
    'longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms',
    'population', 'households', 'medianIncome', 'medianHouseValue'
]
data.columns = column_names
print(data.head())
 
X = data.drop('medianHouseValue', axis=1)
y = data['medianHouseValue']
# Train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
 
# Test the model
y_pred = model.predict(X_test)
 
# Evaluate the model
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")

Visualizing the prediction results:

plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel('Actual House Prices', fontsize=12)
plt.ylabel('Predicted House Prices', fontsize=12)
plt.title('Actual vs Predicted House Prices', fontsize=14)
plt.grid(True)
plt.xlim(min(y_test), max(y_test))
plt.ylim(min(y_test), max(y_test))
plt.show()

Polynomial Linear Regression

When the relationship between variables is non-linear, a polynomial function can be used:

from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
 
poly_model = make_pipeline(PolynomialFeatures(7), LinearRegression())
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
poly_model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 1000)
yfit = poly_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('polynomial_sin.png', dpi=400)
plt.show()

Underfitting vs Overfitting

  • Underfitting (e.g., degree = 1): The model is too simple to capture the underlying pattern.
  • Overfitting (e.g., degree = 6 or higher): The model captures noise in the training data and generalizes poorly.

Choosing an appropriate model complexity is crucial for good predictive performance.

Predicting Population Over Years

import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
 
years = np.array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])
population = np.array([100, 103, 108, 115, 124, 135, 148, 163, 180, 199, 220]) + np.random.randint(0, 50, 11)
 
X = years[:, np.newaxis]
degree = 3
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model.fit(X, population)
future_years = np.arange(2010, 2031)[:, np.newaxis]
predicted_population = poly_model.predict(future_years)
 
pred_2025 = poly_model.predict([[2025]])
print(f"The predicted population in 2025 is {pred_2025[0]:.2f} thousand.")
 
plt.scatter(years, population, color='blue', label='Actual Data')
plt.plot(future_years, predicted_population)
plt.show()

Bayes’ Theorem for Classification

Bayes’ theorem provides a probabilistic framework for classification. The core idea is to compare posterior probabilities to determine the most likely class:

Where:

  • is the posterior probability of class given evidence
  • is the likelihood of evidence given class
  • is the prior probability of class
  • is the marginal probability of evidence

Classification with Multiple Discrete Features

The PlayTennis dataset demonstrates classification with categorical features using a Naive Bayes classifier:

import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
 
data = {
    'Rec': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
    'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
    'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
    'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
    'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
    'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)

Encoding categorical variables:

df = df.drop('Rec', axis=1)
le = LabelEncoder()
encoded_df = df.copy()
for col in df.columns:
    encoded_df[col] = le.fit_transform(df[col])
 
X = encoded_df.drop('PlayTennis', axis=1)
y = encoded_df['PlayTennis']
 
bnb = BernoulliNB()
bnb.fit(X, y)

Making predictions on new data:

data_new = {
    'Outlook': ['Sunny'], 'Temperature': ['Cool'], 'Humidity': ['High'], 'Wind': ['Strong'], 'PlayTennis': ['Yes']
}
data_new = pd.DataFrame(data_new)
 
encoded_new_data = data_new.copy()
for col in data_new.columns:
    if col in df.columns:
        le = LabelEncoder()
        le.fit(df[col])
        encoded_new_data[col] = le.transform(data_new[col])
 
X_new = encoded_new_data.drop('PlayTennis', axis=1)
 
new_pred = bnb.predict(X_new)
play_tennis_le = LabelEncoder()
play_tennis_le.fit(df['PlayTennis'])
decoded_new_pred = play_tennis_le.inverse_transform(new_pred)
print("Prediction for new data:", decoded_new_pred)

Classification with Continuous Features

When features are continuous, the Normal (Gaussian) distribution is typically assumed. Its probability density function depends on the mean and variance, producing the characteristic bell curve.

Iris Dataset: Multiple Continuous Features

The Iris dataset contains four continuous features (sepal length, sepal width, petal length, petal width) and a categorical label (Iris Setosa, Versicolor, or Virginica). Each feature is assumed to follow a Gaussian distribution and features are treated as independent:

from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
 
iris = load_iris()
X, y = iris.data, iris.target
 
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
 
gnb = GaussianNB()
gnb.fit(X_train, y_train)
 
y_pred = gnb.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")

Summary

TopicKey Points
MatplotlibLine plots, histograms, scatter plots; customization via color, linestyle, marker; object-oriented API
ML WorkflowLoad data → Split train/test → Choose model → Train → Evaluate → Predict
Linear RegressionSimple (one feature), multiple (many features), polynomial (non-linear); watch for underfitting/overfitting
Naive BayesProbabilistic classifier; BernoulliNB for discrete features, GaussianNB for continuous features

Sources