Matplotlib and Machine Learning
中文版:Matplotlib 与机器学习
An introduction to data visualization with Matplotlib and foundational machine learning concepts using scikit-learn.
Overview
This lecture covers two core Python libraries for data analysis:
- Matplotlib: A Python library for visualizing data and results.
- Scikit-learn (sklearn): A machine learning library for processing data to extract underlying patterns.
Matplotlib
Matplotlib is a Python library for creating static, animated, and interactive visualizations.
Common Chart Types
| Chart Type | Purpose |
|---|---|
| Line Plot | Shows trends over continuous variables |
| Histogram | Shows the distribution of a single variable |
| Scatter Plot | Shows the relationship between two variables |
| Subplots | Combines multiple plots in a single figure |
| Bar Chart | Compares categorical data |
| Pie Chart | Shows proportions of categorical data |
| Box Plot | Shows the distribution and statistical summary of data |
| 3D Plots | Creates three-dimensional visualizations |
Line Plot
Line plots are typically used for time-series data, such as stock prices, temperature trends, or website traffic over time.
import matplotlib.pyplot as plt
import numpy as np
# Generate some data
x = np.linspace(0, 10, 100) # Create 100 evenly spaced points between 0 and 10
y = np.sin(x) # Calculate the sine of each x value
# Create the plot
plt.plot(x, y)
plt.xlabel("X_axis")
plt.ylabel("Y_axis")
plt.title("Sine Wave Plot")
plt.grid(True)
plt.savefig('sin_plot.png', dpi=400)
plt.savefig('sin_plot.pdf', dpi=400)
plt.show()
# Read the saved image
import matplotlib.image as mpimg
image = mpimg.imread('sin_plot.png')
plt.imshow(image)
plt.axis('off')
plt.show()Customization Options
| Property | Options |
|---|---|
| Line Styles | Solid -, Dashed --, Dotted :, Dash-dot -. |
| Colors | Color names (red, blue), hex codes (#FF0000), grayscale (0.7) |
| Markers | Circle o, Plus +, Star *, etc. |
| Linewidth | Controls the thickness of the line |
| Labels and Titles | Add descriptive labels and titles |
| Legends | Identify different plot elements |
| Grid | Add gridlines to improve readability |
| Axis Limits | Control the range of the x and y axes |
plt.plot(x, y, color='green', linestyle='--', marker='o', linewidth=2, label='Sine Wave')
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Customized Sine Wave Plot")
plt.legend()
plt.grid(True)
plt.xlim(0, 10)
plt.ylim(-1.2, 1.2)
plt.show()Object-Oriented Approach
For more complex plots, the object-oriented approach is recommended:
import matplotlib.pyplot as plt
import numpy as np
x = np.linspace(0, 10, 100)
y1 = np.sin(x)
y2 = np.cos(x)
fig, ax = plt.subplots(figsize=(8, 6))
ax.plot(x, y1, label='Sine', linestyle='--')
ax.plot(x, y2, label='Cosine', linestyle='-')
ax.set_xlabel("X-axis")
ax.set_ylabel("Y-axis")
ax.set_title("Sine and Cosine Waves")
ax.legend()
ax.grid(True)
plt.savefig('plot_objected.png', dpi=400)
plt.show()Histogram
A histogram shows the distribution of a single variable.
import matplotlib.pyplot as plt
import seaborn; seaborn.set()
import pandas as pd
import numpy as np
plt.rc('font', family='Times New Roman', size=12)
data = pd.read_csv('./data/president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)
plt.figure(num=1)
plt.hist(heights, bins=8)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number')
plt.savefig('BarChart.png', dpi=400)
plt.show()Histograms can be compared with the probability density function (PDF) to understand the theoretical distribution.
Comparing Two Distributions
data_marks1 = pd.read_excel('./data/CourseMarks1.xlsx')
Marks = np.array(data_marks1['总分'], dtype=int)
plt.figure(num=2)
plt.hist(Marks, bins=20)
plt.title('Scores of students in Class 1')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores1.png', dpi=400)
plt.show()
data_marks2 = pd.read_excel('./data/CourseMarks2.xlsx')
Marks = np.array(data_marks2['总分'], dtype=int)
plt.figure(num=3)
plt.hist(Marks, bins=20, color='green')
plt.title('Scores of students in Class 2')
plt.xlabel('Scores of students (Full scores are 100)')
plt.ylabel('number')
plt.savefig('BarChart_scores2.png', dpi=400)
plt.show()Scatter Plot
A scatter plot shows the relationship between two variables.
import matplotlib.pyplot as plt
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 1, 3, 5])
plt.scatter(x, y)
plt.xlabel("X-axis")
plt.ylabel("Y-axis")
plt.title("Scatter Plot")
plt.savefig('Simple_scatter.png')
plt.show()Machine Learning
What is Machine Learning?
Machine learning is a branch of artificial intelligence that enables algorithms to extract hidden patterns within datasets. It allows them to predict new, similar data without explicit programming for each task. Machine learning finds applications in diverse domains.
Supervised vs Unsupervised Learning
- Supervised Learning: The model learns from labeled data (input-output pairs).
- Unsupervised Learning: The model finds patterns in unlabeled data.
Regression vs Classification
| Aspect | Regression | Classification |
|---|---|---|
| Output | Continuous numerical value | Discrete category / class label |
| Goal | Predict a quantity | Predict a category |
| Example | House price prediction | Email spam detection |
Structure of Supervised Learning
A typical supervised learning workflow consists of:
- Input / Features / Attributes: The data used to make predictions.
- Output / Targets / Labels: The known results for training data.
- Model Selection: Choose an appropriate algorithm.
- Training: Fit the model to training data to obtain parameters.
- Evaluation: Assess model performance on test data.
- Inference / Prediction: Apply the trained model to new, unseen data.
Linear Regression
Linear regression is a statistical method used to model the relationship between a dependent variable and one or more independent variables.
Simple Linear Regression
import numpy as np
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
# Generate sample data
area = np.array([100, 150, 200, 250, 300]) + np.random.randint(-30, 20, 5)
area = area[:, np.newaxis]
price = np.array([200000, 250000, 300000, 350000, 400000])
# Create and train model
model = LinearRegression()
model.fit(area, price)
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
# Predict for new data
new_area = np.array([350])
new_area = new_area[:, np.newaxis]
predicted_price = model.predict(new_area)
print(f"Predicted price for an area of {new_area[0][0]} square feet: {predicted_price[0]}")Multiple Linear Regression
Multiple linear regression extends simple regression to multiple input features. Below is an example using the California housing dataset:
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
import pandas as pd
import matplotlib.pyplot as plt
data_file_path = './data/cal_housing.data'
data = pd.read_csv(data_file_path, header=None)
print(data.head())
column_names = [
'longitude', 'latitude', 'housingMedianAge', 'totalRooms', 'totalBedrooms',
'population', 'households', 'medianIncome', 'medianHouseValue'
]
data.columns = column_names
print(data.head())
X = data.drop('medianHouseValue', axis=1)
y = data['medianHouseValue']# Train the model
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
model = LinearRegression()
model.fit(X_train, y_train)
# Test the model
y_pred = model.predict(X_test)
# Evaluate the model
print("coefficients:", model.coef_)
print("Intercept:", model.intercept_)
mse = mean_squared_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")
print(f"R² Score: {r2}")Visualizing the prediction results:
plt.figure(figsize=(10, 6))
plt.scatter(y_test, y_pred, color='blue', alpha=0.5)
plt.xlabel('Actual House Prices', fontsize=12)
plt.ylabel('Predicted House Prices', fontsize=12)
plt.title('Actual vs Predicted House Prices', fontsize=14)
plt.grid(True)
plt.xlim(min(y_test), max(y_test))
plt.ylim(min(y_test), max(y_test))
plt.show()Polynomial Linear Regression
When the relationship between variables is non-linear, a polynomial function can be used:
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as plt
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import PolynomialFeatures
import numpy as np
poly_model = make_pipeline(PolynomialFeatures(7), LinearRegression())
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
poly_model.fit(x[:, np.newaxis], y)xfit = np.linspace(0, 10, 1000)
yfit = poly_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit)
plt.xlabel('x')
plt.ylabel('y')
plt.savefig('polynomial_sin.png', dpi=400)
plt.show()Underfitting vs Overfitting
- Underfitting (e.g., degree = 1): The model is too simple to capture the underlying pattern.
- Overfitting (e.g., degree = 6 or higher): The model captures noise in the training data and generalizes poorly.
Choosing an appropriate model complexity is crucial for good predictive performance.
Predicting Population Over Years
import numpy as np
import matplotlib.pyplot as plt
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
years = np.array([2010, 2011, 2012, 2013, 2014, 2015, 2016, 2017, 2018, 2019, 2020])
population = np.array([100, 103, 108, 115, 124, 135, 148, 163, 180, 199, 220]) + np.random.randint(0, 50, 11)
X = years[:, np.newaxis]
degree = 3
poly_model = make_pipeline(PolynomialFeatures(degree), LinearRegression())
poly_model.fit(X, population)future_years = np.arange(2010, 2031)[:, np.newaxis]
predicted_population = poly_model.predict(future_years)
pred_2025 = poly_model.predict([[2025]])
print(f"The predicted population in 2025 is {pred_2025[0]:.2f} thousand.")
plt.scatter(years, population, color='blue', label='Actual Data')
plt.plot(future_years, predicted_population)
plt.show()Bayes’ Theorem for Classification
Bayes’ theorem provides a probabilistic framework for classification. The core idea is to compare posterior probabilities to determine the most likely class:
Where:
- is the posterior probability of class given evidence
- is the likelihood of evidence given class
- is the prior probability of class
- is the marginal probability of evidence
Classification with Multiple Discrete Features
The PlayTennis dataset demonstrates classification with categorical features using a Naive Bayes classifier:
import pandas as pd
from sklearn.naive_bayes import BernoulliNB
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
data = {
'Rec': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14],
'Outlook': ['Sunny', 'Sunny', 'Overcast', 'Rain', 'Rain', 'Rain', 'Overcast', 'Sunny', 'Sunny', 'Rain', 'Sunny', 'Overcast', 'Overcast', 'Rain'],
'Temperature': ['Hot', 'Hot', 'Hot', 'Mild', 'Cool', 'Cool', 'Cool', 'Mild', 'Cool', 'Mild', 'Mild', 'Mild', 'Hot', 'Mild'],
'Humidity': ['High', 'High', 'High', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'Normal', 'Normal', 'High', 'Normal', 'High'],
'Wind': ['Weak', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Weak', 'Weak', 'Strong', 'Strong', 'Weak', 'Strong'],
'PlayTennis': ['No', 'No', 'Yes', 'Yes', 'Yes', 'No', 'Yes', 'No', 'Yes', 'Yes', 'Yes', 'Yes', 'Yes', 'No']
}
df = pd.DataFrame(data)Encoding categorical variables:
df = df.drop('Rec', axis=1)
le = LabelEncoder()
encoded_df = df.copy()
for col in df.columns:
encoded_df[col] = le.fit_transform(df[col])
X = encoded_df.drop('PlayTennis', axis=1)
y = encoded_df['PlayTennis']
bnb = BernoulliNB()
bnb.fit(X, y)Making predictions on new data:
data_new = {
'Outlook': ['Sunny'], 'Temperature': ['Cool'], 'Humidity': ['High'], 'Wind': ['Strong'], 'PlayTennis': ['Yes']
}
data_new = pd.DataFrame(data_new)
encoded_new_data = data_new.copy()
for col in data_new.columns:
if col in df.columns:
le = LabelEncoder()
le.fit(df[col])
encoded_new_data[col] = le.transform(data_new[col])
X_new = encoded_new_data.drop('PlayTennis', axis=1)
new_pred = bnb.predict(X_new)
play_tennis_le = LabelEncoder()
play_tennis_le.fit(df['PlayTennis'])
decoded_new_pred = play_tennis_le.inverse_transform(new_pred)
print("Prediction for new data:", decoded_new_pred)Classification with Continuous Features
When features are continuous, the Normal (Gaussian) distribution is typically assumed. Its probability density function depends on the mean and variance, producing the characteristic bell curve.
Iris Dataset: Multiple Continuous Features
The Iris dataset contains four continuous features (sepal length, sepal width, petal length, petal width) and a categorical label (Iris Setosa, Versicolor, or Virginica). Each feature is assumed to follow a Gaussian distribution and features are treated as independent:
from sklearn.datasets import load_iris
from sklearn.naive_bayes import GaussianNB
from sklearn.model_selection import train_test_split
from sklearn.metrics import confusion_matrix, mean_squared_error
iris = load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
gnb = GaussianNB()
gnb.fit(X_train, y_train)
y_pred = gnb.predict(X_test)
mse = mean_squared_error(y_test, y_pred)
print(f"Mean Squared Error (MSE): {mse}")Summary
| Topic | Key Points |
|---|---|
| Matplotlib | Line plots, histograms, scatter plots; customization via color, linestyle, marker; object-oriented API |
| ML Workflow | Load data → Split train/test → Choose model → Train → Evaluate → Predict |
| Linear Regression | Simple (one feature), multiple (many features), polynomial (non-linear); watch for underfitting/overfitting |
| Naive Bayes | Probabilistic classifier; BernoulliNB for discrete features, GaussianNB for continuous features |
Related Concepts
- Regression Analysis — Linear and polynomial forecasting methods
- Classification — Supervised learning for category prediction
- Introduction to Python — Python basics required for this lecture