0506 Linear Regression
中文版:线性回归
In Depth: Linear Regression
Justasnaive Bayes (discussed earlier in In Depth: Naive Bayes Classification) isagood starting point for classification tasks, linear regression models areagood starting point for regression tasks. Such models are popular because theycanbefitvery quickly, andarevery interpretable. You are probably familiar with the simplest formofa linear regression model (i.e., fitting a straight linetodata) but such models can be extended to model more complicated data behavior.
In this section wewillstart withaquick intuitive walk-through of the mathematics behind this well-known problem, before seeing how before moving ontoseehow linear models can be generalized to account for more complicated patterns in data.
We begin with the standard imports:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns; sns.set()
import numpy as npSimple Linear Regression
Wewillstart withthemost familiar linear regression, a straight-linefittodata. A straight-linefitisamodel oftheform
where is commonly known as the slope, and is commonly known as the intercept.
Consider the following data, which is scattered about alinewithaslope of 2 and an intercept of -5:
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = 2 * x - 5 + rng.randn(50)
plt.scatter(x, y);Wecanuse Scikit-Learn’s LinearRegression estimator tofitthisdataand construct the best-fit line:
from sklearn.linear_model import LinearRegression
model = LinearRegression(fit_intercept=True)
model.fit(x[:, np.newaxis], y)
xfit = np.linspace(0, 10, 1000)
yfit = model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);The slope and intercept ofthedataare contained inthemodel’s fit parameters, which in Scikit-Learn are always marked by a trailing underscore.
Here the relevant parameters are coef_ and intercept_:
print("Model slope: ", model.coef_[0])
print("Model intercept:", model.intercept_)Polynomial basis functions
This polynomial projection is useful enough thatitisbuilt into Scikit-Learn, using the PolynomialFeatures transformer:
from sklearn.preprocessing import PolynomialFeatures
x = np.array([2, 3, 4])
poly = PolynomialFeatures(3, include_bias=False)
poly.fit_transform(x[:, None])Weseeherethatthe transformer has converted our one-dimensional array intoathree-dimensional array by taking the exponent ofeachvalue. This new, higher-dimensional data representation canthenbe plugged into a linear regression.
Aswesawin Feature Engineering, the cleanest way to accomplish thisistousea pipeline. Let’smakea 7th-degree polynomial model inthisway:
from sklearn.pipeline import make_pipeline
poly_model = make_pipeline(PolynomialFeatures(7),
LinearRegression())With this transform in place, wecanusethe linear model tofitmuchmore complicated relationships between and . For example, hereisasinewavewithnoise:
rng = np.random.RandomState(1)
x = 10 * rng.rand(50)
y = np.sin(x) + 0.1 * rng.randn(50)
poly_model.fit(x[:, np.newaxis], y)
yfit = poly_model.predict(xfit[:, np.newaxis])
plt.scatter(x, y)
plt.plot(xfit, yfit);Our linear model, through theuseof 7th-order polynomial basis functions, can provide an excellent fittothisnon-linear data!
Let’stakealookat another prediction example based on time series. Firstly, import the necessary libraries:
import numpy as np
import pandas as pd
from sklearn.linear_model import LinearRegression
import matplotlib.pyplot as pltRead the specified data file:
data = pd.read_csv('Hongkong.csv')Assign theyearand temperature columns ofdatatotwo variables:
X = data['Year'].values.reshape(-1, 1) # feature: year
y = data['Temperature'].values # label: tempTrain a linear regression model withXasinput andyas output:
model = LinearRegression()
model.fit(X, y)Predict temperature data using the next 5 years astestdata:
future_years = np.array([2023, 2024, 2025, 2026, 2027]).reshape(-1, 1)
predicted_temperatures = model.predict(future_years)Draw the original and predicted year and temperature data:
plt.figure(figsize=(10, 5))
plt.scatter(data['Year'], data['Temperature'], color='blue', label='Original Data')
plt.plot(future_years, predicted_temperatures, color='red', label='Predicted Data', linestyle='--')
plt.xlabel('Year')
plt.ylabel('Temperature')
plt.title('Temperature Prediction forHongKong')
plt.legend()
plt.show()