AITC Wiki

0502 Introducing Scikit Learn

Scikit-Learn 简介

0502 Introducing Scikit Learn

中文版:Scikit-Learn 简介

Introducing Scikit-Learn

There are several Python libraries which provide solid implementations ofarange of machine learning algorithms. Oneofthebestknown is Scikit-Learn, a package that provides efficient versions ofalarge number of common algorithms. Scikit-Learn is characterized byaclean, uniform, and streamlined API, aswellasbyvery useful and complete online documentation. A benefit of this uniformity isthatonceyou understand the basic use and syntax of Scikit-Learn foronetypeofmodel, switching toanewmodel or algorithm is very straightforward.

This section provides an overview of the Scikit-Learn API; a solid understanding of these API elements willformthe foundation for understanding the deeper practical discussion of machine learning algorithms and approaches in the following chapters.

Wewillstart by covering data representation in Scikit-Learn, followed by covering the Estimator API, and finally go through a more interesting example of using these tools for exploring asetof images of hand-written digits.

Scikit-Learn’s Estimator API

The Scikit-Learn API is designed with the following guiding principles in mind, as outlined in the Scikit-Learn API paper:

  • Consistency: All objects share a common interface drawn from a limited set of methods, with consistent documentation.

  • Inspection: All specified parameter values are exposed as public attributes.

  • Limited object hierarchy: Only algorithms are represented by Python classes; datasets are represented in standard formats (NumPy arrays, Pandas DataFrames, SciPy sparse matrices) and parameter names use standard Python strings.

  • Composition: Many machine learning tasks can be expressed as sequences of more fundamental algorithms, and Scikit-Learn makes useofthis wherever possible.

  • Sensible defaults: When models require user-specified parameters, the library defines an appropriate default value.

In practice, these principles make Scikit-Learn veryeasytouse, oncethebasic principles are understood. Every machine learning algorithm in Scikit-Learn is implemented via the Estimator API, which provides a consistent interface forawiderange of machine learning applications.

Supervised learning example: Simple linear regression

As an example of this process, let’s consider a simple linear regression—that is, the common case of fitting alineto data. Wewillusethe following simple dataforour regression example:

import matplotlib.pyplot as plt
import numpy as np
 
rng = np.random.RandomState(42)
x = 10 * rng.rand(50)
y = 2 * x - 1 + rng.randn(50)
plt.scatter(x, y);

Withthisdatainplace, wecanusethe recipe outlined earlier. Let’s walk through the process:

1. Choose a class of model

In Scikit-Learn, every class of model is represented by a Python class. So, for example, ifwewould like to compute a simple linear regression model, we can import the linear regression class:

from sklearn.linear_model import LinearRegression

2. Choose model hyperparameters

An important point is that a class of model isnotthesameasan instance ofamodel.

Oncewehave decided onourmodel class, there are still some options opentous. Depending onthemodel class we are working with, we might need to answer oneormore questions like the following:

  • Would weliketofitforthe offset (i.e., y-intercept)?
  • Would welikethemodel to be normalized?
  • Would weliketo preprocess our features toaddmodel flexibility?
  • What degree of regularization would weliketouseinourmodel?
  • Howmanymodel components would weliketouse?

These are examples of the important choices thatmustbemade oncethemodel class is selected. These choices are often represented as hyperparameters, or parameters thatmustbeset before the model isfittodata. In Scikit-Learn, hyperparameters are chosen by passing values at model instantiation. We will explore howyoucan quantitatively motivate the choice of hyperparameters in Hyperparameters and Model Validation.

For our linear regression example, we can instantiate the LinearRegression class and specify thatwewould liketofitthe intercept using the fit_intercept hyperparameter:

model = LinearRegression(fit_intercept=True)
model

Keepinmindthatwhenthemodel is instantiated, the only action is the storing of these hyperparameter values. In particular, wehavenotyet applied the model toanydata: the Scikit-Learn API makes very clear the distinction between choice of model and application of model to data.

3. Arrange dataintoa features matrix and target vector

Previously we detailed the Scikit-Learn data representation, which requires a two-dimensional features matrix andaone-dimensional target array. Here our target variable y is already in the correct form (a length-n_samples array), butweneedto massage the data x tomakeita matrix of size [n_samples, n_features]. Inthiscase, this amounts to a simple reshaping oftheone-dimensional array:

X = x[:, np.newaxis]
X.shape

4. Fitthemodel toyourdata

Nowitistimetoapply our model to data. Thiscanbedonewiththe fit() method ofthemodel:

model.fit(X, y)

This fit() command causes a number of model-dependent internal computations totakeplace, and the results of these computations are stored in model-specific attributes thattheusercan explore. In Scikit-Learn, by convention all model parameters that were learned during the fit() process have trailing underscores; for example in this linear model, wehavethe following:

model.coef_
model.intercept_

These two parameters represent the slope and intercept of the simple linear fittothedata. Comparing tothedata definition, weseethattheyareveryclose totheinput slope of 2 and intercept of -1.

One question that frequently comes up regards the uncertainty in such internal model parameters. In general, Scikit-Learn does not provide tools to draw conclusions from internal model parameters themselves: interpreting model parameters ismuchmorea statistical modeling question than a machine learning question. Machine learning rather focuses onwhatthemodel predicts. Ifyouwould liketodiveintothe meaning of fit parameters within the model, other tools are available, including the Statsmodels Python package.

5. Predict labels for unknown data

Oncethemodel is trained, themaintaskof supervised machine learning is to evaluate it based onwhatitsaysabout newdatathatwasnotpartofthe training set. In Scikit-Learn, thiscanbedoneusing the predict() method. Forthesakeofthis example, our “new data” willbeagridof x values, andwewillaskwhat y values the model predicts:

xfit = np.linspace(-1, 11)

As before, weneedto coerce these x values into a [n_samples, n_features] features matrix, after which wecanfeedittothemodel:

Xfit = xfit[:, np.newaxis]
yfit = model.predict(Xfit)

Finally, let’s visualize the results by plotting first therawdata, andthenthismodel fit:

plt.scatter(x, y)
plt.plot(xfit, yfit);

Typically the efficacy ofthemodel is evaluated by comparing its results tosomeknown baseline, aswewillseeinthenext example