0204 Computation on arrays aggregates

中文版：聚合运算

Aggregations: Min, Max, and Everything In Between

Often when faced withalarge amount of data, a first stepisto compute summary statistics forthedatain question. Perhaps the most common summary statistics arethemeanand standard deviation, which allow you to summarize the “typical” values in a dataset, but other aggregates are useful as well (the sum, product, median, minimum and maximum, quantiles, etc.).

NumPy hasfastbuilt-in aggregation functions for working on arrays; we’ll discuss and demonstrate someofthemhere.

Summing the Values inanArray

Asaquick example, consider computing thesumofall values inanarray. Python itself candothisusing the built-in sum function:

import numpy as np

L = np.random.random(100)
sum(L)

The syntax is quite similar tothatofNumPy’s sum function, and the result isthesameinthe simplest case:

np.sum(L)

However, because it executes the operation in compiled code, NumPy’s version of the operation is computed much more quickly:

big_array = np.random.rand(10000)
%timeit sum(big_array)
%timeit np.sum(big_array)

Be careful, though: the sum function and the np.sum function are not identical, which can sometimes lead to confusion! In particular, their optional arguments have different meanings, and np.sum is aware of multiple array dimensions, aswewillseeinthe following section.

Minimum and Maximum

Similarly, Python has built-in min and max functions, usedtofindthe minimum value and maximum value ofanygiven array:

min(big_array), max(big_array)

NumPy’s corresponding functions have similar syntax, and again operate much more quickly:

np.min(big_array), np.max(big_array)

For min, max, sum, and several other NumPy aggregates, a shorter syntax istouse methods ofthearray object itself:

print(big_array.min(), big_array.max(), big_array.sum())

Whenever possible, makesurethatyouareusing the NumPy version of these aggregates when operating on NumPy arrays!

Multi dimensional aggregates

One common type of aggregation operation is an aggregate along arowor column. Sayyouhavesomedata stored inatwo-dimensional array:

M = np.random.random((3, 4))
print(M)

By default, each NumPy aggregation function will return the aggregate over the entire array:

M.sum()

Aggregation functions take an additional argument specifying the axis along which the aggregate is computed. For example, wecanfindthe minimum value within each column by specifying axis=0:

M.min(axis=0)

The function returns four values, corresponding tothefour columns of numbers.

Similarly, wecanfindthe maximum value within each row:

M.max(axis=1)

Thewaytheaxisis specified herecanbe confusing to users coming from other languages. The axis keyword specifies the dimension ofthearray thatwillbe collapsed, rather than the dimension thatwillbe returned. So specifying axis=0 means thatthefirst axiswillbe collapsed: for two-dimensional arrays, this means that values within each column will be aggregated.

Other aggregation functions

NumPy provides many other aggregation functions, butwewon’t discuss them in detail here. Additionally, most aggregates have a NaN-safe counterpart that computes the result while ignoring missing values, which are marked by the special IEEE floating-point NaN value (for a fuller discussion of missing data, see Handling Missing Data). Someofthese NaN-safe functions werenotadded until NumPy 1.8, sotheywillnotbe available in older NumPy versions.

The following table provides alistof useful aggregation functions available in NumPy:

Wewillseethese aggregates often throughout therestofthebook.

Example: Whatisthe Average Height of US Presidents?

Aggregates available in NumPy can be extremely useful for summarizing asetof values. As a simple example, let’s consider the heights ofallUS presidents. Thisdatais available inthefile president_heights.csv, which is a simple comma-separated list of labels and values:

!head -4 president_heights.csv

We’llusethe Pandas package, which we’ll explore more fully in Chapter 3, toreadthefileand extract this information (notethatthe heights are measured in centimeters).

import pandas as pd
data = pd.read_csv('president_heights.csv')
heights = np.array(data['height(cm)'])
print(heights)

Nowthatwehavethisdataarray, we can compute a variety of summary statistics:

print("Mean height: ", heights.mean())
print("Standard deviation:", heights.std())
print("Minimum height: ", heights.min())
print("Maximum height: ", heights.max())

Notethatineachcase, the aggregation operation reduced the entire array to a single summarizing value, which gives us information about the distribution of values. Wemayalsowishto compute quantiles:

print("25th percentile: ", np.percentile(heights, 25))
print("Median: ", np.median(heights))
print("75th percentile: ", np.percentile(heights, 75))

Weseethatthe median height of US presidents is 182 cm, orjustshyofsixfeet.

Of course, sometimes it’s more useful toseea visual representation ofthisdata, which we can accomplish using tools in Matplotlib (we’ll discuss Matplotlib more fully in Chapter 4). For example, this code generates the following chart:

%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # setplotstyle

plt.hist(heights)
plt.title('Height Distribution of US Presidents')
plt.xlabel('height (cm)')
plt.ylabel('number');

These aggregates aresomeofthe fundamental pieces of exploratory data analysis that we’ll explore inmoredepth in later chapters ofthebook.

Explorer

AITC Wiki

0204 Computation on arrays aggregates

聚合运算

0204 Computation on arrays aggregates

Aggregations: Min, Max, and Everything In Between

Summing the Values inanArray

Minimum and Maximum

Multi dimensional aggregates

Other aggregation functions

Example: Whatisthe Average Height of US Presidents?

Graph View

Table of Contents