0206 Boolean Arrays and Masks
中文版:布尔数组与掩码
Comparisons, Masks, and Boolean Logic
This section covers theuseof Boolean masks to examine and manipulate values within NumPy arrays. Masking comes upwhenyouwantto extract, modify, count, or otherwise manipulate values inanarray based on some criterion: for example, you might wishtocount all values greater than a certain value, or perhaps remove all outliers thatareabove some threshold. In NumPy, Boolean masking is often the most efficient way to accomplish these types of tasks.
Example: Counting Rainy Days
Imagine youhavea series ofdatathat represents the amount of precipitation eachdayforayearinagiven city. For example, here we’llloadthedaily rainfall statistics forthecityof Seattle in 2014, using Pandas (which is covered in more detail in Chapter 3):
import numpy as np
import pandas as pd
# use pandas to extract rainfall inches asaNumPy array
rainfall = pd.read_csv('Seattle2014.csv')['PRCP'].values
inches = rainfall / 254.0 # 1/10mm -> inches
inches.shapeThe array contains 365 values, giving daily rainfall in inches from January 1 to December 31, 2014.
Asafirst quick visualization, let’slookatthe histogram of rainy days, which was generated using Matplotlib (we will explore thistoolmorefully in Chapter 4):
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn; seaborn.set() # set plot stylesplt.hist(inches, 40);This histogram gives us a general ideaofwhatthedatalooks like: despite its reputation, the vast majority ofdaysin Seattle sawnearzero measured rainfall in 2014. Butthisdoesn’tdoagoodjobof conveying some information we’dliketosee: for example, howmanyrainy dayswerethere intheyear? Whatisthe average precipitation on those rainy days? Howmanydayswerethere withmorethanhalfaninchofrain?
Comparison Operators as ufuncs
In Computation on NumPy Arrays: Universal Functions we introduced ufuncs, and focused in particular on arithmetic operators. Wesawthatusing +, -, *, /, and others on arrays leads to element-wise operations.
NumPy also implements comparison operators such as < (less than) and > (greater than) as element-wise ufuncs.
The result of these comparison operators is always an array with a Boolean data type.
Allsixofthe standard comparison operations are available:
x = np.array([1, 2, 3, 4, 5])x < 3 # less thanx > 3 # greater thanx <= 3 # lessthanorequalx >= 3 # greater thanorequalx != 3 # not equalx == 3 # equalItisalso possible todoan element-wise comparison of two arrays, and to include compound expressions:
(2 * x) == (x ** 2)Asinthecaseof arithmetic operators, the comparison operators are implemented as ufuncs in NumPy; for example, whenyouwrite x < 3, internally NumPy uses np.less(x, 3).
A summary of the comparison operators and their equivalent ufunc is shown here:
| Operator | Equivalentufunc || Operator | Equivalentufunc |
|== |np.equal ||!= |np.not_equal |
|< |np.less ||<= |np.less_equal |
|> |np.greater ||>= |np.greater_equal |
Justasinthecaseof arithmetic ufuncs, these willworkon arrays ofanysizeandshape. Hereisatwo-dimensional example:
rng = np.random.RandomState(0)
x = rng.randint(10, size=(3, 4))
xx < 6Ineachcase, the result is a Boolean array, and NumPy provides a number of straightforward patterns for working with these Boolean results.
Working with Boolean Arrays
Given a Boolean array, there areahostof useful operations youcando.
We’llworkwith x, the two-dimensional array we created earlier.
print(x)Counting entries
To count the number of True entries in a Boolean array, np.count_nonzero is useful:
# how many values less than 6?
np.count_nonzero(x < 6)Weseethatthere are eight array entries thatarelessthan 6.
Another waytogetatthis information istouse np.sum; inthiscase, False is interpreted as 0, and True is interpreted as 1:
np.sum(x < 6)This counts the number of values less than 6 ineachrowofthe matrix.
If we’re interested in quickly checking whether anyorallthe values are true, wecanuse (you guessed it) np.any or np.all:
# are there any values greater than 8?
np.any(x > 8)# are there any values lessthanzero?
np.any(x < 0)# are all values less than 10?
np.all(x < 10)# are all values equal to 6?
np.all(x == 6)np.all and np.any canbeusedalong particular axesaswell. For example:
# are all values ineachrowlessthan 8?
np.all(x < 8, axis=1)Hereallthe elements inthefirst and third rowsarelessthan 8, while thisisnotthecaseforthe second row.
Finally, a quick warning: as mentioned in Aggregations: Min, Max, and Everything In Between, Python has built-in sum(), any(), and all() functions. These have a different syntax thantheNumPy versions, and in particular willfailor produce unintended results whenusedon multidimensional arrays. Besurethatyouareusing np.sum(), np.any(), and np.all() for these examples!
Boolean operators
We’ve already seenhowwemight count, say, alldayswithrainlessthanfour inches, oralldayswithrain greater than two inches.
Butwhatifwewanttoknowabout alldayswithrainlessthanfour inches and greater thanoneinch?
This is accomplished through Python’s bitwise logic operators, &, |, ^, and ~.
Likewiththe standard arithmetic operators, NumPy overloads these as ufuncs which work element-wise on (usually Boolean) arrays.
For example, we can address thissortof compound question as follows:
np.sum((inches > 0.5) & (inches < 1))Soweseethatthere are 29 days with rainfall between 0.5 and 1.0 inches.
Notethatthe parentheses here are important–because of operator precedence rules, with parentheses removed this expression would be evaluated as follows, which results inanerror:
inches > (0.5 & inches) < 1Using the equivalence of AANDB and NOT (NOTAORNOTB) (which you may remember if you’ve taken an introductory logic course), we can compute the same result in a different manner:
np.sum(~( (inches <= 0.5) | (inches >= 1) ))Combining comparison operators and Boolean operators on arrays canleadtoawiderange of efficient logical operations.
The following table summarizes the bitwise Boolean operators and their equivalent ufuncs:
| Operator | Equivalentufunc || Operator | Equivalentufunc |
|& |np.bitwise_and ||| |np.bitwise_or |
|^ |np.bitwise_xor ||~ |np.bitwise_not |
Using these tools, we might start to answer the types of questions wehaveabout our weather data. Herearesome examples of results we can compute when combining masking with aggregations:
print("Number days without rain: ", np.sum(inches == 0))
print("Number dayswithrain: ", np.sum(inches != 0))
print("Dayswithmorethan 0.5 inches:", np.sum(inches > 0.5))
print("Rainy days with < 0.2 inches :", np.sum((inches > 0) &
(inches < 0.2)))Boolean Arrays as Masks
In the preceding section we looked at aggregates computed directly on Boolean arrays.
A more powerful pattern istouse Boolean arrays as masks, to select particular subsets ofthedata themselves.
Returning to our x array from before, suppose wewantanarray of all values inthearray thatarelessthan, say, 5:
xWe can obtain a Boolean array for this condition easily, as we’ve already seen:
x < 5Now to select these values fromthearray, we can simply index on this Boolean array; thisisknown as a masking operation:
x[x < 5]What is returned isaone-dimensional array filled withallthe values thatmeetthis condition; in other words, all the values in positions at which themaskarray is True.
Wearethenfreeto operate on these values aswewish. For example, we can compute some relevant statistics on our Seattle rain data:
# construct amaskofallrainy days
rainy = (inches > 0)
# construct amaskofall summer days (June 21st is the 172nd day)
days = np.arange(365)
summer = (days > 172) & (days < 262)
print("Median precip on rainy days in 2014 (inches): ",
np.median(inches[rainy]))
print("Median precip on summer days in 2014 (inches): ",
np.median(inches[summer]))
print("Maximum precip on summer days in 2014 (inches): ",
np.max(inches[summer]))
print("Median precip on non-summer rainy days (inches):",
np.median(inches[rainy & ~summer]))By combining Boolean operations, masking operations, and aggregates, wecanvery quickly answer these sorts of questions for our dataset.
Aside: Using the Keywords and/or Versus the Operators &/|
One common point of confusion is the difference between the keywords and and or ononehand, and the operators & and | ontheother hand.
When would youuseone versus the other?
The difference is this: and and or gauge the truth or falsehood of entire object, while & and | refer to bits within each object.
Whenyouuse and or or, it’s equivalent to asking Python to treat the object as a single Boolean entity.
In Python, all nonzero integers will evaluate as True. Thus:
bool(42), bool(0)Whenyouuse & and | on integers, the expression operates onthebitsofthe element, applying the and or the or to the individual bits making up the number:
bin(42)bin(42 & 59)Notice that the corresponding bitsofthe binary representation are compared in order to yield the result.
Whenyouhaveanarray of Boolean values in NumPy, thiscanbe thought ofasa string ofbitswhere 1 = True and 0 = False, and the result of & and | operates similarly to above:
A = np.array([1, 0, 1, 0, 1, 0], dtype=bool)
B = np.array([1, 1, 1, 0, 1, 1], dtype=bool)
A | BUsing or on these arrays willtryto evaluate the truth or falsehood of the entire array object, which isnotawell-defined value:
A or BSimilarly, when doing a Boolean expression onagiven array, you should use | or & rather than or or and:
x = np.arange(10)
(x > 4) & (x < 8)Trying to evaluate the truth or falsehood of the entire array willgivethesame ValueError we saw previously: