practice DataHandle solution

中文版：数据处理练习参考答案

Pandas Data Cleaning

Data cleaning refers to the process of handling problematic data.

Many datasets contain issues such as missing values, incorrect formats, erroneous entries, or duplicate records. To ensure more accurate data analysis, these problematic data points must be addressed.

Common Steps in Data Cleaning and Preprocessing

1. Handling Missing Values

Identify missing values.
Fill missing values using appropriate methods (e.g., mean, median, mode).
Remove rows or columns containing too many missing values.

2. Dealing with Duplicate Data

Detect duplicate entries.
Remove duplicates to ensure each data point is unique.

3. Handling Outliers

Identify outliers using statistical methods or visualization.
Decide whether to remove or transform extreme values.

4. Standardization and Normalization

Standardization: Scale data using Z-score (mean = 0, standard deviation = 1).
Normalization: Scale data to a fixed range, usually [0, 1], using Min-Max scaling.

5. Data Sampling

Randomly extract samples for testing or training purposes.
Apply oversampling or undersampling techniques to handle class imbalance.

0. Import the pandas library in Python and use pd as an alias.

Read the dirty_data.csv file using pandas and store it in a DataFrame named df

import pandas as pd #hint: import ... as ...
df = pd.read_csv('dirty_data.csv') # pd.read_csv('name.csv')
print(df)

1. Handling Missing Values

Age Column We choose to fill the missing values in the Age column with the median value of Age.

Age is susceptible to extreme values (e.g., abnormally high values).
Using the median is more robust than the mean.
The median better reflects the typical age of most individuals.

Salary Column We choose to fill the missing values in the Salary column with the mean value of Salary.

Salary data tends to fluctuate within a reasonable range and carries meaningful economic implications.
The mean can effectively represent the overall trend.
Using the mean is a common practice when the proportion of missing data is low.using it is a common practice when the proportion of missing data is low.

# hint: df.fillna({col: value}, inplace=True)
df.fillna({'Age': df['Age'].median(), 'Salary': df['Salary'].mean()}, inplace=True)
print(df)

2. Dealing with Duplicate Data

Pay attention to row1 and row2. They are entirely duplicate.

Remove entirely duplicate rows from the DataFrame, helping ensure data uniqueness and avoid bias or overrepresentation in analysis.

df.drop_duplicates(inplace=True)
print(df)

3. Handling Outliers

Exclude records with unreasonable or implausible values （check row8）

Assume valid Age should be between 18 and 100, Salary > 0.

Therefore, Age values exactly 18 or 100 will be retained. Salary values smaller or equal to 0 will be excluded.

# Keep only rows where Age is between 18 and 100 (inclusive)
df = df[(df['Age'] >= 18) & (df['Age'] <= 100)]
 
# Keep only rows where Salary is greater than 0 (exclusive)
df = df[(df['Salary'] > 0)]
 
print(df)

4. Standardization and Normalization

Ensures all values are on the same scale.

import matplotlib.pyplot as plt
df.boxplot(column=['Age', 'Score', 'Salary'], showfliers=False)
plt.show()

from sklearn.preprocessing import MinMaxScaler, StandardScaler
 
scaler = MinMaxScaler() #or scaler = StandardScaler()
df[['Score']] = scaler.fit_transform(df[['Score']])
df[['Age']] = scaler.fit_transform(df[['Age']])
df[['Salary']] = scaler.fit_transform(df[['Salary']])
 
print(df)

df.boxplot(column=['Age', 'Score', 'Salary'],showfliers=False)
plt.show()

🔍 Explanation:

MinMaxScaler() A normalization tool from sklearn.preprocessing that scales numerical values to a fixed range, typically [0, 1]. It transforms the data using the following formula:

X_{scaled} = \frac{X - X _{m i n}}{X _{m a x} - X _{m i n}}

StandardScaler() A standardization tool that transforms features to have mean = 0 and standard deviation = 1. It uses the following formula:

X_{scaled} = \frac{X - μ}{σ}

scaler.fit_transform(df[['Score']])
fit_transform() first computes the minimum and maximum values of the Score column.
It then scales all values so that the minimum becomes 0 and the maximum becomes 1.
df[['Score']] = ... The normalized values are reassigned to the Score column, replacing the original data in the DataFrame.

5. Data Sampling (e.g., 80% sample for training)

The predictive model often used to split data into training and testing sets (e.g., 80% training, 20% testing)

df_train = df.sample(frac=0.8, random_state=42)
print(df_train)

df_test = df.drop(df_train.index)
print(df_test)

🔍 Explanation:

df.sample() This method is used to randomly sample rows from a DataFrame.
frac=0.7 Indicates that 70% of the rows from the original DataFrame will be sampled.
random_state=42 Sets a random seed to ensure reproducibility. With a fixed seed (e.g., 42), the same rows will be sampled every time the code is executed, which is important for consistent results during testing or experiments.
df_sample = ... The resulting subset is stored in a new DataFrame called df_sample.

Explorer

AITC Wiki

practice DataHandle solution

数据处理练习参考答案

practice DataHandle solution

Pandas Data Cleaning

Common Steps in Data Cleaning and Preprocessing

1. Handling Missing Values

2. Dealing with Duplicate Data

3. Handling Outliers

4. Standardization and Normalization

5. Data Sampling

0. Import the pandas library in Python and use pd as an alias.

1. Handling Missing Values

2. Dealing with Duplicate Data

3. Handling Outliers

4. Standardization and Normalization

5. Data Sampling (e.g., 80% sample for training)

Graph View

Table of Contents

Backlinks