practice DataHandle solution
中文版:数据处理练习参考答案
Pandas Data Cleaning
Data cleaning refers to the process of handling problematic data.
Many datasets contain issues such as missing values, incorrect formats, erroneous entries, or duplicate records. To ensure more accurate data analysis, these problematic data points must be addressed.
Common Steps in Data Cleaning and Preprocessing
1. Handling Missing Values
- Identify missing values.
- Fill missing values using appropriate methods (e.g., mean, median, mode).
- Remove rows or columns containing too many missing values.
2. Dealing with Duplicate Data
- Detect duplicate entries.
- Remove duplicates to ensure each data point is unique.
3. Handling Outliers
- Identify outliers using statistical methods or visualization.
- Decide whether to remove or transform extreme values.
4. Standardization and Normalization
- Standardization: Scale data using Z-score (mean = 0, standard deviation = 1).
- Normalization: Scale data to a fixed range, usually [0, 1], using Min-Max scaling.
5. Data Sampling
- Randomly extract samples for testing or training purposes.
- Apply oversampling or undersampling techniques to handle class imbalance.
0. Import the pandas library in Python and use pd as an alias.
Read the dirty_data.csv file using pandas and store it in a DataFrame named df
import pandas as pd #hint: import ... as ...
df = pd.read_csv('dirty_data.csv') # pd.read_csv('name.csv')
print(df)1. Handling Missing Values
Age Column We choose to fill the missing values in the Age column with the median value of Age.
- Age is susceptible to extreme values (e.g., abnormally high values).
- Using the median is more robust than the mean.
- The median better reflects the typical age of most individuals.
Salary Column We choose to fill the missing values in the Salary column with the mean value of Salary.
- Salary data tends to fluctuate within a reasonable range and carries meaningful economic implications.
- The mean can effectively represent the overall trend.
- Using the mean is a common practice when the proportion of missing data is low.using it is a common practice when the proportion of missing data is low.
# hint: df.fillna({col: value}, inplace=True)
df.fillna({'Age': df['Age'].median(), 'Salary': df['Salary'].mean()}, inplace=True)
print(df)2. Dealing with Duplicate Data
Pay attention to row1 and row2. They are entirely duplicate.
Remove entirely duplicate rows from the DataFrame, helping ensure data uniqueness and avoid bias or overrepresentation in analysis.
df.drop_duplicates(inplace=True)
print(df)3. Handling Outliers
Exclude records with unreasonable or implausible values (check row8)
Assume valid Age should be between 18 and 100, Salary > 0.
Therefore, Age values exactly 18 or 100 will be retained. Salary values smaller or equal to 0 will be excluded.
# Keep only rows where Age is between 18 and 100 (inclusive)
df = df[(df['Age'] >= 18) & (df['Age'] <= 100)]
# Keep only rows where Salary is greater than 0 (exclusive)
df = df[(df['Salary'] > 0)]
print(df)4. Standardization and Normalization
Ensures all values are on the same scale.
import matplotlib.pyplot as plt
df.boxplot(column=['Age', 'Score', 'Salary'], showfliers=False)
plt.show()from sklearn.preprocessing import MinMaxScaler, StandardScaler
scaler = MinMaxScaler() #or scaler = StandardScaler()
df[['Score']] = scaler.fit_transform(df[['Score']])
df[['Age']] = scaler.fit_transform(df[['Age']])
df[['Salary']] = scaler.fit_transform(df[['Salary']])
print(df)df.boxplot(column=['Age', 'Score', 'Salary'],showfliers=False)
plt.show()🔍 Explanation:
MinMaxScaler()A normalization tool fromsklearn.preprocessingthat scales numerical values to a fixed range, typically [0, 1]. It transforms the data using the following formula:
StandardScaler()A standardization tool that transforms features to have mean = 0 and standard deviation = 1. It uses the following formula:
-
scaler.fit_transform(df[['Score']]) -
fit_transform()first computes the minimum and maximum values of theScorecolumn. -
It then scales all values so that the minimum becomes 0 and the maximum becomes 1.
-
df[['Score']] = ...The normalized values are reassigned to theScorecolumn, replacing the original data in the DataFrame.
5. Data Sampling (e.g., 80% sample for training)
The predictive model often used to split data into training and testing sets (e.g., 80% training, 20% testing)
df_train = df.sample(frac=0.8, random_state=42)
print(df_train)df_test = df.drop(df_train.index)
print(df_test)🔍 Explanation:
-
df.sample()This method is used to randomly sample rows from a DataFrame. -
frac=0.7Indicates that 70% of the rows from the original DataFrame will be sampled. -
random_state=42Sets a random seed to ensure reproducibility. With a fixed seed (e.g., 42), the same rows will be sampled every time the code is executed, which is important for consistent results during testing or experiments. -
df_sample = ...The resulting subset is stored in a new DataFrame calleddf_sample.