Analysis of Big Data
中文版:大数据分析
This lecture introduces the four types of data analytics — Descriptive, Diagnostic, Predictive, and Prescriptive — with hands-on Python examples.
Types of Data Analytics
| Type | Focus | Question Answered |
|---|---|---|
| Descriptive | Insight into the past | What has happened? |
| Diagnostic | Understanding causes | Why did it happen? |
| Predictive | Forecasting the future | What will happen? |
| Prescriptive | Decision support | What should we do? |
Descriptive Analytics
Descriptive analytics comprises analyzing past data to present it in a summarized form that can be easily interpreted. A major portion of analytics done today is descriptive analytics through statistics functions such as counts, maximum, minimum, mean, median, mode, and percentage.
Use Cases
- Computing total likes for a social media post
- Analyzing comments to understand user attitudes
- Business reports of revenue, expenses, cash flow, inventory, and production
- Identifying trends in customer preference and behavior
- Analyzing past electricity usage to set optimal charges and categorize consumers
Three Types of Descriptive Analytics
| Type | Measures | Visualization |
|---|---|---|
| Frequency | Counting | Histogram |
| Shape (Central Tendency) | Mean, median, mode | Box plot |
| Dispersion (Variability) | Variance, standard deviation, outliers | Histogram and box plot |
Example: Salary Data Analysis
Using a .csv dataset with employee salary, years of experience, age, and gender.
Frequency Questions
- Total number and proportion of male and female employees
- Average salary by gender
- Employee count in age groups (21–30, 31–40, 41–50, 51–60)
- Employees with more than 5 years of experience
Distribution/Shape Questions
- Mean salary, years of experience, and age
- Median, minimum, and maximum values
- Salary distribution against years of experience and age
- Salary profile by gender
Python Examples
import pandas as pd
data = pd.read_csv("/Desktop/salary_data.csv")
# Mean values
mean_salary = data['salary'].mean()
mean_experience = data['years experience'].mean()
mean_age = data['age'].mean()
print('Mean Salary:', mean_salary)
print('Mean Years of Experience:', mean_experience)
print('Mean Age:', mean_age)# Median values
median_salary = data['salary'].median()
median_experience = data['years experience'].median()
median_age = data['age'].median()# Min / Max values
min_salary = data['salary'].min()
max_salary = data['salary'].max()# Visualization: Salary vs Years of Experience
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
plt.figure(1)
plt.plot(data['years experience'], data['salary'] / 1E3, 'ro', linewidth=1)
plt.xlabel('Years of Experience')
plt.ylabel('Salary in Thousands (RMB)')
plt.grid(True)
plt.title('Salary Distribution against Years of Experience')
plt.show()# Visualization: Salary vs Age
plt.figure(2)
plt.plot(data['age'], data['salary'] / 1E3, 'bo', linewidth=1)
plt.xlabel('Age')
plt.ylabel('Salary in Thousands (RMB)')
plt.grid(True)
plt.title('Salary Distribution against Age')
plt.show()# Visualization: Salary by Gender
import seaborn as sns
plt.figure(3)
plt.scatter(data['gender'], data['salary'] / 1E3, c='blue', alpha=0.5)
plt.xlabel('Gender')
plt.ylabel('Salary in Thousands (RMB)')
plt.grid(True)
plt.title('Salary Distribution by Gender')
plt.show()Variability of Data
Standard deviation measures the degree of spread in data. It is useful when performing predictive analytics such as hypothesis testing.
Example:
28, 29, 30, 31, 32and10, 20, 30, 40, 50both have mean = 30, but the second list is much more spread out.
Diagnostic Analytics
Diagnostic analytics is the process of using data to determine the causes of trends and correlations between variables. It is a logical next step after descriptive analytics and can be viewed as a form of root-cause analysis.
Key Characteristics
- Enables understanding of what is happening and why it happened
- Can be done manually, with algorithms, or using statistical software (MATLAB, R, Excel)
- Provides actionable insights to decision-makers
- Investigates factors that contributed to a certain outcome
Example: Salary Gap Analysis
From descriptive analytics: average male salary is higher than female salary, and female salary is below the total average.
Question: Why is the mean salary of female employees lower?
- Histogram shows male salary distribution is skewed to the right relative to female distribution.
Correlation Analysis
Used to reveal relationships among parameters (Years of Experience, Salary, Age, Gender).
- Correlation coefficient (r): describes how closely related two characteristics are
- Values close to 1 or −1 indicate a strong linear relationship
- Values close to 0 indicate a weak linear relationship
- Positive values: greater values of one variable associated with greater values of the other
- Negative values: greater values of one variable associated with lesser values of the other
Python Example: Correlation Matrix
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
data = pd.read_csv("/Desktop/salary_data.csv")
# Correlation between Salary and Years of Experience
correlation_matrix = data[['salary', 'years experience']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix between Salary and Years of Experience')
plt.show()# Correlation for male employees only
male_data = data[data['gender'] == 'Male']
correlation_matrix = np.corrcoef(male_data['salary'], male_data['years experience'])
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
xticklabels=['Salary', 'Years of Experience'],
yticklabels=['Salary', 'Years of Experience'])
plt.title("Correlation Analysis for Male Employees")
plt.show()Findings
| Age Group | Avg Male Exp | Avg Female Exp | Avg Male Salary | Avg Female Salary |
|---|---|---|---|---|
| 21–30 | 7 | 8 | 48,355 | 46,809 |
| 31–40 | 8 | 9 | 74,013 | 65,503 |
Although female employees have more experience, their average salary is lower. Without additional data (education, skills), the higher male salary may indicate bias and subjective judgement.
Predictive Analytics
Predictive analytics uses historical data to forecast future outcomes.
Example: Salary Prediction for Experienced Hires
A company wants to hire consultants with more than 15 years of experience. How much should they pay?
Use the salary distribution against years of experience plot and apply regression analysis.
Regression Types
Regression
├── Linear
│ └── Simple linear
│ └── Multiple linear
└── Nonlinear
Model Comparison
| Model | Equation | R² |
|---|---|---|
| Linear (order 1) | y = 4677.7x + 42799 | 0.7555 |
| Nonlinear (order 2) | y = −197.31x² + 7395.4x + 35904 | 0.7692 |
| Nonlinear (order 3) | y = 81.613x³ − 1929.7x² + 17798x + 19818 | 0.8016 |
| Nonlinear (order 4) | y = 15.512x⁴ − 362.5x³ + 2270x² + 3067.3x + 34930 | 0.8113 |
Warning
While the quartic model has the highest R², linear or quadratic models are usually preferred in real situations for representing salary vs experience because they generalize better.
Classification
Another predictive method that identifies a category given a set of data.
| Aspect | Classification | Regression |
|---|---|---|
| Output | Category / class | Continuous numerical value |
| Examples | Predicting illness, fraud detection, face classifier | Assessing house price, forecasting demand, temperature forecasting |
Scikit-Learn library is used for predictive data analysis, usually along with NumPy, SciPy, and Matplotlib.
# Label encoding example
from sklearn.preprocessing import LabelEncoder
# Converts categorical labels (e.g., blue, green) into numerical values (0, 1)Prescriptive Analytics
Prescriptive analytics provides decision support to benefit from analysis outcomes. It goes beyond predicting future outcomes by providing suggestions to extract benefits and take advantage of predictions.
Key Features
- Optimizes business outcomes by combining mathematical models, machine learning algorithms, and historical data
- Anticipates what will happen, when it will happen, and why
- Implemented using simulation and optimization
- Guides users on how different actions will affect business and suggests the optimal choice
Applications
Pricing, production planning, marketing, financial planning, and supply chain optimization.
Example: Asset Performance Management
- Foundation: Time-series data consolidated by a historian tool
- Predictive analytics: Creates indicators and alerts forecasting risks or sub-optimum performance
- Prescriptive actions: Pre-defined actions triggered by alerts enable problem-solving before failure impacts operations
Summary
| Analytics Type | Purpose | Methods |
|---|---|---|
| Descriptive | Summarize past data | Statistics, histograms, box plots |
| Diagnostic | Find root causes | Correlation analysis, hypothesis testing |
| Predictive | Forecast future events | Regression, classification, machine learning |
| Prescriptive | Recommend optimal actions | Simulation, optimization |
Related Concepts
- Descriptive Analytics — Statistics and visualization techniques
- Correlation Analysis — Measuring relationships between variables
- Regression Analysis — Linear and nonlinear forecasting
- Classification — Supervised learning for category prediction