AITC Wiki

Analysis of Big Data

大数据分析

Analysis of Big Data

中文版:大数据分析

This lecture introduces the four types of data analytics — Descriptive, Diagnostic, Predictive, and Prescriptive — with hands-on Python examples.

Types of Data Analytics

TypeFocusQuestion Answered
DescriptiveInsight into the pastWhat has happened?
DiagnosticUnderstanding causesWhy did it happen?
PredictiveForecasting the futureWhat will happen?
PrescriptiveDecision supportWhat should we do?

Descriptive Analytics

Descriptive analytics comprises analyzing past data to present it in a summarized form that can be easily interpreted. A major portion of analytics done today is descriptive analytics through statistics functions such as counts, maximum, minimum, mean, median, mode, and percentage.

Use Cases

  • Computing total likes for a social media post
  • Analyzing comments to understand user attitudes
  • Business reports of revenue, expenses, cash flow, inventory, and production
  • Identifying trends in customer preference and behavior
  • Analyzing past electricity usage to set optimal charges and categorize consumers

Three Types of Descriptive Analytics

TypeMeasuresVisualization
FrequencyCountingHistogram
Shape (Central Tendency)Mean, median, modeBox plot
Dispersion (Variability)Variance, standard deviation, outliersHistogram and box plot

Example: Salary Data Analysis

Using a .csv dataset with employee salary, years of experience, age, and gender.

Frequency Questions

  • Total number and proportion of male and female employees
  • Average salary by gender
  • Employee count in age groups (21–30, 31–40, 41–50, 51–60)
  • Employees with more than 5 years of experience

Distribution/Shape Questions

  • Mean salary, years of experience, and age
  • Median, minimum, and maximum values
  • Salary distribution against years of experience and age
  • Salary profile by gender

Python Examples

import pandas as pd
data = pd.read_csv("/Desktop/salary_data.csv")
 
# Mean values
mean_salary = data['salary'].mean()
mean_experience = data['years experience'].mean()
mean_age = data['age'].mean()
print('Mean Salary:', mean_salary)
print('Mean Years of Experience:', mean_experience)
print('Mean Age:', mean_age)
# Median values
median_salary = data['salary'].median()
median_experience = data['years experience'].median()
median_age = data['age'].median()
# Min / Max values
min_salary = data['salary'].min()
max_salary = data['salary'].max()
# Visualization: Salary vs Years of Experience
import matplotlib
matplotlib.use('TkAgg')
import matplotlib.pyplot as plt
 
plt.figure(1)
plt.plot(data['years experience'], data['salary'] / 1E3, 'ro', linewidth=1)
plt.xlabel('Years of Experience')
plt.ylabel('Salary in Thousands (RMB)')
plt.grid(True)
plt.title('Salary Distribution against Years of Experience')
plt.show()
# Visualization: Salary vs Age
plt.figure(2)
plt.plot(data['age'], data['salary'] / 1E3, 'bo', linewidth=1)
plt.xlabel('Age')
plt.ylabel('Salary in Thousands (RMB)')
plt.grid(True)
plt.title('Salary Distribution against Age')
plt.show()
# Visualization: Salary by Gender
import seaborn as sns
plt.figure(3)
plt.scatter(data['gender'], data['salary'] / 1E3, c='blue', alpha=0.5)
plt.xlabel('Gender')
plt.ylabel('Salary in Thousands (RMB)')
plt.grid(True)
plt.title('Salary Distribution by Gender')
plt.show()

Variability of Data

Standard deviation measures the degree of spread in data. It is useful when performing predictive analytics such as hypothesis testing.

Example: 28, 29, 30, 31, 32 and 10, 20, 30, 40, 50 both have mean = 30, but the second list is much more spread out.


Diagnostic Analytics

Diagnostic analytics is the process of using data to determine the causes of trends and correlations between variables. It is a logical next step after descriptive analytics and can be viewed as a form of root-cause analysis.

Key Characteristics

  • Enables understanding of what is happening and why it happened
  • Can be done manually, with algorithms, or using statistical software (MATLAB, R, Excel)
  • Provides actionable insights to decision-makers
  • Investigates factors that contributed to a certain outcome

Example: Salary Gap Analysis

From descriptive analytics: average male salary is higher than female salary, and female salary is below the total average.

Question: Why is the mean salary of female employees lower?

  • Histogram shows male salary distribution is skewed to the right relative to female distribution.

Correlation Analysis

Used to reveal relationships among parameters (Years of Experience, Salary, Age, Gender).

  • Correlation coefficient (r): describes how closely related two characteristics are
  • Values close to 1 or −1 indicate a strong linear relationship
  • Values close to 0 indicate a weak linear relationship
  • Positive values: greater values of one variable associated with greater values of the other
  • Negative values: greater values of one variable associated with lesser values of the other

Python Example: Correlation Matrix

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
 
data = pd.read_csv("/Desktop/salary_data.csv")
 
# Correlation between Salary and Years of Experience
correlation_matrix = data[['salary', 'years experience']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix between Salary and Years of Experience')
plt.show()
# Correlation for male employees only
male_data = data[data['gender'] == 'Male']
correlation_matrix = np.corrcoef(male_data['salary'], male_data['years experience'])
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm',
            xticklabels=['Salary', 'Years of Experience'],
            yticklabels=['Salary', 'Years of Experience'])
plt.title("Correlation Analysis for Male Employees")
plt.show()

Findings

Age GroupAvg Male ExpAvg Female ExpAvg Male SalaryAvg Female Salary
21–307848,35546,809
31–408974,01365,503

Although female employees have more experience, their average salary is lower. Without additional data (education, skills), the higher male salary may indicate bias and subjective judgement.


Predictive Analytics

Predictive analytics uses historical data to forecast future outcomes.

Example: Salary Prediction for Experienced Hires

A company wants to hire consultants with more than 15 years of experience. How much should they pay?

Use the salary distribution against years of experience plot and apply regression analysis.

Regression Types

Regression
├── Linear
│   └── Simple linear
│   └── Multiple linear
└── Nonlinear

Model Comparison

ModelEquation
Linear (order 1)y = 4677.7x + 427990.7555
Nonlinear (order 2)y = −197.31x² + 7395.4x + 359040.7692
Nonlinear (order 3)y = 81.613x³ − 1929.7x² + 17798x + 198180.8016
Nonlinear (order 4)y = 15.512x⁴ − 362.5x³ + 2270x² + 3067.3x + 349300.8113

Warning

While the quartic model has the highest R², linear or quadratic models are usually preferred in real situations for representing salary vs experience because they generalize better.

Classification

Another predictive method that identifies a category given a set of data.

AspectClassificationRegression
OutputCategory / classContinuous numerical value
ExamplesPredicting illness, fraud detection, face classifierAssessing house price, forecasting demand, temperature forecasting

Scikit-Learn library is used for predictive data analysis, usually along with NumPy, SciPy, and Matplotlib.

# Label encoding example
from sklearn.preprocessing import LabelEncoder
# Converts categorical labels (e.g., blue, green) into numerical values (0, 1)

Prescriptive Analytics

Prescriptive analytics provides decision support to benefit from analysis outcomes. It goes beyond predicting future outcomes by providing suggestions to extract benefits and take advantage of predictions.

Key Features

  • Optimizes business outcomes by combining mathematical models, machine learning algorithms, and historical data
  • Anticipates what will happen, when it will happen, and why
  • Implemented using simulation and optimization
  • Guides users on how different actions will affect business and suggests the optimal choice

Applications

Pricing, production planning, marketing, financial planning, and supply chain optimization.

Example: Asset Performance Management

  1. Foundation: Time-series data consolidated by a historian tool
  2. Predictive analytics: Creates indicators and alerts forecasting risks or sub-optimum performance
  3. Prescriptive actions: Pre-defined actions triggered by alerts enable problem-solving before failure impacts operations

Summary

Analytics TypePurposeMethods
DescriptiveSummarize past dataStatistics, histograms, box plots
DiagnosticFind root causesCorrelation analysis, hypothesis testing
PredictiveForecast future eventsRegression, classification, machine learning
PrescriptiveRecommend optimal actionsSimulation, optimization

Sources