Technical Aspects of Big Data

中文版：大数据技术层面

This lecture covers domain-specific big data applications and the core Hadoop ecosystem for storage, processing, and analytics.

Big Data Applications

Businesses across all industries have benefited from adopting big data solutions.

Cross-Industry Use Cases

Industry	Application
Banking & Securities	Credit/debit card fraud detection, securities fraud warnings, credit risk reporting, customer data analytics
Healthcare	Storing patient data and analyzing it to detect medical ailments at an early stage
Marketing	Analyzing customer purchase history to reach the right customers for newly launched products
Web Analysis	Analyzing social media and search engine data to broadcast advertisements based on user interests
Call Centers	Identifying recurring problems and staff behavior patterns by capturing and processing call content
Agriculture	Sensors used by biotechnology firms to optimize crop efficiency; big data analyzes sensor data
Smartphones	Facial recognition for unlocking phones and retrieving stored information about a person

Manufacturing Industry

Machine Diagnosis & Prognosis: Predicting machine performance by analyzing current operating conditions and deviations from normal conditions. Sensors monitor temperature and vibration levels. Diagnostic systems integrate with cloud-based storage and big data analytics backends.
Risk Analysis of Industrial Operations: Monitoring indoor air quality using gas sensors (CO, NO, NO₂). Big data systems analyze risks and identify hazardous zones.
Production Planning and Control: Systems measure various production process parameters and control the entire process in real-time using sensors. Big data analyzes this data for planning and identifying potential problems.

Civil Infrastructure

Structural Health Monitoring: Networks of sensors monitor vibration levels in bridges and buildings. Analyzing this data can detect cracks, locate damages, calculate remaining life, and provide advance warnings of imminent failures.

Transportation & Smart Cities

Smart Roads: Sensors provide information on driving conditions, travel time estimates, and alerts for poor conditions, traffic congestion, and accidents. Information is communicated via Internet to cloud-based analytics applications and disseminated to drivers.

Healthcare Industry

Epidemiological Surveillance: Studying distribution and determinants of health-related states in populations. Electronic Health Record (EHR) systems include laboratory results, diagnostic, treatment, and demographic data. Big data frameworks integrate multiple EHR systems for predicting outbreaks, population-level surveillance, disease detection, and public health mapping.

Finance Industry

Credit Risk Modeling: Scoring credit applications and predicting borrower default. Models are built from credit scores, credit history, account balances, transactions, and spending patterns. Big data systems compute credit risk scores for large numbers of customers regularly.
Fraud Detection: Detecting credit card fraud, money laundering, and insurance claim fraud. Real-time analytics frameworks label transactions in real-time. Machine learning models detect anomalies. Batch analytics searches historical data for fraud patterns.

Digital Marketing

Content Recommendation: Applications collect user search patterns, browsing history, consumed content, and ratings. Big data systems recommend new content based on user preferences and interests.

Environment

Weather Monitoring: Sensors collect temperature, humidity, and pressure data and send it to cloud-based applications. Data is analyzed and visualized for monitoring and generating weather alerts.

Big Data Technology

Big Data Technology is a software utility designed to store, process, and analyze extremely complex and large datasets that traditional RDBMS cannot handle. The key technologies include:

Hadoop
Hadoop Distributed File System (HDFS)
MapReduce
YARN

Apache Hadoop

An open-source framework written in Java that supports processing of large datasets across clusters in a distributed computing environment. It can store structured, semi-structured, and unstructured data in a distributed file system (DFS) and process them in parallel.

Hadoop ecosystem comprises four layers:

Layer	Components
Data Storage	HDFS, HBase
Data Processing	MapReduce, YARN
Data Access	Hive, Pig, Mahout, SQOOP, Avro
Data Management	Flume, Oozie, Chukwa

HDFS (Hadoop Distributed File System)

Designed to store large datasets running on low-cost commodity hardware
Does not require highly reliable expensive hardware
Data is stored in a write once, read many times pattern
Not suitable for applications requiring low latency access
HBase is a suitable alternative for low-latency applications

HBase

A column-oriented NoSQL database built on top of HDFS
Horizontally scalable, open-source, distributed
Does not require any predefined schema
Supports both structured and unstructured data
Provides real-time access to data in HDFS
Supports read and write many times

MapReduce

A batch-processing programming model adopting a divide-and-conquer principle.

Processes data in a parallel and distributed computing environment
Supports only batch workloads
Map task: splits and maps the data
Reduce task: shuffles and reduces the data
Not suitable for real-time processing and small data

MapReduce Architecture:

Phase	Description
Input Splits	Input divided into fixed-size pieces
Mapping	Counts occurrences and prepares `<word, frequency>` list
Shuffling & Sorting	Consolidates relevant records from mapping output
Reducer	Aggregates shuffling output values into single output value

YARN (Yet Another Resource Negotiator)

Developed to overcome MapReduce architecture drawbacks. Allows using various data processing engines for batch, interactive, and real-time stream processing.

Core Components:

Component	Role
ResourceManager	Allocates cluster resources using Scheduler and ApplicationManager
ApplicationMaster	Manages job lifecycle by directing NodeManager to create/destroy containers
NodeManager	Manages jobs in a specific node by creating/destroying containers

Hive

A tool to process structured data in the Hadoop environment. Provides a platform to develop scripts similar to SQL to perform MapReduce operations.

Query language: HQL (Hibernate Query Language)
Has a DDL similar to SQL DDL for creating, deleting, or altering schema objects
Data organized into: Tables, Partitions, and Buckets
Partitions speed up query performance by avoiding full table scans
Buckets further divide partitions based on hash of a column

Pig

A high-level programming language for analyzing large datasets. Two components: Pig Latin (language) and the execution environment.

Can handle structured, semi-structured, and unstructured data
Programmers without Java knowledge can perform MapReduce tasks
Internally, Pig Latin scripts are converted into MapReduce jobs

Internal Process:

Parser checks syntax
Optimizer carries out logical optimization
Compiler compiles into MapReduce jobs
Execution Engine submits jobs to Hadoop

SQOOP (SQL to Hadoop)

Transfers structured data from RDBMS to HDFS when RDBMS cannot support huge data
Can also move data from relational databases to HBase
Final analysis results can be exported back to the database

Avro

An open-source data serialization framework.

Translates in-memory data into binary or textual format for transport or storage
Designed to overcome Hadoop’s lack of portability
Data format processed by multiple languages
Schemas defined in JSON

Flume

A distributed and reliable tool for collecting large amounts of streaming data from multiple sources.

Ingests streaming data (sensors, social media, log files) into HDFS
Difference from SQOOP: SQOOP handles structured data; Flume handles streaming data

Oozie

An open-source workflow management engine and scheduler for Hadoop.

Three types of jobs:

Type	Description
Workflow	Represented as DAGs, run on demand
Coordinator	Scheduled to execute periodically based on frequency or data availability
Bundle	Collection of coordinator jobs managed as a single job

Chukwa

An open-source data collection system for monitoring large distributed systems.

Four components:

Agents: Run on each machine and emit data
Collectors: Receive data from agents and write to stable storage
MapReduce jobs: Parse and archive data
HICC (Hadoop Infrastructure Care Center): Web-portal interface for displaying data

Big Data Analytics Technology

MATLAB

A high-performance language for technical computing integrating computation, visualization, and programming.

Typical uses: math and computation, algorithm development, modeling/simulation/prototyping, data analysis/visualization, scientific/engineering graphics, application development.

R

An open-source programming language optimized for statistical analysis and data visualization. Commonly used within RStudio IDE.

Libraries and tools for: cleansing/preping data, creating visualizations, training/evaluating machine learning and deep learning algorithms.

Python

An open-source, multipurpose programming language with applicability anywhere that uses data. Data Analytics and Python are inseparable terms.

Key Libraries:

Library	Purpose
Pandas	Data manipulation and analysis
NumPy	Numerical computing
SciPy	Scientific computing
Matplotlib	Visualization
Scikit Learn	Machine learning
TensorFlow	Deep learning
Theano	Numerical computation
Keras	Neural networks
PyTorch	Deep learning

Big Data Visualization Technology

Big data visualization translates information into visual context (maps, graphs) to make data easier to understand and identify patterns, trends, and outliers.

Tool	Description
Power BI	Microsoft’s business analytics service for analyzing, visualizing, extracting insights, and sharing across organizations
Tableau	Business Intelligence tool managing data flow and turning data into actionable information; creates interactive visualizations

Summary

Domain-specific applications in manufacturing, finance, environment, smart cities, healthcare, and digital marketing
Hadoop Ecosystem: HDFS, YARN, MapReduce, HBase, Hive, Pig, SQOOP, Avro, Flume, Oozie, Chukwa
Analytics languages: MATLAB, R, Python
Visualization tools: Tableau, Power BI

Hadoop Ecosystem — Deep dive into storage, processing, and management layers
MapReduce — Batch processing model and architecture
YARN — Resource negotiation and job scheduling
HDFS — Distributed file system design

Sources

Lecture-2-Technical-aspects-of-big-data

Explorer

AITC Wiki

Technical Aspects of Big Data

大数据技术层面

Technical Aspects of Big Data

Big Data Applications

Cross-Industry Use Cases

Manufacturing Industry

Civil Infrastructure

Transportation & Smart Cities

Healthcare Industry

Finance Industry

Digital Marketing

Environment

Big Data Technology

Apache Hadoop

HDFS (Hadoop Distributed File System)

HBase

MapReduce

YARN (Yet Another Resource Negotiator)

Hive

Pig

SQOOP (SQL to Hadoop)

Avro

Flume

Oozie

Chukwa

Big Data Analytics Technology

MATLAB

R

Python

Big Data Visualization Technology

Summary

Sources

Graph View

Table of Contents

Backlinks

Explorer

Technical Aspects of Big Data

大数据技术层面

Technical Aspects of Big Data

Big Data Applications

Cross-Industry Use Cases

Manufacturing Industry

Civil Infrastructure

Transportation & Smart Cities

Healthcare Industry

Finance Industry

Digital Marketing

Environment

Big Data Technology

Apache Hadoop

HDFS (Hadoop Distributed File System)

HBase

MapReduce

YARN (Yet Another Resource Negotiator)

Hive

Pig

SQOOP (SQL to Hadoop)

Avro

Flume

Oozie

Chukwa

Big Data Analytics Technology

MATLAB

R

Python

Big Data Visualization Technology

Summary

Related Concepts

Sources

Graph View

Table of Contents

Backlinks