Technical Aspects of Big Data
中文版:大数据技术层面
This lecture covers domain-specific big data applications and the core Hadoop ecosystem for storage, processing, and analytics.
Big Data Applications
Businesses across all industries have benefited from adopting big data solutions.
Cross-Industry Use Cases
| Industry | Application |
|---|---|
| Banking & Securities | Credit/debit card fraud detection, securities fraud warnings, credit risk reporting, customer data analytics |
| Healthcare | Storing patient data and analyzing it to detect medical ailments at an early stage |
| Marketing | Analyzing customer purchase history to reach the right customers for newly launched products |
| Web Analysis | Analyzing social media and search engine data to broadcast advertisements based on user interests |
| Call Centers | Identifying recurring problems and staff behavior patterns by capturing and processing call content |
| Agriculture | Sensors used by biotechnology firms to optimize crop efficiency; big data analyzes sensor data |
| Smartphones | Facial recognition for unlocking phones and retrieving stored information about a person |
Manufacturing Industry
- Machine Diagnosis & Prognosis: Predicting machine performance by analyzing current operating conditions and deviations from normal conditions. Sensors monitor temperature and vibration levels. Diagnostic systems integrate with cloud-based storage and big data analytics backends.
- Risk Analysis of Industrial Operations: Monitoring indoor air quality using gas sensors (CO, NO, NO₂). Big data systems analyze risks and identify hazardous zones.
- Production Planning and Control: Systems measure various production process parameters and control the entire process in real-time using sensors. Big data analyzes this data for planning and identifying potential problems.
Civil Infrastructure
- Structural Health Monitoring: Networks of sensors monitor vibration levels in bridges and buildings. Analyzing this data can detect cracks, locate damages, calculate remaining life, and provide advance warnings of imminent failures.
Transportation & Smart Cities
- Smart Roads: Sensors provide information on driving conditions, travel time estimates, and alerts for poor conditions, traffic congestion, and accidents. Information is communicated via Internet to cloud-based analytics applications and disseminated to drivers.
Healthcare Industry
- Epidemiological Surveillance: Studying distribution and determinants of health-related states in populations. Electronic Health Record (EHR) systems include laboratory results, diagnostic, treatment, and demographic data. Big data frameworks integrate multiple EHR systems for predicting outbreaks, population-level surveillance, disease detection, and public health mapping.
Finance Industry
- Credit Risk Modeling: Scoring credit applications and predicting borrower default. Models are built from credit scores, credit history, account balances, transactions, and spending patterns. Big data systems compute credit risk scores for large numbers of customers regularly.
- Fraud Detection: Detecting credit card fraud, money laundering, and insurance claim fraud. Real-time analytics frameworks label transactions in real-time. Machine learning models detect anomalies. Batch analytics searches historical data for fraud patterns.
Digital Marketing
- Content Recommendation: Applications collect user search patterns, browsing history, consumed content, and ratings. Big data systems recommend new content based on user preferences and interests.
Environment
- Weather Monitoring: Sensors collect temperature, humidity, and pressure data and send it to cloud-based applications. Data is analyzed and visualized for monitoring and generating weather alerts.
Big Data Technology
Big Data Technology is a software utility designed to store, process, and analyze extremely complex and large datasets that traditional RDBMS cannot handle. The key technologies include:
- Hadoop
- Hadoop Distributed File System (HDFS)
- MapReduce
- YARN
Apache Hadoop
An open-source framework written in Java that supports processing of large datasets across clusters in a distributed computing environment. It can store structured, semi-structured, and unstructured data in a distributed file system (DFS) and process them in parallel.
Hadoop ecosystem comprises four layers:
| Layer | Components |
|---|---|
| Data Storage | HDFS, HBase |
| Data Processing | MapReduce, YARN |
| Data Access | Hive, Pig, Mahout, SQOOP, Avro |
| Data Management | Flume, Oozie, Chukwa |
HDFS (Hadoop Distributed File System)
- Designed to store large datasets running on low-cost commodity hardware
- Does not require highly reliable expensive hardware
- Data is stored in a write once, read many times pattern
- Not suitable for applications requiring low latency access
- HBase is a suitable alternative for low-latency applications
HBase
- A column-oriented NoSQL database built on top of HDFS
- Horizontally scalable, open-source, distributed
- Does not require any predefined schema
- Supports both structured and unstructured data
- Provides real-time access to data in HDFS
- Supports read and write many times
MapReduce
A batch-processing programming model adopting a divide-and-conquer principle.
- Processes data in a parallel and distributed computing environment
- Supports only batch workloads
- Map task: splits and maps the data
- Reduce task: shuffles and reduces the data
- Not suitable for real-time processing and small data
MapReduce Architecture:
| Phase | Description |
|---|---|
| Input Splits | Input divided into fixed-size pieces |
| Mapping | Counts occurrences and prepares <word, frequency> list |
| Shuffling & Sorting | Consolidates relevant records from mapping output |
| Reducer | Aggregates shuffling output values into single output value |
YARN (Yet Another Resource Negotiator)
Developed to overcome MapReduce architecture drawbacks. Allows using various data processing engines for batch, interactive, and real-time stream processing.
Core Components:
| Component | Role |
|---|---|
| ResourceManager | Allocates cluster resources using Scheduler and ApplicationManager |
| ApplicationMaster | Manages job lifecycle by directing NodeManager to create/destroy containers |
| NodeManager | Manages jobs in a specific node by creating/destroying containers |
Hive
A tool to process structured data in the Hadoop environment. Provides a platform to develop scripts similar to SQL to perform MapReduce operations.
- Query language: HQL (Hibernate Query Language)
- Has a DDL similar to SQL DDL for creating, deleting, or altering schema objects
- Data organized into: Tables, Partitions, and Buckets
- Partitions speed up query performance by avoiding full table scans
- Buckets further divide partitions based on hash of a column
Pig
A high-level programming language for analyzing large datasets. Two components: Pig Latin (language) and the execution environment.
- Can handle structured, semi-structured, and unstructured data
- Programmers without Java knowledge can perform MapReduce tasks
- Internally, Pig Latin scripts are converted into MapReduce jobs
Internal Process:
- Parser checks syntax
- Optimizer carries out logical optimization
- Compiler compiles into MapReduce jobs
- Execution Engine submits jobs to Hadoop
SQOOP (SQL to Hadoop)
- Transfers structured data from RDBMS to HDFS when RDBMS cannot support huge data
- Can also move data from relational databases to HBase
- Final analysis results can be exported back to the database
Avro
An open-source data serialization framework.
- Translates in-memory data into binary or textual format for transport or storage
- Designed to overcome Hadoop’s lack of portability
- Data format processed by multiple languages
- Schemas defined in JSON
Flume
A distributed and reliable tool for collecting large amounts of streaming data from multiple sources.
- Ingests streaming data (sensors, social media, log files) into HDFS
- Difference from SQOOP: SQOOP handles structured data; Flume handles streaming data
Oozie
An open-source workflow management engine and scheduler for Hadoop.
Three types of jobs:
| Type | Description |
|---|---|
| Workflow | Represented as DAGs, run on demand |
| Coordinator | Scheduled to execute periodically based on frequency or data availability |
| Bundle | Collection of coordinator jobs managed as a single job |
Chukwa
An open-source data collection system for monitoring large distributed systems.
Four components:
- Agents: Run on each machine and emit data
- Collectors: Receive data from agents and write to stable storage
- MapReduce jobs: Parse and archive data
- HICC (Hadoop Infrastructure Care Center): Web-portal interface for displaying data
Big Data Analytics Technology
MATLAB
A high-performance language for technical computing integrating computation, visualization, and programming.
Typical uses: math and computation, algorithm development, modeling/simulation/prototyping, data analysis/visualization, scientific/engineering graphics, application development.
R
An open-source programming language optimized for statistical analysis and data visualization. Commonly used within RStudio IDE.
Libraries and tools for: cleansing/preping data, creating visualizations, training/evaluating machine learning and deep learning algorithms.
Python
An open-source, multipurpose programming language with applicability anywhere that uses data. Data Analytics and Python are inseparable terms.
Key Libraries:
| Library | Purpose |
|---|---|
| Pandas | Data manipulation and analysis |
| NumPy | Numerical computing |
| SciPy | Scientific computing |
| Matplotlib | Visualization |
| Scikit Learn | Machine learning |
| TensorFlow | Deep learning |
| Theano | Numerical computation |
| Keras | Neural networks |
| PyTorch | Deep learning |
Big Data Visualization Technology
Big data visualization translates information into visual context (maps, graphs) to make data easier to understand and identify patterns, trends, and outliers.
| Tool | Description |
|---|---|
| Power BI | Microsoft’s business analytics service for analyzing, visualizing, extracting insights, and sharing across organizations |
| Tableau | Business Intelligence tool managing data flow and turning data into actionable information; creates interactive visualizations |
Summary
- Domain-specific applications in manufacturing, finance, environment, smart cities, healthcare, and digital marketing
- Hadoop Ecosystem: HDFS, YARN, MapReduce, HBase, Hive, Pig, SQOOP, Avro, Flume, Oozie, Chukwa
- Analytics languages: MATLAB, R, Python
- Visualization tools: Tableau, Power BI
Related Concepts
- Hadoop Ecosystem — Deep dive into storage, processing, and management layers
- MapReduce — Batch processing model and architecture
- YARN — Resource negotiation and job scheduling
- HDFS — Distributed file system design