AITC Wiki

Technical Aspects of Big Data

大数据技术层面

Technical Aspects of Big Data

中文版:大数据技术层面

This lecture covers domain-specific big data applications and the core Hadoop ecosystem for storage, processing, and analytics.

Big Data Applications

Businesses across all industries have benefited from adopting big data solutions.

Cross-Industry Use Cases

IndustryApplication
Banking & SecuritiesCredit/debit card fraud detection, securities fraud warnings, credit risk reporting, customer data analytics
HealthcareStoring patient data and analyzing it to detect medical ailments at an early stage
MarketingAnalyzing customer purchase history to reach the right customers for newly launched products
Web AnalysisAnalyzing social media and search engine data to broadcast advertisements based on user interests
Call CentersIdentifying recurring problems and staff behavior patterns by capturing and processing call content
AgricultureSensors used by biotechnology firms to optimize crop efficiency; big data analyzes sensor data
SmartphonesFacial recognition for unlocking phones and retrieving stored information about a person

Manufacturing Industry

  • Machine Diagnosis & Prognosis: Predicting machine performance by analyzing current operating conditions and deviations from normal conditions. Sensors monitor temperature and vibration levels. Diagnostic systems integrate with cloud-based storage and big data analytics backends.
  • Risk Analysis of Industrial Operations: Monitoring indoor air quality using gas sensors (CO, NO, NO₂). Big data systems analyze risks and identify hazardous zones.
  • Production Planning and Control: Systems measure various production process parameters and control the entire process in real-time using sensors. Big data analyzes this data for planning and identifying potential problems.

Civil Infrastructure

  • Structural Health Monitoring: Networks of sensors monitor vibration levels in bridges and buildings. Analyzing this data can detect cracks, locate damages, calculate remaining life, and provide advance warnings of imminent failures.

Transportation & Smart Cities

  • Smart Roads: Sensors provide information on driving conditions, travel time estimates, and alerts for poor conditions, traffic congestion, and accidents. Information is communicated via Internet to cloud-based analytics applications and disseminated to drivers.

Healthcare Industry

  • Epidemiological Surveillance: Studying distribution and determinants of health-related states in populations. Electronic Health Record (EHR) systems include laboratory results, diagnostic, treatment, and demographic data. Big data frameworks integrate multiple EHR systems for predicting outbreaks, population-level surveillance, disease detection, and public health mapping.

Finance Industry

  • Credit Risk Modeling: Scoring credit applications and predicting borrower default. Models are built from credit scores, credit history, account balances, transactions, and spending patterns. Big data systems compute credit risk scores for large numbers of customers regularly.
  • Fraud Detection: Detecting credit card fraud, money laundering, and insurance claim fraud. Real-time analytics frameworks label transactions in real-time. Machine learning models detect anomalies. Batch analytics searches historical data for fraud patterns.

Digital Marketing

  • Content Recommendation: Applications collect user search patterns, browsing history, consumed content, and ratings. Big data systems recommend new content based on user preferences and interests.

Environment

  • Weather Monitoring: Sensors collect temperature, humidity, and pressure data and send it to cloud-based applications. Data is analyzed and visualized for monitoring and generating weather alerts.

Big Data Technology

Big Data Technology is a software utility designed to store, process, and analyze extremely complex and large datasets that traditional RDBMS cannot handle. The key technologies include:

  • Hadoop
  • Hadoop Distributed File System (HDFS)
  • MapReduce
  • YARN

Apache Hadoop

An open-source framework written in Java that supports processing of large datasets across clusters in a distributed computing environment. It can store structured, semi-structured, and unstructured data in a distributed file system (DFS) and process them in parallel.

Hadoop ecosystem comprises four layers:

LayerComponents
Data StorageHDFS, HBase
Data ProcessingMapReduce, YARN
Data AccessHive, Pig, Mahout, SQOOP, Avro
Data ManagementFlume, Oozie, Chukwa

HDFS (Hadoop Distributed File System)

  • Designed to store large datasets running on low-cost commodity hardware
  • Does not require highly reliable expensive hardware
  • Data is stored in a write once, read many times pattern
  • Not suitable for applications requiring low latency access
  • HBase is a suitable alternative for low-latency applications

HBase

  • A column-oriented NoSQL database built on top of HDFS
  • Horizontally scalable, open-source, distributed
  • Does not require any predefined schema
  • Supports both structured and unstructured data
  • Provides real-time access to data in HDFS
  • Supports read and write many times

MapReduce

A batch-processing programming model adopting a divide-and-conquer principle.

  • Processes data in a parallel and distributed computing environment
  • Supports only batch workloads
  • Map task: splits and maps the data
  • Reduce task: shuffles and reduces the data
  • Not suitable for real-time processing and small data

MapReduce Architecture:

PhaseDescription
Input SplitsInput divided into fixed-size pieces
MappingCounts occurrences and prepares <word, frequency> list
Shuffling & SortingConsolidates relevant records from mapping output
ReducerAggregates shuffling output values into single output value

YARN (Yet Another Resource Negotiator)

Developed to overcome MapReduce architecture drawbacks. Allows using various data processing engines for batch, interactive, and real-time stream processing.

Core Components:

ComponentRole
ResourceManagerAllocates cluster resources using Scheduler and ApplicationManager
ApplicationMasterManages job lifecycle by directing NodeManager to create/destroy containers
NodeManagerManages jobs in a specific node by creating/destroying containers

Hive

A tool to process structured data in the Hadoop environment. Provides a platform to develop scripts similar to SQL to perform MapReduce operations.

  • Query language: HQL (Hibernate Query Language)
  • Has a DDL similar to SQL DDL for creating, deleting, or altering schema objects
  • Data organized into: Tables, Partitions, and Buckets
  • Partitions speed up query performance by avoiding full table scans
  • Buckets further divide partitions based on hash of a column

Pig

A high-level programming language for analyzing large datasets. Two components: Pig Latin (language) and the execution environment.

  • Can handle structured, semi-structured, and unstructured data
  • Programmers without Java knowledge can perform MapReduce tasks
  • Internally, Pig Latin scripts are converted into MapReduce jobs

Internal Process:

  1. Parser checks syntax
  2. Optimizer carries out logical optimization
  3. Compiler compiles into MapReduce jobs
  4. Execution Engine submits jobs to Hadoop

SQOOP (SQL to Hadoop)

  • Transfers structured data from RDBMS to HDFS when RDBMS cannot support huge data
  • Can also move data from relational databases to HBase
  • Final analysis results can be exported back to the database

Avro

An open-source data serialization framework.

  • Translates in-memory data into binary or textual format for transport or storage
  • Designed to overcome Hadoop’s lack of portability
  • Data format processed by multiple languages
  • Schemas defined in JSON

Flume

A distributed and reliable tool for collecting large amounts of streaming data from multiple sources.

  • Ingests streaming data (sensors, social media, log files) into HDFS
  • Difference from SQOOP: SQOOP handles structured data; Flume handles streaming data

Oozie

An open-source workflow management engine and scheduler for Hadoop.

Three types of jobs:

TypeDescription
WorkflowRepresented as DAGs, run on demand
CoordinatorScheduled to execute periodically based on frequency or data availability
BundleCollection of coordinator jobs managed as a single job

Chukwa

An open-source data collection system for monitoring large distributed systems.

Four components:

  • Agents: Run on each machine and emit data
  • Collectors: Receive data from agents and write to stable storage
  • MapReduce jobs: Parse and archive data
  • HICC (Hadoop Infrastructure Care Center): Web-portal interface for displaying data

Big Data Analytics Technology

MATLAB

A high-performance language for technical computing integrating computation, visualization, and programming.

Typical uses: math and computation, algorithm development, modeling/simulation/prototyping, data analysis/visualization, scientific/engineering graphics, application development.

R

An open-source programming language optimized for statistical analysis and data visualization. Commonly used within RStudio IDE.

Libraries and tools for: cleansing/preping data, creating visualizations, training/evaluating machine learning and deep learning algorithms.

Python

An open-source, multipurpose programming language with applicability anywhere that uses data. Data Analytics and Python are inseparable terms.

Key Libraries:

LibraryPurpose
PandasData manipulation and analysis
NumPyNumerical computing
SciPyScientific computing
MatplotlibVisualization
Scikit LearnMachine learning
TensorFlowDeep learning
TheanoNumerical computation
KerasNeural networks
PyTorchDeep learning

Big Data Visualization Technology

Big data visualization translates information into visual context (maps, graphs) to make data easier to understand and identify patterns, trends, and outliers.

ToolDescription
Power BIMicrosoft’s business analytics service for analyzing, visualizing, extracting insights, and sharing across organizations
TableauBusiness Intelligence tool managing data flow and turning data into actionable information; creates interactive visualizations

Summary

  • Domain-specific applications in manufacturing, finance, environment, smart cities, healthcare, and digital marketing
  • Hadoop Ecosystem: HDFS, YARN, MapReduce, HBase, Hive, Pig, SQOOP, Avro, Flume, Oozie, Chukwa
  • Analytics languages: MATLAB, R, Python
  • Visualization tools: Tableau, Power BI
  • Hadoop Ecosystem — Deep dive into storage, processing, and management layers
  • MapReduce — Batch processing model and architecture
  • YARN — Resource negotiation and job scheduling
  • HDFS — Distributed file system design

Sources