General Introduction to Big Data

中文版：大数据概述

This lecture introduces the definition, evolution, characteristics, types, and architecture of big data.

What is Big Data?

Historical Definitions

1997 (NASA): The first documented use of the term “big data” appeared in a NASA paper describing visualization problems: “We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”
2008: Prominent computer scientists predicted that “big-data computing” will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defence and intelligence operations.”
Oxford English Dictionary (definition #1): “Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”
Oxford English Dictionary (definition #2): “Sets of information that are too large or too complex to handle, analyse or use with standard methods.”
McKinsey (2011): “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” This definition is intentionally subjective and incorporates a moving target.

Big Data Evolution

Data size grew from just 600 MB in the 1950s to 100 petabytes by 2010.

What Leads to Big Data?

The proliferation of smart devices and connected systems has dramatically increased data generation:

Ordinary phones → Smart phones
Desktops → Cloud services and applications
Ordinary cars → Smart driverless cars
Smart homes → Smart cities and IoT

Big Data Characteristics — The 5 Vs

In 2001, industry analyst Doug Laney described the 3Vs (Volume, Velocity, Variety) as key data management challenges. Later, additional Vs were added:

V	Meaning	Description
Volume	The size of the data	Large enough that it can’t be effectively stored, accessed, or processed locally
Velocity	The speed at which data is generated	Produced more continually than small data; includes frequency of generation and handling
Variety	Different types of data	Structured, unstructured, or semi-structured: numbers, text, images, video, audio, and more
Veracity	Data accuracy	Concerned with data provenance, ownership, quality, uncertainty, and trustworthiness
Value	Useful data	The knowledge that can be extracted through big data analytics

Volume Challenges

Storage of large data while maintaining integrity and security
Rapid and critical access to such data
Processing big data to retrieve information of interest

Velocity Example (Per Minute, 2015)

100,000+ tweets
695,000+ social status updates
11,000,000+ instant messages
700,000+ Google searches
170,000,000+ emails
1,820 TB created
220 new mobile users

Veracity Concerns

1 in 3 business leaders don’t trust the information they use to make critical decisions
90% of data is obsolete or stored on obsolete media (referred to as dark data)

Types of Big Data

Data may be machine-generated or human-generated:

Human-generated: Outcome of human-machine interactions — emails, documents, Facebook posts
Machine-generated: Generated by computer applications or hardware without active human intervention — sensors, disaster warning systems, weather forecasting systems, satellite data

Primitive Types

Type	Description	Examples
Structured Data	Stored in relational databases in table format with rows and columns	Employee details, financial transactions
Unstructured Data	Raw, unorganized, do not fit into relational database systems	Video, audio, images, emails, text files, social media posts
Semi-Structured Data	Have a structure but do not fit into relational databases	JSON, XML

JSON

JavaScript Object Notation — a lightweight format for storing and transporting data, often used when data is sent from a server to a web page.

XML

eXtensible Markup Language — a markup language much like HTML, designed to store and transport data.

Big Data Architecture

Big data systems are typically organized into four layers:

Data Source Layer

Before data is processed and analyzed, it must be captured from raw sources into big data systems. Examples include:

Logs: Generated by web applications and servers for performance monitoring
Transactional Data: Generated by eCommerce, banking, and financial applications
Social Media: Data generated by social media platforms
Databases: Structured data residing in relational databases
Sensor Data: Generated by Internet of Things (IoT) systems

Data Aggregation Layer

Involves collecting raw data using data access connectors (wired and wireless). Key components:

Publish-Subscribe Messaging: Communication model involving publishers, brokers, and consumers. Examples: Apache Kafka, Amazon Kinesis.
Source-Sink Connectors: Efficiently collect, aggregate, and move data from various sources into a centralized data store. Example: Apache Flume.

Data Preprocessing

An important process performed on raw data to transform it into an understandable format. Steps include:

Step	Description
Data Integration	Combining data from different sources to give end users a unified data view
Data Cleaning	Detecting and resolving corrupt records, missing values, bad formatting
Data Reduction	Reducing the volume or dimension (number of attributes) of data
Data Transformation	Transforming or consolidating data into an appropriate format for management and analysis

Data Analytics Layer

Analytics is not a new concept — techniques like regression analysis and machine learning have existed for years. Big data analytics focuses on extracting meaningful information using efficient algorithms.

Types of analytics applied to big data:

Type	Purpose
Descriptive Analytics	What happened?
Diagnostic Analytics	Why did it happen?
Predictive Analytics	What will happen?
Prescriptive Analytics	What should we do?

Data Visualization Layer

Visualization completes the big data lifecycle by assisting end users to gain insights:

Static: Display stored analysis results from a serving database
Dynamic: Results updated regularly with live widgets, plots, or gauges
Interactive: Accept user inputs and display corresponding results

Sources

Lecture-1-General-introduction-to-big-data

Explorer

AITC Wiki

General Introduction to Big Data

大数据概述

General Introduction to Big Data

What is Big Data?

Historical Definitions

Big Data Evolution

What Leads to Big Data?

Big Data Characteristics — The 5 Vs

Volume Challenges

Velocity Example (Per Minute, 2015)

Veracity Concerns

Types of Big Data

Primitive Types

JSON

XML

Big Data Architecture

Data Source Layer

Data Aggregation Layer

Data Preprocessing

Data Analytics Layer

Data Visualization Layer

Recommended Textbooks

Sources

Graph View

Table of Contents

Backlinks

Explorer

General Introduction to Big Data

大数据概述

General Introduction to Big Data

What is Big Data?

Historical Definitions

Big Data Evolution

What Leads to Big Data?

Big Data Characteristics — The 5 Vs

Volume Challenges

Velocity Example (Per Minute, 2015)

Veracity Concerns

Types of Big Data

Primitive Types

JSON

XML

Big Data Architecture

Data Source Layer

Data Aggregation Layer

Data Preprocessing

Data Analytics Layer

Data Visualization Layer

Recommended Textbooks

Related Concepts

Sources

Graph View

Table of Contents

Backlinks