General Introduction to Big Data
中文版:大数据概述
This lecture introduces the definition, evolution, characteristics, types, and architecture of big data.
What is Big Data?
Historical Definitions
- 1997 (NASA): The first documented use of the term “big data” appeared in a NASA paper describing visualization problems: “We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”
- 2008: Prominent computer scientists predicted that “big-data computing” will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defence and intelligence operations.”
- Oxford English Dictionary (definition #1): “Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”
- Oxford English Dictionary (definition #2): “Sets of information that are too large or too complex to handle, analyse or use with standard methods.”
- McKinsey (2011): “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” This definition is intentionally subjective and incorporates a moving target.
Big Data Evolution
Data size grew from just 600 MB in the 1950s to 100 petabytes by 2010.
What Leads to Big Data?
The proliferation of smart devices and connected systems has dramatically increased data generation:
- Ordinary phones → Smart phones
- Desktops → Cloud services and applications
- Ordinary cars → Smart driverless cars
- Smart homes → Smart cities and IoT
Big Data Characteristics — The 5 Vs
In 2001, industry analyst Doug Laney described the 3Vs (Volume, Velocity, Variety) as key data management challenges. Later, additional Vs were added:
| V | Meaning | Description |
|---|---|---|
| Volume | The size of the data | Large enough that it can’t be effectively stored, accessed, or processed locally |
| Velocity | The speed at which data is generated | Produced more continually than small data; includes frequency of generation and handling |
| Variety | Different types of data | Structured, unstructured, or semi-structured: numbers, text, images, video, audio, and more |
| Veracity | Data accuracy | Concerned with data provenance, ownership, quality, uncertainty, and trustworthiness |
| Value | Useful data | The knowledge that can be extracted through big data analytics |
Volume Challenges
- Storage of large data while maintaining integrity and security
- Rapid and critical access to such data
- Processing big data to retrieve information of interest
Velocity Example (Per Minute, 2015)
- 100,000+ tweets
- 695,000+ social status updates
- 11,000,000+ instant messages
- 700,000+ Google searches
- 170,000,000+ emails
- 1,820 TB created
- 220 new mobile users
Veracity Concerns
- 1 in 3 business leaders don’t trust the information they use to make critical decisions
- 90% of data is obsolete or stored on obsolete media (referred to as dark data)
Types of Big Data
Data may be machine-generated or human-generated:
- Human-generated: Outcome of human-machine interactions — emails, documents, Facebook posts
- Machine-generated: Generated by computer applications or hardware without active human intervention — sensors, disaster warning systems, weather forecasting systems, satellite data
Primitive Types
| Type | Description | Examples |
|---|---|---|
| Structured Data | Stored in relational databases in table format with rows and columns | Employee details, financial transactions |
| Unstructured Data | Raw, unorganized, do not fit into relational database systems | Video, audio, images, emails, text files, social media posts |
| Semi-Structured Data | Have a structure but do not fit into relational databases | JSON, XML |
JSON
JavaScript Object Notation — a lightweight format for storing and transporting data, often used when data is sent from a server to a web page.
XML
eXtensible Markup Language — a markup language much like HTML, designed to store and transport data.
Big Data Architecture
Big data systems are typically organized into four layers:
Data Source Layer
Before data is processed and analyzed, it must be captured from raw sources into big data systems. Examples include:
- Logs: Generated by web applications and servers for performance monitoring
- Transactional Data: Generated by eCommerce, banking, and financial applications
- Social Media: Data generated by social media platforms
- Databases: Structured data residing in relational databases
- Sensor Data: Generated by Internet of Things (IoT) systems
Data Aggregation Layer
Involves collecting raw data using data access connectors (wired and wireless). Key components:
- Publish-Subscribe Messaging: Communication model involving publishers, brokers, and consumers. Examples: Apache Kafka, Amazon Kinesis.
- Source-Sink Connectors: Efficiently collect, aggregate, and move data from various sources into a centralized data store. Example: Apache Flume.
Data Preprocessing
An important process performed on raw data to transform it into an understandable format. Steps include:
| Step | Description |
|---|---|
| Data Integration | Combining data from different sources to give end users a unified data view |
| Data Cleaning | Detecting and resolving corrupt records, missing values, bad formatting |
| Data Reduction | Reducing the volume or dimension (number of attributes) of data |
| Data Transformation | Transforming or consolidating data into an appropriate format for management and analysis |
Data Analytics Layer
Analytics is not a new concept — techniques like regression analysis and machine learning have existed for years. Big data analytics focuses on extracting meaningful information using efficient algorithms.
Types of analytics applied to big data:
| Type | Purpose |
|---|---|
| Descriptive Analytics | What happened? |
| Diagnostic Analytics | Why did it happen? |
| Predictive Analytics | What will happen? |
| Prescriptive Analytics | What should we do? |
Data Visualization Layer
Visualization completes the big data lifecycle by assisting end users to gain insights:
- Static: Display stored analysis results from a serving database
- Dynamic: Results updated regularly with live widgets, plots, or gauges
- Interactive: Accept user inputs and display corresponding results
Recommended Textbooks
-
Big Data Analytics: A Hands-On Approach — Arshdeep Bahga & Vijay Madisetti
- ISBN: 978-1-949978-00-1
- Website: www.hands-on-books-series.com
-
Python Data Science Handbook — Jake VanderPlas
- ISBN: 978-1-491-91205-8
-
Learning Python — Mark Lutz
- ISBN: 978-0-596-15806-4
Related Concepts
- Big Data 5Vs — Deep dive into Volume, Velocity, Variety, Veracity, Value
- Structured vs Unstructured Data — Data type comparison
- Big Data Architecture — Four-layer system overview
- Data Analytics Types — Descriptive, Diagnostic, Predictive, Prescriptive