AITC Wiki

General Introduction to Big Data

大数据概述

General Introduction to Big Data

中文版:大数据概述

This lecture introduces the definition, evolution, characteristics, types, and architecture of big data.

What is Big Data?

Historical Definitions

  • 1997 (NASA): The first documented use of the term “big data” appeared in a NASA paper describing visualization problems: “We call this the problem of big data. When data sets do not fit in main memory (in core), or when they do not fit even on local disk, the most common solution is to acquire more resources.”
  • 2008: Prominent computer scientists predicted that “big-data computing” will “transform the activities of companies, scientific researchers, medical practitioners, and our nation’s defence and intelligence operations.”
  • Oxford English Dictionary (definition #1): “Data of a very large size, typically to the extent that its manipulation and management present significant logistical challenges.”
  • Oxford English Dictionary (definition #2): “Sets of information that are too large or too complex to handle, analyse or use with standard methods.”
  • McKinsey (2011): “Datasets whose size is beyond the ability of typical database software tools to capture, store, manage, and analyze.” This definition is intentionally subjective and incorporates a moving target.

Big Data Evolution

Data size grew from just 600 MB in the 1950s to 100 petabytes by 2010.

What Leads to Big Data?

The proliferation of smart devices and connected systems has dramatically increased data generation:

  • Ordinary phones → Smart phones
  • Desktops → Cloud services and applications
  • Ordinary cars → Smart driverless cars
  • Smart homes → Smart cities and IoT

Big Data Characteristics — The 5 Vs

In 2001, industry analyst Doug Laney described the 3Vs (Volume, Velocity, Variety) as key data management challenges. Later, additional Vs were added:

VMeaningDescription
VolumeThe size of the dataLarge enough that it can’t be effectively stored, accessed, or processed locally
VelocityThe speed at which data is generatedProduced more continually than small data; includes frequency of generation and handling
VarietyDifferent types of dataStructured, unstructured, or semi-structured: numbers, text, images, video, audio, and more
VeracityData accuracyConcerned with data provenance, ownership, quality, uncertainty, and trustworthiness
ValueUseful dataThe knowledge that can be extracted through big data analytics

Volume Challenges

  • Storage of large data while maintaining integrity and security
  • Rapid and critical access to such data
  • Processing big data to retrieve information of interest

Velocity Example (Per Minute, 2015)

  • 100,000+ tweets
  • 695,000+ social status updates
  • 11,000,000+ instant messages
  • 700,000+ Google searches
  • 170,000,000+ emails
  • 1,820 TB created
  • 220 new mobile users

Veracity Concerns

  • 1 in 3 business leaders don’t trust the information they use to make critical decisions
  • 90% of data is obsolete or stored on obsolete media (referred to as dark data)

Types of Big Data

Data may be machine-generated or human-generated:

  • Human-generated: Outcome of human-machine interactions — emails, documents, Facebook posts
  • Machine-generated: Generated by computer applications or hardware without active human intervention — sensors, disaster warning systems, weather forecasting systems, satellite data

Primitive Types

TypeDescriptionExamples
Structured DataStored in relational databases in table format with rows and columnsEmployee details, financial transactions
Unstructured DataRaw, unorganized, do not fit into relational database systemsVideo, audio, images, emails, text files, social media posts
Semi-Structured DataHave a structure but do not fit into relational databasesJSON, XML

JSON

JavaScript Object Notation — a lightweight format for storing and transporting data, often used when data is sent from a server to a web page.

XML

eXtensible Markup Language — a markup language much like HTML, designed to store and transport data.

Big Data Architecture

Big data systems are typically organized into four layers:

Data Source Layer

Before data is processed and analyzed, it must be captured from raw sources into big data systems. Examples include:

  • Logs: Generated by web applications and servers for performance monitoring
  • Transactional Data: Generated by eCommerce, banking, and financial applications
  • Social Media: Data generated by social media platforms
  • Databases: Structured data residing in relational databases
  • Sensor Data: Generated by Internet of Things (IoT) systems

Data Aggregation Layer

Involves collecting raw data using data access connectors (wired and wireless). Key components:

  • Publish-Subscribe Messaging: Communication model involving publishers, brokers, and consumers. Examples: Apache Kafka, Amazon Kinesis.
  • Source-Sink Connectors: Efficiently collect, aggregate, and move data from various sources into a centralized data store. Example: Apache Flume.

Data Preprocessing

An important process performed on raw data to transform it into an understandable format. Steps include:

StepDescription
Data IntegrationCombining data from different sources to give end users a unified data view
Data CleaningDetecting and resolving corrupt records, missing values, bad formatting
Data ReductionReducing the volume or dimension (number of attributes) of data
Data TransformationTransforming or consolidating data into an appropriate format for management and analysis

Data Analytics Layer

Analytics is not a new concept — techniques like regression analysis and machine learning have existed for years. Big data analytics focuses on extracting meaningful information using efficient algorithms.

Types of analytics applied to big data:

TypePurpose
Descriptive AnalyticsWhat happened?
Diagnostic AnalyticsWhy did it happen?
Predictive AnalyticsWhat will happen?
Prescriptive AnalyticsWhat should we do?

Data Visualization Layer

Visualization completes the big data lifecycle by assisting end users to gain insights:

  • Static: Display stored analysis results from a serving database
  • Dynamic: Results updated regularly with live widgets, plots, or gauges
  • Interactive: Accept user inputs and display corresponding results
  1. Big Data Analytics: A Hands-On Approach — Arshdeep Bahga & Vijay Madisetti

  2. Python Data Science Handbook — Jake VanderPlas

    • ISBN: 978-1-491-91205-8
  3. Learning Python — Mark Lutz

    • ISBN: 978-0-596-15806-4

Sources