Storage and Treatment of Big Data

中文版：大数据的存储与处理

This lecture covers two pillars of big data systems: how data is stored (relational vs. non-relational databases) and how computing is provisioned (cloud deployment and service models).

Lecture Overview

This lecture bridges raw big data technologies and practical implementation. We focus on:

Database paradigms — Relational (SQL) vs. Non-relational (NoSQL), and when to use each.
Cloud computing — Deployment models (Public, Private, Hybrid) and service models (SaaS, PaaS, IaaS).

Relational Database (SQL)

A relational database organizes data into tables (rows and columns) and defines relationships between them using foreign keys. It uses SQL (Structured Query Language) for querying and manipulation.

Key Characteristics

Structured schema: Data must conform to predefined tables and types.
ACID compliance: Guarantees Atomicity, Consistency, Isolation, and Durability.
Strong relationships: Tables are linked via primary/foreign keys.

Common Systems

System	Typical Use Case
MySQL	Web applications, LAMP stack
PostgreSQL	Complex queries, geospatial data
Microsoft SQL Server	Enterprise Windows environments
Oracle	Large-scale enterprise systems
Microsoft Access	Small desktop applications

When to Use

Data integrity and complex transactions are critical.
Relationships between entities are well-defined and stable.
You need powerful querying (JOINs, aggregations, window functions).

Warning

RDBMS can struggle with horizontal scalability when data volume or write throughput grows beyond a single server’s capacity. This is where NoSQL becomes relevant.

Non-Relational Database (NoSQL)

NoSQL databases were designed to overcome the scalability and performance limitations of traditional RDBMS when handling unstructured or semi-structured big data. They trade some ACID guarantees for horizontal scalability and schema flexibility.

Four Main Types

Type	Data Model	Best For	Examples
Key-Value Store	Dictionary of key-value pairs	Caching, session management, simple lookups	Redis, DynamoDB, Riak, Memcached
Column-Store	Data stored by columns rather than rows	Analytical queries, OLAP, fast bulk reads	HBase, Cassandra, Bigtable
Document Store	Flexible documents (JSON, XML, BSON)	Content management, user profiles, catalogs	MongoDB, CouchDB, DocumentDB
Graph Database	Nodes, edges, and properties	Social networks, recommendation engines, fraud detection	Neo4j, InfiniteGraph, OrientDB

Key-Value Store

The simplest NoSQL model. Each item is stored as an attribute name (key) together with its value.

# Conceptual example
user_session = {
    "session_001": {"user_id": 42, "login_time": "2025-04-20T09:00:00Z"},
    "session_002": {"user_id": 7,  "login_time": "2025-04-20T09:15:00Z"}
}

Info

Amazon DynamoDB is a managed key-value and document database released by AWS. Redis is widely used for in-memory caching.

Column-Store Database

Unlike row-oriented databases (OLTP), column-stores save data in sections of columns (OLAP). This dramatically improves analytical query performance because the system only reads the columns relevant to the query.

Example: A database with 100 million rows and 100 columns (100 GB total).

Query: What is the average age of males?

Row-wise DB must read: 100 GB

Columnar DB reads only age + gender columns: 2 GB

Aspect	Row-Oriented (OLTP)	Column-Oriented (OLAP)
Storage pattern	Whole row in same disk block	Each column in separate blocks
Best for	Transactions, frequent updates	Analytics, bulk aggregations
Typical read pattern	Select few rows, many columns	Select many rows, few columns
Examples	MySQL, PostgreSQL, Oracle	HBase, Cassandra, ClickHouse

Document Database

Stores data as flexible documents (usually JSON) instead of rigid rows. A group of documents is called a collection.

{
  "_id": "user_001",
  "name": "Tom",
  "age": 28,
  "skills": ["Python", "SQL", "Cloud"],
  "address": {"city": "Suzhou", "zip": "215123"}
}

Info

MongoDB stores documents in BSON (Binary JSON), which supports richer data types than plain JSON.

Graph Database

Represents data as nodes (entities), edges (relationships), and properties (attributes). Ideal for traversing deep relationships.

Nodes: Objects or instances (equivalent to a row in RDBMS).
Edges: Relationships with direction and type.
Properties: Information attached to nodes or edges.

Use case: In a social network, find “friends of friends” by traversing edges rather than expensive JOINs.

Cloud Computing

Cloud computing has transformed how organizations store, access, and manipulate data by moving computation closer to data and adopting elastic resource provisioning — scaling up or down based on demand.

Deployment Models

Model	Ownership	Access	Best For
Public Cloud	Third-party vendor (AWS, Azure, GCP)	Internet, pay-as-you-go	Startups, variable workloads, cost reduction
Private Cloud	Single organization (on-premise or hosted)	Internal network	Highly regulated industries, full data control
Hybrid Cloud	Combination of public + private	Mixed	Burst capacity, keeping sensitive data private while offloading general workloads

Service Models

The cloud stack is often visualized as three layers, with the vendor managing more as you move up:

Model	You Manage	Vendor Manages	Examples
SaaS	Nothing (just use the app)	Everything: app, data, runtime, middleware, OS, networking, storage	Gmail, Dropbox, Salesforce, Google Workspace
PaaS	Application, data	Runtime, middleware, OS, networking, storage	Heroku, Google App Engine, AWS Elastic Beanstalk
IaaS	OS, middleware, runtime, application, data	Networking, storage, virtualization, servers	AWS EC2, Azure VMs, DigitalOcean, Linode

SaaS — Software as a Service

Delivers complete applications over the internet, usually via subscription or pay-as-you-go.

Advantages: Zero installation, centralized updates, accessible from any device.
Limitations: Vendor lock-in, limited customization, data resides on third-party servers.

PaaS — Platform as a Service

Provides a managed development platform so teams can build and deploy without maintaining infrastructure.

Advantages: Faster time-to-market, automatic scaling, reduced operational overhead.
Limitations: Runtime constraints, data security concerns, potential vendor-specific API lock-in.

IaaS — Infrastructure as a Service

Offers raw computing resources on demand. The most flexible model, but requires the most management.

Advantages: Full control over the stack, highly scalable, pay-only-for-what-you-use.
Limitations: Requires internal DevOps expertise, shared-tenant security risks, users must patch and maintain their own OS.

How to Choose?

Need a ready-to-use application?        → SaaS
Need to deploy code without managing OS? → PaaS
Need full control over servers and OS?   → IaaS

Summary

Topic	Key Takeaway
Relational DB	Structured, ACID, SQL, great for transactions
NoSQL	Scalable, flexible schema, optimized for specific data models (key-value, column, document, graph)
Cloud Deployment	Public (cost), Private (control), Hybrid (both)
Cloud Service	SaaS (use), PaaS (build), IaaS (control)

Relational Database — Deep dive into SQL and ACID
NoSQL Database — Design patterns and consistency models
OLTP vs OLAP — Transactional vs analytical workloads
SaaS, PaaS, IaaS — Cloud service model comparison

Sources

Lecture-3-Storage-and-Treatment-of-Big-Data

Explorer

AITC Wiki

Storage and Treatment of Big Data

大数据的存储与处理

Storage and Treatment of Big Data

Lecture Overview

Relational Database (SQL)

Key Characteristics

Common Systems

When to Use

Non-Relational Database (NoSQL)

Four Main Types

Key-Value Store

Column-Store Database

Document Database

Graph Database

Cloud Computing

Deployment Models

Service Models

SaaS — Software as a Service

PaaS — Platform as a Service

IaaS — Infrastructure as a Service

How to Choose?

Summary

Sources

Graph View

Table of Contents

Backlinks

Explorer

Storage and Treatment of Big Data

大数据的存储与处理

Storage and Treatment of Big Data

Lecture Overview

Relational Database (SQL)

Key Characteristics

Common Systems

When to Use

Non-Relational Database (NoSQL)

Four Main Types

Key-Value Store

Column-Store Database

Document Database

Graph Database

Cloud Computing

Deployment Models

Service Models

SaaS — Software as a Service

PaaS — Platform as a Service

IaaS — Infrastructure as a Service

How to Choose?

Summary

Related Concepts

Sources

Graph View

Table of Contents

Backlinks