Storage and Treatment of Big Data
中文版:大数据的存储与处理
This lecture covers two pillars of big data systems: how data is stored (relational vs. non-relational databases) and how computing is provisioned (cloud deployment and service models).
Lecture Overview
This lecture bridges raw big data technologies and practical implementation. We focus on:
- Database paradigms — Relational (SQL) vs. Non-relational (NoSQL), and when to use each.
- Cloud computing — Deployment models (Public, Private, Hybrid) and service models (SaaS, PaaS, IaaS).
Relational Database (SQL)
A relational database organizes data into tables (rows and columns) and defines relationships between them using foreign keys. It uses SQL (Structured Query Language) for querying and manipulation.
Key Characteristics
- Structured schema: Data must conform to predefined tables and types.
- ACID compliance: Guarantees Atomicity, Consistency, Isolation, and Durability.
- Strong relationships: Tables are linked via primary/foreign keys.
Common Systems
| System | Typical Use Case |
|---|---|
| MySQL | Web applications, LAMP stack |
| PostgreSQL | Complex queries, geospatial data |
| Microsoft SQL Server | Enterprise Windows environments |
| Oracle | Large-scale enterprise systems |
| Microsoft Access | Small desktop applications |
When to Use
- Data integrity and complex transactions are critical.
- Relationships between entities are well-defined and stable.
- You need powerful querying (JOINs, aggregations, window functions).
Warning
RDBMS can struggle with horizontal scalability when data volume or write throughput grows beyond a single server’s capacity. This is where NoSQL becomes relevant.
Non-Relational Database (NoSQL)
NoSQL databases were designed to overcome the scalability and performance limitations of traditional RDBMS when handling unstructured or semi-structured big data. They trade some ACID guarantees for horizontal scalability and schema flexibility.
Four Main Types
| Type | Data Model | Best For | Examples |
|---|---|---|---|
| Key-Value Store | Dictionary of key-value pairs | Caching, session management, simple lookups | Redis, DynamoDB, Riak, Memcached |
| Column-Store | Data stored by columns rather than rows | Analytical queries, OLAP, fast bulk reads | HBase, Cassandra, Bigtable |
| Document Store | Flexible documents (JSON, XML, BSON) | Content management, user profiles, catalogs | MongoDB, CouchDB, DocumentDB |
| Graph Database | Nodes, edges, and properties | Social networks, recommendation engines, fraud detection | Neo4j, InfiniteGraph, OrientDB |
Key-Value Store
The simplest NoSQL model. Each item is stored as an attribute name (key) together with its value.
# Conceptual example
user_session = {
"session_001": {"user_id": 42, "login_time": "2025-04-20T09:00:00Z"},
"session_002": {"user_id": 7, "login_time": "2025-04-20T09:15:00Z"}
}Info
Amazon DynamoDB is a managed key-value and document database released by AWS. Redis is widely used for in-memory caching.
Column-Store Database
Unlike row-oriented databases (OLTP), column-stores save data in sections of columns (OLAP). This dramatically improves analytical query performance because the system only reads the columns relevant to the query.
Example: A database with 100 million rows and 100 columns (100 GB total).
- Query: What is the average age of males?
- Row-wise DB must read: 100 GB
- Columnar DB reads only
age+gendercolumns: 2 GB
| Aspect | Row-Oriented (OLTP) | Column-Oriented (OLAP) |
|---|---|---|
| Storage pattern | Whole row in same disk block | Each column in separate blocks |
| Best for | Transactions, frequent updates | Analytics, bulk aggregations |
| Typical read pattern | Select few rows, many columns | Select many rows, few columns |
| Examples | MySQL, PostgreSQL, Oracle | HBase, Cassandra, ClickHouse |
Document Database
Stores data as flexible documents (usually JSON) instead of rigid rows. A group of documents is called a collection.
{
"_id": "user_001",
"name": "Tom",
"age": 28,
"skills": ["Python", "SQL", "Cloud"],
"address": {"city": "Suzhou", "zip": "215123"}
}Info
MongoDB stores documents in BSON (Binary JSON), which supports richer data types than plain JSON.
Graph Database
Represents data as nodes (entities), edges (relationships), and properties (attributes). Ideal for traversing deep relationships.
- Nodes: Objects or instances (equivalent to a row in RDBMS).
- Edges: Relationships with direction and type.
- Properties: Information attached to nodes or edges.
Use case: In a social network, find “friends of friends” by traversing edges rather than expensive JOINs.
Cloud Computing
Cloud computing has transformed how organizations store, access, and manipulate data by moving computation closer to data and adopting elastic resource provisioning — scaling up or down based on demand.
Deployment Models
| Model | Ownership | Access | Best For |
|---|---|---|---|
| Public Cloud | Third-party vendor (AWS, Azure, GCP) | Internet, pay-as-you-go | Startups, variable workloads, cost reduction |
| Private Cloud | Single organization (on-premise or hosted) | Internal network | Highly regulated industries, full data control |
| Hybrid Cloud | Combination of public + private | Mixed | Burst capacity, keeping sensitive data private while offloading general workloads |
Service Models
The cloud stack is often visualized as three layers, with the vendor managing more as you move up:
| Model | You Manage | Vendor Manages | Examples |
|---|---|---|---|
| SaaS | Nothing (just use the app) | Everything: app, data, runtime, middleware, OS, networking, storage | Gmail, Dropbox, Salesforce, Google Workspace |
| PaaS | Application, data | Runtime, middleware, OS, networking, storage | Heroku, Google App Engine, AWS Elastic Beanstalk |
| IaaS | OS, middleware, runtime, application, data | Networking, storage, virtualization, servers | AWS EC2, Azure VMs, DigitalOcean, Linode |
SaaS — Software as a Service
Delivers complete applications over the internet, usually via subscription or pay-as-you-go.
- Advantages: Zero installation, centralized updates, accessible from any device.
- Limitations: Vendor lock-in, limited customization, data resides on third-party servers.
PaaS — Platform as a Service
Provides a managed development platform so teams can build and deploy without maintaining infrastructure.
- Advantages: Faster time-to-market, automatic scaling, reduced operational overhead.
- Limitations: Runtime constraints, data security concerns, potential vendor-specific API lock-in.
IaaS — Infrastructure as a Service
Offers raw computing resources on demand. The most flexible model, but requires the most management.
- Advantages: Full control over the stack, highly scalable, pay-only-for-what-you-use.
- Limitations: Requires internal DevOps expertise, shared-tenant security risks, users must patch and maintain their own OS.
How to Choose?
Need a ready-to-use application? → SaaS
Need to deploy code without managing OS? → PaaS
Need full control over servers and OS? → IaaS
Summary
| Topic | Key Takeaway |
|---|---|
| Relational DB | Structured, ACID, SQL, great for transactions |
| NoSQL | Scalable, flexible schema, optimized for specific data models (key-value, column, document, graph) |
| Cloud Deployment | Public (cost), Private (control), Hybrid (both) |
| Cloud Service | SaaS (use), PaaS (build), IaaS (control) |
Related Concepts
- Relational Database — Deep dive into SQL and ACID
- NoSQL Database — Design patterns and consistency models
- OLTP vs OLAP — Transactional vs analytical workloads
- SaaS, PaaS, IaaS — Cloud service model comparison