AITC Wiki

Storage and Treatment of Big Data

大数据的存储与处理

Storage and Treatment of Big Data

中文版:大数据的存储与处理

This lecture covers two pillars of big data systems: how data is stored (relational vs. non-relational databases) and how computing is provisioned (cloud deployment and service models).

Lecture Overview

This lecture bridges raw big data technologies and practical implementation. We focus on:

  1. Database paradigms — Relational (SQL) vs. Non-relational (NoSQL), and when to use each.
  2. Cloud computing — Deployment models (Public, Private, Hybrid) and service models (SaaS, PaaS, IaaS).

Relational Database (SQL)

A relational database organizes data into tables (rows and columns) and defines relationships between them using foreign keys. It uses SQL (Structured Query Language) for querying and manipulation.

Key Characteristics

  • Structured schema: Data must conform to predefined tables and types.
  • ACID compliance: Guarantees Atomicity, Consistency, Isolation, and Durability.
  • Strong relationships: Tables are linked via primary/foreign keys.

Common Systems

SystemTypical Use Case
MySQLWeb applications, LAMP stack
PostgreSQLComplex queries, geospatial data
Microsoft SQL ServerEnterprise Windows environments
OracleLarge-scale enterprise systems
Microsoft AccessSmall desktop applications

When to Use

  • Data integrity and complex transactions are critical.
  • Relationships between entities are well-defined and stable.
  • You need powerful querying (JOINs, aggregations, window functions).

Warning

RDBMS can struggle with horizontal scalability when data volume or write throughput grows beyond a single server’s capacity. This is where NoSQL becomes relevant.


Non-Relational Database (NoSQL)

NoSQL databases were designed to overcome the scalability and performance limitations of traditional RDBMS when handling unstructured or semi-structured big data. They trade some ACID guarantees for horizontal scalability and schema flexibility.

Four Main Types

TypeData ModelBest ForExamples
Key-Value StoreDictionary of key-value pairsCaching, session management, simple lookupsRedis, DynamoDB, Riak, Memcached
Column-StoreData stored by columns rather than rowsAnalytical queries, OLAP, fast bulk readsHBase, Cassandra, Bigtable
Document StoreFlexible documents (JSON, XML, BSON)Content management, user profiles, catalogsMongoDB, CouchDB, DocumentDB
Graph DatabaseNodes, edges, and propertiesSocial networks, recommendation engines, fraud detectionNeo4j, InfiniteGraph, OrientDB

Key-Value Store

The simplest NoSQL model. Each item is stored as an attribute name (key) together with its value.

# Conceptual example
user_session = {
    "session_001": {"user_id": 42, "login_time": "2025-04-20T09:00:00Z"},
    "session_002": {"user_id": 7,  "login_time": "2025-04-20T09:15:00Z"}
}

Info

Amazon DynamoDB is a managed key-value and document database released by AWS. Redis is widely used for in-memory caching.

Column-Store Database

Unlike row-oriented databases (OLTP), column-stores save data in sections of columns (OLAP). This dramatically improves analytical query performance because the system only reads the columns relevant to the query.

Example: A database with 100 million rows and 100 columns (100 GB total).

  • Query: What is the average age of males?
  • Row-wise DB must read: 100 GB
  • Columnar DB reads only age + gender columns: 2 GB
AspectRow-Oriented (OLTP)Column-Oriented (OLAP)
Storage patternWhole row in same disk blockEach column in separate blocks
Best forTransactions, frequent updatesAnalytics, bulk aggregations
Typical read patternSelect few rows, many columnsSelect many rows, few columns
ExamplesMySQL, PostgreSQL, OracleHBase, Cassandra, ClickHouse

Document Database

Stores data as flexible documents (usually JSON) instead of rigid rows. A group of documents is called a collection.

{
  "_id": "user_001",
  "name": "Tom",
  "age": 28,
  "skills": ["Python", "SQL", "Cloud"],
  "address": {"city": "Suzhou", "zip": "215123"}
}

Info

MongoDB stores documents in BSON (Binary JSON), which supports richer data types than plain JSON.

Graph Database

Represents data as nodes (entities), edges (relationships), and properties (attributes). Ideal for traversing deep relationships.

  • Nodes: Objects or instances (equivalent to a row in RDBMS).
  • Edges: Relationships with direction and type.
  • Properties: Information attached to nodes or edges.

Use case: In a social network, find “friends of friends” by traversing edges rather than expensive JOINs.


Cloud Computing

Cloud computing has transformed how organizations store, access, and manipulate data by moving computation closer to data and adopting elastic resource provisioning — scaling up or down based on demand.

Deployment Models

ModelOwnershipAccessBest For
Public CloudThird-party vendor (AWS, Azure, GCP)Internet, pay-as-you-goStartups, variable workloads, cost reduction
Private CloudSingle organization (on-premise or hosted)Internal networkHighly regulated industries, full data control
Hybrid CloudCombination of public + privateMixedBurst capacity, keeping sensitive data private while offloading general workloads

Service Models

The cloud stack is often visualized as three layers, with the vendor managing more as you move up:

ModelYou ManageVendor ManagesExamples
SaaSNothing (just use the app)Everything: app, data, runtime, middleware, OS, networking, storageGmail, Dropbox, Salesforce, Google Workspace
PaaSApplication, dataRuntime, middleware, OS, networking, storageHeroku, Google App Engine, AWS Elastic Beanstalk
IaaSOS, middleware, runtime, application, dataNetworking, storage, virtualization, serversAWS EC2, Azure VMs, DigitalOcean, Linode

SaaS — Software as a Service

Delivers complete applications over the internet, usually via subscription or pay-as-you-go.

  • Advantages: Zero installation, centralized updates, accessible from any device.
  • Limitations: Vendor lock-in, limited customization, data resides on third-party servers.

PaaS — Platform as a Service

Provides a managed development platform so teams can build and deploy without maintaining infrastructure.

  • Advantages: Faster time-to-market, automatic scaling, reduced operational overhead.
  • Limitations: Runtime constraints, data security concerns, potential vendor-specific API lock-in.

IaaS — Infrastructure as a Service

Offers raw computing resources on demand. The most flexible model, but requires the most management.

  • Advantages: Full control over the stack, highly scalable, pay-only-for-what-you-use.
  • Limitations: Requires internal DevOps expertise, shared-tenant security risks, users must patch and maintain their own OS.

How to Choose?

Need a ready-to-use application?        → SaaS
Need to deploy code without managing OS? → PaaS
Need full control over servers and OS?   → IaaS

Summary

TopicKey Takeaway
Relational DBStructured, ACID, SQL, great for transactions
NoSQLScalable, flexible schema, optimized for specific data models (key-value, column, document, graph)
Cloud DeploymentPublic (cost), Private (control), Hybrid (both)
Cloud ServiceSaaS (use), PaaS (build), IaaS (control)

Sources