Introduction to SQL

Transcription

Introduction to SQL
10/26/2015
WORKSHOP: Introduction to DB and SQL
• Overview of database systems
– What is behind this Web Site?
– Database Management Systems
– What is DBMS?
– Where are RDBMS used ?
• Problems without an DBMS
• How the Programmer Sees the DBMS
• SQL
Fundamentals of Website
Development
CSC 2320, Fall 2015
The Department of Computer Science
10/26/2015
What is behind this Web Site?
-2-
©2015 http://cs.gsu.edu/~mhan7
Database Management Systems
• Search on a large database
Database Management System = DBMS
• Specify search conditions
• A collection of files that store the data
• Many users
• A big C program written by someone else that accesses
and updates those files for you
• Updates
Relational DBMS = RDBMS
• Access through a web interface
• Data files are structured as relations (tables)
10/26/2015
-3-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-4-
©2015 http://cs.gsu.edu/~mhan7
1
10/26/2015
Where are RDBMS used ?
Example of a Traditional Database Application
• Backend for traditional “database” applications
Suppose we are building a system
– EPFL administration
to store the information about:
• Backend for large Websites
• students
– Google
• courses
• Backend for Web services
• professors
– Amazon
10/26/2015
• who takes what, who teaches what
-5-
©2015 http://cs.gsu.edu/~mhan7
What is DBMS?
10/26/2015
-6-
©2015 http://cs.gsu.edu/~mhan7
Why Use a DBMS?
• Need for information management
• A very large, integrated collection of data.
• Models real-world enterprise.
– Entities (e.g., students, courses)
– Relationships (e.g., John is taking CSC2320)
• A Database Management System (DBMS) is a
software package designed to store and manage
databases.
• Data independence and efficient access.
• Data integrity and security.
• Uniform data administration.
• Concurrent access, recovery from crashes.
• Replication control
• Reduced application development time.
10/26/2015
-7-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-8-
©2015 http://cs.gsu.edu/~mhan7
2
10/26/2015
Why Study Databases??
?
• Shift from computation to information
– at the “low end”: access to physical world
– at the “high end”: scientific applications
• Datasets increasing in diversity and volume.
– Digital libraries, interactive video, Human Genome
project, e-commerce, sensor networks
– ... need for DBMS/data services exploding
• DBMS encompasses several areas of CS
– OS, languages, theory, AI, multimedia, logic
-9-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Data Models
• A data model is a collection of concepts for
describing data.
• A schema is a description of a particular collection of
data, using the a given data model.
• The relational model of data is the most widely used
model today.
– Main concept: relation, basically a table with rows
and columns.
– Every relation has a schema, which describes the
columns, or fields.
-10-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Can we do it without a DBMS ?
Levels of Abstraction
• Many views, single
conceptual (logical)
schema and physical
schema.
–
–
–
Sure we can! Start by storing the data in files:
View 1
View 2
View 3
students.txt
courses.txt
professors.txt
Conceptual Schema
Views describe how users
see the data.
Conceptual schema defines
logical structure
Physical schema describes
the files and indexes used.
Physical Schema
Now write C or Java programs to implement specific tasks
Schemas are defined using DDL; data is modified/queried using DML.
10/26/2015
-11-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-12-
©2015 http://cs.gsu.edu/~mhan7
3
10/26/2015
Doing it without a DBMS...
Problems without an DBMS...
Read ‘students.txt’
Read ‘courses.txt’
Find&update the record “Mary
Johnson”
Find&update the record “CSE444”
Write “students.txt”
Write “courses.txt”
• Enroll “Mary Johnson” in “CSC2320”:
Write a C/Java program to do the following:
Read ‘students.txt’
Read ‘courses.txt’
Find&update the record “Mary
Johnson”
Find&update the record “CSC2320”
Write “students.txt”
Write “courses.txt”
10/26/2015
-13-
©2015 http://cs.gsu.edu/~mhan7
Enters a DBMS
• System crashes:
– What is the problem ?
• Large data sets (say 500GB)
CRASH !
– Why is this a problem ?
• Simultaneous access by many users
– Lock students.txt – what is the problem ?
10/26/2015
-14-
©2015 http://cs.gsu.edu/~mhan7
Functionality of a DBMS
The programmer sees SQL, which has two
components:
• Data Definition Language - DDL
• Data Manipulation Language - DML
“Two tier system” or “client-server”
– query language
connection
(ODBC, JDBC)
Data files
10/26/2015
Database server
(someone else’s
C program)
-15-
Applications
©2015 http://cs.gsu.edu/~mhan7
Behind the scenes the DBMS has:
• Query engine
• Query optimizer
• Storage management
• Transaction Management (concurrency,
recovery)
10/26/2015
-16-
©2015 http://cs.gsu.edu/~mhan7
4
10/26/2015
How the Programmer Sees the DBMS
How the Programmer Sees the DBMS
• Start with DDL to create tables:
CREATE TABLE Students (
Name CHAR(30)
SSN CHAR(9) PRIMARY KEY NOT NULL,,
Category CHAR(20)
) ...
• Tables:
Students:
SSN
123-45-6789
234-56-7890
• Continue with DML to populate tables:
-17-
Category
undergrad
grad
…
SSN
123-45-6789
123-45-6789
234-56-7890
Courses:
CID
CSC2444
CSC2541
INSERT INTO Students
VALUES(‘Charles’, ‘123456789’,
‘undergraduate’)
. . . .
10/26/2015
Takes:
Name
Charles
Dan
…
Name
Databases
Operating systems
CID
CSC2444
CSC2541
CSC2142
…
Quarter
fall
winter
• Still implemented as files, but behind the scenes
can be quite complex
“data independence” = separate logical view
from physical implementation
©2015 http://cs.gsu.edu/~mhan7
Transactions
10/26/2015
-18-
©2015 http://cs.gsu.edu/~mhan7
Transactions
• Enroll “Mary Johnson” in “CSC2320”:
• A transaction = sequence of statements that
either all succeed, or all fail
• Transactions have the ACID properties:
BEGIN TRANSACTION;
INSERT INTO Takes
SELECT Students.SSN, Courses.CID
FROM Students, Courses
WHERE Students.name = ‘Mary Johnson’ and
Courses.name = ‘CSC2320’
A = atomicity (a transaction should be done or undone completely )
C = consistency (a transaction should transform a system from one
consistent state to another consistent state)
I = isolation (each transaction should happen independently of other
-- More updates here....
transactions )
IF everything-went-OK
THEN COMMIT;
ELSE ROLLBACK
D = durability (completed transactions should remain permanent)
If system crashes, the transaction is still either committed or aborted
10/26/2015
-19-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-20-
©2015 http://cs.gsu.edu/~mhan7
5
10/26/2015
Queries
Queries, behind the scene
• Find all courses that “Mary” takes
SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
Declarative SQL query
Imperative query execution plan:
sname
SELECT C.name
FROM Students S, Takes T, Courses C
WHERE S.name=“Mary” and
S.ssn = T.ssn and T.cid = C.cid
• What happens behind the scene ?
– Query processor figures out how to answer the query
efficiently.
cid=cid
sid=sid
name=“Mary”
Students
Takes
Courses
The optimizer chooses the best execution plan for a query
10/26/2015
-21-
©2015 http://cs.gsu.edu/~mhan7
Database Systems
©2015 http://cs.gsu.edu/~mhan7
• Accessing databases through web interfaces
– Java programming interface (JDBC)
– Embedding into HTML pages (JSP)
– Access through http protocol (Web Services)
• Using Web document formats for data definition and
manipulation
– XML, Xquery, Xpath
– XML databases and messaging systems
Oracle
IBM (with DB2)
Microsoft (SQL Server)
Sybase
• Some free database systems (Unix) :
– Postgres
– MySQL
– Predator
10/26/2015
-22-
Databases and the Web
• The big commercial database vendors:
–
–
–
–
10/26/2015
-23-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-24-
©2015 http://cs.gsu.edu/~mhan7
6
10/26/2015
Database Integration
Other Trends in Databases
• Industrial
• Combining data from different databases
– collection of data (wrapping)
– Object-relational databases
– combination of data and generation of new views on
– Main memory database systems
the data (mediation)
– Data warehousing and mining
• Problem: heterogeneity
• Research
– access, representation, content
– Peer-to-peer data management
– Stream data management
– Mobile data management
10/26/2015
-25-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-26-
©2015 http://cs.gsu.edu/~mhan7
AGAIN: 3 types of SQL commands
A simplified schematic of a typical SQL environment
• 1. Data Definition Language (DDL) commands - that
define a database, including creating, altering, and
dropping tables and establishing constraints
• 2. Data Manipulation Language (DML) commands - that
maintain and query a database
• 3. Data Control Language (DCL) commands - that
control a database, including administering privileges
and committing data
10/26/2015
-27-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-28-
©2015 http://cs.gsu.edu/~mhan7
7
10/26/2015
SQL Data types
DDL, DML, DCL, and the database development process
• CHAR(n) – fixed-length character data, n characters
long Maximum length = 2000 bytes
• VARCHAR2(n) – variable length character data,
maximum 4000 bytes
• LONG – variable-length character data, up to 4GB.
Maximum 1 per table
• NUMBER(p,q) – general purpose numeric data type
• INTEGER(p) – signed integer, p digits wide
• FLOAT(p) – floating point in scientific notation with p
binary digits precision
• DATE – fixed-length date/time in dd-mm-yy form
10/26/2015
-29-
©2015 http://cs.gsu.edu/~mhan7
Syntax used in these notes
-31-
-30-
©2015 http://cs.gsu.edu/~mhan7
SQL Database Definition
• Capitals = command syntax (may not be required by the
RDBMS)
• Lowercase = values that must be supplied by user
• Brackets = enclose optional syntax
• Ellipses (...) = indicate that the accompanying syntactic
clause may be repeated as necessary
• Each SQL command ends with a semicolon ‘;’
• In interactive mode, when the user presses the RETURN
key, the SQL command will execute
10/26/2015
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
• Data Definition Language (DDL) has these major
CREATE statements:
– CREATE SCHEMA – defines a portion of the
database owned by a particular user. Schemas
are dependent on a catalog and contain schema
objects, including base tables and views, domains,
constraints, assertions, character sets, collations
etc.
– CREATE TABLE – defines a new table and its
columns. The table may be a base table or a
derived table. Tables are dependent on a schema.
Derived tables are created by executing a query
that uses one or more tables or views.
10/26/2015
-32-
©2015 http://cs.gsu.edu/~mhan7
8
10/26/2015
SQL Database Definition
SQL database definition
– CREATE VIEW – defines a logical table from one or
more tables or views. There are limitations on
updating data through a view. Where views can be
updated, those changes can be transferred to the
underlying base tables originally referenced to create
the view.
10/26/2015
-33-
©2015 http://cs.gsu.edu/~mhan7
Creating tables
10/26/2015
-34-
©2015 http://cs.gsu.edu/~mhan7
Creating tables
• Once data model is designed and normalized, the
columns needed for each table can be defined using the
CREATE TABLE command. The syntax for this is shown
in the following Fig. These are the seven steps to follow:
• 1. Identify the appropriate datatype for each , including
length and precision
• 2. Identify those columns that should accept null values.
Column controls that indicate a column cannot be null
are established when a table is created and are enforced
for every update of the table
10/26/2015
• Each of the previous create commands may be reversed
using a DROP command, so DROP TABLE will destroy a
table (including its definitions, contents and any schemas
or views associated with it)
• Usually only the table creator may delete the table
• DROP SCHEMA and DROP VIEW will also delete the
named schema or view.
• ALTER TABLE may be used to change the definition of
an existing table
-35-
©2015 http://cs.gsu.edu/~mhan7
• 3. Identify those columns that need to be UNQUE - when
the data in that column must have a different value (no
duplicates) for each row of data within that table. Where
a column or set of columns is designated as UNIQUE,
this is a candidate key. Only one candidate key may be
designated as a PRIMARY KEY
• 4. Identify all primary key-foreign key mates. Foreign
keys can be established immediately or later by altering
the table. The parent table in such a parent-child
relationship should be created first. The column
constraint REFERENCES can be used to enforce
referential integrity
10/26/2015
-36-
©2015 http://cs.gsu.edu/~mhan7
9
10/26/2015
Creating tables
Table creation
• 5. Determine values to be inserted into any columns for
which a DEFAULT value is desired - can be used to
define a value that is automatically inserted when no
value is provided during data entry.
• 6. Identify any columns for which domain specifications
may be stated that are more constrained than those
established by data type. Using CHECK it is possible to
establish validation rules for values to be inserted into
the database
• 7. Create the table and any desired indexes using the
CREATE TABLE and CREATE INDEX statements
10/26/2015
-37-
©2015 http://cs.gsu.edu/~mhan7
General syntax for CREATE TABLE
10/26/2015
-38-
©2015 http://cs.gsu.edu/~mhan7
SQL database definition commands for Pine Valley Furniture
Table creation
• The following Fig. Shows SQL database definition
commands
• Here some additional column constraints are shown, and
primary and foreign keys are given names
• For example, the CUSTOMER table’s primary key is
CUSTOMER_ID
• The primary key constraint is named CUSTOMER_PK,
without the constraint name a system identifier would be
assigned automatically and the identifier would be
difficult to read
10/26/2015
-39-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-40-
©2015 http://cs.gsu.edu/~mhan7
10
10/26/2015
STEP 1
STEP2
Defining
attributes and
their data types
Non-nullable
specifications
Note: primary
keys should not
be null
10/26/2015
-41-
©2015 http://cs.gsu.edu/~mhan7
STEP 3
10/26/2015
-42-
©2015 http://cs.gsu.edu/~mhan7
STEP 4
Identifying
foreign keys
and
establishing
relationships
Identifying
primary keys
This is a
composite
primary key
10/26/2015
-43-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-44-
©2015 http://cs.gsu.edu/~mhan7
11
10/26/2015
STEPS 5 and 6
STEP 7
Default values
and domain
constraints
10/26/2015
-45-
©2015 http://cs.gsu.edu/~mhan7
The Relational Model
10/26/2015
-46-
©2015 http://cs.gsu.edu/~mhan7
The Relational Model
•E.F. Codd: (1923-2003)
– Developed the relational model while at IBM San Jose Research
Laboratory
– IBM Fellow 1976
– Turing Award 1981
– ACM Fellow 1994
– British, by birth
•“A Relational Model of Data for Large Shared Data Banks,” E.F. Codd,
Communications of the ACM, Vol. 13, No. 6, June, 1970.
•“Further Normalization of the Data Base Relational Model,” E.F. Codd,
Data Base Systems, Proceedings of 6th Courant Computer Science
Symposium, May, 1971.
•“Relational Completeness of Data Base Sublanguages,” E.F. Codd,
Data Base Systems, Proceedings of 6th Courant Computer Science
Symposium, May, 1971.
•Associations:
– Raymond F. Boyce
– Hugh Darwen
– C.J. Date
– Nikos Lorentzos
– David McGoveran
– Fabian Pascal
10/26/2015
Overall table
definitions
•Plus others…
-47-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-48-
©2015 http://cs.gsu.edu/~mhan7
12
10/26/2015
Relational Database Management Systems
(RDBMS)
The Relational Model
•The basic data model:
– Relations, tuples, attributes, domains
“Employee”
– Primary & foreign keys
ID
Last-Name Date-of-Birth
– Normal forms
21621 Smith
6/24/69
17852 Brown
32904 Carson
•Database Management Systems Based on the Relational Model:
Job-Category
Management
Hardware
Software
8/14/72
10/29/64
:
:
•Query model:
– Relational algebra – cartesian product, selection, projection,
union, set-difference
– Relational calculus
–
–
–
–
–
–
–
System R – IBM research project (1974)
Ingres – University of California Berkeley (early 1970’s)
Oracle – Rational Software, now Oracle Corporation (1974)
SQL/DS – IBM’s first commercial RDBMS (1981)
Informix – Relational Database Systems, now IBM (1981)
DB2 – IBM (1984)
Sybase SQL Server – Sybase, now SAP (1988)
•A primary theme:
– Physical data independence
-49-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Structure Query Language (SQL)
10/26/2015
-50-
©2015 http://cs.gsu.edu/~mhan7
SQL and the Relational Model
•A text search of E.F. Codd’s early papers for “SQL” (or SEQUEL) reveals:
SQL is a language for querying relational databases.
History:
Developed at IBM San Jose Research Laboratory, early 1970’s, for System R
Credited to Donald D. Chamberlin and Raymond F. Boyce
Based on relational algebra and tuple calculus
Originally called SEQUEL
Language Elements:
Clauses, expressions, predicates, queries, statements, transactions, operators,
nesting etc.
select o_orderpriority, count(*) as order_count
from orders
where o_orderdate >= date '[DATE]‘ and o_orderdate < date '[DATE]' + interval '3' month
and exists (select * from lineitem
where l_orderkey = o_orderkey and l_commitdate < l_receiptdate)
group by o_orderpriority
order by o_orderpriority;
10/26/2015
-51-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-52-
©2015 http://cs.gsu.edu/~mhan7
52
13
10/26/2015
Relational Query Languages
What is Wrong With RDBMS?
• Nothing. One size fits all? Not really.
• Impedance mismatch.
– Object Relational Mapping doesn't work quite well.
• Rigid schema design.
• Harder to scale.
• Replication.
• Joins across multiple nodes? Hard.
• How does RDMS handle data growth? Hard.
• Need for a DBA.
• Many programmers are already familiar with it.
• Transactions and ACID make development easy.
• Lots of tools to use.
•Other Relational Query Languages:
– Datalog
– QUEL
– Query By Example (QBE)
– SQL variations
– shell scripts, with relational extensions
10/26/2015
-53-
ACID Semantics
•
•
•
•
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
-54-
©2015 http://cs.gsu.edu/~mhan7
Enter CAP Theorem
Atomicity: All or nothing.
Consistency: Consistent state of data and transactions.
Isolation: Transactions are isolated from each other.
Durability: When the transaction is committed, state will
be durable.
Any data store can achieve Atomicity, Isolation and
Durability but do you always need consistency? No.
• Also known as Brewer’s Theorem by Prof. Eric Brewer,
published in 2000 at University of Berkeley.
• “Of three properties of a shared data system: data
consistency, system availability and tolerance to network
partitions, only two can be achieved at any given
moment.”
• Proven by Nancy Lynch et al. MIT labs.
•
http://www.cs.berkeley.edu/~brewer/cs262b-2004/PODC-keynote.pdf
By giving up ACID properties, one can achieve higher
performance and scalability.
10/26/2015
-55-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-56-
©2015 http://cs.gsu.edu/~mhan7
14
10/26/2015
CAP Semantics
A Simple Proof
• Consistency: Clients should read the same data. There
are many levels of consistency.
– Strict Consistency – RDBMS.
– Tunable Consistency – Cassandra.
– Eventual Consistency – Amazon Dynamo.
• Availability: Data to be available.
• Partial Tolerance: Data to be partitioned across network
segments due to network failures.
Consistent and available
No partition.
App
Data
Data
A
-57-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
A Simple Proof
B
-58-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
A Simple Proof
Consistent and partitioned
Not available, waiting…
Available and partitioned
Not consistent, we get back old data.
App
Data
App
New Data
Old Data
Wait for new data
A
10/26/2015
B
-59-
A
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
B
-60-
©2015 http://cs.gsu.edu/~mhan7
15
10/26/2015
BASE, an ACID Alternative
A Clash of cultures
Almost the opposite of ACID.
• Basically available: Nodes in the a distributed
environment can go down, but the whole system
shouldn’t be affected.
• Soft State (scalable): The state of the system and data
changes over time.
• Eventual Consistency: Given enough time, data will be
consistent across the distributed system.
-61-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Distributed Transactions
©2015 http://cs.gsu.edu/~mhan7
• Solves Partitioning Problem.
• Consistent Hashing, Memcahced.
– Starbucks doesn’t use two phase commit by Gregor Hophe.
• Possible failures
– servers = [s1, s2, s3, s4, s5]
– serverToSendData = servers[hash(data) % servers.length]
• A New Hope
– Continuum Approach.
– Network errors.
– Node errors.
– Database errors.
Coordinato
r
-62-
10/26/2015
Consistent Hashing
• Two phase commit.
Commit
ACID:
• Strong consistency.
• Less availability.
• Pessimistic concurrency.
• Complex.
BASE:
• Availability is the most important thing. Willing
to sacrifice for this (CAP).
• Weaker consistency (Eventual).
• Best effort.
• Simple and fast.
• Optimistic.
Rollback
Acknowledge
Problems:
Locking the entire cluster if one node is
down
Possible to implement timeouts.
Possible to use Quorum.
Quorum: in a distributed environment, if
there is
partition, then the nodes vote to commit or
rollback.
• Virtual Nodes in a cycle.
• Hash both objects and caches.
• Easy Replication.
– Eventually Consistent.
• What happens if nodes fail?
• How do you add nodes?
Complete operation
Release locks
10/26/2015
-63-
©2015 http://cs.gsu.edu/~mhan7
http://www.akamai.com/dl/technical_publications/ConsistenHashingandRandomTreesDistributedCachingprotocolsforreliev
ingHotSpotsontheworldwideweb.pdf
10/26/2015
-64©2015 http://cs.gsu.edu/~mhan7
16
10/26/2015
Concurrency models
Vector Clocks
• Optimistic concurrency.
• Pessimistic concurrency.
• MVCC.
• Used for conflict detection of data.
• Timestamp based resolution of conflicts is not enough.
Time 1:
Time 2:
Replicated
Time 3:
Update
Time 4: Update
Time 5:
-65-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Vector Clocks
10/26/2015
Replicated
Conflict detection
-66-
©2015 http://cs.gsu.edu/~mhan7
Read Repair
Document.v.1([A, 1])
A
Value = Data.v2
Update
Document.v.2([A, 2])
Client
GET (K, Q=2)
A
Value = Data.v2
Document.v.2([A, 2],[B,1])
B
C
Update K = Data.v2
Document.v.2([A, 2],[C,1])
Value = Data.v1
Conflicts are detected.
10/26/2015
-67-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-68-
©2015 http://cs.gsu.edu/~mhan7
17
10/26/2015
Gossip Protocol & Hinted Handoffs
Data Models
• Most preferred communication protocol in a distributed
environment is Gossip Protocol.
A
• All the nodes talk to each other peer wise.
• There is no global state.
• No single point of coordinator.
• If one node goes down and there is a Quorum
load for that node is shared among others.
• Self managing system.
G • If a new node joins, load is also distributed.
D
H
F
10/26/2015
Requests coming to F will be handled by
the nodes who takes the load of F, lets say C with
the hint that it took the requests which was for F,
when F becomes available, F will get this
Information from C. Self healing property.
-69-
Key/Value Pairs.
Tuples (rows).
Documents.
Columns.
Objects.
Graphs.
There are corresponding data stores.
B
C
•
•
•
•
•
•
©2015 http://cs.gsu.edu/~mhan7
Complexity
10/26/2015
-70-
©2015 http://cs.gsu.edu/~mhan7
Key-Value Stores
• Memcached – Key value stores.
• Membase – Memcached with persistence and improved
consistent hashing.
• AppFabric Cache – Multi region Cache.
• Redis – Data structure server.
• Riak – Based on Amazon’s Dynamo.
• Project Voldemort – eventual consistent key value
stores, auto scaling.
10/26/2015
-71-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-72-
©2015 http://cs.gsu.edu/~mhan7
18
10/26/2015
Memcached
•
•
•
•
•
•
•
•
Membase
•
•
•
•
Very easy to setup and use.
Consistent hashing.
Scales very well.
In memory caching, no persistence.
LRU eviction policy.
O(1) to set/get/delete.
Atomic operations set/get/delete.
No iterators, or very difficult.
10/26/2015
-73-
•
•
•
•
•
•
•
-74-
©2015 http://cs.gsu.edu/~mhan7
Microsoft AppFabric
•
•
•
•
•
•
Distributed Data structure server.
Consistent hashing at client.
Non-blocking I/O, single threaded.
Values are binary safe strings: byte strings.
String : Key/Value Pair, set/get. O(1) many string operations.
Lists: lpush, lpop, rpush, rpop.you can use it as stack or
queue. O(1). Publisher/Subscriber is available.
• Set: Collection of Unique elements, add, pop, union,
intersection etc. set operations.
• Sorted Set: Unique elements sorted by scores. O(logn).
Supports range operations.
• Hashes: Multiple Key/Value pairs
HMSET user 1 username foo password bar age 30
HGET user 1 age
10/26/2015
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Redis
Easy to manage via web console.
Monitoring and management via Web console
.
Consistency and Availability.
Dynamic/Linear Scalability, add a node, hit join to cluster
and rebalance.
Low latency, high throughput.
Compatible with current Memcached Clients.
Data Durability, persistent to disk asynchronously.
Rebalancing (Peer to peer replication).
Fail over (Master/Slave).
vBuckets are used for consistent hashing.
O(1) to set/get/delete.
-75-
©2015 http://cs.gsu.edu/~mhan7
•
•
•
•
•
•
•
•
Add a node to the cluster easily. Elastic scalability.
Namespaces to organize different caches.
LRU Eviction policy.
Timeout/Time to live is default to 10 min.
No persistence.
O(1) to set/get/delete.
Optimistic and pessimistic concurrency.
Supports tagging.
10/26/2015
-76-
©2015 http://cs.gsu.edu/~mhan7
19
10/26/2015
Document Stores
Mongodb
• Data types: bool, int, double, string, object(bson), oid,
array, null, date.
• Database and collections are created automatically.
• Lots of Language Drivers.
• Capped collections are fixed size collections, buffers,
very fast, FIFO, good for logs. No indexes.
• Object id are generated by client, 12 bytes packed data.
4 byte time, 3 byte machine, 2 byte pid, 3 byte counter.
• Possible to refer other documents in different collections
but more efficient to embed documents.
• Replication is very easy to setup. You can read from
slaves.
• Schema Free.
• Usually JSON like interchange model.
• Query Model: JavaScript or custom.
• Aggregations: Map/Reduce.
• Indexes are done via B-Trees.
10/26/2015
-77-
©2015 http://cs.gsu.edu/~mhan7
Mongodb
-78-
©2015 http://cs.gsu.edu/~mhan7
Mongodb - Sharding
• Connection pooling is done for you. Sweet.
• Supports aggregation.
– Map Reduce with JavaScript.
• You have indexes, B-Trees. Ids are always indexed.
• Updates are atomic. Low contention locks.
• Querying mongo done with a document:
– Lazy, returns a cursor.
– Reduceable to SQL, select, insert, update limit, sort
etc.
• There is more: upsert (either inserts of updates)
– Several operators:
• $ne, $and, $or, $lt, $gt, $incr,$decr and so on.
• Repository Pattern makes development very easy.
10/26/2015
10/26/2015
-79-
©2015 http://cs.gsu.edu/~mhan7
Config servers: Keeps mapping
Mongos: Routing servers
Mongod: master-slave replicas
10/26/2015
-80-
©2015 http://cs.gsu.edu/~mhan7
20
10/26/2015
Couchdb
Objectivity
• Availability and Partial Tolerance.
• Views are used to query. Map/Reduce.
• MVCC – Multiple Concurrent versions. No locks.
– A little overhead with this approach due to garbage collection.
– Conflict resolution.
• Very simple, REST based. Schema Free.
• Shared nothing, seamless peer based Bi-Directional replication.
• Auto Compaction. Manual with Mongodb.
• Uses B-Trees
• Documents and indexes are kept in memory and flushed to disc
periodically.
• Documents have states, in case of a failure, recovery can continue
from the state documents were left.
• No built in auto-sharding, there are open source projects.
• You can’t define your indexes.
-81-
10/26/2015
•
•
•
•
•
No need for ORM. Closer to OOP.
Complex data modeling.
Schema evolution.
Scalable Collections: List, Set, Map.
Object relations.
– Bi-Directional relations
• ACID properties.
• Blazingly fast, uses paging.
• Supports replication and clustering.
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Column Stores
©2015 http://cs.gsu.edu/~mhan7
Cassandra
Row oriented
Id
username
email
Department
1
John
john@foo.com
Sales
2
Mary
mary@foo.com
Marketing
3
Yoda
yoda@foo.com
IT
•
•
•
•
•
•
Column oriented
Id
Username
email
Department
1
John
john@foo.com
Sales
2
Mary
mary@foo.com
Marketing
3
Yoda
yoda@foo.com
IT
10/26/2015
-82-
-83-
©2015 http://cs.gsu.edu/~mhan7
•
•
•
•
Tunable consistency.
Decentralized.
Writes are faster than reads.
No Single point of failure.
Incremental scalability.
Uses consistent hashing (logical partitioning) when
clustered.
Hinted handoffs.
Peer to peer routing(ring).
Thrift API.
Multi data center support.
10/26/2015
-84-
©2015 http://cs.gsu.edu/~mhan7
21
10/26/2015
Cassandra at Netflix
Graph Stores
• Based on Graph Theory.
• Scale vertically, no clustering.
• You can use graph algorithms easily.
http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html
10/26/2015
-85-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Neo4J
-86-
©2015 http://cs.gsu.edu/~mhan7
Which one to use?
•
Key-value stores:
– Processing a constant stream of small reads and writes.
Document databases:
– Natural data modeling. Programmer friendly. Rapid development. Web
friendly, CRUD.
• RDMBS:
– OLTP. SQL. Transactions. Relations.
• OODBMS
– Complex object models.
• Data Structure Server:
– Quirky stuff.
• Columnar:
– Handles size well. Massive write loads. High availability. Multiple-data
centers. MapReduce
• Graph:
– Graph algorithms and relations.
• Want more ideas ?
http://highscalability.com/blog/2011/6/20/35-use-cases-for-choosing-your-nextnosql-database.html
• Nodes, Relationship.
•
• Traversals.
• HTTP/REST.
• ACID.
• Web Admin.
• Not too much support for languages.
• Has transactions.
10/26/2015
-87-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-88-
©2015 http://cs.gsu.edu/~mhan7
22
10/26/2015
The NoSQL RDBMS
Operator/stream Paradigm
Commonly referenced papers:
“The Next Generation,” E. Schaffer and M. Wolf, UNIX Review, March, 1991, page 24.
“The UNIX Shell as a Fourth Generation Language,” E. Schaffer and M. Wolf, Revolutionary
Software.
One of first uses of the phrase NoSQL is due to Carlo Strozzi, circa 1998.
NoSQL:
Regarding Database Management Systems:
“…almost all are software prisons that you must get into and leave the power of UNIX behind.”
A fast, portable, open-source RDBMS
A derivative of the RDB database system (Walter Hobbs, RAND)
“…large, complex programs which degrade total system performance, especially when they are run
in a multi-user environment.”
Not a full-function DBMS, per se, but a shell-level tool
User interface – Unix shell
“…put walls between the user and UNIX, and the power of UNIX is thrown away.”
Based on the “operator/stream paradigm”
In summary:
Relational model => yes
UNIX => big yes
Big, COTS, relational DBMS => no
SQL => no
http://www.strozzi.it/cgi-bin/CSA/tw7/I/en_US/nosql/Home%20Page
10/26/2015
-89-
©2015 http://cs.gsu.edu/~mhan7
The NoSQL RDBMS
-90-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
NoSQL Today
More recently:
 The term has taken on different meanings
 One common interpretation is “not only SQL”
•Getting back to Strozzi’s NoSQL RDBMS:
– Based on the relational model
– Based on UNIX and shell scripts
– Does not have an SQL interface
Most modern NoSQL systems diverge from the relational model or standard RDBMS functionality:
The data model:
•In that sense, and interpreted literally, NoSQL means “no sql,” i.e., we
are not using the SQL language.
The query model:
The implementation:
relations
tuples
attributes
domains
normalization
vs.
documents
graphs
key/values
relational algebra
tuple calculus
vs.
graph traversal
text search
map/reduce
rigid schemas
vs.
flexible schemas
(schema-less)
ACID compliance
vs.
BASE
In that sense, NoSQL today is more commonly meant to be something like “non-relational”
10/26/2015
-91-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-92-
©2015 http://cs.gsu.edu/~mhan7
23
10/26/2015
NoSQL Today
NoSQL Today
(a partial, unrefined list)
Motivation for recent NoSQL systems is also quite varied:
 “…there are significant advantages to building our own storage solution at
Google,” Chang et. al., 2006
 Scalability, performance, availability, flexibility
 Speculation - $$$, control
MySQL vs. MongoDB:
•
http://www.youtube.com/watch?v=b2F-DItXtZs
How “big” is the NoSQL movement?
Is this another grand conspiracy by the government and, you know, that guy….
-93-
©2015 http://cs.gsu.edu/~mhan7
NoSQL Today
• http://www.vertabelo.com/blog/vertabelo-news/jdd-2013-what-we-found-out-about-databases
It is easy to find diagrams that look like this:
http://db-engines.com/en/ranking_categories
It is easy to find diagrams that look like this:
•
http://www.odbms.org/2014/11/gartner-2014-magic-quadrant-operational-database-managementsystems-2/
10/26/2015
Hypertable
BigTable
QD Technology
Vertica
Qbase–MetaCarta OpenNeptune
Accumulo
Stratosphere
flare
SmartFocus
KDI
Alterian
Cloudera
C-Store
HPCC
Mongo DB
Amazon SimpleDB SciDB
CouchDB
Clusterpoint ServerTerrastore
Djondb
SchemaFreeDB
Jackrabbit
OrientDB
Perservere
CoudKit
RaptorDB
ThruDB
RavenDB
DynamoDB
Azure Table Storage
Couchbase Server Riak
LevelDB
Chordless
GenieDB
Scalaris
Tokyo
Kyoto Cabinet
Tyrant
Faircom C-Tree
Berkeley DB
Voldemort
Dynomite
KAI
MemcacheDB
Tarantool/Box
Maxtable
Pincaster
RaptorDB
TIBCO Active Spaces
Hibari
BangDB
SDB
JasDB
Scalien
HamsterDB
STSdb
allegro-C
nessDBHyperDex
Mnesia
LightCloud
OpenLDAP/MDB/Lightning
Scality
Redis
KaTree
TomP2P
Kumofs
TreapDB
NMDB
luxio
actord
Keyspace
schema-free
RAMCloud
SubRecord
Mo8onDb
Dovetaildb
JDBM
Neo4
InfiniteGraph
DEX
BrightstarDB
Sones
InfoGrid
HyperGraphDB
GraphBase
Trinity
AllegroGraph
Bigdata
Meronymy
OpenLink Virtuoso VertexDB
FlockDB
Execom IOG
Java Univ Netwrk/Graph Framework
OpenRDF/Sesame Filament
OWLim
iGraph
Jena
SPARQL
ArangoDB
AlchemyDB
Soft NoSQL Systems
Db4o
Versant
Objectivity
Starcounter
ZODB
Magma
NEO
siaqodb
Sterling
Morantex
EyeDB
FramerD
NetworkX
PicoList
Ninja Database Pro StupidDB
Hazelcast
KiokuDB
Perl solution
OrientDb
Durus
GigaSpaces
Infinispan
Queplix
GridGain
Galaxy
SpaceBase
JoafipCoherence
eXtremeScale
MarkLogic Server
EMC Documentum xDB
eXist
Sedna
BaseX
Qizx
Berkeley DB XML
Xindice
Tamino
Intersystems Cache GT.M
EGTM
ESENT
MultiValue
Globals
U2
OpenInsight
Reality
OpenQM
eXtremeDB
RDM Embedded
ISIS Family
Prevayler
Yserial
Vmware vFabric GemFire
KirbyBase
Tokutek
Recutils
FileDB
Armadillo
illuminate Correlation Database
FluidDB
Fleet DB
Twisted Storage
Rindo
Sherpa
tin
Dryad
SkyNet
Disco
MUMPS
Adabas
Oracle Big
Data Appliance
10/26/2015
jBASE
Lotus/Domino
Btrieve
XAP In-Memory Grid
eXtreme Scale
MckoiDDB
Mckoi SQL Database
Innostore
No-List
-94-
KDI
Perst
FleetDB
IODB©2015
http://cs.gsu.edu/~mhan7
Primary NoSQL Categories
It is easy to find diagrams that look like this:
•
Cassandra
Cloudata
HSS Database
Will they eventually eliminate the need for relational databases?
10/26/2015
Hbase
-95-
©2015 http://cs.gsu.edu/~mhan7
•General Categories of NoSQL Systems:
– Key/value store
– (wide) Column store
– Graph store
– Document store
•Compared to the relational model:
– Query models are not as developed.
– Distinction between abstraction & implementation is not as clear.
10/26/2015
-96-
©2015 http://cs.gsu.edu/~mhan7
24
10/26/2015
Key/Value Store
Wide Column Store
•“Dynamo: Amazon’s Highly Available Key-value Store,” DeCandia, G., et al., SOSP’07,
21st ACM
•Symposium on Operating Systems Principles.
•“Bigtable: A Distributed Storage System for Structured Data,” Chang, F., et al.,
OSDI’06: Seventh Symposium on Operating System Design and
implementation, 2006.
•The basic data model:
– Database is a collection of key/value pairs
– The key for each pair is unique
•The basic data model:
– Database is a collection of key/value pairs
– Key consists of 3 parts – a row key, a column key, and a time-stamp
(i.e., the version)
– Flexible schema - the set of columns is not fixed, and may differ from
row-to-row
No requirement for normalization
(and consequently dependency
preservation or lossless join)
•Primary operations:
– insert(key,value)
– delete(key)
– update(key,value)
– lookup(key)
•One last column detail:
– Column key consists of two parts – a column family, and a qualifier
•Additional operations:
– variations on the above, e.g., reverse lookup
– iterators
Warning #1!
-97-
10/26/2015
©2015 http://cs.gsu.edu/~mhan7
Wide Column Store
-98-
10/26/2015
Wide Column Store
Column families
Personal data
Row key
Personal data
ID
First Name
Last Name
©2015 http://cs.gsu.edu/~mhan7
Professional data
ID
First Name
Last Name
Date of
Birth
Job
Category
Salary
Date of
Hire
ID
First Name
Middle
Name
Last Name
Job
Category
Employer
Hourly
Rate
ID
First
Name
ID
Last Name
Employer
Professional data
Date of
Birth
Job
Category
Salary
Date of
Hire
Employer
Last
Name
Job
Category
Job
Category
Salary
Employer
Date of
Hire
Salary
Column qualifiers
Group
Employer
Seniority
Insurance
ID
Bldg #
Office #
Emergenc
y Contact
Medical data
One “table”
10/26/2015
-99-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-100-
©2015 http://cs.gsu.edu/~mhan7
25
10/26/2015
Wide Column Store
Graph Store
•Neo4j - “The Neo Database – A Technology Introduction,” 2006.
Row key
t1
t0
ID
First Name
Last Name
Date of
Birth
Job
Category
Personal data
Salary
Date of
Hire
Employer
•The basic data model:
– Directed graphs
– Nodes & edges, with properties, i.e., “labels”
Professional data
One “row”
One “row” in a wide-column NoSQL database table
=
Many rows in several relations/tables in a relational database
10/26/2015
-101-
©2015 http://cs.gsu.edu/~mhan7
Document Store
-102-
©2015 http://cs.gsu.edu/~mhan7
ACID vs. BASE
MongoDB - “How a Database Can Make Your Organization Faster, Better, Leaner,”
February 2015.
The basic data model:
The general notion of a document – words, phrases, sentences, paragraphs,
sections,
subsections, footnotes, etc.
Flexible schema – subcomponent structure may be nested, and vary from
document-to-document.
Metadata – title, author, date, embedded tags, etc.
Key/identifier.
One implementation detail:
Formats vary greatly – PDF, XML, JSON, BSON, plain text, various binary,
scanned image.
10/26/2015
10/26/2015
-103-
©2015 http://cs.gsu.edu/~mhan7
•Database systems traditionally support ACID requirements:
– Atomicity, Consistency, Isolation, Durability
•In a distributed web applications the focus shifts to:
– Consistency, Availability, Partition tolerance
•CAP theorem - At most two of the above can be enforced at any given time.
– Conjecture – Eric Brewer, ACM Symposium on the Principles of
Distributed Computing, 2000.
– Proved – Seth Gilbert & Nancy Lynch, ACM SIGACT News, 2002.
•Reducing consistency, at least temporarily, maintains the other two.
10/26/2015
-104-
©2015 http://cs.gsu.edu/~mhan7
26
10/26/2015
Questions
ACID vs. BASE
Thus, distributed NoSQL systems are typically said to
support some form of BASE:
Basic Availability
Soft state
Eventual consistency*
Thank You!
“We’d really like everything to be structured, consistent and
harmonious,…, but what we are faced with is a little bit of
punk-style anarchy. And actually, whilst it might scare our
grandmothers, it’s OK...”
-Julian Browne
Email: mhan@cs.gsu.edu
10/26/2015
-105-
©2015 http://cs.gsu.edu/~mhan7
10/26/2015
-106-
©2015 http://cs.gsu.edu/~mhan7
27