Hadoop과 R을 이용한 분산처리 시스템 구축 및 예제

Transcription

Hadoop과 R을 이용한 분산처리 시스템 구축 및 예제
Hadoop과 R을 이용한 분산처리
시스템 구축 및 예제
한상우, 지현웅, 이수지., 박새롬, 장희수, 이재욱
swhan@snu.ac.kr
This document is confidential and is intended solely for the use
<1>
목차
I. Hadoop 소개
II. Rhadoop
III. MapReduce
IV.Examples
<2>
배경
기존의 문제점
1. 저장 불가능
2. 비경제적
3. 엄청난 분석 시간 소요
Hadoop
- 방대한 양의 데이터
- 분산 처리
- 빠른 시간 내에 결과 제공
- Open source
<3>
Hadoop
Hadoop
=
Hadoop
Distributed
File
System
(HDFS)
+
Hadoop
MapReduce
- HDFS : Distributed File System
수천 대의 서버를 네트워크로 묶어 하나의 서버가 보유
하고 있는 파일 시스템 처럼 사용
- MapReduce : Distributed Processing System
각 서버가 저장하고 있는 데이터를 동시에 병렬로 처리
<4>
Hadoop Structure
MASTER
Back-up
Master
Master
SLAVE
Slave 1
Slave 2
Slave 3
Slave 4
- Master Node : Slave node 관리
Namenode(HDFS), Job Tracker(MapReduce) 역할
- Slave Node
: 데이터 저장 및 전달
Datanode(HDFS), Task Tracker(MapReduce) 역할
<5>
Rhadoop?
1. Rhdfs – R and HDFS
2. Rhbase – R and HBASE
3. RMR – R and MapReduce
Rhadoop을 통해서 R 사용자는 Hadoop으로 데이터를 관리, 분석이 가능
<6>
MapReduce
•
Map function processes a key/value pair to generate a set of intermediate
key/value pairs
•
Reduce function merges all intermediate values associated with the same
intermediate key
<7>
Getting Started with RHadoop
• With RHadoop rmr package we could use ‘mapreduce’ function to implement
same calculations to a list of data
• Simple example to double all the numbers from 1 to 100 :
ints = to.dfs(1:100)
calc = mapreduce(input = ints,
map = function(k, v) cbind(v, 2*v))
from.dfs(calc)
$val
[1,]
[2,]
[3,]
[4,]
[5,]
.....
v
1
2
3
4
5
2
4
6
8
10
<8>
Rhadoop Examples
• Map and Reduce functions are defined with respect to data
structured in (key, value) pairs
nums = to.dfs(rnorm(100, 100, 10))
sort.map.fn <- function(k,v) {
key <- ifelse(v < 100, "less", "greater")
keyval(key, 1)
}
count.reduce.fn <- function(k,v) {
keyval(k, length(v))
}
$key
"𝒍𝒆𝒔𝒔"
"𝒈𝒓𝒆𝒕𝒆𝒓"
⋮
"𝒍𝒆𝒔𝒔"
count <- mapreduce (input= nums,
map = sort.map.fn,
reduce = count.reduce.fn)
from.dfs(count)
>$key
[1] "greater" "less"
$val
[1] 45 55
Reduce functions handle data
separately for each key value
<9>
$value
𝟏
𝟏
⋮
𝟏
Simple Simulations for Option Pricing
Call
option
value
•
We assume that asset 𝑺(𝒕) follows the
stochastic differential equation under the
risk-neutral probability:
𝒅𝑺 𝒕 = 𝒓𝑺 𝒕 + 𝝈𝑺 𝒕 𝒅𝑾
Exercise
price
where 𝑾 is the Brownian motion under
Stock
price
the risk-neutral probability
•
We reproduce the future prices of the
underlying asset, and then the future
payoffs to be obtained
•
The sample mean of the discounted
payoffs is the value of the option contract
< 10 >
Example - Option Pricing
• An example for European option pricing :
inp = cbind(S0*rep(1,nTraj), rep(0,nTraj));
inp = to.dfs(inp);
buildTraj <- function(k, v ){
deltaT = T/nPas;
Data-specific quantities are desired
for (i in 1:nPas){
to be stated in terms of data
dW = sqrt(deltaT)*rnorm(length(v[,1]));
v[,2] = v[,1] + r*v[,1]*deltaT + sigma*v[,1]*dW;
v[,1] = v[,2];
}
key <- ifelse(v[,1]-K>0, "call", "put");
value <- ifelse(v[,1]-K>0, exp(-r*T)*(v[,1]-K), exp(-r*T)*(K-v[,1]));
}
keyval(key,value)
price.reduce.fn <- function(k,v) {
keyval(k, mean(v)*(length(v)/nTraj))
}
>
[1] "call" "put"
$val
[1] 6.038343 10.676495
call <- mapreduce(input = inp , map = buildTraj, reduce = price.reduce.fn);
< 11 >
Rhadoop Examples
• Ellapsed time :
-R
#Timestep
\ #traj
100,000
1,000,000
5,000,000
100
0.8
9.8
130.3
500
3.8
48.3
201.3
#Timestep
\ #traj
100,000
1,000,000
5,000,000
100
41.9
55.5
130.6
500
44.3
82.4
237.9
- RHadoop
< 12 >
Example – k-means Clustering
k- Means Clustering
• Simple partitional clustering
• Chooses the number of clusters k
Iterate {
Compute distance from all points to all k-centers
MAP Assign each point to the nearest k-center
Compute the average of all points assigned to all specific kcenters
REDUCE Replace the k-centers with the new averages
}
< 13 >
Example – k-means Clustering
• In Map function..
Input Data
𝑥11
𝑥21
.
.
.
𝑥𝑛1
𝑥12
𝑥22
.
.
.
𝑥𝑛2
dim1 dim2
distance (x1,c1) distance (x2,c2) distance (x3,c3)
distance (x2,c1) distance (x2,c2) distance (x2,c3)
distance (xn,c1) distance (xn,c2) distance (xn,c3)
Output
$key
1
3
cluster #
.
.
.
2
< 14 >
$value
𝑥11
𝑥21
.
.
.
𝑥𝑛1
𝑥12
𝑥22
.
.
.
𝑥𝑛2
Example – k-means Clustering
• In Reduce function..
$key
$value
𝑥𝑎12
1 𝑥𝑎1 1
𝑥𝑎22
1 𝑥𝑎2 1
.
$key
$value
𝑥𝑏12
2 𝑥𝑏11
𝑥𝑏22
2 𝑥𝑏21
.
.
.
.
.
.
.
.
1
𝑥𝑎𝑙 1
𝑥𝑎𝑙 2
.
.
.
.
.
.
.
.
.
2
𝑥𝑏𝑚1
𝑥𝑏𝑚2
.
$key
New
centers
$key
$value
𝑥𝑐11
𝑥𝑐12
3
𝑥𝑐21
𝑥𝑐22
3
.
.
.
3
𝑥𝑐𝑝1
𝑥𝑐𝑝2
.
$value
1
𝑐11
𝑐12
2
𝑐21
𝑐22
3
𝑐31
𝑐32
< 15 >
.
.
.
.
Example – k-means Clustering
< 16 >
Reference
- Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simpli ed
Data Processing on Large Clusters, Google, Inc.
< 17 >