Example: How to Write Parallel Queries in Parallel Secondo 1

Transcription

Example: How to Write Parallel Queries in Parallel Secondo 1
Example: How to Write Parallel Queries in Parallel Secondo
Jiamin Lu, Oct 2012
S ECONDO was built by the Secondo team
1
Overview of Parallel Secondo
Parallel S ECONDO extends the scalability of S ECONDO, in order to process moving objects data in parallel on a
computer cluster. It is composed by Hadoop and a set of single-computer S ECONDO databases, using Hadoop as
the task coordinator and the communication level. Most database queries are processed by S ECONDO databases
simultaneously.
The basic infrastructure of Parallel S ECONDO is shown in Figure 1. HDFS is a distributed file system prepared
by Hadoop to shuffle its intermediate data during parallel procedures. PSFS (Parallel S ECONDO File System)
is a similar system provided in Parallel S ECONDO, directly exchanging data among single-computer S ECONDO
databases, without passing through HDFS, in order to improve the data transfer efficiency in the parallel system.
The minimum execution unit in Parallel S ECONDO is called a Data Server (DS). Each DS contains a compact
S ECONDO named Mini-S ECONDO, its database storage and a PSFS node. At present, it is quite common that
a low-end PC has several hard disks, large memory and several processors with multiple cores. Therefore, it is
possible for the user to set several Data Servers on a same computer, in order to fully use the computing resource
of the underlying cluster. Hadoop nodes are set independently with Data Servers, each node is set in the first Data
Server of a cluster node.
The functions of Parallel S ECONDO are provided in two S ECONDO algebras, Hadoop and HadoopParallel. These
two algebras must be activated before installing Parallel S ECONDO.
More details about Parallel S ECONDO can be found in the “User Guide For Parallel S ECONDO”. In this document,
we would like to use an example query to introduce the user how to set up Parallel S ECONDO in a single computer,
and achieve a speedup as large as four times.
2
Set Up Parallel Secondo
Parallel S ECONDO can be installed on either a computer or a computer cluster composed by tens and even hundreds of computers. Here we would like to quickly guide the user to install Parallel S ECONDO on a single
computer. The computer used here as the testbed has a AMD Phenom(tm) II X6 1055T processor with six cores,
8 GB memory and two 500GB hard disks, and uses Ubuntu 10.04.2-LTS-Desktop-64bit as the operating system.
A set of auxiliary tools, which practically are bash scripts, are provided to help the user to set up and use Parallel
S ECONDO easily. These tools are kept in the Hadoop algebra, in a folder named clusterManagement. Within
them, the ps-cluster-format helps the user to automatically install Parallel S ECONDO, including both Data Servers
and Hadoop. The deployment of Parallel S ECONDO on one computer includes the following steps:
1
Master Data Server
Master
Node
Mini Secondo
Meta
Database
DS Catalog
Parallel
Query
Converter
Hadoop Master Node
H
D
F
S
Slave
Node
Hadoop Slave Node
Partial Secondo
Query
DS Catalog
P
S
F
S
Slave Data Server
Mini Secondo
Slave
Database
Slave Data
Server
Slave Data
Server
...
Slave Node
... ...
Figure 1: The Infrastructure of Parallel Secondo
1. Prepare the authentication key-pair. Both data servers and the underlying Hadoop platform rely on secure
shell as the basic communication level, and it is better to connect shells without password, even on a single
computer. For this purpose, the user should create and set up the authentication key-pair on the computer
with commands:
$ cd $HOME
$ ssh-keygen -t dsa -P ’’ -f ˜/.ssh/id_dsa
$ cat ˜/.ssh/id_dsa.pub >> ˜/.ssh/authorized_keys
2. Install S ECONDO. The installation guide can be found on our website, and the user can install the S ECONDO
database system as usual.1 After the installation is finished, the user can verify the correctness of the
installation, and then compile S ECONDO.
$ env | grep "ˆSECONDO"
SECONDO_CONFIG=.... /secondo/bin/SecondoConfig.ini
SECONDO_BUILD_DIR=... /secondo
SECONDO_JAVA=.... /java
SECONDO_PLATFORM=...
$ cd $SECONDO_BUILD_DIR
$ make
3. Download Hadoop. Go to the official website of Hadoop, and download the Hadoop distribution with the
version of 0.20.2. The downloaded package should be put into the $SECONDO BUILD DIR/bin directory
without changing the name.
4. Set up the ParallelSecondoConfig.ini according to the current computer. This file contains all configurations of Parallel S ECONDO. An example file is kept in the clusterManagement folder, prepared for setting
up Parallel S ECONDO on a single computer with an Ubuntu operating system. Detailed explanations about
this file can be found in the User Guide. Normally only the configurations in the cluster section should be
1
The version must be 3.3 or higher.
2
set according to various environments. In our example cluster, it is set as:
Master = 192.168.1.1:/disk1/dataServer1:11234
Slaves += 192.168.1.1:/disk1/dataServer1:11234
Slaves += 192.168.1.1:/disk2/dataServer2:12234
The IP address of this computer is 192.168.1.1. Two Data Servers are set in /disk1/dataServer1 and
/disk2/dataServer2, respectively. The access port for these two DSs’ Mini-S ECONDO are 11234 and 12234.
Two DSs are prepared in this computer, since it has two hard disks. The first DS is set to be a master and
slave DS at the same time.
In Parallel S ECONDO, the transaction feature is normally turned off, in order to improve the efficiency of
loading data into the databases. For this purpose, the following line in the S ECONDO configure file should
be uncommented. The file is named SecondoConfig.ini, and kept in the $SECONDO BUILD DIR/bin
directory.
RTFlags += SMI:NoTransactions
After all required parameters are set, copy the file ParallelSecondoConfig.ini also to the directory $SECONDO BUILD DIR/bin, and initialize the environment with ps-cluster-format.
$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement
$ cp ParallelSecondoConfig.ini $SECONDO_BUILD_DIR/bin
$ ps-cluster-format
Since this step defines a set of environment variables, the user should start a new shell after the format
command is finished. The correctness of the initialization can be checked with the following command:
$ cd $HOME
$ env | grep "ˆPARALLEL_SECONDO"
PARALLEL_SECONDO_MASTER=.../conf/master
PARALLEL_SECONDO_CONF=.../conf
PARALLEL_SECONDO_BUILD_DIR=.../secondo
PARALLEL_SECONDO_MINIDB_NAME=msec-databases
PARALLEL_SECONDO_MINI_NAME=msec
PARALLEL_SECONDO_PSFSNAME=PSFS
PARALLEL_SECONDO_DBCONFIG=
PARALLEL_SECONDO_SLAVES=.../conf/slaves
PARALLEL_SECONDO_MAINDS=.../dataServer1/...
PARALLEL_SECONDO=...:...
PARALLEL_SECONDO_DATASERVER_NAME=...
5. The above steps set up both data servers and Hadoop on the computer. Afterwards, the Namenode in the
Hadoop framework should be formatted before starting it.
$ hadoop namenode -format
$ start-all.sh
6. The fourth step initializes the data servers in the computer, while the Mini-S ECONDO systems are not
distributed yet. At first, both Hadoop and HadoopParallel algebras should be activated, by adding the
following lines to the file $SECONDO BUILD DIR/bin/makefile.algebras.
ALGEBRA_DIRS += HadoopParallel
ALGEBRAS
+= HadoopParallelAlgebra
ALGEBRA_DIRS += Hadoop
ALGEBRAS
+= HadoopAlgebra
3
After re-compiling the S ECONDO system, an auxiliary tool called ps-secondo-buildMini can be used to
distribute Mini-S ECONDO on the cluster.
$
$
$
$
cd $SECONDO_BUILD_DIR
make
cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement
ps-secondo-buildMini -lo
7. Start up mini-S ECONDO Monitors.
$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement
$ ps-startMonitors
$ ps-cluster-queryMonitorStatus
The second script is used to check whether all Mini-S ECONDO monitors are started.
8. Open the text interface of the master database.
$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement
$ ps-startTTYCS -s 1
So far, the Parallel S ECONDO has been set up on the example computer. Next we will introduce how to process a
sequential S ECONDO query in parallel, with different methods.
3
The Sequential Query
In order to guide the user to write a parallel query quickly, an example is demonstrated here. This query studies
the problem of dividing a massive amount of large regions into small pieces, based on a set of lines that overlap
them. For example, a forest can be split into pieces by roads and power lines. For this purpose an operator is
provided in S ECONDO, named findCycles. It is used as follows:
query Roads feed r
Natural feed filter[.Type contains ’forest’]
itSpatialJoin[GeoData_r, GeoData, 4, 8]
sortby[Osm_id]
groupby[Osm_id; InterGraph:
intersection( group feed projecttransformstream[GeoData_r]
collect_line[TRUE], group feed extract[GeoData])
union boundary(group feed extract[GeoData])]
projectextendstream[Osm_id; Curves: findCycles(.InterGraph)]
count;
Total runtime ...
= 2.86417
Times (elapsed / cpu): 9:32min (571.688sec) /199.6sec
113942
Here the forest regions in the relation Natural are split into pieces, according to the road network in the relation
Roads. The query is written in S ECONDO executable language, consisting of S ECONDO objects and operators. It
describes the data flow of the query, hence the user can use different methods and operators to process the query.
Note that in order to make this query run, the algebra named SpatialJoin must be activated in S ECONDO.
This example query is mainly divided into three steps. At first, forest regions in the relation Natural are joined
with the relation Roads, by using the itSpatialJoin operator. The itSpatialJoin operation is the most efficient
4
(a)
(b)
Figure 2: findCycles Operation
sequential spatial join operation in the current S ECONDO. It builds up a memory R-tree on one side relation,
and probes this R-tree with the tuples of the other relation. All pairs of tuples are put together for which the
bounding boxes of the geometric attributes overlap. Afterwards, the join results are grouped based on forest
regions’ identifiers. For each group, a graph is built up consisting of all boundary edges of the region and all its
intersected roads. For example, as shown in Figure 2, the region R is divided into five pieces by three intersected
lines a, b and c, while the created graph is shown in Figure 2b.
First, all roads intersecting the region are put together into one line value by the collect line operator. The
intersection operation then computes all parts of this line (road lines) lying within the forest region, like P1P2,
P3P4, P7P8 etc. The boundary operation produces all boundary edges of the region, like P5P6. Boundary edges
and roads inside the region are the put together with the union operation.
At last, for each such graph (represented as a line value), the findCycles operation traverses it and returns split
regions, like the red cycle shown in the Figure 2b. The findCycles operation is used as an argument function
in the operator projectextendstream, thereby each forest region’s identifier is duplicated with all its split
regions.
This query produces 113942 split regions, and costs about 571 seconds.
Here in this example, both data sets come from the OpenStreetMap project, describing the geographic information
of the German state of North Rhine Westphalia (NRW). They can be fetched with the following commands.
$ cd $SECONDO_BUILD_DIR/bin
$ wget http://download.geofabrik.de/openstreetmap/europe/germany/
nordrhein-westfalen.shp.zip
$ mkdir nrw
$ unzip -d nrw nordrhein-westfalen.shp.zip
By now, all shape files about streets and regions in NRW are kept in the directory named nrw in your $SECONDO BUILD DIR/bin. You can use the following S ECONDO queries to load these shape files into a database.
create database nrw;
open database nrw;
5
Type
Flow
PQC
Assistant
PSFS
Operator
spread
collect
hadoopMap
hadoopReduce
hadoopReduce2
para
fconsume
fdistribute
ffeed
Signature
stream(T) →flist (T)
flist (T) →stream(T)
flist x fun(mapQuery) x bool →flist
flist x fun(reduceQuery) →flist
flist x flist x fun(reduceQuery) →flist
flist (T) →T
stream(T) x fileMeta →bool
stream(T) x fileMeta x Key→stream(key,tupNum)
fileMeta →stream(T)
Table 1: Operators Provided as Extensions In Parallel Secondo
let Roads =
dbimport2(’../bin/nrw/roads.dbf’) addcounter[No, 1]
shpimport2(’../bin/nrw/roads.shp’) namedtransformstream[GeoData]
addcounter[No2, 1]
mergejoin[No, No2] remove[No, No2]
filter[isdefined(bbox(.GeoData))] validateAttr
consume;
let Natural =
dbimport2(’../bin/nrw/natural.dbf’) addcounter[No, 1]
shpimport2(’../bin/nrw/natural.shp’) namedtransformstream[GeoData]
addcounter[No2, 1]
mergejoin[No, No2] remove[No, No2]
filter[isdefined(bbox(.GeoData))] validateAttr
consume;
close database;
4
Parallel Queries
4.1 Parallel Data Model
A parallel data model is provided in Parallel S ECONDO, to help the user write his parallel queries like normal
sequential queries. The data type flist encapsulates all S ECONDO objects and distributes their content on the
computer cluster. Besides, a set of parallel operators are implemented in Parallel S ECONDO, making the system
compatible with the single-computer database. These operators can be roughly divided into four kinds: flow
operators, Hadoop operators, PSFS operators, and other assistant operators, as shown in Table 1.
Flow operators connect parallel operators with other existing S ECONDO operators. Relations can be distributed
over the cluster with flow operations for being simultaneously processed by all data servers. Later the distributed
result, which is represented as an flist object, can be collected from the cluster in order to be processed with
single-computer S ECONDO. PQC operators accept queries written in S ECONDO executable language and create
Hadoop jobs with their respective template Hadoop programs. Thereby the user can write parallel queries like
common sequential queries. Assistant operators work together with PQC operators, helping to denote flist objects
in parameter functions. At last, PSFS operators are particularly prepared for accessing data exchanged in PSFS,
being internally used by the Hadoop operators. Details about these operators are not explained here, since we
want to quickly guide the user to get familiar with Parallel S ECONDO with the example.
6
With the parallel data model, the above example query can be converted into parallel queries with different methods. The most efficient one can achieve a speedup as large as four times on the given computer.
4.2 Method 1
The first parallel solution divides the example query into two stages, Map and Reduce, following the paradigm of
MapReduce. The spatial join operation is first finished in the Map stage, then the join results are distributed over
the cluster based on forest identifiers, and the remaining steps up to the findCycles operation are processed on
all slaves simultaneously in the Reduce stage. The queries are listed below.
let CLUSTER_SIZE = 2;
let PS_SCALE = 6;
let Roads_BCell_List = Roads feed extend[Box: bbox(.GeoData)]
projectextendstream[Osm_id, GeoData, Box; Cell:
cellnumber(.Box, Big_RNCellGrid)]
spread[;Cell, CLUSTER_SIZE, TRUE;];
Total runtime ...
Times (elapsed / cpu): 53.0651sec / 16.95sec = 3.13069
let Forest_BCell_List = Natural feed filter[.Type contains ’forest’]
extend[Box: bbox(.GeoData)]
projectextendstream[Osm_id, GeoData, Box; Cell:
cellnumber(.Box, Big_RNCellGrid)]
spread[;Cell, CLUSTER_SIZE, TRUE;];
Total runtime ...
Times (elapsed / cpu): 5.20549sec / 1.35sec = 3.85592
query Roads_BCell_List
hadoopMap[DLF, FALSE
; . {r} para(Forest_BCell_List)
itSpatialJoin[GeoData_r, GeoData, 4, 8]
filter[ (.Cell = .Cell_r) and
gridintersects(Big_RNCellGrid, .Box_r, .Box, .Cell_r)]]
hadoopReduce[Osm_id, DLF, PS_SCALE; . sortby[Osm_id]
groupby[Osm_id; InterGraph:
intersection( group feed projecttransformstream[GeoData_r]
collect_line[TRUE], group feed extract[GeoData])
union boundary(group feed extract[GeoData])]
projectextendstream[Osm_id; Curves: findCycles(.InterGraph)]
]
collect[]
count;
Total runtime ...
= 1220.63
Times (elapsed / cpu): 34:23min (2062.86sec) /1.69sec
113942
The spatial join operation is processed with the PBSM method. Both relations are first partitioned into an identical
cell-grid, and then distributed on the cluster based on cell numbers, with the flow operator spread. Here the
constant CLUSTER SIZE is set to be two, since we installed Parallel S ECONDO on a single computer, with two
slave data servers in total. The constant PS SCALE indicates the number of reduce tasks running in parallel during
7
the Hadoop procedure, it is set to be six as there are six processor cores in the current computer. The creation of
the cell-grid Big RNCellGrid is introduced in the Appendix A.
In the parallel query, two PQC operators hadoopMap and hadoopReduce are used; each describes one stage
of the Hadoop job. In principle, each PQC operator starts a Hadoop job, and processes the embedded argument
function in one stage of the job. Nonetheless, here the executed argument in the hadoopMap operation is set
to be false, hence the Map stage is actually put together with the following hadoopReduce operation, and both
operations are processed in one Hadoop job at last.
The hadoopMap is an unary operation, though the itSpatialJoin requires two inputs. Regarding this issue,
the assistant operator para is used to identify the distributed relation Forest BCell List. The spatial join results
produced in the Map stage are then shuffled based on forest regions’ identifiers. We can see that the argument
function embedded in the hadoopReduce query is exactly the same as in the sequential query. In the end, the
distributed results are collected back to the standard S ECONDO by using the flow operator collect.
Unfortunately, the first solution is not efficient, and much slower than the sequential query. This is mainly caused
by two reasons. First, it is caused by the materialized intermediate data between the Map and Reduce stage. In the
sequential solution, the spatial join results are all maintained in the memory. In contrast, in this method, they have
to be exported as disk files after being produced in the Map stage. In addition, when the Parallel S ECONDO is
deployed in a computer cluster, these disk files also have to be transferred among computers. All these overheads
degrade the performance. Second, in the Reduce stage, the sort operation must be used before the groupby
operation. During its procedure, the FLOB data have to be cached into disk buffer if they overflow the limited
memory. Since there are several reduce tasks run in parallel, the disk interference also encumbers the query.
4.3 Method 2
Because of the performance degradation of the first solution, the second method is proposed. This query intends
to finish both the spatial join and the findCycles operation in one stage, in order to avoid the shuffle procedure.
Therefore, both relations are divided into a same cell-grid, and the complete query can be processed within cells
independently.
However, during the experiment, a new problem arises. When a region overlaps several cells, not all its intersected
lines overlap the same cells. For example, as shown in Figure 3, the region is partitioned into two cells C1 and
C2. According to PBSM, it is duplicated in both cells. Lines P3P7 and P4P8 are also partitioned to C1 and C2.
However, the line P2P9 is only assigned to C1. Therefore, in the cell C2, because of the absence of P2P9, the
wrong region P1-P3-P10-P8-P9 is produced. This problem is called Over Cell Problem.
Figure 3: Over Cell Problem
let Roads_BCell_List = Roads feed extend[Box: bbox(.GeoData)]
8
projectextendstream[Osm_id, GeoData, Box; Cell:
cellnumber(.Box, Big_RNCellGrid)]
spread[; Cell, CLUSTER_SIZE, TRUE;]
Total runtime ...
Times (elapsed / cpu): 53.0651sec / 16.95sec = 3.13069
let OneCellForest_BCell_OBJ_List =
Natural feed filter[.Type contains ’forest’]
extend[Box: bbox(.GeoData)]
filter[cellnumber( .Box, Big_RNCellGrid) count = 1]
projectextendstream[Osm_id, GeoData, Box
; Cell: cellnumber(.Box, Big_RNCellGrid)]
spread[; Cell, CLUSTER_SIZE, TRUE;]
hadoopMap[; . consume];
Total runtime ...
Times (elapsed / cpu): 37.5881sec / 2.27sec = 16.5586
query Roads_BCell_List
hadoopMap[DLF
; . {r} para(OneCellForest_BCell_OBJ_List) feed
itSpatialJoin[GeoData_r, GeoData, 4, 8]
filter[gridintersects(Big_RNCellGrid, .Box_r, .Box, .Cell_r)]
sortby[Osm_id]
groupby[Osm_id; InterGraph:
intersection( group feed projecttransformstream[GeoData_r]
collect_line[TRUE], group feed extract[GeoData])
union boundary(group feed extract[GeoData])]
projectextendstream[Osm_id; Curves: findCycles(.InterGraph)]
]
collect[] count;
Total runtime ...
= 368.141
Times (elapsed / cpu): 3:34min (213.522sec) /0.58sec
104642
let MulCellForest = Natural feed filter[.Type contains ’forest’]
filter[cellnumber( bbox(.GeoData), Big_RNCellGrid) count > 1]
consume;
Total runtime ...
Times (elapsed / cpu): 2.7419sec / 0.83sec = 3.3035
query Roads feed r
MulCellForest feed
itSpatialJoin[GeoData_r, GeoData, 4, 8]
sortby[Osm_id]
groupby[Osm_id; InterGraph:
intersection( group feed projecttransformstream[GeoData_r]
collect_line[TRUE], group feed extract[GeoData])
union boundary(group feed extract[GeoData])]
projectextendstream[Osm_id; Curves: findCycles(.InterGraph)]
count
Total runtime ...
= 4.05799
Times (elapsed / cpu): 1:48min (107.942sec) /26.6sec
9
9300
Regarding the over-cell problem, we divide the forest regions into two types, those covering only one cell of the
grid and those covering multiple cells. For the first kind, we process them in Parallel S ECONDO, while the second
kind are processed in a standard sequential query.
In particular, a hadoopMap operation is used to load all forest regions that cover only one cell into distributed
S ECONDO databases, creating a DLO type flist object at last. This is because such an object is used in the assistant
operator para in the parallel query, each task can use part of its value only when it is a DLO flist.
This solution costs in total 414 seconds, and the complete result is correct. Although itis a little bit more efficient
than the sequential query, it is too complicated for the user to understand and express.
4.4 Method 3
Apparently, the second solution is complicated and not intuitive to the user. Thereby, the third solution is proposed,
and the idea of semi-join is adopted here. First the spatial join happens in a standard S ECONDO, and only the roads
which intersect with forest regions are left. If a road intersects several regions, it is duplicated and each road tuple
is extended with the identification of the region that it passes.
Afterwards, both the intersected roads and the forest regions are distributed in the cluster based on regions’
identifications. Thereby, a forest and all roads that pass it are assigned to a same task. In each task, a hashjoin
operation happens on the roads and forests, in order to fetch the forest-roads pairs again. Thereafter, all queries
are the same as the sequential and the query only happens in the Reduce stage of the Hadoop job.
let JoinedRoads_List = Roads feed
Natural feed filter[.Type contains ’forest’] {r}
itSpatialJoin[GeoData, GeoData_r, 4, 8]
projectextend[Osm_id, GeoData; OverRegion: .Osm_id_r]
spread[; OverRegion, CLUSTER_SIZE, TRUE;]
Total runtime ...
Times (elapsed / cpu): 29.0833sec / 13.17sec = 2.2083
let Forest_List = Natural feed filter[.Type contains ’forest’]
spread[; Osm_id, CLUSTER_SIZE, TRUE;]
Total runtime ...
Times (elapsed / cpu): 5.27066sec / 1.15sec = 4.58319
query JoinedRoads_List Forest_List
hadoopReduce2[OverRegion, Osm_id, DLF, PS_SCALE
; . {r} .. hashjoin[OverRegion_r, Osm_id]
sortby[Osm_id]
groupby[Osm_id; InterGraph:
intersection( group feed projecttransformstream[GeoData_r]
collect_line[TRUE], group feed extract[GeoData])
union boundary(group feed extract[GeoData])]
projectextendstream[Osm_id; Curves: findCycles(.InterGraph)]
]
collect[] count;
Total runtime ...
Times (elapsed / cpu): 1:46min (106.202sec) /1.09sec = 97.4333
113942
The last solution is very efficient, it costs only 140 seconds, and gains about four times speed-up by processing
10
it with Parallel S ECONDO in a single computer. The hadoopReduce2 operator is used in this method. It is a
binary operation, both input flist objects are distributed into PS SCALE reduce tasks. The workload in every task
is even and small, hence the complete query becomes very efficient.
5
Stop And Remove Parallel Secondo
Parallel S ECONDO can be turned off by stopping all Mini-S ECONDO monitors and the Hadoop framework.
$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement
$ ps-stopMonitors
$ stop-all.sh
If the user wants to uninstall Parallel S ECONDO from the current computer or the cluster, an easy-to-use script is
also prepared. It is named ps-cluster-uninstall, and used as follows.
$ cd $SECONDO_BUILD_DIR/Algebras/Hadoop/clusterManagement
$ ps-cluster-uninstall
A
Assistant Objects
In the first two methods of the example query, the PBSM is used to process the parallel spatial join operation.
For this purpose, a cell-grid is required to partition the data into independent groups. Here a coarse-grained cellgrid is created, while the cell number is set to be the size of the cluster. At present, the two relations has to
be completely scanned so as to fetch the bounding box of the grid. The grid is created with an operator named
createCellGrid2D, by setting five arguments. The first two arguments indicate the left-bottom point of the
grid, while the other three defines the edge sizes of the cell and the number of cells on the X-axis.
let RoadsMBR = Roads feed
extend[Box: bbox(.GeoData)]
aggregateB[Box; fun(r:rect, s:rect) r union s
; [const rect value undef]]]
let NaturalMBR = Natural feed
extend[Box: bbox(.GeoData)]
aggregateB[Box; fun(r:rect, s:rect) r union s
; [const rect value undef]]]
let RN_MBR = intersection(RoadsMBR, NaturalMBR)
let BCellNum = CLUSTER_SIZE;
let BCellSize = (maxD(RN_MBR,1) - minD(RN_MBR,1)) / BCellNum;
let Big_RNCellGrid =
createCellGrid2D(minD(RN_MBR,1), minD(RN_MBR,2),
BCellSize, BCellSize, BCellNum);
11