Computing and Finding a Minimum Bottleneck

Transcription

Computing and Finding a Minimum Bottleneck
Computing and Finding a Minimum Bottleneck Spanning
Tree in parallel
Ahmad Traboulsi
School of Computer Science
Carleton University
Ottawa, Canada K1S 5B6
ahmadtraboulsi@cmail.carleton.ca
April 19, 2015
Abstract
Finding a minimum bottleneck spanning tree consist basically of finding the minimum bottleneck edge. In the paper I have parallelized an approach to find that minimum
bottleneck spanning tree edge. The parallelization is on both levels Cluster and Cluster
node level (CGM cluster), the algorithm is presented along with evaluation and results.
The approach used came from the reverse-delete algorithm.
1
Introduction
The world of computing lead people to the thinking of optimizing all the solutions, i.e. finding the best solution to a problem from the set of all feasible solutions. Many such problems
require lots of processing time and power especially with problems consisting of processing big data and sometimes its impractical to do that using serial computing due to the
limitation of hardware, however nowadays parallel computers with different architectures
(multicore, GPUs, Clusters) are available which allows better performance and efficiency
if the hardware was utilized well using parallel programming. One of the classical graph
optimization problems is Minimum Spanning Trees(MST) and another related problem this
paper will be addressing and considering is the Minimum Bottleneck Spanning Tree Problem (MBST).
Formally the Minimum Spanning Tree problem is defined as follow, let G = (V, E)
be an undirected connected graph with a cost function w mapping edges to positive real
numbers. A spanning tree is an undirected tree connecting all vertices of G. The cost
of a spanning tree is equal to the sum of the costs of the edges in the tree. A minimum
spanning tree is a spanning tree whose cost is minimum over all possible spanning trees of G.
The Minimum Bottleneck Spanning Tree problem is defined as follow, let G = (V, E)
be an undirected connected graph with a cost function w mapping edges to positive real
numbers. The bottleneck edge of a spanning tree is the edge with the maximum cost among
all edges of that tree, there might be more than one bottleneck edge in a spanning tree in
which they all have the same cost. A spanning tree T is called a minimum bottleneck
spanning tree (MBST) if its bottleneck edge cost is minimum among all possible spanning
1
trees.
Some of the applications of MBST and MST ([15]) are:
• Taxonomy
• Cluster analysis: clustering points in the plane, single-linkage clustering (a method
of hierarchical clustering), graph-theoretic clustering, and clustering gene expression
data.
• Constructing trees for broadcasting in computer networks. On Ethernet networks this
is accomplished by means of the Spanning tree protocol.
• Image registration and segmentation
• Curvilinear feature extraction in computer vision.
• Handwriting recognition of mathematical expressions.
• Circuit design: implementing efficient multiple constant multiplications, as used in
finite impulse response filters.
• Regionalisation of socio-geographic areas, the grouping of areas into homogeneous,
contiguous regions.
• Comparing ecotoxicology data.
• Topological observability in power systems.
• Measuring homogeneity of two-dimensional materials.
• Minimax process control.
In addition to that the MBST and MST are often a key module for solving more complex
graph algorithms. I’m planning to present an approach to use concurrent threads on the
Reverse-Delete algorithm to solve the MBST problem.
2
2
Literature Review
The literature has no parallel algorithms or implementations for the Minimum Bottleneck
Spanning Tree problem, however for the MBST sequential algorithms I referred to [2] paper by Camerini presents one algorithm for finding minimum bottleneck spanning tree in
a weighted undirected graph and another for finding a minimum bottleneck spanning tree
in a directed graph, a second paper that solves the MBST in a directed graph is by Harold
and Tarjan [5] which again presents a new algorithm for finding an MBST in a directed
graph and a second algorithm a modified Dijkistra algorithm that finds an MBST in a
directed Graph. Most of those algorithms are inherently sequential therefore not a good
choice to parallelize, however in Kruskal’s algorithm original paper it includes an algorithm
called reverse-delete which can be utilized to get an MST or an MBST. On the other hand
there are lots of papers about finding the minimum spanning trees in a parallel computing
systems with different architectures. However the most targeted algorithm in MST among
the three well known algorithms Kruskal’s algorithm, Prim’s algorithm and Boruvka’s algorithms is the latter because of the fact that it is naturally parallel ,whilst Kruskal’s and
prim’s algorithms are inherently sequential which makes it difficult to parallelize, and as I
have observed non of the papers that targeted those algorithms as their approach without
the combination or modification has little to no improvements and speed ups. In the upcoming I’ve divided my reviews to several subsections according to the parallel architecture used.
2.1
MST - GPUs
Three papers were reviewed in this architecture starting from the current state-of-the-art
approach by Vineet et Al. [13] which gives speedup of 30 to 50 times over cpu implementation and in under one second on one quarter of Tesla S1070 GPU an MST is constructed
for a graph with 5 million node and 30 million edge, their algorithm is based on Boruvka’s
algorithm that uses scalable primitives such as scan, split and segmented scan in addition
to efficient data mapping primitives including sort scan and reduce, its basically a recursive
approach that uses a series of basic primitives at each step. A second paper by S.Bressan
et Al. [10] claims to outperform the current state of art algorithm by Vineet et al, in which
their approach is based on Prim’s algorithm that also uses the parallel primitives namely
prefix-sum, stream compaction and sorting as intermediate components of their algorithms,
the algorithm let each processor to try to grow a tree using prim’s algorithm whenever a
collision between two trees occur then one of the processors hands over its tree to the other
and start building a tree from a new unvisited vertex the idea is somehow similar to an approach in multicore transactional memory approach to find an MST by S.Kang and D.Bader
[7]. Finally the third paper by W.Wang et Al. [14]similarly uses prim’s algorithm however
by not parallelizing the outer loop the algorithm does not performs well when compared to
the two previously mentioned papers since it doesn’t try to grow multiple trees and only
tries to parallelize the two inner loops finding min weight edge and updating the candidate
edge set.
2.2
MST - Multicores
A fast shared memory algorithm [1] by David A. Bader and Guiojing Cong for computing an
MST in Sparse Graph gives three variants of Boruvka’s Algorithm plus a new MST algorithm
3
for Symmetric Multiprocessors (SMP). Their best variant of Boruvka’s algorithm has a time
complexity of O( m+n
p .log(n)) and their new MST algorithm has a time complexity in the
worst case similarly O( m+n
p .log(n)) however their new MST is interesting since it combines
Prim’s and Boruvka’s algorithm in a way that the processors starts growing trees as in
Prim’s algorithm then contract them as in Boruvka’s finally repeating the whole procedure
again recursively but the lock free mechanism used has an excessive overhead. The second
paper by David A Bader and S.Kang [7] which is based on the latter algorithm of the
previous paper the new MST algorithm. This paper also target sparse graphs and provides
a speedup of an average 8 up to 16 times, the algorithm grow trees using Prim’s algorithm
and a processor stops growing the tree when it touches another tree, in this case processor
1 hands over its tree to processor 2 and starts all over again from a new unvisited node,
the drawback of the algorithm presented is it decreases the utilization of the number of
processors therefore less parallelism. The third paper in this section by A. Katsigiannis et
al. [8] tries to parallelize kruskals algorithm using helper threads, a main thread proceeds as
the usual Kruskals algorithm while the helping threads trying to decrease the search space
of the main thread. this is done by assigning to each of those helping threads a partition
of the list of edges, and each processor keeps looping through its partition tying to test
each edge if it would cause a cycle with the current found MST edges by the main thread,
as soon as the main thread enters a helper thread partition the helper thread stops, they
mentioned that a speed up of 5.5 times of the sequential kruskal’s algorithm. The drawback
in this algorithm is that again the utilization of threads and processes decrease as the main
thread approaches. So current state-of-the-art is the 2nd paper presented in this section
for multicores, however this algorithm requires a costly inter-processor communication to
merge subtrees when they do get in contact.
2.3
MST - Clusters
Parallelization of both Prim’s and Kruskal’s algorithms are presented by V.Loncar et al. [12]
in which a master slave approach in parallelization of Prim’s algorithm my having several
processes find the min weight edge in their set of edges and vertices and finally collecting
2
the data and processing the results. This algorithm runs in O( np . + O(nlog(p)) and a
parallelization of Kruskal’s algorithm is also presented which works as follow partitions of the
main graph are assigned to the processors each locally computing the MST using Kruskal’s
2
algorithm and merging them, the time complexity of this algorithm is O( np . + O(n2 log(p)).
The second paper is based on MapReduce in which an approach of how to achieve a very
simple Java implementation of Minimum Spanning Tree problem in MapReduce [11]. It
only gives the implementation details no analysis was provided. Basically uses Kruskal’s
algorithm as a reducer after partitioning the graph into subgraphs.
2.4
MST - abstract Machines
Two abstract machines were considered in the two papers F.Dehne and S.Gotz [6] presenting an algorithm Boruvka’s based that computes the MST by finding local MST by each
processing unit then prunes and merges the resulting MSTs into a single one using D-ary
tree on a BSP abstract computer. The second paper is by K.W. Chong et al. [4] is an optimal time logarithmic time O(logn) on PRAM EREW abstract computer. It takes log(n)
steps by using multiple threads working on different parts of the search space however as
soon as one thread i finishes the following thread requires only an O(1) to finish and there
4
are log(n) threads therefore resulting in a time complexity of the order O(long), therefore
being the state-of-the-art.
2.5
MST - architecture Independent methodology
One paper presenting an independent platform algorithm C. da Silva Sousa et al. [9] a
variant of Boruvka’s algorithm. The implementation is based on a specific design and implementation decisions such as data representation. Claims to outperform all other existing
algorithms, however no results were shown that compares it to the state-of-the-art algorithms stated previously. The implementation and the approach taken are interesting, and
from the implementation Its obvious to state that it would perform best at a GPU architecture.
The above were material related to the Minimum Spanning Trees problem however the
problem I’m resolving has not yet been touched in parallel computing therefore I will be
reading more papers and doing more literature review regarding my approach to the Minimum Bottleneck Spanning trees and mainly I need to find more about parallel algorithms
for connected components, since its a part of my approach to parallelizing the reverse-delete
algorithm.
3
Project Report
In this project I try to parallelize one approach inspired from reverse-delete algorithm for
computing a minimum bottleneck spanning tree on two levels, both cluster level and cluster
node level . The remaining of this report is as follow, section 3.1 defines the MBST in
more details. In 3.2 I present the approach of computing a minimum bottleneck spanning
tree, in 3.3 the approach of parallelizing the algorithm with subsections 3.3.1 about the
parallelization on cluster level, and subsection 3.3.2 on parallelization on cluster node level.
Finally section 3.4 evaluation and results are illustrated in figures.
3.1
Minimum Bottleneck Spanning Trees
Let G = (V, E) be an undirected connected graph with a cost function w mapping edges to
positive real numbers. A spanning tree is a tree connecting all vertices of G. The bottleneck
edge of a spanning tree is the edge with the highest cost among all edges of that tree, there
might be more than one bottleneck edge in a spanning tree in which they all have the same
cost. A spanning tree T is called a minimum bottleneck spanning tree (MBST) if its bottleneck edge cost is minimum among all possible spanning trees. It is easy to see that a graph
may have many MBSTs ( e.g. consider a graph where all edges’ costs are the same, then all
the spanning trees of that graph have same bottleneck edge cost and 6 ∃ spanning tree with
a bottleneck edge cost lower than any other spanning tree , therefore any spanning tree of
such graph is a MBST.
The well known problem Minimum Spanning Tree (MST) is related to MBST in which
the Former is necessary an MBST while the opposite is not true. Therefore any algorithm
that get an MST is also an algorithm to get an MBST.
5
3.2
Reverse-Delete inspired Approach for Computing an MBST
The Reverse-Delete is an algorithm which is the exact reverse of Kruskals algorithm. The
algorithm sorts the edges in non-decreasing order, then starts removing edges starting from
the edge with maximum weight at index m (see figure 1), if removal of any edge cause the
graph to be disconnected the edge is kept, and the algorithm proceed checking till the edge
at index 1.
(a)
(b)
Figure 1: Reverse Delete for computing an MST.
To Compute an MBST it is possible to do the same as in reverse-delete however the algorithm stops at the first edge (see figure 2) that disconnect the graph, adds that edge
again to the graph and finally get any spanning tree of those remaining edges which will
be an MBST. To do this more efficiently one would go searching for that first edge that
disconnects the graph by applying a binary search like technique to find that edge.
(a)
(b)
Figure 2: computing an MBST.
6
3.3
Parallelizing the computation of MBST
This aforementioned approach to compute an MBST can be parallelized on a cluster with
two level parallelization, cluster level and cluster nodes level. The search for the edge is
parallelized on the cluster level, while some computations mainly the connectivity check is
parallelized on cluster node level. The algorithm is presented below.
Algorithm 1 Parallel Computation of an MBST of Graph G
1: Sort the set E of edges in a non decreasing order
2: while bottleneck edge is not found do
3:
F indOut();
4:
result= P BF S();
5:
Share and collect the results and the last edge index using Allgather.
6:
analyse(CollectedDataf romAllgather);
7: end while
8: Find a spanning tree from the set of edges that has a weight ≤ bottleckneck edge weight
Where each of the functions above does the following:
F indOut() is a function that allow each cluster node to know the set of edges it is
allowed to use at the current round according to its Rank.
P BF S() PBFS performs a parallel breadth first search and returns 1 if the graph is
connected or zero if its not connected
analyse() is a function that analyze the collected data and updates the two variables
max disconnected edge and min connected edge which are used in F indOut() and if the
bottleneck edge is found it breaks the while loop.
3.3.1
Parallelization on Cluster level
On cluster level each cluster node is assigned set of edges as shown in figure 3, performs
local computations including connectivity check and then share results with other cluster
nodes by participating in the filling of array L in figure 4. The array size is of the size of
cluster nodes, the array is expected to have zeros then ones or all zeros or all ones which
reflects the case that if a cluster node pi finds the graph connected for the set of edges Ki
then processor pi+1 will also find the graph connected for the set Ki+1 since the set Ki ⊂
Ki+1 and the other way around if pi+1 has found the graph disconnected similarly pi will
find the graph disconnected. After array L is filled and shared among all processors using
Allgather it is analyzed by each of the cluster nodes, in which in the analysis the algorithm
updates the its knowledge on the index of the edge with the minimum edge weight that
is required to keep the graph connected, and the index of the edge with maximum edge
weight that disconnects the graph. Maintaining these two indices will help in finding the
bottleneck edge. Its is found once the difference between these two indices is one i.e. if
max edge index disconnected is i the min edge index connected is i+1 therefore the edge
at that latter is the edge with the minimum bottleneck edge weight.
Each cluster node will be deleting and add adding edges at each round, and since
the graph is represented in a compressed adjacency list, deletion of an edge would cost
7
(a)
(b)
(c)
Figure 3: three rounds of the algorithm
Figure 4: Array L of size cluster nodes
O(deg(v)) which would cause a huge overhead. However the deletion is done using an array of edge status rather than deleting them from the compressed adjacency list, and since
each processor will be either adding or deleting edges at each round but never both at same
round, it can be easily seen that the number of deletion and addition operations are in
O( pm2 ). Let p be the number of cluster nodes, S be the size of the region, R be the number
of regions and M x be the current search space.
R = p + 1, S = M x/R, and M x = m/pi where m is the number of edges and i is the
round number.
Max number of operations (either deletion or addition) at any round os (p − 1) ∗ S. The
total moves would be as follow :
log(m)
P
i=0
mp
(p+1)pi
=
mp1−log(m) (plog(m)+1 −1)
(p−1)(p+1)
And since log is of base p this is
m
(p−1)(p+1)
which is O( pm2 )
Figure 5: Graph representation along with Sorted Edges and Edge Status array
8
3.3.2
Parallelization on Cluster node level
The parallelization on Cluster node level is mainly done on connectivity check which uses
BFS. The Parallelization here is on multicore processors nodes. There are several parallel
bfs versions and implementations the one considered here is Parallel BFS (PBFS) from
C.Leiserson and T.Schardl [3]. According to this paper the parallelization works on the
BFS tree levels. A bfs queue would hold only nodes from two distinct levels, knowing
that we can replace the queue with a data structure called bag, essentially two Bags are
required, In-Bag holds the nodes of level i and Out-Bag holds the nodes of level i+1 (see
figure 6). The nodes in the In-Bag are processed in parallel and their output goes in the
Out-Bag. However to do this in parallel and efficiently the In-Bag is split for smaller InBags and for each In-Bag there will be an Out-Bag (see figure 7) which are later merged to
be the In-Bag of the new round and the algorithm repeats. This results in a benign race
when two processor are processing the same neighbour of two distinct nodes in which the
same node will be added to two different Out-Bags but that doesn’t affect the correctness
of the algorithm but causes some extra work. This race can be solved by using locking
methods however according to the results in [3] it shows by using the locking technique the
performance actually got worse.
(a)
Figure 6: Two bags each for a distinct layer
(a)
Figure 7: Processing the nodes in the In-Bags and put the output in Out-Bags
9
3.4
Evaluation and Results
For the evaluation and testing purposes I have implemented a code under the filename
CAG.cc in the attached codes folder which create a simple connected Graph, of 1000 node
and 800,000 edge. The complexity of the algorithm 1 would be O(p(m + n)(logp (m)m))
while the sequential one would e O((m + n)log2 (m)).
The evaluation and test are done as follow:
4 implementations were to be tested however the cluster at the lab has no cilk installed so
I had only 3 implementations tested. The implementations tested are:
1. Sequential MBST based on the binary search approach.
2. Parallel MBST with sequential BFS that is the parallelization done on cluster level
only.
3. One Cluster Node with Parallel BFS that would be parallelization on cluster node
level only, basically one node.
The implementation that was not tested was the Parallel MBST with Parallel BFS that
is where the parallelization is done on both cluster and cluster node levels. However the
code is implemented and only testing was required.
The figures below show how did each of the tested implementations performed.
(a)
Figure 8: Comparison between the time taken for each implementation the yellow bar
shows the parallel MBST-BFS with 5 nodes while the green shows the sequential MBST
and the blue one is the MBST-PBFS with one node
10
(a)
Figure 9: Comparison between the time taken for parallel MBST-BFS with 1,3,4,5 nodes
(a)
Figure 10: Comparison between the number of rounds for parallel MBST-BFS with 1,3,4,5
nodes
11
4
Conclusion
The parallelization on the cluster level had a little improvement in the speed up and this
can be clearly seen since the improvement was from log base 2 to log base p where p is
number of nodes in the cluster however the parallelization on the cluster node level had a
better significant speed up and combining both would absolutely achieve a better speed up,
that can be proved with more testing with different number of nodes and edges. Indeed
the parallel BFS had a significant better performance since the part that cost the most in
the algorithm is the bfs, and parallelizing the BFS had a big impact on the performance.
The parallelization on cluster level has an overhead which I worked hard on minimizing
it, however parallelization on cluster node level alone had little improvement which in
combination with that on cluster node level achieves a better performance.
12
References
[1] David A. Badera and Guojing Congb. Fast shared-memory algorithms for computing
the minimum spanning forest of sparse graphs. Journal of Parallel and Distributed
Computing, 66(11):1366–1378, November 2006.
[2] P.M. Camerini. The min-max spanning tree problem and some extensions. Information
Processing Letters, pages 10–14, January 1978.
[3] Tao B. Schardl Charles E. Leiserson. A work-efficient parallel breadth-first search
algorithm. SPAA ’10 Proceedings of the twenty-second annual ACM symposium on
Parallelism in algorithms and architectures Pages 303-314, June 2010.
[4] Ka Wong Chong, Yijie Han, and Tak Wah Lam. Concurrent threads and optimal
parallel minimum spanning trees algorithm concurrent threads and optimal parallel
minimum spanning trees algorithm. Journal of ACM, 48(2):297–323, March 2001.
[5] Harold N Gabow and Robert E Tarjan. Algorithms for two bottleneck optimization
problems. Journal of Algorithms, 9(3):411–417, September 1988.
[6] IEEE. Practical parallel algorithms for minimum spanning trees. IEEE, 1998.
[7] Seunghwa Kang and David A. Bader. An efficient transactional memory algorithm for
computing minimum spanning forest of sparse graphs . Proceedings of the 14th ACM
SIGPLAN symposium on Principles and practice of parallel programming, February
2009.
[8] A. Katsigiannis, N. Anastopoulos, K. Nikas, and N. Koziris. An approach to parallelize kruskal’s algorithm using helper threads. Parallel and Distributed Processing
Symposium Workshops and PhD Forum (IPDPSW), 2012 IEEE 26th International,
May 2012.
[9] Artur Mariano, Cristiano da Silva Sousa, and Alberto Proen¸ca. A generic and highly
efficient parallel variant of boruvka’s algorithm. .
[10] Sadegh Nobari, Thanh-Tung Cao, Panagiotis Karras, and St´ephane Bressan. Scalable parallel minimum spanning forest computation. In PPoPP ’12 Proceedings of the
17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming,
Principles and Practice of Parallel Programming. ACM, February 2012.
[11] Antonio Paolacci. A mapreduce algorithm: How-to approach to the bigdata. .
[12] Proceedings of the World Congress on Engineering. Distributed Memory Parallel Algorithms for Minimum Spanning Trees, volume 2, 2013.
[13] Vibhav Vineet, Pawan Harish, Suryakant Patidar, and P. J. Narayanan. Fast minimum
spanning tree for large graphs on the gpu. HPG ’09 Proceedings of the Conference on
High Performance Graphics 2009, August 2009.
[14] Wei Wang, Shaozhong Guo, Fan Yang, and Jianxun Chen. Gpu-based fast minimum
spanning tree using data parallel primitives. In The 2nd International Conference on
Information Engineering and Computer Science. IEEE, December 2010.
13
[15] Wikipedia. https://en.wikipedia.org/wiki/Minimum_spanning_tree. Accessed:
2015-02-12.
14