Geocluster
Transcription
Geocluster
Diplomarbeitspräsentation Geocluster: Server-side clustering for mapping in Drupal based on Geohash Masterstudium: Software Engineering & Internet Computing Josef Dabernig 9 APTER 2. CLUSTER ANALYSIS • • Problem then a cluster can be defined as a connected component: a group of Clustering is the task of grouping unlabeled data in an automated way. The thesis researches cluster analysis to create an algorithm for server-side clustering with maps. Performance and readability of digital mapping applications decreases when displaying large Density-based. A cluster is a dense region of objects that is suramounts of data. Client-side clustering uses rounded a region of items. low Server-side density. a density-based definition of a JavaScriptby to group overlapping clustering is needed when too many itemsthe slowclusters are irregular or intertwined cluster is often employed when Geohash space decomposition on level 1. down processing and create network bottle necks. The letter „D“ covers parts of the Americas Geohash is a latitude/longitude geocode system based on the Morton order. Coordinates are encoded as string identifiers with a hierarchical spatial structure. objects that are connected to one another, but have no connection to Maps visualize datathe in an group. intuitive way. objects outside An example can be seen in Figure 2.3c. Algorithm considerations and when noise and outliers are present. A density-based cluster can Goals take on any shape, an example can be seen in Figure 2.3d. ◘ Pattern representation: spatial clusters ◘ Proximity measure: Euclidean distance ◘ Cluster type: prototype-based ◘ Algorithm: based on Geohash ◘ Implement real-time, server-side clustering ◘ Cluster up to 1,000,000 items within 1 second 8VHU Shared-Property (Conceptual Clusters). More generally, we can ◘ Visualize clusters on an interactive map ,QWHUDFW 9LVXDOL]H ZLWKPDS definition PDS define a cluster asDrupal a setframework of objects that share a property. This ◘ Integrate with the 0DS ◘ Publish under the Open Source GPL license encompasses all the previous definitions of a cluster. The process of ◘ Implement use cases and evaluate results Clustering finding such clusters is called conceptual clustering. When this concepApproach tual clustering gets too sophisticated, %URZVHU it becomes pattern recognition ◘ Research clustering, mapping and visualization &OLHQW definition any more. on its own. Then this definition is no basic 9HFWRUGDWDOD\HU %DVHLPDJHOD\HU -DYDVFULSWPDSSLQJOLEUDU\ ◘ Evaluate state-of-the-art technologies ◘ Design a scalable algorithm for clustering specific interpretation of clusters ◘ Implement and test the algorithm Implementation :HESDJH %DVH OD\HUWLOHV Create a Geohash-based hierarchical spatial index 1) initialize algorithm variables (cluster level) 2) pre-cluster points based on Geohash 3) merge clusters by neighbor-check 9HFWRU GDWD that a method uses to create these 6HUYHU ters can result in totally different mathematical approaches. It is importo decide which type of clusters are needed to solve a problem. 7LOH6HUYHU )HDWXUHVHUYHU The algorithm has been integrated into the Drupal mapping stack as shown in the figure below: 6SDWLDO GDWDEDVH Mapping $SDFKH6ROU6HUYHU 'UXSDO6HUYHU A modern web mapping stack ◘ Spatial data is represented by points, lines or polygons in vector format or rastered images ◘ Projections map the geoid earth onto a planar surface which causes distortion ◘ A modern web mapping stack uses image base (a) Well-separated tiles with overlays of vector data ◘ The slippy map is rendered client-side by a JavaScript mapping library Geocluster $UUD\RI *HRGDWD 6HDUFK$3, Types of cluster analysis *HRFOXVWHU 9LVXDOL]DWLRQ 9LHZV 6HDUFK$3, 6ROU $SDFKH 6ROU /HDIOHW /LEUDU\ *HR-621 ,QWHUDFWLYH 0DS *HRFOXVWHU $OJRULWKP *HRFOXVWHU VROU Geocluster Solr architecture overview (d) Density-based Foundations of geovisualization, visual variables, data exploration techniques and 2.3: clutterTypes reductionof Figure have been researched. A state-of-the-art analysis enumerates map visualization types and techniques for putting clustered, multi-variate data on maps. %%2; 6WUDWHJ\ 9LHZV *HR-621 (b) Prototype-based Cluster islands (c) Graph-based +70/0DS :UDSSHU *HRFOXVWHU The Drupal mapping stack has been studied for integration for a server-side clustering solution. Visualization &OLHQW%URZVHU +70/0DS :UDSSHU Drupal clusters Drupal is a free and open source content management system and framework. Developed and maintained by an international community, it currently backs more than 2% of all websites. The Drupal mapping stack has been evaluated for integration of a server-side clustering implementation, including for spatial suited method formodules extractdata storage and presentation. ◘ Map types: Geographic maps with markers, Heat/choropleth maps, Dot grid maps and en theVoronoi wanted mapstype of cluster is known, a visualization techniques: these◘ Cluster clusters is needed. A variety of methods for searching clusters is Icon-based/Glyphs, Pixel-oriented as well as Geocluster integrates with state-of-the-art lable, Geometric each producing its own type of clusters. The way these methods techniques and Diagrams. Drupal 7 modules like Geofield, Views, Leaflet to k can be divided based on three characteristics. This defines not the provide interactive, scalable, clustered maps. An evaluation classifies the stated techniques for It has been released under the GPL license and cluster visualization on maps, based on can be downloaded from: exploratory analysis. Results Two use cases have been realized and evaluated for performance and visualization: a geocluster demo use case and a GeoRecruiter prototype that extends the Recruiter distribution for job boards in Drupal 7. The performance tests show that one of the 3 algorithm implementations fulfills the objective: ◘ the PHP implementation doesn‘t scale well ◘ the MySQL clustering scales up to 100,000 items ◘ the Solr version scales beyond 1,000,000 items Cluster algorithm performance Request time 4 Technische Universität Wien Institut für Softwaretechnik und Interaktive Systeme Arbeitsbereich: Information & Software Engineering Group Betreuer: O.Univ.Prof. Dr. A Min Tjoa 1000ms none 900ms mysql 800ms php solr 700ms 600ms 500ms 400ms 300ms 200ms 100ms 10 0 10 00 1,0 00 ,0 10 00 1 ,0 00 Clustered items Simple glyph types no edge entry Matrix edge detection http://drupal.org/project/geocluster Geocluster performance Kontakt: http://dasjo.at