Going the Distance - Western Users of SAS Software
Transcription
Going the Distance - Western Users of SAS Software
Going the Distance: Google Maps Capabilities in a Friendly SAS Environment Anton Bekkerman, Ph.D., Montana State University, Bozeman, MT ABSTRACT While the GEODIST procedure allows users to calculate “as the crow flies,” straightline distances, SAS does not directly provide capabilities to calculate road distances between locations. For many users who might only have addresses (rather than geographic coordinates) and who need to determine actual road distances for optimizing routes, minimizing transportation costs, or simply translating postal addresses to geographic coordinates, existing SAS functionality may be insufficient. I demonstrate how Google Maps can be integrated with SAS to perform these functions and output the desired results within the SAS environment. That is, after a SAS user specifies a location or multiple locations (as postal addresses, city names, state names, etc.), the information is passed to Google Maps from within SAS, the underlying Google Maps HTML code with the coordinates and/or directions is retrieved and parsed, and the desired results are recorded to a SAS dataset. The entire process is completed using only a few lines of code within a single DATA step statement. Moreover, I demonstrate how the process can be easily automated for numerous location entries within a MACRO environment. A comparison of the native SAS straightline and integrated road distance methods indicates that, on average, the straightline method underestimates the true road distance by approximately 25%, and this error becomes larger as the distance between spatially separated locations increases. INTRODUCTION When you are asked to get from location A to location B, what is your first reaction? Perhaps it is to pull out your smart phone and use one of the myriad driving directions apps. Or maybe it is to access a web-based option, such as Google Maps, MapQuest, Bing, among others. Or, perhaps you may even be tempted to pull out the circa 1997 road atlas, which has had its corners chewed off by your dog (or kid), proudly displays travel mug coffee stains, and has been accumulating dust in your car’s trunk and waiting for the “just-in-case” scenario when there is neither a wifi nor cellular phone signal.1 Regardless of your preferred method, rarely do you consider calculating distances using the “as the crow flies” method— a straightline connection between two spatially-separated points, which accounts for the Earth’s curvature but ignores the constraints associated with traveling on roads. Such constraints are manifest in routes being indirect connections between a starting and ending locations due to factors such as geological characteristics (e.g., unbridged bodies of waters), construction or road repair projects, or simply no available routes that mimic the “as the crow flies” path. Moreover, it is reasonable to assume that most travel occurs using ground transportation, rather than other methods that may be more characteristic of a straightline distance.2 While the SAS software has continued to update and expand its spatial analysis capabilities, tools for easily determining and automating road distances between locations are not directly available. Moreover, the constantly changing road conditions and accessibility to driving routes require a dynamic method for recognizing these changes and providing the most current spatial analysis results. This paper presents a relatively straightforward method for determining road distances by integrating the Google Maps directions tool, which has developed a mechanism for optimizing transportation routes within much of North America and the world. A preliminary example demonstrates the underlying process for calling the Google Maps directions tool directly from SAS and extracting relevant distance information into a SAS dataset. The technique is then generalized to determine distances for any number of starting and ending location combinations. The presented methodology is then compared to the native distance calculation tool in SAS—the GEODIST function— which calculates the straightline distance between two spatially-separated points. The comparison analysis shows that the GEODIST function underestimates road distances by approximately 25%. Such errors can have non-trivial impacts on studies that rely on the precise understanding of distances and travel routes for estimating costs and revenues, optimizing logistics, and improving marketing efforts, among other activities. 1 Yes. From a recent personal experience, I can attest that such places still exist. 2 One could argue that travel by rail or air follow straightline routes. However, railroads are often subject to similar constraints as roads and air travel is subject to layovers in locations that prevent direct routes. 1 NATIVE SAS DISTANCE CALCULATION TOOLS The GEODIST function is used to calculate distances between two geographic coordinates using the Haversine formula (SAS Institute, Inc. 2011). The formula determines the shortest, straightline distance between two coordinates, accounting for the approximate curvature of the Earth. The function requires four arguments—the latitude and longitude of the starting location and the latitude and longitude of the destination. While manually obtaining these coordinates from postal addresses or location names is not overly costly when dealing with only a few locations, increasing the number of observations can become expensive or even impractical.3 Using a known set of coordinates, the GEODIST function can be called within the DATA step as follows: distance = geodist(latitudeStart,longitudeStart,latitudeEnd,longitudeEnd,’M’); where latitudeStart and longitudeStart represent columns of the latitude and longitude coordinates of the starting locations, latitudeEnd and longitudeEnd represent columns of the latitude and longitude coordinates of the destinations, and distance is the column containing the resulting straightline distances. The option ‘M’ requests that the distance is output in miles rather than kilometers, which are the default units. The straighline route is rarely the same as the driving route between the two locations. Moreover, it is expected that the difference between the two alternatives will be more substantial as the distance between two locations increases. Figure 1 provides a visual comparison of the straightline distance and one that is based on drivable routes between Bozeman, MT and Las Vegas, NV. The figure makes evident the constraints that bind road travel but not necessarily straightline approximations. INTEGRATING GOOGLE MAPS As shown in Figure 1, the Google Maps directions tool can be used to obtain a more precise estimate of driving distances. This is the underlying mechanism for generating driving distance data within SAS. The following SAS code demonstrates a basic framework for performing the SAS—Google Maps integration. %let addr1 = Bozeman,MT; %let addr2 = Las+Vegas,NV; filename google url "http://maps.google.com/maps?daddr=&addr2.%nrstr(&saddr)=&addr1"; data dist(drop=html); infile google recfm=f lrecl=10000; input @ ’<div class="altroute-rcol altroute-info"> input html $50.; if _n_ = 1; locStart = "&addr1"; locEnd = "&addr2"; roaddistance = input(scan(html,1," "),comma12.); run; <span>’ @; proc print data=dist noobs; run; The MACRO variables addr1 and addr2 specify the starting and ending locations, respectively, and are the only userinput variables. The URL google requests that Google Maps generates driving directions between the two specified locations. The HTML code underlying the route displayed in Google Maps is then read into SAS and parsed within the DATA step. The third line of the DATA step specifies that SAS begins to parse the HTML code beginning after the line <div class="altroute-rcol altroute-info"> <span>. That is, the DATA step eliminates all text that precedes location where the road distance value is reported. Lastly, the SCAN function is used to extract the road distance value into the SAS dataset. Table 1 shows the contents of the resulting dist dataset. 3 The Appendix presents SAS code that helps automate the process for obtaining geographic coordinates for postal addresses and location names. Users can also use the GEOCODE procedure, but a detailed discussion of this procedure is out of the scope of this paper. 2 Figure 1: Comparison of Straightline and Driving Routes Between Bozeman, MT and Las Vegas, NV Source: The map was generated using Google Maps. Notes: The starting location is Bozeman, MT (45.682677,-111.053288) and the ending location is Las Vegas, NV (36.116799,-115.174534). 3 Table 1: Contents of the dist Dataset: Road Distance Information locStart Bozeman,MT locEnd Las+Vegas,NV roaddistance 832 Of course, the advantages of using this approach are minimal when determining distances for one or a few location pairs—users can go directly to Google Maps and obtain the same outputs. Substantial improvements in efficiency (and cost-savings) begin to be realized when road distances need to be recorded for a large number of location pairs. For example, consider a courier service that has three warehouses from where packages could be delivered to customers. The courier service may be interested in understanding how to efficiently allocate delivery packages to the warehouses such that the final delivery distances are minimized. This requires that the courier service determines the driving distances from each of the three warehouses to the final destinations. The following data represent randomly generated courier service warehouse sites and customer locations in the Bozeman, MT area. data courier; input warehouse_address & $19. warehouse_city $ & warehouse_state $ customer_address & $19. customer_city $ & customer_state $ ; datalines; 8250 Huffine Lane Bozeman MT 2884 Caterpillar Dr. Bozeman MT 8250 Huffine Lane Bozeman MT 408 S 12th Ave. Bozeman MT 8250 Huffine Lane Bozeman MT 30 Main Street Belgrade MT 6553 N 19th Ave Bozeman MT 2884 Caterpillar Dr. Bozeman MT 6553 N 19th Ave Bozeman MT 408 S 12th Ave. Bozeman MT 6553 N 19th Ave Bozeman MT 30 Main Street Belgrade MT 1340 Kagy Blvd Bozeman MT 2884 Caterpillar Dr. Bozeman MT 1340 Kagy Blvd Bozeman MT 408 S 12th Ave. Bozeman MT 1340 Kagy Blvd Bozeman MT 30 Main Street Belgrade MT ... ; run; The following MACRO uses the location pair information in the courier dataset and creates an output dataset containing the driving distance for each pair. /**********************************************************************/ /* Purpose: Determine road distances for location pairs */ /* Author: Anton Bekkerman */ /* */ /* User inputs: */ /* input = name of SAS input dataset */ /* (e.g., libname.inputName) */ /* output = name of SAS output dataset */ /* (if empty, then libname.inputName_dist) */ /* startAddr = variable name of starting location address */ /* (variable content example: 555 StreetName Dr.) */ /* startCity = variable name of starting location city */ /* (variable content example: Bozeman) */ /* startSt = variable name of starting location state */ /* (variable content example: MT) */ /* endAddr = variable name of destination address */ /* endCity = variable name of destination city */ /* endSt = variable name of destination state */ /**********************************************************************/ 4 %macro road(input,output,startAddr,startCity,startSt,endAddr,endCity,endSt); /* Check if input data set exists; otherwise, throw exception */ %if %sysfunc(exist(&input))ˆ=1 %then %do; data _null_; file print; put #3 @10 "Data set &input. does not exist"; run; %abort; %end; /* Check if user specified output dataset name; otherwise, create default */ %if &outputˆ="" %then %let outData=&output; %else %let outData = &input._dist; /* Replace all inter-word spaces with plus signs */ data tmp; set &input; addr1 = tranwrd(left(trim(&startAddr))," ","+")||","|| tranwrd(left(trim(&startCity))," ","+")||","|| left(trim(&startSt)); addr2 = tranwrd(left(trim(&endAddr))," ","+")||","|| tranwrd(left(trim(&endCity))," ","+")||","|| left(trim(&endSt)); n = _n_; run; data _NULL_; if 0 then set tmp nobs=n; call symputx("nObs",n); stop; run; %do i=1 %to &nObs; /* Place starting and ending locations into macro variables */ data _null_; set tmp(where=(n=&i)); call symput("addr1",trim(left(addr1))); call symput("addr2",trim(left(addr2))); run; /* Determine road distance*/ options noquotelenmax; filename google url "http://maps.google.com/maps?daddr=&addr2.%nrstr(&saddr)=&addr1"; data dist(drop=html); infile google recfm=f lrecl=10000; input @ ’<div class="altroute-rcol altroute-info"> <span>’ @; input html $50.; if _n_ = 1; roaddistance = input(scan(html,1," "),comma12.); run; data dist; merge tmp(where=(n=&i)) dist; run; /* Append to output dataset */ %if &i=1 %then %do; data &outData; set dist(drop=n addr:); run; %end; %else %do; proc append base=&outData data=dist(drop=n addr:) force; run; 5 %end; %end; /* Delete the temporary dataset */ proc datasets library=work noprint; delete tmp; quit; %mend; The MACRO road is used to evaluate road distances for the destinations contained in the courier dataset. Table 2 presents an abbreviated representation of the resulting output data, courier dist. These data can now be used to evaluate the optimal courier warehouse location (conditional on distance to final destination) to minimize the total costs for delivering packages to their final destinations. Table 2: Contents of the courier dist Dataset: Road Distance Information for Multiple Location Pairs Warehouse Address City 8250 Huffine Lane Bozeman 8250 Huffine Lane Bozeman 8250 Huffine Lane Bozeman 6553 N 19th Ave Bozeman 6553 N 19th Ave Bozeman 6553 N 19th Ave Bozeman 1340 Kagy Blvd Bozeman 1340 Kagy Blvd Bozeman 1340 Kagy Blvd Bozeman .. . .. . State MT MT MT MT MT MT MT MT MT Destination Address City 2884 Caterpillar Dr Bozeman 408 S 12th Ave. Bozeman 30 Main Street Belgrade 2884 Caterpillar Dr Bozeman 408 S 12th Ave. Bozeman 30 Main Street Belgrade 2884 Caterpillar Dr Bozeman 408 S 12th Ave. Bozeman 30 Main Street Belgrade .. . .. . State MT MT MT MT MT MT MT MT MT Road Distance (miles) 6.3 6.4 8.1 6.2 4.8 14.1 3.2 1.3 11.1 .. . .. . AN EMPIRICAL COMPARISON OF METHODS As noted above and shown in Figure 1, there is likely a discrepancy between the straightline and driving directions methods for calculating distances. However, if the discrepancy is only trivial, then using the integrated Google Maps approach may be a cost-ineffective approach. To evaluate whether the dissimilarities are statistically significant and quantify the potential error, I use the road MACRO and the GEODIST function to determine distances using a large number of location pairs. As an example, the comparison is made using the travel distance between the locations of four- and two-year universities in California, Colorado, Montana, Nevada (excluding those in Las Vegas), Oregon, Washington, and Wyoming and Las Vegas, NV— the 2013 location of the Western Users of SAS Software annual conference. The resulting dataset yielded a total of 272 location pairs. Figure 2 shows a comparison of these distances across all location pairs, across pairs that are separated by less than or equal to 500 miles, and across locations that are separated by a distance greater than 500 miles. In each case, the straightline approximation underestimates the road distance. More importantly, this difference is statistically significant across all scenarios. This suggests that using straightline distances as approximations to road distances could lead to inaccurate inferences. In the sample used for this example, the average error is approximately 25%—that is, the road distance is underestimated by approximately 25% when using the straightline approach. The results also indicate that the error is larger when two spatially separated locations are farther apart. This is generally observable in Figure 2, but is much clearly observed in Figure 3. The latter figure shows that as the distance between two location pairs increases, so does the degree of underestimation due to the use of a straighline distance approximation. 6 Figure 2: Comparison of Haversine (Straightline) and Road Distances Across 272 Location Pairs Source: Figure generated by the author. Notes: Bar heights indicate average distances and bands represent 95% confidence limits. 7 Figure 3: Percent Underestimation of Road Distance when Using the Straightline Distance Approximation Source: Figure generated by the author. CONCLUSION The capabilities for spatial analysis continues to rapidly improve in SAS, but there remain aspects that require additional external resources. One such deficiency is the ability to calculate road distances between spatially separated locations. While the GEODIST function offers an approximation (which is appropriate to use in some cases), a more precise mechanism is not currently available. This imprecision is relatively straightforward to overcome by using the geocoding and driving direction functions of Google Maps. The Google Maps directions feature becomes even more powerful when it is coupled with SAS, enabling users to easily automate the road distance data collection process. This allows users to determine road distances across large datasets and immediately employ these data for statistical analyses. Being able to obtain a more detailed and precise understanding of distances can substantially improve individuals’ and companies’ abilities to optimize their decisions and strategies, and can have significant economic impacts. APPENDIX: DETERMINING LATITUDE AND LONGITUDE COORDINATES The following code uses the SAS—Google Maps integration to geocode an address or location (determine the latitude and longitude coordinates). %let addr1 = Bozeman,MT; filename google url "http://maps.google.com/maps?q=&addr1"; data location(keep=lat long); infile google recfm=f lrecl=10000; input @ ’viewport:{center:{’ @; input html $50.; if _n_ = 1; ystart = index(html,"lat:"); yend = index(html,",lng"); xstart = index(html,"lng:"); 8 xend = index(html,"},span"); lat = input(substr(html,ystart+4,yend-1),best8.); long = input(substr(html,xstart+4,xend-1),best11.); run; REFERENCES SAS Institute, Inc. 2011. SAS/STAT 9.3 Users Guide, Cary, NC: SAS Institute Inc. CONTACT INFORMATION All SAS code described in this paper can be accessed by clicking here or by visiting the “Tools/Code” tab on the website listed below. Please address comments and questions to: Anton Bekkerman, Ph.D. 205 Linfield Hall Montana State University P.O. Box 172920 Bozeman, MT 59717-2920 Phone: (406) 994-3032 anton.bekkerman@montana.edu http://www.montana.edu/bekkerman SAS and all other SAS Institute Inc. product or service names are registered trademarks or trademarks of SAS Institute R indicates USA registration. Inc. in the USA and other countries. Other brand and product names are trademarks of their respective companies. 9