Document 6502291
Transcription
Document 6502291
HOW TO USE SAS® SOFTWARE EFFECTIVELY ON LARGE FILES Juliana M. Ma, University of North Carolina INTRODUCTION ''large" clearly depends on the computer system's capacity. The principles presented here remain useful with appropriate adjustments to the specific techniques. The intent of this tutorial was to present basic principles worth remembering when working with SAS software and large files. The paper varies slightly from the original tutorial to compensate for format differences. The methods discussed are directed toward project managers, data managers, programmers, and other users who have data files that must be processed carefully because of size. The objective is to present ideas for improving efficiency in projects using the SAS system on large files. PROJECT MANAGEMENT The importance of project management is magnified when large files are involved. The first principle is to require advanced planning for all processes and programs. Just a few of the advantages of proper planning are: The topics include: • • • • • • documentation of projects, • data storage considerations, • efficient programming techniques. Examples with actual SAS statements show how the principles can be applied. Basic knowledge of base SAS software is assumed. The emphasis is on batch processing in a mainframe computing environment, e.g., under OS. efficient use of resources, easy progress tracking, cooperative programming, reusable programs, reusable databases. Planuing should begin well before a project requires any programming. Early planning insures that computing resources, both machinery and personnel, are used efficiently. A sirople example of poor planning is a project for which programming must begin, but no arrangement has been made for access to the mainframe computer through an !lPpropriate account. What is a Large File? A basic characteristic of a "large" file is that you do not want to process it unnecessarily. When used alone, file refers to either a raw file or a SAS dataset. Processing a large file is generally expensive and only done after careful program testing. Any production run, which processes an entire file, becomes a significant event. Common avoidable reasons for wasted production runs are programming mistakes, programming oversights, or slight changes in the original program request. During a project, planning results in better management control. Project management is simpler when a structure for reporting project progress is well defined. If you know the current status of a project, then making a modification is usually easier. Cooperative programming, in which one programmer continues a task begun by another, benefits from planned documentation of programs. In a mainframe computing environment a file with 10,000 to 1,000,000 records may be considered large. The number of variables is obviously another factor in determining a file's size category. Batch processing, overnight processing, and tape storage are other indications that you are dealing with large files. Programs and data sets are more useful after a project is completed when project wrap-up is an integral part of project planning. For instance, a program may contain programming techniques applicable to another situation, but without documentation such a program may be lost or too difficult to use effectively. Datasets are difficult to reuse effectively without thorough documentation from the original project. In computing environments using microcomputers or minicomputers, the definition of 94 Project Documentation Naming Conventions Documentation is the key to simpli(yiI)g the job of project management. Some of the basic forms for documentation are: A well organized project has pre-defined naming conventions that are readily apparent. Naming conventions make it easier to understand any project or program. You should establish rules and guidelines for the following: • flowcharts, • comments and TITLEs in programs, • codebooks, • programmer notes. • program. names, • dataset names, • variables names, Flowcharts are still the best way to follow complex program logic, or the flow of data through many processing stages. Program flowcharts are particularly appropriate for complex System programs that require maintenance. The CONTENTS procedure with the HISTORY option provides one way of reviewing the background of a dataset, but the graphical format of a flowchart is easier to understand quickly. • format names, • value labels. The name of a program should indicate which project it belongs to, as well as its place in the program sequence. Even small programs should aclhere to the conventions since any program has the potential to grow beyond original expectations. Whether you choose to use descriptive names (e.g., NC84REP, KIDTAB) or sequential names (e.g., STEPl, STEP2) is relatively unimportant as long as the choice is deliberate. (See Muller, Smith, and Christiansen, SUG! '81.) An effective way to accomplish program documentation is to encourage the creation of selfdocumenting programs. The modular structure of the SAS system, in which programs have separate DATA and PROC steps, provides a good foundation on which to build. A self-documenting program includes blank lines, indentation, TITLE statements, labels, descriptive variable names, and comments. Any programmer finds it easier to modifY a program with built-in documentation; even the original programmer benefits when changes are required after a time lapse (i.e., a month or more). Comments ean apply to program logic, version changes, basically anything without an obvious SAS statement. Names need to be chosen carefully for all types of files. Raw files, SAS datasets, SAS databases, and files of program statements should each follow prescribed guidelines. In particular, the name of a file should immediately indicate the file type. Some of the advantages of having planned file names include making it easier to locate a specific file, files are clearly associated with a particular project, and clean up of unnecessary files at the end of a project is quicker.· Any permanent file must have a codebook with detailed information about the values of each variable. A codebook for a SAS dataset may use output from the CONTENTS and FMTUB procedures, especially if all variables are associated with permanent formats. New permanent variables should be added to a project's codebook as they are created. As with program names, define variable names and format names deliberately. In general, descriptive names simplifY program development. Although sequential names (VARl, VAR2, etc.) can be used in variable lists (VAru-VAru6), remembering that driver's age is VAR3 is more difficult than remembering DR_AGE. Choose format names that have natural associations· with their variables. For instance, use AGEF as the format name for DR_AGE. Along with the formal documentation required for a project, programmers should be encouraged to keep their own notes. Plan to incorporate these notes into the archived documentation at the end of a project. 95 Data Storage PROGRAMNITNGTEC~QUES The method used for data storage is more critical with large files. You should understand the advantages and disadvantages of USing raw files versus SAS datasets. In some cases the relatively high cost of converting a large file into a SAS dataset is quickly recouped in later processing. Using a SAS dataset, with a SET statement, eliminates certain sources of programming errors because SAS variables can have permanent labels, formats, and storage lengths. However, if a large raw file will only be used infrequently, or processing will always involve selecting a very small subset, then using an INPUT statement may be the best method. Another· consideration is that concatenating files stored on many separate tapes may be simpler with raw files. Although a complete discussion of large file programming techniques is impossible in this tutorial, some examples are instructive. An important point to remember is that processing time becomes a factor with large files. In contrast, techniques that decrease the time spent programming and testing may be more appropriate for projects with smaller files. Efficient Programming Efficient programming techniques that decrease processing costs are especially importantfor large file processing. Three techniques to consider are: • Select records as soon as possible. • Select variables when possible. • Use intermediate working files. When a permanent SAS dataset is created, consider whether or not permanent formats are desirable. A format is permanently associated with a variable when a FORMAT statement is placed in a DATA step. All procedures then display formatted values unless an overriding FORMAT statement is included. A compromise is to only associate permanent formats with key variables. One result of including a permanent format is that a user cannot even list observations without providing a format library or an appropriate PROC FORMAT. On the other hand, that means only a well informed user has ready access to the data. A variety of methods exist for selecting records in a DATA step. The difference in execution logic between using Select IF and IF conditwn THEN OUTPUT is worth understanding. DATA step statements following Select IF (or DELETE) are executed selectively because of an implicit RETURN. Statements following IF ... THEN OUTPUT are. executed regardless of whether or not a record satisfies the condition. For processing efficiency, place appropriate Select IF statements before other executable .statements when creating a subset from a large file. The storage space used by the SAS system for numerical variables is a very important aspect to understand when creating large SAS datasets. The LENGTH statement gives you control over the storage space used. Remember that, withQut a LENGTH statement, a raw twodigit value can become a SAS value that takes four times the storage space. Always use the shortest length possible for efficiency; when in doubt, use a conservative length to avoid truncation problems. Storing numeric identifiers (e.g., eight-digit case numbers) as character variables eliminates the possibility of numerical conversion problems. In some situations, limiting the number of variables is a way to improve program performance. However, do not eliminate variables unless you are positive that they have no potential uses. for related tasks. When a SAS dataset is used as input, consider including a dataset DROPIKEEP option in the SET, MERGE, or UPDATE statement. Intermediate working files are helpful, and occasionally unavoidable, with large file processing. (See Fischell and Hamilton, SUGI12.) These files are usually stored as permanent SAS datasets during a project, but 96 i. not retained after project completion. One type of working file is a permanent sample dataset; probability or systematic samples are easily created using a DATA step. Alternatively, summary statistics can be generated and used as input for analyses. In Example 2, PROC SUMMARY is used as an intermediate step for producing PROC FREQ tables. Examplel This program, shown in Figure l, is a prototype for extracting information from raw records. We want to create a permanent SAS dataset of selected variables for a specific analysis task. Our interest is only in cars, which make up seventy-five percent of the records. The analysis dataset will become input for a series of programs to investigate seat belt usage of car drivers. Program Testing Even an efficient program must be tested carefully before a production run. The SAS system provides many ways of simplifYing the testing process. For less complex programs, a test run with the general OES= option may be sufficient. Regardless of how you create an appropriate test file, check intermediate results with procedures such as CONTENTS and PRINT. Ideally, statements used for testing should be easy to identify, and change, or should not require any changes before a production run. Changing test statements to comments, which remain in the production run, is a helpful documenting technique. The program is meant to be self-documenting. For an actual run, the SAS dataset name would be chosen to be as informative as possible, e.g. NC86_CAR. Note that variable names are descriptive. Comments and labels are included; blank lines and indentation make the program easier to understand. The procedures included to check DATA step results have appropriate TITLEs associated with th,em. For instance, a descriptive title could indicate that the dataset contained records for cars in accidents that occurred in 1986. Testing is accomplished by using the OES option. In this fashion a separate test file is unnecessary. All references to the actual raw file are automatically verified during test runs. For example, the tape containing the complete file is mounted and used, but only 1000 records are processed. Converting the OPTIONS statement to a comment provides documentation about the number of records required to test the program adequately. APPLICATIONS The examples of programs for large file processing are based on motor vehicle accident files. One input file has about 300,000 records for one calendar year. The data for a single year is stored on a separate full-size magnetic tape (2400 feet at density 6250). Each raw record is about 300 characters long. One record has information about one accident-involved unit; a unit is a motor vehicle (car, truck, bus, motorcycle), a pedestrian, or a bicyclist. The DATA statement includes several components for which naming conventions should be followed. The first level of the SAS dataset name, ddtape, corresponds to an appropriate DDNAME in the IBM® Job Control Language associated with the SAS statements. The second level, sasname, is permanent and limited to eight alphanumeric characters. The dataset LABEL option allows better documentation since its limit is forty characters. Techniques are emphasized that relate to efficiently creating a subset from a large file. The following elements are discussed: • test statements, • meaningful SAS dataset names, • use of the dataset LABEL option, • efficient record selection, • intelligent variable selection, • definition of variable attributes, • use ofPROCs to check results, • use of comments, • creation of working files. The first INPUT statement (for VEHTYPE) uses a trailing @ to efficiently select car records only. Since a Select IF statement is used, no further DATA step statements are executed for non-car records. The relatively expensive process of converting raw data is only done for the records of interest. The 97 Examplel Creating a SAS Dataset From a Large Raw File Figurel ·OPTIONS OBS =1000; f. for testing ·f DATA ddtape.name (LABEL = 'description') ; INFILE indd; INPUT VEHTYPE 12 - 13 @ ; IF VEHTYPE = 1 OR VEHTYPE = 2 ; f. CARS .f DROP VEHTYPE ; INPUT TOWN MONTH ACCYEAR RDCLASS 3-5 12 -13 14 -15 30 other variables (1 or 2 digit fields) DRINJ DRBELT 214 215 , LENGTH TOWN 3 LABEL DEFAULT = 2 ; MONTH = 'MONTH OF ACCIDENT' ACCYEAR = 'YEAR OF ACCIDENT' RDCLASS = 'ROAD CLASS' DRINJ = 'DRIVER INJURY' DRBELT = 'DRIVER SEAT BELT USAGE' ; OUTPUT; RETURN; f* this is an optional statement *f f* this is an optional statement *f PROC CONTENTS NOSOURCE; TITLE 'description'; PROC PRINT DATA = ddtape.name (OBS=1 0) ; PROC FREQ DATA = ddtape.name (OBS=1 000) ; TITLE2 'FIRST 1000 OBS ONLY' ; 98 variable VEHTYPE is not kept in the output dataset since we have no further interest in the vehicle type information. Notice how a simple comment makes it easier to understand that cars are being selected. The second INPUT statement uses column input to read the required variables for car records only. Placing each variable on a separate line and listing variables in the same order as a codebook makes checking the program simpler. Adding another variable is also simplified with this format. Variable names are descriptive and standardized. With this type of extract program, think carefully about which variables are necessary. Including a variable that is never analyzed is usually less expensive than reprocessing a large file because one variable was not included in the extract program. Example 2 These programs show two ways of creating simple cross-tabulation tables from a subset of a large SAS dataset. The objective is to generate three tables considering the use of seat belts by car drivers. The input SAS dataset contains 150 variables and we expect it to be sorted by month. The first program, Figure 2a, gets the job done, but is not self-documenting and does not allow for potential changes in the table specifications. A follow-up request to combine months into quarters would require reprocessing the large dataset or "hand calculations" based on the first tables. The combination of an uninformative dataset name, SUBSET, and a lack of comments make it difficult to know which records are being selected. In this example, specifYing DEFAULT=2 on the LENGTH statement is appropriate because only one variable requires a storage length greater than two. However, an explicit length specification is necessary for TOWN. Note that no error messages would be generated if TOWN is given a length of two even though numeric truncation would result in errors for values greater than 255. The second program, Figure 2b, strives to be self-documenting and produces a permanent intermediate dataset of summary counts appropriate for producing a variety of tables. A summary dataset provides input for PROC FREQ using the WEIGHT statement, or PROC TABULATE using the FREQ statement, or even a DATA step if recoding of unknown categories to SAS missing values is needed (see Ma and Leininger, SUGI '84). Collapsing categories of any of the CLASS variables is possible using FORMAT statements or DATA step recoding. In addition to providing more flexibility, PROC SUMMARY is more efficient than PROC FREQ (see Sharlin, SUGI '83). . The CONTENTS, PRINT, and FREQ procedures provide simple w~ys to check the results of INPUT, LENGTH, and LABEL statements. PROC CONTENTS shows attributes of all variables in the output dataset. The NOSOURCE option is appropriate for batch printouts in which program statements, called source code, are printed together with the output from PROC CONTENTS. The PRINT and FREQ procedures can highlight obvious errors in the INPUT or LENGTH statements, e.g., incorrect input column specifications. The dataset OBS option serves to avoid inadvertent problems. when the complete large file is processed. The dataset KEEP option appears in the SET statement of both programs to select only the required variables. This is more efficient than a KEEP statement that only affects the output dataset. When MONTH is a CLASS variable, as opposed to a BY variable, the successful completion of a production run depends less on whether the input dataset is sorted by MONTH. With BY processing, records are processed until the first out-of-place record is encountered. This could cause an expensive mistake if a problem occurs at record number 290,000 of 300,000. The first TITLE statement applies to the output of all procedures in this program. The second applies only to PROC FREQ and serves to clarify that the frequencies are for program verification only. 99 Example 2 Simple Tables From A Large SAS Dataset Figure2a DATA SUBSET; SET ddtape.name (KEEP = MONTH VEHTYPE DRINJ DR BELT DRSEX DRAGE) ; IF VEHTYPE = 1 OR VEHTYPE = 2 ; IF DRINJ = 6 THEN DELETE; IF DRBELT = 5 THEN DR BELT = 0 ; PROC FREQ DATA = SUBSET; BY MONTH ; TABLES (DRINJ DRSEX DRAGE) • DRBELT ; TITLE 'description'; Figure2b .OPTIONS OBS =5000 ; /. for testing ./ DATA CAR_DR (LABEL = 'description') ; SET ddtape.name (KEEP = MONTH VEHTYPE DRINJ DRBELT DRSEX DRAGE) ; IF VEHTYPE = 1 OR VEHTYPE = 2 ; DROP VEHTYPE ; /. CARS ONLY 0/ IF DRINJ = 6 THEN DELETE; /. NO DRIVER PRESENT ./ IF DRBELT = 5 THEN DRBELT = 0; /. COMBINE UNKNOWNS ./ OUTPUT; RETURN; PROC CONTENTS NOSOURCE; TITLE 'description'; PROC PRINT DATA = CAR_DR (OBS = 20) ; PROC SUMMARY DATA = CAR_DR NWAY; CLASS MONTH DRINJ DRSEX DRAGE DRBELT; OUTPUT OUT = ddkeep.DRSTAT (LABEL = 'description') ; PROC PRINT DATA = ddkeep.DRSTAT (OBS = 50) ; TITLE2 'SUMMARY STATS' ; PROC FREQ DATA = ddkeep.DRSTAT; TABLES MONTH - - DRBELT ; WEIGHT _FREQ_ ; PROC FREQ DATA = ddkeep.DRSTAT; BY MONTH ; TABLES (DRINJ DRSEX DRAGE). DRBELT; WEIGHT JREQ_; TITLE2 'description of requested tables' ; 100 :1 SAS Procedures to Remember The following SAS procedures prove useful for data management or checking programs. The list demonstrates the wide variety of utility procedures found in the SAS system. The Main. Objective Efficient self-documenting projects Using SAS procedures in place of operating system utilities often make data management tasks easier. The SAS utility procedures usually have clearer syntax than comparable operating system utilities. Information provided by the SAS system is more understandable in some cases, e.g., the output from PROC TAPELABEL. REFERENCES Procedures are available that simpl.ifY the process of checking programming logic and results. As shown in the examples, procedures such as CONTENTS and PRINT provide simple ways to verify intermediate results as well as final output. Clark, S. and Konowal, L. (1986), "Efficient Use of PROC SUMMARy," Proc. of the Eleventh Annual SAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., 664-669. Council, KA. (1980), SAS Applications Guide, 1980 Edition, Chapter 10: Processing Large Data Sets with SAS, Cary, NC: SAS Institute, Inc., 149-157. COMPARE CONTENTS COpy *DATACHK DATASETS *FMTLIB *ISAM PDS PDSCOPY PRINT QPRINT (new in Version 5) RELEASE SOURCE TAPECOPY TAPELABEL Fischell, T.R and Hamilton, E.G. (1987), "Processing Large Data Sets into Customized Tables," Proc. of the Twelfth Annual SAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., in press. Helms, R (1978), "An Overview of RDM: I. Project and Data Management System Planning," American Statistical Association Proceedings of the Statistical Computing Section, 10-17. * SUG! Supplemental Library User's Guide, Version 5. Ma, J.M. and Leininger, C. (1984), ''PROC SUMMARy As the Basis for Efficient Analysis of Large Files," Proc. of the Ninth Annual SAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., 309-312. CONCLUSION The objective of discussing a variety of topics related to processing large files is to encourage a better understanding of the data management and program checking tools in SAS software. The topics covered in this paper are meant to inspire you to explore the power of SAS software, not to provide a comprehensive set of techniques. In the long run, any project with large files is better when you take time to think ahead, and document, at every step. Ma, J.M. and Fischell, T.R (1985), "Descriptive Statistics: Using Indicator Variables," Proc. of the Tenth Annual SAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., 1026-1030. Merlin, RZ. (1984), "Design Concepts for SAS Applications," Proc. of the Ninth Annual SAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., 283-287. 101 Muller, KE., Smith, J., and Bass, J. (1982), "Managing 'not small' Datasets in a Research Environment," Proc. of the Seventh Annual BAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., 371-376. ACKNOWLEDGEMENTS The examples are based on research done at the. UNC Highway Safety Research Center, Chapel Hill, NC 27514. IBM is a registered trademark of International Business Machines Corp. Muller, KE., Smith, J., and Christiansen, D.H. (1981), "Rules We Followed and Wish We Had Followed in Managing Datasets, Programs and Printouts," Proc. of the Sixth Annual BAS Users Group Intl. Conference, Cary, NC: SAS Institute, Inc., 401-405. SAS is a registered trademark of SAS Institute Inc., Cary, NC 27511-8000, USA. SAS Institute (1985), BAS® User's Guide: BASICS, Version 5 Edition, Cary, NC: SAS Institute Inc. SAS Institute (1986), SUG! Supplemental Library User's Guide, Version 5 Edition, Cary, NC: SAS Institute Inc. Ramsay, A. (1984), "Keyed Access to SAS Data Sets," Proc. of the Ninth Annual BAS Users Group Inti. Conference, Cary, NC: SAS Institute, Inc., 322-325. Shadin, J. (1983), "Data Reduction and Summarization," Proc. of the Eighth Annual BAS Users Group Inti. Conference, Cary, NC: SAS Institute, Inc., 912-919. 102