Document 6493969
Transcription
Document 6493969
Building a Window to a Data Archive or How to Provide Access to Large Statistical Data Files Tom Boggess Cornell University Abstract the VAX/VMS~ operating system; a variety of VMS and Unix~ based MicroVAXs~; and the Cornell National Supercomputer Facility's IBM 4381 R14 and IBM 3090-600 with attached array proces- Researchers that use secondary data files, such as those produced by the Bureau of the Census, face the problems of obtaining information about the contents of such files and of formatting these diverse files for analytical evaluation. A data archive is charged with the responsibility to maintain and provide access to the data. An electronic information and data retrieval system would seem an ideal interface between a data archive and its users. sors. Consulting on the use of files in the CISER Data Archive and computing consulting is provided by CISER o 'Mainframe' system developments include: o eMS Utility Panels (ISPf'M) o o OASIS [Online Archive Statistical Information System] Phase I (Data Retrieval System) Phase II (Information Retrieval System) An information subsystem might allow users to obtain information about the databases by browsing or entering specific characteristics of interest, obtaining information about which data bases contain such information, inform the user about the specifics of each field or variable in the database, and indicate any problems with merging disparate databases. A data retrieval subsystem could let the user select the databases of interest, subset according to case and The Problems There is no one standard format for files distributed by governmental and other agencies. The data files are frequently organized in a hierarchical fashion with complicated coding schemes and difficult to decipher codebooks. Further, these data are rarely completely clean and must be checked for accuracy against printed and other sources. o Many faculty members depend on graduate students to perform computing work and focus on the results of that work. Training faculty and graduate students to access data and become computer literate is a major task. • Due to the eclectic nature of social science research, there is no way to predict which files, tables, and cases a researcher will find valuable. o field, merge 'common ease' subsets, obtain descrip- tive statistics on the subsets, and store the results in a variety of formats. A complete system would integrate the two subsystems. This presentation will outline completed work for Phase I (the data retrieval subsystem). Introduction and Background The Organization: Cornell Institute for Social & Economic Research KISER) There are over 400 social science faculty and research associates at Cornell o 300+ of these social scientists are CISER members. o CISER maintains a Data Archive of 7000 separate data files on over 600 reels of tape. o The CISER computing team supports use of the SAS'System and SPSS"'" on an IBM" 4381 as well as general use of IBM PCs and Apple" Macintoshs~. o CISER also provides proposal submission support with local (Cornell) peer review. o A Survey Research Facility fo, collection and entry of primary data is another CISER service. o OASIS Phase I Solution: Data Retrieval From the user's point of view, OASIS is simple: choose a database, select observations, keep variables, and process (including reformat and store) the data. This process is outlined in the figure below. Welcome Keep Variables The Social Science Computing Environment at Cornell o IBM 4381 Q13 (soon R14) running VM/SP CMS operating system with a minimum of half the CPU resources, access to several tape drives (allocated o dynamically), and 5.0 gigabytes (Gb) of diskspace. Other hardware available to social scientists on campus: IBM 3090-200 with CMS and MVS (batch only); a primarily instructional DEC® 8500 running 496 files. However, one major data management task is required: placing a standard merge variable, standard identifying codes, and descriptive names that are equivalent across files that have similar units of observation. An example for county-based files is to match county names and Federal Information Processing Standards (FIPS) codes. Each file tends to have differing descriptive names and PIPS codes vary over time, so some standards are set to allow for these differences. This system is written using REXX~ executable programs (EXECs) interfacing SAS System files currently stored only on magnetic disk. See section 'Software Selections' for a discussion the choice of using the SAS System and RExx. Value Added by the Data Archive Group The 'behind the scenes' picture is, as expected, more complex and requires many person hours per dataset. First, information about the data from the codebook or data dictionary is entered into a dBase III Plus™ file. Each record of this database includes a variable or 'flag" name, the position of the variable in the file, a SAS System forty character label, a label that can be as long as 255 characters (for use with the retrieval software), the record number and column position of the variable, the variable length, and the number of decimal places (if appropriate), and the year the variable covers. A user can access any installed dataset by entering 'OASIS' and after an introductory screen the current list of datasets is presented as shown in the figure below. The list is currently quite short since efforts have been primarily directed toward the user interface. There are some sixty files waiting to be installed. I. 2. Once checked for duplications and entry errors, a text file containing primarily INPUT and LABEL SAS System statements is created and again checked for accuracy. This file is then uploaded to the mainframe and edited to include relevant Job Control Language, Data Control Block (DCB) information (record length, block size, and record format), a constrained PROC PRINT, a PROC CONTENTS, and PROC MEANS. MVS is used since these files tend to be large and work space is more easily obtained in MVS. When the SAS System file is created, the archive tape information database is updated and the tape manager is notified. STF3<A) COSTftTI 3_ SEASCS) 4. ":1'. 6. 7. BEA25(S) FEtlSTC02 P(')Pt1IGa2 POPtH083 .-:----------+ +---If£LP un) CHOOSE II INfO n INfO n f'IlIMT cc:o..-.d1ilELPPF (ff2) ~El.P PFlOC£S:s <rfUI) QUIT (£IIIT 0tIS!S) RlIflING CORNE1...LA The follOwing figure presents the initial screen the user would use to select observations. Typically, a user of the system would require OASIS documentation that would outline the observations and variables available for each dataset. Phase II of this project will allow a user to browse the contents of the files by entering areas of interest and being presented with alternatives. This information system (sometimes called a meta data system) is still on the drawing board but would use classic computer science information retrieval system concepts such as those advanced by Gerard Salton and others'. There are five additional files that must be created and uploaded for use in the OASIS System: a variable description file including the variable name, length, type, decimal, and long label; an observation file that contains key information associated with the unit of observation (e.g., PIPS codes for county-level data); a help file that provides an overview of the types of variables included including flag information and references on where to obtain more information from printed sources; a help file that summarizes the number and types of cases (observations) to be found in the data; and an abstract of the file. These files are documented and stored on a personal computer, uploaded to the mainframe, and a computing team member is informed and given complete file information. You "'""'" ",I\osen ob"""""ti.,1\S of the "COlItITY ....,..... .-il.. 1I~INC2. PI ...... ind.",..,.. i~ I"l sele.,H"'I .... " err the opt.ions I isl .... bel ..... 1_ Z. 3. 4. S. IIL1. Ul&IMITIUNS flU. U.S. ClIUNTIH fILL N.Y. COUNTIES stH1fIIIY IWFllflHflTION PmTICULFIIl tIlllHTl£S C.......... loKU Value Added by the Computing Group +----I n Once the information and data files have been readied by the Data Archive group, the next step is to load these files into the OASIS system. Since OASIS primarily consists of EXECs that incorporate text and SAS System files, much of the loading process is editing EXECs to acknowledge the new ; ; ; HELl' (f'Fl. HELl' IlBS fIE11lJIN (fF4) cct.oone .... H .... n I (General Help on tIIoosinOj ~iOllS) • (Specific:: llelp .... 1lI>_.... <tt.iuns in tbis 1111\<01><1_) 1 (Het...... 'to f'M! .. i _ 1'101 .... ) ; +------------~--- 497 --------------------+ Screen Management The next figure presents the screen viewed by a user when selecting variables. Again, OASIS documentation would be required to fully utilize the system. However, the 'long labels' can be accessed from this screen using the DESCRIBE command. LIST DF IJAfIlflBLES FUI BSINU 1--FfPSTCl'l 2--1JlB1.E 3-l.1HECOIl£ 4-f1E(l1(lH >-"""""" .......... >--"""""" 0--~- 9-85WISO IB-85HfPISo;I 11--8SF159 12--1Sf'OP59 13--8S1"U'I59 14--El5lE59 1:1--Ei5f'CSI-59 l6-nflllll.J59 17-1l5N£lEi9 111--B5DIRW 1'J-1l5111f'S<J 28--IISWSIlS9 21--IIS()lU-59 :22-115PII159 2l-----r15F:i1l 24---1I5NF:l9 2S---lISFfIIlm9 216-1l5NFttS9 <>" UfIR 100l..£S ~ 2~f!lV59 <le-B'5F'EOO5') ,,......,.,.. 29"-B5H 1N59 41-BSHILI-59 42---tl5SUG59' 55----t!5DUI62 >8-8"'" 3fI-s=iCcrt!l-59 31-B5H~ 32-B5t1IlNDS9 ".....,..""". ~,,,. 3S-II5loIH59 36--85fIE:iO 3?--B5FIFlIE:I<J 38-85SEflV59 39--e5GOU1S9 REXX EXECs using TERMF commands were selected over ISPF (Interactive System Productivity Facility) panels due to the belief that when invoked, the SAS System and ISPF would fracture memory too severely in the three megabyte virtual machine the user has available. In fact, many conflicts occurred between the EXECs and running of the SAS System and use of ISPF is being reevaluated with machines allocated larger virtual memory. Another 43-8S1P162 44--8SHFP162 4:r-fI5F 162 - ..,,---.... ,,~ 41-B5PU'162 4'J~lti2 54---1r513D&2 -56--1r.i1'll162 51--1!:5F62 5B--II5HF62 59--1l5FfIIltt62 6t!-1i5NF1'162 61--1l5PIIIU62 62--1l5f11!62 63--El5H 1N62 ,,~ 64-115CClNS62 Sz-..IISI)lf162 6:!i--Il~ solution being considered is to build a service vir- tual machine that would invoke the SAS System and return results to the EXEC or panel that did the call. SAS / AF'" was not used due to nonexistent support atComell. ------+ KEU' (I"FO I I:EEf' ••••• , ... ( ..... , I fILL I (lONE (PF18) I fi£LP <~..n ; IlESCfll1i£ •••• __ .(.. _U ; $UlEDI ; ; J.I£l.I> 'I.ftIS ; WHEIII: _ _ (C,'."') _I .... I ; ; (Uft f'f7,fF8,+,-, .... Df1Dl t.o _"" .. II .... I~I. list) ; --------+ COHMAHD _> Descriptive Statistics RJliNING Finally, after naming the subset, the SAS System is invoked and a subset of the data is created. At this point the user can subset another dataset available to the system or perform a variety of tasks on the dataset just created. These tasks include reformatting the data into SPSS', comma delimited, tab delimited, column aligned, etc.; storing the data on mainframe disk, sending to another user, or downloading the data to a personal computer (using special terminal emulation software); merging the data with data that have similar units of observation; and so on. OASIS uses PROC UNIVARIATE PLOT, PROC FREQ (for non-continuous variables), and PROC MEANS as descriptive statistics for the user. These are fairly straight forward procedures and no complicated options are necessary. It was felt that these procedures provided enough of the basic information needed by the user. Within limits, requests for other descriptive statistics procedures will be honored. Of course, once created and stored, the user can run any desired procedure independent of OASIS. Managing the Process OA$IS sw.s..b t:r.ated Teq> FII _ _ I. SEASIHe ~t • FI1.ty~ ~ Oc:Itobo_ Obs SSI B51HC2 8ZlEI1P2 2. BEA25El1P SS2 F 6 fInf' (PFI) lIEU' c~~ I'I£lWtE .. Hardware Resource Management Size U_ ForlllClt (Hb) 66 31 SAS 9.92 f>2 38 SAS 8.93 StQr~ SAS tlo"g Stat LIS Disk space is needed for both storage of the data and for work space. One potential solution to the central storage concern is to use optical disk technology. CISER is evaluating an optical disk device by Data/Ware®that will allow hundreds of gigabytes of data to be stored in IBM 3380 tape format. Though this is not a solution for work space, centrally stored SAS System files will be relatively quickly accessible as compared to tape mounts and much larger datasets can be made accessible. "" "" "" I HEIIG£ •••••• ("-") I SRIII£ • c$fISlllG. I I S1II1S .. I FIETlIRt ',..4) I ; oorc;: (WIf)} ; : alMRT. ------- --- RUtttU HG COfltELLR Work space is an ongoing problem. It is not expected that a large number of users will access OASIS simultaneously, however the system depends on an adequate supply of temporary disk space made available for any user request. The amount of temporary space defined is determined by the size of the parent database selected. This is done so the user can be told immediately if there is inadequate work space to continue. Another approach is to wait until the actual subsetting request is made and then estimate the amount of space needed to complete the task. H no space is available at this point, it is more disappointing to the user who must make Software Selections Database Management The SAS System was selected as a database manager due to price; its ability to handle data stored on disk or tape; the ability to use the SAS System in CMS (both semi-interactive and 'batch') and MVS; the savings in disk space that a SAS System file requires as compared to a more traditional database management system file; familiarity by inhouse programmers; its user population popularity; and SAS System versatility as both a database manager and statistical package. another request at another time. For certain very 498 From Concept to Product large datasets (50 megabytes or larger) pre-formatted, non-temporary disks are kept ready to handle the data requests without adversely effecting the temporary disk pool. Of course, this means that these disks generally sit idle. Still, the savings of disk space is substantial when compared to conventional database managers. There are three basic stages in the completion of a project of this kind: design, testing, and promotion. One of the first design questions that needs to be addressed is 'For whom will this software serve?' This question is not answered in a day or two with Another real concern is memory management. .Three megabytes of virtual memory becomes full and well fractured when one defines, links, and accesses multiple disks, opens tens of EXEC files and invokes the SAS System. If virtual machine storage limits were increased to four megabytes, system problems could occur since there seems to be a minimum of eighty virtual machines attached and a high load day can see as many as 200 active virtual machines. One solution is to upgrade the CPU and . increase main memory and paging. CISER and central computing services is considering just such a plan. input from a few. Generally, the answers to basic design questions need to evolve over time. For ClSER, the basic use of the system is by faculty, research associates, and their assistants. However, since research could include Cooperative Extension faculty a dilemma was created. Would this be a system to provide well defined descriptive information that could be made available for county extension agents? Or would it be better if the system offered 'complete" files with no judgement as to which table or variable needed to be included? It was agreed that OASIS incorporate complete files, but it certainly would have changed the nature of the software had the alternative been chosen. Perhaps more importantly, the design of the system Personnel may have floundered without a clear answer to Coordinating the efforts of six FIEs across two 'departments' with a faculty oversight committee is a difficult task. Fortunately, CISER remains small enough to maintain a family-like atmosphere of these questions. Another basic design element was to limit the scope of the project. What OASIS was not designed to do is as important as the tasks it is designed to accomplish. For the initial project, only county based data files stored on magnetic disk in SAS System format were used. Future OASIS Phase I projects will evaluate the inclusion of hierarchical files (e.g., person records nested in family records nested in household records), files stored in tape format, processing the requests on separate processors, merging on selected variables, using expanded WHERE options, and so on. cooperation and communication. At the outset the project, there was more discussion than production but finally a prototype system was developed and has become the OASIS system. Achievements must be recognized and celebrated. It has been important for workers on this project to have clearly defined tasks that have reasonable completion dates. To advance Phase I of the OASIS system from its current prototype status to a mature system will require moving the system to ISPF panels, evaluating other database management systems, and incorporating enhancements. An additional systems programmer will be required for this effort. Designing a system is more a consort than a solo with input from a number of sources on desires and system limitations. It requires a high level of under- The building of OASIS Phase II (the information management subsystem) will require continued evaluation of this type of system and the creation of intelligent front end software to aid users in selecting information from available data. This project is being directed from the data archive group with input from the computing team. standing of the task at hand, system possibilities and limitations, software knowledge, database management techniques, knowledge of the data, and an understanding of what users need and will use. One of the more challenging aspects of designing a mainframe-based system is that today's university researcher is often more accustomed to what a personal computer can do and is generally not enam- Another often overlooked activity is documentation. For the Phase I prototype, the computing manager produced an introduction and tutorial while the data archive provided information about each data file including 10ng label' descriptions of ored with the complications of a mainframe operating system. Thus, the system needs to address that user knowledge base and be logical, comprehensible, and consistent. The words on the screen must convey the function they perform. [More than one long discussion has been generated on the relative merits of 'QUIT', 'STOP', 'EXIT', and 'DONE'.J Finally, the design must be forward-looking. It must allow for changes in the hardware and software and variables, a numeric list of the observations from which to choose, and an abstract. It is estimated that a half-time person will be needed to produce more complete documentation on the use of the system and the data files it contains. 499 SAS, and SAS/ AF are registered trademarks of SAS Institute, Inc., Cary, NC, USA IBM is a registered trademark of International Busi- be flexible enough to adapt to major changes. For example, it is expected that the interactive portion of this system may reside on a personal computer with calls to the larger processor(s) (via a sophisticated network) to perform the subsetting, merging and file transfer. This will need to be a seamless operation. If the system is not carefully designed and properly fragmented, it could prove impossible to alter the system without a complete overhaul of the software. ness Machines Corporation. ISPF and REXX are trademarks of International Business Machines Corporation. Apple is a registered trademark of Apple Computer, Inc. Macintosh is a trademark of Apple Computer, Inc. DEC is a registered trademark of Digital Equipment Corporation. VAX/VMS and MicroVAX are trademarks of Digital Equipment Corporation. SPSS'is a trademark of SPSS, Inc. Unix is a trademark of Bell Laboratories. dBase III Plus is a trademark of Ashton-Tate. Data/Ware is a registered trademark of Data/Ware Development, Inc. Testing the system is not as straight forward as it may at first seem. Of course. . it is important to find programming 'bugs', but logical inconsistencies and an awkward flow or command name may be just as damaging to the success of the software. Therefore, it is as important for a group of users as well as programmers to test the system. Finding a willing set of faculty researchers is not always an easy task since research agenda and testing of software rarely jive. For Additional Information Contact: Promotion of the completed system is as important as its creation. It is important that all concerned in the creation of the software believe in it and are willing and able to use it themselves. Promotion of the system must include workshops, on-going formal presentations, and quick on-the-fly demonstra- Tom Boggess CISER, 382 Uris Hall Cornell UniverSity Ithaca, New York 14853 tions. The reason the OASIS system was created was to reduce the time it took a researcher to produce a clean, workable dataset. CISER staff had noticed that it often took months of working with a complicated data file to get it into a format against which analytical procedures could be run. If success is achieved, users should immediately recognize the value of the software and want to use it. It is still too early to know whether researchers will embrace this system, but when shown to a set of social science institute personnel from universities across the country, the reaction was most favorable. 1 2 A flag is information about variables that relate to suppression of the data due to privacy concerns. . refusals to answer, and so on. Gerard Salton, Dynamic Infonnation and Library Proc- eSSing, Prentice-Hall, Inc., Englewood Cliffs, N.L 1975. 3 Files in OASIS contain all variables (tables) but not necessarily all observations. 500