Datorzinatne un p informacijas tehnologijas
Transcription
Datorzinatne un p informacijas tehnologijas
Datorzinatne un informacijas tehnologijas p Computer Science and Information Technologies _669 ISSN 1407-2157 LATVIJAS UNIVERSITATES RAKSTI Datorzinatne un informacijas tehnologijas SCIENTIFIC PAPERS UNIVERSITY OF LATVIA Computer Science and Information Technologies SCIENTIFIC PAPERS UNIVERSITY OF LATVIA Computer Science and Information Technologies Automation of Information Processing LATVIJAS LNIVERSITATE LATVIJAS UNIVERSITATES RAKSTI Datorzinatne un informacijas tehnologijas Informacijas apstrades automatizacija I'DK 004 Da 814 Editor-in-Chief prof. Rusins-Martins Freivalds. University of Latvia. Latvia Deputy Editors-in-Chief: as. prof. Janis Cirulis. L'nivcrsity of Latvia. Latvia prof. Audris Kalnins, L'nivcrsity of Latvia. Latvia Members: prof. Mikhail Auguston. Software Engineering Naval Postgraduate School, L'SA prof. Janis Barzdins. L'nivcrsity of Latvia prof. Janis Biccvskis. L'nivcrsity of Latvia as. prof. Juris Borzovs. University of Latvia prof. Janis Bubenko, Royal Institute of Technology. Sweden prof. Albertas Caplinskas. Institute of Mathematics and Informatics. Lithuania prof. Janis Grundspenkis. Riga Technical University. Latvia prof. Hele-Mai Haav. Tallinn Technical University, Estonia prof. Ahto Kalja. Tallinn Technical University. Estonia prof. Jaan Penjam. Tallinn Technical University. Estonia Litcrarais redaktors Imants Mezaraups Visi krajuma ievictotie raksti ir recenzeti. Parpubliccsanas gadljuma nepieciesama Latvijas Lniversitatcs atjauja Citcjot atsaucc uz izde\umu obligata O Latvijas L'niversitiile. 2004 O Apgads "Rasa ABC". SIA. datorsalikums. 2004 ISSN 1407-2157 ISBN 9984-770-04-4 Contents Baiba Ap'me. Software Development Effort Estimation Guntis Amicans. 7 Description of Semantics and Code Generation Possibilities for a Multi-Language Interpreter 13 Janis Benefelds. Data Staging in the Data Warehouse 34 Edgars Celms. Generic Data Representation by Table in Mctamodel Based Modelling Tool Kadis Fnivalds. 53 Nondifferentiable Optimization Based Algorithm for Graph Ratio-Cut Partitioning 61 Martins Gills. Review of Traccabilitv Models for Software Testing Processes SO Mara Culhe. Anus Gulbis. Benchmarking Problems of Topic Telecommunications and Access in Latvia Janis Iljins. Design Interpretation Principles in Development and 89 Usage of Informative Systems 99 Girts Linde. Semantics and Equivalence of UML Class Diagrams 107 Laila Niedrite. Requirements and Options for Data Warehouses at Universities 117 Preface This v o l u m e is the first o n e in the n e w series of A c t a Universitatis Latviensis in Computer Science and Information Technologies. R e s e a r c h in c o m p u t e r science and software engineering has been d o n e at the University of Latvia since 1960s, and there are hundreds of publications by c o m p u t e r scientists of the U n i v e r s i t y of Latvia in various scientific journals and conference p r o c e e d i n g s w o r l d w i d e . H o w e v e r , there have been only a few attempts to publish collections of the University c o m p u t e r science research papers in the Acta - there w e r e three v o l u m e s in the m i d 1970s in Russian. T h e first v o l u m e in the n e w series - Automation of Information Processing contains recent results of y o u n g r e s e a r c h e r s , most of t h e m d o c t o r a l students at the University of Latvia. T h o u g h the topics of the papers a r e quite different, they are all centered around the p r o b l e m of p r o v i d i n g theory, m e t h o d o l o g y , d e v e l o p m e n t tools and supporting e n v i r o n m e n t for t h e d e v e l o p m e n t of i n f o r m a t i o n s y s t e m s . All the papers in the volume arc related to the most up-to-date issues in the r e s p e c t i v e area. Theoretical p r o b l e m s are d i s c u s s e d in papers by Girts L i n d e and Karlis Freivalds both of them papers of high scientific quality. T h e p a p e r by L i n d e is devoted to the formalization of semantics of class d i a g r a m s - t h e m a i n U M L notation for system design and to formalization of class d i a g r a m e q u i v a l e n c e . Freivalds" paper provides new results and an efficient heuristic algorithm for a classical o p t i m i z a t i o n p r o b l e m in graph theory - finding the m i n i m u m ratio-cut. Several papers are d e v o t e d to generic m o d e l i n g a n d d e v e l o p m e n t tools for information systems. T h e Paper by E d g a r s C e l m s discusses an i m p o r t a n t aspect of metamodel based m o d e l i n g tools - g e n e r i c facilities for defining tables. T h e p a p e r by Guntis Arnicans is devoted to a g e n e r i c language interpreter i m p l e m e n t i n g a n e w m e t h o d for defining language semantics. T h e paper by Janis Iljins d i s c u s s e s an experience in using a generic development e n v i r o n m e n t - the IS Technology, w h e r e t h e design specification can be directly interpreted. T h e papers by Janis B e n e f e l d s and Laila Niedrite c o n s i d e r various aspects of data w a r e h o u s e s - an overview of d a t a staging and an e x p e r i e n c e of application to the education domain. T w o papers discuss the software d e v e l o p m e n t m a n a g e m e n t p r o b l e m s - Baiba Apine the issues of various software d e v e l o p m e n t estimation m o d e l s a n d Martins Gills - the traceability issues in software testing. Finally, the paper by M a r a G u l b e and A m i s G u l b i s c o n s i d e r the availability of IT services in Latvia. All the papers in the v o l u m e have been r e v i e w e d by s e v e r a l m e m b e r s of the international Editorial Board of the series. Prof Audris Kalnins, Deputy Editor-in-Chief University of Lat\ ia LATVIJAS L'NIVERSITATES RAKSTI. 201)4. 6 6 y . sej.: D A T O R Z I N A T N E UN INFORMACIJAS TEHNOLOGIJAS. 7.-12. [pp. Software Development Effort Estimation Baiba Apine PricewaterhouseCoopers baiba.apine@lv.pwc.com Formal analytical and analogy-based software development cost estimation models are analysed to find out what factors are considered to have an influence on the development process in each of the models, and to provide recommendations for the application of those models. Key words: estimation, software development cost, effor estimation models. Introduction Software d e v e l o p m e n t effort estimation is a rather old, btit still relevant, problem. Since the middle of the last century, w h e n the very first formal software development effort and schedule estimation m o d e l s appeared, the software d e v e l o p m e n t process, the subject of these e s t i m a t e s , has c h a n g e d dramatically. For instance. [ E X P 0 2 ] highlights that productivity of the software d e v e l o p m e n t process increases by 1 0 % annually due to the progress of software d e v e l o p m e n t technologies. T h e c o n t i n u o u s i m p r o v e m e n t of the d e v e l o p m e n t process, and different factors which influence the process productivity and must be taken into account, create a n i g h t m a r e for estimators. T h e following e x p e r i m e n t w a s carried out during classes on software cost estimation. A group of 2 0 0 students w a s asked to estimate the effort and schedule for the d e v e l o p m e n t of information system (IS) storing data about software development projects ( n a m e , d e v e l o p m e n t e n v i r o n m e n t , start date, expected end date etc.). d e v e l o p e r s and c u s t o m e r s ( n a m e s , skills, office hours, phone n u m b e r s etc.) and d o c u m e n t s produced d u r i n g the project Iifecycle ( n a m e , c o m m e n t s , author). All the students had the s a m e input information about the IS. T h e results for this rather small project differed significantly: starting from 10 m a n - d a y s (2 weeks s c h e d u l e ) to 12 manm o n t h s (1 year schedule) with an a v e r a g e of 3 m a n - m o n t h s (schedule 3.5 months). The students t h e m s e l v e s were very surprised about the differences in estimates. If this is acceptable with students in the c l a s s r o o m , in real-life such a situation might be rather painful and must be discussed in order to solve the p r o b l e m and find consensus. Here formal cost estimation m e t h o d s could help: 1. 2. To provide a disciplined way of thinking during the estimation process, which was not available for students. To provide a subject, subjects for discussions for all involved parties. Formal software d e v e l o p m e n t cost estimation models are analysed to find out what factors are considered to have an influence on the d e v e l o p m e n t process in each o f the m o d e l s , and to provide r e c o m m e n d a t i o n s for the application (if those m o d e l s Software Development Effort Estimation Models T o be honest, not always are the formal cost estimation models r e c o m m e n d e d for effort and schedule estimation. [ C O C 2 1 . [ K E M 9 3 ] do not r e c o m m e n d formal effort 8 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS estimation models for small projects (less than 2 K lines of c o d e [ B O E 9 1 ] or less than 10 d e v e l o p e r s w o r k for 3-6 m o n t h s [ Y O U 9 7 ] ) . For small projects the d e v e l o p m e n t productivity d e p e n d s on t h e individuals very m u c h , h e n c e the right w a y of e s t i m a t i n g is asking t h e d e v e l o p e r for an estimate. Formal m e t h o d s are w e l c o m e for a v e r a g e ( 2 0 - 3 0 d e v e l o p e r s ' w o r k for 1-2 years) and large ( 1 0 0 - 3 0 0 e m p l o y e e s ' w o r k for 3-5 years) projects, b e c a u s e : 1. 2. Individual experience of the estimator is limited, as even a very experienced project manager has worked for 5-7 average and 3-5 large projects during his/her career. Typically there are more than two stakeholders in average and large projects, therefore it is very crucial to have a documented, step by step estimation process as a subject for discussions. Different analytical and a n a l o g y - b a s e d software cost estimation m o d e l s d e v e l o p e d . Analytical m o d e l s are based on a n e g a t i v e e x p o n e n t i a l curve called R a y l e i g h - N o r d e n m a n p o w e r a p p r o x i m a t i o n c u r v e (see F i g u r e 1). Functionality of software product, usualy e x p r e s s e d in function points or lines of code, is used as input for analytical models. are the the an M a n p o w e r distribution for o n e real-life software d e v e l o p m e n t project is s h o w n in Figure 1. This illustrates a rather typical fault in project m a n a g e m e n t - as t h e d e v e l o p m e n t process s e e m s to be stable (see S E P _ 9 8 to A P R _ 9 9 ) , s o m e d e v e l o p e r s might be taken away from the project (see M A Y _ 9 9 ) . This yields an increasing response time to c u s t o m e r queries and l o w e r s c u s t o m e r satisfaction. Additional d e v e l o p e r s must be added in the project urgently (see J U N _ 9 9 ) . Otherwise the real n e w software d e v e l o p m e n t life cycle a p p r o x i m a t e s the theoretical o n e quite closely. A n a l o g y based models use information about past projects. T h e estimation p r o c e s s basically m e a n s database b r o w s i n g to find similar projects. T h e success of estimation using a n a l o g y - b a s e d m o d e l s d e p e n d s on h o w successfully the criteria for finding similar projects from past experience are found. Figure 1. Rayleigh-Noren manpower approximation curve for new software development process (optimal) in comparison with the real one Bciiha Apine. Software Development Effort Estimation 9 Samples of Analytical Models S L I M (Software Lifecycle M o d e l ) w a s d e v e l o p e d in the middle of the 1970's [K.EM87] and currently is m a i n t a i n e d by Q S M (Quantitative Software M a n a g e m e n t ) [ Q S M 0 2 ] . T h e M o d e l w a s developed using data about 5000 software d e v e l o p m e n t projects. T h e S L I M m o d e l uses a productivity p a r a m e t e r and productivity index, which characterise the d e v e l o p m e n t e n v i r o n m e n t (tools used, developers' skills etc.). In c o m p a r i s o n to other analytical models, in the SLIM m o d e l a schedule must be set before the effort is estimated. H o w e v e r this is not a d i s a d v a n t a g e of the model, because in real life very often the s c h e d u l e is already predefined. T h e Jensen m o d e l [JEN84] estimates the d e v e l o p m e n t effort and s c h e d u l e based on the size of the product, technology index and 11 d e v e l o p m e n t e n v i r o n m e n t factors, which characterize the d e v e l o p m e n t t e c h n o l o g y . A Current version of the Jensen model is i m p l e m e n t e d in the tool S A G E [ J E N 9 5 ] . C O C O M O II [ C O C 2 ] estimates the d e v e l o p m e n t effort and schedule based on the size of the product, 5 scaling drivers and 6 adjustment factors, which describe the product's quality and reliability r e q u i r e m e n t s , t e c h n o l o g i e s used, d e v e l o p e r s ' skills etc At the end of last year a survey about the m o s t popular d e v e l o p m e n t effort estimation m e t h o d w a s carried out by [ISB01] in more than 100 software d e v e l o p m e n t organizations of varied sizes [ C L T 0 3 ] . M o r e than a quarter of the respondents (27%) use various analytical m o d e l s . A surprisingly low n u m b e r (9%) report the use of C O C O M O , which w a s considered by m a n y to be a g r o u n d b r e a k i n g t e c h n i q u e in the 1980s. O n e possible explanation for this is that the C O C O M O applications are often c u m b e r s o m e and require the use of other tools and t e c h n i q u e s to generate input to C O C O M O , such as lines of code estimators. Samples of Analogy Based Models T h e C h e c k p o i n t m o d e l is based on information about 8000 software d e v e l o p m e n t projects. T h i s model is property of S P R (Software Productivity Research) [ J O N 9 6 ] . The C h e c k p o i n t model uses a survey containing multiple-choice questions about the d e v e l o p m e n t process and the product must be d e v e l o p e d . Information from the survey then is used to adjust the productivity o f the d e v e l o p m e n t process. T h e s a m e principle as for the C h e c k p o i n t model is used in the E x p e r i e n c e P r o model and tool, w h i c h is property of Software T e c h n o l o g y the Transfer [ E X P 0 2 ] . This model uses information about approximately 500 past projects. All the users of this model arc j o i n e d in F i S M A [FIS03] and e n c o u r a g e d to collect information a b o u t projects to update the E x p e r i e n c e P r o model continously. A c c o r d i n g to t h e survey by [ I S B 0 1 ] , most organizations (65%) collect historical data and u s e it to h e l p them estimate software d e v e l o p m e n t . 2 6 % report that they do so diligently, by m a i n t a i n i n g a central database that formally receives and stores data from every project. 10 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS Conclusions Analytical models as well as a n a l o g y b a s e d o n e s use data about the software d e v e l o p m e n t process ( m e a s u r e m e n t s ) , consider different risks a n d their influence on the development process. A Set of risks is predefined and d e p e n d s on the model. T a b l e 1 shows software d e v e l o p m e n t risks found out in the s u r v e y of risk m a n a g e m e n t [ A P I 0 2 ] . There are risks which are considered in each cost e s t i m a t i o n m o d e l , for instance, an unrealistic project schedule and budget or lack of k n o w l e d g e in software d e v e l o p m e n t technologies and e n v i r o n m e n t . At the same time there are risks which are important and influence the d e v e l o p m e n t s c h e d u l e and effort, but are not c o n s i d e r e d by any of the estimation models analysed. For instance, software d e v e l o p m e n t e n v i r o n m e n t bugs or change of qualified p e r s o n n e l . Table 1. Risks considered in different software cost estimation models (+ - the model considers that) S. C/) Checkpoint Software development influencing factor '-> '_j 53 •j 2 3 3 3 Lack of hardware on the customer side Difficult communication with customer Low quality of software requirements Unstable software requirements L'nrealistic schedules and budgets Weak project management Software development environment buss Lack of developer motivation Lack of hardware on developers side Change of qualified personnel Lack of knowledge in software development technologies and environment -- - - - - - - - - - - - M o d e l s concentrate on effort and s c h e d u l e e s t i m a t i o n for new software development [PLT92], [JEN84], [JON96], [ C O C 2 ] . [EXP02], [PAR95], [TAU81]. There are add-ons d e v e l o p e d for s o m e m o d e l s to a p p l y t h e m to more specific projects like software m a i n t e n a n c e [ C O C 2 ] , [ E X P 0 2 ] . T a k i n g into account that m o d e l s rely on development process m e a s u r e m e n t s , they are b e c o m i n g out-of-date continuously as technologies develop. F o r instance, software d e v e l o p m e n t productivity increases by 10% annually. H e n c e using a year old cost estimation m o d e l , it overestimates the effort by 10% even if we exactly h o w k n o w to use the m o d e l . T o solve this p r o b l e m , authors continuously gather data about projects to m a i n t a i n their m o d e l s [ Q S M 0 2 ] , [ J O N 9 6 ] . This problem could not be solved for analytical m o d e l s , but for analogy based ones it is not so actual if the project base is updated with information about new projects continuously, like it is d o n e for [ E X P 0 2 ] . Bulba Apine. Software Development Effort EMimation u T h e r e are a d d - o n s for d e v e l o p m e n t process operational p l a n n i n g developed for s o m e of the m o d e l s , w h i c h are based on the R a y l e i g h - N o r d e n m a n p o w e r curve [ P U T 9 2 ] . There are no such a d d - o n s for analogy based m o d e l s , b e c a u s e data gathering for such models is m u c h more c o m p l e x than approximation of the d e v e l o p m e n t process using analytical m e t h o d s . T h e d e v e l o p m e n t of such a d d - o n requires integration of the m e a s u r e m e n t and risk m a n a g e m e n t processes. W h a t to do? T o estimate the software product d e v e l o p m e n t effort and s c h e d u l e the following steps must be carried out. 1. Perform a risk analysis for the particular project to find out what factors will influence the development of the product. 2. Choose the development effort estimation model which takes into account all the risks pertaining to the particular project. Such an a p p r o a c h is not feasible due to the following reasons: 1. There is no knowledge available about different models. Experience shows that most companies have knowledge about one or two effort estimation models. 2. Application of most of the models requires acquisition of specific, rather expensive tools (several thousand EUR for licence and approximately a thousand EUR for annual maintenance of the model). Therefore one formal model must be chosen for u s a g e in the c o m p a n y . There are s o m e sources which r e c c o m e n d usage of two formal m o d e l s for estimation in parallel [ I S B 0 1 ] , [ J O N 9 6 ] . T h e second model is to verify the results given by the first one. Instead of the s e c o n d formal m o d e l , d e v e l o p m e n t process m e a s u r e m e n t s could be applied to adjust the formal results. Developers and m a n a g e r s must be trained to use the c h o s e n formal model to avoid painful estimation mistakes. T h e d e v e l o p m e n t process must be measured in order to use the gathered information for analysis of the results obtained from application of the formal model. Results must be c o m m u n i c a t e d a m o n g the developers. References [EXP02] "Estimation and Measurement Methodology of Experience Pro 3.0". [Electronical resource].-Access: WWW. URL: http^Avww.sttffi/html.-Resource described 12th of June, 2002. [COC2] C.Abts, B.Boehm, B.Clark. S.Devnani-Chulani. "COCOMO II Model Definition Manual". University of Southern California, 2000, 68 p. [KEM93]Chris F. Kemerer. "Empirical studies of assumptions that underlie software cost estimation". Information and Softw. Techno!., Vol.34 =4, 1992, pp. 21 1-218 [BOE91] Barry-' W. Boehm. "Software Risk Management: Principles and Practices". IEEE Software, January, 1991, pp. 32-41. •YOU97] Yourdon. E. "Death March. The complete Software Developer's Guide to Surviving 'Mission Impossible' Projects". Prentice Hall, 1997. 2 I 8 p. [K.EM87] Chris F. Kemerer. "An Empirical Validation of Software Cost Estimation". ACM Communication. May, 1987. pp. 416-429 12 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS [QSM02] "SLIM-Estimate 5.0". [Electronical resource].-Access: WWW. URL: http:/ 'www.qsm.com-'slim estimate.html.-Resource described 3rd of January, 2003. [JENS4] Randall W.Jensen. "A Comparision of the Jensen and COCOMO Schedule and Cost Estimation Model". Proceedings of the International Society of Parametric Analysis, 1984, pp. 96-106. [JEN95] Randall W.Jensen. "A New Perspective in Software Schedule and Cost Estimation". [Electronical resource].-Access: WWW. URL: http://www.seisage.com.'publications.htm -Resource described: 17th of February, 2003. [ISB01] International Software Benchmarking Standards Group. "Practical Project Estimation" 2001. 46 p. [CUT03] "The Guru Method Prevails". [Electronical resource].-Access: WWW. URL: http:/'www.cutter.com 'benchmark, index.html.-Resource described 12th of February. 2003. [JON96] Caper Jones. "Applied Software Metrics". McGraw Hill, 1996, 324 p. [F1S03]"Finnish Software Metrics Association". [Electronical resource],-Access: WWW. URL: http:/Avww.fisma-network.org.-Resource described 12th of February, 2003. [API02] Baiba Apine. "Software Development Risk Management Survey", accepted paper for I 1th International Conference on Information Systems Development, Riga. Latvia. September 12 14, 2002. http://www.cs.rtu.Ivisd2002 accepted.asp.-Resource described:2003.g. 3.jan. [PUT92] Lawrence H. Putnam, Ware Myers. "Measures for Excellence". Yourdon Press Computing Series, 1992 TAR95] Naval Sea Systems Command . "Parametric Cost Estimating Handbook". [Electronical resource].-Access: WWW. URL: http:,Vw~ww.jsc.nasa.gov/bu2/PCEHHTML/pceh.htm.Resource described: 2003. g. 17.fob. [TAUSI] Robert C. Tausworthe. "Deep Space Network Software^Cost Estimation Model". Jet Propulsion Laboratory Publication 8 1-7. Pasadena. C.A., April, 1981 L A T V I J A S U N I V E R S I T A T E S KAKSTI. 2004. 669. s g . : D A T O R Z I N A T N E I'N INFORMACIJAS : : . i : \ o i . o i i . : . \ s . 1 3 . - 3 3 . [pp. Description of Semantics and Code Generation Possibilities for a Multi-Language Interpreter Guntis Arnicans Faculty o f Physics and M a t h e m a t i c s . University of Latvia Raina B l v d . 19. Riga L V - 1 5 8 6 . Latvia, garnican@lanet.lv In this paper we describe the definition of semantics for a Multi-Language interpreter (MLI), which provides the execution of the given program, receiving and exploiting corresponding language syntax and the desired semantics. We analyze the simplest solution the MLI receives the language syntax and the semantics descriptions, which have already been compiled to executable objects. Semantics is defined as a composition from several semantic aspects, considering the pragmatics of a language. Semantic aspects are translated to semantic functions by composing descnptions of the aspects. A traversing program's intermediate representation and the calling out of semantic functions similarly to the principle of the Visitor pattern perfonn the desired semantics. To simplify the semantic descriptions, we use abstract components that are joined by connectors at the meta-levcl. The implementation of these components and connectors can be very different. Examples of conventional and specific semantics are given for the simple imperative language in tins paper. Key words: interpreter, programming language specifications, tool generation. 1. Introduction T h e number of n e w languages that are related to the IT sector has increased rapidly over t h e last several y e a r s ( p r o g r a m m i n g languages and data description l a n g u a g e s , for e x a m p l e ) . P r o b l e m s associated with the implementation a n d use of these languages has also e x p a n d e d , of c o u r s e . Kinnersley [Kin95] has reported that there were 2,00(1 languages in 1995. w h i c h were being put to serious use. Even back then specialists found that the n e w languages w e r e mostly to be classified as domain-specific languages. Most o f them are not e a s y to i m p l e m e n t and maintain [ I T S E 9 9 , DKVOO ( D S L analysis, p r o b l e m s and an annotated b i b l i o g r a p h y ) ] . It is also true that we need not j u s t a c o m p i l e r or an interpreter, but also a n u m b e r of supportive tools. Questions of p r o g r a m m i n g quality a r e very important today, and t h e s e questions often cannot be a n s w e r e d without specialized and a u t o m a t e d ancillary resources. C o m p u t e r s are b e i n g used with increasing d y n a m i s m today: s y s t e m s have been d i v i d e d up in t e r m s o f time and space, the operational e n v i r o n m e n t is h e t e r o g e n e o u s , and w e have to e n s u r e the implementation o f parallel processes w h i l e organizing cooperation a m o n g c o m p o n e n t s and systems, a d a p t i n g to c h a n g i n g circumstances w i t h o u t interrupting o u r work. etc. We are m a k i n g increasing use of interpreters or of code generation and c o m p i l a t i o n just in time. T h e formal resources that are used to describe the s e m a n t i c s of a language, however, cannot fully satisfy our needs in the m o d e r n age, and they a r e starting to lose their positions [ S c h 9 7 , L o u 9 7 , Paa95], T h e basic p r o b l e m that is associated with the formal specifications of p r o g r a m m i n g languages is that these specifications are far too c o m p l e x . It is not clear how they are administered, w e c a n n o t use t h e m to explain all of our practical needs, and in the end we are still faced with a p r o b l e m - who can prove that these c o m p l e x specifications are really correct? T h e literature c l a i m s that the best c o m m e r c i a l compilers (interpreters or other language-based tools) are written without formalism or are used only in the first 14 DATORZINATNE L'N INFORMACIJAS TEHNOLOGIJAS phases - s c a n n i n g and parsing [e.g. L o u 9 7 ] . F o r m a l i s m s are elaborated and used mostly for research purposes in educational and scientific institutions at this time. T h e d e v e l o p m e n t of s e m a n t i c s is gradually m o v i n g a w a y from the d e v e l o p m e n t of languages and tools. O n e w a y to o v e r c o m e this gap is to take a tool-oriented approach to semantics, m a k i n g the definitions of s e m a n t i c s far m o r e useful and p r o d u c t i v e in practice and generating as m a n y l a n g u a g e - b a s e d tools as possible from t h e m [HKOO]. We support this a p p r o a c h in principle, but our aim is to p r o p o s e a different approach toward the definition of s e m a n t i c s , m a k i n g r o o m for far less formal records. Those w h o prepared descriptions of s e m a n t i c s in the past have long since been looking for w a y s in which semantics can be divided up into reusable c o m p o n e n t s , and it is not yet clear whether the formal or the partly formal m e t h o d o l o g y is the best in this case. We c h o s e a less formal and m o r e free form o f description k e e p i n g from the theoretical perspective, and o u r empirical research s h o w e d that rank-and-file developers of tools understand this m e t h o d far m o r e easily. 2. T h e concept of a Multi-Language Interpreter T h e concept of a multi-language interpreter w a s i n t r o d u c e d in [ A A B 9 6 ] . A MultiL a n g u a g e Interpreter ( M L I ) is a program w h i c h receives s o u r c e language syntax, source language s e m a n t i c s and a program written in the s o u r c e l a n g u a g e , then performs the operations on the basis of the program and t h e relevant s e m a n t i c s . C o n c e p t u a l l y , we parse an input token stream, build a parse tree a n d then t r a v e r s e the tree as needed so as to evaluate the semantic functions that are a s s o c i a t e d with the parse tree nodes. O n c e an explicit parse tree is available, we visit the nodes in s o m e order and call out an appropriate function. This approach is similar to the p r i n c i p l e build a tree, save a parse and traverse it [Cla99] and to a Visitor p a t t e r n [ G H J V 9 5 ] , except in terms of the m e t h o d o l o g y which w e apply in obtaining s e m a n t i c functions and organizing physical implementation. T h e idea of M L I is e x p r e s s e d in Figure 1. Syntax Semantics Multi-language nterpreter -/ Results Program Figure I. The concept of a Multi-Language Interpreter T h e concept of a M L I p r e s u p p o s e s that w e can p r e p a r e several semantics for one syntax, and w e can exploit one semantic for various s y n t a x e s . T h e descriptions of syntaxes and semantics must be translated to the e x e c u t a b l e form (before or during the running of the M L I ) . MLI implementation a r c h i t e c t u r e s m a y vary. T h e one w e use receives and exploits syntax and semantic descriptions that h a v e already been compiled as e x e c u t a b l e objects (Figure 2). Syntax is r e p r e s e n t e d b y the SyntaxObject, and semantics by the TraverserObject. the S e m a n t i c O b j e c t . the S y m b o l T a b l e . and the necessary v o l u m e of the C o m p o n e n t (the c o m p o n e n t s A , B. C in our figure). T h e MLI Kernel, w h i c h provides the initial b o n d i n g of all s y n t a x and semantics objects, initializes the execution of the program. Gitntis Amicans. Description of S e m a n t i c s and Code Generation Possibilities 15 Results Figure 2. MLI runtime architeeture Each of the c o m p o n e n t s can be implemented in various w a y s - with a different semantic a s s i g n m e n t and physical implementation. Here w e have a c h a n c e to c o m b i n e syntaxes and s e m a n t i c s in both w a y s - in terms of architecture and in terms of implementation. T h e n , h o w e v e r , w e immediately face the question of the compatibility of the syntax and s e m a n t i c s so as to avoid senseless interpretation. T h e obtaining of an executable syntax and of semantic objects from their descriptions can be d o n e before or during the actual p r o g r a m execution (analogue to a classical compiler and interpreter). D y n a m i c code generation is m o r e difficult because all generation p h a s e s m u s t be d o n e automatically. 3 . Language Specifications for MLI P r o g r a m m i n g l a n g u a g e is an artificial means to c o m m u n i c a t e with a computer and to fix the algorithms for problem solving. Like a natural language, a p r o g r a m m i n g l a n g u a g e ' s definition consists of three c o m p o n e n t s or aspects: syntax, semantics and p r a g m a t i c s [ P a g 8 1 , S K 9 5 ] , All of these aspects are significant in dealing with our p r o b l e m s . Usually exploited rarely, pragmatics deals with the practical use of a language, and this is an important element in defining s e m a n t i c s . W e can look at s y n t a x and s e m a n t i c s from two perspectives - the definition or description phase and t h e runtime phase. Our goal is to achieve runtime c o m p o n e n t s which can freely be e x c h a n g e d or m i x e d together in pursuit of the desired collaboration. First w e must look at t h e principles of syntax and s e m a n t i c s descriptions, and then we can view the target c o d e generation steps. O u r basic principle is to divide syntax and s e m a n t i c s into small parts, and later, with a simple m e t h o d , to c o m b i n e these parts thus providing a m e c h a n i s m to tie together the semantic parts and the syntax elements. Our m e t h o d is close to some of the structuring p a r a d i g m s o f attribute g r a m m a r s [Paa95]: T h e definition p h a s e is similar to the relationship S e m a n t i c aspect = M o d u l e , but the runtime p h a s e is similar to N o n t e r m i n a l = P r o c e d u r e . That m e a n s that we basically use the language pragmatics and divide the s e m a n t i c s into semantic aspects. 3.1. Syntax T h e formalisms for dealing with the syntax aspect of a p r o g r a m m i n g language are well developed. T h e theory of scanning, parsing and attribute analysis provides not only the m e a n s to perform syntactical analysis, but also a way to generate a w h o l e compiler as well. Such t e r m s , c o n c e p t s or tools as finite a u t o m a t a , regular expression, contextfree g r a m m a r , attribute g r a m m a r ( A G ) , Backus-Naur form (BN'F). extended B N F 16 DATORZINATNE EN INFORMACIJAS TEHNOLOGIJAS ( E B N F ) , Lex (also Flex). Y a c c (also B i s o n ) , a n d P C C T S are well k n o w n and accepted by the c o m p u t e r science c o m m u n i t y . W e do not need to reinvent the w h e e l a n d it is r e a s o n a b l e to c h o o s e the existing formalisms and generators (lexers and p a r s e r s ) . T h e main task w h e n d e a l i n g with syntax description for a given language is c o d e g e n e r a t i n g which can transform the written program, w h i c h uses the s y n t a x , into i n t e r m e d i a t e representation (IR). A d d i t i o n a l l y , we need to attach a library with functions, which p r o v i d e the m e a n s to m a n i p u l a t e with the IR and to c o m p i l e the w h o l e code. T h e result is the S y n t a x O b j e c t (Figure 2). In this paper we concentrate mostly on the class of imperative p r o g r a m m i n g languages, but our method is adaptable for other languages t o o , such as d i a g r a m m a t i c languages (e.g., Petri nets. E-R d i a g r a m s , Statecharts. V P L - visual p r o g r a m m i n g languages, etc.). which exploit other f o r m a l i s m s (e.g., SR G r a m m a r s . R e s e r v e d Graph G r a m m a r ) and processing styles [FNT-i-97, Z Z 9 7 ] , 3.2. Semantics T h e chosen principle for the r u n t i m e s e m a n t i c s parse and traverse states that the most important things are a traversing strategy and the s e m a n t i c functions w h i c h must be executed when visiting a node (Figure J ) . Therefore, the central c o m p o n e n t s of the semantics are TraversalObject and S e m a n t i c O b j e c t (Figure 2). T h e TraversalObject m a n a g e s the node visiting order, p r o v i d e s semantic functions with information from the IR, and is the m a i n engine of the M L I . The SemanticObject, for its part, contains all o f the necessary s e m a n t i c functions and provides for the execution e n v i r o n m e n t . At the same time, w e can also put into the semantic functions certain c o m m a n d s which force the T r a v e r s e r to search for the needed node and to c h a n g e the current execution point in t h e IR (traversing strategy c h a n g e s and a transition to another node are p r o b l e m s in the Visitor pattern [e.g. VisOl]). Semantic functions h a v e to be as s i m p l e and as small as possible. This can be achieved by using a m e t a - l a n g u a g e and by e m p l o y i n g high-level expression m e a n s , which allow for easy understanding and verification of the description. F o l l o w i n g this Gunris Arnicans. Inscription of S e m a n t i c s and Code Generation 17 Possibilities principle b e c o m e s more natural if w e use abstract c o m p o n e n t s so that the underlying semantic can be clear without additional explanations (in Figure 3, the abstract c o m p o n e n t s already h a v e a concrete i m p l e m e n t a t i o n c o m p o n e n t - A, B and C). This statement m a y lead to objections from the advocates of formal semantics, because the c o m p o n e n t s are not described with m a t h e m a t i c precision. At the same time, however, formal semantics s o m e t i m e s use such c o n c e p t s as Stack or Symbol table. Let us introduce a conceptual syntax element, which is a g r a m m a r symbol with a n a m e (e.g., a n a m e d n o n t e r m i n a l s y m b o l or a n a m e d terminal symbol). C o n s i d e r i n g the various types of syntax elements and the traversing strategy, w e separate various visitations and introduce the concept of the traversing aspect. For instance, we can distinguish the arriving into node from the parent n o d e (PreVisit) as well the arriving into n o d e from the child node (PostVisit). Thus w e create the semantic functions and n a m e t h e m not only on the basis of the n a m e of the syntax element but also on the basis of the arriving aspect (traversing aspect) into this element (Figure 3). R u n t i m e semantics o r simply s e m a n t i c s for multi-language interpreters are a set of semantic functions. W e represent the runtime s e m a n t i c in T a b l e 1. There is an executable code ( c ) or nothing (A.) for the syntax e l e m e n t , according to the traversing aspect, n depends on the size of syntax (e.g.. the count of all nonterminal and terminal s y m b o l s ) , and m d e p e n d s on the c o m p l e x i t y of the traversing strategy (usually 1..3). We notice that the matrix m a i n l y consists of e m p t y functions (X.). Tabic 1. The matrix of syntax elements and semantic function correspondence Syntax Element (SE) A, Traversing Aspect (TA) T T T A-, A-, T X SE, SE : X - X SE, X X X SE„ • X K X A A X T h e identification o f semantic functions is realized both by the syntax n a m e and by the traversing aspect n a m e . Technical implementation may differ, but it is very advisable that functions identification and calling b e performed with constant complexity 0 ( 1 ) . N o w we arrive at the most difficult and important p r o b l e m - how can we obtain semantic functions and ensure correct collaboration b e t w e e n them, and how is it possible to create r e u s a b l e semantic descriptions? Let us explain our ideas about how to define semantics and h o w to gain the matrix o b s e r v e d above, i.e.. h o w to generate executable semantics from the semantic description. 18 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS 4. Semantic Aspects and Abstract C o m p o n e n t s 4.1. Semantic aspects In practice, p r o g r a m m i n g languages are frequently presented through the pragmatics of the p r o g r a m m i n g language, i.e.. e x a m p l e s are used to s h o w h o w the language constructs are exploited and what their u n d e r l y i n g m e a n i n g is. Let us call these language constructs and their m e a n i n g like s e m a n t i c aspects. We h a v e chosen to define the s e m a n t i c as a set of m u t u a l l y c o n n e c t e d semantic aspects. Here are s o m e e x a m p l e s for typical g r o u p s of s e m a n t i c aspects for imperative p r o g r a m m i n g languages: execution of c o m m a n d s or s t a t e m e n t s (e.g., basic operations, variable declaring, assigning of a value to the variable, execution of arithmetic expressions), program control flow m a n a g e m e n t (e.g., loop with a counter, conditional loop, conditional branching), dealing with s y m b o l s (e.g., variables, constants), e n v i r o n m e n t m a n a g e m e n t (e.g.. the s c o p e s of visibility). Here, t o o . are e x a m p l e s of nontraditional semantic aspects: attractive printing o f the p r o g r a m , d y n a m i c accounting of statistics, symbolic execution, specific p r o g r a m i n s t r u m e n t a t i o n , etc. W e have chosen an operational a p p r o a c h to describe the semantic aspect - we define the c o m p u t a t i o n s , which a c o m p u t e r h a s to d o to perform the semantic action. 4.2. Abstract data types and abstract c o m p o n e n t s T h e next significant principle to define the s e m a n t i c aspect is using abstract data types ( A D T ) as much as possible. A D T is a collection of data type and value definitions and operations on those definitions, w h i c h b e h a v e as a primitive data type. This software design approach breaks down the p r o b l e m into c o m p o n e n t s by identifying the public interface and the private i m p l e m e n t a t i o n . In o u r case, typical e x a m p l e s of A D T are Stack. Q u e u e , Dictionary, and Symbol table (in c o m p i l e r construction theory [ A S L ' 8 6 , F L 8 8 ] , in formal semantics [SK95]). In this w a y w e hide most of the i m p l e m e n t a t i o n details a n d concentrate mainly on the logic of the semantic aspect. Later w e can c h o o s e the best implementation of A D T for the given task. Seeing that s o m e exploited c o m p o n e n t s can be complicated (E-mail. Graph visualization. Distributed c o m m u n i c a t i o n , T r a n s a c t i o n m a n a g e r , etc.) and have no standards, w e use another term - abstract c o m p o n e n t . S o m e t i m e s w e want to utilize an already existing c o m p o n e n t , and t h e term abstract c o m p o n e n t seems more appropriate to us. It is advisable to describe the s e m a n t i c aspect t h r o u g h m e t a - l a n g u a g e , even if one does not h a v e a translator for this. Then o n e can translate or simply rewrite it by hand to the target p r o g r a m m i n g language, select a p p r o p r i a t e implementation for the abstract c o m p o n e n t s , and use the needed interface, c o l l a b o r a t i n g protocol and execution e n v i r o n m e n t . For instance. Stack can be i m p l e m e n t e d in a contiguous m e m o r y or in a linked m e m o r y . Symbol table - as a list or as a dictionary with the hashing technique. F u r t h e r m o r e , instances of abstract c o m p o n e n t s can be v i e w e d as distributed objects in a heterogeneous c o m p u t i n g network. Gunlis Arnicans. Description ot S e m a n t i c s and Code Generation Possibilities 19 4.3. Examples of abstract components S o m e abstract c o m p o n e n t s and their operations are very popular, e.g.. Stack (createStack. push, p o p , t o p . etc.). Q u e u e ( c r e a t e Q u e u e , e n q u e u e , d e q u e u e , first, etc.), while s o m e are g u e s s e d , e.g.. E-mail (prepare, send, r e c e i v e , open). A m o n g the many specific c o m p o n e n t s w e would like to emphasize o n e that is useful for most of semantics - S y m b o l table ( S y m b o l T a b l e in Figure 2) or its analogue to provide the execution e n v i r o n m e n t . While building prototypes of the M L I . w e h a v e created an implementation of S y m b o l table - M O M S (Memory Object M a n a g e m e n t S y s t e m ) - that is appropriate for i m p l e m e n t i n g the i m p e r a t i v e p r o g r a m m i n g languages. It is possible to define basic and user defined data t y p e s , to define base operations and functions, to operate with variables and their v a l u e s , to m a n a g e the scope of visibility of all objects, etc. T h e most important data types, c o n c e p t s , and operations of M O M S are listed in the appendix to this paper so as to g i v e the reader a better idea about M O M S . The second i m p o r t a n t c o m p o n e n t is Traverser (TraverserObjeet in Figure 2). Its main task is realizing the traversing strategy, to c h a n g e the current execution point and to organize cooperation with the syntax object. There is a depth-first left-to-right traversing strategy, which is used in the following e x a m p l e s (Table 2 ) . T h i s strategy has three visiting aspects: Visit (for tree leaves terminals), PreVisit a n d PostVisit (for the other tree n o d e s - nonterminals). T o define semantic functions for e x a m p l e s , w e have used the f o l l o w i n g operations: N o d e V a l u e ( ) returns a value for t h e current terminal or nonterminal symbol (value from the current IR n o d e ) , and both g o S i b l F o r w ( a N a m e ) and g o S i b l B a c k w ( a N a m e ) provide for a changing of the current n o d e , searching the node with t h e n a m e a N a m e b e t w e e n siblings going forward or b a c k w a r d . Table 2. A depth-first left-to-right traversing strategy Travcrse(nodc P) if IsLeaf(P) Visit{V) PreVisiiy?) for each child Q of P. in order, do Traverse! 0 ) PostVisit(P) It is possible to describe interfaces for S y n t a x O b j e c t . TraverserObject and S e m a n t i c O b j e c t with domain-specific language. T h e n interfaces for obtaining the IR. manipulating with it and working with the symbol table can be compiled together, and it is possible to e n g a g e in high-level optimization and verification [Eng99], The T r a v e r s i n g strategy can also be described with domain-specific language. This is important if the strategy is not trivial and d e p e n d s on syntax elements and the program state [ O W 9 9 (traversing p r o b l e m s and solutions for Visitor pattern)]. The traversal strategy should b e independent from syntax as m u c h as possible and organized 20 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS (combined) by patterns [ V i s O l ] . In addition to c o m m o n t r a v e r s i n g strategies there are also less traditional o n e s , e.g., the strategy for r e v e r s e e x e c u t i o n o f the p r o g r a m [ B M 9 9 ] 4.4. Defining the semantic aspect It is more convenient to define the s e m a n t i c aspect b y using d i a g r a m s (as in Figure 7). W e can write a m e t a - p r o g r a m or a p r o g r a m in the target l a n g u a g e in textual form, too. D i a g r a m s contain syntax e l e m e n t s that are important for the s e m a n t i c aspect and are visualized with g r a p h i c s y m b o l s . W e can use different graphic notations. If the visiting order of syntax e l e m e n t s is important, then we m a r k the order with arrows. Let us call the o p e r a t i o n s that are performed d u r i n g the aspect n o d e visiting semantic action. S e m a n t i c action is s i m i l a r to s e m a n t i c function, but it is written at the meta-level and relates only to a given s e m a n t i c aspect. S e m a n t i c action is s h o w n as a box with the m e t a - c o d e c o n n e c t e d to the syntax e l e m e n t and takes into account the traversing aspect. T h e r e are all kinds of abstract d a t a types that are n e e d e d for the semantic aspect into the box with the key w o r d s I M P O R T G L O B A L . F o r better perceptibility of the semantic aspect, it is p e r m i s s i b l e to use additional graphic s y m b o l s that are not needed in real execution. For instance, w e use O t h e r aspects to signal that w e expect there to be a composition with the other s e m a n t i c a s p e c t s . 4.5. Examples of semantic aspects Let us look at s o m e e x a m p l e s of s e m a n t i c aspects (Figure 4 - Figure 8) that are applicable for the s i m p l e imperative p r o g r a m m i n g l a n g u a g e P a m [ P a g 8 1 ] . Terminal s y m b o l s are denoted b y a rectangle, while n o n t e r m i n a l s y m b o l s are indicated by rounded rectangles. T h e left circle in the n o n t e r m i n a l s c o r r e s p o n d s to the PreVisit semantic action, the right one - to the PostVisit semantic action, while for the terminals, the Visit semantic action is assigned. IMPORT G L O B A L Env of AD7_8 y T r i b o 1 Tab 1 e ^ J0 program Gj ENV. releas«?rogtr.v () L-OGE.-.V ( ) C_Other a s p e c t s ~' Figure 4. The semantic aspect PROGRAM. It prepares the program environment to manage variables, constants, etc. and operations involving them. The environment is destroyed at the end IMPORT G L O B A L C a n C r e a t e V a r of Stack ! /r(G v a r i a b l e _ d e f (Tj) 1 i r . C r e a t i e V ' a r . p e p () CanCreateVar. '„ O t h e r a s p e c t s _ J Figure 5. The semantic aspect VARIABLE DEFINITION. It allows for variable creation Gumis Amicans. D e s c r i p t i o n of S e m a n t i c s a n d C o d e G e n e r a t i o n 21 Possibilities IMPOST GLOBAL Trav cf ATTJTreeTraverser, ?efStack of ADT Stack, Er.v of ADT SymbolTabie, CanCreateVar of ADT Stack LOCAL VarT-xt = Trav. oodeValue () if CanCreateVar.top() - TRCE and Env.flftdvar(Vertex t) = FALSE Env.createVar(VarTe xt, INT) endif LOCAL ?.ef = Env.getRe f VarText) RefStack.p-jsh (Ref) LOCAL IntText = Trav.nodeValue() LOCAL Ref = Env.getRef ( " TNT_" + lntTe>:t! If Ref = EMPTY LOCAL Integer = TextToInteger (IntText) Ref = Er.v.createLitJ" INT_"»rr.t>.xt, :XT; Env.p'utValue (Ref, Ir.teger) endr f RefStack.push(Refi Figure 6. The semantic aspect ELEMENT. It provides for the pushing into the stack all references to each variable encountered while traversing. Variable creation is forbidden by default. The Trav provides for getting the values of the current terminal node in IR. IMPORT GLOBAL RefStack of A S T Stack, Snv of ADT SymbolTabie (O LOCAL Res = RefStack.pop() LOCAL Var = Ref S tac•:. pop () LOCAL Vol = Env.getValue(Pes) Er-.v. p-JtValue (Var, Val) assignment_statement(7-H I (O !eft_hand„side Y Q)—>- "t ASSIGN \ X ' ' O t h e r a s p e c t s ~> —>{Q t \ Re fStack.push(NULL) 5 right_hand_side 1 'T_Other a s p e c t s / , 1 Ref Stack, pcsh (NULL) Figure 7. The semantic aspect ASSIGNMENT. It takes reference to the variable and reference to the value from the stack and assigns a value to the variable. Pushing of references is simulated 22 DATORZINATNE IMPORT GLOBAL RefSrack, Sort, Flag of Trav of ADT TreeTraverser, Env of ADT WHILE - * { / comparisonC^—>~ DO IN INFORMACIJAS TEHNOLOGIJAS ADT_5tack, SyrtbolTable —>-(C series J)—>- END Other a s p e c t s if Sort, top () = INDEF t h e n LOCAL Ref = R e f S t a c k . p o p O if Er.v.getValue(Ref) - FALSE Flag.replaceTop(FALSE! Trav.goSibiForw(@END) e n d if endif Ref Stack, push (NULL) if S o r t . t o p O = INDEF and Flag, top () - T R U E Trav.goSiblBackw(@WHILE) endif Figure <H. The semantic aspect INDEFINITE LOOP. It "goes through"' the series and back to WHILE until the comparison sets a NL'LL reference or a reference with the value FALSE 5. Meta-semantics M e t a - s e m a n t i c s is a term which relates to a m e t a - n r o g r a m that describes the counterparts of which semantics consists and the way in which these counterparts are connected together. T h e conceptual s c h e m e of m e t a - s e m a n t i c s is s h o w n in Figure 9. Nonterminals Syntax Terminals Figure 9. The conceptual scheme of meta-semantics. For instance. SA1 involves nonterminal NT I and terminal T l , and SCI is performed while prcvisiting. SCI is a composition of ISA I and ISA4 (funtis Arnicans. Description of S e m a n t i c s a n d Code Generation 23 Possibililics If w e look at s e m a n t i c s from the definition side (the logical view) then semantics is Conned by traversing strategy and by a set of semantic aspects that consist of semantic actions realized while visiting the appropriate node of the intermediate representation. If we look at s e m a n t i c s from the runtime side (the physical view) then while visiting one node it is possible that several semantic actions have to be realized, and we h a v e to ensure correct collaboration between all involved instances of abstract c o m p o n e n t s into the desired e n v i r o n m e n t . H o w can w e put all of tins together correctly'? T o solve this p r o b l e m w e introduce the concept o f the semantic connector. This concept has been a d a p t e d from the concept G r e y - b o x connector, which solves a similar problem: h o w to c o n n e c t the pre-buiit c o m p o n e n t s in a distributed and h e t e r o g e n e o u s e n v i r o n m e n t for collaborating work [AGBOO]. The G r e y - b o x c o n n e c t o r is a rrictap r o g r a m that introduces a concrete c o m m u n i c a t i o n s connection into a set of c o m p o n e n t s , i.e., it generates the adaptation and c o m m u n i c a t i o n s glue code for a specific connection. 6. F r o m semantic aspects to meta-semantics and executable semantics T h e obtaining of semantics for the fixed syntax is achieved in several steps: 1) select predefined s e m a n t i c aspects or define new o n e s i'or the desired s e m a n t i c s , 2) r e n a m e syntax e l e m e n t s and traversing aspect in the selected semantic aspects with n a m e s from fixed syntax and traversing strategy. 3) rename instances of abstract c o m p o n e n t s to o r g a n i z e collaboration between s e m a n t i c aspects. 4) m a k e composition from s e m a n t i c a s p e c t s . 5) specify the runtime e n v i r o n m e n t and translate the mcta-code to the code of the target p r o g r a m m i n g language, and 6) compile the semantics. • Selection of s e m a n t i c aspects (step 1) If w e h a v e a library with previously created semantic aspects, then w e can search for a p p r o p r i a t e o n e s . i.e. reuse some parts of the s e m a n t i c s . T h e traversing strategy also has to be considered. • Syntax e l e m e n t and traversing aspect renaming in semantic aspects (step 2) Actually this is semantic action m a p p i n g . M a t c h i n g to the fixed syntax elements is achieved by m a p p i n g syntax elements of semantic aspects to fixed syntax elements ( r e n a m e with - simple m a p p i n g and duplicate to - m a p p i n g of a semantic aspect syntax element to several fixed syntax elements): rename < n a m e > <travcrsing aspec!> duplicate < n a m e > <travorsing <targct namc> <target with target n a m e • <targct traversing aspect- aspect ' to visiting aspcet> [,<targct namc> • ' t a r g e t visiting aspect> ...j For instance, r e n a m e left hand side PostVisit with V A R I A B L E Visit ; d u p l i c a t e C O L ' N ' T A B L F . N O D E Visit to V A R I A B L E Visit. : i s s i g n m c n t _ s t a t e m e n t PnslYLit • R e n a m i n g of instances of abstract c o m p o n e n t s (step 3) M a t c h i n g of c o m p o n e n t s is necessary-' because there is no direct data e x c h a n g e between semantic functions, and we need collaborative work. T h e program slate is fixed by using the runtime states o f c o m p o n e n t s . At first w e decide what instances they have 24 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS in c o m m o n and what n a m e s they h a v e to get, and then w e r e n a m e instances in the semantic aspects: replace <name> with <target namc> For instance, replace RefStack with D a t a S t a c k ; r e p l a c e Sort with L o o p S o r t S t a c k • Composition of s e m a n t i c aspects (step 4) The goal of a s e m a n t i c aspect c o m p o s i t i o n is to bring together several semantic aspects into one m o r e c o m p l i c a t e d aspect that nearly d e s c r i b e s entire semantics. While c o m p o s i n g , w e stick t o g e t h e r the m e t a - c o d e of s e m a n t i c a c t i o n s that have the same n a m e . T h e sticking principles can vary, for instance, sequential (one c o d e is a p p e n d e d to the other o n e ) , parallel (codes can be e x e c u t e d s i m u l t a n e o u s l y or sequentially in any order), free (the user can modify the c o d e u n i o n as h e likes). T o achieve better results, w e ignore s o m e semantic actions or apply the sticking principle to the semantic aspects in the reverse order. At this t i m e we have to be aware o f conflicts b e t w e e n local variable names. compose aspect « n c w S A » (<refmed SA>) [[append ; parallel, free] (<refined SA>) . . . ] , where <refined SA> = « o l d S A » [ignore <name> <trav aspcct> [,<name> <trav aspect>]] I [reverse] The result is m e t a - s e m a n t i c s . A n e x a m p l e of a m e t a - c o d e fragment semantics is given in T a b l e 3. for meta- Gunlis Arnicans. Description of Semantics and C o d e Generation 25 Possibilities Table 3. An e x a m p l e of m e t a - s e m a n t i c s compatible with ir_type ParseTree traverser_type syntax ParseTreeTraverser elements semantic (prograrr., expression, actions VARIABLE, (<PROGRAM> ...) program PreVisit iEKV.preparePrcgEnv() <?ROGRAK> program PostVisit . . . ) global Trav of ADT_TreeTraverser global E n v of ADT_Syrr.bclTabie create DataStack, OperatcrStack, CanCreateVar, LoopSortStack, L o c p C c u n t e r S t a c k , L c c p F i a g S t a c k , I t F l a g S t a c k cf ADT Stack c r e a t e I n p u t F i l e , O u t p u t : i^e cf ADT FILE compose aspect / / c o m p o s e s s e m a n t i c a s p e c t s from a s p e c t s g i v e n a b o v e <A1> /,' s e m a n t i c a s p e c t s P R O G R A M r e m a i n s the s a m e (<PROGRAK>) append (<ELEMEKT> replace RefStack rename INTEGER with Visit D a t a S t a c k /,' r e p l a c e s s t a c k for c o l l a b o r a t i n g w o r k with CONSTANT Visit) // r e n a m e s n o n t e r m i n a l a c c o r d i n g to P A M s y n t a x append (<ASSIGNMENT> replace RefStack rename left right ignore append hand hand lef t_hand (<INDEFINITE repiace RefStack with DataStack side PostVisit siae side with V A R I A B L E with PostVisit Visit, expression PostVisit // i g n o r e p u s h i n g o f N U L L r e f e r e n c e PostVisit) LOOP> with DataStack, Sort with LoopSortStack, Flag append (<VARIA3LE DEFINITIONS with LoopFiagStack) rename variable_def FreVisit with assign statement v a r i a b l e def PostVisit with ASSIGN Visit) end compose aspect compose (<A1> aspect ignore append <A2> expression ( ... PostVisit, coxpari sicn /* Others aspect are appended <INPUT>, <OUT?LT>, <BAS5 3YNARY C O N D I T I O N A L S T A T E K E N T > */ . . . ) e n d ccivpose • PreVisit, FcstVisit; as <TY?E AND OPERATOR>, so o h 0 FERATION>, <DEF:N;TE LOOF>, aspect M e t a - c o d e translating (step 5) M e t a - c o d e is t r a n s l a t e d to the target p r o g r a m m i n g language, taking into account the target l a n g u a g e (e.g. C + + ) , the i m p l e m e n t a t i o n of abstract c o m p o n e n t s (e.g. Stack in linked m e m o r y ) , the o p e r a t i n g s y s t e m (e.g. U n i x ) , t h e c o m m u n i c a t i o n s between c o m p o n e n t s (e.g. C O R B A ) , MLI c o m p o n e n t s type (e.g. D L L ) , etc. T h e translation may be d o n e by hand or a u t o m a t i c a l l y (desirable in c o m m o n cases). 26 I) A T O R Z I N A T N F: U N INFORMACIJAS TEHNOLOGIJAS • Obtaining semantic objects (step 6) By compiling the c o d e we get e x e c u t a b l e objects that p r o v i d e semantic performance, i.e. they contain the s e m a n t i c functions that are called w h i l e traversing the program intermediate representation. T h e instances of the c o n c r e t e i m p l e m e n t a t i o n of abstract c o m p o n e n t s are created, or the existing ones are d y n a m i c a l l y linked via the selected c o m m u n i c a t i o n s protocols. 7. Examples of alternative semantic aspects In this section we p r o v i d e a short insight on h o w w e can build nontraditional semantics. By adding new features to e x i s t i n g s e m a n t i c s w e can create a specific tool that works with a given p r o g r a m m i n g l a n g u a g e . W e w o u l d like briefly to survey two e x a m p l e s : 1) statistics a c c o u n t i n g of p r o g r a m point visiting, a n d 2) storing of symbolic values for variables. Both aspects are a d d e d to the c o n v e n t i o n a l s e m a n t i c s , and this is program instrumentation if w e speak in t e r m s of software testing. 7.2. Accounting of program point visiting Our goal is to account for any visiting o f a desirable p r o g r a m point. This m e a n s that w e need to set counters at these points. At first w e write the s e m a n t i c aspect N O D E C O U N T E R (Figure 10). W e use the abstract c o m p o n e n t Dictionary w h e r e w e can store, read and update records in form <key, v a l u e > . IMPORT G L O B A L Trav of A D T _ T r e e T r a v e r s e r , Diet ci A D T _ D i c L i o n a r y COUNTABLENODE LOCAL key = T r o v . g c t N o d c T D I : LOCAL reco-U = D i e t . g e c R e c N o m ( k e y ) if r e c o r d = 0 D i e t . c r e a t e R e c ( k e y , 1) else Diet.updat8(ro:crd, Diet.cot(record) endi f - 1) Figure 10. The semantic aspect NODE COUNTER Defining m e t a - s e m a n t i c s . w e add this s e m a n t i c aspect to the others. F o r instance, we define accounting for the use of any variable and a s s i g n m e n t operation: dublicate countable_node Visit to VARIABLE Visit. assignmcnt_statcraent PostVisit Another semantic aspect can be built which a c c o u n t s for every concrete variable using statistics into an additional d i c t i o n a r y (the variable n a m e serves as the key). We can improve this aspect further by a c c o u n t i n g for an aspect of variable use - defined, modified, referenced, released, etc. In the p r o g r a m analysis and instrumentation area, our approach is similar to the W y o n g s y s t e m (based on the Eli compiler generation system and the A T O M p r o g r a m instrumentation s y s t e m ) , b e c a u s e specific operations are attached to syntax e l e m e n t s , and in this way w e obtain a specific tool with additional semantics [Slo97]. Guruis Arnicctns. Description ni S e m a n t i c s and Code Generation Po%sihiliiics 27 7.2. Storing of symbolic values for variables T h e second e x a m p l e provides for the fixing of symbolic values for variables (Figure 11). T o do this task in an effective way, w e have additional operations in our M O M S (symbol table). T h e operation c r e a t e S y m b V a l t i e creates an entry for symbolic value, a n d with the operations a d d T e x t T o S y m b V a l u e and a d d V a r T o S y m b V a l u e , w e form the value while traversing all nodes in the desired subtree ( w e store all needed p r o g r a m s y m b o l s and symbolic values of variables). At the end w e store accumulated value w i t h the operation s t o r e S y m b V a l u e . IMPORT GLOBAL T r a v o f A O T _ T r e e T r a v e r s e r SymfcR«fSi:ack o f A D T _ s t a e k , Er.v at A C T ^ S y m b a l T a h l e , Car.CreateSyrr.bVar o f A C T _ s t a c l c H LOCAL Syrr-cVar - £ y n i : P . e r S t a c k . p o p [] Env. s t o r e S y r i : V a l u e (SynbVar, S y r i V a U ast.- left h a n d Side car,Craar:eSyr.bVar . p u s n I FALSE) CafiCzeateSyrriVar . p o p \ ) ••••en:_5'.aierr;e'il 2 > — u)—^\ T ASSIGN nght_hand_s;de '^_0'her aspects^' '„ O t h e r a s p e c t : ?ack,pusr-tHULL) | LOCAL S y n b T e x i » T r a v . n o d e V a l u e I } LOCAL Synfcp.ef - S y m b R e E S t a c k . t o p ( ) En v . addTeKCToSymfiValije (SymbP-ef, S y n b T e x t ) LOCAL V a r T e K t =• T r a v . n o d e v a L u e < ) LOCAL Eyru'-'flcName = " SYMB_" i VarTusc: i f C a n C r e a t e S y m t r t / a r . t o p l) - TRUE 1£ E i w . f L r . d V j r ( 5 y r n b v a i 7 e x * 0 - FALSE E n v . c r e a t e V a r (SymfcVarTexTi^ SYKSJ endif LOCAL S y n b R e : - E n v . q e t R e f (SyrctoVarTexc) Er.v. c r e a t e s y r n b v a l u e ( S y r r i R e : ) SVT-bRe i S zack. p u s n IZynvt Re f ) else LOCAL, SymbP.ef - Enir.getP-e f [Syirf-jVarText) Er.v . acdVarTc-SyrrJ-A'alue (SymbRef.) er.da f Figure 11. The semantic aspect SYMBOLIC VALUES 8. Conclusions This method was developed with the goal of reducing the gap b e t w e e n practitioners (tool d e v e l o p e r s ) and theoreticians (developers of formal specifications for semantics). Our e x p e r i e n c e shows that a significant a c h i e v e m e n t is attained if abstract c o m p o n e n t s or abstract data types realize the greater part of s e m a n t i c s , because that w a y it is easier to p e r c e i v e the full implementation of semantics. T h e second acquisition of our m e t h o d involves significant disjoining o f syntax and s e m a n t i c s from each other. It allows us to c o m b i n e various syntaxes and semantics and to find out the most desirable semantics for the given syntax. So. if w e have written syntax for a new language, we can m a t c h several s e m a n t i c s to it in a comparatively short t i m e . As a result, w e can develop a wide spectrum of tools in support of our new language. Our approach allows us to c h a n g e semantics d y n a m i c a l l y while the interpreter is running, i.e. replace semantics or execute various ones simultaneously. It is possible to reduce a derived parse tree (by deleting nodes with e m p t y connectors) or to optimize it (tree restructuring statically and d y n a m i c a l l y , considering performance statistics). 28 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS At this m o m e n t the e n v i r o n m e n t for tool construction or s e m a n t i c s generation is not completely d e v e l o p e d . O u r experience s h o w s that tools can be developed without significant i n v e s t m e n t s , for e x a m p l e , by u s i n g L e x / Y A C C as a generator to create a syntactic object, which p r o d u c e s p r o g r a m i n t e r m e d i a t e representation. It is not too difficult to d e v e l o p a T r a v e r s e r and a simple S y m b o l T a b i e . A n d as the last j o b , w e have to w o r k up semantic a s p e c t s on the basis of our m e t h o d and c o m p o s e t h e m , thus obtaining c o n n e c t o r s , w h i c h can be written in s o m e c o m m o n p r o g r a m m i n g language (skipping m e t a - l a n g u a g e use and its translation). T h e u s e of abstract c o m p o n e n t s depends on target s e m a n t i c s . W e believe that the d e v e l o p m e n t of the serious t o o l s d e m a n d s a more universal implementation of the S y m b o l table. O u r S y m b o l table i m p l e m e n t a t i o n - M O M S - is not applicable only for imperative l a n g u a g e i m p l e m e n t a t i o n s . It w a s also the basic object-oriented d a t a b a s e for the c o m m e r c i a l application M o s a i k (Sietec consulting G m b H C o . O H G . graphical C A S E tool for b u s i n e s s m o d e l l i n g ) . T h e w e a k n e s s of o u r m e t h o d lies in the s e m a n t i c a s p e c t s c o m p o s i t i o n stage. At this m o m e n t w e have not a n a l y z e d all risks in t e r m s of o b t a i n i n g senseless or erroneous semantics. T h e p r o b l e m s are not trivial, and they are similar to p r o b l e m s in the proper collaboration of objects or c o m p o n e n t s in object-oriented p r o g r a m m i n g , too. [e.g. M L 9 8 ] , M o s t n a m e conflicts can be p r e c l u d e d a u t o m a t i c a l l y , but it is considerably harder to organize collaboration a m o n g the c o m m o n c o m p o n e n t s in s e m a n t i c aspects (it is easier if the s e m a n t i c aspects are m u t u a l l y i n d e p e n d e n t ) . Another p r o b l e m is that the l a n g u a g e g r a m m a r is frequently not context free (this is true of our e x a m p l e a b o v e , too). In this case w e h a v e to introduce additional flags to m e m o r i z e the context of s y n t a x e l e m e n t s . It is a d v i s a b l e to rewrite t h e syntax and to use context-free g r a m m a r s . 9. References [AAB96] V. Arnicane, G. Amicans, and J. Bicevskis. Multilanguage interpreter. In H.-M. Haav and B. Thalheim, editors, Proceedings of the Second International Baltic Workshop on Databases and Information Systems (DB&IS '96), Volume 2: Technology Track, pages 173-174. Tampere University of Technology Press, 1996. [AGBOO] U. ABmann, T. GenBler, and H. Bar. Meta-programming Grey-box Connectors. Proceedings of the Technology of Object-Oriented Languages and Systems (TOOLS 33), pp.300-31 1, 2000. [ASUS6] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Technigucs, and Tools. Addison-Wesley, 1986. [BM99] B. Biswas and R. Mall. Reverse Execution of Programs. ACM SIGPLAN Notices, 34(4):61-69. April 2000. [Cla99]C. Clark. Build a Tree - Save a Parse. ACM SIGPLAN Notices, 34(4): 19-24, April 2000. [DKV00] A. Deursen. P. Klint, and J. Visser. Domain-Specific Languages: An Annotated Bibliography. ACM SIGPLAN Notices, 35(6):26-36. June 2000. [Eng99] Dawson R. Engler. Interface Compilation: Steps toward Compiling Program Interfaces as Languages. In DSL-99 [ITSE99], pp.387-400. [FL88] Charles X. Fisher, and Richard J. LeBlanc. Jr. Crafting A Compiler. BenjaminCummings, 1988. Guntis Amicans. Description of S e m a n t i c s and Code Generation Possibilities 29 [FNT-97JF. Ferrucci, F. Napolitano, G. Tortora. M. Tucci, and G. Vitiello. An Interpreter for Diagrammatic Languages Based on SR Grammars. Proceedings of the 1997 IEEE Symposium on Visual Languages (VL '97), pages 292-299, 1997. [GHJV95]E. Gamma, R. Helm, R. Johnson, and J. Vlisides. Design Patterns: Elements of Reusable Software, pages 331-334. Addison-Wesley, 1995. [HKOO] J. Fleering and P. Klint. Semantics of Programming Languages: A Tool-Oriented Approach. ACM SIGPLAN Notices. 35(3):39^8. March 2000. [ITSE99] Special issue on domain-specific languages. IEEE Transactions on Software Engineering, 25(3), May/Tune 1999. [Kin95] W.Kinnersley, ed.. The Language List. 1995. http://wuarehive.wust fed ii/doc/misc/lunglist.txt [Lou97] Kenneth C. Louden. Compilers and Interpreters. In Tucker [Tuc97], pp.2120-2147. [ML98] M. Mezini and K. Lieberherr. Adaptive Plug-and-Play Components for Evolutionary Software Development. SIGPLAN Notices. 33( 10):97-l 16. 1998. Proceedings of the 1998 ACM SIGPLAN Conference on Object-Oriented Programming. Systems, Languages & Applications (OOPSLA '98). [OW99] J. Ovlinger and M. Wand. A Language for Specifying Recursive Traversals of Object Structures. SIGPLAN Notices, 34(10):70-8L 1999. Proceedings of the 1999 ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages & Applications (OOPSLA '99). [Paa95]J. Paakki. Attribute Grammar Paradigms - A High-Level Methodology in Language Implementation. ACM Computing Surveys, 27(2): 196-255. June 1995. [Pag81] Frank G. Pagan. Formal Specification of Programming Languages: A Panoramic Primer. Prentice-Hall, 1981. [Sch97] David A. Schmidt. Programming Language Semantics. In Tucker [Tuc97], pp.22372254. [SK95] K. Slonneger and B. L. Kurtz. Formal Syntax and semantics of Programming Languages: A Laboratory Based Approach. Addison-Wesly, 1995. [Slo97] A. M. Sloane. Generating Dynamic Program Analysis Tools. Proceedings of the Autralian Software Endineering Conference (ASWEC'97), pp.166-173, 1997. [Tuc97] Allen B. Tucker, editor. The computer science and engineering handbook. CRC Press, 1997. [Vis01]J. Visser. Visitor Combination and Traversal Control. SIGPLAN Notices, 36(11):270282, 2001. Proceedings of the 1998 ACM SIGPLAN Conference on Object-Oriented Programming, Systems, Languages & Applications (OOPSLA '01). ~ZZ97] D.-Q.Zhang and K.Zhang. Reserved Graph Grammar: A Specification Tool For Diagrammatic VPLs. Proceedings of the 1997 IEEE Symposium on Visual Languages (VL '97), pages 292-299, 1997. 30 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS 10. Appendix Table 4. MOMS types Type Constructor Value Reference MemoryMap IdentifDict TypcsDict FunctDict NamesTable ConstrTable ValuesTable MemoryBlock ...Others Description Handle to an obiect type description Handle to a byte stream that contains an object value Handle to a memory obiect Main central object that organizes other objects (other MemoryMap also) Dictionary of all identifiers (variables, constants, etc.) Dictionary of all types (basic types and user defined) Dictionary of all functions (basic operators, basic functions, user defined functions) Compact storing of all strings Compact storing of all constructor descriptions Storing and managing of all object values Organizing scope visibility in all dictionaries and providing memory management Stack, queue, collection, etc. Guntis Arnicans. Description of Semantics and Code Generation Possibilities 31 Table 5. System initialization and global operations Operation MemoryMap mcmoryCount) initialize(Uint Uint switchMemoryTo(Uint memNum) Uint getCurrentMemorv(void) defineCountOfBaseTypes(Lint countOfBaseTypes) defineBaseType(char* typcNamc. Uint type, Uint rypcSize) defincBascFunetion(char* langFunctN'ame. char* internalName, Uint return Type, int paramCount, ...) prepareProgramEnv(Uchar scope) releaseProgramEnv(void) define Automat icMemS witching!'Uint firstMemNum, Uint lastMemNum) ... Others Description Creates MOMS with internal parallel but related memories (MOMS) We can exploit only the specific internal memory Returns the number of the actual memory Defines a count of the basic types of MOMS Defines base types. This interface is in C and depends on previously defined types. Some examples: defineBaseType("long ", LONG , sizeof (long )); define Base Type ("boolean". BOOLEAN sizeof (boolean_)); define Base Type ( " d a t e " , DATE_. sizeof(date )) Defines the base operations and functions. This interface is in C and depends on previously defined types. Some examples: define Base Function ("-". "PLUS". LONG_, 2. LONG_. L O N G J ; define Base Function ("day", "day". LONG . 1, DATE ) Prepares a new McmoryBlock, defines the scope (visibility) ot previously defined variables, types, functions Releases a current McmoryBlock and all related memory in other objects (dictionaries, tables) and restores a previously defined Memory-Block * Provides for automatic switching in various functions. For instance, we look up the variable in a local memory and then in a global memory" (if the variable is not founded yet). 32 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS Table 6. Defining of user defined data types Operation putValue(Ref aRef, char* aValue) char* getValue(Ref aReO char* createDynamicVaIue(ConstrPtr ptrToConstr) deleteDynamicValue(char* a Value) setValucProtectionOn(char* aName) setValueProtectionOff(char* aName) gotoArrayElementConstr(Ref& aRef, Sint index) gotoNameConstr(Ref& aRef) gotoPointerConstr(Ref& aRef) gotoProductionLeftConstr(Ref& aRef) gotoProductionRightConstr(Ref& aRef) gotoRecordConstr(Ref& aRef) gotoNameInList(Ref& aRef, aName) ... Other char* Description Sets a new value for the object. Returns a value for the object. Provides dynamic memory allocation for the object given by type. Releases dynamically allocated memory. Protects a value of the given object against modification, for instance, protects constants. Takes off a value protection. Sets a virtual mark to element constructor and to a given array element value. aRef is modified, it refers to the array element. Moves the virtual mark to the name constructor. Moves the virtual mark to the pointer subconstructor and to the start of value. Moves the virtual mark to the left subcon-structor and to the start of the corresponding value. Moves the virtual mark to the right subconstructor and to the start of the correspondding value. Moves the virtual mark to the start of record. Moves to the appropriate type and value (list of named types linked by productions) E.g., search a field in the user-defined structure. Guntis Aniicuns. Description of S e m a n t i c s and C o d e Generation 33 Possibilities Table 7. Operations with variables and similar objects Operation ConstrPtr create Constr Array (Uint min Index, Uint maxlndcx, ConstrPtr ptrTo ElemConstr) ConstrPtr createConstrFunct (ConstrPtr ptr To Return Constr, ConstrPtr ptr To Param Constr) ConstrPtr createConstrName(char* aXame, ConstrPtr trToSubConstr) ConstrPtr createConstrPointer(ConstrPtr ptrToSubConstr) ConstrPtr ereateConstrProduct(ConstrPtr ptrToSubConstr], ConstrPtr ptrToSubConstr2) ConstrPtr createConstrRecord(ConstrPtr ptrToSubConstr) ... Other constructors constr ArraySetMinIndex(ConstrPtr ptrToConstr, Uint minlndex) Uint constrArrayGetMinIndex(ConstrPtr ptrToConstr) ... Others Description Defines an array type with the given dimensions and element types. Here and in other functions we can use any previously defined (or partly defined) data type. Defines a function type with the given parameters and return type. Assigns a user-defined name for the given type. Defines a pointer type to the given type. Creates a production of two types (establishes some relation between them). It is useful to construct a serious data structure. Defines a record data type (a set of pairs [name, tvpe}). For instance, base data type constructors Modifies the type description (attributes). Provides details about data type attributes. Table X. Operations with value Operation Create Var (char* aName, ConstrPtr ptrToConstr) createVar(char* aName, char* typeName) createLiteral (char* aName, Constr Ptr ptrTo Constr) Ref createRef (ConstrPtr ptrToConstr) [ ConstrPtr getConstrRef (char* typeName) createSynonym (char* aName, Ref aRef) ... Other Description Creates a variable with the given name and type. Creates a variable with the given name and type name. Creates a literal (constant) with the given name and type. Creates an object without a name with the given type, for instance, internal loop counter, return value of function. Returns pointer to type with the given type name. Creates another reference by name to the existing object. LATVIJAS UNIVERSITATES RAKSTI. 2004. 669, sfej.: DATORZINATNE L'N INFORMACIJAS TEHNOLOGIJAS. 34.-52. Ipp. Data Staging in the Data Warehouse Janis Benefelds C o m p u t e r S c i e n c e Ph.D. student. University of Latvia J a m s . Be n o t e klst» u n i b t m k a . l v This topic describes one of Data Warehousing components - Data Staging. Data staging ensures a) data extraction from the source systems: b) data transformation and standardisation; c) data transfer to the Data Warehouse. The Importance of such important things like data quality and data integrity is described in more details too. This paper intends to be a survey based on practical experience in real projects and theoretical professional literature studies. The Paper consists of a Foreword, six Chapters and References. The Foreword describes the scope or the subject to be analysed in this article. The first Chapter is an introduction that gives definitions and common understanding about concepts used in this article The second Chapter is the largest one. and it describes three main parts of Data transformation Staging: data export, data transform and data load. Two types of Data Staging - dimensional and fact table staging - are described in more details. This Chapter is concerned with relative problems, like data quality, as well. The third and fourth Chapters describe such an important component of Data Staging like Metadata and the mutual relation between it and other parts of Data Staging described in the previous Chapter. The Fifth Chapter summarises the more important items and makes emphases on them to understand how essential they are to be successful in Data Staging. The Sixth Chapter points out the directions that follow as logic and sequential continuation to be a subject for further research and analysis References contains a list of information sources. Key words: metadata, data warehouse, data staging, data integrity. Foreword This article is a s u m m a r y about Data S t a g i n g as o n e of the c o m p o n e n t s o f Data W a r e h o u s i n g . There is an e x p l a n a t i o n of D a t a Staging — w h a t it m e a n s and w h a t it consists of. S o m e c o m m o n definitions are given and the basic elements (data extraction, transformation, loading and metadata) arc d e s c r i b e d . T h e content is additionally illustrated by many e x a m p l e s that are taken from the financial industry. This Article points out the importance o f the right u n d e r s t a n d i n g o f the functions which Data Staging is responsible for, to be successful in designing and deploying projects in real life. The following picture (Figure I) s h o w s the basic D a t a W a r e h o u s i n g e l e m e n t s and describes w h a t part of it this article covers. Janis ttene Data Stat- in the Data Sources Data 35 Wareho Data Staging Area Data Export Data Transform j I Data Warehouse End User Applications Data Load — i Metadata Figure I' The basic elements of Data Warehousing 1. Introduction Data Staging is o n e o f the first and one of the m o s t important things, if we arc discussing issues about d e v e l o p i n g a Data W a r e h o u s e . In a lot of b o o k s , the Internet and other sources w e can find n u m e r o u s Data Staging definitions that are very similar. I w o u l d like to mention s o m e of t h e m that would be sufficient and describe all the main functions performed by Data Staging. Data Staging is a set of processes that cleans, transforms, combines, deduplicates. households, archives and prepares source data for use in the Data Warehouse. Data Staging Area is everything in between the source system and the presentation server where Data Staging processes are performed. The key defining restriction on the Data Staging area is that it does not provide query and presentation services to end-users. The Data Staging area is the Data Warehouse workbench. It is the place where raw data is loaded, cleaned, combined, archived and exported to one or more presentation server platforms. O n e of the D W H g u r u s , Ralph Kimball, has m e n t i o n e d that Data Staging adds to the data an additional v a l u e , e.g.. the data quality and sense of use of these data in some analysis is higher than before. In general the Data S t a g i n g process consists of one c o m m o n service c o m p o n e n t - metadata - and three functional c o m p o n e n t s : • Data Export from the source systems or data c o n s o l i d a t i o n ; • Data cleaning, t r a n s f o r m i n g and soiling or data standardization; • Data import into the D a t a W a r e h o u s e . Data Mart or l o a d i n g : 36 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS T h e r e m a y be s o m e different a b b r e v i a t i o n s used to n a m e these p r o c e s s e s . E T L stands for Export. Transfer and Load, E T T s t a n d s for Extraction, T r a n s f o r m a t i o n s and T r a n s p o r t a t i o n . In this article w e ' l l use E T L . It certainly does not m e a n that D a t a S t a g i n g consists o f three strictly different m o d u l e s , w h e r e the s e q u e n c e and c o m p l e t e n e s s of tasks to be performed is highly prescribed. Data Staging, d e p e n d i n g o n the situation, can be very a c o m p l i c a t e d process from the functional and technical point o f v i e w . It is r e c o m m e n d e d to start the Data S t a g i n g design process with a very simple s c h e m a t i c of the pieces that y o u k n o w : the s o u r c e s a n d targets. Keep it very high-level and highlight where data are c o m i n g from a n d w h e r e data are going t o , point out the major challenges that you already k n o w a b o u t . Figure 2 s h o w s an e x a m p l e o f a highlevel Data Staging design plan that c a m e from the financial industry. Figure 2: Example of the high-level Data Staging plan for some financial institution. For a better u n d e r s t a n d i n g of the p r i n c i p l e s a n d mission of Data S t a g i n g , l e t ' s analyse e a c h of the three steps m e n t i o n e d b e f o r e m o r e closely. 2. E T L 2.1. D a t a export from transactional s y s t e m s or data consolidation. Practically every on-line transactional system ( O L T P ) used in organization is a potential data source for the Data W a r e h o u s e . H o w e v e r a data source for the Data W a r e h o u s e can be s o m e external i n f o r m a t i o n s y s t e m s and not classified information Janis Benefelds. D a t a S t a g i n g in the D a t a Warehouse 37 (flat files etc.) as well. E v e r y t h i n g that serves as a data source can be d e v e l o p e d in different e n v i r o n m e n t s on different platforms and it can be " h o m e - m a d e " and delivered by external v e n d o r s . F r o m the point of view of the D a t a W a r e h o u s i n g it should make no difference - the only thing w e are interested in is data. N o r m a l l y , the s c o p e o f the data to be collected in the D a t a W a r e h o u s e is defined by the end u s e r s - usually b u s i n e s s p e o p l e . T h a t m e a n s the s c o p e is defined in business not Information T e c h n o l o g y concepts and therefore high quality business analysts and system analysts are required at this stage o f the Data W a r e h o u s i n g process. To d e t e r m i n e the right database objects (and their relationship) that potentially could b e included in a data export p r o c e s s e s , business analysts should very precisely define t h e r e q u i r e m e n t s o f data n e e d e d and the s y s t e m analyst should be able to point out h o w and w h e r e t h e s e data physically can be allocated. N o r m a l l y the s y s t e m analyst uses the system d o c u m e n t a t i o n that should help to find out the right information about data w i t h i n a system. S a d l y , there m a y be situations w h e n d o c u m e n t a t i o n does not help, or there is no d o c u m e n t a t i o n at all. S u c h a situation can arise b e c a u s e of poor m a n a g e m e n t during s y s t e m d e v e l o p m e n t , i m p l e m e n t a t i o n and m a i n t e n a n c e processes. T h e n t h e only source o f information about the inside of a particular information system are p e o p l e . S o m e kind of interviews should be o r g a n i z e d and system d o c u m e n t a t i o n , should be redesigned accordingly to the actual status of the system. O n e of the resulting d o c u m e n t s of these interviews is a data dictionary and it is actually a part o f the m e t a d a t a (see chapter 3 . M e t a d a t a ) . A Data dictionary is a collection of descriptions o f the data objects or items and their relationships in a data m o d e l . A data dictionary can be consulted to understand w h e r e a data item fits in the structure, what values it m a y contain a n d basically w h a t t h e data item m e a n s in realw o r l d t e r m s . A small e x a m p l e of that part, which explains what the data item m e a n s in r e a l - w o r l d terms, is s h o w n in the following table. Table No. I "Example of a Data dictionary from the financial industry" Real-world term Account (Acct) Active Acct Dormant Acct Closed Acct Client Active client Dormant client Closed client Database object and its relationship One record of the table ACCOUNTS Accounts.Status = 1 AND Accounts.CloseDate IS NULL Accounts.Status = 2 AND <&today> - Accounts.LastTranDate > 3M Accounts.Status = 2 AND Accounts.CloseDate <= <today> One record of the table CLIENTS Client that has at least one active account (see 'Active Acct') Client with all accounts being dormant (see 'Dormant Acct') Client with all accounts being closed (see 'Closed Acct') U s i n g such a data dictionary i m p r o v e s the efficiency o f c o m m u n i c a t i o n s between business and IT p e o p l e . T h e r e should be n o questions about what the business people are r e q u i r i n g and w h e r e it can be found. W h e n we h a v e clarified what data is n e e d e d and w h e r e it can bee allocated, we should g i v e an a n s w e r to the question " h o w " the data can be extracted or e x p o r t e d from the particular data source. Usually the first step[s] of the technical solution is very close DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS to the technical solution of the data s o u r c e itself. It m e a n s a special p r o g r a m should be developed that extracts specific data from all data s o u r c e s and puts it together in a Data Staging area. Figure 3 s h o w s a small s c h e m a of data extracting from the different data sources and gathering the particular data together in the Data Staging area. S o m e financial figures of the c u s t o m e r s are taken to build a c u s t o m e r profile. At the Start of building the Data S t a g i n g application the p r i m a r y goal is to w o r k out the infrastructure issues, including connectivity, security and file transformation. Figure 3: An Example of data extraction processes from different data sources and putting it together in the Data Staging area. Certainly data in a Data Staging area are stored in s o m e particular structure and therefore the way in which the data c o m e s into these structures should be strictly predefined and c o m p l e t e l y d o c u m e n t e d . T h e n a m e of s u c h a d o c u m e n t is data m a p p i n g . Data m a p p i n g s h o w s a c o m p l e t e w a y in w h i c h the particular data from s o m e source system is transported (and transformed in b e t w e e n if n e c e s s a r y ) to a Data Staging area. C o r r e s p o n d i n g to a dimensional data model (generally that is a data model for the Data W a r e h o u s e ) there are two different types of Data Staging, each for one of two main c o m p o n e n t s of the dimensional m o d e l : d i m e n s i o n a l Data Staging and fact Data Stasiimi. Dimensional Model - a specific discipline for modelling data that is an alternative to entityrelationship (ER1 modelling. Janis Benefelds. Data Staging in the Data 39 Warehouse A fact tabic is the primary table in each dimensional model that is meant to contain measurements of the business. The most useful facts arc numeric and additive. Every fact table represents a many-to-many relationship and every fact table contains a set of two or more foreign keys that join to their respective dimensional tables. A dimensional table is one of a set of companion tables to a fact table. Each dimension is defined by its primary key that serves as the basis for referential integrity with any given fact table to which it is joined. | 2.1.1. Dimensional tabic staging. There are three t y p e s of processing dimensional data: overwrite, create a new dimension record, and displace the c h a n g e d value into an " o l d " attribute field. T h e second one is the t e c h n i q u e usually used for h a n d l i n g occasional c h a n g e s in a dimension record. W e shall actually focus on that type of dimensional data processing. In terms of c o m p l e x i t y the major problem could be the identification and extraction of data that has been c h a n g e d since the last data extraction instead of extraction of all data every time. It is c o n v e n i e n t to have a m e c h a n i s m to select new or c h a n g e d data only. Ideally there is an additional c o l u m n in the source data that shows the last change date and time or j u s t a y e s / n o stamp that indicates w h e t h e r the particular data is or is not changed since the last data extraction. That m e a n s that only these data will be the subject of the Data S t a g i n g . Taking into account that usually s o m e of the data source information s y s t e m s are either delivered by an external v e n d o r or long a g o " i n - h o u s e " s y s t e m s without any k n o w l e d g e about their contens and without full d o c u m e n t a t i o n , there may be s o m e p r o b l e m s to c h a n g e these systems for h a v i n g such a s t a m p indicating a c h a n g e status a c c o r d i n g to the last data extraction p r o c e s s . That m e a n s there should be another way to d e t e r m i n e the existence of the n e w r e c o r d s and changes in the old ones. In this case the easiest way is to pull the current snapshot of the d i m e n s i o n table in its entirety and let the transformation process deal with d e t e r m i n i n g what has changed and how to handle it. C h a n g e determination could be d o n e by c o m p a r i n g the actual data snapshot either with y e s t e r d a y ' s data snapshot or with the current Data W a r e h o u s e dimensional table. T h i s w a y is more t i m e - c o n s u m i n g and complicated, f i g u r e 4 s h o w s data flow for this t y p e of d i m e n s i o n a l Data Staging. Data source Yesterday's dimensional table data Today's dimensional table data / Compare -»•-. and allocate Vdifl'crences/ New Records to be inserted in the D W I I Surrogate key dimension table . Generation, Figure 4: Allocation of new and changed records in dimension data 40 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS B e c a u s e of the large a m o u n t of data to b e extracted, c o m p a r e d (or transferred), special attention to usage of software a n d h a r d w a r e r e s o u r c e s should b e paid. Special c o m p a r i s o n m e t h o d s could b e used for p e r f o r m a n c e i m p r o v e m e n t , such as the Cyclic R e d u n d a c y C h e c k s u m ( C R C ) . T h i s m e t h o d calculates a c h e c k s u m for a particular string - it could b e a record of a table or a w h o l e text file or s o m e t h i n g else. T h e c h e c k s u m calculated for t h e n e w data is t h e n c o m p a r e d with the c h e c k s u m of the y e s t e r d a y ' s data that is already calculated. If these c h e c k s u m s a r e equal there is no need to c o m p a r e details for particular details. T h e U s a g e of such m e t h o d s substantially saves lime and h a r d w a r e resources. 2.1.2. Fact table staging Facts usually represent a s n a p s h o t o f a d a t a s o u r c e at that particular time point. A c c o r d i n g l y facts n o r m a l l y a r e the n u m e r i c attributes of the subject of t h e transactional system, like daily balance of an a c c o u n t , sales a m o u n t o f a product, etc. P r o b l e m s regarding extracting o f such data from the data s o u r c e n o r m a l l y are c o n n e c t e d with the a m o u n t of data to be p r o c e s s e d and w i t h t h e c h a n g e a b i l i t y o f data. P r o b l e m s regarding data a m o u n t s usually can b e solved on a c c o u n t of technical issues or data processing p e r f o r m a n c e optimization. A l m o s t e v e r y transactional s y s t e m p r o v i d e s today on-line services a n d that m e a n s transactions are p e r f o r m e d all the t i m e ( 2 4 h o u r s / 7 d a y s ) . The problem g r o w s even b i g g e r in case o f a m u l t i n a t i o n a l s y s t e m , w h e r e m a n y time zones are involved. To get a s n a p s h o t of such a s y s t e m in the particular t i m e point, s o m e technical tricks can help. T h e r e c a n b e several solutions. In current data base m a n a g e m e n t s y s t e m s , like Oracle, a b a c k u p t e c h n o l o g y is p r o v i d e d without d i s c o n n e c t i n g users or shutting d o w n a d a t a base - a b a c k - u p is created taking into account the events in the s y s t e m logs. A c c o r d i n g l y fact data could b e extracted from a b a c k - u p or directly from t h e system using t h e s a m e t e c h n o l o g y . Actually, normally every s y s t e m has s o m e h o u s e k e e p i n g p h a s e for s u p p o r t i n g s o m e internal functionality (like E O D - end of d a y ) that m a k e s a s y s t e m constant for s o m e t i m e . W h e t h e r the s y s t e m serves business transactions or n o t d u r i n g this t i m e - o u t d e p e n d s on the particular system - there are s o m e distinctions. S u c h t i m e - o u t is a g o o d place for fact data (and d i m e n s i o n a l data as well) extracting. Conclusion. The Financial industry differs from t h e o t h e r major sectors of b u s i n e s s with its nonstop b u s i n e s s . B e c a u s e of m a n y c h a n n e l s ( A T M , Internet, W A P . etc.), w h i c h banks provide, t h e customer has t h e possibility t o d o b u s i n e s s all t h e time any day. A Similar situation exists in the t e l e c o m m u n i c a t i o n b u s i n e s s as well. That is the N ° l p r o b l e m for d a t a extraction, w h i c h s h o u l d b e solved. N o r m a l l y every IS operating in the financial industry h a s already solved the p r o b l e m o f on-line transactions. There are s o m e t e c h n i q u e s that ensure on-line data p r o c e s s i n g and daily (weekly, m o n t h l y , etc.) data servicing at t h e s a m e t i m e . T h e s e data servicing p r o c e d u r e s ( h o u s e k e e p i n g , calculations, etc.) could be p o p u l a t e d with data extraction. Problem N°2 is the large a m o u n t of h e t e r o g e n e o u s data (funds transfers, loan a g r e e m e n t s , securities, trust operations, i n s u r a n c e , etc.) to be processed. T h a t could be solved by 1) using optimal data allocating a l g o r i t h m s t o find data to b e p r o c e s s e d and 2) fast data extraction m e c h a n i s m s . A Data allocating m e c h a n i s m could b e , for instance. Janis Benefelds. Data Staging in the D a t a Warehouse 41 Cyclic R e d u n d a c y C h e c k s u m m e t h o d or using different ' c h a n g e d d a t a ' flags in the source s y s t e m s . A s fast a data extraction m e c h a n i s m specific tools could be used or just m a k i n g a copy and p r o c e s s i n g 9 0 % of data instead of selecting 9 0 % o f data and processing all of the resulting data set. 2.2. Data cleaning, transforming and sorting or data standardization This is the place w h e r e an additional value to the data actually is added. Usually there are s o m e activities b e t w e e n data extraction from data source systems and data import into the data target. T h e s e activities ensure that data in the right structure, format, and s e q u e n c e , and that quality should be adequate for the data model of the data target. T h e Data should satisfy all logical rules provided by b u s i n e s s analysts and all physical rules that c o m e from the D a t a W a r e h o u s e data model and end user applications. 2.2.1. Data quality Actually, most o f all data transformation deals with data quality. T h e r e is no reason to a s s u m e the quality of data in O L T P systems a l w a y s will be acceptable for the D a t a W a r e h o u s e . S o m e data validations can be integrated directly in the source ( O L T P ) to p r e s e r v e data from logical and structural mistakes. In any case, there are s o m e p r o b l e m s . First - every data validation p r o c e d u r e m a k e s the system slower and it conflicts with a definition o f the O L T P "on-line (identical "fast") transaction p r o c e s s i n g " . S e c o n d - even if w e create all necessary data validations in O L T P , t h e r e always will arise s o m e n e w r e q u i r e m e n t s or conditions against data to be realized in the Data W a r e h o u s e . It is easier, cheaper and more secure to realize data cleaning and transformation p r o c e d u r e s outside of the data source. In the Situation w h e n data are validated after they h a v e been captured, another p r o b l e m arises - h o w to edit t h e m n o w ? If a validation error occurs by data capture, they can b e entered o n c e m o r e till they are correct. W h e r e a s if a data validation error is discovered during a fully a u t o m a t e d data cleaning and standardization p r o c e s s , there should be s o m e p r e v i o u s l y described algorithms to m a n a g e the situation. A u t o m a t i c data editing algorithms s h o u l d be a p p r o v e d by business analysts and every data change should be stored in the L O G files. W h e r e v e r and w h e n e v e r data are modified before importing t h e m into the Data W a r e h o u s e , they should be at the best quality possible. T h e r e are attributes that indicate the level of data quality. 42 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS Table No.2 "Attributes i n d i c a t i n g the data q u a l i t y . " Attribute ACCURATE COMPLETE CONSISTENT UNIQUE TIMELY Description means data should be accordingly equal to the data extracted from data sources. All data modifications, standardizations and other processes performed by data transforming arc accountable and an explanation of every difference can be found in audit trails - LOG files; means the particular data set (data mart) should contain absolutely all data and nothing over accordingly the Data Staging procedures. If one particular data mart should contain all active customer records then all customers with an 'active' status should be included there and all customers with a status not equal to 'active' are not included there; means during data transformation process they should not lose their meaning and value. These arc situations when one should solve problems regarding to rounding of numeric, interpretation of aggregated values and other similar issues; means data should not have duplicates. Especially when data are extracted from more than one source. The unique ID from business understanding must be allocated or generated for the same business unit and business analyst improved algorithms used for data joining from multiple sources; means data should be current. Nobody is interested in out of date information. The data used in the Data Warehouse should be taken from and validated against 'fresh' data whenever possible. Of course, 'fresh' has many meanings in different businesses. If we are talking, for instance, about hotel chains, the 'fresh' data could be a turnover of the last day or even week. Whereas if we are talking about financial institutions, like money markets or stock exchanges, then minutes or even seconds is a very long time. To e n s u r e the data characteristics m e n t i o n e d a b o v e m a n y p r o b l e m s h a v e to be solved. T h e s e are p r o b l e m s r e g a r d i n g t h e data itself, not p e r f o r m a n c e p r o b l e m s of h a r d w a r e or software. Janis Benefelds. Data Staging in the D a t a Warehouse Table ho. 3 "Types of problems to be solved regarding data quality" Type Description Inconsistent use of code or different look-up values The same value could have a different ok key in different information systems. Sex, for instance, in one source could have a look-up table like M - male and F - female, while in another system I - male and 2 female. These values should be standardized during the data transformation process by unification of the particular values; Unofficial or undocumented functionality Sometimes some of the fields are used for another reason than documented in the user manual or data in the particular field are unclassified. A big mistake could be the address fields that arc not structured as an address usually is - five text fields of 25 characters length, for instance. From the business point of view the content of these fields is gold, but it is absolutely unusable for data processing; Overloaded codes Usually overloaded codes occur in the systems that are developed a long time ago when every saved bite was very important. That means, a single code was used for multiple purposes. If the customer is an individual then a particular field could be used for birth date. If it is a corporate customer - for registration date: Evolving data Systems that have evolved over time may have old data and new data that uses the same fields for different purposes or where the meaning of the codes has changed over time. It is very important if data are planned to load into Data Warehouse from historical data archives; Missing, incorrect or duplicate values Names and addresses are the classic example of this problem. Transactional systems do not care about data that are not needed for performing the particular transaction. It does not matter whether the name of the customer is IBM or I.B.M. But they can end up looking like two different customers in the warehouse; 1 w o u l d like to a d d o n e more type of p r o b l e m from m y o w n experience. T h a t relates to the different d a t a formats - data should be in the s a m e format to be processed automatically. Table No. 3 (continued) "Types of problems to be solved regarding data quality" Different data formats When Gathering data from many sources, data could be in different formats. That refers to any . data format: date Cdd.mm.yy'; 'YYYY-MON-DD': . . . ) . numeric C0.99':'9.9999E"; ...) etc. To use such data for common data processing they have to be in one previously described format; If it is p o s s i b l e a feedback from Data Staging b a c k to the data source could be c o n s i d e r e d . This is also a w a y to improve the data quality in the data sources. H o w e v e r it should be estimated very critically, b e c a u s e data sources - analytical transactional s y s t e m s — most o f all are m a d e for data p r o c e s s i n g captured by users. Data modifications m a d e by a n o t h e r software, like Data S t a g i n g p r o c e d u r e s , could not be the best w a y of i m p r o v i n g data quality in the analytical s y s t e m . 44 DATORZINATNE 2.2.2. Data UN INFORMACIJAS TEHNOLOGIJAS integrity Besides data quality this part of D a t a S t a g i n g is r e s p o n s i b l e for data integration as well. E n s u r i n g of data integrity m e a n s s u r r o g a t e key (as p r i m a r y key) generation for dimensional data tables and fact table p o p u l a t i o n (or even creation) using j u s t generated surrogate keys. All d i m e n s i o n a l tables h a v e single part k e y s , w h i c h , by definition, are primary keys. In other w o r d s , the key itself defines u n i q u e n e s s in the dimensional tables. It is a significant m i s t a k e to use, for u n i q u e n e s s o f d i m e n s i o n a l table record, an S Q L - b a s e d date stamp, a set of various attributes in the d i m e n s i o n a l table, production keys used in the data s o u r c e . In most cases an integer m a k e s a great surrogate key. A 4byte integer, for instance, can contain 2 ~ v a l u e s , or m o r e than two billion positive integers starting with 1, and that is e n o u g h for j u s t about a n y d i m e n s i o n . ? Since fact tables are a l w a y s a s s o c i a t e d with d i m e n s i o n a l tables, surrogate keys should replace c o r r e s p o n d i n g foreign k e y s there. Figure 5 s h o w s a dimensional data model w h e r e generated surrogate k e y s ( T i m e _ k e y , C u s t o m e r _ k e y , A c c t k e y ) of dimensional tables are u s e d as foreign k e y s in t h e fact t a b l e instead of production keys (Date. Ctistomer_ID, A c c t _ l D ) . Customer dimension Time dimension Time_key (PK) Date Day_of_week Day_of_year Week_number etc. Daily balance fact table Account dimension Acct_key (PK) Acct ID Acct type Currency Ledger_balance etc. T i m e k e y (FK) Acct_key (FK) Customer_key (FK) Ledgerbal Acct_ccy Ledger_bal 1 Base ccy Customer_key (PK) CustomerlD Customertype Name Address etc. Figure 5: Example of using surrogate keys instead of production keys in the Daily balance fact table For better q u e r y i n g p e r f o r m a n c e besides d i m e n s i o n a l a n d fact data, the Data W a r e h o u s e has the a g g r e g a t e date as w e l l . Since the D a t a W a r e h o u s e is used m o s t of all to g e t summarized d a t a over any d i m e n s i o n a l attributes, a significant data aggregation would be performed b y every data request. T h e r e f o r e data b a s e m a n a g e m e n t systems provide an o p p o r t u n i t y to maintain a g g r e g a t e d data - materialized v i e w s or data where subtotals are already calculated up to a predefined level of granularity. Figure 6 shows how an aggregate d a t a table is h a n d l e d that ensures a n s w e r to data q u e r y about product sales subtotals for a particular w e e k or a few w e e k s . Janis Benefelds. Data Staging Aggregate Daily balance fact table Day_of_year (FK) => Acct_key (FK) SUM(Ledger_bal) MIN(Ledger_baI) iMAX(Ledger_bal) AVG(Ledger_bal) Acct ccy SUM(Ledger_bal 1) MlN(Ledger_baI 1) MAX(Ledger_ball) AVG(Ledger_ball) Baseccy in t h e D a t a 45 Warehouse Time dimension Time_key (PK) Date Day of week Day of year Week number etc. Product dimension Acct_key (PK) AcctJD Acct_ type Currency Ledger balance etc. Sales fact table • « T i m e k e y (FK) Acct key (FK) Customer key (FK) Ledger bal Acct ccy Ledger ball Base ccy Figure 6: Example of the aggregate Daily balance fact table. Conclusion. I would say that data quality is o n e of the major units of m e a s u r e (the second one is data accessing p e r f o r m a n c e , but that is not the subject o f this topic) for the Data W a r e h o u s i n g . T h e benefit that users obtain from the u s a g e of the data w a r e h o u s e is as large as how clean and g o o d are the data. So, if data are inaccurate, the user gets the wrong picture of the b u s i n e s s and a decision that is b a s e d on that picture could be wrong. N o recourses s h o u l d be e c o n o m i z e d to i m p r o v e d a t a quality. R e g a r d i n g the data that b a n k s capture a n d m a i n t a i n t h e m s e l v e s - that is all their o w n business, and they do everything to avoid ' d i r t y ' data. S o m e t i m e s a solution could be s o m e administrative order to stop doing things in the w r o n g way. Data that w e get from outside of the bank, or even the financial industry, should h a v e a warranty, e.g., the data supplier should be trustworthy and secure e n o u g h . Usually these are s o m e state institutions, b u t they could be s o m e c o m m e r c i a l o r g a n i s a t i o n s ( i n s u r a n c e c o m p a n i e s , travel agencies, etc.) as well contract about p r o v i d i n g data should contain r e q u i r e m e n t s and warranties regarding data quality and security. A separate Data Q u a l i t y d e p a r t m e n t within the organisational structure would help significantly as well. S u c h a d e p a r t m e n t could deal with ' d i r t y ' data - to ensure the certain level of the data quality by developing data transformation and cleaning algorithms as well as l o o k i n g for s o m e external data sources that could serve as a standard for data c o m p a r i s o n . 2.3. Data loading into the Data Warehouse, Data Mart or data loading Data loading into the Data W a r e h o u s e m e a n s physical data transportation from the Data Staging area ( d a t a b a s e A) to the Data W a r e h o u s e (database B). T o populate an existing d a t a b a s e with n e w data usually s o m e data m a n a g e m e n t p r o b l e m s should be 46 DATORZINATNE UN IN F O R M A C ! J A S TEHNOLOGIJAS solved. T h i s refers to such technical t e r m s like table spaces, table partitioning, indexing, data load method (usually bulk load), audit action, m e m o r y , parallel processing, network architecture etc. T h e software, h a r d w a r e and time resources can be significantly saved by configuring items like these. T o s p e e d up the load cycle the frequency of load can be c h a n g e d , especially regarding s o m e a g g r e g a t e data. If s o m e monthly totals should be a c c u m u l a t e d , it d o e s not m e a n that data load can be performed only o n c e a month. It is possible to a c c u m u l a t e data on a daily basis by incremental load. T h e n s o m e additional c h a n g e s to the presentation layer should be a d d e d - whether users see ' m o n t h - t o - d a t e t o t a l s ' (instead o f "totals per m o n t h ' ) or it could be restricted by software to see just full m o n t h s (the p r e v i o u s o n e will be the n e w e s t then). B e c a u s e of data integrity the s e q u e n c e for l o a d i n g data into the Data W a r e h o u s e is prescribed accordingly. First of all all d i m e n s i o n a l data arc loaded in the right sequence to e n s u r e star schema data model data integrity. W h e n all d i m e n s i o n a l (look-up) data tables are populated, fact data and a g g r e g a t e d data are loaded and/or calculated. If necessary, s o m e additional data validation can be p e r f o r m e d . For duplication and possible repetition p r o c e s s e s data transportation (load) supports data a r c h i v i n g and back-up. T h a t m e a n s w e can repeat s o m e data transformation and load p r o c e s s from any point of the staging p r o c e s s . That could be n e e d e d if s o m e errors are found and removed, or if data load s h o u l d be repeated b e c a u s e of s o m e technical disaster. Such a duplicate table loading t e c h n i q u e is s h o w n in Figure 7. B a c k - u p s of load data are needed for t o m o r r o w ' s i n c r e m e n t a l load as well (see 2 . 1 . 1 . D i m e n s i o n a l table staging). (From t h e —- staging area) (1) Load tact table load ( 6 ) Rename liiad tact 1 to active tact I factl (7) W H comes up active factl •(3) Index / (4) W H down (2) Duplicate 3up!ie; b L-CK! o d j_afca< lI i ( ) 8 «<"P.foa' to load tact 1 (5) Rename acme factl to old tact I dup fact 1 (9) Drop l!ie okl tact I table old factl Figure 7: Duplicate table loading technique. Janis Benefelds. Step 0 Step 1 Step Step Step Step Step Step Step Step 2 3 4 5 6 7 8 9 Data Staging in the D a t a 47 Warehouse The active_factl table is active and queried by users. The load_fact I table starts off being identical to active_fact I. Load the incremental (different) facts into the load_factl table. Duplicate load fact 1 into dup_factl for tomorrows load. Index Ioad_faet I table according to the final index list. "Bring down" the warehouse, forbidding user access. Rename the active_factl table to old_factl (back-up). Rename the loadfactl to a c t i v e f a c t l . Open the warehouse to user activity. Total downtime: seconds! Rename the dup_factl tabic to load_factl for performing of next load. Drop the old_factl table. 3 . Metadata S o m e tools that a u t o m a t e the transformation and load layer can actually reside on a third party platform and generate the c o d e necessary to perform those activities (see Figure 8 as well). A n d the most important thing that ensures such a possibility is metadata. T h e definition of m e t a d a t a is really simple - m e t a d a t a is data about data. D o e s it help a n y b o d y to u n d e r s t a n d what exactly this elusive information? In my opinion - no. M o r e descriptive definition could be as follows. M e t a d a t a is all of the information (or descriptive information) in the particular e n v i r o n m e n t that is not the actual data itself. S o m e a u t h o r s are e v e n talking about the " b a c k r o o m m e t a d a t a " and "front room m e t a d a t a " . T h e back r o o m metadata is process related a n d it guides the d a t a extraction, cleaning and loading p r o c e s s . T h e front r o o m metadata is more descriptive and it helps query tools report writers function s m o o t h l y . O f c o u r s e , both metadata types overlap, but it is useful to think about them separately. In this c a s e w e are interested in the first one - b a c k r o o m metadata. M e t a d a t a is like traditional systems d o c u m e n t a t i o n (in fact, it is d o c u m e n t a t i o n ) and at s o m e point the resources needed to create and maintain it will be diverted to other " m o r e u r g e n t " projects. Active metadata helps solve this p r o b l e m . A c t i v e metadata is metadata that drives a process rather than d o c u m e n t s it, while d o c u m e n t a t i o n of these p r o c e s s e s is an important issue as well. T h e following is the general m e t a d a t a n e e d e d to get the data into the Data Staging area and p r e p a r e it for l o a d i n g into one or m o r e data marts. Data extraction information • D e s c r i p t i o n of t h e source s y s t e m s ( s c h e m a s , formats etc.); • A c c e s s m e t h o d s , rights, privileges and p a s s w o r d s ; • D a t a transportation scheduling a n d results of them: • File u s a g e in the Data Staging area including duration and o w n e r s h i p : Data transformation information • Data m o d e l : • J o b specifications attributes: for joining s o u r c e s , stripping out fields and looking up 48 DATORZINATNE UN INFORMACUAS TEHNOUOCIJAS • Data m a p p i n g b e t w e e n sources a n d Data S t a g i n g area; • Y e s t e r d a y s copy o f production data to use as the b a s i s for D I F F C O M P A R E • Data update (refreshing) policies for each i n c o m i n g attribute; • Current surrogate key a s s i g n m e n t s and m e t h o d o l o g y of that; • Data cleaning, formatting, filtering and sorting: • Populating data for data mining (e.g., interpret nulls, scale n u m e r i c s etc.); Data load information • Target schema d e s i g n , source-to-target (staging a r e a to data mart) data flows and target data o w n e r s h i p ; • D B M S load scripts; • A g g r e g a t e definitions; • Data (especial a g g r e g a t e ) usage statistics and potential i m p r o v e m e n t s ; Audit, Job Logs a n d D o c u m e n t a t i o n • Audit records ( w h e r e exactly did this record c o m e from and w h e n ? ) ; • Data transform r u n time logs, s u c c e s s s u m m a r i e s a n d time stamps; • Information a b o u t software (version n u m b e r s ) and h a r d w a r e (configuration); • Security settings; Conclusion. Metadata is an important issue in terms o f d e s c r i p t i o n s . It describes (it should at least) any object and a n y process dealing with these objects. It is very h a r d to define or to s h o w - this is m e t a d a t a and that is not. T h e c o n t e n t o f m e t a d a t a usually describes the structure or nature of o t h e r d a t a (the real ones). M e t a d a t a , for instance is ER diagram, description o f table structures, d o c u m e n t a t i o n o f the p r o g r a m p a c k a g e s etc. 4. Relationships between Export, Transfer a n d Load. A s I h a v e already m e n t i o n e d data extraction, t r a n s f o r m a t i o n and load are very closely related to each other o n the o n e hand (usually in the small projects) and these parts could be very strictly separated o n the o t h e r hand (large projects s o m e t i m e s use separate software p r o d u c t s to m a n a g e data e x t r a c t i o n , transformation, loading and metadata in their entirety). S o m e t i m e s it is really i m p o s s i b l e to strictly separate which piece of software or h a r d w a r e belongs to one part of D a t a S t a g i n g and which one to another. For instance, let us s a y we h a v e j u s t o n e data s o u r c e s y s t e m , w h i c h operates in the Oracle, and data s h o u l d b e loaded in the D a t a W a r e h o u s e , w h i c h is d e v e l o p e d in an Oracle environment t o o . If h a r d w a r e and software c o n f i g u r a t i o n allows creating a database link between these t w o Oracle d a t a b a s e s , then Data S t a g i n g could be j u s t one SQL statement. Of c o u r s e , it is a very simplified e x a m p l e , but it is true: Janis Benefelds. Data Staging in t h e D a t a 49 Warehouse INSERT INTO CUSTUM_DIMENSION ( SELECT NEXTVAL.cust_seq cust_pk, cust_id, cust_name, TOJ3ATE(birth_date.'DD.MM.RRRR )birth_date. DECODE(scx,' 1 7 M 7 F ' ) sex FROM CUSTOMERS@;SOURCE ORDER BY cusMd ) - A n d there w e can see: Data extraction: Data transformation: Data transportation: SELECT cust_id, c u s t n a m e . birth_date. sex FROM CUSTOMERS® SOURCE NEXTVAL.custseq key generation TojDATE(birthdate.'DD.MM.RRRR') adjustment DECODE(sex,'L/M\'F') of code keys ORDER BY cust_id record order INSERT INTO CUSTUM DIMENSION - surrogate - data fomiat - unification - particular It actually m e a n s that from the technical point of view D a t a S t a g i n g looks like Figure 8. Data Data sources Warehouse Extract and | transformation programms J Data Data movement movement • > Cleansing, transformation. sorting Transformation, addition, update programms I programmsl Figure S: Role of data transformation and transportation tools within a Data Staging process. 50 DATORZINATNE L'N INFORMACIJAS TEHNOLOGIJAS 5. S u m m a r y A s a conclusion 1 w o u l d say that Data S t a g i n g is o n e of the main issues of the Data W a r e h o u s i n g one must pay m u c h attention to. Data S t a g i n g c o v e r s any action with data starting from extracting them from the source s y s t e m s and finishing with data loading into the presentation server ( D a t a W a r e h o u s e a n d ' o r Data M a r t ) . Data Staging itself does not include data presentation, e.g., data cannot be a c c e s s e d by the end users from the Data Staging area. Data A c c e s s by end users is p e r f o r m e d by other tools. Actions that Data Staging performs arc (at least, s h o u l d be) very short in t i m e , w h e r e a s the developing of these p r o c e s s e s could t a k e m u c h of time a n d m a n y other resources. The main tasks of Data S t a g i n g are 1) fast particular data extraction from the source s y s t e m s 2) fast data transformation, s t a n d a r d i z a t i o n and data quality assurance according to the business user r e q u i r e m e n t s and 3) fast data loading into the target according to the data m o d e l (usually d i m e n s i o n a l ) it requires. N o w a d a y s technologies usually p r o v i d e a range of possibilities to ensure fast physical data processing and t r a n s p o r t a t i o n - I w o u l d s a y . that should solve all p r o b l e m s regarding physical data e x t r a c t i o n , m o v i n g t h e m through the hardware infrastructure and loading t h e m into t h e data target, w h a t e v e r that may be. B e s i d e s that, the logical side of data p r o c e s s i n g still should be taken into account. This part of the data processing brings out the i m p o r t a n c e of t h e h u m a n sense or intelligence that should be both the driver and supervisor of the m a n n e r in w h i c h data is m a n a g e d . That means Data Staging d e v e l o p m e n t requires additional special skills and e x p e r i e n c e in the data m a n a g e m e n t area. In the case of D a t a W a r e h o u s i n g these special skills w o u l d be d i m e n s i o n a l data m o d e l l i n g and u s a g e of a n d m a n a g e m e n t o f the m e t a d a t a . It could be reasonable to look t o w a r d s o m e v e n d o r s that p r o v i d e tools for the functionality j u s t mentioned (dimensional data m o d e l l i n g and m e t a d a t a m a n a g e m e n t ) . In any case, it is very good to have at least a theoretical b a c k g r o u n d and u n d e r s t a n d i n g about how it should work. On the one hand there is no r e a d y solution (software) that will satisfy you for 1 0 0 % of your n e e d s , and on the other h a n d you will never d e v e l o p that functionality by yourself according to the budget, time a n d end user r e q u i r e m e n t restrictions. T h e truth is a l w a y s in the middle! The main deliverables (in terms o f the m e t a d a t a ) of the D a t a S t a g i n g would be: • Data dictionary • Data mapping o Data s o u r c e to Data S t a g i n g area o Data S t a g i n g area to data target • Description of the data transportation (extract, b a c k - u p and load) • Data quality'' checklist • Data transformation a l g o r i t h m s and a u d i t trails Conclusion Regarding the financial industry, data extraction a n d consolidation is very important in terms of c o m p l e t e n e s s and a c c u r a t e n e s s . Only c o m p l e t e and accurate data give the right picture - c o m m o n trends and a g g r e g a t e s . Janis Benefelds. Data Staging in the D a t a Warehouse 51 A g g r e g a t e d information is usually passed to a c o m p a n y ' s m a n a g e m e n t , external s u p e r v i s o r s and auditors. Each of t h e m requires special and very similar information. T h e y are different but c o m p a r a b l e ! Data r e q u i r e m e n t s are both regular and on demand. T h i s is a g o o d w a y to s h o w others that data is u n d e r control in o n e ' s c o m p a n y , especially if response times to irregular data requests are short enough. Besides the c o m m o n trends financial institutions focus on particular individuals or c o m p a n i e s as well. Therefore data o f the particular individual can be evaluated in relation with c o m m o n trends only partly. T h e other, the larger part o f an evaluation is based on a client's individual data — client profile. Actually, such a p p r o a c h to data analysis is part of C u s t o m e r Relationship M a n a g e m e n t ( C R M ) , which is very important for financial institutions. T h i s is exactly the reason w h y data quality should be the best possible. B a n k s , insurers and other financial industry players are m a r k e t i n g themselves as o n e s w h o are professionals in w h o m y o u can trust. A n d clients are e x p e c t i n g an a d e q u a t e attitude: " S o , if you know e v e r y t h i n g about m y finances, w h y d o you offer me a G o l d C r e d i t card? M y m o n t h l y i n c o m e is b y far insufficient to buy such a product!" S u c h a situation c o m e s from unclean and in consistent data, and it leads to the customer that d o u b t s whether the b a n k is dealing with his m a n y at a level that is professional enough. 6. F u r t h e r research Sequentially the next part after D a t a S t a g i n g is D a t a b a s e - either D a t a W a r e h o u s e , Data Mart or by a special Data M i n i n g tool predefined data structures. D e v e l o p i n g the right Data M o d e l and overall data architecture strategy could be the direction o f the next research. This includes understanding the difference between d i m e n s i o n a l and traditional O L T P design practices. Since Data M o d e l is d e v e l o p e d data is c o m i n g in. Data M a n a g e m e n t is a very important issue as well. By Data M a n a g e m e n t 1 m e a n such things like data storage solutions, fast data access p r o b l e m s and other p r o b l e m s similar to it. A s data volumes a c c u m u l a t e d by Data W a r e h o u s e generally are extremely large, a technical and architectural solution should be ready to p r o c e s s it. T h i s part of the Data W a r e h o u s i n g definitely is a subject for following research as well. A s D a t a W a r e h o u s i n g p r o d u c t s (reporting, analysing, data mining etc. tools) are required b y business, it is very close to the particular end users. O f c o u r s e , every c o m p a n y has its o w n specific r e q u i r e m e n t s , but there are s o m e c o m m o n specificities from industry to industry. Financial industry specificities and possible solutions to them are g o o d a r e a for research and for h a v i n g key k n o w l e d g e too. 52 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS References Kimball R., Reeves L., Ross M., Thornthwaite W. (1998), The Data Warehouse Lifecycle Toolkit, John Wiley & Sons, Inc. Groth R. (1998), Data Mining, Prentice Hall PTR Inmon W. H., (1999). Building the Operational Data Store - Second Edition, John Wilev & Sons, Inc. Oracle Corporation, (1996 - 2000), Oracle Documentation Library. Release 8.1.7 Kimball R., Merz R. (2000). The Data Webhouse Toolkit, John Wiley & Sons, Inc. Marco D. (2000). Building and Managing the Meta Data Repository. John Wiley & Sons. Inc. Todman C. (2001). Designing a Data Warehouse, Hewlett-Packard Company Greenberg P. (2001). CRM at the Speed of Light. McGrow-Hill Companies Imhoff C , Loftis L., Geiger J.G. (2001), Building the Customer-Centric Enterprise. John Wiley & Sons, Inc. Kimball R., Ross M. (2002), The Data Warehouse Toolkit - Second Edition, John Wiley & Sons, Inc. LATVIJAS UNIVERSITATES R A K . S T I . 20IM. 6 6 9 . sej.: D A T O R Z I N A T N E UN INFORMACUAS TEHNOLOGIJAS. 53.-60. 1pp. Generic Data Representation by Table in Metamodel Based Modelling Tool Edgars Celms Institute of M a t h e m a t i c s and C o m p u t e r Science, University of Latvia Raina bulv. 2 9 , L V - 1 4 5 9 , Riga, Latvia Edgars. Celms@,mii. lu.lv The foundation of a metamodel based modelling tool is its flexible facility to declaratively define the modelling method, notation and tool support. One of the problems for such tools is generic data representation by table. The paper proposes a method for declarative definition of a table based on the logical metamodel. The most interesting aspect of this definition is the specification of element selection criterion. Some practical examples of table specification by the described method are also explained. Key words: modelling tool, generic table editor, metamodel, editor definition. 1. Introduction T o d a y t h e r e are a lot of m o d e l l i n g tools on the market. M o d e l l i n g tools are designed to p r o v i d e all that is necessary to support major areas of m o d e l l i n g , including business p r o c e s s m o d e l l i n g , object-oriented and c o m p o n e n t modelling with U M L [ 1 ] , relational d a t a m o d e l l i n g , and structured analysis and design, etc. W h y is it not sufficient to use " h a r d - c o d e d " m o d e l l i n g tools? Let us consider, for e x a m p l e , the situation in b u s i n e s s m o d e l l i n g . On the o n e hand there exist several w e l l - k n o w n business m o d e l l i n g l a n g u a g e s ( I D E F 3 [ 2 ] , A R I S [ 3 ] etc.), each with a set of tools supporting it. B u t there are also Activity d i a g r a m s in U M L . w h o s e main role n o w is to serve b u s i n e s s m o d e l l i n g . T h e r e is G R A D E B M [4] - a specialized l a n g u a g e for business m o d e l l i n g and simulation. T h u s for the area of b u s i n e s s m o d e l l i n g there is no one b e s t or m o s t u s e d l a n g u a g e or tool, a n d each of t h e m e m p h a s i z e s its o w n aspects. For e x a m p l e , G R A D E B M presents very c o n v e n i e n t facilities for specifying performers of a task a n d its triggering conditions. H o w e v e r any n e w language feature d o e s not c o m e for free, and the l a n g u a g e b e c o m e s m o r e c o m p l i c a t e d for use. T h e r e f o r e one universal b u s i n e s s m o d e l l i n g language, w h i c h would s u p p o r t all wishes, w o u l d b e c o m e extremely difficult for use in simple cases. U M L for this situation offers o n e ingenious solution — s t e r e o t y p e s for adjusting the m o d e l l i n g l a n g u a g e to a specific area. In many cases the idea w o r k s perfectly, and it is well supported in several tools including G R A D E . T h e latest version of U M L - 1.4 e x t e n d s the notion o f stereotype, by assigning tagged v a l u e s to it and g r o u p i n g stereotypes into profiles (thus actually e x t e n d i n g the m e t a m o d e l ) . But currently n o tool fully supports it and already a new version of U M L 2.0 is c o m i n g with significant changes, particularly in the area of Activity d i a g r a m s . T h e p r o b l e m with flexible m o d e l l i n g e n v i r o n m e n t is e v e n more urgent for d o m a i n specific m o d e l l i n g , w h e r e countless special notations are used for separate d o m a i n s . T h e r e are p r o b a b l y several w a y s to solve such a p r o b l e m . You can d e v e l o p a m o d e l l i n g tool specially for any specific modelling m e t h o d but this w a y can be very time and cost c o n s u m i n g . 54 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS Y o u can m a k e a tool as universal as p o s s i b l e to s u p p o r t all needs. For example, there is A R I S tool by IDS prof. Scheer. w h o s e " h o m e notation" is the A R I S B M language. But it supports also the L'ML notation with Activity d i a g r a m s , as well as n u m e r o u s modifications of the main p r o c e s s notation via e E P C (office processes, industrial processes e t c ) . In general, A R I S tool can be characterised to b e extremely "wide", with about 110 different types o f d i a g r a m s , a n d frequently having about 100 different s y m b o l types per d i a g r a m ( m o s t , in fact, are predefined stereotypes of the basic ones). At the s a m e time, there are practically n o facilities for defining new stereotypes. In practice such a universal tool is difficult for u s e in simple cases. An alternative w a y is a c o m p l e t e l y m e t a m o d e l b a s e d generic m o d e l l i n g tool (previously called m e t a C A S E ) . S u c h a tool h a s n o built-in m o d e l l i n g m e t h o d o l o g y . It has to be filled up w i t h specific m e t a m o d e l and additional information to start modelling something. 2. Metamodel Based Modelling Tool In this paper s o m e aspects of the m e t a m o d e l b a s e d a p p r o a c h to b u i l d i n g flexible modelling e n v i r o n m e n t s are explained. T h e m e t a m o d e l c o n c e p t h a s b e c o m e p o p u l a r in recent years especially d u e to the p r i n c i p l e u s e d in L ' M L definition [ 1 ] . What is a metamodel in a m e t a m o d e l based m o d e l l i n g tool? To put it in short, a m e t a m o d e l is a class d i a g r a m c o n t a i n i n g all m o d e l l i n g c o n c e p t s , their attributes and their relationships. A m e t a m o d e l based tool has n o built-in m o d e l l i n g m e t h o d o l o g y . It has to b e filled up with specific m e t a m o d e l and additional information to start m o d e l l i n g s o m e t h i n g . An earlier alternative n a m e for the a p p r o a c h is m e t a C A S E . Let me m e n t i o n the key research in this area. P e r h a p s , the first similar a p p r o a c h h a s been m a d e by Metaedit [5], but for a long time its editor definition facilities h a v e b e e n fairly limited. T h e latest version n a m e d M e t a e d i t - n o w can s u p p o r t definition o f most used d i a g r a m types, but via very restricted m e t a m o d e l l i n g features ( n o n - g r a p h i c a l ) , the resulting diagrams c o r r e s p o n d i n g to the simplest c o n c e p t o f labelled directed g r a p h . T h e m o s t flexible definition facilities (and s o m e t i m e a g o , also t h e most p o p u l a r in practice, but n o w the tool is out of t h e market) seem to be offered by the T o o l b u i l d e r by Lincoln Software. Being a typical m e t a C A S E of the early nineties, t h e a p p r o a c h is based on E R m o d e l for describing the repository contents and on special e x t e n s i o n s to E R notation for defining derived data objects w h i c h are in a o n e - t o - o n e relation to objects in a graphical diagram. A m o r e a c a d e m i c a p p r o a c h is that p r o p o s e d by K o g g e [ 6 ] , with a very flexible, but very complicated a n d mainly procedural e d i t o r definition l a n g u a g e . Other n e w e r similar approaches are proposed by ISIS G M E [7], D O M E [8] and M o s e s [9] projects, with main e m p h a s i s for creating e n v i r o n m e n t s for d o m a i n - s p e c i f i c m o d e l l i n g in the engineering world. T h e richest editor definition possibilities o f them are in G M E . Several commercial m o d e l l i n g tools ( A R I S b y I D S prof. Scheer, System Architect by Popkin Software[10], etc.) u s e a similar a p p r o a c h internally, for easy c u s t o m i s a t i o n of their products, but their tool definition l a n g u a g e s have n e v e r been made explicit. T h e approach explained in this paper is in a sense a further d e v e l o p m e n t of the above mentioned a p p r o a c h e s . In this a p p r o a c h at first the d o m a i n m e t a m o d e l of the modelling area (logical m e t a m o d e l ) is built. T h e n the m o d e l l i n g method, notation and tool support are defined d e c l a r a t i v e l y , by m e a n s of a special m e t a m o d e l l i n g Eitg&ri Celms. Generic Data Representation by Table in M e t a m o d e l Based Modelling Tool 55 e n v i r o n m e n t . T h e s e activities also require extension of the metamodel ( a d d i n g s o m e presentation classes), b u t it is not relevant to the p u r p o s e s o f this paper. A very significant part of the tool support in m e t a m o d e l based m o d e l l i n g tools is flexible data representation facilities. T h e r e are s o m e m a i n principles how data can be represented in m o d e l l i n g tools: • Diagrammatic, • Hierarchical (e.g. "tree v i e w s " ) , • D a t a dictionaries, • Tables, • Object editors. O b v i o u s l y the m o s t p o p u l a r data representation form (modelling notation) in any d o m a i n n o w is d i a g r a m based. Therefore a very important part of the m e t a m o d e l based a p p r o a c h is the E d i t o r Definition L a n g u a g e ( E d D L ) for a simple and convenient definition of a w i d e s p e c t r u m of d i a g r a m m a t i c graphical editors [ 1 1 . 12]. N e v e r t h e l e s s not all information is c o n v e n i e n t t o represent by d i a g r a m s . T h e r e is also the necessity to have a possibility in m e t a m o d e l based m o d e l l i n g tools to define such facilities as flexible model c o n t e n t b r o w s i n g and flexible definition of an editor for a single m e t a m o d e l class instance. Perhaps one of the m o s t u n d e r v a l u e d data representation m a n n e r in m o d e l l i n g tools t o d a y is data representation by table. For e x a m p l e , in Rational Rose [13] L ' M L tool, in real size models it is c o m p l i c a t e d to b r o w s e through p a c k a g e s and classes in the m o d e l tree. Frequently a p a c k a g e contains m o r e than ten class d i a g r a m s with ten to thirty classes in each. That leads to a situation w h e r e there are h u n d r e d s of classes at o n e tree level and it is not easy to find the right one. It would be r e a s o n a b l e to represent such uniform information not in hierarchical (tree) form but in s o m e easily configurable tabular form. S o m e tools ( S y s t e m Architect) h a v e s o m e t h i n g like t a b l e s but with a v e r y restricted functionality (reports only). M o r e or less usable in practice tables are i m p l e m e n t e d in G R A D E tool but they a r e " h a r d - c o d e d " . A l t h o u g h all of t h e data representations principles m e n t i o n e d here are interesting p r o b l e m s in relation to m e t a m o d e l based m o d e l l i n g tools, this is out of s c o p e for this paper to explain all of t h e m . T h e paper introduces a m e t h o d for declarative definition of a table, b a s e d on the logical m e t a m o d e l . Partly the d e s c r i b e d a p p r o a c h has been developed within the EU E S P R I T project A D D E [14], see [15] for a preliminary report. 3. Generic Data Representation by Table T h e foundation of a m e t a m o d e l based m o d e l l i n g tool is its flexible facility to define declaratively the m o d e l l i n g method, notation and tool support. O n e of the p r o b l e m s for such t o o l s is generic data representation b y table. T h e paper p r o p o s e s a method for table editor definition based on the logical m e t a m o d e l . Certainly, the logical m e t a m o d e l for the selected modelling method should be defined before we start defining any of the editors used for lite tool. This logical structure is described by a L'ML class diagram [Lj. Fig. 1 s h o w s an e x a m p l e of a logical m e t a m o d e l for L'ML class diagram (here the L'ML class m e t a m o d e l is described by a 56 DATORZINATNE UN INFORMACUAS TEHNOUOGIJAS U M L class diagram). T h i s e x a m p l e will be used in the p a p e r to d e m o n s t r a t e the table editor definition features. jleteotype ot altr&-!( base d a s j Opeocon visiMty rniital value mJbpliaty Jan' stereotype rwiv iiereoiype ol das* ope ) 1 sggregatron is navigable n ability larg rale name Wrg rrullipfpaty AssodabonEnd role name muttiplidty ordering vistoihty Figure 1. Logical metamodel example (fragment) A table editor definition for a n e w table starts with the selection o f the domain objects to be represented by the table. It should be e m p h a s i z e d that the logical m e t a m o d e l itself is n e v e r modified d u r i n g this p r o c e s s . O u r goal is to define the c o r r e s p o n d i n g table editor, w h i c h in this c a s e should be a b l e to p r e s e n t the selected set of m e t a m o d e l class instances in a tabular form. F o r e x a m p l e . Fig.2 s h o w s a table which represents all instances o f Class in s o m e m o d e l . Classes are g r o u p e d by the association o w n e d _ b y ("Is in P a c k a g e " c o l u m n in the table) and o r d e r e d by t h e N a m e attribute. Edgars Celms. Generic Data Representation b y T a b l e in M e t a m o d e l Based Modelling 57 Tool Table View - Class • x| Objects - 1 9 [191" Name" Description Policy Claims Policy_owner_d Property claim Witness Is bi Package" Claims this is descrj Claims Claims Core entities Policy Core entities Insurance Claim Insurance Health claim Insurance Insutance company Insutance Policy Insurance Policy owner Insurance i ' Indicates tequited field. 'Sorted by: Is in Package, Name iGiouped by: Is in Package • Figure 2. Class table example Definition of a table consists essentially of t w o sections: h o w and w h a t data should be represented. H o w - it m e a n s to define such things as: • T a b l e c o l u m n s properties - the user should be a l l o w e d to define w h i c h c o l u m n s are visible in t h e table (based o n t h e m e t a m o d e l elements). For e x a m p l e , t h e C l a s s table in Fig.2 is configured t o s h o w three c o l u m n s , w h i c h c o r r e s p o n d to t w o attributes [name and description) and one association (owned_by) of the class Class in the m e t a m o d e l . Other attributes and associations of t h e Class are not visible in this table. T h e user should b e also able t o define for each c o l u m n in the table such properties as c o l u m n title, c o l u m n edit m o d e (either it is a l l o w e d to do in-place editing or s o m e " n a t i v e " object editor should be i n v o k e d ) , column ordering a n d g r o u p i n g (by w h i c h c o l u m n ( s ) the table is ordered or g r o u p e d ) , etc. • P r o m p t i n g / d i s p l a y form for associations - it is a very important feature to m a k e the table more usable for end users. For e x a m p l e , the association column ownedjoy for t h e table in Fig.2 is defined to be s h o w n as the value o f t h e attribute name of Package. In other w o r d s , it is the facility t o define the text assembly rules for c o l u m n s , w h i c h c o r r e s p o n d to associations in the m e t a m o d e l . T h e user can define the text to display as a c o n c a t e n a t i o n o f s e g m e n t s , each c o n s i s t i n g of a constant prefix, a variable part and a constant suffix. Each variable part is defined as t h e object attribute value located at the end of the chain defined b y associations (for e x a m p l e . Class.ownedjoy.name). T h e Association chain m a y b e e m p t y , or it must start at t h e m e t a m o d e l class c o r r e s p o n d i n g to the objects in the table. Attribute must b e l o n g to t h e m e t a m o d e l class at t h e e n d object of t h e associations chain. 58 DATORZINATNE L'N I N F O R M A C U A S TEHNOLOGIJAS • N a v i g a t i o n facilities for objects, attributes a n d associations - what to do in response to user activities. For e x a m p l e , what to d o on single or double m o u s e click on s o m e object attribute or association. • P o p u p menu configuration - w h e n defining p o p u p m e n u s for table objects the user should b e a l l o w e d to define m e n u items at least for the following actions: create a new object (for e x a m p l e , " N e w C l a s s " ) , delete the object represented in the table, open an object v i e w e r / e d i t o r ( " P r o p e r t i e s . . . " ) , copy and paste, execute any other p r o g r a m , etc. T h e a b o v e described facilities for table definition allow the user to define nearly any reasonable table form for practical use. T h e s e c o n d part of the definition o f the table is used to specify what data should be represented by the table. T h e specification of the e l e m e n t selection criterion (selection rule) is the most non-trivial and t h e most interesting aspect o f the table definition. A selection rule can be specified, s a y i n g w h i c h of t h e possible instances actually must appear in the table. T o be short, a selection rule is a B o o l e a n expression ( A N D , O R , N O T ) on link conditions (specifying that a certain s e q u e n c e o f m e t a m o d e l associations must link t h e given object to another object) a n d attribute value conditions (defined by an association chain, attribute at its e n d a n d a fixed attribute value). T h e actual expression language contains s o m e m o r e details (including simple existential quantifiers). E x a m i n e one m o r e e x a m p l e of the t a b l e . L e t ' s a s s u m e that w e need to define a table which contains S t e r e o t y p e s of Classes, w h i c h are contained in one concrete Package and these Stereotypes do not have Tag_definition ( s e e F i g . l ) . T h e r e are languages w h e r e it is possible to define selections or assertions of such a kind. Such languages, for e x a m p l e , a r e Object Q u e r y L a n g u a g e ( O Q L [ 1 6 j ) , w h i c h is used for q u e r y i n g an Object D a t a b a s e o r Object Constraint L a n g u a g e ( O C L [ l ] ) , which is used in the U M L to specify t h e w e l l - f o r m e d n e s s rules o f the m e t a c l a s s e s c o m p r i s i n g the U M L m e t a m o d e l . In O Q L a n d O C L t h e a b o v e m e n t i o n e d selection criterion c a n be written as ( w h e r e the concrete Package is given by the p a r a m e t e r c u r r _ p a c k a g e ) : In OQL: select st from Stereotype st where curr_package in st.stereotype_of_class.owned_by andthen is_undefmed(st.contains) In OCL: self.stereotype_of_class.owned_by->exists(pck:Package pek ~ curr_ package) and self.contains->IsEmpty() H o w e v e r in practice both t h e s e l a n g u a g e s are difficult to u s e for declarative definition of the tables. O Q L is a v e r y powerful and e x t e n s i v e language. O C L is a very concise a n d rather c o m p l i c a t e d to u s e l a n g u a g e a n d it is not intended for q u e r y i n g but for specifying static constraints in a m e t a m o d e l . In o r d e r to use O Q L or O C L y o u need to k n o w precise s e m a n t i c s o f the l a n g u a g e s . Let's say, that the p a p e r p r o p o s e s a m e t h o d to specify a selection rule for a table, which is a "reasonable s u b s e t " of O Q L or O C L with a graphical realization ( s e e Fig.3). " R e a s o n a b l e subset" in this case m e a n s - such a subset, w h i c h is sufficient for practical use and allows also an efficient i m p l e m e n t a t i o n . [ulnars Celms. Generic Data Representation by Table in M e t a m o d e l Based Modelling Tool 59 Select Condition 13 F Stereotype 3 I— contains-:* Tag_definition L- I stereotype_of_class -> Class - |I^{pi|. ^'i!*J^'|if^^i^TJ!^ s n H f~ has_attribute •> Attribute [+] r~ has_operation -> Opetation E r~ has_tagged_value •> TaggerJValue El [~ source •> BinaryAssociation H f (arget -> BinaryAssociation I OK Cancel _LJ J^J Help Figure 3. Selection criterion definition example T h e definition o f the selection rule is done by tree. T h e tree is s o m e view (fragment) of the m e t a m o d e l where t h e root node c o r r e s p o n d s to the m e t a m o d e l class for w h i c h w e want t o define the table (in o u r e x a m p l e t h e Stereotype class). Every next level of t h e tree c o n t a i n s n o d e s (classes) which are linked in the m e t a m o d e l with associations to the p a r e n t n o d e (class) in the tree. Fig.3 gives an example of such a tree where the a b o v e m e n t i o n e d selection criterion is defined. T h e tree row m a r k e d by t h e red " m i n u s " sign m e a n s that there might not be a link to a Tag definition instance. In other w o r d s , a c c o r d i n g to this a p p r o a c h it is possible in an easy, usable m a n n e r to c o m p o s e textual selection expression, w h i c h is based on m e t a m o d e l e l e m e n t s . W e can also specify s o m e additional constraints for attribute values for classes. It s h o u l d be e m p h a s i z e d , that unlike O Q L or O C L , in the p r o p o s e d m e t h o d it is not important t o " k n o w " such technical issues as cardinalities of associations or kind of collections ( b a g , set, list in O Q L or b a g , set, s e q u e n c e in O C L ) . Therefore t h e proposed method is easy to u s e in practice. Practical e x p e r i m e n t s have d e m o n s t r a t e d that the described m e t a m o d e l based table definition m e t h o d h a s a realistic i m p l e m e n t a t i o n and can reach an industrial quality of the defined editors. 4. Conclusions T h e a p p r o a c h to table definition d e s c r i b e d in this p a p e r permits us to define nearly any r e a s o n a b l e table for practical use, a n d t h e described m e t h o d also allows an efficient i m p l e m e n t a t i o n (in t h e sense of run-time necessary to build the tables from real a m o u n t s of data). Practical e x p e r i m e n t s have d e m o n s t r a t e d that t h e table editors obtained by the described definition m e t h o d can reach an industrial quality. H o w e v e r this a p p r o a c h can be m a d e significantly m o r e universal a n d applicable not only to tables but also for other similar data r e p r e s e n t a t i o n s , since it is generally a c c e p t e d that a m e t a m o d e l (class diagram) can be used for describing the logical structure of nearly any s y s t e m . 60 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS References [I] Rumbaugh, J., Booch, G., Jackobson, I.: The Unified Modeling Language Reference Manual. Addison-Wesley (1999) [2] Mayer, R., Mcnzel, C.. Painter. M.: IDEF3 Process Description Capture Method Report http://wu'w.idef.com/T)ownloads/pdf/Tdcf3_fn.pdf, Knowledge Based Systems Inc (1995) [3] Scheer, A.-W.: ARIS Business Process Modeling. 3rd edn. Springer-Verlag, Berlin Heidelberg New York (2000) [4] Barzdins, J., Tentcns, J., Vilums, E.: Business Modeling Language GRAPES-BM (Version 4.0) and its Application. Dati, Riga (1998) [5] Smolander, K . Martiin, P., Lyytinen, K., Tahvanainen. V-P.: Metaedit - a flexible graphical environment for methodology modelling. Springer-Verlag (1991) [6] Ebcrt. J., Suttenbach, R.. Uhe, I.: Meta-CASE in Practice: a Case for KOGGE. Proceedings of the 9th International Conference, CAiSE'97, Barcelona, Catalonia, Spain (1997) [7] Ledeczi A.. Maroti M., Bakay A.. Karsai G., Garrett J., Thomason IV C , Nordstrom G.. Sprinkle J., Volgyesi P.: The Generic Modeling Environment. Workshop on Intelligent Signal Processing. Budapest, Hungary, May 17, 2001 (available in PDF at http://www.isis.vanderbilt.edu) [8] DOME Users Guide, http://www.htc.honeywell.com/dome/support.htni [9] Esser, R.: The Moses Project. http://www.tik.ee.ethz.ch'~moses/ N'IiniConference2000/pdf'Overview.PDF [10] Popkin Software: System Architect, http://www.popkin.com/products/systcm_architect.htm r [ I I ] Audris Kalnins, Janis Barzdins. Edgars Cclms, Lclde Lace, Martins Opmanis, Karlis Podnieks, Andns Zarins.: The First Step Towards Generic Modelling Tool. Proceedings of Baltic DB&IS 2002, Tallinn, 2002, v. 2, pp. 167-1 80. [12] Audris Kalnins, Karlis Podnieks, Andns Zarins, Edgars Celms, Janis Barzdins.: Editor Definition Language and its Implementation. Proceedings of the 4th International Conference "Perspectives of System Informatics", Novosibirsk, 2001. LNCS, v.2244, pp. 530 - 537 [13] Quatrani, T.: Visual Modeling with Rational Rose 2002 and UML. 3rd edn. Addison Wesley Professional, Boston (2002) [14] ESPRIT project ADDE. http://www.fast.de/ADDE [15] Sarkans. L\. Barzdins, J., Kalnins, A., Podnieks, K.: Towards a Metamodel-Based Universal Graphical Editor. Proceedings of the Third International Baltic Workshop DB&IS, Vol. 1. University of Latvia, Riga. (1998) 187-197 [16] Cattel, R.G.G., Barry, D.: The Object Database Standard: ODMG 3.0. Morgan Kaufmann (2000) LATVIJAS UNIVERSITATES RAKSTI. 2 0 0 4 . 6 6 9 . sej.: D A T O R Z I N A T N E UN INFORMACUAS TEHNOLOGIJAS. 61.-79. lpp. Nondifferentiable Optimization Based Algorithm for Graph Ratio-Cut Partitioning Karlis Freivalds University of Latvia. Institute of Mathematics and Computer Science Raina blvd. 29, LV-1459. Riga, Latvia Karlis.Freivaldsramii.lu.lv, Fax: - 3 7 1 7 8 2 0 1 5 3 We propose a new method for finding the minimum ratio-cut of a graph. Ratio-cut is an NP-hard problem for which the best previously known algorithm gives an 0(log n)-factor approximation by solving its dually related maximum concurrent flow problem. We formulate the minimum ratio-cut as a certain nondifferentiable optimization problem, and show that the global minimum of the optimization problem is equal to the minimum ratio-cut. Moreover, we provide strong symbolic computation based evidence that any strict local minimum gives an approximation by a factor of 2. We also give an efficient heuristic algorithm for finding a local minimum of the proposed optimization problem based on standard nondifferentiable optimization methods and evaluate its performance on several families of graphs. We achieve 0 ( « ' ' ' ) experimentally obtained running time on these graphs. Key words: graphs, heuristic graph algorithms, minimum ratio-cut. Introduction B a l a n c e d cuts of a graph are difficult computation p r o b l e m s that are important both in t h e o r y a n d practice. Ratio-cut is the m o s t fundamental one since most of the others including m i n i m u m quotient cut, m i n i m u m bisection, and m u l t i - w a y cuts, can be easily a p p r o x i m a t e d using it [ 1 1 , 17]. Also several other important a p p r o x i m a t i o n algorithms, such as crossing n u m b e r and m i n i m u m cut linear a r r a n g e m e n t are based o n the ratio-cut [11]. Ratio-cut has m a n y practical applications, the m o s t important being V L S I design, clustering and partitioning [20, 12, 1]. Since ratio-cut is an N P - h a r d p r o b l e m [13], w e must seek a p p r o x i m a t i o n algorithms to s o l v e it in practically r e a s o n a b l e time. M a n y purely heuristic a l g o r i t h m s were d e v e l o p e d [ 2 0 , 22, 18, 6] m o s t of t h e m relying on simulated annealing, spectral m e t h o d s or iterative m o v e m e n t of n o d e s from o n e side of the partition to the other. A c o m m o n idea e x p l o i t e d by several authors [22, 2, 7, 18, 19] to i m p r o v e their quality is using multi-scale graph representation u s u a l l y obtained by edge contraction. At first a partition at t h e coarsest scale is o b t a i n e d and then refined to a m o r e detailed o n e by one of m e n t i o n e d algorithms. A l t h o u g h these algorithms m a y perform well in practice, no optimality b o u n d s are k n o w n for them. T h e best previously k n o w n algorithm with proven optimality b o u n d s finds an 0 (log »)-factor a p p r o x i m a t i o n , w h e r e n is the n u m b e r of n o d e s in the graph. It is based on reduction of the ratio-cut p r o b l e m to the m u l t i - c o m m o d i t y flow problem, w h i c h can be solved with p o l y n o m i a l time linear p r o g r a m m i n g m e t h o d s . Unfortunately, this method is not practical, since the resulting linear p r o g r a m is of quadratic size of the n u m b e r of nodes in the graph and c a n n o t be solved efficiently. T h e n , a p p r o x i m a t i o n algorithms 62 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS [15, 16, 10. 21] w e r e discovered for the m u l t i - c o m m o d i t y flow problem itself m a k i n g this approach usable in practice. Several heuristic i m p l e m e n t a t i o n s [15, 9, 21] are based on this idea, s o m e o f t h e m quite effective and practical. T h e most elaborate one [9] can deal with up to 1 0 0 0 0 0 - n o d e graphs in r e a s o n a b l e t i m e . In this paper w e p r o p o s e a n e w w a y o f finding the m i n i m u m ratio-cut of a graph. We construct a nondifferentiable o p t i m i z a t i o n p r o b l e m w h o s e m i n i m u m solution equals the m i n i m u m ratio-cut value and use n o n l i n e a r p r o g r a m m i n g m e t h o d s to search for it. Since the problem is n o n - c o n v e x , w e m a y find only a local m i n i m u m . H o w e v e r , we s h o w that any strict local m i n i m u m g i v e s us a factor of 2 a p p r o x i m a t e cut. For that purpose w e introduce a notion of locally m i n i m a l ratio cut for which n o subset of nodes taken from one side of the cut and m o v e d to the o t h e r side decrease the cut value. W e establish a one-to-one c o r r e s p o n d e n c e b e t w e e n strictly locally minimal cuts and strict local m i n i m a of the p r o p o s e d optimization p r o b l e m . The reduction o f an N P - h a r d discrete p r o b l e m to a c o n t i n u o u s one is not a novel idea. For e x a m p l e the m a x i m u m clique p r o b l e m of a g r a p h can also be stated as an optimization p r o b l e m [24] and n u m e r i c a l o p t i m i z a t i o n m e t h o d s for finding the o p t i m u m m a y be used. H o w e v e r , for the m a x i m u m c l i q u e p r o b l e m no optimality b o u n d s of a local m i n i m u m are k n o w n . To s h o w that our m e t h o d is practical w e p r e s e n t an efficient heuristic algorithm for finding a local m i n i m u m of the p r o p o s e d p r o b l e m , w h i c h is b a s e d on the standard m e t h o d s of nondifferentiable o p t i m i z a t i o n and a n a l y z e its p e r f o r m a n c e on several families o f graphs. With the p r o p o s e d m e t h o d w e can find a good partition of 2 0 0 0 0 0 n o d e g r a p h s in less than o n e hour. Problem Formulation We are dealing with an undirected graph G = (K, £ ) , w h e r e F i s its node set and E is its edge set. T h e n o d e s of the graph are identified b y natural n u m b e r s from 1 to n. Each node ; has a weight d > 0, and each e d g e (i, j) has a capacity e,, > 0, satisfying the properties c = c c:„ = 0. W e define e,, = 0 w h e n there is n o e d g e in G. W e a s s u m e that there are at least t w o nodes with n o n - z e r o w e i g h t s . (A, A') denotes a cut that separates a set of n o d e s A from its c o m p l e m e n t A' = V\ A. T h e capacity of the cut C{A, A') is the sum of e d g e capacities b e t w e e n A and A'. T h e ratio of the cut R(A, A') is defined as follows ; y lh a A, A) 1 (1) We will focus on finding the m i n i m u m ratio-cut i.e. the cut (A. A') m i n i m u m ratio o v e r all n o n - e m p t y A, A'. with the Definition i . A cut (A. A') is locally minimal if for all n o n - e m p t y UdA, U i A: R(A, A') < R(A\U, A'\J U) and for all n o n - e m p t y U'CZ A', V =- A': R(A, A') <R(A\J If, A' \ L '). Similarly w e call a ratio-cut strictly locally m i n i m a l if strict inequalities hold in : this definition. Karlis h'reivalds. NnndifierciitiLtblc O p u m i / a i i u n B a s e d A k ' n r i l h m l o r G r a p h Ralio-Cul Partitioning Ratio-Cut as an Optimization Problem W e c a n assign a variable v . e R to each node i a n d consider t h e following o p t i m i z a t i o n p r o b l e m over x from R". m i n F ( x ) = _^/,j|-V, ~Xj\ (2) ^ < 7 , C / L | A - , . - A \ | = 1. s u b j e c t to 77(A) = (3) f./ef £*,-=fl. (4) id' W e a r e g o i n g to s h o w that this optimization p r o b l e m is equivalent to t h e ratio-cut p r o b l e m in the sense described below. Definition 2. A characteristic vector x components A \a, x'; = < [b, if for a c u t (.4, A') is defined such that its /e A , where (5) ifieA' 1 Ml 1 in A ieA' IQA le.-f It is straightforward from this definition that for a c u t (A, A') x constraints ( 3 , 4 ) , a n d F(x ) = R{A, A'). A satisfies t h e A Definition 3. F o r s o m e feasible x and s o m e p e R w e call the cut (P, 7") positional, if P = {/ x, <p\. P' = V\ P, and both P a n d 7 " are n o n - e m p t y . T h e ratio o f this cut R (x) = R(P, 7"). F o r a fixed x w e can speak o f m i n i m u m positional cut R (x) over all possible positional cuts obtained for different p: p mm R (x) mm = mmR R p (a) . pS Theorem 1. For each feasible x* o f (2, 3 , 4 ) Fix) > R (x'). it'there are at least t w o positional cuts with different values. iwn Also F(x') > Proof. Let us partition all nodes into three sets L i. L' , U as follows. r 2 1,' = m i n x , . vs*= , m i n a", , 3 R (x') nm 64 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS If U , is empty there is exactly o n e positional cut, x is in the form of characteristic vector and F(x ) = R (x') that c o n c l u d e s this proof, o t h e r w i s e define : mm >•;, = minx, We create a s u b - p r o b l e m of (2. 3, 4) by r e d u c i n g the original one to a n e w variable min F,(v) = 2 ^c t j | y , - v, | + i<EU., / c i A 2> <4 ^ c , | v, + / , - y, | + ; ieG.JcG, 3 ' (6) J subject to H,(y) = 2 Y. > i d d h " >'• I X . l/(/ + ICU JtL'z J >'i ~ l + -/'c^A { (7) |c/,|v| + |L/-,|>S + | L / , | V , (8) =0, where c o n s t a n t s /, = x, - v . 3 1 - I , X, ^ ;cC-,./eC/, Next, we further restrict it w i t h y , < v'2 < y j and can d r o p the absolute value signs: min F (y) : = 2 (y, - y,) + ^(y.+lj-y.) ^c / ; (y 3 + - V,) + +K, (9) subject to /7<v) = 2 ^c/,c/,(y -y,)+ 2 ( V ; +lj-y ) 2 >\/t/ ^P=l (v 3 • 1; v,) • (10) Karlis Frcivalds. Nondif'fcrcntiahlc Optimisation Based Algorithm lor Graph Ralio-Cul Partitioning (11) V] (12) < V'2 < > ' j . W e h a v e obtained a locally e q u i v a l e n t linear p r o g r a m in the sense that for v the constraints ( 1 0 , 11, 12) a r e satisfied a n d F(x') = F {\'). A l s o , if w e can find a better solution for ( 9 - 1 2 ) w e c a n substitute the result back to the original p r o b l e m giving a better feasible solution for it. 2 F r o m t h e linear p r o g r a m m i n g theory it is k n o w n that w e can find t h e optimal solution b y e x a m i n i n g t h e vertices o f the polytope defined by the constraints - in our case that m e a n s w h e n o n e o f the inequalities in ( 1 2 ) is satisfied a s equality. Let us e x a m i n e both cases: H 2 ( V | , V l . V;,) = ( v - > • , ) 3 _dd = 2 ( v - v,) i 3 l ] - I. 1. inU^C.jci:, .here 1 = 2 + icUilM F~2 ( v ' i , > ' i , v ) 3 = 2 ( . v 3 ,JCL' , - v , ) : _ _ > , Z + C v / , + A ' ' t O ' i L't/'i . y e t ' , Ot/'j .yet', By substituting vj - Vi from p r e v i o u s equation we get L/ f^Ot-V,, >';,)= Z (1 -L)R(U, where 5 = 2 ' . X ' ' - ' . + ' U CA, tTjl + B^ + A*. case v =j'.v 2 Similarly w e get F 0 ' i - > : . = ( 1 - I ) /?(£/,. £A U t/.O - 5 , 2 W e c h o s e t h e solution y with y, = y if /x(Lj U t A , CAJ < /?(£/,. £A IJ c ' ) , otherwise c h o o s e the solution with i = y . It is evident that if both these ratio costs are n o n - e q u a l w e get a strictly smaller function value. W e substitute the solution back into the original problem o b t a i n i n g a n e w .v. x is a feasible solution o f (2, 3. 4 ) with a smaller or equal function value a n d the set LA m e r g e d to U\ or Uj. W e repeat the described process until l A is e m p t y . : 2 3 3 Let us analyze the resulting x and sets b\ and LA W e have F(x) = R(L'\, U ). where (L'i. U ) is s o m e positional c u t of x (in fact the minimal o n e ) , hence Fix ) > F(x) > 2 2 66 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS R \ (.x )• It" w e had s o m e non-equal eu'.s c o m p a r e d d u r i n g the process, w e w o u l d have a strict decrease in the function and h e n c e the s e c o n d s t a t e m e n t of the t h e o r e m holds. m n D Definition 4. A feasible x is a local m i n i m u m of ( 2 . 3, 4) if there exists 8 > 0 such that Fi/) < F(x) for each .v satisfying ( 3 , 4) and | x-x' || < e. ; Definition 5. A feasible A is a strict local m i n i m u m o f (2. 3 . 4) if there exists s > 0 such thai F{x ) < Fix) for each a - x satisfying ( 3 , 4) a n d A - v* \\ < z. 1 T h e o r e m 2. Each strict local m i n i m u m .v o f (2. 3, 4) is in the form .v* - x" for some cut (A. A'). 1 Proof. W e need to prove that in a strict local m i n i m u m the expression (5) holds for s o m e a and />. and the correct values for a a n d /; are g u a r a n t e e d by the constraints (3) and (4). A s s u m e to the contrary that there a r e m o r e than t w o distinct values for the c o m p o n e n t s o f x . W e form the reduced p r o b l e m as in T h e o r e m 1 obtaining equations (9 - 12). We are able to do that since L\ is n o n - e m p t y d u e to our assumption. From the linear p r o g r a m m i n g theory' a strict local m i n i m u m can only be on the vertex of the p o l y t o p e d e n n e d by the constraints ( 1 0 - 12), h o w e v e r in our case y cannot be on the vertex since the constraints w h e r e equality holds at v are less then the n u m b e r of variables. C o n s e q u e n t l y , x c a n n o t be a strict local m i n i m u m for the original p r o b l e m . H e n c e our a s s u m p t i o n is false and the t h e o r e m is p r o v e n . There is one-to-one c o r r e s p o n d e n c e b e t w e e n strictly locally minimal cuts and strict local m i n i m u m s of (2, 3. 4) as stated in the following t h e o r e m . T h e o r e m 3 . x is a strict local m i n i m u m o f (2, 3. 4) if and only if .v is a characteristic v e c t o r of s o m e strictly locally minimal cut (A, A'). Proof. First w e prove that the characteristic Vector x o f a strictly locally minimal cut (A, A') c o r r e s p o n d s to a strict local m i n i m u m of (2, 3. 4). A s s u m e to the contrary that x is not a strict local m i n i m u m point; then by Definition 5 there exists an arbitrary s m a l l vector / ± 0 that leads to smaller or equal feasible solution ,v - .v' - t, F(x) <Fix'). B y the nature oi the constraints ( 3 . 4 ) there will be at least t w o positional cuLs of x. and sin.ee t was sufficiently small, one o f t h e m is (A, A'). T h e r e are two cases: (i) there exists a positional cut R (x) with value not equal to R(A. A'). Then by Theorem 1 there is a positional cut with value R (x) < Fix) < Fix') = R(A, A') which divides the nodes into sets 0", and (A. Since / was sufficiently small, both U and LA cannot contain nodes from both sets A and A' i.e. either 6 ) C A. or L) C A\ or UjC A, or .'A_C A' contradicting the local minimality of (A, A') by Definition 1. (ii) all positional cuts of.v are of the same value. Let us choose some positional cut (U,, L* ) = [A, A')- And again, since we can choose / small enough., either L\ c A, or V- C A', or LVcz A, or (A c A' which contradicts strict local optimality of (A, A'} by Definition 1. A 1 p r t : Secondly, we prove that each strict local m i n i m u m x of (2, 3, 4) is in the form of a characteristic vector of a strictly locally m i n i m a l cut. From T h e o r e m 2 we have that each strict local m i n i m u m is in the form of a characteristic vector A'' of s o m e cut (A. A'). W e need to show that this cut is strictly locally minimal. A s s u m e to the contrary that there exist S <z .•(, S 0, A \ S - 0 for w h i c h R(A \ S, A' U S) < R(A. A'). Denote U = A \ S. 1 Kivlis l-'rcivaLls. N i M i i i i l t c r c i i l i j b l c O p u m i / a t m n B;iscd A l g o r i t h m i o r G r a p h R j i j o - C u i PatUiiuimi^ Let us repeat the construction from the p r o o f of T h e o r e m 1 taking the sets Ui, U a n d U3 as 5 , U, A' c o r r e s p o n d i n g l y . W e obtain equations (9 - 12) and y with v, = y . From Definition 5 each small m o v e increases the function. L e t us take such feasible m o v e t that d o e s not destroy t h e groups L'i, U a n d L" but separates U\, from U . Let /' be the projection o f / o n t o the sets i •. U . £/ i.e. /' = (/,. ke U\, t p G L' , t \ r e (I,). Let us d e n o t e y " = y + / ' = y ~ at' where a is chosen such t h a t y ' - v . Since t h e constraints (10, 11) are linear and they have the s a m e value for y a n d y"', they will h a v e the same value for y ' , which is a linear c o m b i n a t i o n o f y a n d v"'. Obviously y' satisfies the constraint ( 1 2 ) . Similar a r g u m e n t s hold for the function: since F (y"') > F (y), also F:(y')> F (y) from t h e linearity of (9). From v w e can g o back to the original problem and obtain . y \ which is a characteristic vector of the cut {A \ S. A' [j S). Since H (x ) * / / . ( / ' ) = 1 a n d F{x') = F (y) > F (y) = R(A, A'). R(A \ S. A' U 5) > R(A. A') which contradicts o u r a s s u m p t i o n . 2 2 2 2 3 2 3 p 2 r L ; 3 2 2 c 2 2 2 • W e shall note that also for each non-strictly m i n i m a l ratio cut, its characteristic vector x' gives a local m i n i m u m of the function, h o w e v e r t h e converse is not true. T h e r e exist non-strict local m i n i m a o f (2, 3 . 4) with the function value not equal to any locally m i n i m a l cut value. In F i g . 3 a graph a n d their n o d e coordinates in a local m i n i m u m of (2, 3 , 4 ) arc s h o w n . It is a local m i n i m u m because n o small m o v e gives a decreased positional cut or a better function value. But none o f its positional cuts are locally minimal. T h e o r e m 4. T h e m i n i m u m ratio-cut R is equal to the o p t i m u m solution o f (2, 3,4). Proof. F r o m T h e o r e m 1 F(x) > R.. .. (x) > R. F o r the characteristic vector x m i n i m u m ratio cut F(x' ) = R. T h e claim follows i m m e d i a t e l y . • 4 n n o f the : T h e o r e m 5. Each locally minimal cut {A, A') is not greater than t w o times the m i n i m u m ratio cut. Proof. Let us d e n o t e the optimal cut ( 5 . B'). W e form four sets ADB. Af\B', A'PiB. AT\B' s h o w n in Fig. 1 From Definition 1 the ratio of each o f the cuts C\ - (ACiB, V 4 0 5 ) . C = (AnB\ V - APiB'). C , = (AT\B\ : ;TIB'), C = (A'DB. V - A'OB) is at least a s large as ratio o f C\ = (T, A') a n d C = (B. B'). : A 2 3 3 W e form a full g r a p h by taking each o f the sets AHB, Af)B'. A'PiB, A'DB' as the nodes o f the graph a n d assign edge capacities as the s u m of all edge capacities in the original graph b e t w e e n the c o r r e s p o n d i n g sets. W e obtain c\ + c + 4 d {d-> + x c^ R = d,(d l R t + - 2 2 R,>R . r d (d Ri>R\ . 2 2 c. — T d ^ ^ i (<:/, + d )(dj l c, + c-, + n 4 + c, 5 } c, +d) +d) . R,' 4 2 c, d (</, 4 <L_, A + L', + d - rf,) 2 (d +ci )(ci^J ) 4 3 d) fl,,- + d) / ? > A',2- - r c, i c. -r d, - 2 R+>R-.i y A 68 DATORZINATNE Fig. UN INFORMACIJAS TEHNOLOGIJAS L I l l u s t r a t i o n o f t h e p r o o f o f T h e o r e m 5. And the statement of the t h e o r e m to be p r o v e n translates to —— < 2 . W e verified this using symbolical c o m p u t a t i o n in M a t h e m a t i c a 4.0. See A p p e n d i x A for details. C o r o l l a r y 6. E a c h strict local m i n i m u m of (2, 3 . 4) is not greater than t w o times the m i n i m u m ratio cut. T h e p r o o f is straightforward from T h e o r e m 3 and T h e o r e m 5. T h e bound of T h e o r e m 5 is tight; the graph with 4 n o d e s s h o w n in Fig. 2 achieves this b o u n d . O n e can m a k e larger e x a m p l e s easily by substituting any connected graph with sufficiently high e d g e capacities in place of the n o d e s of the given graph. The Algorithm In this section w e present a heuristic a l g o r i t h m b a s e d on standard nondifferentiable optimization m e t h o d s for finding a local m i n i m u m of ( 2 , 3). F i n d i n g the m i n i m u m of a nondifferentiable function is one of several w e l l - e x p l o r e d n o n l i n e a r p r o g r a m m i n g topics [5, 2 3 ] . O n e of the possibilities is to a p p r o x i m a t e the nondifferentiable function with a Fig. 2. A graph achieving the boundary of Theorem 5. Fig. 3. A graph with assigned node coordinates in a local minimum of (2, 3. 4) where no positional cut is locally minimal. Kj/iis Frt'ivatils N n n d i t l e r c i u i a h l e O p t i m i z a t i o n Ba^cd AI^uriLhm for G r a p h R a t i o - C u t Partitioning 69 smooth o n e and apply o n e o f the w e l l - k n o w n algorithms to find its local m i n i m u m [5. 3]. Often a better a p p r o a c h is to handle it directly. Indeed, in our case w e obtain a very simple and fast algorithm. M o s t of the o p t i m i z a t i o n theory deals with convex p r o b l e m s for w h i c h algorithms with proven c o n v e r g e n c e c a n be d e v e l o p e d . M a n y of these m e t h o d s also w o r k for nonconvex functions finding a local m i n i m u m . However, then the c o n v e r g e n c e cannot be shown or c a n be s h o w n only in a local n e i g h b o r h o o d of s o m e local m i n i m u m which is not satisfactory in our case. T h e very basic algorithm of nondifferentiable optimization is a subgradient a l g o r i t h m [ 5 . 2 3 ] . W e will adopt it for s o l v i n g our p r o b l e m . Since we apply it to a n o n - c o n v e x p r o b l e m , it should be considered m o s t l y as heuristic, h o w e v e r practice s h o w s that it a c t u a l l y c o n v e r g e s to a local m i n i m u m of our p r o b l e m . T h e a l g o r i t h m is an iterative o n e . T h e iteration of the a l g o r i t h m consists of g o i n g in the n e g a t i v e direction o f a subgradient of the function by a fixed step and then performing a projection o n t o the constraint (3). T h e constraint (4) was introduced for technical r e a s o n s required in proofs and m a y be not c o n s i d e r e d in the algorithm. A n a p p r o p r i a t e s u b g r a d i e n t q of F c a n be calculated with the following equation for its c o m p o n e n t s 1& 1,A->0 w h e r e sign(x) 0,.v = 0 -1,*<0 C h o o s i n g the right step size is crucial for the c o n v e r g e n c e speed. O u r heuristic observation is that it s h o u l d b e proportional to the n o d e spread. W e c h o o s e the step equal to s t e p F f - . o r * ( . r otepP'acLor m a x -x, K m ) . w h e r e s t e p F a c t ^ r is s o m e parameter. Initially = 1 / 1 4 . Its update d u r i n g the algorithm is be discussed later. T h e projection is performed b y g o i n g in the direction of a s u b g r a d i e n t of the constraint until the constraint is r e a c h e d . F o r the constraint H we can write its s u b g r a d i e n t /• in a different form to allow its faster evaluation r.•,= t 7 , ^ c / s i g n ( . T . / - x . ) = d:( } - V(/ ) ( If w e sort the .v, v a l u e s a n d consider them in increasing order, then the needed s u m s can be updated i n c r e m e n t a l l y leading to linear time e v a l u a t i o n (not counting the sorting). T o perform t h e projection w e need to calculate the step length towards the constraint. T o simplify t h e calculations w e will a s s u m e that the ordering of .v, will not change d u r i n g the projection. Then the constraint function H b e c o m e s linear and the desired step can be easily calculated from the linear e q u a t i o n defined us the point of value 1 on the ray defined by the s u b g r a d i e n t from the current point. In the case of unit node w e i g h t s our a s s u m p t i o n is fulfilled. T o see this w e m u s t observe that the step in the function will a l w a y s lead to a point with If < 1. T h e n , if w e have unit n o d e w e i g h t s , r, < ;> for all / and k satisfying .v, < A,;, so after the projection the distance b e t w e e n them can only increase thus k e e p i n g the s a m e ordering. For the case of arbitrary n o d e weights such an a l g o r i t h m gives a u s a b l e a p p r o x i m a t i o n of the projection step length. 70 DATORZINATNE L'N I N F O R M A C I J A S TEHNOLOGIJAS The w h o l e algorithm starts with a r a n d o m initialization. W e assign a r a n d o m position for each node such that H < 1 and then perform a projection to obtain a feasible starting position. W e e x p e r i m e n t e d also with several other initializations, but obtained significant i m p r o v e m e n t s only for tree g r a p h s . S i n c e the o p t i m a l cut for trees can be found in linear time, w e c a n construct a starting position that reveals it and the algorithm will only perform a few iterations to confirm that the solution does not improve. Hence to get a c o m p r e h e n s i v e picture of the a l g o r i t h m behavior w e decided to consider r a n d o m initialization only. After the initialization w e perform iterations until c o n v e r g e n c e . T o tell w h e n the process is c o n v e r g e d w e track the m i n i m u m positional c u t values obtained at each iteration. T h e s a m e values will also tell u s h o w to c h a n g e the step size p a r a m e t e r s t e p F a c t o r . If the n e w cut value is lower than the p r e v i o u s one. then w e are m a k i n g progress a n d we should c o n t i n u e with the s a m e step size. If the value is higher than the previous o n e , then the step size is t o o large. If the cut d o e s n o t c h a n g e , then w e have a clue that the process has c o n v e r g e d . O f c o u r s e w e d o not h u r r y to m a k e decisions only from o n e iteration. Instead w e wait for a certain n u m b e r o f iterations controlled b y the constants M A . x _ O S C I L L A T I O N S a n d MAX_EQUAL before m a k i n g t h e decision. Such a delay also i m p r o v e s the c o n v e r g e n c e speed by a l l o w i n g to iterate longer with a larger step size To determine the positional cut value at each iteration, w e proceed as follows. W e consider the soiled s e q u e n c e of nodes, c a l c u l a t e the positional cut in each interval between t w o consecutive n o d e s , a n d take t h e m i n i m u m o n e . W e can d o it incrementally on the sorted sequence in t i m e 0(n r m) p r o v i d e d by the s o r t i n g , where m is the n u m b e r of edges in the graph. As suggested in o n e o f the exercises in [ 5 ] , the p e r f o r m a n c e of this algorithm can be improved b y taking the p r e v i o u s directions into account. W e add the previous direction to the current one reduced by s o m e constant R E DU C T I ON_ F AC TO R b e t w e e n 0 and 1. It models a heavy ball m o t i o n in the p r e s e n c e o f a force in t h e direction of the subgradient. In our e x p e r i m e n t s s u c h a modification with R E D U C T I O N _ F A C T O R = 0.95 performed substantially better. All the steps described before can be i m p l e m e n t e d to r u n in time 0 ( « - m). A d d i n g the time needed for sorting the nodes o n e iteration takes time 0(m - n log n). T h e n u m b e r o f iterations is hard to estimate s o w e will p r o v i d e e x p e r i m e n t a l data in the next sections. T h e constants MAX O S C I L L A T I O N S a n d MAX_EQUAL have the most impact on the iteration c o u n t a n d also o n t h e quality o f t h e obtained cut. S o we must select them carefully. After s o m e e x p e r i m e n t a t i o n w e c h o s e MAX_EQUAL = 2 0 0 a n d N!AX_OSCILLATIONS - 3 0 . algorithm ra~io-cut calculate a ran.co:ri feasible initial position acam - 0 osci 1 la uioo.Countor = 0 equalityCounter = 0 stepFactor = 1/14 while (equalit-yCounter < MAX EQUAL! Karlis r'rt'ivdlds. NondilTercniuhlc O p t i m i z a t i o n B a s e d A l g u n i h m tor G r a p h R a M i > - O i L Ptetitipning X = x -- ' acu~ - scam * REjU-JTIO'." FACTOR + d if iteration) :'r.:.s (minimum cot positional equalityCcunter = 0 o s c i l l a t i c n C o u n t e r ++ or else (minimum p o s i t i o n a l iteratiir.) equalityCcunter = • or else e q u a l i t y C o u n t e r +4 if (oscillationCounter value has cut value incrioses has : r. this decreased in > M&XJ3SCILLATION5) stepFactor /= 1 . 3 oscillationCounter = 0 endwhile end function d i r e c t i o n (x) d = s u b g r a d i e n t (x) s t e p = {•/... ... - x.. _.,) * s t e p F a c t o r x. = x + d * s t e p x = p r o j e c t i o n of x or. t h e c o n s t r a i n t return x end ; T h e Data W e evaluated the p r o p o s e d algorithm on three families of graphs: r a n d o m cubic g r a p h s , r a n d o m geometric g r a p h s and r a n d o m trees. W e considered only g r a p h s with unit weight n o d e s and edges. R a n d o m cubic graphs a r e potentially hard for ratio-cut a l g o r i t h m s , b e c a u s e in [11] it w a s s h o w n that there is actually an 0 ( l o g n) g a p b e t w e e n the m i n i m u m ratio-cut and the m a x i m u m concurrent flow on these g r a p h s . W e generate them using the algorithm provided in [14]. R a n d o m geometric g r a p h s are standard test suite for b a l a n c e d cut p r o b l e m s used in several papers [9, 2 2 ] , T o g e n e r a t e a g e o m e t r i c graph w e p l a c e the nodes o f the graph r a n d o m l y in the unit square. Then w e include an e d g e b e t w e e n each of t w o nodes that are within distance 6 in the graph, w h e r e <5 is the m i n i m u m value such that the resulting graph is connected. Tree graphs are s e e m i n g l y easy g r a p h s because their optimal ralio-cul can be calculated in linear time. A l s o it is not hard to show that only one local m i n i m u m exists for the c o r r e s p o n d i n g optimization p r o b l e m . Nevertheless, it is an interesting family since our e x p e r i m e n t s indicate a slow c o n v e r g e n c e of o u r algorithm on these graphs. Also, w e can c o m p a r e our result with the optimal o n e . W e generate r a n d o m trees with the classical algorithm, w h e r e each tree is produced with the same probability. This 72 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS algorithm produces long and skinny trees, w h i c h are particularly difficult algorithm. for our Experimental Results We i m p l e m e n t e d the algorithm in a n d e v a l u a t e d its p e r f o r m a n c e on a c o m p u t e r e q u i p p e d with a Pentium III 8 0 0 M H z p r o c e s s o r and 2 5 6 M b y t e s of R A M . For each graph family vve measured the r u n n i n g time in s e c o n d s , the n u m b e r of iterations and the quality of the p r o d u c e d cuts. S i n c e w e did not k n o w the exact cut values for r a n d o m and geometric graphs, w e e v a l u a t e d h o w much the ratio-cut value decreases w h e n w e continue the a l g o r i t h m for the s a m e n u m b e r of iterations as performed before termination. M e a s u r i n g the d e c r e a s e o f the ratio-cut value w e can estimate h o w far the result is from the o p t i m a l . T h e algorithm was run on a series of g r a p h s o f e x p o n e n t i a l l y increasing size from 100 up to 2 0 4 8 0 0 n o d e s . Ten graphs w e r e g e n e r a t e d for e a c h size and the results were averaged. T h e average n o d e degree for all g r a p h families is constant. A l t h o u g h we cannot specify the d e g r e e explicitly for g e o m e t r i c g r a p h s , due to their nature it was about 10 on all instances. T h e e x p e r i m e n t a l results are g i v e n in tables 1-3. For each graph size tables s h o w the iteration count, the r u n n i n g t i m e , the obtained ratio-cut value and the i m p r o v e d ratio-cut value w h e n the iteration c o u n t is doubled. For tree g r a p h s the optimal ratio-cut value is s h o w n also. Let us d i s c u s s the results separately for each graph family. Cubic Graphs A l t h o u g h these g r a p h s were s u g g e s t e d as difficult, the algorithm performed very well on t h e m . It took on average about 35 m i n u t e s to partition the 2 0 4 8 0 0 n o d e graphs. The r u n n i n g t i m e d e p e n d e n c e on the g r a p h size is s h o w n in Fig. 4. W h e n w e a p p r o x i m a t e d the r u n n i n g time with a function in a form 0(rr") w e get the asymptotical running t i m e about 0{n ) on these graphs. b T h e algorithm s e e m s to find a very close iteration count, the quality increased only by improved for the larger graphs a p p r o a c h i n g surprising since it is not hard to prove that the to o p t i m a l cut since after d o u b l i n g the less than 0 . 4 % . E v e n m o r e , the quality 1 (see Fig. 7). Such b e h a v i o r is not ratio of a cut (A, A') of a r a n d o m cubic on a v e r a g e independently of the sizes o f A and A' (a similar proof for n -1 general r a n d o m graphs is presented in [20]). T h e n it is unlikely that the m i n i m u m ratio cut will be m u c h different from this a v e r a g e value. graph is Kitrlis f-reivalds. Nondit'tcrenliablc Optimization Based Algorithm tor Graph Ralio-Cul 73 Partjiiuniny 4000 3500 3000 2500 2000 1500 1000 500 \ 0 81920 122880 163840 0 20*300 40960 Fig. 4. Running time for cubic graphs. 0 40960 81920 81920 122880 163840 204800 g r a p h size g r a p h size 122880 163843 Fig. 5. Running time for geometric graphs. tflfj 204800 4O0 1603 64 X 25600 102400 g r a p h size g r a p h size Fig. 6. Running time for tree graphs. Pig. 7. Quality for cubic graphs. 1.01 , 0.75 i 0 94 1 100 400 ; 1600 6400 25600 0 7 102400 g r a p h size Fig. iV. Quality for geometric graphs. , 100 400 1600 6400 25600 102400 g r a p h size Fig. 9. Quality for tree graphs Geometric Graphs T h e a l g o r i t h m p e r f o r m e d very well on these graphs both in terms of speed and quality. It took on a v e r a g e abotit 1 hour to partition the 2 0 4 8 0 0 node g r a p h s . T h e running time d e p e n d e n c e on the graph size is s h o w n in F i g . 5. T h e asymptotical running t i m e b e h a v i o r on t h e s e graphs w a s about Q ( « A f t e r d o u b l i n g the iteration count the quality increased by less than (see Fig. 8). A l s o visually the cuts seemed the best possible. Fig. 10 s h o w s a typical cut of a 1 0 0 0 0 - n o d e graph. 74 DATORZINATNE EN INFORMACIJAS TEHNOLOGIJAS Tree Graphs T h e running time for trees w a s better than for other families. T h e largest graphs w e r e partitioned in about 2 0 minutes. T h e r u n n i n g t i m e d e p e n d e n c e on the graph size is s h o w n m Fig. 6. T h e a s y m p t o t i c a l r u n n i n g t i m e on these g r a p h s w a s about 0(n ). H o w e v e r the quality w a s poor. As s h o w n in F i g . 1 1 the o b t a i n e d cuts were far from the optimal and the quality d e c r e a s e d with increasing g r a p h size. A l s o d o u b l i n g the iteration count s h o w e d 10% to 2 0 % quality i m p r o v e m e n t (see Fig. 9). ] Fig, 10. The obtained cut for a 10000-node geometric graph. 45 Fig. 11. Optimality for tree graphs W h e n w e explored further the r e a s o n for the p o o r b e h a v i o r , w e found out that c o n v e r g e n c e is m u c h s l o w e r than for other g r a p h families and the stopping criterion does not work correctly in this case. W h e n w e a l l o w e d the algorithm to run for a sufficiently long time, it a l w a y s found the o p t i m u m solution. H o w e v e r w e did not find a robust stopping criterion that correctly w o r k s with tree g r a p h s and does not increase running time m u c h for other g r a p h families. A s already m e n t i o n e d , a smarter initialization can be u s e d to i m p r o v e the quality of the partition if such tree or tree-like graphs are c o m m o n for s o m e application. Conclusions and O p e n Problems We have p r o p o s e d a nondifferentiable o p t i m i z a t i o n based m e t h o d for solving the ratio-cut problem and presented a heuristic a l g o r i t h m i m p l e m e n t i n g it. W e have shown that any strict local m i n i m u m is 2-optimal. T h e presented algorithm, h o w e v e r , in certain cases can find a non-strict m i n i m u m , but w e can easily transform the obtained .v vector into the characteristic form. T h e n the a l g o r i t h m can be run again from this starting position and this p r o c e s s can be iterated until the result d o e s not c h a n g e giving a locally minimal cut. which by T h e o r e m 5 is 2-optimal. T h e obtained a l g o r i t h m is s i m p l e and fast and uses an a m o u n t of m e m o r y that is proportional to the size of the g r a p h . Its r u n n i n g time and quality are verified Karlis I-reivahts. Nimdificrenliablc Optimization B a s e d A l g o r i t h m tor G r a p h R a l i o - C u l Partitioning 75 experimentally. Its practical running time is about 0(n ') on our test data. The algorithm p r o d u c e s high quality cuts on r a n d o m cubic a n d g e o m e t r i c g r a p h s . On trees and other very sparse g r a p h s the quality can be significantly improved by c h o o s i n g a better starting position than a r a n d o m o n e . W e evaluated t h e algorithm on artificially generated data. A s further s t u d y it w o u l d be important to evaluate its p e r f o r m a n c e on real-life p r o b l e m s . )t Although the algorithm p e r f o r m e d well on most graphs,, it is heuristic a n y w a y . Is it an open question w h e t h e r w e c a n find a local m i n i m u m o f ( 2 . 3 , 4 ) in p o l y n o m i a l time? Acknowledgements The author would like to thank Paulis Kikusts and Krists B o i t m a n i s for valuable discussion a n d help in p r e p a r a t i o n of this paper. References 1. C. J. Alpert and A. Kahng. Recent Directions in Netlist Partitioning: a Survey. Tech. Report, Computer Science Department. University of California at Los Angeles. 1995. 2. C. J. Alpert, J.-H. Huang, and A. B. Kahng. Multilevel circuit partitioning. Proc. Design Automation Conf, pp. 530 533, 1997. 3. R. Baldick. A. B. Kahng, A. Kennings and T. L. Markov. Function Smoothing with Applications to VLSI Layout, Proc. Asia and South Pacific Design Automation Conf., Jan. 1999. 4. J. W. Bern,' and M. K. Goldberg. Path Optimization for Graph Partitioning Problems. Technical report TR: 95-34, DiMAC'S, 1995. 5. D. Bertsekas. Nonlinear Programming. Athena Scientific. 1 999. 6. L. Hagen and A. B. Kahng. New Spectral Methods for Ratio Cut Partitioning and Clustering. IEEE Trans. Computer-Aided Design. 1 1 (1992). pp. 1074 1085. 7. T. Hamada. C.-K. Cheng. P. M. Chau. An Efficient Multilevel Placement Technique using Hierarchical Partitioning. IEEE Trans, on Circuits and Systems, vol. 39(6) pp. 432 439. June 1992. 8. P. Klein. S. Plotkin, C. Stein, E. Tardos. Faster Approximation Algorithms for Unit Capacity Concurrent Flow Problems with Applications to Routing and Sparsest Cuts. SIA.M J Computing. 3(231(1994). pp. 466-488. '•). K. Lang and S. Rao. Finding Near-Optimal Cuts: An Empirical Evaluation. 4th. ACXf-SIAM Svmposium on Discrete Algorithms, pp. 2 1 2 -221. 1993. 10. T. Leighton. F. Makcdon. S. Plotkin, C. Stein. E. Tardos, S. Tragoudas. Fast Approximation Algorithms for Multicommodity Flow Problems. Journal of Computer and System Sciences. 50(2), pp. 228-243. April 1995. 11. T. Leighton. S. Rao. Multicommodity Max-Flow Min-Cut Theorems and their use in Designing Approximation Algorithms. Journal of the ACM. vol 46 . No. 6 (Now 1999), pp. 787 • 832. 12. T. Lengauer. Combinatorial Algorithms for Integrated Circuit Layout. Stuttgart. John Wiley & Sons 1 994. 76 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS 13. D. W. Matula, F. Shahrokhi. Sparsest Cuts and Bottlenecks in Graphs. Journal of Disc. Applied Math., vol. 27 (1990). pp. 113 123. 14. A. Steger and X. C. Wormald. Generating Random Regular Graphs Quickly. Combinatorics, Probah. and Comput. vol. S (1999). pp. 377 396. 15. F. Shahrokhi, D. W. Matula. On Solving Large Maximum Concurrent Flow Problems. Proceedings of ACM 1987 National Conference, pp. 205-209. 16. F. Shahroki, D. W. Manila. The Maximum Concurrent Flow Problem. Journal of the ACM, vol. 37. pp. 3 IS 334. 1990. 17. M. Wang, S. K. Lim, J. Cong, M. Sarrafzadeh. Multi-way Partitioning using Bi-partition I leuristics. Proc. Asia and South Pacific Design Automation Conf. pp. 667-672. 2000. 18. Y. C. Wei, C. K. Cheng. An Improved Two-way Partitioning Algorithm with Stable Performance, IEEE Trans, on Computer-Aided Design, 1990, pp. 1502-1511. 19. Y. C. Wei. C. K. Cheng. A Two-level Two-way Partitioning Algorithm. Proc. lnt'l. Conf Computer-Aided Design, pp. 516-519. 1990. 20. Y. C. Wei. C. K. Cheng. Ratio Cut Partitioning for Hierarchical Designs. IEEE Trans, on Computer-Aided Design, vol. 10. pp. 9 i 1 -92 I. July 1991. 21. C.-W. Yeh. C.-K. Cheng, and T.-T. Y. Lin. A Probabilistic Multicommodity-flow Solution to Circuit Clustering Problems. IEEE International Conference on Computer-Aided Design. pp. 428 4 3 1 , 1992." 22. C.-W. Yeh. C.-K. Cheng, T.-T. Y. Lin. An Experimental Evaluation of Partitioning Algorithms. IEEE International ASIC Conference. P I 4 - I . I - P14-14: 1991. 23. X. Z. Shor. Methods of Minimization of Nondifferentiable Functions and their Applications. Kiev, "Naukova Dumka" 1979. (in Russian). 24. I. Bomze, M. Budinich. P. Pardalos, and M. Pclillo. The Maximum Clique Problem. Handbook of Combinatorial Optimization, Volume 4. Kluwer Academic Publishers. Boston, MA, 1999. Appendix A - J u s t i f i c a t i o n o f T h e o r e m 2. Unfortunately all attempts p r o v i n g this t h e o r e m analytically failed. W e succeeded only in s o m e special cases w h e n all n o d e s or all edges have unit weight. Therefore we resorted to a '"computer assisted p r o o f . T h e equalities can be eqtuvalently written as: c, + c, +c, c, — c\ + c- A'i/n'.-.v Frftvalds. A, >R , tl Nondiriercnliahlc Oplirm/iiiion Based A l ^ i i n l h m lor Graph Ralio-Cul R >R\z2 A' ; ' A' .-. 77 Pavilioning • A'- . prove: Since all the inequalities do not change w h e n w e multiply all c. with some constant, w e can c h o o s e this constant such, that R = I- Solving this equality for c and substituting the result into the inequalities w e obtain. (liven: ]2 ; (.1) gs - g(, > 0 This is the s a m e as p r o v i n g that (1)—(5) together with the negation of (6) can never be satisfied for any values of the involved variables. Let us also r e m e m b e r that the variables c o m e from n o d e w e i g h t s and e d g e capacities that are n o n n e g a n v e . W e can further require that g, are strictly positive; in the opposite case the proof is easy. W e verified that these inequalities have no solution with several n u m e r i c a l solvers, h o w e v e r that does not g u a r a n t e e the correctness due to numerical errors. Therefore we used the Cylindrical A l g e b r a i c D e c o m p o s i t i o n m e t h o d i m p l e m e n t e d in M a t h e m a t i c s 4.0 to verify the inequalities using symbolic c o m p u t a t i o n . T h e running time of the Cylindrical Algebraic D e c o m p o s i t i o n algorithm is inherently doubly e x p o n e n t i a l : hence w e had to transform the inequalities and write supporting functions to obtain results in practically reasonable time. U s i n g s y m b o l i c c o m p u t a t i o n we can be certain, if M a t h e m a t i c a has n o implementation b u g s , our conjecture holds. T h e M a t h e m a t i c a 4.0 n o t e b o o k to verify these inequalities follows. 78 DATORZINATNE: (ri[11 = ( v We i n t r o d u c e returns V e r i f y [a_ ||b Verify[a ] := Examine [a a, auxilary False (gl, i f and ] := function only i f to the (Examine[a] IN INFORMACUAS speed up expression the i s TEHNOLOCIJAS calculations. always false 11 *) || V e r i f y [ O r [ b ] ] ) ; Examine[a] ; ] : = Experiment al g2, g3, $E.ecursionlimit g4, = CylindricalJSVlgebraicDe composition [ g6, g5 , c5, c6, cl, c3, c 4 } ] ; 10000; e q l ^ c l * c 4 + c 6 - g l - g4 - g6; e q 2 - c l - c 4 - c 6 - g l + g 4 +• g 6 ; eq3 c 3 - c 4 - c 5 - g 3 * g4 + E e q 4 - c 3 + c 4 + c 5 - g 3 - g4 - g5 ; g5 ; e q i m j i = 2 ( c l + c 3 *• c 5 + c G ) - g l - g 3 - g.5 - g 6 ; (w Do the takes verification too tmpResult much time in two steps. Doing i t all together *•) = Experimental" CylindricalAlgebraicDecomposition[ c 5 > = 0 & & c 6 > = 0 S & e q 2 > = 0 && e q 4 > = 0 && e q l > = 0 && e q 3 > = 0 && eqimp < 0 {c5, c6, gl, gl > 0 g2, g3, g3 > 0 && g 5 > 0 , g4, g5 , g 6 , V e r i f y [L o g i c a l E x p a n d [ tmpResult 0Ji\]"= cl, c3 g l wg3 c 4 } ] ; y — g2 * g 4 == g 5 * g 6 ] ] F a l s e Table I. E x p e r i m e n t a l r e s u l t s for tree g r a p h s . Node count 100 200 400 800 1600 3200 6400 12800 25600 5 1200 102400 204800 Iterations 257 377 512 739 906 987 1085 1294 1343 1374 1115 1734 Time (s) Cut value 0.01755 0.05105 0.1377 0.43965 1.14515 2.72245 8.336 27.96925 79.48075 199.9305 359.242b 1197.7344 0.000424058 0.000131803 3.38841 E-05 9.62S36E-06 3.49721E-06 1.3683E-06 7.15262E-07 3.34456E-07 1.78731 E-07 1.031S7E-07 6.66622E-0S 2.S1771E-08 Improved cut 0.000421019 0.00010598 3.14657E-05 7.94314E-06 3.10424E-O6 1.14917E-06 5.51447E-07 2.98915E-07 1.59801 E-07 9.352E-08 6.17942 E-08 2.60658E-08 Optimal cut 0.000407809 0.000101407 2.55342E-05 6.32I58E-06 1.58787E-06 3.97123E-07 9.884S5E-08 2.46653E-08 6.19941E-09 1.53453E-09 3.89972E-I0 9.6445E-I I Ktirlis i'reivtdds. Nondtflcrcnliahlc Optimization B a s e d A l g o r i t h m for G r a p h Raiio-Cui Pariitinnin-j Table 2. E x p e r i m e n t a l results for cubic g r a p h s . Node count 100 200 400 800 1600 3200 6400 12800 25600 51200 102400 204800 Iteration count 199 290 383 570 727 835 1012 1 195 1455 1672 1998 2500 Time (s) 0.013 0.04205 0.11065 0.36905 1.02045 2.69475 8.92785 30.71865 107.70535 320.36415 834.6924 2254.4636 Cut value 0.005185495 0.002513133 0.001214463 0.0005994 0.000296 0.000147637 7.36747E-05 3.67635E-05 1.8375E-05 9.15616E-06 4.56971E-06 2.28649E-06 Improved cut 0.005166088 0.002510172 0.001213595 0.000598147 0.000295773 0.000147552 7.36234E-05 3.675E-05 1.83679E-05 9.15434E-06 4.56927E-06 2.28627E-06 Table 3. E x p e r i m e n t a l results for g e o m e t r i c g r a p h s . Node count 100 200 400 800 1600 3200 6400 12800 25600 51200 102400 204800 Iteration count 129 146 204 266 389 430 530 681 800 886 976 1417 Time (s) 0.016 0.04305 0.1282 0.4532 1.9732 5.28105 14.3365 45.96155 149.0739 461.85255 1 169.0154 3797.3582 Cut value 0.002848427 0.001631237 0.000592549 0.000338845 0.0001 16216 5.88979E-05 2.34042E-05 1.13718E-05 5.18003E-06 2.88345E-06 1.4I57E-06 3.32351E-07 Improved cut 0.002845499 0.001620948 0.0005825 0.000333014 0.000113355 5.82837E-05 2.25542E-05 1.09091E-05 5.18003E-06 2.88345E-06 1.4157E-06 3.14612E-07 LATVIJAS U N1V E R SI T A T E S R A K S T 1 . 2 0 0 4 . 6 6 9 . sej.: D A T O R Z I N A T N E UN INFORMACIJAS TEHNOLOGIJAS. tO 88. l p. P Review of Traceability Models for Software Testing Processes Martins Gills Riga Information T e c h n o l o g y Institute Kuldigas i e l a 4 5 , R i g a . L V - 1 0 8 3 . Latvia martins.gills(a riti.lv ; A considerable part of software development activities is devoted to testing. The scale of testing may depend on quality criteria for the particular development project. As soon as development and testing is performed in some organized manner, there exist some relations and dependencies between all types of project items. The relations outline some path of traceability that starts from the origins of a particular item, and follows to its derivatives in a form of other item types. This paper reviews various traceability models for software projects, and analyses the sub-models of testing processes. The common properties of these models arc identified. Key words: traceability, problem reporting, testing process. Introduction The property of traceability for software development projects has been k n o w n for long. It usually assures a better control o v e r software development artefacts, provides a more organized project life-cycle and. consequently, reduces the risk of developing a low quality product [6], Testing as one of software processes is extensively dependent on the established traceability principles or. more formally, a traceability model within the given project. A vital problem in software testing is to establish an appropriate coverage of requirements. Typically, the formal coverage could be achieved according to s o m e criteria, e.g. stating that each requirement must have at least one appropriate test, or in more advanced situations using standard requirements like ones o f BS 7925-2 [3j. It is especially important to evaluate the impacts of requirement's changes or to analyse the updates within the test suite. The previously mentioned issues are related to traceability property within the software development life cycle and to the testing process in particular. Software testing process properties A large part of the effort devoted to software development is due to software testing [11]. Software testing can be regarded as one of the processes presenting within a n y software development life cycle. Although it may not appear explicitly within the formal list of processes (like e.g. in ISO/IEC 12207 [8]). it m a y be derived as a tailored process. Typically, the testing is organized at various levels, depending on what is the focus of the specific testing activity. The concept of test levels (test activities with a c o m m o n direction) is mainly derived from the software d e v e l o p m e n t V-model. The model itself originally comes from the Software Lifecycle Process M o d e l [4], m o r e often it is represented in a very simplified form [10], and as a more generic model is s h o w n in F i g . l . In principle, the V-model is close to the more classical Waterfall software development model [11]. Every project has a different testing process, and not all projects do have tasks corresponding to Marlins Gills. Review of T r a c e a b i l i t y Models for S o f t w a r e Testing Processes 81 every test level - some may have very specific ones. Main differences can be due to different naming and due to the combination or subdivision of particular test levels. In each case various organizational and methodological issues play a role in defining the w a y these test levels are implemented. For example, a unit test may be called a component test, but acceptance testing may be subdivided as factory acceptance testing and user acceptance testing, or this level may be supplemented by qualification testing. Every test level has a different object to be tested, and the tests themselves are based on different project artefacts. Acceptance Testing Requirements Specification System lesting Architectural Design \ .s > Integration Testing / / Detailed Design Unit T e s t i n g \ / / Coding - > o n the b a s i s of Figure /. The V-model. T h e Author has found out that in rare cases the projects can immediately say what kind of life cycle model they use. In most cases such a model exists, but it is not formally described in an explicit form, rather it is hidden within the project plan, procedures or standards. Although the notion of test levels is explicitly used in the V-model. they can be a part of other models as well. It could be extremely difficult to indicate a development process where testing is just a once-through activity with only one testing level. Even the simplest approach of quality requires a product (e.g. software element) checking activity prior to passing it further for integration into a larger product or prior to delivery. A s the typical software development consists of numerous developers, project stages (typically, with cycles), there are at least two test levels present. Traceability models for development process According to the concept of the traceability model [5]. it is formed from project items and relations between them. T h e most convenient form of representation is graphical where boxes are item types and arrows indicate relations. To analyse the testing process items, the author compiled preliminary information from 14 projects. Each of those represented a slightly different development approach, but at the same time, each project from this group could be m a p p e d to the typical life cycle building blocks defined by the standard J-STD-016 [7]. For this paper, four different models were selected, each representing s o m e possible class of traceability models. Three of them represent custom information system development projects, and one represents development of small utility software. Project A (see Fig.2A) represents the most popular class of traceability relations. Formal requirements are derived from customer proposals, tests are based on these requirements, 82 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS and problems are documented in problem reports. T h e test execution results are kept together with the test information. Items directly related to testing are: test a n d problem reports. B D e s i g n item Requirement implements Change request Problem report Test A described ki j execution ol T e s t case V T e s t log Figure 2. Traceability models for Projects A and B. Project B (see Fig.2B) shows the project where the testing process is highly refined — the testing process items are test class, test, test case, test log and problem report. This project m a y correspond more to a situation where requirements are not undergoing changes, and the project stage represented in the model is m o r e oriented to implementation and testing rather than designing or developing specifications. Two other projects h a d more relaxed requirements for t h e development. Project C (see Fig.3A) is a sample of the model that includes quite detailed implementation description. The particular model w a s present in a project, developing a web-based application. Testing process items were: test, test case, test log a n d problem report. Testing was well organized, and in the particular case these tests were mostly at unit and integration level. A slightly different situation from all t h e previous ones w a s in Project D (see Fig.3B). This w a s a development project of a software utility with n o particular external requirements in mind, but they were rather defined iteratively and served at the same time as tests. Something similar m a y happen when special d e v e l o p m e n t methodologies like Extreme Programming are practiced. Tire Project D case is n o t dominant, but it characterizes a semiformal approach in projects with a rapid and evolutionary development. Martins Gills. Review of Traceability Models for S o f t w a r e Testing 83 Processes Figure 3. Traceability models for Projects C and D. The A u t h o r found out that it was possible to map the traceability models of all 14 reviewed projects to the four models previously presented. T h e main difference could be that specific n a m e s are given for particular item types in each particular project. For example, requirement could be a particular diagram, use case, scenario, instruction, feature, etc. Typical item types of testing process The identification of testing items within the development process traceability model allows defining the traceability for the testing process. T h e typical constituents of the testing process are project items s h o w n in Table 1. Table I. Testing-related project i t e m s . Test Test case Test class Test configuration Test suite Test log record Problem report A set of objectives, methods and criteria to be applied in verification/checking purposes to a certain software item. A set of inputs, execution preconditions, and expected outcomes developed for a particular test objective. In some cases the test case may be created in the form of a test script for automated tests. A set of tests with similar test objectives, methods or test levels. A set of additional test attributes to be applied during execution for certain test cases. A set of test cases to be executed within a certain test session, scenario or by the particular personnel. Test suite can be implemented in the form of a test script for automated tests. Information about the status of test execution. A detailed description of test incidents, a record describing all events requiring an additional exploration. The identification of testing process items within project traceability models shows that the relations between these items are not random. There are typical directions of dependencies. In every project, the testing process items form some subset of a more generalized model. Figure 4 shows a generalized model of the testing process. Each individual project m a y include a subset of these, or define some specific additional testing- 84 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS related items. For example, the test script as an item type can be added to the model for projects with a high level of test automation. Test T e s t class b e i o r g s to t ^ is p a r i of contains T e s t case T e s t suite ^ m a r k s execution of 1 ; T e s t log record Problem report is e x e c u t e d with Test configuration marfcs p r o b l e m in Figure 4. A generalized traceability model for testing process. Items shown on the left side (Fig. 4) represent the typical constituents of the information set for testing that could be derived from software engineering standards like J-STD-016 [7]. The information set of the testing process has to provide answers to the following three question areas that are vital for project m a n a g e m e n t and product quality: • what is to be (or was) tested and how? • when and how were the tests executed? • what problems are encountered? In Fig. 4, the project items o n the right side d o have a more classifying role - in some cases they could not be regarded as separate items but rather as attributes to the project items on the left side. T h e most c o m m o n approach in industrial projects is n o t to distinguish between tests and test cases. All are called tests, and the item could be either very detailed (like the test case), or it could outline j u s t the main issues (like the test). S o m e other optimisations could take place, sometimes reducing the minimal test item set to only two project item types. Fig. 5 s h o w s two possible cases that were observed in the projects. In case (a) there are separate tests defined, but all testing results are maintained as o n e log. T h i s can function in small teams where it is not important to establish a formalized communication between testers and developers, but at the same t i m e testers recognize the n e e d to document the progress and problems. Case (b) formalizes the situation in which attention is paid mainly to problem registering, and probably tests are not even in every case formally documented, but rather briefly described while they are executed. This m a y be typical for larger project teams than in case (a). a) Test marks e x e c u t i o n ot Results of testing (including Pr.Rep.) b) Test & execution info describes p r o b l e m of Problem report Figure 5. Possible optimisations of traceability models of the testing process. Martins Gills. In s o m e as a n of T r a c e a b i l i t y projects, exact added. Review c o p y the (due m a i n to it a s s u r e s a g o o d T h e evaluation items is t h e process Test i t e m s not play f o u n d - to for that b e the present o n base in the this. T h e - it i n d i c a t e s to this T h i s traceability, but d e p e n d a significant role seen w o r d s reasons). s h o w s it h a s m a i n l y b e e n list w i t h motivation e c o n o m y engineering, O v e r a l l h a v e o f requirements S o m e t i m e s d o c u m e n t i n g intentions M o d e l s for S o f t w a r e Testing o m i t writing like " m a k e is t h e n e e d the sure to f o r m a l i s m is project the m i n u n i z e the case in item if the d e p e n d e n c e tests, o r tests a r e that", " c h e c k " quite it is n o t e f f i c i e n t element, terms type testing n e w the A l l on software testing [1]. project other the p r o b l e m be spent in test-related is m a d e . tests m a y "verify" effort o f g o o d for written or widespread o f the test o n w h e r e 85 Processes testing report does a d d e d as a result o f p r o b l e m s . External relations of testing process items T h e least Previous t w o items chapter - the defined test a n d the the generalized p r o b l e m project items that are not part o f the testing traceability report - w i t h i n m o d e l this for m o d e l testing h a v e process. relations A t w i t h process. Dependencies of the test Items externally classified into three • software • the • co-products T h e levels directly to the test (relative to the testing process) can be system d e f i n i n g items (requirements, design items), code, (e.g. user's test is u s u a l l y a n d p r o b l e m related groups: the strongly dependencies report (see manual). d e p e n d e n t o f the o n test (see these items. T a b l e T h e r e 3). T h e F i g . 4 ) , b u t this is n o t d e p e n d e n t o n test is a c o r r e l a t i o n can the test also be b e t w e e n derived f r o m level, rather o n test considered as a test the process organization. T h e relation functional the expectations primarily to testing. in testers w o u l d the This o f the developer initial or provides c u s t o m e r tests, refined in the are because r e q u i r e m e n t m o s t direct met. D e s i g n in traditional a h a v e the availability to exploit could w a y items the b e a n s w e r a n d setting c o d e to users n o r m question elements neither internal d e v e l o p m e n t the are nor for w h e t h e r referenced independent information. Table 3. Items related to the test and the c o r r e s p o n d i n g test levels. Project item Initial requirement Requirement Design item Code element User's manual Problem report Test levels acceptance system integration unit system, acceptance unit, integration, system, acceptance 86 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS The relation of the test to the user's manual is quite interesting. It could be applied in two cases - either when the documentation is under test, or w h e n there are n o formally specified requirements. O f course, there may be s o m e defined requirements, but at the time of testing they are completely outdated, and there is n o practical feasibility to base the test on these requirements. An interesting alternative approach with case studies has been proposed - to use the u s e r ' s manual as a requirements specification [2]. Such an approach requires systematic writing of the manual, b u t for m a n y real-life projects this method could be less expensive than writing and maintaining t w o d o c u m e n t s . A Special advantage is gained when the customer does not request these two. Relation with a problem report is typical for the tests added during the test execution phase to check the problem areas in later testing cycles. Relations of the problem report The problem report is an event-provoked item, and therefore it is tied with a weak relation to other items - n o changes in other related items can provoke change into the problem report itself. T h e problem report c a n b e related to almost any project item, but the most typical are the same as for the test. It may have numerous indirect relations (e.g. through an interface element problem it m a y point to s o m e specific requirement). T h e existence of particular direct relations depends on the test level (see Table 4). In each particular project, only part of the above mentioned items are usually used. T h e problem report's relation to the test or test log indicates what kind activity led to the given problem report. Design, code, interface elements or t h e user's manual or do indicate the location of the problem observable by the tester. Reference to the requirement could be used either to indicate problem in requirements or into its derivatives. Table 4. Items related to the problem report and the corresponding test levels. Project item Test Test log record Design element Interface element Code element Requirement User manual Test levels unit, integration, system, acceptance unit, integration, system, acceptance integration, system system, acceptance unit, integration acceptance system Additionally, there is an external item that d e p e n d s on the problem report - the change request. Usually, it is further related to the requirements. Related work A comprehensive analysis of traceability models within software development projects is made in [12]. At the same time, it is m o r e focused on generalization of the problem rather than analysing individual relations of project items. In [9] a model of trace links within testing activities is shown. At the same time it describes a particular case of traceability links with generalized, bi-directional relations. T h e Main focus is on coverage-oriented questions. Some papers (e.g. [13]) express the need for increased traceability for software development Marlins Gills. Review of T r a c e a b i l i t y Models for Software Testing Processes 87 quality techniques to be more similar to traditional measurement principles. N o reports were found that are focused on traceability and software testing relations. Summary The traceability models of software testing processes belonging to different development life cycle models indicate similar properties. There are considerable similarities between every testing process, and no contradictory principles were found. These particular testing process models can be regarded as a subset of a generalised traceability model of the testing process. T h e M a i n differences could be found within the terminology - how each item type is called in a particular project, and the level of optimisation - which items are skipped or regrouped. T h e quality of testing can be considerably improved if traceability information is taken into account. This increases the level of maintainability of tests, and could reduce effort and t i m e spent to carry out a change. T h e Author has indicated possible research areas concerning the traceability within the projects. From the application point of view, a potential interest could be in finding influence on the properties of the traceability model caused by the duration of the project, its particular life cycle m o d e l , project team size, project (sub-jcontract conditions, project scope and tasks, development methodology and development environment. Acknowledgement The author would like to thank the Riga Infonnation Technology Institute for making it possible to analyse the software development projects and the Latvian Council of Science Grant (grant assignation decision No. 3-2-1, 2002.07.09) for partial financing of the research. References 1. Bach J. Risk and Requirements-Based Testing. Computer, June 1999. pp. 113-114. 2. Berry, D.M., Daudjee. K., Dong. J.. Nelson, M.A., and Nelson, T. User's Manual as a Requirements Specification. Technical Report, CS 2001-17. University of Waterloo, Waterloo, ON, Canada. May. 2001. 3. BS 7925-2 Software component testing, BSI 1998. 4. DSK RR413320010. Allgemeiner Umdruck 250/4. BWB IT 1997. Entwicklungsstandard fur IT-Systeme des Bundes, Vorgehcnsmodell. (Toil 1: Regclungsteil. Toil 3: Handbuchsammlung). 5. Gills. M.. The Concept of a Universal Traceability Tool for Software Development Projects. Scientific Proceedings of the Riga Technical University. Computer Science. Applied Computer Systems. 2st thematic issue. Riga. 2001. pp 33-4 I. h. Gotel O. C. Z.; Finkelstein A. C. \Y.. An Analysis of the Requirements Traceability Problem. 1994. p.8 http://citesccr.nj.nec.com/78573.htm! 7. IEEE. J-STD-016-1995. Standard for Information Technology Software Life Cycle Processes. 1995. 88 8. 9. 10. 11. 12. 13. DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS ISO. Standard ISO/TEC 12207. Standard Information technology - Software life cycle processes. 1995. Linde S.. Relationships between quality and testing - Kvalitates un testesanas attiecfbas (in Latvian). Scientific Proceedings of the Riga Technical University, Computer Science, Applied Computer Systems, 2st thematic issue. Riga, 2001. pp 68-74. Pol, Martin; van Veenendaal, Erik. Structured testing of information systems - an introduction to TMap. Kluwer Bedrijfslnformatie, 1998. p 177. Pressman. Roger S. Software engineering: a practitioner's approach. McGraw-Hill, Inc. 1997 (4th ed.) p. 852. Ramesh, Balasubramaniam; Jarke, Matthias. Toward Reference Models for Requirements Traceability. IEEE Transactions On Software Engineering, vol. 27. no. 1, January 2001. pp 58-93. Wakid. Shukn A.; Kuhn, D. Richard. Wallace. Dolores R.. Toward Credible IT Testing and Certification. IEEE Software. July 1999. pp. 39-47. LATVIJAS UNTVERSITATES RAKSTI. 20O4. 6 6 9 . s e j . : D A T O R Z I N A T N E UN INFORMACIJAS TEHNOLOGIJAS. X9.-9X. Ipp. Benchmarking Problems of Topic Telecommunications and Access in Latvia Mara Gulbe*, Amis Gulbis** *University o f Latvia, Aspazijas blv. 5, Riga, m g u l b e @ l a t n e t . l v **Riga T e c h n i c a l University, A z e n e s 12. Riga, arnisg(g(di.Lv The problems related to benchmarking of telecommunications and access area in Latvia are analyzed. The topic under consideration is important because it represents the backbone of all information Society characterizing issues. The availability of data on T&A from national statistical sources is analyzed and indicators used for benchmarking of the topic T&A in Latvia are considered. Key words: telecommunication, benchmarking. Introduction T a k i n g into a c c o u n t the fact that Latvia is going to b e c o m e a m e m b e r state of EU the priorities of EU also b e c o m e actual in Latvia. T h e e E u r o p e + action plan [1] directed towards the acceleration of the formation of Information Society in M A S . serves as proof of this statement. All MAS c o u n t r i e s have agreed to take part in the i m p l e m e n t a t i o n of this action plan in their respective countries. U p to now our g o v e r n m e n t s h a v e s u p p o r t e d this p r o c e s s in Latvia. A s an e x a m p l e w e can mention National p r o g r a m " I n f o r m a t i c s " [2] and s o c i o - e c o n o m i c p r o g r a m e-Latvia [3]. At the same t i m e the attention paid to b e n c h m a r k i n g of action results is unsatisfied, t h o u g h the 2003 action plan of central statistical b u r e a u includes the data collection in a c c o r d a n c e with the n e e d s o f e E u r o p e + activities. T h e s e data will be available only in 2 0 0 4 . N e v e r t h e l e s s EC has significant interest a b o u t the present state of IS (Information Society) characterizing areas in N A S countries. Therefore, independent investigations are s u p p o r t e d by EC. H e r e w e can m e n t i o n the SIBIS project [ 4 ] , which also e x t e n d s the b e n c h m a r k i n g process of IS to N A S countries. T h e level a c h i e v e d in the d e v e l o p m e n t of T e l e c o m m u n i c a t i o n s and A c c e s s serves as the b a c k b o n e for IS activities in different areas o f society and individual lives. T h e present paper reflects the investigation of the topic T & A in Latvia carried out in framework of SIBIS project (supported by EC as FP5 investigation). Telecommunications Infrastructure T o facilitate the better u n d e r s t a n d i n g of the topic s o m e insight into t e l e c o m m u n i c a t i o n s and access infrastructure in Latvia is n e c e s s a r y . It includes: • Fixed public exchange network; • State significance data transmission network and other specialized networks: • Mobile telephone network: • Cable T V network. the 90 DATORZINATNE L'N INFORMACUAS TEHNOLOGIJAS Fixed Public Exchange Network After t h e renewal of i n d e p e n d e n c e in 1 9 9 1 , t h e g o v e r n m e n t o w n e d a rather b a c k w a r d fixed public e x c h a n g e n e t w o r k infrastructure w h e r e there w a s an almost h u n d r e d percent d o m i n a n c e o f a n a l o g u e t e l e p h o n e lines. Very often o n e t e l e p h o n e line was shared by two u s e r s m a k i n g s i m u l t a n e o u s calls i m p o s s i b l e . T h e access d e m a n d was h i g h e r then the supply. U n d e r these c i r c u m s t a n c e s q u i c k m e a s u r e s w e r e necessary to accelerate t h e progress in the field u n d e r consideration. T o stimulate the implementation of digital t e c h n o l o g i e s in the area of t e l e c o m m u n i c a t i o n s and to satisfy the access needs of t h e population, foreign capital w a s involved. T h e o w n e r s h i p of t h e fixed public e x c h a n g e network w a s shared b e t w e e n L a t t e l e k o m ( r e p r e s e n t a t i v e of t h e state) a n d one foreign t e l e c o m m u n i c a t i o n c o m p a n y (after t h e c h a n g e o f its s h a r e h o l d e r it is n o w the Finnish S o n e r a ) u n d e r conditions specified in t h e a g r e e m e n t b e t w e e n b o t h parties. The agreement foresees t h e c o m p l e t e digitalization o f the fixed p u b l i c e x c h a n g e n e t w o r k in Latvia by L a t t e l e k o m a c c o r d i n g to a t i m e s c h e d u l e fixed in the a g r e e m e n t . The agreement also established t h e L a t t e l e k o m ( t o g e t h e r w i t h a s e c o n d o w n e r ) m o n o p o l y on the fixed p u b l i c e x c h a n g e network ( s u p e r v i s i o n , d e v e l o p m e n t , maintenance) for a time period of t w e n t y y e a r s (up to 2 0 1 3 ) . This m e a n s t h a t the fixed t e l e c o m m u n i c a t i o n m a r k e t was forbidden for o t h e r parties w h o w a n t e d to p l a y in that market up to t h i s date. T h i s , in turn, acts against the price reduction in the t e l e c o m m u n i c a t i o n market. Later it turned out that this m o n o p o l y also m a k e s a serious b a r r i e r for Internet penetration and u s a g e by h o u s e h o l d s in Latvia due t o inappropriately high p r i c e w h e n c o m p a r e d with the average salary o f e m p l o y e e s in this country. After this b e c a m e clear to the g o v e r n m e n t , efforts were u n d e r t a k e n to get free from that m o n o p o l y . This w a s also stimulated by E U recommendations demanding the liberalization of the t e l e c o m m u n i c a t i o n market. In case o f a successful a g r e e m e n t between both players the liberalization of the fixed t e l e c o m m u n i c a t i o n market should b e achieved in 2 0 0 3 . This is stated in the l a w " O n t e l e c o m m u n i c a t i o n s " a n d is w h a t is e x p e c t e d by the government. Now L a t t e l e k o m Ltd is the largest t e l e c o m m u n i c a t i o n c o m p a n y in Latvia. It offers to clients different t e l e c o m m u n i c a t i o n s e r v i c e s i n c l u d i n g v o i c e t e l e p h o n y , Internet, lease of telephone lines, etc. T h r o u g h its s u b s i d i a r y A p o l l o L a t t e l e k o m also acts as an Internet provider. T h e following Internet a c c e s s possibilities a r e offered: • Dial-up: . ISDN; • xDSL. D e p e n d i n g on the p u r p o s e ( b u s i n e s s o r h o u s e h o l d ) different t y p e s of I S D N and x D S L are offered. S o m e t y p e s of x D S L can n o t be c o n s i d e r e d as broadband. One m o r e point must b e m e n t i o n e d h e r e . Lattelekom d o e s not keep its promises r e g a r d i n g digitalization o f t e l e p h o n e lines e v e r y w h e r e in Latvia according to the time s c h e d u l e set in the a g r e e m e n t with t h e g o v e r n m e n t . The a r g u m e n t a t i o n is based on the statement t h a t in rtiral regions, digitalization is unprofitable. At t h e same time the quality o f data t r a n s m i s s i o n by dial-up c o n n e c t i o n through a n a l o g u e lines is s o m e t i m e s very bad. This could b e c o m e a reason for digital divide. Mara Culbe. Amis (julbis. Benchmarking Problems of Topic T e l e c o m m u n i c a t i o n s and Access .. 91 O n the o t h e r hand other Internet service providers can not extend their services through the fixed t e l e p h o n e network to rural regions and m u s t try to lind an alternative (radio-link or Internet t h r o u g h satellite if possible). State Significance Data Transmission Network and Other Specialized Networks T h e r e is a state significance data t r a n s m i s s i o n n e t w o r k (Latvian abbreviation V N D P T ) in Latvia. T h i s network is i n d e p e n d e n t from the fixed public e x c h a n g e network (though s o m e lines are leased from L a t t e l e k o m ) and is supervised by the state agency V I T A (Latvian abbreviation from the A g e n c y of G o v e r n m e n t Information N e t w o r k s ) . V I T A is a N o n p r o f i t Organization State Joint-stock C o m p a n y established in 1997 a c c o r d i n g to order of C a b i n e t of Ministers N o 44. T h e aim of building up o f the agency was to p r o v i d e the d e v e l o p m e n t and m a i n t e n a n c e of a joint g o v e r n m e n t a l information s y s t e m . A s provider of State significance data transmission network services V I T A m a i n t a i n s the corporative (intranet) c o m p u t e r network c o v e r i n g all the territory of Latvia. In this n e t w o r k the F r a m e Relay p a c k e t switching t e c h n o l o g y is used. T h e a d v a n t a g e o f such a t e c h n o l o g y is t h e possibility of several logical channels in o n e physical c h a n n e l and t h e foreseeable r e q u i r e m e n t s of both data transmission speed (28.8 k b p s - 2 M b p s ) and query r e s p o n s e t i m e in t h e address space g o v e r n e d by T C P / I P protocol. B e s i d e s , the logical c h a n n e l of one user is not available for others. S e r v i c e s p r o v i d e d b y V I T A are available in all of Latvia for both g o v e r n m e n t a l establishments and state registers. T h e V N D P T n e t w o r k is a closed n e t w o r k available for g o v e r n m e n t a l institutions on-line 2 4 h o u r s a day. A c c e s s t o State significance information s y s t e m s from outside (Internet) is controlled t h r o u g h a fire-wall system. A c c e s s to public servers (containing e.g. h o m e page i n f o r m a t i o n of Ministries) is open. In order to i m p r o v e client needs r e g a r d i n g an increased data transmission rate a transition to optical fibre connections is in p r o g r e s s . F i v e ministries a n d s o m e state significance registers a r e connected t h r o u g h optical cables. A n e w service offered by V I T A is IP telephony, w h i c h allows the reduction of t e l e c o m m u n i c a t i o n c o s t s for their clients. In general benefits from V I T A services for g o v e r n m e n t a l institutions are as follows: • Lower access and traffic costs when compared with those offered by other providers: • Increased security of access and data transmission. In E u r o p e a n context the future integration w i t h IDA n e t w o r k s . objective for the V N D P T network is closer B e s i d e s the n e t w o r k supervised b y V I T A there are s o m e other specialized t e l e c o m m u n i c a t i o n n e t w o r k s in Latvia, those o f L a t v i a ' s Railroad, the Ministry o f the Interior, L a t v e n e r g o (the biggest e n e r g y supplier in Latvia), Latvia State C e n t e r of Radio and T e l e v i s i o n . D u e to restricted r a n g e o f usage t h e y are less important. 92 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS Mobile network There are two m o b i l e t e l e p h o n e n e t w o r k s in Latvia: • L M T (Latvian abbreviation from Latvia's Mobile Telephone); • T E L L 2 (subsidiary of the Tele2 A B (at the time " N e t C o m AB")). Both operators are p r o v i d i n g t e l e c o m m u n i c a t i o n services G S M 1800 frequencies a n d offering t h e following set o f s e r v i c e s : in GSM900 and • Voice telephony, • SMS: • Fax transmission; • Data transmission; • Internet: • E-mail; • WAP. T y p e s of connection u s e d to initiate d a t a t r a n s m i s s i o n a r e : • Analogue; • ISDN. L M T services s u p p o r t w i d e r possibilities for m o b i l e c o n n e c t i o n with PC w h e n c o m p a r e d with those offered by T E L E 2 . T h e s e include also B l u e t o o t h technology. The data t r a n s m i s s i o n rate 9.6 Kbps is p r o v i d e d by b o t h o p e r a t o r s , but L M T also offers another m o r e a d v a n c e d t e c h n o l o g y H S C S D ( H i g h S p e e d Circuit Switched Data) with a d a t a transmission rate up to 38.4 K b p s . L M T s e n d e e s also i n c l u d e a n e w data t r a n s m i s s i o n t e c h n o l o g y for G S M networks G P R S (General Packet Radio S e r v i c e ) based o n p a c k e t s w i t c h i n g a n d offering additional possibilities for data t r a n s m i s s i o n s e n d e e s . Historically the first m o b i l e o p e r a t o r in Latvia w a s L M T . T h e a p p e a r a n c e o f a second operator leads to r e d u c t i o n o f costs for m o b i l e t e l e p h o n e services. O n e more point m u s t be m e n t i o n e d here. T h e r e is little h o p e that in the nearest years t h e services of the fixed t e l e p h o n e n e t w o r k w i l l be a v a i l a b l e for m a n y inhabitants of rural regions. T h i s is d u e to the lack o f t e l e p h o n e lines o f the fixed n e t w o r k in m a n y places. T h e alternative is the u s a g e o f services p r o v i d e d by m o b i l e operators. T h e services offered b y L M T are a v a i l a b l e almost in all o f the territory of Latvia. T h e connection possibilities o f the T e l e 2 n e t w o r k are not so good. D u e to c i r c u m s t a n c e s described many people of rural r e g i o n s are u s i n g m o b i l e s b u t m a i n l y for voice telephony. Of c o u r s e , Internet a c c e s s t h r o u g h m o b i l e is a l s o available for t h e m but only with a rather l o w data t r a n s m i s s i o n rate p r o v i d e d by G S M . In the context o f an increased data t r a n s m i s s i o n rate t h e possibilities offered b y 3G m o b i l e services seems to be very attractive for inhabitants o f rural regions, o f c o u r s e , only in case o f acceptable service costs. T h e Internet via satellite is also p o s s i b l e but the exploitation of this service is less p r o b a b l e . In order to increase the n u m b e r o f players in the t e l e c o m m u n i c a t i o n market and facilitate its liberalization the bid of three L ' M T S licenses w a s organized by the government in 2 0 0 2 . B o t h m o b i l e o p e r a t o r s L M T and T E L E 2 s u c c e e d e d in that bid. The third license w a s n o t sold a n d . therefore, the bid m u s t b e r e p e a t e d once more. Mara Gulbe, Amis Culbis. Benchmarking Problems o f Topic T e l e c o m m u n i c a t i o n s and Access .. 93 Cable TV Networks T h e r e a r e several cable T V providers in Latvia, m a i n l y in the capital Riga (one w e l l - k n o w n in Riga is B a l t k o m ( w w w . b a l t k o m . l v ) , but there are also others). The service is available in areas with a high density of inhabitants w h e r e it is profitable. Up to n o w the barrier for d e v e l o p m e n t of cable T V services w a s the L a t t e l e k o m m o n o p o l y on u n d e r g r o u n d cables. T o o v e r c o m e this difficulty very often T V p r o g r a m s were sent through the air from the b a s e station to s o m e antenna located on a definite building connected w i t h n e i g h b o r i n g buildings b y cables also g o i n g through the air. T h e same solution is also used b y Internet providers offering Internet access via radio-link. In fact, under the conditions o f the L a t t e l e k o m m o n o p o l y , Internet a c c e s s via radio-link is a very p o p u l a r and w i d e l y used service. A s could be e x p e c t e d , s o m e cable T V providers also offer Internet services, of c o u r s e , only in the p l a c e s w h e r e the cable T V network is available. F o r this the cable m o d e m is necessary. T h e service offered is on-line c o n n e c t i o n with rather good parameters (broadband). T h e barrier for Internet through cable T V network is rather high service prices. The cable m o d e m is also n o t c h e a p . Internet Via Satellite A c c e s s to Internet is also offered b y satellite TV p r o v i d e r s (e.g. U n i s a t in Riga). T w o possibilities for Internet via satellite are offered. T h e s e are: • Europe Online (the query to specialized European proxy server and the query result through satellite to user); • D i R E C W A Y (specialized equipment for up-link to satellite and cable modem is necessary). T h e last possibility is c o n v e n i e n t not o n l y for individual access but also for collective a c c e s s (for not very large Intranets), especially in places where other kinds of access are n o t available or are o f bad quality. T h e barrier for Internet through satellite is a rather high price for the service. Policy D o c u m e n t s on Telecommunications and Access T h i s chapter c o v e r s a variety o f policy d o c u m e n t s at the national level, which contain information on national p o l i c y directions a n d priorities, regulation and legislation. These d o c u m e n t s contain those w h i c h must be highlighted as particularly important or central to the topic T & A . From the T & A infrastructure c o n s i d e r e d above and the analysis of p o l i c y d o c u m e n t s a n d national data sources given b e l o w , w e shall try t o identify the indicators which m u s t be applied to the b e n c h m a r k i n g o f m a i n issues of t h e topic T & A . N a t i o n a l P r o g r a m " I n f o r m a t i c s " 1 9 9 9 — 2 0 0 5 , Ministry o f T r a n s p o r t . 1998. This d o c u m e n t is an action plan which outlines the Latvian w a y t o w a r d s the Information Society. It includes the large set o f tasks which m u s t be solved to achieve the target. It is a c o m p l e x p r o g r a m c o v e r i n g t h e time period 1 9 9 9 - 2 0 0 5 and consisting of 13 s u b p r o g r a m s . It foresees the realization of m o r e then 120 individual projects 94 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS directed mainly to t h e i m p l e m e n t a t i o n o f information and c o m m u n i c a t i o n technologies in the different areas of society and individual lives as w e l l as w i d e international cooperation in integration of E u r o p e a n d a t a t r a n s m i s s i o n n e t w o r k s and services. R e g a r d i n g the topic T & A the following s u b p r o g r a m s m u s t be selected: • telecommunication networks and services (9 projects), • information and data transmission networks (15 projects). • development of information and telecommunication networks (3 projects), • state significance statistics (4 projects). The goal of the s u b p r o g r a m " T e l e c o m m u n i c a t i o n N e t w o r k s and S s e r v i c e s " is the reform of t e l e c o m m u n i c a t i o n sector in Latvia a c c o r d i n g to principles a c c e p t e d b y EU. A plan of such reform is c o n s i d e r e d in detail. In the s u b p r o g r a m ' i n f o r m a t i o n a n d Data T r a n s m i s s i o n N e t w o r k s " the architecture of existing data t r a n s m i s s i o n n e t w o r k s in L a t v i a is c o n s i d e r e d and actions necessary for further d e v e l o p m e n t of these n e t w o r k s are outlined. In the s u b p r o g r a m " D e v e l o p m e n t o f Information and T e l e c o m m u n i c a t i o n N e t w o r k s " a survey on p e r s p e c t i v e I n f o r m a t i o n a n d c o m m u n i c a t i o n technologies is given and a special role of n e t w o r k t e c h n o l o g i e s (Intranet, Internet) is e m p h a s i z e d . The s t a t e ' s role in the d e v e l o p m e n t of I T C is outlined. The goal of the s u b p r o g r a m " S t a t e Significance S t a t i s t i c s " is the d e v e l o p m e n t of a m o d e m ICT-based statistical system in Latvia which is in a c c o r d a n c e w i t h standards of such a s y s t e m in EU'. The laws related to the topic T & A a n d c o n s i d e r e d a b o v e a r e results of the i m p l e m e n t a t i o n o f national p r o g r a m " I n f o r m a t i c s " , or m o r e precisely, the legislation part of this p r o g r a m . C o n c e p t u a l G u i d e l i n e s of S o c i o - E c o n o m i c P r o g r a m E c o n o m i c s , 2000 "•e-Latvia", Ministry of The d o c u m e n t outlines the main objectives of t h e s o c i o - e c o n o m i c p r o g r a m e-Latvia w h i c h appeared as a r e s p o n s e to e - E u r o p e initiative a n d in fact can b e considered as a further d e v e l o p m e n t of the national p r o g r a m " I n f o r m a t i c s " . The p r o g r a m e-Latvia represents the Latvian contribution into e - E u r o p e . T h e m a i n objectives of e-Latvia are as follows: • significantly improve the access possibilities to Internet for citizens, • based on improved computer literacy and training possibilities, to provide everybody the use of modern information technologies. for • provide for everybody the access to global and national information resources, • stimulate the development of e-commerce and e-government. Analogously to e - E u r o p e action e v a l u a t i o n e-Latvia is structured along three key objectives: • A cheaper, faster and secure internet: • Investing in people and skills; • Stimulating the use of internet. Mara Culbe, Amis Gulhis. Benchmarking Problems of Topic T e l e c o m m u n i c a t i o n s and Access .. 95 F o r T & A t h e following issues from e-Latvia are of importance: • Full digitalization of the telecommunication network, • Full liberalization of the telecommunication market, opening of the market of leased lines, • Regulation of Internet sector and services, Internet services provider's responsibility regarding data security, • Installation of public Internet access terminals in every local government, school and library, • Provision of personal data security. T h u s the c o n c e p t u a l guidelines p r o p o s e actions stimulating both the d e v e l o p m e n t of t e l e c o m m u n i c a t i o n s a n d access possibilities to Internet. R e g a r d i n g the topic T & A two projects involved in t h e e-Latvia action plan m u s t be highlighted. T h e s e are a Latvian e d u c a t i o n informatization s y s t e m and a Unified information system for L a t v i a ' s libraries. Both projects are interesting from the point of v i e w of public Internet access points ( P I A P ) . A c c e s s to Internet in schools is available for pupils and teachers and not available for p e o p l e from outside (restricted public access possibilities). In libraries such access is available for e v e r y b o d y . T h u s the i m p l e m e n t a t i o n of both projects is very important to extend the Internet access possibilities for the public. Relevant National Statistical Sources Identification and Documentation A t first it must be noted here that Information Society statistics in Latvia are in their b e g i n n i n g stage. Up to 2 0 0 2 t h e r e was n o systematical collection of statistical data on Information society, t h o u g h s o m e aspects o f such statistics were c o v e r e d . T h e d e v e l o p m e n t in the area is greatly stimulated by e-Europe^- initiative. T h e 2 0 0 3 action plan o f central statistical bureau ( w w w . c s b . l v ) ( g o v e r n m e n t a l institution) serves as a proof o f this statement t h o u g h t h e respective data will be available only at t h e b e g i n n i n g of 2 0 0 4 . Therefore the main issues from this plan related to topic T & A are not included in t h e f o l l o w i n g table c o v e r i n g t h e main national and international data sources. E v e r y y e a r data collected by Central statistical b u r e a u are published in " T h e Statistical Y e a r b o o k " . Important information regarding prices of c o r r e s p o n d i n g t e l e c o m m u n i c a t i o n services are given in the w e b s i t e s of the largest fixed and m o b i l e n e t w o r k operators. It should be noted that exactly the service price is the m o s t important barrier to w i d e penetration of Internet into h o u s e h o l d s . It is also clear that there is a strong correlation b e t w e e n the n u m b e r of c o m p u t e r s and the n u m b e r of Internet connections (if we e x c l u d e W A P ) . W i t h o u t a c o m p u t e r there is no need for an Internet connection. 96 N a m e o f data s o u r c e ( A c r o n y m ) DATORZINATNE M a i n publication(s) of i n t e r e s t for S I B I S UN INFORMACUAS TEHNOLOGIJAS Description (incl. target, survey unit) Responsible C o m p u t e r u s a g e in e n t e r p r i s e s , Data of Central statistical R e s p o n s e on e - E u r o p e + Central statistical Internet usage ( n u m b e r of bureau initiative, data bureau computers, n u m b e r of Internet c o l l e c t i o n is b a s e d o n connections, etc.), e-commerce Eurosiat module 491, survey questionnaires arc u s e d C a b l e TV network usage D a t a of C e n t r a l s t a t i s t i c a l R e s p o n s e on U N E S C O Central statistical ( n u m b e r of c u s t o m e r s , bureau initiative, data bureau e m p l o y e e s , financial figures, col l e c t i o n b a s e d o n etc.) Eurostat module 493, s u r v e y q u e s t i o n n a i r e is used T h e statistical y e a r b o o k y y y y D a t a of C e n t r a l s t a t i s t i c a l D a t a o n all t o p i c s Central statistical bureau characterising the bureau d e v e l o p m e n t o f country R e p o r t on d e v e l o p m e n t o f Ministry of e c o n o m i c s Latvia's national e c o n o m y D a t a o n all t o p i c s Ministry of characterizing the economics d e v e l o p m e n t o f country eLurope-r2003, progress report, Joint High Level J u n e 2002 Committee Information society statistics, S t a t i s t i c s in f o c u s , T h e m e Information society D a t a on candidate c o u n t r i e s 4-17/2002 statistics Richard Deiss Information society statistics, S t a t i s t i c s in f o c u s . T h e m e Information society Eurostat, Author R a p i d g r o w t h o f Internet a n d 4-37/2001 statistics Richard Deiss eEurope^- survey Joint High Level Committee Eurostat, Author m o b i l e p h o n e u s a g e in c a n d i d a t e countries M a r k e t of IT a n d Department of C o n t r o l of d e v e l o p m e n t Ministry of telecommunications (percentage I n f o r m a t i c s at M i n i s t r y o f of I C T sector transport o f d i g i t a l n e t w o r k , n u m b e r of Transport, website mobile network customers, n u m b e r of I S D N lines, e t c . ) E q u i p m e n t u s e d to b e n e f i t from D e p a r t m e n t of C o n t r o l of d e v e l o p m e n t I C T services ( n u m b e r o f I n f o r m a t i c s at M i n i s t r y o f of I C T sector Ministry of transport telephones, m o d e m s , computers Transport, website Lattelekom in h o u s e h o l d s a n d b u s i n e s s , etc.) W e b s i t e of L a t t e l e k o m W e b s i t e of L M T Website ot'Tele2 D a t a on l a r g e s t fixed Available data network services provider i m p o r t a n t for t o p i c in L a t v i a T&A D a t a on o n e (of two) Available data mobile services provider i m p o r t a n t for t o p i c in L a t v i a T&A Data on o n e (of t w o ) Available data mobile services provider i m p o r t a n t for t o p i c in L a t v i a T&A W e b s i t e of L a t v i a ' s I n t e r n e t Data on Internet s e r v i c e s Available data association customers i m p o r t a n t for t o p i c LMT Tele2 LI A T&A Here w e shall focus o u r attention on c o n v e n t i o n a l indicators u s e d to characterize t h e topic T & A in Latvia. A list of s u c h indicators is as follows. 1. Percentage of households that have fixed telephone service 2. Fixed lines per 100 inhabitants Mara Gulbe, Amis Gulbis. Benchmarking P r o b l e m * of T o p i c T e l e c o m m u n i c a t i o n s and Access .. 97 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. Mobile phone subscriptions Percentage of households with Internet access Percentage of population regularly using Internet Internet access costs Number of personal computers Internet users Number of Public Internet Access Points per 1000 inhabitants Number of Internet hosts Computerized enterprises Enterprises with access to the Internet Type of Internet connection in enterprises Enterprises with a home page on the Internet Number of employees that at their work place regularly use a computer with internet connection 16. 17. 18. 19. Number of computers with access to the internet used by enterprises Number of cable T V stations Number of cable T V customers Expenditure for cable T V business 20. Revenue from cable TV business 2 1 . Information technology expenditure 22. Telecommunication expenditure Glossary CSB DSL EC EDI ESS EU Eurostat FP5 ICT IDA IS ISDN MS NAS SIBIS SMS T&A UMTS WAP Central statistical bureau Digital Subscriber Line European Commission Electronic Data Interchange European Statistical System European Union Statistical Office of the European Commission 5 Framework Program Information and Communication Technologies Interchange Data between Administration Information Society Integrated Services Digital Network Member States Newly Accessing States Statistical Indicators Benchmarking the Information Society Short Message Service Telecommunications and Access Universal Mobile Telecommunications System Wireless Application Protocol Ih 98 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS References [1] eEurope- 2003, Action Plan, prepared by the candidate countries with the assistance of the European Commission, June 2001. [2] National program "Informatics", Ministry of Transports, 1998 (in Latvian). [3] Conceptual guidelines of socio-economic program e-Latvia, Ministry of economics, 2000 (in Latvian). [4] www.sibis-eu.org. LATVIJAS UNIVERSITATES R A K . S T I . 20()4. 6 6 9 . s e j . : D A T O R Z I N A T N E UN INFORMACUAS TEHNOLOGIJAS. 99.-106. lpp Design Interpretarion Principles in Development and Usage of Informative Systems Janis Iljins U n i v e r s i t y of Latvia, Raina 2 9 , Riga, Latvia Janisj@lanet.lv The paper summarizes the experience and ideas of the last five years in developing Informative Systems by means of Informative System Technology (1ST) and shows the application of these ideas in practice. One of the main methods discussed in the paper allows exact implementation of the svstem compatible to the design - design interpreter method. In the paper the system design formalization method interpreted during the operation of the system is suggested. The 1ST environment acts as the design interpreter. Other advantages of 1ST are described - modularity and simple maintenance, the possibility to automate the system prototype development, to generate the system documentation automatically and other advantages. All the described methods are evaluated according to their practical application. Key words: informative systems, 1ST. system prototype, maintenance. Introduction In o r d e r to d e v e l o p a n d operate the Informative S y s t e m (IS) according to the classic IS " w a t e r f a l l " life cycle it is necessary to describe the p r o b l e m statement or design from the n e e d s of the real w o r l d . It is the b a s e of system i m p l e m e n t a t i o n [ I ] . Typical p r o b l e m s o c c u r r i n g in this a p p r o a c h are design i n c o m p l i a n c e to actual requirements and i m p l e m e n t a t i o n i n c o m p l i a n c e to design. T h e last most often occurs in the phase of system i m p l e m e n t a t i o n a n d m a i n t e n a n c e w h e n errors are stated in the implemented system; they are e l i m i n a t e d w i t h o u t any c h a n g e s in the d e s i g n . In the practice often systems t h e design specification and s y s t e m d o c u m e n t a t i o n of which are older then i m p l e m e n t a t i o n are met. T h e y d o not contain information on the latest c h a n g e s and innovations in the system. T h e r e are w e l l - k n o w n w a y s h o w to solve the a b o v e - m e n t i o n e d problems. The system p r o t o t y p e is u s u a l l y u s e d to c h e c k the c o m p l i a n c e of the design to the requirements. Before i m p l e m e n t a t i o n of t h e system the p r o t o t y p e is tested and changes can be e a s i l y d o n e . As often the system p r o t o t y p e is automatically generated from the design specification, it exactly c o m p l i e s with t h e design. To achieve the system implementation c o m p l i a n c e with the design used code generation is usually. In such a case the design should b e formalized by m e a n s insuring c o d e generation. S u c h means are offered by the specification l a n g u a g e G R A P E S and its e n v i r o n m e n t G R A D E elaborated in Latvia [ 2 ] [ 3 ] , M a n y other tools like R A T I O N A L R O S E . O R A C L E Designer 2 0 0 0 [4] etc. a r e w i d e l y successfully used in the world. As the design specification does not c o n t a i n information on implementation details, code generation makes p r o g r a m m i n g e a s i e r but does not eliminate it. It is possible to c h a n g e the generated system code, s o causing danger that the c h a n g e d c o d e m a y not c o m p l y with the design. 100 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS At the B a n k of Latvia in 1995 d e v e l o p m e n t of the C u r r e n c y O p e r a t i o n M a n a g e m e n t Informative System ( V O P I S ) w a s started u n d e r the l e a d e r s h i p of Professor M. T r e i m a n i s . It w a s clear at the b e g i n n i n g that the s y s t e m will h a v e to be able to operate in an e n v i r o n m e n t in which r e q u i r e m e n t s are c h a n g i n g very fast and it will be necessary to m a k e m a n y c h a n g e s d u r i n g its m a i n t e n a n c e . T h e r e f o r e special attention w a s paid to the fact h o w to d e v e l o p the s y s t e m so that it is easy to c h a n g e it in the future and how t o achieve that the c h a n g e s are easily reflected in the system d o c u m e n t a t i o n . Professor M . T r e i m a n i s offered to d e v e l o p t h e 1ST c o n t a i n i n g the d e s i g n m e t h o d and the e n v i r o n m e n t s u p p o r t i n g the m e t h o d [5]. T h e specified design is stored in the database b y m e a n s of the m e t h o d . 1ST e n v i r o n m e n t c a n interpret the design. Such an approach allows us to: • combine design and implementation as the system implementation will directly use the design definition, so implementation will exactly comply with the design; • change the applied system easily, for to make changes in the system functionality, in most cases, it is necessary to m a k e changes only in the data stored in the database; • maintain the system documentation corresponding to implementation as it can be obtained automatically from the design definition. 1ST w a s successfully u s e d in V O P I S d e v e l o p m e n t and operation. Later 1ST b e c a m e the technology o f all IS in the B a n k of Latvia w h e r e it is still used. 1ST is also successfully used in D a t o r i k a s Instituts Ltd. e l a b o r a t e d IS. Following p a p e r c h a p t e r s s h o w the 1ST offered d e s i g n formalization m e t h o d s and describe t h e project interpreter. Different 1ST possibility a p p l i c a t i o n s in real projects, their advantages a n d d i s a d v a n t a g e s are a n a l y z e d . System Design M e t h o d The IS d e v e l o p e d b a s e d on this design m e t h o d is o p e r a t e d by m e a n s o f the design interpreter - 1ST. To d e v e l o p a s y s t e m b y 1ST m e t h o d t h e w o r k c o n s i s t s o f three i n d e p e n d e n t parts: • database design, • technology definition, • development of specific applications. The essence o f these parts will be d e s c r i b e d further. Database Design T a r g e t system d a t a b a s e design is d e v e l o p e d by m e a n s o f the classic E R model. Usually, in actual 1ST d e s i g n s in d a t a b a s e E R m o d e l d i a g r a m s , Object M o d e l notation is used [6]. 1ST also e n v i s a g e s t h e registration o f the entities and relations o f ER model in the database, although it is n o t c o m p u l s o r y . If the d e v e l o p e d E R m o d e l is registered in the t e c h n o l o g y database, it g i v e s the f o l l o w i n g a d v a n t a g e s : • automatic data structure (table) formation (definition script generation) is possible; Janis lljins. Design Interpretation Principles in Development and Usage of Informative S y s t e m s 101 • development of the system prototype even before development of the actual data structure is possible; • automatic system documentation generation; • technology definition work is made easer. Several real s y s t e m s h a v e been d e v e l o p e d in which the target s y s t e m d a t a b a s e is d e v e l o p e d b y m e a n s of classical relation d a t a b a s e tools. In such cases these a d v a n t a g e s are n o t u s e d and E R m o d e l description (entities and relations) is not registered in the t e c h n o l o g y d a t a b a s e . N e v e r t h e l e s s , the s y s t e m is successfully operated by m e a n s of 1ST interpreter. Technology Definition T h e basic principles and m e t a m o d e l of t e c h n o l o g y definition are described in [5]. T o d e v e l o p IS it is n e c e s s a r y to define the organization w o r k t e c h n o l o g y . It is done by c o m b i n i n g various m e t h o d s already k n o w n described in [6] [7] [8] [9]. In order to define the t e c h n o l o g y the following questions should be answered; • " W h y ? " - the tasks of the organization should be understood. 1ST envisages defining the tasks, to define the subtasks for the tasks that can be described in more detail. In the result a hierarchic network is formed. 1ST task module helps programmers understand the target system, but it is not necessary during operation. Therefore the design interpreter does not use it. • " W h o ? " - it should be clarified what structural units and what staff of the organization are involved in implementation of the task. The design interpreter uses the information on the staff from 1ST organization module in order to give them Login rights and control the staff rights to work with IS. • " W h e r e ? " - according to the posts and responsibilities of the organization's staff members the necessary work places and what rights and appearance are needed for every work place should be specified. Starting a work session with the design interpreter in 1ST environment the user identifies and chooses one of the available work places where he will work during the work session. The user will be offered those IS possibilities which are envisaged in the design for the particular work place. • " W h a t ? " - the design determines which database objects the user may process at the particular work place. T h e 1ST view concept is introduced - 1ST view is a set of objects of a single object class of the target system. In the context of a work place a hierarchic view network may be formed. As a view is a set of objects a sub view can be defined as the set of objects related to the selected view object. The project interpreter offers the user to see the available views and the objects in them, to find and select them for implementation of operation. • " H o w ? " - in the context of a work place an operation can be implemented for the view or object in view. Calling the operation the design interpreter starts another application shown in the design. The application receives a parameter from 1ST environment - the selected object in the view network. • " W h e n ? " - for particular target system object classes state-transition diagrams can be designed. The technological process for each object of this class is started during the work and the object is set in a definite state of the state-transition diagram. 1ST environment offers the possibility to change the object state according to the state transition diagram. The view network should be designed so that the views containing 102 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS the objects in the needed state could be available in the particular work places. The operation should be linked to the views that change the object state. The descriptions of the system u s e r s , w o r k p l a c e s , w o r k place view n e t w o r k and operations linked to the v i e w s as w e l l as state transition d i a g r a m s together form an interpreted t e c h n o l o g y that at the s a m e t i m e is IS d e s i g n . In order to operate IS only the design description h a s to be interpreted. Application Development Specific target s y s t e m applications are d e v e l o p e d separately. T h e y differ for every system. Therefore they are designed s e p a r a t e l y a n d i n d e p e n d e n t l y o f the organization t e c h n o l o g y used b y 1ST. The principle o f m o d u l a r i t y is o b s e r v e d in d e v e l o p m e n t of applications - every specific function o f the system has its o w n a p p l i c a t i o n . T y p i c a l l y every target system object class has its o w n application to register the objects o f this class and to edit their attributes. In 1ST operation it is indicated w h i c h specific a p p l i c a t i o n and w h i c h specific m o d e to call it. For all a p p l i c a t i o n s it is n e c e s s a r y to u s e a p r e d e f i n e d c o m m a n d line interface by m e a n s of w h i c h the 1ST e n v i r o n m e n t c o n v e y s information to application on the u s e r ' s activities - w h a t objects the user has selected in the 1ST e n v i r o n m e n t and wants to indicate as p a r a m e t e r s for application. Design Interpreter Implementation Implemented Metamodel The interpreted design description is f o r m a l i z e d b y the 1ST m e t a m o d e l w o r k place m o d u l e [5] (Figure 1.) W o r k Place is c o n n e c t e d with Q provides provides is a l l o w e d to -OMessaee is related to is c o n n e c t e d with is c o n n e c t e d in c o n t e x t with View of in c o n t e x t s£ has been apphed Operation Employee Figure I: Part of the 1ST Workplace Module T h e notion o f the W o r k p l a c e is u s e d to define the e m p l o y e e ' s rights and duties. M o r e formally, the 1ST W o r k p l a c e M o d u l e defines • Workplace o T h e Workplace entity attributes are the name and the comment. This entity is used as a placeholder to define the employees' rights and duties. Janis lljins. Design Interpretation Principles in Development and Usage of Informative Systems 103 • Message o The Message entity main attributes are the text (for instance. Warning: FX deals are waiting confirmation!), the icon-representing message and SELECT statement. It is possible to use arbitrary SELECT statements with a set of predefined variables. ISTE executes this statement after defined time intervals. The message text appears, if SELECT statement, when executed, finds some object. • Set of technological relations o the relation between the Employee and the Workplace entities allows employees to participate in the organization's technological process with a predefined set of rights, which will be defined below; o the relation between the View and the Workplace entities allows the employee w h o works in a particular Workplace to get access only to predefined sets of Views and Objects; o the relation between the View, the Operation and the Workplace entities allows the employee to execute a predefined set of Operations with Views and objects in a particular Workplace; T h e relation b e t w e e n the M e s s a g e and the W o r k p l a c e entities allows the employee to receive M e s s a g e s a b o u t predefined events, w h i c h is c o n n e c t e d with the Workplace. Usually these m e s s a g e s are related to a necessity to p r o c e s s s o m e object that "arrives" in the W o r k p l a c e in s o m e V i e w . T h e relation b e t w e e n the M e s s a g e and the View entities a l l o w s the e m p l o y e e to find those objects easily. 1ST interprets t h e s e technological relationships and thus enforces e m p l o y e e s to perform the o r g a n i z a t i o n ' s tasks according to predefined O T M . Functional Shell T h e Functional Shell is a p r o g r a m - design interpreter operating according to the above d e s c r i b e d m e t a m o d e l a n d interprets the system design description coded in it. The p r o g r a m controls t h e user a p p r o a c h to the database, thus so controlling operation of the s y s t e m . T h e p r o g r a m insures w o r k in a particular w o r k place available to the user. In w i n d o w s the v i e w n e t w o r k - views and the objects they contain which may have linked sub v i e w s . S o it is c o n t r o l l e d what the user sees and can process from the database. Selecting a particular object or view o p e r a t i o n s by m e a n s of w h i c h the selected object can b e p r o c e s s e d are s h o w n . T h i s indicates h o w the p r o c e s s i n g is done. After a definite time interval the m e s s a g e w i n d o w is refreshed. In it t h e user can see w h e n to do the appropriate activities. T h e m a i n w i n d o w of t h e functional shell as it is seen by the system user is s h o w n in Figure 2, 104 DATORZINATNE UN INFORMACUAS Who Skatit darijumu TEHNOLOGIJAS 9 »izsauktsf Figure 2. Main window of the Functional Shell. 1ST Additional Possibilities Besides the F u n c t i o n a l Shell it is also p o s s i b l e to realize o t h e r possibilities using the s y s t e m design e n t e r e d in the d a t a b a s e . O n e of s u c h possibilities is data structure generation. If in t h e t e c h n o l o g y description the s y s t e m d a t a b a s e entity types and relations b e t w e e n t h e m as well as entity attributes are defined it is possible to d e v e l o p a p r o g r a m that transforms the design description c o d e d in t h e d a t a b a s e into S Q L C R E A T E T A B L E s t a t e m e n t s , so creating a real d a t a b a s e structure. It is possible to enter in the t e c h n o l o g y d a t a b a s e o b j e c t s - e x a m p l e s for every type of entities. T h e s y s t e m design can b e m a d e i n c l u d i n g w o r k places and the views they contain so that the objects s h o w n in the v i e w s are o b j e c t s - e x a m p l e s from the t e c h n o l o g y database. T h u s it is possible to d e v e l o p a s y s t e m p r o t o t y p e e v e n before the real database structure is d e v e l o p e d . T h e p r o t o t y p e will be o p e r a t e d b y the real design interpreted by m e a n s of w h i c h the i m p l e m e n t e d s y s t e m will be o p e r a t e d later. So the prototype will precisely coincide w i t h the system i m p l e m e n t a t i o n . T h e design description that can be interpreted or u s e d for data base structure generation can be p r i n t e d in a p r e p a r e d d o c u m e n t a t i o n t e m p l a t e . For it a p r o g r a m is needed that reads the design description in the d a t a b a s e and passes it to a tool like V I S I O or M S W o r d . Information on reports necessary to print from the target system can be entered into the t e c h n o l o g y d a t a b a s e . T h e t e c h n o l o g y d a t a b a s e stores information on the data needed for the report ( S Q L S E L E C T s t a t e m e n t r e t u r n i n g the data) and information on how the data should be s h o w n in the report t e m p l a t e s . So a p r o g r a m - report interpreter can be implemented by m e a n s o f w h i c h m o s t o f the s y s t e m r e p o r t s can be developed. Janis Iljins. Design Interpretation Principles in Development and Usage of Informative Systems 105 1ST Application E x a m p l e s Experience I S T e c h n o l o g y c u r r e n t l y is used in the B a n k of Latvia as the standard framework for building Information S y s t e m s . Besides the Bank of Latvia between 1996 and 2001 I S T e c h n o l o g y f r a m e w o r k w a s used in m a n y other organizations to build and maintain c o m p l e x Information S y s t e m s (Figure 3). Organization Information Svstem ISTechnology (as described in this paper) Salaries Management IS Employees Management IS Commercial Banks Management IS Trade and Treasury Management IS for Central Bank Computer and Software Engineering Institute I Ltd. Bank of Latvia Bank of Latvia Bank of Latvia Bank of Latvia Trade and Treasury Management IS for Commercial Bank Unibank of Latvia Trade and Treasury Management IS for Investment fund Optimus fund Patient Register IS Health Department of Riga Municipality 1 Pension Capital Management • Private Pension Fund IS for Unipensija Figure 3. Examples of ISTechnology usage. Application Evaluation All 1ST possibilities w e r e m o s t c o m p l e t e l y applied in the Bank of Latvia. All s y s t e m s w h e r e deigned b y t h e 1ST m e t h o d and operated by t h e design interpreter - 1ST F u n c t i o n a l Shell. Several s y s t e m d e v e l o p m e n t e n v i r o n m e n t s w e r e created in one 1ST e n v i r o n m e n t . A part o f t h e 1ST m o d u l e s w a s assessed as successful. But for a part of the i m p l e m e n t e d m o d u l e s t h e application in practice did not p r o v e to be useful. The functions of these m o d u l e s w e r e carried out by other tools a n d they were not actually used. So the data structures w e r e not g e n e r a t e d by m e a n s of 1ST, as the applied SQL Server Software includes sufficiently easy to u s e structure creation tools, and there was no need to u s e 1ST tools. S y s t e m d o c u m e n t a t i o n generation and system prototyping before data structure g e n e r a t i o n also w a s not justified in practice because too much configuration work w a s n e e d e d for it. T h e m a i n reason w h y 1ST m o d u l e s w e r e not used is the complicated configuration work n e e d e d to enter the t e c h n o l o g y definition in the database. Handy editors are lacking. It would be w o r t h d e v e l o p i n g such, but it is a time c o n s u m i n g process. T h e 1ST m o d u l e that p r o v e d to be the most efficient is the design interpreter described in this paper. T h e m e t h o d for the technology definition by w h i c h w o r k place configuration is d e s c r i b e d indicating the w o r k place view network and the linked 106 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS operations and interpretation of the t e c h n o l o g y by the functional shell m a k e s 1ST u n i q u e . S y s t e m s d e v e l o p e d and o p e r a t e d in this w a y are easy to u s e and maintain. Therefore 1ST is u s e d in m o r e and m o r e n e w projects. In s o m e s y s t e m s d e v e l o p e d and o p e r a t e d b y 1ST, especially in U n i b a n k o f Latvia, the system d y n a m i c m o d e l is successfully u s e d , the 1ST state-transition and m e s s a g e m o d u l e is based on it and the report g e n e r a t i o n m o d u l e . Both these m o d u l e s are additional to the Functional Shell h e l p i n g to m a k e the s y s t e m m a i n t e n a n c e even easier. Future Perspective 1ST is w o r k e d o u t in client-server a r c h i t e c t u r e b y m e a n s o f w h i c h client-server IS is d e v e l o p e d . N e v e r t h e l e s s , t o d a y other t e c h n i c a l s o l u t i o n s are found. A possibility to d e v e l o p 1ST as an I n t e m e t - b a s e d IS d e v e l o p m e n t and i m p l e m e n t a t i o n tool is considered. N e w 1ST versions are c r e a t e d built in 3-level architecture - database, m i d d l e w a r e and user interface. So the IS d a t a security w o u l d be i m p r o v e d , and as well as there w o u l d be a possibility for o n e s y s t e m to u s e different user interfaces, for e x a m p l e , users c o u l d w o r k s i m u l t a n e o u s l y with o n e s y s t e m from M S W i n d o w s e n v i r o n m e n t and Internet. N e w data presentation possibilities to substitute t h e trees seen in the w i n d o w s in the current i m p l e m e n t a t i o n are also b e i n g s e a r c h e d for. O n e of the solutions is to approach m a x i m a l l y the a p p e a r a n c e o f the 1ST w i n d o w s to those of generally a c c e p t e d M S W i n d o w s Explorer. Conclusion The p a p e r describes a n e w a p p r o a c h to s y s t e m d e s i g n i m p l e m e n t a t i o n - the design interpretation. T h i s a p p r o a c h e n s u r e s exact c o m p l i a n c e of s y s t e m i m p l e m e n t a t i o n w i t h design description and m a k e s the s y s t e m e a s y to m a i n t a i n . I m p l e m e n t a t i o n o f the a b o v e m e n t i o n e d m e t h o d is also discussed. T h e m e t h o d is applied in practice in real projects. It should be d e v e l o p e d a c c o r d i n g to t o d a y ' s updated t e c h n o l o g i e s , a n d n e w e a s y to u s e v e r s i o n s in w h i c h application of all 1ST m o d u l e s w o u l d justify t h e m s e l v e s s h o u l d be elaborated. References Brace I. Blum. Software Engineering a Holistic View. Oxford University Press, 1992. Treimanis M.. tvrasts O., Nazartchik I. GRADE. GRAPES Development Environment. University of Latvia, 1992. Bicevskis J. Specification Language GRAPES/PLUS. University of Latvia, 1992. Oracle Designer'2000. Product Overview. Oracle Corporation. Treimanis M. ISTchnology - Tehnology Based approach to Information System Development. University of Latvia, 1997. James Ruabaugh and Co. Object-Oriented Modeling and Design. Prentice-Hall, 1991. Sowa J.F., Zachman J.A. Extending and Formalizing the Framework for Information Systems Architecture. IBM Systems Journal, Vol. 3 1, No. 3. pp. 590 - 6 1 6 , 1992. David R.Hampton. Contemporary Management. McGraw-Hill, 1981. Capers Jones. CASE'S Missing Elements. IEEE Spectrum, June pp. 38-41. 1992. LATVIJAS UNI VERS ITATES RAKSTI. 2 0 0 4 . 669. sej.: D A T O R Z I N A T N E UN INFORMACIJAS TEHNOLOGIJAS. I07.-1I6. lpp Semantics and Equivalence of UML Class Diagrams Girts Linde I M C S , U n i v e r s i t y of Latvia, glinde@acm.org One and the same "real world" can be modeled by different UML class diagrams, which in such a case can be considered "intuitively equivalent". A new approach to the formalization of this "intuitive equivalence" of class diagrams is proposed, based on the fonnal object-oriented cognitive process. Two theorems are proved to support this approach. The new formalization can be used to construct algorithms for class diagram analysis. Key words: UML, class diagrams, equivalence. Introduction A s object modeling gains popularity, starting from the pioneering work [1], a need emerges for methods o f handling and evaluating models. Object models are used in various stages of system development - from definition of requirements to system design. One p r o b l e m is to validate the model as it gets transformed and refined in the lifecycle of system d e v e l o p m e n t . R e s e a r c h is already being done on this topic [ 2 - 5 ] . A m o r e general problem is to compare several alternative object models of the same "real world". O u r goal is to formalize this "intuitive equivalence" for object models represented with U M L class d i a g r a m s . Consider the class diagrams m o d e l i n g directed graphs shown in Figure 1. 1 Node Node , 1 S t a r t s at , Edge E n d s at E d g e to C o n n e c t e d to , Standpoint S t a r t s with 1 Node Edge 1 * C o n n e c t e d to E n d s with Endpoint Fig. L Three class diagrams of a directed graph F o r m a l l y , we have here 3 totally different class diagrams. Each of t h e m describes different instance diagrams. But intuitively we know that all three of these diagrams describe the s a m e "real world" - directed graphs. 108 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS In [6] a formal definition of the "intuitive equivalence" o f class diagrams is proposed called reduction. It is defined on the instance diagram level a n d uses c o m p o s i t e classes to define the transformation between the class d i a g r a m s . In this paper a new approach to the definition of the "intuitive equivalence" o f class d i a g r a m s is proposed, called semantical implication. It is based on the formalization of the object-oriented cognitive process, introduced in [7]. T w o t h e o r e m s are p r o v e d confirming that the n e w approach is more general and intuitive. The paper is structured as follows. In section 2 a restricted formalization of the cognitive process is outlined. In section 3 t h e new definition of the "intuitive equivalence" is presented. In section 4 the reduction of class d i a g r a m s is briefly outlined. A n d finally in section 5 the difference between the t w o definitions of the "intuitive equivalence" is examined, proving t w o t h e o r e m s . The formalization of the cognitive process T h e definition of t h e class diagram semantical implication is based on a restricted version of the formal cognitive process (for the full version see [7]) h a p p e n i n g in the mind of a modeler w h e n he analyzes a fragment o f the real w o r l d and builds an object model for it. T h e fragment of the real world to be m o d e l e d , t o g e t h e r with the object-oriented structure that gets created in the mind of t h e m o d e l e r w e represent with a labeled directed graph and call it the cognition state (Fig. 2 ) . D u r i n g the cognitive process, as the modeler learns n e w concepts from the fragment o f the real world, his mind is populated with new elements that correspond to objects, classes, links and associations. W i t h every addition of a new element a new cognition state is created. T h e resulting sequence of cognition states we call the cognition path. The cognition path e n d s with a cognition state that contains the final object m o d e l as a part of it (Fig. 3). State of t h e abstract world ,< State o f tbe-'real wDrld Fig. 2. A cognition state Girts Linde. Semantics and Equivalence Snapshot of the Real worid of L'ML Class 109 Diagrams n i t i o n start s t a t e State Gf the bslracl worid Fig. 3. Overview A m o r e precise description of the cognition state and the cognition path follows in the next t w o subsections. Cognition state T h e cognition state is defined as a labeled directed graph that consists of t w o interconnected parts: • the state of the real world, which formally represents the fragment of the real world that is being modeled. • the state o f the abstract world, which formally represents the object model created in the modeler's mind on a fixed moment in time. Each n o d e o f the state o f the abstract w o r l d has o n e of the four labels, according to the element o f the object m o d e l the node represents: O - o b j e c t node, C - c l a s s n o d e , L - l i n k node, A —association node. (Note that links a n d associations are represented as nodes in the graph.) T h e following predefined edges are used to connect the nodes (they have a graphical notation, as s h o w n in Fig. 4 ) : • startpoint (S) - connects a link or association node to its startpoint node, • endpoint (E) - connects a link or association node to its endpoint node. • instance-of instances, • part-of- - connects a class or association node to the nodes representing its connects an object node to nodes representing the parts of the object. 110 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS Start state of the abstract world (empty) startpoint -S- endpoint -E- State of the real world instance-of • part-of • Fig. 4. Notation of predefined edges s r Fig. 5. A cognition start state We define the cognition start state as a cognition state t h a t consists of a state of the real world and an empty state of the abstract w o r l d (Fig. 5). Cognition path During the cognitive process, a cognition path is incrementally created. It is a sequence of cognition states, beginning with a cognition start state. Given a cognition state, the next cognition state is produced by extending the state of the abstract world in it - by adding a new node or b y adding o n e of t h e predefined e d g e s between the nodes. A s in [6], in this paper w e w o r k with the basic class d i a g r a m s containing classes, binary associations and t h e four m o s t popular multiplicity constraints (0..1, 1..1, 1..*, 0..*). So, the cognitive process can b e restricted to u s e only the n e e d e d steps. W e call the resulting cognition path a constructive cognition path. Definition. W e call a cognition path constructive cognition path (CCP): constructed u s i n g the following steps a • "create a new aggregate object" (Fig. 6), which represents a subgraph of the real world state: create a node with label O, and connect it to all nodes of the subgraph by the edge part-of; • "create a new class": create a node with label C (a class node); • create an edge instance-of from an existing class node to any existing object node that has no instance-of edges connected to it; • "create a new association": create a node with label A (an association node), connect it to two existing class nodes by the edge S and the edge E; • "create a new link" (Fig. 7): create a node with label L (a link node), connect it to two existing object nodes by the edge S and the edge E; • create an edge instance-of from an existing association n o d e to any existing link node that has no instance-of edges connected to it; this is allowed only if the nodes connected by the link (nodes to which edges S and E go) are connected by instance-of edges to the class nodes, connected by the association (with edges S and E. respectively). Girls Linde. S e m a n t i c s a n d E q u i v a l e n c e of U M L C l a s s 111 Diagrams Fig. 6. Adding an object r-s,„.„,... "S (O) f s£/ F;g. 7. Adding a link V E If w e assign n a m e s to the class and association nodes of a given C C P then we can construct the corresponding instance diagram. The new definition of the "intuitive equivalence" Consider a real world state R, and two C C P s PI and P2 starting from this real world state R. If P2 can be obtained from PI aggregating s o m e of the objects and classes while preserving s o m e properties o f the structure, then w e a s s u m e that these cognition paths model the s a m e concept in different levels of detail. A more precise definition follows. Definition. W e say that P 2 is less detailed than P I , if the following 4 conditions hold: 1) if t w o nodes of R in P I are connected b y part-of edges t o one and the same object node (Fig. 8 a), then in P2 they are also connected by part-of edges to one and the same object node (Fig. 8 b), Cognition path P1 Cognition path P2 (a) (b) Fig. S. Condition for object nodes 112 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS 2) if there are two nodes a and b in R that are in P I c o n n e c t e d by part-of e d g e s to object nodes cl and d l , respectively, a n d there is a link n o d e connected by S e d g e to cl and b y E e d g e to dl (Fig. 9 a), then in P2 they are connected b y part-of edges to one object node (Fig. 9 b) or they are c o n n e c t e d b y part-of e d g e s to object n o d e s c2 a n d d2, respectively, and there is a link node c o n n e c t e d b y S e d g e to c2 and by E e d g e to d 2 (Fig. 9 c). C o g n i t i o n p a t h P2 Cognition path P1 , s \ b (o) c K 6 E dl (0) l (a) 6 o > 6 0) 2 b (b) c d2 2 6 6 i (c) Fig. 9. Condition for link nodes 3) if there are t w o nodes a and b in R that are in P I c o n n e c t e d by part-of e d g e s to object nodes cl and d l , respectively, a n d the object n o d e s cl and dl are connected by instance-of edges to the same class n o d e (Fig. 10 a), then in P2 they are connected b y partof e d g e s to o n e object node (Fig. 10 b) or they are connected b y part-of edges to object n o d e s c2 a n d d2, respectively, and the object n o d e s c2 a n d d2 a r e connected by instanceof e d g e s to the same class node (Fig. 10 c). C o g n i t i o n p a t h P1 C o g n i t i o n p a t h P2 (o)c. (o) 6 6 (a) dl b O) 6 (b) & c 2 b 6 (2) (c) d2 6 Fie. 10. Condition for class nodes 4) if there are 4 n o d e s a, b. c and d in R. and PI contains 4 object nodes a l , b l , c l and d l , 2 link n o d e s el a n d fl, and an association n o d e g l , c o n n e c t e d as in (Fig. 11 a), then in P2 there are t w o cases possible: • a and b are connected by part-of edges to one and the same object node, and c and d are connected by part-of edges to one and the same object node (Fig. 11 b) • or there is the same configuration as in P1 - there are 4 object nodes a2, b2, c2 and d2, and 2 link nodes e2 and f2, and an association node g2, connected as in Fig. 1 1 c. Gins Linde. S e m a n t i c s a n d E q u i v a l e n c e of U M L Class Cognition path P 1 (0) a l ©>> 113 Diagrams Cognition path P2 1 (O)" (O) 1 1 1 Co)* © » vo)" 2 (") C$ K c2 (°)' lb] la) f7g. / / . Condition for association nodes N o w , using this method o f comparing cognition paths w e define the "intuitive equivalence" o f class diagrams, in this paper called semantical implication. Definition. W e will say that a class diagram CD1 semantically implies a class diagram C D 2 , if for every state o f the real world R: • if there exists a CCP PI starting from R whose corresponding instance diagram ID1 satisfies CD1, • then there exists a CPP P2 starting from R that is less detailed than PI and whose corresponding instance diagram ID2 satisfies CD2. Reduction of class d i a g r a m s In [6] the "intuitive equivalence" of class diagrams is formalized as a reduction. The reduction o f the more detailed diagram to the less detailed one is defined introducing a set of concepts. Every concept is represented with a composite class containing s o m e of the existing classes and associations, and assigning multiplicity constraints to the enclosed classes. The set o f new concepts must cover all the classes o f the class diagram. The multiplicity constraints assigned to the contained classes must guarantee that the composite classes have only connected instances. When reducing an instance diagram with a set o f concepts, each group o f objects and links that forms an instance o f one o f the concepts is replaced with a new object, and so a n e w (less detailed) instance diagram is constructed. Definition. W e will s a y that a class diagram D l can be reduced to a class diagram D2, if there exists such a set o f n e w concepts S, using which every instance diagram of D1 can be reduced to an instance diagram o f D 2 , and all instance diagrams of D 2 can be constructed this w a y from instance diagrams o f D l . The relation between the t w o definitions In this section w e s h o w that the semantical implication is more general than the reduction. Theorem 1. If a class diagram D l can be reduced to a class diagram D 2 , then D l semantically implies D2. 114 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS Proof. Consider • a class diagram D l that can be reduced to D2 by a set of concepts S, • a state of the real world R, for which a constructive cognition path PI exists starting from R, whose corresponding instance diagram ID1 satisfies D l , • an instance diagram ID2 that is reduced from ID1 using a set of concepts S. To prove the theorem w e show that w e can construct a constructive cognition path P2 starting from R that is less detailed than PI and w h o s e corresponding instance diagram is ID2. First we construct the last cognition state X o f P2 w h o s e corresponding instance diagram is ID2 and which is based on the s a m e state o f the real world R: • group the object and link nodes and the class and association nodes into subgraphs according to the new concepts in S, • replace each object-link subgraph with an object node, and each class-association subgraph with a class node, • replace multiple instance-of edges between two nodes with one instance-of edge, • replace multiple link nodes (together with S and E edges) that are between the same object nodes, and that are connected by instance-of edges to the same association node, with one link node (and one pair o f S and E edges). N o w it is easy to construct a constructive cognition path P2 that starts with the state of the real world R and ends with the state X. The cognition path P2 is constructed in the following order: 1) create aggregate object nodes with part-of edges (in any order), 2) create link nodes with S and E edges (in any order), 3) create class nodes (in any order), 4) create instance-of edges between object nodes and class nodes (in any order), 5) create association nodes with S and E edges (in any order), 6) create instance-of edges between link nodes and association nodes (in any order). The set o f new concepts S by definition is constructed so that every class and association of D l belongs to exactly one new concept. Keeping that in mind it is easy to see that the new cognition path P2 is less detailed than the cognition path PI (all four conditions are satisfied). T h e o r e m 2. There exist two class diagrams D l and D 2 , such that • D l semantically implies D 2 , but • D l cannot be reduced to D2. Proof. Consider the two class diagrams in Fig. 12. They both describe directed graphs. The difference is that D l a l l o w s us to model the structure o f graphs, but for D2 every object instance is a whole graph. Girts Linde. Semantics and Equivalence c l a s s diagram D1 of U M L C l a s s Diagrams 115 c l a s s diagram D2 E d g e to Node Fig. 12. Two class diagrams of a directed graph Let's prove that D l semantically implies D2. Consider an arbitrary state o f the real world R, for which there exists a CCP PI w h o s e corresponding instance diagram satisfies D l . W e can construct a C C P P2 that starts from the same state of the real world R and contains one object node, which is connected by part-of edges to all the nodes o f R, and which is connected by an instance-of edge to a class node. The corresponding instance diagram o f P2 satisfies D 2 . It is easy to see that P2 is a less detailed CCP than PI. N o w , let's see if D l can be reduced to D 2 . It can be proved that the only new concept that can be constructed for the diagram D l is a trivial composite class that contains the class Node, but doesn't contain the association Edge-to. The association Edge-to cannot be included in a new concept together with Node, because it forms a cycle, and cycles are not allowed inside the concept (that comes from the requirement that a new concept must have only connected instances). Therefore, D l can be reduced only to D l , but cannot be reduced to D2. Conclusion In the paper a new formalization approach of the U M L class diagram "intuitive equivalence" is presented. It is enabled by the semantics o f class diagrams that is based on the formalization o f the object-oriented cognitive process. T w o theorems are proved to compare the two definitions o f the class diagram "intuitive equivalence". The theorems confirm that the n e w definition is more general - the reduction is a special case of the semantical implication. In fact, Theorem 2 can be generalized to prove that any class diagram semantically implies the trivial class diagram containing one class. That is a fundamental property of system modeling. These theoretical results show that the semantical implication is now very close to the concept o f the class diagram "intuitive equivalence". The next problem that must be researched is the algorithmical decidability o f the class diagram semantical implication, as it is already done for the reduction in [6]. Finally, it must be noted that this paper gives another confirmation of the sufficient generality o f the object-oriented cognitive process formalization. It is shown that the formalization can be successfully used to reason about the problems o f the class diagram equivalence. 116 DATORZINATNE UN INFORMACIJAS TEHNOLOGIJAS References J.Rumbaugh, M.Blaha, W.Premerlani, F.Eddy W.Lorsen. Object-oriented modeling and design. Englewood Cliffs, NJ: Prentice Hall, 1991. A.S.Evans, R.B.France, K.C.Lano, B.Rumpe. The UML as a Formal Modelling Notation. UML'98. LNCS, Vol. 1618, Springer, 1999 K.Lano, J.Bicarregui. Semantics and Transformations for UML Models. UML'98. LNCS, Vol. 1618, Springer, 1999 A.S.Evans. Reasoning with UML class diagrams. WIFT'98. IEEE CS Press, 1998 K.C.Lano, A.S.Evans. Rigorous Development in UML. FASE'99, ETAPS'99. LNCS, Vol. 1577, Springer, 1999 G. Linde. Reduction of UML Class Diagrams. In: proceedings of the Fifth International Baltic Conference on Databases and Information Systems. Tallinn, 2002. G. Linde. The Cartoon Method of Semantics Definition for Modeling Languages. Submitted to: Proceedings of the Latvian Academy of Sciences. Riga, 2003. LATVIJAS UNIVERSITATES RAKSTI. 2004. 669. sej.: D A T O R Z I N A T N E UN TEHNOLOGIJAS. INFORMACIJAS I17.-125. Ipp. Requirements and Options for Data Warehouses at Universities Laila Niedrite Lecturer at the U n i v e r s i t y o f Latvia Raina blvd. 19, R i g a , Latvia, + 3 7 1 7 0 3 4 4 3 9 lnied@lanet.lv This paper deals with the problems of the changing role of traditional universities on the educational market, and how data warehousing could help the universities solve their problems. It describes the experience of different universities and the issues of feasibility study of the data warehousing project at the university. Key words: data warehouses, requirements, universities. Introduction Data w a r e h o u s i n g is traditionally u s e d in b u s i n e s s . H i g h e r education is trying to follow the g r o w i n g trend in d e v e l o p m e n t a n d u s a g e o f d a t a w a r e h o u s e s ; h o w e v e r , the reasons for d o i n g it are often m o r e c o n c e r n e d w i t h scientific a n d educational issues than w i t h financial benefits. H i g h e r educational institutions h a v e g r o w n in recent years into large b u s i n e s s e s a n d the m a n a g e m e n t of t h e s e institutions has changed and b e c o m e m o r e b u s i n e s s - l i k e . T h e m a n a g e m e n t of h i g h e r educational institutions can benefit from using data w a r e h o u s i n g j u s t as other traditional b u s i n e s s organizations. T h e t o p i c o f d a t a w a r e h o u s i n g [5], [ 1 3 ] , [4], a n d [7] comprises architectures, a l g o r i t h m s , m o d e l s , tools, organizational a n d m a n a g e m e n t issues for integrating data from several operational s y s t e m s in order to p r o v i d e information for decision support, e.g., u s i n g data m i n i n g or O L A P tools. T h e data w a r e h o u s i n g technology provides integrated, c o n s o l i d a t e d a n d historical data. A data w a r e h o u s e can be realized as a separate d a t a b a s e c o n t a i n i n g these integrated data. A data w a r e h o u s e is built by e x t r a c t i n g data from t r a n s f o r m i n g t h e m and then loading into a separate d a t a b a s e m o d e l l i n g c o n c e p t s , e.g. star schema, and d a t a b a s e structures [14] are u s e d . Finally, the d a t a are accessed w i t h different data in F i g u r e 1. different data sources, where specialized data like materialized views analysis tools as shown 118 DATORZINATNE Data source systems: Students, Financing, HRM .. UN INFORMACUAS The Data Warehouse target database TEHNOLOGIJAS Data access and Analysis f^Data M i n i n g ) Statistics Q ^ Reporting ^ Figure J. Common data warehousing environment This paper will give further insight into the business processes o f higher educational institutions, and some key benefits o f having a data warehouse in business are examined concerning higher education. The following section presents the business process m o d e l for higher educational institutions and university management issues in the new competitive environment. An overview o f possible areas o f application o f data warehousing in education is given in Section 2. Section 3 presents the possible system architecture for the n e w application area, and the relevant experience o f universities using data warehousing. Section 4 draws conclusions on the data warehouse feasibility study at the University o f Latvia. The Education Process as a Business Process The starting point for finding out the necessity for higher education institutions to build and use data warehousing could be their traditional management and business processes. The t w o main university activities are education and research, and the most important support processes are human resource management and financial resource management [11]. In the last decade universities have experienced significant changes in the education area. One of the important issues is the growing competition between educational institutions. Colleges and universities rarely express their policies, intentions, and practices in competitive terms. H o w e v e r , the pressure on traditional resources doubled with the emergence of technology-based education delivery s y s t e m s will force competitive thinking [6], If w e consider education as business; w e have to define the key terms in this context. The universities as "education providers" have to find out, what is the product? What are the resources? What is the price o f the product? The students as „customers" have different important questions: from w h i c h institution to buy, which product exactly to buy? Finally, the customer will have to know about the quality and price o f the product to be able to make the right decision. Laila Niedrite. Requirements and O p t i o n s for D a t a W a r e h o u s e s at Universities 119 There are many examples o f using the market terminology in higher education now. One of the best known cases in this area is the University o f Phoenix, which focuses on educational needs o f adults. The new trend is the formation and growth of corporate educational institutions, many o f which have become universities with accredited study programs in recent years. For example, Motorola University co-operates with universities around the world to develop and deliver courses for the Motorola Corporation employees. The most challenging issues for university management in the new competitive environment are [12]: • Public relations, • Competition, including universities abroad, • Market analysis, • Strategic planning, • Revenues/ expend iture analysis, • Calculation of expenses, • Cooperation with traditional businesses. Alongside developing their policies in the new competitive environment, universities are looking for methods to help them evaluate the answers and to support them and their customers to make the right decisions. Higher education is facing many problems in the next years and IT will play a major role in determining h o w institutions resolve and solve these problems [ 1 5 ] . Many universities are looking at data warehousing with the hope that it can help them to solve their problems. The reasons w h y data warehouses are developed and successfully used in management and decision support in traditional business areas like sales, banking and others are the following: • the data warehouse makes the information in the organization more accessible, • integrates data from many operational data sources and thus ensures quick access to the information about business activities o f the whole organization • the data are user-friendly and consistent. As it has been mentioned above, the universities' main activities are education, research and management, and also the n e w business activities, for example, processes which b e c o m e important for the existence and development o f universities. Therefore, we can assume that universities, like businesses, can benefit from data warehousing. The following section deals with s o m e scenario examples for data warehouse development at universities, which are connected with the traditional higher education institution's processes (scenarios 3 and 4 ) and also with the new business processes (scenarios 1 and 2). DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS 120 Specifications for the University D a t a W a r e h o u s e There are some attractive university data warehouse implementation scenarios. Components of them have been implemented at European and American universities [1], [2], [3]. Data warehouse implementation scenario 1. The first specification is about attracting good students. With the ever escalating costs and competition associated with higher education, universities and colleges are very interested in attracting and retaining high-quality students [8]. A data warehouse can enhance the existing information system that handles admissions to university study programmes, providing access To the required information. One example data warehouse star schema for the admission process is s h o w n in Figure 3. The star schema reflects the current administrative processes for student admission in the IS o f the University o f Latvia. rime PK Time Study Programme kev Date Month Year A c a d e m i c year Faculty PK Facultv kev Faculty_code N u m b e r student PK Admisian fact PK PK PK PK PK PK PK Degree No_of_appltcants No_of_approved Max_grade_approved Start effective d a t a Time admiss FK Time e x a m l FK Time e x a m 2 FK Time results FK P r o a r a m m e FK ADDlicant FK Faculty FK Status Gradel Grade2 Priority Proaramme kev > - Applicant PK ADDlicant kev First n a m e Family n a m e Date of birth Student number Sex Address Figure 3. Example data warehouse star schema for admissions. For example [1], if the admission algorithm is based on school grades and selected number o f study programmes in priority order, the information about the admission process in previous years could help applicants avoid mistakes in choosing the programmes, because they do not really comprehend the selection algorithm. Data warehouse implementation scenario 2 The next specification is to support customer relationship management (CRM) strategies. Luila Niedrite. Requirements and Options for Data Warehouses at Universities 121 ' C R M aligns business processes with customer strategies to build customer loyalty and increase profits over time.' (Harvard Business Review, 2 0 0 2 ) . Each year colleges and universities confront the challenge o f meeting higher levels of service demanded by an increasingly sophisticated and competitive marketplace and consumer base. At the same time there is an ever present need to cut costs and tighten budgets. Therefore, it b e c o m e s more and more important for each service interaction to be effective in both cost and outcome. In order to do this, a greater understanding o f the customer is critical [ 9 ] . Universities are n o w challenged with the needs of many different groups of learners. The universities are working with multi-generational students, and also the correspondence students are expecting new services. Another important issue is the feature o f learners in the information age to get information when and how they choose. Universities must now evaluate their services taking into account that the students are customers. Universities are starting to develop new strategies to manage their customer relationships in various stages o f the student's lifecycle. The universities have to think about new services in each phase o f the lifecycle to support students' different needs on the w a y towards graduation. There are some additional outcomes with effective C R M strategy: reduced costs, if the self-service is supported with web—applications; improved document flow in the institution and influence on n e w technologies and investments. Data warehouse implementation scenario 3 This scenario focuses on the improvement o f the study process [2], for example, the assessment o f study courses concerning the student and course performance. The evaluation o f detailed statistics such as the number o f students enrolled, evaluated and approved, the average mark, and the average approved mark can help to find out the potential problems for a particular course before they arise. It helps better understand the student needs, or in business terms, what are the students buying? Another issue is the quality assessment o f the study programme. Statistics on admissions, n e w students, drop-outs and graduates and average studying duration before graduation can give the faculty management the necessary feedback and can influence how the programme works and changes. Another possible way to conduct course and study programme evaluation is by getting information from student questionnaires and by publication of aggregated data for all participants: students, lecturers and managers. Data warehouse implementation scenario 4 This scenario describes resource planning at the university, for example, planning the teaching service. If the managers o f the faculty plan the necessary number o f classes for a particular course, they must know the statistics about the previous year. It can help to avoid empty or full classes and to find out the workload o f teaching staff. Another example is Human Resources, when saving all transactions made with university personnel records can help plan the carrier o f employees and make the right decisions about new recruitments. 122 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS The financial resources o f a university, in context with the size o f the student body, as scientific results over several years are also very significant issues to analyze. One less obvious example [3] for this scenario aims to measure the international attractiveness o f the institution. The purpose is to control the institution's policy on its international reputation and exchanges, with budget involved. Data Warehouse a n d University Information Technology Architecture Because many universities' data warehousing scenarios mentioned above are concerned with Customer Relationship Management ( C R M ) , it is necessary to understand the role o f the data warehouse in the university IT architecture and its connection with CRM applications The Connect Enterprise Architecture [9] also fits for universities and illustrates as in Figure 4, how C R M integrates many existing applications into a unified customer facing strategy. Students Application Intsgratinn Infnrmatinn analysis Employees Customer relationship management C^Data m i n ^ n g ^ ) < ^ ^ e p o m r v g ^ > < ^ ^ Data Integration Alumni Management 0 1 ^ ~ p Data warehouse IS-data source Figure 4. Data warehouse connections with CRM applications. The layer of Enterprise Application Integration provides access to Back Office applications (ERP, Student IS . . . ) from various solutions: data mining, CRM applications, Portals, K n o w l e d g e Management applications. The integration o f Back Office Applications' data can b e solved by using a data warehouse. The data warehouse also serves the reporting and analytical needs of an Laikt Niedrite. Requirements and O p t i o n s for Data Warehouses at Universities 123 institution, so the data warehouse can be a part of CRM solution which focuses on front office activities, e.g., customer service and support. The present developments in universities in the field o f data warehousing are very different in their decisions, which architecture and tools to use, and also which business processes to support with data warehousing solutions. A m o n g higher education institutions in the U S A , less than 2 5 % have or are planning a data warehouse [ 1 0 ] . Only a limited number o f database tools is used. The choice mostly depends on existing environments in other projects at the university. This is also true for client tools. The most popular is the ORACLE database and W e b based client side tools. Only a very small number of universities have all their information sources integrated into a corporate warehouse. Most are starting with students' data, human resources or finance data. Many data warehousing projects in universities have been initiated in the IT department, and not all o f them have obtained management sponsorship yet. N o similar survey has been done in Europe, and to asses the situation with data warehouses in European universities one has to look up the information on the Internet and the annual conference 'European Universities Information Systems' (EUNIS). The number o f participating universities from different countries is usually around 100 every year, but during the last 5 years, the number o f presentations on data warehouses was less than 10. For example, in 1999 there were 2 presentations from Ljubljana in Slovenia and from Porto in Portugal. T h e most significant data warehouse project in Europe a m o n g universities is in France, where the project is developed on a national scale. All French universities which had previously implemented the university information system A P O G E E (also a national-scale project) can now participate in the data warehouse project. The Data W a r e h o u s e Feasibility Study at T h e University of Latvia A s an example o f a university starting the development of the university data warehouse w e can consider the University o f Latvia. The following issues are important at the starting point o f the project: The existence and quality of data sources In the case o f the University o f Latvia, an information system was developed based on the Oracle database and Oracle Application Server. During the 5 years o f the system development and implementation process, a large amount o f various data have been collected. It is now possible to provide access to operational data, but the possibility to analyze historical data in different w a y s is limited. One o f the questions for the evaluation o f the feasibility o f the data warehouse is the technical feasibility, which means the existence o f data for expected data warehouse and the quality o f existing data. At the University o f Latvia the management information system (LUIS in Latvian) has been developed since 1996. The starting year of the students' module was 1997, of the study fees module - 1998, o f the courses and courses' enrolment module - 2000, and the starting year o f the admission module was 1998. The data accuracy differs not 124 DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS depending on the module, but on the faculty, in some cases it depends on a particular program of study. The existing data are accurate enough, but in s o m e cases the lack of data is obvious. The second possible data source for the data warehouse at the University o f Latvia in addition to LUIS is the accounting system. Reasons for having a data warehouse For example, one reason for starting a feasibility study o f a data warehousing project at the University o f Latvia is burdened access to the necessary information in the operational systems. Another reason is the growth o f competition a m o n g universities and the new roles o f traditional universities. O n e more question regarding readiness is the business motivation of the institution. The business motivation of the University of Latvia is based on the changing situation in education, which w a s described earlier in this paper. The management o f the University o f Latvia is interested in a more flexible reporting facility, in the analysis of the integrated information, and the most actual need is a new portal. The idea o f the portal fits CRM solutions and also fits the data warehouse as a data integrator. Users' acceptance Prior to starting a data warehouse project it is important to understand the readiness of the institution to have a data warehouse [ 7 ] . One o f the questions for evaluation is the existence o f data for the expected data warehouse. A l s o an important question is the readiness o f people to use the data warehouse information. In the University o f Latvia the users have long experience in usage o f operational systems, and if the data warehouse will satisfy also the reporting needs of the users, the acceptance o f the data warehouse will depend on the correctness of extracted data and reports. The users were also involved in the requirements analysis. The decision between different d e v e l o p m e n t scenarios and the right choice for the first development stage. The possible scenarios regarding specifications have to be presented to the university management and discussed. The information about users' needs and priorities help to make the right decision. In our case study w e collected information about the number o f potential users and about the purposes o f the usage. W e evaluated this information to get quantitative measurements expressing the more urgent needs. The earlier mentioned 4 * implementation scenario's subset o f problems concerning the human resources and the financial resources of the university in context with the number of students, and scientific results w e r e accepted as the starting point o f the implementation. Conclusions Universities are not the typical owners o f a data warehouse system, because the data warehouse projects need financial investments, which in the c a s e of universities, are not obviously comparable with gained results in terms of money. T h e reasons why universities have or are starting the development of the data warehouse are only in the long term the expected return on investment, or the g r o w i n g competition. According to Laila Niedrite. Requirements a n d O p t i o n s for D a t a W a r e h o u s e s at Universities 125 their prior business functions and goals, universities build data warehouses to improve the quality o f education and communication with their customers - students. The improvement o f the existing reporting facility o f operational systems and improvement of data accessibility can be mentioned as a secondary reason. References [I] [2] [3] [4] [5] [6] [7] [8] [9] [10] [II] [12] [13] [14] Bajec,M., Rupnik.R., Krisper,M. Using Data Warehouses in University Information Systems, Proceedings of conference EUNIS99, Espoo, Finland, 1999. David,G., Ribeiro.L. Getting Management Support from a University Information System. Proceedings of conference EUNIS99, Espoo, Finland, 1999. DesnosJ.F. A National Data Warehouse Project for French Universities, Proceedings of conference EUNIS2001, Berlin, 2001. Hammer.J., Garcia-Molina,H., Widom,J., Labio.W., Zhuge,Y. The Stanford Data Warehousing Project. Inmon,W.H. Building the Data Warehouse, John Wiley, 1996. Katz,R.N. Competitive Strategies for Higher.Education in the Information Age, Proceedings of conference EUNIS99, Espoo, Finland, 1999. Kimball,R., Reeves,L., Ross,M., Thornthwite,W. The Data Warehouse Lifecycle Toolkit, John Wiley, 1998. KimbalfR., Ross M. The Data Warehouse Toolkit: The Complete Guide to Dimensional Modeling, John Wiley, 2002. KPMG Consulting, Managing Customer Relationships in Higher Education, 2002 N C H E M S , Information Architecture: the Data Warehouse Foundation, CAUSE/EFFECT , Volume 20, Number 2, Summer 1997, pp. 31-33, 38-40, 60. Sinz, E. J.; Bohnlein, M.; Ulbrich-vom Ende, A.: Konzeption eines Data WarehouseSystems fur Hochschulen. In: Workshop "Unternehmen Hochschule" (Inforniatik'99, 29. Jahrestagung der Gesellschaft fur Informatik, Paderbom, 5.-9. Oktober), 1999, S. 111-124. Stonis J., Augstaka izglitlba ka pakalpojumu tirgus dallbnieks, LU konference, 2003. W i d o m J . (ed). Special Issue on Materialized Views and Data Warehousing, IEEE Data Engineering Bulletin, June 1995. YangJ., Karlapalem,K., Li,Q. Algorithms for Materialized View Design in Data Warehousing Environment, Proceedings of the 2 3 International Conference on Very Large Data Bases, Athens, Greece, August 1997. Zastrocky, M.R. The Information- and Technology- Enabled University of 2004, Proceedings of conference EUNIS99, Espoo, Finland, 1999. rd [15]