Datorzinatne un p informacijas tehnologijas

Transcription

Datorzinatne un p informacijas tehnologijas
Datorzinatne un
informacijas
tehnologijas
p
Computer Science
and Information
Technologies
_669
ISSN 1407-2157
LATVIJAS UNIVERSITATES
RAKSTI
Datorzinatne un
informacijas tehnologijas
SCIENTIFIC PAPERS
UNIVERSITY OF LATVIA
Computer
Science and
Information Technologies
SCIENTIFIC PAPERS
UNIVERSITY OF LATVIA
Computer
Science and
Information Technologies
Automation of Information Processing
LATVIJAS LNIVERSITATE
LATVIJAS UNIVERSITATES
RAKSTI
Datorzinatne un
informacijas tehnologijas
Informacijas apstrades automatizacija
I'DK
004
Da 814
Editor-in-Chief
prof. Rusins-Martins Freivalds. University of Latvia. Latvia
Deputy Editors-in-Chief:
as. prof. Janis Cirulis. L'nivcrsity of Latvia. Latvia
prof. Audris Kalnins, L'nivcrsity of Latvia. Latvia
Members:
prof. Mikhail Auguston. Software Engineering Naval Postgraduate School, L'SA
prof. Janis Barzdins. L'nivcrsity of Latvia
prof. Janis Biccvskis. L'nivcrsity of Latvia
as. prof. Juris Borzovs. University of Latvia
prof. Janis Bubenko, Royal Institute of Technology. Sweden
prof. Albertas Caplinskas. Institute of Mathematics and Informatics. Lithuania
prof. Janis Grundspenkis. Riga Technical University. Latvia
prof. Hele-Mai Haav. Tallinn Technical University, Estonia
prof. Ahto Kalja. Tallinn Technical University. Estonia
prof. Jaan Penjam. Tallinn Technical University. Estonia
Litcrarais redaktors Imants Mezaraups
Visi krajuma ievictotie raksti ir recenzeti.
Parpubliccsanas gadljuma nepieciesama Latvijas Lniversitatcs atjauja
Citcjot atsaucc uz izde\umu obligata
O Latvijas L'niversitiile. 2004
O Apgads "Rasa ABC". SIA. datorsalikums. 2004
ISSN 1407-2157
ISBN 9984-770-04-4
Contents
Baiba Ap'me. Software Development Effort Estimation
Guntis Amicans.
7
Description of Semantics and Code Generation Possibilities
for a Multi-Language Interpreter
13
Janis Benefelds. Data Staging in the Data Warehouse
34
Edgars Celms. Generic Data Representation by Table in Mctamodel
Based Modelling Tool
Kadis Fnivalds.
53
Nondifferentiable Optimization Based Algorithm
for Graph Ratio-Cut Partitioning
61
Martins Gills. Review of Traccabilitv Models for Software Testing Processes
SO
Mara Culhe. Anus Gulbis. Benchmarking Problems of Topic Telecommunications
and Access in Latvia
Janis Iljins. Design Interpretation Principles in Development and
89
Usage of Informative Systems
99
Girts Linde. Semantics and Equivalence of UML Class Diagrams
107
Laila Niedrite. Requirements and Options for Data Warehouses at Universities
117
Preface
This v o l u m e is the first o n e in the n e w series of A c t a Universitatis Latviensis in
Computer Science and Information Technologies. R e s e a r c h in c o m p u t e r science and
software engineering has been d o n e at the University of Latvia since 1960s, and there
are hundreds of publications by c o m p u t e r scientists of the U n i v e r s i t y of Latvia in various scientific journals and conference p r o c e e d i n g s w o r l d w i d e . H o w e v e r , there have been
only a few attempts to publish collections of the University c o m p u t e r science research
papers in the Acta - there w e r e three v o l u m e s in the m i d 1970s in Russian.
T h e first v o l u m e in the n e w series - Automation of Information Processing contains recent results of y o u n g r e s e a r c h e r s , most of t h e m d o c t o r a l students at the University of Latvia. T h o u g h the topics of the papers a r e quite different, they are all centered
around the p r o b l e m of p r o v i d i n g theory, m e t h o d o l o g y , d e v e l o p m e n t tools and supporting e n v i r o n m e n t for t h e d e v e l o p m e n t of i n f o r m a t i o n s y s t e m s . All the papers in the
volume arc related to the most up-to-date issues in the r e s p e c t i v e area.
Theoretical p r o b l e m s are d i s c u s s e d in papers by Girts L i n d e and Karlis Freivalds both of them papers of high scientific quality. T h e p a p e r by L i n d e is devoted to the
formalization of semantics of class d i a g r a m s - t h e m a i n U M L notation for system design and to formalization of class d i a g r a m e q u i v a l e n c e . Freivalds" paper provides new
results and an efficient heuristic algorithm for a classical o p t i m i z a t i o n p r o b l e m in graph
theory - finding the m i n i m u m ratio-cut.
Several papers are d e v o t e d to generic m o d e l i n g a n d d e v e l o p m e n t tools for information systems. T h e Paper by E d g a r s C e l m s discusses an i m p o r t a n t aspect of metamodel
based m o d e l i n g tools - g e n e r i c facilities for defining tables. T h e p a p e r by Guntis
Arnicans is devoted to a g e n e r i c language interpreter i m p l e m e n t i n g a n e w m e t h o d for
defining language semantics. T h e paper by Janis Iljins d i s c u s s e s an experience in using
a generic development e n v i r o n m e n t - the IS Technology, w h e r e t h e design specification
can be directly interpreted.
T h e papers by Janis B e n e f e l d s and Laila Niedrite c o n s i d e r various aspects of data
w a r e h o u s e s - an overview of d a t a staging and an e x p e r i e n c e of application to the education domain.
T w o papers discuss the software d e v e l o p m e n t m a n a g e m e n t p r o b l e m s - Baiba Apine
the issues of various software d e v e l o p m e n t estimation m o d e l s a n d Martins Gills - the
traceability issues in software testing.
Finally, the paper by M a r a G u l b e and A m i s G u l b i s c o n s i d e r the availability of IT
services in Latvia.
All the papers in the v o l u m e have been r e v i e w e d by s e v e r a l m e m b e r s of the international Editorial Board of the series.
Prof Audris Kalnins,
Deputy Editor-in-Chief
University of Lat\ ia
LATVIJAS
L'NIVERSITATES
RAKSTI.
201)4. 6 6 y .
sej.: D A T O R Z I N A T N E
UN
INFORMACIJAS
TEHNOLOGIJAS.
7.-12.
[pp.
Software Development Effort Estimation
Baiba Apine
PricewaterhouseCoopers
baiba.apine@lv.pwc.com
Formal analytical and analogy-based software development cost estimation models are analysed
to find out what factors are considered to have an influence on the development process in each of
the models, and to provide recommendations for the application of those models.
Key words: estimation, software development cost, effor estimation models.
Introduction
Software d e v e l o p m e n t effort estimation is a rather old, btit still relevant, problem.
Since the middle of the last century, w h e n the very first formal software development
effort and schedule estimation m o d e l s appeared, the software d e v e l o p m e n t process, the
subject of these e s t i m a t e s , has c h a n g e d dramatically. For instance. [ E X P 0 2 ] highlights
that productivity of the software d e v e l o p m e n t process increases by 1 0 % annually due to
the progress of software d e v e l o p m e n t technologies. T h e c o n t i n u o u s i m p r o v e m e n t of the
d e v e l o p m e n t process, and different factors which influence the process productivity and
must be taken into account, create a n i g h t m a r e for estimators.
T h e following e x p e r i m e n t w a s carried out during classes on software cost
estimation. A group of 2 0 0 students w a s asked to estimate the effort and schedule for
the d e v e l o p m e n t of information system (IS) storing data about software development
projects ( n a m e , d e v e l o p m e n t e n v i r o n m e n t , start date, expected end date etc.).
d e v e l o p e r s and c u s t o m e r s ( n a m e s , skills, office hours, phone n u m b e r s etc.) and
d o c u m e n t s produced d u r i n g the project Iifecycle ( n a m e , c o m m e n t s , author). All the
students had the s a m e input information about the IS. T h e results for this rather small
project differed significantly: starting from 10 m a n - d a y s (2 weeks s c h e d u l e ) to 12 manm o n t h s (1 year schedule) with an a v e r a g e of 3 m a n - m o n t h s (schedule 3.5 months). The
students t h e m s e l v e s were very surprised about the differences in estimates. If this is
acceptable with students in the c l a s s r o o m , in real-life such a situation might be rather
painful and must be discussed in order to solve the p r o b l e m and find consensus. Here
formal cost estimation m e t h o d s could help:
1.
2.
To provide a disciplined way of thinking during the estimation process, which was
not available for students.
To provide a subject, subjects for discussions for all involved parties.
Formal software d e v e l o p m e n t cost estimation models are analysed to find out what
factors are considered to have an influence on the d e v e l o p m e n t process in each o f the
m o d e l s , and to provide r e c o m m e n d a t i o n s for the application (if those m o d e l s
Software Development Effort Estimation Models
T o be honest, not always are the formal cost estimation models r e c o m m e n d e d for
effort and schedule estimation. [ C O C 2 1 . [ K E M 9 3 ] do not r e c o m m e n d formal effort
8
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
estimation models for small projects (less than 2 K lines of c o d e [ B O E 9 1 ] or less than 10
d e v e l o p e r s w o r k for 3-6 m o n t h s [ Y O U 9 7 ] ) . For small projects the d e v e l o p m e n t
productivity d e p e n d s on t h e individuals very m u c h , h e n c e the right w a y of e s t i m a t i n g is
asking t h e d e v e l o p e r for an estimate.
Formal m e t h o d s are w e l c o m e for a v e r a g e ( 2 0 - 3 0 d e v e l o p e r s ' w o r k for 1-2 years)
and large ( 1 0 0 - 3 0 0 e m p l o y e e s ' w o r k for 3-5 years) projects, b e c a u s e :
1.
2.
Individual experience of the estimator is limited, as even a very experienced project
manager has worked for 5-7 average and 3-5 large projects during his/her career.
Typically there are more than two stakeholders in average and large projects,
therefore it is very crucial to have a documented, step by step estimation process as
a subject for discussions.
Different analytical and a n a l o g y - b a s e d software cost estimation m o d e l s
d e v e l o p e d . Analytical m o d e l s are based on a n e g a t i v e e x p o n e n t i a l curve called
R a y l e i g h - N o r d e n m a n p o w e r a p p r o x i m a t i o n c u r v e (see F i g u r e 1). Functionality of
software product, usualy e x p r e s s e d in function points or lines of code, is used as
input for analytical models.
are
the
the
an
M a n p o w e r distribution for o n e real-life software d e v e l o p m e n t project is s h o w n in
Figure 1. This illustrates a rather typical fault in project m a n a g e m e n t - as t h e
d e v e l o p m e n t process s e e m s to be stable (see S E P _ 9 8 to A P R _ 9 9 ) , s o m e d e v e l o p e r s
might be taken away from the project (see M A Y _ 9 9 ) . This yields an increasing
response time to c u s t o m e r queries and l o w e r s c u s t o m e r satisfaction. Additional
d e v e l o p e r s must be added in the project urgently (see J U N _ 9 9 ) . Otherwise the real n e w
software d e v e l o p m e n t life cycle a p p r o x i m a t e s the theoretical o n e quite closely.
A n a l o g y based models use information about past projects. T h e estimation p r o c e s s
basically m e a n s database b r o w s i n g to find similar projects. T h e success of estimation
using a n a l o g y - b a s e d m o d e l s d e p e n d s on h o w successfully the criteria for finding similar
projects from past experience are found.
Figure 1. Rayleigh-Noren manpower approximation curve for new software development process
(optimal) in comparison with the real one
Bciiha
Apine.
Software
Development
Effort
Estimation
9
Samples of Analytical Models
S L I M (Software Lifecycle M o d e l ) w a s d e v e l o p e d in the middle of the 1970's
[K.EM87] and currently is m a i n t a i n e d by Q S M (Quantitative Software M a n a g e m e n t )
[ Q S M 0 2 ] . T h e M o d e l w a s developed using data about 5000 software d e v e l o p m e n t
projects. T h e S L I M m o d e l uses a productivity p a r a m e t e r and productivity index, which
characterise the d e v e l o p m e n t e n v i r o n m e n t (tools used, developers' skills etc.). In
c o m p a r i s o n to other analytical models, in the SLIM m o d e l a schedule must be set before
the effort is estimated. H o w e v e r this is not a d i s a d v a n t a g e of the model, because in real
life very often the s c h e d u l e is already predefined.
T h e Jensen m o d e l [JEN84] estimates the d e v e l o p m e n t effort and s c h e d u l e based on
the size of the product, technology index and 11 d e v e l o p m e n t e n v i r o n m e n t factors,
which characterize the d e v e l o p m e n t t e c h n o l o g y . A Current version of the Jensen model
is i m p l e m e n t e d in the tool S A G E [ J E N 9 5 ] .
C O C O M O II [ C O C 2 ] estimates the d e v e l o p m e n t effort and schedule based on the
size of the product, 5 scaling drivers and 6 adjustment factors, which describe the
product's quality and reliability r e q u i r e m e n t s , t e c h n o l o g i e s used, d e v e l o p e r s ' skills etc
At the end of last year a survey about the m o s t popular d e v e l o p m e n t effort
estimation m e t h o d w a s carried out by [ISB01] in more than 100 software d e v e l o p m e n t
organizations of varied sizes [ C L T 0 3 ] . M o r e than a quarter of the respondents (27%)
use various analytical m o d e l s . A surprisingly low n u m b e r (9%) report the use of
C O C O M O , which w a s considered by m a n y to be a g r o u n d b r e a k i n g t e c h n i q u e in the
1980s. O n e possible explanation for this is that the C O C O M O applications are often
c u m b e r s o m e and require the use of other tools and t e c h n i q u e s to generate input to
C O C O M O , such as lines of code estimators.
Samples of Analogy Based Models
T h e C h e c k p o i n t m o d e l is based on information about 8000 software d e v e l o p m e n t
projects. T h i s model is property of S P R (Software Productivity Research) [ J O N 9 6 ] . The
C h e c k p o i n t model uses a survey containing multiple-choice questions about the
d e v e l o p m e n t process and the product must be d e v e l o p e d . Information from the survey
then is used to adjust the productivity o f the d e v e l o p m e n t process.
T h e s a m e principle as for the C h e c k p o i n t model is used in the E x p e r i e n c e P r o model
and tool, w h i c h is property of Software T e c h n o l o g y the Transfer [ E X P 0 2 ] . This model
uses information about approximately 500 past projects. All the users of this model arc
j o i n e d in F i S M A [FIS03] and e n c o u r a g e d to collect information a b o u t projects to
update the E x p e r i e n c e P r o model continously.
A c c o r d i n g to t h e survey by [ I S B 0 1 ] , most organizations (65%) collect historical
data and u s e it to h e l p them estimate software d e v e l o p m e n t . 2 6 % report that they do so
diligently, by m a i n t a i n i n g a central database that formally receives and stores data from
every project.
10
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
Conclusions
Analytical models as well as a n a l o g y b a s e d o n e s use data about the software
d e v e l o p m e n t process ( m e a s u r e m e n t s ) , consider different risks a n d their influence on the
development process. A Set of risks is predefined and d e p e n d s on the model. T a b l e 1
shows software d e v e l o p m e n t risks found out in the s u r v e y of risk m a n a g e m e n t [ A P I 0 2 ] .
There are risks which are considered in each cost e s t i m a t i o n m o d e l , for instance, an
unrealistic project schedule and budget or lack of k n o w l e d g e in software d e v e l o p m e n t
technologies and e n v i r o n m e n t . At the same time there are risks which are important and
influence the d e v e l o p m e n t s c h e d u l e and effort, but are not c o n s i d e r e d by any of the
estimation models analysed. For instance, software d e v e l o p m e n t e n v i r o n m e n t bugs or
change of qualified p e r s o n n e l .
Table 1.
Risks considered in different software cost estimation models (+ - the model
considers that)
S.
C/)
Checkpoint
Software development influencing factor
'->
'_j
53
•j
2
3
3
3
Lack of hardware on the customer side
Difficult communication with customer
Low quality of software requirements
Unstable software requirements
L'nrealistic schedules and budgets
Weak project management
Software development environment buss
Lack of developer motivation
Lack of hardware on developers side
Change of qualified personnel
Lack of knowledge
in software
development
technologies and environment
--
-
-
-
-
-
-
-
-
-
-
-
M o d e l s concentrate on effort and s c h e d u l e e s t i m a t i o n for new software
development [PLT92], [JEN84], [JON96], [ C O C 2 ] . [EXP02], [PAR95], [TAU81].
There are add-ons d e v e l o p e d for s o m e m o d e l s to a p p l y t h e m to more specific projects
like software m a i n t e n a n c e [ C O C 2 ] , [ E X P 0 2 ] . T a k i n g into account that m o d e l s rely on
development process m e a s u r e m e n t s , they are b e c o m i n g out-of-date continuously as
technologies develop. F o r instance, software d e v e l o p m e n t productivity increases by
10% annually. H e n c e using a year old cost estimation m o d e l , it overestimates the effort
by 10% even if we exactly h o w k n o w to use the m o d e l . T o solve this p r o b l e m , authors
continuously gather data about projects to m a i n t a i n their m o d e l s [ Q S M 0 2 ] , [ J O N 9 6 ] .
This problem could not be solved for analytical m o d e l s , but for analogy based ones it is
not so actual if the project base is updated with information about new projects
continuously, like it is d o n e for [ E X P 0 2 ] .
Bulba
Apine.
Software
Development
Effort
EMimation
u
T h e r e are a d d - o n s for d e v e l o p m e n t process operational p l a n n i n g developed for
s o m e of the m o d e l s , w h i c h are based on the R a y l e i g h - N o r d e n m a n p o w e r curve
[ P U T 9 2 ] . There are no such a d d - o n s for analogy based m o d e l s , b e c a u s e data gathering
for such models is m u c h more c o m p l e x than approximation of the d e v e l o p m e n t process
using analytical m e t h o d s . T h e d e v e l o p m e n t of such a d d - o n requires integration of the
m e a s u r e m e n t and risk m a n a g e m e n t processes.
W h a t to do?
T o estimate the software product d e v e l o p m e n t effort and s c h e d u l e the following
steps must be carried out.
1. Perform a risk analysis for the particular project to find out what factors will
influence the development of the product.
2. Choose the development effort estimation model which takes into account all the
risks pertaining to the particular project.
Such an a p p r o a c h is not feasible due to the following reasons:
1. There is no knowledge available about different models. Experience shows that
most companies have knowledge about one or two effort estimation models.
2. Application of most of the models requires acquisition of specific, rather expensive
tools (several thousand EUR for licence and approximately a thousand EUR for
annual maintenance of the model).
Therefore one formal model must be chosen for u s a g e in the c o m p a n y . There are
s o m e sources which r e c c o m e n d usage of two formal m o d e l s for estimation in parallel
[ I S B 0 1 ] , [ J O N 9 6 ] . T h e second model is to verify the results given by the first one.
Instead of the s e c o n d formal m o d e l , d e v e l o p m e n t process m e a s u r e m e n t s could be
applied to adjust the formal results. Developers and m a n a g e r s must be trained to use the
c h o s e n formal model to avoid painful estimation mistakes.
T h e d e v e l o p m e n t process must be measured in order to use the gathered
information for analysis of the results obtained from application of the formal model.
Results must be c o m m u n i c a t e d a m o n g the developers.
References
[EXP02] "Estimation and Measurement Methodology of Experience Pro 3.0". [Electronical
resource].-Access: WWW. URL: http^Avww.sttffi/html.-Resource described 12th of
June, 2002.
[COC2] C.Abts, B.Boehm, B.Clark. S.Devnani-Chulani. "COCOMO II Model Definition
Manual". University of Southern California, 2000, 68 p.
[KEM93]Chris F. Kemerer. "Empirical studies of assumptions that underlie software cost
estimation". Information and Softw. Techno!., Vol.34 =4, 1992, pp. 21 1-218
[BOE91] Barry-' W. Boehm. "Software Risk Management: Principles and Practices". IEEE
Software, January, 1991, pp. 32-41.
•YOU97] Yourdon. E. "Death March. The complete Software Developer's Guide to Surviving
'Mission Impossible' Projects". Prentice Hall, 1997. 2 I 8 p.
[K.EM87] Chris F. Kemerer. "An Empirical Validation of Software Cost Estimation". ACM
Communication. May, 1987. pp. 416-429
12
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
[QSM02] "SLIM-Estimate
5.0".
[Electronical
resource].-Access:
WWW.
URL:
http:/ 'www.qsm.com-'slim estimate.html.-Resource described 3rd of January, 2003.
[JENS4] Randall W.Jensen. "A Comparision of the Jensen and COCOMO Schedule and Cost
Estimation Model". Proceedings of the International Society of Parametric Analysis,
1984, pp. 96-106.
[JEN95] Randall W.Jensen. "A New Perspective in Software Schedule and Cost Estimation".
[Electronical resource].-Access: WWW. URL: http://www.seisage.com.'publications.htm
-Resource described: 17th of February, 2003.
[ISB01] International Software Benchmarking Standards Group. "Practical Project Estimation"
2001. 46 p.
[CUT03] "The Guru Method Prevails". [Electronical resource].-Access: WWW. URL:
http:/'www.cutter.com 'benchmark, index.html.-Resource described 12th of February.
2003.
[JON96] Caper Jones. "Applied Software Metrics". McGraw Hill, 1996, 324 p.
[F1S03]"Finnish Software Metrics Association". [Electronical resource],-Access: WWW. URL:
http:/Avww.fisma-network.org.-Resource described 12th of February, 2003.
[API02] Baiba Apine. "Software Development Risk Management Survey", accepted paper for
I 1th International Conference on Information Systems Development, Riga. Latvia.
September
12 14, 2002. http://www.cs.rtu.Ivisd2002 accepted.asp.-Resource
described:2003.g. 3.jan.
[PUT92] Lawrence H. Putnam, Ware Myers. "Measures for Excellence". Yourdon Press
Computing Series, 1992
TAR95] Naval Sea Systems Command . "Parametric Cost Estimating Handbook". [Electronical
resource].-Access: WWW. URL: http:,Vw~ww.jsc.nasa.gov/bu2/PCEHHTML/pceh.htm.Resource described: 2003. g. 17.fob.
[TAUSI] Robert C. Tausworthe. "Deep Space Network Software^Cost Estimation Model". Jet
Propulsion Laboratory Publication 8 1-7. Pasadena. C.A., April, 1981
L A T V I J A S U N I V E R S I T A T E S KAKSTI. 2004. 669. s g . : D A T O R Z I N A T N E I'N
INFORMACIJAS
: : . i : \ o i . o i i . : . \ s . 1 3 . - 3 3 . [pp.
Description of Semantics and Code Generation Possibilities
for a Multi-Language Interpreter
Guntis Arnicans
Faculty o f Physics and M a t h e m a t i c s . University of Latvia
Raina B l v d . 19. Riga L V - 1 5 8 6 . Latvia, garnican@lanet.lv
In this paper we describe the definition of semantics for a Multi-Language interpreter (MLI), which provides
the execution of the given program, receiving and exploiting corresponding language syntax and the desired
semantics. We analyze the simplest solution the MLI receives the language syntax and the semantics
descriptions, which have already been compiled to executable objects. Semantics is defined as a composition
from several semantic aspects, considering the pragmatics of a language. Semantic aspects are translated to
semantic functions by composing descnptions of the aspects. A traversing program's intermediate
representation and the calling out of semantic functions similarly to the principle of the Visitor pattern perfonn
the desired semantics. To simplify the semantic descriptions, we use abstract components that are joined by
connectors at the meta-levcl. The implementation of these components and connectors can be very different.
Examples of conventional and specific semantics are given for the simple imperative language in tins paper.
Key words: interpreter, programming language specifications, tool generation.
1. Introduction
T h e number of n e w languages that are related to the IT sector has increased rapidly
over t h e last several y e a r s ( p r o g r a m m i n g languages and data description l a n g u a g e s , for
e x a m p l e ) . P r o b l e m s associated with the implementation a n d use of these languages has
also e x p a n d e d , of c o u r s e . Kinnersley [Kin95] has reported that there were 2,00(1
languages in 1995. w h i c h were being put to serious use. Even back then specialists
found that the n e w languages w e r e mostly to be classified as domain-specific languages.
Most o f them are not e a s y to i m p l e m e n t and maintain [ I T S E 9 9 , DKVOO ( D S L analysis,
p r o b l e m s and an annotated b i b l i o g r a p h y ) ] . It is also true that we need not j u s t a
c o m p i l e r or an interpreter, but also a n u m b e r of supportive tools.
Questions of
p r o g r a m m i n g quality a r e very important today, and t h e s e questions often cannot be
a n s w e r e d without specialized and a u t o m a t e d ancillary resources.
C o m p u t e r s are b e i n g used with increasing d y n a m i s m today: s y s t e m s have been
d i v i d e d up in t e r m s o f time and space, the operational e n v i r o n m e n t is h e t e r o g e n e o u s ,
and w e have to e n s u r e the implementation o f parallel processes w h i l e organizing
cooperation a m o n g c o m p o n e n t s and systems, a d a p t i n g to c h a n g i n g circumstances
w i t h o u t interrupting o u r work. etc. We are m a k i n g increasing use of interpreters or of
code generation and c o m p i l a t i o n just in time. T h e formal resources that are used to
describe the s e m a n t i c s of a language, however, cannot fully satisfy our needs in the
m o d e r n age, and they a r e starting to lose their positions [ S c h 9 7 , L o u 9 7 , Paa95],
T h e basic p r o b l e m that is associated with the formal specifications of p r o g r a m m i n g
languages is that these specifications are far too c o m p l e x . It is not clear how they are
administered, w e c a n n o t use t h e m to explain all of our practical needs, and in the end
we are still faced with a p r o b l e m - who can prove that these c o m p l e x specifications are
really correct? T h e literature c l a i m s that the best c o m m e r c i a l compilers (interpreters or
other language-based tools) are written without formalism or are used only in the first
14
DATORZINATNE
L'N
INFORMACIJAS
TEHNOLOGIJAS
phases - s c a n n i n g and parsing [e.g. L o u 9 7 ] . F o r m a l i s m s are elaborated and used mostly
for research purposes in educational and scientific institutions at this time.
T h e d e v e l o p m e n t of s e m a n t i c s is gradually m o v i n g a w a y from the d e v e l o p m e n t of
languages and tools. O n e w a y to o v e r c o m e this gap is to take a tool-oriented approach
to semantics, m a k i n g the definitions of s e m a n t i c s far m o r e useful and p r o d u c t i v e in
practice and generating as m a n y l a n g u a g e - b a s e d tools as possible from t h e m [HKOO].
We support this a p p r o a c h in principle, but our aim is to p r o p o s e a different approach
toward the definition of s e m a n t i c s , m a k i n g r o o m for far less formal records.
Those w h o prepared descriptions of s e m a n t i c s in the past have long since been
looking for w a y s in which semantics can be divided up into reusable c o m p o n e n t s , and it
is not yet clear whether the formal or the partly formal m e t h o d o l o g y is the best in this
case. We c h o s e a less formal and m o r e free form o f description k e e p i n g from the
theoretical perspective, and o u r empirical research s h o w e d that rank-and-file developers
of tools understand this m e t h o d far m o r e easily.
2. T h e concept of a Multi-Language Interpreter
T h e concept of a multi-language interpreter w a s i n t r o d u c e d in [ A A B 9 6 ] . A MultiL a n g u a g e Interpreter ( M L I ) is a program w h i c h receives s o u r c e language syntax, source
language s e m a n t i c s and a program written in the s o u r c e l a n g u a g e , then performs the
operations on the basis of the program and t h e relevant s e m a n t i c s . C o n c e p t u a l l y , we
parse an input token stream, build a parse tree a n d then t r a v e r s e the tree as needed so as
to evaluate the semantic functions that are a s s o c i a t e d with the parse tree nodes. O n c e an
explicit parse tree is available, we visit the nodes in s o m e order and call out an
appropriate function. This approach is similar to the p r i n c i p l e build a tree, save a parse
and traverse it [Cla99] and to a Visitor p a t t e r n [ G H J V 9 5 ] , except in terms of the
m e t h o d o l o g y which w e apply in obtaining s e m a n t i c functions and organizing physical
implementation. T h e idea of M L I is e x p r e s s e d in Figure 1.
Syntax
Semantics
Multi-language
nterpreter
-/
Results
Program
Figure I. The concept of a Multi-Language Interpreter
T h e concept of a M L I p r e s u p p o s e s that w e can p r e p a r e several semantics for one
syntax, and w e can exploit one semantic for various s y n t a x e s . T h e descriptions of
syntaxes and semantics must be translated to the e x e c u t a b l e form (before or during the
running of the M L I ) . MLI implementation a r c h i t e c t u r e s m a y vary. T h e one w e use
receives and exploits syntax and semantic descriptions that h a v e already been compiled
as e x e c u t a b l e objects (Figure 2). Syntax is r e p r e s e n t e d b y the SyntaxObject, and
semantics by the TraverserObject. the S e m a n t i c O b j e c t . the S y m b o l T a b l e . and the
necessary v o l u m e of the C o m p o n e n t (the c o m p o n e n t s A , B. C in our figure). T h e MLI
Kernel, w h i c h provides the initial b o n d i n g of all s y n t a x and semantics objects,
initializes the execution of the program.
Gitntis
Amicans.
Description
of S e m a n t i c s
and Code Generation
Possibilities
15
Results
Figure 2. MLI runtime architeeture
Each of the c o m p o n e n t s can be implemented in various w a y s - with a different
semantic a s s i g n m e n t and physical implementation. Here w e have a c h a n c e to c o m b i n e
syntaxes and s e m a n t i c s in both w a y s - in terms of architecture and in terms of
implementation. T h e n , h o w e v e r , w e immediately face the question of the compatibility
of the syntax and s e m a n t i c s so as to avoid senseless interpretation.
T h e obtaining of an executable syntax and of semantic objects from their
descriptions can be d o n e before or during the actual p r o g r a m execution (analogue to a
classical compiler and interpreter). D y n a m i c code generation is m o r e difficult because
all generation p h a s e s m u s t be d o n e automatically.
3 . Language Specifications for MLI
P r o g r a m m i n g l a n g u a g e is an artificial means to c o m m u n i c a t e with a computer and
to fix the algorithms for problem solving. Like a natural language, a p r o g r a m m i n g
l a n g u a g e ' s definition consists of three c o m p o n e n t s or aspects: syntax, semantics and
p r a g m a t i c s [ P a g 8 1 , S K 9 5 ] , All of these aspects are significant in dealing with our
p r o b l e m s . Usually exploited rarely, pragmatics deals with the practical use of a
language, and this is an important element in defining s e m a n t i c s .
W e can look at s y n t a x and s e m a n t i c s from two perspectives - the definition or
description phase and t h e runtime phase. Our goal is to achieve runtime c o m p o n e n t s
which can freely be e x c h a n g e d or m i x e d together in pursuit of the desired collaboration.
First w e must look at t h e principles of syntax and s e m a n t i c s descriptions, and then we
can view the target c o d e generation steps.
O u r basic principle is to divide syntax and s e m a n t i c s into small parts, and later,
with a simple m e t h o d , to c o m b i n e these parts thus providing a m e c h a n i s m to tie
together the semantic parts and the syntax elements. Our m e t h o d is close to some of the
structuring p a r a d i g m s o f attribute g r a m m a r s [Paa95]: T h e definition p h a s e is similar to
the relationship S e m a n t i c aspect = M o d u l e , but the runtime p h a s e is similar to
N o n t e r m i n a l = P r o c e d u r e . That m e a n s that we basically use the language pragmatics
and divide the s e m a n t i c s into semantic aspects.
3.1. Syntax
T h e formalisms for dealing with the syntax aspect of a p r o g r a m m i n g language are
well developed. T h e theory of scanning, parsing and attribute analysis provides not only
the m e a n s to perform syntactical analysis, but also a way to generate a w h o l e compiler
as well. Such t e r m s , c o n c e p t s or tools as finite a u t o m a t a , regular expression, contextfree g r a m m a r , attribute g r a m m a r ( A G ) , Backus-Naur form (BN'F). extended B N F
16
DATORZINATNE
EN
INFORMACIJAS
TEHNOLOGIJAS
( E B N F ) , Lex (also Flex). Y a c c (also B i s o n ) , a n d P C C T S are well k n o w n and accepted
by the c o m p u t e r science c o m m u n i t y .
W e do not need to reinvent the w h e e l a n d it is r e a s o n a b l e to c h o o s e the existing
formalisms and generators (lexers and p a r s e r s ) . T h e main task w h e n d e a l i n g with syntax
description for a given language is c o d e g e n e r a t i n g which can transform the written
program, w h i c h uses the s y n t a x , into i n t e r m e d i a t e representation (IR). A d d i t i o n a l l y , we
need to attach a library with functions, which p r o v i d e the m e a n s to m a n i p u l a t e with the
IR and to c o m p i l e the w h o l e code. T h e result is the S y n t a x O b j e c t (Figure 2).
In this paper we concentrate mostly on the class of imperative p r o g r a m m i n g
languages, but our method is adaptable for other languages t o o , such as d i a g r a m m a t i c
languages (e.g., Petri nets. E-R d i a g r a m s , Statecharts. V P L - visual p r o g r a m m i n g
languages, etc.). which exploit other f o r m a l i s m s (e.g., SR G r a m m a r s . R e s e r v e d Graph
G r a m m a r ) and processing styles [FNT-i-97, Z Z 9 7 ] ,
3.2. Semantics
T h e chosen principle for the r u n t i m e s e m a n t i c s parse and traverse states that the
most important things are a traversing strategy and the s e m a n t i c functions w h i c h must
be executed when visiting a node (Figure J ) . Therefore, the central c o m p o n e n t s of the
semantics are TraversalObject and S e m a n t i c O b j e c t (Figure 2).
T h e TraversalObject m a n a g e s the node visiting order, p r o v i d e s semantic functions
with information from the IR, and is the m a i n engine of the M L I . The SemanticObject,
for its part, contains all o f the necessary s e m a n t i c functions and provides for the
execution e n v i r o n m e n t . At the same time, w e can also put into the semantic functions
certain c o m m a n d s which force the T r a v e r s e r to search for the needed node and to
c h a n g e the current execution point in t h e IR (traversing strategy c h a n g e s and a
transition to another node are p r o b l e m s in the Visitor pattern [e.g. VisOl]).
Semantic functions h a v e to be as s i m p l e and as small as possible. This can be
achieved by using a m e t a - l a n g u a g e and by e m p l o y i n g high-level expression m e a n s ,
which allow for easy understanding and verification of the description. F o l l o w i n g this
Gunris
Arnicans.
Inscription
of S e m a n t i c s
and
Code Generation
17
Possibilities
principle b e c o m e s more natural if w e use abstract c o m p o n e n t s so that the underlying
semantic can be clear without additional explanations (in Figure 3, the abstract
c o m p o n e n t s already h a v e a concrete i m p l e m e n t a t i o n c o m p o n e n t - A, B and C). This
statement m a y lead to objections from the advocates of formal semantics, because the
c o m p o n e n t s are not described with m a t h e m a t i c precision. At the same time, however,
formal semantics s o m e t i m e s use such c o n c e p t s as Stack or Symbol table.
Let us introduce a conceptual syntax element, which is a g r a m m a r symbol with a
n a m e (e.g., a n a m e d n o n t e r m i n a l s y m b o l or a n a m e d terminal symbol). C o n s i d e r i n g the
various types of syntax elements and the traversing strategy, w e separate various
visitations and introduce the concept of the traversing aspect. For instance, we can
distinguish the arriving into node from the parent n o d e (PreVisit) as well the arriving
into n o d e from the child node (PostVisit). Thus w e create the semantic functions and
n a m e t h e m not only on the basis of the n a m e of the syntax element but also on the basis
of the arriving aspect (traversing aspect) into this element (Figure 3).
R u n t i m e semantics o r simply s e m a n t i c s for multi-language interpreters are a set of
semantic functions. W e represent the runtime s e m a n t i c in T a b l e 1. There is an
executable code ( c ) or nothing (A.) for the syntax e l e m e n t , according to the traversing
aspect, n depends on the size of syntax (e.g.. the count of all nonterminal and terminal
s y m b o l s ) , and m d e p e n d s on the c o m p l e x i t y of the traversing strategy (usually 1..3). We
notice that the matrix m a i n l y consists of e m p t y functions (X.).
Tabic 1.
The matrix of syntax elements and semantic function correspondence
Syntax
Element (SE)
A,
Traversing Aspect (TA)
T
T
T
A-,
A-,
T
X
SE,
SE
:
X
-
X
SE,
X
X
X
SE„
•
X
K
X
A
A
X
T h e identification o f semantic functions is realized both by the syntax n a m e and by
the traversing aspect n a m e . Technical implementation may differ, but it is very
advisable that functions identification and calling b e performed with constant
complexity 0 ( 1 ) .
N o w we arrive at the most difficult and important p r o b l e m - how can we obtain
semantic functions and ensure correct collaboration b e t w e e n them, and how is it
possible to create r e u s a b l e semantic descriptions? Let us explain our ideas about how to
define semantics and h o w to gain the matrix o b s e r v e d above, i.e.. h o w to generate
executable semantics from the semantic description.
18
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
4. Semantic Aspects and Abstract C o m p o n e n t s
4.1. Semantic aspects
In practice, p r o g r a m m i n g languages are frequently presented through the
pragmatics of the p r o g r a m m i n g language, i.e.. e x a m p l e s are used to s h o w h o w the
language constructs are exploited and what their u n d e r l y i n g m e a n i n g is. Let us call
these language constructs and their m e a n i n g like s e m a n t i c aspects.
We h a v e chosen to define the s e m a n t i c as a set of m u t u a l l y c o n n e c t e d semantic
aspects. Here are s o m e e x a m p l e s for typical g r o u p s of s e m a n t i c aspects for imperative
p r o g r a m m i n g languages: execution of c o m m a n d s or s t a t e m e n t s (e.g., basic operations,
variable declaring, assigning of a value to the variable, execution of arithmetic
expressions), program control flow m a n a g e m e n t (e.g., loop with a counter, conditional
loop, conditional branching), dealing with s y m b o l s (e.g., variables, constants),
e n v i r o n m e n t m a n a g e m e n t (e.g.. the s c o p e s of visibility). Here, t o o . are e x a m p l e s of
nontraditional semantic aspects: attractive printing o f the p r o g r a m , d y n a m i c accounting
of statistics, symbolic execution, specific p r o g r a m i n s t r u m e n t a t i o n , etc.
W e have chosen an operational a p p r o a c h to describe the semantic aspect - we
define the c o m p u t a t i o n s , which a c o m p u t e r h a s to d o to perform the semantic action.
4.2. Abstract data types and abstract c o m p o n e n t s
T h e next significant principle to define the s e m a n t i c aspect is using abstract data
types ( A D T ) as much as possible. A D T is a collection of data type and value definitions
and operations on those definitions, w h i c h b e h a v e as a primitive data type. This
software design approach breaks down the p r o b l e m into c o m p o n e n t s by identifying the
public interface and the private i m p l e m e n t a t i o n .
In o u r case, typical e x a m p l e s of A D T are Stack. Q u e u e , Dictionary, and Symbol
table (in c o m p i l e r construction theory [ A S L ' 8 6 , F L 8 8 ] , in formal semantics [SK95]). In
this w a y w e hide most of the i m p l e m e n t a t i o n details a n d concentrate mainly on the
logic of the semantic aspect. Later w e can c h o o s e the best implementation of A D T for
the given task. Seeing that s o m e exploited c o m p o n e n t s can be complicated (E-mail.
Graph visualization. Distributed c o m m u n i c a t i o n , T r a n s a c t i o n m a n a g e r , etc.) and have
no standards, w e use another term - abstract c o m p o n e n t . S o m e t i m e s w e want to utilize
an already existing c o m p o n e n t , and t h e term abstract c o m p o n e n t seems more
appropriate to us.
It is advisable to describe the s e m a n t i c aspect t h r o u g h m e t a - l a n g u a g e , even if one
does not h a v e a translator for this. Then o n e can translate or simply rewrite it by hand to
the target p r o g r a m m i n g language, select a p p r o p r i a t e implementation for the abstract
c o m p o n e n t s , and use the needed interface, c o l l a b o r a t i n g protocol and execution
e n v i r o n m e n t . For instance. Stack can be i m p l e m e n t e d in a contiguous m e m o r y or in a
linked m e m o r y . Symbol table - as a list or as a dictionary with the hashing technique.
F u r t h e r m o r e , instances of abstract c o m p o n e n t s can be v i e w e d as distributed objects in a
heterogeneous c o m p u t i n g network.
Gunlis
Arnicans.
Description
ot S e m a n t i c s
and Code
Generation
Possibilities
19
4.3. Examples of abstract components
S o m e abstract c o m p o n e n t s and their operations are very popular, e.g.. Stack
(createStack. push, p o p , t o p . etc.). Q u e u e ( c r e a t e Q u e u e , e n q u e u e , d e q u e u e , first, etc.),
while s o m e are g u e s s e d , e.g.. E-mail (prepare, send, r e c e i v e , open). A m o n g the many
specific c o m p o n e n t s w e would like to emphasize o n e that is useful for most of
semantics - S y m b o l table ( S y m b o l T a b l e in Figure 2) or its analogue to provide the
execution e n v i r o n m e n t .
While building prototypes of the M L I . w e h a v e created an implementation of
S y m b o l table - M O M S (Memory Object M a n a g e m e n t S y s t e m ) - that is appropriate for
i m p l e m e n t i n g the i m p e r a t i v e p r o g r a m m i n g languages. It is possible to define basic and
user defined data t y p e s , to define base operations and functions, to operate with
variables and their v a l u e s , to m a n a g e the scope of visibility of all objects, etc. T h e most
important data types, c o n c e p t s , and operations of M O M S are listed in the appendix to
this paper so as to g i v e the reader a better idea about M O M S .
The second i m p o r t a n t c o m p o n e n t is Traverser (TraverserObjeet in Figure 2). Its
main task is realizing the traversing strategy, to c h a n g e the current execution point and
to organize cooperation with the syntax object.
There is a depth-first left-to-right traversing strategy, which is used in the following
e x a m p l e s (Table 2 ) . T h i s strategy has three visiting aspects: Visit (for tree leaves terminals), PreVisit a n d PostVisit (for the other tree n o d e s - nonterminals). T o define
semantic functions for e x a m p l e s , w e have used the f o l l o w i n g operations: N o d e V a l u e ( )
returns a value for t h e current terminal or nonterminal symbol (value from the current
IR n o d e ) , and both g o S i b l F o r w ( a N a m e ) and g o S i b l B a c k w ( a N a m e ) provide for a
changing of the current n o d e , searching the node with t h e n a m e a N a m e b e t w e e n siblings
going forward or b a c k w a r d .
Table 2.
A depth-first left-to-right traversing strategy
Travcrse(nodc P)
if IsLeaf(P)
Visit{V)
PreVisiiy?)
for each child Q of P.
in order, do
Traverse! 0 )
PostVisit(P)
It is possible to describe interfaces for S y n t a x O b j e c t . TraverserObject and
S e m a n t i c O b j e c t with domain-specific language. T h e n interfaces for obtaining the IR.
manipulating with it and working with the symbol table can be compiled together, and it
is possible to e n g a g e in high-level optimization and verification [Eng99],
The T r a v e r s i n g strategy can also be described with domain-specific language. This
is important if the strategy is not trivial and d e p e n d s on syntax elements and the
program state [ O W 9 9 (traversing p r o b l e m s and solutions for Visitor pattern)]. The
traversal strategy should b e independent from syntax as m u c h as possible and organized
20
DATORZINATNE
UN
INFORMACIJAS TEHNOLOGIJAS
(combined) by patterns [ V i s O l ] . In addition to c o m m o n t r a v e r s i n g strategies there are
also less traditional o n e s , e.g., the strategy for r e v e r s e e x e c u t i o n o f the p r o g r a m [ B M 9 9 ]
4.4. Defining the semantic aspect
It is more convenient to define the s e m a n t i c aspect b y using d i a g r a m s (as in Figure
7). W e can write a m e t a - p r o g r a m or a p r o g r a m in the target l a n g u a g e in textual form,
too. D i a g r a m s contain syntax e l e m e n t s that are important for the s e m a n t i c aspect and
are visualized with g r a p h i c s y m b o l s . W e can use different graphic notations. If the
visiting order of syntax e l e m e n t s is important, then we m a r k the order with arrows.
Let us call the o p e r a t i o n s that are performed d u r i n g the aspect n o d e visiting
semantic action. S e m a n t i c action is s i m i l a r to s e m a n t i c function, but it is written at the
meta-level and relates only to a given s e m a n t i c aspect. S e m a n t i c action is s h o w n as a
box with the m e t a - c o d e c o n n e c t e d to the syntax e l e m e n t and takes into account the
traversing aspect.
T h e r e are all kinds of abstract d a t a types that are n e e d e d for the semantic aspect
into the box with the key w o r d s I M P O R T G L O B A L . F o r better perceptibility of the
semantic aspect, it is p e r m i s s i b l e to use additional graphic s y m b o l s that are not needed
in real execution. For instance, w e use O t h e r aspects to signal that w e expect there to be
a composition with the other s e m a n t i c a s p e c t s .
4.5. Examples of semantic aspects
Let us look at s o m e e x a m p l e s of s e m a n t i c aspects (Figure 4 - Figure 8) that are
applicable for the s i m p l e imperative p r o g r a m m i n g l a n g u a g e P a m [ P a g 8 1 ] . Terminal
s y m b o l s are denoted b y a rectangle, while n o n t e r m i n a l s y m b o l s are indicated by
rounded rectangles. T h e left circle in the n o n t e r m i n a l s c o r r e s p o n d s to the PreVisit
semantic action, the right one - to the PostVisit semantic action, while for the terminals,
the Visit semantic action is assigned.
IMPORT G L O B A L Env of
AD7_8 y T r i b o 1 Tab 1 e
^ J0
program
Gj
ENV. releas«?rogtr.v ()
L-OGE.-.V ( )
C_Other a s p e c t s ~'
Figure 4. The semantic aspect PROGRAM. It prepares the program environment to manage
variables, constants, etc. and operations involving them. The environment is destroyed at the end
IMPORT G L O B A L
C a n C r e a t e V a r of
Stack
! /r(G v a r i a b l e _ d e f (Tj)
1
i r . C r e a t i e V ' a r . p e p ()
CanCreateVar.
'„ O t h e r a s p e c t s _ J
Figure 5. The semantic aspect VARIABLE DEFINITION. It allows for variable creation
Gumis
Amicans.
D e s c r i p t i o n of S e m a n t i c s a n d C o d e G e n e r a t i o n
21
Possibilities
IMPOST GLOBAL Trav cf ATTJTreeTraverser, ?efStack of
ADT Stack, Er.v of ADT SymbolTabie, CanCreateVar of ADT Stack
LOCAL VarT-xt = Trav. oodeValue ()
if CanCreateVar.top() - TRCE and
Env.flftdvar(Vertex t) = FALSE
Env.createVar(VarTe xt, INT)
endif
LOCAL ?.ef = Env.getRe f VarText)
RefStack.p-jsh (Ref)
LOCAL IntText = Trav.nodeValue()
LOCAL Ref = Env.getRef ( " TNT_" + lntTe>:t!
If Ref = EMPTY
LOCAL Integer =
TextToInteger (IntText)
Ref = Er.v.createLitJ" INT_"»rr.t>.xt, :XT;
Env.p'utValue (Ref, Ir.teger)
endr f
RefStack.push(Refi
Figure 6. The semantic aspect ELEMENT. It provides for the pushing into the stack all references
to each variable encountered while traversing. Variable creation is forbidden by default. The Trav
provides for getting the values of the current terminal node in IR.
IMPORT GLOBAL
RefStack of A S T Stack,
Snv of ADT SymbolTabie
(O
LOCAL Res = RefStack.pop()
LOCAL Var = Ref S tac•:. pop ()
LOCAL Vol = Env.getValue(Pes)
Er-.v. p-JtValue (Var, Val)
assignment_statement(7-H
I
(O
!eft_hand„side
Y
Q)—>-
"t
ASSIGN
\
X
' ' O t h e r a s p e c t s ~>
—>{Q
t
\
Re fStack.push(NULL)
5
right_hand_side
1
'T_Other a s p e c t s / ,
1
Ref Stack, pcsh (NULL)
Figure 7. The semantic aspect ASSIGNMENT. It takes reference to the variable and reference to
the value from the stack and assigns a value to the variable. Pushing of references is simulated
22
DATORZINATNE
IMPORT GLOBAL RefSrack, Sort, Flag of
Trav of ADT TreeTraverser, Env of ADT
WHILE
- * { / comparisonC^—>~
DO
IN
INFORMACIJAS
TEHNOLOGIJAS
ADT_5tack,
SyrtbolTable
—>-(C
series
J)—>-
END
Other a s p e c t s
if Sort, top () = INDEF t h e n
LOCAL Ref = R e f S t a c k . p o p O
if Er.v.getValue(Ref) - FALSE
Flag.replaceTop(FALSE!
Trav.goSibiForw(@END)
e n d if
endif
Ref Stack, push (NULL)
if S o r t . t o p O = INDEF and
Flag, top () - T R U E
Trav.goSiblBackw(@WHILE)
endif
Figure <H. The semantic aspect INDEFINITE LOOP. It "goes through"' the series and back to
WHILE until the comparison sets a NL'LL reference or a reference with the value FALSE
5. Meta-semantics
M e t a - s e m a n t i c s is a term which relates to a m e t a - n r o g r a m that describes the
counterparts of which semantics consists and the way in which these counterparts are
connected together. T h e conceptual s c h e m e of m e t a - s e m a n t i c s is s h o w n in Figure 9.
Nonterminals
Syntax
Terminals
Figure 9. The conceptual scheme of meta-semantics. For instance. SA1 involves nonterminal
NT I and terminal T l , and SCI is performed while prcvisiting. SCI is a composition
of ISA I and ISA4
(funtis
Arnicans.
Description
of S e m a n t i c s a n d
Code
Generation
23
Possibililics
If w e look at s e m a n t i c s from the definition side (the logical view) then semantics is
Conned by traversing strategy and by a set of semantic aspects that consist of semantic
actions realized while visiting the appropriate node of the intermediate representation. If
we look at s e m a n t i c s from the runtime side (the physical view) then while visiting one
node it is possible that several semantic actions have to be realized, and we h a v e to
ensure correct collaboration between all involved instances of abstract c o m p o n e n t s into
the desired e n v i r o n m e n t . H o w can w e put all of tins together correctly'?
T o solve this p r o b l e m w e introduce the concept o f the semantic connector. This
concept has been a d a p t e d from the concept G r e y - b o x connector, which solves a similar
problem: h o w to c o n n e c t the pre-buiit c o m p o n e n t s in a distributed and h e t e r o g e n e o u s
e n v i r o n m e n t for collaborating work [AGBOO]. The G r e y - b o x c o n n e c t o r is a rrictap r o g r a m that introduces a concrete c o m m u n i c a t i o n s connection into a set of
c o m p o n e n t s , i.e., it generates the adaptation and c o m m u n i c a t i o n s glue code for a
specific connection.
6. F r o m semantic aspects to meta-semantics and executable
semantics
T h e obtaining of semantics for the fixed syntax is achieved in several steps: 1)
select predefined s e m a n t i c aspects or define new o n e s i'or the desired s e m a n t i c s , 2)
r e n a m e syntax e l e m e n t s and traversing aspect in the selected semantic aspects with
n a m e s from fixed syntax and traversing strategy. 3) rename instances of abstract
c o m p o n e n t s to o r g a n i z e collaboration between s e m a n t i c aspects. 4) m a k e composition
from s e m a n t i c a s p e c t s . 5) specify the runtime e n v i r o n m e n t and translate the mcta-code
to the code of the target p r o g r a m m i n g language, and 6) compile the semantics.
• Selection of s e m a n t i c aspects (step 1)
If w e h a v e a library with previously created semantic aspects, then w e can search
for a p p r o p r i a t e o n e s . i.e. reuse some parts of the s e m a n t i c s . T h e traversing strategy also
has to be considered.
• Syntax e l e m e n t and traversing aspect renaming in semantic aspects (step 2)
Actually this is semantic action m a p p i n g . M a t c h i n g to the fixed syntax elements is
achieved by m a p p i n g syntax elements of semantic aspects to fixed syntax elements
( r e n a m e with - simple m a p p i n g and duplicate to - m a p p i n g of a semantic aspect syntax
element to several fixed syntax elements):
rename < n a m e > <travcrsing aspec!>
duplicate < n a m e > <travorsing
<targct
namc>
<target
with
target n a m e • <targct traversing
aspect-
aspect ' to
visiting aspcet>
[,<targct namc> • ' t a r g e t
visiting
aspect> ...j
For instance, r e n a m e left hand side PostVisit with V A R I A B L E Visit ;
d u p l i c a t e C O L ' N ' T A B L F . N O D E Visit to V A R I A B L E Visit. : i s s i g n m c n t _ s t a t e m e n t
PnslYLit
• R e n a m i n g of instances of abstract c o m p o n e n t s (step 3)
M a t c h i n g of c o m p o n e n t s is necessary-' because there is no direct data e x c h a n g e
between semantic functions, and we need collaborative work. T h e program slate is fixed
by using the runtime states o f c o m p o n e n t s . At first w e decide what instances they have
24
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
in c o m m o n and what n a m e s they h a v e to get, and then w e r e n a m e instances in the
semantic aspects:
replace <name> with <target namc>
For instance, replace RefStack with D a t a S t a c k ; r e p l a c e Sort with L o o p S o r t S t a c k
• Composition of s e m a n t i c aspects (step 4)
The goal of a s e m a n t i c aspect c o m p o s i t i o n is to bring together several semantic
aspects into one m o r e c o m p l i c a t e d aspect that nearly d e s c r i b e s entire semantics. While
c o m p o s i n g , w e stick t o g e t h e r the m e t a - c o d e of s e m a n t i c a c t i o n s that have the same
n a m e . T h e sticking principles can vary, for instance, sequential (one c o d e is a p p e n d e d to
the other o n e ) , parallel (codes can be e x e c u t e d s i m u l t a n e o u s l y or sequentially in any
order), free (the user can modify the c o d e u n i o n as h e likes). T o achieve better results,
w e ignore s o m e semantic actions or apply the sticking principle to the semantic aspects
in the reverse order. At this t i m e we have to be aware o f conflicts b e t w e e n local variable
names.
compose aspect « n c w S A » (<refmed SA>) [[append ; parallel, free] (<refined SA>) . . . ] , where
<refined SA> = « o l d S A » [ignore <name> <trav aspcct> [,<name> <trav aspect>]] I [reverse]
The result is m e t a - s e m a n t i c s . A n e x a m p l e of a m e t a - c o d e fragment
semantics is given in T a b l e 3.
for meta-
Gunlis
Arnicans.
Description
of Semantics
and C o d e
Generation
25
Possibilities
Table 3.
An e x a m p l e of m e t a - s e m a n t i c s
compatible
with
ir_type
ParseTree
traverser_type
syntax
ParseTreeTraverser
elements
semantic
(prograrr.,
expression,
actions
VARIABLE,
(<PROGRAM>
...)
program
PreVisit
iEKV.preparePrcgEnv()
<?ROGRAK>
program
PostVisit
. . . )
global
Trav of ADT_TreeTraverser
global
E n v of
ADT_Syrr.bclTabie
create DataStack, OperatcrStack, CanCreateVar,
LoopSortStack,
L o c p C c u n t e r S t a c k , L c c p F i a g S t a c k , I t F l a g S t a c k cf
ADT Stack
c r e a t e I n p u t F i l e , O u t p u t : i^e cf
ADT FILE
compose
aspect
/ / c o m p o s e s s e m a n t i c a s p e c t s from a s p e c t s g i v e n a b o v e
<A1>
/,' s e m a n t i c a s p e c t s P R O G R A M r e m a i n s the s a m e
(<PROGRAK>)
append
(<ELEMEKT>
replace
RefStack
rename
INTEGER
with
Visit
D a t a S t a c k /,' r e p l a c e s s t a c k for c o l l a b o r a t i n g w o r k
with
CONSTANT
Visit)
// r e n a m e s n o n t e r m i n a l a c c o r d i n g
to P A M s y n t a x
append
(<ASSIGNMENT>
replace
RefStack
rename
left
right
ignore
append
hand
hand
lef t_hand
(<INDEFINITE
repiace
RefStack
with
DataStack
side
PostVisit
siae
side
with V A R I A B L E
with
PostVisit
Visit,
expression
PostVisit
// i g n o r e p u s h i n g o f N U L L r e f e r e n c e
PostVisit)
LOOP>
with
DataStack,
Sort with LoopSortStack, Flag
append
(<VARIA3LE
DEFINITIONS
with
LoopFiagStack)
rename variable_def FreVisit with assign statement
v a r i a b l e def PostVisit with ASSIGN Visit)
end
compose
aspect
compose
(<A1>
aspect
ignore
append
<A2>
expression
(
...
PostVisit,
coxpari sicn
/*
Others
aspect
are
appended
<INPUT>,
<OUT?LT>,
<BAS5
3YNARY
C O N D I T I O N A L S T A T E K E N T > */
. . . )
e n d ccivpose
•
PreVisit,
FcstVisit;
as
<TY?E
AND
OPERATOR>,
so o h
0 FERATION>,
<DEF:N;TE
LOOF>,
aspect
M e t a - c o d e translating (step 5)
M e t a - c o d e is t r a n s l a t e d to the target p r o g r a m m i n g language, taking into account the
target l a n g u a g e (e.g. C + + ) , the i m p l e m e n t a t i o n of abstract c o m p o n e n t s (e.g. Stack in
linked m e m o r y ) , the o p e r a t i n g s y s t e m (e.g. U n i x ) , t h e c o m m u n i c a t i o n s between
c o m p o n e n t s (e.g. C O R B A ) , MLI c o m p o n e n t s type (e.g. D L L ) , etc. T h e translation may
be d o n e by hand or a u t o m a t i c a l l y (desirable in c o m m o n cases).
26
I) A T O R Z I N A T N F: U N
INFORMACIJAS
TEHNOLOGIJAS
• Obtaining semantic objects (step 6)
By compiling the c o d e we get e x e c u t a b l e objects that p r o v i d e semantic
performance, i.e. they contain the s e m a n t i c functions that are called w h i l e traversing the
program intermediate representation. T h e instances of the c o n c r e t e i m p l e m e n t a t i o n of
abstract c o m p o n e n t s are created, or the existing ones are d y n a m i c a l l y linked via the
selected c o m m u n i c a t i o n s protocols.
7. Examples of alternative semantic aspects
In this section we p r o v i d e a short insight on h o w w e can build nontraditional
semantics. By adding new features to e x i s t i n g s e m a n t i c s w e can create a specific tool
that works with a given p r o g r a m m i n g l a n g u a g e . W e w o u l d like briefly to survey two
e x a m p l e s : 1) statistics a c c o u n t i n g of p r o g r a m point visiting, a n d 2) storing of symbolic
values for variables. Both aspects are a d d e d to the c o n v e n t i o n a l s e m a n t i c s , and this is
program instrumentation if w e speak in t e r m s of software testing.
7.2. Accounting of program point visiting
Our goal is to account for any visiting o f a desirable p r o g r a m point. This m e a n s that
w e need to set counters at these points. At first w e write the s e m a n t i c aspect N O D E
C O U N T E R (Figure 10). W e use the abstract c o m p o n e n t Dictionary w h e r e w e can store,
read and update records in form <key, v a l u e > .
IMPORT G L O B A L Trav of A D T _ T r e e T r a v e r s e r ,
Diet ci A D T _ D i c L i o n a r y
COUNTABLENODE
LOCAL key = T r o v . g c t N o d c T D I :
LOCAL reco-U = D i e t . g e c R e c N o m ( k e y )
if r e c o r d = 0
D i e t . c r e a t e R e c ( k e y , 1)
else
Diet.updat8(ro:crd, Diet.cot(record)
endi f
- 1)
Figure 10. The semantic aspect NODE COUNTER
Defining m e t a - s e m a n t i c s . w e add this s e m a n t i c aspect to the others. F o r instance,
we define accounting for the use of any variable and a s s i g n m e n t operation:
dublicate countable_node Visit to VARIABLE Visit. assignmcnt_statcraent PostVisit
Another semantic aspect can be built which a c c o u n t s for every concrete variable
using statistics into an additional d i c t i o n a r y (the variable n a m e serves as the key). We
can improve this aspect further by a c c o u n t i n g for an aspect of variable use - defined,
modified, referenced, released, etc. In the p r o g r a m analysis and instrumentation area,
our approach is similar to the W y o n g s y s t e m (based on the Eli compiler generation
system and the A T O M p r o g r a m instrumentation s y s t e m ) , b e c a u s e specific operations
are attached to syntax e l e m e n t s , and in this way w e obtain a specific tool with additional
semantics [Slo97].
Guruis
Arnicctns.
Description
ni S e m a n t i c s
and Code Generation
Po%sihiliiics
27
7.2. Storing of symbolic values for variables
T h e second e x a m p l e provides for the fixing of symbolic values for variables
(Figure 11). T o do this task in an effective way, w e have additional operations in our
M O M S (symbol table). T h e operation c r e a t e S y m b V a l t i e creates an entry for symbolic
value, a n d with the operations a d d T e x t T o S y m b V a l u e and a d d V a r T o S y m b V a l u e , w e
form the value while traversing all nodes in the desired subtree ( w e store all needed
p r o g r a m s y m b o l s and symbolic values of variables). At the end w e store accumulated
value w i t h the operation s t o r e S y m b V a l u e .
IMPORT GLOBAL T r a v o f A O T _ T r e e T r a v e r s e r SymfcR«fSi:ack o f
A D T _ s t a e k , Er.v at A C T ^ S y m b a l T a h l e , Car.CreateSyrr.bVar o f A C T _ s t a c l c
H
LOCAL Syrr-cVar - £ y n i : P . e r S t a c k . p o p []
Env. s t o r e S y r i : V a l u e (SynbVar, S y r i V a U
ast.-
left h a n d Side
car,Craar:eSyr.bVar . p u s n I FALSE)
CafiCzeateSyrriVar . p o p \ )
••••en:_5'.aierr;e'il 2 > —
u)—^\
T
ASSIGN
nght_hand_s;de
'^_0'her aspects^'
'„ O t h e r a s p e c t :
?ack,pusr-tHULL) |
LOCAL S y n b T e x i » T r a v . n o d e V a l u e I }
LOCAL Synfcp.ef - S y m b R e E S t a c k . t o p ( )
En v . addTeKCToSymfiValije (SymbP-ef, S y n b T e x t )
LOCAL V a r T e K t =• T r a v . n o d e v a L u e < )
LOCAL Eyru'-'flcName = " SYMB_" i VarTusc:
i f C a n C r e a t e S y m t r t / a r . t o p l) - TRUE
1£ E i w . f L r . d V j r ( 5 y r n b v a i 7 e x * 0 - FALSE
E n v . c r e a t e V a r (SymfcVarTexTi^ SYKSJ
endif
LOCAL S y n b R e : - E n v . q e t R e f (SyrctoVarTexc)
Er.v. c r e a t e s y r n b v a l u e ( S y r r i R e : )
SVT-bRe i S zack. p u s n IZynvt Re f )
else
LOCAL, SymbP.ef - Enir.getP-e f [Syirf-jVarText)
Er.v . acdVarTc-SyrrJ-A'alue (SymbRef.)
er.da f
Figure 11. The semantic aspect SYMBOLIC VALUES
8. Conclusions
This method was developed with the goal of reducing the gap b e t w e e n practitioners
(tool d e v e l o p e r s ) and theoreticians (developers of formal specifications for semantics).
Our e x p e r i e n c e shows that a significant a c h i e v e m e n t is attained if abstract c o m p o n e n t s
or abstract data types realize the greater part of s e m a n t i c s , because that w a y it is easier
to p e r c e i v e the full implementation of semantics.
T h e second acquisition of our m e t h o d involves significant disjoining o f syntax and
s e m a n t i c s from each other. It allows us to c o m b i n e various syntaxes and semantics and
to find out the most desirable semantics for the given syntax. So. if w e have written
syntax for a new language, we can m a t c h several s e m a n t i c s to it in a comparatively
short t i m e . As a result, w e can develop a wide spectrum of tools in support of our new
language.
Our approach allows us to c h a n g e semantics d y n a m i c a l l y while the interpreter is
running, i.e. replace semantics or execute various ones simultaneously. It is possible to
reduce a derived parse tree (by deleting nodes with e m p t y connectors) or to optimize it
(tree restructuring statically and d y n a m i c a l l y , considering performance statistics).
28
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
At this m o m e n t the e n v i r o n m e n t for tool construction or s e m a n t i c s generation is not
completely d e v e l o p e d . O u r experience s h o w s that tools can be developed without
significant i n v e s t m e n t s , for e x a m p l e , by u s i n g L e x / Y A C C as a generator to create a
syntactic object, which p r o d u c e s p r o g r a m i n t e r m e d i a t e representation. It is not too
difficult to d e v e l o p a T r a v e r s e r and a simple S y m b o l T a b i e . A n d as the last j o b , w e have
to w o r k up semantic a s p e c t s on the basis of our m e t h o d and c o m p o s e t h e m , thus
obtaining c o n n e c t o r s , w h i c h can be written in s o m e c o m m o n p r o g r a m m i n g language
(skipping m e t a - l a n g u a g e use and its translation). T h e u s e of abstract c o m p o n e n t s
depends on target s e m a n t i c s .
W e believe that the d e v e l o p m e n t of the serious t o o l s d e m a n d s a more universal
implementation of the S y m b o l table. O u r S y m b o l table i m p l e m e n t a t i o n - M O M S - is
not applicable only for imperative l a n g u a g e i m p l e m e n t a t i o n s . It w a s also the basic
object-oriented d a t a b a s e for the c o m m e r c i a l application M o s a i k (Sietec consulting
G m b H C o . O H G . graphical C A S E tool for b u s i n e s s m o d e l l i n g ) .
T h e w e a k n e s s of o u r m e t h o d lies in the s e m a n t i c a s p e c t s c o m p o s i t i o n stage. At this
m o m e n t w e have not a n a l y z e d all risks in t e r m s of o b t a i n i n g senseless or erroneous
semantics. T h e p r o b l e m s are not trivial, and they are similar to p r o b l e m s in the proper
collaboration of objects or c o m p o n e n t s in object-oriented p r o g r a m m i n g , too. [e.g.
M L 9 8 ] , M o s t n a m e conflicts can be p r e c l u d e d a u t o m a t i c a l l y , but it is considerably
harder to organize collaboration a m o n g the c o m m o n c o m p o n e n t s in s e m a n t i c aspects (it
is easier if the s e m a n t i c aspects are m u t u a l l y i n d e p e n d e n t ) .
Another p r o b l e m is that the l a n g u a g e g r a m m a r is frequently not context free (this is
true of our e x a m p l e a b o v e , too). In this case w e h a v e to introduce additional flags to
m e m o r i z e the context of s y n t a x e l e m e n t s . It is a d v i s a b l e to rewrite t h e syntax and to use
context-free g r a m m a r s .
9. References
[AAB96] V. Arnicane, G. Amicans, and J. Bicevskis. Multilanguage interpreter. In H.-M. Haav
and B. Thalheim, editors, Proceedings of the Second International Baltic Workshop on
Databases and Information Systems (DB&IS '96), Volume 2: Technology Track, pages
173-174. Tampere University of Technology Press, 1996.
[AGBOO] U. ABmann, T. GenBler, and H. Bar. Meta-programming Grey-box Connectors.
Proceedings of the Technology of Object-Oriented Languages and Systems (TOOLS 33),
pp.300-31 1, 2000.
[ASUS6] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: Principles, Technigucs,
and Tools. Addison-Wesley, 1986.
[BM99] B. Biswas and R. Mall. Reverse Execution of Programs. ACM SIGPLAN Notices,
34(4):61-69. April 2000.
[Cla99]C. Clark. Build a Tree - Save a Parse. ACM SIGPLAN Notices, 34(4): 19-24, April 2000.
[DKV00] A. Deursen. P. Klint, and J. Visser. Domain-Specific Languages: An Annotated
Bibliography. ACM SIGPLAN Notices, 35(6):26-36. June 2000.
[Eng99] Dawson R. Engler. Interface Compilation: Steps toward Compiling Program Interfaces
as Languages. In DSL-99 [ITSE99], pp.387-400.
[FL88] Charles X. Fisher, and Richard J. LeBlanc. Jr. Crafting A Compiler. BenjaminCummings, 1988.
Guntis
Amicans.
Description
of S e m a n t i c s
and
Code Generation
Possibilities
29
[FNT-97JF. Ferrucci, F. Napolitano, G. Tortora. M. Tucci, and G. Vitiello. An Interpreter for
Diagrammatic Languages Based on SR Grammars. Proceedings of the 1997 IEEE
Symposium on Visual Languages (VL '97), pages 292-299, 1997.
[GHJV95]E. Gamma, R. Helm, R. Johnson, and J. Vlisides. Design Patterns: Elements of
Reusable Software, pages 331-334. Addison-Wesley, 1995.
[HKOO] J. Fleering and P. Klint. Semantics of Programming Languages: A Tool-Oriented
Approach. ACM SIGPLAN Notices. 35(3):39^8. March 2000.
[ITSE99] Special issue on domain-specific languages. IEEE Transactions
on Software
Engineering, 25(3), May/Tune 1999.
[Kin95] W.Kinnersley, ed.. The Language List. 1995. http://wuarehive.wust fed ii/doc/misc/lunglist.txt
[Lou97] Kenneth C. Louden. Compilers and Interpreters. In Tucker [Tuc97], pp.2120-2147.
[ML98] M. Mezini and K. Lieberherr. Adaptive Plug-and-Play Components for Evolutionary
Software Development. SIGPLAN Notices. 33( 10):97-l 16. 1998. Proceedings of the
1998 ACM SIGPLAN Conference on Object-Oriented Programming. Systems, Languages
& Applications (OOPSLA '98).
[OW99] J. Ovlinger and M. Wand. A Language for Specifying Recursive Traversals of Object
Structures. SIGPLAN Notices, 34(10):70-8L 1999. Proceedings of the 1999 ACM
SIGPLAN Conference
on Object-Oriented
Programming,
Systems, Languages &
Applications (OOPSLA '99).
[Paa95]J. Paakki. Attribute Grammar Paradigms - A High-Level Methodology in Language
Implementation. ACM Computing Surveys, 27(2): 196-255. June 1995.
[Pag81] Frank G. Pagan. Formal Specification of Programming Languages: A Panoramic
Primer. Prentice-Hall, 1981.
[Sch97] David A. Schmidt. Programming Language Semantics. In Tucker [Tuc97], pp.22372254.
[SK95] K. Slonneger and B. L. Kurtz. Formal Syntax and semantics of Programming
Languages:
A Laboratory Based Approach. Addison-Wesly, 1995.
[Slo97] A. M. Sloane. Generating Dynamic Program Analysis Tools. Proceedings of the
Autralian Software Endineering Conference (ASWEC'97), pp.166-173, 1997.
[Tuc97] Allen B. Tucker, editor. The computer science and engineering handbook. CRC Press,
1997.
[Vis01]J. Visser. Visitor Combination and Traversal Control. SIGPLAN Notices, 36(11):270282, 2001. Proceedings of the 1998 ACM SIGPLAN Conference on Object-Oriented
Programming, Systems, Languages & Applications (OOPSLA '01).
~ZZ97] D.-Q.Zhang and K.Zhang. Reserved Graph Grammar: A Specification Tool For
Diagrammatic VPLs. Proceedings of the 1997 IEEE Symposium on Visual Languages (VL
'97), pages 292-299, 1997.
30
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
10. Appendix
Table 4.
MOMS types
Type
Constructor
Value
Reference
MemoryMap
IdentifDict
TypcsDict
FunctDict
NamesTable
ConstrTable
ValuesTable
MemoryBlock
...Others
Description
Handle to an obiect type description
Handle to a byte stream that contains an object value
Handle to a memory obiect
Main central object that organizes other objects (other MemoryMap also)
Dictionary of all identifiers (variables, constants, etc.)
Dictionary of all types (basic types and user defined)
Dictionary of all functions (basic operators, basic functions, user defined
functions)
Compact storing of all strings
Compact storing of all constructor descriptions
Storing and managing of all object values
Organizing scope visibility in all dictionaries and providing memory
management
Stack, queue, collection, etc.
Guntis
Arnicans.
Description
of
Semantics
and Code
Generation
Possibilities
31
Table 5.
System initialization and global operations
Operation
MemoryMap
mcmoryCount)
initialize(Uint
Uint switchMemoryTo(Uint memNum)
Uint getCurrentMemorv(void)
defineCountOfBaseTypes(Lint
countOfBaseTypes)
defineBaseType(char* typcNamc. Uint
type, Uint rypcSize)
defincBascFunetion(char*
langFunctN'ame. char* internalName,
Uint return Type, int paramCount, ...)
prepareProgramEnv(Uchar scope)
releaseProgramEnv(void)
define Automat icMemS witching!'Uint
firstMemNum, Uint lastMemNum)
... Others
Description
Creates MOMS with internal parallel but related
memories (MOMS)
We can exploit only the specific internal memory
Returns the number of the actual memory
Defines a count of the basic types of MOMS
Defines base types. This interface is in C and
depends on previously defined types. Some
examples:
defineBaseType("long ", LONG , sizeof (long ));
define Base Type ("boolean". BOOLEAN
sizeof
(boolean_)); define Base Type ( " d a t e " , DATE_.
sizeof(date ))
Defines the base operations and functions. This
interface is in C and depends on previously defined
types. Some examples: define Base Function ("-".
"PLUS". LONG_, 2. LONG_. L O N G J ; define
Base Function ("day", "day". LONG . 1, DATE )
Prepares a new McmoryBlock, defines the scope
(visibility) ot previously defined variables, types,
functions
Releases a current McmoryBlock and all related
memory in other objects (dictionaries, tables) and
restores a previously defined Memory-Block
*
Provides for automatic switching in various
functions. For instance, we look up the variable in a
local memory and then in a global memory" (if the
variable is not founded yet).
32
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
Table 6.
Defining of user defined data types
Operation
putValue(Ref aRef, char* aValue)
char* getValue(Ref aReO
char*
createDynamicVaIue(ConstrPtr
ptrToConstr)
deleteDynamicValue(char* a Value)
setValucProtectionOn(char* aName)
setValueProtectionOff(char* aName)
gotoArrayElementConstr(Ref& aRef, Sint
index)
gotoNameConstr(Ref& aRef)
gotoPointerConstr(Ref& aRef)
gotoProductionLeftConstr(Ref& aRef)
gotoProductionRightConstr(Ref& aRef)
gotoRecordConstr(Ref& aRef)
gotoNameInList(Ref&
aRef,
aName)
... Other
char*
Description
Sets a new value for the object.
Returns a value for the object.
Provides dynamic memory allocation for the
object given by type.
Releases dynamically allocated memory.
Protects a value of the given object against
modification, for instance, protects constants.
Takes off a value protection.
Sets a virtual mark to element constructor and to
a given array element value. aRef is modified, it
refers to the array element.
Moves the virtual mark to the name constructor.
Moves the virtual mark to the pointer
subconstructor and to the start of value.
Moves the virtual mark to the left subcon-structor
and to the start of the corresponding value.
Moves the virtual mark to the right subconstructor and to the start of the correspondding
value.
Moves the virtual mark to the start of record.
Moves to the appropriate type and value (list of
named types linked by productions) E.g., search a
field in the user-defined structure.
Guntis Aniicuns.
Description
of S e m a n t i c s
and C o d e Generation
33
Possibilities
Table 7.
Operations with variables and similar objects
Operation
ConstrPtr create Constr Array (Uint min
Index, Uint maxlndcx, ConstrPtr ptrTo
ElemConstr)
ConstrPtr createConstrFunct (ConstrPtr ptr
To Return Constr, ConstrPtr ptr To Param
Constr)
ConstrPtr createConstrName(char* aXame,
ConstrPtr trToSubConstr)
ConstrPtr
createConstrPointer(ConstrPtr
ptrToSubConstr)
ConstrPtr
ereateConstrProduct(ConstrPtr
ptrToSubConstr],
ConstrPtr
ptrToSubConstr2)
ConstrPtr
createConstrRecord(ConstrPtr
ptrToSubConstr)
... Other constructors
constr ArraySetMinIndex(ConstrPtr
ptrToConstr, Uint minlndex)
Uint
constrArrayGetMinIndex(ConstrPtr
ptrToConstr)
... Others
Description
Defines an array type with the given dimensions
and element types. Here and in other functions
we can use any previously defined (or partly
defined) data type.
Defines a function type with the given
parameters and return type.
Assigns a user-defined name for the given type.
Defines a pointer type to the given type.
Creates a production of two types (establishes
some relation between them). It is useful to
construct a serious data structure.
Defines a record data type (a set of pairs [name,
tvpe}).
For instance, base data type constructors
Modifies the type description (attributes).
Provides details about data type attributes.
Table X.
Operations with value
Operation
Create
Var (char*
aName,
ConstrPtr
ptrToConstr)
createVar(char* aName, char* typeName)
createLiteral (char* aName, Constr Ptr ptrTo
Constr)
Ref createRef (ConstrPtr ptrToConstr)
[ ConstrPtr getConstrRef (char* typeName)
createSynonym (char* aName, Ref aRef)
... Other
Description
Creates a variable with the given name and
type.
Creates a variable with the given name and
type name.
Creates a literal (constant) with the given
name and type.
Creates an object without a name with the
given type, for instance, internal loop counter,
return value of function.
Returns pointer to type with the given type
name.
Creates another reference by name to the
existing object.
LATVIJAS
UNIVERSITATES
RAKSTI.
2004. 669,
sfej.:
DATORZINATNE
L'N
INFORMACIJAS
TEHNOLOGIJAS.
34.-52.
Ipp.
Data Staging in the Data Warehouse
Janis Benefelds
C o m p u t e r S c i e n c e Ph.D. student. University of Latvia
J a m s . Be n o t e klst» u n i b t m k a . l v
This topic describes one of Data Warehousing components - Data Staging. Data staging ensures
a) data extraction from the source systems: b) data transformation and standardisation; c) data
transfer to the Data Warehouse. The Importance of such important things like data quality and
data integrity is described in more details too.
This paper intends to be a survey based on practical experience in real projects and theoretical
professional literature studies.
The Paper consists of a Foreword, six Chapters and References.
The Foreword describes the scope or the subject to be analysed in this article.
The first Chapter is an introduction that gives definitions and common understanding about
concepts used in this article
The second Chapter is the largest one. and it describes three main parts of Data transformation
Staging: data export, data transform and data load. Two types of Data Staging - dimensional and
fact table staging - are described in more details. This Chapter is concerned with relative
problems, like data quality, as well.
The third and fourth Chapters describe such an important component of Data Staging like
Metadata and the mutual relation between it and other parts of Data Staging described in the
previous Chapter.
The Fifth Chapter summarises the more important items and makes emphases on them to
understand how essential they are to be successful in Data Staging.
The Sixth Chapter points out the directions that follow as logic and sequential continuation to be a
subject for further research and analysis
References contains a list of information sources.
Key words: metadata, data warehouse, data staging, data integrity.
Foreword
This article is a s u m m a r y about Data S t a g i n g as o n e of the c o m p o n e n t s o f Data
W a r e h o u s i n g . There is an e x p l a n a t i o n of D a t a Staging — w h a t it m e a n s and w h a t it
consists of. S o m e c o m m o n definitions are given and the basic elements (data extraction,
transformation, loading and metadata) arc d e s c r i b e d . T h e content is additionally
illustrated by many e x a m p l e s that are taken from the financial industry.
This Article points out the importance o f the right u n d e r s t a n d i n g o f the functions
which Data Staging is responsible for, to be successful in designing and deploying
projects in real life.
The following picture (Figure I) s h o w s the basic D a t a W a r e h o u s i n g e l e m e n t s and
describes w h a t part of it this article covers.
Janis
ttene
Data
Stat-
in the
Data
Sources
Data
35
Wareho
Data Staging
Area
Data
Export
Data
Transform j
I
Data
Warehouse
End User
Applications
Data
Load
— i
Metadata
Figure I' The basic elements of Data Warehousing
1. Introduction
Data Staging is o n e o f the first and one of the m o s t important things, if we arc
discussing issues about d e v e l o p i n g a Data W a r e h o u s e .
In a lot of b o o k s , the Internet and other sources w e can find n u m e r o u s Data Staging
definitions that are very similar. I w o u l d like to mention s o m e of t h e m that would be
sufficient and describe all the main functions performed by Data Staging.
Data Staging is a set of processes that cleans, transforms, combines, deduplicates. households,
archives and prepares source data for use in the Data Warehouse.
Data Staging Area is everything in between the source system and the presentation server
where Data Staging processes are performed. The key defining restriction on the Data Staging
area is that it does not provide query and presentation services to end-users.
The Data Staging area is the Data Warehouse workbench. It is the place where raw data is
loaded, cleaned, combined, archived and exported to one or more presentation server
platforms.
O n e of the D W H g u r u s , Ralph Kimball, has m e n t i o n e d that Data Staging adds to
the data an additional v a l u e , e.g.. the data quality and sense of use of these data in some
analysis is higher than before. In general the Data S t a g i n g process consists of one
c o m m o n service c o m p o n e n t - metadata - and three functional c o m p o n e n t s :
• Data Export from the source systems or data c o n s o l i d a t i o n ;
• Data cleaning, t r a n s f o r m i n g and soiling or data standardization;
• Data import into the D a t a W a r e h o u s e . Data Mart or l o a d i n g :
36
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
T h e r e m a y be s o m e different a b b r e v i a t i o n s used to n a m e these p r o c e s s e s . E T L
stands for Export. Transfer and Load, E T T s t a n d s for Extraction, T r a n s f o r m a t i o n s and
T r a n s p o r t a t i o n . In this article w e ' l l use E T L .
It certainly does not m e a n that D a t a S t a g i n g consists o f three strictly different
m o d u l e s , w h e r e the s e q u e n c e and c o m p l e t e n e s s of tasks to be performed is highly
prescribed. Data Staging, d e p e n d i n g o n the situation, can be very a c o m p l i c a t e d process
from the functional and technical point o f v i e w .
It is r e c o m m e n d e d to start the Data S t a g i n g design process with a very simple
s c h e m a t i c of the pieces that y o u k n o w : the s o u r c e s a n d targets. Keep it very high-level
and highlight where data are c o m i n g from a n d w h e r e data are going t o , point out the
major challenges that you already k n o w a b o u t . Figure 2 s h o w s an e x a m p l e o f a highlevel Data Staging design plan that c a m e from the financial industry.
Figure 2: Example of the high-level Data Staging plan for some financial institution.
For a better u n d e r s t a n d i n g of the p r i n c i p l e s a n d mission of Data S t a g i n g , l e t ' s
analyse e a c h of the three steps m e n t i o n e d b e f o r e m o r e closely.
2. E T L
2.1. D a t a export from transactional s y s t e m s or data consolidation.
Practically every on-line transactional system ( O L T P ) used in organization is a
potential data source for the Data W a r e h o u s e . H o w e v e r a data source for the Data
W a r e h o u s e can be s o m e external i n f o r m a t i o n s y s t e m s and not classified information
Janis
Benefelds.
D a t a S t a g i n g in the D a t a
Warehouse
37
(flat files etc.) as well. E v e r y t h i n g that serves as a data source can be d e v e l o p e d in
different e n v i r o n m e n t s on different platforms and it can be " h o m e - m a d e " and delivered
by external v e n d o r s . F r o m the point of view of the D a t a W a r e h o u s i n g it should make no
difference - the only thing w e are interested in is data.
N o r m a l l y , the s c o p e o f the data to be collected in the D a t a W a r e h o u s e is defined by
the end u s e r s - usually b u s i n e s s p e o p l e . T h a t m e a n s the s c o p e is defined in business not
Information T e c h n o l o g y concepts and therefore high quality business analysts and
system analysts are required at this stage o f the Data W a r e h o u s i n g process.
To d e t e r m i n e the right database objects (and their relationship) that potentially
could b e included in a data export p r o c e s s e s , business analysts should very precisely
define t h e r e q u i r e m e n t s o f data n e e d e d and the s y s t e m analyst should be able to point
out h o w and w h e r e t h e s e data physically can be allocated. N o r m a l l y the s y s t e m analyst
uses the system d o c u m e n t a t i o n that should help to find out the right information about
data w i t h i n a system. S a d l y , there m a y be situations w h e n d o c u m e n t a t i o n does not help,
or there is no d o c u m e n t a t i o n at all. S u c h a situation can arise b e c a u s e of poor
m a n a g e m e n t during s y s t e m d e v e l o p m e n t , i m p l e m e n t a t i o n and m a i n t e n a n c e processes.
T h e n t h e only source o f information about the inside of a particular information system
are p e o p l e . S o m e kind of interviews should be o r g a n i z e d and system d o c u m e n t a t i o n ,
should be redesigned accordingly to the actual status of the system.
O n e of the resulting d o c u m e n t s of these interviews is a data dictionary and it is
actually a part o f the m e t a d a t a (see chapter 3 . M e t a d a t a ) . A Data dictionary is a
collection of descriptions o f the data objects or items and their relationships in a data
m o d e l . A data dictionary can be consulted to understand w h e r e a data item fits in the
structure, what values it m a y contain a n d basically w h a t t h e data item m e a n s in realw o r l d t e r m s . A small e x a m p l e of that part, which explains what the data item m e a n s in
r e a l - w o r l d terms, is s h o w n in the following table.
Table No. I
"Example of a Data dictionary from the financial industry"
Real-world term
Account (Acct)
Active Acct
Dormant Acct
Closed Acct
Client
Active client
Dormant client
Closed client
Database object and its relationship
One record of the table ACCOUNTS
Accounts.Status = 1 AND Accounts.CloseDate IS NULL
Accounts.Status = 2 AND <&today> - Accounts.LastTranDate >
3M
Accounts.Status = 2 AND Accounts.CloseDate <= <today>
One record of the table CLIENTS
Client that has at least one active account (see 'Active Acct')
Client with all accounts being dormant (see 'Dormant Acct')
Client with all accounts being closed (see 'Closed Acct')
U s i n g such a data dictionary i m p r o v e s the efficiency o f c o m m u n i c a t i o n s between
business and IT p e o p l e . T h e r e should be n o questions about what the business people
are r e q u i r i n g and w h e r e it can be found.
W h e n we h a v e clarified what data is n e e d e d and w h e r e it can bee allocated, we
should g i v e an a n s w e r to the question " h o w " the data can be extracted or e x p o r t e d from
the particular data source. Usually the first step[s] of the technical solution is very close
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
to the technical solution of the data s o u r c e itself. It m e a n s a special p r o g r a m should be
developed that extracts specific data from all data s o u r c e s and puts it together in a Data
Staging area. Figure 3 s h o w s a small s c h e m a of data extracting from the different data
sources and gathering the particular data together in the Data Staging area. S o m e
financial figures of the c u s t o m e r s are taken to build a c u s t o m e r profile.
At the Start of building the Data S t a g i n g application the p r i m a r y goal is to w o r k out
the infrastructure issues, including connectivity, security and file transformation.
Figure 3: An Example of data extraction processes from different data sources and putting it
together in the Data Staging area.
Certainly data in a Data Staging area are stored in s o m e particular structure and
therefore the way in which the data c o m e s into these structures should be strictly
predefined and c o m p l e t e l y d o c u m e n t e d . T h e n a m e of s u c h a d o c u m e n t is data m a p p i n g .
Data m a p p i n g s h o w s a c o m p l e t e w a y in w h i c h the particular data from s o m e source
system is transported (and transformed in b e t w e e n if n e c e s s a r y ) to a Data Staging area.
C o r r e s p o n d i n g to a dimensional data model (generally that is a data model for the
Data W a r e h o u s e ) there are two different types of Data Staging, each for one of two
main c o m p o n e n t s of the dimensional m o d e l : d i m e n s i o n a l Data Staging and fact Data
Stasiimi.
Dimensional Model - a specific discipline for modelling data that is an alternative to entityrelationship (ER1 modelling.
Janis
Benefelds.
Data Staging
in the
Data
39
Warehouse
A fact tabic is the primary table in each dimensional model that is meant to contain
measurements of the business. The most useful facts arc numeric and additive. Every fact table
represents a many-to-many relationship and every fact table contains a set of two or more
foreign keys that join to their respective dimensional tables.
A dimensional table is one of a set of companion tables to a fact table. Each dimension is
defined by its primary key that serves as the basis for referential integrity with any given fact
table to which it is joined.
|
2.1.1. Dimensional
tabic
staging.
There are three t y p e s of processing dimensional data: overwrite, create a new
dimension record, and displace the c h a n g e d value into an " o l d " attribute field. T h e
second one is the t e c h n i q u e usually used for h a n d l i n g occasional c h a n g e s in a
dimension record. W e shall actually focus on that type of dimensional data processing.
In terms of c o m p l e x i t y the major problem could be the identification and extraction
of data that has been c h a n g e d since the last data extraction instead of extraction of all
data every time. It is c o n v e n i e n t to have a m e c h a n i s m to select new or c h a n g e d data
only. Ideally there is an additional c o l u m n in the source data that shows the last change
date and time or j u s t a y e s / n o stamp that indicates w h e t h e r the particular data is or is not
changed since the last data extraction. That m e a n s that only these data will be the
subject of the Data S t a g i n g . Taking into account that usually s o m e of the data source
information s y s t e m s are either delivered by an external v e n d o r or long a g o " i n - h o u s e "
s y s t e m s without any k n o w l e d g e about their contens and without full d o c u m e n t a t i o n ,
there may be s o m e p r o b l e m s to c h a n g e these systems for h a v i n g such a s t a m p indicating
a c h a n g e status a c c o r d i n g to the last data extraction p r o c e s s . That m e a n s there should be
another way to d e t e r m i n e the existence of the n e w r e c o r d s and changes in the old ones.
In this case the easiest way is to pull the current snapshot of the d i m e n s i o n table in its
entirety and let the transformation process deal with d e t e r m i n i n g what has changed and
how to handle it. C h a n g e determination could be d o n e by c o m p a r i n g the actual data
snapshot either with y e s t e r d a y ' s data snapshot or with the current Data W a r e h o u s e
dimensional table. T h i s w a y is more t i m e - c o n s u m i n g and complicated, f i g u r e 4 s h o w s
data flow for this t y p e of d i m e n s i o n a l Data Staging.
Data source
Yesterday's
dimensional
table data
Today's
dimensional
table data
/ Compare
-»•-. and allocate
Vdifl'crences/
New Records to be
inserted in the D W I I
Surrogate
key
dimension table
. Generation,
Figure 4: Allocation of new and changed records in dimension data
40
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
B e c a u s e of the large a m o u n t of data to b e extracted, c o m p a r e d (or transferred),
special attention to usage of software a n d h a r d w a r e r e s o u r c e s should b e paid. Special
c o m p a r i s o n m e t h o d s could b e used for p e r f o r m a n c e i m p r o v e m e n t , such as the Cyclic
R e d u n d a c y C h e c k s u m ( C R C ) . T h i s m e t h o d calculates a c h e c k s u m for a particular
string - it could b e a record of a table or a w h o l e text file or s o m e t h i n g else. T h e
c h e c k s u m calculated for t h e n e w data is t h e n c o m p a r e d with the c h e c k s u m of the
y e s t e r d a y ' s data that is already calculated. If these c h e c k s u m s a r e equal there is no need
to c o m p a r e details for particular details. T h e U s a g e of such m e t h o d s substantially saves
lime and h a r d w a r e resources.
2.1.2. Fact table
staging
Facts usually represent a s n a p s h o t o f a d a t a s o u r c e at that particular time point.
A c c o r d i n g l y facts n o r m a l l y a r e the n u m e r i c attributes of the subject of t h e transactional
system, like daily balance of an a c c o u n t , sales a m o u n t o f a product, etc. P r o b l e m s
regarding extracting o f such data from the data s o u r c e n o r m a l l y are c o n n e c t e d with the
a m o u n t of data to be p r o c e s s e d and w i t h t h e c h a n g e a b i l i t y o f data. P r o b l e m s regarding
data a m o u n t s usually can b e solved on a c c o u n t of technical issues or data processing
p e r f o r m a n c e optimization. A l m o s t e v e r y transactional s y s t e m p r o v i d e s today on-line
services a n d that m e a n s transactions are p e r f o r m e d all the t i m e ( 2 4 h o u r s / 7 d a y s ) . The
problem g r o w s even b i g g e r in case o f a m u l t i n a t i o n a l s y s t e m , w h e r e m a n y time zones
are involved. To get a s n a p s h o t of such a s y s t e m in the particular t i m e point, s o m e
technical tricks can help. T h e r e c a n b e several solutions. In current data base
m a n a g e m e n t s y s t e m s , like Oracle, a b a c k u p t e c h n o l o g y is p r o v i d e d without
d i s c o n n e c t i n g users or shutting d o w n a d a t a base - a b a c k - u p is created taking into
account the events in the s y s t e m logs. A c c o r d i n g l y fact data could b e extracted from a
b a c k - u p or directly from t h e system using t h e s a m e t e c h n o l o g y . Actually, normally
every s y s t e m has s o m e h o u s e k e e p i n g p h a s e for s u p p o r t i n g s o m e internal functionality
(like E O D - end of d a y ) that m a k e s a s y s t e m constant for s o m e t i m e . W h e t h e r the
s y s t e m serves business transactions or n o t d u r i n g this t i m e - o u t d e p e n d s on the particular
system - there are s o m e distinctions. S u c h t i m e - o u t is a g o o d place for fact data (and
d i m e n s i o n a l data as well) extracting.
Conclusion.
The Financial industry differs from t h e o t h e r major sectors of b u s i n e s s with its nonstop b u s i n e s s . B e c a u s e of m a n y c h a n n e l s ( A T M , Internet, W A P . etc.), w h i c h banks
provide, t h e customer has t h e possibility t o d o b u s i n e s s all t h e time any day. A Similar
situation exists in the t e l e c o m m u n i c a t i o n b u s i n e s s as well.
That is the N ° l p r o b l e m for d a t a extraction, w h i c h s h o u l d b e solved. N o r m a l l y
every IS operating in the financial industry h a s already solved the p r o b l e m o f on-line
transactions. There are s o m e t e c h n i q u e s that ensure on-line data p r o c e s s i n g and daily
(weekly, m o n t h l y , etc.) data servicing at t h e s a m e t i m e . T h e s e data servicing p r o c e d u r e s
( h o u s e k e e p i n g , calculations, etc.) could be p o p u l a t e d with data extraction.
Problem N°2 is the large a m o u n t of h e t e r o g e n e o u s data (funds transfers, loan
a g r e e m e n t s , securities, trust operations, i n s u r a n c e , etc.) to be processed. T h a t could be
solved by 1) using optimal data allocating a l g o r i t h m s t o find data to b e p r o c e s s e d and 2)
fast data extraction m e c h a n i s m s . A Data allocating m e c h a n i s m could b e , for instance.
Janis
Benefelds.
Data Staging
in the D a t a
Warehouse
41
Cyclic R e d u n d a c y C h e c k s u m m e t h o d or using different ' c h a n g e d d a t a ' flags in the
source s y s t e m s . A s fast a data extraction m e c h a n i s m specific tools could be used or just
m a k i n g a copy and p r o c e s s i n g 9 0 % of data instead of selecting 9 0 % o f data and
processing all of the resulting data set.
2.2. Data cleaning, transforming and sorting or data standardization
This is the place w h e r e an additional value to the data actually is added.
Usually there are s o m e activities b e t w e e n data extraction from data source systems
and data import into the data target. T h e s e activities ensure that data in the right
structure, format, and s e q u e n c e , and that quality should be adequate for the data model
of the data target. T h e Data should satisfy all logical rules provided by b u s i n e s s analysts
and all physical rules that c o m e from the D a t a W a r e h o u s e data model and end user
applications.
2.2.1. Data
quality
Actually, most o f all data transformation deals with data quality.
T h e r e is no reason to a s s u m e the quality of data in O L T P systems a l w a y s will be
acceptable for the D a t a W a r e h o u s e . S o m e data validations can be integrated directly in
the source ( O L T P ) to p r e s e r v e data from logical and structural mistakes. In any case,
there are s o m e p r o b l e m s . First - every data validation p r o c e d u r e m a k e s the system
slower and it conflicts with a definition o f the O L T P "on-line (identical "fast")
transaction p r o c e s s i n g " . S e c o n d - even if w e create all necessary data validations in
O L T P , t h e r e always will arise s o m e n e w r e q u i r e m e n t s or conditions against data to be
realized in the Data W a r e h o u s e . It is easier, cheaper and more secure to realize data
cleaning and transformation p r o c e d u r e s outside of the data source.
In the Situation w h e n data are validated after they h a v e been captured, another
p r o b l e m arises - h o w to edit t h e m n o w ? If a validation error occurs by data capture,
they can b e entered o n c e m o r e till they are correct. W h e r e a s if a data validation error is
discovered during a fully a u t o m a t e d data cleaning and standardization p r o c e s s , there
should be s o m e p r e v i o u s l y described algorithms to m a n a g e the situation. A u t o m a t i c
data editing algorithms s h o u l d be a p p r o v e d by business analysts and every data change
should be stored in the L O G files.
W h e r e v e r and w h e n e v e r data are modified before importing t h e m into the Data
W a r e h o u s e , they should be at the best quality possible. T h e r e are attributes that indicate
the level of data quality.
42
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
Table No.2
"Attributes i n d i c a t i n g the data q u a l i t y . "
Attribute
ACCURATE
COMPLETE
CONSISTENT
UNIQUE
TIMELY
Description
means data should be accordingly equal to the data extracted from
data sources. All data modifications, standardizations and other
processes performed by data transforming arc accountable and an
explanation of every difference can be found in audit trails - LOG
files;
means the particular data set (data mart) should contain absolutely
all data and nothing over accordingly the Data Staging procedures. If
one particular data mart should contain all active customer records
then all customers with an 'active' status should be included there
and all customers with a status not equal to 'active' are not included
there;
means during data transformation process they should not lose their
meaning and value. These arc situations when one should solve
problems regarding to rounding of numeric, interpretation of
aggregated values and other similar issues;
means data should not have duplicates. Especially when data are
extracted from more than one source. The unique ID from business
understanding must be allocated or generated for the same business
unit and business analyst improved algorithms used for data joining
from multiple sources;
means data should be current. Nobody is interested in out of date
information. The data used in the Data Warehouse should be taken
from and validated against 'fresh' data whenever possible. Of
course, 'fresh' has many meanings in different businesses. If we are
talking, for instance, about hotel chains, the 'fresh' data could be a
turnover of the last day or even week. Whereas if we are talking
about financial institutions, like money markets or stock exchanges,
then minutes or even seconds is a very long time.
To e n s u r e the data characteristics m e n t i o n e d a b o v e m a n y p r o b l e m s h a v e to be
solved. T h e s e are p r o b l e m s r e g a r d i n g t h e data itself, not p e r f o r m a n c e p r o b l e m s of
h a r d w a r e or software.
Janis
Benefelds.
Data Staging
in the D a t a
Warehouse
Table ho. 3
"Types of problems to be solved regarding data quality"
Type
Description
Inconsistent use of code or different look-up values
The same value could have a different ok key in different information systems. Sex, for
instance, in one source could have a look-up table like M - male and F - female, while in
another system I - male and 2 female. These values should be standardized during the data
transformation process by unification of the particular values;
Unofficial or undocumented functionality
Sometimes some of the fields are used for another reason than documented in the user manual
or data in the particular field are unclassified. A big mistake could be the address fields that arc
not structured as an address usually is - five text fields of 25 characters length, for instance.
From the business point of view the content of these fields is gold, but it is absolutely unusable
for data processing;
Overloaded codes
Usually overloaded codes occur in the systems that are developed a long time ago when every
saved bite was very important. That means, a single code was used for multiple purposes. If
the customer is an individual then a particular field could be used for birth date. If it is a
corporate customer - for registration date:
Evolving data
Systems that have evolved over time may have old data and new data that uses the same fields
for different purposes or where the meaning of the codes has changed over time. It is very
important if data are planned to load into Data Warehouse from historical data archives;
Missing, incorrect or duplicate values
Names and addresses are the classic example of this problem. Transactional systems do not
care about data that are not needed for performing the particular transaction. It does not matter
whether the name of the customer is IBM or I.B.M. But they can end up looking like two
different customers in the warehouse;
1 w o u l d like to a d d o n e more type of p r o b l e m from m y o w n experience. T h a t relates
to the different d a t a formats - data should be in the s a m e format to be processed
automatically.
Table No. 3 (continued)
"Types of problems to be solved regarding data quality"
Different data formats
When Gathering data from many sources, data could be in different formats. That refers to any .
data format: date Cdd.mm.yy'; 'YYYY-MON-DD': . . . ) . numeric C0.99':'9.9999E"; ...) etc.
To use such data for common data processing they have to be in one previously described
format;
If it is p o s s i b l e a feedback from Data Staging b a c k to the data source could be
c o n s i d e r e d . This is also a w a y to improve the data quality in the data sources. H o w e v e r
it should be estimated very critically, b e c a u s e data sources - analytical transactional
s y s t e m s — most o f all are m a d e for data p r o c e s s i n g captured by users. Data
modifications m a d e by a n o t h e r software, like Data S t a g i n g p r o c e d u r e s , could not be the
best w a y of i m p r o v i n g data quality in the analytical s y s t e m .
44
DATORZINATNE
2.2.2. Data
UN
INFORMACIJAS
TEHNOLOGIJAS
integrity
Besides data quality this part of D a t a S t a g i n g is r e s p o n s i b l e for data integration as
well. E n s u r i n g of data integrity m e a n s s u r r o g a t e key (as p r i m a r y key) generation for
dimensional data tables and fact table p o p u l a t i o n (or even creation) using j u s t generated
surrogate keys. All d i m e n s i o n a l tables h a v e single part k e y s , w h i c h , by definition, are
primary keys. In other w o r d s , the key itself defines u n i q u e n e s s in the dimensional
tables. It is a significant m i s t a k e to use, for u n i q u e n e s s o f d i m e n s i o n a l table record, an
S Q L - b a s e d date stamp, a set of various attributes in the d i m e n s i o n a l table, production
keys used in the data s o u r c e . In most cases an integer m a k e s a great surrogate key. A 4byte integer, for instance, can contain 2 ~ v a l u e s , or m o r e than two billion positive
integers starting with 1, and that is e n o u g h for j u s t about a n y d i m e n s i o n .
?
Since fact tables are a l w a y s a s s o c i a t e d with d i m e n s i o n a l tables, surrogate keys
should replace c o r r e s p o n d i n g foreign k e y s there. Figure 5 s h o w s a dimensional data
model w h e r e generated surrogate k e y s ( T i m e _ k e y , C u s t o m e r _ k e y , A c c t k e y ) of
dimensional tables are u s e d as foreign k e y s in t h e fact t a b l e instead of production keys
(Date. Ctistomer_ID, A c c t _ l D ) .
Customer dimension
Time dimension
Time_key (PK)
Date
Day_of_week
Day_of_year
Week_number
etc.
Daily balance
fact table
Account dimension
Acct_key (PK)
Acct ID
Acct type
Currency
Ledger_balance
etc.
T i m e k e y (FK)
Acct_key (FK)
Customer_key (FK)
Ledgerbal
Acct_ccy
Ledger_bal 1
Base ccy
Customer_key (PK)
CustomerlD
Customertype
Name
Address
etc.
Figure 5: Example of using surrogate keys instead
of production keys in the Daily balance fact table
For better q u e r y i n g p e r f o r m a n c e besides d i m e n s i o n a l a n d fact data, the Data
W a r e h o u s e has the a g g r e g a t e date as w e l l . Since the D a t a W a r e h o u s e is used m o s t of all
to g e t summarized d a t a over any d i m e n s i o n a l attributes, a significant data aggregation
would be performed b y every data request. T h e r e f o r e data b a s e m a n a g e m e n t systems
provide an o p p o r t u n i t y to maintain a g g r e g a t e d data - materialized v i e w s or data where
subtotals are already calculated up to a predefined level of granularity. Figure 6 shows
how an aggregate d a t a table is h a n d l e d that ensures a n s w e r to data q u e r y about product
sales subtotals for a particular w e e k or a few w e e k s .
Janis
Benefelds.
Data Staging
Aggregate Daily
balance fact table
Day_of_year (FK)
=>
Acct_key (FK)
SUM(Ledger_bal)
MIN(Ledger_baI)
iMAX(Ledger_bal)
AVG(Ledger_bal)
Acct ccy
SUM(Ledger_bal 1)
MlN(Ledger_baI 1)
MAX(Ledger_ball)
AVG(Ledger_ball)
Baseccy
in t h e D a t a
45
Warehouse
Time dimension
Time_key (PK)
Date
Day of week
Day of year
Week number
etc.
Product dimension
Acct_key (PK)
AcctJD
Acct_ type
Currency
Ledger balance
etc.
Sales fact table
•
« T i m e k e y (FK)
Acct key (FK)
Customer key (FK)
Ledger bal
Acct ccy
Ledger ball
Base ccy
Figure 6: Example of the aggregate Daily balance fact table.
Conclusion.
I would say that data quality is o n e of the major units of m e a s u r e (the second one is
data accessing p e r f o r m a n c e , but that is not the subject o f this topic) for the Data
W a r e h o u s i n g . T h e benefit that users obtain from the u s a g e of the data w a r e h o u s e is as
large as how clean and g o o d are the data. So, if data are inaccurate, the user gets the
wrong picture of the b u s i n e s s and a decision that is b a s e d on that picture could be
wrong.
N o recourses s h o u l d be e c o n o m i z e d to i m p r o v e d a t a quality. R e g a r d i n g the data
that b a n k s capture a n d m a i n t a i n t h e m s e l v e s - that is all their o w n business, and they do
everything to avoid ' d i r t y ' data. S o m e t i m e s a solution could be s o m e administrative
order to stop doing things in the w r o n g way. Data that w e get from outside of the bank,
or even the financial industry, should h a v e a warranty, e.g., the data supplier should be
trustworthy and secure e n o u g h . Usually these are s o m e state institutions, b u t they could
be s o m e c o m m e r c i a l o r g a n i s a t i o n s ( i n s u r a n c e c o m p a n i e s , travel agencies, etc.) as well contract about p r o v i d i n g data should contain r e q u i r e m e n t s and warranties regarding data
quality and security.
A separate Data Q u a l i t y d e p a r t m e n t within the organisational structure would help
significantly as well. S u c h a d e p a r t m e n t could deal with ' d i r t y ' data - to ensure the
certain level of the data quality by developing data transformation and cleaning
algorithms as well as l o o k i n g for s o m e external data sources that could serve as a
standard for data c o m p a r i s o n .
2.3. Data loading into the Data Warehouse, Data Mart or data loading
Data loading into the Data W a r e h o u s e m e a n s physical data transportation from the
Data Staging area ( d a t a b a s e A) to the Data W a r e h o u s e (database B). T o populate an
existing d a t a b a s e with n e w data usually s o m e data m a n a g e m e n t p r o b l e m s should be
46
DATORZINATNE
UN
IN F O R M A C ! J A S
TEHNOLOGIJAS
solved. T h i s refers to such technical t e r m s like table spaces, table partitioning, indexing,
data load method (usually bulk load), audit action, m e m o r y , parallel processing,
network architecture etc. T h e software, h a r d w a r e and time resources can be
significantly saved by configuring items like these. T o s p e e d up the load cycle the
frequency of load can be c h a n g e d , especially regarding s o m e a g g r e g a t e data. If s o m e
monthly totals should be a c c u m u l a t e d , it d o e s not m e a n that data load can be performed
only o n c e a month. It is possible to a c c u m u l a t e data on a daily basis by incremental
load. T h e n s o m e additional c h a n g e s to the presentation layer should be a d d e d - whether
users see ' m o n t h - t o - d a t e t o t a l s ' (instead o f "totals per m o n t h ' ) or it could be restricted
by software to see just full m o n t h s (the p r e v i o u s o n e will be the n e w e s t then).
B e c a u s e of data integrity the s e q u e n c e for l o a d i n g data into the Data W a r e h o u s e is
prescribed accordingly. First of all all d i m e n s i o n a l data arc loaded in the right sequence
to e n s u r e star schema data model data integrity. W h e n all d i m e n s i o n a l (look-up) data
tables are populated, fact data and a g g r e g a t e d data are loaded and/or calculated. If
necessary, s o m e additional data validation can be p e r f o r m e d .
For duplication and possible repetition p r o c e s s e s data transportation (load) supports
data a r c h i v i n g and back-up. T h a t m e a n s w e can repeat s o m e data transformation and
load p r o c e s s from any point of the staging p r o c e s s . That could be n e e d e d if s o m e errors
are found and removed, or if data load s h o u l d be repeated b e c a u s e of s o m e technical
disaster. Such a duplicate table loading t e c h n i q u e is s h o w n in Figure 7. B a c k - u p s of
load data are needed for t o m o r r o w ' s i n c r e m e n t a l load as well (see 2 . 1 . 1 . D i m e n s i o n a l
table staging).
(From
t h e —-
staging area)
(1) Load tact table
load
( 6 ) Rename
liiad tact 1 to
active tact I
factl
(7) W H comes up
active
factl
•(3) Index
/
(4) W H down
(2) Duplicate
3up!ie;
b L-CK!
o d j_afca<
lI
i
( )
8
«<"P.foa'
to load tact 1
(5) Rename
acme factl to
old tact I
dup fact 1
(9) Drop l!ie
okl tact I table
old
factl
Figure 7: Duplicate table loading technique.
Janis
Benefelds.
Step 0
Step 1
Step
Step
Step
Step
Step
Step
Step
Step
2
3
4
5
6
7
8
9
Data Staging
in the D a t a
47
Warehouse
The active_factl table is active and queried by users.
The load_fact I table starts off being identical to active_fact I.
Load the incremental (different) facts into the load_factl table.
Duplicate load fact 1 into dup_factl for tomorrows load.
Index Ioad_faet I table according to the final index list.
"Bring down" the warehouse, forbidding user access.
Rename the active_factl table to old_factl (back-up).
Rename the loadfactl to a c t i v e f a c t l .
Open the warehouse to user activity. Total downtime: seconds!
Rename the dup_factl tabic to load_factl for performing of next load.
Drop the old_factl table.
3 . Metadata
S o m e tools that a u t o m a t e the transformation and load layer can actually reside on a
third party platform and generate the c o d e necessary to perform those activities (see
Figure 8 as well). A n d the most important thing that ensures such a possibility is
metadata.
T h e definition of m e t a d a t a is really simple - m e t a d a t a is data about data. D o e s it
help a n y b o d y to u n d e r s t a n d what exactly this elusive information? In my opinion - no.
M o r e descriptive definition could be as follows. M e t a d a t a is all of the information (or
descriptive information) in the particular e n v i r o n m e n t that is not the actual data itself.
S o m e a u t h o r s are e v e n talking about the " b a c k r o o m m e t a d a t a " and "front room
m e t a d a t a " . T h e back r o o m metadata is process related a n d it guides the d a t a extraction,
cleaning and loading p r o c e s s . T h e front r o o m metadata is more descriptive and it helps
query tools report writers function s m o o t h l y . O f c o u r s e , both metadata types overlap,
but it is useful to think about them separately. In this c a s e w e are interested in the first
one - b a c k r o o m metadata.
M e t a d a t a is like traditional systems d o c u m e n t a t i o n (in fact, it is d o c u m e n t a t i o n ) and
at s o m e point the resources needed to create and maintain it will be diverted to other
" m o r e u r g e n t " projects. Active metadata helps solve this p r o b l e m . A c t i v e metadata is
metadata that drives a process rather than d o c u m e n t s it, while d o c u m e n t a t i o n of these
p r o c e s s e s is an important issue as well.
T h e following is the general m e t a d a t a n e e d e d to get the data into the Data Staging
area and p r e p a r e it for l o a d i n g into one or m o r e data marts.
Data extraction information
• D e s c r i p t i o n of t h e source s y s t e m s ( s c h e m a s , formats etc.);
• A c c e s s m e t h o d s , rights, privileges and p a s s w o r d s ;
• D a t a transportation scheduling a n d results of them:
• File u s a g e in the Data Staging area including duration and o w n e r s h i p :
Data transformation information
• Data m o d e l :
• J o b specifications
attributes:
for joining
s o u r c e s , stripping
out
fields
and
looking
up
48
DATORZINATNE
UN
INFORMACUAS
TEHNOUOCIJAS
• Data m a p p i n g b e t w e e n sources a n d Data S t a g i n g area;
• Y e s t e r d a y s copy o f production data to use as the b a s i s for D I F F C O M P A R E
• Data update (refreshing) policies for each i n c o m i n g attribute;
• Current surrogate key a s s i g n m e n t s and m e t h o d o l o g y of that;
• Data cleaning, formatting, filtering and sorting:
• Populating data for data mining (e.g., interpret nulls, scale n u m e r i c s etc.);
Data load information
• Target schema d e s i g n , source-to-target (staging a r e a to data mart) data flows and
target data o w n e r s h i p ;
• D B M S load scripts;
• A g g r e g a t e definitions;
• Data (especial a g g r e g a t e ) usage statistics and potential i m p r o v e m e n t s ;
Audit, Job Logs a n d D o c u m e n t a t i o n
• Audit records ( w h e r e exactly did this record c o m e from and w h e n ? ) ;
• Data transform r u n time logs, s u c c e s s s u m m a r i e s a n d time stamps;
• Information a b o u t software (version n u m b e r s ) and h a r d w a r e (configuration);
• Security settings;
Conclusion.
Metadata is an important issue in terms o f d e s c r i p t i o n s . It describes (it should at
least) any object and a n y process dealing with these objects. It is very h a r d to define or
to s h o w - this is m e t a d a t a and that is not. T h e c o n t e n t o f m e t a d a t a usually describes the
structure or nature of o t h e r d a t a (the real ones). M e t a d a t a , for instance is ER diagram,
description o f table structures, d o c u m e n t a t i o n o f the p r o g r a m p a c k a g e s etc.
4. Relationships between Export, Transfer a n d Load.
A s I h a v e already m e n t i o n e d data extraction, t r a n s f o r m a t i o n and load are very
closely related to each other o n the o n e hand (usually in the small projects) and these
parts could be very strictly separated o n the o t h e r hand (large projects s o m e t i m e s use
separate software p r o d u c t s to m a n a g e data e x t r a c t i o n , transformation, loading and
metadata in their entirety). S o m e t i m e s it is really i m p o s s i b l e to strictly separate which
piece of software or h a r d w a r e belongs to one part of D a t a S t a g i n g and which one to
another. For instance, let us s a y we h a v e j u s t o n e data s o u r c e s y s t e m , w h i c h operates in
the Oracle, and data s h o u l d b e loaded in the D a t a W a r e h o u s e , w h i c h is d e v e l o p e d in an
Oracle environment t o o . If h a r d w a r e and software c o n f i g u r a t i o n allows creating a
database link between these t w o Oracle d a t a b a s e s , then Data S t a g i n g could be j u s t one
SQL statement. Of c o u r s e , it is a very simplified e x a m p l e , but it is true:
Janis
Benefelds.
Data Staging
in t h e D a t a
49
Warehouse
INSERT INTO CUSTUM_DIMENSION
(
SELECT NEXTVAL.cust_seq cust_pk,
cust_id,
cust_name,
TOJ3ATE(birth_date.'DD.MM.RRRR )birth_date.
DECODE(scx,' 1 7 M 7 F ' ) sex
FROM
CUSTOMERS@;SOURCE
ORDER BY cusMd )
-
A n d there w e can see:
Data extraction:
Data transformation:
Data transportation:
SELECT cust_id, c u s t n a m e . birth_date. sex
FROM CUSTOMERS® SOURCE
NEXTVAL.custseq
key generation
TojDATE(birthdate.'DD.MM.RRRR')
adjustment
DECODE(sex,'L/M\'F')
of code keys
ORDER BY cust_id
record order
INSERT INTO CUSTUM DIMENSION
-
surrogate
- data fomiat
-
unification
-
particular
It actually m e a n s that from the technical point of view D a t a S t a g i n g looks like
Figure 8.
Data
Data
sources
Warehouse
Extract and |
transformation
programms J
Data
Data
movement
movement
•
>
Cleansing,
transformation.
sorting
Transformation,
addition, update
programms I
programmsl
Figure S: Role of data transformation and transportation tools within a Data Staging process.
50
DATORZINATNE
L'N
INFORMACIJAS
TEHNOLOGIJAS
5. S u m m a r y
A s a conclusion 1 w o u l d say that Data S t a g i n g is o n e of the main issues of the Data
W a r e h o u s i n g one must pay m u c h attention to. Data S t a g i n g c o v e r s any action with data
starting from extracting them from the source s y s t e m s and finishing with data loading
into the presentation server ( D a t a W a r e h o u s e a n d ' o r Data M a r t ) . Data Staging itself
does not include data presentation, e.g., data cannot be a c c e s s e d by the end users from
the Data Staging area. Data A c c e s s by end users is p e r f o r m e d by other tools. Actions
that Data Staging performs arc (at least, s h o u l d be) very short in t i m e , w h e r e a s the
developing of these p r o c e s s e s could t a k e m u c h of time a n d m a n y other resources.
The main tasks of Data S t a g i n g are 1) fast particular data extraction from the source
s y s t e m s 2) fast data transformation, s t a n d a r d i z a t i o n and data quality assurance
according to the business user r e q u i r e m e n t s and 3) fast data loading into the target
according to the data m o d e l (usually d i m e n s i o n a l ) it requires.
N o w a d a y s technologies usually p r o v i d e a range of possibilities to ensure fast
physical data processing and t r a n s p o r t a t i o n - I w o u l d s a y . that should solve all
p r o b l e m s regarding physical data e x t r a c t i o n , m o v i n g t h e m through the hardware
infrastructure and loading t h e m into t h e data target, w h a t e v e r that may be. B e s i d e s that,
the logical side of data p r o c e s s i n g still should be taken into account. This part of the
data processing brings out the i m p o r t a n c e of t h e h u m a n sense or intelligence that should
be both the driver and supervisor of the m a n n e r in w h i c h data is m a n a g e d . That means
Data Staging d e v e l o p m e n t requires additional special skills and e x p e r i e n c e in the data
m a n a g e m e n t area. In the case of D a t a W a r e h o u s i n g these special skills w o u l d be
d i m e n s i o n a l data m o d e l l i n g and u s a g e of a n d m a n a g e m e n t o f the m e t a d a t a . It could be
reasonable to look t o w a r d s o m e v e n d o r s that p r o v i d e tools for the functionality j u s t
mentioned (dimensional data m o d e l l i n g and m e t a d a t a m a n a g e m e n t ) . In any case, it is
very good to have at least a theoretical b a c k g r o u n d and u n d e r s t a n d i n g about how it
should work. On the one hand there is no r e a d y solution (software) that will satisfy you
for 1 0 0 % of your n e e d s , and on the other h a n d you will never d e v e l o p that functionality
by yourself according to the budget, time a n d end user r e q u i r e m e n t restrictions. T h e
truth is a l w a y s in the middle!
The main deliverables (in terms o f the m e t a d a t a ) of the D a t a S t a g i n g would be:
• Data dictionary
• Data mapping
o
Data s o u r c e to Data S t a g i n g area
o
Data S t a g i n g area to data target
• Description of the data transportation (extract, b a c k - u p and load)
• Data quality'' checklist
• Data transformation a l g o r i t h m s and a u d i t trails
Conclusion
Regarding the financial industry, data extraction a n d consolidation is very
important in terms of c o m p l e t e n e s s and a c c u r a t e n e s s . Only c o m p l e t e and accurate data
give the right picture - c o m m o n trends and a g g r e g a t e s .
Janis
Benefelds.
Data Staging
in the D a t a
Warehouse
51
A g g r e g a t e d information is usually passed to a c o m p a n y ' s m a n a g e m e n t , external
s u p e r v i s o r s and auditors. Each of t h e m requires special and very similar information.
T h e y are different but c o m p a r a b l e ! Data r e q u i r e m e n t s are both regular and on demand.
T h i s is a g o o d w a y to s h o w others that data is u n d e r control in o n e ' s c o m p a n y ,
especially if response times to irregular data requests are short enough.
Besides the c o m m o n trends financial institutions focus on particular individuals or
c o m p a n i e s as well. Therefore data o f the particular individual can be evaluated in
relation with c o m m o n trends only partly. T h e other, the larger part o f an evaluation is
based on a client's individual data — client profile. Actually, such a p p r o a c h to data
analysis is part of C u s t o m e r Relationship M a n a g e m e n t ( C R M ) , which is very important
for financial institutions. T h i s is exactly the reason w h y data quality should be the best
possible. B a n k s , insurers and other financial industry players are m a r k e t i n g themselves
as o n e s w h o are professionals in w h o m y o u can trust. A n d clients are e x p e c t i n g an
a d e q u a t e attitude: " S o , if you know e v e r y t h i n g about m y finances, w h y d o you offer me
a G o l d C r e d i t card? M y m o n t h l y i n c o m e is b y far insufficient to buy such a product!"
S u c h a situation c o m e s from unclean and in consistent data, and it leads to the customer
that d o u b t s whether the b a n k is dealing with his m a n y at a level that is professional
enough.
6. F u r t h e r research
Sequentially the next part after D a t a S t a g i n g is D a t a b a s e - either D a t a W a r e h o u s e ,
Data Mart or by a special Data M i n i n g tool predefined data structures.
D e v e l o p i n g the right Data M o d e l and overall data architecture strategy could be the
direction o f the next research. This includes understanding the difference between
d i m e n s i o n a l and traditional O L T P design practices.
Since Data M o d e l is d e v e l o p e d data is c o m i n g in. Data M a n a g e m e n t is a very
important issue as well. By Data M a n a g e m e n t 1 m e a n such things like data storage
solutions, fast data access p r o b l e m s and other p r o b l e m s similar to it. A s data volumes
a c c u m u l a t e d by Data W a r e h o u s e generally are extremely large, a technical and
architectural solution should be ready to p r o c e s s it. T h i s part of the Data W a r e h o u s i n g
definitely is a subject for following research as well.
A s D a t a W a r e h o u s i n g p r o d u c t s (reporting, analysing, data mining etc. tools) are
required b y business, it is very close to the particular end users. O f c o u r s e , every
c o m p a n y has its o w n specific r e q u i r e m e n t s , but there are s o m e c o m m o n specificities
from industry to industry. Financial industry specificities and possible solutions to them
are g o o d a r e a for research and for h a v i n g key k n o w l e d g e too.
52
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
References
Kimball R., Reeves L., Ross M., Thornthwaite W. (1998), The Data Warehouse Lifecycle
Toolkit, John Wiley & Sons, Inc.
Groth R. (1998), Data Mining, Prentice Hall PTR
Inmon W. H., (1999). Building the Operational Data Store - Second Edition, John Wilev & Sons,
Inc.
Oracle Corporation, (1996 - 2000), Oracle Documentation Library. Release 8.1.7
Kimball R., Merz R. (2000). The Data Webhouse Toolkit, John Wiley & Sons, Inc.
Marco D. (2000). Building and Managing the Meta Data Repository. John Wiley & Sons. Inc.
Todman C. (2001). Designing a Data Warehouse, Hewlett-Packard Company
Greenberg P. (2001). CRM at the Speed of Light. McGrow-Hill Companies
Imhoff C , Loftis L., Geiger J.G. (2001), Building the Customer-Centric Enterprise. John Wiley &
Sons, Inc.
Kimball R., Ross M. (2002), The Data Warehouse Toolkit - Second Edition, John Wiley & Sons,
Inc.
LATVIJAS
UNIVERSITATES
R A K . S T I . 20IM. 6 6 9 .
sej.: D A T O R Z I N A T N E
UN
INFORMACUAS
TEHNOLOGIJAS.
53.-60.
1pp.
Generic Data Representation by Table in Metamodel
Based Modelling Tool
Edgars Celms
Institute of M a t h e m a t i c s and C o m p u t e r Science, University of Latvia
Raina bulv. 2 9 , L V - 1 4 5 9 , Riga, Latvia
Edgars. Celms@,mii. lu.lv
The foundation of a metamodel based modelling tool is its flexible facility to declaratively define
the modelling method, notation and tool support. One of the problems for such tools is generic
data representation by table. The paper proposes a method for declarative definition of a table
based on the logical metamodel. The most interesting aspect of this definition is the specification
of element selection criterion. Some practical examples of table specification by the described
method are also explained.
Key words: modelling tool, generic table editor, metamodel, editor definition.
1. Introduction
T o d a y t h e r e are a lot of m o d e l l i n g tools on the market. M o d e l l i n g tools are
designed to p r o v i d e all that is necessary to support major areas of m o d e l l i n g , including
business p r o c e s s m o d e l l i n g , object-oriented and c o m p o n e n t modelling with U M L [ 1 ] ,
relational d a t a m o d e l l i n g , and structured analysis and design, etc. W h y is it not
sufficient to use " h a r d - c o d e d " m o d e l l i n g tools? Let us consider, for e x a m p l e , the
situation in b u s i n e s s m o d e l l i n g . On the o n e hand there exist several w e l l - k n o w n
business m o d e l l i n g l a n g u a g e s ( I D E F 3 [ 2 ] , A R I S [ 3 ] etc.), each with a set of tools
supporting it. B u t there are also Activity d i a g r a m s in U M L . w h o s e main role n o w is to
serve b u s i n e s s m o d e l l i n g . T h e r e is G R A D E B M [4] - a specialized l a n g u a g e for
business m o d e l l i n g and simulation. T h u s for the area of b u s i n e s s m o d e l l i n g there is no
one b e s t or m o s t u s e d l a n g u a g e or tool, a n d each of t h e m e m p h a s i z e s its o w n aspects.
For e x a m p l e , G R A D E B M presents very c o n v e n i e n t facilities for specifying performers
of a task a n d its triggering conditions. H o w e v e r any n e w language feature d o e s not
c o m e for free, and the l a n g u a g e b e c o m e s m o r e c o m p l i c a t e d for use. T h e r e f o r e one
universal b u s i n e s s m o d e l l i n g language, w h i c h would s u p p o r t all wishes, w o u l d b e c o m e
extremely difficult for use in simple cases. U M L for this situation offers o n e ingenious
solution — s t e r e o t y p e s for adjusting the m o d e l l i n g l a n g u a g e to a specific area. In many
cases the idea w o r k s perfectly, and it is well supported in several tools including
G R A D E . T h e latest version of U M L - 1.4 e x t e n d s the notion o f stereotype, by assigning
tagged v a l u e s to it and g r o u p i n g stereotypes into profiles (thus actually e x t e n d i n g the
m e t a m o d e l ) . But currently n o tool fully supports it and already a new version of U M L 2.0 is c o m i n g with significant changes, particularly in the area of Activity d i a g r a m s .
T h e p r o b l e m with flexible m o d e l l i n g e n v i r o n m e n t is e v e n more urgent for d o m a i n specific m o d e l l i n g , w h e r e countless special notations are used for separate d o m a i n s .
T h e r e are p r o b a b l y several w a y s to solve such a p r o b l e m .
You can d e v e l o p a m o d e l l i n g tool specially for any specific modelling m e t h o d but
this w a y can be very time and cost c o n s u m i n g .
54
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
Y o u can m a k e a tool as universal as p o s s i b l e to s u p p o r t all needs. For example,
there is A R I S tool by IDS prof. Scheer. w h o s e " h o m e notation" is the A R I S B M
language. But it supports also the L'ML notation with Activity d i a g r a m s , as well as
n u m e r o u s modifications of the main p r o c e s s notation via e E P C (office processes,
industrial processes e t c ) . In general, A R I S tool can be characterised to b e extremely
"wide", with about 110 different types o f d i a g r a m s , a n d frequently having about 100
different s y m b o l types per d i a g r a m ( m o s t , in fact, are predefined stereotypes of the
basic ones). At the s a m e time, there are practically n o facilities for defining new
stereotypes. In practice such a universal tool is difficult for u s e in simple cases.
An alternative w a y is a c o m p l e t e l y m e t a m o d e l b a s e d generic m o d e l l i n g tool
(previously called m e t a C A S E ) . S u c h a tool h a s n o built-in m o d e l l i n g m e t h o d o l o g y . It
has to be filled up w i t h specific m e t a m o d e l and additional information to start
modelling something.
2. Metamodel Based Modelling Tool
In this paper s o m e aspects of the m e t a m o d e l b a s e d a p p r o a c h to b u i l d i n g flexible
modelling e n v i r o n m e n t s are explained. T h e m e t a m o d e l c o n c e p t h a s b e c o m e p o p u l a r in
recent years especially d u e to the p r i n c i p l e u s e d in L ' M L definition [ 1 ] . What is a
metamodel in a m e t a m o d e l based m o d e l l i n g tool? To put it in short, a m e t a m o d e l is a
class d i a g r a m c o n t a i n i n g all m o d e l l i n g c o n c e p t s , their attributes and their relationships.
A m e t a m o d e l based tool has n o built-in m o d e l l i n g m e t h o d o l o g y . It has to b e filled up
with specific m e t a m o d e l and additional information to start m o d e l l i n g s o m e t h i n g .
An earlier alternative n a m e for the a p p r o a c h is m e t a C A S E . Let me m e n t i o n the key
research in this area. P e r h a p s , the first similar a p p r o a c h h a s been m a d e by Metaedit [5],
but for a long time its editor definition facilities h a v e b e e n fairly limited. T h e latest
version n a m e d M e t a e d i t - n o w can s u p p o r t definition o f most used d i a g r a m types, but
via very restricted m e t a m o d e l l i n g features ( n o n - g r a p h i c a l ) , the resulting diagrams
c o r r e s p o n d i n g to the simplest c o n c e p t o f labelled directed g r a p h . T h e m o s t flexible
definition facilities (and s o m e t i m e a g o , also t h e most p o p u l a r in practice, but n o w the
tool is out of t h e market) seem to be offered by the T o o l b u i l d e r by Lincoln Software.
Being a typical m e t a C A S E of the early nineties, t h e a p p r o a c h is based on E R m o d e l for
describing the repository contents and on special e x t e n s i o n s to E R notation for defining
derived data objects w h i c h are in a o n e - t o - o n e relation to objects in a graphical diagram.
A m o r e a c a d e m i c a p p r o a c h is that p r o p o s e d by K o g g e [ 6 ] , with a very flexible, but very
complicated a n d mainly procedural e d i t o r definition l a n g u a g e . Other n e w e r similar
approaches are proposed by ISIS G M E [7], D O M E [8] and M o s e s [9] projects, with
main e m p h a s i s for creating e n v i r o n m e n t s for d o m a i n - s p e c i f i c m o d e l l i n g in the
engineering world. T h e richest editor definition possibilities o f them are in G M E .
Several commercial m o d e l l i n g tools ( A R I S b y I D S prof. Scheer, System Architect by
Popkin Software[10], etc.) u s e a similar a p p r o a c h internally, for easy c u s t o m i s a t i o n of
their products, but their tool definition l a n g u a g e s have n e v e r been made explicit.
T h e approach explained in this paper is in a sense a further d e v e l o p m e n t of the
above mentioned a p p r o a c h e s . In this a p p r o a c h at first the d o m a i n m e t a m o d e l of the
modelling area (logical m e t a m o d e l ) is built. T h e n the m o d e l l i n g method, notation and
tool support are defined d e c l a r a t i v e l y , by m e a n s of a special m e t a m o d e l l i n g
Eitg&ri Celms.
Generic
Data
Representation
by Table
in M e t a m o d e l
Based
Modelling
Tool
55
e n v i r o n m e n t . T h e s e activities also require extension of the metamodel ( a d d i n g s o m e
presentation classes), b u t it is not relevant to the p u r p o s e s o f this paper.
A very significant part of the tool support in m e t a m o d e l based m o d e l l i n g tools is
flexible data representation facilities. T h e r e are s o m e m a i n principles how data can be
represented in m o d e l l i n g tools:
• Diagrammatic,
• Hierarchical (e.g. "tree v i e w s " ) ,
• D a t a dictionaries,
• Tables,
• Object editors.
O b v i o u s l y the m o s t p o p u l a r data representation form (modelling notation) in any
d o m a i n n o w is d i a g r a m based. Therefore a very important part of the m e t a m o d e l based
a p p r o a c h is the E d i t o r Definition L a n g u a g e ( E d D L ) for a simple and convenient
definition of a w i d e s p e c t r u m of d i a g r a m m a t i c graphical editors [ 1 1 . 12]. N e v e r t h e l e s s
not all information is c o n v e n i e n t t o represent by d i a g r a m s . T h e r e is also the necessity to
have a possibility in m e t a m o d e l based m o d e l l i n g tools to define such facilities as
flexible model c o n t e n t b r o w s i n g and flexible definition of an editor for a single
m e t a m o d e l class instance. Perhaps one of the m o s t u n d e r v a l u e d data representation
m a n n e r in m o d e l l i n g tools t o d a y is data representation by table. For e x a m p l e , in
Rational Rose [13] L ' M L tool, in real size models it is c o m p l i c a t e d to b r o w s e through
p a c k a g e s and classes in the m o d e l tree. Frequently a p a c k a g e contains m o r e than ten
class d i a g r a m s with ten to thirty classes in each. That leads to a situation w h e r e there are
h u n d r e d s of classes at o n e tree level and it is not easy to find the right one. It would be
r e a s o n a b l e to represent such uniform information not in hierarchical (tree) form but in
s o m e easily configurable tabular form. S o m e tools ( S y s t e m Architect) h a v e s o m e t h i n g
like t a b l e s but with a v e r y restricted functionality (reports only). M o r e or less usable in
practice tables are i m p l e m e n t e d in G R A D E tool but they a r e " h a r d - c o d e d " .
A l t h o u g h all of t h e data representations principles m e n t i o n e d here are interesting
p r o b l e m s in relation to m e t a m o d e l based m o d e l l i n g tools, this is out of s c o p e for this
paper to explain all of t h e m . T h e paper introduces a m e t h o d for declarative definition of
a table, b a s e d on the logical m e t a m o d e l .
Partly the d e s c r i b e d a p p r o a c h has been developed within the EU E S P R I T project
A D D E [14], see [15] for a preliminary report.
3. Generic Data Representation by Table
T h e foundation of a m e t a m o d e l based m o d e l l i n g tool is its flexible facility to define
declaratively the m o d e l l i n g method, notation and tool support. O n e of the p r o b l e m s for
such t o o l s is generic data representation b y table.
T h e paper p r o p o s e s a method for table editor definition based on the logical
m e t a m o d e l . Certainly, the logical m e t a m o d e l for the selected modelling method should
be defined before we start defining any of the editors used for lite tool. This logical
structure is described by a L'ML class diagram [Lj. Fig. 1 s h o w s an e x a m p l e of a logical
m e t a m o d e l for L'ML class diagram (here the L'ML class m e t a m o d e l is described by a
56
DATORZINATNE
UN
INFORMACUAS
TEHNOUOGIJAS
U M L class diagram). T h i s e x a m p l e will be used in the p a p e r to d e m o n s t r a t e the table
editor definition features.
jleteotype ot altr&-!(
base d a s j
Opeocon
visiMty
rniital value
mJbpliaty
Jan' stereotype rwiv
iiereoiype ol das*
ope
) 1
sggregatron
is navigable
n ability
larg rale name
Wrg rrullipfpaty
AssodabonEnd
role name
muttiplidty
ordering
vistoihty
Figure 1. Logical metamodel example (fragment)
A table editor definition for a n e w table starts with the selection o f the domain
objects to be represented by the table. It should be e m p h a s i z e d that the logical
m e t a m o d e l itself is n e v e r modified d u r i n g this p r o c e s s . O u r goal is to define the
c o r r e s p o n d i n g table editor, w h i c h in this c a s e should be a b l e to p r e s e n t the selected set
of m e t a m o d e l class instances in a tabular form. F o r e x a m p l e . Fig.2 s h o w s a table which
represents all instances o f Class in s o m e m o d e l . Classes are g r o u p e d by the association
o w n e d _ b y ("Is in P a c k a g e " c o l u m n in the table) and o r d e r e d by t h e N a m e attribute.
Edgars
Celms.
Generic
Data Representation
b y T a b l e in M e t a m o d e l
Based
Modelling
57
Tool
Table View - Class
•
x|
Objects - 1 9 [191"
Name"
Description
Policy
Claims
Policy_owner_d
Property claim
Witness
Is bi Package"
Claims
this is descrj
Claims
Claims
Core entities
Policy
Core entities
Insurance
Claim
Insurance
Health claim
Insurance
Insutance company
Insutance
Policy
Insurance
Policy owner
Insurance
i
' Indicates tequited field.
'Sorted by: Is in Package, Name
iGiouped by: Is in Package
•
Figure 2. Class table example
Definition of a table consists essentially of t w o sections: h o w and w h a t data should
be represented. H o w - it m e a n s to define such things as:
• T a b l e c o l u m n s properties - the user should be a l l o w e d to define w h i c h c o l u m n s
are visible in t h e table (based o n t h e m e t a m o d e l elements). For e x a m p l e , t h e
C l a s s table in Fig.2 is configured t o s h o w three c o l u m n s , w h i c h c o r r e s p o n d to
t w o attributes [name and description) and one association (owned_by) of the class
Class in the m e t a m o d e l . Other attributes and associations of t h e Class are not
visible in this table. T h e user should b e also able t o define for each c o l u m n in the
table such properties as c o l u m n title, c o l u m n edit m o d e (either it is a l l o w e d to do
in-place editing or s o m e " n a t i v e " object editor should be i n v o k e d ) , column
ordering a n d g r o u p i n g (by w h i c h c o l u m n ( s ) the table is ordered or g r o u p e d ) , etc.
• P r o m p t i n g / d i s p l a y form for associations - it is a very important feature to m a k e
the table more usable for end users. For e x a m p l e , the association column
ownedjoy for t h e table in Fig.2 is defined to be s h o w n as the value o f t h e attribute
name of Package. In other w o r d s , it is the facility t o define the text assembly
rules for c o l u m n s , w h i c h c o r r e s p o n d to associations in the m e t a m o d e l . T h e user
can define the text to display as a c o n c a t e n a t i o n o f s e g m e n t s , each c o n s i s t i n g of a
constant prefix, a variable part and a constant suffix. Each variable part is defined
as t h e object attribute value located at the end of the chain defined b y associations
(for e x a m p l e . Class.ownedjoy.name).
T h e Association chain m a y b e e m p t y , or it
must start at t h e m e t a m o d e l class c o r r e s p o n d i n g to the objects in the table.
Attribute must b e l o n g to t h e m e t a m o d e l class at t h e e n d object of t h e associations
chain.
58
DATORZINATNE
L'N I N F O R M A C U A S
TEHNOLOGIJAS
• N a v i g a t i o n facilities for objects, attributes a n d associations - what to do in
response to user activities. For e x a m p l e , what to d o on single or double m o u s e
click on s o m e object attribute or association.
• P o p u p menu configuration - w h e n defining p o p u p m e n u s for table objects the
user should b e a l l o w e d to define m e n u items at least for the following actions:
create a new object (for e x a m p l e , " N e w C l a s s " ) , delete the object represented in
the table, open an object v i e w e r / e d i t o r ( " P r o p e r t i e s . . . " ) , copy and paste, execute
any other p r o g r a m , etc.
T h e a b o v e described facilities for table definition allow the user to define nearly
any reasonable table form for practical use.
T h e s e c o n d part of the definition o f the table is used to specify what data should be
represented by the table. T h e specification of the e l e m e n t selection criterion (selection
rule) is the most non-trivial and t h e most interesting aspect o f the table definition. A
selection rule can be specified, s a y i n g w h i c h of t h e possible instances actually must
appear in the table. T o be short, a selection rule is a B o o l e a n expression ( A N D , O R ,
N O T ) on link conditions (specifying that a certain s e q u e n c e o f m e t a m o d e l associations
must link t h e given object to another object) a n d attribute value conditions (defined by
an association chain, attribute at its e n d a n d a fixed attribute value). T h e actual
expression language contains s o m e m o r e details (including simple existential
quantifiers).
E x a m i n e one m o r e e x a m p l e of the t a b l e . L e t ' s a s s u m e that w e need to define a table
which contains S t e r e o t y p e s of Classes, w h i c h are contained in one concrete Package
and these Stereotypes do not have Tag_definition ( s e e F i g . l ) . T h e r e are languages
w h e r e it is possible to define selections or assertions of such a kind. Such languages, for
e x a m p l e , a r e Object Q u e r y L a n g u a g e ( O Q L [ 1 6 j ) , w h i c h is used for q u e r y i n g an Object
D a t a b a s e o r Object Constraint L a n g u a g e ( O C L [ l ] ) , which is used in the U M L to
specify t h e w e l l - f o r m e d n e s s rules o f the m e t a c l a s s e s c o m p r i s i n g the U M L m e t a m o d e l .
In O Q L a n d O C L t h e a b o v e m e n t i o n e d selection criterion c a n be written as ( w h e r e the
concrete Package is given by the p a r a m e t e r c u r r _ p a c k a g e ) :
In OQL: select st from Stereotype st
where curr_package in st.stereotype_of_class.owned_by
andthen is_undefmed(st.contains)
In OCL: self.stereotype_of_class.owned_by->exists(pck:Package
pek ~ curr_ package) and self.contains->IsEmpty()
H o w e v e r in practice both t h e s e l a n g u a g e s are difficult to u s e for declarative
definition of the tables. O Q L is a v e r y powerful and e x t e n s i v e language. O C L is a very
concise a n d rather c o m p l i c a t e d to u s e l a n g u a g e a n d it is not intended for q u e r y i n g but
for specifying static constraints in a m e t a m o d e l . In o r d e r to use O Q L or O C L y o u need
to k n o w precise s e m a n t i c s o f the l a n g u a g e s .
Let's say, that the p a p e r p r o p o s e s a m e t h o d to specify a selection rule for a table,
which is a "reasonable s u b s e t " of O Q L or O C L with a graphical realization ( s e e Fig.3).
" R e a s o n a b l e subset" in this case m e a n s - such a subset, w h i c h is sufficient for practical
use and allows also an efficient i m p l e m e n t a t i o n .
[ulnars
Celms.
Generic
Data
Representation
by Table
in M e t a m o d e l
Based
Modelling
Tool
59
Select Condition
13 F Stereotype
3 I— contains-:* Tag_definition
L- I stereotype_of_class -> Class
-
|I^{pi|. ^'i!*J^'|if^^i^TJ!^
s
n
H f~ has_attribute •> Attribute
[+] r~ has_operation -> Opetation
E
r~ has_tagged_value •> TaggerJValue
El [~ source •> BinaryAssociation
H f
(arget -> BinaryAssociation
I
OK
Cancel
_LJ
J^J
Help
Figure 3. Selection criterion definition example
T h e definition o f the selection rule is done by tree. T h e tree is s o m e view
(fragment) of the m e t a m o d e l where t h e root node c o r r e s p o n d s to the m e t a m o d e l class
for w h i c h w e want t o define the table (in o u r e x a m p l e t h e Stereotype class). Every next
level of t h e tree c o n t a i n s n o d e s (classes) which are linked in the m e t a m o d e l with
associations to the p a r e n t n o d e (class) in the tree. Fig.3 gives an example of such a tree
where the a b o v e m e n t i o n e d selection criterion is defined. T h e tree row m a r k e d by t h e
red " m i n u s " sign m e a n s that there might not be a link to a Tag definition instance.
In other w o r d s , a c c o r d i n g to this a p p r o a c h it is possible in an easy, usable m a n n e r
to c o m p o s e textual selection expression, w h i c h is based on m e t a m o d e l e l e m e n t s . W e
can also specify s o m e additional constraints for attribute values for classes.
It s h o u l d be e m p h a s i z e d , that unlike O Q L or O C L , in the p r o p o s e d m e t h o d it is not
important t o " k n o w " such technical issues as cardinalities of associations or kind of
collections ( b a g , set, list in O Q L or b a g , set, s e q u e n c e in O C L ) . Therefore t h e proposed
method is easy to u s e in practice. Practical e x p e r i m e n t s have d e m o n s t r a t e d that the
described m e t a m o d e l based table definition m e t h o d h a s a realistic i m p l e m e n t a t i o n and
can reach an industrial quality of the defined editors.
4. Conclusions
T h e a p p r o a c h to table definition d e s c r i b e d in this p a p e r permits us to define nearly
any r e a s o n a b l e table for practical use, a n d t h e described m e t h o d also allows an efficient
i m p l e m e n t a t i o n (in t h e sense of run-time necessary to build the tables from real a m o u n t s
of data). Practical e x p e r i m e n t s have d e m o n s t r a t e d that t h e table editors obtained by the
described definition m e t h o d can reach an industrial quality. H o w e v e r this a p p r o a c h can
be m a d e significantly m o r e universal a n d applicable not only to tables but also for other
similar data r e p r e s e n t a t i o n s , since it is generally a c c e p t e d that a m e t a m o d e l (class
diagram) can be used for describing the logical structure of nearly any s y s t e m .
60
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
References
[I] Rumbaugh, J., Booch, G., Jackobson, I.: The Unified Modeling Language Reference Manual.
Addison-Wesley (1999)
[2] Mayer, R., Mcnzel, C.. Painter. M.: IDEF3 Process Description Capture Method Report
http://wu'w.idef.com/T)ownloads/pdf/Tdcf3_fn.pdf, Knowledge Based Systems Inc (1995)
[3] Scheer, A.-W.: ARIS Business Process Modeling. 3rd edn. Springer-Verlag, Berlin
Heidelberg New York (2000)
[4] Barzdins, J., Tentcns, J., Vilums, E.: Business Modeling Language GRAPES-BM (Version
4.0) and its Application. Dati, Riga (1998)
[5] Smolander, K . Martiin, P., Lyytinen, K., Tahvanainen. V-P.: Metaedit - a flexible graphical
environment for methodology modelling. Springer-Verlag (1991)
[6] Ebcrt. J., Suttenbach, R.. Uhe, I.: Meta-CASE in Practice: a Case for KOGGE. Proceedings of
the 9th International Conference, CAiSE'97, Barcelona, Catalonia, Spain (1997)
[7] Ledeczi A.. Maroti M., Bakay A.. Karsai G., Garrett J., Thomason IV C , Nordstrom G..
Sprinkle J., Volgyesi P.: The Generic Modeling Environment. Workshop on Intelligent Signal
Processing.
Budapest,
Hungary,
May
17,
2001
(available
in
PDF
at
http://www.isis.vanderbilt.edu)
[8] DOME Users Guide, http://www.htc.honeywell.com/dome/support.htni
[9] Esser, R.: The Moses Project.
http://www.tik.ee.ethz.ch'~moses/ N'IiniConference2000/pdf'Overview.PDF
[10] Popkin Software: System Architect, http://www.popkin.com/products/systcm_architect.htm
r
[ I I ] Audris Kalnins, Janis Barzdins. Edgars Cclms, Lclde Lace, Martins Opmanis, Karlis
Podnieks, Andns Zarins.: The First Step Towards Generic Modelling Tool. Proceedings of
Baltic DB&IS 2002, Tallinn, 2002, v. 2, pp. 167-1 80.
[12] Audris Kalnins, Karlis Podnieks, Andns Zarins, Edgars Celms, Janis Barzdins.: Editor
Definition Language and its Implementation. Proceedings of the 4th International Conference
"Perspectives of System Informatics", Novosibirsk, 2001. LNCS, v.2244, pp. 530 - 537
[13] Quatrani, T.: Visual Modeling with Rational Rose 2002 and UML. 3rd edn. Addison Wesley
Professional, Boston (2002)
[14] ESPRIT project ADDE. http://www.fast.de/ADDE
[15] Sarkans. L\. Barzdins, J., Kalnins, A., Podnieks, K.: Towards a Metamodel-Based Universal
Graphical Editor. Proceedings of the Third International Baltic Workshop DB&IS, Vol. 1.
University of Latvia, Riga. (1998) 187-197
[16] Cattel, R.G.G., Barry, D.: The Object Database Standard: ODMG 3.0. Morgan Kaufmann
(2000)
LATVIJAS
UNIVERSITATES
RAKSTI.
2 0 0 4 . 6 6 9 . sej.: D A T O R Z I N A T N E
UN
INFORMACUAS
TEHNOLOGIJAS.
61.-79.
lpp.
Nondifferentiable Optimization Based Algorithm for Graph
Ratio-Cut Partitioning
Karlis Freivalds
University of Latvia. Institute of Mathematics and Computer Science
Raina blvd. 29, LV-1459. Riga, Latvia
Karlis.Freivaldsramii.lu.lv, Fax: - 3 7 1 7 8 2 0 1 5 3
We propose a new method for finding the minimum ratio-cut of a graph. Ratio-cut is an NP-hard
problem for which the best previously known algorithm gives an 0(log n)-factor approximation
by solving its dually related maximum concurrent flow problem. We formulate the minimum
ratio-cut as a certain nondifferentiable optimization problem, and show that the global minimum
of the optimization problem is equal to the minimum ratio-cut. Moreover, we provide strong
symbolic computation based evidence that any strict local minimum gives an approximation by a
factor of 2. We also give an efficient heuristic algorithm for finding a local minimum of the
proposed optimization problem based on standard nondifferentiable optimization methods and
evaluate its performance on several families of graphs. We achieve 0 ( « ' ' ' ) experimentally
obtained running time on these graphs.
Key words: graphs, heuristic graph algorithms, minimum ratio-cut.
Introduction
B a l a n c e d cuts of a graph are difficult computation p r o b l e m s that are important both
in t h e o r y a n d practice. Ratio-cut is the m o s t fundamental one since most of the others
including m i n i m u m quotient cut, m i n i m u m bisection, and m u l t i - w a y cuts, can be easily
a p p r o x i m a t e d using it [ 1 1 , 17]. Also several other important a p p r o x i m a t i o n algorithms,
such as crossing n u m b e r and m i n i m u m cut linear a r r a n g e m e n t are based o n the ratio-cut
[11]. Ratio-cut has m a n y practical applications, the m o s t important being V L S I design,
clustering and partitioning [20, 12, 1].
Since ratio-cut is an N P - h a r d p r o b l e m [13], w e must seek a p p r o x i m a t i o n algorithms
to s o l v e it in practically r e a s o n a b l e time. M a n y purely heuristic a l g o r i t h m s were
d e v e l o p e d [ 2 0 , 22, 18, 6] m o s t of t h e m relying on simulated annealing, spectral m e t h o d s
or iterative m o v e m e n t of n o d e s from o n e side of the partition to the other. A c o m m o n
idea e x p l o i t e d by several authors [22, 2, 7, 18, 19] to i m p r o v e their quality is using
multi-scale graph representation u s u a l l y obtained by edge contraction. At first a
partition at t h e coarsest scale is o b t a i n e d and then refined to a m o r e detailed o n e by one
of m e n t i o n e d algorithms. A l t h o u g h these algorithms m a y perform well in practice, no
optimality b o u n d s are k n o w n for them.
T h e best previously k n o w n algorithm with proven optimality b o u n d s finds an 0
(log »)-factor a p p r o x i m a t i o n , w h e r e n is the n u m b e r of n o d e s in the graph. It is based on
reduction of the ratio-cut p r o b l e m to the m u l t i - c o m m o d i t y flow problem, w h i c h can be
solved with p o l y n o m i a l time linear p r o g r a m m i n g m e t h o d s . Unfortunately, this method
is not practical, since the resulting linear p r o g r a m is of quadratic size of the n u m b e r of
nodes in the graph and c a n n o t be solved efficiently. T h e n , a p p r o x i m a t i o n algorithms
62
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
[15, 16, 10. 21] w e r e discovered for the m u l t i - c o m m o d i t y flow problem itself m a k i n g
this approach usable in practice. Several heuristic i m p l e m e n t a t i o n s [15, 9, 21] are based
on this idea, s o m e o f t h e m quite effective and practical. T h e most elaborate one [9] can
deal with up to 1 0 0 0 0 0 - n o d e graphs in r e a s o n a b l e t i m e .
In this paper w e p r o p o s e a n e w w a y o f finding the m i n i m u m ratio-cut of a graph.
We construct a nondifferentiable o p t i m i z a t i o n p r o b l e m w h o s e m i n i m u m solution equals
the m i n i m u m ratio-cut value and use n o n l i n e a r p r o g r a m m i n g m e t h o d s to search for it.
Since the problem is n o n - c o n v e x , w e m a y find only a local m i n i m u m . H o w e v e r , we
s h o w that any strict local m i n i m u m g i v e s us a factor of 2 a p p r o x i m a t e cut. For that
purpose w e introduce a notion of locally m i n i m a l ratio cut for which n o subset of nodes
taken from one side of the cut and m o v e d to the o t h e r side decrease the cut value. W e
establish a one-to-one c o r r e s p o n d e n c e b e t w e e n strictly locally minimal cuts and strict
local m i n i m a of the p r o p o s e d optimization p r o b l e m .
The reduction o f an N P - h a r d discrete p r o b l e m to a c o n t i n u o u s one is not a novel
idea. For e x a m p l e the m a x i m u m clique p r o b l e m of a g r a p h can also be stated as an
optimization p r o b l e m [24] and n u m e r i c a l o p t i m i z a t i o n m e t h o d s for finding the o p t i m u m
m a y be used. H o w e v e r , for the m a x i m u m c l i q u e p r o b l e m no optimality b o u n d s of a
local m i n i m u m are k n o w n .
To s h o w that our m e t h o d is practical w e p r e s e n t an efficient heuristic algorithm for
finding a local m i n i m u m of the p r o p o s e d p r o b l e m , w h i c h is b a s e d on the standard
m e t h o d s of nondifferentiable o p t i m i z a t i o n and a n a l y z e its p e r f o r m a n c e on several
families o f graphs. With the p r o p o s e d m e t h o d w e can find a good partition of 2 0 0 0 0 0 n o d e g r a p h s in less than o n e hour.
Problem Formulation
We are dealing with an undirected graph G = (K, £ ) , w h e r e F i s its node set and E is
its edge set. T h e n o d e s of the graph are identified b y natural n u m b e r s from 1 to n. Each
node ; has a weight d > 0, and each e d g e (i, j) has a capacity e,, > 0, satisfying the
properties c = c c:„ = 0. W e define e,, = 0 w h e n there is n o e d g e
in G. W e a s s u m e
that there are at least t w o nodes with n o n - z e r o w e i g h t s . (A, A') denotes a cut that
separates a set of n o d e s A from its c o m p l e m e n t A' = V\ A. T h e capacity of the cut C{A,
A') is the sum of e d g e capacities b e t w e e n A and A'. T h e ratio of the cut R(A, A') is
defined as follows
;
y
lh
a A,
A)
1
(1)
We will focus on finding the m i n i m u m ratio-cut i.e. the cut (A. A')
m i n i m u m ratio o v e r all n o n - e m p t y A, A'.
with the
Definition i . A cut (A. A') is locally minimal if for all n o n - e m p t y UdA,
U i A:
R(A, A') < R(A\U, A'\J
U) and for all n o n - e m p t y U'CZ A', V =- A': R(A, A') <R(A\J
If,
A' \ L '). Similarly w e call a ratio-cut strictly locally m i n i m a l if strict inequalities hold in
:
this definition.
Karlis
h'reivalds.
NnndifierciitiLtblc O p u m i / a i i u n B a s e d A k ' n r i l h m l o r G r a p h
Ralio-Cul
Partitioning
Ratio-Cut as an Optimization Problem
W e c a n assign a variable v . e R to each node i a n d consider t h e following
o p t i m i z a t i o n p r o b l e m over x from R".
m i n F ( x ) = _^/,j|-V, ~Xj\
(2)
^ < 7 , C / L | A - , . - A \ | = 1.
s u b j e c t to 77(A) =
(3)
f./ef
£*,-=fl.
(4)
id'
W e a r e g o i n g to s h o w that this optimization p r o b l e m is equivalent to t h e ratio-cut
p r o b l e m in the sense described below.
Definition 2. A characteristic vector x
components
A
\a,
x'; = <
[b,
if
for a c u t (.4, A') is defined such that its
/e A
, where
(5)
ifieA'
1
Ml
1
in A
ieA'
IQA
le.-f
It is straightforward from this definition that for a c u t (A, A') x
constraints ( 3 , 4 ) , a n d F(x ) = R{A, A').
A
satisfies t h e
A
Definition 3. F o r s o m e feasible x and s o m e p e R w e call the cut (P, 7") positional,
if P = {/ x, <p\. P' = V\ P, and both P a n d 7 " are n o n - e m p t y . T h e ratio o f this cut R (x)
= R(P, 7"). F o r a fixed x w e can speak o f m i n i m u m positional cut R (x)
over all
possible positional cuts obtained for different p:
p
mm
R (x)
mm
= mmR
R
p
(a) .
pS
Theorem 1. For each feasible x* o f (2, 3 , 4 ) Fix) > R (x').
it'there are at least t w o positional cuts with different values.
iwn
Also F(x') >
Proof. Let us partition all nodes into three sets L i. L' , U as follows.
r
2
1,' = m i n x , .
vs*=
,
m i n a", ,
3
R (x')
nm
64
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
If U , is empty there is exactly o n e positional cut, x is in the form of characteristic
vector and F(x ) = R (x')
that c o n c l u d e s this proof, o t h e r w i s e define
:
mm
>•;, =
minx,
We create a s u b - p r o b l e m of (2. 3, 4) by r e d u c i n g the original one to a n e w variable
min F,(v) =
2
^c
t
j
| y , - v, | +
i<EU., / c i A
2>
<4
^ c ,
| v, + / , - y, | +
;
ieG.JcG,
3
'
(6)
J
subject to
H,(y) = 2
Y. > i
d
d
h " >'• I
X .
l/(/
+
ICU JtL'z
J
>'i
~
l
+
-/'c^A
{
(7)
|c/,|v| + |L/-,|>S + | L / , | V ,
(8)
=0,
where c o n s t a n t s /, = x, - v .
3
1
- I ,
X,
^ ;cC-,./eC/,
Next, we further restrict it w i t h y , < v'2 < y j and can d r o p the absolute value signs:
min F (y)
:
=
2
(y, - y,) +
^(y.+lj-y.)
^c
/
;
(y
3
+
- V,) +
+K,
(9)
subject to
/7<v) = 2
^c/,c/,(y -y,)+
2
( V ; +lj-y )
2
>\/t/
^P=l
(v
3
• 1;
v,) •
(10)
Karlis
Frcivalds.
Nondif'fcrcntiahlc
Optimisation
Based Algorithm lor Graph
Ralio-Cul
Partitioning
(11)
V]
(12)
< V'2 < > ' j .
W e h a v e obtained a locally e q u i v a l e n t linear p r o g r a m in the sense that for v the
constraints ( 1 0 , 11, 12) a r e satisfied a n d F(x') = F {\').
A l s o , if w e can find a better
solution for ( 9 - 1 2 ) w e c a n substitute the result back to the original p r o b l e m giving a
better feasible solution for it.
2
F r o m t h e linear p r o g r a m m i n g theory it is k n o w n that w e can find t h e optimal
solution b y e x a m i n i n g t h e vertices o f the polytope defined by the constraints - in our
case that m e a n s w h e n o n e o f the inequalities in ( 1 2 ) is satisfied a s equality. Let us
e x a m i n e both cases:
H
2
( V | , V l . V;,) =
( v
- > • , )
3
_dd
= 2 ( v - v,)
i
3
l
]
- I.
1.
inU^C.jci:,
.here 1 = 2
+
icUilM
F~2 ( v ' i , > ' i , v )
3
=
2
( . v
3
,JCL' ,
- v , )
:
_ _ > ,
Z
+
C
v
/
,
+
A
'
' t O ' i L't/'i . y e t ' ,
Ot/'j .yet',
By substituting vj - Vi from p r e v i o u s equation we get
L/
f^Ot-V,, >';,)=
Z
(1 -L)R(U,
where 5 = 2
'
.
X ' ' - ' .
+
'
U CA, tTjl + B^
+ A*.
case v =j'.v
2
Similarly w e get F 0 ' i - > : . = ( 1 - I ) /?(£/,. £A U t/.O - 5 ,
2
W e c h o s e t h e solution y with y, = y if /x(Lj U t A , CAJ < /?(£/,. £A IJ c ' ) ,
otherwise c h o o s e the solution with i = y . It is evident that if both these ratio costs are
n o n - e q u a l w e get a strictly smaller function value. W e substitute the solution back into
the original problem o b t a i n i n g a n e w .v. x is a feasible solution o f (2, 3. 4 ) with a smaller
or equal function value a n d the set LA m e r g e d to U\ or Uj. W e repeat the described
process until l A is e m p t y .
:
2
3
3
Let us analyze the resulting x and sets b\ and LA W e have F(x) = R(L'\, U ). where
(L'i. U ) is s o m e positional c u t of x (in fact the minimal o n e ) , hence Fix ) > F(x) >
2
2
66
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
R \ (.x )• It" w e had s o m e non-equal eu'.s c o m p a r e d d u r i n g the process, w e w o u l d have a
strict decrease in the function and h e n c e the s e c o n d s t a t e m e n t of the t h e o r e m holds.
m n
D
Definition 4. A feasible x is a local m i n i m u m of ( 2 . 3, 4) if there exists 8 > 0 such
that Fi/) < F(x) for each .v satisfying ( 3 , 4) and | x-x'
|| < e.
;
Definition 5. A feasible A is a strict local m i n i m u m o f (2. 3 . 4) if there exists s > 0
such thai F{x ) < Fix) for each a - x satisfying ( 3 , 4) a n d A - v* \\ < z.
1
T h e o r e m 2. Each strict local m i n i m u m .v o f (2. 3, 4) is in the form .v* - x" for some
cut (A. A').
1
Proof. W e need to prove that in a strict local m i n i m u m the expression (5) holds for
s o m e a and />. and the correct values for a a n d /; are g u a r a n t e e d by the constraints (3)
and (4). A s s u m e to the contrary that there a r e m o r e than t w o distinct values for the
c o m p o n e n t s o f x . W e form the reduced p r o b l e m as in T h e o r e m 1 obtaining equations
(9 - 12). We are able to do that since L\ is n o n - e m p t y d u e to our assumption.
From the linear p r o g r a m m i n g theory' a strict local m i n i m u m can only be on the
vertex of the p o l y t o p e d e n n e d by the constraints ( 1 0 - 12), h o w e v e r in our case y
cannot be on the vertex since the constraints w h e r e equality holds at v are less then the
n u m b e r of variables. C o n s e q u e n t l y , x c a n n o t be a strict local m i n i m u m for the original
p r o b l e m . H e n c e our a s s u m p t i o n is false and the t h e o r e m is p r o v e n .
There is one-to-one c o r r e s p o n d e n c e b e t w e e n strictly locally minimal cuts and strict
local m i n i m u m s of (2, 3. 4) as stated in the following t h e o r e m .
T h e o r e m 3 . x is a strict local m i n i m u m o f (2, 3. 4) if and only if .v is a characteristic
v e c t o r of s o m e strictly locally minimal cut (A, A').
Proof. First w e prove that the characteristic Vector x o f a strictly locally minimal
cut (A, A') c o r r e s p o n d s to a strict local m i n i m u m of (2, 3. 4). A s s u m e to the contrary
that x is not a strict local m i n i m u m point; then by Definition 5 there exists an arbitrary
s m a l l vector / ± 0 that leads to smaller or equal feasible solution ,v - .v' - t, F(x)
<Fix').
B y the nature oi the constraints ( 3 . 4 ) there will be at least t w o positional cuLs of x. and
sin.ee t was sufficiently small, one o f t h e m is (A, A').
T h e r e are two cases:
(i)
there exists a positional cut R (x) with value not equal to R(A. A'). Then by
Theorem 1 there is a positional cut with value R (x) < Fix) < Fix') = R(A, A')
which divides the nodes into sets 0", and (A. Since / was sufficiently small, both U
and LA cannot contain nodes from both sets A and A' i.e. either 6 ) C A. or L) C
A\ or UjC
A, or .'A_C A' contradicting the local minimality of (A, A') by
Definition 1.
(ii)
all positional cuts of.v are of the same value. Let us choose some positional cut (U,,
L* ) = [A, A')- And again, since we can choose / small enough., either L\ c A, or
V- C A', or LVcz A, or (A c A' which contradicts strict local optimality of (A, A'}
by Definition 1.
A
1
p
r
t
:
Secondly, we prove that each strict local m i n i m u m x of (2, 3, 4) is in the form of a
characteristic vector of a strictly locally m i n i m a l cut. From T h e o r e m 2 we have that each
strict local m i n i m u m is in the form of a characteristic vector A'' of s o m e cut (A. A'). W e
need to show that this cut is strictly locally minimal. A s s u m e to the contrary that there
exist S <z .•(, S
0, A \ S - 0 for w h i c h R(A \ S, A' U S) < R(A. A'). Denote U = A \ S.
1
Kivlis
l-'rcivaLls.
N i M i i i i l t c r c i i l i j b l c O p u m i / a t m n B;iscd A l g o r i t h m i o r G r a p h R j i j o - C u i
PatUiiuimi^
Let us repeat the construction from the p r o o f of T h e o r e m 1 taking the sets Ui, U a n d U3
as 5 , U, A' c o r r e s p o n d i n g l y . W e obtain equations (9 - 12) and y with v, = y . From
Definition 5 each small m o v e increases the function. L e t us take such feasible m o v e t
that d o e s not destroy t h e groups L'i, U a n d L" but separates U\, from U . Let /' be the
projection o f / o n t o the sets i •. U . £/ i.e. /' = (/,. ke U\, t p G L' , t \ r e (I,). Let us
d e n o t e y " = y + / ' = y ~ at' where a is chosen such t h a t y ' - v . Since t h e constraints
(10, 11) are linear and they have the s a m e value for y a n d y"', they will h a v e the same
value for y ' , which is a linear c o m b i n a t i o n o f y a n d v"'. Obviously y' satisfies the
constraint ( 1 2 ) . Similar a r g u m e n t s hold for the function: since F (y"') > F (y),
also
F:(y')> F (y) from t h e linearity of (9). From v w e can g o back to the original problem
and obtain . y \ which is a characteristic vector of the cut {A \ S. A' [j S). Since H (x ) *
/ / . ( / ' ) = 1 a n d F{x') = F (y) > F (y) = R(A, A'). R(A \ S. A' U 5) > R(A. A') which
contradicts o u r a s s u m p t i o n .
2
2
2
2
3
2
3
p
2
r
L
;
3
2
2
c
2
2
2
•
W e shall note that also for each non-strictly m i n i m a l ratio cut, its characteristic
vector x' gives a local m i n i m u m of the function, h o w e v e r t h e converse is not true. T h e r e
exist non-strict local m i n i m a o f (2, 3 . 4) with the function value not equal to any locally
m i n i m a l cut value. In F i g . 3 a graph a n d their n o d e coordinates in a local m i n i m u m of
(2, 3 , 4 ) arc s h o w n . It is a local m i n i m u m because n o small m o v e gives a decreased
positional cut or a better function value. But none o f its positional cuts are locally
minimal.
T h e o r e m 4. T h e m i n i m u m ratio-cut R is equal to the o p t i m u m solution o f (2,
3,4).
Proof. F r o m T h e o r e m 1 F(x) > R.. .. (x) > R. F o r the characteristic vector x
m i n i m u m ratio cut F(x' ) = R. T h e claim follows i m m e d i a t e l y . •
4
n
n
o f the
:
T h e o r e m 5. Each locally minimal cut {A, A') is not greater than t w o times the
m i n i m u m ratio cut.
Proof. Let us d e n o t e the optimal cut ( 5 . B'). W e form four sets ADB. Af\B',
A'PiB.
AT\B' s h o w n in Fig. 1 From Definition 1 the ratio of each o f the cuts C\ - (ACiB, V 4 0 5 ) . C = (AnB\ V - APiB'). C , = (AT\B\
:
;TIB'), C = (A'DB. V - A'OB) is at
least a s large as ratio o f C\ = (T, A') a n d C = (B. B').
:
A
2
3 3
W e form a full g r a p h by taking each o f the sets AHB, Af)B'. A'PiB, A'DB' as the
nodes o f the graph a n d assign edge capacities as the s u m of all edge capacities in the
original graph b e t w e e n the c o r r e s p o n d i n g sets. W e obtain
c\
+ c
+
4
d {d-> +
x
c^
R =
d,(d
l
R
t
+
-
2
2
R,>R .
r
d (d
Ri>R\ .
2
2
c.
—
T d
^ ^
i
(<:/, + d )(dj
l
c, + c-, +
n
4
+ c,
5
}
c,
+d)
+d)
.
R,'
4
2
c,
d (</,
4
<L_,
A
+ L',
+ d - rf,)
2
(d +ci )(ci^J )
4
3
d)
fl,,-
+ d)
/ ? > A',2-
- r c,
i
c.
-r d, -
2
R+>R-.i
y
A
68
DATORZINATNE
Fig.
UN
INFORMACIJAS
TEHNOLOGIJAS
L I l l u s t r a t i o n o f t h e p r o o f o f T h e o r e m 5.
And the statement of the t h e o r e m to be p r o v e n translates to —— < 2 . W e verified
this using symbolical c o m p u t a t i o n in M a t h e m a t i c a 4.0. See A p p e n d i x A for details.
C o r o l l a r y 6. E a c h strict local m i n i m u m of (2, 3 . 4) is not greater than t w o times the
m i n i m u m ratio cut.
T h e p r o o f is straightforward from T h e o r e m 3 and T h e o r e m 5.
T h e bound of T h e o r e m 5 is tight; the graph with 4 n o d e s s h o w n in Fig. 2 achieves
this b o u n d . O n e can m a k e larger e x a m p l e s easily by substituting any connected graph
with sufficiently high e d g e capacities in place of the n o d e s of the given graph.
The Algorithm
In this section w e present a heuristic a l g o r i t h m b a s e d on standard nondifferentiable
optimization m e t h o d s for finding a local m i n i m u m of ( 2 , 3). F i n d i n g the m i n i m u m of a
nondifferentiable function is one of several w e l l - e x p l o r e d n o n l i n e a r p r o g r a m m i n g topics
[5, 2 3 ] . O n e of the possibilities is to a p p r o x i m a t e the nondifferentiable function with a
Fig. 2. A graph achieving
the boundary of Theorem 5.
Fig. 3. A graph with assigned node coordinates
in a local minimum of (2, 3. 4) where no
positional cut is locally minimal.
Kj/iis
Frt'ivatils
N n n d i t l e r c i u i a h l e O p t i m i z a t i o n Ba^cd AI^uriLhm for G r a p h R a t i o - C u t
Partitioning
69
smooth o n e and apply o n e o f the w e l l - k n o w n algorithms to find its local m i n i m u m [5.
3]. Often a better a p p r o a c h is to handle it directly. Indeed, in our case w e obtain a very
simple and fast algorithm.
M o s t of the o p t i m i z a t i o n theory deals with convex p r o b l e m s for w h i c h algorithms
with proven c o n v e r g e n c e c a n be d e v e l o p e d . M a n y of these m e t h o d s also w o r k for nonconvex functions finding a local m i n i m u m . However, then the c o n v e r g e n c e cannot be
shown or c a n be s h o w n only in a local n e i g h b o r h o o d of s o m e local m i n i m u m which is
not satisfactory in our case. T h e very basic algorithm of nondifferentiable optimization
is a subgradient
a l g o r i t h m [ 5 . 2 3 ] . W e will adopt it for s o l v i n g our p r o b l e m . Since we
apply it to a n o n - c o n v e x p r o b l e m , it should be considered m o s t l y as heuristic, h o w e v e r
practice s h o w s that it a c t u a l l y c o n v e r g e s to a local m i n i m u m of our p r o b l e m .
T h e a l g o r i t h m is an iterative o n e . T h e iteration of the a l g o r i t h m consists of g o i n g in
the n e g a t i v e direction o f a subgradient of the function by a fixed step and then
performing a projection o n t o the constraint (3). T h e constraint (4) was introduced for
technical r e a s o n s required in proofs and m a y be not c o n s i d e r e d in the algorithm.
A n a p p r o p r i a t e s u b g r a d i e n t q of F c a n be calculated with the following equation for
its c o m p o n e n t s
1&
1,A->0
w h e r e sign(x)
0,.v = 0
-1,*<0
C h o o s i n g the right step size is crucial for the c o n v e r g e n c e speed. O u r heuristic
observation is that it s h o u l d b e proportional to the n o d e spread. W e c h o o s e the step
equal to s t e p F f - . o r * ( . r
otepP'acLor
m a x
-x,
K m
) . w h e r e s t e p F a c t ^ r is s o m e parameter. Initially
= 1 / 1 4 . Its update d u r i n g the algorithm is be discussed later. T h e
projection is performed b y g o i n g in the direction of a s u b g r a d i e n t of the constraint until
the constraint is r e a c h e d . F o r the constraint H we can write its s u b g r a d i e n t /• in a
different form to allow its faster evaluation
r.•,= t 7 , ^ c / s i g n ( . T .
/
- x . )
= d:(
}
-
V(/ )
(
If w e sort the .v, v a l u e s a n d consider them in increasing order, then the needed s u m s
can be updated i n c r e m e n t a l l y leading to linear time e v a l u a t i o n (not counting the
sorting). T o perform t h e projection w e need to calculate the step length towards the
constraint. T o simplify t h e calculations w e will a s s u m e that the ordering of .v, will not
change d u r i n g the projection. Then the constraint function H b e c o m e s linear and the
desired step can be easily calculated from the linear e q u a t i o n defined us the point of
value 1 on the ray defined by the s u b g r a d i e n t from the current point. In the case of unit
node w e i g h t s our a s s u m p t i o n is fulfilled. T o see this w e m u s t observe that the step in
the function will a l w a y s lead to a point with If < 1. T h e n , if w e have unit n o d e w e i g h t s ,
r, < ;> for all / and k satisfying .v, < A,;, so after the projection the distance b e t w e e n them
can only increase thus k e e p i n g the s a m e ordering. For the case of arbitrary n o d e weights
such an a l g o r i t h m gives a u s a b l e a p p r o x i m a t i o n of the projection step length.
70
DATORZINATNE
L'N I N F O R M A C I J A S
TEHNOLOGIJAS
The w h o l e algorithm starts with a r a n d o m initialization. W e assign a r a n d o m
position for each node such that H < 1 and then perform a projection to obtain a feasible
starting position. W e e x p e r i m e n t e d also with several other initializations, but obtained
significant i m p r o v e m e n t s only for tree g r a p h s . S i n c e the o p t i m a l cut for trees can be
found in linear time, w e c a n construct a starting position that reveals it and the
algorithm will only perform a few iterations to confirm that the solution does not
improve. Hence to get a c o m p r e h e n s i v e picture of the a l g o r i t h m behavior w e decided to
consider r a n d o m initialization only.
After the initialization w e perform iterations until c o n v e r g e n c e . T o tell w h e n the
process is c o n v e r g e d w e track the m i n i m u m positional c u t values obtained at each
iteration. T h e s a m e values will also tell u s h o w to c h a n g e the step size p a r a m e t e r
s t e p F a c t o r . If the n e w cut value is lower than the p r e v i o u s one. then w e are m a k i n g
progress a n d we should c o n t i n u e with the s a m e step size. If the value is higher than the
previous o n e , then the step size is t o o large. If the cut d o e s n o t c h a n g e , then w e have a
clue that the process has c o n v e r g e d . O f c o u r s e w e d o not h u r r y to m a k e decisions only
from o n e iteration. Instead w e wait for a certain n u m b e r o f iterations controlled b y the
constants M A . x _ O S C I L L A T I O N S a n d MAX_EQUAL before m a k i n g t h e decision.
Such a delay also i m p r o v e s the c o n v e r g e n c e speed by a l l o w i n g to iterate longer with a
larger step size
To determine the positional cut value at each iteration, w e proceed as follows. W e
consider the soiled s e q u e n c e of nodes, c a l c u l a t e the positional cut in each interval
between t w o consecutive n o d e s , a n d take t h e m i n i m u m o n e . W e can d o it incrementally
on the sorted sequence in t i m e 0(n r m) p r o v i d e d by the s o r t i n g , where m is the n u m b e r
of edges in the graph.
As suggested in o n e o f the exercises in [ 5 ] , the p e r f o r m a n c e of this algorithm can be
improved b y taking the p r e v i o u s directions into account. W e add the previous direction
to the current one reduced by s o m e constant R E DU C T I ON_ F AC TO R b e t w e e n 0 and 1. It
models a heavy ball m o t i o n in the p r e s e n c e o f a force in t h e direction of the subgradient.
In our e x p e r i m e n t s s u c h a modification with R E D U C T I O N _ F A C T O R
= 0.95
performed substantially better.
All the steps described before can be i m p l e m e n t e d to r u n in time 0 ( « - m). A d d i n g
the time needed for sorting the nodes o n e iteration takes time 0(m - n log n). T h e
n u m b e r o f iterations is hard to estimate s o w e will p r o v i d e e x p e r i m e n t a l data in the next
sections. T h e constants MAX O S C I L L A T I O N S a n d MAX_EQUAL have the most
impact on the iteration c o u n t a n d also o n t h e quality o f t h e obtained cut. S o we must
select them carefully. After s o m e e x p e r i m e n t a t i o n w e c h o s e MAX_EQUAL = 2 0 0 a n d
N!AX_OSCILLATIONS - 3 0 .
algorithm
ra~io-cut
calculate a ran.co:ri feasible initial position
acam - 0
osci 1 la uioo.Countor = 0
equalityCounter = 0
stepFactor = 1/14
while
(equalit-yCounter
< MAX
EQUAL!
Karlis r'rt'ivdlds. NondilTercniuhlc
O p t i m i z a t i o n B a s e d A l g u n i h m tor G r a p h R a M i > - O i L
Ptetitipning
X
= x -- '
acu~ - scam * REjU-JTIO'." FACTOR + d
if
iteration)
:'r.:.s
(minimum
cot
positional
equalityCcunter = 0
o s c i l l a t i c n C o u n t e r ++
or else (minimum p o s i t i o n a l
iteratiir.)
equalityCcunter = •
or else
e q u a l i t y C o u n t e r +4
if
(oscillationCounter
value
has
cut
value
incrioses
has
: r. this
decreased
in
> M&XJ3SCILLATION5)
stepFactor /= 1 . 3
oscillationCounter
= 0
endwhile
end
function d i r e c t i o n (x)
d = s u b g r a d i e n t (x)
s t e p = {•/... ... - x.. _.,) * s t e p F a c t o r
x. = x + d * s t e p
x = p r o j e c t i o n of x or. t h e c o n s t r a i n t
return x
end
;
T h e Data
W e evaluated the p r o p o s e d algorithm on three families of graphs: r a n d o m cubic
g r a p h s , r a n d o m geometric g r a p h s and r a n d o m trees. W e considered only g r a p h s with
unit weight n o d e s and edges.
R a n d o m cubic graphs a r e potentially hard for ratio-cut a l g o r i t h m s , b e c a u s e in [11]
it w a s s h o w n that there is actually an 0 ( l o g n) g a p b e t w e e n the m i n i m u m ratio-cut and
the m a x i m u m concurrent flow on these g r a p h s . W e generate them using the algorithm
provided in [14].
R a n d o m geometric g r a p h s are standard test suite for b a l a n c e d cut p r o b l e m s used in
several papers [9, 2 2 ] , T o g e n e r a t e a g e o m e t r i c graph w e p l a c e the nodes o f the graph
r a n d o m l y in the unit square. Then w e include an e d g e b e t w e e n each of t w o nodes that
are within distance 6 in the graph, w h e r e <5 is the m i n i m u m value such that the resulting
graph is connected.
Tree graphs are s e e m i n g l y easy g r a p h s because their optimal ralio-cul can be
calculated in linear time. A l s o it is not hard to show that only one local m i n i m u m exists
for the c o r r e s p o n d i n g optimization p r o b l e m . Nevertheless, it is an interesting family
since our e x p e r i m e n t s indicate a slow c o n v e r g e n c e of o u r algorithm on these graphs.
Also, w e can c o m p a r e our result with the optimal o n e . W e generate r a n d o m trees with
the classical algorithm, w h e r e each tree is produced with the same probability. This
72
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
algorithm produces long and skinny trees, w h i c h are particularly difficult
algorithm.
for our
Experimental Results
We i m p l e m e n t e d the algorithm in
a n d e v a l u a t e d its p e r f o r m a n c e on a
c o m p u t e r e q u i p p e d with a Pentium III 8 0 0 M H z p r o c e s s o r and 2 5 6 M b y t e s of R A M .
For each graph family vve measured the r u n n i n g time in s e c o n d s , the n u m b e r of
iterations and the quality of the p r o d u c e d cuts. S i n c e w e did not k n o w the exact cut
values for r a n d o m and geometric graphs, w e e v a l u a t e d h o w much the ratio-cut value
decreases w h e n w e continue the a l g o r i t h m for the s a m e n u m b e r of iterations as
performed before termination. M e a s u r i n g the d e c r e a s e o f the ratio-cut value w e can
estimate h o w far the result is from the o p t i m a l .
T h e algorithm was run on a series of g r a p h s o f e x p o n e n t i a l l y increasing size from
100 up to 2 0 4 8 0 0 n o d e s . Ten graphs w e r e g e n e r a t e d for e a c h size and the results were
averaged. T h e average n o d e degree for all g r a p h families is constant. A l t h o u g h we
cannot specify the d e g r e e explicitly for g e o m e t r i c g r a p h s , due to their nature it was
about 10 on all instances. T h e e x p e r i m e n t a l results are g i v e n in tables 1-3. For each
graph size tables s h o w the iteration count, the r u n n i n g t i m e , the obtained ratio-cut value
and the i m p r o v e d ratio-cut value w h e n the iteration c o u n t is doubled. For tree g r a p h s the
optimal ratio-cut value is s h o w n also. Let us d i s c u s s the results separately for each
graph family.
Cubic Graphs
A l t h o u g h these g r a p h s were s u g g e s t e d as difficult, the algorithm performed very
well on t h e m . It took on average about 35 m i n u t e s to partition the 2 0 4 8 0 0 n o d e graphs.
The r u n n i n g t i m e d e p e n d e n c e on the g r a p h size is s h o w n in Fig. 4. W h e n w e
a p p r o x i m a t e d the r u n n i n g time with a function in a form 0(rr") w e get the asymptotical
running t i m e about 0{n )
on these graphs.
b
T h e algorithm s e e m s to find a very close
iteration count, the quality increased only by
improved for the larger graphs a p p r o a c h i n g
surprising since it is not hard to prove that the
to o p t i m a l cut since after d o u b l i n g the
less than 0 . 4 % . E v e n m o r e , the quality
1 (see Fig. 7). Such b e h a v i o r is not
ratio of a cut (A, A') of a r a n d o m cubic
on a v e r a g e independently of the sizes o f A and A' (a similar proof for
n -1
general r a n d o m graphs is presented in [20]). T h e n it is unlikely that the m i n i m u m ratio
cut will be m u c h different from this a v e r a g e value.
graph is
Kitrlis
f-reivalds.
Nondit'tcrenliablc Optimization
Based Algorithm tor Graph Ralio-Cul
73
Partjiiuniny
4000
3500
3000
2500
2000
1500
1000
500 \
0
81920
122880
163840
0
20*300
40960
Fig. 4. Running time for cubic graphs.
0
40960
81920
81920
122880 163840 204800
g r a p h size
g r a p h size
122880
163843
Fig. 5. Running time for geometric graphs.
tflfj
204800
4O0
1603
64 X
25600
102400
g r a p h size
g r a p h size
Fig. 6. Running time for tree graphs.
Pig. 7. Quality for cubic graphs.
1.01 ,
0.75 i
0 94
1
100
400
;
1600
6400
25600
0 7
102400
g r a p h size
Fig. iV. Quality for geometric graphs.
,
100
400
1600
6400
25600
102400
g r a p h size
Fig. 9. Quality for tree graphs
Geometric Graphs
T h e a l g o r i t h m p e r f o r m e d very well on these graphs both in terms of speed and
quality. It took on a v e r a g e abotit 1 hour to partition the 2 0 4 8 0 0 node g r a p h s . T h e
running time d e p e n d e n c e on the graph size is s h o w n in F i g . 5. T h e asymptotical
running t i m e b e h a v i o r on t h e s e graphs w a s about Q ( « A f t e r d o u b l i n g the iteration
count the quality increased by less than
(see Fig. 8). A l s o visually the cuts seemed
the best possible. Fig. 10 s h o w s a typical cut of a 1 0 0 0 0 - n o d e graph.
74
DATORZINATNE
EN
INFORMACIJAS
TEHNOLOGIJAS
Tree Graphs
T h e running time for trees w a s better than for other families. T h e largest graphs
w e r e partitioned in about 2 0 minutes. T h e r u n n i n g t i m e d e p e n d e n c e on the graph size is
s h o w n m Fig. 6. T h e a s y m p t o t i c a l r u n n i n g t i m e on these g r a p h s w a s about 0(n
).
H o w e v e r the quality w a s poor. As s h o w n in F i g . 1 1 the o b t a i n e d cuts were far from the
optimal and the quality d e c r e a s e d with increasing g r a p h size. A l s o d o u b l i n g the iteration
count s h o w e d 10% to 2 0 % quality i m p r o v e m e n t (see Fig. 9).
]
Fig, 10. The obtained cut for a 10000-node
geometric graph.
45
Fig. 11. Optimality for tree graphs
W h e n w e explored further the r e a s o n for the p o o r b e h a v i o r , w e found out that
c o n v e r g e n c e is m u c h s l o w e r than for other g r a p h families and the stopping criterion
does not work correctly in this case. W h e n w e a l l o w e d the algorithm to run for a
sufficiently long time, it a l w a y s found the o p t i m u m solution. H o w e v e r w e did not find a
robust stopping criterion that correctly w o r k s with tree g r a p h s and does not increase
running time m u c h for other g r a p h families. A s already m e n t i o n e d , a smarter
initialization can be u s e d to i m p r o v e the quality of the partition if such tree or tree-like
graphs are c o m m o n for s o m e application.
Conclusions and O p e n Problems
We have p r o p o s e d a nondifferentiable o p t i m i z a t i o n based m e t h o d for solving the
ratio-cut problem and presented a heuristic a l g o r i t h m i m p l e m e n t i n g it. W e have shown
that any strict local m i n i m u m is 2-optimal. T h e presented algorithm, h o w e v e r , in certain
cases can find a non-strict m i n i m u m , but w e can easily transform the obtained .v vector
into the characteristic form. T h e n the a l g o r i t h m can be run again from this starting
position and this p r o c e s s can be iterated until the result d o e s not c h a n g e giving a locally
minimal cut. which by T h e o r e m 5 is 2-optimal.
T h e obtained a l g o r i t h m is s i m p l e and fast and uses an a m o u n t of m e m o r y that is
proportional to the size of the g r a p h . Its r u n n i n g time and quality are verified
Karlis I-reivahts.
Nimdificrenliablc Optimization
B a s e d A l g o r i t h m tor G r a p h R a l i o - C u l
Partitioning
75
experimentally. Its practical running time is about 0(n ')
on our test data. The
algorithm p r o d u c e s high quality cuts on r a n d o m cubic a n d g e o m e t r i c g r a p h s . On trees
and other very sparse g r a p h s the quality can be significantly improved by c h o o s i n g a
better starting position than a r a n d o m o n e . W e evaluated t h e algorithm on artificially
generated data. A s further s t u d y it w o u l d be important to evaluate its p e r f o r m a n c e on
real-life p r o b l e m s .
)t
Although the algorithm p e r f o r m e d well on most graphs,, it is heuristic a n y w a y . Is it
an open question w h e t h e r w e c a n find a local m i n i m u m o f ( 2 . 3 , 4 ) in p o l y n o m i a l time?
Acknowledgements
The author would like to thank Paulis Kikusts and Krists B o i t m a n i s for valuable
discussion a n d help in p r e p a r a t i o n of this paper.
References
1.
C. J. Alpert and A. Kahng. Recent Directions in Netlist Partitioning: a Survey. Tech.
Report, Computer Science Department. University of California at Los Angeles. 1995.
2.
C. J. Alpert, J.-H. Huang, and A. B. Kahng. Multilevel circuit partitioning. Proc. Design
Automation Conf, pp. 530 533, 1997.
3.
R. Baldick. A. B. Kahng, A. Kennings and T. L. Markov. Function Smoothing with
Applications to VLSI Layout, Proc. Asia and South Pacific Design Automation Conf., Jan.
1999.
4.
J. W. Bern,' and M. K. Goldberg. Path Optimization for Graph Partitioning Problems.
Technical report TR: 95-34, DiMAC'S, 1995.
5.
D. Bertsekas. Nonlinear Programming. Athena Scientific. 1 999.
6.
L. Hagen and A. B. Kahng. New Spectral Methods for Ratio Cut Partitioning and Clustering.
IEEE Trans. Computer-Aided Design. 1 1 (1992). pp. 1074 1085.
7.
T. Hamada. C.-K. Cheng. P. M. Chau. An Efficient Multilevel Placement Technique using
Hierarchical Partitioning. IEEE Trans, on Circuits and Systems, vol. 39(6) pp. 432 439.
June 1992.
8.
P. Klein. S. Plotkin, C. Stein, E. Tardos. Faster Approximation Algorithms for Unit Capacity
Concurrent Flow Problems with Applications to Routing and Sparsest Cuts. SIA.M J
Computing. 3(231(1994). pp. 466-488.
'•).
K. Lang and S. Rao. Finding Near-Optimal Cuts: An Empirical Evaluation. 4th. ACXf-SIAM
Svmposium on Discrete Algorithms, pp. 2 1 2 -221. 1993.
10. T. Leighton. F. Makcdon. S. Plotkin, C. Stein. E. Tardos, S. Tragoudas. Fast Approximation
Algorithms for Multicommodity Flow Problems. Journal of Computer and System Sciences.
50(2), pp. 228-243. April 1995.
11. T. Leighton. S. Rao. Multicommodity Max-Flow Min-Cut Theorems and their use in
Designing Approximation Algorithms. Journal of the ACM. vol 46 . No. 6 (Now 1999), pp.
787 • 832.
12. T. Lengauer. Combinatorial Algorithms for Integrated Circuit Layout. Stuttgart. John Wiley
& Sons 1 994.
76
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
13. D. W. Matula, F. Shahrokhi. Sparsest Cuts and Bottlenecks in Graphs. Journal of Disc.
Applied Math., vol. 27 (1990). pp. 113 123.
14. A. Steger and X. C. Wormald. Generating Random Regular Graphs Quickly. Combinatorics,
Probah. and Comput. vol. S (1999). pp. 377 396.
15. F. Shahrokhi, D. W. Matula. On Solving Large Maximum Concurrent Flow Problems.
Proceedings of ACM 1987 National Conference, pp. 205-209.
16. F. Shahroki, D. W. Manila. The Maximum Concurrent Flow Problem. Journal of the ACM,
vol. 37. pp. 3 IS 334. 1990.
17. M. Wang, S. K. Lim, J. Cong, M. Sarrafzadeh. Multi-way Partitioning using Bi-partition
I leuristics. Proc. Asia and South Pacific Design Automation Conf. pp. 667-672. 2000.
18. Y. C. Wei, C. K. Cheng. An Improved Two-way Partitioning Algorithm with Stable
Performance, IEEE Trans, on Computer-Aided Design, 1990, pp. 1502-1511.
19. Y. C. Wei. C. K. Cheng. A Two-level Two-way Partitioning Algorithm. Proc. lnt'l. Conf
Computer-Aided Design, pp. 516-519. 1990.
20. Y. C. Wei. C. K. Cheng. Ratio Cut Partitioning for Hierarchical Designs. IEEE Trans, on
Computer-Aided Design, vol. 10. pp. 9 i 1 -92 I. July 1991.
21. C.-W. Yeh. C.-K. Cheng, and T.-T. Y. Lin. A Probabilistic Multicommodity-flow Solution
to Circuit Clustering Problems. IEEE International Conference on Computer-Aided
Design.
pp. 428 4 3 1 , 1992."
22. C.-W. Yeh. C.-K. Cheng, T.-T. Y. Lin. An Experimental Evaluation of Partitioning
Algorithms. IEEE International ASIC Conference. P I 4 - I . I - P14-14: 1991.
23. X. Z. Shor. Methods of Minimization of Nondifferentiable Functions and their Applications.
Kiev, "Naukova Dumka" 1979. (in Russian).
24. I. Bomze, M. Budinich. P. Pardalos, and M. Pclillo. The Maximum Clique Problem.
Handbook of Combinatorial Optimization, Volume 4. Kluwer Academic Publishers. Boston,
MA, 1999.
Appendix A - J u s t i f i c a t i o n o f T h e o r e m 2.
Unfortunately all attempts p r o v i n g this t h e o r e m analytically failed. W e succeeded
only in s o m e special cases w h e n all n o d e s or all edges have unit weight. Therefore we
resorted to a '"computer assisted p r o o f .
T h e equalities can be eqtuvalently written as:
c, + c, +c,
c, — c\ + c-
A'i/n'.-.v Frftvalds.
A, >R ,
tl
Nondiriercnliahlc Oplirm/iiiion Based A l ^ i i n l h m lor Graph Ralio-Cul
R >R\z2
A'
;
' A' .-.
77
Pavilioning
• A'- .
prove:
Since all the inequalities do not change w h e n w e multiply all c. with some constant,
w e can c h o o s e this constant such, that R
= I- Solving this equality for c and
substituting the result into the inequalities w e obtain.
(liven:
]2
;
(.1)
gs - g(, > 0
This is the s a m e as p r o v i n g that (1)—(5) together with the negation of (6) can never
be satisfied for any values of the involved variables. Let us also r e m e m b e r that the
variables c o m e from n o d e w e i g h t s and e d g e capacities that are n o n n e g a n v e . W e can
further require that g, are strictly positive; in the opposite case the proof is easy. W e
verified that these inequalities have no solution with several n u m e r i c a l solvers, h o w e v e r
that does not g u a r a n t e e the correctness due to numerical errors. Therefore we used the
Cylindrical A l g e b r a i c D e c o m p o s i t i o n m e t h o d i m p l e m e n t e d in M a t h e m a t i c s 4.0 to
verify the inequalities using symbolic c o m p u t a t i o n . T h e running time of the Cylindrical
Algebraic D e c o m p o s i t i o n algorithm is inherently doubly e x p o n e n t i a l : hence w e had to
transform the inequalities and write supporting functions to obtain results in practically
reasonable time. U s i n g s y m b o l i c c o m p u t a t i o n we can be certain, if M a t h e m a t i c a has n o
implementation b u g s , our conjecture holds.
T h e M a t h e m a t i c a 4.0 n o t e b o o k to verify these inequalities follows.
78
DATORZINATNE:
(ri[11 = ( v
We i n t r o d u c e
returns
V e r i f y [a_ ||b
Verify[a
] :=
Examine [a
a,
auxilary
False
(gl,
i f
and
] :=
function
only
i f
to
the
(Examine[a]
IN
INFORMACUAS
speed
up
expression
the
i s
TEHNOLOCIJAS
calculations.
always
false
11
*)
|| V e r i f y [ O r [ b ] ] ) ;
Examine[a] ;
] : = Experiment al
g2,
g3,
$E.ecursionlimit
g4,
=
CylindricalJSVlgebraicDe composition [
g6,
g5 ,
c5,
c6,
cl,
c3,
c 4 } ] ;
10000;
e q l ^ c l * c 4 + c 6 - g l - g4 -
g6;
e q 2 - c l - c 4 - c 6 - g l + g 4 +• g 6 ;
eq3
c 3 - c 4 - c 5 - g 3 * g4 +
E
e q 4 - c 3 + c 4 + c 5 - g 3 - g4 -
g5 ;
g5 ;
e q i m j i = 2 ( c l + c 3 *• c 5 + c G ) - g l - g 3 - g.5 - g 6 ;
(w
Do
the
takes
verification
too
tmpResult
much
time
in
two
steps.
Doing
i t
all
together
*•)
= Experimental"
CylindricalAlgebraicDecomposition[
c 5 > = 0 & & c 6 > = 0 S & e q 2 > = 0 && e q 4 > = 0 && e q l > = 0 && e q 3 > = 0 &&
eqimp < 0
{c5,
c6,
gl,
gl > 0
g2,
g3,
g3
> 0 && g 5 > 0 ,
g4,
g5 , g 6 ,
V e r i f y [L o g i c a l E x p a n d [ tmpResult
0Ji\]"=
cl,
c3
g l wg3
c 4 } ] ;
y
— g2 * g 4
== g 5 * g 6 ] ]
F a l s e
Table I.
E x p e r i m e n t a l r e s u l t s for tree g r a p h s .
Node
count
100
200
400
800
1600
3200
6400
12800
25600
5 1200
102400
204800
Iterations
257
377
512
739
906
987
1085
1294
1343
1374
1115
1734
Time (s)
Cut value
0.01755
0.05105
0.1377
0.43965
1.14515
2.72245
8.336
27.96925
79.48075
199.9305
359.242b
1197.7344
0.000424058
0.000131803
3.38841 E-05
9.62S36E-06
3.49721E-06
1.3683E-06
7.15262E-07
3.34456E-07
1.78731 E-07
1.031S7E-07
6.66622E-0S
2.S1771E-08
Improved cut
0.000421019
0.00010598
3.14657E-05
7.94314E-06
3.10424E-O6
1.14917E-06
5.51447E-07
2.98915E-07
1.59801 E-07
9.352E-08
6.17942 E-08
2.60658E-08
Optimal cut
0.000407809
0.000101407
2.55342E-05
6.32I58E-06
1.58787E-06
3.97123E-07
9.884S5E-08
2.46653E-08
6.19941E-09
1.53453E-09
3.89972E-I0
9.6445E-I I
Ktirlis i'reivtdds.
Nondtflcrcnliahlc Optimization
B a s e d A l g o r i t h m for G r a p h
Raiio-Cui
Pariitinnin-j
Table 2.
E x p e r i m e n t a l results for cubic g r a p h s .
Node count
100
200
400
800
1600
3200
6400
12800
25600
51200
102400
204800
Iteration count
199
290
383
570
727
835
1012
1 195
1455
1672
1998
2500
Time (s)
0.013
0.04205
0.11065
0.36905
1.02045
2.69475
8.92785
30.71865
107.70535
320.36415
834.6924
2254.4636
Cut value
0.005185495
0.002513133
0.001214463
0.0005994
0.000296
0.000147637
7.36747E-05
3.67635E-05
1.8375E-05
9.15616E-06
4.56971E-06
2.28649E-06
Improved cut
0.005166088
0.002510172
0.001213595
0.000598147
0.000295773
0.000147552
7.36234E-05
3.675E-05
1.83679E-05
9.15434E-06
4.56927E-06
2.28627E-06
Table 3.
E x p e r i m e n t a l results for g e o m e t r i c g r a p h s .
Node count
100
200
400
800
1600
3200
6400
12800
25600
51200
102400
204800
Iteration count
129
146
204
266
389
430
530
681
800
886
976
1417
Time (s)
0.016
0.04305
0.1282
0.4532
1.9732
5.28105
14.3365
45.96155
149.0739
461.85255
1 169.0154
3797.3582
Cut value
0.002848427
0.001631237
0.000592549
0.000338845
0.0001 16216
5.88979E-05
2.34042E-05
1.13718E-05
5.18003E-06
2.88345E-06
1.4I57E-06
3.32351E-07
Improved cut
0.002845499
0.001620948
0.0005825
0.000333014
0.000113355
5.82837E-05
2.25542E-05
1.09091E-05
5.18003E-06
2.88345E-06
1.4157E-06
3.14612E-07
LATVIJAS
U N1V E R SI T A T E S R A K S T 1 . 2 0 0 4 . 6 6 9 .
sej.: D A T O R Z I N A T N E
UN
INFORMACIJAS
TEHNOLOGIJAS.
tO
88.
l p.
P
Review of Traceability Models for Software
Testing Processes
Martins Gills
Riga Information T e c h n o l o g y Institute
Kuldigas i e l a 4 5 , R i g a . L V - 1 0 8 3 . Latvia
martins.gills(a riti.lv
;
A considerable part of software development activities is devoted to testing. The scale of testing
may depend on quality criteria for the particular development project. As soon as development
and testing is performed in some organized manner, there exist some relations and dependencies
between all types of project items. The relations outline some path of traceability that starts from
the origins of a particular item, and follows to its derivatives in a form of other item types. This
paper reviews various traceability models for software projects, and analyses the sub-models of
testing processes. The common properties of these models arc identified.
Key words: traceability, problem reporting, testing process.
Introduction
The property of traceability for software development projects has been k n o w n for long.
It usually assures a better control o v e r software development artefacts, provides a more
organized project life-cycle and. consequently, reduces the risk of developing a low quality
product [6], Testing as one of software processes is extensively dependent on the established
traceability principles or. more formally, a traceability model within the given project. A
vital problem in software testing is to establish an appropriate coverage of requirements.
Typically, the formal coverage could be achieved according to s o m e criteria, e.g. stating that
each requirement must have at least one appropriate test, or in more advanced situations
using standard requirements like ones o f BS 7925-2 [3j. It is especially important to evaluate
the impacts of requirement's changes or to analyse the updates within the test suite. The
previously mentioned issues are related to traceability property within the software
development life cycle and to the testing process in particular.
Software testing process properties
A large part of the effort devoted to software development is due to software testing [11].
Software testing can be regarded as one of the processes presenting within a n y software
development life cycle. Although it may not appear explicitly within the formal list of
processes (like e.g. in ISO/IEC 12207 [8]). it m a y be derived as a tailored process.
Typically, the testing is organized at various levels, depending on what is the focus of the
specific testing activity. The concept of test levels (test activities with a c o m m o n direction) is
mainly derived from the software d e v e l o p m e n t V-model. The model itself originally comes
from the Software Lifecycle Process M o d e l [4], m o r e often it is represented in a very
simplified form [10], and as a more generic model is s h o w n in F i g . l . In principle, the
V-model is close to the more classical Waterfall software development model [11]. Every
project has a different testing process, and not all projects do have tasks corresponding to
Marlins
Gills.
Review
of T r a c e a b i l i t y
Models
for S o f t w a r e
Testing
Processes
81
every test level - some may have very specific ones. Main differences can be due to different
naming and due to the combination or subdivision of particular test levels. In each case
various organizational and methodological issues play a role in defining the w a y these test
levels are implemented. For example, a unit test may be called a component test, but
acceptance testing may be subdivided as factory acceptance testing and user acceptance
testing, or this level may be supplemented by qualification testing. Every test level has a
different object to be tested, and the tests themselves are based on different project artefacts.
Acceptance
Testing
Requirements
Specification
System lesting
Architectural
Design
\
.s
>
Integration
Testing
/ /
Detailed
Design
Unit T e s t i n g
\
/ /
Coding
- > o n the b a s i s of
Figure /. The V-model.
T h e Author has found out that in rare cases the projects can immediately say what kind
of life cycle model they use. In most cases such a model exists, but it is not formally
described in an explicit form, rather it is hidden within the project plan, procedures or
standards. Although the notion of test levels is explicitly used in the V-model. they can be a
part of other models as well. It could be extremely difficult to indicate a development
process where testing is just a once-through activity with only one testing level. Even the
simplest approach of quality requires a product (e.g. software element) checking activity
prior to passing it further for integration into a larger product or prior to delivery. A s the
typical software development consists of numerous developers, project stages (typically,
with cycles), there are at least two test levels present.
Traceability models for development process
According to the concept of the traceability model [5]. it is formed from project items
and relations between them. T h e most convenient form of representation is graphical where
boxes are item types and arrows indicate relations. To analyse the testing process items, the
author compiled preliminary information from 14 projects. Each of those represented a
slightly different development approach, but at the same time, each project from this group
could be m a p p e d to the typical life cycle building blocks defined by the standard J-STD-016
[7]. For this paper, four different models were selected, each representing s o m e possible
class of traceability models. Three of them represent custom information system
development projects, and one represents development of small utility software.
Project A (see Fig.2A) represents the most popular class of traceability relations. Formal
requirements are derived from customer proposals, tests are based on these requirements,
82
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
and problems are documented in problem reports. T h e test execution results are kept
together with the test information. Items directly related to testing are: test a n d problem
reports.
B
D e s i g n item
Requirement
implements
Change
request
Problem
report
Test
A
described ki j
execution ol
T e s t case
V
T e s t log
Figure 2. Traceability models for Projects A and B.
Project B (see Fig.2B) shows the project where the testing process is highly
refined — the testing process items are test class, test, test case, test log and problem report.
This project m a y correspond more to a situation where requirements are not undergoing
changes, and the project stage represented in the model is m o r e oriented to implementation
and testing rather than designing or developing specifications.
Two other projects h a d more relaxed requirements for t h e development. Project C (see
Fig.3A) is a sample of the model that includes quite detailed implementation description.
The particular model w a s present in a project, developing a web-based application. Testing
process items were: test, test case, test log a n d problem report. Testing was well organized,
and in the particular case these tests were mostly at unit and integration level.
A slightly different situation from all t h e previous ones w a s in Project D (see Fig.3B).
This w a s a development project of a software utility with n o particular external requirements
in mind, but they were rather defined iteratively and served at the same time as tests.
Something similar m a y happen when special d e v e l o p m e n t methodologies like Extreme
Programming are practiced. Tire Project D case is n o t dominant, but it characterizes a semiformal approach in projects with a rapid and evolutionary development.
Martins
Gills.
Review
of Traceability
Models
for S o f t w a r e
Testing
83
Processes
Figure 3. Traceability models for Projects C and D.
The A u t h o r found out that it was possible to map the traceability models of all 14
reviewed projects to the four models previously presented. T h e main difference could be that
specific n a m e s are given for particular item types in each particular project. For example,
requirement could be a particular diagram, use case, scenario, instruction, feature, etc.
Typical item types of testing process
The identification of testing items within the development process traceability model
allows defining the traceability for the testing process. T h e typical constituents of the testing
process are project items s h o w n in Table 1.
Table I.
Testing-related project i t e m s .
Test
Test case
Test class
Test configuration
Test suite
Test log record
Problem report
A set of objectives, methods and criteria to be applied in
verification/checking purposes to a certain software item.
A set of inputs, execution preconditions, and expected
outcomes developed for a particular test objective. In some
cases the test case may be created in the form of a test script for
automated tests.
A set of tests with similar test objectives, methods or test levels.
A set of additional test attributes to be applied during execution
for certain test cases.
A set of test cases to be executed within a certain test session,
scenario or by the particular personnel. Test suite can be
implemented in the form of a test script for automated tests.
Information about the status of test execution.
A detailed description of test incidents, a record describing all
events requiring an additional exploration.
The identification of testing process items within project traceability models shows that
the relations between these items are not random. There are typical directions of
dependencies. In every project, the testing process items form some subset of a more
generalized model. Figure 4 shows a generalized model of the testing process. Each
individual project m a y include a subset of these, or define some specific additional testing-
84
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
related items. For example, the test script as an item type can be added to the model for
projects with a high level of test automation.
Test
T e s t class
b e i o r g s to
t ^ is p a r i of
contains
T e s t case
T e s t suite
^
m a r k s execution of
1 ;
T e s t log
record
Problem
report
is e x e c u t e d with
Test
configuration
marfcs p r o b l e m in
Figure 4. A generalized traceability model for testing process.
Items shown on the left side (Fig. 4) represent the typical constituents of the information
set for testing that could be derived from software engineering standards like J-STD-016 [7].
The information set of the testing process has to provide answers to the following three
question areas that are vital for project m a n a g e m e n t and product quality:
• what is to be (or was) tested and how?
• when and how were the tests executed?
• what problems are encountered?
In Fig. 4, the project items o n the right side d o have a more classifying role - in some
cases they could not be regarded as separate items but rather as attributes to the project items
on the left side.
T h e most c o m m o n approach in industrial projects is n o t to distinguish between tests and
test cases. All are called tests, and the item could be either very detailed (like the test case),
or it could outline j u s t the main issues (like the test). S o m e other optimisations could take
place, sometimes reducing the minimal test item set to only two project item types. Fig. 5
s h o w s two possible cases that were observed in the projects. In case (a) there are separate
tests defined, but all testing results are maintained as o n e log. T h i s can function in small
teams where it is not important to establish a formalized communication between testers and
developers, but at the same t i m e testers recognize the n e e d to document the progress and
problems. Case (b) formalizes the situation in which attention is paid mainly to problem
registering, and probably tests are not even in every case formally documented, but rather
briefly described while they are executed. This m a y be typical for larger project teams than
in case (a).
a)
Test
marks e x e c u t i o n ot
Results of testing
(including Pr.Rep.)
b)
Test &
execution info
describes p r o b l e m of
Problem report
Figure 5. Possible optimisations of traceability models of the testing process.
Martins Gills.
In s o m e
as a n
of T r a c e a b i l i t y
projects,
exact
added.
Review
c o p y
the
(due
m a i n
to
it a s s u r e s a g o o d
T h e
evaluation
items
is t h e
process
Test
i t e m s
not play
f o u n d
-
to
for
that
b e
the
present
o n
base
in
the
this. T h e
- it i n d i c a t e s
to
this
T h i s
traceability, but
d e p e n d
a significant role
seen
w o r d s
reasons).
s h o w s
it h a s
m a i n l y
b e e n
list w i t h
motivation
e c o n o m y
engineering,
O v e r a l l
h a v e
o f requirements
S o m e t i m e s
d o c u m e n t i n g
intentions
M o d e l s for S o f t w a r e
Testing
o m i t
writing
like " m a k e
is t h e
n e e d
the
sure
to
f o r m a l i s m
is
project
the
m i n u n i z e
the case
in
item
if the
d e p e n d e n c e
tests, o r tests a r e
that", " c h e c k "
quite
it is n o t e f f i c i e n t
element,
terms
type
testing
n e w
the
A l l
on
software
testing
[1].
project
other
the p r o b l e m
be
spent
in
test-related
is m a d e .
tests m a y
"verify"
effort
o f g o o d
for
written
or
widespread
o f the test o n
w h e r e
85
Processes
testing
report
does
a d d e d as a result
o f
p r o b l e m s .
External relations of testing process items
T h e
least
Previous
t w o
items
chapter
-
the
defined
test
a n d
the
the
generalized
p r o b l e m
project items that are not part o f the testing
traceability
report
-
w i t h i n
m o d e l
this
for
m o d e l
testing
h a v e
process.
relations
A t
w i t h
process.
Dependencies of the test
Items
externally
classified into three
•
software
•
the
•
co-products
T h e
levels
directly
to
the
test
(relative
to
the
testing
process)
can
be
system
d e f i n i n g
items
(requirements,
design
items),
code,
(e.g. user's
test is u s u a l l y
a n d
p r o b l e m
related
groups:
the
strongly
dependencies
report
(see
manual).
d e p e n d e n t
o f the
o n
test (see
these items.
T a b l e
T h e r e
3). T h e
F i g . 4 ) , b u t this is n o t d e p e n d e n t
o n
test
is a c o r r e l a t i o n
can
the test
also
be
b e t w e e n
derived
f r o m
level, rather
o n
test
considered
as
a
test
the
process
organization.
T h e
relation
functional
the
expectations
primarily
to
testing.
in
testers w o u l d
the
This
o f
the
developer
initial
or
provides
c u s t o m e r
tests,
refined
in the
are
because
r e q u i r e m e n t
m o s t
direct
met.
D e s i g n
in
traditional
a
h a v e the availability to exploit
could
w a y
items
the
b e
a n s w e r
a n d
setting
c o d e
to
users
n o r m
question
elements
neither
internal d e v e l o p m e n t
the
are
nor
for
w h e t h e r
referenced
independent
information.
Table 3.
Items related to the test and the c o r r e s p o n d i n g test levels.
Project item
Initial requirement
Requirement
Design item
Code element
User's manual
Problem report
Test levels
acceptance
system
integration
unit
system, acceptance
unit, integration, system, acceptance
86
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
The relation of the test to the user's manual is quite interesting. It could be applied in two
cases - either when the documentation is under test, or w h e n there are n o formally specified
requirements. O f course, there may be s o m e defined requirements, but at the time of testing
they are completely outdated, and there is n o practical feasibility to base the test on these
requirements. An interesting alternative approach with case studies has been proposed - to
use the u s e r ' s manual as a requirements specification [2]. Such an approach requires
systematic writing of the manual, b u t for m a n y real-life projects this method could be less
expensive than writing and maintaining t w o d o c u m e n t s . A Special advantage is gained when
the customer does not request these two. Relation with a problem report is typical for the
tests added during the test execution phase to check the problem areas in later testing cycles.
Relations of the problem report
The problem report is an event-provoked item, and therefore it is tied with a weak
relation to other items - n o changes in other related items can provoke change into the
problem report itself. T h e problem report c a n b e related to almost any project item, but the
most typical are the same as for the test. It may have numerous indirect relations (e.g.
through an interface element problem it m a y point to s o m e specific requirement). T h e
existence of particular direct relations depends on the test level (see Table 4). In each
particular project, only part of the above mentioned items are usually used. T h e problem
report's relation to the test or test log indicates what kind activity led to the given problem
report. Design, code, interface elements or t h e user's manual or do indicate the location of
the problem observable by the tester. Reference to the requirement could be used either to
indicate problem in requirements or into its derivatives.
Table 4.
Items related to the problem report and the corresponding test levels.
Project item
Test
Test log record
Design element
Interface element
Code element
Requirement
User manual
Test levels
unit, integration, system, acceptance
unit, integration, system, acceptance
integration, system
system, acceptance
unit, integration
acceptance
system
Additionally, there is an external item that d e p e n d s on the problem report - the change
request. Usually, it is further related to the requirements.
Related work
A comprehensive analysis of traceability models within software development projects is
made in [12]. At the same time, it is m o r e focused on generalization of the problem rather
than analysing individual relations of project items. In [9] a model of trace links within
testing activities is shown. At the same time it describes a particular case of traceability links
with generalized, bi-directional relations. T h e Main focus is on coverage-oriented questions.
Some papers (e.g. [13]) express the need for increased traceability for software development
Marlins
Gills.
Review
of T r a c e a b i l i t y
Models
for Software
Testing
Processes
87
quality techniques to be more similar to traditional measurement principles. N o reports were
found that are focused on traceability and software testing relations.
Summary
The traceability models of software testing processes belonging to different development
life cycle models indicate similar properties. There are considerable similarities between
every testing process, and no contradictory principles were found. These particular testing
process models can be regarded as a subset of a generalised traceability model of the testing
process. T h e M a i n differences could be found within the terminology - how each item type
is called in a particular project, and the level of optimisation - which items are skipped or
regrouped. T h e quality of testing can be considerably improved if traceability information is
taken into account. This increases the level of maintainability of tests, and could reduce
effort and t i m e spent to carry out a change.
T h e Author has indicated possible research areas concerning the traceability within the
projects. From the application point of view, a potential interest could be in finding influence
on the properties of the traceability model caused by the duration of the project, its particular
life cycle m o d e l , project team size, project (sub-jcontract conditions, project scope and tasks,
development methodology and development environment.
Acknowledgement
The author would like to thank the Riga Infonnation Technology Institute for making it
possible to analyse the software development projects and the Latvian Council of Science
Grant (grant assignation decision No. 3-2-1, 2002.07.09) for partial financing of the
research.
References
1. Bach J. Risk and Requirements-Based Testing. Computer, June 1999. pp. 113-114.
2. Berry, D.M., Daudjee. K., Dong. J.. Nelson, M.A., and Nelson, T. User's Manual as a
Requirements Specification. Technical Report, CS 2001-17. University of Waterloo,
Waterloo, ON, Canada. May. 2001.
3. BS 7925-2 Software component testing, BSI 1998.
4. DSK RR413320010. Allgemeiner Umdruck 250/4. BWB IT 1997. Entwicklungsstandard fur
IT-Systeme
des
Bundes,
Vorgehcnsmodell.
(Toil
1:
Regclungsteil.
Toil
3:
Handbuchsammlung).
5. Gills. M.. The Concept of a Universal Traceability Tool for Software Development Projects.
Scientific Proceedings of the Riga Technical University. Computer Science. Applied
Computer Systems. 2st thematic issue. Riga. 2001. pp 33-4 I.
h. Gotel O. C. Z.; Finkelstein A. C. \Y.. An Analysis of the Requirements Traceability Problem.
1994. p.8 http://citesccr.nj.nec.com/78573.htm!
7. IEEE. J-STD-016-1995. Standard for Information Technology Software Life Cycle Processes.
1995.
88
8.
9.
10.
11.
12.
13.
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
ISO. Standard ISO/TEC 12207. Standard Information technology - Software life cycle
processes. 1995.
Linde S.. Relationships between quality and testing - Kvalitates un testesanas attiecfbas (in
Latvian). Scientific Proceedings of the Riga Technical University, Computer Science, Applied
Computer Systems, 2st thematic issue. Riga, 2001. pp 68-74.
Pol, Martin; van Veenendaal, Erik. Structured testing of information systems - an introduction
to TMap. Kluwer Bedrijfslnformatie, 1998. p 177.
Pressman. Roger S. Software engineering: a practitioner's approach. McGraw-Hill, Inc. 1997
(4th ed.) p. 852.
Ramesh, Balasubramaniam; Jarke, Matthias. Toward Reference Models for Requirements
Traceability. IEEE Transactions On Software Engineering, vol. 27. no. 1, January 2001. pp
58-93.
Wakid. Shukn A.; Kuhn, D. Richard. Wallace. Dolores R.. Toward Credible IT Testing and
Certification. IEEE Software. July 1999. pp. 39-47.
LATVIJAS
UNTVERSITATES
RAKSTI.
20O4. 6 6 9 . s e j . : D A T O R Z I N A T N E
UN
INFORMACIJAS
TEHNOLOGIJAS.
X9.-9X.
Ipp.
Benchmarking Problems of Topic Telecommunications and
Access in Latvia
Mara Gulbe*, Amis Gulbis**
*University o f Latvia, Aspazijas blv. 5, Riga, m g u l b e @ l a t n e t . l v
**Riga T e c h n i c a l University, A z e n e s 12. Riga, arnisg(g(di.Lv
The problems related to benchmarking of telecommunications and access area in Latvia are
analyzed. The topic under consideration is important because it represents the backbone of all
information Society characterizing issues. The availability of data on T&A from national
statistical sources is analyzed and indicators used for benchmarking of the topic T&A in Latvia
are considered.
Key words: telecommunication, benchmarking.
Introduction
T a k i n g into a c c o u n t the fact that Latvia is going to b e c o m e a m e m b e r state of EU
the priorities of EU also b e c o m e actual in Latvia. T h e e E u r o p e + action plan [1] directed
towards the acceleration of the formation of Information Society in M A S . serves as
proof of this statement. All MAS c o u n t r i e s have agreed to take part in the
i m p l e m e n t a t i o n of this action plan in their respective countries. U p to now our
g o v e r n m e n t s h a v e s u p p o r t e d this p r o c e s s in Latvia. A s an e x a m p l e w e can mention
National p r o g r a m " I n f o r m a t i c s " [2] and s o c i o - e c o n o m i c p r o g r a m e-Latvia [3]. At the
same t i m e the attention paid to b e n c h m a r k i n g of action results is unsatisfied, t h o u g h the
2003 action plan of central statistical b u r e a u includes the data collection in a c c o r d a n c e
with the n e e d s o f e E u r o p e + activities. T h e s e data will be available only in 2 0 0 4 .
N e v e r t h e l e s s EC has significant interest a b o u t the present state of IS (Information
Society) characterizing areas in N A S countries. Therefore, independent investigations
are s u p p o r t e d by EC. H e r e w e can m e n t i o n the SIBIS project [ 4 ] , which also e x t e n d s the
b e n c h m a r k i n g process of IS to N A S countries. T h e level a c h i e v e d in the d e v e l o p m e n t of
T e l e c o m m u n i c a t i o n s and A c c e s s serves as the b a c k b o n e for IS activities in different
areas o f society and individual lives. T h e present paper reflects the investigation of the
topic T & A in Latvia carried out in framework of SIBIS project (supported by EC as
FP5 investigation).
Telecommunications Infrastructure
T o facilitate the better u n d e r s t a n d i n g of the topic s o m e insight into
t e l e c o m m u n i c a t i o n s and access infrastructure in Latvia is n e c e s s a r y . It includes:
• Fixed public exchange network;
• State significance data transmission network and other specialized networks:
• Mobile telephone network:
• Cable T V network.
the
90
DATORZINATNE
L'N
INFORMACUAS
TEHNOLOGIJAS
Fixed Public Exchange Network
After t h e renewal of i n d e p e n d e n c e in 1 9 9 1 , t h e g o v e r n m e n t o w n e d a rather
b a c k w a r d fixed public e x c h a n g e n e t w o r k infrastructure w h e r e there w a s an almost
h u n d r e d percent d o m i n a n c e o f a n a l o g u e t e l e p h o n e lines. Very often o n e t e l e p h o n e line
was shared by two u s e r s m a k i n g s i m u l t a n e o u s calls i m p o s s i b l e . T h e access d e m a n d was
h i g h e r then the supply. U n d e r these c i r c u m s t a n c e s q u i c k m e a s u r e s w e r e necessary to
accelerate t h e progress in the field u n d e r consideration. T o stimulate the implementation
of digital t e c h n o l o g i e s in the area of t e l e c o m m u n i c a t i o n s and to satisfy the access needs
of t h e population, foreign capital w a s involved. T h e o w n e r s h i p of t h e fixed public
e x c h a n g e network w a s shared b e t w e e n L a t t e l e k o m ( r e p r e s e n t a t i v e of t h e state) a n d one
foreign t e l e c o m m u n i c a t i o n c o m p a n y (after t h e c h a n g e o f its s h a r e h o l d e r it is n o w the
Finnish S o n e r a ) u n d e r conditions specified in t h e a g r e e m e n t b e t w e e n b o t h parties. The
agreement foresees t h e c o m p l e t e digitalization o f the fixed p u b l i c e x c h a n g e n e t w o r k in
Latvia by L a t t e l e k o m a c c o r d i n g to a t i m e s c h e d u l e fixed in the a g r e e m e n t . The
agreement also established t h e L a t t e l e k o m ( t o g e t h e r w i t h a s e c o n d o w n e r ) m o n o p o l y on
the fixed p u b l i c e x c h a n g e network ( s u p e r v i s i o n , d e v e l o p m e n t , maintenance) for a time
period of t w e n t y y e a r s (up to 2 0 1 3 ) . This m e a n s t h a t the fixed t e l e c o m m u n i c a t i o n
m a r k e t was forbidden for o t h e r parties w h o w a n t e d to p l a y in that market up to t h i s date.
T h i s , in turn, acts against the price reduction in the t e l e c o m m u n i c a t i o n market. Later it
turned out that this m o n o p o l y also m a k e s a serious b a r r i e r for Internet penetration and
u s a g e by h o u s e h o l d s in Latvia due t o inappropriately high p r i c e w h e n c o m p a r e d with
the average salary o f e m p l o y e e s in this country. After this b e c a m e clear to the
g o v e r n m e n t , efforts were u n d e r t a k e n to get free from that m o n o p o l y . This w a s also
stimulated
by E U
recommendations
demanding
the
liberalization
of
the
t e l e c o m m u n i c a t i o n market. In case o f a successful a g r e e m e n t between both players the
liberalization of the fixed t e l e c o m m u n i c a t i o n market should b e achieved in 2 0 0 3 . This
is stated in the l a w " O n t e l e c o m m u n i c a t i o n s " a n d is w h a t is e x p e c t e d by the
government.
Now L a t t e l e k o m Ltd is the largest t e l e c o m m u n i c a t i o n c o m p a n y in Latvia. It offers
to clients different t e l e c o m m u n i c a t i o n s e r v i c e s i n c l u d i n g v o i c e t e l e p h o n y , Internet,
lease of telephone lines, etc. T h r o u g h its s u b s i d i a r y A p o l l o L a t t e l e k o m also acts as an
Internet provider. T h e following Internet a c c e s s possibilities a r e offered:
• Dial-up:
. ISDN;
• xDSL.
D e p e n d i n g on the p u r p o s e ( b u s i n e s s o r h o u s e h o l d ) different t y p e s of I S D N and
x D S L are offered. S o m e t y p e s of x D S L can n o t be c o n s i d e r e d as broadband.
One m o r e point must b e m e n t i o n e d h e r e . Lattelekom d o e s not keep its promises
r e g a r d i n g digitalization o f t e l e p h o n e lines e v e r y w h e r e in Latvia according to the time
s c h e d u l e set in the a g r e e m e n t with t h e g o v e r n m e n t .
The a r g u m e n t a t i o n is based on the statement t h a t in rtiral regions, digitalization is
unprofitable. At t h e same time the quality o f data t r a n s m i s s i o n by dial-up c o n n e c t i o n
through a n a l o g u e lines is s o m e t i m e s very bad. This could b e c o m e a reason for digital
divide.
Mara
Culbe.
Amis
(julbis.
Benchmarking
Problems
of Topic T e l e c o m m u n i c a t i o n s
and
Access
..
91
O n the o t h e r hand other Internet service providers can not extend their services
through the fixed t e l e p h o n e network to rural regions and m u s t try to lind an alternative
(radio-link or Internet t h r o u g h satellite if possible).
State Significance Data Transmission Network and Other Specialized
Networks
T h e r e is a state significance data t r a n s m i s s i o n n e t w o r k (Latvian abbreviation
V N D P T ) in Latvia. T h i s network is i n d e p e n d e n t from the fixed public e x c h a n g e
network (though s o m e lines are leased from L a t t e l e k o m ) and is supervised by the state
agency V I T A (Latvian abbreviation from the A g e n c y of G o v e r n m e n t Information
N e t w o r k s ) . V I T A is a N o n p r o f i t Organization State Joint-stock C o m p a n y established in
1997 a c c o r d i n g to order of C a b i n e t of Ministers N o 44. T h e aim of building up o f the
agency was to p r o v i d e the d e v e l o p m e n t and m a i n t e n a n c e of a joint g o v e r n m e n t a l
information s y s t e m . A s provider of State significance data transmission network
services V I T A m a i n t a i n s the corporative (intranet) c o m p u t e r network c o v e r i n g all the
territory of Latvia. In this n e t w o r k the F r a m e Relay p a c k e t switching t e c h n o l o g y is
used. T h e a d v a n t a g e o f such a t e c h n o l o g y is t h e possibility of several logical channels
in o n e physical c h a n n e l and t h e foreseeable r e q u i r e m e n t s of both data transmission
speed (28.8 k b p s - 2 M b p s ) and query r e s p o n s e t i m e in t h e address space g o v e r n e d by
T C P / I P protocol. B e s i d e s , the logical c h a n n e l of one user is not available for others.
S e r v i c e s p r o v i d e d b y V I T A are available in all of Latvia for both g o v e r n m e n t a l
establishments and state registers.
T h e V N D P T n e t w o r k is a closed n e t w o r k available for g o v e r n m e n t a l institutions
on-line 2 4 h o u r s a day.
A c c e s s t o State significance information s y s t e m s from outside (Internet) is
controlled t h r o u g h a fire-wall system. A c c e s s to public servers (containing e.g. h o m e
page i n f o r m a t i o n of Ministries) is open.
In order to i m p r o v e client needs r e g a r d i n g an increased data transmission rate a
transition to optical fibre connections is in p r o g r e s s . F i v e ministries a n d s o m e state
significance registers a r e connected t h r o u g h optical cables.
A n e w service offered by V I T A is IP telephony, w h i c h allows the reduction of
t e l e c o m m u n i c a t i o n c o s t s for their clients.
In general benefits from V I T A services for g o v e r n m e n t a l institutions are as follows:
• Lower access and traffic costs when compared with those offered by other providers:
• Increased security of access and data transmission.
In E u r o p e a n context the future
integration w i t h IDA n e t w o r k s .
objective
for the V N D P T
network
is closer
B e s i d e s the n e t w o r k supervised b y V I T A there are s o m e other specialized
t e l e c o m m u n i c a t i o n n e t w o r k s in Latvia, those o f L a t v i a ' s Railroad, the Ministry o f the
Interior, L a t v e n e r g o (the biggest e n e r g y supplier in Latvia), Latvia State C e n t e r of
Radio and T e l e v i s i o n . D u e to restricted r a n g e o f usage t h e y are less important.
92
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
Mobile network
There are two m o b i l e t e l e p h o n e n e t w o r k s in Latvia:
• L M T (Latvian abbreviation from Latvia's Mobile Telephone);
• T E L L 2 (subsidiary of the Tele2 A B (at the time " N e t C o m AB")).
Both operators are p r o v i d i n g t e l e c o m m u n i c a t i o n services
G S M 1800 frequencies a n d offering t h e following set o f s e r v i c e s :
in
GSM900
and
• Voice telephony,
• SMS:
• Fax transmission;
• Data transmission;
• Internet:
• E-mail;
• WAP.
T y p e s of connection u s e d to initiate d a t a t r a n s m i s s i o n a r e :
• Analogue;
• ISDN.
L M T services s u p p o r t w i d e r possibilities for m o b i l e c o n n e c t i o n with PC w h e n
c o m p a r e d with those offered by T E L E 2 . T h e s e include also B l u e t o o t h technology.
The data t r a n s m i s s i o n rate 9.6 Kbps is p r o v i d e d by b o t h o p e r a t o r s , but L M T also
offers another m o r e a d v a n c e d t e c h n o l o g y H S C S D ( H i g h S p e e d Circuit Switched Data)
with a d a t a transmission rate up to 38.4 K b p s .
L M T s e n d e e s also i n c l u d e a n e w data t r a n s m i s s i o n t e c h n o l o g y for G S M networks G P R S (General Packet Radio S e r v i c e ) based o n p a c k e t s w i t c h i n g a n d offering
additional possibilities for data t r a n s m i s s i o n s e n d e e s .
Historically the first m o b i l e o p e r a t o r in Latvia w a s L M T . T h e a p p e a r a n c e o f a
second operator leads to r e d u c t i o n o f costs for m o b i l e t e l e p h o n e services.
O n e more point m u s t be m e n t i o n e d here. T h e r e is little h o p e that in the nearest
years t h e services of the fixed t e l e p h o n e n e t w o r k w i l l be a v a i l a b l e for m a n y inhabitants
of rural regions. T h i s is d u e to the lack o f t e l e p h o n e lines o f the fixed n e t w o r k in m a n y
places. T h e alternative is the u s a g e o f services p r o v i d e d by m o b i l e operators. T h e
services offered b y L M T are a v a i l a b l e almost in all o f the territory of Latvia. T h e
connection possibilities o f the T e l e 2 n e t w o r k are not so good. D u e to c i r c u m s t a n c e s
described many people of rural r e g i o n s are u s i n g m o b i l e s b u t m a i n l y for voice
telephony. Of c o u r s e , Internet a c c e s s t h r o u g h m o b i l e is a l s o available for t h e m but only
with a rather l o w data t r a n s m i s s i o n rate p r o v i d e d by G S M . In the context o f an
increased data t r a n s m i s s i o n rate t h e possibilities offered b y 3G m o b i l e services seems to
be very attractive for inhabitants o f rural regions, o f c o u r s e , only in case o f acceptable
service costs. T h e Internet via satellite is also p o s s i b l e but the exploitation of this
service is less p r o b a b l e .
In order to increase the n u m b e r o f players in the t e l e c o m m u n i c a t i o n market and
facilitate its liberalization the bid of three L ' M T S licenses w a s organized by the
government in 2 0 0 2 . B o t h m o b i l e o p e r a t o r s L M T and T E L E 2 s u c c e e d e d in that bid.
The third license w a s n o t sold a n d . therefore, the bid m u s t b e r e p e a t e d once more.
Mara
Gulbe,
Amis
Culbis.
Benchmarking
Problems
o f Topic T e l e c o m m u n i c a t i o n s
and
Access
..
93
Cable TV Networks
T h e r e a r e several cable T V providers in Latvia, m a i n l y in the capital Riga (one
w e l l - k n o w n in Riga is B a l t k o m ( w w w . b a l t k o m . l v ) , but there are also others). The
service is available in areas with a high density of inhabitants w h e r e it is profitable. Up
to n o w the barrier for d e v e l o p m e n t of cable T V services w a s the L a t t e l e k o m m o n o p o l y
on u n d e r g r o u n d cables. T o o v e r c o m e this difficulty very often T V p r o g r a m s were sent
through the air from the b a s e station to s o m e antenna located on a definite building
connected w i t h n e i g h b o r i n g buildings b y cables also g o i n g through the air. T h e same
solution is also used b y Internet providers offering Internet access via radio-link. In fact,
under the conditions o f the L a t t e l e k o m m o n o p o l y , Internet a c c e s s via radio-link is a
very p o p u l a r and w i d e l y used service.
A s could be e x p e c t e d , s o m e cable T V providers also offer Internet services, of
c o u r s e , only in the p l a c e s w h e r e the cable T V network is available. F o r this the cable
m o d e m is necessary. T h e service offered is on-line c o n n e c t i o n with rather good
parameters (broadband).
T h e barrier for Internet through cable T V network is rather high service prices. The
cable m o d e m is also n o t c h e a p .
Internet Via Satellite
A c c e s s to Internet is also offered b y satellite TV p r o v i d e r s (e.g. U n i s a t in Riga).
T w o possibilities for Internet via satellite are offered. T h e s e are:
• Europe Online (the query to specialized European proxy server and the query result
through satellite to user);
• D i R E C W A Y (specialized equipment for up-link to satellite and cable modem is
necessary).
T h e last possibility is c o n v e n i e n t not o n l y for individual access but also for
collective a c c e s s (for not very large Intranets), especially in places where other kinds of
access are n o t available or are o f bad quality.
T h e barrier for Internet through satellite is a rather high price for the service.
Policy D o c u m e n t s on Telecommunications and Access
T h i s chapter c o v e r s a variety o f policy d o c u m e n t s at the national level, which
contain information on national p o l i c y directions a n d priorities, regulation and
legislation. These d o c u m e n t s contain those w h i c h must be highlighted as particularly
important or central to the topic T & A . From the T & A infrastructure c o n s i d e r e d above
and the analysis of p o l i c y d o c u m e n t s a n d national data sources given b e l o w , w e shall
try t o identify the indicators which m u s t be applied to the b e n c h m a r k i n g o f m a i n issues
of t h e topic T & A .
N a t i o n a l P r o g r a m " I n f o r m a t i c s " 1 9 9 9 — 2 0 0 5 , Ministry o f T r a n s p o r t . 1998.
This d o c u m e n t is an action plan which outlines the Latvian w a y t o w a r d s the
Information Society. It includes the large set o f tasks which m u s t be solved to achieve
the target. It is a c o m p l e x p r o g r a m c o v e r i n g t h e time period 1 9 9 9 - 2 0 0 5 and consisting
of 13 s u b p r o g r a m s . It foresees the realization of m o r e then 120 individual projects
94
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
directed mainly to t h e i m p l e m e n t a t i o n o f information and c o m m u n i c a t i o n technologies
in the different areas of society and individual lives as w e l l as w i d e international
cooperation in integration of E u r o p e a n d a t a t r a n s m i s s i o n n e t w o r k s and services.
R e g a r d i n g the topic T & A the following s u b p r o g r a m s m u s t be selected:
• telecommunication networks and services (9 projects),
• information and data transmission networks (15 projects).
• development of information and telecommunication networks (3 projects),
• state significance statistics (4 projects).
The goal of the s u b p r o g r a m " T e l e c o m m u n i c a t i o n N e t w o r k s and S s e r v i c e s " is the
reform of t e l e c o m m u n i c a t i o n sector in Latvia a c c o r d i n g to principles a c c e p t e d b y EU. A
plan of such reform is c o n s i d e r e d in detail.
In the s u b p r o g r a m ' i n f o r m a t i o n a n d Data T r a n s m i s s i o n N e t w o r k s " the architecture
of existing data t r a n s m i s s i o n n e t w o r k s in L a t v i a is c o n s i d e r e d and actions necessary for
further d e v e l o p m e n t of these n e t w o r k s are outlined.
In the s u b p r o g r a m " D e v e l o p m e n t o f Information and T e l e c o m m u n i c a t i o n
N e t w o r k s " a survey on p e r s p e c t i v e I n f o r m a t i o n a n d c o m m u n i c a t i o n technologies is
given and a special role of n e t w o r k t e c h n o l o g i e s (Intranet, Internet) is e m p h a s i z e d . The
s t a t e ' s role in the d e v e l o p m e n t of I T C is outlined.
The goal of the s u b p r o g r a m " S t a t e Significance S t a t i s t i c s " is the d e v e l o p m e n t of a
m o d e m ICT-based statistical system in Latvia which is in a c c o r d a n c e w i t h standards of
such a s y s t e m in EU'.
The laws related to the topic T & A a n d c o n s i d e r e d a b o v e a r e results of the
i m p l e m e n t a t i o n o f national p r o g r a m " I n f o r m a t i c s " , or m o r e precisely, the legislation
part of this p r o g r a m .
C o n c e p t u a l G u i d e l i n e s of S o c i o - E c o n o m i c P r o g r a m
E c o n o m i c s , 2000
"•e-Latvia", Ministry of
The d o c u m e n t outlines the main objectives of t h e s o c i o - e c o n o m i c p r o g r a m e-Latvia
w h i c h appeared as a r e s p o n s e to e - E u r o p e initiative a n d in fact can b e considered as a
further d e v e l o p m e n t of the national p r o g r a m " I n f o r m a t i c s " . The p r o g r a m e-Latvia
represents the Latvian contribution into e - E u r o p e . T h e m a i n objectives of e-Latvia are
as follows:
• significantly improve the access possibilities to Internet for citizens,
• based on improved computer literacy and training possibilities, to provide
everybody the use of modern information technologies.
for
• provide for everybody the access to global and national information resources,
• stimulate the development of e-commerce and e-government.
Analogously to e - E u r o p e action e v a l u a t i o n e-Latvia is structured along three key
objectives:
• A cheaper, faster and secure internet:
• Investing in people and skills;
• Stimulating the use of internet.
Mara
Culbe,
Amis
Gulhis.
Benchmarking
Problems
of Topic T e l e c o m m u n i c a t i o n s
and
Access
..
95
F o r T & A t h e following issues from e-Latvia are of importance:
• Full digitalization of the telecommunication network,
• Full liberalization of the telecommunication market, opening of the market of leased
lines,
• Regulation of Internet sector and services, Internet services provider's responsibility
regarding data security,
• Installation of public Internet access terminals in every local government, school and
library,
• Provision of personal data security.
T h u s the c o n c e p t u a l guidelines p r o p o s e actions stimulating both the d e v e l o p m e n t of
t e l e c o m m u n i c a t i o n s a n d access possibilities to Internet. R e g a r d i n g the topic T & A two
projects involved in t h e e-Latvia action plan m u s t be highlighted. T h e s e are a Latvian
e d u c a t i o n informatization s y s t e m and a Unified information system for L a t v i a ' s
libraries. Both projects are interesting from the point of v i e w of public Internet access
points ( P I A P ) . A c c e s s to Internet in schools is available for pupils and teachers and not
available for p e o p l e from outside (restricted public access possibilities). In libraries such
access is available for e v e r y b o d y . T h u s the i m p l e m e n t a t i o n of both projects is very
important to extend the Internet access possibilities for the public.
Relevant National Statistical Sources Identification and
Documentation
A t first it must be noted here that Information Society statistics in Latvia are in their
b e g i n n i n g stage. Up to 2 0 0 2 t h e r e was n o systematical collection of statistical data on
Information society, t h o u g h s o m e aspects o f such statistics were c o v e r e d . T h e
d e v e l o p m e n t in the area is greatly stimulated by e-Europe^- initiative. T h e 2 0 0 3 action
plan o f central statistical bureau ( w w w . c s b . l v ) ( g o v e r n m e n t a l institution) serves as a
proof o f this statement t h o u g h t h e respective data will be available only at t h e b e g i n n i n g
of 2 0 0 4 . Therefore the main issues from this plan related to topic T & A are not included
in t h e f o l l o w i n g table c o v e r i n g t h e main national and international data sources.
E v e r y y e a r data collected by Central statistical b u r e a u are published in " T h e
Statistical Y e a r b o o k " . Important information regarding prices of c o r r e s p o n d i n g
t e l e c o m m u n i c a t i o n services are given in the w e b s i t e s of the largest fixed and m o b i l e
n e t w o r k operators. It should be noted that exactly the service price is the m o s t important
barrier to w i d e penetration of Internet into h o u s e h o l d s . It is also clear that there is a
strong correlation b e t w e e n the n u m b e r of c o m p u t e r s and the n u m b e r of Internet
connections (if we e x c l u d e W A P ) . W i t h o u t a c o m p u t e r there is no need for an Internet
connection.
96
N a m e o f data s o u r c e ( A c r o n y m )
DATORZINATNE
M a i n publication(s) of
i n t e r e s t for S I B I S
UN
INFORMACUAS
TEHNOLOGIJAS
Description (incl.
target, survey unit)
Responsible
C o m p u t e r u s a g e in e n t e r p r i s e s ,
Data of Central statistical
R e s p o n s e on e - E u r o p e +
Central statistical
Internet usage ( n u m b e r of
bureau
initiative, data
bureau
computers, n u m b e r of Internet
c o l l e c t i o n is b a s e d o n
connections, etc.), e-commerce
Eurosiat module 491,
survey questionnaires
arc u s e d
C a b l e TV network usage
D a t a of C e n t r a l s t a t i s t i c a l
R e s p o n s e on U N E S C O
Central statistical
( n u m b e r of c u s t o m e r s ,
bureau
initiative, data
bureau
e m p l o y e e s , financial figures,
col l e c t i o n b a s e d o n
etc.)
Eurostat module 493,
s u r v e y q u e s t i o n n a i r e is
used
T h e statistical y e a r b o o k y y y y
D a t a of C e n t r a l s t a t i s t i c a l
D a t a o n all t o p i c s
Central statistical
bureau
characterising the
bureau
d e v e l o p m e n t o f country
R e p o r t on d e v e l o p m e n t o f
Ministry of e c o n o m i c s
Latvia's national e c o n o m y
D a t a o n all t o p i c s
Ministry of
characterizing the
economics
d e v e l o p m e n t o f country
eLurope-r2003, progress report,
Joint High Level
J u n e 2002
Committee
Information society statistics,
S t a t i s t i c s in f o c u s , T h e m e
Information society
D a t a on candidate c o u n t r i e s
4-17/2002
statistics
Richard Deiss
Information society statistics,
S t a t i s t i c s in f o c u s . T h e m e
Information society
Eurostat, Author
R a p i d g r o w t h o f Internet a n d
4-37/2001
statistics
Richard Deiss
eEurope^- survey
Joint High Level
Committee
Eurostat, Author
m o b i l e p h o n e u s a g e in c a n d i d a t e
countries
M a r k e t of IT a n d
Department of
C o n t r o l of d e v e l o p m e n t
Ministry of
telecommunications (percentage
I n f o r m a t i c s at M i n i s t r y o f
of I C T sector
transport
o f d i g i t a l n e t w o r k , n u m b e r of
Transport, website
mobile network customers,
n u m b e r of I S D N lines, e t c . )
E q u i p m e n t u s e d to b e n e f i t from
D e p a r t m e n t of
C o n t r o l of d e v e l o p m e n t
I C T services ( n u m b e r o f
I n f o r m a t i c s at M i n i s t r y o f
of I C T sector
Ministry of
transport
telephones, m o d e m s , computers
Transport, website
Lattelekom
in h o u s e h o l d s a n d b u s i n e s s , etc.)
W e b s i t e of L a t t e l e k o m
W e b s i t e of L M T
Website ot'Tele2
D a t a on l a r g e s t fixed
Available data
network services provider
i m p o r t a n t for t o p i c
in L a t v i a
T&A
D a t a on o n e (of two)
Available data
mobile services provider
i m p o r t a n t for t o p i c
in L a t v i a
T&A
Data on o n e (of t w o )
Available data
mobile services provider
i m p o r t a n t for t o p i c
in L a t v i a
T&A
W e b s i t e of L a t v i a ' s I n t e r n e t
Data on Internet s e r v i c e s
Available data
association
customers
i m p o r t a n t for t o p i c
LMT
Tele2
LI A
T&A
Here w e shall focus o u r attention on c o n v e n t i o n a l indicators u s e d to characterize
t h e topic T & A in Latvia. A list of s u c h indicators is as follows.
1. Percentage of households that have fixed telephone service
2. Fixed lines per 100 inhabitants
Mara
Gulbe,
Amis
Gulbis.
Benchmarking
P r o b l e m * of T o p i c T e l e c o m m u n i c a t i o n s
and
Access
..
97
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
Mobile phone subscriptions
Percentage of households with Internet access
Percentage of population regularly using Internet
Internet access costs
Number of personal computers
Internet users
Number of Public Internet Access Points per 1000 inhabitants
Number of Internet hosts
Computerized enterprises
Enterprises with access to the Internet
Type of Internet connection in enterprises
Enterprises with a home page on the Internet
Number of employees that at their work place regularly use a computer with
internet connection
16.
17.
18.
19.
Number of computers with access to the internet used by enterprises
Number of cable T V stations
Number of cable T V customers
Expenditure for cable T V business
20. Revenue from cable TV business
2 1 . Information technology expenditure
22. Telecommunication expenditure
Glossary
CSB
DSL
EC
EDI
ESS
EU
Eurostat
FP5
ICT
IDA
IS
ISDN
MS
NAS
SIBIS
SMS
T&A
UMTS
WAP
Central statistical bureau
Digital Subscriber Line
European Commission
Electronic Data Interchange
European Statistical System
European Union
Statistical Office of the European Commission
5 Framework Program
Information and Communication Technologies
Interchange Data between Administration
Information Society
Integrated Services Digital Network
Member States
Newly Accessing States
Statistical Indicators Benchmarking the Information Society
Short Message Service
Telecommunications and Access
Universal Mobile Telecommunications System
Wireless Application Protocol
Ih
98
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
References
[1] eEurope- 2003, Action Plan, prepared by the candidate countries with the assistance of the
European Commission, June 2001.
[2] National program "Informatics", Ministry of Transports, 1998 (in Latvian).
[3] Conceptual guidelines of socio-economic program e-Latvia, Ministry of economics, 2000 (in
Latvian).
[4] www.sibis-eu.org.
LATVIJAS
UNIVERSITATES
R A K . S T I . 20()4. 6 6 9 . s e j . : D A T O R Z I N A T N E
UN
INFORMACUAS
TEHNOLOGIJAS.
99.-106.
lpp
Design Interpretarion Principles in Development and Usage
of Informative Systems
Janis Iljins
U n i v e r s i t y of Latvia, Raina 2 9 , Riga, Latvia
Janisj@lanet.lv
The paper summarizes the experience and ideas of the last five years in developing Informative
Systems by means of Informative System Technology (1ST) and shows the application of these
ideas in practice.
One of the main methods discussed in the paper allows exact implementation of the svstem
compatible to the design - design interpreter method. In the paper the system design formalization
method interpreted during the operation of the system is suggested. The 1ST environment acts as
the design interpreter.
Other advantages of 1ST are described - modularity and simple maintenance, the possibility to
automate the system prototype development, to generate the system documentation automatically
and other advantages.
All the described methods are evaluated according to their practical application.
Key words: informative systems, 1ST. system prototype, maintenance.
Introduction
In o r d e r to d e v e l o p a n d operate the Informative S y s t e m (IS) according to the classic
IS " w a t e r f a l l " life cycle it is necessary to describe the p r o b l e m statement or design from
the n e e d s of the real w o r l d . It is the b a s e of system i m p l e m e n t a t i o n [ I ] . Typical
p r o b l e m s o c c u r r i n g in this a p p r o a c h are design i n c o m p l i a n c e to actual requirements and
i m p l e m e n t a t i o n i n c o m p l i a n c e to design. T h e last most often occurs in the phase of
system i m p l e m e n t a t i o n a n d m a i n t e n a n c e w h e n errors are stated in the implemented
system; they are e l i m i n a t e d w i t h o u t any c h a n g e s in the d e s i g n . In the practice often
systems t h e design specification and s y s t e m d o c u m e n t a t i o n of which are older then
i m p l e m e n t a t i o n are met. T h e y d o not contain information on the latest c h a n g e s and
innovations in the system.
T h e r e are w e l l - k n o w n w a y s h o w to solve the a b o v e - m e n t i o n e d problems. The
system p r o t o t y p e is u s u a l l y u s e d to c h e c k the c o m p l i a n c e of the design to the
requirements. Before i m p l e m e n t a t i o n of t h e system the p r o t o t y p e is tested and changes
can be e a s i l y d o n e . As often the system p r o t o t y p e is automatically generated from the
design specification, it exactly c o m p l i e s with t h e design. To achieve the system
implementation c o m p l i a n c e with the design used code generation is usually. In such a
case the design should b e formalized by m e a n s insuring c o d e generation. S u c h means
are offered by the specification l a n g u a g e G R A P E S and its e n v i r o n m e n t G R A D E
elaborated in Latvia [ 2 ] [ 3 ] , M a n y other tools like R A T I O N A L R O S E . O R A C L E
Designer 2 0 0 0 [4] etc. a r e w i d e l y successfully used in the world. As the design
specification does not c o n t a i n information on implementation details, code generation
makes p r o g r a m m i n g e a s i e r but does not eliminate it. It is possible to c h a n g e the
generated system code, s o causing danger that the c h a n g e d c o d e m a y not c o m p l y with
the design.
100
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
At the B a n k of Latvia in 1995 d e v e l o p m e n t of the C u r r e n c y O p e r a t i o n M a n a g e m e n t
Informative System ( V O P I S ) w a s started u n d e r the l e a d e r s h i p of Professor
M. T r e i m a n i s . It w a s clear at the b e g i n n i n g that the s y s t e m will h a v e to be able to
operate in an e n v i r o n m e n t in which r e q u i r e m e n t s are c h a n g i n g very fast and it will be
necessary to m a k e m a n y c h a n g e s d u r i n g its m a i n t e n a n c e . T h e r e f o r e special attention
w a s paid to the fact h o w to d e v e l o p the s y s t e m so that it is easy to c h a n g e it in the future
and how t o achieve that the c h a n g e s are easily reflected in the system d o c u m e n t a t i o n .
Professor M . T r e i m a n i s offered to d e v e l o p t h e 1ST c o n t a i n i n g the d e s i g n m e t h o d and
the e n v i r o n m e n t s u p p o r t i n g the m e t h o d [5]. T h e specified design is stored in the
database b y m e a n s of the m e t h o d . 1ST e n v i r o n m e n t c a n interpret the design. Such an
approach allows us to:
• combine design and implementation as the system implementation will directly use
the design definition, so implementation will exactly comply with the design;
• change the applied system easily, for to make changes in the system functionality, in
most cases, it is necessary to m a k e changes only in the data stored in the database;
• maintain the system documentation corresponding to implementation as it can be
obtained automatically from the design definition.
1ST w a s successfully u s e d in V O P I S d e v e l o p m e n t and operation. Later 1ST b e c a m e
the technology o f all IS in the B a n k of Latvia w h e r e it is still used. 1ST is also
successfully used in D a t o r i k a s Instituts Ltd. e l a b o r a t e d IS.
Following p a p e r c h a p t e r s s h o w the 1ST offered d e s i g n formalization m e t h o d s and
describe t h e project interpreter. Different 1ST possibility a p p l i c a t i o n s in real projects,
their advantages a n d d i s a d v a n t a g e s are a n a l y z e d .
System Design M e t h o d
The IS d e v e l o p e d b a s e d on this design m e t h o d is o p e r a t e d by m e a n s o f the design
interpreter - 1ST.
To d e v e l o p a s y s t e m b y 1ST m e t h o d t h e w o r k c o n s i s t s o f three i n d e p e n d e n t parts:
• database design,
• technology definition,
• development of specific applications.
The essence o f these parts will be d e s c r i b e d further.
Database Design
T a r g e t system d a t a b a s e design is d e v e l o p e d by m e a n s o f the classic E R model.
Usually, in actual 1ST d e s i g n s in d a t a b a s e E R m o d e l d i a g r a m s , Object M o d e l
notation is used [6].
1ST also e n v i s a g e s t h e registration o f the entities and relations o f ER model in the
database, although it is n o t c o m p u l s o r y . If the d e v e l o p e d E R m o d e l is registered in the
t e c h n o l o g y database, it g i v e s the f o l l o w i n g a d v a n t a g e s :
• automatic data structure (table) formation (definition script generation) is possible;
Janis lljins.
Design
Interpretation
Principles
in
Development
and
Usage
of
Informative
S y s t e m s 101
• development of the system prototype even before development of the actual data
structure is possible;
• automatic system documentation generation;
• technology definition work is made easer.
Several real s y s t e m s h a v e been d e v e l o p e d in which the target s y s t e m d a t a b a s e is
d e v e l o p e d b y m e a n s of classical relation d a t a b a s e tools. In such cases these a d v a n t a g e s
are n o t u s e d and E R m o d e l description (entities and relations) is not registered in the
t e c h n o l o g y d a t a b a s e . N e v e r t h e l e s s , the s y s t e m is successfully operated by m e a n s of 1ST
interpreter.
Technology Definition
T h e basic principles and m e t a m o d e l of t e c h n o l o g y definition are described in [5].
T o d e v e l o p IS it is n e c e s s a r y to define the organization w o r k t e c h n o l o g y . It is done
by c o m b i n i n g various m e t h o d s already k n o w n described in [6] [7] [8] [9].
In order to define the t e c h n o l o g y the following questions should be answered;
• " W h y ? " - the tasks of the organization should be understood. 1ST envisages defining
the tasks, to define the subtasks for the tasks that can be described in more detail. In
the result a hierarchic network is formed. 1ST task module helps programmers
understand the target system, but it is not necessary during operation. Therefore the
design interpreter does not use it.
• " W h o ? " - it should be clarified what structural units and what staff of the
organization are involved in implementation of the task. The design interpreter uses
the information on the staff from 1ST organization module in order to give them
Login rights and control the staff rights to work with IS.
• " W h e r e ? " - according to the posts and responsibilities of the organization's staff
members the necessary work places and what rights and appearance are needed for
every work place should be specified. Starting a work session with the design
interpreter in 1ST environment the user identifies and chooses one of the available
work places where he will work during the work session. The user will be offered
those IS possibilities which are envisaged in the design for the particular work place.
• " W h a t ? " - the design determines which database objects the user may process at the
particular work place. T h e 1ST view concept is introduced - 1ST view is a set of
objects of a single object class of the target system. In the context of a work place a
hierarchic view network may be formed. As a view is a set of objects a sub view can
be defined as the set of objects related to the selected view object. The project
interpreter offers the user to see the available views and the objects in them, to find
and select them for implementation of operation.
• " H o w ? " - in the context of a work place an operation can be implemented for the
view or object in view. Calling the operation the design interpreter starts another
application shown in the design. The application receives a parameter from 1ST
environment - the selected object in the view network.
• " W h e n ? " - for particular target system object classes state-transition diagrams can be
designed. The technological process for each object of this class is started during the
work and the object is set in a definite state of the state-transition diagram. 1ST
environment offers the possibility to change the object state according to the state
transition diagram. The view network should be designed so that the views containing
102
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
the objects in the needed state could be available in the particular work places. The
operation should be linked to the views that change the object state.
The descriptions of the system u s e r s , w o r k p l a c e s , w o r k place view n e t w o r k and
operations linked to the v i e w s as w e l l as state transition d i a g r a m s together form an
interpreted t e c h n o l o g y that at the s a m e t i m e is IS d e s i g n . In order to operate IS only the
design description h a s to be interpreted.
Application Development
Specific target s y s t e m applications are d e v e l o p e d separately. T h e y differ for every
system. Therefore they are designed s e p a r a t e l y a n d i n d e p e n d e n t l y o f the organization
t e c h n o l o g y used b y 1ST.
The principle o f m o d u l a r i t y is o b s e r v e d in d e v e l o p m e n t of applications - every
specific function o f the system has its o w n a p p l i c a t i o n . T y p i c a l l y every target system
object class has its o w n application to register the objects o f this class and to edit their
attributes.
In 1ST operation it is indicated w h i c h specific a p p l i c a t i o n and w h i c h specific m o d e
to call it. For all a p p l i c a t i o n s it is n e c e s s a r y to u s e a p r e d e f i n e d c o m m a n d line interface
by m e a n s of w h i c h the 1ST e n v i r o n m e n t c o n v e y s information to application on the
u s e r ' s activities - w h a t objects the user has selected in the 1ST e n v i r o n m e n t and wants
to indicate as p a r a m e t e r s for application.
Design Interpreter Implementation
Implemented Metamodel
The interpreted design description is f o r m a l i z e d b y the 1ST m e t a m o d e l w o r k place
m o d u l e [5] (Figure 1.)
W o r k Place
is c o n n e c t e d with
Q
provides
provides
is a l l o w e d to
-OMessaee
is related to
is c o n n e c t e d with
is c o n n e c t e d in c o n t e x t
with
View
of
in c o n t e x t
s£
has been
apphed
Operation
Employee
Figure I: Part of the 1ST Workplace Module
T h e notion o f the W o r k p l a c e is u s e d to define the e m p l o y e e ' s rights and duties.
M o r e formally, the 1ST W o r k p l a c e M o d u l e defines
• Workplace
o
T h e Workplace entity attributes are the name and the comment. This
entity is used as a placeholder to define the employees' rights and duties.
Janis lljins.
Design
Interpretation
Principles
in
Development
and
Usage
of
Informative
Systems 103
• Message
o
The Message entity main attributes are the text (for instance. Warning:
FX deals are waiting confirmation!), the icon-representing message and
SELECT statement. It is possible to use arbitrary SELECT statements
with a set of predefined variables. ISTE executes this statement after
defined time intervals. The message text appears, if SELECT statement,
when executed, finds some object.
• Set of technological relations
o
the relation between the Employee and the Workplace entities allows
employees to participate in the organization's technological process with
a predefined set of rights, which will be defined below;
o
the relation between the View and the Workplace entities allows the
employee w h o works in a particular Workplace to get access only to
predefined sets of Views and Objects;
o
the relation between the View, the Operation and the Workplace entities
allows the employee to execute a predefined set of Operations with Views
and objects in a particular Workplace;
T h e relation b e t w e e n the M e s s a g e and the W o r k p l a c e entities allows the employee
to receive M e s s a g e s a b o u t predefined events, w h i c h is c o n n e c t e d with the Workplace.
Usually these m e s s a g e s are related to a necessity to p r o c e s s s o m e object that "arrives"
in the W o r k p l a c e in s o m e V i e w . T h e relation b e t w e e n the M e s s a g e and the View
entities a l l o w s the e m p l o y e e to find those objects easily.
1ST interprets t h e s e technological relationships and thus enforces e m p l o y e e s to
perform the o r g a n i z a t i o n ' s tasks according to predefined O T M .
Functional Shell
T h e Functional Shell is a p r o g r a m - design interpreter operating according to the
above d e s c r i b e d m e t a m o d e l a n d interprets the system design description coded in it.
The p r o g r a m controls t h e user a p p r o a c h to the database, thus so controlling operation of
the s y s t e m . T h e p r o g r a m insures w o r k in a particular w o r k place available to the user. In
w i n d o w s the v i e w n e t w o r k - views and the objects they contain which may have linked
sub v i e w s . S o it is c o n t r o l l e d what the user sees and can process from the database.
Selecting a particular object or view o p e r a t i o n s by m e a n s of w h i c h the selected object
can b e p r o c e s s e d are s h o w n . T h i s indicates h o w the p r o c e s s i n g is done. After a definite
time interval the m e s s a g e w i n d o w is refreshed. In it t h e user can see w h e n to do the
appropriate activities. T h e m a i n w i n d o w of t h e functional shell as it is seen by the
system user is s h o w n in Figure 2,
104
DATORZINATNE
UN
INFORMACUAS
Who
Skatit darijumu
TEHNOLOGIJAS
9
»izsauktsf
Figure 2. Main window of the Functional Shell.
1ST Additional Possibilities
Besides the F u n c t i o n a l Shell it is also p o s s i b l e to realize o t h e r possibilities using the
s y s t e m design e n t e r e d in the d a t a b a s e . O n e of s u c h possibilities is data structure
generation. If in t h e t e c h n o l o g y description the s y s t e m d a t a b a s e entity types and
relations b e t w e e n t h e m as well as entity attributes are defined it is possible to d e v e l o p a
p r o g r a m that transforms the design description c o d e d in t h e d a t a b a s e into S Q L
C R E A T E T A B L E s t a t e m e n t s , so creating a real d a t a b a s e structure.
It is possible to enter in the t e c h n o l o g y d a t a b a s e o b j e c t s - e x a m p l e s for every type of
entities. T h e s y s t e m design can b e m a d e i n c l u d i n g w o r k places and the views they
contain so that the objects s h o w n in the v i e w s are o b j e c t s - e x a m p l e s from the t e c h n o l o g y
database. T h u s it is possible to d e v e l o p a s y s t e m p r o t o t y p e e v e n before the real database
structure is d e v e l o p e d . T h e p r o t o t y p e will be o p e r a t e d b y the real design interpreted by
m e a n s of w h i c h the i m p l e m e n t e d s y s t e m will be o p e r a t e d later. So the prototype will
precisely coincide w i t h the system i m p l e m e n t a t i o n .
T h e design description that can be interpreted or u s e d for data base structure
generation can be p r i n t e d in a p r e p a r e d d o c u m e n t a t i o n t e m p l a t e . For it a p r o g r a m is
needed that reads the design description in the d a t a b a s e and passes it to a tool like
V I S I O or M S W o r d .
Information on reports necessary to print from the target system can be entered into
the t e c h n o l o g y d a t a b a s e . T h e t e c h n o l o g y d a t a b a s e stores information on the data needed
for the report ( S Q L S E L E C T s t a t e m e n t r e t u r n i n g the data) and information on how the
data should be s h o w n in the report t e m p l a t e s . So a p r o g r a m - report interpreter can be
implemented by m e a n s o f w h i c h m o s t o f the s y s t e m r e p o r t s can be developed.
Janis
Iljins.
Design
Interpretation
Principles
in
Development
and
Usage
of
Informative
Systems 105
1ST Application E x a m p l e s
Experience
I S T e c h n o l o g y c u r r e n t l y is used in the B a n k of Latvia as the standard framework for
building Information S y s t e m s . Besides the Bank of Latvia between 1996 and 2001
I S T e c h n o l o g y f r a m e w o r k w a s used in m a n y other organizations to build and maintain
c o m p l e x Information S y s t e m s (Figure 3).
Organization
Information Svstem
ISTechnology (as described in this paper)
Salaries Management IS
Employees Management IS
Commercial Banks Management IS
Trade and Treasury Management IS for
Central Bank
Computer and Software Engineering Institute I
Ltd.
Bank of Latvia
Bank of Latvia
Bank of Latvia
Bank of Latvia
Trade and Treasury Management IS for
Commercial Bank
Unibank of Latvia
Trade and Treasury Management IS for
Investment fund
Optimus fund
Patient Register IS
Health Department of Riga Municipality
1 Pension Capital Management
• Private Pension Fund
IS
for
Unipensija
Figure 3. Examples of ISTechnology usage.
Application Evaluation
All 1ST possibilities w e r e m o s t c o m p l e t e l y applied in the Bank of Latvia. All
s y s t e m s w h e r e deigned b y t h e 1ST m e t h o d and operated by t h e design interpreter - 1ST
F u n c t i o n a l Shell. Several s y s t e m d e v e l o p m e n t e n v i r o n m e n t s w e r e created in one 1ST
e n v i r o n m e n t . A part o f t h e 1ST m o d u l e s w a s assessed as successful. But for a part of the
i m p l e m e n t e d m o d u l e s t h e application in practice did not p r o v e to be useful. The
functions of these m o d u l e s w e r e carried out by other tools a n d they were not actually
used. So the data structures w e r e not g e n e r a t e d by m e a n s of 1ST, as the applied SQL
Server Software includes sufficiently easy to u s e structure creation tools, and there was
no need to u s e 1ST tools. S y s t e m d o c u m e n t a t i o n generation and system prototyping
before data structure g e n e r a t i o n also w a s not justified in practice because too much
configuration work w a s n e e d e d for it.
T h e m a i n reason w h y 1ST m o d u l e s w e r e not used is the complicated configuration
work n e e d e d to enter the t e c h n o l o g y definition in the database. Handy editors are
lacking. It would be w o r t h d e v e l o p i n g such, but it is a time c o n s u m i n g process.
T h e 1ST m o d u l e that p r o v e d to be the most efficient is the design interpreter
described in this paper. T h e m e t h o d for the technology definition by w h i c h w o r k place
configuration is d e s c r i b e d indicating the w o r k place view network and the linked
106
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
operations and interpretation of the t e c h n o l o g y by the functional shell m a k e s 1ST
u n i q u e . S y s t e m s d e v e l o p e d and o p e r a t e d in this w a y are easy to u s e and maintain.
Therefore 1ST is u s e d in m o r e and m o r e n e w projects.
In s o m e s y s t e m s d e v e l o p e d and o p e r a t e d b y 1ST, especially in U n i b a n k o f Latvia,
the system d y n a m i c m o d e l is successfully u s e d , the 1ST state-transition and m e s s a g e
m o d u l e is based on it and the report g e n e r a t i o n m o d u l e . Both these m o d u l e s are
additional to the Functional Shell h e l p i n g to m a k e the s y s t e m m a i n t e n a n c e even easier.
Future Perspective
1ST is w o r k e d o u t in client-server a r c h i t e c t u r e b y m e a n s o f w h i c h client-server IS is
d e v e l o p e d . N e v e r t h e l e s s , t o d a y other t e c h n i c a l s o l u t i o n s are found. A possibility to
d e v e l o p 1ST as an I n t e m e t - b a s e d IS d e v e l o p m e n t and i m p l e m e n t a t i o n tool is
considered. N e w 1ST versions are c r e a t e d built in 3-level architecture - database,
m i d d l e w a r e and user interface. So the IS d a t a security w o u l d be i m p r o v e d , and as well
as there w o u l d be a possibility for o n e s y s t e m to u s e different user interfaces, for
e x a m p l e , users c o u l d w o r k s i m u l t a n e o u s l y with o n e s y s t e m from M S W i n d o w s
e n v i r o n m e n t and Internet.
N e w data presentation possibilities to substitute t h e trees seen in the w i n d o w s in the
current i m p l e m e n t a t i o n are also b e i n g s e a r c h e d for. O n e of the solutions is to approach
m a x i m a l l y the a p p e a r a n c e o f the 1ST w i n d o w s to those of generally a c c e p t e d M S
W i n d o w s Explorer.
Conclusion
The p a p e r describes a n e w a p p r o a c h to s y s t e m d e s i g n i m p l e m e n t a t i o n - the design
interpretation. T h i s a p p r o a c h e n s u r e s exact c o m p l i a n c e of s y s t e m i m p l e m e n t a t i o n w i t h
design description and m a k e s the s y s t e m e a s y to m a i n t a i n . I m p l e m e n t a t i o n o f the a b o v e
m e n t i o n e d m e t h o d is also discussed.
T h e m e t h o d is applied in practice in real projects. It should be d e v e l o p e d a c c o r d i n g
to t o d a y ' s updated t e c h n o l o g i e s , a n d n e w e a s y to u s e v e r s i o n s in w h i c h application of
all 1ST m o d u l e s w o u l d justify t h e m s e l v e s s h o u l d be elaborated.
References
Brace I. Blum. Software Engineering a Holistic View. Oxford University Press, 1992.
Treimanis M.. tvrasts O., Nazartchik I. GRADE. GRAPES Development Environment. University
of Latvia, 1992.
Bicevskis J. Specification Language GRAPES/PLUS. University of Latvia, 1992.
Oracle Designer'2000. Product Overview. Oracle Corporation.
Treimanis M. ISTchnology - Tehnology Based approach to Information System Development.
University of Latvia, 1997.
James Ruabaugh and Co. Object-Oriented Modeling and Design. Prentice-Hall, 1991.
Sowa J.F., Zachman J.A. Extending and Formalizing the Framework for Information Systems
Architecture. IBM Systems Journal, Vol. 3 1, No. 3. pp. 590 - 6 1 6 , 1992.
David R.Hampton. Contemporary Management. McGraw-Hill, 1981.
Capers Jones. CASE'S Missing Elements. IEEE Spectrum, June pp. 38-41. 1992.
LATVIJAS
UNI VERS ITATES
RAKSTI.
2 0 0 4 . 669. sej.: D A T O R Z I N A T N E
UN
INFORMACIJAS
TEHNOLOGIJAS.
I07.-1I6.
lpp
Semantics and Equivalence of UML Class Diagrams
Girts Linde
I M C S , U n i v e r s i t y of Latvia,
glinde@acm.org
One and the same "real world" can be modeled by different UML class diagrams, which in such a
case can be considered "intuitively equivalent". A new approach to the formalization of this
"intuitive equivalence" of class diagrams is proposed, based on the fonnal object-oriented
cognitive process. Two theorems are proved to support this approach. The new formalization can
be used to construct algorithms for class diagram analysis.
Key words: UML, class diagrams, equivalence.
Introduction
A s object modeling gains popularity, starting from the pioneering work [1], a need
emerges for methods o f handling and evaluating models. Object models are used in
various stages of system development - from definition of requirements to system design.
One p r o b l e m is to validate the model as it gets transformed and refined in the lifecycle of
system d e v e l o p m e n t . R e s e a r c h is already being done on this topic [ 2 - 5 ] . A m o r e general
problem is to compare several alternative object models of the same "real world".
O u r goal is to formalize this "intuitive equivalence" for object models represented
with U M L class d i a g r a m s . Consider the class diagrams m o d e l i n g directed graphs shown
in Figure 1.
1
Node
Node
,
1
S t a r t s at
,
Edge
E n d s at
E d g e to
C o n n e c t e d to
,
Standpoint
S t a r t s with
1
Node
Edge
1
*
C o n n e c t e d to
E n d s with
Endpoint
Fig. L Three class diagrams of a directed graph
F o r m a l l y , we have here 3 totally different class diagrams. Each of t h e m describes
different instance diagrams. But intuitively we know that all three of these diagrams
describe the s a m e "real world" - directed graphs.
108
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
In [6] a formal definition of the "intuitive equivalence" o f class diagrams is proposed
called reduction. It is defined on the instance diagram level a n d uses c o m p o s i t e classes to
define the transformation between the class d i a g r a m s . In this paper a new approach to the
definition of the "intuitive equivalence" o f class d i a g r a m s is proposed, called semantical
implication. It is based on the formalization of the object-oriented cognitive process,
introduced in [7]. T w o t h e o r e m s are p r o v e d confirming that the n e w approach is more
general and intuitive.
The paper is structured as follows. In section 2 a restricted formalization of the
cognitive process is outlined. In section 3 t h e new definition of the "intuitive equivalence"
is presented. In section 4 the reduction of class d i a g r a m s is briefly outlined. A n d finally in
section 5 the difference between the t w o definitions of the "intuitive equivalence" is
examined, proving t w o t h e o r e m s .
The formalization of the cognitive process
T h e definition of t h e class diagram semantical implication is based on a restricted
version of the formal cognitive process (for the full version see [7]) h a p p e n i n g in the mind
of a modeler w h e n he analyzes a fragment o f the real w o r l d and builds an object model for
it. T h e fragment of the real world to be m o d e l e d , t o g e t h e r with the object-oriented
structure that gets created in the mind of t h e m o d e l e r w e represent with a labeled directed
graph and call it the cognition state (Fig. 2 ) . D u r i n g the cognitive process, as the modeler
learns n e w concepts from the fragment o f the real world, his mind is populated with new
elements that correspond to objects, classes, links and associations. W i t h every addition of
a new element a new cognition state is created. T h e resulting sequence of cognition states
we call the cognition path. The cognition path e n d s with a cognition state that contains the
final object m o d e l as a part of it (Fig. 3).
State of t h e
abstract world
,<
State o f tbe-'real wDrld
Fig. 2. A cognition state
Girts Linde.
Semantics
and
Equivalence
Snapshot of the
Real worid
of
L'ML
Class
109
Diagrams
n i t i o n start s t a t e
State Gf the
bslracl worid
Fig. 3. Overview
A m o r e precise description of the cognition state and the cognition path follows in the
next t w o subsections.
Cognition state
T h e cognition state is defined as a labeled directed graph that consists of t w o
interconnected parts:
• the state of the real world, which formally represents the fragment of the real world
that is being modeled.
• the state o f the abstract world, which formally represents the object model created in
the modeler's mind on a fixed moment in time.
Each n o d e o f the state o f the abstract w o r l d has o n e of the four labels, according to the
element o f the object m o d e l the node represents: O - o b j e c t node, C - c l a s s n o d e , L - l i n k
node, A —association node. (Note that links a n d associations are represented as nodes in
the graph.) T h e following predefined edges are used to connect the nodes (they have a
graphical notation, as s h o w n in Fig. 4 ) :
• startpoint (S) - connects a link or association node to its startpoint node,
• endpoint (E) - connects a link or association node to its endpoint node.
• instance-of
instances,
• part-of-
- connects a class or association node to the nodes representing its
connects an object node to nodes representing the parts of the object.
110
DATORZINATNE
UN INFORMACUAS
TEHNOLOGIJAS
Start state of the
abstract world
(empty)
startpoint
-S-
endpoint
-E-
State of the real world
instance-of
•
part-of
•
Fig. 4. Notation of predefined edges
s
r
Fig. 5. A cognition start state
We define the cognition start state as a cognition state t h a t consists of a state of the
real world and an empty state of the abstract w o r l d (Fig. 5).
Cognition path
During the cognitive process, a cognition path is incrementally created. It is a
sequence of cognition states, beginning with a cognition start state. Given a cognition
state, the next cognition state is produced by extending the state of the abstract world in it
- by adding a new node or b y adding o n e of t h e predefined e d g e s between the nodes. A s
in [6], in this paper w e w o r k with the basic class d i a g r a m s containing classes, binary
associations and t h e four m o s t popular multiplicity constraints (0..1, 1..1, 1..*, 0..*). So,
the cognitive process can b e restricted to u s e only the n e e d e d steps. W e call the resulting
cognition path a constructive cognition path.
Definition. W e call a cognition path
constructive cognition path (CCP):
constructed
u s i n g the following
steps a
• "create a new aggregate object" (Fig. 6), which represents a subgraph of the real
world state: create a node with label O, and connect it to all nodes of the subgraph by
the edge part-of;
• "create a new class": create a node with label C (a class node);
• create an edge instance-of from an existing class node to any existing object node that
has no instance-of edges connected to it;
• "create a new association": create a node with label A (an association node), connect
it to two existing class nodes by the edge S and the edge E;
• "create a new link" (Fig. 7): create a node with label L (a link node), connect it to two
existing object nodes by the edge S and the edge E;
• create an edge instance-of from an existing association n o d e to any existing link node
that has no instance-of edges connected to it; this is allowed only if the nodes
connected by the link (nodes to which edges S and E go) are connected by instance-of
edges to the class nodes, connected by the association (with edges S and E.
respectively).
Girls Linde.
S e m a n t i c s a n d E q u i v a l e n c e of U M L C l a s s
111
Diagrams
Fig. 6. Adding an object
r-s,„.„,...
"S
(O)
f
s£/
F;g. 7. Adding a link
V
E
If w e assign n a m e s to the class and association nodes of a given C C P then we can
construct the corresponding instance diagram.
The new definition of the "intuitive equivalence"
Consider a real world state R, and two C C P s PI and P2 starting from this real world
state R. If P2 can be obtained from PI aggregating s o m e of the objects and classes while
preserving s o m e properties o f the structure, then w e a s s u m e that these cognition paths
model the s a m e concept in different levels of detail. A more precise definition follows.
Definition. W e say that P 2 is less detailed than P I , if the following 4 conditions hold:
1) if t w o nodes of R in P I are connected b y part-of edges t o one and the same object
node (Fig. 8 a), then in P2 they are also connected by part-of edges to one and the same
object node (Fig. 8 b),
Cognition path P1
Cognition path P2
(a)
(b)
Fig. S. Condition for object nodes
112
DATORZINATNE
UN
INFORMACUAS TEHNOLOGIJAS
2) if there are two nodes a and b in R that are in P I c o n n e c t e d by part-of e d g e s to
object nodes cl and d l , respectively, a n d there is a link n o d e connected by S e d g e to cl
and b y E e d g e to dl (Fig. 9 a), then in P2 they are connected b y part-of edges to one
object node (Fig. 9 b) or they are c o n n e c t e d b y part-of e d g e s to object n o d e s c2 a n d d2,
respectively, and there is a link node c o n n e c t e d b y S e d g e to c2 and by E e d g e to d 2 (Fig.
9 c).
C o g n i t i o n p a t h P2
Cognition path P1
, s \ b
(o)
c
K
6
E
dl
(0)
l
(a)
6
o >
6
0)
2
b
(b)
c
d2
2
6 6
i
(c)
Fig. 9. Condition for link nodes
3) if there are t w o nodes a and b in R that are in P I c o n n e c t e d by part-of e d g e s to
object nodes cl and d l , respectively, a n d the object n o d e s cl and dl are connected by
instance-of edges to the same class n o d e (Fig. 10 a), then in P2 they are connected b y partof e d g e s to o n e object node (Fig. 10 b) or they are connected b y part-of edges to object
n o d e s c2 a n d d2, respectively, and the object n o d e s c2 a n d d2 a r e connected by instanceof e d g e s to the same class node (Fig. 10 c).
C o g n i t i o n p a t h P1
C o g n i t i o n p a t h P2
(o)c.
(o)
6
6
(a)
dl
b
O)
6
(b)
&
c 2
b
6
(2)
(c)
d2
6
Fie. 10. Condition for class nodes
4) if there are 4 n o d e s a, b. c and d in R. and PI contains 4 object nodes a l , b l , c l and
d l , 2 link n o d e s el a n d fl, and an association n o d e g l , c o n n e c t e d as in (Fig. 11 a), then in
P2 there are t w o cases possible:
• a and b are connected by part-of edges to one and the same object node, and c and d
are connected by part-of edges to one and the same object node (Fig. 11 b)
• or there is the same configuration as in P1 - there are 4 object nodes a2, b2, c2 and d2,
and 2 link nodes e2 and f2, and an association node g2, connected as in Fig. 1 1 c.
Gins Linde.
S e m a n t i c s a n d E q u i v a l e n c e of U M L Class
Cognition path P 1
(0)
a l
©>>
113
Diagrams
Cognition path P2
1
(O)"
(O)
1 1 1
Co)*
© »
vo)"
2
(") C$
K
c2
(°)'
lb]
la)
f7g. / / . Condition for association nodes
N o w , using this method o f comparing cognition paths w e define the "intuitive
equivalence" o f class diagrams, in this paper called semantical implication.
Definition. W e will say that a class diagram CD1 semantically implies a class
diagram C D 2 , if for every state o f the real world R:
• if there exists a CCP PI starting from R whose corresponding instance diagram ID1
satisfies CD1,
• then there exists a CPP P2 starting from R that is less detailed than PI and whose
corresponding instance diagram ID2 satisfies CD2.
Reduction of class d i a g r a m s
In [6] the "intuitive equivalence" of class diagrams is formalized as a reduction. The
reduction o f the more detailed diagram to the less detailed one is defined introducing a set
of concepts. Every concept is represented with a composite class containing s o m e of the
existing classes and associations, and assigning multiplicity constraints to the enclosed
classes. The set o f new concepts must cover all the classes o f the class diagram.
The multiplicity constraints assigned to the contained classes must guarantee that the
composite classes have only connected instances. When reducing an instance diagram
with a set o f concepts, each group o f objects and links that forms an instance o f one o f the
concepts is replaced with a new object, and so a n e w (less detailed) instance diagram is
constructed.
Definition. W e will s a y that a class diagram D l can be reduced to a class diagram
D2, if there exists such a set o f n e w concepts S, using which every instance diagram of D1
can be reduced to an instance diagram o f D 2 , and all instance diagrams of D 2 can be
constructed this w a y from instance diagrams o f D l .
The relation between the t w o definitions
In this section w e s h o w that the semantical implication is more general than the
reduction.
Theorem 1. If a class diagram D l can be reduced to a class diagram D 2 , then D l
semantically implies D2.
114
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
Proof.
Consider
• a class diagram D l that can be reduced to D2 by a set of concepts S,
• a state of the real world R, for which a constructive cognition path PI exists starting
from R, whose corresponding instance diagram ID1 satisfies D l ,
• an instance diagram ID2 that is reduced from ID1 using a set of concepts S.
To prove the theorem w e show that w e can construct a constructive cognition path P2
starting from R that is less detailed than PI and w h o s e corresponding instance diagram is
ID2.
First we construct the last cognition state X o f P2 w h o s e corresponding instance
diagram is ID2 and which is based on the s a m e state o f the real world R:
• group the object and link nodes and the class and association nodes into subgraphs
according to the new concepts in S,
• replace each object-link subgraph with an object node, and each class-association
subgraph with a class node,
• replace multiple instance-of edges between two nodes with one instance-of edge,
• replace multiple link nodes (together with S and E edges) that are between the same
object nodes, and that are connected by instance-of edges to the same association
node, with one link node (and one pair o f S and E edges).
N o w it is easy to construct a constructive cognition path P2 that starts with the state of
the real world R and ends with the state X. The cognition path P2 is constructed in the
following order:
1) create aggregate object nodes with part-of edges (in any order),
2) create link nodes with S and E edges (in any order),
3) create class nodes (in any order),
4) create instance-of edges between object nodes and class nodes (in any order),
5) create association nodes with S and E edges (in any order),
6) create instance-of edges between link nodes and association nodes (in any order).
The set o f new concepts S by definition is constructed so that every class and
association of D l belongs to exactly one new concept. Keeping that in mind it is easy to
see that the new cognition path P2 is less detailed than the cognition path PI (all four
conditions are satisfied).
T h e o r e m 2. There exist two class diagrams D l and D 2 , such that
• D l semantically implies D 2 , but
• D l cannot be reduced to D2.
Proof.
Consider the two class diagrams in Fig. 12. They both describe directed graphs. The
difference is that D l a l l o w s us to model the structure o f graphs, but for D2 every object
instance is a whole graph.
Girts Linde.
Semantics and Equivalence
c l a s s diagram D1
of U M L C l a s s
Diagrams
115
c l a s s diagram D2
E d g e to
Node
Fig. 12. Two class diagrams of a directed graph
Let's prove that D l semantically implies D2. Consider an arbitrary state o f the real
world R, for which there exists a CCP PI w h o s e corresponding instance diagram satisfies
D l . W e can construct a C C P P2 that starts from the same state of the real world R and
contains one object node, which is connected by part-of edges to all the nodes o f R, and
which is connected by an instance-of edge to a class node. The corresponding instance
diagram o f P2 satisfies D 2 . It is easy to see that P2 is a less detailed CCP than PI.
N o w , let's see if D l can be reduced to D 2 . It can be proved that the only new concept
that can be constructed for the diagram D l is a trivial composite class that contains the
class Node, but doesn't contain the association Edge-to. The association Edge-to cannot be
included in a new concept together with Node, because it forms a cycle, and cycles are not
allowed inside the concept (that comes from the requirement that a new concept must
have only connected instances). Therefore, D l can be reduced only to D l , but cannot be
reduced to D2.
Conclusion
In the paper a new formalization approach of the U M L class diagram "intuitive
equivalence" is presented. It is enabled by the semantics o f class diagrams that is based on
the formalization o f the object-oriented cognitive process. T w o theorems are proved to
compare the two definitions o f the class diagram "intuitive equivalence". The theorems
confirm that the n e w definition is more general - the reduction is a special case of the
semantical implication.
In fact, Theorem 2 can be generalized to prove that any class diagram semantically
implies the trivial class diagram containing one class. That is a fundamental property of
system modeling. These theoretical results show that the semantical implication is now
very close to the concept o f the class diagram "intuitive equivalence". The next problem
that must be researched is the algorithmical decidability o f the class diagram semantical
implication, as it is already done for the reduction in [6].
Finally, it must be noted that this paper gives another confirmation of the sufficient
generality o f the object-oriented cognitive process formalization. It is shown that the
formalization can be successfully used to reason about the problems o f the class diagram
equivalence.
116
DATORZINATNE
UN
INFORMACIJAS
TEHNOLOGIJAS
References
J.Rumbaugh, M.Blaha, W.Premerlani, F.Eddy W.Lorsen. Object-oriented modeling and design.
Englewood Cliffs, NJ: Prentice Hall, 1991.
A.S.Evans, R.B.France, K.C.Lano, B.Rumpe. The UML as a Formal Modelling Notation.
UML'98. LNCS, Vol. 1618, Springer, 1999
K.Lano, J.Bicarregui. Semantics and Transformations for UML Models. UML'98. LNCS, Vol.
1618, Springer, 1999
A.S.Evans. Reasoning with UML class diagrams. WIFT'98. IEEE CS Press, 1998
K.C.Lano, A.S.Evans. Rigorous Development in UML. FASE'99, ETAPS'99. LNCS, Vol. 1577,
Springer, 1999
G. Linde. Reduction of UML Class Diagrams. In: proceedings of the Fifth International Baltic
Conference on Databases and Information Systems. Tallinn, 2002.
G. Linde. The Cartoon Method of Semantics Definition for Modeling Languages. Submitted to:
Proceedings of the Latvian Academy of Sciences. Riga, 2003.
LATVIJAS
UNIVERSITATES
RAKSTI.
2004.
669.
sej.: D A T O R Z I N A T N E
UN
TEHNOLOGIJAS.
INFORMACIJAS
I17.-125.
Ipp.
Requirements and Options for Data
Warehouses at Universities
Laila Niedrite
Lecturer at the U n i v e r s i t y o f Latvia
Raina blvd. 19, R i g a , Latvia, + 3 7 1 7 0 3 4 4 3 9
lnied@lanet.lv
This paper deals with the problems of the changing role of traditional universities on the
educational market, and how data warehousing could help the universities solve their problems. It
describes the experience of different universities and the issues of feasibility study of the data
warehousing project at the university.
Key words: data warehouses, requirements, universities.
Introduction
Data w a r e h o u s i n g is traditionally u s e d in b u s i n e s s . H i g h e r education is trying to
follow the g r o w i n g trend in d e v e l o p m e n t a n d u s a g e o f d a t a w a r e h o u s e s ; h o w e v e r , the
reasons for d o i n g it are often m o r e c o n c e r n e d w i t h scientific a n d educational issues than
w i t h financial benefits. H i g h e r educational institutions h a v e g r o w n in recent years into
large b u s i n e s s e s a n d the m a n a g e m e n t of t h e s e institutions has changed and b e c o m e
m o r e b u s i n e s s - l i k e . T h e m a n a g e m e n t of h i g h e r educational institutions can benefit from
using data w a r e h o u s i n g j u s t as other traditional b u s i n e s s organizations.
T h e t o p i c o f d a t a w a r e h o u s i n g [5], [ 1 3 ] , [4], a n d [7] comprises architectures,
a l g o r i t h m s , m o d e l s , tools, organizational a n d m a n a g e m e n t issues for integrating data
from several operational s y s t e m s in order to p r o v i d e information for decision support,
e.g., u s i n g data m i n i n g or O L A P tools. T h e data w a r e h o u s i n g technology provides
integrated, c o n s o l i d a t e d a n d historical data. A data w a r e h o u s e can be realized as a
separate d a t a b a s e c o n t a i n i n g these integrated data.
A data w a r e h o u s e is built by e x t r a c t i n g data from
t r a n s f o r m i n g t h e m and then loading into a separate d a t a b a s e
m o d e l l i n g c o n c e p t s , e.g. star schema, and d a t a b a s e structures
[14] are u s e d . Finally, the d a t a are accessed w i t h different data
in F i g u r e 1.
different data sources,
where specialized data
like materialized views
analysis tools as shown
118
DATORZINATNE
Data source systems:
Students, Financing, HRM ..
UN
INFORMACUAS
The Data Warehouse
target database
TEHNOLOGIJAS
Data access and
Analysis
f^Data M i n i n g )
Statistics
Q
^
Reporting ^
Figure J. Common data warehousing environment
This paper will give further insight into the business processes o f higher educational
institutions, and some key benefits o f having a data warehouse in business are examined
concerning higher education.
The following section presents the business process m o d e l for higher educational
institutions and university management issues in the new competitive environment. An
overview o f possible areas o f application o f data warehousing in education is given in
Section 2. Section 3 presents the possible system architecture for the n e w application
area, and the relevant experience o f universities using data warehousing. Section 4
draws conclusions on the data warehouse feasibility study at the University o f Latvia.
The Education Process as a Business Process
The starting point for finding out the necessity for higher education institutions to
build and use data warehousing could be their traditional management and business
processes. The t w o main university activities are education and research, and the most
important support processes are human resource management and financial resource
management [11].
In the last decade universities have experienced significant changes in the education
area. One of the important issues is the growing competition between educational
institutions.
Colleges and universities rarely express their policies, intentions, and practices in
competitive terms. H o w e v e r , the pressure on traditional resources doubled with the
emergence of technology-based education delivery s y s t e m s will force competitive
thinking [6],
If w e consider education as business; w e have to define the key terms in this
context. The universities as "education providers" have to find out, what is the product?
What are the resources? What is the price o f the product? The students as „customers"
have different important questions: from w h i c h institution to buy, which product exactly
to buy? Finally, the customer will have to know about the quality and price o f the
product to be able to make the right decision.
Laila
Niedrite.
Requirements
and
O p t i o n s for D a t a W a r e h o u s e s at
Universities
119
There are many examples o f using the market terminology in higher education now.
One of the best known cases in this area is the University o f Phoenix, which focuses on
educational needs o f adults. The new trend is the formation and growth of corporate
educational institutions, many o f which have become universities with accredited study
programs in recent years. For example, Motorola University co-operates with
universities around the world to develop and deliver courses for the Motorola
Corporation employees.
The most challenging issues for university management in the new competitive
environment are [12]:
• Public relations,
• Competition, including universities abroad,
• Market analysis,
• Strategic planning,
• Revenues/ expend iture analysis,
• Calculation of expenses,
• Cooperation with traditional businesses.
Alongside developing their policies in the new competitive environment,
universities are looking for methods to help them evaluate the answers and to support
them and their customers to make the right decisions.
Higher education is facing many problems in the next years and IT will play a
major role in determining h o w institutions resolve and solve these problems [ 1 5 ] . Many
universities are looking at data warehousing with the hope that it can help them to solve
their problems.
The reasons w h y data warehouses are developed and successfully used in
management and decision support in traditional business areas like sales, banking and
others are the following:
• the data warehouse makes the information in the organization more accessible,
• integrates data from many operational data sources and thus ensures quick access to
the information about business activities o f the whole organization
• the data are user-friendly and consistent.
As it has been mentioned above, the universities' main activities are education,
research and management, and also the n e w business activities, for example, processes
which b e c o m e important for the existence and development o f universities. Therefore,
we can assume that universities, like businesses, can benefit from data warehousing.
The following section deals with s o m e scenario examples for data warehouse
development at universities, which are connected with the traditional higher education
institution's processes (scenarios 3 and 4 ) and also with the new business processes
(scenarios 1 and 2).
DATORZINATNE UN INFORMACUAS TEHNOLOGIJAS
120
Specifications for the University D a t a W a r e h o u s e
There are some attractive university data warehouse implementation scenarios.
Components of them have been implemented at European and American universities
[1], [2], [3].
Data warehouse implementation scenario 1.
The first specification is about attracting good students. With the ever escalating
costs and competition associated with higher education, universities and colleges are
very interested in attracting and retaining high-quality students [8].
A data warehouse can enhance the existing information system that handles
admissions to university study programmes, providing access To the required
information. One example data warehouse star schema for the admission process is
s h o w n in Figure 3. The star schema reflects the current administrative processes for
student admission in the IS o f the University o f Latvia.
rime
PK
Time
Study Programme
kev
Date
Month
Year
A c a d e m i c year
Faculty
PK
Facultv kev
Faculty_code
N u m b e r student
PK
Admisian fact
PK
PK
PK
PK
PK
PK
PK
Degree
No_of_appltcants
No_of_approved
Max_grade_approved
Start effective d a t a
Time admiss FK
Time e x a m l FK
Time e x a m 2 FK
Time results FK
P r o a r a m m e FK
ADDlicant FK
Faculty FK
Status
Gradel
Grade2
Priority
Proaramme kev
> -
Applicant
PK
ADDlicant kev
First n a m e
Family n a m e
Date of birth
Student number
Sex
Address
Figure 3. Example data warehouse star schema for admissions.
For example [1], if the admission algorithm is based on school grades and selected
number o f study programmes in priority order, the information about the admission
process in previous years could help applicants avoid mistakes in choosing the
programmes, because they do not really comprehend the selection algorithm.
Data warehouse implementation scenario 2
The next specification is to support customer relationship management (CRM)
strategies.
Luila
Niedrite.
Requirements and Options for Data Warehouses at Universities
121
' C R M aligns business processes with customer strategies to build customer loyalty
and increase profits over time.' (Harvard Business Review, 2 0 0 2 ) .
Each year colleges and universities confront the challenge o f meeting higher levels
of service demanded by an increasingly sophisticated and competitive marketplace and
consumer base. At the same time there is an ever present need to cut costs and tighten
budgets. Therefore, it b e c o m e s more and more important for each service interaction to
be effective in both cost and outcome. In order to do this, a greater understanding o f the
customer is critical [ 9 ] .
Universities are n o w challenged with the needs of many different groups of
learners. The universities are working with multi-generational students, and also the
correspondence students are expecting new services. Another important issue is the
feature o f learners in the information age to get information when and how they choose.
Universities must now evaluate their services taking into account that the students
are customers. Universities are starting to develop new strategies to manage their
customer relationships in various stages o f the student's lifecycle. The universities have
to think about new services in each phase o f the lifecycle to support students' different
needs on the w a y towards graduation.
There are some additional outcomes with effective C R M strategy: reduced costs, if
the self-service is supported with web—applications; improved document flow in the
institution and influence on n e w technologies and investments.
Data warehouse implementation scenario 3
This scenario focuses on the improvement o f the study process [2], for example, the
assessment o f study courses concerning the student and course performance. The
evaluation o f detailed statistics such as the number o f students enrolled, evaluated and
approved, the average mark, and the average approved mark can help to find out the
potential problems for a particular course before they arise. It helps better understand
the student needs, or in business terms, what are the students buying?
Another issue is the quality assessment o f the study programme. Statistics on
admissions, n e w students, drop-outs and graduates and average studying duration before
graduation can give the faculty management the necessary feedback and can influence
how the programme works and changes.
Another possible way to conduct course and study programme evaluation is by
getting information from student questionnaires and by publication of aggregated data
for all participants: students, lecturers and managers.
Data warehouse implementation scenario 4
This scenario describes resource planning at the university, for example, planning
the teaching service. If the managers o f the faculty plan the necessary number o f classes
for a particular course, they must know the statistics about the previous year. It can help
to avoid empty or full classes and to find out the workload o f teaching staff.
Another example is Human Resources, when saving all transactions made with
university personnel records can help plan the carrier o f employees and make the right
decisions about new recruitments.
122
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
The financial resources o f a university, in context with the size o f the student body,
as scientific results over several years are also very significant issues to analyze.
One less obvious example [3] for this scenario aims to measure the international
attractiveness o f the institution. The purpose is to control the institution's policy on its
international reputation and exchanges, with budget involved.
Data Warehouse a n d University Information Technology
Architecture
Because many universities' data warehousing scenarios mentioned above are
concerned with Customer Relationship Management ( C R M ) , it is necessary to
understand the role o f the data warehouse in the university IT architecture
and its
connection with CRM applications
The Connect Enterprise Architecture [9] also fits for universities and illustrates as
in Figure 4, how C R M integrates many existing applications into a unified customer
facing strategy.
Students
Application Intsgratinn
Infnrmatinn
analysis
Employees
Customer relationship
management
C^Data m i n ^ n g ^ ) < ^ ^ e p o m r v g ^ > < ^ ^
Data Integration
Alumni
Management
0
1
^ ~
p
Data warehouse
IS-data source
Figure 4. Data warehouse connections with CRM applications.
The layer of Enterprise Application Integration provides access to Back Office
applications (ERP, Student IS . . . ) from various solutions: data mining, CRM
applications, Portals, K n o w l e d g e Management applications.
The integration o f Back Office Applications' data can b e solved by using a data
warehouse. The data warehouse also serves the reporting and analytical needs of an
Laikt
Niedrite.
Requirements
and
O p t i o n s for
Data Warehouses
at
Universities
123
institution, so the data warehouse can be a part of CRM solution which focuses on front
office activities, e.g., customer service and support.
The present developments in universities in the field o f data warehousing are very
different in their decisions, which architecture and tools to use, and also which business
processes to support with data warehousing solutions.
A m o n g higher education institutions in the U S A , less than 2 5 % have or are
planning a data warehouse [ 1 0 ] . Only a limited number o f database tools is used.
The choice mostly depends on existing environments in other projects at the
university. This is also true for client tools. The most popular is the ORACLE database
and W e b based client side tools.
Only a very small number of universities have all their information sources
integrated into a corporate warehouse. Most are starting with students' data, human
resources or finance data.
Many data warehousing projects in universities have been initiated in the IT
department, and not all o f them have obtained management sponsorship yet.
N o similar survey has been done in Europe, and to asses the situation with data
warehouses in European universities one has to look up the information on the Internet
and the annual conference 'European Universities Information Systems' (EUNIS). The
number o f participating universities from different countries is usually around 100 every
year, but during the last 5 years, the number o f presentations on data warehouses was
less than 10. For example, in 1999 there were 2 presentations from Ljubljana in
Slovenia and from Porto in Portugal. T h e most significant data warehouse project in
Europe a m o n g universities is in France, where the project is developed on a national
scale. All French universities which had previously implemented the university
information system A P O G E E (also a national-scale project) can now participate in the
data warehouse project.
The Data W a r e h o u s e Feasibility Study at T h e University of Latvia
A s an example o f a university starting the development of the university data
warehouse w e can consider the University o f Latvia.
The following issues are
important at the starting point o f the project:
The existence and quality of data sources
In the case o f the University o f Latvia, an information system was developed based
on the Oracle database and Oracle Application Server. During the 5 years o f the system
development and implementation process, a large amount o f various data have been
collected. It is now possible to provide access to operational data, but the possibility to
analyze historical data in different w a y s is limited. One o f the questions for the
evaluation o f the feasibility o f the data warehouse is the technical feasibility, which
means the existence o f data for expected data warehouse and the quality o f existing
data. At the University o f Latvia the management information system (LUIS in Latvian)
has been developed since 1996. The starting year of the students' module was 1997, of
the study fees module - 1998, o f the courses and courses' enrolment module - 2000,
and the starting year o f the admission module was 1998. The data accuracy differs not
124
DATORZINATNE
UN
INFORMACUAS
TEHNOLOGIJAS
depending on the module, but on the faculty, in some cases it depends on a particular
program of study. The existing data are accurate enough, but in s o m e cases the lack of
data is obvious. The second possible data source for the data warehouse at the
University o f Latvia in addition to LUIS is the accounting system.
Reasons for having a data warehouse
For example, one reason for starting a feasibility study o f a data warehousing
project at the University o f Latvia is burdened access to the necessary information in the
operational systems. Another reason is the growth o f competition a m o n g universities
and the new roles o f traditional universities. O n e more question regarding readiness is
the business motivation of the institution. The business motivation of the University of
Latvia is based on the changing situation in education, which w a s described earlier in
this paper. The management o f the University o f Latvia is interested in a more flexible
reporting facility, in the analysis of the integrated information, and the most actual need
is a new portal. The idea o f the portal fits CRM solutions and also fits the data
warehouse as a data integrator.
Users' acceptance
Prior to starting a data warehouse project it is important to understand the readiness
of the institution to have a data warehouse [ 7 ] . One o f the questions for evaluation is the
existence o f data for the expected data warehouse. A l s o an important question is the
readiness o f people to use the data warehouse information. In the University o f Latvia
the users have long experience in usage o f operational systems, and if the data
warehouse will satisfy also the reporting needs of the users, the acceptance o f the data
warehouse will depend on the correctness of extracted data and reports. The users were
also involved in the requirements analysis.
The decision between different d e v e l o p m e n t scenarios and the right choice for the
first development stage.
The possible scenarios regarding specifications have to be presented to the
university management and discussed. The information about users' needs and priorities
help to make the right decision. In our case study w e collected information about the
number o f potential users and about the purposes o f the usage. W e evaluated this
information to get quantitative measurements expressing the more urgent needs. The
earlier mentioned 4 * implementation scenario's subset o f problems concerning the
human resources and the financial resources of the university in context with the
number of students, and scientific results w e r e accepted as the starting point o f the
implementation.
Conclusions
Universities are not the typical owners o f a data warehouse system, because the data
warehouse projects need financial investments, which in the c a s e of universities, are not
obviously comparable with gained results in terms of money. T h e reasons why
universities have or are starting the development of the data warehouse are only in the
long term the expected return on investment, or the g r o w i n g competition. According to
Laila Niedrite.
Requirements
a n d O p t i o n s for D a t a W a r e h o u s e s
at
Universities
125
their prior business functions and goals, universities build data warehouses to improve
the quality o f education and communication with their customers - students. The
improvement o f the existing reporting facility o f operational systems and improvement
of data accessibility can be mentioned as a secondary reason.
References
[I]
[2]
[3]
[4]
[5]
[6]
[7]
[8]
[9]
[10]
[II]
[12]
[13]
[14]
Bajec,M., Rupnik.R., Krisper,M. Using Data Warehouses in University Information
Systems, Proceedings of conference EUNIS99, Espoo, Finland, 1999.
David,G., Ribeiro.L. Getting Management Support from a University Information System.
Proceedings of conference EUNIS99, Espoo, Finland, 1999.
DesnosJ.F. A National Data Warehouse Project for French Universities, Proceedings of
conference EUNIS2001, Berlin, 2001.
Hammer.J., Garcia-Molina,H., Widom,J., Labio.W., Zhuge,Y. The Stanford Data
Warehousing Project.
Inmon,W.H. Building the Data Warehouse, John Wiley, 1996.
Katz,R.N. Competitive Strategies for Higher.Education in the Information Age, Proceedings
of conference EUNIS99, Espoo, Finland, 1999.
Kimball,R., Reeves,L., Ross,M., Thornthwite,W. The Data Warehouse Lifecycle Toolkit,
John Wiley, 1998.
KimbalfR., Ross M. The Data Warehouse Toolkit: The Complete Guide to Dimensional
Modeling, John Wiley, 2002.
KPMG Consulting, Managing Customer Relationships in Higher Education, 2002
N C H E M S , Information Architecture: the Data Warehouse Foundation, CAUSE/EFFECT ,
Volume 20, Number 2, Summer 1997, pp. 31-33, 38-40, 60.
Sinz, E. J.; Bohnlein, M.; Ulbrich-vom Ende, A.: Konzeption eines Data WarehouseSystems fur Hochschulen. In: Workshop "Unternehmen Hochschule" (Inforniatik'99, 29.
Jahrestagung der Gesellschaft fur Informatik, Paderbom, 5.-9. Oktober), 1999, S. 111-124.
Stonis J., Augstaka izglitlba ka pakalpojumu tirgus dallbnieks, LU konference, 2003.
W i d o m J . (ed). Special Issue on Materialized Views and Data Warehousing, IEEE Data
Engineering Bulletin, June 1995.
YangJ., Karlapalem,K., Li,Q. Algorithms for Materialized View Design in Data
Warehousing Environment, Proceedings of the 2 3 International Conference on Very Large
Data Bases, Athens, Greece, August 1997.
Zastrocky, M.R. The Information- and Technology- Enabled University of 2004,
Proceedings of conference EUNIS99, Espoo, Finland, 1999.
rd
[15]