NO 1, 1995 - Ericssonhistory.com
Transcription
NO 1, 1995 - Ericssonhistory.com
NO 1 , 1995 APZ 21220 - The New High-end Processor Measuring Quality of Service in Public Telecommunications Network AXE 10 System Processing Capacity Using Predictions to Improve Software Reliability Test Marketing of Mobile Intelligent Network Services AXE 10 Dependability CONTENTS No. 1 1995 • Vol.72 APZ 21220 - The New High-end Processor Cover: To meet the great demand for processing capacity in telecommunications systems, Ericsson manufactures proprietery ASICs in VLSI technology. The photo shows one of the ASICs for the latest version of the AXE 10 central processor PHOTOS BY Labe Allwin Karl-Evert Eklund Nina Reistad 5 Measuring Quality of Service in Public Telecommunications Networks 13 AXE 10 System Processing Capacity 22 Using Predictions to Improve Software Reliability 30 Test Marketing of Mobile Intelligent Network Services 36 AXE 10 Dependability 42 Malcolm Brow Olav Hammero CONTENTS Previous issues 1994 No.l No. 2 A New Standard for North American Digital Cellular RBS 884 A New Generation Radio Base Station for the American Standard A 10 Gbit/s Demonstrator A Prototype Demonstrating User Mobility and Flexible Service Profiles New-generation True Pocket Phones The exchange Manager - An Operation & Maintenance System for the Switched Network Operations Support System for CME20 Implementation of UPT - Universal Personal Telecommunication An Integrated TMN Solution - Eripax and TMOS DCT 1800 - A DECT Solution for Radio Access Application Fibre to the Home Field Trial in Ballerup, Denmark No. 3 An Information-Based Approach to Engineering Telecommunication Networks Integrated Photonics for Optical Networks Ericsson's Turnkey System for GSM 900 and CDCS 1800 Networks - CME 20 An Optical Transport Network Layer - Concept and Demonstrator In-House Plant for Submicron Intelligent Network Architecture in the Japanese Digital Cellular Standard - PDC No. 4 ATM Traffic Management at the Initial Deployment of B-ISDN Re-defining Management Systems The TMOS Architecture Evolution Trends in Wide Area Paging Network Traffic Management (NTM) Using AXE and TMOS Systems MOMS - An Operations System for MINI-LINK Ericsson Review © Telefonaktiebolaget L M Ericsson • Stockholm 1995 • Publisher Hakan Jansson .Editor Editorial Board •Editorial staff Eva Karlstein 'Address Telefonaktiebolaget L M Ericsson S-126 25 Stockholm, Sweden • Fax +46 8 681 2710 • Published in English and Spanish with four issues per year. Ericsson Review No. 1 , 1995 3 CONTRIBUTORS in this issue Terje Egeland, Technical Coordinator and Chief Designer at Ericsson Telecom's core unit Basic Systems. He received an MSc in Applied Physics from the Royal Institute of Technology, in Stockholm, 1981. Ragnar Huslende, Product Manager at Ericsson AS' Network Management Systems department (in Olso); as a senior engineer engaged in QoS measurements, and object-oriented modelling for the Telecommunications Management Network (TMN). He also participates in standardisation work in ETSI and ITU. Ragnar Huslende received an MSc in 1974 and his doctor's degree in 1983, from the Telecommunications Engineering Department of the Norwegian Institute of Technology. In 1978 and 1979 he was a visiting scholar at the University of California, Los Angeles. Leif Hakansson, Product Manager at Ericsson Telecom's core unit Basic systems; responsible for the AXE 10 control system (regional and central processors). He has been appointed Senior Expert in control system capacity. In 1970, Leif Hakansson received his MSc in Electrical Engineering from the Royal Institute of Technology, Stockholm. Bjorn Kihlblom, Section Manager and responsible for Traffic and Systems Dimensioning at Ericsson Telecom's Systems Management department. He holds an MSc in Applied Mathematics, awarded by the Royal Institute of Technology, Stockholm, in 1988. Hans Lundberg, Program Manager for the APZ System at Ericsson Telecom's Program Management department. Hans Lundberg received his MSc in Electrical Engineering from the Royal Institute of Technology, Stockholm, in 1983. Camilla Nord, member of the Software Reliability Team and editor of the Software Reliability network newsletter at Ericsson Telecom's core unit Basic Systems. She has been engaged in research on methods for fault prediction in SW and in the definition, verification and prediction of SW reliability. In 1993, Camilla Nord received an MSc in Industrial Engineering and Management from the Linkoping Institute of Technology. Ojvind Johansson, Systems Engineer, member of the Software Reliability Team at Ericsson Telecom's core unit Basic Systems. He has been engaged in research on methods for fault prediction in SW and on software reliability, as related to release criteria. Ojvind Johansson holds an MSc in Engineering Physics, awarded by the Royal Institute of Technology, Stockholm, in 1993. Rima Qureshi is Technical Project Manager at the Systems Design department of Ericsson Research Canada. She holds Bachelor's degrees in Management and Computer Science and has completed 75% of the courses required for an MSc in Business Administration. Stephen Crombie, Marketing Manager at the Cellular Systems department of Ericsson Communications Ltd, New Zealand, is responsible for the marketing and sales of cellular infrastructure and services to Telecom Mobile. Stephen Crombie holds an MSc in Technology Management, awarded by the University of Sussex, England. Tina Sutton, Senior Product Manager at Telecom Mobile Communications Ltd, New Zealand, is responsible for the development, launch and management of value-added services in Telecom Mobile's cellular network. Tina Sutton has an MA from Massey University in New Zealand and an MA in Library and Information Science from the University of Hawaii. 4 Karl-Axel Englund, Senior Specialist in dependability engineering work at Ericsson Telecom's Network and Systems Characteristics department, currently engaged in reliability and maintainability analysis work in the APZ 212 20 project. In 1969, Karl-Axel Englund graduated from Eskilstuna Upper Secondary Technical School, specialising in control systems engineering. Additional courses include mathematical statistics and reliability engineering. Ericsson Review No. 1, 1995 APZ 21220 - The New High-end Processor for AXE 10 Terje Egeland The demand for processing capacity in telephone systems doubles every four years, due to increased use of existing services, new service offerings and a rising demand for operations support. In AXE 1 0 , this demand is met by introducing dedicated regional processors, optimising compiler and instruction primitives, enhancing software and increasing capacity in the central computer system. The author describes how this is achieved in the latest version of the central processor in AXE 1 0 , through new and faster technology, dedicated hardware logic, and enhanced architectural features. APZ 212 20 is the latest AXE 10 control system in the high-performance APZ 212 series. It includes a new central processor (CP) with substantially increased capacity. The previous CP generations for APZ 212 were released in 1985 (APZ 212 02), and in 1990 (APZ 212 10). The low and medium capacity ranges for AXE 10 are served by the APZ 2 1 1 control system. From an application point of view, all variants of the APZ control system are highly compatible. The only fundamental Rg.l The IPU board with the ASIC's CPS and UMC plus the PSCM and CM-E memories Ericsson Review No. 1 , 1995 difference is in the implementation of the central processor subsystem. Thus, both APZ 212 1 1 and APZ 2 1 1 1 1 can be replaced by APZ 212 20. APZ consists of a number of different processors, such as the regional processor, RP, RPD, EMRPD, I/O processors, and the central processor, CP. The central processor is part of a subsystem, CPS, which forms part of the APZ control system. The main focus of this article is on the CP hardware. A general description of the APZ system was given in Ericsson Review, No. 3, 1990. OBJECTIVE The objective of the new CP was to increase the capacity of AXE 10 to meet the expanding needs of the market, primarily caused by rapidly increasing demands for new and enhanced services. The improvement, it was deemed, must at least quadruple the capacity of APZ 212 10. Another requirement was full backward compatibility with other APZ versions, thereby allowing existing program code to run without having to recompile it. The changeover to the new APZ processor must also be possible with a minimum of disturbance to the traffic, and without taking the system out of operation. As in previous versions of APZ, it must also be possible to upgrade installed software while the APZ is in full operation. Originally, the central processor, the chosen instruction set and the language PLEX - in AXE 10 were designed together, to provide telecom applications with Fig. 2 The central processing system consists of two identical processors CP-A and CP-B, each with its own instruction processor unit, IPU, signal processor unit, SPU. regional processor bus handler, RPH, program store, PS, and data store, DRS. The system's parallel-synchronous operation is supervised by the maintenance unit, MAU maximum processor performance. Since then, the language, the compiler, the instruction set and the hardware have been developed further, with an aim to optimise performance. The use of a language, hardware and instruction set that have been optimised to work together yields superior capacity, compared with general-purpose computer systems. Because Ericsson and its customers have made large investments in AXE 10 application software, they want to be able to run existing software on new processors. CONCEPT The methods used to increase capacity are: - to take advantage of faster components - to achieve higher speed through integration - to provide more extensive HW support - to increase clock frequency. One advantage of a proprietary processor is that it can be customised to perform specific tasks. In such cases it is necessary to know exactly what is executed in the machine. To determine this, the executed instructions in different AXE 10 applications were recorded in exactly the same order and frequency as they appear in calls. A model of the architecture was then built, using a simulation language. In this model it was possible to change parameters, such as queue length, cache size, memory access principles, etc. By applying instructions from the recorded calls to the model, it was possible to optimise the processor architecture and provide necessary hardware support to achieve the highest possible capacity for switching applications. ARCHITECTURE The new architecture is based on proven features from APZ 212 10 and APZ 2 1 1 , to which some new concepts have been added. At the top level, APZ is made up of two equivalent CP sides supervised and controlled by a maintenance processor, MAU, Fig. 2. The structure of each CP side in APZ 212 is built up by three different modules: the regional processor bus handler, RPH, the signal processor unit, SPU, and the instruction processor unit, IPU. 6 Ericsson Review No. 1, 1995 Fig. 3 Hardware structure of the IPU and the SPU in APZ 212 2 0 ACC Address calculation circuit ALU Arithmetic and logic unit BAH BAS address handler BAS Base address store CMAI CPC maintenance interface CM-E Control memory - external CM-I Control memory- internal C0NM Control memory CPC Central processor circuit DRS Data and reference store EXDB External data bus IPI IPU interface IPU Instruction processor unit JAM Jump address memory JBU Job buffer unit LMU Load and measurement unit MAI Maintenance interface MAU Maintenance unit MCC Memory control circuit MCU Microprogram control unit OPAB Operand A bus 0PBB Operand B bus PCU Priority control unit PS Program store PSCM Program store cache memory PSH Program store handler RESB Result bus RPB Regional processor bus RPH Regional processor handler RPHI RPH interface SPU Signal processor unit TCU Timer and counter unit TRU Trace unit UMB Update and match bus UMU Update and match unit VDSH Variable and data store handler Ericsson Review No. 1 , 1995 One RPH is used for each regional processor bus, instead of a single handler for all buses, as in APZ 212 10. The RPHs are connected to the SPU by an RPH bus, RPHB. Today, the maximum number of RPHBs in APZ 2 1 0 1 2 is thirty-two, but this number can be increased and the associated hardware expanded if the need arises. The SPU transmits and receives signals from the RPHs, analyses and assigns priority to incoming signals and prepares them for execution in the IPU. In the IPU, the basic structure used in APZ 212 10 is retained, with two operand buses (OPAB and OPBB) and one result bus (RESB); a fourth bus, EXDB, is added, Fig. 3. Each bus carries 32 bits plus parity. The EXDB connects a number of different ASICs, while OPAB, OPBB and RESB are internal buses contained within an ASIC. These buses are visible when data is being processed in the machine. However, to obtain an efficient flow of instructions, there are also separate address and data buses to the dedicated program store, PS, and two similar buses to the data and reference store, DRS. The IPU is built up of eleven different logical modules. Seven of them are placed in one ASIC, called CPC. Nine different memories are used to ensure high execution speed. 7 Fig. 4 Instruction and data flow for an SCC Instruction (substract character from register) (left), and an RS Instruction (read from store) (right) ALU BAH BAS CMI DRS PSCM PSH RMU MCU VDSH Arithmetic unit Base address handler Base address store Central memory - internal Data and reference store Program store cache memory Program store handler Register memory unit Microprogram control unit Variable and data store handler Another way to describe the structure is to sketch the flow of an instruction as it is processed in the machine, Fig. 4. The flow of a simple instruction, like SCC (subtract character constant from register), starts by calculating the address of the program memory (PSCM or PS). This step takes one cycle, followed by one cycle to access the PSCM; the instruction is then prepared and decoded in the program store handler, PSH, which requires two cycles. In the actual execution cycle, a read is made in the register memory, RM; the constant is substracted in the arithmetic unit, ALU, and data is written back to the RM, all in one cycle. In this case, the instruction flow takes five cycles. However, due to the prefetch mechanism described below, and the pipelining of the flow, the effective load to execute the SCC instruction lasts only one cycle. Box A Characteristics of CP hardware Call handling capacity is four times that of APZ 212 10, eight times that of APZ 2 1 1 1 0 Data store Program store Max number of RPs 8 max 4 Gword of 16 bit max 256 Mword of 16 bit 256,000 An efficient pipeline becomes even more important for instructions that require variable access, like RS (read from store). Cycles one to three, forthis type of instructions, are identical with those just described. In cycle four, the address of the base address store, BAS, is calculated in the base address handler, BAH. In cycle five, access is gained to BAS. In cycle six, the address of the data and reference store, DRS, is calculated. Access to DRS is then gained in cycles seven to nine; error correction and variable extraction are executed in cycle ten, and, finally, the data is moved to the register memory, RM, in cycle eleven. If the prefetch mechanism has filled the pipeline, only cycle eleven will load the system. For the prefetch mechanism to be efficient, all logic and bus resources must be independent of each other. In APZ 210 20, the pipelining permits up to eleven different instructions at a time. Description of hardware As shown in Fig. 2, each CP side is divided into three hardware blocks: RPH, SPU and IPU. Each of these blocks can be described as an independent processor. The RPH, which sends and receives signals on the RPB and to and from the SPU, has its program coded in hardware. The SPU analyses and prepares incoming signals, and assigns priority to these signals before they are sent to the IPU. The task to be performed by the SPU is more comEricsson Review No. 1 , 1995 BTQ Fig. 5 Logical description of the instruction handler in the program store handler. Data from the program store is sorted in the input buffers, decoded in the instruction decoder and then stored in the SQ (sequential queue) or BTQ (branch target queue), depending on which order is given by the queue manager plex, and it is therefore controlled by a microprogram in a random access memory. The purpose of the RPH and SPU is to decrease the load on the IPU, which executes the application software. The IPU is thus the bottleneck in the processing system, which explains why the greatest emphasis in the design of APZ 210 20 was placed on that hardware block. Prefetch of variables The data store, DRS, must be capable of storing a large amount of data. Due to the size and cost of the memory, DRAMs are used. These are not as fast as other logic, which makes DRS relatively slow. To reduce the effect of a slow data store, access to the DRS must be gained early, by fetching the instructions from the program store, PS, well in advance of execution. Instructions are read from the program store, PS, continuously and more or less independently of how fast they are executed and stored in a FIFO queue. If an instruction is fetched from the PS before it is needed, it will be stored in a sixposition FIFO queue, called the sequential queue, SQ, Fig. 5. The instruction is decoded before it is stored in the SQ. If it uses variables, the names of these variables (the a-parameter) are extracted and sent to BAH for addressing base address table BAT in BAS. The information from BAT is then loaded into a new queue, called the BAS data queue, BDQ, Fig. 6. The BDQ is explained in greater detail in the section 'Pipeline interrupts'. The BDQ is located in the address calculating circuit, ACC, which calculates the real address in the DRS, together with Fig. 6 Data from BAS is temporarily stored in the BAS data queue, BDQ. The address is calculated together with the Index register slave, IRS, and the pointer register slave, PRS. Variable information to be used in the variable and data store handler, VDSH, is stored in the variable control queue, VCQ. The address itself is stored in the variable address queue, VAQ; at the same time it is decoded from a logical into a physical address. A check is made to ensure that there is no collision with ongoing accesses, whose addresses are stored in the DRS access control queue, DACQ. DPBAC and DTYPR contain the number of boards and type of memory device that are used in DRS. The addresses in VAC are used to check whether there has been an earlier write access to the same address. If so, the new access will not start until the writing has been performed. The MCU access queue, MACQ, is used to execute write instructions efficiently and to support accesses ordered by the MIP through the DRS read address register, DRSRAR, and the DRS write address register, DRSWAR BDQ BAS data queue DACQ DRS access control queue DPBAC DRS PBA Counter (number of PBAs in DRS) DRSRAR DRS read address register DRSWAR DRS write address register DTYPR DRS type register (type of memory device in DRS) IRS Index register slave PRS Pointer register slave MACQ MCU access queue VAQVAQ Variable access queue VCQ Variable control queue VDSH Variable and data store handler Ericsson Review No. 1 , 1995 9 an index and a pointer. This calculated address is stored in the variable address queue, VAQ, until previously ordered memory access or refresh instructions have been executed. Data retrieved from the DRS is checked and corrected for errors and then stored in the variable read data queue, VARRDQ, in the variable and data store handler, VDSH, Fig. 7. Since all prefetched instructions may relate to variables, the VARRDQ has the same number of positions as the sequential queue, SQ. Data in the DRS has a length of 32 bits, and the variables range from one to 128 bits. Special hardware in VDSH is used to extract variables that are shorter than 32 bits. The variable is now ready for use when the instruction is to be executed. The most common instruction for variables is read from store, RS, which moves the variable to the register memory in the register memory unit, RMU. Fig. 7 Data from the DRS is first checked for errors in the ECC and then corrected, if needed. If it is a variable that has been requested, the data is stored in the variable read data queue, VARRDQ. The variable is then extracted from the data by control of signals from the variable control queue, VCQ. Accesses ordered by microprograms are stored in the DRSRDQ, if they arrive before they are needed DRSRDQ ECC DRS Read Data Queue Error Correction and Check For a write instruction, the complete word is first read as described above. The variable, if less than 32 bits in length, is then inserted into the data word and written back to the DRS. Thus, the steps are the same as for a read instruction. To execute a write instruction efficiently, the address resulting from the address calculation, Fig. 6, is also saved in the MCU access queue, MACQ, for use when the instruction is executed. After execution, the instruction is written in the background, to the DRS. Pipeline interrupts Great efforts have been made to handle instructions for variables as effectively as possible. When the pipeline is working, only one cycle is used to execute a normal instruction, but building up the pipeline requires many cycles. Depending on what interrupts the pipeline - internal jump, jump to another block, pointer or index changes - the number of cycles required to restore the pipeline will vary. If the pointer or index is changed after the address has been calculated, a new address must be calculated before the instruction can be executed - but only if any of the instructions in the queue use pointer or index. Data for calculating the address is fetched from the BDQ, Fig. 6. The separate BAS data queue (BDQ) makes it possible to recalculate the 10 address without having to regain access to the BAT. Jump support The pipeline is interrupted by jumps. To minimise the effects of local unconditional jumps, these jumps are exclusively handled by the PSH and transparent to the rest of the system. For a conditional jump loaded in the SQ it is impossible to know where the execution will continue. In this case, the first instruction (in the branch at which the jump is targeted) is read from the PS and stored in the branch target queue, BTQ, Fig. 5. When this is done, the program store handler, PSH, continues to fill the SQ with instructions. An instruction stored in the BTQ is decoded in the same way as an instruction retrieved from the SQ, but all actions are not performed to completion. If the instruction in the BTQ relates to a variable, access is gained to the BAT, but the address to the DRS is not calculated. If the instruction specifies another jump, or if there are more conditional jumps in the SQ, then no action is taken. When a conditional jump instruction is executed, the execution may continue with either the instructions in the SQ o r - if the jump is effected - in the BTQ. If the jump is not effected, and there are more conditional jump instructions in the SQ, the target instruction for the next conditional jump is read and stored in the BTQ. An effected jump can save three to five cycles, compared with a situation without the BTQ. Interleaving in DRAM memories Prefetch logic is used to compensate for the relatively slow operation of the DRAM memories in the DRS, compared with the logic. Normal memory access to the DRS takes five cycles: one cycle to calculate and transport the address to the memory, three cycles for the actual access, and one cycle to transport, check for errors, and to correct read-out data. This means that one memory access can be made every sixth cycle. To speed up access handling, the memory is split into banks which can be accessed independently. Each memory board contains eight banks. If consecutive accesses address different banks, access can be gained to as many as five Ericsson Review No. 1 , 1995 Fig. 8 The CPC ASIC with the memories JAM, CM-I and RM plus 50k gates to implement the modules TRU, CMAI, RMU, ALU, PSH, MCU and BAA memory banks at the same time, and new data can be delivered every cycle. The risk of collision diminishes with the number of banks available. The maximum number of boards in the DRS is six, which makes 48 independently addressable banks. To ensure that this interleaving capability is efficiently used, an address scrambler is employed to distribute the accesses evenly among the banks. Cache for Program Store Studies have shown that the average number of instructions in a call-handling code sequence without jumps is eight. This means that the five cycles needed for the first access is a very longtime. To build up the prefetch queue, instructions from the PS must come in faster than they are executed. This is solved by using static random access memory (SRAM) components for a cache memory for the capacity-critical blocks. SRAM is smaller and faster than DRAM; having an access time of 10 ns, compared with DRAM's 60 ns. SRAM is located on the same board as the PSH, which means that the time required for complete access is reduced to one cycle instead of five. The program store cache, PSC, is located in a memory called PSCM, Fig. 3. The PSCM also holds tables used for signalsending instructions. The PSC holds the complete program for capacity-critical blocks. The content of the PSC is updated once every day. To decide which block should be held in the PSC, the time each block is used is measured. These measurements are made only at traffic level and under the most loaded 3hour period every day. Both the PS and the PSC are 32 bits wide. The instruction length in APZ varies from 16 to 64 bits, but the most common instructions are 16 or 32 bits long. This means that, on the average, more than one instruction is read from the PS each cycle. However, because some instructions take more than one cycle to execute, and because of the waiting time that arises whenever the pipeline is interrupted, it is possible to fill the prefetch queues. Cache for Base Address Table Address calculation is part of the activities associated with the prefetch of variables. This calculation must be made in parallel with the execution of other instructions. The information needed to calculate the address of a variable is stored in the base address table, BAT. To do this in parallel, information from the BAT, on the most frequently used variables, is placed in a cache memory called BAS (base address store). This memory is dimensioned to handle information on all variables, even for very large applications. Support for signal-sending instructions The signal-sending instructions are optimised for the PLEX language and supported by hardware to be executed as efficiently as possible. They are complex and take more than one cycle to execute. Box B Component technology ACC CPC MCC UMC SPC RAC 0.7 0.7 0.8 0.7 0.7 1.0 Memories: DRAM SRAM SRAM 4M * 4, 256k * 4, 32k * 8, mfi GaAs, m j i BiCMOS, my. BiCMOS, mp. BiCMOS. my. BiCMOS, m i l CMOS, Ericsson Review No. 1 , 1995 90,000 random gates 50,000 random gates, 4,000 random gates 20,000 random gates, 60,000 random gates, 20,000 random gates 96k bit SRAM 3.5k bit SRAM 30k bit SRAM 60 ns access time, used for PS and DRS 10 ns access time, used for BAS and PSCM 8 ns access time, used for CONM and CM-E When a signal-sending instruction (in this example SSN) is detected by the decoder, Fig. 5, and targeted at the SQ, special hardware masks out the signal-sending pointer, SSP, from the instruction. The SSP is used together with the program start address, PSA, to calculate the address of the global signal number, GSN, in the signal-sending table, SST. The GSN is read, to be used on order by the microprogram. All this preparation is made by the hardware before the instruction is executed. When the instruction is executed, special hardware works in parallel with the microprogram, MIP, to speed up execution. When ordered by the MIP, this hardware 11 uses the GSN to calculate the address and to access the global signal distribution table, GSDT; it reads the four 32-bit words from the reference table, RT, accesses the local signal distribution table, SDT, calculates the address in the new block where execution should begin, and starts prefetching instructions. To reduce execution time as much as possible, all accesses to tables must be fast. Both the GSDT and - for each block - f o u r words from RT are therefore placed in the fast SRAM on the same board as the PSH. This same memory is used for the PSC, Fig. 3. In the PSCM a 16k32 area is reserved for the RTC, which is sufficient for 4k software blocks. A 128k32 area is used for the GSDT, and the rest for the PSC. In the first version of APZ 212 20, the PSCM will use 256k32, which leaves 112k32 (covering eight to twelve blocks) for the PSC. The PSCM will be upgraded to 1M32 when the next generation of SRAM components become available in the market. PHYSICAL IMPLEMENTATION The IPU is implemented on two K3 boards (344x178 mm) containing three proprietary ASICs, SRAM chips and buffers. The IPU hardware block also includes the PS (built on one K3 board) and the DRS (built on one to six K3 boards). The PS and DRS uses the same type of board, built with DRAM circuits and one ASIC. Each memory board has a capacity of 64M16 words. 12 This capacity can be increased when enhanced memory components become available. The SPU is implemented on one K3 board with one ASIC, SRAM and buffer components. The SPU and the IPU are located in the same magazine. All functions in the RPH are implemented in one ASIC, called RAC. Moreover, all components needed to handle the electrical transmission and reception of signals on the RPB are placed on the same board (RPIRS). The board size is K2 (222x178 mm). One magazine accommodates eight RPIRS, power supply, and an interface board to the SPU. Using two K3 size boards, the MAU is built around a standard microprocessor and programmable logic device circuits. CONCLUSION New technology and a new architecture make it possible to meet the demand for increased processing capacity. The new central processor for AXE 10 is the result of continuous developments on the APZ 212 concept. By investing in the development of customised circuits, faster and more complex functions can be implemented in smaller hardware units, optimised forthe specific task of controlling telecom exchanges. The same strategy will be used to keep new generations of APZ CPs compatible with those in service today, thereby allowing existing software to be used in the future as well. Ericsson Review No. 1,1995 Measuring Quality of Service in Public Telecommunications Networks Ragnar Huslende For many users of telecommunication facilities, the quality of service is an important factor in their choice of service provider. To measure this quality, Ericsson has developed a system called NEAT. It covers both fixed and cellular networks and can be applied to a wide range of services including basic telephony, virtual private network services, intelligent network services and international network services. The author describes the NEAT system, shows how this system can be used to measure quality of service in various networks, and comments on some results from the use of NEAT. Rapidly changing technologies and the advent of new services have led to very complex telecommunication networks. Many different types - and even different generations - of equipment are interconnected and must cooperate properly in order to carry the services to the users on an end-to-end basis. Both transmission and switching equipment are involved, which means that - although the physical size of individual hardware components has been reduced - t h e overall functional complexity of a telecommunications network is increasing. This trend, reflected in large software systems, sophisticated signalling procedures and various specialised service nodes, makes it extremely dif- ficult to predict the quality of service (QoS) by analytical methods. The best possible measuring principle must therefore be chosen. Keeping in mind ITU's definition of quality of service in Recommendation E.800:3 "The collective effect of service performances which determine the degree of satisfaction of a user of the service", it seems that at least three basic requirements should be fulfilled: - Quality of service is defined from the user's point of view and, therefore, Fig. 1 A variety of reports on Quality-of Service parameters and the different fault types are available to the NEAT user Ericsson Review No. 1 , 1995 13 Box A Abbreviations ISDN NCU NEAT NTC NTU PDH PSTN SDH QoS Integrated services digital network NEAT communication unit Network evaluation and test system NEAT test centre NEAT test unit Plesiochronous digital hierarchy Public switched telephone network Synchronous digital hierarchy Quality of service should be measured accordingly. This means that end-to-end measurements will give the most accurate results - According to a recognised principle in the field of quality assurance, quality of service should be measured by means of equipment that is independent of the traffic-carrying elements of the network - The method should be universal in the sense that it must produce comparable results for all parts and regions of a network. All these requirements can be fulfilled if a system for automatic generation of test traffic is used - a system that complies with the principles recommended by ITU in Rec. E.434.2. Typically, such a system consists of one test centre and a number of small test units as shown in Fig. 2. The test units communicate via ordinary customer access interfaces; for example, subscriber lines in the local exchanges or the air interface of the radio base stations in cellular networks. Fig. 2 Scenario for QoS measurements by means of automatic generation of test calls The test centre is in charge of the preparation and scheduling of tests. Post-processing and presentation and distribution of test results are also carried out by the test centre. The actual measurements and observations are made during test calls between the test units. This approach to measuring quality of service has several important and unique advantages. The system can measure end-to-end transmission quality at the service level. The measured end-to-end connection may include a number of different transit nodes and both PDH and SDH sections in the transmission network. During test calls, the receiving level of defined test tones, signal/noise ratio, idle channel noise, bit error ratio, etc. can be measured. A system based on test calls can also measure the initial network response delay (e.g. dial tone delay) quite accurately, and provides independent, regular monitoring of the call metering equipment. Unlike a live subscriber, a test unit will never be busy or absent when called: it will answer by returning its unique identity code. Thus, it can be positively verified that the correct B-number has been reached and that the call has been successful. Conversely, an unsuccessful call will always be due to a network problem; never to subscriber behaviour. 14 The system used to generate test calls can serve as an independent, "neutral" observer with overall responsibilityfor QoS measurements in the network. The subscriber access interfaces represent a stable standard valid throughout the network for long periods. Thus, one type of system can cover the entire network although the switching and transmission equipment may differ from one region to another, and although technological changes may occur in the network from time to time. The user of the system can define a standard set of parameters and report formats for service statistics and fault reports. Measurements from various parts of the network are immediately comparable, and quality trends can be observed in an objective manner as the network evolves. The test units are connected to the network like ordinary telephone sets, which makes installation very flexible and straightforward. Since the measurements are made from a subscriber's point of view, the results can be intuitively well understood, even by personnel without indepth technical knowledge of the network. The statistical material supplied must be comprehensive enough to reflect the various quality parameters with sufficient confidence. Detailed mathematical elaboration and discussion of this kind of material was presented at the 5th Nordic Teletraffic Seminar.6 The results of the discussions indicate that 500-1000 test calls per traffic route will identify, with statistical significance, those parts of the network which may cause quality problems. It is also shown that the higher the probability of faults (or lost calls), the fewer test calls are required. This is a very favourable effect. It means that serious bottle-necks or trouble spots in the network, requiring quick corrective actions, are detected after short periods of testing. NEAT - FOR MEASURING QUALITY OF SERVICE Ericsson is marketing a modern implementation of a test call generating system called NEAT. The system, which works as illustrated in Fig. 2, can perform both routine measurements and on-demand measurements. A possible configuration of a NEAT test centre, NTC, is shown in Fig. 3. The NTC application has been designed to be largely independent of the computer Ericsson Review No. 1 , 1995 platform. The current version runs on a UNIX server supplied by Sun Microsystems with a Sybase relational database management system, but alternative platforms can be used. The user interface is based on Open Look or Motif. Standard functions for window handling (move, resize, iconise, quit, etc.), pull-down menus and drag-and-drop functions are used to create a user-friendly environment for the NEAT operator. Fig. 3 Example of configuration of a NEAT Test Centre NTC is a multi-user system; various user categories with different authorisation levels can be defined. It is equipped with one or more communication units, NCUs, with dial-up modems to communicate with test units, the NTUs. The NTC application is coded in C++ using recognised methods for object-oriented design. According to the ITU-T Recommendation E.434, a test unit is a combined transponder/responder. The NEAT test units are very compact microprocessor systems, in which advanced digital signal processing techniques have been used to implement the measurement functions. Each NTU can be individually configured with the capacity needed at the actual site. Some key figures, specifying NEAT capacity, are given in Table 1 . Configuring the system Various data must be entered into the NTC before tests are run. These data include a network model with groups/subgroups of exchanges, NTU telephone numbers, definition of signalling tones, tariff zones and rates, scaling factor for the test traffic, defined test types, etc. Table 1 Number of lines for test calls per NTU Number of simultaneous test calls per NTU Max. number of NTUs per NTC Ericsson Review No. 1 , 1995 2-48 2-8 1000 Different types of tests can be defined: - Quality tests. These tests produce QoS and fault statistics for the various groups/subgroups of exchanges. - Metering tests. Test calls are generated within and between the various tariff zones. Various schemes for generating call metering pulses on the different wires of the subscriber interface can be checked. Multi-metering checks may cover all the actual tariff zones, metering rates and switch-over times. - Toll billing tests. The expected charge for all calls generated from a given A-number is accumulated. This accumulated sum can then be compared with the amount actually specified on the bill as produced by the service provider. Thus, the entire billingchain can be monitored, including both the metering system in the exchanges and the various post-processing systems involved in producing the bills that are sent to the subscribers. - Fault trace tests and on-demand tests. Used for special investigations, typically initiated when the above tests have revealed certain network problems that require more detailed follow-up. When these basic configuration data have been entered, the user of the system may compose a test schedule that exactly matches the needs for QoS measurements in the actual network. He can choose among the defined test types, define start time and duration, and schedule tests for weekly repetition, if desired. Performing tests Once the test schedule has been entered, all the remaining tasks that are required to execute the tests and report the results are performed automatically, on time, by the NEAT system. Each individual test is carried out as an independent test sequence. During test execution, all NTUs work in parallel, synchronised by time slots. The algorithm for call-pattern generation is designed to randomise test traffic to/from each group. Weight factors per exchange are used to create the desired test-traffic profile. For instance, one objective may be to generate test traffic with a distribution similar to real subscriber traffic. The algorithm for call pattern generation is also designed to avoid call collision. This ensures that no more than one test call is generated to any one test number in any time slot, which means that the "busy B-number" case is avoided. In the signalling phase of a test call, frequency, cadence and signal level of the various signalling tones are checked against specified tolerances. A number of events may be detected and reported by various event codes; for example: - missing, delayed, illegal, unexpected or out-of-sequence signalling tones - congestion tone - wrong B-number reached. Detected by missing response (transmission test tone) from the called party or by 15 - To comply with ITU-T Recommendation E.434, additional measurements are implemented, e.g. round-trip propagation delay, clipping, echo and impulsive noise. "Call continuity" can also be tested by detecting short signal interruptions during the test call. Fig. 4a Testing the national network receiving an incorrect identity code from the called NTU. - missing or incorrectly timed metering pulse at B-answer - metering pulse interval differing from the expected, nominal value - violation of transmission quality threshold. During a test call, values of a number of parameters are measured and recorded: - dial tone delay - ringing tone (post-dialling) delay - level of dial tone and other detected tones - transmission quality parameters, including: a fixed test tone and three user-selectable speechband tones idle channel noise signal/noise ratio all measurements are made end-toend both by the calling NTU and the called NTU. - metering pulse interval. Fig. 4b Testing regional networks (see explanatory text In Fig. 4a) 16 The NTUs can also be instructed to supervise network elements. If a certain number of consecutive unsuccessful calls for specified combinations of A-numbers/Bnumbers is registered, an alarm is emitted. This is also the case if the dial tone is missing for a certain number of consecutive call attempts. Dial-tone supervision can be performed even in periods when no ordinary test calls are being made. As a background test, the NTUs may then regularly check that dial tone is detected on all test lines. This has proved to be a safeguard in cases where the exchange has died a "silent death", which means that it can neither perform normal call processing nor generate an alarm about the situation. Using the test results All the above data are post-processed in NTC and formatted into different reports with various degrees of detail. Some reports can support the daily activities of the operations and maintenance staff. After each test sequence, a summary report is created in NTC showing a "snapshot" of network status. Based on this picture, the system user can ask for full measurement details of the interesting test calls. Special fault statistics are also accumulated during the month, to support problem diagnosis. Some reports are created to support the strategic planning of maintenance resources and investments in new network equipment. These reports are based on statistics accumulated over one or more statistical periods. They may show total values of various QoS parameters for the entire network or for the individual administrative regions. Trends in these parameters over the past months and years can be shown. Some examples of reports are: - lost calls and ringing tone delay for the various traffic routes during busy hour - dial tone delay for the various exchanges during busy hour - transmission quality figures for the various routes Ericsson Review No. 1.1995 Fig. 4c Testing local networks (see explanatory text in Fig. 4a) test traffic will support QoS measurements in the regional level subnetworks. Fig. 4c illustrates a test type for the lowest level in this example. Test traffic is generated only between the exchanges in each local area, and internally in each exchange. - undercharging/overcharging percentages for the various network groups - various trend statistics covering the previous/current month/year. NETWORK APPLICATIONS Thanks to the general network access interface used by the NTUs and the flexible grouping feature in the NTC, the system can be applied in several different ways. Some examples are given in the following. National PSTN/ISDN networks Different types of test can be defined in NTC to test the various levels of a national telecommunications network. Nested groups/subgroups can be defined to support testing at an arbitrary number of network levels. Typically (but not necessarily), the groups can reflect the various geographical regions of the network. Another possibility is that of defining groups according to the type of network equipment. For example, all digital exchanges from supplier X may share one subgroup. Ericsson Review No. 1 , 1995 International networks NEAT test configurations have been installed to perform end-to-end tests of international services. Test traffic is generated through international switching centres in the participating countries. This is useful, for one thing because it supports the procedures recommended by ITU in Rec. E.424. For example, regular tests are being performed by a dedicated NEAT system with test units located in Denmark, Finland, Iceland, the Netherlands, Norway and Sweden. Mobile networks Mobile services represent an area of very rapid growth. Old and new service providers are competing for customers. QoS is a crucial aspect for many reasons, e.g.: - radio coverage differs from network to network, and from one region to another within the same network - traffic is rapidly growing - subscriber mobility may produce unexpected effects on traffic load distribution. NEAT is used to generate test traffic to/from various parts of the fixed network, as well as mobile-to-mobile traffic. Figs. 4a, b and c can be interpreted as a network with three main regions. Each region is further divided into a number of local areas served by a few exchanges. Each exchange has a number of dedicated test lines that are connected to NTUs. Delays, transmission quality, call metering and various signalling events on the air interface can be recorded. The following test configuration may be used: - one NTU per cell installed at a fixed location to give a stable reference for measurements - a number of additional NTUs installed in vehicles to monitor service characteristics related to vehicle movement within a cell, handover between cells, and roaming. A test type for the national level of the network will only generate test calls between the main geographical regions (Fig. 4a). Test traffic is long-distance traffic routed via the national level trunk network. A test type defined for the next lower level will only generate test calls between the subgroups within each region (Fig. 4b). The Corporate telecommunications networks Many business customers, by virtue of their size, may require guaranteed minimum figures for QoS to be stated in contracts with the service provider. The possibility for a service provider to monitor and report on the quality of the delivered 17 Fig. 5 - Temporary supervision: A number of transportable NEAT test units are moved between the PABX/centrex sites. Thus, the various corporate networks can be tested for limited periods whenever necessary. Such tests may be performed after major reconfigurations of a network, or triggered by customer complaints. QoS measurements by means of NEAT in a virtual private network services, in order to verify that the agreed quality levels are being maintained, will provide a competitive advantage and sometimes even be a necessity from a legal point of view. Corporate telecommunications networks can be built in several different ways. Typically, a combination of public and private network resources are configured to constitute a virtual private network as shown in Fig. 5. Both PABXs and exchanges providing centrex services may be included and equipped with NEAT test units. QoS measurements can then be made for internal traffic in the corporate network and fortraffic to/from the public network. Depending on customer requirements, NEAT can be used in various ways, e.g.: - Continuous supervision: The test units are permanently connected to the PABX/centrex lines that belong to a specific customer. QoS reports, including trend reports, are produced on a regular basis Intelligent networks Intelligent network, IN, services are now offered in many countries worldwide. Examples of IN services are: Calls charged to the called party (often referred to as Green number or Freephone service), routing to time/day dependent B-number, private numbering plan, etc. Quality monitoring of IN services is very important, for several reasons, e.g.: - complex signalling procedures during call set-up and release. IN may be integrated into both fixed networks and mobile networks. Complex interworking situations may occur, possibly with unexpected effects on quality. This means that end-to-end measurements are needed - dynamic service environment (new services, distribution of service logic, etc) - differentiated charging for the various services - demanding customers; an example is a private numbering plan service used as the basis for a corporate network, Fig. 5. A scenario for testing IN services is shown in Fig. 6. NEAT test units are defined as IN subscribers in the IN service control point, SCP. When test calls are made towards these IN numbers, normal call processing takes place in the SCP. Problems such as congestion, delay, charging faults and calls lost due to various technical faults can then be monitored by NEAT. STRATEGIES FOR THE USE OF TEST RESULTS Fig. 6 IN QoS measurement scenario SCP Service control point SSP Service switching point 18 NEAT ties in very well with current philosophies in network O&M and planning. One interesting possibility is a managementby-objectives approach to selected QoS issues in the network, Fig. 7. When NEAT measurements are introduced, the user organisation sets future goals for the quality norms and stipulates when these norms should be fulfilled. The goals can be broken down into the various quality Ericsson Review No. 1,1995 Fig. 7 NEAT supports a management-by-objectives approach to pursuing future quality objectives tive approach. This is a strategy based on continuous supervision of the network's quality parameters. As long as the specified service norms are met, no action is taken. Only when these norms are violated, corrective measures are introduced, Fig. 8. In other words, a certain amount of faults and problems are allowed in the network at any time. Controlled corrective maintenance and management by objectives are related approaches which both require some efficient means for QoS measurements and problem identification. NEAT has proved to be a valuable tool in both cases. parameters and the various parts of the network. Since all test traffic is generated on an end-to-end basis, the local area network is always included. This is true even of tests at the highest levels of the network. Therefore, it may be an advisable strategy to start a QoS improvement programme by first focusing on the lower levels of the network. When any problems found there have been analysed and solved, the system user can successively go on to test the higher levels. When an approach like this one has been used systematically for some time, it is likely thatthe given quality objectives have been achieved. Then, the next challenge is to maintain the desired minimum quality level. In the past, preventive maintenance was widely used in telecommunications networks. But as networks grew in size and complexity, this became very resource-demanding, and new technology also required other methods. Today, socalled controlled corrective maintenance, CCM, is often considered as a more attrac- Fig. 8 NEAT supports a controlled corrective maintenance strategy Ericsson Review No. 1 , 1995 SOME EXPERIENCES ITU's general definition of quality of service was stated in the introduction. To support quantitative measurements in the network, a more explicit and constrained definition has been worked out by Norwegian Telecom, a long-time user of NEAT: "The percentage of calls that - with a defined transmission quality in both directions - within a defined time reaches the correct B-number and is correctly charged". In order to evaluate the quality of service according to this definition, a number of parameters must be measured. Assuming that the telecommunications services are the service provider's end product, NEAT tests can be compared to the final quality control inspection in a manufacturing plant. Another point emphasised by the operator is the value of NEAT tests from a marketing point of view. Reduced quality problems and more efficient handling of complaints result in a more positive public image. In fact, quality norms for the national telephone service, as verified by NEAT, are now published in the telephone directories in Norway and distributed to all subscribers. When NEAT was introduced, it produced some quite unexpected results. In many cases, NEAT indicated a significantly lower QoS than previous measurement techniques, such as manual test calls. This difference has mainly been ascribed to the more objective and comprehensive tests performed by NEAT. The number of faults 19 Examples of technical problems uncovered by NEAT. - Oscillator fault in radio link station 1800 channels (noise) - Ice problems on radio link antenna - Antenna coverage jradome) fault, 140 Mbit/s system, bit fault - Noise on 60-groups in a 300-group - High room temperature in a radio link station - Fading on radio links (over lakes in special weather conditions) - Bad connectors in transmission equipment - Bit faults on 2, 8 and 34 Mbit/s systems - Faults in hybrids - Faults in coaxial cables - Impedance mismatch - No voice transmitted in a time-slot in a 2 Mbit/s system - Coil loading (pupinisation) - Humid and wet cables in the ground - Bad/missing solderings - Crosstalk (noise) caused by bad insulation in jumper fields and distribution frames - Attenuation/frequency distortion (highest frequency in the speech band is too low) - Noise on some coaxial cables caused by tiers in a supposedly high-quality network was surprisingly large. Many different types of faults that it would otherwise have been difficult (and expensive) to find, have been corrected by detailed follow-up of NEAT test results. Some published examples4 are shown in Box B. Reduction of non-paying call attempts It is commonly assumed that a certain percentage of unsuccessful calls in a network Flg.9 Busy hour traffic loss percentages. National and regional network level. Measured from the beginning of regular NEAT testing 2nd Quarter 1 9 8 7 to 1st Quarter 1993 20 - Missing dial tone (fault in registers or other equipment in local exchange) - Unknown tone (incorrect frequency or other tones on the line) Metering faults: - No B-answer signal on certain lines - Metering pulse before B-answer signal - Too long/short metering interval - No metering for a new area code - No metering for certain area codes due to softrftware fault in digital exchange Congestion and mtscellaneo - Faults in relays, selectors and MFC equipment - Faults in digital trunk modules - Faults in wiring and cable connections - No alternative route for certain area codes - Misrouting - Sticking (remanence) of relays and selectors - Faults on SS#7 signalling route - Vibration in building caused by street traffic (interrupting calls) - Bundles too small during busy hour - Silicone (insulation from coils) on relay contacts - 2 Mbit/s system out of service - 2 Mbit/s system in loop Wrong B-number reached: - Misrouting in selectors - Register faults (counting chain/code receivers) - Faults in MFC equipment represents losttraffic and, hence, lost revenues. If we assume that the success rate is increased by two percentage points, and that 20% of unsuccessful calls can be considered as lost traffic, then the volume of revenue-generating traffic is increased by 0.4%. In the Norwegian network, the percentage of successful calls has been increased, as reported by Norwegian Telecom. 5 Some of these results are shown in Fig. 9. In the measurement period, the objective for the service norm has been redefined in several steps from 5% to 2% unsuccessful calls. Especially during the first two years of testing, a very high improvement rate was achieved. The quality is now being kept relatively constant, in a controlled manner. Increased utilisation of the network Continuous supervision and fault detection by NEAT makes for more efficient utilisation of the network. After some time of active use of the test results in the daily O&M activities, the capacity available for revenue-generating subscriber traffic can be increased, Fig. 10; for example, by - eliminating latent faults in the network - removing bottle-necks by more optimal traffic routing Ericsson Review No. 1 , 1 9 9 5 - reducing the volume of repeated call attempts. Fig. 10 Utilisation of network capacity: The effect of using test traffic measurements Of course, this must be weighed against the capacity required by the test traffic itself. Experience has shown that a net capacity gain of 1-2 % can easily be achieved, Rg. 10. Thus, more traffic can be handled - resulting in increased revenues without the need for immediate investments in network equipment. CONCLUSION Systems that automatically generate test calls - like NEAT - offer an effective way of measuring quality of service in telecommunications networks. Since observa- tions are made at the subscriber access interface, the measurement concept is simple and general in nature. A variety of services in different networks (fixed and mobile) can be monitored, including services that involve interworking between several networks. End-to-end measurements, as recommended by ITU, imply that the quality is measured from a subscriber's point of view. Experience has shown that charging complaints are reduced and that a number of economic benefits can be derived from the information provided by a system like NEAT. Increased revenues are obtained by increasing the probability of successful service completion. References 1. Test Calls. ITU-T Recommendation E.424, 1992. 2. Subscriber-to-Subscriber Measurement of Public Switched Telephone Network. ITU-T Recommendation E.434, 1992. 3. Quality of Service and Dependability Vocabulary. ITU-T Recommendation E.800,1993. 4. Network failures in the Norwegian telecommunications network detected by the 'NEAT' system. ISCC, Norwegian Telecom, Sept. 91. 5. Loss in busy hours, trunk network, May '87 - May '94. ISCC, Norwegian Telecom, 1994 6. Huslende, R.: Traffic Route Test System as a Tool in Network Operations, Maintenance and Planning. 5th Nordic Teletraffic Seminar, Trondheim, Norway, June 5-7,1984. Ericsson Review No. 1,1995 21 AXE 10 System Processing Capacity Leif Hakansson, Bjom Kihlblom and Hans Lundberg The world is rapidly changing into an information society based on electronic communication. Telecom networks are necessary and strategic assets in this development, which requires telecom equipment to evolve in order to support continuously expanding traffic, as well as new and more complex functions. The authors describe the driving factors behind this development, and its influence on the continuing enhancement of the AXE 1 0 control system. The real-time capacity of a computer in the switching network is often measured by the number of instructions it can execute per second (millions of instructions per second, MIPS), or specified by the clock frequency of the processor. In truth, however, such figures give very little information about the processor's ability to execute different tasks. For a telecom switching system, the number of call attempts that can be handled per second, or per hour (BHCA), is a more important characteristic. Fig. 1 The increased use of specified billing and powerful inter-exchange signalling has swelled the number of instructions required to handle even ordinary calls. The figure depicts the evolution in a metropolitan exchange, during a 20-years period, for local (black) and transit (red) calls. 1990 • 100% Historically, the most demanding steps of development with respect to capacity have included the replacement of analog technology with digital, processor-controlled functions, and the transition from decadic and multi-frequency signalling systems to packet mode digital signalling. At the same time, protocols for inter-exchange communication have become much more powerful and complex, from MFC via TUP to ISUP, together with MAP and TCAP, Box A. Similarly, the trend to change charging meth- ods from pulse metering to specified billing (TT) and the introduction of ISDN subscribers in the public network markedly raised the requirement for processing capacity. The increasing number of instructions required to execute a typical call, Fig. 1, is a measure of how complex functions and systems have become. The amount of memory per subscriber line or trunk line is another measurement of complexity in telecom applications. As Fig. 2 shows, the trend is the same in this area, although the demand for memory is seldom a limiting factor for system capacity, except when multiple storage of program and data is needed. The rapid growth in memory demand is more accentuated for transit exchanges than for local exchanges, mainly because the introduction of ISUP trunk signalling is faster than the introduction of ISDN subscribers. A third measurement of a switching system's capacity is the volume of data transferred from one switch node to another, such as other switching nodes, billing centres, network databases, or nodes for centralised operation or postprocessing of statistics. The number of data transfers increased slowly until the middle of the 1980s. Since then the demand for signalling link capacity has been steadily rising, due to the introduction of large signal transfer points, STPs, for System 7 signalling. The demand for higher data transfer capacity from exchanges to billing centres has also grown, because of the increased use of detailed billing, and because a higher rate of transfer is needed to reload the increased amount of data when major failures have occurred in exchanges. Multi-purpose, open-interface protocols have several obvious advantages, but also create an overhead that loads the processing system substantially. The most impor- 22 Ericsson Review No. 1 , 1995 tional local call between two non-ISDN subscribers. Fig. 2 The requirement to store information related to subscribers has increased in the same manner as the number of instructions. The black curve represents local exchanges; the red curve represents transit exchanges. 1989 = 100% tant examples are ISUP, TCAP, MAP, SCCP and MTP, Box A. Protocol handling requires considerable capacity for mapping, screening, and syntax checking, as illustrated by the following three examples of powerful protocols. Example 1. The number of instructions required for a typical transit call that uses an ISUP trunk is around 2.4 times as great as for the same call via a TUP trunk. Example 2. Atypical local call between two subscribers in ISDN requires nearly three times as many instructions as a conven- BoxA The relationship between different protocols is shown in two examples. In both cases, the message transfer part, MTP, and the service connection control part, SCCP, form the basis of other protocols. In communication from a subscriber in ISDN to a mobile subscriber, the ISDN user part, ISUP, is used at the originating side, and the mobile application part, MAP, and the terminal control application part, TCAP, are used at the terminating side. In communication between two subscribers who use an intelligent network, IN, (for example, a virtual private network, VPN, service) the intelligent network application part, INAP, is used together with the TCAP. Ericsson Review No. 1 , 1995 Example 3. A simple intelligent network (IN) service that uses the TCAP protocol between the service control point, SCP, and the service switching point, SSP, requires about 1.8 times as many instructions as when the same IN service is implemented in a combined SSP/SCP node, whose SSP and SCP software communicates via an internal protocol. In this example, the execution of the protocol in both the SSP and the SCP - has been included. The complexity of the IN protocols is also illustrated by the fact that MTP, SCCP, TCAP and INAP together represent 40% of the total number of instructions executed in the SSP for this call. The intelligence in the emerging network is mainly allocated to centralised databases for subscriber, terminal and service data. Because these databases communicate with the traffic handling exchanges - using powerful protocols, such as TCAP -they require substantial capacity at both ends. A growing portion of calls will use at least one such database, which considerably increases the processing capacity required for an average call. In mobile telephony, functions for roaming, paging, activation and deactivation, handover, and communication with the home location register, HLR, have added to the demands for processing capacity when compared with conventional calls in PSTN. The node with the highest capacity requirements in mobile telephony networks is the mobile switching centre, MSC. In terms of capacity, the major difference between digital GSM technology and analog mobile telephony is that GSM uses more complex protocols for communication, which means that more syntactic and semantic checks are required. The demand for, and the use of, subscriber services is also greater for GSM subscribers than for subscribers to analog mobile telephony services or non-mobile services, which also makes normal GSM calls more complex. For example, the number of instructions used in a GSM mobile switching centre for a call from a mobile subscriber towards an ordinary subscriber is 1.5 to 2.0 times greater than what is required for a corresponding call in a 23 Flg.3 Logical model of the application modularity concept. The system consists of one system platform one application platform which provides access to shared system resources one or more application modules which consist of software only and implement most of the functions of telecommunications applications. in general, the amount of data per charging record is increasing dramatically, from 30 bytes in the early 1980s to an expected 500 bytes for GSM calls in 1998. Example 1. The number of instructions needed to provide detailed billing is sometimes as large as ten times the number of instructions required for pulse metering. Example 2. By 1998, it is expected that the amount of data that must be transferred from an exchange to a billing centre will be between 0.1 and 0.2 Mbyte/s. switching centre in the analog TACS system. The PSTN and ISDN subscribers are dependent on network databases for their mobility. The disadvantage of using centrally stored data is increased signalling in the network, as well as increased processor load, both for handling protocols for communication between the local exchange and the database and for transferring subscriber data between local exchanges, as subscribers move from one exchange area to another. The widespread use of detailed billing greatly influences the demand for processing power, since each call requires that more data be recorded and transferred to the billing centre. In GSM, more than one detailed bill may be made for each call. For detailed billing Fig. 4a Application programs register the details of a call in software records. Each function Mock reports the specific individual used for the call to the forlopp manager. This information Is stored in a software record identified by the forlopp identity, FID 24 The output of statistical counters for measurement and supervision increases the processor load only slightly. However, if a large number of counters are used, they will substantially increase the volume of data transferred from the switch to the statistical post-processing centres. By 1998, 500,000 to 1 million counters can be expected in large transit exchanges. The deployment of processor-controlled switches in the commercially most interesting parts of the networks has increased the flexibility of telecom services. The rapid introduction of new and enhanced services has become an important factor when competing for subscribers and traffic. Another trend is intensified efforts to reduce operational costs by keeping down the demand for labour and increasing the size of switching nodes, to reduce their total number. Both these factors imply an increasing need for processing power. One effect of new services is that the requirements for detailed billing have been accentuated, not only to specify the duration and destination of calls but also to specify the types of service invoked during each call. The architecture of the AXE 10 software has been enhanced to meet the demands for shorter time-to-market, improved inservice performance and reduced cost of operation. Two major developments in this software architecture that will have an impact on the system's processing capacity are better support for application modularity and the introduction of the "Forlopp" concept. The introduction of application modules, AM, which are used to model applications on a common system resource - t h e appliEricsson Review No. 1 , 1 9 9 5 Fig. 4 * If a fault occurs, the operating system can release the system resources affected try the fault and save the relevant information for analysis. The release is initiated by the foriopp manager, which instructs the affected function blocks to use normal discon- ence, it seems fair to assume that in the foreseeable future the requirements for real-time capacity will double every two years. This assumption is valid for the entire system, and will thus influence every part of the AXE 10 control system. The central processing subsystem, the regional processing subsystem and the I/O handling parts will all be subject to demands for upgraded capacity. Due to the increased complexity of the switching nodes, more program code must be stored in each node, and this software will be subject to more frequent upgrades. The introduction of new features does not necessarily entail loading new program code, but the logic of the switch will have to be changed in a flexible and efficient manner. cation platform - has significantly enhanced the support of modular applications in AXE 10. Communication between AMs is strictly limited to protocols that are similar to network protocols. Fig. 3. This will increase the signalling within the system, both between AMs and to common system resources, but it will also permit a more flexible mixture of applications in the same switch. Increased signalling, of course, means increased internal demands for processing capacity. The Foriopp concept allows the software to tie data items that take part in the same process to a unique identifier, the foriopp identity, Fig. 4a. This feature is used for fault recovery actions. When the operating system or the application program detects a fatal error in a process that uses a foriopp identity, a foriopp release is initiated, Fig. 4b. Through an interface dedicated to the foriopp identity, the operating system requests all blocks that take part in the foriopp to initiate release of all data items connected to the foriopp. The foriopp concept adds work to each process - for example a call or call attempt - and thereby increases the demands for system processing capacity. IMPACT ON THE AXE 10 CONTROL SYSTEM The increasing demands for rapid introduction of new features, larger nodes and improved billing facilities all put greater demands on the AXE control system. The most important requirement is for increased real-time capacity. From experi- Ericsson Review No. 1 , 1 9 9 5 The switches are also increasing in size which means that data for an increasing number of subscribers and traffic-carrying devices must be stored. Since this data must be readily available during call setup, the use of anything but random access memories, RAM, is excluded. Storage of charging data, on the other hand, is affected by other requirements. Even a major power failure must not be allowed to destroy this information until it has been safely stored in a post-processing centre. Larger switching nodes and more detailed charging requirements imply that even secondary storage on semi-permanent media, such as hard disks, will be subject to capacity upgrades. The increasing volumes of data needed for charging and statistics, and the necessary ability to recover from lengthy outages of data communication facilities, put great demands on data transfer capacity. This means that not only faster communication links will be required, but also that the number of data links connected to the switching node will increase. Reducing the number of switching nodes means that the same number of subscribers will be distributed over fewer nodes. The addressingcapabilities of the present AXE 10 control system allow over one million subscribers to be handled in an efficient way. It is likely that, in the future, this limit must be extended, especially in mobile applications. 25 Table 1 Historical evolution of APZ system capacity Characteristic APZ 210 04 APZ 212 02 APZ 212 1 1 Primary memory(*) Relative capacity Release date 12MW16b*» 0.12 1977 48MW16b*« 0.5 1984 225MW16b** 1.0 1989 ** in the latest version of the hardware platform * * 1 MW16b = 1048576 16-bit words Please observe that the relative capacity figures are only valid within the table in which they appear and not between tables. The cost of the access network - that is, the connection of subscribers to the nodes - must not rise, even though the size of the switch increases. The remote subscriber switch (RSS) concept used in AXE 10, to concentrate traffic close to the subscriber, will be the solution to this problem. Each RSS can connect up to 2048 subscribers. The present limitation of 256 remote units per exchange will be extended as the need arises. ments must therefore be more stringent for larger nodes. With larger nodes, the telecom network becomes vulnerable to faults, in the sense that a single fault may cause disruption of service to more subscribers than in present networks. Availability require- However, recovery from software faults can be improved further. Faults can be isolated to affect only parts of the switch, thereby reducing the scope of the recovery action. 1 Today, recovery from hardware faults is almost transparent to users. The duplication of all AXE 10 control system equipment at all levels means that every single fault is completely masked to the application. In the rare event of multiple faults, these must affect the same part of the system to have an impact on the service. Other disturbances that occur during extensions or functional upgrades of the switch will also be reduced. The long-term goal is to perfect the system to the point that hardware or software faults will not cause any service disruption, and to achieve this without a major cost penalty. Table 2 Evolution of RP characteristics RP release 1976 1984 1994 Memory 20 kW16b PROM 1 Assembly 19 boards Proprietary 12 bits 5 MHz 256 kW16b Loadable 2 Assembly 3 boards Proprietary 16 bits 5 MHz 256 kW16b Loadable 2 Assembly 1 board Proprietary 16 bits 5 MHz Relative capacity Language Size(*) Processor Internal bus Ciock frequency * excluding regional processor bus interface PCB Please observe that the relative capacity figures are only valid within the table in which they appear and not between tables. 26 THE CURRENT AXE 10 CONTROL SYSTEM The development of processing capacity in the central processor, CP, and the available CP memory is illustrated in Table 1 . Due to the large number of regional processors (RP), the design of earlier versions aimed primarily at reducing cost and size. Capacity requirements caused no problems, since the tasks to be performed by the processors were fairly simple. However, the advent of ISDN subscribers and complex protocol handling has changed this situation. The processing and the storage capacities of the regional processors have evolved rapidly, and it is now Ericsson Review No. 1 , 1995 Table 3 Evolution of RPD characteristics RPD release 1991 1995 Memory Relative capacity Language Size(*) Processor Internal bus Clock frequency 4MW8b 1 C++/C 1 board Commercial 32 bits 25 MHz 16 MW8b 12 C++/C 1 board Commercial 32 bits >50 MHz * excluding regional processor bus interface PCB Please observe that the relative capacity figures are only valid within the table in which they appear and not between tables. feasible to move complex functions from the CP level to the regional processor level, which requires simple handling of program and data and the use of high-level program languages in these processors. Table 4 Evolution of EMRPD characteristics EMRPD release Memory Relative capacity Language Size Processor Internal bus Clock frequency 1991 5MW8b 1 C++/C 1 board Commercial 32 bits 20 MHz 1995 16 MW8b 1.5 C++/C 1 board Commercial 32 bits >33MHz Please observe that the relative capacity figures are only valid within the table in which they appear and not between tables. The 1976 version of the regional processor had memory in PROM. In 1984, RAM was introduced, loadable from the CP. High-level programs were introduced in a new regional processor, the EMRP (extension module regional processor), in 1980, and in two other versions, the RPD (regional processor, device-specific), and the EMRPD (EMRP, device-specific), in 1991. These latter processors in particular have required greater capacity, especially when used as signalling terminal controllers in an application. The evolution of regional processors is shown in Tables 3 to 5. When AXE 10 was introduced in 1976, the input/output (I/O) system mainly had two functions: to provide a man-machine interface and to store the system's backup copy. Consequently, by present standards, a rather primitive system was sufficient. The RP was used for control, and a non-volatile storage medium was provided by magnetic tape cassettes. Capacity requirements were typically much less than one transaction per second and, for throughput capacity, far below one kilobyte per second. With increasing requirements for itemised billing, statistics and centralised operation, the I/O system was developed into a proprietary and dedicated processing platform, based on the so-called support processor, SP. The SP provides an I/O system with improved capabilities and availability and is only loosely coupled to Table 5 Evolution of EMRP characteristics EMRP release 1981 1984 1986 1989 Memory Relative capacity Language Size Processor Internal bus Clock frequency 128 kW8b 1 Plex-M 5 boards Commercial 8 bits 1.5 MHz 128 kW8b 1 Plex-M 4 boards Commercial 8 bits 1.5 MHz 256 kW8b 1.3 Plex-M 2 boards Commercial 8 bits 2 MHz 256 kW8b 1.3 Plex-M 1 board Commercial 8 bits 2 MHz Please observe that the relative capacity figures are only valid within the table in which they appear and not between tables. Ericsson Review No. 1 . 1995 27 functional modularity- based on the function block, where each block contains both its own programs and its own data - allows functions to be allocated to different processors, or even to different systems. Table 6 Evolution of I/O characteristics I/O release 1978 1987 1990 1994 1997 Secondary storage Access Throughput Language Processor Redundancy 100 MB Sequential N/A Assembly RP Hot standby 268 MB Direct 3kbit/s Eripascal Commercial Hot standby 1.2 GB Direct 6kbit/s Eripascal Commercial Hot standby 2.1GB Direct 20kbit/s Eripascal Commercial Hot standby >4GB Direct 150-200 kbit/ s Eripascal Commercial Hot Standby Please observe that the relative capacity figures are only valid within the table in which they appear and not OGtWCCn i3Ot0S- the switching functions. Moreover, a sophisticated file management system that handles hard disks instead of magnetic tape cassettes has been added to the SP functions. The evolution of the support processor is shown in Table 6. CONTINUED DEVELOPMENT OF THE AXE 1 0 CONTROL SYSTEM With respect to central and regional processors, there are basically two ways to improve the processing capacity of AXE 10: to apply technological improvements, and to develop a more advanced architecture. The open-endedness of the system also makes it possible to distribute functions to other platforms. The AXE concept of As in all processors, the processing capacity of the central processor is dependent on the clock frequency and the memory access time. For AXE 10, the relative processing capacity of each consecutive generation of processors has nearly quadrupled. Approximately half of each increase stems from doubling the clock frequency, and the other half comes from architectural developments aimed at optimising the use of memory. Table 7 shows the current and projected growth in processing capacity, as it relates to clock frequency and memory components. The architectural developments referred to above have been achieved through the introduction of more parallelism and pipelining into the current architecture, which creates opportunities for the compiler to optimise memory accesses. Throughout this path of evolution, a constant evaluation of other architectures has been made. This is exemplified by the introduction of common microprocessor families and multi-processing concepts. To date, however, microprocessor components have only been introduced at the regional processor level. Multi-processing has always been considered as an alternative in the evolution of the AXE 10 con- iMmmmmmMmmu^mmtm Table 7 Evolution of APZ central processor capacity Characteristic APZ 212 1 1 APZ 212 2 0 Next generation Clock frequency Memory components 1 0 MHz 4 Mbit DRAM 256 kbit SRAM no pipelining and no cache; microprogram ore-fetch 4 0 MHz 16 Mbit DRAM 1 Mbit SRAM use of pipelining and cache; program instruction pre-fetch 1(1 CMOS 1 1989 0.7 u BICMOS/CMOS 4 1995 >80MHz 64 Mbit DRAM 16 Mbit SRAM extended use of pipelining and cache; parallel handling of instructions and signalling data 0,5 u CMOS 16 1998 Architecture ASIC technology Relative capacity Release date Please observe that the relative capacity figures are only valid within the table in which they appear and not between tables. 28 Ericsson Review No. 1 , 1 9 9 5 trol system. In fact, even though it was never used in the first generation of processors, the APZ 210 was designed to permit multi-processing. So far, proprietary development of single processors (although internally duplicated to improve dependability) has outperformed all alternative evolution paths, including those mentioned above. This might change in the future. New architectures are continuously being evaluated, to ensure the required processing capacity in AXE 10. The evolution of capacity at the regional processor level is linked to the evolution of microprocessors in general, see Tables 3 to 5. The main task of the RPs today is to control the application hardware, which may be very simple in some cases and very complex in others. The distribution of functions between the CP and the RP is determined by factors such as reliability, capacity and cost. One way to increase the processing capacity of the system is to move as much of the device control as possible to the RP. Another way is to move entire functions from the central processor to the regional processor level. However, the functions residing in RPs must ensure high reliability and data consistency. In the CP, the duplicated hardware ensures that hardware faults are transparent to the application, whereas the reliability of RPs is in part achieved through software. Currently, several possible ways of migrating functions from the CP to RPs are being evaluated. The open-endedness of a system can improve the system's processing capacity by allowing functions to be migrated to other platforms. Recently, a new subsystem - the open communication subsystem, OCS - has been introduced in AXE 10. The OCS provides data communication, according to common standards, between applications in AXE 10 and external computer systems and supports the Internet protocols TCP/IP and Ethernet links. One of the first examples of using the OCS connection for data communication to external computer systems is the adjunct processor, AP. The AP is based on a faulttolerant, high-availability, commercial UNIX computer platform, to which - initiall y - parts of the AXE 10 charging functions will be migrated. The AP, which is also an extension to the I/O system, offers very high transaction capability, typically more than 100 transactions per second, and throughputcapacitythatexceeds 100 kilobytes per second. CONCLUSION The evolution of AXE 10 processor hardware is characterised by developments to minimise costs and risk, while complying with new demands for functions and capacity. The AXE 10 system architecture - while retaining the original, underlying principles - has been developed to meet new demands. It has not only been possible, but feasible to continue the development of the system to keep it modern and based on state-of-the-art technology. The future will no doubt bring new requirements forfurther developments of AXE 10, but even today a number of solutions are feasible. As always, the objective will be to select a path that minimises technological risks and at the same time ensures that capacity is available when Ericsson's customers need it. References 1. Englund, T.: AXE 10 Dependability. Ericsson Review 72 (1995):1, pp. 42-50. 2. Johansson, 6., Nord, C: Using Predictions to Improve Software Reliability. Ericsson Review 72 (1995):1, pp. 30-35. Ericsson Review No. 1 , 1 9 9 5 29 Using Predictions to Improve Software Reliability Ojvind Johansson and Camilla Nord High software reliability is becoming increasingly important. To meet the demands for rapid development of reliable software, Ericsson is now increasing the use of prediction methods in the software design process. When a software system is designed, the majority of faults are often found in a few of its modules. By early identification of these modules, software design management - and thereby software reliability - can be considerably improved. The authors describe predictors capable of identifying fault-prone modules, how these predictors can be evaluated, and how software projects can benefit from predictions. Fig. 1 Typical breakdown of software costs. The cost of testing equals the cost of design and coding, and maintenance accounts for 50 percent of life time costs As hardware is becoming cheaper and more reliable, the importance of software reliability is increasing. Software applications are more complex than ever and this leads to stricter demands on the software development process. Fig. 1 shows the distribution of costs during a typical software life cycle. Maintenance costs - that is, post-installation costs of error correction and minor changes - account for approximately 50 percent. It is therefore important to improve the software design process by providing the designers with means of avoiding faults, and by discovering faults as early in the design process as possible. Fig. 2 provides a structure of how this can be obtained. Early fault prevention and detection is more easily achieved if designers and project management know where faults are likely to occur. In software systems, the Fig. 2 An important goal is to improve software reliability through fault prevention; that is, to improve the way the design process handles real and potential faults. Fault tolerance, on the other hand, is related to how the system handles run-time failures 30 Ericsson Review No. 1 , 1995 Fig. 3 The new and modified modules in a project have been arranged in a descending order by the ratio of function test trouble reports to code statements. This shows that 80 percent of the faults were found in modules containing only 40 percent of the total amount of code majority of faults are often found in a few modules. Fig. 3 illustrates, with figures from a recent project, that the most faultdense modules - which together made up 20 percent of the total code - caused 55 percent of the function test trouble reports. Here, a predictor capable of pointing out modules likely to contain many faults would have been a valuable tool the earlier in the design process, the more valuable. The search for predictors will also provide more profound knowledge of the development process and code characteristics, which is useful when improving and standardising the design process through enhanced design rules, training, etc. Finding predictors of fault-prone modules The objective is to find methods of pointing out the individual modules where faults are likely to be introduced during design and coding. It must be possible to use these methods in a design project based on an existing system where some software modules are kept as they are, some are modified, and new modules are added. Two studies 35 , made in order to find suitable prediction methods, have focused on finding good predictors for the number of trouble reports related to each module in the function test. The reason for this approach is that the function test is performed in a standardised manner, is rigorously reported, and reveals a large proportion of existing faults. The studies cover several design projects and hundreds of new and modified software modules. The results are interesting. One would expect there to be a close relationship between the number of modified or new statements (M) and the number of trouble reports. But many of the modules with high values of M had few faults, Fig. 4a. The same is true of the total number of statements (S); many large modules were Fig. 4a Plot of the number of function test trouble reports versus the number of new or modified source code statements. The plot indicates no close relationship between the number of new code statements and the number of trouble reports generated. The modules are identical with those shown in Fig. 3 Ericsson Review No. 1 , 1995 31 Fig. 4b The relationship between module size, measured as the total number of source code statements, and the number of trouble reports for the same modules as those shown in Fig. 4a almost free from faults, Fig. 4b. Both the number of implementation proposals (IP) referring to a module and the number of new or modified signals (SigFF) affecting a module showed a closer correspondence with the number of trouble reports. Fig. 4c Function test trouble reports versus predictor S* SigFF. A high predictor value points out modules that are likely to contain many faults (An implementation proposal is a document that describes how a requirement influences a source system. Signals are the primary means of ordering execution of code. They may contain data and can be sent between modules or within them.) The products S*IP and S*SigFFturned out to be even more useful. For example, modules with high values of S*SigFF were shown to be likely sources of many trouble reports in the function test, Fig. 4c. This is exactly the kind of results the studies were aiming at - a tool that enables project management to effectively allocate design, inspection, and test resources. When predictions are made at an early stage, the exact number of statements (S) that a module will have when it is coded cannot possibly be known. Estimates have to be used. But when the modules' flow charts have been designed, data obtained from these flow charts can be used instead of estimates of S. It has been found that in predictor S* SigFF, S can be replaced by one of several measures on the flow chart, for example McCabe's Cyclomatic Complexity Measure (see Box A), the number of decision points, or the total number of decision alternatives. Interpretation of predictors requires great carefulness, since variables may be correlated if they do not affect each other directly. However, for the above-mentioned predictors there are some interesting possible explanations. In the case of predictor S*IP, it seems reasonable that the likelihood of faults is more or less pro- 32 Ericsson Review No. 1 , 1995 RfrA The figure depicts the different paths through a program, hi the graph, each circle is a statement, or a number of statements that are always executed Box A McCabe's Cyclomatic Complexity Measure For a program consisting of one component only (a program without subroutines), the McCabe cyclomatic complexity measure is the number of basic paths through the program. For example, the graph in Fig. A represents a onecomponent program with cyclomatic complexity 4. For a program consisting of several components, such as a main component and subroutines. Table 1 Meant. A B c D Modate B A c D . ^ • • ^ D C B A hi IICS 50(50) 30(80) 1K91) 9(100) Predl (30) (80) (91) (100» Pmd2 (9) (20) (50) (100) For each module in the example, the number of trouble reports is shown without parentheses. Numbers within parentheses are accumulated trouble reports when modules are sorted in an actual or predicted descending order Hfr5 Example from a study5 in which the accuracy of two predictors was compared by means of an Alberg diagram. The accumulated percentage of trouble reports is plotted starting with the module that the prediction points out as the most fault-prone. For the curve FTTH, the modules are in the order given by the actual number of trouble reports Ericsson Review No. 1 , 1 9 9 5 McCabe defined the cyclomatic measure as the sum of measures for each of the individual components. The cyclomatic complexity V(S) is given by the for mula: V(S)=e^2p where: e n p number of edges number of nodes number of connected components In the example shown in thefigure,V(S)= 10-8+2=4 portional to the amount of new requirements for a module (measured by the number of implementation proposals), and also increasing with the amount of code (S) that may be affected by these requirements. Predictor S*SigFF is similar to S*IP. Signals make up the interfaces between modules, and it is not surprising that SigFF, the number of new or changed signals, correlates with IP, the number of implementation proposals. A paper presented at the 11th Nordic Teletraffic Seminar, 1993, 1 gives some more specific explanations of the relationship between a high occurrence of faults and high values of SigFF. For example, the presence of a large number of signals indicates high coupling (the module interacts with its environment in many ways), which means that the system is difficult to understand. A higher value of SigFF indicates that the module designer may have to communicate more frequently with other designers, possibly in other countries, which involves a greater risk of misunderstanding. Also, careless mistakes are easily made with the signals themselves. Evaluating predictors The Alberg diagram can be used to show and evaluate the usefulness of various predictors. The data need not have a normal distribution, which is favourable because software measurement data is very often found to be highly skewed and to contain many outliers. This method is easy to use and does not require any complicated mathematical modelling. The easiest way of explaining the Alberg diagram is through an example, taken from one of the studies5: Assume that a design project contains four modules - A , B, C and D - mentioned in a descending order by the number of function test trouble reports. Table 1 shows the individual and accumulated number of trouble reports (FTTR) for these modules. Two predictors, Predl and Pred2, are to be evaluated. Assume that Predl points out module B as the most fault-prone module, followed by A, C and D. The actual (not predicted) values of FTTR and the module order predicted by Predl are used to calculate the accumulated FTTR values for Predl. Then assume that Pred2 indicates module D as the most risky, and calculate the accumulated FTTR values for this predictor, too. The result is shown in Table 1 . An Alberg diagram, Fig. 5, is created by using the y axis for the accumulated percentage of trouble reports, and the x axis for the accumulated percentage of modules. The accumulated percentage of trouble reports is plotted, both when the modules are sorted according to the actual outcome and when in the order given by the predictors. Thus, the Alberg diagram shows the ability of the predictors to rank the modules in much the same order as that of the actual outcome. For a predictor to be considered good, its curve should lie close to the uppermost curve, which represents the actual outcome. 33 Fig. 6 An Alberg diagram for predictor S*SigFF. The modules are sorted in a descending order by the number of actual (black curve) or predicted (red curve) trouble reports per source code statement. In both cases, the actual number of trouble reports is used for the accumulated percentage on the vertical axis The Alberg diagram in Fig. 6 shows the usefulness of the predictor S*SigFF. In this case, the x-axis does not show the accumulated percentage of modules but the accumulated percentage of code. Therefore, the modules have not been sorted as described above, but in a descending order by the number of trouble reports (actual and predicted) per code statement. The diagram shows that the predictor makes it possible to point out, at an early stage, a 20-percent portion of the final code which is expected to generate around 40 percent of the failures. Using predictors in software projects Fault avoidance and early fault detection are of the utmost importance for software productivity and quality. This means that good predictors are of great value in project planning, where time and resources are critical factors. Any project will benefit from early fault detection, simply because late changes are more expensive than those made early. If the most faultprone modules can be predicted at the beginning of a software development project, closer attention can be paid to these modules in the early phases of design and testing, which will reduce the number of remaining faults. In general, the total software life cycle can be divided into five phases, or steps: definition of requirements, system and software design, implementation and unit testing, integration and system testing and, finally, operation and maintenance, Fig. 7. Those phases in Ericsson's software development process which correspond to steps two, three and four of the life cycle are shown in a simplified way. Fig. 7 shows where in the software development process predictions can be made, and on which parameters these predictions are based. Fig. 7 General software life cycle model related t o Ericsson's software development process. After the system analysis phase, the S 'IP predictor can be calculated. The number of implementation proposals related to each module can be identified by reading the different implementation proposals. Estimates of the size of new and modified modules are also available at this stage After the function design phase, the S*SigFF predictor can be calculated by searching in the signal coordination register for the number of new and modified signals. Estimates of the size of new or modified modules are also required After block design, flowcharts and source code metrics can be used to derive other predictors 34 Ericsson Review No. 1 , 1995 Fig. 8 The development process. Experiences from each step in the cycle are used for continuous improvement and standardisation of all parts of the process After system study and analysis, a first prediction of the most fault-prone blocks can be made using the S*IP predictor. A block is a software unit that can be designed and maintained by one designer. Hence, prediction-based planning can be made at a detailed, individual level. The project manager can use this prediction when planning designer and tester training, assigning the most fault-prone blocks to the most experienced designers, and when requesting a group of experts to inspect these blocks. Also refer to the means shown in Fig. 2. All possible test cases for complex systems cannot be performed or even specified. But by predicting which blocks contain the largest number of faults, extra resources for the testing of these blocks can be allocated. In this way, overall reliability can be significantly improved within a given margin of expenditure. The S*SigFF predictor can be seen as a complement to the S*IP predictor and used to improve project planning. Other possible predictions could be based on flowcharts or on the source code itself. Software development is a continuous process; new experience is gained from each new iteration, Fig. 8. Trouble report analysis is one means of acquiring such knowledge, which can be used to improve and standardise the different steps in the design process. When the design process is modified, the prediction formulas may have to be mod- ified too. Minor changes can be compensated through model parameter correction, but when the design process is considerably improved or changed, a thorough model revision may have to be made. A positive side effect is that the use of prediction stresses the importance of thorough and accurate measurements, which are essential in order to monitor progress and to verify the effect of process improvements. CONCLUSIONS Generally, the majority of faults in a software system are found in only a few of its modules. The use of predictors makes it possible to point out which modules are likely to be the most fault-prone. For instance, predictors S*IP and S*SigFF can be used for predicting the number of trouble reports written for each module in the function test. These predictors are useful, because they enable predictions early in the design process. An Alberg diagram is an efficient tool when evaluating and comparing different predictors, especially for data that does not have a normal distribution - a phenomenon of frequent occurrence in software projects. Predictions can be used for planning the software design process such that extra resources are allocated for design and testing of the most fault-prone modules. In this way, software reliability can be improved, both through fault avoidance and by fault detection. References 1. Alberg, H., Johansson, 6. and Ohlsson, N.: Predicting Error-prone Software Modules, 11th Nordic Teletrafflc Seminar (1993). 2. Ericsson Telecom AB, Software Reliability Handbook, EN/LZG 205 603 R2, (1993). 3. Johansson, 6.: Software Reliability, TRITA/MAT-93/0015, Royal Institute of Technology, Stockholm (1993). 4. Myers, 6. J.: Software Reliability-Principles & Practices, John Wiley & Sons, New York (1976). 5. Ohlsson, N.: Predicting Error-prone Software Modules in Telephone Switches, Industriserien, LiTH-IDA-Ex-9346, Linkoping University. 6. Shepperd, M.: A Critique of Cyclomatic Complexity as a Software Metric, Software Engineering Journal, March 1988, pp 30-36. Ericsson Review No. 1, 1995 35 Test Marketing of Mobile Intelligent Network Services Tina Sutton, Stephen Crombie and Rima Qureshi The competitive nature of the telecommunications industry makes it increasingly necessary for network operators and equipment suppliers to work as strategic partners. Often this means that the equipment supplier must be involved with end users very early in the product cycle, even at the conceptual stage. The authors describe how services derived from mobile IN technology are being tested in the New Zealand market and how these tests are used to support further development of Ericsson's and Telecom Mobile's marketing and product strategies. Telecom Mobile Communications Ltd, a cellular network operator who employs Ericsson's CMS 88 D-AMPS system in New Zealand, is cooperating with Ericsson on ajoint project. The objective of this project is to provide a greater understanding Fig. 1 Mobile IN is providing advanced end user services to subscribers in New Zealand of end-user and implementation aspects of intelligent network (IN) services in mobile cellular networks. Increasing competition and demands for advanced, customised end-user services mean that differentiation and diversification of end-user service offerings is becoming critically important. The test marketing project launched by Telecom Mobile and Ericsson offers the advantage of early market entry with new services, and both companies gain a considerable amount of knowledge about the marketing and deployment of Ericsson Mobile IN services. The project also assists Ericsson in the development of its Mobile IN concept for the CMS 88 Cellular System. Telecom Mobile and Ericsson both have their own service providers which furnish a wealth of information about cellular user behaviour. APPLICATION OF IN TECHNOLOGY IN DAMPS NETWORKS Intelligent network technology promisesto deliver end-user services that allow a higher degree of: - service diversity - customisation at the user or business group level - rapid design and deployment of new services - service provider control. The rapid development and deployment of cellular networks offers new opportunities for utilising IN technology to derive benefits similar to those already seen in the wired telecommunications environment. Catering for terminal mobility adds complexity to the implementation of IN technology in cellular networks. The interaction between existing end-user services and new IN-derived services must also be considered. In the CMS 88 mobile IN platform, these issues are dealt with through the use of a combined Home Location Reg- 36 Ericsson Review No. 1 , 1995 Fig. 2 The mobile intelligent network architecture consists of a collocated HLR/SCP, an MSC and SMAS. SMAS provides a graphical user interface for IN service script development. Service scripts are translated by SMAS into man-machine language (MML) commands that are sent to the SCP via an X.25 link. The MSC communicates with the HLR/SCP using MTUP or IS-41+. Communication between the HLR/SCP is according to a TCAP-based protocol similar to IS-41. IN subscribers are assigned IN categories in the HLR, depending on the type of service (originating, terminating or transfer) that they subscribe to ister and IN Service Control Point, HLR/SCP, Fig. 2. During call processing, the mobile switching centre, MSC, queries the HLR for information about the cellular subscriber, such as location, enduser services and other supplementary information in order for the call to be progressed. In the case of a call that requires an IN service to be invoked, the HLR detects this and passes the request on to the SCP. The SCP responds accordingly, depending on what IN services are invoked, and forwards this information to the HLR which, in turn, forwards it to the MSC. Private numbering plan The private numbering plan provides an abbreviated-dialling facility for cellular phone users. THE NETWORK STRUCTURE OF THE CMS 88 MOBILE IN PLATFORM In the current implementation of CMS 88 mobile IN, service information is passed between the HLR/SCP and the MSC using IS 41+ signalling - a signalling protocol used in the American standard cellular system. The interface between HLR and SCP is a proprietary one, but it resides on a standard signalling system 7 platform to allow physical separation of HLR and SCP, if required. IN call triggers within the existing CMS 88 IN call model are limited to originating calls, terminating calls and call transfer. In later product releases, this will be extended as the service switching function, SSF, in the MSC and MSC to HLR/SCP protocols are enhanced. Selective call rejection Selective call rejection - based on a subscriber-specific A-number restriction list — prevents certain calls from being forwarded to a mobile. Telecom Mobile has implemented an HLR/SCP in Auckland, to support its entire New Zealand network. The HLR functions are managed by means of the existing service provisioning and management system developed by Telecom Mobile. The IN services are provisioned and managed with Ericsson's SMAS - the service management application system for IN - and the subscriber management application, SMA. The SMAwill provide the "front end" to Telecom Mobile's customer services personnel to allow them to provision and administer IN services in an efficient manner. The SMA is implemented through windows-based extensions to Telecom Mobile's provisioning and management system. INITIAL MOBILE IN SERVICES Ericsson provided seven initial services on the IN platform at the time of installation. These services would form the basis for the first phase of the test marketing project. The initial mobile IN services are: Ericsson Review No. 1, 1995 Timed call diversion Timed call diversion - based on time of day, day of week or special day - allows calls to be redirected to other numbers. Outgoing call restriction Outgoing call restriction - based on a subscriber-specific restriction list - prevents certain numbers from being called from a mobile phone. Selective call acceptance Selective call acceptance - based on a subscriber-specific A-number allowance list - allows only certain calls to be forwarded to a mobile. Cellular business group The cellular business group service comprises a set of services which can be made available to mobile phone users in the same organisation. Normally a unique short number, similar to an extension number on a PBX, would be assigned each mobile phone in the cellular business group. Users belonging to the same group can call each other by dialling the short number. Users belonging to a cellular business group can have access to: - private numbering plan numbers - selective call acceptance service - selective call rejection service - outgoing call restriction service. The above-mentioned services are common to every user in the cellular business group. In addition, each user in the group can have individualised timed call diversion service. 0800-type service This is a network-oriented service - based on time of day, day of week or special day - which allows calls to be redirected to other numbers. THE TEST MARKETING PROCESS Telecom Mobile and Ericsson have defined IN service test marketing as: "A 37 Beta Test. An external test of the service using 50 to 500 real end users. Fig. 3 Test marketing can be seen as a separate information gathering process, working parallel with the overall service creation process. In test marketing, information is collected primarily from the test service deployment process systematic approach to obtaining empirical data about cellular subscriber behaviour and implementation issues through deployment of mobile IN services in the marketplace prior to commercial product introduction", Fig. 3. The test marketing process enables Telecom Mobile and Ericsson to: - gain competitive advantage by providing advanced end-user services - optimise service offerings so that they meet proven and well-defined end-user requirements - improve time to market by optimising the integration and deployment processes - improve customer support by optimising the provisioning and administration processes prior to commercial product launch; and: - gain information about appropriate target markets on how to optimise pricing, market positioning and promotion of the services. The test marketing project is being undertaken in a number of phases as new mobile IN services are developed. The test service deployment process, which is designed to expose the service to actual end users, has two phases: Fig. 4 In this application, cellular business group services are used to integrate the numbering plans of a PBX network and cellular phones and to provide access to mobile data facilities Alpha Test. An in-house test of the service with 10-20 internal users. Information from the test marketing process is used to facilitate service launching and then provide feedback to service definition and development processes. SERVICE APPLICATIONS USED FOR TEST MARKETING Telecom Mobile has identified a number of service applications which are used as a basis fortest marketing research. These applications are: Field force Telecom New Zealand's national fault and work management project has provided the opportunity to test-deploy the cellular business group service. The overall objective of the project was to improve customer service by equipping Telecom's fault and maintenance personnel with a wireless handheld data terminal that allows field staff to receive information about customer faults. The cellular network is used to transport data to the mobile data terminal and provide integrated communications with Telecom's PBX network, Fig. 4. The integration of PBX and cellular communications is achieved through the use of a linked numbering plan for cellular phones and PBX extensions. A cellular phone user only needs to dial a five-digit extension number to reach a PBX extension or a cellular phone. Similarly, a PBX user can reach any one of the field force cellular phone users by dialling a five-digit number. Calls from the public are always routed through the PBX before they reach the cellular phone user. Information about this application of IN was gathered through interviews with the implementation team (Telecom Mobile and Ericsson) and through interviews with users and their managers. The outcome of the initial research was that - number translation services could be enhanced to make the service application more flexible and easy to administer - enhancements of the billing systems were required, to enable the services to be billed individually - improvements in graphical user interfaces were required, for service provisioning and administration - the impact on existing billing, provision- 38 Ericsson Review No. 1, 1995 corporation in the trial makes it possible to investigate end-user aspects of providing PBX-like features to corporate users. An important technology related to the provision of PCS is microcell/picocell radio access. Through the use of microcell technology and IN services, Telecom and Ericsson have been able to provide PBXlike features on a cellular phone with location-dependent tariffs. Fig. 5 The application of mobile IN services in combination with microcell is tested with a large corporate user. A microcell has been placed on the customer's premises, and the cellular business group feature provides "PBX-like" services ing and management systems needs to be taken into consideration - the processes in the service supply chain need adapting to customised service applications - the integration of PBX and cellular phone services has wide application for businesses to make communications easier for the user. This project is continuing to provide information on the application of cellular business group services. Personal communications service Telecom New Zealand's PCS (personal communications service) trial began early 1995 in Auckland and will continue for 12 months. It is a market-oriented rather than a technological trial, seeking to establish approximate demand, pricing, migration from other services, provisioning across fixed and mobile (cellular and paging) networks, and cost and revenue attributes of PCS. The attempt to integrate some aspects of the existing mobile and fixed networks has proved a significant challenge. Three market segments have been identified for testing PCS services: residential users, small business users, and a large corporate customer. Each PCS customer will be provided with a single PCS number. By utilising IN functionality, users can be contacted, whether they are on their mobile phone, at work or at home. The standard PCS package consists of a mobile phone, common voice mailbox, messaging alert, timed call diversions, call diversion override, divert on busy, home cell calling and twelve-hour help desk support. Large corporate trial Although the PCS service is targeted at the mass market, the inclusion of a large Ericsson Review No. 1 , 1995 The objective, from the large corporate's point of view, is to improve call completion in a cost-effective manner by using microcell technology and location-based tariffs - calls within the microcell being tariffed relative to PBX costs, Fig. 5. The IN features being used (cellular business group and timed call diversion) allow for integration of the PBX and the cellular phone services including extension dialling from PBX to PCS mobiles and vice versa. Users within the large corporation are taking part in telephone interviews, indepth interviews and focus groups to obtain a user perspective on PCS as applied to a business environment. Further research Focus groups demonstrating IN features were held, and this has meant that a number of other service applications have been identified. Specific, favoured applications are based on the cellular business group and outgoing call restriction services. INITIAL RESULTS Although the test marketing process has only been in place since mid-94, a considerable amount of new information has been gained. This information falls into four main categories: new service requirements, impact on organisation, impact on processes, and marketing data. New service requirements The following new requirements have been identified and will appear as new IN services in the near future: - enhanced number translation functions - enhanced screening services, i.e. selective call diversion - selective forwarding of calls - location-dependent call forwarding - improvements to voice announcements - services to support fixed and mobile integration, i.e. personal communication service, PCS, and universal personal telecommunication, UPT - enhanced billing 39 Fig. 6 The service creation process starts with market research to identify end user requirements. These requirements are specified by the network operator and subsequently developed by Ericsson. Services are delivered as AXE software or IN service scripts to the operator who provisions the service to end users via service providers and dealers The significance of this list lies in its having been derived from and tested against real end-user requirements. Impact on organisation and processes When attemptingto increase flexibility and reduce time-to-market of new end-user services, the organisational and process issues become critically important throughout the service creation process, Fig. 6. By undertaking the test marketing process, Telecom Mobile and Ericsson identified these issues at an early stage. In summary, the areas of prime importance were: - an increase in operator control over new service development, allowing the market to be the driver rather than waiting for new services to be provided in a generic software package - a requirement for closer operational and strategic relationships between the end user, the network operator and the service developer - a requirementfor improvement of service creation processes to ensure that timeto-market goals are achieved - significant changes to implementation and post-implementation support processes, to take advantage of the possibility of customised service applications - the emergence of more highly-trained distribution channels to market and sell new services, particularly customised service applications. Along with the development of new mobile IN services and of the platform itself, there is a manifest requirement for change and development of the organisation and its processes. This is essential for the flexibility and time-to-market goals of mobile IN to be attained. Marketing aspects Information from end users has provided the most valuable feedback regarding the marketing of mobile IN features. What has been gathered to date will be validated through further research as the test marketing project continues. Some observations follow. As the mobile market moves from business to consumer, the ability to retain the relatively high-revenue business custom- 40 er will become increasingly important to network operators. The customisation of services (to meet the requirements of a specific user) which IN facilitates, will become a key competitive differentiator and a means of protecting that business revenue stream. Pricing Business customers are seeking increased functionality; at the same time they expect telecommunications costs to fall and are looking for features that will cut their cellular service costs. The end user is not necessarily willing to pay for the extra functionality IN services provide. After all, a feature such as PBX-extension dialling from a cellular phone - which makes the cellular phone operate in a more "PBX-like" fashion-willcauseacustomer to consider similarities to PBX/PSTN/VPN pricing rather than to mobile pricing. In this scenario, customers are not only unwilling to pay more for the functionality of a cellular business group; they often expect to pay less. The corporate user The large corporate or national customer is an obvious target for the features of a cellular business group. Clearly, there are certain features that the customer would like all the members of the group to have (such as extension dialling), but other features may need to be tailored to the specific individual (timed call diversion and outgoing call restriction by time of day, for example). A truck driver might only be allowed to make calls from his cellular phone during working-hours, while senior executives should be able to use their cellular phones for outgoing calls anytime of the day. By controlling who can make calls to which destinations, the company will also control its costs. IN functionality must therefore cater for both the individual and generic requirements. Providing microcell/picocell technology in a corporate office environment may primarily be a question of relieving network congestion. However, the creation of a specific system area gives the network operator an opportunity to develop special tariff options for "internal" cellular calls. The "home cell" tariffing of the PCS environment may effectively be transported to Ericsson Review No. 1, 1995 a cellular business environment to benefit both the network operator and the customer. The small business Advanced call forwarding services are particularly attractive to small business users who want to be in contact with their customers at anytime of the day; for instance, calls during the day being forwarded to voice-mail if the customer is unavailable, and calls after working-hours being forwarded to their home number. The small business user wants a standard call forwarding profile for most of the year, but must have control of how this profile is set and the ability to change it at will while he is on vacation or when special circumstances arise. The optimal situation is that the customer has control, but it must be made simple for him. This is especially the case as the market moves into the consumer area and its less sophisticated telecommunications user. Positioning and promotion Customers were interested in understanding how IN services could provide them with competitive advantage. In attempting to market IN services, network operators must position them in a way that shows benefits to the customers in terms of their own business situation. Customisation and control are key components here along with flexibility and ease of use. This should be kept in mind as IN services are further developed. SUMMARY The test marketing project has completed its first phase; investigation of new IN services and marketing issues is now being planned. The project is seen as an ongoing and evolutionary process, designed to gain a better understanding of end users and their requirements. Telecom Mobile and Ericsson have acquired a considerable amount of information about the implementation and marketing aspects of mobile IN services. This information will be a great help in the further development of services, because it shows the way organisations and processes have to change to meet the increasing demands of end users. 41 AXE 10 Dependability Karl-Axel Englund Because telecommunication exchanges are expected to work 2 4 hours a day, 3 6 5 days a year, with negligible down time, they must have an extremely high level of dependability. The author outlines the fault tolerance, maintainability functions and the different tasks and activities that are emphasised during the design and development of the AXE 1 0 system, to ensure high reliability and maintainability. Today's users of telecommunication networks and services expect superior quality of the services offered. They require high performance of the network during the establishment and retention of calls, and they expect network operators to provide good service support in terms of short installation time, quality of user instructions and charging integrity. Moreover, services have to be provided at a reasonable cost for the users and with a reasonable profit for the operator. Today, when transmission costs are decreasing, thanks in part to increased use of optical fibres in the network, requests to reduce costs further will have to focus on the switching elements. This requires exchanges that are very reliable and easy to maintain. Fig. 1 The Quality of Service concept and its relation to dependability according to ITU-T recommendation E.800 (revised edition) AXE 10 has been designed with stringent requirements for reliability, maintainability and quality of service. These requirements apply to software, hardware, documentation and the man-machine interfaces. The proven low failure rates and the philosophy of integrated operation and maintenance, which incorporates scheduled corrective maintenance, contribute to the comparatively low life-cycle cost of AXE 10. Predictions and estimations of reliability and maintainability performance and quality of service are important parts of assurance activities. Fig. 1 shows the qualityof-service concept and its related performances. Estimates based on field observations provide necessary feedback for creation and update of databases with the information needed for future, more comprehensive and accurate predictions. These predictions also provide operators with the input they need to plan maintenance support. The technological enhancements - continuously introduced into the AXE 10 system, thanks to its unique modularity - can easily be covered by updated predictions based on comprehensive field studies. DEPENDABILITY OBJECTIVES In recent years, the in-service performance of telecom exchanges has improved. Demands for higher reliability have also increased, as has the need for larger switching nodes capable of providing more functionality to a growing number of subscribers. In terms of reliability and unavailability, the mean accumulated down-time objective for an exchange is approaching a few minutes per exchange and year. The past few years have seen a rapid growth of interest in network survivability, especially in the US. In the standardisation work, this has been accompanied by more stringent requirements for node switches. 42 Ericsson Review No. 1 , 1995 Fault tolerance The architecture chosen heavily influences the reliability of the system. The exchanges in the telecom network must have adequate redundancy, to limit the effects on the service if major faults occur. A primary characteristic of AXE 10 is that hardware redundancy is provided for all equipment handling more than - 128 lines (POTS) - 32 trunks (one PCM system) - 64 ISDN basic access lines (2B+D) - 4 ISDN primary access lines (30B+D) The central processor (CP) is duplicated. Its two sides work in parallel-synchronous mode and continuously compare the program execution in both sides, which ensures high system hardware availability. Error-correcting codes in program and data store units ensure that single-bit errors do not cause an outage of any of the CP sides. Redundancy is provided for the decentralised regional processors, the bus systems between the central processor and the regional processors, and between the regional processors and their subunits. The reliability performance of the digital group switch is ensured by the duplicated switching network and the triplicated clock modules for synchronisation. Fig. 2 When a hardware fault occurs, the supervisory functions will initiate a series of actions that lead to the removal of the faulty printed board assembly. Recovery in the case of software errors is accomplished through the most appropriate action with respect to the fault situation. The range of the automatic recovery actions depends on the application, from low level recovery by forlopp release to a major restart with reload. The faulty function block is corrected after analysis and testing HARDWARE RELIABILITY Reliable equipment performance necessitates components that conform to defined quality and performance requirements and meet the demands for long-term reliability. To ensure that only high-quality and reliable components are used, a comprehensive component reliability assurance program, consisting of the following elements, is essential: - lot-to-lot control - vendor qualification - process qualification - family qualification - type approval - standardisation - feedback and corrective actions. Ericsson Review No. 1 , 1995 The power supply system is sectioned in such a way that converters and the distribution follow the redundancy principles and structure of the powered units. The configuration of the I/O subsystems is flexible and can be adapted to the performance requirements that apply to the network. MAINTAINABILITY Extensive maintenance functions are built into the system to supervise hardware and software functions and to support maintenance activities. These functions automatically supervise connections through the exchange and check that the quality-of-service performance values remain within prescribed limits. Real traffic is supervised. Fig. 2 shows a flow diagram of the maintenance actions when faults occur. Software supervision The methods used to detect software errors in AXE 10 range from direct super43 - consistency check of reference store and data store. Fig. 3 Typical recovery from a hardware fault in the central processor. In normal operation, the central processors work in parallel: the two CP sides (CP-A and CP-B) execute the same program at the microprogram level. However, only one side - CP-A - is executive and handles traffic. If a hardware failure occurs in one CP side, a mismatch between CP-A and CP-B is detected. The faulty CP side is either directly pointed out or Identified by a side-determining test program. If the failure occurs In the executive side, a switchover will immediately take place. The faulty CP side is then stopped. Traffic handling is not disturbed by this procedure vision, such as plausibility checks, to indirect supervision of time, pointers, signals, etc. Full-coverage supervisory circuits in the central processor check that program handling is started from a certain point, at regular intervals. Plausibility checks of microprograms are made to discover program faults before they disperse incorrect data. Each time data is written in - or read from - t h e data store, the index and pointer values used to calculate the absolute store address are checked, to ensure that they will not cause over-addressing. Automatic actions are initiated when errors are detected. When a program error has been traced to a function block, relevant data about the cause of the error is stored in the memory, for subsequent printout. If serious errors in the execution can be traced to central software units, the system is restarted. If the fault is contained within a single process - for example, a call process adapted to the forlopp handling feature-then a low-level fault recovery mechanism called forlopp release will clear the faulty process without affecting any other process in the system. The word 'forlopp' is used to denote a sequence of events. Audit functions Audit functions are implemented in APZ in order to detect latent errors; for example, those caused by software errors (bugs). When an audit function detects an error, an alarm is issued to inform operators about errors that require manual intervention. The alarm condition ceases after correction of the error. Currently, the following audit functions are available: - detection and correction of processor store errors - check summing of program store - detection and reporting of data errors 44 Hardware supervision Some of the characteristics of hardware maintainability are: - automatic supervision of each connection through the exchange, combined with checks to ensure that the serveability remains within prescribed limits - automatic reconfiguration in the case of hardware failures - automatic fault localisation - advanced built-in functions for diagnostics and repair. Supervision of the switching system is in the form of supervision of the real traffic it handles. Traffic circuits, such as trunk lines, are supervised by means of statistical methods; an alarm is issued when abnormal behaviour is detected. The processor hardware is checked by means of various supervisory functions, such as: - CP-side matching - parity checks - check summing of memories - ECC (error correcting code) - micro-programmed addressing checks - watchdog function - routine tests. Other fault-detection mechanisms, essentially used to activate low frequent operating circuits and functions, operate on a scheduled basis. The APZ fault-detection functions are implemented in hardware, software and microprograms. The supervisory functions detect both permanent and transient faults. The time to fault detection is short, and a clear indication of the faulty unit is given in a diagnostic printout. Error-correcting code The need for large memory volumes in APZ has affected the hardware failure intensity, which is closely related to the number of memory circuits used in an application, and to their failure rate. For this reason, all APZ processors are provided with error-correcting codes, on a per word basis, in the CPS store unit, to deal with both permanent and transient failures. The CPS has been designed to be tolerant to single-memory fault; it can continue to handle traffic without taking the faulty CP side out of service. Replacement of faulty memory boards can be postponed until other repair activities in the affected Ericsson Review No. 1 , 1995 Forlopp release The forlopp handling mechanism provides correction at the call process level. This means that higher-order recovery procedures, such as restarts, can be avoided for most faults. Fig. 4 The AXE system is provided with various protective mechanisms to handle software-related errors, in order to keep traffic handling intact to the greatest possible extent. Depending on the fault code and recovery Implementation, faulty program execution may affect only a single call attempt or all call processing. The system has built-in facilities for escalation from low-level recovery, e.g. a forlopp release, to higher-level recovery, e.g. a system restart, If necessary side have been performed, which extends the time between repairs. Hardware reconfiguration For traffic handling not to be affected, the hardware must be reconfigured quickly. The reconfiguration of hardware is implemented at different levels of the AXE system. In the RPs that control the traffic devices, redundancy is implemented through a pair of RPs working in load-sharing mode, whereas switching is controlled by RPs that work in active/standby mode. A transparent recovery mechanism - that is, transparent to the handling of traffic is used in the event of a hardware failure in the central processors, Fig. 3. The faulty CP side is taken out of service without affecting the traffic handling process; the fault-free CP side continues to work. Once the faulty side is repaired, the two sides are brought to work in parallel again. Software recovery Large software systems are never completely free of faults. They must therefore have protective mechanisms that eliminate or limit the effects of faults in the handling of traffic. Fig. 4 illustrates the system recovery mechanisms used in AXE 10. Automatic system restarts have long been the only mechanism to recover from severe software errors, restoring the system to a predefined stable state, from which the execution of programs can be continued. Although this is a simple and robust recovery mechanism, the need for a more sophisticated and selective one has become necessary. Today, the system is gradually being upgraded with new, low-level fault recovery mechanisms. Depending on the location and severity of the error, the control system can use the protective mechanisms described in the following. Ericsson Review No. 1 , 1995 The design of forlopp handling, to enable forlopp release in an application system, is supported by the APZ system. The forlopp handling mechanism can also be used as a test tool that can manage hanging devices (hung lines or trunks) and report abnormalities in the call process. When neither forlopp release nor selective restart can handle a fault (for example, job buffer congestion), an unconditional system restart is performed. However, the forlopp release adequately handles the most frequent faults and substantially improves in-service performance. Selective restart A selective restart functions as follows: If a software error is detected in a block that is less important to the traffic process, the system restart can be either terminated or delayed until traffic is low. The recovery action is determined by - the fault code - the block category - time of day. A selective restart can be activated or deactivated by command. It is controlled by the block category, which is also set by command or contained in the exchange data. Each time an error interrupt does not restart the system, an error intensity counter is incremented, until a maximum limit is reached. This maximum limit can be set by command; when it is reached, the system is restarted. The error information is preserved, and an alarm is initiated for each recovery. The error intensity counter is automatically decremented at pre-defined intervals. The available block categories and their actions are: Category 0 Error is ignored Category 1 Delayed system restart Category 2 Immediate system restart Category 3 Delayed reload with major restart In cases where not all processes in the system are adapted to the forlopp han45 detailed information about the contents of the registers at the time the error occurred can be printed out by command. Fig. 5 Calculation tools have been integrated into a system that contains both databases and programs. The tools are used for predictions in the design phase, dimensioning of spare parts inventory, and estimates based on in-service performance statistics. The system is called DEPEND (short for "dependability") RELIABILITY AND MAINTAINABILITY ASSURANCE DURING DESIGN dling mechanism, the selective restart complements the forlopp release. The capability to assign block categories for restart purposes greatly reduces the impact of minor software errors on the AXE in-service performance. Minimal restart If the control system can determine by the fault code that the regional software in RPs and EMRPs is not involved in the faulty process, these units will not be updated when a restart is initiated. This kind of restart is called minimal restart and is shorter than ordinary minor restarts. Restart escalation Previously, when the time interval between two consecutive program errors was less than a fixed ten-minute interval, the restart level was automatically escalated to a higher level. The ten-minute restart window has been improved in later APZ versions, and it can now be changed by command. The number of consecutive restarts before escalation to a higher restart level can also be stated. The possibility of changing the restart window reduces the risk of cyclic restarts before escalation. Box A Quality of Service "The collective effect of service performances which determine the degree of satisfaction of a user of the service. Note 1 . - The quality of service is characterised by the combined aspects of service support performance, service operability performance, serveability performance, service integrity and other factors specific to each service. Note 2. - The term "quality of service" is not used to express a degree of excellence in a comparative sense, nor is it used in a quantitative sense for technical evaluations. In these cases a qualifying adjective (modifier) should be used." From ITU-T Recommendation E.8O0 46 Repair facilities The repair function includes alarm handling and support for repairs. An alarm is issued when a permanent hardware fault has been found and localised, or when the number of transient errors has reached the specified alarm level. Support functions locate the faulty printed board assembly. The system also guides operators through the actions they need to take when intervening manually. A repair check is always made after a board is replaced, and a printout acknowledgement states thatthe alarm has been removed and that the system is restored to the normal working mode. In rare situations, when the normal procedures and diagnostics are insufficient, expert support may be necessary. To assist consulting experts, an error log with Dependability plan Reliability and maintainability performance analyses are integral parts of the design process. The specified reliability requirements are analysed, and low-level requirements are set for each subunit. For example, in a feasibility study, various alternatives of the product structure are studied and the reliability and maintainability performances of the different solutions are predicted. When a new hardware project is started, a dependability plan describes the activities and resources that will be included during the design process to ensure reliability and maintainability performance. Reliability and maintainability performance are predicted to check compliance with requirements. The results of these predictions, including necessary analyses (stress analyses, fault modes and effects analyses, FMEA), are presented in a formal document according to recommended standards. Within the project, review meetings are held to evaluate the results and to compare them with the stated requirements, for the purpose of providing input to the continued design. Predictions often comprise a large amount of data. The results of the complex calculations must be available in time to support the ongoing design activities. A set of computerised tools has therefore been created. Component reliability predictions At an early stage of the design process, failure rates are predicted for components and printed board assemblies, to provide project management with the information they need to ensure that reliability and maintainability performance meet the specified requirements. All predictions are based on Ericsson's database for component failure rates, TILDA. A component reliability prediction model, based on findings from extensive field studies, is also used. Ericsson Telecom has gathered this kind of data for several years and sees Ericsson Review No. 1 , 1995 progresses, the predictions are refined to provide more adequate reliability assessments. Fig. 6 Failure rate prediction for a printed board assembly using one of the SYPREX programs. In this case, the components are listed in the order of their contributions to the total failure rate. A default template with parameter values for a typical environment and application is provided. For temperature sensitivity analysis, the parameters can be changed by the user. Follow-up of "old" prediction data for a certain unit (magazine or printed board assembly) is possible by changing the cut-over date (cut date). The position and failure rate of each individual component on the board can also be presented. Reports can be recorded on files for use by other tools Fault modes and effects analysis, FMEA FMEA is a method used during the design and development phase to analyse reliability by identifying failures that significantly affect system performance. How the analysis is made depends on the specific purposes for which its results are needed. These results are used to produce reliability block diagrams and state diagrams. a consistent and ongoing downward trend in the failure rates for microcircuits. The designer can choose the most optimal design by selecting appropriate components and acting on available information from production, testing and operation. In particular, vital components with high integration, such as ASICs, are followed up with reliability tests before they are released. Manufacturers' life-cycle test results are thoroughly analysed and reviewed. Information on reliability is further used by allocating failure rate data to the component type concerned. To select new components, the design centres are supported by the material analysis laboratory. Components based on new technology, such as BICMOS, are studied carefully in order to achieve good reliability assessments. Potential fault modes are assessed, and the resulting data is used in subsequent fault modes and effects analyses of the boards. If necessary, the design and layout of printed board assemblies are analysed using temperature simulations. Reliability versus ambient temperature is calculated, to provide a basis for improved performance. Weak components and components that severely affect reliability performance will be identified. Initial predictions are achieved by roughly calculating the overall failure rates for the different boards, using the parts- count method; that is, by adding the failure rates of all components that are used. The results can verify that an appropriate reliability structure is feasible. As the design Ericsson Review No. 1 , 1995 Prediction tools and databases Most of the tools that are used for ordinary reliability work are integrated into the data application system available on host processors in Ericsson's corporate network. These tools have been collected into a system called DEPEND, Fig. 5, which contains databases and calculation programs. They are accessible on-line from workstations and closely connected to other Ericsson databases used for design, ordering and manufacture. Prediction tools for complex reliability models and other work based on statistical theory are accessible as separate programs on the workstations. TILDA Prediction of hardware failure rates used in the reliability models is based on the component failure rates accessible in the TILDA database. This database also contains information on temperature range and stress level for each component type. All data in TILDA is continuously updated by a committee, whose job it is to ensure that the information in the database reflects the most recent development and knowledge of the components. SYPREX A set of programs called SYPREX is used by hardware designers to predict the reliability performance of components, printed board assemblies, magazines and other units containing components. Predictions can be based on either results of detailed stress analyses and fault modes and effects analyses, or on typical stress data stored in TILDA (tentative predictions). Fig. 6 shows an example of a circuit reliability prediction for a printed board assembly. Prediction results for 47 hardware units are automatically stored in FELIX (see below) for later retrieval. FELIX A database called FELIX contains predicted failure rates of printed board assemblies and other items. These values can be retrieved for use in more complexly connected items, for calculations of spare parts stores, or for comparison with field data. RESEX A set of programs called RESEX is used in the support process to calculate repair costs and to dimension spare parts inventories (components, printed board assemblies, and other replacement units) for single- and multi-level spare parts store organisations. VERA To follow up the in-service performance of hardware, a set of programs - VERA - is used to estimate the failure rates of components, printed board assemblies, and other replaceable units. Statistical methods are used to compare the predictions with the observed data, adding precision to the estimates. FIDEX FIDEX is a collective term for the databases that contain information on failed and replaced hardware equipment reported to a repair centre, and information gathered from analyses of faults and details. Dependability modelling Various types of dependability modelling techniques are used, depending on the subsystem, type of measure to be predicted, and availability of data. Dependability models are based on the initial prediction during design; they are further updated as the design progresses and when the predictions are updated during later life-cycle phases. A large number of calculations are made on the basis of state-transition diagrams and Markovian modelling techniques, Fig. 7, using the SYPREX calculation programs. Maintainability analysis In a very early phase of the design, when the layout of the HW and SW for maintenance functions has been outlined, a maintainability analysis is made. The focus of this analysis is on the system's capability for failure detection, fault localisation and repair. The analysis enables management to assess, for example, how well the system meets the diagnostic accuracy requirement, and whether improvements in the design are necessary. The precision of the analysis is dependent on the quality of component Fig. 7 An example of a dependability model using Markovian modelling technique. The tool Is used for designing and examining Markov chains in an inter* active way. The diagram, which is drawn directly in the window, shows the expected states of a central processor configuration with redundancy, and how the system is handled in fault situations. An arc between states represent a transition as a result of a failure, logistic delay, repair, or recovery. Each transition has a parameter value. The result of a calculation contains steady-state probabilities indicated for each state, mean recurrence time and the mean time spent by the system in each state, or in a combined state, e.g. a system fault state 48 Ericsson Review No. 1 , 1 9 9 5 Fig. 8 Evaluation of maintainability and reliability performances through fault simulation. The basic data that has to be prepared before the actual fault simulation starts at a test site includes a list of faults randomly selected according to a specification. Test probes are connected to circuit pin positions on the circuit boards when a fault is simulated. Fault simulation is performed by fault simulator equipment controlled by PC programs that enable triggering of single or multiple test points to correspond to fault modes. The behaviour of the system when a failure occurs is recorded. The observed data are evaluated against requirements, and improvements are introduced, if needed. The results are also used for prediction of reliability and maintainability measures failure rate data and the depth of the analysis. The FMEA may provide a basis for an analysis of this kind, if the relevant information is available. Maintainability and reliability system architecture testing The test used to verify that maintainability performance complies with requirements must be performed in conditions as similar to real operation as possible. Therefore, hardware faults are simulated at a test site to study how the system behaves when the hardware fails. The purpose of the test is to evaluate the maintainability performance and the reliability Ericsson Review No. 1 , 1995 structure of the system, Fig. 8. This is done by simulating both single and multiple fault situations. The hardware failures are chosen at random, in accordance with their expected relative occurrence in normal operation. All data is based on the dependability predicted in the design phase. The system's reactions to the simulated hardware faults are observed and classified accordingto specified rules. Main performance parameters are evaluated statistically, to provide the characteristic, expected in-service performance, and to verify that the system reacts properly and can be maintained according to the 49 Telefonaktiebolaget L M Ericsson S-126 25 Stockholm, Sweden Phone: +46 8 7190000 Fax: +46 8 6812710 ISSN 0014-0171 Ljungforetagen, Orebro 1995