Complete volume in PDF
Transcription
Complete volume in PDF
Advances in Soft Computing Editor-in-Chief: J. Kacprzyk 53 Advances in Soft Computing Editor-in-Chief Prof. Janusz Kacprzyk Systems Research Institute Polish Academy of Sciences ul. Newelska 6 01-447 Warsaw Poland E-mail: kacprzyk@ibspan.waw.pl Further volumes of this series can be found on our homepage: springer.com Bernd Reusch, (Ed.) Computational Intelligence, Theory and Applications, 2006 ISBN 978-3-540-34780-4 Jonathan Lawry, Enrique Miranda, Alberto Bugarín Shoumei Li, María Á. Gil, Przemysław Grzegorzewski, Olgierd Hryniewicz, Soft Methods for Integrated Uncertainty Modelling, 2006 ISBN 978-3-540-34776-7 Ashraf Saad, Erel Avineri, Keshav Dahal, Muhammad Sarfraz, Rajkumar Roy (Eds.) Soft Computing in Industrial Applications, 2007 ISBN 978-3-540-70704-2 Bing-Yuan Cao (Ed.) Fuzzy Information and Engineering, 2007 ISBN 978-3-540-71440-8 Patricia Melin, Oscar Castillo, Eduardo Gómez Ramírez, Janusz Kacprzyk, Witold Pedrycz (Eds.) Analysis and Design of Intelligent Systems Using Soft Computing Techniques, 2007 ISBN 978-3-540-72431-5 Oscar Castillo, Patricia Melin, Oscar Montiel Ross, Roberto Sepúlveda Cruz, Witold Pedrycz, Janusz Kacprzyk (Eds.) Theoretical Advances and Applications of Fuzzy Logic and Soft Computing, 2007 ISBN 978-3-540-72433-9 Katarzyna M. W˛egrzyn-Wolska, Piotr S. Szczepaniak (Eds.) Advances in Intelligent Web Mastering, 2007 ISBN 978-3-540-72574-9 Emilio Corchado, Juan M. Corchado, Ajith Abraham (Eds.) Innovations in Hybrid Intelligent Systems, 2007 ISBN 978-3-540-74971-4 Marek Kurzynski, Edward Puchala, Michal Wozniak, Andrzej Zolnierek (Eds.) Computer Recognition Systems 2, 2007 ISBN 978-3-540-75174-8 Van-Nam Huynh, Yoshiteru Nakamori, Hiroakira Ono, Jonathan Lawry, Vladik Kreinovich, Hung T. Nguyen (Eds.) Interval / Probabilistic Uncertainty and Non-classical Logics, 2008 ISBN 978-3-540-77663-5 Ewa Pietka, Jacek Kawa (Eds.) Information Technologies in Biomedicine, 2008 ISBN 978-3-540-68167-0 Didier Dubois, M. Asunción Lubiano, Henri Prade, María Ángeles Gil, Przemysław Grzegorzewski, Olgierd Hryniewicz (Eds.) Soft Methods for Handling Variability and Imprecision, 2008 ISBN 978-3-540-85026-7 Juan M. Corchado, Francisco de Paz, Miguel P. Rocha, Florentino Fernández Riverola (Eds.) 2nd International Workshop on Practical Applications of Computational Biology and Bioinformatics (IWPACBB 2008), 2009 ISBN 978-3-540-85860-7 Juan M. Corchado, Sara Rodriguez, James Llinas, Jose M. Molina (Eds.) International Symposium on Distributed Computing and Artificial Intelligence 2008 (DCAI 2008), 2009 ISBN 978-3-540-85862-1 Juan M. Corchado, Dante I. Tapia, José Bravo (Eds.) 3rd Symposium of Ubiquitous Computing and Ambient Intelligence 2008, 2009 ISBN 978-3-540-85866-9 Erel Avineri, Mario Köppen, Keshav Dahal, Yos Sunitiyoso, Rajkumar Roy (Eds.) Applications of Soft Computing, 2009 ISBN 978-3-540-88078-3 Emilio Corchado, Rodolfo Zunino, Paolo Gastaldo, Álvaro Herrero (Eds.) Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS 2008, 2009 ISBN 978-3-540-88180-3 Emilio Corchado, Rodolfo Zunino, Paolo Gastaldo, Álvaro Herrero (Eds.) Proceedings of the International Workshop on Computational Intelligence in Security for Information Systems CISIS 2008 ABC Editors Prof. Dr. Emilio S. Corchado Área de Lenguajes y Sistemas Informáticos Departamento de Ingeniería Civil Escuela Politécnica Superior Universidad de Bugos Campus Vena C/ Francisco de Vitoria s/n E-09006 Burgos Spain E-mail: escorchado@ubu.es Prof. Rodolfo Zunino DIBE–Department of Biophysical and Electronic Engineering University of Genova Via Opera Pia 11A 16145 Genova Italy E-mail: rodolfo.zunino@unige.it ISBN 978-3-540-88180-3 Paolo Gastaldo DIBE–Department of Biophysical and Electronic Engineering University of Genova Via Opera Pia 11A 16145 Genova Italy E-mail: paolo.gastaldo@unige.it Álvaro Herrero Área de Lenguajes y Sistemas Informáticos Departamento de Ingeniería Civil Escuela Politécnica Superior Universidad de Bugos Campus Vena C/ Francisco de Vitoria s/n E-09006 Burgos Spain E-mail: ahcosio@ubu.es e-ISBN 978-3-540-88181-0 DOI 10.1007/978-3-540-88181-0 Advances in Soft Computing ISSN 1615-3871 Library of Congress Control Number: 2008935893 c 2009 Springer-Verlag Berlin Heidelberg This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable for prosecution under the German Copyright Law. The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India. Printed in acid-free paper 543210 springer.com Preface The research scenario in advanced systems for protecting critical infrastructures and for deeply networked information tools highlights a growing link between security issues and the need for intelligent processing abilities in the area of information systems. To face the ever-evolving nature of cyber-threats, monitoring systems must have adaptive capabilities for continuous adjustment and timely, effective response to modifications in the environment. Moreover, the risks of improper access pose the need for advanced identification methods, including protocols to enforce computersecurity policies and biometry-related technologies for physical authentication. Computational Intelligence methods offer a wide variety of approaches that can be fruitful in those areas, and can play a crucial role in the adaptive process by their ability to learn empirically and adapt a system’s behaviour accordingly. The International Workshop on Computational Intelligence for Security in Information Systems (CISIS) proposes a meeting ground to the various communities involved in building intelligent systems for security, namely: information security, data mining, adaptive learning methods and soft computing among others. The main goal is to allow experts and researchers to assess the benefits of learning methods in the data-mining area for information-security applications. The Workshop offers the opportunity to interact with the leading industries actively involved in the critical area of security, and have a picture of the current solutions adopted in practical domains. This volume of Advances in Soft Computing contains accepted papers presented at CISIS’08, which was held in Genova, Italy, on October 23rd–24th, 2008. The selection process to set up the Workshop program yielded a collection of about 40 papers. This allowed the Scientific Committee to verify the vital and crucial nature of the topics involved in the event, and resulted in an acceptance rate of about 60% of the originally submitted manuscripts. CISIS’08 has teamed up with the Journal of Information Assurance and Security (JIAS) and the International Journal of Computational Intelligence Research (IJCIR) for a suite of special issues including selected papers from CISIS’08. The extended papers, together with contributed articles received in response to subsequent open calls, will go through further rounds of peer refereeing in the remits of these two journals. We would like to thank the work of the Programme Committee Members who performed admirably under tight deadline pressures. Our warmest and special thanks go to the Keynote Speakers: Dr. Piero P. Bonissone (Coolidge Fellow, General Electric Global Research) and Prof. Marios M. Polycarpou (University of Cyprus). Prof. Vincenzo Piuri, former President of the IEEE Computational Intelligence Society, provided invaluable assistance and guidance in enhancing the scientific level of the event. VI Preface Particular thanks go to the Organising Committee, chaired by Dr. Clotilde Canepa Fertini (IIC) and composed by Dr. Sergio Decherchi, Dr. Davide Leoncini, Dr. Francesco Picasso and Dr. Judith Redi, for their precious work and for their suggestions about organisation and promotion of CISIS’08. Particular thanks go as well to the Workshop main Sponsors, Ansaldo Segnalamento Ferroviario Spa and Elsag Datamat Spa, who jointly contributed in an active and constructive manner to the success of this initiative. We wish to thank Prof. Dr. Janusz Kacprzyk (Editor-in-chief), Dr. Thomas Ditzinger (Senior Editor, Engineering/Applied Sciences) and Mrs. Heather King at Springer-Verlag for their help and collaboration in this demanding scientific publication project. We thank as well all the authors and participants for their great contributions that made this conference possible and all the hard work worthwhile. October 2008 Emilio Corchado Rodolfo Zunino Paolo Gastaldo Álvaro Herrero Organization Honorary Chairs Gaetano Bignardi – Rector, University of Genova (Italy) Giovanni Bocchetti – Ansaldo STS (Italy) Michele Fracchiolla – Elsag Datamat (Italy) Vincenzo Piuri – President, IEEE Computational Intelligence Society Gianni Vernazza – Dean, Faculty of Engineering, University of Genova (Italy) General Chairs Emilio Corchado – University of Burgos (Spain) Rodolfo Zunino – University of Genova (Italy) Program Committee Cesare Alippi – Politecnico di Milano (Italy) Davide Anguita – University of Genoa (Italy) Enrico Appiani – Elsag Datamat (Italy) Alessandro Armando – University of Genova (Italy) Piero Bonissone – GE Global Research (USA) Juan Manuel Corchado – University of Salamanca (Spain) Rafael Corchuelo – University of Sevilla (Spain) Andre CPLF de Carvalho – University of São Paulo (Brazil) Keshav Dehal – University of Bradford (UK) José Dorronsoro – Autonomous University of Madrid (Spain) Bianca Falcidieno – CNR (Italy) Dario Forte – University of Milano Crema (Italy) Bogdan Gabrys – Bournemouth University (UK) Manuel Graña – University of Pais Vasco (Spain) Petro Gopych – V.N. Karazin Kharkiv National University (Ukraine) Francisco Herrera – University of Granada (Spain) R.J. Howlett – University of Brighton (UK) Giacomo Indiveri – ETH Zurich (Switzerland) Lakhmi Jain – University of South Australia (Australia) Janusz Kacprzyk – Polish Academy of Sciences (Poland) VIII Organization Juha Karhunen – Helsinki University of Technology (Finland) Antonio Lioy – Politecnico di Torino (Italy) Wenjian Luo – University of Science and Technology of China (China) Nadia Mazzino – Ansaldo STS (Italy) José Francisco Martínez – INAOE (Mexico) Ermete Meda – Ansaldo STS (Italy) Evangelia Tzanakou – Rutgers University (USA) José Mira – UNED (Spain) José Manuel Molina – University Carlos III of Madrid (Spain) Witold Pedrycz – University of Alberta (Canada) Dennis K Nilsson – Chalmers University of Technology (Sweden) Tomas Olovsson – Chalmers University of Technology (Sweden) Carlos Pereira – Universidade de Coimbra (Portugal) Kostas Plataniotis – University of Toronto (Canada) Fernando Podio – NIST (USA) Marios Polycarpou – University of Cyprus (Cyprus) Jorge Posada – VICOMTech (Spain) Perfecto Reguera – University of Leon (Spain) Bernardete Ribeiro – University of Coimbra (Portugal) Sandro Ridella – University of Genova (Italy) Ramón Rizo – University of Alicante (Spain) Dymirt Ruta – British Telecom (UK) Fabio Scotti – University of Milan (Italy) Kate Smith-Miles – Deakin University (Australia) Sorin Stratulat – University Paul Verlaine – Metz (France) Carmela Troncoso – Katholieke Univ. Leuven (Belgium) Tzai-Der Wang – Cheng Shiu University (Taiwan) Lei Xu – Chinese University of Hong Kong (Hong Kong) Xin Yao – University of Birmingham (UK) Hujun Yin – University of Manchester (UK) Alessandro Zanasi – TEMIS (France) David Zhang – Hong Kong Polytechnic University (Hong Kong) Local Arrangements Bruno Baruque – University of Burgos Andrés Bustillo – University of Burgos Clotilde Canepa Fertini – International Institute of Communications, Genova Leticia Curiel – University of Burgos Sergio Decherchi – University of Genova Paolo Gastaldo – University of Genova Álvaro Herrero – University of Burgos Francesco Picasso – University of Genova Judith Redi – University of Genova Contents Computational Intelligence Methods for Fighting Crime An Artificial Neural Network for Bank Robbery Risk Management: The OS.SI.F Web On-Line Tool of the ABI Anti-crime Department Carlo Guazzoni, Gaetano Bruno Ronsivalle . . . . . . . . . . . . . . . . . . . . . . . . . . 1 Secure Judicial Communication Exchange Using Softcomputing Methods and Biometric Authentication Mauro Cislaghi, George Eleftherakis, Roberto Mazzilli, Francois Mohier, Sara Ferri, Valerio Giuffrida, Elisa Negroni . . . . . . . . . . . . . . . . . . . . . . . . . . 11 Identity Resolution in Criminal Justice Data: An Application of NORA Queen E. Booker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 PTK: An Alternative Advanced Interface for the Sleuth Kit Dario V. Forte, Angelo Cavallini, Cristiano Maruti, Luca Losio, Thomas Orlandi, Michele Zambelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 Text Mining and Intelligence Stalker, a Multilingual Text Mining Search Engine for Open Source Intelligence F. Neri, M. Pettoni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 Computational Intelligence Solutions for Homeland Security Enrico Appiani, Giuseppe Buslacchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 Virtual Weapons for Real Wars: Text Mining for National Security Alessandro Zanasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53 X Contents Hypermetric k-Means Clustering for Content-Based Document Management Sergio Decherchi, Paolo Gastaldo, Judith Redi, Rodolfo Zunino . . . . . . . . . 61 Critical Infrastructure Protection Security Issues in Drinking Water Distribution Networks Demetrios G. Eliades, Marios M. Polycarpou . . . . . . . . . . . . . . . . . . . . . . . . . 69 Trusted-Computing Technologies for the Protection of Critical Information Systems Antonio Lioy, Gianluca Ramunno, Davide Vernizzi . . . . . . . . . . . . . . . . . . . . 77 A First Simulation of Attacks in the Automotive Network Communications Protocol FlexRay Dennis K. Nilsson, Ulf E. Larson, Francesco Picasso, Erland Jonsson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84 Wireless Sensor Data Fusion for Critical Infrastructure Security Francesco Flammini, Andrea Gaglione, Nicola Mazzocca, Vincenzo Moscato, Concetta Pragliola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92 Development of Anti Intruders Underwater Systems: Time Domain Evaluation of the Self-informed Magnetic Networks Performance Osvaldo Faggioni, Maurizio Soldani, Amleto Gabellone, Paolo Maggiani, Davide Leoncini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100 Monitoring and Diagnosing Railway Signalling with Logic-Based Distributed Agents Viviana Mascardi, Daniela Briola, Maurizio Martelli, Riccardo Caccia, Carlo Milani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108 SeSaR: Security for Safety Ermete Meda, Francesco Picasso, Andrea De Domenico, Paolo Mazzaron, Nadia Mazzino, Lorenzo Motta, Aldo Tamponi . . . . . . . . 116 Network Security Automatic Verification of Firewall Configuration with Respect to Security Policy Requirements Soutaro Matsumoto, Adel Bouhoula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123 Contents XI Automated Framework for Policy Optimization in Firewalls and Security Gateways Gianluca Maiolini, Lorenzo Cignini, Andrea Baiocchi . . . . . . . . . . . . . . . . . . 131 An Intrusion Detection System Based on Hierarchical Self-Organization E.J. Palomo, E. Domı́nguez, R.M. Luque, J. Muñoz . . . . . . . . . . . . . . . . . . . 139 Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions for Intrusion Detection Zorana Banković, Slobodan Bojanić, Octavio Nieto-Taladriz . . . . . . . . . . . . 147 Agents and Neural Networks for Intrusion Detection Álvaro Herrero, Emilio Corchado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155 Cluster Analysis for Anomaly Detection Giuseppe Lieto, Fabio Orsini, Genoveffa Pagano . . . . . . . . . . . . . . . . . . . . . . 163 Statistical Anomaly Detection on Real e-Mail Traffic Maurizio Aiello, Davide Chiarella, Gianluca Papaleo . . . . . . . . . . . . . . . . . . 170 On-the-fly Statistical Classification of Internet Traffic at Application Layer Based on Cluster Analysis Andrea Baiocchi, Gianluca Maiolini, Giacomo Molina, Antonello Rizzi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178 Flow Level Data Mining of DNS Query Streams for Email Worm Detection Nikolaos Chatzis, Radu Popescu-Zeletin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186 Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection Bogdan Vrusias, Ian Golledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 A Preliminary Performance Comparison of Two Feature Sets for Encrypted Traffic Classification Riyad Alshammari, A. Nur Zincir-Heywood . . . . . . . . . . . . . . . . . . . . . . . . . . 203 Dynamic Scheme for Packet Classification Using Splay Trees Nizar Ben-Neji, Adel Bouhoula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211 XII Contents A Novel Algorithm for Freeing Network from Points of Failure Rahul Gupta, Suneeta Agarwal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 Biometry A Multi-biometric Verification System for the Privacy Protection of Iris Templates S. Cimato, M. Gamassi, V. Piuri, R. Sassi, F. Scotti . . . . . . . . . . . . . . . . . . 227 Score Information Decision Fusion Using Support Vector Machine for a Correlation Filter Based Speaker Authentication System Dzati Athiar Ramli, Salina Abdul Samad, Aini Hussain . . . . . . . . . . . . . . . . 235 Application of 2DPCA Based Techniques in DCT Domain for Face Recognition Messaoud Bengherabi, Lamia Mezai, Farid Harizi, Abderrazak Guessoum, Mohamed Cheriet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 Fingerprint Based Male-Female Classification Manish Verma, Suneeta Agarwal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251 BSDT Multi-valued Coding in Discrete Spaces Petro Gopych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258 A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication Thi Hoi Le, The Duy Bui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266 The Concept of Application of Fuzzy Logic in Biometric Authentication Systems Anatoly Sachenko, Arkadiusz Banasik, Adrian Kapczyński . . . . . . . . . . . . . . 274 Information Protection Bidirectional Secret Communication by Quantum Collisions Fabio Antonio Bovino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280 Semantic Region Protection Using Hu Moments and a Chaotic Pseudo-random Number Generator Paraskevi Tzouveli, Klimis Ntalianis, Stefanos Kollias . . . . . . . . . . . . . . . . . 286 Random r-Continuous Matching Rule for Immune-Based Secure Storage System Cai Tao, Ju ShiGuang, Zhong Wei, Niu DeJiao . . . . . . . . . . . . . . . . . . . . . . . 294 Contents XIII Industrial Perspectives nokLINK: A New Solution for Enterprise Security Francesco Pedersoli, Massimiliano Cristiano . . . . . . . . . . . . . . . . . . . . . . . . . . 301 SLA & LAC: New Solutions for Security Monitoring in the Enterprise Bruno Giacometti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309 Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317 An Artificial Neural Network for Bank Robbery Risk Management: The OS.SI.F Web On-Line Tool of the ABI Anti-crime Department Carlo Guazzoni and Gaetano Bruno Ronsivalle* OS.SI.F - Centro di Ricerca dell'ABI per la sicurezza Anticrimine Piazza del Gesù 49 – 00186 Roma, Italy spsricercasviluppo@abiformazione.it, gabrons@gabrons.com Abstract. The ABI (Associazione Bancaria Italiana) Anti-crime Department, OS.SI.F (Centro di Ricerca dell'ABI per la sicurezza Anticrimine) and the banking working group created an artificial neural network (ANN) for the Robbery Risk Management in Italian banking sector. The logic analysis model is based on the global Robbery Risk index of the single banking branch. The global index is composed by: the Exogenous Risk, related to the geographic area of the branch, and the Endogenous risk, connected to its specific variables. The implementation of a neural network for Robbery Risk management provides 5 advantages: (a) it represents, in a coherent way, the complexity of the "robbery" event; (b) the database that supports the AN is an exhaustive historical representation of Italian Robbery phenomenology; (c) the model represents the state of art of Risk Management; (d) the ANN guarantees the maximum level of flexibility, dynamism and adaptability; (e) it allows an effective integration between a solid calculation model and the common sense of the safety/security manager of the bank. Keywords: Risk Management, Robbery Risk, Artificial Neural Network, Quickprop, Logistic Activation Function, Banking Application, ABI, OS.SI.F., Anti-crime. 1 Toward an Integrated Vision of the “Risk Robbery” In the first pages of The Risk Management Standard1 - published by IRM2, AIRMIC3 and ALARM4 - the “risk” is defined as «the combination of the probability of an event and its consequences». Although simple and linear, this definition has many implications from a theoretical and pragmatic point of view. Any type of risk analysis shouldn't be limited to an evaluation of a specific event's probability without considering the effects, presumably negative, of the event. The correlation of these two concepts is not banal but, unfortunately, most Risk Management Models currently in use * Thanks to Marco Iaconis, Francesco Protani, Fabrizio Capobianco, Giorgio Corito, Riccardo Campisi, Luigi Rossi and Diego Ronsivalle for the scientific and operating support in the development of the theoretical model, and to Antonella De Luca for the translation of the paper. 1 http://www.theirm.org/publications/PUstandard.html 2 The Institute of Risk Management. 3 The Association of Insurance and Risk Managers. 4 The National Forum for Risk Management in the Public Sector. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 1–10, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 2 C. Guazzoni and G.B. Ronsivalle for banking security are characterized by low attention to these factors. In fact, they are often focused on the definition of methods and tools to foresee the “harmful event”' in probabilistic terms, without pay attention to the importance of a composed index considering the intensity levels of events that causally derive from this “harmful event”'. But then the little book of the above mentioned English Institutes, embraces an integrated vision of the Risk Management. It is related to a systemic and strategic meaning, giving a coherent description of data and defining the properties of the examined phenomenon. This wide vision provides explanatory hypothesis and, given a series of historical conditions, may foresee the probable evolutions of the system. Thanks to the joint effort of the inter-banking working group - coordinated by ABI and OS.SI.F -, the new support tool for the Robbery Risk Management takes into account this integrated idea of “risk”. In fact it represents the Risk Management process considering the strategic and organizational factors that characterize the phenomenon of robbery. Hence, it defines the role of the safety/security manager in the banking sector. The online software tool, indeed, integrates a general plan with a series of resources to attend the manager during the various phases of the decisional process scheduled by the standard IRM: 1. from the Robbery Risk mapping - articulated in analysis, various activities of identification, description and appraisal - to the risk evaluation; 2. from the Risk Reporting to the definition of threats and opportunities connected to the robbery; 3. from the decisional moment, supported by the simulation module (where we can virtually test the Risk Management and analyze the Residual Risk) to the phase of virtual and real monitoring5. The various functions are included in a multi-layers software architecture, composed by a database and a series of modules that elaborate the information to support the analysis of the risk and its components. In this way, the user can always retrace the various steps that gradually determine the Robbery Risk Global Index and its single relative importance, starting from the primary components of the risk, he/she can focus the attention on the minimum element of the organizational dimension. Thus, the analysis is focused on the complex relationship between the single banking branch that represents the system cell and unity of measurement - and the structure of relationships, connections, relevant factors from a local and national point of view. In this theoretical frame, the Robbery Risk is not completely identified with the mere probability that the event occurs. According with IRM standard, instead, it is also taken into account the possibility that the robbery may cause harms, as well as the combined probability that the event occurs and the possible negative consequences for the system may have a different intensity. 2 Exogenous Risk and Endogenous Risk According to the inter-banking working group, coordinated by OS.SI.F, which are the factors or the variables that compose and influence the robbery and its harmful effects? 5 ``The Risk Management Standard'', pp. 6-13. An Artificial Neural Network for Bank Robbery Risk Management 3 2.1 The Exogenous Risk The “Exogenous”' components include environment variables (from regional data to local detailed data), tied to the particular geographic position, population density, crime rate, number of general criminal actions in the area, as well as the “history”' and/or evolution of the relationship between the number of banking branches, the defense implementations and the Robbery Risk. So the mathematical function, that integrates these factors, must take into account the time variable. In fact it's essential to consider the influence of each variable according to the changes that occur at any time in a certain geographic zone. The analysis made up by the working group of ABI has shown that the composition of environment conditions is represented by a specific index of “Exogenous risk”. Its dynamic nature makes possible to define a probabilistic frame for in order to calculate the Robbery Risk. Such index of “Exogenous” risk allows considering the possibility of a dynamic computation regarding the variation rate of criminal actions density. This variation depends on the direct or indirect intervention of police or central/local administrations in a geographic area. The Exogenous risk could provide some relevant empirical bases in order to allow banks to share common strategies for the management/mitigation of the Robbery Risk. The aim is to avoid possible negative effects tied to activities connected to only one banking branch. 2.2 The Endogenous Risk A second class of components corresponds, instead, to material, organizational, logistic, instrumental, and technological factors. They characterize the single banking branch and determine its specific architecture in relation to the robbery. Such factors are the following: 1. the “basic characteristics”'6 2. the “services”'7 3. the “plants”'8 The interaction of these factors contributes to determine a significant part of the socalled “Endogenous” risk. It is calculated through a complex function of the number of robberies on a single branch, calculated during a unit of time in which any “event” has modified, in meaningful terms, the internal order of the branch. In other terms, a dynamic connection between the risk index and the various interventions planned by the safety/security managers has been created, both for the single cell and for whole system. The aim was to control the causal sequence between the possible variation of an Endogenous characteristic and its relative importance (%) to calculate the specific impact on the number of robberies. 2.3 The Global Risk Complying with the objectives of an exhaustive Robbery Risk management, the composition of the two risk indexes (Exogenous and Endogenous) defines the perimeter 6 E.g. the number of employees, the location, the cash risk, the target-hardening strategies, etc. E.g. the bank security guards, the bank surveillance cameras, etc. 8 E.g. the access control vestibules (man-catcher or mantraps), the bandit barriers, broad-band internet video feeds directly to police, the alarms, etc. 7 4 C. Guazzoni and G.B. Ronsivalle of a hypothetical “global” risk referred to the single branch. A sort of integrated index that includes both environment and “internal” factors. Thus the calculation of the Global Risk index derives from the normalization of the bi-dimensional vector obtained from the above mentioned functions. Let us propose a calculation model in order to support the definition of the various indexes. 3 Methodological Considerations about the Definition of “Robbery Risk” Before dealing with the computation techniques, however, it is necessary to clarify some issue. 3.1 Possible Extend of “Robbery Risk” First of all, the demarcation between the “Exogenous” and “Endogenous” dimensions cannot be considered absolute. In fact, in some specific case of “Endogenous” characteristics, an analysis that considers only factors that describe the branch isn't enough representative of the variables combination. In fact, the ponderation of each single element makes possible to assign a “weight” related to the Endogenous risk index. But it must be absolutely taken into account the variability rate of influence according to the geographic area. This produces an inevitable contamination, even though circumscribed, between the two dimensions of the Robbery Risk. Then, it must be taken into account that these particular elements of the single cell are the result of the permanent activity of comparison made by the inter-banking working group members coordinated by ABI and OS.SI.F. Given the extremely delicate nature of the theme, the architecture of the “Endogenous” characteristics of the branch must be considered temporary and in continuous evolution. The “not final” nature of the scheme that we propose, depends therefore, by the progressive transformation of the technological tools supporting security, as well as by the slow change - both in terms of national legislation, and in terms of reorganization of the safety/security manager role - regarding the ways in which the contents of Robbery Risk are interpreted. Finally it's not excluded, in the future, the possibility to open the theoretical scheme to criminological models tied to the description of the criminal behaviour and to the definition of indexes of risk perception. But, in both cases the problem is to find a shared theoretical basis, and to translate the intangible factors in quantitative variables that can be elaborated through the calculation model9. 3.2 The Robbery Risk from an Evolutionistic Point of View Most bank models consider the Robbery Risk index of a branch, as a linear index depending on the number of attempted and/or consumed robbery in a certain time interval. 9 On this topic, some researches are going on with the aim to define a possible evolution of this particular type of ``socio-psychological'' index of perceived risk. But it's not yet clear if and how this index can be considered as a category of risk distinguished both from the ``exogenous'' and ``endogenous'' risk. An Artificial Neural Network for Bank Robbery Risk Management 5 This index is usually translated into a value, with reference to a risk scale, and it is the criterion according to which for the Security Manager decide. These models may receive a series of criticisms: 1. they extremely simplify the relation between variables, not considering the reciprocal influences between Exogenous and Endogenous risk; 2. they represent the history of a branch inaccurately, with reference to very wide temporal criteria; 3. they don’t foresee a monitoring system of historical evolution related to the link between changes of the branch and number of robberies. To avoid these criticisms, the OS.SI.F team has developed a theoretical framework based on the following methodological principles: 1. the Robbery Global Risk is a probability index (it varies from 0 to 1) and it depends on the non-linear combination of Exogenous and Endogenous risk; 2. the Robbery Global Risk of a branch corresponds to the trend calculated by applying the Least Squares formula to the numerical set of Risk Robbery values (monthly) from January 2000 to March 2008 in the branch; 3. the monthly value of the Robbery Global Risk corresponds to the relation between the number of robberies per month and the number of days in which the branch is open to people. It is expressed in a value from 0 to 1. An consequence follows these principles: it is necessary to describe the history of the branch as a sequence of states of the branch itself, in relation to changes occurred in its internal structure (for example, the introduction of a new defending service or a new plant). Thus it is possible to create a direct relation between the evolution of branch and the evolution of Robbery Global Risk. In particular, we interpret the various transformations of the branch over time as “mutations” in a population of biological organisms (represented by isomorphic branches). The Robbery Global Risk, thus, becomes a kind of value suggesting how the “robbery market” rewards the activities of the security managers, even though without awareness about the intentional nature of certain choices10. This logical ploy foresees the possibility to analyze indirect strategies of the various banking groups in the management and distribution of risk into different regions of the country11. This methodological framework constitutes the logical basis for the construction of the Robbery Risk management simulator. It is a direct answer to criticisms of the 2nd and 3rd points (see above). It allows the calculation of the fluctuations referred to the Exogenous risk in relation to the increase - over certain thresholds - of the Robbery Global Risk (criticism of the 1st point). 10 11 This “biological metaphor”, inspired by Darwin’s theory, is the methodological basis to overcome a wrong conception of the term “deterrent” inside the security managers’ vocabulary. In many cases, the introduction of new safety services constitutes only an indirect deterrent, since the potential robber cannot know the change. The analysis per populations of branches allows, instead, to extend the concept of “deterrent” per large numbers. While analyzing data, we discovered a number of “perverse effects” in the Robbery Risk management, including the transfer of risk to branches of other competing groups as a result of corrective actions conceived for a branch and then extended to all branches in the same location. 6 C. Guazzoni and G.B. Ronsivalle 4 The Calculation Model for the Simulator The choice of a good calculation model as a support tool for the risk management is essentially conditioned from the following elements: 1. the nature of the phenomenon; 2. the availability of information and historical data on the phenomenon; 3. the quality of the available information; 4. the presence of a scientific literature and/or of possible applications on 5. the theme; 6. the type of tool and the output required; 7. the perception and the general consent level related to the specific model 8. adopted; 9. the degree of obsolescence of the results; 10.the impact of the results in social, economic, and political, terms. In the specific case of the Robbery Risk, 1. the extreme complexity of the “robbery” phenomenon suggests the adoption of analysis tools that take into account the various components, according to a non linear logic; 2. there is a big database on the phenomenon: it can represent the pillar for a historical analysis and a research of regularity, correlations and possible nomic and/or probabilistic connections among the factors that determine the “robbery” risk; 3. the actual database has recently been “normalized”, with the aim to guarantee the maximum degree of coherence between the information included in the archive and the real state of the system; 4. the scientific literature on the risk analysis models related to criminal events is limited to a mere qualitative analysis of the phenomenon, without consider quantitative models; 5. the inter-banking group has expressed the need of a tool to support the decisional processes in order to manage the Robbery Risk, through a decomposition of the fundamental elements that influence the event at an Exogenous and Endogenous level; 6. the banking world aims to have innovative tools founds and sophisticated calculation models in order to guarantee objective and scientifically founded results within the Risk Management domain; 7. given the nature of the phenomenon, the calculation model of the Robbery Risk must guarantee the maximum of flexibility and dynamism according to the time variable and the possible transformations at a local and national level; 8. the object of the analysis is matched with a series of ethics, politics, social, and economic topics, and requires, indeed, an integrated systemic approach. These considerations have brought the team to pursue an innovative way for the creation of the calculation model related to the Robbery Risk indexes: the artificial neural networks (ANN). An Artificial Neural Network for Bank Robbery Risk Management 7 5 Phases of ANN Design and Development for the Management of the Robbery Risk How did we come to the creation of the neural network for the management of the robbery Global Risk? The creation of the calculation model is based on the logical scheme above exposed. It is articulated in five fundamental phases: 1. 2. 3. 4. 5. Re-design OS.SI.F database and data analysis; Data normalization; OS.SI.F Network Design; OS.SI.F Network Training; Network testing and delivery. 5.1 First Phase: Re-design OS.SI.F Database and Data Analysis Once defined the demarcation between Exogenous and Endogenous risks, as well as the structure of variables concerning each single component of the Global Risk, some characteristic elements of OS.SI.F historical archive have been revised. The revision of the database allowed us to remove possible macroscopic redundancies and occasional critical factors, before starting the data analysis. Through a neural network based on genetic algorithms all possible incoherence and contradictions have been underlined. The aim was to isolate patterns that would have been potentially “dangerous” for the network and to produce a “clean” database, deprived of logical “impurities” (in limits of human reason). At this point, the team has defined the number of entry variables (ANN inputs) related to the characteristics above mentioned - and the exit variables (ANN output) representing the criteria for design the network. The structure of the dataset is determined, as well as the single field types (categorical or numerical) and the distinction among (a) information for the training of the ANN, (b) data to validate the neuronal architecture, and (c) dataset dedicated to testing the ANN after training. 5.2 Second Phase: Data Normalization The database cleaning allowed the translation of data in a new archive of information for the elaboration of an ANN. In other words, all the variables connected to the Exogenous risk (environment and geographic variables) and the Endogenous risk (basic characteristics, services and plants of each single branch) have been “re-write” and normalized. All this has produced the historical sequence of examples provided to the ANN with the aim to let it “discover” the general rules that govern the “robbery” phenomenon. The real formal “vocabulary” for the calculation of the Global Risk. 5.3 Third Phase: OS.SI.F Network Design This phase has been dedicated to determine the general architecture of the ANN and its mathematical properties. Concerning topology, in particular, after a series of unlucky attempts with a single hidden layer, we opted for a two hidden layers network: 8 C. Guazzoni and G.B. Ronsivalle Fig. 1. A schematic representation of the Architecture of OS.SI.F ANN This architecture was, in fact, more appropriate to solve a series of problems connected to the particular nature of the “robbery” phenomenon. This allowed us, therefore, to optimize the choice of the single neurons activation function and the error function. In fact, after a first disastrous implementation of the linear option, a logistic activation function with a sigmoid curve has been adopted. It was characterized by a size domain included between 0 and 1 and calculated through the following formula: F ( x) = 1 1 + e −x (1) Since it was useful for the evaluation of the ANN quality, an error function has been associated to logistic function. It was based on the analysis of differences among the output of the historical archive and the output produced by the neural network. In this way we reached to the definition of the logical calculation model. Even though it still doesn't have the knowledge necessary to describe, explain, and foresee the probability of the “robbery” event. This knowledge, in fact, derives only from an intense training activity. 5.4 Fourth Phase: OS.SI.F Network Training The ANN training constitutes the most delicate moment of the whole process of creation of the network. In particular with a Supervised Learning. In our case, in fact, the training consists in provide the network of a big series of examples of robberies associated to particular geographic areas and specific characteristics of the branches. From this data, the network has to infer the rule through an abstraction process. For the Robbery Risk ANN, we decided to implement a variation of the Back propagation. The Back propagation is the most common learning algorithm within the multi-layer networks. It is based on the error propagation and on the transformation of weights (originally random assigned) from the output layer in direction to the intermediate layers, up to the input neurons. In our special version, the “OS.SI.F Quickpropagation”, the variation of each weight of the synaptic connections changes according to the following formula: An Artificial Neural Network for Bank Robbery Risk Management ⎛ ⎞ s (t ) ∆w(t ) = ⎜⎜ ∆w(t − 1) ⎟⎟ + k ⎝ s (t − 1) − s (t ) ⎠ 9 (2) where k is a hidden variable to solve the numerical instability of this formula12. We can state that the fitness of the ANN-Robbery Risk has been subordinated to a substantial correspondence between the values of Endogenous and Exogenous risk (included in the historical archive), and the results of the network's elaboration after each learning iteration. 5.5 Fifth Phase: Network Testing and Delivery In the final phase of the process lot of time has been dedicated to verify the neural network architecture defined in the previous phases. Moreover a series of dataset previously not included in the training, have been considered with the aim to remove the last calculation errors and put some adjustments to the general system of weights. In this phase, some critical knots have been modified: they were related to the variations of the Exogenous risk according to the population density and to the relationship between Endogenous risk and some new plants (biometrical devices). Only after this last testing activity, the ANN has been integrated and implemented in the OS.SI.FWeb module, to allow users (banking security/safety managers), to verify the coherence of the tool through a module of simulation of new sceneries. 6 Advantages of the Application of the Neural Networks to the Robbery Risk Management The implementation of an ANN to support the Robbery Risk management has at least 5 fundamental advantages: 1. Unlike any linear system based on proportions and simple systems of equations, an ANN allows to face, in coherent way, the high complexity degree of the “robbery” phenomenon. The banal logic of the sum of variables and causal connections of many common models, is replaced by a more articulated design, that contemplates in dynamic and flexible terms, the innumerable connections among the Exogenous and Endogenous variables. 2. The OS.SI.F ANN database is based on a historical archive continually fed by the whole Italian banking system. This allows to overcome each limited local vision, according to the absolute need of a systemic approach for the Robbery Risk analysis. In fact, it's not possible to continue to face such a delicate topic through visions circumscribed to one's business dimension. 3. The integration of neural algorithms constitutes the state of the art within the Risk Management domain. In fact it guarantees the definition of a net of variables opportunely measured according to a probabilistic - and not banally linear - logic. 12 Moreover, during the training, a quantity of “noise” has been introduced (injected) into the calculation process. The value of the “noise” has been calculated in relation to the error function and has allowed to avoid the permanence of the net in critical situations of local minims. 10 C. Guazzoni and G.B. Ronsivalle The Robbery Risk ANN foresees a real Bayes network that dynamically determines the weight of each variable (Exogenous and Endogenous) in the probability of the robbery. This provides a higher degree of accuracy and scientific reliability to the definition of “risk” and to the whole calculation model. 4. A tool based on neural networks guarantees the maximum level of flexibility, dynamism and adaptability to contexts and conditions that are in rapid evolution. These are assured by (a) a direct connection of the database to the synaptic weights of the ANN, (b) the possible reconfiguration of the network architecture in cases of introduction of new types of plants and/or services, and/or basic characteristics of branches. 5. The ANN allows an effective integration between a solid calculation model (the historical archive of information related to the robberies of last years), and the professional and human experience of security/safety managers. The general plan of the database (and of the composition of the two risk indexes), takes into account the considerations, observations and indications of the greater representatives of the national banking safety/security sectors. The general plan of the database is based on the synthesis done by the inter-banking team, on the normalization of the robbery event descriptions, and on the sharing of some guidelines in relation to a common vocabulary for the description of the robbery event. The final result of this integration is a tool that guarantees the maximum level of decisional liberty, through the scientific validation of virtuous practices and, thanks to the simulator of new branches, an a priori evaluation of the possible effects deriving from future interventions. References 1. Corradini, I., Iaconis, M.: Antirapina. Guida alla sicurezza per gli operatori di sportello. Bancaria Editrice, Roma (2007) 2. Fahlman, S.E.: Fast-Learning Variations on Back-Propagation: An Empirical Study. In: Proceedings of the 1988 Connessionist Models Summer School, pp. 38–51. Morgan Kaufmann, San Francisco (1989) 3. Floreano, D.: Manuale sulle reti neurali. Il Mulino, Bologna (1996) 4. McClelland, J.L., Rumelhart, D.E.: PDP: Parallel Distributed Processing: Explorations in the Microstructure of Cognition: Psychological and Biological Models, vol. II. MIT PressBradford Books, Cambridge (1986) 5. Pessa, E.: Statistica con le reti neurali. Un’introduzione. Di Renzo Editore, Roma (2004) 6. Sietsma, J., Dow, R.J.F.: Neural Net Pruning – Why and how. In: Proceedings of the IEEE International Conference on Neural Networks, pp. 325–333. IEEE Press, New York (1988) 7. von Lehman, A., Paek, G.E., Liao, P.F., Marrakchi, A., Patel, J.S.: Factors Influencing Learning by Back-propagation. In: Proceedings of the IEEE International Conference on Neural Networks, vol. I, pp. 335–341. IEEE Press, New York (1988) 8. Weisel, D.L.: Bank Robbery. In: COPS, Community Oriented Policing Services, U.S. Department of Justice, No.48, Washington (2007), http://www.cops.usdoj.gov Secure Judicial Communication Exchange Using Soft-computing Methods and Biometric Authentication Mauro Cislaghi1, George Eleftherakis2, Roberto Mazzilli1, Francois Mohier3, Sara Ferri4, Valerio Giuffrida5, and Elisa Negroni6 1 Project Automation , Viale Elvezia, Monza, Italy {mauro.cislaghi,roberto.mazzilli}@p-a.it 2 SEERC, 17 Mitropoleos Str, Thessaloniki, Greece eleftherakis@city.academic.gr 3 Airial Conseil, RueBellini 3, Paris, France francois.mohier@airial.com 4 AMTEC S.p.A., Loc. San Martino, Piancastagnaio, Italy sara.ferri@elsagdatamat.com 5 Italdata, Via Eroi di Cefalonia 153, Roma, Italy valerio.giuffrida@italdata-roma.com 6 Gov3 Ltd, UK j-web.project@gov3innovation.eu Abstract. This paper describes how “Computer supported cooperative work”, coped with security technologies and advanced knowledge management techniques, can support the penal judicial activities, in particular national and trans-national investigations phases when different judicial system have to cooperate together. Increase of illegal immigration, trafficking of drugs, weapons and human beings, and the advent of terrorism, made necessary a stronger judicial collaboration between States. J-WeB project (http://www.jweb-net.com/), financially supported by the European Union under the FP6 – Information Society Technologies Programme, is designing and developing an innovative judicial cooperation environment capable to enable an effective judicial cooperation during cross-border criminal investigations carried out between EU and Countries of enlarging Europe, having the Italian and Montenegrin Ministries of Justice as partners. In order to reach a higher security level, an additional biometric identification system is integrated in the security environment. Keywords: Critical Infrastructure Protection, Security, Collaboration, Cross border investigations, Cross Border Interoperability, Biometrics, Identity and Access Management. 1 Introduction Justice is a key success factors in regional development, in particular in areas whose development is lagging back the average development of the European Union. In the last years particular attention has been paid on judicial collaboration between Western Balkans and the rest of EU, and CARDS Program [1] is a suitable evidence of this cooperation. According to this program, funds were provided for the development of closer relations and regional cooperation among SAp (Stabilisation and Association E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 11–18, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 12 M. Cislaghi et al. process) countries and between them and all the EU member states to promote direct cooperation in tackling the common threats of organised crime, illegal migration and other forms of trafficking. The Mutual assistance [2] is subject to different agreements and different judicial procedures. JWeB project [3], [9], based on the experiences of e-Court [4] and SecurE-Justice [5] projects, funded by the European Commission in IST program, is developing an innovative judicial cooperation environment capable to enable an effective judicial cooperation during cross-border criminal investigations, having the Italian and Montenegrin Ministries of Justice as partners. JWeB (started in 2007 and ending in 2009) will experiment a cross-border secure cooperative judicial workspace (SCJW), distributed on different ICT platforms called Judicial Collaboration Platforms (JCP) [6], based on Web-based groupware tools supporting collaboration and knowledge sharing among geographically distributed workforces, within and between judicial organizations. 2 Investigation Phase and Cross-Border Judicial Cooperation The investigation phase includes all the activities carried out from crime notification to the trial. Cross-border judicial cooperation is one of them. It may vary from simple to complex judicial actions; but it has complex procedure and requirements, such as information security and non repudiation. A single investigation may include multiple cross-border judicial cooperation requests; this is quite usual when investigating on financial flows. Judicial cooperation develops as follows: 1) In the requesting country, the magistrate starts preliminary checks to understand if her/his requests to another country are likely to produce the expected results. Liaison magistrate support and contacts with magistrates in the other country are typical actions. 2) The “requesting” magistrate prepares and sends the judicial cooperation request (often referred to as “letter of rogatory”) containing the list of specific requests to the other country. Often the flow in the requesting country is named “active rogatory”, while the flow in the requested country is named “passive rogatory”. 3) The judicial cooperation request coming from the other country is evaluated, usually by a court of appeal that, in case of positive evaluation, appoints the prosecutors’ office in charge of the requested activities. This prosecutors’ office appoints a magistrate. The requesting magistrate, directly or via the office delegated to international judicial cooperation, receives back these information and judicial cooperation starts. 4) Judicial cooperation actions are performed. They may cover request for documents, request for evidences, request for interrogations, request for specific actions (for example interceptions, sequestration or an arrest), requests for joint investigation. Most of the activities are still paper based. The listed activities may imply complex actions in the requested country, involving people (magistrates, police, etc.) in different departments. The requesting country is interested on the results of the activities, Secure Judicial Communication Exchange Using Soft-computing Methods 13 not on the procedures followed by the judicial organisation fulfils the requests. The liaison magistrate can support the magistrate, helping her/him to understand how to address the judicial counterpart and, once judicial cooperation has been granted, in understanding and overcoming possible obstacles. Each national judicial system is independent from the other, both in legal and infrastructural terms. Judicial cooperation, on the ICT point of view, implies cooperation between two different infrastructures, the “requesting” one (“active”) and the “requested” (“passive”), and activities such as judicial cooperation setup, joint activities of the workgroups, secure exchange of not repudiable information between the two countries. These activities can be effectively supported by a secure collaborative workspace, as described in the next paragraph. 3 The Judicial Collaboration Platform (JCP) A workspace for judicial cooperation involves legal, organisational and technical issues, and requires a wide consensus in judicial organisations. It has to allow straightforward user interface, easy data retrieval, seamless integration with procedures and systems already in place. All that implemented providing top-level security standards. Accordingly, the main issues for judicial collaboration are: • • • • • A Judicial Case is a secure private virtual workspace accessed by law enforcement and judicial authorities, that need to collaborate in order to achieve common objectives and tasks; JCP services are on-line services, supplying various collaborative functionalities to the judicial authorities in a secure and non repudiable communication environment; User profile is a set of access rights assigned to a user. The access to a judicial case and to JCP services are based on predefined, as well as, customised role based user profiles; Mutual assistance during investigations creates a shared part of investigation folder. Each country will have its own infrastructure. The core system supporting judicial cooperation is the secure JCP [6]. It is part of a national ICT judicial infrastructure, within the national judicial space. Different JCPs in different countries may cooperate during judicial cooperation. The platform, organised on three layer (presentation, business, persistence) and supporting availability and data security, provides the following main services: • • Profiling: user details, user preferences Web Services o Collaboration: collaborative tools so that users can participate and discuss on the judicial cooperation cases. o Data Mining: customization of user interfaces based on users’ profile. o Workflow Management: design and execution of judicial cooperation processes 14 M. Cislaghi et al. Audio/Video Management: real time audio/video streaming of a multimedia file, videoconference support. o Knowledge Management: documents uploading, indexing, search. Security and non repudiation: Biometric access, digital certificates, digital signature, secure communication, cryptography, Role based access control. o • Services may be configured according to the different needs of the Judicial systems. The modelling of Workflow Processes is based on the Workflow Management Coalition specifications (WfMC), while software developments are based on Open-Source and the J2EE framework. Communications are based on HTTPS and SSL, SOAP, RMI, LDAP and XML. Videoconference is based on H323. 4 The Cross-Border Judicial Cooperation Via Secure JCPs 4.1 The Judicial Collaborative Workspace and Judicial Cooperation Activities A secure collaborative judicial workspace (SCJW) is a secure inter-connected environment related to a judicial case, in which all entitled judicial participants in dispersed locations can access and interact with each other just as inside a single entity. The environment is supported by electronic communications and groupware which enable participants to overcome space and time differentials. On the physical point of view, the workspace is supported by the JCP. The SCJW allows the actors to use communication and scheduling instruments (agenda, shared data, videoconference, digital signature, document exchange) in a secured environment. A judicial cooperation activity (JCA) is the implementation of a specific judicial cooperation request. It is a self contained activity, opened inside the SCJWs in the requesting and requested countries, supported by specific judicial workflows and by the collaboration tools, having as the objective to fulfil a number of judicial actions issued by the requesting magistrate. The SCJW is connected one-to-one to a judicial case and may contain multiple JCAs running in parallel. A single JCA ends when rejected or when all requests contained in the letter of rogatory have been fulfilled and the information collected have been inserted into the target investigation folder, external to the JCP. In this moment the JCA may be archived. The SCJW does not end when a JCA terminates, but when the investigation phase is concluded. Each JCA may have dedicated working teams, in particular in case of major investigations. The “owner” of the SCJW is the investigating magistrate in charge of the judicial case. SCJW is implemented in a single JCP, while the single JCA is distributed on two JCP connected via secure communication channels (crypto-routers, with certificate exchange), implementing a secured Web Service Interface via a collaboration gateway. Each SCJW has a global repository and a dedicated repository for each JCA. This is due to the following constraints: 1) the security, confidentiality and non repudiation constraints 2) each JCA is an independent entity, accessible only by the authorised members of the judicial workgroup and with a limited time duration. Secure Judicial Communication Exchange Using Soft-computing Methods 15 The repository associated to the single JCA contains: • • JCA persistence data 1) “JCA metadata” containing data such as: information coming from the national registry (judicial case protocol numbers, etc.), the users profiles and the related the access rights, the contact information, the information related to the workflows (state, transitions), etc. 2) “JCP semantic repository”. It will be the persistence tier for the JCP semantic engine, containing: ontology, entity identifiers, Knowledge Base (KB) JCA judicial information The documentation produced during the judicial cooperation will be stored in a configurable tree folder structure. Typical contents are: 1) “JCA judicial cooperation request”. It contains information related to the judicial cooperation request, including further documents exchanged during the set-up activities. 2) “JCA decisions”. It contains the outcomes of the formal process of judicial cooperation and any internal decision relevant to the specific JCA (for example letter of appointment of the magistrate(s), judicial acts authorising interceptions or domicile violation, etc.) 3) “JCA investigation evidences”. It contains the documents to be sent/ received (Audio/video recordings, from audio/video conferences and phone interceptions, Images, Objects and documents, Supporting documentation, not necessarily to be inserted in the investigation folder) 4.2 The Collaboration Gateway Every country has it own ICT judicial infrastructure, interfaced but not shared with other countries. Accordingly a SCJW in a JCP must support a 1:n relationships between judicial systems, including data communication, in particular when the judicial case implies more than one JCA. A single JCA has a 1:1 relationship between the JCA in the requesting country and the corresponding “requested” JCA. For example, a single judicial case in Montenegro may require cross-border judicial cooperation to Italy, Serbia, Switzerland, France and United Kingdom, and the JCP in Montenegro will support n cross border judicial cooperations. Since JCP platforms are hosted on different locations and countries, the architecture of the collaboration module is based on the mechanism of secured gateway. It is be based on a set of Web Services allowing one JWeB site, based on a JCP, to exchange the needed data with another JWeB site and vice and versa. The gateway architecture, under development in JWeB project, is composed by: • • • Users and Profiling module Judicial CASES and Profiling Module Calendar/Meeting Module Workflow engines exchange information about the workflows states through the collaboration gateway. 16 M. Cislaghi et al. 4.3 Communication Security, User Authentication and RBAC in JCP Security [7] is managed through the Security Module, designed to properly manage Connectivity Domains, to assure access rights to different entities, protecting information and segmenting IP network in secured domains. Any communication is hidden to third parties, protecting privacy, preventing unauthorised usage and assuring data integrity. The JCP environment is protected by the VPN system allowing the access only from authenticated and pre-registered user; no access is allowed without the credentials given by the PKI. User is authenticated in her/his access to any resource by means of his X.509v3 digital certificate issued by the Certification Authority, stored in his smart card and protected by biometry [7], [8]. The Network Security System is designed in order to grant the access to the networks and the resources only to authenticated users; it is composed by the following components: • • • Security Access Systems (Crypto-router). Crypto-routers prevent unauthorized intrusions, offers protection against external attacks and offer tunneling capabilities and data encryption. Security Network Manager. This is the core of security managing system that allows managing, monitoring and modifying configurations of the system, including accounting of new users. S-VPN clients (Secure Virtual Private Network Client). Software through which the users can entry in the IP VPN and so can be authenticated by the Security Access System. The Crypto-router supports routing and encryption functions with the RSA public key algorithm on standard TCP/IP networks in end to end mode. Inside JCP security architecture Crypto-router main task is to institute the secure tunnel to access JCP VPN (Virtual Private Network) and to provide both Network and Resources Authentication. In order to reach a higher security level, an additional biometric identification system is integrated in the security environment. The device integrates a smart card reader with a capacitive ST Microelectronics fingerprint scanner and an “Anti Hacking Module” that will made the device unusable in case of any kind of physical intrusion attempt. The biometric authentication device will entirely manage the biometric verification process. There is no biometric data exchange within the device and the workstation or any other device. Biometric personal data will remain in the user’s smart card and the comparison between the live and the smart card stored fingerprint will be performed inside the device. After biometric authentication, access control of judicial actors to JCP is rolebased. In Role Based Access Control [11] (RBAC), permissions are associated with roles, and users are made members of appropriate roles. This model simplifies access administration, management, and audit procedures. The role-permissions relationship changes much less frequently than the role-user relationship, in particular in the judicial field. RBAC allows these two relationships to be managed separately and gives much clearer guidance to system administrators on how to properly add new users and Secure Judicial Communication Exchange Using Soft-computing Methods 17 their associated permissions. RBAC is particularly appropriate in justice information sharing systems where there are typically several organizationally diverse user groups that need access, in varying degrees, to enterprise-wide data. Each JCP system will maintain its own Access Control List (ACL). Example of roles related to judicial cooperation are: • • • • • SCJW magistrate supervisor: Basically he/she has the capability to manage all JCAs. JCA magistrate: he/she has the capability to handle the cases that are assigned to him Liaison Magistrate: a magistrate located in a foreign country that supports the magistrate(s) in case of difficulties. Judicial Clerk: supporting the magistrate for secretarial and administrative tasks (limited access to judicial information). System Administrator: He is the technical administrator of the JCP platform (no access to judicial information) 5 Conclusions Council Decision of 12 February 2007 establishes for the period 2007-2013 the Programme ‘Criminal Justice’ (2007/126/JHA), with the objective to foster judicial cooperation in criminal matter. CARDS project [1] and IPA funds represent today a relevant financial support to regional development in Western Balkans, including justice as one of the key factors. This creates a strong EU support to JCP deployment, while case studies such as the ongoing JWeB and SIDIP [10] projects, demonstrated that electronic case management is now ready for deployment on the technological point of view. Judicial secure collaboration environment will be the basis for the future judicial trans-national cooperation, and systems such as the JCP may lead to a considerable enhancement of cross-border judicial cooperation. The experience in progress in JWeB project is demonstrating that features such as security, non repudiation, strong authentication can be obtained through integration of state of the art technologies and can be coped with collaboration tools, in order to support a more effective and straightforward cooperation between investigating magistrates in full compliance with national judicial procedures and practices. The JCP platform represents a possible bridge between national judicial spaces, allowing through secure web services the usage of the Web as a cost effective and the same time secured interconnection between judicial systems. While technologies are mature and ready to be used, their impact on the judicial organisations in cross-border cooperation is still under analysis. It is one of the main non technological challenges for deployment of solutions such as the one under development in JWeB project. The analysis conducted so far in the JWeB project gives a reasonable confidence that needed organisational changes will become evident through the pilot usage of the developed ICT solutions, so giving further contributions to the Ministries of Justice about the activities needed for a future deployment of ICT solutions in a delicate area such as the one of the international judicial cooperation. 18 M. Cislaghi et al. References 1. CARDS project: Support to the Prosecutors Network, EuropeAid/125802/C/ACT/Multi (2007), http://ec.europa.eu/europeaid/cgi/frame12.pl 2. Armone, G., et al.: Diritto penale europeo e ordinamento italiano: le decisioni quadro dell’Unione europea: dal mandato d’arresto alla lotta al terrorismo. Giuffrè edns. (2006) ISBN 88-14-12428-0 3. JWeB consortium (2007), http://www.jweb-net.com 4. European Commission, ICT in the courtroom, the evidence (2005), http://ec.europa.eu/information_society/activities/ policy_link/documents/factsheets/jus_ecourt.pdf 5. European Commission. Security for judicial cooperation (2006), http://ec.europa.eu/information_society/activities/ policy_link/documents/factsheets/just_secure_justice.pdf 6. Cislaghi, M., Cunsolo, F., Mazzilli, R., Muscillo, R., Pellegrini, D., Vuksanovic, V.: Communication environment for judicial cooperation between Europe and Western Balkans. In: Expanding the knowledge economy, eChallenges 2007 conference proceedings, The Hague, The Netherlands (October 2007); ISBN 978-1-58603-801-4, 757-764. 7. Italian Committee for IT in Public Administrations (CNIPA), Linee guida per la sicurezza ICT delle pubbliche amministrazioni. In: Quaderni CNIPA 2006 (2006), http://www.cnipa.gov.it/site/_files/Quaderno20.pdf 8. Italian Committee for IT in Public Administrations (CNIPA), CNIPA Linee guida per l’utilizzo della Firma Digitale, in CNIPA (May 2004), http://www.cnipa.gov.it/site/_files/LineeGuidaFD_200405181.pdf 9. JWeB project consortium (2007-2008), http://www.jweb-net.com/index.php? option=com_content&task=category§ionid=4&id=33&Itemid=63 10. SIDIP project (ICT system supporting trial and hearings in Italy) (2007), http://www.giustiziacampania.it/file/1012/File/ progettosidip.pdf, https://www.giustiziacampania.it/file/1053/File/ mozzillopresentazionesistemasidip.doc 11. Ferraiolo, D.F., Sandhu, R., Gavrila, S., Kuhn, D.R., Chandramouli: A proposed standard for rolebased access control. Technical report, National Institute of Standards & Technology (2000) Identity Resolution in Criminal Justice Data: An Application of NORA Queen E. Booker Minnesota State University, Mankato, 150 Morris Hall Mankato, Minnesota Queen.booker@mnsu.edu Abstract. Identifying aliases is an important component of the criminal justice system. Accurately identifying a person of interest or someone who has been arrested can significantly reduce the costs within the entire criminal justice system. This paper examines the problem domain of matching and relating identities, examines traditional approaches to the problem, and applies the identity resolution approach described by Jeff Jonas [1] and relationship awareness to the specific case of client identification for the indigent defense office. The combination of identity resolution and relationship awareness offered improved accuracy in matching identities. Keywords: Pattern Analysis, Identity Resolution, Text Mining. 1 Introduction Appointing counsel for indigent clients is a complex task with many constraints and variables. The manager responsible for assigning the attorney is limited by the number of attorneys at his/her disposal. If the manager assigns an attorney to a case with which the attorney has a conflict of interest, the office loses the funds already invested in the case by the representing attorney. Additional resources are needed to bring the next attorney “up to speed.” Thus, it is in the best interest of the manager to be able to accurately identify the client, the victim and any potential witnesses to minimize any conflict of interest. As the number of cases grows, many times, the manager simply selects the next person on the list when assigning the case. This type of assignment can lead to a high number of withdrawals due to a late identified conflict of interest. Costs to the office increase due to additional incarceration expenses while the client is held in custody as well as the sunk costs of prior and repeated attorney representation regardless of whether the client is in or out of custody. These problems are further exacerbated when insufficient systems are in place to manage the data that could be used to make assignments easier. The data on the defendant is separately maintained by the various criminal justice agencies including the indigent defense service agency itself. This presents a challenge as the number of cases increases but without a concomitant increase in staff available to make the assignments. Thus those individuals responsible for assigning attorneys want not only the ability to better assign attorneys, but also to do so in a more expedient fashion. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 19–26, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 20 Q.E. Booker The aggregate data from all the information systems in the criminal justice process have been proven to improve the attorney assignment process [2]. Criminal justice systems have many disparate information systems, each with their own data sets. These include systems concerned with arrests, court case scheduling, the prosecuting attorneys office, to name a few. In many cases, relationships are nonobvious. It is not unusual for a repeat offender to provide an alternative name that is not validated prior to sending the arrest data to the indigent defense office. Likewise it is not unusual for potential witnesses to provide alternative names in an attempt to protect their identities. And further, it is not unusual for a victim to provide yet another name in an attempt to hide a previous interaction with the criminal justice process. Detecting aliases becomes harder as the indigent defense problem grows in complexity. 2 Problems with Matching Matching identities or finding aliases is a difficult process to perform manually. The process relies on institutional knowledge and/or visual stimulation. For example, if an arrest report is accompanied by a picture, the manager or attorney can easily ascertain the person’s identity. But that is not the case. Arrest reports sent generally are textual with the defendant’s name, demographic information, arrest charges, victim, and any witness information. With the institutional knowledge, the manager or an attorney can review the information on the report and identify the person by the use of a previous alias or by other pertinent information on the report. So essentially, it is possible to identify many aliases by humans, and hence possible for an information system because the enterprise contains all the necessary knowledge. But the knowledge and the process is trapped across isolated operational systems within the criminal justice agencies. One approach to improving the indigent defense agency problem is to amass information from as many different available data sources, clean the data, and find matches to improve the defense process. Traditional algorithms aren't well suited for this process. Matching is further encumbered by the poor quality of the underlying data. Lists containing subjects of interest commonly have typographical errors such as data from the defendants who intentionally misspell their names to frustrate data matching efforts, and legitimate natural variability (Mike versus Michael and 123 Main Street versus 123 S. Maine Street). Dates are often a problem as well. Months and days are sometimes transposed, especially in international settings. Numbers often have transposition errors or might have been entered with a different number of leading zeros. 2.1 Current Identity Matching Approaches Organizations typically employ three general types of identity matching systems: merge/purge and match/merge, binary matching engines, and centralized identity catalogues. Merge/purge and match/merge is the process of combining two or more lists or files, simultaneously identifying and eliminating duplicate records. This process was developed by direct marketing organizations to eliminate duplicate customer records in mailing lists. Binary matching engines test an identity in one data set for its Identity Resolution in Criminal Justice Data: An Application of NORA 21 presence in a second data set. These matching engines are also sometimes used to compare one identity with another single identity (versus a list of possibilities), with the output often expected to be a confidence value pertaining to the likelihood that the two identity records are the same. These systems were designed to help organizations recognize individuals with whom they had previously done business or, alternatively, recognize that the identity under evaluation is known as a subject of interest—that is, on a watch list—thus warranting special handling. [1] Centralized identity catalogues are systems collect identity data from disparate and heterogeneous data sources and assemble it into unique identities, while retaining pointers to the original data source and record with the purpose of creating an index. Each of the three types of identity matching systems uses either probabilistic or deterministic matching algorithms. Probabilistic techniques rely on training data sets to compute attribute distribution and frequency looking for both common and uncommon patterns. These statistics are stored and used later to determine confidence levels in record matching. As a result, any record containing similar, but uncommon data might be considered a record the same person with a high degree of probability. These systems lose accuracy when the underlying data's statistics deviate from the original training set and must frequently retrained to maintain its level of accuracy. Deterministic techniques rely on pre-coded expert rules to define when records should be matched. One rule might be that if the names are close (Robert versus Rob) and the social security numbers are the same, the system should consider the records as matching identities. These systems often have complex rules based on itemsets such as name, birthdate, zipcode, telephone number, and gender. However, these systems fail as data becomes more complex. 3 NORA Jeff Jonas introduced a system called NORA which stands for non-obvious relationship awareness. He developed the system specifically to solve Las Vegas casinos' identity matching problems. NORA accepts data feeds from numerous enterprise information systems, and builds a model of identities and relationships between identities (such as shared addresses or phone numbers) in real time. If a new identity matched or related to another identity in a manner that warranted human scrutiny (based on basic rules, such as good guy connected to very bad guy), the system would immediately generate an intelligence alert. The system approach for the Las Vegas casinos is very similar to the needs of the criminal justice system. The data needed to identify aliases and relationships for conflict of interest concerns comes from multiple data sources – arresting agency, probation offices, court systems, prosecuting attorney office, and the defense agency itself, and the ability to successfully identify a client is needed in real-time to reduce costs to the defenses office. The NORA system requirements were: • Sequence neutrality. The system needed to react to new data in real time. • Relationship awareness. Relationship awareness was designed into the identity resolution process so that newly discovered relationships could generate realtime intelligence. Discovered relationships also persisted in the database, which is essential to generate alerts to beyond one degree of separation. 22 Q.E. Booker • • • • • • Perpetual analytics. When the system discovered something of relevance during the identity matching process, it had to publish an alert in real time to secondary systems or users before the opportunity to act was lost. Context accumulation. Identity resolution algorithms evaluate incoming records against fully constructed identities, which are made up of the accumulated attributes of all prior records. This technique enabled new records to match to known identities in toto, rather than relying on binary matching that could only match records in pairs. Context accumulation improved accuracy and greatly improved the handling of low-fidelity data that might otherwise have been left as a large collection of unmatched orphan records. Extensible. The system needed to accept new data sources and new attributes through the modification of configuration files, without requiring that the system be taken offline. Knowledge-based name evaluations. The system needed detailed name evaluation algorithms for high-accuracy name matching. Ideally, the algorithms would be based on actual names taken from all over the world and developed into statistical models to determine how and how often each name occurred in its variant form. This empirical approach required that the system be able to automatically determine the culture that the name most likely came from because names vary in predictable ways depending on their cultural origin. Real time. The system had to handle additions, changes, and deletions from real-time operational business systems. Processing times are so fast that matching results and accompanying intelligence (such as if the person is on a watch list or the address is missing an apartment number based on prior observations) could be returned to the operational systems in sub-seconds. Scalable. The system had to be able to process records on a standard transaction server, adding information to a repository that holds hundreds of identities. [1] Like the gaming industry, the defense attorney’s office has relatively low daily transactional volumes. Although it receives booking reports on an ongoing basis, initial court appearances are handled by a specific attorney, and the assignments are made daily, usually the day after the initial court appearance. The attorney at the initial court appearance is not the officially assigned attorney, allowing the manager a window of opportunity from booking to assigning the case to accurately identify the client. But the analytical component of accurate identification involves numerous records with accurate linkages including aliases as well as past relationships and networks as related to the case. The legal profession has rules and regulations that constitute conflict of interest. Lawyers must follow these rules to maintain their license to practice which makes the assignment process even more critical. [3] NORA’s identity resolution engine is capable of performing in real time against extraordinary data volumes. The gaming industry's requirements of less than 1 million affected records a day means that a typical installation might involve a single Intelbased server and any one of several leading SQL database engines. This performance establishes an excellent baseline for application to the defense attorney data since the NORA system demonstrated that the system could handle multibillion-row databases Identity Resolution in Criminal Justice Data: An Application of NORA 23 consisting of hundreds of millions of constructed identities and ingest new identities at a rate of more than 2,000 identity resolutions per second; such ultra-large deployments require 64 or more CPUs and multiple terabytes of storage, and move the performance bottleneck from the analytic engine to the database engine itself. While the defense attorney dataset is not quite as large, the processing time on the casino data suggests that NORA would be able to accurately and easily handle the defense attorney’s needs in real-time. 4 Identity Resolution Identity resolution is an operational intelligence process, typically powered by an identity resolution engine, whereby organizations can connect disparate data sources with a view to understanding possible identity matches and non-obvious relationships across multiple data sources. It analyzes all of the information relating to individuals and/or entities from multiple sources of data, and then applies likelihood and probability scoring to determine which identities are a match and what, if any, nonobvious relationships exist between those identities. These engines are used to uncover risk, fraud, and conflicts of interest. Identity resolution is designed to assemble i identity records from j data sources into k constructed, persistent identities. The term "persistent" indicates that matching outcomes are physically stored in a database at the moment a match is computed. Accurately evaluating the similarity of proper names is undoubtedly one of the most complex (and most important) elements of any identity matching system. Dictionary- based approaches fail to handle the complexities of names such as common names such as Robert Johnson. The approaches fail even greater when cultural influences in naming are involved. Soundex is an improvement over traditional dictionary approaches. It uses a phonetic algorithm for indexing names by their sound when pronounced in English. The basic aim is for names with the same pronunciation to be encoded to the same string so that matching can occur despite minor differences in spelling. Such systems' attempts to neutralize slight variations in name spelling by assigning some form of reduced "key" to a name (by eliminating vowels or eliminating double consonants) frequently fail because of external factors—for example, different fuzzy matching rules are needed for names from different cultures. Jonas found that the deterministic method is essential for eliminating dependence on training data sets. As such, the system no longer needed periodic reloads to account for statistical changes to the underlying universe of data. However, he also asserts many common conditions in which deterministic techniques fail—specifically, certain attributes were so overused that it made more sense to ignore them than to use them for identity matching and detecting relationships. For example, two people with the first name of "Rick" who share the same social security number are probably the same person—unless the number is 111-11-1111. Two people who have the same phone number probably live at the same address—unless that phone number is a travel agency's phone number. He refers to such values as generic because the overuse diminishes the usefulness of the value itself. It's impossible to know all of these 24 Q.E. Booker generic values a priori—for one reason, they keep changing—thus probabilistic-like techniques are used to automatically detect and remember them. His identity resolution system uses a hybrid matching approach that combines deterministic expert rules with a probabilistic-like component to detect generics in real time (to avoid the drawback of training data sets). The result is expert rules that look something like this: If the name is similar AND there is a matching unique identifier THEN match UNLESS this unique identifier is generic In his system, a unique identifier might include social security or credit-card numbers, or a passport number, but wouldn't include such values as phone number or date of birth. The term "generic" here means the value has become so widely used (across a predefined number of discreet identities) that one can no longer use this same value to disambiguate one identity from another. [1] However, the approach for the study for the defense data included a merged itemset that combined date of birth, gender, and ethnicity code because of the inability or legal constraint of not being able to use the social security number for identification. Thus, an identifier was developed from a merged itemset after using the SUDA algorithm to identify infrequent itemsets based on data mining [4]. The actual deterministic matching rules for NORA as well as the defense attorney system are much more elaborate in practice because they must explicitly address fuzzy matching to scrub and clean the data as well as address transposition errors in numbers, malformed addresses, and other typographical errors. The current defense attorney agency model has thirty-six rules. Once the data is “cleansed” it is stored and indexed to provide user-friendly views of the data that make it easy for the user to find specific information when performing queries and ad hoc reporting. Then, a datamining algorithm using a combination of binary regression and logit models is run to update patterns for assigning attorneys based on the day’s outcomes [5]. The algorithm identifies patterns for the outcomes and tree structure for attorney and defendant combinations where the attorney “completed the case.” [6] Although matching accuracy is highly dependent on the available data, using the techniques described here achieves the goals of identity resolution, which essentially boil down to accuracy, scalability, and sustainability even in extremely large transactional environments. 5 Relationship Awareness According to Jonas, detecting relationships is vastly simplified when a mechanism for doing so is physically embedded into the identity matching algorithm. Stating the obvious, before analyzing meaningful relationships, the system must be able to resolve unique identities. As such, identity resolution must occur first. Jonas purported that it was computationally efficient to observe relationships at the moment the Identity Resolution in Criminal Justice Data: An Application of NORA 25 identity record is resolved because in-memory residual artifacts (which are required to match an identity) comprise a significant portion of what's needed to determine relevant relationships. Relevant relationships, much like matched identities, were then persisted in the same database. Notably, some relationships are stronger than others; a relationship score that's assigned with each relationship pair captures this strength. For example, living at the same address three times over 10 years should yield a higher score than living at the same address once for three months. As identities are matched and relationships detected, the NORA evaluates userconfigurable rules to determine if any new insight warrants an alert being published as an intelligence alert to a specific system or user. One simplistic way to do this is via conflicting roles. A typical rule for the defense attorney might be notification any time a client rule is associated to a role of victim, witness, co-defendant, or previously represented relative, for example. In this case, associated might mean zero degrees of separation (they're the same person) or one degree of separation (they're roommates). Relationships are maintained in the database to one degree of separation; higher degrees are determined by walking the tree. Although the technology supports searching for any degree of separation between identities, higher orders include many insignificant leads and are thus less useful. 6 Comparative Results This research is an ongoing process to improve the attorney assignment process in the defense attorney offices. As economic times get harder, crime increases and as crimes increase, so do the number of people who require representation by the public defense offices. The ability to quickly identify conflicts of interests reduces the amount of time a person stays in the system and also reduces the time needed to process the case. The original system built to work with the alias/identity matching as called the Court Appointed Counsel System or CACS. CACS identified 83% more conflicts of interests than the indigent defense managers during the initial assignments [2]. Using the merged itemset and an algorithm using NORA’s underlying technology, the conflicts improved from 83% to 87%. But the real improvement came in the processing time. The key to the success of these systems is the ability to update and provide accurate data at a moments notice. Utilizing NORA’s underlying algorithms improved the updating and matching process significantly, allowing for new data to be entered and analyzed within a couple of hours as opposed to the days it took to process using the CACS algorithms. Further, the merged itemset approach helped to provide a unique identifier in 90% of the cases significantly increasing automated relationship identifications. The ability to handle real-time transactional data with sustained accuracy will continue to be of "front and center" importance as organizations seek competitive advantage. The identity resolution technology applied here provides evidence that such technologies can be applied to more than simple fraud detection but also to improve business decision making and intelligence support to entities whose purpose are to make expedient decisions regarding individual identities. 26 Q.E. Booker References 1. Jonas, J.: Threat and Fraud Intelligence, Las Vegas Style. IEEE Security & Privacy 4(06), 28–34 (2006) 2. Booker, Q., Kitchens, F.K., Rebman, C.: A Rule Based Decision Support System Prototype for Assigning Felony Court Appointed Counsel. In: Proceedings of the 2004 Decision Sciences Annual Meeting, Boston, MA (2004) 3. Gross, L.: Are Differences Among the Attorney Conflict of Interest Rules Consistent with Principles of Behavioral Economics. Georgetown Journal of Legal Ethics 19, 111 (2006) 4. Manning, A.M., Haglin, D.J., Keane, J.A.: A Recursive Search Algorithm for Statistical Disclosure Assessment. Data Mining and Knowledge Discovery (accepted, 2007) 5. Kitchens, F.L., Sharma, S.K., Harris, T.: Cluster Computers for e-Business Applications. Asian Journal of Information Systems (AJIS) 3(10) (2004) 6. Forgy, C.: Rete: A Fast Algorithm for the Many Pattern/ Many Object Pattern Match Problem. Artificial Intelligence 19 (1982) PTK: An Alternative Advanced Interface for the Sleuth Kit Dario V. Forte, Angelo Cavallini, Cristiano Maruti, Luca Losio, Thomas Orlandi, and Michele Zambelli The IRItaly Project at DFlabs Italy www.dflabs.com Abstract. PTK is a new open-source tool for all complex digital investigations. It represents an alternative to the well-known but now obsolete front-end Autopsy Forensic Browser. This latter tool has a number of inadequacies taking the form of a cumbersome user interface, complicated case and evidence management, and a non-interactive timeline that is difficult to consult. A number of important functions are also lacking, such as an effective bookmarking system or a section for file analysis in graphic format. The need to accelerate evidence analysis through greater automation has prompted DFLabs to design and develop this new tool. PTK provides a new interface for The Sleuth Kit (TSK) suite of tools and also adds numerous extensions and features, one of which is an internal indexing engine that is capable of carrying out complex evidence pre-analysis processes. PTK was written from scratch using Ajax technology for graphic contents and a MySql database management system server for saving indexing results and investigator-generated bookmarks. This feature allows a plurality of users to work simultaneously on the same or different cases, accessing previously indexed contents. The ability to work in parallel greatly reduces analysis times. These characteristics are described in greater detail below. PTK includes a dedicated “Extension Management” module that allows existing or newly developed tools to be integrated into it, effectively expanding its analysis and automation capacity. Keywords: Computer Forensics, Open Source, SleuthKit, Autopsy Forensic, Incident Response. 1 Multi-investigator Management One of the major features of this software is its case access control mechanism and high level user profiling, allowing more than one investigator to work simultaneously on the same case. The administrator creates new cases and assigns investigators to them, granting appropriate access privileges. The investigators are then able to work in parallel on the same case. PTK user profiling may be used to restrict access to sensitive cases to a handpicked group of investigators or even a single investigator. The advantages of this type of system are numerous: above all, evidence analysis is speeded up by the ability of a team of investigators to work in parallel; secondly, the problem of case synchronization is resolved since all operations reference the same database. Each investigator is also able to save specific notes and references directly relating to his or her activities on a case in a special bookmark section. All user actions are logged in CSV format so that all application activity can be retraced. Furthermore, the administrator is able to manage PTK log files from the interface, viewing the contents in table format and exporting them locally. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 27–34, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 28 D.V. Forte et al. 2 Direct Evidence Analysis As a graphic interface for the TSK suite of tools, PTK inherits all the characteristics of this system, starting with the recognized evidence formats. PTK supports Raw (e.g., dd), Expert Witness (e.g., EnCase) and AFF evidence. Evidence may only be added to a case by the Administrator, who follows a guided three-step procedure: 1. 2. 3. Insertion of information relating to the disk image, such as type and place of acquisition; Selection of the image file and any partitions to be included in the analysis; File hashing (MD5 and SHA1) while adding the image. One important PTK function is automatic recognition of the disk file system and partitions during image selection. PTK also recognizes images in various formats that have been split up. Here the investigator needs only to select one split, since PTK is able to recognize all splits belonging to the same image. PTK and TSK interact directly for various analysis functions which therefore do not require preliminary indexing operations: • File analysis • Data unit analysis • Data export 2.1 File Analysis This section analyzes the contents of files (also deleted files) in the disk image. PTK introduces the important new feature of the tree-view, a dynamic disk directory tree which provides immediate evidence browsing capability. PTK allows multiple files to be opened simultaneously on different tabs to facilitate comparative analysis. The following information is made available to investigators: • • • • • Contents of file in ASCII format; Contents of file in hexadecimal format; Contents of file in ASCII string format; File categorization; All TSK output information: permissions, file name, MAC times, dimensions, UID, GID and inode. 3 Indexing Engine In order to provide the user with the greatest amount of information in the least time possible, an indexing system has been designed and developed for PTK. The objective is to minimize the time needed to recover all file information required in forensic analysis, such as hash values, timeline, type1, and keywords. For small files, indexing may not be necessary since the time required to recover information, such as the MD5 1 The file extension does not determine file type. PTK: An Alternative Advanced Interface for the Sleuth Kit 29 hash, may be negligible. However, if we begin to contemplate files of dimensions on the order of Megabytes these operations begin to slow down, and the wait time for the results becomes excessive. Hence a procedure was developed in which all files are processed into an image, just once, and the result saved in a database. The following indices have been implemented in PTK: • • • • • Timeline File type MD5 SHA1 Keyword search. 4 Indexed Evidence Analysis All analysis functions that require preliminary indexing are collected under the name “Indexed Analysis”, which includes timeline analysis, keyword search and hash comparison. 4.1 Timeline Analysis The disk timeline helps the investigator concentrate on the areas of the image where evidence may be located. It displays the chronological succession of actions carried out on allocated and non-allocated files. These actions are traced by means of analysis of the metadata known as MAC times (Modification, Access, and Creation, depending on file system2). PTK allows investigators to analyze the timeline by means of time filters. The time unit, in relation to the file system, is on the order of one second. The investigators have two types of timelines at their disposal: one in table format and one in graphic format. The former allows investigators to view each single timeline entry, which are organized into fields (time and date, file name, actions performed, dimension, permissions) and provide direct access to content analysis or export operations. The latter is a graphic representation plotting the progress of each action (MAC times) over a given time interval. This is a useful tool for viewing file access activity peaks. 4.2 Keyword Search The indexing process generates a database of keywords which makes it possible to carry out high performance searches in real time. Searches are carried out by means of the direct use of strings or the creation of regular expressions. The interface has various templates of regular expressions that the user can use and customize. The search templates described by regular expressions are memorized in text files and thus can be customized by users. 2 This information will have varying degrees of detail depending on file system type. For example, FAT32 does not record the time of last access to a file, but only the date. As a result, in the timeline analysis phase, this information will be displayed as 00:00:00. 30 D.V. Forte et al. 4.3 Hash Set Manager and Comparison Once the indexing process has been completed, PTK generates a MD5 or SHA1 hash value for each file present in the evidence: these values are used in comparisons with hash sets (either public or user-generated), making it possible to determine whether a file belongs to the “known good” or “known bad” category. Investigators can also use this section to import the contents of Rainbow Tables in order to compare a given hash, perhaps one recovered via a keyword search, with those in the hash set. 5 Data Carving Process Data carving seeks files or other data structures in an incoming data flow, based on contents rather than on the meta information that a file system associates with each file or directory. The initial approach chosen for PTK is based on the techniques of Header/Footer carving and Header/Maximum (file) size carving3. The PTK indexing step provides for the possibility of enabling data carving for the non-allocated space of evidence imported into the case. It is possible to directly configure the data carving module by adding or eliminating entries based on the headers and footers used in carving. However, the investigator can also set up custom search patterns directly from the interface. This way the investigator can search for patterns not only in order to find files, by means of new headers and footers, but also to find file contents. The particular structure of PTK allows investigators to run data carving operations also on evidence consisting of a RAM dump. Please note that the data carving results are not saved directly in the database, only the references to the data identified during the process are saved. The indexing process uses matching headers and footers also for the categorization of all the files in the evidence. The output of this process allows the analyzed data to be subdivided into different categories: • • • • Documents (Word, Excel, ASCII, etc.) Graphic or multimedia content (images, video, audio) Executable programs Compressed or encrypted data (zip, rar, etc.) 6 Bookmarking and Reporting The entire analysis section is flanked by a bookmarking subsystem that allows investigators to bookmark evidence at any time. All operations are facilitated by the backend MySql database, and so there is no writing of data locally in the client file system. When an investigator saves a bookmark, the reference to the corresponding evidence is written in the database, in terms of inodes and sectors, without any data being transferred from the disk being examined to the database. Each bookmark is also associated with a tag specifying the category and a text field for any user notes. Each investigator has a private bookmark management section, which can be used, at the investigator’s total discretion, to share bookmarks with other users. 3 Based on Simson, Garfinkel and Joachim Metz taxonomy. PTK: An Alternative Advanced Interface for the Sleuth Kit 31 Reports are generated automatically on the basis of the bookmarks saved by the user. PTK provides for two report formats: html and PDF. Reports are highly customizable in terms of graphics (header, footer, logos) and contents, with the option of inserting additional fields for enhanced description and documentation of the investigation results. 7 PTK External Modules (Extensions) This PTK section allows users to use external tools for the execution of various tasks. It is designed to give the application the flexibility of performing automatic operations on different operating systems, running data search or analysis processes and recovering deleted files. The “PTK extension manager” creates an interface between third-party tools and the evidence management system and runs various processes on them. The currently enabled extensions provide for: Memory dump analysis, Windows registry analysis, OS artifact recovery. The first extension provides PTK with the ability to analyze the contents of RAM dumps. This feature allows both evidence from long-term data storage media and evidence from memory dumps to be associated with a case, thus allowing important information to be extracted, such as a list of strings in memory, which could potentially contain passwords to be used for the analysis of protected or encrypted archives found on the disk. The registry analysis extension gives PTK the capability of recognizing and interpreting a Microsoft Windows registry file and navigating within it with the same facility as the regedit tool. Additionally, PTK provides for automatic search within the most important sections of the registry and generation of output results. The Artifact Recovery extension was implemented in order to reconstruct or recover specific contents relating to the functions of an operating system or its components or applications. The output from these automatic processes can be included among the investigation bookmarks. PTK extensions do not write their output to the database in order to prevent it from becoming excessively large. User-selected output from these processes may be included in the bookmark section in the database. If bookmarks are not created before PTK is closed, the results are lost. 8 Comparative Assessment The use of Ajax in the development of PTK has drastically reduced execution times on the server side and, while delegating part of the code execution to the client, has reduced user wait times by minimizing that amount of information loaded into pages. An assessment was carried out to obtain a comparison of the performance of PTK versus Autopsy Forensic Browser. Given that these are two web-based applications using different technologies, it is not possible to make a direct, linear comparison of performance. For these reasons, it is useful to divide the assessment into two parts: the first highlights the main differences in the interfaces, examining the necessary user procedures; the second makes a closer examination of the performance of the PTK indexing 32 D.V. Forte et al. Table 1. Action New case creation Investigator assignment Image addition Image integrity verification Evidence analysis Autopsy You click “New case” and a new page is loaded where you add the case name, description, and assigned investigator names (text fields). Pages loaded: 2 Investigators are assigned to the case when it is created. However, these are only text references. You select a case and a host and then click “Add image file”. A page is displayed where you indicate the image path (manually) and specify a number of import parameters. On the next page, you specify integrity control operations and select the partitions. Then you click “Add”. Pages loaded: 6 After selecting the case, the host and the image, you click “Image integrity”. The next page allows you to create an MD5 hash of the file and to verify it on request. Pages loaded: 4 After selecting the case, the host and the image, you click “Analyze” to access the analysis section. Pages loaded: 4 After selecting the case, the host and the image, you click “File activity time lines”. You then have to create a data Evidence timeline file by providing appropriate creation parameters and create the timeline file based on the file thus generated. Pages loaded: 8 After selecting the case, the host and the image, you click “Details”. On the next page you click “Extract String extraction strings” to run the process. Pages loaded: 5 PTK You click “Add new case” and a modal form is opened where it is sufficient to provide a case name and brief description. Pages loaded: 1 You click on the icon in the case table to access the investigator management panel. These assignments represent bona fide user profiles. In the case table, you click on the image management icon and then click “Add new image”. A modal form opens with a guided 3-step process for adding the image. Path selection is based on automatic folder browsing. Pages loaded: 1 You open image management for a case and click “Integrity check”. A panel opens where you can generate and/or verify both MD5 and SHA1 hashes. Pages loaded: 1 After opening the panel displaying the images in a case, you click the icon “Analyze image” to access the analysis section. Pages loaded: 1 You open image management for a case and click on the indexing icon. The option of generating a timeline comes up and the process is run. The timeline is saved in the database and is available during analyses. Pages loaded: 1 You open image management for a case and click on the indexing icon. The option of extracting strings comes up and the process is run. All ASCII strings for each image file are saved in the database. Pages loaded: 1 engine, providing a more technical comparison on the basis of such objective parameters as command execution, parsing, and output presentation times. 8.1 Interface The following comparative assessment of Autopsy and PTK (Table 1) highlights the difference on the interface level, evaluated in terms of number of pages loaded for the execution of the requested action. All pages (and thus the steps taken by the user) are counted starting from and excluding the home page of each application. PTK: An Alternative Advanced Interface for the Sleuth Kit 33 Table 2. Action Timeline generation Keyword extraction File hash generation Autopsy PTK 54” + 2” 18” 8’ 10” 8’ 33” Autopsy manages the hash values (MD5) for each file on the directory level. The hash generation operation must therefore be run from the file analysis page, however, this process does not save any of the generated hash values. PTK optimizes the generation of file hashes via indexing operations, eliminating wait time during analysis and making the hash values easy to consult. 8.2 Indexing Performance The following tests were performed on the same evidence: File system: FAT32; Dimension: 1.9 Gb; Acquisition: dd. A direct comparison (Table 2) can be made for timeline generation and keyword extraction in terms of how many seconds are required to perform the operations. 9 Conclusions and Further Steps The main idea behind the project was to provide an “alternative” interface to the TSK suite so as to offer a new and valid open source tool for forensic investigations. We use the term “alternative” because PTK was not designed to be a completely different software from its forerunner, Autopsy, but a product that seeks to improve the performance of existing functions and resolve any inadequacies. The strong point of this project is thus the careful initial analysis of Autopsy Forensic Browser, which allowed developers to establish the bases for a robust product that represents a real step forward. Future developments of the application will certainly include: • Integration of new tools as extensions of the application in order to address a greater number of analysis types within the capabilities of PTK. • Creation of customized installation packages for the various platforms. • Adaption of style sheets to all browser types in order to extend the portability of the tool. References 1. Carrier, Brian: File System Forensic Analysis. Addison Wesley, Reading (2005) 2. Carrier, Brian: Digital Forensic Tool Testing Images (2005), http://dftt.sourceforge.net 3. Carvey, Harlan: Windows Forensic Analysis. Syngress (2007) 34 D.V. Forte et al. 4. Casey, Eoghan: Digital Evidence and Computer Crime. Academic Press, London (2004) 5. Garfinkel, Simson: Carving Contiguous and Fragmented Files with Fast Object Validation. In: Digital Forensics Workshop (DFRWS 2007), Pittsburgh, PA (August 2007) 6. Jones, Keith, J., Bejtlich, Richard, Rose, Curtis, W.: Real Digital Forensics: Computer Security and Incident Response. Addison-Wesley, Reading (2005) 7. Schwartz, Randal, L., Phoenix, Tom: Learning Perl. O’Reilly, Sebastopol (2001) 8. The Sleuthkit documentation, http://www.sleuthkit.org/ 9. Forte, D.V.: The State of the Art in Digital Forensics. Advances in Computers 67, 254– 300 (2006) 10. Forte, D.V., Maruti, C., Vetturi, M.R., Zambelli, M.: SecSyslog: an Approach to Secure Logging Based on Covert Channels. In: SADFE 2005, 248–263 (2005) Stalker, a Multilingual Text Mining Search Engine for Open Source Intelligence F. Neri1 and M. Pettoni2 1 Lexical Systems Department, Synthema, Via Malasoma 24, 56121 Ospedaletto – Pisa, Italy federico.neri@synthema.it 2 CIFI/GE, II Information and Security Department (RIS), Stato Maggiore Difesa, Rome, Italy Abstract. Open Source Intelligence (OSINT) is an intelligence gathering discipline that involves collecting information from open sources and analyzing it to produce usable intelligence. The international Intelligence Communities have seen open sources grow increasingly easier and cheaper to acquire in recent years. But up to 80% of electronic data is textual and most valuable information is often hidden and encoded in pages which are neither structured, nor classified. The process of accessing all these raw data, heterogeneous in terms of source and language, and transforming them into information is therefore strongly linked to automatic textual analysis and synthesis, which are greatly related to the ability to master the problems of multilinguality. This paper describes a content enabling system that provides deep semantic search and information access to large quantities of distributed multimedia data for both experts and general public. STALKER provides with a language independent search and dynamic classification features for a broad range of data collected from several sources in a number of culturally diverse languages. Keywords: open source intelligence, focused crawling, natural language processing, morphological analysis, syntactic analysis, functional analysis, supervised clustering, unsupervised clustering. 1 Introduction Open Source Intelligence (OSINT) is an intelligence gathering discipline that involves collecting information from open sources and analyzing it to produce usable intelligence. The specific term “open” refers to publicly available sources, as opposed to classified sources. OSINT includes a wide variety of information and sources. With the Internet, the bulk of predictive intelligence can be obtained from public, unclassified sources. The revolution in information technology is making open sources more accessible, ubiquitous, and valuable, making open intelligence at less cost than ever before. In fact, monitors no longer need an expensive infrastructure of antennas to listen to radio, watch television or gather textual data from Internet newspapers and magazines. The availability of a huge amount of data in the open sources information channels leads to the well-identified modern paradox: an overload of information means, most of the time, a no usable knowledge. Besides, open source texts are - and will be - written in various native languages, but these documents are relevant even to non-native E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 35–42, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 36 F. Neri and M. Pettoni speakers. Independent information sources can balance the limited information normally available, particularly if related to non-cooperative targets. The process of accessing all these raw data, heterogeneous both for type (web pages, crime reports), source (Internet/Intranet, database, etc), protocol (HTTP/HTTPS, FTP, GOPHER, IRC, NNTP, etc) and language used, transforming them into information, is therefore inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on the ability to master the problems of multilinguality. 1.1 State of Art Current-generation information retrieval (IR) systems excel with respect to scale and robustness. However, if it comes to deep analysis and precision, they lack power. Users are limited by keywords search, which is not sufficient if answers to complex problems are sought. This becomes more acute when knowledge and information are needed from diverse linguistic and cultural backgrounds, so that both problems and answers are necessarily more complex. Developments in the IR have mostly been restricted to improvements in link and click analysis or smart query expansion or profiling, rather than focused on a deeper analysis of text and the building of smarter indexes. Traditionally, text and data mining systems can be seen as specialized systems that convert more complex information into a structured database, allowing people to find knowledge rather than information. For some domains, text mining applications are well-advanced, for example in the domains of medicine, military and intelligence, and aeronautics [1], [15]. In addition to domain-specific miners, general technology has been developed to detect Named Entities [2], co-reference relations, geographical data [3], and time points [4]. The field of knowledge acquisition is growing rapidly with many enabling technologies being developed that eventually will approach Natural Language Understanding (NLU). Despite much progress in Natural Language Processing (NLP), the field is still a long way from language understanding. The reason is that full semantic interpretation requires the identification of every individual conceptual component and the semantic roles it play. In addition, understanding requires processing and knowledge that goes beyond parsing and lexical lookup and that is not explicitly conveyed by linguistic elements. First, contextual understanding is needed to deal with the omissions. Ambiguities are a common aspect of human communication. Speakers are cooperative in filling gaps and correcting errors, but automatic systems are not. Second, lexical knowledge does not provide background or world knowledge, which is often required for non-trivial inferences. Any automatic system trying to understand a simple sentence will require - among others - accurate capabilities for Named Entity Recognition and Classification (NERC), full Syntactic Parsing, Word Sense Disambiguation (WSD) and Semantic Role Labeling (SRL) [5]. Current baseline information systems are either large-scale, robust but shallow (standard IR systems), or they are small-scale, deep but ad hoc (Semantic-Web ontology-based systems). Furthermore, these systems are maintained by experts in IR, ontologies or languagetechnology and not by the people in the field. Finally, hardly any of the systems is multilingual, yet alone cross-lingual and definitely not cross-cultural. The next table gives a comparison across different state-of-the-art information systems, where we compare ad-hoc Semantic web solutions, wordnet-based information systems and tradition information retrieval with STALKER [6]. Stalker, a Multilingual Text Mining Search Engine for Open Source Intelligence 37 Table 1. Comparison of semantic information systems Features Semantic web Wordnet-based Traditional STALKER Information retrieval Large scale and multiple domains NO YES YES YES Deep semantics YES NO NO YES Automatic acquisition/indexing NO YES/NO YES YES Multi-lingual NO YES YES YES Cross-lingual NO YES NO YES Data and fact mining YES NO NO YES 2 The Logical Components The system is built on the following components: − a Crawler, an adaptive and selective component that gathers documents from Internet/Intranet sources. − a Lexical system, which identifies relevant knowledge by detecting semantic relations and facts in the texts. − a Search engine that enables Functional, Natural Language and Boolean queries. − a Classification system which classifies search results into clusters and sub-clusters recursively, highlighting meaningful relationships among them. 2.1 The Crawler In any large company or public administration the goal of aggregating contents from different and heterogeneous sources is really hard to be accomplished. Searchbox is a multimedia content gathering and indexing system, whose main goal is managing huge collections of data coming from different and geographically distributed information sources. Searchbox provides a very flexible and high performance dynamic indexing for content retrieval [7], [8], [9]. The gathering activities of Searchbox are not limited to the standard Web, but operate also with other sources like remote databases by ODBC, Web sources by FTP-Gopher, Usenet news by NNTP, WebDav and SMB shares, mailboxes by POP3-POP3/S-IMAP-IMAP/S, file systems and other proprietary sources. Searchbox indexing and retrieval system does not work on the original version of data, but on the “rendered version”. For instance, the features renedered and extracted from a portion of text might be a list of words/lemmas/ concepts, while the extraction of features from a bitmap image might be extremely sophisticated. Even more complex sources, like video, might be suitably processed so as to extract a textual-based labeling, which can be based on both the recognition of speech and sounds. All of the extracted and indexed features can be combined in the query language which is available in the user interface. Searchbox provides default plug-ins to extract text from most common types of documents, like HTML, XML, TXT, PDF, PS and DOC. Other formats can be supported using specific plugins. 38 F. Neri and M. Pettoni 2.2 The Lexical System This component is intended to identify relevant knowledge from the whole raw text, by detecting semantic relations and facts in texts. Concept extraction and text mining are applied through a pipeline of linguistic and semantic processors that share a common ground and a knowledge base. The shared knowledge base guarantees a uniform interpretation layer for the diverse information from different sources and languages. Fig. 1. Lexical Analysis The automatic linguistic analysis of the textual documents is based on Morphological, Syntactic, Functional and Statistical criteria. Recognizing and labeling semantic arguments is a key task for answering Who, When, What, Where, Why questions in all NLP tasks in which some kind of semantic interpretation is needed. At the heart of the lexical system is the McCord's theory of Slot Grammar [10]. A slot, explains McCord, is a placeholder for the different parts of a sentence associated with a word. A word may have several slots associated with it, forming a slot frame for the word. In order to identify the most relevant terms in a sentence, the system analyzes it and, for each word, the Slot Grammar parser draws on the word's slot frames to cycle through the possible sentence constructions. Using a series of word relationship tests to establish the context, the system tries to assign the context-appropriate meaning to each word, determining the meaning of the sentence. Each slot structure can be partially or fully instantiated and it can be filled with representations from one or more statements to incrementally build the meaning of a statement. This includes most of the treatment of coordination, which uses a method of ‘factoring out’ unfilled slots from elliptical coordinated phrases. The parser - a bottom-up chart parser - employs a parse evaluation scheme used for pruning away unlikely analyses during parsing as well as for ranking final analyses. By including semantic information directly in the dependency grammar structures, the system relies on the lexical semantic information combined with functional relations. The detected terms are then extracted, reduced to their Part Of Speech1 and 1 Noun, Verb, Adjective, Adverb, etc. Stalker, a Multilingual Text Mining Search Engine for Open Source Intelligence 39 Functional2 tagged base form [12]. Once referred to their synset inside the domain dictionaries, they are used as documents metadata [12], [13], [14]. Each synset denotes a concept that can be referred to by its members. Synsets are interlinked by means of semantic relations, such as the super-subordinate relation, the part-whole relation and several lexical entailment relations. 2.3 The Search Engine 2.3.1 Functional Search Users can search and navigate by roles, exploring sentences and documents by the functional role played by each concept. Users can navigate on the relations chart by simply clicking on nodes or arches, expanding them and having access to set of sentences/documents characterized by the selected criterion. Fig. 2. Functional search and navigation This can be considered a visual investigative analysis component specifically designed to bring clarity to complex investigations. It automatically enables investigative information to be represented as visual elements that can be easily analyzed and interpreted. Functional relationships - Agent, Action, Object, Qualifier, When, Where, How - among human beings and organizations can be searched for and highlighted, pattern and hidden connections can be instantly revealed to help investigations, promoting efficiency into investigative teams. Should human beings be cited, their photos can be shown by simple clicking on the related icon. 2.3.2 Natural Language Search Users can search documents by query in Natural Language, expressed using normal conversational syntax, or by keywords combined by Boolean operators. Reasoning over facts and ontological structures makes it possible to handle diverse and more complex types of questions. Traditional Boolean queries in fact, while precise, require strict interpretation that can often exclude information that is relevant to user interests. So this is the reason why the system analyzes the query, identifying the most relevant terms contained and their semantic and functional interpretation. By mapping a query 2 Agent, Object, Where, Cause, etc. 40 F. Neri and M. Pettoni to concepts and relations very precise matches can be generated, without the loss of scalability and robustness found in regular search engines that rely on string matching and context windows. The search engine returns as result all the documents which contain the query concepts/lemmas in the same functional role as in the query, trying to retrieve all the texts which constitute a real answer to the query. Fig. 3. Natural language query and its functional and conceptual expansion Results are then displayed and ranked by relevance, reliability and credibility. Fig. 4. Search results 2.4 The Clustering System The automatic classification of results is made by TEMIS Insight Discoverer Categorizer and Clusterer, fulfilling both the Supervised and Unsupervised Classification schemas. The application assigns texts to predefined categories and dynamically discovers the groups of documents which share some common traits. 2.4.1 Supervised Clustering The categorization model was created during the learning phase, on representative sets of training documents focused documents focused on news about Middle East North Africa, Balkans, East Europe, International Organizations and ROW (Rest Of the World). The bayesian method was used as the learning method: the probabilist classification model was built on around 1.000 documents. The overall performance Stalker, a Multilingual Text Mining Search Engine for Open Source Intelligence 41 measures used were Recall (number of categories correctly assigned divided by the total number of categories that should be assigned) and Precision (number of categories correctly assigned divided by total number of categories assigned): in our tests, they were 75% and 80% respectively. 2.4.2 Unsupervised Clustering Result documents are represented by a sparse matrix, where lines and columns are normalized in order to give more weight to rare terms. Each document is turned to a vector comparable to others. Similarity is measured by a simple cosines calculation between document vectors, whilst clustering is based on the K-Means algorithm. The application provides a visual summary of the clustering analysis. A map shows the different groups of documents as differently sized bubbles and the meaningful correlation among them as lines drawn with different thickness. Users can search inside topics, project clusters on lemmas and their functional links. Fig. 5. Thematic map, functional search and projection inside topics 3 Conclusions This paper describes a Multilingual Text Mining platform for Open Source Intelligence, adopted by Joint Intelligence and EW Training Centre (CIFIGE) to train the military and civilian personnel of Italian Defence in the OSINT discipline. Multilanguage Lexical analysis permits to overcome linguistic barriers, allowing the automatic indexation, simple navigation and classification of documents, whatever it might be their language, or the source they are collected from. This approach enables the research, the analysis, the classification of great volumes of heterogeneous documents, helping intelligence analysts to cut through the information labyrinth. References 1. Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A Brief History. In: Proceedings of the 16th International Conference on Computational Linguistics (COLING), I, Kopenhagen, pp. 466–471 (1996) 42 F. Neri and M. Pettoni 2. Hearst, M.: Untangling Text Data Mining. In: ACL 1999. University of Maryland, June 20-26 (1999) 3. Miller, H.J., Han, J.: Geographic Data Mining and Knowledge Discovery. CRC Press, Boca Raton (2001) 4. Wei, L., Keogh, E.: Semi-Supervised Time Series Classification, SIGKDD (2006) 5. Carreras, X., Màrquez, L.: Introduction to the CoNLL-2005 Shared Task: Semantic Role Labeling. In: CoNLL 2005, Ann Arbor, MI, USA (2005) 6. Vossen, P., Neri, F., et al.: KYOTO: A System for Mining, Structuring, and Distributing Knowledge Across Languages and Cultures. In: Proceedings of GWC 2008, The Fourth Global Wordnet Conference, Szeged, Hungary, January 2008, pp. 22–25 (2008) 7. Baldini, N., Bini, M.: Focuseek searchbox for digital content gathering. In: AXMEDIS 2005 - 1st International Conference on Automated Production of Cross Media Content for Multi-channel Distribution, Proceedings Workshop and Industrial, pp. 24–28 (2005) 8. Baldini, N., Gori, M., Maggini, M.: Mumblesearch: Extraction of high quality Web information for SME. In: 2004 IEEE/WIC/ACM International Conference on Web Intelligence (2004) 9. Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using Context Graphs. In: Proceedings of 26th International Conference on Very Large Databases, VLDB, September 2000, pp. 10–12 (2000) 10. McCord, M.C.: Slot Grammar: A System for Simpler Construction of Practical Natural Language Grammars Natural Language and Logic 1989, pp. 118–145 (1989) 11. McCord, M.C.: Slot Grammars. American Journal of Computational Linguistics 6(1), 31– 43 (1980) 12. Marinai, E., Raffaelli, R.: The design and architecture of a lexical data base system. In: COLING 1990, Workshop on advanced tools for Natural Language Processing, Helsinki, Finland, August 1990, p. 24 (1990) 13. Cascini, G., Neri, F.: Natural Language Processing for Patents Analysis and Classification. In: ETRIA World Conference, TRIZ Future 2004, Florence, Italy (2004) 14. Neri, F., Raffaelli, R.: Text Mining applied to Multilingual Corpora. In: Sirmakessis, S. (ed.) Knowledge Mining: Proceedings of the NEMIS 2004 Final Conference. Springer, Heidelberg (2004) 15. Baldini, N., Neri, F.: A Multilingual Text Mining based content gathering system for Open Source Intelligence. In: IAEA International Atomic Energy Agency, Symposium on International Safeguards: Addressing Verification Challenges, Wien, Austria, IAEA-CN148/192P, Book of Extended Synopses, October 16-20, 2006, pp. 368–369 (2006) Computational Intelligence Solutions for Homeland Security Enrico Appiani and Giuseppe Buslacchi Elsag Datamat spa, via Puccini 2, 16154 Genova, Italy {Enrico.Appiani,Giuseppe.Buslacchi}@elsagdatamat.com Abstract. On the basis of consolidated requirements from international Polices, Elsag Datamat has developed an integrated tool suite, supporting all the main Homeland Security activities like operations, emergency response, investigation and intelligence analysis. The last support covers the whole “Intelligence Cycle” along its main phases and integrates a wide variety of automatic and semi-automatic tools, coming from both original company developments and from the market (COTS), in a homogeneous framework. Support to Analysis phase, most challenging and computing-intensive, makes use of Classification techniques, Clustering techniques, Novelty Detection and other sophisticated algorithms. An innovative and promising use of Clustering and Novelty Detection, supporting the analysis of “information behavior”, can be very useful to the analysts in identifying relevant subjects, monitoring their evolution and detecting new information events who may deserve attention in the monitored scenario. 1 Introduction Modern Law Enforcement Agencies experiment challenging and contrasting needs: on one side, the Homeland Security mission has become more and more complex, due to many national and international factors such as stronger crime organization, asymmetric threats, the complexity of civil and economic life, the criticality of infrastructures, and the rapidly growing information environment; on the other side, the absolute value of public security is not translated in large resource availability for such Agencies, which must cope with similar problems as business organizations, namely to conjugate the results with the search of internal efficiency, clarity of roles, careful programming and strategic resource allocation. Strategic programming for security, however, is not a function of business, but rather of the evolution of security threats, whose prevision and prevention capability has a double target: externally to an Agency, improving the coordination and the public image to the citizen, for better enforcing everyone’s cooperation to civil security; internally, improving the communication between corps and departments, the individual motivation through better assignation of roles and missions, and ultimately the efficiency of Law Enforcement operations. Joining good management of resources, operations, prevention and decisions translates into the need of mastering internal and external information with an integrated approach, in which different tools cooperate to a common, efficient and accurate information flow, from internal management to external intelligence, from resource allocation to strategic security decisions. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 43–52, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 44 E. Appiani and G. Buslacchi The rest of the paper describes an integrated framework trying to enforce the above principles, called Law Enforcement Agency Framework (LEAF) by Elsag Datamat, whose tools are in use by National Police and Carabinieri, and still under development for both product improvement and new shipments to national and foreign Agencies. Rather than technical details and individual technologies, this work tries to emphasize their flexible integration, in particular for intelligence and decision support. Next sections focus on the following topics: LEAF architecture and functions, matching the needs of Law Enforcement Agencies; the support to Intelligence and Investigations, employing a suite of commercial and leading edge technologies; the role of Clustering and Semantic technology combination in looking for elements of known or novel scenarios in large, unstructured and noisy document bases; and some conclusions with future perspectives. 2 Needs and Solutions for Integrated Homeland Security Support Polices of advanced countries have developed or purchased their own IT support to manage their operations and administration. US Polices, facing their multiplicity (more than one Agency for each State), have focused on common standards for Record Management System (RMS) [1], in order to exchange and analyze data of Federal importance, such as criminal records. European Polices generally have their own IT support and are improving their international exchange and integration, also thanks to the Commission effort for a common European Security and Defense Policy [3], including common developments in the FP7 Research Program and Europol [2], offering data and services for criminal intelligence. At the opposite end, other Law Enforcement Agencies are building or completely replacing their IT support, motivated by raising security challenges and various internal needs, such as improving their organizations, achieving more accurate border control and fighting international crime traffics more effectively. Elsag Datamat’s LEAF aims at providing an answer to both Polices just improving their IT support and Polices requiring complete solutions, from base IT infrastructure to top-level decision support. LEAF architecture includes the following main functions, from low to high level information, as illustrated in Fig. 1: • • • • • Infrastructure – IT Information Equipments, Sensors, Network, Applications and Management; Administration – Enterprise Resource Planning (ERP) for Agency personnel and other resources; Operations – support to Law Enforcement daily activities through recording all relevant events, decisions and document; Emergency – facing and resolving security compromising facts, with real-time and efficient resource scheduling, with possible escalation to crises; Intelligence – support to crime prevention, investigation and security strategy, through a suite of tools to acquire, process, analyze and disseminate useful information, Computational Intelligence Solutions for Homeland Security 45 Fig. 1. LEAF Functional and Layered Architecture We can now have a closer look to each main function, just to recall which concrete functionalities lay behind our common idea of Law Enforcement. 2.1 Infrastructure The IT infrastructure of a Police can host applications of similar complexity to business ones, but with more critical requirements of security, reliability, geographical articulation and communication with many fixed and mobile users. This requires the capability to perform most activities for managing complex distributed IT systems: • Infrastructure management • Service management • Geographic Information System Common Geo-processing Service for the other applications in the system 2.2 Administration This has the same requirements of personnel, resource and budget administration of multi-site companies, with stronger needs for supporting mission continuity. • Enterprise Resource Planning (human resources, materials, vehicles, infrastructures, budget, procurement, etc.) 2.3 Operations The Operations Support System is based on RMS implementation, partly inspired to the American standard, recording all actors (such as people, vehicles and other objects), events (such as incidents, accidents, field interviews), activities (such as arrest, booking, wants and warrants) and documents (such as passports and weapon licenses) providing a common information ground to everyday tasks and to the operations of the upper level (emergency and intelligence). In other words, this is the fundamental database of LEAF, whose data and services can be split as follows: • Operational activity (events and actors) • Judicial activity (support to justice) • Administrative activity (support to security administration) 46 E. Appiani and G. Buslacchi 2.4 Emergencies This is the core Police activity for reaction to security-related emergencies, whose severity and implications can be very much different (e.g. from small robberies to large terrorist attacks). An emergency alarm can be triggered in some different ways: by the Police itself during surveillance and patrolling activity; by sensor-triggered automatic alarms; by citizen directly signaling events to Agents; and by citizen calling a security emergency telephone number, such as 112 or 113 in Italy. IT support to organize a proper reaction is crucial, for saving time and choosing the most appropriate means. • Call center and Communications • Emergency and Resource Management 2.5 Intelligence This is the core support to prevention (detecting threats before they are put into action) and investigation (detecting the authors and the precise modalities of committed crimes); besides, it provides statistical and analytical data for understanding the evolution of crimes and takes strategic decisions for next Law Enforcement missions. More accurate description is demanded to the next section. 3 Intelligence and Investigations Nowadays threats are asymmetric, international, aiming to strike more than to win, and often moved by individuals or small groups. Maintenance of Homeland Security requires, much more than before, careful monitoring of every information sources through IT support, with a cooperating and distributed approach, in order to perform the classical Intelligence cycle on two basic tasks: • Pursuing Intelligence targets – performing research on specific military or civil targets, in order to achieve timely and accurate answers, moreover to prevent specific threats before they are realized; • Monitoring threats – listening to Open, Specialized and Private Sources, to capture and isolate security sensitive information possibly revealing new threats, in order to generate alarms and react with mode detailed investigation and careful prevention. In addition to Intelligence tasks, • Investigation relies on the capability to collect relevant information on past events and analyze it in order to discover links and details bringing to the complete situation picture; • Crisis management comes from emergency escalation, but requires further capabilities to simulate the situation evolution, like intelligence, and understand what has just happened, like investigations. Computational Intelligence Solutions for Homeland Security 47 The main Intelligence support functions are definable as follows: • Information source Analysis and Monitoring – the core information collection, processing and analysis • Investigation and Intelligence – the core processes • Crisis, Main events and Emergencies Management – the critical reaction to large events • Strategies, Management, Direction and Decisions – understand and forecast the overall picture Some supporting technologies are the same across the supporting functions above. In fact, every function involves a data processing flow, from sources to the final report, which can have similar steps as other flows. This fact shows that understanding the requirements, modeling the operational scenario and achieving proper integration of the right tools, are much more important steps that just acquiring a set of technologies. Another useful viewpoint to focus on the data processing flow is the classical Intelligence cycle, modeled with similar approach by different military (e.g. NATO rules and practices for Open Source Analysis [4] and Allied Joint Intelligence [5]) and civil institutions, whose main phases are: • Management - Coordination and planning, including resource and task management, mission planning (strategy and actions to take for getting the desired information) and analysis strategy (approach to distil and analyze a situation from the collected information). Employs the activity reports to evaluate results and possibly reschedule the plan. • Collection - Gathering signals and raw data from any relevant source, acquiring them in digital format suitable for next processing. • Exploiting - Processing signals and raw data in order to become useful “information pieces” (people, objects, events, locations, text documents, etc.) which can be labeled, used as indexes and put into relation. • Processing - Processing information pieces in order to get their relations and aggregate meaning, transformed and filtered at light of the situation to be analyzed or discovered. This is the most relevant support to human analysis, although in many cases this is strongly based on analyst experience and intuition. • Dissemination - This does not mean diffusion to a large public, but rather aggregating the analysis outcomes in suitable reports which can be read and exploited by decision makers, with precise, useful and timely information. The Intelligence process across phases and input data types can be represented by a pyramidal architecture of contributing tools and technologies, represented as boxes in fig. 2, not exhaustive and not necessarily related to the corresponding data types (below) and Intelligence steps (on the left). The diagram provides a closer look to the functions supporting for the Intelligence phases, some of which are, or are becoming, commercial products supporting in particular Business Intelligence, Open Source analysis and the Semantic Web. An example of industrial subsystem taking part in this vision is called IVAS, namely Integrated Video Archiving System, capable to receive a large number of radio/TV channels (up to 60 in current configurations), digitize them with both Web 48 E. Appiani and G. Buslacchi Fig. 2. The LEAF Intelligence architecture for integration of supporting technologies stream and high quality (DVD) data rates, store them in a disk buffer, allow the operators to browse the recorded channels and perform both manual and automatic indexing, synthesize commented emissions or clips, and archive or publish such selected video streams for later retrieval. IVAS manages Collection and Exploiting phases with Audio/Video data, and indirectly supports the later processing. Unattended indexing modules include Face Recognition, Audio Transcription, Tassonomic and Semantic Recognition. IVAS implementations are currently working for the Italian National Command Room of Carabinieri and for the Presidency of Republic. In summary, this section has shown the LEAF component architecture and technologies to support Intelligence and Investigations. The Intelligence tasks look at future scenarios, already known or partially unknown. Support to their analysis thus requires a combination of explicit knowledge-based techniques and inductive, implicit information extraction, as it being studied with the so called Hybrid Artificial Intelligence Systems (HAIS). An example of such combination is shown in the next section. Investigation tasks, instead, aim at reconstruct past scenarios, thus requiring the capability to model them and look for their related information items through text and multimedia mining on selected sources, also involving explicit knowledge processing, ontology and conceptual networks. 4 Inductive Classification for Non-structured Information Open sources are heterogeneous, unstructured, multilingual and often noisy, in the sense of being possibly written with improper syntax, various mistakes, automatic translation, OCR and other conversion techniques. Open sources to be monitored include: press selections, broadcast channels, Web pages (often from variable and short-life sites), blogs, forums, and emails. All them may be acquired in form of Computational Intelligence Solutions for Homeland Security 49 documents of different formats and types, either organized in renewing streams (e.g. forums, emails) or specific static information, at most updated over time. In such a huge and heterogeneous document base, classical indexing and text mining techniques may fail in looking for and isolating relevant content, especially with unknown target scenarios. Inductive technologies can be usefully exploited to characterize and classify a mix of information content and behavior, so as to classify sources without explicit knowledge indexing, acknowledge them based on their style, discover recurring subjects and detect novelties, which may reveal new hidden messages and possible new threats. Inductive clustering of text documents is achieved with source filtering, feature extraction and clustering tools based on Support Vector Machines (SVM) [7], through a list of features depending on both content (most used words, syntax, semantics) and style (such as number of words, average sentence length, first and last words, appearance time or refresh frequency). Documents are clustered according to their vector positions and distances, trying to optimize the cluster number by minimizing a distortion cost function, so as to achieve a good compromise between the compactness (not high number of cluster with a few documents each) and representativeness (common meaning and similarity among the clustered documents) of the obtained clusters. Some clustering parameters can be tuned manually through document subsets. The clustered documents are then partitioned in different folders whose name include the most recurring words, excluding the “stop-words”, namely frequent words with low semantic content, such as prepositions, articles and common verbs. This way we can obtain a pseudo-classification of documents, expressed by the common concepts associated to the resulting keywords of each cluster. The largest experiment has been led upon about 13,000 documents of a press release taken from Italian newspapers in 2006, made digital through scanning and OCR. The document base was much noisy, with many words changed, abbreviated, concatenated with others, or missed; analyzing this sample with classical text processing, if not semantic analysis, would have been problematic indeed, since language technology is very sensitive to syntactical correctness. Instead, with this clustering technique, most of the about 30 clusters obtained had a true common meaning among the composing documents (for instance, criminality of specific types, economy, terrorist attacks, industry news, culture, fiction, etc.), with more specific situations expressed by the resulting keywords. Further, such keywords were almost correct, except for a few clusters grouping so noisy documents that it would have been impossible to find some common sense. In practice, the document noise has been removed when asserting the prevailing sense of the most representative clusters. Content Analysis (CA) and Behavior Analysis (BA) can support each other in different ways. CA applied before BA can add content features to the clustering space. Conversely, BA applied before CA can reduce the number of documents to be processed for content, by isolating relevant groups expressing a certain type of style, source and/or conceptual keyword set. CA further contributes in the scenario interpretation by applying reasoning tools to inspect clustering results at the light of domain-specific knowledge. Analyzing cluster contents may help application-specific ontologies discover unusual patterns in the observed domain. Conversely, novel information highlighted by BA might help dynamic ontologies to update their knowledge in a semi-automated way. Novelty Detection is obtained through the “outliers”, 50 E. Appiani and G. Buslacchi namely documents staying at a certain relative distance from their cluster centers, thus expressing a loose commonality with more central documents. Outliers from a certain source, for instance an Internet forum, can reveal some odd style or content with respect to the other documents. Dealing with Intelligence for Security, we can have two different operational solutions combining CA and BA, respectively supporting Prevention and Investigation [6]. This is still subject of experiments, the major difficulty being to collect relevant documents and some real, or at least realistic, scenario. Prevention-mode operation is illustrated in Fig. 3, and proceeds as follows. 1) Every input document is searched for basic terms, context, and eventually key concepts and relations among these (by using semantic networks and/or ontology), in order to develop an understanding of the overall content of each document and its relevance to the reference scenario. 2a) In the knowledge-base set-up phase, the group of semantically analyzed documents forming a training set, undergoes a clustering process, whose similarity metrics is determined by both linguistic features (such as lexicon, syntax, style, etc.) and semantic information (concept similarity derived from the symbolic information tools). 2b) At run-time operation, each new document is matched with existing clusters; outlier detection, together with a history of the categorization process, highlights possibly interesting elements and subsets of documents bearing novel contents. 3) Since clustering tools are not able to extract mission-specific knowledge from input information, ontology processing interprets the detected trends in the light of possible criminal scenarios. If the available knowledge base cannot explain the extracted information adequately, the component may decide to bring an alert to the analyst’s attention and possibly tag the related information for future use. Content Analysis Document(s) 1 Behaviour Analysis Annotated Docum. Corpus Set-up 2a Ref. clusters Run-time Ref. Dynamic Knowledge 3 Novelties 2b Analyzed novel scenario Fig. 3. Functional CA-BA combined dataflow for prevention mode Novelty detection is the core activity of this operation mode, and relies on the interaction between BA and CA to define a ‘normal-state scenario’, to be used for identifying interesting deviations. The combination between inductive clustering and explicit knowledge extraction is promising in helping analysts to perform both gross Computational Intelligence Solutions for Homeland Security 51 classification of large, unknown and noisy document bases, find promising content in some clusters and hence refine the analysis through explicit knowledge processing. This combination lies between the information exploitation and processing of the intelligence cycle, as recalled in the previous section; in fact, it contributes both to isolate meaningful information items and to analyze the overall results. Content Analysis Selected corpus 2 Annotated Docum. Corpus 4 Relevant Groups 1 Ref. Investigation Scenario Behaviour Analysis 3 Structured hypothesis Search strategy for missing information Fig. 4. Functional CA-BA combined dataflow for Investigation mode Investigation-mode operation is illustrated in Fig. 4 and proceeds as follows. 1) A reference criminal scenario is assumed as a basic hypothesis. 2) Alike prevention-mode operation, input documents are searched for to develop an understanding of the overall content of each document and its relevance to the reference scenario. 3) BA (re)groups the existing base of documents by embedding the scenariospecific conceptual similarity in the document- and cluster-distance criterion. The result is a grouping of documents that indirectly takes into account the relevance of the documents to the reference criminal scenario. 4) CA uses high-level knowledge describing the assumed scenario to verify the consistency of BA. The output is a confirmation of the sought-for hypothesis or, more likely, a structural description of the possibly partial match, which provides useful directives for actively searching missing information, which ultimately serves to validate or disclaim the investigative assumption. Elsag Datamat already uses CA with different tools within the LEAF component for Investigation, Intelligence and Decision Support. The combination with BA is being experimented in Italian document sets in order to set up a prototype for Carabinieri (one of the Italian security forces), to be experimented on the real field between 2009 and 2010. In parallel, multi-language CA-BA solutions are being studied for the needs of international Polices. 52 5 E. Appiani and G. Buslacchi Conclusions In this position paper we have described an industrial solution for integrated IT support to Law Enforcement Agencies, called LEAF, in line with the state of the art of this domain, including some advanced computational intelligence functions to support Intelligence and Investigations. The need for an integrated Law Enforcement support, realized by the LEAF architecture, has been explained. The key approach for LEAF computational intelligence is modularity and openness to different tools, in order to realize the most suitable processing workflow for any analysis needs. LEAF architecture just organizes such workflow to support, totally or partially, the intelligence cycle phases: acquisition, exploitation, processing and dissemination, all coordinated by management. Acquisition and exploitation involve multimedia data processing, while processing and dissemination work at conceptual object level. An innovative and promising approach to analysis of Open and Private Sources combines Content and Behavior Analysis, this last exploring the application of | clustering techniques to textual documents, usually subject to text and language processing. BA shows the potential to tolerate the high number, heterogeneity and information noise of large Open Sources, creating clusters whose most representative keywords can express an underlying scenario, directly to the analysts’ attention or with the help of knowledge models. References 1. Law Enforcement Record Management Systems (RMS) – Standard functional specifications by Law Enforcement Information Technology Standards Council (LEITSC) (updated 2006), http://www.leitsc.org 2. Europol – mission, mandates, security objectives, http://www.europol.europa.eu 3. European Security and Defense Policy – Wikipedia article, http://en.wikipedia.org/wiki/ European_Security_and_Defence_Policy 4. NATO Open Source Intelligence Handbook (November 2001), http://www.oss.net 5. NATO Allied Joint Intelligence, Counter Intelligence and Security Doctrine. Allied Joint Publication (July 2003) 6. COBASIM Proposal for FP7 – Security Call 1 – Proposal no. 218012 (2007) 7. Jing, L., Ng, M.K., Zhexue Huang, J.: An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE Transactions on knowledge and data engineering 19(8) (August 2007) Virtual Weapons for Real Wars: Text Mining for National Security Alessandro Zanasi ESRIF-European Security Research and Innovation Forum University of Bologna Professor Temis SA Cofounder a_zanasi@yhaoo.it Abstract. Since the end of the Cold War, the threat of large scale wars has been substituted by new threats: terrorism, organized crime, trafficking, smuggling, proliferation of weapons of mass destruction. The new criminals, especially the so called “jihadist” terrorists are using the new technologies, as those enhanced by Web2.0, to fight their war. Text mining is the most advanced knowledge management technology which allow intelligence analysts to automatically analyze the content of information rich online data banks, suspected web sites, blogs, emails, chat lines, instant messages and all other digital media detecting links between people and organizations, trends of social and economic actions, topics of interest also if they are “sunk” among terabytes of information. Keywords: National security, information sharing, text mining. 1 Introduction After the 9/11 shock, the world of intelligence is reshaping itself, since that the world is requiring a different intelligence: dispersed, not concentrated; open to several sources; sharing its analysis with a variety of partners, without guarding its secrets tightly; open to strong utilization of new information technologies to take profit of the information (often contradictory) explosion (information density doubles every 24 months and its costs are halved every 18 months [1]); open to the contributions of the best experts, also outside government or corporations [2], e.g. through a publicprivate partnership (PPP or P3: a system in which a government service or private business venture is funded and operated through a partnership of government and one or more private sector companies). The role of competitive intelligence has assumed great importance not only in the corporate world but also in the government one, largely due to the changing nature of national power. Today the success of foreign policy rests to a significant extent on energy production control, industrial and financial power, and energy production control, industrial and financial power in turn are dependent on science and technology and business factors and on the capacity of detecting key players and their actions. New terrorists are typically organized in small, widely dispersed units and coordinate their activities on line, obviating the need for central command. Al Qaeda and E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 53–60, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 54 A. Zanasi similar groups rely on the Internet to contact potential recruits and donors, sway public opinion, instruct would-be terrorists, pool tactics and knowledge, and organize attacks. This phenomenon has been called Netwar (a form of conflict marked by the use of network forms of organizations and related doctrines, strategies, and technologies) [3]. In many ways, such groups use Internet in the same way that peaceful political organizations do; what makes terrorists’ activity threatening is their intent. This approach reflects what the world is experiencing in the last ten years: a paradigm shift from an organization-driven threat architecture (i.e., communities and social activities focused around large companies or organizations) to an individual-centric threat architecture (increased choice and availability of opportunities focused around individual wants and desires). This is a home, local community, and virtual-communitycentric societal architecture: the neo-renaissance paradigm. This new lifestyle is due to a growing workforce composed of digitally connected free-agent (e.g. terrorists) able to operate from any location and to be engaged anywhere on the globe [4]. Due to this growing of virtual communities, a strong interest towards the capability of automatic evaluation of communications exchanged inside these communities and of their authors is also growing, directed to profiling activity, to extract authors personal characteristics (useful in investigative actions too). So, to counter netwar terrorists, intelligence must learn how to monitor their network activity, also online, in the same way it keeps tabs on terrorists in the real world. Doing so will require a realignment of western intelligence and law enforcement agencies, which lag behind terrorist organizations in adopting information technologies [5] and, at least for NSA and FBI, to upgrade their computers to better coordinate intelligence information [6]. The structure of the Internet allows malicious activity to flourish and the perpetrators to remain anonymous. Since it would be nearly impossible to identify and disable every terrorist news forum on the internet given the substantial legal and technical hurdles involved (there are some 4,500 web sites that disseminate the al Qaeda leadership’s messages [7]), it would make more sense to leave those web sites online but watch them carefully. These sites offer governments’ intelligence unprecedented insight into terrorists’ ideology and motivations. Deciphering these Web sites will require not just Internet savvy but also the ability to read Arabic and understand terrorists’ cultural backgrounds-skills most western counterterrorism agencies currently lack [5]. These are the reasons for which the text mining technologies, which allow the reduction of information overload and complexity, analyzing texts also in unknown, exotic languages (Arabic included: the screenshots in the article, as all the other ones, are Temis courtesy), have become so important in the government as in the corporate intelligence world. For an introduction to text mining technology and its applications to intelligence: [8]. The question to be answered is: once defined the new battle field (i.e. the Web) how to use the available technologies to fight the new criminals and terrorists? A proposed solution is: through an “Internet Center”. That is a physical place where to concentrate advanced information technologies (including Web 2.0, machine translation, crawlers, text mining) and human expertise, not only in information technologies but also in online investigations). Virtual Weapons for Real Wars: Text Mining for National Security 55 Fig. 1. Text mining analysis of Arabic texts We present here the scenarios into which these technologies, especially those regarding text mining are utilized, with some real cases. 2 New Challenges to the Market State The information revolution is the key enabler of economic globalization. The age of information is also the age of emergence of the so called market-state [9] which maximizes the opportunities of its people, facing lethal security challenges which dramatically change the roles of government and of private actors and of intelligence. Currently governments power is being challenged from both above (international commerce, which erodes what used to be thought of as aspects of national sovereignty) and below (terrorist and organized crime challenge the state power from beneath, by trying to compel states to acquiesce or by eluding the control of states). Tackling these new challenges is the role of the new government intelligence. From the end of Cold War there is general agreement about the nature of the threats that posed a challenge to the intelligence community: drugs, organized crime, and proliferation of conventional and unconventional weapons, terrorism, financial crimes. All these threats aiming for violence, not victory, may be tackled through the help of technologies as micro-robots, bio-sniffers, and sticky electronics. Information technology, applied to open sources analysis, is a support in intelligence activities directed to prevent these threats [10] and to assure homeland protection [11]. 56 A. Zanasi Since 2001 several public initiatives involving data and text mining appeared in USA and Europe. All of them shared the same conviction: information is the best arm against the asymmetric threats. Fig. 2. The left panel allows us to highlight all the terms which appear in the collected documents and into which we are interested in (eg: Hezbollah). After clicking on the term, in the right column appear the related documents. 3 The Web as a Source of Information Until some years ago the law enforcement officers were interested only in retrieving data coming from web sites and online databanks (collections of information, available online, dedicated to press surveys, patents, scientific articles, specific topics, commercial data). Now they are interested in analyzing the data coming from internet seen as a way of communication: e-mails, chat rooms, forums, newsgroups, blogs (obtained, of course, after being assured that data privacy rules have been safeguarded). Of course, this nearly unending stream of new information, especially regarding communications, also in exotic languages, created not only an opportunity but also a new problem. The information data is too large to be analyzed by human beings and the languages in which this data are written are very unknown to the analysts. Luckily these problems, created by technologies, may be solved thanks to other information technologies. Virtual Weapons for Real Wars: Text Mining for National Security 57 4 Text Mining The basic text mining technique is Information Extraction consisting in linguistic processing, where semantic models are defined according to the user requirements, allowing the user to extract the principal topics of interest to him. These semantic models are contained in specific ontologies, engineering artefacts which contain a specific vocabulary used to describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of the vocabulary words. This technique allows, for example, the extraction of organization and people names, email addresses, bank account, phone and fax numbers as they appear in the data set. For example, once defined a technology or a political group, we can quickly obtain the list of organizations working with that technology or the journalists supporting that opinion or the key players for that political group. 5 Virtual Communities Monitoring A virtual community, whose blog and chats are typical examples, are communities of people sharing and communicating common interests, ideas, and feelings over the Internet or other collaborative networks. The possible inventor of this term was Howard Rheingold, who defines virtual communities as social aggregations that emerge from the Internet when enough people carry on public discussions long enough and with sufficient human feeling to form webs of personal relationships in cyberspace [12]. Most community members need to interact with a single individual in a one-to-one conversation or participate and collaborate in idea development via threaded conversations with multiple people or groups. This type of data is, clearly, an exceptional source to be mined [19]. 6 Accepting the Challenges to National Security 6.1 What We Need It is difficult for government intelligence to counter the threat that terrorists pose. To fight them we need solutions able to detect their names in the communications, to detect their financial movements, to recognize the real authors of anonymous documents, to put in evidence connections inside social structures, to track individuals through collecting as much information about them as possible and using computer algorithms and human analysis to detect potential activity. 6.2 Names and Relationships Detection New terrorist groups rise each week, new terrorists each day. Their names, often written in a different alphabet, are difficult to be caught and checked against other names already present in the databases. Text mining technology allows their detection, also with their connections to other groups or people. 58 A. Zanasi Fig. 3. Extraction of names with detection of connection to suspect terrorist names and the reason of this connection 6.3 Money Laundering Text mining is used in detecting anomalies in the fund transfer request process and in the automatic population of black lists. 6.4 Insider Trading To detect insider trading it is necessary to track the stock trading activity for every publicly traded company, plot it on a time line and compare anomalous peaks to company news: if there is no news to spur a trading peak, that is a suspected insider trading. To perform this analysis it is necessary to extract the necessary elements (names of officers and events, separated by category) from news text and then correlate them with the structured data coming from stock trading activity [13]. 6.5 Defining Anonymous Terrorist Authorship Frequently the only traces available after a terrorist attack are the emails or the communications claiming the act. The analyst must analyze the style, the concepts and feelings [14] expressed in a communication to establish connections and patterns between documents [15], comparing them with documents coming from known Virtual Weapons for Real Wars: Text Mining for National Security 59 authors: famous attackers (Unabomber was the most famous one) were precisely described, before being really detected, using this type of analysis. 6.6 Digital Signatures Human beings are habit beings and have some personal characteristics (more than 1000 “style markers” have been quoted in literature) that are inclined to persist, useful in profiling the authors of the texts. 6.7 Lobby Detection Analyzing connections, similarities and general patterns in public declarations and/or statements of different people allows the recognition of unexpected associations («lobbies») of authors (as journalists, interest groups, newspapers, media groups, politicians) detecting whom, among them, is practically forming an alliance. 6.8 Monitoring of Specific Areas/Sectors In business there are several examples of successful solutions applied to competitive intelligence. E.g. Unilever, text mining patents discovered that a competitor was planning new activities in Brazil which really took place a year later [16]. Telecom Italia, discovered that a competitor (NEC-Nippon Electric Company) was going to launch new services in multimedia [16]. Total (F), mines Factiva and Lexis-Nexis databases to detect geopolitical and technical information. 6.9 Chat Lines, Blogs and Other Open Sources Analysis The first enemy of intelligence activity is the “avalanche” of information that daily the analysts must retrieve, read, filter and summarize. The Al Qaeda terrorists declared to interact among them through chat lines to avoid being intercepted [17]: interception and analysis of chat lines content is anyway possible and frequently done in commercial situations [18], [19]. Using different text mining techniques it is possible to identify the context of the communication and the relationships among documents detecting the references to the interesting topics, how they are treated and what impression they create in the reader [20]. 6.10 Social Network Links Detection “Social structure” has long been an important concept in sociology. Network analysis is a recent set of methods for the systematic study of social structure and offers a new standpoint from which to judge social structures [21]. Text mining is giving an important help in detection of social network hidden inside large volumes of text also detecting the simultaneous appearance of entities (names, events and concepts) measuring their distance (proximity). 60 A. Zanasi References 1. Lisse, W.: The Economics of Information and the Internet. Competitive Intelligence Review 9(4) (1998) 2. Treverton, G.F.: Reshaping National Intelligence in an Age of Information. Cambridge University Press, Cambridge (2001) 3. Ronfeldt, D., Arquilla, J.: The Advent of Netwar –Rand Corporation (1996) 4. Goldfinger, C.: Travail et hors Travail: vers une societe fluide. In: Jacob, O. (ed.) (1998) 5. Kohlmann, E.: The Real Online Terrorist Threat – Foreign Affairs (September/October 2006) 6. Mueller, J.: Is There Still a Terrorist Threat? – Foreign Affairs (September/ October 2006) 7. Riedel, B.: Al Qaeda Strikes Back – Foreign Affairs (May/June 2007) 8. Zanasi, A. (ed.): Text Mining and its Applications to Intelligence, CRM and Knowledge Management. WIT Press, Southampton (2007) 9. Bobbitt, P.: The Shield of Achilles: War, Peace, and the Course of History, Knopf (2002) 10. Zanasi, A.: New forms of war, new forms of Intelligence: Text Mining. In: ITNS Conference, Riyadh (2007) 11. Steinberg, J.: In Protecting the Homeland 2006/2007 - The Brookings Institution (2006) 12. Rheingold, H.: The Virtual Community. MIT Press, Cambridge (2000) 13. Feldman, S.: Insider Trading and More, IDC Report, Doc#28651 (December 2002) 14. de Laat, M.: Network and content analysis in an online community discourse. University of Nijmegen (2002) 15. Benedetti, A.: Il linguaggio delle nuove Brigate Rosse, Erga Edizioni (2002) 16. Zanasi, A.: Competitive Intelligence Thru Data Mining Public Sources - Competitive Intelligence Review, vol. 9(1). John Wiley & Sons, Inc., Chichester (1998) 17. The Other War, The Economist March 26 (2003) 18. Campbell, D.: - World under Watch, Interception Capabilities in the 21st Century – ZDNet.co (2001) (updated version of Interception Capabilities 2000, A report to European Parlement - 1999) 19. Zanasi, A.: Email, chatlines, newsgroups: a continuous opinion surveys source thanks to text mining. In: Excellence in Int’l Research 2003 - ESOMAR (Nl) (2003) 20. Jones, C.W.: Online Impression Management. University of California paper (July 2005) 21. Degenne, A., Forse, M.: Introducing Social Networks. Sage Publications, London (1999) Hypermetric k-Means Clustering for Content-Based Document Management Sergio Decherchi, Paolo Gastaldo, Judith Redi, and Rodolfo Zunino Dept. Biophysical and Electronic Engineering, University of Genoa, 16145 Genova, Italy {sergio.decherchi,paolo.gastaldo,judith.redi, rodolfo.zunino}@unige.it Abstract. Text-mining methods have become a key feature for homeland-security technologies, as they can help explore effectively increasing masses of digital documents in the search for relevant information. This research presents a model for document clustering that arranges unstructured documents into content-based homogeneous groups. The overall paradigm is hybrid because it combines pattern-recognition grouping algorithms with semantic-driven processing. First, a semantic-based metric measures distances between documents, by combining a content-based with a behavioral analysis; the metric considers both lexical properties and the structure and styles that characterize the processed documents. Secondly, the model relies on a Radial Basis Function (RBF) kernel-based mapping for clustering. As a result, the major novelty aspect of the proposed approach is to exploit the implicit mapping of RBF kernel functions to tackle the crucial task of normalizing similarities while embedding semantic information in the whole mechanism. Keywords: document clustering, homeland security, kernel k-means. 1 Introduction The automated surveillance of information sources is of strategic importance to effective homeland security [1], [2]. The increased availability of data-intensive heterogeneous sources provides a valuable asset for the intelligence task; data-mining methods have therefore become a key feature for security-related technologies [2], [3] as they can help explore effectively increasing masses of digital data in the search for relevant information. Text mining techniques provide a powerful tool to deal with large amounts of unstructured text data [4], [5] that are gathered from any multimedia source (e.g. from Optical Character Recognition, from audio via speech transcription, from webcrawling agents, etc.). The general area of text-mining methods comprises various approaches [5]: detection/tracking tools continuously monitor specific topics over time; document classifiers label individual files and build up models for possible subjects of interest; clustering tools process documents for detecting relevant relations among those subjects. As a result, text mining can profitably support intelligence and security activities in identifying, tracking, extracting, classifying and discovering patterns, so that the outcomes can generate alerts notifications accordingly [6] ,[7]. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 61–68, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 62 S. Decherchi et al. This work addresses document clustering and presents a dynamic, adaptive clustering model to arrange unstructured documents into content-based homogeneous groups. The framework implements a hybrid paradigm, which combines a contentdriven similarity processing with pattern-recognition grouping algorithms. Distances between documents are worked out by a semantic-based hypermetric: the specific approach integrates a content-based with a user-behavioral analysis, as it takes into account both lexical and style-related features of the documents at hand. The core clustering strategy exploits a kernel-based version of the conventional k-means algorithm [8]; the present implementation relies on a Radial Basis Function (RBF) kernelbased mapping [9]. The advantage of using such a kernel consists in supporting normalization implicitly; normalization is a critical issue in most text-mining applications, and prevents that extensive properties of documents (such as length, lexicon, etc) may distort representation and affect performance. A standard benchmark for content-based document management, the Reuters database [10], provided the experimental domain for the proposed methodology. The research shows that the document clustering framework based on kernel k-means can generate consistent structures for information access and retrieval. 2 Document Clustering Text mining can effectively support the strategic surveillance of information sources thanks to automatic means, which is of paramount importance to homeland security [6], [7]. For prevention, text mining techniques can help identify novel “information trends” revealing new scenarios and threats to be monitored; for investigation, these technologies can help distil relevant information about known scenarios. Within the text mining framework, this work addresses document clustering, which is one of the most effective techniques to organize documents in an unsupervised manner. When applied to text mining, clustering algorithms are designed to discover groups in the set of documents such that the documents within a group are more similar to one another than to documents of other groups. As apposed to text categorization [5], in which categories are predefined and are part of the input information to the learning procedure, document clustering follows an unsupervised paradigm and partitions a set of documents into several subsets. Thus, the document clustering problem can be defined as follows. One should first define a set of documents D = {D1, . . . , Dn}, a similarity measure (or distance metric), and a partitioning criterion, which is usually implemented by a cost function. In the case of flat clustering, one sets the desired number of clusters, Z, and the goal is to compute a membership function φ : D → {1, . . . , Z} such that φ minimizes the partitioning cost with respect to the similarities among documents. Conversely, hierarchical clustering does not need to define the cardinality, Z, and applies a series of nested partitioning tasks which eventually yield a hierarchy of clusters. Indeed, every text mining framework should always be supported by an information extraction (IE) model [11], [12] which is designed to pre-process digital text documents and to organize the information according to a given structure that can be directly interpreted by a machine learning system. Thus, a document D is eventually reduced to a sequence of terms and is represented as a vector, which lies in a space Hypermetric k-Means Clustering for Content-Based Document Management 63 spanned by the dictionary (or vocabulary) T = {tj; j= 1,.., nT}. The dictionary collects all terms used to represent any document D, and can be assembled empirically by gathering the terms that occurs at least once in a document collection D ; by this representation one loses the original relative ordering of terms within each document. Different models [11], [12] can be used to retrieve index terms and to generate the vector that represents a document D. However, the vector space model [13] is the most widely used method for document clustering. Given a collection of documents D, the vector space model represents each document D as a vector of real-valued weight terms v = {wj; j=1,..,nT}. Each component of the nT-dimensional vector is a non-negative term weight, wj, that characterizes the j-th term and denotes the relevance of the term itself within the document D. 3 Hypermetric k-Means Clustering The hybrid approach described in this Section combines the specific advantages of content-driven processing with the effectiveness of an established pattern-recognition grouping algorithm. Document similarity is defined by a content-based distance, which combines a classical distribution-based measure with a behavioral analysis of the style features of the compared documents. The core engine relies on a kernelbased version of the classical k-means partitioning algorithm [8] and groups similar documents by a top-down hierarchical process. In the kernel-based approach, every document is mapped into an infinite-dimensional Hilbert space, where only inner products among elements are meaningful and computable. In the present case the kernel-based version of k-means [15] provides a major advantage over the standard kmeans formulation. In the following, D = {Du; u= 1,..,nD} will denote the corpus, holding the collection of documents to be clustered. The set T = {tj; j= 1,.., nT} will denote the vocabulary, which is the collection of terms that occur at least one time in D after the preprocessing steps of each document D ∈ D (e.g., stop-words removal, stemming [11]). 3.1 Document Distance Measure A novel aspect of the method described here is the use of a document-distance that takes into account both a conventional content-based similarity metric and a behavioral similarity criterion. The latter term aims to improve the overall performance of the clustering framework by including the structure and style of the documents in the process of similarity evaluation. To support the proposed document distance measure, a document D is here represented by a pair of vectors, v′ and v′′. Vector v′(D) actually addresses the content description of a document D; it can be viewed as the conventional nT-dimensional vector that associates each term t ∈ T with the normalized frequency, tf, of that term in the document D. Therefore, the k-th element of the vector v′(Du) is defined as: v′k ,u = tf k ,u nT ∑ tfl ,u , l =1 (1) 64 S. Decherchi et al. where tfk,u is the frequency of the k-th term in document Du. Thus v′ represents a document by a classical vector model, and uses term frequencies to set the weights associated to each element. From a different perspective, the structural properties of a document, D, are represented by a set of probability distributions associated with the terms in the vocabulary. Each term t ∈ T that occurs in Du is associated with a distribution function that gives the spatial probability density function (pdf) of t in Du. Such a distribution, pt,u(s), is generated under the hypothesis that, when detecting the k-th occurrence of a term t at the normalized position sk ∈ [0,1] in the text, the spatial pdf of the term can be approximated by a Gaussian distribution centered around sk. In other words, if the term tj is found at position sk within a document, another document with a similar structure is expected to include the same term at the same position or in a neighborhood thereof, with a probability defined by a Gaussian pdf. To derive a formal expression of the pdf, assume that the u-th document, Du, holds nO occurrences of terms after simplifications; if a term occurs more than once, each occurrence is counted individually when computing nO, which can be viewed as a measure of the length of the document. The spatial pdf can be defined as: p t ,u (s ) = ⎡ (s − s )2 ⎤ 1 nO 1 nO 1 k ⎥ , exp ⎢ − G sk,λ = ∑ ∑ A k =1 A k =1 2π λ ⎢⎣ λ 2 ⎥⎦ ( ) (2) where A is a normalization term and λ is regularization parameter. In practice one uses a discrete approximation of (2). First, the document D is segmented evenly into S sections. Then, an S-dimensional vector is generated for each term t ∈ T , and each element estimates the probability that the term t occurs in the corresponding section of the document. As a result, v′′(D) is an array of nT vectors having dimension S. Vector v′ and vector v′′ support the computations of the frequency-based distance, ∆(f), and the behavioral distance, ∆(b), respectively. The former term is usually measured according to a standard Minkowski distance, hence the content distance between a pair of documents (Du, Dv) is defined by: ⎡ T ∆( f ) ( Du , Dv ) = ⎢ ∑ v k′ ,u − v k′ ,v n ⎢⎣ k =1 p⎤ ⎥ ⎥⎦ 1 p . (3) The present approach adopts the value p = 1 and therefore actually implements a Manhattan distance metric. The term computing behavioral distance, ∆(b), applies an Euclidean metric to compute the distance between probability vectors v′′. Thus: ∆(b ) ( Du , Dv ) = ∑ ∆(tbk ) (Du , Dv ) = ∑ ∑ [v (′′k ) s,u − v (′′k ) s,v ]2 . nT nT S k =1 k =1 s =1 (4) Both terms (3) and (4) contribute to the computation of the eventual distance value, ∆(Du, Dv), which is defined as follows: ∆(Du,Dv) = α⋅ ∆(f) (Du,Dv) + (1 – α)⋅ ∆(b)(Du,Dv) , (5) Hypermetric k-Means Clustering for Content-Based Document Management 65 where the mixing coefficient α∈[0,1] weights the relative contribution of ∆(f) and ∆(b). It is worth noting that the distance expression (5) obeys the basic properties of non-negative values and symmetry that characterize general metrics, but does not necessarily satisfy the triangular property. 3.2 Kernel k-Means The conventional k-means paradigm supports an unsupervised grouping process [8], which partitions the set of samples, D = {Du; u= 1,..,nD}, into a set of Z clusters, Cj (j = 1,…, Z). In practice, one defines a “membership vector,” which indexes the partitioning of input patterns over the K clusters as: mu = j ⇔ Du ∈Cj, otherwise mu = 0; u = 1,…, nD. It is also useful to define a “membership function” δuj(Du,Cj), that defines the membership of the u-th document to the j-th cluster: δuj =1 if mu = j, and 0 otherwise. Hence, the number of members of a cluster is expressed as Nj = nD ∑ δ uj ; j = 1,…, Z ; (6) u =1 and the cluster centroid is given by: wj = 1 Nj nD ∑ x u δ uj ; j = 1,…, Z ; u =1 (7) where xu is any vector-based representation of document Du. The kernel based version of the algorithm is based on the assumption that a function, Φ, can map any element, D, into a corresponding position, Φ(D), in a possibly infinite dimensional Hilbert space. The mapping function defines the actual ‘Kernel’, which is formulated as the expression to compute the inner product: def K (Du , Dv ) = Kuv = Φ (Du ) ⋅ Φ (Dv ) = Φu ⋅ Φ v . (8) In our particular case we employ the largely used RBF kernel ⎡ ∆(Du , Dv ) ⎤ K (Du , Dv ) = exp ⎢− ⎥ . σ2 ⎣ ⎦ (9) It is worth stressing here an additional, crucial advantage of using a kernel-based formulation in the text-mining context: the approach (9) can effectively support the critical normalization process by reducing all inner products within a limited range, thereby preventing that extensive properties of documents (length, lexicon, etc) may distort representation and ultimately affect clustering performance. The kernel-based version of the k-means algorithm, according to the method proposed in [15], replicates the basic partitioning schema (6)-(7) in the Hilbert space, where the centroid positions, Ψ, are given by the averages of the mapping images, Φu: Ψj = 1 Nj nD ∑ Φ u δ uj ; u =1 j = 1,…, Z . (10) 66 S. Decherchi et al. The ultimate result of the clustering process is the membership vector, m, which determines prototype positions (7) even though they cannot be stated explicitly. As a consequence, for a document, Du, the distance in the Hilbert space from the mapped image, Φu, to the cluster Ψj as per (7) can be worked out as: ( ) d Φu , Ψ j = Φu − 1 Nj nD 2 ∑ Φv =1+ v =1 1 (N j ) 2 nD nD m ,v =1 j v =1 2 ∑ δ mjδ vj K mv − N ∑ δ vj Ku,v . (11) By using expression (11), which includes only kernel computations, one can identify the closest prototype to the image of each input pattern, and assign sample memberships accordingly. In clustering domains, k-means clustering can notably help separate groups and discover clusters that would have been difficult to identify in the base space. From this viewpoint one might even conclude that a kernel-based method might represent a viable approach to tackle the dimensionality issue. 4 Experimental Results A standard benchmark for content-based document management, the Reuters database [10], provided the experimental domain for the proposed framework. The database includes 21,578 documents, which appeared on the Reuters newswire in 1987. One or more topics derived from economic subject categories have been associated by human indexing to each document; eventually, 135 different topics were used. In this work, the experimental session involved a corpus DR including 8267 documents out of the 21,578 originally provided by the database. The corpus DR was obtained by adopting the criterion used in [14]. First, all the documents with multiple topics were discarded. Then, only the documents associated to topics having at least 18 occurrences were included in DR. As a result, 32 topics were represented in the corpus. In the following experiments, the performances of the clustering framework have been evaluated by using the purity parameter. Let Nk denote the number of elements lying in a cluster Ck and let Nmk be the number of elements of the class Im in the cluster Ck. Then, the purity pur(k) of the cluster Ck is defined as follows: pur (k ) = 1 max ( N mk ) . Nk m (12) Accordingly, the overall purity of the clustering results is defined as follows: purity = ∑ k Nk ⋅ pur (k ) , N (13) where N is the total number of element. The purity parameter has been preferred to other measures of performance (e.g. the F-measures) since it is the most accepted measure for machine learning classification problems [11]. The clustering performance of the proposed methodology was evaluated by analyzing the result obtained with three different experiments: the documents in the corpus DR were partitioned by using a flat clustering paradigm and three different settings for Hypermetric k-Means Clustering for Content-Based Document Management 67 the parameter α, which, as per (5), weights the relative contribution of ∆(f) and ∆(b) in the document distance measure. The values used in the experiments were α = 0.3, α = 0.7 and α = 0.5; thus, a couple of experiments were characterized by a strong preponderance of one of the two components, while in the third experiment ∆(f) and ∆(b) evenly contribute to the eventual distance measure. Table 1 outlines the results obtained with the setting α = 0.3. The evaluations were conducted with different number of clusters Z, ranging from 20 to 100. For each experiment, four quality parameters are presented: • • • • the overall purity, purityOV, of the clustering result; the lowest purity value pur(k) over the Z clusters; the highest purity value pur(k) over the Z clusters; the number of elements (i.e. documents) associated to the smallest cluster. Analogously, Tables 2 and 3 reports the results obtained with α = 0.5 and α = 0.7, respectively. Table 1. Clustering performances obtained on Reuters-21578 with α=0.3 Number of clusters 20 40 60 80 100 Overall purity 0.712108 0.77138 0.81154 0.799685 0.82666 pur(k) minimum 0.252049 0.236264 0.175 0.181818 0.153846 pur(k) maximum 1 1 1 1 1 Smallest cluster 109 59 13 2 1 Table 2. Clustering performances obtained on Reuters-21578 with α=0.5 Number of clusters 20 40 60 80 100 Overall purity 0.696383 0.782267 0.809121 0.817467 0.817467 pur(k) minimum 0.148148 0.222467 0.181818 0.158333 0.139241 pur(k) maximum 1 1 1 1 1 Smallest cluster 59 4 1 1 2 Table 3. Clustering performances obtained on Reuters-21578 with α=0.7 Number of clusters 20 40 60 80 100 Overall purity 0.690577 0.742833 0.798718 0.809483 0.802589 pur(k) minimum 0.145719 0.172638 0.18 0.189655 0.141732 pur(k) maximum 1 1 1 1 1 Smallest cluster 13 6 5 2 4 68 S. Decherchi et al. As expected, the numerical figures show that, in general, the overall purity grows as the number of clusters Z increases. Indeed, the value of the overall purity seems to indicate that clustering performances improve by using the setting α= 0.3. Hence, empirical outcomes confirm the effectiveness of the proposed document distance measure, which combines the conventional content-based similarity with the behavioral similarity criterion. References 1. Chen, H., Chung, W., Xu, J.J., Wang, G., Qin, Y., Chau, M.: Crime data mining: a general framework and some examples. IEEE Trans. Computer 37, 50–56 (2004) 2. Seifert, J.W.: Data Mining and Homeland Security: An Overview. CRS Report RL31798 (2007), http://www.epic.org/privacy/fusion/ crs-dataminingrpt.pdf 3. Mena, J.: Investigative Data Mining for Security and Criminal Detection. ButterworthHeinemann (2003) 4. Sullivan, D.: Document warehousing and text mining. John Wiley and Sons, Chichester (2001) 5. Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Comm. of the ACM 49, 76–82 (2006) 6. Popp, R., Armour, T., Senator, T., Numrych, K.: Countering terrorism through information technology. Comm. of the ACM 47, 36–43 (2004) 7. Zanasi, A. (ed.): Text Mining and its Applications to Intelligence, CRM and KM, 2nd edn. WIT Press (2007) 8. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Trans. Commun. COM-28, 84–95 (1980) 9. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004) 10. Reuters-21578 Text Categorization Collection. UCI KDD Archive 11. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008) 12. Baeza-Yates, R., Ribiero-Neto, B.: Modern Information Retrieval. ACM Press, New York (1999) 13. Salton, G., Wong, A., Yang, L.S.: A vector space model for information retrieval. Journal Amer. Soc. Inform. Sci. 18, 613–620 (1975) 14. Cai, D., He, X., Han, J.: Document Clustering Using Locality Preserving Indexing. IEEE Transaction on knowledge and data engineering 17, 1624–1637 (2005) 15. Girolami, M.: Mercer kernel based clustering in feature space. IEEE Trans. Neural Networks. 13, 2780–2784 (2002) Security Issues in Drinking Water Distribution Networks Demetrios G. Eliades and Marios M. Polycarpou* KIOS Research Center for Intelligent Systems and Networks Dept. of Electrical and Computer Engineering University of Cyprus, CY-1678 Nicosia, Cyprus {eldemet,mpolycar}@ucy.ac.cy Abstract. This paper formulates the security problem of sensor placement in water distribution networks for contaminant detection. An initial attempt to develop a problem formulation is presented, suitable for mathematical analysis and design. Multiple risk-related objectives are minimized in order to compute the Pareto front of a set of possible solutions; the considered objectives are the contamination impact average, worst-case and worst-cases average. A multiobjective optimization methodology suitable for considering more that one objective function is examined and solved using a multiple-objective evolutionary algorithm. Keywords: contamination, water distribution, sensor placement, multi-objective optimization, security of water systems. 1 Introduction A drinking water distribution network is the infrastructure which facilitates delivery of water to consumers. It is comprised of pipes which are connected to other pipes at junctions or connected to tanks and reservoirs. Junctions represent points in the network where pipes are connected, with inflows and outflows. Each junction is assumed to serve a number of consumers whose aggregated water demands are the junction’s demand outflow. Reservoirs (such as lakes, rivers etc.) are assumed to have infinite water capacity which they outflow to the distribution network. Tanks are dynamic elements with finite capacity that fill, store and return water back to the network. Valves are usually installed to some of the pipes in order to adjust flow, pressure, or to close part of the network if necessary. Water quality monitoring in distribution networks involves manual sampling or placing sensors at various locations to determine the chemical concentrations of various species such as disinfectants (e.g. chlorine) or for various contaminants that can be harmful to the consumers. Distribution networks are susceptible to intrusions due to their open and uncontrolled nature. Accidental faults or intentional actions could cause a contamination, that may affect significantly the health and economic activities of a city. Contaminants are substances, usually chemical, biological or radioactive, which travel along the water flow, and may exhibit decay or growth dynamics. The concentration dynamics of a substance in a water pipe can be modelled by the first-order hyperbolic * This work is partially supported by the Research Promotion Foundation (Cyprus) and the University of Cyprus. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 69–76, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 70 D.G. Eliades and M.M. Polycarpou equations of advection and reaction [1]. When a contaminant reaches a water consumer node, it can expose some of the population served at risk, or cause economic losses. The issue of modelling dangerous contaminant transport in water distribution networks was examined in [2], where the authors discretized the equations of contaminant transport and simulated a network under contamination. Currently, in water research an open-source hydraulic and quality numerical solver, called EPANET, is frequently used for computing the advection and reaction dynamics in discrete-time [3]. The security problem of contaminant detection in water distribution networks was first examined in [4]. The algorithmic “Battle of the Water Sensor Networks” competition in 2006 boosted research on the problem and established some benchmarks [5]. While previous research focused on specific cases of the water security problem, there has not been a unified problem formulation. In this work, we present an initial attempt to develop such a problem formulation, suitable for mathematical analysis and design. In previous research the main solution approach has been the formulation of an integer program which is solved using either evolutionary algorithms [6] or mathematical programming [7]. Various groups have worked in an operational research framework in formulating the mathematical program as in the ‘p-median’ problem [8]. Although these formulations seek to minimize one objective, it is often the case the solutions are not suitable with respect to some other objectives. In this work we propose a multi-objective optimization methodology suitable for considering more that one objective function. Some work has been conducted within a multi-objective framework, computing the Pareto fronts for conflicting objectives and finding the sets of non-dominant feasible solutions [9], [10]. However some of the objectives considered did not capture the contamination risk. The most frequently used risk objective metric is the average impact on the network. Recently, other relevant metrics have also been applied [11], [7], such as the ‘Conditional Value at Risk’ (CVaR) which corresponds to the average impact of the worst case scenarios. In this work we present a security-oriented formulation and solution of the problem when the average, the worst-case (maximum impact) and the average of worst-cases (CVaR) impact is considered. For computing the solution, we examine the use of a multi-objective evolutionary algorithm. In Section 2 the problem is formulated; in Section 3, the solution methodology is described and an algorithmic solution is presented. In Section 4 simulation results are demonstrated using a realistic water distribution network. Finally, the results are summarized and future work is discussed in Section 5. 2 Problem Formulation We first express the network into a generic graph with nodes and edges. We consider nodes in the graph as locations in the distribution network where water consumption can occur, such as reservoirs, pipe junctions and tanks. Pipes that transport water from one node to another are represented as edges in the graph. Let V be the set of n nodes in the network, such that V={v1,…,vn} and E be the set of m edges connecting pairs of nodes, where for e∈E, e=(vi,vj). The two sets V and E capture the topology of the water distribution network. The function g(t), g:ℜ+ a ℜ + describes the rate of Security Issues in Drinking Water Distribution Networks 71 contaminant’s mass injection in time at a certain node. A typical example of this injection profile is a pulse signal of finite duration. A contamination event ψ i ( g v (t )) is the contaminant injection at node vi∈V with rate i g vi (t ) . A contamination scenario s={ψ1,…,ψn} is defined as the set of contamination events ψi at each node vi describing a possible “attack” on the network. Typically, the contamination event ψi at most nodes will be zero, since the injection will occur at a few specific nodes. The set of nodes where intrusion occurs for a scenario s is V*={vi | ψi≠0, ψi∈s}, so that V*⊆V. Let S be the set of all possible contamination scenarios w.r.t the specific water distribution system. We define the function ω̃(s,t), ω̃:S×ℜ+ a ℜ , as the impact of a contamination scenario s until time t, for s∈S. This impact is computed through ω~( s, t ) = ∑ ϕ (vi , s, t ), vi ∈V (1) where φ:V×S×ℜ+ a ℜ is a function that computes the impact of a specific scenario s at node vi until time t. The way to compute φ(⋅) is determined by the problem specifications; for instance it can be related to the number of people infected at each node due to contamination, or to the consumed volume of contaminated water. For edge (vi,vj)∈Ε, the function τ(vi,vj,t), τ:V×V×ℜ+ a ℜ , expresses the transport time between nodes vi and vj, when a particle departs node vi at time t. This is computed by solving the network water hydraulics with a numerical solver for a certain time-window and for a certain water demands, tank levels and hydraulic control actions. This corresponds to a time-varying weight for each edge. We further define the function τ*:S×V a ℜ so that when for a scenario s, τ*(s,vi) is the minimum transport time for the contaminant to reach node vi∈V. To compute this we consider τ * ( s, vi ) = min F (vi , v j , s ), where for each intrusion node vj∈V* during a scenario s, the v ∈V * j function F(·) is a shortest path algorithm for which the contaminant first reaches node vi. Finally we define function ω:S×V a ℜ , in order to express the impact of a contamination scenario s until it reaches node vi, such that ω(vi,s)= ω̃(s,τ*(s,vi)). This function will be used in the optimization formulation in the next section. 3 Solution Methodology Since the set of all possible scenarios S is comprised of infinite elements, an increased computational complexity is imposed to the problem; moreover, contaminations in certain nodes are unrealistic or have trivial impacts. We can relax the problem by considering S0 as a representative finite subset of S, such that S0⊂S. In the simulations that follow, we assume that a scenario s∈S0 has a non-zero element for ψi and zero elements for all ψj for which i≠j. We further assume that the non-zero contamination event is ψi=g0(t,θ), where g0(·) is a known signal structure and θ is a parameter vector in the bounded parameter space Θ, θ∈Θ. Since Θ has infinite elements, we perform grid sampling and the selected parameter samples constitute a finite set Θ0⊂Θ. We assume that the parameter vector θ of a contamination event ψi also belongs to Θ0, 72 D.G. Eliades and M.M. Polycarpou such that θ∈Θ0. Therefore, a scenario s∈S0 is comprised of one contamination event with parameter θ∈Θ0; the finite scenario set S0 is comprised of |V|·|Θ0| elements. 3.1 Optimization Problem In relation to the sensor placement problem, when there is more than one sensor in the network, the impact of a fault scenario s∈S0 is the minimum impact among all the impacts computed for each node/sensor; essentially it corresponds to the sensor that detects the fault first. We define three objective functions fi:X a ℜ , i={1,2,3}, that map a set of nodes X⊂V to a real number. Specifically, f1(X) is the average impact of S0, such that f1 ( X ) = 1 ∑ min ω ( x, s). | S 0 | s∈S0 x∈X (2) Function f2(X) is the maximum impact of the set of all scenarios, such that f 2 ( X ) = max min ω ( x, s). s∈S0 x∈X (3) Finally, function f3(X) corresponds to the CVaR risk metric and is the average impact * S 0 of the scenarios in the set ⊂S0 with impact larger that αf2(X), where α∈[0,1], ⎫⎪ ⎧⎪ 1 f3 ( X ) = ⎨ min ω ( x, s ) :s ∈ S 0* ⇔ min ω ( x, s ) ≥ αf 2 ( X ) ⎬ . ∑ x∈X ⎪⎭ ⎪⎩| S 0* | x∈S0* x∈X (4) The multi-objective optimization problem is formulated as min{ f1 ( X ), f 2 ( X ), f 3 ( X )} , X (5) subject to X⊂V' and |X|=N, where V'⊆V is the set of feasible nodes and N the number of sensors to be placed. Minimizing an objective function may result in maximizing others; it is thus not possible to find one optimal solution that satisfies all objectives at the same time. It is possible however to find a set of solutions, laying on a Pareto front, where each solution is no worse that the other. 3.2 Algorithmic Solution In general a feasible solution X is called Pareto optimal if for a set of objectives Γ and i,j∈Γ, there exists no other feasible solution X' such that fi(X')≤fi(X) with fj(X')<fj(X) for at least one j. Stated differently, a solution is Pareto optimal if there is no other feasible solution that would reduce some objective function without simultaneously causing an increase in at least one other objective function [15, p.779]. The solution space is extremely big, even for networks with a few nodes. The set of computed solutions may or may not represent the actual Pareto front. Heuristic searching [9] or computational intelligence techniques such as multi-objective evolutionary algorithms [16], [17], [6] have been applied for this problem. In this work we consider an algorithm suitable for the problem, the NSGA-II [18]. This algorithm is examined in the multi-objective sensor placement formulation for risk minimization. Security Issues in Drinking Water Distribution Networks 73 In summary, the algorithm randomly creates a set of possible solutions P; these solutions are examined for non-dominance and are separated in ranks. Specifically, the subset of solutions that are non-dominant in the set is P1⊂P. By removing P1, the subset of solutions that are non-dominant in the set is P2⊂{P-P1}, and so on. A ‘crowding’ metric is used to express the proximity of one solution to its neighbour solutions; this is used to achieve ‘better’ spreading of solutions on the Pareto front. A subset of P is selected for computing a new set of solutions, in a rank and crowding metric competition. The set of new solutions P' is computed through genetic algorithm operators such as crossover and mutation. The sets P are P' are combined and their elements are ranked with the non-dominance criterion. The best solutions in the mixed set are selected to continue to the next iteration. The algorithm was modified in order to accept discrete inputs, specifically the index number of nodes. 4 Simulation Results To illustrate the solution methodology we examine the sensor placement problem in a realistic distribution system. The network model is comprised by the topological information as well as the outflows for a certain period of time. A graphical representation of the network is shown in Fig. 1. Fig. 1. Spatial schematic of network. The Source represents a reservoir supplying water to the network and Tanks are temporary water storage facilities. Nodes are consumption points. The network has 126 nodes and 168 pipes; in detail it consists of two tanks, one infinite source reservoir, two pumps and eight valves. All nodes in the network are possible locations for placing sensors. The water demands across the nodes are not uniformly distributed; specifically, 20 of the nodes are responsible for more than 80% of all water demands. The network is in EPANET 2.0 format and was used in the ‘Battle of the Water Sensor Networks’ design challenge [5, 17]. The demands for a typical day are provided in the network. According to our solution methodology, we assumed that g0(·) is a pulse signal with three parameters: θ1=28.75 Kg/hr the rate of contaminant injection, θ2=2 hr the duration of the injection and 0≤θ3≤24 hr the injection start time (as in [5]). By performing 5 minute sampling on θ3, the finite set Θ0 is build with |Θ0|=288 parameters. All nodes 74 D.G. Eliades and M.M. Polycarpou were considered as possible intrusion locations, and it is assumed that only that only one node can be attacked at each contamination scenario. Therefore, the finite scenario set S0 has |S0|=126⋅288= 36,288 elements. Impact φ(⋅) is computed using a nonlinear function described in [5] representing the number of people infected due to contaminant consumption. Hydraulic and quality dynamics were computed using the EPANET software. (a) (b) Fig. 2. (a): Histogram of the maximum normalized impact affected in all contamination scenarios. (b): Histogram of the maximum normalized impact affected in all contamination scenarios, when 6 sensors are placed for minimizing average impact. For simplicity and without loss of generality, the impact metrics presented hereafter are normalized. Figure 2(a) depicts the histogram of the maximum normalized impact in all contamination scenarios. From Fig. 2(a) it appears that about 40% of all scenarios under consideration have impacts more that 10% w.r.t the maximum. The long tail in the distribution shows that there is subset of scenarios which have a large impact on the network. From simulations we identified two locations that on certain scenarios, they achieve the largest impact, specifically near the reservoir and at one tank (labelled with numbers ‘1’ and ‘2’ in Fig. 1). The worst possible outcome from all intrusions is for contamination at node ‘1’ at time 23:55, given that the contaminant propagates undetected in the network. The optimization problem is to place six sensors at six nodes in order to minimize the three objectives described in the problem formulation. For the third objective we use α=0.8. The general assumption is that the impact of a contamination is measured until a sensor has been triggered, i.e. a contaminant has reached at a sensor node; afterwards it is assumed that there are no delays in stopping the service. For illustrating the impact reduction by placing sensors, we choose one solution from the solution set computed with the smallest average impact; the proposed six sensor locations are indicated as nodes with circles in Fig. 1. Figure 2(b) shows the histogram of the maximum impact on the network when these six sensors are installed. We observe a reduction of the impact to the system, with worst case impact near 20% w.r.t. the unprotected worst case. We performed two experiments with the NSGA-II algorithm to solve the optimization problem. In the one experiment, the parameters were 200 solutions in population Security Issues in Drinking Water Distribution Networks 75 for 1000 generations; for the second experiment it was 400 and 2000 respectively. The Pareto fronts are depicted in Fig. 3, along with the results computed using a deterministic tree algorithm presented in [10]. The simulations have shown that for our example, the maximum impact objective was almost the same in all computed Pareto solutions and is not presented in the figures. Fig. 3. Pareto fronts computed by the NSGA-II algorithm, as well as from a deterministic algorithm for comparison. The normalized tail average and total average impact are compared. 5 Conclusions In this work we have presented the security problem of sensor placement problem in water distribution networks for contaminant detection. We have presented an initial attempt to formulate the problem in order to be suitable for mathematical analysis. Furthermore, we examined a multiple-objective optimization problem using certain risk metrics and demonstrated a solution on a realistic network using a suitable multiobjective evolutionary algorithm, the NSGA-II. Good results were obtained considering the stochastic nature of the algorithm and the extremely large solution search space. Security of water systems is an open problem where computational intelligence could provide suitable solutions. Besides sensor placement, other interesting aspects of the security problem are the fault detection based on telemetry signals, the identification and the accommodation of the problem. References 1. LeVeque, R.: Nonlinear Conservation Laws and Finite Volume Methods. In: LeVeque, R.J., Mihalas, D., Dor, E.A., Müller, E. (eds.) Computational Methods for Astrophysical Fluid Flow, pp. 1–159. Springer, Berlin (1998) 2. Kurotani, K., Kubota, M., Akiyama, H., Morimoto, M.: Simulator for contamination diffusion in a water distribution network. In: Proc. IEEE International Conference on Industrial Electronics, Control, and Instrumentation, vol. 2, pp. 792–797 (1995) 3. Rossman, L.A.: The EPANET Programmer’s Toolkit for Analysis of Water Distribution Systems. In: ASCE 29th Annual Water Resources Planning and Management Conference, pp. 39–48 (1999) 76 D.G. Eliades and M.M. Polycarpou 4. Kessler, A., Ostfeld, A., Sinai, G.: Detecting accidental contaminations in municipal water networks. ASCE Journal of Water Resources Planning and Management 124(4), 192–198 (1998) 5. Ostfeld, A., Uber, J.G., Salomons, E.: Battle of the Water Sensor Networks (BWSN): A Design Challenge for Engineers and Algorithms. In: ASCE 8th Annual Water Distibution System Analysis Symposium (2006) 6. Huang, J.J., McBean, E.A., James, W.: Multi-Objective Optimization for Monitoring Sensor Placement in Water Distribution Systems. In: ASCE 8th Annual Water Distibution System Analysis Symposium (2006) 7. Hart, W., Berry, J., Riesen, L., Murray, R., Phillips, C., Watson, J.: SPOT: A sensor placement optimization toolkit for drinking water contaminant warning system design. In: Proc. World Water and Environmental Resources Conference (2007) 8. Berry, J.W., Fleischer, L., Hart, W.E., Phillips, C.A., Watson, J.P.: Sensor Placement in Municipal Water Networks. ASCE Journal of Water Resources Planning and Management 131(3), 237–243 (2005) 9. Eliades, D., Polycarpou, M.: Iterative Deepening of Pareto Solutions in Water Sensor Networks. In: Buchberger, S.G. (ed.) ASCE 8th Annual Water Distibution System Analysis Symposium. ASCE (2006) 10. Eliades, D.G., Polycarpou, M.M.: Multi-Objective Optimization of Water Quality Sensor Placement in Drinking Water Distribution Networks. In: European Control Conference, pp. 1626–1633 (2007) 11. Watson, J.P., Hart, W.E., Murray, R.: Formulation and Optimization of Robust Sensor Placement Problems for Contaminant Warning Systems. In: Buchberger, S.G. (ed.) ASCE 8th Annual Water Distibution System Analysis Symposium (2006) 12. Rockafellar, R., Uryasev, S.: Optimization of Conditional Value-at-Risk. Journal of Risk 2(3), 21–41 (2000) 13. Rockafellar, R., Uryasev, S.: Conditional Value-at-Risk for General Loss Distributions. Journal of Banking and Finance 26(7), 1443–1471 (2002) 14. Topaloglou, N., Vladimirou, H., Zenios, S.: CVaR models with selective hedging for international asset allocation. Journal of Banking and Finance 26(7), 1535–1561 (2002) 15. Rao, S.: Engineering Optimization: Theory and Practice. Wiley-Interscience, Chichester (1996) 16. Preis, A., Ostfeld, A.: Multiobjective Sensor Design for Water Distribution Systems Security. In: ASCE 8th Annual Water Distibution System Analysis Symposium (2006) 17. Ostfeld, A., Uber, J.G., Salomons, E., Berry, J.W., Hart, W.E., Phillips, C.A., Watson, J.P., Dorini, G., Jonkergouw, P., Kapelan, Z., di Pierro, F., Khu, S.T., Savic, D., Eliades, D., Polycarpou, M., Ghimire, S.R., Barkdoll, B.D., Gueli, R., Huang, J.J., McBean, E.A., James, W., Krause, A., Leskovec, J., Isovitsch, S., Xu, J., Guestrin, C., VanBriesen, J., Andc, M.S., Andd, P.F., Preis, A., Propato, M., Piller, O., Trachtman, G.B., Wu, Z.Y., Walski, T.: The Battle of the Water Sensor Networks (BWSN): A Design Challenge for Engineers and Algorithms. ASCE Journal of Water Resources Planning and Management (to appear, 2008) 18. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2), 182–197 (2002) Trusted-Computing Technologies for the Protection of Critical Information Systems Antonio Lioy, Gianluca Ramunno, and Davide Vernizzi* Politecnico di Torino Dip. di Automatica e Informatica c. Duca degli Abruzzi, 24 – 10129 Torino, Italy {antonio.lioy,gianluca.ramunno,davide.vernizzi}@polito.it Abstract. Information systems controlling critical infrastructures are vital elements of our modern society. Purely software-based protection techniques have demonstrated limits in fending off attacks and providing assurance of correct configuration. Trusted computing techniques promise to improve over this situation by using hardware-based security solutions. This paper introduces the foundations of trusted computing and discusses how it can be usefully applied to the protection of critical information systems. Keywords: Critical infrastructure protection, trusted computing, security, assurance. 1 Introduction Trusted-computing (TC) technologies have historically been proposed by the TCG (Trusted Computing Group) to protect personal computers from those software attacks that cannot be countered by purely software solutions. However these techniques are now mature enough to spread out to both bigger and smaller systems. Trusted desktop environments are already available and easy to setup. Trusted computing servers and embedded systems are just around the corner, while proof-ofconcept trusted environments for mobile devices have been demonstrated and are just waiting for the production of the appropriate hardware anchor (MTM, Mobile Trust Module). TC technologies are not easily understood and to many people they immediately evoke the “Big Brother” phantom, mainly due to their initial association with controversial projects from operating system vendors (to lock the owner into using only certified and licensed software components) and from providers of multimedia content (to avoid copyright breaches). However, TC is nowadays increasingly being associated with secure open environments, also thanks to pioneer work performed by various projects around the world, such as the Open-TC [1] one funded by the European Commission. On another hand, we have an increasing number of vital control systems (such as electric power distribution, railway traffic, and water supply) that heavily and almost exclusively rely on computer-based infrastructures for their correct operation. In the * This work has been partially funded by the EC as part of the OpenTC project (ref. 027635). It is the work of the authors alone and may not reflect the opinion of the whole project. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 77–83, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 78 A. Lioy, G. Ramunno, and D. Vernizzi following we will refer to these infrastructures with the term “Critical Information Systems” (CIS) because their proper behaviour in handling information is critical for the operation of some very important system. This paper briefly describes the foundations of TC and shows how they can help in creating more secure and trustworthy CIS. 2 Critical Information Systems (CIS) CIS are typically characterized by being very highly distributed systems, because the underlying controlled system (e.g. power distribution, railway traffic) is highly distributed itself on a geographic scale. In turn this bears an important consequence: it is nearly impossible to control physical access to all its components and to the communication channels that must therefore be very trustworthy. In other words, we must consider the likelihood that someone is manipulating the hardware, software, or communication links of the various distributed components. This likelihood is larger than in normal networked systems (e.g. corporate networks) because nodes are typically located outside the company’s premises and hosted in shelters easily violated. For example, think of the small boxes containing control equipment alongside railway tracks or attached to electrical power lines. An additional problem is posed by the fact that quite often CIS are designed, developed, and deployed by a company (the system developer), owned by a different one (the service provider), and finally maintained by yet another company (the maintainer) on a contract with one of the first two. When a problem occurs and it leads to an accident, it is very important to be able to track the source of the problem: is it due to a design mistake? or to a development bug? or to an incorrect maintenance procedure? The answer has influence over economical matters (costs for fixing the problem, penalties to be paid) and may be also over legal ones, in the case of damage to a third-party. Even if no damage is produced, it is nonetheless important to be able to quickly identify problems with components of a CIS, be them produced incidentally or by a deliberate act. For example, in real systems many problems are caused by mistakes made by maintainers when upgrading or replacing hardware or software components. Any technical solution that can help in thwarting attacks and detecting mistakes and breaches is of interest, but nowadays the solutions commonly adopted – such as firewall, VPN, and IDS – heavily rely on correct software configuration of all the nodes. Unfortunately this cannot be guaranteed in a highly distributed and physically insecure system as a CIS. Therefore better techniques should be adopted not only to protect the system against attacks and errors but also to provide assurance that each node is configured and operating as expected. This is exactly one of the possible applications of the TC paradigm, which is introduced in the next section. 3 Trusted Computing Principles In order to protect computer systems and networks from attacks we rely on software tools in the form of security applications (e.g. digital signature libraries), kernel Trusted-Computing Technologies 79 modules (e.g. IPsec) or firmware, as in the case of firewall appliances. However software can be manipulated either locally by privileged and un-privileged users, or remotely via network connections that exploit known vulnerabilities or insecure configurations (e.g. accepting unknown/unsigned Active-X components in your browser). It is therefore clear that is nearly impossible to protect a computer system from software attacks while relying purely on software defences. To progress beyond this state, the TCG [2], a not-for-profit group of ICT industry players, developed a set of specification to create a computer system with enhanced security named “trusted platform”. A trusted platform is based on two key components: protected capabilities and shielded memory locations. A protected capability is a basic operation (performed with an appropriate mixture of hardware and firmware) that is vital to trust the whole TCG subsystem. In turn capabilities rely on shielded memory locations, special regions where is safe to store and operate on sensitive data. From the functional perspective, a trusted platform provides three important features rarely found in other systems: secure storage, integrity measurement and reporting. The integrity of the platform is defined as a set of metrics that identify the software components (e.g. operating system, applications and their configurations) through the use of fingerprints that act as unique identifiers for each component. Considered as a whole, the integrity measures represent the configuration of the platform. A trusted platform must be able to measure its own integrity, locally store the related measurements and report these values to remote entities. In order to trust these operations, the TCG defines three so-called “root of trust”, components that must be trusted because their misbehaviour might not be detected: • the Root of Trust for Measurements (RTM) that implements an engine capable of performing the integrity measurements; • the Root of Trust for Storage (RTS) that securely holds integrity measures and protect data and cryptographic keys used by the trusted platform and held in external storages; • the root of trust for reporting (RTR) capable of reliably reporting to external entities the measures held by the RTS. The RTM can be implemented by the first software module executed when a computer system is switched on (i.e. a small portion of the BIOS) or directly by the hardware itself when using processors of the latest generation. The central component of a TCG trusted platform is the Trusted Platform Module (TPM). This is a low cost chip capable to perform cryptographic operations, securely maintain the integrity measures and report them. Given its functionalities, it is used to implement RTS and RTR, but it can also be used by the operating system and applications for cryptographic operations although its performance is quite low. The TPM is equipped with two special RSA keys, the Endorsement Key (EK) and the Storage Root Key (SRK). The EK is part of the RTR and it is a unique (i.e. each TPM has a different EK) and “non-migratable” key created by the manufacturer of the TPM and that never leaves this component. Furthermore the specification requires that a certificate must be provided to guarantee that the key belongs to a genuine TPM. The SRK is part of the RTS and it is a “non-migratable” key that protects the 80 A. Lioy, G. Ramunno, and D. Vernizzi other keys used for cryptographic functions1 and stored outside the TPM. Also SRK never leaves the TPM and it is used to build a key hierarchy. The integrity measures are held into the Platform Configuration Registers (PCR). These are special registers within the TPM acting as accumulators: when the value of a register is updated, the new value depends both on the new measure and on the old value to guarantee that once initialized it is not possible to fake the value of a PCR. The action of reporting the integrity of the platform is called Remote Attestation. A remote attestation is requested by a remote entity that wants evidence about the configuration of the platform. The TPM then makes a digital signature over the values of a subset of PCR to prove to the remote entity the integrity and authenticity of the platform configuration. For privacy reasons, the EK cannot be used to make the digital signature. Instead, to perform the remote attestation the TPM uses an Attestation Identity Key (AIK), which is an “alias” for the EK. The AIK is a RSA key created by the TPM whose private part is never released outside the chip; this guarantees that the AIK cannot be used by anyone except the TPM itself. In order to use the AIK for authenticating the attestation data (i.e. the integrity measures) it is necessary to obtain a certificate proving that the key was actually generated by a genuine TPM and it is managed in a correct way. Such certificates are issued by a special certification authority called Privacy CA (PCA). Before creating the certificate, the PCA must verify the genuineness of the TPM. This verification is done through the EK certificate. Many AIKs can be created and, to prevent the traceability of the platform operations, ideally a different AIK should be used for interacting with each different remote attester. Using trusted computing it is possible to protect data via asymmetric encryption in a way that only the platform’s TPM can access them: this operation is called binding. It is however possible to migrate keys and data to another platform, with a controlled procedure, if they were created as “migratable”. The TPM also offers a stronger capability to protect data: sealing. When the user seals some data, he must specify an “unsealing configuration”. The TPM assures that sealed data can be only be accessed if the platform is in the “unsealing configuration” that was specified at the sealing time. The TPM is a passive chip disabled at factory and only the owner of a computer equipped with a TPM may choose to activate this chip. Even when activated, the TPM cannot be remotely controlled by third entities: every operation must be explicitly requested by software running locally and the possible disclosure of local data or the authorisation to perform the operations depend on the software implementation. In the TCG architecture, the owner of the platform plays a central role because the TPM requires authorisation from the owner for all the most critical operations. Furthermore, the owner can decide at any time to deactivate the TPM, hence disabling the trusted computing features. The identity of the owner largely depends on the scenario where trusted computing is applied: in a corporate environment, the owner is usually the administrator of the IT department, while in a personal scenario normally the end-user is also the owner of the platform. 1 In order to minimize attacks, SRK is never used for any cryptographic function, but only to protect other keys. Trusted-Computing Technologies 81 Run-time isolation between software modules with different security requirement can be an interesting complementary requirement for a trusted platform. Given that memory areas of different modules are isolated and inter-module communication can occur only under well specified control flow policies, then if a specific module of the system is compromised (e.g. due to a bug or a virus), the other modules that are effectively isolated from that one are not affected at all. Today virtualization is an emerging technology for PC class platforms to achieve run-time isolation and hence is a perfect partner for a TPM-based trusted platform. The current TCG specifications essentially focus on protecting a platform against software attacks. The AMD-V [3] and the Intel TXT [4] initiatives, besides providing hardware assistance for virtualization, increase the robustness against software attacks and the latter also starts dealing with some basic hardware attacks. In order to protect the platforms also from physical attacks, memory curtaining and secure input/output should be provided: memory curtaining extends memory protection in a way that sensitive areas are fully isolated while secure input/output protects communication paths (such as the buses and input/output channels) among the various components of a computer system. Intel TXT focuses only on some so called “open box” attacks, by protecting the slow buses and by guaranteeing the integrity verification of the main hardware components on the platform. 4 The OpenTC Project The OpenTC project has applied TC techniques to the creation of an open and secure computing environment by coupling them with advanced virtualization techniques. In this way it is possible to create on the same computer different execution environments mutually protected and with different security properties. OpenTC uses virtualisation layers – also called Virtual Machine Monitors (VMM) or hypervisors – and supports two different implementations: Xen and L4/Fiasco. This layer hosts compartments, also called virtual machines (VM), domains or tasks, depending on the VMM being used. Some domains host trust services that are available to authorised user compartments. Various system components make use of TPM capabilities, e.g. in order to measure other components they depend on or to prove the system integrity to remote challengers. Each VM can host an open or proprietary operating environment (e.g. Linux or Windows) or just a minimal library-based execution support for a single application. The viability of the OpenTC approach has been demonstrated by creating two proof-of-concept prototypes, the so-called PET and CC@H ones, that are publicly available at the project’s web site The PET (for Private Electronic Transactions) scenario [5] aims to improve the trustworthiness of interactions with remote servers. Transactions are simply performed by accessing a web server through a standard web browser running in a dedicated trusted compartment. The server is assumed to host web pages related to a critical financial service, such as Internet banking or another e-commerce service. The communication setup between the browser compartment and the web server is extended by a protocol for mutual remote attestation tunnelled through an SSL/TLS channel. During the attestation phase, each side assesses the trustworthiness of the other. If this assessment is negative on either side, the SSL/TLS tunnel is closed, 82 A. Lioy, G. Ramunno, and D. Vernizzi preventing further end-to-end communication. If the assessment is positive, end-toend communication between browser and server is enabled via standard HTTPS tunnelled over SSL/TLS. In this way the user is reassured about having connected to the genuine server of its business provider and about its integrity, and the provider knows that the user has connected by using a specific browser and that the hosting environment (i.e. operating system and drivers) has not been tampered with, for example by inserting a key-logger. The CC@H (for Corporate Computing at Home) scenario [6] demonstrates the usage of the OpenTC solution to run simultaneously on the same computer different non-interfering applications. It reflects the situation where employers tolerate, within reasonable limits, the utilization of corporate equipment (in particular notebooks) for private purposes but want assurance that the compartment dedicated to corporate applications is not manipulated by the user. In turn the user has a dedicated compartment for his personal matters, included free Internet surfing which, on the contrary, is not allowed from the corporate compartment. The CC@H scenario is based on the following main functional components: • boot-loaders capable of producing cryptographic digests for lists of partitions and arbitrary files that are logged into PCRs of the TPM prior to passing on control of the execution flow to the virtual machine monitor (VMM) or kernel it has loaded into memory; • virtualization layers with virtual machine loaders that calculate and log cryptographic digests for virtual machines prior to launching them; • a graphical user interface enabling the user to launch, stop and switch between different compartments with a simple mouse click; • a virtual network device for forwarding network packets from and to virtual machine domains; • support for binding the release of keys for encrypted files and partitions to defined platform integrity metrics. 5 How TC Can Help in Building Better CIS TC can benefit the design and operations of a CIS in several ways. First of all, attestation can layer the foundation for better assurance in the systems operation. This is especially important when several players concur to the design, management, operation, and maintenance of a CIS and thus unintentional modification of software modules is possible, as well as deliberate attacks due to the distributed nature of these systems. TC opens also the door to reliable logging, where log files contain not only a list of events but also a trusted trace of the component that generated the event and the system state when the event was generated. Moreover the log file could be manipulated only by trusted applications if it was sealed against them, so for example no direct editing (for insertion or cancellation) would be possible. The security of remote operations could be improved by exploiting remote attestation, so that network connections are permitted only if requested by nodes executing specific software modules. Intruders would therefore be unable to connect to the other Trusted-Computing Technologies 83 nodes: as they could not even open a communication channel towards the application, they could not in any way try to exploit the software vulnerabilities that unfortunately plague most applications. This feature could also be used in a different way: a central management facility could poll the various nodes by opening network connection just to check via remote attestation the software status of the nodes. In this way it would be very easy to detect misconfigured elements and promptly reconfigure or isolate them. In general sealing protects critical data from access by unauthorized software because access can be bound to a well-defined system configuration. This feature allows the implementation of fine-grained access control schemes that can prevent agents from accessing data they are not authorized for. This is very important when several applications with different trust level are running on the same node and may be these applications have been developed by providers with different duties. In this way we can easily implement the well-known concept of “separation of duties”: even if a module can bypass the access control mechanisms of the operating system and directly access the data source, it will be unable to operate on it because the TPM and/or security services running on the system will not release to the module the required cryptographic credentials. Finally designers and managers of CIS should not be scared by the overhead introduced by TC. In our experience, the L4/Fiasco set-up is very lightweight and therefore suitable also for nodes with limited computational resources. Moreover, inside a partition we can execute a full operating system with all its capabilities (e.g. Suse Linux), a stripped-down OS (such as DSL, Damn Small Linux) with only the required drivers and capabilities, or just a mini-execution environment providing only the required libraries for small embedded single task applications. In case multiple compartments are not needed, the TC paradigm can be directly built on top of the hardware, with no virtualization layer, hence further reducing its footprint. 6 Conclusions While several technical problems are still to be solved before large-scale adoption of TC is a reality, we nonetheless think that it is ready to become a major technology of the current IT scenario, especially for critical information systems where its advantages in terms of protection and assurance far outweigh its increased design and management complexity. References 1. 2. 3. 4. Open Trusted Computing (OpenTC) project, IST-027635, http://www.opentc.net Trusted Computing Group, http://www.trustedcomputing.org AMD Virtualization, http://www.amd.com/virtualization Intel Trusted Execution Technology, http://www.intel.com/technology/security/ 5. OpenTC newsletter no.3, http://www.opentc.net/publications/OpenTC_Newsletter_03.html 6. OpenTC newsletter no.5, http://www.opentc.net/publications/OpenTC_Newsletter_05.html A First Simulation of Attacks in the Automotive Network Communications Protocol FlexRay Dennis K. Nilsson1 , Ulf E. Larson1 , Francesco Picasso2, and Erland Jonsson1 1 2 Department of Computer Science and Engineering Chalmers University of Technology SE-412 96 Gothenburg, Sweden Department of Computer Science and Engineering University of Genoa Genoa, Italy {dennis.nilsson,ulf.larson,erland.jonsson}@chalmers.se, francesco.picasso@unige.it Abstract. The automotive industry has over the last decade gradually replaced mechanical parts with electronics and software solutions. Modern vehicles contain a number of electronic control units (ECUs), which are connected in an in-vehicle network and provide various vehicle functionalities. The next generation automotive network communications protocol FlexRay has been developed to meet the future demands of automotive networking and can replace the existing CAN protocol. Moreover, the upcoming trend of ubiquitous vehicle communication in terms of vehicle-to-vehicle and vehicle-to-infrastructure communication introduces an entry point to the previously isolated in-vehicle network. Consequently, the in-vehicle network is exposed to a whole new range of threats known as cyber attacks. In this paper, we have analyzed the FlexRay protocol specification and evaluated the ability of the FlexRay protocol to withstand cyber attacks. We have simulated a set of plausible attacks targeting the ECUs on a FlexRay bus. From the results, we conclude that the FlexRay protocol lacks sufficient protection against the executed attacks, and we therefore argue that future versions of the specification should include security protection. Keywords: Automotive, vehicle, FlexRay, security, attacks, simulation. 1 Introduction Imagine a vehicle accelerating and exceeding a certain velocity. At this point, the airbag suddenly triggers rendering the driver unable to maneuver the vehicle. This event leads to the vehicle crashing, leaving the driver seriously wounded. One might ask if this event was caused by a software malfunction or a hardware fault. In the near future, one might even ask if the event was the result of a deliberate cyber attack on the vehicle. In the last decade electronics and firmware have replaced mechanical components in vehicles at an unprecedented rate. Modern vehicles contain an in-vehicle network consisting of a number of electronic control units (ECUs) responsible E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 84–91, 2009. c Springer-Verlag Berlin Heidelberg 2009 springerlink.com A First Simulation of Attacks 85 for the functionality in the vehicle. As a result, the ECUs constitute a likely target for cyber attackers. An emerging trend for automotive manufacturers is to create an infrastructure for performing remote diagnostics and firmware updates over the air (FOTA) [1]. There are several benefits with this approach. It involves a minimum of customer inconvenience since there exists no need for the customer to bring the vehicle to a service station for a firmware update. In addition, it allows faster updates; it is possible to update the firmware as soon as it is released. Furthermore, this approach reduces the lead time from fault to action since it is possible to analyze errors and identify causes using diagnostics before a vehicle arrives at a service station. However, the future infrastructure allows external communication to interact with the in-vehicle network, which introduces a number of security risks. The previously isolated in-vehicle network is thus exposed to a whole new type of attacks, collectively known as cyber attacks. Since the ECUs connected to the FlexRay bus are used for providing control and maneuverability in the vehicle, this bus is a likely target for attackers. We have analyzed what an attacker can do once access to the bus is achieved. The main contributions of this paper are as follows. – We have analyzed the FlexRay protocol specification with respect to desired security properties and found that functionalities to achieve these properties are missing. – We have identified the actions an attacker can take in a FlexRay network as a result of the lack of security protection. – We have successfully implemented and simulated the previously identified attack actions in the CANoe simulator environment. – We discuss the potential safety effects of these attacks and emphasize the need for future security protection in in-vehicle networks. 2 Related Work Much research on the in-vehicle network has been on safety issues. Little research has been done on the security aspects in such networks. Only a few papers focusing on the security of those networks have been published. However, the majority of these papers have focused on the CAN protocol. These papers are described in more detail as follows. Wolf et al. [2] present several weaknesses in the CAN and FlexRay bus protocols. Weaknesses include confidentiality and authenticity problems. However, the paper does not give any specific attack examples. Simulated attacks have been performed on the CAN bus [3, 4, 5] and illustrate the severity of such attacks. Moreover, the notion of vehicle virus is introduced in [5] which describes more advanced attacks on the CAN bus. As a result, the safety of the driver can be affected. The Electronic Architecture and System Engineering for Integrated Safety Systems (EASIS) project has done work in embedded security for safety applications [6]. The work dicusses the need for safe and reliable communication for 86 D.K. Nilsson et al. external and internal vehicle communication. A security manager responsible for crypto operations and authentication management is described. In this paper, we focus on identifying possible attacker actions on the FlexRay bus and discussing the safety effects of such security threats. 3 Background Traditionally, in-vehicle networks are designed to meet reliability requirements. As such, they primarily address failures caused by non-malicious and inadvertent flaws, which are produced by chance or by component malfunction. Protection is realized by fault-tolerance mechanisms, such as redundancy, replication, and diversity. Since the in-vehicle network has been isolated, protection against intelligent attackers (i.e., security) has not been previously considered. However, recent wireless technology allow for external interaction with the vehicle through remote diagnostics and FOTA. To benefit from the new technology, the in-vehicle network needs to allow wireless communication with external parties, including service stations, business systems and fleet management. Since the network must be open for external access, new threats need to be accounted for and new requirements need to be stated and implemented. Consider for example an attacker using a compromised host in the business network to obtain unauthorized access to the in-vehicle network. Once inside the in-vehicle network, the attacker sends a malicious diagnostic request to trigger the airbag, which in turn could cause injury to the driver and the vehicle to crash. As illustrated by the example scenario, the safety of the vehicle is strongly linked to the security, and a security breach may well affect the safety of the driver. It is thus reasonable to believe that not only reliability requirements but also security requirements need to be fulfilled to protect the driver. Since security has yet not been required in the in-vehicle networks, it can be assumed that a set of successful attacks targeting the in-vehicle network should be possible to produce. To assess our assumption, we analyze the FlexRay protocol specification version 2.1 revision A [7]. We then identify a set of security properties for the network, evaluate the correspondence between the properties and the protocol specifications, and develop an attacker model. 3.1 In-Vehicle Network The in-vehicle network consists of a number of ECUs and buses. The ECUs and buses form networks, and the networks are connected through gateways. Critical applications, such as the engine management system and the antilock braking system (ABS) use the CAN bus for communication. To meet future application demands, CAN is gradually being replaced with FlexRay. A wireless gateway connected to the FlexRay bus allows access to external networks such as the Internet. A conceptual model of the in-vehicle network is shown in Fig. 1. A First Simulation of Attacks 87 External Network ECUs Wireless Gateway FlexRay Fig. 1. Conceptual model of the in-vehicle network consisting of a FlexRay network and ECUs including a wireless gateway 3.2 FlexRay Protocol The FlexRay protocol [7] is designed to meet the requirements of today’s automotive industry, including flexible data communications, support for three different topologies (bus, star, and mixed), fault-tolerant operation and higher data rate than previous standards. FlexRay allows both asynchronous transfer mode and real-time data transfer, and operates as a dual-channel system, where each channel delivers a maximum data bit rate of 10 Mbps. The FlexRay protocol belongs to the time-triggered protocol family which is characterized by a continuous communication of all connected nodes via redundant data buses at predefined intervals. It defines the specifics of a data-link layer independent of a physical layer. It focuses on real-time redundant communications. Redundancy is achieved by using two communications channels. Real time is assured by the adoption of a time division multiplexing communication scheme: each ECU has a given time slot to communicate but it cannot decide when since the slots are allocated at design time. Moreover, a redundant masterless protocol is used for synchronizing the ECU clocks by measuring time differences of arriving frames, thus implementing a fault-tolerant procedure. The data-link layer provides basic but strongly reliable communication functionality, and for the protocol to be practically useful, an application layer needs to be implemented on top of the link layer. Such an application layer could be used to assure the desired security properties. In FlexRay, this application layer is missing. 3.3 Desired Security Properties To evaluate the security of the FlexRay protocol, we use a set of established security properties commonly used for, e.g., sensor networks [8, 9, 10]. We believe that most of the properties are desirable in the vehicle setting due to the many similarities between the network types. The following five properties are considered. – Data Confidentiality. The contents of messages between the ECUs should be kept confidential to avoid unauthorized reads. 88 D.K. Nilsson et al. – Data Integrity. Data integrity is necessary for ensuring that messages have not been modified in transit. – Data Availability. Data availability is necessary to ensure that measured data from ECUs can be accessed at requested times. – Data Authentication. To prevent an attacker from spoofing packets, it is important that the receiver can verify the sender of the packets (authenticity). – Data Freshness. Data freshness ensures that communicated data is recent and that an attacker is not replaying old data. 3.4 Security Evaluation of FlexRay Specification We perform a security evaluation of the FlexRay protocol based on the presented desired security properties. We inspect the specification and look for functionalities that would address those properties. The FlexRay protocol itself does not provide any security features since it is a pure communication protocol. However, CRC check values are included to provide some form of integrity protection against transmission errors. The time division multiplexing assures availability since the ECUs can communicate during the allocated time slot. The security properties are at best marginally met by the FlexRay specification. We find some protection of data availability and data integrity, albeit the intention of the protection is for safety reasons. However, the specification does not indicate any assurance of confidentiality, authentication, or freshness of data. Thus, security has not been considered during the design of the FlexRay protocol specification. 3.5 Attacker Model The evaluation concluded that the five security properties were at best slightly addressed in the FlexRay protocol. Therefore, we can safely make the assumption that a wide set of attacks in the in-vehicle network should be possible to execute. We apply the Nilsson-Larson attacker model [5], where an attacker has access to the in-vehicle network via the wireless gateway and can perform the following actions: read, spoof, drop, modify, flood, steal, and replay. In the following section, we focus on simulating the read and spoof actions. 4 Cyber Attacks In this section, we define and discuss two attacker actions: read and spoof. The actions are derived from the security evaluation and the attacker model in the previous section. 4.1 Attacker Actions Read and spoof can be performed from any ECU in the FlexRay network. Read. Due to the lack of confidentiality protection, an attacker can read all data sent on the FlexRay bus, and possibly send the data to a remote location via A First Simulation of Attacks 89 the wireless gateway. If secret keys, proprietary or private data are sent on the FlexRay bus, an attacker can easily learn that data. Spoof. An attacker can create and inject messages since there exists no data authentication on the data sent on the FlexRay bus. Therefore, messages can be spoofed and target arbitrary ECUs claiming to be from any ECU. An attacker can easily create and inject diagnostics messages on the FlexRay bus to cause ECUs to perform arbitrary actions. 4.2 Simulation Environment We have used CANoe version 7.0 from Vector Informatik [11] to simulate the attacks. ECUs for handling the engine, console, brakes, and back light functionalities were connected to a FlexRay bus to simulate a simplified view of an in-vehicle network. 4.3 Simulated Attack Actions We describe the construction and simulation of the read and spoof attack actions. Read. Messages sent on the FlexRay bus can be recorded by an attacker who has access to the network. The messages are Time stamped, and the channel (Chn), identifier (ID), direction (Dir ), data length (DLC ), and any associated Data bytes are recorded. The log entries for a few messages are shown in Table 1. Also, we have included the corresponding message Name (interpreted from ID) for each message. Spoof. An attacker can create and inject a request on the bus to cause an arbitrary effect. For example, an attacker can create and inject a request to light up the brake light. The BreakLight message is spoofed with the data value 0x01 and sent on the bus resulting in the brake light being turned on. The result of the attack is shown in Fig. 2. The transmission is in the fifth gear, and the vehicle is accelerating and traveling with a velocity of 113 mph. At the same time, the brake light is turned on, as indicated by the data value 1 of the BreakLight message although no brakes are applied (BrakePressure has the value 0). 4.4 Effect on Safety Based on Lack of Proper Security Protection As noted in Section 3, safety and security are tightly coupled in the vehicle setting. From our analysis, it is evident that security-related incidents can affect Table 1. Log entries with corresponding names and values Time 15.540518 15.540551 15.549263 15.549362 15.549659 15.549692 Chn ID Dir DLC Name FR FR FR FR FR FR 1A 1A 1A 1A 1A 1A 51 52 13 16 25 26 Tx Tx Tx Tx Tx Tx 16 16 16 16 16 16 BackLightInfo GearBoxInfo ABSInfo BreakControl EngineData EngineStatus Data 128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 59 0 0 0 31 0 0 0 0 0 0 0 0 0 0 0 0 64 31 0 0 0 0 0 0 0 0 0 0 0 0 0 73 8 81 44 140 10 152 58 0 0 0 0 0 0 0 0 0000000000000000 90 D.K. Nilsson et al. Fig. 2. Spoof attack where the brake lights are lit up while the vehicle is accelerating and no brake pressure is applied the safety of the driver. Our example with the spoofed brake light message could cause other drivers to react based on the false message. Even worse, spoofed control messages could affect the control and maneuverability of the vehicle and cause serious injury to the drivers, passengers, and other road-users. It is therefore imperative that proper security protection is highly prioritized in future development of the protocol. 5 Conclusion and Future Work We have created and simulated attacks in the automotive communications protocol FlexRay and shown that such attacks can easily be created. In addition, we have discussed how safety is affected by the lack of proper security protection. These cyber attacks can target control and maneuverability ECUs in the in-vehicle network and lead to serious injury for the driver. The attacker actions are based on weaknesses in the FlexRay 2.1 revision A protocol specification. The security of the in-vehicle network must be taken into serious consideration when external access to these previously isolated networks is introduced. The next step is to investigate other types of attacks and simulate such attacks on the FlexRay bus. Then, based on the various attacks identified, a set of appropriate solutions for providing security features should be investigated. The most pertinent future work to be further examined are prevention and detection mechanisms. An analysis of how to secure the in-vehicle protocol should be performed, and the possibility to introduce lightweight mechanisms for data integrity and authentication protection should be investigated. Moreover, to A First Simulation of Attacks 91 detect attacks in the in-vehicle network a lightweight intrusion detection system needs to be developed. References 1. Miucic, R., Mahmud, S.M.: Wireless Multicasting for Remote Software Upload in Vehicles with Realistic Vehicle Movement. Technical report, Electrical and Computer Engineering Department, Wayne State University, Detroit, MI 48202 USA (2005) 2. Wolf, M., Weimerskirch, A., Paar, C.: Security in Automotive Bus Systems. In: Workshop on Embedded IT-Security in Cars, Bochum, Germany (November 2004) 3. Hoppe, T., Dittman, J.: Sniffing/Replay Attacks on CAN Buses: A simulated attack on the electric window lift classified using an adapted CERT taxonomy. In: Proceedings of the 2nd Workshop on Embedded Systems Security (WESS), Salzburg, Austria (2007) 4. Lang, A., Dittman, J., Kiltz, S., Hoppe, T.: Future Perspectives: The car and its IP-address - A potential safety and security risk assessment. In: The 26th International Conference on Computer Safety, Reliability and Security (SAFECOMP), Nuremberg, Germany (2007) 5. Nilsson, D.K., Larson, U.E.: Simulated Attacks on CAN Buses: Vehicle virus. In: Proceedings of the Fifth IASTED Asian Conference on Communication Systems and Networks (ASIACSN) (2008) 6. EASIS. Embedded Security for Integrated Safety Applications (2006), http://www.car-to-car.org/fileadmin/dokumente/pdf/security 2006/ sec 06 10 eyeman easis security.pdf 7. FlexRay Consortium. FlexRay Communications System Protocol Specification 2.1 Revision A (2005) (Visited August, 2007), http://www.softwareresearch.net/site/teaching/SS2007/ds/ FlexRay-ProtocolSpecification V2.1.revA.pdf 8. Luk, M., Mezzour, G., Perrig, A., Gligor, V.: MiniSec: A secure sensor network communication architecture. In: IPSN 2007: Proceedings of the 6th International Conference on Information Processing in Sensor Networks, pp. 479–488. ACM Press, New York (2007) 9. Perrig, A., Szewczyk, R., Wen, V., Culler, D.E., Tygar, J.D.: SPINS: Security protocols for sensor networks. In: Mobile Computing and Networking, pp. 189–199 (2001) 10. Karlof, C., Sastry, N., Wagner, D.: TinySec: A link layer security architecture for wireless sensor networks. In: SenSys 2004: Proceedings of the 2nd International Conference on Embedded Networked Sensor Systems, Baltimore, November 2004, pp. 162–175 (2004) 11. Vector Informatik. CANoe and DENoe 7.0 (2007) (Visited December, 2007), http://www.vector-worldwide.com/vi canoe en.html Wireless Sensor Data Fusion for Critical Infrastructure Security Francesco Flammini1,2, Andrea Gaglione2, Nicola Mazzocca2, Vincenzo Moscato2, and Concetta Pragliola1 1 ANSALDO STS - Ansaldo Segnalamento Ferroviario S.p.A. Business Innovation Unit Via Nuova delle Brecce 260, Naples, Italy {flammini.francesco,pragliola.concetta}@asf.ansaldo.it 2 Università di Napoli “Federico II” Dipartimento di Informatica e Sistemistica Via Claudio 21, Naples, Italy {frflammi,andrea.gaglione,nicola.mazzocca,vmoscato}@unina.it Abstract. Wireless Sensor Networks (WSN) are being investigated by the research community for resilient distributed monitoring. Multiple sensor data fusion has proven as a valid technique to improve detection effectiveness and reliability. In this paper we propose a theoretical framework for correlating events detected by WSN in the context of critical infrastructure protection. The aim is to develop a decision support and early warning system used to effectively face security threats by exploiting the advantages of WSN. The research addresses two relevant issues: the development of a middleware for the integration of heterogeneous WSN (SeNsIM, Sensor Networks Integration and Management) and the design of a model-based event correlation engine for the early detection of security threats (DETECT, DEcision Triggering Event Composer & Tracker). The paper proposes an overall system architecture for the integration of the SeNsIM and DETECT frameworks and provides example scenarios in which the system features can be exploited. Keywords: Critical Infrastructure Protection, Sensor Data Fusion, Railways. 1 Introduction Several methodologies (e.g. risk assessment [5]) and technologies (e.g. physical protection systems [4]) have been proposed to enhance the security of critical infrastructure systems. The aim of this work is to propose the architecture for a decision support and early warning system used to effectively face security threats (e.g. terrorist attacks) based on wireless sensors. Wireless sensors feature several advantages when applied to critical infrastructure surveillance [8], as they are: • Cheap, and this allows for fine grained and highly redundant configurations; • Resilient, due to their fault-tolerant mesh topology; • Power autonomous, due to the possibility of battery and photovoltaic energy supplies; E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 92–99, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Wireless Sensor Data Fusion for Critical Infrastructure Security 93 • Easily installable, due to their wireless nature and auto-adapting multi-hop routing; • Intelligent, due to the on-board processor and operating systems which allow for some data elaborations being performed locally. All these features support the use of WSN in highly distributed monitoring applications in critical environments. The example application we will refer to in this paper is railway infrastructure protection against external threats which can be natural (fire, flooding, landslide, etc.) or human-made malicious (sabotage, terrorism, etc.). Examples of useful sensors in this domain are listed in the following: smoke and heat – useful for fire detection; moisture and water – useful for flooding detection; Pressure – useful for explosion detection; movement detection (accelerometer or GPS based shifting measurement) – useful for theft detection or structural integrity checks; gas and explosive – useful for chemical or bombing attack detection; vibration and sound – useful for earthquake or crash detection. WSN could also be used for video surveillance and on-board intelligent video-analysis, as reported in [7]. Theoretically, any kind of sensor could be interfaced with a WSN, as it would just substitute the sensing unit of the so called “motes”. For instance, it would be useful (and apparently easy) to interface on WSN intrusion detection devices (like volumetric detectors, active infrared barriers, microphonic cables, etc.) in order to save on cables and junction boxes and exploit an improved resiliency and a more cohesive integration. With respect to traditional connections based on serial buses, wireless sensors are also less prone to tampering, when proper cryptographic protocols are adopted [6]. However, for some classes of sensors (e.g. radiation portals) some of the features of motes (e.g. size, battery power) would be lost. In spite of their many expected advantages, there are several technological difficulties (e.g. energy consumption) and open research issues associated with WSN. The issue addressed in this paper relates to the “data fusion” aspect. In fact, the heterogeneity of network topologies and measured data requires integration and analysis at different levels (see Fig. 1), as partly surveyed in reference [9]. As first, the monitoring of wide geographical areas and the diffusion of WSNs managed by different middlewares have highlighted the research problem of the integrated management of data coming from the various networks. Unfortunately such information is not available in a unique container, but in distributed repositories and the major challenge lies in the heterogeneity of repositories which complicates data management and retrieval processes. This issue is addressed by the SeNsIM framework [1], as described in Section 2. Secondly, there is the need for an on-line reasoning about the events captured by sensor nodes, in order to early detect and properly manage security threats. The availability of possibly redundant data allows for the correlation of basic events in order to increase the probability of detection, decrease the false alarm rate, warn the operators about suspect situations, and even automatically trigger adequate countermeasures by the Security Management System (SMS). This issue is addressed by the DETECT framework [2], as described in Section 3. The rest of the paper is organized as follows: Section 4 discusses about the SeNsIM and DETECT software integration; Section 5 introduces an example railway security application; Section 6 draws conclusions and hints about future developments. 94 F. Flammini et al. DB DETECT (reasoning) SeNsIM (integration) THREAT ROUTE SENSING POINTS WSN 1 ... SMS WSN N ... (a) (b) Fig. 1. (a) Distributed sensing in physical security; (b) Monitoring architecture 2 The SeNsIM Framework The main objectives of SeNsIM are: • To integrate information from distributed sensor networks managed by local middlewares (e.g. TinyDB); • To provide an unique interface for local networks by which a generic user can easily execute queries on specific sensor nodes; • To ensure system’s scalability in case of connection of new sensor networks. From an architectural point of view, the integration has been realized by exploiting the wrapper-mediator paradigm: when a sensor network is activated, an apposite wrapper agent aims at extracting its features and functionalities and to send (e.g. in a XML format) them to one or more mediator agents that are, in the opposite, responsible to drive the querying process and the communication with users. Thus, a query is first submitted through a user interface, and then analyzed by the mediator, converted in a standard XML format and sent to the apposite wrapper. The latter, in a first moment executes the translated query on its local network, by means of a low-layer middleware (TinyDB in the current implementation), and then retrieve the results to send (in a XML format) to the mediator, which show them to the user. According to the data model, the wrapper agent provides a local network virtualization in terms of objects, network and sensors. An object of the class Sensor can be associated to an object of Network type. Moreover, inside the same network one or more sensors can be organized into objects of Cluster or Group type. The state of a sensor can be modified by means of classical getting/setting functions, while the measured variables can be accessed using the sensing function. Fig.2 schematizes the levels of abstraction in the data management perspective provided by SeNsIM using TinyDB as low-level middleware layer and outlines the system architecture. The framework is described in more details in reference [1]. 3 The DETECT Framework Among the best ways to prevent attacks and disruptions is to stop any perpetrators before they strike. DETECT is a framework aimed at the automatic detection of threats Wireless Sensor Data Fusion for Critical Infrastructure Security 95 (a) (b) Fig. 2. (a) Levels of abstraction; (b) SeNsIM architecture against critical infrastructures, possibly before they evolve to disastrous consequences. In fact, non trivial attack scenarios are made up by a set of basic steps which have to be executed in a predictable sequence (with possible variants). Such scenarios must be precisely identified during the risk analysis process. DETECT operates by performing a model-based logical, spatial and temporal correlation of basic events detected by sensor networks, in order to “sniff” sequence of events which indicate (as early as possible) the likelihood of threats. In order to achieve this aim, DETECT is based on a real-time detection engine which is able to reason about heterogeneous data, implementing a centralized application of “data fusion”. The framework can be interfaced with or integrated in existing SMS systems in order to automatically trigger adequate countermeasures (e.g. emergency/crisis management). Attack scenarios are described in DETECT using a specific Event Description Language (EDL) and stored in a Scenario Repository. Starting from the Scenario Repository, one or more detection models are automatically generated using a suitable formalism (Event Graphs in the current implementation). In the operational phase, a 96 F. Flammini et al. model manager macro-module has the responsibility of performing queries on the Event History database for the real-time feeding of detection models according to predetermined policies. When a composite event is recognized, the output of DETECT consists of: the identifier(s) of the detected/suspected scenario(s); an alarm level, associated to scenario evolution (only used in deterministic detection as a linear progress indicator); a likelihood of attack, expressed in terms of probability (only used as a threshold in heuristic detection). DETECT can be used as an on-line decision support system, by alerting in advance SMS operators about the likelihood and nature of the threat, as well as an autonomous reasoning engine, by automatically activating responsive actions, including audio and visual alarms, unblock of exit turnstiles, air conditioned flow inversion, activation of sprinkles, emergency calls to first responders, etc. DETECT is depicted as a black-box in Fig. 3 and described in more details in [2]. Scenari o DETECT Engine Detected Attack Scenario Event History Criticality Level (1, 2, 3, ...) Fig. 3. The DETECT framework 4 Integration of SeNsIM and DETECT The SeNsIM and DETECT frameworks need to be integrated in order to obtain an online reasoning about the events captured by different WSNs. As mentioned above, the aim is to early detect and manage security threats against critical infrastructures. In this section we provide the description of the sub-components involved in the software integration of SeNsIM and DETECT. During the query processing task of SeNsIM, user queries are first submitted by means of a User Interface; then, a specific module (Query Builder) is used to build a query. The user queries are finally processed by means of a Query Processing module which sends the query to the appropriate wrappers. The partial and global query results are then stored in a database named Event History. All the results are captured and managed by a Results Handler, which implements the interface with wrappers. The Model Feeder is the DETECT component which performs periodic queries on the Event History to access primitive event occurrences. The Model Feeder instantiates the inputs of the Detection Engine according to the nature of the model(s). Therefore, the integration is straightforward and mainly consists in the management of the Event History as a shared database, written by the mediator and read by the Model Feeder according to an appropriate concurrency protocol. Wireless Sensor Data Fusion for Critical Infrastructure Security 97 In Figure 4 we report the overall software architecture as a result of the integration between SeNsIM and DETECT. The figure also shows the modules of SeNsIM involved in the query processing task. User interaction is only needed in the configuration phase, to define attack scenarios and query parameters. To this aim, a userfriendly unified GUIs will be available (as indicated in Fig.4). According to the query strategy, both SeNsIM and DETECT can access data from the lower layers using either a cyclic or event driven retrieval process. DETECT DETECTION MODEL(S) DETECTION ENGINE SMS MODEL FEEDER SCENARIO REPOSITORY USER EVENT HISTORY GUI QUERY BUILDER QUERY PROCESSING RESULTS HANDLER SeNsIM MEDIATOR TO WRAPPER(S) Notes: • Only modules used for integration are shown • Query Processing is a macro-module, containing several submodules • A GUI (Graphical User Interface) is used to: - edit DETECT scenarios using a graphical formalism translatable to EDL files - define SeNsIM queries for sensor data retrieval (cyclic polling or event-driven) FROM WRAPPER(S) Fig. 4. Query processing and software integration 5 Example Application Scenario In this section we report an example application of the overall framework to the casestudy of a railway transportation system, an attractive target for thieves, vandals and terrorists. Several application scenarios can be thought exploiting the proposed architecture and several wireless sensors (track line break detection, on-track obstacle detection, etc.) and actuators (e.g. virtual or light signalling devices) could be installed to monitor track integrity against external threats and notify anomalies. In the following we describe how to detect a more complex scenario, namely a terrorist strategic attack. Let us suppose a terrorist decides to attack a high-speed railway line, which is completely supervised by a computer-based control system. A possible scenario consisting in multiple train halting and railway bridge bombing is reported in the following: 1. Artificial occupation (e.g. by using a wire) of the track circuits immediately after the location in which the trains needs to be stopped (let us suppose a high bridge), in both directions. 2. Interruption of the railway power line, in order to prevent the trains from restarting using a staff responsible operating mode. 98 F. Flammini et al. 3. Bombing of the bridge shafts by remotely activating the already positioned explosive charges. Variants of this scenarios exist: for instance, trains can be (less precisely) stopped by activating jammers to disturb the wireless communication channel used for radio signaling, or starting the attack from point (2) (but this would be even less precise). The described scenario could be early identified by detecting the abnormal events reported in point (1) and activating proper countermeasures. By using proper on-track sensors it is possible to monitor the abnormal occupation of track circuits and a possible countermeasure consists in immediately sending an unconditional emergency stop message to the train. This would prevent the terrorist from stopping the train at the desired location and therefore halt the evolution of the attack scenario. Even though the detection of events in points (2) and (3) would happen too late to prevent the disaster, it could be useful to achieve a greater situational awareness about what is happening in order to rationalize the intervention of first responders. Now, let us formally describe the scenario using wireless sensors and detected events, using the notation “sensor description (sensor ID) :: event description (event ID)”: FENCE VIBRATION DETECTOR (S1) :: POSSIBLE ON TRACK INTRUSION (E1) TRACK CIRCUIT X (S2) :: OCCUPATION (E2) LINESIDE TRAIN DETECTOR (S3) :: NO TRAIN DETECTED (E3) TRACK CIRCUIT Y (S4) :: OCCUPATION (E4) LINESIDE TRAIN DETECTOR (S5) :: NO TRAIN DETECTED (E5) VOLTMETER (S6) :: NO POWER (E6) ON-SHAFT ACCELEROMETER (S7) :: STRUCTURAL MOVEMENT (E7) Due to the integration middleware made available by SeNsIM, these events are not required to be detected on the same physical WSN, but they just need to share the same sensor group identifier at the DETECT level. Event (a) is not mandatory, as the detection probability is not 100%. Please not that each of the listed events taken singularly would not imply a security anomaly or be a reliable indicator of it. The EDL description of the above scenario is provided in the following (in the assumption of unique event identifiers): (((E1 SEQ ((E2 AND E3) OR (E4 AND E5))) OR ((E2 AND E3) AND (E4 AND E5))) SEQ E6 ) SEQ E7 Top-down and left to right, using 4 levels of alarm severity: a) E1 can be associated a level 1 warning (alert to the security officer); b) The composite events determined by the first group of 4 operators and the second group of 3 operators can be both associated a level 2 warning (triggering the unconditional emergency stop message); c) The composite event terminating with E6 can be associated a level 3 warning (switch on back-up power supply, whenever available) d) The composite event terminating with E7 (complete scenario) can be associated a level 4 warning (emergency call to first responders). Wireless Sensor Data Fusion for Critical Infrastructure Security 99 In the design phase, the scenario is represented using Event Trees and stored in the Scenario Repository of DETECT. In the operational phase, SeNsIM records the sequence of detected events in the Event History. When the events corresponding to the scenario occur, DETECT provides the scenario identifier and the alarm level (with a likelihood index in case of non deterministic detection models). Pre-configured countermeasures can then be activated by the SMS on the base of such information. 6 Conclusions and Future Works Wireless sensors are being investigated in several applications. In this paper we have provided the description of a framework which can be employed to collect and analyze data measured by such heterogeneous sources in order to enhance the protection of critical infrastructures. One of the research threads points at connecting by WSN traditionally wired sensors and application specific devices, which can serve as useful information sources for a superior situational awareness in security critical applications (like in the example scenario provided above). The verification of the overall system is also a delicate issue which can be addressed using the methodology described in [3]. We are currently developing the missing modules of the software system and testing the already available ones in a simulated environment. The next step will be the interfacing with a real SMS for the on-the-field experimentation. References 1. Chianese, A., Gaglione, A., Mazzocca, N., Moscato, V.: SeNsIM: a system for Sensor Networks Integration and Management. In: Proc. Intl. Conf. on Intelligent Sensors, Sensor Networks and Information Processing (ISSNIP 2008) (to appear, 2008) 2. Flammini, F., Gaglione, A., Mazzocca, N., Pragliola, C.: DETECT: a novel framework for the detection of attacks to critical infrastructures. In: Proc. European Safety & Reliability Conference (ESREL 2008) (to appear, 2008) 3. Flammini, F., Mazzocca, N., Orazzo, A.: Automatic instantiation of abstract tests to specific configurations for large critical control systems. In: Journal of Software Testing, Verification & Reliability (Wiley) (2008), doi:10.1002/stvr.389 4. Garcia, M.L.: The Design and Evaluation of Physical Protection Systems. ButterworthHeinemann, USA (2001) 5. Lewis, T.G.: Critical Infrastructure Protection in Homeland Security: Defending a Networked Nation. John Wiley, New York (2006) 6. Perrig, A., Stankovic, J., Wagner, D.: Security in Wireless Sensor Networks. Communications of the ACM 47(6), 53–57 (2004) 7. Rahimi, M., Baer, R., et al.: Cyclops: In situ image sensing and interpretation in wireless sensor networks. In: Proc. 3rd ACM Conference on Embedded Networked Sensor Systems (SenSys 2005) (2005) 8. Roman, R., Alcaraz, C., Lopez, J.: The role of Wireless Sensor Networks in the area of Critical Information Infrastructure Protection. Inf. Secur. Tech. Rep. 12(1), 24–31 (2007) 9. Wang, M.M., Cao, J.N., Li, J., Dasi, S.K.: Middleware for Wireless Sensor Networks: A Survey. J. Comput. Sci. & Technol. 23(3), 305–326 (2008) Development of Anti Intruders Underwater Systems: Time Domain Evaluation of the Self-informed Magnetic Networks Performance Osvaldo Faggioni1, Maurizio Soldani1, Amleto Gabellone2, Paolo Maggiani3, and Davide Leoncini4 1 INGV Sez. ROMA2, Stazione di Geofisica Marina, Fezzano (SP), Italy {faggioni,soldani}@ingv.it 2 CSSN ITE, Italian Navy, Viale Italia 72, Livorno, Italy amleto.gabellone@marina.difesa.it 3 COMFORDRAG, Italian Navy, La Spezia, Italy paolov.maggiani@marina.difesa.it 4 DIBE, University of Genoa, Genova, Italy davide.leoncini@unige.it Abstract. This paper shows the result obtained during the operative test of an anti-intrusion undersea magnetic system based on a magnetometers’ new self-informed network. The experiment takes place in a geomagnetic space characterized by medium-high environmental noise with a relevant human origin magnetic noise component. The system has two different input signals: the magnetic background field (natural + artificial) and a signal composed by the magnetic background field and the signal due to the target magnetic field. The system uses the first signal as filter for the second one to detect the target magnetic signal. The effectiveness of the procedure is related to the position of the magnetic field observation points (reference devices and sentinel devices). The sentinel devices must obtain correlation in the noise observations and de-correlations in the target signal observations. The system, during four tries of intrusion, has correctly detected all magnetic signals generated by divers. Keywords: Critical Systems, Port Protection, Magnetic Systems. 1 Introduction The recent evolution of the world strategic scenarios is characterized by a change of the first threat type: from the military high power attack to terrorist attack. In these new conditions our submarine areas control systems must redesigned to obtain the capability of detecting of very small targets closer to the objective. The acoustic systems, base for actual and, probably, future port protection option, has a very good results in the control of big volumes of water but they failure in the high definition controls as, for example, the control of port sea bottom and docks proximity water volumes. The magnetic detecting is a very interesting option to give more effectiveness to the Anti Intruders Port System. The magnetic method has very good performance in the proximity detecting and loses in the big water volumes controls while the acoustic method achieves good performances in the free water but fails in the sea bottom surface areas E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 100–107, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com Development of Anti Intruders Underwater Systems 101 and, more relevant in the docks proximity. The integration of these two methods is the MAC System (Magnetic Acoustic) class of underwater alert composed systems. Past studies about the magnetic method detecting shown a phenomenological limitation of its detecting effectiveness due to the very high interference between magnetic signal target (diver = low power source) and the environmental magnetic field. The geomagnetic field is a convolved field of several elementary contributes originated by sources external or internal to the planet, static and dynamic, and more, in the area with human activity, there is the presence of transient magnetic signals very large band with high amplitude variations. Classically the way of target signal detection is the classification of elementary signals of the magnetic field, tentative of the association of their hypothetic sources, and separation of the target signal to the noise: more or less the use of frequency filters numerical techniques (LP, HP, BP) based to empirical experiences in the C.O. frequency definition. The result is a subjective method depending to the decision of the operators (or of the array/chain designers) able of great successes or strong failures. The effectiveness of the method is low and so its develop was neglected for the detecting of little magnetic signals [1-3]. Of course also the increase of devices sensibility doesn’t solve the problem because our problem is related to the target signal informative capability of the magnetograms and not to the sensibility on the measures. In the present paper we show the results obtained by means of a new approach to this problem: we don’t use statistic-conjectural frequency filters, we use a reference magnetometer to inform the sentinel magnetometer of the noised field without the target signal, the system use the reference magnetogram RM as function TD (or FD) filter for the sentinel magnetogram SM. The result of de-convolution RM-SM is the target signal. The critical point of system is related to the design of the network: to obtain an effective magneto-detecting system the devices must be put at a distance to have amplitude correlation in the noise measures and de-correlation in the target measure. 2 On the System Design Options and Metrological Response To build the self-information mag-system we propose two design solutions for the geometry of devices network, the Referred Integrated MAgnetic Network (RIMAN) and the Self-referred Integrated MAgnetic Network (SIMAN) [4]. 2.1 RIMAN System The RIMAN system consists of a magnetometers array system and an identical standalone magnetometer (referring node) deployed within the protected area. The zero-level condition is obtained through the comparison of the signal measured by each of the array’s magnetometers with the signal measured by the referring magnetometer. If the protected area is confined we can assume the total background noise constant and therefore the difference between each array’s sensor and the referring one is around zero. The zero-level condition can be altered only in presence of a target approaching one or two (in case of middle-crossing) sensors of the array. Signal processing of the RIMAN system is accurate and the risk of numeric alteration of the registered rough signal is very low. A standard data-logger system has to measure the signals coming from each of the array’s magnetometers and respectively 102 O. Faggioni et al. compare to the reference signal. The comparison functions ∆F(1,0), ∆F(2,0), …., ∆F(N,0) are subsequently compared to the reference level 0 and then only the non-zero differential signal is taken. It means that the target is crossing a specific nodal magnetometer. For example, if the target is forcing the barrier between nodes 6 and 7, the only differential non-zero functions are ∆F(6,0) and ∆F(7,0) which will indicate the target’s position. A quantitative analysis of the differential signals can also show the target’s relative position to the two nodes. If the protected area is too wide to allow a stability condition of the total background noise, the RIMAN system can be divided in more subsystems, each of one using an intermediate reference node, which have to respect the stability condition among them. The intermediate reference nodes have finally to be related to a common single reference node. MAG1 MAG2 MAG3 MAG4 MAG0 C P U 'F10 2 'F40 'F20 1 4 0 0 'F30 3 Fig. 1. Scheme of Referred Integrated Magnetometers Array Network 2.2 SIMAN System In the SIMAN system all the array’s magnetometers are used to obtain the zero-level condition. The control unit has to check in sequence the zero-level condition between each pair of magnetometers and signal any non-zero differential function. MAG1 MAG2 MAG3 MAG4 MAG5 C P U 1 2 2 'F12 'F45 4 3 2 'F23 'F34 3 5 4 Fig. 2. Scheme of Referred Integrated Magnetometers Array Network Development of Anti Intruders Underwater Systems 103 2.3 Signal Processing Signal processing in the SIMAN system gives very good accuracy, too. The only issue is related on the ambiguity in case the target crosses a pair of magnetometers at the same distance from both. Such ambiguity can be solved through the evaluation of the differential functions between the adjacent nodes. The drawback is that a SIMAN system requires a continuous second-order check at all the nodes. The advantage of using a SIMAN system is the possibility to cover an unlimited area. The stability condition is requested only for each pair of the array’s magnetometers. 3 Results The experiment consists of recording the magnetic field variations during multiple runs performed by a diver’s team. The test system of magnetic control consists in the elementary cell of SIMAN class system [5, 6]. Devices are two tri-axial fluxgate magnetometers were positioned at a water depth of 12 meters at a respective start distance of 12 meters; the computational procedure is based on vector component Z (not variance in the vector direction on the time) measure (Fig. 3). MAG1 MAG2 C P U 'F12 Fig. 3. Experiment configuration The diver’s runs were performed at zero CPA (Closest Point of Approach) at about 1 meter from the bottom along 50 meters tracks centered on each magnetometer. The runs’ bearing was approximately E-W and W-E. The two magnetometers were cableconnected with their respective electronic control devices (deployed at 1 meter form the sensor) to a data-logger station placed on the beach coast at about 150 meters. The environmental noise of the geomagnetic space of the area of measure is classified as medium-high and it is characterized by contributes of human source coming from city noise, industrial noise (electrical power point of production), electrical railway noise, maritime commercial traffic, port activity traffic etc. (all far < 8 km). The target source for the experiment was a diver equipped with standard air bottles system. The result of the experiment is shows in the diagrams of figure 4. The magnetogram of the sentinel devices (Fig. 4A) is characterized by 5 points of magnetic impulsive anomaly. These signals are compatible with the magnetic signal of our target, the signal marked in figure 4 as 1, 2, 4 and 5 has very clear impulsive origin (mono or dipolar 104 O. Faggioni et al. geometry) while the signal 3 has geometrical characteristic not fully impulsive so the operator (human or automatic) proposes the classification in Table 1. The availability of the reference magnetogram and its preliminary and subjective observation well define the fatal error in the signal number 2. This signal appears in magnetogram A and B because it is not related to the diver cross in the sentinel devices proximity but it is has very large space coherence (noise). Table 1. Alarm classification table Signal 1 2 3 4 5 Geometry MONO MONO COMPLEX MONO DIPOLE Alarm YES YES YES YES YES Uncertain NO NO YES NO NO K 0 -K nT K 0 -K Fig. 4. Magnetograms obtained by the SIMAN elementary cell devices nT K/2 0 -K/2 s Fig. 5. SIMAN elementary cell self-informed magnetogram Development of Anti Intruders Underwater Systems 105 In the figure 5 we show the result obtained by the time de-convolution B-A. This is the numerical procedure to have the quantitative evaluation of the presence of impulsive signals in proximity of the sentinel devices. The magnetogram of SIMAN well define the real condition of diver crosses in proximity of the sentinel devices. The signal 2 is declassified and 3 is classified as dipolar target signal without uncertain because the procedure of de-convolution cleans the complex signal from the superimposed noise component. The table of classification became the Table 2. The SIMAN classification of impulsive signals (Table 2) has a fully correspondence to the 4 crosses of the diver first course (E.W, W-E directions) 1 and 3 signals and second course 4 and 5 signals (E-W, W-E directions). Table 2. Alarm classification table after the time de-convolution A-B Signal 1 3 4 5 Geometry MONO DIPOLE MONO DIPOLE Alarm YES YES YES YES Uncertain NO NO NO NO nT K/2 0 -K/2 Fig. 6. Comparison to self-referred system and classical techniques performances The effectiveness of the system (options RIMAN or SIMAN) is strongly depending to the position of the reference devices. The reference magnetometer must far from the sentinel to have a good de-correlation of the target signal but also closer to the sentinel to have a good correlation of the noise. This condition generates a reference magnetograms filter function that cut off the noise and it is transparent to the target signal. In effect the kernel of self-information procedure is the RD-SD distance. Now we analyzed the numerical quality of the self-informed system performance with reference to the performance of classical LP procedure of noise cleaning and to the variations of the distance RD-SD, reference devices – sentinel devices (Fig. 6). The diagrams in figure 6 show the elaboration of the same data subset (source magnetogram in figure 4): A diagram corresponds to the pure extraction of the subset from magnetograms in figure 4 (best distance RD-SD), B it is the LP filter (typical 106 O. Faggioni et al. informative approach), C self-informed response with too short RD-SD distance, D self informed response with too great RD-SD distance. 3.1 Comparison A/B The metrological and numerical procedure effectiveness of magnetic detection is defined with the increase of informative capability of the magnetogram with reference to the signal target. We define, in quantitative way, the Informative Capability IC of a composed signal, related to each its “i” harmonic elementary component to be the ratio i energy – composed signal total energy: ⎛ n ⎞ ICi = Ei ⎜ ∑ ( E1 + ... + En ) ⎟ ⎝ i =0 ⎠ −1 (1) In the A condition SIMAN produces a very high IC for the impulse 1 and the strong reduction of the IC of impulse 2; the SIMAN permits correct classification of impulse 1 as the target and impulse 2 as environmental noise. In the B graphics the classic technique of LP filter produces, in the best cut frequency choice, a good cleaning of high frequency component but the IC of impulse 1 is lose by the effect of impulse 2 energy (survived to the filter). The numerical value of the IC2 is, more or less, the same of the IC1; so the second impulse is classified as target: egregious false alarm. 3.2 Comparison A/C/D The comparison A/C/D (fig. 6) shows the distance RD-SD be the kernel for the selfinformed magnetic system detecting effectiveness. In the present case the A distance is 11 m, C distance 9 m, D distance 13 m (+/- 0.5 m). The C RD-SD distance produces a very good contrast of the high frequency noise but it loses also the target signal because this one is present in the sentinel magnetogram and also in the reference magnetograms (too closer). The Q of target is drastically reduced and SIMAN loses its detection capability. In D condition we have a too long distance. The noise in the reference devices geomagnetic position is not correlated from the noise in the sentinel devices; so SIMAN procedure adds the noise of reference magnetogram to the noise of sentinel magnetograms with an high increase of self-informed signal ETOT: also in this condition Q of target is loses. 4 Conclusion The field test of self-informed magnetic detecting system SIMAN shows a relevant growth of effectiveness in the low magnetic undersea sources (divers) detection respect to the classical techniques used in the magnetic port protection networks. The full success of defense magnetic network crossing SIMAN in the detecting of the diver crossing tries of our experimental devices barrier is due to the use of the magnetogram not perturbed by target magnetic signal as time domain filter of the perturbed magnetogram. This approach is quantitative and objective and it increases the target signal informative capability IC without empirical and subjective techniques of Development of Anti Intruders Underwater Systems 107 signal cleaning as the well known frequency LPF. This performance of SIMAN is related to the accuracy in the choice of distance between the points of acquisition of the magnetograms. The good definition of this distance produces the best condition of noise space correlation and target magnetic signal space de-correlation. One m of delocalization of the reference magnetometer respect to its best position can compromise, in the magnetic environment of our experiment, the results and it can cancel the effectiveness of SIMAN sensibility. The system of our test developed on sea bottom with a good geometry has produced very high detecting performance irrespective of the magnetic noise time variations. Acknowledgments. The study of the self-informed undersea magnetic networks was launched and developed by Ufficio Studi e Sviluppo of COMFORDRAG – Italian Navy. This experiment was supported by Nato Undersea Research Centre. References 1. Berti, G., Cantini, P., Carrara, R., Faggioni, O., Pinna, E.: Misure di Anomalie Magnetiche Artificiali, Atti X Conv. Gr. Naz. Geof. Terra Solida, Roma, pp. 809–814 (1991) 2. Faggioni, O., Palangio, P., Pinna, E.: Osservatorio Geomagnetico Stazione Baia Terra Nova: Ricostruzione Sintetica del Magnetogramma 01.00(UT)GG1.88:12.00(UT)GG18.88, Atti X Conv. Gr. Naz. Geof. Terra Solida, Roma, pp. 687–706 (1991) 3. Cowan, D.R., Cowan, S.: Separation Filtering Applied to Aeromagnetic Data. Exploration Geophysics 24, 429–436 (1993) 4. Faggioni, O.: Protocollo Operativo CF05EX, COMFORDRAG, Ufficio Studi e Sviluppo, Italian Navy, Official Report (2005) 5. Gabellone, A., Faggioni, O., Soldani, M.: CAIMAN (Coastal Anti Intruders MAgnetic Network) Experiment, UDT Europe – Naples, Topic: Maritime Security and Force Protection (2007) 6. Gabellone, A., Faggioni, O., Soldani, M., Guerrini, P.: CAIMAN (Coastal Anti Intruder MAgnetometers Network). In: RTO-MP-SET-130 Symposium on NATO Military Sensing, Orlando, Florida, USA, March 12-14 (classified, 2008) Monitoring and Diagnosing Railway Signalling with Logic-Based Distributed Agents Viviana Mascardi1, Daniela Briola1, Maurizio Martelli1, Riccardo Caccia2, and Carlo Milani 2 1 DISI, University of Genova, Italy {Viviana.Mascardi,Daniela.Briola, Maurizio.Martelli}@disi.unige.it 2 IAG/FSW, Ansaldo Segnalamento Ferroviario S.p.A., Italy {Caccia.Riccardo,Milani.Carlo}@asf.ansaldo.it Abstract. This paper describes an ongoing project that involves DISI, the Computer Science Department of Genova University, and Ansaldo Segnalamento Ferroviario, the Italian leader in design and construction of signalling and automation systems for railway lines. We are implementing a multiagent system that monitors processes running in a railway signalling plant, detects functioning anomalies, provides diagnoses for explaining them, and early notifies problems to the Command and Control System Assistance. Due to the intrinsic rule-based nature of monitoring and diagnostic agents, we have adopted a logic-based language for implementing them. 1 Introduction According to the well known definition of N. Jennings, K. Sycara and M. Wooldridge an agent is “a computer system, situated in some environment, that is capable of flexible autonomous action in order to meet its design objectives” [1]. Distributed diagnosis and monitoring represent one of the oldest application fields of declarative software agents. For example, ARCHON (ARchitecture for Cooperative Heterogeneous ON-line systems [2]) was Europe's largest ever project in the area of Distributed Artificial Intelligence and exploited rule-based agents. In [3], Schroeder et al. describe a diagnostic agent based on extended logic programming. Many other multiagent systems (MASs) for diagnosis and monitoring based on declarative approaches have been developed in the past [4] ,[5], [6], [7]. One of the most important reasons for exploiting agents in the monitoring and diagnosis application domain is that an agent-based distributed infrastructure can be added to any existing system with minimal or no impact over it. Agents look at the processes they must monitor, be they computer processes, business processes, chemical processes, by “looking over their shoulders” without interfering with their activities. The “no-interference” feature has an enormous importance, since changing the processes in order to monitor them would be often unfeasible. Also the increase of situational awareness, essential for coping with the situational complexity of most large real applications, motivates an agent-oriented approach. Situational awareness is a mandatory feature for the successful monitoring and decision-making in many scenarios. When combined with reactivity, situatedness may E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 108–115, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Monitoring and Diagnosing Railway Signalling 109 lead to the early detection of, and reaction to, anomalies. The simplest and most natural form of reasoning for producing diagnoses starting from observations is based on rules. This paper describes a joint academy-industry project for monitoring and diagnosing railway signalling with distributed agents implemented in Prolog, a logic programming language suitable for rule-based reasoning. The project involves the Computer Science Department of Genova University, Italy, and Ansaldo Segnalamento Ferroviario, a company of Ansaldo STS group controlled by Finmeccanica, the Italian leader in design and construction of railway signalling and automation systems. The project aims to develop a monitoring and diagnosing multiagent system implemented in JADE [8] extended with the tuProlog implementation of the Prolog language [9] by means of the DCaseLP libraries [10]. The paper is structured in the following way: Section 2 describes the scenario where the MAS will operate, Section 3 describes the architecture of the MAS and the DCaseLP libraries, Section 4 concludes and outlines the future directions of the project. 2 Operating Scenario The Command and Control System for Railway Circulation (“Sistema di Comando e Controllo della Circolazione Ferroviaria”, SCC) is a framework project for the technological development of the Italian Railways (“Ferrovie dello Stato”, FS), with the following targets: • introducing and extending automation to the command and control of railway circulation over the principal lines and nodes of the FS network; • moving towards innovative approaches for the infrastructure and process management, thanks to advanced monitoring and diagnosis systems; • improving the quality of the service offered to the FS customers, thanks to a better regularity of the circulation and to the availability of more efficient services, like delivery of information to the customers, remote surveillance, and security. The SCC project is based on the installation of centralized Traffic Command and Control Systems, able to remotely control the plants located in the railway stations and to manage the movement of trains from the Central Plants (namely, the offices where instances of the SCC system are installed). In this way, Central Plants become command and control centres for railway lines that represent the main axes of circulation, and for railway nodes with high traffic volumes, corresponding to the main metropolitan and intermodal nodes of the FS network. An element that strongly characterizes the SCC is the strict coupling of functionalities for circulation control and functionalities for diagnosis and support to upkeep activities, with particular regard to predictive diagnostic functionalities aimed to enable on-condition upkeep. The SCC of the node of Genova, which will be employed as a case-study for the implementation of the first MAS prototype, belongs to the first six plants developed by Ansaldo Segnalamento Ferroviario. They are independent one from the other, but networked by means of a WAN, and cover 3000 Km 110 V. Mascardi et al. of the FS network. The area controlled by the SCC of Genova covers 255 km, with 28 fully equipped stations plus 20 stops. The SCC can be decomposed, both from a functional and from an architectural point of view, into five subsystems. The MAS that we are implementing will monitor and diagnose critical processes belonging to the Circulation subsystem whose aims are to implement remote control of traffic and to make circulation as regular as possible. The two processes we have taken under consideration are Path Selection and Planner. The Planner process is the back-end elaboration process for the activities concerned with Railway Regulation. There is only one instance of the Planner process in the SCC, running on the server. It continuously receives information on the position of trains from sensors located in the stations along the railway lines, checks the timetable, and formulates a plan for ensuring that the train schedule is respected. A plan might look like the following: “Since the InterCity (IC) train 5678 is late, and IC trains have highest priority, and there is the chance to reduce the delay of 5678 if it overtakes the regional train 1234, then 1234 must stop on track 1 of Arquata station, and wait for 5678 to overtake it”. Plans formulated by the Planner may be either confirmed by the operators working at the workstations or modified by them. The process that allows operators to modify the Planner's decision is Path Selection. The Path Selection process is the front-end user interface for the activities concerned with Railway Regulation. There is one Path Selection process running on each workstation in the SCC. Each operator interacts with one instance of this process. There are various operators responsible for controlling portions of the railway line, and only one senior operator with a global view of the entire area controlled by the SCC. The senior operator coordinates the activities of the other operators and takes the final decisions. The Path Selection process visualizes decisions made by the Planner process and allows the operator to either confirm or modify them. For example, the Planner might decide that a freight train should stop on track 3 of Ronco Scrivia station in order to allow a regional train to overtake it, but the operator in charge for that railway segment might think that track 4 is a better choice for making the freight train stop. In this case, after receiving an “ok” from the senior operator, the operator may input its choice thanks to the interface offered by the Path Selection process, and this choice overcomes the Planner's decisions. The Planner will re-plan its decisions according to the new input given by the human operator. The SCC Assistance Centre, that provides assistance to the SCC operators in case of problems, involves a large number of domain experts, and is always contacted after the evidence of a malfunctioning. The cause of the malfunctioning, however, might have generated minutes, and sometimes hours, before that the SCC operator(s) experienced the malfunctioning. By integrating a monitoring and diagnosing MAS to the circulation subsystem, we aim to equip any operator of the Central Plant with the means for early detecting anomalies that, if reported in a short time, and before their effects have propagated to the entire system, may allow the prevention of more serious problems including circulation delays. When a malfunctioning is detected, the MAS, besides alerting the SCC operator, will always alert the SCC Assistance Centre in an automatic way. This will not only speed up the implementation of repair actions, but also allow the SCC Assistance Centre with recorded traces of what happened in the Central Plant. Monitoring and Diagnosing Railway Signalling 111 3 Logic-Based Monitoring Agents In order to provide the functionalities required by the operating scenario, we designed a MAS where different agents exist and interact. The MAS implementation is under way but its architecture has been completely defined, as well as the tools and languages that will be exploited for implementing a first prototype. The following sections provide a brief overview of DCaseLP, the multi-language prototyping environment that we are going to use, and of the MAS architecture and implementation. 3.1 DCaseLP: A Multi-language Prototyping Environment for MASs DCaseLP [10] stands for Distributed Complex Applications Specification Environment based on Logic Programming. Although initially born as a logic-based framework, as the acronym itself suggests, DCaseLP has evolved into a multi-language prototyping environment that integrates both imperative (object-oriented) and declarative (rule-based and logic-based) languages, as well as graphical ones. The languages and tools that DCaseLP integrates are UML and an XML-based language for the analysis and design stages, Java, JESS and tuProlog for the implementation stage, and JADE for the execution stage. Software libraries for translating UML class diagrams into code and for integrating JESS and tuProlog agents into the JADE platform are also provided. Both the architecture of the MAS, and the agent interactions taking place there, are almost simple. On the contrary, the rules that guide the behaviour of agents are sophisticated, leading to implement agents able to monitor running processes and to quickly diagnose malfunctions. For these reasons, among the languages offered by DCaseLP, we have chosen to use tuProlog for implementing the monitoring agents, and Java for implementing the agents at the lowest architectural level, namely those that implement the interfaces to the processes. In this project we will take advantage neither of UML nor of XML, which prove useful for specifying complex interaction protocols and complex system architectures. Due to space constraints, we cannot provide more details on DCaseLP. Papers describing its usage, as well as manuals, tutorials, and the source code of the libraries it provides for integrating JESS and tuProlog into JADE can be downloaded from http://www.disi.unige.it/person/MascardiV/Software/DCaseLP.html. 3.2 MAS Architecture The architecture of the MAS is depicted in Figure 1. There are four kinds of agent, organized in a hierarchy: Log Reader Agents, Process Monitoring Agents, Computer Monitoring Agents, and Plant Monitoring Agents. Agents running on remote computers are connected via a reliable network whose failures are quickly detected and solved by an ad hoc process (already existing, out of the MAS). If the network becomes unavailable for a short time, the groups of agents running on the same computer can go on with their local work. Messages directed to remote agents are saved in a local buffer, and are sent as soon as the network comes up again. The human operator is never alerted by the MAS about a network failure, and local diagnoses continue to be produced. 112 V. Mascardi et al. Fig. 1. The system architecture Log Reader Agent. In our MAS, there is one Log Reader Agent (LRA) for each process that needs to be monitored. Thus, there may be many LRAs running on the same computer. Once every m minutes, where m can be set by the MAS configurator, the LRA reads the log file produced by the process P it monitors, extracts information from it, produces a symbolic representation of the extracted information in a format amenable of logic-based reasoning, and sends the symbolic representation to the Process Monitoring Agent in charge of monitoring P. Relevant information to be sent to the Process Monitoring Agent include loss of connection to the net and life of the process. The parameters, and corresponding admissible values, that an LRA associated with the “Path Selection” process extracts from its log file, include connection_to_server (active, lost); answer_to_life (ready, slow, absent); cpu_usage (normal, high); memory_usage (normal, high); disk_usage (normal, high); errors (absent, present). The answer_to_life parameter corresponds to the time required by a process to answer a message. A couple of configurable thresholds determines the value of this paramenter: time < threshold1 implies answer_to_life ready; threshold1 <= time < threshold2 implies answer_to_life slow; time >= threshold2 implies answer_to_life absent. In a similar way, the parameters and corresponding admissible values that an LRA associated with the “Planner” process extracts from its log file include connection_to_client (active, lost); computing_time (normal, high); managed_conflicts (normal, high); managed_trains (normal, high); answer_to_life, cpu_usage, memory_usage, disk_usage, errors, in the same way as the “Path Selection” process. LRAs have a time-driven behaviour, do not perform any reasoning over the information extracted from the log file, and are neither proactive, nor autonomous. They may be considered very simple agents, mainly characterized by their social ability, employed to decouple the syntactic processing of the log files from their semantic Monitoring and Diagnosing Railway Signalling 113 processing, entirely demanded to the Process Monitoring Agents. In this way, if the format of the log produced by process P changes, only the LRA that reads that log needs to be modified, with no impact on the other MAS components. Process Monitoring Agent. Process Monitoring Agents (PMAs) are in a one-to-one correspondence with LRAs: the PMA associated with process P receives the information sent by the LRA associated with P, looks for anomalies in the functioning of P, provides diagnoses for explaining their cause, reports them to the Computer Monitoring Agent (CMA), and in case kills and restarts P if necessary. It implements a sort of social, context-aware, reactive and proactive expert system characterized by rules like: • if the answer to life of process P is slow, and it was slow also in the previous check, then there might be a problem either with the network, or with P. The PMA has to inform the CMA, and to wait for a diagnosis from the CMA. If the CMA answers that there is no problem with the network, then the problem concerns P. The action to take is to kill and restart P. • if the answer to life of process P is absent, then the PMA has to inform the CMA, and to kill and restart P. • if the life of process P is right, then the PMA has to do nothing. Computer Monitoring Agent. The CMA receives all the messages arriving from the PMAs that run on that computer, and is able to monitor parameters like network availability, CPU usage, memory usage, hard disk usage. The messages received from PMAs together with the values of the monitored parameters allow the CMA to make hypotheses on the functioning of the computer where it is running, and of the entire plant. For achieving its goal, the CMA includes rules like: • if only one PMA reported problems local to the process it monitors, then there might be temporary problems local to the process. No action needs to be taken. • if more than one PMA reported problems local to the process it monitors, and the CPU usage is high, then there might be problems local to this computer. The action to take is to send a message to the Plant Monitoring Agent and to make a pop-up window appear on the computer monitor, in order to alert the operator working with this computer. • if more than one PMA reported problems due either to the network, or to the server accessed by the process, and the network is up, then there might be problems to the server accessed by the process. The action to take is to send a message to the Plant Monitoring Agent to alert it and to make a pop-up window appear on the computer monitor. If necessary, the CMA may ask for more information to the PMA that reported the anomaly. For example, it may ask to send a detailed description of which hypotheses led to notify the anomaly, and which rules were used. Plant Monitoring Agent. There is one Plant Monitoring Agent (PlaMA) for each plant. The PlaMA receives messages from all the CMAs in the plant and makes diagnoses and decisions according to the information it gets from them. It implements rules like: 114 V. Mascardi et al. • if more than one CMA reported a problem related to the same server S, then the server S might have a problem. The action to take is to notify the SCC Assistance Centre in an automatic way. • if more than one CMA reported a problem related to a server, and the servers referred to by the CMAs are the different, then there might be a problem of network, but more information is needed. The action to take is to ask to the CMAs that reported the anomalies more information about them. • if more than one CMA reported a problem of network then there might be a problem of network. The action to take is to notify the SCC Assistance Centre in an automatic way. The SCC Assistance Centre receives notifications from all the plants spread around Italy. It may cross-relate information about anomalies and occurred failures, and take repair actions in a centralized, efficient way. As far as the MAS implementation is concerned, LRAs are being implemented as “pure” JADE agents: they just parse the log file and translate it into another format, thus there is no need to exploit Prolog for their implementation. PMAs, CMAs, and PlaMA are being implemented in tuProlog and integrated into JADE by means of the libraries offered by DCaseLP. Prolog is very suitable for implementing the agent rules that we gave in natural language before. 4 Conclusions The joint DISI-Ansaldo project confirms the applicability of MAS technologies for concrete industrial problems. Although the adoption of agents for process control is not a novelty, the exploitation of declarative approaches outside the boundaries of academia is not widespread. Instead, the Ansaldo partners consider Prolog a suitable language for rapid prototyping of complex systems: this is a relevant result. The collaboration will lead to the implementation of a first MAS prototype by the end of June 2008, to its experimentation and, in the long term perspective, to the implementation of a real MAS and its installation in both central plants and stations along the railway lines. The possibility to integrate agents that exploit statistical learning methods to classify malfunctions and agents that mine functioning reports in order to identify patterns that led to problems, will be considered. More sophisticated rules will be added in the agents in order to capture the situation where a process is killed and restarted a suspicious number of times in a time unit. In that case, the PlaMa will need to urgently inform the SCC Assistance. The transfer of knowledge that DISI is currently performing will allow Ansaldo to pursue its goals by exploiting internal competencies in a not-so-distant future. Acknowledgments. The authors acknowledge Gabriele Arecco who is implementing the MAS described in this paper, and the anonymous reviewers for their constructive comments. Monitoring and Diagnosing Railway Signalling 115 References 1. Jennings, N.R., Sycara, K.P., Wooldridge, M.: A roadmap of agent research and development. Autonomous Agents and Multi-Agent Systems 1(1), 7–38 (1998) 2. Jennings, N.R., Mamdani, E.H., Corera, J.M., Laresgoiti, I., Perriollat, F., Skarek, P., Zsolt Varga, L.: Using Archon to develop real-world DAI applications, part 1. IEEE Expert 11(6), 64–67 (1996) 3. Schroeder, M., De Almeida Mòra, I., Moniz Pereira, L.: A deliberative and reactive diagno-sis agent based on logic programming. In: Rao, A., Singh, M.P., Wooldridge, M.J. (eds.) ATAL 1997. LNCS, vol. 1365, pp. 293–307. Springer, Heidelberg (1998) 4. Leckie, C., Senjen, R., Ward, B., Zhao, M.: Communication and coordination for intelligent fault diagnosis agents. In: Proc. of 8th IFIP/IEEE International Workshop for Distributed Systems Operations and Management, DSOM 1997, pp. 280–291 (1997) 5. Semmel, G.S., Davis, S.R., Leucht, K.W., Rowe, D.A., Smith, K.E., Boloni, L.: Space shuttle ground processing with monitoring agents. IEEE Intelligent Systems 21(1), 68–73 (2006) 6. Weihmayer, T., Tan, M.: Modeling cooperative agents for customer network control using planning and agent-oriented programming. In: Proc. of IEEE Global Telecommunications Conference, Globecom 1992, pp. 537–543. IEEE, Los Alamitos (1992) 7. Balduccini, M., Gelfond, M.: Diagnostic reasoning with a-prolog. Theory Pract. Log. Program. 3(4), 425–461 (2003) 8. Bellifemine, F.L., Caire, G., Greenwood, D.: Developing Multi-Agent Systems with JADE. Wiley, Chichester (2007) 9. Denti, E., Omicini, A., Ricci, A.: Tuprolog: A lightweight prolog for internet applications and infrastructures. In: Ramakrishnan, I.V. (ed.) 3rd International Symposium on Practical Aspects of Declarative Languages, PADL 2001, Proc., pp. 184–198. Springer, Heidelberg (2001) 10. Mascardi, V., Martelli, M., Gungui, I.: DCaseLP: A prototyping environment for multilanguage agent systems. In: Dastani, M., El-Fallah Seghrouchni, A., Leite, J., Torroni, P. (eds.) 1st Int. Workshop on Languages, Methodologies and Development Tools for MultiAgent Systems, LADS 2007, Proc. LNCS. Springer, Heidelberg (to appear, 2008) SeSaR: Security for Safety Ermete Meda1, Francesco Picasso2, Andrea De Domenico1, Paolo Mazzaron1, Nadia Mazzino1, Lorenzo Motta1, and Aldo Tamponi1 1 Ansaldo STS, Via P. Mantovani 3-5, 16151 Genoa, Italy {Meda.Ermete,DeDomenico.Andrea,Mazzaron.Paolo,Mazzino.Nadia, Motta.Lorenzo,Tamponi.Aldo}@asf.ansaldo.it 2 University of Genoa, DIBE, Via Opera Pia 11-A, 16145 Genoa, Italy Francesco.Picasso@unige.it Abstract. The SeSaR (Security for Safety in Railways) project is a HW/SW system conceived to protect critical infrastructures against cyber threats - both deliberate and accidental – arising from misuse actions operated by personnel from outside or inside the organization, or from automated malware programs (viruses, worms and spyware). SeSaR’s main objective is to strengthen the security aspects of a complex system exposed to possible attacks or to malicious intents. The innovative aspects of SeSaR are manifold: it is a non-invasive and multi-layer defense system, which subjects different levels and areas of computer security to its checks and it is a reliable and trusted defense system, implementing the functionality of Trusted Computing. SeSaR is an important step since it applies to different sectors and appropriately responds to the more and more predominant presence of interconnected networks, of commercial systems with heterogeneous software components and of the potential threats and vulnerabilities that they introduce. 1 Introduction The development and the organization of industrialized countries are based on an increasingly complex and computerized infrastructure system. The Critical National Infrastructures, such as Healthcare, Banking & Finance, Energy, Transportation and others, are vital to citizen security, national economic security, national public health and safety. Since the Critical National Infrastructures may be subject to critical events, such as failures, natural disasters and intentional attacks, capable to affect human life and national efficiency both directly and indirectly, these infrastructures need a an aboveaverage level of protection. In the transportation sector of many countries the introduction into their national railway signaling systems of the new railway interoperability system, called ERTMS/ETCS (European Rail Traffic Management System/European Train Control System) is very important: in fact the goal of ERTMS/ETCS is the transformation of a considerable part of the European - and not just European - railway system into an interoperable High Speed/High Capacity system. In particular, thanks to said innovative signaling system the Italian railways intend to meet the new European requirements and the widespread demand for higher speed and efficiency in transportation. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 116–122, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 SeSaR: Security for Safety 117 The development of ERTMS/ETCS has set the Italian railways and the Italian railway signaling industry among the world leaders, but it has introduced new risks associated with the use of information infrastructures, such as failures due to natural events and human-induced problems. The latter ones may be due to negligence and/or to malice and can directly and heavily affect ICT security. ICT security management is therefore becoming complex. Even for railway applications, where the safety of transport systems is still maintained, information threats and vulnerabilities can contribute to affect the continuity of operation, not only causing inconvenience to train traffic, but also financial losses and damage to the corporate image of the railway authorities and to the one of their suppliers. 2 Implementation of ICT Security To illustrate the main idea behind the SeSaR (Security of Safety in Railways) project, the concept of ICT security should be defined first: implementing security means developing the activities of prevention, detection and reaction to an accident. In particular the following definitions hold: • • • Prevention: All the technical defence and countermeasures enabling a system to avoid attacks and damages. Unfortunately, the drawback of this activity is that, in order to be effective, prevention must be prolonged indefinitely in time, thus requiring a constant organisational commitment, continuous training of personnel and adjustment/updating of the countermeasures. Detection: When prevention fails, detection should intervene, but the new malicious codes are capable of self-mutations and the "zero-day" attacks render the countermeasures adopted ineffective: so adequate defence architecture cannot be separated from the use of instruments capable of analyzing, in real-time, tracks and records that are stored in computers and in the defence network equipment. Reaction: The last activity that can be triggered either as a consequence of prompt detection or when the first two techniques were ineffective but some problems start to affect the system in an apparent way. Of course in the latter case it is difficult to guarantee the absence of inefficiency, breakdowns and malfunctions. The goal of SeSaR was to find a tool that could merge the Prevention and the Detection activities of security, in the belief that the effectiveness of the solution sought could be based on the ability to discover and communicate what is new, unusual and anomalous. The idea was to design a hardware/software platform that could operate as follows: • • Prevention: operate a real-time integrity control of the hardware and software assets. Detection: operate a real-time analysis of tracks and records stored in critical systems and in defence systems, seeking meaningful events and asset changes. Operator Interface: create a simple concise representation, which consists of a console and of a mimic panel overview, to provide the operator with a clear and comprehensive indication of anomalies identified and to enable immediate reaction. 118 E. Meda et al. 3 Architecture The scope of SeSaR is constituted by the ERTMS/ETCS signaling system infrastructure used for “Alta Velocità Roma-Napoli”, i.e. the Rome-Naples high-speed railway line. In summary, the Rome-Naples high speed system is constituted by the Central Post (PCS, acronym for “Posto Centrale Satellite”) located at the Termini station in Rome, and by 19 Peripheral Posts (PPF, acronym for “Posto Periferico Fisso”). The PPFs are distributed along the line and linked together and with the PCS by a WAN. SeSaR supplies hardware/software protection against ICT threats - deliberate and accidental - arising from undue actions operated by individuals external or internal to the ICT infrastructure or by automated malicious programs, such as viruses, worms, trojans and spyware. The distinctive and innovative features of SeSaR are: • • • • A non-invasive defence system. It works independently of the ICT infrastructure and Operating System (Unix, Linux or Windows). A multi-layer defence system, operating control over different levels and areas of computer security, such as Asset Control, Log Correlation and an integrated and interacting Honeynet. A reliable defence system with the capability to implement the functionality of Trusted Computing. In particular, SeSaR carries out the following main functions, as shown in Fig.1: Fig. 1. SeSaR: multi-level defence system • • • Real-time monitoring of the assets of the ICT infrastructure. Real-time correlation of the tracks and records collected by security devices and critical systems. Honeynet feature, to facilitate the identification of intrusion or incorrect behaviour by creating "traps and information bait". SeSaR: Security for Safety • 119 Forensic feature, which allows the records accumulated during the treatment of attacks to be accompanied by digital signature and marking time (timestamp) so that they can have probative value. This feature also contributes to create "confidence" on the system by constant monitoring of the system integrity. SeSaR protects both the Central Post network and the whole PPF network against malicious attacks by internal or external illicit actions. As already mentioned, three functional components are included: Asset Control Engine, Log Correlation Engine and Honeynet. 3.1 Asset Control Engine This component checks in real-time the network configuration of the system IP-based assets, showing any relevant modification, in order to detect the unauthorized intrusion or lack (also shut-down and Ethernet card malfunctioning) of any IP node. The principle of operation is based on: • • monitoring the real-time network configuration of all IP-devices by SNMP (Simple Network Management Protocol) queries to all the switches of the system. comparison of such information with the reference configuration known and stored in SeSaR database. Each variation found generates an alarm on the console and on a mimic panel. To get updated information from the switches, the SeSaR Asset Control Engine, (SACE, also called CAI, an acronym for Controllo Asset Impianto) uses MIB (Management Information Base) through the mentioned SNMP protocol. In a switch MIB is a kind of database containing information about the status of the ports and the IP – MAC addresses of the connected hosts. By interrogating the switches, CAI determines whether their ports are connected to IP nodes and the corresponding IP and MAC addresses. The CAI activities can be divided into the following subsequent phases: PHASE 1) Input Data Definition and Populating the SeSaR Reference Data-Base This phase includes the definition of a suitable database and the importation into it of the desired values for the following data, forming the initial specification of the network configuration: a) IP address and Netmask for each system switch and router; b) IP address, Netmask and Hostname for each system host; c) Each System subnet with the relevant Geographic name. Such operations can also be performed with SeSaR disconnected from the system. PHASE 2) Connecting SeSaR to the system and implementing the Discovery activity This phase is constituted by a Discovery activity, which consists in getting, for each switch of the previous phase, the Port number, MAC address and IP address of each node connected to it. The SeSaR reference database is thus verified and completed with MAC Address and Port number, obtaining the 4 network parameters which describe each node (IP address, Hostname, MAC address and Number of the switch port connected to the node). 120 E. Meda et al. Said research already lets the operator identify possible nodes which are missing or intruding and lets him/her activate alarms: an explicit operator acknowledge is required for each node corresponding to a modification and all refused nodes will be treated by SeSaR as alarms on the Operator Console (or GUI, Graphic User Interface) and appropriately signalled on the mimic panel. Of course whenever modifications are accepted the reference database is updated. PHASE 3) Asset Control fully operational activity In such phase CAI cyclically and indefinitely verifies in real-time the network configuration of all the system assets, by interrogating the switches and by getting, for each network node, any modification concerning not only the IP address or Port number, as in the previous phase, but also the MAC address; the alarm management is performed as in the said phase. 3.2 Log Correlation Engine Firewall and Antivirus devices set up the underlying first defence countermeasures and they are usually deployed to defend the perimeter of a network. Independently of how they are deployed, they discard or block data according to predefined rules: the rules are defined by administrators in the firewall case, while they are signature matching rules defined by the vendor in the antivirus case. Once set up, these devices operate without the need of human participation: it could be said that they silently defend the network assets. Recently new kinds of devices, such as intrusion detection/prevention systems, have become available, capable to monitor and to defend the assets: they supply proactive defence in addition to the prevention defence of firewalls and antiviruses. Even using ID systems it is difficult for administrators to get a global vision of the asset’s security state since all devices work separately from each other. To achieve this comprehensive vision there is a need of a correlation process of the information made available by security devices: moreover, given the great amount of data, this process should be an automatic or semi-automatic one. The SeSaR Log Correlation Engine (SLCE, also called CLI, an acronym for Correlazione Log Impianto) achieves this goal by operating correlations based on security devices’ logs. These logs are created by security devices while monitoring the information flow and taking actions: such logs are typically used to debug rules configurations and to investigate a digital incident, but they can be used in a log correlation process as well. It is important to note that the correlation process is located above the assets’ defence layer, so it is transparent to the assets it monitors. This property has a great relevance especially when dealing with critical infrastructures, where changes in the assets require a large effort – time and resources – to be developed. The SLCE’s objective is to implement a defensive control mechanism by executing correlations within each single device’s log and among multiple devices’ alerts (Fig. 2). The correlation process is set up in two phases: the Event Correlation phase (E.C.) named Vertical Correlation and the Alert Correlation (A.C.) phase named Horizontal Correlation. This approach rises from a normalization need: syntactic, semantic and numeric normalization. Every sensor (security device) either creates logs expressed in its own format or, when a common format is used, it gives particular meanings to its log fields. Moreover the number of logs created by different sensors may differ by several orders of magnitude: for example, the number of logs created by a firewall sensor is usually largely greater than the number of logs created by an antivirus sensor. SeSaR: Security for Safety 121 Fig. 2. Logs management by the SeSaR Log Correlation Engine So the first log management step is to translate the different incoming formats into a common one, but there is the need for a next step at the same level. In fact, by simply translating formats, the numeric problem is not taken into account: depending on the techniques used in the correlation processes, by mixing two entities, which greatly differ in size, the information carried by the smallest one could be lost. In fact, at higher level, it should be better to manage a complete new entity than working on different entities expressed in a common format. The Vertical Correlation takes into account these aspects by acting on every sensor in four steps: 1. 2. 3. 4. it extracts the logs from a device; it translates the logs from the original format into an internal format (metadata) useful to the correlation process; it correlates the metadata on-line and with real-time constraints by using sketches [3]; if anomalies are detected, it creates a new message (Alert), which will carry this information by using the IDMEF format [4]. The correlation process is called vertical since it is performed on a single sensor basis: the SensorAgent software manages a single device and implements E.C. (the Vertical Correlation) using the four steps described above. The on-line, real-time and fixed upper-bound memory usage constraints allow for the development of lightweight software SensorAgents which can be modified to fit the context and which could be easily embedded in special devices. The Alerts created by Sensor Agents (one for each security device) are sent to the Alert Correlator Engine (ACE) to be further investigated and correlated. Basically ACE works in two steps: the aggregation step, in which the alerts are aggregated based on content’s fields located at the same level in IDMEF, without taking into consideration the link between different fields; the matching scenario step, in which 122 E. Meda et al. the alerts are aggregated based on the reconstruction and match of predefined scenarios. In both cases multiple Alerts – or even a single Alert, if it represents a critical situation – form an Alarm which will be signalled by sending it to the Operator Interface (console and mimic panel). Alarms share the IDMEF format with Alerts but the former ones get an augmented semantic meaning given to them during the correlation processes: Alerts represent anomalies that must be verified and used to match scenarios, while Alarms represent threats. 3.3 Honeynet This SeSaR’s component identifies anomalous behaviours of users: not only operators, but also maintainers, inspectors, assessors, external attackers, etc... Honeynet is composed of two or more honeypots. Any honeypot simulates the behaviour of the real host of the critical infrastructure. The honeypot security level is lower than the real host security level, so the honeypot creates a trap both for malicious and negligent users. Data capturing (keystroke collection and capture traffic entering the honeypots), data controlling (monitoring and filtering permitted or unauthorised information) and data analysis (operating activities on the honeypots) the functionalities it performs. 4 User Console and Mimic Panel Any kind of alarms coming from Asset Control, Log Correlation and Honeynet are displayed to the operator both in textual and graphical format: the textual format by a web based GUI (Graphic User Interface), the graphical format by a mimic panel. The operators can manage the incoming alarms on the GUI. There are three kinds of alarm status: incoming (waiting to be processed by an operator), in progress and closed. The latter ones are the responsibility of operators. 5 Conclusion SeSaR works independently of the ICT infrastructure and Operative System. Therefore SeSaR can be applied in any other context, even if SeSaR is being designed for railway computerized infrastructure. SeSaR and its component are useful to improve any kind of existing security infrastructure and it can be a countermeasure for any kind of ICT infrastructure. References 1. Meda, E., Sabbatino, V., Doro Altan, A., Mazzino, N.: SeSaR: Security for Safety, AirPort&Rail, security, logistic and IT magazine, EDIS s.r.l., Bologna (2008) 2. Forte, D.V.: Security for safety in railways, Network Security. Elsevier Ltd., Amsterdam (2008) 3. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in Theoretical Computer Science (2005) 4. RFC 4765, The Intrusion Detection Message Exchange Format (IDMEF)(March 2007) Automatic Verification of Firewall Configuration with Respect to Security Policy Requirements Soutaro Matsumoto1 and Adel Bouhoula2 1 Graduate School of System and Information Engineering University of Tsukuba – Japan soutaro@score.cs.tsukuba.ac.jp 2 Higher School of Communication of Tunis (Sup'Com) University of November 7th at Carthage – Tunisia adel.bouhoula@supcom.rnu.tn Abstract. Firewalls are key security components in computer networks. They filter network traffics based on an ordered list of filtering rules. Firewall configurations must be correct and complete with respect to security policies. Security policy is a set of predicates, which is a high level description of traffic controls. In this paper, we propose an automatic method to verify the correctness of firewall configuration. We have defined a boolean formula representation of security policy. With the boolean formula representations of security policy and firewall configuration, we can formulate the condition that ensures correctness of firewall configuration. We use SAT solver to check the validity of the condition. If the configuration is not correct, our method produces an example of packet to help users to correct the configuration. We have implemented a prototype verifier and had some experimental results. The first results were very promising. Keywords: Firewall Configuration, Security Policy, Automatic Verification, SAT Solver. 1 Introduction Firewalls are key security components in computer networks. They filter packets to control network traffics based on an ordered list of filtering rules. Filtering rules describe which packet should be accepted or rejected. Filtering rules consist of a pattern of packet and an action. Firewalls reject / accept packets if the first rule which matches the packet in their configurations rejects / accepts the packet. Security policies are high-level description of traffic controls. They define which connection is allowed. They are sets of predicates. Security policies reject / accept connections if the most specific predicate which matches the connection rejects / accepts the connection. Firewalls should be configured correctly with respect to security policies. Correct configurations reject / accept connections if and only if security policies reject / accept the connections. However, the correctness of firewall configurations is not obvious. Consider for example the following security policy. 1. All users in LAN can access Internet 2. All users in LAN1 cannot access youtube.com 3. All users in LAN2 can access youtube.com E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 123–130, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 124 S. Matsumoto and A. Bouhoula Fig. 1. Structure of the Network There are three predicates. Assume that we have a network LAN, two sub networks LAN1 and LAN2, and there is a website youtube.com in the Internet. The structure of the networks is shown in figure 1. The first predicate is the most general one that allows all users in LAN to access any web site in the Internet. The Second predicate is more specific than the first predicate which prohibits users in LAN1 to access youtube.com. Third predicate allows users in LAN2 to access youtube.com. Under this security policy, users in LAN1 cannot access youtube.com. Since the second predicate is the most specific for a connection from LAN1 to youtube.com, the connection will be rejected. Consider for example the following firewall configuration in Cisco’s format. access-list 101 permit tcp 192.168.0.0 any access-list 102 reject tcp 192.168.1.0 233.114.23.1 access-list 103 permit tcp 192.168.2.0 233.114.23.1 0.0.255.255 eq 80 0.0.0.255 0.0.0.0 0.0.0.255 0.0.0.0 eq 80 eq 80 There are three rules named 101, 102, and 103. Rule 101 permits connections from 192.168.*.* to any host. * is wildcard. Rule 102 rejects packets from 192.168.1.* to 233.114.23.1. Rule 103 permits packets from 192.168.2.* to 233.114.23.1. Assume that the network LAN has address 192.168.*.*, LAN1 and LAN2 have 192.168.1.* and 192.168.2.* respectively, and youtube.com has address 233.114.23.1. This configuration can read as a straightforward translation of the security policy. Unfortunately, this configuration is incorrect. Since the first rule that matches given connection will be applied, a connection from LAN1 matches the first rule and it will be accepted even if the destination is youtube.com. We propose a method to verify the correctness of firewall configurations with respect to security policies. Security policy P and firewall configuration F are translated into boolean formulae QP and QF. The correctness of firewall configuration is reduced to the equivalence of the two formulae. The equivalence of the two formulae is checked by satisfiability of QP ⇎ QF. If the formula is satisfiable, QP and QF are not equivalent and F is not correct. A counterexample of packet will be produced such that P and F will give different answers for the packet. The counterexample will help users to find and correct mistakes in their configurations. 1.1 Related Work There are a lot of works about development of formal languages for security policy description [1, 2]. Our formalization of security policy is very abstract, but essentially Automatic Verification of Firewall Configuration 125 equivalent to them. The works are so technical that we simplified the definition of security policy. There are some works for efficient testing of firewalls based on security policies [3, 4]. Their approaches generate test cases from security policy and then test networks with the test cases. Our method only verifies the correctness of the configuration of firewall. The simplification makes our method very simple and fast. The essence of our method is the definition of boolean formula representation. Other steps are applications of well-known logical operations. The fact that our method is performed on our computers makes verifications much faster and easier than testing with sending packet to real networks. This also makes possible to use SAT solvers to prove the correctness. Detecting anomalies in firewall configurations is also important issue [5]. Detection of anomalies helps to find mistakes in configuration, but it does not find mistakes by itself. Hazelhurst has presented a boolean formula representation of firewall configuration, which can be used to express filtering rule [6]. The boolean formula representation can be simplified with Binary Decision Diagram to improve the performance of filtering. In this paper, we use the boolean formula representations of firewall configuration and packets. 2 Security Policy and Firewall Configuration We present a formal definition of security policy and firewall configuration. The definitions are very abstract. We clarify the assumptions of our formalization about security policy. 2.1 Security Policy Security policy is a set of policy predicates. Policy predicates consist of action, source address, and destination address. Actions are accept or deny. The syntax of security policy is given in figure 2. policy is a set of predicate. A predicate consists of action K, source address, and destination address. Action K is accept A or deny D. A predicate A(s, d) reads Connection from s to d is accepted. A predicate D(s, d) reads Connection from s to d is denied. Network address is a set. Packets are pairs of source address and destination address. Packet p from source address s to destination address d will match with predicate K(s’, d’) if and only if s ⊆ s' ⋀ d ⊆ d' holds. We have a partial order relation ⊒ on policy predicates. K(s, d) ⊒ K’(s’, d’) ⇔ s ⊇ s’ ⋀ d ⊇ d’ If q ⊒ q’ holds, q is more general than q' or q' is more specific than q. Packet p is accepted by Security Policy P if the action of the most specific policy predicate that matches with p is A, and p is rejected by P if the action of the most specific policy predicate which matches with p is D. 126 S. Matsumoto and A. Bouhoula policy predicate address K ::= ::= ::= ::= | { predicate, …, predicate } K(address, address) Network Address A D Accept Deny Fig. 2. Syntax of Security Policy configuration rule address K ::= ::= ::= ::= | rule :: configuration | φ K(address, address) Network Address A D Accept Deny Fig. 3. Syntax of firewall configuration Example. The security policy we have shown in section 1 can be represented as P = { q1, q2, q3 } where q1, q2 and q3 are defined as follows: 1. q1 = A(LAN, Internet) 2. q2 = D(LAN1, youtube.com) 3. q3 = A(LAN2, youtube.com) We have ordering of policy predicates q1 ⊒ q2 and q1 ⊒ q3. Assumptions. To ensure the consistency of security policy, we assume that we can find the most specific policy predicate for any packet. We assume that the following formula holds for any two different predicates K(s, d) and K’(s’, d’). s × d ⊇ s’ × d’ ⋁ s × d ⊆ s’ × d’ ⋁ s × d ∩ s’ × d’ = φ Tree View of Security Policy. We can see security policies as trees, such that the root is the most general predicate and children of each node is set of more specific predicates. We define an auxiliary function CP(q) which maps predicate q in security policy P to the set of its children. CP(q) = { q’ | q’ ∈ P ⋀ q ⊒ q’ ⋀ ¬(∃q’’ ∈ P . q ⊒ q’’ ⋀ q’’ ⊒ q) } For instance, CP(q) = { q2, q3 } and CP(q2) = CP(q3) = φ for the previous example. 2.2 Firewall Configuration The syntax of firewall configuration is given in figure 3. Configurations of firewalls are ordered lists of rules. Rules consist of action, source address, and destination address. Actions are accept A or deny D. Automatic Verification of Firewall Configuration 127 A packet is accepted by a firewall configuration if and only if the action of the first rule in the configuration that matches with source and destination address of packet is A. The main difference between security policies and firewall configurations is that rules in firewalls form ordered list but predicates in security policies form trees. 3 Boolean Formula Representation We present a boolean formula representation of security policy and firewall configuration in this section. This section includes also a boolean formula representation of network address and packet. In the previous section, we did not define concrete representation of network addresses and packets.1 The boolean formula representations of network address and firewall configuration are proposed by Hazelhurst [6]. 3.1 Boolean Formula Representation of Network Addresses and Packets Network addresses are IPv4 addresses. Since IPv4 addresses are 32 bit unsigned integers, we need 32 logical variables to represent each address. IP addresses are represented as conjunction of 32 variables or their negations, so that each variable represents a bit in IP address. If the ith bit of the address is 1 then variable ai should evaluate true. For example an IP address 192.168.0.1 is represented as the following. a32 ⋀ a31 ⋀ ¬a30 ⋀ ¬a29 ⋀ ¬a28 ⋀ ¬a27 ⋀ ¬a26 ⋀ ¬a25 ⋀ a24 ⋀ ¬a23 ⋀ a22 ⋀ ¬a21 ⋀ a20 ⋀ ¬a19 ⋀ ¬a18 ⋀ ¬a17 ⋀ ¬a16 ⋀ ¬a15 ⋀ ¬a14 ⋀ ¬a13 ⋀ ¬a12 ⋀ ¬a11 ⋀ ¬a10 ⋀ ¬a9 ⋀ ¬a8 ⋀ ¬a7 ⋀ ¬a6 ⋀ ¬a5 ⋀ ¬a4 ⋀ ¬a3 ⋀ ¬a2 ⋀ a1 Here, a1 is the variable for the lowest bit and a32 is for the highest bit. «p» is an environment, which represents packet p. Packets consist of source address and destination address. We have two sets of boolean variables, s = { s1, …, s32} and d = { d1, …, d32 }. They represent source address and destination address of a packet respectively. If packet p is from address 192.168.0.1, «p»|s is the following. { s32 ↦ T, s31 ↦ T, s30 ↦ F, s29 ↦ F, s28 ↦ F, s24 ↦ T, s23 ↦ F, s22 ↦ T, s21 ↦ F, s20 ↦ T, s16 ↦ F, s15 ↦ F, s14 ↦ F, s13 ↦ F, s12 ↦ F, s8 ↦ F, s7 ↦ F, s6 ↦ F, s5 ↦ F, s4 ↦ F, s27 ↦ F, s26 ↦ F, s25 ↦ F, s19 ↦ F, s18 ↦ F, s17 ↦ F, s11 ↦ F, s10 ↦ F, s9 ↦ F, s3 ↦ F, s2 ↦ F, s1 ↦ T, } «p» also includes assignments of d for destination address of p. ‹a, b› is a boolean formula such that «p» ⊢ ‹a, b› holds if and only if packet p is from a to b. ‹192.168.0.1, b› is like the following boolean formula. s32 ⋀ s31 ⋀¬ s30 ⋀ ¬s29 ⋀ ¬s28 ⋀ ¬s27 ⋀ ¬s26 ⋀ ¬s25 ⋀ … 1 Without loss of generality, we have a simplified representation of network addresses. We can easily extend this representation to support net-masks, range of ports, or other features as proposed by Hazelhurst. 128 S. Matsumoto and A. Bouhoula This is the only eight components of the formula. They are the same as the highest eight components of boolean formula representation of IP address 192.168.0.1. 3.2 Boolean Formula Representation of Security Policy Security Policy P can be represented as boolean formula QP, such that ∀p : Packet . P accept p ⇔ «p» ⊢ QP holds. We define a translation BP(q, β) which maps a policy predicate q in security policy P to its boolean formula representation. BP(A(a, b), T) = BP(A(a, b), F) = BP(D(a, b), T) = BP(D(a, b), F) = ¬‹a, b› ⋁ (T ⋀ q∈C BP(q, T)) ‹a, b› ⋀ (⋀ q∈C BP(q, T)) ¬‹a, b› ⋁ (⋁q∈C BP(q, F)) ‹a, b› ⋀ (T ⋁ q∈C BP(q, F)) where C = CP(q) We can obtain the boolean formula representation of security policy P as BP(q, F) where q is the most general predicate in P. Example. The following is an example of transformation from security policy P in section 2 to its boolean formula representation. The boolean formula representation of P is obtained by BP(q1, F) since q1 is the most general predicate. BP(A(LAN, Internet), F) = ‹LAN.Internet› ⋀ BP(q2, T) ⋀ BP(q3,T) BP(D(LAN1, youtube.com), T) = ¬‹LAN1.youtube.com› BP(A(LAN2, youtube.com), T) = ‹LAN2.youtube.com› ⋁ T Finally we have the following formula after some simplifications. ‹LAN.Internet› ⋀ ¬‹LAN1.youtube.com› Consider a packet from LAN1 to youtube.com, the first component in the formula evaluates true, but the second component evaluates false. Whole expression evaluates false, so the packet is rejected. 3.3 Boolean Formula Representation of Firewall Configuration Firewall configuration F can be represented as boolean formula QF, such that ∀p: Packet . F accept p ⇔ «p» ⊢ QF holds. B(F) is a mapping from F to QF. B(φ) B(A(a, b) :: rules) = = B(D(a, b) :: rules) ‹a, b› ⋁ B(rules) = ¬‹a, b› ⋀ B(rules) F 4 Experimental Results We have implemented a prototype of verifier. The verifier reads a security policy and a firewall configuration, and verifies the correctness of the configuration. It supposes Automatic Verification of Firewall Configuration 129 that we have IPv4 addresses with net-masks and port numbers of 16 bit unsigned integer with range support. The verifier uses MiniSAT to solve SAT [7]. In our verifier packets consist of two network addresses and protocol. Network addresses are pair of a 32 bit unsigned integer which represents an IPv4 address and a 16 bit unsigned integer which represents a port number. The protocols are TCP, UDP, or ICMP. Thus, a formula for one packet includes up to 99 variables – two 32+16 variables for source and destination addresses and three variables for protocol. We have verified some firewall configurations. Our experiments were performed on an Intel Core Duo 2.16 GHz processor with 2 Gbytes of RAM. Table 1 summarizes our results. The first two columns show the size of inputs. It is the numbers of predicates in security policy and the numbers of filtering rules in firewall configuration. The third column shows the size of the input for SAT solver. The last column shows the running times of our verifier. It includes all processing time from reading the inputs to printing the results. All of the inputs were correct because it is the most time consuming case. These results show that our method verifies fast enough with not so big inputs. If configurations are not correct, then the verifier produces a counterexample packet. The following is an output of our verifier that shows a counterexample. % verifyconfig verify ../samples/policy.txt ../samples/rules.txt Loading policy ... ok Loading configuration ... ok Translating to CNF ... ok MiniSAT running ... ok Incorrect: for example [tcp - 192.168.0.0:0 - 111.123.185.1:80] The counterexample tells that the security policy and firewall configuration will give different answers for a packet from 192.168.0.0 of port 0 to 111.123.185.1 of port 80. Testing the firewall with the counterexample will help users to correct the configuration. Table 1. Experimental Results # of predicates 3 13 26 # of rules 3 11 21 size of SAT 9183 21513 112327 running time (s) 0.04 0.08 0.37 5 Conclusion In this paper, we have defined a boolean formula representation of security policy, which can be used in a lot of applications. We have also proposed an automatic method to verify the correctness of firewall configurations with respect to security policies. The method translates both of the two inputs into boolean formulae and then verifies the equivalence by checking satisfiability. We have had experimental results with some small examples using our prototype implementation. Our method can verify the configuration of centralized firewall. We are working for generalization of our method for distributed firewalls. 130 S. Matsumoto and A. Bouhoula References 1. Hamdi, H., Bouhoula, A., Mosbah, M.: A declarative approach for easy specification and automated enforcement of security policy. International Journal of Computer Science and Network Security 8(2), 60–71 (2008) 2. Abou El Kalam, A., Baida, R.E., Balbiani, P., Benferhat, S., Cuppens, F., Deswarte, Y., Miége, A., Saurel, C., Trouessin, G.: Organization Based Access Control. In: 4th IEEE International Workshop on Policies for Distributed Systems and Networks (Policy 2003) (June 2003) 3. Senn, D., Basin, D.A., Caronni, G.: Firewall conformance testing. In: Khendek, F., Dssouli, R. (eds.) TestCom 2005. LNCS, vol. 3502, pp. 226–241. Springer, Heidelberg (2005) 4. Darmaillacq, V., Fernandez, J.C., Groz, R., Mounier, L., Richier, J.L.: Test generation for network security rules. In: Uyar, M.Ü., Duale, A.Y., Fecko, M.A. (eds.) TestCom 2006. LNCS, vol. 3964, pp. 341–356. Springer, Heidelberg (2006) 5. Abbes, T., Bouhoula, A., Rusinowitch, M.: Inference System for Detecting Firewall Filtering Rules Anomalies. In: Proceedings of the 23rd Annual ACM Symposium on Applied Computing, Fortaleza, Ceara, Brazil, pp. 2122–2128 (March 2008) 6. Hazelhurst, S.: Algorithms for analysing firewall and router access lists. CoRR cs.NI/0008006 (2000) 7. Eén, N., Sörensson, N.: An extensible sat-solver. In: Giunchiglia, E., Tacchella, A. (eds.) SAT 2003. LNCS, vol. 2919, pp. 502–518. Springer, Heidelberg (2003) Automated Framework for Policy Optimization in Firewalls and Security Gateways Gianluca Maiolini1, Lorenzo Cignini1, and Andrea Baiocchi2 1 Elsag Datamat – Divisione Automazione Sicurezza e Trasporti Rome, Italy {Gianluca.Maiolini, Lorenzo2.Cignini}@elsagdatamat.com 2 University of Roma “La Sapienza” Rome, Italy {Andrea.Baiocchi}@uniroma1.it Abstract. The challenge to address in multi-firewall and security gateway environment is to implement conflict-free policies, necessary to avoid security inconsistency, and to optimize, at the same time, performances in term of average filtering time, in order to make firewalls stronger against DoS and DDoS attacks. Additionally the approach should be real time, based on the characteristics of network traffic. Our work defines an algorithm to find conflict free optimized device rule sets in real time, by relying on information gathered from traffic analysis. We show results obtained from our test environment demonstrating for computational power savings up to 24% with fully conflict free device policies. Keywords: Firewall; Data mining; network management; security policy; optimization. 1 Introduction A key challenge of secure systems is the management of security policies, from those at high level down to the platform specific implementation. Security policy defines constraints, limitations and authorization on data handling and communications. The need for high speed links follows the increasing demand for improved packet filtering devices performance, such as firewall and S-VPN gateway. As hacking techniques evolves and routing protocols are becoming more complex there is a growing need of automated network management systems that can rapidly adapt to different and new environments. We assume that policies are formally stated according to a well defined formal language, so that the access lists of a security gateway can be reduced to an ordered list of predicates of the form: C Æ A, where C is a condition and A is an action. We refer to predicates implementing security policies as rules. For security gateway the condition of a filtering rule is composed of five selectors: <protocol> <src ip> <src port> <dst ip> <dst port>. The action that could be performed on the packet is allow, deny or process, where process imply that the packet has to be submitted to the IPSec algorithm. How to process that packet is described in a specific rule which details how to apply the security mechanism. Conditions are checked on each packet flowing through the device. The process of inspecting incoming packets and looking up the policy rule set for a match often results in CPU overload and traffic or application’s delay. Packets that match high rank rules require a small computation time compared to those one at the end of rule set. Shaping list of rules on traffic E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 131–138, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 132 G. Maiolini, L. Cignini, and A. Baiocchi flowing through devices could be useful to improve devices performance. This operation performed on all packet filtering devices give an improvement in global network performance. Our analysis shows how shaping access list based on network traffic can often results in conflicts between policies. As reported by many authors [1-6], conflicts in a policy can cause holes in security, and often they can be hard to find when performing only visual or manual inspection. In this paper we propose architecture based on our algorithm to automatically adapt packet filtering devices configuration to traffic behavior achieving the best performance ensuring conflict-free solution. The architecture retrieves traffic pattern from log information sent in real time from all devices deployed in the network. 2 Related Works In the last few years the critical role of firewall in the policy based network management led to a large amount of works. Many of these concern the correctness of implemented policies. In [1] the Authors only aim at detecting if firewall rules are correlated to each other, while in [2][3][4] a set of techniques and algorithms are defined to discover all of the possible policy conflicts. Along this line, [5] and [6] provide an automatic conflict resolution algorithm in single firewall environment and tuning algorithm in multi-firewall environment respectively. Recently great emphasis has been placed on how to optimize firewall performance. In [7] a simple algorithm based on rule re-ordering is presented. In [11] Authors present an algorithm to optimize firewall performance that order the rules according to their weights and consider two factors to determine the weight of a rule: rule frequency and recency which reflect the number and time of rule matching, respectively. Finally extracting rules from the “deny all” rule is another big problem to address. The few works on this issue [8] [10] do not define how many rules must be extracted, which combine values; how to define their priorities and they not check whether this process really improve the firewall packet filtering performance. In this paper we propose a fully automated framework composed by a log management infrastructure, policy compliance checking and a tool that, based on log messages collected from all device in the network, calculate rules’ rate related to traffic data, re-orders ranks guaranteeing conflict-free configuration and maximum performance optimization. Moreover our tool is able to extract a rule from the “deny all” rule if this leads to further improved performance. To make the framework automatic we are actually working to define a threshold to understand how many logs are needed to automatically start rule set update. 3 Adaptive Conflict-Free Optimization (ACO) Algorithm In the following we refer to a tagged device rule set, denoted by R = [R1,…,RN]. The index i of each rule in set R is also called rule rank. We let the following definitions: • Pi is the rule rate, i.e. the matching ratio of rule i, defined as the ratio on the number ni(T) of packets matching Ri out of the overall number n(T) of packets processed by the tagged device in the reference time interval T. Automated Framework for Policy Optimization • • 133 Ci is the rule weight, i.e. the computational cost of rule i; if the same processing complexity can be assumed for each rule in R, then Ci=i·Pi; C(R) is the device rule set overall computational cost, computed as the sum of the rule weights, C(R)=∑i Ci. Our aim is to minimize C(R) in all network devices, under the constraint of full rule consistency. The aim is to improve both device and global filtering operation. In the following paragraph we are going to describe phases for algorithm. To develop our algorithm we used two repository systems, in particular: DVDB: database storing devices configurations including security policy. LogDB: database designed to store all log messages coming from devices. Analysis and correlation among logs are performed in order to know how many times each device’s rule was matched. 3.1 Phase 1: Starting ACO ACO starts operation when: i) policy configurations (rule set) are retrieved to solve conflicts in all devices; ii) a sufficiently large amount of log for each device are collected, e.g. to allow reliable weight estimates (i.e. logs collected in a day). Since ACO is aimed at working in real time, we need to decide which events trigger its run. We monitor in real time all devices and decide to start optimization process when at least one of the following events occurs: • • rule set change (such as rule insertion, modification and removal); the number of logs received from a device in the last collection time interval (in our implementation set to 60 s) has grown more than 10% with respect to the previous collection interval. The first criterion is motivated mainly to check the policy consistency; the second one to optimize performance adapting to traffic load. Specifically, performance optimization is needed the more the higher the traffic load, i.e. as traffic load attains critical values. In fact, rule set processing time optimization is seen as a form of protection of secure networks from malicious overloads (DoS attacks by dummy traffic flooding). 3.2 Phase 2: Data Import The algorithm retrieves from Device DB (DVDB): • • the IP address of devices interfaces to the networks; devices rule set R. For each device algorithm retrieves all rules hit number (how many times a rule was applied to a packet) from Log DB (LogDB). Then it calculates rule match rate (Pi) and rule weight (Ci). 3.3 Phase 3: Rules Classification In this phase for each device a classifier analyzes one by one the rules in R and it determines the relations between Ri and all the rules previously analysed [R1,…,Ri–1]. A data structure called Complete Pseudo Tree (CPT) is built out of this analysis. 134 G. Maiolini, L. Cignini, and A. Baiocchi Definition. A pseudo tree is a hierarchically organized data structure that represents relations between each rule analysed. A pseudo tree might be formed by more than one tree. Each tree node represents a rule in R. The relation parent-children in the trees reflect the inclusion relation between the portions of traffic matched by the selectors of the rules: a rule will be a child node if the traffic matched by its selectors is included in the portion of traffic matched by the selectors of the parent node. Any two rules whose selectors match no overlapping portions of traffic will not be related by any inclusion relation. In each tree there will be a root node which represents a rule that includes all the rules in the tree and there will be one or more leaves which represent the most specific rules. When the classifier finds a couple of rules which are not related by an inclusion relation, it will split one of them into two or three new rules so as to obtain derived rules that can be classified in the pseudo tree. The output of this phase is a conflict-free tree where there remain only redundant rules that will be eliminated in the next phase. The pseudo code of the algorithm for the identification of the Complete Pseudo Tree is listed below. Algorithm 1: Create new Complete Pseudo Tree CPT 2: foreach rule r in Ruleset do 3: foreach ClassifiedRule cR in CPT do 4: classify(r,cR) 5: if r is to be fragmented then 6: fragment(r,classifiedRule) 7: remove(r, pseudoTree) 8: insert fragments at the bottom of the Ruleset 9: calculate statistic from LogDB (fragments) 10: else 11: insert(r,CPT). 3.4 Phase 4: Optimization In this phase core operations upon the single devices rule lists optimization will be performed. The aim of these operations is twofold: to restrict the number of rules in every rule list without changing the external behavior of the device and to optimize filtering performance. We ought to take into consideration the data structures introduced in previous section, namely the Device Pseudo Tree (DPT), one for each device, obtained from the CPT by considering only the rules belonging to one device. Each of these structures shows a hierarchical representation of the rules in the rule list of each device. Chances are that one rule might have the same action as a rule that directly includes it. This means that the child rule is in a way redundant because, if it was not in the rule list, the same portion of traffic would be matched by the parent rule which has got the same action. The child rule is indeed not necessary to describe the device behavior and could be eliminated simplifying the device rule list. Therefore, our algorithm will locate in every device pseudo tree all these cases, in which a child rule has got the same action as the father’s, and will delete the child rule. So the rule set obtained is composed by two kinds of rules: one completely disjoint that can be located in any rank of rule list, the other one characterized by dependencies constraints among rules. In addition we need to update the rate of the father rule when one or more child rules are deleted. At this point each device rule set R is re-ordered according to nonincreasing rule rates, i.e. so that Pi≥Pj for i≤j The resulting ordering minimizes the Automated Framework for Policy Optimization 135 overall cost C(R), yet it does not guarantee the correctness of the policies implemented. As a matter of fact, it may happen that one more specific rule is placed after a more general rule, so violating the constraints imposed by security policy consistency. For this reason, after re-ordering operation, relations father-child of the DPT are restored. It’s clear that the gain achieved by optimization heavily depends on the degree of dependencies of the rules. The two limit cases are: i) no rules dependencies (all disjoint rules), that yields the biggest optimization margin; ii) complete dependency (every rule depends on any other one), where the optimization process produces a near zero gain. The device rule set total cost C(R) is evaluated and fed as input for the next phase. The pseudo code of the optimization algorithm is listed below. Algorithm 1: foreach Node node in CPT do 2: get the id of the device the rule belongs to 3: if exist DPT.id = = id then 4: insert(node,DPT) 5: else 6: create(DPT,id) 7: insert(node,DPT) 8: foreach DPT do 9: foreach Rule rule in DPT do 10: if rule.action = = rule’s father.action then 11: update rule’s father.rate 12: delete(rule) 13: foreach Device dv do 14: sort the dv.ruleset except deny all in non-increasing order of rule.rate 15: foreach Rule rule do 16: if rule is a child 17: move rule just above father rule 18: calculate dv.cost 3.5 Phase 5: Extracting Rules from Deny all String The common idea about rules extraction from deny all rule is to obtain better optimization rate. It consists in selecting only heavily invoked rules and simply extract them in rule set according of their rates. However, this is a very delicate operation, since the inclusion of these rules often does not improve performance. In Table 1 we show this case: two new rules extracted and ordered according to their rates produce an Table 1. Rule list with two rules extracted (Rank = 2 and Rank = 7) Rank 1 2 3 4 5 6 7 8 9 10 Pi 0.18 0.10 0.02 0.07 0.17 0.12 0.05 0.03 0.01 0.25 C(R) = 5.47 Ci 0.18 0.20 0.06 0.28 0.85 0.72 0.35 0.24 0.09 2.5 136 G. Maiolini, L. Cignini, and A. Baiocchi increment of the starting value of C(R), that was 5.16. In addition extracted rules are always disjoint from all others in the rule sets, so it is impossible to introduce additional conflicts. Our algorithm extracts a new rule when its rate exceeds 20% of deny all rule rate. But this is not enough; in fact we perform an additional control to assess efficacy of the new rule in the process of optimization. Changing the position of rules implies cost changes so we will choose the position of rule that grants for the lowest overall cost C(R). The derived rule will be actually inserted in the rule list if the overall cost improves over the value it had before rule extraction. Detailed pseudo code of this phase is listed below. Algorithm 1: foreach Device dv 2: foreach denyall log record in LogDB 3: count log occurrence 4: if log.rate > 0.2 denyall.rate 5: extract new rule from log record 6: add rule to extracted_ruleset 7: foreach Rule rule in extracted_ruleset 8: calculate vector dv.costvector 9: if min(dv.costvector) < dv.cost 10: update dv.cost 11: update ruleset including rule. 3.6 Phase 6: Update Devices At this point we have obtained for each packet filtering device an optimized and conflict-free rules list, shaped on traffic flowed through the network. In this phase the algorithm updates devices configuration on device DB. Network management system manages configuration upload to deployed devices. 4 Performance Evaluations Our approach is based on real test scenario even if due to privacy issues we can’t provide reference and traffic contents. We have observed for traffic behaviour during a day in ten different devices deployed in a internal network. We analyzed configuration in order to detect and solve eventual conflict, results of this phase are not important because we are focused on log gathering and optimization. We have stored conflict free configurations in device DB. We have also configured devices for sending log to our machine where our tool is installed. We collected logs for 24 hour storing them in logDB. We started our tool based on the algorithm described in section 3, obtaining different level of optimization depending on devices configuration and traffic. For us optimization rate consists in calculate parameter C(R). In this section we are going to describe the results obtained in the most significant device deployed on the network, it could describe the concept of the algorithm. Table 2 shows the initial device rule set comprising access list for IP traffic and IPSec configuration. It’s easy to see that if we exchange rules 3, 6 and 7 shadowing conflicts occur. Table 3 shows the Pi and Ci values of the initial rule set calculated retrieving data stored in logDB. According to the metric used we obtain a value of Automated Framework for Policy Optimization 137 C(R) equal to 5.99. Re-ordering operation (line 14 of phase 4 algorithm) produces the best optimization gain (+22%) but adjustments are necessary to ensure a conflict-free configuration. At the end of phase 4 we obtained the order showed in Table 4. The value of C(R) obtained is 5.16, so the improvement is about 14%. Another good optimization is obtained by algorithm in phase 5. In this phase new rules are extracted because of packet flow matched with a specific denied rule (obviously included in deny all) a lot of time. Our algorithm calculates the best position to insert rule in the list to minimize cost C(R) so in this case it has been positioned at the top of rule-list. In Table 5 are shown the changing occurring in Pi and Ci values. The value of C(R) obtained is 4.56, so additional gain obtained in this phase is about 10 %. At the end of the process the total optimization gain is about 24%. Similar values (±2 %) were obtained in the remaining devices. Optimization is not always feasible. In the end, we have achieved a conflict-free configuration with a gain of 24%. These results depend on specific traffic behaviour and security policies applied on devices. Our work is continuing performing further traffic test looking for relations between number of rules and optimization rate also refining the extraction from deny all rules. A further issue is the fine tuning of heuristic parameters used in the algorithm, like time interval duration between two updates. Finally a device spends significant CPU time to send logs, especially when it has to be sent them for all packets flowing through it. Table 2. Rank Protocol Source IP 1 2 3 4 5 6 7 8 tcp tcp tcp udp udp tcp tcp any 192.168.10.* 192.168.10.* 10.1.1.23 192.168.3.5 192.168.3.5 10.1.1.* 10.1.*.* any Table 3. Rank 1 2 3 4 5 6 7 8 Pi 0.01 0.12 0.02 0.18 0.03 0.07 0.17 0.4 C(R) = 5.99 Source Port 80 21 any 53 any any any any Dest. IP 192.168.20.* 192.168.20.* 20.1.1.23 192.168.3.5 192.168.*.* 20.1.1.* 20.1.1.* any Table 4. Ci 0.01 0.24 0.06 0.72 0.15 0.42 1.19 3.2 Old Rank rank 4 1 3 2 6 3 7 4 2 5 5 6 1 7 8 8 C(R) = 5.16 Pi 0.18 0.02 0.07 0.17 0.12 0.03 0.01 0.40 Dest. port 80 21 80 any 80 80 80 any Action Deny Allow Allow Deny Allow Deny Allow Deny Table 5. Ci 0.18 0.04 0.21 0.68 0.60 0.18 0.07 3.20 + 14% Rank Pi 1 0.2 2 0.18 3 0.02 4 0.07 5 0.17 6 0.12 7 0.03 8 0.01 9 0.20 C(R) = 4.56 Ci 0.2 0.36 0.06 0.28 0.85 0.72 0.21 0.08 1.8 +10% 138 G. Maiolini, L. Cignini, and A. Baiocchi References 1. Hari, H.B., Suri, S., Parulkar, G.: Detecting and Resolving Packet Filter Conflicts. In: Proceedings of IEEE INFOCOM 2000, Tel Aviv (2000) 2. Al-Shaer, E., Hamed, H.: Modeling and Management of Firewall Policies. In: IEEE eTransactions on Network and Service Management, vol. 1-1 (2004) 3. Al-Shaer, E., Hamed, H., Boutaba, R., Hasan, M.: Conflict Classification and Analysis of Distributed Firewall Policies. IEEE Journal on Selected Areas in Communications 23(10) (2005) 4. Al-Shaer, E., Hamed, H.: Firewall Policy Advisor for Anomaly Detection and Rule Editing. In: Proceedings of IEEE/IFIP Integrated Management Conference (IM 2003), Colorado Springs (2003) 5. Ferraresi, S., Pesic, S., Trazza, L., Baiocchi, A.: Automatic Conflict Analysis and Resolution of Traffic Filtering Policy for Firewall and Security Gateway. In: IEEE International Conference on Communications 2007 (ICC 2007), Glasgow (2007) 6. Ferraresi, S., Francocci, E., Quaglini, A., Picasso, F.: Security Policy Tuning among IP Devices. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part II. LNCS (LNAI), vol. 4693. Springer, Heidelberg (2007) 7. Fulp, E.W.: Optimization of network firewall policies using directed acyclical graphs. In: Proceedings of the IEEE Internet Management Conference (2005) 8. Acharya, S., Wang, J., Ge, Z., Znati, T., Greenberg, A.: Simulation study of firewalls to aid improved performance. In: Proceedings of 39th Annual Simulation Symposium (ANSS 2006), Huntsville (2006) 9. Acharya, S., Wang, J., Ge, Z., Znati, T., Greenberg, A.: Traffic-aware firewall optimization Strategies. In: IEEE International Conference on Communications (ICC 2006), Istambul (2006) 10. Zhao, L., Inoue, Y., Yamamoto, H.: Delay reduction for linear-search based packet filters. In: International Technical Conference on Circuits/Systems, Computers and Communication (ITC-CSCC 2004), Japan (2004) 11. Hamed, H., Al-Shaer, E.: Dynamic rule ordering optimization for high speed firewall Filtering. In: ACM Symposium on InformAtion, Computer and Communications Security (ASIACCS 2006), Taipei (2006) An Intrusion Detection System Based on Hierarchical Self-Organization E.J. Palomo, E. Domínguez, R.M. Luque, and J. Muñoz Department of Computer Science E.T.S.I. Informatica, University of Malaga Campus Teatinos s/n, 29071 – Malaga, Spain {ejpalomo,enriqued,rmluque,munozp}@lcc.uma.es Abstract. An intrusion detection system (IDS) monitors the IP packets flowing over the network to capture intrusions or anomalies. One of the techniques used for anomaly detection is building statistical models using metrics derived from observation of the user's actions. A neural network model based on self organization is proposed for detecting intrusions. The selforganizing map (SOM) has shown to be successful for the analysis of high-dimensional input data as in data mining applications such as network security. The proposed growing hierarchical SOM (GHSOM) addresses the limitations of the SOM related to the static architecture of this model. The GHSOM is an artificial neural network model with hierarchical architecture composed of independent growing SOMs. Randomly selected subsets that contain both attacks and normal records from the KDD Cup 1999 benchmark are used for training the proposed GHSOM. Keywords: Network security, self-organization, intrusion detection. 1 Introduction Nowadays, network communications become more and more important to the information society. Business, e-commerce and other network transactions require more secured networks. As these operations increases, computer crimes and attacks become more frequents and dangerous, compromising the security and the trust of a computer system and causing costly financial losses. In order to detect and prevent these attacks, intrusion detection systems have been used and have become an important area of research over the years. An intrusion detection system (IDS) monitors the network traffic to detect intrusions or anomalies. There are two different approaches used to detect intrusions [1]. The first approach is known as misuse detection, which compares previously stored signatures of known attacks. This method is good detecting many or all known attacks, having a successful detection rate. However, they are not successful in detecting unknown attacks occurrences and the signature database has to be manually modified. The second approach is known as anomaly detection. First, these methods establish a normal activity profile. Thus, variations from this normal activity are considered anomalous activity. Anomaly-based systems assume that anomalous activities are intrusion attempts. Many of these anomalous activities are frequently normal E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 139–146, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 140 E.J. Palomo et al. activities, showing false positives. Many anomaly detection systems build statistical models using metrics derived from the user's actions [1]. Also, anomaly detection systems using data mining techniques such as clustering, support vector machines (SVM) and neural network systems have been proposed [2-4]. Several IDS using a self-organizing maps have been done, however they have many difficulties detecting a wide variety of attacks with low false positive rates [5]. The self-organizing map (SOM) [6] has been applied successfully in multiple areas. However, the network architecture of SOMs has to be established in advance and it requires knowledge about the problem domain. Moreover, the hierarchical relations among input data are difficult to represent. The growing hierarchical SOM (GHSOM) faces these problems. The GHSOM is an artificial neural network which consists of several growing SOMs [7] arranged in layers. The number of layers, maps and neurons of maps are automatically adapted and established during a training process of the GHSOM, according to a set of input patterns fed to the neural network. Applied to an intrusion detection system, the input patterns will be samples of network traffic, which can be classified as anomalous or normal activities. The data analyzed from network traffic can be numerical (i.e. number of seconds of the connection) or symbolic (i.e. protocol type used). Usually, the Euclidean distance has been used as metric to compare two input data. However, this is not suitable for symbolic data. Therefore, in this paper an alternative metric for GHSOMs where both symbolic and numerical data are considered is proposed. The implemented IDS using the GHSOM, was trained with the KDD Cup 1999 benchmark data set [8]. This data set has served as the first and only reliable benchmark data set that has been used for most of the research work on intrusion detection algorithms [9]. This data set includes a wide variety of simulated attacks. The remainder of this paper is organized as follows. Section 2 discusses the new GHSOM model used to build the IDS. Then, the training algorithm is described. In Section 3, we show some experimental results obtained from comparing our implementation of the Intrusion Detection System (IDS) with other related works, where data from the KDD Cup 1999 benchmark are used. Section 4 concludes this paper. 2 GHSOM for Intrusion Detection The IDS implemented is based on GHSOM. Initially, the GHSOM consists of a single SOM of 2x2 neurons. Then, the GHSOM adapts its very architecture depending on the input patterns, so that the GHSOM structure mirrors the structure of the input data getting a good data representation. This neural network structure is used to classify the input data in groups, where each neuron represents a data group with similar features. The level of data representation of each neuron is measured as the quantization error of the neuron ( ). The is a measure of the similarity of a data set, where the higher is the , the higher is the heterogeneity of the data. Usually the has been used in terms of the Euclidean distance between the input pattern and the weight vector of a neuron. However, in many real life problems not only numerical features are present, but also symbolic features can be found. To take an obvious example, among the features to analyze from data for building an IDS, three symbolic features are found: protocol type (UDP and ICMP), service (i.e. HTTP, SMTP, FTP, TELNET, An Intrusion Detection System Based on Hierarchical Self-Organization 141 etc.) and flag of the status of the connection (i.e. SF, S1, REJ, etc.). Unlike numerical data, symbolic data do not have an order associated and cannot be measured by a distance. It makes no sense to use the Euclidean distance between two symbolic values, for example the distance between HTTP and FTP protocol. It seems better to use a similarity measure rather than a distance measure for symbolic data. For that reason, in this paper we introduce the entropy as similarity measure of error in representation of a neuron for symbolic data together with the Euclidean distance for numerical data. Fig. 1. Sample architecture of a GHSOM Let be the th input pattern, where is the vector component of nu- is the component of symbolic features. The error of a unit merical features and ( ) in the representation is defined as follows: w x p x log p x , (1) and are the error components of numerical and symbolic features, rewhere is the probability spectively, is the set of patterns mapped onto the unit , and of the element in . The quantization error of the unit is given by expression (2). | |. (2) First of all, the quantization error at layer 0 map ( ) has to be computed as it is shown above. In this case, the error neuron is computed as specified in (1), but using inas the mean of the all input data , being the set of input patterns the set . stead The training of a map is done as the training of a single SOM [6]. An input pattern is randomly selected and each neuron determines its activation according to a similarity measure. In our case, since we take into account the presence of symbolic and 142 E.J. Palomo et al. numerical data, the neuron with the smallest similarity measure defined in (3), becomes the winner. 1, 2 . log (3) For numerical component, is the Euclidean distance between two vectors. For symbolic data, it checks whether the two vectors are the same or not, that is, the probability can just take the values 1, if the vectors are the same; or 0.5 if they are differand are equal; and ent. Therefore, for symbolic data the value of will be 0, if if they are not the same. By taking into account the new similarity measure, the index of the winner neuron is defined in (4). min| , |. (4) is adapted according to the expression (5). For The weight vector of a neuron numerical component, the winner and its neighbors, whose amount of adaptation follows a Gaussian neighborhood function , are adpated. For symbolic data, just the winner is adapted with the weight vector of the mode of the set of input patterns mapped onto the winner . 1 1 1 (5) The GHSOM growing and expansion is controlled by means of two parameters: , which is used to control the growth of a single map; and , which is used to control the expansion of each neuron of the GHSOM. Specifically, a map stops growing if ( ) reaches a certain fraction of the of the corthe mean of the map's responding neuron that was expanded in the map . Also, a neuron is not expanded of . Thus, the if its quantization error ( ) is smaller than a certain fraction larger the paremeter is chosen the deeper the hierarchy will be. Also, for large values, we will have large maps. This way, with these two paremeters, a control of the resulting hierarchical architecture is provided. Note that these parameters are the only ones that have to be established in advance. The pseudocode of the training algorithm of the proposed GHSOM is defined as follows. Step 1. Compute the mean of the all input data and then, the initial quantization error . Step 2. Create an initial map with 2x2 neurons. Step 3. Train the map during iterations as a single SOM, using the expressions (3), (4) and (5). Step 4. Compute the quantification errors of each neuron according to the expression (2). Step 5. Calculate the mean of all units’ quantization errors of the map ( . Step 6. If go to step 9, where is the of the corresponding unit in the upper layer that was expanded. For the first layer, this is . An Intrusion Detection System Based on Hierarchical Self-Organization Step 7. Select the neuron with the highest according to the expression (6). arg max| , | 143 and its most dissimilar neighbor Λ (6) Step 8. Insert a row or column of neurons between and , initializing their weight vectors as the means of their respective neighbors. Go to step 3. Step 9. If for all neurons in the map, go to step 11. Step 10. Select an unsatisfied neuron and expand it creating a new map in the lower layer. The parent of the new map is the expanded neuron and their weight vectors are initialized as the mean of their parent and neighbors. Go to step 2. Step 11. If exists remaining maps, select one and go to step 3. Otherwise, the algorithm ends. 3 Experimental Results Our IDS based on the new GHSOM model was trained with the pre-processed KDD Cup 1999 benchmark data set created by MIT Lincoln Laboratory. The purpose of this benchmark was to build a network intrusion detector capable of distinguishing between intrusions or attacks, and normal connections. We have used the 10% KDD Cup 1999 benchmark data set, which contains 494021 connection records, each of them with 41 features. Here, a connection is a sequence of TCP packets which flows between a source and a target IP addresses. Since some features alone cannot constitute a sign of an anomalous activity, it is better analyzes connection records rather than individual packets. Among the 41 features, three are symbolic: protocol type, service and status of the connection flag. In the training data set exist 22 attack types and in addition to normal records, which fall into four main categories [10]: Denial of Service (DoS), Probe, Remote-to-Local (R2L) and User-to-Root (U2R). In this paper, we have selected two data subsets for training the GHSOM from the total of 494021 connection records, SetA with 100000 connection records and SetB with 169000. Both SetA and SetB contain the 22 attack types. We try to select the data in such a way that there was the same distribution for all the record types. However, the distribution of the data in the 10% KDD Cup data set has an irregular distribution that finally was mirrored in our selection. The two data subsets were trained with 0.1 as value for parameters and , since with these values we achieved good results and a very simple architecture. In fact, each trained GHSOM generated just Table 1. Training results for SetA and SetB Training Set Detected (%) False Positive (%) Identified (%) SetA 99.98 3.03 94.99 SetB 99.98 5.74 95.09 two layers with 16 neurons, although with a different arrangement. The GHSOM trained with SetB is the same that we showed in Fig. 1. 144 E.J. Palomo et al. Many related works are only interested in classifying the input pattern just as one of two record types: anomalous or normal records. Taking into account just two groups, normal records that are classified as anomalous are known as false positives, whereas anomalous records that are classified as normal records are known as missed detections. However, we are also interested in classify an anomalous record into its attack type, that is, taking into account 23 groups (22 attack types plus normal records) instead 2 groups. Hence, we call identification rate to the connection records that are correctly identified as their respective record types. The training results of the two GHSOMs obtained with the subsets SetA and SetB are shown in Table 1. Both subsets achieve 99.98% attack detection rate and false positive rates of 3.03% and 5.74%, respectively. Attending to the identification of the attack type, around the 95% were correctly identified in both cases. We have simulated the trained GHSOMs with the 494021 connection records from the benchmark data set. This simulation consists of classifying these data with the trained GHSOMs without any modification of the GHSOMs, that is, without learning process. The simulation results of both GHSOMs are given in Table 2. Here, 99.9% detection rate is achieved. Also, the identification rate rises up to 97% in both cases. The false positive rate increases for SetA during the simulation, although it is lower for SetB compared with this rate after the training. Note that during both training and simulation, we used the 41 features and all the 22 existing attacks in the training data set. Moreover, the resulting number of neurons was much reduced. In fact, there are less neurons than groups, although this is due to the scarce amount of connection records from certain attack groups. Table 2. Simulating results with 494021 records for SetA and SetB Training Set Detected (%) False Positive (%) Identified (%) SetA 99.99 5.41 97.09 SetB 99.91 5.11 97.15 In Table 3, we compare the training results of SetA with the results obtained in [9, 10], where SOMs were used to build IDSs as well. In order to differentiate both IDS based on self-organization, we call them SOM and K-Map, respectively, using the author's notation, whereas our trained IDS is called GHSOM. From the first work, we chose the only one SOM trained on all the 41 features, which was composed of 20x20 neurons. Another IDS implementation was proposed in the second work. Here, a hierarchical Kohonen net (K-Map), composed of three layers, was trained with a subset of 169000 connection records, and taking into account the 41 features and 22 attack types. Their best result was 99.63% detection rate after testing the K-Map, although with several restrictions. This result was achieved using a pre-selected combination of 20 features, which were divided into three levels of features, where each features sub combination was fed to a different layer. Also, they used just three attack types during the testing, whereas we used the 22 attack types. Moreover, the architecture of the hierarchical K-Map was established in advance, using 48 neurons in each layer, that is, 144 neurons, when we used just 16 neurons that were generated during the training process without any human intervention. An Intrusion Detection System Based on Hierarchical Self-Organization 145 Table 3. Comparison results for different IDS implementations GHSOM SOM K-Map Detected (%) False Positive (%) 99.98 3.03 81.85 0.03 99.63 0.34 4 Conclusions This paper has presented a novel Intrusion Detection System based on growing hierarchical self-organizing maps (GHSOMs). These neural networks are composed of several SOMs arranged in layers, where the number of layers, maps and neurons are established during the training process mirroring the inherent data structure. Moreover, we have taken into account the presence of symbolic features in addition to numerical features in the input data. In order to improve the classification process of input data, we have introduced a new metric for GHSOMs based on entropy for symbolic data together with the Euclidean distance for numerical data. We have used the 10% KDD Cup 1999 benchmark data set to train our IDS based on GHSOM, which contains 494021 connection records, where 22 attack types in addition to normal records can be found. We trained and simulated two GHSOMs with two different subsets, SetA with 100000 connection records and SetB with 169000 connection records. Both SetA and SetB achieved 99.98% detection rate and false positives rates of 3.03% and 5.74%, respectively. The identification rate, that is, the connection records identified as their correct connection record types, was 94.99% and 95.09%, respectively. After the simulation of the two trained GHSOMs with the 494021 connection records, we achieved 99.9% detection rate and false positive rates between 5.11% and 5.41% in both subsets. The identification rate was around the 97%. Acknowledgements. This work is partially supported by the Spanish Ministry of Education and Science under contract TIN-07362. References 1. Denning, D.: An intrusion-detection model. Software Engineering. IEEE Transactions on SE 13(2), 222–232 (1987) 2. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., Zhang, J.: Real time data mining-based intrusion detection. In: DARPA Information Survivability Conference & Exposition II, vol. 1, pp. 89–100 (2001) 3. Maxion, R., Tan, K.: Anomaly detection in embedded systems. IEEE Transactions on Computers 51(2), 108–120 (2002) 4. Tan, K., Maxion, R.: Determining the operational limits of an anomaly-based intrusion detector. IEEE Journal on Selected Areas in Communications 21(1), 96–110 (2003) 5. Ying, H., Feng, T.J., Cao, J.K., Ding, X.Q., Zhou, Y.H.: Research on some problems in the kohonen som algorithm. In: International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1279–1282 (2002) 146 E.J. Palomo et al. 6. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological cybernetics 43(1), 59–69 (1982) 7. Fritzke, B.: Growing grid - a self-organizing network with constant neighborhood range and adaptation strength. Neural Processing Letters 2(5), 9–13 (1995) 8. Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection models. In: IEEE Symposium on Security and Privacy, pp. 120–132 (1999) 9. Sarasamma, S., Zhu, Q., Hu, J.: Hierarchical kohonenen net for anomaly detection in network security. IEEE Transactions on Systems Man and Cybernetics Part BCybernetics 35(2), 302–312 (2005) 10. DeLooze, L., DeLooze, A.F.: Attack characterization and intrusion detection using an ensemble of self-organizing maps. In: 7th Annual IEEE Information Assurance Workshop, pp. 108–115 (2006) Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions for Intrusion Detection Zorana Banković, Slobodan Bojanić, and Octavio Nieto-Taladriz ETSI Telecomunicación, Universidad Politécnica de Madrid, Ciudad Universitaria s/n, 28040 Madrid, Spain {zorana,slobodan,nieto}@die.upm.es Abstract. The paper presents a serial combination of two genetic algorithm-based intrusion detection systems. Feature extraction techniques are deployed in order to reduce the amount of data that the system needs to process. The designed system is simple enough not to introduce significant computational overhead, but at the same time is accurate, adaptive and fast. There is a large number of existing solutions based on machine learning techniques, but most of them introduce high computational overhead. Moreover, due to its inherent parallelism, our solution offers a possibility of implementation using reconfigurable hardware with the implementation cost much lower than the one of the traditional systems. The model is verified on KDD99 benchmark dataset, generating a solution competitive with the solutions of the state-of-the-art. Keywords: intrusion detection, genetic algorithm, sequential combination, principal component analysis, multi expression programming. 1 Introduction Computer networks are usually protected against attacks by a number of access restriction policies (anti-virus software, firewall, message encryption, secured network protocols, password protection). Since these solutions are proven to be insufficient for providing high level of network security, there is a need for additional support in detecting malicious traffic. Intrusion detection systems (IDS) are placed inside the protected network, looking for potential threats in network traffic and/or audit data recorded by hosts. IDS have three common problems that should be tackled when designing a system of the kind: speed, accuracy and adaptability. The speed problem arises from the extensive amount of data that needs to be monitored in order to perceive the entire situation. An existing approach to solving this problem is to split network stream into more manageable streams and analyze each in real-time using separate IDS. The event stream must be split in a way that covers all relevant attack scenarios, but this assumes that all attack scenarios must be known a priori. We are deploying a different approach. Instead of defining different attack scenarios, we extract the features of network traffic that are likely to take part in an attack. This provides higher flexibility since a feature can be relevant for more than one attack or is prone to be abused by an unknown attack. Moreover, we need only one IDS E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 147–154, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 148 Z. Banković, S. Bojanić, and O. Nieto-Taladriz to perform the detection. Finally, in this way the total amount of data to be processed is highly reduced. Hence, the amount of time spent for offline training of the system and afterwards the time spent for attacks detection are also reduced. Incorporation of learning algorithms provides a potential solution for the adaptation and accuracy issues. A great number of machine learning techniques have been deployed for intrusion detection in both commercial systems and state-of-the-art. These techniques can introduce certain amount of ‘intelligence’ in the process of detecting attacks, but are capable of processing large amount of data much faster than a human. However, these systems can introduce high computational overhead. Furthermore, many of them do not deal properly with so-called ‘rare’ classes [5], i.e. the classes that have significantly smaller number of elements then the rest of the classes. This problem occurs mostly due to the tendency for generalization that most of these techniques exhibit. Intrusions can be considered rare classes since the amount of intrusive traffic is considerably smaller then the amount of normal traffic. Thus, we need a machine learning technique that is capable of dealing with this issue. In this work we are presenting a genetic algorithm (GA) [9] approach for classifying network connections. GAs are robust, inherently parallel, adaptable and suitable for dealing with the classification of rare classes [5]. Moreover, due to its inherent parallelism, it offers possibility to implement the system using reconfigurable devices without the need of deploying a microprocessor. In this way, the implementation cost would be much lower than the cost of implementing traditional IDS offering at the same time higher level of adaptability. This work represents a continuation of our previous one [1] where we investigated the possibilities of applying GA to intrusion detection while deploying small subset of features. The experiments have confirmed the robustness of GA and inspired us to further continue experimenting on the subject. Here we further investigate a combination of two GA-based intrusion detection systems. The first system in the row is an anomaly-based IDS implemented as a simple linear classifier. This system exhibits high false-positive rate. Thus, we have added a simple system based on if-then rules that filters the decision of the linear classifier and in that way significantly reduces false-positive rate. We actually create a strong-classifier built upon weak-classifiers, but without the need to follow the process of boosting algorithm [15] as both of the created systems can be trained separately. For evolving our GA-based system KDD99Cup training and testing dataset was used [6]. KDD99Cup dataset was found to have quite drawbacks [7], [8], but it is still prevailing dataset used for training and testing of IDS due to its good structure and availability [3], [4]. Because of these shortcomings, the presented results do not illustrate the behavior of the system in a real-world environment, but it does reflect its possibilities. In general case, the performance of the implemented system highly depends on the training data. In the following text Sections 2 gives the survey and comparison of machine learning techniques deployed for intrusion detection. Section 3 details the implementation if the system. Section 4 presents the benchmark KDD99 training and testing dataset, Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions 149 evaluates the performance of the system using this dataset and discusses the results. Finally, the conclusions are drawn in Section 5. 2 Machine Learning Techniques for Intrusion Detection – Survey and Comparison In the recent past there has been a growing recognition of intelligent techniques for the construction of efficient and reliable intrusion detection systems. A complete survey of these techniques is hard to present at this point, since there are more than hundred IDS based on machine learning techniques. Some of the best-performed techniques used in the state-of-the-art apply GA [4], combination of neural networks and C4.5 [11], genetic programming (GP) ensemble [12], support vector machines [13] or fuzzy logic [14]. All of the named techniques have two steps: training and testing. The systems have to be constantly retrained using new data since new attacks are emerging every day. The advantage of all GA or GP-based techniques lies in their easy retraining. It’s enough to use the best population evolved in the previous iteration as initial population and repeat the process, but this time including new data. Thus, our system is inherently adaptive which is an imperative quality of an IDS. Furthermore, GAs are intrinsically parallel, since they have multiple offspring, they can explore the solution space in multiple directions at once. Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly well-suited to solving problems where the space of all potential solutions is too vast to search exhaustively in any reasonable amount of time, as network data. GA-based techniques are appropriate for dealing with rare classes. As they work with populations of candidate solutions rather than a single solution and employ stochastic operators to guide the search process, GAs cope well with attribute interactions and avoid getting stuck in local maxima, which together make them very suitable for dealing with classifying rare classes [5]. We have gone further by deploying standard F-measure as fitness function. F-value is proven to be very suitable when dealing with rare classes [5]. F-measure is a combination of precision and recall. Rare cases and classes are valued when using these metrics because both precision and recall are defined with respect to a rare class. None of the GA or GP techniques stated above considers the problem of rare classes. A technique that considers the problem of rare classes is given in [15]. Their solution is similar to ours in the sense that they deploy a boosting technique, which also assumes creating a strong classifier by combining several weak classifiers. Furthermore, they present the results in the terms of F-measure. The advantage of our system is that we can train the parts of our system independently, while boosting algorithm trains its parts one after another. Finally, if we want to consider the possibility of hardware implementation using reconfigurable hardware, not all the systems are appropriate due to their sequential nature or high computational complexity. Due to the parallelism of our algorithm a hardware implementation using reconfigurable devices is possible. This can lead to lower implementation cost with higher level of adaptability compared to the existing solutions and reduced amount of time for system training and testing. 150 Z. Banković, S. Bojanić, and O. Nieto-Taladriz In short, the main advantage of our solution lies in the fact that it includes important characteristics (high accuracy and performance, dealing with rare classes, inherent adaptability, feasibility of hardware implementation) in one solution. We are not familiar with any existing solution that would cover all the characteristics mentioned above. 3 System Implementation The implemented IDS is a serial combination of two IDSs. The complete system is presented in Fig. 1. The first part is a linear classifier that classifies connections into normal ones and potential attacks. Due to its very low false-negative rate, the decision that it makes on normal connections is considered correct. But, as it exhibits high false-positive rate, if it opts for an attack, it has to be re-checked. This re-checking is performed by a rule-based system whose rules are trained to recognize normal connections. This part of the system exhibits very low false-positive rate, i.e. the probability for an attack to be incorrectly classified as a normal connection is very low. In this way, the achieved false-positive rate of the entire system is significantly reduced while maintaining high detection rate. As our system is trained and tested on KDD99 dataset, the election of the most important features is performed once at the beginning of the process. Implementation for a real-world environment, however, would require performing the feature selection process before each process of training. Fig. 1. Block Diagram of the Complete System The linear classifier is based on a linear combination of three features. The features are identified as those that have the highest possibility to take part in an attack by deploying PCA [1]. The details of PCA algorithm are explained in [16]. The selected features and their explanations are presented in Table 1. Table 1. The features used to describe the attacks Name of the feature duration src_bytes dst_host_srv_serror_rate Explication length (number of seconds) of the connection number of data bytes from source to destination percentage of connections that have “SYN” errors The linear classifier is evolved using GA algorithm [9]. Each chromosome, i.e. potential solution to the problem, in the population consists of four genes, where the first Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions 151 three represent coefficients of the linear classifier and the fourth one the threshold value. The decision about the current connection is made according to the formula (1): gene(1)*con(duration)+gene(2)*con(src_bytes)+ gene(3)*con(dst_host_srv_serror_rate)<gene(4) (1) where con(duration), con(src_bytes) and con(dst_host_srv_serror_rate) are the values of the duration, src_bytes and dst_host_srv_serror_rate feature of the current connection. The linear classifier is trained using incremental GA [9]. The population contains 1000 individuals which were trained during 300 generations. The mutation rate was 0.1 while the crossover rate was 0.9. The previous numbers were chosen after number of experiments. Increasing the size of the population and the number of generations stopped when it not bring significant performance improvement. The type of crossover deployed was uniform crossover, i.e. a new individual had equal chances to contain either of the genes of both of its parents. The performance measurement, i.e. the fitness function, was the squared percentage of the correctly classified connections, i.e. according to the formula: ⎛ count ⎞ ⎟⎟ fitness = ⎜⎜ ⎝ numOfCon ⎠ 2 (2) where count is the number of correctly classified connections, while numOfCon is the number of connections in the training dataset. The squared percentage was chosen rather than the simple percentage value because the achieved results were better. The result of this GA was its best individual which forms the first part of the system presented in Fig.1. The second part of the system (Fig. 1) is a rule-based system, where simple if-then rules for recognizing normal connections are evolved. The most important features were taken over from the results obtained in [2] using Multi Expression Programming (MEP). The features and their explanations are listed in Table 2. Table 2. The features used to describe normal connections Name of the feature service hot logged in Explication Destination service (e.g. telnet, ftp) number of hot indicators 1 if successfully logged in; 0 otherwise An example of a rule can be the following one: if (service=”http” and hot=”0” and logged_in=”0”) then normal; The rules are trained using incremental GA with the same parameters used for the linear classifier. Each 3-gene chromosome represents a rule, where the value of each gene is the value of its corresponding feature. But, the population used in this case contained 500 individuals, as no improvements were achieved with larger populations. The result of the training was a set of 200 best-performing rules. The fitness function in this case was the F-value with the parameter 0.8: 152 Z. Banković, S. Bojanić, and O. Nieto-Taladriz fitness = 1.8 * recall * precision TP TP , precision = , recall = 0.8 * precision + recall TP + FP TP + FN (3) where TP, FP and FN stand for the number of true positives, false positives and false negatives respectively. The system presented here was implemented in C++ programming language. The software for this work used the GAlib genetic algorithm package, written by Matthew Wall at the Massachusetts Institute of Technology [10]. The time of training the implemented system is 185 seconds while the testing process takes 45 seconds. The reason for short time of training lies in deploying incremental GA whose population is not big all the time, i.e. it is growing after each iteration. The system was demonstrated on AMD Athlon 64 X2 Dual Core Processor 3800+ with 1GB RAM memory on its disposal. 4 Results 4.1 Training and Testing Datasets The dataset contains 5000000 network connection records. A connection is a sequence of TCP packets starting and ending at defined times, between which data flows from a source IP to a target IP under certain protocol [6]. The training portion of the dataset ( “kdd_10_percent”) contains 494021 connections of which 20% are normal (97277), and the rest (396743) are attacks. Each connection record contains 41 independent fields and a label (normal or type of attack). Attacks belong to the one of the four attack categories: user to root, remote to local, probe, and denial of service. The testing dataset (“corrected”) provides a dataset with a significantly different statistical distribution than the training dataset (250436 attacks and 60593 normal connections) and contains an additional 14 attacks not included in the training dataset. The most important flaws of the mentioned dataset are given in the following [7]. The dataset contains biases that may be reflected in the performance of the evaluated systems, for example the skewed nature of the attack distribution. None of the sources explaining the dataset contains any discussion of data rates, and its variation with time is not specified. There is no discussion of whether the quantity of data presented is sufficient to train a statistical anomaly system or a learning-based system. Furthermore, in [8] is demonstrated that the transformation model used for transforming raw DARPA’s network data to a well-featured data item set is ‘poor’. Here ‘poor’ refers to the fact that some attribute values are the same in different data items that have different class labels. Due to this, some of the attacks can’t be classified correctly. 4.2 Obtained Rates The system was trained using “kdd_10_percent” and tested on “corrected” dataset. The obtained results are summarized in Table 3. The last column gives the value of classical F-measure so that learning results could be easily compared with a unique feature for both recall and precision. The false-positive rate is reduced from 40.7% to Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions 153 1.4%, while the detection rate has reduced for only 0.15%. The increasing of F-value is also exhibited. Table 3. The performance of the whole system and its parts separately System Linear Classifier Rule-based Whole system Detection rate Num. Per.(%) 231030 92.25 45504 75.1 230625 92.1 False Positive Rate Num. Per.(%) 24628 40.7 5537 2.2 862 1.4 F-measure 0.913 0.815 0.96 The adaptability of the system was tested as well. At first, the system was trained with a subset of “kdd_10_percent” (250000 connections out of 491021). The generated rules were taken as the initial generation and re-trained with the remaining data of “kdd_10_percent” dataset. Both of the systems were tested on “corrected” dataset. The system exhibited improvements in both detection and false positive rate. The improvements are presented in the Table 4. Table 4. The performance of the system after re-training System Trained with a subset Re-trained with the rest of the training data Detection rate Num. Per.(%) 183060 73.1 231039 92.3 False Positive rate Num. Per.(%) 1468 2.4 862 1.4 Fmeasure 0.84 0.96 The drawbacks of the dataset have influenced the gained rates. As a comparison, the detection rate of the system tested on the same data that it was trained on, i.e. “kdd_10_percent”, is 99.2% comparing to the detection rate of 92.1% after testing the system using “corrected” dataset. Thus, the dataset deficiencies stated previously in this section had negative effects on the rates obtained in this work. 5 Conclusions In this work a novel approach consisting in a serial combination of two GA-based IDSes is introduced. The properties including adaptability of the resulting system were analyzed. The resulting system exhibits very good characteristics in the terms of both detection and false-positive rate and the F-measure. The implementation of the system has been performed in a way that corresponds well to the deployed dataset mostly in the terms of the chosen features. In a real system this does not have to be the case. Thus, an implementation of the system for a real-world environment has to be adjusted in the sense that the set of the chosen features has to be changed according to the environmental changing. Due to the inherent high parallelism of the presented system, there is a possibility of its implementation using reconfigurable hardware. This will result in a highperformance real-word implementation with considerably lower implementation cost, 154 Z. Banković, S. Bojanić, and O. Nieto-Taladriz size and power consumption compared to the existing solutions. Part of the future work will consist in pursuing hardware implementation of the presented system. Acknowledgements. This work has been partially funded by the Spanish Ministry of Education and Science under the project TEC2006-13067-C03-03 and by the European Commission under the FastMatch project FP6 IST 27095. References 1. Banković, Z., Stepanović, D., Bojanić, S., Nieto-Taladriz, O.: Improving Network Security Using Genetic Algorithm Approach. Computers & Electrical Engineering 33(5-6), 438–451 2. Grosan, C., Abraham, A., Chis, M.: Computational Intelligence for light weight intrusion detection systems. In: International Conference on Applied Computing (IADIS 2006), San Sebastian, Spain, pp. 538–542 (2006); ISBN: 9728924097 3. Gong, R.H., Zulkernine, M., Abolmaesumi, P.: A Software Implementation of a Genetic Algorithm Based Approach to Network Intrusion Detection. In: Proceedings of SNPD/SAWN 2005 (2005) 4. Chittur, A.: Model Generation for an Intrusion Detection System Using Genetic Algorithms (accessed in 2006), http://www1.cs.columbia.edu/ids/ publications/gaids-thesis01.pdf 5. Weiss, G.: Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7–19 (2004) 6. http://kdd.ics.uci.edu/ (October 1999) 7. McHugh, J.: Testing Intrusion Detection Systems: A Critique of the 1998 and 1999 DARPA IDS Evaluation as Performed by Lincoln Laboratory. ACM Trans. on Information and System security 3(4), 262–294 (2000) 8. Bouzida, Y., Cuppens, F.: Detecting known and novel network intrusion. In: IFIP/SEC 2006 21st International Information Security Conference, Karlstad, Sweden (2006) 9. Goldberg, D.E.: Genetic algorithms for search, optimization, and machine learning. Addison-Wesley, Reading (1989) 10. GAlib, A.: C++ Library of Genetic Algorithm Components, http://lancet.mit.edu/ga/ 11. Pan, Z., Chen, S., Hu, G., Zhang, D.: Hybrid Neural Network and C4.5 for Misuse Detection. In: Proceedings of the Second International Conference on Machine Learning and Cybernetics, November 2003, vol. 4, pp. 2463–2467 (2003) 12. Folino, G., Pizzuti, C., Spezzano, G.: GP Ensemble for Distributed Intrusion Detection Systems. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS, vol. 3686. Springer, Heidelberg (2005) 13. Laskov, P., Düssel, P., Schäfer, C., Rieck, K.: Learning Intrusion Detection: Supervised or Unsaupervised? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 50–57. Springer, Heidelberg (2005) 14. Yao, J.T., Zhao, S.L., Saxton, L.V.: A Study on Fuzzy Intrusion Detection. Data mining, intrusion detection, information assurance and data networks security (2005) 15. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction of the minority class in boosting. In: Proceedings of Principles of Knowledge Discovery in Databases (2003) 16. Fodor, I.K.: A Survey of Dimension Reduction Techniques, http://llnl.gov/CASC/sapphire/pubs Agents and Neural Networks for Intrusion Detection Álvaro Herrero and Emilio Corchado Department of Civil Engineering, University of Burgos C/ Francisco de Vitoria s/n, 09006 Burgos, Spain {ahcosio, escorchado}@ubu.es Abstract. Up to now, several Artificial Intelligence (AI) techniques and paradigms have been successfully applied to the field of Intrusion Detection in Computer Networks. Most of them were proposed to work in isolation. On the contrary, the new approach of hybrid artificial intelligent systems, which is based on the combination of AI techniques and paradigms, is probing to successfully address complex problems. In keeping with this idea, we propose a hybrid use of three widely probed paradigms of computational intelligence, namely Multi-Agent Systems, Case Based Reasoning and Neural Networks for Intrusion Detection. Some neural models based on different statistics (such as the distance, the variance, the kurtosis or the skewness) have been tested to detect anomalies in packet-based network traffic. The projection method of Curvilinear Component Analysis has been applied for the first time in this study to perform packet-based intrusion detection. The proposed framework has been probed through anomalous situations related to the Simple Network Management Protocol and normal traffic. Keywords: Multiagent Systems, Case Based Reasoning, Artificial Neural Networks, Unsupervised Learning, Projection Methods, Computer Network Security, Intrusion Detection. 1 Introduction Firewalls are the most widely used tools for securing networks, but Intrusion Detection Systems (IDS’s) are becoming more and more popular [1]. IDS’s monitor the activity of the network with the purpose of identifying intrusive events and can take actions to abort these risky events. A wide range of techniques have been used to build IDS’s. On the one hand, there have been some previous attempts to take advantage of agents and Multi-Agent Systems (MAS) [2] in the field of Intrusion Detection (ID) [3], [4], [5], including the mobile-agents approach [6], [7]. On the other hand, some different machine learning models – including Data Mining techniques and Artificial Neural Networks (ANN) – have been successfully applied for ID [8], [9], [10], [11]. Additionally, some other Artificial Intelligence techniques (such as Genetic Algorithms and Fuzzy Logic, Genetic Algorithms and K-Nearest Neighbor (K-NN) or KNN and ANN among others) [12] [13] have been combined in order to face ID from a hybrid point of view. This paper employs a framework based on a dynamic multiagent architecture employing deliberative agents capable of learning and evolving with the environment [14]. These agents may incorporate different identification or projection algorithms depending on their goals. In this case, a neural model based on E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 155–162, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 156 Á. Herrero and E. Corchado the study of some statistical features (such as the variance, the interpoint distance or the skew and kurtosis indexes) will be embedded in such agents. One of the main novelties of this paper is the application of Curvilinear Component Analysis (CCA) for packet-based intrusion detection. The overall architecture of this paper is the following: Section 2 outlines the ID multiagent system, section 3 describes the neural models applied in this research, section 4 presents some experimental results and finally section 5 goes over some conclusions and future work. 2 Agent-Based IDS An ID framework, called Visualization Connectionist Agent-Based IDS (MOVICABIDS) [14] and based on software agents [2] and neural models, is introduced. This MAS incorporates different types of agents; some of the agents have been designed as CBR-BDI agents [15], [16] including an ANN for ID tasks, while some others are reactive agents. CBR-BDI agents use Case Based Reasoning (CBR) systems [17] as a reasoning mechanism, which allows them to learn from initial knowledge, to interact autonomously with the environment, users and other agents within the system, and to have a large capacity for adaptation to the needs of its surroundings. MOVICAB-IDS includes deliberative agents using a CBR architecture. These CBR-BDI agents work at a high level with the concepts of Believes, Desires and Intentions (BDI) [18]. CBR-BDI agents have learning and adaptation capabilities, what facilitates their work in dynamic environments. The extended version of the Gaia methodology [19] was applied, and some roles and protocols where identified after the Architectural Design Stage [14]. The six agent classes identified in the Detailed Design Stage were: SNIFFER, PREPROCESSOR, ANALYZER, CONFIGURATIONMANAGER, COORDINATOR and VISUALIZER. 2.1 Agent Classes The agent classes previously mentioned are described in the following paragraphs. Sniffer Agent This reactive agent is in charge of capturing traffic data. The continuous traffic flow is captured and split into segments in order to send it through the network for further process. As these agents are the most critical ones, there are cloned agents (one per network segment) ready to substitute the active ones when they fail. Preprocessor Agent After splitting traffic data, the generated segments are preprocessed to apply subsequent analysis. Once the data has been preprocessed, an analysis for this new piece of data is requested. Analyzer Agent This is a CBR-BDI agent embedding a neural model within the adaptation stage of its CBR system that helps to analyze the preprocessed traffic data. This agent is based on the application of different neural models allowing the projection of network data. In this paper, PCA [20], CCA [21], MLHL [22] and CHMLHL [23] (See Section 3) have Agents and Neural Networks for Intrusion Detection 157 been applied for comparison reasons. This agent generates a solution (getting an adequate projection of the preprocessed data) by retrieving a case and analyzing the new one using a neural network. Each case incorporates several features, such as segment length (in ms), total number of packets and neural model parameters among others. A further description of the CBR four steps for this agent can be found in [14]. ConfigurationManager Agent The processes of data capture, split, preprocess and analysis depends on the values of several parameters, as for example: packets to capture, segment length, features to extract... This information is managed by the CONFIGURATIONMANAGER reactive agent, which is in charge of providing this information to some other agents. Coordinator Agent There can be several instances (from 1 to m) of the ANALYZER class of agent. In order to improve the efficiency and perform a real-time processing, the preprocessed data must be dynamically and optimally assigned to ANALYZER agents. This assignment is performed by the COORDINATOR agent. Visualizer Agent At the very end of the process, this interface agent presents the analyzed data to the network administrator by means of a functional and mobile visualization interface. To improve the accessibility of the system, the administrator may visualize the results on a mobile device, enabling informed decisions to be taken anywhere and at any time. 3 Neural Projection Models Projection models are used as tools to identify and remove correlations between problem variables, which enable us to carry out dimensionality reduction, visualization or exploratory data analysis. In this study, some neural projection models, namely PCA, MLHL, CMLHL and CCA have been applied for ID. Principal Component Analysis (PCA) [20] is a standard statistical technique for compressing data; it can be shown to give the best linear compression of the data in terms of least mean square error. There are several ANN which have been shown to perform PCA [24], [25], [26]. It describes the variation in a set of multivariate data in terms of a set of uncorrelated variables each of which is a linear combination of the original variables. Its goal is to derive new variables, in decreasing order of importance, which are linear combinations of the original variables and are uncorrelated with each other. Curvilinear Component Analysis (CCA) [21] is a nonlinear dimensionality reduction method. It was developed as an improvement on the Self Organizing Map (SOM) [27], trying to circumvent the limitations inherent in some linear models such as PCA. CCA is performed by a self-organised neural network calculating a vector quantization of the submanifold in the data set (input space) and a nonlinear projection of these quantising vectors toward an output space. As regards its goal, the projection part of CCA is similar to other nonlinear mapping methods, as it minimizes a cost function based on interpoint distances in both input and output spaces. Quantization and nonlinear mapping are separately performed: firstly, the input vectors are forced to become prototypes of the distribution 158 Á. Herrero and E. Corchado using a vector quantization (VQ) method, and then, the output layer builds a nonlinear mapping of the input vectors. Cooperative Maximum Likelihood Hebbian Learning (CMLHL) [23] extends the MLHL model [22] that is a neural implementation of Exploratory Projection Pursuit (EPP). The statistical method of EPP [28] linearly project a data set onto a set of basis vectors which best reveal the interesting structure in data. MLHL identifies interestingness by maximising the probability of the residuals under specific probability density functions which are non-Gaussian. CMLHL extends the MLHL paradigm by adding lateral connections [23], which have been derived from the Rectified Gaussian Distribution [29]. The resultant model can find the independent factors of a data set but does so in a way that captures some type of global ordering in the data set. 4 Experiments and Results In this work, the above described neural models have been applied to a real traffic data set [11] containing “normal” traffic and some anomalous situations. These anomalous situations are related to the Simple Network Management Protocol (SNMP), known by its vulnerabilities [30]. Apart from “normal” traffic, the data set includes: SNMP ports sweeps (scanning of network hosts at different ports - a random port number: 3750, and SNMP default port numbers: 161 and 162), and a transfer of information stored in the Management Information Base (MIB), that is, the SNMP database. This data set contains only five variables extracted from the packet headers: timestamp (the time when the packet was sent), protocol, source port (the port of the source host that sent the packet), destination port (the destination host port number to which the packet is sent) and size: (total packet size in Bytes). This data set was generated in a medium-sized network so the “normal” and anomalous traffic flows were known in advance. As SNMP is based on UDP, only 5866 UDP-based packets were included in the dataset. In this work, the performance of the previously described projection models (PCA, CCA, MLHL and CMLHL) has been analysed and compared through this dataset (See Figs. 1 and 2.). PCA was initially applied to the previously described dataset. The PCA projection is shown in Fig. 1.a. After analysing this projection, it is discovered that the normal traffic evolves in parallel straight lines. According to the parallelism to normal traffic, PCA is only able to identify the port sweeps (Groups 3, 4 and 5 in Fig. 1.b). On the contrary, it fails to detect the MIB information transfer (Groups 1 and 2 in Fig. 1.b) because the packets in this anomalous situation evolve in a direction parallel to the “normal” one. Fig. 1.b shows the MLHL projection of the dataset. Once again, the normal traffic evolves in parallel straight lines. There are some other groups (Groups 1, 2, 3, 4 and 5 in Fig. 1.a) evolving in an anomalous way. In this case, all the anomalous situations contained in the dataset can be identified due to their non-parallel evolution to the normal direction. Additionally, in the case of the MIB transfer (Groups 1 and 2 in Fig. 1.b), the high concentration of packets must be considered as an anomalous feature. Agents and Neural Networks for Intrusion Detection 159 Normal direction Normal direction Group 1 Group 1 Group 3 Group 4 Group 3 Group 4 Group 5 Group 5 Group 2 Group 2 (a) (b) Group 4 Group 1 Group 5 Group 3 Group 3 Group 4 Group 2 Group 2 Group 1 Normal direction (c) (d) Fig. 1. (a) PCA projection. (b) MLHL projection. (c) CMLHL projection. (d) CCA (Euclidean dist.) projection. It can be seen in Fig. 1.c how the CMLHL model is able to identify the two anomalous situations contained in the data set. As in the case of MLHL, the MIB information transfer (Groups 1 and 2 in Fig. 1.a) is identified due to its orthogonal direction with respect to the normal traffic and to the high density of packets. The sweeps (Groups 3, 4 and 5 in Fig. 1.a) are identified due to their non-parallel direction to the normal one. Several experiments were conducted to apply CCA to the analysed data set; tuning the different options and parameters, such as type of initialization, epochs and distance criterion. The best (from a projection point of view) CCA result, based on the Standardized Euclidean Distance, is depicted on Fig. 2. There’s a marked contrast between the behavioral pattern shown by the normal traffic in previous projections and the evolution of normal traffic in the CCA projection. In the latter, some of the packets belonging to normal traffic do not evolve in parallel straight lines. That is the case of groups 1 and 2 in Fig. 2. The anomalous traffic shows an abnormal evolution once again (Groups 3 and 4 in Fig. 2), so it is not as clear as in previous projections to distinguish the anomalous traffic from the “normal” one. 160 Á. Herrero and E. Corchado Group 3 Group 1 Group 4 Group 2 Fig. 2. CCA projection (employing Standardized Euclidean distance) For comparison purposes, a different CCA projection for the same dataset is shown on Fig. 1.d. This projection is based on simple Euclidean Distance. The anomalous situations can not be identified in this case as the evolution of the normal and anomalous traffic is similar. Different distance criteria such as Cityblock, Humming and some others were tested as well. None of them surpass the Standardized Euclidean Distance, whose projection is shown on Fig. 2. 5 Conclusions and Future Work The use of embedded ANN in the deliberative agents of a dynamic MAS let us take advantage of some of the properties of ANN (such as generalization) and agents (reactivity, proactivity and sociability) making the ID task possible. It is worth mentioning that, as in other application fields, the tuning of the different neural models is of extreme importance. Although the neural model can get a useful projection, a wrong tuning of the model can lead to a useless outcome, as is the case of Fig 1.d. We can conclude as well that CMLHL outperforms MLHL, PCA and CCA. This probes the intrinsic robustness of CMLHL, which is able to properly respond to a complex data set that includes time as a variable. Further work will focus on the application of high-performance computing clusters. Increased system power will be used to enable the IDS to process and display the traffic data in real time. Acknowledgments. This research has been partially supported by the project BU006A08 of the JCyL. Agents and Neural Networks for Intrusion Detection 161 References 1. Chuvakin, A.: Monitoring IDS. Information Security Journal: A Global Perspective 12(6), 12–16 (2004) 2. Wooldridge, M., Jennings, N.R.: Agent theories, architectures, and languages: A survey. Intelligent Agents (1995) 3. Spafford, E.H., Zamboni, D.: Intrusion Detection Using Autonomous Agents. Computer Networks: The Int. Journal of Computer and Telecommunications Networking 34(4), 547– 570 (2000) 4. Hegazy, I.M., Al-Arif, T., Fayed, Z.T., Faheem, H.M.: A Multi-agent Based System for Intrusion Detection. IEEE Potentials 22(4), 28–31 (2003) 5. Dasgupta, D., Gonzalez, F., Yallapu, K., Gomez, J., Yarramsettii, R.: CIDS: An agentbased intrusion detection system. Computers & Security 24(5), 387–398 (2005) 6. Wang, H.Q., Wang, Z.Q., Zhao, Q., Wang, G.F., Zheng, R.J., Liu, D.X.: Mobile Agents for Network Intrusion Resistance. In: Shen, H.T., Li, J., Li, M., Ni, J., Wang, W. (eds.) APWeb Workshops 2006. LNCS, vol. 3842, pp. 965–970. Springer, Heidelberg (2006) 7. Deeter, K., Singh, K., Wilson, S., Filipozzi, L., Vuong, S.: APHIDS: A Mobile AgentBased Programmable Hybrid Intrusion Detection System. In: Karmouch, A., Korba, L., Madeira, E.R.M. (eds.) MATA 2004. LNCS, vol. 3284, pp. 244–253. Springer, Heidelberg (2004) 8. Laskov, P., Dussel, P., Schafer, C., Rieck, K.: Learning Intrusion Detection: Supervised or Unsupervised? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 50–57. Springer, Heidelberg (2005) 9. Liao, Y.H., Vemuri, V.R.: Use of K-Nearest Neighbor Classifier for Intrusion Detection. Computers & Security 21(5), 439–448 (2002) 10. Sarasamma, S.T., Zhu, Q.M.A., Huff, J.: Hierarchical Kohonenen Net for Anomaly Detection in Network Security. IEEE Transactions on Systems Man and Cybernetics, Part B 35(2), 302–312 (2005) 11. Corchado, E., Herrero, A., Sáiz, J.M.: Detecting Compounded Anomalous SNMP Situations Using Cooperative Unsupervised Pattern Recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 905–910. Springer, Heidelberg (2005) 12. Middlemiss, M., Dick, G.: Feature Selection of Intrusion Detection Data Using a Hybrid Genetic Algorithm/KNN Approach. In: Design and Application of Hybrid Intelligent Systems, pp. 519–527. IOS Press, Amsterdam (2003) 13. Kholfi, S., Habib, M., Aljahdali, S.: Best Hybrid Classifiers for Intrusion Detection. Journal of Computational Methods in Science and Engineering 6(2), 299–307 (2006) 14. Herrero, Á., Corchado, E., Pellicer, M., Abraham, A.: Hybrid Multi Agent-Neural Network Intrusion Detection with Mobile Visualization. In: Innovations in Hybrid Intelligent Systems. Advances in Soft Computing, vol. 44, pp. 320–328. Springer, Heidelberg (2007) 15. Corchado, J.M., Laza, R.: Constructing Deliberative Agents with Case-Based Reasoning Technology. International Journal of Intelligent Systems 18(12), 1227–1241 (2003) 16. Pellicer, M.A., Corchado, J.M.: Development of CBR-BDI Agents. International Journal of Computer Science and Applications 2(1), 25–32 (2005) 17. Aamodt, A., Plaza, E.: Case-Based Reasoning - Foundational Issues, Methodological Variations, and System Approaches. AI Communications 7(1), 39–59 (1994) 18. Bratman, M.E.: Intentions, Plans and Practical Reason. Harvard University Press, Cambridge (1987) 162 Á. Herrero and E. Corchado 19. Zambonelli, F., Jennings, N.R., Wooldridge, M.: Developing Multiagent Systems: the Gaia Methodology. ACM Transactions on Software Engineering and Methodology 12(3), 317–370 (2003) 20. Pearson, K.: On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(6), 559–572 (1901) 21. Demartines, P., Herault, J.: Curvilinear Component Analysis: A Self-Organizing Neural Network for Nonlinear Mapping of Data Sets. IEEE Transactions on Neural Networks 8(1), 148–154 (1997) 22. Corchado, E., MacDonald, D., Fyfe, C.: Maximum and Minimum Likelihood Hebbian Learning for Exploratory Projection Pursuit. Data Mining and Knowledge Discovery 8(3), 203–225 (2004) 23. Corchado, E., Fyfe, C.: Connectionist Techniques for the Identification and Suppression of Interfering Underlying Factors. Int. Journal of Pattern Recognition and Artificial Intelligence 17(8), 1447–1466 (2003) 24. Oja, E.: A Simplified Neuron Model as a Principal Component Analyzer. Journal of Mathematical Biology 15(3), 267–273 (1982) 25. Sanger, D.: Contribution Analysis: a Technique for Assigning Responsibilities to Hidden Units in Connectionist Networks. Connection Science 1(2), 115–138 (1989) 26. Fyfe, C.: A Neural Network for PCA and Beyond. Neural Processing Letters 6(1-2), 33–41 (1997) 27. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78(9), 1464–1480 (1990) 28. Friedman, J.H., Tukey, J.W.: A Projection Pursuit Algorithm for Exploratory DataAnalysis. IEEE Transactions on Computers 23(9), 881–890 (1974) 29. Seung, H.S., Socci, N.D., Lee, D.: The Rectified Gaussian Distribution. Advances in Neural Information Processing Systems 10, 350–356 (1998) 30. Cisco Secure Consulting. Vulnerability Statistics Report (2000) Cluster Analysis for Anomaly Detection Giuseppe Lieto, Fabio Orsini, and Genoveffa Pagano System Management, Inc. Italy glieto@sysmanagement.it, forsini@sysmanagement.it, gpagano@pridelabs.it Abstract. This document presents a technique of traffic analysis, looking for attempted intrusion and information attacks. A traffic classifier aggregates packets in clusters by means of an adapted genetic algorithm. In a network with traffic homogenous over the time, clusters do not vary in number and characteristics. In the event of attacks or introduction of new applications the clusters change in number and characteristics. The set of data processed for the test are extracted from traffic DARPA, provided by MIT Lincoln Labs and commonly used to test effectiveness and efficiency of systems for Intrusion Detection. The target events of the trials are Denial of Service and Reconaissance. The experimental evidence shows that, even with an input of unrefined data, the algorithm is able to classify, with discrete accuracy, malicious events. 1 Introduction Anomaly detection techniques are based on traffic experience retrieval on the network to protect, so that abnormal traffic, both in quantity and in quality, can be detected. Almost all approaches are based on anti-intrusion system learning of what is normal traffic: anything different from the normal traffic is malicious or suspect. The normal traffic classification is made analyzing: • Application level content, with a textual characterization; • The whole connection and packet headers, usually using clustering techniques; • Traffic quantity and connections transition frequencies, by modelling the users behaviour in different hours, according to the services and the applications they use. 1.1 Anomaly Detection with Clustering on Header Data The most interesting studies are related to learning algorithms without human supervision. They classify the traffic in different clusters [3], each of them contains strongly correlated packets. Packets characterization is based on header fields, while the cluster creation can be realized with different algorithms. All the traffic that doesn’t belong to normal clusters, is classified as abnormal; the following step is to distinguish between abnormal and malicious traffic. Traffic characterization starts with data mining and creation of multidimensional vectors, called feature vectors, whose components represent the instance dimensions. The choice of relevant attributes for the instances is really important for the characterization and many studies focus on the evaluation of the best techniques for features choice. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 163–169, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 164 G. Lieto, F. Orsini, and G. Pagano The approach chosen in [4] considers the connections as an unique entity, containing several packets. In that way is possible to retrieve from network data information as the observing domain, the hosts number, the active applications, the number of users connected to a host or to a service. The approach introduced above is the most used one. Anyway there are classification trials based on raw data. An interesting study field is DoS attack detection: such attacks produce an undistinguishable traffic. [1] proposes a defence based on edge routers, that can create different queues for malicious packets and normal traffic. The distinction takes place through an anomaly detection algorithm that classifies the normal traffic using a kmeans clustering, based on observation of the statistic trend of the traffic. Experimental results state that, during a DoS attack, new denser, more populated and bigger clusters are created. Even the sudden increase in density of an existing cluster can mean an attach is ongoing, if it is observed together with an immediate variation of the single features mean values. The different queues allow to shorten up the normal connection time; moreover, without the need to stop the suspect connection, the band dedicated to DoS traffic decreases (because of long queues), together with the effects on resources management. In our approach, we choose a genetic clustering technique, Unsupervised Niche Clustering [2], to classify the network traffic. UNC uses an evolutionary algorithm with a niching strategy, allowing to maintain genetic niches as candidate clusters, rather than loosing awareness of little groups of strongly peculiar genetic individuals, that would end up extinguished using a traditional evolutionary algorithm. Analysing the traffic, we applied the algorithm to several groups of individuals, monitoring the formation of new clusters and the trend of the density in already existing clusters [1]. Moreover, we observed the trend in the number of clusters extracted from a fixed number of individuals. The two approaches were tested during normal network activities and then compared to the results obtained during a DoS attack or a reconnaissance activity. 2 Description of the Algorithm: Unsupervised Niche Clustering Unsupervised Niche Clustering aims at searching the solution space for any number of clusters. It maintains dense areas in the solution space using an evolutionary algorithm and a niching tecnique. As in nature, niches represent subspaces of the environment that support different types of individuals with similar genetic characteristics. 2.1 Encoding Scheme Each individual represents a candidate cluster: it is characterized by the center, an individual in n dimensions, a robust measure of the scale and its fitness. Table 1. Feature vector of each individual Genome Guyi= (gi1, gi2, … , gin) Scale Σ²i Fitness fi Cluster Analysis for Anomaly Detection 165 2.2 Genetic Operators and Scale At the very first step, scale is assigned to each individual in an empirical way: we assume i-th individual is the center, that is the mean value, of a cluster containing all the individuals in the solution space. For each generation parents and offspring, update their scale in a recursive manner, according to equation (1): (1) (2) where wij represents a robust weight measuring how much the j-th individual belongs to i-th cluster; dij is the Euclidean distance of j-th individual from the center of i-th cluster; N is the number of individuals in the solution space. At the moment of their birth, children inherit their closest parent’s scale. Generation of an offspring is made by two genetic operators: crossover and mutation. In our work we implemented one-point crossover to each dimension, combining the most significant bits of one parent with the least significant ones of the second parent. Mutation could modify each bit of the genome with a given probability: in our case, we chose the mutation probability was 0.001. Equation (1) maximizes fitness value (3) for i-th cluster. 2.3 Fitness Function Fitness, for i-th individual, is represented by the density of a hypotetical cluster, having i-th individual as its center (3) 2.4 Niching UNC uses Deterministic Crowding (DC) to create and maintain niches. DC steps are: 1. 2. 3. 4. choose the couple of parents apply crossover and mutation calculate the distance from each parent to each child couple one child with one parent, so that the sum of the two distances parentchild is minimized 5. in each couple parent-child, the one with the best fitness survives, and the other is discarded from the population 166 G. Lieto, F. Orsini, and G. Pagano Through DC, evaluation of child’s fitness for surviving is not simply obtained comparing it to the closest parent’s fitness: such an approach keeps the comparisons within a limited solution space. In addition, we analysed a conservative approach for step1. The parents where chosen so that their distance was under a fixed threshold, and their fitness had the same order of magnitude, in a way to maintain genetic diversity. Coupling between very far individuals, having highly different fitness values, would quickly extinguish the weakest individual, loosing notion of evolutionary niches. 2.5 Extraction of the Cluster Centers The final cluster centers are individuals in the final population with fitness greater than a given value: in our case, greater than the mean fitness of the entire population. 2.6 Cluster Characterization Assignment of each individual to a cluster does not follow a binary logic; fuzzy logic is applied instead. Clusters don’t have a radius, but we assigned to each individual a degree of belonging to each cluster. Member functions of the fuzzy set are Gaussian functions in equation (4) (4) where mean value in the center of the cluster, and the scale of the center coincides with the scale of the belonging Gaussian function. An individual will be considered as belonging to the cluster which maximizes the belonging function (4). 3 Data Set For the correct exploration of a solutions space, the genomes must correctly represent the physical reality under study. Our work can be divided according to the following phases: • We created an instrument to extract network traffic data. • We investigated the results of a genetic clustering without any data manipulation, in order to observe if header raw data could correctly represent the network traffic population, without any human understanding of attribute meaning. This approach proved to be completely different from the analytic one proposed in [4]. • We observed the clusters centres evolution to detect DoS and scanning attacks. • All data were extracted from tcpdump text files, obtained from a real network traffic; the packets have been studied starting from the third level of the TCP/IP stack. Cluster Analysis for Anomaly Detection 167 This choice causes the lack of the information of the link between Ip address and physical address, contained in ARP tables. • The headers values were extracted, separated, and converted in long integers. Ip addresses, were divided in two separated segments and later converted because of the maximum representation capacity of our computers. • From a single data set we implemented an object made up of the whole population under exam. The choice of the headers fields and the population size has been taken using a heuristic method. 4 Experimental Results We focused on attacks as denial of Service and scanning activities. Our data set was extracted from DARPA, build by Lincoln Lab in MIT in Boston., USA. 4.1 Experimentation 1 The first example is an IP sweep attack: the attacker sends an ICMP packet to each machine, in order to discover if they’re on at the moment of the attack. We applied the algorithm to an initial train of 5000 packets; then, we monitored the trend of the centers on following trains of 1000 packets. Fig. 1. Trends of clusters for Neptune attack 168 G. Lieto, F. Orsini, and G. Pagano As the number of clusters won’t be predefined, we related each cluster representative of a train with the closest belonging to the following train of packets. About not assigned centers, we calculated the minimum distance from the preceding clusters. We observed that ICMP clusters can never be assigned to a preceding train of packets, and their minimum distance is far larger than any other not assigned clusters. In figure 1, the evolution of normal clusters, and the isolated cluster created during the attack, the red one. The three dimensions are transport layer protocol, destination and source port. We identified the attack, when a not assigned cluster’s minimum distance was higher than a threshold value. We had the same results in a Port Sweep attack. Anyway, we faced some false positive, in presence of DNS requests: this could be avoided accurately assigning weight functions to balance the different kinds of normal traffic in the network. 4.2 Experimentation 2 The second case we analysed was a Neptune attack, it causes a denial of service, flooding the target machine with syn TCP packets, and never finalizing the three way handshake. Handling an attack producing a huge number of packets, we expected a Table 2. Evolution of the cluster centers during Neptune attack Train number Attack 1 2 3 4 5 6 No No No Yes Yes Yes Colour in fig. 2 1 2 Population Number of Clusters 5000 1000 1000 1000 1000 1000 18 13 13 21 2 2 3 4 5 6 Scale 6,00E+08 4,00E+08 2,00E+08 0,00E+00 1 2 3 4 5 6 Data set Fig. 2. Dispersion around the centers in each train of packets Average Scale of the Clusters in the Train 2.37E+08 4.35E+08 5.29E+08 5.95E+08 8.40E+07 1.70E+08 Cluster Analysis for Anomaly Detection 169 raise in the number of clusters calculated on the packet trains containing the attack, characterized by a density higher than the one observed during normal activities. In figure 2, we represent the clusters’ centers in each train of packets. We observed a strong contraction in the number of clusters. More over, the individuals in the population were very less dispersed around these centers than the normal centers. It’s evident that the center, though calculated from the same number of packets, diminish abruptly in number. More over, the dispersion around the centers diminishes as abruptly, as seen in figure 3. 5 Conclusions Experimental results show that our algorithm can identify new events happening in trains of packets of a given small number: its sensitivity applies to attacks producing a large number of homogenous packets. Evolutionary approach proved to be feasible, stressing a trend in traffic: thanks to the recombination of data and to the random component, the cluster centers can be identified in genomes not present in the initial solution space. Using a hill climbing procedure, UNC selects the fittest individuals, preserving evolutionary niches generation by generation; by monitoring the evolution of the centers, we had a robust approach against the noise, compared to a statistical approach to clustering: individuals not representative of an evolutionary niche have a low probability of surviving. The performance of the algorithm can be improved by separately processing the traffic incoming and outgoing the network under analysis: this would help to keep under control the false positive rate. Moreover, the process of data mining from packet headers can be refined and improved, so to build a feature vector containing not only raw data from the header, but more refined data, containing knowledge about the network and its hosts, the connections, the services and so on. A different approach of the same algorithm could be monitoring the traffic and evaluating it compared to the existing clusters, rather than observing the evolution of the clusters: once the clusters are formed, a score of abnormality can be assigned to each individual under investigation, according to how much it belongs to each cluster of the solution space. In a few empirical tests, we simulated a wide range of attacks using Nessus tool: although some trains of anomalous packets show substantially normal scores, and the number of false positive is quite relevant, we observed that abnormal traffic has got a sensitive higher abnormal score than the normal traffic has, if referring to the mean values. References 1. Rouil, Chevrollier, Golmie: Unsupervised anomaly detection system using next-generation router architecture (2005) 2. Leon, Nasraoui, Gomez: Anomaly detection based on unsupervised niche clustering with application to network intrusion detection 3. Cerbara, I.: Cenni sulla cluster analysis (1999) 4. Lee, S.: A framework for constructing features and models for intrusion detection systems (2001) Statistical Anomaly Detection on Real e-Mail Traffic Maurizio Aiello1, Davide Chiarella1,2, and Gianluca Papaleo1,2 1 2 National Research Council, IEIIT, Genoa, Italy University of Genoa, Department of Computer and Information Sciences, Italy {maurizio.aiello,davide.chiarella, gianluca.papaleo}@ieiit.cnr.it Abstract. There are many recent studies and proposal in Anomaly Detection Techniques, especially in worm and virus detection. In this field it does matter to answer few important questions like at which ISO/OSI layer data analysis is done and which approach is used. Furthermore these works suffer of scarcity of real data due to lack of network resources or privacy problem: almost every work in this sector uses synthetic (e.g. DARPA) or pre-made set of data. Our study is based on layer seven quantities (number of e-mail sent in a chosen period): we analyzed quantitatively our network e-mail traffic (4 SMTP servers, 10 class C networks) and applied our method on gathered data to detect indirect worm infection (worms which use e-mail to spread infection). The method is a threshold method and, in our dataset, it identified various worm activities. In this document we show our data analysis and results in order to stimulate new approaches and debates in Anomaly Intrusion Detection Techniques. Keywords: Anomaly Detection Techniques; indirect worm; real e-mail traffic. 1 Introduction Network security and Intrusion Detection Systems have become one of the research focus with the ever fast development of the Internet and the growing of unauthorized activities on the Net. Intrusion Detection Techniques are an important security barrier against computer intrusions, virus infections, spam and phishing. In the known literature there are two main approaches to worm detection [1], [2]: misuse intrusion detection and anomaly intrusion detection. The first one is based upon the signature concept, it is more accurate but it lacks the ability to identify the presence of intrusions that do not fit a pre-defined signature, resulting not adaptive [3], [4]. The second one tries to create a model to characterize a normal behaviour: the system defines the expected network behaviour and, if there are significant deviations in the short term usage from the profile, raises an alarm. It is a more adaptive system, ready to counterattack new threats, but it has a high rate of false positives [5], [6], [7], [8], 9]. Theoretically Misuse and Anomaly detection integrated together can get the holistic estimation of malicious situations on a network. Which kind of threats spread via e-mails? Primarily we can say that the main ones are worms and viruses, spam and phishing. Let’s try to summarize the whole situation. At present Internet surpasses one billion users [10] and we witness more and more cyber criminal activities originate and misuse this worldwide network by using different tools: one of the most important and relevant is the electronic-mail. In fact E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 170–177, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Statistical Anomaly Detection on Real e-Mail Traffic 171 nowadays Internet users are flooded by a huge amount of emails infected with worms and viruses: indeed the majority of mass-mailing worms employ a Simple Mail Transfer Protocol engine as infection delivery mechanism, so in the last years, a multitude of worm epidemics has affected millions of networked computing devices [11]. Are worms a real and growing threat? The answer is simple and we can find it in the virulent events of the last years: we have in fact thousands hosts infected and billion dollars in damage [12]. Moreover recently we witness a merge between worm and spam activities: it has been estimated that 80% of spam is sent by spam zombies [13]: an event which can make us think that future time hides bad news. How can we neutralize all these menaces? In Intrusion Detection Techniques many types of research have been developed during years. Proposed Network Intrusion Detection Systems worked and work on information available on different TCP stack layers: Datalink ( e.g. monitoring ARP [14] ), Network (e.g. monitoring IP, BGP and ICMP [15], [16] ), Transport (e.g. monitoring DNS MX query [17,18]) and Application (e.g. monitoring System Calls[19,20] ). Sometimes, because of enormous relative features available on different levels researcher correlate information gathered on each level in order to improve the effectiveness of the detection system. In a similar way we propose to work with e-mail focusing our attention on quantities considering that all the three above phenomena have something in common: they all use SMTP [21] as proliferation medium. In this paper our goal is to present a dataset analysis which reflects the complete SMTP traffic sent by seven /24 network in order to detect worm and virus infections given that no user in our network is a spammer. Moreover we want to stress that the dataset we worked on is genuine and not a synthetic one ( like KDD 99 [22] and DARPA [23] ) so we hope that it might be inspiring to other researcher and that probably in the near future our work might produce a genuine data set at everyone’s disposal. The paper is structured as following. Section 2 introduces the analysis’ scenario. Section 3 discusses the dataset we worked on. Our analysis’ theory and methods are described in section 4, and our experimental results using our tools to analyze mail activities discussed in section 5. In section 6, we give our conclusion. 2 Scenario Our approach is highly experimental. In fact we work on eleven local area network (C class) interconnected by a layer three switch and directly connected to Internet (no NAT [24] policies, all public IP). In this network we have five different mail-servers, varying from Postfix to Sendmail. As every system administrator knows every server has its own kind of log and the main problem with log files is that every transaction is merged with the other ones and they are of difficult reading by a human beings: for this reason we focused our anomaly detection on a single Postfix mail-server, optimizing our efforts. Every mail is checked by an antivirus server (Sophos). To circumvent spam we have SpamAssassin [25] and Greylisting [26]. SpamAssassin is a software which uses a variety of mechanisms including header and text analysis, Bayesian filtering, DNS blocklists, and collaborative filtering databases to detect spam. A mail transfer agent which uses greylisting temporarily rejects email from 172 M. Aiello, D. Chiarella, and G. Papaleo unknown senders. If the mail is legitimate, the originating server will try again to send it later according to RFC, while a mail which is from a spammer, it will probably not be retried because very few spammers are RFC compliant. The few spam sources which re-transmit later are more likely to be listed in DNSBLs [27] and distributed signature systems. Greylisting and SpamAssassin reduced heavily our spam percentage. To make a complete description we must add that port 25 is monitored and filtered: in fact the hosts inside our network can’t communicate with a host outside our network on port 25 and an outsider can’t communicate with an our host on port 25. These restriction nullify two threats: the first one concerns the infected hosts which can become spam-zombie pc; the second one concerns the SMTP relaying misuse problem. In fact since we are a research institution almost all the hosts are used by a single person who detains root privileges, so she can eventually install a SMTP server. Only few of total hosts are shared among different people (students, fellow researcher etc.). We have a good balancing between Linux operating systems distribution and Windows ones. We focus our attention on one mail server which has installed a Postfix e-mail server. Every mail is checked by the antivirus server updated once an hour: this is an important fact because it assures that all the worms found during analysis are zero-day worm [28-30]. This server supplies service to 300 users with a wide international social network due to the research mission of our Institution [31]: this fact grant us a huge amount of SMTP traffic. 3 Dataset We analyze mail-server log of 1065 days length period (2 years and 11 months). To speed up the process we used LMA (Log Mail Analyzer [32]) to make the log more readable. LMA is a Perl program, open source, available on Sourceforge, which makes Postfix and Sendmail logs human readable. It reconstructs every single e-mail transaction spread across the mail server log and it creates a plain text file in a simpler format like. Every row represents a single transaction and it has the following fields: • Timestamp: it is the moment in which the e-mail has been sent: it is possible to have this information in Unix timestamp format or through the Julian format in standard date. • Client: it is the hostname of e-mail sender (HELO identifier). • Client IP: it is the IP of the sender’s host. • From: it is the e-mail address of the sender. • To: it is the e-mail address of the receiver. • Status: it is the server response (e.g. 450, 550 etc.). With this format is possible to find the moment in which the e-mail has been sent, the sender client name and IP, the from and to field of the e-mail and the server response. Lets make an example: if Paul@myisp.com send an e-mail on 23 march 2006 to Pamela@myisp.com from X.X.2.235 and all the e-mail server transactions go successful we will have a record like this: 23/03/2006 X.X.2.235 Paul@myisp.com Pamela@myisp.com 250 Statistical Anomaly Detection on Real e-Mail Traffic 173 As already said, we want to stress that our data is not synthetic and so it doesn’t suffer of bias of any form: it reflects the complete set of emails received by a single hightraffic e-mail server and it represents the overall view of a typical network operator. Furthermore, contrary to synthetic ones, it suffers of accidental hardware faults: can you say that a network topology is static and it is not prone to wanted and unwanted changes? Intrusion Detection evaluation dataset have some hypothesis, one of these is the never changing topology and immortal hardware health. As a matter of fact this is not true, this is not reality: Murphy’s Law holds true and strikes with extraordinary efficiency. In addition our data are only about SMTP flow and, due to the long-term monitoring, are a good snapshot of all-day life and, romantically, a silent witness of Internet growth and e-mail use growth. 4 Analysis Our analysis has been made on the e-mail traffic of ten C-class network in a period of 900 days, from January 2004 to November 2006. In our analysis, we work on the global e-mail flow in a given time interval. We use a threshold detection [33], like other software do (e.g. Snort ): if the traffic volume rises above a given threshold, the system triggers an alarm. The given threshold is calculated in a statistical way, where we determine the network normal e-mail traffic in selected slices of time: for example we take the activity of a month and we divide the month in five-minutes slices, calculating how many e-mails are normally sent in five minutes. After that, we check that the number of e-mails sent in a day during each interval don’t exceed the threshold. We call this kind of analysis base-line analysis. Our strategy is to study the temporal correlation between the present behaviour (maybe modified by the presence of a worm activity) of a given entity (pc, entire network) and its past behaviour (normal activity, no virus or worm presence). Before proceeding, however, we pre-process the data subtracting the mean to the values and cutting all the interval with a negative number of e-mails, because we wanted to obfuscate the no-activity and few activity periods, not interesting for our purposes. In other words we trashed all the time slices characterized by a number of e-mail sent below the month average, with the purpose of dynamically selecting activity periods (working hours, no holidays etc). If we didn’t perform this pre-processing we could have had an average which depended on night time, weekend or holidays duration. E-mails sent mean in 2004, before pre-processing, was 524 in a day for 339 activity day: after data pre-processing was 773 in a day for 179 activity day. After this we calculate the baseline activity of working hours according to the following: µ + 3σ. The mean and the variance are calculated for every month, modelling the network behaviour, taking into account every chosen time interval (e.g. we divide February in five-minutes slices, we count how many e-mails are sent in these periods and then we calculate the mean of these intervals). Values have been compared with the baseline threshold and if found greater than it they have been marked. Analyzing the first five months with a five minutes slice we found too many alerts and a lot of them exceeded the threshold only for few e-mails. So we thought to correlate the alerts found with a five minutes period with those found with an hour period, with the hypothesis that a worm which has infected a host sends a lot of e-mail both in a short period and in a bit 174 M. Aiello, D. Chiarella, and G. Papaleo longer period. To clarify the concept lets take the analysis for a month: April 2004 (see Fig. 1 and Fig. 2). The five minutes base-line resulted in 63 e-mails while the one hour base-line is 463. In five-minute analysis we found sixteen alerts, meanwhile in one-hour analysis only three. Why do we find a so big gap between the two approaches? In five-minutes analysis we have a lot of false alarms, due to the presence of e-mails sent to very large mailing lists while in one-hour analysis we find very few alarms, but these alarms result more significant because they represents a continuative violation of the normal (expected) activity. Correlating these results, searching the selected five-minutes periods in the five one-hour alert we detected that a little set of the five-minute alarms were near in the temporal line: after a deeper analysis, using our knowledge and experience on real user’s activity we concluded that it was a worm activity. Fig. 1. Example of e-mail traffic: hour base-line Fig. 2. Example of e-mail traffic: five minutes base-line 4.1 SMTP Sender Analysis Sometimes, peaks catch from flow analysis were e-mail sent to mailing list which are, as already said, bothersome hoaxes. This fact produced from analysis, where we analyze how many different e-mail address every host use: we look which from field is Statistical Anomaly Detection on Real e-Mail Traffic 175 used by every host. In fact an host, owned by a single person or few persons, is not likely to use a lot of different e-mail addresses in a short time and if it does so, it is highly considerable a suspicious behaviour. So we think that this analysis could be used to identify true positives, or to suggest suspect activity. Of course it isn’t so straight that a worm will change from field continuously, but it is a likely event. 4.2 SMTP Reject Analysis One typical feature of a malware is haste in spreading the infection. This haste leads indirect worms to send a lot of e-mail to unknown receivers or nonexistent e-mail address: this is a mistake that, we think, it is very important. In fact all e-mails sent to a nonexistent e-mail address are rejected by the mail-server, and they are tracked in the log. In this step of our work we analyze rejected e-mail flow: we work only on emails referred by internet server. By this approach we identified worm activity. Table 1. Experimental results Date Infected Host Analysis 28/01/04 18:00 X.X.6.24 Baseline, from, reject 29/01/04 10:30 X.X.4.24 From, reject 28/04/2004 14:58-15:03 X.X.7.20 Baseline, from, reject 28/04/2004 15:53-15:58 X.X.5.216 Baseline, from, reject 29/04/2004 09:08-10:03 X.X.6.36, X.X.7.20 Baseline, from, reject 04/05/2004 12:05-12:10 X.X.5.158 Baseline, from, reject 04/05/2004 13:15-13:25 X.X.5.158 Baseline, from, reject 31/08/04 14:51 X.X.3.234 Baseline, reject X.X.3.234 Baseline, reject X.X.3.101, X.X.3.200, X.X.3.234 X.X.5.123 Baseline, reject X.X.10.10 Baseline, from, reject X.X.10.10 Baseline, from, reject X.X.10.10 Baseline, from, reject X.X.10.10 Baseline, from, reject 31/08/04 17:46 23/11/04 11:38 22/08/05 17:13 22/08/05 17:43 22/08/05 20:18 22/08/05 22:08-22:13 176 M. Aiello, D. Chiarella, and G. Papaleo 5 Results The approach does detect 14 worms activity, mostly concentrated in 2004. We think that this fact is caused by new firewall policies introduced in 2005 and by the introduction of a second antivirus engine in our Mail Transfer Agent. Moreover in last years we haven’t got very large worm infection. The results we obtained are summarized in Table 1. 6 Conclusion Baseline analysis can be useful in identifying some indirect worm activity, but this approach need some integration by some other methods, because it lacks a complete vision of SMTP activity: this lack can be filled by methods which analyze some other SMTP aspects, like From and To e-mail field. In future this method can be integrated in an anomaly detection system to get more accuracy in detecting anomalies. Acknowledgments. This work was supported by National Research Council of Italy and University of Genoa. References 1. Axelsson, S.: Intrusion detection systems: A survey and taxonomy,Tech. Rep. 99-15, Chalmers Univ (March 2000) 2. Verwoerd, T., Hunt, R.: Intrusion detection techniques and approaches. Comput. Commun. 25(15), 1356–1365 (2002) 3. Ilgun, K., Kemmerer, R.A., Porras, P.A.: State transition analysis: A rule-based intrusion detection approach. IEEE Transactions on Software Engineering 21(3), 181–199 (1995) 4. Kumar, S., Spafford, E.H.: A software architecture to support misuse intrusion detection. In: Proceedings of the 18th National Information Security Conference, pp. 194–204 (1995) 5. Denning, D.E.: An intrusion detection model. IEEE Transactions on Software Engineering (1987) 6. Estvez-Tapiador, J.M., Garcia-Teodoro, P., Diaz-Verdejo, J.E.: Anomaly detection methods in wired networks: A survey and taxonomy. Comput. Commun. 27(16), 1569–1584 (2004) 7. Du, Y., Wang, W.-q., Pang, Y.-G.: An intrusion detection method using average hamming distance. In: Proceedings of the Third International Conference on Machine Learning and Cybernetics, Shanghai, 26-29 August (2004) 8. Anderson, D., Frivold, T., Valdes, A.: Next-generation intrusion detection expert system (NIDES). Computer Science Laboratory (SRI Intemational, Menlo Park, CA): Technical reportSRI-CSL-95-07 (1995) 9. Wang, Y., Abdel-Wahab, H.: A Multilayer Approach of Anomaly Detection for Email Systems. In: Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC 2006) (2006) 10. http://www.internetworldstats.com/stats.htm 11. http://en.wikipedia.org/wiki/ Notable_computer_viruses_and_worms Statistical Anomaly Detection on Real e-Mail Traffic 177 12. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., Weaver, N.: Inside the slammer worm. IEEE Magazine of Security and Privacy, 33–39 (July/August 2003) 13. Leyden, J.: Zombie PCs spew out 80% of spam. The Register (June 2004) 14. Yasami, Y., Farahmand, M., Zargari, V.: An ARP-based Anomaly Detection Algorithm Using Hidden Markov Model in Enterprise Networks. In: Second International Conference on Systems and Networks Communications (ICSNC 2007) (2007) 15. Berk, V., Bakos, G., Morris, R.: Designing a Framework for Active Worm Detection on Global Networks. In: Proceedings of the first IEEE International Workshop on Information Assurance (IWIA 2003), Darmstadt, Germany (March 2003) 16. Bakos, G., Berk, V.: Early detection of internet worm activity by metering icmp destination unreachable messages. In: Proceedings of the SPIE Aerosense 2002 (2002) 17. Whyte, D., Kranakis, E., van Oorschot, P.C.: DNS-based Detection of Scanning Worms in an Enterprise Network. In: Proceedings of the 12th Annual Network and Distributed System Security Symposium, San Diego, USA, February 3-4 (2005) 18. Whyte, D., van Oorschot, P.C., Kranakis, E.: Addressing Malicious SMTP-based MassMailing Activity Within an Enterprise Network 19. Hofmeyr, S.A., Forrest, S., Somayaji, A.: Intrusion Detection using Sequences of System Calls. Journal of Computer Security 6(3), 151–180 (1998) 20. Cha, B.: Host anomaly detection performance analysis based on system call of NeuroFuzzy using Soundex algorithm and N-gram technique. In: Proceedings of the 2005 Systems Communications (ICW 2005) (2005) 21. http://www.ietf.org/rfc/rfc0821.txt 22. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html 23. http://www.ll.mit.edu/IST/ideval/data/data_index.html 24. http://en.wikipedia.org/wiki/Network_address_translation 25. http://spamassassin.apache.org/ 26. Harris, E.: The Next Step in the Spam Control War: Greylisting 27. http://en.wikipedia.org/wiki/DNSBL 28. Crandall, J.R., Su, Z., Wu, S.F., Chong, F.T.: On Deriving Unknown Vulnerabilities from Zero-Day Polymorphic and Metamorphic Worm Exploits. In: CCS 2005, Alexandria, Virginia, USA, November 7–11 (2005) 29. Portokalidis, G., Bos, H.: SweetBait: Zero-Hour Worm Detection and Containment Using Honeypots 30. Akritidis, P., Anagnostakis, K., Markatos, E.P.: Efficient Content-Based Detection of Zero-DayWorms 31. http://www.cnr.it/sitocnr/home.html 32. http://lma.sourceforge.net/ 33. Behaviour-Based Network Security Goes Mainstream, David Geer, Computer (March 2006) On-the-fly Statistical Classification of Internet Traffic at Application Layer Based on Cluster Analysis Andrea Baiocchi1, Gianluca Maiolini2, Giacomo Molina1, and Antonello Rizzi1 1 INFOCOM Dept., University of Roma “Sapienza” Via Eudossiana 18 - 00184 Rome, Italy andrea.baiocchi@uniroma1.it, giacomo.molina@libero.it, rizzi@infocom.uniroma1.it 2 ELSAG Datamat – Divisione automazione sicurezza e trasporti, Via Laurentina 760 – 00143 Rome, Italy gianluca.maiolini@elsagdatamat.com Abstract. We address the problem of classifying Internet packet flows according to the application level protocol that generated them. Unlike deep packet inspection, which reads up to application layer payloads and keeps track of packet sequences, we consider classification based on statistical features extracted in real time from the packet flow, namely IP packet lengths and inter-arrival times. A statistical classification algorithm is proposed, built upon the powerful and rich tools of cluster analysis. By exploiting traffic traces taken at the Networking Lab of our Department and traces from CAIDA, we defined data sets made up of thousands of flows for up to five different application protocols. With the classic approach of training and test data sets we show that cluster analysis yields very good results in spite of the little information it is based on, to stick to the real time decision requirement. We aim to show that the investigated applications are characterized from a ”signature” at the network layer that can be useful to recognize such applications even when the port number is not significant. Numerical results are presented to highlight the effect of major algorithm parameters. We discuss complexity and possible exploitation of the statistical classifier. 1 Introduction As broadband communications widen the range of popular applications, there is an increasing demand of fast traffic classification means according to the services data is generated by. The specific meaning of service depends on the context and purpose of traffic classifications. In case of traffic filtering for security or policy enforcement purposes, service can be usually identified with application layer protocol. However, many kind of different services exploit http or ssh (e.g. file transfer, multimedia communications, even P2P), so that a simple header based filter (e.g. exploiting the IP address and TCP/UDP port numbers) may be inadequate. Traffic classification at application level can be therefore based on the analysis of the entire packets content (header plus payload), usually by means of finite state machine schemes. Although there are widely available software tools for such a classification approach (e.g. L7filter, BRO, Snort), they can hardly catch up with high speed links and are usually inadequate for backbone use (e.g. Gbps links). E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 178–185, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 On-the-fly Statistical Classification of Internet Traffic 179 The solution based on port analysis is becoming ineffective because of applications running on non-standard ports (e.g. peer-to-peer). Furthermore, traffic classification based on deep packet inspection is resource-consuming and hard to implement on high capacity links (e.g. Gbps OC links). For these reasons, different approaches to traffic classification have been developed, using all the information available at network layer. Some proposals ([4], [5]), however, need semantically complete TCP flows as input: we target a real-time tool, able to classify the application layer protocol of a TCP connection by observing just the first few packets of the connection (hereinafter referred to as a flow). A number of works [5], [6], [7] rely on unsupervised learning techniques. The only features they use is packets size and packets direction: they demonstrate the effectiveness of these algorithms even using a small number of packets (e.g. the first four of a TCP connection). We believe that even packets inter-arrival time contains pieces of information relevant to address the classification problem. We provide a way to exalt the information contained in inter-arrival times, preserving the real-time characteristic of the approach described in [7]: we try to clean the interarrival time (as better explained in section 3) assessing the contribution of network congestion, to exalt the time depending on the application layer protocol. The paper is organized as follows. In Section 2 the classification problem is defined and notation is introduced. Section 3 is devoted to the description of the traffic data sets used for the defined algorithm assessment and the numerical evaluation. The cluster analysis based statistical classifier is defined in Section 4. Numerical examples are given in Section 5 and final remarks are stated in Section 6. 2 Problem Statement In this paper, we focus on the classification of IP flows generated from network applications communicating through TCP protocol as HTTP, SMTP, POP3, etc. With this in mind, we define flow F as the unidirectional, ordered sequence of IP packets produced either by the client towards the server, or by the server towards the client during an application layer session. The server-client flow FServer will be composed of (NServer + 1) IP packets, from PK0 to PKNserver , where PKj represents the j-th IP packet sent by the server to the client; the corresponding client-server flow FClient will be composed by (NClient + 1) IP packets. At the IP layer, each flow F can be characterized as an ordered sequence of N pairs Pi = (Ti ; Li), with 1 < i < N, where Li represents the size of PKi (including TCP/IP Header) and Ti represents the inter-arrival time between PKi-1 and PKi. In our study we consider only semantically complete TCP flows, namely flows starting with SYN or SYN-ACK TCP segment (respectively for client-to-server and server-to-client direction). Because of the limited number of packets considered in this work, we don’t care about the FIN TCP segment to be observed. With this in mind, we aim to recognize a description of protocols (through clustering techniques): such a description should be based on the first few packets of the flows and should be able to strongly characterize each analyzed protocol. The purpose of this work is the definition of an algorithm that takes as input a traffic flow from an unknown application and that gives as output the (probable) application responsible of its generation. 180 A. Baiocchi et al. 3 Dataset Description In this work, we focus our attention on five different application layer protocols, namely HTTP, FTP-Control, POP3, SMTP and SSH, which are among the most used protocols on the INTERNET. As for HTTP and FTP-Control (FTP-C in the following), we collected traffic traces in the Networking Lab at our Department. By means of automated tools mounted on machines within the Lab, thousands of web pages carefully selected have been visited in a random order, over thousands of web sites distributed in various geographical areas (Italy, Europe, North America, Asia). FTP sites have been addressed as well and control FTP session established with thousands remote servers, again distributed in a wide area. The generated traffic has been captured on our LAN switch; we verified that the TCP connections bottleneck was never the link connecting our LAN to the big Internet to avoid the measured inter-arrival times to be too noisy. This experimental set up, while allowing the capture of artificial traffic that (realistically) emulates user activity, gives us traces with reliable application layer protocol classification. Traffic flows for the other protocols (POP3, SMTP, SSH) are extracted form backbone traffic traces made available by CAIDA. Precisely, we randomically extracted flows from the OC-48 traces of the days 2002-08-14, 2003-01-15 and 2003-04-24. Due to privacy reasons, only anonymized packet traces with no payloads are made available. Regarding to SSH, it can be configured as an encrypted tunnel to transport every kind of applications. Even in its ”normal” behavior (remote management), it would have to be difficult to recognize a specific behavioral pattern due to its humaninteractive nature. For these reasons we expect that the classification results involving SSH flows will be worse than those without them. Starting from these traffic traces, and focusing our attention only to semantically complete server-client flows, we created two different data sets with 1000 flows for each application. Each flow in a data set is described by the following fields: • • a protocol label coded as an integer from 1 to 5; P ≥ 1 couples (Ti ; Li), where Ti is the inter-arrival time (difference between timestamps) of the (i – 1)-th and i-th packet of the considered flow and Li is the IP packet length of the i-th packet of the flow, i = 1,…, P. Inter-arrival times are in seconds, packet lengths are in bytes. The 0-th packet of a flow, used as a reference for the first inter-arrival time, is conventionally defined as the one carrying the SYN-ACK TCP segment for the server-toclient direction. The label in the first field is used as the target class for flows in both the training and test sets. The other 2P quantities are normalized and define a 2P-dimensional array associated to the considered flow. Normalization has to be done carefully: we choose to normalize packet lengths between 40 to 1500 bytes, which is the minimun/maximum observed length. As for inter-arrival times, normalization is done over an entire data-set of M flows making up a training or a test set. Let (T i ( j ) , L i ( j ) ) be the i-th couple of the j-th flow (i = 1,…, P ; j = 1,…,M). Then we let: On-the-fly Statistical Classification of Internet Traffic Tˆi ( j ) = Ti ( j ) − min Ti ( k ) 1≤ k ≤ M max Ti ( k ) − min Ti ( k ) 1≤ k ≤ M L ( j ) − 40 Lˆi ( j ) = i 1500 − 40 181 i = 1,..., P 1≤ k ≤ M (1) i = 1,..., P In the following we assume P = 5. A different version of this data set has also been used, so called pre-processed data set. In this last case, inter-arrival times are modified to be the differential inter-arrival times, obtained as DTi = Ti ¡ T0; i = 1,…,P, where T0 is the time elapsing between the packet carrying the TCP SYN-ACK of the flow and the next packet, most of times a presentation message (as for FTP-C) or an ACK (as for HTTP). So, T0 approximates the first RTT of the connection, including only time depending on TCP computation (as we have seen during our experimental setup). The differential delay can therefore be expressed as DTi = Ti − T0 ≈ RTTi − RTT0 + TAi (2) Where we account for the fact that T0 ≈ RTT0 and that Ti comprises the i-th RTT and in general an application dependent time, TAi. Hence, we expect that application layer protocol features are more evident in the pre-processed data set as compared to theplain one, since the contribution of the applications to interarrival times is usually much smaller than the average RTT in a wide area network. On the other hand, in case of differential inter-arrival times, the noise affecting the application dependent inter-arrival times is reduced to the RTT variation (zero on the average). 4 A Basic Classification System Based on Cluster Analysis In this section some details about the adopted classification system are exploited. Basically a classification problem can be defined as follows. Let P : X → L be an unknown oriented process to be modeled, where X is the domain set and the codomain L is a label set, i.e. a set in which it is not possible (or misleading) to define an ordering function and hence any dissimilarity measure between its elements. If P is a single value function, we will call it classification function. Let Str and Sts be two sets of input-output pairs, namely the training set and the test set. We will call instance of a classification problem a given pair (Str , Sts) with the constrain Str ∩ Sts =Ø . A classification system is a pair (M , TAi), where TA is the training algorithm, i.e. the set of instructions responsible for generating, exclusively on the basis of Str, a particular instance M¯ of the classification model family M, such that the classification error of M¯ computed on Sts will be minimized. The generalization capability, i.e. the capability to correctly classify any pattern belonging to the input space of the oriented process domain to be modeled, is for sure the most important desired feature of a classification system. From this point of view, the mean classification error on Sts 182 A. Baiocchi et al. can be considered as an estimate of the expected behavior of the classifier over all the possible inputs. In the following, we describe a classification system trained by an unsupervised (clustering) procedure. When dealing with patterns belonging to the Rn vectorial space we can adopt a distance measure, such as the Euclidean distance; moreover, in this case we can define the prototype of the cluster as the centroid (the mean vector) of all the patterns in the cluster, thanks to the algebraic structure defined in Rn. Consequently, the distance between a given pattern xi and a cluster Ck can be easily defined as the Euclidean distance d(xi ; µ k) where µ k is the centroid of the pattern belonging to Ck: µk = 1 mk ∑x i (3) xi ∈Ck A direct way to synthesize a classification model on the basis of a training set Str consists in partitioning the patterns in the input space (discarding the class label information) by a clustering algorithm (in our case, by the K-means). Successively, each cluster is labeled by the most frequent class among its patterns. Thus, a classification model is a set of labeled clusters (centroids); note that more than one cluster can be associated with the same label, i.e. a class can be represented by more than one cluster. Assuming to represent a floating point number with four bytes, the amount of memory needed to store a classification model is K · (4 · n + 1) bytes, where n is the input space dimension and assuming to code class labels with one byte. An unlabeled pattern x is classified by determining the closest centroid µ i (and thus the closest cluster Ci) and by labeling x with the same class label associated with Ci. It is important to underline that, since the initialization step of the K-Means is not deterministic, in order to compute a precise estimation of the performance of the classification model on the test set Sts, the whole algorithm must be run several times, averaging the classification errors on Sts yielded by the different classification models obtained in each run. 5 Numerical Results In this section we provide numerical results of the classification algorithm. We investigated two groups of applications, the first containing HTTP, FTP-C, POP3 and SMTP, the second including also SSH. Using the non preprocessed data set (hereinafter referred as original) we obtain a classification accuracy on Sts comparable with that achievable with port-based classification. This happens because the effect of RTT almost completely covers the information carried from inter-arrival times. In Table 1 and 2 are listed the global results and the individual contributions of the protocols to the average value. Using the pre-processed data set we obtain much better results, in particular for the case with only 4 protocols as we can see in Table 3 and Table 4. An important thing to consider is the complexity of the classification model, namely the number of clusters used. The performance does not significantly increase after 20 clusters (Fig. 2 and Fig. 3): this means we can achieve good results with a simple model that requires On-the-fly Statistical Classification of Internet Traffic 183 Table 1. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, original data set Table 2. Average classification accuracy vs # Clusters, P=5, # flows (training+test) =1000, original data set Table 3. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set not much computation. We can see from Table 4 the negative impact SSH has on the overall classification accuracy, mainly because it is a human-driven protocol, hard to characterize with few hundreds of flows. Moreover, we can see that SSH suffers of overfitting problems, as the probability of success decreases as the number of clusters increases. 184 A. Baiocchi et al. Table 4. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set Extending the data sets with unknown traffic, the classification probability is significantly reduced. Although the performance of the overall classification accuracy decreases, we can see in Table 5 the effect of unknown traffic. This means that our classifier is not mistaking flows of the considered protocols, but is just raising the false positive classifications due to unknown traffic, erroneously labeled as known traffic. Table 5. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set with HTTP, FTP-C, POP3, SMTP, SSH, Unknown 6 Concluding Remarks In this work we present a model that could be useful to address the problem of traffic classification. To this end, we use only (poor) information available at network layer, namely packets size and inter-arrival times. In the next future we plan to better test the performances of this model, mainly extending the data sets we use to a greater number of protocols. We are also planning to collect traffic traces from our Department link to be able to accurately classify all protocols we want to analyze through payload analysis. Moreover, we will have to enforce the C-means algorithm to automatically select the optimal number of clusters relatively to the used data set. The following step will On-the-fly Statistical Classification of Internet Traffic 185 be the use of more recent and powerful fuzzy-like algorithms to achieve better performances in a real environment. Acknowledgments. Authors thank Claudio Mammone for his development work of the software package used in the collection of part of traffic traces analysed in this work. References 1. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: Multilevel traffic classification in the dark. In: Proc. of ACM SIGCOMM 2005, Philadelphia, PA, USA (August 2005) 2. Crotti, M., Dusi, M., Gringoli, F., Salgarelli, L.: Traffic Classification through Simple Statistical Fingerprinting. ACM SIGCOMM Computer Communication Review 37(1), 5–16 (2007) 3. Wright, C., Monrose, F., Masson, G.: On Inferring Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine Learning Research (JMLR): Special issue on Machine Learning for Computer Security 7, 2745–2769 (2006) 4. Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques. In: ACM SIGMETRICS 2005, Banff, Alberta, Canada (June 2005) 5. McGregor, A., Hall, M., Lorier, P., Brunskill, J.: Flow clustering using machine learning techniques. In: PAM 2004, Antibes Juan-les-Pins, France (April 2004) 6. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: LCN 2005, Sydney, Australia (November 2005) 7. Bernaille, L., Teixeira, R., Salamatian, K.: ’Early Application Identification. In: Proceedings of CoNEXT (December 2006) Flow Level Data Mining of DNS Query Streams for Email Worm Detection Nikolaos Chatzis and Radu Popescu-Zeletin Fraunhofer Institute FOKUS, Kaiserin-Augusta-Allee 31, 10589 Berlin, Germany {nikolaos.chatzis,radu.popescu-zeletin}@fokus.fraunhofer.de Abstract. Email worms remain a major network security concern, as they increasingly attack systems with intensity using more advanced social engineering tricks. Their extremely high prevalence clearly indicates that current network defence mechanisms are intrinsically incapable of mitigating email worms, and thereby reducing unwanted email traffic traversing the Internet. In this paper we study the effect email worms have on the flow-level characteristics of DNS query streams a user machine generates. We propose a method based on unsupervised learning and time series analysis to early detect email worms on the local name server, which is located topologically near the infected machine. We evaluate our method against an email worm DNS query stream dataset that consists of 68 email worm instances and show that it exhibits remarkable accuracy in detecting various email worm instances1. 1 Introduction Email worms remain an ever-evolving threat, and unwanted email traffic traversing the Internet steadily escalates [1]. This causes network congestion, which results in loss of service or degradation in the performance of network resources [2]. In addition, email worms populate almost exclusively the monthly top threat lists of antivirus companies [3,4], and are used to deliver Trojans, viruses, and phishing attempts. Email worms rely mainly on social engineering to infect a user machine, and then they exploit information found on the infected machine about the email network of the user to spread via email among social contacts. Social engineering is a nontechnical kind of intrusion, which depends on human interaction to break normal security procedures. This propagation strategy differs significantly from IP address scanning propagation; therefore, it renders network detection methods that look for high rates at which unique destination addresses are contacted [5], or high number of failed connections [6], or high self-similarity of packet contents [7] incapable of detecting this class of Internet worms. Likewise, Honeypot-based systems [8], which provide a reliable anti-scanning mechanism, are not effective against email worms. Commonly applied approaches like antivirus and antispam software and recipient or message based rules have two deficiencies. They suffer from poor detection time 1 The research leading to these results has received funding from the European Community's Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 216585 (INTERSECTION Project). E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 186–194, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com Flow Level Data Mining of DNS Query Streams for Email Worm Detection 187 against novel email worm instances because they entail non-trivial human labour in order to develop a signature or a rule, and go to no lengths to reducing the unwanted email traffic traversing the Internet, as their target is to detect abusive email traffic in the network of the potential victim. In the past much research effort has been devoted to analyzing the traffic email worm-infected user machines generate [9-13]. These studies share the positive contribution that email worm infection affects at application layer the Domain Name System (DNS) traffic of a user machine. Detection methods based on this observation focus on straightforward application layer detection, which makes them suitable for detecting few specific outdated email worms. In this work we go a step beyond earlier work, and identify anomalies in DNS traffic that are common for email spreading malicious software, and as such they can serve as a strong basis for detecting various instances of email worms in the long run. We show that DNS query streams that non-infected user machines generate share at flow level many of the same canonical behaviours, while email worms rely on similar spreading methods that generate DNS traffic that share common patterns. We present a detection method that builds on unsupervised learning and time series analysis, and uses the wavelet transform in a different way than this often proposed in the Internet traffic analysis literature. We experiment with 68 worm instances that appeared in the wild between April 2004 and July 2007 to show that flow-level characteristics remain unaltered in the long run, and that our method is remarkably accurate. Inspecting packets at flow level does not involve deep packet analysis. This ensures user privacy, renders our approach unaffected by encryption and keeps the processing overhead low, which is a strong requirement for busy, high-speed networks. Moreover, DNS query streams consist of significantly less data than the input of conventional network intrusion detection systems. This is advantageous, since high volumes of input data inevitably degrade the effectiveness of these systems [14]. Additionally, the efficiency and deployment ease of an in-network detection system increase as the topological proximity between the system and the user machines decreases. Local name servers are the first link of the chain of Internet connectivity. Moreover, detection at the local name server, which is located topologically near the infected machine, contributes to reducing unwanted traffic traversing the Internet. The paper is organized as follows. In Section 2, we discuss related work. In Section 3, we explain our method for detecting email worms by flow-level analysis of DNS query streams. In Section 4, we validate our approach by examining its detection capabilities over various worm instances. We conclude in Section 5. 2 Related Work Since our work builds on DNS traffic analysis for detecting email worms and security oriented time series analysis of Internet traffic signals using the wavelet transform we provide below the necessary background on these areas. Previously published work provides evidence that the majority of today’s open Internet operational security issues affect DNS traffic. In this section we concentrate solely on email worms and refer the interest reader to [15] for a thorough analysis. Wong et al. [11] analyze DNS traffic captured at the local name server of a campus 188 N. Chatzis and R. Popescu-Zeletin network during the outbreak of SoBig.F and MyDoom.A. Musashi et al. [12] present similar measurements for the same worms, and extend their work by studying Netsky.Q and Mydoom.S in [13]. Whyte et al. [9] focus on enterprise networks and measure the DNS activity of NetSky.Q. Despite, their positive contribution in proving that there exists a correlation between email worm infection and DNS traffic, the efficacy of the detection methods these studies propose would have been dwarfed, if the methods had been evaluated against various worm instances. Indeed, the methods presented in [9,11,12,13] are straightforward and focus on application layer detection. They propose that many queries for Mail eXchange (MX) resource records (RR) or the relative numbers of queries for pointer (PTR) RR, MX and address (A) RR a user machine generates give a telltale sign of email worm infection. Although this observation holds for the few email worm instances studied in each paper, it can not be generalized for detecting various email worm instances. Furthermore, these methods neglect that DNS queries carry user sensitive information, and due to the high processing overhead they introduce by analyzing packet payloads, they are not suitable for busy, high speed networks. Moreover, as any other volume based method they require an artificial boundary as threshold on which the decision whether a user machine is infected or not is taken. In [10] the authors argue that anomaly detection is a promising solution to detect email worm-infected user machines. They use Bayesian inference assuming a priori knowledge of worm signature DNS queries to detect email worm. However, such knowledge is not apparent, and if it was it would allow straightforward detection. The wavelet transform is a powerful tool, since its time and scale localization abilities make it ideally suited to detect irregular patterns in traffic traces. Although, many research papers appear that analyze Denial of Service (DoS) attack traffic [16,17] by means of wavelets, only a handful of papers deals with applying the wavelet transform on Internet worm signals [18,19]. Inspired by similar methodology, used to analyze DoS traffic, these papers concentrate solely on looking at worm traffic for repeating behaviors by means of the self-similarity parameter Hurst. In this work we present a method that accurately detects various email worms by analyzing DNS query streams at flow level. We show that flow-level characteristics remain unaltered in the long run. Flow-level analysis does not violate user privacy, renders our method unaffected by encryption, eliminates the need for not anonymized data; and makes our method suitable for high-speed network environments. In our framework, we see DNS query streams from different hosts as independent time series and use the wavelet transform as a dimensionality reduction tool rather than a tool for searching self-similar patterns on a single signal. 3 Proposed Approach Our approach is based on time series analysis of DNS query streams. Given the time series representation, we show by similarity search over time series using clustering that user machines’ DNS activity fall into two canonical profiles: legitimate user behaviour and email worm infected behaviour. DNS query streams generated by user machines share many of the same canonical behaviours. Likewise, email worms rely on similar spreading methods generating query streams that share common patterns. Flow Level Data Mining of DNS Query Streams for Email Worm Detection 189 3.1 Data Management As input, our method uses the complete set of DNS queries that a local name server received within an observation interval. Since we are not interested in application level information, we retain for each query the time of the query and the IP address of the requesting user machine. We group DNS queries per requesting user machine. For each user machine we consider successive time bins of equal width, and we count the DNS queries in each bin. Thereby, we get a set of univariate time series, where each one of them expresses the number of DNS queries of a user machine through time. The set of time series can be expressed as an n× p time series matrix; n is the number of user machines and p the number of time bins. 3.2 Data Pre-processing A time series of length p can be seen as a point in the p-dimensional space. This allows using multivariate data mining algorithms directly to time series data. However, most data mining algorithms, and in particular, clustering algorithms do not work well for time series. Working with each and every time point, makes the significance of distance metrics, which are used to measure the similarity between objects, questionable [20]. To attack this problem numerous time series representations have been proposed that facilitate extracting a feature vector from the time series. The feature vector is a compressed representation of the time series, which serves as input to the data mining algorithms. Although many representations have been proposed in the time series literature, only few of them are suitable for our framework. The fundamental requirement for our work is that clustering the feature vectors should result in clusters containing feature vectors of time series, which are similar in the time series space. The authors in [26] proved that in order for this requirement to hold the representation should satisfy the lower bounding lemma. In [21] the authors give an overview of the state of art in representations and highlight those that satisfy the lower bounding lemma. We opt to use the Discrete Wavelet Transform (DWT). Motivation for our choice is that the DWT representation is intrinsically multi-resolution and allows simultaneous time and frequency analysis. Furthermore, for time series typically found in practice, many of the coefficients in a DWT representation are either zero or very small, which allows for efficient compression. Moreover, DWT is applicable for analyzing non-stationary signals, and performs well in compressing sparse spike time series. We apply the DWT independently on each time series of the time series matrix using Mallat's algorithm [22]. The Mallat algorithm is of O(p) time complexity and decomposes a time series of length p, where p must be power of two, in log2p levels. The number of wavelet coefficients computed after decomposing a time series is equal to the number of time points of the original time series. Therefore, applying the DWT on each line of the time series matrix gives an n× p wavelet coefficient matrix. To reduce the dimensionality of the wavelet coefficient matrix, we apply a compression technique. In [23] the author gives a detailed description of four compression techniques. Two of them operate on each time series independently and suggest retaining the k first wavelet coefficients or the k largest coefficients in terms of absolute normalized value. Whereas the rest two are applicable to a set of n time series, so 190 N. Chatzis and R. Popescu-Zeletin directly to the wavelet coefficient matrix. The first suggests retaining the k columns of the matrix that have the largest mean squared value. The second suggests retaining for a given k the n × k largest coefficients of the wavelet coefficient matrix. In the interest of space, we present here experimental results only retaining the first k wavelet coefficients. This produces an n × k feature vector matrix. 3.3 Data Clustering We validate our hypothesis that DNS query streams generated by non-infected user machines share similar characteristics, and that these characteristics are dissimilar to those of email worm-infected user machines in two steps. First we cluster the rows of the feature vector matrix in two clusters. Then we examine if one cluster contains only feature vectors of non-infected user machines and the other only feature vectors of email worm-infected machines. We use hierarchical clustering, since it produces relatively good results, as it facilitates the exploration of data at different levels of granularity, and is robust to variations of cluster size and shape. Hierarchical clustering methods require the user to specify a dissimilarity measure. We use as dissimilarity measure the Euclidean distance, since it produces comparable results to more sophisticated distance functions [24]. Hierarchical clustering methods are categorized into agglomerative and divisive. Agglomerative methods start with each observation in its own cluster and recursively merge the less dissimilar clusters into a single cluster. By contrast, the divisive scheme starts with all observations in one cluster and subdivides them into smaller clusters until each cluster consists of only one observation. Since, we search for a small number of clusters we use the divisive scheme. The most commonly cited disadvantage of the divisive scheme is its computational cost. A divisive algorithm considers first all divisions of the entire dataset into two non-empty sets. There are 2(n-1) - 1 possibilities of dividing n observations in two clusters, which is intractable for many practical datasets. However, DIvisive ANAlysis (DIANA) [25] uses a splitting heuristic to limit the number of possible partitions, which results in O(n2). Given its quadratic time on the number of observations, DIANA scales poor with the number of observations, however we use it because it is the only divisive method generally available and in our framework it achieves clustering on a personal computer in computing time in the order of milliseconds. 4 Experimental Evaluation We set up an isolated computer cluster, and launch 68 out of a total of 164 email worms, which have been reported between April 2004 and July 2007 in the monthly updated top threat lists of Virus Radar [3] and Viruslist [4]. We capture over a period of eight hours the DNS query streams the infected machines generate to create – to the best of our knowledge – the largest email worm DNS dataset that has been up to date used to evaluate an in-network email worm detection method. As our isolated computer cluster has no real users, we merge the worm traffic with legitimate DNS traffic captured at the primary name server of our research institute, which serves daily between 350 and 500 users. We use three recent DNS log file fragments that we split in eight hour datasets and present here experiments with four datasets. Flow Level Data Mining of DNS Query Streams for Email Worm Detection 191 For each DNS query stream we consider a time series of length 512 with one minute time bins. We decompose the time series, compress the wavelet coefficient matrix, retain k wavelet coefficients to make the feature vector matrix, and cluster rows of the feature vector matrix in two clusters. In this paper we focus on showing that our method detects various instances of email worms; therefore, we assume that only one user machine is infected at each time. The procedure is shown in Fig.1. Time Series Wavelet Coef . Feature Vector Matrix Matrix 6444Matrix 74448 6444 4 74444 8 6444 74448 um −1 um −1 um −1 um −1 um −1 K ts512 ⎤ ⎡ wc1 K wc512 ⎤ ⎡ fv1 K fvkum −1 ⎤ ⎡ ts1 ⎥ ⎥ ⎢ ⎥ ⎢ ⎢ M ⎥ ⎢ M M ⎥ ⎢ M M ⎥ ⎢ M → → um − n ⎥ um − n ⎥ ⎢ fv1um − n K fvkum − n ⎥ ⎢ wc1um − n K wc512 ⎢ts1um − n K ts512 ⎥ ⎢ worm ⎥ ⎢ ⎥ ⎢ worm worm worm worm K ts512 K wc512 K fvkworm ⎦⎥ ts1 ⎥ ⎣⎢ fv1 ⎥ ⎣⎢ wc1 ⎦ ⎦ ⎣⎢1 44424443 1444 424444 3 144424443 ( n +1) × 512 ( n +1) × 512 ( n +1) × k k : 4 ,8 ,16 , 32 , 64 ,128 , 256 Fig. 1. We append one infectious time series to the non-infected user machines time series; decompose the time series to form the wavelet coefficient matrix, which we compress to get the feature vector matrix. This is input to the clustering analysis, which we repeat 4 Datasets × 7 Feature vector lengths × 68 Email Worms = 1904 times. We examine the resulting two-cluster scheme, and find that it comprises of one dense populated cluster and a low populated cluster. Our method detects an email worm, when its feature vector belongs to the low populated cluster. In Fig. 2 we present the false positive and false negative rates. Our method erroneously reports legitimate user activity as suspicious with less than 1% for every value of k, whereas worms are misclassified with less than 2%, if at least 16 wavelet coefficients are used. Fig. 2. False negative and false positive rates for detecting various instances of email worms over four different DNS datasets (DS1, DS2, DS3 and DS4), while retaining 4, 8, 16, 32, 64, 128 or 256 wavelet coefficients. With 16 or more wavelet coefficients both rates fall below 2%. 192 N. Chatzis and R. Popescu-Zeletin Fig. 3. Accuracy measures independently for each worm show that our method is remarkably accurate. Only k values lower or equal to the first that maximizes accuracy are plotted. With 16 or more wavelet coefficients only the Eyeveg.F email worm is not detected. In Fig. 3, using the accuracy statistical measure we look at the detection of each worm independently. Accuracy is defined as the proportion of true results – both true positives and true negatives – in the population, and measures the ability of our method to identify correctly email worm-infected and non-infected user machines. The accuracy is in the range between 0 and 1, where values close to 1 indicate good detection. In the interest of space we present here only results over one dataset. Our method fails to detect only one out of 68 total email worms i.e. Eyeveg.F for every k, while with more of 16 wavelet coefficients all other email worms are detected. 5 Conclusion Email worms and the high amount of email abusive traffic continue to be a serious security concern. This paper presented a method to early detect email worms in the local name server, which is topologically near the infected host by analyzing DNS query streams characteristics at flow level. By experimenting with various email worm instances we show that these characteristics remain unaltered in the long run. Our method builds on unsupervised learning and similarity search over time series using wavelets. Our experimental results show that our method identifies various email worm instances with remarkable accuracy. Future work calls for analysing DNS data from other networking environments to assess the detection efficacy of our method over DNS data that might have different flow-level characteristics, and investigating the actions that can be triggered at the local name server once an email worm-infected user machine has been detected to contain email worm propagation. References 1. Messaging Anti-Abuse Working Group: Email Metrics Report, http://www.maawg.org 2. Symantec: Internet Security Threat Report Trends (January-June 2007), http://www.symantec.com Flow Level Data Mining of DNS Query Streams for Email Worm Detection 193 3. ESET Virus Radar, http://www.virus-radar.com 4. Kaspersky Lab Viruslist, http://www.viruslist.com 5. Roesch, M.: Snort - Lightweight Intrusion Detection for Networks. In: LISA 1999, 13th USENIX Systems Administration Conference, pp. 229–238. USENIX (1999) 6. Paxson, V.: Bro: A System for Detecting Network Intruders in Real-Time. In: 7th Conference on USENIX Security Symposium. USENIX (1998) 7. Singh, S., Estan, C., Varghese, G., Savage, S.: The Earlybird System for Real-time Detection of Unknown Worms. Tech. Report CS2003-0761, University of California (2003) 8. Provos, N., Holz, T.: Virtual Honeypots: From Botnet Tracking to Intrusion Detection. Addison Wesley Professional, Reading (2007) 9. Whyte, D., van Oorschot, P., Kranakis, E.: Addressing Malicious SMTP-based Mass Mailing Activity within an Enterprise Network. Technical Report TR-05-06, Carleton University, School of Computer Science (2005) 10. Ishibashi, K., Toyono, T., Toyama, K., Ishino, M., Ohshima, H., Mizukoshi, I.: Detecting Mass-Mailing Worm Infected Hosts by Mining DNS Traffic Data. In: MineNet 2005 ACM SIGCOMM Workshop, pp. 159–164. ACM Press, New York (2005) 11. Wong, C., Bielski, S., McCune, J., Wang, C.: A Study of Mass-Mailing Worms. In: WORM 2004 ACM Workshop, pp. 1–10. ACM Press, New York (2004) 12. Musashi, Y., Matsuba, R., Sugitani, K.: Indirect Detection of Mass Mailing WormInfected PC Terminals for Learners. In: 3rd International Conference on Emerging Telecommunications Technologies and Applications, pp. 233–237 (2004) 13. Musashi, Y., Rannenberg, K.: Detection of Mass Mailing Worm-Infected PC Terminals by Observing DNS Query Access. IPSJ SIG Notes, pp. 39–44 (2004) 14. Schaelicke, L., Slabach, T., Moore, B., Freeland, C.: Characterizing the Performance of Network Intrusion Detection Sensors. In: Recent Advances in Intrusion Detection, 6th International Symposium, RAID. LNCS, pp. 155–172. Springer, Heidelberg (2003) 15. Chatzis, N.: Motivation for Behaviour-Based DNS Security: A Taxonomy of DNS-related Internet Threats. In: International Conference on Emerging Security Information Systems, and Technologies, pp. 36–41. IEEE, Los Alamitos (2007) 16. Dainotti, A., Pescape, A., Ventre, G.: Wavelet-based Detection of DoS Attacks. In: Global Telecommunications Conference, GLOBECOM 2006, pp. 1–6. IEEE, Los Alamitos (2006) 17. Li, L., Lee, G.: DDoS Attack Detection and Wavelets. In: 12th International Conference on Computer Communications and Networks, ICCCN 2003, pp. 421–427. IEEE, Los Alamitos (2003) 18. Chong, K., Song, H., Noh, S.: Traffic Characterization of the Web Server Attacks of Worm Viruses. In: Int. Conference on Computational Science, pp. 703–712. Springer, Heidelberg (2003) 19. Dainotti, A., Pescape, A., Ventre, G.: Worm Traffic Analysis and Characterization. In: International Conference on Communications, ICC 2007, pp. 1435–1442. IEEE, Los Alamitos (2007) 20. Aggarwal, C., Hinneburg, A., Keim, D.: On the Surprising Behavior of Distance Metrics in High Dimensional Space. In: 8th Int. Conf. on Database Theory. LNCS, pp. 420–434. Springer, Heidelberg (2001) 21. Bagnall, A., Ratanamahatana, C., Keogh, E., Lonardi, S., Janacek, G.: A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. Data Min. and Knowl. Discovery 13(1), 11–40 (2006) 194 N. Chatzis and R. Popescu-Zeletin 22. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(7), 674–693 (1989) 23. Mörchen, F.: Time Series Feature Extraction for Data Mining Using DWT and DFT. Technical Report No. 33, Dept. of Maths and CS, Philipps-U. Marburg (2003) 24. Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A survey and empirical demonstration. Data Min. and Knowl. Discovery 7(4), 349–371 (2003) 25. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Chichester (1990) 26. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast Subsequence Matching in Time Series Databases. In: ACM SIGMOD International Conference on Management of Data, pp. 419–429. ACM Press, New York (1994) Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection Bogdan Vrusias and Ian Golledge Department of Computing, Faculty of Electronic and Physical Sciences, University of Surrey, Guildford, UK {b.vrusias,cs31ig}@surrey.ac.uk Abstract. Spam detection has become a necessity for successful email communications, security and convenience. This paper describes a learning process where the text of incoming emails is analysed and filtered based on the salient features identified. The method described has promising results and at the same time significantly better performance than other statistical and probabilistic methods. The salient features of emails are selected automatically based on functions combining word frequency and other discriminating matrices, and emails are then encoded into a representative vector model. Several classifiers are then used for identifying spam, and self-organising maps seem to give significantly better results. Keywords: Spam Detection, Self-Organising Maps, Naive Bayesian, Adaptive Text Filters. 1 Introduction The ever increasing volume of spam brings with it a whole series of problems to a network provider and to an end user. Networks are flooded every day with millions of spam emails wasting network bandwidth while end users suffer with spam engulfing their mailboxes. Users have to spend time and effort sorting through to find legitimate emails, and within a work environment this can considerably reduce productivity. Many anti-spam filter techniques have been developed to achieve this [4], [8]. The overall premise of spam filtering is text categorisation where an email can belong in either of two classes: Spam or Ham (legitimate email). Text categorisation can be applied here as the content of a spam message tends to have few mentions in that of a legitimate email. Therefore the content of spam belongs to a specific genre which can be separated from normal legitimate email. Original ideas for filtering focused on matching keyword patterns in the body of an email that could identify it as spam [9]. A manually constructed list of keyword patterns such as “cheap Viagra” or “get rich now” would be used. For the most effective use of this approach, the list would have to be constantly updated and manually tuned. Overtime the content and topic of spam would vary providing a constant challenge to keep the list updated. This method is infeasible as it would be impossible to manually keep up with the spammers. Sahami et al. is the first to apply a machine learning technique to the field of antispam filtering [5]. They trained a Naïve Bayesian (NB) classifier on a dataset of pre-categorised ham and spam. A vector model is then built up of Boolean values representing the existence of pre-selected attributes of a given message. As well as E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 195–202, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 196 B. Vrusias and I. Golledge word attributes, the vector model could also contain attributes that represent nontextual elements of a message. For example, this could include the existence of a nonmatching URL embedded in the email. Other non-textual elements could include whether an email has an attachment, the use of bright fonts to draw attention to certain areas of an email body and the use of embedded images. Metsis et al. evaluated five different versions of Naïve Bayes on particular dataset [3]. Some of these Naïve Bayesian versions are more common in spam filtering than others. The conclusion of the paper is that the two Naïve Bayes versions used least in spam filtering provided the best success. These are a Flexible Bayes method and a Multinomial Naïve Bayes (MNB) with Boolean attributes. The lower computational complexity of the MNB provided it the edge. The purpose of their paper is not only to contrast the success of five different Naïve Bayes techniques but to implement the techniques in a situation of a new user training a personalized learning anti-spam filter. This involved incremental retraining and evaluating of each technique. Furthermore, methods like Support Vector Machines (SVM) have also been used to identify spam [13]. Specifically, term frequency with boosting trees and binary features with SVM’s had acceptable test performance, but both methods used high dimensional (1000 - 7000) feature vectors. Another approach is to look into semantics. Youn & McLeod introduced a method to allow for machine-understandable semantics of data [7]. The basic idea here is to model the concept of spam in order to semantically identify it. The results reported are very encouraging, but the model constructed is static and therefore not adaptable. Most of the above proposed techniques struggled with changes in the email styles and words used on spam emails. Therefore, it made sense to consider an automatic learning approach to spam filtering, in order to adapt to changes. In this approach spam features are updated based on new coming spam messages. This, together with a novel method for training online Self-Organising Maps (SOM) [2] and retrieving the classification of a new email, indicated good performance. Most importantly the method proposed only misclassified very few ham messages as spam, and had correctly identified most spam messages. This exceeds the performance of other probabilistic approaches, as later proven in the paper. 2 Spam Detection Methods As indicated by conducted research one of the best ways so far to classify spam is to use probabilistic models, i.e. Bayesian [3], [5], [6], [8], [9]. For that reason, this paper is going to compare the approach of using SOMs to what appears to be best classifier for spam, the MNB Boolean classifier. Both approaches need to transform the text email message into a numerical vector, therefore several vector models have been proposed and are described later on. 2.1 Classifying with Multinomial NB Boolean The MNB treats each message d as a set of tokens. Therefore d is represented by a numerical feature vector model. Each element of the vector model represents a Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection 197 Boolean value of whether that token exists in the message or not. The probability of P(x|c) can be calculated by trialling the probability of each token t occurring in a category c. The product of these trials, P(ti|c), for each category will result in the P(x|c) for the respective category. The equation is then [6]: P(cs ) ⋅ m ∏ P(ti | cs )xi i =1 m ∑c∈{c c } P(c ) ⋅ ∏ P(ti | cs ) s h >T (1) xi i =1 Each trial P(t|c) is estimated using a Laplacean prior: P(t | c ) = 1 + M t ,c 2 + Mc (2) Where Mt,c is the number of training messages of category c that contain the token t. Mc is the total number of training messages of category c. The outcomes of all these trials are considered independent given the category which is a naïve assumption. This simplistic assumption overlooks the fact that co-occurrences of words in a category should not be independent, however this technique still results in a very good performance of classification tasks. 2.2 Classifying with Self-Organising Maps Self-organising map (SOM) systems have been used consistently for classification and data visualisation in general [2]. The main function of a SOM is to identify salient features in the n-dimensional input space and squash that space into two dimensions according to similarity. Despite the popularity, SOMs are difficult to use after the training is over. Although visually some clusters emerge in the output map, computationally it is difficult to classify a new input into a formed cluster and be able to semantically label it. For the classification, an input weighted majority voting (WMV) method is used for identifying the label for the new unknown input [12]. Voting is based on the distance of each input vector from the node vector. For the proposed process for classifying an email and adapting it to new coming emails, the feature vector model is calculated based on the first batch of emails and then the SOM is trained on the first batch of emails with random weights. Then, a new batch of emails is appended, the feature vector model is recalculated, and the SOM is retrained from the previous weights, but on the new batch only. Finally, a new batch of emails inserted and the process is repeated until all batches are finished. For the purpose of the experiments, as described later, a 10x10 SOM is trained for 1000 cycles, where each cycle is a complete run of all inputs. The learning rate and neighbourhood value is started at high values, but then decreased exponentially towards the end of the training [12]. Each training step is repeated several times and results are averaged to remove any initial random bias. 198 B. Vrusias and I. Golledge 3 Identifying Salient Features The process of extracting salient features is probably the most important part of the methodology. The purpose here is to identify the keywords (tokens) that differentiate spam from ham. Typical approaches so far focused on pure frequency measures for that purpose, or the usage of the term frequency inverse document frequency (tf*idf) metric [10]. Furthermore, the weirdness metric that calculates the frequency ration of tokens used in special domains like spam, against the ratio in the British National Corpus (BNC), reported accuracy to some degree [1], [12]. This paper uses a combination function of the weirdness and TFIDF metrics, and both metrics are used in their normalised form. The ranking Rt of each token is therefore calculated based on: Rt = weirdness t × tf ∗ idf t (3) The weirdness metric compares the frequency of the token in the spam domain against the frequency of the same token in BNC. For tf*idf the “document” is considered as a category where all emails belonging to that same category are merged together, and document frequency is the total number of categories (i.e. 2 in this instance: spam and ham). The rating metric R is used to build a list of most salient features in order to encode emails into binary numerical input vectors (see Fig. 1). After careful examination a conclusion is drawn, that in total, the top 500 tokens are enough to represent the vector model. Furthermore, one more feature (dimension) is added in the vector, indicating whether there are any ham tokens present in an email. The binary vector model seemed to work well and has the advantage of the computational simplicity. SPAM EMAIL HAM EMAIL Subject: dobmeos with hgh my energy level has gone up ! stukm Introducing doctor – formulated hgh Subject: re : entex transistion thanks so much for the memo . i would like to reiterate my support on two key issues : 1 ) . thu - best of luck on this new assignment . howard has worked hard and done a great job ! please don ' t be shy on asking questions . entex is critical to the texas business , and it is critical to our team that we are timely and accurate . 2 ) . rita : thanks for setting up the account team . communication is critical to our success , and i encourage you all to keep each other informed at all times . the p & l impact to our business can be significant . additionally , this is high profile , so we want to assure top quality . thanks to all of you for all of your efforts . let me know if there is anything i can do to help provide any additional support . rita wynne … human growth hormone - also called hgh is referred to in medical science as the master hormone. it is very plentiful when we are young , but near the age of twenty - one our bodies begin to produce less of it . by the time we are forty nearly everyone is deficient in hgh , and at eighty our production has normally diminished at least 90 - 95 % . advantages of hgh : - increased muscle strength - loss in body fat - increased bone density - lower blood pressure - quickens wound healing - reduces cellulite - increased … sexual potency Fig. 1. Sample spam and ham emails. Large bold words indicate top ranked spam words and smaller words with low ranking, whereas normal black text indicate non spam words. In most cases of generating feature vectors, scientists usually concentrate on static models that require complete refactoring when information changes or when the user provides feedback. In order to cope with the demand of changes, the proposed model can automatically recalculate the salient features and appropriately adapt the vector Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection 199 model to accommodate this. The method can safely modify/update the vector model every 100 emails in order to achieve best performance. Basically, the rank list is modified depending on the contents of the new coming emails. This is clearly visualised in Fig. 2 where is observable that as more email batches (of 100 emails) are presented, the tokens in the list get updated. New “important” tokens are quickly placed at the top of the rank, but the ranking changes based on other new entries. Rank Spam Key-Words Ranking 500 450 400 350 300 250 200 150 100 50 0 viagra hotlist xanax pharmacy vicodin pills sofftwaares valium prozac computron 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 20 30 40 50 Batch No Fig. 2. Random spam keyword ranking as it evolved through the training process for Enron1 dataset. Each batch contains 100 emails. The graph shows that each new keyword entry has an impact on the ranking list, and it then fluctuates to accommodate new-coming keywords. 4 Experimentation: Spam Detection In order to evaluate spam filters a dataset with a large volume of spam and ham messages is required. Gathering public benchmark datasets of a large size has proven difficult [8]. This is mainly due to privacy issues of the senders and receivers of ham emails with a particular dataset. Some datasets have tried to bypass the privacy issue by considering ham messages collected from freely accessible sources such as mailing lists. The Ling-Spam dataset consists of spam received at the time and a collection of ham messages from an archived list of linguist mails. The SpamAssassin corpus uses ham messages publicly donated by the public or collected from public mailing lists. Other datasets like SpamBase and PU only provide the feature vectors rather than the content itself and therefore are considered inappropriate for the proposed method. 4.1 Setup One of the most widely used datasets in spam filtering research is the Enron dataset From a set of 150 mailboxes with messages various benchmark datasets have been constructed. A subset as constructed by Androutsopoulos et al. [6] is used, containing mailboxes of 6 users within the dataset. To reflect the different scenarios of a personalised filter, each dataset is interlaced with varying amounts of spam (from a variety of sources), so that some had a ham-spam ratio of 1:3 and others 3:1. To implement the process of incremental retraining the approach suggested by Androutsopoulos et al. [6] is adapted, where the messages of each dataset are split into 200 B. Vrusias and I. Golledge batches b1,…,bl of k adjacent messages. Then for batch i=1 to l-1 the filter is trained on batch bi and tested on batch bi+1. The number of emails per batch k=100. The performance of a spam filter is measured on its ability to correctly identify spam and ham while minimising misclassification. NhÆh and nsÆs represent the number of correctly classified ham and spam messages. Nh->s represents the number of ham misclassified as spam (false positive) and nsÆh represents the number of spam misclassified as ham (false negative). Spam precision and recall is then calculated. These measurements are useful for showing the basic performance of a spam filter. However they do not take into account the fact that misclassifying a Ham message as Spam is an order of magnitude worse than misclassifying a Spam message to Ham. A user can cope with a number of false negatives, however a false positive could result in the loss of a potential important legitimate email which is unacceptable to the user. So, when considering the statistical success of a spam filter the consequence weight associated with false positives should be taken into account. Androutsopoulos et al. [6] introduced the idea of a weighted accuracy measurement (WAcc): WAccλ = λ ⋅ nh → h + ns → s λ ⋅ Nh + Ns (4) Nh and Ns represent the total number of ham and spam messages respectively. In this measurement each legitimate ham message nh is treated as λ messages. For every false positive occurring, this is seen as λ errors instead of just 1. The higher the value of λ the more cost there is of each misclassification. When λ=99, misclassifying a ham message is as bad as letting 99 spam messages through the filter. The value of λ can be adjusted depending on the scenario and consequences involved. 4.2 Results Across the six datasets the results show a variance in performance of the MNB and a consistent performance of the SOM. Across the first three datasets, with ratio 3:1 in favour of ham, the MNB almost perfectly classifies ham messages, however the recall of spam is noticeably low. This is especially apparent in Enron 1 which appears to be the most difficult dataset. The last three datasets have a 3:1 ratio in favour of spam and this change in ratio is reflected in a change in pattern of the MNB results. The recall of spam is highly accurate however many ham messages are missed by the classifier. The pattern of performance of the SOM across the 6 datasets is consistent. Recall of spam is notably high over each dataset with the exception of a few batches. The ratio of spam to ham in the datasets appears to have no bearing on the results of the SOM. The recall of ham messages in each batch is very high with a very low percentage of ham messages miss-classified. The resulting overall accuracy is very high for the SOM and consistently higher than the MNB. However the weighted accuracy puts a different perspective on the results. Fig. 3 (a), (b) and (c) show the MNB showing consistently better weighted accuracy than the SOM. Although the MNB misclassified a lot of spam, the cost of the SOM misclassifying a small number of ham results in the MNB being more effective on these datasets. Fig. 3 (d), (e) and (f) show the SOM outperforming the MNB on the spam heavy datasets. The MNB missed a large proportion of ham messages and consequently the weighted accuracy is considerably lower than the SOM. Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection 1.02 201 1.02 1 1 0.98 0.98 0.96 0.94 0.96 0.92 0.94 0.9 0.92 0.88 0.86 0.9 0.84 0.82 0.88 1 3 5 7 9 11 13 15 17 19 21 23 25 27 1 29 3 5 7 9 (a) Enron 1 11 13 15 17 19 21 23 25 27 29 (b) Enron 2 1.02 1.2 1 1 0.98 0.96 0.8 0.94 0.92 0.6 0.9 0.4 0.88 0.86 0.2 0.84 0.82 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 (c) Enron 3 (d) Enron 4 1.2 1.2 1 1 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 1 3 5 7 9 11 13 15 17 (e) Enron 5 19 21 23 25 27 29 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 (f) Enron 6 Fig. 3. WAcc results for all Enron datasets (1-6) for SOM and MNB and λ=99. Training / testing is conducted based on 30 batches of Spam and Ham, and a total number of 3000 emails. Y axes shows WAcc and the X axes indicates the batch number. 5 Conclusions This paper has discussed and evaluated two classifiers for the purposes of categorising emails into classes of spam and ham. Both MNB and SOM methods are incrementally trained and tested on 6 subsets of the Enron dataset. The methods are evaluated using a weighted accuracy measurement. The results of the SOM proved consistent over each dataset maintaining an impressive spam recall. A small percentage of ham emails are misclassified by the SOM. Each ham missed is treated as the equivalent of missing 99 spam emails. This lowered the overall effectiveness of the SOM. The MNB demonstrated a trade off between false positives and false negatives as it struggled to maintain high performance on both. Where it struggled to classify spam in the first three datasets, ham recall is impressive and consequently the WAcc is consistently better than the SOM. This pattern is reversed in the final three datasets as many ham messages are missed, and the SOM outperformed the MNB. 202 B. Vrusias and I. Golledge Further evaluations are currently being made into the selection of salient features and size of attribute set. This work aims to reduce the small percentage of misclassified ham by the SOM to improve its weighted accuracy performance. References 1. Manomaisupat, P., Vrusias, B., Ahmad, K.: Categorization of Large Text Collections: Feature Selection for Training Neural Networks. In: Corchado, E., Yin, H., Botti, V., Fyfe, C. (eds.) IDEAL 2006. LNCS, vol. 4224, pp. 1003–1013. Springer, Heidelberg (2006) 2. Kohonen, T.: Self-organizing maps, 2nd edn. Springer, New York (1997) 3. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naïve Bayes – Which Naïve Bayes? In: CEAS, 3rd Conf. on Email and AntiSpam, California, USA (2006) 4. Zhang, L., Zhu, J., Yao, T.: An Evaluation of Statistical Spam Filtering Techniques. ACM Trans. on Asian Language Information Processing 3(4), 243–269 (2004) 5. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop, Madison, Wisconsin, pp. 55–62 (1998) 6. Androutsopoulos, I., Paliouras, G., Karkaletsi, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a Memory-Based Approach. In: Proceedings of the Workshop Machine Learning and Textual Information Access. 4th European Conf. on KDD, Lyon, France, pp. 1–13 (2000) 7. Youn, S., McLeod, D.: Efficient Spam Email Filtering using Adaptive Ontology. In: 4th International Conf. on Information Technology, ITNG 2007, pp. 249–254 (2007) 8. Hunt, R., Carpinter, J.: Current and New Developments in Spam Filtering. In: 14th IEEE International Conference on Networks, ICON 2006, vol. 2, pp. 1–6 (2006) 9. Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical Language Models. Information Retrieval 7, 317–345 (2004) 10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988) 11. Vrusias, B.: Combining Unsupervised Classifiers: A Multimodal Case Study, PhD thesis, University of Surrey (2004) 12. Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization. IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999) A Preliminary Performance Comparison of Two Feature Sets for Encrypted Traffic Classification Riyad Alshammari and A. Nur Zincir-Heywood Dalhousie University, Faculty of Computer Science {riyad,zincir}@cs.dal.ca Abstract. The objective of this work is the comparison of two types of feature sets for the classification of encrypted traffic such as SSH. To this end, two learning algorithms – RIPPER and C4.5 – are employed using packet header and flow-based features. Traffic classification is performed without using features such as IP addresses, source/destination ports and payload information. Results indicate that the feature set based on packet header information is comparable with flow based feature set in terms of a high detection rate and a low false positive rate. Keywords: Encrypted Traffic Classification, Packet, Flow, and Security. 1 Introduction In this work our objective is to explore the utility of two possible feature sets – Packet header based and Flow based – to represent the network traffic to the machine learning algorithms. To this end, we employed two machine learning algorithms – C4.5 and RIPPER [1] – in order to classify encrypted traffic, specifically SSH (Secure Shell). In this work, traffic classification is performed without using features such as IP addresses, source/destination ports and payload information. By doing so, we aim to develop a framework where privacy concerns of users are respected but also an important task of network management, i.e. accurate identification of network traffic, is achieved. Having an encrypted payload and being able to run different applications over SSH makes it a challenging problem to classify SSH traffic. Traditionally, one approach to classifying network traffic is to inspect the payload of every packet. This technique can be extremely accurate when the payload is not encrypted. However, encrypted applications such as SSH imply that the payload is opaque. Another approach to classifying applications is using well-known TCP/UDP port numbers. However, this approach has become increasingly inaccurate, mostly because applications can use non-standard ports to by-pass firewalls or circumvent operating systems restrictions. Thus, other techniques are needed to increase the accuracy of network traffic classification. The rest of this paper is organized as follows. Related work is discussed in section 2. Section 3 details the methodology followed. Aforementioned machine learning algorithms are detailed in Section 4, and the experimental results are presented in Section 5. Finally, conclusions are drawn and future work is discussed in Section 6. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 203–210, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 204 R. Alshammari and A.N. Zincir-Heywood 2 Related Work In literature, Zhang and Paxson present one of the earliest studies of techniques based on matching patterns in the packet payloads [2]. Early et al. employed a decision tree classifier on n-grams of packets for classifying flows [3]. Moore et al. used Bayesian analysis to classify flows into broad categories [4]. Karagiannis et al. proposed an approach that does not use port numbers or payload information [5], but their system cannot classify distinct flows. Wright et al. investigate the extent to which common application protocols can be identified using only packet size, timing and direction information of a connection [6]. They employed a kNN and HMM learning systems to compare the performance. Their performance on SSH classification is 76% detection rate and 8% false positive rate. Bernaille et al. employed first clustering and then classification to the first three packets in each connection to identify SSL connections [7]. Haffner et al. employed AdaBoost, Hidden Markov Models (HMM), Naive Bayesian and Maximum Entropy models to classify network traffic into different applications [8]. Their results showed AdaBoost performed the best on their data sets. In their work, the classification rate for SSH was 86% detection rate and 0% false positive rate but they employed the first 64 bytes of the payload. Recently, Williams et al. [9] compared five different classifiers – Bayesian Network, C4.5, Naive Bayes (two different types) and Naive Bayes Tree – using flows. They found that C4.5 performed better than the others. In our previous work [10], we employed RIPPER and AdaBoost algorithms for classifying SSH traffic. RIPPER performed better than AdaBoost by achieving 99% detection rate and 0.7% false positive rate. However, in that work, all tests were performed using flow based feature sets, whereas in this work we not only employ other types of classifiers but also investigate the usage of packet header based feature sets. 3 Methodology In this work, RIPPER and C4.5 based classifiers are employed to identify the most relevant feature set – Packet header vs. Flow – to the problem of SSH traffic classification. For packet header based features used, the underlying principle is that features employed should be simple and clearly defined within the networking community. They should represent a reasonable benchmark feature set to which more complex features might be added in the future. Given the above model, Table 1 lists the packet header Feature Set used to represent each packet to our framework. In the above table, payload length and inter-arrival time are the only two features, which are not directly obtained from the header information but are actually derived using the data in the header information. In the case of inter-arrival time, we take the Table 1. Packet Header Based Features Employed IP Header length IP Time to live TCP Header length Payload length (derived) IP Fragment flags IP Protocol TCP Control bits Inter-arrival time (derived) A Preliminary Performance Comparison of Two Feature Sets 205 difference in milliseconds between the current packet and the previous packet sent from the same host within the same session. In the case of payload length, we calculate it using Eq. 1 for TCP packets and using Eq. 2 for UDP packets. Payload length = IPTotalLength - (IPHeaderLength x 4) - (TCPHeaderLength x 4) (1) Payload length = IPTotalLength - (IPHeaderLength x 4) – 8 (2) For the flow based feature set, a feature is a descriptive statistic that can be calculated from one or more packets for each flow. To this end, NetMate [11] is employed to generate flows and compute feature values. Flows are bidirectional and the first packet seen by the tool determines the forward direction. We consider only UDP and TCP flows that have no less than one packet in each direction and transport no less than one byte of payload. Moreover, UDP flows are terminated by a flow timeout, whereas TCP flows are terminated upon proper connection teardown or by a flow timeout, whichever occurs first. The flow timeout value employed in this work is 600 seconds [12]. We extract the same set of features used in [9, 10] to provide a comparison environment for the reader, Table 2. Table 2. Flow Based Features Employed Protocol # Packets in forward direction Min forward inter-arrival time Std. deviation of forward inter-arrival times Mean forward inter-arrival time Max forward inter-arrival time Std. deviation of backward inter-arrival times Min forward packet length Max forward packet length Std deviation of forward packet length Mean backward packet length Duration of the flow # Bytes in forward direction # Bytes in backward direction # Packets in backward direction Mean backward inter-arrival time Max backward inter-arrival time Min backward inter-arrival time Mean forward packet length Min backward packet length Std. deviation of backward packet length Max backward packet length Fig. 1. Generation of network traffic for the NIMS data set In our experiments, the performance of the different machine learning algorithms is established on two different traffic sources: Dalhousie traces and NIMS traces. 206 R. Alshammari and A.N. Zincir-Heywood - Dalhousie traces were captured by the University Computing and Information Services Centre (UCIS) in January 2007 on the campus network between the university and the commercial Internet. Given the privacy related issues university may face, data is filtered to scramble the IP addresses and further truncate each packet to the end of the IP header so that all payload is excluded. Moreover, the checksums are set to zero since they could conceivably leak information from short packets. However, any length information in the packet is left intact. Thus the data sets given to us are anonymized and without any payload information. Furthermore, Dalhousie traces are labeled by a commercial classification tool (deep packet analyzer) called PacketShaper [13] by the UCIS. This provides us the ground truth for the training. PacketShaper labeled all traffic either as SSH or non-SSH. - NIMS traces consist of packets collected on our research test-bed network. Our data collection approach is to simulate possible network scenarios using one or more computers to capture the resulting traffic. We simulate an SSH connection by connecting a client computer to four SSH servers outside of our test-bed via the Internet, Figure 1.We ran the following six SSH services: (i) Shell login; (ii) X11; (iii) Local tunneling; (iv) Remote tunneling; (v) SCP; (vi) SFTP. We also captured the following application traffic: DNS, HTTP, FTP, P2P (limewire), and Telnet. These traces include all the headers, and the application payload for each packet. Since both of the traffic traces contain millions of packets (40 GB of traffic). We performed subset sampling to limit the memory and CPU time required for training and testing. Subset sampling algorithms are a mature field of machine learning in which it has already been thoroughly demonstrated that performance of the learner (classifier) is not impacted by restricting the learner to a subset of the exemplars during training [14]. The only caveat to this is that the subset be balanced. Should one samples without this constraint one will provide a classifier which maximizes accuracy, where this is known to be a rather poor performance metric. However, the balanced subset sampling heuristic here tends to maximize the AUC (measurements of the fraction of the total area that falls under the ROC curve) statistic, a much more robust estimator of performance [14]. Thus, we sub-sampled 360,000 packets from each aforementioned data source. The 360,000 packets consist of 50% in-class (application running over SSH) and 50% out-class. From the NIMS traffic trace, we choose the first 30000 packets of X11, SCP, SFTP, Remote-tunnel, Local-tunnel and Remote login that had payload size bigger than zero. These packets are combined together in the in-class. The out-class is sampled from the first 180000 packets that had payload size bigger than zero. The out-class consists of the following applications FTP, Telnet, DNS, HTTP and P2P (lime-wire). On the other hand, from the Dalhousie traces, we filter the first 180000 packets of SSH traffic for the in-class data. The out-class is sampled from the first 180000 packets. It consists of the following applications FTP, DNS, HTTP and MSN. We then run these data sets through NetMate to generate the flow feature set. We generated 30 random training data sets from both sub-sampled traces. Each training data set is formed by randomly selecting (uniform probability) 75% of the in-class and 75% of the out-class without replacement. In case of Packet header feature set, each training data set contains 270,000 packets while in case of NetMate feature set, each training data set contains 18095 flows (equivalent of 270,000 packets in terms of flows), Table 3. In this table, some applications have packets in the training sample A Preliminary Performance Comparison of Two Feature Sets 207 but no flows. This is due to the fact that we consider only UDP and TCP flows that have no less than one packet in each direction and transport no less than one byte of payload. Table 3. Private and Dalhousie Data Sets SSH FTP TELNET DNS HTTP P2P (limewire) NIMS Training Sample for IPheader (total = 270000) x 30 135000 14924 13860 17830 8287 96146 NIMS Training Sample for NetMate (total = 18095) x 30 1156 406 777 1422 596 13738 SSH FTP TELNET DNS HTTP MSN Dalhousie Training Sample for IPheader (total = 270000) x 30 135000 139 0 2985 127928 3948 Dalhousie Training Sample for NetMate (total = 12678) x 30 11225 2 0 1156 295 0 4 Classifiers Employed In order to classify SSH traffic; two different machine learning algorithms – RIPPER and C4.5 – are employed. The reason is two-folds: As discussed earlier Williams et al. compared five different classifiers and showed that a C4.5 classifier performed better than the others [9], whereas in our previous work [10], a RIPPER based classifier performed better than the AdaBoost, which was shown to be the best performing model in [8]. RIPPER, Repeated Incremental Pruning to Produce Error Reduction, is a rulebased algorithm, where the rules are learned from the data directly [1]. Rule induction does a depth-first search and generates one rule at a time. Each rule is a conjunction of conditions on discrete or numeric attributes and these conditions are added one at a time to optimize some criterion. In RIPPER, conditions are added to the rule to maximize an information gain measure [1]. To measure the quality of a rule, minimum description length is used [1]. RIPPER stops adding rules when the description length of the rule base is 64 (or more) bits larger than the best description length. Once a rule is grown and pruned, it is added to the rule base and all the training examples that satisfy that rule are removed from the training set. The process continues until enough rules are added. In the algorithm, there is an outer loop of adding one rule at a time to the rule base and inner loop of adding one condition at a time to the current rule. These steps are both greedy and do not guarantee optimality. C4.5 is a decision tree based classification algorithm. A decision tree is a hierarchical data structure for implementing a divide-and-conquer strategy. It is an efficient non-parametric method that can be used both for classification and regression. In nonparametric models, the input space is divided into local regions defined by a distance metric. In a decision tree, the local region is identified in a sequence of recursive splits in smaller number of steps. A decision tree is composed of internal decision nodes and terminal leaves. Each node m implements a test function fm(x) with discrete outcomes labeling the branches. This process starts at the root and is repeated until a leaf node is hit. The value of a leaf constitutes the output. A more detailed explanation of the algorithm can be found in [1]. 208 R. Alshammari and A.N. Zincir-Heywood 5 Experimental Results In this work we have 30 training data sets. Each classifier is trained on each data set via 10-fold cross validation. The results given below are averaged over these 30 data sets. Moreover, results are given using two metrics: Detection Rate (DR) and False Positive Rate (FPR). In this work, DR reflects the number of SSH flows correctly classified, whereas FPR reflects the number of Non-SSH flows incorrectly classified as SSH. Naturally, a high DR rate and a low FPR would be the desired outcomes. They are calculated as follows: DR = 1 − (#FNClassifications / TotalNumberSSHClassifications) FPR = #FPClassifications / TotalNumberNonSSHClassifications where FN, False Negative, means SSH traffic classified as non- SSH traffic. Once the aforementioned feature vectors are prepared, RIPPER, and C4.5, based classifiers are trained using WEKA [15] (an open source tool for data mining tasks) with its default parameters for both algorithms. Tables 4 and 6 show that the difference between the two feature sets based on DR is around 1%, whereas it is less than 1% for FPR. Moreover, C4.5 performs better than RIPPER using both feature sets, but again the difference is less than 1%. C4.5 achieves 99% DR and 0.4% FPR using flow based features. Confusion matrixes, Tables 5 and 7, show that the number of SSH packets/flows that are misclassified are notably small using C4.5 classifier. Table 4. Average Results for the NIMS data sets Classifiers Employed RIPPER for Non-SSH RIPPER for –SSH C4.5 for Non-SSH C4.5 for –SSH IP-header Feature DR FPR 0.99 0.008 0.991 0.01 0.991 0.007 0.993 0.01 NetMate Feature DR FPR 1.0 0.002 0.998 0.0 1.0 0.001 0.999 0.0 Table 5. Confusion Matrix for the NIMS data sets Classifiers Employed RIPPER for Non-SSH RIPPER for –SSH C4.5 for Non-SSH C4.5 for –SSH IP-header Feature Non-SSH SSH 133531 1469 1130 133870 133752 1248 965 134035 NetMate Feature Non-SSH SSH 16939 0 2 1154 16937 2 1 1155 Results reported above are averaged over 30 runs, where 10-fold cross validation is employed at each run. In average, a C4.5 based classifier achieves a 99% DR and almost 0.4% FPR using flow based features and 98% DR and 2% FPR using packet header based features. In both cases, no payload information, IP addresses or port numbers are used, whereas Haffner et al. achieved 86% DR and 0% FPR using the A Preliminary Performance Comparison of Two Feature Sets 209 Table 6. Average Results for the Dalhousie data sets Classifiers Employed RIPPER for Non-SSH RIPPER for –SSH C4.5 for Non-SSH C4.5 for –SSH IP-header Feature DR FPR 0.974 0.027 0.972 0.025 0.98 0.02 0.98 0.02 NetMate Feature DR FPR 0.994 0.0008 0.999 0.005 0.996 0.0004 0.999 0.004 Table 7. Confusion Matrix for the Dalhousie data sets Classifiers Employed RIPPER for Non-SSH RIPPER for –SSH C4.5 for Non-SSH C4.5 for –SSH IP-header Feature Non-SSH SSH 131569 3431 3696 131304 13239 2651 2708 132292 NetMate Feature Non-SSH SSH 1445 8 8 11217 1447 6 5 11220 first 64 bytes of the payload of the SSH traffic [8]. This implies that they have used the un-encrypted part of the payload, where the handshake for SSH takes place. On the other hand, Wright et al. achieved a 76% DR and 8% FPR using packet size, time and direction information only [6]. These results show that our proposed approach achieves better performance in terms of DR and FPR for SSH traffic than the above existing approaches in the literature. 6 Conclusions and Future Work In this work, we investigate the performance of two feature sets using C4.5 and RIPPER learning algorithms for classifying SSH traffic from a given traffic trace. To do so, we employ data sets generated at our lab as well as employing traffic traces captured on our Campus network. We tested the aforementioned learning algorithms using packet header and flow based features. We have employed WEKA (with default settings) for both algorithms. Results show that feature set based on packet header is compatible with the statistical flow based feature set. Moreover, C4.5 based classifier performs better than RIPPER on the above data sets. C4.5 can achieve a 99% DR and less than 0.5% FPR at its test performance to detect SSH traffic. It should be noted again that in this work, the objective of automatically identifying SSH traffic from a given network trace is performed without using any payload, IP addresses or port numbers. This shows that the feature sets proposed in this work are both sufficient to classify any encrypted traffic since no payload or other biased features are employed. Our results are encouraging to further explore the packet header based features. Given that such an approach requires less computational cost and can be employed on-line. Future work will follow similar lines to perform more tests on different data sets in order to continue to test the robustness and adaptability of the classifiers and the feature sets. We are also interested in defining a framework for generating good training data sets. Furthermore, investigating our approach for other encrypted applications such as VPN and Skype traffic is some of the future directions that we want to pursue. 210 R. Alshammari and A.N. Zincir-Heywood Acknowledgments. This work was in part supported by MITACS, NSERC and the CFI new opportunities program. Our thanks to John Sherwood, David Green and Dalhousie UCIS team for providing us the anonymozied Dalhousie traffic traces. All research was conducted at the Dalhousie Faculty of Computer Science NIMS Laboratory, http://www.cs.dal.ca/projectx. References [1] Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge; ISBN: 0- 26201211-1 [2] Zhang, Y., Paxson, V.: Detecting back doors. In: Proceedings of the 9th USENIX Security Symposium, pp. 157–170 (2000) [3] Early, J., Brodley, C., Rosenberg, C.: Behavioral authentication of server flows. In: Proceedings of the ACSAC, pp. 46–55 (2003) [4] Moore, A.W., Zuev, D.: Internet Traffic Classification Using Bayesian Analysis Techniques. In: Proceedings of the ACM SIGMETRICS, pp. 50–60 (2005) [5] Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: Multilevel Traffic Classification in the Dark. In: Proceedings of the ACM SIGCOMM, pp. 229–240 (2006) [6] Wright, C.V., Monrose, F., Masson, G.M.: On Inferring Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine Learning Research 7, 2745–2769 (2006) [7] Bernaille, L., Teixeira, R.: Early Recognition of Encrypted Applications. In: Passive and Active Measurement Conference (PAM), Louvain-la-neuve, Belgium (April 2007) [8] Haffner, P., Sen, S., Spatscheck, O., Wang, D.: ACAS: Automated Construction of Application Signatures. In: Proceedings of the ACM SIGCOMM, pp. 197–202 (2005) [9] Williams, N., Zander, S., Armitage, G.: A Prelimenary Performance Comparison of Five Machine Learning Algorithms for Practical IP Traffic Flow Comparison. ACM SIGCOMM Computer Communication Review 36(5), 7–15 (2006) [10] Alshammari, R., Zincir-Heywood, A.N.: A flow based approach for SSH traffic detection, IEEE SMC, pp. 296–301 (2007) [11] NetMate (last accessed, January 2008), http://www.ip-measurement.org/tools/netmate/ [12] IETF (last accessed January 2008), http://www3.ietf.org/proceedings/97apr/ 97apr-final/xrtftr70.htm [13] PacketShaper (last accessed, January 2008), http://www.packeteer.com/products/packetshaper/ [14] Weiss, G.M., Provost, F.J.: Learning When Training Data are Costly: The Effect of Class Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354 (2003) [15] WEKA Software (last accessed, January 2008), http://www.cs.waikato.ac.nz/ml/weka/ Dynamic Scheme for Packet Classification Using Splay Trees Nizar Ben-Neji and Adel Bouhoula Higher School of Communications of Tunis (Sup’Com) University November 7th at Carthage City of Communications Technologies, 2083 El Ghazala, Ariana, Tunisia nizar.bennaji@certification.tn, bouhoula@supcom.rnu.tn Abstract. Many researches are about optimizing schemes for packet classification and matching filters to increase the performance of many network devices such as firewalls and QoS routers. Most of the proposed algorithms do not process dynamically the packets and give no specific interest in the skewness of the traffic. In this paper, we conceive a set of self-adjusting tree filters by combining the scheme of binary search on prefix length with the splay tree model. Hence, we have at most 2 hash accesses per filter for consecutive values. Then, we use the splaying technique to optimize the early rejection of unwanted flows, which is important for many filtering devices such as firewalls. Thus, to reject a packet, we have at most 2 hash accesses per filter and at least only one. Keywords: Packet Classification, Binary Search on Prefix Length, Splay Tree, Early Rejection. 1 Introduction In the packet classification problems we wish to classify incoming packets into classes based on predefined rules. Classes are defined by rules composed of multiple header fields, mainly source and destination IP addresses, source and destination port numbers, and a protocol type. On one hand, packet classifiers must be constantly optimized to cope with the network traffic demands. On the other hand, few of proposed algorithms process dynamically the packets and the lack of dynamic packet filtering solutions has been the motivation for this research. Our study shows that the use of a dynamic data structure is the best solution to take into consideration the skewness in the traffic distribution. In this case, in order to achieve this goal, we adapt the splay tree data structure to the binary search on prefix length algorithm. Hence, we have conceived a set of dynamic filters for each packet header-field to minimize the average matching time. On the other hand, discarded packets represent the most important part of the traffic treated then reject by a firewall. So, those packets might cause more harm than others if they are rejected by the default-deny rule as they traverse a long matching path. Therefore, we use the technique of splaying to reject the maximum number of unwanted packets as early as possible. This paper is organized as follows. In Section 2 we describe the previously published related work. In Section 3 we present the proposed techniques used to perform E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 211–218, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 212 N. Ben-Neji and A. Bouhoula the binary search on prefix length algorithm. In Section 4 we illustrate theoretical analysis of the proposed work. At the end, in Section 5 we present the conclusion and our plans for future work. 2 Previous Work Since our proposed work in this paper applies binary search on prefix length with splay trees, we describe the binary search on prefix length algorithm in detail, then we present a previous dynamic packet classification technique using splay trees called Splay Tree based Packet Classification (ST-PC). After that, we present an early rejection technique for maximizing the rejection of unwanted packets. Table 1. Example Rule Set Rule no. R1 R2 R3 R4 R5 R6 R7 R8 R9 Src Prefix 01001* 01001* 010* 0001* 1011* 1011* 1010* 110* * Dst Prefix 000111* 00001* 000* 0011* 11010* 110000* 110* 1010* * Src Port * * * * * * * * * Dst Port 80 80 443 443 80 80 443 443 * Proto. TCP TCP TCP TCP UDP UDP UDP UDP * 2.1 Binary Search on Prefix Length Waldvogel et al. [1] have proposed the IP lookup scheme based on binary search on prefix length Fig.1. Their scheme performs a binary search on hash tables that are organized by prefix length. Each hash table in their scheme contains prefixes of the same length together with markers for longer-length prefixes. In that case, IP Lookup can be done with O(log(Ldis)) hash table searches, where Ldis is the number of distinct prefix lengths and Ldis<W-1 where W is the maximum possible length, in bits, of a prefix in the filter table. Note that W=32 for IPv4, W=128 for IPv6. Many works were proposed to perform this scheme. For instance, Srinivasan and Varghese [3] and Kim and Sahni [4] have proposed ways to improve the performance of the binary search on lengths scheme by using prefix expansion to reduce the value of Ldis. On the other hand, an asymmetric binary search tree was proposed to reduce the average number of hash computations. This tree basically inserts values of higher occurrence probability (matching frequency) at higher tree levels then the values of less probability. In fact, we have to rebuild periodically the search tree based on the traffic characteristics. Also, a rope search algorithm was proposed to reduce the average number of hash computations but it increases the rebuild time of the search tree because it use precomputation to accomplish this goal. That is why we have O(NW) time complexity when we rebuild the tree after a rule insertion or deletion according to the policy. Dynamic Scheme for Packet Classification Using Splay Trees 213 Fig. 1. This shows a binary tree for the destination prefix field of Table 1 and the access order performing the binary search on prefix lengths proposed by Waldvogel [1]. (M: Marker) 2.2 Splay Tree Packet Classification Technique (ST-PC) The idea of the Splay Tree Packet Classification Technique [2] is to convert the set of prefixes into integer ranges as shown in Table 2 then we put all the lower and upper values of the ranges into a splay tree data structure. On the other hand, we need to store in each node of the data structure all the matching rules as shown in Fig.2. The same procedure is repeated for the filters of the other packet's header fields. Finally, all the splay trees are to be linked to each other to determine the corresponding action of the incoming packets. When we look for the list of the matching rules, we first convert the values of the packet’s header fields to integers then we find separately the matching rules for each field and finally we select a final set of matching rules for the incoming packet. Each newly accessed node has to become at root by the property of a splay tree (Fig.2). Table 2. Destination Prefix Conversion Rule no. R1 R2 R3 R4 R5 R6 R7 R8 R9 Dst Prefix 000111* 00001* 000* 0011* 11010* 110000* 110* 1010* * Lower Bound 00011100 00001000 00000000 00110000 11010000 11000000 11000000 10100000 00000000 Upper Bound 00011111 00001111 00011111 00111111 11010111 11000011 11011111 10101111 11111111 Start 28 8 0 48 208 192 192 160 0 Finish 31 15 31 63 215 195 223 175 255 2.3 Early Rejection Rules The technique of early rejection rules was proposed by Adel El-Atawy et al. [5], using an approximation algorithm that analyzes the firewall policy in order to construct a set of early rejection rules. This set can reject the maximum number of unwanted packets as early as possible. Unfortunately, the construction of the optimal set of rejection 214 N. Ben-Neji and A. Bouhoula Fig. 2. The figure (a) shows the destination splay tree constructed as shown in Table 2 with the corresponding matching rules. The newly accessed node becomes the root as shown in (b). rules is an NP-complete problem and adding them may increase the size of matching rules list. Early rejection works periodically by building a list of most frequently hit rejection rules. And then, starts comparing the incoming packets to that list prior to trying normal packet filters. 3 Proposed Filter Our basic scheme consists of a set of self-adjusting filters (SA-BSPL: Self-Adjusting Binary Search on Prefix Length). Our proposed filter can be applied to filter every packet's header field. Accordingly, it can easily assure exact matching for protocol field, prefix matching for IP addresses, and range matching for port numbers. 3.1 Splay Tree Filters Our work performs the scheme of Waldvogel et al [1] by conceiving splay tree filters to ameliorate the global average search time. A filter consists of a collection of hash tables and a splay tree with no need to represent the default prefix (with zero length). Prefixes are grouped by lengths in hash tables (Fig.3). Also, each hash table is augmented with markers for longer length prefixes (Fig.1 shows an example). We still use binary search on prefix length but with splaying operations to push every newly accessed node to the top of the tree and because of the search step starts at root and ends at the leaves of the search tree as described in [1], we have to splay also the successor of the best length to the root.right position (Fig4-a). Consequently, the tree is adequately adjusted to have at most 2 hash accesses for all repeated values. Fig. 3. This figure shows the collection of hash tables of the destination address filter according to the example in Table 1. We denote by M the marker and by P the prefix. Dynamic Scheme for Packet Classification Using Splay Trees 215 Fig. 4. This figure shows the operations of splaying to early accept the repeated packets (a) and to early reject the unwanted ones (b). Note that x+ is the successor of x, and M is the minimum item in the tree. Fig. 5. This figure shows the alternative operations of splaying when x+ appears after x in the search path (a) or before x (b) Therefore, we start the search from the root node and if we get a match, we update the best length value and the list of matching rules then we go for higher lengths and if nothing is matched, we go for lower lengths (Fig 6). We stop the search process if a leaf is met. After that, the best length value and its successor have to be splayed to the top of the tree (Fig.4-a). We can go faster since we use the top-down splay tree model because we are able to combine searching and splaying steps together. We have also conceived an alternative implementation of splaying that have slightly better amortized time bound. We look for x and x+, then we splay x or x+ until we obtain x+ as a right child of x. Then, these two nodes have to be splayed to the top of the tree, as a single node (Fig.5). Hence, we have a better amortized cost than given in Fig.4-a. 3.2 Early Rejection Technique In this section, we focus on optimizing matching of traffic discarded due to the default-deny rule because it has more profound effect on the performance of the firewalls. In our case, we have no need to represent the default prefix (with zero length) so if a packet don’t match any length it will be automatically rejected by the Min-node. Generally, rejected packets might traverse long decision path of rule matching before they are finally rejected by the default-deny rule. The left child of the Min-node is Null, hence if a packet doesn’t match the Min-node we go to its left child which is Null, so it means that this node is the end of the search path. 216 N. Ben-Neji and A. Bouhoula Fig. 6. The algorithm of binary search on prefix length combined with splaying operations In our case, in each filter, we have to traverse the entire tree until we arrive to the node with the minimum value. Subsequently, a packet might traverse a long path in the search tree before it is rejected by the Min-node. Hence, we have to rotate always the Min-node in the upper levels of the self-adjusting tree. We have to splay the Minnode to the root.left position as shown in (Fig.4-b). In our case, we can use either bottom-up or top-down splay trees. However, the top-down splay tree is much more efficient for the early rejection technique because we are able to maintain the Minnode fixed in the desired position when searching for the best matching value. 4 Complexity Analysis In this section, we first give the complexity analysis of the proposed operations used to adapt the splay tree data structure to the binary search on length algorithm and we also give the cost per filter of our early rejection technique. 4.1 Amortized Analysis On an n-node splay tree, all the standard search tree operations have an amortized time bound of O(log(n)) per operation. According to the analysis of Sleator and Tarjan [6], if each item of the splay tree is given a weight wx, with Wt denoting the sum of the weights in the tree t, then the amortized cost to access an item x have the following upper bounds (Let x+ denote the item following x in the tree t and x- denote the item preceding x): 3 log(Wt/wx) + O(1) . 3 log(Wt/min(wx-,wx+)) + O(1) . If x is in the tree t . (0) If x is not in the tree t . (1) Dynamic Scheme for Packet Classification Using Splay Trees 217 In our case, we have to rotate the best length value to the root position and its successor to the root.right position (Fig.4-a). We have a logarithmic amortized complexity to release these two operations and the time cost is calculated using (0) and (1): 3 log (Wt/wx) + 3 log(Wt-wx/wx+) + O(1) . (2) With the use of the alternative optimized implementation of splaying (Fig.5) we obtain slightly better amortized time bound than the one given in (2): 3 log(Wt-wx/min(wx,wx+)) + O(1) . (3) On the other hand, the cost of the early rejection step is 3 log(Wt-wx/wmin) + O(1), note that wmin is the weight of the Min-node. With the use of the proposed early rejection technique, we have at most 2 hash accesses before we reject a packet and at least only one. If we have m values to be rejected by the default deny rule, search time will be at most m+1 hash accesses per filter and at least m hash accesses. 4.2 Number of Nodes The time complexity is related with the number of nodes. For the ST-PC scheme, in the worst case, if we assume that all the prefixes are distinct, then we have at most 2W nodes per tree, note that W is the length of the longest prefix. Besides, if all bounds are distinct, then we have 2r nodes, note that r is the number of rules. So, the actual number of nodes in ST-PC in the worst case will be the minimum of these two values. Our self-adjusting tree structure is based on the binary search on hash tables organized by prefixes length. Hence, the number of nodes in our case is equal to the number of hash tables. If we consider W the length of the longest prefix, we have at most W nodes. Since the number of node in our self-adjusting tree is smaller than in the ST-PC especially with an important number of rules (Fig.7), we can say that our scheme is much more competitive in term of time and scalability. Fig. 7. (a) This figure shows the distribution of the number of nodes with respect to W and r, where W is the maximum possible length in bits and r the number of rules 218 N. Ben-Neji and A. Bouhoula 5 Conclusion The Packet classification optimization problem has received the attention of the research community for many years. Nevertheless, there is a manifested need for new innovative directions to enable filtering devices such as firewalls to keep up with high-speed networking demands. In our work, we have suggested a dynamic scheme, based on using collection of hash tables and splay trees for filtering each header field. We have also proved that our scheme is suitable to take advantage of locality in the incoming requests. Locality in this context is a tendency to look for the same element multiple times. Finally, we have reached this goal with a logarithmic time cost, and in our future works, we wish to optimize data storage and the other performance aspects. References 1. Waldvogel, M., Varghese, G., Turner, J., Plattner, B.: Scalable High-Speed IP Routing Lookups. ACM SIGCOMM Comput. Commu. Review. 27(4), 25–36 (1997) 2. Srinivasan, T., Nivedita, M., Mahadevan, V.: Efficient Packet Classification Using Splay Tree Models. IJCSNS International Journal of Computer Science and Network Security 6(5), 28–35 (2006) 3. Srinivasan, V., Varghese, G.: Fast Address Lookups using Controlled Prefix Expansion. ACM Trans. Comput. Syst. 17(1), 1–40 (1999) 4. Kim, K., Sahni, S.: IP Lookup by Binary Search on Prefix Length. J. Interconnect. Netw. 3, 105–128 (2002) 5. Hamed, H., El-Atawy, A., Al-Shaer, E.: Adaptive Statistical Optimization Techniques for Firewall Packet Filtering. In: IEEE INFOCOM, Barcelona, Spain, pp. 1–12 (2006) 6. Sleator, D., Tarjan, R.: Self Adjusting Binary Search Trees. Journal of the ACM 32, 652– 686 (1985) A Novel Algorithm for Freeing Network from Points of Failure Rahul Gupta and Suneeta Agarwal Department of Computer Science and Engineering, Motilal Nehru National Institute of Technology, Allahabad, India rahulgupta_mnnit@yahoo.co.in, suneeta@mnnit.ac.in Abstract. A network design may have many points of failure, the failure of any of which breaks up the network into two or more parts, thereby disrupting the communication between the nodes. This paper presents a heuristic for making an existing network more reliable by adding new communication links between certain nodes. The algorithm ensures the absence of any point of failure after addition of addition of minimal number of communication links determined by the algorithm. The paper further presents theoretical proofs and results which prove the minimality of the number of new links added in the network. Keywords: Points of Failure, Network Management, Safe Network Component, Connected Network, Reliable Network. 1 Introduction A network consists of number of interconnected nodes communicating among each other through communication channels between them. A wired communication link between two nodes is more reliable [1]. Various topology designs have been proposed for various network protocols and applications [2][3] such as bus topology, star topology, ring topology and mesh topology. All these network designs leave certain nodes as failure points [4][5][6]. These nodes become very important and must remain working all the time. If one of these nodes is down for any reason, it breaks the network into segments and the communication among the nodes in different segments is disrupted. Hence these nodes make the network unreliable. In this paper, we have designed a heuristic which has the capability to handle a single failure of node. The algorithm adds minimal number of new communication links between the nodes so that a single node failure does not disrupt communication among communicating nodes. 2 Basic Outline The various network designs common in use are ring topology, star topology, mesh topology, bus topology [1][4][5]. All these topology designs have their own advantages and disadvantages. Ring topology does not contain any points of failure. Bus topology on the other hand, has many points of failure. Star topology contains a single E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 219–226, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 220 R. Gupta and S. Agarwal point of failure, the failure of which disrupts communication between any pair of communicating nodes. Points marked P in figure 2 are the points of failure in the network design. Star topology and bus topology are least reliable from the point of view of failure of a single communication node in the network. In star topology, there is always one point of failure, the failure of which breaks the communication between all pairs of nodes and no nodes can communicate further. In a network of n nodes connected by bus topology, there are (n-2) points of failure. Ring topology is most advantageous and has no points of failure. For a reliable network, there must be no point of failure in the network design. These points of failure can be made safe by adding new communication links between nodes in the network. In this paper, we have presented an algorithm which finds the points of failure in a given network design. The paper further presents a heuristic which adds minimal number of communication links in the network to make it reliable. This ensures the removal of points of failure with least possible cost. 3 Algorithm for Making Network Reliable In this paper, we have designed an algorithm to find the points of failure in the network and an algorithm for converting these failure points into non failure points by the addition of minimal number of communication links. 3.1 New Terms Coined We have coined the following terms which aid in the algorithm development and network design understanding. N – Nodes of the Network E – Links in the Network P – Set of Points of Failure S – Set of Safe Network Components Pi – Point of Failure Si – Safe Network Component Si(a) – Safe Network Component Attached to the Failure Point ‘a’ B – Set of all Safe Network Components each having a Single Point of Failure in the Original Network Bi – A Safe Network Component having a Single Point of Failure in the Original Network |B| - Cardinality of Set B Fi – Point of Failure Corresponding to the Original Network in the member ‘Bi’ NFi – Non Failure Point corresponding to the Original Network in the member ‘Bi’ L – Set of New Communication Links Added Li – A New Communication Link C – Matrix List for the Components Reachable dfn(i) – Depth First Number of the node ‘i’ low(i) – Lowest Depth First Number of the Node Reachable from ‘i’. A Novel Algorithm for Freeing Network from Points of Failure 221 The points of failure are the nodes in the network, the failure of any of which breaks the network into isolated segments which can not have any communication among each other. A safe network component is the maximal subset of the connected nodes from the complete network which do not contain any point of failure. The safe component can handle a single failure occurring at any of its node within the subset. We have developed an algorithm which finds the minimal number of communication links to be added to the network to make the network capable of handling a single failure of any node. A safe component may have more than one point of failure in the original network. The algorithm considers the components having only a single point of failure differently. ‘B’ is the set of all safe components having only a single point of failure in the original network. ‘Fi’ is the point of failure in the original network design. ‘C’ corresponds to the matrix having the reachable components. Each Row in the matrix corresponds to the components reachable through one outgoing link from the point of failure. All the components having single point of failure and occurring on one outgoing link corresponds to the representatives in each row. The new communication links added in the network are collected in the set ‘L’. The set contains the pairs of nodes between which links must be added to make network free of points of failure. Fig. 1. (a) An example network design, (b) Safe components in the design 3.2 Algorithm for Finding Points of Failure and Safe Components To find all the points of failure in the network, we use depth first search [7][8] technique starting from any node in the network. Nodes that are reachable through more than one path become part of the safe component and the ones which are connected through only one path are vulnerable and the communication can get disrupted because of any one node in the single path of communication available for the node. The network is represented by a matrix of nodes connected to each other with edges representing the communication links. Each node of the network is numbered sequentially in the depth first search order starting from 0. This forms the dfn of each node. The unmarked nodes reachable from a node are called the child of each node and the node itself becomes the parent of those child nodes. The algorithm finds the low of each node and the points of failure in the network design and all the safe components in the network. The algorithm finds the safe components and all points of failure in the network. The starting node is a pint of failure if some unmarked nodes remain even on fully exploring any one single path from the node. 222 R. Gupta and S. Agarwal 3.3 Algorithm for Finding Points of Failure and Safe Components In this section, we describe our algorithm for the conversion of points of failure into non failure points by the addition of new communication links. The algorithm adds minimal number of new links which ensures least possible cost to make the network reliable. The algorithm is based on the concept that the safe components having more than one point of failure are necessarily connected to a safe component having only one point of failure directly or indirectly. Thus this component can become a part of larger safe component through more than one path which originates from any of the points of failure in the original network present in the component. Thus if the component having only one point of failure in the original network is made a part of larger safe component, the component having more than one point of failure is made safe itself. The algorithm finds new links to be added for making the safe component larger and larger and thus finally including all the nodes of the network making the complete network safe. When the maximal component that is safe consists of all the nodes of the network, the whole network is made safe and all points of failure are removed. The following steps are followed in order. 1. Initially the set L = ∅ is taken. 2. P, the set of points of failure is found using algorithm described in section 3.2. The algorithm also finds all safe components of the network and adds them to the set S. Each of the Si has a copy of failure point within it. Hence, the failure points are replicated in each component. A Novel Algorithm for Freeing Network from Points of Failure 223 3. Find the subset B of safe components having only single point of failure in the original network by using set S and set P found in step 2. Let each of these component members be B1, B2, B3,…. , Bk. These Bi`s are mutually disjoint with respect to non failure points. 4. Each of the components Bi has at least one non failure point. Any non failure point node is named as NFi and taken as the representative of the component Bi. 5. The failure point present in maximum number of safe components is chosen i.e, the node, the failure of which creates maximum number of safe components is chosen. Let it be named ‘s’. 6. Let S1(s), S2(s), S3(s),…. Sm(s) be all the safe components having the failure point ‘s’. Each of these components may have one or more points of failure corresponding to the original network. If the component has more than one point of failure, other safe components are reachable from these safe components through points of failure other than ‘s’. 7. Now we create the lists of components reachable from point of failure‘s’. For each Sj(s), j=1, 2,… m, if the component contains only one point of failure, add the representative of this component to the list as the next row element and if the component contains more than one point of failure, then the reachable safe components having only one point of failure are taken and their representatives are added to the list C. These components are found by going using depth search from this component. All the components that are reachable from the same component are considered for the same row and their corresponding representatives are added in the same row in the matrix C. The number of elements in each row of matrix C corresponds to the number of components that are reachable from the point of failure ‘s’ through that one outgoing link. It is to be noted here that the components having one point of failure only are considered for the algorithm. Now we have a row for each Sj(s), j=1, 2,… m. Thus the number of rows in matrix C is m. 8. The number of elements in each row of matrix C corresponds to the number of components that are reachable from the point of failure ‘s’ through that one outgoing link. It is to be noted that each component is represented just by a non failure point representative. Arrange the matrix rows in non decreasing order based on the size of the row i.e, on the basis of the number of elements in each row. 9. If all Ci(s) `s are of size 1, pair the only member of each row with the only member of next row. Here pairing means adding a communication link between the non failure point members acting as representatives of their corresponding components. Thus giving (k-1) new communication links to be added to the network for ‘k’ members. Add all these edges to set L, the set of all new communication links and exit from the algorithm. If the size of some Ci(s) `s is greater than 1, start with the last list Ci ( the list of the maximum size). For every k>=2, pair the kth element of this row with the (k-1) th element of the preceding row (if it exists). Here again pairing means addition of a communication link between the representative nodes. Remove these paired up elements from the lists and the lists are contracted. 10.Now if more than one element is left in the second last list, shift the last element from this list to the last list and append to the last list. 11.If the number of non empty lists is greater than one, go to step 8 for further processing. If the size of the last and the only left row is one, pair its only member with any of the non failure points in the network and exit from the algorithm. If the 224 R. Gupta and S. Agarwal last and the only row left have only two elements left in it, then pair the two representatives and exit from the algorithm. If the size of the last and the only left row is greater than two, add the edges from set L into the network design and repeat the algorithm from step 2 on updated network design. Since in every iteration of the algorithm at least one communication link is added to set L and only finite number of edges are added, the algorithm will terminate in finite number of steps. The algorithm ensures that there are at least two paths between any pair of nodes in the network. Thus, because of multiple paths of communication between any pair of nodes, the failure of any one of the node does not effect the communication between any other pair of nodes in the network. Thus the algorithm makes the points of failure in the original network safe by adding minimal number of communication links. 4 Theoretical Results and Proofs In this section, we describe the theoretical proofs for the correctness of the algorithm and sufficiency of the number of the new communication links added. Further, the lower and upper bounds on the number of links added to the network are proved. Theorem 1. If | B | = k, i.e., there are only k safe network components having only one point of failure in the original network, then the number of new edges necessary to make all points of failure safe varies between ⎡k/2⎤ and (k-1) both inclusive. Proof: Each safe component Bi has only point of failure corresponding to the original network. Failure of this node will separate the whole component Bi from remaining part of the network. Thus, for having communication from any node of this component Bi with any other node outside of Bi, at least one extra communication link is required to be added with this component. This argument is valid for each Bi. Thus at least one extra edge is to be added from each of the component Bi. This needs at least ⎡k/2⎤ extra links to be added each being incident on a distinct pair of Bi’s. This forms the lower bound on the number of links to be added to make the points of failure safe in the network design. Fig. 2. (a) and (b) Two Sample Network Designs In figure 2(a), there are k = 6 safe components each having only one point of failure and thus requiring k/2 = 3 new links to be added to make all the points of failure safe. It is easy to see that k/2 = 3 new links are sufficient to make the network failure free. A Novel Algorithm for Freeing Network from Points of Failure 225 Now, we consider the upper bound on the number of new communication links to be added to the network. This occurs when | B | = | S | = k, i.e, when each safe components in the network contain only one point of failure. Since, there is no safe component which can become safe through more than one path. Thus all the safe components are to be considered by the algorithm. Thus, it requires the addition of (k1) new communication links to join ‘k’ safe components. Theorem 2. If the edges determined by the algorithm are added to the network, the nodes will keep on communicating even after the failure of any single node in the network. Proof: We arbitrarily take 2 nodes ‘x’ and ‘y’ from the set ‘N’ of the network. Now we show that ‘x’ and ‘y’ can communicate even after the failure of any single node from the network. CASE 1: If the node that fails is not a point of failure, ‘x’ and ‘y’ can continue to communicate with each other. CASE 2: If the node that fails is a point of failure and both ‘x’ and ‘y’ are in the same safe component of the network, then by the definition of safe component ‘x’ and ‘y’ can still communicate because the failure of this node has no effect on the nodes that are in the same safe component. CASE 3: If the node that fails is a point of failure and ‘x’ and ‘y’ are in different safe components and ‘x’ and ‘y’ both are members of safe components in set ‘B’. We know that the algorithm makes all members of set ‘B’ safe by using only non failure points of each component so the failure of any point of failure will not effect the communication of any node member of the safe component formed. This is because the algorithm has already created an alternate path for each of the node in any of the safe member. CASE 4: If the node that fails is a point of failure and ‘x’ and ‘y’ are in different safe components and ‘x’ is a member of component belonging to set ‘B’ and ‘y’ a member of component belonging to set ‘(S-B)’. Now we know that any node occurring in any member of set ‘(S-B)’ is connected to at least 2 points of failure in the safe component and through each of these points of failure we can reach to a member of set ‘B’. So even after deletion of any point of failure, ‘y’ will remain connected with at least one member of set B. The algorithm has already connected all the members of set ‘B’ by finding new communication links, hence ‘x’ and ‘y’ can still communicate with each other. CASE 5: If the node that fails is a point of failure and ‘x’ and ‘y’ are in different safe components and both ‘x’ and ‘y’ belong to components that are members of set ‘(S-B)’. Now each member of set ‘(S-B)’ has at least 2 points of failure. So after the failure of any one of the failure point, ‘x’ can send message to at least one component that is a member of set ‘B’. Similarly, ‘y’ can send message to at least one component that is a member of set ‘B’. Now, the algorithm has already connected all the components belonging to set ‘B’, so ‘x’ and ‘y’ can continue to communicate with each other after the failure of any one node. After the addition of links determined by the algorithm, there exist multiple paths of communication between any pair of communicating nodes. Thus, no node is dependent on just one path. 226 R. Gupta and S. Agarwal Theorem 3. The algorithm provides the minimal number of new communication links to be added to the network to make it capable of handling any single failure. Proof: The algorithm considers only the components having a single point of failure corresponding to the original network. Since | B | = k, thus it requires at least ⎡k/2⎤ new communication links to be added to pair up these k components and making them capable of handling single failure of any node in the network. Thus adding less than ⎡k/2⎤ new communication links can never result in safe network. Thus the algorithm finds minimal number of new communication links as shown by the example discussed in theorem 1. In all the steps of the algorithm, except the last, only one link is added to join 2 members of set ‘B’ and these members are not further considered for the algorithm and hence do not generate any further edge in set ‘L’. In the last step, when only one vertical column of x rows with each row having single member is left, then (x-1) new links are added. These members have the property that only single point of failure ‘s’ can separate these into x disjoint groups, hence addition of (x-1) links is justified. When only single row of just one element is left, this can only be made safe by joining it with any one of the non failure nodes. Hence, the algorithm adds minimal number of new communication links to make the network. 5 Conclusion and Future Research This paper described an algorithm for making points of failure safe in the network. The new communication links determined by the algorithm are minimal and guarantees to make the network capable of handling a single failure of any node. The algorithm guarantees at least two paths of communication between any pair of nodes in the network. References 1. Tanenbaum, A.S.: Computer Networks, 4th edn. Pearson Education, London (2004) 2. Pearlman, R.: Interconnections: Bridges, Routers, Switches, and Internetworking Protocols, 2nd edn. Pearson Education, London (2006) 3. Kamiyana, N.: Network Topology Design Using Data Envelopment Analysis. In: IEEE Global Telecommunications Conference (2007) 4. Dengiz, B., Altiparmak, F., Smith, A.E.: Efficient optimization of all-terminal reliable networks, using an evolutionary approach. IEEE Transactions on Reliability 46(1), 18–26 (1997) 5. Mandal, S., Saha, D., Mukherjee, R., Roy, A.: An efficient algorithm for designing optimal backbone topology for a communication network. In: International Conference on Communication Technology, vol. 1, pp. 103–106 (2003) 6. Ray, G.A., Dunsmore, J.J.: Reliability of network topologies. In: IEEE INFOCOM 1988 Networks, pp. 842–850 (1988) 7. Horowitz, E., Sahni, S., Anderson-Freed, S.: Fundamentals of Data Structures in C, 8th edn. Computer Science Press (1998) 8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn. Prentice-Hall, India (2004) A Multi-biometric Verification System for the Privacy Protection of Iris Templates S. Cimato, M. Gamassi, V. Piuri, R. Sassi, and F. Scotti Dipartimento di Tecnologie dell’Informazione, Università di Milano, Via Bramante, 65 – 26013 Crema (CR), Italy {cimato,gamassi,piuri,sassi,fscotti}@dti.unimi.it Abstract. Biometric systems have been recently developed and used for authentication or identification in several scenarios, ranging from institutional purposes (border control) to commercial applications (point of sale). Two main issues are raised when such systems are applied: reliability and privacy for users. Multi-biometric systems, i.e. systems involving more than a biometric trait, increase the security of the system, but threaten users’ privacy, which are compelled to release an increased amount of sensible information. In this paper, we propose a multi-biometric system, which allows the extraction of secure identifiers and ensures that the stored information does not compromise the privacy of users’ biometrics. Furthermore, we show the practicality of our approach, by describing an effective construction, based on the combination of two iris templates and we present the resulting experimental data. 1 Introduction Nowadays, biometric systems are deployed in several commercial, institutional, and forensic applications as a tool for identification and authentication [1], [2]. The advantages of such systems over traditional authentication techniques, like the ones based on the possession (of a password or a token), come from the fact that identity is established on the basis of physical or behavioral characteristics of the subject taken into consideration and not on something he/she carries. In fact, biometrics cannot be lost or stolen, they are difficult to copy or reproduce, and in general they require the presence of the user when the biometric authentication procedure takes place. However, side to side with the widespread diffusion of biometrics an opposition grows towards the acceptance of the technology itself. Two main reasons might motivate such resistance: the reliability of a biometric system and the possible threatens to users’ privacy. In fact, a fault in a biometric system, due to a poor implementation or to an overestimation of its accuracy could lead to a security breach. Moreover since biometric traits are permanently associated to a person, releasing the biometric information acquired during the enrollment can be dangerous, since an impostor could reuse that information to break the biometric authentication process. For this reason, privacy agencies of many countries have ruled in favor of a legislation which limits the biometric information that can be centrally stored or carried on a personal ID. For example, templates, e.g. mathematical information derived from a fingerprint, are retained instead of the picture of the fingerprint itself. Also un-encrypted biometrics are discouraged. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 227–234, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 228 S. Cimato et al. A possible key to enhance the reliability of biometric systems might be that of simultaneously using different biometric traits. Such systems are termed in literature multi-biometric [3] and they usually rely on a combination of one of several of the followings: (i) multiple sensors, (ii) multiple acquisitions (e.g., different frames/poses of the face), (iii) multiple traits (e.g., an eye and a fingerprint), (iv) multiple instances of the same kind of trait (e.g., left eye, and right eye). As a rule of thumb, the performances of two of more biometric systems which each operate on a single trait might be enhanced when the same systems are organized in a single multimodal one. This is easy to understand if we refer to the risk of admitting an impostor: two or more different subsequent verifications are obviously more difficult to tamper with than a single one (AND configuration). But other less obvious advantages might occur. Population coverage might be increased, for example, in an OR configuration since some individuals could not have one biometric traits (illnesses, injuries, etc.). Or the global fault tolerance of the system might be enhanced in the same configuration, since, if one biometric subsystem is not working properly (e.g., a sensor problem occurred), the multimodal system can still keep working using the remaining biometric submodules. On the other hand, the usage of multimodal biometric systems has also some important drawbacks related to the higher cost of the systems, and user perception of larger invasiveness for his/her privacy. In the following, we will derive a multi-biometric authentication system which limits the threats posed to the privacy of users while still benefiting from the increase reliability of multiple biometrics. It was introduced in [4] and it is based on the secure sketch, a cryptographic primitive introduced by Dodis et al. in [5]. In fact, a main problem in using biometrics as cryptographic keys is their inherent variability in subsequent acquisitions. The secure sketch absorbs such variability to retrieve a fixed binary string from a set of similar biometric readings. In literature biometric authentication schemes based on secure sketches have been presented and applied to face and iris biometrics [6], [7]. Our proposal is generally applicable to a wider range of biometric traits and, compared to previous works, exploits multimodality in innovative way. In the following we describe the proposed construction and show its application to the case where two biometrics are used, the right and left iris. Iris templates are extracted from the iris images and used in the enrolment phase to generate a secure identifier, where the biometric information is protected and any malicious attempt to break the users’ privacy is prevented. 2 A Multimodal Sketch Based (MSB) Verification Scheme The MSB verification scheme we propose is composed of two basic modules: the first one (enroll module) creates an identifier (ID) for each user starting from the biometric samples. The ID can then be stored and must be provided during the verification phase. The second one, the (verification module) performs the verification process starting from the novel biometric readings and the information contained into the ID. Verification is successful if the biometric matching succeeds when comparing the novel reading with the stored biometrics, concealed into the ID. A Multi-biometric Verification System for the Privacy Protection of Iris Templates 229 2.1 Enrollment Module The general structure of the enroll module is depicted in Figure 1 in its basic configuration where the multimodality is restricted at two biometrics. The scheme can be generalized and we refer the reader to [5] for further details. First, two independent biometrics are acquired and processed with two feature extraction algorithms F1 and F2 to extract sets of biometric features. Each set of features is then collected into a template, a binary string. We refer to each template as I1 and I2. The feature extraction algorithms can be freely selected; they represent the single biometric systems which compose the multimodal one. Let us denote with ri the binary tolerable error rate of each biometric subsystem, i.e., the rate of bits in the templates which could be modified without affecting the biometric verification of the subject. The second biometric feature I2 is given as input to a pseudo random permutation block, which returns a bit string of the same length, having almost uniform distribution. δ I1 I2 Pseudo-Random Permutation Error Correction Encoding {H( ), δ} ID Hash Function H(I2) Fig. 1. The MSB Enroll Module The string is then encoded by using an error correcting code and the resulting codeword c is xored with the other biometric feature I1 to obtain δ. Given N1, the bitlength of I1, the code must be selected so that it corrects at most r1N1 single bit errors on codewords which are N1 bits long. Finally, I2 is given as input to a hash function and the digest H(I2), together with δ, and other additional information possibly needed (to invert the pseudo random permutation) are collected and published as the identifier of the enrolled person. 2.2 Verification Module Figure 2 shows the structure of the verification module. Let us denote with I’1 and I’2 the biometric features freshly collected. The ID provided by the subject is split into δ, the hash H(I2) and the key needed to invert the pseudo random permutation. A corrupted version of the codeword c, concealed at enrollment, is retrieved by xoring the fresh reading I’1 with δ. Under the hypothesis that both readings I1 and I’1 belong to the same subject, the corrupted codeword c’ and c should differ for at most r1 bits. Thus the subsequent application of the error correcting decoding and of the inverse pseudo random permutation, should allow the exact reconstruction of the original reading I2. 230 S. Cimato et al. Verification SubModule ID {H( ), δ} ==? H(I2) δ I 1’ Error Correction Decoding Inverse Pseudo-Random Permutation Hash Function Biometric matching I2’ Enable Yes/No Fig. 2. The MSB Verification Module The identity of the user is verified in two steps. First a check is performed to compare the hash of the retrieved value for I2 with the value H(I2) stored into the identifier. If the check succeeds it means that the readings of the first biometric trait did not differ more than what permitted by the biometric employed. Then a second biometric matching is performed using as input the retrieved value of I2 and the fresh biometric reading I’2. The authentication is successful when also this second match is positive. 3 Experimental Data and Results 3.1 Dataset Creation The proposed scheme has been tested by using the public CASIA dataset [8]. (version 1.0) which contains seven images of the same eye obtained from 108 subjects. The images were collected by the Chinese Academy of Science waiting at least one month between two capturing stages using near infrared light for illumination (3 images during the first session and 4 for the second one). We used the first 3 images in the enroll operations, and the last 4 images in the verification phase. At the best of our knowledge, there is no public dataset containing the left and right eyes sample of each individual with the sufficient iris resolution to be effectively used in identification tests. For this reason we synthetically created a new dataset by composing two irises of different individuals taken from the CASIA dataset. Table 1 shows the details of the composition method used to create the synthetic dataset from the CASIA samples. The method we used to create the dataset can be considered as a pessimistic estimation of real conditions, since the statistical independence of the features extracted from the iris samples coming from the left and right eye of the same individual is likely to be equal or lower than the one related to the eyes coming from different individuals. In the literature it has been showed that the similarities of the iris templates A Multi-biometric Verification System for the Privacy Protection of Iris Templates 231 Table 1. Creation of the synthetic dataset CASIA Individual Identifier 001 002 CASIA File Name Enroll/ Validation 001_1_1.bmp 001_1_2.bmp 001_1_3.bmp 001_2_1.bmp … 001_2_4.bmp 002_1_1.bmp 002_1_2.bmp 002_1_3.bmp 002_2_1.bmp … 002_2_4.bmp Enroll Enroll Enroll Validation … Validation Enroll Enroll Enroll Validation … Validation Synthetic DB Individual Identifier 01 Notes Right eye, Enroll, Sample 1 Right eye, Enroll, Sample 2 Right eye, Enroll, Sample 3 Right eye, Validation, Sample 1 … Right eye, Validation, Sample 4 Left eye, Enroll, Sample 1 Left eye, Enroll, Sample 2 Left eye, Enroll, Sample 3 Left eye, Validation, Sample 1 … Left eye, Validation, Sample 4 coming from the left and right eyes of the same individuals are negligible when Iriscodes templates are used [9]. 3.2 Template Creation The iris templates of the left and right eyes were computed using the code presented in [10] (a completely open implementation which builds over the original ideas of Daugman [9]). The code has been used to obtain the iris codes of the right and left eye of each individual present in the synthetic database. The primary biometric template I1 has been associated to the right eye of the individual by using a 9600 bits wide iris template. As suggested in [10], the 9600 bits have been obtained by processing the iris image with a radial resolution (the number of points selected along a radial line) of 20. The author suggested for the CASIA database a matching criterion with a separation point of r1 = 0.4 (Hamming distance between two different iris templates). Using such a threshold, we independently verified that the algorithm was capable of a false match rate (FMR, the probability of an individual not enrolled being identified) and false non-match rate (FNMR, the probability of an enrolled individual not being identified by the system) of 0.028% and 9.039%, respectively using the CASIA version 1.0 database. Such rates rise to 0.204% and 16.799% respectively if the masking bits are not used. The masking bits mark bits in the iris code which should not be considered when evaluating the Hamming distance between different patterns due to reflections, eyelids and eyelashes coverage, etc. Due to security issues, we preferred to not include the masking bits of the iris code in the final templates since the distribution of zero valued bits in the masks is far from being uniform. The higher FNMR compared with the work of [10] can be explained by considering that using the adopted code failed segmentations of the pupil were reported to happen in the CASIA database in 17.4% of the cases. 3.3 Enroll and Verification Procedures The enroll procedure for the right eye has been executed according to the following steps. The three iris codes available in the enroll phase (Table 1) of each individual 232 S. Cimato et al. (D) ROC Comparison (Linear scale) (A) Right Eye System: 9600 bits 1 20 0.8 10 0.6 0 FNMR Freq. 30 0 0.2 0.4 0.6 0.8 Match score (B) Left Eye System: 1920 bits 1 0.4 0.2 Freq. 15 0 0 10 0.2 0.4 0.6 0.8 1 FMR 5 0 (E) ROC Comparison (Logartimic scale) 0 0.2 0.4 0.6 Match score (C) Proposed Scheme 0.8 0 FNMR 20 Right Eye 9600 bits Left Eye 1920 bits Proposed Scheme 10 1 30 Freq. Right Eye 9600 bits Left Eye 1920 bits Proposed Scheme -1 10 10 0 0 0.2 0.4 0.6 Match score 0.8 1 -3 10 -2 -1 10 10 0 10 FMR Fig. 3. Impostor and genuine frequency distributions of the iris templates composed by 9600 bits (A) and 1920 bits (B) using the synthetic dataset and for the proposed scheme (C and D respectively). The corresponding FNMR versus FMR are plotted in linear (D) and logarithmic scale (E). were evaluated for quality, in term of number of masking bits. The iris code with the highest “quality” was retained for further processing. The best of three approach was devised to avoid that segmentation errors might further jeopardize the verification stage. Then, the remaining enroll phases were performed according to the description previously made. A Reed-Solomon [9600,1920,7681]m=14 correction code has been adopted with n1 = 9600 and r1 = 0.4. In such set up, the scheme allows for up to k = 1920 bits for storing the second biometric template. If list decoding is taken into consideration the parameters should be adapted to take into account the enhanced error correcting rate of the list decoding algorithm. The former has been chosen by selecting the available left iris template with highest quality (best of three method) in the same fashion adopted for the right eye. Using this approach, a single identifier ID has been created for every individual present in the synthetic dataset. In particular, the shorter iris code was first subjected to a pseudo random permutation (we used AES in CTR mode) and then it was encoded with the RS code and then xored with the first one to obtain δ. Note that the RS codewords are 14 bits long. The unusual usage of the RS code (here we didn’t pack the bits in the iris code to form symbols, as in typical industrial application) is due to the fact that here we want to correct “at most” a certain number of error (and not “at least”). Each bit of the iris code was then inserted in a separate symbol adding random bits to complete the symbols. Finally an hash value of the second biometric template was computed to get the final ID with δ. In the implementation we used the hash function SHA-1 (Java JDK 6). A Multi-biometric Verification System for the Privacy Protection of Iris Templates 233 In the verification procedure, the left eye related portion was processed only if one of the iris codes was able to unlock the first part of the scheme. Otherwise the matching was considered as failed, and a maximum Hamming distance of 1 was associated to the failed matching value. If the first part of the scheme was successful, the recovered left eye template was matched by using a classical biometric system with the left eye template selected for the validation. The Hamming distance between the two strings is used to measure the distance between the considered templates. The best of four strategy is applied using the four left eye images available in the validation partition of the synthetic dataset. 3.4 Experimental Results for the Proposed Scheme The performances of the proposed method are strictly related to the performance of the code that constructs the iris templates. As such, a fair comparison should be done by considering as reference the performances of the original iris code system working on the same dataset. If we adopt the original iris templates of 9600 and 1920 bits by using the same enroll and verification procedure in a traditional fashion (best of three in verification, best of four in verification, no masking bits), we obtain the system behaviors described in Figure 3. The right eye system (9600 bits) has good separation between the genuine and impostor distributions and it achieves an equal error rate (ERR, the value of the threshold used for matching at which FMR equals FNMR) that can be estimated to about 0.5% on the synthetic dataset. The left eye system is working only with 1920 bits and achieves a worst separation between the two populations. The corresponding EER has been estimated to be equal to 9.9%. On the other hand, our multimodal scheme achieves an EER that can be estimated to be equal to 0.96%, and shows then an intermediate behavior between the ROC curves of each single biometric system based on the right or on the left eye (Figure 3). For a wide portion of the ROC curve, the proposed scheme achieves a better performance with respect to the right eye biometric system. That behavior is common for traditional multimodal systems where, very often, the multimodal system can work better than the best single biometric sub-system. The proposed scheme seem to show this interesting property and the slightly worse EER with respect to the best single biometric system (right eye, 9600 bits) is balanced by the protection of the biometric template. We may suppose that the small worsening for the EER is related to the specific code we used to compute the iris code templates and that it might be ameliorated by selecting a different code. Further experiments with enlarged datasets, different coding algorithms and error correction codes will be useful to validate the generality of the discussed results. 4 Conclusions In this work we proposed a method based on the secure sketch cryptographic primitive to provide an effective and easily deployable multimodal biometric verification system. Privacy of user templates is guaranteed by the randomization transformations which avoid any attempt to reconstruct the biometric features from the public identifier, preventing thus any abuse of biometric information. We also showed the 234 S. Cimato et al. feasibility of our approach, by constructing a biometric authentication system that combines two iris biometrics. The experiments confirm that only the owner of the biometric ID can “unlock” her/his biometric templates, once fixed proper thresholds. More complex systems, involving several biometric traits as well as traits of different kinds will be object of further investigations. Acknowledgments. The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n° 216483. References 1. Jain, A.K., Ross, A., Pankanti, S.: Biometrics: A tool for information security. IEEE Trans. on information forensics and security 1(2), 125–143 (2006) 2. Uludag, U., Pankanti, S., Prabhakar, S., Jain, A.: Biometric cryptosystems: Issues and challenges. Proceedings of the IEEE, Special Issue on Enabling Security Technologies for Digital Rights Management 92, 948–960 (2004) 3. Snelick, R., Uludag, U., Mink, A., Indovina, M., Jain, A.K.: Large scale evaluation of multi-modal biometric authentication using state of the art systems. IEEE Trans. Pattern Analysis and Machine Intelligence 27(3), 450–455 (2005) 4. Cimato, S., Gamassi, M., Piuri, V., Sassi, R., Scotti, F.: A biometric verification system addressing privacy concerns. In: IEEE International Conference on Computational Intelligence and Security (CIS 2007), pp. 594–598 (2007) 5. Dodis, Y., Ostrovsky, R., Reyzin, L., Smith, A.: Fuzzy extractors: How to generate strong keys from biometrics and other noisy data, Cryptology Eprint Archive, Tech. Rep. 2006/235 (2006) 6. Bringer, J., Chabanne, H., Cohen, G., Kindari, B., Zemor, G.: An application of the goldwasser-micali cryptosystem to biometric authentication. In: Pieprzyk, J., Ghodosi, H., Dawson, E. (eds.) ACISP 2007. LNCS, vol. 4586, pp. 96–106. Springer, Heidelberg (2007) 7. Sutcu, Y., Li, Q., Memon, N.: Protecting biometric templates with sketch: Theory and practice. IEEE Trans. on Information Forensics and Security 2(3) (2007) 8. Chinese Academy of Sciences: Database of 756 greyscale eye images; Version 1.0 (2003), http://www.sinobiometrics.com/IrisDatabase.htm 9. Daugman, J.G.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. on Pattern Analysis and Machine Intelligence 15, 1148–1161 (1993) 10. Masek, L.: Recognition of human iris patterns for biometric identification. Bachelor’s Thesis, School of Computer Science and Software Engineering, University of Western Australia (2003) Score Information Decision Fusion Using Support Vector Machine for a Correlation Filter Based Speaker Authentication System Dzati Athiar Ramli, Salina Abdul Samad, and Aini Hussain Department of Electrical, Electronic and Systems Engineering, Faculty of Engineering, University Kebangsaan Malaysia, 43600 Bangi Selangor, Malaysia dzati@vlsi.eng.ukm.my, salina@vlsi.eng.ukm.my, aini@vlsi.eng.ukm.my. Abstract. In this paper, we propose a novel decision fusion by fusing score information from multiple correlation filter outputs of a speaker authentication system. Correlation filter classifier is designed to yield a sharp peak in the correlation output for an authentic person while no peak is perceived for the imposter. By appending the scores from multiple correlation filter outputs as a feature vector, Support Vector Machine (SVM) is then executed for the decision process. In this study, cepstrumgraphic and spectrographic images are implemented as features to the system and Unconstrained Minimum Average Correlation Energy (UMACE) filters are used as classifiers. The first objective of this study is to develop a multiple score decision fusion system using SVM for speaker authentication. Secondly, the performance of the proposed system using both features are then evaluated and compared. The Digit Database is used for performance evaluation and an improvement is observed after implementing multiple score decision fusion which demonstrates the advantages of the scheme. Keywords: Correlation Filters, Decision Fusion, Support Vector Machine, Speaker Authentication. 1 Introduction Biometric speaker authentication is used to verify a person’s claimed identity. Authentication system compares the claimant’s speech with the client model during the authentication process [1]. The development of a client model database can be a complicated procedure due to voice variations. These variations occur when the condition of the vocal tract is affected by the influence of internal problems such as cold or dry mouth, and also by external problems, for example temperature and humidity. The performance of a speaker authentication system is also affected by room and line noise, changing of recording equipment and uncooperative claimants [2], [3]. Thus, the implementation of biometric systems has to correctly discriminate the biometric features from one individual to another, and at the same time, the system also needs to handle the misrepresentations in the features due to the problems stated. In order to overcome these limitations, we improve the performance of speaker authentication systems by extracting more information (samples) from the claimant and then executing fusion techniques in the decision process. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 235–242, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 236 D.A. Ramli, S.A. Samad, and A. Hussain So far there are many fusion techniques in literature that have been implemented in biometric systems for the purpose of enhancing the system performance. These include the fusion of multiple-modalities, multiple-classifiers and multiple-samples [4]. Teoh et. al. in [5] proposed a combination of features of face modality and speech modality so as to improve the accuracy of biometric authentication systems. Person identification based on visual and acoustic features has also been reported by Brunelli and Falavigna in [6]. Suutala and Roning in [7] used Learning Vector Quantization (LVQ) and Multilayer Perceptron (MLP) as classifiers for footstep profile based person identification whereas in [8], Kittler et.al. utilized Neural Networks and Hidden Markov Model (HMM) for hand written digit recognition task. The implementation of multiple-sample fusion approach can be found in [4] and [9]. In general, these studies revealed that the implementation of the fusion approaches in biometric systems can improve system performance significantly. This paper focuses on the fusion of score information from multiple correlation filter outputs for a correlation filter based speaker authentication system. Here, we use scores extracted from the correlation outputs by considering several samples extracted from the same modality as independent samples. The scores are then concatenated together to form a feature vector and then Support Vector Machine (SVM) is executed to classify the feature vector as either authentic or imposter class. Correlation filters have been effectively applied in biometric systems for visual applications such as face verification and fingerprint verification as reported in [10], [11]. Lower face verification and lip movement for person identification using correlation filters have been implemented in [12], [13], respectively. A study of using correlation filters in speaker verification for speech signal as features can be found in [14]. The advantages of correlation filters are shift-invariance, ability to trade-off between discrimination and distortion tolerance and having a close-form expression. 2 Methodology The database used in this study is obtained from the Audio-Visual Digit Database (2001) [15]. The database consists of video and corresponding audio of people reciting digits zero to nine. The video of each person is stored as a sequence of JPEG images with a resolution of 512 x 384 pixels while the corresponding audio provided as a monophonic, 16 bit, 32 kHz WAV file. 2.1 Spectroghaphic Features A spectrogram is an image representing the time-varying spectrum of a signal. The vertical axis (y) shows frequency, the horizontal axis (x) represents time and the pixel intensity or color represents the amount of energy (acoustic peaks) in the frequency band y, at time x [16], [17]. Fig.1 shows samples of the spectrogram of the word ‘zero’ from person 3 and person 4 obtained from the database. From the figure, it can be seen that the spectrogram image contains personal information in terms of the way the speaker utters the word such as speed and pitch that is showed by the spectrum. 1 1 0.9 0.9 0.8 0.8 0.7 0.7 0.6 0.6 Frequency Frequency Score Information Decision Fusion Using Support Vector Machine 0.5 0.4 0.3 237 0.5 0.4 0.3 0.2 0.2 0.1 0.1 0 0 0 1000 2000 3000 4000 Time 5000 6000 7000 0 1000 2000 3000 4000 Time 5000 6000 7000 Fig. 1. Examples of the spectrogram image from person 3 and person 4 for the word ‘zero’ Comparing both figures, it can be observed that although the spectrogram image holds inter-class variations, it also comprises intra-class variations. In order to be successfully classified by correlation filters, we propose a novel feature extraction technique. The computation of the spectrogram is described below. a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using the following equation: x (t ) = (s(t ) − 0.95) ∗ x (t − 1) (1) x ( t ) is the filtered signal, s( t ) is the input signal and t represents time. b. Framing and windowing task. A Hamming window with 20ms length and 50% overlapping is used on the signal. c. Specification of FFT length. A 256-point FFT is used and this value determines the frequencies at which the discrete-time Fourier transform is computed. d. The logarithm of energy (acoustic peak) of each frequency bin is then computed. e. Retaining the high energies. After a spectrogram image is obtained, we aim to eliminate the small blobs in the image which impose the intra-class variations. This can be achieved by retaining the high energies of the acoustic peak by setting an appropriate threshold. Here, the FFT magnitudes which are above a certain threshold are maintained, otherwise they are set to be zero. f. Morphological opening and closing. Morphological opening process is used to clear up the residue noisy spots in the image whereas morphological closing is the task used to recover the original shape of the image caused by the morphological opening process. 2.2 Cepstrumgraphic Features Linear Predictive Coding (LPC) is used for the acoustic measurements of speech signals. This parametric modeling is an approach used to match closely the resonant structure of the human vocal tract that produces the corresponding sounds [17]. The computation of the cepstrumgraphic features is described below. a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using equation 1. b. Framing and windowing task. A Hamming window with 20ms length and 50% overlapping is used on the signal. c. Specification of FFT length. A 256-point FFT is used and this value determines the frequencies at which the discrete-time Fourier transform is computed. 238 D.A. Ramli, S.A. Samad, and A. Hussain d. Auto-correlation task. For each frame, a vector of LPC coefficients is computed from the autocorrelation vector using Durbin recursion method. The LPC-derived cepstral coefficients (cepstrum) are then derived that lead to 14 coefficients per vector. e. Resizing task. The feature vectors are then down sampled to the size of 64x64 in order to be verified by UMACE filters. 2.3 Correlation Filter Classifier Unconstrained Minimum Average Correlation Energy (UMACE) filters which evolved from Matched Filter are synthesized in the Fourier domain using a closed form solution. Several training images are used to synthesize a filter template. The designed filter is then used for cross-correlating the test image in order to determine whether the test image is from the authentic class or imposter class. In this process, the filter optimizes a criterion to produce a desired correlation output plane by minimizing the average correlation energy and at the same time maximizing the correlation output at the origin [10][11]. The optimization of UMACE filter equation can be summarized as follows, U mace = D −1m (2) D is a diagonal matrix with the average power spectrum of the training images placed along the diagonal elements while m is a column vector containing the mean of the Fourier transforms of the training images. The resulting correlation plane produce a sharp peak in the origin and the values at everywhere else are close to zero when the test image belongs to the same class of the designed filter [10][11]. Fig. 2 shows the correlation outputs when using a UMACE filter to determine the test image from the authentic class (left) and imposter class (right). 0.8 0.8 0.6 0.6 0.4 0.4 0.2 0.2 0 0 30 30 30 20 10 0 30 20 20 10 20 10 10 0 0 0 Fig. 2. Examples of the correlation plane for the test image from the authentic class (left) and imposter class (right) Peak-to-Sidelobe ratio (PSR) metric is used to measure the sharpness of the peak. The PSR is given by PSR = peak − mean σ (3) Here, the peak is the largest value of the test image yield from the correlation output. Mean and standard deviation are calculated from the 20x20 sidelobe region by excluding a 5x5 central mask [10], [11]. Score Information Decision Fusion Using Support Vector Machine 239 2.4 Support Vector Machine Support vector machine (SVM) classifier in its simplest form, linear and separable case is the optimal hyperplane that maximizes the distance of the separating hyperplane from the closest training data point called the support vectors [18], [19]. From [18], the solution of a linearly separable case is given as follows. Consider a problem of separating the set of training vectors belonging to two separate classes, {( ) ( )} D = x 1 , y1 ,... x L , y L , x ∈ ℜ n , y ∈ {− 1,−1} (4) with a hyperplane, w, x + b = 0 (5) The hyperplane that optimally separates the data is the one that minimizes φ( w ) = 1 w 2 2 (6) which is equivalent to minimizing an upper bound on VC dimension. The solution to the optimization problem (7) is given by the saddle point of the Lagrange functional (Lagrangian) φ( w , b, α) = L 1 2 w − ∑ α i ⎛⎜ y i ⎡ w , x i + b⎤ − 1⎞⎟ ⎢ ⎥⎦ ⎠ 2 i =1 ⎝ ⎣ (7) where α are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w , b and maximized with respect to α ≥ 0 . Equation (7) is then transformed to its dual problem. Hence, the solution of the linearly separable case is given by, α* = arg min α L 1 L L ∑ ∑ αiα jyi y j xi , x j − ∑ αk 2 i =1 j=1 k =1 (8) with constrains, α i ≥ 0, i = 1,..., L and L ∑ α jy j = 0 j=1 (9) Subsequently, consider a SVM as a non-linear and non-separable case. Non-separable case is considered by adding an upper bound to the Lagrange multipliers and nonlinear case is considered by replacing the inner product by a kernel function. From [18], the solution of the non-linear and non-separable case is given as α* = arg min α ( ) L 1 L L ∑ ∑ α i α j yi y jK x i , x j − ∑ α k 2 i =1 j=1 k =1 (10) with constrains, 0 ≤ α i ≤ C, i = 1,..., L and L ∑ α j y j = 0 x (t ) = (s(t ) − 0.95) ∗ x (t − 1) j=1 (11) 240 D.A. Ramli, S.A. Samad, and A. Hussain Non-linear mappings (kernel functions) that can be employed are polynomials, radial basis functions and certain sigmoid functions. 3 Results and Discussion Assume that N streams of testing data are extracted from M utterances. Let s = {s1 , s 2 ,..., s N } be a pool of scores from each utterance. The proposed verification system is shown in Fig.3. a11 am1 ... Filter design 1 . . . . a1n amn ... Correlation filter Filter design n Correlation filter . . . . b1 FFT bn IFFT Correlation output psr1 . . . . . . . . FFT IFFT Correlation output psrn Support vector machine (polynomial kernel) (a11… am1 ) … (a1n … amn )– training data b1, b2 … bn – testing data m – number of training data n - number of groups (zero to nine) Decision Fig. 3. Verification process using spectrographic / ceptrumgraphic images For the spectrographic features, we use 250 filters which represent each word for the 25 persons. Our spectrographic image database consists of 10 groups of spectrographic images (zero to nine) of 25 persons with 46 images per group of size 32x32 pixels, thus 11500 images in total. For each filter, we used 6 training images for the synthesis of a UMACE filter. Then, 40 images are used for the testing process. These six training images were chosen based on the largest variations among the images. In the testing stage, we performed cross correlations of each corresponding word with 40 authentic images and another 40x24=960 imposter images from the other 24 persons. For the ceptrumgraphic features, we also have 250 filters which represent each word for the 25 persons. Our ceptrumgraphic image database consists of 10 groups of ceptrumgraphic images (zero to nine) of 25 persons with 43 images per group of size 64x64 pixels, thus 10750 images in total. For each filter, we used 3 training images for the synthesis of the UMACE filter and 40 images are used for the testing process. We performed cross correlations of each corresponding word with 40 authentic images and another 40x24=960 imposter images from the other 24 persons. Score Information Decision Fusion Using Support Vector Machine 241 For both cases, polynomial kernel has been employed for the decision fusion procedure using SVM. Table 1 below compares the performance of single score decision and multiple score decision fusions for both spectrographic and ceptrumgrapic features. The false accepted rate (FAR) and false rejected rate (FRR) of multiple score decision fusion are described in Table 2. Table 1. Performance of single score decision and multiple score decision fusion features spectrographic cepstrumgraphic single score 92.75% 90.67% multiple score 96.04% 95.09% Table 2. FAR and FRR percentages of multiple score decision fusion features spectrographic cepstrumgraphic FAR 3.23% 5% FRR 3.99% 4.91% 4 Conclusion The multiple score decision fusion approach using support vector machine has been developed in order to enhance the performance of a correlation filter based speaker authentication system. Spectrographic and cepstrumgraphic features, are employed as features and UMACE filters are used as classifiers in the system. By implementing the proposed decision fusion, the error due to the variation of data can be reduced hence further enhance the performance of the system. The experimental result is promising and can be an alternative method to biometric authentication systems. Acknowledgements. This research is supported by Fundamental Research Grant Scheme, Malaysian Ministry of Higher Education, FRGS UKM-KK-02-FRGS00362006 and Science Fund, Malaysian Ministry of Science, Technology and Innovation, 01-01-02-SF0374. References 1. Campbell, J.P.: Speaker Recognition: A Tutorial. Proceeding of the IEEE 85, 1437–1462 (1997) 2. Rosenberg, A.: Automatic speaker verification: A review. Proceeding of IEEE 64(4), 475– 487 (1976) 3. Reynolds, D.A.: An overview of Automatic Speaker Recognition Technology. Proceeding of IEEE on Acoustics Speech and Signal Processing 4, 4065–4072 (2002) 4. Poh, N., Bengio, S., Korczak, J.: A multi-sample multi-source model for biometric authentication. In: 10th IEEE on Neural Networks for Signal Processing, pp. 375–384 (2002) 5. Teoh, A., Samad, S.A., Hussein, A.: Nearest Neighborhood Classifiers in a Bimodal Biometric Verification System Fusion Decision Scheme. Journal of Research and Practice in Information Technology 36(1), 47–62 (2004) 242 D.A. Ramli, S.A. Samad, and A. Hussain 6. Brunelli, R., Falavigna, D.: Personal Identification using Multiple Cue. Proceeding of IEEE Trans. on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995) 7. Suutala, J., Roning, J.: Combining Classifier with Different Footstep Feature Sets and Multiple Samples for Person Identification. In: Proceeeding of International Conference on Acoustics, Speech and Signal Processing, pp. 357–360 (2005) 8. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. Proceeding of IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998) 9. Cheung, M.C., Mak, M.W., Kung, S.Y.: Multi-Sample Data-Dependent Fusion of Sorted Score Sequences for Biometric verification. In: IEEE Conference on Acoustics Speech and Signal Processing (ICASSP 2004), pp. 229–232 (2004) 10. Savvides, M., Vijaya Kumar, B.V.K., Khosla, P.: Face Verification using Correlation Filters. In: 3rd IEEE Automatic Identification Advanced Technologies, pp. 56–61 (2002) 11. Venkataramani, K., Vijaya Kumar, B.V.K.: Fingerprint Verification using Correlation Filters. In: System AVBPA, pp. 886–894 (2003) 12. Samad, S.A., Ramli, D.A., Hussain, A.: Lower Face Verification Centered on Lips using Correlation Filters. Information Technology Journal 6(8), 1146–1151 (2007) 13. Samad, S.A., Ramli, D.A., Hussain, A.: Person Identification using Lip Motion Sequence. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part I. LNCS (LNAI), vol. 4692, pp. 839–846. Springer, Heidelberg (2007) 14. Samad, S.A., Ramli, D.A., Hussain, A.: A Multi-Sample Single-Source Model using Spectrographic Features for Biometric Authentication. In: IEEE International Conference on Information, Communications and Signal Processing, CD ROM (2007) 15. Sanderson, C., Paliwal, K.K.: Noise Compensation in a Multi-Modal Verification System. In: Proceeding of International Conference on Acoustics, Speech and Signal Processing, pp. 157–160 (2001) 16. Spectrogram, http://cslu.cse.ogi.edu/tutordemo/spectrogramReading/spectrogram.html 17. Klevents, R.L., Rodman, R.D.: Voice Recognition: Background of Voice Recognition, London (1997) 18. Gunn, S.R.: Support Vector Machine for Classification and Regression. Technical Report, University of Southampton (2005) 19. Wan, V., Campbell, W.M.: Support Vector Machines for Speaker Verification and Identification. In: Proceeding of Neural Networks for Signal Processing, pp. 775–784 (2000) Application of 2DPCA Based Techniques in DCT Domain for Face Recognition Messaoud Bengherabi1, Lamia Mezai1, Farid Harizi1, Abderrazak Guessoum2, and Mohamed Cheriet3 1 Centre de Développement des Technologies Avancées- Algeria Division Architecture des Systèmes et MultiMédia Cité 20 Aout, BP 11, Baba Hassen, Algiers-Algeriabengherabi@yahoo.com, l_mezai@yahoo.fr, harizihourizi@yahoo.fr 2 Université Saad Dahlab de Blida – Algeria Laboratoire Traitement de signal et d’imagerie Route De Soumaa BP 270 Blida guessouma@hotmail.com 3 École des Technologies Supérieur –Québec- CanadaLaboratoire d’Imagerie, de Vision et d’Intelligence Artificielle 1100, Rue Notre-Dame Ouest, Montréal (Québec) H3C 1K3 Canada mohamed.cheriet@gpa.etsmtl.ca Abstract. In this paper, we introduce 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain for the aim of face recognition. The 2D DCT transform has been used as a preprocessing step, then 2DPCA, DiaPCA and DiaPCA+2DPCA are applied on the upper left corner block of the global 2D DCT transform matrix of the original images. The ORL face database is used to compare the proposed approach with the conventional ones without DCT under Four matrix similarity measures: Frobenuis, Yang, Assembled Matrix Distance (AMD) and Volume Measure (VM). The experiments show that in addition to the significant gain in both the training and testing times, the recognition rate using 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain is generally better or at least competitive with the recognition rates obtained by applying these three 2D appearance based statistical techniques directly on the raw pixel images; especially under the VM similarity measure. Keywords: Two-Dimensional PCA (2DPCA), Diagonal PCA (DiaPCA), DiaPCA+2DPCA, face recognition, 2D Discrete Cosine Transform (2D DCT). 1 Introduction Different appearance based statistical methods for face recognition have been proposed in literature. But the most popular ones are Principal Component Analysis (PCA) [1] and Linear Discriminate Analysis (LDA) [2], which process images as 2D holistic patterns. However, a limitation of PCA and LDA is that both involve eigendecomposition, which is extremely time-consuming for high dimensional data. Recently, a new technique called two-dimensional principal component analysis 2DPCA was proposed by J. Yang et al. [3] for face recognition. Its idea is to estimate the covariance matrix based on the 2D original training image matrices, resulting in a E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 243–250, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 244 M. Bengherabi et al. covariance matrix whose size is equal to the width of images, which is quite small compared with the one used in PCA. However, the projection vectors of 2DPCA reflect only the variations between the rows of images, while discarding the variations of columns. A method called Diagonal Principal Component Analysis (DiaPCA) is proposed by D. Zhang et al. [4] to resolve this problem. DiaPCA seeks the projection vectors from diagonal face images [4] obtained from the original ones to ensure that the correlation between rows and those of columns is taken into account. An efficient 2D techniques that results from the combination of DiaPCA and 2DPCA (DiaPCA+2DPCA) is proposed also in [4]. Discrete cosine transform (DCT) has been used as a feature extraction step in various studies on face recognition. This results in a significant reduction of computational complexity and better recognition rates [5, 6]. DCT provides excellent energy compaction and a number of fast algorithms exist for calculating it. In this paper, we introduce 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain for face recognition. The DCT transform has been used as a feature extraction step, then 2DPCA, DiaPCA and DiaPCA+2DPCA are applied only on the upper left corner block of the global DCT transform matrix of the original images. Our proposed approach is tested against conventional approaches without DCT under Four matrix similarity measures: Frobenuis, Yang, Assembled Matrix Distance (AMD) and Volume Measure (VM). The rest of this paper is organized as follows. In Section 2 we give a review of 2DPCA, DiaPCA and DiaPCA+2DPCA approaches and also we review different matrix similarity measures. In section 3, we present our contribution. In section 4 we report the experimental results and highlight a possible perspective of this work. Finally, in section 5 we conclude this paper. 2 Overview of 2DPCA, DiaPCA, DiaPCA+2DPCA and Matrix Similarity Measures 2.1 Overview of 2D PCA, DiaPCA and DiaPCA+2DPCA 2.1.1 Two-Dimensional PCA Given M training face images, denoted by m×n matrices Ak (k = 1, 2… M), twodimensional PCA (2DPCA) first uses all the training images to construct the image covariance matrix G given by [3] G= 1 M ∑ (A M k −A k =1 ) (A T k −A ) (1) Where A is the mean image of all training images. Then, the projection axes of 2DPCA, Xopt=[x1… xd] can be obtained by solving the algebraic eigenvalue problem Gxi=λixi, where xi is the eigenvector corresponding to the ith largest eigenvalue of G [3]. The low dimensional feature matrix C of a test image matrix A is extracted by C = AX opt (2) In Eq.(2) the dimension of 2DPCA projector Xopt is n×d, and the dimension of 2DPCA feature matrix C is m×d. Application of 2DPCA Based Techniques in DCT Domain 245 2.1.2 Diagonal Principal Component Analysis Suppose that there are M training face images, denoted by m×n matrices Ak(k = 1, 2, …, M). For each training face image Ak, we calculate the corresponding diagonal face image Bk as it is defined in [4]. Based on these diagonal faces, diagonal covariance matrix is defined as [4]: G DIAG = Where B = 1 M 1 M ∑ (B M k −B k =1 ) (B T k −B ) (3) M ∑B k is the mean diagonal face. According to Eq. (3), the projection k =1 vectors Xopt=[x1, …, xd] can be obtained by computing the d eigenvectors corresponding to the d biggest eigenvalues of GDIAG. The training faces Ak’s are projected onto Xopt, yielding m×d feature matrices. C k = Ak X opt (4) Given a test face image A, first use Eq. (4) to get the feature matrix C = AX opt , then a matrix similarity metric can be used for classification. 2.1.3 DiaPCA+2DPCA Suppose the n by d matrix X=[x1, …, xd] is the projection matrix of DiaPCA. Let Y=[y1, …, yd] the projection matrix of 2DPCA is computed as follows: When the height m is equal to the width n, Y is obtained by computing the q eigenvectors corresponding to the q biggest eigenvalues of the image covarinace matrix 1 M (A − A)T (A − A) . On the other hand, when the height m is not equal to the width M ∑ k k k =1 n, Y is obtained by computing the q eigenvectors corresponding to the q biggest ei- ∑ (A M genvalues of the alternative image covariance matrix 1 M k k =1 )( ) T − A Ak − A . Projecting training faces Aks onto X and Y together, yielding the q×d feature matrices (5) D k = Y T Ak X Given a test face image A, first use Eq. (5) to get the feature matrix D = Y T AX , then a matrix similarity metric can be used for classification. 2.2 Overview of Matrix Similarity Measures An important aspect of 2D appearance based face recognition approaches is the similarity measure between matrix features used at the decision level. In our work, we have used four matrix similarity measures. 2.2.1 Frobenius Distance Given two feature matrices A = (aij)m×d and B = (bij)m×d, the Frobenius distance [7] measure is given by: ⎛ m d F ( A, B ) = ⎜⎜ ∑ ⎝ i =1 ∑ (a 12 d j =1 ij 2⎞ − bij ) ⎟⎟ ⎠ (6) 246 M. Bengherabi et al. 2.2.2 Yang Distance Measure Given two feature matrices A = (aij)m×d and B = (bij)m×d, the Yang distance [7] is given by: 12 d ⎛ m 2⎞ dY ( A, B ) = ∑ ⎜ ∑ (aij − bij ) ⎟ j =1 ⎝ i =1 ⎠ (7) 2.2.3 Assembled Matrix Distance (AMD) A new distance called assembled matrix distance (AMD) metric to calculate the distance between two feature matrices is proposed recently by Zuo et al [7]. Given two feature matrices A = (aij)m×d and B = (bij)m×d, the assembled matrix distance dAMD(A,B) is defined as follows : (1 2 ) p ⎞ ⎛ d ⎛ m 2⎞ ⎟ d AMD ( A, B ) = ⎜ ∑ ⎜ ∑ (aij − bij ) ⎟ ⎟ ⎜ j =1 ⎝ i =1 ⎠ ⎠ ⎝ 12 ( p > 0) (8) It was experimentally verified in [7] that best recognition rate can be obtained when p≤0.125 while it decrease as p increases. In our work the parameter p is set equal to 0.125. 2.2.4 Volume Measure (VM) The VM similarity measure is based on the theory of high-dimensional geometry space. The volume of an m×n matrix of rank p is given by [8] ∑det Vol A = ( I ,J )∈N 2 A IJ (9) where AIJ denotes the submatrix of A with rows I and columns J, N is the index set of p×p nonsingular submatrix of A, and if p=0, then Vol A = 0 by definition. 3 The Proposed Approach In this section, we introduce 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain for the aim of face recognition. The DCT is a popular technique in imaging and video compression, which was first applied in image compression in 1974 by Ahmed et al [9]. Applying the DCT to an input sequence decomposes it into a weighted sum of basis cosine sequences. our methodology is based on the use of the 2D DCT as a feature extraction or preprocessing step, then 2DPCA, DiaPCA and DiaPCA+2DPCA are applied to w×w upper left block of the global 2D DCT transform matrix of the original images. In this approach, we keep only a sub-block containing the first coefficients of the 2D DCT matrix as shown in Fig.1, from the fact that, the most significant information is contained in these coefficients. 2D DCT c11 c12 … c1w . . cw1 cw2 … cww Fig. 1. Feature extraction in our approach Application of 2DPCA Based Techniques in DCT Domain 247 With this approach and inversely to what is presented in literature of DCT-based face recognition approaches, the 2D structure is kept and the dimensionality reduction is carried out. Then, the 2DPCA, DiaPCA and DiaPCA+2DPCA are applied to w×w block of 2D DCT coefficients. The training and testing block diagrams describing the proposed approach is illustrated in Fig.2. Training a lgorithm based on 92DPCA 9Dia PCA 9Dia PCA+2DPCA Block w*w of 2D DCT coefficients 2D DCT Tra ined Model Training data 2D DCT ima ge Projection of the DCT bloc of the test ima ge using the eigenvectors of 92DPCA 9Dia PCA 9Dia PCA+2DPCA Block w*w of 2D DCT coefficients 2D DCT Test ima ge 2D DCT Block Features 2D DCT ima ge Compa rison using 9Frobenius 9Yang 9AMD 9VM 2D DCT Block Fea tures Decision Fig. 2. Block diagram of 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain 4 Experimental Results and Discussion In this part, we evaluate the performance of 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain and we compare it to the original 2DPCA, DiaPCA and DiaPCA+2DPCA methods. All the experiments are carried out on a PENTUIM 4 PC with 3.2GHz CPU and 1Gbyte memory. Matlab [10] is used to carry out these experiments. The database used in this research is the ORL [11] (Olivetti Research Laboratory) face database. This database contains 400 images for 40 individuals, for each person we have 10 different images of size 112×92 pixels. For some subjects, the images captured at different times. The facial expressions and facial appearance also vary. Ten images of one person from the ORL database are shown in Fig.3. In our experiment, we have used the first five image samples per class for training and the remaining images for test. So, the total number of training samples and test samples were both 200. Herein and without DCT the size of diagonal covariance matrix is 92×92, and each feature matrix with a size of 112×p where p varies from 1 to 92. However with DCT preprocessing the dimension of these matrices depends on the w×w DCT block where w varies from 8 to 64. We have calculated the recognition rate of 2DPCA, DiaPCA, DiaPCA+2DPCA with and without DCT. In this experiment, we have investigated the effect of the matrix metric on the performance of the 2D face recognition approaches presented in section 2. We see from table 1, that the VM provides the best results whereas the Frobenius gives the worst ones, this is justified by the fact that the Frobenius metric is just the sum of the 248 M. Bengherabi et al. (a) (b) Fig. 3. Ten images of one subject in the ORL face database, (a) Training, (b) Testing Euclidean distance between two feature vectors in a feature matrix. So, this measure is not compatible with the high-dimensional geometry theory [8]. Table 1. Best recognition rates of 2DPCA, DiaPCA and DiaPCA+2DPCA without DCT Methods 2DPCA DiaPCA DiaPCA+2DPCA Frobenius 91.50 (112×8) 91.50 (112×8) 92.50 (16×10) Yang 93.00 (112×7) 92.50 (112×10) 94.00 (13×11) AMD p=0,125 95.00 (112×4) 91.50 (112×8) 93.00 (12×6) Volume Distance 95.00 (112×3) 94.00 (112×9) 96.00 (21×8) Tables 2, and Table 3 summarize the best performances under different 2D DCT block sizes and different matrix similarity measures. Table 2. 2DPCA, DiaPCA and DiaPCA+2DPCA under different DCT block sizes using the Frobenius and Yang matrix distance Best Recognition rate (feature matrix dimension) DiaPCA+2DPCA 2DPCA DiaPCA Yang 91.50 (6×6) 93.50 (8×6) 93.50 (8×5) 92.00 (9×5) 93.00 (9×6) 95.00 (9×9) 92.00 (10×5) 94.50 (10×6) 95.50 (10×9) 92.00 (9×5) 94.00 (11×6) 95.50 (11×5) 91.50 (9×5) 94.50 (12×6) 95.50 (12×5) 92.00 (12×11) 94.50 (13×6) 95.00 (13×5) 92.00 (12×7) 94.50 (14×6) 94.50 (14×5) 2D DCT block size 8×8 9×9 10×10 11×11 12×12 13×13 14×14 91.50 (8×8) 92.00 (9×9) 91.50 (10×5) 92.00 (11×8) 92.00 (12×8) 91.50 (13×7) 92.00 (14×7) DiaPCA Frobenius 91.50 (8×6) 92.00 (9×5) 92.00 (10×5) 91.50 (11×5) 91.50 (12×10) 92.00 (13×11) 91.50 (14×7) 15×15 16×16 32×32 91.50 (15×5) 92.50 (16×10) 92.00 (32×6) 91.50 (15×5) 91.50 (16×11) 91.50 (32×6) 92.00 (13×15) 92.00 (4×10) 92.00 (11×7) 94.00 (15×9) 94.00 (16×7) 93.00 (32×6) 94.50 (15×5) 94.50 (16×5) 93.50 (32×5) 95.50 (12×5) 95.00 (12×5) 95.00 (12×5) 64×64 91.50 (64×6) 91.00 (32×6) 92.00 (14×12) 93.00 (64×7) 93.50 (64×5) 95.00 (12×5) 2DPCA DiaPCA+2DPCA 93.50 (8×5) 95.00 (9×9) 95.50 (10×9) 95.50 (11×5) 95.50 (12×5) 95.00 (11×5) 95.00 (12×5) From these four tables, we notice that in addition to the importance of matrix similarity measures, by the use of DCT we have always better performance in terms of recognition rate and this is valid for all matrix measures, we have only to choose the DCT block size and appropriate feature matrix dimension. An important remark is that a block size of 16×16 or less is sufficient to have the optimal performance. So, this results in a significant reduction in training and testing time. This significant gain Application of 2DPCA Based Techniques in DCT Domain 249 Table 3. 2DPCA, DiaPCA and DiaPCA+2DPCA under different DCT block sizes using the AMD distance and VM similarity measure on the ORL database 2D DCT block size 2DPCA 8×8 9×9 10×10 11×11 12×12 13×13 14×14 15×15 16×16 32×32 64×64 94.00 (8×4) 94.50 (9×4) 94.50 (10×4) 95.50 (11×5) 95.50 (12×5) 96.00 (13×4) 96.00 (14×4) 96.00 (15×4) 96.00 (16×4) 95.50 (32×4) 95.00 (64×4) DiaPCA AMD 95.00 (8×6) 94.50 (9×5) 95.50 (10×5) 96.00 (11×5) 96.50 (12×7) 95.50 (13×5) 95.00 (14×5) 95.00 (15×5) 95.50 (16×5) 95.00 (32×9) 94.50 (64×9) Best Recognition rate (feature matrix dimension) DiaPCA+2DPCA 2DPCA DiaPCA VM 95.00 (7×5) 96.00 (8×3) 93.50 (8×4) 94.50 (9×5) 95.00 (9×4) 95.00 (9×5) 96.00 (9×7) 95.00 (10×3) 95.00 (10×4) 94.50 (11×3) 95.50 (11×3) 96.50 (9×6) 95.50 (12×5) 96.00 (12×5) 96.50 (9×7) 95.50 (12×5) 96.00 (13×9) 96.00 (13×5) 95.50 (10×5) 95.00 (14×3) 95.50 (14×5) 96.00 (9×7) 96.00 (15×8) 96.00 (15×5) 95.50 (16×8) 96.00 (16×5) 96.50 (12×5) 96.00 (11×5) 95.00 (32×3) 95.50 (32×5) 96.00 (12×5) 95.00 (64×3) 95.00 (64×5) DiaPCA+2DPCA 93.50 (8×4) 95.00 (9×5) 95.00 (10×4) 95.50 (11×3) 96.00 (11×5) 96.50 (10×5) 96.50 (10×5) 96.50 (10×5) 96.50 (10×5) 96.50 (9×5) 96.50 (21×5) in computation is better illustrated in table 4 and table 5, which illustrate the total training and total testing time of 200 persons -in seconds - of the ORL database under 2DPCA, DiaPCA and DiaPCA+2DPCA without and with DCT, respectively. We should mention that the computation of DCT was not taken into consideration when computing the training and testing time of DCT based approaches. Table 4. Training and testing time without DCT using Frobenius matrix distance Methods Training time in sec Testing time in sec 2DPCA 5.837 (112×8) 1.294 (112×8) DiaPCA 5.886 (112×8) 2.779 (112×8) DiaPCA+2DPCA 10.99 (16×10) 0.78 (16×10) Table 5. Training and testing time with DCT using the Frobenius distance and the same matrixfeature dimensions as in Table2 2D DCT block size 8×8 9×9 10×10 11×11 12×12 13×13 14×14 15×15 16×16 2DPCA 0.047 0.048 0.047 0.048 0.063 0.062 0.079 0.094 0.125 Training time in sec DiaPCA DiaPCA+2DPCA 0.047 0.047 0.048 0.124 0.048 0.094 0.047 0.063 0.046 0.094 0.047 0.126 0.062 0.14 0.078 0.173 0.141 0.219 2DPCA 0.655 0.626 0.611 0.578 0.641 0.642 0.656 0.641 0.813 Testing time in sec DiaPCA DiaPCA+2DPCA 0.704 0.61 0.671 0.656 0.719 0.625 0.734 0.5 0.764 0.657 0.843 0.796 0.735 0.718 0.702 0.796 0.829 0.827 We can conclude from this experiment, that the proposed approach is very efficient in weakly constrained environments, which is the case of the ORL database. 5 Conclusion In this paper, 2DPCA, DiaPCA and DiaPCA+2PCA are introduced in DCT domain. The main advantage of the DCT transform is that it discards redundant information and it can be used as a feature extraction step. So, computational complexity is significantly reduced. The experimental results show that in addition to the significant gain in both the training and testing times, the recognition rate using 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain is generally better or at least competitive with the 250 M. Bengherabi et al. recognition rates obtained by applying these three techniques directly on the raw pixel images; especially under the VM similarity measure. The proposed approaches will be very efficient for real time face identification applications such as telesurveillance and access control. References 1. Turk, M., Pentland, A.: “Eigenfaces for Recognition. Journal of Cognitive Neurosicence 3(1), 71–86 (1991) 2. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition using class specific linear projection. IEEETrans. on Patt. Anal. and Mach. Intel. 19(7), 711–720 (1997) 3. Yang, J., Zhang, D., Frangi, A.F., Yang, J.Y.: Two-Dimensional PCA: A New Approach to Appearance- Based Face Representation and Recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence 26(1), 131–137 (2004) 4. Zhang, D., Zhou, Z.H., Chen, S.: “Diagonal Principal Component Analysis for Face Recognition. Pattern Recognition 39(1), 140–142 (2006) 5. Hafed, Z.M., Levine, M.D.: “Face recognition using the discrete cosine transform. International Journal of Computer Vision 43(3) (2001) 6. Chen, W., Er, M.J., Wu, S.: PCA and LDA in DCT domain. Pattern Recognition Letters 26(15), 2474–2482 (2005) 7. Zuo, W., Zhang, D., Wang, K.: An assembled matrix distance metric for 2DPCA-based image recognition. Pattern Recognition Letters 27(3), 210–216 (2006) 8. Meng, J., Zhang, W.: Volume measure in 2DPCA-based face recognition. Pattern Recognition Letters 28(10), 1203–1208 (2007) 9. Ahmed, N., Natarajan, T., Rao, K.: Discrete cosine transform. IEEE Trans. on Computers 23(1), 90–93 (1974) 10. Matlab, The Language of Technical Computing, Version 7 (2004), http://www.mathworks.com 11. ORL. The ORL face database at the AT&T (Olivetti) Research Laboratory (1992), http://www.uk.research.att.com/facedatabase.html Fingerprint Based Male-Female Classification Manish Verma and Suneeta Agarwal Computer Science Department, Motilal Nehru National Institute of Technology Allahabad Uttar Pradesh India manishverma649@gmail.com, suneeta@mnnit.ac.in Abstract. Male-female classification from a fingerprint is an important step in forensic science, anthropological and medical studies to reduce the efforts required for searching a person. The aim of this research is to establish a relationship between gender and the fingerprint using some special features such as ridge density, ridge thickness to valley thickness ratio (RTVTR) and ridge width. Ahmed Badawi et. al. showed that male-female classification can be done correctly upto 88.5% based on white lines count, RTVTR & ridge count using Neural Network as Classifier. We have used RTVTR, ridge width and ridge density for classification and SVM as classifier. We have found male-female can be correctly classified upto 91%. Keywords: gender classification, fingerprint, ridge density, ridge width, RTVTR, forensic, anthropology. 1 Introduction For over centuries, fingerprint has been used for both identification and verification because of its uniqueness. A fingerprint contains three level of information. Level 1 features contain macro details of fingerprint such as ridge flow and pattern type e.g. arch, loop, whorl etc. Level 2 features refer to the Galton characteristics or minutiae, such as ridge bifurcation or ridge termination e.g. eye, hook, bifurcation, ending etc. Level 3 features include all dimensional attributes of ridge e.g. ridge path deviation, width, shape, pores, edge contour, ridges breaks, creases, scars and other permanent details [10]. Till now little work has been done in the field of male-female fingerprint classification. In 1943, Harold Cummnins and Charles Midlo in the book “Fingerprints, Palm and Soles” first gave the relation between gender and the fingerprint. In 1968, Sarah B Holt, Charles C. Thomas in the book “The Genetics of the Dermal Ridges” gave same theory with little modification. Both state the same fact that female ridges are finer/smaller and have higher ridge density than males. Acree showed that females have higher ridge density [9]. Kralik showed that males have higher ridge width [6]. Moore also carried out a study on ridge to ridge distance and found that mean distance is more in male compared to female [7]. Dr. Sudesh Gungadin showed that a ridge count of ≤13 ridges/25 mm2 is more likely to be of males and that of ≥14 ridges/25 mm2 is likely to be of females [2]. Ahmed Badawi et. al. showed that male-female can be correctly classified upto 88.5% [1] based on white lines count, RTVTR & ridge count using Neural Network as Classifier. According to the research of E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 251–257, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 252 M. Verma and S. Agarwal Cummnins and Midlo, A typical young male has, on an average, 20.7 ridges per centimeter while a young female has 23.4 ridges per centimeter [8]. On the basis of studies made in [6], [1], [2], ridge width, RTVTR and ridge density are significant features for male-female classification. In this paper, we studied the significance of ridge width, ridge density and ridge thickness to valley thickness ratio (RTVTR) for the classification purpose. For classification we have used SVM classifier because of its significant advantage. Artificial Neural Networks (ANNs) can suffer from multiple local minima, the solution with an SVM is global and unique. Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space. ANNs use empirical risk minimization, whilst SVMs use structural risk minimization. SVMs are less prone to overfitting [13]. 2 Materials and Methods In our Male-Female classification analysis with respect to fingerprints, we extracted three features from each fingerprint. The features are ridge width, ridge density and RTVTR. Male & Female are classified using these features with the help of SVM classifier. 2.1 Dataset We have taken 400 fingerprints (200 Male & 200 Female) of indian origin in the age group of 18-60 years. These fingerprint are divided into two disjoint set for training and testing, each set contains 100 male and 100 female fingerprints. 2.2 Fingerprint Feature Extraction Algorithm The flowchart of the Fingerprint Feature Extraction and Classification Algorithm is shown in Fig. 1. The main steps of the algorithm are: Normalization [4] Normalization is used to standardize the intensity values of an image by adjusting the range of its grey-level values so that they lie within a desired range of values e.g. zero mean and unit standard deviation. Let I(i,j) denotes the gray-level value at pixel (i,j), M & VAR denote the estimated mean & variance of I(i,j) respectively & N(i,j) denotes the normalized gray-level value at pixel (i,j). The normalized values is defined as follows: , , , , (1) , , Where M0 and VAR0 are desired mean and variance values respectively. Image Orientation [3] Orientation of a fingerprint is estimated by the least mean square orientation estimation algorithm given by Hong et. al. Given a normalized image, N, the main steps of Fingerprint Based Male-Female Classification 253 Fig. 1. Flow chart for Fingerprint Feature Extraction and Classification Algorithm the orientation estimation are as follows: Firstly, a block of size wXw (25X25) is centred at pixel (i, j) in the normalized fingerprint image. For each pixel in this block, compute the Gaussian gradients ∂x(i, j) and ∂y(i, j), which are the gradient magnitudes in the x & y directions respectively. The local orientation of each block centered at pixel (i, j) is estimated using the following equations [11]. i, j 2∂ . ∂ , , 1 tan 2 ∂ , , , , ∂ , , (2) (3) (4) 254 M. Verma and S. Agarwal where θ(i,j) is the least square estimate of the local orientation at the block centered at pixel (i,j). Now orient the block with θ degree around the center of the block, so that the ridges of this block are in vertical direction. Fingerprint Feature Extraction In the oriented image, ridges are in vertical direction. Projection of the ridges and valleys on the horizontal line forms an almost sinusoidal shape wave with the local minima points corresponding to ridges and maxima points corresponds to valleys of the fingerprint. Ridge Width R is defined as thickness of a ridge. It is computed by counting the number of pixels between consecutive maxima points of projected image, number of 0’s between two clusters of 1’s will give ridge width e.g. 11110000001111 in above example, ridge width is 6 pixels. Valley Width V is defined as thickness of valleys. It is computed by counting the number of pixels between consecutive minima points of projected image, number of 1’s between two clusters of 0’s will give valley width e.g. 00001111111000 in above example, valley width is 7 pixels. Ridge Density is defined as number of ridges in a given block. e.g. 001111100011111011 Above string contains 3 ridges in a block. So ridge density is 3. Ridge Thickness to Valley Thickness Ratio (RTVTR) is defined as the ratio of ridge width to the valley width and is given by RTVTR = R/V. Fig. 2. Segmented image is Oriented and then projected to line from its binary transform we got ridge and valley width Example 1. Fig. 2. shows a segment of normalized fingerprint, which is oriented so that the ridges are in vertical direction. Then these ridges are projected on horizontal line. In the projected image, black dots show a ridge and white show a valley. Fingerprint Based Male-Female Classification 255 Classification SVM’s are used for classification and regression. SVM’s are set of related supervised learning methods. A special property of SVMs is that they simultaneously minimize the empirical classification error and maximize the geometric margin; hence they are also known as maximum margin classifiers. For a given training set of instance-label pairs (Xi, yi), i=1..N, where N is any integer showing number of training sample, Xi ∈Rn (n denotes the dimension of input space) belongs to the two separating classes labeled by yi∈{-1,1}, This classification problem is to find an optimal hyperplane WTZ + b =0 in a high dimension feature space Z by constructing a map Z=φ(X) between Rn and Z. SVM determines this hyperplane by finding W and b which satisfy ∑ , 1,2, . . , . (5) Where ξ i ≥ 0 and yi[WTφ(Xi)+b] ≥ 1- ξ i holds. Coefficient c is the given upper bound and N is the number of samples. The optimal W and b can be found by solving the dual problem of Eqn. (5), namely ∑ max ∑ φX φX ∑ . (6) 00. Where 0 ≤ αi ≤ c (i = 1,….,N) is Lagrange multiplier and it satisfies ∑ α y Let K X , X φ X φ X ) and we adopt the RBF function to map the input vectors into the high dimensional space Z. The RBF Function is given by , exp γ| | , γ 0 . (7) Where γ= 0.3125,c=512.Values of c & γ are computed by grid search [5]. The decision function of the SVM classifier is presented as . , (8) Where K(.,.) is the kernel function, which defines an inner product in higher dimenW φ X . The decision func, sion space Z and it satisfies that ∑ α y . tion sgn(φ) is the sign function and if φ≥0 then sgn(φ)=1 otherwise sgn(φ)=-1 [12]. 3 Results Our experimental result showed that if we consider any single feature for Male– Female classification then the classification rate is very low. Confusion matrix for Ridge Density (Table 1), RTVTR (Table 2) and Ridge Width (Table 3) show that their classification rate is 53, 59.5 and 68 respectively for testing set. But by taking all these features together we obtained the classification rate 91%. Five fold cross validation is used for the evaluation of the model. For testing set, Combining all these features together classification rate is 88% (Table 4). 256 M. Verma and S. Agarwal Table 1. Confusion Matrix for Male-Female classification based on Ridge Density only for Testing set Actual\Estimated Male Female Total Male 47 41 88 Female 53 59 112 Total 100 100 200 For Ridge Density the classification rate is 53% Table 2. Confusion Matrix for Male-Female classification based on RTVTR only for Testing set Actual\Estimated Male Female Total Male 30 11 41 Female 70 89 159 Total 100 100 200 For RTVTR the classification rate is 59.5% Table 3. Confusion Matrix for Male-Female classification based on Ridge Width only for Testing set Actual\Estimated Male Female Total Male 51 15 66 Female 49 85 134 Total 100 100 200 For Ridge Width the classification rate is 68% Table 4. Confusion Matrix for Male-Female classification based on combining Ridge Density, Ridge Width and RTVTR only for Testing set Actual\Estimated Male Female Total Male 86 10 96 Female 14 90 104 Total 100 100 200 For Testing set the classification rate is 88% 4 Conclusion Accuracy of our model obtained by five fold cross validation method is 91%. Our results have shown that the ridge density, RTVTR and Ridge Width gave gave 53%, 59.5% and 68% classification rates respectively. Combining all these features together, we obtained 91% classification rate. Hence, our method gave 2.5 % better result than the method given by Ahmed Badawi et al. Fingerprint Based Male-Female Classification 257 References 1. Badawi, A., Mahfouz, M., Tadross, R., Jantz, R.: Fingerprint Based Gender Classification. In: IPCV 2006, June 29 (2006) 2. Sudesh, G.: Sex Determination from Fingerprint Ridge Density. Internet Journal of Medical Update 2(2) (July-December 2007) 3. Hong, L., Wan, Y., Jain, A.K.: Fingerprint Image Enhancement: Algorithms and Performance Evaluation. IEEE Trans. Pattern Analysis and Machine Intelligence 20(8), 777–789 (1998) 4. Raymond thai, Fingerprint Image Enhancement and Minutiae Extraction (2003) 5. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification, http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf 6. Kralik, M., Novotny, V.: Epidermal ridge breadth: an indicator of age and sex in paleodermatoglyphics. Variability and Evolution 11, 5–30 (2003) 7. Moore, R.T.: Automatic fingerprint identification systems. In: Lee, H.C., Gaensslen, R.E. (eds.) Advances in Fingerprint Technology, p. 169. CRC Press, Boca Raton (1994) 8. Cummins, H., Midlo, C.: Fingerprints, Palms and Soles. An introduction to dermatoglyphics, p. 272. Dover Publ., New York (1961) 9. Acree M.A.: Is there a gender difference in fingerprint ridge density? Federal Bureau of Investigation, Washington, DC 20535-0001, USA 10. Jain, A.K., Chen, Y., Demerkus, M.: Pores and Ridges: High-Resolution Fingerprint Matching Using Level 3 Fetures. IEEE Transaction on Pattern Analysis and Matching Intelligence 29 (January 2007) 11. Rao, A.: A Taxonomy for Texture Description and Identification. Springer, New York (1990) 12. Ji, L., Yi, Z.: SVM-based Fingerprint Classification Using Orientation Field. In: Third International Conference on Natural Computation (ICNC 2007) (2007) 13. Support Vector Machines vs Artificial Neural Networks, http://www.svms.org/anns.html BSDT Multi-valued Coding in Discrete Spaces Petro Gopych Universal Power Systems USA-Ukraine LLC, 3 Kotsarskaya Street, Kharkiv 61012 Ukraine pmg@kharkov.com Abstract. Recent binary signal detection theory (BSDT) employs a 'replacing' binary noise (RBN). In this paper it has been demonstrated that RBN generates some related N-dimensional discrete vector spaces, transforming to each other under different network synchrony conditions and serving 2-, 3-, and 4-valued neurons. These transformations explain optimal BSDT coding/decoding rules and provide a common mathematical framework, for some competing types of signal coding in neurosciences. Results demonstrate insufficiency of almost ubiquitous binary codes and, in complex cases, the need of multi-valued ones. Keywords: neural networks, replacing binary noise, colored spaces, degenerate spaces, spikerate coding, time-rate coding, meaning, synchrony, criticality. 1 Introduction Data coding (a way of taking noise into account) is a problem whose solving depends essentially on the accepted noise model [1]. Recent binary signal detection theory (BSDT, [2-4] and references therein) employs an original replacing binary noise (RBN, see below) which is an alternative to traditional additive noise models. For this reason, BSDT coding has unexpected features, leading in particular to the conclusion that in some important cases almost ubiquitous binary codes are insufficient and multi-valued ones are essentially required. The BSDT defines 2N different N-dimensional vectors x with spin-like components i x = ±1, reference vector x = x0 representing the information stored in a neural network (NN), and noise vectors x = xr. Vectors x are points in a discrete N-dimensional binary vector space, N-BVS, where variables take values +1 and –1 only. As in the N-BVS additive noise is impossible, vectors x(d) in this space (damaged versions of x0) are introduced by using a 'replacing' coding rule based on the RBN, xr: ⎧ x i , if u i = 0, xi (d ) = ⎨ 0i d = ∑ u i / N , i = 1,..., N ⎩ x r , if u i = 1 (1) where ui are marks, 0 or 1. If m is the number of marks ui = 1 then d = m/N is a fraction of noise components in x(d) or a damage degree of x0, 0 ≤ d ≤ 1; q = 1 – d is a fraction of intact components of x0 in x(d) or an intensity of cue, 0 ≤ q ≤ 1. If d = m/N, the number of different x(d) is 2mCNm, CNm = N!/(N – m)!/m!; if 0 ≤ d ≤ 1, this number is ∑2mCNm = 3N (0 ≤ m ≤ N). If ui = 1 then, to obtain xi(d), the ith component of x0, xi0, is replaced by the ith component of noise, xir, otherwise xi0 remains intact (1). E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 258–265, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com BSDT Multi-valued Coding in Discrete Spaces 259 2 BSDT Binary, Ternary and Quaternary Vector Spaces With respect to vectors x, vectors x(d) have an additional range of discretion because their ±1 projections have the property to be a component of a noise, xr, or a signal message, x0. To formalize this feature, we ascribe to vectors x a new range of freedom ― a 'color' (meaning) of their components. Thus, within the N-BVS (Sect. 1), we define an additional two-valued color variable (a discrete-valued non-locality) labeling the components of x. Thanks to that extension, components of x become colored (meaningful), e.g. either 'red' (noise) or 'black' (signal), and each x transforms into 2N different vectors, numerically equivalent but colored in different colors (in column 3 of Table 1, such 2N vectors are underlined). We term the space of these items an Ndimensional colored BVS, N-CBVS. As the N-CBVS comprises 2N vectors x colored in 2N ways, the total number of N-CBVS items is 2N×2N = 4N. Table 1. Two complete sets of binary N-PCBVS(x0) vectors (columns 1-5) and ternary N-TVS vectors (columns 5-8) at N = 3. m, the number of noise ('red,' shown in bold face in column 3) components of x(d); sm = 2mCNm, the number of different x(d) for a given m; 3N = ∑sm, the same for 0 ≤ m ≤ N. In column 3, all the x(d) for a given x0 (column 1) are shown; 2N two-color vectors obtained by coloring the x = x0 are here underlined. n, the number of zeros among the – components of N-TVS vectors; sn = 2N nC NN – n, the number of N-TVS vectors for a given n; N 3 = ∑sn, the same for 0 ≤ n ≤ N. In columns 3 and 7, table cells containing complete set of 2N one-color N-BVS vectors are between the shaded cells m = 3 and sm = 8, sn = 8 and n = 0. Positive, negative and zero vector components are designated as +, – and 0, respectively. –+– x0 1 m 2 0 1 2 3 –++ 0 1 2 N-PCBVS(x0) vectors 3 – + –, – + –, + + –, – + –, – – –, – + –, – + +, – + –, + + –, + – –, – – –, – + –, – + +, – – –, – – +, – + –, – + +, + + –, + + +, – + –, + + –, – + +, – – +, + – –, + + +, – – –, + – +. – + +, – + +, + + +, – + +, – – +, – + +, – + –, – + +, + + +, + – +, – – +, – + +, – – +, – + –, – – –, – + +, + + +, – + –, + + –, sm 4 1 6 12 3N 5 27 8 1 6 12 – + –, + + –, – + +, – – +, 8 + – –, + + +, – – –, + – +. Complete set of N synchronized neurons, 'dense' spike-time coding 27 3 33 sn 6 1 6 N-TVS vectors 7 n 8 3 2 0 0 0, + 0 0, 0 0 +, 0 + 0, – 0 0, 0 0 –, 0 – 0, 12 + + 0, + 0 +, 0 + +, 1 – + 0, – 0 +, 0 – +, + – 0, + 0 –, 0 + –, – – 0, – 0 –, 0 – –, 8 – + –, + + –, – + +, – – +, 0 + – –, + + +, – – –, + – +. 1 0 0 0, 3 6 + 0 0, 0 0 +, 0 + 0, 2 – 0 0, 0 0 –, 0 – 0, 12 + + 0, + 0 +, 0 + +, 1 – + 0, – 0 +, 0 – +, + – 0, + 0 –, 0 + –, – – 0, – 0 –, 0 – –, 8 – + –, + + –, – + +, – – +, 0 + – –, + + +, – – –, + – +. Complete set of N unsynchronized neurons, 'sparse' spike-rate coding 260 P. Gopych Of 4N colored vectors x(d) constituting the N-CBVS, (1) selects a fraction (subspace, subset) of them specific to particular x = x0 and consisting of 3N x(d) only. We refer to such an x0-specific subspace as an N-dimensional partial colored BVS, NPCBVS(x0). An N-PCBVS(x0) consists of particular x0 and all its possible distortions (corresponding vectors x(d) have m colored in red components, see column 3 of Table 1). As an N-BVS generating the N-CBVS supplies 2N vectors x = x0, the total number of different N-PCBVS(x0) is also 2N. 2N spaces N-PCBVS(x0) each of which consists of 3N items contain in sum 2N×3N = 6N vectors x(d) while the total amount of different x(d) is only 4N. The intersection of all the spaces N-PCBVS(x0) (sets of corresponding space points) and the unity of them are I x0∈N -BVS N -PCBVS( x0 ) = N -BVS, U N -PCBVS( x0 ) = N -CBVS. x0 ∈N -BVS (2) The first relation means that any two spaces N-PCBVS(x0) contain at least 2N common space points, constituting together the N-BVS (e.g., 'red' vectors x(d) in Table 1 column 3 rows m = 3). The second relation reflects the fact that spaces N-PCBVS(x0) are overlapped subspaces of the N-CBVS. Spaces N-PCBVS(x0) and N-CBVS consist of 3N and 4N items what is typically for N-dimensional spaces of 3- and 4-valued vectors, respectively. Of this an obvious insight arises ― to consider an N-PCBVS(x0) as a vector space 'built' for serving 3valued neurons (an N-dimensional ternary vector space, N-TVS; Table 1 columns 58) and to consider an N-CBVS as a vector space 'built' for serving 4-valued neurons (an N-dimensional quaternary vector space, N-QVS; Table 3 column 2). After accepting this idea it becomes clear that the BSDT allows an intermittent three-fold (2-, 3-, and 4-valued) description of signal data processing. 3 BSDT Degenerate Binary Vector Spaces Spaces N-PCBVS(x0) and N-CBVS are devoted to the description of vectors x(d) by means of explicit specifying the origin or 'meaning' of their components (either signal or noise). At the stage of decoding, BSDT does not differ the colors (meanings) of x(d) components and reads out their numerical values only. Consequently, for BSDT decoding algorithm, all N-PCBVS(x0) and N-CBVS items are color-free. By means of ignoring the colors, two-color x(d) are transforming into one-color x and, consequently, spaces N-CBVS and N-PCBVS(x0) are transforming, respectively, into spaces N-DBVS (N-dimensional degenerate BVS) and N-DBVS(x0) (N-dimensional degenerate BVS given x0). The N-DBVS(x0) and the N-DBVS contain respectively 3N and 4N items, though only 2N of them (related to the N-BVS) are different. As a result, N-DBVS and N-DBVS(x0) items are degenerate, i.e. in these spaces they may exist in some equivalent copies. We refer to the number of such copies related to a given x as its degeneracy degree, τ (1 ≤ τ ≤ 2N, τ = 1 means no degeneracy). When N-CBVS vectors x(d) lose their color identity, their one-color counterparts, x, are 'breeding' 2N times. For this reason, all N-DBVS space points have equal degeneracy degree, τ = 2N. When N-PCBVS(x0) vectors x(d) lose their color identity, their one-color counterparts, x, are 'breeding' the number of times which is specified by (1) BSDT Multi-valued Coding in Discrete Spaces 261 given x0 and coincides with the number of x(d) related to particular x in the NPCBVS(x0). Consequently, all N-DBVS(x0) items have different in general degeneracy degrees, τ(x,x0), depending on x as well as x0. As the number of different vectors x in an N-DBVS(x0) and the number of different spaces N-DBVS(x0) are the same (and equal to 2N), discrete function τ(x,x0) is a square non-zero matrix. As the number of x in an N-DBVS(x0) and the number of x(d) in corresponding N-PCBVS(x0) is 3N, the sum of matrix element values made over each row (or over each column) is also the same and equals 3N. Remembering that ∑2mCNm = 3 N (m = 0, 1, …, N), we see that in each the matrix's row or column the number of its elements, which are equal to 2m, is CNm (e.g. in Table 2, the number of 8s, 4s, 2s and 1s is 1, 3, 3 and 1, respectively). If x (columns) and x0 (rows) are ordered in the same way (as in Table 2) then matrix τ(x,x0) is symmetrical with respect to its main diagonal; if x = x0, then in the column x and the row x0 τ(x,x0)-values are equally arranged (as in the column and the row shaded in Table 2: 4, 2, 8, 4, 1, 4, 2, 2; corresponding sets of two-colored N-PCBVS(x0) vectors x(d) are shown in column 3 of Table 1). Degeneracy degree averaged across all the x given N-DBVS(x0) or across all the N-DBVS(x0) given x does not depend on x and x0: τa = <τ(x,x0)> = (3/2)N (for the example presented in Table 2, τa = (3/2)3 = 27/8 = 3.375). x x0 –+– ∑τ(x,x0) given x –+– ++– –++ ––+ +–– +++ ––– +–+ ++– –++ ––+ +–– +++ ––– +–+ ∑τ(x,x0) given x0 Table 2. Degeneracy degree, τ(x,x0), for all the vectors x in all the spaces N-DBVS(x0), N = 3. 2N = 8, the number of different x (and different x0); 3N = 27, the total number of x in an NDBVS(x0) or the number of two-colored x(d) in corresponding N-PCBVS(x0); positive and negative components of x and x0 are designated as + and –, respectively. Rows provide τ(x,x0) for all the x given N-DBVS(x0) (the row x0 = – + + is shaded); columns show τ(x,x0) for all the spaces N-DBVS(x0) given x (the column x = – + + is shaded). 8 4 4 2 2 2 4 1 4 8 2 1 4 4 2 2 4 2 8 4 1 4 2 2 2 1 4 8 2 2 4 4 2 4 1 2 8 2 4 4 2 4 4 2 2 8 1 4 4 2 2 4 4 1 8 2 1 2 2 4 4 4 2 8 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 27 3N Of the view of the set theory, the N-DBVS consists of 2N equivalent N-BVS; the intersection and the unity of them are the N-BVS itself. Each N-DBVS(x0) includes all the 2N N-BVS vectors each of which is repeated in an x0-specific number of copies, as it is illustrated by rows (columns) of Table 2 (N + 1 of these numbers are only different). 262 P. Gopych 4 BSDT Multi-valued Codes and Multi-valued Neurons For the description of a network of the size N in spaces above discussed, the BSDT defines 2-, 3- and 4-valued N-dimensional (code) vectors each of which represents a set of N firing 2-, 3- and 4-valued neurons, respectively (see Tables 1 and 3). 23valued valued 3 –1, black –1, red +1, red +1, black –1, black/red no black/red +1, black/red –1, black/red +1, black/red 4 –1, signal –1, noise +1, noise +1, signal –1, signal/noise no signal/noise +1, signal/noise –1, signal/noise +1, signal/noise 5 inhibitory inhibitory excitatory excitatory inhibitory no spike excitatory inhibitory excitatory Space examples The target neuron's synapse Signal/noise numerical code, SNNC 2 –2 –1 +1 +2 –1 0 +1 –1 +1 Colored numerical code, CNC Numerical code, NC 1 4valued Type of neurons Table 3. Values/meanings of code vector components for BSDT neurons and codes. Shaded table cells display code items implementing the N-BVS (its literal implementation by 3- and 4valued neurons is possible under certain conditions only; parentheses in column 6 indicate this fact, see also process 6 in Fig. 1); in column 5 the content/meaning of each code item is additionally specified in neuroscience terms. 6 N-QVS, N-CBVS, N-PCBVS(x0), (N-BVS) N-TVS, (N-BVS) N-BVS, N-DBVS, N-DBVS(x0) In Table 3 numerical code (NC, column 2) is the simplest, most general, and meaning-irrelevant: its items (numerical values of code vector components) may arbitrary be interpreted (e.g., for 4-valued neurons, code values ±1 and ±2 may be treated as related to noise and signal, respectively). Colored numerical code (CNC, column 3) combines binary NC with binary color code; for 3- and 2-valued neurons, CNC vector components are ambiguously defined (they are either black or red, 'black/ red'). Signal/noise numerical code (SNNC, column 4) specifies the colors of the CNC with respect to a signal/noise problem: 'black' and 'red' units are interpreted as ones that represent respectively signal and noise components of SNNC vectors (marks 'signal' and 'noise' reflect this fact). As in the case of the CNC, for 3- and 2-valued neurons, SNNC code vector components say only that (with equal probability) they can represent either signal or noise (this fact reflects the mark 'signal/ noise'). Further code specification (by adding a neuroscience meaning to each signal, noise, or signal/noise numerical code item) is given in column 5: it is assumed [3] that vector components –1 designate signal, noise or signal/noise spikes, affecting the inhibitory synapses of target neurons, while vector components +1 designate spikes, affecting the excitatory synapses of target neurons (i.e. BSDT neurons are intimately embedded into their environment, as they always 'know' the type of synapses of their postsynaptic neurons); zero-valued components of 3-valued vectors designate 'silent' or 'dormant' BSDT Multi-valued Coding in Discrete Spaces 263 neurons generating no spikes at the moment and, consequently, not affecting on their target neurons at all (marks 'no signal/noise' and 'no spike' in columns 4 and 5 reflect this fact). BSDT spaces implementing different types of coding and dealing with different types of neurons are classified in column 6. The unity of mathematical description of spiking and silent neurons (see Table 3) explains why BSDT spin-like +1/–1 coding cannot be replaced by popular 1/0 coding. Ternary vectors with large fractions of zero components may also contribute to explaining the so-called sparse neuron codes (e.g., [5-6], Table 1 columns 5-8). Quaternary as well as binary vectors without zero components (and, perhaps, ternary vectors with small fractions of zero components) may contribute to explaining the so-called dense neuron codes (a reverse counterpart to sparse neuron codes, Table 1 columns 1-5). 5 Reciprocal Transformations of BSDT Vector Spaces The major physical condition explaining the diversity of BSDT vector spaces and the need of their transformations is the network's state of synchrony (Fig. 1). We understand synchrony as simultaneous (within a time window/bin ∆t ~ 10 ms) spike firing of N network neurons (cf. [7]). Unsynchronized neurons fire independently at instants t1, …, tN whose variability is much greater than ∆t. As the BSDT imposes no constraints on network neuron space distribution, they may arbitrary be arranged occupying positions even in distinct brain areas (within a population map [8]). Unsynchronized and synchronized networks are respectively described by BSDT vectors with and without zero-valued components and consequently each such an individual vector represents a pattern of network spike activity at a moment t ± ∆t. Hence, the BSDT deals with network spike patterns only (not with spike trains of individual neurons [9]) while diverse neuron wave dynamics, responsible in particular for changes in network synchrony [10], remains out of the consideration. For unsynchronized networks (box 2 in Fig. 1), spike timing can in principle not be used for coding; in this case signals of interest may be coded by the number of network spikes randomly emerged per a given time bin ― that is spike-rate or firingrate population coding [9], implementing the independent-coding hypothesis [8]. For partially ordered networks, firing of their neurons is in time to some extent correlated and such mutual correlations can already be used for spike-time coding (e.g., [1,9]) ― that is an implementation of the coordinated-coding hypothesis [8]. The case of completely synchronized networks (box 4 in Fig. 1) is an extreme case of spike-time coding describable by BSDT vectors without zero components. Time-to-first-spike coding (e.g., [11]) and phase coding (e.g., [12]) may be interpreted as particular implementations of such a consideration. Figure 1 shows also the transformations (circled dashed arrows) related to changes in network synchrony and accompanied by energy exchange (vertical arrows). BSDT optimal decoding probability (that is equal to BSDT optimal generalization degree) is the normalized number of N-DBVS(x0) vectors x for which their Hamming distance to x0 is smaller than a given threshold. Hence, the BSDT defines its coding/decoding [2,3] and generalization/codecorrection [2] rules but does not specify mechanisms implementing them: for BSDT applicability already synchronized/unsynchronized networks are required. 264 P. Gopych 1 The global network's environment Energy input 2 Energy dissipation N-TVS Unsynchronized neurons Entropy 1 'The edge of chaos' 5 N-PCBVS(x0) N-DBVS(x0) N-CBVS N-PCBVS(x0) N-DBVS(x0) 2 Synchronized neurons ... N-PCBVS(x0) 3 1 2 Complexity N-BVS N-BVS 6 3 4 N-TVS ... 4 N-DBVS(x0) N 2 Fig. 1. Transformations of BSDT vector spaces. Spaces for pools of synchronized (box 4) and unsynchronized (box 2) neurons are framed separately. In box 4, right-most numbers enumerate different spaces of the same type; framed numbers mark space transformation processes (arrows): 1, coloring all the components of N-BVS vectors; 2, splitting the N-CBVS into 2N different but overlapping spaces N-PCBVS(x0); 3, transformation of two-color N-PCBVS(x0) vectors x(d) into one-color N-TVS vectors (because of network desynchronization); 4, equalizing the colors of all the components of all the N-PCBVS(x0) vectors; 5, transformation of one-color NTVS vectors into two-color N-PCBVS(x0) vectors; 6, random coincident spiking of unsynchronized neurons. Vertical left/right-most arrows remind trends in entropy and complexity for a global network, containing synchronized and unsynchronized parts and open for energy exchange with the environment (box 1). Box 3 comprises rich and diverse neuron individual and collective nonlinear wave/oscillatory dynamics implementing synchrony/unsynchrony transitions (i.e. the global network is as a rule near its 'criticality'). In most behavioral and cognitive tasks, spike synchrony and neuron coherent wave/oscillatory activity are tightly entangled (e.g. [10,13]) though which of these phenomena is the prime mechanism, for dynamic temporal and space binding (synchronization) of neurons into a cell assembly, remains unknown. If gradual changes of the probability of occurring zero-valued components of ternary vectors is implied then the BSDT is consistent in general with scenarios of gradual synchrony/ unsynchrony transitions, but we are here interesting in such abrupt transitions only. This means that, for the BSDT biological relevance, it is needed to take the popular stance according to which the brain is a very large and complex nonlinear dynamic system having long-distant and reciprocal connectivity [14], being in a metastable state and running near its 'criticality' (box 3 in Fig. 1). If so, then abrupt unsynchronyto-synchrony transitions may be interpreted as the network's 'self-organization,' 'bifurcation,' or 'phase transition' (see ref. 15 for review) while abrupt synchrony decay may be considered as reverse phase transitions. This idea stems from statistical physics and offers a mechanism contributing most probably to real biological processes underlying BSDT space transformations shown in Fig. 1 (i.e. arrows crossing 'the edge of chaos' may have real biological counterparts). Task-related brain activity takes <5% of energy consumed by the resting human brain [16]; spikes are high energy-consuming and, consequently, rather seldom and important brain signals [17]. Of these follows that the BSDT, as a theory for spike computations, describes though rather small but perhaps most important ('high-level') fraction of brain activity responsible for maintaining behavior and cognition. BSDT Multi-valued Coding in Discrete Spaces 265 6 Conclusion A set of N-dimensional discrete vector spaces (some of which are 'colored' and some degenerate) has been introduced by using original BSDT coding rules. These spaces, serving 2-, 3- and 4-valued neurons, ensure optimal BSDT decoding/generalization and provide a common description of spiking and silencing neurons under different conditions of network synchrony. The BSDT is a theory for spikes/impulses and can describe though rather small but probably most important ('high-level') fraction of brain activity responsible for human/animal behavior and cognition. Results demonstrate also that for the description of complex living or artificial digital systems, where conditions of signal synchrony and unsynchrony coexist, binary codes are insufficient and the use of multi-valued ones is essentially required. References 1. Averbeck, B.B., Latham, P.E., Pouget, A.: Neural Correlations, Population Coding and Computation. Nat. Rev. Neurosci. 7, 358–366 (2006) 2. Gopych, P.M.: Generalization by Computation through Memory. Int. J. Inf. Theo. Appl. 13, 145–157 (2006) 3. Gopych, P.M.: Foundations of the Neural Network Assembly Memory Model. In: Shannon, S. (ed.) Leading Edge Computer Sciences, pp. 21–84. Nova Science, New York (2006) 4. Gopych, P.M.: Minimal BSDT Abstract Selectional Machines and Their Selectional and Computational Performance. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 198–208. Springer, Heidelberg (2007) 5. Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988) 6. Olshausen, B.A., Field, D.J.: Sparse Coding of Sensory Inputs. Curr. Opin. Neurobiol. 14, 481–487 (2004) 7. Engel, A.K., Singer, W.: Temporal Binding and the Neural Correlates of Sensory Awareness. Trends Cog. Sci. 5, 16–21 (2001) 8. de Charms, C.R., Zador, A.: Neural Representations and the Cortical Code. Ann. Rev. Neurosci. 23, 613–647 (2000) 9. Tiesinga, P., Fellous, J.-M., Sejnowski, T.J.: Regulation of Spike Timing in Visual Cortical Circuits. Nat. Rev. Neurosci. 9, 97–109 (2008) 10. Buzáki, G., Draghun, A.: Neuronal Oscillations in Cortical Networks. Science 304, 1926– 1929 (2004) 11. Johansson, R.S., Birznieks, I.: First Spikes in Ensemble of Human Tactile Afferents Code Complex Spatial Fingertip Events. Nat. Neurosci. 7, 170–177 (2004) 12. Jacobs, J., Kahana, M.J., Ekstrom, A.D., Fried, I.: Brain Oscillations Control Timing of Single-Neuron Activity in Humans. J. Neurosci. 27, 3839–3844 (2007) 13. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The Brainweb: Phase Synchronization and Large-Scale Integration. Nat. Rev. Neurosci. 2, 229–239 (2001) 14. Sporns, O., Chialvo, D.R., Kaiser, M., Hilgetag, C.C.: Organization, Development and Function of Complex Brain Networks. Trends Cog. Sci. 8, 418–425 (2004) 15. Werner, G.: Perspectives on the Neuroscience of Cognition and Consciousness. BioSystems 87, 82–95 (2007) 16. Fox, M.D., Raichle, M.E.: Spontaneous Fluctuations in Brain Activity Observed with Functional Magnetic Resonance Imaging. Nat. Rev. Neurosci. 8, 700–711 (2007) 17. Lennie, P.: The Cost of Cortical Computation. Curr. Biology 13, 493–497 (2003) A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication Thi Hoi Le and The Duy Bui Faculty of Information Technology Vietnam National University, Hanoi hoilt@vnu.edu.vn, duybt@vnu.edu.vn Abstract. Biometrics such as fingerprint, face, eye retina, and voice offers means of reliable personal authentication is now a widely used technology both in forensic and civilian domains. Reality, however, makes it difficult to design an accurate and fast biometric recognition due to large biometric database and complicated biometric measures. In particular, fast fingerprint indexing is one of the most challenging problems faced in fingerprint authentication system. In this paper, we present a specific contribution to advance the state of the art in this field by introducing a new robust indexing scheme that is able to fasten the fingerprint recognition process. Keywords: fingerprint hashing, fingerprint authentication, error correcting code, image authentication. 1 Introduction With the development of digital world, reliable personal authentication has become a big interest in human computer interface activity. National ID card, electronic commerce, and access to computer networks are some scenarios where declaration of a person’s identity is crucial. Existing security measures rely on knowledge-based approaches like passwords or token-based such as magnetic cards and passports are used to control access to real and virtual societies. Though ubiquitous, such methods are not very secure. More severely, they may be shared or stolen easily. Passwords and PIN numbers may be even stolen electronically. Furthermore, they cannot differentiate between authorized user and fraudulent imposter. Otherwise, biometrics has a special characteristic that user is the key; hence, it is not easily compromised or shared. Therefore, biometrics offers means of reliable personal authentication that can address these problems and is gaining citizen and government acceptance. Although significant progress has been made in fingerprint authentication system, there are still a number of research issues that need to be addressed to improve the system efficiency. Automatic fingerprint identification which requires 1-N matching is usually computationally demanding. For a small database, a common approach is to exhaustively match a query fingerprint against all the fingerprints in database [21]. For a large database, however, it is not desirable in practice without an effective fingerprint indexing scheme. There are two technical choices to reduce the number of comparisons and consequently to reduce the response time of the identification process: classification and E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 266–273, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication 267 indexing techniques. Traditional classification techniques (e.g. [5], [16]) attempt to classify fingerprint into five classes: Right Loop (R), Left Loop (L), Whorl (W), Arch (A), and Tented Arch (A). Due to the uneven natural distribution, the number of classes is small and real fingerprints are unequally distributed among them: over 90% of fingerprints fall in to only three classes (Loops and Whorl) [18]. This is resulted in the inability of reducing the search space enough of such systems. Indexing technique performs better than classification in terms of the size of space need to be searched. Fingerprint indexing algorithms select most probable candidates and sort them by the similarity to the query. Many indexing algorithms have been proposed recently. A.K. Jain et al [11] use the features around the core point of a Gabor filtered image to realize indexing. Although this approach makes use of global information (core point) but the discrimination power of just one core is limited. In [2], the singular point (SP) is used to estimate the search priority which is resulted in the mean search space below 4% the whole dataset. However, detecting singular point is a hard problem. Some fingerprints even do not have SPs and the uncertainty of SP location is large [18]. Besides, several attempts to account for fingerprint indexing have shown the improvement. R. Cappelli et al. [6] proposed an approach which reaches the reasonable performance and identification time. R.S Germain et al. [9] use the triplets of minutiae in their indexing procedure. J.D Boer et al. [3] make effort in combining multiple features (orientation field, FingerCode, and minutiae triplets). One of another key point is to make indexing algorithm more accurate, fingerprint distortion must be considered. There are two kinds of distortion: transformation distortion and system distortion. In particular, due to fingerprint scanners can only capture partial fingerprints; some minutiae – primary fingerprint features - are missing during acquisition process. These distortions of fingerprint including minutia missing cause several problems: (i) the number of minutia points available in such prints is few, thus reducing its discrimination power; (ii) loss of singular points (core and delta) is likely [15]. Therefore, a robust indexing algorithm independent of such global features is required. However, in most of existed indexing scheme mentioned above, they perform indexing by utilizing these global feature points. In this paper, we propose a hashing scheme that can achieve high efficiency in hashing performance by perform hashing on localized features that are able to tolerate distortions. This hashing scheme can perform on any localized features such as minutiae or triplet. To avoid alignment, in this paper we use the triplet feature introduced by Tsai-Yang Jea et. al. [15]. One of our main contributions is that we present a definition of codeword for such fingerprint feature points and our hashing scheme performs on those codewords. By producing codewords for feature points, this scheme can tolerate the distortion of each point. Moreover, to reduce number of candidates for matching stage more efficiently, a randomized hash function scheme is used to reduce the output collisions. The paper is organized as follows. Section 2 introduces some notions and our definition of codeword for fingerprint feature points. Section 3 presents our hashing scheme based on these codewords and a scheme to retrieve fingerprint from hashing results. We present experiment results of our scheme in Section 4. 268 T.H. Le and T.D. Bui 2 Preliminaries 2.1 Error Correction Code For a given choice of metric d, one can define error correction codes in the corresponding space M. A code is a subset C={w1,…,wk} ⊂ M. The set C is sometimes called codebook; its K elements are the codewords. The (minimum) distance of a code is the smallest distance d between two distinct codewords (according to the metric d). Given a codebook C, we can define a pair of functions (C,D). The encoding function C is an injective map for the elements of some domain of size K to the elements of C. The decoding function D maps any element w ∈ M to the pre-image C-1[wk] of the codeword wk that minimizes the distance d[w,wk]. The error correcting distance is the largest radius t such that for every element w in M there is at most one codeword in the ball of radius t centered on w. For integer distance functions we have t=(d-1)/2. A standard shorthand notation in coding theory is that of a (M,K,t)-code. 2.2 Fingerprint Feature Point Primary feature point of fingerprint is called minutia. Minutiae are the various ridge discontinuities of a fingerprint. There are two types of widely used minutiae which are bifurcations and endings (Fig.1). Minutia contains only local information of fingerprints. Each minutia is represented by its coordinates and orientation. Fig. 1. (left) Ridge bifurcation. (b) Ridge endings [15]. Secondary feature is a vector of five-elements (Fig.2). For each minutiae Mi(xi,yi,θi) and its two nearest neighbors N0(xn0,yn0,θn0) and N1(xn1,yn1,θn1), the secondary feature is constructed by form a vector Si(ri0 ,ri1,φi0,φi1,δi) in which ri0 and ri1 are the Euclidean distances between the central minutia Mi and its neighbors N0 and N1 respectively. φik is the orientation difference between Mi and Nk, where k is 0 or 1. δi represents the acute angle between the line segments MiN0 and MiN1. Note that N0 and N1 are the two nearest neighbors of the central minutia Mi and ordered not by their Euclidean distances but by satisfying the equation: N0MixN1Mi ≥ 0. A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication 269 Fig. 2. Secondary feature of Mi. Where ri0 and ri1 are the Euclidean distances between central minutia Mi and its neighbors N0 and N1 respectively. φik is the orientation difference between Mi and Nk where k is 0 or 1. δi represents the acute angle between MiN0 and MiN1 [15]. N0 is the first and N1 is the second minutia encountered while traversing the angle ∠N0MiN1. For the given matched reference minutiae pair pi and qj, it is said that minutiae pi(ri,i’,Φi,i’;,θi,i’) matches qj(rj,j’,Φj,j’,θj,j’), if qj is within the tolerance area of pi. Thus for given threshold functions Thldr(.), ThldΦ(.), and Thldθ(.), |ri,i’ – rj,j’| ≤ Thldr(ri,i’), |Φi,i’ – Φj,j’| ≤ ThldΦ(Φi,i’) and |θi,i’ – θj,j’| ≤ Thldθ(θi,i’). Note that the thresholds are not predefined values but are adjustable according to rii, and rjj. In this literature, we call this secondary fingerprint feature point as feature point. 2.3 Our Error Correction Code for Fingerprint Feature Point Most existing error correction schemes are respect to Hamming distance and used to correct the message at bit level (e.g. parity check bits, repetition scheme, CRC…). Therefore, we define a new error correction scheme for fingerprint feature point. In particular, we consider each feature x as a corrupted codeword and try to correct to the correct codeword by using an encoding function C(x). Definition 1. Let x ∈ RD, and C(x) is an encoding function of the error correction scheme. We define C(x) = (q(x-t)|q(x)|q(x+t)) where q(x) is a quantization function of x with quantization step t. We call the output of C(x) (cx-t,cx,cx+t) is the codeword set of x and cx=q(x) is the nearest codeword of x or codeword of x for short. Lemma 1. Let x ∈ R, given a tolerant threshold t ∈ R. For every y such that |x-y| ≤ t, then the codeword of y cy = q(y) takes one of three elements in the codeword set of x. Lemma 1 can be proved easily by some algebraic transformations. In our approach, we generate codeword set for every dimension of template fingerprint feature point q and only codeword for every dimension of query point p. Following lemma 1 and definition of two matched feature points, we can see that if p and q are corresponding 270 T.H. Le and T.D. Bui feature points of two versions from one fingerprint, codeword of q will be an element in codeword set of p. 3 Codeword-Based Hashing Scheme We present a fingerprint hash generation scheme based on the codeword of the feature point. The key idea is: first, “correct” the error feature point to its codeword, then use that codeword as the input of a randomized hash function which can scatter the input set and ensure that the probability of collision of two feature points is closely related to the distance between their corresponding coordinate pairs (refer to definition of two matched feature points). 3.1 Our Approach Informal description. Fingerprint features stored in database can be considered as a very large set. Moreover, the distribution of fingerprint feature points is not uniform and unknown. Therefore, we want to design a randomized hash scheme such that for the large input sets of fingerprint feature points, it is unlikely that elements collide. Fortunately, some standard hash functions (e.g. MD5, SHA) can be made randomized to satisfy the property of target collision resistance. Our scheme works as follows. First, the message bits are permuted by a random permutation and then the hash of resulting message is computed (by a compression function e.g. SHA). A permutation is a special kind of block cipher where the output and input have the same length. A random permutation is widely used in cryptography since it possesses two important properties: randomness and efficient computation if the key is random. The basic idea so far is for any given set of inputs, this algorithm will scatter the inputs among the range of the function well based on a random permutation so that the probabilistic expectation of the output will be distributed more randomly. To ensure the error tolerant property, we perform hashing on codeword (for query points) and on the whole codeword set (for template points) instead of the feature point itself. Follow the lemma 1 we have the query point y and the template point x are matched if and only if codeword of y is an element in codeword set of x . Formal description. We set up our hashing scheme as follows: 1. Choose random dimensions from (1,2,…,D); by this way, we adjust the trade off between the collision of hashing values and the space of our database. 2. Choose an tolerant threshold t and appropriate metric ld for selected dimensions. For each template point p: 1. Generate the codeword set for pd. These values are mapped to binary strings. Binary strings in the codeword set of selected dimensions are then concatenated to form L3 binary strings mt for i=1,…,L3 where L is the number of selected dimensions. 2. Each mi is padded with zero bits to form a n – bit message mi. 3. Generate a random key K for a permutation π of {0,1}n. 4. mi is permuted by π and the resulting message is hashed using a compression function such as SHA.. Let shi = SHA( π (mi)) for i=1,…,L3. A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication 271 5. There are at most L3 hash values for p. These values can be stored in the same hash table as well as separate ones. For query point q: 1. Compute the nearest codeword value for every selected dimension of q. Then map the value to a binary string. 2. Binary strings of selected dimensions are then concatenated. 3. Perform steps 2 and 4 as for template point. Note that there is only one binary string for query point q. 4. Return all the points which are sharing identical hash value with q. For query evaluation, all candidate matches are returned by our hashing for every query feature point. Hence, each query fingerprint is treated as a bag of points, simulating a multi-point query evaluation. To do this efficiently, we use an identical framework as Ke et al. [17], in that, we maintain two auxiliary index structure –File Table (FT) and Keypoint Table (KT) – to map feature points to their corresponding fingerprint; an entry in KT consists of the file ID (index location of FT) and feature point information. The template fingerprints that have points sharing identical hash values (collisions) with the query version are then ranked by the number of similarity points. Only top T templates are selected for identifying the query fingerprint by matching. Thus, the search space is greatly reduced. The candidate selection process requires only linear computational cost so that it can be applied for online interactive querying on large image collections. 3.2 Analysis Space complexity. For an automate fingerprint identification, we must allocate extra storage for the hash values of template fingerprints. Assume that the number of Ddimension feature points extracted from template version is N. With L selected dimensions, the total extra space required for one template is O(L3.N). Thus, the hashing scheme requires a polynomial-sized data structure that allows sub-linear time retrievals of near neighbors as shown in following section. Time complexity. On query fingerprint Y with N feature points in a database of M fingerprints, we must: compute the codeword for each feature point which takes O(N) due to quantization hashing functions required constant time complexity; compute the hash value which takes O(time(f,x)) where f is the hash function used and time(f,x) is the time required by function f with input x; and compute the similarity scores of templates that has any point sharing identical hash value with the query fingerprint, requiring time O(M.N/2m). Thus the key quantity is O(M.N/2n + time(f,x)) which is approximately equivalent to 1/2n computations of exhaustive search. 4 Experiments We evaluate our method by testing it on Database FVC2004 [19] which consists of 300 images, 3 prints each of 100 distinct fingers. DB1_A database contains the partial fingerprint templates of various sizes. These images are captured by a optical sensor with a resolution of 500dpi, resulting in images of 300x200 pixels in 8 bit gray scale. 272 T.H. Le and T.D. Bui The full original templates are used to construct the database, while the remaining 7 impressions are used to hashing. We use the feature extraction algorithm described in [15] in our system. However, the authors do not mention in detail how to determine the threshold of error tolerances. Therefore, in our experiments, we assume that the error correction distance t is fixed for all fingerprint feature points. This assumption makes the implementation not optimal so that two feature points are recognized as “match” by matching algorithm proposed in [15] may not share the same hash value in our experiments. The Euclidean distances between the corresponding dimensions of feature vectors are used in quantization hashing function. Table 1 shows the Correct Index Power (CIP) which is defined as the percentage of correctly indexed queries based on the percentage of hypotheses that need to be searched in the verification step. Although our implementation is not optimal, scheme still achieves good CIP result. As can be easily seen, the larger search percentage is, the better results are obtained. It indicates that the optimal implementation can improve the result performance. Table 1. Correct Indexing Power of our algorithm Correct Index Power Search Percentage CIP 5% 10% 15% 20% 80% 87% 94% 96% Compare with some published experiments in the literature, at the search percentage 10%, [13] comes up with 84.5% CIP and [23] reaches a result of 92.8% CIP. However, unlike previous works, our scheme is much simpler and by adjusting t carefully, it is promising that our scheme will reach 100% CIP with low search percentage. 5 Conclusion In this paper, we have presented a new robust approach to perform indexing on fingerprint which provides both accurate and fast indexing. However, there is still some works need to be done to in order to make the system more persuasive and to obtain the optimal result like studying optimal choices of t parameter. Moreover, to guarantee the privacy of fingerprint template in any indexing scheme is another important problem that must be considered. References [1] Bazen, A.M., Gerez, S.H.: Fingerprint matching by thin-plate spline modeling of elastic deformations. Pattern Recognition 36, 1859–1867 (2003) [2] Bazen, A.M., Verwaaijen, G.T.B., Garez, S.H., Veelunturf, L.P.J.: A correlation-based fingerprint verification system. In: ProRISC 2000 Workshops on Circuits, Systems and Signal Processing (2000) A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication 273 [3] Boer, J., Bazen, A., Cerez, S.: Indexing fingerprint database based on multiple features. In: ProRISC 2001 Workshop on Circuits, Systems and Singal Processing. (2001) [4] Brown, L.: A survey of image registration techniques. ACM Computing Surveys (1992) [5] Cappelli, R., Lumini, A., Maio, D., Maltoni, D.: Fingerprint Classification by Directional Image Partitioning. IEEE Trans. on PAMI 21(5), 402–421 (1999) [6] Cappelli, R., Maio, D., Maltoni, D.: Indexing fingerprint databases for efficicent 1: n matching. In: Sixth Int.Conf. on Control, Automation, Robotics and Vision, Singapore (2000) [7] Choudhary, A.M., Awwal, A.A.S.: Optical pattern recognition of fingerprints using distortion-invariant phase-only filter. In: Proc. SPIE, vol. 3805(20), pp. 162–170 (1999) [8] Fingerprint verification competition, http://bias.csr.unibo.it/fvc2002/ [9] Germain, R., Califano, A., Colville, S.: Fingerprint matching using transformation parameter clustering. IEEE Computational Science and Eng. 4(4), 42–49 (1997) [10] Gonzalez, Woods, Eddins: Digital Image Processing, Prentice Hall, Englewood Cliffs (2004) [11] Jain, A., Ross, A., Prabhakar, S.: Fingerprint matching using minutiae texture features. In: International Conference on Image Processing, pp. 282–285 (2001) [12] Jain, A., Prabhakar, S., Hong, L., Pankanti, S.: Filterbank-based fingerprint matching. Transactions on Image Processing 9, 846–859 (2000) [13] Jain, A.K., Prabhakar, S., Hong, L., Pankanti, S.: FingerCode: a filterbank for fingerprint representation and matching. In: CVPR IEEE Computer Society Conference (2), pp. 187– 193 (1999) [14] Jea, T., Chavan, V.K., Govindaraju, V., Schneider, J.K.: Security and matching of partial fingerprint recognition systems, pp. 39–50. SPIE (2004) [15] Tsai-Yang, J., Venu, G.: A minutia-based partial fingerprint recognition system. Pattern Recognition 38(10), 1672–1684 (2005) [16] Karu, K., Jain, A.K.: Fingerprint Classification. Pattern Recognition 18(3), 389–404 (1996) [17] Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near duplicate and sub-image retrieval system. In: MM International Conference on Multimedia, pp. 869–876 (2004) [18] Liang, X., Asano, T.,, B.: Distorted Fingerprint indexing using minutiae detail and delaunay triangle. In: ISVD 2006, pp. 217–223 (2006) [19] Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC2004: Third Fingerprint Verification Competition. In: Proc. ICBA, Hong Kong, July 2004, pp. 1–7 (2004) [20] Nandakumar, K., Jain, A.K.: Local correlation-based fingerprint matching. In: Indian Conference on Computer Vision, Graphics and Image Processing, pp. 503–508 (2004) [21] Nist fingerprint vendor technology evaluation, http://fpvte.nist.gov/ [22] Ruud, B., Connell, J.H., Pankanti, S., Ratha, N.K., Senior, A.W.: Guide to Biometrics. Springer, Heidelberg (2003) [23] Liu, T., Zhang, G.Z.C., Hao, P.: Fingerprint Indexing Based on Singular Point Correlation. In: ICIP 2005 (2005) The Concept of Application of Fuzzy Logic in Biometric Authentication Systems Anatoly Sachenko, Arkadiusz Banasik, and Adrian Kapczyński Silesian University of Technology, Department of Computer Science and Econometrics, F. D. Roosevelt 26-28, 41-800 Zabrze, Poland sachenkoa@yahoo.com, arkadiusz.banasik@polsl.pl, adrian.kapczynski@polsl.pl Abstract. In the paper the key topics concerning architecture and rules of working of biometric authentication systems were described. Significant role is played by threshold which constitutes acceptance or rejection given authentication attempt. Application of elements of fuzzy logic was proposed in order to define threshold value of authentication system. The concept was illustrated by an example. Keywords: biometrics, fuzzy logic, authentication systems. 1 Introduction The aim of this paper is to present on the basis of theoretical foundations of fuzzy logic and how to use it as a hypothetical, single-layered biometric authentication system. In the first part it will be provided biometric authentication systems primer and the fundamentals of fuzzy logic. On that basis the idea of use of fuzzy logic in biometric authentication systems was formulated. 2 Biometric Authentication Systems Primer Biometric authentication system is basically a system which identifies patterns and carries out the objectives of authentication by identifying the authenticity of physical or behavioral characteristics possessed by the user [5]. The logical system includes the following modules [1]: enrollment module and identification or verification module. The first module is responsible for the registration of user ID and the association of this identifier with the biometric pattern (called biometric template), which is understood as a vector of vales presented in an appropriate form, as a result of processing the collected by the biometric device, raw human characteristics. The identification Module (verification) is responsible for carrying out the collection and processing of raw biometric characteristics in order to obtain a biometric template, which is compared with patterns saved by the registration module. Those modules cooperate with each other and carry out the tasks related to the collection of raw biometric data, features extraction and comparison of features and finally decision making. The session with biometric systems begins with taking anatomical or behavioral features by the biometric reader. Biometric reader generates in n-dimensional biometric E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 274–279, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 The Concept of Application of Fuzzy Logic 275 data. Biometric data represented by the form of vector features are the output of signals processing algorithms. Then, vector features are identified, and the result is usually presented in the form of the match (called confidence degree). On the basis of their relevance and value of the threshold (called threshold value) the module responsible for decision making produce output considered as acceptance or rejection. The result of the work of biometric system is a confirmation of the identity of the user. In light of the existence of the positive population (genuine users) and negative population (impostors), there are four possible outcomes: • • • • Genuine user is accepted (correct response), Genuine user is rejected (wrong response), Impostor is accepted (wrong response), Impostor is rejected (correct response.). A key decision-making is based on the value of the threshold T, which determines the classification of the individual characteristics of vectors, leading to a positive class (the sheep population) and a negative class (the wolf population). When threshold is increased (decreased) the likelihood of false acceptance decreases (increases) and the probability of false rejection increases (decreases). It is not possible to minimize the likelihood of erroneous acceptance and rejection simultaneously. In empirical conditions it can be found that precise threshold value makes it clear that confidence scores which are very close to threshold level, but still lower than it, are rejected. It can be found that the use of fuzzy logic can be helpful mean of reducing identified problem. 3 Basics of Fuzzy Logic A fuzzy set is an object which is characterized by its membership function. That function is assigned to every object in the set and it is ranging between zero and one. The membership (characteristic) function is the grade of membership of that object in the mentioned set [4]. That definition allows to declare more adequate if the object is within a range of the set or not; to be more precise the degree of being in range. That is a useful feature for expressions in natural language, e.g. the price is around thirty dollars, etc. It is obvious that if we consider sets (not fuzzy sets) it is very hard to declare objects and their membership function. The visualization of the example membership function S is presented on fig. 1 and elaborated on eq. 1. 0 ⎧ 2 ⎪ ⎛ x−a⎞ ⎪1 − 2⎜ ⎟ ⎪ ⎝ c−a ⎠ s ( x; a, b, c) = ⎨ 2 ⎪1 − 2⎛⎜ x − c ⎞⎟ ⎪ ⎝c−a⎠ ⎪ 1 ⎩ for x≤a for a≤ x≤b (1) for b≤x≤c for x≥c 276 A. Sachenko, A. Banasik, and A. Kapczyński 1,2 1 µ(x) 0,8 0,6 0,4 0,2 0 0 a 2 4 6 b 8 c 10 Fig. 1. Membership function S applied to fuzzy set high level of security” It is necessary to indicate the meaning of membership function [4]. The first possibility is to indicate similarity between object and the standard. Another is to indicate level of preferences. In that case the membership function is concerned as level of acceptance of an object in order to declared preferences. And the last but not least possibility is to consider it as a level of uncertainty. In that case membership function is concerned as a level of validity that variable X will be equal to value x. Another important aspect of fuzzy sets and fuzzy logic is possibility of fuzzyfication – ability to change sharp values into fuzzy ones and defuzzyfication as a process of changing fuzzy values into crisp values. That approach is very useful in case of natural language problems and natural language variables. Fuzzyfication and defuzzyfication is also used in analysis of group membership. It is the best way of presenting average values in order to whole set of objects I ndimensional space. This space may be a multicriteria analysis of the problem. It is well known that there are a lot of practical applications of fuzzy techniques. In many applications, people start with fuzzy values and then propagate the original fuzziness all the way to the answer, by transforming fuzzy rules and fuzzy inputs into fuzzy recommendations. However, there is one well known exception to this general feature of fuzzy technique applications. One of the main applications of fuzzy techniques is intelligent control. In fuzzy control, the objective is not so much to provide an advise to the expert, but rather to generate a single (crisp) control value uc that will be automatically applied by the automated controller. To get this value it is necessary to use the fuzzy control rules to combine the membership functions of the inputs into a membership function µ(u) for the desired control u. This function would be a good output if the result will be an advise to a human expert. However, our point o concern is in generating a crisp value for the automatical controller, we must transform the fuzzy membership function µ(u) into a single value uc. This transformation from fuzzy to crisp is called defuzzification [3]. One of the most widely used defuzzification technique based on centroids is centroid defuzzification. It is based on minimizing the mean square difference between The Concept of Application of Fuzzy Logic 277 the actual (unknown) optimal control u and the generated control uc produced as a result. In this least square optimization it is possible to weigh each value u with the weight proportional to its degree of possibility µ(u). The resulting optimization problem [2]: ∫ µ (u ) ⋅ (u − u ) c 2 du → min uc (2) can be explicitly solved if there is a possibility of differentiation the corresponding objective function by uc and equate the resulting value to 0. The result of it is a formula [2]: uc = ∫ u ⋅ µ (u )du ∫ µ (u )du (3) This formula is called centroid defuzzification because it describes the ucoordinate of the center of mass of the region bounded by the graph of the membership function µ(u). As it was mentioned before fuzzy sets and fuzzy logic is a way to cope with qualitative and quantitative problems. That possibility is a great advantage of presented approach and it is very commonly used in different fields. That provides us a possibility of using its mechanisms in many different problems and it usually gives reasonable solutions. 4 The Use of Fuzzy Logic in Biometric Authentication Systems There are two main characteristics of biometric authentication system: false acceptance rate and false rejection rate. The false acceptance rate can be defined as relation of number of accepted authentication attempts to number of all attempts undertaken by impostors. Authentication attempt is successful only if confidence score resulted from comparison of template created during enrolment process with template created from current authentication attempts eqauls or is greater than specified threshold value. The threshold value can be specified apriori basing on theoretical estimations or can be defined based on requirements from given environment. For example for high security environments the importance of false acceptance errors is greater than of false rejection errors; for low security environments the situation is quite opposite. Threshold is the parameter which defines the levels of false acceptance and false rejection errors. One of the most popular approach assumes that threshold value is chosen at level were false acceptance rate equals false reject rate. In our paper we consider the theoretical biometric authentication system with five levels of security: very low, low, medium, high, very high and a main error considered is only a false acceptance error. The security level is associated with a given level of false acceptance error and the values of false acceptance rates are obtained emipirically from given set of biometric templates. Obtained false acceptance rates are function of threshold value which is expressed precisely as a value from range 0 to 100. 278 A. Sachenko, A. Banasik, and A. Kapczyński In biometric systems the threshold value can be set globally (for all users) or individually. From perspective of security officer responsible for effictient work of whole biometric system the choice of indivual thresholds requires setting as many thresholds as number of users enrolled in the system. We propose the use of fuzzy logic as a mean of more natural expression of accepted level of false acceptance errors. In our approach the first parameter considered is the value of false accept rate and basing on which we the level of security is determined which is finally transformed into threshold value. Step-by-step proposed procedure consists of three steps. In fist step for a given value of false acceptance rate we determine the value of the membership functions for given level of security. If we consider five levels of false acceptance rate, e.g. 5%, 2%, 1%, 0.5% and 0.1% than for each of those levels we can calculate the values of membership functions to one of five levels of security: very low, low, medium, high and very high (see fig. 2). µ Fig. 2. Membership functions for given security levels (S). Medium level was distinguished. In second step through appropriate application of fuzzy rules we receive a result of fuzzy request of the threshold value. Those rules are developed in order to obtain an answer to a question about the relationship between threshold (T) and the specified level of security (S): Rule Rule Rule Rule Rule 1: 2: 3: 4: 5: IF IF IF IF IF “S “S “S “S “S is very low” THEN “T is very low” level is low” THEN “T is low” is medium” THEN “T is medium” level is high” THEN “T is high” is very high” THEN “T is very high”. In third step we apply defuzzyfication during which the fuzzy values are transformed based on specified values of the membership functions and point a fuzzy centroid of given values. Our approach was depicted on fig. 3. For example if we assume that the false acceptance rate is 0.1 and is fuzzified and belongs to "S is high" with value of the membership function of 0.2 and belongs to "S is very high" with value of the membership function of 0.8. Then, based on rule 4 and rule 5 we can see that threshold value is high with the value of membership function equaled to 0.2 and threshold value is very high at the The Concept of Application of Fuzzy Logic 279 Fig. 3. Steps of fuzzy reasoning applied in biometric authentication system value of membership function equaled to 0.8. The modal established at the threshold value of membership functions shall be: 10 (very low level), 30 (low level), 50 (medium), 70 (high level), 90 (very high level). The calculation of fuzzy centroid is carried out by the following calculation: T = 90 ⋅ 0.8 + 70 ⋅ 0.2 = 86 (3) In this example, for the value of false acceptance rate of 0.1, the threshold value equals to 86. 5 Conclusions The development of this concept in the use of fuzzy logic to determine the threshold value associated with a given level of security provides an interesting alternative to the traditional concept to define threshold values in biometric authentication systems. References 1. Kapczyński, A.: Evaluation of the application of the method chosen in the process of biometric authentication of users, Informatica studies. Science series No. 1 (43), vol. 22. Gliwice (2001) 2. Kreinovich, V., Mouzouris, G.C., Nguyen, G.C., H.T.: Fuzzy rule based modeling as a universal approximation tool. In: Nguyen, H.T., Sugeno, M. (eds.) Fuzzy Systems: Modeling and Control, pp. 135–195. Kluwer, Boston (1998) 3. Mendel, J.M., Gang X.: Fast Computation of Centroids for Constant-Width Interval-Valued Fuzzy Sets. In: Fuzzy Information Processing Society. NAFIPS 2006, pp. 621–626. Annual meeting of the North American (2006) 4. Zadeh, L.A.: Fuzzy sets. Information and Control 8 (1965) 5. Zhang, D.: Automated biometrics. Kluwer Academic Publishers, Dordrecht (2000) Bidirectional Secret Communication by Quantum Collisions Fabio Antonio Bovino Elsag Datamat, via Puccini 2, Genova. 16154, Italy fabio.bovino@elsagdatamat.com Abstract. A novel secret communication protocol based on quantum entanglement is introduced. We demonstrate that Alice and Bob can perform a bidirectional secret communication exploiting the “collisions” on linear optical devices between partially shared entangled states. The protocol is based on the phenomenon of coalescence and anti-coalescence experimented by photons when they are incident on a 50:50 beam splitter. Keywords: secret communications, quantum entanglement. 1 Introduction Interference between different alternatives is in the nature of quantum mechanics [1]. For two photons, the best known example is the superposition on a 50:50 beamsplitter (Hong Ou Mandel –HOM– interferometer): two photons with the same polarization are subjected to a coalescence effect when they are superimposed in time [2]. HOM interferometer is used for Bell States measurements too, and it is the crucial element in the experiment of teleportation or entanglement swapping. Multi-particle entanglement has attracted much attention in these years. GHZ (Greemberger, Horne, Zeilinger) states showed stronger violation of locality. The generation of multi-particle entangled states is based on interference between independent fields generated by Spontaneous Parametric Down Conversion (SPDC) from non-linear crystals. As an example four photons GHZ states are created by two pairs emitted from two different sources. For most applications high visibility in interference is necessary to increase the fidelity of the produced states. Usually, high visibility can be reached in experiments involving only a pair of down-converted photons emitted by one source and quantum correlation between two particles is generally ascribed to the fact that particles involved are either generated by the same source or have interacted at some earlier time. In the case of two independent sources of down converted pairs stationary fields cannot be used, unless the bandwidth of the fields is much smaller than that of the detectors. In other words the coherent length of the down conversion fields must be so long that, within the detection time period, the phase of the fields is constant. This limitation can be overcome by using pulsed pump laser with sufficiently narrow temporal width. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 280–285, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Bidirectional Secret Communication by Quantum Collisions 281 2 Four-Photons Interference by SPDC We want to consider now the interference of two down-converted pairs generated by distinct sources that emit the same state. We take, as source, non-linear crystals, cut for a Type II emission, so that the photons are emitted in pairs with orthogonal polarizations, and they satisfy the well known phase-matching conditions, i.e. energy and momentum conservation. In the analysis we select two k-modes (1,3) for the first source and two k-modes for the second source (2,4), along which the emission can be considered degenerate, or in other words, the central frequency of the two photons is half of the central frequency of the pump pulse. We are interested to the case in which only four photons are detected in the experimental set-up, then we can reduce the total state to 2 ⊗2 Ψ1234 = η2 2 2 Ψ13 + η2 2 2 1 1 Ψ24 + η 2 Ψ13 Ψ24 (1) where |η|² is the probability of photon-conversion in a single pump pulse: η is proportional to the interaction time, to the χ⁽²⁾ non-linear susceptibility of the non-linearcrystal and to the intensity of the pump field, here assumed classical and un-depleted during the parametric interaction. Thus we have a coherent superposition of a double pair emission from the first source (eq. 2), or a double pair emission from the second source (eq. 3), or the emission of a pair in the first crystal and a pair in the second one (eq. 4): 2 Ψ13 = η2 2 ∫ ∫ ∫ ∫ dω1dω2dω3dω4 × Φ(ω1 + ω2 )Φ(ω3 + ω4 )e−i (ω1 +ω2 )ϕ e−i (ω3 +ω4 )ϕ [ × [aˆ ] (ω )] vac × aˆ1+e (ω1 )aˆ3+o (ω2 ) − aˆ3+e (ω1 )aˆ1+o (ω2 ) + 1e (ω3 )aˆ3+o (ω4 ) − aˆ3+e (ω3 )aˆ1+o 2 Ψ24 = η2 2 4 ∫ ∫ ∫ ∫ dω1dω2 dω3dω4 × Φ(ω1 + ω2 )Φ(ω3 + ω4 )e −i (ω1 +ω2 )ϕ [ × [aˆ ] (ω )] vac × aˆ 2+e (ω1 )aˆ 4+o (ω2 ) − aˆ2+e (ω1 )aˆ 4+o (ω2 ) + 2e (ω3 )aˆ4+o (ω4 ) − aˆ2+e (ω3 )aˆ4+o 1 1 Ψ13 Ψ24 = η2 2 4 ] × aˆ1+e (ω1 )aˆ3+o (ω2 ) − aˆ1+e (ω1 )aˆ3+o (ω2 ) + 2e (3) ∫ ∫ ∫ ∫ dω1dω2 dω3dω4 × Φ(ω1 + ω2 )Φ(ω3 + ω4 ) [ × [aˆ (2) (4) (ω3 )aˆ4+o (ω4 ) − aˆ2+e (ω3 )aˆ4+o (ω4 )] vac The function Φ (ω1 + ω2 ) contains the information about the pump field and the parametric interaction, and can be expanded in the form 282 Fabio Antonio Bovino Φ (ω1 + ω 2 ) = Ε p (ω1 + ω 2 )φ (ω1 + ω 2 , ω1 − ω 2 ) (5) where Ε p (ω1 + ω 2 ) describes the pump field spectrum and φ (ω1 + ω 2 , ω1 − ω 2 ) is the two photon amplitude for single frequency pumped parametric down-conversion. Without loss of generality, we impose a normalization condition on Φ (ω1 + ω 2 ) so that 2 ∫∫ dω1dω 2 Φ (ω1 + ω 2 ) = 1 (6) We want to calculate the probability to obtain Anti-Coalescence or Coalescence on the second Beam-splitter (BS) conditioned to Anti-coalescence at the first one. For Anti-Coalescence-Anti-Coalescence probability (AA) we obtain: AA = 5 + 3Cos(4Ω 0ϕ ) 20 (7) For Anti-Coalescence-Coalescence Probability (AC) we obtain: 3Sin 2 (2Ω 0ϕ ) 5 AC = (8) For Coalescence-Coalescence probability (CC) we obtain: CC = 5 3 + Cos(4Ω 0ϕ ) 20 (9) If the two sources emit different states, the result is different. In fact for AA probability we obtain: AA = 3 + Cos(4Ω 0ϕ ) 20 (10) 5 − Cos (4Ω 0ϕ ) 10 (11) 7 + Cos (4Ω 0ϕ ) 20 (12) For AC probability we obtain: AC = For CC probability we obtain: CC = 3 Four-Photons Interference: Ideal Case Now, let us consider the interference of two pairs generated by distinct ideal polarization entangled sources that emit the same state, for example two singlet states: 1 1 Ψ13 Ψ24 = ( )( ) 1 + + aˆ1e aˆ3o − aˆ1+e aˆ3+o aˆ 2+e aˆ 4+o − aˆ 2+e aˆ 4+o vac 2 For AA, AC, CC probabilities we obtain: (13) Bidirectional Secret Communication by Quantum Collisions 1 , 4 AC = 0, 283 AA = (14) 3 CC = 4 If the two distinct ideal polarization entangled sources emit different states, for example a singlet state the first one and a triplet state the second one, we have: 1 1 Ψ13 Ψ24 = ( )( ) 1 + + aˆ1e aˆ3o − aˆ1+e aˆ3+o aˆ 2+e aˆ 4+o + aˆ 2+e aˆ 4+o vac 2 (15) For AA, AC, CC Probabilities we obtain: AA = 0, 1 AC = , 2 1 CC = 2 (16) 4 Bidirectional Secret Communication by Quantum Collision The last result could be used for a quantum communication protocol to exchange secret messages between Alice and Bob. Alice has a Bell’s states synthesizer (i.e. an entangled states source, a phase shifter and a polarization rotator), a quantum memory and a 50:50 beam-splitter. Alice codifies binary classical information by two different Bell states, instead she codifies 1 with singlet state and 0 with triplet state. Then she sends one of the photons of the state to Bob, and maintains the second one in the quantum memory. Fig. 1. Set-up used by Alice and Bob to perform the bidirectional secret communication 284 Fabio Antonio Bovino Bob has the same set-up of Alice, so that he is able to codify a binary sequence by the same two Bell’s states used in the protocol. Bob sends one of the photons of the state to Alice, and maintains the second one in the quantum memory. Alice and Bob perform a Bell’s measurement on the two 50:50 beam-splitters. If Alice and Bob have exchanged the same bit, after the Bell’s measurements the probability of coalescence-coalescence will be ¾, and the probability of AntiCoalescence-Anti-Coalescence will be ¼. If Alice and Bob have exchanged a different bit, the probability to obtain Coalescence-Coalescence will be ½ and the probability to obtain Anti-CoalescenceCoalescence will be ½. After the measurements, Alice and Bob communicate on an authenticated classical channel the results of the measurements: if the result is Anti-Coalescence-AntiCoalescence, Alice and Bob understand that the same state has been used; if the result is Anti-Coalescence-Coalescence or Coalescence-Anti-Coalescence, they understand that they used a different state. In the case of Coalescence-Coalescence, they have to repeat the procedure until a useful result is obtained. If we assume that the probability to use the singlet and triplet states is equal to ½, Alice (Bob) can reconstruct the 37.5% of the message sent by Bob (Alice). Fig. 2. Possible outputs of a Bell’s measurement 5 Conclusion Sources emitting entangled states “on demand” do not exist and the protocol, for now, can not be exploited. The novelty is the use of quantum properties to encoding and decoding a message without exchange of a key and the protocol performs a really quantum cryptographic process. It seems that the fundamental condition is the authentication of the users: if the classical channel is authenticated it is not possible to extract information from the quantum communication channel. The analysis of security is not complete and comments are welcome. Bidirectional Secret Communication by Quantum Collisions References 1. Schrödinger, E.: Die Naturewissenschaften 48, 807 (1935) 2. Hong, C.K., Ou, Z.Y., Mandel, L.: Phys. Rev. Lett. 59, 2044 (1987) 285 Semantic Region Protection Using Hu Moments and a Chaotic Pseudo-random Number Generator Paraskevi Tzouveli, Klimis Ntalianis, and Stefanos Kollias National Technical University of Athens Electrical and Computer Engineering School 9, Heroon Polytechniou str. Zografou 15773, Athens, Greece tpar@image.ntua.gr Abstract. Content analysis technologies give more and more emphasis on multimedia semantics. However most watermarking systems are frame-oriented and do not focus on the protection of semantic regions. As a result, they fail to protect semantic content especially in case of the copy-paste attack. In this framework, a novel unsupervised semantic region watermark encoding scheme is proposed. The proposed scheme is applied to human objects, localized by a face and body detection method that is based on an adaptive two-dimensional Gaussian model of skin color distribution. Next, an invariant method is designed, based on Hu moments, for properly encoding the watermark information into each semantic region. Finally, experiments are carried out, illustrating the advantages of the proposed scheme, such as robustness to RST and copy-paste attacks, and low overhead transmission. Keywords: Semantic region protection, Hu moments, Chaotic generator. pseudo-random number 1 Introduction Copyright protection of digital images and video is still an urgent issue of ownership identification. Several watermarking techniques have been presented in literature, trying to confront the problem of copyright protection. Many of them are not resistant enough to geometric attacks, such as rotation, scaling, translation and shearing. Several researchers [1]-[4] have tried to overcome this inefficiency by designing watermarking techniques resistant to geometric attacks. Some of them are based on the invariant property of the Fourier transform. Others use moment based image normalization [5], with a standard size and orientation or other normalization technique [6]. In most of the aforementioned techniques the watermark is a random sequence of bits [7] and it is retrieved by subtracting the original from the candidate image and choosing an experimental threshold value to determine when the cross-correlation coefficient denotes a watermarked image or not. On the other hand, the majority of the proposed techniques are frame-based and thus semantic regions such as humans, buildings, cars etc., are not considered. At the same time, multimedia analysis technologies give more and more importance to semantic content. However in several applications (such as TV news, TV weather E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 286–293, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Semantic Region Protection Using Hu Moments 287 forecasting, films) semantic regions, and especially humans, are addressed as independent video objects and thus should be independently protected. In this direction the proposed system is designed to provide protection of semantic content. To achieve this goal, a human object detection module is required both in watermark encoding and during authentication. Afterwards the watermark encoding phase is activated, where chaotic noise is generated and properly added to each human object, producing the watermarked human object. The watermark encoding procedure is guided by a feedback mechanism in order to satisfy an equality, formed as a weighted difference between Hu moments of the original and watermarked human objects. During authentication, initially every received image passes through the human object detection module. Then Hu moments are calculated for each detected region, and an inequality is examined. A semantic region is copyrighted only if the inequality is satisfied. Experimental results on real sequences indicate the advantages of the proposed scheme in cases of mixed attacks, affine distortions and the copypaste attack. 2 Semantic Region Extraction Extraction of human objects is a two-step procedure. In the first step the human face is detected and in the second step the human body is localized based on information of the position and size of human face. In particular human face detection is based on the distribution of chrominance values corresponding to human faces. These values occupy a very small region of the YCbCr color space [9]. Blocks of an image that are located within this small region can be considered as face blocks, belonging to the face class Ωf. According to this assumption, the histogram of chrominance values corresponding to the face class can be initially modeled by a Gaussian probability density function. Each Bi block of the image is considered to belong to the face class, if the respective probability P(x(Bi)|Ωf) is high. a) Initial Image b) Object Mask c) Object Extraction Fig. 1. Human Video Object Extraction Method On the other hand and in order to filter non-face regions with similar chrominance values, the aspect ratio R=Hf /Wf (where Hf is the height and Wf the width of the head) is adopted, which was experimentally found to lie within the interval [1.4 1.6] [9]. Using R and P, a binary mask, say Mf, is build containing the face area. Detection of the body area is then achieved using geometric attributes that relate face and body areas [9]. After calculating the geometric attributes of the face region, the human body can be localized by incorporating a similar probabilistic model. Finally the face 288 P. Tzouveli, K. Ntalianis, and S. Kollias and body masks are fused and human video objects are extracted. The phases of human object extraction are illustrated in Figure 1. This algorithm is an efficient method of finding face locations in a complex background when the size of the face is unknown. It can be used for a wide range of face sizes. The performance of the algorithm is based on the distribution of chrominance values corresponding to human faces providing 92% successful. 3 The Watermark Encoding Module Let us assume that human object O has been extracted from an image or frame, using the human object extraction module described in Section 2. Initially Hu moments of human object O are computed [6] providing an invariant feature of an object. Traditionally, moment invariants are computed based both on the shape boundary of the area and its interior object. Hu first introduced [19] the mathematical foundation of 2D moment invariants, based on methods of algebraic invariants and demonstrated their application to shape recognition. Hu’s method is based on nonlinear combinations of 2nd and 3rd order normalized central moments, providing a set of absolute orthogonal moment invariants, which can be used for RST invariant pattern identification. Hu [19] derived seven functions from regular moments, which are rotation, scaling and translation invariant. In [20], Hu’s moment invariant functions are incorporated and the watermark is embedded by modifying the moment values of the image. In this implementation, exhaustive search should be performed in order to determine the embedding strength. The method that is proposed in [20] provides an invariant watermark in both geometric and signal processing attacks based on invariant of moments. Hu moments are seven invariant values computed from central moments through order three, and are independent of object translation, scale and orientation. Let Φ= [φ1, φ2, φ3, φ4, φ5, φ6, φ7]Τ be a vector containing the Hu moments of O. In this paper, the watermark information is encoded into the invariant moments of the original human object. To accomplish this, let us define the following function: 7 ⎛ x − φi ⎞ f ( X , Φ ) = ∑ wi ⎜ i ⎟ i =1 ⎝ φi ⎠ . (1) where X is a vector containing the φ values of an object, Φ contains the φ invariants of object O and wi are weights that put different emphasis to different invariants. Each of the weights wi receives a value within a specific interval, based on the output of a chaotic random number generator. In particular chaotic functions, first studied in the 1960's, present numerous interesting properties that can be used by modern cryptographic and watermarking schemes. For example the iterative values generated from such functions are completely random in nature, although they are limited between some bounds. The iterative values are never seen to converge after any number of iterations. However the most fascinating aspect of these functions is their extreme sensitivity to initial conditions that make chaotic functions very important for applications in cryptography. One of the simplest chaotic functions that incorporated in our Semantic Region Protection Using Hu Moments 289 work is the logistic map. In particular, the logistic function is incorporated, as core component, in a chaotic pseudo-random number generator (C-PRNG) [8]. The procedure is triggered and guided by a secret 256-bit key that is split into 32 8bit session keys (k0, k1, …, k31). Two successive session keys kn and kn+1 are used to regulate the initial conditions of the chaotic map in each iteration. The robustness of the system is further reinforced by a feedback mechanism, which leads to acyclic behavior, so that the next value to be produced depends on the key and on the current value. In particular the first seven output values of C-PRNG are linearly mapped to the following intervals: [1.5 1.75] for w1, [1.25 1.5] for w2, [1 1.25] for w3, [0.75 1] for w4 and w5, and [0.5 0.75] for w6 and w7. These intervals have been experimentally estimated and are analogous to the importance and robustness of each of the φ invariants. Then watermark encoding is achieved by enforcing the following condition: 7 ⎛ φ * − φi ⎞ * f (Φ* , Φ) = ∑ wi ⎜ i ⎟=N i =1 ⎝ φi ⎠ (2) where Φ* is the moments vector of the watermarked human object O* and N* is a target value also properly determined by the C-PRNG, taking into consideration a tolerable content distortion. N*value expresses the weighted difference among the φ invariants of the original the watermarked human objects. The greater the value is, the larger perturbation should be added to the original video object and the higher visual distortion would be introduced. Fig. 2. Block diagram of encoding module This is achieved by generating a perturbation region ∆Ο of the same size as O such that, when ∆Ο is added to the original human object O, it produces a region O* = O + β∆Ο (3) that satisfies Eq. (2). Here, β is a parameter that controls the distortion introduced to O by ∆Ο. C-PRNG generates values until mask ∆Ο is fully filled. After generating all sensitive parameters of the watermark encoding module, a proper O* is iteratively produced using Eqs. (2) and (3). In this way, the watermark information is encoded into the φ values of O producing O*. An overview of the proposed watermark encoding module is presented in Figure 2. 290 P. Tzouveli, K. Ntalianis, and S. Kollias 4 The Decoding Module The decoding module is responsible for detecting copyrighted human objects. The decoding procedure is split into two phases (Figure 3). During the first phase, the received image passes through the human object extraction module described in Section 2. During the second phase each human object undergoes an authentication test to check whether it is copyrighted or not. Fig. 3. Block Diagram of decoding module In particular let us consider the following sets of objects and respective φ invariants: (a) (O, Φ) for the original human object, (b) (O*, Φ*) for the watermarked human object and (c) (O΄, Φ΄) for a candidate human object. Then O΄ is declared authentic if: f (Φ* , Φ) − f (Φ′, Φ) ≤ ε (4) where f(Φ*, Φ) is given by Eq.(2), while f(Φ΄, Φ) is given by: 7 ⎛ φ′ −φ ⎞ f (Φ′, Φ) = ∑ wi ⎜ i i ⎟ = N ' i =1 ⎝ φi ⎠ (5) Then Eq. (4) becomes Nd = N * − N ′ ≤ ε ⇒ 7 ⎛ φi* − φi′ ⎞ ⎟ ≤ε i ⎝ φi ⎠ ∑w ⎜ i =1 (6) where ε is an experimentally determined, case-specific margin of error and wi are the weights (see Section 3). Two observations need to be stressed at this point. It is advantageous that the decoder does not need the original image. It only needs wi, Φ, Φ* and the margin of error ε. Secondly, since the decoder only checks the validity of Eq. (6) for the received human object, the resulting watermarking scheme answers a yes/no, (i.e. copyrighted Semantic Region Protection Using Hu Moments 291 or not) question. As a consequence, this watermarking scheme belongs to the family of algorithms of 1-bit capacity. Now in order to determine ε, we should first observe that Eq. (6), delimits a normalized margin of error between Φ and Φ*. This margin depends on the severity of the attack, i.e., the more severe the attack, the larger the value of Nd will be. Thus, its value should be properly selected so as to keep false reject and false accept rates as low as possible (ideally zero). More specifically, the value of ε is not heuristically set, but depends on the content of each distinct human object. In particular, each watermarked human object, O*, undergoes a sequence of plain (e.g. compression, filtering etc.) and mixed attacks (e.g. cropping and filtering, noise addition and compression) of increasing strength. The strength of the attack increases until, either the SNR falls below a predetermined value, or a subjective criterion is satisfied. In the following the subjective criterion is selected, which is related to content’s visual quality. According to this criterion and for each attack, when the quality of the human object’s content is considered unacceptable for the majority of evaluators, an upper level of attack, say Ah, is set. This upper level of attack can also be automatically determined based on SNR, since a minimum value of SNR can be defined before any attack is performed. Let us now define an operator p(.) that performs attack i to O* (reaching upper level Ahi ) and producing an object Oi*: ( ) p O* , Ahi = Oi* , i = 1, 2, …, M (7) Then for each Oi*, N d is calculated according to Eq. (6). By gathering N d values, a i i vector is produced: r N d = ⎡⎣ N d1 , N d 2 ,..., N d M ⎤⎦ (8) Then the margin of error is determined as: r ε = max N d (9) r Since ε is the maximum value of N d , it is guaranteed that human objects should be visually unacceptable in order to deceive the watermark decoder. 5 Experimental Results Several experiments were performed to examine the advantages and open issues of the proposed method. Firstly, face and body detection is performed on different images. Afterwards, the watermark information is encoded to each human object, and the decoding module is tested under a wide class of geometric distortions, copy-paste and mixed attacks. When an attack of specific kind is performed to the watermarked human object (Fig. 3a), it leads to SNR reduction that is proportional to the severity of the attack. Firstly we examine JPEG compression for different quality factors in the range of 10 to 50. Result sets (N*, SNR, Nd) are provided in the first group of rows of Table I. It can be observed that Nd changes rapidly for SNR < 9.6 dB. Furthermore, 292 P. Tzouveli, K. Ntalianis, and S. Kollias the subjective visual quality is not acceptable for SNR < 10 dB. Similar behavior can be observed in the cases of Gaussian noise for SNR < 6.45 (using different means and deviations) and median filtering for SNR < 14.3 dB (changing the filter size). Again in these cases the subjective visual quality is not acceptable for SNR < 5.8 dB and SNR < 9.2 dB respectively. Furthermore, we examine some mixed attacks, by combining Gaussian noise and JPEG compression, scaling and JPEG compression and Gaussian noise and median filtering. Table 1. Watermark detection after different attacks In the following, we illustrate the ability of the method to protect content in case of copy-paste attacks. The encoding module receives an image which contains a weather forecaster and provides the watermarked human object (Fig 3a). In this case ε was automatically set equal to 0.65 according to Eq.(9), so as to confront even cropping inaccuracy of 6 %. It should be mentioned that, for larger ε, larger cropping inaccuracies can be addressed, however, the possibility of false alarms also increases. Now let us assume that a malicious user initially receives Fig.(3a) and then copies, modifies (cropping 2% inaccuracy, scaling 5%, rotation 10o) and pastes the watermarked human object in a new content (Fig 3b). Let us also assume that the decoding module receives Fig. (3b). Initially, the human object is extracted and then the decoder checks the validity of Eq. (6). In this case Nd=0.096, a value that is smaller than ε. As a result the watermark decoder certifies that the human object of Fig.(3b) is copyrighted. Semantic Region Protection Using Hu Moments 293 Fig. 4. Copy-paste attack. (a) Watermarked human object (b) Modified watermarked human object in new content. 6 Conclusions In this paper an unsupervised, robust and low complexity semantic object watermarking scheme has been proposed. Initially, human objects are extracted and the watermark information is properly encoded to their Hu moments. The authentication module needs only the moment values of the original and watermarked human objects and the corresponding weights. Consequently, both encoding and decoding modules have low complexity. Several experiments have been performed on real sequences, illustrating the robustness of the proposed watermarking method to various signal distortions, mixed processing and copy-paste attacks. References 1. Cox, J., Miller, M.L., Bloom, J.A.: Digital Watermarking, San Mateo. Morgan Kaufmann, San Francisco (2001) 2. Lin, C.Y., Wu, M., Bloom, J., Cox, I., Miller, M., Lui, Y.: Rotation, scale, and translation resilient watermarking for images. IEEE Trans. on Image Processing 10, 767–782 (2001) 3. Wu, M., Yu, H.: Video access control via multi-level data hiding. In: Proc. of the IEEE ICME, N.Y. York (2000) 4. Pereira, S., Pun, T.: Robust template matching for affine resistant image watermarks. IEEE Transactions on Image Processing 9(6) (2000) 5. Abu-Mostafa, Y., Psaltis, D.: Image normalization by complex moments. IEEE Trans. on Pattern Analysis and Machine Intelligent 7 (1985) 6. Hu, M.K.: Visual pattern recognition by moment invariants. IEEE Trans. on Information Theory 8, 179–187 (1962) 7. Alghoniemy, M., Tewfik, A.H.: Geometric Invariance in Image. Watermarking in IEEE Trans. on Image Processing 13(2) (2004) 8. Devaney, R.: An Introduction to Chaotic Dynamical Systems. Addison-Wesley, Redwood City (1989) 9. Yang, Huang, T.S.: Human Face Detection in Complex Background. Pattern Recognition 27(1), 53–63 (1994) Random r-Continuous Matching Rule for Immune-Based Secure Storage System Cai Tao, Ju ShiGuang, Zhong Wei, and Niu DeJiao* JiangSu University, Computer Department, ZhengJiang, China, 212013 caitao@ujs.edu.cn Abstract. On the basis of analyzing demand of secure storage system, this paper use the artificial immune algorithm to research access control system for the secure storage system. Firstly some current matching rules are introduced and analyzed. Then the elements in immune-based access control system are defined. To improve the efficiency of the artificial immune algorithm, this paper proposes the random r-continuous matching rule, and analyze the number of illegal access requests that one detector can check out. Implementing prototype of the random rcontinuous matching rule to evaluate and compare its performance with current matching rules. The result proves the random r-continuous matching rule is more efficient than current matching rules. At last, we use the random r-continuous matching rule to realize immune-based access control system for OST in Lustre. Evaluating its I/O performance, the result shows its I/O performance loss is below 8%, it proves that the random r-continuous matching rule can be used to realize the secure storage system that can keep high I/O performance. Keywords: matching rule; artificial immune algorithm; secure storage system. 1 Introduction The secure storage system is the hot topic in current researching. There are six directions in current researching. Firstly, encrypting file system to ensure security of storage system, it contains CFS[1], AFS[2], SFS[3,4] and Secure NAS [5]. Secondly, researching new disk structure to realize secure storage system, it contains NASD[6] and Self Securing Storage[7,8]. Thirdly, researching survivable strategy for storage system, it contains PASIS[9,10] and OceanStore[11]. Fourth, researching efficient key management strategy for secure storage system, it contains SNAD[12,13,14], PLUTUS[15] and iSCSI-Based Network Attached Storage Secure System[16]. Fifth, researching the secure middle-ware module to ensure security of storage system, it contains SiRiUS[17] and two-layered secure structure for storage system [18]. Sixth, using zone and mask code to ensure security of storage system. Encryption, authentication, data redundancy and intrusion detection are used in current researching of secure storage system. But large time and space consumption are needed to ensure security of enormous data stored in storage system, and the loss of I/O performance in the secure storage system is very large. High I/O performance is important character for storage system. We use the artificial immune algorithm to research fast access control strategy for the secure storage system. * Support by JiangSu Science Foundation of China No.2007086. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 294–300, 2009. springerlink.com © Springer-Verlag Berlin Heidelberg 2009 Random r-Continuous Matching Rule for Immune-Based Secure Storage System 295 The artificial immune algorithm simulates natural immune system, it has many good characters such as distributability, multi-layered, diversity, autonomy and adaptability and so on. It can protect system efficiently. The classic theory is the negative selection algorithm that presented by Forrest in 1994. It simulates selftolerance of T cells. We use the artificial immune algorithm to judge whether access request is legal in secure storage system and realize fast access control system, then ensure the security of the storage system. The matching rule used to select selftolerance detectors and judge whether detector matches access request, so it is important for efficiency and accuracy the secure storage system. When self, access request and detector are represented by binary string, matching rule is used to compare two binary strings. The remainder of this paper is organized as follows. Section 2 analyzes current matching rules. Section 3 gives definition of some elements and presents random rcontinuous matching rule. Section 4 analyzes the efficiency of random r-continuous matching rule and compare with current matching rules. Section 5 implements prototype of random r-continuous matching rule to evaluate its performance, then compares efficiency and accuracy with current matching rules. Section 6 use random r-continuous matching rule to realize access control system for object-based storage target in storage area network system named Lustre, and evaluate its I/O performance. 2 Related Works The matching rule used to judge whether two binary strings matches between detector and self or between detector and access request. Current matching rules contain rcontiguous matching rule, r-chunk matching rule, Hamming distance matching rule and Rogers and Tanimoto matching rule. Forrest presented r-contiguous matching rule in 1994[24]. r-contiguous matching rule can discriminate non-self accurately, but need large number of detectors. Every detector with l bits contains l-r+1 characteristic sub-string. Every characteristic subl −r string can detect 2 non-self. One detector can recognize (l − r + 1)2 non-self mostly. Balthrop presented r-chunk matching rule to improve accuracy and efficiency of rcontiguous matching rule in 2002[25]. r-chunk matching rule is to add condition to rcontiguous matching rule. Start position can improve the accuracy and restrict form ith to (i+r-1)th in detector are valid, i is special to one detector. One detector contains l -r l-r-i+1 characteristic sub-string and can recognize (l - r - i + 1)2 non-self mostly. The recognition capability of detector is smaller than r-contiguous matching rule. Hamming distance matching rule was proposed by Farmer in 1986[26]. Hamming distance matching rule check whether there are r counterpart and same bits, and it do not care whether these r bits are continuous. Every detector can recognize l-r (l - r - i + 1)2l -r non-self mostly. But it had less accuracy. Harmer analyzes different matching rules by calculating the signal-to-noise ratio and the function-value distribution of each matching rule when applied to a randomly generated data set[27]. But current matching rules are less efficiency. 296 C. Tao et al. 3 The Random r-Continuous Matching Rule We give definition of the elements and present the random r-continuous matching rule. 3.1 Definition of Elements Definition 1. Domain. U={0,1}l it is a set of all binary strings with l bits, it contains by self set and non-self set. Definition 2. Self set. S ∈ U it is a set of all legal strings in domain. Definition 3. Non-self set. NS ∈ U it is a set of all illegal strings in domain. Definition 4. Access request. x=x1x2…xl (xi ∈ {0,1}) it is betoken of one access request in storage system and it is one string in domain. Definition 5. Threshold. r is criterion to judge whether x matches d. ∈ Definition 6. Detector. d=(d1d2…dl,r) (di {0,1}) is binary string with l bits l. Definition 7. Characteristic sub-string. It is sub-string in detector that used to check access request. 3.2 Random r-Continuous Matching Rule When checking access request by the artificial immune algorithm, the characteristic sub-string is critical. So increasing the number of non-self that very characteristic substring can match is the important way to improve efficiency of matching rule. The rcontiguous matching rule and r-chunk matching rule check whether the length of identical and counterpart sub-string between antigen and detector is larger than threshold, detector matches antigen. This condition limits efficiency of matching rule. The Hamming distance matching rule check the number of counterpart and identical bits, the condition of identical and counterpart bits limits efficiency of matching rule also. We propose the random r-continuous matching rule to increase the number of illegal access request that one detector can match and improve efficiency. Given detector(d) and access request(x), its definition is as formula 1. Formula 1: d matches x ≡ ∃i ≤ l − r + 1 and ∃j ≤ l − r + 1 such that xk = d l for k = i, L , i + r − 1 and l = j , L , j + r − 1 If the length of identicial sub-string is larger than threshold between detector and access request, then d matches x. 4 Performance Analyze We analyze and compare the number of illegal access request that one detector can recognize using different matching rules. Random r-Continuous Matching Rule for Immune-Based Secure Storage System 297 Using random r-continuous matching rule, one detector with l bits contains l-r+1 characteristic sub-strings. One detector can matches (l − r + 1)2 l − r +1 illegal access request. Table 1 shows how many detectors one detector can match when using different matching rules. We can find the number of illegal access request that one detector can recognize is largest when using random r-continuous matching rule, it proves that random r-continuous matching rule can improve efficiency of artificial immune algorithm obviously. Table 1. Number of illegal access request one detector can recognize using different matching rules Matching rule random rcontinuous matching rule r-contiguous matching rule r-chunk matching rule Hamming distance matching rule The number of illegal access request that one detector can recognize (l − r + 1)2 l − r +1 (l − r + 1)2 l −r (l-r-i + 1)2 l-r (l − r + 1)2 l −r 5 Prototype of Random r-Continuous Matching Rule We implement the prototype of immune-based access control system using the random r-continuous matching rule and other matching rules on Linux. Access request and detector are betokened by string with eight bits. Self and access request are stored in two text file. Using the exhaustive detector generating algorithm to generate original detector and do not limit number of self-tolerance detector. Prototype output the result of detection and the number of detectors which are needed to check out all illegal access requests, then comparing with current matching rules. The minimum value of r is 1, maximal value is 8 and increment is 1. Firstly we use the exhaustive strategy to generate all 256 different strings with 8 number bits. We choose some strings as self sequence and other as access request. Then self and access request are complementary and all access requests are illegal. We create seven self files with 0, 8, 16, 32, 64, 128 and 192 access requests specially cne rae lo tlfe s fo re bm un 140 120 ro100 cte 80 etd 60 40 20 0 0 8 16 32 64 128 192 number of self prototype of random rcontinous matching rule prototype of r-contiguous matching rule prototype of r-chunk matching rule prototype of Hamming distance matching rule Fig. 1. Number of detectors needed to recognize all illegal access requests 298 C. Tao et al. and corresponding access request files. And evaluating how many detectors needed to check out all illegal access requests. The result shows in figure 1. From figure 1 we find that prototype can check out all illegal access requests with smallest number of detectors when using the random r-continuous matching rule. This result proves that the random r-continuous matching rule is more efficient than other matching rules. 6 Prototype of Immune-Based Secure Storage Lustre is an open source storage area network system. There are three modules such as client, MDS and OST in system. OST is an object-based storage target. We use the random r-continuous matching rule to realize immune-based access control system for OST. Using the Iozone to test I/O performance. We test writing performance of 1M file with different size block such as 4k, 8k, 16k, 32k, 64k, 128k, 256k, 512k and 1024k. The result shows in figure 2. It shows that the prototype of the secure storage system lose 8% writing performance of Lustre that can keep high I/O performance. writing performance /XVWUH 600000 500000 400000 b/s 300000 200000 100000 0 481632641282565121024 block size /XVWUH ZLWK LPPXQH EDVHG DFFHVV FRQWURO V\VWHP Fig. 2. Writing performance 7 Conclusion This paper presents the random r-continuous matching rule to improve the efficiency of the artificial immune algorithm. By analyzing the number of illegal access request that one detector can check out, evaluating and comparing with current matching rules. The result proves the random r-continuous matching rule is more efficient than current matching rules. At last we using the random r-continuous matching rule to realize immune-based access control system for OST in the Lustre, the I/O performance testing proves that random r-continuous matching rule can used to realize the secure storage system that can keep high I/O performance. Different detector will contain same characteristic sub-string that will increase consumption of detection. Next step we analyze characteristic sub-string in detector and research new detector generating algorithm to improve efficiency. Random r-Continuous Matching Rule for Immune-Based Secure Storage System 299 References 1. Blaze, M.: A cryptographic file system for UNIX. In: Proceedings of 1st ACM Conference on Communications and Computing Security (1993) 2. Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R., West, M.: Scale and performance in a distributed file system. ACM TOCS 6(1) (February 1988) 3. Fu, K., Kaashoek, M., Mazieres, D.: Fast and secure distributed read-only file system. OSDI (October 2000) 4. Mazieres, D., Kaminsky, M., Kaashoek, M., Witchel, E.: Separating key management from file system security. SOSP (December 1999) 5. Li, X., Yang, J., Wu, Z.: An NFSv4-Based Security Scheme for NAS, Parallel and Distributed Processing and Applications, NanJiang, China (2005) 6. Gobioff, H., Nagle, D., Gibson, G.: Embedded Security for Network-Attached Storage, CMU SCS technical report CMU-CS-99-154 (June 1999) 7. John, D., Strunk, G.R., Goodson, M.L., Sheinholtz, C.A.N., Soules, G.R.: Self-Securing Storage: Protecting Data in Compromised Systems. In: 4th Symposium on Operating System Design and Implementation, San Diego, CA (October 2000) 8. Craig, A.N., Soules, G.R., Goodson, J.D., Strunk, G.R.: Metadata Efficiency in Versioning File Systems. In: 2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, March 31-April 2 (2003) 9. Wylie, J., Bigrigg, M., Strunk, J., Ganger, G., Kiliccote, H., Khosla, P.: Survivable information storage systems. IEEE Computer, Los Alamitos (2000) 10. Ganger, G.R., Khosla, P.K., Bakkaloglu, M., Bigrigg, M.W., Goodson, G.R., Oguz, S., Pandurangan, V., Soules, C.A.N., Strunk, J.D., Wylie, J.J.: Survivable Storage Systems. In: DARPA Information Survivability Conference and Exposition, Anaheim, CA, 12-14 June 2001, vol. 2, pp. 184–195. IEEE, Los Alamitos (2001) 11. Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R., Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., Zhao, B.: OceanStore: An Architecture for Global-Scale Persistent Storage. In: ASPLOS (December 2000) 12. Freeman, W., Miller, E.: Design for a decentralized security system for network-attached storage. In: Proceedings of the 17th IEEE Symposium on Mass Storage Systems and Technologies, College Park, MD, pp. 361–373 (March 2000) 13. Miller, E.L., Long, D.D.E., Freeman, W., Reed, B.: Strong security for distributed file systems. In: Proceedings of the 20th IEEE international Performance, Computing and Communications Conference (IPCCC 2001), Phoenix, April 2001, pp. 34–40. IEEE, Los Alamitos (2001) 14. Miller, E.L., Long, D.D.E., Freeman, W.E., Reed, B.C.: Strong Security for NetworkAttached Storage. In: Proceedings of the 2002 Conference on File and Storage Technologies (FAST), January 2002, pp. 1–13 (2002) 15. Kallahalla, M., Riedel, E., Swaminathan, R., Wang, Q., Fu, K.: PLUTUS: Scalable secure file sharing on untrusted storage. In: Conference on File andStorage Technology (FAST 2003), San Francisco, CA, 31 March - 2 April 2003, pp. 29–42. USENIX, Berkeley (2003) 16. De-zhi, H., Xiang-lin, F., Jiang-zhong, H.: Study and Implementation of a iSCSI-Based Network Attached Storage Secure System. MINI-MICRO SYSTEMS 7, 1223–1227 (2004) 17. Goh, E.-J., Shacham, H., Modadugu, N., Boneh, D.: SiRiUS:Securing Remote Untrusted Storage. In: The proceedings of the Internet Society (ISOC) Network and Distributed Systems Security (NDSS) Symposium 2003(2003) 300 C. Tao et al. 18. Azagury, A., Cabetti, R., Factor, M., Halevi, S., Henis, E., Naor, D., Rinetzky, N., Rodeh, O., Satran, J.: A Two Layered Approach for Secuting an Object Store Network. In: SISW 2002 (2002) 19. Hewlett-Packard Company. HP OpenView storage allocator (October 2001), http://www.openview.hp.com 20. Brocade Communications Systems, Inc. Advancing Security in Storage Area Networks. White Paper (June 2001) 21. Hewlett-Packard Company. HP SureStore E Secure Manager XP (March 2001), http://www.hp.com/go/storage 22. Dasgupta, D.: An overview of artificial immune systems and their applications. In: Dasgupta, D. (ed.) Artificial immune systems and their applications, pp. 3–23. Springer, Heidelberg (1999) 23. de Castro, L.N., Timmis, J.: Artificial Immune Systems: A New Computational Approach. Springer, London (2002) 24. Forrest, S., Perelson, A., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In: Proceedings IEEE Symposium on Research in Security and Privacy, Los Alamitos, CA, pp. 202–212. IEEE Computer Society Press, Los Alamitos (1994) 25. Balthrop, J., Esponda, F., Forrest, S., Glickman, M.: Coverage and generalization in an artificial immune system. In: Langdon, W.B., Cantú-Paz, E., Mathias, K., Roy, R., Davis, D., Poli, R., Balakrishnan, K., Honavar, V., Rudolph, G., Wegener, J., Bull, L., Potter, M.A., Schultz, A.C., Miller, J.F., Burke, E., Jonoska, N. (eds.) Proceedings of the Genetic and Evolutionary Computation Conference (GECCO), 9-13 July 2002, pp. 3–10. Morgan Kaufmann Publishers, San Francisco (2002) 26. Farmer, J.D., Packard, N.H., Perelson, A.S.: The immune system, adaptation, and machine learning. Physica D 22, 187–204 (1986) 27. Harmer, P., Williams, G., Gnusch, P.D., Lamont, G.: An Artificial Immune System Architecture for Computer Security Applications. IEEE Transactions on Evolutionary Computation 6(3), 252–280 (2002) 28. Forrest, S., Perelson, A.S., Allen, L., Cherukuri, R.: Self-Nonself Discrimination in a computer. In: Proceeding of IEEE Symposium on Research in Security and Privacy, pp. 202–212. IEEE Computer Society Press, Los Alamitos (1994) 29. Helman, P., Forrest, S.: An efficient algorithm for generating random antibody strings, Technical Report CS-94-07, The University of New Mexico, Albuquerque, NM (1994) 30. D’haeseleer, P., Forrest, S., Helman, P.: An immunological approach to change detection: algorithms, analysis and implications. In: McHugh, J., Dinolt, G. (eds.) Proceedings of the 1996 IEEE Symposium on Computer Security and Privacy, USA, pp. 110–119. IEEE Press, Los Alamitos (1996) 31. D’haeseleer, P.: Further efficient algorithms for generating antibody strings, Technical Report CS95-3, The University of New Mexico, Albuquerque, NM (1995) nokLINK: A New Solution for Enterprise Security Francesco Pedersoli and Massimiliano Cristiano Spin Networks Italia, Via Bernardino Telesio 14, 00195 Roma, Italy {fpedersoli,mcristiano}@spinnetworks.com Abstract. The product nokLINK is a communication protocol carrier which transports data with greater efficiency and vastly greater security by encrypting, compressing and routing information between two or more end-points. nokLINK creates a virtual, “dark” application (port) specific tunnel which ensures protection of end-points by removing their exposure to the Internet. By removing the exposure of both end-points (Client and Server) to the Internet and LAN, you remove the ability for someone or something to attack either end-point. If both endpoint have not entry point, attack becomes extremely, if not impossible to succeed. nokLINK is Operating System independent and the protection level is applied starting from the application itself: advanced anti-reverse engineering technique are used, a full executable encryption and also the space memory used by nokLINK is encrypted. The MASTER-DNS like structure permit to be very resistant also to Denial of Service attack and the solution management is completely decoupled by the Admin or root rights: only nokLINK Administrator can access to security configuration parameters. 1 Overview The inherent makeup of nokLINK implies two purposes. The first is the name of a communications protocol that has the potential to work with or without TCP/IP. This protocol includes everything needed to encrypt, route, resolve names, and ensure the delivery of upper layer packets. The protocol itself is independent of any particular operating system. It has the potential of running on any OS or even be included in any hardware solutions as firmware. The second is “nokLINK The Product, a communication protocol carrier which transports data with greater efficiency and vastly greater security by encrypting, compressing and routing information between two or more end-points. nokLINK creates a virtual “dark” application [port] specific tunnel, which ensures protection of endpoints by removing their exposure to the Internet. By removing the exposure of both end-points to the internet and LAN, you remove the ability for someone or something to attack either end-point. If both end-point have not entry point, attack becomes extremely, if not impossible to succeed. If it can’t been seen, it can’t be attacked. In most scenarios, if you block inbound access to an end-point, then you loose the ability to communicate with that device, but with nokLINK any permitted application can communicate in a bi-directional (2-way) manner but contrary to typical communication, without exposing those applications to not-authorized devices. The result is increased security plus improved availability without inheriting security threats. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 301–308, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 302 F. Pedersoli and M. Cristiano nokLINK works similar to DNS by receiving client requests and resolving to a server but in addition to routing requests, nokLINK provides strong authentication. nokLINK provides this “DNS” like functionality without exposing in-bound connections to the internet through the use of an intermediate “Master Broker”. Communication routing is not possible without the nokLINK Master Broker authorizing each device’s permission. Once permission is granted, client and server can communicate via the Master without interruption. Conceptually, the nokLINK master is a smart router with built in encryption and authentication. If a system like nokLINK can be deployed without exposing both client and server end-point to inbound requests, then a device firewall can be used to ensure both endpoints are protected from potential intrusions. nokLINK includes a software firewall with equivalent to or better security than that of a hardware-based firewall to protect each machine from any other device. An increasing threat in corporate systems is LAN based attacks which are typically much harder to stop without losing productivity. By implementing nokLINK and the nokLINK firewall, an organization can maintain even higher levels of availability without exposure to attacks. Almost all systems today relay on Internet Protocol (IP) to communicate but even on the same LAN. nokLINK removes the dependency on Internet Protocol (to date IP is still utilized, but simply for convenience). In fact, nokLINK allows for the elimination of virtually all of the complex private communications lines, IP router configuration, and management. Given that it is protocol-independent, it means that almost any IP-based communication can benefit from the secure tunneling that nokLINK provide. nokLINK can be used for many IP-based applications. 2 Architecture There are four nokLINK components which make up the nokLINK architecture. A single device may contain: just the client, client + server, the master or master + authenticator in a single installation. 1. NOKLINK CLIENT: The nokLINK client is the component that allows a computer to access applications [ports] on another device with the nokLINK server component. This client component is part of all nokLINK installs in the form of an “Installation ID”. The Installation ID is associated with this component. The client itself may be context-less; this means that the nokLINK may have permission to connect to any server in any context (given proper permission configuration) without having to reinstall any software. In other words, any client could connect to http://company.vsx and http://other.vsx just by typing in the address in the browser. 2. NOKLINK SERVER: The nokLINK server component is the component that allows nokLINK clients to connect to local applications based on a vsx name. For instance, if a web server needed securing, a nokLINK server would be installed on the web server; then anyone with a nokLINK client and permission could access that server from anywhere in the world by its vsx name, i.e. http://company.vsx. The server and client together are the components that create the tunnel. No other component can “see” into the transmission tunnel of any other real time pair of communicating server and client. The encryption system used between client and nokLINK: A New Solution for Enterprise Security 303 server ensures that only the intended recipient has the ability to un-package the communication and read the data, this includes the master component. 3. NOKLINK MASTERCOMPONENT: The Master components has two main purposes: authenticating devices and routing communications. While the Master is responsible for routing communications between end points, it is not part of the communication tunnel and therefore cannot read data between them. This ensure that endpoint to endpoint security is always maintained. 4. NOKLINK MASTER AUTHENTICATOR (NA): The nokLINK Master Authenticator (NA) is the console for setting authentication and access rights for each nokLINK enabled device within each nokLINK vsx context. A web interface provides administrators a system to control nokLINK’s transport security via nokLINK names, nokLINK domains and nokLINK sub-domains. For example an administrator can allow a machine called webserver.sales.company.vsx to communicate only to xxx.sales.company.vsx or xxx.company.vsx or one specific nokLINK machine. Administrators can manage device security settings in a global manner or in a very specific manner depending on the companies objectives. Besides other main functions are: 1. nokLINK Communication Interceptor: The component that provides seamless use of nokLINK for the client and server is a “Shim” which intercepts .vsx communication and routes the requests to the nokLINK master. The nokLINK shim intercepts, compresses, encrypts and routes data including attaching the routing information required for the master to deliver. The data is wrapped by the nokLINK protocol, essentially transforming it from the original protocol to the nokLINK protocol. By wrapping nokLINK around the original protocol you can further ensure the privacy of the data and the privacy of the protocol in use. Packet inspection systems used to filter and block specific protocols are ineffective in identifying protocols secured by nokLINK. Upon arrival of the data at the endpoint, nokLINK unpacks the communication back to the original protocol and reintroduces the data to the local IP stack to ensure the data is presented transparently to the upper level applications. As a result, nokLINK can be introduced to virtually any application seamlessly. 2. Device Authorization: The node authorization and rules configuration is managed at the nokLINK Authenticator. The Master authenticates, thus it dictates which client can be a part of a specific nokLINK context. During install, a unique “DNA” signature (like TPM via software) is created along with a .vsx name which is registered with the nokLINK Authenticator (NA). The nokLINK device identifies itself to the Master and registers its name upon installation. The Master determines the authenticity of inquiring nokLINK device and its right to conduct the requested activity. When access is requested for a specific machine, the master authenticates the machine but does not interfere with authentication for the application in use. The Master is like a hall monitor; i.e. it does not know what the person will do in any particular room he has permission to visit but has full control of who can get to what room. 304 F. Pedersoli and M. Cristiano 3 Features and Functionality nokLINK provides many features and functionality depending on implementation, objectives and configuration including: • Secure communication protocol able to encrypting, compressing and routing information between end-points. • Virtual “Dark” network that ensures protection of end-points removing exposure to the Internet. • Seamless access to services from network to network without re-configuration of firewalls or IP addresses. • Communication between systems without those systems being visible to Internet. • Low level software firewall. • Protocol independent, which means that any communication can be secured. Most extra-net connectivity products today offer connectivity for clients to a LAN from within a LAN or from the internet. A simple client is installed on the users’ PC; this allows users access to the corporate network. Unfortunately this access is also available to anybody else who knows a user’s name and has time and/or the patience to guess passwords. nokLINK functions differently than a VPN. nokLINK is not network specific and does not attach clients to foreign networks. nokLINK install client software that identifies each PC individually and provide remote access to applications instead of remote access to VPNs. This, coupled with the nokLINK authenticator, ensures the identification of any device containing nokLINK attempting to get at company data. For further security, nokLINK opens only individual, user configured ports to individual nodes, thus protecting other assets to which access is not permitted from outside PCs. End-point to end-point security starts with the PC identification. At installation the nokLINK client creates a unique DNA signature based on many variables including hardware characteristics of the PC and time of installation. Every instance of nokLINK is unique regardless of the operating environment to further eliminate the possibility of spoofing. When communication is initiated, the nokLINK server receive a noklink name terminating in .vsx. This naming scheme is identical to DNS naming schemes. The difference is that only nokLINK clients understand .vsx extension. This name is used instead of standard DNS names when accessing nokLINK servers. For instance, if a web server is being protected by nokLINK than the nokLINK enabled end user would type http://webserver.mycomp.vsx into their browser. The nokKERNEL take the request, encrypts the information and sends it out to one or more nokLINK Master. This allow a workstation to communicate with a server without either of them being visible on the Internet, as it is shown in the Fig. 1. 4 Security Elements nokLINK is a multi-layered monolithic security solution. Using various techniques, it encloses everything needed to secure communications between any two nokLINK nokLINK: A New Solution for Enterprise Security 305 enabled nodes, using various techniques to do this. It impacts three different security areas: Encryption Security, Transport Security, End Point Security. 4.1 Encryption Security The strength of public algorithms is well-known. nokLINK uses state of the art encryption algorithm, but goes further than just the single level of encryption. The information traded between systems is not the actual key or algorithm. It is simply synchronization information which only the two end points understand, that is the equivalent of one end node telling the other “Use RNG (Random Number Generators) four to decode this message.” The strength of nokLINK’s encryption is based on a family of new Random Number Generators This RNG family is based on years of research in this area. The nokLINK encryption system encrypts three times before sending out the packet: once for the actual data going out, once for the packet header and finally both together. The upper-layer data is encrypted with a synchronization key. The key is not an actual key, it contains information for the system to synchronize the RNGs on the end points. This way the system stays as secure, but with much less overhead. The only two nodes that understand this encrypted data are the client and the server. The intermediate machines do not and cannot open this section of the packet. Fig. 1. 4.2 Transport Security This layer of security is an extra layer of security in comparison with other security solutions. It deals with permissions for communications and the dynamic format of the nokLINK packets and it is composed by: • TRAFFIC CONTROLLER: nokLINK affords a new control that eliminates this type of attack. While maintaining all the encryption security of other products, nokLINK includes controls to mandate which nodes can communicate with which 306 F. Pedersoli and M. Cristiano other nodes. The basic requirement is to get a hold of a particular nokLINK client using the same nokLINK context. Each nokLINK context uses a different family of encryption RNG’s and cannot be used to communicate with another context. Without that nokLINK client, the nokLINK server remains invisible to potential intruders, as it is shown in Fig. 2. In the nokLINK vsx name environment the attacker can’t see the server to attack because the name is sent to a DNS server for resolution. This makes it extremely difficult, if not impossible to break into a nokLINK server, while leaving it completely accessible to those who need it. In a nokLINK environment the nodes are identified uniquely. The master server uses this particular ID to determine which nodes are permitted to communicate with which servers; all controlled by the end user. In the unlikely event that someone comes up with a super crack in the next ten years that can read nokLINK packets, they still will not be able to communicate directly with another nokLINK node because of this level of security, as it is shown in Fig.3. Here you see users accessing exactly those services and applications they are allowed to access. There is redundancy in the security. A PC must have a nokLINK v2 client installed and permission must have been granted between the client and the server in the nokLINK Master Authenticator. Each PC generates an exclusive unique identifier at install time. The system recognizes this ID and uses it for control, i.e. nokLINK Client ID 456 can communicate with nokLINK Server ID 222. If the hard drive is removed from the PC, or is attempted to be cloned, it is likely that the ID will be corrupted because of the change of hardware. If not, as soon as one of the PCs [cloned or original] connects, all the device with the same ID [whether the original and/or cloned] will stop communicating, as the system will allow only one PC with a specific ID to operate in the nokLINK environment. • Dynamic Packet Format: the format of the nokLINK protocol is dynamic. The location of “header” information changes from packet to packet. The implication is that, in the unlikely event that a packet is broken, it will be absolutely useless to attempt to replicate it in efforts of breaking other packets. Entirely new efforts will have to be put forth to break a second and a third and a forth (and so on) packet. With other protocols, the header information is always in the same spot, making it easy to sniff, analyze and manipulate data. 4.3 End Point Security Another enhancement, and probably the most significant compared to other security solutions, is nokLINK’s end point security. A significant effort of this security is based on anti-reverse engineering techniques. There are many facets to anti-reverse engineering including: • • • • • All code in memory is encrypted. Each time the code passes from ring 0 to ring 3 it is encrypted to avoid monitoring. Each executable is generated individually. A unique identifier is generated at install time for each node. Protection versus cloning - if someone successfully clones it will stop working, thus alerting the system administrator of a problem. nokLINK: A New Solution for Enterprise Security 307 • There are certain features embedded in the software to allow it to detect the presence of debugging software. Once detected, nokLINK takes steps to avoid being hacked into. Fig. 2. Fig. 3. 4.4 nokLINK Firewall The nokLINK firewall is a low level firewall which blocks incoming and outgoing packets to and from the Network Interface Card (NIC). This would normally block all communications in and out of the PC. nokLINK still manages to communicate via standard ports to other vsx nodes through the permanent inclusion of an outgoing connection exception from port 14015 (the default nokLINK tunneling port). Any vsx name presented to the TCP/IP stack is treated well before it reaches the Internet card. 308 F. Pedersoli and M. Cristiano Once the system recognizes the vsx name, it diverts the packet to nokLINK. nokLINK encrypts and encapsulates the packet sending it out via port 14015, to the nokLINK Master. In this way, any applications utilizing the vsx name can communicate from “behind” the nokLINK firewall. The following graphic shows the internal configuration of a PC. This machine would be able to open a browser and access a site using a vsx name (i.e http://webserver.noklink.vsx). It would not be able to access a site without a vsx name. In addition to nokLINK traffic, this machine would only be capable of SMTP traffic for outgoing mail and POP3 for incoming mail, as it is shown in Fig. 4. Fig. 4. References 1. Gleeson, B., Lin, A., Heinane, J., Armitage, G., Malis, A.: A Framework for IP based Virtual Private Networks. Internet Engineering Task Force, RFC 2764 (2000) 2. Herrero, Á., Corchado, E., Gastaldo, P., Leoncini, D., Picasso, F., Zunino, R.: Intrusion Detection at Packet Level by Unsupervised Architectures. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 718–727. Springer, Heidelberg (2007) 3. Kaufman, C., Perlman, R., Speciner, M.: Network Security: Private Communication in a Public World, 2nd edn. Prentice Hall, Englewood Cliffs (2002) SLA and LAC: New Solutions for Security Monitoring in the Enterprise Bruno Giacometti IFINET s.r.l., Via XX Settembre 12, 37129 Verona, Italy b.giacometti@ifinet.it Abstract. SLA and LAC are the solutions developed by IFInet to better analyze firewalls logs and monitor network accesses respectively. SLA collects the logs generated by several firewalls and consolidates them; by means of SLA all the logs are analyzed, catalogued and related based on rules and algorithms defined by IFInet. LAC allows IFInet to identify and isolate devices that access to a LAN in an unauthorized manner, its operation is totally non-intrusive and its installation does not require any change neither to the structure of the network nor to single hosts that compose it. 1 Overview As of today most organizations use firewalls and monitor their networks. The ability to screening firewall logs to determine suspicious traffic is the key to an efficient utilization of firewall and IDS/IPS systems. Anyway this is a difficult task, particularly if there is the need to analyze a great amount of log. SLA is IFInet approach to solve this problem. In the same way it’s fundamental to monitor internal networks, but monitoring alone is not sufficient: it’s necessary a proactive control on internal networks to prevent unauthorized accesses. 2 SLA: Security Log Analyzer In order to make more effective monitoring firewalls activities log, IFInet has created a proprietary application, called Security Log Analyzer (SLA), which collects the logs generated by firewalls and consolidates them into a SQL database. With this application all the logs are analysed, catalogued and related based on rules and algorithms defined by IFInet to identify the greatest possible number of types of traffic. SLA points out to technicians any anomaly or suspicious activity detected and automatically recognizes and catalos most traffic (about 90%) detected by the firewall, allowing the technical staff of IFInet to focus on analysis and correlation of the remaining logs. Through the use of SLA, control of perimeter security system is done by analyzing data relating to traffic, used ports and services and also through a great variety of charts that allow IFInet technicians to monitor security events through the analysis of traffic in a given timeframe, the comparison with the volume of traffic on a significant period of reference (egg the previous month) and detection of possible anomalies. The chart in Fig. 1, for example, highlights the outgoing traffic, divided by service. E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 309–315, 2009. © Springer-Verlag Berlin Heidelberg 2009 springerlink.com 310 B. Giacometti Fig. 1. The outgoing traffic, divided by service SLA shall record in a Black List all IP addresses that are running illegal or potentially intrusive activities, on the basis of the rules implemented by the IFInet technicians: from the moment an IP address is added to the Black List, any successive activity is detected in real time by IFInet technicians that analyze the event and take immediate actions. With the use of SLA, IFInet technicians can do extremely detailed queries on the database logs and have the ability to view in real-time only logs relating to events that have a critical relevance and represent, therefore, a danger to the integrity of Customer networks and systems. In Fig. 2 is listed how for a customer SLA highlights many IP scans to its network and, in particular, indicates that part of this potentially intrusive activity passed through the firewall, taking advantage of traffic permitted by security policies. I.e. thanks to the highlighted event the technical staff is able to instantly detect if a particular external address, which previously made a host scan system (and incorporated in the Blacklist by SLA), has managed to cross the perimeter security system through a port and achieve a system (e.g. Web server) within the network of Customer. Selecting the event of interest, (line highlighted in red rectangle), the technician can see the detail of abnormal activity. SLA reports the number of Host Scan attempts to the public IP addresses assigned to a particular customer: as a further example we can see a Host Scan made towards a range of 18 addresses. The technician, selecting the relevant line, can analyze the details of this anomalous activity and, for example, define that Host Scan is caused by the Sesser worm, present on the 151.8.35.67. By means of SLA all intrusive events or events that otherwise may pose a danger to the integrity of the customer's network are promptly notified by email (or telephone). SLA and LAC: New Solutions for Security Monitoring in the Enterprise 311 Fig. 2. Example of SLA report This service provides a constant and effective control of all events recorded by the firewall and allows for a timely intervention in case of critical events. 3 LAC: LAN Access Control LAC - LAN Access Control - is the solution that allows IFInet to identify and isolate devices (PC and other network equipment) that access to a LAN in an unauthorized manner. Unlike many systems currently on the market, the operation of LAC is totally nonintrusive and its installation does not require any amendment neither to the structure of the network nor to single hosts that compose it. LAC therefore prevents unauthorized devices to access the corporate network. Access is allowed only after certification by an authorized user (egg network administrator). LAC can operate in two modes: active and passive. In passive mode (the default) all the hosts on the network segment in which LAC is connected are detected. In active mode the hosts that are unauthorised are virtually isolated from the corporate network. LAC therefore enforces the following rules: • The organization host and equipment are allowed to use the network without any kind of limitation; • Guests are allowed to access the network after authorization but only for a limited amount of time and using certain services (e.g. Internet navigation and e-mail); • All other hosts should not be able to connect because they are not authorized. Finally LAC has the following key aspects: 312 B. Giacometti • Safety: identifies and isolates devices (PC and other network equipment) that access to a LAN in an unauthorized manner • Ease of management: LAC act entirely non-intrusive and does not require any changes nor the structure of the network or to individual host that make up the network itself • Guests users management: total control on how and what guests are allowed to do • Important features: census of the nodes of one or more networks LAN / WLAN or wireless, isolation of unauthorised host, control of professional services, reporting tools and alerting Main features of LAC are: 1. Census of the LAN nodes. When LAC runs in passive mode, it records al hosts detected on the network in a database. In particular, keeps track of hosts IP addresses and MAC. In order to detect all the host it is necessary that LAC keeps on running for a variable amount of time, depending on how often hosts access the network. An example of this detection is shown in Fig. 3. 2. Isolation host unauthorized. Used in active mode, LAC performs a virtual isolation of all not authorized hosts: once isolated, a host can no longer send / receive data to any other host except LAC. 3. Host detection. One interesting piece of information about a detected host is its physical location within the organization. LAC detects the switch port to which the unauthorized host is physically connected. This feature works only if in the network switches that support the SNMP are in use. In fact, all the network switches (included in LAC configuration) are interrogated using special SNMP query. 4. Managing guest users. Usually an organization is often visited by external staffs that require a network connection to their notebook and to use certain services (i.e. Internet navigation and e-mail). In the case of network access without any authorization, LAC detects the new host and considers it unauthorized from the network. The guest, however, if in possession of appropriate credentials (e.g. Supplied by staff at the reception) can authenticate to unlock the pc and use the services enabled. When proceeding with a new host registration, you can ask LAC to generate a new user (or to edit an existing one), to generate access credentials (username and password) and set the time of their validity. Using this information the staff, through a specific web interface, can enable or unlock the desired location. Guests only connect their pc to the network, browse to the LAC web page and insert the provided credentials. As optional feature, the system administrator can choose from a list the protocols that the host can use once authenticated. This makes it possible to limit the resources available to the network and thus have effective control over allowed activities. 5. Reports. LAC provides a series of reports that allow the administrator to analyze, also with graphs, network access data, based on several criteria: i.e. MAC address, IP address, usernames, denied accesses, enabled accesses, average duration of accesses, etc. 6. Alerting. LAC records in a database all the collected events and it is able to send all the information (necessary to control access to the network) to the network administrator via email, SMS or administrative GUI, as it is shown in Fig. 4. SLA and LAC: New Solutions for Security Monitoring in the Enterprise 313 Fig. 3. Census of the LAN nodes 3.1 Sure to Be Safe? The organisations with an information system, whether public or private, have certainly network infrastructure that enables the exchange of information and teamwork. The corporate network is realized in most cases according to the Ethernet standard, which combines speed and reliability with very limited cost. It’s very easy, in fact, to attach a device to an Ethernet network. For the same reasons, however, this network is exposed to a kind of "violation" very common, definable "physical intrusion" and this can be through a spare RJ45 plug, a network cable from a detached device, a switch or hub port, etc. If an attacker hooks his laptop to an access point in an Ethernet network, the PC becomes part of the network and can therefore access to resources and even to compromise the data security and corporate information. It becomes therefore extremely important to recognise immediately the moment when an unauthorized host access to the network, preventing use the resources of the network itself. At the same time it is important to ensure the normal operation of the network to all allowed hosts: i.e. a commercial agent or an external consultant ("guest" user), who have the need to connect your notebook to the network to use its resources. 3.2 LAC Architecture LAC takes care of, therefore, ensuring compliance with the rules of network access and protecting the integrity and safety of corporate data. LAC is a software solution that consists of a set of applications running on a Linux-based system, composed of the following elements: Web administration interface, Core program (LAC), Database for users management, Database for information 314 B. Giacometti Fig. 4. Alerting recovery, appliance equipped with 3 / 6 interfaces to manage up to 3 / 6 LAN / VLAN (optional). 3.3 LAC Operational Modes LAC can operate in two modes: active and passive. In passive mode (the default) are all the hosts on the network segment to which LAC is connected are detected. In active mode, however, the unauthorised hosts are virtually isolated from the network. Each node on the network can be in one of the following status: unauthorized, authorized, approved by authentication, address mismatch. In the unauthorized status, a host is virtually isolated from the network. This prevents the host from transmitting and receiving data. In the authorized status a host suffers no treatment by LAC and its operation is normal. A user, who uses an unauthorized host, may authenticate to LAC; if successful, it revokes the status of non-authorisation and provides a status of authorization for a limited amount of time. This status is called "authorized with authentication". When the authorization time expires, the host reverts to the unauthorized access status. LAC is able to detect changes in the IP address of a host and addresses the potential anomaly assigning the address mismatch status to that host. The management of status and time validity of users is entrusted with LAC, which is responsible for allocating status to host and change the status of a host at any times. The credentials to authenticate a user (in case the administrator should unlock a client enabling its access to the network) are generated from the LAC on behalf of the Administrator. SLA and LAC: New Solutions for Security Monitoring in the Enterprise 315 3.4 LAC Requirements Below are listed briefly the requirements for the operation of LAC: • Ethernet Network • Protocol network IP version 4 • A network access for the LAC on segment network (LAN or VLAN) to monitor/control 4 Conclusion We have analyzed how a network can be proactively protected and the results have been used to implement SLA and LAC. With SLA and LAC it is possible to identify and block suspicious and potentially dangerous network traffic. Future development of SLA will focus on two major area: real time analysis of monitored device in order to identify suspicious traffic as soon as it enters the perimeter and correlation of firewalls log with logs originated from other security devices, i.e. antivirus. Also LAC will correlate information from Ethernet network with information from other devices, i.e. 802.1x switches or network scanner in order to enforce a finer grained control on the network. References 1. Abad, C., Taylor, J., Sengul, C., Yurcik, W., Zhou, Y., Rowe, K.: Log correlation for intrusion detection: a proof of concept. In: Proc. 19th Annual Computer Security Applications Conference, pp. 255–264. IEEE Press, New York (2003) 2. Cuppens, F., Miege, A.: Alert correlation in a cooperative intrusion detection framework. In: Proc. 2002 IEEE Symposium on Security and Privacy, pp. 202–215. IEEE Press, New York (2002) 3. Debar, H., Wespi, A.: Aggregation and Correlation of Intrusion-Detection Alerts. In: Proc. 4th Int. Symp. Recent Advances in Intrusion Detection, RAID 2001, pp. 85–103. Springer, Berlin (2001) 4. Corchado, E., Herrero, A., Sáiz, J.M.: Detecting compounded anomalous SNMP situations using cooperative unsupervised pattern recognition. In: Duch, W., Kacprzyk, J., Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 905–910. Springer, Heidelberg (2005) 5. Herrero, A., Corchado, E., Gastaldo, P., Zunino, R.: A comparison of neural projection techniques applied to Intrusion Detection Systems. In: Sandoval, F., Gonzalez Prieto, A., Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 1138–1146. Springer, Heidelberg (2007) 6. Ridella, S., Rovetta, S., Zunino, R.: Circular back-propagation networks for classification. IEEE Trans. on Neural Networks 8, 84–97 (1997) Author Index Agarwal, Suneeta 219, 251 Aiello, Maurizio 170 Alshammari, Riyad 203 Appiani, Enrico 43 Faggioni, Osvaldo 100 Ferri, Sara 11 Flammini, Francesco 92 Forte, Dario V. 27 Baiocchi, Andrea 131, 178 Banasik, Arkadiusz 274 Banković, Zorana 147 Ben-Neji, Nizar 211 Bengherabi, Messaoud 243 Bojanić, Slobodan 147 Booker, Queen E. 19 Bouhoula, Adel 123, 211 Bovino, Fabio Antonio 280 Briola, Daniela 108 Bui, The Duy 266 Buslacchi, Giuseppe 43 Gabellone, Amleto 100 Gaglione, Andrea 92 Gamassi, M. 227 Gastaldo, Paolo 61 Giacometti, Bruno 309 Giuffrida, Valerio 11 Golledge, Ian 195 Gopych, Petro 258 Guazzoni, Carlo 1 Guessoum, Abderrazak 243 Gupta, Rahul 219 Caccia, Riccardo 108 Cavallini, Angelo 27 Chatzis, Nikolaos 186 Cheriet, Mohamed 243 Chiarella, Davide 170 Cignini, Lorenzo 131 Cimato, S. 227 Cislaghi, Mauro 11 Corchado, Emilio 155 Cristiano, Massimiliano 301 De Domenico, Andrea Decherchi, Sergio 61 DeJiao, Niu 294 Domı́nguez, E. 139 116 Eleftherakis, George 11 Eliades, Demetrios G. 69 Harizi, Farid 243 Herrero, Álvaro 155 Hussain, Aini 235 Jonsson, Erland 84 Kapczyński, Adrian 274 Kollias, Stefanos 286 Larson, Ulf E. 84 Le, Thi Hoi 266 Leoncini, Davide 100 Lieto, Giuseppe 163 Lioy, Antonio 77 Losio, Luca 27 Luque, R.M. 139 Maggiani, Paolo 100 Maiolini, Gianluca 131, 178 Martelli, Maurizio 108 318 Author Index Maruti, Cristiano 27 Mascardi, Viviana 108 Matsumoto, Soutaro 123 Mazzaron, Paolo 116 Mazzilli, Roberto 11 Mazzino, Nadia 116 Mazzocca, Nicola 92 Meda, Ermete 116 Mezai, Lamia 243 Milani, Carlo 108 Mohier, Francois 11 Molina, Giacomo 178 Moscato, Vincenzo 92 Motta, Lorenzo 116 Muñoz, J. 139 Negroni, Elisa 11 Neri, F. 35 Nieto-Taladriz, Octavio Nilsson, Dennis K. 84 Ntalianis, Klimis 286 147 Orlandi, Thomas 27 Orsini, Fabio 163 Pagano, Genoveffa 163 Palomo, E.J. 139 Papaleo, Gianluca 170 Pedersoli, Francesco 301 Pettoni, M. 35 Picasso, Francesco 84, 116 Piuri, V. 227 Polycarpou, Marios M. 69 Popescu-Zeletin, Radu 186 Pragliola, Concetta 92 Ramli, Dzati Athiar 235 Ramunno, Gianluca 77 Redi, Judith 61 Rizzi, Antonello 178 Ronsivalle, Gaetano Bruno 1 Sachenko, Anatoly 274 Samad, Salina Abdul 235 Sassi, R. 227 Scotti, F. 227 ShiGuang, Ju 294 Soldani, Maurizio 100 Tamponi, Aldo 116 Tao, Cai 294 Tzouveli, Paraskevi 286 Verma, Manish 251 Vernizzi, Davide 77 Vrusias, Bogdan 195 Wei, Zhong 294 Zambelli, Michele 27 Zanasi, Alessandro 53 Zincir-Heywood, A. Nur Zunino, Rodolfo 61 203