Complete volume in PDF

Transcription

Advances in Soft Computing
Editor-in-Chief: J. Kacprzyk
53
Editor-in-Chief
Prof. Janusz Kacprzyk
Systems Research Institute
Polish Academy of Sciences
ul. Newelska 6
01-447 Warsaw
Poland
E-mail: kacprzyk@ibspan.waw.pl
Further volumes of this series can be found on our homepage: springer.com
Bernd Reusch, (Ed.)
Computational Intelligence, Theory and
Applications, 2006
ISBN 978-3-540-34780-4
Jonathan Lawry, Enrique Miranda,
Alberto Bugarín Shoumei Li,
María Á. Gil, Przemysław Grzegorzewski,
Olgierd Hryniewicz,
Soft Methods for Integrated Uncertainty
Modelling, 2006
ISBN 978-3-540-34776-7
Ashraf Saad, Erel Avineri, Keshav Dahal,
Muhammad Sarfraz, Rajkumar Roy (Eds.)
Soft Computing in Industrial Applications, 2007
ISBN 978-3-540-70704-2
Bing-Yuan Cao (Ed.)
Fuzzy Information and Engineering, 2007
ISBN 978-3-540-71440-8
Patricia Melin, Oscar Castillo,
Eduardo Gómez Ramírez, Janusz Kacprzyk,
Witold Pedrycz (Eds.)
Analysis and Design of Intelligent Systems
Using Soft Computing Techniques, 2007
ISBN 978-3-540-72431-5
Oscar Castillo, Patricia Melin,
Oscar Montiel Ross, Roberto Sepúlveda Cruz,
Witold Pedrycz, Janusz Kacprzyk (Eds.)
Theoretical Advances and Applications of
Fuzzy Logic and Soft Computing, 2007
ISBN 978-3-540-72433-9
Katarzyna M. W˛egrzyn-Wolska,
Piotr S. Szczepaniak (Eds.)
Advances in Intelligent Web Mastering, 2007
ISBN 978-3-540-72574-9
Emilio Corchado, Juan M. Corchado,
Ajith Abraham (Eds.)
Innovations in Hybrid Intelligent Systems, 2007
ISBN 978-3-540-74971-4
Marek Kurzynski, Edward Puchala,
Michal Wozniak, Andrzej Zolnierek (Eds.)
Computer Recognition Systems 2, 2007
ISBN 978-3-540-75174-8
Van-Nam Huynh, Yoshiteru Nakamori,
Hiroakira Ono, Jonathan Lawry,
Vladik Kreinovich, Hung T. Nguyen (Eds.)
Interval / Probabilistic Uncertainty and
Non-classical Logics, 2008
ISBN 978-3-540-77663-5
Ewa Pietka, Jacek Kawa (Eds.)
Information Technologies in Biomedicine, 2008
ISBN 978-3-540-68167-0
Didier Dubois, M. Asunción Lubiano,
Henri Prade, María Ángeles Gil,
Przemysław Grzegorzewski,
Olgierd Hryniewicz (Eds.)
Soft Methods for Handling
Variability and Imprecision, 2008
ISBN 978-3-540-85026-7
Juan M. Corchado, Francisco de Paz,
Miguel P. Rocha,
Florentino Fernández Riverola (Eds.)
2nd International Workshop
on Practical Applications of
Computational Biology
and Bioinformatics
(IWPACBB 2008), 2009
ISBN 978-3-540-85860-7
Juan M. Corchado, Sara Rodriguez,
James Llinas, Jose M. Molina (Eds.)
International Symposium on
Distributed Computing and
Artificial Intelligence 2008
(DCAI 2008), 2009
ISBN 978-3-540-85862-1
Juan M. Corchado, Dante I. Tapia,
José Bravo (Eds.)
3rd Symposium of Ubiquitous Computing and
Ambient Intelligence 2008, 2009
ISBN 978-3-540-85866-9
Erel Avineri, Mario Köppen, Keshav Dahal,
Yos Sunitiyoso, Rajkumar Roy (Eds.)
Applications of Soft Computing, 2009
ISBN 978-3-540-88078-3
Emilio Corchado, Rodolfo Zunino,
Paolo Gastaldo, Álvaro Herrero (Eds.)
Proceedings of the International Workshop on
Computational Intelligence in Security for
Information Systems CISIS 2008, 2009
ISBN 978-3-540-88180-3
Emilio Corchado, Rodolfo Zunino,
Paolo Gastaldo, Álvaro Herrero (Eds.)
Proceedings of the
International Workshop on
Computational Intelligence
in Security for Information
Systems CISIS 2008
ABC
Editors
Prof. Dr. Emilio S. Corchado
Área de Lenguajes y Sistemas
Informáticos
Departamento de Ingeniería Civil
Escuela Politécnica Superior
Universidad de Bugos
Campus Vena C/ Francisco de Vitoria s/n
E-09006 Burgos
Spain
E-mail: escorchado@ubu.es
Prof. Rodolfo Zunino
DIBE–Department of Biophysical and
Electronic Engineering
University of Genova
Via Opera Pia 11A
16145 Genova
Italy
E-mail: rodolfo.zunino@unige.it
ISBN 978-3-540-88180-3
Paolo Gastaldo
DIBE–Department of Biophysical and
Electronic Engineering
University of Genova
Via Opera Pia 11A
16145 Genova
Italy
E-mail: paolo.gastaldo@unige.it
Álvaro Herrero
Área de Lenguajes y Sistemas
Informáticos
Departamento de Ingeniería Civil
Escuela Politécnica Superior
Universidad de Bugos
Campus Vena C/ Francisco de Vitoria s/n
E-09006 Burgos
Spain
E-mail: ahcosio@ubu.es
e-ISBN 978-3-540-88181-0
DOI 10.1007/978-3-540-88181-0
ISSN 1615-3871
Library of Congress Control Number: 2008935893
c 2009
Springer-Verlag Berlin Heidelberg
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is
concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting,
reproduction on microfilm or in any other way, and storage in data banks. Duplication of this publication or
parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in
its current version, and permission for use must always be obtained from Springer. Violations are liable for
prosecution under the German Copyright Law.
The use of general descriptive names, registered names, trademarks, etc. in this publication does not imply,
even in the absence of a specific statement, that such names are exempt from the relevant protective laws and
regulations and therefore free for general use.
Typeset & Cover Design: Scientific Publishing Services Pvt. Ltd., Chennai, India.
Printed in acid-free paper
543210
springer.com
Preface
The research scenario in advanced systems for protecting critical infrastructures and
for deeply networked information tools highlights a growing link between security
issues and the need for intelligent processing abilities in the area of information systems. To face the ever-evolving nature of cyber-threats, monitoring systems must
have adaptive capabilities for continuous adjustment and timely, effective response to
modifications in the environment. Moreover, the risks of improper access pose the
need for advanced identification methods, including protocols to enforce computersecurity policies and biometry-related technologies for physical authentication. Computational Intelligence methods offer a wide variety of approaches that can be fruitful
in those areas, and can play a crucial role in the adaptive process by their ability to
learn empirically and adapt a system’s behaviour accordingly.
The International Workshop on Computational Intelligence for Security in Information Systems (CISIS) proposes a meeting ground to the various communities involved in building intelligent systems for security, namely: information security, data
mining, adaptive learning methods and soft computing among others. The main goal
is to allow experts and researchers to assess the benefits of learning methods in the
data-mining area for information-security applications. The Workshop offers the
opportunity to interact with the leading industries actively involved in the critical area
of security, and have a picture of the current solutions adopted in practical domains.
This volume of Advances in Soft Computing contains accepted papers presented at
CISIS’08, which was held in Genova, Italy, on October 23rd–24th, 2008. The selection process to set up the Workshop program yielded a collection of about 40 papers.
This allowed the Scientific Committee to verify the vital and crucial nature of the
topics involved in the event, and resulted in an acceptance rate of about 60% of the
originally submitted manuscripts.
CISIS’08 has teamed up with the Journal of Information Assurance and Security
(JIAS) and the International Journal of Computational Intelligence Research (IJCIR) for
a suite of special issues including selected papers from CISIS’08. The extended papers,
together with contributed articles received in response to subsequent open calls, will go
through further rounds of peer refereeing in the remits of these two journals.
We would like to thank the work of the Programme Committee Members who performed admirably under tight deadline pressures. Our warmest and special thanks go to
the Keynote Speakers: Dr. Piero P. Bonissone (Coolidge Fellow, General Electric
Global Research) and Prof. Marios M. Polycarpou (University of Cyprus). Prof. Vincenzo Piuri, former President of the IEEE Computational Intelligence Society, provided
invaluable assistance and guidance in enhancing the scientific level of the event.
VI
Preface
Particular thanks go to the Organising Committee, chaired by Dr. Clotilde Canepa
Fertini (IIC) and composed by Dr. Sergio Decherchi, Dr. Davide Leoncini, Dr. Francesco Picasso and Dr. Judith Redi, for their precious work and for their suggestions
about organisation and promotion of CISIS’08. Particular thanks go as well to the
Workshop main Sponsors, Ansaldo Segnalamento Ferroviario Spa and Elsag Datamat
Spa, who jointly contributed in an active and constructive manner to the success of
this initiative.
We wish to thank Prof. Dr. Janusz Kacprzyk (Editor-in-chief), Dr. Thomas Ditzinger (Senior Editor, Engineering/Applied Sciences) and Mrs. Heather King at
Springer-Verlag for their help and collaboration in this demanding scientific publication project.
We thank as well all the authors and participants for their great contributions that
made this conference possible and all the hard work worthwhile.
October 2008
Emilio Corchado
Rodolfo Zunino
Paolo Gastaldo
Álvaro Herrero
Organization
Honorary Chairs
Gaetano Bignardi – Rector, University of Genova (Italy)
Giovanni Bocchetti – Ansaldo STS (Italy)
Michele Fracchiolla – Elsag Datamat (Italy)
Vincenzo Piuri – President, IEEE Computational Intelligence Society
Gianni Vernazza – Dean, Faculty of Engineering, University of Genova (Italy)
General Chairs
Emilio Corchado – University of Burgos (Spain)
Rodolfo Zunino – University of Genova (Italy)
Program Committee
Cesare Alippi – Politecnico di Milano (Italy)
Davide Anguita – University of Genoa (Italy)
Enrico Appiani – Elsag Datamat (Italy)
Alessandro Armando – University of Genova (Italy)
Piero Bonissone – GE Global Research (USA)
Juan Manuel Corchado – University of Salamanca (Spain)
Rafael Corchuelo – University of Sevilla (Spain)
Andre CPLF de Carvalho – University of São Paulo (Brazil)
Keshav Dehal – University of Bradford (UK)
José Dorronsoro – Autonomous University of Madrid (Spain)
Bianca Falcidieno – CNR (Italy)
Dario Forte – University of Milano Crema (Italy)
Bogdan Gabrys – Bournemouth University (UK)
Manuel Graña – University of Pais Vasco (Spain)
Petro Gopych – V.N. Karazin Kharkiv National University (Ukraine)
Francisco Herrera – University of Granada (Spain)
R.J. Howlett – University of Brighton (UK)
Giacomo Indiveri – ETH Zurich (Switzerland)
Lakhmi Jain – University of South Australia (Australia)
Janusz Kacprzyk – Polish Academy of Sciences (Poland)
VIII
Organization
Juha Karhunen – Helsinki University of Technology (Finland)
Antonio Lioy – Politecnico di Torino (Italy)
Wenjian Luo – University of Science and Technology of China (China)
Nadia Mazzino – Ansaldo STS (Italy)
José Francisco Martínez – INAOE (Mexico)
Ermete Meda – Ansaldo STS (Italy)
Evangelia Tzanakou – Rutgers University (USA)
José Mira – UNED (Spain)
José Manuel Molina – University Carlos III of Madrid (Spain)
Witold Pedrycz – University of Alberta (Canada)
Dennis K Nilsson – Chalmers University of Technology (Sweden)
Tomas Olovsson – Chalmers University of Technology (Sweden)
Carlos Pereira – Universidade de Coimbra (Portugal)
Kostas Plataniotis – University of Toronto (Canada)
Fernando Podio – NIST (USA)
Marios Polycarpou – University of Cyprus (Cyprus)
Jorge Posada – VICOMTech (Spain)
Perfecto Reguera – University of Leon (Spain)
Bernardete Ribeiro – University of Coimbra (Portugal)
Sandro Ridella – University of Genova (Italy)
Ramón Rizo – University of Alicante (Spain)
Dymirt Ruta – British Telecom (UK)
Fabio Scotti – University of Milan (Italy)
Kate Smith-Miles – Deakin University (Australia)
Sorin Stratulat – University Paul Verlaine – Metz (France)
Carmela Troncoso – Katholieke Univ. Leuven (Belgium)
Tzai-Der Wang – Cheng Shiu University (Taiwan)
Lei Xu – Chinese University of Hong Kong (Hong Kong)
Xin Yao – University of Birmingham (UK)
Hujun Yin – University of Manchester (UK)
Alessandro Zanasi – TEMIS (France)
David Zhang – Hong Kong Polytechnic University (Hong Kong)
Local Arrangements
Bruno Baruque – University of Burgos
Andrés Bustillo – University of Burgos
Clotilde Canepa Fertini – International Institute of Communications, Genova
Leticia Curiel – University of Burgos
Sergio Decherchi – University of Genova
Paolo Gastaldo – University of Genova
Álvaro Herrero – University of Burgos
Francesco Picasso – University of Genova
Judith Redi – University of Genova
Contents
Computational Intelligence Methods for Fighting Crime
An Artificial Neural Network for Bank Robbery Risk
Management: The OS.SI.F Web On-Line Tool of the ABI
Anti-crime Department
Carlo Guazzoni, Gaetano Bruno Ronsivalle . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Secure Judicial Communication Exchange Using Softcomputing Methods and Biometric Authentication
Mauro Cislaghi, George Eleftherakis, Roberto Mazzilli, Francois Mohier,
Sara Ferri, Valerio Giuffrida, Elisa Negroni . . . . . . . . . . . . . . . . . . . . . . . . . .
11
Identity Resolution in Criminal Justice Data: An Application
of NORA
Queen E. Booker . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
19
PTK: An Alternative Advanced Interface for the Sleuth Kit
Dario V. Forte, Angelo Cavallini, Cristiano Maruti, Luca Losio,
Thomas Orlandi, Michele Zambelli . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
27
Text Mining and Intelligence
Stalker, a Multilingual Text Mining Search Engine for Open
Source Intelligence
F. Neri, M. Pettoni . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35
Computational Intelligence Solutions for Homeland Security
Enrico Appiani, Giuseppe Buslacchi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43
Virtual Weapons for Real Wars: Text Mining for National
Security
Alessandro Zanasi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
53
X
Contents
Hypermetric k-Means Clustering for Content-Based
Document Management
Sergio Decherchi, Paolo Gastaldo, Judith Redi, Rodolfo Zunino . . . . . . . . .
61
Critical Infrastructure Protection
Security Issues in Drinking Water Distribution Networks
Demetrios G. Eliades, Marios M. Polycarpou . . . . . . . . . . . . . . . . . . . . . . . . .
69
Trusted-Computing Technologies for the Protection of Critical
Information Systems
Antonio Lioy, Gianluca Ramunno, Davide Vernizzi . . . . . . . . . . . . . . . . . . . .
77
A First Simulation of Attacks in the Automotive Network
Communications Protocol FlexRay
Dennis K. Nilsson, Ulf E. Larson, Francesco Picasso,
Erland Jonsson . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
84
Wireless Sensor Data Fusion for Critical Infrastructure
Security
Francesco Flammini, Andrea Gaglione, Nicola Mazzocca,
Vincenzo Moscato, Concetta Pragliola . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
92
Development of Anti Intruders Underwater Systems: Time
Domain Evaluation of the Self-informed Magnetic Networks
Performance
Osvaldo Faggioni, Maurizio Soldani, Amleto Gabellone, Paolo Maggiani,
Davide Leoncini . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
Monitoring and Diagnosing Railway Signalling with
Logic-Based Distributed Agents
Viviana Mascardi, Daniela Briola, Maurizio Martelli, Riccardo Caccia,
Carlo Milani . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
SeSaR: Security for Safety
Ermete Meda, Francesco Picasso, Andrea De Domenico,
Paolo Mazzaron, Nadia Mazzino, Lorenzo Motta, Aldo Tamponi . . . . . . . . 116
Network Security
Automatic Verification of Firewall Configuration with Respect
to Security Policy Requirements
Soutaro Matsumoto, Adel Bouhoula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
Contents
XI
Automated Framework for Policy Optimization in Firewalls
and Security Gateways
Gianluca Maiolini, Lorenzo Cignini, Andrea Baiocchi . . . . . . . . . . . . . . . . . . 131
An Intrusion Detection System Based on Hierarchical
Self-Organization
E.J. Palomo, E. Domı́nguez, R.M. Luque, J. Muñoz . . . . . . . . . . . . . . . . . . . 139
Evaluating Sequential Combination of Two Genetic
Algorithm-Based Solutions for Intrusion Detection
Zorana Banković, Slobodan Bojanić, Octavio Nieto-Taladriz . . . . . . . . . . . . 147
Agents and Neural Networks for Intrusion Detection
Álvaro Herrero, Emilio Corchado . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
Cluster Analysis for Anomaly Detection
Giuseppe Lieto, Fabio Orsini, Genoveffa Pagano . . . . . . . . . . . . . . . . . . . . . . 163
Statistical Anomaly Detection on Real e-Mail Traffic
Maurizio Aiello, Davide Chiarella, Gianluca Papaleo . . . . . . . . . . . . . . . . . . 170
On-the-fly Statistical Classification of Internet Traffic at
Application Layer Based on Cluster Analysis
Andrea Baiocchi, Gianluca Maiolini, Giacomo Molina,
Antonello Rizzi . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
Flow Level Data Mining of DNS Query Streams for Email
Worm Detection
Nikolaos Chatzis, Radu Popescu-Zeletin . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 186
Adaptable Text Filters and Unsupervised Neural Classifiers
for Spam Detection
Bogdan Vrusias, Ian Golledge . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
A Preliminary Performance Comparison of Two Feature Sets
for Encrypted Traffic Classification
Riyad Alshammari, A. Nur Zincir-Heywood . . . . . . . . . . . . . . . . . . . . . . . . . . 203
Dynamic Scheme for Packet Classification Using Splay Trees
Nizar Ben-Neji, Adel Bouhoula . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 211
XII
Contents
A Novel Algorithm for Freeing Network from Points of Failure
Rahul Gupta, Suneeta Agarwal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219
Biometry
A Multi-biometric Verification System for the Privacy
Protection of Iris Templates
S. Cimato, M. Gamassi, V. Piuri, R. Sassi, F. Scotti . . . . . . . . . . . . . . . . . . 227
Score Information Decision Fusion Using Support Vector
Machine for a Correlation Filter Based Speaker Authentication
System
Dzati Athiar Ramli, Salina Abdul Samad, Aini Hussain . . . . . . . . . . . . . . . . 235
Application of 2DPCA Based Techniques in DCT Domain for
Face Recognition
Messaoud Bengherabi, Lamia Mezai, Farid Harizi,
Abderrazak Guessoum, Mohamed Cheriet . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243
Fingerprint Based Male-Female Classification
Manish Verma, Suneeta Agarwal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
BSDT Multi-valued Coding in Discrete Spaces
Petro Gopych . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 258
A Fast and Distortion Tolerant Hashing for Fingerprint Image
Authentication
Thi Hoi Le, The Duy Bui . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 266
The Concept of Application of Fuzzy Logic in Biometric
Authentication Systems
Anatoly Sachenko, Arkadiusz Banasik, Adrian Kapczyński . . . . . . . . . . . . . . 274
Information Protection
Bidirectional Secret Communication by Quantum Collisions
Fabio Antonio Bovino . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
Semantic Region Protection Using Hu Moments and a
Chaotic Pseudo-random Number Generator
Paraskevi Tzouveli, Klimis Ntalianis, Stefanos Kollias . . . . . . . . . . . . . . . . . 286
Random r-Continuous Matching Rule for Immune-Based
Secure Storage System
Cai Tao, Ju ShiGuang, Zhong Wei, Niu DeJiao . . . . . . . . . . . . . . . . . . . . . . . 294
Contents
XIII
Industrial Perspectives
nokLINK: A New Solution for Enterprise Security
Francesco Pedersoli, Massimiliano Cristiano . . . . . . . . . . . . . . . . . . . . . . . . . . 301
SLA & LAC: New Solutions for Security Monitoring in the
Enterprise
Bruno Giacometti . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 309
Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 317
An Artificial Neural Network for Bank Robbery Risk
Management: The OS.SI.F Web On-Line Tool of the ABI
Anti-crime Department
Carlo Guazzoni and Gaetano Bruno Ronsivalle*
OS.SI.F - Centro di Ricerca dell'ABI per la sicurezza Anticrimine
Piazza del Gesù 49 – 00186 Roma, Italy
spsricercasviluppo@abiformazione.it, gabrons@gabrons.com
Abstract. The ABI (Associazione Bancaria Italiana) Anti-crime Department, OS.SI.F (Centro
di Ricerca dell'ABI per la sicurezza Anticrimine) and the banking working group created an
artificial neural network (ANN) for the Robbery Risk Management in Italian banking sector.
The logic analysis model is based on the global Robbery Risk index of the single banking
branch. The global index is composed by: the Exogenous Risk, related to the geographic area of
the branch, and the Endogenous risk, connected to its specific variables. The implementation of
a neural network for Robbery Risk management provides 5 advantages: (a) it represents, in a
coherent way, the complexity of the "robbery" event; (b) the database that supports the AN is
an exhaustive historical representation of Italian Robbery phenomenology; (c) the model represents the state of art of Risk Management; (d) the ANN guarantees the maximum level of
flexibility, dynamism and adaptability; (e) it allows an effective integration between a solid
calculation model and the common sense of the safety/security manager of the bank.
Keywords: Risk Management, Robbery Risk, Artificial Neural Network, Quickprop, Logistic
Activation Function, Banking Application, ABI, OS.SI.F., Anti-crime.
1 Toward an Integrated Vision of the “Risk Robbery”
In the first pages of The Risk Management Standard1 - published by IRM2, AIRMIC3
and ALARM4 - the “risk” is defined as «the combination of the probability of an
event and its consequences». Although simple and linear, this definition has many
implications from a theoretical and pragmatic point of view. Any type of risk analysis
shouldn't be limited to an evaluation of a specific event's probability without considering the effects, presumably negative, of the event. The correlation of these two concepts is not banal but, unfortunately, most Risk Management Models currently in use
*
Thanks to Marco Iaconis, Francesco Protani, Fabrizio Capobianco, Giorgio Corito, Riccardo
Campisi, Luigi Rossi and Diego Ronsivalle for the scientific and operating support in the development of the theoretical model, and to Antonella De Luca for the translation of the paper.
1
http://www.theirm.org/publications/PUstandard.html
2
The Institute of Risk Management.
3
The Association of Insurance and Risk Managers.
4
The National Forum for Risk Management in the Public Sector.
E. Corchado et al. (Eds.): CISIS 2008, ASC 53, pp. 1–10, 2009.
springerlink.com © Springer-Verlag Berlin Heidelberg 2009
2
C. Guazzoni and G.B. Ronsivalle
for banking security are characterized by low attention to these factors. In fact, they
are often focused on the definition of methods and tools to foresee the “harmful
event”' in probabilistic terms, without pay attention to the importance of a composed
index considering the intensity levels of events that causally derive from this “harmful
event”'. But then the little book of the above mentioned English Institutes, embraces
an integrated vision of the Risk Management. It is related to a systemic and strategic
meaning, giving a coherent description of data and defining the properties of the examined phenomenon. This wide vision provides explanatory hypothesis and, given a
series of historical conditions, may foresee the probable evolutions of the system.
Thanks to the joint effort of the inter-banking working group - coordinated by ABI
and OS.SI.F -, the new support tool for the Robbery Risk Management takes into account this integrated idea of “risk”. In fact it represents the Risk Management process
considering the strategic and organizational factors that characterize the phenomenon
of robbery. Hence, it defines the role of the safety/security manager in the banking
sector. The online software tool, indeed, integrates a general plan with a series of resources to attend the manager during the various phases of the decisional process
scheduled by the standard IRM:
1. from the Robbery Risk mapping - articulated in analysis, various activities of identification, description and appraisal - to the risk evaluation;
2. from the Risk Reporting to the definition of threats and opportunities connected to
the robbery;
3. from the decisional moment, supported by the simulation module (where we can
virtually test the Risk Management and analyze the Residual Risk) to the phase of
virtual and real monitoring5.
The various functions are included in a multi-layers software architecture, composed
by a database and a series of modules that elaborate the information to support the
analysis of the risk and its components. In this way, the user can always retrace the
various steps that gradually determine the Robbery Risk Global Index and its single
relative importance, starting from the primary components of the risk, he/she can focus the attention on the minimum element of the organizational dimension. Thus, the
analysis is focused on the complex relationship between the single banking branch that represents the system cell and unity of measurement - and the structure of relationships, connections, relevant factors from a local and national point of view. In this
theoretical frame, the Robbery Risk is not completely identified with the mere probability that the event occurs. According with IRM standard, instead, it is also taken
into account the possibility that the robbery may cause harms, as well as the combined
probability that the event occurs and the possible negative consequences for the system may have a different intensity.
2 Exogenous Risk and Endogenous Risk
According to the inter-banking working group, coordinated by OS.SI.F, which are the
factors or the variables that compose and influence the robbery and its harmful effects?
5
``The Risk Management Standard'', pp. 6-13.
An Artificial Neural Network for Bank Robbery Risk Management
3
2.1 The Exogenous Risk
The “Exogenous”' components include environment variables (from regional data to
local detailed data), tied to the particular geographic position, population density,
crime rate, number of general criminal actions in the area, as well as the “history”'
and/or evolution of the relationship between the number of banking branches, the defense implementations and the Robbery Risk. So the mathematical function, that integrates these factors, must take into account the time variable. In fact it's essential to
consider the influence of each variable according to the changes that occur at any time
in a certain geographic zone.
The analysis made up by the working group of ABI has shown that the composition of environment conditions is represented by a specific index of “Exogenous risk”.
Its dynamic nature makes possible to define a probabilistic frame for in order to calculate the Robbery Risk. Such index of “Exogenous” risk allows considering the
possibility of a dynamic computation regarding the variation rate of criminal actions
density. This variation depends on the direct or indirect intervention of police or central/local administrations in a geographic area. The Exogenous risk could provide
some relevant empirical bases in order to allow banks to share common strategies for
the management/mitigation of the Robbery Risk. The aim is to avoid possible negative effects tied to activities connected to only one banking branch.
2.2 The Endogenous Risk
A second class of components corresponds, instead, to material, organizational, logistic, instrumental, and technological factors. They characterize the single banking
branch and determine its specific architecture in relation to the robbery. Such factors
are the following:
1. the “basic characteristics”'6
2. the “services”'7
3. the “plants”'8
The interaction of these factors contributes to determine a significant part of the socalled “Endogenous” risk. It is calculated through a complex function of the number
of robberies on a single branch, calculated during a unit of time in which any “event”
has modified, in meaningful terms, the internal order of the branch. In other terms, a
dynamic connection between the risk index and the various interventions planned by
the safety/security managers has been created, both for the single cell and for whole
system. The aim was to control the causal sequence between the possible variation of
an Endogenous characteristic and its relative importance (%) to calculate the specific
impact on the number of robberies.
2.3 The Global Risk
Complying with the objectives of an exhaustive Robbery Risk management, the composition of the two risk indexes (Exogenous and Endogenous) defines the perimeter
6
E.g. the number of employees, the location, the cash risk, the target-hardening strategies, etc.
E.g. the bank security guards, the bank surveillance cameras, etc.
8
E.g. the access control vestibules (man-catcher or mantraps), the bandit barriers, broad-band
internet video feeds directly to police, the alarms, etc.
7
4
of a hypothetical “global” risk referred to the single branch. A sort of integrated index
that includes both environment and “internal” factors. Thus the calculation of the
Global Risk index derives from the normalization of the bi-dimensional vector obtained from the above mentioned functions.
Let us propose a calculation model in order to support the definition of the various
indexes.
3 Methodological Considerations about the Definition of
“Robbery Risk”
Before dealing with the computation techniques, however, it is necessary to clarify
some issue.
3.1 Possible Extend of “Robbery Risk”
First of all, the demarcation between the “Exogenous” and “Endogenous” dimensions
cannot be considered absolute. In fact, in some specific case of “Endogenous” characteristics, an analysis that considers only factors that describe the branch isn't enough
representative of the variables combination. In fact, the ponderation of each single
element makes possible to assign a “weight” related to the Endogenous risk index.
But it must be absolutely taken into account the variability rate of influence according
to the geographic area. This produces an inevitable contamination, even though circumscribed, between the two dimensions of the Robbery Risk. Then, it must be taken
into account that these particular elements of the single cell are the result of the permanent activity of comparison made by the inter-banking working group members
coordinated by ABI and OS.SI.F.
Given the extremely delicate nature of the theme, the architecture of the “Endogenous” characteristics of the branch must be considered temporary and in continuous
evolution. The “not final” nature of the scheme that we propose, depends therefore,
by the progressive transformation of the technological tools supporting security, as
well as by the slow change - both in terms of national legislation, and in terms of reorganization of the safety/security manager role - regarding the ways in which the
contents of Robbery Risk are interpreted.
Finally it's not excluded, in the future, the possibility to open the theoretical
scheme to criminological models tied to the description of the criminal behaviour and
to the definition of indexes of risk perception. But, in both cases the problem is to find
a shared theoretical basis, and to translate the intangible factors in quantitative variables that can be elaborated through the calculation model9.
3.2 The Robbery Risk from an Evolutionistic Point of View
Most bank models consider the Robbery Risk index of a branch, as a linear index depending on the number of attempted and/or consumed robbery in a certain time interval.
9
On this topic, some researches are going on with the aim to define a possible evolution of this
particular type of ``socio-psychological'' index of perceived risk. But it's not yet clear if and
how this index can be considered as a category of risk distinguished both from the `èxogenous'' and `èndogenous'' risk.
5
This index is usually translated into a value, with reference to a risk scale, and it is
the criterion according to which for the Security Manager decide.
These models may receive a series of criticisms:
1. they extremely simplify the relation between variables, not considering the reciprocal influences between Exogenous and Endogenous risk;
2. they represent the history of a branch inaccurately, with reference to very wide
temporal criteria;
3. they don’t foresee a monitoring system of historical evolution related to the link
between changes of the branch and number of robberies.
To avoid these criticisms, the OS.SI.F team has developed a theoretical framework
based on the following methodological principles:
1. the Robbery Global Risk is a probability index (it varies from 0 to 1) and it depends on the non-linear combination of Exogenous and Endogenous risk;
2. the Robbery Global Risk of a branch corresponds to the trend calculated by applying the Least Squares formula to the numerical set of Risk Robbery values
(monthly) from January 2000 to March 2008 in the branch;
3. the monthly value of the Robbery Global Risk corresponds to the relation between
the number of robberies per month and the number of days in which the branch is
open to people. It is expressed in a value from 0 to 1.
An consequence follows these principles: it is necessary to describe the history of the
branch as a sequence of states of the branch itself, in relation to changes occurred in
its internal structure (for example, the introduction of a new defending service or a
new plant). Thus it is possible to create a direct relation between the evolution of
branch and the evolution of Robbery Global Risk. In particular, we interpret the various transformations of the branch over time as “mutations” in a population of biological organisms (represented by isomorphic branches).
The Robbery Global Risk, thus, becomes a kind of value suggesting how the “robbery market” rewards the activities of the security managers, even though without
awareness about the intentional nature of certain choices10.
This logical ploy foresees the possibility to analyze indirect strategies of the various banking groups in the management and distribution of risk into different regions
of the country11. This methodological framework constitutes the logical basis for the
construction of the Robbery Risk management simulator. It is a direct answer to criticisms of the 2nd and 3rd points (see above). It allows the calculation of the fluctuations
referred to the Exogenous risk in relation to the increase - over certain thresholds - of
the Robbery Global Risk (criticism of the 1st point).
10
11
This “biological metaphor”, inspired by Darwin’s theory, is the methodological basis to
overcome a wrong conception of the term “deterrent” inside the security managers’ vocabulary. In many cases, the introduction of new safety services constitutes only an indirect deterrent, since the potential robber cannot know the change. The analysis per populations of
branches allows, instead, to extend the concept of “deterrent” per large numbers.
While analyzing data, we discovered a number of “perverse effects” in the Robbery Risk
management, including the transfer of risk to branches of other competing groups as a result
of corrective actions conceived for a branch and then extended to all branches in the same
location.
6
4 The Calculation Model for the Simulator
The choice of a good calculation model as a support tool for the risk management is
essentially conditioned from the following elements:
1. the nature of the phenomenon;
2. the availability of information and historical data on the phenomenon;
3. the quality of the available information;
4. the presence of a scientific literature and/or of possible applications on
5. the theme;
6. the type of tool and the output required;
7. the perception and the general consent level related to the specific model
8. adopted;
9. the degree of obsolescence of the results;
10.the impact of the results in social, economic, and political, terms.
In the specific case of the Robbery Risk,
1. the extreme complexity of the “robbery” phenomenon suggests the adoption of
analysis tools that take into account the various components, according to a non
linear logic;
2. there is a big database on the phenomenon: it can represent the pillar for a historical analysis and a research of regularity, correlations and possible nomic and/or
probabilistic connections among the factors that determine the “robbery” risk;
3. the actual database has recently been “normalized”, with the aim to guarantee the
maximum degree of coherence between the information included in the archive
and the real state of the system;
4. the scientific literature on the risk analysis models related to criminal events is limited to a mere qualitative analysis of the phenomenon, without consider quantitative models;
5. the inter-banking group has expressed the need of a tool to support the decisional
processes in order to manage the Robbery Risk, through a decomposition of the
fundamental elements that influence the event at an Exogenous and Endogenous
level;
6. the banking world aims to have innovative tools founds and sophisticated calculation models in order to guarantee objective and scientifically founded results
within the Risk Management domain;
7. given the nature of the phenomenon, the calculation model of the Robbery Risk
must guarantee the maximum of flexibility and dynamism according to the time
variable and the possible transformations at a local and national level;
8. the object of the analysis is matched with a series of ethics, politics, social, and
economic topics, and requires, indeed, an integrated systemic approach.
These considerations have brought the team to pursue an innovative way for the creation of the calculation model related to the Robbery Risk indexes: the artificial neural
networks (ANN).
7
5 Phases of ANN Design and Development for the Management of
the Robbery Risk
How did we come to the creation of the neural network for the management of the
robbery Global Risk? The creation of the calculation model is based on the logical
scheme above exposed. It is articulated in five fundamental phases:
1.
2.
3.
4.
5.
Re-design OS.SI.F database and data analysis;
Data normalization;
OS.SI.F Network Design;
OS.SI.F Network Training;
Network testing and delivery.
5.1 First Phase: Re-design OS.SI.F Database and Data Analysis
Once defined the demarcation between Exogenous and Endogenous risks, as well as
the structure of variables concerning each single component of the Global Risk, some
characteristic elements of OS.SI.F historical archive have been revised. The revision
of the database allowed us to remove possible macroscopic redundancies and occasional critical factors, before starting the data analysis. Through a neural network
based on genetic algorithms all possible incoherence and contradictions have been
underlined. The aim was to isolate patterns that would have been potentially “dangerous” for the network and to produce a “clean” database, deprived of logical “impurities” (in limits of human reason).
At this point, the team has defined the number of entry variables (ANN inputs) related to the characteristics above mentioned - and the exit variables (ANN output)
representing the criteria for design the network. The structure of the dataset is determined, as well as the single field types (categorical or numerical) and the distinction
among (a) information for the training of the ANN, (b) data to validate the neuronal
architecture, and (c) dataset dedicated to testing the ANN after training.
5.2 Second Phase: Data Normalization
The database cleaning allowed the translation of data in a new archive of information
for the elaboration of an ANN. In other words, all the variables connected to the Exogenous risk (environment and geographic variables) and the Endogenous risk (basic
characteristics, services and plants of each single branch) have been “re-write” and
normalized. All this has produced the historical sequence of examples provided to the
ANN with the aim to let it “discover” the general rules that govern the “robbery” phenomenon. The real formal “vocabulary” for the calculation of the Global Risk.
5.3 Third Phase: OS.SI.F Network Design
This phase has been dedicated to determine the general architecture of the ANN and
its mathematical properties. Concerning topology, in particular, after a series of
unlucky attempts with a single hidden layer, we opted for a two hidden layers
network:
8
Fig. 1. A schematic representation of the Architecture of OS.SI.F ANN
This architecture was, in fact, more appropriate to solve a series of problems connected to the particular nature of the “robbery” phenomenon. This allowed us, therefore, to optimize the choice of the single neurons activation function and the error
function.
In fact, after a first disastrous implementation of the linear option, a logistic activation function with a sigmoid curve has been adopted. It was characterized by a size
domain included between 0 and 1 and calculated through the following formula:
F ( x) =
1
1 + e −x
(1)
Since it was useful for the evaluation of the ANN quality, an error function has been
associated to logistic function. It was based on the analysis of differences among the
output of the historical archive and the output produced by the neural network. In this
way we reached to the definition of the logical calculation model. Even though it still
doesn't have the knowledge necessary to describe, explain, and foresee the probability
of the “robbery” event. This knowledge, in fact, derives only from an intense training
activity.
5.4 Fourth Phase: OS.SI.F Network Training
The ANN training constitutes the most delicate moment of the whole process of creation of the network. In particular with a Supervised Learning. In our case, in fact, the
training consists in provide the network of a big series of examples of robberies associated to particular geographic areas and specific characteristics of the branches. From
this data, the network has to infer the rule through an abstraction process. For the
Robbery Risk ANN, we decided to implement a variation of the Back propagation.
The Back propagation is the most common learning algorithm within the multi-layer
networks. It is based on the error propagation and on the transformation of weights
(originally random assigned) from the output layer in direction to the intermediate
layers, up to the input neurons. In our special version, the “OS.SI.F Quickpropagation”, the variation of each weight of the synaptic connections changes according to the following formula:
⎛
⎞
s (t )
∆w(t ) = ⎜⎜
∆w(t − 1) ⎟⎟ + k
⎝ s (t − 1) − s (t )
⎠
9
(2)
where k is a hidden variable to solve the numerical instability of this formula12.
We can state that the fitness of the ANN-Robbery Risk has been subordinated to a
substantial correspondence between the values of Endogenous and Exogenous risk
(included in the historical archive), and the results of the network's elaboration after
each learning iteration.
5.5 Fifth Phase: Network Testing and Delivery
In the final phase of the process lot of time has been dedicated to verify the neural
network architecture defined in the previous phases. Moreover a series of dataset previously not included in the training, have been considered with the aim to remove the
last calculation errors and put some adjustments to the general system of weights. In
this phase, some critical knots have been modified: they were related to the variations
of the Exogenous risk according to the population density and to the relationship between Endogenous risk and some new plants (biometrical devices). Only after this last
testing activity, the ANN has been integrated and implemented in the OS.SI.FWeb
module, to allow users (banking security/safety managers), to verify the coherence of
the tool through a module of simulation of new sceneries.
6 Advantages of the Application of the Neural Networks to the
Robbery Risk Management
The implementation of an ANN to support the Robbery Risk management has at least
5 fundamental advantages:
1. Unlike any linear system based on proportions and simple systems of equations, an
ANN allows to face, in coherent way, the high complexity degree of the “robbery”
phenomenon. The banal logic of the sum of variables and causal connections of
many common models, is replaced by a more articulated design, that contemplates
in dynamic and flexible terms, the innumerable connections among the Exogenous
and Endogenous variables.
2. The OS.SI.F ANN database is based on a historical archive continually fed by the
whole Italian banking system. This allows to overcome each limited local vision,
according to the absolute need of a systemic approach for the Robbery Risk analysis. In fact, it's not possible to continue to face such a delicate topic through visions
circumscribed to one's business dimension.
3. The integration of neural algorithms constitutes the state of the art within the Risk
Management domain. In fact it guarantees the definition of a net of variables
opportunely measured according to a probabilistic - and not banally linear - logic.
12
Moreover, during the training, a quantity of “noise” has been introduced (injected) into the
calculation process. The value of the “noise” has been calculated in relation to the error function and has allowed to avoid the permanence of the net in critical situations of local minims.
10
The Robbery Risk ANN foresees a real Bayes network that dynamically determines the weight of each variable (Exogenous and Endogenous) in the probability
of the robbery. This provides a higher degree of accuracy and scientific reliability
to the definition of “risk” and to the whole calculation model.
4. A tool based on neural networks guarantees the maximum level of flexibility, dynamism and adaptability to contexts and conditions that are in rapid evolution.
These are assured by (a) a direct connection of the database to the synaptic weights
of the ANN, (b) the possible reconfiguration of the network architecture in cases of
introduction of new types of plants and/or services, and/or basic characteristics of
branches.
5. The ANN allows an effective integration between a solid calculation model (the
historical archive of information related to the robberies of last years), and the professional and human experience of security/safety managers. The general plan
of the database (and of the composition of the two risk indexes), takes into account
the considerations, observations and indications of the greater representatives of
the national banking safety/security sectors. The general plan of the database is
based on the synthesis done by the inter-banking team, on the normalization of the
robbery event descriptions, and on the sharing of some guidelines in relation to a
common vocabulary for the description of the robbery event. The final result of
this integration is a tool that guarantees the maximum level of decisional liberty,
through the scientific validation of virtuous practices and, thanks to the simulator
of new branches, an a priori evaluation of the possible effects deriving from future
interventions.
References
1. Corradini, I., Iaconis, M.: Antirapina. Guida alla sicurezza per gli operatori di sportello.
Bancaria Editrice, Roma (2007)
2. Fahlman, S.E.: Fast-Learning Variations on Back-Propagation: An Empirical Study. In:
Proceedings of the 1988 Connessionist Models Summer School, pp. 38–51. Morgan Kaufmann, San Francisco (1989)
3. Floreano, D.: Manuale sulle reti neurali. Il Mulino, Bologna (1996)
4. McClelland, J.L., Rumelhart, D.E.: PDP: Parallel Distributed Processing: Explorations in
the Microstructure of Cognition: Psychological and Biological Models, vol. II. MIT PressBradford Books, Cambridge (1986)
5. Pessa, E.: Statistica con le reti neurali. Un’introduzione. Di Renzo Editore, Roma (2004)
6. Sietsma, J., Dow, R.J.F.: Neural Net Pruning – Why and how. In: Proceedings of the IEEE
International Conference on Neural Networks, pp. 325–333. IEEE Press, New York (1988)
7. von Lehman, A., Paek, G.E., Liao, P.F., Marrakchi, A., Patel, J.S.: Factors Influencing
Learning by Back-propagation. In: Proceedings of the IEEE International Conference on
Neural Networks, vol. I, pp. 335–341. IEEE Press, New York (1988)
8. Weisel, D.L.: Bank Robbery. In: COPS, Community Oriented Policing Services, U.S. Department of Justice, No.48, Washington (2007), http://www.cops.usdoj.gov
Secure Judicial Communication Exchange Using
Soft-computing Methods and Biometric Authentication
Mauro Cislaghi1, George Eleftherakis2, Roberto Mazzilli1, Francois Mohier3,
Sara Ferri4, Valerio Giuffrida5, and Elisa Negroni6
1
Project Automation , Viale Elvezia, Monza, Italy
{mauro.cislaghi,roberto.mazzilli}@p-a.it
2
SEERC, 17 Mitropoleos Str, Thessaloniki, Greece
eleftherakis@city.academic.gr
3
Airial Conseil, RueBellini 3, Paris, France
francois.mohier@airial.com
4
AMTEC S.p.A., Loc. San Martino, Piancastagnaio, Italy
sara.ferri@elsagdatamat.com
5
Italdata, Via Eroi di Cefalonia 153, Roma, Italy
valerio.giuffrida@italdata-roma.com
6
Gov3 Ltd, UK
j-web.project@gov3innovation.eu
Abstract. This paper describes how “Computer supported cooperative work”, coped with
security technologies and advanced knowledge management techniques, can support the penal
judicial activities, in particular national and trans-national investigations phases when different
judicial system have to cooperate together. Increase of illegal immigration, trafficking of drugs,
weapons and human beings, and the advent of terrorism, made necessary a stronger judicial
collaboration between States. J-WeB project (http://www.jweb-net.com/), financially supported
by the European Union under the FP6 – Information Society Technologies Programme, is
designing and developing an innovative judicial cooperation environment capable to enable an
effective judicial cooperation during cross-border criminal investigations carried out between
EU and Countries of enlarging Europe, having the Italian and Montenegrin Ministries of Justice
as partners. In order to reach a higher security level, an additional biometric identification
system is integrated in the security environment.
Keywords: Critical Infrastructure Protection, Security, Collaboration, Cross border investigations, Cross Border Interoperability, Biometrics, Identity and Access Management.
1 Introduction
Justice is a key success factors in regional development, in particular in areas whose
development is lagging back the average development of the European Union. In the
last years particular attention has been paid on judicial collaboration between Western
Balkans and the rest of EU, and CARDS Program [1] is a suitable evidence of this
cooperation. According to this program, funds were provided for the development of
closer relations and regional cooperation among SAp (Stabilisation and Association
© Springer-Verlag Berlin Heidelberg 2009
springerlink.com
12
M. Cislaghi et al.
process) countries and between them and all the EU member states to promote direct
cooperation in tackling the common threats of organised crime, illegal migration and
other forms of trafficking. The Mutual assistance [2] is subject to different agreements
and different judicial procedures.
JWeB project [3], [9], based on the experiences of e-Court [4] and SecurE-Justice
[5] projects, funded by the European Commission in IST program, is developing an
innovative judicial cooperation environment capable to enable an effective judicial
cooperation during cross-border criminal investigations, having the Italian and Montenegrin Ministries of Justice as partners.
JWeB (started in 2007 and ending in 2009) will experiment a cross-border secure
cooperative judicial workspace (SCJW), distributed on different ICT platforms called
Judicial Collaboration Platforms (JCP) [6], based on Web-based groupware tools
supporting collaboration and knowledge sharing among geographically distributed
workforces, within and between judicial organizations.
2 Investigation Phase and Cross-Border Judicial Cooperation
The investigation phase includes all the activities carried out from crime notification
to the trial. Cross-border judicial cooperation is one of them. It may vary from simple
to complex judicial actions; but it has complex procedure and requirements, such as
information security and non repudiation. A single investigation may include multiple
cross-border judicial cooperation requests; this is quite usual when investigating on
financial flows. Judicial cooperation develops as follows:
1) In the requesting country, the magistrate starts preliminary checks to understand
if her/his requests to another country are likely to produce the expected results.
Liaison magistrate support and contacts with magistrates in the other country are
typical actions.
2) The “requesting” magistrate prepares and sends the judicial cooperation request
(often referred to as “letter of rogatory”) containing the list of specific requests to
the other country. Often the flow in the requesting country is named “active rogatory”, while the flow in the requested country is named “passive rogatory”.
3) The judicial cooperation request coming from the other country is evaluated,
usually by a court of appeal that, in case of positive evaluation, appoints the
prosecutors’ office in charge of the requested activities. This prosecutors’ office
appoints a magistrate. The requesting magistrate, directly or via the office delegated to international judicial cooperation, receives back these information and
judicial cooperation starts.
4) Judicial cooperation actions are performed. They may cover request for documents, request for evidences, request for interrogations, request for specific actions (for example interceptions, sequestration or an arrest), requests for joint investigation.
Most of the activities are still paper based. The listed activities may imply complex
actions in the requested country, involving people (magistrates, police, etc.) in different departments. The requesting country is interested on the results of the activities,
Secure Judicial Communication Exchange Using Soft-computing Methods
13
not on the procedures followed by the judicial organisation fulfils the requests. The
liaison magistrate can support the magistrate, helping her/him to understand how to
address the judicial counterpart and, once judicial cooperation has been granted, in
understanding and overcoming possible obstacles. Each national judicial system is
independent from the other, both in legal and infrastructural terms. Judicial cooperation, on the ICT point of view, implies cooperation between two different infrastructures, the “requesting” one (“active”) and the “requested” (“passive”), and activities
such as judicial cooperation setup, joint activities of the workgroups, secure exchange
of not repudiable information between the two countries. These activities can be effectively supported by a secure collaborative workspace, as described in the next
paragraph.
3 The Judicial Collaboration Platform (JCP)
A workspace for judicial cooperation involves legal, organisational and technical
issues, and requires a wide consensus in judicial organisations. It has to allow
straightforward user interface, easy data retrieval, seamless integration with procedures and systems already in place.
All that implemented providing top-level security standards. Accordingly, the main
issues for judicial collaboration are:
•
•
•
•
•
A Judicial Case is a secure private virtual workspace accessed by law enforcement and judicial authorities, that need to collaborate in order to achieve
common objectives and tasks;
JCP services are on-line services, supplying various collaborative functionalities
to the judicial authorities in a secure and non repudiable communication
environment;
User profile is a set of access rights assigned to a user. The access to a judicial
case and to JCP services are based on predefined, as well as, customised role
based user profiles;
Mutual assistance during investigations creates a shared part of investigation
folder.
Each country will have its own infrastructure.
The core system supporting judicial cooperation is the secure JCP [6]. It is part of a
national ICT judicial infrastructure, within the national judicial space. Different JCPs
in different countries may cooperate during judicial cooperation. The platform, organised on three layer (presentation, business, persistence) and supporting availability
and data security, provides the following main services:
•
•
Profiling: user details, user preferences
Web Services
o Collaboration: collaborative tools so that users can participate and discuss on the judicial cooperation cases.
o Data Mining: customization of user interfaces based on users’ profile.
o Workflow Management: design and execution of judicial cooperation
processes
14
M. Cislaghi et al.
Audio/Video Management: real time audio/video streaming of a multimedia file, videoconference support.
o Knowledge Management: documents uploading, indexing, search.
Security and non repudiation: Biometric access, digital certificates, digital
signature, secure communication, cryptography, Role based access control.
o
•
Services may be configured according to the different needs of the Judicial systems.
The modelling of Workflow Processes is based on the Workflow Management Coalition specifications (WfMC), while software developments are based on Open-Source
and the J2EE framework. Communications are based on HTTPS and SSL, SOAP,
RMI, LDAP and XML. Videoconference is based on H323.
4 The Cross-Border Judicial Cooperation Via Secure JCPs
4.1 The Judicial Collaborative Workspace and Judicial Cooperation Activities
A secure collaborative judicial workspace (SCJW) is a secure inter-connected environment related to a judicial case, in which all entitled judicial participants in dispersed locations can access and interact with each other just as inside a single entity.
The environment is supported by electronic communications and groupware which
enable participants to overcome space and time differentials. On the physical point of
view, the workspace is supported by the JCP.
The SCJW allows the actors to use communication and scheduling instruments
(agenda, shared data, videoconference, digital signature, document exchange) in a
secured environment.
A judicial cooperation activity (JCA) is the implementation of a specific judicial
cooperation request. It is a self contained activity, opened inside the SCJWs in the
requesting and requested countries, supported by specific judicial workflows and by
the collaboration tools, having as the objective to fulfil a number of judicial actions
issued by the requesting magistrate.
The SCJW is connected one-to-one to a judicial case and may contain multiple
JCAs running in parallel. A single JCA ends when rejected or when all requests contained in the letter of rogatory have been fulfilled and the information collected have
been inserted into the target investigation folder, external to the JCP. In this moment
the JCA may be archived. The SCJW does not end when a JCA terminates, but when
the investigation phase is concluded. Each JCA may have dedicated working teams,
in particular in case of major investigations. The “owner” of the SCJW is the investigating magistrate in charge of the judicial case.
SCJW is implemented in a single JCP, while the single JCA is distributed on two
JCP connected via secure communication channels (crypto-routers, with certificate
exchange), implementing a secured Web Service Interface via a collaboration gateway.
Each SCJW has a global repository and a dedicated repository for each JCA. This
is due to the following constraints:
1) the security, confidentiality and non repudiation constraints
2) each JCA is an independent entity, accessible only by the authorised members of the judicial workgroup and with a limited time duration.
15
The repository associated to the single JCA contains:
•
•
JCA persistence data
1) “JCA metadata” containing data such as: information coming from the national registry (judicial case protocol numbers, etc.), the users profiles and
the related the access rights, the contact information, the information related
to the workflows (state, transitions), etc.
2) “JCP semantic repository”. It will be the persistence tier for the JCP semantic engine, containing: ontology, entity identifiers, Knowledge Base
(KB)
JCA judicial information
The documentation produced during the judicial cooperation will be stored in a
configurable tree folder structure. Typical contents are:
1) “JCA judicial cooperation request”. It contains information related to the
judicial cooperation request, including further documents exchanged during
the set-up activities.
2) “JCA decisions”. It contains the outcomes of the formal process of judicial
cooperation and any internal decision relevant to the specific JCA (for example letter of appointment of the magistrate(s), judicial acts authorising interceptions or domicile violation, etc.)
3) “JCA investigation evidences”. It contains the documents to be sent/ received (Audio/video recordings, from audio/video conferences and phone interceptions, Images, Objects and documents, Supporting documentation, not
necessarily to be inserted in the investigation folder)
4.2 The Collaboration Gateway
Every country has it own ICT judicial infrastructure, interfaced but not shared with
other countries.
Accordingly a SCJW in a JCP must support a 1:n relationships between judicial
systems, including data communication, in particular when the judicial case implies
more than one JCA. A single JCA has a 1:1 relationship between the JCA in the requesting country and the corresponding “requested” JCA. For example, a single judicial case in Montenegro may require cross-border judicial cooperation to Italy, Serbia,
Switzerland, France and United Kingdom, and the JCP in Montenegro will support n
cross border judicial cooperations.
Since JCP platforms are hosted on different locations and countries, the architecture of the collaboration module is based on the mechanism of secured gateway. It is
be based on a set of Web Services allowing one JWeB site, based on a JCP, to exchange the needed data with another JWeB site and vice and versa.
The gateway architecture, under development in JWeB project, is composed by:
•
•
•
Users and Profiling module
Judicial CASES and Profiling Module
Calendar/Meeting Module
Workflow engines exchange information about the workflows states through the collaboration gateway.
16
M. Cislaghi et al.
4.3 Communication Security, User Authentication and RBAC in JCP
Security [7] is managed through the Security Module, designed to properly manage
Connectivity Domains, to assure access rights to different entities, protecting information and segmenting IP network in secured domains. Any communication is hidden to
third parties, protecting privacy, preventing unauthorised usage and assuring data
integrity.
The JCP environment is protected by the VPN system allowing the access only
from authenticated and pre-registered user; no access is allowed without the credentials given by the PKI.
User is authenticated in her/his access to any resource by means of his X.509v3
digital certificate issued by the Certification Authority, stored in his smart card and
protected by biometry [7], [8].
The Network Security System is designed in order to grant the access to the networks and the resources only to authenticated users; it is composed by the following
components:
•
•
•
Security Access Systems (Crypto-router). Crypto-routers prevent unauthorized
intrusions, offers protection against external attacks and offer tunneling capabilities and data encryption.
Security Network Manager. This is the core of security managing system that
allows managing, monitoring and modifying configurations of the system, including accounting of new users.
S-VPN clients (Secure Virtual Private Network Client). Software through which
the users can entry in the IP VPN and so can be authenticated by the Security Access System.
The Crypto-router supports routing and encryption functions with the RSA public key
algorithm on standard TCP/IP networks in end to end mode. Inside JCP security architecture Crypto-router main task is to institute the secure tunnel to access JCP VPN
(Virtual Private Network) and to provide both Network and Resources Authentication.
In order to reach a higher security level, an additional biometric identification system is integrated in the security environment. The device integrates a smart card
reader with a capacitive ST Microelectronics fingerprint scanner and an “Anti Hacking Module” that will made the device unusable in case of any kind of physical intrusion attempt.
The biometric authentication device will entirely manage the biometric verification
process. There is no biometric data exchange within the device and the workstation or
any other device. Biometric personal data will remain in the user’s smart card and the
comparison between the live and the smart card stored fingerprint will be performed
inside the device.
After biometric authentication, access control of judicial actors to JCP is rolebased. In Role Based Access Control [11] (RBAC), permissions are associated with
roles, and users are made members of appropriate roles. This model simplifies access
administration, management, and audit procedures. The role-permissions relationship
changes much less frequently than the role-user relationship, in particular in the judicial field. RBAC allows these two relationships to be managed separately and gives
much clearer guidance to system administrators on how to properly add new users and
17
their associated permissions. RBAC is particularly appropriate in justice information
sharing systems where there are typically several organizationally diverse user groups
that need access, in varying degrees, to enterprise-wide data. Each JCP system will
maintain its own Access Control List (ACL). Example of roles related to judicial
cooperation are:
•
•
•
•
•
SCJW magistrate supervisor: Basically he/she has the capability to manage all
JCAs.
JCA magistrate: he/she has the capability to handle the cases that are assigned to
him
Liaison Magistrate: a magistrate located in a foreign country that supports the
magistrate(s) in case of difficulties.
Judicial Clerk: supporting the magistrate for secretarial and administrative tasks
(limited access to judicial information).
System Administrator: He is the technical administrator of the JCP platform (no
access to judicial information)
5 Conclusions
Council Decision of 12 February 2007 establishes for the period 2007-2013 the Programme ‘Criminal Justice’ (2007/126/JHA), with the objective to foster judicial cooperation in criminal matter. CARDS project [1] and IPA funds represent today a
relevant financial support to regional development in Western Balkans, including
justice as one of the key factors. This creates a strong EU support to JCP deployment,
while case studies such as the ongoing JWeB and SIDIP [10] projects, demonstrated
that electronic case management is now ready for deployment on the technological
point of view.
Judicial secure collaboration environment will be the basis for the future judicial
trans-national cooperation, and systems such as the JCP may lead to a considerable
enhancement of cross-border judicial cooperation. The experience in progress in
JWeB project is demonstrating that features such as security, non repudiation, strong
authentication can be obtained through integration of state of the art technologies and
can be coped with collaboration tools, in order to support a more effective and
straightforward cooperation between investigating magistrates in full compliance with
national judicial procedures and practices. The JCP platform represents a possible
bridge between national judicial spaces, allowing through secure web services the
usage of the Web as a cost effective and the same time secured interconnection between judicial systems.
While technologies are mature and ready to be used, their impact on the judicial
organisations in cross-border cooperation is still under analysis. It is one of the main
non technological challenges for deployment of solutions such as the one under development in JWeB project. The analysis conducted so far in the JWeB project gives
a reasonable confidence that needed organisational changes will become evident
through the pilot usage of the developed ICT solutions, so giving further contributions
to the Ministries of Justice about the activities needed for a future deployment of ICT
solutions in a delicate area such as the one of the international judicial cooperation.
18
M. Cislaghi et al.
References
1. CARDS project: Support to the Prosecutors Network, EuropeAid/125802/C/ACT/Multi
(2007), http://ec.europa.eu/europeaid/cgi/frame12.pl
2. Armone, G., et al.: Diritto penale europeo e ordinamento italiano: le decisioni quadro
dell’Unione europea: dal mandato d’arresto alla lotta al terrorismo. Giuffrè edns. (2006)
ISBN 88-14-12428-0
3. JWeB consortium (2007), http://www.jweb-net.com
4. European Commission, ICT in the courtroom, the evidence (2005),
http://ec.europa.eu/information_society/activities/
policy_link/documents/factsheets/jus_ecourt.pdf
5. European Commission. Security for judicial cooperation (2006),
http://ec.europa.eu/information_society/activities/
policy_link/documents/factsheets/just_secure_justice.pdf
6. Cislaghi, M., Cunsolo, F., Mazzilli, R., Muscillo, R., Pellegrini, D., Vuksanovic, V.:
Communication environment for judicial cooperation between Europe and Western Balkans. In: Expanding the knowledge economy, eChallenges 2007 conference proceedings,
The Hague, The Netherlands (October 2007); ISBN 978-1-58603-801-4, 757-764.
7. Italian Committee for IT in Public Administrations (CNIPA), Linee guida per la sicurezza
ICT delle pubbliche amministrazioni. In: Quaderni CNIPA 2006 (2006),
http://www.cnipa.gov.it/site/_files/Quaderno20.pdf
8. Italian Committee for IT in Public Administrations (CNIPA), CNIPA Linee guida per
l’utilizzo della Firma Digitale, in CNIPA (May 2004),
http://www.cnipa.gov.it/site/_files/LineeGuidaFD_200405181.pdf
9. JWeB project consortium (2007-2008), http://www.jweb-net.com/index.php?
option=com_content&task=category&sectionid=4&id=33&Itemid=63
10. SIDIP project (ICT system supporting trial and hearings in Italy) (2007),
http://www.giustiziacampania.it/file/1012/File/
progettosidip.pdf,
https://www.giustiziacampania.it/file/1053/File/
mozzillopresentazionesistemasidip.doc
11. Ferraiolo, D.F., Sandhu, R., Gavrila, S., Kuhn, D.R., Chandramouli: A proposed standard
for rolebased access control. Technical report, National Institute of Standards & Technology (2000)
Identity Resolution in Criminal Justice Data:
An Application of NORA
Queen E. Booker
Minnesota State University, Mankato,
150 Morris Hall
Mankato, Minnesota
Queen.booker@mnsu.edu
Abstract. Identifying aliases is an important component of the criminal justice system. Accurately identifying a person of interest or someone who has been arrested can significantly
reduce the costs within the entire criminal justice system. This paper examines the problem
domain of matching and relating identities, examines traditional approaches to the problem,
and applies the identity resolution approach described by Jeff Jonas [1] and relationship
awareness to the specific case of client identification for the indigent defense office. The
combination of identity resolution and relationship awareness offered improved accuracy in
matching identities.
Keywords: Pattern Analysis, Identity Resolution, Text Mining.
1 Introduction
Appointing counsel for indigent clients is a complex task with many constraints and
variables. The manager responsible for assigning the attorney is limited by the number of attorneys at his/her disposal. If the manager assigns an attorney to a case with
which the attorney has a conflict of interest, the office loses the funds already invested
in the case by the representing attorney. Additional resources are needed to bring the
next attorney “up to speed.” Thus, it is in the best interest of the manager to be able to
accurately identify the client, the victim and any potential witnesses to minimize any
conflict of interest. As the number of cases grows, many times, the manager simply
selects the next person on the list when assigning the case. This type of assignment
can lead to a high number of withdrawals due to a late identified conflict of interest.
Costs to the office increase due to additional incarceration expenses while the client is
held in custody as well as the sunk costs of prior and repeated attorney representation
regardless of whether the client is in or out of custody.
These problems are further exacerbated when insufficient systems are in place to
manage the data that could be used to make assignments easier. The data on the defendant is separately maintained by the various criminal justice agencies including the
indigent defense service agency itself. This presents a challenge as the number of
cases increases but without a concomitant increase in staff available to make the assignments. Thus those individuals responsible for assigning attorneys want not only
the ability to better assign attorneys, but also to do so in a more expedient fashion.
springerlink.com
20
Q.E. Booker
The aggregate data from all the information systems in the criminal justice process
have been proven to improve the attorney assignment process [2].
Criminal justice systems have many disparate information systems, each with their
own data sets. These include systems concerned with arrests, court case scheduling, the
prosecuting attorneys office, to name a few. In many cases, relationships are nonobvious. It is not unusual for a repeat offender to provide an alternative name that is not
validated prior to sending the arrest data to the indigent defense office. Likewise it is not
unusual for potential witnesses to provide alternative names in an attempt to protect
their identities. And further, it is not unusual for a victim to provide yet another name in
an attempt to hide a previous interaction with the criminal justice process. Detecting
aliases becomes harder as the indigent defense problem grows in complexity.
2 Problems with Matching
Matching identities or finding aliases is a difficult process to perform manually. The
process relies on institutional knowledge and/or visual stimulation. For example, if an
arrest report is accompanied by a picture, the manager or attorney can easily ascertain
the person’s identity. But that is not the case. Arrest reports sent generally are textual
with the defendant’s name, demographic information, arrest charges, victim, and any
witness information. With the institutional knowledge, the manager or an attorney can
review the information on the report and identify the person by the use of a previous
alias or by other pertinent information on the report. So essentially, it is possible to
identify many aliases by humans, and hence possible for an information system because the enterprise contains all the necessary knowledge. But the knowledge and the
process is trapped across isolated operational systems within the criminal justice
agencies.
One approach to improving the indigent defense agency problem is to amass information from as many different available data sources, clean the data, and find
matches to improve the defense process. Traditional algorithms aren't well suited for
this process. Matching is further encumbered by the poor quality of the underlying
data. Lists containing subjects of interest commonly have typographical errors such as
data from the defendants who intentionally misspell their names to frustrate data
matching efforts, and legitimate natural variability (Mike versus Michael and 123
Main Street versus 123 S. Maine Street). Dates are often a problem as well. Months
and days are sometimes transposed, especially in international settings. Numbers
often have transposition errors or might have been entered with a different number of
leading zeros.
2.1 Current Identity Matching Approaches
Organizations typically employ three general types of identity matching systems:
merge/purge and match/merge, binary matching engines, and centralized identity
catalogues. Merge/purge and match/merge is the process of combining two or more
lists or files, simultaneously identifying and eliminating duplicate records. This process was developed by direct marketing organizations to eliminate duplicate customer
records in mailing lists. Binary matching engines test an identity in one data set for its
Identity Resolution in Criminal Justice Data: An Application of NORA
21
presence in a second data set. These matching engines are also sometimes used to
compare one identity with another single identity (versus a list of possibilities), with
the output often expected to be a confidence value pertaining to the likelihood that the
two identity records are the same. These systems were designed to help organizations
recognize individuals with whom they had previously done business or, alternatively,
recognize that the identity under evaluation is known as a subject of interest—that is,
on a watch list—thus warranting special handling. [1] Centralized identity catalogues
are systems collect identity data from disparate and heterogeneous data sources and
assemble it into unique identities, while retaining pointers to the original data source
and record with the purpose of creating an index.
Each of the three types of identity matching systems uses either probabilistic or deterministic matching algorithms. Probabilistic techniques rely on training data sets to
compute attribute distribution and frequency looking for both common and uncommon patterns. These statistics are stored and used later to determine confidence levels
in record matching. As a result, any record containing similar, but uncommon data
might be considered a record the same person with a high degree of probability. These
systems lose accuracy when the underlying data's statistics deviate from the original
training set and must frequently retrained to maintain its level of accuracy. Deterministic techniques rely on pre-coded expert rules to define when records should be
matched. One rule might be that if the names are close (Robert versus Rob) and the
social security numbers are the same, the system should consider the records as
matching identities. These systems often have complex rules based on itemsets such
as name, birthdate, zipcode, telephone number, and gender. However, these systems
fail as data becomes more complex.
3 NORA
Jeff Jonas introduced a system called NORA which stands for non-obvious relationship awareness. He developed the system specifically to solve Las Vegas casinos'
identity matching problems. NORA accepts data feeds from numerous enterprise
information systems, and builds a model of identities and relationships between identities (such as shared addresses or phone numbers) in real time. If a new identity
matched or related to another identity in a manner that warranted human scrutiny
(based on basic rules, such as good guy connected to very bad guy), the system would
immediately generate an intelligence alert. The system approach for the Las Vegas
casinos is very similar to the needs of the criminal justice system. The data needed to
identify aliases and relationships for conflict of interest concerns comes from multiple
data sources – arresting agency, probation offices, court systems, prosecuting attorney
office, and the defense agency itself, and the ability to successfully identify a client is
needed in real-time to reduce costs to the defenses office.
The NORA system requirements were:
• Sequence neutrality. The system needed to react to new data in real time.
• Relationship awareness. Relationship awareness was designed into the identity resolution process so that newly discovered relationships could generate
realtime intelligence. Discovered relationships also persisted in the database,
which is essential to generate alerts to beyond one degree of separation.
22
Q.E. Booker
•
•
•
•
•
•
Perpetual analytics. When the system discovered something of relevance
during the identity matching process, it had to publish an alert in real time to
secondary systems or users before the opportunity to act was lost.
Context accumulation. Identity resolution algorithms evaluate incoming records against fully constructed identities, which are made up of the accumulated attributes of all prior records. This technique enabled new records to
match to known identities in toto, rather than relying on binary matching that
could only match records in pairs. Context accumulation improved accuracy
and greatly improved the handling of low-fidelity data that might otherwise
have been left as a large collection of unmatched orphan records.
Extensible. The system needed to accept new data sources and new attributes
through the modification of configuration files, without requiring that the
system be taken offline.
Knowledge-based name evaluations. The system needed detailed name
evaluation algorithms for high-accuracy name matching. Ideally, the algorithms would be based on actual names taken from all over the world and
developed into statistical models to determine how and how often each name
occurred in its variant form. This empirical approach required that the system
be able to automatically determine the culture that the name most likely
came from because names vary in predictable ways depending on their cultural origin.
Real time. The system had to handle additions, changes, and deletions from
real-time operational business systems. Processing times are so fast that
matching results and accompanying intelligence (such as if the person is on a
watch list or the address is missing an apartment number based on prior observations) could be returned to the operational systems in sub-seconds.
Scalable. The system had to be able to process records on a standard transaction server, adding information to a repository that holds hundreds of
identities. [1]
Like the gaming industry, the defense attorney’s office has relatively low daily transactional volumes. Although it receives booking reports on an ongoing basis, initial
court appearances are handled by a specific attorney, and the assignments are made
daily, usually the day after the initial court appearance. The attorney at the initial
court appearance is not the officially assigned attorney, allowing the manager a window of opportunity from booking to assigning the case to accurately identify the client. But the analytical component of accurate identification involves numerous
records with accurate linkages including aliases as well as past relationships and
networks as related to the case. The legal profession has rules and regulations that
constitute conflict of interest. Lawyers must follow these rules to maintain their license to practice which makes the assignment process even more critical. [3]
NORA’s identity resolution engine is capable of performing in real time against
extraordinary data volumes. The gaming industry's requirements of less than 1 million
affected records a day means that a typical installation might involve a single Intelbased server and any one of several leading SQL database engines. This performance
establishes an excellent baseline for application to the defense attorney data since the
NORA system demonstrated that the system could handle multibillion-row databases
23
consisting of hundreds of millions of constructed identities and ingest new identities
at a rate of more than 2,000 identity resolutions per second; such ultra-large deployments require 64 or more CPUs and multiple terabytes of storage, and move the performance bottleneck from the analytic engine to the database engine itself. While the
defense attorney dataset is not quite as large, the processing time on the casino data
suggests that NORA would be able to accurately and easily handle the defense attorney’s needs in real-time.
4 Identity Resolution
Identity resolution is an operational intelligence process, typically powered by an
identity resolution engine, whereby organizations can connect disparate data sources
with a view to understanding possible identity matches and non-obvious relationships
across multiple data sources. It analyzes all of the information relating to individuals
and/or entities from multiple sources of data, and then applies likelihood and
probability scoring to determine which identities are a match and what, if any, nonobvious relationships exist between those identities. These engines are used to
uncover risk, fraud, and conflicts of interest. Identity resolution is designed to assemble i identity records from j data sources into k constructed, persistent identities. The
term "persistent" indicates that matching outcomes are physically stored in a database
at the moment a match is computed.
Accurately evaluating the similarity of proper names is undoubtedly one of the
most complex (and most important) elements of any identity matching system. Dictionary- based approaches fail to handle the complexities of names such as common
names such as Robert Johnson. The approaches fail even greater when cultural influences in naming are involved.
Soundex is an improvement over traditional dictionary approaches. It uses a phonetic algorithm for indexing names by their sound when pronounced in English. The
basic aim is for names with the same pronunciation to be encoded to the same string
so that matching can occur despite minor differences in spelling. Such systems' attempts to neutralize slight variations in name spelling by assigning some form of
reduced "key" to a name (by eliminating vowels or eliminating double consonants)
frequently fail because of external factors—for example, different fuzzy matching
rules are needed for names from different cultures.
Jonas found that the deterministic method is essential for eliminating dependence
on training data sets. As such, the system no longer needed periodic reloads to account for statistical changes to the underlying universe of data. However, he also
asserts many common conditions in which deterministic techniques fail—specifically,
certain attributes were so overused that it made more sense to ignore them than to use
them for identity matching and detecting relationships. For example, two people with
the first name of "Rick" who share the same social security number are probably the
same person—unless the number is 111-11-1111. Two people who have the same
phone number probably live at the same address—unless that phone number is a
travel agency's phone number. He refers to such values as generic because the overuse diminishes the usefulness of the value itself. It's impossible to know all of these
24
Q.E. Booker
generic values a priori—for one reason, they keep changing—thus probabilistic-like
techniques are used to automatically detect and remember them.
His identity resolution system uses a hybrid matching approach that combines deterministic expert rules with a probabilistic-like component to detect generics in real
time (to avoid the drawback of training data sets). The result is expert rules that look
something like this:
If the name is similar
AND there is a matching unique identifier
THEN match
UNLESS this unique identifier is generic
In his system, a unique identifier might include social security or credit-card numbers,
or a passport number, but wouldn't include such values as phone number or date of
birth. The term "generic" here means the value has become so widely used (across a
predefined number of discreet identities) that one can no longer use this same value to
disambiguate one identity from another. [1] However, the approach for the study for
the defense data included a merged itemset that combined date of birth, gender, and
ethnicity code because of the inability or legal constraint of not being able to use the
social security number for identification. Thus, an identifier was developed from a
merged itemset after using the SUDA algorithm to identify infrequent itemsets based
on data mining [4].
The actual deterministic matching rules for NORA as well as the defense attorney
system are much more elaborate in practice because they must explicitly address
fuzzy matching to scrub and clean the data as well as address transposition errors in
numbers, malformed addresses, and other typographical errors. The current defense
attorney agency model has thirty-six rules. Once the data is “cleansed” it is stored and
indexed to provide user-friendly views of the data that make it easy for the user to
find specific information when performing queries and ad hoc reporting. Then, a datamining algorithm using a combination of binary regression and logit models is run to
update patterns for assigning attorneys based on the day’s outcomes [5]. The algorithm identifies patterns for the outcomes and tree structure for attorney and defendant
combinations where the attorney “completed the case.” [6]
Although matching accuracy is highly dependent on the available data, using the
techniques described here achieves the goals of identity resolution, which essentially
boil down to accuracy, scalability, and sustainability even in extremely large transactional environments.
5 Relationship Awareness
According to Jonas, detecting relationships is vastly simplified when a mechanism for
doing so is physically embedded into the identity matching algorithm. Stating the
obvious, before analyzing meaningful relationships, the system must be able to resolve unique identities. As such, identity resolution must occur first. Jonas purported
that it was computationally efficient to observe relationships at the moment the
25
identity record is resolved because in-memory residual artifacts (which are required to
match an identity) comprise a significant portion of what's needed to determine relevant relationships. Relevant relationships, much like matched identities, were then
persisted in the same database.
Notably, some relationships are stronger than others; a relationship score that's assigned with each relationship pair captures this strength. For example, living at the
same address three times over 10 years should yield a higher score than living at the
same address once for three months.
As identities are matched and relationships detected, the NORA evaluates userconfigurable rules to determine if any new insight warrants an alert being published as
an intelligence alert to a specific system or user. One simplistic way to do this is via
conflicting roles. A typical rule for the defense attorney might be notification any
time a client rule is associated to a role of victim, witness, co-defendant, or previously
represented relative, for example. In this case, associated might mean zero degrees of
separation (they're the same person) or one degree of separation (they're roommates).
Relationships are maintained in the database to one degree of separation; higher degrees are determined by walking the tree. Although the technology supports searching
for any degree of separation between identities, higher orders include many insignificant leads and are thus less useful.
6 Comparative Results
This research is an ongoing process to improve the attorney assignment process in the
defense attorney offices. As economic times get harder, crime increases and as crimes
increase, so do the number of people who require representation by the public defense
offices. The ability to quickly identify conflicts of interests reduces the amount of
time a person stays in the system and also reduces the time needed to process the case.
The original system built to work with the alias/identity matching as called the Court
Appointed Counsel System or CACS. CACS identified 83% more conflicts of interests than the indigent defense managers during the initial assignments [2]. Using the
merged itemset and an algorithm using NORA’s underlying technology, the conflicts
improved from 83% to 87%. But the real improvement came in the processing time.
The key to the success of these systems is the ability to update and provide accurate
data at a moments notice. Utilizing NORA’s underlying algorithms improved the
updating and matching process significantly, allowing for new data to be entered and
analyzed within a couple of hours as opposed to the days it took to process using the
CACS algorithms. Further, the merged itemset approach helped to provide a unique
identifier in 90% of the cases significantly increasing automated relationship identifications. The ability to handle real-time transactional data with sustained accuracy will
continue to be of "front and center" importance as organizations seek competitive
advantage.
The identity resolution technology applied here provides evidence that such technologies can be applied to more than simple fraud detection but also to improve business decision making and intelligence support to entities whose purpose are to make
expedient decisions regarding individual identities.
26
Q.E. Booker
References
1. Jonas, J.: Threat and Fraud Intelligence, Las Vegas Style. IEEE Security & Privacy 4(06),
28–34 (2006)
2. Booker, Q., Kitchens, F.K., Rebman, C.: A Rule Based Decision Support System Prototype
for Assigning Felony Court Appointed Counsel. In: Proceedings of the 2004 Decision Sciences Annual Meeting, Boston, MA (2004)
3. Gross, L.: Are Differences Among the Attorney Conflict of Interest Rules Consistent with
Principles of Behavioral Economics. Georgetown Journal of Legal Ethics 19, 111 (2006)
4. Manning, A.M., Haglin, D.J., Keane, J.A.: A Recursive Search Algorithm for Statistical
Disclosure Assessment. Data Mining and Knowledge Discovery (accepted, 2007)
5. Kitchens, F.L., Sharma, S.K., Harris, T.: Cluster Computers for e-Business Applications.
Asian Journal of Information Systems (AJIS) 3(10) (2004)
6. Forgy, C.: Rete: A Fast Algorithm for the Many Pattern/ Many Object Pattern Match Problem. Artificial Intelligence 19 (1982)
PTK: An Alternative Advanced Interface for the
Sleuth Kit
Dario V. Forte, Angelo Cavallini, Cristiano Maruti, Luca Losio, Thomas Orlandi,
and Michele Zambelli
The IRItaly Project at DFlabs Italy
www.dflabs.com
Abstract. PTK is a new open-source tool for all complex digital investigations. It represents an
alternative to the well-known but now obsolete front-end Autopsy Forensic Browser. This latter
tool has a number of inadequacies taking the form of a cumbersome user interface, complicated
case and evidence management, and a non-interactive timeline that is difficult to consult. A number of important functions are also lacking, such as an effective bookmarking system or a section
for file analysis in graphic format. The need to accelerate evidence analysis through greater automation has prompted DFLabs to design and develop this new tool. PTK provides a new interface
for The Sleuth Kit (TSK) suite of tools and also adds numerous extensions and features, one of
which is an internal indexing engine that is capable of carrying out complex evidence pre-analysis
processes. PTK was written from scratch using Ajax technology for graphic contents and a MySql
database management system server for saving indexing results and investigator-generated bookmarks. This feature allows a plurality of users to work simultaneously on the same or different
cases, accessing previously indexed contents. The ability to work in parallel greatly reduces analysis times. These characteristics are described in greater detail below. PTK includes a dedicated
“Extension Management” module that allows existing or newly developed tools to be integrated
into it, effectively expanding its analysis and automation capacity.
Keywords: Computer Forensics, Open Source, SleuthKit, Autopsy Forensic, Incident Response.
1
Multi-investigator Management
One of the major features of this software is its case access control mechanism and
high level user profiling, allowing more than one investigator to work simultaneously
on the same case. The administrator creates new cases and assigns investigators to
them, granting appropriate access privileges. The investigators are then able to work
in parallel on the same case. PTK user profiling may be used to restrict access to sensitive cases to a handpicked group of investigators or even a single investigator. The
advantages of this type of system are numerous: above all, evidence analysis is
speeded up by the ability of a team of investigators to work in parallel; secondly, the
problem of case synchronization is resolved since all operations reference the same
database. Each investigator is also able to save specific notes and references directly
relating to his or her activities on a case in a special bookmark section. All user
actions are logged in CSV format so that all application activity can be retraced. Furthermore, the administrator is able to manage PTK log files from the interface, viewing the contents in table format and exporting them locally.
springerlink.com
28
D.V. Forte et al.
2 Direct Evidence Analysis
As a graphic interface for the TSK suite of tools, PTK inherits all the characteristics
of this system, starting with the recognized evidence formats. PTK supports Raw
(e.g., dd), Expert Witness (e.g., EnCase) and AFF evidence. Evidence may only be
added to a case by the Administrator, who follows a guided three-step procedure:
1.
2.
3.
Insertion of information relating to the disk image, such as type and place of
acquisition;
Selection of the image file and any partitions to be included in the analysis;
File hashing (MD5 and SHA1) while adding the image.
One important PTK function is automatic recognition of the disk file system and partitions during image selection. PTK also recognizes images in various formats that
have been split up. Here the investigator needs only to select one split, since PTK is
able to recognize all splits belonging to the same image.
PTK and TSK interact directly for various analysis functions which therefore do
not require preliminary indexing operations:
• File analysis
• Data unit analysis
• Data export
2.1 File Analysis
This section analyzes the contents of files (also deleted files) in the disk image. PTK
introduces the important new feature of the tree-view, a dynamic disk directory tree
which provides immediate evidence browsing capability. PTK allows multiple files to
be opened simultaneously on different tabs to facilitate comparative analysis. The
following information is made available to investigators:
•
•
•
•
•
Contents of file in ASCII format;
Contents of file in hexadecimal format;
Contents of file in ASCII string format;
File categorization;
All TSK output information: permissions, file name, MAC times, dimensions,
UID, GID and inode.
3 Indexing Engine
In order to provide the user with the greatest amount of information in the least time
possible, an indexing system has been designed and developed for PTK. The objective is to minimize the time needed to recover all file information required in forensic
analysis, such as hash values, timeline, type1, and keywords. For small files, indexing
may not be necessary since the time required to recover information, such as the MD5
1
The file extension does not determine file type.
29
hash, may be negligible. However, if we begin to contemplate files of dimensions on
the order of Megabytes these operations begin to slow down, and the wait time for the
results becomes excessive. Hence a procedure was developed in which all files are
processed into an image, just once, and the result saved in a database. The following
indices have been implemented in PTK:
•
•
•
•
•
Timeline
File type
MD5
SHA1
Keyword search.
4 Indexed Evidence Analysis
All analysis functions that require preliminary indexing are collected under the name
“Indexed Analysis”, which includes timeline analysis, keyword search and hash
comparison.
4.1 Timeline Analysis
The disk timeline helps the investigator concentrate on the areas of the image where
evidence may be located. It displays the chronological succession of actions carried
out on allocated and non-allocated files. These actions are traced by means of analysis
of the metadata known as MAC times (Modification, Access, and Creation, depending on file system2). PTK allows investigators to analyze the timeline by means of
time filters. The time unit, in relation to the file system, is on the order of one second.
The investigators have two types of timelines at their disposal: one in table format
and one in graphic format. The former allows investigators to view each single timeline entry, which are organized into fields (time and date, file name, actions performed, dimension, permissions) and provide direct access to content analysis or
export operations. The latter is a graphic representation plotting the progress of each
action (MAC times) over a given time interval. This is a useful tool for viewing file
access activity peaks.
4.2 Keyword Search
The indexing process generates a database of keywords which makes it possible to
carry out high performance searches in real time. Searches are carried out by means of
the direct use of strings or the creation of regular expressions. The interface has various templates of regular expressions that the user can use and customize. The search
templates described by regular expressions are memorized in text files and thus can be
customized by users.
2
This information will have varying degrees of detail depending on file system type. For example, FAT32 does not record the time of last access to a file, but only the date. As a result,
in the timeline analysis phase, this information will be displayed as 00:00:00.
30
D.V. Forte et al.
4.3 Hash Set Manager and Comparison
Once the indexing process has been completed, PTK generates a MD5 or SHA1 hash
value for each file present in the evidence: these values are used in comparisons with
hash sets (either public or user-generated), making it possible to determine whether a
file belongs to the “known good” or “known bad” category. Investigators can also use
this section to import the contents of Rainbow Tables in order to compare a given
hash, perhaps one recovered via a keyword search, with those in the hash set.
5 Data Carving Process
Data carving seeks files or other data structures in an incoming data flow, based on
contents rather than on the meta information that a file system associates with each
file or directory. The initial approach chosen for PTK is based on the techniques of
Header/Footer carving and Header/Maximum (file) size carving3. The PTK indexing
step provides for the possibility of enabling data carving for the non-allocated space
of evidence imported into the case. It is possible to directly configure the data carving
module by adding or eliminating entries based on the headers and footers used in
carving. However, the investigator can also set up custom search patterns directly
from the interface. This way the investigator can search for patterns not only in order
to find files, by means of new headers and footers, but also to find file contents. The
particular structure of PTK allows investigators to run data carving operations also on
evidence consisting of a RAM dump. Please note that the data carving results are not
saved directly in the database, only the references to the data identified during the
process are saved.
The indexing process uses matching headers and footers also for the categorization of all the files in the evidence. The output of this process allows the analyzed data
to be subdivided into different categories:
•
•
•
•
Documents (Word, Excel, ASCII, etc.)
Graphic or multimedia content (images, video, audio)
Executable programs
Compressed or encrypted data (zip, rar, etc.)
6 Bookmarking and Reporting
The entire analysis section is flanked by a bookmarking subsystem that allows investigators to bookmark evidence at any time. All operations are facilitated by the backend MySql database, and so there is no writing of data locally in the client file system.
When an investigator saves a bookmark, the reference to the corresponding evidence
is written in the database, in terms of inodes and sectors, without any data being
transferred from the disk being examined to the database. Each bookmark is also
associated with a tag specifying the category and a text field for any user notes. Each
investigator has a private bookmark management section, which can be used, at the
investigator’s total discretion, to share bookmarks with other users.
3
Based on Simson, Garfinkel and Joachim Metz taxonomy.
31
Reports are generated automatically on the basis of the bookmarks saved by the
user. PTK provides for two report formats: html and PDF. Reports are highly customizable in terms of graphics (header, footer, logos) and contents, with the option of
inserting additional fields for enhanced description and documentation of the investigation results.
7 PTK External Modules (Extensions)
This PTK section allows users to use external tools for the execution of various tasks.
It is designed to give the application the flexibility of performing automatic operations on different operating systems, running data search or analysis processes and
recovering deleted files. The “PTK extension manager” creates an interface between
third-party tools and the evidence management system and runs various processes on
them. The currently enabled extensions provide for: Memory dump analysis, Windows registry analysis, OS artifact recovery.
The first extension provides PTK with the ability to analyze the contents of RAM
dumps. This feature allows both evidence from long-term data storage media and
evidence from memory dumps to be associated with a case, thus allowing important
information to be extracted, such as a list of strings in memory, which could potentially contain passwords to be used for the analysis of protected or encrypted archives
found on the disk.
The registry analysis extension gives PTK the capability of recognizing and interpreting a Microsoft Windows registry file and navigating within it with the same
facility as the regedit tool. Additionally, PTK provides for automatic search within the
most important sections of the registry and generation of output results.
The Artifact Recovery extension was implemented in order to reconstruct or recover specific contents relating to the functions of an operating system or its components or applications. The output from these automatic processes can be included
among the investigation bookmarks.
PTK extensions do not write their output to the database in order to prevent it from
becoming excessively large. User-selected output from these processes may be included in the bookmark section in the database. If bookmarks are not created before
PTK is closed, the results are lost.
8 Comparative Assessment
The use of Ajax in the development of PTK has drastically reduced execution times on
the server side and, while delegating part of the code execution to the client, has reduced
user wait times by minimizing that amount of information loaded into pages. An assessment was carried out to obtain a comparison of the performance of PTK versus Autopsy
Forensic Browser. Given that these are two web-based applications using different technologies, it is not possible to make a direct, linear comparison of performance.
For these reasons, it is useful to divide the assessment into two parts: the first highlights the main differences in the interfaces, examining the necessary user procedures;
the second makes a closer examination of the performance of the PTK indexing
32
D.V. Forte et al.
Table 1.
Action
New case
creation
Investigator
assignment
Image addition
Image integrity
verification
Evidence analysis
Autopsy
You click “New case” and a new page
is loaded where you add the case name,
description, and assigned investigator
names (text fields).
Pages loaded: 2
Investigators are assigned to the case
when it is created. However, these
are only text references.
You select a case and a host and
then click “Add image file”. A page is
displayed where you indicate the
image path (manually) and specify a
number of import parameters. On the next
page, you specify integrity control operations
and
select
the
partitions.
Then you click “Add”.
Pages loaded: 6
After selecting the case, the host and
the image, you click “Image integrity”.
The next page allows you to create an
MD5 hash of the file and to verify it on
request.
Pages loaded: 4
After selecting the case, the host
and the image, you click “Analyze” to
access the analysis section.
Pages loaded: 4
and the image, you click “File activity
time lines”. You then have to create a data
Evidence timeline
file
by
providing
appropriate
creation
parameters and create the timeline file
based on the file thus generated.
Pages loaded: 8
and the image, you click “Details”. On the
next
page
you
click
“Extract
String extraction strings” to run the process.
Pages loaded: 5
PTK
You click “Add new case” and a
modal form is opened where it is
sufficient to provide a case name and
brief description.
Pages loaded: 1
You click on the icon in the case
table to access the investigator
management panel. These assignments
represent bona fide user profiles.
In the case table, you click on the
image management icon and then click
“Add new image”. A modal form
opens with a guided 3-step process for
adding the image. Path selection is
based on automatic folder browsing.
Pages loaded: 1
You open image management for a
case and click “Integrity check”. A
panel opens where you can generate
and/or verify both MD5 and SHA1
hashes.
Pages loaded: 1
After opening the panel displaying
the images in a case, you click the icon
“Analyze image” to access the analysis
section.
Pages loaded: 1
case and click on the indexing icon.
The option of generating a timeline
comes up and the process is run. The
timeline is saved in the database and is
available during analyses.
Pages loaded: 1
case and click on the indexing icon.
The option of extracting strings comes
up and the process is run. All ASCII
strings for each image file are saved in
the database.
Pages loaded: 1
engine, providing a more technical comparison on the basis of such objective parameters as command execution, parsing, and output presentation times.
8.1 Interface
The following comparative assessment of Autopsy and PTK (Table 1) highlights the
difference on the interface level, evaluated in terms of number of pages loaded for the
execution of the requested action. All pages (and thus the steps taken by the user) are
counted starting from and excluding the home page of each application.
33
Table 2.
Action
Timeline
generation
Keyword
extraction
File hash
generation
Autopsy
PTK
54” + 2”
18”
8’ 10”
8’ 33”
Autopsy manages the hash
values (MD5) for each file on the
directory
level.
The
hash
generation
operation
must
therefore be run from the file
analysis page, however, this process does not save any of the generated hash values.
PTK optimizes the generation of
file hashes via indexing operations,
eliminating wait time during
analysis and making the hash
values easy to consult.
8.2 Indexing Performance
The following tests were performed on the same evidence: File system: FAT32; Dimension: 1.9 Gb; Acquisition: dd.
A direct comparison (Table 2) can be made for timeline generation and keyword
extraction in terms of how many seconds are required to perform the operations.
9 Conclusions and Further Steps
The main idea behind the project was to provide an “alternative” interface to the TSK
suite so as to offer a new and valid open source tool for forensic investigations. We
use the term “alternative” because PTK was not designed to be a completely different
software from its forerunner, Autopsy, but a product that seeks to improve the performance of existing functions and resolve any inadequacies.
The strong point of this project is thus the careful initial analysis of Autopsy Forensic Browser, which allowed developers to establish the bases for a robust product
that represents a real step forward.
Future developments of the application will certainly include:
• Integration of new tools as extensions of the application in order to address a
greater number of analysis types within the capabilities of PTK.
• Creation of customized installation packages for the various platforms.
• Adaption of style sheets to all browser types in order to extend the portability of
the tool.
References
1. Carrier, Brian: File System Forensic Analysis. Addison Wesley, Reading (2005)
2. Carrier, Brian: Digital Forensic Tool Testing Images (2005),
http://dftt.sourceforge.net
3. Carvey, Harlan: Windows Forensic Analysis. Syngress (2007)
34
D.V. Forte et al.
4. Casey, Eoghan: Digital Evidence and Computer Crime. Academic Press, London (2004)
5. Garfinkel, Simson: Carving Contiguous and Fragmented Files with Fast Object Validation.
In: Digital Forensics Workshop (DFRWS 2007), Pittsburgh, PA (August 2007)
6. Jones, Keith, J., Bejtlich, Richard, Rose, Curtis, W.: Real Digital Forensics: Computer Security and Incident Response. Addison-Wesley, Reading (2005)
7. Schwartz, Randal, L., Phoenix, Tom: Learning Perl. O’Reilly, Sebastopol (2001)
8. The Sleuthkit documentation, http://www.sleuthkit.org/
9. Forte, D.V.: The State of the Art in Digital Forensics. Advances in Computers 67, 254–
300 (2006)
10. Forte, D.V., Maruti, C., Vetturi, M.R., Zambelli, M.: SecSyslog: an Approach to Secure
Logging Based on Covert Channels. In: SADFE 2005, 248–263 (2005)
Stalker, a Multilingual Text Mining Search Engine for
Open Source Intelligence
F. Neri1 and M. Pettoni2
1
Lexical Systems Department,
Synthema, Via Malasoma 24, 56121 Ospedaletto – Pisa, Italy
federico.neri@synthema.it
2
CIFI/GE, II Information and Security Department (RIS),
Stato Maggiore Difesa, Rome, Italy
Abstract. Open Source Intelligence (OSINT) is an intelligence gathering discipline that involves collecting information from open sources and analyzing it to produce usable intelligence. The international Intelligence Communities have seen open sources grow increasingly
easier and cheaper to acquire in recent years. But up to 80% of electronic data is textual and
most valuable information is often hidden and encoded in pages which are neither structured,
nor classified. The process of accessing all these raw data, heterogeneous in terms of source
and language, and transforming them into information is therefore strongly linked to automatic
textual analysis and synthesis, which are greatly related to the ability to master the problems of
multilinguality. This paper describes a content enabling system that provides deep semantic
search and information access to large quantities of distributed multimedia data for both experts
and general public. STALKER provides with a language independent search and dynamic
classification features for a broad range of data collected from several sources in a number of
culturally diverse languages.
Keywords: open source intelligence, focused crawling, natural language processing, morphological analysis, syntactic analysis, functional analysis, supervised clustering, unsupervised
clustering.
1 Introduction
Open Source Intelligence (OSINT) is an intelligence gathering discipline that involves
collecting information from open sources and analyzing it to produce usable intelligence. The specific term “open” refers to publicly available sources, as opposed to
classified sources. OSINT includes a wide variety of information and sources. With the
Internet, the bulk of predictive intelligence can be obtained from public, unclassified
sources. The revolution in information technology is making open sources more accessible, ubiquitous, and valuable, making open intelligence at less cost than ever before.
In fact, monitors no longer need an expensive infrastructure of antennas to listen to
radio, watch television or gather textual data from Internet newspapers and magazines.
The availability of a huge amount of data in the open sources information channels
leads to the well-identified modern paradox: an overload of information means, most
of the time, a no usable knowledge. Besides, open source texts are - and will be - written in various native languages, but these documents are relevant even to non-native
springerlink.com
36
F. Neri and M. Pettoni
speakers. Independent information sources can balance the limited information normally available, particularly if related to non-cooperative targets. The process of accessing all these raw data, heterogeneous both for type (web pages, crime reports),
source (Internet/Intranet, database, etc), protocol (HTTP/HTTPS, FTP, GOPHER,
IRC, NNTP, etc) and language used, transforming them into information, is therefore
inextricably linked to the concepts of textual analysis and synthesis, hinging greatly on
the ability to master the problems of multilinguality.
1.1 State of Art
Current-generation information retrieval (IR) systems excel with respect to scale and
robustness. However, if it comes to deep analysis and precision, they lack power. Users
are limited by keywords search, which is not sufficient if answers to complex problems
are sought. This becomes more acute when knowledge and information are needed
from diverse linguistic and cultural backgrounds, so that both problems and answers
are necessarily more complex. Developments in the IR have mostly been restricted to
improvements in link and click analysis or smart query expansion or profiling, rather
than focused on a deeper analysis of text and the building of smarter indexes. Traditionally, text and data mining systems can be seen as specialized systems that convert
more complex information into a structured database, allowing people to find knowledge rather than information. For some domains, text mining applications are
well-advanced, for example in the domains of medicine, military and intelligence, and
aeronautics [1], [15]. In addition to domain-specific miners, general technology has
been developed to detect Named Entities [2], co-reference relations, geographical data
[3], and time points [4]. The field of knowledge acquisition is growing rapidly with
many enabling technologies being developed that eventually will approach Natural
Language Understanding (NLU). Despite much progress in Natural Language Processing (NLP), the field is still a long way from language understanding. The reason is that
full semantic interpretation requires the identification of every individual conceptual
component and the semantic roles it play. In addition, understanding requires processing and knowledge that goes beyond parsing and lexical lookup and that is not explicitly conveyed by linguistic elements. First, contextual understanding is needed to deal
with the omissions. Ambiguities are a common aspect of human communication.
Speakers are cooperative in filling gaps and correcting errors, but automatic systems
are not. Second, lexical knowledge does not provide background or world knowledge,
which is often required for non-trivial inferences. Any automatic system trying to understand a simple sentence will require - among others - accurate capabilities for
Named Entity Recognition and Classification (NERC), full Syntactic Parsing, Word
Sense Disambiguation (WSD) and Semantic Role Labeling (SRL) [5]. Current baseline
information systems are either large-scale, robust but shallow (standard IR systems), or
they are small-scale, deep but ad hoc (Semantic-Web ontology-based systems). Furthermore, these systems are maintained by experts in IR, ontologies or languagetechnology and not by the people in the field. Finally, hardly any of the systems is
multilingual, yet alone cross-lingual and definitely not cross-cultural. The next table
gives a comparison across different state-of-the-art information systems, where we
compare ad-hoc Semantic web solutions, wordnet-based information systems and
tradition information retrieval with STALKER [6].
Stalker, a Multilingual Text Mining Search Engine for Open Source Intelligence
37
Table 1. Comparison of semantic information systems
Features
Semantic web Wordnet-based
Traditional
STALKER
Information retrieval
Large scale and multiple domains
NO
YES
YES
YES
Deep semantics
YES
NO
NO
YES
Automatic acquisition/indexing
NO
YES/NO
YES
YES
Multi-lingual
NO
YES
YES
YES
Cross-lingual
NO
YES
NO
YES
Data and fact mining
YES
NO
NO
YES
2 The Logical Components
The system is built on the following components:
− a Crawler, an adaptive and selective component that gathers documents from Internet/Intranet sources.
− a Lexical system, which identifies relevant knowledge by detecting semantic relations and facts in the texts.
− a Search engine that enables Functional, Natural Language and Boolean queries.
− a Classification system which classifies search results into clusters and sub-clusters
recursively, highlighting meaningful relationships among them.
2.1 The Crawler
In any large company or public administration the goal of aggregating contents from
different and heterogeneous sources is really hard to be accomplished. Searchbox is a
multimedia content gathering and indexing system, whose main goal is managing
huge collections of data coming from different and geographically distributed information sources. Searchbox provides a very flexible and high performance dynamic
indexing for content retrieval [7], [8], [9]. The gathering activities of Searchbox are
not limited to the standard Web, but operate also with other sources like remote databases by ODBC, Web sources by FTP-Gopher, Usenet news by NNTP, WebDav and
SMB shares, mailboxes by POP3-POP3/S-IMAP-IMAP/S, file systems and other
proprietary sources. Searchbox indexing and retrieval system does not work on the
original version of data, but on the “rendered version”. For instance, the features
renedered and extracted from a portion of text might be a list of words/lemmas/
concepts, while the extraction of features from a bitmap image might be extremely
sophisticated. Even more complex sources, like video, might be suitably processed so
as to extract a textual-based labeling, which can be based on both the recognition of
speech and sounds. All of the extracted and indexed features can be combined in the
query language which is available in the user interface. Searchbox provides default
plug-ins to extract text from most common types of documents, like HTML, XML,
TXT, PDF, PS and DOC. Other formats can be supported using specific plugins.
38
2.2 The Lexical System
This component is intended to identify relevant knowledge from the whole raw text,
by detecting semantic relations and facts in texts. Concept extraction and text mining
are applied through a pipeline of linguistic and semantic processors that share a common ground and a knowledge base. The shared knowledge base guarantees a uniform
interpretation layer for the diverse information from different sources and languages.
Fig. 1. Lexical Analysis
The automatic linguistic analysis of the textual documents is based on Morphological, Syntactic, Functional and Statistical criteria. Recognizing and labeling semantic arguments is a key task for answering Who, When, What, Where, Why questions in
all NLP tasks in which some kind of semantic interpretation is needed. At the heart of
the lexical system is the McCord's theory of Slot Grammar [10]. A slot, explains
McCord, is a placeholder for the different parts of a sentence associated with a word.
A word may have several slots associated with it, forming a slot frame for the word.
In order to identify the most relevant terms in a sentence, the system analyzes it and,
for each word, the Slot Grammar parser draws on the word's slot frames to cycle
through the possible sentence constructions. Using a series of word relationship tests
to establish the context, the system tries to assign the context-appropriate meaning to
each word, determining the meaning of the sentence. Each slot structure can be partially or fully instantiated and it can be filled with representations from one or more
statements to incrementally build the meaning of a statement. This includes most of
the treatment of coordination, which uses a method of ‘factoring out’ unfilled slots
from elliptical coordinated phrases.
The parser - a bottom-up chart parser - employs a parse evaluation scheme used for
pruning away unlikely analyses during parsing as well as for ranking final analyses.
By including semantic information directly in the dependency grammar structures, the
system relies on the lexical semantic information combined with functional relations.
The detected terms are then extracted, reduced to their Part Of Speech1 and
1
Noun, Verb, Adjective, Adverb, etc.
39
Functional2 tagged base form [12]. Once referred to their synset inside the domain
dictionaries, they are used as documents metadata [12], [13], [14]. Each synset denotes a concept that can be referred to by its members. Synsets are interlinked by
means of semantic relations, such as the super-subordinate relation, the part-whole
relation and several lexical entailment relations.
2.3 The Search Engine
2.3.1 Functional Search
Users can search and navigate by roles, exploring sentences and documents by the
functional role played by each concept. Users can navigate on the relations chart by
simply clicking on nodes or arches, expanding them and having access to set of sentences/documents characterized by the selected criterion.
Fig. 2. Functional search and navigation
This can be considered a visual investigative analysis component specifically designed to bring clarity to complex investigations. It automatically enables investigative information to be represented as visual elements that can be easily analyzed and
interpreted. Functional relationships - Agent, Action, Object, Qualifier, When, Where,
How - among human beings and organizations can be searched for and highlighted,
pattern and hidden connections can be instantly revealed to help investigations, promoting efficiency into investigative teams. Should human beings be cited, their photos can be shown by simple clicking on the related icon.
2.3.2 Natural Language Search
Users can search documents by query in Natural Language, expressed using normal
conversational syntax, or by keywords combined by Boolean operators. Reasoning
over facts and ontological structures makes it possible to handle diverse and more
complex types of questions. Traditional Boolean queries in fact, while precise, require
strict interpretation that can often exclude information that is relevant to user interests.
So this is the reason why the system analyzes the query, identifying the most relevant
terms contained and their semantic and functional interpretation. By mapping a query
2
Agent, Object, Where, Cause, etc.
40
to concepts and relations very precise matches can be generated, without the loss of
scalability and robustness found in regular search engines that rely on string matching
and context windows. The search engine returns as result all the documents which
contain the query concepts/lemmas in the same functional role as in the query, trying
to retrieve all the texts which constitute a real answer to the query.
Fig. 3. Natural language query and its functional and conceptual expansion
Results are then displayed and ranked by relevance, reliability and credibility.
Fig. 4. Search results
2.4 The Clustering System
The automatic classification of results is made by TEMIS Insight Discoverer Categorizer and Clusterer, fulfilling both the Supervised and Unsupervised Classification
schemas. The application assigns texts to predefined categories and dynamically discovers the groups of documents which share some common traits.
2.4.1 Supervised Clustering
The categorization model was created during the learning phase, on representative
sets of training documents focused documents focused on news about Middle East North Africa, Balkans, East Europe, International Organizations and ROW (Rest Of
the World). The bayesian method was used as the learning method: the probabilist
classification model was built on around 1.000 documents. The overall performance
41
measures used were Recall (number of categories correctly assigned divided by the
total number of categories that should be assigned) and Precision (number of
categories correctly assigned divided by total number of categories assigned): in our
tests, they were 75% and 80% respectively.
2.4.2 Unsupervised Clustering
Result documents are represented by a sparse matrix, where lines and columns are
normalized in order to give more weight to rare terms. Each document is turned to a
vector comparable to others. Similarity is measured by a simple cosines calculation
between document vectors, whilst clustering is based on the K-Means algorithm. The
application provides a visual summary of the clustering analysis. A map shows the
different groups of documents as differently sized bubbles and the meaningful correlation among them as lines drawn with different thickness. Users can search inside
topics, project clusters on lemmas and their functional links.
Fig. 5. Thematic map, functional search and projection inside topics
3 Conclusions
This paper describes a Multilingual Text Mining platform for Open Source Intelligence, adopted by Joint Intelligence and EW Training Centre (CIFIGE) to train the
military and civilian personnel of Italian Defence in the OSINT discipline.
Multilanguage Lexical analysis permits to overcome linguistic barriers, allowing
the automatic indexation, simple navigation and classification of documents, whatever
it might be their language, or the source they are collected from. This approach enables the research, the analysis, the classification of great volumes of heterogeneous
documents, helping intelligence analysts to cut through the information labyrinth.
References
1. Grishman, R., Sundheim, B.: Message Understanding Conference - 6: A Brief History. In:
Proceedings of the 16th International Conference on Computational Linguistics (COLING), I, Kopenhagen, pp. 466–471 (1996)
42
2. Hearst, M.: Untangling Text Data Mining. In: ACL 1999. University of Maryland, June
20-26 (1999)
3. Miller, H.J., Han, J.: Geographic Data Mining and Knowledge Discovery. CRC Press,
Boca Raton (2001)
4. Wei, L., Keogh, E.: Semi-Supervised Time Series Classification, SIGKDD (2006)
5. Carreras, X., Màrquez, L.: Introduction to the CoNLL-2005 Shared Task: Semantic Role
Labeling. In: CoNLL 2005, Ann Arbor, MI, USA (2005)
6. Vossen, P., Neri, F., et al.: KYOTO: A System for Mining, Structuring, and Distributing
Knowledge Across Languages and Cultures. In: Proceedings of GWC 2008, The Fourth
Global Wordnet Conference, Szeged, Hungary, January 2008, pp. 22–25 (2008)
7. Baldini, N., Bini, M.: Focuseek searchbox for digital content gathering. In: AXMEDIS
2005 - 1st International Conference on Automated Production of Cross Media Content for
Multi-channel Distribution, Proceedings Workshop and Industrial, pp. 24–28 (2005)
8. Baldini, N., Gori, M., Maggini, M.: Mumblesearch: Extraction of high quality Web information for SME. In: 2004 IEEE/WIC/ACM International Conference on Web Intelligence
(2004)
9. Diligenti, M., Coetzee, F.M., Lawrence, S., Giles, C.L., Gori, M.: Focused Crawling Using
Context Graphs. In: Proceedings of 26th International Conference on Very Large Databases, VLDB, September 2000, pp. 10–12 (2000)
10. McCord, M.C.: Slot Grammar: A System for Simpler Construction of Practical Natural
Language Grammars Natural Language and Logic 1989, pp. 118–145 (1989)
11. McCord, M.C.: Slot Grammars. American Journal of Computational Linguistics 6(1), 31–
43 (1980)
12. Marinai, E., Raffaelli, R.: The design and architecture of a lexical data base system. In:
COLING 1990, Workshop on advanced tools for Natural Language Processing, Helsinki,
Finland, August 1990, p. 24 (1990)
13. Cascini, G., Neri, F.: Natural Language Processing for Patents Analysis and Classification.
In: ETRIA World Conference, TRIZ Future 2004, Florence, Italy (2004)
14. Neri, F., Raffaelli, R.: Text Mining applied to Multilingual Corpora. In: Sirmakessis, S.
(ed.) Knowledge Mining: Proceedings of the NEMIS 2004 Final Conference. Springer,
Heidelberg (2004)
15. Baldini, N., Neri, F.: A Multilingual Text Mining based content gathering system for Open
Source Intelligence. In: IAEA International Atomic Energy Agency, Symposium on International Safeguards: Addressing Verification Challenges, Wien, Austria, IAEA-CN148/192P, Book of Extended Synopses, October 16-20, 2006, pp. 368–369 (2006)
Computational Intelligence Solutions for Homeland
Security
Enrico Appiani and Giuseppe Buslacchi
Elsag Datamat spa, via Puccini 2,
16154 Genova, Italy
{Enrico.Appiani,Giuseppe.Buslacchi}@elsagdatamat.com
Abstract. On the basis of consolidated requirements from international Polices, Elsag Datamat
has developed an integrated tool suite, supporting all the main Homeland Security activities like
operations, emergency response, investigation and intelligence analysis. The last support covers
the whole “Intelligence Cycle” along its main phases and integrates a wide variety of automatic
and semi-automatic tools, coming from both original company developments and from the
market (COTS), in a homogeneous framework. Support to Analysis phase, most challenging
and computing-intensive, makes use of Classification techniques, Clustering techniques, Novelty Detection and other sophisticated algorithms. An innovative and promising use of Clustering and Novelty Detection, supporting the analysis of “information behavior”, can be very
useful to the analysts in identifying relevant subjects, monitoring their evolution and detecting
new information events who may deserve attention in the monitored scenario.
1 Introduction
Modern Law Enforcement Agencies experiment challenging and contrasting needs: on
one side, the Homeland Security mission has become more and more complex, due to
many national and international factors such as stronger crime organization, asymmetric threats, the complexity of civil and economic life, the criticality of infrastructures,
and the rapidly growing information environment; on the other side, the absolute value
of public security is not translated in large resource availability for such Agencies,
which must cope with similar problems as business organizations, namely to conjugate
the results with the search of internal efficiency, clarity of roles, careful programming
and strategic resource allocation.
Strategic programming for security, however, is not a function of business, but
rather of the evolution of security threats, whose prevision and prevention capability
has a double target: externally to an Agency, improving the coordination and the
public image to the citizen, for better enforcing everyone’s cooperation to civil security; internally, improving the communication between corps and departments, the
individual motivation through better assignation of roles and missions, and ultimately
the efficiency of Law Enforcement operations.
Joining good management of resources, operations, prevention and decisions translates into the need of mastering internal and external information with an integrated
approach, in which different tools cooperate to a common, efficient and accurate
information flow, from internal management to external intelligence, from resource
allocation to strategic security decisions.
springerlink.com
44
E. Appiani and G. Buslacchi
The rest of the paper describes an integrated framework trying to enforce the above
principles, called Law Enforcement Agency Framework (LEAF) by Elsag Datamat,
whose tools are in use by National Police and Carabinieri, and still under development for both product improvement and new shipments to national and foreign Agencies. Rather than technical details and individual technologies, this work tries to
emphasize their flexible integration, in particular for intelligence and decision support. Next sections focus on the following topics: LEAF architecture and functions,
matching the needs of Law Enforcement Agencies; the support to Intelligence and
Investigations, employing a suite of commercial and leading edge technologies; the
role of Clustering and Semantic technology combination in looking for elements of
known or novel scenarios in large, unstructured and noisy document bases; and some
conclusions with future perspectives.
2
Needs and Solutions for Integrated Homeland Security Support
Polices of advanced countries have developed or purchased their own IT support to
manage their operations and administration. US Polices, facing their multiplicity
(more than one Agency for each State), have focused on common standards for Record Management System (RMS) [1], in order to exchange and analyze data of Federal importance, such as criminal records. European Polices generally have their own
IT support and are improving their international exchange and integration, also thanks
to the Commission effort for a common European Security and Defense Policy [3],
including common developments in the FP7 Research Program and Europol [2], offering data and services for criminal intelligence.
At the opposite end, other Law Enforcement Agencies are building or completely
replacing their IT support, motivated by raising security challenges and various internal needs, such as improving their organizations, achieving more accurate border
control and fighting international crime traffics more effectively.
Elsag Datamat’s LEAF aims at providing an answer to both Polices just improving
their IT support and Polices requiring complete solutions, from base IT infrastructure
to top-level decision support. LEAF architecture includes the following main functions, from low to high level information, as illustrated in Fig. 1:
•
•
•
•
•
Infrastructure – IT Information Equipments, Sensors, Network, Applications and Management;
Administration – Enterprise Resource Planning (ERP) for Agency personnel and other resources;
Operations – support to Law Enforcement daily activities through recording all relevant events, decisions and document;
Emergency – facing and resolving security compromising facts, with
real-time and efficient resource scheduling, with possible escalation to
crises;
Intelligence – support to crime prevention, investigation and security
strategy, through a suite of tools to acquire, process, analyze and disseminate useful information,
45
Fig. 1. LEAF Functional and Layered Architecture
We can now have a closer look to each main function, just to recall which concrete
functionalities lay behind our common idea of Law Enforcement.
2.1 Infrastructure
The IT infrastructure of a Police can host applications of similar complexity to business ones, but with more critical requirements of security, reliability, geographical
articulation and communication with many fixed and mobile users. This requires the
capability to perform most activities for managing complex distributed IT systems:
• Infrastructure management
• Service management
• Geographic Information System Common Geo-processing Service for the other applications in the system
2.2 Administration
This has the same requirements of personnel, resource and budget administration of
multi-site companies, with stronger needs for supporting mission continuity.
• Enterprise Resource Planning (human resources, materials, vehicles, infrastructures, budget, procurement, etc.)
2.3 Operations
The Operations Support System is based on RMS implementation, partly inspired to
the American standard, recording all actors (such as people, vehicles and other objects), events (such as incidents, accidents, field interviews), activities (such as arrest,
booking, wants and warrants) and documents (such as passports and weapon licenses)
providing a common information ground to everyday tasks and to the operations of
the upper level (emergency and intelligence). In other words, this is the fundamental
database of LEAF, whose data and services can be split as follows:
• Operational activity (events and actors)
• Judicial activity (support to justice)
• Administrative activity (support to security administration)
46
2.4 Emergencies
This is the core Police activity for reaction to security-related emergencies, whose
severity and implications can be very much different (e.g. from small robberies to
large terrorist attacks). An emergency alarm can be triggered in some different ways:
by the Police itself during surveillance and patrolling activity; by sensor-triggered
automatic alarms; by citizen directly signaling events to Agents; and by citizen calling
a security emergency telephone number, such as 112 or 113 in Italy. IT support to
organize a proper reaction is crucial, for saving time and choosing the most appropriate means.
• Call center and Communications
• Emergency and Resource Management
2.5 Intelligence
This is the core support to prevention (detecting threats before they are put into action) and investigation (detecting the authors and the precise modalities of committed
crimes); besides, it provides statistical and analytical data for understanding the evolution of crimes and takes strategic decisions for next Law Enforcement missions.
More accurate description is demanded to the next section.
3
Intelligence and Investigations
Nowadays threats are asymmetric, international, aiming to strike more than to win,
and often moved by individuals or small groups. Maintenance of Homeland Security
requires, much more than before, careful monitoring of every information sources
through IT support, with a cooperating and distributed approach, in order to perform
the classical Intelligence cycle on two basic tasks:
• Pursuing Intelligence targets – performing research on specific military or civil
targets, in order to achieve timely and accurate answers, moreover to prevent specific threats before they are realized;
• Monitoring threats – listening to Open, Specialized and Private Sources, to capture and isolate security sensitive information possibly revealing new threats, in
order to generate alarms and react with mode detailed investigation and careful
prevention.
In addition to Intelligence tasks,
• Investigation relies on the capability to collect relevant information on past events
and analyze it in order to discover links and details bringing to the complete situation picture;
• Crisis management comes from emergency escalation, but requires further capabilities to simulate the situation evolution, like intelligence, and understand what
has just happened, like investigations.
47
The main Intelligence support functions are definable as follows:
• Information source Analysis and Monitoring – the core information collection,
processing and analysis
• Investigation and Intelligence – the core processes
• Crisis, Main events and Emergencies Management – the critical reaction to large
events
• Strategies, Management, Direction and Decisions – understand and forecast the
overall picture
Some supporting technologies are the same across the supporting functions above. In
fact, every function involves a data processing flow, from sources to the final report,
which can have similar steps as other flows. This fact shows that understanding the
requirements, modeling the operational scenario and achieving proper integration of
the right tools, are much more important steps that just acquiring a set of technologies.
Another useful viewpoint to focus on the data processing flow is the classical Intelligence cycle, modeled with similar approach by different military (e.g. NATO rules
and practices for Open Source Analysis [4] and Allied Joint Intelligence [5]) and civil
institutions, whose main phases are:
• Management - Coordination and planning, including resource and task management, mission planning (strategy and actions to take for getting the desired information) and analysis strategy (approach to distil and analyze a situation from the
collected information). Employs the activity reports to evaluate results and possibly reschedule the plan.
• Collection - Gathering signals and raw data from any relevant source, acquiring
them in digital format suitable for next processing.
• Exploiting - Processing signals and raw data in order to become useful “information pieces” (people, objects, events, locations, text documents, etc.) which can be
labeled, used as indexes and put into relation.
• Processing - Processing information pieces in order to get their relations and aggregate meaning, transformed and filtered at light of the situation to be analyzed or
discovered. This is the most relevant support to human analysis, although in many
cases this is strongly based on analyst experience and intuition.
• Dissemination - This does not mean diffusion to a large public, but rather aggregating the analysis outcomes in suitable reports which can be read and exploited by
decision makers, with precise, useful and timely information.
The Intelligence process across phases and input data types can be represented by a
pyramidal architecture of contributing tools and technologies, represented as boxes in
fig. 2, not exhaustive and not necessarily related to the corresponding data types (below) and Intelligence steps (on the left). The diagram provides a closer look to the
functions supporting for the Intelligence phases, some of which are, or are becoming,
commercial products supporting in particular Business Intelligence, Open Source
analysis and the Semantic Web.
An example of industrial subsystem taking part in this vision is called IVAS,
namely Integrated Video Archiving System, capable to receive a large number of
radio/TV channels (up to 60 in current configurations), digitize them with both Web
48
Fig. 2. The LEAF Intelligence architecture for integration of supporting technologies
stream and high quality (DVD) data rates, store them in a disk buffer, allow the operators to browse the recorded channels and perform both manual and automatic indexing, synthesize commented emissions or clips, and archive or publish such selected
video streams for later retrieval. IVAS manages Collection and Exploiting phases
with Audio/Video data, and indirectly supports the later processing. Unattended indexing modules include Face Recognition, Audio Transcription, Tassonomic and
Semantic Recognition. IVAS implementations are currently working for the Italian
National Command Room of Carabinieri and for the Presidency of Republic.
In summary, this section has shown the LEAF component architecture and technologies to support Intelligence and Investigations. The Intelligence tasks look at
future scenarios, already known or partially unknown. Support to their analysis thus
requires a combination of explicit knowledge-based techniques and inductive, implicit
information extraction, as it being studied with the so called Hybrid Artificial Intelligence Systems (HAIS). An example of such combination is shown in the next
section.
Investigation tasks, instead, aim at reconstruct past scenarios, thus requiring the
capability to model them and look for their related information items through text and
multimedia mining on selected sources, also involving explicit knowledge processing,
ontology and conceptual networks.
4
Inductive Classification for Non-structured Information
Open sources are heterogeneous, unstructured, multilingual and often noisy, in the
sense of being possibly written with improper syntax, various mistakes, automatic
translation, OCR and other conversion techniques. Open sources to be monitored
include: press selections, broadcast channels, Web pages (often from variable and
short-life sites), blogs, forums, and emails. All them may be acquired in form of
49
documents of different formats and types, either organized in renewing streams (e.g.
forums, emails) or specific static information, at most updated over time.
In such a huge and heterogeneous document base, classical indexing and text mining techniques may fail in looking for and isolating relevant content, especially with
unknown target scenarios. Inductive technologies can be usefully exploited to characterize and classify a mix of information content and behavior, so as to classify sources
without explicit knowledge indexing, acknowledge them based on their style, discover recurring subjects and detect novelties, which may reveal new hidden messages
and possible new threats.
Inductive clustering of text documents is achieved with source filtering, feature extraction and clustering tools based on Support Vector Machines (SVM) [7], through a
list of features depending on both content (most used words, syntax, semantics) and
style (such as number of words, average sentence length, first and last words, appearance time or refresh frequency). Documents are clustered according to their vector
positions and distances, trying to optimize the cluster number by minimizing a distortion cost function, so as to achieve a good compromise between the compactness (not
high number of cluster with a few documents each) and representativeness (common
meaning and similarity among the clustered documents) of the obtained clusters.
Some clustering parameters can be tuned manually through document subsets. The
clustered documents are then partitioned in different folders whose name include the
most recurring words, excluding the “stop-words”, namely frequent words with low
semantic content, such as prepositions, articles and common verbs.
This way we can obtain a pseudo-classification of documents, expressed by the
common concepts associated to the resulting keywords of each cluster. The largest
experiment has been led upon about 13,000 documents of a press release taken from
Italian newspapers in 2006, made digital through scanning and OCR. The document
base was much noisy, with many words changed, abbreviated, concatenated with
others, or missed; analyzing this sample with classical text processing, if not semantic
analysis, would have been problematic indeed, since language technology is very
sensitive to syntactical correctness. Instead, with this clustering technique, most of the
about 30 clusters obtained had a true common meaning among the composing documents (for instance, criminality of specific types, economy, terrorist attacks, industry
news, culture, fiction, etc.), with more specific situations expressed by the resulting
keywords. Further, such keywords were almost correct, except for a few clusters
grouping so noisy documents that it would have been impossible to find some common sense. In practice, the document noise has been removed when asserting the
prevailing sense of the most representative clusters.
Content Analysis (CA) and Behavior Analysis (BA) can support each other in different ways. CA applied before BA can add content features to the clustering space.
Conversely, BA applied before CA can reduce the number of documents to be processed for content, by isolating relevant groups expressing a certain type of style,
source and/or conceptual keyword set. CA further contributes in the scenario interpretation by applying reasoning tools to inspect clustering results at the light of domain-specific knowledge. Analyzing cluster contents may help application-specific
ontologies discover unusual patterns in the observed domain. Conversely, novel information highlighted by BA might help dynamic ontologies to update their knowledge in a semi-automated way. Novelty Detection is obtained through the “outliers”,
50
namely documents staying at a certain relative distance from their cluster centers, thus
expressing a loose commonality with more central documents. Outliers from a certain
source, for instance an Internet forum, can reveal some odd style or content with respect to the other documents.
Dealing with Intelligence for Security, we can have two different operational solutions combining CA and BA, respectively supporting Prevention and Investigation
[6]. This is still subject of experiments, the major difficulty being to collect relevant
documents and some real, or at least realistic, scenario.
Prevention-mode operation is illustrated in Fig. 3, and proceeds as follows.
1) Every input document is searched for basic terms, context, and eventually key
concepts and relations among these (by using semantic networks and/or ontology), in
order to develop an understanding of the overall content of each document and its
relevance to the reference scenario.
2a) In the knowledge-base set-up phase, the group of semantically analyzed documents forming a training set, undergoes a clustering process, whose similarity metrics
is determined by both linguistic features (such as lexicon, syntax, style, etc.) and semantic information (concept similarity derived from the symbolic information tools).
2b) At run-time operation, each new document is matched with existing clusters;
outlier detection, together with a history of the categorization process, highlights
possibly interesting elements and subsets of documents bearing novel contents.
3) Since clustering tools are not able to extract mission-specific knowledge from
input information, ontology processing interprets the detected trends in the light of
possible criminal scenarios. If the available knowledge base cannot explain the extracted information adequately, the component may decide to bring an alert to the
analyst’s attention and possibly tag the related information for future use.
Content Analysis
Document(s)
1
Behaviour Analysis
Annotated
Docum.
Corpus
Set-up
2a
Ref.
clusters
Run-time
Ref.
Dynamic
Knowledge
3
Novelties
2b
Analyzed novel
scenario
Fig. 3. Functional CA-BA combined dataflow for prevention mode
Novelty detection is the core activity of this operation mode, and relies on the interaction between BA and CA to define a ‘normal-state scenario’, to be used for identifying interesting deviations. The combination between inductive clustering and
explicit knowledge extraction is promising in helping analysts to perform both gross
51
classification of large, unknown and noisy document bases, find promising content in
some clusters and hence refine the analysis through explicit knowledge processing.
This combination lies between the information exploitation and processing of the
intelligence cycle, as recalled in the previous section; in fact, it contributes both to
isolate meaningful information items and to analyze the overall results.
Content Analysis
Selected
corpus
2
Annotated
Docum.
Corpus
4
Relevant
Groups
1
Ref.
Investigation
Scenario
Behaviour Analysis
3
Structured
hypothesis
Search strategy for
missing information
Fig. 4. Functional CA-BA combined dataflow for Investigation mode
Investigation-mode operation is illustrated in Fig. 4 and proceeds as follows.
1) A reference criminal scenario is assumed as a basic hypothesis.
2) Alike prevention-mode operation, input documents are searched for to develop
an understanding of the overall content of each document and its relevance to the
reference scenario.
3) BA (re)groups the existing base of documents by embedding the scenariospecific conceptual similarity in the document- and cluster-distance criterion. The
result is a grouping of documents that indirectly takes into account the relevance of
the documents to the reference criminal scenario.
4) CA uses high-level knowledge describing the assumed scenario to verify the
consistency of BA. The output is a confirmation of the sought-for hypothesis or, more
likely, a structural description of the possibly partial match, which provides useful
directives for actively searching missing information, which ultimately serves to validate or disclaim the investigative assumption.
Elsag Datamat already uses CA with different tools within the LEAF component
for Investigation, Intelligence and Decision Support. The combination with BA is
being experimented in Italian document sets in order to set up a prototype for
Carabinieri (one of the Italian security forces), to be experimented on the real field
between 2009 and 2010. In parallel, multi-language CA-BA solutions are being studied for the needs of international Polices.
52
5
Conclusions
In this position paper we have described an industrial solution for integrated IT support to Law Enforcement Agencies, called LEAF, in line with the state of the art of
this domain, including some advanced computational intelligence functions to support
Intelligence and Investigations. The need for an integrated Law Enforcement support,
realized by the LEAF architecture, has been explained.
The key approach for LEAF computational intelligence is modularity and openness
to different tools, in order to realize the most suitable processing workflow for any
analysis needs. LEAF architecture just organizes such workflow to support, totally or
partially, the intelligence cycle phases: acquisition, exploitation, processing and dissemination, all coordinated by management. Acquisition and exploitation involve
multimedia data processing, while processing and dissemination work at conceptual
object level.
An innovative and promising approach to analysis of Open and Private Sources
combines Content and Behavior Analysis, this last exploring the application of |
clustering techniques to textual documents, usually subject to text and language processing. BA shows the potential to tolerate the high number, heterogeneity and information noise of large Open Sources, creating clusters whose most representative
keywords can express an underlying scenario, directly to the analysts’ attention or
with the help of knowledge models.
References
1. Law Enforcement Record Management Systems (RMS) – Standard functional specifications
by Law Enforcement Information Technology Standards Council (LEITSC) (updated 2006),
http://www.leitsc.org
2. Europol – mission, mandates, security objectives, http://www.europol.europa.eu
3. European Security and Defense Policy – Wikipedia article,
http://en.wikipedia.org/wiki/
European_Security_and_Defence_Policy
4. NATO Open Source Intelligence Handbook (November 2001), http://www.oss.net
5. NATO Allied Joint Intelligence, Counter Intelligence and Security Doctrine. Allied Joint
Publication (July 2003)
6. COBASIM Proposal for FP7 – Security Call 1 – Proposal no. 218012 (2007)
7. Jing, L., Ng, M.K., Zhexue Huang, J.: An Entropy Weighting k-Means Algorithm for Subspace Clustering of High-Dimensional Sparse Data. IEEE Transactions on knowledge and
data engineering 19(8) (August 2007)
Virtual Weapons for Real Wars: Text Mining for
National Security
Alessandro Zanasi
ESRIF-European Security Research and Innovation Forum
University of Bologna Professor
Temis SA Cofounder
a_zanasi@yhaoo.it
Abstract. Since the end of the Cold War, the threat of large scale wars has been substituted by
new threats: terrorism, organized crime, trafficking, smuggling, proliferation of weapons of
mass destruction. The new criminals, especially the so called “jihadist” terrorists are using the
new technologies, as those enhanced by Web2.0, to fight their war. Text mining is the most
advanced knowledge management technology which allow intelligence analysts to automatically analyze the content of information rich online data banks, suspected web sites, blogs,
emails, chat lines, instant messages and all other digital media detecting links between people
and organizations, trends of social and economic actions, topics of interest also if they are
“sunk” among terabytes of information.
Keywords: National security, information sharing, text mining.
1 Introduction
After the 9/11 shock, the world of intelligence is reshaping itself, since that the world
is requiring a different intelligence: dispersed, not concentrated; open to several
sources; sharing its analysis with a variety of partners, without guarding its secrets
tightly; open to strong utilization of new information technologies to take profit of the
information (often contradictory) explosion (information density doubles every 24
months and its costs are halved every 18 months [1]); open to the contributions of the
best experts, also outside government or corporations [2], e.g. through a publicprivate partnership (PPP or P3: a system in which a government service or private
business venture is funded and operated through a partnership of government and one
or more private sector companies).
The role of competitive intelligence has assumed great importance not only in the
corporate world but also in the government one, largely due to the changing nature of
national power. Today the success of foreign policy rests to a significant extent on
energy production control, industrial and financial power, and energy production control, industrial and financial power in turn are dependent on science and technology
and business factors and on the capacity of detecting key players and their actions.
New terrorists are typically organized in small, widely dispersed units and coordinate their activities on line, obviating the need for central command. Al Qaeda and
springerlink.com
54
A. Zanasi
similar groups rely on the Internet to contact potential recruits and donors, sway public opinion, instruct would-be terrorists, pool tactics and knowledge, and organize
attacks. This phenomenon has been called Netwar (a form of conflict marked by the
use of network forms of organizations and related doctrines, strategies, and technologies) [3]. In many ways, such groups use Internet in the same way that peaceful political organizations do; what makes terrorists’ activity threatening is their intent. This
approach reflects what the world is experiencing in the last ten years: a paradigm shift
from an organization-driven threat architecture (i.e., communities and social activities
focused around large companies or organizations) to an individual-centric threat
architecture (increased choice and availability of opportunities focused around individual wants and desires). This is a home, local community, and virtual-communitycentric societal architecture: the neo-renaissance paradigm. This new lifestyle is due
to a growing workforce composed of digitally connected free-agent (e.g. terrorists)
able to operate from any location and to be engaged anywhere on the globe [4].
Due to this growing of virtual communities, a strong interest towards the capability
of automatic evaluation of communications exchanged inside these communities and
of their authors is also growing, directed to profiling activity, to extract authors personal characteristics (useful in investigative actions too).
So, to counter netwar terrorists, intelligence must learn how to monitor their network activity, also online, in the same way it keeps tabs on terrorists in the real world.
Doing so will require a realignment of western intelligence and law enforcement
agencies, which lag behind terrorist organizations in adopting information technologies [5] and, at least for NSA and FBI, to upgrade their computers to better coordinate
intelligence information [6].
The structure of the Internet allows malicious activity to flourish and the perpetrators to remain anonymous.
Since it would be nearly impossible to identify and disable every terrorist news forum on the internet given the substantial legal and technical hurdles involved (there
are some 4,500 web sites that disseminate the al Qaeda leadership’s messages [7]), it
would make more sense to leave those web sites online but watch them carefully.
These sites offer governments’ intelligence unprecedented insight into terrorists’
ideology and motivations.
Deciphering these Web sites will require not just Internet savvy but also the ability
to read Arabic and understand terrorists’ cultural backgrounds-skills most western
counterterrorism agencies currently lack [5].
These are the reasons for which the text mining technologies, which allow the reduction of information overload and complexity, analyzing texts also in unknown,
exotic languages (Arabic included: the screenshots in the article, as all the other ones,
are Temis courtesy), have become so important in the government as in the corporate
intelligence world.
For an introduction to text mining technology and its applications to intelligence: [8].
The question to be answered is: once defined the new battle field (i.e. the Web)
how to use the available technologies to fight the new criminals and terrorists? A
proposed solution is: through an “Internet Center”. That is a physical place where to
concentrate advanced information technologies (including Web 2.0, machine translation, crawlers, text mining) and human expertise, not only in information technologies
but also in online investigations).
Virtual Weapons for Real Wars: Text Mining for National Security
55
Fig. 1. Text mining analysis of Arabic texts
We present here the scenarios into which these technologies, especially those regarding text mining are utilized, with some real cases.
2 New Challenges to the Market State
The information revolution is the key enabler of economic globalization.
The age of information is also the age of emergence of the so called market-state
[9] which maximizes the opportunities of its people, facing lethal security challenges
which dramatically change the roles of government and of private actors and of
intelligence.
Currently governments power is being challenged from both above (international
commerce, which erodes what used to be thought of as aspects of national sovereignty) and below (terrorist and organized crime challenge the state power from beneath, by trying to compel states to acquiesce or by eluding the control of states).
Tackling these new challenges is the role of the new government intelligence.
From the end of Cold War there is general agreement about the nature of the
threats that posed a challenge to the intelligence community: drugs, organized crime,
and proliferation of conventional and unconventional weapons, terrorism, financial
crimes. All these threats aiming for violence, not victory, may be tackled through the
help of technologies as micro-robots, bio-sniffers, and sticky electronics. Information
technology, applied to open sources analysis, is a support in intelligence activities
directed to prevent these threats [10] and to assure homeland protection [11].
56
A. Zanasi
Since 2001 several public initiatives involving data and text mining appeared in
USA and Europe.
All of them shared the same conviction: information is the best arm against the
asymmetric threats.
Fig. 2. The left panel allows us to highlight all the terms which appear in the collected documents and into which we are interested in (eg: Hezbollah). After clicking on the term, in the
right column appear the related documents.
3 The Web as a Source of Information
Until some years ago the law enforcement officers were interested only in retrieving
data coming from web sites and online databanks (collections of information, available online, dedicated to press surveys, patents, scientific articles, specific topics,
commercial data). Now they are interested in analyzing the data coming from internet
seen as a way of communication: e-mails, chat rooms, forums, newsgroups, blogs
(obtained, of course, after being assured that data privacy rules have been safeguarded).
Of course, this nearly unending stream of new information, especially regarding
communications, also in exotic languages, created not only an opportunity but also a
new problem. The information data is too large to be analyzed by human beings and
the languages in which this data are written are very unknown to the analysts.
Luckily these problems, created by technologies, may be solved thanks to other information technologies.
57
4 Text Mining
The basic text mining technique is Information Extraction consisting in linguistic
processing, where semantic models are defined according to the user requirements,
allowing the user to extract the principal topics of interest to him. These semantic
models are contained in specific ontologies, engineering artefacts which contain a
specific vocabulary used to describe a certain reality, plus a set of explicit assumptions regarding the intended meaning of the vocabulary words.
This technique allows, for example, the extraction of organization and people
names, email addresses, bank account, phone and fax numbers as they appear in the
data set. For example, once defined a technology or a political group, we can quickly
obtain the list of organizations working with that technology or the journalists supporting that opinion or the key players for that political group.
5 Virtual Communities Monitoring
A virtual community, whose blog and chats are typical examples, are communities of
people sharing and communicating common interests, ideas, and feelings over the Internet or other collaborative networks. The possible inventor of this term was Howard
Rheingold, who defines virtual communities as social aggregations that emerge from the
Internet when enough people carry on public discussions long enough and with sufficient human feeling to form webs of personal relationships in cyberspace [12].
Most community members need to interact with a single individual in a one-to-one
conversation or participate and collaborate in idea development via threaded conversations with multiple people or groups.
This type of data is, clearly, an exceptional source to be mined [19].
6 Accepting the Challenges to National Security
6.1 What We Need
It is difficult for government intelligence to counter the threat that terrorists pose.
To fight them we need solutions able to detect their names in the communications, to
detect their financial movements, to recognize the real authors of anonymous documents, to put in evidence connections inside social structures, to track individuals
through collecting as much information about them as possible and using computer
algorithms and human analysis to detect potential activity.
6.2 Names and Relationships Detection
New terrorist groups rise each week, new terrorists each day. Their names, often written in a different alphabet, are difficult to be caught and checked against other names
already present in the databases. Text mining technology allows their detection, also
with their connections to other groups or people.
58
A. Zanasi
Fig. 3. Extraction of names with detection of connection to suspect terrorist names and the
reason of this connection
6.3 Money Laundering
Text mining is used in detecting anomalies in the fund transfer request process and in
the automatic population of black lists.
6.4 Insider Trading
To detect insider trading it is necessary to track the stock trading activity for every publicly traded company, plot it on a time line and compare anomalous peaks to company
news: if there is no news to spur a trading peak, that is a suspected insider trading.
To perform this analysis it is necessary to extract the necessary elements (names of
officers and events, separated by category) from news text and then correlate them
with the structured data coming from stock trading activity [13].
6.5 Defining Anonymous Terrorist Authorship
Frequently the only traces available after a terrorist attack are the emails or the communications claiming the act. The analyst must analyze the style, the concepts and
feelings [14] expressed in a communication to establish connections and patterns
between documents [15], comparing them with documents coming from known
59
authors: famous attackers (Unabomber was the most famous one) were precisely
described, before being really detected, using this type of analysis.
6.6 Digital Signatures
Human beings are habit beings and have some personal characteristics (more than
1000 “style markers” have been quoted in literature) that are inclined to persist, useful
in profiling the authors of the texts.
6.7 Lobby Detection
Analyzing connections, similarities and general patterns in public declarations and/or
statements of different people allows the recognition of unexpected associations
(«lobbies») of authors (as journalists, interest groups, newspapers, media groups,
politicians) detecting whom, among them, is practically forming an alliance.
6.8 Monitoring of Specific Areas/Sectors
In business there are several examples of successful solutions applied to competitive
intelligence. E.g. Unilever, text mining patents discovered that a competitor was planning new activities in Brazil which really took place a year later [16].
Telecom Italia, discovered that a competitor (NEC-Nippon Electric Company) was
going to launch new services in multimedia [16].
Total (F), mines Factiva and Lexis-Nexis databases to detect geopolitical and technical information.
6.9 Chat Lines, Blogs and Other Open Sources Analysis
The first enemy of intelligence activity is the “avalanche” of information that daily
the analysts must retrieve, read, filter and summarize. The Al Qaeda terrorists declared to interact among them through chat lines to avoid being intercepted [17]: interception and analysis of chat lines content is anyway possible and frequently done in
commercial situations [18], [19].
Using different text mining techniques it is possible to identify the context of the
communication and the relationships among documents detecting the references to
the interesting topics, how they are treated and what impression they create in the
reader [20].
6.10 Social Network Links Detection
“Social structure” has long been an important concept in sociology. Network analysis
is a recent set of methods for the systematic study of social structure and offers a new
standpoint from which to judge social structures [21].
Text mining is giving an important help in detection of social network hidden inside large volumes of text also detecting the simultaneous appearance of entities
(names, events and concepts) measuring their distance (proximity).
60
A. Zanasi
References
1. Lisse, W.: The Economics of Information and the Internet. Competitive Intelligence Review 9(4) (1998)
2. Treverton, G.F.: Reshaping National Intelligence in an Age of Information. Cambridge
University Press, Cambridge (2001)
3. Ronfeldt, D., Arquilla, J.: The Advent of Netwar –Rand Corporation (1996)
4. Goldfinger, C.: Travail et hors Travail: vers une societe fluide. In: Jacob, O. (ed.) (1998)
5. Kohlmann, E.: The Real Online Terrorist Threat – Foreign Affairs (September/October
2006)
6. Mueller, J.: Is There Still a Terrorist Threat? – Foreign Affairs (September/ October 2006)
7. Riedel, B.: Al Qaeda Strikes Back – Foreign Affairs (May/June 2007)
8. Zanasi, A. (ed.): Text Mining and its Applications to Intelligence, CRM and Knowledge
Management. WIT Press, Southampton (2007)
9. Bobbitt, P.: The Shield of Achilles: War, Peace, and the Course of History, Knopf (2002)
10. Zanasi, A.: New forms of war, new forms of Intelligence: Text Mining. In: ITNS Conference, Riyadh (2007)
11. Steinberg, J.: In Protecting the Homeland 2006/2007 - The Brookings Institution (2006)
12. Rheingold, H.: The Virtual Community. MIT Press, Cambridge (2000)
13. Feldman, S.: Insider Trading and More, IDC Report, Doc#28651 (December 2002)
14. de Laat, M.: Network and content analysis in an online community discourse. University
of Nijmegen (2002)
15. Benedetti, A.: Il linguaggio delle nuove Brigate Rosse, Erga Edizioni (2002)
16. Zanasi, A.: Competitive Intelligence Thru Data Mining Public Sources - Competitive Intelligence Review, vol. 9(1). John Wiley & Sons, Inc., Chichester (1998)
17. The Other War, The Economist March 26 (2003)
18. Campbell, D.: - World under Watch, Interception Capabilities in the 21st Century –
ZDNet.co (2001) (updated version of Interception Capabilities 2000, A report to European
Parlement - 1999)
19. Zanasi, A.: Email, chatlines, newsgroups: a continuous opinion surveys source thanks to
text mining. In: Excellence in Int’l Research 2003 - ESOMAR (Nl) (2003)
20. Jones, C.W.: Online Impression Management. University of California paper (July 2005)
21. Degenne, A., Forse, M.: Introducing Social Networks. Sage Publications, London (1999)
Hypermetric k-Means Clustering for Content-Based
Document Management
Sergio Decherchi, Paolo Gastaldo, Judith Redi, and Rodolfo Zunino
Dept. Biophysical and Electronic Engineering, University of Genoa,
16145 Genova, Italy
{sergio.decherchi,paolo.gastaldo,judith.redi,
rodolfo.zunino}@unige.it
Abstract. Text-mining methods have become a key feature for homeland-security technologies,
as they can help explore effectively increasing masses of digital documents in the search for
relevant information. This research presents a model for document clustering that arranges
unstructured documents into content-based homogeneous groups. The overall paradigm is
hybrid because it combines pattern-recognition grouping algorithms with semantic-driven
processing. First, a semantic-based metric measures distances between documents, by combining a content-based with a behavioral analysis; the metric considers both lexical properties and
the structure and styles that characterize the processed documents. Secondly, the model relies
on a Radial Basis Function (RBF) kernel-based mapping for clustering. As a result, the major
novelty aspect of the proposed approach is to exploit the implicit mapping of RBF kernel functions to tackle the crucial task of normalizing similarities while embedding semantic information in the whole mechanism.
Keywords: document clustering, homeland security, kernel k-means.
1 Introduction
The automated surveillance of information sources is of strategic importance to effective homeland security [1], [2]. The increased availability of data-intensive heterogeneous sources provides a valuable asset for the intelligence task; data-mining methods
have therefore become a key feature for security-related technologies [2], [3] as they
can help explore effectively increasing masses of digital data in the search for relevant
information.
Text mining techniques provide a powerful tool to deal with large amounts of unstructured text data [4], [5] that are gathered from any multimedia source (e.g. from
Optical Character Recognition, from audio via speech transcription, from webcrawling agents, etc.). The general area of text-mining methods comprises various
approaches [5]: detection/tracking tools continuously monitor specific topics over
time; document classifiers label individual files and build up models for possible
subjects of interest; clustering tools process documents for detecting relevant relations
among those subjects. As a result, text mining can profitably support intelligence and
security activities in identifying, tracking, extracting, classifying and discovering
patterns, so that the outcomes can generate alerts notifications accordingly [6] ,[7].
springerlink.com
62
S. Decherchi et al.
This work addresses document clustering and presents a dynamic, adaptive clustering model to arrange unstructured documents into content-based homogeneous
groups. The framework implements a hybrid paradigm, which combines a contentdriven similarity processing with pattern-recognition grouping algorithms. Distances
between documents are worked out by a semantic-based hypermetric: the specific
approach integrates a content-based with a user-behavioral analysis, as it takes into
account both lexical and style-related features of the documents at hand. The core
clustering strategy exploits a kernel-based version of the conventional k-means algorithm [8]; the present implementation relies on a Radial Basis Function (RBF) kernelbased mapping [9]. The advantage of using such a kernel consists in supporting
normalization implicitly; normalization is a critical issue in most text-mining applications, and prevents that extensive properties of documents (such as length, lexicon,
etc) may distort representation and affect performance.
A standard benchmark for content-based document management, the Reuters database [10], provided the experimental domain for the proposed methodology. The
research shows that the document clustering framework based on kernel k-means can
generate consistent structures for information access and retrieval.
2 Document Clustering
Text mining can effectively support the strategic surveillance of information sources
thanks to automatic means, which is of paramount importance to homeland security
[6], [7]. For prevention, text mining techniques can help identify novel “information
trends” revealing new scenarios and threats to be monitored; for investigation, these
technologies can help distil relevant information about known scenarios. Within the
text mining framework, this work addresses document clustering, which is one of the
most effective techniques to organize documents in an unsupervised manner.
When applied to text mining, clustering algorithms are designed to discover groups
in the set of documents such that the documents within a group are more similar to
one another than to documents of other groups. As apposed to text categorization [5],
in which categories are predefined and are part of the input information to the learning procedure, document clustering follows an unsupervised paradigm and partitions a
set of documents into several subsets. Thus, the document clustering problem can be
defined as follows. One should first define a set of documents D = {D1, . . . , Dn}, a
similarity measure (or distance metric), and a partitioning criterion, which is usually
implemented by a cost function. In the case of flat clustering, one sets the desired
number of clusters, Z, and the goal is to compute a membership function φ : D → {1, .
. . , Z} such that φ minimizes the partitioning cost with respect to the similarities
among documents. Conversely, hierarchical clustering does not need to define the
cardinality, Z, and applies a series of nested partitioning tasks which eventually yield
a hierarchy of clusters.
Indeed, every text mining framework should always be supported by an information extraction (IE) model [11], [12] which is designed to pre-process digital text
documents and to organize the information according to a given structure that can be
directly interpreted by a machine learning system. Thus, a document D is eventually
reduced to a sequence of terms and is represented as a vector, which lies in a space
Hypermetric k-Means Clustering for Content-Based Document Management
63
spanned by the dictionary (or vocabulary) T = {tj; j= 1,.., nT}. The dictionary collects
all terms used to represent any document D, and can be assembled empirically by
gathering the terms that occurs at least once in a document collection D ; by this representation one loses the original relative ordering of terms within each document.
Different models [11], [12] can be used to retrieve index terms and to generate the
vector that represents a document D. However, the vector space model [13] is the
most widely used method for document clustering. Given a collection of documents
D, the vector space model represents each document D as a vector of real-valued
weight terms v = {wj; j=1,..,nT}. Each component of the nT-dimensional vector is a
non-negative term weight, wj, that characterizes the j-th term and denotes the relevance of the term itself within the document D.
3 Hypermetric k-Means Clustering
The hybrid approach described in this Section combines the specific advantages of
content-driven processing with the effectiveness of an established pattern-recognition
grouping algorithm. Document similarity is defined by a content-based distance,
which combines a classical distribution-based measure with a behavioral analysis of
the style features of the compared documents. The core engine relies on a kernelbased version of the classical k-means partitioning algorithm [8] and groups similar
documents by a top-down hierarchical process. In the kernel-based approach, every
document is mapped into an infinite-dimensional Hilbert space, where only inner
products among elements are meaningful and computable. In the present case the
kernel-based version of k-means [15] provides a major advantage over the standard kmeans formulation.
In the following, D = {Du; u= 1,..,nD} will denote the corpus, holding the collection
of documents to be clustered. The set T = {tj; j= 1,.., nT} will denote the vocabulary,
which is the collection of terms that occur at least one time in D after the preprocessing steps of each document D ∈ D (e.g., stop-words removal, stemming [11]).
3.1 Document Distance Measure
A novel aspect of the method described here is the use of a document-distance that
takes into account both a conventional content-based similarity metric and a behavioral similarity criterion. The latter term aims to improve the overall performance of
the clustering framework by including the structure and style of the documents in the
process of similarity evaluation. To support the proposed document distance measure,
a document D is here represented by a pair of vectors, v′ and v′′.
Vector v′(D) actually addresses the content description of a document D; it can be
viewed as the conventional nT-dimensional vector that associates each term t ∈ T
with the normalized frequency, tf, of that term in the document D. Therefore, the k-th
element of the vector v′(Du) is defined as:
v′k ,u = tf k ,u
nT
∑ tfl ,u ,
l =1
(1)
64
S. Decherchi et al.
where tfk,u is the frequency of the k-th term in document Du. Thus v′ represents a
document by a classical vector model, and uses term frequencies to set the weights
associated to each element.
From a different perspective, the structural properties of a document, D, are represented by a set of probability distributions associated with the terms in the vocabulary. Each term t ∈ T that occurs in Du is associated with a distribution function that
gives the spatial probability density function (pdf) of t in Du. Such a distribution,
pt,u(s), is generated under the hypothesis that, when detecting the k-th occurrence of a
term t at the normalized position sk ∈ [0,1] in the text, the spatial pdf of the term can
be approximated by a Gaussian distribution centered around sk. In other words, if the
term tj is found at position sk within a document, another document with a similar
structure is expected to include the same term at the same position or in a neighborhood thereof, with a probability defined by a Gaussian pdf. To derive a formal expression of the pdf, assume that the u-th document, Du, holds nO occurrences of terms
after simplifications; if a term occurs more than once, each occurrence is counted
individually when computing nO, which can be viewed as a measure of the length of
the document. The spatial pdf can be defined as:
p t ,u (s ) =
⎡ (s − s )2 ⎤
1 nO
1 nO 1
k
⎥ ,
exp ⎢ −
G sk,λ = ∑
∑
A k =1
A k =1 2π λ
⎢⎣
λ 2 ⎥⎦
(
)
(2)
where A is a normalization term and λ is regularization parameter. In practice one
uses a discrete approximation of (2). First, the document D is segmented evenly into S
sections. Then, an S-dimensional vector is generated for each term t ∈ T , and each
element estimates the probability that the term t occurs in the corresponding section of
the document. As a result, v′′(D) is an array of nT vectors having dimension S.
Vector v′ and vector v′′ support the computations of the frequency-based distance,
∆(f), and the behavioral distance, ∆(b), respectively. The former term is usually measured according to a standard Minkowski distance, hence the content distance between
a pair of documents (Du, Dv) is defined by:
⎡ T
∆( f ) ( Du , Dv ) = ⎢ ∑ v k′ ,u − v k′ ,v
n
⎢⎣ k =1
p⎤
⎥
⎥⎦
1
p
.
(3)
The present approach adopts the value p = 1 and therefore actually implements a
Manhattan distance metric. The term computing behavioral distance, ∆(b), applies an
Euclidean metric to compute the distance between probability vectors v′′. Thus:
∆(b ) ( Du , Dv ) =
∑ ∆(tbk ) (Du , Dv ) = ∑ ∑ [v (′′k ) s,u − v (′′k ) s,v ]2 .
nT
nT S
k =1
k =1 s =1
(4)
Both terms (3) and (4) contribute to the computation of the eventual distance value,
∆(Du, Dv), which is defined as follows:
∆(Du,Dv) = α⋅ ∆(f) (Du,Dv) + (1 – α)⋅ ∆(b)(Du,Dv) ,
(5)
65
where the mixing coefficient α∈[0,1] weights the relative contribution of ∆(f) and
∆(b). It is worth noting that the distance expression (5) obeys the basic properties of
non-negative values and symmetry that characterize general metrics, but does not
necessarily satisfy the triangular property.
3.2 Kernel k-Means
The conventional k-means paradigm supports an unsupervised grouping process [8],
which partitions the set of samples, D = {Du; u= 1,..,nD}, into a set of Z clusters, Cj (j
= 1,…, Z). In practice, one defines a “membership vector,” which indexes the partitioning of input patterns over the K clusters as: mu = j ⇔ Du ∈Cj, otherwise mu = 0; u
= 1,…, nD. It is also useful to define a “membership function” δuj(Du,Cj), that defines
the membership of the u-th document to the j-th cluster: δuj =1 if mu = j, and 0 otherwise. Hence, the number of members of a cluster is expressed as
Nj =
nD
∑ δ uj ;
j = 1,…, Z ;
(6)
u =1
and the cluster centroid is given by:
wj =
1
Nj
nD
∑ x u δ uj ;
j = 1,…, Z ;
u =1
(7)
where xu is any vector-based representation of document Du.
The kernel based version of the algorithm is based on the assumption that a function, Φ, can map any element, D, into a corresponding position, Φ(D), in a possibly
infinite dimensional Hilbert space. The mapping function defines the actual ‘Kernel’,
which is formulated as the expression to compute the inner product:
def
K (Du , Dv ) = Kuv = Φ (Du ) ⋅ Φ (Dv ) = Φu ⋅ Φ v .
(8)
In our particular case we employ the largely used RBF kernel
⎡ ∆(Du , Dv ) ⎤
K (Du , Dv ) = exp ⎢−
⎥ .
σ2
⎣
⎦
(9)
It is worth stressing here an additional, crucial advantage of using a kernel-based
formulation in the text-mining context: the approach (9) can effectively support the
critical normalization process by reducing all inner products within a limited range,
thereby preventing that extensive properties of documents (length, lexicon, etc) may
distort representation and ultimately affect clustering performance. The kernel-based
version of the k-means algorithm, according to the method proposed in [15], replicates the basic partitioning schema (6)-(7) in the Hilbert space, where the centroid
positions, Ψ, are given by the averages of the mapping images, Φu:
Ψj =
1
Nj
nD
∑ Φ u δ uj ;
u =1
j = 1,…, Z .
(10)
66
S. Decherchi et al.
The ultimate result of the clustering process is the membership vector, m, which determines prototype positions (7) even though they cannot be stated explicitly. As a
consequence, for a document, Du, the distance in the Hilbert space from the mapped
image, Φu, to the cluster Ψj as per (7) can be worked out as:
(
)
d Φu , Ψ j = Φu −
1
Nj
nD
2
∑ Φv =1+
v =1
1
(N j )
2
nD
nD
m ,v =1
j v =1
2
∑ δ mjδ vj K mv − N ∑ δ vj Ku,v .
(11)
By using expression (11), which includes only kernel computations, one can identify
the closest prototype to the image of each input pattern, and assign sample memberships accordingly.
In clustering domains, k-means clustering can notably help separate groups and
discover clusters that would have been difficult to identify in the base space. From
this viewpoint one might even conclude that a kernel-based method might represent a
viable approach to tackle the dimensionality issue.
4 Experimental Results
A standard benchmark for content-based document management, the Reuters database
[10], provided the experimental domain for the proposed framework. The database
includes 21,578 documents, which appeared on the Reuters newswire in 1987. One or
more topics derived from economic subject categories have been associated by human
indexing to each document; eventually, 135 different topics were used. In this work,
the experimental session involved a corpus DR including 8267 documents out of the
21,578 originally provided by the database. The corpus DR was obtained by adopting
the criterion used in [14]. First, all the documents with multiple topics were discarded.
Then, only the documents associated to topics having at least 18 occurrences were
included in DR. As a result, 32 topics were represented in the corpus.
In the following experiments, the performances of the clustering framework have
been evaluated by using the purity parameter. Let Nk denote the number of elements
lying in a cluster Ck and let Nmk be the number of elements of the class Im in the cluster
Ck. Then, the purity pur(k) of the cluster Ck is defined as follows:
pur (k ) =
1
max ( N mk ) .
Nk m
(12)
Accordingly, the overall purity of the clustering results is defined as follows:
purity = ∑
k
Nk
⋅ pur (k ) ,
N
(13)
where N is the total number of element. The purity parameter has been preferred to
other measures of performance (e.g. the F-measures) since it is the most accepted
measure for machine learning classification problems [11].
The clustering performance of the proposed methodology was evaluated by analyzing the result obtained with three different experiments: the documents in the corpus
DR were partitioned by using a flat clustering paradigm and three different settings for
67
the parameter α, which, as per (5), weights the relative contribution of ∆(f) and ∆(b)
in the document distance measure. The values used in the experiments were α = 0.3,
α = 0.7 and α = 0.5; thus, a couple of experiments were characterized by a strong
preponderance of one of the two components, while in the third experiment ∆(f) and
∆(b) evenly contribute to the eventual distance measure.
Table 1 outlines the results obtained with the setting α = 0.3. The evaluations were
conducted with different number of clusters Z, ranging from 20 to 100. For each experiment, four quality parameters are presented:
•
•
•
•
the overall purity, purityOV, of the clustering result;
the lowest purity value pur(k) over the Z clusters;
the highest purity value pur(k) over the Z clusters;
the number of elements (i.e. documents) associated to the smallest cluster.
Analogously, Tables 2 and 3 reports the results obtained with α = 0.5 and α = 0.7,
respectively.
Table 1. Clustering performances obtained on Reuters-21578 with α=0.3
Number of
clusters
20
40
60
80
100
Overall
purity
0.712108
0.77138
0.81154
0.799685
0.82666
pur(k)
minimum
0.252049
0.236264
0.175
0.181818
0.153846
pur(k)
maximum
1
1
1
1
1
Smallest
cluster
109
59
13
2
1
Number of
clusters
20
40
60
80
100
Overall
purity
0.696383
0.782267
0.809121
0.817467
0.817467
pur(k)
minimum
0.148148
0.222467
0.181818
0.158333
0.139241
pur(k)
maximum
1
1
1
1
1
Smallest
cluster
59
4
1
1
2
Number of
clusters
20
40
60
80
100
Overall
purity
0.690577
0.742833
0.798718
0.809483
0.802589
pur(k)
minimum
0.145719
0.172638
0.18
0.189655
0.141732
pur(k)
maximum
1
1
1
1
1
Smallest
cluster
13
6
5
2
4
68
S. Decherchi et al.
As expected, the numerical figures show that, in general, the overall purity grows
as the number of clusters Z increases. Indeed, the value of the overall purity seems to
indicate that clustering performances improve by using the setting α= 0.3. Hence,
empirical outcomes confirm the effectiveness of the proposed document distance
measure, which combines the conventional content-based similarity with the behavioral similarity criterion.
References
1. Chen, H., Chung, W., Xu, J.J., Wang, G., Qin, Y., Chau, M.: Crime data mining: a general
framework and some examples. IEEE Trans. Computer 37, 50–56 (2004)
2. Seifert, J.W.: Data Mining and Homeland Security: An Overview. CRS Report RL31798
(2007), http://www.epic.org/privacy/fusion/
crs-dataminingrpt.pdf
3. Mena, J.: Investigative Data Mining for Security and Criminal Detection. ButterworthHeinemann (2003)
4. Sullivan, D.: Document warehousing and text mining. John Wiley and Sons, Chichester
(2001)
5. Fan, W., Wallace, L., Rich, S., Zhang, Z.: Tapping the power of text mining. Comm. of the
ACM 49, 76–82 (2006)
6. Popp, R., Armour, T., Senator, T., Numrych, K.: Countering terrorism through information
technology. Comm. of the ACM 47, 36–43 (2004)
7. Zanasi, A. (ed.): Text Mining and its Applications to Intelligence, CRM and KM, 2nd edn.
WIT Press (2007)
8. Linde, Y., Buzo, A., Gray, R.M.: An algorithm for vector quantizer design. IEEE Trans.
Commun. COM-28, 84–95 (1980)
9. Shawe-Taylor, J., Cristianini, N.: Kernel Methods for Pattern Analysis. Cambridge University Press, Cambridge (2004)
10. Reuters-21578 Text Categorization Collection. UCI KDD Archive
11. Manning, C.D., Raghavan, P., Schütze, H.: Introduction to Information Retrieval. Cambridge University Press, Cambridge (2008)
12. Baeza-Yates, R., Ribiero-Neto, B.: Modern Information Retrieval. ACM Press, New York
(1999)
13. Salton, G., Wong, A., Yang, L.S.: A vector space model for information retrieval. Journal
Amer. Soc. Inform. Sci. 18, 613–620 (1975)
14. Cai, D., He, X., Han, J.: Document Clustering Using Locality Preserving Indexing. IEEE
Transaction on knowledge and data engineering 17, 1624–1637 (2005)
15. Girolami, M.: Mercer kernel based clustering in feature space. IEEE Trans. Neural Networks. 13, 2780–2784 (2002)
Demetrios G. Eliades and Marios M. Polycarpou*
KIOS Research Center for Intelligent Systems and Networks
Dept. of Electrical and Computer Engineering
University of Cyprus, CY-1678 Nicosia, Cyprus
{eldemet,mpolycar}@ucy.ac.cy
Abstract. This paper formulates the security problem of sensor placement in water distribution
networks for contaminant detection. An initial attempt to develop a problem formulation is
presented, suitable for mathematical analysis and design. Multiple risk-related objectives are
minimized in order to compute the Pareto front of a set of possible solutions; the considered
objectives are the contamination impact average, worst-case and worst-cases average. A multiobjective optimization methodology suitable for considering more that one objective function is
examined and solved using a multiple-objective evolutionary algorithm.
Keywords: contamination, water distribution, sensor placement, multi-objective optimization,
security of water systems.
1
Introduction
A drinking water distribution network is the infrastructure which facilitates delivery
of water to consumers. It is comprised of pipes which are connected to other pipes at
junctions or connected to tanks and reservoirs. Junctions represent points in the network where pipes are connected, with inflows and outflows. Each junction is assumed
to serve a number of consumers whose aggregated water demands are the junction’s
demand outflow. Reservoirs (such as lakes, rivers etc.) are assumed to have infinite
water capacity which they outflow to the distribution network. Tanks are dynamic
elements with finite capacity that fill, store and return water back to the network.
Valves are usually installed to some of the pipes in order to adjust flow, pressure, or
to close part of the network if necessary. Water quality monitoring in distribution
networks involves manual sampling or placing sensors at various locations to determine the chemical concentrations of various species such as disinfectants (e.g. chlorine) or for various contaminants that can be harmful to the consumers.
Distribution networks are susceptible to intrusions due to their open and uncontrolled nature. Accidental faults or intentional actions could cause a contamination,
that may affect significantly the health and economic activities of a city. Contaminants are substances, usually chemical, biological or radioactive, which travel along
the water flow, and may exhibit decay or growth dynamics. The concentration dynamics of a substance in a water pipe can be modelled by the first-order hyperbolic
*
This work is partially supported by the Research Promotion Foundation (Cyprus) and the
University of Cyprus.
springerlink.com
70
D.G. Eliades and M.M. Polycarpou
equations of advection and reaction [1]. When a contaminant reaches a water consumer node, it can expose some of the population served at risk, or cause economic
losses.
The issue of modelling dangerous contaminant transport in water distribution networks was examined in [2], where the authors discretized the equations of contaminant transport and simulated a network under contamination. Currently, in water
research an open-source hydraulic and quality numerical solver, called EPANET, is
frequently used for computing the advection and reaction dynamics in discrete-time
[3]. The security problem of contaminant detection in water distribution networks was
first examined in [4]. The algorithmic “Battle of the Water Sensor Networks” competition in 2006 boosted research on the problem and established some benchmarks [5].
While previous research focused on specific cases of the water security problem, there
has not been a unified problem formulation. In this work, we present an initial attempt
to develop such a problem formulation, suitable for mathematical analysis and design.
In previous research the main solution approach has been the formulation of an
integer program which is solved using either evolutionary algorithms [6] or mathematical programming [7]. Various groups have worked in an operational research
framework in formulating the mathematical program as in the ‘p-median’ problem
[8]. Although these formulations seek to minimize one objective, it is often the case
the solutions are not suitable with respect to some other objectives. In this work
we propose a multi-objective optimization methodology suitable for considering more
that one objective function.
Some work has been conducted within a multi-objective framework, computing the
Pareto fronts for conflicting objectives and finding the sets of non-dominant feasible
solutions [9], [10]. However some of the objectives considered did not capture
the contamination risk. The most frequently used risk objective metric is the average
impact on the network. Recently, other relevant metrics have also been applied [11],
[7], such as the ‘Conditional Value at Risk’ (CVaR) which corresponds to the average
impact of the worst case scenarios. In this work we present a security-oriented formulation and solution of the problem when the average, the worst-case (maximum impact) and the average of worst-cases (CVaR) impact is considered. For computing the
solution, we examine the use of a multi-objective evolutionary algorithm.
In Section 2 the problem is formulated; in Section 3, the solution methodology is
described and an algorithmic solution is presented. In Section 4 simulation results are
demonstrated using a realistic water distribution network. Finally, the results are summarized and future work is discussed in Section 5.
2
Problem Formulation
We first express the network into a generic graph with nodes and edges. We consider
nodes in the graph as locations in the distribution network where water consumption
can occur, such as reservoirs, pipe junctions and tanks. Pipes that transport water from
one node to another are represented as edges in the graph. Let V be the set of n nodes
in the network, such that V={v1,…,vn} and E be the set of m edges connecting pairs of
nodes, where for e∈E, e=(vi,vj). The two sets V and E capture the topology of the
water distribution network. The function g(t), g:ℜ+ a ℜ + describes the rate of
71
contaminant’s mass injection in time at a certain node. A typical example of this injection profile is a pulse signal of finite duration.
A contamination event ψ i ( g v (t )) is the contaminant injection at node vi∈V with rate
i
g vi (t ) . A contamination scenario s={ψ1,…,ψn} is defined as the set of contamination
events ψi at each node vi describing a possible “attack” on the network. Typically, the
contamination event ψi at most nodes will be zero, since the injection will occur at a
few specific nodes. The set of nodes where intrusion occurs for a scenario s is V*={vi |
ψi≠0, ψi∈s}, so that V*⊆V. Let S be the set of all possible contamination scenarios
w.r.t the specific water distribution system. We define the function ω̃(s,t),
ω̃:S×ℜ+ a ℜ , as the impact of a contamination scenario s until time t, for s∈S. This
impact is computed through
ω~( s, t ) = ∑ ϕ (vi , s, t ),
vi ∈V
(1)
where φ:V×S×ℜ+ a ℜ is a function that computes the impact of a specific scenario s
at node vi until time t. The way to compute φ(⋅) is determined by the problem
specifications; for instance it can be related to the number of people infected at each
node due to contamination, or to the consumed volume of contaminated water.
For edge (vi,vj)∈Ε, the function τ(vi,vj,t), τ:V×V×ℜ+ a ℜ , expresses the transport
time between nodes vi and vj, when a particle departs node vi at time t. This is computed by solving the network water hydraulics with a numerical solver for a certain
time-window and for a certain water demands, tank levels and hydraulic control actions. This corresponds to a time-varying weight for each edge. We further define the
function τ*:S×V a ℜ so that when for a scenario s, τ*(s,vi) is the minimum transport
time for the contaminant to reach node vi∈V. To compute this we consider
τ * ( s, vi ) = min F (vi , v j , s ), where for each intrusion node vj∈V* during a scenario s, the
v ∈V *
j
function F(·) is a shortest path algorithm for which the contaminant first reaches node
vi. Finally we define function ω:S×V a ℜ , in order to express the impact of a contamination scenario s until it reaches node vi, such that ω(vi,s)= ω̃(s,τ*(s,vi)). This
function will be used in the optimization formulation in the next section.
3
Solution Methodology
Since the set of all possible scenarios S is comprised of infinite elements, an increased
computational complexity is imposed to the problem; moreover, contaminations in
certain nodes are unrealistic or have trivial impacts. We can relax the problem by
considering S0 as a representative finite subset of S, such that S0⊂S. In the simulations
that follow, we assume that a scenario s∈S0 has a non-zero element for ψi and zero
elements for all ψj for which i≠j. We further assume that the non-zero contamination
event is ψi=g0(t,θ), where g0(·) is a known signal structure and θ is a parameter vector
in the bounded parameter space Θ, θ∈Θ. Since Θ has infinite elements, we perform
grid sampling and the selected parameter samples constitute a finite set Θ0⊂Θ. We
assume that the parameter vector θ of a contamination event ψi also belongs to Θ0,
72
such that θ∈Θ0. Therefore, a scenario s∈S0 is comprised of one contamination event
with parameter θ∈Θ0; the finite scenario set S0 is comprised of |V|·|Θ0| elements.
3.1 Optimization Problem
In relation to the sensor placement problem, when there is more than one sensor in the
network, the impact of a fault scenario s∈S0 is the minimum impact among all the
impacts computed for each node/sensor; essentially it corresponds to the sensor that
detects the fault first.
We define three objective functions fi:X a ℜ , i={1,2,3}, that map a set of nodes
X⊂V to a real number. Specifically, f1(X) is the average impact of S0, such that
f1 ( X ) =
1
∑ min ω ( x, s).
| S 0 | s∈S0 x∈X
(2)
Function f2(X) is the maximum impact of the set of all scenarios, such that
f 2 ( X ) = max min ω ( x, s).
s∈S0
x∈X
(3)
Finally, function f3(X) corresponds
to the CVaR risk metric and is the average impact
*
S
0
of the scenarios in the set ⊂S0 with impact larger that αf2(X), where α∈[0,1],
⎫⎪
⎧⎪ 1
f3 ( X ) = ⎨
min ω ( x, s ) :s ∈ S 0* ⇔ min ω ( x, s ) ≥ αf 2 ( X ) ⎬ .
∑
x∈X
⎪⎭
⎪⎩| S 0* | x∈S0* x∈X
(4)
The multi-objective optimization problem is formulated as
min{ f1 ( X ), f 2 ( X ), f 3 ( X )} ,
X
(5)
subject to X⊂V' and |X|=N, where V'⊆V is the set of feasible nodes and N the number
of sensors to be placed. Minimizing an objective function may result in maximizing
others; it is thus not possible to find one optimal solution that satisfies all objectives at
the same time. It is possible however to find a set of solutions, laying on a Pareto
front, where each solution is no worse that the other.
3.2 Algorithmic Solution
In general a feasible solution X is called Pareto optimal if for a set of objectives Γ and
i,j∈Γ, there exists no other feasible solution X' such that fi(X')≤fi(X) with fj(X')<fj(X)
for at least one j. Stated differently, a solution is Pareto optimal if there is no other
feasible solution that would reduce some objective function without simultaneously
causing an increase in at least one other objective function [15, p.779].
The solution space is extremely big, even for networks with a few nodes. The set
of computed solutions may or may not represent the actual Pareto front. Heuristic
searching [9] or computational intelligence techniques such as multi-objective evolutionary algorithms [16], [17], [6] have been applied for this problem. In this work we
consider an algorithm suitable for the problem, the NSGA-II [18]. This algorithm is
examined in the multi-objective sensor placement formulation for risk minimization.
73
In summary, the algorithm randomly creates a set of possible solutions P; these solutions are examined for non-dominance and are separated in ranks. Specifically, the
subset of solutions that are non-dominant in the set is P1⊂P. By removing P1, the
subset of solutions that are non-dominant in the set is P2⊂{P-P1}, and so on. A
‘crowding’ metric is used to express the proximity of one solution to its neighbour
solutions; this is used to achieve ‘better’ spreading of solutions on the Pareto front. A
subset of P is selected for computing a new set of solutions, in a rank and crowding
metric competition. The set of new solutions P' is computed through genetic algorithm operators such as crossover and mutation. The sets P are P' are combined and
their elements are ranked with the non-dominance criterion. The best solutions in the
mixed set are selected to continue to the next iteration. The algorithm was modified in
order to accept discrete inputs, specifically the index number of nodes.
4 Simulation Results
To illustrate the solution methodology we examine the sensor placement problem in a
realistic distribution system. The network model is comprised by the topological information as well as the outflows for a certain period of time. A graphical representation of the network is shown in Fig. 1.
Fig. 1. Spatial schematic of network. The Source represents a reservoir supplying water to the
network and Tanks are temporary water storage facilities. Nodes are consumption points.
The network has 126 nodes and 168 pipes; in detail it consists of two tanks, one
infinite source reservoir, two pumps and eight valves. All nodes in the network are
possible locations for placing sensors. The water demands across the nodes are not
uniformly distributed; specifically, 20 of the nodes are responsible for more than 80%
of all water demands. The network is in EPANET 2.0 format and was used in the
‘Battle of the Water Sensor Networks’ design challenge [5, 17]. The demands for a
typical day are provided in the network.
According to our solution methodology, we assumed that g0(·) is a pulse signal with
three parameters: θ1=28.75 Kg/hr the rate of contaminant injection, θ2=2 hr the duration of the injection and 0≤θ3≤24 hr the injection start time (as in [5]). By performing 5
minute sampling on θ3, the finite set Θ0 is build with |Θ0|=288 parameters. All nodes
74
were considered as possible intrusion locations, and it is assumed that only that only
one node can be attacked at each contamination scenario. Therefore, the finite scenario
set S0 has |S0|=126⋅288= 36,288 elements. Impact φ(⋅) is computed using a nonlinear
function described in [5] representing the number of people infected due to contaminant consumption. Hydraulic and quality dynamics were computed using the EPANET
software.
(a)
(b)
Fig. 2. (a): Histogram of the maximum normalized impact affected in all contamination scenarios. (b): Histogram of the maximum normalized impact affected in all contamination scenarios,
when 6 sensors are placed for minimizing average impact.
For simplicity and without loss of generality, the impact metrics presented hereafter are normalized. Figure 2(a) depicts the histogram of the maximum normalized
impact in all contamination scenarios. From Fig. 2(a) it appears that about 40% of all
scenarios under consideration have impacts more that 10% w.r.t the maximum. The
long tail in the distribution shows that there is subset of scenarios which have a large
impact on the network. From simulations we identified two locations that on certain
scenarios, they achieve the largest impact, specifically near the reservoir and at one
tank (labelled with numbers ‘1’ and ‘2’ in Fig. 1). The worst possible outcome from
all intrusions is for contamination at node ‘1’ at time 23:55, given that the contaminant propagates undetected in the network.
The optimization problem is to place six sensors at six nodes in order to minimize
the three objectives described in the problem formulation. For the third objective we
use α=0.8. The general assumption is that the impact of a contamination is measured
until a sensor has been triggered, i.e. a contaminant has reached at a sensor node;
afterwards it is assumed that there are no delays in stopping the service.
For illustrating the impact reduction by placing sensors, we choose one solution
from the solution set computed with the smallest average impact; the proposed six
sensor locations are indicated as nodes with circles in Fig. 1. Figure 2(b) shows the
histogram of the maximum impact on the network when these six sensors are installed. We observe a reduction of the impact to the system, with worst case impact
near 20% w.r.t. the unprotected worst case.
We performed two experiments with the NSGA-II algorithm to solve the optimization problem. In the one experiment, the parameters were 200 solutions in population
75
for 1000 generations; for the second experiment it was 400 and 2000 respectively.
The Pareto fronts are depicted in Fig. 3, along with the results computed using a deterministic tree algorithm presented in [10]. The simulations have shown that for our
example, the maximum impact objective was almost the same in all computed Pareto
solutions and is not presented in the figures.
Fig. 3. Pareto fronts computed by the NSGA-II algorithm, as well as from a deterministic algorithm for comparison. The normalized tail average and total average impact are compared.
5 Conclusions
In this work we have presented the security problem of sensor placement problem in
water distribution networks for contaminant detection. We have presented an initial
attempt to formulate the problem in order to be suitable for mathematical analysis. Furthermore, we examined a multiple-objective optimization problem using certain risk
metrics and demonstrated a solution on a realistic network using a suitable multiobjective evolutionary algorithm, the NSGA-II. Good results were obtained considering
the stochastic nature of the algorithm and the extremely large solution search space.
Security of water systems is an open problem where computational intelligence
could provide suitable solutions. Besides sensor placement, other interesting aspects
of the security problem are the fault detection based on telemetry signals, the
identification and the accommodation of the problem.
References
1. LeVeque, R.: Nonlinear Conservation Laws and Finite Volume Methods. In: LeVeque,
R.J., Mihalas, D., Dor, E.A., Müller, E. (eds.) Computational Methods for Astrophysical
Fluid Flow, pp. 1–159. Springer, Berlin (1998)
2. Kurotani, K., Kubota, M., Akiyama, H., Morimoto, M.: Simulator for contamination diffusion in a water distribution network. In: Proc. IEEE International Conference on Industrial
Electronics, Control, and Instrumentation, vol. 2, pp. 792–797 (1995)
3. Rossman, L.A.: The EPANET Programmer’s Toolkit for Analysis of Water Distribution
Systems. In: ASCE 29th Annual Water Resources Planning and Management Conference,
pp. 39–48 (1999)
76
4. Kessler, A., Ostfeld, A., Sinai, G.: Detecting accidental contaminations in municipal water
networks. ASCE Journal of Water Resources Planning and Management 124(4), 192–198
(1998)
5. Ostfeld, A., Uber, J.G., Salomons, E.: Battle of the Water Sensor Networks (BWSN): A
Design Challenge for Engineers and Algorithms. In: ASCE 8th Annual Water Distibution
System Analysis Symposium (2006)
6. Huang, J.J., McBean, E.A., James, W.: Multi-Objective Optimization for Monitoring Sensor Placement in Water Distribution Systems. In: ASCE 8th Annual Water Distibution
System Analysis Symposium (2006)
7. Hart, W., Berry, J., Riesen, L., Murray, R., Phillips, C., Watson, J.: SPOT: A sensor
placement optimization toolkit for drinking water contaminant warning system design. In:
Proc. World Water and Environmental Resources Conference (2007)
8. Berry, J.W., Fleischer, L., Hart, W.E., Phillips, C.A., Watson, J.P.: Sensor Placement in
Municipal Water Networks. ASCE Journal of Water Resources Planning and Management 131(3), 237–243 (2005)
9. Eliades, D., Polycarpou, M.: Iterative Deepening of Pareto Solutions in Water Sensor Networks. In: Buchberger, S.G. (ed.) ASCE 8th Annual Water Distibution System Analysis
Symposium. ASCE (2006)
10. Eliades, D.G., Polycarpou, M.M.: Multi-Objective Optimization of Water Quality Sensor
Placement in Drinking Water Distribution Networks. In: European Control Conference,
pp. 1626–1633 (2007)
11. Watson, J.P., Hart, W.E., Murray, R.: Formulation and Optimization of Robust Sensor
Placement Problems for Contaminant Warning Systems. In: Buchberger, S.G. (ed.) ASCE
8th Annual Water Distibution System Analysis Symposium (2006)
12. Rockafellar, R., Uryasev, S.: Optimization of Conditional Value-at-Risk. Journal of
Risk 2(3), 21–41 (2000)
13. Rockafellar, R., Uryasev, S.: Conditional Value-at-Risk for General Loss Distributions.
Journal of Banking and Finance 26(7), 1443–1471 (2002)
14. Topaloglou, N., Vladimirou, H., Zenios, S.: CVaR models with selective hedging for international asset allocation. Journal of Banking and Finance 26(7), 1535–1561 (2002)
15. Rao, S.: Engineering Optimization: Theory and Practice. Wiley-Interscience, Chichester
(1996)
16. Preis, A., Ostfeld, A.: Multiobjective Sensor Design for Water Distribution Systems Security. In: ASCE 8th Annual Water Distibution System Analysis Symposium (2006)
17. Ostfeld, A., Uber, J.G., Salomons, E., Berry, J.W., Hart, W.E., Phillips, C.A., Watson, J.P.,
Dorini, G., Jonkergouw, P., Kapelan, Z., di Pierro, F., Khu, S.T., Savic, D., Eliades, D.,
Polycarpou, M., Ghimire, S.R., Barkdoll, B.D., Gueli, R., Huang, J.J., McBean, E.A.,
James, W., Krause, A., Leskovec, J., Isovitsch, S., Xu, J., Guestrin, C., VanBriesen, J.,
Andc, M.S., Andd, P.F., Preis, A., Propato, M., Piller, O., Trachtman, G.B., Wu, Z.Y.,
Walski, T.: The Battle of the Water Sensor Networks (BWSN): A Design Challenge for
Engineers and Algorithms. ASCE Journal of Water Resources Planning and Management
(to appear, 2008)
18. Deb, K., Pratap, A., Agarwal, S., Meyarivan, T.: A fast and elitist multiobjective genetic
algorithm: NSGA-II. IEEE Transactions on Evolutionary Computation 6(2), 182–197
(2002)
Trusted-Computing Technologies
for the Protection of Critical Information Systems
Antonio Lioy, Gianluca Ramunno, and Davide Vernizzi*
Politecnico di Torino
Dip. di Automatica e Informatica
c. Duca degli Abruzzi, 24 – 10129 Torino, Italy
{antonio.lioy,gianluca.ramunno,davide.vernizzi}@polito.it
Abstract. Information systems controlling critical infrastructures are vital elements of our
modern society. Purely software-based protection techniques have demonstrated limits in fending off attacks and providing assurance of correct configuration. Trusted computing techniques
promise to improve over this situation by using hardware-based security solutions. This paper
introduces the foundations of trusted computing and discusses how it can be usefully applied to
the protection of critical information systems.
Keywords: Critical infrastructure protection, trusted computing, security, assurance.
1 Introduction
Trusted-computing (TC) technologies have historically been proposed by the TCG
(Trusted Computing Group) to protect personal computers from those software attacks that cannot be countered by purely software solutions. However these techniques are now mature enough to spread out to both bigger and smaller systems.
Trusted desktop environments are already available and easy to setup. Trusted computing servers and embedded systems are just around the corner, while proof-ofconcept trusted environments for mobile devices have been demonstrated and are just
waiting for the production of the appropriate hardware anchor (MTM, Mobile Trust
Module).
TC technologies are not easily understood and to many people they immediately
evoke the “Big Brother” phantom, mainly due to their initial association with controversial projects from operating system vendors (to lock the owner into using only certified and licensed software components) and from providers of multimedia content
(to avoid copyright breaches). However, TC is nowadays increasingly being associated with secure open environments, also thanks to pioneer work performed by various projects around the world, such as the Open-TC [1] one funded by the European
Commission.
On another hand, we have an increasing number of vital control systems (such as
electric power distribution, railway traffic, and water supply) that heavily and almost
exclusively rely on computer-based infrastructures for their correct operation. In the
*
This work has been partially funded by the EC as part of the OpenTC project (ref. 027635). It
is the work of the authors alone and may not reflect the opinion of the whole project.
springerlink.com
78
A. Lioy, G. Ramunno, and D. Vernizzi
following we will refer to these infrastructures with the term “Critical Information
Systems” (CIS) because their proper behaviour in handling information is critical for
the operation of some very important system.
This paper briefly describes the foundations of TC and shows how they can help in
creating more secure and trustworthy CIS.
2 Critical Information Systems (CIS)
CIS are typically characterized by being very highly distributed systems, because the
underlying controlled system (e.g. power distribution, railway traffic) is highly distributed itself on a geographic scale. In turn this bears an important consequence: it is
nearly impossible to control physical access to all its components and to the communication channels that must therefore be very trustworthy. In other words, we must
consider the likelihood that someone is manipulating the hardware, software, or
communication links of the various distributed components. This likelihood is larger
than in normal networked systems (e.g. corporate networks) because nodes are typically located outside the company’s premises and hosted in shelters easily violated.
For example, think of the small boxes containing control equipment alongside railway
tracks or attached to electrical power lines.
An additional problem is posed by the fact that quite often CIS are designed, developed, and deployed by a company (the system developer), owned by a different
one (the service provider), and finally maintained by yet another company (the maintainer) on a contract with one of the first two. When a problem occurs and it leads to
an accident, it is very important to be able to track the source of the problem: is it due
to a design mistake? or to a development bug? or to an incorrect maintenance procedure? The answer has influence over economical matters (costs for fixing the problem, penalties to be paid) and may be also over legal ones, in the case of damage to a
third-party.
Even if no damage is produced, it is nonetheless important to be able to quickly
identify problems with components of a CIS, be them produced incidentally or by a
deliberate act. For example, in real systems many problems are caused by mistakes
made by maintainers when upgrading or replacing hardware or software components.
Any technical solution that can help in thwarting attacks and detecting mistakes
and breaches is of interest, but nowadays the solutions commonly adopted – such as
firewall, VPN, and IDS – heavily rely on correct software configuration of all the
nodes. Unfortunately this cannot be guaranteed in a highly distributed and physically
insecure system as a CIS. Therefore better techniques should be adopted not only to
protect the system against attacks and errors but also to provide assurance that each
node is configured and operating as expected. This is exactly one of the possible applications of the TC paradigm, which is introduced in the next section.
3 Trusted Computing Principles
In order to protect computer systems and networks from attacks we rely on software
tools in the form of security applications (e.g. digital signature libraries), kernel
79
modules (e.g. IPsec) or firmware, as in the case of firewall appliances. However software can be manipulated either locally by privileged and un-privileged users, or
remotely via network connections that exploit known vulnerabilities or insecure configurations (e.g. accepting unknown/unsigned Active-X components in your browser).
It is therefore clear that is nearly impossible to protect a computer system from software attacks while relying purely on software defences.
To progress beyond this state, the TCG [2], a not-for-profit group of ICT industry
players, developed a set of specification to create a computer system with enhanced
security named “trusted platform”.
A trusted platform is based on two key components: protected capabilities and
shielded memory locations. A protected capability is a basic operation (performed
with an appropriate mixture of hardware and firmware) that is vital to trust the whole
TCG subsystem. In turn capabilities rely on shielded memory locations, special regions where is safe to store and operate on sensitive data.
From the functional perspective, a trusted platform provides three important
features rarely found in other systems: secure storage, integrity measurement and reporting. The integrity of the platform is defined as a set of metrics that identify the
software components (e.g. operating system, applications and their configurations)
through the use of fingerprints that act as unique identifiers for each component. Considered as a whole, the integrity measures represent the configuration of the platform.
A trusted platform must be able to measure its own integrity, locally store the related
measurements and report these values to remote entities. In order to trust these operations, the TCG defines three so-called “root of trust”, components that must be trusted
because their misbehaviour might not be detected:
• the Root of Trust for Measurements (RTM) that implements an engine capable of
performing the integrity measurements;
• the Root of Trust for Storage (RTS) that securely holds integrity measures and protect data and cryptographic keys used by the trusted platform and held in external
storages;
• the root of trust for reporting (RTR) capable of reliably reporting to external entities the measures held by the RTS.
The RTM can be implemented by the first software module executed when a computer system is switched on (i.e. a small portion of the BIOS) or directly by the hardware itself when using processors of the latest generation.
The central component of a TCG trusted platform is the Trusted Platform Module
(TPM). This is a low cost chip capable to perform cryptographic operations, securely
maintain the integrity measures and report them. Given its functionalities, it is used to
implement RTS and RTR, but it can also be used by the operating system and applications for cryptographic operations although its performance is quite low.
The TPM is equipped with two special RSA keys, the Endorsement Key (EK) and
the Storage Root Key (SRK). The EK is part of the RTR and it is a unique (i.e. each
TPM has a different EK) and “non-migratable” key created by the manufacturer of the
TPM and that never leaves this component. Furthermore the specification requires
that a certificate must be provided to guarantee that the key belongs to a genuine
TPM. The SRK is part of the RTS and it is a “non-migratable” key that protects the
80
other keys used for cryptographic functions1 and stored outside the TPM. Also SRK
never leaves the TPM and it is used to build a key hierarchy. The integrity measures
are held into the Platform Configuration Registers (PCR). These are special registers
within the TPM acting as accumulators: when the value of a register is updated, the
new value depends both on the new measure and on the old value to guarantee that
once initialized it is not possible to fake the value of a PCR.
The action of reporting the integrity of the platform is called Remote Attestation. A
remote attestation is requested by a remote entity that wants evidence about the configuration of the platform. The TPM then makes a digital signature over the values of
a subset of PCR to prove to the remote entity the integrity and authenticity of the platform configuration. For privacy reasons, the EK cannot be used to make the digital
signature. Instead, to perform the remote attestation the TPM uses an Attestation
Identity Key (AIK), which is an “alias” for the EK. The AIK is a RSA key created by
the TPM whose private part is never released outside the chip; this guarantees that the
AIK cannot be used by anyone except the TPM itself.
In order to use the AIK for authenticating the attestation data (i.e. the integrity
measures) it is necessary to obtain a certificate proving that the key was actually generated by a genuine TPM and it is managed in a correct way. Such certificates are
issued by a special certification authority called Privacy CA (PCA). Before creating
the certificate, the PCA must verify the genuineness of the TPM. This verification is
done through the EK certificate. Many AIKs can be created and, to prevent the traceability of the platform operations, ideally a different AIK should be used for interacting with each different remote attester.
Using trusted computing it is possible to protect data via asymmetric encryption in
a way that only the platform’s TPM can access them: this operation is called binding.
It is however possible to migrate keys and data to another platform, with a controlled
procedure, if they were created as “migratable”.
The TPM also offers a stronger capability to protect data: sealing. When the user
seals some data, he must specify an “unsealing configuration”. The TPM assures that
sealed data can be only be accessed if the platform is in the “unsealing configuration”
that was specified at the sealing time.
The TPM is a passive chip disabled at factory and only the owner of a computer
equipped with a TPM may choose to activate this chip. Even when activated, the
TPM cannot be remotely controlled by third entities: every operation must be explicitly requested by software running locally and the possible disclosure of local data or
the authorisation to perform the operations depend on the software implementation.
In the TCG architecture, the owner of the platform plays a central role because the
TPM requires authorisation from the owner for all the most critical operations. Furthermore, the owner can decide at any time to deactivate the TPM, hence disabling
the trusted computing features. The identity of the owner largely depends on the scenario where trusted computing is applied: in a corporate environment, the owner is
usually the administrator of the IT department, while in a personal scenario normally
the end-user is also the owner of the platform.
1
In order to minimize attacks, SRK is never used for any cryptographic function, but only to
protect other keys.
81
Run-time isolation between software modules with different security requirement
can be an interesting complementary requirement for a trusted platform. Given that
memory areas of different modules are isolated and inter-module communication can
occur only under well specified control flow policies, then if a specific module of the
system is compromised (e.g. due to a bug or a virus), the other modules that are effectively isolated from that one are not affected at all. Today virtualization is an emerging technology for PC class platforms to achieve run-time isolation and hence is a
perfect partner for a TPM-based trusted platform.
The current TCG specifications essentially focus on protecting a platform against
software attacks. The AMD-V [3] and the Intel TXT [4] initiatives, besides providing
hardware assistance for virtualization, increase the robustness against software attacks
and the latter also starts dealing with some basic hardware attacks. In order to protect
the platforms also from physical attacks, memory curtaining and secure input/output
should be provided: memory curtaining extends memory protection in a way that sensitive areas are fully isolated while secure input/output protects communication paths
(such as the buses and input/output channels) among the various components of a
computer system. Intel TXT focuses only on some so called “open box” attacks, by
protecting the slow buses and by guaranteeing the integrity verification of the main
hardware components on the platform.
4 The OpenTC Project
The OpenTC project has applied TC techniques to the creation of an open and secure
computing environment by coupling them with advanced virtualization techniques. In
this way it is possible to create on the same computer different execution environments
mutually protected and with different security properties. OpenTC uses virtualisation
layers – also called Virtual Machine Monitors (VMM) or hypervisors – and supports
two different implementations: Xen and L4/Fiasco. This layer hosts compartments, also
called virtual machines (VM), domains or tasks, depending on the VMM being used.
Some domains host trust services that are available to authorised user compartments.
Various system components make use of TPM capabilities, e.g. in order to measure
other components they depend on or to prove the system integrity to remote challengers.
Each VM can host an open or proprietary operating environment (e.g. Linux or Windows) or just a minimal library-based execution support for a single application.
The viability of the OpenTC approach has been demonstrated by creating two
proof-of-concept prototypes, the so-called PET and CC@H ones, that are publicly
available at the project’s web site
The PET (for Private Electronic Transactions) scenario [5] aims to improve the
trustworthiness of interactions with remote servers. Transactions are simply
performed by accessing a web server through a standard web browser running in a
dedicated trusted compartment. The server is assumed to host web pages related to
a critical financial service, such as Internet banking or another e-commerce service.
The communication setup between the browser compartment and the web server is
extended by a protocol for mutual remote attestation tunnelled through an SSL/TLS
channel. During the attestation phase, each side assesses the trustworthiness of the
other. If this assessment is negative on either side, the SSL/TLS tunnel is closed,
82
preventing further end-to-end communication. If the assessment is positive, end-toend communication between browser and server is enabled via standard HTTPS
tunnelled over SSL/TLS. In this way the user is reassured about having connected
to the genuine server of its business provider and about its integrity, and the provider knows that the user has connected by using a specific browser and that the
hosting environment (i.e. operating system and drivers) has not been tampered with,
for example by inserting a key-logger.
The CC@H (for Corporate Computing at Home) scenario [6] demonstrates the usage of the OpenTC solution to run simultaneously on the same computer different
non-interfering applications. It reflects the situation where employers tolerate, within
reasonable limits, the utilization of corporate equipment (in particular notebooks) for
private purposes but want assurance that the compartment dedicated to corporate applications is not manipulated by the user. In turn the user has a dedicated compartment for his personal matters, included free Internet surfing which, on the contrary, is
not allowed from the corporate compartment. The CC@H scenario is based on the
following main functional components:
• boot-loaders capable of producing cryptographic digests for lists of partitions and
arbitrary files that are logged into PCRs of the TPM prior to passing on control of
the execution flow to the virtual machine monitor (VMM) or kernel it has loaded
into memory;
• virtualization layers with virtual machine loaders that calculate and log cryptographic digests for virtual machines prior to launching them;
• a graphical user interface enabling the user to launch, stop and switch between different compartments with a simple mouse click;
• a virtual network device for forwarding network packets from and to virtual machine domains;
• support for binding the release of keys for encrypted files and partitions to defined
platform integrity metrics.
5 How TC Can Help in Building Better CIS
TC can benefit the design and operations of a CIS in several ways.
First of all, attestation can layer the foundation for better assurance in the systems
operation. This is especially important when several players concur to the design,
management, operation, and maintenance of a CIS and thus unintentional modification of software modules is possible, as well as deliberate attacks due to the distributed nature of these systems.
TC opens also the door to reliable logging, where log files contain not only a list of
events but also a trusted trace of the component that generated the event and the system state when the event was generated. Moreover the log file could be manipulated
only by trusted applications if it was sealed against them, so for example no direct
editing (for insertion or cancellation) would be possible.
The security of remote operations could be improved by exploiting remote attestation, so that network connections are permitted only if requested by nodes executing
specific software modules. Intruders would therefore be unable to connect to the other
83
nodes: as they could not even open a communication channel towards the application,
they could not in any way try to exploit the software vulnerabilities that unfortunately
plague most applications.
This feature could also be used in a different way: a central management facility
could poll the various nodes by opening network connection just to check via remote
attestation the software status of the nodes. In this way it would be very easy to detect
misconfigured elements and promptly reconfigure or isolate them.
In general sealing protects critical data from access by unauthorized software because
access can be bound to a well-defined system configuration. This feature allows the
implementation of fine-grained access control schemes that can prevent agents from
accessing data they are not authorized for. This is very important when several applications with different trust level are running on the same node and may be these applications have been developed by providers with different duties. In this way we can easily
implement the well-known concept of “separation of duties”: even if a module can bypass the access control mechanisms of the operating system and directly access the data
source, it will be unable to operate on it because the TPM and/or security services running on the system will not release to the module the required cryptographic credentials.
Finally designers and managers of CIS should not be scared by the overhead introduced by TC. In our experience, the L4/Fiasco set-up is very lightweight and therefore suitable also for nodes with limited computational resources. Moreover, inside a
partition we can execute a full operating system with all its capabilities (e.g. Suse
Linux), a stripped-down OS (such as DSL, Damn Small Linux) with only the required
drivers and capabilities, or just a mini-execution environment providing only the
required libraries for small embedded single task applications. In case multiple compartments are not needed, the TC paradigm can be directly built on top of the hardware, with no virtualization layer, hence further reducing its footprint.
6 Conclusions
While several technical problems are still to be solved before large-scale adoption of
TC is a reality, we nonetheless think that it is ready to become a major technology of
the current IT scenario, especially for critical information systems where its advantages in terms of protection and assurance far outweigh its increased design and management complexity.
References
1.
2.
3.
4.
Open Trusted Computing (OpenTC) project, IST-027635, http://www.opentc.net
Trusted Computing Group, http://www.trustedcomputing.org
AMD Virtualization, http://www.amd.com/virtualization
Intel Trusted Execution Technology,
http://www.intel.com/technology/security/
5. OpenTC newsletter no.3,
http://www.opentc.net/publications/OpenTC_Newsletter_03.html
6. OpenTC newsletter no.5,
http://www.opentc.net/publications/OpenTC_Newsletter_05.html
A First Simulation of Attacks in the Automotive
Network Communications Protocol FlexRay
Dennis K. Nilsson1 , Ulf E. Larson1 , Francesco Picasso2, and Erland Jonsson1
1
2
Department of Computer Science and Engineering
Chalmers University of Technology
SE-412 96 Gothenburg, Sweden
Department of Computer Science and Engineering
University of Genoa
Genoa, Italy
{dennis.nilsson,ulf.larson,erland.jonsson}@chalmers.se,
francesco.picasso@unige.it
Abstract. The automotive industry has over the last decade gradually replaced mechanical parts with electronics and software solutions. Modern vehicles contain a number of electronic control units (ECUs), which are connected in an in-vehicle network
and provide various vehicle functionalities. The next generation automotive network
communications protocol FlexRay has been developed to meet the future demands of
automotive networking and can replace the existing CAN protocol. Moreover, the upcoming trend of ubiquitous vehicle communication in terms of vehicle-to-vehicle and
vehicle-to-infrastructure communication introduces an entry point to the previously
isolated in-vehicle network. Consequently, the in-vehicle network is exposed to a whole
new range of threats known as cyber attacks. In this paper, we have analyzed the
FlexRay protocol specification and evaluated the ability of the FlexRay protocol to
withstand cyber attacks. We have simulated a set of plausible attacks targeting the
ECUs on a FlexRay bus. From the results, we conclude that the FlexRay protocol
lacks sufficient protection against the executed attacks, and we therefore argue that
future versions of the specification should include security protection.
Keywords: Automotive, vehicle, FlexRay, security, attacks, simulation.
1 Introduction
Imagine a vehicle accelerating and exceeding a certain velocity. At this point, the
airbag suddenly triggers rendering the driver unable to maneuver the vehicle.
This event leads to the vehicle crashing, leaving the driver seriously wounded.
One might ask if this event was caused by a software malfunction or a hardware
fault. In the near future, one might even ask if the event was the result of a
deliberate cyber attack on the vehicle.
In the last decade electronics and firmware have replaced mechanical components in vehicles at an unprecedented rate. Modern vehicles contain an in-vehicle
network consisting of a number of electronic control units (ECUs) responsible
c Springer-Verlag Berlin Heidelberg 2009
springerlink.com
A First Simulation of Attacks
85
for the functionality in the vehicle. As a result, the ECUs constitute a likely
target for cyber attackers.
An emerging trend for automotive manufacturers is to create an infrastructure
for performing remote diagnostics and firmware updates over the air (FOTA) [1].
There are several benefits with this approach. It involves a minimum of customer
inconvenience since there exists no need for the customer to bring the vehicle to
a service station for a firmware update. In addition, it allows faster updates; it is
possible to update the firmware as soon as it is released. Furthermore, this approach
reduces the lead time from fault to action since it is possible to analyze errors and
identify causes using diagnostics before a vehicle arrives at a service station.
However, the future infrastructure allows external communication to interact
with the in-vehicle network, which introduces a number of security risks. The
previously isolated in-vehicle network is thus exposed to a whole new type of
attacks, collectively known as cyber attacks. Since the ECUs connected to the
FlexRay bus are used for providing control and maneuverability in the vehicle,
this bus is a likely target for attackers. We have analyzed what an attacker can
do once access to the bus is achieved. The main contributions of this paper are
as follows.
– We have analyzed the FlexRay protocol specification with respect to desired
security properties and found that functionalities to achieve these properties
are missing.
– We have identified the actions an attacker can take in a FlexRay network as
a result of the lack of security protection.
– We have successfully implemented and simulated the previously identified
attack actions in the CANoe simulator environment.
– We discuss the potential safety effects of these attacks and emphasize the
need for future security protection in in-vehicle networks.
2 Related Work
Much research on the in-vehicle network has been on safety issues. Little research
has been done on the security aspects in such networks. Only a few papers
focusing on the security of those networks have been published. However, the
majority of these papers have focused on the CAN protocol. These papers are
described in more detail as follows.
Wolf et al. [2] present several weaknesses in the CAN and FlexRay bus protocols. Weaknesses include confidentiality and authenticity problems. However,
the paper does not give any specific attack examples.
Simulated attacks have been performed on the CAN bus [3, 4, 5] and illustrate
the severity of such attacks. Moreover, the notion of vehicle virus is introduced
in [5] which describes more advanced attacks on the CAN bus. As a result, the
safety of the driver can be affected.
The Electronic Architecture and System Engineering for Integrated Safety
Systems (EASIS) project has done work in embedded security for safety applications [6]. The work dicusses the need for safe and reliable communication for
86
D.K. Nilsson et al.
external and internal vehicle communication. A security manager responsible for
crypto operations and authentication management is described.
In this paper, we focus on identifying possible attacker actions on the FlexRay
bus and discussing the safety effects of such security threats.
3 Background
Traditionally, in-vehicle networks are designed to meet reliability requirements.
As such, they primarily address failures caused by non-malicious and inadvertent
flaws, which are produced by chance or by component malfunction. Protection
is realized by fault-tolerance mechanisms, such as redundancy, replication, and
diversity. Since the in-vehicle network has been isolated, protection against intelligent attackers (i.e., security) has not been previously considered.
However, recent wireless technology allow for external interaction with the
vehicle through remote diagnostics and FOTA. To benefit from the new technology, the in-vehicle network needs to allow wireless communication with external
parties, including service stations, business systems and fleet management. Since
the network must be open for external access, new threats need to be accounted
for and new requirements need to be stated and implemented. Consider for example an attacker using a compromised host in the business network to obtain
unauthorized access to the in-vehicle network. Once inside the in-vehicle network, the attacker sends a malicious diagnostic request to trigger the airbag,
which in turn could cause injury to the driver and the vehicle to crash.
As illustrated by the example scenario, the safety of the vehicle is strongly
linked to the security, and a security breach may well affect the safety of the
driver. It is thus reasonable to believe that not only reliability requirements but
also security requirements need to be fulfilled to protect the driver.
Since security has yet not been required in the in-vehicle networks, it can
be assumed that a set of successful attacks targeting the in-vehicle network
should be possible to produce. To assess our assumption, we analyze the FlexRay
protocol specification version 2.1 revision A [7]. We then identify a set of security
properties for the network, evaluate the correspondence between the properties
and the protocol specifications, and develop an attacker model.
3.1
In-Vehicle Network
The in-vehicle network consists of a number of ECUs and buses. The ECUs and
buses form networks, and the networks are connected through gateways.
Critical applications, such as the engine management system and the antilock braking system (ABS) use the CAN bus for communication. To meet future
application demands, CAN is gradually being replaced with FlexRay. A wireless
gateway connected to the FlexRay bus allows access to external networks such
as the Internet. A conceptual model of the in-vehicle network is shown in Fig. 1.
87
External
Network
ECUs
Wireless
Gateway
FlexRay
Fig. 1. Conceptual model of the in-vehicle network consisting of a FlexRay network
and ECUs including a wireless gateway
3.2
FlexRay Protocol
The FlexRay protocol [7] is designed to meet the requirements of today’s automotive industry, including flexible data communications, support for three
different topologies (bus, star, and mixed), fault-tolerant operation and higher
data rate than previous standards. FlexRay allows both asynchronous transfer
mode and real-time data transfer, and operates as a dual-channel system, where
each channel delivers a maximum data bit rate of 10 Mbps. The FlexRay protocol belongs to the time-triggered protocol family which is characterized by a
continuous communication of all connected nodes via redundant data buses at
predefined intervals. It defines the specifics of a data-link layer independent of a
physical layer. It focuses on real-time redundant communications. Redundancy
is achieved by using two communications channels. Real time is assured by the
adoption of a time division multiplexing communication scheme: each ECU has
a given time slot to communicate but it cannot decide when since the slots are
allocated at design time. Moreover, a redundant masterless protocol is used for
synchronizing the ECU clocks by measuring time differences of arriving frames,
thus implementing a fault-tolerant procedure. The data-link layer provides basic
but strongly reliable communication functionality, and for the protocol to be
practically useful, an application layer needs to be implemented on top of the
link layer. Such an application layer could be used to assure the desired security
properties. In FlexRay, this application layer is missing.
3.3
Desired Security Properties
To evaluate the security of the FlexRay protocol, we use a set of established
security properties commonly used for, e.g., sensor networks [8, 9, 10]. We believe that most of the properties are desirable in the vehicle setting due to the
many similarities between the network types. The following five properties are
considered.
– Data Confidentiality. The contents of messages between the ECUs should
be kept confidential to avoid unauthorized reads.
88
D.K. Nilsson et al.
– Data Integrity. Data integrity is necessary for ensuring that messages have
not been modified in transit.
– Data Availability. Data availability is necessary to ensure that measured
data from ECUs can be accessed at requested times.
– Data Authentication. To prevent an attacker from spoofing packets, it is
important that the receiver can verify the sender of the packets (authenticity).
– Data Freshness. Data freshness ensures that communicated data is recent
and that an attacker is not replaying old data.
3.4
Security Evaluation of FlexRay Specification
We perform a security evaluation of the FlexRay protocol based on the presented
desired security properties. We inspect the specification and look for functionalities that would address those properties. The FlexRay protocol itself does not
provide any security features since it is a pure communication protocol. However, CRC check values are included to provide some form of integrity protection
against transmission errors. The time division multiplexing assures availability
since the ECUs can communicate during the allocated time slot.
The security properties are at best marginally met by the FlexRay specification. We find some protection of data availability and data integrity, albeit
the intention of the protection is for safety reasons. However, the specification
does not indicate any assurance of confidentiality, authentication, or freshness of
data. Thus, security has not been considered during the design of the FlexRay
protocol specification.
3.5
Attacker Model
The evaluation concluded that the five security properties were at best slightly
addressed in the FlexRay protocol. Therefore, we can safely make the assumption that a wide set of attacks in the in-vehicle network should be possible to
execute. We apply the Nilsson-Larson attacker model [5], where an attacker has
access to the in-vehicle network via the wireless gateway and can perform the
following actions: read, spoof, drop, modify, flood, steal, and replay. In the following section, we focus on simulating the read and spoof actions.
4 Cyber Attacks
In this section, we define and discuss two attacker actions: read and spoof. The
actions are derived from the security evaluation and the attacker model in the
previous section.
4.1
Attacker Actions
Read and spoof can be performed from any ECU in the FlexRay network.
Read. Due to the lack of confidentiality protection, an attacker can read all data
sent on the FlexRay bus, and possibly send the data to a remote location via
89
the wireless gateway. If secret keys, proprietary or private data are sent on the
FlexRay bus, an attacker can easily learn that data.
Spoof. An attacker can create and inject messages since there exists no data
authentication on the data sent on the FlexRay bus. Therefore, messages can be
spoofed and target arbitrary ECUs claiming to be from any ECU. An attacker
can easily create and inject diagnostics messages on the FlexRay bus to cause
ECUs to perform arbitrary actions.
4.2
Simulation Environment
We have used CANoe version 7.0 from Vector Informatik [11] to simulate the
attacks. ECUs for handling the engine, console, brakes, and back light functionalities were connected to a FlexRay bus to simulate a simplified view of an
in-vehicle network.
4.3
Simulated Attack Actions
We describe the construction and simulation of the read and spoof attack actions.
Read. Messages sent on the FlexRay bus can be recorded by an attacker who
has access to the network. The messages are Time stamped, and the channel
(Chn), identifier (ID), direction (Dir ), data length (DLC ), and any associated
Data bytes are recorded. The log entries for a few messages are shown in Table 1.
Also, we have included the corresponding message Name (interpreted from ID)
for each message.
Spoof. An attacker can create and inject a request on the bus to cause an arbitrary effect. For example, an attacker can create and inject a request to light up
the brake light. The BreakLight message is spoofed with the data value 0x01
and sent on the bus resulting in the brake light being turned on. The result of
the attack is shown in Fig. 2. The transmission is in the fifth gear, and the vehicle is accelerating and traveling with a velocity of 113 mph. At the same time,
the brake light is turned on, as indicated by the data value 1 of the BreakLight
message although no brakes are applied (BrakePressure has the value 0).
4.4
Effect on Safety Based on Lack of Proper Security Protection
As noted in Section 3, safety and security are tightly coupled in the vehicle
setting. From our analysis, it is evident that security-related incidents can affect
Table 1. Log entries with corresponding names and values
Time
15.540518
15.540551
15.549263
15.549362
15.549659
15.549692
Chn ID Dir DLC Name
FR
FR
FR
FR
FR
FR
1A
1A
1A
1A
1A
1A
51
52
13
16
25
26
Tx
Tx
Tx
Tx
Tx
Tx
16
16
16
16
16
16
BackLightInfo
GearBoxInfo
ABSInfo
BreakControl
EngineData
EngineStatus
Data
128 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
160 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
59 0 0 0 31 0 0 0 0 0 0 0 0 0 0 0
0 64 31 0 0 0 0 0 0 0 0 0 0 0 0 0
73 8 81 44 140 10 152 58 0 0 0 0 0 0 0 0
0000000000000000
90
D.K. Nilsson et al.
Fig. 2. Spoof attack where the brake lights are lit up while the vehicle is accelerating
and no brake pressure is applied
the safety of the driver. Our example with the spoofed brake light message could
cause other drivers to react based on the false message. Even worse, spoofed
control messages could affect the control and maneuverability of the vehicle
and cause serious injury to the drivers, passengers, and other road-users. It is
therefore imperative that proper security protection is highly prioritized in future
development of the protocol.
5 Conclusion and Future Work
We have created and simulated attacks in the automotive communications protocol FlexRay and shown that such attacks can easily be created. In addition,
we have discussed how safety is affected by the lack of proper security protection. These cyber attacks can target control and maneuverability ECUs in the
in-vehicle network and lead to serious injury for the driver. The attacker actions
are based on weaknesses in the FlexRay 2.1 revision A protocol specification.
The security of the in-vehicle network must be taken into serious consideration
when external access to these previously isolated networks is introduced.
The next step is to investigate other types of attacks and simulate such attacks on the FlexRay bus. Then, based on the various attacks identified, a set of
appropriate solutions for providing security features should be investigated. The
most pertinent future work to be further examined are prevention and detection mechanisms. An analysis of how to secure the in-vehicle protocol should be
performed, and the possibility to introduce lightweight mechanisms for data
integrity and authentication protection should be investigated. Moreover, to
91
detect attacks in the in-vehicle network a lightweight intrusion detection system needs to be developed.
References
1. Miucic, R., Mahmud, S.M.: Wireless Multicasting for Remote Software Upload in
Vehicles with Realistic Vehicle Movement. Technical report, Electrical and Computer Engineering Department, Wayne State University, Detroit, MI 48202 USA
(2005)
2. Wolf, M., Weimerskirch, A., Paar, C.: Security in Automotive Bus Systems. In:
Workshop on Embedded IT-Security in Cars, Bochum, Germany (November 2004)
3. Hoppe, T., Dittman, J.: Sniffing/Replay Attacks on CAN Buses: A simulated
attack on the electric window lift classified using an adapted CERT taxonomy.
In: Proceedings of the 2nd Workshop on Embedded Systems Security (WESS),
Salzburg, Austria (2007)
4. Lang, A., Dittman, J., Kiltz, S., Hoppe, T.: Future Perspectives: The car and its
IP-address - A potential safety and security risk assessment. In: The 26th International Conference on Computer Safety, Reliability and Security (SAFECOMP),
Nuremberg, Germany (2007)
5. Nilsson, D.K., Larson, U.E.: Simulated Attacks on CAN Buses: Vehicle virus. In:
Proceedings of the Fifth IASTED Asian Conference on Communication Systems
and Networks (ASIACSN) (2008)
6. EASIS. Embedded Security for Integrated Safety Applications (2006),
http://www.car-to-car.org/fileadmin/dokumente/pdf/security 2006/
sec 06 10 eyeman easis security.pdf
7. FlexRay Consortium. FlexRay Communications System Protocol Specification 2.1
Revision A (2005) (Visited August, 2007),
http://www.softwareresearch.net/site/teaching/SS2007/ds/
FlexRay-ProtocolSpecification V2.1.revA.pdf
8. Luk, M., Mezzour, G., Perrig, A., Gligor, V.: MiniSec: A secure sensor network communication architecture. In: IPSN 2007: Proceedings of the 6th International Conference on Information Processing in Sensor Networks, pp. 479–488. ACM Press,
New York (2007)
9. Perrig, A., Szewczyk, R., Wen, V., Culler, D.E., Tygar, J.D.: SPINS: Security
protocols for sensor networks. In: Mobile Computing and Networking, pp. 189–199
(2001)
10. Karlof, C., Sastry, N., Wagner, D.: TinySec: A link layer security architecture for
wireless sensor networks. In: SenSys 2004: Proceedings of the 2nd International
Conference on Embedded Networked Sensor Systems, Baltimore, November 2004,
pp. 162–175 (2004)
11. Vector Informatik. CANoe and DENoe 7.0 (2007) (Visited December, 2007),
http://www.vector-worldwide.com/vi canoe en.html
Wireless Sensor Data Fusion for Critical Infrastructure
Security
Francesco Flammini1,2, Andrea Gaglione2, Nicola Mazzocca2, Vincenzo Moscato2,
and Concetta Pragliola1
1
ANSALDO STS - Ansaldo Segnalamento Ferroviario S.p.A.
Business Innovation Unit
Via Nuova delle Brecce 260, Naples, Italy
{flammini.francesco,pragliola.concetta}@asf.ansaldo.it
2
Università di Napoli “Federico II”
Dipartimento di Informatica e Sistemistica
Via Claudio 21, Naples, Italy
{frflammi,andrea.gaglione,nicola.mazzocca,vmoscato}@unina.it
Abstract. Wireless Sensor Networks (WSN) are being investigated by the research community
for resilient distributed monitoring. Multiple sensor data fusion has proven as a valid technique
to improve detection effectiveness and reliability. In this paper we propose a theoretical framework for correlating events detected by WSN in the context of critical infrastructure protection.
The aim is to develop a decision support and early warning system used to effectively face security threats by exploiting the advantages of WSN. The research addresses two relevant issues:
the development of a middleware for the integration of heterogeneous WSN (SeNsIM, Sensor
Networks Integration and Management) and the design of a model-based event correlation engine for the early detection of security threats (DETECT, DEcision Triggering Event Composer
& Tracker). The paper proposes an overall system architecture for the integration of the SeNsIM and DETECT frameworks and provides example scenarios in which the system features
can be exploited.
Keywords: Critical Infrastructure Protection, Sensor Data Fusion, Railways.
1
Introduction
Several methodologies (e.g. risk assessment [5]) and technologies (e.g. physical protection systems [4]) have been proposed to enhance the security of critical infrastructure systems.
The aim of this work is to propose the architecture for a decision support and early
warning system used to effectively face security threats (e.g. terrorist attacks) based
on wireless sensors. Wireless sensors feature several advantages when applied to
critical infrastructure surveillance [8], as they are:
• Cheap, and this allows for fine grained and highly redundant configurations;
• Resilient, due to their fault-tolerant mesh topology;
• Power autonomous, due to the possibility of battery and photovoltaic energy
supplies;
springerlink.com
Wireless Sensor Data Fusion for Critical Infrastructure Security
93
• Easily installable, due to their wireless nature and auto-adapting multi-hop routing;
• Intelligent, due to the on-board processor and operating systems which allow for
some data elaborations being performed locally.
All these features support the use of WSN in highly distributed monitoring applications in critical environments. The example application we will refer to in this paper is
railway infrastructure protection against external threats which can be natural (fire,
flooding, landslide, etc.) or human-made malicious (sabotage, terrorism, etc.).
Examples of useful sensors in this domain are listed in the following: smoke and
heat – useful for fire detection; moisture and water – useful for flooding detection;
Pressure – useful for explosion detection; movement detection (accelerometer or GPS
based shifting measurement) – useful for theft detection or structural integrity checks;
gas and explosive – useful for chemical or bombing attack detection; vibration and
sound – useful for earthquake or crash detection. WSN could also be used for video
surveillance and on-board intelligent video-analysis, as reported in [7].
Theoretically, any kind of sensor could be interfaced with a WSN, as it would just
substitute the sensing unit of the so called “motes”. For instance, it would be useful
(and apparently easy) to interface on WSN intrusion detection devices (like volumetric detectors, active infrared barriers, microphonic cables, etc.) in order to save on
cables and junction boxes and exploit an improved resiliency and a more cohesive integration. With respect to traditional connections based on serial buses, wireless sensors are also less prone to tampering, when proper cryptographic protocols are
adopted [6]. However, for some classes of sensors (e.g. radiation portals) some of the
features of motes (e.g. size, battery power) would be lost.
In spite of their many expected advantages, there are several technological difficulties (e.g. energy consumption) and open research issues associated with WSN. The
issue addressed in this paper relates to the “data fusion” aspect. In fact, the heterogeneity of network topologies and measured data requires integration and analysis at
different levels (see Fig. 1), as partly surveyed in reference [9].
As first, the monitoring of wide geographical areas and the diffusion of WSNs
managed by different middlewares have highlighted the research problem of the integrated management of data coming from the various networks. Unfortunately such
information is not available in a unique container, but in distributed repositories and
the major challenge lies in the heterogeneity of repositories which complicates data
management and retrieval processes. This issue is addressed by the SeNsIM framework [1], as described in Section 2.
Secondly, there is the need for an on-line reasoning about the events captured by
sensor nodes, in order to early detect and properly manage security threats. The availability of possibly redundant data allows for the correlation of basic events in order to
increase the probability of detection, decrease the false alarm rate, warn the operators
about suspect situations, and even automatically trigger adequate countermeasures by
the Security Management System (SMS). This issue is addressed by the DETECT
framework [2], as described in Section 3.
The rest of the paper is organized as follows: Section 4 discusses about the SeNsIM and DETECT software integration; Section 5 introduces an example railway security application; Section 6 draws conclusions and hints about future developments.
94
F. Flammini et al.
DB
DETECT
(reasoning)
SeNsIM
(integration)
THREAT
ROUTE
SENSING
POINTS
WSN 1
...
SMS
WSN N
...
(a)
(b)
Fig. 1. (a) Distributed sensing in physical security; (b) Monitoring architecture
2 The SeNsIM Framework
The main objectives of SeNsIM are:
• To integrate information from distributed sensor networks managed by local middlewares (e.g. TinyDB);
• To provide an unique interface for local networks by which a generic user can easily execute queries on specific sensor nodes;
• To ensure system’s scalability in case of connection of new sensor networks.
From an architectural point of view, the integration has been realized by exploiting
the wrapper-mediator paradigm: when a sensor network is activated, an apposite
wrapper agent aims at extracting its features and functionalities and to send (e.g. in a
XML format) them to one or more mediator agents that are, in the opposite, responsible to drive the querying process and the communication with users.
Thus, a query is first submitted through a user interface, and then analyzed by the
mediator, converted in a standard XML format and sent to the apposite wrapper. The
latter, in a first moment executes the translated query on its local network, by means
of a low-layer middleware (TinyDB in the current implementation), and then retrieve
the results to send (in a XML format) to the mediator, which show them to the user.
According to the data model, the wrapper agent provides a local network virtualization in terms of objects, network and sensors. An object of the class Sensor can be
associated to an object of Network type. Moreover, inside the same network one or
more sensors can be organized into objects of Cluster or Group type. The state of a
sensor can be modified by means of classical getting/setting functions, while the
measured variables can be accessed using the sensing function.
Fig.2 schematizes the levels of abstraction in the data management perspective
provided by SeNsIM using TinyDB as low-level middleware layer and outlines the
system architecture. The framework is described in more details in reference [1].
3
The DETECT Framework
Among the best ways to prevent attacks and disruptions is to stop any perpetrators before they strike. DETECT is a framework aimed at the automatic detection of threats
95
(a)
(b)
Fig. 2. (a) Levels of abstraction; (b) SeNsIM architecture
against critical infrastructures, possibly before they evolve to disastrous consequences. In fact, non trivial attack scenarios are made up by a set of basic steps which
have to be executed in a predictable sequence (with possible variants). Such scenarios
must be precisely identified during the risk analysis process. DETECT operates by
performing a model-based logical, spatial and temporal correlation of basic events detected by sensor networks, in order to “sniff” sequence of events which indicate (as
early as possible) the likelihood of threats. In order to achieve this aim, DETECT is
based on a real-time detection engine which is able to reason about heterogeneous
data, implementing a centralized application of “data fusion”. The framework can be
interfaced with or integrated in existing SMS systems in order to automatically trigger
adequate countermeasures (e.g. emergency/crisis management).
Attack scenarios are described in DETECT using a specific Event Description
Language (EDL) and stored in a Scenario Repository. Starting from the Scenario Repository, one or more detection models are automatically generated using a suitable
formalism (Event Graphs in the current implementation). In the operational phase, a
96
F. Flammini et al.
model manager macro-module has the responsibility of performing queries on the
Event History database for the real-time feeding of detection models according to predetermined policies.
When a composite event is recognized, the output of DETECT consists of: the
identifier(s) of the detected/suspected scenario(s); an alarm level, associated to scenario evolution (only used in deterministic detection as a linear progress indicator); a
likelihood of attack, expressed in terms of probability (only used as a threshold in
heuristic detection).
DETECT can be used as an on-line decision support system, by alerting in advance
SMS operators about the likelihood and nature of the threat, as well as an autonomous
reasoning engine, by automatically activating responsive actions, including audio and
visual alarms, unblock of exit turnstiles, air conditioned flow inversion, activation of
sprinkles, emergency calls to first responders, etc.
DETECT is depicted as a black-box in Fig. 3 and described in more details in [2].
Scenari
o
DETECT Engine
Detected
Attack
Scenario
Event
History
Criticality
Level
(1, 2, 3, ...)
Fig. 3. The DETECT framework
4 Integration of SeNsIM and DETECT
The SeNsIM and DETECT frameworks need to be integrated in order to obtain an online reasoning about the events captured by different WSNs. As mentioned above, the
aim is to early detect and manage security threats against critical infrastructures. In
this section we provide the description of the sub-components involved in the software integration of SeNsIM and DETECT.
During the query processing task of SeNsIM, user queries are first submitted by
means of a User Interface; then, a specific module (Query Builder) is used to build a
query. The user queries are finally processed by means of a Query Processing module
which sends the query to the appropriate wrappers. The partial and global query results are then stored in a database named Event History. All the results are captured
and managed by a Results Handler, which implements the interface with wrappers.
The Model Feeder is the DETECT component which performs periodic queries on
the Event History to access primitive event occurrences. The Model Feeder instantiates the inputs of the Detection Engine according to the nature of the model(s).
Therefore, the integration is straightforward and mainly consists in the management of the Event History as a shared database, written by the mediator and read by
the Model Feeder according to an appropriate concurrency protocol.
97
In Figure 4 we report the overall software architecture as a result of the integration
between SeNsIM and DETECT. The figure also shows the modules of SeNsIM
involved in the query processing task. User interaction is only needed in the configuration phase, to define attack scenarios and query parameters. To this aim, a userfriendly unified GUIs will be available (as indicated in Fig.4). According to the query
strategy, both SeNsIM and DETECT can access data from the lower layers using either a cyclic or event driven retrieval process.
DETECT
DETECTION
MODEL(S)
DETECTION
ENGINE
SMS
MODEL
FEEDER
SCENARIO
REPOSITORY
USER
EVENT
HISTORY
GUI
QUERY
BUILDER
QUERY
PROCESSING
RESULTS
HANDLER
SeNsIM MEDIATOR
TO
WRAPPER(S)
Notes:
• Only modules used for integration are shown
• Query Processing is a macro-module, containing
several submodules
• A GUI (Graphical User Interface) is used to:
- edit DETECT scenarios using a graphical
formalism translatable to EDL files
- define SeNsIM queries for sensor data
retrieval (cyclic polling or event-driven)
FROM
WRAPPER(S)
Fig. 4. Query processing and software integration
5 Example Application Scenario
In this section we report an example application of the overall framework to the casestudy of a railway transportation system, an attractive target for thieves, vandals and
terrorists. Several application scenarios can be thought exploiting the proposed architecture and several wireless sensors (track line break detection, on-track obstacle detection, etc.) and actuators (e.g. virtual or light signalling devices) could be installed to
monitor track integrity against external threats and notify anomalies. In the following
we describe how to detect a more complex scenario, namely a terrorist strategic attack.
Let us suppose a terrorist decides to attack a high-speed railway line, which is completely supervised by a computer-based control system. A possible scenario consisting
in multiple train halting and railway bridge bombing is reported in the following:
1. Artificial occupation (e.g. by using a wire) of the track circuits immediately after
the location in which the trains needs to be stopped (let us suppose a high bridge),
in both directions.
2. Interruption of the railway power line, in order to prevent the trains from restarting
using a staff responsible operating mode.
98
F. Flammini et al.
3. Bombing of the bridge shafts by remotely activating the already positioned explosive charges.
Variants of this scenarios exist: for instance, trains can be (less precisely) stopped by
activating jammers to disturb the wireless communication channel used for radio signaling, or starting the attack from point (2) (but this would be even less precise). The
described scenario could be early identified by detecting the abnormal events reported
in point (1) and activating proper countermeasures. By using proper on-track sensors
it is possible to monitor the abnormal occupation of track circuits and a possible countermeasure consists in immediately sending an unconditional emergency stop message
to the train. This would prevent the terrorist from stopping the train at the desired location and therefore halt the evolution of the attack scenario. Even though the detection of events in points (2) and (3) would happen too late to prevent the disaster, it
could be useful to achieve a greater situational awareness about what is happening in
order to rationalize the intervention of first responders.
Now, let us formally describe the scenario using wireless sensors and detected events,
using the notation “sensor description (sensor ID) :: event description (event ID)”:
FENCE VIBRATION DETECTOR (S1) :: POSSIBLE ON TRACK INTRUSION (E1)
TRACK CIRCUIT X (S2) :: OCCUPATION (E2)
LINESIDE TRAIN DETECTOR (S3) :: NO TRAIN DETECTED (E3)
TRACK CIRCUIT Y (S4) :: OCCUPATION (E4)
LINESIDE TRAIN DETECTOR (S5) :: NO TRAIN DETECTED (E5)
VOLTMETER (S6) :: NO POWER (E6)
ON-SHAFT ACCELEROMETER (S7) :: STRUCTURAL MOVEMENT (E7)
Due to the integration middleware made available by SeNsIM, these events are not
required to be detected on the same physical WSN, but they just need to share the
same sensor group identifier at the DETECT level. Event (a) is not mandatory, as the
detection probability is not 100%. Please not that each of the listed events taken singularly would not imply a security anomaly or be a reliable indicator of it.
The EDL description of the above scenario is provided in the following (in the assumption of unique event identifiers):
(((E1 SEQ ((E2 AND E3) OR (E4 AND E5)))
OR
((E2 AND E3) AND (E4 AND E5)))
SEQ E6 ) SEQ E7
Top-down and left to right, using 4 levels of alarm severity:
a) E1 can be associated a level 1 warning (alert to the security officer);
b) The composite events determined by the first group of 4 operators and the
second group of 3 operators can be both associated a level 2 warning (triggering the unconditional emergency stop message);
c) The composite event terminating with E6 can be associated a level 3 warning
(switch on back-up power supply, whenever available)
d) The composite event terminating with E7 (complete scenario) can be associated a level 4 warning (emergency call to first responders).
99
In the design phase, the scenario is represented using Event Trees and stored in the
Scenario Repository of DETECT. In the operational phase, SeNsIM records the sequence of detected events in the Event History. When the events corresponding to the
scenario occur, DETECT provides the scenario identifier and the alarm level (with a
likelihood index in case of non deterministic detection models). Pre-configured countermeasures can then be activated by the SMS on the base of such information.
6 Conclusions and Future Works
Wireless sensors are being investigated in several applications. In this paper we have
provided the description of a framework which can be employed to collect and analyze data measured by such heterogeneous sources in order to enhance the protection
of critical infrastructures.
One of the research threads points at connecting by WSN traditionally wired sensors and application specific devices, which can serve as useful information sources
for a superior situational awareness in security critical applications (like in the example scenario provided above). The verification of the overall system is also a delicate
issue which can be addressed using the methodology described in [3].
We are currently developing the missing modules of the software system and testing the already available ones in a simulated environment. The next step will be the
interfacing with a real SMS for the on-the-field experimentation.
References
1. Chianese, A., Gaglione, A., Mazzocca, N., Moscato, V.: SeNsIM: a system for Sensor Networks Integration and Management. In: Proc. Intl. Conf. on Intelligent Sensors, Sensor
Networks and Information Processing (ISSNIP 2008) (to appear, 2008)
2. Flammini, F., Gaglione, A., Mazzocca, N., Pragliola, C.: DETECT: a novel framework for
the detection of attacks to critical infrastructures. In: Proc. European Safety & Reliability
Conference (ESREL 2008) (to appear, 2008)
3. Flammini, F., Mazzocca, N., Orazzo, A.: Automatic instantiation of abstract tests to specific
configurations for large critical control systems. In: Journal of Software Testing, Verification & Reliability (Wiley) (2008), doi:10.1002/stvr.389
4. Garcia, M.L.: The Design and Evaluation of Physical Protection Systems. ButterworthHeinemann, USA (2001)
5. Lewis, T.G.: Critical Infrastructure Protection in Homeland Security: Defending a Networked Nation. John Wiley, New York (2006)
6. Perrig, A., Stankovic, J., Wagner, D.: Security in Wireless Sensor Networks. Communications of the ACM 47(6), 53–57 (2004)
7. Rahimi, M., Baer, R., et al.: Cyclops: In situ image sensing and interpretation in wireless
sensor networks. In: Proc. 3rd ACM Conference on Embedded Networked Sensor Systems
(SenSys 2005) (2005)
8. Roman, R., Alcaraz, C., Lopez, J.: The role of Wireless Sensor Networks in the area of
Critical Information Infrastructure Protection. Inf. Secur. Tech. Rep. 12(1), 24–31 (2007)
9. Wang, M.M., Cao, J.N., Li, J., Dasi, S.K.: Middleware for Wireless Sensor Networks: A
Survey. J. Comput. Sci. & Technol. 23(3), 305–326 (2008)
Development of Anti Intruders Underwater Systems:
Time Domain Evaluation of the Self-informed Magnetic
Networks Performance
Osvaldo Faggioni1, Maurizio Soldani1, Amleto Gabellone2, Paolo Maggiani3,
and Davide Leoncini4
1
INGV Sez. ROMA2, Stazione di Geofisica Marina, Fezzano (SP), Italy
{faggioni,soldani}@ingv.it
2 CSSN ITE, Italian Navy, Viale Italia 72, Livorno, Italy
amleto.gabellone@marina.difesa.it
3 COMFORDRAG, Italian Navy, La Spezia, Italy
paolov.maggiani@marina.difesa.it
4 DIBE, University of Genoa, Genova, Italy
davide.leoncini@unige.it
Abstract. This paper shows the result obtained during the operative test of an anti-intrusion
undersea magnetic system based on a magnetometers’ new self-informed network. The experiment takes place in a geomagnetic space characterized by medium-high environmental noise
with a relevant human origin magnetic noise component. The system has two different input
signals: the magnetic background field (natural + artificial) and a signal composed by the magnetic background field and the signal due to the target magnetic field. The system uses the first
signal as filter for the second one to detect the target magnetic signal. The effectiveness of the
procedure is related to the position of the magnetic field observation points (reference devices
and sentinel devices). The sentinel devices must obtain correlation in the noise observations and
de-correlations in the target signal observations. The system, during four tries of intrusion, has
correctly detected all magnetic signals generated by divers.
Keywords: Critical Systems, Port Protection, Magnetic Systems.
1 Introduction
The recent evolution of the world strategic scenarios is characterized by a change of
the first threat type: from the military high power attack to terrorist attack. In these new
conditions our submarine areas control systems must redesigned to obtain the capability of detecting of very small targets closer to the objective. The acoustic systems, base
for actual and, probably, future port protection option, has a very good results in the
control of big volumes of water but they failure in the high definition controls as, for
example, the control of port sea bottom and docks proximity water volumes. The magnetic detecting is a very interesting option to give more effectiveness to the Anti Intruders Port System. The magnetic method has very good performance in the proximity
detecting and loses in the big water volumes controls while the acoustic method
achieves good performances in the free water but fails in the sea bottom surface areas
springerlink.com
Development of Anti Intruders Underwater Systems
101
and, more relevant in the docks proximity. The integration of these two methods is the
MAC System (Magnetic Acoustic) class of underwater alert composed systems. Past
studies about the magnetic method detecting shown a phenomenological limitation of
its detecting effectiveness due to the very high interference between magnetic signal
target (diver = low power source) and the environmental magnetic field. The geomagnetic field is a convolved field of several elementary contributes originated by sources
external or internal to the planet, static and dynamic, and more, in the area with human
activity, there is the presence of transient magnetic signals very large band with high
amplitude variations. Classically the way of target signal detection is the classification
of elementary signals of the magnetic field, tentative of the association of their hypothetic sources, and separation of the target signal to the noise: more or less the use of
frequency filters numerical techniques (LP, HP, BP) based to empirical experiences in
the C.O. frequency definition. The result is a subjective method depending to the decision of the operators (or of the array/chain designers) able of great successes or strong
failures. The effectiveness of the method is low and so its develop was neglected for
the detecting of little magnetic signals [1-3]. Of course also the increase of devices
sensibility doesn’t solve the problem because our problem is related to the target signal
informative capability of the magnetograms and not to the sensibility on the measures.
In the present paper we show the results obtained by means of a new approach to this
problem: we don’t use statistic-conjectural frequency filters, we use a reference magnetometer to inform the sentinel magnetometer of the noised field without the target
signal, the system use the reference magnetogram RM as function TD (or FD) filter for
the sentinel magnetogram SM. The result of de-convolution RM-SM is the target signal. The critical point of system is related to the design of the network: to obtain an
effective magneto-detecting system the devices must be put at a distance to have amplitude correlation in the noise measures and de-correlation in the target measure.
2 On the System Design Options and Metrological Response
To build the self-information mag-system we propose two design solutions for the
geometry of devices network, the Referred Integrated MAgnetic Network (RIMAN)
and the Self-referred Integrated MAgnetic Network (SIMAN) [4].
2.1 RIMAN System
The RIMAN system consists of a magnetometers array system and an identical standalone magnetometer (referring node) deployed within the protected area.
The zero-level condition is obtained through the comparison of the signal measured by each of the array’s magnetometers with the signal measured by the referring
magnetometer. If the protected area is confined we can assume the total background
noise constant and therefore the difference between each array’s sensor and the referring one is around zero. The zero-level condition can be altered only in presence of a
target approaching one or two (in case of middle-crossing) sensors of the array.
Signal processing of the RIMAN system is accurate and the risk of numeric alteration of the registered rough signal is very low. A standard data-logger system has to
measure the signals coming from each of the array’s magnetometers and respectively
102
O. Faggioni et al.
compare to the reference signal. The comparison functions ∆F(1,0), ∆F(2,0), ….,
∆F(N,0) are subsequently compared to the reference level 0 and then only the non-zero
differential signal is taken. It means that the target is crossing a specific nodal magnetometer. For example, if the target is forcing the barrier between nodes 6 and 7, the
only differential non-zero functions are ∆F(6,0) and ∆F(7,0) which will indicate
the target’s position. A quantitative analysis of the differential signals can also show
the target’s relative position to the two nodes. If the protected area is too wide to allow
a stability condition of the total background noise, the RIMAN system can be divided
in more subsystems, each of one using an intermediate reference node, which have to
respect the stability condition among them. The intermediate reference nodes have
finally to be related to a common single reference node.
MAG1
MAG2
MAG3
MAG4
MAG0
C
P
U
'F10
2
'F40
'F20
1
4
0
0
'F30
3
Fig. 1. Scheme of Referred Integrated Magnetometers Array Network
2.2 SIMAN System
In the SIMAN system all the array’s magnetometers are used to obtain the zero-level
condition. The control unit has to check in sequence the zero-level condition between
each pair of magnetometers and signal any non-zero differential function.
MAG1
MAG2
MAG3
MAG4
MAG5
C
P
U
1
2
2
'F12
'F45
4
3
2
'F23
'F34
3
5
4
Fig. 2. Scheme of Referred Integrated Magnetometers Array Network
103
2.3 Signal Processing
Signal processing in the SIMAN system gives very good accuracy, too. The only
issue is related on the ambiguity in case the target crosses a pair of magnetometers at
the same distance from both. Such ambiguity can be solved through the evaluation of
the differential functions between the adjacent nodes. The drawback is that a SIMAN
system requires a continuous second-order check at all the nodes. The advantage of
using a SIMAN system is the possibility to cover an unlimited area. The stability
condition is requested only for each pair of the array’s magnetometers.
3 Results
The experiment consists of recording the magnetic field variations during multiple
runs performed by a diver’s team. The test system of magnetic control consists in the
elementary cell of SIMAN class system [5, 6]. Devices are two tri-axial fluxgate
magnetometers were positioned at a water depth of 12 meters at a respective start
distance of 12 meters; the computational procedure is based on vector component Z
(not variance in the vector direction on the time) measure (Fig. 3).
MAG1
MAG2
C
P
U
'F12
Fig. 3. Experiment configuration
The diver’s runs were performed at zero CPA (Closest Point of Approach) at about
1 meter from the bottom along 50 meters tracks centered on each magnetometer. The
runs’ bearing was approximately E-W and W-E. The two magnetometers were cableconnected with their respective electronic control devices (deployed at 1 meter form
the sensor) to a data-logger station placed on the beach coast at about 150 meters. The
environmental noise of the geomagnetic space of the area of measure is classified as
medium-high and it is characterized by contributes of human source coming from city
noise, industrial noise (electrical power point of production), electrical railway noise,
maritime commercial traffic, port activity traffic etc. (all far < 8 km). The target
source for the experiment was a diver equipped with standard air bottles system. The
result of the experiment is shows in the diagrams of figure 4. The magnetogram of the
sentinel devices (Fig. 4A) is characterized by 5 points of magnetic impulsive anomaly. These signals are compatible with the magnetic signal of our target, the signal
marked in figure 4 as 1, 2, 4 and 5 has very clear impulsive origin (mono or dipolar
104
O. Faggioni et al.
geometry) while the signal 3 has geometrical characteristic not fully impulsive so the
operator (human or automatic) proposes the classification in Table 1.
The availability of the reference magnetogram and its preliminary and subjective
observation well define the fatal error in the signal number 2. This signal appears in
magnetogram A and B because it is not related to the diver cross in the sentinel devices proximity but it is has very large space coherence (noise).
Table 1. Alarm classification table
Signal
1
2
3
4
5
Geometry
MONO
MONO
COMPLEX
MONO
DIPOLE
Alarm
YES
YES
YES
YES
YES
Uncertain
NO
NO
YES
NO
NO
K
0
-K
nT
K
0
-K
Fig. 4. Magnetograms obtained by the SIMAN elementary cell devices
nT
K/2
0
-K/2
s
Fig. 5. SIMAN elementary cell self-informed magnetogram
105
In the figure 5 we show the result obtained by the time de-convolution B-A. This is
the numerical procedure to have the quantitative evaluation of the presence of impulsive signals in proximity of the sentinel devices. The magnetogram of SIMAN well
define the real condition of diver crosses in proximity of the sentinel devices. The
signal 2 is declassified and 3 is classified as dipolar target signal without uncertain
because the procedure of de-convolution cleans the complex signal from the superimposed noise component. The table of classification became the Table 2. The SIMAN
classification of impulsive signals (Table 2) has a fully correspondence to the 4
crosses of the diver first course (E.W, W-E directions) 1 and 3 signals and second
course 4 and 5 signals (E-W, W-E directions).
Table 2. Alarm classification table after the time de-convolution A-B
Signal
1
3
4
5
Geometry
MONO
DIPOLE
MONO
DIPOLE
Alarm
YES
YES
YES
YES
Uncertain
NO
NO
NO
NO
nT
K/2
0
-K/2
Fig. 6. Comparison to self-referred system and classical techniques performances
The effectiveness of the system (options RIMAN or SIMAN) is strongly depending to the position of the reference devices. The reference magnetometer must far
from the sentinel to have a good de-correlation of the target signal but also closer to
the sentinel to have a good correlation of the noise. This condition generates a reference magnetograms filter function that cut off the noise and it is transparent to the
target signal. In effect the kernel of self-information procedure is the RD-SD distance.
Now we analyzed the numerical quality of the self-informed system performance with
reference to the performance of classical LP procedure of noise cleaning and to the
variations of the distance RD-SD, reference devices – sentinel devices (Fig. 6). The
diagrams in figure 6 show the elaboration of the same data subset (source magnetogram in figure 4): A diagram corresponds to the pure extraction of the subset from
magnetograms in figure 4 (best distance RD-SD), B it is the LP filter (typical
106
O. Faggioni et al.
informative approach), C self-informed response with too short RD-SD distance, D
self informed response with too great RD-SD distance.
3.1 Comparison A/B
The metrological and numerical procedure effectiveness of magnetic detection is
defined with the increase of informative capability of the magnetogram with reference
to the signal target. We define, in quantitative way, the Informative Capability IC of a
composed signal, related to each its “i” harmonic elementary component to be the
ratio i energy – composed signal total energy:
⎛ n
⎞
ICi = Ei ⎜ ∑ ( E1 + ... + En ) ⎟
⎝ i =0
⎠
−1
(1)
In the A condition SIMAN produces a very high IC for the impulse 1 and the strong
reduction of the IC of impulse 2; the SIMAN permits correct classification of impulse
1 as the target and impulse 2 as environmental noise. In the B graphics the classic
technique of LP filter produces, in the best cut frequency choice, a good cleaning of
high frequency component but the IC of impulse 1 is lose by the effect of impulse 2
energy (survived to the filter). The numerical value of the IC2 is, more or less, the
same of the IC1; so the second impulse is classified as target: egregious false alarm.
3.2 Comparison A/C/D
The comparison A/C/D (fig. 6) shows the distance RD-SD be the kernel for the selfinformed magnetic system detecting effectiveness. In the present case the A distance
is 11 m, C distance 9 m, D distance 13 m (+/- 0.5 m). The C RD-SD distance produces a very good contrast of the high frequency noise but it loses also the target
signal because this one is present in the sentinel magnetogram and also in the reference magnetograms (too closer). The Q of target is drastically reduced and SIMAN
loses its detection capability. In D condition we have a too long distance. The noise in
the reference devices geomagnetic position is not correlated from the noise in the
sentinel devices; so SIMAN procedure adds the noise of reference magnetogram to
the noise of sentinel magnetograms with an high increase of self-informed signal
ETOT: also in this condition Q of target is loses.
4 Conclusion
The field test of self-informed magnetic detecting system SIMAN shows a relevant
growth of effectiveness in the low magnetic undersea sources (divers) detection respect to the classical techniques used in the magnetic port protection networks. The
full success of defense magnetic network crossing SIMAN in the detecting of the
diver crossing tries of our experimental devices barrier is due to the use of the magnetogram not perturbed by target magnetic signal as time domain filter of the perturbed magnetogram. This approach is quantitative and objective and it increases the
target signal informative capability IC without empirical and subjective techniques of
107
signal cleaning as the well known frequency LPF. This performance of SIMAN is
related to the accuracy in the choice of distance between the points of acquisition
of the magnetograms. The good definition of this distance produces the best condition
of noise space correlation and target magnetic signal space de-correlation. One m of
delocalization of the reference magnetometer respect to its best position can compromise, in the magnetic environment of our experiment, the results and it can cancel the
effectiveness of SIMAN sensibility. The system of our test developed on sea bottom
with a good geometry has produced very high detecting performance irrespective of
the magnetic noise time variations.
Acknowledgments. The study of the self-informed undersea magnetic networks was
launched and developed by Ufficio Studi e Sviluppo of COMFORDRAG – Italian
Navy. This experiment was supported by Nato Undersea Research Centre.
References
1. Berti, G., Cantini, P., Carrara, R., Faggioni, O., Pinna, E.: Misure di Anomalie Magnetiche
Artificiali, Atti X Conv. Gr. Naz. Geof. Terra Solida, Roma, pp. 809–814 (1991)
2. Faggioni, O., Palangio, P., Pinna, E.: Osservatorio Geomagnetico Stazione Baia Terra
Nova: Ricostruzione Sintetica del Magnetogramma 01.00(UT)GG1.88:12.00(UT)GG18.88,
Atti X Conv. Gr. Naz. Geof. Terra Solida, Roma, pp. 687–706 (1991)
3. Cowan, D.R., Cowan, S.: Separation Filtering Applied to Aeromagnetic Data. Exploration
Geophysics 24, 429–436 (1993)
4. Faggioni, O.: Protocollo Operativo CF05EX, COMFORDRAG, Ufficio Studi e Sviluppo,
Italian Navy, Official Report (2005)
5. Gabellone, A., Faggioni, O., Soldani, M.: CAIMAN (Coastal Anti Intruders MAgnetic Network) Experiment, UDT Europe – Naples, Topic: Maritime Security and Force Protection
(2007)
6. Gabellone, A., Faggioni, O., Soldani, M., Guerrini, P.: CAIMAN (Coastal Anti Intruder
MAgnetometers Network). In: RTO-MP-SET-130 Symposium on NATO Military Sensing,
Orlando, Florida, USA, March 12-14 (classified, 2008)
Monitoring and Diagnosing Railway Signalling with
Logic-Based Distributed Agents
Viviana Mascardi1, Daniela Briola1, Maurizio Martelli1, Riccardo Caccia2,
and Carlo Milani 2
1
DISI, University of Genova, Italy
{Viviana.Mascardi,Daniela.Briola,
Maurizio.Martelli}@disi.unige.it
2
IAG/FSW, Ansaldo Segnalamento Ferroviario S.p.A., Italy
{Caccia.Riccardo,Milani.Carlo}@asf.ansaldo.it
Abstract. This paper describes an ongoing project that involves DISI, the Computer Science
Department of Genova University, and Ansaldo Segnalamento Ferroviario, the Italian leader in
design and construction of signalling and automation systems for railway lines. We are implementing a multiagent system that monitors processes running in a railway signalling plant, detects functioning anomalies, provides diagnoses for explaining them, and early notifies problems
to the Command and Control System Assistance. Due to the intrinsic rule-based nature of monitoring and diagnostic agents, we have adopted a logic-based language for implementing them.
1 Introduction
According to the well known definition of N. Jennings, K. Sycara and M. Wooldridge
an agent is “a computer system, situated in some environment, that is capable of
flexible autonomous action in order to meet its design objectives” [1].
Distributed diagnosis and monitoring represent one of the oldest application fields
of declarative software agents. For example, ARCHON (ARchitecture for Cooperative Heterogeneous ON-line systems [2]) was Europe's largest ever project in the area
of Distributed Artificial Intelligence and exploited rule-based agents. In [3], Schroeder et al. describe a diagnostic agent based on extended logic programming. Many
other multiagent systems (MASs) for diagnosis and monitoring based on declarative
approaches have been developed in the past [4] ,[5], [6], [7].
One of the most important reasons for exploiting agents in the monitoring and diagnosis application domain is that an agent-based distributed infrastructure can be
added to any existing system with minimal or no impact over it. Agents look at the
processes they must monitor, be they computer processes, business processes, chemical processes, by “looking over their shoulders” without interfering with their activities. The “no-interference” feature has an enormous importance, since changing the
processes in order to monitor them would be often unfeasible.
Also the increase of situational awareness, essential for coping with the situational
complexity of most large real applications, motivates an agent-oriented approach.
Situational awareness is a mandatory feature for the successful monitoring and decision-making in many scenarios. When combined with reactivity, situatedness may
springerlink.com
Monitoring and Diagnosing Railway Signalling
109
lead to the early detection of, and reaction to, anomalies. The simplest and most natural form of reasoning for producing diagnoses starting from observations is based on
rules.
This paper describes a joint academy-industry project for monitoring and diagnosing railway signalling with distributed agents implemented in Prolog, a logic programming language suitable for rule-based reasoning. The project involves the Computer Science Department of Genova University, Italy, and Ansaldo Segnalamento
Ferroviario, a company of Ansaldo STS group controlled by Finmeccanica, the Italian
leader in design and construction of railway signalling and automation systems. The
project aims to develop a monitoring and diagnosing multiagent system implemented
in JADE [8] extended with the tuProlog implementation of the Prolog language [9] by
means of the DCaseLP libraries [10].
The paper is structured in the following way: Section 2 describes the scenario
where the MAS will operate, Section 3 describes the architecture of the MAS and the
DCaseLP libraries, Section 4 concludes and outlines the future directions of the
project.
2 Operating Scenario
The Command and Control System for Railway Circulation (“Sistema di Comando e
Controllo della Circolazione Ferroviaria”, SCC) is a framework project for the technological development of the Italian Railways (“Ferrovie dello Stato”, FS), with the
following targets:
• introducing and extending automation to the command and control of railway circulation over the principal lines and nodes of the FS network;
• moving towards innovative approaches for the infrastructure and process management, thanks to advanced monitoring and diagnosis systems;
• improving the quality of the service offered to the FS customers, thanks to a better
regularity of the circulation and to the availability of more efficient services, like
delivery of information to the customers, remote surveillance, and security.
The SCC project is based on the installation of centralized Traffic Command and
Control Systems, able to remotely control the plants located in the railway stations
and to manage the movement of trains from the Central Plants (namely, the offices
where instances of the SCC system are installed). In this way, Central Plants become
command and control centres for railway lines that represent the main axes of circulation, and for railway nodes with high traffic volumes, corresponding to the main
metropolitan and intermodal nodes of the FS network.
An element that strongly characterizes the SCC is the strict coupling of functionalities for circulation control and functionalities for diagnosis and support to upkeep activities, with particular regard to predictive diagnostic functionalities aimed to
enable on-condition upkeep. The SCC of the node of Genova, which will be employed as a case-study for the implementation of the first MAS prototype, belongs to
the first six plants developed by Ansaldo Segnalamento Ferroviario. They are independent one from the other, but networked by means of a WAN, and cover 3000 Km
110
V. Mascardi et al.
of the FS network. The area controlled by the SCC of Genova covers 255 km, with 28
fully equipped stations plus 20 stops.
The SCC can be decomposed, both from a functional and from an architectural
point of view, into five subsystems. The MAS that we are implementing will monitor
and diagnose critical processes belonging to the Circulation subsystem whose aims
are to implement remote control of traffic and to make circulation as regular as possible. The two processes we have taken under consideration are Path Selection and
Planner.
The Planner process is the back-end elaboration process for the activities concerned with Railway Regulation. There is only one instance of the Planner process in
the SCC, running on the server. It continuously receives information on the position
of trains from sensors located in the stations along the railway lines, checks the timetable, and formulates a plan for ensuring that the train schedule is respected.
A plan might look like the following: “Since the InterCity (IC) train 5678 is late,
and IC trains have highest priority, and there is the chance to reduce the delay of 5678
if it overtakes the regional train 1234, then 1234 must stop on track 1 of Arquata station, and wait for 5678 to overtake it”. Plans formulated by the Planner may be either
confirmed by the operators working at the workstations or modified by them. The
process that allows operators to modify the Planner's decision is Path Selection.
The Path Selection process is the front-end user interface for the activities concerned with Railway Regulation. There is one Path Selection process running on each
workstation in the SCC. Each operator interacts with one instance of this process.
There are various operators responsible for controlling portions of the railway line,
and only one senior operator with a global view of the entire area controlled by the
SCC. The senior operator coordinates the activities of the other operators and takes
the final decisions. The Path Selection process visualizes decisions made by the Planner process and allows the operator to either confirm or modify them. For example,
the Planner might decide that a freight train should stop on track 3 of Ronco Scrivia
station in order to allow a regional train to overtake it, but the operator in charge for
that railway segment might think that track 4 is a better choice for making the freight
train stop. In this case, after receiving an “ok” from the senior operator, the operator
may input its choice thanks to the interface offered by the Path Selection process, and
this choice overcomes the Planner's decisions. The Planner will re-plan its decisions
according to the new input given by the human operator.
The SCC Assistance Centre, that provides assistance to the SCC operators in case
of problems, involves a large number of domain experts, and is always contacted after
the evidence of a malfunctioning. The cause of the malfunctioning, however, might
have generated minutes, and sometimes hours, before that the SCC operator(s) experienced the malfunctioning. By integrating a monitoring and diagnosing MAS to
the circulation subsystem, we aim to equip any operator of the Central Plant with the
means for early detecting anomalies that, if reported in a short time, and before their
effects have propagated to the entire system, may allow the prevention of more serious problems including circulation delays.
When a malfunctioning is detected, the MAS, besides alerting the SCC operator,
will always alert the SCC Assistance Centre in an automatic way. This will not only
speed up the implementation of repair actions, but also allow the SCC Assistance
Centre with recorded traces of what happened in the Central Plant.
111
3 Logic-Based Monitoring Agents
In order to provide the functionalities required by the operating scenario, we designed
a MAS where different agents exist and interact. The MAS implementation is under
way but its architecture has been completely defined, as well as the tools and languages that will be exploited for implementing a first prototype. The following
sections provide a brief overview of DCaseLP, the multi-language prototyping environment that we are going to use, and of the MAS architecture and implementation.
3.1 DCaseLP: A Multi-language Prototyping Environment for MASs
DCaseLP [10] stands for Distributed Complex Applications Specification Environment based on Logic Programming. Although initially born as a logic-based framework, as the acronym itself suggests, DCaseLP has evolved into a multi-language
prototyping environment that integrates both imperative (object-oriented) and declarative (rule-based and logic-based) languages, as well as graphical ones. The languages
and tools that DCaseLP integrates are UML and an XML-based language for the
analysis and design stages, Java, JESS and tuProlog for the implementation stage, and
JADE for the execution stage. Software libraries for translating UML class diagrams
into code and for integrating JESS and tuProlog agents into the JADE platform are
also provided.
Both the architecture of the MAS, and the agent interactions taking place there, are
almost simple. On the contrary, the rules that guide the behaviour of agents are sophisticated, leading to implement agents able to monitor running processes and to
quickly diagnose malfunctions.
For these reasons, among the languages offered by DCaseLP, we have chosen to use
tuProlog for implementing the monitoring agents, and Java for implementing the agents
at the lowest architectural level, namely those that implement the interfaces to the processes. In this project we will take advantage neither of UML nor of XML, which prove
useful for specifying complex interaction protocols and complex system architectures.
Due to space constraints, we cannot provide more details on DCaseLP. Papers describing its usage, as well as manuals, tutorials, and the source code of the libraries it
provides for integrating JESS and tuProlog into JADE can be downloaded from
http://www.disi.unige.it/person/MascardiV/Software/DCaseLP.html.
3.2 MAS Architecture
The architecture of the MAS is depicted in Figure 1. There are four kinds of agent,
organized in a hierarchy: Log Reader Agents, Process Monitoring Agents, Computer
Monitoring Agents, and Plant Monitoring Agents.
Agents running on remote computers are connected via a reliable network whose
failures are quickly detected and solved by an ad hoc process (already existing, out of
the MAS). If the network becomes unavailable for a short time, the groups of agents
running on the same computer can go on with their local work. Messages directed to
remote agents are saved in a local buffer, and are sent as soon as the network comes
up again. The human operator is never alerted by the MAS about a network failure,
and local diagnoses continue to be produced.
112
V. Mascardi et al.
Fig. 1. The system architecture
Log Reader Agent. In our MAS, there is one Log Reader Agent (LRA) for each process that needs to be monitored. Thus, there may be many LRAs running on the same
computer. Once every m minutes, where m can be set by the MAS configurator, the
LRA reads the log file produced by the process P it monitors, extracts information from
it, produces a symbolic representation of the extracted information in a format amenable
of logic-based reasoning, and sends the symbolic representation to the Process Monitoring Agent in charge of monitoring P. Relevant information to be sent to the Process
Monitoring Agent include loss of connection to the net and life of the process. The parameters, and corresponding admissible values, that an LRA associated with the “Path
Selection” process extracts from its log file, include connection_to_server
(active, lost); answer_to_life (ready, slow, absent);
cpu_usage (normal, high); memory_usage (normal, high);
disk_usage (normal, high); errors (absent, present).
The answer_to_life parameter corresponds to the time required by a process to
answer a message. A couple of configurable thresholds determines the value of this
paramenter: time < threshold1 implies answer_to_life ready; threshold1 <=
time < threshold2 implies answer_to_life slow; time >= threshold2 implies answer_to_life absent.
In a similar way, the parameters and corresponding admissible values that an LRA
associated with the “Planner” process extracts from its log file include connection_to_client (active, lost); computing_time (normal,
high); managed_conflicts (normal, high); managed_trains
(normal, high); answer_to_life, cpu_usage, memory_usage,
disk_usage, errors, in the same way as the “Path Selection” process.
LRAs have a time-driven behaviour, do not perform any reasoning over the information extracted from the log file, and are neither proactive, nor autonomous. They
may be considered very simple agents, mainly characterized by their social ability,
employed to decouple the syntactic processing of the log files from their semantic
113
processing, entirely demanded to the Process Monitoring Agents. In this way, if the
format of the log produced by process P changes, only the LRA that reads that log
needs to be modified, with no impact on the other MAS components.
Process Monitoring Agent. Process Monitoring Agents (PMAs) are in a one-to-one
correspondence with LRAs: the PMA associated with process P receives the information sent by the LRA associated with P, looks for anomalies in the functioning of P,
provides diagnoses for explaining their cause, reports them to the Computer Monitoring Agent (CMA), and in case kills and restarts P if necessary. It implements a sort of
social, context-aware, reactive and proactive expert system characterized by rules
like:
• if the answer to life of process P is slow, and it was slow also in the previous
check, then there might be a problem either with the network, or with P. The PMA
has to inform the CMA, and to wait for a diagnosis from the CMA. If the CMA answers that there is no problem with the network, then the problem concerns P. The
action to take is to kill and restart P.
• if the answer to life of process P is absent, then the PMA has to inform the CMA,
and to kill and restart P.
• if the life of process P is right, then the PMA has to do nothing.
Computer Monitoring Agent. The CMA receives all the messages arriving from the
PMAs that run on that computer, and is able to monitor parameters like network
availability, CPU usage, memory usage, hard disk usage.
The messages received from PMAs together with the values of the monitored parameters allow the CMA to make hypotheses on the functioning of the computer
where it is running, and of the entire plant. For achieving its goal, the CMA includes
rules like:
• if only one PMA reported problems local to the process it monitors, then there
might be temporary problems local to the process. No action needs to be taken.
• if more than one PMA reported problems local to the process it monitors, and the
CPU usage is high, then there might be problems local to this computer. The action
to take is to send a message to the Plant Monitoring Agent and to make a pop-up
window appear on the computer monitor, in order to alert the operator working
with this computer.
• if more than one PMA reported problems due either to the network, or to the server
accessed by the process, and the network is up, then there might be problems to the
server accessed by the process. The action to take is to send a message to the Plant
Monitoring Agent to alert it and to make a pop-up window appear on the computer
monitor.
If necessary, the CMA may ask for more information to the PMA that reported the
anomaly. For example, it may ask to send a detailed description of which hypotheses
led to notify the anomaly, and which rules were used.
Plant Monitoring Agent. There is one Plant Monitoring Agent (PlaMA) for each
plant. The PlaMA receives messages from all the CMAs in the plant and makes diagnoses and decisions according to the information it gets from them. It implements
rules like:
114
V. Mascardi et al.
• if more than one CMA reported a problem related to the same server S, then the
server S might have a problem. The action to take is to notify the SCC Assistance
Centre in an automatic way.
• if more than one CMA reported a problem related to a server, and the servers referred to by the CMAs are the different, then there might be a problem of network,
but more information is needed. The action to take is to ask to the CMAs that reported the anomalies more information about them.
• if more than one CMA reported a problem of network then there might be a problem of network. The action to take is to notify the SCC Assistance Centre in an
automatic way.
The SCC Assistance Centre receives notifications from all the plants spread around
Italy. It may cross-relate information about anomalies and occurred failures, and take
repair actions in a centralized, efficient way.
As far as the MAS implementation is concerned, LRAs are being implemented as
“pure” JADE agents: they just parse the log file and translate it into another format,
thus there is no need to exploit Prolog for their implementation. PMAs, CMAs, and
PlaMA are being implemented in tuProlog and integrated into JADE by means of the
libraries offered by DCaseLP. Prolog is very suitable for implementing the agent rules
that we gave in natural language before.
4 Conclusions
The joint DISI-Ansaldo project confirms the applicability of MAS technologies for
concrete industrial problems. Although the adoption of agents for process control is
not a novelty, the exploitation of declarative approaches outside the boundaries of
academia is not widespread. Instead, the Ansaldo partners consider Prolog a suitable
language for rapid prototyping of complex systems: this is a relevant result.
The collaboration will lead to the implementation of a first MAS prototype by the
end of June 2008, to its experimentation and, in the long term perspective, to the implementation of a real MAS and its installation in both central plants and stations
along the railway lines. The possibility to integrate agents that exploit statistical
learning methods to classify malfunctions and agents that mine functioning reports in
order to identify patterns that led to problems, will be considered. More sophisticated
rules will be added in the agents in order to capture the situation where a process is
killed and restarted a suspicious number of times in a time unit. In that case, the
PlaMa will need to urgently inform the SCC Assistance.
The transfer of knowledge that DISI is currently performing will allow Ansaldo to
pursue its goals by exploiting internal competencies in a not-so-distant future.
Acknowledgments. The authors acknowledge Gabriele Arecco who is implementing
the MAS described in this paper, and the anonymous reviewers for their constructive
comments.
115
References
1. Jennings, N.R., Sycara, K.P., Wooldridge, M.: A roadmap of agent research and development. Autonomous Agents and Multi-Agent Systems 1(1), 7–38 (1998)
2. Jennings, N.R., Mamdani, E.H., Corera, J.M., Laresgoiti, I., Perriollat, F., Skarek, P., Zsolt
Varga, L.: Using Archon to develop real-world DAI applications, part 1. IEEE Expert 11(6), 64–67 (1996)
3. Schroeder, M., De Almeida Mòra, I., Moniz Pereira, L.: A deliberative and reactive diagno-sis agent based on logic programming. In: Rao, A., Singh, M.P., Wooldridge, M.J.
(eds.) ATAL 1997. LNCS, vol. 1365, pp. 293–307. Springer, Heidelberg (1998)
4. Leckie, C., Senjen, R., Ward, B., Zhao, M.: Communication and coordination for intelligent fault diagnosis agents. In: Proc. of 8th IFIP/IEEE International Workshop for Distributed Systems Operations and Management, DSOM 1997, pp. 280–291 (1997)
5. Semmel, G.S., Davis, S.R., Leucht, K.W., Rowe, D.A., Smith, K.E., Boloni, L.: Space
shuttle ground processing with monitoring agents. IEEE Intelligent Systems 21(1), 68–73
(2006)
6. Weihmayer, T., Tan, M.: Modeling cooperative agents for customer network control using
planning and agent-oriented programming. In: Proc. of IEEE Global Telecommunications
Conference, Globecom 1992, pp. 537–543. IEEE, Los Alamitos (1992)
7. Balduccini, M., Gelfond, M.: Diagnostic reasoning with a-prolog. Theory Pract. Log. Program. 3(4), 425–461 (2003)
8. Bellifemine, F.L., Caire, G., Greenwood, D.: Developing Multi-Agent Systems with
JADE. Wiley, Chichester (2007)
9. Denti, E., Omicini, A., Ricci, A.: Tuprolog: A lightweight prolog for internet applications
and infrastructures. In: Ramakrishnan, I.V. (ed.) 3rd International Symposium on Practical
Aspects of Declarative Languages, PADL 2001, Proc., pp. 184–198. Springer, Heidelberg
(2001)
10. Mascardi, V., Martelli, M., Gungui, I.: DCaseLP: A prototyping environment for multilanguage agent systems. In: Dastani, M., El-Fallah Seghrouchni, A., Leite, J., Torroni, P.
(eds.) 1st Int. Workshop on Languages, Methodologies and Development Tools for MultiAgent Systems, LADS 2007, Proc. LNCS. Springer, Heidelberg (to appear, 2008)
Ermete Meda1, Francesco Picasso2, Andrea De Domenico1, Paolo Mazzaron1,
Nadia Mazzino1, Lorenzo Motta1, and Aldo Tamponi1
1
Ansaldo STS, Via P. Mantovani 3-5, 16151 Genoa, Italy
{Meda.Ermete,DeDomenico.Andrea,Mazzaron.Paolo,Mazzino.Nadia,
Motta.Lorenzo,Tamponi.Aldo}@asf.ansaldo.it
2
University of Genoa, DIBE, Via Opera Pia 11-A, 16145 Genoa, Italy
Francesco.Picasso@unige.it
Abstract. The SeSaR (Security for Safety in Railways) project is a HW/SW system conceived
to protect critical infrastructures against cyber threats - both deliberate and accidental – arising
from misuse actions operated by personnel from outside or inside the organization, or from
automated malware programs (viruses, worms and spyware). SeSaR’s main objective is to
strengthen the security aspects of a complex system exposed to possible attacks or to malicious
intents. The innovative aspects of SeSaR are manifold: it is a non-invasive and multi-layer
defense system, which subjects different levels and areas of computer security to its checks and
it is a reliable and trusted defense system, implementing the functionality of Trusted Computing. SeSaR is an important step since it applies to different sectors and appropriately responds
to the more and more predominant presence of interconnected networks, of commercial systems with heterogeneous software components and of the potential threats and vulnerabilities
that they introduce.
1 Introduction
The development and the organization of industrialized countries are based on an
increasingly complex and computerized infrastructure system. The Critical National
Infrastructures, such as Healthcare, Banking & Finance, Energy, Transportation and
others, are vital to citizen security, national economic security, national public health
and safety.
Since the Critical National Infrastructures may be subject to critical events, such as
failures, natural disasters and intentional attacks, capable to affect human life and
national efficiency both directly and indirectly, these infrastructures need a an aboveaverage level of protection.
In the transportation sector of many countries the introduction into their national
railway signaling systems of the new railway interoperability system, called
ERTMS/ETCS (European Rail Traffic Management System/European Train Control
System) is very important: in fact the goal of ERTMS/ETCS is the transformation of a
considerable part of the European - and not just European - railway system into an
interoperable High Speed/High Capacity system. In particular, thanks to said innovative signaling system the Italian railways intend to meet the new European requirements and the widespread demand for higher speed and efficiency in transportation.
springerlink.com
117
The development of ERTMS/ETCS has set the Italian railways and the Italian
railway signaling industry among the world leaders, but it has introduced new risks
associated with the use of information infrastructures, such as failures due to natural
events and human-induced problems. The latter ones may be due to negligence and/or
to malice and can directly and heavily affect ICT security.
ICT security management is therefore becoming complex. Even for railway applications, where the safety of transport systems is still maintained, information threats
and vulnerabilities can contribute to affect the continuity of operation, not only causing inconvenience to train traffic, but also financial losses and damage to the corporate image of the railway authorities and to the one of their suppliers.
2 Implementation of ICT Security
To illustrate the main idea behind the SeSaR (Security of Safety in Railways) project,
the concept of ICT security should be defined first: implementing security means
developing the activities of prevention, detection and reaction to an accident.
In particular the following definitions hold:
•
•
•
Prevention: All the technical defence and countermeasures enabling a system to
avoid attacks and damages. Unfortunately, the drawback of this activity is that, in
order to be effective, prevention must be prolonged indefinitely in time, thus requiring a constant organisational commitment, continuous training of personnel
and adjustment/updating of the countermeasures.
Detection: When prevention fails, detection should intervene, but the new malicious codes are capable of self-mutations and the "zero-day" attacks render the
countermeasures adopted ineffective: so adequate defence architecture cannot be
separated from the use of instruments capable of analyzing, in real-time, tracks
and records that are stored in computers and in the defence network equipment.
Reaction: The last activity that can be triggered either as a consequence of prompt
detection or when the first two techniques were ineffective but some problems
start to affect the system in an apparent way. Of course in the latter case it is difficult to guarantee the absence of inefficiency, breakdowns and malfunctions.
The goal of SeSaR was to find a tool that could merge the Prevention and the Detection activities of security, in the belief that the effectiveness of the solution sought
could be based on the ability to discover and communicate what is new, unusual and
anomalous.
The idea was to design a hardware/software platform that could operate as follows:
•
•
Prevention: operate a real-time integrity control of the hardware and software
assets.
Detection: operate a real-time analysis of tracks and records stored in critical
systems and in defence systems, seeking meaningful events and asset changes.
Operator Interface: create a simple concise representation, which consists of a
console and of a mimic panel overview, to provide the operator with a clear and
comprehensive indication of anomalies identified and to enable immediate
reaction.
118
E. Meda et al.
3 Architecture
The scope of SeSaR is constituted by the ERTMS/ETCS signaling system infrastructure
used for “Alta Velocità Roma-Napoli”, i.e. the Rome-Naples high-speed railway line.
In summary, the Rome-Naples high speed system is constituted by the Central Post
(PCS, acronym for “Posto Centrale Satellite”) located at the Termini station in Rome,
and by 19 Peripheral Posts (PPF, acronym for “Posto Periferico Fisso”). The PPFs
are distributed along the line and linked together and with the PCS by a WAN.
SeSaR supplies hardware/software protection against ICT threats - deliberate and
accidental - arising from undue actions operated by individuals external or internal to
the ICT infrastructure or by automated malicious programs, such as viruses, worms,
trojans and spyware.
The distinctive and innovative features of SeSaR are:
•
•
•
•
A non-invasive defence system.
It works independently of the ICT infrastructure and Operating System (Unix,
Linux or Windows).
A multi-layer defence system, operating control over different levels and areas of
computer security, such as Asset Control, Log Correlation and an integrated and
interacting Honeynet.
A reliable defence system with the capability to implement the functionality of
Trusted Computing.
In particular, SeSaR carries out the following main functions, as shown in Fig.1:
Fig. 1. SeSaR: multi-level defence system
•
•
•
Real-time monitoring of the assets of the ICT infrastructure.
Real-time correlation of the tracks and records collected by security devices and
critical systems.
Honeynet feature, to facilitate the identification of intrusion or incorrect behaviour by creating "traps and information bait".
•
119
Forensic feature, which allows the records accumulated during the treatment of
attacks to be accompanied by digital signature and marking time (timestamp) so
that they can have probative value. This feature also contributes to create "confidence" on the system by constant monitoring of the system integrity.
SeSaR protects both the Central Post network and the whole PPF network against
malicious attacks by internal or external illicit actions. As already mentioned, three
functional components are included: Asset Control Engine, Log Correlation Engine
and Honeynet.
3.1 Asset Control Engine
This component checks in real-time the network configuration of the system IP-based
assets, showing any relevant modification, in order to detect the unauthorized intrusion or lack (also shut-down and Ethernet card malfunctioning) of any IP node.
The principle of operation is based on:
•
•
monitoring the real-time network configuration of all IP-devices by SNMP (Simple Network Management Protocol) queries to all the switches of the system.
comparison of such information with the reference configuration known and
stored in SeSaR database.
Each variation found generates an alarm on the console and on a mimic panel.
To get updated information from the switches, the SeSaR Asset Control Engine,
(SACE, also called CAI, an acronym for Controllo Asset Impianto) uses MIB (Management Information Base) through the mentioned SNMP protocol. In a switch MIB
is a kind of database containing information about the status of the ports and the IP –
MAC addresses of the connected hosts. By interrogating the switches, CAI determines whether their ports are connected to IP nodes and the corresponding IP and
MAC addresses.
The CAI activities can be divided into the following subsequent phases:
PHASE 1) Input Data Definition and Populating the SeSaR Reference Data-Base
This phase includes the definition of a suitable database and the importation into it of
the desired values for the following data, forming the initial specification of the network configuration:
a) IP address and Netmask for each system switch and router;
b) IP address, Netmask and Hostname for each system host;
c) Each System subnet with the relevant Geographic name.
Such operations can also be performed with SeSaR disconnected from the system.
PHASE 2) Connecting SeSaR to the system and implementing the Discovery activity
This phase is constituted by a Discovery activity, which consists in getting, for each
switch of the previous phase, the Port number, MAC address and IP address of each
node connected to it. The SeSaR reference database is thus verified and completed
with MAC Address and Port number, obtaining the 4 network parameters which describe each node (IP address, Hostname, MAC address and Number of the switch port
connected to the node).
120
E. Meda et al.
Said research already lets the operator identify possible nodes which are missing or
intruding and lets him/her activate alarms: an explicit operator acknowledge is required for each node corresponding to a modification and all refused nodes will be
treated by SeSaR as alarms on the Operator Console (or GUI, Graphic User Interface) and appropriately signalled on the mimic panel.
Of course whenever modifications are accepted the reference database is updated.
PHASE 3) Asset Control fully operational activity
In such phase CAI cyclically and indefinitely verifies in real-time the network configuration of all the system assets, by interrogating the switches and by getting, for
each network node, any modification concerning not only the IP address or Port number, as in the previous phase, but also the MAC address; the alarm management is
performed as in the said phase.
3.2 Log Correlation Engine
Firewall and Antivirus devices set up the underlying first defence countermeasures
and they are usually deployed to defend the perimeter of a network. Independently of
how they are deployed, they discard or block data according to predefined rules: the
rules are defined by administrators in the firewall case, while they are signature
matching rules defined by the vendor in the antivirus case. Once set up, these devices
operate without the need of human participation: it could be said that they silently
defend the network assets. Recently new kinds of devices, such as intrusion detection/prevention systems, have become available, capable to monitor and to defend the
assets: they supply proactive defence in addition to the prevention defence of firewalls and antiviruses. Even using ID systems it is difficult for administrators to get a
global vision of the asset’s security state since all devices work separately from each
other. To achieve this comprehensive vision there is a need of a correlation process of
the information made available by security devices: moreover, given the great amount
of data, this process should be an automatic or semi-automatic one.
The SeSaR Log Correlation Engine (SLCE, also called CLI, an acronym for Correlazione Log Impianto) achieves this goal by operating correlations based on security
devices’ logs. These logs are created by security devices while monitoring the information flow and taking actions: such logs are typically used to debug rules configurations and to investigate a digital incident, but they can be used in a log correlation
process as well. It is important to note that the correlation process is located above the
assets’ defence layer, so it is transparent to the assets it monitors. This property has a
great relevance especially when dealing with critical infrastructures, where changes in
the assets require a large effort – time and resources – to be developed.
The SLCE’s objective is to implement a defensive control mechanism by executing
correlations within each single device’s log and among multiple devices’ alerts (Fig. 2).
The correlation process is set up in two phases: the Event Correlation phase (E.C.) named
Vertical Correlation and the Alert Correlation (A.C.) phase named Horizontal Correlation.
This approach rises from a normalization need: syntactic, semantic and numeric
normalization. Every sensor (security device) either creates logs expressed in its own
format or, when a common format is used, it gives particular meanings to its log
fields. Moreover the number of logs created by different sensors may differ by several
orders of magnitude: for example, the number of logs created by a firewall sensor is
usually largely greater than the number of logs created by an antivirus sensor.
121
Fig. 2. Logs management by the SeSaR Log Correlation Engine
So the first log management step is to translate the different incoming formats into a
common one, but there is the need for a next step at the same level. In fact, by simply
translating formats, the numeric problem is not taken into account: depending on the
techniques used in the correlation processes, by mixing two entities, which greatly
differ in size, the information carried by the smallest one could be lost. In fact, at
higher level, it should be better to manage a complete new entity than working on
different entities expressed in a common format.
The Vertical Correlation takes into account these aspects by acting on every sensor
in four steps:
1.
2.
3.
4.
it extracts the logs from a device;
it translates the logs from the original format into an internal format (metadata) useful to the correlation process;
it correlates the metadata on-line and with real-time constraints by using
sketches [3];
if anomalies are detected, it creates a new message (Alert), which will carry
this information by using the IDMEF format [4].
The correlation process is called vertical since it is performed on a single sensor basis: the SensorAgent software manages a single device and implements E.C. (the Vertical Correlation) using the four steps described above. The on-line, real-time and
fixed upper-bound memory usage constraints allow for the development of lightweight software SensorAgents which can be modified to fit the context and which
could be easily embedded in special devices.
The Alerts created by Sensor Agents (one for each security device) are sent to the
Alert Correlator Engine (ACE) to be further investigated and correlated. Basically
ACE works in two steps: the aggregation step, in which the alerts are aggregated
based on content’s fields located at the same level in IDMEF, without taking into
consideration the link between different fields; the matching scenario step, in which
122
E. Meda et al.
the alerts are aggregated based on the reconstruction and match of predefined scenarios. In both cases multiple Alerts – or even a single Alert, if it represents a critical
situation – form an Alarm which will be signalled by sending it to the Operator Interface (console and mimic panel). Alarms share the IDMEF format with Alerts but the
former ones get an augmented semantic meaning given to them during the correlation
processes: Alerts represent anomalies that must be verified and used to match scenarios, while Alarms represent threats.
3.3 Honeynet
This SeSaR’s component identifies anomalous behaviours of users: not only operators, but also maintainers, inspectors, assessors, external attackers, etc... Honeynet is
composed of two or more honeypots. Any honeypot simulates the behaviour of the
real host of the critical infrastructure. The honeypot security level is lower than the
real host security level, so the honeypot creates a trap both for malicious and negligent users. Data capturing (keystroke collection and capture traffic entering the
honeypots), data controlling (monitoring and filtering permitted or unauthorised information) and data analysis (operating activities on the honeypots) the functionalities
it performs.
4 User Console and Mimic Panel
Any kind of alarms coming from Asset Control, Log Correlation and Honeynet are
displayed to the operator both in textual and graphical format: the textual format by a
web based GUI (Graphic User Interface), the graphical format by a mimic panel.
The operators can manage the incoming alarms on the GUI. There are three kinds
of alarm status: incoming (waiting to be processed by an operator), in progress and
closed. The latter ones are the responsibility of operators.
5 Conclusion
SeSaR works independently of the ICT infrastructure and Operative System. Therefore SeSaR can be applied in any other context, even if SeSaR is being designed for
railway computerized infrastructure. SeSaR and its component are useful to improve
any kind of existing security infrastructure and it can be a countermeasure for any
kind of ICT infrastructure.
References
1. Meda, E., Sabbatino, V., Doro Altan, A., Mazzino, N.: SeSaR: Security for Safety, AirPort&Rail, security, logistic and IT magazine, EDIS s.r.l., Bologna (2008)
2. Forte, D.V.: Security for safety in railways, Network Security. Elsevier Ltd., Amsterdam
(2008)
3. Muthukrishnan, S.: Data Streams: Algorithms and Applications. Foundations and Trends in
Theoretical Computer Science (2005)
4. RFC 4765, The Intrusion Detection Message Exchange Format (IDMEF)(March 2007)
Automatic Verification of Firewall Configuration with
Respect to Security Policy Requirements
Soutaro Matsumoto1 and Adel Bouhoula2
1
Graduate School of System and Information Engineering
University of Tsukuba – Japan
soutaro@score.cs.tsukuba.ac.jp
2
Higher School of Communication of Tunis (Sup'Com)
University of November 7th at Carthage – Tunisia
adel.bouhoula@supcom.rnu.tn
Abstract. Firewalls are key security components in computer networks. They filter network
traffics based on an ordered list of filtering rules. Firewall configurations must be correct and
complete with respect to security policies. Security policy is a set of predicates, which is a high
level description of traffic controls. In this paper, we propose an automatic method to verify the
correctness of firewall configuration. We have defined a boolean formula representation of
security policy. With the boolean formula representations of security policy and firewall configuration, we can formulate the condition that ensures correctness of firewall configuration.
We use SAT solver to check the validity of the condition. If the configuration is not correct, our
method produces an example of packet to help users to correct the configuration. We have
implemented a prototype verifier and had some experimental results. The first results were very
promising.
Keywords: Firewall Configuration, Security Policy, Automatic Verification, SAT Solver.
1 Introduction
Firewalls are key security components in computer networks. They filter packets to
control network traffics based on an ordered list of filtering rules. Filtering rules describe which packet should be accepted or rejected. Filtering rules consist of a pattern
of packet and an action. Firewalls reject / accept packets if the first rule which
matches the packet in their configurations rejects / accepts the packet. Security policies are high-level description of traffic controls. They define which connection is
allowed. They are sets of predicates. Security policies reject / accept connections if
the most specific predicate which matches the connection rejects / accepts the connection. Firewalls should be configured correctly with respect to security policies. Correct configurations reject / accept connections if and only if security policies reject /
accept the connections. However, the correctness of firewall configurations is not
obvious.
Consider for example the following security policy.
1. All users in LAN can access Internet
2. All users in LAN1 cannot access youtube.com
3. All users in LAN2 can access youtube.com
springerlink.com
124
S. Matsumoto and A. Bouhoula
Fig. 1. Structure of the Network
There are three predicates. Assume that we have a network LAN, two sub networks LAN1 and LAN2, and there is a website youtube.com in the Internet. The
structure of the networks is shown in figure 1. The first predicate is the most general
one that allows all users in LAN to access any web site in the Internet. The Second
predicate is more specific than the first predicate which prohibits users in LAN1 to
access youtube.com. Third predicate allows users in LAN2 to access youtube.com. Under this security policy, users in LAN1 cannot access youtube.com. Since the second predicate is the most specific for a connection from
LAN1 to youtube.com, the connection will be rejected.
Consider for example the following firewall configuration in Cisco’s format.
access-list 101 permit tcp 192.168.0.0
any
access-list 102 reject tcp 192.168.1.0
233.114.23.1
access-list 103 permit tcp 192.168.2.0
233.114.23.1
0.0.255.255
eq 80
0.0.0.255
0.0.0.0
0.0.0.255
0.0.0.0
eq 80
eq 80
There are three rules named 101, 102, and 103. Rule 101 permits connections from
192.168.*.* to any host. * is wildcard. Rule 102 rejects packets from 192.168.1.* to
233.114.23.1. Rule 103 permits packets from 192.168.2.* to 233.114.23.1. Assume
that the network LAN has address 192.168.*.*, LAN1 and LAN2 have 192.168.1.*
and 192.168.2.* respectively, and youtube.com has address 233.114.23.1. This
configuration can read as a straightforward translation of the security policy. Unfortunately, this configuration is incorrect. Since the first rule that matches given connection will be applied, a connection from LAN1 matches the first rule and it will be
accepted even if the destination is youtube.com.
We propose a method to verify the correctness of firewall configurations with respect to security policies. Security policy P and firewall configuration F are translated
into boolean formulae QP and QF. The correctness of firewall configuration is reduced
to the equivalence of the two formulae. The equivalence of the two formulae is
checked by satisfiability of QP ⇎ QF. If the formula is satisfiable, QP and QF are not
equivalent and F is not correct. A counterexample of packet will be produced such
that P and F will give different answers for the packet. The counterexample will help
users to find and correct mistakes in their configurations.
1.1 Related Work
There are a lot of works about development of formal languages for security policy
description [1, 2]. Our formalization of security policy is very abstract, but essentially
Automatic Verification of Firewall Configuration
125
equivalent to them. The works are so technical that we simplified the definition of
security policy.
There are some works for efficient testing of firewalls based on security policies
[3, 4]. Their approaches generate test cases from security policy and then test networks with the test cases. Our method only verifies the correctness of the configuration of firewall. The simplification makes our method very simple and fast. The
essence of our method is the definition of boolean formula representation. Other steps
are applications of well-known logical operations. The fact that our method is performed on our computers makes verifications much faster and easier than testing with
sending packet to real networks. This also makes possible to use SAT solvers to prove
the correctness.
Detecting anomalies in firewall configurations is also important issue [5]. Detection of anomalies helps to find mistakes in configuration, but it does not find mistakes
by itself.
Hazelhurst has presented a boolean formula representation of firewall configuration, which can be used to express filtering rule [6]. The boolean formula representation can be simplified with Binary Decision Diagram to improve the performance of
filtering. In this paper, we use the boolean formula representations of firewall configuration and packets.
2 Security Policy and Firewall Configuration
We present a formal definition of security policy and firewall configuration. The
definitions are very abstract. We clarify the assumptions of our formalization about
security policy.
2.1 Security Policy
Security policy is a set of policy predicates. Policy predicates consist of action, source
address, and destination address. Actions are accept or deny.
The syntax of security policy is given in figure 2. policy is a set of predicate. A
predicate consists of action K, source address, and destination address. Action K is
accept A or deny D. A predicate A(s, d) reads Connection from s to d is accepted. A
predicate D(s, d) reads Connection from s to d is denied. Network address is a set.
Packets are pairs of source address and destination address. Packet p from source
address s to destination address d will match with predicate K(s’, d’) if and only if s
⊆ s' ⋀ d ⊆ d' holds.
We have a partial order relation ⊒ on policy predicates.
K(s, d) ⊒ K’(s’, d’) ⇔ s ⊇ s’ ⋀ d ⊇ d’
If q ⊒ q’ holds, q is more general than q' or q' is more specific than q.
Packet p is accepted by Security Policy P if the action of the most specific policy
predicate that matches with p is A, and p is rejected by P if the action of the most
specific policy predicate which matches with p is D.
126
policy
predicate
address
K
::=
::=
::=
::=
|
{ predicate, …, predicate }
K(address, address)
Network Address
A
D
Accept
Deny
Fig. 2. Syntax of Security Policy
configuration
rule
address
K
::=
::=
::=
::=
|
rule :: configuration | φ
K(address, address)
Network Address
A
D
Accept
Deny
Fig. 3. Syntax of firewall configuration
Example. The security policy we have shown in section 1 can be represented as
P = { q1, q2, q3 } where q1, q2 and q3 are defined as follows:
1. q1 = A(LAN, Internet)
2. q2 = D(LAN1, youtube.com)
3. q3 = A(LAN2, youtube.com)
We have ordering of policy predicates q1 ⊒ q2 and q1 ⊒ q3.
Assumptions. To ensure the consistency of security policy, we assume that we can
find the most specific policy predicate for any packet. We assume that the following
formula holds for any two different predicates K(s, d) and K’(s’, d’).
s × d ⊇ s’ × d’
⋁
s × d ⊆ s’ × d’
⋁ s × d ∩ s’ × d’ = φ
Tree View of Security Policy. We can see security policies as trees, such that the
root is the most general predicate and children of each node is set of more specific
predicates. We define an auxiliary function CP(q) which maps predicate q in security
policy P to the set of its children.
CP(q) = { q’ | q’ ∈ P ⋀ q ⊒ q’ ⋀ ¬(∃q’’ ∈ P . q ⊒ q’’ ⋀ q’’ ⊒ q) }
For instance, CP(q) = { q2, q3 } and CP(q2) = CP(q3) = φ for the previous example.
2.2 Firewall Configuration
The syntax of firewall configuration is given in figure 3. Configurations of firewalls
are ordered lists of rules. Rules consist of action, source address, and destination address. Actions are accept A or deny D.
127
A packet is accepted by a firewall configuration if and only if the action of the first
rule in the configuration that matches with source and destination address of packet is
A. The main difference between security policies and firewall configurations is that
rules in firewalls form ordered list but predicates in security policies form trees.
3 Boolean Formula Representation
We present a boolean formula representation of security policy and firewall configuration in this section. This section includes also a boolean formula representation of
network address and packet. In the previous section, we did not define concrete representation of network addresses and packets.1
The boolean formula representations of network address and firewall configuration
are proposed by Hazelhurst [6].
3.1 Boolean Formula Representation of Network Addresses and Packets
Network addresses are IPv4 addresses. Since IPv4 addresses are 32 bit unsigned
integers, we need 32 logical variables to represent each address. IP addresses are
represented as conjunction of 32 variables or their negations, so that each variable
represents a bit in IP address. If the ith bit of the address is 1 then variable ai should
evaluate true. For example an IP address 192.168.0.1 is represented as the following.
a32 ⋀ a31 ⋀ ¬a30 ⋀ ¬a29 ⋀ ¬a28 ⋀ ¬a27 ⋀ ¬a26 ⋀ ¬a25 ⋀
a24 ⋀ ¬a23 ⋀ a22 ⋀ ¬a21 ⋀ a20 ⋀ ¬a19 ⋀ ¬a18 ⋀ ¬a17 ⋀
¬a16 ⋀ ¬a15 ⋀ ¬a14 ⋀ ¬a13 ⋀ ¬a12 ⋀ ¬a11 ⋀ ¬a10 ⋀ ¬a9 ⋀
¬a8 ⋀ ¬a7 ⋀ ¬a6 ⋀ ¬a5 ⋀ ¬a4 ⋀ ¬a3 ⋀ ¬a2 ⋀ a1
Here, a1 is the variable for the lowest bit and a32 is for the highest bit.
«p» is an environment, which represents packet p. Packets consist of source address
and destination address. We have two sets of boolean variables, s = { s1, …, s32} and d
= { d1, …, d32 }. They represent source address and destination address of a packet
respectively. If packet p is from address 192.168.0.1, «p»|s is the following.
{ s32 ↦ T, s31 ↦ T, s30 ↦ F, s29 ↦ F, s28 ↦ F,
s24 ↦ T, s23 ↦ F, s22 ↦ T, s21 ↦ F, s20 ↦ T,
s16 ↦ F, s15 ↦ F, s14 ↦ F, s13 ↦ F, s12 ↦ F,
s8 ↦ F, s7 ↦ F, s6 ↦ F, s5 ↦ F, s4 ↦ F,
s27 ↦ F, s26 ↦ F, s25 ↦ F,
s19 ↦ F, s18 ↦ F, s17 ↦ F,
s11 ↦ F, s10 ↦ F, s9 ↦ F,
s3 ↦ F, s2 ↦ F, s1 ↦ T,
}
«p» also includes assignments of d for destination address of p.
‹a, b› is a boolean formula such that «p» ⊢ ‹a, b› holds if and only if packet p is
from a to b. ‹192.168.0.1, b› is like the following boolean formula.
s32 ⋀ s31 ⋀¬ s30 ⋀ ¬s29 ⋀ ¬s28 ⋀ ¬s27 ⋀ ¬s26 ⋀ ¬s25 ⋀ …
1
Without loss of generality, we have a simplified representation of network addresses. We can
easily extend this representation to support net-masks, range of ports, or other features as proposed by Hazelhurst.
128
This is the only eight components of the formula. They are the same as the highest
eight components of boolean formula representation of IP address 192.168.0.1.
3.2 Boolean Formula Representation of Security Policy
Security Policy P can be represented as boolean formula QP, such that ∀p : Packet . P
accept p ⇔ «p» ⊢ QP holds. We define a translation BP(q, β) which maps a policy
predicate q in security policy P to its boolean formula representation.
BP(A(a, b), T)
=
BP(A(a, b), F)
=
BP(D(a, b), T)
=
BP(D(a, b), F)
=
¬‹a, b› ⋁ (T ⋀ q∈C BP(q, T))
‹a, b› ⋀ (⋀ q∈C BP(q, T))
¬‹a, b› ⋁ (⋁q∈C BP(q, F))
‹a, b› ⋀ (T ⋁ q∈C BP(q, F))
where C = CP(q)
We can obtain the boolean formula representation of security policy P as BP(q, F)
where q is the most general predicate in P.
Example. The following is an example of transformation from security policy P in
section 2 to its boolean formula representation. The boolean formula representation of
P is obtained by BP(q1, F) since q1 is the most general predicate.
BP(A(LAN, Internet), F) = ‹LAN.Internet› ⋀ BP(q2, T) ⋀ BP(q3,T)
BP(D(LAN1, youtube.com), T) = ¬‹LAN1.youtube.com›
BP(A(LAN2, youtube.com), T) = ‹LAN2.youtube.com› ⋁ T
Finally we have the following formula after some simplifications.
‹LAN.Internet› ⋀ ¬‹LAN1.youtube.com›
Consider a packet from LAN1 to youtube.com, the first component in the formula
evaluates true, but the second component evaluates false. Whole expression evaluates
false, so the packet is rejected.
3.3 Boolean Formula Representation of Firewall Configuration
Firewall configuration F can be represented as boolean formula QF, such that ∀p:
Packet . F accept p ⇔ «p» ⊢ QF holds. B(F) is a mapping from F to QF.
B(φ)
B(A(a, b) :: rules)
=
=
B(D(a, b) :: rules)
‹a, b› ⋁ B(rules)
=
¬‹a, b› ⋀ B(rules)
F
We have implemented a prototype of verifier. The verifier reads a security policy and
a firewall configuration, and verifies the correctness of the configuration. It supposes
129
that we have IPv4 addresses with net-masks and port numbers of 16 bit unsigned
integer with range support. The verifier uses MiniSAT to solve SAT [7].
In our verifier packets consist of two network addresses and protocol. Network addresses are pair of a 32 bit unsigned integer which represents an IPv4 address and a
16 bit unsigned integer which represents a port number. The protocols are TCP, UDP,
or ICMP. Thus, a formula for one packet includes up to 99 variables – two 32+16
variables for source and destination addresses and three variables for protocol.
We have verified some firewall configurations. Our experiments were performed
on an Intel Core Duo 2.16 GHz processor with 2 Gbytes of RAM. Table 1 summarizes our results. The first two columns show the size of inputs. It is the numbers of
predicates in security policy and the numbers of filtering rules in firewall configuration. The third column shows the size of the input for SAT solver. The last column
shows the running times of our verifier. It includes all processing time from reading
the inputs to printing the results. All of the inputs were correct because it is the most
time consuming case. These results show that our method verifies fast enough with
not so big inputs.
If configurations are not correct, then the verifier produces a counterexample packet. The following is an output of our verifier that shows a counterexample.
% verifyconfig verify ../samples/policy.txt ../samples/rules.txt
Loading policy ... ok
Loading configuration ... ok
Translating to CNF ... ok
MiniSAT running ... ok
Incorrect: for example [tcp - 192.168.0.0:0 - 111.123.185.1:80]
The counterexample tells that the security policy and firewall configuration will give
different answers for a packet from 192.168.0.0 of port 0 to 111.123.185.1
of port 80. Testing the firewall with the counterexample will help users to correct the
configuration.
Table 1. Experimental Results
# of predicates
3
13
26
# of rules
3
11
21
size of SAT
9183
21513
112327
running time (s)
0.04
0.08
0.37
5 Conclusion
In this paper, we have defined a boolean formula representation of security policy,
which can be used in a lot of applications. We have also proposed an automatic
method to verify the correctness of firewall configurations with respect to security
policies. The method translates both of the two inputs into boolean formulae and then
verifies the equivalence by checking satisfiability. We have had experimental results
with some small examples using our prototype implementation.
Our method can verify the configuration of centralized firewall. We are working
for generalization of our method for distributed firewalls.
130
References
1. Hamdi, H., Bouhoula, A., Mosbah, M.: A declarative approach for easy specification and
automated enforcement of security policy. International Journal of Computer Science and
Network Security 8(2), 60–71 (2008)
2. Abou El Kalam, A., Baida, R.E., Balbiani, P., Benferhat, S., Cuppens, F., Deswarte, Y.,
Miége, A., Saurel, C., Trouessin, G.: Organization Based Access Control. In: 4th IEEE International Workshop on Policies for Distributed Systems and Networks (Policy 2003)
(June 2003)
3. Senn, D., Basin, D.A., Caronni, G.: Firewall conformance testing. In: Khendek, F., Dssouli,
R. (eds.) TestCom 2005. LNCS, vol. 3502, pp. 226–241. Springer, Heidelberg (2005)
4. Darmaillacq, V., Fernandez, J.C., Groz, R., Mounier, L., Richier, J.L.: Test generation for
network security rules. In: Uyar, M.Ü., Duale, A.Y., Fecko, M.A. (eds.) TestCom 2006.
LNCS, vol. 3964, pp. 341–356. Springer, Heidelberg (2006)
5. Abbes, T., Bouhoula, A., Rusinowitch, M.: Inference System for Detecting Firewall Filtering Rules Anomalies. In: Proceedings of the 23rd Annual ACM Symposium on Applied
Computing, Fortaleza, Ceara, Brazil, pp. 2122–2128 (March 2008)
6. Hazelhurst, S.: Algorithms for analysing firewall and router access lists. CoRR
cs.NI/0008006 (2000)
7. Eén, N., Sörensson, N.: An extensible sat-solver. In: Giunchiglia, E., Tacchella, A. (eds.)
SAT 2003. LNCS, vol. 2919, pp. 502–518. Springer, Heidelberg (2003)
Automated Framework for Policy Optimization
in Firewalls and Security Gateways
Gianluca Maiolini1, Lorenzo Cignini1, and Andrea Baiocchi2
1
Elsag Datamat – Divisione Automazione Sicurezza e Trasporti Rome, Italy
{Gianluca.Maiolini, Lorenzo2.Cignini}@elsagdatamat.com
2
University of Roma “La Sapienza” Rome, Italy
{Andrea.Baiocchi}@uniroma1.it
Abstract. The challenge to address in multi-firewall and security gateway environment is to
implement conflict-free policies, necessary to avoid security inconsistency, and to optimize, at
the same time, performances in term of average filtering time, in order to make firewalls
stronger against DoS and DDoS attacks. Additionally the approach should be real time, based
on the characteristics of network traffic. Our work defines an algorithm to find conflict free
optimized device rule sets in real time, by relying on information gathered from traffic analysis.
We show results obtained from our test environment demonstrating for computational power
savings up to 24% with fully conflict free device policies.
Keywords: Firewall; Data mining; network management; security policy; optimization.
1 Introduction
A key challenge of secure systems is the management of security policies, from those
at high level down to the platform specific implementation. Security policy defines
constraints, limitations and authorization on data handling and communications. The
need for high speed links follows the increasing demand for improved packet filtering
devices performance, such as firewall and S-VPN gateway. As hacking techniques
evolves and routing protocols are becoming more complex there is a growing need of
automated network management systems that can rapidly adapt to different and new
environments. We assume that policies are formally stated according to a well defined
formal language, so that the access lists of a security gateway can be reduced to an
ordered list of predicates of the form: C Æ A, where C is a condition and A is an
action. We refer to predicates implementing security policies as rules. For security
gateway the condition of a filtering rule is composed of five selectors: <protocol>
<src ip> <src port> <dst ip> <dst port>. The action that could be performed on the
packet is allow, deny or process, where process imply that the packet has to be submitted to the IPSec algorithm. How to process that packet is described in a specific
rule which details how to apply the security mechanism. Conditions are checked on
each packet flowing through the device. The process of inspecting incoming packets
and looking up the policy rule set for a match often results in CPU overload and traffic or application’s delay. Packets that match high rank rules require a small computation time compared to those one at the end of rule set. Shaping list of rules on traffic
springerlink.com
132
G. Maiolini, L. Cignini, and A. Baiocchi
flowing through devices could be useful to improve devices performance. This operation performed on all packet filtering devices give an improvement in global network
performance. Our analysis shows how shaping access list based on network traffic can
often results in conflicts between policies. As reported by many authors [1-6], conflicts in a policy can cause holes in security, and often they can be hard to find when
performing only visual or manual inspection. In this paper we propose architecture
based on our algorithm to automatically adapt packet filtering devices configuration
to traffic behavior achieving the best performance ensuring conflict-free solution. The
architecture retrieves traffic pattern from log information sent in real time from all
devices deployed in the network.
2 Related Works
In the last few years the critical role of firewall in the policy based network management led to a large amount of works. Many of these concern the correctness of
implemented policies. In [1] the Authors only aim at detecting if firewall rules are
correlated to each other, while in [2][3][4] a set of techniques and algorithms are
defined to discover all of the possible policy conflicts. Along this line, [5] and [6]
provide an automatic conflict resolution algorithm in single firewall environment and
tuning algorithm in multi-firewall environment respectively. Recently great emphasis
has been placed on how to optimize firewall performance. In [7] a simple algorithm
based on rule re-ordering is presented. In [11] Authors present an algorithm to optimize firewall performance that order the rules according to their weights and consider
two factors to determine the weight of a rule: rule frequency and recency which reflect the number and time of rule matching, respectively. Finally extracting rules from
the “deny all” rule is another big problem to address. The few works on this issue [8]
[10] do not define how many rules must be extracted, which combine values; how to
define their priorities and they not check whether this process really improve the firewall packet filtering performance. In this paper we propose a fully automated framework composed by a log management infrastructure, policy compliance checking and
a tool that, based on log messages collected from all device in the network, calculate
rules’ rate related to traffic data, re-orders ranks guaranteeing conflict-free configuration and maximum performance optimization. Moreover our tool is able to extract a
rule from the “deny all” rule if this leads to further improved performance. To make
the framework automatic we are actually working to define a threshold to understand
how many logs are needed to automatically start rule set update.
3 Adaptive Conflict-Free Optimization (ACO) Algorithm
In the following we refer to a tagged device rule set, denoted by R = [R1,…,RN]. The
index i of each rule in set R is also called rule rank. We let the following definitions:
•
Pi is the rule rate, i.e. the matching ratio of rule i, defined as the ratio on the
number ni(T) of packets matching Ri out of the overall number n(T) of packets
processed by the tagged device in the reference time interval T.
•
•
133
Ci is the rule weight, i.e. the computational cost of rule i; if the same processing complexity can be assumed for each rule in R, then Ci=i·Pi;
C(R) is the device rule set overall computational cost, computed as the sum of
the rule weights, C(R)=∑i Ci.
Our aim is to minimize C(R) in all network devices, under the constraint of full rule
consistency. The aim is to improve both device and global filtering operation. In the
following paragraph we are going to describe phases for algorithm.
To develop our algorithm we used two repository systems, in particular: DVDB:
database storing devices configurations including security policy. LogDB: database
designed to store all log messages coming from devices. Analysis and correlation
among logs are performed in order to know how many times each device’s rule was
matched.
3.1 Phase 1: Starting ACO
ACO starts operation when: i) policy configurations (rule set) are retrieved to solve
conflicts in all devices; ii) a sufficiently large amount of log for each device are collected, e.g. to allow reliable weight estimates (i.e. logs collected in a day). Since ACO
is aimed at working in real time, we need to decide which events trigger its run. We
monitor in real time all devices and decide to start optimization process when at least
one of the following events occurs:
•
•
rule set change (such as rule insertion, modification and removal);
the number of logs received from a device in the last collection time interval
(in our implementation set to 60 s) has grown more than 10% with respect to
the previous collection interval.
The first criterion is motivated mainly to check the policy consistency; the second one
to optimize performance adapting to traffic load. Specifically, performance optimization is needed the more the higher the traffic load, i.e. as traffic load attains critical
values. In fact, rule set processing time optimization is seen as a form of protection of
secure networks from malicious overloads (DoS attacks by dummy traffic flooding).
3.2 Phase 2: Data Import
The algorithm retrieves from Device DB (DVDB):
•
•
the IP address of devices interfaces to the networks;
devices rule set R.
For each device algorithm retrieves all rules hit number (how many times a rule was
applied to a packet) from Log DB (LogDB). Then it calculates rule match rate (Pi)
and rule weight (Ci).
3.3 Phase 3: Rules Classification
In this phase for each device a classifier analyzes one by one the rules in R and it
determines the relations between Ri and all the rules previously analysed [R1,…,Ri–1].
A data structure called Complete Pseudo Tree (CPT) is built out of this analysis.
134
Definition. A pseudo tree is a hierarchically organized data structure that represents
relations between each rule analysed.
A pseudo tree might be formed by more than one tree. Each tree node represents a rule
in R. The relation parent-children in the trees reflect the inclusion relation between the
portions of traffic matched by the selectors of the rules: a rule will be a child node if
the traffic matched by its selectors is included in the portion of traffic matched by the
selectors of the parent node. Any two rules whose selectors match no overlapping
portions of traffic will not be related by any inclusion relation. In each tree there will
be a root node which represents a rule that includes all the rules in the tree and there
will be one or more leaves which represent the most specific rules. When the classifier
finds a couple of rules which are not related by an inclusion relation, it will split one of
them into two or three new rules so as to obtain derived rules that can be classified in
the pseudo tree. The output of this phase is a conflict-free tree where there remain only
redundant rules that will be eliminated in the next phase. The pseudo code of the algorithm for the identification of the Complete Pseudo Tree is listed below.
Algorithm
1: Create new Complete Pseudo Tree CPT
2: foreach rule r in Ruleset do
3: foreach ClassifiedRule cR in CPT do
4:
classify(r,cR)
5:
if r is to be fragmented then
6:
fragment(r,classifiedRule)
7:
remove(r, pseudoTree)
8:
insert fragments at the bottom of the Ruleset
9:
calculate statistic from LogDB (fragments)
10:
else
11:
insert(r,CPT).
3.4 Phase 4: Optimization
In this phase core operations upon the single devices rule lists optimization will be
performed. The aim of these operations is twofold: to restrict the number of rules in
every rule list without changing the external behavior of the device and to optimize
filtering performance. We ought to take into consideration the data structures introduced in previous section, namely the Device Pseudo Tree (DPT), one for each device,
obtained from the CPT by considering only the rules belonging to one device. Each of
these structures shows a hierarchical representation of the rules in the rule list of each
device. Chances are that one rule might have the same action as a rule that directly
includes it. This means that the child rule is in a way redundant because, if it was not in
the rule list, the same portion of traffic would be matched by the parent rule which has
got the same action. The child rule is indeed not necessary to describe the device behavior and could be eliminated simplifying the device rule list. Therefore, our algorithm will locate in every device pseudo tree all these cases, in which a child rule has
got the same action as the father’s, and will delete the child rule. So the rule set obtained is composed by two kinds of rules: one completely disjoint that can be located in
any rank of rule list, the other one characterized by dependencies constraints among
rules. In addition we need to update the rate of the father rule when one or more child
rules are deleted. At this point each device rule set R is re-ordered according to nonincreasing rule rates, i.e. so that Pi≥Pj for i≤j The resulting ordering minimizes the
135
overall cost C(R), yet it does not guarantee the correctness of the policies implemented.
As a matter of fact, it may happen that one more specific rule is placed after a more
general rule, so violating the constraints imposed by security policy consistency. For
this reason, after re-ordering operation, relations father-child of the DPT are restored.
It’s clear that the gain achieved by optimization heavily depends on the degree of dependencies of the rules. The two limit cases are: i) no rules dependencies (all disjoint
rules), that yields the biggest optimization margin; ii) complete dependency (every rule
depends on any other one), where the optimization process produces a near zero gain.
The device rule set total cost C(R) is evaluated and fed as input for the next phase. The
pseudo code of the optimization algorithm is listed below.
Algorithm
1: foreach Node node in CPT do
2:
get the id of the device the rule belongs to
3:
if exist DPT.id = = id then
4:
insert(node,DPT)
5:
else
6:
create(DPT,id)
7:
insert(node,DPT)
8: foreach DPT do
9:
foreach Rule rule in DPT do
10:
if rule.action = = rule’s father.action then
11:
update rule’s father.rate
12:
delete(rule)
13: foreach Device dv do
14:
sort the dv.ruleset except deny all
in non-increasing order of rule.rate
15:
foreach Rule rule do
16:
if rule is a child
17:
move rule just above father rule
18:
calculate dv.cost
3.5 Phase 5: Extracting Rules from Deny all String
The common idea about rules extraction from deny all rule is to obtain better optimization rate. It consists in selecting only heavily invoked rules and simply extract them
in rule set according of their rates. However, this is a very delicate operation, since
the inclusion of these rules often does not improve performance. In Table 1 we show
this case: two new rules extracted and ordered according to their rates produce an
Table 1. Rule list with two rules extracted (Rank = 2 and Rank = 7)
Rank
1
2
3
4
5
6
7
8
9
10
Pi
0.18
0.10
0.02
0.07
0.17
0.12
0.05
0.03
0.01
0.25
C(R) = 5.47
Ci
0.18
0.20
0.06
0.28
0.85
0.72
0.35
0.24
0.09
2.5
136
increment of the starting value of C(R), that was 5.16. In addition extracted rules are
always disjoint from all others in the rule sets, so it is impossible to introduce additional conflicts. Our algorithm extracts a new rule when its rate exceeds 20% of deny
all rule rate. But this is not enough; in fact we perform an additional control to assess
efficacy of the new rule in the process of optimization. Changing the position of rules
implies cost changes so we will choose the position of rule that grants for the lowest
overall cost C(R). The derived rule will be actually inserted in the rule list if the overall cost improves over the value it had before rule extraction. Detailed pseudo code of
this phase is listed below.
Algorithm
1: foreach Device dv
2:
foreach denyall log record in LogDB
3:
count log occurrence
4:
if log.rate > 0.2 denyall.rate
5:
extract new rule from log record
6:
add rule to extracted_ruleset
7:
foreach Rule rule in extracted_ruleset
8:
calculate vector dv.costvector
9:
if min(dv.costvector) < dv.cost
10:
update dv.cost
11:
update ruleset including rule.
3.6 Phase 6: Update Devices
At this point we have obtained for each packet filtering device an optimized and conflict-free rules list, shaped on traffic flowed through the network. In this phase the
algorithm updates devices configuration on device DB. Network management system
manages configuration upload to deployed devices.
4 Performance Evaluations
Our approach is based on real test scenario even if due to privacy issues we can’t
provide reference and traffic contents. We have observed for traffic behaviour
during a day in ten different devices deployed in a internal network. We analyzed
configuration in order to detect and solve eventual conflict, results of this phase
are not important because we are focused on log gathering and optimization. We
have stored conflict free configurations in device DB. We have also configured
devices for sending log to our machine where our tool is installed. We collected
logs for 24 hour storing them in logDB. We started our tool based on the algorithm
described in section 3, obtaining different level of optimization depending on devices configuration and traffic. For us optimization rate consists in calculate parameter C(R).
In this section we are going to describe the results obtained in the most significant device deployed on the network, it could describe the concept of the algorithm.
Table 2 shows the initial device rule set comprising access list for IP traffic and
IPSec configuration. It’s easy to see that if we exchange rules 3, 6 and 7 shadowing
conflicts occur. Table 3 shows the Pi and Ci values of the initial rule set calculated
retrieving data stored in logDB. According to the metric used we obtain a value of
137
C(R) equal to 5.99. Re-ordering operation (line 14 of phase 4 algorithm) produces
the best optimization gain (+22%) but adjustments are necessary to ensure a conflict-free configuration. At the end of phase 4 we obtained the order showed in
Table 4. The value of C(R) obtained is 5.16, so the improvement is about 14%.
Another good optimization is obtained by algorithm in phase 5. In this phase new
rules are extracted because of packet flow matched with a specific denied rule
(obviously included in deny all) a lot of time. Our algorithm calculates the best
position to insert rule in the list to minimize cost C(R) so in this case it has been
positioned at the top of rule-list. In Table 5 are shown the changing occurring in Pi
and Ci values. The value of C(R) obtained is 4.56, so additional gain obtained in
this phase is about 10 %. At the end of the process the total optimization gain is
about 24%. Similar values (±2 %) were obtained in the remaining devices. Optimization is not always feasible. In the end, we have achieved a conflict-free configuration with a gain of 24%. These results depend on specific traffic behaviour and
security policies applied on devices. Our work is continuing performing further
traffic test looking for relations between number of rules and optimization rate also
refining the extraction from deny all rules. A further issue is the fine tuning of heuristic parameters used in the algorithm, like time interval duration between two
updates. Finally a device spends significant CPU time to send logs, especially when
it has to be sent them for all packets flowing through it.
Table 2.
Rank
Protocol
Source IP
1
2
3
4
5
6
7
8
tcp
tcp
tcp
udp
udp
tcp
tcp
any
192.168.10.*
192.168.10.*
10.1.1.23
192.168.3.5
192.168.3.5
10.1.1.*
10.1.*.*
any
Table 3.
Rank
1
2
3
4
5
6
7
8
Pi
0.01
0.12
0.02
0.18
0.03
0.07
0.17
0.4
C(R) = 5.99
Source
Port
80
21
any
53
any
any
any
any
Dest. IP
192.168.20.*
192.168.20.*
20.1.1.23
192.168.3.5
192.168.*.*
20.1.1.*
20.1.1.*
any
Table 4.
Ci
0.01
0.24
0.06
0.72
0.15
0.42
1.19
3.2
Old
Rank
rank
4
1
3
2
6
3
7
4
2
5
5
6
1
7
8
8
C(R) = 5.16
Pi
0.18
0.02
0.07
0.17
0.12
0.03
0.01
0.40
Dest.
port
80
21
80
any
80
80
80
any
Action
Deny
Allow
Allow
Deny
Allow
Deny
Allow
Deny
Table 5.
Ci
0.18
0.04
0.21
0.68
0.60
0.18
0.07
3.20
+ 14%
Rank
Pi
1
0.2
2
0.18
3
0.02
4
0.07
5
0.17
6
0.12
7
0.03
8
0.01
9
0.20
C(R) = 4.56
Ci
0.2
0.36
0.06
0.28
0.85
0.72
0.21
0.08
1.8
+10%
138
References
1. Hari, H.B., Suri, S., Parulkar, G.: Detecting and Resolving Packet Filter Conflicts. In: Proceedings of IEEE INFOCOM 2000, Tel Aviv (2000)
2. Al-Shaer, E., Hamed, H.: Modeling and Management of Firewall Policies. In: IEEE
eTransactions on Network and Service Management, vol. 1-1 (2004)
3. Al-Shaer, E., Hamed, H., Boutaba, R., Hasan, M.: Conflict Classification and Analysis of
Distributed Firewall Policies. IEEE Journal on Selected Areas in Communications 23(10)
(2005)
4. Al-Shaer, E., Hamed, H.: Firewall Policy Advisor for Anomaly Detection and Rule Editing. In: Proceedings of IEEE/IFIP Integrated Management Conference (IM 2003), Colorado Springs (2003)
5. Ferraresi, S., Pesic, S., Trazza, L., Baiocchi, A.: Automatic Conflict Analysis and Resolution of Traffic Filtering Policy for Firewall and Security Gateway. In: IEEE International
Conference on Communications 2007 (ICC 2007), Glasgow (2007)
6. Ferraresi, S., Francocci, E., Quaglini, A., Picasso, F.: Security Policy Tuning among IP
Devices. In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part II. LNCS (LNAI),
vol. 4693. Springer, Heidelberg (2007)
7. Fulp, E.W.: Optimization of network firewall policies using directed acyclical graphs. In:
Proceedings of the IEEE Internet Management Conference (2005)
8. Acharya, S., Wang, J., Ge, Z., Znati, T., Greenberg, A.: Simulation study of firewalls to
aid improved performance. In: Proceedings of 39th Annual Simulation Symposium (ANSS
2006), Huntsville (2006)
9. Acharya, S., Wang, J., Ge, Z., Znati, T., Greenberg, A.: Traffic-aware firewall optimization Strategies. In: IEEE International Conference on Communications (ICC 2006), Istambul (2006)
10. Zhao, L., Inoue, Y., Yamamoto, H.: Delay reduction for linear-search based packet filters.
In: International Technical Conference on Circuits/Systems, Computers and Communication (ITC-CSCC 2004), Japan (2004)
11. Hamed, H., Al-Shaer, E.: Dynamic rule ordering optimization for high speed firewall Filtering. In: ACM Symposium on InformAtion, Computer and Communications Security
(ASIACCS 2006), Taipei (2006)
An Intrusion Detection System Based on Hierarchical
Self-Organization
E.J. Palomo, E. Domínguez, R.M. Luque, and J. Muñoz
Department of Computer Science
E.T.S.I. Informatica, University of Malaga
Campus Teatinos s/n, 29071 – Malaga, Spain
{ejpalomo,enriqued,rmluque,munozp}@lcc.uma.es
Abstract. An intrusion detection system (IDS) monitors the IP packets flowing over the network to capture intrusions or anomalies. One of the techniques used for anomaly detection is
building statistical models using metrics derived from observation of the user's actions. A neural network model based on self organization is proposed for detecting intrusions. The selforganizing map (SOM) has shown to be successful for the analysis of high-dimensional input
data as in data mining applications such as network security. The proposed growing hierarchical SOM (GHSOM) addresses the limitations of the SOM related to the static architecture of this
model. The GHSOM is an artificial neural network model with hierarchical architecture composed of independent growing SOMs. Randomly selected subsets that contain both attacks and
normal records from the KDD Cup 1999 benchmark are used for training the proposed
GHSOM.
Keywords: Network security, self-organization, intrusion detection.
1 Introduction
Nowadays, network communications become more and more important to the information society. Business, e-commerce and other network transactions require more
secured networks. As these operations increases, computer crimes and attacks become
more frequents and dangerous, compromising the security and the trust of a computer
system and causing costly financial losses. In order to detect and prevent these attacks, intrusion detection systems have been used and have become an important area
of research over the years.
An intrusion detection system (IDS) monitors the network traffic to detect intrusions or anomalies. There are two different approaches used to detect intrusions [1].
The first approach is known as misuse detection, which compares previously stored
signatures of known attacks. This method is good detecting many or all known attacks, having a successful detection rate. However, they are not successful in detecting unknown attacks occurrences and the signature database has to be manually
modified. The second approach is known as anomaly detection. First, these methods
establish a normal activity profile. Thus, variations from this normal activity are considered anomalous activity. Anomaly-based systems assume that anomalous activities
are intrusion attempts. Many of these anomalous activities are frequently normal
springerlink.com
140
E.J. Palomo et al.
activities, showing false positives. Many anomaly detection systems build statistical
models using metrics derived from the user's actions [1]. Also, anomaly detection
systems using data mining techniques such as clustering, support vector machines
(SVM) and neural network systems have been proposed [2-4]. Several IDS using a
self-organizing maps have been done, however they have many difficulties detecting
a wide variety of attacks with low false positive rates [5].
The self-organizing map (SOM) [6] has been applied successfully in multiple
areas. However, the network architecture of SOMs has to be established in advance
and it requires knowledge about the problem domain. Moreover, the hierarchical
relations among input data are difficult to represent. The growing hierarchical SOM
(GHSOM) faces these problems. The GHSOM is an artificial neural network which
consists of several growing SOMs [7] arranged in layers. The number of layers, maps
and neurons of maps are automatically adapted and established during a training
process of the GHSOM, according to a set of input patterns fed to the neural network.
Applied to an intrusion detection system, the input patterns will be samples of network traffic, which can be classified as anomalous or normal activities. The data analyzed from network traffic can be numerical (i.e. number of seconds of the connection) or symbolic (i.e. protocol type used). Usually, the Euclidean distance has been
used as metric to compare two input data. However, this is not suitable for symbolic
data. Therefore, in this paper an alternative metric for GHSOMs where both symbolic
and numerical data are considered is proposed.
The implemented IDS using the GHSOM, was trained with the KDD Cup 1999
benchmark data set [8]. This data set has served as the first and only reliable benchmark data set that has been used for most of the research work on intrusion detection
algorithms [9]. This data set includes a wide variety of simulated attacks.
The remainder of this paper is organized as follows. Section 2 discusses the new
GHSOM model used to build the IDS. Then, the training algorithm is described. In
Section 3, we show some experimental results obtained from comparing our implementation of the Intrusion Detection System (IDS) with other related works, where
data from the KDD Cup 1999 benchmark are used. Section 4 concludes this paper.
2 GHSOM for Intrusion Detection
The IDS implemented is based on GHSOM. Initially, the GHSOM consists of a single
SOM of 2x2 neurons. Then, the GHSOM adapts its very architecture depending on
the input patterns, so that the GHSOM structure mirrors the structure of the input data
getting a good data representation. This neural network structure is used to classify
the input data in groups, where each neuron represents a data group with similar
features.
The level of data representation of each neuron is measured as the quantization error of the neuron ( ). The
is a measure of the similarity of a data set, where the
higher is the , the higher is the heterogeneity of the data. Usually the
has been
used in terms of the Euclidean distance between the input pattern and the weight vector of a neuron. However, in many real life problems not only numerical features are
present, but also symbolic features can be found. To take an obvious example, among
the features to analyze from data for building an IDS, three symbolic features are
found: protocol type (UDP and ICMP), service (i.e. HTTP, SMTP, FTP, TELNET,
An Intrusion Detection System Based on Hierarchical Self-Organization
141
etc.) and flag of the status of the connection (i.e. SF, S1, REJ, etc.). Unlike numerical
data, symbolic data do not have an order associated and cannot be measured by a
distance. It makes no sense to use the Euclidean distance between two symbolic values, for example the distance between HTTP and FTP protocol. It seems better to use
a similarity measure rather than a distance measure for symbolic data. For that reason,
in this paper we introduce the entropy as similarity measure of error in representation
of a neuron for symbolic data together with the Euclidean distance for numerical data.
Fig. 1. Sample architecture of a GHSOM
Let
be the th input pattern, where
is the vector component of nu-
is the component of symbolic features. The error of a unit
merical features and
( ) in the representation is defined as follows:
w
x
p x log p x
,
(1)
and
are the error components of numerical and symbolic features, rewhere
is the probability
spectively, is the set of patterns mapped onto the unit , and
of the element in . The quantization error of the unit is given by expression (2).
| |.
(2)
First of all, the quantization error at layer 0 map ( ) has to be computed as it is shown
above. In this case, the error neuron is computed as specified in (1), but using
inas the mean of the all input data , being the set of input patterns the set .
stead
The training of a map is done as the training of a single SOM [6]. An input pattern
is randomly selected and each neuron determines its activation according to a similarity measure. In our case, since we take into account the presence of symbolic and
142
E.J. Palomo et al.
numerical data, the neuron with the smallest similarity measure defined in (3), becomes the winner.
1, 2
.
log
(3)
For numerical component, is the Euclidean distance between two vectors. For
symbolic data, it checks whether the two vectors are the same or not, that is, the probability can just take the values 1, if the vectors are the same; or 0.5 if they are differand
are equal; and
ent. Therefore, for symbolic data the value of will be 0, if
if they are not the same. By taking into account the new similarity measure, the index
of the winner neuron is defined in (4).
min|
,
|.
(4)
is adapted according to the expression (5). For
The weight vector of a neuron
numerical component, the winner and its neighbors, whose amount of adaptation
follows a Gaussian neighborhood function , are adpated. For symbolic data, just the
winner is adapted with the weight vector of the mode of the set of input patterns
mapped onto the winner .
1
1
1
(5)
The GHSOM growing and expansion is controlled by means of two parameters: ,
which is used to control the growth of a single map; and , which is used to control
the expansion of each neuron of the GHSOM. Specifically, a map stops growing if
(
) reaches a certain fraction
of the
of the corthe mean of the map's
responding neuron that was expanded in the map . Also, a neuron is not expanded
of
. Thus, the
if its quantization error ( ) is smaller than a certain fraction
larger the paremeter
is chosen the deeper the hierarchy will be. Also, for large
values, we will have large maps. This way, with these two paremeters, a control of the
resulting hierarchical architecture is provided. Note that these parameters are the only
ones that have to be established in advance.
The pseudocode of the training algorithm of the proposed GHSOM is defined as
follows.
Step 1. Compute the mean of the all input data
and then, the initial quantization error
.
Step 2. Create an initial map with 2x2 neurons.
Step 3. Train the map during iterations as a single SOM, using the expressions
(3), (4) and (5).
Step 4. Compute the quantification errors of each neuron according to the expression (2).
Step 5. Calculate the mean of all units’ quantization errors
of the map
(
.
Step 6. If
go to step 9, where
is the
of the corresponding unit in the upper layer that was expanded. For the first layer, this
is
.
Step 7. Select the neuron with the highest
according to the expression (6).
arg max|
,
|
143
and its most dissimilar neighbor
Λ
(6)
Step 8. Insert a row or column of neurons between and , initializing their
weight vectors as the means of their respective neighbors. Go to step 3.
Step 9. If
for all neurons in the map, go to step 11.
Step 10. Select an unsatisfied neuron and expand it creating a new map in the lower
layer. The parent of the new map is the expanded neuron and their weight
vectors are initialized as the mean of their parent and neighbors. Go to
step 2.
Step 11. If exists remaining maps, select one and go to step 3. Otherwise, the algorithm ends.
Our IDS based on the new GHSOM model was trained with the pre-processed KDD
Cup 1999 benchmark data set created by MIT Lincoln Laboratory. The purpose of
this benchmark was to build a network intrusion detector capable of distinguishing
between intrusions or attacks, and normal connections. We have used the 10% KDD
Cup 1999 benchmark data set, which contains 494021 connection records, each of
them with 41 features. Here, a connection is a sequence of TCP packets which flows
between a source and a target IP addresses. Since some features alone cannot constitute a sign of an anomalous activity, it is better analyzes connection records rather
than individual packets. Among the 41 features, three are symbolic: protocol type,
service and status of the connection flag. In the training data set exist 22 attack types
and in addition to normal records, which fall into four main categories [10]: Denial of
Service (DoS), Probe, Remote-to-Local (R2L) and User-to-Root (U2R).
In this paper, we have selected two data subsets for training the GHSOM from the
total of 494021 connection records, SetA with 100000 connection records and SetB
with 169000. Both SetA and SetB contain the 22 attack types. We try to select the
data in such a way that there was the same distribution for all the record types. However, the distribution of the data in the 10% KDD Cup data set has an irregular distribution that finally was mirrored in our selection. The two data subsets were trained
with 0.1 as value for parameters and , since with these values we achieved good
results and a very simple architecture. In fact, each trained GHSOM generated just
Table 1. Training results for SetA and SetB
Training Set Detected (%) False Positive (%) Identified (%)
SetA
99.98
3.03
94.99
SetB
99.98
5.74
95.09
two layers with 16 neurons, although with a different arrangement. The GHSOM
trained with SetB is the same that we showed in Fig. 1.
144
E.J. Palomo et al.
Many related works are only interested in classifying the input pattern just as one
of two record types: anomalous or normal records. Taking into account just two
groups, normal records that are classified as anomalous are known as false positives,
whereas anomalous records that are classified as normal records are known as missed
detections. However, we are also interested in classify an anomalous record into its
attack type, that is, taking into account 23 groups (22 attack types plus normal
records) instead 2 groups. Hence, we call identification rate to the connection records
that are correctly identified as their respective record types. The training results of the
two GHSOMs obtained with the subsets SetA and SetB are shown in Table 1. Both
subsets achieve 99.98% attack detection rate and false positive rates of 3.03% and
5.74%, respectively. Attending to the identification of the attack type, around the 95%
were correctly identified in both cases.
We have simulated the trained GHSOMs with the 494021 connection records from
the benchmark data set. This simulation consists of classifying these data with the
trained GHSOMs without any modification of the GHSOMs, that is, without learning
process. The simulation results of both GHSOMs are given in Table 2. Here, 99.9%
detection rate is achieved. Also, the identification rate rises up to 97% in both cases.
The false positive rate increases for SetA during the simulation, although it is lower
for SetB compared with this rate after the training. Note that during both training and
simulation, we used the 41 features and all the 22 existing attacks in the training data
set. Moreover, the resulting number of neurons was much reduced. In fact, there are
less neurons than groups, although this is due to the scarce amount of connection
records from certain attack groups.
Table 2. Simulating results with 494021 records for SetA and SetB
Training Set Detected (%) False Positive (%) Identified (%)
SetA
99.99
5.41
97.09
SetB
99.91
5.11
97.15
In Table 3, we compare the training results of SetA with the results obtained in [9,
10], where SOMs were used to build IDSs as well. In order to differentiate both IDS
based on self-organization, we call them SOM and K-Map, respectively, using the
author's notation, whereas our trained IDS is called GHSOM. From the first work, we
chose the only one SOM trained on all the 41 features, which was composed of 20x20
neurons. Another IDS implementation was proposed in the second work. Here, a
hierarchical Kohonen net (K-Map), composed of three layers, was trained with a
subset of 169000 connection records, and taking into account the 41 features and 22
attack types. Their best result was 99.63% detection rate after testing the K-Map,
although with several restrictions. This result was achieved using a pre-selected combination of 20 features, which were divided into three levels of features, where each
features sub combination was fed to a different layer. Also, they used just three attack
types during the testing, whereas we used the 22 attack types. Moreover, the architecture of the hierarchical K-Map was established in advance, using 48 neurons in each
layer, that is, 144 neurons, when we used just 16 neurons that were generated during
the training process without any human intervention.
145
Table 3. Comparison results for different IDS implementations
GHSOM
SOM
K-Map
Detected (%) False Positive (%)
99.98
3.03
81.85
0.03
99.63
0.34
4 Conclusions
This paper has presented a novel Intrusion Detection System based on growing hierarchical self-organizing maps (GHSOMs). These neural networks are composed of
several SOMs arranged in layers, where the number of layers, maps and neurons are
established during the training process mirroring the inherent data structure. Moreover, we have taken into account the presence of symbolic features in addition to numerical features in the input data. In order to improve the classification process of
input data, we have introduced a new metric for GHSOMs based on entropy for symbolic data together with the Euclidean distance for numerical data.
We have used the 10% KDD Cup 1999 benchmark data set to train our IDS based
on GHSOM, which contains 494021 connection records, where 22 attack types in
addition to normal records can be found. We trained and simulated two GHSOMs
with two different subsets, SetA with 100000 connection records and SetB with
169000 connection records. Both SetA and SetB achieved 99.98% detection rate and
false positives rates of 3.03% and 5.74%, respectively. The identification rate, that is,
the connection records identified as their correct connection record types, was 94.99%
and 95.09%, respectively. After the simulation of the two trained GHSOMs with the
494021 connection records, we achieved 99.9% detection rate and false positive rates
between 5.11% and 5.41% in both subsets. The identification rate was around
the 97%.
Acknowledgements. This work is partially supported by the Spanish Ministry of
Education and Science under contract TIN-07362.
References
1. Denning, D.: An intrusion-detection model. Software Engineering. IEEE Transactions on
SE 13(2), 222–232 (1987)
2. Lee, W., Stolfo, S., Chan, P., Eskin, E., Fan, W., Miller, M., Hershkop, S., Zhang, J.: Real
time data mining-based intrusion detection. In: DARPA Information Survivability Conference & Exposition II, vol. 1, pp. 89–100 (2001)
3. Maxion, R., Tan, K.: Anomaly detection in embedded systems. IEEE Transactions on
Computers 51(2), 108–120 (2002)
4. Tan, K., Maxion, R.: Determining the operational limits of an anomaly-based intrusion detector. IEEE Journal on Selected Areas in Communications 21(1), 96–110 (2003)
5. Ying, H., Feng, T.J., Cao, J.K., Ding, X.Q., Zhou, Y.H.: Research on some problems in the
kohonen som algorithm. In: International Conference on Machine Learning and Cybernetics, vol. 3, pp. 1279–1282 (2002)
146
E.J. Palomo et al.
6. Kohonen, T.: Self-organized formation of topologically correct feature maps. Biological
cybernetics 43(1), 59–69 (1982)
7. Fritzke, B.: Growing grid - a self-organizing network with constant neighborhood range
and adaptation strength. Neural Processing Letters 2(5), 9–13 (1995)
8. Lee, W., Stolfo, S., Mok, K.: A data mining framework for building intrusion detection
models. In: IEEE Symposium on Security and Privacy, pp. 120–132 (1999)
9. Sarasamma, S., Zhu, Q., Hu, J.: Hierarchical kohonenen net for anomaly detection in network security. IEEE Transactions on Systems Man and Cybernetics Part BCybernetics 35(2), 302–312 (2005)
10. DeLooze, L., DeLooze, A.F.: Attack characterization and intrusion detection using an ensemble of self-organizing maps. In: 7th Annual IEEE Information Assurance Workshop,
pp. 108–115 (2006)
Evaluating Sequential Combination of Two Genetic
Algorithm-Based Solutions for Intrusion Detection
Zorana Banković, Slobodan Bojanić, and Octavio Nieto-Taladriz
ETSI Telecomunicación, Universidad Politécnica de Madrid, Ciudad Universitaria s/n,
28040 Madrid, Spain
{zorana,slobodan,nieto}@die.upm.es
Abstract. The paper presents a serial combination of two genetic algorithm-based intrusion
detection systems. Feature extraction techniques are deployed in order to reduce the amount of
data that the system needs to process. The designed system is simple enough not to introduce
significant computational overhead, but at the same time is accurate, adaptive and fast. There is
a large number of existing solutions based on machine learning techniques, but most of them
introduce high computational overhead. Moreover, due to its inherent parallelism, our solution
offers a possibility of implementation using reconfigurable hardware with the implementation
cost much lower than the one of the traditional systems. The model is verified on KDD99
benchmark dataset, generating a solution competitive with the solutions of the state-of-the-art.
Keywords: intrusion detection, genetic algorithm, sequential combination, principal component analysis, multi expression programming.
1 Introduction
Computer networks are usually protected against attacks by a number of access restriction policies (anti-virus software, firewall, message encryption, secured network
protocols, password protection). Since these solutions are proven to be insufficient for
providing high level of network security, there is a need for additional support in
detecting malicious traffic. Intrusion detection systems (IDS) are placed inside the
protected network, looking for potential threats in network traffic and/or audit data
recorded by hosts.
IDS have three common problems that should be tackled when designing a system
of the kind: speed, accuracy and adaptability. The speed problem arises from the extensive amount of data that needs to be monitored in order to perceive the entire situation. An existing approach to solving this problem is to split network stream into more
manageable streams and analyze each in real-time using separate IDS. The event
stream must be split in a way that covers all relevant attack scenarios, but this assumes that all attack scenarios must be known a priori.
We are deploying a different approach. Instead of defining different attack scenarios, we extract the features of network traffic that are likely to take part in an attack.
This provides higher flexibility since a feature can be relevant for more than one attack or is prone to be abused by an unknown attack. Moreover, we need only one IDS
springerlink.com
148
Z. Banković, S. Bojanić, and O. Nieto-Taladriz
to perform the detection. Finally, in this way the total amount of data to be processed
is highly reduced. Hence, the amount of time spent for offline training of the system
and afterwards the time spent for attacks detection are also reduced.
Incorporation of learning algorithms provides a potential solution for the adaptation and accuracy issues. A great number of machine learning techniques have been
deployed for intrusion detection in both commercial systems and state-of-the-art.
These techniques can introduce certain amount of ‘intelligence’ in the process of
detecting attacks, but are capable of processing large amount of data much faster than
a human. However, these systems can introduce high computational overhead. Furthermore, many of them do not deal properly with so-called ‘rare’ classes [5], i.e. the
classes that have significantly smaller number of elements then the rest of the classes.
This problem occurs mostly due to the tendency for generalization that most of these
techniques exhibit. Intrusions can be considered rare classes since the amount of intrusive traffic is considerably smaller then the amount of normal traffic. Thus, we
need a machine learning technique that is capable of dealing with this issue.
In this work we are presenting a genetic algorithm (GA) [9] approach for classifying network connections. GAs are robust, inherently parallel, adaptable and suitable
for dealing with the classification of rare classes [5]. Moreover, due to its inherent
parallelism, it offers possibility to implement the system using reconfigurable devices
without the need of deploying a microprocessor. In this way, the implementation cost
would be much lower than the cost of implementing traditional IDS offering at the
same time higher level of adaptability.
This work represents a continuation of our previous one [1] where we investigated
the possibilities of applying GA to intrusion detection while deploying small subset of
features. The experiments have confirmed the robustness of GA and inspired us to
further continue experimenting on the subject.
Here we further investigate a combination of two GA-based intrusion detection
systems. The first system in the row is an anomaly-based IDS implemented as a simple linear classifier. This system exhibits high false-positive rate. Thus, we have
added a simple system based on if-then rules that filters the decision of the linear
classifier and in that way significantly reduces false-positive rate. We actually create a
strong-classifier built upon weak-classifiers, but without the need to follow the process of boosting algorithm [15] as both of the created systems can be trained
separately.
For evolving our GA-based system KDD99Cup training and testing dataset was
used [6]. KDD99Cup dataset was found to have quite drawbacks [7], [8], but it is still
prevailing dataset used for training and testing of IDS due to its good structure and
availability [3], [4]. Because of these shortcomings, the presented results do not illustrate the behavior of the system in a real-world environment, but it does reflect its
possibilities. In general case, the performance of the implemented system highly depends on the training data.
In the following text Sections 2 gives the survey and comparison of machine learning techniques deployed for intrusion detection. Section 3 details the implementation
if the system. Section 4 presents the benchmark KDD99 training and testing dataset,
Evaluating Sequential Combination of Two Genetic Algorithm-Based Solutions
149
evaluates the performance of the system using this dataset and discusses the results.
Finally, the conclusions are drawn in Section 5.
2 Machine Learning Techniques for Intrusion Detection – Survey
and Comparison
In the recent past there has been a growing recognition of intelligent techniques for
the construction of efficient and reliable intrusion detection systems. A complete
survey of these techniques is hard to present at this point, since there are more than
hundred IDS based on machine learning techniques. Some of the best-performed
techniques used in the state-of-the-art apply GA [4], combination of neural networks
and C4.5 [11], genetic programming (GP) ensemble [12], support vector machines
[13] or fuzzy logic [14].
All of the named techniques have two steps: training and testing. The systems have
to be constantly retrained using new data since new attacks are emerging every day.
The advantage of all GA or GP-based techniques lies in their easy retraining. It’s
enough to use the best population evolved in the previous iteration as initial population and repeat the process, but this time including new data. Thus, our system is
inherently adaptive which is an imperative quality of an IDS.
Furthermore, GAs are intrinsically parallel, since they have multiple offspring,
they can explore the solution space in multiple directions at once. Due to the parallelism that allows them to implicitly evaluate many schemas at once, GAs are particularly well-suited to solving problems where the space of all potential solutions is too
vast to search exhaustively in any reasonable amount of time, as network data.
GA-based techniques are appropriate for dealing with rare classes. As they work
with populations of candidate solutions rather than a single solution and employ stochastic operators to guide the search process, GAs cope well with attribute interactions and avoid getting stuck in local maxima, which together make them very
suitable for dealing with classifying rare classes [5]. We have gone further by deploying standard F-measure as fitness function. F-value is proven to be very suitable when
dealing with rare classes [5]. F-measure is a combination of precision and recall. Rare
cases and classes are valued when using these metrics because both precision and
recall are defined with respect to a rare class. None of the GA or GP techniques stated
above considers the problem of rare classes.
A technique that considers the problem of rare classes is given in [15]. Their solution is similar to ours in the sense that they deploy a boosting technique, which also
assumes creating a strong classifier by combining several weak classifiers. Furthermore, they present the results in the terms of F-measure. The advantage of our system
is that we can train the parts of our system independently, while boosting algorithm
trains its parts one after another.
Finally, if we want to consider the possibility of hardware implementation using
reconfigurable hardware, not all the systems are appropriate due to their sequential
nature or high computational complexity. Due to the parallelism of our algorithm a
hardware implementation using reconfigurable devices is possible. This can lead to
lower implementation cost with higher level of adaptability compared to the existing
solutions and reduced amount of time for system training and testing.
150
In short, the main advantage of our solution lies in the fact that it includes important characteristics (high accuracy and performance, dealing with rare classes, inherent adaptability, feasibility of hardware implementation) in one solution. We are not
familiar with any existing solution that would cover all the characteristics mentioned
above.
3 System Implementation
The implemented IDS is a serial combination of two IDSs. The complete system is
presented in Fig. 1. The first part is a linear classifier that classifies connections into
normal ones and potential attacks. Due to its very low false-negative rate, the decision
that it makes on normal connections is considered correct. But, as it exhibits high
false-positive rate, if it opts for an attack, it has to be re-checked. This re-checking is
performed by a rule-based system whose rules are trained to recognize normal connections. This part of the system exhibits very low false-positive rate, i.e. the probability for an attack to be incorrectly classified as a normal connection is very low. In
this way, the achieved false-positive rate of the entire system is significantly reduced
while maintaining high detection rate. As our system is trained and tested on KDD99
dataset, the election of the most important features is performed once at the beginning
of the process. Implementation for a real-world environment, however, would require
performing the feature selection process before each process of training.
Fig. 1. Block Diagram of the Complete System
The linear classifier is based on a linear combination of three features. The features
are identified as those that have the highest possibility to take part in an attack by
deploying PCA [1]. The details of PCA algorithm are explained in [16]. The selected
features and their explanations are presented in Table 1.
Table 1. The features used to describe the attacks
Name of the feature
duration
src_bytes
dst_host_srv_serror_rate
Explication
length (number of seconds) of the connection
number of data bytes from source to destination
percentage of connections that have “SYN” errors
The linear classifier is evolved using GA algorithm [9]. Each chromosome, i.e. potential solution to the problem, in the population consists of four genes, where the first
151
three represent coefficients of the linear classifier and the fourth one the threshold value.
The decision about the current connection is made according to the formula (1):
gene(1)*con(duration)+gene(2)*con(src_bytes)+
gene(3)*con(dst_host_srv_serror_rate)<gene(4)
(1)
where con(duration), con(src_bytes) and con(dst_host_srv_serror_rate) are the values of the duration, src_bytes and dst_host_srv_serror_rate feature of the current
connection.
The linear classifier is trained using incremental GA [9]. The population contains
1000 individuals which were trained during 300 generations. The mutation rate was
0.1 while the crossover rate was 0.9. The previous numbers were chosen after number
of experiments. Increasing the size of the population and the number of generations
stopped when it not bring significant performance improvement. The type of crossover deployed was uniform crossover, i.e. a new individual had equal chances to
contain either of the genes of both of its parents. The performance measurement, i.e.
the fitness function, was the squared percentage of the correctly classified connections, i.e. according to the formula:
⎛ count ⎞
⎟⎟
fitness = ⎜⎜
⎝ numOfCon ⎠
2
(2)
where count is the number of correctly classified connections, while numOfCon is the
number of connections in the training dataset. The squared percentage was chosen
rather than the simple percentage value because the achieved results were better. The
result of this GA was its best individual which forms the first part of the system presented in Fig.1.
The second part of the system (Fig. 1) is a rule-based system, where simple if-then
rules for recognizing normal connections are evolved. The most important features
were taken over from the results obtained in [2] using Multi Expression Programming
(MEP). The features and their explanations are listed in Table 2.
Table 2. The features used to describe normal connections
Name of the feature
service
hot
logged in
Explication
Destination service (e.g. telnet, ftp)
number of hot indicators
1 if successfully logged in; 0 otherwise
An example of a rule can be the following one:
if (service=”http” and hot=”0” and logged_in=”0”)
then normal;
The rules are trained using incremental GA with the same parameters used for the
linear classifier. Each 3-gene chromosome represents a rule, where the value of each
gene is the value of its corresponding feature. But, the population used in this case
contained 500 individuals, as no improvements were achieved with larger populations. The result of the training was a set of 200 best-performing rules. The fitness
function in this case was the F-value with the parameter 0.8:
152
fitness =
1.8 * recall * precision
TP
TP
, precision =
, recall =
0.8 * precision + recall
TP + FP
TP + FN
(3)
where TP, FP and FN stand for the number of true positives, false positives and false
negatives respectively.
The system presented here was implemented in C++ programming language. The
software for this work used the GAlib genetic algorithm package, written by Matthew
Wall at the Massachusetts Institute of Technology [10]. The time of training the implemented system is 185 seconds while the testing process takes 45 seconds. The
reason for short time of training lies in deploying incremental GA whose population is
not big all the time, i.e. it is growing after each iteration. The system was demonstrated on AMD Athlon 64 X2 Dual Core Processor 3800+ with 1GB RAM memory
on its disposal.
4 Results
4.1 Training and Testing Datasets
The dataset contains 5000000 network connection records. A connection is a sequence of TCP packets starting and ending at defined times, between which data
flows from a source IP to a target IP under certain protocol [6]. The training portion
of the dataset ( “kdd_10_percent”) contains 494021 connections of which 20% are
normal (97277), and the rest (396743) are attacks. Each connection record contains 41
independent fields and a label (normal or type of attack). Attacks belong to the one of
the four attack categories: user to root, remote to local, probe, and denial of service.
The testing dataset (“corrected”) provides a dataset with a significantly different statistical distribution than the training dataset (250436 attacks and 60593 normal connections) and contains an additional 14 attacks not included in the training dataset.
The most important flaws of the mentioned dataset are given in the following [7].
The dataset contains biases that may be reflected in the performance of the evaluated
systems, for example the skewed nature of the attack distribution. None of the sources
explaining the dataset contains any discussion of data rates, and its variation with time
is not specified. There is no discussion of whether the quantity of data presented is
sufficient to train a statistical anomaly system or a learning-based system.
Furthermore, in [8] is demonstrated that the transformation model used for transforming raw DARPA’s network data to a well-featured data item set is ‘poor’. Here
‘poor’ refers to the fact that some attribute values are the same in different data items
that have different class labels. Due to this, some of the attacks can’t be classified
correctly.
4.2 Obtained Rates
The system was trained using “kdd_10_percent” and tested on “corrected” dataset.
The obtained results are summarized in Table 3. The last column gives the value of
classical F-measure so that learning results could be easily compared with a unique
feature for both recall and precision. The false-positive rate is reduced from 40.7% to
153
1.4%, while the detection rate has reduced for only 0.15%. The increasing of F-value
is also exhibited.
Table 3. The performance of the whole system and its parts separately
System
Linear Classifier
Rule-based
Whole system
Detection rate
Num.
Per.(%)
231030
92.25
45504
75.1
230625
92.1
False Positive Rate
Num.
Per.(%)
24628
40.7
5537
2.2
862
1.4
F-measure
0.913
0.815
0.96
The adaptability of the system was tested as well. At first, the system was trained
with a subset of “kdd_10_percent” (250000 connections out of 491021). The generated rules were taken as the initial generation and re-trained with the remaining data
of “kdd_10_percent” dataset. Both of the systems were tested on “corrected” dataset.
The system exhibited improvements in both detection and false positive rate. The
improvements are presented in the Table 4.
Table 4. The performance of the system after re-training
System
Trained with a subset
Re-trained with the rest of the
training data
Detection rate
Num.
Per.(%)
183060
73.1
231039
92.3
False Positive rate
Num.
Per.(%)
1468
2.4
862
1.4
Fmeasure
0.84
0.96
The drawbacks of the dataset have influenced the gained rates. As a comparison,
the detection rate of the system tested on the same data that it was trained on, i.e.
“kdd_10_percent”, is 99.2% comparing to the detection rate of 92.1% after testing the
system using “corrected” dataset. Thus, the dataset deficiencies stated previously in
this section had negative effects on the rates obtained in this work.
5 Conclusions
In this work a novel approach consisting in a serial combination of two GA-based
IDSes is introduced. The properties including adaptability of the resulting system
were analyzed. The resulting system exhibits very good characteristics in the terms of
both detection and false-positive rate and the F-measure.
The implementation of the system has been performed in a way that corresponds
well to the deployed dataset mostly in the terms of the chosen features. In a real system this does not have to be the case. Thus, an implementation of the system for a
real-world environment has to be adjusted in the sense that the set of the chosen features has to be changed according to the environmental changing.
Due to the inherent high parallelism of the presented system, there is a possibility
of its implementation using reconfigurable hardware. This will result in a highperformance real-word implementation with considerably lower implementation cost,
154
size and power consumption compared to the existing solutions. Part of the future
work will consist in pursuing hardware implementation of the presented system.
Acknowledgements. This work has been partially funded by the Spanish Ministry of
Education and Science under the project TEC2006-13067-C03-03 and by the European Commission under the FastMatch project FP6 IST 27095.
References
1. Banković, Z., Stepanović, D., Bojanić, S., Nieto-Taladriz, O.: Improving Network Security
Using Genetic Algorithm Approach. Computers & Electrical Engineering 33(5-6), 438–451
2. Grosan, C., Abraham, A., Chis, M.: Computational Intelligence for light weight intrusion
detection systems. In: International Conference on Applied Computing (IADIS 2006), San
Sebastian, Spain, pp. 538–542 (2006); ISBN: 9728924097
3. Gong, R.H., Zulkernine, M., Abolmaesumi, P.: A Software Implementation of a Genetic
Algorithm Based Approach to Network Intrusion Detection. In: Proceedings of
SNPD/SAWN 2005 (2005)
4. Chittur, A.: Model Generation for an Intrusion Detection System Using Genetic Algorithms (accessed in 2006), http://www1.cs.columbia.edu/ids/
publications/gaids-thesis01.pdf
5. Weiss, G.: Mining with rarity: A unifying framework. SIGKDD Explorations 6(1), 7–19
(2004)
6. http://kdd.ics.uci.edu/ (October 1999)
7. McHugh, J.: Testing Intrusion Detection Systems: A Critique of the 1998 and 1999
DARPA IDS Evaluation as Performed by Lincoln Laboratory. ACM Trans. on Information and System security 3(4), 262–294 (2000)
8. Bouzida, Y., Cuppens, F.: Detecting known and novel network intrusion. In: IFIP/SEC
2006 21st International Information Security Conference, Karlstad, Sweden (2006)
9. Goldberg, D.E.: Genetic algorithms for search, optimization, and machine learning. Addison-Wesley, Reading (1989)
10. GAlib, A.: C++ Library of Genetic Algorithm Components,
http://lancet.mit.edu/ga/
11. Pan, Z., Chen, S., Hu, G., Zhang, D.: Hybrid Neural Network and C4.5 for Misuse Detection. In: Proceedings of the Second International Conference on Machine Learning and
Cybernetics, November 2003, vol. 4, pp. 2463–2467 (2003)
12. Folino, G., Pizzuti, C., Spezzano, G.: GP Ensemble for Distributed Intrusion Detection
Systems. In: Singh, S., Singh, M., Apte, C., Perner, P. (eds.) ICAPR 2005. LNCS,
vol. 3686. Springer, Heidelberg (2005)
13. Laskov, P., Düssel, P., Schäfer, C., Rieck, K.: Learning Intrusion Detection: Supervised or
Unsaupervised? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 50–57.
Springer, Heidelberg (2005)
14. Yao, J.T., Zhao, S.L., Saxton, L.V.: A Study on Fuzzy Intrusion Detection. Data mining,
intrusion detection, information assurance and data networks security (2005)
15. Chawla, N.V., Lazarevic, A., Hall, L.O., Bowyer, K.: SMOTEBoost: Improving prediction
of the minority class in boosting. In: Proceedings of Principles of Knowledge Discovery in
Databases (2003)
16. Fodor, I.K.: A Survey of Dimension Reduction Techniques,
http://llnl.gov/CASC/sapphire/pubs
Álvaro Herrero and Emilio Corchado
Department of Civil Engineering, University of Burgos
C/ Francisco de Vitoria s/n, 09006 Burgos, Spain
{ahcosio, escorchado}@ubu.es
Abstract. Up to now, several Artificial Intelligence (AI) techniques and paradigms have been
successfully applied to the field of Intrusion Detection in Computer Networks. Most of them
were proposed to work in isolation. On the contrary, the new approach of hybrid artificial intelligent systems, which is based on the combination of AI techniques and paradigms, is probing
to successfully address complex problems. In keeping with this idea, we propose a hybrid use
of three widely probed paradigms of computational intelligence, namely Multi-Agent Systems,
Case Based Reasoning and Neural Networks for Intrusion Detection. Some neural models
based on different statistics (such as the distance, the variance, the kurtosis or the skewness)
have been tested to detect anomalies in packet-based network traffic. The projection method of
Curvilinear Component Analysis has been applied for the first time in this study to perform
packet-based intrusion detection. The proposed framework has been probed through anomalous
situations related to the Simple Network Management Protocol and normal traffic.
Keywords: Multiagent Systems, Case Based Reasoning, Artificial Neural Networks, Unsupervised Learning, Projection Methods, Computer Network Security, Intrusion Detection.
1 Introduction
Firewalls are the most widely used tools for securing networks, but Intrusion Detection Systems (IDS’s) are becoming more and more popular [1]. IDS’s monitor the
activity of the network with the purpose of identifying intrusive events and can take
actions to abort these risky events.
A wide range of techniques have been used to build IDS’s. On the one hand, there
have been some previous attempts to take advantage of agents and Multi-Agent Systems (MAS) [2] in the field of Intrusion Detection (ID) [3], [4], [5], including the
mobile-agents approach [6], [7]. On the other hand, some different machine learning
models – including Data Mining techniques and Artificial Neural Networks (ANN) –
have been successfully applied for ID [8], [9], [10], [11].
Additionally, some other Artificial Intelligence techniques (such as Genetic Algorithms and Fuzzy Logic, Genetic Algorithms and K-Nearest Neighbor (K-NN) or KNN and ANN among others) [12] [13] have been combined in order to face ID from a
hybrid point of view. This paper employs a framework based on a dynamic multiagent architecture employing deliberative agents capable of learning and evolving
with the environment [14]. These agents may incorporate different identification or
projection algorithms depending on their goals. In this case, a neural model based on
springerlink.com
156
Á. Herrero and E. Corchado
the study of some statistical features (such as the variance, the interpoint distance or
the skew and kurtosis indexes) will be embedded in such agents. One of the main
novelties of this paper is the application of Curvilinear Component Analysis (CCA)
for packet-based intrusion detection.
The overall architecture of this paper is the following: Section 2 outlines the ID
multiagent system, section 3 describes the neural models applied in this research,
section 4 presents some experimental results and finally section 5 goes over some
conclusions and future work.
2 Agent-Based IDS
An ID framework, called Visualization Connectionist Agent-Based IDS (MOVICABIDS) [14] and based on software agents [2] and neural models, is introduced. This
MAS incorporates different types of agents; some of the agents have been designed as
CBR-BDI agents [15], [16] including an ANN for ID tasks, while some others are
reactive agents. CBR-BDI agents use Case Based Reasoning (CBR) systems [17] as a
reasoning mechanism, which allows them to learn from initial knowledge, to interact
autonomously with the environment, users and other agents within the system, and to
have a large capacity for adaptation to the needs of its surroundings.
MOVICAB-IDS includes deliberative agents using a CBR architecture. These
CBR-BDI agents work at a high level with the concepts of Believes, Desires and
Intentions (BDI) [18]. CBR-BDI agents have learning and adaptation capabilities,
what facilitates their work in dynamic environments.
The extended version of the Gaia methodology [19] was applied, and some roles
and protocols where identified after the Architectural Design Stage [14]. The six
agent classes identified in the Detailed Design Stage were: SNIFFER, PREPROCESSOR,
ANALYZER, CONFIGURATIONMANAGER, COORDINATOR and VISUALIZER.
2.1 Agent Classes
The agent classes previously mentioned are described in the following paragraphs.
Sniffer Agent
This reactive agent is in charge of capturing traffic data. The continuous traffic flow is
captured and split into segments in order to send it through the network for further
process. As these agents are the most critical ones, there are cloned agents (one per
network segment) ready to substitute the active ones when they fail.
Preprocessor Agent
After splitting traffic data, the generated segments are preprocessed to apply subsequent analysis. Once the data has been preprocessed, an analysis for this new piece of
data is requested.
Analyzer Agent
This is a CBR-BDI agent embedding a neural model within the adaptation stage of its
CBR system that helps to analyze the preprocessed traffic data. This agent is based on
the application of different neural models allowing the projection of network data. In
this paper, PCA [20], CCA [21], MLHL [22] and CHMLHL [23] (See Section 3) have
157
been applied for comparison reasons. This agent generates a solution (getting an adequate projection of the preprocessed data) by retrieving a case and analyzing the new
one using a neural network. Each case incorporates several features, such as segment
length (in ms), total number of packets and neural model parameters among others. A
further description of the CBR four steps for this agent can be found in [14].
ConfigurationManager Agent
The processes of data capture, split, preprocess and analysis depends on the values of
several parameters, as for example: packets to capture, segment length, features to
extract... This information is managed by the CONFIGURATIONMANAGER reactive
agent, which is in charge of providing this information to some other agents.
Coordinator Agent
There can be several instances (from 1 to m) of the ANALYZER class of agent. In order
to improve the efficiency and perform a real-time processing, the preprocessed data
must be dynamically and optimally assigned to ANALYZER agents. This assignment is
performed by the COORDINATOR agent.
Visualizer Agent
At the very end of the process, this interface agent presents the analyzed data to the
network administrator by means of a functional and mobile visualization interface. To
improve the accessibility of the system, the administrator may visualize the results on
a mobile device, enabling informed decisions to be taken anywhere and at any time.
3 Neural Projection Models
Projection models are used as tools to identify and remove correlations between problem variables, which enable us to carry out dimensionality reduction, visualization or
exploratory data analysis. In this study, some neural projection models, namely PCA,
MLHL, CMLHL and CCA have been applied for ID.
Principal Component Analysis (PCA) [20] is a standard statistical technique for
compressing data; it can be shown to give the best linear compression of the data in
terms of least mean square error. There are several ANN which have been shown to
perform PCA [24], [25], [26]. It describes the variation in a set of multivariate data in
terms of a set of uncorrelated variables each of which is a linear combination of the
original variables. Its goal is to derive new variables, in decreasing order of importance, which are linear combinations of the original variables and are uncorrelated
with each other.
Curvilinear Component Analysis (CCA) [21] is a nonlinear dimensionality reduction method. It was developed as an improvement on the Self Organizing Map (SOM)
[27], trying to circumvent the limitations inherent in some linear models such as PCA.
CCA is performed by a self-organised neural network calculating a vector quantization of the submanifold in the data set (input space) and a nonlinear projection of
these quantising vectors toward an output space.
As regards its goal, the projection part of CCA is similar to other nonlinear mapping methods, as it minimizes a cost function based on interpoint distances in both
input and output spaces. Quantization and nonlinear mapping are separately performed: firstly, the input vectors are forced to become prototypes of the distribution
158
using a vector quantization (VQ) method, and then, the output layer builds a nonlinear
mapping of the input vectors.
Cooperative Maximum Likelihood Hebbian Learning (CMLHL) [23] extends the
MLHL model [22] that is a neural implementation of Exploratory Projection Pursuit
(EPP). The statistical method of EPP [28] linearly project a data set onto a set of basis
vectors which best reveal the interesting structure in data. MLHL identifies interestingness by maximising the probability of the residuals under specific probability density functions which are non-Gaussian.
CMLHL extends the MLHL paradigm by adding lateral connections [23], which
have been derived from the Rectified Gaussian Distribution [29]. The resultant model
can find the independent factors of a data set but does so in a way that captures some
type of global ordering in the data set.
4 Experiments and Results
In this work, the above described neural models have been applied to a real traffic
data set [11] containing “normal” traffic and some anomalous situations. These
anomalous situations are related to the Simple Network Management Protocol
(SNMP), known by its vulnerabilities [30]. Apart from “normal” traffic, the data set
includes: SNMP ports sweeps (scanning of network hosts at different ports - a random
port number: 3750, and SNMP default port numbers: 161 and 162), and a transfer of
information stored in the Management Information Base (MIB), that is, the SNMP
database.
This data set contains only five variables extracted from the packet headers: timestamp (the time when the packet was sent), protocol, source port (the port of the
source host that sent the packet), destination port (the destination host port number to
which the packet is sent) and size: (total packet size in Bytes). This data set was generated in a medium-sized network so the “normal” and anomalous traffic flows were
known in advance. As SNMP is based on UDP, only 5866 UDP-based packets were
included in the dataset. In this work, the performance of the previously described
projection models (PCA, CCA, MLHL and CMLHL) has been analysed and compared through this dataset (See Figs. 1 and 2.).
PCA was initially applied to the previously described dataset. The PCA projection
is shown in Fig. 1.a. After analysing this projection, it is discovered that the normal
traffic evolves in parallel straight lines. According to the parallelism to normal traffic,
PCA is only able to identify the port sweeps (Groups 3, 4 and 5 in Fig. 1.b). On the
contrary, it fails to detect the MIB information transfer (Groups 1 and 2 in Fig. 1.b)
because the packets in this anomalous situation evolve in a direction parallel to the
“normal” one.
Fig. 1.b shows the MLHL projection of the dataset. Once again, the normal traffic
evolves in parallel straight lines. There are some other groups (Groups 1, 2, 3, 4 and 5 in
Fig. 1.a) evolving in an anomalous way. In this case, all the anomalous situations contained in the dataset can be identified due to their non-parallel evolution to the normal
direction. Additionally, in the case of the MIB transfer (Groups 1 and 2 in Fig. 1.b), the
high concentration of packets must be considered as an anomalous feature.
159
Normal direction
Normal direction
Group 1
Group 1
Group 3 Group 4
Group 3
Group 4 Group 5
Group 5
Group 2
Group 2
(a)
(b)
Group 4
Group 1
Group 5
Group 3
Group 3
Group 4
Group 2
Group 2
Group 1
Normal direction
(c)
(d)
Fig. 1. (a) PCA projection. (b) MLHL projection. (c) CMLHL projection. (d) CCA (Euclidean
dist.) projection.
It can be seen in Fig. 1.c how the CMLHL model is able to identify the two anomalous situations contained in the data set. As in the case of MLHL, the MIB information transfer (Groups 1 and 2 in Fig. 1.a) is identified due to its orthogonal direction
with respect to the normal traffic and to the high density of packets. The sweeps
(Groups 3, 4 and 5 in Fig. 1.a) are identified due to their non-parallel direction to the
normal one.
Several experiments were conducted to apply CCA to the analysed data set; tuning
the different options and parameters, such as type of initialization, epochs and distance criterion. The best (from a projection point of view) CCA result, based on the
Standardized Euclidean Distance, is depicted on Fig. 2. There’s a marked contrast
between the behavioral pattern shown by the normal traffic in previous projections
and the evolution of normal traffic in the CCA projection. In the latter, some of the
packets belonging to normal traffic do not evolve in parallel straight lines. That is the
case of groups 1 and 2 in Fig. 2. The anomalous traffic shows an abnormal evolution
once again (Groups 3 and 4 in Fig. 2), so it is not as clear as in previous projections to
distinguish the anomalous traffic from the “normal” one.
160
Group 3
Group 1
Group 4
Group 2
Fig. 2. CCA projection (employing Standardized Euclidean distance)
For comparison purposes, a different CCA projection for the same dataset is shown
on Fig. 1.d. This projection is based on simple Euclidean Distance. The anomalous
situations can not be identified in this case as the evolution of the normal and anomalous traffic is similar. Different distance criteria such as Cityblock, Humming and
some others were tested as well. None of them surpass the Standardized Euclidean
Distance, whose projection is shown on Fig. 2.
5 Conclusions and Future Work
The use of embedded ANN in the deliberative agents of a dynamic MAS let us take
advantage of some of the properties of ANN (such as generalization) and agents (reactivity, proactivity and sociability) making the ID task possible. It is worth mentioning that, as in other application fields, the tuning of the different neural models is of
extreme importance. Although the neural model can get a useful projection, a wrong
tuning of the model can lead to a useless outcome, as is the case of Fig 1.d.
We can conclude as well that CMLHL outperforms MLHL, PCA and CCA. This
probes the intrinsic robustness of CMLHL, which is able to properly respond to a
complex data set that includes time as a variable.
Further work will focus on the application of high-performance computing clusters. Increased system power will be used to enable the IDS to process and display the
traffic data in real time.
Acknowledgments. This research has been partially supported by the project
BU006A08 of the JCyL.
161
References
1. Chuvakin, A.: Monitoring IDS. Information Security Journal: A Global Perspective 12(6),
12–16 (2004)
2. Wooldridge, M., Jennings, N.R.: Agent theories, architectures, and languages: A survey.
Intelligent Agents (1995)
3. Spafford, E.H., Zamboni, D.: Intrusion Detection Using Autonomous Agents. Computer
Networks: The Int. Journal of Computer and Telecommunications Networking 34(4), 547–
570 (2000)
4. Hegazy, I.M., Al-Arif, T., Fayed, Z.T., Faheem, H.M.: A Multi-agent Based System for
Intrusion Detection. IEEE Potentials 22(4), 28–31 (2003)
5. Dasgupta, D., Gonzalez, F., Yallapu, K., Gomez, J., Yarramsettii, R.: CIDS: An agentbased intrusion detection system. Computers & Security 24(5), 387–398 (2005)
6. Wang, H.Q., Wang, Z.Q., Zhao, Q., Wang, G.F., Zheng, R.J., Liu, D.X.: Mobile Agents
for Network Intrusion Resistance. In: Shen, H.T., Li, J., Li, M., Ni, J., Wang, W. (eds.)
APWeb Workshops 2006. LNCS, vol. 3842, pp. 965–970. Springer, Heidelberg (2006)
7. Deeter, K., Singh, K., Wilson, S., Filipozzi, L., Vuong, S.: APHIDS: A Mobile AgentBased Programmable Hybrid Intrusion Detection System. In: Karmouch, A., Korba, L.,
Madeira, E.R.M. (eds.) MATA 2004. LNCS, vol. 3284, pp. 244–253. Springer, Heidelberg
(2004)
8. Laskov, P., Dussel, P., Schafer, C., Rieck, K.: Learning Intrusion Detection: Supervised or
Unsupervised? In: Roli, F., Vitulano, S. (eds.) ICIAP 2005. LNCS, vol. 3617, pp. 50–57.
9. Liao, Y.H., Vemuri, V.R.: Use of K-Nearest Neighbor Classifier for Intrusion Detection.
Computers & Security 21(5), 439–448 (2002)
10. Sarasamma, S.T., Zhu, Q.M.A., Huff, J.: Hierarchical Kohonenen Net for Anomaly Detection in Network Security. IEEE Transactions on Systems Man and Cybernetics, Part
B 35(2), 302–312 (2005)
11. Corchado, E., Herrero, A., Sáiz, J.M.: Detecting Compounded Anomalous SNMP Situations Using Cooperative Unsupervised Pattern Recognition. In: Duch, W., Kacprzyk, J.,
Oja, E., Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 905–910. Springer, Heidelberg (2005)
12. Middlemiss, M., Dick, G.: Feature Selection of Intrusion Detection Data Using a Hybrid
Genetic Algorithm/KNN Approach. In: Design and Application of Hybrid Intelligent Systems, pp. 519–527. IOS Press, Amsterdam (2003)
13. Kholfi, S., Habib, M., Aljahdali, S.: Best Hybrid Classifiers for Intrusion Detection. Journal of Computational Methods in Science and Engineering 6(2), 299–307 (2006)
14. Herrero, Á., Corchado, E., Pellicer, M., Abraham, A.: Hybrid Multi Agent-Neural Network Intrusion Detection with Mobile Visualization. In: Innovations in Hybrid Intelligent
Systems. Advances in Soft Computing, vol. 44, pp. 320–328. Springer, Heidelberg (2007)
15. Corchado, J.M., Laza, R.: Constructing Deliberative Agents with Case-Based Reasoning
Technology. International Journal of Intelligent Systems 18(12), 1227–1241 (2003)
16. Pellicer, M.A., Corchado, J.M.: Development of CBR-BDI Agents. International Journal
of Computer Science and Applications 2(1), 25–32 (2005)
17. Aamodt, A., Plaza, E.: Case-Based Reasoning - Foundational Issues, Methodological
Variations, and System Approaches. AI Communications 7(1), 39–59 (1994)
18. Bratman, M.E.: Intentions, Plans and Practical Reason. Harvard University Press, Cambridge (1987)
162
19. Zambonelli, F., Jennings, N.R., Wooldridge, M.: Developing Multiagent Systems: the
Gaia Methodology. ACM Transactions on Software Engineering and Methodology 12(3),
317–370 (2003)
20. Pearson, K.: On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 2(6), 559–572 (1901)
21. Demartines, P., Herault, J.: Curvilinear Component Analysis: A Self-Organizing Neural
Network for Nonlinear Mapping of Data Sets. IEEE Transactions on Neural Networks 8(1), 148–154 (1997)
22. Corchado, E., MacDonald, D., Fyfe, C.: Maximum and Minimum Likelihood Hebbian
Learning for Exploratory Projection Pursuit. Data Mining and Knowledge Discovery 8(3),
203–225 (2004)
23. Corchado, E., Fyfe, C.: Connectionist Techniques for the Identification and Suppression of
Interfering Underlying Factors. Int. Journal of Pattern Recognition and Artificial Intelligence 17(8), 1447–1466 (2003)
24. Oja, E.: A Simplified Neuron Model as a Principal Component Analyzer. Journal of
Mathematical Biology 15(3), 267–273 (1982)
25. Sanger, D.: Contribution Analysis: a Technique for Assigning Responsibilities to Hidden
Units in Connectionist Networks. Connection Science 1(2), 115–138 (1989)
26. Fyfe, C.: A Neural Network for PCA and Beyond. Neural Processing Letters 6(1-2), 33–41
(1997)
27. Kohonen, T.: The Self-Organizing Map. Proceedings of the IEEE 78(9), 1464–1480
(1990)
28. Friedman, J.H., Tukey, J.W.: A Projection Pursuit Algorithm for Exploratory DataAnalysis. IEEE Transactions on Computers 23(9), 881–890 (1974)
29. Seung, H.S., Socci, N.D., Lee, D.: The Rectified Gaussian Distribution. Advances in Neural Information Processing Systems 10, 350–356 (1998)
30. Cisco Secure Consulting. Vulnerability Statistics Report (2000)
Giuseppe Lieto, Fabio Orsini, and Genoveffa Pagano
System Management, Inc. Italy
glieto@sysmanagement.it, forsini@sysmanagement.it,
gpagano@pridelabs.it
Abstract. This document presents a technique of traffic analysis, looking for attempted intrusion
and information attacks. A traffic classifier aggregates packets in clusters by means of an
adapted genetic algorithm. In a network with traffic homogenous over the time, clusters do not
vary in number and characteristics. In the event of attacks or introduction of new applications the
clusters change in number and characteristics. The set of data processed for the test are extracted
from traffic DARPA, provided by MIT Lincoln Labs and commonly used to test effectiveness
and efficiency of systems for Intrusion Detection. The target events of the trials are Denial of
Service and Reconaissance. The experimental evidence shows that, even with an input of unrefined data, the algorithm is able to classify, with discrete accuracy, malicious events.
1 Introduction
Anomaly detection techniques are based on traffic experience retrieval on the network
to protect, so that abnormal traffic, both in quantity and in quality, can be detected.
Almost all approaches are based on anti-intrusion system learning of what is normal traffic: anything different from the normal traffic is malicious or suspect.
The normal traffic classification is made analyzing:
• Application level content, with a textual characterization;
• The whole connection and packet headers, usually using clustering techniques;
• Traffic quantity and connections transition frequencies, by modelling the users behaviour in different hours, according to the services and the applications they use.
1.1 Anomaly Detection with Clustering on Header Data
The most interesting studies are related to learning algorithms without human supervision. They classify the traffic in different clusters [3], each of them contains
strongly correlated packets. Packets characterization is based on header fields, while
the cluster creation can be realized with different algorithms.
All the traffic that doesn’t belong to normal clusters, is classified as abnormal; the
following step is to distinguish between abnormal and malicious traffic.
Traffic characterization starts with data mining and creation of multidimensional vectors, called feature vectors, whose components represent the instance dimensions. The
choice of relevant attributes for the instances is really important for the characterization
and many studies focus on the evaluation of the best techniques for features choice.
springerlink.com
164
G. Lieto, F. Orsini, and G. Pagano
The approach chosen in [4] considers the connections as an unique entity, containing several packets. In that way is possible to retrieve from network data information
as the observing domain, the hosts number, the active applications, the number of
users connected to a host or to a service.
The approach introduced above is the most used one. Anyway there are classification trials based on raw data.
An interesting study field is DoS attack detection: such attacks produce an undistinguishable traffic. [1] proposes a defence based on edge routers, that can create
different queues for malicious packets and normal traffic. The distinction takes place
through an anomaly detection algorithm that classifies the normal traffic using a kmeans clustering, based on observation of the statistic trend of the traffic.
Experimental results state that, during a DoS attack, new denser, more populated
and bigger clusters are created. Even the sudden increase in density of an existing
cluster can mean an attach is ongoing, if it is observed together with an immediate
variation of the single features mean values.
The different queues allow to shorten up the normal connection time; moreover,
without the need to stop the suspect connection, the band dedicated to DoS traffic decreases (because of long queues), together with the effects on resources management.
In our approach, we choose a genetic clustering technique, Unsupervised Niche
Clustering [2], to classify the network traffic. UNC uses an evolutionary algorithm
with a niching strategy, allowing to maintain genetic niches as candidate clusters,
rather than loosing awareness of little groups of strongly peculiar genetic individuals,
that would end up extinguished using a traditional evolutionary algorithm.
Analysing the traffic, we applied the algorithm to several groups of individuals,
monitoring the formation of new clusters and the trend of the density in already existing clusters [1]. Moreover, we observed the trend in the number of clusters extracted
from a fixed number of individuals. The two approaches were tested during normal
network activities and then compared to the results obtained during a DoS attack or a
reconnaissance activity.
2 Description of the Algorithm: Unsupervised Niche Clustering
Unsupervised Niche Clustering aims at searching the solution space for any number
of clusters. It maintains dense areas in the solution space using an evolutionary algorithm and a niching tecnique. As in nature, niches represent subspaces of the environment that support different types of individuals with similar genetic characteristics.
2.1 Encoding Scheme
Each individual represents a candidate cluster: it is characterized by the center, an
individual in n dimensions, a robust measure of the scale and its fitness.
Table 1. Feature vector of each individual
Genome
Guyi= (gi1, gi2, … , gin)
Scale
Σ²i
Fitness
fi
165
2.2 Genetic Operators and Scale
At the very first step, scale is assigned to each individual in an empirical way: we
assume i-th individual is the center, that is the mean value, of a cluster containing all
the individuals in the solution space.
For each generation parents and offspring, update their scale in a recursive manner, according to equation (1):
(1)
(2)
where wij represents a robust weight measuring how much the j-th individual belongs
to i-th cluster;
dij is the Euclidean distance of j-th individual from the center of i-th cluster;
N is the number of individuals in the solution space.
At the moment of their birth, children inherit their closest parent’s scale. Generation
of an offspring is made by two genetic operators: crossover and mutation. In our work
we implemented one-point crossover to each dimension, combining the most significant bits of one parent with the least significant ones of the second parent. Mutation
could modify each bit of the genome with a given probability: in our case, we chose
the mutation probability was 0.001.
Equation (1) maximizes fitness value (3) for i-th cluster.
2.3 Fitness Function
Fitness, for i-th individual, is represented by the density of a hypotetical cluster, having i-th individual as its center
(3)
2.4 Niching
UNC uses Deterministic Crowding (DC) to create and maintain niches.
DC steps are:
1.
2.
3.
4.
choose the couple of parents
apply crossover and mutation
calculate the distance from each parent to each child
couple one child with one parent, so that the sum of the two distances parentchild is minimized
5. in each couple parent-child, the one with the best fitness survives, and the other
is discarded from the population
166
Through DC, evaluation of child’s fitness for surviving is not simply obtained comparing it to the closest parent’s fitness: such an approach keeps the comparisons
within a limited solution space.
In addition, we analysed a conservative approach for step1. The parents where chosen so that their distance was under a fixed threshold, and their fitness had the same
order of magnitude, in a way to maintain genetic diversity.
Coupling between very far individuals, having highly different fitness values,
would quickly extinguish the weakest individual, loosing notion of evolutionary
niches.
2.5 Extraction of the Cluster Centers
The final cluster centers are individuals in the final population with fitness greater
than a given value: in our case, greater than the mean fitness of the entire population.
2.6 Cluster Characterization
Assignment of each individual to a cluster does not follow a binary logic; fuzzy logic
is applied instead. Clusters don’t have a radius, but we assigned to each individual a
degree of belonging to each cluster.
Member functions of the fuzzy set are Gaussian functions in equation (4)
(4)
where mean value in the center of the cluster, and the scale of the center coincides
with the scale of the belonging Gaussian function. An individual will be considered as
belonging to the cluster which maximizes the belonging function (4).
3 Data Set
For the correct exploration of a solutions space, the genomes must correctly represent
the physical reality under study.
Our work can be divided according to the following phases:
• We created an instrument to extract network traffic data.
• We investigated the results of a genetic clustering without any data manipulation,
in order to observe if header raw data could correctly represent the network traffic
population, without any human understanding of attribute meaning. This approach
proved to be completely different from the analytic one proposed in [4].
• We observed the clusters centres evolution to detect DoS and scanning attacks.
• All data were extracted from tcpdump text files, obtained from a real network traffic; the packets have been studied starting from the third level of the TCP/IP stack.
167
This choice causes the lack of the information of the link between Ip address and
physical address, contained in ARP tables.
• The headers values were extracted, separated, and converted in long integers. Ip
addresses, were divided in two separated segments and later converted because of
the maximum representation capacity of our computers.
• From a single data set we implemented an object made up of the whole population
under exam. The choice of the headers fields and the population size has been
taken using a heuristic method.
We focused on attacks as denial of Service and scanning activities. Our data set was
extracted from DARPA, build by Lincoln Lab in MIT in Boston., USA.
4.1 Experimentation 1
The first example is an IP sweep attack: the attacker sends an ICMP packet to each
machine, in order to discover if they’re on at the moment of the attack.
We applied the algorithm to an initial train of 5000 packets; then, we monitored the
trend of the centers on following trains of 1000 packets.
Fig. 1. Trends of clusters for Neptune attack
168
As the number of clusters won’t be predefined, we related each cluster representative of a train with the closest belonging to the following train of packets. About not
assigned centers, we calculated the minimum distance from the preceding clusters.
We observed that ICMP clusters can never be assigned to a preceding train of
packets, and their minimum distance is far larger than any other not assigned clusters.
In figure 1, the evolution of normal clusters, and the isolated cluster created during
the attack, the red one. The three dimensions are transport layer protocol, destination
and source port.
We identified the attack, when a not assigned cluster’s minimum distance was
higher than a threshold value. We had the same results in a Port Sweep attack.
Anyway, we faced some false positive, in presence of DNS requests: this could be
avoided accurately assigning weight functions to balance the different kinds of normal
traffic in the network.
4.2 Experimentation 2
The second case we analysed was a Neptune attack, it causes a denial of service,
flooding the target machine with syn TCP packets, and never finalizing the three way
handshake. Handling an attack producing a huge number of packets, we expected a
Table 2. Evolution of the cluster centers during Neptune attack
Train
number
Attack
1
2
3
4
5
6
No
No
No
Yes
Yes
Yes
Colour in
fig. 2
1
2
Population
Number of
Clusters
5000
1000
1000
1000
1000
1000
18
13
13
21
2
2
3
4
5
6
Scale
6,00E+08
4,00E+08
2,00E+08
0,00E+00
1
2
3
4
5
6
Data set
Fig. 2. Dispersion around the centers in each train of packets
Average
Scale of the
Clusters in
the Train
2.37E+08
4.35E+08
5.29E+08
5.95E+08
8.40E+07
1.70E+08
169
raise in the number of clusters calculated on the packet trains containing the attack,
characterized by a density higher than the one observed during normal activities.
In figure 2, we represent the clusters’ centers in each train of packets. We observed
a strong contraction in the number of clusters. More over, the individuals in the population were very less dispersed around these centers than the normal centers.
It’s evident that the center, though calculated from the same number of packets,
diminish abruptly in number. More over, the dispersion around the centers diminishes
as abruptly, as seen in figure 3.
5 Conclusions
Experimental results show that our algorithm can identify new events happening in
trains of packets of a given small number: its sensitivity applies to attacks producing a
large number of homogenous packets.
Evolutionary approach proved to be feasible, stressing a trend in traffic: thanks to
the recombination of data and to the random component, the cluster centers can be
identified in genomes not present in the initial solution space.
Using a hill climbing procedure, UNC selects the fittest individuals, preserving
evolutionary niches generation by generation; by monitoring the evolution of the
centers, we had a robust approach against the noise, compared to a statistical approach
to clustering: individuals not representative of an evolutionary niche have a low probability of surviving.
The performance of the algorithm can be improved by separately processing the
traffic incoming and outgoing the network under analysis: this would help to keep
under control the false positive rate. Moreover, the process of data mining from
packet headers can be refined and improved, so to build a feature vector containing
not only raw data from the header, but more refined data, containing knowledge about
the network and its hosts, the connections, the services and so on.
A different approach of the same algorithm could be monitoring the traffic and
evaluating it compared to the existing clusters, rather than observing the evolution of
the clusters: once the clusters are formed, a score of abnormality can be assigned to
each individual under investigation, according to how much it belongs to each cluster
of the solution space. In a few empirical tests, we simulated a wide range of attacks
using Nessus tool: although some trains of anomalous packets show substantially
normal scores, and the number of false positive is quite relevant, we observed that
abnormal traffic has got a sensitive higher abnormal score than the normal traffic has,
if referring to the mean values.
References
1. Rouil, Chevrollier, Golmie: Unsupervised anomaly detection system using next-generation
router architecture (2005)
2. Leon, Nasraoui, Gomez: Anomaly detection based on unsupervised niche clustering with
application to network intrusion detection
3. Cerbara, I.: Cenni sulla cluster analysis (1999)
4. Lee, S.: A framework for constructing features and models for intrusion detection systems
(2001)
Statistical Anomaly Detection on Real e-Mail Traffic
Maurizio Aiello1, Davide Chiarella1,2, and Gianluca Papaleo1,2
1
2
National Research Council, IEIIT, Genoa, Italy
University of Genoa, Department of Computer and Information Sciences, Italy
{maurizio.aiello,davide.chiarella,
gianluca.papaleo}@ieiit.cnr.it
Abstract. There are many recent studies and proposal in Anomaly Detection Techniques, especially in worm and virus detection. In this field it does matter to answer few important
questions like at which ISO/OSI layer data analysis is done and which approach is used. Furthermore these works suffer of scarcity of real data due to lack of network resources or privacy
problem: almost every work in this sector uses synthetic (e.g. DARPA) or pre-made set of data.
Our study is based on layer seven quantities (number of e-mail sent in a chosen period): we
analyzed quantitatively our network e-mail traffic (4 SMTP servers, 10 class C networks) and
applied our method on gathered data to detect indirect worm infection (worms which use e-mail
to spread infection). The method is a threshold method and, in our dataset, it identified various
worm activities. In this document we show our data analysis and results in order to stimulate
new approaches and debates in Anomaly Intrusion Detection Techniques.
Keywords: Anomaly Detection Techniques; indirect worm; real e-mail traffic.
1 Introduction
Network security and Intrusion Detection Systems have become one of the research
focus with the ever fast development of the Internet and the growing of unauthorized
activities on the Net. Intrusion Detection Techniques are an important security barrier
against computer intrusions, virus infections, spam and phishing. In the known literature there are two main approaches to worm detection [1], [2]: misuse intrusion detection and anomaly intrusion detection. The first one is based upon the signature
concept, it is more accurate but it lacks the ability to identify the presence of intrusions that do not fit a pre-defined signature, resulting not adaptive [3], [4]. The second
one tries to create a model to characterize a normal behaviour: the system defines the
expected network behaviour and, if there are significant deviations in the short term
usage from the profile, raises an alarm. It is a more adaptive system, ready to counterattack new threats, but it has a high rate of false positives [5], [6], [7], [8], 9]. Theoretically Misuse and Anomaly detection integrated together can get the holistic
estimation of malicious situations on a network.
Which kind of threats spread via e-mails? Primarily we can say that the main ones
are worms and viruses, spam and phishing. Let’s try to summarize the whole situation.
At present Internet surpasses one billion users [10] and we witness more and more
cyber criminal activities originate and misuse this worldwide network by using different tools: one of the most important and relevant is the electronic-mail. In fact
springerlink.com
171
nowadays Internet users are flooded by a huge amount of emails infected with worms
and viruses: indeed the majority of mass-mailing worms employ a Simple Mail Transfer Protocol engine as infection delivery mechanism, so in the last years, a multitude
of worm epidemics has affected millions of networked computing devices [11]. Are
worms a real and growing threat? The answer is simple and we can find it in the virulent events of the last years: we have in fact thousands hosts infected and billion dollars in damage [12]. Moreover recently we witness a merge between worm and spam
activities: it has been estimated that 80% of spam is sent by spam zombies [13]: an
event which can make us think that future time hides bad news. How can we neutralize all these menaces? In Intrusion Detection Techniques many types of research have
been developed during years. Proposed Network Intrusion Detection Systems worked
and work on information available on different TCP stack layers: Datalink ( e.g.
monitoring ARP [14] ), Network (e.g. monitoring IP, BGP and ICMP [15], [16] ),
Transport (e.g. monitoring DNS MX query [17,18]) and Application (e.g. monitoring
System Calls[19,20] ). Sometimes, because of enormous relative features available
on different levels researcher correlate information gathered on each level in order to
improve the effectiveness of the detection system. In a similar way we propose to
work with e-mail focusing our attention on quantities considering that all the three
above phenomena have something in common: they all use SMTP [21] as proliferation medium.
In this paper our goal is to present a dataset analysis which reflects the complete
SMTP traffic sent by seven /24 network in order to detect worm and virus infections
given that no user in our network is a spammer. Moreover we want to stress that the
dataset we worked on is genuine and not a synthetic one ( like KDD 99 [22] and
DARPA [23] ) so we hope that it might be inspiring to other researcher and that
probably in the near future our work might produce a genuine data set at everyone’s
disposal.
The paper is structured as following. Section 2 introduces the analysis’ scenario.
Section 3 discusses the dataset we worked on. Our analysis’ theory and methods are
described in section 4, and our experimental results using our tools to analyze mail activities discussed in section 5. In section 6, we give our conclusion.
2 Scenario
Our approach is highly experimental. In fact we work on eleven local area network (C
class) interconnected by a layer three switch and directly connected to Internet (no
NAT [24] policies, all public IP). In this network we have five different mail-servers,
varying from Postfix to Sendmail. As every system administrator knows every server
has its own kind of log and the main problem with log files is that every transaction is
merged with the other ones and they are of difficult reading by a human beings: for
this reason we focused our anomaly detection on a single Postfix mail-server,
optimizing our efforts. Every mail is checked by an antivirus server (Sophos). To circumvent spam we have SpamAssassin [25] and Greylisting [26]. SpamAssassin is a
software which uses a variety of mechanisms including header and text analysis,
Bayesian filtering, DNS blocklists, and collaborative filtering databases to detect
spam. A mail transfer agent which uses greylisting temporarily rejects email from
172
M. Aiello, D. Chiarella, and G. Papaleo
unknown senders. If the mail is legitimate, the originating server will try again to send
it later according to RFC, while a mail which is from a spammer, it will probably not
be retried because very few spammers are RFC compliant. The few spam sources
which re-transmit later are more likely to be listed in DNSBLs [27] and distributed
signature systems. Greylisting and SpamAssassin reduced heavily our spam percentage. To make a complete description we must add that port 25 is monitored and filtered: in fact the hosts inside our network can’t communicate with a host outside our
network on port 25 and an outsider can’t communicate with an our host on port 25.
These restriction nullify two threats: the first one concerns the infected hosts which
can become spam-zombie pc; the second one concerns the SMTP relaying misuse
problem. In fact since we are a research institution almost all the hosts are used by a
single person who detains root privileges, so she can eventually install a SMTP
server. Only few of total hosts are shared among different people (students, fellow
researcher etc.). We have a good balancing between Linux operating systems distribution and Windows ones. We focus our attention on one mail server which has installed a Postfix e-mail server. Every mail is checked by the antivirus server updated
once an hour: this is an important fact because it assures that all the worms found during analysis are zero-day worm [28-30].
This server supplies service to 300 users with a wide international social network
due to the research mission of our Institution [31]: this fact grant us a huge amount of
SMTP traffic.
3 Dataset
We analyze mail-server log of 1065 days length period (2 years and 11 months). To
speed up the process we used LMA (Log Mail Analyzer [32]) to make the log more
readable. LMA is a Perl program, open source, available on Sourceforge, which
makes Postfix and Sendmail logs human readable. It reconstructs every single e-mail
transaction spread across the mail server log and it creates a plain text file in a simpler
format like. Every row represents a single transaction and it has the following fields:
• Timestamp: it is the moment in which the e-mail has been sent: it is possible to
have this information in Unix timestamp format or through the Julian format in
standard date.
• Client: it is the hostname of e-mail sender (HELO identifier).
• Client IP: it is the IP of the sender’s host.
• From: it is the e-mail address of the sender.
• To: it is the e-mail address of the receiver.
• Status: it is the server response (e.g. 450, 550 etc.).
With this format is possible to find the moment in which the e-mail has been sent, the
sender client name and IP, the from and to field of the e-mail and the server response.
Lets make an example: if Paul@myisp.com send an e-mail on 23 march 2006 to Pamela@myisp.com from X.X.2.235 and all the e-mail server transactions go successful
we will have a record like this:
23/03/2006 X.X.2.235 Paul@myisp.com Pamela@myisp.com 250
173
As already said, we want to stress that our data is not synthetic and so it doesn’t suffer
of bias of any form: it reflects the complete set of emails received by a single hightraffic e-mail server and it represents the overall view of a typical network operator.
Furthermore, contrary to synthetic ones, it suffers of accidental hardware faults: can
you say that a network topology is static and it is not prone to wanted and unwanted
changes? Intrusion Detection evaluation dataset have some hypothesis, one of these is
the never changing topology and immortal hardware health. As a matter of fact this is
not true, this is not reality: Murphy’s Law holds true and strikes with extraordinary efficiency. In addition our data are only about SMTP flow and, due to the long-term
monitoring, are a good snapshot of all-day life and, romantically, a silent witness of
Internet growth and e-mail use growth.
4 Analysis
Our analysis has been made on the e-mail traffic of ten C-class network in a period of
900 days, from January 2004 to November 2006. In our analysis, we work on the
global e-mail flow in a given time interval. We use a threshold detection [33], like
other software do (e.g. Snort ): if the traffic volume rises above a given threshold, the
system triggers an alarm. The given threshold is calculated in a statistical way, where
we determine the network normal e-mail traffic in selected slices of time: for example
we take the activity of a month and we divide the month in five-minutes slices, calculating how many e-mails are normally sent in five minutes. After that, we check that
the number of e-mails sent in a day during each interval don’t exceed the threshold.
We call this kind of analysis base-line analysis. Our strategy is to study the temporal correlation between the present behaviour (maybe modified by the presence of a
worm activity) of a given entity (pc, entire network) and its past behaviour (normal
activity, no virus or worm presence). Before proceeding, however, we pre-process the
data subtracting the mean to the values and cutting all the interval with a negative
number of e-mails, because we wanted to obfuscate the no-activity and few activity
periods, not interesting for our purposes.
In other words we trashed all the time slices characterized by a number of e-mail sent
below the month average, with the purpose of dynamically selecting activity periods
(working hours, no holidays etc). If we didn’t perform this pre-processing we could
have had an average which depended on night time, weekend or holidays duration.
E-mails sent mean in 2004, before pre-processing, was 524 in a day for 339 activity day: after data pre-processing was 773 in a day for 179 activity day. After this we
calculate the baseline activity of working hours according to the following: µ + 3σ.
The mean and the variance are calculated for every month, modelling the network
behaviour, taking into account every chosen time interval (e.g. we divide February in
five-minutes slices, we count how many e-mails are sent in these periods and then we
calculate the mean of these intervals). Values have been compared with the baseline
threshold and if found greater than it they have been marked. Analyzing the first five
months with a five minutes slice we found too many alerts and a lot of them exceeded
the threshold only for few e-mails. So we thought to correlate the alerts found with a
five minutes period with those found with an hour period, with the hypothesis that a
worm which has infected a host sends a lot of e-mail both in a short period and in a bit
174
longer period. To clarify the concept lets take the analysis for a month: April 2004
(see Fig. 1 and Fig. 2). The five minutes base-line resulted in 63 e-mails while the one
hour base-line is 463. In five-minute analysis we found sixteen alerts, meanwhile in
one-hour analysis only three. Why do we find a so big gap between the two approaches? In five-minutes analysis we have a lot of false alarms, due to the presence
of e-mails sent to very large mailing lists while in one-hour analysis we find very few
alarms, but these alarms result more significant because they represents a continuative
violation of the normal (expected) activity.
Correlating these results, searching the selected five-minutes periods in the five
one-hour alert we detected that a little set of the five-minute alarms were near in the
temporal line: after a deeper analysis, using our knowledge and experience on real
user’s activity we concluded that it was a worm activity.
Fig. 1. Example of e-mail traffic: hour base-line
Fig. 2. Example of e-mail traffic: five minutes base-line
4.1 SMTP Sender Analysis
Sometimes, peaks catch from flow analysis were e-mail sent to mailing list which are,
as already said, bothersome hoaxes. This fact produced from analysis, where we analyze how many different e-mail address every host use: we look which from field is
175
used by every host. In fact an host, owned by a single person or few persons, is not
likely to use a lot of different e-mail addresses in a short time and if it does so, it is
highly considerable a suspicious behaviour. So we think that this analysis could be
used to identify true positives, or to suggest suspect activity. Of course it isn’t so
straight that a worm will change from field continuously, but it is a likely event.
4.2 SMTP Reject Analysis
One typical feature of a malware is haste in spreading the infection. This haste leads
indirect worms to send a lot of e-mail to unknown receivers or nonexistent e-mail address: this is a mistake that, we think, it is very important. In fact all e-mails sent to a
nonexistent e-mail address are rejected by the mail-server, and they are tracked in the
log. In this step of our work we analyze rejected e-mail flow: we work only on emails referred by internet server. By this approach we identified worm activity.
Table 1. Experimental results
Date
Infected Host
Analysis
28/01/04 18:00
X.X.6.24
Baseline, from, reject
29/01/04 10:30
X.X.4.24
From, reject
28/04/2004 14:58-15:03
X.X.7.20
28/04/2004 15:53-15:58
X.X.5.216
29/04/2004 09:08-10:03
X.X.6.36, X.X.7.20
04/05/2004 12:05-12:10
X.X.5.158
04/05/2004 13:15-13:25
X.X.5.158
31/08/04 14:51
X.X.3.234
Baseline, reject
X.X.3.234
Baseline, reject
X.X.3.101, X.X.3.200, X.X.3.234
X.X.5.123
Baseline, reject
X.X.10.10
X.X.10.10
X.X.10.10
X.X.10.10
31/08/04 17:46
23/11/04 11:38
22/08/05 17:13
22/08/05 17:43
22/08/05 20:18
22/08/05 22:08-22:13
176
5 Results
The approach does detect 14 worms activity, mostly concentrated in 2004. We think
that this fact is caused by new firewall policies introduced in 2005 and by the introduction of a second antivirus engine in our Mail Transfer Agent. Moreover in last
years we haven’t got very large worm infection. The results we obtained are summarized in Table 1.
6 Conclusion
Baseline analysis can be useful in identifying some indirect worm activity, but this
approach need some integration by some other methods, because it lacks a complete
vision of SMTP activity: this lack can be filled by methods which analyze some other
SMTP aspects, like From and To e-mail field. In future this method can be integrated
in an anomaly detection system to get more accuracy in detecting anomalies.
Acknowledgments. This work was supported by National Research Council of Italy
and University of Genoa.
References
1. Axelsson, S.: Intrusion detection systems: A survey and taxonomy,Tech. Rep. 99-15,
Chalmers Univ (March 2000)
2. Verwoerd, T., Hunt, R.: Intrusion detection techniques and approaches. Comput. Commun. 25(15), 1356–1365 (2002)
3. Ilgun, K., Kemmerer, R.A., Porras, P.A.: State transition analysis: A rule-based intrusion
detection approach. IEEE Transactions on Software Engineering 21(3), 181–199 (1995)
4. Kumar, S., Spafford, E.H.: A software architecture to support misuse intrusion detection.
In: Proceedings of the 18th National Information Security Conference, pp. 194–204 (1995)
5. Denning, D.E.: An intrusion detection model. IEEE Transactions on Software Engineering
(1987)
6. Estvez-Tapiador, J.M., Garcia-Teodoro, P., Diaz-Verdejo, J.E.: Anomaly detection methods in wired networks: A survey and taxonomy. Comput. Commun. 27(16), 1569–1584
(2004)
7. Du, Y., Wang, W.-q., Pang, Y.-G.: An intrusion detection method using average hamming
distance. In: Proceedings of the Third International Conference on Machine Learning and
Cybernetics, Shanghai, 26-29 August (2004)
8. Anderson, D., Frivold, T., Valdes, A.: Next-generation intrusion detection expert system
(NIDES). Computer Science Laboratory (SRI Intemational, Menlo Park, CA): Technical
reportSRI-CSL-95-07 (1995)
9. Wang, Y., Abdel-Wahab, H.: A Multilayer Approach of Anomaly Detection for Email
Systems. In: Proceedings of the 11th IEEE Symposium on Computers and Communications (ISCC 2006) (2006)
10. http://www.internetworldstats.com/stats.htm
11. http://en.wikipedia.org/wiki/
Notable_computer_viruses_and_worms
177
12. Moore, D., Paxson, V., Savage, S., Shannon, C., Staniford, S., Weaver, N.: Inside the
slammer worm. IEEE Magazine of Security and Privacy, 33–39 (July/August 2003)
13. Leyden, J.: Zombie PCs spew out 80% of spam. The Register (June 2004)
14. Yasami, Y., Farahmand, M., Zargari, V.: An ARP-based Anomaly Detection Algorithm
Using Hidden Markov Model in Enterprise Networks. In: Second International Conference
on Systems and Networks Communications (ICSNC 2007) (2007)
15. Berk, V., Bakos, G., Morris, R.: Designing a Framework for Active Worm Detection on
Global Networks. In: Proceedings of the first IEEE International Workshop on Information Assurance (IWIA 2003), Darmstadt, Germany (March 2003)
16. Bakos, G., Berk, V.: Early detection of internet worm activity by metering icmp destination unreachable messages. In: Proceedings of the SPIE Aerosense 2002 (2002)
17. Whyte, D., Kranakis, E., van Oorschot, P.C.: DNS-based Detection of Scanning Worms in
an Enterprise Network. In: Proceedings of the 12th Annual Network and Distributed System Security Symposium, San Diego, USA, February 3-4 (2005)
18. Whyte, D., van Oorschot, P.C., Kranakis, E.: Addressing Malicious SMTP-based MassMailing Activity Within an Enterprise Network
19. Hofmeyr, S.A., Forrest, S., Somayaji, A.: Intrusion Detection using Sequences of System
Calls. Journal of Computer Security 6(3), 151–180 (1998)
20. Cha, B.: Host anomaly detection performance analysis based on system call of NeuroFuzzy using Soundex algorithm and N-gram technique. In: Proceedings of the 2005 Systems Communications (ICW 2005) (2005)
21. http://www.ietf.org/rfc/rfc0821.txt
22. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
23. http://www.ll.mit.edu/IST/ideval/data/data_index.html
24. http://en.wikipedia.org/wiki/Network_address_translation
25. http://spamassassin.apache.org/
26. Harris, E.: The Next Step in the Spam Control War: Greylisting
27. http://en.wikipedia.org/wiki/DNSBL
28. Crandall, J.R., Su, Z., Wu, S.F., Chong, F.T.: On Deriving Unknown Vulnerabilities from
Zero-Day Polymorphic and Metamorphic Worm Exploits. In: CCS 2005, Alexandria, Virginia, USA, November 7–11 (2005)
29. Portokalidis, G., Bos, H.: SweetBait: Zero-Hour Worm Detection and Containment Using
Honeypots
30. Akritidis, P., Anagnostakis, K., Markatos, E.P.: Efficient Content-Based Detection of
Zero-DayWorms
31. http://www.cnr.it/sitocnr/home.html
32. http://lma.sourceforge.net/
33. Behaviour-Based Network Security Goes Mainstream, David Geer, Computer (March
2006)
On-the-fly Statistical Classification of Internet Traffic
at Application Layer Based on Cluster Analysis
Andrea Baiocchi1, Gianluca Maiolini2, Giacomo Molina1, and Antonello Rizzi1
1
INFOCOM Dept., University of Roma “Sapienza”
Via Eudossiana 18 - 00184 Rome, Italy
andrea.baiocchi@uniroma1.it, giacomo.molina@libero.it,
rizzi@infocom.uniroma1.it
2 ELSAG Datamat – Divisione automazione sicurezza e trasporti,
Via Laurentina 760 – 00143 Rome, Italy
gianluca.maiolini@elsagdatamat.com
Abstract. We address the problem of classifying Internet packet flows according to the application level protocol that generated them. Unlike deep packet inspection, which reads up to application layer payloads and keeps track of packet sequences, we consider classification based on
statistical features extracted in real time from the packet flow, namely IP packet lengths and
inter-arrival times. A statistical classification algorithm is proposed, built upon the powerful
and rich tools of cluster analysis. By exploiting traffic traces taken at the Networking Lab of
our Department and traces from CAIDA, we defined data sets made up of thousands of flows
for up to five different application protocols. With the classic approach of training and test data
sets we show that cluster analysis yields very good results in spite of the little information it is
based on, to stick to the real time decision requirement. We aim to show that the investigated
applications are characterized from a ”signature” at the network layer that can be useful to
recognize such applications even when the port number is not significant. Numerical results are
presented to highlight the effect of major algorithm parameters. We discuss complexity and
possible exploitation of the statistical classifier.
1 Introduction
As broadband communications widen the range of popular applications, there is an
increasing demand of fast traffic classification means according to the services data is
generated by. The specific meaning of service depends on the context and purpose of
traffic classifications. In case of traffic filtering for security or policy enforcement
purposes, service can be usually identified with application layer protocol. However,
many kind of different services exploit http or ssh (e.g. file transfer, multimedia
communications, even P2P), so that a simple header based filter (e.g. exploiting the IP
address and TCP/UDP port numbers) may be inadequate.
Traffic classification at application level can be therefore based on the analysis of
the entire packets content (header plus payload), usually by means of finite state machine schemes. Although there are widely available software tools for such a classification approach (e.g. L7filter, BRO, Snort), they can hardly catch up with high speed
links and are usually inadequate for backbone use (e.g. Gbps links).
springerlink.com
179
The solution based on port analysis is becoming ineffective because of applications
running on non-standard ports (e.g. peer-to-peer). Furthermore, traffic classification
based on deep packet inspection is resource-consuming and hard to implement on
high capacity links (e.g. Gbps OC links). For these reasons, different approaches to
traffic classification have been developed, using all the information available at network layer. Some proposals ([4], [5]), however, need semantically complete TCP
flows as input: we target a real-time tool, able to classify the application layer protocol of a TCP connection by observing just the first few packets of the connection
(hereinafter referred to as a flow).
A number of works [5], [6], [7] rely on unsupervised learning techniques. The only
features they use is packets size and packets direction: they demonstrate the effectiveness of these algorithms even using a small number of packets (e.g. the first four of a
TCP connection).
We believe that even packets inter-arrival time contains pieces of information relevant to address the classification problem. We provide a way to exalt the information
contained in inter-arrival times, preserving the real-time characteristic of the approach
described in [7]: we try to clean the interarrival time (as better explained in section 3)
assessing the contribution of network congestion, to exalt the time depending on the
application layer protocol.
The paper is organized as follows. In Section 2 the classification problem is defined and notation is introduced. Section 3 is devoted to the description of the traffic
data sets used for the defined algorithm assessment and the numerical evaluation. The
cluster analysis based statistical classifier is defined in Section 4. Numerical examples
are given in Section 5 and final remarks are stated in Section 6.
2 Problem Statement
In this paper, we focus on the classification of IP flows generated from network applications communicating through TCP protocol as HTTP, SMTP, POP3, etc. With this
in mind, we define flow F as the unidirectional, ordered sequence of IP packets produced either by the client towards the server, or by the server towards the client during an application layer session. The server-client flow FServer will be composed of
(NServer + 1) IP packets, from PK0 to PKNserver , where PKj represents the j-th IP packet
sent by the server to the client; the corresponding client-server flow FClient will be
composed by (NClient + 1) IP packets. At the IP layer, each flow F can be characterized
as an ordered sequence of N pairs Pi = (Ti ; Li), with 1 < i < N, where Li represents the
size of PKi (including TCP/IP Header) and Ti represents the inter-arrival time between
PKi-1 and PKi.
In our study we consider only semantically complete TCP flows, namely flows
starting with SYN or SYN-ACK TCP segment (respectively for client-to-server and
server-to-client direction). Because of the limited number of packets considered in
this work, we don’t care about the FIN TCP segment to be observed.
With this in mind, we aim to recognize a description of protocols (through clustering
techniques): such a description should be based on the first few packets of the flows and
should be able to strongly characterize each analyzed protocol. The purpose of this work
is the definition of an algorithm that takes as input a traffic flow from an unknown application and that gives as output the (probable) application responsible of its generation.
180
A. Baiocchi et al.
3 Dataset Description
In this work, we focus our attention on five different application layer protocols,
namely HTTP, FTP-Control, POP3, SMTP and SSH, which are among the most used
protocols on the INTERNET.
As for HTTP and FTP-Control (FTP-C in the following), we collected traffic traces
in the Networking Lab at our Department. By means of automated tools mounted on
machines within the Lab, thousands of web pages carefully selected have been visited
in a random order, over thousands of web sites distributed in various geographical
areas (Italy, Europe, North America, Asia). FTP sites have been addressed as well and
control FTP session established with thousands remote servers, again distributed in a
wide area. The generated traffic has been captured on our LAN switch; we verified
that the TCP connections bottleneck was never the link connecting our LAN to the
big Internet to avoid the measured inter-arrival times to be too noisy. This experimental set up, while allowing the capture of artificial traffic that (realistically) emulates
user activity, gives us traces with reliable application layer protocol classification.
Traffic flows for the other protocols (POP3, SMTP, SSH) are extracted form backbone traffic traces made available by CAIDA. Precisely, we randomically extracted
flows from the OC-48 traces of the days 2002-08-14, 2003-01-15 and 2003-04-24.
Due to privacy reasons, only anonymized packet traces with no payloads are made
available. Regarding to SSH, it can be configured as an encrypted tunnel to transport
every kind of applications. Even in its ”normal” behavior (remote management), it
would have to be difficult to recognize a specific behavioral pattern due to its humaninteractive nature. For these reasons we expect that the classification results involving
SSH flows will be worse than those without them.
Starting from these traffic traces, and focusing our attention only to semantically
complete server-client flows, we created two different data sets with 1000 flows for
each application. Each flow in a data set is described by the following fields:
•
•
a protocol label coded as an integer from 1 to 5;
P ≥ 1 couples (Ti ; Li), where Ti is the inter-arrival time (difference between
timestamps) of the (i – 1)-th and i-th packet of the considered flow and Li is the
IP packet length of the i-th packet of the flow, i = 1,…, P.
Inter-arrival times are in seconds, packet lengths are in bytes.
The 0-th packet of a flow, used as a reference for the first inter-arrival time, is conventionally defined as the one carrying the SYN-ACK TCP segment for the server-toclient direction.
The label in the first field is used as the target class for flows in both the training
and test sets. The other 2P quantities are normalized and define a 2P-dimensional
array associated to the considered flow. Normalization has to be done carefully: we
choose to normalize packet lengths between 40 to 1500 bytes, which is the minimun/maximum observed length. As for inter-arrival times, normalization is done over
an entire data-set of M flows making up a training or a test set. Let (T i ( j ) , L i ( j ) )
be the i-th couple of the j-th flow (i = 1,…, P ; j = 1,…,M). Then we let:
Tî ( j ) =
Ti ( j ) − min Ti ( k )
1≤ k ≤ M
max Ti ( k ) − min Ti ( k )
1≤ k ≤ M
L ( j ) − 40
Lî ( j ) = i
1500 − 40
181
i = 1,..., P
1≤ k ≤ M
(1)
i = 1,..., P
In the following we assume P = 5.
A different version of this data set has also been used, so called pre-processed data
set. In this last case, inter-arrival times are modified to be the differential inter-arrival
times, obtained as DTi = Ti ¡ T0; i = 1,…,P, where T0 is the time elapsing between the
packet carrying the TCP SYN-ACK of the flow and the next packet, most of times a
presentation message (as for FTP-C) or an ACK (as for HTTP). So, T0 approximates
the first RTT of the connection, including only time depending on TCP computation
(as we have seen during our experimental setup). The differential delay can therefore
be expressed as
DTi = Ti − T0 ≈ RTTi − RTT0 + TAi
(2)
Where we account for the fact that T0 ≈ RTT0 and that Ti comprises the i-th RTT and
in general an application dependent time, TAi.
Hence, we expect that application layer protocol features are more evident in the
pre-processed data set as compared to theplain one, since the contribution of the applications to interarrival times is usually much smaller than the average RTT in a
wide area network. On the other hand, in case of differential inter-arrival times, the
noise affecting the application dependent inter-arrival times is reduced to the RTT
variation (zero on the average).
4 A Basic Classification System Based on Cluster Analysis
In this section some details about the adopted classification system are exploited.
Basically a classification problem can be defined as follows. Let P : X → L be an
unknown oriented process to be modeled, where X is the domain set and the codomain
L is a label set, i.e. a set in which it is not possible (or misleading) to define an ordering function and hence any dissimilarity measure between its elements.
If P is a single value function, we will call it classification function. Let Str and Sts
be two sets of input-output pairs, namely the training set and the test set. We will call
instance of a classification problem a given pair (Str , Sts) with the constrain Str ∩ Sts
=Ø . A classification system is a pair (M , TAi), where TA is the training algorithm,
i.e. the set of instructions responsible for generating, exclusively on the basis of Str, a
particular instance M¯ of the classification model family M, such that the classification error of M¯ computed on Sts will be minimized. The generalization capability,
i.e. the capability to correctly classify any pattern belonging to the input space of the
oriented process domain to be modeled, is for sure the most important desired feature
of a classification system. From this point of view, the mean classification error on Sts
182
A. Baiocchi et al.
can be considered as an estimate of the expected behavior of the classifier over all the
possible inputs. In the following, we describe a classification system trained by an
unsupervised (clustering) procedure.
When dealing with patterns belonging to the Rn vectorial space we can adopt a distance measure, such as the Euclidean distance; moreover, in this case we can define
the prototype of the cluster as the centroid (the mean vector) of all the patterns in the
cluster, thanks to the algebraic structure defined in Rn. Consequently, the distance
between a given pattern xi and a cluster Ck can be easily defined as the Euclidean
distance d(xi ; µ k) where µ k is the centroid of the pattern belonging to Ck:
µk =
1
mk
∑x
i
(3)
xi ∈Ck
A direct way to synthesize a classification model on the basis of a training set Str consists in partitioning the patterns in the input space (discarding the class label information) by a clustering algorithm (in our case, by the K-means).
Successively, each cluster is labeled by the most frequent class among its patterns.
Thus, a classification model is a set of labeled clusters (centroids); note that more than
one cluster can be associated with the same label, i.e. a class can be represented by
more than one cluster. Assuming to represent a floating point number with four bytes,
the amount of memory needed to store a classification model is K · (4 · n + 1) bytes,
where n is the input space dimension and assuming to code class labels with one byte.
An unlabeled pattern x is classified by determining the closest centroid µ i (and thus
the closest cluster Ci) and by labeling x with the same class label associated with Ci. It
is important to underline that, since the initialization step of the K-Means is not
deterministic, in order to compute a precise estimation of the performance of the classification model on the test set Sts, the whole algorithm must be run several times,
averaging the classification errors on Sts yielded by the different classification models
obtained in each run.
5
Numerical Results
In this section we provide numerical results of the classification algorithm. We investigated two groups of applications, the first containing HTTP, FTP-C, POP3 and
SMTP, the second including also SSH. Using the non preprocessed data set (hereinafter referred as original) we obtain a classification accuracy on Sts comparable with
that achievable with port-based classification. This happens because the effect of RTT
almost completely covers the information carried from inter-arrival times. In Table 1
and 2 are listed the global results and the individual contributions of the protocols to
the average value.
Using the pre-processed data set we obtain much better results, in particular for the
case with only 4 protocols as we can see in Table 3 and Table 4. An important thing
to consider is the complexity of the classification model, namely the number of clusters used. The performance does not significantly increase after 20 clusters (Fig. 2 and
Fig. 3): this means we can achieve good results with a simple model that requires
183
Table 1. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000,
original data set
Table 2. Average classification accuracy vs # Clusters, P=5, # flows (training+test) =1000,
original data set
Table 3. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set
not much computation. We can see from Table 4 the negative impact SSH has on the
overall classification accuracy, mainly because it is a human-driven protocol, hard to characterize with few hundreds of flows. Moreover, we can see that SSH suffers of overfitting
problems, as the probability of success decreases as the number of clusters increases.
184
A. Baiocchi et al.
Table 4. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set
Extending the data sets with unknown traffic, the classification probability is significantly reduced. Although the performance of the overall classification accuracy
decreases, we can see in Table 5 the effect of unknown traffic. This means that our
classifier is not mistaking flows of the considered protocols, but is just raising the
false positive classifications due to unknown traffic, erroneously labeled as known
traffic.
Table 5. Average classification accuracy vs # Clusters, P=5, # flows (training+test)=1000, preprocessed data set with HTTP, FTP-C, POP3, SMTP, SSH, Unknown
6 Concluding Remarks
In this work we present a model that could be useful to address the problem of traffic
classification. To this end, we use only (poor) information available at network layer,
namely packets size and inter-arrival times. In the next future we plan to better test
the performances of this model, mainly extending the data sets we use to a greater
number of protocols. We are also planning to collect traffic traces from our Department link to be able to accurately classify all protocols we want to analyze through
payload analysis.
Moreover, we will have to enforce the C-means algorithm to automatically select
the optimal number of clusters relatively to the used data set. The following step will
185
be the use of more recent and powerful fuzzy-like algorithms to achieve better performances in a real environment.
Acknowledgments. Authors thank Claudio Mammone for his development work of
the software package used in the collection of part of traffic traces analysed in this
work.
References
1. Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: Multilevel traffic classification in
the dark. In: Proc. of ACM SIGCOMM 2005, Philadelphia, PA, USA (August 2005)
2. Crotti, M., Dusi, M., Gringoli, F., Salgarelli, L.: Traffic Classification through Simple Statistical Fingerprinting. ACM SIGCOMM Computer Communication Review 37(1), 5–16
(2007)
3. Wright, C., Monrose, F., Masson, G.: On Inferring Application Protocol Behaviors in Encrypted Network Traffic. Journal of Machine Learning Research (JMLR): Special issue on
Machine Learning for Computer Security 7, 2745–2769 (2006)
4. Moore, A.W., Zuev, D.: Internet traffic classification using Bayesian analysis techniques.
In: ACM SIGMETRICS 2005, Banff, Alberta, Canada (June 2005)
5. McGregor, A., Hall, M., Lorier, P., Brunskill, J.: Flow clustering using machine learning
techniques. In: PAM 2004, Antibes Juan-les-Pins, France (April 2004)
6. Zander, S., Nguyen, T., Armitage, G.: Automated traffic classification and application identification using machine learning. In: LCN 2005, Sydney, Australia (November 2005)
7. Bernaille, L., Teixeira, R., Salamatian, K.: ’Early Application Identification. In: Proceedings of CoNEXT (December 2006)
Flow Level Data Mining of DNS Query Streams
for Email Worm Detection
Nikolaos Chatzis and Radu Popescu-Zeletin
Fraunhofer Institute FOKUS,
Kaiserin-Augusta-Allee 31, 10589 Berlin, Germany
{nikolaos.chatzis,radu.popescu-zeletin}@fokus.fraunhofer.de
Abstract. Email worms remain a major network security concern, as they increasingly attack
systems with intensity using more advanced social engineering tricks. Their extremely high
prevalence clearly indicates that current network defence mechanisms are intrinsically incapable of mitigating email worms, and thereby reducing unwanted email traffic traversing the
Internet. In this paper we study the effect email worms have on the flow-level characteristics of
DNS query streams a user machine generates. We propose a method based on unsupervised
learning and time series analysis to early detect email worms on the local name server, which is
located topologically near the infected machine. We evaluate our method against an email
worm DNS query stream dataset that consists of 68 email worm instances and show that it exhibits remarkable accuracy in detecting various email worm instances1.
1 Introduction
Email worms remain an ever-evolving threat, and unwanted email traffic traversing
the Internet steadily escalates [1]. This causes network congestion, which results in
loss of service or degradation in the performance of network resources [2]. In addition, email worms populate almost exclusively the monthly top threat lists of antivirus
companies [3,4], and are used to deliver Trojans, viruses, and phishing attempts.
Email worms rely mainly on social engineering to infect a user machine, and then
they exploit information found on the infected machine about the email network of
the user to spread via email among social contacts. Social engineering is a nontechnical kind of intrusion, which depends on human interaction to break normal security procedures. This propagation strategy differs significantly from IP address
scanning propagation; therefore, it renders network detection methods that look for
high rates at which unique destination addresses are contacted [5], or high number of
failed connections [6], or high self-similarity of packet contents [7] incapable of detecting this class of Internet worms. Likewise, Honeypot-based systems [8], which
provide a reliable anti-scanning mechanism, are not effective against email worms.
Commonly applied approaches like antivirus and antispam software and recipient or
message based rules have two deficiencies. They suffer from poor detection time
1
The research leading to these results has received funding from the European Community's
Seventh Framework Programme (FP7/2007-2013) under grant agreement no. 216585 (INTERSECTION Project).
springerlink.com
Flow Level Data Mining of DNS Query Streams for Email Worm Detection
187
against novel email worm instances because they entail non-trivial human labour in
order to develop a signature or a rule, and go to no lengths to reducing the unwanted
email traffic traversing the Internet, as their target is to detect abusive email traffic in
the network of the potential victim.
In the past much research effort has been devoted to analyzing the traffic email
worm-infected user machines generate [9-13]. These studies share the positive contribution that email worm infection affects at application layer the Domain Name System (DNS) traffic of a user machine. Detection methods based on this observation
focus on straightforward application layer detection, which makes them suitable for
detecting few specific outdated email worms.
In this work we go a step beyond earlier work, and identify anomalies in DNS traffic that are common for email spreading malicious software, and as such they can
serve as a strong basis for detecting various instances of email worms in the long run.
We show that DNS query streams that non-infected user machines generate share at
flow level many of the same canonical behaviours, while email worms rely on similar
spreading methods that generate DNS traffic that share common patterns. We present
a detection method that builds on unsupervised learning and time series analysis, and
uses the wavelet transform in a different way than this often proposed in the Internet
traffic analysis literature. We experiment with 68 worm instances that appeared in the
wild between April 2004 and July 2007 to show that flow-level characteristics remain
unaltered in the long run, and that our method is remarkably accurate.
Inspecting packets at flow level does not involve deep packet analysis. This ensures user privacy, renders our approach unaffected by encryption and keeps the processing overhead low, which is a strong requirement for busy, high-speed networks.
Moreover, DNS query streams consist of significantly less data than the input of
conventional network intrusion detection systems. This is advantageous, since high
volumes of input data inevitably degrade the effectiveness of these systems [14]. Additionally, the efficiency and deployment ease of an in-network detection system increase as the topological proximity between the system and the user machines
decreases. Local name servers are the first link of the chain of Internet connectivity.
Moreover, detection at the local name server, which is located topologically near the
infected machine, contributes to reducing unwanted traffic traversing the Internet.
The paper is organized as follows. In Section 2, we discuss related work. In Section 3, we explain our method for detecting email worms by flow-level analysis of
DNS query streams. In Section 4, we validate our approach by examining its detection
capabilities over various worm instances. We conclude in Section 5.
2 Related Work
Since our work builds on DNS traffic analysis for detecting email worms and security
oriented time series analysis of Internet traffic signals using the wavelet transform we
provide below the necessary background on these areas.
Previously published work provides evidence that the majority of today’s open
Internet operational security issues affect DNS traffic. In this section we concentrate
solely on email worms and refer the interest reader to [15] for a thorough analysis.
Wong et al. [11] analyze DNS traffic captured at the local name server of a campus
188
N. Chatzis and R. Popescu-Zeletin
network during the outbreak of SoBig.F and MyDoom.A. Musashi et al. [12] present
similar measurements for the same worms, and extend their work by studying Netsky.Q and Mydoom.S in [13]. Whyte et al. [9] focus on enterprise networks and
measure the DNS activity of NetSky.Q. Despite, their positive contribution in proving
that there exists a correlation between email worm infection and DNS traffic, the efficacy of the detection methods these studies propose would have been dwarfed, if the
methods had been evaluated against various worm instances. Indeed, the methods presented in [9,11,12,13] are straightforward and focus on application layer detection.
They propose that many queries for Mail eXchange (MX) resource records (RR) or
the relative numbers of queries for pointer (PTR) RR, MX and address (A) RR a user
machine generates give a telltale sign of email worm infection. Although this observation holds for the few email worm instances studied in each paper, it can not be
generalized for detecting various email worm instances. Furthermore, these methods
neglect that DNS queries carry user sensitive information, and due to the high processing overhead they introduce by analyzing packet payloads, they are not suitable for
busy, high speed networks. Moreover, as any other volume based method they require
an artificial boundary as threshold on which the decision whether a user machine is
infected or not is taken. In [10] the authors argue that anomaly detection is a promising solution to detect email worm-infected user machines. They use Bayesian inference assuming a priori knowledge of worm signature DNS queries to detect email
worm. However, such knowledge is not apparent, and if it was it would allow
straightforward detection.
The wavelet transform is a powerful tool, since its time and scale localization abilities make it ideally suited to detect irregular patterns in traffic traces. Although, many
research papers appear that analyze Denial of Service (DoS) attack traffic [16,17] by
means of wavelets, only a handful of papers deals with applying the wavelet transform on Internet worm signals [18,19]. Inspired by similar methodology, used to
analyze DoS traffic, these papers concentrate solely on looking at worm traffic for repeating behaviors by means of the self-similarity parameter Hurst.
In this work we present a method that accurately detects various email worms by
analyzing DNS query streams at flow level. We show that flow-level characteristics
remain unaltered in the long run. Flow-level analysis does not violate user privacy,
renders our method unaffected by encryption, eliminates the need for not anonymized
data; and makes our method suitable for high-speed network environments. In our
framework, we see DNS query streams from different hosts as independent time series and use the wavelet transform as a dimensionality reduction tool rather than a tool
for searching self-similar patterns on a single signal.
3 Proposed Approach
Our approach is based on time series analysis of DNS query streams. Given the time
series representation, we show by similarity search over time series using clustering
that user machines’ DNS activity fall into two canonical profiles: legitimate user behaviour and email worm infected behaviour. DNS query streams generated by user
machines share many of the same canonical behaviours. Likewise, email worms rely
on similar spreading methods generating query streams that share common patterns.
189
3.1 Data Management
As input, our method uses the complete set of DNS queries that a local name server
received within an observation interval. Since we are not interested in application
level information, we retain for each query the time of the query and the IP address of
the requesting user machine. We group DNS queries per requesting user machine. For
each user machine we consider successive time bins of equal width, and we count the
DNS queries in each bin. Thereby, we get a set of univariate time series, where each
one of them expresses the number of DNS queries of a user machine through time.
The set of time series can be expressed as an n× p time series matrix; n is the number
of user machines and p the number of time bins.
3.2 Data Pre-processing
A time series of length p can be seen as a point in the p-dimensional space. This allows using multivariate data mining algorithms directly to time series data. However,
most data mining algorithms, and in particular, clustering algorithms do not work well
for time series. Working with each and every time point, makes the significance of
distance metrics, which are used to measure the similarity between objects, questionable [20]. To attack this problem numerous time series representations have been proposed that facilitate extracting a feature vector from the time series. The feature
vector is a compressed representation of the time series, which serves as input to the
data mining algorithms.
Although many representations have been proposed in the time series literature,
only few of them are suitable for our framework. The fundamental requirement for
our work is that clustering the feature vectors should result in clusters containing feature vectors of time series, which are similar in the time series space. The authors in
[26] proved that in order for this requirement to hold the representation should satisfy
the lower bounding lemma. In [21] the authors give an overview of the state of art in
representations and highlight those that satisfy the lower bounding lemma.
We opt to use the Discrete Wavelet Transform (DWT). Motivation for our choice
is that the DWT representation is intrinsically multi-resolution and allows simultaneous time and frequency analysis. Furthermore, for time series typically found in practice, many of the coefficients in a DWT representation are either zero or very small,
which allows for efficient compression. Moreover, DWT is applicable for analyzing
non-stationary signals, and performs well in compressing sparse spike time series.
We apply the DWT independently on each time series of the time series matrix using Mallat's algorithm [22]. The Mallat algorithm is of O(p) time complexity and decomposes a time series of length p, where p must be power of two, in log2p levels.
The number of wavelet coefficients computed after decomposing a time series is
equal to the number of time points of the original time series. Therefore, applying the
DWT on each line of the time series matrix gives an n× p wavelet coefficient matrix.
To reduce the dimensionality of the wavelet coefficient matrix, we apply a compression technique. In [23] the author gives a detailed description of four compression
techniques. Two of them operate on each time series independently and suggest retaining the k first wavelet coefficients or the k largest coefficients in terms of absolute
normalized value. Whereas the rest two are applicable to a set of n time series, so
190
directly to the wavelet coefficient matrix. The first suggests retaining the k columns of
the matrix that have the largest mean squared value. The second suggests retaining for
a given k the n × k largest coefficients of the wavelet coefficient matrix. In the interest
of space, we present here experimental results only retaining the first k wavelet coefficients. This produces an n × k feature vector matrix.
3.3 Data Clustering
We validate our hypothesis that DNS query streams generated by non-infected user
machines share similar characteristics, and that these characteristics are dissimilar to
those of email worm-infected user machines in two steps. First we cluster the rows of
the feature vector matrix in two clusters. Then we examine if one cluster contains
only feature vectors of non-infected user machines and the other only feature vectors
of email worm-infected machines.
We use hierarchical clustering, since it produces relatively good results, as it facilitates the exploration of data at different levels of granularity, and is robust to variations of cluster size and shape. Hierarchical clustering methods require the user to
specify a dissimilarity measure. We use as dissimilarity measure the Euclidean distance, since it produces comparable results to more sophisticated distance functions
[24]. Hierarchical clustering methods are categorized into agglomerative and divisive.
Agglomerative methods start with each observation in its own cluster and recursively
merge the less dissimilar clusters into a single cluster. By contrast, the divisive
scheme starts with all observations in one cluster and subdivides them into smaller
clusters until each cluster consists of only one observation. Since, we search for a
small number of clusters we use the divisive scheme.
The most commonly cited disadvantage of the divisive scheme is its computational
cost. A divisive algorithm considers first all divisions of the entire dataset into two
non-empty sets. There are 2(n-1) - 1 possibilities of dividing n observations in two clusters, which is intractable for many practical datasets. However, DIvisive ANAlysis
(DIANA) [25] uses a splitting heuristic to limit the number of possible partitions,
which results in O(n2). Given its quadratic time on the number of observations,
DIANA scales poor with the number of observations, however we use it because it is
the only divisive method generally available and in our framework it achieves clustering on a personal computer in computing time in the order of milliseconds.
4 Experimental Evaluation
We set up an isolated computer cluster, and launch 68 out of a total of 164 email
worms, which have been reported between April 2004 and July 2007 in the monthly
updated top threat lists of Virus Radar [3] and Viruslist [4]. We capture over a period
of eight hours the DNS query streams the infected machines generate to create – to
the best of our knowledge – the largest email worm DNS dataset that has been up to
date used to evaluate an in-network email worm detection method. As our isolated
computer cluster has no real users, we merge the worm traffic with legitimate DNS
traffic captured at the primary name server of our research institute, which serves
daily between 350 and 500 users. We use three recent DNS log file fragments that we
split in eight hour datasets and present here experiments with four datasets.
191
For each DNS query stream we consider a time series of length 512 with one minute time bins. We decompose the time series, compress the wavelet coefficient matrix,
retain k wavelet coefficients to make the feature vector matrix, and cluster rows of the
feature vector matrix in two clusters. In this paper we focus on showing that our
method detects various instances of email worms; therefore, we assume that only one
user machine is infected at each time. The procedure is shown in Fig.1.
Time Series
Wavelet Coef .
Feature Vector
Matrix
Matrix
6444Matrix
74448
6444
4
74444
8
6444
74448
um −1
um −1
um −1
um −1
um −1
K ts512 ⎤ ⎡ wc1
K wc512 ⎤ ⎡ fv1
K fvkum −1 ⎤
⎡ ts1
⎥
⎥ ⎢
⎥ ⎢
⎢
M ⎥ ⎢ M
M ⎥ ⎢ M
M ⎥
⎢ M
→
→
um − n ⎥
um − n ⎥
⎢ fv1um − n K fvkum − n ⎥
⎢ wc1um − n K wc512
⎢ts1um − n K ts512
⎥
⎢ worm
⎥
⎢
⎥
⎢ worm
worm
worm
worm
K ts512
K wc512
K fvkworm ⎦⎥
ts1
⎥ ⎣⎢ fv1
⎥ ⎣⎢ wc1
⎦
⎦
⎣⎢1
44424443
1444
424444
3
144424443
( n +1) × 512
( n +1) × 512
( n +1) × k
k : 4 ,8 ,16 , 32 , 64 ,128 , 256
Fig. 1. We append one infectious time series to the non-infected user machines time series; decompose the time series to form the wavelet coefficient matrix, which we compress to get the
feature vector matrix. This is input to the clustering analysis, which we repeat 4 Datasets × 7
Feature vector lengths × 68 Email Worms = 1904 times.
We examine the resulting two-cluster scheme, and find that it comprises of one
dense populated cluster and a low populated cluster. Our method detects an email
worm, when its feature vector belongs to the low populated cluster. In Fig. 2 we present the false positive and false negative rates. Our method erroneously reports legitimate user activity as suspicious with less than 1% for every value of k, whereas
worms are misclassified with less than 2%, if at least 16 wavelet coefficients are used.
Fig. 2. False negative and false positive rates for detecting various instances of email worms
over four different DNS datasets (DS1, DS2, DS3 and DS4), while retaining 4, 8, 16, 32, 64,
128 or 256 wavelet coefficients. With 16 or more wavelet coefficients both rates fall below 2%.
192
Fig. 3. Accuracy measures independently for each worm show that our method is remarkably
accurate. Only k values lower or equal to the first that maximizes accuracy are plotted. With 16
or more wavelet coefficients only the Eyeveg.F email worm is not detected.
In Fig. 3, using the accuracy statistical measure we look at the detection of each
worm independently. Accuracy is defined as the proportion of true results – both true
positives and true negatives – in the population, and measures the ability of our
method to identify correctly email worm-infected and non-infected user machines.
The accuracy is in the range between 0 and 1, where values close to 1 indicate good
detection. In the interest of space we present here only results over one dataset. Our
method fails to detect only one out of 68 total email worms i.e. Eyeveg.F for every k,
while with more of 16 wavelet coefficients all other email worms are detected.
5 Conclusion
Email worms and the high amount of email abusive traffic continue to be a serious security concern. This paper presented a method to early detect email worms in the local
name server, which is topologically near the infected host by analyzing DNS query
streams characteristics at flow level. By experimenting with various email worm instances we show that these characteristics remain unaltered in the long run. Our
method builds on unsupervised learning and similarity search over time series using
wavelets. Our experimental results show that our method identifies various email
worm instances with remarkable accuracy.
Future work calls for analysing DNS data from other networking environments to
assess the detection efficacy of our method over DNS data that might have different
flow-level characteristics, and investigating the actions that can be triggered at the local name server once an email worm-infected user machine has been detected to contain email worm propagation.
References
1. Messaging Anti-Abuse Working Group: Email Metrics Report,
http://www.maawg.org
2. Symantec: Internet Security Threat Report Trends (January-June 2007),
http://www.symantec.com
193
3. ESET Virus Radar, http://www.virus-radar.com
4. Kaspersky Lab Viruslist, http://www.viruslist.com
5. Roesch, M.: Snort - Lightweight Intrusion Detection for Networks. In: LISA 1999, 13th
USENIX Systems Administration Conference, pp. 229–238. USENIX (1999)
6. Paxson, V.: Bro: A System for Detecting Network Intruders in Real-Time. In: 7th Conference on USENIX Security Symposium. USENIX (1998)
7. Singh, S., Estan, C., Varghese, G., Savage, S.: The Earlybird System for Real-time Detection of Unknown Worms. Tech. Report CS2003-0761, University of California (2003)
8. Provos, N., Holz, T.: Virtual Honeypots: From Botnet Tracking to Intrusion Detection.
Addison Wesley Professional, Reading (2007)
9. Whyte, D., van Oorschot, P., Kranakis, E.: Addressing Malicious SMTP-based Mass Mailing Activity within an Enterprise Network. Technical Report TR-05-06, Carleton University, School of Computer Science (2005)
10. Ishibashi, K., Toyono, T., Toyama, K., Ishino, M., Ohshima, H., Mizukoshi, I.: Detecting
Mass-Mailing Worm Infected Hosts by Mining DNS Traffic Data. In: MineNet 2005
ACM SIGCOMM Workshop, pp. 159–164. ACM Press, New York (2005)
11. Wong, C., Bielski, S., McCune, J., Wang, C.: A Study of Mass-Mailing Worms. In:
WORM 2004 ACM Workshop, pp. 1–10. ACM Press, New York (2004)
12. Musashi, Y., Matsuba, R., Sugitani, K.: Indirect Detection of Mass Mailing WormInfected PC Terminals for Learners. In: 3rd International Conference on Emerging Telecommunications Technologies and Applications, pp. 233–237 (2004)
13. Musashi, Y., Rannenberg, K.: Detection of Mass Mailing Worm-Infected PC Terminals by
Observing DNS Query Access. IPSJ SIG Notes, pp. 39–44 (2004)
14. Schaelicke, L., Slabach, T., Moore, B., Freeland, C.: Characterizing the Performance of
Network Intrusion Detection Sensors. In: Recent Advances in Intrusion Detection, 6th International Symposium, RAID. LNCS, pp. 155–172. Springer, Heidelberg (2003)
15. Chatzis, N.: Motivation for Behaviour-Based DNS Security: A Taxonomy of DNS-related
Internet Threats. In: International Conference on Emerging Security Information Systems,
and Technologies, pp. 36–41. IEEE, Los Alamitos (2007)
16. Dainotti, A., Pescape, A., Ventre, G.: Wavelet-based Detection of DoS Attacks. In: Global
Telecommunications Conference, GLOBECOM 2006, pp. 1–6. IEEE, Los Alamitos
(2006)
17. Li, L., Lee, G.: DDoS Attack Detection and Wavelets. In: 12th International Conference
on Computer Communications and Networks, ICCCN 2003, pp. 421–427. IEEE, Los
Alamitos (2003)
18. Chong, K., Song, H., Noh, S.: Traffic Characterization of the Web Server Attacks of
Worm Viruses. In: Int. Conference on Computational Science, pp. 703–712. Springer,
Heidelberg (2003)
19. Dainotti, A., Pescape, A., Ventre, G.: Worm Traffic Analysis and Characterization. In: International Conference on Communications, ICC 2007, pp. 1435–1442. IEEE, Los Alamitos (2007)
20. Aggarwal, C., Hinneburg, A., Keim, D.: On the Surprising Behavior of Distance Metrics
in High Dimensional Space. In: 8th Int. Conf. on Database Theory. LNCS, pp. 420–434.
21. Bagnall, A., Ratanamahatana, C., Keogh, E., Lonardi, S., Janacek, G.: A Bit Level Representation for Time Series Data Mining with Shape Based Similarity. Data Min. and
Knowl. Discovery 13(1), 11–40 (2006)
194
22. Mallat, S.: A Theory for Multiresolution Signal Decomposition: The Wavelet Representation. IEEE Transactions on Pattern Analysis and Machine Intelligence 11(7), 674–693
(1989)
23. Mörchen, F.: Time Series Feature Extraction for Data Mining Using DWT and DFT.
Technical Report No. 33, Dept. of Maths and CS, Philipps-U. Marburg (2003)
24. Keogh, E., Kasetty, S.: On the Need for Time Series Data Mining Benchmarks: A survey
and empirical demonstration. Data Min. and Knowl. Discovery 7(4), 349–371 (2003)
25. Kaufman, L., Rousseeuw, P.: Finding Groups in Data: An Introduction to Cluster Analysis. Wiley, Chichester (1990)
26. Faloutsos, C., Ranganathan, M., Manolopoulos, Y.: Fast Subsequence Matching in Time
Series Databases. In: ACM SIGMOD International Conference on Management of Data,
pp. 419–429. ACM Press, New York (1994)
Adaptable Text Filters and Unsupervised Neural
Classifiers for Spam Detection
Bogdan Vrusias and Ian Golledge
Department of Computing, Faculty of Electronic and Physical Sciences,
University of Surrey, Guildford, UK
{b.vrusias,cs31ig}@surrey.ac.uk
Abstract. Spam detection has become a necessity for successful email communications, security and convenience. This paper describes a learning process where the text of incoming emails
is analysed and filtered based on the salient features identified. The method described has
promising results and at the same time significantly better performance than other statistical
and probabilistic methods. The salient features of emails are selected automatically based on
functions combining word frequency and other discriminating matrices, and emails are then
encoded into a representative vector model. Several classifiers are then used for identifying
spam, and self-organising maps seem to give significantly better results.
Keywords: Spam Detection, Self-Organising Maps, Naive Bayesian, Adaptive Text Filters.
1 Introduction
The ever increasing volume of spam brings with it a whole series of problems to a
network provider and to an end user. Networks are flooded every day with millions of
spam emails wasting network bandwidth while end users suffer with spam engulfing
their mailboxes. Users have to spend time and effort sorting through to find legitimate
emails, and within a work environment this can considerably reduce productivity.
Many anti-spam filter techniques have been developed to achieve this [4], [8]. The
overall premise of spam filtering is text categorisation where an email can belong in
either of two classes: Spam or Ham (legitimate email). Text categorisation can be
applied here as the content of a spam message tends to have few mentions in that of a
legitimate email. Therefore the content of spam belongs to a specific genre which can
be separated from normal legitimate email.
Original ideas for filtering focused on matching keyword patterns in the body of an
email that could identify it as spam [9]. A manually constructed list of keyword patterns such as “cheap Viagra” or “get rich now” would be used. For the most effective
use of this approach, the list would have to be constantly updated and manually tuned.
Overtime the content and topic of spam would vary providing a constant challenge to
keep the list updated. This method is infeasible as it would be impossible to manually
keep up with the spammers.
Sahami et al. is the first to apply a machine learning technique to the field of antispam filtering [5]. They trained a Naïve Bayesian (NB) classifier on a dataset of
pre-categorised ham and spam. A vector model is then built up of Boolean values
representing the existence of pre-selected attributes of a given message. As well as
springerlink.com
196
B. Vrusias and I. Golledge
word attributes, the vector model could also contain attributes that represent nontextual elements of a message. For example, this could include the existence of a nonmatching URL embedded in the email. Other non-textual elements could include
whether an email has an attachment, the use of bright fonts to draw attention to certain areas of an email body and the use of embedded images.
Metsis et al. evaluated five different versions of Naïve Bayes on particular dataset
[3]. Some of these Naïve Bayesian versions are more common in spam filtering than
others. The conclusion of the paper is that the two Naïve Bayes versions used least in
spam filtering provided the best success. These are a Flexible Bayes method and a
Multinomial Naïve Bayes (MNB) with Boolean attributes. The lower computational
complexity of the MNB provided it the edge. The purpose of their paper is not only to
contrast the success of five different Naïve Bayes techniques but to implement the
techniques in a situation of a new user training a personalized learning anti-spam
filter. This involved incremental retraining and evaluating of each technique.
Furthermore, methods like Support Vector Machines (SVM) have also been used
to identify spam [13]. Specifically, term frequency with boosting trees and binary
features with SVM’s had acceptable test performance, but both methods used high
dimensional (1000 - 7000) feature vectors. Another approach is to look into semantics. Youn & McLeod introduced a method to allow for machine-understandable semantics of data [7]. The basic idea here is to model the concept of spam in order to
semantically identify it. The results reported are very encouraging, but the model
constructed is static and therefore not adaptable.
Most of the above proposed techniques struggled with changes in the email styles
and words used on spam emails. Therefore, it made sense to consider an automatic
learning approach to spam filtering, in order to adapt to changes. In this approach
spam features are updated based on new coming spam messages. This, together with a
novel method for training online Self-Organising Maps (SOM) [2] and retrieving the
classification of a new email, indicated good performance. Most importantly the
method proposed only misclassified very few ham messages as spam, and had correctly identified most spam messages. This exceeds the performance of other probabilistic approaches, as later proven in the paper.
2 Spam Detection Methods
As indicated by conducted research one of the best ways so far to classify spam is to
use probabilistic models, i.e. Bayesian [3], [5], [6], [8], [9]. For that reason, this paper
is going to compare the approach of using SOMs to what appears to be best classifier
for spam, the MNB Boolean classifier. Both approaches need to transform the text
email message into a numerical vector, therefore several vector models have been
proposed and are described later on.
2.1 Classifying with Multinomial NB Boolean
The MNB treats each message d as a set of tokens. Therefore d is represented by a
numerical feature vector model. Each element of the vector model represents a
Adaptable Text Filters and Unsupervised Neural Classifiers for Spam Detection
197
Boolean value of whether that token exists in the message or not. The probability of
P(x|c) can be calculated by trialling the probability of each token t occurring in a
category c. The product of these trials, P(ti|c), for each category will result in the
P(x|c) for the respective category. The equation is then [6]:
P(cs ) ⋅
m
∏ P(ti | cs )xi
i =1
m
∑c∈{c c } P(c ) ⋅ ∏ P(ti | cs )
s h
>T
(1)
xi
i =1
Each trial P(t|c) is estimated using a Laplacean prior:
P(t | c ) =
1 + M t ,c
2 + Mc
(2)
Where Mt,c is the number of training messages of category c that contain the token t.
Mc is the total number of training messages of category c. The outcomes of all these
trials are considered independent given the category which is a naïve assumption.
This simplistic assumption overlooks the fact that co-occurrences of words in a category should not be independent, however this technique still results in a very good
performance of classification tasks.
2.2 Classifying with Self-Organising Maps
Self-organising map (SOM) systems have been used consistently for classification
and data visualisation in general [2]. The main function of a SOM is to identify salient
features in the n-dimensional input space and squash that space into two dimensions
according to similarity. Despite the popularity, SOMs are difficult to use after the
training is over. Although visually some clusters emerge in the output map, computationally it is difficult to classify a new input into a formed cluster and be able to semantically label it. For the classification, an input weighted majority voting (WMV)
method is used for identifying the label for the new unknown input [12]. Voting is
based on the distance of each input vector from the node vector.
For the proposed process for classifying an email and adapting it to new coming
emails, the feature vector model is calculated based on the first batch of emails and
then the SOM is trained on the first batch of emails with random weights. Then, a
new batch of emails is appended, the feature vector model is recalculated, and the
SOM is retrained from the previous weights, but on the new batch only. Finally, a
new batch of emails inserted and the process is repeated until all batches are
finished.
For the purpose of the experiments, as described later, a 10x10 SOM is trained for
1000 cycles, where each cycle is a complete run of all inputs. The learning rate and
neighbourhood value is started at high values, but then decreased exponentially towards the end of the training [12]. Each training step is repeated several times and
results are averaged to remove any initial random bias.
198
3 Identifying Salient Features
The process of extracting salient features is probably the most important part of the
methodology. The purpose here is to identify the keywords (tokens) that differentiate
spam from ham. Typical approaches so far focused on pure frequency measures for
that purpose, or the usage of the term frequency inverse document frequency (tf*idf)
metric [10]. Furthermore, the weirdness metric that calculates the frequency ration of
tokens used in special domains like spam, against the ratio in the British National
Corpus (BNC), reported accuracy to some degree [1], [12].
This paper uses a combination function of the weirdness and TFIDF metrics, and
both metrics are used in their normalised form. The ranking Rt of each token is therefore calculated based on:
Rt = weirdness t × tf ∗ idf t
(3)
The weirdness metric compares the frequency of the token in the spam domain
against the frequency of the same token in BNC. For tf*idf the “document” is considered as a category where all emails belonging to that same category are merged
together, and document frequency is the total number of categories (i.e. 2 in this instance: spam and ham).
The rating metric R is used to build a list of most salient features in order to encode
emails into binary numerical input vectors (see Fig. 1). After careful examination a
conclusion is drawn, that in total, the top 500 tokens are enough to represent the vector model. Furthermore, one more feature (dimension) is added in the vector, indicating whether there are any ham tokens present in an email. The binary vector model
seemed to work well and has the advantage of the computational simplicity.
SPAM EMAIL
HAM EMAIL
Subject: dobmeos with hgh my energy level has gone
up ! stukm
Introducing doctor – formulated hgh
Subject: re : entex transistion
thanks so much for the memo . i would like to reiterate
my support on two key
issues :
1 ) . thu - best of luck on this new assignment . howard
has worked hard and done a great job ! please don ' t be
shy on asking questions . entex is
critical to the texas business , and it is critical to our team
that we are timely and accurate .
2 ) . rita : thanks for setting up the account team .
communication is critical to our success , and i
encourage you all to keep each other informed
at all times . the p & l impact to our business can be
significant .
additionally , this is high profile , so we want to assure top
quality .
thanks to all of you for all of your efforts . let me know if
there is anything i can do to help provide any additional
support .
rita wynne
…
human growth hormone - also called hgh is
referred to in medical science as the master
hormone.
it is very plentiful when we are young ,
but near the age of twenty - one our bodies begin to
produce less of it . by the time we are forty nearly
everyone is deficient in hgh ,
and at eighty our production has normally diminished at
least 90 - 95 % .
advantages of hgh :
- increased muscle strength
- loss in body fat
- increased bone density
- lower blood pressure
- quickens wound healing
- reduces cellulite
- increased
…
sexual potency
Fig. 1. Sample spam and ham emails. Large bold words indicate top ranked spam words and
smaller words with low ranking, whereas normal black text indicate non spam words.
In most cases of generating feature vectors, scientists usually concentrate on static
models that require complete refactoring when information changes or when the user
provides feedback. In order to cope with the demand of changes, the proposed model
can automatically recalculate the salient features and appropriately adapt the vector
199
model to accommodate this. The method can safely modify/update the vector model
every 100 emails in order to achieve best performance. Basically, the rank list is
modified depending on the contents of the new coming emails. This is clearly visualised in Fig. 2 where is observable that as more email batches (of 100 emails) are presented, the tokens in the list get updated. New “important” tokens are quickly placed
at the top of the rank, but the ranking changes based on other new entries.
Rank
Spam Key-Words Ranking
500
450
400
350
300
250
200
150
100
50
0
viagra
hotlist
xanax
pharmacy
vicodin
pills
sofftwaares
valium
prozac
computron
1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 20 30 40 50
Batch No
Fig. 2. Random spam keyword ranking as it evolved through the training process for Enron1
dataset. Each batch contains 100 emails. The graph shows that each new keyword entry has an
impact on the ranking list, and it then fluctuates to accommodate new-coming keywords.
4 Experimentation: Spam Detection
In order to evaluate spam filters a dataset with a large volume of spam and ham messages is required. Gathering public benchmark datasets of a large size has proven
difficult [8]. This is mainly due to privacy issues of the senders and receivers of ham
emails with a particular dataset. Some datasets have tried to bypass the privacy issue
by considering ham messages collected from freely accessible sources such as mailing
lists. The Ling-Spam dataset consists of spam received at the time and a collection of
ham messages from an archived list of linguist mails. The SpamAssassin corpus uses
ham messages publicly donated by the public or collected from public mailing lists.
Other datasets like SpamBase and PU only provide the feature vectors rather than the
content itself and therefore are considered inappropriate for the proposed method.
4.1 Setup
One of the most widely used datasets in spam filtering research is the Enron dataset
From a set of 150 mailboxes with messages various benchmark datasets have been
constructed. A subset as constructed by Androutsopoulos et al. [6] is used, containing
mailboxes of 6 users within the dataset. To reflect the different scenarios of a personalised filter, each dataset is interlaced with varying amounts of spam (from a variety
of sources), so that some had a ham-spam ratio of 1:3 and others 3:1.
To implement the process of incremental retraining the approach suggested by Androutsopoulos et al. [6] is adapted, where the messages of each dataset are split into
200
batches b1,…,bl of k adjacent messages. Then for batch i=1 to l-1 the filter is trained
on batch bi and tested on batch bi+1. The number of emails per batch k=100.
The performance of a spam filter is measured on its ability to correctly identify
spam and ham while minimising misclassification. NhÆh and nsÆs represent the number of correctly classified ham and spam messages. Nh->s represents the number of
ham misclassified as spam (false positive) and nsÆh represents the number of spam
misclassified as ham (false negative). Spam precision and recall is then calculated.
These measurements are useful for showing the basic performance of a spam filter.
However they do not take into account the fact that misclassifying a Ham message as
Spam is an order of magnitude worse than misclassifying a Spam message to Ham. A
user can cope with a number of false negatives, however a false positive could result
in the loss of a potential important legitimate email which is unacceptable to the user.
So, when considering the statistical success of a spam filter the consequence weight
associated with false positives should be taken into account. Androutsopoulos et al.
[6] introduced the idea of a weighted accuracy measurement (WAcc):
WAccλ =
λ ⋅ nh → h + ns → s
λ ⋅ Nh + Ns
(4)
Nh and Ns represent the total number of ham and spam messages respectively. In this
measurement each legitimate ham message nh is treated as λ messages. For every
false positive occurring, this is seen as λ errors instead of just 1. The higher the value
of λ the more cost there is of each misclassification. When λ=99, misclassifying a
ham message is as bad as letting 99 spam messages through the filter. The value of λ
can be adjusted depending on the scenario and consequences involved.
4.2 Results
Across the six datasets the results show a variance in performance of the MNB and a
consistent performance of the SOM. Across the first three datasets, with ratio 3:1 in
favour of ham, the MNB almost perfectly classifies ham messages, however the recall
of spam is noticeably low. This is especially apparent in Enron 1 which appears to be
the most difficult dataset. The last three datasets have a 3:1 ratio in favour of spam and
this change in ratio is reflected in a change in pattern of the MNB results. The recall of
spam is highly accurate however many ham messages are missed by the classifier.
The pattern of performance of the SOM across the 6 datasets is consistent. Recall
of spam is notably high over each dataset with the exception of a few batches. The
ratio of spam to ham in the datasets appears to have no bearing on the results of the
SOM. The recall of ham messages in each batch is very high with a very low percentage of ham messages miss-classified. The resulting overall accuracy is very high for
the SOM and consistently higher than the MNB. However the weighted accuracy puts
a different perspective on the results.
Fig. 3 (a), (b) and (c) show the MNB showing consistently better weighted accuracy than the SOM. Although the MNB misclassified a lot of spam, the cost of the
SOM misclassifying a small number of ham results in the MNB being more effective
on these datasets. Fig. 3 (d), (e) and (f) show the SOM outperforming the MNB on the
spam heavy datasets. The MNB missed a large proportion of ham messages and consequently the weighted accuracy is considerably lower than the SOM.
1.02
201
1.02
1
1
0.98
0.98
0.96
0.94
0.96
0.92
0.94
0.9
0.92
0.88
0.86
0.9
0.84
0.82
0.88
1
3
5
7
9
11
13
15
17
19
21
23
25
27
1
29
3
5
7
9
(a) Enron 1
11
13
15
17
19
21
23
25
27
29
(b) Enron 2
1.02
1.2
1
1
0.98
0.96
0.8
0.94
0.92
0.6
0.9
0.4
0.88
0.86
0.2
0.84
0.82
0
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
(c) Enron 3
(d) Enron 4
1.2
1.2
1
1
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
1
3
5
7
9
11
13
15
17
(e) Enron 5
19
21
23
25
27
29
1
3
5
7
9
11
13
15
17
19
21
23
25
27
29
(f) Enron 6
Fig. 3. WAcc results for all Enron datasets (1-6) for SOM and MNB and λ=99. Training / testing is conducted based on 30 batches of Spam and Ham, and a total number of 3000 emails. Y
axes shows WAcc and the X axes indicates the batch number.
5 Conclusions
This paper has discussed and evaluated two classifiers for the purposes of categorising emails into classes of spam and ham. Both MNB and SOM methods are incrementally trained and tested on 6 subsets of the Enron dataset. The methods are evaluated
using a weighted accuracy measurement. The results of the SOM proved consistent
over each dataset maintaining an impressive spam recall. A small percentage of ham
emails are misclassified by the SOM. Each ham missed is treated as the equivalent of
missing 99 spam emails. This lowered the overall effectiveness of the SOM.
The MNB demonstrated a trade off between false positives and false negatives as it
struggled to maintain high performance on both. Where it struggled to classify spam
in the first three datasets, ham recall is impressive and consequently the WAcc is
consistently better than the SOM. This pattern is reversed in the final three datasets as
many ham messages are missed, and the SOM outperformed the MNB.
202
Further evaluations are currently being made into the selection of salient features
and size of attribute set. This work aims to reduce the small percentage of misclassified ham by the SOM to improve its weighted accuracy performance.
References
1. Manomaisupat, P., Vrusias, B., Ahmad, K.: Categorization of Large Text Collections: Feature Selection for Training Neural Networks. In: Corchado, E., Yin, H., Botti, V., Fyfe, C.
(eds.) IDEAL 2006. LNCS, vol. 4224, pp. 1003–1013. Springer, Heidelberg (2006)
2. Kohonen, T.: Self-organizing maps, 2nd edn. Springer, New York (1997)
3. Metsis, V., Androutsopoulos, I., Paliouras, G.: Spam Filtering with Naïve Bayes – Which
Naïve Bayes? In: CEAS, 3rd Conf. on Email and AntiSpam, California, USA (2006)
4. Zhang, L., Zhu, J., Yao, T.: An Evaluation of Statistical Spam Filtering Techniques. ACM
Trans. on Asian Language Information Processing 3(4), 243–269 (2004)
5. Sahami, M., Dumais, S., Heckerman, D., Horvitz, E.: A Bayesian approach to filtering
junk e-mail. In: Learning for Text Categorization – Papers from the AAAI Workshop,
Madison, Wisconsin, pp. 55–62 (1998)
6. Androutsopoulos, I., Paliouras, G., Karkaletsi, V., Sakkis, G., Spyropoulos, C.D., Stamatopoulos, P.: Learning to Filter Spam E-Mail: A Comparison of a Naïve Bayesian and a
Memory-Based Approach. In: Proceedings of the Workshop Machine Learning and Textual Information Access. 4th European Conf. on KDD, Lyon, France, pp. 1–13 (2000)
7. Youn, S., McLeod, D.: Efficient Spam Email Filtering using Adaptive Ontology. In: 4th
International Conf. on Information Technology, ITNG 2007, pp. 249–254 (2007)
8. Hunt, R., Carpinter, J.: Current and New Developments in Spam Filtering. In: 14th IEEE
International Conference on Networks, ICON 2006, vol. 2, pp. 1–6 (2006)
9. Peng, F., Schuurmans, D., Wang, S.: Augmenting Naive Bayes Classifiers with Statistical
Language Models. Information Retrieval 7, 317–345 (2004)
10. Salton, G., Buckley, C.: Term-weighting approaches in automatic text retrieval. Information Processing & Management 24(5), 513–523 (1988)
11. Vrusias, B.: Combining Unsupervised Classifiers: A Multimodal Case Study, PhD thesis,
University of Surrey (2004)
12. Drucker, H., Wu, D., Vapnik, V.N.: Support Vector Machines for Spam Categorization.
IEEE Transactions on Neural Networks 10(5), 1048–1054 (1999)
A Preliminary Performance Comparison of Two Feature
Sets for Encrypted Traffic Classification
Riyad Alshammari and A. Nur Zincir-Heywood
Dalhousie University, Faculty of Computer Science
{riyad,zincir}@cs.dal.ca
Abstract. The objective of this work is the comparison of two types of feature sets for the
classification of encrypted traffic such as SSH. To this end, two learning algorithms – RIPPER
and C4.5 – are employed using packet header and flow-based features. Traffic classification is
performed without using features such as IP addresses, source/destination ports and payload
information. Results indicate that the feature set based on packet header information is comparable with flow based feature set in terms of a high detection rate and a low false positive rate.
Keywords: Encrypted Traffic Classification, Packet, Flow, and Security.
1 Introduction
In this work our objective is to explore the utility of two possible feature sets – Packet
header based and Flow based – to represent the network traffic to the machine learning algorithms. To this end, we employed two machine learning algorithms – C4.5
and RIPPER [1] – in order to classify encrypted traffic, specifically SSH (Secure
Shell). In this work, traffic classification is performed without using features such as
IP addresses, source/destination ports and payload information. By doing so, we aim
to develop a framework where privacy concerns of users are respected but also an
important task of network management, i.e. accurate identification of network traffic,
is achieved. Having an encrypted payload and being able to run different applications
over SSH makes it a challenging problem to classify SSH traffic.
Traditionally, one approach to classifying network traffic is to inspect the payload
of every packet. This technique can be extremely accurate when the payload is not
encrypted. However, encrypted applications such as SSH imply that the payload is
opaque. Another approach to classifying applications is using well-known TCP/UDP
port numbers. However, this approach has become increasingly inaccurate, mostly
because applications can use non-standard ports to by-pass firewalls or circumvent
operating systems restrictions. Thus, other techniques are needed to increase the accuracy of network traffic classification.
The rest of this paper is organized as follows. Related work is discussed in section 2.
Section 3 details the methodology followed. Aforementioned machine learning algorithms are detailed in Section 4, and the experimental results are presented in Section 5.
Finally, conclusions are drawn and future work is discussed in Section 6.
springerlink.com
204
R. Alshammari and A.N. Zincir-Heywood
2 Related Work
In literature, Zhang and Paxson present one of the earliest studies of techniques based
on matching patterns in the packet payloads [2]. Early et al. employed a decision tree
classifier on n-grams of packets for classifying flows [3]. Moore et al. used Bayesian
analysis to classify flows into broad categories [4]. Karagiannis et al. proposed an
approach that does not use port numbers or payload information [5], but their system
cannot classify distinct flows. Wright et al. investigate the extent to which common
application protocols can be identified using only packet size, timing and direction
information of a connection [6]. They employed a kNN and HMM learning systems
to compare the performance. Their performance on SSH classification is 76% detection rate and 8% false positive rate. Bernaille et al. employed first clustering and then
classification to the first three packets in each connection to identify SSL connections
[7]. Haffner et al. employed AdaBoost, Hidden Markov Models (HMM), Naive
Bayesian and Maximum Entropy models to classify network traffic into different
applications [8]. Their results showed AdaBoost performed the best on their data sets.
In their work, the classification rate for SSH was 86% detection rate and 0% false
positive rate but they employed the first 64 bytes of the payload. Recently, Williams
et al. [9] compared five different classifiers – Bayesian Network, C4.5, Naive Bayes
(two different types) and Naive Bayes Tree – using flows. They found that C4.5 performed better than the others. In our previous work [10], we employed RIPPER and
AdaBoost algorithms for classifying SSH traffic. RIPPER performed better than
AdaBoost by achieving 99% detection rate and 0.7% false positive rate. However, in
that work, all tests were performed using flow based feature sets, whereas in this work
we not only employ other types of classifiers but also investigate the usage of packet
header based feature sets.
3 Methodology
In this work, RIPPER and C4.5 based classifiers are employed to identify the most
relevant feature set – Packet header vs. Flow – to the problem of SSH traffic classification. For packet header based features used, the underlying principle is that features
employed should be simple and clearly defined within the networking community.
They should represent a reasonable benchmark feature set to which more complex
features might be added in the future. Given the above model, Table 1 lists the packet
header Feature Set used to represent each packet to our framework.
In the above table, payload length and inter-arrival time are the only two features,
which are not directly obtained from the header information but are actually derived
using the data in the header information. In the case of inter-arrival time, we take the
Table 1. Packet Header Based Features Employed
IP Header length
IP Time to live
TCP Header length
Payload length (derived)
IP Fragment flags
IP Protocol
TCP Control bits
Inter-arrival time (derived)
205
difference in milliseconds between the current packet and the previous packet sent
from the same host within the same session. In the case of payload length, we calculate it using Eq. 1 for TCP packets and using Eq. 2 for UDP packets.
Payload length = IPTotalLength - (IPHeaderLength x 4) - (TCPHeaderLength x 4)
(1)
Payload length = IPTotalLength - (IPHeaderLength x 4) – 8
(2)
For the flow based feature set, a feature is a descriptive statistic that can be calculated from one or more packets for each flow. To this end, NetMate [11] is employed
to generate flows and compute feature values. Flows are bidirectional and the first
packet seen by the tool determines the forward direction. We consider only UDP and
TCP flows that have no less than one packet in each direction and transport no less
than one byte of payload. Moreover, UDP flows are terminated by a flow timeout,
whereas TCP flows are terminated upon proper connection teardown or by a flow
timeout, whichever occurs first. The flow timeout value employed in this work is 600
seconds [12]. We extract the same set of features used in [9, 10] to provide a comparison environment for the reader, Table 2.
Table 2. Flow Based Features Employed
Protocol
# Packets in forward direction
Min forward inter-arrival time
Std. deviation of forward inter-arrival times
Mean forward inter-arrival time
Max forward inter-arrival time
Std. deviation of backward inter-arrival times
Min forward packet length
Max forward packet length
Std deviation of forward packet length
Mean backward packet length
Duration of the flow
# Bytes in forward direction
# Bytes in backward direction
# Packets in backward direction
Mean backward inter-arrival time
Max backward inter-arrival time
Min backward inter-arrival time
Mean forward packet length
Min backward packet length
Std. deviation of backward packet length
Max backward packet length
Fig. 1. Generation of network traffic for the NIMS data set
In our experiments, the performance of the different machine learning algorithms is
established on two different traffic sources: Dalhousie traces and NIMS traces.
206
- Dalhousie traces were captured by the University Computing and Information
Services Centre (UCIS) in January 2007 on the campus network between the university and the commercial Internet. Given the privacy related issues university may
face, data is filtered to scramble the IP addresses and further truncate each packet to
the end of the IP header so that all payload is excluded. Moreover, the checksums are
set to zero since they could conceivably leak information from short packets. However, any length information in the packet is left intact. Thus the data sets given to us
are anonymized and without any payload information. Furthermore, Dalhousie traces
are labeled by a commercial classification tool (deep packet analyzer) called PacketShaper [13] by the UCIS. This provides us the ground truth for the training. PacketShaper labeled all traffic either as SSH or non-SSH.
- NIMS traces consist of packets collected on our research test-bed network. Our
data collection approach is to simulate possible network scenarios using one or more
computers to capture the resulting traffic. We simulate an SSH connection by connecting a client computer to four SSH servers outside of our test-bed via the Internet,
Figure 1.We ran the following six SSH services: (i) Shell login; (ii) X11; (iii) Local
tunneling; (iv) Remote tunneling; (v) SCP; (vi) SFTP. We also captured the following
application traffic: DNS, HTTP, FTP, P2P (limewire), and Telnet. These traces include all the headers, and the application payload for each packet.
Since both of the traffic traces contain millions of packets (40 GB of traffic). We
performed subset sampling to limit the memory and CPU time required for training
and testing. Subset sampling algorithms are a mature field of machine learning in
which it has already been thoroughly demonstrated that performance of the learner
(classifier) is not impacted by restricting the learner to a subset of the exemplars during training [14]. The only caveat to this is that the subset be balanced. Should one
samples without this constraint one will provide a classifier which maximizes accuracy, where this is known to be a rather poor performance metric. However, the balanced subset sampling heuristic here tends to maximize the AUC (measurements of
the fraction of the total area that falls under the ROC curve) statistic, a much more
robust estimator of performance [14].
Thus, we sub-sampled 360,000 packets from each aforementioned data source.
The 360,000 packets consist of 50% in-class (application running over SSH) and 50%
out-class. From the NIMS traffic trace, we choose the first 30000 packets of X11,
SCP, SFTP, Remote-tunnel, Local-tunnel and Remote login that had payload size
bigger than zero. These packets are combined together in the in-class. The out-class is
sampled from the first 180000 packets that had payload size bigger than zero. The
out-class consists of the following applications FTP, Telnet, DNS, HTTP and P2P
(lime-wire). On the other hand, from the Dalhousie traces, we filter the first 180000
packets of SSH traffic for the in-class data. The out-class is sampled from the first
180000 packets. It consists of the following applications FTP, DNS, HTTP and MSN.
We then run these data sets through NetMate to generate the flow feature set. We
generated 30 random training data sets from both sub-sampled traces. Each training
data set is formed by randomly selecting (uniform probability) 75% of the in-class
and 75% of the out-class without replacement. In case of Packet header feature set,
each training data set contains 270,000 packets while in case of NetMate feature set,
each training data set contains 18095 flows (equivalent of 270,000 packets in terms of
flows), Table 3. In this table, some applications have packets in the training sample
207
but no flows. This is due to the fact that we consider only UDP and TCP flows that
have no less than one packet in each direction and transport no less than one byte of
payload.
Table 3. Private and Dalhousie Data Sets
SSH
FTP
TELNET
DNS
HTTP
P2P (limewire)
NIMS Training Sample for IPheader (total = 270000) x 30
135000 14924
13860
17830
8287
96146
NIMS Training Sample for NetMate (total = 18095) x 30
1156
406
777
1422
596
13738
SSH
FTP
TELNET
DNS
HTTP
MSN
Dalhousie Training Sample for IPheader (total = 270000) x 30
135000 139
0
2985
127928
3948
Dalhousie Training Sample for NetMate (total = 12678) x 30
11225
2
0
1156
295
0
4 Classifiers Employed
In order to classify SSH traffic; two different machine learning algorithms – RIPPER
and C4.5 – are employed. The reason is two-folds: As discussed earlier Williams et al.
compared five different classifiers and showed that a C4.5 classifier performed better
than the others [9], whereas in our previous work [10], a RIPPER based classifier
performed better than the AdaBoost, which was shown to be the best performing
model in [8].
RIPPER, Repeated Incremental Pruning to Produce Error Reduction, is a rulebased algorithm, where the rules are learned from the data directly [1]. Rule induction
does a depth-first search and generates one rule at a time. Each rule is a conjunction
of conditions on discrete or numeric attributes and these conditions are added one at a
time to optimize some criterion. In RIPPER, conditions are added to the rule to
maximize an information gain measure [1]. To measure the quality of a rule, minimum description length is used [1]. RIPPER stops adding rules when the description
length of the rule base is 64 (or more) bits larger than the best description length.
Once a rule is grown and pruned, it is added to the rule base and all the training examples that satisfy that rule are removed from the training set. The process continues
until enough rules are added. In the algorithm, there is an outer loop of adding one
rule at a time to the rule base and inner loop of adding one condition at a time to the
current rule. These steps are both greedy and do not guarantee optimality.
C4.5 is a decision tree based classification algorithm. A decision tree is a hierarchical data structure for implementing a divide-and-conquer strategy. It is an efficient
non-parametric method that can be used both for classification and regression. In nonparametric models, the input space is divided into local regions defined by a distance
metric. In a decision tree, the local region is identified in a sequence of recursive
splits in smaller number of steps. A decision tree is composed of internal decision
nodes and terminal leaves. Each node m implements a test function fm(x) with discrete outcomes labeling the branches. This process starts at the root and is repeated
until a leaf node is hit. The value of a leaf constitutes the output. A more detailed
explanation of the algorithm can be found in [1].
208
In this work we have 30 training data sets. Each classifier is trained on each data set
via 10-fold cross validation. The results given below are averaged over these 30 data
sets. Moreover, results are given using two metrics: Detection Rate (DR) and False
Positive Rate (FPR). In this work, DR reflects the number of SSH flows correctly
classified, whereas FPR reflects the number of Non-SSH flows incorrectly classified
as SSH. Naturally, a high DR rate and a low FPR would be the desired outcomes.
They are calculated as follows:
DR = 1 − (#FNClassifications / TotalNumberSSHClassifications)
FPR = #FPClassifications / TotalNumberNonSSHClassifications
where FN, False Negative, means SSH traffic classified as non- SSH traffic. Once the
aforementioned feature vectors are prepared, RIPPER, and C4.5, based classifiers are
trained using WEKA [15] (an open source tool for data mining tasks) with its default
parameters for both algorithms.
Tables 4 and 6 show that the difference between the two feature sets based on DR
is around 1%, whereas it is less than 1% for FPR. Moreover, C4.5 performs better
than RIPPER using both feature sets, but again the difference is less than 1%. C4.5
achieves 99% DR and 0.4% FPR using flow based features. Confusion matrixes,
Tables 5 and 7, show that the number of SSH packets/flows that are misclassified are
notably small using C4.5 classifier.
Table 4. Average Results for the NIMS data sets
Classifiers Employed
RIPPER for Non-SSH
RIPPER for –SSH
C4.5 for Non-SSH
C4.5 for –SSH
IP-header Feature
DR
FPR
0.99
0.008
0.991
0.01
0.991
0.007
0.993
0.01
NetMate Feature
DR
FPR
1.0
0.002
0.998
0.0
1.0
0.001
0.999
0.0
Table 5. Confusion Matrix for the NIMS data sets
RIPPER for Non-SSH
RIPPER for –SSH
C4.5 for Non-SSH
C4.5 for –SSH
IP-header Feature
Non-SSH
SSH
133531
1469
1130
133870
133752
1248
965
134035
NetMate Feature
Non-SSH SSH
16939
0
2
1154
16937
2
1
1155
Results reported above are averaged over 30 runs, where 10-fold cross validation is
employed at each run. In average, a C4.5 based classifier achieves a 99% DR and
almost 0.4% FPR using flow based features and 98% DR and 2% FPR using packet
header based features. In both cases, no payload information, IP addresses or port
numbers are used, whereas Haffner et al. achieved 86% DR and 0% FPR using the
209
Table 6. Average Results for the Dalhousie data sets
RIPPER for Non-SSH
RIPPER for –SSH
C4.5 for Non-SSH
C4.5 for –SSH
IP-header Feature
DR
FPR
0.974
0.027
0.972
0.025
0.98
0.02
0.98
0.02
NetMate Feature
DR
FPR
0.994
0.0008
0.999
0.005
0.996
0.0004
0.999
0.004
Table 7. Confusion Matrix for the Dalhousie data sets
RIPPER for Non-SSH
RIPPER for –SSH
C4.5 for Non-SSH
C4.5 for –SSH
IP-header Feature
Non-SSH
SSH
131569
3431
3696
131304
13239
2651
2708
132292
NetMate Feature
Non-SSH SSH
1445
8
8
11217
1447
6
5
11220
first 64 bytes of the payload of the SSH traffic [8]. This implies that they have used
the un-encrypted part of the payload, where the handshake for SSH takes place. On
the other hand, Wright et al. achieved a 76% DR and 8% FPR using packet size, time
and direction information only [6]. These results show that our proposed approach
achieves better performance in terms of DR and FPR for SSH traffic than the above
existing approaches in the literature.
6 Conclusions and Future Work
In this work, we investigate the performance of two feature sets using C4.5 and RIPPER learning algorithms for classifying SSH traffic from a given traffic trace. To do
so, we employ data sets generated at our lab as well as employing traffic traces captured on our Campus network. We tested the aforementioned learning algorithms
using packet header and flow based features. We have employed WEKA (with default
settings) for both algorithms. Results show that feature set based on packet header is
compatible with the statistical flow based feature set. Moreover, C4.5 based classifier
performs better than RIPPER on the above data sets. C4.5 can achieve a 99% DR and
less than 0.5% FPR at its test performance to detect SSH traffic. It should be noted
again that in this work, the objective of automatically identifying SSH traffic from a
given network trace is performed without using any payload, IP addresses or port
numbers. This shows that the feature sets proposed in this work are both sufficient to
classify any encrypted traffic since no payload or other biased features are employed.
Our results are encouraging to further explore the packet header based features. Given
that such an approach requires less computational cost and can be employed on-line.
Future work will follow similar lines to perform more tests on different data sets in
order to continue to test the robustness and adaptability of the classifiers and the feature sets. We are also interested in defining a framework for generating good training
data sets. Furthermore, investigating our approach for other encrypted applications
such as VPN and Skype traffic is some of the future directions that we want to pursue.
210
Acknowledgments. This work was in part supported by MITACS, NSERC and the
CFI new opportunities program. Our thanks to John Sherwood, David Green and
Dalhousie UCIS team for providing us the anonymozied Dalhousie traffic traces. All
research was conducted at the Dalhousie Faculty of Computer Science NIMS Laboratory, http://www.cs.dal.ca/projectx.
References
[1] Alpaydin, E.: Introduction to Machine Learning. MIT Press, Cambridge; ISBN: 0- 26201211-1
[2] Zhang, Y., Paxson, V.: Detecting back doors. In: Proceedings of the 9th USENIX Security Symposium, pp. 157–170 (2000)
[3] Early, J., Brodley, C., Rosenberg, C.: Behavioral authentication of server flows. In: Proceedings of the ACSAC, pp. 46–55 (2003)
[4] Moore, A.W., Zuev, D.: Internet Traffic Classification Using Bayesian Analysis Techniques. In: Proceedings of the ACM SIGMETRICS, pp. 50–60 (2005)
[5] Karagiannis, T., Papagiannaki, K., Faloutsos, M.: BLINC: Multilevel Traffic Classification in the Dark. In: Proceedings of the ACM SIGCOMM, pp. 229–240 (2006)
[6] Wright, C.V., Monrose, F., Masson, G.M.: On Inferring Application Protocol Behaviors
in Encrypted Network Traffic. Journal of Machine Learning Research 7, 2745–2769
(2006)
[7] Bernaille, L., Teixeira, R.: Early Recognition of Encrypted Applications. In: Passive and
Active Measurement Conference (PAM), Louvain-la-neuve, Belgium (April 2007)
[8] Haffner, P., Sen, S., Spatscheck, O., Wang, D.: ACAS: Automated Construction of Application Signatures. In: Proceedings of the ACM SIGCOMM, pp. 197–202 (2005)
[9] Williams, N., Zander, S., Armitage, G.: A Prelimenary Performance Comparison of Five
Machine Learning Algorithms for Practical IP Traffic Flow Comparison. ACM SIGCOMM Computer Communication Review 36(5), 7–15 (2006)
[10] Alshammari, R., Zincir-Heywood, A.N.: A flow based approach for SSH traffic detection,
IEEE SMC, pp. 296–301 (2007)
[11] NetMate (last accessed, January 2008),
http://www.ip-measurement.org/tools/netmate/
[12] IETF (last accessed January 2008),
http://www3.ietf.org/proceedings/97apr/
97apr-final/xrtftr70.htm
[13] PacketShaper (last accessed, January 2008),
http://www.packeteer.com/products/packetshaper/
[14] Weiss, G.M., Provost, F.J.: Learning When Training Data are Costly: The Effect of Class
Distribution on Tree Induction. Journal of Artificial Intelligence Research 19, 315–354
(2003)
[15] WEKA Software (last accessed, January 2008),
http://www.cs.waikato.ac.nz/ml/weka/
Dynamic Scheme for Packet Classification Using Splay
Trees
Nizar Ben-Neji and Adel Bouhoula
Higher School of Communications of Tunis (Sup’Com)
University November 7th at Carthage
City of Communications Technologies, 2083 El Ghazala, Ariana, Tunisia
nizar.bennaji@certification.tn, bouhoula@supcom.rnu.tn
Abstract. Many researches are about optimizing schemes for packet classification and matching filters to increase the performance of many network devices such as firewalls and QoS
routers. Most of the proposed algorithms do not process dynamically the packets and give no
specific interest in the skewness of the traffic. In this paper, we conceive a set of self-adjusting
tree filters by combining the scheme of binary search on prefix length with the splay tree
model. Hence, we have at most 2 hash accesses per filter for consecutive values. Then, we use
the splaying technique to optimize the early rejection of unwanted flows, which is important for
many filtering devices such as firewalls. Thus, to reject a packet, we have at most 2 hash accesses per filter and at least only one.
Keywords: Packet Classification, Binary Search on Prefix Length, Splay Tree, Early Rejection.
1 Introduction
In the packet classification problems we wish to classify incoming packets into
classes based on predefined rules. Classes are defined by rules composed of multiple
header fields, mainly source and destination IP addresses, source and destination port
numbers, and a protocol type. On one hand, packet classifiers must be constantly optimized to cope with the network traffic demands. On the other hand, few of proposed
algorithms process dynamically the packets and the lack of dynamic packet filtering
solutions has been the motivation for this research.
Our study shows that the use of a dynamic data structure is the best solution to take
into consideration the skewness in the traffic distribution. In this case, in order to
achieve this goal, we adapt the splay tree data structure to the binary search on prefix
length algorithm. Hence, we have conceived a set of dynamic filters for each packet
header-field to minimize the average matching time.
On the other hand, discarded packets represent the most important part of the traffic treated then reject by a firewall. So, those packets might cause more harm than
others if they are rejected by the default-deny rule as they traverse a long matching
path. Therefore, we use the technique of splaying to reject the maximum number of
unwanted packets as early as possible.
This paper is organized as follows. In Section 2 we describe the previously published related work. In Section 3 we present the proposed techniques used to perform
springerlink.com
212
N. Ben-Neji and A. Bouhoula
the binary search on prefix length algorithm. In Section 4 we illustrate theoretical
analysis of the proposed work. At the end, in Section 5 we present the conclusion and
our plans for future work.
2 Previous Work
Since our proposed work in this paper applies binary search on prefix length with
splay trees, we describe the binary search on prefix length algorithm in detail, then we
present a previous dynamic packet classification technique using splay trees called
Splay Tree based Packet Classification (ST-PC). After that, we present an early rejection technique for maximizing the rejection of unwanted packets.
Table 1. Example Rule Set
Rule no.
R1
R2
R3
R4
R5
R6
R7
R8
R9
Src Prefix
01001*
01001*
010*
0001*
1011*
1011*
1010*
110*
*
Dst Prefix
000111*
00001*
000*
0011*
11010*
110000*
110*
1010*
*
Src Port
*
*
*
*
*
*
*
*
*
Dst Port
80
80
443
443
80
80
443
443
*
Proto.
TCP
TCP
TCP
TCP
UDP
UDP
UDP
UDP
*
2.1 Binary Search on Prefix Length
Waldvogel et al. [1] have proposed the IP lookup scheme based on binary search on
prefix length Fig.1. Their scheme performs a binary search on hash tables that are
organized by prefix length. Each hash table in their scheme contains prefixes of the
same length together with markers for longer-length prefixes. In that case, IP Lookup
can be done with O(log(Ldis)) hash table searches, where Ldis is the number of distinct
prefix lengths and Ldis<W-1 where W is the maximum possible length, in bits, of a
prefix in the filter table. Note that W=32 for IPv4, W=128 for IPv6.
Many works were proposed to perform this scheme. For instance, Srinivasan and
Varghese [3] and Kim and Sahni [4] have proposed ways to improve the performance
of the binary search on lengths scheme by using prefix expansion to reduce the value
of Ldis. On the other hand, an asymmetric binary search tree was proposed to reduce
the average number of hash computations. This tree basically inserts values of higher
occurrence probability (matching frequency) at higher tree levels then the values of
less probability. In fact, we have to rebuild periodically the search tree based on the
traffic characteristics. Also, a rope search algorithm was proposed to reduce the average number of hash computations but it increases the rebuild time of the search tree
because it use precomputation to accomplish this goal. That is why we have O(NW)
time complexity when we rebuild the tree after a rule insertion or deletion according
to the policy.
Dynamic Scheme for Packet Classification Using Splay Trees
213
Fig. 1. This shows a binary tree for the destination prefix field of Table 1 and the access order
performing the binary search on prefix lengths proposed by Waldvogel [1]. (M: Marker)
2.2 Splay Tree Packet Classification Technique (ST-PC)
The idea of the Splay Tree Packet Classification Technique [2] is to convert the set of
prefixes into integer ranges as shown in Table 2 then we put all the lower and upper
values of the ranges into a splay tree data structure. On the other hand, we need to
store in each node of the data structure all the matching rules as shown in Fig.2. The
same procedure is repeated for the filters of the other packet's header fields. Finally,
all the splay trees are to be linked to each other to determine the corresponding action
of the incoming packets.
When we look for the list of the matching rules, we first convert the values of the
packet’s header fields to integers then we find separately the matching rules for each
field and finally we select a final set of matching rules for the incoming packet. Each
newly accessed node has to become at root by the property of a splay tree (Fig.2).
Table 2. Destination Prefix Conversion
Rule no.
R1
R2
R3
R4
R5
R6
R7
R8
R9
Dst Prefix
000111*
00001*
000*
0011*
11010*
110000*
110*
1010*
*
Lower Bound
00011100
00001000
00000000
00110000
11010000
11000000
11000000
10100000
00000000
Upper Bound
00011111
00001111
00011111
00111111
11010111
11000011
11011111
10101111
11111111
Start
28
8
0
48
208
192
192
160
0
Finish
31
15
31
63
215
195
223
175
255
2.3 Early Rejection Rules
The technique of early rejection rules was proposed by Adel El-Atawy et al. [5], using
an approximation algorithm that analyzes the firewall policy in order to construct a set
of early rejection rules. This set can reject the maximum number of unwanted packets
as early as possible. Unfortunately, the construction of the optimal set of rejection
214
Fig. 2. The figure (a) shows the destination splay tree constructed as shown in Table 2 with the
corresponding matching rules. The newly accessed node becomes the root as shown in (b).
rules is an NP-complete problem and adding them may increase the size of matching
rules list. Early rejection works periodically by building a list of most frequently hit
rejection rules. And then, starts comparing the incoming packets to that list prior to
trying normal packet filters.
3 Proposed Filter
Our basic scheme consists of a set of self-adjusting filters (SA-BSPL: Self-Adjusting
Binary Search on Prefix Length). Our proposed filter can be applied to filter every
packet's header field. Accordingly, it can easily assure exact matching for protocol
field, prefix matching for IP addresses, and range matching for port numbers.
3.1 Splay Tree Filters
Our work performs the scheme of Waldvogel et al [1] by conceiving splay tree filters
to ameliorate the global average search time. A filter consists of a collection of hash
tables and a splay tree with no need to represent the default prefix (with zero length).
Prefixes are grouped by lengths in hash tables (Fig.3). Also, each hash table is augmented with markers for longer length prefixes (Fig.1 shows an example). We still
use binary search on prefix length but with splaying operations to push every newly
accessed node to the top of the tree and because of the search step starts at root and
ends at the leaves of the search tree as described in [1], we have to splay also the successor of the best length to the root.right position (Fig4-a). Consequently, the tree is
adequately adjusted to have at most 2 hash accesses for all repeated values.
Fig. 3. This figure shows the collection of hash tables of the destination address filter according
to the example in Table 1. We denote by M the marker and by P the prefix.
215
Fig. 4. This figure shows the operations of splaying to early accept the repeated packets (a) and
to early reject the unwanted ones (b). Note that x+ is the successor of x, and M is the minimum
item in the tree.
Fig. 5. This figure shows the alternative operations of splaying when x+ appears after x in the
search path (a) or before x (b)
Therefore, we start the search from the root node and if we get a match, we update
the best length value and the list of matching rules then we go for higher lengths and
if nothing is matched, we go for lower lengths (Fig 6). We stop the search process if a
leaf is met. After that, the best length value and its successor have to be splayed to the
top of the tree (Fig.4-a). We can go faster since we use the top-down splay tree model
because we are able to combine searching and splaying steps together.
We have also conceived an alternative implementation of splaying that have
slightly better amortized time bound. We look for x and x+, then we splay x or x+
until we obtain x+ as a right child of x. Then, these two nodes have to be splayed to
the top of the tree, as a single node (Fig.5). Hence, we have a better amortized cost
than given in Fig.4-a.
3.2 Early Rejection Technique
In this section, we focus on optimizing matching of traffic discarded due to the default-deny rule because it has more profound effect on the performance of the firewalls. In our case, we have no need to represent the default prefix (with zero length)
so if a packet don’t match any length it will be automatically rejected by the
Min-node. Generally, rejected packets might traverse long decision path of rule
matching before they are finally rejected by the default-deny rule. The left child of the
Min-node is Null, hence if a packet doesn’t match the Min-node we go to its left child
which is Null, so it means that this node is the end of the search path.
216
Fig. 6. The algorithm of binary search on prefix length combined with splaying operations
In our case, in each filter, we have to traverse the entire tree until we arrive to the
node with the minimum value. Subsequently, a packet might traverse a long path in
the search tree before it is rejected by the Min-node. Hence, we have to rotate always
the Min-node in the upper levels of the self-adjusting tree. We have to splay the Minnode to the root.left position as shown in (Fig.4-b). In our case, we can use either
bottom-up or top-down splay trees. However, the top-down splay tree is much more
efficient for the early rejection technique because we are able to maintain the Minnode fixed in the desired position when searching for the best matching value.
4 Complexity Analysis
In this section, we first give the complexity analysis of the proposed operations used
to adapt the splay tree data structure to the binary search on length algorithm and we
also give the cost per filter of our early rejection technique.
4.1 Amortized Analysis
On an n-node splay tree, all the standard search tree operations have an amortized
time bound of O(log(n)) per operation. According to the analysis of Sleator and Tarjan [6], if each item of the splay tree is given a weight wx, with Wt denoting the sum of
the weights in the tree t, then the amortized cost to access an item x have the following upper bounds (Let x+ denote the item following x in the tree t and x- denote the
item preceding x):
3 log(Wt/wx) + O(1) .
3 log(Wt/min(wx-,wx+)) + O(1) .
If x is in the tree t .
(0)
If x is not in the tree t .
(1)
217
In our case, we have to rotate the best length value to the root position and its successor to the root.right position (Fig.4-a). We have a logarithmic amortized complexity to release these two operations and the time cost is calculated using (0)
and (1):
3 log (Wt/wx) + 3 log(Wt-wx/wx+) + O(1) .
(2)
With the use of the alternative optimized implementation of splaying (Fig.5) we obtain slightly better amortized time bound than the one given in (2):
3 log(Wt-wx/min(wx,wx+)) + O(1) .
(3)
On the other hand, the cost of the early rejection step is 3 log(Wt-wx/wmin) + O(1), note
that wmin is the weight of the Min-node. With the use of the proposed early rejection
technique, we have at most 2 hash accesses before we reject a packet and at least only
one. If we have m values to be rejected by the default deny rule, search time will be at
most m+1 hash accesses per filter and at least m hash accesses.
4.2 Number of Nodes
The time complexity is related with the number of nodes. For the ST-PC scheme,
in the worst case, if we assume that all the prefixes are distinct, then we have at
most 2W nodes per tree, note that W is the length of the longest prefix. Besides, if
all bounds are distinct, then we have 2r nodes, note that r is the number of rules.
So, the actual number of nodes in ST-PC in the worst case will be the minimum of
these two values. Our self-adjusting tree structure is based on the binary search on
hash tables organized by prefixes length. Hence, the number of nodes in our case is
equal to the number of hash tables. If we consider W the length of the longest prefix, we have at most W nodes. Since the number of node in our self-adjusting tree
is smaller than in the ST-PC especially with an important number of rules (Fig.7),
we can say that our scheme is much more competitive in term of time and
scalability.
Fig. 7. (a) This figure shows the distribution of the number of nodes with respect to W and r,
where W is the maximum possible length in bits and r the number of rules
218
5 Conclusion
The Packet classification optimization problem has received the attention of the research community for many years. Nevertheless, there is a manifested need for new
innovative directions to enable filtering devices such as firewalls to keep up with
high-speed networking demands. In our work, we have suggested a dynamic scheme,
based on using collection of hash tables and splay trees for filtering each header field.
We have also proved that our scheme is suitable to take advantage of locality in the
incoming requests. Locality in this context is a tendency to look for the same element
multiple times. Finally, we have reached this goal with a logarithmic time cost, and in
our future works, we wish to optimize data storage and the other performance aspects.
References
1. Waldvogel, M., Varghese, G., Turner, J., Plattner, B.: Scalable High-Speed IP Routing
Lookups. ACM SIGCOMM Comput. Commu. Review. 27(4), 25–36 (1997)
2. Srinivasan, T., Nivedita, M., Mahadevan, V.: Efficient Packet Classification Using Splay
Tree Models. IJCSNS International Journal of Computer Science and Network Security 6(5), 28–35 (2006)
3. Srinivasan, V., Varghese, G.: Fast Address Lookups using Controlled Prefix Expansion.
ACM Trans. Comput. Syst. 17(1), 1–40 (1999)
4. Kim, K., Sahni, S.: IP Lookup by Binary Search on Prefix Length. J. Interconnect. Netw. 3,
105–128 (2002)
5. Hamed, H., El-Atawy, A., Al-Shaer, E.: Adaptive Statistical Optimization Techniques for
Firewall Packet Filtering. In: IEEE INFOCOM, Barcelona, Spain, pp. 1–12 (2006)
6. Sleator, D., Tarjan, R.: Self Adjusting Binary Search Trees. Journal of the ACM 32, 652–
686 (1985)
A Novel Algorithm for Freeing Network from Points of
Failure
Rahul Gupta and Suneeta Agarwal
Department of Computer Science and Engineering, Motilal Nehru National Institute of
Technology, Allahabad, India
rahulgupta_mnnit@yahoo.co.in, suneeta@mnnit.ac.in
Abstract. A network design may have many points of failure, the failure of any of which
breaks up the network into two or more parts, thereby disrupting the communication between
the nodes. This paper presents a heuristic for making an existing network more reliable by
adding new communication links between certain nodes. The algorithm ensures the absence of
any point of failure after addition of addition of minimal number of communication links determined by the algorithm. The paper further presents theoretical proofs and results which
prove the minimality of the number of new links added in the network.
Keywords: Points of Failure, Network Management, Safe Network Component, Connected
Network, Reliable Network.
1 Introduction
A network consists of number of interconnected nodes communicating among each
other through communication channels between them. A wired communication link
between two nodes is more reliable [1]. Various topology designs have been proposed
for various network protocols and applications [2][3] such as bus topology, star topology, ring topology and mesh topology. All these network designs leave certain nodes
as failure points [4][5][6]. These nodes become very important and must remain
working all the time. If one of these nodes is down for any reason, it breaks the network into segments and the communication among the nodes in different segments is
disrupted. Hence these nodes make the network unreliable.
In this paper, we have designed a heuristic which has the capability to handle a
single failure of node. The algorithm adds minimal number of new communication
links between the nodes so that a single node failure does not disrupt communication
among communicating nodes.
2 Basic Outline
The various network designs common in use are ring topology, star topology, mesh
topology, bus topology [1][4][5]. All these topology designs have their own advantages and disadvantages. Ring topology does not contain any points of failure. Bus
topology on the other hand, has many points of failure. Star topology contains a single
springerlink.com
220
R. Gupta and S. Agarwal
point of failure, the failure of which disrupts communication between any pair of
communicating nodes.
Points marked P in figure 2 are the points of failure in the network design. Star topology and bus topology are least reliable from the point of view of failure of a single
communication node in the network. In star topology, there is always one point of
failure, the failure of which breaks the communication between all pairs of nodes and
no nodes can communicate further. In a network of n nodes connected by bus topology, there are (n-2) points of failure. Ring topology is most advantageous and has no
points of failure.
For a reliable network, there must be no point of failure in the network design.
These points of failure can be made safe by adding new communication links between
nodes in the network. In this paper, we have presented an algorithm which finds the
points of failure in a given network design. The paper further presents a heuristic
which adds minimal number of communication links in the network to make it reliable. This ensures the removal of points of failure with least possible cost.
3 Algorithm for Making Network Reliable
In this paper, we have designed an algorithm to find the points of failure in the network and an algorithm for converting these failure points into non failure points by
the addition of minimal number of communication links.
3.1 New Terms Coined
We have coined the following terms which aid in the algorithm development and
network design understanding.
N – Nodes of the Network
E – Links in the Network
P – Set of Points of Failure
S – Set of Safe Network Components
Pi – Point of Failure
Si – Safe Network Component
Si(a) – Safe Network Component Attached to the Failure Point ‘a’
B – Set of all Safe Network Components each having a Single Point of Failure in the
Original Network
Bi – A Safe Network Component having a Single Point of Failure in the Original
Network
|B| - Cardinality of Set B
Fi – Point of Failure Corresponding to the Original Network in the member ‘Bi’
NFi – Non Failure Point corresponding to the Original Network in the member ‘Bi’
L – Set of New Communication Links Added
Li – A New Communication Link
C – Matrix List for the Components Reachable
dfn(i) – Depth First Number of the node ‘i’
low(i) – Lowest Depth First Number of the Node Reachable from ‘i’.
221
The points of failure are the nodes in the network, the failure of any of which
breaks the network into isolated segments which can not have any communication
among each other. A safe network component is the maximal subset of the connected
nodes from the complete network which do not contain any point of failure. The safe
component can handle a single failure occurring at any of its node within the subset.
We have developed an algorithm which finds the minimal number of communication
links to be added to the network to make the network capable of handling a single
failure of any node. A safe component may have more than one point of failure in
the original network. The algorithm considers the components having only a single
point of failure differently. ‘B’ is the set of all safe components having only a single
point of failure in the original network. ‘Fi’ is the point of failure in the original network design. ‘C’ corresponds to the matrix having the reachable components. Each
Row in the matrix corresponds to the components reachable through one outgoing
link from the point of failure. All the components having single point of failure and
occurring on one outgoing link corresponds to the representatives in each row. The
new communication links added in the network are collected in the set ‘L’. The set
contains the pairs of nodes between which links must be added to make network free
of points of failure.
Fig. 1. (a) An example network design, (b) Safe components in the design
3.2 Algorithm for Finding Points of Failure and Safe Components
To find all the points of failure in the network, we use depth first search [7][8] technique starting from any node in the network. Nodes that are reachable through more
than one path become part of the safe component and the ones which are connected
through only one path are vulnerable and the communication can get disrupted because of any one node in the single path of communication available for the node. The
network is represented by a matrix of nodes connected to each other with edges
representing the communication links. Each node of the network is numbered sequentially in the depth first search order starting from 0. This forms the dfn of each node.
The unmarked nodes reachable from a node are called the child of each node and the
node itself becomes the parent of those child nodes. The algorithm finds the low of
each node and the points of failure in the network design and all the safe components
in the network. The algorithm finds the safe components and all points of failure in
the network. The starting node is a pint of failure if some unmarked nodes remain
even on fully exploring any one single path from the node.
222
3.3 Algorithm for Finding Points of Failure and Safe Components
In this section, we describe our algorithm for the conversion of points of failure into
non failure points by the addition of new communication links. The algorithm adds
minimal number of new links which ensures least possible cost to make the network
reliable. The algorithm is based on the concept that the safe components having more
than one point of failure are necessarily connected to a safe component having only
one point of failure directly or indirectly. Thus this component can become a part of
larger safe component through more than one path which originates from any of the
points of failure in the original network present in the component. Thus if the component having only one point of failure in the original network is made a part of larger
safe component, the component having more than one point of failure is made safe
itself. The algorithm finds new links to be added for making the safe component larger and larger and thus finally including all the nodes of the network making the complete network safe. When the maximal component that is safe consists of all the nodes
of the network, the whole network is made safe and all points of failure are removed.
The following steps are followed in order.
1. Initially the set L = ∅ is taken.
2. P, the set of points of failure is found using algorithm described in section 3.2. The
algorithm also finds all safe components of the network and adds them to the set S.
Each of the Si has a copy of failure point within it. Hence, the failure points are
replicated in each component.
223
3. Find the subset B of safe components having only single point of failure in the
original network by using set S and set P found in step 2. Let each of these component members be B1, B2, B3,…. , Bk. These Bi`s are mutually disjoint with respect
to non failure points.
4. Each of the components Bi has at least one non failure point. Any non failure point
node is named as NFi and taken as the representative of the component Bi.
5. The failure point present in maximum number of safe components is chosen i.e, the
node, the failure of which creates maximum number of safe components is chosen.
Let it be named ‘s’.
6. Let S1(s), S2(s), S3(s),…. Sm(s) be all the safe components having the failure
point ‘s’. Each of these components may have one or more points of failure corresponding to the original network. If the component has more than one point of failure, other safe components are reachable from these safe components through
points of failure other than ‘s’.
7. Now we create the lists of components reachable from point of failure‘s’. For each
Sj(s), j=1, 2,… m, if the component contains only one point of failure, add the representative of this component to the list as the next row element and if the component contains more than one point of failure, then the reachable safe components
having only one point of failure are taken and their representatives are added to the
list C. These components are found by going using depth search from this component. All the components that are reachable from the same component are considered for the same row and their corresponding representatives are added in the
same row in the matrix C. The number of elements in each row of matrix C corresponds to the number of components that are reachable from the point of failure ‘s’
through that one outgoing link. It is to be noted here that the components having
one point of failure only are considered for the algorithm. Now we have a row for
each Sj(s), j=1, 2,… m. Thus the number of rows in matrix C is m.
8. The number of elements in each row of matrix C corresponds to the number of
components that are reachable from the point of failure ‘s’ through that one outgoing link. It is to be noted that each component is represented just by a non failure
point representative. Arrange the matrix rows in non decreasing order based on the
size of the row i.e, on the basis of the number of elements in each row.
9. If all Ci(s) `s are of size 1, pair the only member of each row with the only member
of next row. Here pairing means adding a communication link between the non
failure point members acting as representatives of their corresponding components.
Thus giving (k-1) new communication links to be added to the network for ‘k’
members. Add all these edges to set L, the set of all new communication links and
exit from the algorithm. If the size of some Ci(s) `s is greater than 1, start with the
last list Ci ( the list of the maximum size). For every k>=2, pair the kth element of
this row with the (k-1) th element of the preceding row (if it exists). Here again
pairing means addition of a communication link between the representative nodes.
Remove these paired up elements from the lists and the lists are contracted.
10.Now if more than one element is left in the second last list, shift the last element
from this list to the last list and append to the last list.
11.If the number of non empty lists is greater than one, go to step 8 for further
processing. If the size of the last and the only left row is one, pair its only member
with any of the non failure points in the network and exit from the algorithm. If the
224
last and the only row left have only two elements left in it, then pair the two representatives and exit from the algorithm. If the size of the last and the only left row is
greater than two, add the edges from set L into the network design and repeat the
algorithm from step 2 on updated network design.
Since in every iteration of the algorithm at least one communication link is added to
set L and only finite number of edges are added, the algorithm will terminate in finite
number of steps. The algorithm ensures that there are at least two paths between any
pair of nodes in the network. Thus, because of multiple paths of communication between any pair of nodes, the failure of any one of the node does not effect the communication between any other pair of nodes in the network. Thus the algorithm makes
the points of failure in the original network safe by adding minimal number of communication links.
4 Theoretical Results and Proofs
In this section, we describe the theoretical proofs for the correctness of the algorithm
and sufficiency of the number of the new communication links added. Further, the
lower and upper bounds on the number of links added to the network are proved.
Theorem 1. If | B | = k, i.e., there are only k safe network components having only
one point of failure in the original network, then the number of new edges necessary
to make all points of failure safe varies between ⎡k/2⎤ and (k-1) both inclusive.
Proof: Each safe component Bi has only point of failure corresponding to the original
network. Failure of this node will separate the whole component Bi from remaining
part of the network. Thus, for having communication from any node of this component Bi with any other node outside of Bi, at least one extra communication link is
required to be added with this component. This argument is valid for each Bi. Thus at
least one extra edge is to be added from each of the component Bi. This needs at least
⎡k/2⎤ extra links to be added each being incident on a distinct pair of Bi’s. This forms
the lower bound on the number of links to be added to make the points of failure safe
in the network design.
Fig. 2. (a) and (b) Two Sample Network Designs
In figure 2(a), there are k = 6 safe components each having only one point of failure
and thus requiring k/2 = 3 new links to be added to make all the points of failure safe.
It is easy to see that k/2 = 3 new links are sufficient to make the network failure free.
225
Now, we consider the upper bound on the number of new communication links to
be added to the network. This occurs when | B | = | S | = k, i.e, when each safe components in the network contain only one point of failure. Since, there is no safe
component which can become safe through more than one path. Thus all the safe
components are to be considered by the algorithm. Thus, it requires the addition of (k1) new communication links to join ‘k’ safe components.
Theorem 2. If the edges determined by the algorithm are added to the network, the
nodes will keep on communicating even after the failure of any single node in the
network.
Proof: We arbitrarily take 2 nodes ‘x’ and ‘y’ from the set ‘N’ of the network. Now
we show that ‘x’ and ‘y’ can communicate even after the failure of any single node
from the network.
CASE 1: If the node that fails is not a point of failure, ‘x’ and ‘y’ can continue to
communicate with each other.
CASE 2: If the node that fails is a point of failure and both ‘x’ and ‘y’ are in the
same safe component of the network, then by the definition of safe component ‘x’ and
‘y’ can still communicate because the failure of this node has no effect on the nodes
that are in the same safe component.
CASE 3: If the node that fails is a point of failure and ‘x’ and ‘y’ are in different
safe components and ‘x’ and ‘y’ both are members of safe components in set ‘B’. We
know that the algorithm makes all members of set ‘B’ safe by using only non failure
points of each component so the failure of any point of failure will not effect the
communication of any node member of the safe component formed. This is because
the algorithm has already created an alternate path for each of the node in any of the
safe member.
safe components and ‘x’ is a member of component belonging to set ‘B’ and ‘y’ a
member of component belonging to set ‘(S-B)’. Now we know that any node occurring in any member of set ‘(S-B)’ is connected to at least 2 points of failure in the safe
component and through each of these points of failure we can reach to a member of
set ‘B’. So even after deletion of any point of failure, ‘y’ will remain connected with
at least one member of set B. The algorithm has already connected all the members of
set ‘B’ by finding new communication links, hence ‘x’ and ‘y’ can still communicate
with each other.
safe components and both ‘x’ and ‘y’ belong to components that are members of set
‘(S-B)’. Now each member of set ‘(S-B)’ has at least 2 points of failure. So after the
failure of any one of the failure point, ‘x’ can send message to at least one component
that is a member of set ‘B’. Similarly, ‘y’ can send message to at least one component
that is a member of set ‘B’. Now, the algorithm has already connected all the components belonging to set ‘B’, so ‘x’ and ‘y’ can continue to communicate with each
other after the failure of any one node.
After the addition of links determined by the algorithm, there exist multiple paths of
communication between any pair of communicating nodes. Thus, no node is dependent on just one path.
226
Theorem 3. The algorithm provides the minimal number of new communication links
to be added to the network to make it capable of handling any single failure.
Proof: The algorithm considers only the components having a single point of failure
corresponding to the original network. Since | B | = k, thus it requires at least ⎡k/2⎤
new communication links to be added to pair up these k components and making
them capable of handling single failure of any node in the network. Thus adding less
than ⎡k/2⎤ new communication links can never result in safe network. Thus the algorithm finds minimal number of new communication links as shown by the example
discussed in theorem 1. In all the steps of the algorithm, except the last, only one link
is added to join 2 members of set ‘B’ and these members are not further considered
for the algorithm and hence do not generate any further edge in set ‘L’. In the last
step, when only one vertical column of x rows with each row having single member is
left, then (x-1) new links are added. These members have the property that only single
point of failure ‘s’ can separate these into x disjoint groups, hence addition of (x-1)
links is justified. When only single row of just one element is left, this can only be
made safe by joining it with any one of the non failure nodes. Hence, the algorithm
adds minimal number of new communication links to make the network.
5 Conclusion and Future Research
This paper described an algorithm for making points of failure safe in the network.
The new communication links determined by the algorithm are minimal and guarantees to make the network capable of handling a single failure of any node. The algorithm guarantees at least two paths of communication between any pair of nodes in
the network.
References
1. Tanenbaum, A.S.: Computer Networks, 4th edn. Pearson Education, London (2004)
2. Pearlman, R.: Interconnections: Bridges, Routers, Switches, and Internetworking Protocols,
2nd edn. Pearson Education, London (2006)
3. Kamiyana, N.: Network Topology Design Using Data Envelopment Analysis. In: IEEE
Global Telecommunications Conference (2007)
4. Dengiz, B., Altiparmak, F., Smith, A.E.: Efficient optimization of all-terminal reliable networks, using an evolutionary approach. IEEE Transactions on Reliability 46(1), 18–26
(1997)
5. Mandal, S., Saha, D., Mukherjee, R., Roy, A.: An efficient algorithm for designing optimal
backbone topology for a communication network. In: International Conference on Communication Technology, vol. 1, pp. 103–106 (2003)
6. Ray, G.A., Dunsmore, J.J.: Reliability of network topologies. In: IEEE INFOCOM 1988
Networks, pp. 842–850 (1988)
7. Horowitz, E., Sahni, S., Anderson-Freed, S.: Fundamentals of Data Structures in C, 8th edn.
Computer Science Press (1998)
8. Cormen, T.H., Leiserson, C.E., Rivest, R.L., Stein, C.: Introduction to Algorithms, 2nd edn.
Prentice-Hall, India (2004)
A Multi-biometric Verification System for the Privacy
Protection of Iris Templates
S. Cimato, M. Gamassi, V. Piuri, R. Sassi, and F. Scotti
Dipartimento di Tecnologie dell’Informazione, Università di Milano,
Via Bramante, 65 – 26013 Crema (CR), Italy
{cimato,gamassi,piuri,sassi,fscotti}@dti.unimi.it
Abstract. Biometric systems have been recently developed and used for authentication or identification in several scenarios, ranging from institutional purposes (border control) to commercial applications (point of sale). Two main issues are raised when such systems are applied:
reliability and privacy for users. Multi-biometric systems, i.e. systems involving more than a
biometric trait, increase the security of the system, but threaten users’ privacy, which are compelled to release an increased amount of sensible information. In this paper, we propose a
multi-biometric system, which allows the extraction of secure identifiers and ensures that the
stored information does not compromise the privacy of users’ biometrics. Furthermore, we
show the practicality of our approach, by describing an effective construction, based on the
combination of two iris templates and we present the resulting experimental data.
1 Introduction
Nowadays, biometric systems are deployed in several commercial, institutional, and
forensic applications as a tool for identification and authentication [1], [2]. The advantages of such systems over traditional authentication techniques, like the ones
based on the possession (of a password or a token), come from the fact that identity is
established on the basis of physical or behavioral characteristics of the subject taken
into consideration and not on something he/she carries. In fact, biometrics cannot be
lost or stolen, they are difficult to copy or reproduce, and in general they require the
presence of the user when the biometric authentication procedure takes place.
However, side to side with the widespread diffusion of biometrics an opposition
grows towards the acceptance of the technology itself. Two main reasons might motivate such resistance: the reliability of a biometric system and the possible threatens to
users’ privacy. In fact, a fault in a biometric system, due to a poor implementation or
to an overestimation of its accuracy could lead to a security breach. Moreover since
biometric traits are permanently associated to a person, releasing the biometric information acquired during the enrollment can be dangerous, since an impostor could reuse that information to break the biometric authentication process. For this reason,
privacy agencies of many countries have ruled in favor of a legislation which limits
the biometric information that can be centrally stored or carried on a personal ID. For
example, templates, e.g. mathematical information derived from a fingerprint, are retained instead of the picture of the fingerprint itself. Also un-encrypted biometrics are
discouraged.
springerlink.com
228
S. Cimato et al.
A possible key to enhance the reliability of biometric systems might be that of simultaneously using different biometric traits. Such systems are termed in literature
multi-biometric [3] and they usually rely on a combination of one of several of the
followings: (i) multiple sensors, (ii) multiple acquisitions (e.g., different frames/poses
of the face), (iii) multiple traits (e.g., an eye and a fingerprint), (iv) multiple instances
of the same kind of trait (e.g., left eye, and right eye). As a rule of thumb, the performances of two of more biometric systems which each operate on a single trait
might be enhanced when the same systems are organized in a single multimodal one.
This is easy to understand if we refer to the risk of admitting an impostor: two or
more different subsequent verifications are obviously more difficult to tamper with
than a single one (AND configuration). But other less obvious advantages might occur. Population coverage might be increased, for example, in an OR configuration
since some individuals could not have one biometric traits (illnesses, injuries, etc.). Or
the global fault tolerance of the system might be enhanced in the same configuration,
since, if one biometric subsystem is not working properly (e.g., a sensor problem occurred), the multimodal system can still keep working using the remaining biometric
submodules. On the other hand, the usage of multimodal biometric systems has also
some important drawbacks related to the higher cost of the systems, and user perception of larger invasiveness for his/her privacy.
In the following, we will derive a multi-biometric authentication system which limits the threats posed to the privacy of users while still benefiting from the increase reliability of multiple biometrics. It was introduced in [4] and it is based on the secure
sketch, a cryptographic primitive introduced by Dodis et al. in [5]. In fact, a main
problem in using biometrics as cryptographic keys is their inherent variability in subsequent acquisitions. The secure sketch absorbs such variability to retrieve a fixed binary string from a set of similar biometric readings.
In literature biometric authentication schemes based on secure sketches have been
presented and applied to face and iris biometrics [6], [7]. Our proposal is generally
applicable to a wider range of biometric traits and, compared to previous works, exploits multimodality in innovative way. In the following we describe the proposed
construction and show its application to the case where two biometrics are used, the
right and left iris. Iris templates are extracted from the iris images and used in the enrolment phase to generate a secure identifier, where the biometric information is protected and any malicious attempt to break the users’ privacy is prevented.
2 A Multimodal Sketch Based (MSB) Verification Scheme
The MSB verification scheme we propose is composed of two basic modules: the first
one (enroll module) creates an identifier (ID) for each user starting from the biometric
samples. The ID can then be stored and must be provided during the verification
phase. The second one, the (verification module) performs the verification process
starting from the novel biometric readings and the information contained into the ID.
Verification is successful if the biometric matching succeeds when comparing the
novel reading with the stored biometrics, concealed into the ID.
A Multi-biometric Verification System for the Privacy Protection of Iris Templates
229
2.1 Enrollment Module
The general structure of the enroll module is depicted in Figure 1 in its basic configuration where the multimodality is restricted at two biometrics. The scheme can be
generalized and we refer the reader to [5] for further details. First, two independent
biometrics are acquired and processed with two feature extraction algorithms F1 and
F2 to extract sets of biometric features. Each set of features is then collected into a
template, a binary string. We refer to each template as I1 and I2. The feature extraction
algorithms can be freely selected; they represent the single biometric systems which
compose the multimodal one. Let us denote with ri the binary tolerable error rate of
each biometric subsystem, i.e., the rate of bits in the templates which could be modified without affecting the biometric verification of the subject.
The second biometric feature I2 is given as input to a pseudo random permutation
block, which returns a bit string of the same length, having almost uniform distribution.
δ
I1
I2
Pseudo-Random
Permutation
Error Correction
Encoding
{H( ), δ}
ID
Hash Function
H(I2)
Fig. 1. The MSB Enroll Module
The string is then encoded by using an error correcting code and the resulting
codeword c is xored with the other biometric feature I1 to obtain δ. Given N1, the bitlength of I1, the code must be selected so that it corrects at most r1N1 single bit errors
on codewords which are N1 bits long. Finally, I2 is given as input to a hash function
and the digest H(I2), together with δ, and other additional information possibly needed
(to invert the pseudo random permutation) are collected and published as the identifier of the enrolled person.
2.2 Verification Module
Figure 2 shows the structure of the verification module. Let us denote with I’1 and I’2
the biometric features freshly collected. The ID provided by the subject is split into δ,
the hash H(I2) and the key needed to invert the pseudo random permutation. A corrupted version of the codeword c, concealed at enrollment, is retrieved by xoring the
fresh reading I’1 with δ. Under the hypothesis that both readings I1 and I’1 belong to
the same subject, the corrupted codeword c’ and c should differ for at most r1 bits.
Thus the subsequent application of the error correcting decoding and of the inverse
pseudo random permutation, should allow the exact reconstruction of the original
reading I2.
230
S. Cimato et al.
Verification SubModule
ID
{H( ), δ}
==?
H(I2)
δ
I 1’
Error Correction
Decoding
Inverse
Pseudo-Random
Permutation
Hash Function
Biometric
matching
I2’
Enable
Yes/No
Fig. 2. The MSB Verification Module
The identity of the user is verified in two steps. First a check is performed to compare the hash of the retrieved value for I2 with the value H(I2) stored into the identifier. If the check succeeds it means that the readings of the first biometric trait did not
differ more than what permitted by the biometric employed. Then a second biometric
matching is performed using as input the retrieved value of I2 and the fresh biometric
reading I’2. The authentication is successful when also this second match is positive.
3 Experimental Data and Results
3.1 Dataset Creation
The proposed scheme has been tested by using the public CASIA dataset [8]. (version
1.0) which contains seven images of the same eye obtained from 108 subjects. The
images were collected by the Chinese Academy of Science waiting at least one month
between two capturing stages using near infrared light for illumination (3 images during the first session and 4 for the second one). We used the first 3 images in the enroll
operations, and the last 4 images in the verification phase. At the best of our knowledge, there is no public dataset containing the left and right eyes sample of each individual with the sufficient iris resolution to be effectively used in identification tests.
For this reason we synthetically created a new dataset by composing two irises of different individuals taken from the CASIA dataset. Table 1 shows the details of the
composition method used to create the synthetic dataset from the CASIA samples.
The method we used to create the dataset can be considered as a pessimistic estimation of real conditions, since the statistical independence of the features extracted
from the iris samples coming from the left and right eye of the same individual is
likely to be equal or lower than the one related to the eyes coming from different individuals. In the literature it has been showed that the similarities of the iris templates
231
Table 1. Creation of the synthetic dataset
CASIA
Individual
Identifier
001
002
CASIA
File Name
Enroll/
Validation
001_1_1.bmp
001_1_2.bmp
001_1_3.bmp
001_2_1.bmp
…
001_2_4.bmp
002_1_1.bmp
002_1_2.bmp
002_1_3.bmp
002_2_1.bmp
…
002_2_4.bmp
Enroll
Enroll
Enroll
Validation
…
Validation
Enroll
Enroll
Enroll
Validation
…
Validation
Synthetic DB
Individual
Identifier
01
Notes
Right eye, Enroll, Sample 1
Right eye, Validation, Sample 1
…
Right eye, Validation, Sample 4
Left eye, Enroll, Sample 1
Left eye, Validation, Sample 1
…
Left eye, Validation, Sample 4
coming from the left and right eyes of the same individuals are negligible when Iriscodes templates are used [9].
3.2 Template Creation
The iris templates of the left and right eyes were computed using the code presented
in [10] (a completely open implementation which builds over the original ideas of
Daugman [9]). The code has been used to obtain the iris codes of the right and left eye
of each individual present in the synthetic database.
The primary biometric template I1 has been associated to the right eye of the individual by using a 9600 bits wide iris template. As suggested in [10], the 9600 bits
have been obtained by processing the iris image with a radial resolution (the number
of points selected along a radial line) of 20. The author suggested for the CASIA database a matching criterion with a separation point of r1 = 0.4 (Hamming distance between two different iris templates). Using such a threshold, we independently verified
that the algorithm was capable of a false match rate (FMR, the probability of an individual not enrolled being identified) and false non-match rate (FNMR, the probability
of an enrolled individual not being identified by the system) of 0.028% and 9.039%,
respectively using the CASIA version 1.0 database. Such rates rise to 0.204% and
16.799% respectively if the masking bits are not used. The masking bits mark bits in
the iris code which should not be considered when evaluating the Hamming distance
between different patterns due to reflections, eyelids and eyelashes coverage, etc.
Due to security issues, we preferred to not include the masking bits of the iris code
in the final templates since the distribution of zero valued bits in the masks is far from
being uniform. The higher FNMR compared with the work of [10] can be explained
by considering that using the adopted code failed segmentations of the pupil were reported to happen in the CASIA database in 17.4% of the cases.
3.3 Enroll and Verification Procedures
The enroll procedure for the right eye has been executed according to the following
steps. The three iris codes available in the enroll phase (Table 1) of each individual
232
S. Cimato et al.
(D) ROC Comparison (Linear scale)
(A) Right Eye System: 9600 bits
1
20
0.8
10
0.6
0
FNMR
Freq.
30
0
0.2
0.4
0.6
0.8
Match score
(B) Left Eye System: 1920 bits
1
0.4
0.2
Freq.
15
0
0
10
0.2
0.4
0.6
0.8
1
FMR
5
0
(E) ROC Comparison (Logartimic scale)
0
0.2
0.4
0.6
Match score
(C) Proposed Scheme
0.8
0
FNMR
20
Right Eye 9600 bits
Left Eye 1920 bits
Proposed Scheme
10
1
30
Freq.
Right Eye 9600 bits
Left Eye 1920 bits
Proposed Scheme
-1
10
10
0
0
0.2
0.4
0.6
Match score
0.8
1
-3
10
-2
-1
10
10
0
10
FMR
Fig. 3. Impostor and genuine frequency distributions of the iris templates composed by 9600
bits (A) and 1920 bits (B) using the synthetic dataset and for the proposed scheme (C and D respectively). The corresponding FNMR versus FMR are plotted in linear (D) and logarithmic
scale (E).
were evaluated for quality, in term of number of masking bits. The iris code with
the highest “quality” was retained for further processing. The best of three approach
was devised to avoid that segmentation errors might further jeopardize the verification stage. Then, the remaining enroll phases were performed according to the description previously made. A Reed-Solomon [9600,1920,7681]m=14 correction code
has been adopted with n1 = 9600 and r1 = 0.4. In such set up, the scheme allows for
up to k = 1920 bits for storing the second biometric template. If list decoding is
taken into consideration the parameters should be adapted to take into account the
enhanced error correcting rate of the list decoding algorithm. The former has been
chosen by selecting the available left iris template with highest quality (best of three
method) in the same fashion adopted for the right eye. Using this approach, a single
identifier ID has been created for every individual present in the synthetic dataset.
In particular, the shorter iris code was first subjected to a pseudo random permutation (we used AES in CTR mode) and then it was encoded with the RS code and
then xored with the first one to obtain δ. Note that the RS codewords are 14 bits
long. The unusual usage of the RS code (here we didn’t pack the bits in the iris code
to form symbols, as in typical industrial application) is due to the fact that here we
want to correct “at most” a certain number of error (and not “at least”). Each bit of
the iris code was then inserted in a separate symbol adding random bits to complete
the symbols. Finally an hash value of the second biometric template was computed
to get the final ID with δ. In the implementation we used the hash function SHA-1
(Java JDK 6).
233
In the verification procedure, the left eye related portion was processed only if one
of the iris codes was able to unlock the first part of the scheme. Otherwise the matching was considered as failed, and a maximum Hamming distance of 1 was associated
to the failed matching value. If the first part of the scheme was successful, the recovered left eye template was matched by using a classical biometric system with the left
eye template selected for the validation. The Hamming distance between the two
strings is used to measure the distance between the considered templates. The best of
four strategy is applied using the four left eye images available in the validation partition of the synthetic dataset.
3.4 Experimental Results for the Proposed Scheme
The performances of the proposed method are strictly related to the performance of
the code that constructs the iris templates. As such, a fair comparison should be done
by considering as reference the performances of the original iris code system working
on the same dataset. If we adopt the original iris templates of 9600 and 1920 bits by
using the same enroll and verification procedure in a traditional fashion (best of three
in verification, best of four in verification, no masking bits), we obtain the system behaviors described in Figure 3. The right eye system (9600 bits) has good separation
between the genuine and impostor distributions and it achieves an equal error rate
(ERR, the value of the threshold used for matching at which FMR equals FNMR) that
can be estimated to about 0.5% on the synthetic dataset. The left eye system is working only with 1920 bits and achieves a worst separation between the two populations.
The corresponding EER has been estimated to be equal to 9.9%.
On the other hand, our multimodal scheme achieves an EER that can be estimated
to be equal to 0.96%, and shows then an intermediate behavior between the ROC
curves of each single biometric system based on the right or on the left eye (Figure 3).
For a wide portion of the ROC curve, the proposed scheme achieves a better performance with respect to the right eye biometric system. That behavior is common for
traditional multimodal systems where, very often, the multimodal system can work
better than the best single biometric sub-system. The proposed scheme seem to show
this interesting property and the slightly worse EER with respect to the best single
biometric system (right eye, 9600 bits) is balanced by the protection of the biometric
template. We may suppose that the small worsening for the EER is related to the specific code we used to compute the iris code templates and that it might be ameliorated
by selecting a different code. Further experiments with enlarged datasets, different
coding algorithms and error correction codes will be useful to validate the generality
of the discussed results.
4
Conclusions
In this work we proposed a method based on the secure sketch cryptographic primitive to provide an effective and easily deployable multimodal biometric verification
system. Privacy of user templates is guaranteed by the randomization transformations
which avoid any attempt to reconstruct the biometric features from the public
identifier, preventing thus any abuse of biometric information. We also showed the
234
S. Cimato et al.
feasibility of our approach, by constructing a biometric authentication system that
combines two iris biometrics. The experiments confirm that only the owner of the
biometric ID can “unlock” her/his biometric templates, once fixed proper thresholds.
More complex systems, involving several biometric traits as well as traits of different
kinds will be object of further investigations.
Acknowledgments. The research leading to these results has received funding from
the European Community’s Seventh Framework Programme (FP7/2007-2013) under
grant agreement n° 216483.
References
1. Jain, A.K., Ross, A., Pankanti, S.: Biometrics: A tool for information security. IEEE
Trans. on information forensics and security 1(2), 125–143 (2006)
2. Uludag, U., Pankanti, S., Prabhakar, S., Jain, A.: Biometric cryptosystems: Issues and
challenges. Proceedings of the IEEE, Special Issue on Enabling Security Technologies for
Digital Rights Management 92, 948–960 (2004)
3. Snelick, R., Uludag, U., Mink, A., Indovina, M., Jain, A.K.: Large scale evaluation of
multi-modal biometric authentication using state of the art systems. IEEE Trans. Pattern
Analysis and Machine Intelligence 27(3), 450–455 (2005)
4. Cimato, S., Gamassi, M., Piuri, V., Sassi, R., Scotti, F.: A biometric verification system
addressing privacy concerns. In: IEEE International Conference on Computational Intelligence and Security (CIS 2007), pp. 594–598 (2007)
5. Dodis, Y., Ostrovsky, R., Reyzin, L., Smith, A.: Fuzzy extractors: How to generate strong
keys from biometrics and other noisy data, Cryptology Eprint Archive, Tech. Rep.
2006/235 (2006)
6. Bringer, J., Chabanne, H., Cohen, G., Kindari, B., Zemor, G.: An application of the goldwasser-micali cryptosystem to biometric authentication. In: Pieprzyk, J., Ghodosi, H.,
Dawson, E. (eds.) ACISP 2007. LNCS, vol. 4586, pp. 96–106. Springer, Heidelberg
(2007)
7. Sutcu, Y., Li, Q., Memon, N.: Protecting biometric templates with sketch: Theory and
practice. IEEE Trans. on Information Forensics and Security 2(3) (2007)
8. Chinese Academy of Sciences: Database of 756 greyscale eye images; Version 1.0 (2003),
http://www.sinobiometrics.com/IrisDatabase.htm
9. Daugman, J.G.: High confidence visual recognition of persons by a test of statistical independence. IEEE Trans. on Pattern Analysis and Machine Intelligence 15, 1148–1161
(1993)
10. Masek, L.: Recognition of human iris patterns for biometric identification. Bachelor’s
Thesis, School of Computer Science and Software Engineering, University of Western
Australia (2003)
Score Information Decision Fusion Using Support Vector
Machine for a Correlation Filter Based Speaker
Authentication System
Dzati Athiar Ramli, Salina Abdul Samad, and Aini Hussain
Department of Electrical, Electronic and Systems Engineering, Faculty of Engineering,
University Kebangsaan Malaysia, 43600 Bangi Selangor, Malaysia
dzati@vlsi.eng.ukm.my, salina@vlsi.eng.ukm.my,
aini@vlsi.eng.ukm.my.
Abstract. In this paper, we propose a novel decision fusion by fusing score information from
multiple correlation filter outputs of a speaker authentication system. Correlation filter classifier
is designed to yield a sharp peak in the correlation output for an authentic person while no peak
is perceived for the imposter. By appending the scores from multiple correlation filter outputs
as a feature vector, Support Vector Machine (SVM) is then executed for the decision process.
In this study, cepstrumgraphic and spectrographic images are implemented as features to the
system and Unconstrained Minimum Average Correlation Energy (UMACE) filters are used as
classifiers. The first objective of this study is to develop a multiple score decision fusion system
using SVM for speaker authentication. Secondly, the performance of the proposed system using
both features are then evaluated and compared. The Digit Database is used for performance
evaluation and an improvement is observed after implementing multiple score decision fusion
which demonstrates the advantages of the scheme.
Keywords: Correlation Filters, Decision Fusion, Support Vector Machine, Speaker
Authentication.
1 Introduction
Biometric speaker authentication is used to verify a person’s claimed identity. Authentication system compares the claimant’s speech with the client model during the
authentication process [1]. The development of a client model database can be a complicated procedure due to voice variations. These variations occur when the condition
of the vocal tract is affected by the influence of internal problems such as cold or dry
mouth, and also by external problems, for example temperature and humidity. The
performance of a speaker authentication system is also affected by room and line
noise, changing of recording equipment and uncooperative claimants [2], [3]. Thus,
the implementation of biometric systems has to correctly discriminate the biometric
features from one individual to another, and at the same time, the system also needs to
handle the misrepresentations in the features due to the problems stated. In order to
overcome these limitations, we improve the performance of speaker authentication
systems by extracting more information (samples) from the claimant and then executing fusion techniques in the decision process.
springerlink.com
236
D.A. Ramli, S.A. Samad, and A. Hussain
So far there are many fusion techniques in literature that have been implemented in
biometric systems for the purpose of enhancing the system performance. These include the fusion of multiple-modalities, multiple-classifiers and multiple-samples [4].
Teoh et. al. in [5] proposed a combination of features of face modality and speech
modality so as to improve the accuracy of biometric authentication systems. Person
identification based on visual and acoustic features has also been reported by Brunelli
and Falavigna in [6]. Suutala and Roning in [7] used Learning Vector Quantization
(LVQ) and Multilayer Perceptron (MLP) as classifiers for footstep profile based person identification whereas in [8], Kittler et.al. utilized Neural Networks and Hidden
Markov Model (HMM) for hand written digit recognition task. The implementation of
multiple-sample fusion approach can be found in [4] and [9]. In general, these studies
revealed that the implementation of the fusion approaches in biometric systems can
improve system performance significantly.
This paper focuses on the fusion of score information from multiple correlation filter outputs for a correlation filter based speaker authentication system. Here, we use
scores extracted from the correlation outputs by considering several samples extracted
from the same modality as independent samples. The scores are then concatenated
together to form a feature vector and then Support Vector Machine (SVM) is executed
to classify the feature vector as either authentic or imposter class. Correlation filters
have been effectively applied in biometric systems for visual applications such as face
verification and fingerprint verification as reported in [10], [11]. Lower face verification
and lip movement for person identification using correlation filters have been implemented in [12], [13], respectively. A study of using correlation filters in speaker verification for speech signal as features can be found in [14]. The advantages of correlation
filters are shift-invariance, ability to trade-off between discrimination and distortion
tolerance and having a close-form expression.
2 Methodology
The database used in this study is obtained from the Audio-Visual Digit Database
(2001) [15]. The database consists of video and corresponding audio of people reciting digits zero to nine. The video of each person is stored as a sequence of JPEG
images with a resolution of 512 x 384 pixels while the corresponding audio provided
as a monophonic, 16 bit, 32 kHz WAV file.
2.1 Spectroghaphic Features
A spectrogram is an image representing the time-varying spectrum of a signal. The
vertical axis (y) shows frequency, the horizontal axis (x) represents time and the pixel
intensity or color represents the amount of energy (acoustic peaks) in the frequency
band y, at time x [16], [17]. Fig.1 shows samples of the spectrogram of the word
‘zero’ from person 3 and person 4 obtained from the database. From the figure, it can
be seen that the spectrogram image contains personal information in terms of the way
the speaker utters the word such as speed and pitch that is showed by the spectrum.
1
1
0.9
0.9
0.8
0.8
0.7
0.7
0.6
0.6
Frequency
Frequency
Score Information Decision Fusion Using Support Vector Machine
0.5
0.4
0.3
237
0.5
0.4
0.3
0.2
0.2
0.1
0.1
0
0
0
1000
2000
3000
4000
Time
5000
6000
7000
0
1000
2000
3000
4000
Time
5000
6000
7000
Fig. 1. Examples of the spectrogram image from person 3 and person 4 for the word ‘zero’
Comparing both figures, it can be observed that although the spectrogram image
holds inter-class variations, it also comprises intra-class variations. In order to be
successfully classified by correlation filters, we propose a novel feature extraction
technique. The computation of the spectrogram is described below.
a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using
the following equation:
x (t ) = (s(t ) − 0.95) ∗ x (t − 1)
(1)
x ( t ) is the filtered signal, s( t ) is the input signal and t represents time.
b. Framing and windowing task. A Hamming window with 20ms length and 50%
overlapping is used on the signal.
c. Specification of FFT length. A 256-point FFT is used and this value determines
the frequencies at which the discrete-time Fourier transform is computed.
d. The logarithm of energy (acoustic peak) of each frequency bin is then computed.
e. Retaining the high energies. After a spectrogram image is obtained, we aim to
eliminate the small blobs in the image which impose the intra-class variations. This
can be achieved by retaining the high energies of the acoustic peak by setting an appropriate threshold. Here, the FFT magnitudes which are above a certain threshold
are maintained, otherwise they are set to be zero.
f. Morphological opening and closing. Morphological opening process is used to
clear up the residue noisy spots in the image whereas morphological closing is the
task used to recover the original shape of the image caused by the morphological
opening process.
2.2 Cepstrumgraphic Features
Linear Predictive Coding (LPC) is used for the acoustic measurements of speech
signals. This parametric modeling is an approach used to match closely the resonant
structure of the human vocal tract that produces the corresponding sounds [17]. The
computation of the cepstrumgraphic features is described below.
a. Pre-emphasis task. By using a high-pass filter, the speech signal is filtered using
equation 1.
b. Framing and windowing task. A Hamming window with 20ms length and 50%
overlapping is used on the signal.
c. Specification of FFT length. A 256-point FFT is used and this value determines
the frequencies at which the discrete-time Fourier transform is computed.
238
d. Auto-correlation task. For each frame, a vector of LPC coefficients is computed
from the autocorrelation vector using Durbin recursion method. The LPC-derived
cepstral coefficients (cepstrum) are then derived that lead to 14 coefficients per vector.
e. Resizing task. The feature vectors are then down sampled to the size of 64x64 in
order to be verified by UMACE filters.
2.3 Correlation Filter Classifier
Unconstrained Minimum Average Correlation Energy (UMACE) filters which
evolved from Matched Filter are synthesized in the Fourier domain using a closed
form solution. Several training images are used to synthesize a filter template. The
designed filter is then used for cross-correlating the test image in order to determine
whether the test image is from the authentic class or imposter class. In this process,
the filter optimizes a criterion to produce a desired correlation output plane by minimizing the average correlation energy and at the same time maximizing the correlation output at the origin [10][11].
The optimization of UMACE filter equation can be summarized as follows,
U mace = D −1m
(2)
D is a diagonal matrix with the average power spectrum of the training images placed
along the diagonal elements while m is a column vector containing the mean of the
Fourier transforms of the training images. The resulting correlation plane produce a
sharp peak in the origin and the values at everywhere else are close to zero when the
test image belongs to the same class of the designed filter [10][11]. Fig. 2 shows the
correlation outputs when using a UMACE filter to determine the test image from the
authentic class (left) and imposter class (right).
0.8
0.8
0.6
0.6
0.4
0.4
0.2
0.2
0
0
30
30
30
20
10
0
30
20
20
10
20
10
10
0
0
0
Fig. 2. Examples of the correlation plane for the test image from the authentic class (left) and
imposter class (right)
Peak-to-Sidelobe ratio (PSR) metric is used to measure the sharpness of the peak.
The PSR is given by
PSR =
peak − mean
σ
(3)
Here, the peak is the largest value of the test image yield from the correlation output.
Mean and standard deviation are calculated from the 20x20 sidelobe region by excluding a 5x5 central mask [10], [11].
239
2.4 Support Vector Machine
Support vector machine (SVM) classifier in its simplest form, linear and separable
case is the optimal hyperplane that maximizes the distance of the separating hyperplane from the closest training data point called the support vectors [18], [19].
From [18], the solution of a linearly separable case is given as follows. Consider a
problem of separating the set of training vectors belonging to two separate classes,
{(
) (
)}
D = x 1 , y1 ,... x L , y L ,
x ∈ ℜ n , y ∈ {− 1,−1}
(4)
with a hyperplane,
w, x + b = 0
(5)
The hyperplane that optimally separates the data is the one that minimizes
φ( w ) =
1
w
2
2
(6)
which is equivalent to minimizing an upper bound on VC dimension. The solution to
the optimization problem (7) is given by the saddle point of the Lagrange functional
(Lagrangian)
φ( w , b, α) =
L
1
2
w − ∑ α i ⎛⎜ y i ⎡ w , x i + b⎤ − 1⎞⎟
⎢
⎥⎦ ⎠
2
i =1 ⎝ ⎣
(7)
where α are the Lagrange multipliers. The Lagrangian has to be minimized with respect to w , b and maximized with respect to α ≥ 0 . Equation (7) is then transformed
to its dual problem. Hence, the solution of the linearly separable case is given by,
α* = arg min
α
L
1 L L
∑ ∑ αiα jyi y j xi , x j − ∑ αk
2 i =1 j=1
k =1
(8)
with constrains,
α i ≥ 0, i = 1,..., L
and
L
∑ α jy j = 0
j=1
(9)
Subsequently, consider a SVM as a non-linear and non-separable case. Non-separable
case is considered by adding an upper bound to the Lagrange multipliers and nonlinear case is considered by replacing the inner product by a kernel function. From
[18], the solution of the non-linear and non-separable case is given as
α* = arg min
α
(
)
L
1 L L
∑ ∑ α i α j yi y jK x i , x j − ∑ α k
2 i =1 j=1
k =1
(10)
with constrains,
0 ≤ α i ≤ C, i = 1,..., L and
L
∑ α j y j = 0 x (t ) = (s(t ) − 0.95) ∗ x (t − 1)
j=1
(11)
240
Non-linear mappings (kernel functions) that can be employed are polynomials, radial
basis functions and certain sigmoid functions.
3 Results and Discussion
Assume that N streams of testing data are extracted from M utterances. Let
s = {s1 , s 2 ,..., s N } be a pool of scores from each utterance. The proposed verification
system is shown in Fig.3.
a11
am1
...
Filter design 1
. . . .
a1n
amn
...
Correlation filter
Filter design n
Correlation filter
. . . .
b1
FFT
bn
IFFT
Correlation output
psr1
. . . .
. . . .
FFT
IFFT
Correlation output
psrn
Support vector machine (polynomial kernel)
(a11… am1 ) … (a1n … amn )– training data
b1, b2 … bn – testing data
m – number of training data
n - number of groups (zero to nine)
Decision
Fig. 3. Verification process using spectrographic / ceptrumgraphic images
For the spectrographic features, we use 250 filters which represent each word for
the 25 persons. Our spectrographic image database consists of 10 groups of spectrographic images (zero to nine) of 25 persons with 46 images per group of size 32x32
pixels, thus 11500 images in total. For each filter, we used 6 training images for the
synthesis of a UMACE filter. Then, 40 images are used for the testing process. These
six training images were chosen based on the largest variations among the images. In
the testing stage, we performed cross correlations of each corresponding word with 40
authentic images and another 40x24=960 imposter images from the other 24 persons.
For the ceptrumgraphic features, we also have 250 filters which represent each
word for the 25 persons. Our ceptrumgraphic image database consists of 10 groups of
ceptrumgraphic images (zero to nine) of 25 persons with 43 images per group of size
64x64 pixels, thus 10750 images in total. For each filter, we used 3 training images
for the synthesis of the UMACE filter and 40 images are used for the testing process.
We performed cross correlations of each corresponding word with 40 authentic images and another 40x24=960 imposter images from the other 24 persons.
241
For both cases, polynomial kernel has been employed for the decision fusion procedure using SVM. Table 1 below compares the performance of single score decision
and multiple score decision fusions for both spectrographic and ceptrumgrapic features. The false accepted rate (FAR) and false rejected rate (FRR) of multiple score
decision fusion are described in Table 2.
Table 1. Performance of single score decision and multiple score decision fusion
features
spectrographic
cepstrumgraphic
single score
92.75%
90.67%
multiple score
96.04%
95.09%
Table 2. FAR and FRR percentages of multiple score decision fusion
features
spectrographic
cepstrumgraphic
FAR
3.23%
5%
FRR
3.99%
4.91%
4 Conclusion
The multiple score decision fusion approach using support vector machine has been
developed in order to enhance the performance of a correlation filter based speaker
authentication system. Spectrographic and cepstrumgraphic features, are employed as
features and UMACE filters are used as classifiers in the system. By implementing
the proposed decision fusion, the error due to the variation of data can be reduced
hence further enhance the performance of the system. The experimental result is
promising and can be an alternative method to biometric authentication systems.
Acknowledgements. This research is supported by Fundamental Research Grant
Scheme, Malaysian Ministry of Higher Education, FRGS UKM-KK-02-FRGS00362006 and Science Fund, Malaysian Ministry of Science, Technology and Innovation,
01-01-02-SF0374.
References
1. Campbell, J.P.: Speaker Recognition: A Tutorial. Proceeding of the IEEE 85, 1437–1462
(1997)
2. Rosenberg, A.: Automatic speaker verification: A review. Proceeding of IEEE 64(4), 475–
487 (1976)
3. Reynolds, D.A.: An overview of Automatic Speaker Recognition Technology. Proceeding
of IEEE on Acoustics Speech and Signal Processing 4, 4065–4072 (2002)
4. Poh, N., Bengio, S., Korczak, J.: A multi-sample multi-source model for biometric authentication. In: 10th IEEE on Neural Networks for Signal Processing, pp. 375–384 (2002)
5. Teoh, A., Samad, S.A., Hussein, A.: Nearest Neighborhood Classifiers in a Bimodal Biometric Verification System Fusion Decision Scheme. Journal of Research and Practice in
Information Technology 36(1), 47–62 (2004)
242
6. Brunelli, R., Falavigna, D.: Personal Identification using Multiple Cue. Proceeding of
IEEE Trans. on Pattern Analysis and Machine Intelligence 17(10), 955–966 (1995)
7. Suutala, J., Roning, J.: Combining Classifier with Different Footstep Feature Sets and
Multiple Samples for Person Identification. In: Proceeeding of International Conference on
Acoustics, Speech and Signal Processing, pp. 357–360 (2005)
8. Kittler, J., Hatef, M., Duin, R.P.W., Matas, J.: On combining classifiers. Proceeding of
IEEE Trans on Pattern Analysis and Machine Intelligence 20(3), 226–239 (1998)
9. Cheung, M.C., Mak, M.W., Kung, S.Y.: Multi-Sample Data-Dependent Fusion of Sorted
Score Sequences for Biometric verification. In: IEEE Conference on Acoustics Speech and
Signal Processing (ICASSP 2004), pp. 229–232 (2004)
10. Savvides, M., Vijaya Kumar, B.V.K., Khosla, P.: Face Verification using Correlation Filters. In: 3rd IEEE Automatic Identification Advanced Technologies, pp. 56–61 (2002)
11. Venkataramani, K., Vijaya Kumar, B.V.K.: Fingerprint Verification using Correlation Filters. In: System AVBPA, pp. 886–894 (2003)
12. Samad, S.A., Ramli, D.A., Hussain, A.: Lower Face Verification Centered on Lips using
Correlation Filters. Information Technology Journal 6(8), 1146–1151 (2007)
13. Samad, S.A., Ramli, D.A., Hussain, A.: Person Identification using Lip Motion Sequence.
In: Apolloni, B., Howlett, R.J., Jain, L. (eds.) KES 2007, Part I. LNCS (LNAI), vol. 4692,
pp. 839–846. Springer, Heidelberg (2007)
14. Samad, S.A., Ramli, D.A., Hussain, A.: A Multi-Sample Single-Source Model using Spectrographic Features for Biometric Authentication. In: IEEE International Conference on
Information, Communications and Signal Processing, CD ROM (2007)
15. Sanderson, C., Paliwal, K.K.: Noise Compensation in a Multi-Modal Verification System.
In: Proceeding of International Conference on Acoustics, Speech and Signal Processing,
pp. 157–160 (2001)
16. Spectrogram, http://cslu.cse.ogi.edu/tutordemo/spectrogramReading/spectrogram.html
17. Klevents, R.L., Rodman, R.D.: Voice Recognition: Background of Voice Recognition,
London (1997)
18. Gunn, S.R.: Support Vector Machine for Classification and Regression. Technical Report,
University of Southampton (2005)
19. Wan, V., Campbell, W.M.: Support Vector Machines for Speaker Verification and Identification. In: Proceeding of Neural Networks for Signal Processing, pp. 775–784 (2000)
Application of 2DPCA Based Techniques in DCT Domain
for Face Recognition
Messaoud Bengherabi1, Lamia Mezai1, Farid Harizi1, Abderrazak Guessoum2,
and Mohamed Cheriet3
1
Centre de Développement des Technologies Avancées- Algeria
Division Architecture des Systèmes et MultiMédia
Cité 20 Aout, BP 11, Baba Hassen, Algiers-Algeriabengherabi@yahoo.com, l_mezai@yahoo.fr, harizihourizi@yahoo.fr
2
Université Saad Dahlab de Blida – Algeria
Laboratoire Traitement de signal et d’imagerie
Route De Soumaa BP 270 Blida
guessouma@hotmail.com
3
École des Technologies Supérieur –Québec- CanadaLaboratoire d’Imagerie, de Vision et d’Intelligence Artificielle
1100, Rue Notre-Dame Ouest, Montréal (Québec) H3C 1K3 Canada
mohamed.cheriet@gpa.etsmtl.ca
Abstract. In this paper, we introduce 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain
for the aim of face recognition. The 2D DCT transform has been used as a preprocessing step,
then 2DPCA, DiaPCA and DiaPCA+2DPCA are applied on the upper left corner block of the
global 2D DCT transform matrix of the original images. The ORL face database is used to
compare the proposed approach with the conventional ones without DCT under Four matrix
similarity measures: Frobenuis, Yang, Assembled Matrix Distance (AMD) and Volume Measure (VM). The experiments show that in addition to the significant gain in both the training and
testing times, the recognition rate using 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain is generally better or at least competitive with the recognition rates obtained by applying
these three 2D appearance based statistical techniques directly on the raw pixel images; especially under the VM similarity measure.
Keywords: Two-Dimensional PCA (2DPCA), Diagonal PCA (DiaPCA), DiaPCA+2DPCA,
face recognition, 2D Discrete Cosine Transform (2D DCT).
1 Introduction
Different appearance based statistical methods for face recognition have been proposed in literature. But the most popular ones are Principal Component Analysis
(PCA) [1] and Linear Discriminate Analysis (LDA) [2], which process images as 2D
holistic patterns. However, a limitation of PCA and LDA is that both involve eigendecomposition, which is extremely time-consuming for high dimensional data.
Recently, a new technique called two-dimensional principal component analysis
2DPCA was proposed by J. Yang et al. [3] for face recognition. Its idea is to estimate
the covariance matrix based on the 2D original training image matrices, resulting in a
springerlink.com
244
M. Bengherabi et al.
covariance matrix whose size is equal to the width of images, which is quite small
compared with the one used in PCA. However, the projection vectors of 2DPCA
reflect only the variations between the rows of images, while discarding the variations
of columns. A method called Diagonal Principal Component Analysis (DiaPCA) is
proposed by D. Zhang et al. [4] to resolve this problem. DiaPCA seeks the projection
vectors from diagonal face images [4] obtained from the original ones to ensure that
the correlation between rows and those of columns is taken into account. An efficient
2D techniques that results from the combination of DiaPCA and 2DPCA
(DiaPCA+2DPCA) is proposed also in [4].
Discrete cosine transform (DCT) has been used as a feature extraction step in various studies on face recognition. This results in a significant reduction of computational complexity and better recognition rates [5, 6]. DCT provides excellent energy
compaction and a number of fast algorithms exist for calculating it.
In this paper, we introduce 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain for face recognition. The DCT transform has been used as a feature extraction
step, then 2DPCA, DiaPCA and DiaPCA+2DPCA are applied only on the upper left
corner block of the global DCT transform matrix of the original images. Our proposed
approach is tested against conventional approaches without DCT under Four matrix
similarity measures: Frobenuis, Yang, Assembled Matrix Distance (AMD) and Volume Measure (VM).
The rest of this paper is organized as follows. In Section 2 we give a review of
2DPCA, DiaPCA and DiaPCA+2DPCA approaches and also we review different
matrix similarity measures. In section 3, we present our contribution. In section 4 we
report the experimental results and highlight a possible perspective of this work. Finally, in section 5 we conclude this paper.
2 Overview of 2DPCA, DiaPCA, DiaPCA+2DPCA and Matrix
Similarity Measures
2.1 Overview of 2D PCA, DiaPCA and DiaPCA+2DPCA
2.1.1 Two-Dimensional PCA
Given M training face images, denoted by m×n matrices Ak (k = 1, 2… M), twodimensional PCA (2DPCA) first uses all the training images to construct the image
covariance matrix G given by [3]
G=
1
M
∑ (A
M
k
−A
k =1
) (A
T
k
−A
)
(1)
Where A is the mean image of all training images. Then, the projection axes of
2DPCA, Xopt=[x1… xd] can be obtained by solving the algebraic eigenvalue problem
Gxi=λixi, where xi is the eigenvector corresponding to the ith largest eigenvalue of G
[3]. The low dimensional feature matrix C of a test image matrix A is extracted by
C = AX opt
(2)
In Eq.(2) the dimension of 2DPCA projector Xopt is n×d, and the dimension of
2DPCA feature matrix C is m×d.
245
2.1.2 Diagonal Principal Component Analysis
Suppose that there are M training face images, denoted by m×n matrices Ak(k = 1, 2,
…, M). For each training face image Ak, we calculate the corresponding diagonal face
image Bk as it is defined in [4].
Based on these diagonal faces, diagonal covariance matrix is defined as [4]:
G DIAG =
Where B =
1
M
1
M
∑ (B
M
k
−B
k =1
) (B
T
k
−B
)
(3)
M
∑B
k
is the mean diagonal face. According to Eq. (3), the projection
k =1
vectors Xopt=[x1, …, xd] can be obtained by computing the d eigenvectors corresponding to the d biggest eigenvalues of GDIAG. The training faces Ak’s are projected onto
Xopt, yielding m×d feature matrices.
C k = Ak X opt
(4)
Given a test face image A, first use Eq. (4) to get the feature matrix C = AX opt , then a
matrix similarity metric can be used for classification.
2.1.3 DiaPCA+2DPCA
Suppose the n by d matrix X=[x1, …, xd] is the projection matrix of DiaPCA. Let
Y=[y1, …, yd] the projection matrix of 2DPCA is computed as follows: When the
height m is equal to the width n, Y is obtained by computing the q eigenvectors corresponding to the q biggest eigenvalues of the image covarinace matrix
1 M
(A − A)T (A − A) . On the other hand, when the height m is not equal to the width
M
∑
k
k
k =1
n, Y is obtained by computing the q eigenvectors corresponding to the q biggest ei-
∑ (A
M
genvalues of the alternative image covariance matrix 1
M
k
k =1
)(
)
T
− A Ak − A .
Projecting training faces Aks onto X and Y together, yielding the q×d feature matrices
(5)
D k = Y T Ak X
Given a test face image A, first use Eq. (5) to get the feature matrix D = Y T AX , then a
matrix similarity metric can be used for classification.
2.2 Overview of Matrix Similarity Measures
An important aspect of 2D appearance based face recognition approaches is the similarity measure between matrix features used at the decision level. In our work, we
have used four matrix similarity measures.
2.2.1 Frobenius Distance
Given two feature matrices A = (aij)m×d and B = (bij)m×d, the Frobenius distance [7]
measure is given by:
⎛ m
d F ( A, B ) = ⎜⎜ ∑
⎝ i =1
∑ (a
12
d
j =1
ij
2⎞
− bij ) ⎟⎟
⎠
(6)
246
2.2.2 Yang Distance Measure
Given two feature matrices A = (aij)m×d and B = (bij)m×d, the Yang distance [7] is given by:
12
d
⎛ m
2⎞
dY ( A, B ) = ∑ ⎜ ∑ (aij − bij ) ⎟
j =1 ⎝ i =1
⎠
(7)
2.2.3 Assembled Matrix Distance (AMD)
A new distance called assembled matrix distance (AMD) metric to calculate the distance between two feature matrices is proposed recently by Zuo et al [7]. Given two
feature matrices A = (aij)m×d and B = (bij)m×d, the assembled matrix distance dAMD(A,B)
is defined as follows :
(1 2 ) p
⎞
⎛ d ⎛ m
2⎞
⎟
d AMD ( A, B ) = ⎜ ∑ ⎜ ∑ (aij − bij ) ⎟
⎟
⎜ j =1 ⎝ i =1
⎠
⎠
⎝
12
( p > 0)
(8)
It was experimentally verified in [7] that best recognition rate can be obtained when
p≤0.125 while it decrease as p increases. In our work the parameter p is set equal to 0.125.
2.2.4 Volume Measure (VM)
The VM similarity measure is based on the theory of high-dimensional geometry
space. The volume of an m×n matrix of rank p is given by [8]
∑det
Vol A =
( I ,J )∈N
2
A IJ
(9)
where AIJ denotes the submatrix of A with rows I and columns J, N is the index set of
p×p nonsingular submatrix of A, and if p=0, then Vol A = 0 by definition.
3 The Proposed Approach
In this section, we introduce 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain
for the aim of face recognition. The DCT is a popular technique in imaging and video
compression, which was first applied in image compression in 1974 by Ahmed et al
[9]. Applying the DCT to an input sequence decomposes it into a weighted sum of
basis cosine sequences. our methodology is based on the use of the 2D DCT as a
feature extraction or preprocessing step, then 2DPCA, DiaPCA and DiaPCA+2DPCA
are applied to w×w upper left block of the global 2D DCT transform matrix of the
original images. In this approach, we keep only a sub-block containing the first coefficients of the 2D DCT matrix as shown in Fig.1, from the fact that, the most significant information is contained in these coefficients.
2D DCT
c11 c12 … c1w
.
.
cw1 cw2 … cww
Fig. 1. Feature extraction in our approach
247
With this approach and inversely to what is presented in literature of DCT-based
face recognition approaches, the 2D structure is kept and the dimensionality reduction
is carried out. Then, the 2DPCA, DiaPCA and DiaPCA+2DPCA are applied to w×w
block of 2D DCT coefficients. The training and testing block diagrams describing the
proposed approach is illustrated in Fig.2.
Training a lgorithm
based on
92DPCA
9Dia PCA
9Dia PCA+2DPCA
Block w*w
of 2D DCT
coefficients
2D
DCT
Tra ined
Model
Training data
2D DCT ima ge
Projection of the DCT bloc
of the test ima ge using the
eigenvectors of
92DPCA
9Dia PCA
9Dia PCA+2DPCA
Block w*w
of 2D DCT
coefficients
2D
DCT
Test ima ge
2D DCT Block
Features
2D DCT ima ge
Compa rison using
9Frobenius
9Yang
9AMD
9VM
2D DCT Block
Fea tures
Decision
Fig. 2. Block diagram of 2DPCA, DiaPCA and DiaPCA+2DPCA in DCT domain
4 Experimental Results and Discussion
In this part, we evaluate the performance of 2DPCA, DiaPCA and DiaPCA+2DPCA
in DCT domain and we compare it to the original 2DPCA, DiaPCA and
DiaPCA+2DPCA methods. All the experiments are carried out on a PENTUIM 4 PC
with 3.2GHz CPU and 1Gbyte memory. Matlab [10] is used to carry out these experiments. The database used in this research is the ORL [11] (Olivetti Research
Laboratory) face database. This database contains 400 images for 40 individuals, for
each person we have 10 different images of size 112×92 pixels. For some subjects, the
images captured at different times. The facial expressions and facial appearance also
vary. Ten images of one person from the ORL database are shown in Fig.3.
In our experiment, we have used the first five image samples per class for training
and the remaining images for test. So, the total number of training samples and test
samples were both 200. Herein and without DCT the size of diagonal covariance
matrix is 92×92, and each feature matrix with a size of 112×p where p varies from 1
to 92. However with DCT preprocessing the dimension of these matrices depends on
the w×w DCT block where w varies from 8 to 64. We have calculated the recognition
rate of 2DPCA, DiaPCA, DiaPCA+2DPCA with and without DCT.
In this experiment, we have investigated the effect of the matrix metric on the performance of the 2D face recognition approaches presented in section 2. We see from
table 1, that the VM provides the best results whereas the Frobenius gives the worst
ones, this is justified by the fact that the Frobenius metric is just the sum of the
248
(a)
(b)
Fig. 3. Ten images of one subject in the ORL face database, (a) Training, (b) Testing
Euclidean distance between two feature vectors in a feature matrix. So, this measure
is not compatible with the high-dimensional geometry theory [8].
Table 1. Best recognition rates of 2DPCA, DiaPCA and DiaPCA+2DPCA without DCT
Methods
2DPCA
DiaPCA
DiaPCA+2DPCA
Frobenius
91.50 (112×8)
91.50 (112×8)
92.50 (16×10)
Yang
93.00 (112×7)
92.50 (112×10)
94.00 (13×11)
AMD p=0,125
95.00 (112×4)
91.50 (112×8)
93.00 (12×6)
Volume Distance
95.00 (112×3)
94.00 (112×9)
96.00 (21×8)
Tables 2, and Table 3 summarize the best performances under different 2D DCT
block sizes and different matrix similarity measures.
Table 2. 2DPCA, DiaPCA and DiaPCA+2DPCA under different DCT block sizes using the
Frobenius and Yang matrix distance
Best Recognition rate (feature matrix dimension)
DiaPCA+2DPCA
2DPCA
DiaPCA
Yang
91.50 (6×6)
93.50 (8×6)
93.50 (8×5)
92.00 (9×5)
93.00 (9×6)
95.00 (9×9)
92.00 (10×5)
94.50 (10×6)
95.50 (10×9)
92.00 (9×5)
94.00 (11×6)
95.50 (11×5)
91.50 (9×5)
94.50 (12×6)
95.50 (12×5)
92.00 (12×11)
94.50 (13×6)
95.00 (13×5)
92.00 (12×7)
94.50 (14×6)
94.50 (14×5)
2D DCT
block
size
8×8
9×9
10×10
11×11
12×12
13×13
14×14
91.50 (8×8)
92.00 (9×9)
91.50 (10×5)
92.00 (11×8)
92.00 (12×8)
91.50 (13×7)
92.00 (14×7)
DiaPCA
Frobenius
91.50 (8×6)
92.00 (9×5)
92.00 (10×5)
91.50 (11×5)
91.50 (12×10)
92.00 (13×11)
91.50 (14×7)
15×15
16×16
32×32
91.50 (15×5)
92.50 (16×10)
92.00 (32×6)
91.50 (15×5)
91.50 (16×11)
91.50 (32×6)
92.00 (13×15)
92.00 (4×10)
92.00 (11×7)
94.00 (15×9)
94.00 (16×7)
93.00 (32×6)
94.50 (15×5)
94.50 (16×5)
93.50 (32×5)
95.50 (12×5)
95.00 (12×5)
95.00 (12×5)
64×64
91.50 (64×6)
91.00 (32×6)
92.00 (14×12)
93.00 (64×7)
93.50 (64×5)
95.00 (12×5)
2DPCA
DiaPCA+2DPCA
93.50 (8×5)
95.00 (9×9)
95.50 (10×9)
95.50 (11×5)
95.50 (12×5)
95.00 (11×5)
95.00 (12×5)
From these four tables, we notice that in addition to the importance of matrix similarity measures, by the use of DCT we have always better performance in terms of
recognition rate and this is valid for all matrix measures, we have only to choose the
DCT block size and appropriate feature matrix dimension. An important remark is
that a block size of 16×16 or less is sufficient to have the optimal performance. So,
this results in a significant reduction in training and testing time. This significant gain
249
Table 3. 2DPCA, DiaPCA and DiaPCA+2DPCA under different DCT block sizes using the
AMD distance and VM similarity measure on the ORL database
2D DCT
block size
2DPCA
8×8
9×9
10×10
11×11
12×12
13×13
14×14
15×15
16×16
32×32
64×64
94.00 (8×4)
94.50 (9×4)
94.50 (10×4)
95.50 (11×5)
95.50 (12×5)
96.00 (13×4)
96.00 (14×4)
96.00 (15×4)
96.00 (16×4)
95.50 (32×4)
95.00 (64×4)
DiaPCA
AMD
95.00 (8×6)
94.50 (9×5)
95.50 (10×5)
96.00 (11×5)
96.50 (12×7)
95.50 (13×5)
95.00 (14×5)
95.00 (15×5)
95.50 (16×5)
95.00 (32×9)
94.50 (64×9)
Best Recognition rate (feature matrix dimension)
DiaPCA+2DPCA
2DPCA
DiaPCA
VM
95.00 (7×5)
96.00 (8×3)
93.50 (8×4)
94.50 (9×5)
95.00 (9×4)
95.00 (9×5)
96.00 (9×7)
95.00 (10×3)
95.00 (10×4)
94.50 (11×3)
95.50 (11×3)
96.50 (9×6)
95.50 (12×5)
96.00 (12×5)
96.50 (9×7)
95.50 (12×5)
96.00 (13×9)
96.00 (13×5)
95.50 (10×5)
95.00 (14×3)
95.50 (14×5)
96.00 (9×7)
96.00 (15×8)
96.00 (15×5)
95.50 (16×8)
96.00 (16×5)
96.50 (12×5)
96.00 (11×5)
95.00 (32×3)
95.50 (32×5)
96.00 (12×5)
95.00 (64×3)
95.00 (64×5)
DiaPCA+2DPCA
93.50 (8×4)
95.00 (9×5)
95.00 (10×4)
95.50 (11×3)
96.00 (11×5)
96.50 (10×5)
96.50 (10×5)
96.50 (10×5)
96.50 (10×5)
96.50 (9×5)
96.50 (21×5)
in computation is better illustrated in table 4 and table 5, which illustrate the total
training and total testing time of 200 persons -in seconds - of the ORL database under
2DPCA, DiaPCA and DiaPCA+2DPCA without and with DCT, respectively. We
should mention that the computation of DCT was not taken into consideration when
computing the training and testing time of DCT based approaches.
Table 4. Training and testing time without DCT using Frobenius matrix distance
Methods
Training time in sec
Testing time in sec
2DPCA
5.837 (112×8)
1.294 (112×8)
DiaPCA
5.886 (112×8)
2.779 (112×8)
DiaPCA+2DPCA
10.99 (16×10)
0.78 (16×10)
Table 5. Training and testing time with DCT using the Frobenius distance and the same matrixfeature dimensions as in Table2
2D DCT
block size
8×8
9×9
10×10
11×11
12×12
13×13
14×14
15×15
16×16
2DPCA
0.047
0.048
0.047
0.048
0.063
0.062
0.079
0.094
0.125
Training time in sec
DiaPCA
DiaPCA+2DPCA
0.047
0.047
0.048
0.124
0.048
0.094
0.047
0.063
0.046
0.094
0.047
0.126
0.062
0.14
0.078
0.173
0.141
0.219
2DPCA
0.655
0.626
0.611
0.578
0.641
0.642
0.656
0.641
0.813
Testing time in sec
DiaPCA
DiaPCA+2DPCA
0.704
0.61
0.671
0.656
0.719
0.625
0.734
0.5
0.764
0.657
0.843
0.796
0.735
0.718
0.702
0.796
0.829
0.827
We can conclude from this experiment, that the proposed approach is very efficient
in weakly constrained environments, which is the case of the ORL database.
5 Conclusion
In this paper, 2DPCA, DiaPCA and DiaPCA+2PCA are introduced in DCT domain.
The main advantage of the DCT transform is that it discards redundant information and
it can be used as a feature extraction step. So, computational complexity is significantly reduced. The experimental results show that in addition to the significant gain in
both the training and testing times, the recognition rate using 2DPCA, DiaPCA and
DiaPCA+2DPCA in DCT domain is generally better or at least competitive with the
250
recognition rates obtained by applying these three techniques directly on the raw pixel
images; especially under the VM similarity measure. The proposed approaches will be
very efficient for real time face identification applications such as telesurveillance and
access control.
References
1. Turk, M., Pentland, A.: “Eigenfaces for Recognition. Journal of Cognitive Neurosicence 3(1), 71–86 (1991)
2. Belhumeur, P.N., Hespanha, J.P., Kriegman, D.J.: Eigenfaces vs. fisherfaces: Recognition
using class specific linear projection. IEEETrans. on Patt. Anal. and Mach. Intel. 19(7),
711–720 (1997)
3. Yang, J., Zhang, D., Frangi, A.F., Yang, J.Y.: Two-Dimensional PCA: A New Approach
to Appearance- Based Face Representation and Recognition. IEEE Transactions on Pattern
Analysis and Machine Intelligence 26(1), 131–137 (2004)
4. Zhang, D., Zhou, Z.H., Chen, S.: “Diagonal Principal Component Analysis for Face Recognition. Pattern Recognition 39(1), 140–142 (2006)
5. Hafed, Z.M., Levine, M.D.: “Face recognition using the discrete cosine transform. International Journal of Computer Vision 43(3) (2001)
6. Chen, W., Er, M.J., Wu, S.: PCA and LDA in DCT domain. Pattern Recognition Letters 26(15), 2474–2482 (2005)
7. Zuo, W., Zhang, D., Wang, K.: An assembled matrix distance metric for 2DPCA-based
image recognition. Pattern Recognition Letters 27(3), 210–216 (2006)
8. Meng, J., Zhang, W.: Volume measure in 2DPCA-based face recognition. Pattern Recognition Letters 28(10), 1203–1208 (2007)
9. Ahmed, N., Natarajan, T., Rao, K.: Discrete cosine transform. IEEE Trans. on Computers 23(1), 90–93 (1974)
10. Matlab, The Language of Technical Computing, Version 7 (2004),
http://www.mathworks.com
11. ORL. The ORL face database at the AT&T (Olivetti) Research Laboratory (1992),
http://www.uk.research.att.com/facedatabase.html
Fingerprint Based Male-Female Classification
Manish Verma and Suneeta Agarwal
Computer Science Department, Motilal Nehru National Institute of Technology
Allahabad Uttar Pradesh India
manishverma649@gmail.com, suneeta@mnnit.ac.in
Abstract. Male-female classification from a fingerprint is an important step in forensic science,
anthropological and medical studies to reduce the efforts required for searching a person. The
aim of this research is to establish a relationship between gender and the fingerprint using some
special features such as ridge density, ridge thickness to valley thickness ratio (RTVTR) and
ridge width. Ahmed Badawi et. al. showed that male-female classification can be done correctly
upto 88.5% based on white lines count, RTVTR & ridge count using Neural Network as Classifier. We have used RTVTR, ridge width and ridge density for classification and SVM as
classifier. We have found male-female can be correctly classified upto 91%.
Keywords: gender classification, fingerprint, ridge density, ridge width, RTVTR, forensic,
anthropology.
1 Introduction
For over centuries, fingerprint has been used for both identification and verification
because of its uniqueness. A fingerprint contains three level of information. Level 1
features contain macro details of fingerprint such as ridge flow and pattern type e.g.
arch, loop, whorl etc. Level 2 features refer to the Galton characteristics or minutiae,
such as ridge bifurcation or ridge termination e.g. eye, hook, bifurcation, ending etc.
Level 3 features include all dimensional attributes of ridge e.g. ridge path deviation,
width, shape, pores, edge contour, ridges breaks, creases, scars and other permanent
details [10].
Till now little work has been done in the field of male-female fingerprint classification. In 1943, Harold Cummnins and Charles Midlo in the book “Fingerprints, Palm
and Soles” first gave the relation between gender and the fingerprint. In 1968, Sarah
B Holt, Charles C. Thomas in the book “The Genetics of the Dermal Ridges” gave
same theory with little modification. Both state the same fact that female ridges are
finer/smaller and have higher ridge density than males. Acree showed that females
have higher ridge density [9]. Kralik showed that males have higher ridge width [6].
Moore also carried out a study on ridge to ridge distance and found that mean distance
is more in male compared to female [7]. Dr. Sudesh Gungadin showed that a ridge
count of ≤13 ridges/25 mm2 is more likely to be of males and that of ≥14 ridges/25
mm2 is likely to be of females [2]. Ahmed Badawi et. al. showed that male-female
can be correctly classified upto 88.5% [1] based on white lines count, RTVTR &
ridge count using Neural Network as Classifier. According to the research of
springerlink.com
252
M. Verma and S. Agarwal
Cummnins and Midlo, A typical young male has, on an average, 20.7 ridges per centimeter while a young female has 23.4 ridges per centimeter [8].
On the basis of studies made in [6], [1], [2], ridge width, RTVTR and ridge density
are significant features for male-female classification. In this paper, we studied the
significance of ridge width, ridge density and ridge thickness to valley thickness ratio
(RTVTR) for the classification purpose. For classification we have used SVM classifier because of its significant advantage. Artificial Neural Networks (ANNs) can
suffer from multiple local minima, the solution with an SVM is global and unique.
Unlike ANNs, the computational complexity of SVMs does not depend on the dimensionality of the input space. ANNs use empirical risk minimization, whilst SVMs use
structural risk minimization. SVMs are less prone to overfitting [13].
2 Materials and Methods
In our Male-Female classification analysis with respect to fingerprints, we extracted three
features from each fingerprint. The features are ridge width, ridge density and RTVTR.
Male & Female are classified using these features with the help of SVM classifier.
2.1 Dataset
We have taken 400 fingerprints (200 Male & 200 Female) of indian origin in the age
group of 18-60 years. These fingerprint are divided into two disjoint set for training
and testing, each set contains 100 male and 100 female fingerprints.
2.2 Fingerprint Feature Extraction Algorithm
The flowchart of the Fingerprint Feature Extraction and Classification Algorithm is
shown in Fig. 1. The main steps of the algorithm are:
Normalization [4]
Normalization is used to standardize the intensity values of an image by adjusting the
range of its grey-level values so that they lie within a desired range of values e.g. zero
mean and unit standard deviation. Let I(i,j) denotes the gray-level value at pixel (i,j),
M & VAR denote the estimated mean & variance of I(i,j) respectively & N(i,j) denotes the normalized gray-level value at pixel (i,j). The normalized values is defined
as follows:
,
,
,
,
(1)
,
,
Where M0 and VAR0 are desired mean and variance values respectively.
Image Orientation [3]
Orientation of a fingerprint is estimated by the least mean square orientation estimation algorithm given by Hong et. al. Given a normalized image, N, the main steps of
253
Fig. 1. Flow chart for Fingerprint Feature Extraction and Classification Algorithm
the orientation estimation are as follows: Firstly, a block of size wXw (25X25) is
centred at pixel (i, j) in the normalized fingerprint image. For each pixel in this block,
compute the Gaussian gradients ∂x(i, j) and ∂y(i, j), which are the gradient magnitudes in the x & y directions respectively. The local orientation of each block centered
at pixel (i, j) is estimated using the following equations [11].
i, j
2∂
.
∂
,
,
1
tan
2
∂
,
,
,
,
∂
,
,
(2)
(3)
(4)
254
where θ(i,j) is the least square estimate of the local orientation at the block centered
at pixel (i,j). Now orient the block with θ degree around the center of the block, so
that the ridges of this block are in vertical direction.
Fingerprint Feature Extraction
In the oriented image, ridges are in vertical direction. Projection of the ridges and
valleys on the horizontal line forms an almost sinusoidal shape wave with the local
minima points corresponding to ridges and maxima points corresponds to valleys of
the fingerprint.
Ridge Width R is defined as thickness of a ridge. It is computed by counting the
number of pixels between consecutive maxima points of projected image, number of
0’s between two clusters of 1’s will give ridge width
e.g.
11110000001111
in above example, ridge width is 6 pixels.
Valley Width V is defined as thickness of valleys. It is computed by counting the
number of pixels between consecutive minima points of projected image, number of
1’s between two clusters of 0’s will give valley width
e.g.
00001111111000
in above example, valley width is 7 pixels.
Ridge Density is defined as number of ridges in a given block.
e.g.
001111100011111011
Above string contains 3 ridges in a block. So ridge density is 3.
Ridge Thickness to Valley Thickness Ratio (RTVTR) is defined as the ratio of
ridge width to the valley width and is given by RTVTR = R/V.
Fig. 2. Segmented image is Oriented and then projected to line from its binary transform we got
ridge and valley width
Example 1. Fig. 2. shows a segment of normalized fingerprint, which is oriented so
that the ridges are in vertical direction. Then these ridges are projected on horizontal
line. In the projected image, black dots show a ridge and white show a valley.
255
Classification
SVM’s are used for classification and regression. SVM’s are set of related supervised
learning methods. A special property of SVMs is that they simultaneously minimize
the empirical classification error and maximize the geometric margin; hence they are
also known as maximum margin classifiers. For a given training set of instance-label
pairs (Xi, yi), i=1..N, where N is any integer showing number of training sample, Xi
∈Rn (n denotes the dimension of input space) belongs to the two separating classes
labeled by yi∈{-1,1}, This classification problem is to find an optimal hyperplane WTZ
+ b =0 in a high dimension feature space Z by constructing a map Z=φ(X) between Rn
and Z. SVM determines this hyperplane by finding W and b which satisfy
∑
,
1,2, . . , .
(5)
Where ξ i ≥ 0 and yi[WTφ(Xi)+b] ≥ 1- ξ i holds. Coefficient c is the given upper bound
and N is the number of samples. The optimal W and b can be found by solving the
dual problem of Eqn. (5), namely
∑
max
∑
φX
φX
∑
.
(6)
00.
Where 0 ≤ αi ≤ c (i = 1,….,N) is Lagrange multiplier and it satisfies ∑ α y
Let K X , X
φ X φ X ) and we adopt the RBF function to map the input vectors into the high dimensional space Z. The RBF Function is given by
,
exp
γ|
|
,
γ
0 .
(7)
Where γ= 0.3125,c=512.Values of c & γ are computed by grid search [5]. The decision function of the SVM classifier is presented as
.
,
(8)
Where K(.,.) is the kernel function, which defines an inner product in higher dimenW φ X . The decision func,
sion space Z and it satisfies that ∑ α y .
tion sgn(φ) is the sign function and if φ≥0 then sgn(φ)=1 otherwise sgn(φ)=-1 [12].
3 Results
Our experimental result showed that if we consider any single feature for Male–
Female classification then the classification rate is very low. Confusion matrix for
Ridge Density (Table 1), RTVTR (Table 2) and Ridge Width (Table 3) show that
their classification rate is 53, 59.5 and 68 respectively for testing set. But by taking all
these features together we obtained the classification rate 91%. Five fold cross validation is used for the evaluation of the model. For testing set, Combining all these features together classification rate is 88% (Table 4).
256
Table 1. Confusion Matrix for Male-Female classification based on Ridge Density only for
Testing set
Actual\Estimated
Male
Female
Total
Male
47
41
88
Female
53
59
112
Total
100
100
200
For Ridge Density the classification rate is 53%
Table 2. Confusion Matrix for Male-Female classification based on RTVTR only for Testing set
Actual\Estimated
Male
Female
Total
Male
30
11
41
Female
70
89
159
Total
100
100
200
For RTVTR the classification rate is 59.5%
Table 3. Confusion Matrix for Male-Female classification based on Ridge Width only for
Testing set
Actual\Estimated
Male
Female
Total
Male
51
15
66
Female
49
85
134
Total
100
100
200
For Ridge Width the classification rate is 68%
Table 4. Confusion Matrix for Male-Female classification based on combining Ridge Density,
Ridge Width and RTVTR only for Testing set
Actual\Estimated
Male
Female
Total
Male
86
10
96
Female
14
90
104
Total
100
100
200
For Testing set the classification rate is 88%
4 Conclusion
Accuracy of our model obtained by five fold cross validation method is 91%. Our
results have shown that the ridge density, RTVTR and Ridge Width gave gave 53%,
59.5% and 68% classification rates respectively. Combining all these features together, we obtained 91% classification rate. Hence, our method gave 2.5 % better result
than the method given by Ahmed Badawi et al.
257
References
1. Badawi, A., Mahfouz, M., Tadross, R., Jantz, R.: Fingerprint Based Gender Classification.
In: IPCV 2006, June 29 (2006)
2. Sudesh, G.: Sex Determination from Fingerprint Ridge Density. Internet Journal of Medical Update 2(2) (July-December 2007)
3. Hong, L., Wan, Y., Jain, A.K.: Fingerprint Image Enhancement: Algorithms and Performance Evaluation. IEEE Trans. Pattern Analysis and Machine Intelligence 20(8), 777–789
(1998)
4. Raymond thai, Fingerprint Image Enhancement and Minutiae Extraction (2003)
5. Hsu, C.-W., Chang, C.-C., Lin, C.-J.: A practical guide to support vector classification,
http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf
6. Kralik, M., Novotny, V.: Epidermal ridge breadth: an indicator of age and sex in paleodermatoglyphics. Variability and Evolution 11, 5–30 (2003)
7. Moore, R.T.: Automatic fingerprint identification systems. In: Lee, H.C., Gaensslen, R.E.
(eds.) Advances in Fingerprint Technology, p. 169. CRC Press, Boca Raton (1994)
8. Cummins, H., Midlo, C.: Fingerprints, Palms and Soles. An introduction to dermatoglyphics, p. 272. Dover Publ., New York (1961)
9. Acree M.A.: Is there a gender difference in fingerprint ridge density? Federal Bureau of
Investigation, Washington, DC 20535-0001, USA
10. Jain, A.K., Chen, Y., Demerkus, M.: Pores and Ridges: High-Resolution Fingerprint
Matching Using Level 3 Fetures. IEEE Transaction on Pattern Analysis and Matching Intelligence 29 (January 2007)
11. Rao, A.: A Taxonomy for Texture Description and Identification. Springer, New York
(1990)
12. Ji, L., Yi, Z.: SVM-based Fingerprint Classification Using Orientation Field. In: Third International Conference on Natural Computation (ICNC 2007) (2007)
13. Support Vector Machines vs Artificial Neural Networks,
http://www.svms.org/anns.html
Petro Gopych
Universal Power Systems USA-Ukraine LLC, 3 Kotsarskaya Street, Kharkiv 61012 Ukraine
pmg@kharkov.com
Abstract. Recent binary signal detection theory (BSDT) employs a 'replacing' binary noise
(RBN). In this paper it has been demonstrated that RBN generates some related N-dimensional
discrete vector spaces, transforming to each other under different network synchrony conditions
and serving 2-, 3-, and 4-valued neurons. These transformations explain optimal BSDT coding/decoding rules and provide a common mathematical framework, for some competing types
of signal coding in neurosciences. Results demonstrate insufficiency of almost ubiquitous
binary codes and, in complex cases, the need of multi-valued ones.
Keywords: neural networks, replacing binary noise, colored spaces, degenerate spaces, spikerate coding, time-rate coding, meaning, synchrony, criticality.
1 Introduction
Data coding (a way of taking noise into account) is a problem whose solving depends
essentially on the accepted noise model [1]. Recent binary signal detection theory
(BSDT, [2-4] and references therein) employs an original replacing binary noise
(RBN, see below) which is an alternative to traditional additive noise models. For this
reason, BSDT coding has unexpected features, leading in particular to the conclusion
that in some important cases almost ubiquitous binary codes are insufficient and
multi-valued ones are essentially required.
The BSDT defines 2N different N-dimensional vectors x with spin-like components
i
x = ±1, reference vector x = x0 representing the information stored in a neural network
(NN), and noise vectors x = xr. Vectors x are points in a discrete N-dimensional binary
vector space, N-BVS, where variables take values +1 and –1 only. As in the N-BVS
additive noise is impossible, vectors x(d) in this space (damaged versions of x0) are
introduced by using a 'replacing' coding rule based on the RBN, xr:
⎧ x i , if u i = 0,
xi (d ) = ⎨ 0i
d = ∑ u i / N , i = 1,..., N
⎩ x r , if u i = 1
(1)
where ui are marks, 0 or 1. If m is the number of marks ui = 1 then d = m/N is a fraction of noise components in x(d) or a damage degree of x0, 0 ≤ d ≤ 1; q = 1 – d is a
fraction of intact components of x0 in x(d) or an intensity of cue, 0 ≤ q ≤ 1. If d = m/N,
the number of different x(d) is 2mCNm, CNm = N!/(N – m)!/m!; if 0 ≤ d ≤ 1, this number
is ∑2mCNm = 3N (0 ≤ m ≤ N). If ui = 1 then, to obtain xi(d), the ith component of x0,
xi0, is replaced by the ith component of noise, xir, otherwise xi0 remains intact (1).
springerlink.com
259
2 BSDT Binary, Ternary and Quaternary Vector Spaces
With respect to vectors x, vectors x(d) have an additional range of discretion because
their ±1 projections have the property to be a component of a noise, xr, or a signal
message, x0. To formalize this feature, we ascribe to vectors x a new range of freedom
― a 'color' (meaning) of their components. Thus, within the N-BVS (Sect. 1), we
define an additional two-valued color variable (a discrete-valued non-locality) labeling the components of x. Thanks to that extension, components of x become colored
(meaningful), e.g. either 'red' (noise) or 'black' (signal), and each x transforms into 2N
different vectors, numerically equivalent but colored in different colors (in column 3
of Table 1, such 2N vectors are underlined). We term the space of these items an Ndimensional colored BVS, N-CBVS. As the N-CBVS comprises 2N vectors x colored
in 2N ways, the total number of N-CBVS items is 2N×2N = 4N.
Table 1. Two complete sets of binary N-PCBVS(x0) vectors (columns 1-5) and ternary N-TVS
vectors (columns 5-8) at N = 3. m, the number of noise ('red,' shown in bold face in column 3)
components of x(d); sm = 2mCNm, the number of different x(d) for a given m; 3N = ∑sm, the same
for 0 ≤ m ≤ N. In column 3, all the x(d) for a given x0 (column 1) are shown; 2N two-color
vectors obtained by coloring the x = x0 are here underlined. n, the number of zeros among the
–
components of N-TVS vectors; sn = 2N nC NN – n, the number of N-TVS vectors for a given n;
N
3 = ∑sn, the same for 0 ≤ n ≤ N. In columns 3 and 7, table cells containing complete set of 2N
one-color N-BVS vectors are between the shaded cells m = 3 and sm = 8, sn = 8 and n = 0. Positive, negative and zero vector components are designated as +, – and 0, respectively.
–+–
x0
1
m
2
0
1
2
3
–++
0
1
2
N-PCBVS(x0) vectors
3
– + –,
– + –, + + –, – + –,
– – –, – + –, – + +,
– + –, + + –, + – –, – – –,
– + –, – + +, – – –, – – +,
– + –, – + +, + + –, + + +,
– + –, + + –, – + +, – – +,
+ – –, + + +, – – –, + – +.
– + +,
– + +, + + +, – + +,
– – +, – + +, – + –,
– + +, + + +, + – +, – – +,
– + +, – – +, – + –, – – –,
– + +, + + +, – + –, + + –,
sm
4
1
6
12
3N
5
27
8
1
6
12
– + –, + + –, – + +, – – +,
8
+ – –, + + +, – – –, + – +.
Complete set of N synchronized neurons,
'dense' spike-time coding
27
3
33
sn
6
1
6
N-TVS vectors
7
n
8
3
2
0 0 0,
+ 0 0, 0 0 +, 0 + 0,
– 0 0, 0 0 –, 0 – 0,
12 + + 0, + 0 +, 0 + +,
1
– + 0, – 0 +, 0 – +,
+ – 0, + 0 –, 0 + –,
– – 0, – 0 –, 0 – –,
8
– + –, + + –, – + +, – – +,
0
+ – –, + + +, – – –, + – +.
1
0 0 0,
3
6
+ 0 0, 0 0 +, 0 + 0,
2
– 0 0, 0 0 –, 0 – 0,
12 + + 0, + 0 +, 0 + +,
1
– + 0, – 0 +, 0 – +,
+ – 0, + 0 –, 0 + –,
– – 0, – 0 –, 0 – –,
8
– + –, + + –, – + +, – – +,
0
+ – –, + + +, – – –, + – +.
Complete set of N unsynchronized
neurons, 'sparse' spike-rate coding
260
P. Gopych
Of 4N colored vectors x(d) constituting the N-CBVS, (1) selects a fraction (subspace, subset) of them specific to particular x = x0 and consisting of 3N x(d) only. We
refer to such an x0-specific subspace as an N-dimensional partial colored BVS, NPCBVS(x0). An N-PCBVS(x0) consists of particular x0 and all its possible distortions
(corresponding vectors x(d) have m colored in red components, see column 3 of Table
1). As an N-BVS generating the N-CBVS supplies 2N vectors x = x0, the total number
of different N-PCBVS(x0) is also 2N. 2N spaces N-PCBVS(x0) each of which consists
of 3N items contain in sum 2N×3N = 6N vectors x(d) while the total amount of different
x(d) is only 4N. The intersection of all the spaces N-PCBVS(x0) (sets of corresponding
space points) and the unity of them are
I
x0∈N -BVS
N -PCBVS( x0 ) = N -BVS,
U
N -PCBVS( x0 ) = N -CBVS.
x0 ∈N -BVS
(2)
The first relation means that any two spaces N-PCBVS(x0) contain at least 2N common space points, constituting together the N-BVS (e.g., 'red' vectors x(d) in Table 1
column 3 rows m = 3). The second relation reflects the fact that spaces N-PCBVS(x0)
are overlapped subspaces of the N-CBVS.
Spaces N-PCBVS(x0) and N-CBVS consist of 3N and 4N items what is typically for
N-dimensional spaces of 3- and 4-valued vectors, respectively. Of this an obvious
insight arises ― to consider an N-PCBVS(x0) as a vector space 'built' for serving 3valued neurons (an N-dimensional ternary vector space, N-TVS; Table 1 columns 58) and to consider an N-CBVS as a vector space 'built' for serving 4-valued neurons
(an N-dimensional quaternary vector space, N-QVS; Table 3 column 2). After accepting this idea it becomes clear that the BSDT allows an intermittent three-fold (2-, 3-,
and 4-valued) description of signal data processing.
3 BSDT Degenerate Binary Vector Spaces
Spaces N-PCBVS(x0) and N-CBVS are devoted to the description of vectors x(d) by
means of explicit specifying the origin or 'meaning' of their components (either signal
or noise). At the stage of decoding, BSDT does not differ the colors (meanings) of
x(d) components and reads out their numerical values only. Consequently, for BSDT
decoding algorithm, all N-PCBVS(x0) and N-CBVS items are color-free. By means of
ignoring the colors, two-color x(d) are transforming into one-color x and, consequently, spaces N-CBVS and N-PCBVS(x0) are transforming, respectively, into
spaces N-DBVS (N-dimensional degenerate BVS) and N-DBVS(x0) (N-dimensional
degenerate BVS given x0). The N-DBVS(x0) and the N-DBVS contain respectively 3N
and 4N items, though only 2N of them (related to the N-BVS) are different. As a result,
N-DBVS and N-DBVS(x0) items are degenerate, i.e. in these spaces they may exist in
some equivalent copies. We refer to the number of such copies related to a given x as
its degeneracy degree, τ (1 ≤ τ ≤ 2N, τ = 1 means no degeneracy).
When N-CBVS vectors x(d) lose their color identity, their one-color counterparts,
x, are 'breeding' 2N times. For this reason, all N-DBVS space points have equal degeneracy degree, τ = 2N. When N-PCBVS(x0) vectors x(d) lose their color identity, their
one-color counterparts, x, are 'breeding' the number of times which is specified by (1)
261
given x0 and coincides with the number of x(d) related to particular x in the NPCBVS(x0). Consequently, all N-DBVS(x0) items have different in general degeneracy degrees, τ(x,x0), depending on x as well as x0.
As the number of different vectors x in an N-DBVS(x0) and the number of different spaces N-DBVS(x0) are the same (and equal to 2N), discrete function τ(x,x0)
is a square non-zero matrix. As the number of x in an N-DBVS(x0) and the number
of x(d) in corresponding N-PCBVS(x0) is 3N, the sum of matrix element values
made over each row (or over each column) is also the same and equals 3N. Remembering that ∑2mCNm = 3 N (m = 0, 1, …, N), we see that in each the matrix's
row or column the number of its elements, which are equal to 2m, is CNm (e.g. in
Table 2, the number of 8s, 4s, 2s and 1s is 1, 3, 3 and 1, respectively). If x (columns) and x0 (rows) are ordered in the same way (as in Table 2) then matrix τ(x,x0)
is symmetrical with respect to its main diagonal; if x = x0, then in the column x and
the row x0 τ(x,x0)-values are equally arranged (as in the column and the row shaded
in Table 2: 4, 2, 8, 4, 1, 4, 2, 2; corresponding sets of two-colored N-PCBVS(x0)
vectors x(d) are shown in column 3 of Table 1). Degeneracy degree averaged
across all the x given N-DBVS(x0) or across all the N-DBVS(x0) given x does not
depend on x and x0: τa = <τ(x,x0)> = (3/2)N (for the example presented in Table 2,
τa = (3/2)3 = 27/8 = 3.375).
x
x0
–+–
∑τ(x,x0)
given x
–+–
++–
–++
––+
+––
+++
–––
+–+
++–
–++
––+
+––
+++
––– +–+
∑τ(x,x0)
given x0
Table 2. Degeneracy degree, τ(x,x0), for all the vectors x in all the spaces N-DBVS(x0), N = 3.
2N = 8, the number of different x (and different x0); 3N = 27, the total number of x in an NDBVS(x0) or the number of two-colored x(d) in corresponding N-PCBVS(x0); positive and
negative components of x and x0 are designated as + and –, respectively. Rows provide τ(x,x0)
for all the x given N-DBVS(x0) (the row x0 = – + + is shaded); columns show τ(x,x0) for all the
spaces N-DBVS(x0) given x (the column x = – + + is shaded).
8
4
4
2
2
2
4
1
4
8
2
1
4
4
2
2
4
2
8
4
1
4
2
2
2
1
4
8
2
2
4
4
2
4
1
2
8
2
4
4
2
4
4
2
2
8
1
4
4
2
2
4
4
1
8
2
1
2
2
4
4
4
2
8
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
27
3N
Of the view of the set theory, the N-DBVS consists of 2N equivalent N-BVS; the intersection and the unity of them are the N-BVS itself. Each N-DBVS(x0) includes all the
2N N-BVS vectors each of which is repeated in an x0-specific number of copies, as it is
illustrated by rows (columns) of Table 2 (N + 1 of these numbers are only different).
262
P. Gopych
4 BSDT Multi-valued Codes and Multi-valued Neurons
For the description of a network of the size N in spaces above discussed, the BSDT
defines 2-, 3- and 4-valued N-dimensional (code) vectors each of which represents a
set of N firing 2-, 3- and 4-valued neurons, respectively (see Tables 1 and 3).
23valued valued
3
–1, black
–1, red
+1, red
+1, black
–1, black/red
no black/red
+1, black/red
–1, black/red
+1, black/red
4
–1, signal
–1, noise
+1, noise
+1, signal
–1, signal/noise
no signal/noise
+1, signal/noise
–1, signal/noise
+1, signal/noise
5
inhibitory
inhibitory
excitatory
excitatory
inhibitory
no spike
excitatory
inhibitory
excitatory
Space examples
The target
neuron's
synapse
Signal/noise
numerical
code,
SNNC
2
–2
–1
+1
+2
–1
0
+1
–1
+1
Colored
numerical
code, CNC
Numerical
code, NC
1
4valued
Type
of neurons
Table 3. Values/meanings of code vector components for BSDT neurons and codes. Shaded
table cells display code items implementing the N-BVS (its literal implementation by 3- and 4valued neurons is possible under certain conditions only; parentheses in column 6 indicate this
fact, see also process 6 in Fig. 1); in column 5 the content/meaning of each code item is additionally specified in neuroscience terms.
6
N-QVS,
N-CBVS,
N-PCBVS(x0),
(N-BVS)
N-TVS,
(N-BVS)
N-BVS,
N-DBVS,
N-DBVS(x0)
In Table 3 numerical code (NC, column 2) is the simplest, most general, and meaning-irrelevant: its items (numerical values of code vector components) may arbitrary
be interpreted (e.g., for 4-valued neurons, code values ±1 and ±2 may be treated as
related to noise and signal, respectively). Colored numerical code (CNC, column 3)
combines binary NC with binary color code; for 3- and 2-valued neurons, CNC vector
components are ambiguously defined (they are either black or red, 'black/ red'). Signal/noise numerical code (SNNC, column 4) specifies the colors of the CNC with
respect to a signal/noise problem: 'black' and 'red' units are interpreted as ones that
represent respectively signal and noise components of SNNC vectors (marks 'signal'
and 'noise' reflect this fact). As in the case of the CNC, for 3- and 2-valued neurons,
SNNC code vector components say only that (with equal probability) they can represent either signal or noise (this fact reflects the mark 'signal/ noise'). Further code
specification (by adding a neuroscience meaning to each signal, noise, or signal/noise
numerical code item) is given in column 5: it is assumed [3] that vector components
–1 designate signal, noise or signal/noise spikes, affecting the inhibitory synapses of
target neurons, while vector components +1 designate spikes, affecting the excitatory
synapses of target neurons (i.e. BSDT neurons are intimately embedded into their
environment, as they always 'know' the type of synapses of their postsynaptic neurons); zero-valued components of 3-valued vectors designate 'silent' or 'dormant'
263
neurons generating no spikes at the moment and, consequently, not affecting on their
target neurons at all (marks 'no signal/noise' and 'no spike' in columns 4 and 5 reflect
this fact). BSDT spaces implementing different types of coding and dealing with
different types of neurons are classified in column 6.
The unity of mathematical description of spiking and silent neurons (see Table 3) explains why BSDT spin-like +1/–1 coding cannot be replaced by popular 1/0 coding.
Ternary vectors with large fractions of zero components may also contribute to explaining the so-called sparse neuron codes (e.g., [5-6], Table 1 columns 5-8). Quaternary as
well as binary vectors without zero components (and, perhaps, ternary vectors with
small fractions of zero components) may contribute to explaining the so-called dense
neuron codes (a reverse counterpart to sparse neuron codes, Table 1 columns 1-5).
5 Reciprocal Transformations of BSDT Vector Spaces
The major physical condition explaining the diversity of BSDT vector spaces and the
need of their transformations is the network's state of synchrony (Fig. 1). We understand synchrony as simultaneous (within a time window/bin ∆t ~ 10 ms) spike firing
of N network neurons (cf. [7]). Unsynchronized neurons fire independently at instants
t1, …, tN whose variability is much greater than ∆t. As the BSDT imposes no
constraints on network neuron space distribution, they may arbitrary be arranged
occupying positions even in distinct brain areas (within a population map [8]). Unsynchronized and synchronized networks are respectively described by BSDT vectors
with and without zero-valued components and consequently each such an individual
vector represents a pattern of network spike activity at a moment t ± ∆t. Hence, the
BSDT deals with network spike patterns only (not with spike trains of individual
neurons [9]) while diverse neuron wave dynamics, responsible in particular for
changes in network synchrony [10], remains out of the consideration.
For unsynchronized networks (box 2 in Fig. 1), spike timing can in principle not
be used for coding; in this case signals of interest may be coded by the number of
network spikes randomly emerged per a given time bin ― that is spike-rate or firingrate population coding [9], implementing the independent-coding hypothesis [8]. For
partially ordered networks, firing of their neurons is in time to some extent correlated
and such mutual correlations can already be used for spike-time coding (e.g., [1,9]) ―
that is an implementation of the coordinated-coding hypothesis [8]. The case of
completely synchronized networks (box 4 in Fig. 1) is an extreme case of spike-time
coding describable by BSDT vectors without zero components. Time-to-first-spike
coding (e.g., [11]) and phase coding (e.g., [12]) may be interpreted as particular implementations of such a consideration. Figure 1 shows also the transformations (circled dashed arrows) related to changes in network synchrony and accompanied by
energy exchange (vertical arrows). BSDT optimal decoding probability (that is equal
to BSDT optimal generalization degree) is the normalized number of N-DBVS(x0)
vectors x for which their Hamming distance to x0 is smaller than a given threshold.
Hence, the BSDT defines its coding/decoding [2,3] and generalization/codecorrection [2] rules but does not specify mechanisms implementing them: for BSDT
applicability already synchronized/unsynchronized networks are required.
264
P. Gopych
1
The global network's environment
Energy input
2
Energy dissipation
N-TVS
Unsynchronized neurons
Entropy
1
'The edge of chaos'
5
N-PCBVS(x0)
N-DBVS(x0)
N-CBVS
N-PCBVS(x0)
N-DBVS(x0)
2
Synchronized neurons
...
N-PCBVS(x0)
3
1
2
Complexity
N-BVS
N-BVS
6
3
4
N-TVS
...
4
N-DBVS(x0)
N
2
Fig. 1. Transformations of BSDT vector spaces. Spaces for pools of synchronized (box 4) and
unsynchronized (box 2) neurons are framed separately. In box 4, right-most numbers enumerate
different spaces of the same type; framed numbers mark space transformation processes (arrows): 1, coloring all the components of N-BVS vectors; 2, splitting the N-CBVS into 2N different but overlapping spaces N-PCBVS(x0); 3, transformation of two-color N-PCBVS(x0) vectors
x(d) into one-color N-TVS vectors (because of network desynchronization); 4, equalizing the
colors of all the components of all the N-PCBVS(x0) vectors; 5, transformation of one-color NTVS vectors into two-color N-PCBVS(x0) vectors; 6, random coincident spiking of unsynchronized neurons. Vertical left/right-most arrows remind trends in entropy and complexity for a
global network, containing synchronized and unsynchronized parts and open for energy exchange with the environment (box 1). Box 3 comprises rich and diverse neuron individual and
collective nonlinear wave/oscillatory dynamics implementing synchrony/unsynchrony transitions (i.e. the global network is as a rule near its 'criticality').
In most behavioral and cognitive tasks, spike synchrony and neuron coherent
wave/oscillatory activity are tightly entangled (e.g. [10,13]) though which of these
phenomena is the prime mechanism, for dynamic temporal and space binding (synchronization) of neurons into a cell assembly, remains unknown. If gradual changes
of the probability of occurring zero-valued components of ternary vectors is implied
then the BSDT is consistent in general with scenarios of gradual synchrony/
unsynchrony transitions, but we are here interesting in such abrupt transitions only.
This means that, for the BSDT biological relevance, it is needed to take the popular
stance according to which the brain is a very large and complex nonlinear dynamic
system having long-distant and reciprocal connectivity [14], being in a metastable
state and running near its 'criticality' (box 3 in Fig. 1). If so, then abrupt unsynchronyto-synchrony transitions may be interpreted as the network's 'self-organization,' 'bifurcation,' or 'phase transition' (see ref. 15 for review) while abrupt synchrony decay may
be considered as reverse phase transitions. This idea stems from statistical physics and
offers a mechanism contributing most probably to real biological processes underlying BSDT space transformations shown in Fig. 1 (i.e. arrows crossing 'the edge of
chaos' may have real biological counterparts).
Task-related brain activity takes <5% of energy consumed by the resting human
brain [16]; spikes are high energy-consuming and, consequently, rather seldom and
important brain signals [17]. Of these follows that the BSDT, as a theory for spike
computations, describes though rather small but perhaps most important ('high-level')
fraction of brain activity responsible for maintaining behavior and cognition.
265
6 Conclusion
A set of N-dimensional discrete vector spaces (some of which are 'colored' and some
degenerate) has been introduced by using original BSDT coding rules. These spaces,
serving 2-, 3- and 4-valued neurons, ensure optimal BSDT decoding/generalization
and provide a common description of spiking and silencing neurons under different
conditions of network synchrony. The BSDT is a theory for spikes/impulses and can
describe though rather small but probably most important ('high-level') fraction of
brain activity responsible for human/animal behavior and cognition. Results demonstrate also that for the description of complex living or artificial digital systems,
where conditions of signal synchrony and unsynchrony coexist, binary codes are
insufficient and the use of multi-valued ones is essentially required.
References
1. Averbeck, B.B., Latham, P.E., Pouget, A.: Neural Correlations, Population Coding and
Computation. Nat. Rev. Neurosci. 7, 358–366 (2006)
2. Gopych, P.M.: Generalization by Computation through Memory. Int. J. Inf. Theo.
Appl. 13, 145–157 (2006)
3. Gopych, P.M.: Foundations of the Neural Network Assembly Memory Model. In: Shannon,
S. (ed.) Leading Edge Computer Sciences, pp. 21–84. Nova Science, New York (2006)
4. Gopych, P.M.: Minimal BSDT Abstract Selectional Machines and Their Selectional and
Computational Performance. In: Yin, H., Tino, P., Corchado, E., Byrne, W., Yao, X. (eds.)
IDEAL 2007. LNCS, vol. 4881, pp. 198–208. Springer, Heidelberg (2007)
5. Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)
6. Olshausen, B.A., Field, D.J.: Sparse Coding of Sensory Inputs. Curr. Opin. Neurobiol. 14,
481–487 (2004)
7. Engel, A.K., Singer, W.: Temporal Binding and the Neural Correlates of Sensory Awareness. Trends Cog. Sci. 5, 16–21 (2001)
8. de Charms, C.R., Zador, A.: Neural Representations and the Cortical Code. Ann. Rev.
Neurosci. 23, 613–647 (2000)
9. Tiesinga, P., Fellous, J.-M., Sejnowski, T.J.: Regulation of Spike Timing in Visual Cortical Circuits. Nat. Rev. Neurosci. 9, 97–109 (2008)
10. Buzáki, G., Draghun, A.: Neuronal Oscillations in Cortical Networks. Science 304, 1926–
1929 (2004)
11. Johansson, R.S., Birznieks, I.: First Spikes in Ensemble of Human Tactile Afferents Code
Complex Spatial Fingertip Events. Nat. Neurosci. 7, 170–177 (2004)
12. Jacobs, J., Kahana, M.J., Ekstrom, A.D., Fried, I.: Brain Oscillations Control Timing of
Single-Neuron Activity in Humans. J. Neurosci. 27, 3839–3844 (2007)
13. Varela, F., Lachaux, J.-P., Rodriguez, E., Martinerie, J.: The Brainweb: Phase Synchronization and Large-Scale Integration. Nat. Rev. Neurosci. 2, 229–239 (2001)
14. Sporns, O., Chialvo, D.R., Kaiser, M., Hilgetag, C.C.: Organization, Development and
Function of Complex Brain Networks. Trends Cog. Sci. 8, 418–425 (2004)
15. Werner, G.: Perspectives on the Neuroscience of Cognition and Consciousness. BioSystems 87, 82–95 (2007)
16. Fox, M.D., Raichle, M.E.: Spontaneous Fluctuations in Brain Activity Observed with
Functional Magnetic Resonance Imaging. Nat. Rev. Neurosci. 8, 700–711 (2007)
17. Lennie, P.: The Cost of Cortical Computation. Curr. Biology 13, 493–497 (2003)
A Fast and Distortion Tolerant Hashing for Fingerprint
Image Authentication
Thi Hoi Le and The Duy Bui
Faculty of Information Technology
Vietnam National University, Hanoi
hoilt@vnu.edu.vn, duybt@vnu.edu.vn
Abstract. Biometrics such as fingerprint, face, eye retina, and voice offers means of reliable
personal authentication is now a widely used technology both in forensic and civilian domains.
Reality, however, makes it difficult to design an accurate and fast biometric recognition due to
large biometric database and complicated biometric measures. In particular, fast fingerprint indexing is one of the most challenging problems faced in fingerprint authentication system. In this
paper, we present a specific contribution to advance the state of the art in this field by introducing a new robust indexing scheme that is able to fasten the fingerprint recognition process.
Keywords: fingerprint hashing, fingerprint authentication, error correcting code, image
authentication.
1 Introduction
With the development of digital world, reliable personal authentication has become a
big interest in human computer interface activity. National ID card, electronic commerce, and access to computer networks are some scenarios where declaration of a
person’s identity is crucial. Existing security measures rely on knowledge-based approaches like passwords or token-based such as magnetic cards and passports are used
to control access to real and virtual societies. Though ubiquitous, such methods are
not very secure. More severely, they may be shared or stolen easily. Passwords and
PIN numbers may be even stolen electronically. Furthermore, they cannot differentiate between authorized user and fraudulent imposter. Otherwise, biometrics has a special characteristic that user is the key; hence, it is not easily compromised or shared.
Therefore, biometrics offers means of reliable personal authentication that can address
these problems and is gaining citizen and government acceptance.
Although significant progress has been made in fingerprint authentication system,
there are still a number of research issues that need to be addressed to improve the
system efficiency. Automatic fingerprint identification which requires 1-N matching
is usually computationally demanding. For a small database, a common approach is to
exhaustively match a query fingerprint against all the fingerprints in database [21].
For a large database, however, it is not desirable in practice without an effective fingerprint indexing scheme.
There are two technical choices to reduce the number of comparisons and consequently to reduce the response time of the identification process: classification and
springerlink.com
A Fast and Distortion Tolerant Hashing for Fingerprint Image Authentication
267
indexing techniques. Traditional classification techniques (e.g. [5], [16]) attempt to
classify fingerprint into five classes: Right Loop (R), Left Loop (L), Whorl (W),
Arch (A), and Tented Arch (A). Due to the uneven natural distribution, the number
of classes is small and real fingerprints are unequally distributed among them: over
90% of fingerprints fall in to only three classes (Loops and Whorl) [18]. This is
resulted in the inability of reducing the search space enough of such systems. Indexing technique performs better than classification in terms of the size of space need
to be searched. Fingerprint indexing algorithms select most probable candidates and
sort them by the similarity to the query. Many indexing algorithms have been proposed recently. A.K. Jain et al [11] use the features around the core point of a Gabor filtered image to realize indexing. Although this approach makes use of global
information (core point) but the discrimination power of just one core is limited. In
[2], the singular point (SP) is used to estimate the search priority which is resulted
in the mean search space below 4% the whole dataset. However, detecting singular
point is a hard problem. Some fingerprints even do not have SPs and the uncertainty
of SP location is large [18]. Besides, several attempts to account for fingerprint indexing have shown the improvement. R. Cappelli et al. [6] proposed an approach
which reaches the reasonable performance and identification time. R.S Germain et
al. [9] use the triplets of minutiae in their indexing procedure. J.D Boer et al. [3]
make effort in combining multiple features (orientation field, FingerCode, and minutiae triplets).
One of another key point is to make indexing algorithm more accurate, fingerprint
distortion must be considered. There are two kinds of distortion: transformation distortion and system distortion. In particular, due to fingerprint scanners can only capture
partial fingerprints; some minutiae – primary fingerprint features - are missing during
acquisition process. These distortions of fingerprint including minutia missing cause
several problems: (i) the number of minutia points available in such prints is few, thus
reducing its discrimination power; (ii) loss of singular points (core and delta) is likely
[15]. Therefore, a robust indexing algorithm independent of such global features is
required. However, in most of existed indexing scheme mentioned above, they perform
indexing by utilizing these global feature points.
In this paper, we propose a hashing scheme that can achieve high efficiency in
hashing performance by perform hashing on localized features that are able to tolerate
distortions. This hashing scheme can perform on any localized features such as minutiae or triplet. To avoid alignment, in this paper we use the triplet feature introduced
by Tsai-Yang Jea et. al. [15]. One of our main contributions is that we present a definition of codeword for such fingerprint feature points and our hashing scheme performs on those codewords. By producing codewords for feature points, this scheme
can tolerate the distortion of each point. Moreover, to reduce number of candidates for
matching stage more efficiently, a randomized hash function scheme is used to reduce
the output collisions.
The paper is organized as follows. Section 2 introduces some notions and our definition of codeword for fingerprint feature points. Section 3 presents our hashing
scheme based on these codewords and a scheme to retrieve fingerprint from hashing
results. We present experiment results of our scheme in Section 4.
268
T.H. Le and T.D. Bui
2 Preliminaries
2.1 Error Correction Code
For a given choice of metric d, one can define error correction codes in the corresponding space M. A code is a subset C={w1,…,wk} ⊂ M. The set C is sometimes
called codebook; its K elements are the codewords. The (minimum) distance of a code
is the smallest distance d between two distinct codewords (according to the metric d).
Given a codebook C, we can define a pair of functions (C,D). The encoding function
C is an injective map for the elements of some domain of size K to the elements of C.
The decoding function D maps any element w ∈ M to the pre-image C-1[wk] of the
codeword
wk that minimizes the distance d[w,wk]. The error correcting distance is
the largest radius t such that for every element w in M there is at most one codeword
in the ball of radius t centered on w. For integer distance functions we have t=(d-1)/2.
A standard shorthand notation in coding theory is that of a (M,K,t)-code.
2.2 Fingerprint Feature Point
Primary feature point of fingerprint is called minutia. Minutiae are the various ridge
discontinuities of a fingerprint. There are two types of widely used minutiae which
are bifurcations and endings (Fig.1). Minutia contains only local information of fingerprints. Each minutia is represented by its coordinates and orientation.
Fig. 1. (left) Ridge bifurcation. (b) Ridge endings [15].
Secondary feature is a vector of five-elements (Fig.2). For each minutiae
Mi(xi,yi,θi) and its two nearest neighbors N0(xn0,yn0,θn0) and N1(xn1,yn1,θn1), the
secondary feature is constructed by form a vector Si(ri0 ,ri1,φi0,φi1,δi) in which ri0
and ri1 are the Euclidean distances between the central minutia Mi and its neighbors
N0 and N1 respectively. φik is the orientation difference between Mi and Nk, where k
is 0 or 1. δi represents the acute angle between the line segments MiN0 and MiN1.
Note that N0 and N1 are the two nearest neighbors of the central minutia Mi and ordered not by their Euclidean distances but by satisfying the equation: N0MixN1Mi ≥ 0.
269
Fig. 2. Secondary feature of Mi. Where ri0 and ri1 are the Euclidean distances between central
minutia Mi and its neighbors N0 and N1 respectively. φik is the orientation difference between Mi
and Nk where k is 0 or 1. δi represents the acute angle between MiN0 and MiN1 [15].
N0 is the first and N1 is the second minutia encountered while traversing the angle
∠N0MiN1.
For the given matched reference minutiae pair pi and qj, it is said that minutiae
pi(ri,i’,Φi,i’;,θi,i’) matches qj(rj,j’,Φj,j’,θj,j’), if qj is within the tolerance area of pi. Thus for
given threshold functions Thldr(.), ThldΦ(.), and Thldθ(.), |ri,i’ – rj,j’| ≤ Thldr(ri,i’), |Φi,i’ –
Φj,j’| ≤ ThldΦ(Φi,i’) and |θi,i’ – θj,j’| ≤ Thldθ(θi,i’). Note that the thresholds are not predefined values but are adjustable according to rii, and rjj. In this literature, we call this
secondary fingerprint feature point as feature point.
2.3 Our Error Correction Code for Fingerprint Feature Point
Most existing error correction schemes are respect to Hamming distance and used to
correct the message at bit level (e.g. parity check bits, repetition scheme, CRC…).
Therefore, we define a new error correction scheme for fingerprint feature point. In
particular, we consider each feature x as a corrupted codeword and try to correct to the
correct codeword by using an encoding function C(x).
Definition 1. Let x ∈ RD, and C(x) is an encoding function of the error correction
scheme. We define C(x) = (q(x-t)|q(x)|q(x+t)) where q(x) is a quantization function of
x with quantization step t. We call the output of C(x) (cx-t,cx,cx+t) is the codeword set
of x and cx=q(x) is the nearest codeword of x or codeword of x for short.
Lemma 1. Let x ∈ R, given a tolerant threshold t ∈ R. For every y such that |x-y| ≤ t,
then the codeword of y cy = q(y) takes one of three elements in the codeword set of x.
Lemma 1 can be proved easily by some algebraic transformations. In our approach,
we generate codeword set for every dimension of template fingerprint feature point q
and only codeword for every dimension of query point p. Following lemma 1 and
definition of two matched feature points, we can see that if p and q are corresponding
270
feature points of two versions from one fingerprint, codeword of q will be an element
in codeword set of p.
3 Codeword-Based Hashing Scheme
We present a fingerprint hash generation scheme based on the codeword of the feature
point. The key idea is: first, “correct” the error feature point to its codeword, then use
that codeword as the input of a randomized hash function which can scatter the input
set and ensure that the probability of collision of two feature points is closely related to
the distance between their corresponding coordinate pairs (refer to definition of two
matched feature points).
3.1 Our Approach
Informal description. Fingerprint features stored in database can be considered as a
very large set. Moreover, the distribution of fingerprint feature points is not uniform
and unknown. Therefore, we want to design a randomized hash scheme such that for
the large input sets of fingerprint feature points, it is unlikely that elements collide.
Fortunately, some standard hash functions (e.g. MD5, SHA) can be made randomized
to satisfy the property of target collision resistance. Our scheme works as follows.
First, the message bits are permuted by a random permutation and then the hash of
resulting message is computed (by a compression function e.g. SHA). A permutation
is a special kind of block cipher where the output and input have the same length. A
random permutation is widely used in cryptography since it possesses two important
properties: randomness and efficient computation if the key is random. The basic
idea so far is for any given set of inputs, this algorithm will scatter the inputs among
the range of the function well based on a random permutation so that the probabilistic
expectation of the output will be distributed more randomly.
To ensure the error tolerant property, we perform hashing on codeword (for query
points) and on the whole codeword set (for template points) instead of the feature
point itself. Follow the lemma 1 we have the query point y and the template point x
are matched if and only if codeword of y is an element in codeword set of x .
Formal description. We set up our hashing scheme as follows:
1. Choose random dimensions from (1,2,…,D); by this way, we adjust the trade off
between the collision of hashing values and the space of our database.
2. Choose an tolerant threshold t and appropriate metric ld for selected dimensions.
For each template point p:
1. Generate the codeword set for pd. These values are mapped to binary strings. Binary strings in the codeword set of selected dimensions are then concatenated to
form L3 binary strings mt for i=1,…,L3 where L is the number of selected dimensions.
2. Each mi is padded with zero bits to form a n – bit message mi.
3. Generate a random key K for a permutation π of {0,1}n.
4. mi is permuted by π and the resulting message is hashed using a compression
function such as SHA.. Let shi = SHA( π (mi)) for i=1,…,L3.
271
5. There are at most L3 hash values for p. These values can be stored in the same
hash table as well as separate ones.
For query point q:
1. Compute the nearest codeword value for every selected dimension of q. Then map
the value to a binary string.
2. Binary strings of selected dimensions are then concatenated.
3. Perform steps 2 and 4 as for template point. Note that there is only one binary
string for query point q.
4. Return all the points which are sharing identical hash value with q.
For query evaluation, all candidate matches are returned by our hashing for every
query feature point. Hence, each query fingerprint is treated as a bag of points, simulating a multi-point query evaluation. To do this efficiently, we use an identical
framework as Ke et al. [17], in that, we maintain two auxiliary index structure –File
Table (FT) and Keypoint Table (KT) – to map feature points to their corresponding
fingerprint; an entry in KT consists of the file ID (index location of FT) and feature
point information.
The template fingerprints that have points sharing identical hash values (collisions)
with the query version are then ranked by the number of similarity points. Only top T
templates are selected for identifying the query fingerprint by matching. Thus, the
search space is greatly reduced. The candidate selection process requires only linear
computational cost so that it can be applied for online interactive querying on large
image collections.
3.2 Analysis
Space complexity. For an automate fingerprint identification, we must allocate extra
storage for the hash values of template fingerprints. Assume that the number of Ddimension feature points extracted from template version is N. With L selected dimensions, the total extra space required for one template is O(L3.N). Thus, the hashing scheme requires a polynomial-sized data structure that allows sub-linear time retrievals of near neighbors as shown in following section.
Time complexity. On query fingerprint Y with N feature points in a database of M
fingerprints, we must: compute the codeword for each feature point which takes O(N)
due to quantization hashing functions required constant time complexity; compute the
hash value which takes O(time(f,x)) where f is the hash function used and time(f,x) is
the time required by function f with input x; and compute the similarity scores of templates that has any point sharing identical hash value with the query fingerprint, requiring time O(M.N/2m). Thus the key quantity is O(M.N/2n + time(f,x)) which is approximately equivalent to 1/2n computations of exhaustive search.
4 Experiments
We evaluate our method by testing it on Database FVC2004 [19] which consists of
300 images, 3 prints each of 100 distinct fingers. DB1_A database contains the partial
fingerprint templates of various sizes. These images are captured by a optical sensor
with a resolution of 500dpi, resulting in images of 300x200 pixels in 8 bit gray scale.
272
The full original templates are used to construct the database, while the remaining 7
impressions are used to hashing. We use the feature extraction algorithm described in
[15] in our system. However, the authors do not mention in detail how to determine
the threshold of error tolerances. Therefore, in our experiments, we assume that the
error correction distance t is fixed for all fingerprint feature points. This assumption
makes the implementation not optimal so that two feature points are recognized as
“match” by matching algorithm proposed in [15] may not share the same hash value
in our experiments. The Euclidean distances between the corresponding dimensions
of feature vectors are used in quantization hashing function.
Table 1 shows the Correct Index Power (CIP) which is defined as the percentage of
correctly indexed queries based on the percentage of hypotheses that need to be
searched in the verification step. Although our implementation is not optimal, scheme
still achieves good CIP result. As can be easily seen, the larger search percentage is,
the better results are obtained. It indicates that the optimal implementation can improve the result performance.
Table 1. Correct Indexing Power of our algorithm
Correct Index Power
Search
Percentage
CIP
5%
10%
15%
20%
80%
87%
94%
96%
Compare with some published experiments in the literature, at the search percentage
10%, [13] comes up with 84.5% CIP and [23] reaches a result of 92.8% CIP. However,
unlike previous works, our scheme is much simpler and by adjusting t carefully, it is
promising that our scheme will reach 100% CIP with low search percentage.
5 Conclusion
In this paper, we have presented a new robust approach to perform indexing on fingerprint which provides both accurate and fast indexing. However, there is still some
works need to be done to in order to make the system more persuasive and to obtain
the optimal result like studying optimal choices of t parameter. Moreover, to guarantee
the privacy of fingerprint template in any indexing scheme is another important problem that must be considered.
References
[1] Bazen, A.M., Gerez, S.H.: Fingerprint matching by thin-plate spline modeling of elastic
deformations. Pattern Recognition 36, 1859–1867 (2003)
[2] Bazen, A.M., Verwaaijen, G.T.B., Garez, S.H., Veelunturf, L.P.J.: A correlation-based
fingerprint verification system. In: ProRISC 2000 Workshops on Circuits, Systems and
Signal Processing (2000)
273
[3] Boer, J., Bazen, A., Cerez, S.: Indexing fingerprint database based on multiple features.
In: ProRISC 2001 Workshop on Circuits, Systems and Singal Processing. (2001)
[4] Brown, L.: A survey of image registration techniques. ACM Computing Surveys (1992)
[5] Cappelli, R., Lumini, A., Maio, D., Maltoni, D.: Fingerprint Classification by Directional
Image Partitioning. IEEE Trans. on PAMI 21(5), 402–421 (1999)
[6] Cappelli, R., Maio, D., Maltoni, D.: Indexing fingerprint databases for efficicent 1: n
matching. In: Sixth Int.Conf. on Control, Automation, Robotics and Vision, Singapore
(2000)
[7] Choudhary, A.M., Awwal, A.A.S.: Optical pattern recognition of fingerprints using distortion-invariant phase-only filter. In: Proc. SPIE, vol. 3805(20), pp. 162–170 (1999)
[8] Fingerprint verification competition, http://bias.csr.unibo.it/fvc2002/
[9] Germain, R., Califano, A., Colville, S.: Fingerprint matching using transformation parameter clustering. IEEE Computational Science and Eng. 4(4), 42–49 (1997)
[10] Gonzalez, Woods, Eddins: Digital Image Processing, Prentice Hall, Englewood Cliffs
(2004)
[11] Jain, A., Ross, A., Prabhakar, S.: Fingerprint matching using minutiae texture features. In:
International Conference on Image Processing, pp. 282–285 (2001)
[12] Jain, A., Prabhakar, S., Hong, L., Pankanti, S.: Filterbank-based fingerprint matching.
Transactions on Image Processing 9, 846–859 (2000)
[13] Jain, A.K., Prabhakar, S., Hong, L., Pankanti, S.: FingerCode: a filterbank for fingerprint
representation and matching. In: CVPR IEEE Computer Society Conference (2), pp. 187–
193 (1999)
[14] Jea, T., Chavan, V.K., Govindaraju, V., Schneider, J.K.: Security and matching of partial
fingerprint recognition systems, pp. 39–50. SPIE (2004)
[15] Tsai-Yang, J., Venu, G.: A minutia-based partial fingerprint recognition system. Pattern
Recognition 38(10), 1672–1684 (2005)
[16] Karu, K., Jain, A.K.: Fingerprint Classification. Pattern Recognition 18(3), 389–404
(1996)
[17] Ke, Y., Sukthankar, R., Huston, L.: An efficient parts-based near duplicate and sub-image
retrieval system. In: MM International Conference on Multimedia, pp. 869–876 (2004)
[18] Liang, X., Asano, T.,, B.: Distorted Fingerprint indexing using minutiae detail and delaunay triangle. In: ISVD 2006, pp. 217–223 (2006)
[19] Maio, D., Maltoni, D., Cappelli, R., Wayman, J.L., Jain, A.K.: FVC2004: Third Fingerprint Verification Competition. In: Proc. ICBA, Hong Kong, July 2004, pp. 1–7 (2004)
[20] Nandakumar, K., Jain, A.K.: Local correlation-based fingerprint matching. In: Indian
Conference on Computer Vision, Graphics and Image Processing, pp. 503–508 (2004)
[21] Nist fingerprint vendor technology evaluation, http://fpvte.nist.gov/
[22] Ruud, B., Connell, J.H., Pankanti, S., Ratha, N.K., Senior, A.W.: Guide to Biometrics.
[23] Liu, T., Zhang, G.Z.C., Hao, P.: Fingerprint Indexing Based on Singular Point Correlation. In: ICIP 2005 (2005)
The Concept of Application of Fuzzy Logic in Biometric
Authentication Systems
Anatoly Sachenko, Arkadiusz Banasik, and Adrian Kapczyński
Silesian University of Technology, Department of Computer Science and Econometrics,
F. D. Roosevelt 26-28, 41-800 Zabrze, Poland
sachenkoa@yahoo.com, arkadiusz.banasik@polsl.pl,
adrian.kapczynski@polsl.pl
Abstract. In the paper the key topics concerning architecture and rules of working of biometric
authentication systems were described. Significant role is played by threshold which constitutes
acceptance or rejection given authentication attempt. Application of elements of fuzzy logic
was proposed in order to define threshold value of authentication system. The concept was
illustrated by an example.
Keywords: biometrics, fuzzy logic, authentication systems.
1 Introduction
The aim of this paper is to present on the basis of theoretical foundations of fuzzy
logic and how to use it as a hypothetical, single-layered biometric authentication system. In the first part it will be provided biometric authentication systems primer and
the fundamentals of fuzzy logic. On that basis the idea of use of fuzzy logic in biometric authentication systems was formulated.
2 Biometric Authentication Systems Primer
Biometric authentication system is basically a system which identifies patterns and
carries out the objectives of authentication by identifying the authenticity of physical
or behavioral characteristics possessed by the user [5].
The logical system includes the following modules [1]: enrollment module and
identification or verification module.
The first module is responsible for the registration of user ID and the association of
this identifier with the biometric pattern (called biometric template), which is understood as a vector of vales presented in an appropriate form, as a result of processing
the collected by the biometric device, raw human characteristics.
The identification Module (verification) is responsible for carrying out the collection
and processing of raw biometric characteristics in order to obtain a biometric template,
which is compared with patterns saved by the registration module. Those modules cooperate with each other and carry out the tasks related to the collection of raw biometric
data, features extraction and comparison of features and finally decision making.
The session with biometric systems begins with taking anatomical or behavioral features by the biometric reader. Biometric reader generates in n-dimensional biometric
springerlink.com
The Concept of Application of Fuzzy Logic
275
data. Biometric data represented by the form of vector features are the output of signals
processing algorithms. Then, vector features are identified, and the result is usually
presented in the form of the match (called confidence degree). On the basis of their
relevance and value of the threshold (called threshold value) the module responsible
for decision making produce output considered as acceptance or rejection.
The result of the work of biometric system is a confirmation of the identity of the
user. In light of the existence of the positive population (genuine users) and negative
population (impostors), there are four possible outcomes:
•
•
•
•
Genuine user is accepted (correct response),
Genuine user is rejected (wrong response),
Impostor is accepted (wrong response),
Impostor is rejected (correct response.).
A key decision-making is based on the value of the threshold T, which determines the
classification of the individual characteristics of vectors, leading to a positive class
(the sheep population) and a negative class (the wolf population).
When threshold is increased (decreased) the likelihood of false acceptance decreases (increases) and the probability of false rejection increases (decreases). It is not
possible to minimize the likelihood of erroneous acceptance and rejection simultaneously. In empirical conditions it can be found that precise threshold value makes it
clear that confidence scores which are very close to threshold level, but still lower
than it, are rejected. It can be found that the use of fuzzy logic can be helpful mean of
reducing identified problem.
3 Basics of Fuzzy Logic
A fuzzy set is an object which is characterized by its membership function. That function is assigned to every object in the set and it is ranging between zero and one. The
membership (characteristic) function is the grade of membership of that object in the
mentioned set [4].
That definition allows to declare more adequate if the object is within a range of
the set or not; to be more precise the degree of being in range. That is a useful feature
for expressions in natural language, e.g. the price is around thirty dollars, etc. It is
obvious that if we consider sets (not fuzzy sets) it is very hard to declare objects and
their membership function.
The visualization of the example membership function S is presented on fig. 1 and
elaborated on eq. 1.
0
⎧
2
⎪
⎛ x−a⎞
⎪1 − 2⎜
⎟
⎪
⎝ c−a ⎠
s ( x; a, b, c) = ⎨
2
⎪1 − 2⎛⎜ x − c ⎞⎟
⎪
⎝c−a⎠
⎪
1
⎩
for
x≤a
for
a≤ x≤b
(1)
for
b≤x≤c
for
x≥c
276
A. Sachenko, A. Banasik, and A. Kapczyński
1,2
1
µ(x)
0,8
0,6
0,4
0,2
0
0
a
2
4
6
b
8
c
10
Fig. 1. Membership function S applied to fuzzy set high level of security”
It is necessary to indicate the meaning of membership function [4]. The first possibility is to indicate similarity between object and the standard.
Another is to indicate level of preferences. In that case the membership function is
concerned as level of acceptance of an object in order to declared preferences.
And the last but not least possibility is to consider it as a level of uncertainty. In
that case membership function is concerned as a level of validity that variable X will
be equal to value x.
Another important aspect of fuzzy sets and fuzzy logic is possibility of fuzzyfication – ability to change sharp values into fuzzy ones and defuzzyfication as a process
of changing fuzzy values into crisp values. That approach is very useful in case of
natural language problems and natural language variables.
Fuzzyfication and defuzzyfication is also used in analysis of group membership. It
is the best way of presenting average values in order to whole set of objects I ndimensional space. This space may be a multicriteria analysis of the problem.
It is well known that there are a lot of practical applications of fuzzy techniques. In
many applications, people start with fuzzy values and then propagate the original
fuzziness all the way to the answer, by transforming fuzzy rules and fuzzy inputs into
fuzzy recommendations.
However, there is one well known exception to this general feature of fuzzy technique applications. One of the main applications of fuzzy techniques is intelligent
control. In fuzzy control, the objective is not so much to provide an advise to the
expert, but rather to generate a single (crisp) control value uc that will be automatically applied by the automated controller.
To get this value it is necessary to use the fuzzy control rules to combine the membership functions of the inputs into a membership function µ(u) for the desired control
u. This function would be a good output if the result will be an advise to a human
expert. However, our point o concern is in generating a crisp value for the automatical
controller, we must transform the fuzzy membership function µ(u) into a single value
uc. This transformation from fuzzy to crisp is called defuzzification [3].
One of the most widely used defuzzification technique based on centroids is centroid defuzzification. It is based on minimizing the mean square difference between
277
the actual (unknown) optimal control u and the generated control uc produced as a
result. In this least square optimization it is possible to weigh each value u with the
weight proportional to its degree of possibility µ(u). The resulting optimization
problem [2]:
∫ µ (u ) ⋅ (u − u )
c
2
du → min
uc
(2)
can be explicitly solved if there is a possibility of differentiation the corresponding
objective function by uc and equate the resulting value to 0. The result of it is a
formula [2]:
uc =
∫ u ⋅ µ (u )du
∫ µ (u )du
(3)
This formula is called centroid defuzzification because it describes the ucoordinate of the center of mass of the region bounded by the graph of the membership function µ(u).
As it was mentioned before fuzzy sets and fuzzy logic is a way to cope with qualitative and quantitative problems. That possibility is a great advantage of presented
approach and it is very commonly used in different fields. That provides us a possibility of using its mechanisms in many different problems and it usually gives reasonable solutions.
4 The Use of Fuzzy Logic in Biometric Authentication Systems
There are two main characteristics of biometric authentication system: false acceptance rate and false rejection rate. The false acceptance rate can be defined as relation
of number of accepted authentication attempts to number of all attempts undertaken
by impostors. Authentication attempt is successful only if confidence score resulted
from comparison of template created during enrolment process with template created
from current authentication attempts eqauls or is greater than specified threshold
value. The threshold value can be specified apriori basing on theoretical estimations
or can be defined based on requirements from given environment. For example for
high security environments the importance of false acceptance errors is greater than of
false rejection errors; for low security environments the situation is quite opposite.
Threshold is the parameter which defines the levels of false acceptance and false
rejection errors. One of the most popular approach assumes that threshold value is
chosen at level were false acceptance rate equals false reject rate.
In our paper we consider the theoretical biometric authentication system with five
levels of security: very low, low, medium, high, very high and a main error considered
is only a false acceptance error. The security level is associated with a given level of
false acceptance error and the values of false acceptance rates are obtained emipirically
from given set of biometric templates. Obtained false acceptance rates are function of
threshold value which is expressed precisely as a value from range 0 to 100.
278
A. Sachenko, A. Banasik, and A. Kapczyński
In biometric systems the threshold value can be set globally (for all users) or individually. From perspective of security officer responsible for effictient work of whole
biometric system the choice of indivual thresholds requires setting as many thresholds
as number of users enrolled in the system.
We propose the use of fuzzy logic as a mean of more natural expression of accepted level of false acceptance errors. In our approach the first parameter considered
is the value of false accept rate and basing on which we the level of security is determined which is finally transformed into threshold value.
Step-by-step proposed procedure consists of three steps.
In fist step for a given value of false acceptance rate we determine the value of the
membership functions for given level of security. If we consider five levels of false
acceptance rate, e.g. 5%, 2%, 1%, 0.5% and 0.1% than for each of those levels we can
calculate the values of membership functions to one of five levels of security: very
low, low, medium, high and very high (see fig. 2).
µ
Fig. 2. Membership functions for given security levels (S). Medium level was distinguished.
In second step through appropriate application of fuzzy rules we receive a result of
fuzzy request of the threshold value. Those rules are developed in order to obtain an
answer to a question about the relationship between threshold (T) and the specified
level of security (S):
Rule
Rule
Rule
Rule
Rule
1:
2:
3:
4:
5:
IF
IF
IF
IF
IF
“S
“S
“S
“S
“S
is very low” THEN “T is very low”
level is low” THEN “T is low”
is medium” THEN “T is medium”
level is high” THEN “T is high”
is very high” THEN “T is very high”.
In third step we apply defuzzyfication during which the fuzzy values are transformed
based on specified values of the membership functions and point a fuzzy centroid of
given values.
Our approach was depicted on fig. 3.
For example if we assume that the false acceptance rate is 0.1 and is fuzzified and
belongs to "S is high" with value of the membership function of 0.2 and belongs to "S
is very high" with value of the membership function of 0.8.
Then, based on rule 4 and rule 5 we can see that threshold value is high with the
value of membership function equaled to 0.2 and threshold value is very high at the
279
Fig. 3. Steps of fuzzy reasoning applied in biometric authentication system
value of membership function equaled to 0.8. The modal established at the threshold
value of membership functions shall be: 10 (very low level), 30 (low level), 50 (medium), 70 (high level), 90 (very high level).
The calculation of fuzzy centroid is carried out by the following calculation:
T = 90 ⋅ 0.8 + 70 ⋅ 0.2 = 86
(3)
In this example, for the value of false acceptance rate of 0.1, the threshold value
equals to 86.
5 Conclusions
The development of this concept in the use of fuzzy logic to determine the threshold
value associated with a given level of security provides an interesting alternative to
the traditional concept to define threshold values in biometric authentication systems.
References
1. Kapczyński, A.: Evaluation of the application of the method chosen in the process of biometric authentication of users, Informatica studies. Science series No. 1 (43), vol. 22. Gliwice (2001)
2. Kreinovich, V., Mouzouris, G.C., Nguyen, G.C., H.T.: Fuzzy rule based modeling as a universal approximation tool. In: Nguyen, H.T., Sugeno, M. (eds.) Fuzzy Systems: Modeling
and Control, pp. 135–195. Kluwer, Boston (1998)
3. Mendel, J.M., Gang X.: Fast Computation of Centroids for Constant-Width Interval-Valued
Fuzzy Sets. In: Fuzzy Information Processing Society. NAFIPS 2006, pp. 621–626. Annual
meeting of the North American (2006)
4. Zadeh, L.A.: Fuzzy sets. Information and Control 8 (1965)
5. Zhang, D.: Automated biometrics. Kluwer Academic Publishers, Dordrecht (2000)
Bidirectional Secret Communication by Quantum
Collisions
Fabio Antonio Bovino
Elsag Datamat, via Puccini 2, Genova. 16154, Italy
fabio.bovino@elsagdatamat.com
Abstract. A novel secret communication protocol based on quantum entanglement is introduced. We demonstrate that Alice and Bob can perform a bidirectional secret communication
exploiting the “collisions” on linear optical devices between partially shared entangled states.
The protocol is based on the phenomenon of coalescence and anti-coalescence experimented by
photons when they are incident on a 50:50 beam splitter.
Keywords: secret communications, quantum entanglement.
1 Introduction
Interference between different alternatives is in the nature of quantum mechanics [1].
For two photons, the best known example is the superposition on a 50:50 beamsplitter (Hong Ou Mandel –HOM– interferometer): two photons with the same polarization are subjected to a coalescence effect when they are superimposed in time [2].
HOM interferometer is used for Bell States measurements too, and it is the crucial
element in the experiment of teleportation or entanglement swapping.
Multi-particle entanglement has attracted much attention in these years. GHZ
(Greemberger, Horne, Zeilinger) states showed stronger violation of locality. The
generation of multi-particle entangled states is based on interference between independent fields generated by Spontaneous Parametric Down Conversion (SPDC) from
non-linear crystals. As an example four photons GHZ states are created by two pairs
emitted from two different sources.
For most applications high visibility in interference is necessary to increase the fidelity of the produced states. Usually, high visibility can be reached in experiments
involving only a pair of down-converted photons emitted by one source and quantum
correlation between two particles is generally ascribed to the fact that particles involved are either generated by the same source or have interacted at some earlier time.
In the case of two independent sources of down converted pairs stationary fields
cannot be used, unless the bandwidth of the fields is much smaller than that of the
detectors. In other words the coherent length of the down conversion fields must be so
long that, within the detection time period, the phase of the fields is constant. This
limitation can be overcome by using pulsed pump laser with sufficiently narrow temporal width.
springerlink.com
281
2 Four-Photons Interference by SPDC
We want to consider now the interference of two down-converted pairs generated by
distinct sources that emit the same state. We take, as source, non-linear crystals, cut
for a Type II emission, so that the photons are emitted in pairs with orthogonal polarizations, and they satisfy the well known phase-matching conditions, i.e. energy and
momentum conservation. In the analysis we select two k-modes (1,3) for the first
source and two k-modes for the second source (2,4), along which the emission can be
considered degenerate, or in other words, the central frequency of the two photons is
half of the central frequency of the pump pulse.
We are interested to the case in which only four photons are detected in the experimental set-up, then we can reduce the total state to
2 ⊗2
Ψ1234
=
η2
2
2
Ψ13
+
η2
2
2
1
1
Ψ24
+ η 2 Ψ13
Ψ24
(1)
where |η|² is the probability of photon-conversion in a single pump pulse: η is proportional to the interaction time, to the χ⁽²⁾ non-linear susceptibility of the non-linearcrystal and to the intensity of the pump field, here assumed classical and un-depleted
during the parametric interaction.
Thus we have a coherent superposition of a double pair emission from the first
source (eq. 2), or a double pair emission from the second source (eq. 3), or the emission of a pair in the first crystal and a pair in the second one (eq. 4):
2
Ψ13
=
η2
2
∫ ∫ ∫ ∫ dω1dω2dω3dω4
× Φ(ω1 + ω2 )Φ(ω3 + ω4 )e−i (ω1 +ω2 )ϕ e−i (ω3 +ω4 )ϕ
[
× [aˆ
]
(ω )] vac
× aˆ1+e (ω1 )aˆ3+o (ω2 ) − aˆ3+e (ω1 )aˆ1+o (ω2 )
+
1e
(ω3 )aˆ3+o (ω4 ) − aˆ3+e (ω3 )aˆ1+o
2
Ψ24
=
η2
2
4
∫ ∫ ∫ ∫ dω1dω2 dω3dω4
× Φ(ω1 + ω2 )Φ(ω3 + ω4 )e −i (ω1 +ω2 )ϕ
[
× [aˆ
]
(ω )] vac
× aˆ 2+e (ω1 )aˆ 4+o (ω2 ) − aˆ2+e (ω1 )aˆ 4+o (ω2 )
+
2e
(ω3 )aˆ4+o (ω4 ) − aˆ2+e (ω3 )aˆ4+o
1
1
Ψ13
Ψ24
=
η2
2
4
]
× aˆ1+e (ω1 )aˆ3+o (ω2 ) − aˆ1+e (ω1 )aˆ3+o (ω2 )
+
2e
(3)
∫ ∫ ∫ ∫ dω1dω2 dω3dω4
× Φ(ω1 + ω2 )Φ(ω3 + ω4 )
[
× [aˆ
(2)
(4)
(ω3 )aˆ4+o (ω4 ) − aˆ2+e (ω3 )aˆ4+o (ω4 )] vac
The function Φ (ω1 + ω2 ) contains the information about the pump field and the parametric interaction, and can be expanded in the form
282
Φ (ω1 + ω 2 ) = Ε p (ω1 + ω 2 )φ (ω1 + ω 2 , ω1 − ω 2 )
(5)
where Ε p (ω1 + ω 2 ) describes the pump field spectrum and φ (ω1 + ω 2 , ω1 − ω 2 ) is the
two photon amplitude for single frequency pumped parametric down-conversion.
Without loss of generality, we impose a normalization condition on Φ (ω1 + ω 2 ) so that
2
∫∫ dω1dω 2 Φ (ω1 + ω 2 ) = 1
(6)
We want to calculate the probability to obtain Anti-Coalescence or Coalescence on
the second Beam-splitter (BS) conditioned to Anti-coalescence at the first one.
For Anti-Coalescence-Anti-Coalescence probability (AA) we obtain:
AA =
5 + 3Cos(4Ω 0ϕ )
20
(7)
For Anti-Coalescence-Coalescence Probability (AC) we obtain:
3Sin 2 (2Ω 0ϕ )
5
AC =
(8)
For Coalescence-Coalescence probability (CC) we obtain:
CC = 5
3 + Cos(4Ω 0ϕ )
20
(9)
If the two sources emit different states, the result is different. In fact for AA probability we obtain:
AA =
3 + Cos(4Ω 0ϕ )
20
(10)
5 − Cos (4Ω 0ϕ )
10
(11)
7 + Cos (4Ω 0ϕ )
20
(12)
For AC probability we obtain:
AC =
For CC probability we obtain:
CC =
3 Four-Photons Interference: Ideal Case
Now, let us consider the interference of two pairs generated by distinct ideal polarization entangled sources that emit the same state, for example two singlet states:
1
1
Ψ13
Ψ24
=
(
)(
)
1 + +
aˆ1e aˆ3o − aˆ1+e aˆ3+o aˆ 2+e aˆ 4+o − aˆ 2+e aˆ 4+o vac
2
For AA, AC, CC probabilities we obtain:
(13)
1
,
4
AC = 0,
283
AA =
(14)
3
CC =
4
If the two distinct ideal polarization entangled sources emit different states, for example a singlet state the first one and a triplet state the second one, we have:
1
1
Ψ13
Ψ24
=
(
)(
)
1 + +
aˆ1e aˆ3o − aˆ1+e aˆ3+o aˆ 2+e aˆ 4+o + aˆ 2+e aˆ 4+o vac
2
(15)
For AA, AC, CC Probabilities we obtain:
AA = 0,
1
AC = ,
2
1
CC =
2
(16)
4 Bidirectional Secret Communication by Quantum Collision
The last result could be used for a quantum communication protocol to exchange
secret messages between Alice and Bob.
Alice has a Bell’s states synthesizer (i.e. an entangled states source, a phase shifter
and a polarization rotator), a quantum memory and a 50:50 beam-splitter.
Alice codifies binary classical information by two different Bell states, instead she
codifies 1 with singlet state and 0 with triplet state. Then she sends one of the photons
of the state to Bob, and maintains the second one in the quantum memory.
Fig. 1. Set-up used by Alice and Bob to perform the bidirectional secret communication
284
Bob has the same set-up of Alice, so that he is able to codify a binary sequence by
the same two Bell’s states used in the protocol. Bob sends one of the photons of the
state to Alice, and maintains the second one in the quantum memory. Alice and Bob
perform a Bell’s measurement on the two 50:50 beam-splitters.
If Alice and Bob have exchanged the same bit, after the Bell’s measurements the
probability of coalescence-coalescence will be ¾, and the probability of AntiCoalescence-Anti-Coalescence will be ¼.
If Alice and Bob have exchanged a different bit, the probability to obtain Coalescence-Coalescence will be ½ and the probability to obtain Anti-CoalescenceCoalescence will be ½.
After the measurements, Alice and Bob communicate on an authenticated classical
channel the results of the measurements: if the result is Anti-Coalescence-AntiCoalescence, Alice and Bob understand that the same state has been used; if the result
is Anti-Coalescence-Coalescence or Coalescence-Anti-Coalescence, they understand
that they used a different state.
In the case of Coalescence-Coalescence, they have to repeat the procedure until a
useful result is obtained. If we assume that the probability to use the singlet and triplet
states is equal to ½, Alice (Bob) can reconstruct the 37.5% of the message sent by
Bob (Alice).
Fig. 2. Possible outputs of a Bell’s measurement
5 Conclusion
Sources emitting entangled states “on demand” do not exist and the protocol, for now,
can not be exploited. The novelty is the use of quantum properties to encoding and
decoding a message without exchange of a key and the protocol performs a really
quantum cryptographic process. It seems that the fundamental condition is the authentication of the users: if the classical channel is authenticated it is not possible to extract information from the quantum communication channel. The analysis of security
is not complete and comments are welcome.
References
1. Schrödinger, E.: Die Naturewissenschaften 48, 807 (1935)
2. Hong, C.K., Ou, Z.Y., Mandel, L.: Phys. Rev. Lett. 59, 2044 (1987)
285
Semantic Region Protection Using Hu Moments and
a Chaotic Pseudo-random Number Generator
Paraskevi Tzouveli, Klimis Ntalianis, and Stefanos Kollias
National Technical University of Athens
Electrical and Computer Engineering School
9, Heroon Polytechniou str.
Zografou 15773, Athens, Greece
tpar@image.ntua.gr
Abstract. Content analysis technologies give more and more emphasis on multimedia semantics. However most watermarking systems are frame-oriented and do not focus on the protection of semantic regions. As a result, they fail to protect semantic content especially in case of
the copy-paste attack. In this framework, a novel unsupervised semantic region watermark
encoding scheme is proposed. The proposed scheme is applied to human objects, localized by a
face and body detection method that is based on an adaptive two-dimensional Gaussian model
of skin color distribution. Next, an invariant method is designed, based on Hu moments, for
properly encoding the watermark information into each semantic region. Finally, experiments
are carried out, illustrating the advantages of the proposed scheme, such as robustness to RST
and copy-paste attacks, and low overhead transmission.
Keywords: Semantic region protection, Hu moments, Chaotic
generator.
pseudo-random number
1 Introduction
Copyright protection of digital images and video is still an urgent issue of ownership
identification. Several watermarking techniques have been presented in literature,
trying to confront the problem of copyright protection. Many of them are not resistant
enough to geometric attacks, such as rotation, scaling, translation and shearing. Several researchers [1]-[4] have tried to overcome this inefficiency by designing watermarking techniques resistant to geometric attacks. Some of them are based on the
invariant property of the Fourier transform. Others use moment based image normalization [5], with a standard size and orientation or other normalization technique [6].
In most of the aforementioned techniques the watermark is a random sequence of bits
[7] and it is retrieved by subtracting the original from the candidate image and choosing an experimental threshold value to determine when the cross-correlation coefficient denotes a watermarked image or not.
On the other hand, the majority of the proposed techniques are frame-based and
thus semantic regions such as humans, buildings, cars etc., are not considered. At the
same time, multimedia analysis technologies give more and more importance to semantic content. However in several applications (such as TV news, TV weather
springerlink.com
Semantic Region Protection Using Hu Moments
287
forecasting, films) semantic regions, and especially humans, are addressed as independent video objects and thus should be independently protected.
In this direction the proposed system is designed to provide protection of semantic
content. To achieve this goal, a human object detection module is required both in
watermark encoding and during authentication. Afterwards the watermark encoding
phase is activated, where chaotic noise is generated and properly added to each human object, producing the watermarked human object. The watermark encoding procedure is guided by a feedback mechanism in order to satisfy an equality, formed as a
weighted difference between Hu moments of the original and watermarked human
objects. During authentication, initially every received image passes through the human object detection module. Then Hu moments are calculated for each detected
region, and an inequality is examined. A semantic region is copyrighted only if the
inequality is satisfied. Experimental results on real sequences indicate the advantages
of the proposed scheme in cases of mixed attacks, affine distortions and the copypaste attack.
2 Semantic Region Extraction
Extraction of human objects is a two-step procedure. In the first step the human face
is detected and in the second step the human body is localized based on information
of the position and size of human face. In particular human face detection is based on
the distribution of chrominance values corresponding to human faces. These values
occupy a very small region of the YCbCr color space [9]. Blocks of an image that are
located within this small region can be considered as face blocks, belonging to the
face class Ωf. According to this assumption, the histogram of chrominance values
corresponding to the face class can be initially modeled by a Gaussian probability
density function. Each Bi block of the image is considered to belong to the face class,
if the respective probability P(x(Bi)|Ωf) is high.
a) Initial Image
b) Object Mask
c) Object Extraction
Fig. 1. Human Video Object Extraction Method
On the other hand and in order to filter non-face regions with similar chrominance
values, the aspect ratio R=Hf /Wf (where Hf is the height and Wf the width of the
head) is adopted, which was experimentally found to lie within the interval [1.4 1.6]
[9]. Using R and P, a binary mask, say Mf, is build containing the face area. Detection
of the body area is then achieved using geometric attributes that relate face and body
areas [9]. After calculating the geometric attributes of the face region, the human
body can be localized by incorporating a similar probabilistic model. Finally the face
288
P. Tzouveli, K. Ntalianis, and S. Kollias
and body masks are fused and human video objects are extracted. The phases of human object extraction are illustrated in Figure 1. This algorithm is an efficient method
of finding face locations in a complex background when the size of the face is unknown. It can be used for a wide range of face sizes. The performance of the algorithm is based on the distribution of chrominance values corresponding to human
faces providing 92% successful.
3 The Watermark Encoding Module
Let us assume that human object O has been extracted from an image or frame, using
the human object extraction module described in Section 2. Initially Hu moments of
human object O are computed [6] providing an invariant feature of an object. Traditionally, moment invariants are computed based both on the shape boundary of the
area and its interior object. Hu first introduced [19] the mathematical foundation of 2D moment invariants, based on methods of algebraic invariants and demonstrated
their application to shape recognition. Hu’s method is based on nonlinear combinations of 2nd and 3rd order normalized central moments, providing a set of absolute
orthogonal moment invariants, which can be used for RST invariant pattern identification. Hu [19] derived seven functions from regular moments, which are rotation,
scaling and translation invariant. In [20], Hu’s moment invariant functions are incorporated and the watermark is embedded by modifying the moment values of the
image. In this implementation, exhaustive search should be performed in order to
determine the embedding strength. The method that is proposed in [20] provides an
invariant watermark in both geometric and signal processing attacks based on invariant of moments.
Hu moments are seven invariant values computed from central moments through
order three, and are independent of object translation, scale and orientation. Let
Φ= [φ1, φ2, φ3, φ4, φ5, φ6, φ7]Τ be a vector containing the Hu moments of O. In this
paper, the watermark information is encoded into the invariant moments of the original human object. To accomplish this, let us define the following function:
7
⎛ x − φi ⎞
f ( X , Φ ) = ∑ wi ⎜ i
⎟
i =1
⎝ φi ⎠ .
(1)
where X is a vector containing the φ values of an object, Φ contains the φ invariants
of object O and wi are weights that put different emphasis to different invariants.
Each of the weights wi receives a value within a specific interval, based on the output of a chaotic random number generator. In particular chaotic functions, first studied in the 1960's, present numerous interesting properties that can be used by modern
cryptographic and watermarking schemes. For example the iterative values generated
from such functions are completely random in nature, although they are limited between some bounds. The iterative values are never seen to converge after any number
of iterations. However the most fascinating aspect of these functions is their extreme
sensitivity to initial conditions that make chaotic functions very important for applications in cryptography. One of the simplest chaotic functions that incorporated in our
289
work is the logistic map. In particular, the logistic function is incorporated, as core
component, in a chaotic pseudo-random number generator (C-PRNG) [8].
The procedure is triggered and guided by a secret 256-bit key that is split into 32 8bit session keys (k0, k1, …, k31). Two successive session keys kn and kn+1 are used to
regulate the initial conditions of the chaotic map in each iteration. The robustness of
the system is further reinforced by a feedback mechanism, which leads to acyclic
behavior, so that the next value to be produced depends on the key and on the current
value. In particular the first seven output values of C-PRNG are linearly mapped to
the following intervals: [1.5 1.75] for w1, [1.25 1.5] for w2, [1 1.25] for w3, [0.75 1]
for w4 and w5, and [0.5 0.75] for w6 and w7. These intervals have been experimentally
estimated and are analogous to the importance and robustness of each of the φ invariants. Then watermark encoding is achieved by enforcing the following condition:
7
⎛ φ * − φi ⎞
*
f (Φ* , Φ) = ∑ wi ⎜ i
⎟=N
i =1
⎝ φi ⎠
(2)
where Φ* is the moments vector of the watermarked human object O* and N* is a
target value also properly determined by the C-PRNG, taking into consideration a
tolerable content distortion. N*value expresses the weighted difference among the φ
invariants of the original the watermarked human objects. The greater the value is, the
larger perturbation should be added to the original video object and the higher visual
distortion would be introduced.
Fig. 2. Block diagram of encoding module
This is achieved by generating a perturbation region ∆Ο of the same size as O such
that, when ∆Ο is added to the original human object O, it produces a region
O* = O + β∆Ο
(3)
that satisfies Eq. (2). Here, β is a parameter that controls the distortion introduced to
O by ∆Ο. C-PRNG generates values until mask ∆Ο is fully filled. After generating all
sensitive parameters of the watermark encoding module, a proper O* is iteratively
produced using Eqs. (2) and (3). In this way, the watermark information is encoded
into the φ values of O producing O*. An overview of the proposed watermark encoding module is presented in Figure 2.
290
4 The Decoding Module
The decoding module is responsible for detecting copyrighted human objects. The
decoding procedure is split into two phases (Figure 3). During the first phase, the
received image passes through the human object extraction module described in Section 2. During the second phase each human object undergoes an authentication test to
check whether it is copyrighted or not.
Fig. 3. Block Diagram of decoding module
In particular let us consider the following sets of objects and respective φ invariants: (a) (O, Φ) for the original human object, (b) (O*, Φ*) for the watermarked human
object and (c) (O΄, Φ΄) for a candidate human object. Then O΄ is declared authentic if:
f (Φ* , Φ) − f (Φ′, Φ) ≤ ε
(4)
where f(Φ*, Φ) is given by Eq.(2), while f(Φ΄, Φ) is given by:
7
⎛ φ′ −φ ⎞
f (Φ′, Φ) = ∑ wi ⎜ i i ⎟ = N '
i =1
⎝ φi ⎠
(5)
Then Eq. (4) becomes
Nd = N * − N ′ ≤ ε ⇒
7
⎛ φi* − φi′ ⎞
⎟ ≤ε
i
⎝ φi ⎠
∑w ⎜
i =1
(6)
where ε is an experimentally determined, case-specific margin of error and wi are the
weights (see Section 3).
Two observations need to be stressed at this point. It is advantageous that the decoder does not need the original image. It only needs wi, Φ, Φ* and the margin of error
ε. Secondly, since the decoder only checks the validity of Eq. (6) for the received
human object, the resulting watermarking scheme answers a yes/no, (i.e. copyrighted
291
or not) question. As a consequence, this watermarking scheme belongs to the family
of algorithms of 1-bit capacity.
Now in order to determine ε, we should first observe that Eq. (6), delimits a normalized margin of error between Φ and Φ*. This margin depends on the severity of
the attack, i.e., the more severe the attack, the larger the value of Nd will be. Thus, its
value should be properly selected so as to keep false reject and false accept rates as
low as possible (ideally zero). More specifically, the value of ε is not heuristically set,
but depends on the content of each distinct human object. In particular, each watermarked human object, O*, undergoes a sequence of plain (e.g. compression, filtering
etc.) and mixed attacks (e.g. cropping and filtering, noise addition and compression)
of increasing strength. The strength of the attack increases until, either the SNR falls
below a predetermined value, or a subjective criterion is satisfied.
In the following the subjective criterion is selected, which is related to content’s
visual quality. According to this criterion and for each attack, when the quality of the
human object’s content is considered unacceptable for the majority of evaluators, an
upper level of attack, say Ah, is set. This upper level of attack can also be automatically determined based on SNR, since a minimum value of SNR can be defined before any attack is performed. Let us now define an operator p(.) that performs attack i
to O* (reaching upper level Ahi ) and producing an object Oi*:
(
)
p O* , Ahi = Oi* , i = 1, 2, …, M
(7)
Then for each Oi*, N d is calculated according to Eq. (6). By gathering N d values, a
i
i
vector is produced:
r
N d = ⎡⎣ N d1 , N d 2 ,..., N d M ⎤⎦
(8)
Then the margin of error is determined as:
r
ε = max N d
(9)
r
Since ε is the maximum value of N d , it is guaranteed that human objects should be
visually unacceptable in order to deceive the watermark decoder.
Several experiments were performed to examine the advantages and open issues of
the proposed method. Firstly, face and body detection is performed on different images. Afterwards, the watermark information is encoded to each human object, and
the decoding module is tested under a wide class of geometric distortions, copy-paste
and mixed attacks. When an attack of specific kind is performed to the watermarked
human object (Fig. 3a), it leads to SNR reduction that is proportional to the severity of
the attack. Firstly we examine JPEG compression for different quality factors in the
range of 10 to 50. Result sets (N*, SNR, Nd) are provided in the first group of rows of
Table I. It can be observed that Nd changes rapidly for SNR < 9.6 dB. Furthermore,
292
the subjective visual quality is not acceptable for SNR < 10 dB. Similar behavior can
be observed in the cases of Gaussian noise for SNR < 6.45 (using different means and
deviations) and median filtering for SNR < 14.3 dB (changing the filter size). Again
in these cases the subjective visual quality is not acceptable for SNR < 5.8 dB and
SNR < 9.2 dB respectively. Furthermore, we examine some mixed attacks, by combining Gaussian noise and JPEG compression, scaling and JPEG compression and
Gaussian noise and median filtering.
Table 1. Watermark detection after different attacks
In the following, we illustrate the ability of the method to protect content in case of
copy-paste attacks. The encoding module receives an image which contains a weather
forecaster and provides the watermarked human object (Fig 3a). In this case ε was
automatically set equal to 0.65 according to Eq.(9), so as to confront even cropping
inaccuracy of 6 %. It should be mentioned that, for larger ε, larger cropping inaccuracies can be addressed, however, the possibility of false alarms also increases.
Now let us assume that a malicious user initially receives Fig.(3a) and then copies,
modifies (cropping 2% inaccuracy, scaling 5%, rotation 10o) and pastes the watermarked human object in a new content (Fig 3b). Let us also assume that the decoding
module receives Fig. (3b). Initially, the human object is extracted and then the decoder checks the validity of Eq. (6). In this case Nd=0.096, a value that is smaller than
ε. As a result the watermark decoder certifies that the human object of Fig.(3b) is
copyrighted.
293
Fig. 4. Copy-paste attack. (a) Watermarked human object (b) Modified watermarked human
object in new content.
6 Conclusions
In this paper an unsupervised, robust and low complexity semantic object watermarking scheme has been proposed. Initially, human objects are extracted and the watermark information is properly encoded to their Hu moments. The authentication
module needs only the moment values of the original and watermarked human objects
and the corresponding weights. Consequently, both encoding and decoding modules
have low complexity. Several experiments have been performed on real sequences,
illustrating the robustness of the proposed watermarking method to various signal
distortions, mixed processing and copy-paste attacks.
References
1. Cox, J., Miller, M.L., Bloom, J.A.: Digital Watermarking, San Mateo. Morgan Kaufmann,
San Francisco (2001)
2. Lin, C.Y., Wu, M., Bloom, J., Cox, I., Miller, M., Lui, Y.: Rotation, scale, and translation
resilient watermarking for images. IEEE Trans. on Image Processing 10, 767–782 (2001)
3. Wu, M., Yu, H.: Video access control via multi-level data hiding. In: Proc. of the IEEE
ICME, N.Y. York (2000)
4. Pereira, S., Pun, T.: Robust template matching for affine resistant image watermarks. IEEE
Transactions on Image Processing 9(6) (2000)
5. Abu-Mostafa, Y., Psaltis, D.: Image normalization by complex moments. IEEE Trans. on
Pattern Analysis and Machine Intelligent 7 (1985)
6. Hu, M.K.: Visual pattern recognition by moment invariants. IEEE Trans. on Information
Theory 8, 179–187 (1962)
7. Alghoniemy, M., Tewfik, A.H.: Geometric Invariance in Image. Watermarking in IEEE
Trans. on Image Processing 13(2) (2004)
8. Devaney, R.: An Introduction to Chaotic Dynamical Systems. Addison-Wesley, Redwood
City (1989)
9. Yang, Huang, T.S.: Human Face Detection in Complex Background. Pattern Recognition 27(1), 53–63 (1994)
Random r-Continuous Matching Rule for Immune-Based
Secure Storage System
Cai Tao, Ju ShiGuang, Zhong Wei, and Niu DeJiao*
JiangSu University, Computer Department, ZhengJiang, China, 212013
caitao@ujs.edu.cn
Abstract. On the basis of analyzing demand of secure storage system, this paper use the artificial immune algorithm to research access control system for the secure storage system. Firstly
some current matching rules are introduced and analyzed. Then the elements in immune-based
access control system are defined. To improve the efficiency of the artificial immune algorithm,
this paper proposes the random r-continuous matching rule, and analyze the number of illegal
access requests that one detector can check out. Implementing prototype of the random rcontinuous matching rule to evaluate and compare its performance with current matching rules.
The result proves the random r-continuous matching rule is more efficient than current matching rules. At last, we use the random r-continuous matching rule to realize immune-based access control system for OST in Lustre. Evaluating its I/O performance, the result shows its I/O
performance loss is below 8%, it proves that the random r-continuous matching rule can be
used to realize the secure storage system that can keep high I/O performance.
Keywords: matching rule; artificial immune algorithm; secure storage system.
1 Introduction
The secure storage system is the hot topic in current researching. There are six directions in current researching. Firstly, encrypting file system to ensure security of storage system, it contains CFS[1], AFS[2], SFS[3,4] and Secure NAS [5]. Secondly,
researching new disk structure to realize secure storage system, it contains NASD[6]
and Self Securing Storage[7,8]. Thirdly, researching survivable strategy for storage
system, it contains PASIS[9,10] and OceanStore[11]. Fourth, researching efficient
key management strategy for secure storage system, it contains SNAD[12,13,14],
PLUTUS[15] and iSCSI-Based Network Attached Storage Secure System[16]. Fifth,
researching the secure middle-ware module to ensure security of storage system, it
contains SiRiUS[17] and two-layered secure structure for storage system [18]. Sixth,
using zone and mask code to ensure security of storage system. Encryption, authentication, data redundancy and intrusion detection are used in current researching of
secure storage system. But large time and space consumption are needed to ensure
security of enormous data stored in storage system, and the loss of I/O performance in
the secure storage system is very large. High I/O performance is important character
for storage system. We use the artificial immune algorithm to research fast access
control strategy for the secure storage system.
*
Support by JiangSu Science Foundation of China No.2007086.
springerlink.com
Random r-Continuous Matching Rule for Immune-Based Secure Storage System
295
The artificial immune algorithm simulates natural immune system, it has many
good characters such as distributability, multi-layered, diversity, autonomy and
adaptability and so on. It can protect system efficiently. The classic theory is the
negative selection algorithm that presented by Forrest in 1994. It simulates selftolerance of T cells. We use the artificial immune algorithm to judge whether access
request is legal in secure storage system and realize fast access control system, then
ensure the security of the storage system. The matching rule used to select selftolerance detectors and judge whether detector matches access request, so it is important for efficiency and accuracy the secure storage system. When self, access request
and detector are represented by binary string, matching rule is used to compare two
binary strings.
The remainder of this paper is organized as follows. Section 2 analyzes current
matching rules. Section 3 gives definition of some elements and presents random rcontinuous matching rule. Section 4 analyzes the efficiency of random r-continuous
matching rule and compare with current matching rules. Section 5 implements prototype of random r-continuous matching rule to evaluate its performance, then compares efficiency and accuracy with current matching rules. Section 6 use random
r-continuous matching rule to realize access control system for object-based storage
target in storage area network system named Lustre, and evaluate its I/O performance.
2 Related Works
The matching rule used to judge whether two binary strings matches between detector
and self or between detector and access request. Current matching rules contain rcontiguous matching rule, r-chunk matching rule, Hamming distance matching rule
and Rogers and Tanimoto matching rule.
Forrest presented r-contiguous matching rule in 1994[24]. r-contiguous matching
rule can discriminate non-self accurately, but need large number of detectors. Every
detector with l bits contains l-r+1 characteristic sub-string. Every characteristic subl −r
string can detect 2 non-self. One detector can recognize (l − r + 1)2 non-self
mostly.
Balthrop presented r-chunk matching rule to improve accuracy and efficiency of rcontiguous matching rule in 2002[25]. r-chunk matching rule is to add condition to rcontiguous matching rule. Start position can improve the accuracy and restrict form
ith to (i+r-1)th in detector are valid, i is special to one detector. One detector contains
l -r
l-r-i+1 characteristic sub-string and can recognize (l - r - i + 1)2 non-self mostly.
The recognition capability of detector is smaller than r-contiguous matching rule.
Hamming distance matching rule was proposed by Farmer in 1986[26]. Hamming
distance matching rule check whether there are r counterpart and same bits, and it do
not care whether these r bits are continuous. Every detector can recognize
l-r
(l - r - i + 1)2l -r non-self mostly. But it had less accuracy.
Harmer analyzes different matching rules by calculating the signal-to-noise ratio
and the function-value distribution of each matching rule when applied to a randomly
generated data set[27]. But current matching rules are less efficiency.
296
C. Tao et al.
3 The Random r-Continuous Matching Rule
We give definition of the elements and present the random r-continuous matching rule.
3.1 Definition of Elements
Definition 1. Domain. U={0,1}l it is a set of all binary strings with l bits, it contains
by self set and non-self set.
Definition 2. Self set. S ∈ U it is a set of all legal strings in domain.
Definition 3. Non-self set. NS ∈ U it is a set of all illegal strings in domain.
Definition 4. Access request. x=x1x2…xl (xi ∈ {0,1}) it is betoken of one access request in storage system and it is one string in domain.
Definition 5. Threshold. r is criterion to judge whether x matches d.
∈
Definition 6. Detector. d=(d1d2…dl,r) (di {0,1}) is binary string with l bits l.
Definition 7. Characteristic sub-string. It is sub-string in detector that used to check
access request.
3.2 Random r-Continuous Matching Rule
When checking access request by the artificial immune algorithm, the characteristic
sub-string is critical. So increasing the number of non-self that very characteristic substring can match is the important way to improve efficiency of matching rule. The rcontiguous matching rule and r-chunk matching rule check whether the length of
identical and counterpart sub-string between antigen and detector is larger than
threshold, detector matches antigen. This condition limits efficiency of matching rule.
The Hamming distance matching rule check the number of counterpart and identical
bits, the condition of identical and counterpart bits limits efficiency of matching rule
also. We propose the random r-continuous matching rule to increase the number of
illegal access request that one detector can match and improve efficiency. Given detector(d) and access request(x), its definition is as formula 1.
Formula 1:
d matches x ≡ ∃i ≤ l − r + 1 and ∃j ≤ l − r + 1 such that xk = d l
for k = i, L , i + r − 1 and l = j , L , j + r − 1
If the length of identicial sub-string is larger than threshold between detector and
access request, then d matches x.
4 Performance Analyze
We analyze and compare the number of illegal access request that one detector can
recognize using different matching rules.
297
Using random r-continuous matching rule, one detector with l bits contains l-r+1
characteristic sub-strings. One detector can matches (l − r + 1)2 l − r +1 illegal access
request. Table 1 shows how many detectors one detector can match when using different matching rules. We can find the number of illegal access request that one detector can recognize is largest when using random r-continuous matching rule, it proves
that random r-continuous matching rule can improve efficiency of artificial immune
algorithm obviously.
Table 1. Number of illegal access request one detector can recognize using different matching
rules
Matching rule
random rcontinuous
matching rule
r-contiguous
matching rule
r-chunk
matching rule
Hamming
distance
matching rule
The number
of illegal
access request
that one
detector can
recognize
(l − r + 1)2 l − r +1
(l − r + 1)2 l −r
(l-r-i + 1)2 l-r
(l − r + 1)2 l −r
5 Prototype of Random r-Continuous Matching Rule
We implement the prototype of immune-based access control system using the random r-continuous matching rule and other matching rules on Linux. Access request
and detector are betokened by string with eight bits. Self and access request are stored
in two text file. Using the exhaustive detector generating algorithm to generate original detector and do not limit number of self-tolerance detector. Prototype output the
result of detection and the number of detectors which are needed to check out all
illegal access requests, then comparing with current matching rules. The minimum
value of r is 1, maximal value is 8 and increment is 1.
Firstly we use the exhaustive strategy to generate all 256 different strings with 8
number bits. We choose some strings as self sequence and other as access request.
Then self and access request are complementary and all access requests are illegal.
We create seven self files with 0, 8, 16, 32, 64, 128 and 192 access requests specially
cne
rae
lo
tlfe
s
fo
re
bm
un
140
120
ro100
cte 80
etd 60
40
20
0
0
8
16 32 64 128 192
number of self
prototype of
random rcontinous
matching rule
prototype of
r-contiguous
matching rule
prototype of
r-chunk
matching rule
prototype of
Hamming
distance
matching rule
Fig. 1. Number of detectors needed to recognize all illegal access requests
298
C. Tao et al.
and corresponding access request files. And evaluating how many detectors needed to
check out all illegal access requests. The result shows in figure 1.
From figure 1 we find that prototype can check out all illegal access requests with
smallest number of detectors when using the random r-continuous matching rule. This
result proves that the random r-continuous matching rule is more efficient than other
matching rules.
6 Prototype of Immune-Based Secure Storage
Lustre is an open source storage area network system. There are three modules such
as client, MDS and OST in system. OST is an object-based storage target. We use
the random r-continuous matching rule to realize immune-based access control
system for OST. Using the Iozone to test I/O performance. We test writing performance of 1M file with different size block such as 4k, 8k, 16k, 32k, 64k, 128k,
256k, 512k and 1024k. The result shows in figure 2. It shows that the prototype of
the secure storage system lose 8% writing performance of Lustre that can keep high
I/O performance.
writing performance
/XVWUH
600000
500000
400000
b/s 300000
200000
100000
0
481632641282565121024
block size
/XVWUH
ZLWK
LPPXQH
EDVHG
DFFHVV
FRQWURO
V\VWHP
Fig. 2. Writing performance
7 Conclusion
This paper presents the random r-continuous matching rule to improve the efficiency
of the artificial immune algorithm. By analyzing the number of illegal access request
that one detector can check out, evaluating and comparing with current matching
rules. The result proves the random r-continuous matching rule is more efficient than
current matching rules. At last we using the random r-continuous matching rule to
realize immune-based access control system for OST in the Lustre, the I/O performance testing proves that random r-continuous matching rule can used to realize the
secure storage system that can keep high I/O performance.
Different detector will contain same characteristic sub-string that will increase consumption of detection. Next step we analyze characteristic sub-string in detector and
research new detector generating algorithm to improve efficiency.
299
References
1. Blaze, M.: A cryptographic file system for UNIX. In: Proceedings of 1st ACM Conference
on Communications and Computing Security (1993)
2. Howard, J., Kazar, M., Menees, S., Nichols, D., Satyanarayanan, M., Sidebotham, R.,
West, M.: Scale and performance in a distributed file system. ACM TOCS 6(1) (February
1988)
3. Fu, K., Kaashoek, M., Mazieres, D.: Fast and secure distributed read-only file system.
OSDI (October 2000)
4. Mazieres, D., Kaminsky, M., Kaashoek, M., Witchel, E.: Separating key management
from file system security. SOSP (December 1999)
5. Li, X., Yang, J., Wu, Z.: An NFSv4-Based Security Scheme for NAS, Parallel and Distributed Processing and Applications, NanJiang, China (2005)
6. Gobioff, H., Nagle, D., Gibson, G.: Embedded Security for Network-Attached Storage,
CMU SCS technical report CMU-CS-99-154 (June 1999)
7. John, D., Strunk, G.R., Goodson, M.L., Sheinholtz, C.A.N., Soules, G.R.: Self-Securing
Storage: Protecting Data in Compromised Systems. In: 4th Symposium on Operating System Design and Implementation, San Diego, CA (October 2000)
8. Craig, A.N., Soules, G.R., Goodson, J.D., Strunk, G.R.: Metadata Efficiency in Versioning
File Systems. In: 2nd USENIX Conference on File and Storage Technologies, San Francisco, CA, March 31-April 2 (2003)
9. Wylie, J., Bigrigg, M., Strunk, J., Ganger, G., Kiliccote, H., Khosla, P.: Survivable information storage systems. IEEE Computer, Los Alamitos (2000)
10. Ganger, G.R., Khosla, P.K., Bakkaloglu, M., Bigrigg, M.W., Goodson, G.R., Oguz, S.,
Pandurangan, V., Soules, C.A.N., Strunk, J.D., Wylie, J.J.: Survivable Storage Systems.
In: DARPA Information Survivability Conference and Exposition, Anaheim, CA, 12-14
June 2001, vol. 2, pp. 184–195. IEEE, Los Alamitos (2001)
11. Kubiatowicz, J., Bindel, D., Chen, Y., Czerwinski, S., Eaton, P., Geels, D., Gummadi, R.,
Rhea, S., Weatherspoon, H., Weimer, W., Wells, C., Zhao, B.: OceanStore: An Architecture for Global-Scale Persistent Storage. In: ASPLOS (December 2000)
12. Freeman, W., Miller, E.: Design for a decentralized security system for network-attached
storage. In: Proceedings of the 17th IEEE Symposium on Mass Storage Systems and
Technologies, College Park, MD, pp. 361–373 (March 2000)
13. Miller, E.L., Long, D.D.E., Freeman, W., Reed, B.: Strong security for distributed file systems. In: Proceedings of the 20th IEEE international Performance, Computing and Communications Conference (IPCCC 2001), Phoenix, April 2001, pp. 34–40. IEEE, Los
Alamitos (2001)
14. Miller, E.L., Long, D.D.E., Freeman, W.E., Reed, B.C.: Strong Security for NetworkAttached Storage. In: Proceedings of the 2002 Conference on File and Storage Technologies (FAST), January 2002, pp. 1–13 (2002)
15. Kallahalla, M., Riedel, E., Swaminathan, R., Wang, Q., Fu, K.: PLUTUS: Scalable secure
file sharing on untrusted storage. In: Conference on File andStorage Technology (FAST
2003), San Francisco, CA, 31 March - 2 April 2003, pp. 29–42. USENIX, Berkeley (2003)
16. De-zhi, H., Xiang-lin, F., Jiang-zhong, H.: Study and Implementation of a iSCSI-Based
Network Attached Storage Secure System. MINI-MICRO SYSTEMS 7, 1223–1227
(2004)
17. Goh, E.-J., Shacham, H., Modadugu, N., Boneh, D.: SiRiUS:Securing Remote Untrusted
Storage. In: The proceedings of the Internet Society (ISOC) Network and Distributed Systems Security (NDSS) Symposium 2003(2003)
300
C. Tao et al.
18. Azagury, A., Cabetti, R., Factor, M., Halevi, S., Henis, E., Naor, D., Rinetzky, N., Rodeh,
O., Satran, J.: A Two Layered Approach for Secuting an Object Store Network. In: SISW
2002 (2002)
19. Hewlett-Packard Company. HP OpenView storage allocator (October 2001),
http://www.openview.hp.com
20. Brocade Communications Systems, Inc. Advancing Security in Storage Area Networks.
White Paper (June 2001)
21. Hewlett-Packard Company. HP SureStore E Secure Manager XP (March 2001),
http://www.hp.com/go/storage
22. Dasgupta, D.: An overview of artificial immune systems and their applications. In: Dasgupta, D. (ed.) Artificial immune systems and their applications, pp. 3–23. Springer, Heidelberg (1999)
23. de Castro, L.N., Timmis, J.: Artificial Immune Systems: A New Computational Approach.
Springer, London (2002)
24. Forrest, S., Perelson, A., Allen, L., Cherukuri, R.: Self-nonself discrimination in a computer. In: Proceedings IEEE Symposium on Research in Security and Privacy, Los Alamitos, CA, pp. 202–212. IEEE Computer Society Press, Los Alamitos (1994)
25. Balthrop, J., Esponda, F., Forrest, S., Glickman, M.: Coverage and generalization in an artificial immune system. In: Langdon, W.B., Cantú-Paz, E., Mathias, K., Roy, R., Davis,
D., Poli, R., Balakrishnan, K., Honavar, V., Rudolph, G., Wegener, J., Bull, L., Potter,
M.A., Schultz, A.C., Miller, J.F., Burke, E., Jonoska, N. (eds.) Proceedings of the Genetic
and Evolutionary Computation Conference (GECCO), 9-13 July 2002, pp. 3–10. Morgan
Kaufmann Publishers, San Francisco (2002)
26. Farmer, J.D., Packard, N.H., Perelson, A.S.: The immune system, adaptation, and machine
learning. Physica D 22, 187–204 (1986)
27. Harmer, P., Williams, G., Gnusch, P.D., Lamont, G.: An Artificial Immune System Architecture for Computer Security Applications. IEEE Transactions on Evolutionary Computation 6(3), 252–280 (2002)
28. Forrest, S., Perelson, A.S., Allen, L., Cherukuri, R.: Self-Nonself Discrimination in a
computer. In: Proceeding of IEEE Symposium on Research in Security and Privacy, pp.
202–212. IEEE Computer Society Press, Los Alamitos (1994)
29. Helman, P., Forrest, S.: An efficient algorithm for generating random antibody strings,
Technical Report CS-94-07, The University of New Mexico, Albuquerque, NM (1994)
30. D’haeseleer, P., Forrest, S., Helman, P.: An immunological approach to change detection:
algorithms, analysis and implications. In: McHugh, J., Dinolt, G. (eds.) Proceedings of the
1996 IEEE Symposium on Computer Security and Privacy, USA, pp. 110–119. IEEE
Press, Los Alamitos (1996)
31. D’haeseleer, P.: Further efficient algorithms for generating antibody strings, Technical
Report CS95-3, The University of New Mexico, Albuquerque, NM (1995)
Francesco Pedersoli and Massimiliano Cristiano
Spin Networks Italia, Via Bernardino Telesio 14, 00195 Roma, Italy
{fpedersoli,mcristiano}@spinnetworks.com
Abstract. The product nokLINK is a communication protocol carrier which transports data
with greater efficiency and vastly greater security by encrypting, compressing and routing
information between two or more end-points. nokLINK creates a virtual, “dark” application
(port) specific tunnel which ensures protection of end-points by removing their exposure to the
Internet. By removing the exposure of both end-points (Client and Server) to the Internet and
LAN, you remove the ability for someone or something to attack either end-point. If both endpoint have not entry point, attack becomes extremely, if not impossible to succeed. nokLINK is
Operating System independent and the protection level is applied starting from the application
itself: advanced anti-reverse engineering technique are used, a full executable encryption and
also the space memory used by nokLINK is encrypted. The MASTER-DNS like structure
permit to be very resistant also to Denial of Service attack and the solution management is
completely decoupled by the Admin or root rights: only nokLINK Administrator can access to
security configuration parameters.
1 Overview
The inherent makeup of nokLINK implies two purposes. The first is the name of a
communications protocol that has the potential to work with or without TCP/IP. This
protocol includes everything needed to encrypt, route, resolve names, and ensure the
delivery of upper layer packets. The protocol itself is independent of any particular
operating system. It has the potential of running on any OS or even be included in any
hardware solutions as firmware.
The second is “nokLINK The Product, a communication protocol carrier which
transports data with greater efficiency and vastly greater security by encrypting, compressing and routing information between two or more end-points. nokLINK creates a
virtual “dark” application [port] specific tunnel, which ensures protection of endpoints by removing their exposure to the Internet. By removing the exposure of both
end-points to the internet and LAN, you remove the ability for someone or something
to attack either end-point. If both end-point have not entry point, attack becomes extremely, if not impossible to succeed. If it can’t been seen, it can’t be attacked.
In most scenarios, if you block inbound access to an end-point, then you loose the
ability to communicate with that device, but with nokLINK any permitted application
can communicate in a bi-directional (2-way) manner but contrary to typical communication, without exposing those applications to not-authorized devices. The result is
increased security plus improved availability without inheriting security threats.
springerlink.com
302
F. Pedersoli and M. Cristiano
nokLINK works similar to DNS by receiving client requests and resolving to a
server but in addition to routing requests, nokLINK provides strong authentication.
nokLINK provides this “DNS” like functionality without exposing in-bound connections to the internet through the use of an intermediate “Master Broker”.
Communication routing is not possible without the nokLINK Master Broker authorizing each device’s permission. Once permission is granted, client and server can
communicate via the Master without interruption. Conceptually, the nokLINK master
is a smart router with built in encryption and authentication.
If a system like nokLINK can be deployed without exposing both client and server
end-point to inbound requests, then a device firewall can be used to ensure both endpoints are protected from potential intrusions. nokLINK includes a software firewall
with equivalent to or better security than that of a hardware-based firewall to protect
each machine from any other device. An increasing threat in corporate systems is
LAN based attacks which are typically much harder to stop without losing productivity. By implementing nokLINK and the nokLINK firewall, an organization can maintain even higher levels of availability without exposure to attacks.
Almost all systems today relay on Internet Protocol (IP) to communicate but even
on the same LAN. nokLINK removes the dependency on Internet Protocol (to date IP
is still utilized, but simply for convenience). In fact, nokLINK allows for the elimination of virtually all of the complex private communications lines, IP router configuration, and management. Given that it is protocol-independent, it means that almost any
IP-based communication can benefit from the secure tunneling that nokLINK provide.
nokLINK can be used for many IP-based applications.
2 Architecture
There are four nokLINK components which make up the nokLINK architecture. A
single device may contain: just the client, client + server, the master or master + authenticator in a single installation.
1. NOKLINK CLIENT: The nokLINK client is the component that allows a computer to access applications [ports] on another device with the nokLINK server
component. This client component is part of all nokLINK installs in the form of an
“Installation ID”. The Installation ID is associated with this component. The client
itself may be context-less; this means that the nokLINK may have permission to
connect to any server in any context (given proper permission configuration) without having to reinstall any software. In other words, any client could connect to
http://company.vsx and http://other.vsx just by typing in the address in the browser.
2. NOKLINK SERVER: The nokLINK server component is the component that allows nokLINK clients to connect to local applications based on a vsx name. For instance, if a web server needed securing, a nokLINK server would be installed on
the web server; then anyone with a nokLINK client and permission could access
that server from anywhere in the world by its vsx name, i.e. http://company.vsx.
The server and client together are the components that create the tunnel. No other
component can “see” into the transmission tunnel of any other real time pair of
communicating server and client. The encryption system used between client and
303
server ensures that only the intended recipient has the ability to un-package the
communication and read the data, this includes the master component.
3. NOKLINK MASTERCOMPONENT: The Master components has two main purposes: authenticating devices and routing communications. While the Master is responsible for routing communications between end points, it is not part of the
communication tunnel and therefore cannot read data between them. This ensure
that endpoint to endpoint security is always maintained.
4. NOKLINK MASTER AUTHENTICATOR (NA): The nokLINK Master Authenticator (NA) is the console for setting authentication and access rights for each nokLINK enabled device within each nokLINK vsx context. A web interface provides
administrators a system to control nokLINK’s transport security via nokLINK
names, nokLINK domains and nokLINK sub-domains. For example an administrator can allow a machine called webserver.sales.company.vsx to communicate only
to xxx.sales.company.vsx or xxx.company.vsx or one specific nokLINK machine.
Administrators can manage device security settings in a global manner or in a very
specific manner depending on the companies objectives.
Besides other main functions are:
1. nokLINK Communication Interceptor: The component that provides seamless use
of nokLINK for the client and server is a “Shim” which intercepts .vsx communication and routes the requests to the nokLINK master. The nokLINK shim intercepts, compresses, encrypts and routes data including attaching the routing
information required for the master to deliver. The data is wrapped by the nokLINK protocol, essentially transforming it from the original protocol to the nokLINK protocol. By wrapping nokLINK around the original protocol you can further ensure the privacy of the data and the privacy of the protocol in use. Packet
inspection systems used to filter and block specific protocols are ineffective in
identifying protocols secured by nokLINK. Upon arrival of the data at the endpoint, nokLINK unpacks the communication back to the original protocol and
reintroduces the data to the local IP stack to ensure the data is presented transparently to the upper level applications. As a result, nokLINK can be introduced to
virtually any application seamlessly.
2. Device Authorization: The node authorization and rules configuration is managed
at the nokLINK Authenticator. The Master authenticates, thus it dictates which client can be a part of a specific nokLINK context. During install, a unique “DNA”
signature (like TPM via software) is created along with a .vsx name which is registered with the nokLINK Authenticator (NA). The nokLINK device identifies itself
to the Master and registers its name upon installation. The Master determines the
authenticity of inquiring nokLINK device and its right to conduct the requested activity. When access is requested for a specific machine, the master authenticates
the machine but does not interfere with authentication for the application in use.
The Master is like a hall monitor; i.e. it does not know what the person will do in
any particular room he has permission to visit but has full control of who can get to
what room.
304
3 Features and Functionality
nokLINK provides many features and functionality depending on implementation,
objectives and configuration including:
• Secure communication protocol able to encrypting, compressing and routing information between end-points.
• Virtual “Dark” network that ensures protection of end-points removing exposure to
the Internet.
• Seamless access to services from network to network without re-configuration of
firewalls or IP addresses.
• Communication between systems without those systems being visible to Internet.
• Low level software firewall.
• Protocol independent, which means that any communication can be secured.
Most extra-net connectivity products today offer connectivity for clients to a LAN
from within a LAN or from the internet. A simple client is installed on the users’ PC;
this allows users access to the corporate network. Unfortunately this access is also
available to anybody else who knows a user’s name and has time and/or the patience
to guess passwords. nokLINK functions differently than a VPN. nokLINK is not
network specific and does not attach clients to foreign networks. nokLINK install
client software that identifies each PC individually and provide remote access to applications instead of remote access to VPNs. This, coupled with the nokLINK authenticator, ensures the identification of any device containing nokLINK attempting to get
at company data. For further security, nokLINK opens only individual, user configured ports to individual nodes, thus protecting other assets to which access is not
permitted from outside PCs.
End-point to end-point security starts with the PC identification. At installation the
nokLINK client creates a unique DNA signature based on many variables including
hardware characteristics of the PC and time of installation. Every instance of nokLINK is unique regardless of the operating environment to further eliminate the possibility of spoofing.
When communication is initiated, the nokLINK server receive a noklink name
terminating in .vsx. This naming scheme is identical to DNS naming schemes. The
difference is that only nokLINK clients understand .vsx extension. This name is used
instead of standard DNS names when accessing nokLINK servers. For instance, if a
web server is being protected by nokLINK than the nokLINK enabled end user would
type http://webserver.mycomp.vsx into their browser. The nokKERNEL take the
request, encrypts the information and sends it out to one or more nokLINK Master.
This allow a workstation to communicate with a server without either of them being
visible on the Internet, as it is shown in the Fig. 1.
4 Security Elements
nokLINK is a multi-layered monolithic security solution. Using various techniques, it
encloses everything needed to secure communications between any two nokLINK
305
enabled nodes, using various techniques to do this. It impacts three different security
areas: Encryption Security, Transport Security, End Point Security.
4.1 Encryption Security
The strength of public algorithms is well-known. nokLINK uses state of the art encryption algorithm, but goes further than just the single level of encryption. The information traded between systems is not the actual key or algorithm. It is simply
synchronization information which only the two end points understand, that is the
equivalent of one end node telling the other “Use RNG (Random Number Generators)
four to decode this message.” The strength of nokLINK’s encryption is based on a
family of new Random Number Generators This RNG family is based on years of
research in this area.
The nokLINK encryption system encrypts three times before sending out the
packet: once for the actual data going out, once for the packet header and finally both
together. The upper-layer data is encrypted with a synchronization key. The key is not
an actual key, it contains information for the system to synchronize the RNGs on the
end points. This way the system stays as secure, but with much less overhead.
The only two nodes that understand this encrypted data are the client and the server.
The intermediate machines do not and cannot open this section of the packet.
Fig. 1.
4.2 Transport Security
This layer of security is an extra layer of security in comparison with other security
solutions. It deals with permissions for communications and the dynamic format of
the nokLINK packets and it is composed by:
• TRAFFIC CONTROLLER: nokLINK affords a new control that eliminates this
type of attack. While maintaining all the encryption security of other products,
nokLINK includes controls to mandate which nodes can communicate with which
306
other nodes. The basic requirement is to get a hold of a particular nokLINK client
using the same nokLINK context. Each nokLINK context uses a different family of
encryption RNG’s and cannot be used to communicate with another context. Without that nokLINK client, the nokLINK server remains invisible to potential intruders, as it is shown in Fig. 2. In the nokLINK vsx name environment the attacker
can’t see the server to attack because the name is sent to a DNS server for resolution. This makes it extremely difficult, if not impossible to break into a nokLINK
server, while leaving it completely accessible to those who need it.
In a nokLINK environment the nodes are identified uniquely. The master server
uses this particular ID to determine which nodes are permitted to communicate
with which servers; all controlled by the end user. In the unlikely event that someone comes up with a super crack in the next ten years that can read nokLINK packets, they still will not be able to communicate directly with another nokLINK node
because of this level of security, as it is shown in Fig.3. Here you see users accessing exactly those services and applications they are allowed to access. There is
redundancy in the security. A PC must have a nokLINK v2 client installed and permission must have been granted between the client and the server in the nokLINK
Master Authenticator.
Each PC generates an exclusive unique identifier at install time. The system
recognizes this ID and uses it for control, i.e. nokLINK Client ID 456 can communicate with nokLINK Server ID 222.
If the hard drive is removed from the PC, or is attempted to be cloned, it is
likely that the ID will be corrupted because of the change of hardware. If not, as
soon as one of the PCs [cloned or original] connects, all the device with the same
ID [whether the original and/or cloned] will stop communicating, as the system
will allow only one PC with a specific ID to operate in the nokLINK environment.
• Dynamic Packet Format: the format of the nokLINK protocol is dynamic. The
location of “header” information changes from packet to packet. The implication is
that, in the unlikely event that a packet is broken, it will be absolutely useless to attempt to replicate it in efforts of breaking other packets. Entirely new efforts will
have to be put forth to break a second and a third and a forth (and so on) packet.
With other protocols, the header information is always in the same spot, making it
easy to sniff, analyze and manipulate data.
4.3 End Point Security
Another enhancement, and probably the most significant compared to other security
solutions, is nokLINK’s end point security.
A significant effort of this security is based on anti-reverse engineering techniques.
There are many facets to anti-reverse engineering including:
•
•
•
•
•
All code in memory is encrypted.
Each time the code passes from ring 0 to ring 3 it is encrypted to avoid monitoring.
Each executable is generated individually.
A unique identifier is generated at install time for each node.
Protection versus cloning - if someone successfully clones it will stop working,
thus alerting the system administrator of a problem.
307
• There are certain features embedded in the software to allow it to detect the presence of debugging software. Once detected, nokLINK takes steps to avoid being
hacked into.
Fig. 2.
Fig. 3.
4.4 nokLINK Firewall
The nokLINK firewall is a low level firewall which blocks incoming and outgoing
packets to and from the Network Interface Card (NIC). This would normally block all
communications in and out of the PC. nokLINK still manages to communicate via
standard ports to other vsx nodes through the permanent inclusion of an outgoing
connection exception from port 14015 (the default nokLINK tunneling port). Any vsx
name presented to the TCP/IP stack is treated well before it reaches the Internet card.
308
Once the system recognizes the vsx name, it diverts the packet to nokLINK. nokLINK
encrypts and encapsulates the packet sending it out via port 14015, to the nokLINK
Master. In this way, any applications utilizing the vsx name can communicate from
“behind” the nokLINK firewall.
The following graphic shows the internal configuration of a PC. This machine
would be able to open a browser and access a site using a vsx name (i.e
http://webserver.noklink.vsx). It would not be able to access a site without a vsx name.
In addition to nokLINK traffic, this machine would only be capable of SMTP traffic
for outgoing mail and POP3 for incoming mail, as it is shown in Fig. 4.
Fig. 4.
References
1. Gleeson, B., Lin, A., Heinane, J., Armitage, G., Malis, A.: A Framework for IP based Virtual Private Networks. Internet Engineering Task Force, RFC 2764 (2000)
2. Herrero, Á., Corchado, E., Gastaldo, P., Leoncini, D., Picasso, F., Zunino, R.: Intrusion Detection at Packet Level by Unsupervised Architectures. In: Yin, H., Tino, P., Corchado, E.,
Byrne, W., Yao, X. (eds.) IDEAL 2007. LNCS, vol. 4881, pp. 718–727. Springer, Heidelberg (2007)
3. Kaufman, C., Perlman, R., Speciner, M.: Network Security: Private Communication in a
Public World, 2nd edn. Prentice Hall, Englewood Cliffs (2002)
SLA and LAC: New Solutions for Security Monitoring
in the Enterprise
Bruno Giacometti
IFINET s.r.l., Via XX Settembre 12, 37129 Verona, Italy
b.giacometti@ifinet.it
Abstract. SLA and LAC are the solutions developed by IFInet to better analyze firewalls logs
and monitor network accesses respectively. SLA collects the logs generated by several firewalls
and consolidates them; by means of SLA all the logs are analyzed, catalogued and related based
on rules and algorithms defined by IFInet. LAC allows IFInet to identify and isolate devices
that access to a LAN in an unauthorized manner, its operation is totally non-intrusive and its
installation does not require any change neither to the structure of the network nor to single
hosts that compose it.
1 Overview
As of today most organizations use firewalls and monitor their networks. The ability
to screening firewall logs to determine suspicious traffic is the key to an efficient
utilization of firewall and IDS/IPS systems. Anyway this is a difficult task, particularly if there is the need to analyze a great amount of log. SLA is IFInet approach to
solve this problem. In the same way it’s fundamental to monitor internal networks,
but monitoring alone is not sufficient: it’s necessary a proactive control on internal
networks to prevent unauthorized accesses.
2 SLA: Security Log Analyzer
In order to make more effective monitoring firewalls activities log, IFInet has created a
proprietary application, called Security Log Analyzer (SLA), which collects the logs
generated by firewalls and consolidates them into a SQL database. With this application
all the logs are analysed, catalogued and related based on rules and algorithms defined
by IFInet to identify the greatest possible number of types of traffic. SLA points out to
technicians any anomaly or suspicious activity detected and automatically recognizes
and catalos most traffic (about 90%) detected by the firewall, allowing the technical
staff of IFInet to focus on analysis and correlation of the remaining logs.
Through the use of SLA, control of perimeter security system is done by analyzing
data relating to traffic, used ports and services and also through a great variety of
charts that allow IFInet technicians to monitor security events through the analysis of
traffic in a given timeframe, the comparison with the volume of traffic on a significant
period of reference (egg the previous month) and detection of possible anomalies. The
chart in Fig. 1, for example, highlights the outgoing traffic, divided by service.
springerlink.com
310
B. Giacometti
Fig. 1. The outgoing traffic, divided by service
SLA shall record in a Black List all IP addresses that are running illegal or potentially intrusive activities, on the basis of the rules implemented by the IFInet technicians: from the moment an IP address is added to the Black List, any successive
activity is detected in real time by IFInet technicians that analyze the event and take
immediate actions.
With the use of SLA, IFInet technicians can do extremely detailed queries on the
database logs and have the ability to view in real-time only logs relating to events that
have a critical relevance and represent, therefore, a danger to the integrity of Customer networks and systems.
In Fig. 2 is listed how for a customer SLA highlights many IP scans to its network
and, in particular, indicates that part of this potentially intrusive activity passed
through the firewall, taking advantage of traffic permitted by security policies. I.e.
thanks to the highlighted event the technical staff is able to instantly detect if a particular external address, which previously made a host scan system (and incorporated
in the Blacklist by SLA), has managed to cross the perimeter security system through
a port and achieve a system (e.g. Web server) within the network of Customer.
Selecting the event of interest, (line highlighted in red rectangle), the technician
can see the detail of abnormal activity.
SLA reports the number of Host Scan attempts to the public IP addresses assigned
to a particular customer: as a further example we can see a Host Scan made towards a
range of 18 addresses.
The technician, selecting the relevant line, can analyze the details of this anomalous activity and, for example, define that Host Scan is caused by the Sesser worm,
present on the 151.8.35.67.
By means of SLA all intrusive events or events that otherwise may pose a danger
to the integrity of the customer's network are promptly notified by email (or
telephone).
SLA and LAC: New Solutions for Security Monitoring in the Enterprise
311
Fig. 2. Example of SLA report
This service provides a constant and effective control of all events recorded by the
firewall and allows for a timely intervention in case of critical events.
3 LAC: LAN Access Control
LAC - LAN Access Control - is the solution that allows IFInet to identify and isolate
devices (PC and other network equipment) that access to a LAN in an unauthorized
manner.
Unlike many systems currently on the market, the operation of LAC is totally nonintrusive and its installation does not require any amendment neither to the structure
of the network nor to single hosts that compose it. LAC therefore prevents unauthorized devices to access the corporate network. Access is allowed only after certification by an authorized user (egg network administrator).
LAC can operate in two modes: active and passive. In passive mode (the default)
all the hosts on the network segment in which LAC is connected are detected. In active mode the hosts that are unauthorised are virtually isolated from the corporate
network. LAC therefore enforces the following rules:
• The organization host and equipment are allowed to use the network without any
kind of limitation;
• Guests are allowed to access the network after authorization but only for a limited
amount of time and using certain services (e.g. Internet navigation and e-mail);
• All other hosts should not be able to connect because they are not authorized.
Finally LAC has the following key aspects:
312
B. Giacometti
• Safety: identifies and isolates devices (PC and other network equipment) that access to a LAN in an unauthorized manner
• Ease of management: LAC act entirely non-intrusive and does not require any
changes nor the structure of the network or to individual host that make up the
network itself
• Guests users management: total control on how and what guests are allowed to do
• Important features: census of the nodes of one or more networks LAN / WLAN or
wireless, isolation of unauthorised host, control of professional services, reporting
tools and alerting
Main features of LAC are:
1. Census of the LAN nodes. When LAC runs in passive mode, it records al hosts
detected on the network in a database. In particular, keeps track of hosts IP addresses and MAC. In order to detect all the host it is necessary that LAC keeps on
running for a variable amount of time, depending on how often hosts access the
network. An example of this detection is shown in Fig. 3.
2. Isolation host unauthorized. Used in active mode, LAC performs a virtual isolation
of all not authorized hosts: once isolated, a host can no longer send / receive data to
any other host except LAC.
3. Host detection. One interesting piece of information about a detected host is its
physical location within the organization. LAC detects the switch port to which the
unauthorized host is physically connected. This feature works only if in the network switches that support the SNMP are in use. In fact, all the network switches
(included in LAC configuration) are interrogated using special SNMP query.
4. Managing guest users. Usually an organization is often visited by external staffs
that require a network connection to their notebook and to use certain services (i.e.
Internet navigation and e-mail). In the case of network access without any authorization, LAC detects the new host and considers it unauthorized from the network.
The guest, however, if in possession of appropriate credentials (e.g. Supplied by
staff at the reception) can authenticate to unlock the pc and use the services enabled. When proceeding with a new host registration, you can ask LAC to generate
a new user (or to edit an existing one), to generate access credentials (username
and password) and set the time of their validity. Using this information the staff,
through a specific web interface, can enable or unlock the desired location. Guests
only connect their pc to the network, browse to the LAC web page and insert the
provided credentials. As optional feature, the system administrator can choose
from a list the protocols that the host can use once authenticated. This makes it
possible to limit the resources available to the network and thus have effective control over allowed activities.
5. Reports. LAC provides a series of reports that allow the administrator to analyze,
also with graphs, network access data, based on several criteria: i.e. MAC address,
IP address, usernames, denied accesses, enabled accesses, average duration of accesses, etc.
6. Alerting. LAC records in a database all the collected events and it is able to send
all the information (necessary to control access to the network) to the network administrator via email, SMS or administrative GUI, as it is shown in Fig. 4.
313
Fig. 3. Census of the LAN nodes
3.1 Sure to Be Safe?
The organisations with an information system, whether public or private, have certainly network infrastructure that enables the exchange of information and teamwork.
The corporate network is realized in most cases according to the Ethernet standard,
which combines speed and reliability with very limited cost. It’s very easy, in fact, to
attach a device to an Ethernet network. For the same reasons, however, this network is
exposed to a kind of "violation" very common, definable "physical intrusion" and this
can be through a spare RJ45 plug, a network cable from a detached device, a switch
or hub port, etc.
If an attacker hooks his laptop to an access point in an Ethernet network, the PC
becomes part of the network and can therefore access to resources and even to compromise the data security and corporate information. It becomes therefore extremely
important to recognise immediately the moment when an unauthorized host access to
the network, preventing use the resources of the network itself. At the same time it is
important to ensure the normal operation of the network to all allowed hosts: i.e. a
commercial agent or an external consultant ("guest" user), who have the need to connect your notebook to the network to use its resources.
3.2 LAC Architecture
LAC takes care of, therefore, ensuring compliance with the rules of network access
and protecting the integrity and safety of corporate data.
LAC is a software solution that consists of a set of applications running on a
Linux-based system, composed of the following elements: Web administration interface, Core program (LAC), Database for users management, Database for information
314
B. Giacometti
Fig. 4. Alerting
recovery, appliance equipped with 3 / 6 interfaces to manage up to 3 / 6 LAN / VLAN
(optional).
3.3 LAC Operational Modes
LAC can operate in two modes: active and passive. In passive mode (the default) are
all the hosts on the network segment to which LAC is connected are detected. In active mode, however, the unauthorised hosts are virtually isolated from the network.
Each node on the network can be in one of the following status: unauthorized, authorized, approved by authentication, address mismatch. In the unauthorized status, a
host is virtually isolated from the network. This prevents the host from transmitting
and receiving data. In the authorized status a host suffers no treatment by LAC and its
operation is normal.
A user, who uses an unauthorized host, may authenticate to LAC; if successful, it
revokes the status of non-authorisation and provides a status of authorization for a
limited amount of time. This status is called "authorized with authentication". When
the authorization time expires, the host reverts to the unauthorized access status.
LAC is able to detect changes in the IP address of a host and addresses the potential anomaly assigning the address mismatch status to that host.
The management of status and time validity of users is entrusted with LAC, which
is responsible for allocating status to host and change the status of a host at any times.
The credentials to authenticate a user (in case the administrator should unlock a
client enabling its access to the network) are generated from the LAC on behalf of the
Administrator.
315
3.4 LAC Requirements
Below are listed briefly the requirements for the operation of LAC:
• Ethernet Network
• Protocol network IP version 4
• A network access for the LAC on segment network (LAN or VLAN) to monitor/control
4 Conclusion
We have analyzed how a network can be proactively protected and the results have
been used to implement SLA and LAC. With SLA and LAC it is possible to identify
and block suspicious and potentially dangerous network traffic.
Future development of SLA will focus on two major area: real time analysis of
monitored device in order to identify suspicious traffic as soon as it enters the perimeter and correlation of firewalls log with logs originated from other security devices,
i.e. antivirus. Also LAC will correlate information from Ethernet network with information from other devices, i.e. 802.1x switches or network scanner in order to enforce
a finer grained control on the network.
References
1. Abad, C., Taylor, J., Sengul, C., Yurcik, W., Zhou, Y., Rowe, K.: Log correlation for intrusion detection: a proof of concept. In: Proc. 19th Annual Computer Security Applications
Conference, pp. 255–264. IEEE Press, New York (2003)
2. Cuppens, F., Miege, A.: Alert correlation in a cooperative intrusion detection framework.
In: Proc. 2002 IEEE Symposium on Security and Privacy, pp. 202–215. IEEE Press, New
York (2002)
3. Debar, H., Wespi, A.: Aggregation and Correlation of Intrusion-Detection Alerts. In: Proc.
4th Int. Symp. Recent Advances in Intrusion Detection, RAID 2001, pp. 85–103. Springer,
Berlin (2001)
4. Corchado, E., Herrero, A., Sáiz, J.M.: Detecting compounded anomalous SNMP situations
using cooperative unsupervised pattern recognition. In: Duch, W., Kacprzyk, J., Oja, E.,
Zadrożny, S. (eds.) ICANN 2005. LNCS, vol. 3697, pp. 905–910. Springer, Heidelberg
(2005)
5. Herrero, A., Corchado, E., Gastaldo, P., Zunino, R.: A comparison of neural projection
techniques applied to Intrusion Detection Systems. In: Sandoval, F., Gonzalez Prieto, A.,
Cabestany, J., Graña, M. (eds.) IWANN 2007. LNCS, vol. 4507, pp. 1138–1146. Springer,
Heidelberg (2007)
6. Ridella, S., Rovetta, S., Zunino, R.: Circular back-propagation networks for classification.
IEEE Trans. on Neural Networks 8, 84–97 (1997)
Author Index
Agarwal, Suneeta 219, 251
Aiello, Maurizio 170
Alshammari, Riyad 203
Appiani, Enrico 43
Faggioni, Osvaldo 100
Ferri, Sara 11
Flammini, Francesco 92
Forte, Dario V. 27
Baiocchi, Andrea 131, 178
Banasik, Arkadiusz 274
Banković, Zorana 147
Ben-Neji, Nizar 211
Bengherabi, Messaoud 243
Bojanić, Slobodan 147
Booker, Queen E. 19
Bouhoula, Adel 123, 211
Bovino, Fabio Antonio 280
Briola, Daniela 108
Bui, The Duy 266
Buslacchi, Giuseppe 43
Gabellone, Amleto 100
Gaglione, Andrea 92
Gamassi, M. 227
Gastaldo, Paolo 61
Giacometti, Bruno 309
Giuffrida, Valerio 11
Golledge, Ian 195
Gopych, Petro 258
Guazzoni, Carlo 1
Guessoum, Abderrazak 243
Gupta, Rahul 219
Caccia, Riccardo 108
Cavallini, Angelo 27
Chatzis, Nikolaos 186
Cheriet, Mohamed 243
Chiarella, Davide 170
Cignini, Lorenzo 131
Cimato, S. 227
Cislaghi, Mauro 11
Corchado, Emilio 155
Cristiano, Massimiliano 301
De Domenico, Andrea
Decherchi, Sergio 61
DeJiao, Niu 294
Domı́nguez, E. 139
116
Eleftherakis, George 11
Eliades, Demetrios G. 69
Harizi, Farid 243
Herrero, Álvaro 155
Hussain, Aini 235
Jonsson, Erland
84
Kapczyński, Adrian 274
Kollias, Stefanos 286
Larson, Ulf E. 84
Le, Thi Hoi 266
Leoncini, Davide 100
Lieto, Giuseppe 163
Lioy, Antonio 77
Losio, Luca 27
Luque, R.M. 139
Maggiani, Paolo 100
Maiolini, Gianluca 131, 178
Martelli, Maurizio 108
318
Author Index
Maruti, Cristiano 27
Mascardi, Viviana 108
Matsumoto, Soutaro 123
Mazzaron, Paolo 116
Mazzilli, Roberto 11
Mazzino, Nadia 116
Mazzocca, Nicola 92
Meda, Ermete 116
Mezai, Lamia 243
Milani, Carlo 108
Mohier, Francois 11
Molina, Giacomo 178
Moscato, Vincenzo 92
Motta, Lorenzo 116
Muñoz, J. 139
Negroni, Elisa 11
Neri, F. 35
Nieto-Taladriz, Octavio
Nilsson, Dennis K. 84
Ntalianis, Klimis 286
147
Orlandi, Thomas 27
Orsini, Fabio 163
Pagano, Genoveffa 163
Palomo, E.J. 139
Papaleo, Gianluca 170
Pedersoli, Francesco 301
Pettoni, M. 35
Picasso, Francesco 84, 116
Piuri, V. 227
Polycarpou, Marios M. 69
Popescu-Zeletin, Radu 186
Pragliola, Concetta 92
Ramli, Dzati Athiar 235
Ramunno, Gianluca 77
Redi, Judith 61
Rizzi, Antonello 178
Ronsivalle, Gaetano Bruno
1
Sachenko, Anatoly 274
Samad, Salina Abdul 235
Sassi, R. 227
Scotti, F. 227
ShiGuang, Ju 294
Soldani, Maurizio 100
Tamponi, Aldo 116
Tao, Cai 294
Tzouveli, Paraskevi 286
Verma, Manish 251
Vernizzi, Davide 77
Vrusias, Bogdan 195
Wei, Zhong
294
Zambelli, Michele 27
Zanasi, Alessandro 53
Zincir-Heywood, A. Nur
Zunino, Rodolfo 61
203

Complete volume in PDF

Transcription

Similar documents

PrimeFilm 1800 AFL Low-Cost 35mm Slide/Film Scanning Through

The new RS-35M, June, 2010 - Stu Martin, k2qde@optonline.net

Intelligence in Living Cells

BALTICA 10

MODEL 324 ADDENDUM Please make the ,following changes in

Статья

2008 Suzuki GSXR750

SISTEM INFORMASI MANAJEMEN - Blogs at Universitas Budi Luhur

2007-2008 Ducati Monster S4R/S

pdf