A NetFlow based flow analysis and monitoring system in enterprise
Transcription
A NetFlow based flow analysis and monitoring system in enterprise
Available online at www.sciencedirect.com Computer Networks 52 (2008) 1074–1092 www.elsevier.com/locate/comnet A NetFlow based flow analysis and monitoring system in enterprise networks Liu Bin a,*, Lin Chuang a, Qiao Jian b, He Jianping a, Peter Ungsunan a a b Department of Computer Science and Technology, Tsinghua University, Beijing, China School of Telecommunication Engineering, Beijing University of Posts and Telecommunications, Beijing, China Received 2 April 2007; received in revised form 21 August 2007; accepted 29 December 2007 Available online 3 January 2008 Responsible Editor: A. Marshall Abstract In this paper, a flow analysis and monitoring system based on NetFlow is introduced. The system is built on a Browser– Server framework, aimed at enterprise networks. Data collection and display are separated into two modules, which makes the system clearly demarcated and easy to deploy. The data collection module receives and analyzes NetFlow-exported packets and inserts per flow record information into the Oracle database. The display module acts as a J2EE web server, fetches real-time or history traffic information from the database and shows it to web users. In addition to the above-mentioned functions, the most important part of the system is an IDS. A real-time anomalous traffic monitoring module with a stable matching pattern algorithm and two traffic statistic based intrusion detection algorithms – one algorithm is based on variance similarity while the other is based on Euclidean distance – are embedded in the system to detect worm and other malicious attacks. With the aim of identifying anomalous network traffic simply and effectively, a proved ‘‘join” strategy is also designed along with the two traffic statistic based intrusion detection algorithms. The whole IDS module is able to run with low computational complexity and high detection accuracy. Finally, we conduct experiments to verify the performance of our system. Ó 2008 Elsevier B.V. All rights reserved. Keywords: Traffic measurement; NetFlow; Intrusion detection; Matching pattern; Similarity 1. Background When analyzing network traffic information or defending against the threat of various worms and attacks in enterprise networks, we need powerful tools to handle Gigabit per second traffic. Tcpdump or other software implemented on personal comput* Corresponding author. E-mail address: liubin@csnet1.cs.tsinghua.edu.cn (L. Bin). ers cannot satisfy this request, while Cisco’s NetFlow, along with associated switches/routers, can easily capture enterprise network link traffic information, analyze and give high level statistics for different purposes. NetFlow can also aggregate the raw data before exporting to reduce not only the memory and computational load of the data collecting server but also the link bandwidth consumption. The principle of NetFlow is as follows: when the router receives a packet, its NetFlow module scans 1389-1286/$ - see front matter Ó 2008 Elsevier B.V. All rights reserved. doi:10.1016/j.comnet.2007.12.004 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 the source IP address, the destination IP address, the source port number, the destination port number, the protocol type, the type of service (ToS) bit in IP header, and the input (or output) interface number on the router of the IP packet to judge whether it belongs to a flow record that already exists. If so, it updates the flow record; otherwise, a new flow record is generated in the cache. The expired flow records in the cache are exported periodically to a destination IP address using UDP. (Note that a UDP packet contains several flow records.) The exporting packet format can be found in the NetFlow standard. The most common version of NetFlow is V5. The newest version, NetFlow V9, is described in RFC 3954 [1]. Looking through a flow record, it is found that a flow record does not contain any upper-layer information, it just contains traffic profiles. The advantage to this approach is its high speed. Paying no attention to packet payloads greatly reduces the processing overhead and makes NetFlow based anomalous traffic detection an extraordinarily good fit for busy, high-speed network environments [10]. In this paper, a flow analysis and monitoring system based on NetFlow is introduced. The system is built on a Browser–Server framework, aimed at enterprise networks. Data collection and display are separated into two modules, which makes the system clearly demarcated and easy to deploy. The data collection module receives and analyzes NetFlowexported packets and inserts per flow record information into the Oracle database. The display module acts as a J2EE web server, fetches real-time or history traffic information from the database and shows it to web users. Additionally, the most important part of the system is an IDS, a real-time anomalous traffic monitoring module with a stable matching pattern algorithm and two traffic statistic based intrusion detection algorithms – one based on variance similarity and the other based on Euclidean distance – are embedded in the system to detect worm and other malicious attacks. With the aim of identifying anomalous network traffic simply and effectively, a proved ‘‘join” strategy is also designed along with two traffic statistic based intrusion detection algorithms. The whole IDS module is able to run with low computational complexity and high detection accuracy. 2. Related works Intrusion detection systems (IDSs) are designed to monitor activities on a network and recognize 1075 anomalies that may be symptomatic of misuse or malicious attacks [2]. An IDS consists of two components: monitoring and analysis. Data is collected by monitoring activities in the hosts or network. Then the raw data is analyzed to classify activities as normal or suspicious. Currently there are two basic approaches to intrusion detection, misuse detection and anomaly detection. Anomaly detection is aimed at characterizing the legitimate behavior of the system, and then detecting the anomalous behavior. Common methods used in intrusion detection include statistics, immunology, neural networks, data mining, machine learning, finite state automation and so on. In our system, two kinds of intrusion detection methods are embedded, one kind using pattern matching while the other based on traffic statistics. Pattern matching is a simple type of attack detection technique. Using the pattern matching technique, IDSs generally match the text (audit records) or binary sequences against known attack signatures [13]. In the pattern matching approach [14–16], rules for identification of attacks are easy to write, understand and are also very customizable. Recently, deep packet inspection (DPI) is often used in network intrusion detection and prevention systems (NIDPS), where incoming packet payloads are compared against known attack signatures [38–40]. Attack signatures can be easily generated for new alerts and warnings. In [37], the author presents a novel scheme for pattern matching, called BFPM, that exploits a hardware based programmable state machine technology to achieve deterministic processing rates that are independent of input and pattern characteristics on the order of 10 Gb/s for FPGA and at least 20 Gb/s for ASIC implementations. The limitation of the pattern matching approach is that it can recognize only known attacks. It requires continuous updates of attack signatures to identify new attacks. Most of the pattern matching algorithms in IDS have the simple concept of string matching which is not applicable for our system because of the loss of upper-layer information in NetFlow records as we describe in Section 1. Thus, a special pattern matching algorithm will be proposed in Section 3 to meet the system need. The advantage of anomaly detection over pattern matching is that previously unknown attacks can be discovered. Using statistical analysis of network traffic to identify intrusion is a popular research direction. The following are some of the existing 1076 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 detection technologies which are based on network traffic statistics: 1. Cumulative sum method: The CUSUM algorithm is a commonly used algorithm in statistical process control, which can detect the change of the mean value of a statistical process [17,18]. CUSUM relies on the fact that if a change occurs, the probability distribution of the random sequence will also change. Generally, CUSUM requires a parametric model for the random sequence so that the probability density function can be applied to monitor the sequence. Unfortunately, in many situations, we do not have the knowledge of the underlying distribution of the statistics we are observing. For example, while the arrival rate of TCP connections in the Internet has been shown to follow a Pareto distribution [19], other traffic measures that have been used for attack detection have no known distribution, e.g., the arrival rate of new source IP addresses [20], or the arrival rate of reset packets [21]. Thus, a key challenge for parametric methods is how to model the random sequence. An alternative approach is the use of non-parametric methods, which are not modelspecific. For the most common SYN flooding attacks, a non-parametric cumulative sum (CUSUM) method is applied in [3] to detect the sequential change point. The two algorithms considered in [4] are an adaptive threshold algorithm and a particular application of the CUSUM algorithm for change point detection. The performance is investigated in terms of the detection probability, the false alarm ratio, and the detection delay, using workloads of real traffic traces. In [5], a detection method of SYN flooding called D-SAT (detecting SYN flooding attack by two-stage statistical approach) is proposed. D-SAT only monitors SYN count and the ratio between SYN and other TCP packets at the first stage. Then it detects SYN flooding and finds victims more accurately in its second stage. 2. Statistical model method: The statistical model method is often used to detect anomalous intrusion. As a mature method, it identifies activities which have large statistical errors with usual ones as anomalous. The commonly used methods are threshold detection [22], mean and standard deviation [23], and Markov process model [24]. Threshold detection [22] is one of the most basic and simplest techniques for intrusion detection. The goal of threshold detection is to record each occurrence of a specific event and detect when the number of occurrences of that event surpasses a reasonable amount that one might expect to occur within a specified time period. Mean and standard deviation means that by comparing event measures to a user profile mean and standard deviation, a confidence interval for an anomaly can be established. The user profile values are fixed or based on weighted historical data [22]. The Markov process model applies only to event counters. It means that each distinct type of event is a state variable and uses a state transition matrix [23] to characterize the transition frequencies between states. A new observation is defined to be anomalous, if its probability, which is determined by the previous state and the transition matrix, is too low. In [6], the authors describe statistical model based algorithms which firstly combine the expectation, variance and other statistical parameters together, then use hypothesis testing to detect the attacks. 3. Data mining methods: Data mining is an information extraction activity to discover hidden facts contained in databases. Data mining techniques [25] are used to find patterns and subtle relationships in data and inferred rules that allow the prediction of future results. Various data mining techniques are proposed [26,27], e.g. association rule, classification, clustering analysis, characterization/generalization, incremental updates and meta rules. In [7], technologies based on mining fuzzy rules and data mining techniques are designed for the detection of anomalous network traffic from normal traffic. When there’s offensive traffic, the system can detect intrusion by finding the deviation from normal patterns. 4. Wavelet analysis method: The wavelet analysis technique has already been introduced to intrusion detection [8]. Wavelet analysis intrusion detection (WAID) is achieved by treating staterealizable protocols as discrete waveforms. WAID is offered as a multi-dimensional analysis approach to detecting temporally spaced network attacks. The Wavelet analysis method can also be associated with the self-similar characteristics of network traffic. In [9], the author proved mathematically that the significant changes in the Hurst parameter can help find attacks. In [41], the authors apply signal processing techniques in intrusion detection systems, and develop and L. Bin et al. / Computer Networks 52 (2008) 1074–1092 implement a framework, called Waveman, for real-time wavelet based analysis of network traffic anomalies. Then two metrics, namely percentage deviation and entropy, are used to evaluate the performance of various wavelet functions on detecting different types of anomalies. 5. Neural networks: A neural network is a network of computational units that jointly implement complex mapping functions. The neural network approach to intrusion detection is to learn the behavior of actors (e.g. users, daemons, and so on) in the system [28,29]. Three phases (collecting training data, training, and performance) are required to build a neural network for an intrusion detection system [30]. 6. Genetic algorithm: The genetic algorithm is a family of computational models based on principles of evaluation and natural selection. The process of a genetic algorithm usually begins with a randomly selected population (the set of chromosomes during a stage of evaluation) of chromosomes [31,32]. Genetic algorithms can be used to evolve simple rules for network traffic. These rules are used to differentiate normal network connections from anomalous connections. The rules consider parameters such as source and destination IP addresses and port numbers, duration of the connection, protocol used, etc., to indicate the probability of an intrusion [33]. These anomalous connections refer to events with high probability of intrusions. 7. Immune system: The human immune system (HIS) [34] provides the human body with a high level of protection from invading pathogens. Immune system approaches are also used for computer security, scheduling and virus detection. The role of the immune system in the human body is similar to the role of an intrusion detection system. Immune system approaches with distributed, light weight computing and self-organizing capabilities can be applied to a network based intrusion detection system to provide better security [35,36]. Besides the pattern matching algorithm proposed in Section 3, two algorithms based on network traffic statistics will also be introduced in Section 4. Being different from most existing work, our algorithms focus on the actual project application. In the current high-speed network scenario, the requirement of intrusion detection algorithms is not only highly accurate, but also have low computational 1077 complexity and high processing speed. Although the previous algorithms can achieve high accuracy, they are complicated and require many hardware and software resources. Our algorithms are simple to implement and easy to realize as well as have low false positive rates. In view of such a situation that in certain exceptional scenarios may lead to a detection accuracy decline, we also propose and prove a ‘‘join results” strategy which can further reduce the false positive rate. 3. System structure 3.1. Overview This system has the following characteristics: 1. Based on a Browser–Server framework to avoid the trouble of client software installation. 2. Data collection module and data analysis and display module are separated which makes the system clearly demarcated and also makes it easy to modify and deploy. 3. J2EE and Applet development technologies are used in the display module, which makes it possible to supply real-time traffic monitoring functions to web users. 4. Anomalous traffic analysis functions embedded in the system are helpful to find common worm and other malicious attacks. For the pattern matching method, a characteristic field based algorithm can detect known anomalies efficiently. For the traffic statistical method, two simple and accurate algorithms are used with a proved ‘‘join” strategy. 5. The data collection module can support all versions (1, 5, 7, 8 and 9) of NetFlow and provide data to other modules in some fixed formats version-independently. Fig. 1 shows the overall structure of the system. The switches/routers collect traffic information and send flow records to data collectors. The collectors then analyze the raw data according to user-set rules, and insert both the raw data (selective) and processed data into the database. The web server queries the database, updates the display, and periodically scans the data to look for anomalous traffic. 3.2. Data collection and processing module The main function of this module is collecting, aggregating, processing and storing NetFlow data. 1078 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 can be deployed in different servers which is good for the high-load network to increase processing speed. Here we explain the principles of the Protocol aggregation thread and the AS matrix aggregation thread. Fig. 1. System structure. This module runs on the data collectors in Fig. 1. In the module, dozens of mutually cooperative threads are executed concurrently. Fig. 2 is the simplification flowchart of the handling procedures. Moreover, there are some configuration files which allow users to easily set up the module to meet different requirements. The system supports 12 kinds of aggregations (in Fig. 2, only two of the 12 aggregation threads are listed), each of which aggregates NetFlow data according to different flow-field-sets and operates on different tables in the database. For example, in the host flow matrix aggregation function, the flow-field-set is [source IP address, destination IP address], so the flow records which have the same source IP address and destination IP address are aggregated by the aggregation thread. There are lots of flexibilities between aggregation threads: First, a certain aggregation function can be enabled or disabled at any time by modifying configuration files. Second, different aggregation functions 1. Protocol aggregation: Strictly speaking, it should be called application layer protocol aggregation, because this thread classifies flow records with not only the Ethernet protocol but also the source (or destination) port field. The aggregation rules are defined in the configuration file ‘‘Protocols. aggregate” whose part is like Fig. 3. Before the data collection starts, for each rule in ‘‘Protocols. aggregate”, a corresponding item is generated in memory. Then after it starts, by matching these rules, the protocol type of a certain flow record can be determined, and the corresponding rule item in memory can be updated. The matching is ‘‘short circuit matching,” that is, if a flow record matches a previous rule then no later rule need be checked. 2. AS matrix aggregation: Sometimes, in addition to the traffic statistics of different hosts, people are more concerned with the traffic between different departments. The source and destination AS field of NetFlow can satisfy this purpose, but the BGP protocol should have been deployed. The system provides the function of aggregating and analyzing the traffic between different departments based on the AS field of NetFlow. Meanwhile, because the deployment of BGP is not very common, we also define the other AS according to IP subnet address but not the AS fields of BGP. For example, the IP address scope of a certain AS named AS_99 defined in configuration file Fig. 2. Flowchart of collecting procedures. L. Bin et al. / Computer Networks 52 (2008) 1074–1092 1079 Fig. 3. Protocols.aggregate. Fig. 4. AS_Source.aggregate (AS_Destination.aggregate is quite the same). Fig. 5. The system interface. ‘‘AS_Source. aggregate” (Fig. 4) is ‘‘10.8.1.0– 10.8.3.255”. Note that all subnet masks are of the form 255.255..0, and then the rules for AS_99 are generated obeying following steps: ical data queries, and the remainder figure is the main interface of the system. Then in the following, we introduce each function unit in detail: Step 1: Convert the start IP address 10.8.1.0 to hex: start = 0A080100, and convert the end IP address 10.8.3.255 to hex: end = 0A0803FF. Step 2: (start 8) = 0A0801, (end 8) = 0A0803. The needed right-shift-bit-number is gained by 32-24 (subnet mask bit number) = 8. (Since the subnet mask is of the form 255.255..0, right shift the start and end IP addresses to reduce hash table capacity). Step 3: For each long integer i between variable ‘‘start” and variable ‘‘end”, insert [i, AS_99] into hash table. Then for each flow record’s source or destination IP address, just shift right 8 bits and search the hash table, the AS number which the IP address belongs to can be found. 1. The traffic information monitoring unit mainly provides real-time monitoring and historical data queries of traffic going through the switches/routers in which NetFlow is embedded. A major part of this unit can be considered as the interfaces of the aggregation threads which are introduced in Section 3.2 (i.e. this part shows the real-time or the historical data which are processed by the aggregation threads to users). Commonly, different monitoring functions in this unit correspond to different aggregation threads in the data collecting and processing module introduced in Section 3.2. For example, protocol and application traffic monitoring corresponds to protocol aggregation (Fig. 3), AS flow matrix monitoring corresponds to the department flow matrix aggregation (Fig. 4), end-to-end flow matrix monitoring corresponds to the host flow matrix aggregation, and so on. Moreover, this unit also offers a real-time monitoring function which is more particular than aggregation monitoring, which allows users to track traffic trends of a single IP address or even a single port without any aggregations. 2. The anomalous traffic detection and query unit mainly provides real-time monitoring and historical data queries of anomalous traffic. As we have explained in Section 1, we employ two types of anomalous traffic detection methods in this system – pattern matching method and traffic statistical method. 3.3. Display and anomalous traffic monitoring module This module contains three function units: traffic information monitoring unit, anomalous traffic detection and query unit, and system setting unit. Each of the former two units contains two parts: historical data queries and real-time monitoring. Additionally, the system interfaces are also classified in this module. Fig. 5 shows part of the system interfaces. In the figures, the bar-graph, the pie-graph and the blue and white form are some interfaces of real-time monitoring, the white form is an interface of histor- 1080 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 In the pattern matching method, anomalous flow records are mainly identified by characteristic fields and characteristic rules (sometimes by the statistic of ICMP and TCP_FLAGS), while in the traffic statistical method, anomalous traffic is identified by the ‘‘similar degree” between the traffic data and the trained sample. In the traffic statistical methods which will be introduced in Section 4.2, the traffic in a certain period is considered to be normal if its ‘‘feature extraction vector” and ‘‘trained sample” are similar enough. But in this section, we only discuss the pattern matching method, the concepts ‘‘trained sample”, ‘‘feature extraction vector”, and the details of the traffic statistical methods will be discussed in Section 4.2. Because ICMP and TCP_FLAGS information is helpful when detecting some worms, the ICMP packet and TCP_FLAGS monitoring function is also implemented in this unit. ICMP and TCP_FLAGS monitoring function provides statistical information of the ICMP protocol and the URG, ACK, PSH, RST, SYN, and FIN (TCP flag bits) of the TCP protocol (NetFlow export record contains the cumulative OR of TCP flags of the entire flow duration). Characteristic fields involve the source port, the destination port, the number of packets, the number of octets, and the protocol type of flow record fields. A characteristic rule is defined by several of the above characteristic fields. We say that a flow record matches a characteristic rule if this flow record matches all the characteristic fields of the rule. As Fig. 6, here the needed function is ‘‘for a period of time, identify all the flow records by a certain characteristics rule set (meaning if a flow record matches an arbitrary rule of the set, we pick it up, or we will just ignore it). When a characteristic flow record first appears, the system stores the matched flow record in memory by a specific format, but if it does not first appear, the system updates the corresponding stored item instead of creating new”. To make the result clear, the matched result set distinguishes Fig. 6. The needed function. between different source addresses and different destination addresses, i.e. two same-rule-matched flow records with different source addresses or different destination addresses are seen as two different items in the output result set. To implement the function of Fig. 6, we propose the following matching algorithm. Matching algorithm: Step 1: To use the characteristics rule set A, for each rule a 2 A which is named aname, let its form be [a1, a2, a3, a4, a5]. a1–a5 represent different characteristic fields (‘‘source port‘‘, ‘‘destination port”, ‘‘packets number”, ‘‘octets number”, and ‘‘protocol type”), and here: 8 > < the string format of the ith element ai ði 2 ½1;5Þ ¼ ðthe i th element of a is definedÞ; > : /ðthe ith element of a is undefinedÞ: Then for each rule a 2 A, a mapping set is generated: arule ¼ ½arule1 ; ½arule2 ; arule3 : P Here, arule1 ¼ 5i¼1 ai , means combine the strings together arule2 = aname, arule3 = [b1, b2, b3, b4, b5], and bi ¼ 1ðai 6¼ /Þ; 0ðai ¼ /Þ: Let Arule as the mapping rule set generated by A. Step 2: The result set R is obtained by accessing the database and getting all the raw flow records during the time t1 and t2. Let S (initialized as NULL) be the algorithm output set. Each item of S includes two attributes: the keyword (denoted by key, will be introduced in Step 5) and the value (denoted by value). S(key) is the set of key, value(key) is the value corresponding to a certain key, i.e. value(key) is user-defined output of the matched key. m is the traversal cursor of Arule and n is the traversal cursor of R. Initialization m = 1 and n = 1. Step 3: Let temprule = Arule[m]. According to Step 1, we have, temprule = [temprule1, [temprule2, temprule3]]. Step 4: Let r = R[n], according to the definition of table RAW in which the raw flow records is stored, we have r = [srcaddr, dstaddr, srcport, dstport, dpkts, doctets, protocol, stamp], L. Bin et al. / Computer Networks 52 (2008) 1074–1092 Step 5: Step 6: Step 7: Step 8: i.e. r includes seven attributes of a NetFlow record appending the timestamp on which this record is put into the database. The types of all these fields in r are string. Let string sign = NULL, we map [srcport, dstport, dpkts, doctets, protocol] to [b1, b2, b3, b4, b5](temprule3), and append the element of r to sign if the mapping value of the element in temprule3 is 1. That means if b1 = 1, we append srcport to sign. Then: if (sign == temprule1) goto Step 5; else goto Step 6; Let key0 = temprule2 + srcaddr + dstaddr; if (key0 2 S(key)) update value(key0 ); else put [key0 , value(key0 )] into S; n = n + 1; if (R[n]! = NULL) goto Step 4; else goto Step 7; m = m + 1; if (Arule[m]! = NULL) goto Step 3; else goto Step 8; output S. For the algorithm above, there is one thing in need of explanation. For step 2, firstly, we access the database server to get flow information instead of receiving packets from the data collector and judging the flow records directly. The reason is Receiving packets from the data collector and judging the flow records directly makes the matching algorithm act passively. So system fluctuations (such as burst traffic), which will waste much of the algorithm’s time in processing flow records, further result in a performance decline. Judgment based on the database makes the matching algorithm active. When to begin judging can be completely determined by the time the system has gotten the flow information, even if some fluctuation will not result in a performance decline. In practice, our layout can meet system requirements for processing speed well. Even under the algorithm execution frequency of 1s, the algorithm can handle the flow records in real-time. If the data collection and the algorithm realization are deployed on different servers, algorithm implementation without the database will result in a duplicate transmission of flow records to both the servers, causing a waste of bandwidth. 1081 3. The system setting unit supplies other functions like automatic traffic report generation, user management, interface styles setting and so on. The automatic traffic report generation function can generate reports and delete or export useless data in the database according to the configurations (e.g. the content of reports, the report generation time, and the data clearing strategy) to save the storage space of the database. 4. Using the system to detect anomalous traffic 4.1. Pattern matching method When using the pattern matching method (or ICMP and TCP_FLAGS monitoring) to detect anomalous flow records, the anomalous flow records are classified into three categories: malicious network attacks, such as denial of service (DoS), Trojan horses and worms, and unexpected network applications (e.g. eMule, BitTorrent). 1. Malicious network attacks Malicious network attacks can be detected by our system effectively: (a) Ping of death: Malicious and oversize ICMP packets can cause memory allocation errors, lead to TCP/IP stack collapse, and leave the recipients with a crashed system. When the ping of death attack breaks out, the volume of ping packets between the attacker and a victim can reach hundreds of MB. ICMP monitoring of this system can create statistics for recent ICMP packets. When the ping of death attack breaks out, the particularly large and anomalous ICMP packets with their source IP addresses can be found in real time. (b) SYN flood: This kind of attack causes the victim denial of services by sending many ‘‘semiconnected” packets. As we all know, there is a ‘‘three-way handshake” to start a normal TCP connection. When the attacker repeatedly carries on the first two handshakes several times, many connections are hung up, and these ‘‘semiconnected” transactions will exhaust the resources of the victim. TCP_FLAGS monitoring of this system can find out the IP address which has the most connections in a period time. If a SYN flood does not occur, we should find that TCP_FLAGS SYN and FIN have almost the same count in the traffic monitoring window. But if a SYN 1082 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 flood outbreaks, many more SYN flags can be seen. In addition, Smurf attacks, UDP floods, Land attacks, Fraggle attacks, e-mail attacks and so on can also be detected using a similar method. 2. Trojan horses and worms Trojan horses and worms with known characteristics can be identified using the ‘‘characteristic fields” monitoring function, .e.g. the characteristic fields of the Trojan horse ‘‘Wincrash_v2” is: the destination port = 2583,the protocol type = TCP. The characteristic field of worm ‘‘Shockwave Killer” is: the destination port = 2048, the protocol type = ICMP, the number of octets = 92. The characteristic fields of some worms are shown below: Code Red Worm: destination port = 80, protocol type = 80, packet number = 3, octet number = 144. Worm.Opasoft,W32.Opaserv.Worm: destination port = 137, protocol type = UDP, octet number = 78. Worm.NetKiller2003,Worm.Sqlp1434,W32. Slammer,W32.SQLExp.Worm: destination port = 1434, protocol type = UDP, octet number = 404. Worm.Blaster,W32.Blaster.Worm: destination port = 135, protocol type = TCP, octet number = 48. Worm.KillMsBlast,W32.Nachi.worm,W32. Welchia.Worm: destination port = 2048, protocol type = ICMP, octet number = 92. Worm.Sasser,W32.Sasser: destination port = 445, protocol type = TCP, octet number = 48. W32.Witty.Worm: source port = 4000, protocol type = UDP. The protocol of most Trojan horses is TCP. The destination port of some TCP Trojan horses are listed below (Their characteristic fields can be written as: destination port = some value, protocol type = TCP): BackDoor (1999), Black Hole 2001 (2001), Ripper (2023), Wincrash v2 (2583), Remote Administrator (4899), VNC (5800, 5900), Dameware NT Utilities (6129), GuangWai Girl (6267), DeepThroat v1.0-3.1 (6670, 6671), Indoctrination (6939), Priority (6969), Netspy (7306), Genue (7511), Glacier (7626), Way2.4 (8011), back orifice (31337), back orifice2000 (54320), netbus (12345), subseven (27374, 1243). Because these ports are very special, when some characteristic traffic like above is monitored, it is necessary to further investigate and analyze. When the worm scans the network before dissemination, it randomly or pseudo-randomly generates a large number of IP addresses to scan and detect loopholes in the hosts. But a major part of these scanned IP addresses are null or unreachable, so no TCP FIN packets are returned. So for unknown characteristic TCP worms, similar with the above ‘‘SYN flood” detection, we can use ‘‘TCP_FLAGS monitoring” to expect to see a large number of SYN packets in the flow records associated with the worm-infected host. Unknown characteristics UDP worm: If the protocol field of a NetFlow record is ‘‘1”, then it means the protocol type of this record is ICMP. Then we can convert the destination port field of the record to a three-hexadecimal-number in which the first is the ‘‘ICMP type” and the other two is the ‘‘code field for the type”. Then we can use the ‘‘ICMP type” and the ‘‘code field” to look up in a table [11, Chapter 6] to get detailed information. When a host initiates a UDP request to a nonexistent host, the middle switch will send an ‘‘ICMP T_3” (means destination unreachable) packet to the worm-host. So we can use ‘‘Characteristic fields monitoring” to set ‘‘ICMP T_3” rules as: ‘‘Port Unreachable (characteristic fields: destination port = 771 (hex: 303), protocol type = ICMP)”, ‘‘Host Unreachable (characteristic fields: destination port = 769 (hex: 301), protocol type = ICMP)”, and so on. Thus, if we find a certain IP address suddenly receives a lot of ‘‘ICMP T_3” packets, it is very likely someone is disseminating UDP worms. 3. Unexpected network application We can use ‘‘Protocol monitoring” or ‘‘Characteristic monitoring” to detect unexpected network applications, by setting a new rule in the configuration file in Fig. 3 or in the characteristics rule set in Fig. 6, e.g. some P2P software which may consume too much bandwidth are forbidden in the Enterprise Network. The characteristics of these P2P software (Their characteristic fields can be L. Bin et al. / Computer Networks 52 (2008) 1074–1092 written as: destination port = some value, protocol type = UDP/TCP) are as follows: QQ Live (UDP: 13000–14000), XunLei (TCP: 3076, 3077), NetFairy (TCP: 7777, 7778, 11300), eMule (TCP: 4242, 4661, 4662), KuGoo (TCP: 3318, 7000), BitSpirit (TCP: 16881), BaoCue (TCP: 6346), PTC (TCP: 50007), Poco (TCP: 2881, 5354, 8094), Kamun (TCP: 3751, 3753, 4772, 4774), 100bao (TCP: 3468), Bai Hua PP (TCP: 5093), Kuro (TCP: 6800, 6801, 7003), Baidu Download (TCP: 11000), eDonkey (4371, 4662), Baizhao P2P (TCP: 9000), OPENEXT (TCP: 2500, 4173, 5467, 10002, 10003), iLink (TCP: 5000), DDS (TCP: 11608), iMesh (TCP: 4662), winmx (TCP: 5690), PPlive (UDP: 4004, TCP:8008). 4.2. Traffic statistical methods For the pattern matching method, we detect anomalous flow records based on the flow characteristics of NetFlow. When the flow characteristics are unknown or changed, we hope to determine whether the traffic is normal or not with the overall traffic statistical information in macro. Based on this idea, we propose two traffic statistic based algorithms. Unlike the pattern matching method which has a high accuracy, in certain exceptional scenarios the traffic statistical methods may lead to a decline in accuracy. We also propose and prove a ‘‘join results” strategy which can further reduce the false positive rate to prevent this from happening. 4.2.1. Intrusion detection algorithms Algorithm 1. An anomalous traffic detection algorithm based on the variance similarity. 1083 Detecting the system’s anomalous traffic is a classification problem, i.e. to distinguish between normal and anomalous traffic of a certain IP address. For management convenience, it can be done in more detail, i.e. with the help of multi-level standards, the level that the anomalous traffic belongs to can be clarified further. * A feature extraction vector F was defined as * F ¼ ðF 1 ; F 2 ; . . . ; F n Þ. Here Fi(1 6 i 6 n) is an average value that can identify the traffic characteristics in a fixed length of time, such as certain IP addresses’ average numbers of passive-connections, of received octets, of received packets and so on. Let us take NetFlow V5 and ‘‘Host Behavior Based Detection” [42,43] as an example to explain how we can get data preparation. In host behavior based detection, three behavior classes (bytes sent/ bytes received per minute, outgoing flow records per minute, and bidirectional flow records per minute) are defined as the standard to detect worms. Thus for our algorithm, the feature extraction vector of host behavior based detection is [bytes sent/bytes received per minute, outgoing flow records per minute, bidirectional flow records per minute]. Fig. 7 shows the structure of a V5 flow record, and each UDP packet sent from the router/switch to the data collector (in Fig. 1) contains 29 V5 flow records. When handling the received flow records, the data collector aggregates and stores records in the database (Oracle) as described in Fig. 2 (note that in the record structure, there are many traffic information details like dPkts, dOctets and tos), then with the help of Hibernate or JDBC, we can get the feature extraction vector of a certain IP address as follows: To obtain the ‘‘bytes sent/bytes received per minute”, the steps are: Fig. 7. The structure of a V5 flow record. 1084 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 1. In order to obtain the ‘‘bytes sent per minute”, we can execute the T-SQL sentence – select sum(dOctets) 60/ (END_ TIMESTART_TIME) from TABLE where TIME between START_TIME and END_ TIME and srcaddr = ‘IP’ – where TABLE is the table storing the traffic data of IP addresses (named HOSTMATRIX in our system), START_TIME and END_TIME bound the time interval, and the constraint ‘‘srcaddr = ‘IP’ ” guarantees the selective items’ source address are ‘IP’. Denote the result as i1. 2. In order to obtain the ‘‘bytes received per minute”, we can execute the T-SQL sentence – select sum(dOctets) 60/(END_ TIME-START_TIME) from TABLE where TIME between START_TIME and END_TIME and dstaddr = ‘IP’ – Denote the result as i2. 3. The ‘‘bytes sent/bytes received per minute” is obtained by i1/i2. To obtain the ‘‘outgoing flow records per minute”, we can execute the T-SQL sentence – ‘‘select count() 60/(END_ TIME-START_ TIME) from TABLE where TIME between START_TIME and END_TIME and srcaddr = ‘IP’ ”. To obtain the ‘‘bidirectional flow records per minute”, we can execute the T-SQL sentence – select count() 60/(END_TIME-START_ TIME) from TABLE where TIME between START_TIME and END_TIME and srcaddr = ‘IP’ and dstaddr in (select srcaddr from TABLE where TIME between START_TIME and END_TIME and dstaddr = ‘IP’). * The elements of the feature extraction vector F should be flexibly determined to meet different requirements. For example: as we have discussed in Section 4.1, if we find a certain IP address suddenly receives a lot of ‘‘ICMP T_3” packets, it is very likely someone is disseminating UDP worms. Considering this circumstance, the switches’ average received number of ICMP packets can be added as * an element of F by the servers which are monitoring the network traffic. * A trained sample E was defined as the average of some normal traffic records’ feature extraction vec* tor F , which implies the main characteristic of normal traffic. The number and update frequency of trained samples are related to the specific applica- tion. For example, two trained samples can be prepared for either busy times (Monday–Friday) or leisure times (Saturday, Sunday), or three different trained samples can be prepared to detect the traffic with different time granularities for intervals of 1 min, 3 min and 10 min. When a streaming server is changed into a common web server, it is obvious that trained samples need to be re-generated. So the problem of detecting anomalous traffic is changed to a classification problem. The traffic in a certain period is considered to be normal if its fea* * ture extraction vector F and the trained sample E are similar enough, and vice versa. Then there comes the problem of how to define similar enough. A quantification method is needed (i.e., there should be a mathematical *formulation * so that the similarity degree between F and E can be measured). Similarity-value sim(x1, x2) denotes the similarity degree between x1 and x2. The larger the sim(x1, x2) is, the more similar x1 and x2 are. * Here*we compute the similarity-value between F and E as follows: * * simðF ; EÞ ¼ n X i¼1 wi minðF i ; Ei Þ ; maxðF i ; Ei Þ where wi is related to (Fi Ei)2. Here, we use the empirical formula: , N 1 X 1 1 2 ; S i ¼ ðF i Ei Þ : So; 0 6 sim wi ¼ Si S i i¼0 6 N 1 X wi ¼ 1: i¼0 If Si is too large, it implies that the difference between Fi and Ei is obvious, and wi is given a lower value. On the other hand, if Si is small, wi is given a higher value. The weight based idea is simple and unstable characteristics can be discarded according to the weight. According to the definition of similarity-values, for the purpose of reducing the false positive rate, the determination of trained samples using Algorithm 1 should involve records whose calculated feature extraction vector elements are in the middle range. Then we must find an appropriate similarityvalue threshold.*If the record’s similarity-value to trained sample E is not less than the threshold, it can be classified as normal traffic, and vice versa. The choice of the threshold is directly related to detection accuracy. A more appropriate way to choose the threshold is to calculate the similarity- L. Bin et al. / Computer Networks 52 (2008) 1074–1092 1085 value for each of a large number of records (containing both normal and anomalous ones), then by the principle of minimal false positive rate, use certain methods to identify a suitable threshold. A simpler way to determine the threshold is to calculate the similarity-values for each of a large number of normal records and pick the peak similarity-value as the threshold. Algorithm 2. An anomalous traffic detection algorithm based on the Euclidean distance. * In Algorithm 2, a trained sample E is also needed, which can be obtained by the same method * as Algorithm 1. But to the trained sample E in Algorithm 2, the above limitation in Algorithm 1 is not necessary. In other words, the determination of the trained sample of Algorithm 2 can involve several types of records, but must not involve records whose calculated feature extraction vector elements are in the middle range. We will prove that the junction of each result can enhance the detection accuracy in detail in the next part, and if different algorithms can generate different trained samples with different types of normal traffic records, it is useful to cover a wider range of normal scenarios thus better to reduce the false positive rate and implement intrusion detection functions. * For feature extraction vector F and the trained * sample E, define a new vector: * f ¼ ðf1 ; f2 ; . . . ; fn Þ; fi ¼ Fi ð1 6 i 6 nÞ; Ei d is the Euclidean distance to weigh the difference * * between F and E, and defined as sffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi * * n X 2 d ¼ f 1 ¼ ðfi 1Þ : i¼1 Following a similar principle as Algorithm 1, there must be a distance threshold that minimizes the false positive rate. (The method to determine the threshold is rather similar to Algorithm 1.) So the traffic can be classified by comparing the Euclidean distance between a certain feature extraction vector and the trained sample. If the distance is less than * the threshold, traffic of the vector f which corresponds to the time period is normal, and vice versa. 4.2.2. An improvement strategy of combining two algorithms The main idea of the improvement strategy is to combine the two algorithms to improve the detec- Fig. 8. Judgment process. Table 1 Symbols definition (i, j, a1, a2 are Boolean variants) Symbol Definition P(j) Aij The probability of input being j The probability of detecting input j as i by Algorithm 1 The probability of detecting input j as i by Algorithm 2 The conditional probability of input being j and the detection output is i by Algorithm 1. Aij = A(ijj) P(j) The conditional probability of input being j and the detection output is i by Algorithm 2. Bij = B(ijj) P(j) The strategy output. Here a1 is the output of Algorithm 1 and a2 is the output of Algorithm 2 The probability of the accurate detection by the ‘‘join” strategy Bij A(ijj) B(ijj) P(a2, a1) Pr1 tion accuracy. As shown in Fig. 8, we ‘‘join” the results of Algorithms 1 and 2 in the strategy. Now we will prove that the strategy is better than using one algorithm individually. Assume the algorithm is a Boolean function. When the actual traffic is normal, we say the input is 0, and vise versa. And when the actual traffic is discriminated as anomalous by Algorithms 1 or 2, we say the output A (Algorithm 1) or B (Algorithm 2) is 1, and vise versa. So there are four (input, output) combinations for each of the two algorithms, they are (1, 1), (1, 0), (0, 1) and (0, 0). The final result is denoted as D, and D = A \ B. The symbols needed are defined as in Table 1. By the marginal probability distribution characteristic, we have A11 + A01 = B11 + B01 and A10 + A00 = B10 + B00. So, 1086 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 P ð1; 1Þ ¼ P ð1; 1j1ÞP ð1Þ þ P ð1; 1j0ÞP ð0Þ ¼ ðA11 þ A01 ÞAð1j1ÞBð1j1Þ þ ðB10 þ B00 ÞAð1j0ÞBð1j0Þ A11 B11 ¼ ðA11 þ A01 Þ A11 þ A01 B11 þ B01 A10 B10 þ ðB10 þ B00 Þ : A10 þ A00 B10 þ B00 Similarly, A11 B11 P ð1; 0Þ ¼ ðA11 þ A01 Þ A11 þ A01 B11 þ B01 þ ðB10 þ B00 Þ P ð0; 1Þ ¼ ðA11 þ A01 Þ A11 B01 A11 þ A01 B11 þ B01 þ ðB10 þ B00 Þ P ð0; 0Þ ¼ ðA11 þ A01 Þ A00 B10 ; A10 þ A00 B10 þ B00 A10 B00 ; A10 þ A00 B10 þ B00 A01 B01 A11 þ A01 B11 þ B01 A00 B00 þ ðB10 þ B00 Þ : A10 þ A00 B10 þ B00 And we can validate that, P ð1; 1Þ þ P ð1; 0Þ þ P ð0; 1Þ þ P ð0; 0Þ ¼ 1: By the join Boolean operations, we get, P r1 ¼ ½pð1; 0j0Þ þ pð0; 1j0Þ þ pð0; 0j0Þpð0Þ þ pð1; 1j1Þpð1Þ A00 B10 þ ðB10 þ B00 Þ A10 þ A00 B10 þ B00 A10 B00 A00 þ ðB10 þ B00 Þ A10 þ A00 B10 þ B00 A10 þ A00 B00 A11 B11 þ ðA11 þ A01 Þ B10 þ B00 A11 þ A01 B11 þ B01 A00 B10 A11 B11 ¼ B00 þ þ : A10 þ A00 A11 þ A01 ¼ ðB10 þ B00 Þ For a real system, the probability of the accurate detection is greater than the one of the false positivity and the probability of normal traffic is greater than the one of anomaly. So we have Að0j0ÞBð1j0Þ Að0j1ÞBð1j1Þ and ðB10 þ B00 Þ ¼ P ð0Þ ðB11 þ B01 Þ ¼ P ð1Þ: Then, P r1 ðB11 þ B00 Þ > 0: Similarly, P r1 ðA11 þ A00 Þ > 0; i.e., for the probability of the accurate detection, the ‘‘join” strategy is higher than using one algorithm individually. Thus, from the above formulations, we evidently find that the accuracy of detection is improved. 5. System performance 5.1. Overall performances In this section, we analyze the overall performance of the system in the Gigabit network shown in Fig. 9 (in the figure, the NetFlow function is supported by Cisco 4000 and 6000 series). We hide the department names. According to Fig. 1, the hardware configuration of the database servers and data collectors is E3500 ((400MZCPU 1, 1024 MB Memory 1) 4, TGX Card Quad Ethernet Card DFSCSI Card 1, 18 GB Febrial Disk 4, 8 mm Tape Dri- When Algorithm 1 or 2 is used individually, the probability of the accurate detection is A11 + A00 or B11 + B00. So, A00 B10 A11 B11 þ B11 A10 þ A00 A11 þ A01 A00 A01 ¼ B10 B11 A10 þ A00 A11 þ A01 ¼ Að0j0ÞBð1j0ÞðB10 þ B00 Þ P r1 ðB11 þ B00 Þ ¼ Að0j1ÞBð1j1ÞðB11 þ B01 Þ: Fig. 9. The actual network. L. Bin et al. / Computer Networks 52 (2008) 1074–1092 ver Int 1) and A1000 (36 GB DFSCSI Hard Disk 1). The OS is Sun Solaris 8, and the database is Oracle 10g. In Section 3, we have introduced that the system supports 12 aggregation threads, each of which aggregates NetFlow data according to different flow-field-sets and operates on different tables in the database, and all the aggregation functions can be enabled or disabled at any time by modifying configuration files. Additionally, when the data collector receives a NetFlow UDP packet, it can determine whether to record the raw flow records in the packet or simply drop the packet after handling it by the process in Fig. 2. Fig. 10 shows part of the configuration file which controls the statuses of above functions, and we can see that the ‘‘SaveRaw” function (i.e. record the raw flow records to database), the ‘‘SrcAS aggregation” function and the ‘‘ASMatrix aggregation” function are enabled, and the interval for flushing the memory and storing the ‘‘SrcAS” and ‘‘ASMatrix” information to the database is 3 s and 29 s. The ‘‘DstAs.interval = 0” means the ‘‘DstAS aggregation” function is disabled. The overall performance testing results of Fig. 9 are listed in Table 2. The average flow record number of the network in Fig. 9 is 90-flow records/s (3–4 UDP packets/s), while in a peak hour it is 150 flow records/s (5–6 UDP packets/s), and under burst Fig. 10. Configuration file. Table 2 Performance testing results Sample rate No Sampling 1 in 2 1 in 4 Real-time processed packet ratio Part of aggregating functions (%) All aggregating functions, recording raw data (%) 49.90 99.50 100 25.30 50.66 88.11 1087 conditions it can reach 250 flow records/s (8–9 UDP packets/s). With all aggregation functions enabled, including recording raw flow records to database, we have to sample 1 in four flow records to catch up with the packet arrival speed during burst periods. If we need not record raw flow records, and disable five aggregating functions, sampling 1 of 2 could process the requirement. Therefore, it is efficient to improve system performance by only enabling necessary functions. 5.2. Pattern matching algorithm performance By observing the matching algorithm we can find that the running time is related to the size of the rule set and the involved flow record number, but has nothing to do with the matched record proportion. (Because: (1) All the flow records will be processed by the algorithm no matter whether it is matched or not. (2) Although the process is different between processing a matched flow record and a non-matched one, the database access time, which is thousand times slower than the implementation time of the ordinary program statement, is the same.) Therefore, we discuss the performance of the matching algorithm based on the two aspects of rule size and flow record number. For different rule sizes (100, 200, 500, 1000, 2000, 5000, 10 000) and flow record numbers (100–10 000, increased by 50), we test each running time when the proportion of matched flow records is 0%, 10%, 20%, 30%, 40%, 50%, 60%, 70%, 80%, 90% and 100% (where the corresponding rule of a matched flow record is randomly chosen from the rule set) and then calculate the average running time of the ten circumstances. Results are shown in Fig. 11. Fig. 11 shows that with different rule set sizes, the running time of the matching algorithm has a linear growth with the flow number, and as we discussed, the running time has nothing to do with the matched proportion, so our algorithm is stable. For Fig. 9 network, the flow speed is 90-records/s on average, 150-records/s in the peak hour and 250records/s during burst periods. And from Fig. 11 we can get Table 3. According to Section 5.1, with 1 in 4 sampling, the algorithm implementation can handle a scale of 1000 rules and 600-records/s with a running frequency of 1 s, far more than the actual scale during burst periods in the deployed system environment in Fig. 9, with 450 rules and 250-flows/s. 1088 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 Fig. 11a. Matching algorithm running time 1. Fig. 11b. Matching algorithm running time 2. Table 3 Part of running times (ms) Data 100 150 200 250 Rule Running time (ms) 100 200 500 1000 2000 125 141 156 187 188 203 250 328 359 468 563 766 672 984 1294 1500 1250 1875 2357 3015 5.3. Traffic statistical algorithm performance 5.3.1. NS simulation of anomalous traffic Because we had not obtained permission to introduce attacks to verify our traffic statistical algorithms when we wrote this paper, NS2 was used. The topology for the normal traffic simulation is shown in Fig. 12. Nodes 6–45 denote 40 web servers and Nodes 46–465 denote 420 web clients, between which the communication is conducted in HTTP 1.0 Fig. 12. Topology for internet traffic simulation. and TCP Reno. The number of sessions is 400 for high-load scenarios and each session consists of a fixed number (300) of Web pages. This ensures that almost all simulations and sessions are active for the duration of the simulation (4200 s). For certain ses- L. Bin et al. / Computer Networks 52 (2008) 1074–1092 sions, the server and client are randomly chosen from their box and the client will visit all web pages of that server. Parameters are set as follows: Session Interval: exponential distribution of 1 s; Visiting Interval: exponential distribution of 1 s; Object Size and Shape: 12 packet/object, shape 1.2 Pareto II; TCP packet: 1000 byte. Feldmann et al. have proved that the simulation under the above topology and parameters are similar to real network traffic [12]. Based on the above topology, we simulate some anomalous traffic in the following circumstances: 1. Long term attack: 1000–1500 s, attacker at Node 6 and recipient at Node 46 with an intruding traffic of 1 Mbps. 2. Heavy Traffic attack: 2000–2050 s, attacker at Node 16 and recipient at Node 48 with an intruding traffic of 1.2 Mbps in a time burst. 3. Smooth attack: 3000–3100 s, attacker at Node 26 and recipient at Node 50 with an intruding traffic of 1.8 Mbps. Both the attack traffic and time range are moderate. Fig. 13 shows the result of simulation. The Red traffic is for Node 46, Green for Node 48 and blue for Node 50. There do exist varieties of attacking modes and the corresponding traffic is rather complicated. Here we just consider the traffic’s attack intensity and time range to test whether our algorithms can detect those kinds of attacks or not. 1089 5.3.2. Calculation for trained samples Compared with the real scenario, simulation provides a lack of information in both the amount of data and the content of an individual packet, but it is still able to verify the performance of our strategy to a certain extent. According to the above simulation setting and the algorithms’ needs, we define the two-dimensional feature extraction vector: [average number of received packets/100 s, average amount of received octets/100 s]. Here we select ‘‘average number of received packets/100 s” to reflect the attack time range, and also select ‘‘average amount of received octets/100 s” to reflect the attack intensity. Note that the two feature extraction vector fields are unique to our attack setting in the simulation, but for other kinds of attacking settings, more vector fields like ‘‘Bytes sent/Bytes received”, ‘‘outgoing flow records”, or ‘‘bidirectional flow records” may be necessary (as we have introduced in Section 4.2.1). In a real environment, the determination of the trained samples should depend on a large number of normal records, but for our 4200 s-simulation, hundreds of seconds are enough. According to the requests for the trained samples in Section 4.2.1 – Algorithm 1, we respectively choose 500 ‘‘middle-value” records for each of the three nodes to generate the trained samples of Algorithm 1. Because there is no special limitation for the trained samples of Algorithm 2, for simplicity, we have used the same as Algorithm 1. The results are shown in Table 4. 5.3.3. Results and analysis Here we compare the results of the two algorithms using Table 3. In the table, ‘‘1” represents the detection of anomalous traffic, ‘‘0” represents not detected, and a false positivity is marked in bold. From Table 5, we come to the following conclusions. 1. Generally speaking, our strategy for intrusion detection is simple, effective and accurate. There is only one false positivity (at Node 46) in all, Table 4 Sample and threshold Fig. 13. Result of the simulation for anomalous internet traffic. No Trained sample Similarity threshold Distance threshold 46 48 50 [0.972, 700.64] [0.826, 485.04] [1.048, 733.92] 0.2204 0.3291 0.3925 6.3984 4.2258 2.7552 1090 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 Table 5 Results (I: Algorithm 1, II: Algorithm 2) Node Result (calculated at 100 s intervals) 46: I II Join 000000011111000001000000010000000100000 000000011111000000000000000000000110000 000000011111000000000000000000000100000 48: I II Join 000000000000000001000000000010000000000 000000000000000001000000000000000000000 000000000000000001000000000000000000000 50: I II Join 000000100000000000000000000110000000000 000000000000000000000000000100000000000 000000000000000000000000000100000000000 achieving an accuracy rate of 99%. Even without using the ‘‘join” strategy, the accuracy rate of the two algorithms also reaches as high as 95% and 98%. 2. Algorithm 1 has more false positivities than Algorithm 2. This is because the simulation time is limited, and the records are deficient. Therefore, the trained samples do not well respond to the requests of Algorithm 1 (meaning the ‘‘middle-value” records we choose are not very exact). Because there is no special limitation for Algorithm 2, the false positive rate is low. 3. Algorithm 1 provides a sign of how much the current traffic behavior is similar to the normal scenario. Therefore, it can better reflect the essence of intrusion detection with the support of a large number of records, and will be able to give the network traffic a more detailed delineation if multi-level similarity-values are defined. 4. The false positivity in the period 3600–3700 s (at Node 46) appears both in the results of Algorithms 1 and 2. This is because in this period of time, a sudden change has occurred to the feature extraction vector elements, such as average number of received packets/100 s. The ranges of the other periods are 0–4 while here it suddenly rises to 15. When calculating the trained sample and threshold of Node 46, we make a deliberate decision of not choosing it, which leads to the false positivity. This further shows the importance of sufficient and appropriate samples. 6. Conclusion A flow analysis and monitoring system based on the J2EE framework is introduced above. This system receives flow records exported by NetFlow, extracts flow information and stores it in the Oracle database. A J2EE web server is used to provide realtime and historical traffic information to web users. Anomalous traffic monitoring functions are also embedded. All the functions and anomalous traffic definitions can be easily configured. The performance of the matching pattern algorithm is stable, so it can be easily expanded. The traffic based innovative and cost-effective detection algorithms with join strategy can determine and classify traffic types (normal or anomalous) online in real time with low computational complexity, which provides important insights on the design of not only this but also other intrusion detection and prevention systems for improving network security. This system has already been deployed in a government agency in Beijing, and works well hitherto. Our future work will focus on the improvement of system performance and algorithm efficiency, and the integration of more anomalous attack models. Acknowledgment This work is supported by the National HighTech Research and Development Plan of China (863) under Grant No. 2006AA01Z225; the National Grand Fundamental Research Program of China (973) under Grant No. 2003CB314804 and 2006CB708301; the National Natural Science Foundation of China (NSFC) under Grant No. 60573122 and 60773138. References [1] Cisco, Cisco IOS NetFlow Technology Data Sheet. <http:// www.cisco.comlao/NetFlow>. [2] T. Chen, Intrusion detection for viruses and worms, IEC Annual Review of Communications 57 (Fall) (2004). [3] Haining Wang, Danlu Zhang, Kang G. Shin, Detecting SYN flooding attacks, in: INFOCOM 2002. Twenty-First Annual Joint Conference of the IEEE Computer and Communications Societies, Proceedings IEEE 3 (23–27) (2002) 1530– 1539. [4] Vasilios A. Siris, Fotini Papagalou, Application of anomaly detection algorithms for detecting SYN flooding attacks, Global Telecommunications Conference 29 (3) (2004) 2050– 2054. [5] Seung-won Shin, Ki-young Kim, Jong-soo Jang, D-SAT: detecting SYN flooding attack by two-stage statistical approach, applications and the Internet, in: Proceedings, The 2005 Symposium on 31 January–4 February 2005, pp. 430–436. [6] J.B.D. Caberera, T.B. Ravichandran, R.K. Mehra, Statistical traffic modeling for network intrusion detection, in: Proceedings of 8th International Symposium on Modeling, L. Bin et al. / Computer Networks 52 (2008) 1074–1092 [7] [8] [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] [20] [21] [22] [23] Analysis and Simulation of Computer and Telecommunication Systems, 2000, 29(1), 2000, pp. 466–473. John E. Dickerson, Jukka Juslin, Ourania Koukousoula, Julie A. Dickerson, Fuzzy intrusion detection, in: IFSA World Congress and 20th NAFIPS International Conference 9(3), 2001, vol. 1506–1510. R.C. Garcia, M.N.O. Sadiku, J.D. Cannady, WAID: wavelet analysis intrusion detection, circuits and systems, 2002, in: M-WSCAS-2002, The 2002 45th Midwest Symposium, vol. 3, 4–7 August 2002, pp. III-688–III-691. M. Li, W. Jia, W. Zhao, Decision analysis of network based intrusion detection systems for denial-of-service attacks, in: Proceedings, IEEE Conferences on Info-tech and Infonet, 2001. Yiming Gong, Detecting Worms, and Anomaly Activities with NetFlow. <http://www.securityfocus.com/infocus/ 1796>. W. Richard Stevens, TCP/IP Illustrated (Vol. 1: The Protocols), Addison Wesley, 1994. P. Huang, A. Feldmann, A.C. Gilbert, W. Willinger, Dynamics of ip traffic: a study of the role of variability and the impact of control, in: ACM SIGCOMM’99, vol. 29, Massachusetts, USA, 1999. Shiuh-Pyng Shieh, Virgil D. Gligor, On a Patter-oriented model for intrusion detection, IEEE Transactions on Knowledge and data Engineering 9 (4) (1997). Shiuh-Pyng Shieh, Virgil D. Gligor, A pattern-oriented intrusion detection system and its applications, in: Proceedings of IEEE symposium Research in Security and Privacy, Oakland, CA, May 1991, pp. 327–342. Sandeep Kumar, Eugene H. Spafford, A pattern matching model for misuse intrusion detection, in: Proceedings of the 17th National Computer security conference, Baltimore, MD, 1994. C.J. Coit, S. Staniford, J. McAlemey, Towards faster string matching for intrusion detection or exceeding the speed of snort, in: DARPA Information Survivability Conference and Exposition (DISCEX II 01), Anaheim, CA, June 2001. B.E. Brodsky, B.S. Darkhovsky, Nonparametric Methods in Change-point Problems, Kluwer Academic Publishers, Dordrecht, 1993. M. Basseville, I.V. Nikiforov, Detection of Abrupt Changes: Theory and Application, Prentice Hall, Englewood cliffs, NJ, 1993. V. Paxson, S. Floyd, Wide-area traffic: the failure of Poisson modeling, IEEE/ACM Trans Networking 3 (3) (1995). T. Peng, C. Leckie, K. Ramamohanarao, Detecting reflector attacks by sharing beliefs, in: Proceedings of the IEEE 2003 Global Communications Conference (Globecom 2003), vol. 3, San Francisco, California, USA, 2003b, pp. 1358–1362. T. Peng, C. Leckie, K. Ramamohanarao, Proactively detecting DDoS attack using source IP address monitoring, in: Proceedings of Networking 2004, Athens, Greece, 2004, pp. 771–782. Harold S. Javitz, Alfonso Valdes, The NIDES statistical component: description and justification, SRI International, March 1993. Zheng Zhang, Jun Li, C.N. Manikopoulos, Jay Jorgenson, Jose Ucles, HIDE: a hierarchical network intrusion detection system using statistical preprocessing and neural network classification, in: Proceedings of the 2001 IEEE Workshop [24] [25] [26] [27] [28] [29] [30] [31] [32] [33] [34] [35] [36] [37] [38] [39] [40] 1091 on Information Assurance and Security, United States Military Academy, West point, NY, 5–6 June, 2001. Daniel Q. Naiman, Statistical anomaly detection via httpd data analysis, Computational Statistics & Data Analysis (2004) 51–67. Wenke Lee, Salvatore J. Stolfo, Kui W. Mok, A data mining framework for building intrusion detection models, in: Proceedings of the 20th IEEE symposium on security and privacy, Oakland, CA 1999. Wenke Lee, Salvatore J. Stolfo, Data mining approaches for Intrusion detection system, in: Proceedings of the 7th USENIX security symposium, San Antonio, TX, January, 1998. Bertrand Portier, Jerome Froment, Data mining techniques for Intrusion detection, Data mining term paper, The University of Texas, Spring, 2000. Srinivas Mukkamala, Guadalupe Janoski, Andrew Sung, Intrusion detection using neural networks and support vector machines, Appeared in IEEE IJCNN, May 2002. Herve Debar, Monique Becker, Didier Siboni, A neural network component for an Intrusion Detection System, in: Proceedings of the 1992 IEEE computer Society Symposium on research in Computer Security and Privacy, 1992, pp. 240–250. Jake Ryan, Meng-Jang Lin, Intrusion detection with neural networks, Advances in Neural Information Processing Systems, vol. 10, MIT press, 1998. Ludovic Me, GASSATA, a genetic algorithm as an alternative tool for security audit trail analysis, in: 1st International Conference on the Recent Advances in Intrusion Detection, Belgium 1998. Susan M. Bridges, Rayford B. Vaughn, Fuzzy data mining and genetic algorithms, applied to Intrusion Detection. Wei Li, Using Genetic Algorithm for Network Intrusion Detection, SANS Institute, 2004. Kathia Regina L. Juca, Azzedine Boukerche, Human immune anomayand misuse based detection for computer system operations: part II, Proceedings of the International Parallel and Distributed Processing Symposium (IPDPS’03), IEEE, 2003. Hiroyuki Nishiyama, Fumio Mizoguchi, Design of security system based on immune system, in: Proceedings of the 10th International Workshops on Enabling Technologies: Infrastructure for Collaborative Enterprozes (WETICE’01), IEEE, 2001. Stephanie Forrest, Thomas A. Longstaff, A sense of self for UNIX processes, in: Proceedings of 1996 IEEE Symposium on Computer security and Privacy, Los Alamos, CA, pp. 120–128. Jan van Lunteren, High-performance pattern-matching for intrusion detection, in: Proceedings of IEEE INFOCOM 2006, April 2006, pp. 1–13. Z.K. Baker, V.K. Prasanna, High-throughput linked-pattern matching for intrusion detection systems, in: Proceedings of the First Annual ACM Symposium on Architectures for Networking and Communications Systems, 2005. L. Tan, T. Sherwood, Architectures for bit-split string scanning in intrusion detection, IEEE Micro (January– February) (2006). N.S. Artan, H. Jonathan Chao, TriBiCa: Trie bitmap content analyzer for high-speed network intrusion detection, 1092 L. Bin et al. / Computer Networks 52 (2008) 1074–1092 in: Proceedings of IEEE INFOCOM 2007, May 2007, pp. 125–133. [41] C.-T. Huang, S. Thareja, Y.-J. Shin, Wavelet-based real time detection of network traffic anomalies, in: Proceedings of Workshop on Enterprise Network Security (WENS 2006) (in assoc. with Second SecureComm), August 2006. [42] T. D̈ubendorfer, B. Plattner, Host behaviour based early detection of worm outbreaks in internet backbones, in: Proceedings of 14th IEEE WET ICE/ STCA Security Workshop, IEEE, 2005. [43] T. D̈ubendorfer, B. Plattner, A framework for real-time worm attack detection and backbone monitoring, in: Proceedings of IWCIP 2005, November 2005. Bin Liu is a ME student in the Department of Computer Science at Tsinghua University, China. His research interests include sensor networks, performance evaluation and network management. Liu has received his BS in computer science and technology from Beijing University of Posts and Telecommunications, China. Chuang Lin is a professor of the Department of Computer Science and Technology, Tsinghua University, Beijing, China. He received the Ph.D. degree in Computer Science from the Tsinghua University in 1994. His current research interests include computer networks, performance evaluation, network security analysis, and Petri net theory and its applications. He has published more than 300 papers in research journals and IEEE conference proceedings in these areas and has published three books. Professor Lin is a member of ACM Council, a senior member of the IEEE and the Chinese Delegate in TC6 of IFIP. He serves as the Technical Program Vice Chair, the 10th IEEE Workshop on Future Trends of Distributed Computing Systems (FTDCS 2004); the General Chair, ACM SIGCOMM Asia workshop 2005; the Associate Editor, IEEE Transactions on Vehicular Technology; the Area Editor, Journal of Computer Networks; and the Area Editor, Journal of Parallel and Distributed Computing. Jian Qiao is a ME student of the School of Telecommunication Engineering, Beijing University of Posts and Telecommunications, Beijing, China. He received the BS degree in Telecommunication Engineering from Beijing University of Posts and Telecommunications. His current research interests include distributed system, grid computing and optical network. Jianping He is a M.E. candidate in the Dept. of C.S at Tsinghua University, Beijing, P. R. China. His research interests include multi-hop wireless networks, network management. Jianping He received his B.S of C.S. from Beijing University of Posts and Telecommunications in 2006. Peter Ungsunan is a doctoral student in the Department of Computer Science and Technology at Tsinghua University. Previously he taught at Beihang University and worked as a networking engineer at Lucent Technologies. He received MS and MBA degrees in CIS and Finance from the City University of New York and BS degrees in Electrical and Mechanical Engineering from Rensselaer Polytechnic Institute.