Accelerating the bit-split string matching algorithm using Bloom filters
Transcription
Accelerating the bit-split string matching algorithm using Bloom filters
Computer Communications 33 (2010) 1785–1794 Contents lists available at ScienceDirect Computer Communications journal homepage: www.elsevier.com/locate/comcom Accelerating the bit-split string matching algorithm using Bloom filters Kun Huang a, Dafang Zhang b,c,*, Zheng Qin c a Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, PR China School of Computer and Communication, Hunan University, Changsha, Hunan Province 410082, PR China c School of Software, Hunan University, Changsha, Hunan Province 410082, PR China b a r t i c l e i n f o Article history: Received 19 July 2008 Received in revised form 26 April 2010 Accepted 28 April 2010 Available online 5 May 2010 Keywords: Intrusion detection String matching Bloom filter Bit-split Byte filter a b s t r a c t Deep Packet Inspection (DPI) is essential for network security to scan both packet header and payload to search for predefined signatures. As link rates and traffic volumes of Internet are constantly growing, string matching using Deterministic Finite Automaton (DFA) will be the performance bottleneck of DPI. The recently proposed bit-split string matching algorithm suffers from the unnecessary state transitions problem. The root cause lies in the fact that the bit-split algorithm makes pattern matching in a ‘‘not seeing the forest for trees” approach, where each tiny DFA only processes a b-bit substring of each input character, but cannot check whether the entire character belongs to the alphabet of original DFA. This paper presents a byte- filtered string matching algorithm, where Bloom filters are used to preprocess each character of every incoming packet to check whether the character belongs to the original alphabet, before performing bit- split string matching. If not, each tiny DFA either makes a transition to its initial state or stops any state transition. Experimental results demonstrate that compared with the bit- split algorithm, our byte-filtered algorithm enormously reduces the string matching time as well as the number of state transitions of tiny DFAs on both synthetic and real signature rule sets. Ó 2010 Elsevier B.V. All rights reserved. 1. Introduction Computer networks are vulnerable to enormous break-in attacks, such as worms, spyware, and denial-of-service attacks [1,2]. These attacks intrude publicly accessible computer systems to eavesdrop and destroy sensitive data, and even hijack a large number of victim computers to launch new attacks. Hence, there has been widespread research interest in combating such attacks at every level, from end hosts to network taps to edge and core routers. Network Intrusion Detection and Prevention Systems (NIDS/ NIPS) have emerged as one of the most promising security components to safeguard computer networks. The worldwide NIDS/NIPS market is expected to grow from $932 million in 2007 to $2.1 billion in the next five years [3]. NIDS/NIPS, such as Snort [4] and Bro [5], perform real-time traffic monitoring to identify suspicious activities. Deep Packet Inspection (DPI) is the core of NIDS/NIPS, which scans both packet header and payload to search for predefined attack signatures. To define suspicious activities, DPI typically uses a set of rules which are applied to matching packets. A rule consists at least of a type of packet, a string of content (called signature string) to match, a location where that string should be * Corresponding author. E-mail addresses: huangkun09@ict.ac.cn (K. Huang), dfzhang@hunu.edu.cn (D. Zhang), qinzheng@hunu.edu.cn (Z. Qin). 0140-3664/$ - see front matter Ó 2010 Elsevier B.V. All rights reserved. doi:10.1016/j.comcom.2010.04.043 searched for, and an associated action to take if all the conditions of the rule are met. DPI has been also widely used in a variety of other packet processing applications, such as Linux Layer-7 filter [6], P2P traffic identification [7], context-based billing, etc. As link rates and traffic volumes of Internet are constantly growing, DPI will be faced with the high-performance challenges. In high-speed routers, DPI is required to achieve packet processing in line speed with limited memory space. First, string matching is computationally intensive. DPI adopts string matching to compare every byte of every incoming packet for a potential match again hundreds of thousands of rules. Thus, string matching becomes the performance bottleneck of DPI. For example, the routines of Aho–Corasick string matching algorithm [8] account for about 70% of total execution time and 80% of instructions [9]. Since software-based string matching algorithms can not keep up with 10–40 Gbps 40 Gbps packet processing, hardware-based string matching algorithms [10–25] have been recently proposed to improve the throughput of DPI. These hardware-based algorithms exploit modern embedded devices, such as Application Specific Integrated Circuit (ASIC), Field Programmable Gate Array (FPGA), Network Processor (NP), and Ternary Content Addressable Memory (TCAM), to perform high-speed packet processing, using their powerful parallelism and fast on-chip memory access. Second, a large number of emerging threats result in an expanding rule set, which in turn leads to an explosion of memory requirements. String matching typically uses the Deterministic 1786 K. Huang et al. / Computer Communications 33 (2010) 1785–1794 Finite Automaton (DFA) to represent multiple signature strings for all possible matches at one time. However, DFA requires large memory space to encode ever increasing signatures, which cannot fit in on-chip embedded memory with the size of only 10 Mb. Hence, it is critical to achieve high-throughput string matching in hardware, with small memory requirements. To improve the throughput and reduce memory requirements of DPI, Tan et al. [19] have proposed a bit-split string matching algorithm, suitable for hardware implementation. In the bit-split algorithm, an original DFA of Aho–Corasick algorithm is split apart into a set of parallel tiny DFAs, each of which is responsible for a portion of each signature string. However, the bit-split algorithm suffers from the unnecessary state transitions problem. The root cause lies in the fact that the bit-split algorithm makes pattern matching in a ‘‘not seeing the forest for trees” approach. This means that each tiny DFA only processes a b-bit substring of each input character, but can’t check whether the entire character belongs to the alphabet of original DFA. This approach can result in the unnecessary state transitions of tiny DFAs, ultimately consuming extra expensive CPU cycles and memory access bandwidth in hardware. Moreover, since the alphabet contains a part of unique characters, malicious attacks can utilize this exploit to overload the bit-split algorithm for evading DPI. To address above issues, this paper presents a byte-filtered string matching algorithm to improve time efficiency of DPI. The byte-filtered algorithm is based on the bit-split algorithm, where Bloom filters are used to preprocess each character of every incoming packet to check whether the character belongs to the original alphabet, before performing bit-split string matching. We theoretically and experimentally show that compared with the bit-split algorithm, the byte-filtered algorithm significantly reduces the unnecessary state transitions to improve the string matching time by several times. Our contributions are as follows: To the best of our knowledge, we first indicate that the bit-split string matching algorithm poses the unnecessary state transitions problem. We propose a byte-filtered string matching algorithm, which uses Bloom filters to filter all characters that are not in the original alphabet, before performing bit-split string matching. Experimental results on Snort rule set show that compared with the bit-split algorithm, the byte-filtered algorithm reduces the string matching time by up to 3.6 times, as well as 54% of state transitions. The rest of the paper is organized as follows. Section 2 briefly introduces the bit-split algorithm, and its unnecessary state transitions problem is described in Section 3. In Section 4, we presents in detailed the byte-filtered string matching algorithm. Section 5 gives experimental results and related work is introduced in Section 6. Finally, Section 7 concludes the paper. The construction process of an original DFA is as follows. First, each signature string adds states to the trie, starting at the root and going to the end of the string. The trie has a branching factor that is equal to the number of unique characters in an original alphabet. Second, the trie is traversed and failure edges are added to shortcut from a partial suffix match of one string to a partial prefix match of another. The string matching of the original DFA is easy: given a current state and an input character, the DFA makes a normal transition on the character to a next state, or makes a failure transition. Fig. 1 depicts an example original DFA for a set of strings {he, she, his, hers}. In the DFA, the initial state is 0, and accepting states are 2, 5, 7 and 9. As seen in Fig. 1, a real line denotes a normal transition, while a dashed line denotes a failure transition to the initial state. The bit-split algorithm has two stages to construct a set of tiny alphabets and tiny DFAs. The first stage splits an original alphabet of Aho–Corasick algorithm apart into a set of tiny alphabets. A character consists of m bytes (m P 1). For example, an English character has one byte, while a Chinese character has two bytes. Each character is divided into a number of equal-sized portions called substrings, each with b bits. Thus, each substring of every unique character in the original alphabet is extracted to construct a tiny alphabet. The second stage splits an original DFA of Aho–Corasick algorithm apart into a set of 8m/b tiny DFAs. For a given tiny state and a tuple of tiny alphabet, we search for a subset of possible next original states, labeled as a next tiny state, to construct a tiny DFA. Note that a tiny state may contain a subset of original states and an accepting tiny state is set when it contains any accepting original state. Fig. 2 depicts the original alphabet and two tiny alphabets for a set of strings {he, she, his, hers}. In Fig. 2(a), the original alphabet is R = {h, s, e, i, r}, and each unique character has one 8-bit byte. In Fig. 2(b), the first and second bits are extracted to construct the left tiny alphabet RB01, and the third and fourth bits are extracted to construct the right tiny alphabet RB23. In a tiny alphabet, each tuple (row) consists of a 2-bit substring and a subset of unique characters. For instance, the second tuple in RB01 contains a substring 01 and a subset {e, i}. Fig. 3 depicts two tiny DFAs for a set of strings {he, she, his, hers}. The tiny DFA B01DFA corresponds to the tiny alphabet RB01, while B23DFA corresponds to RB23. B01DFA has the initial tiny state 00 and accepting tiny states 30 , 60 , 70 and 80 , while B23DFA has the initial 0 tiny state 0 and accepting tiny states 40 , 60 , 80 and 90 . Note that 0 the tiny state 6 in B23DFA has a subset of original states {2, 5}. The matching process of bit-split algorithm is as follows. First, an input character is divided into a set of b-bit substrings. Second, each b-bit substring is extracted to feed to the corresponding tiny 0 h 2. Bit-split string matching overview h s 1 In essence, the bit-split string matching algorithm [19] is based on Aho–Corasick algorithm [8] to construct a set of tiny alphabets and tiny DFAs which process every input character in parallel. For a given set of rules, Aho–Corasick algorithm constructs an original alphabet and original DFA. To clarify this paper and distinguish items, a state of the original DFA is called an original state, whereas a state of the tiny DFA is called a tiny state. Aho–Corasick algorithm [8] is a classic exact string matching algorithm. The key of Aho–Corasick algorithm is to construct an original DFA that encodes all signature strings to be searched for. The original DFA is represented with a trie, which starts with an empty root node that is the initial state. he h s h i 2 s {his} 7 s h 6 r {he} 8 h s s 3 s h s i h s s 4 e {she,he} 5 r h {hers} 9 Fig. 1. Original DFA for a set {he, she, his, hers}. K. Huang et al. / Computer Communications 33 (2010) 1785–1794 Char Decimal 7 6 5 4 3 2 1 0 h 104 0 1 1 0 1 0 0 0 s 115 0 1 1 1 0 0 1 1 e 101 0 1 1 0 0 1 0 1 i 105 0 1 1 0 1 0 0 1 r 114 0 1 1 1 0 0 1 0 sets of original states contained by next tiny states, and then the corresponding matched strings are reported. For example, we assume that an input string {rhe} is read by tiny DFAs in Fig. 3. B01DFA reads a set of 2-bit substrings {10, 00, 01}, while B23DFA reads a set of 2-bit substrings {00, 10, 01}. In Fig. 3(a), a serial of state transitions 00 ? 00 ? 10 ? 30 is made by B01DFA. At the same time, in Fig. 3(b), another serial of state transitions 00 ? 10 ? 30 ? 60 is made by B23DFA. 00 and 10 are not accepting tiny state in B01DFA, and 00 , 10 , and 30 are also not accepting tiny states in B23DFA. Hence, the intersection of {2} of 30 in B01DFA and {2, 5} of 60 in B23DFA is taken to output the original state {2}, and then the matched string {he} is reported. (a) Original alphabet B 01 B23 1 0 CharSet 3 2 CharSet 0 0 {h} 0 0 {s, r} 0 1 {e, i} 0 1 {e} 1 0 {r} 1 0 {h, i} 1 1 {s} 1 1 {} 3. Problem statement (b) Tiny alphabets Fig. 2. Original alphabet and two tiny alphabets for a set {he, she, his, hers}. 01,10 01,10 01,10 0' 10 01,10 00 n 11 c 11 01 00 2' 1' 00 10 11 01,10 11 00 00 {7} 6' 11 x 00 x {2} 3' 5' 10 00 01 00 01 11 4' 01 {5} 7' 00 11 11 10 {9} 8' (a) B01DFA 01,11 01,11 01,11 00 0' 00 1' 00 x 10 3' 00 {2,5} 6' n 01,11 2' 01,11 10 10 10 x 01,11 11 11 5' 10 10 00 01 00 10 11 10 c 01,11 10 01 01 00 10 {2} 4' 00 8' {7} 7' 00 00 {9} 1787 9' (b) B23DFA Fig. 3. Two tiny DFAs for a set {he, she, his, hers}. DFA, and then all tiny DFAs make transitions to next tiny states in parallel. Finally, the actual output is an intersection of all the sub- The bit-split string matching algorithm suffers from the unnecessary state transitions problem. This is because a substring of an input character may belong to a tiny alphabet, even if the entire character does not belong to an original alphabet. In the bit-split algorithm, the 8m-bit string of each input character is divided into a set of b-bit substring, which are feed to the corresponding tiny DFAs for string matching. The 8m-bit string is called a forest, while the b-bit substring is called a tree. Since each b-bit substring of each unique character is extracted from an original alphabet R to construct a tiny alphabet R0 = {0, 1, . . .2b 1}, the substring belongs to R 0 , but the entire character may not belong to R. For example, we assume that the 8-bit string of the input character c is 01100011. The first 2-bit substring is 11, and the second 2-bit substring is 00. As seen in Fig. 2, the two substrings belongs to the corresponding tiny alphabets RB01 and RB23, whereas the entire string does not belong to the original alphabet R = {h, s, e, i, r}. The bit-split algorithm makes pattern matching in a ‘‘not seeing the forest for trees” approach. This means that each tiny DFA only processes a b-bit substring of each input character, but can’t check whether the entire character belongs to the original alphabet. This abnormal approach results in the unnecessary state transitions of tiny DFAs, which in turn consumes extra expensive CPU cycles and memory access bandwidth in hardware, and ultimately limits the throughput of DPI. Note that Aho–Corasick algorithm does not have the unnecessary state transitions problem, because there are failure transitions (seen in Fig. 1) to make when reading any input character that is not in the original alphabet. We illustrate an example of unnecessary state transitions in Fig. 3. When an input string is cxxn, B01DFA reads a set of 2-bit substrings {11, 00, 00, 10}, and B23DFA reads a set of 2-bit substrings {00, 10, 10, 11}. A serial of state transitions 00 ? 20 ? 40 ? 10 ? 00 is made by B01DFA, depicted by dashed lines in Fig. 3(a). At the same time, a serial of state transitions 00 ? 10 ? 30 ? 50 ? 00 is made by B23DFA in Fig. 3(b). Each 2-bit substring of c, x, and n belongs to the corresponding tiny alphabet, such as RB01 and RB23. But these characters do not belong to the original alphabet R = {h, s, e, i, r}. Consequently, both B01DFA and B23DFA together make eight unnecessary state transitions, which incurs extra memory access. To understand above issues, we investigate the statistical distribution of unique characters in real network environments. The corpus of 8-bit characters has 28 = 256 unique characters. Thus, the real original alphabet is a subset of the corpus. Fig. 4(a) depicts the unique characters statistics in Snort 2.7 [26], and Fig. 4(b) depicts 55 unique characters statistics in week second network traffic of MIT DARPA 1999 evaluation set [27]. In Snort 2.7, there are about 50 types of attacks and total 7840 signature rules. In Fig. 4(a), the x-axis is the number of signature rules for every five types of attacks, and the y-axis is the number of unique characters. Fig. 4(a) shows that the total 7840 signature 1788 K. Huang et al. / Computer Communications 33 (2010) 1785–1794 Number of Unique Characters 250 201 200 189 174 156 148 145 150 132 138 96 100 89 81 50 0 278 230 128 951 4423 178 753 131 699 69 7840 Number of Signature Rules (a) Unique characters statistics in Snort 2.7 .12 DARPA 1999 2nd Week Traffic .10 Max=11.4% CDF .08 input character is not in the original alphabet, each tiny DFA checks to see whether its current state is the initial state before making a state transition. If not, the tiny DFA makes a transition to the initial state. Otherwise, the tiny DFA stops any state transition. The main purpose of the optimization scheme is to further avoid the unnecessary state transitions in hardware to improve the throughput of DPI. In the byte-filtered algorithm, we use Bloom filters as the byte filter to preprocess every input character. In real applications, a character consists of one or multiple bytes. The Unicode [28] typically uses two bytes to provide a unique number for every character, no matter what the platform, no matter what the program, no matter what the language. As a character has two bytes, the corpus has 65536 unique characters. When a bit vector with the size of 65536 is used to represent all the Unicode characters, the bit vector consumes a large account of memory with 64 Kbits, which is not suitable for costly small on-chip memory in hardware. Hence, since Bloom filters are a space-efficient data structure for fast membership queries, our byte filter adopts Bloom filters to quickly check each input character of multiple bytes. In the remainder of this section, we first briefly introduce Bloom filters, and then present the architecture of the byte-filtered string matching engine. Finally, we analyze the performance gains of the byte-filtered algorithm. .06 4.1. Bloom filters overview .04 Bloom filters [29] are a simple space-efficient data structure for fast approximate set-membership queries. Bloom filters are widely used in most network applications from Web caching to P2P collaboration to packet processing [30]. Bloom filters represent a set S of n items by an array V of m bits, initially all set to 0. Bloom filters use k independent hash functions h1, h2, . . ., hk with the range [1, . . ., m]. For each item s 2 S, the bits V[hi(s)] are set to 1, for 1 6 i 6 k. To query if an item u is in S, we check whether all bits V[hi(u)] are set to 1. If not, u is not in S. If all bits V[hi(u)] are set to 1, it is assumed that u is in S. Fig. 5 illustrates an example of Bloom filters. In Fig. 5, there is a bit vector with the size of m = 12, and k = 3 hash functions. As seen in Fig. 5(a), item s1 is hashing the bit locations {2, 5, 9}, and item s2 is hashing the bit locations {5, 7, 9}. Thus, the bit locations {2, 5, 9} and {5, 7, 9} are set to 1. As seen in Fig. 5(b), to query items u1 and u2, we hash u1 and u2 to the bit locations {2, 5, 9} and {5, 8, 10}, respectively. Since the bit locations {2, 5, 9} are all set to 1, u1 is in the set. But, u2 is not in the set, because the bit locations {5, 8, 10} are not all set to 1. Bloom filters may yield a false positive, where they suggest that an item s is in the set even though it is not. The false positive comes from the fact that b bits in the bit vector may be set by any of items. .02 0.00 0 5 10 15 20 25 30 35 40 45 Number of Unique Character Out of Alphabet 50 55 (b) 55 unique characters statistics in week 2nd network traffic Fig. 4. Unique characters statistics in real network applications. rules contain 201 unique characters, so that there remain 55 unique characters not contained by Snort rule set. In MIT DARPA 1999 evaluation set, the week second network traffic includes normal background traffic and labeled attack traffic. Fig. 4(b) shows that 11.4% of total network packets contain the remaining 55 unique characters that are not in the original alphabet of Snort. Hence, malicious attackers can utilize this exploit to insert a large number of out-of-original-alphabet characters into packet payloads, e.g. 100 input strings cxxn, which results in frequent unnecessary state transitions of tiny DFAs to overload the bit-split algorithm for evading inspection. 4. Byte-filtered string matching In this section, we present a byte-filtered string matching algorithm to overcome the unnecessary state transitions problem. The core idea of the byte-filtered algorithm is that each character of every incoming packet is filtered by checking whether the character belongs to the original alphabet, before performing string matching. If the character is in the original alphabet, the character is divided into a set of substrings, and all tiny DFAs make state transitions on the substrings in parallel for bit-split string matching. If not, each tiny DFA either makes a failure transition or stops any state transition. Furthermore, we propose a novel optimization scheme to speedup the byte-filtered algorithm. In the scheme, given that an (a) Construction process (b) Query process Fig. 5. An example of Bloom filters. K. Huang et al. / Computer Communications 33 (2010) 1785–1794 Fortunately, Bloom filters have no any false negative. The false positive probability is computed as follows: f ¼ ð1 ekn=m Þk ð1Þ where n is the size of S, m is the size of V, and k is the number of hash functions. For a given n, we choose appropriate values of m and k to reduce the false positives. The false positive probability is minimized with respect to k = ln 2m/n. The corresponding minimal probability of false positive is as follows: f ¼ ð1=2Þk ¼ ð0:6185Þm=n ð2Þ Despite all that, the space savings of Bloom filters often outweigh the false positive, when the false positive probability is made sufficiently small. The false positive of Bloom filters cannot impact the correctness of our byte-filtered string matching algorithm. When an input character generates a false positive, the character is divided into a set of equal-sized substrings, each of which is feed to the corresponding tiny DFA to make a state transition. Since the input character is not in the original alphabet, the intersection of all the original state subsets contained by next tiny states is empty, which means no accepting original state. Hence, the false positive of Bloom filters incurs only a little more state transitions, but can not make a wrong matching. Moreover, we can choose appropriate parameters to minimize the false positive probability of the bytefiltered algorithm, which further reduces the unnecessary state transitions of tiny DFAs. 4.2. Byte-filtered string matching engine In this section, we present a novel architecture of byte-filtered string matching engine, suitable for hardware implementation such as FPGA and ASIC. We also describe the optimization schemes for hardware design and implementation of the engine. Fig. 6 depicts a top-level diagram of byte-filtered string matching engine architecture. First, the byte-filtered engine reads as input a data stream with the window size w. Second, each input character of multiple bytes is transferred to the byte filter which checks to see whether the character is in the original alphabet w ... ... Streaming Data Window 9 8 7 6 5 4 3 2 1 Input Bytes Byte Filter Split Bits Leaving Bytes per clock cycle. If not, four tiny DFAs including B01DFA, B23DFA, B45DFA, and B67DFA, either make a transition to the initial tiny state or stop any state transition. If the input character is in the original alphabet, multiple bytes of the character are divided into four parts, each of which is feed to the corresponding tiny DFA. Four tiny DFAs output the partial match vectors including PMV0, PMV1, PMV2, and PMV3, which are transferred to the logical bitwise AND unit to take an intersection. Finally, the full match vector is outputted and the matched signatures are reported to system administrators. Furthermore, we can optimize the hardware design and implementation of byte-filtered string matching engine like previous work [19,31]. The main purpose of optimization is to further speedup high-speed packet content filtering. (1) For a given character, our byte-filtered string matching engine implements four tiny DFAs on parallel processing units in hardware. Theoretical analysis of bit-split string matching algorithm [19] shows an 8-bit character is divided into four 2-bit substrings, which means that there are four parallel tiny DFAs, to minimize memory requirements of the bit-split algorithm. Thus, the byte-filtered algorithm uses the same parameters with the bit-split algorithm to achieve space efficiency. (2) For the byte filter that uses Bloom filters, we choose appropriate parameters to speedup the byte-filtered algorithm. Bloom filters adopt a class universal hash function called H3 [32], which is suitable for hardware implementation. Any item X is represented as X = {x1, x2, . . ., xb}. The ith hash function hi on X is calculated by using the formula hi(X) = (di1 x1) (di2 x2) (dib xb), where is a bitwise AND operator, is a bitwise XOR operator, and dij is a predefined random number in the range [1, . . ., m]. We choose the number of H3 hash functions as bk = ln 2(m/n)c to minimize the false positive probability, which achieves time efficiency. 4.3. Analysis In this section, we analyze and compare the byte-filtered algorithm with the bit-split algorithm in terms of the number of state transitions. We assume that there is an incoming string of n characters, and the string has the percent q of characters that are not in the original alphabet. Each character has 8m bits, and the byte filter uses Bloom filters with the false positive probability f. When each input character is divided into a set of b-bit substrings, there are 8m/b tiny DFAs. For the string of n characters, the bit-split algorithm makes the state transitions of Nbs = 8mn/b, while the byte-filtered algorithm makes the state transitions as follows: Nbf ¼ ð1 qÞ 8m n=b þ f q 8m n=b þ 8m d=b PMV0 PMV1 PMV2 B67 Tiny DFA B45 Tiny DFA B23 Tiny DFA ¼ 8m n=b ð1 f Þ q 8m n=b þ 8m d=b B01 Tiny DFA Entering Bytes 1789 PMV3 AND Full Match Vector Fig. 6. Byte-filtered string matching engine architecture. ð3Þ where d is the number of groups of continuous characters that are not in the original alphabet. Compared with the bit-split algorithm, the bye-filtered algorithm achieves the performance gain in terms of the number of state transitions as follows: Gain ¼ Nbs Nbf ð1 f Þ q n d ¼ n Nbs ð4Þ Formula (4) denotes that the performance gain we can achieve relies on the false positive probability of Bloom filters, and the percent and number of groups of input characters that are not in the original alphabet. 1790 K. Huang et al. / Computer Communications 33 (2010) 1785–1794 When there is one group of continuous characters that are not in the original alphabet, the byte-filtered algorithm achieves the maximal performance gain as follows: Gainmax ¼ ð1 f Þ q n 1 ð1 f Þ q ð1 ð1=2Þk Þ q n ð5Þ Fig. 7 depicts the maximal performance gain of the byte-filtered algorithm. In Fig. 7, k is the number of hash functions, and q is the percent of characters that are not in the original. When k = 10 and q = 0.1 0.9, the maximal performance gain varies from 10% to 89.9%. Fig. 7 also shows that when k = 2 10 and q = 0.1 0.9, the maximal performance gain varies from 7.5% to 89.9%. On the other hand, we compute the maximal number of groups of continuous characters that are not in the original alphabet as follows: ( dmax ¼ 1 1 ð1 f Þ q n; if q P 2ð1f Þ ð6Þ 1 if 0 6 q < 2ð1f Þ ð1 f Þ q n; Accordingly, for a given dmax, the byte-filtered algorithm achieves the minimal performance gain as follows: Gainmin 8 2ð1f Þq1 1 ; if q P 2ð1f > Þ n > < ¼ 2ð1ð1=2Þk Þq1 n > > : 1 0; if 0 6 q < 2ð1f Þ ð7Þ We conduct extensive experiments in three scenarios including the 1-bit split, 2-bit split, and 4-bit-split settings. In both the bitsplit and byte-filtered algorithms, the b-bit split scenario denotes that each input character of 8m bits is divided into a set of b-bit substrings, and there are 8m/b tiny DFAs. In three scenarios, we set the constant number of hash functions as k = 10 to minimize the false positives of Bloom filters, which eliminates the unnecessary state transitions of tiny DFAs. 5.1. On synthetic rule set We developed a string stream generator with C/C++ to synthesize the evaluation set, including the signature and testing strings. The generator randomly generates 20 100 signature strings, and 1000 testing strings from an ASCII character subset {‘0’ ‘9’, ‘a’ ‘z’, ‘A’ ‘Z’}. Each signature string has 10 bytes, and each testing string has 1000 bytes. The alphabet size of signature strings varies from 10 to 40, while the alphabet of testing strings has 42 unique characters that are randomly chosen from the specified character subset. Fig. 8 depicts the string matching time under variable number of signature rules. In Fig. 8, the alphabet of signature strings has 20 unique characters. Fig. 8(a) compares the matching time of the byte-filtered algorithm with that of the bit-split algorithm, and Fig. 8(b) gives the matching time ratio of the bit-split algorithm to the byte-filtered algorithm. 5. Experimental evaluation 1e+5 1-Bit Split Scenario (Bit-Split) 1-Bit Split Scenario (Byte-Filtered) 2-Bit Split Scenario (Bit-Split) 2-Bit Split Scenario (Byte-Filtered) 4-Bit Split Scenario (Bit-Split) 4-Bit Split Scenario (Byte-Filtered) 8e+4 Match Unit Times In this section, we present the software simulation experiments on synthetic and real signature rule sets to compare the byte-filtered algorithm with the algorithm. In the experiments, we measure the performance metrics including the string matching time and the number of state transitions. We designed and implemented both the bit-split and byte-filtered algorithms with C/C++, and run these programs on a machine with 3 GHz Intel Pentium M CPU and 512 MB main memory. We synthesize or choose a set of signature strings, and generate a set of testing strings to evaluate the performance of string matching. The simulation process is as follows. First, the simulator converts the set of signature strings into one DFA, and then the DFA is split into a set of tiny DFAs for both the bit-split and byte-filtered algorithms. Second, the simulator feeds the set of testing strings to all the tiny DFAs which reside only in main memory at the matching phase. Note that the simulator clicks the CPU clock and counts the number of state transitions of tiny DFAs. 6e+4 4e+4 2e+4 0 20 80 100 18 Matching Time Ratio Gain 60 Number of Rules (a) Matching time 1-Bit Split Scenario 2-Bit Split Scenario 4-Bit Split Scenario 16 1 0.9 0.8 0.7 0.6 0.5 0.4 0.3 0.2 0.1 0 0.9 40 14 12 10 8 6 4 2 0.8 0.7 0.6 q 0.5 0.4 0.3 0.2 0.1 2 3 4 5 6 7 8 k Fig. 7. Maximal performance gain of the byte-filtered algorithm. 9 10 0 20 40 60 80 100 Number of Rules (b) Matching time ratio Fig. 8. String matching time under variable number of signature rules. 1791 K. Huang et al. / Computer Communications 33 (2010) 1785–1794 Number of State Transfers 1e+7 8e+6 Bit-Split Byte-Filtered 8000000 Number of State Transistions 5e+6 4e+6 3495892 3185172 3e+6 2246884 2e+6 1586343 1e+6 0 10 30 40 Fig. 10. Number of state transitions under variable number of characters in the alphabet. 6e+4 10 Characters(Bit-Split) 10 Characters(Byte-Filtered) 20 Characters(Bit-Split) 20 Characters(Byte-Filtered) 30 Characters(Bit-Split) 30 Characters(Byte-Filtered) 40 Characters(Bit-Split) 40 Characters(Byte-Filtered) Matching Unit Times 5e+4 4e+4 3e+4 2e+4 1e+4 0 20 40 80 Number of Rules (a) Matching time 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 10 Characters 20 Characters 30 Characters 40 Characters 20 6e+6 20 Number of Characters Matching Time Ratio Fig. 8(a) shows that when the number of signature rules increases, the matching time of the bit-split algorithm increases quickly, while that of the byte-filtered algorithm increases slowly. For example, when the number of signature rules increases from 20 to 100 in the 1-bit split scenario, the matching time of the bit-split algorithm increases from 21307 to 81066 unit times by a factor of 3.8, while that of the byte-filtered algorithm increases from 2813 to 5195 unit times only by a factor of 1.8. Fig. 8(b) shows that when the number of signature rules increases, the matching time ratio increases, and when the b-bits split increase, the ratio decreases. For example, when the number of signature rules increases from 20 to 100, the matching time ratio increases from 7.6 to 15.6 in the 1-bit split scenario, while the ratio increases only from 2.6 to 4.4 in the 4-bit split scenario. Fig. 8(b) also shows that compared with the bit-split algorithm, the byte-filtered algorithm reduces the string matching time by 2.6 15.6 times. Fig. 9 depicts the number of state transitions under variable number of signature rules. As seen in Fig. 9, the number of state transitions decreases as the b-bits split increase. Since there are 8m/b tiny DFAs, the number of state transitions is proportional to the number of parallel tiny DFAs. Fig. 9 shows that compared with the bit-split algorithm, the byte-filtered algorithm reduces about 44% state transitions of tiny DFAs. Furthermore, we do experiments to explore how the number of characters in the alphabet impacts the string matching time. In the experiments, the alphabet has from 10 to 40 unique characters, which are randomly chosen from the corpus of 62 specified characters, and the bits split are b = 2. Figs. 10 and 11 depict the number of state transitions and the string matching time under variable number of characters in the alphabet respectively. As there are four tiny DFAs in the 2-bit split scenario, when 1000,000 bytes are read, there are totally 4000,000 state transitions of the bit-split algorithm. Fig. 10 shows that when the number of characters grows, the number of state transitions of the byte-filtered algorithm approaches to that of the bit-split algorithm. Fig. 11(a) shows that when the number of characters increases, the string matching time increases. For example, when there are 80 signature rules and the number of characters increases from 10 to 40, the string matching time of the bit-split algorithm increases from 37504 to 49816 unit times, while that of the byte-filtered algorithm increases from 2598 to 13375 unit times. Fig. 11(a) also shows that compared with the bit-split algorithm, the byte-filtered algorithm has slower increase of string matching time. Fig. 11(b) shows that when the number of characters grows, the string time ratio decreases. For example, when the number of char- 40 80 Number of Rules 4493768 4e+6 (b) Matching time ratio 4000000 Fig. 11. String matching time under variable number of characters in the alphabet. 2246884 2e+6 2000000 1123442 0 1-Bit Split Scenario 2-Bit Split Scenario 4-Bit Split Scenario Fig. 9. Number of state transitions under variable number of signature rules. acters increases from 10 to 40 and there are 80 signature rules, the string time ratio decreases from 14.4 to 3.7. Fig. 11(b) also shows that when there are 10, 20, 30, and 40 unique characters in the alphabet, the string time ratio increases from 7.5 to 14.4, from 4.5 to 8.3, from 3.3 to 5.2, and from 2.7 to 3.7, respectively. 1792 K. Huang et al. / Computer Communications 33 (2010) 1785–1794 5.2. On real rule set 1.2e+5 Bit-Split Byte-Filtered Matching Unit Times 1.0e+5 8.0e+4 6.0e+4 4.0e+4 2.0e+4 scan snmp icmp finger dos dns p2p telnet porn ddos web-iis pop3 chat shellcode voip sql web-cgi ftp smtp rpc 0.0 (a) String matching time 4.0 3.6 3.5 Matching Time Ratio 3.4 3.3 3.2 3.0 3.0 2.9 2.9 2.5 2.6 2.5 2.3 2.4 2.4 2.3 2.2 2.3 2.2 2.2 2.0 2.0 1.9 1.6 1.5 rpc ftp smtp sql web-cgi voip chat shellcode pop3 web-iis porn ddos p2p telnet dns dos icmp finger scan 1.0 snmp We use a set of Snort signature rules to compare the performance between the bit-split and byte-filtered algorithms. In Snort 2.7, there are 50 types of attacks in the rule set. We choose first 20 types of attacks from Scan to RPC, each of which has less than 100 signature rules. This is because the DFA construction has the high time/space complexity. When all the 50 types of attacks are complied, it is difficult to construct one original DFA and a set of tiny DFAs due to the explosive memory requirements. In practice, Snort often partitions all the signature rules into a large number of groups in terms of the source and destination ports, which facilitates to construct separate small DFAs. Hence, the 20 types of attacks are enough to reflect real network application requirements. Fig. 12 depicts the statistics of 20 types of attacks in Snort 2.7 rule set. In Fig. 12, the number of signature rules varies from 9 to 72. In this experiment, we synthesize 1000 testing strings, each with the size of 1000 bytes, to examine the performance of string matching in the 2-bit split scenario. We set the optimal parameters of both the bit-split and byte-filtered algorithms to minimize memory requirements and achieve time efficiency. Fig. 13 depicts the string matching time on Snort 2.7 rule set. Fig. 13(a) compares the matching time between the bit-split and byte-filtered algorithms, and Fig. 13(b) gives the matching time ratio of the bit-split algorithm to the byte-filtered algorithm. As seen in Fig. 13(a), on 24 signature rules of DDoS, the bit-split algorithm consumes 48055 unit times, while the byte-filtered algorithm only consumes 16758 unit times. Fig. 13(b) shows that compared with the bit-split algorithm, the byte-filtered algorithm reduces the string matching time by 1.6 3.6 times on 20 types of attacks in the 2-bit split scenario. Fig. 14 depicts the unnecessary state transitions percent on Snort 2.7 rule set. As seen in Fig. 14, on 24 signature rules of DDoS, the byte-filtered algorithm reduces the unnecessary state transitions percent by about 33% of total state transitions. Fig. 14 shows that compared with the bit-split algorithm, the byte-filtered algorithm reduces 7 54% state transitions of tiny DFAs on 20 types of attacks in the 2-bit split scenario. (b) Matching time ratio Fig. 13. String matching time on Snort 2.7 rule set. 6. Related work 9 9 0.09 0.07 0.09 rpc ftp smtp sql voip chat 0.0 pop3 16 .1 0.14 web-cgi 14 11 13 26 0.27 0.18 0.17 shellcode 10 24 0.16 web-iis 20 21 19 20 0.19 porn 30 .2 ddos 30 33 36 36 0.25 0.23 0.20 p2p 40 0.30 0.27 telnet 50 0.33 dns 56 0.34 .3 dos 54 0.40 finger 60 0.42 icmp 64 Fig. 14. Unnecessary state transitions percent on Snort 2.7 rule set. 0 scan snmp icmp finger dos dns p2p telnet porn ddos web-iis pop3 chat shellcode voip sql web-cgi ftp smtp rpc Number ofUnique Payloads 68 0.52 .4 scan 72 70 0.54 .5 snmp 80 .6 Unnecessary State Transition Percent String matching is one of the well studied classic problems in computer science. String matching has been widely used in a range of applications, such as information retrieval, pattern recognition, Fig. 12. 20 types of attacks in Snort 2.7 rule set. and network security. String matching algorithms can be classified into single- and multi-pattern algorithms. In the single-pattern string matching algorithm, a single-pattern string is to be searched for the text at a time. This means that K. Huang et al. / Computer Communications 33 (2010) 1785–1794 if we have s patterns to be searched, the algorithm must be repeated by s times. Knuth–Morris–Pratt [33] and Boyer–Moore [34] are some of the most widely used single-pattern string matching algorithms. In the multi-pattern algorithm, a set of strings are to be searched for the text at a time. This is achieved by converting the set of strings into one DFA. Typical multi-pattern string matching algorithms include Aho–Corasick [8], Commentz–Walter [35], and Aho–Corasick Boyer–Moore [36]. These string matching algorithms are suitable for software implementation, but cannot keep up with high-speed packet processing. Thus, hardware-based string matching algorithms [10– 14] have been recently proposed to improve the throughput of DPI by using modern embedded devices. For instance, FPGA-based string matching algorithms map a given rule set and the corresponding state automata down to a reconfigurable and highly parallel hardware circuit by using gates and flip-flops. The FPGA-based solution [11,13,18] can achieve the throughput of about 10 Gbps, but require the frequent time-consuming recompilation and reconfiguration of hardware circuit. The ASIC-based solution [19,20] is proposed by compiling the state automata into the silicon die. The solution can achieve about 20 Gbps throughput, but cannot support incremental updates. The TCAM-based solution [17,21,22] is proposed by exploiting multiple memory accesses in one I/O time cycle. The TCAM-based solution can achieve the high throughput of above 20 Gbps, but suffers from the expensive cost and high power consumption, which prevents their widespread applications. It is challenging for hardware-based string matching algorithms to keep up with high-speed packet processing with limited on-chip memory. In recent years, many memory-efficient string matching algorithms have been proposed to reduce the memory requirements. Tuck et al. [15] propose the bitmap compression and path compression schemes to reduce the amount of memory requirements of Aho–Corasick algorithm and enhance the worst case performance. Dharmapurikar and Lockwood [18] modify Aho– Corasick algorithm to scan multiple characters at a time, and suppressed the unnecessary memory accesses using Bloom filters. Lu et al. [21] implement a memory-efficient multiple-character parallel DFA using Binary Content Addressable Memory. Hua et al. [25] propose a variable-stride DFA using the winnowing scheme to speedup the string matching. Piyachon et al. [23] describe a memory model for parallel Aho–Corasick algorithms and find the optimal multi-byte bit-split string matching algorithm with the minimal memory requirements on network processors. Lunteren [20] and Song et al. [24] propose the B-FSM and Cached DFA to compress the state transitions of DFA. Dharmapurikar et al. [11] propose parallel Bloom filters to handle a set of signature strings with different lengths, in order to achieving the line-speed string matching with FPGA implementation. The expensive power and flexibility of regular expressions have been exploited in NIDS/NIPS. Many hardware-based regular expression matching algorithms have been proposed by recent literatures. Kumar et al. [37,38] propose a delayed input DFA by using default transitions to significantly reduce the memory requirements of DFA. Yu et al. [39] propose an efficient algorithm for partitioning a large set of regular expressions into multiple groups, reducing the overall memory space of DFA. Becchi and Cadambi [40] propose a novel algorithm to merge non-equivalent states of DFA using transition labeling, in order to achieve significant memory savings and provide worst-case speed guarantees. Smith et al. [41,42] propose an extended finite automaton (XFA) using auxiliary variables and simple instructions per state to reduce the explosive memory requirements of DFA. The key of the byte-filtered string matching algorithm is to eliminate the unnecessary state transitions using Bloom filters. 1793 Besides in the bit-split algorithm, the byte filter scheme is applied in above proposed hardware-based to further improve the throughput of string matching. 7. Conclusions Existing bit-split string matching algorithm suffers from the unnecessary state transitions problem. The bit-split algorithm makes pattern matching in a ‘‘not seeing the forest for trees” approach, where each tiny DFA only processes a substring of each input character, but cannot check whether the entire character belongs to the original alphabet. This paper proposes a byte-filtered string matching algorithm using Bloom filters. The byte-filtered algorithm preprocesses each character of every incoming packet by checking whether the character belongs to the original alphabet, before performing bit-split string matching. We further optimize the byte-filtered algorithm for hardware implementation, and theoretically analyze the performance gain in terms of the number of state transitions. Experimental results on synthetic rule set show that compared with the bit-split algorithm, the byte-filtered algorithm reduces the string matching time by up to 15.6 times and the number of state transitions of tiny DFAs by about 44%; when the alphabet size increases, the string matching time ratio of the bye-filtered algorithm decreases. Experimental results on Snort rule set demonstrate that compared with the bit-split algorithm, the bytefiltered algorithm reduces the string matching time by up to 3.6 times as well as the number of state transitions by up to 54%. Acknowledgment This work is supported by the National Science Foundation of China under Grant No. 90718008, and the National Basic Research Program of China (973) with Grant No. 2007CB310702. References [1] S. Staniford, V. Paxson, N. Weaver, How to own the Internet in your spare time, in: Proceedings of the 11th USENIX Security Symposium, 2002, pp.149–167. [2] S. Singh, C. Estan, G. Varghese, S. Savage, Automated worm fingerprinting, in: Proceedings of the Sixth ACM/USENIX Symposium on Operating System Design and Implementation, 2004, pp. 45–60. [3] Frost & Sullivan, World intrusion detection and prevention systems markets, 2007. [4] M. Roesch, Snort – Lightweight intrusion detection for networks, in: Proceedings of the 13th Conference on Systems Administration (LISA), 1999, pp. 229–238. [5] V. Paxon, Bro: a system for detecting network intruders in real-time, Computer Networks 31 (23–24) (1999) 2435–2463. [6] Application layer packet classifier for Linux, 2008. <http://l7-filter.sourceforge. net>. [7] S. Sen, O. Spatscheck, D. Wang, Accurate, scalable in-network identification of P2P traffic using application signatures, in: Proceedings of WWW 2004, Manhattan, 2004. [8] A.V. Aho, M.J. Corasick, Efficient string matching: an aid to bibliographic search, Communications of the ACM 18 (6) (1975) 333–340. [9] J.B.D Cabrera, J. Gosar, W. Lee, R.K. Mehra, On the statistical distribution of processing times in network intrusion detection, in: Proceedings of the 43rd IEEE Conference on Decision and Control, 2004, pp. 75–80. [10] I. Sourdis, D. Pnevmatikatos, Pre-decoded CAMs for efficient and high-speed NIDS pattern matching, in: Proceedings of IEEE FCCM, 2004. [11] Z.K. Baker, V.K. Prasanna, Time and area efficient pattern matching on FPGAs, in: Proceedings of FPGA, 2004. [12] Z.K. Baker, V.K. Prasanna, A methodology for synthesis of efficient intrusion detection systems on FPGAs, in: Proceedings of IEEE FCCM, 2004. [13] C.R. Clark, D.E. Schimmel, Scalable pattern matching on high-speed networks, in: Proceedings of IEEE FCCM, 2004. [14] S. Yusuf, W. Luk, Bitwise optimized CAM for network intrusion detection systems, in: Proceedings of IEEE FPL, 2005. [15] N. Tuck, T. Sherwood, B. Calder, G. Varghese, Deterministic memory-efficient string matching algorithms for intrusion detection, in: Proceedings of IEEE INFOCOM, 2004. 1794 K. Huang et al. / Computer Communications 33 (2010) 1785–1794 [16] S. Dharmapurikar, P. Krishnamurthy, T.S. Sproull, J.W. Lockwood, Deep packet inspection using parallel Bloom filters, IEEE Micro 24 (1) (2004) 52– 61. [17] F. Yu, R. Katz, T.V. Lakshman, Gigabit rate packet pattern-matching using TCAM, in: Proceedings of IEEE ICNP, 2004. [18] S. Dharmapurikar, J. Lockwood, Fast and scalable pattern matching for content filtering, in: Proceedings of IEEE/ACM ANCS, 2005. [19] L. Tan, B. Brotherton, T. Sherwood, Bit-split string-matching engines for intrusion detection and prevention, ACM Transactions on Architecture and Code Optimization 3 (1) (2006) 3–34. [20] J. van Lunteren, High performance pattern-matching for intrusion detection, in: Proceedings of IEEE INFOCOM, 2006. [21] H. Lu, K. Zheng, B. Liu, X. Zhang, Y. Liu, A memory-efficient parallel string matching architecture for high-speed intrusion detection, IEEE Journal on Selected Areas in Communication 34 (10) (2006) 1793–1804. [22] M. Alicherry, M. Muthuprasanna, V. Kumar, High speed pattern matching for network IDS/IPS, in: Proceedings of IEEE ICNP, 2006. [23] P. Piyachon, Y. Luo, Efficient memory utilization on network processors for deep packet inspection, in: Proceedings of IEEE/ACM ANCS, 2006. [24] T. Song, W. Zhang, D. Wang, A memory efficient multiple pattern matching architecture for network security, in: Proceedings of IEEE INFOCOM, 2008. [25] N. Hua, H. Song, T.V. Lakshman, Variable-stride multi-pattern matching for scalable deep packet inspection, in: Proceedings of IEEE INFOCOM, 2009. [26] Snort. Available from: <http://www.snort.org>. [27] 1999 MIT/LL Evaluation Set. Available from: <http://www.ll.mit.edu/mission/ communications/ist/corpora/ideval/data/1999data.html>. [28] Unicode. Available from: <http://www.unicode.org>. [29] B. Bloom, Space/time tradeoffs in hash coding with allowable errors, Communication of the ACM 13 (7) (1970) 422–426. [30] A. Broder, M. Mitzenmacher, Network applications of Bloom filters: a survey, Internet Mathematics 1 (4) (2004) 485–509. [31] H. Song, S. Dharmapurikar, J. Turner, J. Lockwood, Fast hash table lookup using extended bloom filter: an aid to network processing, in: Proceedings of ACM SIGCOMM, 2005, pp. 181–192. [32] M.V. Ramakrishna, E. Fu, E. Bahcekapili, Efficient hardware hashing functions for high performance computers, IEEE Transactions on Computers 46 (12) (1997) 1378–1381. [33] D.E. Knuth, J.H. Morris, V.R. Pratt, Fast pattern matching in strings, SIAM Journal on Computing 6 (2) (1977) 323–350. [34] R.S. Boyer, J.S. Moore, A fast string searching algorithm, Communications of the ACM 20 (10) (1977) 762–772. [35] B. Commentz-Walter, A string matching algorithm fast on the average, in: Proceedings of the Sixth Colloquium on Automata, Languages and Programming, 1979. [36] C. Coit, S. Staniford, J. McAlerney, Towards faster string matching for intrusion detection, in: Proceedings of DARPA Information Survivability Conference and Exhibition, 2002. [37] S. Kumar, S. Dharmapurikar, F. Yu, P. Crowly, J. Turner, Algorithms to accelerate multiple regular expressions matching for deep packet inspection, in: Proceedings of ACM SIGCOMM, 2006. [38] S. Kumar, J. Turner, J. Williams, Advanced algorithms for fast and scalable deep packet inspection, in: Proceedings of IEEE/ACM ANCS, 2006. [39] F. Yu, Z. Chen, Y. Diao, T.V. Lakshman, R.H. Katz, Fast and memory-efficient regular expression matching for deep packet inspection, in: Proceedings of ACM/IEEE ANCS, 2006. [40] M. Becchi, S. Cadambi, Memory-efficient regular expression search using state merging, in: Proceedings of IEEE INFOCOM, 2007. [41] R. Smith, C. Estan, S. Jha, XFA: faster signature matching with extened automata, in: Proceedings of IEEE Symposium on Security and Privacy, 2008. [42] R. Smith, C. Estan, S. Jha, S. Kong, Deflating the big bang: fast and scalable deep packet inspection with extended finite automata, in: Proceedings of ACM SIGCOMM, 2008.