D4.6 Protocol Learning for AMI Environments
Transcription
D4.6 Protocol Learning for AMI Environments
SEVENTH FRAMEWORK PROGRAMME Theme SEC-2011.2.5-1 (Cyber attacks against critical infrastructures) D4.6 Protocol Learning for AMI Environments Contract No. FP7-SEC-285477-CRISALIS Workpackage Author Version Date of delivery Actual Date of Delivery Dissemination level Responsible WP4 - System Discovery Corrado Leita 1.0 M24 M25 Public SYM The research leading to these results has received funding from the European Community’s Seventh Framework Programme (FP7/2007-2013) under grant agreement n°285477. SEVENTH FRAMEWORK PROGRAMME Theme SEC-2011.2.5-1 (Cyber attacks against critical infrastructures) The CRISALIS Consortium consists of: Symantec Ltd. Alliander Chalmers University ENEL Ingegneria e Innovazione EURECOM Security Matters BV Siemens AG Universiteit Twente Project coordinator Contact information: Dr. Matthew Elder Symantec Limited Ballycoolin Business Park (GA11-35) Blanchardstown Dublin 15 Ireland e-mail: matthew_elder@symantec.com Ireland Netherlands Sweden Italy France Netherlands Germany Netherlands Contents 1 Introduction 6 2 Heuristics-Based Protocol Learning 2.1 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 7 9 3 Results of Protocol Learning for AMI Environments 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol . 3.1.1 Analysis of TCP Data . . . . . . . . . . . . . . . . 3.1.2 Analysis of DLMS/COSEM Data . . . . . . . . . . 3.2 AMI Command Network Data - M-Bus Protocol . . . . . 3.3 Discussion and Conclusions . . . . . . . . . . . . . . . . . 4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 14 14 17 25 28 30 3 4 Abstract This deliverable presents the research on protocol learning, applied in this work to Advanced Metering Infrastructure (AMI) environments. The previous work on protocol learning, as applied to ICS/SCADA environments in deliverable D4.3, highlighted a number of challenges in our existing approaches to protocol learning in the context of critical infrastructure environments. Based on this experience, we have devised a heuristics-based protocol learning algorithm in an attempt to address the domain-specific challenges identified. We explore the approach on data collected from two AMI network environments containing two AMI protocols, DLMS/COSEM and M-Bus, and present the results of the analyses. 1 Introduction Monitoring critical infrastructure environments, including those in industrial control systems (ICS) and advanced metering infrastructure (AMI), requires tools capable of generating semantics about system events. Previously, in deliverable D4.3 “Protocol learning in SCADA environments”, we identified the need to monitor the systems and interactions from the network level in the face of malicious adversaries. However, there are numerous challenges in monitoring critical infrastructure environments from the network, including the use of proprietary, evolving, and/or undocumented protocols and the frequency of traffic generated. To address the challenges, we explored two approaches to protocol learning, applied in the context of ICS/SCADA environments: protocol parsers (“protocol aware”) and automated analysis of protocol payloads (“protocol agnostic”). The investigation in deliverable D4.3 identified a number of challenges in those protocol learning approaches as applied to ICS/SCADA networks. For example, the network traffic exhibited long-lived sessions and the absence of separators, which was problematic for the protocol-agnostic approach analyzing the network traffic in order to extract the protocol semantics in particular. Building upon our previous approaches and adapting based upon the findings of the previous investigation, we propose and explore a heuristics-based protocol learning algorithm in this deliverable. The next chapter presents the challenges that have been encountered in protocol learning as applied to critical infrastructure networks and provides a description of and rationale for the heuristics-based algorithm. The following chapter presents the results of applying the heuristics-based protocol learning approach to network data from two AMI environments, containing two AMI network protocols: DLMS/COSEM and M-Bus. 6 2 Heuristics-Based Protocol Learning We have seen in Deliverable D4.3, “Protocol learning in SCADA environments”, that the task of manually generating protocol descriptions is a tedious one, both when a protocol specification is available but even more when the protocol specification is closed. At the same time, it is of utmost importance to be able to get in-depth insights into the network exchanges, being that the network is the only perspective where it is possible to monitor the activity of closed devices such as PLCs, as well smart meters in Advanced Metering Infrastructure (AMI) environments. 2.1 Challenges When applying protocol learning to SCADA/ICS protocols, we have identified already a number of domain-specific challenges that required us to revisit existing techniques and adapt them to this specific scenario. More specifically, three core challenges had been identified: • Long-lived sessions A core problem in most protocol learning approaches consists in the “semantic clustering”, the ability to group together messages having a similar semantic meaning in order to perform correct and useful guesses on the invariants that characterize a specific type of message. For instance, in the HTTP case we want to ensure HTTP GET messages are analyzed separately from HTTP POST messages. If structurally different messages are mixed together in the same cluster, the outcome of the analysis is likely to have inacceptable quality. When looking at methods such as ScriptGen [4, 5, 6, 7], RolePlayer [3], or Netzob [2], we can see that their clustering approach mostly relies on the contextualization of the message with respect to the TCP session. The intuition is that the position of a message in a TCP session gives insights on its semantics, e.g., a login message is likely to come at the very beginning of the TCP conversation. We have seen this assumption not to hold in the SCADA/DCS case, where some protocols were reusing a single TCP session for a number of independent exchanges. • Absence of separators Protocol learning approaches typically attempt to discover the structure of a protocol messages by means of bioinformatics algorithms 7 2 Heuristics-Based Protocol Learning aiming at the alignment of messages. The most common global alignment algorithm used in protocol learning is the Needleman-Wunsch [8] algorithm. It is a dynamic programming algorithm aiming at achieving the best alignment of two sequences S1 and S2 according to a given scoring function. For every couple of bytes belonging to the two sequences, a different score is associated with identical values (Iscore ), differing values (Dscore ), or insertion of a gap in one of the two (Gscore ). The algorithm is usually applied in a “greedy” fashion [1, 3, 4], by giving no penalization to the insertion of gaps in the sequences or to the presence of mismatching bytes, and attempting to simply obtain the maximum possible overlap among the two sequences. The identified overlaps are converted into a regular expression that is then used to match instances of the same message. Differently from the generic case, most ICS protocols are binary and tend to be very compact by trying to encode all the information in the minimum amount of bytes as possible. This copes badly with the intuition underneath alignment: alignment algorithms expect to find throughout a given message large portions of unique, invariant byte regions that can be used to delimit the variable portions of the field (e.g., “GET /(.*).html HTTP/1.0\r\n”). This expectation is not met in ICS protocols. • Protocol structure Most IT protocols have a relatively “flat” structure: a header, followed by an optional payload whose semantics may depend from the header value. ICS protocols such as MODBUS instead tend to be much more structured: an initial framing header specifies information such as the transaction ID, the process ID, and the length of the payload; then, by means of one-byte function codes, one or more payloads can be queued. Function codes can incorporate multiple sub-payloads: for instance, function code 23 is associated to a read-write registers command, and is followed first by parameters to perform a read operation from a specific memory location and then by parameters to perform a write operation to another memory location of a variable number of bytes that is included at the end of the payload. Within a given protocol message, the semantics of a specific protocol byte tend to depend on the value of other preceding bytes. Previous work done by the crisalis consortium on protocol learning has devised better ways to cope with the first challenge, by exploring more effective ways to perform semantic clustering and to group messages sharing similar structure. However, no major modification was made to the second phase of the protocol learning algorithm, that in charge of identifying common patterns and structure in a set of messages and extracting a model (e.g., a regular expression) that could be used to later label the messages. All the proposed algorithms invoked the standard ScriptGen region analysis in order to identify 8 SEVENTH FRAMEWORK PROGRAMME 2.2 Approach invariant segments of the protocol message, and later generate a regular expression that could match the invariant segments in unlabeled messages and identify further instances of a given message type. This approach proved in fact to be sufficient to correctly cope with SCADA/DCS protocols. When looking more in general at ICS protocols and when including for instance AMI protocols such as DLMS/COSEM, further considerations need to be made. First and foremost, AMI protocols often rely on encryption schema to protect Personal Identifiable Information (PII). Secondly, binary ICS protocols pose challenges that clash with the property of standard alignment approaches: the absence of separators and the nested structure make the generation of “useful” and robust regular expressions particularly complex, leading us to question the use of regular expressions as a tool to model the structure of a protocol. 2.2 Approach Differently from previously investigated techniques, we try here to look into an approach specific to the problem under analysis and that is not necessarily generically applicable to any protocol of any type. We address instead the specific case of binary, compact and structured protocols that are typically observed in industrial control systems. The core idea is that of reconstructing the protocol by moving sequentially from the first byte of the message towards the last and performing the analysis concurrently on a large amount of protocol messages, similarly to the operation of a parser (as represented in Figure 2.1). By means of heuristics, we then identify values holding special value for the parser (opcodes, length fields) and when applicable we branch the analysis into subgroups associated to different opcode values. The resulting approach proves to be significantly faster than state of the art approaches for the reconstruction of protocol structure, and obtains significantly better accuracy when dealing with the specific challenges offered by ICS protocols. As previously explained, the purpose of the approach consists in generating labeling information that can be used to discriminate among different message “types” pertaining to a specific ICS protocol. We achieve this objective by proceeding sequentially from the first byte until the end of each message, mimicking the operation of a protocol parser, and gradually building a decision tree as shown in Figure 2.1. A protocol parser would start from the beginning of the header, parsing field by field, obtaining information on the overall message length, and then would proceed to identify the different opcodes and understanding their semantics according to specification. Differently from a protocol parser, we do not have access to the protocol specification. We can however proceed FP7-SEC-285477-CRISALIS 9 Le ins ngth va tanc field ov lue e es t : in era nc he al ll m od by l es e to tes sa ge the OP len so Cod gth me meh e b ss ow yte: ag e l asso the v en gth ciate alue dt o t is he Fix e db yte 2 Heuristics-Based Protocol Learning AA 00 0C 28 01 32 2C 06 80 19 22 A3 AA 00 0C 28 32 03 2C 06 80 19 22 A3 AA 00 0C 28 22 43 2C 06 80 19 22 A3 AA 00 0C 28 14 67 2C 06 80 19 22 A3 AA 00 07 02 22 AA 2C AA 00 07 02 12 1C 2C After the opcode byte the analysis is branched into several children, each associated to each OPcode value. The algorithm recursively proceed to each branch Byte 3 = 28 Byte 0 = AA Byte 1,2 = length … Byte 3 = OPcode values in {28,02} Byte 3 = 02 … Figure 2.1: Intuitive representation of the approach 10 SEVENTH FRAMEWORK PROGRAMME 2.2 Approach sequentially over the bytes of a large amount of messages, and apply heuristics to identify bytes of particular relevance to the protocol: 1. Random bytes: by considering a large amount of messages (several thousands) we can immediately identify random bytes (e.g., transaction IDs) as they are actually covering uniformly most of the 256 possible byte values. These bytes are ignored from a protocol parsing standpoint. 2. Fixed bytes: similarly, it is immediately possible to identify bytes whose value is fixed throughout all the message samples. These bytes have validation semantics: it is expectable for a well formed protocol message to expose exactly the same value as the message seen in the training set. 3. Length fields: we identify length fields by interpreting a sequence of bytes (char, short, or word) as a numerical value (encoded in little or big endian) and identifying a linear dependence between the decoded value and the length of the associated message. Similarly to fixed bytes, length fields have validation semantics and can be used to verify the correctness of a specific protocol message. 4. OPcodes: the real challenge in the proposed approach is that of correctly identifying bytes whose semantics has an influence on the overall structure of the protocol message. OPcodes assign specific “semantics” to the content of a given message, and allow us to discern, for instance, the standard reply of a PLC (containing the value of the PLC registers) from an error message that may be associated to a problem or anomaly. We have discovered that there is a “correlation” between the value of an opcode byte and certain high level characteristics of a message. For instance, we often find correlation between the opcode value and the overall message length: different OPcodes carry different information, and require a different number of bytes to carry it. In other cases, we have identified a “correlation” between the OPcode byte value and the role of the device in the network (e.g., source or destination IP address). Intuitively, certain devices (DCS servers) are only interested in generating “read” requests towards the control system, while the control system devices are usually responding to the requests providing the requested data. All these “correlations” between the values of a given byte and the size of the message, or the involved endpoints are discovered by leveraging a commonly accepted information theory measure, the mutual information. It important to underline that once an OPcode byte is identified, the protocol inference is “forked”: the sequential analysis of the subsequent bytes is recursively split into different sub-analyses, where the content of the messages associated to different OPcode values is analyzed separately. FP7-SEC-285477-CRISALIS 11 2 Heuristics-Based Protocol Learning 5. Parameters: values that do not belong to any of the above classes are likely to be message parameters, e.g., the value of the registers read from a given PLC. Overall, the process described above starts from the analysis of a set of training messages, sequentially analyzes the nature of each byte, and whenever an OPcode is identified the sequential analysis is branched recursively for the bytes following the opcode, and different opcode values are analyzed separately. In practice, the outcome of the above process is the creation of a decision tree: identified OPcode bytes branch the decision tree in multiple sub-branches, and the interaction of a new message with this decision tree inherently labels the message, also allowing us to extract parameters. 12 SEVENTH FRAMEWORK PROGRAMME 2.2 Approach START Generate message group Initialize position X to first element of the message group (X=0) Analyze message values at position X Are message values constant? yes Move X to the right (e.g. increase X) Mark X as constant no Are message values randomly distributed? yes Mark X as random no Are message values expressing the length of the message? yes Mark X as length no Are message values correlated with the structure of the rest of the message? yes Mark X as opcode no Mark X as parameter Subdivide message group into separate groups according to value of X and recursively proceed for each group Figure 2.2: Block diagram explaining the algorithm FP7-SEC-285477-CRISALIS 13 3 Results of Protocol Learning for AMI Environments In deliverable D4.3, we examined the use of protocol learning on two protocols from the Digital Control System (DCS) environment, PP and OPC. In this section, we explore the results from applying the modified protocol learning approach described in the previous section to packet captures from two data sources that include Advanced Metering Infrastructure (AMI) network data including two AMI protocols, DLMS/COSEM and M-Bus. 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol In order to experiment and evaluate the protocol learning approach on AMI network data, we collected packet traces from the Alliander AMI network testbed. The packet traces and high-level descriptions of the associated hosts, flows, and interaction maps were described in deliverable D4.5. The packet trace consisted of 21,621 packets and it contained a variety of traffic types, including some DLMS/COSEM protocol traffic going from the AMI head end server to various smart meters. 3.1.1 Analysis of TCP Data In this first exploration we analyze all the TCP data traffic to present an overview of the approach. The TCP data is most interesting when analyzing AMI network data, given the long-lived sessions and the inclusion of DLMS/COSEM data in the TCP data. However, given the variety of protocols included across all TCP data it is reasonable to assume that the protocol learning algorithm might not be able to find much commonality. For the exploration, we vary (1) the number of packets analyzed by the protocol learning algorithm and (2) the number of bytes in each packet that is analyzed. By varying both of these parameters, this gives the algorithm more data to consider for matching, at the expense of tractability of the analyses. We first consider all TCP data traffic in the packet capture, but only matching on the first 24 bytes of each packet. This is reasonable given that we know that at least some of 14 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol the traffic is encrypted, but perhaps the protocol packet header data for some protocols will be unencrypted in the first 24 bytes of the TCP data. Figure 3.1 shows the results of running the protocol learning algorithm on a very small amount of data, just the first 20 packets with TCP data, primarily to show the output of the analysis when commonality is found. In this analysis, the protocol learning algorithm believes that the first byte could be treated as an opcode with 10 possible values. However, when the first byte is 0 the pattern produced is a string of mostly random bytes, meaning that the algorithm found minimal overlap or alignment when the first byte is 0 – this line indicates some number of the first 20 packets only matched on a few bytes with a few “0” bytes and one “101” byte. At a high level, finding 10 packet sequences across the first 20 packets is not very informative, especially when one of those sequences is mostly random bytes. (Note that there are a number of short packets with less than 24 bytes of TCP data, indicated by the shorter sequences.) Figure 3.2 shows the results of running the protocol learning algorithm on the first 50 packets with TCP data. This pattern of all random bytes indicates that the protocol learning algorithm failed to find an alignment match across 50 packets, and this results remains consistent when increasing the number of packets analyzed to 100, 500, and 1,000 packets containing TCP data. The inability to find alignment across even just 50 packets is a disappointing result. Next, we increase the number of bytes analyzed to 50 within each packet. Looking at only the first 20 packets with TCP data, Figure 3.3 produces a consistent result with that of Figure 3.1, which used the first 24 bytes. The analysis again interprets the first byte as a possible opcode with 10 possible values. Again, when the first byte is 0 the algorithm fails to find significant overlap or alignment, detecting mostly random bytes to match a few “0” bytes and the single “101” byte. Using the first 50 bytes did not change the fundamental result of finding 10 patterns across 20 packets, which does not indicate much matching. Similarly, when we look at the first 50 bytes of the first 50 packets, we get the same result as depicted previously in Figure 3.2: a pattern of all random bytes, indicating the failure of the algorithm to find overlap or alignment. This result is again consistent when extended across the first 100, 500, and 1,000 packets of TCP data. Moving to analysis using the first 100 bytes of each packet, when looking at only the first 20 packets with TCP data using the first 100 bytes produces a consistent result with those when analyzing 24 bytes (Figure 3.1) and 50 bytes (Figure 3.3). The result is basically the same as when using 50 bytes, meaning that the longer sequences that matched for 50 bytes also had more than 100 bytes in their packets. Interestingly, analysis of the first 100 bytes of the first 50 packets produces a different result from the previous analyses using the first 24 bytes and 50 bytes. A magnified FP7-SEC-285477-CRISALIS 15 3 Results of Protocol Learning for AMI Environments Figure 3.1: Analysis of first 24 bytes using first 20 packets with TCP data 16 SEVENTH FRAMEWORK PROGRAMME 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol Figure 3.2: Analysis of first 24 bytes using first 50 packets with TCP data image of the initial bytes sequences is shown in Figure 3.4. When giving the protocol learning algorithm 50 packets with the first 100 bytes, the matching threshold is satisfied differently and the analysis produces 26 patterns from those first 50 packets. This is still not a great result but it does show that with some exploration of threshold parameters the analysis might be able to find some amount of commonality. As one might expect when using the first 100, 500, and 1,000 packets, the analysis fails to match anything and a single string of random bytes is produced similar to that in Figure 3.2. Finally, we conclude with an exploration of the analyzing the first 200 bytes of each packet. This analysis largely corresponds to that when using the first 100 bytes. The results using the first 20 packets matches those results for the previous analyses using 50 bytes (Figure 3.3) and 100 bytes, only with longer string sequences. Similarly, the results using the first 50 packets for the most part matches that of Figure 3.4, where 26 patterns are discovered across those 50 packets. Lastly, using the first 100, 500, and 1,000 packets produces the single string of random bytes, failing to match anything. 3.1.2 Analysis of DLMS/COSEM Data The results of the protocol learning algorithm when analyzing all TCP data in an AMI environment were largely underwhelming, as one might expect given the diversity of protocols present and challenges identified in the previous chapter. In this section we focus on a more tractable problem: analysis of only DLMS/COSEM traffic on port 4059 from the packet capture in the Alliander AMI environment. There are only 419 packets of this traffic across the packet capture. We conduct the same type of exploration, varying the number of bytes in each packet analyzed (24, 50, 100, 200) and the number of packets analyzed (20, 50, 100, all). The results are fairly consistent when looking at a certain number of packets, as the number of bytes analyzed are varied. FP7-SEC-285477-CRISALIS 17 3 Results of Protocol Learning for AMI Environments Figure 3.3: Analysis of first 50 bytes using first 20 packets with TCP data 18 SEVENTH FRAMEWORK PROGRAMME 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol Figure 3.4: Magnified analysis of first 100 bytes using first 50 packets with TCP data FP7-SEC-285477-CRISALIS 19 3 Results of Protocol Learning for AMI Environments Figure 3.5: Analysis of first 24 bytes using first 20 packets on port 4059 Figure 3.5 shows the result from analyzing the first 20 packets on port 4059. The figure is for the analysis of the first 24 bytes, but when looking at the first 50, 100, and 200 bytes of these first 20 packets on port 4059, the same 5 packet sequences are extracted from the protocol learning algorithm with a presumed opcode in the same position, with one sequence containing a number of random bytes. The results from analyzing the first 50 packets on port 4059 show two fundamentally different analyses, depending upon number of bytes analyzed. When only looking at the first 24 or 50 bytes, the results are similar to that in Figure 3.5 when analyzing the first 20 packets, except that there are now 7 packet sequences detected across those 50 packets, with four containing significant random byte sequences. However, when using 100 or 200 bytes for the analysis, the alignment changes and Figure 3.6 is produced. In this analysis, what was taken as the opcode previously is now detected as a length field, the subsequent byte is now considered an opcode for five of the patterns, and a second length and opcode field sequence is produced. These conflicting analyses show the sensitivity of the protocol learning algorithm to the data being analyzed. The results from analyzing the first 100 packets on port 4059 yield yet another interpretation of the data, made possible by the doubling the amount of data being analyzed again. Figure 3.7 shows the result for 24 bytes analyzed, a more complex structure, given the additional data. This result is generally consist across the number of bytes analyzed (50, 100, or 200). What was detected as a possible length field in Figure 3.6 is now presumed to be an opcode, as it was in earlier analyses such as Figure 3.5. Finally, we consider all of the packets seen on port 4059 in the AMI packet capture. The analysis depicted in Figure 3.8 for 24 bytes analyzed is relatively consistent across 50, 100, and 200 bytes analyzed. The first opcode is presumed to be in the same place as in most previous analyses (with the exception of Figure 3.6), and as one might expect, given all of the additional data many additional types of conversations were detected by the protocol learning analysis. In order to address the length anomaly noticed in a few specific situations, we explored an additional, alternate modification to the algorithm in which we provided the real 20 SEVENTH FRAMEWORK PROGRAMME 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol Figure 3.6: Magnified analysis of first 100 bytes using first 50 packets on port 4059 FP7-SEC-285477-CRISALIS 21 3 Results of Protocol Learning for AMI Environments Figure 3.7: Analysis of first 24 bytes using first 100 packets on port 4059 length of the packet to the algorithm instead of just relying upon the truncated length for the heuristic analysis. Figure 3.9 shows an example of this analysis when the first 24 bytes are analyzed across all 419 packets. The primary observation is that one byte is detected as a length field prior to the first opcode detected, as was observed in the anomalous analyses before. This byte is detected as a length field across all parameters of the analysis (24/50/100/200 bytes and first 20/50/100/all 419 packets) when using the modified algorithm. This has a “cascading” effect on the interpretation of the next few bytes when compared with Figure 3.8, but the two analyses converge further into the packet stream again when the other levels of opcode are detected. This general decision tree structure is consistent across the number of bytes analyzed when using all packets on port 4059 in the packet capture. As one final investigation using the AMI network data, we analyzed the port 4059 traffic on a per flow basis. As shown in the D4.5 interaction modeling deliverable, there were only six flows between the AMI head end server and the smart meters using DLMS/COSEM (on port 4059). One of them had only 8 packets and was thus uninteresting. Four of the other flows had 21, 24, 31, and 32 packets, so again a fairly small number of packets, and the sixth flow had 300 packets. The analyses from these last five flows 22 SEVENTH FRAMEWORK PROGRAMME 3.1 Alliander AMI Network Data - DLMS/COSEM Protocol Figure 3.8: Analysis of first 24 bytes using all 419 packets on port 4059 FP7-SEC-285477-CRISALIS 23 3 Results of Protocol Learning for AMI Environments Figure 3.9: Analysis of first 24 bytes using all 419 packets on port 4059 with length modification to the algorithm 24 SEVENTH FRAMEWORK PROGRAMME 3.2 AMI Command Network Data - M-Bus Protocol generally mimic the structure of the analyses presented previously. 3.2 AMI Command Network Data - M-Bus Protocol For a second, more controlled set of experiments, we tested the protocol learning approach on AMI network data containing M-Bus protocol data, collected by Chalmers. The packet traces were small captures associated with four individual M-Bus commands, ranging from 15 to 48 packets. For ground truth, we can compare the results generated by the protocol learning algorithm to the M-Bus protocol specification, which is an official standard that can be obtained and/or found on-line in various places (e.g., www.m-bus.com). Figure 3.10 shows the result of running the protocol learning algorithm on a packet capture that includes a request for data from the master to the slave. Two distinct packets with commonality in the packet capture were detected, with an opcode in the first data byte corresponding to either 16 or 104. The 16 (0x10 hex) opcode corresponds to a short telegram in M-Bus, while the 104 (0x68 hex) opcode corresponds to a long telegram in M-Bus. The protocol learning algorithm successfully detected the opcode in the first byte of the packet, as verified in the M-Bus specification. It should be noted, however, that a third value for the opcode, 0xe5 hex, was not detected by the algorithm, although this was seen in two packets of the capture. This opcode corresponds to a single character telegram, used to acknowledge other telegrams received. The short M-Bus telegram with opcode 16 matched the M-Bus protocol specification: the second byte (91, or 0x5b hex) is the control field specifying the data request from master to slave, the third byte (254, or 0xfe) is the address field specifying to transmit to all participants in the M-Bus system, the fourth byte is the checksum field that is a simple arithmetic sum of the previous bytes, and the fifth byte is the stop byte (22, or 0x16 hex). The long M-Bus telegram with opcode 104 was not able to match very many of the bytes, because in this small packet capture there were only two long telegram packets and most bytes were not common. The protocol learning algorithm was able to detect that the second and third bytes of the packets correspond to two length bytes (a single byte repeated in fact), which is confirmed to be correct in the M-Bus protocol specification. The other common byte in these two long M-Bus telegram packets is a 0 byte, which corresponds to the address field and denotes a value of the default address given by the manufacturer. However, this byte is detected to be one position later than it should be, according to both the specification and the actual packet capture: a byte of value 104 (repeating the opcode value) should have been included immediately after the two length FP7-SEC-285477-CRISALIS 25 3 Results of Protocol Learning for AMI Environments Figure 3.10: Analysis of M-Bus request data command traffic Figure 3.11: Analysis of second M-Bus request data command traffic capture bytes. Also, the “P” byte between the length bytes and “0” byte should have been split into another opcode byte corresponding to either 0x73 or 0x08, the master sending data to the slave or the slave transferring data to the master, respectively. With a longer packet capture that includes repetition of these long M-Bus telegrams, the algorithm should have split that single line in the decision tree into two with that second opcode of 0x08 or 0x73. (This will be elaborated upon and confirmed later in this subsection.) Figure 3.11 shows the result of running the protocol learning algorithm on another packet capture that includes a request for data from the master to the slave. The results are very similar to those shown in Figure 3.10, with the opcode detected in the first byte and two opcode values detected. The only difference is that the second byte of the short M-Bus telegram is 123 (0x7b hex), which also corresponds to a data request from master to slave, and the checksum byte is changed accordingly. The M-Bus specification confirms that this is a correct construction in the protocol as well. The long M-Bus telegram packet analysis yields the same results as in the previous packet capture, with the same shortcomings: a missing repeated opcode byte of 104 in the fourth position, and insufficient duplication to split the long M-Bus telegram line into two, with a second opcode in the fifth byte of either 0x08 or 0x73. Figure 3.12 shows the result of running the protocol learning algorithm on a third packet capture that includes a request for data from the master to the slave. The MBus-specific results - corresponding to rows two and four of the decision tree - are the same as those shown in Figure 3.10, both positive and negative, as explained previously. However, this packet capture included some NetBIOS traffic in addition to the M-Bus traffic, resulting in two additional lines added to the decision tree generated from this packet capture. If the protocol learning algorithm is run on traffic filtered to only include the M-Bus protocol port (6021), the results would match Figure 3.10 exactly. (For the 26 SEVENTH FRAMEWORK PROGRAMME 3.2 AMI Command Network Data - M-Bus Protocol Figure 3.12: Analysis of third M-Bus request data command traffic capture, with NetBIOS traffic previous two packet captures, the full packet capture and the port 6021 filtered results have been identical, given the lack of other, non-M-Bus protocol traffic in those two previous packet captures.) This third packet capture shows the ability of the protocol learning to recognize commonality in another network protocol, in this packet capture. Finally, Figure 3.13 shows the result of running the protocol learning algorithm on a fourth packet capture that includes a request for data from the master to the slave. In this packet capture, the short M-Bus telegram packets exhibited both 0x5b and 0x7b in the second byte, following the opcode byte of 16 (0x10 hex), but not with sufficient repetition to denote the second byte as another opcode, so the second byte is denoted with a “P”. More interestingly, the packet capture includes multiple packets consisting of a data transfer from slave to master, and these three packets are split using a second opcode further into the packet, with detected values of 44, 45, and 46. However, comparing this decision tree with the M-Bus specification yields mixed results. The first issue is that there are now three bytes missing after the two length bytes (as opposed to just one missing byte previously): the repeated opcode byte of 104, the control field of 8 specifying the data transfer from slave to master, and the address field of 0 specifying the default address. On the positive side, the next ten bytes are detected correctly. 114 (0x72 hex) is the common control information field across these packets, denoting a telegram containing data from the master. Following that, the next eight bytes are the same across these packets: four bytes for the M-Bus interface identification number (17 65 71 0), two bytes for the manufacturer ID (66 4), one byte for the version number of the firmware (6), and one byte for the medium (2, denoting “Electricity”). The second issue is that the next byte, detected as an opcode, is the “access number” according to the M-Bus specification. These three unique values could possibly be interpreted as an opcode; more data is required to validate. Lastly, it should be noted that the analysis yielded on the entire packet capture and on filtering for just the M-Bus protocol port (6021) yielded the same result. The last investigation that we conducted using the M-Bus protocol traffic was to aggregate all four packet captures and run the protocol learning algorithm against that FP7-SEC-285477-CRISALIS 27 3 Results of Protocol Learning for AMI Environments Figure 3.13: Analysis of fourth M-Bus request data command traffic capture Figure 3.14: Analysis of aggregated M-Bus protocol traffic higher volume of data, the results of which are shown in Figure 3.14. As was alluded to earlier, with more data the protocol learning algorithm was better able to differentiate the packets that sent data from the master to the slave versus those that sent data from the slave to the master, and the corresponding opcode was detected properly in the fourth byte with values of 8 and 115, respectively. Being able to split on that second opcode enabled more commonality to be determined in the bytes of the master to slave data packets (whereas in previous results there had been a large number of “P” bytes). Furthermore, with the additional data the algorithm was able to combine values for the third “opcode” across the four packet captures (corresponding to the access number in the M-Bus specification, to be precise). However, the additional data did not correct the issue with the missing repeated opcode byte of 104 in the fourth position, and the short M-Bus telegram is not split into two cases (with a second opcode of either 91 or 123), as one might have expected. 3.3 Discussion and Conclusions At a high level, analysis of the DLMS/COSEM protocol traffic and M-Bus protocol command traffic indicate that the heuristics-based protocol learning algorithm is capable of finding commonality across network traffic for these particular AMI protocols. Obviously, further analysis using more data is still required (as well as a comparison to some sort of ground truth for DLMS/COSEM). In the case of the M-Bus protocol traffic, the 28 SEVENTH FRAMEWORK PROGRAMME 3.3 Discussion and Conclusions heuristics-based approach performed generally well when compared to the ground truth of the protocol specification, though there were some issues identified. In the case of the DLMS/COSEM protocol traffic, it was found that the TCP sessions were generally very short and included one or two commands at most. While this was not true of the ICS/SCADA protocol sessions, the protocol learning algorithm could potentially take advantage of this. Also, the position of the message within the TCP session is not reliable for all protocols, and the heuristics-based learning algorithm does not rely upon this anymore. Further investigation of a protocol learning approach that combines the results of both the previous scriptgen approach and the new heuristics-based approach could be interesting when experimenting further with both ICS/SCADA and AMI network data. In our exploration of analysis parameters, the new algorithm is often capable of detecting what appear to be some of the “important” bytes from a relatively small number of packets, though more packet data continues to refine and expand on the initial analyses. Generally speaking, truncating the number of bytes used in the analysis makes the problem more tractable and also enables more readable visualization of the decision tree. It should be noted, however, that when analyzing more bytes from each packet, sometimes commonality is found further into the packet. In terms of readability, there are optimizations that be explored for collapsing the longer strings of random bytes in order to focus attention on the more important opcode bytes where the tree is split. Finally, as a next step, given a decision tree for these protocols, it should then be possible to generate an implementation that would check traffic against this structure. FP7-SEC-285477-CRISALIS 29 4 Conclusion This work extends the protocol learning research conducted previously and presented earlier in deliverable D4.3 (as applied to ICS/SCADA environments) with a heuristicsbased protocol learning approach applied to AMI traffic from two environments containing two AMI protocols, DLMS/COSEM and M-Bus. The challenges discovered in applying protocol-agnostic analysis of network traffic in critical infrastructure contexts are identified and utilized to formulate a different protocol learning approach for these environments. The heuristics-based protocol learning algorithm proceeds sequentially over the bytes of some number of messages in order to identify bytes of relevance to the procotol, such as OPcodes, length fields, and fixed bytes. We explore the approach by analyzing captured network traffic from two AMI environments and present the results of the analyses on different traffic types, including the AMI protocols DLMS/COSEM and M-Bus. When analyzing specifically AMI traffic of the DLMS/COSEM protocol, the heuristics-based analysis produces generally consistent decision trees that help to understand the protocol and associated interactions. When analyzing the AMI traffic of the M-Bus protocol, the heuristics-based analysis identifies a great deal of commonality correctly, when compared to the M-Bus protocol specification, though some issues were uncovered as well. Further validation would be required as part of the incorporation of these analyses for the purposes of network-driven attack detection in work package WP6. 30 Bibliography [1] M. A. Beddoe. Network protocol analysis using bioinformatics algorithms, 2005. [2] G. Bossert, F. Guihéry, G. Hiet, et al. Netzob: un outil pour la rétro-conception de protocoles de communication. In Actes du Symposium sur la sécurité des technologies de l’information et des communications, 2012. [3] W. Cui, V. Paxson, N. Weaver, and R. H. Katz. Protocol-independent adaptive replay of application dialog. In NDSS, 2006. [4] C. Leita. SGNET: automated protocol learning for the observation of malicious threats. PhD thesis, University of Nice-Sophia Antipolis, December 2008. [5] C. Leita and M. Dacier. SGNET: a worldwide deployable framework to support the analysis of malware threat models. In 7th European Dependable Computing Conference (EDCC 2008), May 2008. [6] C. Leita, M. Dacier, and F. Massicotte. Automatic handling of protocol dependencies and reaction to 0-day attacks with scriptgen based honeypots. In Recent Advances in Intrusion Detection, pages 185–205. Springer, 2006. [7] C. Leita, K. Mermoud, and M. Dacier. Scriptgen: an automated script generation tool for honeyd. In Computer Security Applications Conference, 21st Annual, pages 12–pp. IEEE, 2005. [8] S. B. Needleman and C. D. Wunsch. A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of molecular biology, 48(3):443–453, 1970. 31