LEVERAGING METADATA AND MACHINE LEARNING TO ENHANCE INTRUSION ANALYSIS About Reservoir Labs
Transcription
LEVERAGING METADATA AND MACHINE LEARNING TO ENHANCE INTRUSION ANALYSIS About Reservoir Labs
LEVERAGING METADATA AND MACHINE LEARNING TO ENHANCE INTRUSION ANALYSIS Robert Rotsted Senior Engineer Reservoir Labs About Reservoir Labs − Founded in 1990 − Private and profitable business based in New York City − Specialize in Cyber Security R&D, commercialization of R&D − R-Scope® Network Security Monitoring Bro-powered appliance 100 Gigabit / second in 1RU 2 Agenda − Survey of the use of Machine Learning in NIDS − Overview of Machine Learning concepts − Recommendations for use of Machine Learning in NSM 3 Goals Emphasize value of NSM Address the problem of context and diminishing value of network payload data Propose ML as one method of combatting these issues 4 NIDS / NSM Today Extract network metadata, deliver to search engine, hunt threats Match IOC, alert Model attack pattern, alert when state of network matches modeled attack pattern Analyze metadata, draw conclusions 5 NIDS / NSM Today Extract network metadata, deliver to search engine, hunt threats − Pro − Agile Discover previously unknown threats Con Costs a lot of analyst cycles 6 NIDS / NSM Today Match IOC, alert − Pro − Discover previously known threats / vectors Turn-key Con False positives Unknown threats / vectors 7 NIDS / NSM Today Model attack pattern, alert when state of network matches modeled attack pattern − Pro Discover previously known threats / vectors Turn-key Low FP − Con Not agile Unknown threats / vectors Long development cycle 8 NIDS / NSM Today Analyze metadata, draw conclusions − Pro − Agile Discover previously unknown threats Con False positives False negatives 9 Why: Machine Learning in NSM Transition from reactive alert driven work flow to data driven workflow Data driven workflow requires data driven tools 10 What: Machine Learning Machine learning is a subfield of computer science and statistics that deals with the construction and study of systems that can learn from data, rather than follow only explicitly programmed instructions. http://en.wikipedia.org/wiki/Machine_learning 11 What: Machine Learning Unsupervised − Bring structure to unlabeled data − Pro − Discover previously unknown data patterns Con Discover meaningless data patterns 12 What: Machine Learning Supervised − Infer a function from labeled data. − Pros Easy to interpret Highly configurable − Cons Subject to human error Time consuming to construct 13 SOME HISTORY… 14 History: Machine Learning, Anomaly Based IDS History of thought that NIDS would benefit from Machine Learning 1987 paper “An Intrusion-Detection Model”, Dorothy Denning described a NIDS “Profiles and checks for anomalous behavior whenever an audit record is generated or a period terminates. If abnormal behavior is detected, an anomaly record is generated” Cited over 3000 times according to Google Scholar! 15 Machine Learning, Anomaly Based NIDS Indispensable since 1987? No way. 16 Some wisdom from the Sages … 17 “ … On using Machine Learning for IDS” 18 “ … On using Machine Learning for IDS” Sommer and Paxson wrote a paper back in 2010 that said essentially: Detecting adversaries with anomaly detection, ML algorithms is very difficult Considerable research on the topic, not a lot of anomaly detection and ML deployed in production NIDS 19 Paxson and Sommer’s Advice Improve: Methodologies, inputs and outputs − − − Don’t apply machine learning, statistical analysis, or anomaly detection as a general solution Define and extract data relevant features from network traffic to feed to ML algorithms Deliver the data to your user that enhances their contextual understanding of their environment, enhances automatability 20 CURATING DATA FOR MACHINE LEARNING 21 Extracting Data Features Feature extraction is a general term for methods of constructing combinations of the variables in order to simplify the task of accurately describing a large set of data. 22 Quantitative Features Quantitative features are numerical − Examples: Sent byte count Connection duration Received packet count 23 Producer Consumer Ratio − Introduced at FloCon in January by Carter Bullard and John Gerth − All network nodes are producers and consumers of data − Exfiltration is a modification of the highly granular and aggregated consumer / producer relationships of an organization of systems. 24 Producer Consumer Ratio − A normalized value indicating directionality of application information transfer, independent of data load or rate. 𝑃𝐶𝑅 = 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 − 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 + 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 -1 1 25 Producer Consumer Ratio − A normalized value indicating directionality of application information transfer, independent of data load or rate. 𝑃𝐶𝑅 = 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 − 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 + 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 -1 1 1.0 1.0 = Pure Push: FTP upload, multicast, beaconing 26 Producer Consumer Ratio − A normalized value indicating directionality of application information transfer, independent of data load or rate. 𝑃𝐶𝑅 = 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 − 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 + 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 -1 1 0.4 0.4 = 70:30 export: Sending Email 27 Producer Consumer Ratio − A normalized value indicating directionality of application information transfer, independent of data load or rate. 𝑃𝐶𝑅 = 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 − 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 + 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 -1 1 0.0 0.0 = Balanced Exchange: NTP, ARP probe 28 Producer Consumer Ratio − A normalized value indicating directionality of application information transfer, independent of data load or rate. 𝑃𝐶𝑅 = 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 − 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 + 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 -1 1 -0.5 -0.5 = 3:1 import - HTTP Browsing 29 Producer Consumer Ratio − A normalized value indicating directionality of application information transfer, independent of data load or rate. 𝑃𝐶𝑅 = 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 − 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 𝑆𝑟𝑐𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 + 𝐷𝑠𝑡𝐴𝑝𝑝𝑙𝑖𝑐𝑎𝑡𝑖𝑜𝑛𝐵𝑦𝑡𝑒𝑠 -1 1 -1.0 -1.0 = Pure Pull: HTTP Download 30 Baselining DNS PCR Why baselining DNS? − − DNS is less adulterated than HTTP DNS tunneling is a commonly cited Exfil channel DNS -1 1 -0.72 31 Baselining DNS PCR Why baselining DNS? − − DNS is less adulterated than HTTP DNS tunneling is a commonly cited Exfil channel DNS DNS Tunneling -1 1 -0.72 0.23 A change in PCR can indicate DNS Tunneling! 32 Categorical Features Categorical features represent values that are labels − Examples: Season (Spring, Winter, Summer, Fall) Operating System (Windows, Linux, Mac OS) 33 Representing Categorical Variables A “dummy variable” is an artificial variable created to represent an attribute with two or more distinct categories/levels. https://www.moresteam.com/whitepapers/download/dummy-variables.pdf 34 Dummy Variable Trap Beware! Do not include a ‘dummy variable’ for every category of your variable Leads to the problem of multicollinearity – in effect, the variables cancel each other out Solution: When representing categorical data, represent all but one of the categories in your data with a dummy variable − Ex. Spring, Winter, Fall (omitting Summer) 35 APPLYING MACHINE LEARNING 36 Analysts are important, perhaps too important. 37 Problems with ‘meatspace’ Its not scriptable … Memory corruption Sleep, meals … 38 Context should also live in ‘cyberspace’ 39 Create your own features, derive context. Network data is fuzzy You need to make assumptions about hosts and sessions all the time based on imperfect information One thing that you may want to do with data coming off the network is use it to classify behavior or nodes, BUT HOW? 40 Context Aware NSM / NIDS − Apply appropriate policies to hosts on your network − Increase − Write workflow automation policy in more expressive terms 41 Classification with Logistic Regression Logistic Regression models the probability of a sample being member of a class given its feature values Logistic Regression is a supervised classification method http://www.datarobot.com/blog/classification-with-scikit-learn/ 42 Example training data 43 Logistic Regression Probability 44 Prototyping Supervised ML 1. 2. 3. 4. 5. 6. Derive a theory Explore the data Curate the data Test your theory Evaluate Results Repeat 45 Derive a Theory The count of unique HTTP referrers / minute is a good indicator for classifying a host as a human web browser 46 Explore the Data index=rscope sourcetype=http http_referrer !=‘-’ dst_ip !=10.* | timechart distinct_count(http_referrer) by src_ip 47 Curate the Data ( Bro ) Reservoir Labs has created a Bro framework for curating this type of data − − − Aggregate statistics about connection in one line Easily add remove new data features Add labelling criteria 48 Curate the Data ( Bro ) Unique HTTP Referrers / Minute 49 Curate the Data Add labels 50 Logistic Regression POC Leverages sklearn Python library First learns from training data, then applies algorithm to data to test its predictions Inputs data from Bro ml.log Outputs the coefficient and score of the regression 51 Perform the Logistic Regression Execute: ./logistic-regression − Score The mean accuracy on the given test data and labels Higher value = more accurate 52 Evaluate the Results Prediction – Algorithm’s prediction for ‘label’ Prob(0) – Probability that ‘label’ is 0 Prob(1) – Probability that ‘label’ is 1 Confidence – Signed distance from the Logistic Curve 53 Conclusions on Logistic Regression… Be wise! Supervised learning algorithms are only as good as you teach them to be Choose your features wisely, could be very misleading Just because it works for me, does not mean it will work for you! 54 Future Work Implement Logistic Regression in Bro − In progress Experiment with other ML algorithms and network metadata 55 Code available soon! https://github.com/reservoirlabs/ 56 Wrapping up… Statistics and Machine Learning can and will be really useful in NSM There are some common pitfalls in using these tools for attack detection, see “Outside the Closed World” for more details Today, statistical analysis and ML tools are relegated primarily to data scientists and commercial tools For open source NSM to yield more than forensic data and IOC matching, these tools need to be applied by hands of analysts! 57 Thanks! Follow Reservoir Labs on Twitter! − @Reservoir_Labs Follow me on Twitter! − @rrotsted 58