Proofpoint MLX: Machine Learning to Fight Spam
Transcription
Proofpoint MLX: Machine Learning to Fight Spam
Proofpoint MLX Whitepaper Machine learning to beat spam today... and tomorrow Proofpoint, Inc. 892 Ross Drive Sunnyvale, CA 94089 P 408 517 4710 F 408 517 4711 info@proofpoint.com www.proofpoint.com Mounting an effective defense against spam requires detection techniques that evolve as quickly as the attacks themselves. Without the ability to adapt automatically to new types of threats, anti-spam defenses will always remain a step behind the spammers. Proofpoint MLX™ technology uses advanced machine learning techniques to provide comprehensive spam detection that guards against the spam threats of today, as well as tomorrow. Proofpoint MLX continuously analyzes millions of messages and automatically adjusts its detection algorithms to identify even the newest, most cunning types of attacks. Proofpoint MLX provides accurate, adaptive, and continuous protection against spam without requiring manual tuning or administrator intervention. Contents Executive Summary 1 Why Does MLX Matter? 1 The Need for Machine Learning 2 Using Machine Learning to Beat Spam 3 Machine Learning in Action: Proofpoint MLX 5 Proofpoint MLX Spam Detection Process 8 Recent Spam Trends and Emerging Threats 10 Understanding the “Perception Crisis” in Spam Effectiveness 13 Winning the Battle Against Image- and Attachment-based Spam 16 Proofpoint Attack Response Center 23 Conclusion 24 Additional Resources 25 About Proofpoint, Inc. 25 Contents Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Executive Summary As spammers employ increasingly sophisticated techniques to avoid detection, simplistic anti-spam solutions are leaving enterprises vulnerable to lost productivity, lost communications, malware attacks, data theft, and financial loss. Clearly, a new approach is needed to defend corporate messaging infrastructures and to reclaim email’s value as a business communications medium. Mounting an effective defense against spam requires detection techniques that can evolve as quickly as the attacks themselves. Without the ability to automatically adapt to detect new types of threats, an antispam solution will always be a step behind the spammers. Proofpoint MLX™ technology leverages patent-pending machine learning techniques to provide a revolutionary spam detection system. The Proofpoint solution employs a full range of classification methods, from legacy approaches such as heuristics and Bayesian analysis to state-of-the-art machine learning algorithms (such as those used in genomic sequence analysis) and proprietary image analysis methods. Analyzing millions of messages each day, Proofpoint MLX automatically adjust its detection algorithms to identify even the newest spam attacks without manual tuning or administrator intervention. As a result, Proofpoint MLX is able to provide continuous spam detection and content filtering with a very high degree of accuracy–typically on the order of 99.8% or higher. Unlike other anti-spam solutions, the Proofpoint platform delivers anti-spam defenses that don’t degrade over time. By adapting continuously to the changing nature of spam attacks, Proofpoint MLX ensures that enterprises always benefit the latest anti-spam defenses–even if those defenses are only a few hours old. Proofpoint MLX technology protects corporate infrastructure against the spam threats of today, as well as tomorrow. Proofpoint MLX technology is available in Proofpoint’s SaaS email security solutions including Proofpoint ENTERPRISE™ and Proofpoint SHIELD™ as well as in Proofpoint’s on-premises solutions including the Proofpoint Messaging Security Gateway™ email security appliance. Why Does MLX Matter? Proofpoint’s MLX-based solutions provide the most effective spam detection available today: o Accurate: Proofpoint’s machine learning technology, based on techniques such as logistic regression, provides the foundation for a powerful, adaptive anti-spam solution capable of analyzing all types of message features, examining more than one million different attributes accurately differentiate between spam and valid messages. o Decisive: Traditional anti-spam solutions evaluate a limited number of attributes and are unable to decisively classify spam, which leads to a low rate of effectiveness and a high rate of false positives. MLX ensures that Proofpoint’s solutions will remain effective against the tactics spammers try to employ tomorrow: o Predictive: Continuously-evolving spamming techniques can only be countered by a predictive solution capable of learning and self-adjusting. Traditional reactive approaches just can’t keep pace. New in This Version: Updated Content This updated version of Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow, published March 2010, describes how MLX technology is being used to fight the latest forms of spam, including blended threats, phishing attacks, and spam related to social media sites. o Adaptive: Proofpoint’s MLX-based solutions automatically adapt to counter new threats. As more data from both valid email and spam is added to the machine learning model, the system identifies and weights relevant attributes to automatically tune the classification process. The result is a system that is just as effective at identifying tomorrow’s spam as it is at identifying spam today. Proofpoint is the only vendor that has successfully combined machine learning techniques with traditional approaches to achieve near-perfect spam detection. Ongoing efforts by Proofpoint’s Attack Response Center scientists secure Proofpoint’s position as a technology pioneer and industry leader in the fight against spam. This whitepaper explains the key concepts, technologies and benefits associated with Proofpoint MLX technology. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 1 Evolution of Techniques for Fighting Spam Summary Spamming techniques and antispam techniques have evolved on parallel paths. First generation solutions were effective against “static” spam attacks, but spammers quickly developed randomization, obfuscation and new delivery strategies to bypass basic spam filters. Second generation solutions incorporated simple heuristics in rsponse, but these systems typically require a large amount of administration to stay effective. Proofpoint’s third-generation solution applies sophisticated machine learning techniques to deliver high accuracy without the administrative overhead and technology weaknesses of older techniques. Traditional anti-spam solutions are reactive—they compare new messages to known spam, simply looking for words, phrases and other attributes previously encountered in spam, and flag messages from “known” spammers. These technologies cannot adapt quickly enough to detect new threats and are losing ground against increasingly sophisticated spam attacks. Proofpoint MLX machine learning technology provides the next generation in spam detection today—a highly effective, intelligent solution that can adapt to detect new types of spam with minimal intervention. Increasing Sophistication 3rd Generation MLX • Logistic regression • Support vector machines • Integrated reputation 2nd Generation 1st Generation • Signature-based • Challenge/response • Text pattern matching • RBLs, Collaborative Basic Filtering Time Results • Low false-positives • Low effectiveness • Easily fooled by evolving techniques • Linear models • Simple word match • Heuristic rules Heuristics/Bayesian Results Machine Learning Results • Immune to evolving attacks • High effectiveness without decay • Low false-positives • Low administration • High false-positives • High administration • Effectiveness decays over time Figure 1: Evolution of spam detection. The Need for Machine Learning Defending messaging systems against spammers requires an intelligent system that can adapt automatically as the attackers’ techniques evolve. Unlike yesterday’s anti-spam technologies, Proofpoint’s MLX technology counters new spam techniques as they emerge, defending messaging systems against the threats of today as well as tomorrow. A Brief History of Anti-spam Technologies: First Generation Solutions In the early days of the spam epidemic—before the introduction of enterprise anti-spam solutions—spammers used simple, straightforward techniques to deliver spam. Spam messages were typically simple text or HTML messages that were mass mailed over sustained periods of time. Given the “static” nature of this spam, first-generation technologies such as signatures and RBLs (Real-time Block Lists) were able to detect and stop attacks on a reactive basis. Companies like Symantec and others originally used signaturebased techniques, very similar to the way anti-virus products work. But spammers quickly developed techniques for randomizing multiple parts of their messages—maintaining the core message, while changing its signature—to thwart detection by signature-based systems. Similarly, RBL techniques rely on understanding the quality and volume of messages associated with a given sender’s IP address by gathering information over substantial periods of time. Again, in the early days of sustained spam campaigns, RBLs were reasonably effective after the initial attack was recognized. The problem today is that spammers rotate IP addresses frequently and often use hijacked machines (so-called “zombie” or “botnet” machines) to send small bursts of spam from an ever-changing array of locations. Overall, first-generation approaches have a low rate of effectiveness against spam because they are easily defeated by randomization and obfuscation strategies. On the positive side, first-generation solutions did not introduce very many false positives (valid messages incorrectly marked as spam). Page 2 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Second Generation Anti-spam Solutions: Heuristic and Bayesian Approaches To address the increasing frequency and sophistication of spam attacks, second-generation anti-spam vendors (such as CipherTrust, Sophos and Postini) employed heuristics and Bayesian techniques—often in combination with certain first generation technologies—in an attempt to create systems that deliver more proactive, resilient defenses against spam. Heuristics are “rules of thumb” that attempt to make a judgment on whether an email is spam or not, based on a small number of “spammy” attributes. The problem is that rules of thumb are not always accurate and can be easily fooled by spammers—especially since most products taking this approach were based on open-source technologies that are readily available to spammers. The introduction of Bayesian techniques began the trend toward using more sophisticated analytic techniques rather than general rules of thumb to identify spam. Bayesian solutions use statistical analysis to look for individual attributes that might indicate whether an email is valid or spam. But this relatively basic statistical approach falls short due to its inability to understand the relationship between attributes. Bayesian systems can often be fooled simply by adding unrelated valid-looking text to a spam message. Overall, second-generation systems based on heuristics or Bayesian weighting were more effective at catching spam than earlier solutions, but this higher effectiveness came at a cost. Second generation systems suffer from a high rate of false positives and require a substantial amount of ongoing administration to stay effective. Third Generation Anti-spam Solutions: Proofpoint MLX Machine Learning Cognizant of the evolution of both spam and anti-spam techniques, Proofpoint developed a solution that uses exponentially more advanced machine learning techniques. This advanced, statistical approach is more predictive and resilient than previous generation solutions. Proofpoint MLX offers both high effectiveness and low occurrence of false positives while requiring very little ongoing administration to stay effective. Yesterday’s Anti-spam Technologies Technique Description Limitations Spam Signature Detection Compare messages to known spam o Minor modifications thwart detection–and spammers know this o Cannot detect new threats, always a step behind spammers o Poor effectiveness against image-based spam Challenge-Response Require sender to respond o Challenge is offensive in business context o Misclassifies valid, automatically-generated email Text-pattern Matching Search for spam keywords such as Viagra or enlargement o Difficult to manage large keyword lists o Simplicity leads to high false positive rates o Effectiveness plummets as new types of spam emerge Heuristics or Naive Bayesian Apply rules of thumb to assign a spam score o Based on a small number of independentlyassessed attributes o High administration overhead for manually-tuned systems o Naive Bayes models ignore attribute dependencies and systematically under- or over-estimate spam probabilities o Easily defeated by today’s text- and image-based spam obfuscation techniques Community Resources Check messages against RBLs and other public anti-spam resources o Spammers use this information to thwart detection o Network queries are time intensive, reducing performance in enterprise-wide usage Using Machine Learning to Beat Spam Proofpoint MLX technology leverages advanced machine learning techniques to automate the generation of large-scale statistical models for spam and content filtering. Employing a full range of machine learnProofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 3 About Logistic Regression Proofpoint’s statistical models combine attributes and weights to generate an estimate of the probability that a particular message is spam. Logistic regression is one technique used to build these classification models. Logistic regression provides a way to predict a discrete outcome such as group membership from a set of variables that can be continuous, discrete, dichotomous or a mix. Logistic regression is a Bayesian technique—the most likely model is inferred from a combination of observed attributes and previous data. Instead of making the Naive-Bayes assumption that each attribute is conditionally independent, logistic regression provides a mechanism for taking interdependencies into account. ing techniques enables Proofpoint to analyze many millions of messages per day and distill them down to more than 1 million different attributes that reflect the underlying characteristics of spam. The resulting statistical model provides formulas for combining message attributes and weights to estimate the probability that a particular message is spam. With this model in place, the Proofpoint platform can classify messages with a high degree of confidence to maintain high effectiveness rates and a very low occurrence of false positives. The probability that a message is spam is estimated by applying statistical techniques such as logistic regression. To minimize classification errors, Proofpoint employs proprietary techniques to train the system to accurately determine whether or not new cases should be classified as spam. To avoid “overfitting” the model to the training data and maximize its accuracy for new data, Proofpoint performs large-scale, crossvalidation testing using data from a wide variety of sources. The Importance of Attribute Dependencies – What Naive-Bayes Models Ignore Because machine learning models take into account the incremental impacts of different spam attributes and the dependencies between attributes, the system can very accurately classify messages. For example, if the phrases “Want to stop Snoring?” and “Get a good night’s sleep!!!” appear in an email, the marginal spam effect of the second phrase is lessened so that the likelihood that a message is spam is not overestimated. Proofpoint’s proprietary spam analysis techniques use “supervised learning techniques” to understand the subtle differences between valid messages and spam. As a result, Proofpoint MLX technology can accurately classify valid messages that might otherwise be confused with spam. Suppose a user receives an email from a colleague that says, “Bob, did you see the spam message about getting a good night’s sleep?”, the system will recognize that it’s valid because other attributes are more important than the fact that the message contains the common spam phrase “good night’s sleep”. Standard anti-spam systems are not able to detect these subtleties. For example, suppose a user receives the following email from his doctor: Dear Bob, Hope you did get a good night’s sleep after your treatment. Did you sleep well and did you stop snoring? It may take a few days for the medicine to kick in. Let me know if you have any questions. – Dr. Smith A Naive-Bayes classifier trained on the following spam will incorrectly classify the doctor’s message as spam: Did you sleep well last night?? Get a good night’s sleep! Stop Snoring Today! Click here to learn more! Because the phrases “Did you sleep well” and “Get a good night’s sleep” have appeared in spam before, and the Naive-Bayes classifier scores all attributes independently, each attribute gets weighted twice. As a result, it overestimates the probability that the message is spam and mistakenly classifies the doctor’s legitimate email as spam. In contrast, Proofpoint MLX classifiers recognize that attributes sometimes appear together and the system takes these dependent attributes into account, resulting in a more accurate assessment of the “spam content” of each message. As a result, Proofpoint MLX is able to correctly conclude that the doctor’s message is valid. Attribute Dependencies Because attributes associated with spam often have complex relationships and dependencies, taking those dependencies into account is critical for accurate spam detection. Heuristics and Naive-Bayes classifiers evaluate each spam attribute independently—they cannot take into account dependencies between attributes. Because these systems assume that all attributes are conditionally independent, the benefits of considering a larger number of attributes are overwhelmed by the proportional increases in the missing dependencies. This severely limits the number of attributes that these systems can evaluate. They reach a point where adding attributes actually degrades their ability to make accurate classifications. In contrast, Proofpoint’s MLX classifiers accurately model attribute dependencies, enabling the system to analyze more than one million high-quality attributes selected from a pool of many millions. As new attributes are identified, Proofpoint scientists use the latest machine learning techniques, such as information Page 4 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow gain analysis, to ensure that only the most useful attributes are processed by the MLX engine, ensuring the highest levels of performance and accuracy at all times. The ability of Proofpoint MLX to analyze many times the number of attributes considered by traditional systems results in a highly-effective solution that accurately detects spam while maintaining a low incidence of false positives. Estimating the Probability that a Message is Spam To estimate the probability that a message is spam, Proofpoint uses logistic regression to define a statistical model that represents the complicated dependencies observed among spam attributes. Unlike NaiveBayes classifiers, which evaluate each attribute independently, logistic regression enables smart scoring that leverages the knowledge that certain spam attributes commonly appear together. This not only increases the classifier’s effectiveness at identifying spam, it enables it to differentiate between spam and valid messages much more accurately. Logistic regression calculates the incremental impact each attribute has on a message’s spam score. A weight is assigned to each attribute to represent its net effect after the effects of other attributes are taken into account. Sets of attributes that are known to be dependent on one another are weighted accordingly and redundant attributes receive less weight, ensuring that the probability that a message is spam is not over- or underestimated. Because each attribute’s effect is modeled in relation to other attributes, gaps in the model can be filled by intersecting existing attributes to create new ones. In systems that evaluate attributes independently, continuing to add attributes can actually cause a degradation in the classifier’s accuracy and effectiveness. However, adding helper attributes to a logistic regression classifier produces a better model with more predictive power. Minimizing Classification Errors An anti-spam solution must be able to accurately classify messages—it must effectively block spam while avoiding false positives. To achieve this, Proofpoint employs statistical techniques and a set of training examples to determine whether or not new cases should be classified as spam. Machine Learning in Action: Proofpoint MLX Through its pioneering research into the application of tried-and-true statistical techniques to the problem of spam, and its continued focus on the security and messaging needs of large enterprises, Proofpoint has developed a highly configurable message-processing platform that provides a comprehensive defense against spam, viruses, and other messaging threats. Proofpoint’s advanced machine learning classifiers and enterprise-strength platform enable Proofpoint solutions such as Proofpoint ENTERPRISE Protection to synthesize large amounts of data, to analyze millions of message characteristics, and to classify messages with a very high degree of confidence, resulting in a high rate of effectiveness and a very low rate of false positives. Powered by Proofpoint MLX machine learning technology, the Proofpoint platform provides the most effective anti-spam solution available. By leveraging the best of first and second generation spam-detection techniques, applying state-of-the-art MLX classifiers and adapting to enterprise-specific message characteristics and policies, Proofpoint solutions keep pace with emerging message threats and changing corporate needs. The Proofpoint system is an enterprise-grade platform designed from the ground up to ensure high availability and performance, minimize management overhead and integrate seamlessly with existing enterprise management tools. Maximum Protection Today: High Confidence The large number of attributes that the Proofpoint platform is able to analyze ensures that messages can be classified with a high degree of confidence. Proofpoint’s advanced classifiers enable the system to classify messages decisively—most messages score very high or very low, with only 1.5% falling between 20 and 80 on a scale of 0-100. Competitors’ products often unsure how to classify messages—upwards of 40% of messages typically receive scores between 20 and 80. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 5 The image in Figure 2, below, is an actual screenshot from a Proofpoint customer deployment. This report shows how Proofpoint MLX scored inbound messages over a 24-hour period. Notice that 99% of the emails are confidently scored either very high (indicating that they are spam) or very low (indicating that they are valid email). As a result, this customer has found that they can discard messages with a score of 80 or greater (the red region in Figure 2), automatically eliminating 98% of their spam, with zero false positives. These messages, which represent an amazing 80% of the company’s total inbound email volume, are rejected and discarded right at the enterprise gateway, substantially reducing the burden on downstream mail servers, storage systems and network bandwidth. A very small number of messages (about 1% of total email traffic indicated by the yellow region in Figure 2) score between 45 and 80. These “probable spam” messages are held in a quarantine and added to email digests that are sent to end users on a periodic basis. This policy blocks the last 2% of spam without the risk of losing any legitimate email messages. Lastly, the remaining 19% of this company’s original email stream gets confidently delivered as valid messages (the green region in Figure 2). So in this case, Proofpoint correctly and confidently identifies more than 80% of the company’s inbound email volume as spam without the need for administrators to constantly manage the solution—Proofpoint MLX does the work. Confident Scoring Enables Decisive Action Against Spam Summary Actual 24-hour spam score distribution from one of Proofpoint’s customer sites. The confident scoring and high accuracy provided by Proofpoint MLX has allowed this customer to adopt aggressive policies against spam. The green, yellow and red highlights graphically illustrate this customer’s actual spam policies. Most notably, a full 80% of the company’s inbound email is discarded as spam—with zero false positives— right at the email gateway. Deliver Valid Email 19% of Total Mail Volume Quarantine Suspect Email 1% of Total Mail Volume 2% of Spam Volume Discard Spam 80% of Total Mail Volume 98% of Spam Volume Confident Scoring Enables decisive action against spam Figure 2: Was it really spam? Proofpoint MLX’s decisive classification eliminates the uncertainty. Page 6 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Typical Spam Score Distributions for Competing Products Is this Spam... or Valid Mail? 100 ? 90 % of Messages 80 70 60 50 40 30 % Messages with Spam Score Summary Less sophisticated systems score spam with far less certainty than Proofpoint MLX. In this case, a large percentage of messages have scores in the middle range. Such systems are faced with a quandary: Should messages that fall in this middle range be delivered to end users (reducing the system’s effectiveness)? Or should this large volume of messages be quarantined (generating a large number of false positives)? 20 10 0 0 10 20 30 40 50 60 70 80 90 100 Spam Scores Figure 3: Competing solutions are often unsure whether messages are spam or not. Systems that are unable to decisively classify spam (Figure 3, above) are left with a difficult dilemma— should the messages that fall in the in the middle of the scoring range be sent to the user as valid email, or blocked as spam? Sending the messages to the user will lower the overall spam detection rate, greatly reducing the solution’s effectiveness. On the other hand, blocking the messages will cause a spike in false positives, which can be very detrimental to users. Clearly, when messages cannot be classified decisively, there are no good options. The ability of Proofpoint MLX to classify messages with a high degree of confidence eliminates this dilemma and greatly improves the system’s overall effectiveness while maintaining a low rate of false positives. Maximum Protection Tomorrow: The Learning Cycle Unlike traditional anti-spam tools whose effectiveness quickly degrades as spammers change their tactics to thwart detection, Proofpoint MLX is capable of learning and automatically adjusting to detect new threats. As more data from both valid email and spam is added to the statistical model, the system identifies and weights relevant attributes to tune the classification process. The result is a system that is just as effective at identifying tomorrow’s spam as it is at identifying today’s. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 7 Proofpoint MLX Spam Detection at a Glance Customer Site Spam Detection Module Multilingual Commercial Spam Multilingual Pornographic Spam MLX Spam Engine Multilingual Valid Email Version Independent MLX Updates Automated Machine Learning 200,000+ Attribute Identification Model Creation Spam Scoring Feedback Valid Email Likely Spam Spam/Phish Adult Spam Apply Policies MLX Spam Engine Defeating Spammers’ Obfuscation and Randomization Tactics Obfuscation—such as when spammers use variant spellings of “Viagra” or camouflage HTML text—is a very popular strategy that spammers use to deceive spam filters. Proofpoint researchers have developed new machine learning techniques that allow Proofpoint MLX to rapidly and accurately identify obfuscated text and differentiate intentional obfuscations from legitimate spelling errors. Proofpoint’s backend message processing systems also use machine learning techniques to automatically detect and “learn” new obfuscations of “spammy” words that are observed. For example, if the system observes a large number of variant spellings within a short time, those obfuscations are automatically added to the next Proofpoint MLX engine update. These techniques have a significant positive effect on anti-spam effectiveness with zero impact on false positive performance. The predictive nature of Proofpoint MLX is also highly resistant to randomization and “hash busting” techniques commonly used by spammers to bypass signature-based spam filters. Page 8 Deliver Quarantine & Add to Digest Delete Figure 4: Proofpoint MLX spam engine creation and the spam detection process at the customer site. Proofpoint MLX Spam Detection Process The MLX detection process (Figure 4, above) begins at the Proofpoint Attack Response Center, where scientists and engineers build and refine mathematical models that represent Internet spam. These models are constantly updated and delivered to customers to ensure their messaging infrastructures stay ahead of the latest spam attacks. Proofpoint examines every aspect of incoming messages, from the sender’s IP address, to the message envelope, headers, and structure, and finally the content and formatting of the message’s attachments and the message itself. At any given time, more than one million possible attributes—representing both content and structural components—may be taken into consideration. A typical message may trigger more than 300 MLX attributes. Every email message can be broken down into three main components: o The Message Envelope: The envelope contains information used by Mail Transfer Agents to route the message. Spammers can change the message envelope by capturing open relays or by planting zombies at unsuspecting computers and using them to send spam with a “valid” email address. Proofpoint MLX catches envelope-based spammer tricks. o The Message Headers: Headers are key value pairs that provide source and routing information for the message, along with other meta information such as the message sender, subject, and recipients. Headers are often spoofed by spammers. Proofpoint MLX catches header-based spammer tricks. o The Message Body and Attachments: The actual content of the message. Spammers often obfuscate the text of the message body using HTML and other encoding tricks, in an attempt to exploit first and second generation spam filters. MLX catches message-body-based spammer tricks. It also examines the content of attachments looking for similar characteristics. The first level of screening examines the network stream to identify the source of each incoming email. The system then performs an in-depth contextual analysis of the email, from distilling the email’s linguistic structure to normalizing permutations of words (permutations are a common exploit whereby spammers may replace ‘Viagra’ with a term like ‘v1a<b>g</b>gra’). Once the contextual analysis is complete, the system evaluates the message according to the preferences set by both the end user and administrator. The results of this in-depth analysis are fed into Proofpoint MLX’s advanced classifiers to determine the appropriate disposition for the message. On its own, no single test classifies a message as spam. But by taking all attributes of a message into account, Proofpoint’s advanced classifiers categorize each message with a high-degree of certainty to accurately identify spam and minimize false positives. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Phishing Attacks and Pornographic Spam Because of the mathematical foundation of MLX, its models can be easily adapted to subcategories of spam. Two common email attacks are phishing or scam attacks and pornographic spam. A phishing attack is a type of fraud. Phishing email looks like a legitimate message from a business familiar to the recipient—typically a bank or a well-known online brand such as Amazon or PayPal—but is actually a fraudulent attempt to extract personal identity or financial information. Thinking the message is valid, the recipient posts personal account information. This information is then collected by the sender and used illegally. Proofpoint has applied its machine learning algorithms to detecting phishing attacks, thereby ensuring they are blocked from end users’ inboxes. To aid in combatting phishing, Proofpoint accurately identifies spam in email attachments including, but not limited to PDF, ZIP, XLS, DOC and RTF file types. (Phishing is discussed in more detail later in this paper.) In addition to phishing attacks, organizations are being bombarded with pornographic spam, exposing them to risk, liability and embarrassment. Administrators need tools to enforce a policy of zero-tolerance for pornographic spam. Proofpoint not only detects pornographic spam with a high degree of accuracy using MLX technology, but also allows administrators to define separate, more aggressive policies for pornographic spam. Each message analyzed by Proofpoint MLX is assigned a general spam score as well as an adult spam score, enabling each type of message to be handled differently. For example, an organization might configure its policies to delete all pornographic spam, while quarantining non-pornographic spam. Reputation Analysis for IP Addresses and URLs New types of attributes are continually being assessed and added to Proofpoint MLX. For example, Proofpoint MLX spam engine updates include a dynamic set of attributes that represent reputation scores associated with the IP addresses of various senders. The Proofpoint Attack Response Center continually examines large volumes of Internet mail, external spam block lists and data from Proofpoint partners and customers to identify IP addresses that are commonly used to send spam. This ever-changing list of spam servers, suspected spam domains, botnet and “zombie” machines is constantly updated and automatically incorporated into the MLX spam engine updates that are delivered to Proofpoint customers on a regular basis. MLX performs this same reputation analysis on the URLs included in messages, ensuring that dangerous addresses are caught, even if the message is sent from a trusted domain. Fighting Image-based Spam In yet another attempt to bypass less sophisticated spam filters, an increasing amount of spam is now being sent with the spam “payload” contained in an attached image, sometimes accompanied by randomized text. The Proofpoint MLX spam engine includes algorithms to detect and block image-based spam and image-based obfuscation techniques, which competing solutions cannot accurately catch. See “Recent Spam Trends and Emerging Threats,” below for more information on this problematic new form of spam. MLX Speaks Your Language As the volume of non-English language spam increases, Proofpoint’s machine learning engine is continually being trained to identify spam in a wide variety of European and Asian languages. Combined with the real-time, local reputation data generated by the MLX Dynamic Reputation™ features of each Proofpoint server and other message attributes, Proofpoint MLX can make intelligent decisions about which messages and connections to block or throttle without the negative performance impact of constant network blocklist (DNSBL) lookups. Proofpoint MLX also uses similar techniques to identify and block malicious URLs contained in spam and phishing messages. Bounce Management Proofpoint’s anti-spam solutions also support Bounce Address Tag Validation (BATV, a draft specification submitted to the IETF) in order to combat backscatter, which occurs when a spammer spoofs a legitimate address resulting in a barrage of non-delivery reports (NDRs) directed at the legitimate user. Customers who secure their outbound email through Proofpoint solutions can take advantage of BATV tagging of outbound messages to detect and block backscatter from forged addresses. In addition, all Proofpoint customers can take advantage of MLX to differentiate between valid and backscatter NDRs and to configure the agressiveness with which MLX rejects backscatter messages. Obfuscation Detection Proofpoint has developed a proprietary machine learning model to identify obfuscated words (which can be a strong indicator of spam) and to differentiate intentional obfuscations from unintentional obfuscations (such as spelling errors). Natural Language Tests MLX performs context-aware analysis to derive meaning from text and to identify signature-busting language often used in spam email. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 9 Using Machine Learning for Connection Management Proofpoint’s connection management component—Proofpoint Dynamic Reputation™—also uses machine learning techniques. Powered by a combination of local, dynamic reputation analysis and global/network reputation data, Proofpoint’s “netMLX” service uses machine learning technologies to assess the reputation of IP addresses, in order to block or throttle incoming email connections from malicious or spammy IP addresses. While the analysis and computation is done “in the cloud,” netMLX can be queried by any Proofpoint deployment (whether deployed as SaaS or onpremises). netMLX analyzes thousands of traffic properties, domain characteristics, and other features. To identify the most statistically relevant features to use in classification, Proofpoint uses a proprietary method known as discretization, which ensures that Proofpoint Dynamic Reputation operates at maximum efficiency from a computation and accuracy standpoint. Data is sourced from Proofpoint’s worldwide collection of honeypots and our customer base. The properties (or “features”) analyzed are fed into a multi-stage classifier, meaning that different properties are analyzed using different machine learning algorithms and then sent to a final classifier to compute a “reputation” score for each incoming IP address. Based on this score, customers can enforce various connection management policies. For example, connections from malicious IPs can be rejected or throttled. Note that these IP reputation scores are distinct from the spam score that customers also have control over at the policy level. The whole process occurs in realtime; the time it takes for a new IP address to be identified, analyzed and for data on that IP to be available to Proofpoint customers is on the order of one minute. Page 10 International Language Analysis MLX has the ability to parse both single- and double-byte text with appropriate machine learning models for different languages (e.g., English, Japanese, Chinese, etc.). Ongoing training in these languages ensures high-effectiveness against non-English language spam. Recent Spam Trends and Emerging Threats In spite of improved defenses and several high-profile criminal prosecutions, spam continues to plague organizations of all sizes. Closely monitoring spam attacks against its own customers, Proofpoint found that, for most enterprises, spam volumes rose between 150% and 400% last year. And enterprises are feeling the impact: in 2009, spam is estimated to have cost U.S. organizations $42 billion. Worldwide, the cost was $130 billion. Lost productivity accounts for roughly 85% of these costs, with the remainder primarily covering IT support costs, such as help desk salaries.1 But spam jeopardizes much more than employee productivity and IT budgets: it’s increasingly implicated in security attacks, which can lead to data breaches, regulatory fines, lost business, and other damages. Clearly, spam remains a serious threat to the enterprise. And it’s complicating IT’s adoption of other new technologies, such as social media platforms for cross-departmental collaboration. Currently, six trends are making spam an especially difficult and urgent problem for enterprises. These six trends are: o The use of increasingly sophisticated botnets as the primary means of delivering spam o The ongoing use of image-based and other forms of attachment-based spam to evade detection o The growing sophistication of phishing attacks o The increase in blended threats, which combine email with other technologies, such as Web sites or multimedia files o The use of social media sites and tools to deliver spam and to steal data o The rise of cybercriminal syndicates intent on stealing funds and data; hackers are no longer satisfied with causing network outages or data loss; now they’re after confidential data they can resell or exploit to steal funds or to blackmail enterprises To assess the technical approach taken by any anti-spam technology, it’s important to understand the nature and impact of these trends. The Rise of Botnets Robot networks, or botnets (also called zombie networks), consist of network-connected machines that have been compromised by spyware or malware. These compromised machines are used by malware writers to send spam (or viruses) on their behalf and to launch other types of network attacks. Rather than sending spam directly from a server to a set of organizations, spammers use botnets to send spam indirectly. Each node in the botnet is responsible for sending a fraction of the spam campaign. Proofpoint estimates that more than 75% of all spam attacks are now sent using botnets. The impact of using botnets as a spamming tool is two-fold: o Quick and intense attacks: The spammer has precise control over the launch and duration of the attack. By sending commands over an IRC (Internet Relay Chat or similar communication) channel, the spammer can turn an attack on or off. Attacks now last under an hour and deliver huge volumes of spam during that period. o Multiple IP addresses: Instead of having just one or two sending IP addresses for a spam campaign, a botnet allows a spammer to send spam from hundreds—or even thousands—of IP addresses. The use of these techniques have made centralized reputation services (which block spam based on the sender’s IP address) much less effective as a spam-fighting tool. The rapid proliferation of botnets has made it possible for spammers to send an ever-increasing volume of spam. Botnets make sending large quantities of spam “cheap,” because spammers are able to tap into large pools of computing and network resources virtually free (or for a very low price, as when botnet controllers rent out their botnets to other spammers). This same “economy of scale” has also made it possible for spammers to send more resource-intensive types of spam, such as image-based, highly personalized or highly randomized spam. For example, the spam “payload” may be delivered as an attached image (or other document type), sometimes accompanied Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow by large amounts of text. Both images and text are typically randomized or obfuscated in an attempt to defeat both signature- and heuristics-based spam filtering techniques, as described later in this paper. Image-based and Attachment-based Spam Image-based spam has “graduated” from an annoyance to being one of the core tricks used by spammers. In fact, image-based spam represents a whole collection of new techniques (some of which are described in more detail below) that leverage message attachments to deliver spam messages. From month-to-month the volume of image-based spam (or spam that uses other attachments such as Word or PDF files) varies widely, but spam campaigns that use these techniques continue to be a common problem. Spammers have always used images in their spam. Spammers used to insert images referenced on remote servers through URLs. But given the lower cost of bandwidth and availability of the massive processing power of botnets, spammers are now attaching images and other types of files directly to their messages. This has the advantage of tricking many spam filters. Proofpoint has invested heavily in developing and delivering technologies that block these difficult-to-detect messages. The rise of image-based spam was enabled primarily by the proliferation of botnets. Today, botnets send large amounts of image-based spam where each individual message is randomized and obfuscated using a variety of different techniques. Why Traditional Detection Technologies Fail to Detect Image-based Spam Most significantly, image-spam is very hard to detect using traditional first- and second-generation technologies such as signature techniques and Bayesian filtering. Furthermore, since much of the image-based spam comes from botnets, it is difficult to detect using traditional reputation services, as the source IP addresses continually shift. Spam technologies that worked well in the past saw their effectiveness decline when these spam techniques came into widespread use. The image-based spam techniques described later in this paper serve to reduce the effectiveness of filters, and thus increase the amount of spam that gets through to user’s inboxes. Image-based spam techniques wreak havok against many of the most common anti-spam technologies: o Signature-based detection: Because the images used in each individual spam message are obfuscated, each message is unique. That is, each spam email contains a “new” image that doesn’t match any known signature. Therefore, signature-based engines are fooled by the spam and it is delivered to end users. o Reputation-based detection: The use of botnets allows image-based spam to be sent from an ever-changing or “rotating” set of IP addresses. Most of the nodes in a botnet have no reputation rating at all (either positive or negative) and are used for sending messages in such a way that they avoid detection by reputation systems. Relying on the reputation of a sending IP address to catch image-based spam does not work effectively as an enterprise scale solution. o Bayesian-based detection: Many image-based spam messages are also accompanied by text (which may be visible or invisible) that looks “legitimate” to many types of spam filters. Simple Bayesian filters are unable to “see” and analyze the image. Instead, they rely on the text, which appears legitimate. Due to a limitation in the way Bayesian systems perform their computation, the final spam score for this analysis results in the email being misclassified as legitimate mail. These techniques are known as Bayesian-busting. The Growing Sophistication of Phishing Attacks Phishing attacks are email attacks that impersonate an email from a trusted site, such as a bank, brokerage, or social media site, in order to lure the recipient into clicking on a link or giving away confidential information. Phishing attacks can be used to deliver ads to users, to steal users’ login credentials and other information, or to infect the recipient’s computer system with malware. Many attacks lull users into clicking to learn more about topical events, such as elections, major sporting events, natural disasters, and news about celebrities. A typical phishing attack would consist of a spam message, purportedly from a major bank, telling the recipient that his or her account requires immediate action, and that he or she should click on the link and login, thereby allowing hackers to harvest the recipient’s login credentials. Using Machine Learning for Connection Management (continued) Proofpoint uses connection level sender reputation to significantly reduce CPU load by limiting the amount of email content that each system must analyzed. While Proofpoint Dynamic Reputation has a positive impact on antispam performance, enabling this feature is not required to achieve the 99.8% anti-spam effectiveness typically delivered by Proofpoint MLX. Rather, the goal of Proofpoint Dynamic Reputation is to mitigate the impact of sudden bursts of email traffic, denial-of-service attacks (both direct and indirect) and the vast numbers of email connections caused by spam campaigns. In typical production enterprise deployments, Proofpoint Dynamic Reputation identifies 70%-80% of inbound connections as malicious, based on global netMLX reputation scoring. Local reputation analysis will often identify an additional 10% of connections as malicious. Note that these statistics are for connections, not messages. One cannot know how many email messages were in a given connection if it is not accepted, but on average there are multiple messages per spammy or malicious connection. The net result is that, for every connection blocked, multiple messages and they payloads they carry are blocked. Since Proofpoint does not rely exclusively (or even heavily) on reputation for anti-spam effectiveness, Proofpoint Dynamic Reputation’s aggressiveness is tuned to ensure sufficiently high block rates while eliminating false positives. Reputation scores are maintained on individual IP addresses in order to avoid false positives caused by blocking overly broad IP ranges. Phishing attacks constitute a small fraction of spam attacks, but it’s an expensive fraction. Phishing attacks are estimated to have cost the U.S. economy more than $8.4 billion in 2009. Of that money, about $5.8 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 11 Using Machine Learning for Connection Management (continued) The automated machine learning processes at the heart of Proofpoint’s connection management system are designed to avoid false positives by including significant amounts of valid email connections in the training data to ensure that the engine is recognizes both the characteristics of valid mail as well as the characteristics of malicious connections. Because this training process is performed continually over constantlychanging global dataset, the system is “self-annealing.” A major differentiator of Proofpoint Dynamic Reputation is that IP scoring is highly dynamic and automated, whereas other reputation systems require senders to call or write to the vendor if they have been incorrectly added to their system’s “blocklist”. While Proofpoint also provides online feedback mechanisms in the event that senders need to request removal, IP addresses that are identified as malicious and then return to normal behavior quickly return to “good” reputation scores and are automatically “unblocked” by Proofpoint Dynamic Reputation. billion was in direct monetary losses–in other words, phishers stole nearly $6 billion from bank accounts and other reserves, using login credentials and other information obtained fraudulently through phishing. In addition, phishing attacks have tarnished more than 2,000 brands.2 Building on their successes, phishers are refining their attacks and focusing on high net-worth individuals and institutions. So called spear phishing is a highly targeted form of phishing in which the email message appears to be more targeted and personal. For example, instead of sending 10,000 users a spam message that pretends to be a security notification from a major bank, a spear-fishing attack might send only 50 messages to select individuals whose data, assets, or login credentials are especially valuable. Because the spam email contains personal information that a mass emailing would not, the recipient is more likely to trust it, click on a dangerous link, and fall prey to the attack. Phishing attacks are an especially pernicious and increasingly common form of spam. Instead of delivering bothersome ads or malware that may shut down a computer, they can lead to identify theft or even a transfer of funds with serious, long-term repercussions. Blended Threats A blended threat is a security attack that achieves its ends by combining two or more attack vectors, such as email and Web sites. Blended threats often use spam to initiate contact with a victim. For example, a blended threat attack might consist of an email message that includes a link to a tampered Web site; when recipients click on a link in the message, their browsers navigate to a site infected with malware, which the browsers then automatically loads, thereby infecting the recipients’ computers. Because the spam message itself doesn’t contain malware, it stands a good chance of getting past most types of email and virus filters. Comparing the link to a known list of bad sites might not work either. The infected Web site might be new and hence not yet included on any malware blacklists. Or it may be a legitimate site, such as a popular news site or shopping site, which hackers have managed to infect with malware using an attack technique such as SQL injection. Blended threats can be quite elaborate, involving a series of seemingly unrelated leaps between the initial contact with the victim (usually in the form of a spam message) and the ultimate theft of data or injection of malware. For example, an attack might begin with a spoofed email message that appears to be an automatically generated message from Facebook–perhaps from the security team at a Facebook or from a friend of the recipient. The recipient trusts Facebook and clicks on the link, which appears to lead to content posted on Google Reader, another Internet brand the user is likely to trust. The Google Reader content, in turn, might link to a video posted seemingly posted on YouTube, but in fact posted on a malware site cleverly designed to resemble YouTube. Playing the video unleashes a malware attack that takes advantage of security vulnerabilities in some older versions of Flash. Attacks such as these rely on the email recipient’s trust in personal friends and in the brands of the biggest Internet properties. Like the simpler, two-step attack described above, it begins with an email message that itself doesn’t include any malware. But within a matter of seconds,—the length of time that a trusting user might spend clicking through the various links supposedly hosted on these well-known sites—the recipient’s computer is infected with malware. In addition to infecting systems with malware, blended threats are being used by hackers to harvest login credentials and confidential financial data such as Social Security numbers and credit card information. Threats from Social Media The social-media spam floodgates have opened: in 2009, 50% of enterprises reported receiving spam from a social media networks. More than a third of enterprises say they have received malware attacks from social media networks.3 A recent survey by Proofpoint found that more than a quarter of US workers say that social-media network notifications account for about 5% (1 in 20) of the emails they receive at work. More than 10% of US workers say that such notifications account for about 10% (1 in 10) of the emails they receive at work. The growth of spam that appears to be from social media networks isn’t too surprising given the surging popularity of these networks and related Web 2.0 communications platforms, such as Twitter. The largest social media community, Facebook, has grown to 350 million users.4 Users are spending more time on social media–82% more time year over year by December 2009.5 Social media networks now account for 11% of all U.S. Internet traffic.6 In December 2009, Facebook became the top referrer site to news portals such as Yahoo and MSN.7 (Such referrals make social media sites ideal environments for phishing attacks.) Page 12 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow A couple of related trends are amplifying the vulnerability of social media users to spam and to phishing. First, social media marketing is on the upswing. Seventy percent of marketers plan to increase their socialmedia marketing budgets in 2010.8 This re-allocation of marketing funds will lead to a sizeable increase in email solicitations from well-known brands and social media sites. This increase will make it harder for users to distinguish legitimate social-media marketing campaigns from spam campaigns that link to malware or phishing sites. Second, enterprises are adopting social media platforms such as SharePoint and Jive to improve communication and collaboration internally. This adoption makes employees more comfortable posting social media content and following links, and more trusting of social media communications (including email notifications) in general. Spammers and cybercriminals are taking advantage of the popularity of social media networks to create new types of spam, phishing attacks, and malware traps. Many social networks such as LinkedIn send invitations by email. Other such as Facebook send email notifications when users post comments. Spammers and phishers are spoofing these messages to trick users into viewing ads, revealing login credentials, or downloading malware. Phishers have even begun harvesting users’ contact lists on networks such as Facebook to make spam messages appear more credible. Many users will automatically trust a message that includes names and photographs of friends. They assume that only the network administrators could compile this information. Social media spam is particularly well suited to blended threats. Because social media sites are filled with links to videos, photos, and audio clips, many users don’t think twice about clicking on a link that supposedly leads to trustworthy content. Users don’t realize that videos can contain malware, or that even trustworthy sites can be injected with malware.9 Social media networks like Facebook that support custom applications pose additional threats. Few networks screen these applications for security vulnerabilities or malware infections. Many users blindly trust these applications, naively assuming that they’re harmless fun. Unfortunately, some of these applications include Trojans or keyloggers. Some post bogus messages, soliciting the user’s friend to click on links and further propagate the attack. Social applications create dynamic opportunities for distributing malware or harvesting confidential data. Spam messages promoting these applications are an invitation to trouble. Spam and Crime Ten years ago, most malware authors had relatively simple goals: they wanted to demonstrate technical prowess, and they wanted to rebel against authority. To achieve these goals, hackers programmed increasingly sophisticated worms and viruses that slipped through the defenses of major corporations and caused millions of dollars of damage worldwide. But the malevolence stopped there. Early hackers didn’t try to extort money from their victims. They didn’t use Trojans or rootkits to steal login credentials for financial gain. Their motivation seemed to be egotism, not greed. Over the past decade, the nature of malware attacks and other spam-related crimes has changed dramatically. Cybercrime has become big money. Criminal syndicates around the globe are involved in creating and renting botnets, and breaking into financial institutions and siphoning funds. Hackers are no longer content simply congesting networks; now they want financial gain from their nefarious labors. An example of this change is the ongoing development of ransomware, a type of malware that first appeared around 2006. When a ransomware Trojan springs into action, it encrypts files on the user’s local hard drive and demands that the user wire funds a foreign account in order to have the files restored.10 A recent variant shuts down the infected computer’s lnternet access, then orders the victim to send a text message, which turns out to be exorbitantly expensive, to a special number to receive the decryption key.11 Other new spam payloads may modify the DNS settings on a victim’s computer and direct users to fraudulent dating or gambling sites. Outbound Spam Detection Increasingly, organizations of all types are concerned about preventing their networks from contributing to the global spam problem and want to ensure that no machines on their network are sending spam or other forms of malicious email. Even a single botnet-infected machine on an organization’s internal network can generate massive amounts of spam, quickly causing their pool of IP addresses to be blacklisted. But most spam filtering solutions rely on a combination of reputation scores and content filtering to identify and stop spam and rely heavily on the reputation scoring component to ensure accuracy. While this approach may work satisfactorily for the inbound email stream, where the reputation of sending IPs can be easily monitored and tracked, this approach is ineffective in addressing outbound spam, where one must examine the content and be able to make an accurate determination of “spammyness” based solely on non-reputational factors. In addition, some anti-spam solutions do not even support the ability to scan the outbound email stream for spam content. In contrast, Proofpoint’s anti-spam and anti-virus technology can be applied to both inbound and outbound email streams. And because Proofpoint MLX technology detects spam with extremely high accuracy without relying on IP reputation information, it is also highly effective at outbound spam detection, in contrast to solutions that depend heavily on reputation scoring. The growth of cybercrime and increasingly malicious payloads makes spam a security issue too urgent to ignore. Spam jeopardizes not only employee productivity, IT asset availability, and business continuity; it also poses a direct threat to the financial wealth of enterprises and their employees. Understanding the “Perception Crisis” in Spam Effectiveness New spamming techniques and ongoing increases in the sheer volume of spam being sent, combined with the overall increase in legitimate email messages, have had a predictable but unwelcome result. Many email Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 13 users who had become accustomed to spam-free inboxes find themselves now receiving a noticeable and often annoying amount of spam. End users perceive that effectiveness has declined and this often leads to an increase in end-user complaints and helpdesk calls related to spam. In addition, email administrators may find themselves spending more and more time trying to stem the flow of spam. In some cases, anti-spam solutions are unable to keep up with spammer sophistication, but in others, the spam filter effectiveness has remained constant—or even improved—but the higher overall volume of email results in an increase in the absolute number of spam messages making it through the filter. Let’s take a closer look at the relationship between rising inbound email volumes, spam filter effectiveness and end-user perceptions about the quality of your existing spam filtering solution. The following scenarios will help explain why spam has once again become a critical enterprise IT issue and why organizations now require anti-spam solutions with extremely high accuracy. Scenario 1: The “Good Old Days” Think “way back” to the time before botnets. You’ve recently replaced your first-generation anti-spam solution with a new solution that provides what seems like an incredible effectiveness of 94%, with a negligible number of false positives (i.e., legitimate messages inadvertently marked as spam). Your spam volume is 500,000 messages per day. Your 5,000 end users aren’t complaining, as they very rarely receive any spam at all (i.e., false negatives are also negligible). This is a huge improvement over your previous antispam solution, which had a 90% effectiveness rate. You are feeling good about the vendor you chose, and you focus your attention on other projects. A quick calculation of your new spam solution looks like this: Variable Metric Value Formula A Daily spam volume 500,000 B New vendor’s effectiveness 94% C Number of end users 5,000 D Spam being blocked at gateway 470,000 =A*B E Spam not blocked, that gets through the gateway and hits your mail servers 30,000 =A-D F Average number of spam messages/day that get to an end user’s inbox 6 =E/C (with negligible FPs) A quick calculation of the situation before you purchased the new solution looks like this. We assume that the volume and number of employees stayed the same: Variable Metric Value Formula G Old vendor’s effectiveness 90% (with some FPs) H Spam being blocked at gateway 450,000 =A*G I Spam not blocked, that gets through the gateway and hits your mail servers 50,000 =A-H J Average number of spam messages/day that get to an end user’s inbox 10 =I/C So the results are as follows: False Negatives in End User’s Inbox Load on Exchange Servers Old solution 10 50,000 New solution 6 30,000 Your new vendor shows quite an improvement over the old vendor. The research that went into your spam solution has paid off immediately. Not only are your end users happier, but the entire mail team is as well, as you have “gained” capacity on your Exchange servers. Page 14 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Scenario 2: Today’s Challenges Fast forward to today. Things are different. The rise of botnets has resulted in a higher volume of spam hitting your gateway. You run some reports and see that inbound email volumes have increased from 500,000 spam messages per day to 2 millions spam messages per day. Though it sounds extreme, this type of increase is pretty typical, as discussed previously. Furthermore, because spammers are now sending more sophisticated, attachment-based spam campaigns, your gateway spam solution’s effectiveness might have declined a bit. But let’s be conservative and give your “new” vendor the benefit of the doubt—we’ll assume that effectiveness has actually improved by 1% to 95% average effectiveness. Even so, your 5,000 end users are now complaining, because it looks to them like anti-spam effectiveness has declined. This can’t be, you tell your team. We just spent time and money evaluating and deploying a solution a year ago! Lets look at the numbers again to better understand what’s going on: Variable Metric Value Formula A Daily spam volume 2,000,000 B New vendor’s effectiveness 95% C Number of end users 5,000 D Spam being blocked at gateway 1,900,000 =A*B E Spam not blocked, that gets through the gateway and hits your mail servers 100,000 =A-D F Average number of spam messages/day that get to an end user’s inbox 20 =E/C So the results are as follows: False Negatives in End User’s Inbox Load on Exchange Servers Old solution 10 50,000 New solution (when first deployed) 6 30,000 New solution (one year later) 20 100,000 No wonder the helpdesk phone is ringing off the hook. There are several observations to make here: o The typical end user perceives a more than threefold increase in the amount of spam in their inbox. o Furthermore, they complain that they are getting more spam than they did with the old antispam solution. Why did you make the wrong selection, they ask you. o Your Exchange mail servers are also straining under increased load, and you are spending valuable resources trying to keep them performing well. o All of this occurs even though your anti-spam effectiveness at the gateway increased from 94% to 95%. You are actually blocking three times as much spam as you did before! You decide to solve this situation by educating your users through newsletters, seminars, etc. This certainly has a positive effect, and is in fact a best practice to follow. Your end users empathize with your situation. They tell you they understand what is going on, but that 20 spam messages per day is a bit too much for them to handle. Can’t you do anything about it? Scenario 3: The Need for Extreme Anti-spam Effectiveness In the final scenario, let’s look at what it would take to solve this problem with technology. Your end users are complaining due to a perceived decrease in effectiveness and the rising number of spam emails in their inboxes. You can’t control the spam volumes hitting your organization. But you can deploy a system with increased effectiveness. But “good” just won’t cut it. An extraordinarily high effectiveness in the range of 98% to 99% is required. Let’s look again at the numbers: Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 15 Variable Metric Value A Daily spam volume 2,000,000 B1 Minimum new effectiveness 98% B2 Extreme effectiveness 99% C Number of end users 5,000 Formula D1 to D2 Spam being blocked at gateway 1.96M to 1.98M =A*B1 to =A*B2 E1 to E2 Spam not blocked, that gets through the gateway and hits your mail servers 20,000 to 40,000 =A-D2 to =A-D1 F1 to F2 Average number of spam messages/day that get to end user’s inbox 4 to 8 =E2/C to E1/C So the results are as follows: False Negatives in End User’s Inbox Load on Exchange Servers Old solution 10 50,000 New solution (when first deployed) 6 30,000 New solution (one year later) 20 100,000 Solution with 98% Effectiveness 8 40,000 Solution with 99% Effectiveness 4 20,000 As you can see, it is possible to get the situation under control and back to spam levels comparable to the “good old days,” but it requires a solution with extremely high effectiveness. Anti-spam effectiveness must be, at a minimum, 98%... and 99% or better effectiveness is clearly preferable. This level of anti-spam effectiveness is possible, but only by using the most advanced technologies such as Proofpoint MLX, which is continually being enhanced to incorporate innovative new anti-spam techniques and intelligently responds to evolving spam conditions. Winning the Battle Against Image- and Attachment-based Spam Because so much of today’s spam volume increase is due to image- and attachment-based spam being sent by botnets, the best performing anti-spam solutions are those that correctly identify and block such messages. In today’s environment, effective protection against attachment-based spam is a fundamental requirement for successful anti-spam solutions. The use of attachments and images in spam is not new. Spammers switched from using referenced URLs for images to embedding images for two reasons. First, mail clients have become smarter, refusing to download images unless they are explicitly told to do so. For example, Microsoft Outlook and Google’s Gmail do not display images that are referenced through URLs, unless the user has clicked on a button or configured a setting, thereby ordering the client to download the image. Second, the cost of computing resources available to spammers has declined, directly through Moore’s Law and indirectly through the use of botnets. Let’s take a closer look at some of the techniques used in image-based spam to better understand the sort of randomization, “personalization” and obfuscation techniques used in nearly all spam campaigns today. Image-Based Spam Obfuscation and Evasion Methods Today’s approach of embedding images has proven to be successful, especially against Bayesian-based and signature-based products. Many spam filters are not able to correctly classify these messages. The illustration below shows five images were extracted from five different spam messages. They are clearly all from the same spam campaign and, at first glance, the images appear identical. Page 16 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Figure 5: To the human eye, these spam images are identical... But hidden differences confuse spam filters. However, if we take a digital “signature” of each image—using the same sort of technique that a simple spam filter would use to compare the images—we find that each of them has a different electronic signature, as follows: Image Image Signature 1 122413e0682085f68c2b947a53af02cc 2 28de627c92a20b1043deebfa5f7715f8 3 6280188bd69ab41fd9764df2a10978f5 4 6e8a670f65570b1daf52dd3ae10c3a4c 5 e3bdd4b0073a502544df4f07647764db To an unsophisticated, signature-based spam filter, these images are completely different! That’s no comfort to your end users, however. One of them might receive this “same” spam five times over the course of five days and call your help desk asking, “This message is obviously spam, and I keep getting it! Why isn’t the filter catching it?” A signature-based anti-spam filter might identify the first image as spammy based on a submission from an end user, but as it continues to look for that exact image, it will never again see an exact match. Each image-based spam is subtly different. By randomizing or obfuscating the image used in each individual spam email, the spammers are able to successfully bypass simple filtering methods. It’s impossible for a filter to predict all obfuscations of an image using signature-based approaches or simple Bayesian technology. But back to those images: How is it that each of them is unique? Let’s look at just a few of the techniques spammers are using to confuse spam filters using images. Randomized Image Borders A thin border of different colors and pixel width is automatically placed (e.g., auto-generated by spamming software) around the image. The Proofpoint Attack Response Center has seen this technique used frequently with “pump and dump” spam stock pitches. For example, the border that we’ve zoomed in on in Figure 6 is unique for each individual spam message. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 17 Figure 6: A stock pitch image-based spam with randomized border. Each individual spam message sent as part of this campaign includes an image with a unique, pixelated border, making the images resistant to detection by simple signatures. Figure 7: More image-based stock pitch spam. This image shows multiple obfuscation techniques being used in the same spam message. The spam’s “payload” is the information contained in the image at the top of the message. Taking a closer look at this image, we see that it has a very subtle background pattern of randomized lines. The small circle shows a zoomed in area where we’ve used image Page 18 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow enhancement to reveal the random pattern of lines. The larger circle shows another area of the spam where we’ve zoomed in on some randomized pixels, showing another technique the spammer has used to make image de-obfuscation more difficult. Finally, note that the spammer has combined image- and text-based obfuscation techniques by including randomized text to bypass text signatures and less sophisticated Bayesian filters. Randomized Pixels Between Paragraphs In this technique, randomized pixels or other types of high-frequency noise are inserted in between paragraphs (see large circular highlight in Figure 7, above). This is an attempt by spammers to make image de-obfuscation more difficult. There are many variants on this technique whereby visible, apparently random, distortions are added to individual spam images to make them unique so they can slip past signature-based filters. Randomized Colortable Entries to Obfuscate the Image Since most image-based spam messages don’t need many colors, unused color map entries are another place where spammers can insert invisible obfuscations. For example, the spammer enters random values into all of the unused color map entries for each individual spam message. This maintains the visual integrity of the image, but changes its invisible structure and makes the actual content of each image unique. In the figures above, fewer than 10 colors are actually used in the visible image. If this is a 256 color GIF or JPEG image, the spammer then has 246 bytes (i.e., 256 - 10) that can be safely randomized while still ensuring that the image “looks” identical. As shown in Figure 8, even something as simple as the GIF format offers plenty of opportunities for such mischief. For example, randomized data can also be inserted into the GIF terminator part of the file. Alternatively, randomized borders, lines, pixels or other graphical features can be inserted into the visible portion of the image. GIF Signature Screen Descriptor Global Color Map Randomizations can be placed in unused color map entries (typically invisible) Image Descriptor Local Color Map Randomizations can be placed in unused color map entries (typically invisible) Raster (Image) Data Many types of visible randomizations and obfuscations can be inserted here GIF Terminator Randomizations can also be inserted here (invisible) Figure 8: This is a schematic of the GIF file format. Note that, even in this relatively simple graphical file format, there are many opportunities for inserting randomizations and obfuscations. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 19 Animated GIF with Embedded Spam Image Another interesting technique that has emerged recently is the inclusion of a spam payload as a single frame in an animated GIF. The other frames serve as “red herrings” to confuse anti-spam software. When the email recipient views the image, the frames with small time values quickly cycle through, and the user is presented with the image with the longest time sequence (which is, of course, the “spammy” image). See Figure 9, below. A Spammy Animated GIF Frame 1 Displayed for 0.1 Sec Frame 2 Spam Payload Displayed for 250 Sec Frame n... Displayed for 0.1 Sec Figure 9: Animated GIFs can be used to “hide” a spam payload from most spam filters. Decoy images display for an almost imperceptible amount of time while the payload image is displayed for long periods. Advanced forms of analysis are required to properly identify such files as spam. Image Segmentation This image-based spam technique obfuscates an image by breaking up the base image into a random assortment of smaller images. Each spam campaign will use the same image, but the sub-images are different. Automated software disassembles the image into its random parts and composes the HTML code that holds them together. Of course, to most anti-spam filters, each individual image looks entirely different from the base image, and from every other sub-images found in individual emails that are part of the spam campaign. Base Image Image subdivided into 7 sub-images Image subdivided into 12 sub-images Figure 10: Image spam payloads are sometimes broken up into randomly sized sub images that display properly when presented in a mail client, thanks to clever HTML coding. Page 20 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow OCR-resistant Images This technique uses animated GIFs to break up an image into different overlapping frames of “broken” text. All frames, except for the last, quickly cycle through. Each ensuing frame after the first adds pixels to complete the words in the image. The ensuing frames also contain transparent colors to ensure that the frames underneath are visible. Only the parts required to complete the image are transparent. When all frames have cycled through, they appear stacked on top of each other, and thus the image becomes legible, as shown in Figure 11, next page. This technique is designed to make the text in the image resistant to spam filters that use OCR (optical character recognition). Optical character recognition is still an extremely compute intensive technique and Proofpoint’s research in this area has revealed that OCR itself is only minimally effective in the fight against image-based spam. Current versions of Proofpoint MLX do not use literal OCR techniques. However, Proofpoint has developed proprietary, OCR-like techniques for analyzing image color spaces and other image attributes that help MLX differentiate between valid and spammy images in a highly efficient (i.e., minimal CPU overhead) manner. Combining Image-based and Text-based Techniques Spammers typically execute their attacks using a combination of techniques in the same campaign. With the flexibility available to them with image-based spam, they are able to evade many first and second generation filters. Refer back to Figure 7 and you’ll see that this spam combines at least three different obfuscation techniques. The top part of the spam is an embedded image. The image uses two of the obfuscation techniques discussed earlier. First, the image has random pixels between paragraphs. Second it incorporates randomlyspaced lines in the image background that serve to further obfuscate it from the base image. The third technique used is a text technique. Underneath the image is a large amount of “Bayesian busting” text that is likely to be interpreted as legitimate content. This approach of mixing legitimate sounding language with an image-based spam payload is a sophisticated, but very common way to bypass Bayesian- and signature-based filters in one fell swoop. Not only is each image different, but most Bayesian filters will overcompensate in their scoring by treating the legitimate sounding language as a “clue” that this email is legitimate. Winning the Battle Against Image-based Spam The most effective techniques against image-based spam are those that are machine learning based. Proofpoint continues to be at the forefront in the battle against image-based spam—from both primary research and practical development perspectives. Proofpoint’s MLX machine learning technology applies artificial intelligence and advanced image analysis methods to the problem of correctly identifying imagebased spam. Proofpoint’s MLX-based image spam detection technologies protect against all the techniques mentioned above as well as many other classes of images. Just a few of the patent-pending image-based analysis techniques used in Proofpoint MLX include: o Fuzzy matching for obfuscated images: Proofpoint MLX detects obfuscated spam images by using techniques that mimic the way human beings perceive spam. Proofpoint has developed a variety of highly effective, but minimally compute intensive techniques for stripping out randomized borders, ignoring high-frequency randomized noise and analyzing image colormap entries to “see through” obfuscation tricks used by today’s image spammers. o Dynamic spam image detection: Proofpoint software and appliances work locally to analyze incoming messages that contain images and track the number of similar (or similarly obfuscated) images. If the volume of these images exceeds a certain threshold, Proofpoint MLX classifies the image as “spammy,” similar to the way that Proofpoint MLX Dynamic Reputation monitors for malicious IP-level connections. o Animated GIF spam detection: Proofpoint MLX analyzes the structural and temporal attributes of animated images to help identify those with spam characteristics. o Dynamic botnet protection: Proofpoint MLX Dynamic Reputation continually profiles IP-level connections and source IP addresses, monitoring for activity characteristic of botnets. When botnet IPs are detected, Proofpoint MLX automatically rejects image-based and other types of spam from those sources. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 21 An OCR-resistant Animated GIF Spam Frame 1: Contains “broken” text with transparent background Frame 2: Contains “broken” text that displays through the transparent regions of Frame 1 Visible Result: Animated frames combine visually to reveal the spam “payload” Figure 11: An animated GIF image-based spam that uses an advanced technique to evade filters that use optical character recognition (OCR), which attempt to extract the text payload from an image. In this example, the spam payload is broken up into separate frames of an animated GIF. By themselves, each frame consists of only fractional “broken” text which can’t be read by OCR. As the animation plays, the frames combine to form the human-readable spam payload (yet another pump and dump stock pitch). Of course, these image-specific techniques work hand-in-hand with the hundreds of thousands of other message attributes analyzed by Proofpoint MLX. As Proofpoint’s automated machine learning systems and Proofpoint Attack Response Center staff identify new image-based spamming techniques and other threats, MLX engine updates are automatically delivered to customers’ local Proofpoint servers. These updates are automatically and immediately available—without requiring any administrator intervention, manual updates or system upgrades—ensuring that your organization is always protected against the latest threats. Ongoing MLX Research and Development for Attachment-based Spam In order to maintain its edge on spam detection, Proofpoint continues to invest heavily in research and development. Proofpoint’s spam research arm, the Proofpoint Attack Response Center, has developed sophisticated machine learning based “agents” that can reliably identify new classes of image-, attachment- and text-based spam. These predictive systems are also capable of automatically responding to new threats as they appear, Page 22 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow automatically creating new versions of the Proofpoint MLX engine, testing new algorithms for effectiveness and delivering those enhancements to customer sites—without the need to human intervention. This technology, which is unique to Proofpoint, enables extremely rapid response to new forms of spam and optimizes the number of updates delivered to Proofpoint SaaS services and local Proofpoint servers in such a way that they have zero negative impact on performance. Proofpoint’s backend systems employ a wide variety of machine learning techniques, including methods that are especially helpful in the identification of attachment-based spam: o Automated Image Extraction Threshold Analysis: A technique that automatically detects images being used in new spam campaigns. It works by looking at high frequency variations of an image. If a large number of images with subtle differences are detected, that image is added as a potential “spammy” image and included in the MLX engine. This detection and automatic retraining of the MLX engine is performed in real time. o Predominant Correlation: Information gain is a technique used to identify the very best attributes (or clues) to use in detecting spam versus valid mail. From the millions of available attributes, information gain selects those that are most valuable. Proofpoint has taken this technique a step further with the introduction of predominant correlation-based attribute selection. This technique allows Proofpoint MLX to identify attributes that are redundant and automatically remove them, ensuring that only the most effective indicators of spam are considered. This intelligent approach to attribute analysis maximizes effectiveness (the system’s ability to accurately detect spam) and performance (the system’s ability to rapidly process messages) at the same time. o URL Analysis Techniques: Proofpoint’s backend systems perform statistical analyses of URLs from Proofpoint honeypots and customer sites, coupled with correlative analysis of URLs and the IP addresses hosting them. By using advanced network analysis techniques, Proofpoint MLX can determine if a sending IP address is associated with a known malicious URL or suspicious ISP and use these associations as a strong indicator of spam. o Clustering/Automation: Proofpoint MLX uses advanced automation that clusters messages in large data sets and extracts common elements, speeding up the process of identifying new spam attributes. This results in higher effectiveness and faster responses to new spam attacks of all types. o Hadoop MapReduce Processing of Very Large Data Sets: Proofpoint is using cutting-edge technologies such as Apache Hadoop MapReduce to process extremely large amounts of data using distributed computing resources. These distributed computing capabilities enable Proofpoint to analyze statistics and trends more quickly and comprehensively, which in turn strengthens Proofpoint MLX’s ability to respond quickly to new spam campaigns . Proofpoint Attack Response Center Talking about machine learning is relatively easy. Developing an enterprise-class anti-spam solution that effectively leverages machine learning techniques and the best traditional techniques requires a major R&D investment and a world-class team. The Proofpoint Attack Response Center is a collection of dedicated professionals and automated systems that monitor the Internet for new spam attacks, virus outbreaks and other anomalous activities. Data from the Proofpoint Attack Response Center is used to refine Proofpoint MLX models, develop new machine learning-based security technologies, power new services such as Proofpoint Zero-Hour Anti-Virus and address emerging threats. The updates created by Proofpoint researchers and their automated systems are automatically delivered to Proofpoint customer sites via the Proofpoint Dynamic Update service, ensuring that the most accurate statistical models and machine learning classifiers are always used. A world-class team of scientists has been assembled for the Proofpoint Attack Response Center, unparalleled in its cross-disciplinary depth and breadth. The team consists of researchers and engineers with deep roots across several relevant disciplines including machine learning, statistics, natural language processing, information classification, messaging and security. The Proofpoint Attack Response Center brings together the expertise and resources necessary to ensure that Proofpoint solutions continue to set the standards that the rest of the industry strives to match. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 23 Conclusion The Proofpoint MLX system is continually training to detect the latest forms of spam. Information is fed back into the system to enable it to automatically tune its spam attributes, statistical processes and classifications. Rather than relying on any one technology, the MLX engine dynamically chooses the most effective set of attributes and models to process each message. Proofpoint MLX technology: o Continuously adapts: to detect new types of spam without manual intervention—the system’s ability to identify spam does not degrade as spammers change their tactics. o Employs next generation machine leaning techniques: including logistic regression and information gain techniques to build large-scale statistical models that accurately represent dependencies among spam attributes and delineate the boundary between spam and valid messages. o Includes image- and attachment-specific machine learning techniques: to accurately identify even the most sophisticated spam messages. Proofpoint continues to identify the latest attachment-based spamming techniques and has built technology to handle these threats proactively and predictably. As new techniques emerge, Proofpoint delivers the latest spam detection technologies to customers automatically. o Analyzes more than 1,000,000 spam attributes: including message envelope and header characteristics as well as the actual message and attachment content to accurately classify messages and ensure a low rate of false positives. o Ensures the maximum protection today and improves in performance: even as spam evolves. Notes 1. Ferris Research: http://www.ferris.com/?p=322011. 2. “The True Corporate and Consumer Cost of Phishing,” http://blog.epostmarks.com/team-blog/2009/4/4/the-truecorporate-and-consumer-cost-of-phishing.html 3. http://news.cnet.com/8301-1009_3-10445723-83.html 4. Facebook’s announcement that they have 350 million members: http://blog.facebook.com/blog. php?post=190423927130 5. http://blog.nielsen.com/nielsenwire/global/led-by-facebook-twitter-global-time-spent-on-social-media-sites-up-82year-over-year/ 6. http://socialmediaatwork.com/2010/02/11/social-networks-now-account-for-11-of-us-traffic/ 7. http://socialmediaatwork.com/2010/02/16/facebook-now-drives-more-traffic-to-web-sites-than-google/ 8. http://www.mediapost.com/publications/?fa=Articles.showArticle&art_aid=121930 9. For an example of a multimedia attack distributed through a social media network, see http://www.mashable. com/2009/10/01/new-facebook-attack/ 10. https://www.networkworld.com/news/2009/121509-10-predictions-for-2010-kaminsky.html?page=2 11. http://blogs.zdnet.com/security/?p=4996&tag=content;col1 . Page 24 Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Additional Resources To learn more about Proofpoint MLX technology and Proofpoint’s email security solutions, please consult the following online resources. Webinar Replay: Defend Against Blended Threats Blended Web and email threats are becoming increasingly complex and represent a huge potential risk to your organization, your customers and your employees. Watch this webinar replay featuring Web and email security experts from Blue Coat and Proofpoint and learn how you can: o Defend your organization from email and Web threats, minimizing threats. o Protect confidential corporate data and personal information. o Use real-time intelligence and reputation data to safeguard your organization from malicious content Register to view this web seminar replay by visiting: http://www.proofpoint.com/id/blended-threats-webinar-0309/index.php Learn More About SaaS Email Security and Email Archiving Learn more about how Proofpoint’s Software-as-a-Service email security and email archiving solutions deliver maximum security at the lowest total cost of ownership. Download two Osterman Research whitepapers, Using SaaS to Reduce the Costs of Email Security and Email Archiving: Realizing the Cost Savings and Other Benefits from SaaS, by visiting: http://www.proofpoint.com/tco Research Paper on Spam Obfuscation Tactics Proofpoint’s research activities are ongoing and Proofpoint researchers regularly publish papers on their research. For example, the original research behind some of the obfuscation detection techniques described in this whitepaper are described in the following paper that Proofpoint scientists presented at the 2006 Virus Bulletin conference. Download a copy of this paper by visiting: http://www.proofpoint.com/id/vb2006/index.php Proofpoint Online Resource Center Find the latest Proofpoint datasheets, whitepapers, webinars and more in the Proofpoint Resource Center. Please visit: http://www.proofpoint.com/resources Proofpoint Email Security Blog Stay up to date on the latest trends in email security by subscribing to our email security blog or by following us on Twitter: http://blog.proofpoint.com http://www.twitter.com/Proofpoint_Inc About Proofpoint, Inc. Proofpoint provides unified email security, data loss prevention and email archiving solutions that help enterprises, universities, government organizations and ISPs defend against spam and viruses, prevent leaks of confidential and private information, encrypt sensitive emails and comply with regulations that affect email use. Proofpoint’s products are controlled by a single management and policy console and are powered by Proofpoint MLX™ technology, an advanced machine learning system developed by Proofpoint scientists and engineers. Proofpoint solutions can be deployed in hosted service, hardware appliance, virtual appliance, software and hybrid models, for maximum flexibility and scalability. For more information, please visit http://www.proofpoint.com. Proofpoint MLX: Machine Learning to Beat Spam Today and Tomorrow Page 25 US Worldwide Headquarters Proofpoint, Inc. 892 Ross Drive Sunnyvale, CA 94089 United States Tel +1 408 517 4710 US Utah Satellite Office Proofpoint, Inc. 13997 South Minuteman Drive, Suite 320 Draper, UT 84020 United States Tel +1 801 748 4610 Asia Pacific Proofpoint APAC 5th Floor, Q.House Convent Bldg. 38 Convent Road, Silom, Bangrak Bangkok 10500, Thailand Tel +66 2 632 2997 EMEA Proofpoint, Ltd. The Oxford Science Park Magdalen Centre Robert Robinson Avenue Oxford, UK OX4 4GA Tel +44 (0) 870 803 0704 Japan Proofpoint Japan K.K. BUREX Kojimachi Kojimachi 3-5-2, Chiyoda-ku Tokyo, 102-0083 Japan Tel +81 3 5210 3611 Canada Proofpoint Canada 210 King Street East, Suite 300 Toronto, Ontario, M5A 1J7 Canada Tel +1 647 436 1036 www.proofpoint.com ©2010 Proofpoint, Inc. All rights reserved. 03/10 Rev A Mexico Proofpoint Mexico Salaverry 1199 Col. Zacatenco CP 07360 México D.F. Tel: +52 55 5905 5306