A novel algorithm applied to filter spam e-mails using Machine
Transcription
A novel algorithm applied to filter spam e-mails using Machine
26 | P age Australian Journal of Information Technology and Communication Volume II Issue I ISSN 2203-2843 A novel algorithm applied to filter spam e-mails using Machine Learning Techniques Maninder Singh#1, Ranjan Sharma*2 #1 #1 Guru Nanak Dev University, Amritsar (Punjab) Corresponding author:er.maninder001@gmail.com *2 sharma.ranjan@yahoo.com Abstract— Email spam is one of the major problems of today`s Internet, bringing financial damage to companies and annoying individual users. Among the approaches developed to stop spam, filtering is an important and popular one. In this paper, a novel algorithm approach is applied to filter span emails. Another spam base dataset is obtained from UCI repository for evaluating its performance. From the results the proposed algorithm is outperforms other existing algorithms. Keywords—Spam mail-filter, WEKA, MATLAB, mail classification, privacy protection, information security, Machine learning, Data mining, Decision trees, clustering I. INTRODUCTION Email has become one of the fastest and most economical forms of communication in all aspects of everyday life [1]. However this involvement is diminishing by the growth and availability of the emails. Nowadays, a typical user receives about 20-40 email messages every day. Mass unsolicited electronic mail, often known as spam, has recently increased enormously and has become a serious threat to society as well as the Internet. The flooding of spam consumes not only computer, storage and network resources but also human time and attention to dismiss unwanted emails. Thus, users spend a significant part of their working time on processing email. Since the cost of the spam is borne mostly by the recipient, many individual and business people send bulk messages in the form of spam [2]. Not only SPAM is flooding our mailboxes but locating important and vital information among the huge number of emails has turned into a laborious and time consuming daily activity. The amount of spam sent over the Internet has been rising dramatically in recent years and no decline is to be expected in the near future. Therefore, email management is an important and growing problem for individuals and organizations. II. RELATED RESEARCH N Jindal and Liu (2007) [4] has proposed mining of opinions from product reviews, forum posts and blogs as an important research topic with many applications. Existing research has been focused on extraction, classification and summarization of opinions from these sources. The issue in the context of product reviews has been studied. There is still no published study on this topic, although Web page spam and email spam have been investigated extensively. Review spam is quite different from Web page spam and email spam, and thus requires different detection techniques. III. PROBLEM STATEMENT In this e-world, most of the transactions and business is taking place through e-mails [3]. Nowadays, email becomes a powerful tool for communication as it saves a lot of time and cost. But, due to social networks and advertisers, most of the emails contain unwanted information called spam. Even though lot of algorithms has been developed for email spam classification, still none of the algorithms produces 100% accuracy in classifying spam emails. Spam, also known as Unsolicited Commercial Email, is generated by sending unsolicited commercial messages to many recipients without their permission. The spammers use a computer program to check almost every website on the internet. The program looks at the code of every web page, it looks for an email address and it collects and saves your email address to the spammers database of millions of harvested addresses. IV. INTRODUCTION TO DATA MINING Data mining is a powerful new technology with great potential to help companies focus on the most important information in the data they have collected about the behaviour of their customers and potential customers. It discovers information within the data that queries and reports can't effectively reveal. Generally, data mining (sometimes called data or knowledge discovery) is the process of analyzing data from different perspectives and summarizing it into useful information information that can be used to increase revenue, cuts costs, or both. Data mining software is one of a number of analytical tools for analyzing data. It allows users to analyze data from many different dimensions or angles, categorize it, and summarize the relationships identified. Technically, data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. V. ROLE OF DATA MINING IN VARIOUS FIELDS 27 | P age Australian Journal of Information Technology and Communication Volume II Issue I Although data mining is still in its infancy, companies in a wide range of industries - including retail, finance, health care, manufacturing transportation, and aerospace - are already using data mining tools and techniques to take advantage of historical data. By using pattern recognition technologies and statistical and mathematical techniques to sift through warehoused information, data mining helps analysts recognize significant facts, relationships, trends, patterns, exceptions and anomalies that might otherwise go unnoticed. For businesses, data mining is used to discover patterns and relationships in the data in order to help make better business decisions. Data mining can help spot sales trends, develop smarter marketing campaigns, and accurately predict customer loyalty. Specific uses of data mining include: 1. Market segmentation: Identify the common characteristics of customers who buy the same products from your company. 2. Customer churn: Predict which customers are likely to leave your company and go to a competitor. 3. Fraud detection: Identify which transactions are most likely to be fraudulent. ISSN 2203-2843 Output: A decision tree. Method: Step 1: Create a node N Step 2: If tuples in D are all of the same class,c then Step 3: Return N as a leaf node labelled with the class C, Step 4: If attribute_list is empty then Step 5: Return N as a leaf node labeled with the majority class in D,//majority voting Step 6: Apply attribute_selection_method to find the “best” splitting_criterion; Step 7: Label node N with splitting_criterion Step 8: If splitting_attribute is discrete valued and Step 9: Attribute_list -splitting_attribute; 4. Direct marketing: Identify which prospects should be included in a mailing list to obtain the highest response rate. Step 10: For each outcome j of splitting_criterion 5. Interactive marketing: Predict what each individual accessing a Web site is most likely interested in seeing. Step 11: Let Dj be the set of data tuples in D satisfying the outcome j; 6. Market basket analysis: Understand what products or services are commonly purchased together; e.g., beer and diapers. Step 12: If Dj is empty then; 7. Trend analysis: Reveal the difference between a typical customer this month and last. Step 13: Attach a leaf labelled with the majority class in D to node N; Step 14: Else attach the node returned by generate_decision_tree to node N; VI. PROPOSED ALGORITHM Following steps are included in the proposed algorithm: Step 15: Return N; Algorithm: Decision Tree: VII. RESULTS AND DISCUSSION Firstly, The Bar Graph shows the comparison of comparative analysis of different algorithms against the percentage instances using WEKA. The plot depicted that the proposed J48 algorithm has lowest error and highest accuracy. The correctly classified are highest in proposed J48 algorithm. Algorithm: Generate_decision_tree. Generate a decision tree from the training tuples of data partition D. Input: Step 1: Data partition,D,which is a set of training tuples and their associated class labels: Step 2: Attribute_list,the set of candidate attributes: Step 3: Attribute_selection_method, a procedure to determine the splitting criterion that best” partitions the data tuples into individual classes. This criterion consists of a splitting_attribute and possibly, either a split point or splitting subset. The outcome is shown below as Fig. 1. 28 | P age Australian Journal of Information Technology and Communication Volume II Issue I Fig. 2 Comparative analysis in (Percentage parameters). ISSN 2203-2843 Fig. 3 Comparative analysis. The next figure shows the comparative analysis of different algorithms against the different parameters. The plot shows that the proposed J48 algorithm has lowest error and highest accuracy. The kappa statistic shows highest value of 0.8812 in proposed J48 algorithm. The outcome is shown below as Fig. 2. Fig. 2 Comparative analysis of algorithms in kappa statistic, mean absolute error, root mean squared error. Fig.3 shows the comparative analysis of different algorithms against the different parameters. The plot shows that the proposed J48 algorithm has highest accuracy. Recall, F-Measure and ROC Area parameter is highest in the proposed J48 algorithm. Fig.4 Accuracy in MATLAB Accuracy of 99.97% is achieved in it with an error rate of 0.0217.The previous result achieved by J48 algorithm in spambase dataset was 92.195 of the correctly classified instance. Various parameters have been modified with the invent of other new and modified features. 29 | P age Australian Journal of Information Technology and Communication Volume II Issue I VIII. CONCLUSION As the technology of machine learning continues to develop and mature, learning algorithms need to be brought to the desktops of people who work with data and understand the application domain from which it arises. It is necessary to get the algorithms out of the laboratory and into the work environment of those who can use them. Mining frequent item sets for the association rule mining from the large transactional database is a very crucial task. There are many approaches that have been discussed, which have scope for improvement. WEKA is a significant step in the transfer of machine learning technology into the workplace. WEKA has proved itself to be a useful and even essential tool in the analysis of real world data sets. It reduces the level of complexity involved in getting real world data into a variety of machine learning schemes and evaluating the output of those schemes. It has also provided a flexible aid for machine learning research and a tool for introducing people to machine learning in an educational environment. This research work focuses on improving the performance of the e-mail spam classification rate using the integrated MATLAB and WEKA tool. The proposed scenario has shown accuracy rate of 99.97% due to MATLAB’s rich learning rate. Thus the proposed approach has shown the significant improvement over the available techniques. In near future, we will use different kind of data sets to validate the proposed work. However only J48 algorithms has been considered in this work, so in near future some more machine learning algorithms will be considered. [1] ISSN 2203-2843 IX. REFERENCES Jianchao Han,Juan C. Rodriguez,Mohsen Beheshti “Discovering Decision Tree Based Diabetes Prediction Model” International Conference, ASEA 2009,Communications in Computer and Information Science Volume 30, 2009, pp 99-109. ISSN-18650929_Springer. [2] Benevenuto, Fabrıcio, Gabriel Magno, Tiago Rodrigues, and Virgılio Almeida. "Detecting spammers on twitter." In Collaboration, electronic messaging, anti-abuse and spam conference (CEAS), vol. 6, p. 12. 2010. [3] Salama, Gouda I., M. B. Abdelhalim, and MagdyAbdelghanyZeid. "Experimental comparison of classifiers for breast cancer diagnosis." In Computer Engineering & Systems (ICCES), 2012 Seventh International Conference on, pp. 180-185. IEEE, 2012. Jindal, Nitin, and Bing Liu. "Analyzing and detecting review spam." In Data Mining, 2007. ICDM 2007. Seventh IEEE International Conference on, pp. 547-552. IEEE, 2007. [4] [5] Maninder Singh " A REVIEW ON DATA MINING ALGORITHMS." In IJCSITR, Vol. 2, Issue 2, pp: (8-14), 2014. .